Path Planning for Autonomous Balloon Navigation with Reinforcement Learning

He, Yingzhe; Guo, Kai; Wang, Chisheng; Fu, Keyi; Zheng, Jiehao

doi:10.3390/electronics14010204

Open AccessArticle

Path Planning for Autonomous Balloon Navigation with Reinforcement Learning

by

Yingzhe He

¹

,

Kai Guo

^1,*

,

Chisheng Wang

²

,

Keyi Fu

¹ and

Jiehao Zheng

¹

Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518123, China

²

Ministry of Natural Resources (MNR) Key Laboratory for Geo-Environmental Monitoring of Great Bay Area & Guangdong Key Laboratory of Urban Informatics, School of Architecture & Urban Planning, Shenzhen University, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(1), 204; https://doi.org/10.3390/electronics14010204

Submission received: 3 December 2024 / Revised: 30 December 2024 / Accepted: 2 January 2025 / Published: 6 January 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

In the stratosphere, the use of winds to navigate balloons has emerged as a practical approach for Earth observation, collecting meteorological data, and other applications. However, controlling such balloons is challenging due to imperfect wind data and the need for real-time decisions. Research in this field predominantly concentrates on station-keeping missions, but there is an absence of studies on stratospheric balloon path planning. In this work, we employ deep reinforcement learning to train a controller that guides the balloon from a random starting point to a target range within a simulated wind field that changes over time and space. The results prove the feasibility of using reinforcement learning for superpressure balloon path planning in complex, dynamic wind fields, and the RL controller outperforms the hand-crafted baseline controller, achieving faster navigation with a higher success rate.

Keywords:

superpressure balloon; reinforcement learning; DQN; path planning; stratosphere; station-keeping

1. Introduction

High-altitude balloons are becoming increasingly important in various areas such as volcanic plumes [1], atmospheric analysis [2], and other applications [3,4]. Consequently, assessing the feasibility of high-altitude balloons’ path planning in the stratosphere is crucial to addressing the diverse mission requirements. Balloons do not possess a horizontal propulsion system: they ascend or descend solely by inflating and deflating to change their buoyancy, allowing vertical navigation through diverse wind layers to determine horizontal direction. This method of control results in minimal energy consumption and facilitates prolonged flights in the stratosphere [5,6]. Selecting the appropriate altitude in a three-dimensional wind field presents considerable path planning challenges. Furthermore, the partial observability increases the complexity of the problem [7], which is common even when the state-of-the-art real-time wind prediction technology has large deviations [8]. Given the discrepancies in wind field predictions and the intricate nonlinear dynamics between balloon controls and their guided navigation towards set goals, it is difficult for conventional techniques to design a direct and efficient control policy [9,10].

In recent years, various path planning methods have been extensively utilized across multiple fields and applications. Examples include target tracking [11], valet parking [12], autonomous mobile robot navigation [13], collision avoidance [14], and trajectory planning [15]. Traditional algorithms, such as graph search algorithms [16,17] and rapidly exploring random tree [18], are structurally simple but suffer from inefficiency in large-scale searches and lack robust global optimization capabilities. In contrast, intelligent heuristic algorithms, including particle swarm optimization [19], genetic algorithm [20], ant colony optimization [21], and neural network-based algorithms [22], typically offer superior performance in solving complex problems. However, these heuristic methods exhibit sensitivity issues when applied to complex environments, such as dynamic wind fields, and are prone to becoming stuck in local optima or failing to reach the goal. Deep reinforcement learning can handle cases with high uncertainty and high dimensionality at the same time, which is ideal for learning complex solutions and unobvious connections [23]. This capability offers new opportunities for path planning in the stratosphere [24]. To date, research on path planning for stratospheric aerostats has primarily focused on airships. Yang et al. [25] proposed an adaptive horizontal trajectory control method for stratospheric airships in uncertain wind fields using the Q-learning algorithm, where the action strategy is determined based on wind direction. Zheng et al. [26] employed a dual-depth recurrent Q-network for an airship path planning model. Compared to heuristic algorithms, this method achieves higher energy efficiency and success rates under identical conditions. Considering the impact of high-altitude cold clouds on airship motion, Wang et al. [27] introduced a trajectory planning algorithm for stratospheric airships based on the Soft Actor–Critic (SAC) reinforcement learning algorithm. This algorithm effectively plans trajectories to any area while avoiding cold clouds. The aforementioned studies on airship path planning are based on 2D wind fields, assuming that airships operate on a fixed horizontal plane. While maintaining a constant altitude is feasible for airship path planning, this approach is possible because airships are powered vehicles capable of controlling direction and speed via their propulsion system, rather than relying solely on the wind field. In contrast, balloons lack a propulsion system and can only adjust their altitude to find favorable winds for the mission. As a result, balloons have fewer control options and must navigate in a 3D wind field, which significantly increases the complexity of their path planning.

The research within the domain of balloon control via deep reinforcement learning is not extensive, predominantly focused on station-keeping, that is, controlling the balloon as much as possible so that it stays within 50 km of the base station and can comfortably communicate with a ground device [28]. The most outstanding work is by Google’s Loon. Bellmare et al. [29] designed a deep reinforcement learning approach that significantly improved performance compared to their previous manual algorithm on station-keeping. Subsequently, Google launched Loon balloons at altitudes of 15–20 km in diverse global regions, a notable case being a 39-day controlled expedition in the Pacific Ocean. Jeger et al. [30] focused on flying the balloon to a predefined target position at low altitudes (up to 3 km above mean sea level) because low-altitude winds vary over smaller length scales and timescales, allowing more reachable locations compared to high-altitude operation. They used a fully autonomous custom-designed outdoor prototype and implemented the DQN method in real-world conditions. Over six flights, the prototype navigated to predefined positions, with an averaging target distance error of 360 m after traveling approximately 10 km within a volume of 22 × 22 × 3.2 km.

Path planning for stratospheric aerostats plays a crucial role in various applications [31], particularly in high-altitude communication [29], environmental monitoring [32,33], and scientific research [34]. However, existing research on balloon control has primarily focused on station-keeping tasks [35,36,37], while studies on stratospheric aerostat path planning largely concentrate on powered airships navigating in 2D wind fields. The critical challenges of path planning for stratospheric balloons remain unexplored in the current literature. Therefore, this study employs a reinforcement learning-based approach to address the challenges of path planning for stratospheric balloons. In this study, we define the scenario shown in Figure 1, assuming the balloon is in station-keeping mode, when upon switching to path planning mode, the balloon must locate and follow favorable wind currents through ascending and descending to reach a small target area as quickly as possible. This presents a challenging path planning problem, as the wind field may lack favorable wind currents, and the time-optimal path to the target may require spatial detours. Our aim is to control the balloon to reach a target range as fast as possible. This paper proposes a path planning method for stratospheric balloon navigation based on deep reinforcement learning. The main contributions of this work are as follows:

We develop a reinforcement learning agent for an superpressure balloon that can reach a randomly selected target within a stratospheric 3D wind field, using one-dimensional actions while operating under limited temporal and spatial constraints. It provides an efficient solution to the previously unexplored problem of navigating a stratospheric balloon to a small target area.
By analyzing the characteristics of balloon flight and wind fields, we design an effective state space and reward function that enhances the practicality and reliability of the controller. Additionally, we establish a baseline controller for comparison to evaluate the feasibility and effectiveness of the reinforcement learning-based path planning for stratospheric balloons.

The experimental results indicate that reinforcement learning can effectively learn a control policy that achieves fast navigation with a higher success rate, surpassing baseline methods. This task was trained in Google’s Balloon Learning Environment, which allows researchers to conveniently train reinforcement learning controllers with balloon agents [38] that, in theory, could smoothly switch between different controllers to handle different situations and tasks. This work aims to advance balloon control technology by addressing previously underexplored mission scenarios.

2. Methodology

2.1. Balloon Motion Model

The lateral motion of a balloon is influenced by the wind fields at different altitudes. To navigate horizontally, the balloon must adjust its position vertically to access varying wind layers. As illustrated in Figure 2, the balloon consists of an outer envelope and an inner envelope. The outer envelope contains air, while the inner envelope is filled with helium. The altitude control system (ACS) regulates the air within the outer envelope. Since the pressure inside the balloon is consistently higher than atmospheric pressure, the balloon’s overall volume and buoyancy remain constant. When air is removed from the outer envelope, the average density of the gas within the balloon decreases, causing it to ascend. Conversely, pumping air into the outer envelope increases the average density, resulting in the balloon descending.

The balloons are flown in wind fields. The ECMWF ERA5 global reanalysis dataset provides a four-dimensional grid of wind fields categorized by latitude x, longitude y, altitude z, and time t [39]. In the simulation grid wind field generated based on this dataset, linear interpolation calculates the wind speed at the balloon’s current position. This wind speed determines the velocity of the balloon’s lateral movement. Vertical forces acting on a balloon include buoyancy

F_{b}

, gravity

F_{g}

, and aerodynamic drag

F_{d}

. The net force acting on the balloon is expressed as follows:

m_{t o t a l} \cdot \frac{d^{2} z}{d t^{2}} = F_{b} - F_{g} - F_{d}

(1)

The total mass of the stratospheric balloon, denoted as

m_{t o t a l}

, comprises the helium mass

m_{H e}

, air mass

m_{a i r}

, payload mass

m_{p a y l o a d}

, and envelope mass

m_{e n v e l o p e}

.

m_{t o t a l} = m_{H e} + m_{a i r} + m_{p a y l o a d} + m_{e n v e l o p e}

(2)

The buoyancy, gravity, and vertical aerodynamic drag acting on the superpressure balloon are expressed as follows:

F_{b} = ρ_{a i r} \cdot V \cdot g

(3)

F_{g} = m_{t o t a l} \cdot g

(4)

F_{d} = \frac{1}{2} ρ_{a i r} V_{z}^{2} C_{D} S

(5)

where g is the gravitational acceleration,

ρ_{a i r}

is the air density, and V is the balloon volume at the current altitude.

V_{z}

denotes the vertical velocity of the balloon,

C_{D}

is the drag coefficient, and S is the reference area of the balloon. These equations describe the interaction between the balloon and the wind, determining its trajectory within the wind field.

2.2. State Space, Action Space, and Reward Function

The complexity of the problem is heightened by the unpredictable nature of wind data, characterized by high uncertainty and variability across time and space. We address this issue by modeling it as a Markov decision process, using tuples of states, actions, and transition probabilities. To achieve a robust and high-performing controller based on reinforcement learning, we employ deep neural networks taking wind field data and ambient variables associated with the balloon as inputs. The output of the neural network signifies the balloon’s actions. During the continuous interaction between the balloon and its environment, the controller estimates action values and receives feedback to the neural network for parameter adjustments, thereby improving the controller’s performance.

2.2.1. State Space

The state space comprises 1095 variables, categorized into 12 ambient variables and 1083 wind variables, as presented in Table 1, and all input data are normalized to fall within the range of −1 to 1 for the neural network. Among the 12 ambient variables are those pertaining to the balloon’s location, such as distance and relative bearing to the target position and the station. These variables exert a direct influence on the reward of the reinforcement learning controller. Furthermore, additional variables pertain to the balloon’s condition, encompassing its altitude, pressure, and battery charge. These parameters are critical for ensuring the efficient and secure flight of the balloon. In real-world situations, balloons ascending to unsafe altitudes or experiencing excessive pressure may lead to bursting. Moreover, the balloon relies on solar energy for power, with electricity consumption primarily associated with inflation. Nevertheless, unfavorable weather conditions and frequent nighttime operations can significantly drain electricity resources, potentially leading to a loss of control over the balloon. Although these balloon condition variables do not directly affect the value of the reward function, we hard-coded constraints in the simulated environment for these factors that could lead to flight mission failure, which is a trade off between exploring more possibilities and balloon safety.

For wind variables, reliable data are crucial for reinforcement learning. Project Loon utilizes ECMWF’s ERA5 global reanalysis dataset, which is realistic but low-resolution, to create a data generator for historical weather wind field features using a variational autoencoder [40]. By varying the random seed of the procedural noise [41], th any number of high-resolution wind field data can be generated to simulate the prediction error. The dataset was uniformly and randomly sampled within the tropical region (25° N to 25° S) over the period from 2005 to 2010. Configurations lacking sufficient wind diversity were excluded from the dataset. The reinforcement learning agent trained with these data consistently demonstrated high performance in real-world station-keeping applications, showcasing the method’s strong generalization ability. Given the computational challenges in obtaining accurate data from physical atmosphere simulations, and considering different controllers for switching between tasks in real-world scenarios, naturally, we employ the same method.

A large state space can lead to training challenges. We are only concerned with vertical wind data at the current balloon position. The permissible altitude range for the balloon is segmented into 181 pressure layers in 50 Pa increments, ranging from 5 kPa to 14 kPa (equivalent to 15–20 km in altitude). During balloon flight, we can only measure wind data along the trajectory and the current position. We utilize the predicted wind value from variational autoencoder as a prior method, and the variance in the posterior distribution of the actual value quantifies the uncertainty in the predicted wind. Subsequently, we encode the horizontal speed and relative bearing to the center of the target for each layer of wind prediction and its associated uncertainty. Of the 361 triples (a wind layer of wind speed, relative bearing and uncertainty in Table 1) of 1083 wind variables, the central triple always represents the current altitude of the balloon, 181 triples correspond to pressure levels from 5 kPa to 14 kPa at any given time, and the remaining triples are designated as unreachable. Furthermore, these wind variables also form the input to the baseline controller.

2.2.2. Action Space

The balloon can perform three actions: ascend, descend, or stay. The neural network’s output is a three-element vector, with each element representing the expected value of a specific action within the current state space. The reinforcement learning controller can make optimal decisions based on these expected values. During training, the controller does not always select the action with the highest expected value to explore a wider range of possibilities. During both evaluation and real-world deployment, the action with the highest expected value is consistently chosen.

2.2.3. Reward Function

The reward function plays a crucial role in influencing the performance and training efficiency of the reinforcement learning controller, and it should reflect the path planning purpose of the agent. As shown in Algorithm 1, in each episode for every time step, a reward is received after an action. The reward function can be summarized as follows: If the agent is out of the maximum station range (200 km), a large negative reward

r = - R_{t}

is received and the episode is terminated. Since the balloon is initially randomly within the station-keeping range 0–50 km, it is almost impossible for the balloon to return to the vicinity of the station to reach the target in the limited time available due to adverse wind conditions. When the agent is within target range, a positive reward

r = R_{t}

is received, and the episode is terminated. If the agent is out of time in an episode, a reward

r = R_{t} \cdot R_{d}

is received, and the episode is terminated.

R_{d} \in (0, 1)

is the ratio of the radius of the target to the distance of the agent from the target center point. This encourages trajectories that come close to the target but do not actually reach it. When the agent is within the maximum station range (200 km), a negative reward

r = T_{P} \cdot (1 - R_{d})

related to the target distance is received, where

T_{P} = - 0.1

is a time punishment coefficient. The agent is thus encouraged to approach the target as quickly as possible. This design imposes penalties on behaviors that deviate significantly from the target position and consume time. Additionally, the episode ends if the balloon experiences zero pressure, bursts, or runs out of battery, but we hard-coded constraints to make these situations almost impossible in the simulated environment.

Algorithm 1 Reward function

if $o u t o f b o u n d s$ then
$r = - R_{t}$
else
if $w i t h i n t a r g e t r a d i u s$ then
$r = R_{t}$
else if $o u t o f t i m e$ then
$r = R_{t} \cdot R_{d}$
else
$r = T_{P} \cdot (1 - R_{d})$
end if
end if

To determine the maximum station distance, two aspects are considered. Using the controller that was trained with a target radius of 5 km as an example, we calculate the proportion of the maximum distance from the center point of station ranges along the flight path for successful missions. As shown in Table 2, when the maximum distance ranges between 150 and 200 km, the proportion is sufficiently small, approximately 3.6%. This setting aims to accelerate training by identifying successful task cases early. Additionally, for wind fields unfavorable to the task, the training terminates sooner, significantly reducing training time. Additionally, we analyzed 10,000 data wind fields and calculated the average wind speeds in both the latitudinal and longitudinal directions. The average wind speed in the latitudinal direction is significantly greater than in the longitudinal direction, with values of 6.81 m/s and −6.78 m/s in the positive and negative directions, respectively. In contrast, the average wind speeds in the longitudinal direction are 2.42 m/s and −2.45 m/s in the positive and negative directions, respectively. At these average speeds, a round trip of 200 km would take approximately 16 h, making it unsuitable for time-sensitive tasks. Based on the statistical results after training, the maximum distance range of 200 km has minimal impact on controller performance and is therefore considered reasonable.

2.3. Deep Q-Learning

Q-learning and extension methods can be formulated as Markov decision processes by a tuple

(S, A, R, P)

to represent a discrete sequential decision problem in interaction with the environment [42]. At each decision time t, the agent makes a decision

a_{t} \in A

based on the current state

s_{t} \in S

, then receives a reward

r_{t + 1}

according to the reward function

R (s_{t}, a_{t})

and probability distribution

P (s_{t + 1} | s_{t}, a_{t})

decided to a new state

s_{t + 1}

. The action value describes the expected sum of rewards after taking action a in state s:

R_{s, a} = E [\sum_{t}^{\infty} γ^{t} r (s_{t}, a_{t}) | s_{0} = s, a_{0} = a]

(6)

where

E

indicates the expected value.

R_{s, a}

characterizes the long-term value of a flight controller from an initial state s then gives action a. The discount factor is

γ < 1

, and

γ

is 0.993 in this work.

Deep Q-learning leverages the off-policy characteristic of Q-learning [43]. It stores the experiences

(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})

following actions in a replay buffer during the training process. These stored experiences can then be randomly sampled from the replay buffer in subsequent training sessions to update the network [44]. This approach breaks the sequential dependency of experiences, allowing for multiple uses of the same experience, thereby enhancing the efficiency of experience utilization.

We model the QR-DQN method [45], which is an extension of DQN using the same neural network configuration as the Balloon Learning Environment, with seven layers of neural network and 600 units. We maintain a separate target network and utilize n-step backups. The target network is updated every 100 steps. Adam optimizer is used to adjust the network weights, with a step size of

2 \times 10^{- 6}

and an epsilon value of 0.00002. The training is conducted with a minibatch size of 32.

Figure 3 illustrates the training process of Deep Q-Learning (DQN). Initially, the agent predicts Q-values for the current state using its neural network, representing the expected future rewards for each possible action. Based on these Q-values, the agent selects the action that maximizes the expected reward. Once the action is executed, the agent receives an immediate reward from the environment, which updates its state. The loss function is then computed by comparing the predicted Q-value with the target Q-value, which is derived from the reward and the maximum Q-value of the next state. This loss is used to adjust the neural network weights through backpropagation, refining the agent’s policy. Repeated iterations of this process allow the agent to progressively learn the optimal policy, effectively solving complex decision-making tasks.

3. Experiment

3.1. Training

For the training phase, each episode lasts for a maximum of 960 steps and consists of one flight. The controller receives inputs and emits commands at three-minute intervals. To prevent the balloon from crossing the edge of the target range without being detected in a three-minute interval, the balloon’s ambient variables are updated every 10 s. After receiving a command from the controller, the Altitude Control System (ACS) adjusts the balloon’s altitude by either pumping in or venting out air, thus selecting an optimal wind direction for controlling its horizontal path. However, in actual flight, various factors such as buoyancy, gravity, and air resistance significantly influence the balloon’s dynamics. A realistic simulation environment is crucial for bridging the gap between simulations and real-world conditions. The US Standard Atmosphere Model 1976 is used to simulate environmental changes, including variations in atmospheric temperature, density, and pressure with altitude. This simulation environment takes full account of atmospheric temperature, pressure, and density variations at different altitudes, as well as the heat exchange between the balloon and its surroundings. The real flight path significantly increases computational load, consequently slowing down the simulation. The controller was trained for 30 days (wall-clock time) on an RTX 3070 GPU across 100,000 episodes.

At the beginning of each episode, the station’s location, the balloon’s altitude, and date, time, and wind noise are determined by a random seed. The target position is a uniform distribution within a 50 km station-keeping range. During real-world deployment of Project Loon, the balloon remains within the station-keeping range approximately 79% of the time in the Pacific Ocean experiment. To match the real-world scenario, the initial latitude and longitude of the balloon fall within the station-keeping range and the distance followed a similar distribution to that observed in the Pacific Ocean experiment of Project Loon. This distribution follows a beta distribution with parameters (a = 4 and b = 2). An episode concludes when the balloon reaches target range, goes out of bounds, or exceeds the time limit.

To thoroughly investigate the outcomes of various decisions, we adopt an exploratory strategy to explore a wider range of possibilities. The majority of training trials are dedicated to exploration. During these trials, the reinforcement learning controller and the exploration policy were alternated at intervals of 4 h and 2 h. Exploration strategies that take too long tend to produce substantial data, which are useless for replacement buffering. Data are logged as a series of state, action, and reward transitions. When a trial concludes, these transitions are appended to a randomly chosen replay buffer for subsequent training.

3.2. Baseline Controller

The station-keeping controller proposed in previous studies provides insight into how to design a manual algorithm [29]. When the balloon is out of station-keeping range, the balloon will preferentially look for wind at a small bearing to the site location. The baseline controller receives same wind variables in state space as input with 181 different pressure levels. In order for the balloon to reach the target as soon as possible, the wind speed is also taken into account. The score of each wind layer is calculated based on the wind speed and wind angle to let the balloon move to the wind layer with higher score, and the formula is as follows, where a wind score

w_{l}

is computed for each pressure level l on the basis of the wind magnitude

μ_{l}

and bearing

θ_{l}

:

w_{l} = (1 - α_{△}) e^{- w_{△} θ_{l}} + α_{△} e^{- k_{1} μ_{l}}

(7)

The term

w_{△}

defines a penalty for the angle of the wind relative to the target, and

k_{1}

is a constant.

α_{△} = e^{- \frac{D}{C_{d i s}}}

(8)

The term

α_{△} \in [0, 1]

represents the relative weight assigned to the distance D between the balloon and the target. The coefficient

C_{d i s}

has the most significant impact on

α_{△}

. The farther the distance, the smaller the

α_{△}

.

s_{l} = (1 - u_{l}) w_{l} + u_{l} g_{d e f} + k_{2} e^{- k_{3} |l - l_{c e n t e r}|}

(9)

The term

u_{l} \in [0, 1]

is the uncertainty at pressure level l, and

k_{2}

and

g_{d e f}

are constants. The last term is the hysteresis term, meaning the extra cost of time for moving. Depending on the wind layer with the highest score

s_{l}

, the balloon chooses to ascend, descend, or stay in its current position.

3.3. Performance Evaluation

The primary performance evaluation metrics for the task include success rate and average success time. Success rate serves as a core indicator of the balloon’s ability to reach the target area. It is calculated as the ratio of successful attempts to the total number of tasks, reflecting the controller’s capability to complete the assigned mission. Average success time measures the average time required for the balloon to navigate from the starting point to the target area. This metric highlights the navigation efficiency, particularly in scenarios where the task is time-constrained.

Every 20,000 training episodes, the controller undergoes an evaluation with 1000 flights. During this evaluation, factors influencing the balloon’s flight path like starting position, altitude, target position and station location, date, time, and wind conditions are set by a random number seed. Using a consistent seed ensures that the flight path is determined solely by the neural network parameters of the RL controller, enhancing time efficiency by minimizing randomness and reducing the need for simulating numerous flights to obtain a fair assessment. Each flight records the number of random seeds, whether the balloon reached the target range, and the accumulated reward value, and the key performance metrics for the controller are the success rate and average time to successfully reach the target range.

4. Simulation Results and Discussion

4.1. Baseline Controller Performance

The baseline controller, which requires no training, is also assessed with the same seed over 1000 flights. We evaluated the performance of the baseline controller when the target radii are 5 km and 10 km. The results are shown in Table 3, where SR (%) is the success rate and AST (h) is the average success time.

4.2. RL Controller Performance

Multi-step decision tasks require the agent to take a sequence of actions to achieve a goal, with each decision influencing both future states and the eventual outcome. A key challenge in such tasks is balancing the final reward and the rewards received at each step. Excessive reliance on stepwise rewards may lead the agent to focus on optimizing short-term goals, neglecting the long-term global objective. Conversely, over-dependence on endgame rewards can cause training instability, especially in the early stages, when the agent may struggle to accurately evaluate the value of individual actions. Therefore, designing and adjusting the reward structure, especially the balance between the final goal and stepwise rewards, is crucial for solving complex tasks. In this study, we train the reinforcement learning (RL) controller with target ranges of 5 km and 10 km, respectively, to adjust the balance between the final reward

R_{t}

and stepwise rewards. Each was trained over 100,000 episodes and evaluated at intervals of 20,000 episodes. The results are presented in Figure 4 and Figure 5. Experimental data indicate that as the target radius increases, the balloon’s success rate in reaching the target area also increases, concurrently reducing the average success time. When

R_{t} = 100

, the success rate fluctuates for target ranges of 5 km and 10 km, with slower and more unstable convergence compared to when

R_{t} = 50

. In contrast, with

R_{t} = 50

, the controller achieves a high success rate as early as 40,000 episodes, with gradual improvement as training progresses. The experimental results show that setting

R_{t}

to 50 results in more stable and effective controller training compared to a setting of

R_{t}

at 100.

Additionally, the average success time cannot serve as the sole metric for performance evaluation; rather, it should be assessed alongside the average success rate. When the balloon operates without any control, its mission success rate is low, despite the short average time to reach the target. This suggests a higher probability of the balloon reaching the target when in close proximity rather than an improvement in time efficiency resulting from passive flight.

As shown in Figure 6, the moving average of episodic rewards and the average success rate, calculated with a window size of 200, is used to illustrate reward trends for a target radius of 5 km and

R_{t}

of 100. Initially, due to a lack of experience samples, the agent employs an exploration strategy, resulting in relatively low reward values. However, as training progresses, the agent interacts with the environment and accumulates a substantial number of experience samples. Consequently, the algorithm’s reward values and the average success rate gradually increase overall.

4.3. Performance Analysis

Figure 7 and Figure 8 illustrates the best performance of the baseline controller and RL controller, and we also tested the random selection controller and the passive drift controller. By comparing the success rates of different controllers, the RL controller outperforms the baseline controller. The RL controller achieves a 60.6% success rate and the baseline controller achieves a 54.4% success rate when the target radius is 10 km, achieving success rates of 57.6% and 43.1%, respectively, when the target radius is 5 km.

By comparing the average success time of different controllers, at a higher success rate, the RL controller achieved a target range arrival time that is 3.77 h quicker than the baseline controller on average with a target radius of 5 km and 1.68 h quicker for a 10 km target radius.

Figure 9 and Figure 10 illustrate the superior performance of the RL controller over the baseline controller. Using the same random seed during evaluation ensures identical conditions for the target area, the balloon’s starting position, and wind conditions. The balloon’s position was recorded every three minutes using the same random seed for both controllers. Figure 9 illustrates one episode of the task, where both controllers reached the target range starting from the same position and under identical wind field conditions.

Figure 10 presents the variation in the balloon’s distance from the target’s center and wind speed over time. The baseline controller completed the flight in 21.9 h, while the reinforcement learning controller required only 12.9 h. The baseline controller, using a manual algorithm, preferentially selected wind flows with a relatively small angle to the target. In contrast, the reinforcement learning controller began to diverge from the target point around five hours, at one point exceeding 80 km, but eventually reached the target quickly by utilizing high-speed wind currents from approximately 11 h onward, following a more circuitous route to find a faster path. Reinforcement learning-based controllers have demonstrated superior capabilities in navigating complex wind fields and making decisions from a global perspective. Compared to traditional methods, reinforcement learning controllers can incorporate long-term dynamics and optimize them to the overall task objective, enabling better adaptation to changing conditions and more effective alignment with the task’s global goals.

We recorded the electrical energy of the balloon, irrespective of controller type,

R_{t}

value, or target radius, and following RL controller training over 20,000 episodes, the balloon’s average residual power is approximately 75%, regardless of whether the target range is reached or not. Implementing a constraint on excessive inflating in the reward function or manual algorithm could theoretically conserve power; however, it also curtails the balloon’s exploratory potential. Given the practical considerations, where path planning tasks require relatively shorter execution times compared to station-keeping tasks and simulation experiments reveal high residual balloon power, the design strategy prioritizes optimal performance without imposing constraints.

4.4. Robustness Analysis of Balloon Initial Position

To evaluate the robustness of the controllers to the initial positions of the balloons, we selected five different initial test points (0 km, 12.5 km, 25 km, 37.5 km, and 50 km). Within a fixed 5 km target range, each distance was evaluated over 1000 episodes to analyze the impact of the initial position distance on task success rates and average success times. Based on the experimental data, the following conclusions can be drawn: Figure 11 shows the success rates of the RL and baseline controllers at different initial distances from the station center. As the initial position distance from the station center increases, the average expected distance of the balloon to a random target point becomes larger, and the success rate gradually decreases. The reinforcement learning (RL) controller demonstrated stronger robustness under different initial conditions, while the baseline controller showed relatively weaker performance. In evaluations with random initial positions, the success rate of the RL controller was 54.4%, significantly outperforming the baseline controller’s 43.1%. When the initial position was fixed at the station center (0 km), the success rates of the RL and baseline controllers increased by 8.4% and 7.1%, respectively. Conversely, when the initial position was fixed 50 km from the station, the success rates decreased by 1.2% and 6.6%, respectively.

Figure 12 shows the average success times of the RL and baseline controllers at different initial distances from the station center. Further analysis of average success time reveals that the RL controller not only excelled in terms of success rate but also demonstrated significantly better navigation efficiency. Under random initial position conditions, the RL controller’s average success time was 11.38 h, compared to 15.2 h for the baseline controller. In tests with fixed initial positions, the RL controller’s average success time increased from 8.6 h (at 0 km distance) to 13.67 h (at 50 km distance), while the baseline controller’s average success time increased from 10.14 h to 17.8 h.

In summary, the experimental results indicate that the RL controller exhibits significantly greater robustness to variations in the balloon’s initial position compared to the baseline controller. This robustness enables the RL controller to complete tasks effectively under a broader range of initial conditions and achieve higher task efficiency in complex navigation scenarios. These findings provide important insights for optimizing navigation strategies in practical applications and further validate the advantages of reinforcement learning in path planning tasks.

5. Conclusions and Discussion

In this paper, we propose for the first time the task of superpressure balloon path planning in the stratosphere, and employ deep reinforcement learning methods to build a controller for an autonomous balloon. Utilizing surrounding wind variables and ambient variables, the balloon as an agent obtains the optimal action strategy through the optimization network, trying to reach the target range as soon as possible. Using success rate and average success time as evaluation metrics, the results demonstrate that the RL controller outperforms the baseline controller, reaching the target range more frequently and with a shorter average success time. Although we ultimately developed a controller capable of efficiently solving this novel problem, it may not be optimal in terms of training time and final performance. Due to the large computational overhead, training 100,000 episodes requires a computational time of 30 days, limiting our search for an optimal hyperparameter. It is believed that configurations exist that could enhance performance, such as improved reward function designs, hyperparameter selection, and varying neural network sizes. Theoretically, determining a broad range for all settings and parameters, sampling within this range, training controllers, and recording both training time and final performance could achieve the Monte Carlo theoretical optimum. However, the large dataset size and the high precision required for path calculations impose significant time constraints, preventing us from conducting such an exhaustive exploration. Despite these limitations, the successful application of our method to this problem establishes it as a reliable starting point for further research. The success of the reinforcement learning method in stratospheric wind field tasks proves that reinforcement learning can solve the nonlinear relationship between complex wind fields and balloon control, and its exploration characteristics can find more efficient flight paths than manual algorithms.

The ability of balloons to achieve precise navigation to small targets in the stratosphere represents the core outcome of this research. This capability has the potential to bring significant benefits to fields such as meteorology, communication networks, and environmental/ecological monitoring. For instance, in emergency communication scenarios or when providing connectivity to remote areas, balloons can precisely navigate to target locations, delivering stable communication signal coverage to ground users. By navigating to specific target regions, balloons can efficiently perform localized environmental monitoring tasks, such as air quality assessments or greenhouse gas emission evaluations. In atmospheric science and climate change research, balloons can accurately collect high-resolution data, which is particularly valuable given that the current highest resolution of stratospheric wind field data is only 23 km. This capability significantly contributes to global climate change monitoring and prediction. Additionally, in the aftermath of natural disasters, balloons can rapidly move to affected areas, providing critical support for rescue operations and data collection.

The deployment of balloons for precise navigation to small targets raises ethical and safety concerns that must be addressed responsibly. High-resolution sensors, such as cameras or radar, may inadvertently infringe on privacy by monitoring ground activities or sensitive areas. To mitigate this, data collection should be restricted to mission-specific, low-resolution data, avoiding private regions, with transparency ensured through communication with stakeholders and necessary permissions obtained. Additionally, risks such as hardware failures or debris impacting sensitive ecosystems can be reduced by using biodegradable materials and avoiding protected ecological areas, ensuring both ethical and environmental considerations are met.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, resources, writing—original draft preparation, writing—review and editing, Y.H.; funding acquisition, K.G.; supervision and project administration, K.G., C.W., K.F. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangdong Basic and Applied Basic Research Foundation, 2023A1515011216, and the Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, China.

Data Availability Statement

Data are contained within the article.

Acknowledgments

Our thanks go to Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) for the computing resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bryan, N.; Stewart, M.; Granger, D.; Guzik, T.; Christner, B. A method for sampling microbial aerosols using high altitude balloons. J. Microbiol. Methods 2014, 107, 161–168. [Google Scholar] [CrossRef] [PubMed]
Walczak, K.; Gyuk, G.; Garcia, J.; Tarr, C. Light pollution mapping from a stratospheric high-altitude balloon platform. Int. J. Sustain. Light. 2021, 23, 20–32. [Google Scholar] [CrossRef]
Vandermeulen, I.; Guay, M.; McLellan, P.J. Distributed control of high-altitude balloon formation by extremum-seeking control. IEEE Trans. Control Syst. Technol. 2017, 26, 857–873. [Google Scholar] [CrossRef]
Hall, J.L.; Cameron, J.; Pauken, M.; Izraelevitz, J.; Dominguez, M.W.; Wehage, K.T. Altitude-controlled light gas balloons for Venus and Titan exploration. In Proceedings of the AIAA Aviation 2019 Forum, Dallas, TX, USA, 17–21 June 2019; p. 3194. [Google Scholar]
Sushko, A.; Tedjarati, A.; Creus-Costa, J.; Maldonado, S.; Marshland, K.; Pavone, M. Low cost, high endurance, altitude-controlled latex balloon for near-space research (ValBal). In Proceedings of the 2017 IEEE Aerospace Conference, Big Sky, MT, USA, 4–11 March 2017; pp. 1–9. [Google Scholar] [CrossRef]
Cathey, H.; Fairbrother, D.; Said, M. Performance Highlights of NASA Super Pressure Balloon Mid-Latitude Flights. In Proceedings of the AIAA Balloon Systems Conference, Denver, CO, USA, 5–9 June 2017. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef]
Coy, L.; Schoeberl, M.; Pawson, S.; Candido, S.; Carver, R.W. Global assimilation of Loon stratospheric balloon observations. J. Geophys. Res. Atmos. 2019, 124, 3005–3019. [Google Scholar] [CrossRef]
Grüne, L.; Pannek, J.; Grüne, L.; Pannek, J. Nonlinear Model Predictive Control; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Anderson, B.D.O.; Moore, J.B. Optimal Control: Linear Quadratic Methods; Courier Corporation: North Chelmsford, MA, USA, 1990. [Google Scholar]
Yu, H.; Meier, K.; Argyle, M.; Beard, R.W. Cooperative path planning for target tracking in urban environments using unmanned air and ground vehicles. IEEE/ASME Trans. Mechatron. 2014, 20, 541–552. [Google Scholar] [CrossRef]
Sedighi, S.; Nguyen, D.V.; Kuhnert, K.D. Guided hybrid A-star path planning algorithm for valet parking applications. In Proceedings of the 2019 5th International Conference on Control, Automation and Robotics (ICCAR), Beijing, China, 19–22 April 2019; pp. 570–575. [Google Scholar]
Lamini, C.; Benhlima, S.; Elbekri, A. Genetic algorithm based approach for autonomous mobile robot path planning. Procedia Comput. Sci. 2018, 127, 180–189. [Google Scholar] [CrossRef]
Wu, C.; Yu, W.; Li, G.; Liao, W. Deep reinforcement learning with dynamic window approach based collision avoidance path planning for maritime autonomous surface ships. Ocean. Eng. 2023, 284, 115208. [Google Scholar] [CrossRef]
Xi, C.; Liu, X. Unmanned aerial vehicle trajectory planning via staged reinforcement learning. In Proceedings of the 2020 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 1–4 September 2020; pp. 246–255. [Google Scholar]
Erke, S.; Bin, D.; Yiming, N.; Qi, Z.; Liang, X.; Dawei, Z. An improved A-Star based path planning algorithm for autonomous land vehicles. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420962263. [Google Scholar] [CrossRef]
Wang, H.; Yu, Y.; Yuan, Q. Application of Dijkstra algorithm in robot path-planning. In Proceedings of the 2011 Second International Conference on Mechanic Automation and Control Engineering, Inner Mongolia, China, 15–17 July 2011; pp. 1067–1069. [Google Scholar]
Luo, Q.C.; Sun, K.W.; Chen, T.; Zhang, Y.F.; Zheng, Z.W. Trajectory planning of stratospheric airship for station-keeping mission based on improved rapidly exploring random tree. Adv. Space Res. 2024, 73, 992–1005. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
Elshamli, A.; Abdullah, H.A.; Areibi, S. Genetic algorithm for dynamic path planning. In Proceedings of the Canadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No. 04CH37513), Niagara Falls, ON, Canada, 2–5 May 2004; Volume 2, pp. 677–680. [Google Scholar]
Luo, Q.; Wang, H.; Zheng, Y.; He, J. Research on path planning of mobile robot based on improved ant colony algorithm. Neural Comput. Appl. 2020, 32, 1555–1566. [Google Scholar] [CrossRef]
Cao, X.; Peng, J. A potential field bio-inspired neural network control algorithm for AUV path planning. In Proceedings of the 2018 IEEE International Conference on Information and Automation (ICIA), Nanping, China, 11–13 August 2018; pp. 1427–1432. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Tipaldi, M.; Iervolino, R.; Massenio, P.R. Reinforcement learning in spacecraft control applications: Advances, prospects, and challenges. Annu. Rev. Control 2022, 54, 1–23. [Google Scholar] [CrossRef]
Yang, X.; Yang, X.; Deng, X. Horizontal trajectory control of stratospheric airships in wind field using Q-learning algorithm. Aerosp. Sci. Technol. 2020, 106, 106100. [Google Scholar] [CrossRef]
Zheng, B.; Zhu, M.; Guo, X.; Ou, J.; Yuan, J. Path planning of stratospheric airship in dynamic wind field based on deep reinforcement learning. Aerosp. Sci. Technol. 2024, 150, 109173. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, B.; Lou, W.; Sun, L.; Lv, C. Trajectory Planning of Stratosphere Airship in Wind-Cloud Environment Based on Soft Actor-Critic. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Malaysia, 26–28 August 2024; pp. 401–406. [Google Scholar]
Du, H.; Lv, M.; Li, J.; Zhu, W.; Zhang, L.; Wu, Y. Station-keeping performance analysis for high altitude balloon with altitude control system. Aerosp. Sci. Technol. 2019, 92, 644–652. [Google Scholar] [CrossRef]
Bellemare, M.G.; Candido, S.; Castro, P.S.; Gong, J.; Machado, M.C.; Moitra, S.; Ponda, S.S.; Wang, Z. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 2020, 588, 77–82. [Google Scholar] [CrossRef]
Jeger, S.L.; Lawrance, N.; Achermann, F.; Pang, O.; Kovac, M.; Siegwart, R.Y. Reinforcement Learning for Outdoor Balloon Navigation: A Successful Controller for an Autonomous Balloon. IEEE Robot. Autom. Mag. 2023, 31, 26–38. [Google Scholar] [CrossRef]
d’Oliveira, F.A.; Melo, F.C.L.d.; Devezas, T.C. High-altitude platforms—Present situation and technology trends. J. Aerosp. Technol. Manag. 2016, 8, 249–262. [Google Scholar] [CrossRef]
Wang, Z.; Huang, M.; Han, W.; Zhao, B.; Zhang, G.; Qian, L.; Wang, G.; Li, B. Optical sensing in Tibet Plateau wildlife observation based on tethered balloon. Optik 2021, 243, 167425. [Google Scholar] [CrossRef]
Adams, K.; Broad, A.; Ruiz-García, D.; Davis, A.R. Continuous wildlife monitoring using blimps as an aerial platform: A case study observing marine megafauna. Aust. Zool. 2020, 40, 407–415. [Google Scholar] [CrossRef]
Vignelles, D.; Roberts, T.; Carboni, E.; Ilyinskaya, E.; Pfeffer, M.; Waldhauserova, P.D.; Schmidt, A.; Berthet, G.; Jegou, F.; Renard, J.B.; et al. Balloon-borne measurement of the aerosol size distribution from an Icelandic flood basalt eruption. Earth Planet. Sci. Lett. 2016, 453, 252–259. [Google Scholar] [CrossRef]
Gannetti, M.; Gemignani, M.; Marcuccio, S. Navigation of Sounding Balloons with Deep Reinforcement Learning. In Proceedings of the 2023 IEEE 10th International Workshop on Metrology for AeroSpace (MetroAeroSpace), Milan, Italy, 19–21 June 2023; pp. 591–596. [Google Scholar]
Xu, Z.; Liu, Y.; Du, H.; Lv, M. Station-keeping for high-altitude balloon with reinforcement learning. Adv. Space Res. 2022, 70, 733–751. [Google Scholar] [CrossRef]
Liu, S.; Zhou, S.; Miao, J.; Shang, H.; Cui, Y.; Lu, Y. Autonomous Trajectory Planning Method for Stratospheric Airship Regional Station-Keeping Based on Deep Reinforcement Learning. Aerospace 2024, 11, 753. [Google Scholar] [CrossRef]
Greaves, J.; Candido, S.; Dumoulin, V.; Goroshin, R.; Ponda, S.; Bellemare, M.; Castro, P. Balloon Learning Environment. December 2021. Available online: https://github.com/google/balloon-learning-environment (accessed on 1 March 2024).
Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [Google Scholar] [CrossRef]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Perlin, K. An image synthesizer. ACM Siggraph Comput. Graph. 1985, 19, 287–296. [Google Scholar] [CrossRef]
Sutton, R.S. Reinforcement Learning: An Introduction; A Bradford Book; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Dabney, W.; Rowland, M.; Bellemare, M.; Munos, R. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]

Figure 1. Path planning with a superpressure balloon in wind field. The dashed line represents the flight path of the balloon. The station ranges are shown in cyan and the target ranges are shown in black. The arrows represent the wind field.

Figure 2. Structural diagram of an superpressure balloon.

Figure 3. Training process of DQN.

Figure 4. Success rates in different training phases.

Figure 5. Average success time in different training phases.

Figure 6. The cumulative episode reward and success rate. The blue lines represents the moving average of rewards and success rates with a window size of 200.

Figure 7. Success rates by controllers and radius of target.

Figure 8. Average success time of different controllers and radii of target.

Figure 9. Flight path of the RL controller and baseline controller with same random seed.

Figure 10. Variation in balloon–target distance and wind speed over time.

Figure 11. Success rates of different initial positions’ distances from station.

Figure 12. Average success time by different initial positions distance from station.

Table 1. List of state spaces.

Ambient Variables	Range	Number
Distance to target	0–∞ km	1
Relative bearing to target	0–180°	2
Distance to station	0–∞ km	1
Relative bearing to station	0–180°	2
Altitude of balloon	15–20 km	1
Battery charge	0–100%	1
Internal pressure ratio	1–2	1
Last command	0, 1, 2	3
Wind Variables	Range	Number
Horizontal wind speed	0–∞ m/s	361
Relative bearing	0–180°	361
Uncertainty	0–180°	361

Table 2. List of the proportion of the maximum distance range of the balloon trajectory from the center of the site.

Maximum Distance Range	Number of Successful Paths	Percentage (%)
0–50 km	301	52.3
50–100 km	189	32.8
100–150 km	65	11.3
150–200 km	21	3.6
Total	576	100

Table 3. Performance of baseline controller.

Radius of Target (km)	$C_{dis}$	SR (%)	AST (h)
5	3.5	41.6	15.39
5	5	43.1	15.20
10	7	53.2	11.80
10	10	54.4	11.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Y.; Guo, K.; Wang, C.; Fu, K.; Zheng, J. Path Planning for Autonomous Balloon Navigation with Reinforcement Learning. Electronics 2025, 14, 204. https://doi.org/10.3390/electronics14010204

AMA Style

He Y, Guo K, Wang C, Fu K, Zheng J. Path Planning for Autonomous Balloon Navigation with Reinforcement Learning. Electronics. 2025; 14(1):204. https://doi.org/10.3390/electronics14010204

Chicago/Turabian Style

He, Yingzhe, Kai Guo, Chisheng Wang, Keyi Fu, and Jiehao Zheng. 2025. "Path Planning for Autonomous Balloon Navigation with Reinforcement Learning" Electronics 14, no. 1: 204. https://doi.org/10.3390/electronics14010204

APA Style

He, Y., Guo, K., Wang, C., Fu, K., & Zheng, J. (2025). Path Planning for Autonomous Balloon Navigation with Reinforcement Learning. Electronics, 14(1), 204. https://doi.org/10.3390/electronics14010204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Path Planning for Autonomous Balloon Navigation with Reinforcement Learning

Abstract

1. Introduction

2. Methodology

2.1. Balloon Motion Model

2.2. State Space, Action Space, and Reward Function

2.2.1. State Space

2.2.2. Action Space

2.2.3. Reward Function

2.3. Deep Q-Learning

3. Experiment

3.1. Training

3.2. Baseline Controller

3.3. Performance Evaluation

4. Simulation Results and Discussion

4.1. Baseline Controller Performance

4.2. RL Controller Performance

4.3. Performance Analysis

4.4. Robustness Analysis of Balloon Initial Position

5. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI