Path Planning for Autonomous Balloon Navigation with Reinforcement Learning
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe study applies deep reinforcement learning to the challenge of path planning for superpressure balloons in the stratosphere, an area previously unexplored. This represents a significant advancement in the field of autonomous aerial vehicles. The RL controller showed superior capabilities in navigating complex wind fields and making decisions from a global perspective.
There are several things could be improved:
1. The training process is computationally intensive, which could limit the practical application and further optimization of the model. Is there any method to improve it?
2. While the study reports high residual power, it also mentions that constraining inflation to conserve power could limit the balloon's exploratory potential. This trade-off between energy efficiency and performance is not fully explored.
3. The study focuses on specific target ranges (5 km and 10 km) and doesn't explore a wider variety of scenarios or longer-duration missions. This could limit the generalizability of the findings.
4. While the RL controller is compared with a baseline controller, there's no comparison with other advanced path planning methods, which could provide a more comprehensive evaluation of its effectiveness.
Author Response
Comments 1: [The training process is computationally intensive, which could limit the practical application and further optimization of the model. Is there any method to improve it?]
Response 1:[
Thank you for your comment. We agree that the computational intensity of the training process is a significant challenge, and we appreciate the opportunity to clarify this issue. We have provided additional details regarding the reasons for the computational intensity and potential methods for improvement. Below are the updates:
The prolonged training time is primarily attributed to two factors:
Global Scope of the Dataset: Our ultimate objective is to ensure that the balloon controller can effectively perform missions across the globe. However, wind field characteristics vary significantly across regions and seasons. As such, we utilize a dataset covering wind fields from the entire globe and all seasons. The vast scale of this dataset inevitably increases training time.
High-Resolution Trajectory Updates: During training, calculating the balloon’s trajectory in real-time requires substantial computational resources. While the controller operates in real-world conditions by issuing commands every 3 minutes, training updates the balloon's position, altitude, and related parameters every 10 seconds. We have mentioned on page 8, line 269. This approach provides two key advantages:
Improved Task Accuracy: A shorter update interval (10 seconds) increases the precision of task success detection, reducing the likelihood of missing instances where the balloon enters and exits the target region during the longer 3-minute interval.
Enhanced Trajectory Precision: Frequent updates enable more accurate modeling of the balloon’s position and altitude, considering the continuously changing pressure during ascent or descent. This precision is crucial for real-world experiments.
To address the computational intensity, we acknowledge that potential optimizations could include reducing the dataset size or lowering the update frequency. While these adjustments could expedite training and facilitate quicker model tuning, they come with significant trade-offs:
Reducing Dataset Size: This would compromise the model's robustness for global mission coverage, potentially limiting its ability to generalize effectively across diverse wind field conditions.
Lowering Update Frequency: This would reduce trajectory precision and task success detection accuracy, undermining the reliability of the model for real-world applications.
]
[We have provided a more detailed description of the dataset, including its source, selected range, and time span. The updated text can be found on page 5, lines 167-169.]
Comments 2: [While the study reports high residual power, it also mentions that constraining inflation to conserve power could limit the balloon's exploratory potential. This trade-off between energy efficiency and performance is not fully explored. ]
Response 2:[
Thank you for your valuable comment. We agree that the trade-off between energy efficiency and performance deserves further exploration.
In the initial phase of our experiments, we incorporated the percentage of remaining battery power into the balloon’s encoding. This resulted in an average residual power of approximately 75%, which we deemed satisfactory. Consequently, all subsequent training and evaluation processes retained this setup without modification. Given the limited design methodologies available for this issue, identifying an effective training configuration was particularly challenging. Thus, we prioritized adjustments to elements such as reward function design, balloon encoding, and exploration strategies, which have a more significant impact on training efficiency.
At the beginning of our experiments, we determine the parameters by some contrast experiments. For example,the terminal reward was set to be 10,000 times the intermediate rewards at start. However, after training for 10,000 episodes, the task success rate was only about 7%, comparable to random action selection. We then reduced the reward ratio to 5,000 and 1,000 for comparative experiments. After 50,000 episodes, the 1,000 reward ratio achieved a success rate of 56.4% with stable training performance, while the 5,000 reward ratio only reached 41.7% with more fluctuations, aligning with the observation that high terminal rewards destabilize early-stage training. Ultimately, we fixed the ratio at 500, achieving both stable training and superior final performance. This iterative approach also helped refine other experimental parameters, such as whether the balloon's position encoding should be relative to the station or the target, and the exploration strategy's time ratio. While these settings are directly presented in the manuscript, the process of determining them required extensive preliminary experimentation.
Theoretically, we believe that incorporating a penalty term for continuous inflation in the reward function and iteratively optimizing it could achieve higher energy savings with minimal impact on success rates. However, after thorough consideration, we concluded that the current level of residual battery power sufficiently supports our future experiments. As a result, we allocated computational resources to more critical aspects of model development.
]
[We further elaborated on the limitations encountered in the study and their impact on the interpretation of the results in the conclusion and discussion section. The updated text can be found on page 15, lines 451-466.]
Comments 3: [The study focuses on specific target ranges (5 km and 10 km) and doesn't explore a wider variety of scenarios or longer-duration missions. This could limit the generalizability of the findings.]
Response 3:[
Thank you for pointing this out. We agree that exploring a wider variety of scenarios and longer-duration missions could enhance the generalizability of the study's findings.
The setup for A to B path planning problems typically falls into two categories: defining a target range, as described in this study, or optimizing the balloon's proximity to a target point within a specified timeframe. Our choice of a target range is primarily motivated by the novelty of the research question. This is an unexplored area, and to date, no specific studies have addressed the capability of stratospheric balloons to reach a narrowly defined target.
Compared to the commonly studied station-keeping task with a radius of 50 km, the selected target ranges of 5 km and 10 km in this study are significantly smaller in area. These ranges are sufficient for many mainstream tasks, such as environmental, military, or ecological monitoring, as well as communication network applications. For meteorological data collection, there currently exists no stratospheric dataset with a resolution finer than the 23 km grid wind field. Furthermore, defining a fixed target range allows tasks to terminate immediately upon reaching the range, thereby reducing computational overhead during training.
While we acknowledge the value of investigating longer-duration missions or broader scenarios, as this is a novel problem, the scenarios considered in this study are primarily application-driven. Future work could expand upon these findings by exploring more diverse task profiles and operational conditions.]
[In the conclusion and discussion section, we highlighted the practical significance of the research results for the balloon industry, emphasizing specific applications and potential benefits. The updated text can be found on page 16, lines 467-480.]
Comments 4: [While the RL controller is compared with a baseline controller, there's no comparison with other advanced path planning methods, which could provide a more comprehensive evaluation of its effectiveness.]
Response 4:[
Thank you for your comment. We did a lot of research in related areas, and our response is as follows:
Project Loon was the first to apply reinforcement learning to stratospheric balloon control tasks. Following the validation of neural networks’ capacity to learn complex wind field features and RL’s applicability to such control tasks, the most recent studies in this domain are predominantly based on deep reinforcement learning (DRL). While our specific task differs, the control processes and the need to handle complex wind field inputs are consistent. Advanced control algorithms, such as Model Predictive Control (MPC) and tree search controllers (OPD), have been explored in prior work. This resource (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2939-8/MediaObjects/41586_2020_2939_MOESM1_ESM.pdf) from google project loon researcher highlights several limitations of these advanced methods:
Computational Expense: Simulating a two-day flight required approximately 60 seconds, making algorithms like MPC or OPD computationally demanding. For instance, their MPC algorithm with an approximate model incurred high computation costs, with each decision taking 345 ms compared to 19 ms for an RL controller.
Sample Efficiency: The slow simulation process necessitated efficient sample usage to keep training feasible (<1 month). This limitation made direct policy optimization unattractive due to its typically higher interaction demands.
Robustness to Real Conditions: RL controllers effectively bridged the gap between simulation and real wind conditions, performing well even in untrained scenarios. In contrast, MPC often failed under real conditions due to insufficient uncertainty modeling and reliance on heuristic fixes.
Based on these findings, it is reasonable to infer that similar disadvantages of MPC and OPD would also arise in our task. Furthermore, we reviewed recent studies on stratospheric airship control, which have universally adopted DRL methods:
Station-keeping for high-altitude balloon with reinforcement learning: Station-keeping task, the dataset is selected in 2019-2021, Changsha, trained over 1800 episodes using RL, achieving stability within 35 minutes.
Resource-Constrained Station-Keeping for Helium Balloons using Reinforcement Learning :Station-keeping task, the dataset is selected in 2022-2023, Tropics, conducted over 2500 episodes.
Reinforcement Learning for Outdoor Balloon Navigation:Path planning task, the dataset is selected in Switzerland, used low altitude(3km) wind field data, with training lasting 60 hours.
Trajectory planing based on continuous decision deep reinforcement learning for stratospheric: Path planning task, no mention of the dataset. Methods such as DQN, DDPG and TD3 are compared using airships in 2D regional wind fields..
Trajectory Planning of Stratosphere Airship in Wind-Cloud Environment Based on Soft Actor- Critic: Path planning task, the dataset is selected NOAA wind field data in 70 hpa .Methods such as DQN and SAC are compared using airships in 2D regional wind fields.
Path planning of stratospheric airship in dynamic wind field based on deep reinforcement learning: Path planning task, the dataset is selected NOAA wind field data in 70 hpa. Methods such as DRQN, DDRQN and D3RQN are compared using airships in 2D regional wind fields.
Stratospheric airship trajectory planning in wind field using deep reinforcement learning: Path planning task, the dataset is selected from ERA5 whole year of 2022. Using airships and SAC method navigate in 2D regional wind fields.
These are some of the most recent studies in the field, and each has been published at least until 2022. However, the scope of these studies is largely limited to 2D wind fields or smaller regions, allowing faster model adjustments and method comparisons. There is currently no study on global 3D wind fields for balloon path planning in the stratosphere. Consequently, existing work provides limited reference value for our task. Testing other RL algorithms (e.g., SAC, DDPG, TD3) would require re-training agents, under the condition of large data sets and high-precision calculation paths, these algorithms are more complex than DQN, so it is more difficult to design simple and efficient methods and determine parameters.
Our decision to use DQN is grounded in several practical considerations:
Suitability for Discrete Action Spaces: DQN is well-suited for problems with a finite set of actions, such as fixed directional choices in navigation.
Computational Efficiency: Compared to policy-based algorithms (e.g., A3C, PPO) or hybrid algorithms (e.g., SAC), DQN is computationally simpler and resource-efficient.
High-Dimensional State Space: In tasks with high-dimensional state spaces but relatively small action spaces, DQN effectively optimizes cumulative rewards via value function approximation.
In summary, our primary focus is on solving the specific task scenario. The choice of DQN was driven by its robust performance in discrete action spaces, simplicity, and prior success in similar applications, making it a reliable starting algorithm. Future work could explore alternative RL algorithms or advanced path planning methods in this context.]
[We have provided a more detailed description, further explaining the differences between our work and existing research. The updated text can be found on page 2, lines 76-80.]
Reviewer 2 Report
Comments and Suggestions for Authors1. The article proposes the use of deep reinforcement learning for the path planning of superpressure balloons in the stratosphere. The authors are requested to explain the differences and improvements of their work compared to existing research.
2. The state space includes 1095 variables, does this lead to a curse of dimensionality? How do the authors ensure that the algorithm can learn effectively in such a high-dimensional state space?
3. In the article, the design of the reward function takes into account factors such as reaching the target range and time consumption. The authors are asked to explain how they balance these factors to ensure the stability and effectiveness of the training.
4. Why does the authors choose DQN instead of other reinforcement learning algorithms?
5. Does the training process requiring 30 days of wall-clock time limit the practicality of the algorithm?
6. Are the experimental results sensitive to initial conditions, such as changes in the starting position of the balloon or the target location?
7. Have the authors considered the impact of extreme weather conditions or other unpredictable factors?
Author Response
Comments 1: [The article proposes the use of deep reinforcement learning for the path planning of superpressure balloons in the stratosphere. The authors are requested to explain the differences and improvements of their work compared to existing research.]
Response 1:[
Thank you for pointing this out. We have expanded our discussion to highlight the differences and improvements of our work compared to existing studies.
The differences of the existing research on stratospheric aerostat control problem are as follows.
Task differences: There are station keeping task (keep within A large range without leaving) and path planning task (from current position A to goal B).
Aerostat difference: airship (with lateral power) and balloon (without lateral power)
Flight range difference: 3D wind field (flying in 3D space) and 2D wind field (flying in 2D plane)
By reviewing recent studies in the field, we observed that most state-of-the-art research adopts deep reinforcement learning (DRL) methods. Below are some representative studies:
Station-keeping for high-altitude balloon with reinforcement learning: This study addressed a station-keeping task with wind field data from Changsha (2019–2021). The strategy converged after 1800 episodes, with each training session lasting approximately 35 minutes.
Resource-Constrained Station-Keeping for Helium Balloons using Reinforcement Learning: This study addressed a station-keeping task with tropical wind field data (2022–2023) and trained over 2500 episodes.
Reinforcement Learning for Outdoor Balloon Navigation: This study addressed a path planning task using low-altitude (3 km) wind field data from Switzerland. Training lasted approximately 60 hours.
Trajectory planning based on continuous decision deep reinforcement learning for stratospheric: This study addressed a path planning task, with no specific dataset mentioned. Methods such as DQN, DDPG, and TD3 were compared using airships in 2D regional wind fields.
Trajectory Planning of Stratosphere Airship in Wind-Cloud Environment Based on Soft Actor-Critic: This study utilized NOAA wind field data at 70 hPa for a path planning task. Methods such as DQN and SAC were compared using airships in 2D regional wind fields.
Path planning of stratospheric airship in dynamic wind field based on deep reinforcement learning: This study used NOAA wind field data at 70 hPa for a path planning task and compared methods such as DRQN, DDRQN, and D3RQN in 2D regional wind fields.
Stratospheric airship trajectory planning in wind field using deep reinforcement learning: This study utilized ERA5 wind field data for the entire year of 2022, focusing on path planning with SAC in 2D regional wind fields.
These represent the latest research in this field, with all studies published in 2022 or later.
Additionally, the study Autonomous Navigation of Stratospheric Balloons Using Reinforcement Learning explored station-keeping tasks in the stratosphere using global wind field data generated by 100 parallel simulations. The best-performing controller was trained for 24 days.
Our work differs from these studies in terms of tasks, dataset size, and aerostat selection. Our problem is the path planning of an superpressure balloon in the stratosphere, and the data set is the global wind field,and is a relatively unexplored problem. Existing studies often utilize smaller datasets to allow for faster iterations and comparisons of different models and parameters.
Rather than focusing on optimizing and comparing different algorithms, our study emphasizes whether the model can learn the global wind field characteristics effectively, whether the reward function is reasonably designed, and whether it addresses the novel challenge of navigating a superpressure balloon from point A to point B in the stratosphere.]
[We have provided a more detailed description, further explaining the differences between our work and existing research. The updated text can be found on page 2, lines 76-80.]
Comments 2: [The state space includes 1095 variables, does this lead to a curse of dimensionality? How do the authors ensure that the algorithm can learn effectively in such a high-dimensional state space?]
Response 2:[
Thank you for your comment. Our response to this comment is as follows:
The state space indeed includes 1095 variables, of which 1083 are related to wind field characteristics. As noted in the manuscript, we adopted the same wind field dataset and encoding strategy as Project Loon. In their study, the reinforcement learning controller for the station-keeping task demonstrated consistently high performance both in simulation and real-world scenarios. This strongly indicates that the wind field encoding approach enables reinforcement learning models to effectively capture and utilize wind field characteristics.
We hypothesized that differences in balloon tasks (e.g., station-keeping vs. path planning) would not significantly affect the model's ability to learn wind field features using this encoding. Our experimental results confirmed this hypothesis, as the model exhibited high performance in the path planning task, further validating its ability to learn wind characteristics from the same encoding.
One practical advantage of using consistent wind field encoding is that it ensures compatibility across different controllers when switching between tasks or objectives. This uniformity reduces the complexity of processing wind forecast data and simplifies the integration of various control strategies during real-world balloon operations.
.]
Comments 3: [ In the article, the design of the reward function takes into account factors such as reaching the target range and time consumption. The authors are asked to explain how they balance these factors to ensure the stability and effectiveness of the training.]
Response 3:[
Thank you for highlighting this important aspect of our study.
In the A-to-B path planning task, the reward function framework assigns a large positive reward for successfully completing the task and a large negative reward for failure. During the task, smaller incremental rewards are given based on specific environmental factors, such as the distance of the agent to point B and time constraints.
A key challenge in such tasks lies in balancing the terminal rewards and stepwise rewards. Excessive reliance on stepwise rewards may cause the agent to optimize short-term goals at the expense of long-term objectives. Conversely, excessive reliance on terminal rewards can lead to unstable training, especially in the early stages, when the agent struggles to accurately evaluate the value of individual actions. Striking a balance between terminal and stepwise rewards is therefore critical for solving complex tasks effectively.
As this is a novel and unexplored task with limited related references in the literature, our approach to balancing these factors primarily relied on iterative experimentation and adjustment. Initially, we set the terminal reward to be approximately 10,000 times the stepwise reward. However, after 10,000 episodes, the task success rate was around 7%, which was comparable to random action selection. This indicated that the excessively high terminal reward hindered the convergence of the neural network.
Subsequently, we reduced the ratio to 5,000 and 1,000 in separate experiments. After 50,000 episodes, the experiment with a 1,000 ratio achieved a 56.4% success rate with relatively stable performance during training. In contrast, the experiment with a 5,000 ratio only achieved a 41.7% success rate and exhibited higher fluctuations in performance. This aligns with the observation that large terminal rewards often lead to instability during the early stages of training.
After a series of experiments, we finalized the ratio at 500, which provided stable training and higher overall performance.
In the early stages of our research, we also applied this iterative approach to determine other key parameters, such as whether the balloon's position encoding should be relative to the station or the target and the exploration-to-exploitation time ratio in the exploration strategy. Although these settings are explicitly stated in the paper, the experiments conducted to finalize these parameters were extensive and time-consuming.
We believe these efforts highlight the necessity of iterative fine-tuning in novel scenarios to achieve stable and effective training outcomes.]
[We have further explored how to theoretically determine the optimal parameters in the conclusion and discussion section. The updated text can be found on page 15, lines 456-461.]
Comments 4: [Why does the authors choose DQN instead of other reinforcement learning algorithms?]
Response 4:[
Thank you for highlighting this important aspect of our study.
We understand your concerns, our decision to use DQN is grounded in several practical considerations:
Suitability for Discrete Action Spaces: DQN is well-suited for problems with a finite set of actions, such as fixed directional choices in navigation.
Computational Efficiency: Compared to policy-based algorithms (e.g., A3C, PPO) or hybrid algorithms (e.g., SAC), DQN is computationally simpler and resource-efficient.
High-Dimensional State Space: In tasks with high-dimensional state spaces but relatively small action spaces, DQN effectively optimizes cumulative rewards via value function approximation.
In summary, our primary focus is on solving the specific task scenario. The choice of DQN was driven by its robust performance in discrete action spaces, simplicity, and prior success in similar applications, making it a reliable starting algorithm. Future work could explore alternative RL algorithms or advanced path planning methods in this context.
]
Comments 5: [Does the training process requiring 30 days of wall-clock time limit the practicality of the algorithm?]
Response 5:[
Thank you for your comment. We agree that the time of the training process is a significant challenge, and we appreciate the opportunity to clarify this issue. We have provided additional details regarding the reasons for the training time and potential methods for improvement. Below are the updates:
The prolonged training time is primarily attributed to two factors:
Global Scope of the Dataset: Our ultimate objective is to ensure that the balloon controller can effectively perform missions across the globe. However, wind field characteristics vary significantly across regions and seasons. As such, we utilize a dataset covering wind fields from the entire globe and all seasons. The vast scale of this dataset inevitably increases training time.
High-Resolution Trajectory Updates: During training, calculating the balloon’s trajectory in real-time requires substantial computational resources. While the controller operates in real-world conditions by issuing commands every 3 minutes, training updates the balloon's position, altitude, and related parameters every 10 seconds. This approach provides two key advantages:
Improved Task Accuracy: A shorter update interval (10 seconds) increases the precision of task success detection, reducing the likelihood of missing instances where the balloon enters and exits the target region during the longer 3-minute interval.
Enhanced Trajectory Precision: Frequent updates enable more accurate modeling of the balloon’s position and altitude, considering the continuously changing pressure during ascent or descent. This precision is crucial for real-world experiments.
To address the computational intensity, we acknowledge that potential optimizations could include reducing the dataset size or lowering the update frequency. While these adjustments could expedite training and facilitate quicker model tuning, they come with significant trade-offs:
Reducing Dataset Size: This would compromise the model's robustness for global mission coverage, potentially limiting its ability to generalize effectively across diverse wind field conditions.
Lowering Update Frequency: This would reduce trajectory precision and task success detection accuracy, undermining the reliability of the model for real-world applications.
Currently, the only comparable study on the same dataset is Google Loon. In their station-keeping task, the optimal controller was ultimately trained for 24 days. By comparison, a training duration of 30 days is acceptable on our time scale.
]
[We have provided a more detailed description of the dataset, including its source, selected range, and time span. The updated text can be found on page 5, lines 167-169.]
Comments 6: [Are the experimental results sensitive to initial conditions, such as changes in the starting position of the balloon or the target location?]
Response 6:[
Thank you for raising this insightful question. In my review of relevant literature in this field, I have not encountered studies specifically addressing the sensitivity of experimental results to initial conditions, such as changes in the starting position of the balloon or the target location. In our current setup, the target location is randomly distributed, with equal probability of appearing at any point. This design is intended to align closely with our planned experimental scenarios in real-world flight, where we anticipate switching between controllers with different functionalities based on specific tasks. The probability of a balloon’s starting position is approximately proportional to its distance from the station in station-keeping tasks.
To address your question, we could evaluate the sensitivity of our controller to initial conditions by modifying the balloon's starting position or target location. Alternatively, we could directly analyze the correlation between the initial distance between the balloon’s starting position and the target and the success rate. If initial conditions do influence performance, we could leverage this information to predict mission success or accelerate early-stage training, which would be highly beneficial for further optimizing the controller.
Within the constraints of our study, we selected five test points at varying distances and evaluated whether the initial conditions affected the controller’s performance under a 5 km target range. The results indicate that the farther the initial position is from the station’s center, the greater the average distance of the balloon from the target and the lower the success rate of the controller. Specifically, the reinforcement learning (RL) controller demonstrated higher robustness to initial conditions compared to the baseline. When evaluated under random starting positions, the success rates of the RL and baseline controllers were 54.4% and 43.1%, respectively. When the initial position was at the station center, these success rates increased by 8.4% and 7.1%, respectively. Conversely, when the initial position was 50 km away from the station, the success rates decreased by 1.2% and 6.6%, respectively.
In summary, for our experimental setup, we believe that our reinforcement learning controller demonstrates sufficient robustness to variations in the balloon’s starting position and target location.
]
[We add a section to discuss the sensitivity of the controller to the initial position of the balloon. For the updated text, see page 14, lines 411-442]
Comments 7: [Have the authors considered the impact of extreme weather conditions or other unpredictable factors?]
Response 7:[
Thank you for highlighting this important consideration. We agree that understanding the impact of extreme weather conditions and other unpredictable factors is crucial.
The stratosphere's wind and weather characteristics are inherently favorable for conducting balloon missions. Notably, the stratosphere does not experience precipitation. This is because the stratosphere, situated above the troposphere, lacks significant temperature variations and sufficient water vapor to form rain. The temperature in the stratosphere increases with altitude, which inhibits the condensation of water vapor into raindrops. Additionally, the stratosphere's air primarily moves horizontally, with minimal vertical convection, further preventing the upward movement and cooling of water vapor necessary for cloud and rain formation. This results in clear skies and high atmospheric transparency, which are advantageous for balloon missions.
As detailed in NASA's balloon project logs (https://blogs.nasa.gov/superpressureballoon/), mission planning for stratospheric balloons typically involves careful observation of weather and wind conditions to ensure favorable conditions before launch. Missions are often delayed—sometimes up to three or four times—due to adverse weather or wind patterns, emphasizing the importance of mitigating unpredictable factors. Moreover, given that our missions operate at altitudes of 15-20 km, well above the typical flight altitude of commercial aircraft (8,000–10,000 meters), interactions with other objects in this space are rare.
Our team has also considered other unpredictable factors, such as navigating no-fly zones (NFZs). NFZs are high-risk or restricted areas where unauthorized entry could lead to severe consequences. While NFZ avoidance introduces new challenges distinct from A-to-B navigation, our team has conducted research on this topic, which may interest you. For further details, you can refer to our paper, High-Altitude Balloons No-Fly Zone Avoidance Based on Reinforcement Learning.
These considerations ensure that our approach is designed to minimize risks from unpredictable conditions while maintaining mission effectiveness.
]
Reviewer 3 Report
Comments and Suggestions for Authors1. Recommended to create Path Planning for Autonomous Balloon Navigation with Reinforcement Learning model
2. To increase references list till 60 sources
3. To describe in detail the novelty of the scientific article
Author Response
Comments 1: [Recommended to create Path Planning for Autonomous Balloon Navigation with Reinforcement Learning model]
Response 1:[Thank you for your recommendation. The control process of our model is relatively straightforward. In summary, it involves using balloon information and wind field data as inputs to a neural network, which outputs a control command. This process is already illustrated in Figure 3: Training Process of DQN on page 8 line 263, which encapsulates the details of our approach.]
Comments 2: [To increase references list till 60 sources]
Response 2:[Thank you for your valuable suggestion. We have expanded the references list to 46 sources, providing a more comprehensive overview of existing research and applications in high-altitude balloon control as well as the broader context of path planning applications.]
[We increased the list of references to 46.. The updated text can be found on page 2, lines 76-80. and page 16, lines 501-591]
Comments 3: [To describe in detail the novelty of the scientific article]
Response 3:[
Thank you for pointing this out. We have expanded our discussion to highlight the differences and improvements of our work compared to existing studies.
The differences of the existing research on stratospheric aerostat control problem are as follows.
Task differences: There are station keeping task (keep within A large range without leaving) and path planning task (from current position A to goal B).
Aerostat difference: airship (with lateral power) and balloon (without lateral power)
Flight range difference: 3D wind field (flying in 3D space) and 2D wind field (flying in 2D plane)
By reviewing recent studies in the field, we observed that most state-of-the-art research adopts deep reinforcement learning (DRL) methods. Below are some representative studies:
Station-keeping for high-altitude balloon with reinforcement learning: This study addressed a station-keeping task with wind field data from Changsha (2019–2021). The strategy converged after 1800 episodes, with each training session lasting approximately 35 minutes.
Resource-Constrained Station-Keeping for Helium Balloons using Reinforcement Learning: This study addressed a station-keeping task with tropical wind field data (2022–2023) and trained over 2500 episodes.
Reinforcement Learning for Outdoor Balloon Navigation: This study addressed a path planning task using low-altitude (3 km) wind field data from Switzerland. Training lasted approximately 60 hours.
Trajectory planning based on continuous decision deep reinforcement learning for stratospheric: This study addressed a path planning task, with no specific dataset mentioned. Methods such as DQN, DDPG, and TD3 were compared using airships in 2D regional wind fields.
Trajectory Planning of Stratosphere Airship in Wind-Cloud Environment Based on Soft Actor-Critic: This study utilized NOAA wind field data at 70 hPa for a path planning task. Methods such as DQN and SAC were compared using airships in 2D regional wind fields.
Path planning of stratospheric airship in dynamic wind field based on deep reinforcement learning: This study used NOAA wind field data at 70 hPa for a path planning task and compared methods such as DRQN, DDRQN, and D3RQN in 2D regional wind fields.
Stratospheric airship trajectory planning in wind field using deep reinforcement learning: This study utilized ERA5 wind field data for the entire year of 2022, focusing on path planning with SAC in 2D regional wind fields.
These represent the latest research in this field, with all studies published in 2022 or later.
Additionally, the study Autonomous Navigation of Stratospheric Balloons Using Reinforcement Learning explored station-keeping tasks in the stratosphere using global wind field data generated by 100 parallel simulations. The best-performing controller was trained for 24 days.
Our work differs from these studies in terms of tasks, dataset size, and aerostat selection. Our problem is the path planning of an overpressurized balloon in the stratosphere, and the data set is the global wind field,and is a relatively unexplored problem. Existing studies often utilize smaller datasets to allow for faster iterations and comparisons of different models and parameters.
Rather than focusing on optimizing and comparing different algorithms, our study emphasizes whether the model can learn the global wind field characteristics effectively, whether the reward function is reasonably designed, and whether it addresses the novel challenge of navigating a superpressure balloon from point A to point B in the stratosphere.
]
[We have provided a more detailed description, further explaining the differences between our work and existing research and highlights the novelty of the problem. The updated text can be found on page 2, lines 76-80. and page 3, lines 90-94]
Reviewer 4 Report
Comments and Suggestions for AuthorsPlease note the following:
- Provide a detailed description of the dataset used, including source, size and preprocessing methods applied to the data.
- Explain in detail the specific reinforcement learning algorithms used, including neural network architecture, training parameters, and model validation methods.
- Include an extensive section with analyses of model performance under various test conditions, such as extreme or varying wind conditions, for evaluating model robustness.
- Detail the performance metrics used to evaluate the success of balloon navigation, with specific examples of their application in analyzing the results.
- Deepen the discussion of the limitations encountered in the research and their impact on the interpretation of the results.
- Provide comparisons with other existing studies or techniques to contextualize the effectiveness and innovation of your methodology.
- Describe the practical implications of the findings for the ballooning industry, highlighting concrete applications and potential benefits.
- Address the ethical and safety considerations associated with applying your algorithms in real-world scenarios, discussing any possible risks and mitigation measures
Author Response
Comments 1: [Provide a detailed description of the dataset used, including source, size and preprocessing methods applied to the data.]
Response 1:[
Thank you for pointing this out. We have provided a more detailed description of the dataset used in our study, including its source, size, and preprocessing methods.
The dataset is sourced from ECMWF's ERA5 global reanalysis dataset, which provides a four-dimensional wind field consisting of altitude, latitude, longitude, and time. For training, the dataset was uniformly and randomly sampled within the tropical region (25°N to 25°S) over the period from 2005 to 2010, ensuring sufficient variability and representativeness of the wind field conditions in the tropics.
Regarding preprocessing, the methods are integrated into the Balloon Learning Environment (BLE) framework. While detailing all formulas might be overwhelming, we summarize the key functionalities of the preprocessing. The process involves applying procedural noise to increase the spatial resolution and to account for prediction errors, ensuring a more realistic and high-resolution wind field simulation.
]
[We have provided a more detailed description of the dataset, including its source, selected range, and time span. The updated text can be found on page 5, lines 167-169.]
Comments 2: [Explain in detail the specific reinforcement learning algorithms used, including neural network architecture, training parameters, and model validation methods.]
Response 2:[
Thank you for raising this important point. We have explained the reinforcement learning algorithm and neural network architecture used in our study on page 7, line 249 - 250 and we have further elaborated on the training parameters in the revised manuscript.
Regarding the model validation methods, this is a novel task without a standardized validation framework. For our specific scenario, the primary evaluation metrics include the task success rate and average success time. These metrics are designed to assess the controller's ability to achieve the defined objectives effectively and efficiently under varying conditions.
]
[We provide more detailed training parameters and model validation methods.] For the updated text, see page 7, lines 250-253 and page 9, lines 317-323. ]
Comments 3: [ Include an extensive section with analyses of model performance under various test conditions, such as extreme or varying wind conditions, for evaluating model robustness.]
Response 3:[
Thank you for your comment. We have provided an extended section analyzing the model's robustness under various test conditions, including insights into extreme and varying wind scenarios.
In the context of stratospheric balloon control, wind quality is primarily assessed through wind field diversity. This diversity is defined by the presence of opposing horizontal winds at different altitude layers over the same geographic location and time. Greater wind diversity theoretically expands the balloon's reachable range. Under this definition, extreme wind conditions are often characterized by poor wind diversity, which was excluded during the dataset preselection phase. Such wind conditions are practically uncontrollable for balloons without propulsion, rendering them irrelevant for training purposes. In contrast, varying wind fields are exactly the types of conditions the model needs to handle effectively.
To evaluate the robustness of the controller, we tested its performance under varying initial positions of the balloon. Specifically, we selected five test points with different initial distances (0m, 12.5m, 25m, 37.5m, and 50m) from the station center. For each fixed initial position, we conducted 1,000 episodes within a 5KM target range and analyzed how initial positioning impacts task success rate and average success time.
Overall, for our experimental setup, we conclude that the RL controller demonstrates sufficient robustness to variations in the balloon's initial or target positions. This robustness underscores the RL controller's adaptability to diverse conditions and its practical application potential in stratospheric balloon navigation tasks.
]
[We add a section to discuss the sensitivity of the controller to the initial position of the balloon. For the updated text, see page 14, lines 411-442]
Comments 4: [Detail the performance metrics used to evaluate the success of balloon navigation, with specific examples of their application in analyzing the results.]
Response 4:[
Thank you for your comment. We have provided additional details on the performance metrics used to evaluate balloon navigation, along with examples of their application in analyzing results.
-
Success Rate
The success rate is a core metric for evaluating whether the balloon successfully reaches the target area during a task. It is calculated as the ratio of successful missions to the total number of missions, reflecting the controller's task completion capability. For instance, if 80 out of 100 missions are successful in a given experiment, the success rate would be 80%. In our analysis, the success rate serves as a key measure to compare the performance of different controllers under identical wind field conditions. -
Average Success Time to Reach Target
This metric measures the average time required for the balloon to navigate from its starting point to the target area. It provides insight into the navigation efficiency, particularly for tasks with time constraints. For example, by comparing the average success time of two controllers, we can assess which approach demonstrates higher efficiency.
]
[We provide more detailed performance metrics and their application in analyzing the results.] For the updated text, see page 9, lines 317-323. ]
Comments 5: [Deepen the discussion of the limitations encountered in the research and their impact on the interpretation of the results.]
Response 5:[
Thank you for your insightful comment. We acknowledge that limitations in computational resources have, to some extent, constrained the further optimization of the model. This is particularly evident during the early stages of training, where a large number of parameters and training configurations needed to be determined. Achieving a stable and efficient training setup often required extensive experimentation and iterative adjustments. For example, we conducted numerous small-scale experiments to fine-tune critical settings, such as the ratio of terminal rewards to stepwise rewards, the encoding of balloon positions relative to the station or the target, and the exploration strategy's time allocation. While these settings are explicitly described in the paper, determining them required a time-intensive trial-and-error process during the initial phase.
Ultimately, we developed a controller that efficiently addresses this novel problem, but we acknowledge that the solution may not represent the optimal configuration in terms of training time or final performance. In theory, systematically defining a range for all settings and parameters, followed by sampling and training controllers to record training time and performance, could lead to the Monte Carlo theoretical optimum. However, the large-scale dataset and high-precision path calculations necessary for this approach made it infeasible within our resource constraints.
Nonetheless, the successful application of our controller to this problem establishes it as a reliable starting point for future research, demonstrating its practical potential despite these limitations.
]
[We further elaborated on the limitations encountered in the study and their impact on the interpretation of the results in the conclusion and discussion section. The updated text can be found on page 15, lines 451-466.]
Comments 6: [Provide comparisons with other existing studies or techniques to contextualize the effectiveness and innovation of your methodology.]
Response 6:[
Thank you for pointing this out. We have expanded our discussion to highlight the differences and improvements of our work compared to existing studies.
The differences of the existing research on stratospheric aerostat control problem are as follows.
Task differences: There are station keeping task (keep within A large range without leaving) and path planning task (from current position A to goal B).
Aerostat difference: airship (with lateral power) and balloon (without lateral power)
Flight range difference: 3D wind field (flying in 3D space) and 2D wind field (flying in 2D plane)
By reviewing recent studies in the field, we observed that most state-of-the-art research adopts deep reinforcement learning (DRL) methods. Below are some representative studies:
Station-keeping for high-altitude balloon with reinforcement learning: This study addressed a station-keeping task with wind field data from Changsha (2019–2021). The strategy converged after 1800 episodes, with each training session lasting approximately 35 minutes.
Resource-Constrained Station-Keeping for Helium Balloons using Reinforcement Learning: This study addressed a station-keeping task with tropical wind field data (2022–2023) and trained over 2500 episodes.
Reinforcement Learning for Outdoor Balloon Navigation: This study addressed a path planning task using low-altitude (3 km) wind field data from Switzerland. Training lasted approximately 60 hours.
Trajectory planning based on continuous decision deep reinforcement learning for stratospheric: This study addressed a path planning task, with no specific dataset mentioned. Methods such as DQN, DDPG, and TD3 were compared using airships in 2D regional wind fields.
Trajectory Planning of Stratosphere Airship in Wind-Cloud Environment Based on Soft Actor-Critic: This study utilized NOAA wind field data at 70 hPa for a path planning task. Methods such as DQN and SAC were compared using airships in 2D regional wind fields.
Path planning of stratospheric airship in dynamic wind field based on deep reinforcement learning: This study used NOAA wind field data at 70 hPa for a path planning task and compared methods such as DRQN, DDRQN, and D3RQN in 2D regional wind fields.
Stratospheric airship trajectory planning in wind field using deep reinforcement learning: This study utilized ERA5 wind field data for the entire year of 2022, focusing on path planning with SAC in 2D regional wind fields.
These represent the latest research in this field, with all studies published in 2022 or later.
Additionally, the study Autonomous Navigation of Stratospheric Balloons Using Reinforcement Learning explored station-keeping tasks in the stratosphere using global wind field data generated by 100 parallel simulations. The best-performing controller was trained for 24 days.
Our work differs from these studies in terms of tasks, dataset size, and aerostat selection. Our problem is the path planning of an overpressurized balloon in the stratosphere, and the data set is the global wind field,and is a relatively unexplored problem. Existing studies often utilize smaller datasets to allow for faster iterations and comparisons of different models and parameters.
Rather than focusing on optimizing and comparing different algorithms, our study emphasizes whether the model can learn the global wind field characteristics effectively, whether the reward function is reasonably designed, and whether it addresses the novel challenge of navigating a superpressure balloon from point A to point B in the stratosphere.
]
[We have provided a more detailed description, further explaining the differences between our work and existing research. The updated text can be found on page 2, lines 76-80.]
Comments 7: [Describe the practical implications of the findings for the ballooning industry, highlighting concrete applications and potential benefits.]
Response 7:[Thank you for pointing this out. We have added practical significance to the balloon industry, emphasizing specific applications and potential benefits.
The ability of balloons to achieve precise navigation to small targets in the stratosphere represents the core outcome of this research. This capability has the potential to bring significant benefits to fields such as meteorology, communication networks, and environmental/ecological monitoring. For instance, in emergency communication scenarios or providing connectivity in remote areas, balloons can precisely navigate to target locations, delivering stable communication signal coverage to ground users. By navigating to specific target regions, balloons can efficiently perform localized environmental monitoring tasks, such as air quality assessments or greenhouse gas emission evaluations. In atmospheric science and climate change research, balloons can accurately collect high-resolution data, which is particularly valuable given that the current highest resolution of stratospheric wind field data is only 23 km. This capability significantly contributes to global climate change monitoring and prediction. Additionally, in the aftermath of natural disasters, balloons can rapidly move to affected areas, providing critical support for rescue operations and data collection. ]
[In the conclusion and discussion section, we highlighted the practical significance of the research results for the balloon industry, emphasizing specific applications and potential benefits. The updated text can be found on page 15, lines 467-480.]
Comments 8: [Address the ethical and safety considerations associated with applying your algorithms in real-world scenarios, discussing any possible risks and mitigation measures]
Response 8:[
Thank you for raising this important concern. Ethical and safety considerations are indeed critical when applying algorithms for real-world balloon navigation. Below, we discuss potential risks and outline mitigation measures to address them:
-
Privacy Concerns:
Balloons equipped with high-resolution sensors (e.g., cameras, radars) for data collection may inadvertently monitor ground activities, structures, or individuals. This raises privacy concerns, particularly in sensitive or unauthorized areas.- Mitigation Measures: We propose strict limitations on data collection, ensuring that only low-resolution or task-specific data related to the mission is gathered, avoiding surveillance of private areas or sensitive locations. Additionally, transparency and communication with stakeholders (e.g., governments, communities) are essential before initiating missions. Clearly communicating the purpose and use of the data and obtaining necessary authorizations can prevent misuse and alleviate public concerns.
-
Environmental Risks:
The operation of balloons could result in hardware damage, crashes, or debris that may pose ecological threats, especially in sensitive environments.- Mitigation Measures: Utilizing recyclable or biodegradable materials for balloon construction can minimize long-term environmental impacts. Furthermore, flight paths should be carefully planned to avoid ecologically sensitive areas, such as nature reserves or marine ecosystems. By designating "no-fly zones" (NFZs), we can further reduce environmental risks.
-
Potential for Misuse:
Balloon navigation technology might be repurposed for military applications, such as surveillance in adversarial regions, potentially leading to international disputes.- Mitigation Measures: We advocate for clear agreements limiting the technology to non-military applications. International collaboration and regulatory frameworks can ensure transparency and legal compliance in the use of such technologies. Partnering with international organizations to develop standards and guidelines can prevent potential misuse.
Note on Military Applications:
Regarding the third point, we firmly oppose the use of balloon navigation technology for military purposes. However, we believe that explicitly discussing this in the manuscript might inadvertently inspire misuse by those with ill intentions. We hope for your understanding on this matter.
By addressing these ethical and safety concerns, we aim to promote the responsible and beneficial application of balloon navigation technology, ensuring its positive impact across various domains.
]
[We provide information on ethical and safety issues and mitigation measures associated with the application of algorithms in real-world scenarios See the updated text on page 16, lines 481-489]
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have answered my question well.
Reviewer 2 Report
Comments and Suggestions for Authorsaccept.