Abstract
This work presents a reinforcement learning (RL) framework integrated into GMV’s GSharp® precise point positioning (PPP) algorithm to optimize GNSS measurement processing. Initially developed for multipath mitigation, the RL agent has evolved into a decision-making tool that evaluates the usefulness of GNSS observations to enhance positioning accuracy. The model processes GNSS data epoch by epoch using features such as pseudoranges, signal-to-noise ratios, elevation angles, and residuals. Based on these inputs, the agent decides whether each measurement should be included in the positioning solution. A custom reward function encourages decisions that reduce positioning error while maintaining solution stability. The system was trained on over 50 h of GNSS raw data collected in diverse environments, including urban canyons, suburban areas, and open spaces, promoting generalization across real-world conditions. Preliminary validation shows that the RL-enhanced PPP algorithm achieves accuracy improvements over the baseline GSharp® solution in several challenging scenarios. These results suggest that RL can support GNSS data processing by adaptively managing the quality and relevance of observations, potentially enabling more robust and precise positioning in complex environments.
1. Introduction
Satellite positioning is key for critical applications such as urban navigation, autonomous vehicles, and emergency systems. However, its performance is compromised in dense environments, like urban areas with tall buildings, where Global Navigation Satellite System (GNSS) signals face errors including multipath, high noise, and lack of line-of-sight (NLOS). These conditions can introduce meter biases in pseudo distances and significant phase errors, which in turn affect the accuracy of advanced methods such as precise point positioning (PPP).
Over the past decades, several strategies have been developed to mitigate these errors. Among the classical methods are techniques directly linked with receiver-level improvements, such as specialized antennas and correlation discriminators, aimed at mitigating the multipath effect. Also notable are approaches such as sidereal filtering (SF), which exploits the daily repeatability of the satellite-receiver geometry to model and cancel the error. Dong et al. (2016) [1] proposed advanced sidereal filtering (ASF), which adjusts the individual orbital period of each satellite, while other work has employed hemispheric maps or code-minus-carrier (CMC) operators to exclude corrupted measurements prior to positioning [2]. Although effective in static conditions, these methods often have limitations in dynamic scenarios, such as reduced accuracy and increased error rates.
More recently, machine learning (ML) techniques have been explored to detect and exclude degraded signals. In supervised classification, Ozeki and Kubo (2020) [3] developed a support vector machine (SVM) model that analyzes carrier-to-noise-density ratio () time series and residuals to identify non line-of-sight (NLOS) satellites, while Munin et al. (2019) [4] employed convolutional neural networks (CNNs) trained with correlation signals to detect multipath. On the other hand, works such as Shukla and Sinha (2022) [5] and Wang et al. (2022) [6] used unsupervised clustering to classify corrupted signals in real time, with good results when integrated into GNSS/inertial navigation system (INS) systems. These proposals confirm that ML can improve the selection of observations in complex environments.
Simultaneously, reinforcement learning (RL) has started to be applied to GNSS positioning as a tool to optimize dynamic decisions. Shin et al. (2018) [7] applied deep reinforcement learning in high-accuracy GNSS/INS systems to tune sensor fusion parameters. Gao et al. (2020) [8] proposed RL-AKF, an RL-guided adaptive Kalman filter, and Tang et al. (2023) [9] developed a deep RL-based agent to correct GNSS errors by learning temporal patterns. While these works show promise, they focus on internal filter adjustments or global corrections, rather than the active selection of GNSS observations.
This work addresses that gap: real-time autonomous decision making on which GNSS observations to use within PPP positioning. Currently, this selection is based on fixed heuristic criteria, such as elevation or , which do not adjust to the variability of the environment or the context of each measurement. In contrast, we propose integrating an RL agent into the GSharp® PPP algorithm so that it evaluates, at each epoch, the individual quality of the observations (pseudo distance, phase, signal metrics) and decides whether or not they should be included in the solution.
Our approach turns the agent into an adaptive filter trained to maximize positioning accuracy. The agent interacts with the Kalman filter, providing only those observations it has assessed as reliable. It receives rewards based on the positioning error, learning to recognize patterns linked to accurate measurements.
For its training, we used more than 50 h of GNSS data collected in varied environments (from open areas to urban canyons), allowing the model to generalize to real-world conditions. In experimental tests, the agent reduced error compared to conventional PPP, reducing errors in adverse conditions with strong presence of multipath or NLOS signals.
The results demonstrate that an RL agent can increase the robustness and accuracy of GNSS positioning through active and adaptive management of observations. This paper describes the architecture, methodology, and validation of this proposal, which seeks to improve PPP performance specifically in environments where traditional techniques fail due to multipath and NLOS errors.
2. Materials and Methods
The development of this system involved an iterative process of design, testing and tuning, combining real GNSS data with reinforcement learning techniques and a functional precise positioning environment. This section describes the main components of the system, from the data used to the integration of the RL agent and the evaluation criteria applied.
The system has been trained and validated using real GNSS observations obtained during different campaigns carried out between February 2023 and January 2024, in the center and surroundings of Madrid, Spain, under mostly kinematic conditions. The data include GPS signals—L1C, L2C and L5Q; Galileo: E1C, E5aQ and E5bQ; and BeiDou: B1C and B2Ap—and cover environments with different satellite visibility conditions, from open areas to urban canyons with strong presence of reflections and occlusions.
In total, approximately 50 h of data was used. In a previous study focused on multipath mitigation using machine learning techniques, these data allowed the construction of a scenario analysis framework that, while it did not eliminate reflection errors, did provide a detailed understanding of their behavior. This knowledge motivated the evolution towards an autonomous decision making approach using reinforcement learning, with the goal of translating that information into a system capable of dynamically managing the quality of observations without relying on predefined heuristics or human biases.
We process GNSS observations to extract, at each epoch, a set of features that will be used as input to the RL agent. These include pseudoranges, carrier phase, Doppler shift, signal-to-noise ratio (SNR), satellite elevation, pseudorange residuals, ionospheric delay, as well as rate of change in CMC and pseudorange consistency rate (CRC) along with features specific to the day of the year, second of the day, and time since the previous observation, for each satellite. This set of features had been previously validated in a scenario analysis framework, and allows for robust characterization of the real-time status of each observation.
2.1. Reinforcement Learning Agent Design
The initial agent design utilized a continuous action space to represent a multiplier of the GNSS measurement weight in the Kalman filter. This followed the standard proximal policy optimization (PPO) implementation, where each action is sampled from a normal distribution. However, this approach had several limitations. Agent decisions tended to focus on the mean of the action range, causing the Kalman filter to prioritize position propagation over new GNSS measurements. To manage unavailable measurements in a given epoch, the agent design was adjusted to have a constant input and output size. This corresponds to the maximum possible observations, with padding added to positions without real observations to maintain a constant input size. These were filled with neutral values and masked during decision making.
To alleviate such limitations, we decided to change the implementation of the projection network within the actor’s network, so that the actions were sampled from a Beta distribution [10], thus allowing the asymmetry of the distribution and the ability to bias the distribution towards specific values that would subsequently allow the Kalman filter to make use of the GNSS measurements with the weight given by the agent versus the direct propagation of the position. This change allowed more control over the influence of each measurement, but it also revealed the strong dependence of the agent on exploration, leading to instability during learning.
Consequently, we decided to modify the agent design, reducing the combinatoriality of the sampled continuous actions from a distribution to a discrete action space. Now, the agent makes an independent binary decision to accept or reject each available observation. Each decision is modeled as a parameterized Bernoulli distribution, and the actions corresponding to satellites not visible at a given epoch are directly masked by modifying the logits [11,12].
Finally, the agent consists of two neural networks, the actor and the value, which are shared across all GNSS measurements. These networks represent a state in our observation space, composed of the different characteristics of each possible GNSS observation, whether real or padded. This state is encoded separately by the policy and value networks, both of which include layers with Long Short-Term Memory (LSTM) cells to capture temporal dependencies in the training trajectories, allowing the agent to detect long- and short-term patterns in the evolution of the observations. Once encoded, the actor network computes logits for the discrete action in its final projection layer. Meanwhile, the value network uses two independent heads to compute the global value of each target and the subregional value. This topic is further addressed in the next subsection, where the reward function and the transition from a single-objective to a multi-objective approach are detailed. The resulting architecture is lightweight to allow efficient epoch-based inference, and was designed with real-time flow compatibility in mind.
2.2. Reward Function and Multi-Objective Problem
The reward function represents one of the most critical parts of any RL system. At first, the reward function was defined as the difference between the Euclidean error made by the agent versus the Euclidean error of the baseline, addressing the main objective of the problem, to improve the error made by the system. Although with this formulation the reward was perfectly human-readable, we suspected that there remained potential for improvement in terms of RL.
We detected a nuance that could be detrimental to the agent’s performance, the positive rewards represented the improvement in position against the baseline, with both compared against a precise reference, and therefore the improvement is bounded and at best can be as good as the reference. However, the worst is not bounded, causing the reward to have an unequal sensitivity to positive rather than negative values, since it was not centered at 0.
It was necessary to reformulate the reward function to center it at 0 and balance the sensitivity. This was achieved by using the logarithmic improvement ratio; with this new way of expressing the reward we had already centered it at 0, but the slope decreased and grew at different rates, so we decided to represent it as a signed piecewise function, and thus center it at 0 and have it grow at the same rate as it decreased:
where is the Euclidean error made by the agent and is the Euclidean error of the baseline.
This setting reduced its interpretability from a human point of view, but proved more appropriate for agent learning. But it still had a limitation. The position error is dependent on the satellite geometry and the geometry of the environment; the same position with different geometry can cause very different errors. When analyzing the individual components of the positioning error in the local North-East-Up (NEU) frame, we observed that some components had minimal impact, while others were responsible for most of the deviation. This revealed that not all axes contribute equally to the overall Euclidean error, which initially guided the design of our reward function.
In order to address this paradigm shift, it was necessary to convert the single-objective RL problem, in which the only objective was to improve the position error, into a multi-objective problem in which each objective was each component of the error, forcing the agent to learn to balance the different sources of error, without introducing the human bias that might have occurred if we had actively decided the weight of each component.
By switching from the global reward function mentioned above to that same logarithmic ratio function but per component, we were vectorizing the rewards. To maintain the PPO algorithm and a vectorized space of rewards, we define a vector containing one weight per reward component. This allows us to compute a scalar reward as the dot product between the reward vector and the weight vector : , where represents the vector of individual reward signals at time step t, is the vector of corresponding weights, n is the number of reward components, and t is the current timestep.
Because the standard policy gradient algorithms use the Bellman operator for a single objective, we have to extend it to consider a value space containing all bounded functions —expected total reward under m-dimensional preference vectors—and express it as follows:
Here, S denotes the state space and A the action space, s∈S is a particular state of the environment and a∈A is the action taken in that state (in our case, the binary decision to accept or reject an observation). is the principal state-action value function associated with the global multi-objective reward under a given preference vector w in . By driving towards zero for all and w, the agent optimizes the reward frontier. During learning, a set of preference vectors is generated; their components are updated dynamically according to the relative improvement capability of each reward component and are used to guide the agent’s behavior. The components of these vectors are constrained to lie between 0 and 1, so larger weights indicate more important objectives. Following the PPO formulation extended to the multi-objective space [13], the critical model estimates an advantage per criterion, which is weighted to compute the surrogate policy to be optimized:
where the advantage is an estimate with noise of the critical model to evaluate the actor’s strategy, is the discounted return of target i at time t, and is the estimated state value of the critical model.
2.3. Parallel Environment Training and Integration with GSharp®
Unlike typical simulated environments in RL, the GSharp® positioning algorithm is designed for in-vehicle integration, so it operates in a single-threaded way and does not support GPU vectorized execution, which presents significant challenges in terms of efficiency and diversity of experience during training.
To address these challenges, we took advantage of the fact that PPO uses stochastic policies to collect trajectories, and that we can launch multiple concurrent instances of GSharp®. This allowed us to run parallel environments processing the same scenario, where the agent collects different trajectories due to stochastic sampling.
Although the use of multiple instances partially addressed the need for exploration, a major challenge remained: reliably evaluating policies in a partially observable environment. Two moments that may appear similar, within or across scenarios, can in fact differ significantly due to unobserved variables, such as satellite geometry or urban obstructions. As a result, the impact of each policy update cannot be reliably assessed in an immediate, online manner, as would be possible in fully observable or simulated environments.
To obtain more reliable feedback without losing the newly acquired exploration capability, we have developed a cascading mechanism, as shown in Figure 1. This mechanism consists of dividing parallel environments into subgroups with time delays. The time delays allow new and previous policies to be compared under the same conditions, eliminating the need to wait for scenario completion to evaluate a policy under the same conditions.
Figure 1.
Illustration of the cascading environment setup used during training. Each group of environments collects experience with a delay of one policy update with respect to the previous group, enabling more consistent comparisons between policy versions.
This cascade approach allows the collection of experience in a continuous and evaluable mode, while respecting the constraints of the original environment and ensuring that the learned policies were evaluated in a consistent and comparable manner. This design balances the exploration capability of each individual policy with the evaluation of policy updates under the same conditions, but at the sacrifice of higher computational cost in terms of memory and CPU.
The source code of the RL agent, the reward function and the training environment will be published in a public repository after the review process. The data used belong to private campaigns under internal use agreement, and cannot be published openly.
3. Results
The system has been trained and evaluated online on 11 main scenarios, including automotive trajectories in both favorable and challenging GNSS environments. These are divided into 37 sub-scenarios, organized by increasing difficulty, as part of a curriculum learning strategy based on baseline positioning error. This strategy is inspired by how humans learn, starting with simpler tasks that become more complicated over time.
Instead of exposing the agent directly to scenarios of maximum difficulty, we designed a gradual progression based on the baseline error (PPP without agent). Overall horizontal and vertical positioning errors were computed from the NEU components and used as the main performance indicators. The sub-scenarios have been classified according to their difficulty in these two new components, with lower and upper thresholds to segment the error into low, medium, and high levels. The progression prioritizes increasing vertical difficulty over horizontal difficulty.
Training has been completed on all 37 sub-scenarios, covering a wide variety of difficulty combinations in horizontal and vertical components. The progression of scenarios ranges from low error environments to scenarios in more complex sections. This enables continuous evaluation during training, as each data collection validates the learned policy in new observation states.
Table 1 below summarizes the distribution of the classified sub-scenarios and their availability for training. These correspond to 11 scenarios out of the total 50 h available, and the remaining scenarios are reserved for future offline validation to assess generalization.
Table 1.
Distribution of sub-scenarios by error level in horizontal and vertical components.
During training, multiple instances of the environment are run per sub-scenario using the same policy. Each step of the environment is equivalent to 60 s of GNSS data processed by GSharp® at 10 Hz, providing 600 decision instants per step. This sequence is reused for 10 training epochs, allowing the agent to adjust its policy within the same sub-scenario. Each set of sub-scenarios covers the variability of the main scenario from which it is derived. Since GSharp® is deterministic, the system is designed so that, after full training, the learned policy can be frozen and applied in direct inference, ensuring replicability and consistency in decision making.
To analyze the online evaluation presented by the agent during its training, we will show five specific examples of the sub-scenarios that illustrate the agent’s performance. Here, we will be able to observe the improvements obtained by the agent, an intermediate behavior, and a case where the agent’s performance worsens. In all the figures below, we have the same legend: the dashed blue line refers to the horizontal component of the baseline error, corresponding to the original GSharp® PPP solution without the RL-based observation filter; the dashed line in red color refers to the baseline error in the vertical component; and, for the agent, the continuous line in purple color refers to the error in the horizontal component, while the continuous line in red color refers to the error in the vertical component. Both baseline and agent trajectories are evaluated against a high-accuracy reference position. The result for the agent represents the average of the 20 cascaded training environments, and the shading indicates the range from minimum to maximum error. We will now briefly discuss the behavior observed in each figure.
Figure 2 represents the errors obtained in a fragment of the sub-scenario with low errors, recorded in an interurban highway. At the beginning of the fragment we can see how the agent manages to reduce a peak error of 3.5 m to 1.5 m.
Figure 2.
Agent vs baseline positioning error in a low-difficulty sub-scenario (interurban highway).
Figure 3 shows the behavior of the agent in response to outlier errors. The agent manages to reduce the first error peak from 100 m to 20 m and equals the second peak. In the vertical direction, although it equals the peak value, it worsens the previous instants.
Figure 3.
Agent vs baseline positioning error in a section with dead reckonings.
For Figure 4, it is necessary to mention a design decision during training. The Kalman filter needs to converge; during prior training, we observed that starting a sub-scenario in a complex zone could worsen the filter’s convergence. The decision was to start the sub-scenario with the start of the nominal scenario and have the agent start taking action at the formal start, with a Kalman filter already converged. This introduces initial perturbations in the behavior of the Kalman filter when the agent begins to make decisions. Specifically, in Figure 3, we observe a peak error in the vertical component of up to 10 m, eventually converging to around 0.5 m of constant error.
Figure 4.
Initial vertical error spike due to the agent starting decisions after Kalman convergence. The solid line represents the mean performance across all training environments, while the shaded region indicates the variability in the error.
This initial disturbance is not always resolved quickly; in Figure 5, it persists for several seconds before the agent stabilizes the error. Nevertheless, the agent demonstrates significant improvement when faced with a sudden jump in the baseline, underscoring the remarkable robustness that the system can achieve.
Figure 5.
Agent vs. baseline positioning error in cold convergence. The solid line represents the mean performance across all training environments, while the shaded region indicates the variability in the error.
In the last of the fragments, Figure 6, we can see how the perturbation in the positioning estimates, triggered by the agent’s early actions, completely breaks the convergence of the Kalman filter, resulting in a jump in the error that is maintained during small periods of time. This results in a stepped error pattern, where the agent fails to stabilize its behavior or recover accuracy. This case reflects one of the main lines of improvement of the system.
Figure 6.
Case where perturbation breaks Kalman convergence. The solid line represents the mean performance across all training environments, while the shaded region indicates the variability in the error.
These examples illustrate both the potential and the limits of the system at present. They guide future improvements in the management of such disturbances and in the robustness in very complex urban environments. The dependence of the PPP system, together with the RL on the geometry of the satellites and the geometry of the environment, is evident. In the last case discussed, the steps observed in the error correspond to vehicle turns, for example from a southbound street to an eastbound one.
Taken together, these fragments confirm that the agent improves accuracy in most situations, but they also show that the interaction with the Kalman filter at initial times requires more robust handling to avoid unstable convergences.
4. Discussion
Training the agent through a curriculum of sub-scenarios with increasing difficulty has validated its ability to improve PPP accuracy, particularly in the vertical component, which is crucial for applications requiring precise altitude measurements. In simple environments, the agent consistently rejects low-quality observations. In more degraded conditions, it manages to reduce extreme errors that the base system does not optimally handle, although occasional perturbations have also been observed at the beginning of some sub-scenarios, related to the interaction with the Kalman filter. This effect, which is more pronounced in the vertical component, can temporarily compromise stability if the Kalman filter is still converging or if the agent takes abrupt actions relative to the number of measurements received prior to inference.
Although a stochastic policy was employed during training to maximize the diversity of experiences, the results presented are from an online validation phase, rather than from a static inference environment where the policy is fixed. Validation of the system in a deterministic production scenario remains an active line of work. Previous research explored the use of machine learning for multipath mitigation, and this work demonstrates that a reinforcement agent can discriminate the utility of real-time GNSS observations within a functional positioning pipeline and provide a foundation for further research into stability and generalization improvements. A key direction for future research is to improve the definition of the state, transitioning from partial to more informed representations that may increase the agent’s robustness in maintaining PPP positioning accuracy under adverse conditions. Additionally, future work could explore interpretability techniques to better understand which features influence the agent’s decisions to accept or reject GNSS observations.
Author Contributions
Conceptualization, Á.T. and A.C.; methodology, Á.T.; software, Á.T. and M.C.; validation, Á.T., A.C., A.D.-Á. and V.R.-F.; formal analysis, Á.T.; investigation, Á.T.; resources, Á.T.; data curation, Á.T.; writing—original draft preparation, Á.T.; writing—review and editing, Á.T., V.R.-F. and A.G.; visualization, Á.T.; supervision, A.C., A.D.-Á., V.R.-F. and A.G.; project administration, A.G.; funding acquisition, A.G. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The GNSS observation data used in this study cannot be shared due to confidentiality and usage restrictions. However, the source code and framework used for training and evaluation will be made publicly available at https://github.com/AlvaroTena/NavAI, accessed on 23 April 2025).
Conflicts of Interest
Authors Álvaro Tena, María Crespo, Adrián Chamorro and Ana González were employed by the company GMV. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| ASF | Advanced Sidereal Filtering |
| CMC | Code-Minus-Carrier |
| CNN | Convolutional Neural Network |
| Carrier-to-Noise-Density Ratio | |
| GNSS | Global Navigation Satellite System |
| INS | Inertial Navigation System |
| LSTM | Long Short-Term Memory |
| ML | Machine Learning |
| NEU | North-East-Up |
| NLOS | Non Line-of-Sight |
| PPO | Proximal Policy Optimization |
| PPP | Precise Point Positioning |
| RL | Reinforcement Learning |
| RL-AKF | Reinforcement Learning Adaptive Kalman Filter |
| SF | Sidereal Filtering |
| SVM | Support Vector Machine |
References
- Dong, D.; Wang, M.; Chen, W.; Zeng, Z.; Song, L.; Zhang, Q.; Cai, M.; Cheng, Y.; Lv, J. Mitigation of multipath effect in GNSS short baseline positioning by the multipath hemispherical map. J. Geod. 2016, 90, 255–262. [Google Scholar] [CrossRef]
- Caamano, M.; Crespillo, O.G.; Gerbeth, D.; Grosch, A. Detection of GNSS Multipath with Time-Differenced Code-Minus-Carrier for Land-Based Applications. In Proceedings of the 2020 European Navigation Conference (ENC), Dresden, Germany, 23–24 November 2020; pp. 1–12. [Google Scholar] [CrossRef]
- Ozeki, T.; Kubo, N. GNSS NLOS Signal Classification Based on Machine Learning and Pseudorange Residual Check. Front. Robot. AI 2022, 9, 868608. [Google Scholar] [CrossRef]
- Munin, E.; Blais, A.; Couellan, N. Convolutional Neural Network for Multipath Detection in GNSS Receivers. arXiv 2019, arXiv:1911.02347. [Google Scholar] [CrossRef]
- Shukla, A.K.; Sinha, S.A. Unsupervised Machine Learning Approach for Multipath Classification of NavIC Signals. In Proceedings of the 35th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2022), Denver, Colorado, 19–23 September 2022; pp. 2618–2624. [Google Scholar] [CrossRef]
- Wang, H.; Pan, S.; Gao, W.; Xia, Y.; Ma, C. Multipath/NLOS Detection Based on K-Means Clustering for GNSS/INS Tightly Coupled System in Urban Areas. Micromachines 2022, 13, 1128. [Google Scholar] [CrossRef] [PubMed]
- Shin, H.; Lee, J.; Sung, C.-k. Implementation of Deep Reinforcement Learning on High Precision GNSS/INS Augmentation System. In Proceedings of the 31st International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2018), Miami, FL, USA, 24–28 September 2018; pp. 3179–3185. [Google Scholar] [CrossRef]
- Gao, X.; Luo, H.; Ning, B.; Zhao, F.; Bao, L.; Gong, Y.; Xiao, Y.; Jiang, J. RL-AKF: An Adaptive Kalman Filter Navigation Algorithm Based on Reinforcement Learning for Ground Vehicles. Remote Sens. 2020, 12, 1704. [Google Scholar] [CrossRef]
- Tang, J.; Li, Z.; Guo, R.; Zhao, H.; Wang, Q.; Liu, M.; Xie, S.; Polycarpou, M. Improving GNSS Positioning Correction Using Deep Reinforcement Learning with Adaptive Reward Augmentation Method. In Proceedings of the 36th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2023), Denver, CO, USA, 11–15 September 2023; pp. 38–52. [Google Scholar] [CrossRef]
- Petrazzini, I.G.B.; Antonelo, E.A. Proximal Policy Optimization with Continuous Bounded Action Space via the Beta Distribution. arXiv 2021, arXiv:2111.02202. [Google Scholar] [CrossRef]
- Stolz, R.; Krasowski, H.; Thumm, J.; Eichelbeck, M.; Gassert, P.; Althoff, M. Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking. arXiv 2024. [Google Scholar] [CrossRef]
- Tang, C.Y.; Liu, C.H.; Chen, W.K.; You, S.D. Implementing action mask in proximal policy optimization (PPO) algorithm. ICT Express 2020, 6, 200–203. [Google Scholar] [CrossRef]
- Khoi, N.D.H.; Pham Van, C.; Tran, H.V.; Truong, C.D. Multi-Objective Exploration for Proximal Policy Optimization. In Proceedings of the 2020 Applying New Technology in Green Buildings (ATiGB), Da Nang, Vietnam, 12–13 March 2021; pp. 105–109. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.





