Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning
Abstract
1. Introduction
- The formalization of CRL challenges (i.e., domain and task switches) by mapping them to virtual and concept drift, respectively.
- A mathematical proof demonstrating the robustness of the message-passing mechanism in GNNs against virtual drift (domain switches).
- A novel CRL framework combining GNN-based state representations with PPO, which empirically outperforms conventional methods in reducing forgetting and improving learning stability in dynamic environments.
2. Materials and Methods
2.1. MiniGrid Environments
2.1.1. Domain Switch
- MiniGrid-Empty-8x8 for simple navigation: This environment provides a simple navigation task, serving as a baseline for evaluating the agent’s ability to retain basic navigation skills.
- MiniGrid-GoToDoor-8x8 for goal-directed navigation: This environment requires the agent to navigate to specified doors. It tests the agent’s ability to handle spatial understanding.
- MiniGrid-GoToObject-8x8 for specific goal-directed navigation: This environment further increases complexity by requiring the agent to identify and navigate to specific objects based on their type or color.
2.1.2. Task Switch
2.2. Proximal Policy Optimization (PPO)
2.3. Message Passing PPO
Model Architecture
3. Theoretical Analysis of Robustness of MP Under Domain Shift
3.1. Relations of CRL to Data Drift
3.2. Robustness of MP to Virtual Drift
3.3. Instability Under Concept Drift and Implications
4. Experimental Results as Empirical Evidence
4.1. Robustness of MP-PPO in Domain Switches
4.2. Message-Passing PPO Is Competitive in Task Switching
5. Discussion
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Khetarpal, K.; Riemer, M.; Islam, R.; Precup, D.; Caccia, M. Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res. 2022, 75, 1401–1486. [Google Scholar] [CrossRef]
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 2014, 46, 1–37. [Google Scholar] [CrossRef]
- Widmer, G.; Kubat, M. Learning in the presence of concept drift and hidden contexts. Mach. Learn. 1996, 23, 69–101. [Google Scholar] [CrossRef]
- Salganicoff, M. Tolerating concept and sampling shift in lazy learning using prediction error context switching. In Lazy Learning; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1997; pp. 133–155. [Google Scholar]
- Delany, S.J.; Cunningham, P.; Tsymbal, A.; Coyle, L. A case-based technique for tracking concept drift in spam filtering. In Proceedings of the International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, 13–15 December 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 3–16. [Google Scholar]
- Tsymbal, A. The problem of concept drift: Definitions and related work. Comput. Sci. Dep. Trinity Coll. Dublin 2004, 106, 58. [Google Scholar]
- Widmer, G.; Kubat, M. Effective learning in dynamic environments by explicit context tracking. In Proceedings of the Machine Learning: ECML-93: European Conference on Machine Learning, Vienna, Austria, 5–7 April 1993; Proceedings 6. Springer: Berlin/Heidelberg, Germany, 1993; pp. 227–243. [Google Scholar]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- LeCun, Y.; eon Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- McCloskey, M.; Cohen, N.J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation; Elsevier: Amsterdam, The Netherlands, 1989; Volume 24, pp. 109–165. [Google Scholar]
- Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 729–734. [Google Scholar]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
- Gallicchio, C.; Micheli, A. Graph echo state networks. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1–8. [Google Scholar]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. Computational capabilities of graph neural networks. IEEE Trans. Neural Netw. 2008, 20, 81–102. [Google Scholar] [CrossRef] [PubMed]
- Chevalier-Boisvert, M.; Willems, L.; Pal, S. Minimalistic gridworld environment for gymnasium. Adv. Neural Inf. Process. Syst 2018, 101, 8024–8035. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:arXiv:1707.06347. [Google Scholar] [CrossRef]
- Konda, V.; Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Process. Syst. 1999, 12, 1008–1014. [Google Scholar]
- Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1025–1035. [Google Scholar]
- Taylor, M.E.; Stone, P. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res. 2009, 10, 1633–1685. [Google Scholar]
- Zamir, A.R.; Sax, A.; Shen, W.; Guibas, L.J.; Malik, J.; Savarese, S. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3712–3722. [Google Scholar]
Layer | Message Passing PPO (Proposed) | PPO (Baseline) |
---|---|---|
Feature Extractor | ||
Initial Conv | Conv2d (3, 16, k = 2, s = 1) + ReLU | Conv2d (3, 32, k = 3, s = 1, p = 1) + ReLU |
Conv2d (16, 32, k = 2, s = 1) + ReLU | Conv2d (32, 64, k = 3, s = 1, p = 1) + ReLU | |
Conv2d (64, 112, k = 3, s = 1) + ReLU | ||
Graph Layers | GraphSAGE (32 → 64, aggr = mean) | N/A |
GraphSAGE (64 → 128, aggr = mean) | ||
Final Linear | Flatten → Linear (3200 → 128) | Flatten → Linear (2800 → 128) |
+ ReLU | + ReLU | |
Actor–Critic MLP Heads | ||
Policy Net | Linear (128 → 64) + Tanh | Linear (128 → 64) + Tanh |
Linear (64 → 64) + Tanh | Linear (64 → 64) + Tanh | |
Value Net | Linear (128 → 64) + Tanh | Linear (128 → 64) + Tanh |
Linear (64 → 64) + Tanh | Linear (64 → 64) + Tanh | |
Output Layers | ||
Action Output | Linear (64 → 7) | Linear (64 → 7) |
Value Output | Linear (64 → 1) | Linear (64 → 1) |
Total Parameters | 458,040 | 467,896 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, D. Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning. Mathematics 2025, 13, 2542. https://doi.org/10.3390/math13162542
Kim D. Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning. Mathematics. 2025; 13(16):2542. https://doi.org/10.3390/math13162542
Chicago/Turabian StyleKim, Dongjae. 2025. "Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning" Mathematics 13, no. 16: 2542. https://doi.org/10.3390/math13162542
APA StyleKim, D. (2025). Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning. Mathematics, 13(16), 2542. https://doi.org/10.3390/math13162542