Next Article in Journal
Fuzzy-Valued Functions Calculus Through Midpoint Representation
Previous Article in Journal
Safety Analysis of Subway Station Under Seepage Force Using a Continuous Velocity Field
Previous Article in Special Issue
A Comparison of MLE for Some Index Distributions Based on Censored Samples
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning

Department of AI-Based Convergence, Dankook University, 152 Jukjeon-ro, Yongin-si 16890, Republic of Korea
Mathematics 2025, 13(16), 2542; https://doi.org/10.3390/math13162542
Submission received: 3 July 2025 / Revised: 31 July 2025 / Accepted: 5 August 2025 / Published: 8 August 2025
(This article belongs to the Special Issue Decision Making under Uncertainty in Soft Computing)

Abstract

Continual reinforcement learning (CRL) agents face significant challenges when encountering distributional shifts. This paper formalizes these shifts into two key scenarios, namely virtual drift (domain switches), where object semantics change (e.g., walls becoming lava), and concept drift (task switches), where the environment’s structure is reconfigured (e.g., moving from object navigation to a door key puzzle). This paper demonstrates that while conventional convolutional neural networks (CNNs) struggle to preserve relational knowledge during these transitions, graph convolutional networks (GCNs) can inherently mitigate catastrophic forgetting by encoding object interactions through explicit topological reasoning. A unified framework is proposed that integrates GCN-based state representation learning with a proximal policy optimization (PPO) agent. The GCN’s message-passing mechanism preserves invariant relational structures, which diminishes performance degradation during abrupt domain switches. Experiments conducted in procedurally generated MiniGrid environments show that the method significantly reduces catastrophic forgetting in domain switch scenarios. While showing comparable mean performance in task switch scenarios, our method demonstrates substantially lower performance variance (Levene’s test, p < 1.0 × 10 10 ), indicating superior learning stability compared to CNN-based methods. By bridging graph representation learning with robust policy optimization in CRL, this research advances the stability of decision-making in dynamic environments and establishes GCNs as a principled alternative to CNNs for applications requiring stable, continual learning.

1. Introduction

Reinforcement learning (RL) has achieved remarkable success in solving complex sequential decision-making problems, yet a predominant assumption is that the environment remains stationary [1]. However, real-world scenarios are seldom static; they are characterized by dynamic changes, requiring agents to continuously adapt. Continual reinforcement learning (CRL) addresses this challenge, focusing on agents that can learn sequentially in environments that dynamically changes while preserving previously acquired knowledge [2].
A critical challenge in CRL arises from distributional shifts in the agent’s experience, which can manifest in various forms. In this study, formalized shifts, mapping them to established concepts in the data drift literature, are provided [3]. It is defined in two primary scenarios, including (1) task switch, where the underlying task structure changes and the task itself changes (e.g., transitioning from an object navigation task to a door key puzzle). Therefore the goal, objective, and reward change correspondingly. This is mapped to concept drift [4,5], where P ( Y | X ) changes while P ( X ) remains the same, as the mapping from state–action pairs to optimal outcomes is altered. (2) Domain switch, where the semantics of states change while the task’s core logic remains intact (e.g., lava tiles replaced with walls, but the agent can fall into wall and die). This is mapped to virtual drift [6,7,8], where the input data distribution P ( X ) changes, but the fundamental conditional probability P ( Y | X ) —representing the environment’s transition dynamics and reward functions—remains invariant with respect to the objects’ relational roles (Figure 1).
Conventional approaches in deep RL, which predominantly rely on convolutional neural networks (CNNs) [9,10], often falter under these shifts. CNNs excel at extracting visual features and implicitly encode the spatial relations of states. When a switch occurs, the change in visual appearance or change in its relation to reward leads CNNs to discard learned policies, even if the underlying relational structure of the environment is preserved. This results in severe catastrophic forgetting [11], which encompasses forgetting previously learned ones after learning of a series of tasks.
To overcome this limitation, this study propose a framework centered on graph neural networks (GNNs) [12,13,14]. GNNs provide a powerful inductive bias for relational reasoning by explicitly modeling entities and their relationships as a graph structure. The core mechanism of GNNs, message-passing [13,15], allows the network to learn functions based on the topological structure of interactions between objects, making the architecture inherently more robust to changes in node features P ( X ) , as seen in virtual drift.
Building on this theoretical foundation, in this study, empirical experimental results with mathematical proof are provided. It evidently shows the robustness of a deep RL agent that uses GNNs to solve CRL problems, only limited to virtual drift. Specifically, a proximal policy optimization (PPO) agent has been used and evaluated in MiniGrid environments [16]. Then, the performance of the agent is evaluated in both domain switch (which this study formalizes as virtual drift) and task switch (which this study formalizes as concept drift) scenarios. It empirically validates the theoretical findings, showing that the GNN-based PPO significantly reduces catastrophic forgetting under domain switches and accelerates adaptation to new tasks compared to strong CNN baselines.
In summary, the main contributions of this paper are
  • The formalization of CRL challenges (i.e., domain and task switches) by mapping them to virtual and concept drift, respectively.
  • A mathematical proof demonstrating the robustness of the message-passing mechanism in GNNs against virtual drift (domain switches).
  • A novel CRL framework combining GNN-based state representations with PPO, which empirically outperforms conventional methods in reducing forgetting and improving learning stability in dynamic environments.
Through this, this study bridges the gap between graph representation learning and robust decision-making under uncertainty, establishing GNNs as a principled and effective foundation for developing the next generation of adaptive agents.

2. Materials and Methods

2.1. MiniGrid Environments

To validate the proposed approach, all 8 × 8 MiniGrid environments available in the MiniGrid benchmark were considered [16]. Through this, three environments were selected. The selection of environments considered a more challenging continual learning context. MiniGrid-Empty-8x8 serves as a baseline for navigation tasks in a simple, obstacle-free setting. MiniGrid-GoToDoor-8x8 and MiniGrid-GoToObject-8x8, instruction-following tasks, were chosen, as these are composed of many visual objects such as colored doors and objects, making the changes in the environment even harder to recognize with the interactions with objects.
To evaluate the continual learning capabilities of the proposed method, two experimental setups were designed, namely task switch and domain switch.These setups are intended to test the agent’s ability to adapt to new tasks and domains without catastrophic forgetting. Below, we describe each setup in detail.

2.1.1. Domain Switch

The domain switch experiment focuses on testing the agent’s ability to generalize across visually altered domains. For this setup, there are three environments used, namely MiniGrid-Empty-8x8, MiniGrid-GoToDoor-8x8, and MiniGrid-GoToObject-8x8. These environments were chosen because they represent progressively more complex tasks that require different policies.
  • MiniGrid-Empty-8x8 for simple navigation: This environment provides a simple navigation task, serving as a baseline for evaluating the agent’s ability to retain basic navigation skills.
  • MiniGrid-GoToDoor-8x8 for goal-directed navigation: This environment requires the agent to navigate to specified doors. It tests the agent’s ability to handle spatial understanding.
  • MiniGrid-GoToObject-8x8 for specific goal-directed navigation: This environment further increases complexity by requiring the agent to identify and navigate to specific objects based on their type or color.
However, in this case, there is a 50% probability that a domain switch occurs in each episode. During a domain switch, the pixel values of objects in the environment are replaced with visually distinct alternatives (Figure 2). For example, doors can be replaced with walls or other objects.
These modifications ensure that while the underlying task objectives remain unchanged, the visual representation of the environment differs significantly. It is designed to assess the robustness of the learned policies and to evaluate how well the agent can generalize across visually distinct but semantically equivalent tasks. By introducing domain switches probabilistically, it is ensured that the agent encounters both the original and altered domains during training, providing a balanced evaluation of its adaptability. Specifically, each environment took 140,000 steps for training, while domain switches are decided on the basis of every step.

2.1.2. Task Switch

The task switch experiment involves alternating between three environments, i.e., MiniGrid-Empty-8x8, MiniGrid-GoToDoor-8x8, and MiniGrid-GoToObject-8x8 (Figure 2). The agent switches environments every 14,000 steps. By alternating between these environments, we can evaluate how well the agent can adapt to new tasks while retaining knowledge from previously encountered tasks. There are 1.4 million steps in total, meaning that there are 1000 task switches.

2.2. Proximal Policy Optimization (PPO)

This study employs two reinforcement learning models to compare their performance in dynamic environments; the first is a standard PPO agent with a convolutional encoder, which serves as the baseline, and the second is a proposed message-passing PPO agent with a graph-based encoder.
PPO is a state-of-the-art, on-policy reinforcement learning algorithm widely used for its stability and reliable performance across various tasks [17]. It operates on an actor–critic architecture [18], optimizing a surrogate objective function that constrains the size of policy updates to prevent destructive, large changes. The core of PPO is its clipped objective function:
L CLIP ( θ ) = E ^ t [ min ( r t ( θ ) A ^ t , clip ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A ^ t ) ]
where r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) is the probability ratio and A ^ t is the estimated advantage. In the baseline model for this research, the policy and value functions are parameterized using a conventional CNN to process the grid-based state observations.

2.3. Message Passing PPO

To explicitly model the relational structure of the environment, we suggest using message-passing to compose PPO, i.e., the MP-PPO agent. This model replaces the standard CNN backbone of the PPO agent with a message-passing neural network. Instead of treating the state as a flat grid of pixels, the MP-PPO first converts the state observation into a graph. Specifically, a King’s graph structure is used for the edges of nodes. In this representation, each cell in the N × M grid becomes a node in the graph. Each node is connected by an edge to its eight neighboring nodes, similar to the moves of a king in chess.
Instead of treating the state as a flat grid of pixels, the MP-PPO first converts the state observation into a graph. Specifically, a King’s graph structure is used for the edges of nodes. In this representation, each cell in the N × M grid becomes a node in the graph. Each node is connected by an edge to its eight neighboring nodes, similar to the moves of a king in chess. This connectivity captures not only adjacent relationships but also diagonal ones, allowing for a richer flow of local information. The initial feature vector for each node encodes the object present in that cell (e.g., agent, wall, lava, goal).
The King’s graph structure was intentionally chosen, as it provides a good balance between capturing local relational information and maintaining computational tractability. Our preliminary empirical tests confirmed that this structure offers robust performance without the significant computational overhead of more densely connected graphs.

Model Architecture

The overall architecture of the MP-PPO agent processes information through a multi-stage pipeline. First, the input grid state is transformed into the aforementioned King’s graph. This graph is then fed into a GNN encoder. For the GNN encoder, GraphSAGE is employed [19]. The GraphSAGE network processes the graph through several message-passing layers to compute high-level embedding for each node. Following the GNN encoder, a global pooling operation is applied to the set of all node embeddings to produce a single, fixed-size vector representing the entire graph state. Finally, this state vector is passed to two separate feed-forward networks (MLPs), which are the actor head for the policy and the critic head for the state value.
The specific layer dimensions for both the MP-PPO and the baseline CNN PPO are detailed in Table 1. These architectures were intentionally designed to have a roughly equivalent number of trainable parameters, ensuring that the comparison evaluates their structural advantages rather than differences in model capacity.
While CNNs are known for higher computational efficiency due to hardware optimizations, the primary goal of the paper was to equalize the model complexity and representational power by aligning the number of trainable parameters. This ensures a fair comparison of the architectural biases inherent in GCNs and CNNs.

3. Theoretical Analysis of Robustness of MP Under Domain Shift

3.1. Relations of CRL to Data Drift

In CRL, an agent must adapt to a sequence of changing Markov decision processes (MDPs), M 0 , M 1 , , M T . We represent the state s within each MDP M t as a graph G t = ( V , E , X t ) , where V is the set of nodes, E is the set of edges defining the static relational structure (e.g., grid cell adjacencies), and  X t is a matrix of node features that can change over time (e.g., object types like “wall”, “lava”, “key”).
We frame the environmental changes between M t and M t + 1 using the lens of data drift theory. We identify two fundamental types of shifts:
Definition 1 
(Virtual Drift as Domain Switch). A virtual drift occurs when the distribution of node features P ( X ) changes, but the underlying graph structure ( V , E ) and the transition/reward functions conditional on the true semantics of the nodes remain the same. This corresponds to a domain switch in CRL. For example, “wall” objects (represented by feature vector x wall ) might turn into “lava” objects ( x lava ), changing the agent’s input observations, but the rule “colliding with an obstacle terminates the episode” is constant. It can be formalized as follows:
p t + 1 ( X ) p t ( X )
P t + 1 ( s | s , a ) = P t ( s | s , a ) and R t + 1 ( s , a ) = R t ( s , a )
In this context, the state s represents the true, underlying meaning or function of an object in the environment. In contrast, the feature vector x is the specific, raw data the agent actually observes or “sees”.
Definition 2 
(Concept Drift as Task Switch). A concept drift occurs when the transition dynamics P or the reward function R change, fundamentally altering the optimal policy π * . This corresponds to a task switch in CRL. The agent’s observations and the world’s structure might be identical, but the goal itself changes. For example, the reward for reaching a “green door (object)” tile becomes zero, and a new reward is assigned to reaching a “blue door (object)” tile.
P t + 1 ( s | s , a ) P t ( s | s , a ) or R t + 1 ( s , a ) R t ( s , a )
p t + 1 ( X ) = p t ( X )
It can be even simpler using the joint probability of reward r and state s , P ( r , s | s , a ) P t ( r , s | s , a ) . Note that the standard reinforcement learning notation was used here. P denotes the environment’s state transition probability, while p denotes the probability distribution of the agent’s input features, X.

3.2. Robustness of MP to Virtual Drift

Now, we analyze how a policy π θ parameterized by a message-passing neural network (MPNN), a general form of a GNN, behaves under these drifts. An MPNN updates the representation h v for each node v by aggregating messages from its neighbors N ( v ) . The k-th layer is defined as
h v ( k + 1 ) = ϕ h v ( k ) , u N ( v ) ψ ( h v ( k ) , h u ( k ) )
where ψ is a message function, ⨁ is a permutation-invariant aggregation function (e.g., summation, mean), and  ϕ is a node update function. The initial representations h v ( 0 ) are derived from the input node features X.
Proposition 1. 
An MPNN-based policy π θ that has learned that the correct relational dynamics are inherently robust to virtual drifts (domain switches) and do not require retraining of its weights θ.
Proof for Proposition 1. 
Consider a learned MPNN policy π θ . The network learns the functions ψ and ϕ , which operate based on the relationships between node features, not their absolute values. In a virtual drift (domain switch), the node features change from X t to X t + 1 . For instance, a set of nodes V wall V that previously had features x wall now have features x lava . The initial node embeddings will change accordingly, h v ( 0 ) = f ( x v ) , where x v is the new feature vector for node v.
However, the MPNN’s computation graph (the edges E) remains unchanged. The network has learned a universal relational function. If it learned a rule like “do not move towards neighbors whose embedding corresponds to an “obstacle””, this rule is applied during message-passing regardless of whether the input feature vector was originally x wall or is now x lava . The network simply processes the new feature values X t + 1 through its fixed-weight functions ( ψ , ϕ ). Since the underlying transition/reward dynamics are the same, the learned relational policy remains optimal. The output of π θ ( G t ( V , E , X t + 1 ) ) will be correct without any change to θ .
In contrast, a standard CNN would process the scene as a grid of pixels. A change from the “wall” texture to “lava” texture creates a completely new input pattern, causing the learned convolutional filters to fail and necessitating retraining. The MPNN’s explicit separation of relational structure (E) from node identity ( X t ) provides this robustness.    □

3.3. Instability Under Concept Drift and Implications

While robust to virtual drift, an MPNN policy might not be robust to concept drift. In a concept drift (task switch), the reward function R or transition function P changes. The policy π θ may have perfectly learned to navigate to a green tile based on R t . When the goal changes to a blue tile under R t + 1 , the relational model learned by the MPNN is still sound (“this node is a blue tile”, “this node is adjacent to me”), but the value assigned to those relations is now incorrect. The weights θ , which encode the desirability of certain states and actions, must be updated to reflect the new task.
However, this does not mean the agent must learn from scratch. While the MPNN architecture does not grant the same inherent robustness to concept drift, its ability to learn a stable and transferable representation of the environment’s relational structure can still be highly advantageous. By preserving the underlying “how the world works” model, the agent can adapt to a new task more quickly. This facilitates positive transfer learning effects [20,21], where previously acquired knowledge accelerates learning in the new task. Note that MiniGrid environments have similar core objectives (i.e., goal-directed navigation) that have not changed.

4. Experimental Results as Empirical Evidence

4.1. Robustness of MP-PPO in Domain Switches

To validate the aforementioned mathematical analysis, a comparison was made between the maximum, mean, and minimum episode rewards from five runs in three environments (Figure 3A–C). In MiniGrid-Empty-8x8, although PPO converged earlier, greater stability was demonstrated by MP-PPO in the later training phase, as indicated by the small gap between its maximum and minimum episode rewards. A similar trend was observed in MiniGrid-GoToObject-8x8, where the models’ maximum and mean performances were comparable. However, PPO’s minimum rewards suffered from significant drops not seen with MP-PPO. In contrast, for the MiniGrid-GoToDoor-8x8 environment, PPO’s mean reward was found to be similar to MP-PPO’s maximum reward, suggesting that PPO was the superior model in this case. Nevertheless, it must be noted that both models’ episode reward was kept lower than 0.25 (where its maximum is 1.0), which means no model is converged anyway.
Furthermore, the absolute difference of the episode reward, defined as | Δ Performance Domain | , was calculated for cases with and without a domain switch (Figure 3D–F). Specifically, in MiniGrid-Empty-8x8-v0 (Figure 3D), no significant difference (independent t-test, T ( 6212 ) = 0.031 , p = 0.975 ) was present in the early stage between PPO ( 0.097 ± 0.148 ; mean and standard deviation) and MP-PPO ( 0.192 ± 0.158 ) and the late stage ( T ( 6219 ) = 1.139 , p = 0.255 ; PPO = 0.106 ± 0.187 and MP-PPO = 0.039 ± 0.049 ). However in the middle stage, there was a significant difference ( T ( 6218 ) = 2.910 , p = 0.004 ; PPO = 0.007 ± 0.012 and MP-PPO = 0.017 ± 0.019 ), although the changes in performance were very small in both models.
In contrast, in MiniGrid-GoToDoor-8x8, although the episode reward did not converge for either model, a statistically significant difference (independent t-test, t(12,566) = 13.425, p < 1 × 10 10 ) emerged in the middle stage between PPO ( μ PPO = 0.030 ± 0.028 ) and MP-PPO ( μ MP - PPO = 0.023 ± 0.020 ), and there was also a significant difference (independent t-test, t(12,570) = 2.425, p = 0.015 ) in the late stage between PPO ( μ PPO = 0.067 ± 0.052 ) and MP-PPO ( μ MP - PPO = 0.042 ± 0.043 ). Yet, the early stage showed no significant difference (independent t-test, t(12,563) = 0.104 , p = 0.917 ) between the two models (PPO: μ PPO = 0.016 ± 0.012 ); MP-PPO: μ MP - PPO = 0.018 ± 0.012 ).
In MiniGrid-GoToObject-8x8, there was a statistically significant difference (independent t-test, t(15,189) = 3.822 , p = 1 × 10 3 ) that emerged in the early stage between PPO ( μ PPO = 0.033 ± 0.029 ) and MP-PPO ( μ MP - PPO = 0.027 ± 0.019 ) and also a significant difference (independent t-test, t(15,194) = 2.077 , p = 0.038) in the late stage between PPO ( μ PPO = 0.083 ± 0.068 ) and MP-PPO ( μ MP - PPO = 0.085 ± 0.070 ). However, the middle stage showed no significant difference (independent t-test, t(15,190) = 1.641 , p = 0.101 ) between the two models ( μ PPO = 0.133 ± 0.076 vs. μ MP - PPO = 0.105 ± 0.070 ). This might be because unlike the early stage where rapid learning causes a large absolute difference or the late stage where there is a difference in robustness to domain switch, the learning patterns are similar in the middle stage, where active learning is in progress.

4.2. Message-Passing PPO Is Competitive in Task Switching

We also compare how MP-PPO solves the task-switching environment, and interestingly, MP-PPO was also competitive in task switches, i.e., concept drift. A comparison of the PPO and MP-PPO algorithms was conductd based on their performance drops, evaluated at the early, middle, and late training stages (Figure 4). It is found that the MP-PPO demonstrated competitive results compared to the standard PPO that uses a CNN (Figure 4A), and MP-PPO was slightly worse in some steps; however, its maximum result among five runs successfully adapted to the task switches that PPO had failed (around 0.7 M steps, 0.9 M steps, 1.1 M steps, etc.). This suggests that the MPNN is not only good for virtual drift but also good for concept drift, which comprises different kinds of continual learning problems in terms of data drift.
Further analysis was carried out with performance recovery, which is defined as the current task’s mean episode reward subtracted by the previous task’s late phase mean episode reward (Figure 4B,C). Averaged performance recovery over five runs showed no difference as the stage proceeded after task switching (Figure 4B). Specifically, both models’ performance recovery in the early stage and the middle stage (two-sided paired t-test; PPO’s early ( 0.205 ± 2.304 ) vs. PPO’s middle ( 0.243 ± 2.308 ), t ( 98 ) = 1.076 ,   p = 0.284 ; MP-PPO’s early ( 0.179 ± 0.590 ) vs. MP-PPO’s middle ( 0.175 ± 0.570 ), t ( 98 ) = 0.231 ,   p = 0.818 ) had no difference, which is the same with the middle stage and the late stage (two-sided paired t-test; PPO’s middle vs. PPO’s late ( 0.195 ± 2.303 ), t(98) = 1.672, p = 0.098; MP-PPO’s middle vs. MP-PPO’s late ( 0.170 ± 0.605 ), t ( 98 ) = 0.276 ,   p = 0.783 ).
However, performance recovery of the maximum episode reward over five runs showed significant difference between stages (Figure 4C). Specifically, both models’ performance recovery in the early stage and the middle stage (two-sided paired t-test; PPO’s early ( 0.260 ± 1.900 ) vs. PPO’s middle ( 0.367 ± 1.901 ), t ( 98 ) = 3.456 ,   p = 8.125 × 10 4 ; MP-PPO’s early ( 0.331 ± 0.713 ) vs. MP-PPO’s middle ( 0.421 ± 0.718 ), t ( 98 ) = 3.152 , p = 2.151 × 10 3 ) had a significant difference, as is the case with the middle stage and the late stage (two-sided paired t-test; PPO’s middle vs. PPO’s late ( 0.257 ± 1.898 ), t ( 98 ) = 3.558 ,   p = 5.782 × 10 4 ; MP-PPO’s middle vs. MP-PPO’s late ( 0.340 ± 0.761 ), t ( 98 ) = 3.723 ,   p = 3.289 × 10 4 ).
Interestingly, as shown in the middle stage of the MiniGrid-GoToObject-8x8 domain switch (Figure 3B), there was significant increase during the middle stage. Note that while performance recovery over the maximum of five runs seems better in MP-PPO compared to PPO, there was no significant difference in any of the three stages; however, variances in PPO and MP-PPO were significantly different (Levene’s test; early stage: F = 106.072 , p < 1 × 10−10; middle stage: F = 115.039, p < 1 × 10−10; late stage: F = 104.533 , p < 1 × 10−10). This was consistent with the performance recovery of the maximum episode reward over five runs (Figure 4) (Levene’s test; early stage: F = 49.763 ,   p < 1 × 10 10 ; middle stage: F = 49.939 ,   p < 1 × 10 10 ; late stage: F = 44.369 ,   p = 3 × 10 10 ). This finding is the primary evidence for MP-PPO’s benefits in concept drift scenarios. The much lower variance of MP-PPO suggests a more reliable and stable learning process; the agent is less prone to catastrophic failures or wildly different outcomes across runs. This stability is a highly desirable property for continual learning agents, as it implies a more predictable and robust adaptation to new tasks.

5. Discussion

This study presents a theoretical demonstration and empirical evidence for using a GCN to mitigate problems in CRL, such as catastrophic forgetting [2]. The mathematical proof presented in the paper shows that an MPNN architecture is inherently robust to changes in input features ( P ( X ) ), as long as the underlying relational structure and task dynamics remain constant. This was empirically validated in MiniGrid environments, where the MP-PPO agent showed more stable performance and a smaller performance drop during domain switches compared to a standard CNN-based PPO, especially in later training stages. This confirms that the explicit modeling of relational structures by GNNs provides a significant inductive bias for generalization when object appearances change but their functions do not.
Interestingly, while MPNNs are not theoretically guaranteed to be robust to concept drift (task switches), our empirical results revealed a significant advantage. Although the average performance recovery was comparable to the CNN baseline, the MP-PPO exhibited dramatically lower variance across runs. This finding is critical, as low variance implies a more stable, reliable, and predictable learning process. For CRL agents that must operate continuously without catastrophic failure, this stability is a highly desirable property and a key contribution of our work.
However, we acknowledge several limitations that open avenues for future research. First, superior architectural biases do not guarantee success in all contexts, as seen in the MiniGrid-GoToDoor-8x8 experiment, where neither model converged. This suggests that for tasks with significant exploration challenges or sparse rewards, the bottleneck may not be the representation model itself. Future studies could address this by combining our GNN representation with more advanced algorithms suited for such challenges, such as hierarchical reinforcement learning (HRL) or techniques using intrinsic motivation.
Second, our evaluation is confined to MiniGrid environments. While suitable for controlled analysis, these environments do not capture the full complexity of real-world visual data. Future studies must validate whether the benefits of GNNs persist in more complex, high-dimensional environments. Third, the effectiveness of GNNs is dependent on the choice of an appropriate graph construction. While the King’s graph used in this study provided a good balance of performance and efficiency, exploring methods for learning the optimal graph structure for a given task, which remains a valuable direction for future studies.
Finally, to better handle concept drift, future research should focus on integrating our GNN-based framework with explicit CRL adaptation mechanisms. By leveraging the stable and transferable state representations learned by the GNN, mechanisms like experience replay or dynamic parameter updates could more effectively and efficiently adapt the agent’s policy to new task objectives.

Funding

The present research was supported by the research fund of Dankook University in 2022.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and data supporting the reported results will be made publicly available upon publication of this article.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  2. Khetarpal, K.; Riemer, M.; Islam, R.; Precup, D.; Caccia, M. Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res. 2022, 75, 1401–1486. [Google Scholar] [CrossRef]
  3. Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 2014, 46, 1–37. [Google Scholar] [CrossRef]
  4. Widmer, G.; Kubat, M. Learning in the presence of concept drift and hidden contexts. Mach. Learn. 1996, 23, 69–101. [Google Scholar] [CrossRef]
  5. Salganicoff, M. Tolerating concept and sampling shift in lazy learning using prediction error context switching. In Lazy Learning; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1997; pp. 133–155. [Google Scholar]
  6. Delany, S.J.; Cunningham, P.; Tsymbal, A.; Coyle, L. A case-based technique for tracking concept drift in spam filtering. In Proceedings of the International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, 13–15 December 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 3–16. [Google Scholar]
  7. Tsymbal, A. The problem of concept drift: Definitions and related work. Comput. Sci. Dep. Trinity Coll. Dublin 2004, 106, 58. [Google Scholar]
  8. Widmer, G.; Kubat, M. Effective learning in dynamic environments by explicit context tracking. In Proceedings of the Machine Learning: ECML-93: European Conference on Machine Learning, Vienna, Austria, 5–7 April 1993; Proceedings 6. Springer: Berlin/Heidelberg, Germany, 1993; pp. 227–243. [Google Scholar]
  9. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
  10. LeCun, Y.; eon Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  11. McCloskey, M.; Cohen, N.J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation; Elsevier: Amsterdam, The Netherlands, 1989; Volume 24, pp. 109–165. [Google Scholar]
  12. Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 729–734. [Google Scholar]
  13. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
  14. Gallicchio, C.; Micheli, A. Graph echo state networks. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1–8. [Google Scholar]
  15. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. Computational capabilities of graph neural networks. IEEE Trans. Neural Netw. 2008, 20, 81–102. [Google Scholar] [CrossRef] [PubMed]
  16. Chevalier-Boisvert, M.; Willems, L.; Pal, S. Minimalistic gridworld environment for gymnasium. Adv. Neural Inf. Process. Syst 2018, 101, 8024–8035. [Google Scholar]
  17. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:arXiv:1707.06347. [Google Scholar] [CrossRef]
  18. Konda, V.; Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Process. Syst. 1999, 12, 1008–1014. [Google Scholar]
  19. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1025–1035. [Google Scholar]
  20. Taylor, M.E.; Stone, P. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res. 2009, 10, 1633–1685. [Google Scholar]
  21. Zamir, A.R.; Sax, A.; Shen, W.; Guibas, L.J.; Malik, J.; Savarese, S. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3712–3722. [Google Scholar]
Figure 1. Schematic view of the data drift. Concept drift is a shift in P ( Y | X ) , while virtual drift is a shift in P ( X ) .
Figure 1. Schematic view of the data drift. Concept drift is a shift in P ( Y | X ) , while virtual drift is a shift in P ( X ) .
Mathematics 13 02542 g001
Figure 2. Minigrid environment for continual reinforcement learning. There are three kinds of environments and two kinds of changes in the environment. Task switching is related to changes in state transition and reward; domain switching is related to changes in state representations.
Figure 2. Minigrid environment for continual reinforcement learning. There are three kinds of environments and two kinds of changes in the environment. Task switching is related to changes in state transition and reward; domain switching is related to changes in state representations.
Mathematics 13 02542 g002
Figure 3. Performance comparison of PPO and MP-PPO in three different environments: “Empty”, “GoToDoor”, and “GoToObject”. The top panels (AC) show the episode rewards over training steps. The bottom panels (DF) represent the performance difference ( | Δ Performance Domain | ) between the two algorithms at the early, middle, and late training stages. Statistically significant performance differences were observed in the later stages for the “Empty” and “GoToObject” environments.
Figure 3. Performance comparison of PPO and MP-PPO in three different environments: “Empty”, “GoToDoor”, and “GoToObject”. The top panels (AC) show the episode rewards over training steps. The bottom panels (DF) represent the performance difference ( | Δ Performance Domain | ) between the two algorithms at the early, middle, and late training stages. Statistically significant performance differences were observed in the later stages for the “Empty” and “GoToObject” environments.
Mathematics 13 02542 g003
Figure 4. Performance in a task-switching environment. Blue and yellow solid lines represents PPO and MP-PPO, respectively. (A) A learning curve of two models. The semi-transparent region represents the maximum and minimum of each RL agent during five runs, while the dashed vertical lines represent the best of each RL. Note that the dashed lines are not exactly matched to the performance changes (mostly drops) after task switch because it is averaged over 1000 steps. (B) Performance recovery from the average of five runs. (C) Performance recovery from the maximum of five runs. To measure the improvement following the task change, the data was segmented into three equal thirds, representing the early, middle, and late stages. Error bars represent the standard deviation.
Figure 4. Performance in a task-switching environment. Blue and yellow solid lines represents PPO and MP-PPO, respectively. (A) A learning curve of two models. The semi-transparent region represents the maximum and minimum of each RL agent during five runs, while the dashed vertical lines represent the best of each RL. Note that the dashed lines are not exactly matched to the performance changes (mostly drops) after task switch because it is averaged over 1000 steps. (B) Performance recovery from the average of five runs. (C) Performance recovery from the maximum of five runs. To measure the improvement following the task change, the data was segmented into three equal thirds, representing the early, middle, and late stages. Error bars represent the standard deviation.
Mathematics 13 02542 g004
Table 1. A comparison of the network architectures for the proposed message-passing PPO (MP-PPO) and the baseline PPO, which is composed of a standard CNN. The layer dimensions were chosen to provide both models with a similar total parameter count for a fair comparison. Both models also share an identical MLP structure for the final policy and value heads.
Table 1. A comparison of the network architectures for the proposed message-passing PPO (MP-PPO) and the baseline PPO, which is composed of a standard CNN. The layer dimensions were chosen to provide both models with a similar total parameter count for a fair comparison. Both models also share an identical MLP structure for the final policy and value heads.
LayerMessage Passing PPO (Proposed)PPO (Baseline)
Feature Extractor
Initial ConvConv2d (3, 16, k = 2, s = 1) + ReLUConv2d (3, 32, k = 3, s = 1, p = 1) + ReLU
Conv2d (16, 32, k = 2, s = 1) + ReLUConv2d (32, 64, k = 3, s = 1, p = 1) + ReLU
Conv2d (64, 112, k = 3, s = 1) + ReLU
Graph LayersGraphSAGE (32 → 64, aggr = mean)N/A
GraphSAGE (64 → 128, aggr = mean)
Final LinearFlatten → Linear (3200 → 128)Flatten → Linear (2800 → 128)
+ ReLU+ ReLU
Actor–Critic MLP Heads
Policy NetLinear (128 → 64) + TanhLinear (128 → 64) + Tanh
Linear (64 → 64) + TanhLinear (64 → 64) + Tanh
Value NetLinear (128 → 64) + TanhLinear (128 → 64) + Tanh
Linear (64 → 64) + TanhLinear (64 → 64) + Tanh
Output Layers
Action OutputLinear (64 → 7)Linear (64 → 7)
Value OutputLinear (64 → 1)Linear (64 → 1)
Total Parameters458,040467,896
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, D. Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning. Mathematics 2025, 13, 2542. https://doi.org/10.3390/math13162542

AMA Style

Kim D. Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning. Mathematics. 2025; 13(16):2542. https://doi.org/10.3390/math13162542

Chicago/Turabian Style

Kim, Dongjae. 2025. "Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning" Mathematics 13, no. 16: 2542. https://doi.org/10.3390/math13162542

APA Style

Kim, D. (2025). Uncertainty-Aware Continual Reinforcement Learning via PPO with Graph Representation Learning. Mathematics, 13(16), 2542. https://doi.org/10.3390/math13162542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop