Next Article in Journal
Experimental Investigation of Magnetic Abrasive Finishing for Post-Processing Additive Manufactured Inconel 939 Parts
Previous Article in Journal
Systematic Analysis of the Hydrogen Value Chain from Production to Utilization
Previous Article in Special Issue
Policy-Based Reinforcement Learning Approach in Imperfect Information Card Game
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unpacking Performance Variability in Deep Reinforcement Learning: The Role of Observation Space Divergence

1
Department of Computer Engineering, Hanbat National University, Daejeon 34158, Republic of Korea
2
Department of Metaverse & Game, Soonchunhyang University, Asan 31538, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(15), 8247; https://doi.org/10.3390/app15158247
Submission received: 5 June 2025 / Revised: 8 July 2025 / Accepted: 22 July 2025 / Published: 24 July 2025
(This article belongs to the Special Issue Advancements and Applications in Reinforcement Learning)

Abstract

Deep Reinforcement Learning (DRL) algorithms often exhibit significant performance variability across different training runs, even with identical settings. This paper investigates the hypothesis that a key contributor to this variability is the divergence in the observation spaces explored by individual learning agents. We conducted an empirical study using Proximal Policy Optimization (PPO) agents trained on eight Atari environments. We analyzed the collected agent trajectories by qualitatively visualizing and quantitatively measuring the divergence in their explored observation spaces. Furthermore, we cross-evaluated the learned actor and value networks, measuring the average absolute TD-error, the RMSE of value estimates, and the KL divergence between policies to assess their functional similarity. We also conducted experiments where agents were trained from identical network initializations to isolate the source of this divergence. Our findings reveal a strong correlation: environments with low-performance variance (e.g., Freeway) showed high similarity in explored observation spaces and learned networks across agents. Conversely, environments with high-performance variability (e.g., Boxing, Qbert) demonstrated significant divergence in both explored states and network functionalities. This pattern persisted even when agents started with identical network weights. These results suggest that differences in experiential trajectories, driven by the stochasticity of agent–environment interactions, lead to specialized agent policies and value functions, thereby contributing substantially to the observed inconsistencies in DRL performance.

1. Introduction

Deep reinforcement learning (DRL) has demonstrated an extraordinary ability to solve high-dimensional decision-making problems, from Atari video games [1] to the board game Go [2]. Among modern on-policy methods, Proximal Policy Optimization (PPO) [3] has emerged as a de facto baseline due to its empirical performance and implementation simplicity, and it remains the algorithm of choice in a wide range of recent studies. Yet, a persistent challenge plagues the field. Despite identical hyperparameters and training budgets, independent runs of the same DRL algorithm can yield markedly different returns [4], complicating the fair evaluation and deployment of new methods.
Existing research has identified several key sources contributing to this performance variability. One line of inquiry has focused on the profound impact of implementation details. Foundational work by Henderson et al. [4] demonstrated that differences in codebases, hyperparameter settings, and even random seeds could lead to drastically different outcomes. This was further emphasized by Engstrom et al. [5], who showed that seemingly minor code-level decisions—such as the choice of activation function or the order of operations—can alter performance by orders of magnitude. Another perspective has examined the dynamics of the training process itself. Bjorck et al. [6] provided evidence that much of this variability originates early in training, as a few “outlier” runs drift onto low-reward trajectories and never recover. This aligns with the work by Jang et al. [7], which explored how entropy-aware initialization can foster more effective exploration from the outset, thereby preventing early stagnation. A third approach has delved into the internal mechanics of the learning agent. For instance, Moalla et al. [8] recently established a connection between performance instability in PPO and internal “representation collapse,” where the network learns insufficiently diverse features, leading to trust issues in policy updates.
While these studies provide crucial insights into implementation, training dynamics, and internal representations, a complementary perspective remains less explored: the divergence in the observation space that each agent actually experiences. RL agents learn from the specific trajectories of states and rewards they encounter. If two agents, due to chance events early in training, begin to explore different regions of the state space, they are effectively training on different datasets. This can lead them to converge to substantially different—and unequally performing—policies. This paper investigates the hypothesis that this divergence in lived experience is a primary cause of performance variability. Unlike prior work focused on implementation choices or internal network states [8], we aim to directly quantify the differences in state visitation distributions among independently trained agents. And while studies like [6,7] identify that runs diverge, we characterize how they diverge in terms of the states they encounter and explicitly link this to the functional dissimilarity of the resulting policies.
In this work, we perform a controlled empirical study of five PPO agents, each trained with an independent random seed, across eight Atari environments—Alien, Boxing, Breakout, Enduro, Freeway, KungFuMaster, Pong, and Qbert—using the Arcade Learning Environment [9] and Gymnasium [10]. Training five independent PPO agents per game yields a spectrum of outcomes, from highly successful to comparatively poor policies. We then ask the following: Do higher-performing agents encounter a broader or different set of states than lower-performing agents? And do such differences manifest in the learned representations of the actor (policy) and critic (value function) networks? To answer these questions, we employ a range of analysis techniques. First, we visualize and quantify each agent’s observation distribution using dimensionality-reduction methods and statistical measures of dispersion and similarity. Second, we cross-evaluate the learned actor and value networks, applying them to states experienced by other agents to assess functional differences. Our findings reveal a clear correlation between divergence in explored observation spaces, dissimilarity of the learned networks, and variance in achieved performance across agents within each environment.
The remainder of this paper is organized as follows. Section 2 details our methodology. Section 3 presents the experimental results, featuring analyses of observation space characteristics and the discrepancies observed in actor–critic networks. In Section 4, we discuss the implications of our findings. Finally, Section 5 concludes the paper by outlining potential avenues for future research.

2. Methodology

To investigate the relationship between observation space divergence, network differences, and performance variability, a series of experiments were conducted using PPO agents trained on eight Atari environments. For this investigation, we specifically selected Proximal Policy Optimization (PPO) and the Atari 2600 benchmark suite. PPO was chosen due to its status as a robust, high-performing, and widely adopted baseline algorithm in the DRL community, making our findings on its variability highly relevant. The Atari suite, via the Arcade Learning Environment, offers a diverse collection of environments with varying complexities and reward structures. This diversity is crucial for our study, as it allows us to systematically compare the divergence phenomenon in games known to produce both highly consistent and highly variable performance outcomes.

2.1. Experimental Setup

All experiments were conducted on a workstation with the following specifications:
  • Hardware: Intel(R) Core(TM) i7-14700K (28 cores), 128 GB RAM.
  • Software: Python 3.10.16, along with key scientific computing libraries including NumPy 2.2.3, Pandas 2.2.3, SciPy 1.15.3, and Scikit-learn 1.7.0.

2.2. Training Environment and Agents

We trained five independent PPO agents for each of the eight Atari environments shown in Figure 1: Alien, Boxing, Breakout, Enduro, Freeway, KungFuMaster (KFM), Pong, and Qbert. The training was conducted using the Stable-Baselines3 [11], and the agents used the PPO hyperparameters for Atari games as released in the RL Baselines3 Zoo [12]. Each agent within an environment was trained from scratch, with the only source of variation being the random seeds.

2.3. Data Collection

After the training phase concluded for all agents, each of the five converged agents per environment was rolled out for 30 episodes. The performance, measured as the average reward over 30 evaluation episodes, was recorded for each agent. During these rollouts, at each step t, a tuple containing the current observation ( o t ), the received reward ( r t ), the value estimate from the agent’s own critic ( v t = V ϕ ( o t ) ), and the action logits from the agent’s own actor ( action _ logits t = π θ ( o t ) ) was collected and stored. This process generated a dataset of trajectories specific to each agent’s learned policy.

2.4. Analysis of Observation Space

The collected observation data was analyzed to characterize the extent and nature of exploration by each agent and to compare observation distributions across agents.

2.4.1. Visualization

To qualitatively assess the similarity of explored observation spaces, the high-dimensional observation data (raw pixel frames) was visualized in a 2D space using t-SNE. For each environment, observations from all five agents were combined. From each agent’s collected observations, 5000 frames were randomly subsampled. These observations were then flattened into vectors before being projected into two dimensions using t-SNE. The perplexity hyperparameter for t-SNE was set to 30, a choice informed by a trustworthiness analysis (presented in Section 3.5) to ensure high-quality embeddings. The resulting scatter plots were color-coded by agent ID to reveal patterns of overlap and separation.

2.4.2. Single-Agent Observation Dispersion

To quantify the diversity of observations encountered by each individual agent, a multi-step process was employed:
  • Data Loading and Preprocessing: For each agent, its collected observations over 30 episodes were loaded, and then 5000 frames were subsampled and flattened into vectors.
  • Dimensionality Reduction and Denoising: Principal Component Analysis (PCA) was applied to reduce the dimensionality of the flattened observations to 50 dimensions. This step serves to denoise and compress the data, focusing on the directions of highest variance, which can lead to more robust and meaningful clustering in the subsequent step. Working with high-dimensional raw pixel data directly for clustering can be computationally expensive and sensitive to noise; PCA mitigates these issues by capturing the most salient features.
  • State Space Discretization via Clustering: The PCA-transformed features from all agents within an environment were pooled together. K-means clustering was then performed on this pooled set of features to define a common set of discrete state categories. To assess the sensitivity of our metrics to the granularity of discretization, we performed this step with K values of 50, 100, and 200. The results for K = 100 are presented in the main analysis, with K = 50 and K = 200 used for comparison.
  • Occupancy Histogram Generation: For each agent, an occupancy histogram (probability distribution, p ( c ) ) over the K clusters was computed.
  • Dispersion Metrics Calculation: Based on this probability distribution p ( c ) (where c is a cluster index), the following metrics were calculated:
    • Entropy (H(p)): Measures the uncertainty or diversity of the visited clusters. Higher entropy indicates that an agent visits a wider range of distinct state categories more uniformly.
      H ( p ) = c p ( c ) log p ( c ) .
    • Effective Support Size ( N eff ): Estimates the number of effectively visited clusters, providing an interpretable scale for diversity.
      N e f f = e x p ( H ( p ) ) .
    • Coverage Ratio (Cov): The proportion of the K-defined clusters that were visited at least once.
      Cov = { c p ( c ) > 0 } K .
    • Gini–Simpson Index (G): Measures diversity, with values closer to 1 indicating higher diversity (i.e., probabilities are spread out more evenly across many clusters).
      G = 1 c p ( c ) 2 .

2.4.3. Pairwise Observation Distribution Comparison

To quantify the similarity between observation distributions of pairs of agents, the following metrics were computed. The cluster definitions and the individual agent occupancy histograms (e.g., p i ( c ) and p j ( c ) ) used for these pairwise comparisons are established and computed according to the procedures detailed in Section 2.4.2.
  • Pairwise Total Variation (TV) Distance: For each pair of agents ( i , j ) , the TV distance between their cluster occupancy histograms p i ( c ) and p j ( c ) was calculated. This ranges from 0 (identical) to 1 (disjoint).
    TV ( i , j ) = 1 2 c p i ( c ) p j ( c ) .
  • Pairwise Maximum Mean Discrepancy (MMD): MMD was computed between the PCA-reduced feature sets of each pair of agents using a Gaussian kernel [13]. The kernel bandwidth sigma was set using the median heuristic. MMD is a statistical test for determining whether two samples are drawn from the same distribution, and its magnitude provides a measure of their dissimilarity. It operates directly on the feature representations rather than relying on the explicit cluster histograms.
    MMD ( i , j ) = E x , x P i K ( x , x ) 2 E x P i , y P j K ( x , y ) + E y , y P j K ( y , y ) ,
    where K is a positive-definite kernel. We choose a Gaussian kernel K ( x , y ) = exp ( | x y | 2 / 2 σ 2 ) . The kernel bandwidth σ is set to the median pairwise distance among all states in the combined dataset (median heuristic), which provides a reasonable scale. MMD is essentially a distance in reproducing kernel Hilbert space and is 0 iff the two distributions are identical (for characteristic kernels).

2.5. Analysis of Actor and Value Network Differences

To assess how the differences in explored observation spaces translate to differences in the learned actor and value networks, a cross-evaluation methodology was employed. For each pair of agents ( i , j ) within an environment, the observations collected by agent i during its rollouts were used as input to the actor and value networks of agent j. The following metrics were calculated:
  • Absolute TD-Error: The absolute TD-error was computed over agent i’s trajectory data using agent j’s value network, measuring how well agent j’s value function generalizes to agent i’s experiences.
    TD i j ( s t , r t , s t + 1 ) = 1 | D i | ( s t , r t , s t + 1 ) D i r t + γ V ϕ j ( s t + 1 ) V ϕ j ( s t ) .
  • Root Mean Squared Error (RMSE) of Value Estimates: For each state s t in agent i’s trajectory, the value estimated by agent i’s critic, V ϕ i ( s t ) , was compared to the value predicted by agent j’s critic, V ϕ j ( s t ) . This directly measures how different the output of the two value functions is on states that agent i visits.
    RMSE i j = 1 | D i | s t D i V ϕ i ( s t ) V ϕ j ( s t ) 2 .
  • Kullback–Leibler Divergence (KLD) of Action Logits: For each state s t in agent i’s trajectory, KLD was computed using the action probability distributions (derived from the logits) produced by agent i’s actor, π θ i ( · s t ) , and agent j’s actor, π θ j ( · s t ) . This quantifies the policy divergence by comparing the action probability distributions from agent i’s actor and agent j’s actor for states in agent i’s trajectory.
    KLD i j = 1 | D i | s t D i D KL π θ i ( · s t ) π θ j ( · s t ) .
These metrics were computed for all 5 × 5 agent pairings in each of the eight environments, resulting in matrices that reveal the extent of network generalization and similarity.

2.6. Trained Agents with Identically Initialized Networks

To further investigate the sources of performance variability, we conducted an additional experiment. For four of the eight environments—Boxing and Qbert (high-performance variance), Alien (medium variance), and Freeway (low variance)—we trained five new agents starting from the same initial network weights. The only source of variation in these runs was the stochasticity inherent in the agent–environment interaction loop (e.g., action sampling, environment responses). This setup allows us to determine whether different weight initializations primarily drive performance divergence or if it emerges naturally from the training process itself. These agents were then analyzed using the same data collection and analysis pipeline described above.

3. Results

3.1. Performance Variation Across Agents and Environments

As detailed in Table 1, the five PPO agents trained independently in each of the eight environments achieved markedly different final scores. The agents are consistently listed from #1 (highest score) to #5 (lowest score). The performance gap between the best and worst agents varied dramatically across environments.
Environments such as Freeway and Pong showed high consistency. In Freeway, scores were tightly clustered between 21.27 and 22.03, a minimal difference. In contrast, other environments displayed extreme variability. In KFM, the top agent scored 37,453, while the lowest-scoring agent achieved only 3600. Breakout also showed a massive spread, from 343.6 down to 41.2. Boxing and Qbert demonstrated significant, though less extreme, variance. This wide range of performance disparities provides a strong basis for investigating the correlation with observation space divergence. Table 2 shows the mean episode lengths, which often correlate inversely with performance (e.g., in Boxing, higher-scoring agents have shorter episodes).

3.2. Qualitative Analysis of Observation Space via t-SNE

The t-SNE visualizations in Figure 2 and Figure 3 provide a qualitative view of the explored state manifolds, color-coded by an agent. The degree of overlap and separation between agent distributions varies significantly across games, correlating strongly with performance variance.
  • Low-Variance Environments (Freeway, Enduro): In Freeway, the agent observations are thoroughly intermingled, forming a dense, homogeneous cloud with ring-like structures. This indicates that all agents explore nearly identical state spaces, consistent with their minimal performance variance. Enduro shows a similar pattern of high overlap, with points from all agents mixed together within several large clusters.
  • High-Variance Environments (Boxing, Qbert, KFM, Breakout): These environments show clear signs of divergent exploration. In Boxing, agents form highly distinct clusters; for example, agent #1 (blue) and agent #3 (green) occupy almost completely separate regions. In Qbert, agents trace out unique, winding paths with little overlap, suggesting highly specialized policies. In KFM, the lowest-performing agent #5 (purple) forms tight, isolated clusters, while the higher-performing agents explore a much larger, albeit still differentiated, central region. Breakout shows less distinct clustering, but agents still carve out visibly different trajectories.
  • Moderate-Variance Environments (Alien, Pong): These games represent an intermediate case. Alien agents share a large central cluster but also have unique “tendrils” of exploration. Pong agents form a series of concentric rings, but some agents are more confined to inner rings while others explore the outer regions more, and some agents (e.g., #5, purple) have more diffuse distributions.
Figure 2. Qualitative t-SNE analysis of observation space representations from five distinct agents in four Atari games: (a) Alien, (b) Boxing, (c) Breakout, and (d) Enduro.
Figure 2. Qualitative t-SNE analysis of observation space representations from five distinct agents in four Atari games: (a) Alien, (b) Boxing, (c) Breakout, and (d) Enduro.
Applsci 15 08247 g002
Figure 3. Qualitative t-SNE analysis of observation space representations from five distinct agents in four Atari games: (a) Freeway, (b) KungFuMaster, (c) Pong, and (d) Qbert.
Figure 3. Qualitative t-SNE analysis of observation space representations from five distinct agents in four Atari games: (a) Freeway, (b) KungFuMaster, (c) Pong, and (d) Qbert.
Applsci 15 08247 g003aApplsci 15 08247 g003b

3.3. Quantitative Analysis of Observation Space

3.3.1. Single-Agent Observation Dispersion

The dispersion metrics in Table 3 (calculated with K = 100 clusters) quantify the t-SNE visualizations. In Freeway and Enduro, all five agents exhibit consistently high entropy (H), effective support size ( N e f f ), and coverage ( Cov ), indicating they all explore the state space broadly and similarly. For example, all Freeway agents cover nearly 100% of the defined clusters. In high-variance games, these metrics exhibit significant differences between agents. In KFM, the lowest-performing agent (#5) has a much lower entropy (3.52) and effective size (33.74) compared to the others (H > 4.2, N e f f > 69), quantitatively confirming its limited exploration seen in the t-SNE plot. In Boxing, there is no simple correlation between performance and dispersion (e.g., agent #3 has the lowest dispersion, while agent #1 has moderate dispersion), suggesting that the quality and uniqueness of the explored region, not just its size, are critical. In Qbert, the top three agents (which all achieved the max score) have higher dispersion than the lower-performing agents #4 and #5.

3.3.2. Pairwise Observation Distribution Comparison

The Pairwise Total Variation (TV) distance (Table 4) and Maximum Mean Discrepancy (MMD) (Table 5) confirm the qualitative findings.
  • Freeway and Enduro show extremely low TV distances (mostly <0.1 for Freeway, <0.15 for Enduro) and MMD values (mostly <0.015 for Freeway, <0.025 for Enduro), confirming that all agents have statistically very similar observation distributions.
  • Boxing, Qbert, and KFM show very high TV distances, often approaching 1.0, indicating that many agent pairs explore almost entirely different sets of state clusters. For instance, the TV distance between Qbert #1 and Qbert #2 is 0.831. In KFM, the low-performing agent #5 has TV distances > 0.7 against all other agents. The MMD values are correspondingly high, quantitatively demonstrating the stark divergence in exploration.

3.3.3. Sensitivity to Cluster Number K

To ensure our findings were not an artifact of the chosen cluster number (K = 100), we repeated the analysis for K = 50 and K = 200. While the absolute values of the dispersion and TV distance metrics changed with K (as expected), the relative trends remained consistent. For instance, across all values of K, Freeway and Enduro agents consistently showed high, uniform dispersion and low pairwise TV distances. Conversely, Boxing and Qbert agents consistently exhibited significant differences in dispersion and high pairwise TV distances. This indicates that our core conclusion—that performance variance correlates with observation space divergence—is robust to the specific granularity of state space discretization. The complete tables for this sensitivity analysis, which confirm these trends, are provided in the Appendix A.

3.4. Analysis of Actor and Value Network Differences

Cross-evaluation of the learned networks (Table 6, Table 7 and Table 8) shows that divergence in observation space corresponds to functional divergence in the actor and value networks.
  • For Freeway and Enduro, the off-diagonal values in the TD-error, RMSE, and KLD tables are very low, often close to the diagonal (self-evaluation) values. This means any agent’s networks can accurately predict values and replicate policies for states experienced by any other agent, indicating functional convergence.
  • For Boxing, Qbert, and KFM, the off-diagonal values are substantially high. The KLD values in Boxing are frequently above 5.0, and the RMSE values in Qbert can be in the hundreds. This signifies that the agents have learned profoundly different policies and value functions that are highly specialized to their own unique experiences and do not generalize to those of their peers.
This provides strong evidence that when agents see different things, they learn to do other things, and their internal models of the world (value functions) become incompatible.

3.5. Analysis of Methodology Choices

3.5.1. Trustworthiness of t-SNE

To validate our choice of perplexity 30 for the t-SNE visualizations, we computed the trustworthiness metric for a range of perplexity values. Trustworthiness measures how well the local structure of the original high-dimensional data is preserved in the low-dimensional embedding. As shown in Table 9, a perplexity of 30 consistently yields high trustworthiness scores across most environments, indicating a reliable visualization.

3.5.2. Computation Time

Our analysis pipeline is computationally efficient. As shown in Table 10, PCA and K-means clustering take only a few seconds to a minute on the subsampled data. The main bottleneck is memory; increasing the number of samples per agent beyond 5000 becomes challenging, highlighting a limitation for even more fine-grained analyses on standard hardware.

3.6. Impact of Identical Network Initialization

To test whether performance variability stems solely from different random initializations or from the training process itself, we ran experiments where all five agents in an environment started with the same network weights. The results show that performance variance, while sometimes reduced, remains substantial in high-variance environments (Table 11). In Boxing, scores still ranged from 100 down to 90.97. In Qbert, the gap was even larger, from 4050 down to 800. In contrast, Freeway remained highly consistent. Crucially, the observation space and network divergence metrics for these agents mirrored the results from the randomly initialized experiments. For instance, in high-variance environments like Boxing and Qbert, the single-agent dispersion statistics showed considerable spread between agents, unlike in Freeway where all agents explored consistently (Table 12). This divergence in experience led to functionally different networks; the cross-agent absolute TD-error was substantial in high-variance games but negligible in Freeway (Table 13). This pattern of divergence is also confirmed by the high pairwise dissimilarity (high TV and MMD values, see Table 14 and Table 15) and network error metrics (high RMSE and KLD values, see Table 16 and Table 17). The key finding is that the stochasticity of the agent–environment interaction loop itself is a powerful driver of divergence. Even from an identical starting point, chance exploration events can set agents on different learning pathways from which they do not recover, leading to the same pattern of specialized, non-transferable policies and significant performance gaps.

4. Discussion

The results from our expanded study across eight Atari games provide compelling and robust evidence that divergence in explored observation spaces is a primary driver of performance variability in DRL. Our analysis indicates that the degree of this divergence is not random but is strongly tied to the intrinsic characteristics of the environment itself. The Freeway environment serves as a crucial control case. The consistent performance of Freeway agents is tightly linked to their consistent exploration patterns. All agents are guided through a similar, comprehensive set of experiences, which leads to the development of functionally equivalent policies and value functions. In stark contrast, high-variance environments like Boxing, Breakout, KFM, and Qbert reveal the consequences of divergent exploration. In these more complex settings, agents can and do find different niches within the state space. The t-SNE plots and quantitative metrics show that agents often specialize in distinct sub-regions, becoming experts in local areas while remaining naive about others. This specialization is path-dependent; early stochastic events steer an agent toward a particular trajectory, and the actor–critic learning loop reinforces this direction. An agent’s value network becomes more accurate for its frequented states, which in turn biases its policy to continue visiting them. This creates a feedback loop that amplifies initial small differences into significant chasms in both experience and capability.

4.1. Environment Characteristics and Their Impact on Divergence

The consistent patterns of divergence and stability observed across the eight Atari games can be primarily attributed to their intrinsic mechanics and objectives. By categorizing the environments, we can provide more targeted guidance for future DRL applications.

4.1.1. Low-Divergence Environments: Structured and Convergent

Environments that foster low divergence and stable performance, such as Freeway, Pong, and Enduro, often share characteristics like a clear, singular objective and a functionally narrow state space that guides agents toward a single dominant strategy.
  • In Freeway, the goal is simple and monotonic: move up. There are no complex sub-tasks or branching strategic paths. This structure naturally channels all agents toward the same optimal behavior, leading to highly overlapping observation spaces and consistent performance.
  • Pong is a purely reactive game where the optimal policy is to mirror the ball’s vertical movement. The simplicity and deterministic nature of this strategy mean there is little room for meaningful strategic variation to emerge.
  • Enduro, while more complex visually, is also driven by a primary objective of continuous forward progress and overtaking. The core gameplay loop does not contain significant strategic “bottlenecks” that could send agents down wildly different learning paths.
In these games, the path to high rewards is straightforward, causing all agent trajectories and policies to converge. For real-world problems with similar characteristics (e.g., simple process optimization), we can expect DRL training to be relatively stable and reproducible.

4.1.2. High-Divergence Environments: Strategic Bottlenecks and Divergent Policies

Conversely, environments prone to high divergences, such as Qbert, Boxing, KFM, and Breakout, often feature strategic bottlenecks, multiple viable strategies, or complex state dependencies that amplify the effects of stochasticity.
  • Boxing is a highly interactive, opponent-dependent game. An agent might learn an aggressive rushing strategy, while another learns a defensive, counter-punching style. These are two distinct but viable approaches that lead to entirely different patterns of interaction, creating separate clusters in the observation space and varied performance outcomes.
  • Qbert and KFM contain significant exploration challenges and “bottlenecks.” In KFM, an agent that fails to learn how to defeat a specific enemy type will be trapped in early-level states, while an agent that succeeds will unlock a vast new region of the observation space. This creates a sharp bifurcation in experience and performance. Similarly, the unique board structure in Qbert presents many locally optimal paths, causing agents to specialize in different sections of the pyramid.
  • Breakout is a classic example of an environment with a critical strategic bottleneck: learning to tunnel the ball behind the brick wall. Agents that discover this strategy enter a new, high-scoring phase of the game with a completely different set of observations. Agents that fail to discover it remain trapped in a low-scoring, repetitive gameplay loop, leading to extreme performance variance.

4.2. Potential Applications of Observation Space Divergence Analysis

The observation space divergence analysis presented in this study extends beyond understanding the problem of performance variability in DRL; it can be directly utilized to devise solutions for improving evaluation performance in practical applications. Specifically, the findings of this research can be applied to ensemble methods—a widely used technique for surpassing the performance of single agents [14]—from two perspectives.
First, for improving evaluation performance via ensemble methods. When ensembling N agents to boost performance, a novel approach can be explored beyond conventional techniques, such as majority voting, simple averaging, or weighted sums based on rewards. Specifically, one could employ a weighted sum based on the similarity between the representation of the current observation and the observation space representation of each agent in the ensemble. This method would give priority to the actions of agents whose experiences are most similar to the current situation, thereby potentially leading to more sophisticated and higher performance.
Second, for analyzing why ensemble effectiveness varies across environments. Our findings provide insight into predicting whether an ensemble will be effective in a given environment. For instance, in an environment like Freeway, where all agents explore a highly similar observation space, we can predict that the benefit of ensembling will be minimal because the individual agent policies are already convergent. In contrast, in an environment like Boxing, where agents learn distinct strategies (e.g., aggressive vs. defensive) resulting in clearly separated observation spaces, we can anticipate a significant performance boost from ensembling, as combining their different specializations would be highly complementary. This provides a practical guideline for determining where to focus limited computational resources when constructing ensembles.

4.3. General Implications and Study Limitations

The most significant finding is from the identical initialization experiment. The persistence of high-performance variance and observation space divergence in games like Boxing and Qbert, even when starting from the same network weights, demonstrates that the problem is not merely one of poor initialization. The stochastic nature of the RL interaction process itself is sufficient to drive agents onto divergent paths. This suggests that achieving reproducible performance in complex environments requires more than just fixing random seeds; it may require fundamentally new approaches to guide exploration or to make learning more robust to variations in experience.
Our comprehensive set of metrics—from visual t-SNE to quantitative measures of dispersion (entropy, TV distance, MMD) and network function (TD-error, RMSE, KLD)—presents a unified narrative. When agents see different things (high TV/MMD), their understanding of the world diverges (high-value RMSE), and their resulting behaviors diverge (high policy KLD). This work highlights that reporting only the mean and standard deviation of final scores can obscure the rich and varied behaviors that contribute to those scores. A deeper analysis of the underlying state visitation distributions is crucial for a comprehensive understanding of DRL algorithm behavior.
We acknowledge that the empirical results of our study focus on the PPO algorithm within the Atari domain. This deliberate choice of scope enabled a deep and multifaceted analysis of the divergence phenomenon. However, it is reasonable to question how these findings generalize to other algorithms and environments. We hypothesize that the core mechanism—stochastic agent–environment interactions leading to divergent experiential trajectories and specialized policies—is a fundamental aspect of the reinforcement learning process and is not unique to PPO. For instance, off-policy algorithms like DQN or SAC, which utilize a replay buffer, might exhibit different dynamics. A replay buffer could mitigate divergence by averaging experiences across different trajectories. Conversely, it could also exacerbate the issue if certain types of “lucky” trajectories become overrepresented early in training. Similarly, extending this analysis to domains with continuous action spaces, such as robotics tasks in MuJoCo, or environments with sparser rewards, represents an important next step. Exploring how the structure of the state space and the nature of the reward function influence the degree of observation divergence is a compelling avenue for future research.

5. Conclusions and Future Works

This paper investigated the role of observation space divergence as a contributing factor to performance variability in deep reinforcement learning. Through a series of experiments on PPO agents across eight Atari environments, we demonstrated a strong link between the similarity of states explored by different agents, the functional similarity of their learned networks, and the consistency of their final performance. The key findings include the following:
  • We expanded the analysis to eight Atari games, confirming that environments with low variance in performance (e.g., Freeway, Enduro) exhibit highly similar state space exploration across agents. In contrast, high-variance environments (e.g., Boxing, KFM, Qbert) show significant divergence.
  • Cross-evaluation of actor and value networks confirmed that agents with divergent observation distributions learn functionally different networks that do not generalize well to each other’s experiences.
  • A new experiment with identically initialized networks revealed that performance variability and observation space divergence persist even without different initial weights, highlighting that stochasticity in the agent–environment interaction is a primary source of this divergence.
  • Our analysis was shown to be robust to the choice of the number of clusters (K) used for state discretization, and the computational cost of the analysis pipeline was found to be modest.
This work highlights the importance of examining not only the performance levels achieved but also how agents attain them. Building on this foundational analysis in PPO and Atari, future research should extend this investigation in several key directions:
  • Temporal and Causal Analysis: Study the evolution of observation space differences during the training process. Analyzing the temporal relationship between when divergence occurs and when agent performance begins to vary would help establish a more direct causal link and better explain the dynamic nature of performance fluctuations.
  • Systematic Component Analysis: Conduct a comparative analysis of how different DRL components influence exploration behavior and performance variability. This includes systematically varying hyperparameters, network architectures, and regularization techniques to understand their impact on the observation space distribution of trained agents.
  • Broadening Algorithmic and Environmental Scope: Apply this analysis pipeline to a broader range of algorithms, including other on-policy variants and prominent off-policy algorithms (e.g., SAC, DQN), to determine how mechanisms like replay buffers affect divergence. Furthermore, expanding the study to other domains, such as continuous control benchmarks or environments with sparse rewards, is crucial to test the generality of our findings.
  • Developing Mitigation Strategies: Based on the insights from a component-level analysis, develop and benchmark novel exploration or regularization methods specifically designed to reduce undesirable trajectory divergence, promote more consistent learning outcomes, and directly mitigate DRL performance variability.
  • Theoretical Foundations: Work towards proposing a theoretical framework that can interpret these diverse experimental results, providing a more formal understanding of the relationship between stochastic exploration, state space coverage, and performance instability in DRL.

Author Contributions

Conceptualization, S.J. and A.L.; methodology, S.J.; software, S.J.; validation, S.J. and A.L.; formal analysis, S.J.; investigation, S.J.; resources, S.J.; data curation, S.J.; writing—original draft preparation, S.J.; writing—review and editing, S.J. and A.L.; visualization, S.J.; supervision, A.L.; project administration, A.L.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the research fund of Hanbat National University in 2023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on reasonable request to the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT and Gemini for the purposes of formatting and English grammar correction. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Single agent dispersion statistics (K = 50).
Table A1. Single agent dispersion statistics (K = 50).
AgentH N eff Cov G
Alien #12.8216.700.640.90
Alien #23.2325.230.780.95
Alien #33.1824.050.860.94
Alien #42.8717.670.660.91
Alien #52.9218.510.640.92
Boxing #12.6113.620.480.91
Boxing #23.1723.820.740.95
Boxing #32.3110.080.320.90
Boxing #43.0621.290.660.95
Boxing #53.0922.020.600.95
Breakout #12.6113.610.560.87
Breakout #22.5612.880.640.86
Breakout #32.5512.860.560.87
Breakout #42.6313.920.620.88
Breakout #52.5512.800.520.88
Enduro #13.7141.000.960.97
Enduro #23.7040.261.000.97
Enduro #33.7241.251.000.97
Enduro #43.7441.970.980.97
Enduro #53.7140.771.000.97
Freeway #13.8446.691.000.98
Freeway #23.8446.331.000.98
Freeway #33.8446.561.000.98
Freeway #43.8446.611.000.98
Freeway #53.8346.251.000.98
KFM #13.5133.501.000.96
KFM #23.4230.620.920.96
KFM #33.6136.880.980.97
KFM #43.4230.680.960.96
KFM #53.0521.160.640.94
Pong #13.5835.950.820.97
Pong #23.6036.460.820.97
Pong #33.7944.341.000.98
Pong #43.6036.620.940.97
Pong #53.0020.070.980.93
Qbert #12.9318.770.640.92
Qbert #22.9619.210.600.93
Qbert #32.9619.360.640.93
Qbert #42.7615.730.600.92
Qbert #52.289.780.600.85
Table A2. Single agent dispersion statistics (K = 200).
Table A2. Single agent dispersion statistics (K = 200).
AgentH N eff Cov G
Alien #13.9451.660.430.97
Alien #24.3476.730.700.98
Alien #34.3275.090.660.98
Alien #43.9853.560.570.97
Alien #54.2569.930.500.98
Boxing #13.6036.720.290.97
Boxing #24.3779.350.640.98
Boxing #33.4431.290.210.97
Boxing #44.5089.920.630.99
Boxing #54.63102.240.610.99
Breakout #13.1222.610.390.88
Breakout #23.0220.470.370.87
Breakout #32.9318.700.320.88
Breakout #43.4531.550.400.93
Breakout #52.9819.700.310.89
Enduro #15.13169.430.990.99
Enduro #25.11165.930.990.99
Enduro #35.12167.660.990.99
Enduro #45.15172.380.990.99
Enduro #55.14170.260.990.99
Freeway #15.24187.791.000.99
Freeway #25.23186.221.000.99
Freeway #35.23186.071.000.99
Freeway #45.23187.141.000.99
Freeway #55.23187.211.000.99
KFM #14.91135.120.930.99
KFM #24.94139.900.910.99
KFM #34.96142.280.940.99
KFM #44.92137.490.910.99
KFM #54.0155.200.420.98
Pong #14.81122.460.720.99
Pong #24.69108.970.660.99
Pong #35.08160.980.970.99
Pong #44.81122.530.850.99
Pong #54.5190.510.830.99
Qbert #14.1261.310.510.97
Qbert #23.7241.290.350.96
Qbert #34.1060.580.510.97
Qbert #43.5936.130.420.95
Qbert #53.6036.640.420.96
Table A3. Pairwise Total Variation (TV) distance between agent observation distributions (K = 50).
Table A3. Pairwise Total Variation (TV) distance between agent observation distributions (K = 50).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.8250.5270.8180.835
Alien #20.8250.0000.6790.6000.361
Alien #30.5270.6790.0000.7070.748
Alien #40.8180.6000.7070.0000.618
Alien #50.8350.3610.7480.6180.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.9590.9130.9790.985
Boxing #20.9590.0000.9670.5020.457
Boxing #30.9130.9670.0000.9780.984
Boxing #40.9790.5020.9780.0000.148
Boxing #50.9850.4570.9840.1480.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.0000.7240.7770.7800.815
Breakout #20.7240.0000.8120.7400.832
Breakout #30.7770.8120.0000.7170.687
Breakout #40.7800.7400.7170.0000.711
Breakout #50.8150.8320.6870.7110.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Enduro #10.0000.1250.0790.1040.088
Enduro #20.1250.0000.1160.1200.108
Enduro #30.0790.1160.0000.1030.079
Enduro #40.1040.1200.1030.0000.098
Enduro #50.0880.1080.0790.0980.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0510.0470.0510.050
Freeway #20.0510.0000.0490.0500.051
Freeway #30.0470.0490.0000.0470.046
Freeway #40.0510.0500.0470.0000.052
Freeway #50.0500.0510.0460.0520.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.0000.1710.1830.1800.748
KFM #20.1710.0000.1980.1550.791
KFM #30.1830.1980.0000.2190.686
KFM #40.1800.1550.2190.0000.773
KFM #50.7480.7910.6860.7730.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.0740.2140.2160.766
Pong #20.0740.0000.1980.2130.766
Pong #30.2140.1980.0000.2450.574
Pong #40.2160.2130.2450.0000.740
Pong #50.7660.7660.5740.7400.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.7420.0320.5990.737
Qbert #20.7420.0000.7390.8390.792
Qbert #30.0320.7390.0000.5890.733
Qbert #40.5990.8390.5890.0000.833
Qbert #50.7370.7920.7330.8330.000
Table A4. Pairwise Total Variation (TV) distance between agent observation distributions (K = 200).
Table A4. Pairwise Total Variation (TV) distance between agent observation distributions (K = 200).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.8830.5730.8980.909
Alien #20.8830.0000.7870.7240.579
Alien #30.5730.7870.0000.8050.856
Alien #40.8980.7240.8050.0000.752
Alien #50.9090.5790.8560.7520.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.9700.9390.9810.984
Boxing #20.9700.0000.9740.7570.637
Boxing #30.9390.9740.0000.9800.984
Boxing #40.9810.7570.9800.0000.323
Boxing #50.9840.6370.9840.3230.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.0000.8700.9050.8850.916
Breakout #20.8700.0000.9270.8910.935
Breakout #30.9050.9270.0000.8470.817
Breakout #40.8850.8910.8470.0000.848
Breakout #50.9160.9350.8170.8480.000
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.0000.1930.1650.1660.169
Enduro #20.1930.0000.1970.1900.172
Enduro #30.1650.1970.0000.1790.170
Enduro #40.1660.1900.1790.0000.166
Enduro #50.1690.1720.1700.1660.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.1020.1090.1090.104
Freeway #20.1020.0000.1080.1070.107
Freeway #30.1090.1080.0000.0930.105
Freeway #40.1090.1070.0930.0000.107
Freeway #50.1040.1070.1050.1070.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.0000.2780.3050.2890.801
KFM #20.2780.0000.2530.1970.825
KFM #30.3050.2530.0000.2920.720
KFM #40.2890.1970.2920.0000.805
KFM #50.8010.8250.7200.8050.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.2870.3670.3300.817
Pong #20.2870.0000.3750.4020.836
Pong #30.3670.3750.0000.3850.607
Pong #40.3300.4020.3850.0000.754
Pong #50.8170.8360.6070.7540.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.8530.1130.8000.840
Qbert #20.8530.0000.8530.9060.902
Qbert #30.1130.8530.0000.8230.840
Qbert #40.8000.9060.8230.0000.914
Qbert #50.8400.9020.8400.9140.000
Table A5. Pairwise Maximum Mean Discrepancy (MMD) between agent observation distributions (K = 50).
Table A5. Pairwise Maximum Mean Discrepancy (MMD) between agent observation distributions (K = 50).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.3230.2050.3290.354
Alien #20.3230.0000.2730.2430.180
Alien #30.2050.2730.0000.2850.278
Alien #40.3290.2430.2850.0000.237
Alien #50.3540.1800.2780.2370.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.4760.5570.5420.536
Boxing #20.4760.0000.3630.2190.195
Boxing #30.5570.3630.0000.4370.430
Boxing #40.5420.2190.4370.0000.074
Boxing #50.5360.1950.4300.0740.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.0000.2880.3520.2950.361
Breakout #20.2880.0000.4180.3320.434
Breakout #30.3520.4180.0000.2200.209
Breakout #40.2950.3320.2200.0000.213
Breakout #50.3610.4340.2090.2130.000
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.0000.0230.0170.0200.021
Enduro #20.0230.0000.0230.0230.015
Enduro #30.0170.0230.0000.0190.014
Enduro #40.0200.0230.0190.0000.019
Enduro #50.0210.0150.0140.0190.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0110.0130.0140.013
Freeway #20.0110.0000.0120.0110.011
Freeway #30.0130.0120.0000.0100.013
Freeway #40.0140.0110.0100.0000.012
Freeway #50.0130.0110.0130.0120.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.0000.0670.0600.0780.315
KFM #20.0670.0000.0760.0780.334
KFM #30.0600.0760.0000.0980.282
KFM #40.0780.0780.0980.0000.317
KFM #50.3150.3340.2820.3170.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.1130.1370.1140.476
Pong #20.1130.0000.1520.1200.485
Pong #30.1370.1520.0000.1190.358
Pong #40.1140.1200.1190.0000.444
Pong #50.4760.4850.3580.4440.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.2250.0130.1560.225
Qbert #20.2250.0000.2250.2720.255
Qbert #30.0130.2250.0000.1590.225
Qbert #40.1560.2720.1590.0000.255
Qbert #50.2250.2550.2250.2550.000
Table A6. Pairwise Maximum Mean Discrepancy (MMD) between agent observation distributions (K = 200).
Table A6. Pairwise Maximum Mean Discrepancy (MMD) between agent observation distributions (K = 200).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.3230.2050.3290.354
Alien #20.3230.0000.2730.2430.180
Alien #30.2050.2730.0000.2850.278
Alien #40.3290.2430.2850.0000.237
Alien #50.3540.1800.2780.2370.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.4760.5570.5420.536
Boxing #20.4760.0000.3630.2190.195
Boxing #30.5570.3630.0000.4370.430
Boxing #40.5420.2190.4370.0000.074
Boxing #50.5360.1950.4300.0740.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.0000.2880.3520.2950.361
Breakout #20.2880.0000.4180.3320.434
Breakout #30.3520.4180.0000.2200.209
Breakout #40.2950.3320.2200.0000.213
Breakout #50.3610.4340.2090.2130.000
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.0000.0230.0170.0200.021
Enduro #20.0230.0000.0230.0230.015
Enduro #30.0170.0230.0000.0190.014
Enduro #40.0200.0230.0190.0000.019
Enduro #50.0210.0150.0140.0190.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0110.0130.0140.013
Freeway #20.0110.0000.0120.0110.011
Freeway #30.0130.0120.0000.0100.013
Freeway #40.0140.0110.0100.0000.012
Freeway #50.0130.0110.0130.0120.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.0000.0670.0600.0780.315
KFM #20.0670.0000.0760.0780.334
KFM #30.0600.0760.0000.0980.282
KFM #40.0780.0780.0980.0000.317
KFM #50.3150.3340.2820.3170.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.1130.1370.1140.476
Pong #20.1130.0000.1520.1200.485
Pong #30.1370.1520.0000.1190.358
Pong #40.1140.1200.1190.0000.444
Pong #50.4760.4850.3580.4440.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.2250.0130.1560.225
Qbert #20.2250.0000.2250.2720.255
Qbert #30.0130.2250.0000.1590.225
Qbert #40.1560.2720.1590.0000.255
Qbert #50.2250.2550.2250.2550.000

References

  1. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  2. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  3. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  4. Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep Reinforcement Learning That Matters. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; pp. 3207–3214. [Google Scholar]
  5. Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; Madry, A. Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. In Proceedings of the International Conference on Learning Representations (ICLR), Virtually, 26–30 April 2020. [Google Scholar]
  6. Bjorck, J.; Gomes, C.P.; Weinberger, K.Q. Is High Variance Unavoidable in Reinforcement Learning? A Case Study in Continuous Control. In Proceedings of the International Conference on Learning Representations (ICLR), Virtually, 25–29 April 2022. [Google Scholar]
  7. Jang, S.; Kim, H.I. Entropy-Aware Model Initialization for Effective Exploration in Deep Reinforcement Learning. Sensors 2022, 22, 5845. [Google Scholar] [CrossRef] [PubMed]
  8. Moalla, S.; Miele, A.; Pyatko, D.; Pascanu, R.; Gulcehre, C. No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 69652–69699. [Google Scholar]
  9. Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The Arcade Learning Environment: An Evaluation Platform for General Agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
  10. Towers, M.; Kwiatkowski, A.; Terry, J.; Balis, J.U.; Cola, G.D.; Deleu, T.; Goulão, M.; Kallinteris, A.; Krimmel, M.; Arjun, K.G.; et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments. arXiv 2024, arXiv:2407.17032. [Google Scholar] [CrossRef]
  11. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
  12. Raffin, A. RL Baselines3 Zoo. 2020. Available online: https://github.com/DLR-RM/rl-baselines3-zoo (accessed on 5 June 2025).
  13. Roy, A.; Ghosh, A.K. Some tests of independence based on maximum mean discrepancy and ranks of nearest neighbors. Stat. Probab. Lett. 2020, 164, 108793. [Google Scholar] [CrossRef]
  14. Ganaie, M.; Hu, M.; Malik, A.; Tanveer, M.; Suganthan, P. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Figure 1. The eight Atari 2600 environments used for agent training: (a) Alien, (b) Boxing, (c) Breakout, (d) Enduro, (e) Freeway, (f) KungFuMaster (KFM), (g) Pong, and (h) Qbert.
Figure 1. The eight Atari 2600 environments used for agent training: (a) Alien, (b) Boxing, (c) Breakout, (d) Enduro, (e) Freeway, (f) KungFuMaster (KFM), (g) Pong, and (h) Qbert.
Applsci 15 08247 g001
Table 1. Performance of trained agents (mean and standard deviation of rewards).
Table 1. Performance of trained agents (mean and standard deviation of rewards).
Task#1#2#3#4#5
Alien1394.33 (448.74)1135.67 (491.36)971.00 (509.41)903.33 (257.38)824.00 (255.94)
Boxing99.73 (0.81)99.10 (0.70)99.03 (1.14)93.47 (3.20)92.10 (4.55)
Breakout343.60 (90.93)312.03 (114.54)171.80 (174.42)106.20 (143.42)41.20 (19.39)
Enduro874.20 (170.43)862.67 (143.47)771.20 (198.16)711.13 (164.85)700.93 (203.18)
Freeway22.03 (1.62)21.37 (1.33)21.30 (1.51)21.27 (1.31)21.27 (1.34)
KFM37,453.33 (7052.74)34,853.33 (6008.42)28,180.00 (7171.30)22,820.00 (7527.79)3600.00 (0.00)
Pong21.00 (0.00)21.00 (0.00)20.80 (0.40)19.83 (0.37)19.33 (3.73)
Qbert4050.00 (0.00)4050.00 (0.00)4050.00 (0.00)2008.33 (1553.87)1116.67 (977.84)
Table 2. Collected data size from trained agents (mean and standard deviation of episode length).
Table 2. Collected data size from trained agents (mean and standard deviation of episode length).
Task#1#2#3#4#5
Alien856.57 (80.35)731.37 (155.80)711.17 (183.88)699.20 (124.09)746.60 (113.81)
Boxing286.17 (21.92)667.27 (46.83)949.33 (29.09)1163.43 (93.63)1111.53 (68.38)
Breakout17,244.43 (11,941.62)22,010.73 (9967.87)22,905.47 (9143.04)23,686.77 (8433.69)25,257.17 (6501.71)
Enduro11,198.40 (1819.96)11,753.43 (2229.88)9535.17 (2540.15)9534.53 (2057.03)10,422.03 (2539.81)
Freeway2044.43 (2.54)2044.07 (2.26)2044.20 (2.30)2044.37 (1.89)2044.23 (1.67)
KFM3482.80 (788.32)3866.47 (448.77)3037.83 (595.05)4139.97 (553.75)1099.23 (2.54)
Pong1684.03 (20.47)1679.57 (26.72)1688.63 (23.34)2707.00 (302.06)1794.80 (85.39)
Qbert622.70 (1.90)618.43 (6.15)623.37 (7.86)796.10 (111.40)773.87 (52.29)
Table 3. Single agent dispersion statistics (K = 100).
Table 3. Single agent dispersion statistics (K = 100).
AgentH N eff Cov G
Alien #13.4130.330.520.95
Alien #23.7542.700.760.97
Alien #33.7743.170.750.97
Alien #43.3929.810.700.94
Alien #53.6036.620.600.96
Boxing #13.1222.580.380.95
Boxing #23.7743.520.700.97
Boxing #32.8216.780.260.94
Boxing #43.8044.920.630.97
Boxing #53.9049.290.620.98
Breakout #12.9018.140.500.88
Breakout #22.7916.210.510.87
Breakout #32.7315.400.450.87
Breakout #42.8617.540.520.88
Breakout #52.7315.380.380.88
Enduro #14.4384.010.980.99
Enduro #24.3980.820.990.99
Enduro #34.4282.961.000.99
Enduro #44.4484.720.980.99
Enduro #54.4081.720.990.99
Freeway #14.5594.511.000.99
Freeway #24.5494.091.000.99
Freeway #34.5493.841.000.99
Freeway #44.5494.051.000.99
Freeway #54.5494.001.000.99
KFM #14.2670.700.960.98
KFM #24.2469.090.900.98
KFM #34.3174.490.950.98
KFM #44.2670.710.930.98
KFM #53.5233.740.530.96
Pong #14.1563.510.740.98
Pong #23.9954.040.690.98
Pong #34.4585.651.000.99
Pong #44.1060.400.880.98
Pong #53.9149.710.930.97
Qbert #13.4832.550.540.95
Qbert #23.1924.360.450.94
Qbert #33.4932.730.540.95
Qbert #42.9819.690.450.92
Qbert #53.1523.380.540.94
H: entropy, Neff: effective support size, Cov: Coverage Ratio, G: Gini–Simpson Index.
Table 4. Pairwise Total Variation (TV) distance between agent observation distributions.
Table 4. Pairwise Total Variation (TV) distance between agent observation distributions.
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.8540.5120.8570.869
Alien #20.8540.0000.7670.6840.500
Alien #30.5120.7670.0000.7830.809
Alien #40.8570.6840.7830.0000.671
Alien #50.8690.5000.8090.6710.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.9590.9360.9790.985
Boxing #20.9590.0000.9710.6730.565
Boxing #30.9360.9710.0000.9790.984
Boxing #40.9790.6730.9790.0000.252
Boxing #50.9850.5650.9840.2520.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.0000.8090.8610.8450.890
Breakout #20.8090.0000.8790.8430.905
Breakout #30.8610.8790.0000.7820.783
Breakout #40.8450.8430.7820.0000.792
Breakout #50.8900.9050.7830.7920.000
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.0000.1460.1230.1300.135
Enduro #20.1460.0000.1430.1410.125
Enduro #30.1230.1430.0000.1350.124
Enduro #40.1300.1410.1350.0000.135
Enduro #50.1350.1250.1240.1350.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0640.0720.0750.069
Freeway #20.0640.0000.0660.0710.072
Freeway #30.0720.0660.0000.0700.076
Freeway #40.0750.0710.0700.0000.067
Freeway #50.0690.0720.0760.0670.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.0000.2060.2270.2100.779
KFM #20.2060.0000.2050.1650.818
KFM #30.2270.2050.0000.2350.711
KFM #40.2100.1650.2350.0000.794
KFM #50.7790.8180.7110.7940.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.2040.3240.2880.796
Pong #20.2040.0000.3720.3550.824
Pong #30.3240.3720.0000.3160.582
Pong #40.2880.3550.3160.0000.742
Pong #50.7960.8240.5820.7420.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.8310.0840.6980.782
Qbert #20.8310.0000.8300.8780.897
Qbert #30.0840.8300.0000.7360.781
Qbert #40.6980.8780.7360.0000.882
Qbert #50.7820.8970.7810.8820.000
Table 5. Pairwise Maximum Mean Discrepancy (MMD) between agent observation distributions.
Table 5. Pairwise Maximum Mean Discrepancy (MMD) between agent observation distributions.
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.3230.2050.3290.354
Alien #20.3230.0000.2730.2430.180
Alien #30.2050.2730.0000.2850.278
Alien #40.3290.2430.2850.0000.237
Alien #50.3540.1800.2780.2370.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.4760.5570.5420.536
Boxing #20.4760.0000.3630.2190.195
Boxing #30.5570.3630.0000.4370.430
Boxing #40.5420.2190.4370.0000.074
Boxing #50.5360.1950.4300.0740.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.0000.2880.3520.2950.361
Breakout #20.2880.0000.4180.3320.434
Breakout #30.3520.4180.0000.2200.209
Breakout #40.2950.3320.2200.0000.213
Breakout #50.3610.4340.2090.2130.000
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.0000.0230.0170.0200.021
Enduro #20.0230.0000.0230.0230.015
Enduro #30.0170.0230.0000.0190.014
Enduro #40.0200.0230.0190.0000.019
Enduro #50.0210.0150.0140.0190.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0110.0130.0140.013
Freeway #20.0110.0000.0120.0110.011
Freeway #30.0130.0120.0000.0100.013
Freeway #40.0140.0110.0100.0000.012
Freeway #50.0130.0110.0130.0120.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.0000.0670.0600.0780.315
KFM #20.0670.0000.0760.0780.334
KFM #30.0600.0760.0000.0980.282
KFM #40.0780.0780.0980.0000.317
KFM #50.3150.3340.2820.3170.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.1130.1370.1140.476
Pong #20.1130.0000.1520.1200.485
Pong #30.1370.1520.0000.1190.358
Pong #40.1140.1200.1190.0000.444
Pong #50.4760.4850.3580.4440.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.2250.0130.1560.225
Qbert #20.2250.0000.2250.2720.255
Qbert #30.0130.2250.0000.1590.225
Qbert #40.1560.2720.1590.0000.255
Qbert #50.2250.2550.2250.2550.000
Table 6. Mean and standard deviation of absolute TD-error (value eetwork cross-evaluation).
Table 6. Mean and standard deviation of absolute TD-error (value eetwork cross-evaluation).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #15.062 (13.643)4.566 (20.929)4.126 (21.149)3.962 (21.101)4.096 (21.242)
Alien #24.426 (19.533)4.855 (18.012)3.631 (19.554)3.677 (19.447)3.610 (19.559)
Alien #35.524 (15.813)4.156 (14.448)3.414 (14.492)3.787 (14.483)3.670 (14.567)
Alien #44.288 (11.267)4.177 (11.085)3.442 (10.821)3.929 (8.413)3.775 (10.457)
Alien #53.610 (11.117)4.304 (11.979)2.804 (10.900)3.355 (11.073)3.093 (10.593)
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.302 (0.318)0.596 (0.555)0.492 (0.563)0.482 (0.543)0.389 (0.516)
Boxing #20.806 (0.756)0.278 (0.281)0.398 (0.519)0.282 (0.443)0.239 (0.415)
Boxing #30.887 (0.785)0.937 (0.802)0.211 (0.220)0.578 (0.571)0.425 (0.512)
Boxing #40.780 (0.640)0.375 (0.342)0.339 (0.327)0.165 (0.211)0.181 (0.227)
Boxing #50.777 (0.683)0.395 (0.384)0.352 (0.376)0.202 (0.274)0.178 (0.245)
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.492 (4.030)0.496 (4.823)0.360 (1.212)0.510 (1.231)0.510 (2.710)
Breakout #20.299 (3.824)0.331 (4.800)0.173 (0.858)0.165 (0.635)0.224 (2.645)
Breakout #30.273 (4.810)0.283 (5.625)0.126 (0.699)0.098 (0.758)0.110 (2.346)
Breakout #40.406 (4.518)0.527 (5.628)0.488 (0.842)0.213 (0.868)0.313 (2.500)
Breakout #50.275 (4.896)0.324 (5.904)0.125 (0.808)0.108 (0.819)0.117 (2.402)
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.365 (0.556)0.375 (0.543)0.362 (0.535)0.384 (0.544)0.369 (0.516)
Enduro #20.362 (0.556)0.368 (0.541)0.356 (0.534)0.383 (0.548)0.365 (0.513)
Enduro #30.346 (0.524)0.358 (0.520)0.341 (0.504)0.366 (0.518)0.354 (0.496)
Enduro #40.365 (0.569)0.377 (0.559)0.362 (0.545)0.380 (0.551)0.369 (0.530)
Enduro #50.358 (0.530)0.366 (0.521)0.351 (0.509)0.376 (0.527)0.357 (0.493)
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.014 (0.018)0.015 (0.019)0.014 (0.018)0.014 (0.017)0.011 (0.018)
Freeway #20.014 (0.019)0.015 (0.019)0.014 (0.018)0.014 (0.018)0.011 (0.019)
Freeway #30.015 (0.019)0.015 (0.019)0.014 (0.019)0.014 (0.018)0.011 (0.018)
Freeway #40.014 (0.018)0.015 (0.018)0.014 (0.018)0.014 (0.017)0.011 (0.017)
Freeway #50.015 (0.019)0.015 (0.019)0.014 (0.018)0.015 (0.018)0.012 (0.019)
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #121.380 (40.486)21.578 (39.371)21.223 (38.835)20.854 (37.581)22.033 (40.885)
KFM #221.788 (41.577)21.903 (40.504)21.573 (40.012)21.547 (38.781)22.244 (41.771)
KFM #320.385 (37.923)20.541 (37.465)20.302 (35.906)19.908 (34.719)20.512 (37.669)
KFM #421.799 (41.559)21.873 (40.700)21.747 (40.639)21.488 (39.138)22.458 (42.314)
KFM #514.926 (28.450)15.255 (32.852)15.377 (28.981)14.786 (28.030)15.571 (30.718)
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.005 (0.005)0.013 (0.033)0.068 (0.144)0.123 (0.153)0.116 (0.118)
Pong #20.011 (0.043)0.007 (0.008)0.063 (0.146)0.114 (0.156)0.111 (0.119)
Pong #30.054 (0.190)0.067 (0.227)0.011 (0.011)0.126 (0.163)0.089 (0.098)
Pong #40.104 (0.162)0.125 (0.218)0.089 (0.121)0.017 (0.024)0.119 (0.152)
Pong #50.063 (0.184)0.087 (0.249)0.048 (0.107)0.129 (0.175)0.028 (0.027)
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #113.459 (40.799)17.428 (68.259)21.169 (48.023)18.693 (48.216)23.974 (63.322)
Qbert #221.006 (76.693)21.924 (55.678)19.646 (69.327)19.100 (87.946)22.719 (68.507)
Qbert #312.338 (38.798)16.137 (66.068)18.025 (44.184)17.672 (47.846)22.270 (63.021)
Qbert #414.591 (65.426)9.250 (41.167)22.122 (53.370)12.973 (45.607)17.736 (42.792)
Qbert #55.782 (21.723)30.517 (78.335)15.457 (31.070)5.923 (24.812)19.779 (43.431)
Table 7. RMSE of value predictions (value network cross-evaluation).
Table 7. RMSE of value predictions (value network cross-evaluation).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.000144.577113.062144.116142.873
Alien #294.2720.00094.082100.40591.038
Alien #398.36950.5670.00050.52649.087
Alien #460.74754.16557.0830.00057.330
Alien #558.61446.00756.28753.8890.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.00014.82117.99818.62518.874
Boxing #25.2380.0006.0085.0785.472
Boxing #311.7212.5140.0001.8142.654
Boxing #48.4543.3261.7330.0000.778
Boxing #58.9264.0081.3200.6330.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.00013.20913.7327.9889.830
Breakout #29.1260.0005.5456.6827.023
Breakout #36.9735.8740.0003.0014.424
Breakout #46.9807.4353.6300.0004.300
Breakout #59.92812.5074.6706.5140.000
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.0000.6600.8320.7611.004
Enduro #20.6510.0000.7750.6950.920
Enduro #30.7960.7490.0000.6510.750
Enduro #40.7120.6940.6340.0000.698
Enduro #51.0070.9270.7910.7470.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0230.0280.0300.069
Freeway #20.0220.0000.0260.0310.068
Freeway #30.0270.0270.0000.0280.061
Freeway #40.0300.0300.0280.0000.056
Freeway #50.0630.0680.0610.0580.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.00041.42771.82274.51599.879
KFM #240.6410.00069.77557.75396.297
KFM #375.01468.5940.00043.66148.528
KFM #483.36463.39347.9480.00077.542
KFM #561.22043.57840.01051.7620.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.1010.3660.8180.874
Pong #20.1870.0000.3710.7860.877
Pong #30.3920.3550.0000.7570.773
Pong #40.6770.4870.3910.0000.631
Pong #50.4140.4650.3610.6470.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.000593.53280.419145.879397.322
Qbert #2502.5540.000469.197522.818380.142
Qbert #376.790569.9570.000110.466364.446
Qbert #4251.197296.141151.8770.000227.688
Qbert #5244.059293.804158.743250.9960.000
Table 8. Mean and standard deviation of KL divergence of action logits (actor network cross-evaluation).
Table 8. Mean and standard deviation of KL divergence of action logits (actor network cross-evaluation).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0003.640 (2.879)2.441 (3.047)3.451 (2.731)4.730 (4.081)
Alien #23.251 (3.140)0.0002.253 (2.433)2.831 (2.740)2.213 (2.185)
Alien #32.105 (2.308)2.308 (2.391)0.0003.250 (2.959)4.702 (4.060)
Alien #42.916 (2.809)2.705 (3.039)2.899 (3.103)0.0003.208 (3.014)
Alien #53.262 (3.063)2.092 (2.352)3.758 (4.141)2.801 (2.476)0.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0009.609 (4.661)6.428 (4.556)6.798 (3.950)4.989 (3.495)
Boxing #210.185 (7.840)0.0006.217 (5.017)4.573 (5.023)4.514 (5.009)
Boxing #312.033 (8.884)7.685 (5.039)0.0006.610 (3.730)6.582 (4.077)
Boxing #410.883 (7.330)5.016 (4.726)5.804 (3.565)0.0002.552 (3.602)
Boxing #510.436 (7.049)4.731 (4.573)5.519 (3.862)1.836 (2.045)0.000
Breakout #1Breakout #2Breakout #3Breakout #4Breakout #5
Breakout #10.0000.496 (0.975)1.798 (2.181)0.602 (1.265)1.297 (1.968)
Breakout #21.065 (1.537)0.0000.782 (0.929)0.469 (0.818)1.656 (1.960)
Breakout #30.423 (0.759)0.457 (0.577)0.0000.243 (0.917)0.363 (0.681)
Breakout #40.968 (1.415)0.631 (1.055)0.667 (0.937)0.0000.991 (1.439)
Breakout #50.367 (0.828)0.426 (0.576)0.163 (0.644)0.187 (0.803)0.000
Enduro #1Enduro #2Enduro #3Enduro #4Enduro #5
Enduro #10.0000.709 (0.862)0.688 (0.775)0.668 (0.863)0.790 (0.881)
Enduro #20.737 (0.840)0.0000.732 (0.854)0.685 (0.822)0.760 (0.902)
Enduro #30.679 (0.810)0.752 (0.998)0.0000.642 (1.018)0.646 (0.953)
Enduro #40.657 (0.828)0.667 (0.834)0.649 (0.807)0.0000.682 (0.968)
Enduro #50.782 (0.839)0.735 (0.883)0.630 (0.686)0.676 (0.806)0.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.002 (0.000)0.000 (0.000)0.003 (0.000)0.124 (0.005)
Freeway #20.001 (0.000)0.0000.001 (0.000)0.003 (0.000)0.109 (0.004)
Freeway #30.000 (0.000)0.001 (0.000)0.0000.002 (0.001)0.108 (0.013)
Freeway #40.002 (0.000)0.002 (0.000)0.002 (0.000)0.0000.040 (0.001)
Freeway #50.008 (0.000)0.007 (0.000)0.007 (0.001)0.003 (0.000)0.000
KFM #1KFM #2KFM #3KFM #4KFM #5
KFM #10.0000.042 (0.022)0.035 (0.036)0.051 (0.073)0.042 (0.025)
KFM #20.043 (0.030)0.0000.039 (0.021)0.022 (0.029)0.088 (0.032)
KFM #30.039 (0.038)0.045 (0.027)0.0000.039 (0.037)0.037 (0.030)
KFM #40.060 (0.111)0.024 (0.037)0.046 (0.052)0.0000.093 (0.103)
KFM #50.080 (0.031)0.142 (0.055)0.034 (0.027)0.064 (0.050)0.000
Pong #1Pong #2Pong #3Pong #4Pong #5
Pong #10.0000.983 (1.202)0.924 (1.483)5.373 (5.521)2.513 (2.894)
Pong #20.713 (0.800)0.0000.960 (1.497)6.800 (6.938)2.327 (2.432)
Pong #31.027 (1.653)1.376 (1.828)0.0004.076 (4.804)2.049 (2.235)
Pong #41.650 (1.814)2.251 (2.687)1.427 (1.864)0.0001.853 (2.703)
Pong #51.336 (1.880)1.838 (2.165)1.093 (1.477)3.119 (3.791)0.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.805 (1.399)0.108 (0.174)0.690 (0.972)0.221 (0.469)
Qbert #20.876 (1.184)0.0001.133 (2.091)0.622 (0.968)0.702 (0.856)
Qbert #30.118 (0.216)0.723 (1.425)0.0000.753 (1.209)0.201 (0.462)
Qbert #40.688 (1.396)0.991 (1.908)0.696 (1.486)0.0000.548 (0.970)
Qbert #50.525 (0.832)0.749 (1.153)0.322 (0.729)0.508 (0.734)0.000
Table 9. Trustworthiness scores of t-SNE embeddings across varying perplexity values. Columns denote different perplexity settings, and cell entries are the corresponding trustworthiness metrics, with the highest score highlighted in bold.
Table 9. Trustworthiness scores of t-SNE embeddings across varying perplexity values. Columns denote different perplexity settings, and cell entries are the corresponding trustworthiness metrics, with the highest score highlighted in bold.
Task5103050200
Alien0.91220.91660.90920.90630.8992
Boxing0.99660.99840.99850.99840.9979
Breakout0.97310.98650.98930.98800.9916
Enduro0.99540.99650.99690.99680.9966
Freeway0.99760.99950.99980.99970.9995
KFM0.99580.99750.99670.99630.9951
Pong0.98970.99590.99590.99590.9946
Qbert0.98700.99940.99960.99950.9993
Table 10. Mean and standard deviation of computation time (seconds) for PCA and K-means with K = 50, 100, and 200.
Table 10. Mean and standard deviation of computation time (seconds) for PCA and K-means with K = 50, 100, and 200.
TaskPCAK-Means
50 100 200
Alien12.516 (0.415)0.133 (0.007)0.546 (0.135)0.871 (0.334)
Boxing12.448 (0.411)0.137 (0.006)0.445 (0.172)0.718 (0.346)
Breakout13.845 (1.255)0.111 (0.006)0.443 (0.151)0.777 (0.409)
Enduro12.514 (0.469)0.124 (0.003)0.537 (0.136)0.905 (0.202)
Freeway12.169 (0.471)0.125 (0.004)0.465 (0.150)0.552 (0.372)
KFM12.568 (0.509)0.119 (0.002)0.542 (0.037)0.917 (0.254)
Pong11.910 (0.447)0.146 (0.021)0.350 (0.178)0.403 (0.297)
Qbert12.170 (0.527)0.112 (0.005)0.434 (0.145)0.676 (0.369)
Table 11. Performance of trained agents with identically initialized networks (mean and standard deviation of rewards).
Table 11. Performance of trained agents with identically initialized networks (mean and standard deviation of rewards).
Task#1#2#3#4#5
Alien1456.00 (1065.92)1079.33 (211.80)1075.00 (425.80)1070.33 (325.79)916.33 (631.83)
Boxing100.00 (0.00)93.77 (2.50)93.77 (2.60)93.03 (4.53)90.97 (3.43)
Freeway21.43 (1.31)21.33 (1.35)21.30 (1.37)21.27 (1.29)21.13 (1.12)
Qbert4050.00 (0.00)4010.00 (48.99)958.33 (221.11)873.33 (44.22)800.00 (0.00)
Table 12. Single agent dispersion statistics of trained agents with identically initialized networks.
Table 12. Single agent dispersion statistics of trained agents with identically initialized networks.
AgentH N eff Cov G
Alien #13.8446.550.660.97
Alien #23.4631.880.770.93
Alien #33.4230.480.600.95
Alien #43.7140.690.760.97
Alien #53.7241.140.770.96
Boxing #12.7715.890.280.92
Boxing #23.3528.450.490.96
Boxing #33.8346.280.650.98
Boxing #43.0420.840.400.95
Boxing #53.8245.460.670.97
Freeway #14.5594.530.990.99
Freeway #24.5493.971.000.99
Freeway #34.5493.891.000.99
Freeway #44.5493.801.000.99
Freeway #54.5594.281.000.99
Qbert #13.0821.820.430.93
Qbert #23.2926.770.470.94
Qbert #32.9218.460.310.93
Qbert #42.9318.740.380.92
Qbert #53.0420.900.370.93
Table 13. Mean and standard deviation of absolute TD-error for agents with identically initialized networks (value network cross-evaluation).
Table 13. Mean and standard deviation of absolute TD-error for agents with identically initialized networks (value network cross-evaluation).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #17.938 (34.771)5.529 (40.785)5.497 (40.223)4.728 (40.526)4.891 (40.635)
Alien #25.081 (17.172)4.217 (8.028)4.382 (16.863)4.026 (14.063)5.109 (11.535)
Alien #34.118 (14.219)3.482 (14.628)4.185 (13.904)3.327 (14.711)3.784 (14.777)
Alien #44.507 (17.851)4.143 (17.104)4.017 (17.549)4.390 (17.833)4.850 (19.507)
Alien #54.544 (16.142)3.611 (15.878)4.308 (16.030)3.650 (15.803)3.995 (15.891)
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.036 (0.036)0.570 (0.795)0.554 (0.762)0.598 (0.676)0.516 (0.650)
Boxing #20.908 (0.858)0.173 (0.213)0.279 (0.313)0.435 (0.370)0.319 (0.327)
Boxing #31.336 (1.549)0.241 (0.321)0.221 (0.275)0.455 (0.440)0.247 (0.308)
Boxing #40.770 (0.727)0.277 (0.332)0.301 (0.339)0.174 (0.216)0.273 (0.293)
Boxing #51.171 (1.296)0.257 (0.342)0.245 (0.320)0.425 (0.418)0.215 (0.274)
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.015 (0.018)0.015 (0.018)0.017 (0.019)0.016 (0.019)0.012 (0.019)
Freeway #20.015 (0.018)0.015 (0.018)0.016 (0.019)0.016 (0.019)0.012 (0.018)
Freeway #30.015 (0.018)0.014 (0.017)0.016 (0.018)0.016 (0.019)0.011 (0.018)
Freeway #40.014 (0.018)0.014 (0.016)0.016 (0.018)0.016 (0.018)0.011 (0.017)
Freeway #50.014 (0.018)0.014 (0.017)0.016 (0.018)0.016 (0.018)0.011 (0.017)
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #117.643 (43.687)23.649 (65.018)28.110 (86.530)16.074 (66.547)22.951 (88.250)
Qbert #214.553 (36.225)16.889 (54.559)24.112 (85.800)10.127 (61.133)17.677 (83.162)
Qbert #311.549 (25.748)7.513 (22.826)2.730 (11.624)8.916 (31.148)7.016 (30.265)
Qbert #422.447 (27.629)14.627 (23.644)16.403 (22.565)9.976 (23.709)9.516 (15.295)
Qbert #521.291 (50.213)15.357 (41.226)14.825 (24.857)21.417 (66.445)3.444 (6.031)
Table 14. Pairwise TV distance for agents with identically initialized networks.
Table 14. Pairwise TV distance for agents with identically initialized networks.
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.6190.8220.6430.718
Alien #20.6190.0000.7740.7000.638
Alien #30.8220.7740.0000.7570.740
Alien #40.6430.7000.7570.0000.613
Alien #50.7180.6380.7400.6130.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.9290.9290.9730.933
Boxing #20.9290.0000.7730.9790.850
Boxing #30.9290.7730.0000.9740.289
Boxing #40.9730.9790.9740.0000.972
Boxing #50.9330.8500.2890.9720.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0650.0720.0740.074
Freeway #20.0650.0000.0670.0680.072
Freeway #30.0720.0670.0000.0750.079
Freeway #40.0740.0680.0750.0000.072
Freeway #50.0740.0720.0790.0720.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.7580.9160.8980.879
Qbert #20.7580.0000.8950.7300.713
Qbert #30.9160.8950.0000.9030.891
Qbert #40.8980.7300.9030.0000.777
Qbert #50.8790.7130.8910.7770.000
Table 15. Pairwise MMD for agents with identically initialized networks.
Table 15. Pairwise MMD for agents with identically initialized networks.
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.0000.2150.3480.2500.291
Alien #20.2150.0000.3280.2290.226
Alien #30.3480.3280.0000.2910.264
Alien #40.2500.2290.2910.0000.163
Alien #50.2910.2260.2640.1630.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.0000.5530.5380.5590.544
Boxing #20.5530.0000.2250.4630.277
Boxing #30.5380.2250.0000.4410.095
Boxing #40.5590.4630.4410.0000.442
Boxing #50.5440.2770.0950.4420.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0110.0110.0120.011
Freeway #20.0110.0000.0110.0120.010
Freeway #30.0110.0110.0000.0120.011
Freeway #40.0120.0120.0120.0000.013
Freeway #50.0110.0100.0110.0130.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.0000.1940.3750.3220.275
Qbert #20.1940.0000.3220.2660.241
Qbert #30.3750.3220.0000.2780.313
Qbert #40.3220.2660.2780.0000.232
Qbert #50.2750.2410.3130.2320.000
Table 16. RMSE of value predictions for agents with identically initialized networks (value network cross-evaluation).
Table 16. RMSE of value predictions for agents with identically initialized networks (value network cross-evaluation).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.000211.413205.393209.773206.833
Alien #2105.8550.000107.73295.76974.780
Alien #381.81381.0020.00088.80893.255
Alien #457.32561.76870.9150.00038.993
Alien #528.69227.61031.99722.0990.000
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.00026.27725.23225.96425.476
Boxing #211.2590.0001.0451.0541.270
Boxing #313.3321.4330.0001.6710.544
Boxing #47.8301.2701.1110.0000.837
Boxing #511.7261.8290.7612.0790.000
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.0000.0430.0290.0280.062
Freeway #20.0420.0000.0390.0390.041
Freeway #30.0280.0380.0000.0250.058
Freeway #40.0260.0380.0240.0000.060
Freeway #50.0540.0390.0570.0570.000
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.000381.729477.101518.526554.895
Qbert #2126.5720.000316.792288.990303.697
Qbert #3129.170109.1970.000125.973115.177
Qbert #473.66762.25964.6240.00055.432
Qbert #5299.573496.026144.667541.6920.000
Table 17. Average and standard deviation of KLD for agents with identically initialized networks (actor network cross-evaluation).
Table 17. Average and standard deviation of KLD for agents with identically initialized networks (actor network cross-evaluation).
Alien #1Alien #2Alien #3Alien #4Alien #5
Alien #10.000 (0.000)2.588 (2.449)3.184 (2.705)2.074 (2.055)3.578 (3.203)
Alien #23.646 (3.635)0.000 (0.000)3.365 (3.147)2.009 (1.851)2.290 (2.735)
Alien #33.841 (4.202)3.185 (2.868)0.000 (0.000)3.947 (3.600)3.666 (3.621)
Alien #42.905 (3.664)1.577 (1.444)3.203 (2.997)0.000 (0.000)2.232 (2.240)
Alien #53.798 (4.947)1.728 (1.823)3.778 (3.648)3.017 (2.609)0.000 (0.000)
Boxing #1Boxing #2Boxing #3Boxing #4Boxing #5
Boxing #10.000 (0.000)7.752 (5.730)10.043 (6.436)8.779 (5.299)8.894 (5.835)
Boxing #26.039 (4.752)0.000 (0.000)7.312 (6.168)5.546 (3.896)6.794 (5.672)
Boxing #33.554 (4.251)5.966 (5.001)0.000 (0.000)5.373 (3.457)3.883 (5.556)
Boxing #45.014 (4.029)3.505 (2.186)7.819 (6.909)0.000 (0.000)10.088 (8.279)
Boxing #54.304 (4.144)6.278 (5.051)4.308 (5.592)4.978 (2.780)0.000 (0.000)
Freeway #1Freeway #2Freeway #3Freeway #4Freeway #5
Freeway #10.000 (0.000)0.010 (0.000)0.001 (0.000)0.000 (0.000)0.148 (0.006)
Freeway #20.005 (0.000)0.000 (0.000)0.002 (0.000)0.004 (0.000)0.014 (0.001)
Freeway #30.001 (0.001)0.004 (0.000)0.000 (0.000)0.001 (0.000)0.086 (0.003)
Freeway #40.000 (0.000)0.009 (0.000)0.001 (0.000)0.000 (0.000)0.139 (0.005)
Freeway #50.007 (0.000)0.001 (0.000)0.004 (0.000)0.007 (0.000)0.000 (0.000)
Qbert #1Qbert #2Qbert #3Qbert #4Qbert #5
Qbert #10.000 (0.000)0.643 (1.246)1.834 (3.214)0.868 (1.672)0.542 (1.135)
Qbert #21.124 (2.864)0.000 (0.000)1.687 (2.995)0.492 (0.626)0.472 (0.528)
Qbert #31.913 (4.283)1.331 (2.403)0.000 (0.000)1.230 (2.116)1.176 (2.075)
Qbert #41.016 (1.841)0.441 (0.436)1.276 (2.430)0.000 (0.000)0.376 (0.582)
Qbert #51.452 (2.305)0.608 (0.704)1.498 (2.852)0.359 (0.546)0.000 (0.000)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jang, S.; Lee, A. Unpacking Performance Variability in Deep Reinforcement Learning: The Role of Observation Space Divergence. Appl. Sci. 2025, 15, 8247. https://doi.org/10.3390/app15158247

AMA Style

Jang S, Lee A. Unpacking Performance Variability in Deep Reinforcement Learning: The Role of Observation Space Divergence. Applied Sciences. 2025; 15(15):8247. https://doi.org/10.3390/app15158247

Chicago/Turabian Style

Jang, Sooyoung, and Ahyun Lee. 2025. "Unpacking Performance Variability in Deep Reinforcement Learning: The Role of Observation Space Divergence" Applied Sciences 15, no. 15: 8247. https://doi.org/10.3390/app15158247

APA Style

Jang, S., & Lee, A. (2025). Unpacking Performance Variability in Deep Reinforcement Learning: The Role of Observation Space Divergence. Applied Sciences, 15(15), 8247. https://doi.org/10.3390/app15158247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop