1. Introduction
Deep reinforcement learning (DRL) has demonstrated an extraordinary ability to solve high-dimensional decision-making problems, from Atari video games [
1] to the board game Go [
2]. Among modern on-policy methods, Proximal Policy Optimization (PPO) [
3] has emerged as a de facto baseline due to its empirical performance and implementation simplicity, and it remains the algorithm of choice in a wide range of recent studies. Yet, a persistent challenge plagues the field. Despite identical hyperparameters and training budgets, independent runs of the same DRL algorithm can yield markedly different returns [
4], complicating the fair evaluation and deployment of new methods.
Existing research has identified several key sources contributing to this performance variability. One line of inquiry has focused on the profound impact of implementation details. Foundational work by Henderson et al. [
4] demonstrated that differences in codebases, hyperparameter settings, and even random seeds could lead to drastically different outcomes. This was further emphasized by Engstrom et al. [
5], who showed that seemingly minor code-level decisions—such as the choice of activation function or the order of operations—can alter performance by orders of magnitude. Another perspective has examined the dynamics of the training process itself. Bjorck et al. [
6] provided evidence that much of this variability originates early in training, as a few “outlier” runs drift onto low-reward trajectories and never recover. This aligns with the work by Jang et al. [
7], which explored how entropy-aware initialization can foster more effective exploration from the outset, thereby preventing early stagnation. A third approach has delved into the internal mechanics of the learning agent. For instance, Moalla et al. [
8] recently established a connection between performance instability in PPO and internal “representation collapse,” where the network learns insufficiently diverse features, leading to trust issues in policy updates.
While these studies provide crucial insights into implementation, training dynamics, and internal representations, a complementary perspective remains less explored: the divergence in the observation space that each agent actually experiences. RL agents learn from the specific trajectories of states and rewards they encounter. If two agents, due to chance events early in training, begin to explore different regions of the state space, they are effectively training on different datasets. This can lead them to converge to substantially different—and unequally performing—policies. This paper investigates the hypothesis that this divergence in lived experience is a primary cause of performance variability. Unlike prior work focused on implementation choices or internal network states [
8], we aim to directly quantify the differences in state visitation distributions among independently trained agents. And while studies like [
6,
7] identify that runs diverge, we characterize how they diverge in terms of the states they encounter and explicitly link this to the functional dissimilarity of the resulting policies.
In this work, we perform a controlled empirical study of five PPO agents, each trained with an independent random seed, across eight Atari environments—Alien, Boxing, Breakout, Enduro, Freeway, KungFuMaster, Pong, and Qbert—using the Arcade Learning Environment [
9] and Gymnasium [
10]. Training five independent PPO agents per game yields a spectrum of outcomes, from highly successful to comparatively poor policies. We then ask the following: Do higher-performing agents encounter a broader or different set of states than lower-performing agents? And do such differences manifest in the learned representations of the actor (policy) and critic (value function) networks? To answer these questions, we employ a range of analysis techniques. First, we visualize and quantify each agent’s observation distribution using dimensionality-reduction methods and statistical measures of dispersion and similarity. Second, we cross-evaluate the learned actor and value networks, applying them to states experienced by other agents to assess functional differences. Our findings reveal a clear correlation between divergence in explored observation spaces, dissimilarity of the learned networks, and variance in achieved performance across agents within each environment.
The remainder of this paper is organized as follows.
Section 2 details our methodology.
Section 3 presents the experimental results, featuring analyses of observation space characteristics and the discrepancies observed in actor–critic networks. In
Section 4, we discuss the implications of our findings. Finally,
Section 5 concludes the paper by outlining potential avenues for future research.
2. Methodology
To investigate the relationship between observation space divergence, network differences, and performance variability, a series of experiments were conducted using PPO agents trained on eight Atari environments. For this investigation, we specifically selected Proximal Policy Optimization (PPO) and the Atari 2600 benchmark suite. PPO was chosen due to its status as a robust, high-performing, and widely adopted baseline algorithm in the DRL community, making our findings on its variability highly relevant. The Atari suite, via the Arcade Learning Environment, offers a diverse collection of environments with varying complexities and reward structures. This diversity is crucial for our study, as it allows us to systematically compare the divergence phenomenon in games known to produce both highly consistent and highly variable performance outcomes.
2.1. Experimental Setup
All experiments were conducted on a workstation with the following specifications:
Hardware: Intel(R) Core(TM) i7-14700K (28 cores), 128 GB RAM.
Software: Python 3.10.16, along with key scientific computing libraries including NumPy 2.2.3, Pandas 2.2.3, SciPy 1.15.3, and Scikit-learn 1.7.0.
2.2. Training Environment and Agents
We trained five independent PPO agents for each of the eight Atari environments shown in
Figure 1: Alien, Boxing, Breakout, Enduro, Freeway, KungFuMaster (KFM), Pong, and Qbert. The training was conducted using the Stable-Baselines3 [
11], and the agents used the PPO hyperparameters for Atari games as released in the RL Baselines3 Zoo [
12]. Each agent within an environment was trained from scratch, with the only source of variation being the random seeds.
2.3. Data Collection
After the training phase concluded for all agents, each of the five converged agents per environment was rolled out for 30 episodes. The performance, measured as the average reward over 30 evaluation episodes, was recorded for each agent. During these rollouts, at each step t, a tuple containing the current observation (), the received reward (), the value estimate from the agent’s own critic (), and the action logits from the agent’s own actor () was collected and stored. This process generated a dataset of trajectories specific to each agent’s learned policy.
2.4. Analysis of Observation Space
The collected observation data was analyzed to characterize the extent and nature of exploration by each agent and to compare observation distributions across agents.
2.4.1. Visualization
To qualitatively assess the similarity of explored observation spaces, the high-dimensional observation data (raw pixel frames) was visualized in a 2D space using t-SNE. For each environment, observations from all five agents were combined. From each agent’s collected observations, 5000 frames were randomly subsampled. These observations were then flattened into vectors before being projected into two dimensions using t-SNE. The perplexity hyperparameter for t-SNE was set to 30, a choice informed by a trustworthiness analysis (presented in
Section 3.5) to ensure high-quality embeddings. The resulting scatter plots were color-coded by agent ID to reveal patterns of overlap and separation.
2.4.2. Single-Agent Observation Dispersion
To quantify the diversity of observations encountered by each individual agent, a multi-step process was employed:
Data Loading and Preprocessing: For each agent, its collected observations over 30 episodes were loaded, and then 5000 frames were subsampled and flattened into vectors.
Dimensionality Reduction and Denoising: Principal Component Analysis (PCA) was applied to reduce the dimensionality of the flattened observations to 50 dimensions. This step serves to denoise and compress the data, focusing on the directions of highest variance, which can lead to more robust and meaningful clustering in the subsequent step. Working with high-dimensional raw pixel data directly for clustering can be computationally expensive and sensitive to noise; PCA mitigates these issues by capturing the most salient features.
State Space Discretization via Clustering: The PCA-transformed features from all agents within an environment were pooled together. K-means clustering was then performed on this pooled set of features to define a common set of discrete state categories. To assess the sensitivity of our metrics to the granularity of discretization, we performed this step with K values of 50, 100, and 200. The results for K = 100 are presented in the main analysis, with K = 50 and K = 200 used for comparison.
Occupancy Histogram Generation: For each agent, an occupancy histogram (probability distribution, ) over the K clusters was computed.
Dispersion Metrics Calculation: Based on this probability distribution (where c is a cluster index), the following metrics were calculated:
Entropy (H(p)): Measures the uncertainty or diversity of the visited clusters. Higher entropy indicates that an agent visits a wider range of distinct state categories more uniformly.
Effective Support Size (): Estimates the number of effectively visited clusters, providing an interpretable scale for diversity.
Coverage Ratio (Cov): The proportion of the K-defined clusters that were visited at least once.
Gini–Simpson Index (G): Measures diversity, with values closer to 1 indicating higher diversity (i.e., probabilities are spread out more evenly across many clusters).
2.4.3. Pairwise Observation Distribution Comparison
To quantify the similarity between observation distributions of pairs of agents, the following metrics were computed. The cluster definitions and the individual agent occupancy histograms (e.g.,
and
) used for these pairwise comparisons are established and computed according to the procedures detailed in
Section 2.4.2.
Pairwise Total Variation (TV) Distance: For each pair of agents
, the TV distance between their cluster occupancy histograms
and
was calculated. This ranges from 0 (identical) to 1 (disjoint).
Pairwise Maximum Mean Discrepancy (MMD): MMD was computed between the PCA-reduced feature sets of each pair of agents using a Gaussian kernel [
13]. The kernel bandwidth sigma was set using the median heuristic. MMD is a statistical test for determining whether two samples are drawn from the same distribution, and its magnitude provides a measure of their dissimilarity. It operates directly on the feature representations rather than relying on the explicit cluster histograms.
where
K is a positive-definite kernel. We choose a Gaussian kernel
. The kernel bandwidth
is set to the median pairwise distance among all states in the combined dataset (median heuristic), which provides a reasonable scale. MMD is essentially a distance in reproducing kernel Hilbert space and is 0 iff the two distributions are identical (for characteristic kernels).
2.5. Analysis of Actor and Value Network Differences
To assess how the differences in explored observation spaces translate to differences in the learned actor and value networks, a cross-evaluation methodology was employed. For each pair of agents within an environment, the observations collected by agent i during its rollouts were used as input to the actor and value networks of agent j. The following metrics were calculated:
Absolute TD-Error: The absolute TD-error was computed over agent
i’s trajectory data using agent
j’s value network, measuring how well agent
j’s value function generalizes to agent
i’s experiences.
Root Mean Squared Error (RMSE) of Value Estimates: For each state
in agent
i’s trajectory, the value estimated by agent
i’s critic,
, was compared to the value predicted by agent
j’s critic,
. This directly measures how different the output of the two value functions is on states that agent
i visits.
Kullback–Leibler Divergence (KLD) of Action Logits: For each state
in agent
i’s trajectory, KLD was computed using the action probability distributions (derived from the logits) produced by agent
i’s actor,
, and agent
j’s actor,
. This quantifies the policy divergence by comparing the action probability distributions from agent
i’s actor and agent
j’s actor for states in agent
i’s trajectory.
These metrics were computed for all 5 × 5 agent pairings in each of the eight environments, resulting in matrices that reveal the extent of network generalization and similarity.
2.6. Trained Agents with Identically Initialized Networks
To further investigate the sources of performance variability, we conducted an additional experiment. For four of the eight environments—Boxing and Qbert (high-performance variance), Alien (medium variance), and Freeway (low variance)—we trained five new agents starting from the same initial network weights. The only source of variation in these runs was the stochasticity inherent in the agent–environment interaction loop (e.g., action sampling, environment responses). This setup allows us to determine whether different weight initializations primarily drive performance divergence or if it emerges naturally from the training process itself. These agents were then analyzed using the same data collection and analysis pipeline described above.
4. Discussion
The results from our expanded study across eight Atari games provide compelling and robust evidence that divergence in explored observation spaces is a primary driver of performance variability in DRL. Our analysis indicates that the degree of this divergence is not random but is strongly tied to the intrinsic characteristics of the environment itself. The Freeway environment serves as a crucial control case. The consistent performance of Freeway agents is tightly linked to their consistent exploration patterns. All agents are guided through a similar, comprehensive set of experiences, which leads to the development of functionally equivalent policies and value functions. In stark contrast, high-variance environments like Boxing, Breakout, KFM, and Qbert reveal the consequences of divergent exploration. In these more complex settings, agents can and do find different niches within the state space. The t-SNE plots and quantitative metrics show that agents often specialize in distinct sub-regions, becoming experts in local areas while remaining naive about others. This specialization is path-dependent; early stochastic events steer an agent toward a particular trajectory, and the actor–critic learning loop reinforces this direction. An agent’s value network becomes more accurate for its frequented states, which in turn biases its policy to continue visiting them. This creates a feedback loop that amplifies initial small differences into significant chasms in both experience and capability.
4.1. Environment Characteristics and Their Impact on Divergence
The consistent patterns of divergence and stability observed across the eight Atari games can be primarily attributed to their intrinsic mechanics and objectives. By categorizing the environments, we can provide more targeted guidance for future DRL applications.
4.1.1. Low-Divergence Environments: Structured and Convergent
Environments that foster low divergence and stable performance, such as Freeway, Pong, and Enduro, often share characteristics like a clear, singular objective and a functionally narrow state space that guides agents toward a single dominant strategy.
In Freeway, the goal is simple and monotonic: move up. There are no complex sub-tasks or branching strategic paths. This structure naturally channels all agents toward the same optimal behavior, leading to highly overlapping observation spaces and consistent performance.
Pong is a purely reactive game where the optimal policy is to mirror the ball’s vertical movement. The simplicity and deterministic nature of this strategy mean there is little room for meaningful strategic variation to emerge.
Enduro, while more complex visually, is also driven by a primary objective of continuous forward progress and overtaking. The core gameplay loop does not contain significant strategic “bottlenecks” that could send agents down wildly different learning paths.
In these games, the path to high rewards is straightforward, causing all agent trajectories and policies to converge. For real-world problems with similar characteristics (e.g., simple process optimization), we can expect DRL training to be relatively stable and reproducible.
4.1.2. High-Divergence Environments: Strategic Bottlenecks and Divergent Policies
Conversely, environments prone to high divergences, such as Qbert, Boxing, KFM, and Breakout, often feature strategic bottlenecks, multiple viable strategies, or complex state dependencies that amplify the effects of stochasticity.
Boxing is a highly interactive, opponent-dependent game. An agent might learn an aggressive rushing strategy, while another learns a defensive, counter-punching style. These are two distinct but viable approaches that lead to entirely different patterns of interaction, creating separate clusters in the observation space and varied performance outcomes.
Qbert and KFM contain significant exploration challenges and “bottlenecks.” In KFM, an agent that fails to learn how to defeat a specific enemy type will be trapped in early-level states, while an agent that succeeds will unlock a vast new region of the observation space. This creates a sharp bifurcation in experience and performance. Similarly, the unique board structure in Qbert presents many locally optimal paths, causing agents to specialize in different sections of the pyramid.
Breakout is a classic example of an environment with a critical strategic bottleneck: learning to tunnel the ball behind the brick wall. Agents that discover this strategy enter a new, high-scoring phase of the game with a completely different set of observations. Agents that fail to discover it remain trapped in a low-scoring, repetitive gameplay loop, leading to extreme performance variance.
4.2. Potential Applications of Observation Space Divergence Analysis
The observation space divergence analysis presented in this study extends beyond understanding the problem of performance variability in DRL; it can be directly utilized to devise solutions for improving evaluation performance in practical applications. Specifically, the findings of this research can be applied to ensemble methods—a widely used technique for surpassing the performance of single agents [
14]—from two perspectives.
First, for improving evaluation performance via ensemble methods. When ensembling N agents to boost performance, a novel approach can be explored beyond conventional techniques, such as majority voting, simple averaging, or weighted sums based on rewards. Specifically, one could employ a weighted sum based on the similarity between the representation of the current observation and the observation space representation of each agent in the ensemble. This method would give priority to the actions of agents whose experiences are most similar to the current situation, thereby potentially leading to more sophisticated and higher performance.
Second, for analyzing why ensemble effectiveness varies across environments. Our findings provide insight into predicting whether an ensemble will be effective in a given environment. For instance, in an environment like Freeway, where all agents explore a highly similar observation space, we can predict that the benefit of ensembling will be minimal because the individual agent policies are already convergent. In contrast, in an environment like Boxing, where agents learn distinct strategies (e.g., aggressive vs. defensive) resulting in clearly separated observation spaces, we can anticipate a significant performance boost from ensembling, as combining their different specializations would be highly complementary. This provides a practical guideline for determining where to focus limited computational resources when constructing ensembles.
4.3. General Implications and Study Limitations
The most significant finding is from the identical initialization experiment. The persistence of high-performance variance and observation space divergence in games like Boxing and Qbert, even when starting from the same network weights, demonstrates that the problem is not merely one of poor initialization. The stochastic nature of the RL interaction process itself is sufficient to drive agents onto divergent paths. This suggests that achieving reproducible performance in complex environments requires more than just fixing random seeds; it may require fundamentally new approaches to guide exploration or to make learning more robust to variations in experience.
Our comprehensive set of metrics—from visual t-SNE to quantitative measures of dispersion (entropy, TV distance, MMD) and network function (TD-error, RMSE, KLD)—presents a unified narrative. When agents see different things (high TV/MMD), their understanding of the world diverges (high-value RMSE), and their resulting behaviors diverge (high policy KLD). This work highlights that reporting only the mean and standard deviation of final scores can obscure the rich and varied behaviors that contribute to those scores. A deeper analysis of the underlying state visitation distributions is crucial for a comprehensive understanding of DRL algorithm behavior.
We acknowledge that the empirical results of our study focus on the PPO algorithm within the Atari domain. This deliberate choice of scope enabled a deep and multifaceted analysis of the divergence phenomenon. However, it is reasonable to question how these findings generalize to other algorithms and environments. We hypothesize that the core mechanism—stochastic agent–environment interactions leading to divergent experiential trajectories and specialized policies—is a fundamental aspect of the reinforcement learning process and is not unique to PPO. For instance, off-policy algorithms like DQN or SAC, which utilize a replay buffer, might exhibit different dynamics. A replay buffer could mitigate divergence by averaging experiences across different trajectories. Conversely, it could also exacerbate the issue if certain types of “lucky” trajectories become overrepresented early in training. Similarly, extending this analysis to domains with continuous action spaces, such as robotics tasks in MuJoCo, or environments with sparser rewards, represents an important next step. Exploring how the structure of the state space and the nature of the reward function influence the degree of observation divergence is a compelling avenue for future research.
5. Conclusions and Future Works
This paper investigated the role of observation space divergence as a contributing factor to performance variability in deep reinforcement learning. Through a series of experiments on PPO agents across eight Atari environments, we demonstrated a strong link between the similarity of states explored by different agents, the functional similarity of their learned networks, and the consistency of their final performance. The key findings include the following:
We expanded the analysis to eight Atari games, confirming that environments with low variance in performance (e.g., Freeway, Enduro) exhibit highly similar state space exploration across agents. In contrast, high-variance environments (e.g., Boxing, KFM, Qbert) show significant divergence.
Cross-evaluation of actor and value networks confirmed that agents with divergent observation distributions learn functionally different networks that do not generalize well to each other’s experiences.
A new experiment with identically initialized networks revealed that performance variability and observation space divergence persist even without different initial weights, highlighting that stochasticity in the agent–environment interaction is a primary source of this divergence.
Our analysis was shown to be robust to the choice of the number of clusters (K) used for state discretization, and the computational cost of the analysis pipeline was found to be modest.
This work highlights the importance of examining not only the performance levels achieved but also how agents attain them. Building on this foundational analysis in PPO and Atari, future research should extend this investigation in several key directions:
Temporal and Causal Analysis: Study the evolution of observation space differences during the training process. Analyzing the temporal relationship between when divergence occurs and when agent performance begins to vary would help establish a more direct causal link and better explain the dynamic nature of performance fluctuations.
Systematic Component Analysis: Conduct a comparative analysis of how different DRL components influence exploration behavior and performance variability. This includes systematically varying hyperparameters, network architectures, and regularization techniques to understand their impact on the observation space distribution of trained agents.
Broadening Algorithmic and Environmental Scope: Apply this analysis pipeline to a broader range of algorithms, including other on-policy variants and prominent off-policy algorithms (e.g., SAC, DQN), to determine how mechanisms like replay buffers affect divergence. Furthermore, expanding the study to other domains, such as continuous control benchmarks or environments with sparse rewards, is crucial to test the generality of our findings.
Developing Mitigation Strategies: Based on the insights from a component-level analysis, develop and benchmark novel exploration or regularization methods specifically designed to reduce undesirable trajectory divergence, promote more consistent learning outcomes, and directly mitigate DRL performance variability.
Theoretical Foundations: Work towards proposing a theoretical framework that can interpret these diverse experimental results, providing a more formal understanding of the relationship between stochastic exploration, state space coverage, and performance instability in DRL.