Next Article in Journal
Inactivation of Respiratory Syncytial Virus in Aerosols by Means of Selected Radiated Microwaves
Next Article in Special Issue
Handling Multimodality in Pareto Set Estimation via Cluster-Wise Decomposition
Previous Article in Journal
Research on the Construction and Optimal Regulation of an Urban Distribution-Microgrid Collaborative Multi-Energy Flexible Domain
Previous Article in Special Issue
Sensitivity Analysis of Variational Quantum Classifiers for Identifying Dummy Power Traces in Side-Channel Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Effectiveness of Experience-Sharing Group Learning in Deep Reinforcement Learning †

Graduate School of Sustainable System Sciences, Osaka Metropolitan University, Sakai 599-8531, Japan
*
Author to whom correspondence should be addressed.
This paper is an extended version of our papers submitted in Muroya, K.; Ikeda, M.; Notsu, A. Effectiveness of Experience-Sharing Group Learning in Deep Reinforcement Learning. In Proceedings of the 26th International Symposium on Advanced Intelligent Systems (ISIS2025), Cheongju, Republic of Korea, 6–9 November 2025.
Appl. Sci. 2026, 16(7), 3250; https://doi.org/10.3390/app16073250
Submission received: 27 February 2026 / Revised: 24 March 2026 / Accepted: 26 March 2026 / Published: 27 March 2026
(This article belongs to the Special Issue Advances in Intelligent Systems—2nd edition)

Abstract

Deep reinforcement learning faces a critical trade-off between computational cost and performance. This study proposes an experience-sharing group-learning framework in which multiple agents with different network sizes collaboratively learn a single task through a shared experience replay memory. Unlike conventional multi-agent approaches that assume homogeneous agents, our method enables agents with different computational capabilities to share experiences, allowing low-performance agents to benefit from high-performance agents’ quality experiences. The proposed method was evaluated in CartPole and Super Mario Bros environments. In CartPole two-agent experiments, the low-performance agent (Agent16, 404 parameters) achieved approximately 2× performance improvement (93.3 to 184.4 steps) through group learning, while the high-performance agent (Agent64, 4676 parameters) maintained comparable performance, though several group conditions fell below the solo 200-step result. Three-agent experiments further improved Agent16 to 196.5 steps with reduced variance. Under step-matched comparisons in Super Mario Bros, the low-capacity agent benefits from experience sharing beyond solo baselines that consume roughly twice as many steps, while the high-capacity agent remains broadly comparable between group and solo. Claims are limited to step-based normalisation. Q-value analysis revealed accelerated early learning, with Q-values increasing by +10.1 (Mario) and +7.7 (Luigi) at 1 million steps. These results demonstrate that experience-sharing group learning can improve learning efficiency for resource-constrained agents under a fixed environment-step budget.

1. Introduction

The advancement of deep reinforcement learning has created a critical challenge involving the trade-off between computational cost and performance. Building high-performance deep reinforcement learning systems requires large-scale neural network architectures and vast computational resources, which poses a significant barrier to implementation in resource-constrained environments [1]. AlphaGo [2], for example, defeated the world champion in Go but required thousands of CPUs and hundreds of GPUs over several weeks of training.
In human learning, it is well known that individuals with different abilities can effectively solve problems that would be difficult to solve alone by cooperating. Research on collaborative learning in education has demonstrated that learners with different abilities can improve individual learning outcomes by sharing knowledge with each other [3]. This principle of collaborative learning—that gathering individuals with different perspectives and abilities produces wisdom beyond individual capabilities—is considered applicable to machine learning systems.
Multi-agent reinforcement learning offers a framework in which multiple agents interact and learn within the same environment [4]. Tan [5] laid the foundation for multi-agent reinforcement learning, systematically analysing the impact of inter-agent cooperation on learning efficiency. Three forms of cooperation were examined—sharing sensory information, sharing episodes, and sharing learned policies—and episode (experience) sharing was found to be particularly effective. This finding provides the theoretical basis for the experience-sharing group learning proposed in this study. Subsequent work has explored various forms of inter-agent cooperation, including centralised training with decentralised execution (MADDPG [6]), opponent-aware learning (LOLA [7]), cooperative multi-agent control [8], and competitive and cooperative environments using Pong [9]. However, most existing multi-agent reinforcement learning research assumes homogeneous agents with identical network structures, and cooperation between agents with different computational capabilities has not been sufficiently examined.
Algorithm-level approaches to improving computational efficiency include A3C [10], which uses multiple workers interacting with the environment in parallel while asynchronously computing gradients, and PPO [11], which constrains update magnitude through clipping and has become one of the most widely adopted algorithms [12]. Model-level approaches include network pruning [13], which reduces parameters by over 90% while maintaining performance, MobileNets [14], which reduce computation to approximately 1/8–1/9, SqueezeNet [15], which achieves AlexNet-level accuracy with 50× fewer parameters, and simple random search with linear policies [16], which can match complex deep RL methods with over 15× computational efficiency improvement. While these approaches have demonstrated significant improvements, they do not exploit the potential benefits of heterogeneous agent cooperation.
Based on these insights, this study proposes experience-sharing group learning, a novel framework in which multiple agents with different network sizes collaboratively learn a single task through a shared experience replay memory. Unlike conventional self-play or competitive multi-agent learning, our method allows agents with different computational capabilities to share the same experience replay memory, enabling low-performance agents to utilise the experiences of high-performance agents. We evaluate the proposed method in CartPole [17] and Super Mario Bros environments through two-agent experiments, three-agent experiments, and complex visual task experiments, demonstrating that group learning significantly improves low-performance agent capabilities while maintaining high-performance agent performance.

2. Related Work

Scope and recent progress (2021–2025): We survey recent advances in (a) heterogeneous multi-agent RL and knowledge/experience sharing, (b) replay-buffer design and sharing strategies across agents, (c) compute-efficient and resource-constrained deep RL (including low-resource training and on-device learning), and (d) statistical reporting practices for empirical RL. In contrast to parameter-sharing approaches that typically assume homogeneous architectures, our framework enables structure-agnostic experience sharing among heterogeneous agents and explicitly targets fixed environment-step budgets appropriate for compute-limited settings. This compute-constrained perspective distinguishes our goals from many state-of-the-art methods that optimise peak performance under abundant hardware.
Heterogeneous multi-agent RL and experience sharing: Recent work on multi-agent reinforcement learning has increasingly explored heterogeneous agent configurations. Yang et al. [18] proposed Z-Score experience replay to normalise TD-error priorities across agents with different value scales in off-policy deep RL. Jamal et al. [19] demonstrated a hybrid multi-agent RL approach for spectrum sharing in vehicular networks, where agents with different communication capabilities cooperate through shared reward structures. Yang et al. [20] developed a coordination optimisation framework combining reward redistribution and experience reutilisation for cooperative multi-agent settings. Kim et al. [21] applied reinforcement learning to dynamic trajectory optimisation in self-driving vehicles, addressing imbalanced sub-goals. Yang et al. [22] investigated adversarial tactics against cooperative MARL systems, highlighting robustness concerns. In the broader context of adaptive learning systems, adaptive vehicle detection methods using self-learning approaches [23] demonstrate how heterogeneous adaptive systems can improve performance in resource-constrained environments. Our work differs from these approaches by focusing specifically on experience sharing between agents with heterogeneous network architectures under fixed compute budgets, without requiring parameter sharing or centralised training.
Value decomposition and cooperative MARL: A complementary line of research addresses credit assignment in cooperative multi-agent settings through value decomposition. VDN [24] decomposes team reward into per-agent value functions that are summed to approximate the joint value, while QMIX [25] extends this idea with a monotonic mixing network that preserves the argmax consistency needed for decentralised execution. Yu et al. [26] demonstrated that even simple PPO-based approaches can be surprisingly effective in cooperative multi-agent games when properly tuned, raising questions about the necessity of complex coordination architectures. Most relevant to our work, Christianos et al. [27] proposed Shared Experience Actor–Critic (SEAC), which allows agents to learn from each other’s trajectories in a multi-agent actor–critic framework; however, SEAC assumes homogeneous policy-gradient agents, whereas our method targets heterogeneous DQN agents that differ in network capacity. Finally, Henderson et al. [28] highlighted the importance of rigorous statistical reporting in deep RL, including the need for confidence intervals and multiple seeds, which informed our statistical methodology for the CartPole experiments.
Replay-buffer design and sharing strategies: Experience replay has evolved significantly since its introduction in DQN [1]. Double DQN [29] addresses the overestimation bias inherent in the max operator by decoupling action selection from value estimation. Prioritised experience replay [30] samples transitions based on TD-error magnitude. Recent extensions include distributed replay architectures for multi-agent settings, hindsight experience replay for sparse-reward tasks, and curiosity-driven replay prioritisation. However, most replay-sharing work assumes homogeneous agent architectures. Our method extends replay sharing to heterogeneous agents, where the shared buffer serves as a structure-agnostic knowledge transfer mechanism.
Compute-efficient and resource-constrained deep RL: The computational demands of deep RL have motivated various efficiency-oriented approaches. Network compression techniques such as pruning [13], knowledge distillation, and lightweight architectures (MobileNets [14] and SqueezeNet [15]) reduce individual model size. Algorithmic improvements including PPO [11], sample-efficient off-policy methods, and model-based approaches reduce the number of environment interactions needed. Our approach is complementary: rather than compressing a single agent, we improve the learning efficiency of low-capacity agents through heterogeneous experience sharing, without modifying their architectures.
What is new in the journal version: Relative to our ISIS 2025 conference paper, this journal manuscript adds: (i) a new three-agent CartPole experiment enabling analysis of scaling effects in heterogeneous experience sharing, and (ii) a new Super Mario Bros experiment (image-based DRL) with step-matched comparisons and Q-value evolution. We also expand the 2021–2025 literature review, strengthen statistical reporting in CartPole (confidence intervals and effect sizes), and add detailed reproducibility information; the conclusions are adjusted to reflect only journal-version evidence.

3. Proposed Method

In response to the computational cost problem in deep reinforcement learning and the unexplored nature of cooperation between agents with different computational capabilities, this study proposes a method in which agents with different network sizes learn collaboratively while sharing experiences. This method aims to expand the scope of reinforcement learning in resource-limited environments and contribute to developing efficient AI systems.

3.1. Framework Overview

The group learning proposed in this study is a novel framework for multiple agents with different network sizes to share experiences through a shared experience replay memory and collaboratively learn a single task. Experiences obtained by each agent (tuples of state, action, reward, and next state) are accumulated in the shared memory, and all agents sample from this shared memory for learning. Figure 1 illustrates the overall architecture of the proposed framework.
The core of the proposed method lies in mutually leveraging the strengths of large networks’ high representational capacity and small networks’ computational efficiency through experience sharing. High-performance agents learn complex state–action relationships and accumulate high-quality experiences. Low-performance agents may achieve performance levels difficult to reach alone by learning from these experiences.
At the beginning of each episode, one agent is selected by the agent selection mechanism to interact with the environment using its ε-greedy policy. The experiences are stored in the shared memory. After storing, all agents—not just the interacting one—sample mini-batches from the shared memory and independently update their Q-networks. Thus, only one agent interacts with the environment at each step, but all agents perform Q-network updates every step.

3.2. Collaborative Learning Process

Collaborative learning is realised through independent Q-network updates and shared experience replay memory. Each agent maintains its own Q-network and independently updates based on data from the shared memory. This enables agents with different computational capabilities to learn at different abstraction levels from the same experiences.
This method adopts experience sharing rather than parameter sharing. Parameter sharing requires identical network structures and cannot be applied between agents with different network sizes [8]. Experience sharing is indirect knowledge transfer via the replay memory, independent of network structure, applicable between agents of any size. The Q-network update uses the standard DQN loss function [1]:
L = ( 1 / | B | ) Σ [ r + γ m a x Q _ t a r g e t ( s , a ) Q _ o n l i n e ( s , a ) ] 2
where B is the mini-batch and γ is the discount factor.

3.3. Agent Selection Strategies

Two agent selection strategies were implemented. Performance-based selection (PB) selects the agent with the highest recent average step count with probability 1 − e, and a random agent with probability e (e = 0.2). This strategy prioritises the high-performance agent, allowing it to generate high-quality experiences more frequently. Random selection (rand) selects agents completely randomly, giving equal learning opportunities and promoting diverse experience collection.

4. Experimental Setup

To comprehensively verify the proposed method, three types of experiments were conducted: two-agent group learning in CartPole, three-agent group learning in CartPole, and group learning in Super Mario Bros. These experiments examine the effects of agent composition, selection strategy, exploration rate, increased agent count, and task complexity on the proposed method. Detailed experimental settings are provided in Appendix A.

4.1. Common Settings

All experiments used DQN-based learning with the following common hyperparameters: mini-batch size 32, replay memory size 10,000, learning rate 0.0001, and target network updates at the end of each episode. Action selection used ε-greedy with dynamic decay:
ε_current = 0.001 + 0.9/(1 + episode) × α
where α = 1 (standard condition) or α = 0.5 (half condition). The half condition was introduced to examine whether group-learning effectiveness depends on ε values. Each condition was run for 10 trials.

4.2. CartPole Environment

CartPole-v1 from OpenAI Gym was used as the first evaluation environment (Figure 2). The state space consists of four continuous variables: cart position (−4.8, 4.8), cart velocity (unbounded), pole angle (−24°, 24°), and pole angular velocity (unbounded). The action space consists of two discrete actions: push left or push right. Episodes terminate when the pole exceeds ±15° from vertical or the cart leaves bounds. A reward of +1 is given for lasting 195 or more steps and −1 otherwise. Maximum steps per episode: 200; discount factor γ = 0.9.
To reduce total experimental runtime across all conditions, we use a modified CartPole variant (±15° termination, 200-step cap, and simplified reward) instead of the full default Gym-v1 episode length. All experiments—solo, two-agent, and three-agent—use this same customised environment, ensuring fair internal comparison.
Three agent types with different network sizes were used, as shown in Table 1. All agents use fully connected networks with two hidden layers; the number of hidden units determines the agent’s representational capacity and parameter count. Agent64 has the highest capacity with approximately 4676 parameters, while Agent16 has the lowest with approximately 404 parameters—an 11.6× difference.
The two-agent experiment comprised 14 conditions across two categories: solo learning (6 conditions, including standard and half ε for each agent type) and group learning (8 conditions), as shown in Table 2. Two agent compositions were tested (A64&A32 and A64&A16), each combined with two selection strategies (PB vs. random) and two exploration rates (standard ε vs. ε/2). This design enables systematic comparison of the effects of agent composition, selection strategy, and exploration rate on group learning.

4.3. CartPole Three-Agent Experiment

Three-agent experiments used Agent64, Agent32, and Agent16 in group learning to examine the effect of increased agent count. Four conditions were tested, as shown in Table 3.
Two compositions were compared: A64&A32&A16 (one of each type) and A64&A16&A16 (two low-performance agents). The latter composition tests whether duplicating the weakest agent affects group learning dynamics.

4.4. Super Mario Bros Environment

Building on the CartPole results, experiments were conducted in the more complex Super Mario Bros environment using the gym-super-mario-bros package (fork by tyfkda: https://github.com/tyfkda/gym-super-mario-bros/, accessed on 3 December 2025; Figure 3), targeting Stage 1-1. The action space was limited to “walk right” and “jump right.” State representation used frame skipping (every 4 frames), greyscale conversion, 84 × 84 pixel resizing, normalisation (0–1), and 4-frame stacking, resulting in a (4, 84, 84) tensor—approximately 28,000 dimensions compared to CartPole’s 4 dimensions.
The high-performance agent was named Mario and the low-performance agent Luigi. Their configurations are shown in Table 4.
Group learning used performance-based selection: the agent with the higher average step count over the last 20 episodes was selected with 80% probability, and a random agent with 20% probability. Training was conducted for 35,000 episodes. Solo learning used individual replay memories with identical hyperparameters.
Compute normalisation and claim scope (Mario): We adopt environment steps as the primary compute budget in a resource-limited setting. In two-agent group learning, both agents update their Q-networks at every step, implying approximately twice as many optimiser updates as solo per environment step. We therefore limit claims to step-matched comparisons and explicitly refrain from update-matched or wall-clock-matched claims, which would require substantially more compute.

5. Results

5.1. CartPole Two-Agent Experiment

Solo learning results revealed a clear performance hierarchy by network size. Under standard ε, Agent64 achieved a stable high performance of 200 ± 0.0 steps (task solved); Agent32 also reached 200 ± 0.0 steps, but Agent16 struggled at only 93.3 ± 77.69 steps. Under halved ε, performance decreased for all agents: Agent64 achieved 190.5 ± 24.58, Agent32 181.7 ± 40.55, and Agent16 116.7 ± 81.47 steps. While Agent16 improved slightly with halved ε, it still failed to reliably solve the task in solo learning (Figure 4).
Table 5 shows the complete results for all two-agent conditions. In group learning with A64&A16, Agent16 showed substantial improvement across all conditions: from 93.3 ± 77.69 steps in solo learning to 184.4 ± 36.38 steps in A64&A16_rand-ε—approximately 2× improvement. All four A64&A16 group conditions improved Agent16 performance (167.5–184.4 steps), with random selection yielding higher values than PB selection (184.4 vs. 167.5 under standard ε). Agent64 maintained high performance (179.6–193.7 steps) across all group conditions with no material degradation.
To quantify the reliability of these differences, we report 95% confidence intervals and Cohen’s d effect sizes against the corresponding solo baselines. For Agent16 in A64&A16_rand-ε, the improvement over solo was large (d = 1.50, 95% CI [158.4, 210.4]), confirming that the ~2× gain is robust. Other A64&A16 group conditions showed similarly large effects for Agent16: A64&A16_PB-ε (d = 1.07, 95% CI [124.6, 210.4]), A64&A16_rand-ε/2 (d = 0.86, 95% CI [144.5, 198.1]), and A64&A16_PB-ε/2 (d = 0.87, 95% CI [145.6, 197.4]). For Agent64 across group conditions, effect sizes against solo were medium-to-negligible (d = −0.37 to +0.16), indicating no material degradation. In A64&A32 conditions, Agent32 effects ranged from negligible to medium (d = −0.45 to +0.62), with A64&A32_rand-ε/2 notably achieving 199.5 ± 1.58 (d = 0.62).
In group learning with A64&A32, both agents maintained near-optimal performance. Notably, A64&A32_rand-ε/2 achieved an Agent32 performance of 199.5 ± 1.58—effectively solving the task with minimal variance. Agent64 showed consistent performance across all A64&A32 conditions (179.6–191.0 steps).
A notable outcome was that group learning performance exceeded the simple average of individual performances. At episode 50, Agent64 solo achieved approximately 180 steps and Agent16 solo approximately 40 steps, averaging approximately 110 steps. In contrast, the A64&A16 group achieved approximately 130–140 steps at the same point, substantially exceeding this average. This pattern was consistent across ε conditions, indicating a collaborative learning effect.

Q-Value Distribution Analysis

Q-value distribution analysis revealed that group learning expanded Agent16′s state-space coverage from approximately 60% to 85%. The Q-value heatmap showed that group learning produced more continuous Q-value distributions with clearer action boundaries near 0° pole angle, compared to the predominantly negative Q-values in solo learning (Figure 5).
The Q-value heatmaps were generated as follows. The CartPole state space was discretised into a 20 × 20 grid over the pole-angle and angular-velocity dimensions (the two most informative variables), with cart position and velocity held at their mean values. For each grid cell, the trained Q-network evaluated all available actions, and the maximum Q-value was recorded. State-space coverage was defined as the proportion of grid cells in which the agent’s maximum Q-value exceeded a threshold of zero. All reported heatmaps and coverage percentages are averaged over 10 trials for CartPole.
The performance improvement is attributed to efficient utilisation of shared experience replay memory and complementary exploration characteristics between agents at different performance levels. The diverse experiences generated by agents with different representational capacities provided each agent access to learning data that would be unavailable in solo learning. High-performance agents explore regions of the state space that low-performance agents cannot reach independently, and these experiences, stored in the shared memory, enable low-performance agents to learn from otherwise inaccessible states.

5.2. CartPole Three-Agent Experiment

Three-agent group learning confirmed similar trends to the two-agent experiments, demonstrating that increased agent count does not impair group-learning effectiveness. Table 6 shows the results for all three-agent conditions.
Agent16 showed notable improvement across all conditions, with the highest performance in A64&A16&A16_rand (196.5 ± 11.06 steps)—approximately 2.1× the solo performance of 93.3 steps. Comparing agent compositions, A64&A16&A16 outperformed A64&A32&A16 for Agent16 (196.5 vs. 188.3 in random selection), likely because Agent16 received more opportunities to interact with the environment when Agent32 was absent (Figure 6).
Effect-size analysis confirms these improvements are large. Agent16 in A64&A16&A16_rand showed the strongest effect (d = 1.86, 95% CI [188.6, 204.4]), followed by A64&A16&A16_PB (d = 1.65, 95% CI [169.0, 212.0]) and A64&A32&A16_rand (d = 1.62, 95% CI [167.8, 208.8]). Agent64 effects were medium-to-large and negative in sign (d = −0.72 to −0.90), reflecting slight decreases from the solo maximum of 200.0 that remain within a practical range (191.5–195.8 steps).
Regarding selection strategies, random selection outperformed performance-based selection for Agent16 (196.5 ± 11.06 vs. 190.5 ± 30.04 in A64&A16&A16), as performance-based selection prioritises Agent64 and reduces Agent16’s execution opportunities. Random selection provides more balanced interaction time, allowing Agent16 to benefit from both learning from shared experiences and generating its own diverse experiences.
Compared to two-agent learning, Agent16’s performance improved from 184.4 ± 36.38 (A64&A16_rand-ε) to 196.5 ± 11.06 (A64&A16&A16_rand), with substantially reduced variance. This demonstrates that increasing the agent count did not impair but rather enhanced collaborative learning effectiveness and stability under certain conditions, suggesting that increased experience diversity in the shared memory benefits low-performance agents.

5.3. Super Mario Bros Experiment

In the Super Mario Bros environment, learning experiments were conducted for 35,000 episodes, and solo learning and group learning were compared. Table 7 shows the results at the 35,000-episode point.
Single-trial proof-of-concept and compute accounting: The Super Mario Bros experiment was conducted once due to the extremely high computational cost of running image-based DQN for 35,000 episodes. We therefore present this experiment as a single-trial proof-of-concept, intended to illustrate that the experience-sharing benefits observed in CartPole extend to a substantially more complex visual domain. To strengthen the interpretability of this single-trial setting without additional runs, we emphasise two descriptive analyses: (i) the last-200-episode mean rewards (with variability bands), and (ii) the Q-value evolution at matched training steps. Because our compute-fairness criterion is based on a fixed environment-step budget, we additionally report the total environment steps accumulated by each agent. All comparisons between solo and group learning are therefore drawn at matched total-step checkpoints, ensuring that differences reflect the effect of experience sharing rather than unequal compute.
At 35,000 episodes, each agent in group learning accumulated approximately 3.37 million total steps because two agents shared environmental interactions (Figure 7). In contrast, solo learning consumed approximately 7.08 million steps for Mario and 6.35 million steps for Luigi—roughly twice the step count of group learning. Since a simple comparison at the same episode count involves a large discrepancy in the amount of experience each agent has directly obtained from the environment, it cannot be considered a fair comparison. Therefore, to properly evaluate the effectiveness of group learning, we compared mean rewards at equivalent step counts (i.e., equivalent computational cost).
In group learning, because each agent can also utilise the other agent’s experiences through the shared experience replay memory, the effective amount of learning data is equivalent to or greater than solo learning even though each agent’s direct environmental interactions are halved.

5.3.1. Equivalent-Step Comparison

Table 8 presents the mean reward comparison at equivalent step counts from 500,000 to 3,000,000 steps.
As shown in Table 8, group learning outperformed solo learning for both agents across all step counts from 500,000 to 3,000,000. For Mario, a +7.9% improvement was observed at 500,000 steps, reaching +12.1% at 3,000,000 steps, where solo Mario achieved 797.2 compared to 893.5 for group Mario. For Luigi, the maximum improvement of +14.9% was recorded at 2,000,000 steps, where solo Luigi achieved 677.2 compared to 778.0 for group Luigi. At 3,000,000 steps, a +11.7% improvement was maintained, with solo achieving 745.2 versus 832.7 for group.
These results are consistent with the trends observed in the CartPole environment, with particularly large improvements confirmed for Luigi, the low-performance agent. The +14.9% improvement for Luigi at 2,000,000 steps suggests that Luigi efficiently utilised the high-quality experiences collected by Mario through the shared memory. Furthermore, improvement rates increased from early training (500 K steps: Mario +7.9%, Luigi +1.9%) to later training (3 M steps: Mario +12.1%, Luigi +11.7%), suggesting that the quality of experiences accumulated in the shared memory improves with training progression, amplifying the benefits of group learning.
Step-matched summary (Mario): Under step-matched evaluation, group Luigi (low-capacity) consistently outperforms solo Luigi at the same total-step checkpoints and remains competitive with—or above—solo Luigi trajectories that consume roughly twice as many steps; by contrast, Mario (high-capacity) shows broadly comparable performance between group and solo at matched steps. These trends are reported alongside last-200-episode means and Q-value evolution and should be interpreted as descriptive evidence under our step-based budget.
Clarification on updates: Because both agents update every step in the group setting, the number of optimiser updates per environment step is higher than in solo. The above conclusions therefore do not claim update-matched superiority; rather, they show that even under a stricter reading (two agents imply ~2× updates), Luigi benefits in a way that exceeds solo trained with ~2× steps, while Mario remains roughly unchanged.

5.3.2. Q-Value Analysis

The effects of group learning were also prominent in Q-value comparisons. Table 9 shows the Q-value comparison at equivalent step counts.
The Q-value difference was most pronounced at 1,000,000 steps. Mario solo’s Q-value of 32.5 increased to 42.6 in group learning (+10.1), and Luigi solo’s 35.5 increased to 43.2 (+7.7). This substantial early Q-value elevation indicates that both agents could utilise each other’s experiences from early training stages through the shared memory. The large Q-value gap at the early 1,000,000-step stage means that group learning has the effect of accelerating the learning startup. At 3,000,000 steps, Q-value differences of Mario +6.4 and Luigi +6.8 were maintained, confirming the sustained effect of group learning. However, the tendency for Q-values in group learning to be higher than actual rewards suggests that Q-value overestimation inherent to DQN may be amplified through the shared memory.
A lightweight overestimation check was performed by comparing predicted Q-values with realised discounted returns on held-out greedy evaluation trajectories (ε = 0, 3 episodes per agent, γ = 0.9). For group Mario, the mean gap Δ = Q(s,a) − G_t was +3.0 (median +1.3, IQR [0.4, 3.5], with 82% of steps showing Δ > 0), indicating a clear tendency toward Q-value overestimation compared with solo Mario (mean Δ = +1.0, 46%). For group Luigi, overestimation remained comparable to solo levels (mean Δ = +1.3, 45% vs. solo mean Δ = −1.0, 43%). These results suggest that the shared replay mechanism may amplify overestimation for the high-capacity agent, whose Q-network bootstraps from a mixture of heterogeneous experiences, while the low-capacity agent is not similarly affected. This asymmetry is consistent with the discussion in Section 6.3 and supports the potential benefit of Double DQN integration in future work.

5.3.3. Step Distribution and Efficiency Analysis

Analysing each agent’s environmental interactions in group learning: At 35,000 episodes, Mario’s total steps were 3,375,454 and Luigi’s were 3,377,137, resulting in a nearly equal distribution (Luigi 50.0%). Despite using performance-based selection, the similar step counts indicate that both agents maintained comparable performance levels during group learning. The combined total of 6,752,591 steps is approximately equivalent to the computational cost of solo Mario (7,077,188 steps) or solo Luigi (6,354,023 steps).
Notably, group Luigi reached 826.0 with approximately 3.37 million steps—roughly half the steps of solo learning—achieving approximately 97% of solo learning’s last-200-episode average of 852.5. When compared at the same step count (~3.37 million steps), solo Luigi achieved 727.1 versus 826.0 for group Luigi, a +13.6% improvement. This demonstrates that utilising Mario’s experiences through the shared memory improved Luigi’s learning efficiency. We note that in group learning, the number of neural network updates is approximately twice as high because both agents update at every step. Even accounting for this factor, solo Luigi trained for twice as many steps still performs below group Luigi, suggesting genuine efficiency gains from heterogeneous experience sharing.

6. Discussion

6.1. Cross-Environment Comparison

The results indicate that experience-sharing group learning can improve low-performance agent capabilities, while the high-performance agent maintains comparable performance across both environments, though not uniformly across all conditions. In CartPole, Agent16 approximately doubled its performance (93.3 to 184.4 steps in two-agent, 196.5 in three-agent), while several Agent64 group conditions fell slightly below the solo 200-step result. In Super Mario Bros, under step-matched evaluation, Luigi achieved improvements of up to +14.9% at equivalent step counts, while Mario showed broadly comparable performance between group and solo.
Comparing environments, Agent16’s performance approximately doubled in CartPole, while Luigi’s improvement in Super Mario Bros ranged from +1.9% to +14.9%. This difference can be attributed to two factors. First, CartPole’s Agent16 had only 404 parameters and could barely solve the task alone, while Luigi had approximately 140,000 parameters and could achieve reasonable performance independently—the capability gap was less extreme. Second, Super Mario Bros’ 84 × 84 × 4 state space (~28,000 dimensions) is vastly more complex than CartPole’s four dimensions, causing knowledge transfer through experience sharing to proceed more gradually, manifesting as continuous, stable improvement rather than substantial performance jumps.

6.2. Effect of Capability Gap

The magnitude of improvement is related to the capability gap between agents. In CartPole, Agent16 (404 parameters) could barely solve the task alone, resulting in substantial ~2× improvement. In Super Mario Bros, Luigi (~140,000 parameters) achieved reasonable solo performance, yielding more moderate but consistent +1.9–+14.9% improvement. This suggests that the proposed method is most beneficial when the low-performance agent has insufficient capacity for independent task completion.
The tendency for improvement rates to increase with training progression (early: +1.9–+7.9%; later: +11.7–+12.1%) suggests that the quality of experiences in the shared memory improves over time. As the high-performance agent develops better policies, its experiences become more informative for the low-performance agent. Q-value analysis supports this interpretation: the largest Q-value boost occurred at 1 M steps (+10.1 for Mario), indicating that shared experiences accelerate the initial phase of value estimation.

6.3. Q-Value Overestimation

The elevated Q-values in group learning (e.g., 49.7 vs. 43.3 at 3 M steps for Mario) warrant attention. While higher Q-values partially reflect genuine learning improvement, they may also indicate amplified overestimation through the shared memory mechanism. The overestimation check in Section 5.3.2 provides direct evidence: group Mario showed a mean Δ = Q(s,a) − G_t of +3.0, with 82% of steps overestimating, substantially higher than solo Mario (mean Δ = +1.0, 46%). In contrast, group Luigi’s overestimation (mean Δ = +1.3, 45%) remained comparable to solo levels (−1.0, 43%), suggesting that the amplification is asymmetric and primarily affects the high-capacity agent whose Q-network bootstraps from heterogeneous experiences. Future work should investigate Double DQN [29] integration to mitigate this effect.

6.4. Learning Role Division

In Super Mario Bros, the nearly equal step distribution between Mario (50.0%) and Luigi (50.0%) despite performance-based selection suggests comparable performance levels during group learning. This may reflect a division of learning roles: Luigi, with smaller network capacity, captures coarse features of the state space relatively early, while Mario leverages its larger capacity to acquire refined feature representations. If such a gradual role transition actually occurs, it represents a phenomenon consistent with the principle of collaborative learning, where agents with different abilities contribute their respective strengths—an implication that is also interesting from the perspective of educational psychology. However, since this experiment was conducted only once, verifying this hypothesis requires multiple trials and detailed analysis of each agent’s action selection patterns.

6.5. Comparison with Existing Approaches

The proposed method differs from parallel methods (A3C [10]) in using heterogeneous rather than homogeneous workers, and from knowledge distillation approaches in performing online bidirectional experience sharing rather than offline unidirectional transfer. The experience-sharing mechanism is structurally agnostic, allowing application between networks of any size without the identical-structure requirement of parameter sharing [8]. Compared to model compression approaches such as pruning [13] and MobileNets [14], which reduce individual model size, our approach improves learning efficiency through inter-agent cooperation without modifying network architectures.

6.6. Component Attribution

To clarify which mechanism drives the observed gains, we interpret our existing results by mapping them to three components: (i) heterogeneity of agent capacities, (ii) shared replay, and (iii) the agent-selection policy. In CartPole, Agent16’s consistent improvement across two- and three-agent settings, along with reduced variance under the A64&A16&A16 configuration, indicates that heterogeneity (exposing a low-capacity learner to states visited by stronger partners) is a necessary driver of the effect; the same trend is present regardless of whether selection is performance-based or random, which suggests that replay sharing itself is the main conduit for the benefit, while selection primarily modulates how quickly the benefit materialises rather than whether it appears. These attributions are consistent with the manuscript’s analyses of state-coverage expansion and Q-value changes. We emphasise that this decomposition is interpretative and supported by the CartPole evidence already reported; a fully causal isolation requires three minimal controls—(1) homogeneous shared replay (two identical agents), (2) heterogeneous no-sharing (same selection policy, separate buffers), and (3) single-agent DQN with matched optimiser-update counts—which we designate as priority future work for Mario and other high-cost visual domains.
Positioning relative to the recent state of the art: Many recent state-of-the-art systems pursue peak performance with substantially larger compute budgets and are optimised for different constraints than ours. Our contribution is orthogonal: within a fixed environment-step budget and identical model classes, heterogeneous experience sharing improves the efficiency and stability of low-capacity agents without increasing model size. A comprehensive head-to-head against large-compute state-of-the-art methods is promising future work once resources allow; here we focus on the compute-bounded regime that motivates the method.

6.7. Limitations and Future Work

The Super Mario Bros study is a single-trial experiment due to the prohibitive computational resources required for multi-seed image-based DQN training. Accordingly, we refrain from making statistical claims and limit our interpretation to descriptive indicators, namely the last-200-episode reward trends and Q-value evolution at matched step counts. Consistent with our compute-constrained setting, all comparisons are made at matched total-step budgets, rather than through repeated independent runs. Extending the Mario component to multi-seed evaluation and including matched-update or wall-clock-normalised baselines remain priority directions for future work once additional compute becomes available. These limitations do not affect the CartPole experiments, where 10-trial statistical reporting (confidence intervals and effect-size measures) is provided.
For future work, the following research directions are considered. First, introducing prioritised experience replay [30] may further improve efficiency. While this study used simple memory sharing, TD-error-based priority sampling could enable more effective experience sharing. Second, introducing Double DQN [29] to suppress Q-value overestimation is promising, as group learning showed a tendency for Q-values to exceed actual rewards. Third, experiments with further enlarged performance gaps between agents and statistical verification through multiple trials are needed to confirm the robustness of these findings. Fourth, application to additional environments and continuous action spaces would broaden the method’s applicability.

7. Conclusions

This study proposed experience-sharing group learning, a framework in which agents with different network sizes collaboratively learn through a shared experience replay memory, and evaluated its effectiveness in CartPole and Super Mario Bros environments.
In the two-agent CartPole experiment, the low-performance agent (Agent16) improved from 93.3 steps in solo learning to 184.4 steps in group learning—approximately doubling—while the high-performance agent (Agent64) maintained comparable performance, though several group conditions fell below the solo 200-step result. Q-value distribution analysis revealed that group learning expanded Agent16’s state-space coverage from approximately 60% to 85%. The three-agent experiment confirmed similar trends, with Agent16 achieving 196.5 steps (2.1× solo) with reduced variance.
In the Super Mario Bros single-trial experiment spanning 35,000 episodes, step-matched comparisons showed that the low-capacity agent (Luigi) benefits from experience sharing beyond solo baselines that consume roughly twice as many steps, while the high-capacity agent (Mario) remains broadly comparable between group and solo. These results are descriptive and limited to step-based normalisation; update-matched and time-matched evaluations are left for future work. Q-value analysis showed substantial early-stage increases (Mario +10.1, Luigi +7.7 at 1,000,000 steps), suggesting that group learning accelerates the learning startup under a fixed step budget.
From these results, the following findings were obtained regarding experience-sharing group learning. First, group learning can improve low-performance agent performance, while the high-performance agent maintains comparable levels, though not uniformly across all conditions. Second, the effect of group learning tends to increase with training progression, with benefits growing as the quality of experiences in the shared memory improves. Third, greater capability differences between agents tend to produce higher improvement rates for low-performance agents. These findings demonstrate the potential value of heterogeneous agent cooperation for efficient deep reinforcement learning under compute-constrained settings.

Author Contributions

K.M. conceptualised the study, developed the software, conducted experiments, and wrote the manuscript; M.I. contributed to data collection and analysis; A.N. supervised the research and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Numbers JP18K11473 and JP22K12182.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and training logs are available at https://github.com/keita324/ (accessed on 15 February 2026). Repositories: CartPole 2-Agent (Cartpole), CartPole 3-Agent (CartPole_3Agent), Mario Solo (Mario_solo), Luigi Solo (Luigi_solo), and Group Learning (Mario_Luigi_group). The CartPole environment was obtained from the official OpenAI Gym package. The Super Mario Bros environment was based on the gym-super-mario-bros fork by tyfkda (https://github.com/tyfkda/gym-super-mario-bros/; accessed on 3 December 2025).

Acknowledgments

The authors used Claude (Anthropic) (https://claude.ai) for the sole purpose of improving the readability and language of the manuscript. Claude was not used for experimental design, coding, or result interpretation. The authors take full responsibility for the integrity of the content of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Source code repositories for all experiments are available on GitHub and are listed below.
CartPole 2-Agent: https://github.com/keita324/Cartpole (accessed on 15 February 2026).
CartPole 3-Agent: https://github.com/keita324/CartPole_3Agent (accessed on 15 February 2026).
Mario Solo: https://github.com/keita324/Mario_solo (accessed on 15 February 2026).
Luigi Solo: https://github.com/keita324/Luigi_solo (accessed on 15 February 2026).
Group Learning: https://github.com/keita324/Mario_Luigi_group (accessed on 15 February 2026).
Reproducibility information: All CartPole experiments were run for 10 independent trials with random seeds 0–9. The optimiser was Adam with a learning rate of 0.0001. Replay memory warm-up: Learning begins after 1000 transitions are collected. Final evaluation: The reported performance is the mean over the last 100 episodes of each trial. CartPole experiments were conducted on a single NVIDIA GeForce RTX 3060 GPU; Super Mario Bros experiments were conducted on the same hardware. Software: Python 3.10, PyTorch 2.0, OpenAI Gym 0.26. Repository commit hashes are recorded in each repository’s README.

References

  1. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  2. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  3. Sakurai, S. Special Issue on Collaborative Learning and AI. J. Jpn. Soc. Artif. Intell. 2008, 23, 159–162. [Google Scholar]
  4. Busoniu, L.; Babuska, R.; De Schutter, B. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans. Syst. Man Cybern. C 2008, 38, 156–172. [Google Scholar] [CrossRef]
  5. Tan, M. Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. In Proceedings of the ICML, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
  6. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6390. [Google Scholar]
  7. Foerster, J.; Chen, R.Y.; Al-Shedivat, M.; Whiteson, S.; Abbeel, P.; Mordatch, I. Learning with Opponent-Learning Awareness. In Proceedings of the AAMAS, Stockholm, Sweden, 10–15 July 2018; pp. 122–130. [Google Scholar]
  8. Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative Multi-Agent Control Using Deep Reinforcement Learning. In Proceedings of the AAMAS, São Paulo, Brazil, 8–12 May 2017; pp. 66–83. [Google Scholar]
  9. Tampuu, A.; Maarand, T.; Matiisen, T.; Kont, D.; Driessche, G.V.D.; Mets, T.; Aru, J.; Kuzovkin, I.; Vicente, R. Multiagent Cooperation and Competition with Deep Reinforcement Learning. PLoS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef] [PubMed]
  10. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the ICML, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  11. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  12. OpenAI. OpenAI Five. 2018. Available online: https://openai.com/research/openai-five (accessed on 20 January 2025).
  13. Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning Both Weights and Connections for Efficient Neural Networks. In Proceedings of the NeurIPS, Montreal, QC, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
  14. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  15. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
  16. Mania, H.; Guy, A.; Recht, B. Simple Random Search of Static Linear Policies Is Competitive for Reinforcement Learning. In Proceedings of the NeurIPS, Montreal, QC, Canada, 3–8 December 2018; pp. 1805–1814. [Google Scholar]
  17. Muroya, K.; Ikeda, M.; Notsu, A. Effectiveness of Experience-Sharing Group Learning in Deep Reinforcement Learning. In Proceedings of the 26th International Symposium on Advanced Intelligent Systems (ISIS2025), Cheongju, Republic of Korea, 6–9 November 2025. [Google Scholar]
  18. Yang, Y.; Xi, M.; Dai, H.; Wen, J.; Yang, J. Z-Score Experience Replay in Off-Policy Deep Reinforcement Learning. Sensors 2024, 24, 7746. [Google Scholar] [CrossRef] [PubMed]
  19. Jamal, M.; Ullah, Z.; Naeem, M.; Abbas, M.; Coronato, A. A Hybrid Multi-Agent Reinforcement Learning Approach for Spectrum Sharing in Vehicular Networks. Future Internet 2024, 16, 152. [Google Scholar] [CrossRef]
  20. Yang, B.; Gao, L.; Zhou, F.; Yao, H.; Fu, Y.; Sun, Z.; Tian, F.; Ren, H. A Coordination Optimization Framework for Multi-Agent Reinforcement Learning Based on Reward Redistribution and Experience Reutilization. Electronics 2025, 14, 2361. [Google Scholar] [CrossRef]
  21. Kim, Y.-J.; Ahn, W.-J.; Jang, S.-H.; Lim, M.-T.; Pae, D.-S. A Reinforcement Learning Approach to Dynamic Trajectory Optimization with Consideration of Imbalanced Sub-Goals in Self-Driving Vehicles. Appl. Sci. 2024, 14, 5213. [Google Scholar] [CrossRef]
  22. Yang, G.; Miao, X.; Peng, Y.; Huang, W.; Zhang, F. Optimized Adversarial Tactics for Disrupting Cooperative Multi-Agent Reinforcement Learning. Electronics 2025, 14, 2777. [Google Scholar] [CrossRef]
  23. Guerrero-Contreras, G.; Balderas-Díaz, S.; García-Pascual, A.; Muñoz, A. Adaptive Vehicle Detection in Urban Environments: A Self-learning Approach. In Proceedings of the 15th International Symposium on Ambient Intelligence (ISAmI 2024), Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2025; Volume 1279, pp. 25–34. [Google Scholar] [CrossRef]
  24. Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the AAMAS, Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar]
  25. Rashid, T.; Samvelyan, M.; Schroeder de Witt, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
  26. Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Baez, A.; Bhatt, S.; Fong, V.; Rui, H.; Liang, E. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  27. Christianos, F.; Schäfer, L.; Albrecht, S.V. Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning. In Proceedings of the NeurIPS, Online, 6–12 December 2020. [Google Scholar]
  28. Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep Reinforcement Learning That Matters. In Proceedings of the AAAI, New Orleans, LA, USA, 2–7 February 2018; pp. 3207–3214. [Google Scholar]
  29. Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
  30. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Figure 1. Overview of the proposed group-learning framework. Multiple agents with different network sizes interact with the environment through an agent selection mechanism and share experiences via a common experience replay memory. All agents independently update their Q-networks from the shared memory.
Figure 1. Overview of the proposed group-learning framework. Multiple agents with different network sizes interact with the environment through an agent selection mechanism and share experiences via a common experience replay memory. All agents independently update their Q-networks from the shared memory.
Applsci 16 03250 g001
Figure 2. CartPole-v1 environment. The agent must balance a pole on a cart by applying left or right forces.
Figure 2. CartPole-v1 environment. The agent must balance a pole on a cart by applying left or right forces.
Applsci 16 03250 g002
Figure 3. Super Mario Bros Stage 1-1 environment used for evaluation. The agent navigates the level using walk-right and jump-right actions.
Figure 3. Super Mario Bros Stage 1-1 environment used for evaluation. The agent navigates the level using walk-right and jump-right actions.
Applsci 16 03250 g003
Figure 4. Learning curves for the two-agent CartPole experiments (mean of 10 trials, smoothed with a 50-episode moving average). The y-axis shows episode steps; the x-axis shows episode number. (a) Solo learning for Agent64, Agent32, and Agent16; (b) group learning with performance-based selection under standard ε; (c) group learning with random selection under standard ε; (d) group learning with half ε conditions.
Figure 4. Learning curves for the two-agent CartPole experiments (mean of 10 trials, smoothed with a 50-episode moving average). The y-axis shows episode steps; the x-axis shows episode number. (a) Solo learning for Agent64, Agent32, and Agent16; (b) group learning with performance-based selection under standard ε; (c) group learning with random selection under standard ε; (d) group learning with half ε conditions.
Applsci 16 03250 g004
Figure 5. Q-value heatmaps for Agent16 in CartPole. (a) Solo learning: predominantly negative Q-values with limited state-space coverage (~60%). (b) Group learning: more continuous distributions with clearer action boundaries and expanded coverage (~85%).
Figure 5. Q-value heatmaps for Agent16 in CartPole. (a) Solo learning: predominantly negative Q-values with limited state-space coverage (~60%). (b) Group learning: more continuous distributions with clearer action boundaries and expanded coverage (~85%).
Applsci 16 03250 g005
Figure 6. Three-agent group learning curves in CartPole. Agent16 achieves near-optimal performance across all conditions, with particularly high performance and low variance in A64&A16&A16_rand.
Figure 6. Three-agent group learning curves in CartPole. Agent16 achieves near-optimal performance across all conditions, with particularly high performance and low variance in A64&A16&A16_rand.
Applsci 16 03250 g006
Figure 7. Learning curves for Mario and Luigi in group learning over 35,000 episodes. Mean reward is recorded every 100 episodes (unsmoothed). The y-axis shows mean episode reward; the x-axis shows episode number.
Figure 7. Learning curves for Mario and Luigi in group learning over 35,000 episodes. Mean reward is recorded every 100 episodes (unsmoothed). The y-axis shows mean episode reward; the x-axis shows episode number.
Applsci 16 03250 g007
Table 1. Agent specifications for the CartPole environment.
Table 1. Agent specifications for the CartPole environment.
AgentHidden LayersUnits per LayerParameters
Agent64264~4676
Agent32232~1348
Agent16216~404
Table 2. Two-agent experimental conditions in the CartPole environment.
Table 2. Two-agent experimental conditions in the CartPole environment.
CategoryConditionAgent 1Agent 2Selectionε Decay
SoloA64-εAgent64Standard
SoloA32-εAgent32Standard
SoloA16-εAgent16Standard
SoloA64-ε/2Agent64Half
SoloA32-ε/2Agent32Half
SoloA16-ε/2Agent16Half
GroupA64&A32_rand-εAgent64Agent32RandomStandard
GroupA64&A16_rand-εAgent64Agent16RandomStandard
GroupA64&A32_rand-ε/2Agent64Agent32RandomHalf
GroupA64&A16_rand-ε/2Agent64Agent16RandomHalf
GroupA64&A32_PB-εAgent64Agent32PBStandard
GroupA64&A16_PB-εAgent64Agent16PBStandard
GroupA64&A32_PB-ε/2Agent64Agent32PBHalf
GroupA64&A16_PB-ε/2Agent64Agent16PBHalf
PB = performance-based selection; rand = random selection; half = α = 0.5 in ε-decay formula. Each condition was run for 10 trials.
Table 3. Three-agent experimental conditions in the CartPole environment.
Table 3. Three-agent experimental conditions in the CartPole environment.
ConditionAgent 1Agent 2Agent 3Selection
A64&A32&A16_PBAgent64Agent32Agent16PB
A64&A16&A16_PBAgent64Agent16Agent16PB
A64&A32&A16_randAgent64Agent32Agent16Random
A64&A16&A16_randAgent64Agent16Agent16Random
Each condition was run for 10 trials.
Table 4. Agent configurations for the Super Mario Bros environment.
Table 4. Agent configurations for the Super Mario Bros environment.
ParameterMario (High)Luigi (Low)
Conv layers32-64-64 ch12-24-24 ch
FC layer512 units160 units
Parameters~280,000~140,000
Learning rate0.000250.00025
Batch size3232
ε decay0.999999750.99999975
Min ε0.10.1
Discount factor γ0.90.9
Shared memory200,000200,000
Table 5. Two-agent experiment results in the CartPole environment (mean ± SD over 10 trials, final evaluation). 95% CI = 95% confidence interval of the mean; Cohen’s d = standardised effect size vs. the corresponding solo baseline (|d| ≥ 0.8: large; ≥0.5: medium; <0.5: small). — indicates the agent was not present in that condition.
Table 5. Two-agent experiment results in the CartPole environment (mean ± SD over 10 trials, final evaluation). 95% CI = 95% confidence interval of the mean; Cohen’s d = standardised effect size vs. the corresponding solo baseline (|d| ≥ 0.8: large; ≥0.5: medium; <0.5: small). — indicates the agent was not present in that condition.
ConditionAgent64
(Mean ± SD
[95% CI] d)
Agent32
(Mean ± SD
[95% CI] d)
Agent16
(Mean ± SD
[95% CI] d)
A64-ε (solo)200.0 ± 0.0
[200.0, 200.0]
A32-ε (solo)200.0 ± 0.0
[200.0, 200.0]
A16-ε (solo)93.3 ± 77.69
[37.7, 148.9]
A64-ε/2 (solo)190.5 ± 24.58
[172.9, 208.1]
A32-ε/2 (solo)181.7 ± 40.55
[152.7, 210.7]
A16-ε/2 (solo)116.7 ± 81.47
[58.4, 175.0]
A64&A32_rand-ε187.8 ± 25.72
[169.4, 206.2] d = −0.67
187.3 ± 40.16
[158.6, 216.0] d = −0.45
A64&A16_rand-ε189.2 ± 23.35
[172.5, 205.9] d = −0.65
184.4 ± 36.38
[158.4, 210.4] d = +1.50
A64&A32_rand-ε/2183.9 ± 50.91
[147.5, 220.3] d = −0.17
199.5 ± 1.58
[198.4, 200.6] d = +0.62
A64&A16_rand-ε/2179.6 ± 34.3
[155.1, 204.1] d = −0.36
171.3 ± 37.5
[144.5, 198.1] d = +0.86
A64&A32_PB-ε191.0 ± 17.17
[178.7, 203.3] d = −0.74
192.9 ± 12.7
[183.8, 202.0] d = −0.79
A64&A16_PB-ε191.4 ± 18.14
[178.4, 204.4] d = −0.67
167.5 ± 59.91
[124.6, 210.4] d = +1.07
A64&A32_PB-ε/2179.6 ± 18.16
[177.7, 203.7] d = +0.01
186.7 ± 31.58
[164.1, 209.3] d = +0.14
A64&A16_PB-ε/2193.7 ± 14.35
[183.4, 204.0] d = +0.16
171.5 ± 36.23
[145.6, 197.4] d = +0.87
Values represent mean steps at final evaluation. Underlined values indicate performance improvements compared with solo learning under the same ε condition.
Table 6. Three-agent experiment results in the CartPole environment (mean ± SD over 10 trials). The 95% CI and Cohen’s d are computed against the corresponding solo baseline (A64-ε for Agent64, A32-ε for Agent32, and A16-ε for Agent16). — indicates the agent was not present.
Table 6. Three-agent experiment results in the CartPole environment (mean ± SD over 10 trials). The 95% CI and Cohen’s d are computed against the corresponding solo baseline (A64-ε for Agent64, A32-ε for Agent32, and A16-ε for Agent16). — indicates the agent was not present.
ConditionAgent64
(Mean ± SD
[95% CI] d)
Agent32
(Mean ± SD
[95% CI] d)
Agent16
(Mean ± SD
[95% CI] d)
A64&A32&A16_PB195.8 ± 8.2
[189.9, 201.7] d = −0.72
191.3 ± 14.5
[180.9, 201.7] d = −0.85
181.2 ± 35.4
[155.9, 206.5] d = +1.46
A64&A16&A16_PB194.2 ± 10.1
[187.0, 201.4] d = −0.81
190.5 ± 30.04
[169.0, 212.0] d = +1.65
A64&A32&A16_rand193.1 ± 11.7
[184.7, 201.5] d = −0.83
194.6 ± 9.3
[187.9, 201.2] d = −0.82
188.3 ± 28.6
[167.8, 208.8] d = +1.62
A64&A16&A16_rand191.5 ± 13.4
[181.9, 201.1] d = −0.90
196.5 ± 11.06
[188.6, 204.4] d = +1.86
Values represent mean steps at final evaluation. Underlined values indicate performance improvements compared with solo learning under the same ε condition.
Table 7. Results at 35,000 episodes in Super Mario Bros.
Table 7. Results at 35,000 episodes in Super Mario Bros.
MethodAgentTotal StepsFinal εMean RewardMean (Last 200)Max RewardMean Q
SoloMario7,077,1880.1701208.41125.9 ± 102.81398.552.5
SoloLuigi6354,0230.204877.4852.5 ± 51.7977.442.5
GroupMario3375,4540.430824.9865.6 ± 62.81020.249.3
GroupLuigi3377,1370.430804.6826.0 ± 51.7962.249.0
Table 8. Mean episode reward across all episodes completed by the given step counts (Δ = improvement over solo).
Table 8. Mean episode reward across all episodes completed by the given step counts (Δ = improvement over solo).
StepsMario SoloMario GroupΔLuigi SoloLuigi GroupΔ
500 K641.7692.2+7.9%676.1689.1+1.9%
1.0 M698.5739.6+5.9%670.1698.2+4.2%
1.5 M740.2790.2+6.8%707.7753.6+6.5%
2.0 M748.6810.0+8.2%677.2778.0+14.9%
2.5 M804.7835.4+3.8%683.6754.2+10.3%
3.0 M797.2893.5+12.1%745.2832.7+11.7%
Table 9. Q-values for the state encountered and action taken at each equivalent training step count.
Table 9. Q-values for the state encountered and action taken at each equivalent training step count.
StepsMario SoloMario GroupΔLuigi SoloLuigi GroupΔ
1.0 M32.542.6+10.135.543.2+7.7
2.0 M42.545.7+3.242.845.4+2.6
3.0 M43.349.7+6.442.048.8+6.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Muroya, K.; Ikeda, M.; Notsu, A. Effectiveness of Experience-Sharing Group Learning in Deep Reinforcement Learning. Appl. Sci. 2026, 16, 3250. https://doi.org/10.3390/app16073250

AMA Style

Muroya K, Ikeda M, Notsu A. Effectiveness of Experience-Sharing Group Learning in Deep Reinforcement Learning. Applied Sciences. 2026; 16(7):3250. https://doi.org/10.3390/app16073250

Chicago/Turabian Style

Muroya, Keita, Makoto Ikeda, and Akira Notsu. 2026. "Effectiveness of Experience-Sharing Group Learning in Deep Reinforcement Learning" Applied Sciences 16, no. 7: 3250. https://doi.org/10.3390/app16073250

APA Style

Muroya, K., Ikeda, M., & Notsu, A. (2026). Effectiveness of Experience-Sharing Group Learning in Deep Reinforcement Learning. Applied Sciences, 16(7), 3250. https://doi.org/10.3390/app16073250

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop