1. Introduction
Multi-agent reinforcement learning (MARL) is a promising field of artificial intelligence (AI) research, and over the last couple of years, has seen increasingly more pushes to tackle less “toy” problems (full game environments such as ATARI and the Starcraft Multi-Agent Environment (SMAC)) and instead try to solve complex “real-world” problems [
1,
2,
3]. Coordination of agents across a large state space is a challenging and multifaceted problem, with many approaches that can be used to increase coordination. These include communication between agents, both learned and established, parameter sharing and other methods of imparting additional information to function approximators, and increasing levels of centralization.
One paradigm of MARL that aims to increase coordination is called centralized Learning decentralized Execution (CLDE) [
4]. CLDE algorithms train their agents’ policies with additional global information using a centralized mechanism. During execution, the centralized element is removed, and the agent’s policy is based on conditions only on local observations. This has been shown to increase the coordination of agents [
5]. CLDE algorithms separate into two major categories: centralized policy gradient methods [
6,
7,
8] and value decomposition methods [
9,
10]. Recently, however, there has been work that has put into question the assumption that centralized mechanisms do indeed increase coordination. Lyu et al. [
11] found that in actor–critic systems, the use of a centralized critic led to an increase in variance seen in the final policy learned; however, they noted more coordinated agent behaviour while training and concluded that the use of a centralized critic should be thought of as a choice that carries with it a bias variance trade-off.
One aspect of agent coordination that is similarly often taken at face value is the use of a joint reward in cooperative systems that use centralization. The assumption is that joint rewards are necessary for the coordination of systems that rely on centralization. We have not been able to find a theoretical basis for this claim. The closest works addressing team rewards in cooperative settings that we could find include works on difference rewards which try to measure the impact of an individual agent’s actions on the full system reward [
12]. The high learnability, among other nice properties, makes difference rewards attractive but impractical, due to the required knowledge of the total system state [
13,
14,
15].
We investigate the effects of changing the reward from a joint reward to an individual reward in the Level-Based Foraging (LBF) environment. We investigate how different CLDE algorithm performances change as a result of this change and discuss this performance change. In this work, we study the effect of varying reward functions from joint rewards to individual rewards on Independent Q Learning (IQL) [
16], Independent Proximal Policy Optimization (IPPO) [
17], independent synchronous actor–critic (IA2C) [
6], multi-agent proximal policy optimization (MAPPO) [
7], multi agent synchronous actor–critic (MAA2C) [
5,
6], value decomposition networks (VDN) [
10], and QMIX [
9] when evaluated on the LBF environment [
18]. This environment was chosen as it is a gridworld environment, and therefore simpler to understand when compared to other MARL environments such as those based on the StarCraft environment; however, it is a very challenging environment that requires cooperation to solve and has the ability to include the forcing of cooperative policies and partial observability for study.
We show empirically that using an individual reward in the LBF environment causes an increase in the variance in the reward term in the Temporal Difference (TD) error signal and any derivative of this term. We study the effects that this increase in variance has on the selected algorithms and discuss whether this variance is helpful for the learning of better joint policies in the LBF environment. Our results show that PPO-based algorithms, with and without centralized systems and QMIX, perform better with individual rewards, while actor–critic models based on A2C suffer when using individual rewards.
This work is comprised of multiple sections, starting with the background in
Section 2.
Section 3 outlines our experimental method, and we report our results in
Section 4. We discuss the results and compare them to the previous results in
Section 5. All supplementary information pertaining to this work can be found in the
Appendix A,
Appendix B,
Appendix C.
3. Method
To compare our results with those of previous publications, we made sure that the scenarios and scenario parameters matched those of Papoudakis et al. [
5] and Atrazhev et al. [
19], and the results were compared to the results of those previous works.
To remain consistent with previous publications, the LBF scenarios selected for this study are
8x8-2p-2f-coop,
2s-8x8-2p-2f-coop,
10x10-3p-3f, and
2s-10x10-3p-3f. Algorithms are also selected based on these criteria: IQL [
16], IA2C [
6], IPPO [
17], MAA2C [
5], MAPPO [
7], VDN [
10] and QMIX [
9] were selected as they are studied in both Papoudakis et al. [
5] and in Atrazhev et al. [
19] and represent an acceptable assortment of independent algorithms, centralized critic CLDE algorithms, and value factorization CLDE algorithms.
To evaluate the performance of the algorithm, we calculate the average returns and maximum returns achieved throughout all evaluation windows during training, and the 95% confidence interval across ten seeds.
Our investigation consists of varying two variables, the reward function, and episode length. The length of the episode was varied between the reported value of 25 used by Papoudakis et al. [
5] and 50, which is the default length of the episode in the environment. We perform two separate hyperparameter tunings, one for each reward type, adhering to the hyperparameter tuning protocol included in Papoudakis et al. [
5].
All other experimental parameters are taken from Papoudakis et al. [
5], and we encourage readers to look into this work for further details.
4. Results
We compare IQL, IA2C, IPPO, MAA2C, MAPPO, VDN, and QMIX and report the mean returns and max returns achieved by algorithms using individual rewards in
Table 1 and
Table 2, respectively. The mean returns and maximum returns of algorithms using joint rewards are reported in
Table 3 and
Table 4, respectively. We include tables for the increased episode length (50 timesteps) in the
Appendix C.
We generally observe that in the individual reward case, QMIX is able to consistently achieve the highest maximal return value in all scenarios. In terms of the highest mean returns, QMIX is able to outperform IPPO in the partially observable scenarios. In the joint reward case, the majority of the results are in line with those reported in [
5]; however, we note that the average return results for QMIX are much higher with our hyperparameters. We go into more detail regarding these results in
Appendix A.
When comparing joint reward performance with individual reward performance, we note that the effects of reward are not easily predictable. Centralized critic algorithms are evenly split in performance, with MAPPO performing better with individual reward, while MAA2C’s performance suffers. This is paralleled by the independent versions of MAPPO and MAA2C. The value factorization algorithms are also divided, with QMIX performance becoming the top-performing algorithm across the tested scenarios. VDN, however, sees an incredible drop in performance when using joint rewards. Finally, IQL performance when using individual reward is relatively unaffected in the simpler 8x8 scenarios but decreases in the larger scenarios.
6. Conclusions and Future Work
In summary, our results show that different CLDE algorithms respond in different ways when the reward is changed from joint to individual in the LBF environment. MAPPO and QMIX show that they are able to leverage the additional variance present in the individual reward to find improved policies, while VDN and MAA2C suffer from the increase and perform worse. Of the centralized critic algorithms, it seems that it is crucial that the centralized algorithm critic be able to converge slowly enough to find the optimal joint policy, but not fast enough to find a local minima. In addition, if the critic is too sensitive to the increase in variance, it may diverge as in MAA2C and be unable to find the optimal policy. Value decomposition methods also seem to need additional state information to condition the coordination of agents to learn optimal policies. Since much of the emergent behaviour sought in MARL systems is a function of how agents work together, we feel that the choice of reward function may be of even more importance in MARL environments than in a single-agent environment. Our results hint that there may be some greater bias variance-type trade-off between joint and individual rewards; however, more research will need to be performed to confirm this.
As we have outlined in several sections of this work, there are still many questions that need answering before we can definitively say that the choice of using a joint reward or an individual reward when training MARL algorithms comes down to a bias variance trade-off. First, this theory of increased variance would need to be studied in simpler scenarios that can be solved analytically in order to confirm that individual rewards do increase variance. This simpler scenario would need to have the same sparse positive reward as seen in the LBF. Following the establishment of this theoretical underpinning, the next step would be to either relax the sparse constraint or the positive reward constraint and still see if the theory holds true. Once that is performed, a definitive conclusion could be presented about the effects of varying reward functions between joint and individual rewards in cooperative MARL systems.