Author Contributions
Conceptualization, L.O. and H.X.; methodology, L.O.; software, L.O.; validation, L.O. and H.X.; formal analysis, L.O.; investigation, L.O.; resources, H.X.; data curation, L.O.; writing—original draft preparation, L.O.; writing—review and editing, L.O. and H.X.; visualization, L.O.; supervision, H.X.; project administration, H.X. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Conceptual Conceptual connection between Relaxed-QMIX (R-QMIX) and cooperative multi-robot systems. A team of robots operates under partial observability in a shared task. Each agent learns a local Deep Recurrent Q Network (DRQN) based utility , which is combined with the global state by an R-QMIX mixing network with a soft monotonicity regularizer to form the joint action–value . Centralized training uses global information while decentralized execution uses greedy actions with respect to the local utilities, enabling coordinated multi-robot behavior.
Figure 1.
Conceptual Conceptual connection between Relaxed-QMIX (R-QMIX) and cooperative multi-robot systems. A team of robots operates under partial observability in a shared task. Each agent learns a local Deep Recurrent Q Network (DRQN) based utility , which is combined with the global state by an R-QMIX mixing network with a soft monotonicity regularizer to form the joint action–value . Centralized training uses global information while decentralized execution uses greedy actions with respect to the local utilities, enabling coordinated multi-robot behavior.
Figure 2.
R-QMIX architecture. The mixing network (pink) combines per-agent utilities into using state-conditioned hypernetworks (red). A soft monotonicity regularization term is added to the mixer training objective (penalizing negative ), while mixer weights remain unconstrained. Agents use a DRQN (deep recurrent Q-network) utility model with an MLP–gated recurrent unit (GRU)–MLP structure (green). Arrows indicate the direction of information flow.
Figure 2.
R-QMIX architecture. The mixing network (pink) combines per-agent utilities into using state-conditioned hypernetworks (red). A soft monotonicity regularization term is added to the mixer training objective (penalizing negative ), while mixer weights remain unconstrained. Agents use a DRQN (deep recurrent Q-network) utility model with an MLP–gated recurrent unit (GRU)–MLP structure (green). Arrows indicate the direction of information flow.
Figure 3.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on 3m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 3.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on 3m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 4.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on MMM2. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 4.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on MMM2. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 5.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on 6h vs. 8z. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 5.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on 6h vs. 8z. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 6.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on 27m vs. 30m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 6.
Win-rate comparison between QMIX, R-QMIX, and QTRAN on 27m vs. 30m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 7.
Fraction of slopes below margin on on the 3m map.
Figure 7.
Fraction of slopes below margin on on the 3m map.
Figure 8.
Fraction of slopes below margin on the MMM2 map.
Figure 8.
Fraction of slopes below margin on the MMM2 map.
Figure 9.
Fraction of slopes below margin on on the 6h vs. 8z map.
Figure 9.
Fraction of slopes below margin on on the 6h vs. 8z map.
Figure 10.
Fraction of slopes below margin on 27m vs. 30m map.
Figure 10.
Fraction of slopes below margin on 27m vs. 30m map.
Table 1.
QMIX vs. R-QMIX at a glance.
Table 1.
QMIX vs. R-QMIX at a glance.
| Aspect | QMIX | R-QMIX |
|---|
| Mixer weight constraint | Non-negative mixer weights, e.g., or . | Weights unconstrained (; may be negative); monotonicity encouraged via a soft penalty on local slopes. |
| Monotonic guarantee | Yes: (IGM under the model class). | No formal guarantee; encouraged locally / in expectation via regularization. |
| Extra loss terms | TD loss only (plus any shared baseline regularizers). | . |
| Extra hyperparameters | None beyond shared architecture/training knobs. | schedule, margin , exponent p (and for finite differences). |
| Computational overhead | Baseline. | Small–moderate (compute and the penalty). |
| When it helps (intuition) | When a monotone factorization is sufficient; typically stable and sample-efficient. | When interactions are non-monotonic (synergy/interference) and strict QMIX monotonicity underfits or destabilizes learning on hard scenarios. |
Table 2.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 3m map. Bold indicates the highest mean win rate in each quarter.
Table 2.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 3m map. Bold indicates the highest mean win rate in each quarter.
| Quarter | QMIX Mean | QMIX Std | R-QMIX Mean | R-QMIX Std | QTRAN Mean | QTRAN Std |
|---|
| 1 | 0.300 | 0.126 | 0.675 | 0.325 | 0.739 | 0.243 |
| 2 | 0.669 | 0.152 | 0.980 | 0.013 | 0.972 | 0.019 |
| 3 | 0.893 | 0.069 | 0.983 | 0.012 | 0.989 | 0.010 |
| 4 | 0.974 | 0.017 | 0.982 | 0.011 | 0.990 | 0.009 |
Table 3.
Time to reach win-rate thresholds on 3m (seeds = 6; hold consecutive evals). Values are mean across seeds; “n/6” indicates how many runs reached the threshold.
Table 3.
Time to reach win-rate thresholds on 3m (seeds = 6; hold consecutive evals). Values are mean across seeds; “n/6” indicates how many runs reached the threshold.
| Algorithm | 50% WR | 70% WR | 90% WR | Reached |
|---|
| Clock | | Clock | | Clock | (50/70/90) |
|---|
| QMIX | 361 k | 00:34:46 | 599 k | 00:56:54 | 1.03 M | 01:35:28 | 6/6/6 |
| R-QMIX | 154 k | 00:14:34 | 247 k | 00:22:51 | 319 k | 00:29:19 | 6/6/6 |
| QTRAN | 120 k | 00:11:09 | 190 k | 00:17:30 | 307 k | 00:28:11 | 6/6/6 |
Table 4.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the MMM2 map. Bold indicates the highest mean win rate in each quarter.
Table 4.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the MMM2 map. Bold indicates the highest mean win rate in each quarter.
| Quarter | QMIX Mean | QMIX Std | R-QMIX Mean | R-QMIX Std | QTRAN Mean | QTRAN Std |
|---|
| 1 | 0.000 | 0.001 | 0.162 | 0.168 | 0.000 | 0.001 |
| 2 | 0.002 | 0.005 | 0.762 | 0.125 | 0.001 | 0.003 |
| 3 | 0.085 | 0.061 | 0.932 | 0.027 | 0.035 | 0.030 |
| 4 | 0.423 | 0.146 | 0.971 | 0.013 | 0.247 | 0.108 |
Table 5.
Time to reach win-rate thresholds on MMM2 (seeds = 6; hold consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
Table 5.
Time to reach win-rate thresholds on MMM2 (seeds = 6; hold consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
| Algorithm | 50% WR | 70% WR | 90% WR | Reached |
|---|
| Clock | | Clock | | Clock | (50/70/90) |
|---|
| QMIX | 4.97 M | 09:09:41 | 5.57 M | 10:14:54 | N/A | N/A | 4/4/0 |
| R-QMIX | 1.51 M | 02:41:40 | 1.90 M | 03:22:14 | 2.63 M | 04:43:41 | 6/6/6 |
| QTRAN | 5.36 M | 09:32:44 | 5.69 M | 10:02:00 | N/A | N/A | 3/1/0 |
Table 6.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 6h vs. 8z map. Bold indicates the highest mean win rate in each quarter.
Table 6.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 6h vs. 8z map. Bold indicates the highest mean win rate in each quarter.
| Quarter | QMIX Mean | QMIX Std | R-QMIX Mean | R-QMIX Std | QTRAN Mean | QTRAN Std |
|---|
| 1 | 0.000 | 0.000 | 0.005 | 0.008 | 0.000 | 0.000 |
| 2 | 0.000 | 0.000 | 0.071 | 0.043 | 0.002 | 0.005 |
| 3 | 0.000 | 0.000 | 0.301 | 0.090 | 0.005 | 0.008 |
| 4 | 0.000 | 0.001 | 0.575 | 0.063 | 0.014 | 0.015 |
Table 7.
Time to reach win-rate thresholds on 6h vs. 8z (seeds = 6; hold consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
Table 7.
Time to reach win-rate thresholds on 6h vs. 8z (seeds = 6; hold consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
| Algorithm | 50% WR | 70% WR | 90% WR | Reached |
|---|
| Clock | | Clock | | Clock | (50/70/90) |
|---|
| QMIX | N/A | N/A | N/A | N/A | N/A | N/A | 0/0/0 |
| R-QMIX | 4.41 M | 08:23:15 | 5.70 M | 10:50:57 | N/A | N/A | 6/3/0 |
| QTRAN | N/A | N/A | N/A | N/A | N/A | N/A | 0/0/0 |
Table 8.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 27m vs. 30m map. Bold indicates the highest mean win rate in each quarter.
Table 8.
Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 27m vs. 30m map. Bold indicates the highest mean win rate in each quarter.
| Quarter | QMIX Mean | QMIX Std | R-QMIX Mean | R-QMIX Std | QTRAN Mean | QTRAN Std |
|---|
| 1 | 0.000 | 0.000 | 0.168 | 0.167 | 0.002 | 0.004 |
| 2 | 0.001 | 0.004 | 0.733 | 0.130 | 0.043 | 0.029 |
| 3 | 0.203 | 0.133 | 0.930 | 0.024 | 0.191 | 0.076 |
| 4 | 0.580 | 0.078 | 0.966 | 0.012 | 0.352 | 0.052 |
Table 9.
Time to reach win-rate thresholds on 27m vs. 30m (seeds = 6; hold consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold. "N/A" indicates that no runs have met the threshold.
Table 9.
Time to reach win-rate thresholds on 27m vs. 30m (seeds = 6; hold consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold. "N/A" indicates that no runs have met the threshold.
| Algorithm | 50% WR | 70% WR | 90% WR | Reached |
|---|
| Clock | | Clock | | Clock | (50/70/90) |
|---|
| QMIX | 4.49 M | 15:44:36 | 4.72 M | 16:41:25 | 5.02 M | 17:33:47 | 6/4/1 |
| R-QMIX | 1.47 M | 05:34:45 | 2.02 M | 07:20:18 | 2.89 M | 10:07:24 | 6/6/6 |
| QTRAN | 3.84 M | 14:07:32 | 4.44 M | 16:24:35 | N/A | N/A | 2/2/0 |