Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning

O’Brien, Liam; Xu, Hao

doi:10.3390/robotics15010028

Open AccessArticle

Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning

by

Liam O’Brien

^* and

Hao Xu

^*

Department of Electrical and Biomedical Engineering, University of Nevada, Reno, NV 89557, USA

^*

Authors to whom correspondence should be addressed.

Robotics 2026, 15(1), 28; https://doi.org/10.3390/robotics15010028

Submission received: 12 December 2025 / Revised: 16 January 2026 / Accepted: 16 January 2026 / Published: 21 January 2026

(This article belongs to the Special Issue AI-Powered Robotic Systems: Learning, Perception and Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

Value factorization methods have become a standard tool for cooperative multi-agent reinforcement learning (MARL) in the centralized-training, decentralized-execution (CTDE) setting. QMIX (a monotonic mixing network for value factorization), in particular, constrains the joint action–value function to be a monotonic mixing of per-agent utilities, which guarantees consistency with individual greedy policies but can severely limit expressiveness on tasks with non-monotonic agent interactions. This work revisits this design choice and proposes Relaxed Monotonic QMIX (R-QMIX), a simple regularized variant of QMIX that encourages but does not strictly enforce the monotonicity constraint. R-QMIX removes the sign constraints on the mixing network weights and introduces a differentiable penalty on negative partial derivatives of the joint value with respect to each agent’s utility. This preserves the computational benefits of value factorization while allowing the joint value to deviate from strict monotonicity when beneficial. R-QMIX is implemented in a standard PyMARL (an open-source MARL codebase) and evaluated on the StarCraft Multi-Agent Challenge (SMAC). On a simple map (3m), R-QMIX matches the asymptotic performance of QMIX while learning substantially faster. On more challenging maps (MMM2, 6h vs. 8z, and 27m vs. 30m), R-QMIX significantly improves both sample efficiency and final win rate (WR), for example increasing the final-quarter mean win rate from

42.3 %

to

97.1 %

on MMM2, from

0.0 %

to

57.5 %

on 6h vs. 8z, and from

58.0 %

to

96.6 %

on 27m vs. 30m. These results suggest that soft monotonicity regularization is a practical way to bridge the gap between strictly monotonic value factorization and fully unconstrained joint value functions. A further comparison against QTRAN (Q-value transformation), a more expressive value factorization method, shows that R-QMIX achieves higher and more reliably convergent win rates on the challenging SMAC maps considered.

Keywords:

multi-agent reinforcement learning; value factorization; QMIX; regularization; CTDE; StarCraft Multi-Agent Challenge (SMAC); credit assignment

1. Introduction

Reinforcement learning (RL) provides a framework for sequential decision-making in which an agent interacts with an environment in order to maximize cumulative reward [1]. Modern deep RL methods have achieved impressive results in high-dimensional domains such as Atari games and continuous control by combining neural function approximation with value-based or policy-based learning [2,3,4,5,6].

Many real-world problems, however, are naturally multi-agent: multiple learners act concurrently, share a common objective, and operate under partial observability. Examples include cooperative robotics, distributed sensor networks, and multi-vehicle coordination. This has motivated the field of multi-agent reinforcement learning (MARL), where several agents must coordinate their behavior using local observations and, typically, a joint reward signal [7,8]. A widely adopted paradigm is centralized training with decentralized execution (CTDE), in which a learning algorithm can access global information and joint actions during training while the final policies for execution are decentralized. Figure 1 provides a conceptual overview of how R-QMIX fits into a cooperative multi-robot CTDE pipeline, from per-agent utilities to the joint action–value function.

A core challenge in cooperative MARL is credit assignment, that is, how to decompose a global team reward into signals that enable each agent to learn an effective local policy. Value factorization methods address this by parameterizing a joint action–value function

Q_{tot}

as a combination of per-agent utilities

Q_{a}

, trained end-to-end from a global reward. Value decomposition networks (VDN) [9] use a simple additive factorization; QMIX [10] extends this idea with a nonlinear mixing network that is constrained to be monotonic in each agent’s utility. The monotonicity constraint ensures that greedy individual policies are consistent with greedy joint policies, enabling efficient decentralized execution.

Despite its practical success, QMIX’s hard monotonicity constraint can be overly restrictive. Many cooperative tasks involve interactions that are fundamentally non-monotonic with respect to individual utilities, for instance coordinated flanking maneuvers or sacrifice strategies in combat. Empirically, relaxing this constraint often improves representational power but may lead to unstable optimization and poor generalization. This raises a natural question: Can the monotonicity constraint be softened in a principled way that trades off expressiveness and empirical convergence?

This paper proposes Relaxed Monotonic QMIX (R-QMIX), a regularized value factorization method that retains the QMIX architecture but replaces the hard non-negativity constraint on mixing-network weights with a soft differentiable penalty on monotonicity violations. Instead of enforcing

\partial Q_{tot} / \partial Q_{a} \geq 0

for all agents and states, a regularization term is introduced that penalizes negative partial derivatives, allowing the network to deviate from strict monotonicity only when this significantly improves the fit to the Bellman target. Figure 2 summarizes the R-QMIX architecture, including the hypernetwork-based mixer and the soft monotonicity regularizer used during training.

The main contributions of this work are as follows:

A soft monotonicity regularizer is formulated that measures and penalizes violations of the individual global max (IGM) property via the partial derivatives of the mixing network.
R-QMIX is introduced, augmenting the standard QMIX TD loss with the proposed regularizer while keeping the underlying architecture and training pipeline unchanged.
An efficient implementation based on automatic differentiation is provided and integrated into a standard PyMARL/SMAC stack.
R-QMIX is empirically evaluated in the StarCraft Multi-Agent Challenge (SMAC) [11]. On 3m, MMM2, and 27m vs. 30m, R-QMIX consistently matches or outperforms QMIX, with particularly large gains in convergence speed (sample efficiency) and asymptotic win rate on the harder maps.
R-QMIX is empirically compared with QTRAN in four SMAC scenarios, showing that R-QMIX achieves stronger and more reliable convergence than both QMIX and QTRAN in the harder maps.

More broadly, cooperative MARL and its core challenges, including credit assignment, non-stationary, and coordination, have been extensively surveyed, providing context for the algorithms considered in this work [12,13,14,15,16].

Overall, the results show that softly relaxing the monotonicity constraint is a simple yet powerful modification that improves convergence in challenging cooperative tasks without sacrificing the computational advantages of value factorization.

2. Background and Related Work

2.1. Dec-POMDPs and the CTDE Paradigm

Cooperative multi-agent tasks are commonly modeled as decentralized partially-observable Markov decision processes (Dec-POMDPs) [7]. A Dec-POMDP is defined by the tuple

G = 〈 S, U, P, r, Z, O, n, γ 〉,

(1)

where S is the state space, U is the discrete set of actions for each agent,

P (s^{'} ∣ s, u)

is the state-transition function,

r (s, u)

is the shared reward, Z is the local observation space,

O (z_{t}^{a} ∣ s_{t}, u_{t}^{a})

is the observation function, n is the number of agents, and

γ \in [0, 1)

is the discount factor. At each time step t, agents select actions

u_{t}^{a} \in U

, forming a joint action

u_{t} \in U^{n}

, and receive a shared reward

r_{t} = r (s_{t}, u_{t})

.

Because the global state

s_{t}

is not observed directly, each agent a receives a local observation

z_{t}^{a} \sim O (\cdot ∣ s_{t}, u_{t}^{a})

and maintains an action–observation history

τ_{t}^{a} \in {(Z \times U)}^{*}

, conditioning its policy

π_{a} (u_{t}^{a} ∣ τ_{t}^{a})

on this history. The joint policy

π

induces a joint action–value function

Q^{π} (s_{t}, u_{t}) = E [R_{t} ∣ s_{t}, u_{t}],

(2)

where

R_{t} = \sum_{i = 0}^{\infty} γ^{i} r_{t + i}

is the discounted return.

In the centralized-training, decentralized-execution (CTDE) setting, the learning algorithm can access global information (e.g.,

s_{t}

and all

{τ_{t}^{a}}_{a}

) during training, while in execution each agent must rely only on its own history

τ_{t}^{a}

. Value factorization methods fit naturally into this paradigm; they learn a centralized estimate of a joint value

Q_{tot}

while deriving decentralized policies from per-agent utilities.

2.2. Deep Q-Learning and Recurrent Extensions

Deep Q-learning (DQL) approximates a single-agent action–value function

Q (s, u; θ)

with a neural network [2,17,18]. Transitions

(s_{t}, u_{t}, r_{t}, s_{t + 1})

are stored in a replay buffer, and the parameters

θ

are trained by minimizing the squared temporal-difference (TD) error

L_{DQN} (θ) = \sum_{i = 1}^{B} {(y_{i}^{DQN} - Q (s_{i}, u_{i}; θ))}^{2}

(3)

with target

y_{i}^{DQN} = r_{i} + γ max_{u^{'}} Q (s_{i}^{'}, u^{'}; θ^{-}),

(4)

where

θ^{-}

are the parameters of a target network periodically updated from

θ

.

To handle partial observability, deep recurrent Q-networks (DRQN) [19] replace the feedforward Q-network with a recurrent neural network (RNN) such as gated recurrent unit (GRU) [20] or long short-term memory (LSTM) [21], which is conditioned on an agent’s action–observation history

τ_{t}^{a}

. The recurrent hidden state summarizes past observations and actions, and serves as an approximate belief state. DRQN forms the basis for many CTDE value factorization methods, including QMIX and R-QMIX.

2.3. Independent Q-Learning

Independent Q-learning (IQL) [22,23] is a simple baseline in which each agent learns its own Q-function

Q_{a} (τ_{t}^{a}, u_{t}^{a})

by treating other agents as part of the environment. Although conceptually straightforward and often competitive on simpler tasks, IQL suffers from non-stationarity; as other agents update their policies, the effective environment dynamics change, undermining single-agent convergence guarantees. This motivates methods that exploit centralized information during training while still supporting decentralized execution.

2.4. Value Decomposition Networks (VDN)

Value decomposition networks (VDN) [9] address credit assignment by explicitly decomposing the joint action–value function into a sum of per-agent utilities:

Q_{tot} (s_{t}, u_{t}) = \sum_{a = 1}^{n} Q_{a} (τ_{t}^{a}, u_{t}^{a})

(5)

Each

Q_{a}

depends only on agent a’s local history, while the scalar joint value is trained against the global team reward. This additive factorization ensures that greedy policies with respect to each

Q_{a}

are consistent with the greedy joint policy. However, the assumption that agent contributions simply add is often too restrictive for tasks with strong interaction effects or heterogeneous roles.

2.5. QMIX

QMIX [10] generalizes VDN by introducing a learnable nonlinear mixing network

Q_{tot} (s_{t}, u_{t}) = f_{θ} (s_{t}, Q_{1} (τ_{t}^{1}, u_{t}^{1}), \dots, Q_{n} (τ_{t}^{n}, u_{t}^{n})),

(6)

where

f_{θ}

is parameterized by a feedforward network with weights produced by state-conditioned hypernetworks. The key structural constraint is that

Q_{tot}

is required to be monotone in each agent utility:

\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0, \forall a .

(7)

In practice, this is enforced by restricting certain mixer weights to be non-negative (e.g., via absolute value or softplus). The monotonicity constraint guarantees the individual-global-max (IGM) property: maximizing

Q_{tot}

over joint actions is equivalent to each agent greedily maximizing its own utility. This yields efficient decentralized greedy execution, but restricts the functional form of

Q_{tot}

and can limit expressiveness on tasks with non-monotonic interactions.

Several QMIX variants increase mixing expressiveness while preserving useful structure, including weighted QMIX [24], QPLEX (a duplex dueling value decomposition method) [25], Qatten [26], and deep coordination graphs (DCG) [27]. These methods highlight the tradeoff between structural constraints that facilitate decentralized execution and the need for representational power in complex cooperative tasks.

2.6. QTRAN

QTRAN [28] takes a different approach to value factorization. Instead of enforcing monotonicity, it introduces a transformed joint action–value function

{\tilde{Q}}_{tot}

and auxiliary constraints that align independent greedy policies with the optimal joint policy. Concretely, QTRAN minimizes a TD loss on

{\tilde{Q}}_{tot}

together with additional terms that encourage

{\tilde{Q}}_{tot} (s_{t}, u_{t}) = \sum_{a} Q_{a} (τ_{t}^{a}, u_{t}^{a}) + C (s_{t})

(8)

for the joint greedy action and certain baseline conditions. This formulation is more expressive than QMIX in principle; however, the resulting optimization problem can be difficult to solve in practice, and QTRAN is empirically unstable on several SMAC maps.

The approach proposed in this paper lies between strictly monotone and fully unconstrained factorization. A single mixer over per-agent utilities is retained, but hard monotonicity is replaced with a soft regularizer that penalizes violations of the IGM structure.

2.7. Other MARL Methods and Communication

In addition to value factorization, cooperative MARL has been studied through policy gradient and actor–critic methods, hierarchical exploration, and explicit communication. Counterfactual Multi-Agent Policy gradients (COMA) uses a centralized critic with a counterfactual baseline for each agent [29], while Multi-Agent Variational Exploration (MAVEN) augments QMIX with a hierarchical latent variable for multimodal exploration [30]. Architectures such as BicNet [31] and more recent communication-focused surveys [32] model differentiable communication channels between agents.

These approaches are complementary to value factorization, as they explore different ways of injecting structure into cooperative MARL. R-QMIX follows the value-factorization line, focusing specifically on relaxing the monotonicity constraint while retaining decentralized greedy execution and a QMIX-compatible architecture.

3. Relaxed Monotonic QMIX Methodology

In this section, we develop Relaxed Monotonic QMIX (R-QMIX), a variant of QMIX that replaces the hard monotonicity constraint with a soft differentiable monotonicity penalty applied to the mixing network. All methods in our experiments share the same agent architecture, mixer architecture, optimizer, learning rate schedule, parameter groups, and gradient clipping settings; we adopt these training heuristics uniformly because they substantially improve stability on the harder SMAC maps. R-QMIX differs from this common baseline only in its learning objective, which adds the soft monotonicity term to the standard TD loss. For clarity, we refer to QMIX trained under this identical optimization setup as QMIX-Baseline. Unless otherwise noted, these optimization settings are fixed across maps.

For clarity, Table 1 summarizes the key differences between QMIX and R-QMIX at a high level. Figure 2 shows the architecture; the remainder of this section defines the regularizer, learning objective, and monotonicity coefficient.

3.1. R-QMIX Architecture

The centralized-training, decentralized-execution (CTDE) setting described in Section 2 is adopted. Each agent

a \in {1, \dots, n}

maintains a parametric utility function

Q_{a} (τ_{t}^{a}, u_{t}^{a}; θ_{a}) \in R

implemented as a deep recurrent Q-network (DRQN) that conditions on the agent’s action–observation history

τ_{t}^{a}

and its current action

u_{t}^{a}

.

At each time step t, the per-agent utilities are collected into a vector

Q_{t} = (Q_{1} (τ_{t}^{1}, u_{t}^{1}), \dots, Q_{n} (τ_{t}^{n}, u_{t}^{n})) \in R^{n}

and combined with the global state

s_{t}

through a mixing network

Q_{tot} (s_{t}, u_{t}; θ_{mix}) = f_{mix} (Q_{t}, s_{t}; θ_{mix}),

where

f_{mix}

is the original QMIX hypernetwork-based mixer.

In standard QMIX, the architecture of

f_{mix}

is constrained so that

Q_{tot}

is a monotone function of each

Q_{a}

, i.e.,

\partial Q_{tot} / \partial Q_{a} \geq 0

for all agents. This is enforced by restricting certain weights of the mixing network to be non-negative, which guarantees that the joint greedy action can be obtained by greedy action selection with respect to each

Q_{a}

.

R-QMIX keeps the same mixer parameterization but removes these hard sign constraints. Instead, monotonicity is encouraged by adding a smooth regularization term to the loss that penalizes violations of

\partial Q_{tot} / \partial Q_{a} \geq 0

. The decentralized execution policy remains the same as in QMIX: each agent acts greedily with respect to its local utility

Q_{a}

.

3.2. Soft Monotonicity Regularization

The key idea in R-QMIX is to softly enforce the monotonicity of

Q_{tot}

with respect to each agent utility

Q_{a}

via a penalty on the partial derivatives

g_{t}^{a} = \frac{\partial Q_{tot} (s_{t}, u_{t})}{\partial Q_{a} (τ_{t}^{a}, u_{t}^{a})} .

Intuitively, it is desired that these slopes be non-negative, and in practice they should be strictly larger than a small margin

η > 0

in order to avoid flat regions.

For each training batch, the mixer output

Q_{tot}

is recomputed on the per-agent utilities and the tensor of partial derivatives

g_{b, t} = (g_{b, t}^{1}, \dots, g_{b, t}^{n})

is obtained using automatic differentiation. The per-agent violation is defined as

v_{b, t}^{a} = max (0, η - g_{b, t}^{a}),

which is non-zero only when the slope falls below the margin

η

. Let

m_{b, t} \in {0, 1}

be a mask indicating valid (non-padded, non-terminated) time steps in episode b at time index t, and let

M = {(b, t) : m_{b, t} = 1}

be the set of valid positions. Given a batch of B episodes and T time steps, the monotonicity loss is defined as

L_{mono} = \frac{1}{| M | n} \sum_{(b, t) \in M} \sum_{a = 1}^{n} {(v_{b, t}^{a})}^{p},

(9)

where

p \in {1, 2}

controls whether an

ℓ_{1}

-style (

p = 1

) or squared (

p = 2

) penalty is used. In practice,

p = 2

is used, which smoothly emphasizes larger violations.

The partial derivatives

g_{b, t}^{a}

are computed in two interchangeable ways:

Autograd mode (default): The per-agent utilities $Q_{a}$ are treated as differentiable inputs to the mixer, and $g_{b, t}^{a}$ is obtained via automatic differentiation of $Q_{tot}$ with respect to these inputs. This is efficient and stable in modern deep-learning frameworks.
Finite-difference mode (ablation): For each agent a, the quantity $Q_{a}$ is perturbed by a small scalar $δ$ and the slope is approximated as

$g_{b, t}^{a} \approx \frac{Q_{tot} (s_{t}, u_{t}; Q_{a} + δ) - Q_{tot} (s_{t}, u_{t}; Q_{a})}{δ} .$

This variant isolates the regularizer from gradient-flow artifacts, but is more expensive; in the main experiments, the autograd version is used.

In both cases, the mask

m_{b, t}

ensures that the regularizer is only applied to valid transitions in each episode.

In addition to contributing to training loss, several diagnostics derived from

g_{b, t}^{a}

are also recorded, including the fraction of slopes below the margin

η

and the mean magnitude of violation; these help to interpret the strength and effect of the regularizer over the course of training.

3.3. TD Target and Learning Objective

R-QMIX follows the standard off-policy training procedure introduced in QMIX [10]. For each transition in a replayed episode b at time step t, the agents receive a shared reward

r_{b, t}

and a termination flag

z_{b, t} \in {0, 1}

. The target joint action–value is obtained by bootstrapping from the next state using the target networks

V_{b, t + 1} = Q_{tot} (s_{b, t + 1}, u_{b, t + 1}^{*}; θ^{-}),

where

θ^{-}

denotes the target parameters and

u_{b, t + 1}^{*}

is selected via greedy or double Q-learning.

Because the implementation uses the same one-step TD target as the original QMIX, the learning target is

G_{b, t}^{(1)} = r_{b, t} + γ (1 - z_{b, t}) V_{b, t + 1} .

Using a mask

m_{b, t}

to ignore padded transitions, the TD loss over a batch is

L_{TD} = \frac{1}{\sum_{b, t} m_{b, t}} \sum_{b, t} m_{b, t} ℓ (Q_{tot} (s_{b, t}, u_{b, t}; θ) - G_{b, t}^{(1)}),

where

ℓ (\cdot)

is either the mean squared error (MSE) loss or the Huber loss (used for harder maps for increased robustness).

The R-QMIX learning objective adds a soft monotonicity penalty to the original QMIX TD loss:

L_{R - QMIX} = L_{TD} + λ_{mono} (t_{env}) L_{mono},

where

λ_{mono} (t_{env})

is a scheduled regularization coefficient described in Section 3.5.

3.4. Interpreting the Monotonicity Regularizer

The monotonicity penalty in R-QMIX can be viewed as a soft relaxation of the individual-global-max (IGM) constraint that underpins QMIX [10]. In strictly monotone value factorization, the condition

\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0 \forall a

ensures that independently greedy per-agent policies are consistent with the joint greedy policy. The regularizer does not enforce this constraint exactly, but instead penalizes violations of the margin condition

g_{t}^{a} \geq η

for the local slopes

g_{t}^{a} = \partial Q_{tot} / \partial Q_{a}

. In regions of the state–action space where a monotone factorization is sufficient to explain the data, the optimizer can reduce the TD error while driving

L_{mono} \approx 0

, effectively recovering an approximate IGM property. Conversely, when accurately fitting the Bellman target requires non-monotonic interactions, the model is free to trade off a small increase in

L_{mono}

against a larger decrease in TD error.

From a regularization perspective, the monotonicity penalty defines a smooth prior over joint value functions that discourages pathological and highly non-cooperative couplings between agent utilities. Large negative slopes

g_{t}^{a} ≪ 0

correspond to regimes where small local improvements in an individual agent’s utility can sharply decrease the joint value, which is typically undesirable in cooperative settings and can destabilize learning. By softly penalizing such configurations, R-QMIX biases the mixer toward representations where

Q_{tot}

varies smoothly and predominantly positively with each

Q_{a}

without forbidding localized non-monotonic structure when it is genuinely needed. The regularization coefficient

λ_{mono} (t_{env})

then plays the role of a tunable bias–variance knob; larger values enforce a stronger cooperative prior and more QMIX-like behavior, while smaller values move the model closer to an unconstrained factorization.

This perspective helps to explain the empirical behavior observed in Section 4, where R-QMIX remains close to a monotone factorization on easy maps while exploiting non-monotonic structure on the harder scenarios.

Decentralized Execution and IGM Under Relaxed Monotonicity

QMIX enforces a strictly monotone mixing function, which guarantees the individual-global-max (IGM) property: the joint greedy action that maximizes

Q_{tot}

is consistent with the composition of per-agent greedy actions. R-QMIX relaxes this constraint by replacing hard sign constraints with a soft penalty on local partial derivatives

g_{a} = \partial Q_{tot} / \partial Q_{a}

, encouraging

g_{a} \geq η

in expectation rather than enforcing global monotonicity everywhere. As a result, IGM is not theoretically guaranteed under R-QMIX, and in principle there may exist states where decentralized greedy execution is suboptimal relative to the joint argmax. In practice, we treat monotonicity as a regularizer that trades strict consistency for improved optimization and credit assignment on hard scenarios. To make this tradeoff transparent, we report the frequency of monotonicity violations observed during training.

3.5. Scheduling the Monotonicity Coefficient

To trade off between convergence behavior and expressiveness over the course of training, the strength of the monotonicity regularizer is varied as a function of environment time steps. The schedule is parameterized by four scalars:

λ_{mono}^{start}, λ_{mono}^{end}, t_{warmup}, t_{anneal} .

We define a normalized progress variable

α (t_{env}) = min (1, \frac{max {0, t_{env} - t_{warmup}}}{t_{anneal}}),

then set the effective weight

λ_{mono} (t_{env})

as follows. For the linear schedule,

λ_{mono} (t_{env}) = (1 - α) λ_{mono}^{start} + α λ_{mono}^{end},

while for the cosine schedule,

λ_{mono} (t_{env}) = λ_{mono}^{end} + \frac{1}{2} (λ_{mono}^{start} - λ_{mono}^{end}) (1 + cos (π α)) .

In this work, the linear schedule is used by default. This formulation allows the practitioner to, for example, start with a relatively strong monotonicity prior and gradually relax it, or conversely to introduce the regularizer only after the value function has roughly converged.

3.6. Optimization and Training Heuristics

The remaining training details follow the implementation used in the experiments:

Optimizer and parameter groups: The Adam optimizer [33] is used with separate parameter groups for agents and mixer. The mixer parameters are trained with a slightly lower learning rate and a small weight decay, which stabilizes the training on harder maps.
Learning rate schedule: The learning rate of each parameter group is decayed by a constant factor at a small number of predefined environment-step milestones.
Target networks: Separate target networks are maintained for both the agents and the mixer. These are updated either by hard copies at fixed environment-step intervals or by Polyak averaging with coefficient $τ$ , depending on the experiment.
Double Q-learning and masking: Double Q-learning is optionally used when selecting bootstrap actions, and standard episode masks are applied to ignore padded transitions and to stop bootstrapping beyond terminal states.
Gradient clipping: Gradients are clipped to a fixed norm to prevent exploding updates, which is particularly important on super-hard SMAC maps.

Together, the soft monotonicity regularizer, its schedule, and these training heuristics define the full R-QMIX training procedure used in the experiments, and are crucial for achieving robust convergence on the harder SMAC maps.

3.7. QTRAN Baseline

For QTRAN, we use the open-source PyMARL implementation and its default hyperparameters, which are widely used as a reference configuration. Because QTRAN’s objective and auxiliary losses differ from QMIX-style methods, we do not retune it to match our QMIX optimization heuristics; our goal is a representative reference baseline rather than a best-possible QTRAN result. All algorithms share the same six random seeds (Table A4) on each SMAC map.

4. Experimental Results

This section empirically evaluates R-QMIX on the StarCraft Multi-Agent Challenge (SMAC) benchmark. Four representative scenarios are considered: the easy homogeneous map 3m, the mixed-unit map MMM2, the hard map 6h vs. 8z, and the super-hard map 27m vs. 30m. Unless otherwise stated, the mean test win rate over multiple random seeds is reported as a function of environment time steps, and R-QMIX is compared against the original QMIX baseline. For each scenario, all algorithms are tested on the same six randomly selected seeds listed in Table A4.

Within these curves, shaded regions denote the standard deviation between seeds. For each map, quarterly statistics over the training horizon (per 25% slice of time-steps) summarize early, middle, and late training performance, as reported in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8.

In this empirical study, the term convergence is used along two axes: (i) sample efficiency, measured by how quickly the mean test win rate improves as a function of environment steps

t_{env}

, and (ii) asymptotic performance, measured by the maximum or final-quarter mean test win rate attained over training. An algorithm is said to exhibit better convergence on a given scenario if it reaches high win rates with fewer environment steps and/or achieves a higher final-quarter mean win rate with lower variance across seeds.

4.1. Experimental Environments

All experiments in this work are conducted on the StarCraft Multi-Agent Challenge (SMAC) benchmark [11], which provides a suite of micromanagement scenarios based on StarCraft II. Each scenario defines a partially observable, fully cooperative task in which a fixed team of allied units must defeat a fixed team of enemy units.

The implementation is based on the open-source PyMARL framework https://github.com/oxwhirl/pymarl (accessed on 5 January 2025) and extends it with an implementation of R-QMIX. All experiments were conducted on a desktop workstation equipped with an NVIDIA RTX 4080 Super GPU, an AMD Ryzen 7950X3D CPU, and 128 GB of RAM. Training and evaluation were carried out using a dedicated Conda environment that pins package versions for reproducibility.

For transparency and reproducibility, the full code and statistics will be released in a public repository once the paper is accepted (https://github.com/liamobbrien1/R-QMIX-A-Regularized-Value-Factorization-Approach-to-MARL (accessed on 3 January 2026)).

4.2. Overall Performance Across SMAC Maps

Figure 3, Figure 4, Figure 5 and Figure 6 illustrate the set of tasks considered in this work. For each map, QMIX and R-QMIX are trained under identical network architectures and training hyperparameters, differing only in the presence of the monotonicity regularizer and its schedule. R-QMIX generally matches QMIX in simpler scenarios while providing clearer benefits in more challenging maps. Detailed results for each scenario are presented in the following subsections. To characterize how often the learned mixer violates the soft monotonicity margin during training, Figure 7, Figure 8, Figure 9 and Figure 10 report the fraction of partial-derivative slopes

g_{a} = \partial Q_{tot} / \partial Q_{a}

that fall below

η = 0.01

on 3m, MMM2, 6h vs. 8z, and 27m vs. 30m respectively. Table 9 reports the mean ± std (across seeds) time in environment steps and wall-clock time required to reach 50%, 70%, and 90% win-rate thresholds on 27m vs. 30m.

4.3. Easy Scenario: `3m`

The 3m map consists of three marines on each side, and is commonly used as a sanity check for multi-agent algorithms due to its relatively simple coordination structure. Figure 3 shows the test win rate of QMIX and R-QMIX in this scenario.

In this easy scenario, both QMIX and R-QMIX achieve high win rates early in the training. Quarterly statistics on the training horizon confirm that the average performance of R-QMIX is comparable to that of QMIX, with overlapping standard deviations and similar final-quarter mean win rates. This is desirable, as R-QMIX does not sacrifice convergence on simple tasks while still providing benefits on more complex ones.

4.4. Mixed-Unit Scenario: `MMM2`

The MMM2 scenario features a heterogeneous team of marines, marauders, and medivacs, and is known to be significantly more challenging than 3m. It requires both spatial coordination and non-trivial credit assignment among different unit types.

In MMM2, R-QMIX achieves stronger convergence than QMIX across most of the training horizon. In particular, the quarterly averages in the second half of training indicate both a faster improvement in the mean win rate (higher sample efficiency) and higher final-quarter mean win rates with reduced variance for R-QMIX. This suggests that relaxing the strict monotonicity constraint while softly encouraging monotone behavior allows the mixing network to better capture the complex credit assignment structure present in heterogeneous teams.

4.5. Hard Scenario: `6h vs. 8z`

The 6h vs. 8z scenario pits six hydralisks against eight zealots and is considered a hard SMAC map due to the asymmetric unit types and the need for precise kiting and focus fire. This makes it a useful testbed for evaluating the robustness of value factorization methods under more intricate coordination requirements.

On this hard scenario, 6h vs. 8z fills the gap between the relatively easy 3m and the super-hard 27m vs. 30m, and highlights how the relaxed monotonicity constraint behaves under asymmetric unit dynamics. The quarterly averages highlight how the relaxed monotonicity constraint interacts with more complex unit dynamics. In terms of the convergence criteria defined above (sample efficiency and asymptotic win rate), both QMIX and QTRAN effectively fail to converge, maintaining near-zero win rates throughout training, whereas R-QMIX converges to a non-trivial policy with a final-quarter mean win rate of

57.5 %

.

4.6. Super-Hard Scenario: `27m vs. 30m`

The super-hard 27m vs. 30m scenario has been widely used as a stress test for MARL algorithms due to its large number of units and highly nontrivial coordination requirements. This map is particularly sensitive to optimization behavior and the representational capacity of the mixer.

On 27m vs. 30m, R-QMIX achieves substantially higher win rates than QMIX while exhibiting smoother and more reliable convergence, as reflected by both instantaneous win rate curves and quarterly statistics. This result supports the central claim that softly enforcing monotonicity via a gradient-based regularizer can provide a better tradeoff between expressiveness and convergence on complex multi-agent tasks.

An ablation experiment with the same training heuristics but no monotonicity regularization (setting

λ_{mono} = 0

) reduces the final-quarter mean win rate on 27m vs. 30m from

96.6 %

to

21.6 %

(Appendix B), confirming that the soft monotonicity penalty is the primary driver of the observed performance gains.

4.7. Summary of Empirical Findings

Across all four SMAC scenarios, R-QMIX exhibits improved empirical convergence in the sense defined at the beginning of this section, i.e., both in terms of sample efficiency and asymptotic performance.

Easy map (3m). On 3m, both QMIX and R-QMIX quickly reach near-optimal performance. The final-quarter mean win rates are $97.4 %$ for QMIX and $98.2 %$ for R-QMIX (Table 2), with overlapping standard deviations. R-QMIX learns slightly faster in the early quarters (e.g., $67.5 %$ vs. $30.0 %$ in Quarter 1), but the main takeaway is that softening the monotonicity constraint does not harm convergence on simple cooperative tasks.
Mixed-unit map (MMM2). On MMM2, the benefits of relaxed monotonicity are pronounced. QMIX reaches a final-quarter mean win rate of $42.3 %$ , while R-QMIX attains $97.1 %$ with substantially lower variance (Table 4). R-QMIX also learns much faster; by Quarter 2 it already achieves a $76.2 %$ mean win rate, versus $0.2 %$ for QMIX. QTRAN remains below both methods, ending at $24.7 %$ .
Hard map (6h vs. 8z). On the hard 6h vs. 8z scenario, QMIX and QTRAN effectively fail to learn, staying at a near-zero win rate throughout training. In contrast, R-QMIX converges to a non-trivial policy with a final-quarter mean win rate of $57.5 %$ (Table 6) after gradually improving from $0.5 %$ in Quarter 1 and $7.1 %$ in Quarter 2. This highlights that soft monotonicity can unlock policies that are inaccessible to strictly monotone or unconstrained-but-unstable baselines.
Super-hard map (27m vs. 30m). The super-hard 27m vs. 30m scenario is the most challenging setting in this study. Here, R-QMIX reaches a final-quarter mean win rate of $96.6 %$ , while QMIX plateaus at $58.0 %$ and QTRAN at $35.2 %$ (Table 8). The gap opens early in training: by Quarter 2, R-QMIX already achieves a $73.3 %$ mean win rate, compared to $0.1 %$ for QMIX and $4.3 %$ for QTRAN.

An ablation experiment on 27m vs. 30m with

λ_{mono} = 0

confirms that the soft monotonicity penalty is crucial for these gains. With identical architectures and optimization heuristics but no regularization, the final-quarter mean win rate drops from

96.6 %

to

21.6 %

(Table A5). This suggests that relaxed monotonicity, rather than incidental tuning, is the primary driver of the improved convergence.

Taken together, these results support the central claim of this paper: replacing hard monotonicity with a soft differentiable regularizer yields a value factorization method that matches QMIX on easy tasks and substantially improves both sample efficiency and asymptotic performance on challenging SMAC scenarios while remaining compatible with decentralized greedy execution.

5. Discussion

5.1. Summary of Findings

The experiments indicate that strict adherence to the monotonicity constraint in QMIX is not always necessary for effective decentralized execution, and can limit performance on tasks with strong non-monotonic interactions between agents. By introducing a soft monotonicity regularizer, R-QMIX retains the standard QMIX agent architecture and CTDE pipeline while expanding the effective hypothesis class of the mixer. Practically, the largest gains appear in scenarios where classical value factorization methods typically struggle.

In particular, on the mixed-unit MMM2 and super-hard 27m vs. 30m maps, R-QMIX consistently achieves high final performance in our experiments, whereas QMIX either fails to learn strong policies or plateaus at substantially lower win rates and QTRAN exhibits instability and weaker convergence. Overall, these results suggest that a modest change to the learning objective can reduce a meaningful portion of the expressiveness gap to more complex factorization schemes while preserving the simplicity and scalability of the original QMIX architecture.

5.2. Why Soft Monotonicity Helps: Optimization and Representation

5.2.1. Optimization Perspective

From an optimization perspective, the regularizer acts as a smoothing term on the joint value landscape. By penalizing strong negative local slopes of

Q_{tot}

with respect to individual utilities, it discourages sharp antagonistic couplings between agents early in training when TD targets are noisy and Q-values are poorly shaped. This bias can prevent the mixer from overfitting to spurious and brittle interactions that undermine cooperative credit assignment, leading to faster and more reliable convergence in practice.

5.2.2. Representation Perspective

From a representation perspective, soft monotonicity provides a continuum between strictly monotone value factorization (QMIX) and an unconstrained joint action–value model. R-QMIX occupies an intermediate regime: it is more expressive than QMIX due to the removal of hard sign constraints, yet retains the same decentralized greedy execution rule as QMIX and remains structured enough to train stably in practice. Strong performance on MMM2 and 27m vs. 30m suggests that this intermediate regime is well matched to the complex coordination patterns in SMAC.

5.3. Role of the Regularizer: Evidence from Ablations

The ablation with

λ_{mono} = 0

on 27m vs. 30m further supports this interpretation. When the regularizer is removed while keeping the remaining training heuristics fixed, performance degrades sharply (final-quarter mean win rate drops from

96.6 %

to

21.6 %

). This result suggests that the monotonicity penalty itself—not incidental implementation choices—accounts for most of the observed gains on the hardest scenario.

5.4. Mechanistic Comparison to Weighted QMIX and QPLEX on Non-Monotonic Tasks

Weighted QMIX [24] and QPLEX [25] were proposed to expand the representational capacity of monotonic value factorization beyond the original QMIX formulation [10]. To situate R-QMIX relative to these approaches on non-monotonic tasks, we provide the following qualitative mechanistic comparison.

5.4.1. QMIX and the Source of the Limitation

QMIX enforces the individual-global-max (IGM) property by constraining the mixing function so that

\partial Q_{tot} / \partial Q_{a} \geq 0

for all agents a [10]. This guarantees that decentralized greedy action selection is consistent with the joint greedy action under the learned factorization. However, the same constraint restricts the hypothesis class and can underfit tasks in which an agent’s effective contribution is context-dependent or exhibits interference (i.e., locally increasing

Q_{a}

can decrease

Q_{tot}

in some regions).

5.4.2. R-QMIX: Relaxing Monotonicity via a Soft Constraint on Local Slopes

R-QMIX retains the hypernetwork-based mixer parameterization but replaces hard non-negativity constraints with a penalty on local slopes

g_{a} = \partial Q_{tot} / \partial Q_{a}

falling below a margin. Mechanistically, this introduces a continuous tradeoff between approximate monotonicity and expressive non-monotonic couplings: when a monotone factorization suffices, the optimizer can reduce TD error while driving the monotonicity loss toward zero; when accurately fitting Bellman targets requires localized non-monotonic interactions, the mixer can violate the slope condition at a controlled cost. Importantly, this approach directly controls the degree of monotonicity violation through

λ_{mono} (t_{env})

without changing the decentralized execution rule.

5.4.3. Weighted QMIX: Expanding Capacity While Preserving IGM

Weighted QMIX expands the representational capacity of monotone factorization while maintaining an IGM-style guarantee [24]. Mechanistically, it enlarges the class of monotone joint value functions representable by the mixer, which can mitigate underfitting while preserving decentralized greedy consistency. In contrast, R-QMIX explicitly permits localized non-monotonic structure when needed, trading the theoretical IGM guarantee for additional expressiveness.

5.4.4. QPLEX: Restructuring Factorization for Greater Expressiveness

QPLEX introduces a duplex dueling factorization designed to represent a broader class of joint action–values while retaining an IGM-consistent structure in the relevant components [25]. Mechanistically, QPLEX increases expressiveness by modifying the decomposition and mixing of utilities/advantages rather than by directly relaxing the sign constraints on mixer weights. Compared to R-QMIX, QPLEX aims to better approximate complex interactions within an IGM-consistent framework, whereas R-QMIX adopts a direct relaxation that can represent localized monotonicity violations when strictly monotone representations are insufficient.

5.4.5. When Might Each Approach Help on “Non-Monotonic” Scenarios

If the task can be well-approximated by a richer monotone factorization, then weighted QMIX and QPLEX may improve performance while preserving decentralized-greedy consistency. In settings where optimal behavior requires genuine non-monotonic couplings between agent utilities (e.g., interference patterns or context-dependent negative contributions), R-QMIX provides an explicit mechanism to represent such structure, with

λ_{mono} (t_{env})

controlling how often and how strongly the mixer departs from the monotone regime.

5.4.6. Optimality of Decentralized Greedy Execution Under Monotonicity Violations

QMIX guarantees IGM only when the mixing function is globally monotone in each agent utility, i.e.,

\partial Q_{tot} / \partial Q_{a} \geq 0

for all a and relevant inputs [10]. Because R-QMIX relaxes this constraint, decentralized greedy execution is not theoretically guaranteed to be optimal in states where monotonicity is violated (e.g., where some local slopes

g_{a} = \partial Q_{tot} / \partial Q_{a}

become negative). In such cases, the joint maximizer

\arg \max_{u} Q_{tot} (s, u)

may differ from the composition of per-agent greedy actions. In practice, R-QMIX treats monotonicity as a regularizer: the coefficient

λ_{mono}

controls a tradeoff between approximate IGM consistency and additional expressiveness, and we monitor the frequency and magnitude of slope violations to characterize how close the learned mixer remains to the monotone regime.

5.5. Limitations and Future Work

There are several limitations and directions for future work. First, this study considers only value-based methods with discrete actions; extending similar regularization ideas to actor–critic frameworks and continuous-action MARL would be valuable [29,30]. Second, the evaluation focuses on a limited set of SMAC maps; broader testing on diverse benchmarks—including real-world multi-robot settings—would provide a more comprehensive assessment.

Third, the current regularizer uses a margin-based penalty on local partial derivatives. Alternative formulations (e.g., Huber-style penalties, different margins/schedules, or constraints that incorporate higher-order structure) may yield further improvements. Fourth, while we report diagnostics on monotonicity violations, a direct test-time IGM-consistency evaluation (quantifying mismatch between decentralized greedy execution and the joint argmax of

Q_{tot}

) remains an important direction for future work.

Finally, the empirical comparison is restricted to classical value-factorization baselines (QMIX and QTRAN). Evaluating R-QMIX against more recent methods such as weighted QMIX, QPLEX, and DCG on a broader suite of tasks [24,25,27] is an important next step.

6. Conclusions

This paper introduces Relaxed Monotonic QMIX (R-QMIX), an extension of QMIX that replaces hard monotonicity constraints with a soft differentiable regularizer. R-QMIX maintains the architecture and training pipeline of QMIX but allows the mixing network to adopt non-monotonic configurations when they significantly improve the fit to Bellman targets.

On the StarCraft Multi-Agent Challenge, R-QMIX matches QMIX on an easy map while providing substantial gains in both sample efficiency and final performance (i.e., improved empirical convergence) on more challenging scenarios such as MMM2 and 27m vs. 30m. These results demonstrate that soft monotonicity regularization is a promising direction for improving value factorization methods in cooperative MARL.

Future work will extend this framework in two main directions: first, we aim to combine relaxed monotonicity with hierarchical and heterogeneous value factorization, in which different subteams (for example, air, ground, or underwater robots) have their own mixers coupled by a higher-level coordinator; second, we plan to evaluate R-QMIX and its hierarchical variants on real-world multi-robot platforms, where partial observability, actuation delays, and safety constraints play a central role. Taken together, these directions turn R-QMIX from a single algorithm into the foundation of a broader line of relaxed-structure value factorization methods for cooperative MARL.

Author Contributions

Conceptualization, L.O. and H.X.; methodology, L.O.; software, L.O.; validation, L.O. and H.X.; formal analysis, L.O.; investigation, L.O.; resources, H.X.; data curation, L.O.; writing—original draft preparation, L.O.; writing—review and editing, L.O. and H.X.; visualization, L.O.; supervision, H.X.; project administration, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Science Foundation under Grant 2144646 and in part by the Army Research Office through the Cooperative Agreement under Grant W911NF-24-2-0133.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Trained models, configuration files, and training logs will be made available at https://github.com/liamobbrien1/R-QMIX-A-Regularized-Value-Factorization-Approach-to-MARL (accessed on 3 January 2026).

Acknowledgments

The authors thank the members of the Autonomous Systems Lab at the University of Nevada, Reno for helpful discussions. During the preparation of this manuscript/study, the author used GPT4.0 for the purposes of checking of grammar and adherence to MDPI formatting and standards. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
MARL	Multi-Agent Reinforcement Learning
MDP	Markov Decision Process
POMDP	Partially-Observable Markov Decision Process
Dec-POMDP	Decentralized Partially-Observable Markov Decision Process
CTDE	Centralized Training, Decentralized Execution
VDN	Value Decomposition Networks
QMIX	Monotonic Mixing Network (QMIX)
QTRAN	Q-Value Transformation (QTRAN)
QPLEX	QPLEX-Value Factorization Method
DCG	Deep Coordination Graphs
R-QMIX	Relaxed Monotonic QMIX (this work)
DQN	Deep Q-Network
DRQN	Deep Recurrent Q-Network
TD	Temporal Difference
SMAC	StarCraft Multi-Agent Challenge
MSE	Mean Squared Error
COMA	Counterfactual Multi-Agent policy gradients
MAVEN	Multi-Agent Variational Exploration
WR	Win Rate

Appendix A. Additional Experimental Details

Table A1. Key training hyperparameters used in the SMAC experiments. All runs share the same network architecture; differences between rows correspond to map-specific or algorithm-specific settings. If "N/A" appears, the hyperparameter is not used by the algorithm.

Map	Algorithm	$t_{\max}$	Batch	$γ$	$η$	${lr}_{mixer}$	$t_{warmup}$	$t_{anneal}$	$λ_{mono}$ (start → end)	$ϵ$ (start → end)
3 m	QMIX	2 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$
3 m	R-QMIX	2 M	32	0.99	0.01	0.00025	0	2 M	$1.0 \to 1.0$	$1.0 \to 0.05$
3 m	QTRAN	2 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$
MMM2	QMIX	6 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$
MMM2	R-QMIX	6 M	32	0.99	0.01	0.00025	0	2 M	$0.5 \to 0.8$	$1.0 \to 0.05$
MMM2	QTRAN	6 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$
6 h vs. 8 z	QMIX	6 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$
6 h vs. 8 z	R-QMIX	6 M	32	0.99	0.01	0.00025	0	2 M	$1.6 \to 1.6$	$1.0 \to 0.05$
6 h vs. 8 z	QTRAN	6 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$
27 m vs. 30 m	QMIX	6 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$
27 m vs. 30 m	R-QMIX	6 M	32	0.99	0.01	0.00025	0	2 M	$0.3 \to 0.6$	$1.0 \to 0.05$
27 m vs. 30 m	QTRAN	6 M	32	0.99	N/A	0.00025	N/A	N/A	N/A	$1.0 \to 0.05$

Table A2. Optimizer and optimization schedule used in all experiments (Adam with parameter groups).

Setting	Value
Optimizer	Adam
Parameter groups	agents ( $l r$ ), mixer ( $l r_{mixer}$ + weight decay)
Adam $β_{1}, β_{2}$	$(0.9, 0.999)$
Adam $ϵ$	$10^{- 8}$
Agent learning rate ( $l r$ )	$5 \times 10^{- 4}$
Mixer learning rate ( $l r_{mixer}$ )	$2.5 \times 10^{- 4}$
Weight decay (mixer params only)	$1 \times 10^{- 5}$
Gradient clipping	${∥ \nabla ∥}_{2}$ clipped to $10.0$
LR milestones ( $t_{env}$ )	${2.0 M, 3.5 M, 5.0 M}$
LR decay factor (per milestone)	$0.5$
Target update cadence (environment steps)	`target_update_interval_t` $= 20, 000$

Table A3. Exploration and evaluation protocol used in SMAC runs. Unless noted, settings are identical across maps and algorithms.

Setting	Value
Action selector	$ϵ$ -greedy
$ϵ$ schedule	$1.0 \to 0.05$ over $t_{anneal}$ env steps
$t_{anneal}$ (3 m, R-QMIX only)	$2 \times 10^{5}$
$t_{anneal}$ (all other maps)	$2 \times 10^{6}$
$t_{anneal}$ (QTRAN)	$5 \times 10^{4}$
Evaluation interval	every `test_interval` $= 10, 000$ env steps
Evaluation episodes	`test_nepisode` $= 32$
Evaluation policy	greedy (`test_greedy`=`True`)

Table A4. Random seeds used for all experiments. The same six seeds are reused across all algorithms and SMAC maps.

Seed 1	Seed 2	Seed 3	Seed 4	Seed 5	Seed 6
551,715,561	816,329,254	246,463,945	802,630,730	906,766,130	601,271,281

Appendix B. Ablation: `27m vs. 30m` with λ mono = 0

Figure A1. Test win rate on 27m vs. 30m when the monotonicity regularization coefficient is set to

λ_{mono} = 0

(no regularization). This ablation isolates the effect of the soft monotonicity penalty from the other training heuristics used in R-QMIX. Dashed lines show quarterly mean, shadows show quarterly standard deviation.

Figure A1. Test win rate on 27m vs. 30m when the monotonicity regularization coefficient is set to

λ_{mono} = 0

(no regularization). This ablation isolates the effect of the soft monotonicity penalty from the other training heuristics used in R-QMIX. Dashed lines show quarterly mean, shadows show quarterly standard deviation.

Table A5. Ablation on R-QMIX on 27m vs. 30m with

λ_{mono} = 0

(no monotonicity regularization).

Table A5. Ablation on R-QMIX on 27m vs. 30m with

λ_{mono} = 0

(no monotonicity regularization).

Quarter	Mean Win Rate	Std
1	0.037	0.037
2	0.112	0.029
3	0.180	0.031
4	0.216	0.036

Figure A2. Test win rate on MMM2 with

η

sweep. Map seed: 601,271,281.

Figure A2. Test win rate on MMM2 with

η

sweep. Map seed: 601,271,281.

Table A6. Quarterly mean win rate and within-run standard deviation for seed 601,271,281 across

η

settings. Each cell reports Mean(win) ± Std for that quarter. Bold indicates the highest mean win rate within each quarter.

Table A6. Quarterly mean win rate and within-run standard deviation for seed 601,271,281 across

η

settings. Each cell reports Mean(win) ± Std for that quarter. Bold indicates the highest mean win rate within each quarter.

Setting	Quarter 1	Quarter 2	Quarter 3	Quarter 4
$η$ = 0.00	0.000 ± 0.000	0.026 ± 0.046	0.652 ± 0.260	0.924 ± 0.059
$η$ = 0.01	0.000 ± 0.000	0.000 ± 0.000	0.002 ± 0.008	0.124 ± 0.131
$η$ = 0.02	0.000 ± 0.000	0.049 ± 0.082	0.634 ± 0.230	0.933 ± 0.060
$η$ = 0.03	0.000 ± 0.000	0.049 ± 0.090	0.574 ± 0.187	0.892 ± 0.072
$η$ = 0.04	0.000 ± 0.000	0.000 ± 0.000	0.000 ± 0.000	0.000 ± 0.000

Appendix C. Algorithm Result Comparison Table

Table A7. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on the 3m map.

Quarter	$Δ μ_{R - QMIX}$ (%)	$Δ σ_{R - QMIX}$	$Δ μ_{QTRAN}$ (%)	$Δ σ_{QTRAN}$
1	+37.5	+0.199	+43.9	+0.117
2	+31.1	−0.139	+30.3	−0.133
3	+9.0	−0.057	+9.6	−0.059
4	+0.8	−0.006	+1.6	−0.008

Table A8. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on MMM2.

Quarter	$Δ μ_{R - QMIX}$ (%)	$Δ σ_{R - QMIX}$	$Δ μ_{QTRAN}$ (%)	$Δ σ_{QTRAN}$
1	+16.2	+0.167	+0.0	+0.000
2	+76.0	+0.120	−0.1	−0.002
3	+84.7	−0.034	−5.0	−0.031
4	+54.8	−0.133	−17.6	−0.038

Table A9. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on 6h vs. 8z.

Quarter	$Δ μ_{R - QMIX}$ (%)	$Δ σ_{R - QMIX}$	$Δ μ_{QTRAN}$ (%)	$Δ σ_{QTRAN}$
1	+0.5	+0.008	+0.0	+0.000
2	+7.1	+0.043	+0.2	+0.005
3	+30.1	+0.090	+0.5	+0.008
4	+57.5	+0.062	+1.4	+0.014

Table A10. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on 27m vs. 30m.

Quarter	$Δ μ_{R - QMIX}$ (%)	$Δ σ_{R - QMIX}$	$Δ μ_{QTRAN}$ (%)	$Δ σ_{QTRAN}$
1	+16.8	+0.167	+0.2	+0.004
2	+73.2	+0.126	+4.2	+0.025
3	+72.7	−0.109	−1.2	−0.057
4	+38.6	−0.066	−22.8	−0.026

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1889–1937. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; Springer: Cham, Switzerland, 2016. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6390. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
Samvelyan, M.; Rashid, T.; de Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.; Hung, C.; Torr, P.H.S.; Foerster, J.; Whiteson, S. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Montreal, QC, Canada, 13–17 May 2019; pp. 2186–2188. [Google Scholar]
Panait, L.; Luke, S. Cooperative multi-agent learning: The state of the art. Auton. Agents Multi-Agent Syst. 2005, 11, 387–434. [Google Scholar] [CrossRef]
Busoniu, L.; Babuška, R.; De Schutter, B. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C 2008, 38, 156–172. [Google Scholar] [CrossRef]
Hernández-Leal, P.; Kartal, B.; Taylor, M.E. A survey and critique of multiagent deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2019, 33, 750–797. [Google Scholar] [CrossRef]
Oroojlooyjadid, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. arXiv 2019, arXiv:1908.03963. [Google Scholar]
Papoudakis, G.; Christianos, F.; Schäfer, L.; Albrecht, S.V. Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv 2019, arXiv:1906.04737. [Google Scholar]
Bellman, R. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Hausknecht, M.; Stone, P. Deep recurrent Q-learning for partially observable MDPs. In Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents, Arlington, VA, USA, 12–14 November 2015. [Google Scholar]
Cho, K.; Merriënboer, B.V.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef] [PubMed]
Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the 10th International Conference on Machine Learning (ICML), Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
Rashid, T.; de Witt, C.S.; Farquhar, G.; Whiteson, S. Weighted QMIX: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv 2020, arXiv:2006.10800. [Google Scholar] [CrossRef]
Wang, T.; Han, B.; Xu, H.; Wang, X.; Dong, H.; Zhang, C. QPLEX: Duplex dueling multi-agent Q-learning. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Yang, Y.; Meng, Z.; Hao, J.; Zhang, H.; Wang, Z. Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv 2020, arXiv:2002.03939. [Google Scholar] [CrossRef]
Böhmer, W.; Rashid, T.; Thoma, J.; Oliehoek, F.A.; Whiteson, S. Deep coordination graphs. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual Event, 12–18 July 2020. [Google Scholar]
Son, K.; Kim, D.; Kang, W.; Hostallero, D.E.; Yi, Y. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Mahajan, A.; Samvelyan, M.; Rashid, T.; de Witt, C.S.; Whiteson, S. MAVEN: Multi-agent variational exploration. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Peng, P.; Wen, Y.; Yang, Y.; Yuan, Q.; Tang, Z.; Long, H.; Wang, J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play StarCraft combat games. arXiv 2017, arXiv:1703.10069. [Google Scholar]
Zhu, C.; Dastani, M.; Wang, S. A survey of multi-agent deep reinforcement learning with communication. Auton. Agents Multi-Agent Syst. 2024, 38, 4. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Conceptual Conceptual connection between Relaxed-QMIX (R-QMIX) and cooperative multi-robot systems. A team of robots operates under partial observability in a shared task. Each agent learns a local Deep Recurrent Q Network (DRQN) based utility

Q_{a} (τ_{t}^{a}, u_{t}^{a})

, which is combined with the global state

s_{t}

by an R-QMIX mixing network with a soft monotonicity regularizer to form the joint action–value

Q_{tot} (s_{t}, u_{t})

. Centralized training uses global information while decentralized execution uses greedy actions with respect to the local utilities, enabling coordinated multi-robot behavior.

Figure 1. Conceptual Conceptual connection between Relaxed-QMIX (R-QMIX) and cooperative multi-robot systems. A team of robots operates under partial observability in a shared task. Each agent learns a local Deep Recurrent Q Network (DRQN) based utility

Q_{a} (τ_{t}^{a}, u_{t}^{a})

, which is combined with the global state

s_{t}

by an R-QMIX mixing network with a soft monotonicity regularizer to form the joint action–value

Q_{tot} (s_{t}, u_{t})

. Centralized training uses global information while decentralized execution uses greedy actions with respect to the local utilities, enabling coordinated multi-robot behavior.

Figure 2. R-QMIX architecture. The mixing network (pink) combines per-agent utilities into

Q_{tot}

using state-conditioned hypernetworks (red). A soft monotonicity regularization term is added to the mixer training objective (penalizing negative

\partial Q_{tot} / \partial Q_{i}

), while mixer weights remain unconstrained. Agents use a DRQN (deep recurrent Q-network) utility model with an MLP–gated recurrent unit (GRU)–MLP structure (green). Arrows indicate the direction of information flow.

Figure 2. R-QMIX architecture. The mixing network (pink) combines per-agent utilities into

Q_{tot}

using state-conditioned hypernetworks (red). A soft monotonicity regularization term is added to the mixer training objective (penalizing negative

\partial Q_{tot} / \partial Q_{i}

), while mixer weights remain unconstrained. Agents use a DRQN (deep recurrent Q-network) utility model with an MLP–gated recurrent unit (GRU)–MLP structure (green). Arrows indicate the direction of information flow.

Figure 3. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 3m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.

Figure 4. Win-rate comparison between QMIX, R-QMIX, and QTRAN on MMM2. Dashed lines show quarterly mean, shadows show quarterly standard deviation.

Figure 5. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 6h vs. 8z. Dashed lines show quarterly mean, shadows show quarterly standard deviation.

Figure 6. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 27m vs. 30m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.

Figure 7. Fraction of slopes below margin

η = 0.01

on on the 3m map.

Figure 7. Fraction of slopes below margin

η = 0.01

on on the 3m map.

Figure 8. Fraction of slopes below margin

η = 0.01

on the MMM2 map.

Figure 8. Fraction of slopes below margin

η = 0.01

on the MMM2 map.

Figure 9. Fraction of slopes below margin

η = 0.01

on on the 6h vs. 8z map.

Figure 9. Fraction of slopes below margin

η = 0.01

on on the 6h vs. 8z map.

Figure 10. Fraction of slopes below margin

η = 0.01

on 27m vs. 30m map.

Figure 10. Fraction of slopes below margin

η = 0.01

on 27m vs. 30m map.

Table 1. QMIX vs. R-QMIX at a glance.

Aspect	QMIX	R-QMIX
Mixer weight constraint	Non-negative mixer weights, e.g., $W = \| W_{raw} \|$ or $W = softplus (W_{raw})$ .	Weights unconstrained ( $W = W_{raw}$ ; may be negative); monotonicity encouraged via a soft penalty on local slopes.
Monotonic guarantee	Yes: $\partial Q_{tot} / \partial Q_{a} \geq 0$ (IGM under the model class).	No formal guarantee; encouraged locally / in expectation via regularization.
Extra loss terms	TD loss only (plus any shared baseline regularizers).	$L_{TD} + λ_{mono} (t_{env}) L_{mono}$ .
Extra hyperparameters	None beyond shared architecture/training knobs.	$λ_{mono}$ schedule, margin $η$ , exponent p (and $δ$ for finite differences).
Computational overhead	Baseline.	Small–moderate (compute $\partial Q_{tot} / \partial Q_{a}$ and the penalty).
When it helps (intuition)	When a monotone factorization is sufficient; typically stable and sample-efficient.	When interactions are non-monotonic (synergy/interference) and strict QMIX monotonicity underfits or destabilizes learning on hard scenarios.

Table 2. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 3m map. Bold indicates the highest mean win rate in each quarter.

Quarter	QMIX Mean	QMIX Std	R-QMIX Mean	R-QMIX Std	QTRAN Mean	QTRAN Std
1	0.300	0.126	0.675	0.325	0.739	0.243
2	0.669	0.152	0.980	0.013	0.972	0.019
3	0.893	0.069	0.983	0.012	0.989	0.010
4	0.974	0.017	0.982	0.011	0.990	0.009

Table 3. Time to reach win-rate thresholds on 3m (seeds = 6; hold

k = 3

consecutive evals). Values are mean across seeds; “n/6” indicates how many runs reached the threshold.

Table 3. Time to reach win-rate thresholds on 3m (seeds = 6; hold

k = 3

consecutive evals). Values are mean across seeds; “n/6” indicates how many runs reached the threshold.

Algorithm	50% WR		70% WR		90% WR		Reached
Algorithm	$t_{env}$	Clock	$t_{env}$	Clock	$t_{env}$	Clock	(50/70/90)
QMIX	361 k	00:34:46	599 k	00:56:54	1.03 M	01:35:28	6/6/6
R-QMIX	154 k	00:14:34	247 k	00:22:51	319 k	00:29:19	6/6/6
QTRAN	120 k	00:11:09	190 k	00:17:30	307 k	00:28:11	6/6/6

Table 4. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the MMM2 map. Bold indicates the highest mean win rate in each quarter.

Quarter	QMIX Mean	QMIX Std	R-QMIX Mean	R-QMIX Std	QTRAN Mean	QTRAN Std
1	0.000	0.001	0.162	0.168	0.000	0.001
2	0.002	0.005	0.762	0.125	0.001	0.003
3	0.085	0.061	0.932	0.027	0.035	0.030
4	0.423	0.146	0.971	0.013	0.247	0.108

Table 5. Time to reach win-rate thresholds on MMM2 (seeds = 6; hold

k = 3

consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.

Table 5. Time to reach win-rate thresholds on MMM2 (seeds = 6; hold

k = 3

consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.

Algorithm	50% WR		70% WR		90% WR		Reached
Algorithm	$t_{env}$	Clock	$t_{env}$	Clock	$t_{env}$	Clock	(50/70/90)
QMIX	4.97 M	09:09:41	5.57 M	10:14:54	N/A	N/A	4/4/0
R-QMIX	1.51 M	02:41:40	1.90 M	03:22:14	2.63 M	04:43:41	6/6/6
QTRAN	5.36 M	09:32:44	5.69 M	10:02:00	N/A	N/A	3/1/0

Table 6. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 6h vs. 8z map. Bold indicates the highest mean win rate in each quarter.

Quarter	QMIX Std	R-QMIX Mean	R-QMIX Std	QTRAN Mean	QTRAN Std
1	0.000	0.005	0.008	0.000	0.000
2	0.000	0.071	0.043	0.002	0.005
3	0.000	0.301	0.090	0.005	0.008
4	0.001	0.575	0.063	0.014	0.015

Table 7. Time to reach win-rate thresholds on 6h vs. 8z (seeds = 6; hold

k = 3

consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.

Table 7. Time to reach win-rate thresholds on 6h vs. 8z (seeds = 6; hold

k = 3

consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.

Algorithm	50% WR		70% WR		90% WR		Reached
Algorithm	$t_{env}$	Clock	$t_{env}$	Clock	$t_{env}$	Clock	(50/70/90)
QMIX	N/A	N/A	N/A	N/A	N/A	N/A	0/0/0
R-QMIX	4.41 M	08:23:15	5.70 M	10:50:57	N/A	N/A	6/3/0
QTRAN	N/A	N/A	N/A	N/A	N/A	N/A	0/0/0

Table 8. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 27m vs. 30m map. Bold indicates the highest mean win rate in each quarter.

Quarter	QMIX Mean	QMIX Std	R-QMIX Mean	R-QMIX Std	QTRAN Mean	QTRAN Std
1	0.000	0.000	0.168	0.167	0.002	0.004
2	0.001	0.004	0.733	0.130	0.043	0.029
3	0.203	0.133	0.930	0.024	0.191	0.076
4	0.580	0.078	0.966	0.012	0.352	0.052

Table 9. Time to reach win-rate thresholds on 27m vs. 30m (seeds = 6; hold

k = 3

consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold. "N/A" indicates that no runs have met the threshold.

Table 9. Time to reach win-rate thresholds on 27m vs. 30m (seeds = 6; hold

k = 3

consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold. "N/A" indicates that no runs have met the threshold.

Algorithm	50% WR		70% WR		90% WR		Reached
Algorithm	$t_{env}$	Clock	$t_{env}$	Clock	$t_{env}$	Clock	(50/70/90)
QMIX	4.49 M	15:44:36	4.72 M	16:41:25	5.02 M	17:33:47	6/4/1
R-QMIX	1.47 M	05:34:45	2.02 M	07:20:18	2.89 M	10:07:24	6/6/6
QTRAN	3.84 M	14:07:32	4.44 M	16:24:35	N/A	N/A	2/2/0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

O’Brien, L.; Xu, H. Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning. Robotics 2026, 15, 28. https://doi.org/10.3390/robotics15010028

AMA Style

O’Brien L, Xu H. Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning. Robotics. 2026; 15(1):28. https://doi.org/10.3390/robotics15010028

Chicago/Turabian Style

O’Brien, Liam, and Hao Xu. 2026. "Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning" Robotics 15, no. 1: 28. https://doi.org/10.3390/robotics15010028

APA Style

O’Brien, L., & Xu, H. (2026). Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning. Robotics, 15(1), 28. https://doi.org/10.3390/robotics15010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Background and Related Work

2.1. Dec-POMDPs and the CTDE Paradigm

2.2. Deep Q-Learning and Recurrent Extensions

2.3. Independent Q-Learning

2.4. Value Decomposition Networks (VDN)

2.5. QMIX

2.6. QTRAN

2.7. Other MARL Methods and Communication

3. Relaxed Monotonic QMIX Methodology

3.1. R-QMIX Architecture

3.2. Soft Monotonicity Regularization

3.3. TD Target and Learning Objective

3.4. Interpreting the Monotonicity Regularizer

Decentralized Execution and IGM Under Relaxed Monotonicity

3.5. Scheduling the Monotonicity Coefficient

3.6. Optimization and Training Heuristics

3.7. QTRAN Baseline

4. Experimental Results

4.1. Experimental Environments

4.2. Overall Performance Across SMAC Maps

4.3. Easy Scenario: 3m

4.4. Mixed-Unit Scenario: MMM2

4.5. Hard Scenario: 6h vs. 8z

4.6. Super-Hard Scenario: 27m vs. 30m

4.7. Summary of Empirical Findings

5. Discussion

5.1. Summary of Findings

5.2. Why Soft Monotonicity Helps: Optimization and Representation

5.2.1. Optimization Perspective

5.2.2. Representation Perspective

5.3. Role of the Regularizer: Evidence from Ablations

5.4. Mechanistic Comparison to Weighted QMIX and QPLEX on Non-Monotonic Tasks

5.4.1. QMIX and the Source of the Limitation

5.4.2. R-QMIX: Relaxing Monotonicity via a Soft Constraint on Local Slopes

5.4.3. Weighted QMIX: Expanding Capacity While Preserving IGM

5.4.4. QPLEX: Restructuring Factorization for Greater Expressiveness

5.4.5. When Might Each Approach Help on “Non-Monotonic” Scenarios

5.4.6. Optimality of Decentralized Greedy Execution Under Monotonicity Violations

5.5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Additional Experimental Details

Appendix B. Ablation: 27m vs. 30m with λ mono = 0

Appendix C. Algorithm Result Comparison Table

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3. Easy Scenario: `3m`

4.4. Mixed-Unit Scenario: `MMM2`

4.5. Hard Scenario: `6h vs. 8z`

4.6. Super-Hard Scenario: `27m vs. 30m`

Appendix B. Ablation: `27m vs. 30m` with λ mono = 0