Next Article in Journal
Robust Graph-Based Spatial Coupling of Dynamic Movement Primitives for Multi-Robot Manipulation
Next Article in Special Issue
Scaling Functional Electrical Stimulation Control for Diverse Users Through Offline Distributional Reinforcement Learning
Previous Article in Journal
Real-Time Control of Six-DOF Serial Manipulators via Learned Spherical Kinematics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning

Department of Electrical and Biomedical Engineering, University of Nevada, Reno, NV 89557, USA
*
Authors to whom correspondence should be addressed.
Robotics 2026, 15(1), 28; https://doi.org/10.3390/robotics15010028
Submission received: 12 December 2025 / Revised: 16 January 2026 / Accepted: 16 January 2026 / Published: 21 January 2026
(This article belongs to the Special Issue AI-Powered Robotic Systems: Learning, Perception and Decision-Making)

Abstract

Value factorization methods have become a standard tool for cooperative multi-agent reinforcement learning (MARL) in the centralized-training, decentralized-execution (CTDE) setting. QMIX (a monotonic mixing network for value factorization), in particular, constrains the joint action–value function to be a monotonic mixing of per-agent utilities, which guarantees consistency with individual greedy policies but can severely limit expressiveness on tasks with non-monotonic agent interactions. This work revisits this design choice and proposes Relaxed Monotonic QMIX (R-QMIX), a simple regularized variant of QMIX that encourages but does not strictly enforce the monotonicity constraint. R-QMIX removes the sign constraints on the mixing network weights and introduces a differentiable penalty on negative partial derivatives of the joint value with respect to each agent’s utility. This preserves the computational benefits of value factorization while allowing the joint value to deviate from strict monotonicity when beneficial. R-QMIX is implemented in a standard PyMARL (an open-source MARL codebase) and evaluated on the StarCraft Multi-Agent Challenge (SMAC). On a simple map (3m), R-QMIX matches the asymptotic performance of QMIX while learning substantially faster. On more challenging maps (MMM2, 6h vs. 8z, and 27m vs. 30m), R-QMIX significantly improves both sample efficiency and final win rate (WR), for example increasing the final-quarter mean win rate from 42.3 % to 97.1 % on MMM2, from 0.0 % to 57.5 % on 6h vs. 8z, and from 58.0 % to 96.6 % on 27m vs. 30m. These results suggest that soft monotonicity regularization is a practical way to bridge the gap between strictly monotonic value factorization and fully unconstrained joint value functions. A further comparison against QTRAN (Q-value transformation), a more expressive value factorization method, shows that R-QMIX achieves higher and more reliably convergent win rates on the challenging SMAC maps considered.

1. Introduction

Reinforcement learning (RL) provides a framework for sequential decision-making in which an agent interacts with an environment in order to maximize cumulative reward [1]. Modern deep RL methods have achieved impressive results in high-dimensional domains such as Atari games and continuous control by combining neural function approximation with value-based or policy-based learning [2,3,4,5,6].
Many real-world problems, however, are naturally multi-agent: multiple learners act concurrently, share a common objective, and operate under partial observability. Examples include cooperative robotics, distributed sensor networks, and multi-vehicle coordination. This has motivated the field of multi-agent reinforcement learning (MARL), where several agents must coordinate their behavior using local observations and, typically, a joint reward signal [7,8]. A widely adopted paradigm is centralized training with decentralized execution (CTDE), in which a learning algorithm can access global information and joint actions during training while the final policies for execution are decentralized. Figure 1 provides a conceptual overview of how R-QMIX fits into a cooperative multi-robot CTDE pipeline, from per-agent utilities to the joint action–value function.
A core challenge in cooperative MARL is credit assignment, that is, how to decompose a global team reward into signals that enable each agent to learn an effective local policy. Value factorization methods address this by parameterizing a joint action–value function Q tot as a combination of per-agent utilities Q a , trained end-to-end from a global reward. Value decomposition networks (VDN) [9] use a simple additive factorization; QMIX [10] extends this idea with a nonlinear mixing network that is constrained to be monotonic in each agent’s utility. The monotonicity constraint ensures that greedy individual policies are consistent with greedy joint policies, enabling efficient decentralized execution.
Despite its practical success, QMIX’s hard monotonicity constraint can be overly restrictive. Many cooperative tasks involve interactions that are fundamentally non-monotonic with respect to individual utilities, for instance coordinated flanking maneuvers or sacrifice strategies in combat. Empirically, relaxing this constraint often improves representational power but may lead to unstable optimization and poor generalization. This raises a natural question: Can the monotonicity constraint be softened in a principled way that trades off expressiveness and empirical convergence?
This paper proposes Relaxed Monotonic QMIX (R-QMIX), a regularized value factorization method that retains the QMIX architecture but replaces the hard non-negativity constraint on mixing-network weights with a soft differentiable penalty on monotonicity violations. Instead of enforcing Q tot / Q a 0 for all agents and states, a regularization term is introduced that penalizes negative partial derivatives, allowing the network to deviate from strict monotonicity only when this significantly improves the fit to the Bellman target. Figure 2 summarizes the R-QMIX architecture, including the hypernetwork-based mixer and the soft monotonicity regularizer used during training.
The main contributions of this work are as follows:
  • A soft monotonicity regularizer is formulated that measures and penalizes violations of the individual global max (IGM) property via the partial derivatives of the mixing network.
  • R-QMIX is introduced, augmenting the standard QMIX TD loss with the proposed regularizer while keeping the underlying architecture and training pipeline unchanged.
  • An efficient implementation based on automatic differentiation is provided and integrated into a standard PyMARL/SMAC stack.
  • R-QMIX is empirically evaluated in the StarCraft Multi-Agent Challenge (SMAC) [11]. On 3m, MMM2, and 27m vs. 30m, R-QMIX consistently matches or outperforms QMIX, with particularly large gains in convergence speed (sample efficiency) and asymptotic win rate on the harder maps.
  • R-QMIX is empirically compared with QTRAN in four SMAC scenarios, showing that R-QMIX achieves stronger and more reliable convergence than both QMIX and QTRAN in the harder maps.
More broadly, cooperative MARL and its core challenges, including credit assignment, non-stationary, and coordination, have been extensively surveyed, providing context for the algorithms considered in this work [12,13,14,15,16].
Overall, the results show that softly relaxing the monotonicity constraint is a simple yet powerful modification that improves convergence in challenging cooperative tasks without sacrificing the computational advantages of value factorization.

2. Background and Related Work

2.1. Dec-POMDPs and the CTDE Paradigm

Cooperative multi-agent tasks are commonly modeled as decentralized partially-observable Markov decision processes (Dec-POMDPs) [7]. A Dec-POMDP is defined by the tuple
G = S , U , P , r , Z , O , n , γ ,
where S is the state space, U is the discrete set of actions for each agent, P ( s s , u ) is the state-transition function, r ( s , u ) is the shared reward, Z is the local observation space, O ( z t a s t , u t a ) is the observation function, n is the number of agents, and  γ [ 0 , 1 ) is the discount factor. At each time step t, agents select actions u t a U , forming a joint action u t U n , and receive a shared reward r t = r ( s t , u t ) .
Because the global state s t is not observed directly, each agent a receives a local observation z t a O ( · s t , u t a ) and maintains an action–observation history τ t a ( Z × U ) * , conditioning its policy π a ( u t a τ t a ) on this history. The joint policy π induces a joint action–value function
Q π ( s t , u t ) = E R t s t , u t ,
where R t = i = 0 γ i r t + i is the discounted return.
In the centralized-training, decentralized-execution (CTDE) setting, the learning algorithm can access global information (e.g., s t and all { τ t a } a ) during training, while in execution each agent must rely only on its own history τ t a . Value factorization methods fit naturally into this paradigm; they learn a centralized estimate of a joint value Q tot while deriving decentralized policies from per-agent utilities.

2.2. Deep Q-Learning and Recurrent Extensions

Deep Q-learning (DQL) approximates a single-agent action–value function Q ( s , u ; θ ) with a neural network [2,17,18]. Transitions ( s t , u t , r t , s t + 1 ) are stored in a replay buffer, and the parameters θ are trained by minimizing the squared temporal-difference (TD) error
L DQN ( θ ) = i = 1 B y i DQN Q ( s i , u i ; θ ) 2
with target
y i DQN = r i + γ max u Q ( s i , u ; θ ) ,
where θ are the parameters of a target network periodically updated from θ .
To handle partial observability, deep recurrent Q-networks (DRQN) [19] replace the feedforward Q-network with a recurrent neural network (RNN) such as gated recurrent unit (GRU) [20] or long short-term memory (LSTM) [21], which is conditioned on an agent’s action–observation history τ t a . The recurrent hidden state summarizes past observations and actions, and serves as an approximate belief state. DRQN forms the basis for many CTDE value factorization methods, including QMIX and R-QMIX.

2.3. Independent Q-Learning

Independent Q-learning (IQL) [22,23] is a simple baseline in which each agent learns its own Q-function Q a ( τ t a , u t a ) by treating other agents as part of the environment. Although conceptually straightforward and often competitive on simpler tasks, IQL suffers from non-stationarity; as other agents update their policies, the effective environment dynamics change, undermining single-agent convergence guarantees. This motivates methods that exploit centralized information during training while still supporting decentralized execution.

2.4. Value Decomposition Networks (VDN)

Value decomposition networks (VDN) [9] address credit assignment by explicitly decomposing the joint action–value function into a sum of per-agent utilities:
Q tot ( s t , u t ) = a = 1 n Q a ( τ t a , u t a )
Each Q a depends only on agent a’s local history, while the scalar joint value is trained against the global team reward. This additive factorization ensures that greedy policies with respect to each Q a are consistent with the greedy joint policy. However, the assumption that agent contributions simply add is often too restrictive for tasks with strong interaction effects or heterogeneous roles.

2.5. QMIX

QMIX [10] generalizes VDN by introducing a learnable nonlinear mixing network
Q tot ( s t , u t ) = f θ s t , Q 1 ( τ t 1 , u t 1 ) , , Q n ( τ t n , u t n ) ,
where f θ is parameterized by a feedforward network with weights produced by state-conditioned hypernetworks. The key structural constraint is that Q tot is required to be monotone in each agent utility:
Q tot Q a 0 , a .
In practice, this is enforced by restricting certain mixer weights to be non-negative (e.g., via absolute value or softplus). The monotonicity constraint guarantees the individual-global-max (IGM) property: maximizing Q tot over joint actions is equivalent to each agent greedily maximizing its own utility. This yields efficient decentralized greedy execution, but restricts the functional form of Q tot and can limit expressiveness on tasks with non-monotonic interactions.
Several QMIX variants increase mixing expressiveness while preserving useful structure, including weighted QMIX [24], QPLEX (a duplex dueling value decomposition method) [25], Qatten [26], and deep coordination graphs (DCG) [27]. These methods highlight the tradeoff between structural constraints that facilitate decentralized execution and the need for representational power in complex cooperative tasks.

2.6. QTRAN

QTRAN [28] takes a different approach to value factorization. Instead of enforcing monotonicity, it introduces a transformed joint action–value function Q ˜ tot and auxiliary constraints that align independent greedy policies with the optimal joint policy. Concretely, QTRAN minimizes a TD loss on Q ˜ tot together with additional terms that encourage
Q ˜ tot ( s t , u t ) = a Q a ( τ t a , u t a ) + C ( s t )
for the joint greedy action and certain baseline conditions. This formulation is more expressive than QMIX in principle; however, the resulting optimization problem can be difficult to solve in practice, and QTRAN is empirically unstable on several SMAC maps.
The approach proposed in this paper lies between strictly monotone and fully unconstrained factorization. A single mixer over per-agent utilities is retained, but hard monotonicity is replaced with a soft regularizer that penalizes violations of the IGM structure.

2.7. Other MARL Methods and Communication

In addition to value factorization, cooperative MARL has been studied through policy gradient and actor–critic methods, hierarchical exploration, and explicit communication. Counterfactual Multi-Agent Policy gradients (COMA) uses a centralized critic with a counterfactual baseline for each agent [29], while Multi-Agent Variational Exploration (MAVEN) augments QMIX with a hierarchical latent variable for multimodal exploration [30]. Architectures such as BicNet [31] and more recent communication-focused surveys [32] model differentiable communication channels between agents.
These approaches are complementary to value factorization, as they explore different ways of injecting structure into cooperative MARL. R-QMIX follows the value-factorization line, focusing specifically on relaxing the monotonicity constraint while retaining decentralized greedy execution and a QMIX-compatible architecture.

3. Relaxed Monotonic QMIX Methodology

In this section, we develop Relaxed Monotonic QMIX (R-QMIX), a variant of QMIX that replaces the hard monotonicity constraint with a soft differentiable monotonicity penalty applied to the mixing network. All methods in our experiments share the same agent architecture, mixer architecture, optimizer, learning rate schedule, parameter groups, and gradient clipping settings; we adopt these training heuristics uniformly because they substantially improve stability on the harder SMAC maps. R-QMIX differs from this common baseline only in its learning objective, which adds the soft monotonicity term to the standard TD loss. For clarity, we refer to QMIX trained under this identical optimization setup as QMIX-Baseline. Unless otherwise noted, these optimization settings are fixed across maps.
For clarity, Table 1 summarizes the key differences between QMIX and R-QMIX at a high level. Figure 2 shows the architecture; the remainder of this section defines the regularizer, learning objective, and monotonicity coefficient.

3.1. R-QMIX Architecture

The centralized-training, decentralized-execution (CTDE) setting described in Section 2 is adopted. Each agent a { 1 , , n } maintains a parametric utility function
Q a ( τ t a , u t a ; θ a ) R
implemented as a deep recurrent Q-network (DRQN) that conditions on the agent’s action–observation history τ t a and its current action u t a .
At each time step t, the per-agent utilities are collected into a vector
Q t = Q 1 ( τ t 1 , u t 1 ) , , Q n ( τ t n , u t n ) R n
and combined with the global state s t through a mixing network
Q tot ( s t , u t ; θ mix ) = f mix Q t , s t ; θ mix ,
where f mix is the original QMIX hypernetwork-based mixer.
In standard QMIX, the architecture of f mix is constrained so that Q tot is a monotone function of each Q a , i.e., Q tot / Q a 0 for all agents. This is enforced by restricting certain weights of the mixing network to be non-negative, which guarantees that the joint greedy action can be obtained by greedy action selection with respect to each Q a .
R-QMIX keeps the same mixer parameterization but removes these hard sign constraints. Instead, monotonicity is encouraged by adding a smooth regularization term to the loss that penalizes violations of Q tot / Q a 0 . The decentralized execution policy remains the same as in QMIX: each agent acts greedily with respect to its local utility Q a .

3.2. Soft Monotonicity Regularization

The key idea in R-QMIX is to softly enforce the monotonicity of Q tot with respect to each agent utility Q a via a penalty on the partial derivatives
g t a   =   Q tot ( s t , u t ) Q a ( τ t a , u t a ) .
Intuitively, it is desired that these slopes be non-negative, and in practice they should be strictly larger than a small margin η > 0 in order to avoid flat regions.
For each training batch, the mixer output Q tot is recomputed on the per-agent utilities and the tensor of partial derivatives g b , t = ( g b , t 1 , , g b , t n ) is obtained using automatic differentiation. The per-agent violation is defined as
v b , t a   =   max 0 ,   η g b , t a ,
which is non-zero only when the slope falls below the margin η . Let m b , t { 0 , 1 } be a mask indicating valid (non-padded, non-terminated) time steps in episode b at time index t, and let M = { ( b , t ) : m b , t = 1 } be the set of valid positions. Given a batch of B episodes and T time steps, the monotonicity loss is defined as
L mono   =   1 | M |   n ( b , t ) M a = 1 n v b , t a p ,
where p { 1 , 2 } controls whether an 1 -style ( p = 1 ) or squared ( p = 2 ) penalty is used. In practice, p = 2 is used, which smoothly emphasizes larger violations.
The partial derivatives g b , t a are computed in two interchangeable ways:
  • Autograd mode (default): The per-agent utilities Q a are treated as differentiable inputs to the mixer, and g b , t a is obtained via automatic differentiation of Q tot with respect to these inputs. This is efficient and stable in modern deep-learning frameworks.
  • Finite-difference mode (ablation): For each agent a, the quantity Q a is perturbed by a small scalar δ and the slope is approximated as
    g b , t a Q tot ( s t , u t ; Q a + δ ) Q tot ( s t , u t ; Q a ) δ .
This variant isolates the regularizer from gradient-flow artifacts, but is more expensive; in the main experiments, the autograd version is used.
In both cases, the mask m b , t ensures that the regularizer is only applied to valid transitions in each episode.
In addition to contributing to training loss, several diagnostics derived from g b , t a are also recorded, including the fraction of slopes below the margin η and the mean magnitude of violation; these help to interpret the strength and effect of the regularizer over the course of training.

3.3. TD Target and Learning Objective

R-QMIX follows the standard off-policy training procedure introduced in QMIX [10]. For each transition in a replayed episode b at time step t, the agents receive a shared reward r b , t and a termination flag z b , t { 0 , 1 } . The target joint action–value is obtained by bootstrapping from the next state using the target networks
V b , t + 1   =   Q tot s b , t + 1 , u b , t + 1 * ; θ ,
where θ denotes the target parameters and u b , t + 1 * is selected via greedy or double Q-learning.
Because the implementation uses the same one-step TD target as the original QMIX, the learning target is
G b , t ( 1 ) = r b , t + γ ( 1 z b , t ) V b , t + 1 .
Using a mask m b , t to ignore padded transitions, the TD loss over a batch is
L TD = 1 b , t m b , t b , t m b , t Q tot ( s b , t , u b , t ; θ ) G b , t ( 1 ) ,
where ( · ) is either the mean squared error (MSE) loss or the Huber loss (used for harder maps for increased robustness).
The R-QMIX learning objective adds a soft monotonicity penalty to the original QMIX TD loss:
L R - QMIX = L TD + λ mono ( t env ) L mono ,
where λ mono ( t env ) is a scheduled regularization coefficient described in Section 3.5.

3.4. Interpreting the Monotonicity Regularizer

The monotonicity penalty in R-QMIX can be viewed as a soft relaxation of the individual-global-max (IGM) constraint that underpins QMIX [10]. In strictly monotone value factorization, the condition
Q tot Q a 0 a
ensures that independently greedy per-agent policies are consistent with the joint greedy policy. The regularizer does not enforce this constraint exactly, but instead penalizes violations of the margin condition g t a η for the local slopes g t a = Q tot / Q a . In regions of the state–action space where a monotone factorization is sufficient to explain the data, the optimizer can reduce the TD error while driving L mono 0 , effectively recovering an approximate IGM property. Conversely, when accurately fitting the Bellman target requires non-monotonic interactions, the model is free to trade off a small increase in L mono against a larger decrease in TD error.
From a regularization perspective, the monotonicity penalty defines a smooth prior over joint value functions that discourages pathological and highly non-cooperative couplings between agent utilities. Large negative slopes g t a 0 correspond to regimes where small local improvements in an individual agent’s utility can sharply decrease the joint value, which is typically undesirable in cooperative settings and can destabilize learning. By softly penalizing such configurations, R-QMIX biases the mixer toward representations where Q tot varies smoothly and predominantly positively with each Q a without forbidding localized non-monotonic structure when it is genuinely needed. The regularization coefficient λ mono ( t env ) then plays the role of a tunable bias–variance knob; larger values enforce a stronger cooperative prior and more QMIX-like behavior, while smaller values move the model closer to an unconstrained factorization.
This perspective helps to explain the empirical behavior observed in Section 4, where R-QMIX remains close to a monotone factorization on easy maps while exploiting non-monotonic structure on the harder scenarios.

Decentralized Execution and IGM Under Relaxed Monotonicity

QMIX enforces a strictly monotone mixing function, which guarantees the individual-global-max (IGM) property: the joint greedy action that maximizes Q tot is consistent with the composition of per-agent greedy actions. R-QMIX relaxes this constraint by replacing hard sign constraints with a soft penalty on local partial derivatives g a = Q tot / Q a , encouraging g a η in expectation rather than enforcing global monotonicity everywhere. As a result, IGM is not theoretically guaranteed under R-QMIX, and in principle there may exist states where decentralized greedy execution is suboptimal relative to the joint argmax. In practice, we treat monotonicity as a regularizer that trades strict consistency for improved optimization and credit assignment on hard scenarios. To make this tradeoff transparent, we report the frequency of monotonicity violations observed during training.

3.5. Scheduling the Monotonicity Coefficient

To trade off between convergence behavior and expressiveness over the course of training, the strength of the monotonicity regularizer is varied as a function of environment time steps. The schedule is parameterized by four scalars:
λ mono start , λ mono end , t warmup , t anneal .
We define a normalized progress variable
α ( t env ) = min 1 , max { 0 , t env t warmup } t anneal ,
then set the effective weight λ mono ( t env ) as follows. For the linear schedule,
λ mono ( t env ) = ( 1 α ) λ mono start + α λ mono end ,
while for the cosine schedule,
λ mono ( t env ) = λ mono end + 1 2 ( λ mono start λ mono end ) 1 + cos ( π α ) .
In this work, the linear schedule is used by default. This formulation allows the practitioner to, for example, start with a relatively strong monotonicity prior and gradually relax it, or conversely to introduce the regularizer only after the value function has roughly converged.

3.6. Optimization and Training Heuristics

The remaining training details follow the implementation used in the experiments:
  • Optimizer and parameter groups: The Adam optimizer [33] is used with separate parameter groups for agents and mixer. The mixer parameters are trained with a slightly lower learning rate and a small weight decay, which stabilizes the training on harder maps.
  • Learning rate schedule: The learning rate of each parameter group is decayed by a constant factor at a small number of predefined environment-step milestones.
  • Target networks: Separate target networks are maintained for both the agents and the mixer. These are updated either by hard copies at fixed environment-step intervals or by Polyak averaging with coefficient τ , depending on the experiment.
  • Double Q-learning and masking: Double Q-learning is optionally used when selecting bootstrap actions, and standard episode masks are applied to ignore padded transitions and to stop bootstrapping beyond terminal states.
  • Gradient clipping: Gradients are clipped to a fixed norm to prevent exploding updates, which is particularly important on super-hard SMAC maps.
Together, the soft monotonicity regularizer, its schedule, and these training heuristics define the full R-QMIX training procedure used in the experiments, and are crucial for achieving robust convergence on the harder SMAC maps.

3.7. QTRAN Baseline

For QTRAN, we use the open-source PyMARL implementation and its default hyperparameters, which are widely used as a reference configuration. Because QTRAN’s objective and auxiliary losses differ from QMIX-style methods, we do not retune it to match our QMIX optimization heuristics; our goal is a representative reference baseline rather than a best-possible QTRAN result. All algorithms share the same six random seeds (Table A4) on each SMAC map.

4. Experimental Results

This section empirically evaluates R-QMIX on the StarCraft Multi-Agent Challenge (SMAC) benchmark. Four representative scenarios are considered: the easy homogeneous map 3m, the mixed-unit map MMM2, the hard map 6h vs. 8z, and the super-hard map 27m vs. 30m. Unless otherwise stated, the mean test win rate over multiple random seeds is reported as a function of environment time steps, and R-QMIX is compared against the original QMIX baseline. For each scenario, all algorithms are tested on the same six randomly selected seeds listed in Table A4.
Within these curves, shaded regions denote the standard deviation between seeds. For each map, quarterly statistics over the training horizon (per 25% slice of time-steps) summarize early, middle, and late training performance, as reported in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8.
In this empirical study, the term convergence is used along two axes: (i) sample efficiency, measured by how quickly the mean test win rate improves as a function of environment steps t env , and (ii) asymptotic performance, measured by the maximum or final-quarter mean test win rate attained over training. An algorithm is said to exhibit better convergence on a given scenario if it reaches high win rates with fewer environment steps and/or achieves a higher final-quarter mean win rate with lower variance across seeds.

4.1. Experimental Environments

All experiments in this work are conducted on the StarCraft Multi-Agent Challenge (SMAC) benchmark [11], which provides a suite of micromanagement scenarios based on StarCraft II. Each scenario defines a partially observable, fully cooperative task in which a fixed team of allied units must defeat a fixed team of enemy units.
The implementation is based on the open-source PyMARL framework https://github.com/oxwhirl/pymarl (accessed on 5 January 2025) and extends it with an implementation of R-QMIX. All experiments were conducted on a desktop workstation equipped with an NVIDIA RTX 4080 Super GPU, an AMD Ryzen 7950X3D CPU, and 128 GB of RAM. Training and evaluation were carried out using a dedicated Conda environment that pins package versions for reproducibility.
For transparency and reproducibility, the full code and statistics will be released in a public repository once the paper is accepted (https://github.com/liamobbrien1/R-QMIX-A-Regularized-Value-Factorization-Approach-to-MARL (accessed on 3 January 2026)).

4.2. Overall Performance Across SMAC Maps

Figure 3, Figure 4, Figure 5 and Figure 6 illustrate the set of tasks considered in this work. For each map, QMIX and R-QMIX are trained under identical network architectures and training hyperparameters, differing only in the presence of the monotonicity regularizer and its schedule. R-QMIX generally matches QMIX in simpler scenarios while providing clearer benefits in more challenging maps. Detailed results for each scenario are presented in the following subsections. To characterize how often the learned mixer violates the soft monotonicity margin during training, Figure 7, Figure 8, Figure 9 and Figure 10 report the fraction of partial-derivative slopes g a = Q tot / Q a that fall below η = 0.01 on 3m, MMM2, 6h vs. 8z, and 27m vs. 30m respectively. Table 9 reports the mean ± std (across seeds) time in environment steps and wall-clock time required to reach 50%, 70%, and 90% win-rate thresholds on 27m vs. 30m.

4.3. Easy Scenario: 3m

The 3m map consists of three marines on each side, and is commonly used as a sanity check for multi-agent algorithms due to its relatively simple coordination structure. Figure 3 shows the test win rate of QMIX and R-QMIX in this scenario.
In this easy scenario, both QMIX and R-QMIX achieve high win rates early in the training. Quarterly statistics on the training horizon confirm that the average performance of R-QMIX is comparable to that of QMIX, with overlapping standard deviations and similar final-quarter mean win rates. This is desirable, as R-QMIX does not sacrifice convergence on simple tasks while still providing benefits on more complex ones.

4.4. Mixed-Unit Scenario: MMM2

The MMM2 scenario features a heterogeneous team of marines, marauders, and medivacs, and is known to be significantly more challenging than 3m. It requires both spatial coordination and non-trivial credit assignment among different unit types.
In MMM2, R-QMIX achieves stronger convergence than QMIX across most of the training horizon. In particular, the quarterly averages in the second half of training indicate both a faster improvement in the mean win rate (higher sample efficiency) and higher final-quarter mean win rates with reduced variance for R-QMIX. This suggests that relaxing the strict monotonicity constraint while softly encouraging monotone behavior allows the mixing network to better capture the complex credit assignment structure present in heterogeneous teams.

4.5. Hard Scenario: 6h vs. 8z

The 6h vs. 8z scenario pits six hydralisks against eight zealots and is considered a hard SMAC map due to the asymmetric unit types and the need for precise kiting and focus fire. This makes it a useful testbed for evaluating the robustness of value factorization methods under more intricate coordination requirements.
On this hard scenario, 6h vs. 8z fills the gap between the relatively easy 3m and the super-hard 27m vs. 30m, and highlights how the relaxed monotonicity constraint behaves under asymmetric unit dynamics. The quarterly averages highlight how the relaxed monotonicity constraint interacts with more complex unit dynamics. In terms of the convergence criteria defined above (sample efficiency and asymptotic win rate), both QMIX and QTRAN effectively fail to converge, maintaining near-zero win rates throughout training, whereas R-QMIX converges to a non-trivial policy with a final-quarter mean win rate of 57.5 % .

4.6. Super-Hard Scenario: 27m vs. 30m

The super-hard 27m vs. 30m scenario has been widely used as a stress test for MARL algorithms due to its large number of units and highly nontrivial coordination requirements. This map is particularly sensitive to optimization behavior and the representational capacity of the mixer.
On 27m vs. 30m, R-QMIX achieves substantially higher win rates than QMIX while exhibiting smoother and more reliable convergence, as reflected by both instantaneous win rate curves and quarterly statistics. This result supports the central claim that softly enforcing monotonicity via a gradient-based regularizer can provide a better tradeoff between expressiveness and convergence on complex multi-agent tasks.
An ablation experiment with the same training heuristics but no monotonicity regularization (setting λ mono = 0 ) reduces the final-quarter mean win rate on 27m vs. 30m from 96.6 % to 21.6 % (Appendix B), confirming that the soft monotonicity penalty is the primary driver of the observed performance gains.

4.7. Summary of Empirical Findings

Across all four SMAC scenarios, R-QMIX exhibits improved empirical convergence in the sense defined at the beginning of this section, i.e., both in terms of sample efficiency and asymptotic performance.
  • Easy map (3m). On 3m, both QMIX and R-QMIX quickly reach near-optimal performance. The final-quarter mean win rates are 97.4 % for QMIX and 98.2 % for R-QMIX (Table 2), with overlapping standard deviations. R-QMIX learns slightly faster in the early quarters (e.g., 67.5 % vs. 30.0 % in Quarter 1), but the main takeaway is that softening the monotonicity constraint does not harm convergence on simple cooperative tasks.
  • Mixed-unit map (MMM2). On MMM2, the benefits of relaxed monotonicity are pronounced. QMIX reaches a final-quarter mean win rate of 42.3 % , while R-QMIX attains 97.1 % with substantially lower variance (Table 4). R-QMIX also learns much faster; by Quarter 2 it already achieves a 76.2 % mean win rate, versus 0.2 % for QMIX. QTRAN remains below both methods, ending at 24.7 % .
  • Hard map (6h vs. 8z). On the hard 6h vs. 8z scenario, QMIX and QTRAN effectively fail to learn, staying at a near-zero win rate throughout training. In contrast, R-QMIX converges to a non-trivial policy with a final-quarter mean win rate of 57.5 % (Table 6) after gradually improving from 0.5 % in Quarter 1 and 7.1 % in Quarter 2. This highlights that soft monotonicity can unlock policies that are inaccessible to strictly monotone or unconstrained-but-unstable baselines.
  • Super-hard map (27m vs. 30m). The super-hard 27m vs. 30m scenario is the most challenging setting in this study. Here, R-QMIX reaches a final-quarter mean win rate of 96.6 % , while QMIX plateaus at 58.0 % and QTRAN at 35.2 % (Table 8). The gap opens early in training: by Quarter 2, R-QMIX already achieves a 73.3 % mean win rate, compared to 0.1 % for QMIX and 4.3 % for QTRAN.
An ablation experiment on 27m vs. 30m with λ mono = 0 confirms that the soft monotonicity penalty is crucial for these gains. With identical architectures and optimization heuristics but no regularization, the final-quarter mean win rate drops from 96.6 % to 21.6 % (Table A5). This suggests that relaxed monotonicity, rather than incidental tuning, is the primary driver of the improved convergence.
Taken together, these results support the central claim of this paper: replacing hard monotonicity with a soft differentiable regularizer yields a value factorization method that matches QMIX on easy tasks and substantially improves both sample efficiency and asymptotic performance on challenging SMAC scenarios while remaining compatible with decentralized greedy execution.

5. Discussion

5.1. Summary of Findings

The experiments indicate that strict adherence to the monotonicity constraint in QMIX is not always necessary for effective decentralized execution, and can limit performance on tasks with strong non-monotonic interactions between agents. By introducing a soft monotonicity regularizer, R-QMIX retains the standard QMIX agent architecture and CTDE pipeline while expanding the effective hypothesis class of the mixer. Practically, the largest gains appear in scenarios where classical value factorization methods typically struggle.
In particular, on the mixed-unit MMM2 and super-hard 27m vs. 30m maps, R-QMIX consistently achieves high final performance in our experiments, whereas QMIX either fails to learn strong policies or plateaus at substantially lower win rates and QTRAN exhibits instability and weaker convergence. Overall, these results suggest that a modest change to the learning objective can reduce a meaningful portion of the expressiveness gap to more complex factorization schemes while preserving the simplicity and scalability of the original QMIX architecture.

5.2. Why Soft Monotonicity Helps: Optimization and Representation

5.2.1. Optimization Perspective

From an optimization perspective, the regularizer acts as a smoothing term on the joint value landscape. By penalizing strong negative local slopes of Q tot with respect to individual utilities, it discourages sharp antagonistic couplings between agents early in training when TD targets are noisy and Q-values are poorly shaped. This bias can prevent the mixer from overfitting to spurious and brittle interactions that undermine cooperative credit assignment, leading to faster and more reliable convergence in practice.

5.2.2. Representation Perspective

From a representation perspective, soft monotonicity provides a continuum between strictly monotone value factorization (QMIX) and an unconstrained joint action–value model. R-QMIX occupies an intermediate regime: it is more expressive than QMIX due to the removal of hard sign constraints, yet retains the same decentralized greedy execution rule as QMIX and remains structured enough to train stably in practice. Strong performance on MMM2 and 27m vs. 30m suggests that this intermediate regime is well matched to the complex coordination patterns in SMAC.

5.3. Role of the Regularizer: Evidence from Ablations

The ablation with λ mono = 0 on 27m vs. 30m further supports this interpretation. When the regularizer is removed while keeping the remaining training heuristics fixed, performance degrades sharply (final-quarter mean win rate drops from 96.6 % to 21.6 % ). This result suggests that the monotonicity penalty itself—not incidental implementation choices—accounts for most of the observed gains on the hardest scenario.

5.4. Mechanistic Comparison to Weighted QMIX and QPLEX on Non-Monotonic Tasks

Weighted QMIX [24] and QPLEX [25] were proposed to expand the representational capacity of monotonic value factorization beyond the original QMIX formulation [10]. To situate R-QMIX relative to these approaches on non-monotonic tasks, we provide the following qualitative mechanistic comparison.

5.4.1. QMIX and the Source of the Limitation

QMIX enforces the individual-global-max (IGM) property by constraining the mixing function so that Q tot / Q a 0 for all agents a [10]. This guarantees that decentralized greedy action selection is consistent with the joint greedy action under the learned factorization. However, the same constraint restricts the hypothesis class and can underfit tasks in which an agent’s effective contribution is context-dependent or exhibits interference (i.e., locally increasing Q a can decrease Q tot in some regions).

5.4.2. R-QMIX: Relaxing Monotonicity via a Soft Constraint on Local Slopes

R-QMIX retains the hypernetwork-based mixer parameterization but replaces hard non-negativity constraints with a penalty on local slopes g a = Q tot / Q a falling below a margin. Mechanistically, this introduces a continuous tradeoff between approximate monotonicity and expressive non-monotonic couplings: when a monotone factorization suffices, the optimizer can reduce TD error while driving the monotonicity loss toward zero; when accurately fitting Bellman targets requires localized non-monotonic interactions, the mixer can violate the slope condition at a controlled cost. Importantly, this approach directly controls the degree of monotonicity violation through λ mono ( t env ) without changing the decentralized execution rule.

5.4.3. Weighted QMIX: Expanding Capacity While Preserving IGM

Weighted QMIX expands the representational capacity of monotone factorization while maintaining an IGM-style guarantee [24]. Mechanistically, it enlarges the class of monotone joint value functions representable by the mixer, which can mitigate underfitting while preserving decentralized greedy consistency. In contrast, R-QMIX explicitly permits localized non-monotonic structure when needed, trading the theoretical IGM guarantee for additional expressiveness.

5.4.4. QPLEX: Restructuring Factorization for Greater Expressiveness

QPLEX introduces a duplex dueling factorization designed to represent a broader class of joint action–values while retaining an IGM-consistent structure in the relevant components [25]. Mechanistically, QPLEX increases expressiveness by modifying the decomposition and mixing of utilities/advantages rather than by directly relaxing the sign constraints on mixer weights. Compared to R-QMIX, QPLEX aims to better approximate complex interactions within an IGM-consistent framework, whereas R-QMIX adopts a direct relaxation that can represent localized monotonicity violations when strictly monotone representations are insufficient.

5.4.5. When Might Each Approach Help on “Non-Monotonic” Scenarios

If the task can be well-approximated by a richer monotone factorization, then weighted QMIX and QPLEX may improve performance while preserving decentralized-greedy consistency. In settings where optimal behavior requires genuine non-monotonic couplings between agent utilities (e.g., interference patterns or context-dependent negative contributions), R-QMIX provides an explicit mechanism to represent such structure, with  λ mono ( t env ) controlling how often and how strongly the mixer departs from the monotone regime.

5.4.6. Optimality of Decentralized Greedy Execution Under Monotonicity Violations

QMIX guarantees IGM only when the mixing function is globally monotone in each agent utility, i.e.,  Q tot / Q a 0 for all a and relevant inputs [10]. Because R-QMIX relaxes this constraint, decentralized greedy execution is not theoretically guaranteed to be optimal in states where monotonicity is violated (e.g., where some local slopes g a = Q tot / Q a become negative). In such cases, the joint maximizer arg max u Q tot ( s , u ) may differ from the composition of per-agent greedy actions. In practice, R-QMIX treats monotonicity as a regularizer: the coefficient λ mono controls a tradeoff between approximate IGM consistency and additional expressiveness, and we monitor the frequency and magnitude of slope violations to characterize how close the learned mixer remains to the monotone regime.

5.5. Limitations and Future Work

There are several limitations and directions for future work. First, this study considers only value-based methods with discrete actions; extending similar regularization ideas to actor–critic frameworks and continuous-action MARL would be valuable [29,30]. Second, the evaluation focuses on a limited set of SMAC maps; broader testing on diverse benchmarks—including real-world multi-robot settings—would provide a more comprehensive assessment.
Third, the current regularizer uses a margin-based penalty on local partial derivatives. Alternative formulations (e.g., Huber-style penalties, different margins/schedules, or constraints that incorporate higher-order structure) may yield further improvements. Fourth, while we report diagnostics on monotonicity violations, a direct test-time IGM-consistency evaluation (quantifying mismatch between decentralized greedy execution and the joint argmax of Q tot ) remains an important direction for future work.
Finally, the empirical comparison is restricted to classical value-factorization baselines (QMIX and QTRAN). Evaluating R-QMIX against more recent methods such as weighted QMIX, QPLEX, and DCG on a broader suite of tasks [24,25,27] is an important next step.

6. Conclusions

This paper introduces Relaxed Monotonic QMIX (R-QMIX), an extension of QMIX that replaces hard monotonicity constraints with a soft differentiable regularizer. R-QMIX maintains the architecture and training pipeline of QMIX but allows the mixing network to adopt non-monotonic configurations when they significantly improve the fit to Bellman targets.
On the StarCraft Multi-Agent Challenge, R-QMIX matches QMIX on an easy map while providing substantial gains in both sample efficiency and final performance (i.e., improved empirical convergence) on more challenging scenarios such as MMM2 and 27m vs. 30m. These results demonstrate that soft monotonicity regularization is a promising direction for improving value factorization methods in cooperative MARL.
Future work will extend this framework in two main directions: first, we aim to combine relaxed monotonicity with hierarchical and heterogeneous value factorization, in which different subteams (for example, air, ground, or underwater robots) have their own mixers coupled by a higher-level coordinator; second, we plan to evaluate R-QMIX and its hierarchical variants on real-world multi-robot platforms, where partial observability, actuation delays, and safety constraints play a central role. Taken together, these directions turn R-QMIX from a single algorithm into the foundation of a broader line of relaxed-structure value factorization methods for cooperative MARL.

Author Contributions

Conceptualization, L.O. and H.X.; methodology, L.O.; software, L.O.; validation, L.O. and H.X.; formal analysis, L.O.; investigation, L.O.; resources, H.X.; data curation, L.O.; writing—original draft preparation, L.O.; writing—review and editing, L.O. and H.X.; visualization, L.O.; supervision, H.X.; project administration, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Science Foundation under Grant 2144646 and in part by the Army Research Office through the Cooperative Agreement under Grant W911NF-24-2-0133.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Trained models, configuration files, and training logs will be made available at https://github.com/liamobbrien1/R-QMIX-A-Regularized-Value-Factorization-Approach-to-MARL (accessed on 3 January 2026).

Acknowledgments

The authors thank the members of the Autonomous Systems Lab at the University of Nevada, Reno for helpful discussions. During the preparation of this manuscript/study, the author used GPT4.0 for the purposes of checking of grammar and adherence to MDPI formatting and standards. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
RLReinforcement Learning
MARLMulti-Agent Reinforcement Learning
MDPMarkov Decision Process
POMDPPartially-Observable Markov Decision Process
Dec-POMDPDecentralized Partially-Observable Markov Decision Process
CTDECentralized Training, Decentralized Execution
VDNValue Decomposition Networks
QMIXMonotonic Mixing Network (QMIX)
QTRANQ-Value Transformation (QTRAN)
QPLEXQPLEX-Value Factorization Method
DCGDeep Coordination Graphs
R-QMIXRelaxed Monotonic QMIX (this work)
DQNDeep Q-Network
DRQNDeep Recurrent Q-Network
TDTemporal Difference
SMACStarCraft Multi-Agent Challenge
MSEMean Squared Error
COMACounterfactual Multi-Agent policy gradients
MAVENMulti-Agent Variational Exploration
WRWin Rate

Appendix A. Additional Experimental Details

Table A1. Key training hyperparameters used in the SMAC experiments. All runs share the same network architecture; differences between rows correspond to map-specific or algorithm-specific settings. If "N/A" appears, the hyperparameter is not used by the algorithm.
Table A1. Key training hyperparameters used in the SMAC experiments. All runs share the same network architecture; differences between rows correspond to map-specific or algorithm-specific settings. If "N/A" appears, the hyperparameter is not used by the algorithm.
MapAlgorithm t max Batch γ η lr mixer t warmup t anneal λ mono
(start → end)
ϵ
(start → end)
3 mQMIX2 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
3 mR-QMIX2 M320.990.010.0002502 M 1.0 1.0 1.0 0.05
3 mQTRAN2 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
MMM2QMIX6 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
MMM2R-QMIX6 M320.990.010.0002502 M 0.5 0.8 1.0 0.05
MMM2QTRAN6 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
6 h vs. 8 zQMIX6 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
6 h vs. 8 zR-QMIX6 M320.990.010.0002502 M 1.6 1.6 1.0 0.05
6 h vs. 8 zQTRAN6 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
27 m vs. 30 mQMIX6 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
27 m vs. 30 mR-QMIX6 M320.990.010.0002502 M 0.3 0.6 1.0 0.05
27 m vs. 30 mQTRAN6 M320.99N/A0.00025N/AN/AN/A 1.0 0.05
Table A2. Optimizer and optimization schedule used in all experiments (Adam with parameter groups).
Table A2. Optimizer and optimization schedule used in all experiments (Adam with parameter groups).
SettingValue
OptimizerAdam
Parameter groupsagents ( l r ), mixer ( l r mixer + weight decay)
Adam β 1 , β 2 ( 0.9 ,   0.999 )
Adam ϵ 10 8
Agent learning rate ( l r ) 5 × 10 4
Mixer learning rate ( l r mixer ) 2.5 × 10 4
Weight decay (mixer params only) 1 × 10 5
Gradient clipping 2 clipped to 10.0
LR milestones ( t env ) { 2.0   M ,   3.5   M ,   5.0   M }
LR decay factor (per milestone) 0.5
Target update cadence (environment steps)target_update_interval_t   = 20 , 000
Table A3. Exploration and evaluation protocol used in SMAC runs. Unless noted, settings are identical across maps and algorithms.
Table A3. Exploration and evaluation protocol used in SMAC runs. Unless noted, settings are identical across maps and algorithms.
SettingValue
Action selector ϵ -greedy
ϵ schedule 1.0 0.05 over t anneal env steps
t anneal (3 m, R-QMIX only) 2 × 10 5
t anneal (all other maps) 2 × 10 6
t anneal (QTRAN) 5 × 10 4
Evaluation intervalevery test_interval  = 10 , 000 env steps
Evaluation episodestest_nepisode = 32
Evaluation policygreedy (test_greedy=True)
Table A4. Random seeds used for all experiments. The same six seeds are reused across all algorithms and SMAC maps.
Table A4. Random seeds used for all experiments. The same six seeds are reused across all algorithms and SMAC maps.
Seed 1Seed 2Seed 3Seed 4Seed 5Seed 6
551,715,561816,329,254246,463,945802,630,730906,766,130601,271,281

Appendix B. Ablation: 27m vs. 30m with λ mono = 0

Figure A1. Test win rate on 27m vs. 30m when the monotonicity regularization coefficient is set to λ mono = 0 (no regularization). This ablation isolates the effect of the soft monotonicity penalty from the other training heuristics used in R-QMIX. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure A1. Test win rate on 27m vs. 30m when the monotonicity regularization coefficient is set to λ mono = 0 (no regularization). This ablation isolates the effect of the soft monotonicity penalty from the other training heuristics used in R-QMIX. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Robotics 15 00028 g0a1
Table A5. Ablation on R-QMIX on 27m vs. 30m with λ mono = 0 (no monotonicity regularization).
Table A5. Ablation on R-QMIX on 27m vs. 30m with λ mono = 0 (no monotonicity regularization).
QuarterMean Win RateStd
10.0370.037
20.1120.029
30.1800.031
40.2160.036
Figure A2. Test win rate on MMM2 with η sweep. Map seed: 601,271,281.
Figure A2. Test win rate on MMM2 with η sweep. Map seed: 601,271,281.
Robotics 15 00028 g0a2
Table A6. Quarterly mean win rate and within-run standard deviation for seed 601,271,281 across η settings. Each cell reports Mean(win) ± Std for that quarter. Bold indicates the highest mean win rate within each quarter.
Table A6. Quarterly mean win rate and within-run standard deviation for seed 601,271,281 across η settings. Each cell reports Mean(win) ± Std for that quarter. Bold indicates the highest mean win rate within each quarter.
SettingQuarter 1Quarter 2Quarter 3Quarter 4
η = 0.000.000 ± 0.0000.026 ± 0.0460.652 ± 0.2600.924 ± 0.059
η = 0.010.000 ± 0.0000.000 ± 0.0000.002 ± 0.0080.124 ± 0.131
η = 0.020.000 ± 0.0000.049 ± 0.0820.634 ± 0.2300.933 ± 0.060
η = 0.030.000 ± 0.0000.049 ± 0.0900.574 ± 0.1870.892 ± 0.072
η = 0.040.000 ± 0.0000.000 ± 0.0000.000 ± 0.0000.000 ± 0.000

Appendix C. Algorithm Result Comparison Table

Table A7. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on the 3m map.
Table A7. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on the 3m map.
Quarter Δ μ R - QMIX (%) Δ σ R - QMIX Δ μ QTRAN (%) Δ σ QTRAN
1+37.5+0.199+43.9+0.117
2+31.1−0.139+30.3−0.133
3+9.0−0.057+9.6−0.059
4+0.8−0.006+1.6−0.008
Table A8. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on MMM2.
Table A8. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on MMM2.
Quarter Δ μ R - QMIX (%) Δ σ R - QMIX Δ μ QTRAN (%) Δ σ QTRAN
1+16.2+0.167+0.0+0.000
2+76.0+0.120−0.1−0.002
3+84.7−0.034−5.0−0.031
4+54.8−0.133−17.6−0.038
Table A9. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on 6h vs. 8z.
Table A9. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on 6h vs. 8z.
Quarter Δ μ R - QMIX (%) Δ σ R - QMIX Δ μ QTRAN (%) Δ σ QTRAN
1+0.5+0.008+0.0+0.000
2+7.1+0.043+0.2+0.005
3+30.1+0.090+0.5+0.008
4+57.5+0.062+1.4+0.014
Table A10. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on 27m vs. 30m.
Table A10. Quarterly percentage-point difference in mean win rate and signed difference in standard deviation relative to QMIX on 27m vs. 30m.
Quarter Δ μ R - QMIX (%) Δ σ R - QMIX Δ μ QTRAN (%) Δ σ QTRAN
1+16.8+0.167+0.2+0.004
2+73.2+0.126+4.2+0.025
3+72.7−0.109−1.2−0.057
4+38.6−0.066−22.8−0.026

References

  1. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  2. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  3. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  4. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  5. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1889–1937. [Google Scholar]
  6. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  7. Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; Springer: Cham, Switzerland, 2016. [Google Scholar]
  8. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6390. [Google Scholar]
  9. Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar]
  10. Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
  11. Samvelyan, M.; Rashid, T.; de Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.; Hung, C.; Torr, P.H.S.; Foerster, J.; Whiteson, S. The StarCraft Multi-Agent Challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Montreal, QC, Canada, 13–17 May 2019; pp. 2186–2188. [Google Scholar]
  12. Panait, L.; Luke, S. Cooperative multi-agent learning: The state of the art. Auton. Agents Multi-Agent Syst. 2005, 11, 387–434. [Google Scholar] [CrossRef]
  13. Busoniu, L.; Babuška, R.; De Schutter, B. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C 2008, 38, 156–172. [Google Scholar] [CrossRef]
  14. Hernández-Leal, P.; Kartal, B.; Taylor, M.E. A survey and critique of multiagent deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2019, 33, 750–797. [Google Scholar] [CrossRef]
  15. Oroojlooyjadid, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. arXiv 2019, arXiv:1908.03963. [Google Scholar]
  16. Papoudakis, G.; Christianos, F.; Schäfer, L.; Albrecht, S.V. Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv 2019, arXiv:1906.04737. [Google Scholar]
  17. Bellman, R. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
  18. Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  19. Hausknecht, M.; Stone, P. Deep recurrent Q-learning for partially observable MDPs. In Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents, Arlington, VA, USA, 12–14 November 2015. [Google Scholar]
  20. Cho, K.; Merriënboer, B.V.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
  21. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  22. Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef] [PubMed]
  23. Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the 10th International Conference on Machine Learning (ICML), Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
  24. Rashid, T.; de Witt, C.S.; Farquhar, G.; Whiteson, S. Weighted QMIX: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv 2020, arXiv:2006.10800. [Google Scholar] [CrossRef]
  25. Wang, T.; Han, B.; Xu, H.; Wang, X.; Dong, H.; Zhang, C. QPLEX: Duplex dueling multi-agent Q-learning. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
  26. Yang, Y.; Meng, Z.; Hao, J.; Zhang, H.; Wang, Z. Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv 2020, arXiv:2002.03939. [Google Scholar] [CrossRef]
  27. Böhmer, W.; Rashid, T.; Thoma, J.; Oliehoek, F.A.; Whiteson, S. Deep coordination graphs. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual Event, 12–18 July 2020. [Google Scholar]
  28. Son, K.; Kim, D.; Kang, W.; Hostallero, D.E.; Yi, Y. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
  29. Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  30. Mahajan, A.; Samvelyan, M.; Rashid, T.; de Witt, C.S.; Whiteson, S. MAVEN: Multi-agent variational exploration. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  31. Peng, P.; Wen, Y.; Yang, Y.; Yuan, Q.; Tang, Z.; Long, H.; Wang, J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play StarCraft combat games. arXiv 2017, arXiv:1703.10069. [Google Scholar]
  32. Zhu, C.; Dastani, M.; Wang, S. A survey of multi-agent deep reinforcement learning with communication. Auton. Agents Multi-Agent Syst. 2024, 38, 4. [Google Scholar] [CrossRef]
  33. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Figure 1. Conceptual Conceptual connection between Relaxed-QMIX (R-QMIX) and cooperative multi-robot systems. A team of robots operates under partial observability in a shared task. Each agent learns a local Deep Recurrent Q Network (DRQN) based utility Q a ( τ t a , u t a ) , which is combined with the global state s t by an R-QMIX mixing network with a soft monotonicity regularizer to form the joint action–value Q tot ( s t , u t ) . Centralized training uses global information while decentralized execution uses greedy actions with respect to the local utilities, enabling coordinated multi-robot behavior.
Figure 1. Conceptual Conceptual connection between Relaxed-QMIX (R-QMIX) and cooperative multi-robot systems. A team of robots operates under partial observability in a shared task. Each agent learns a local Deep Recurrent Q Network (DRQN) based utility Q a ( τ t a , u t a ) , which is combined with the global state s t by an R-QMIX mixing network with a soft monotonicity regularizer to form the joint action–value Q tot ( s t , u t ) . Centralized training uses global information while decentralized execution uses greedy actions with respect to the local utilities, enabling coordinated multi-robot behavior.
Robotics 15 00028 g001
Figure 2. R-QMIX architecture. The mixing network (pink) combines per-agent utilities into Q tot using state-conditioned hypernetworks (red). A soft monotonicity regularization term is added to the mixer training objective (penalizing negative Q tot / Q i ), while mixer weights remain unconstrained. Agents use a DRQN (deep recurrent Q-network) utility model with an MLP–gated recurrent unit (GRU)–MLP structure (green). Arrows indicate the direction of information flow.
Figure 2. R-QMIX architecture. The mixing network (pink) combines per-agent utilities into Q tot using state-conditioned hypernetworks (red). A soft monotonicity regularization term is added to the mixer training objective (penalizing negative Q tot / Q i ), while mixer weights remain unconstrained. Agents use a DRQN (deep recurrent Q-network) utility model with an MLP–gated recurrent unit (GRU)–MLP structure (green). Arrows indicate the direction of information flow.
Robotics 15 00028 g002
Figure 3. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 3m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 3. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 3m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Robotics 15 00028 g003
Figure 4. Win-rate comparison between QMIX, R-QMIX, and QTRAN on MMM2. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 4. Win-rate comparison between QMIX, R-QMIX, and QTRAN on MMM2. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Robotics 15 00028 g004
Figure 5. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 6h vs. 8z. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 5. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 6h vs. 8z. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Robotics 15 00028 g005
Figure 6. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 27m vs. 30m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Figure 6. Win-rate comparison between QMIX, R-QMIX, and QTRAN on 27m vs. 30m. Dashed lines show quarterly mean, shadows show quarterly standard deviation.
Robotics 15 00028 g006
Figure 7. Fraction of slopes below margin η = 0.01 on on the 3m map.
Figure 7. Fraction of slopes below margin η = 0.01 on on the 3m map.
Robotics 15 00028 g007
Figure 8. Fraction of slopes below margin η = 0.01 on the MMM2 map.
Figure 8. Fraction of slopes below margin η = 0.01 on the MMM2 map.
Robotics 15 00028 g008
Figure 9. Fraction of slopes below margin η = 0.01 on on the 6h vs. 8z map.
Figure 9. Fraction of slopes below margin η = 0.01 on on the 6h vs. 8z map.
Robotics 15 00028 g009
Figure 10. Fraction of slopes below margin η = 0.01 on 27m vs. 30m map.
Figure 10. Fraction of slopes below margin η = 0.01 on 27m vs. 30m map.
Robotics 15 00028 g010
Table 1. QMIX vs. R-QMIX at a glance.
Table 1. QMIX vs. R-QMIX at a glance.
AspectQMIXR-QMIX
Mixer weight constraintNon-negative mixer weights, e.g.,  W = | W raw | or W = softplus ( W raw ) .Weights unconstrained ( W = W raw ; may be negative); monotonicity encouraged via a soft penalty on local slopes.
Monotonic guaranteeYes: Q tot / Q a 0 (IGM under the model class).No formal guarantee; encouraged locally / in expectation via regularization.
Extra loss termsTD loss only (plus any shared baseline regularizers). L TD + λ mono ( t env ) L mono .
Extra hyperparametersNone beyond shared architecture/training knobs. λ mono schedule, margin η , exponent p (and δ for finite differences).
Computational overheadBaseline.Small–moderate (compute Q tot / Q a and the penalty).
When it helps (intuition)When a monotone factorization is sufficient; typically stable and sample-efficient.When interactions are non-monotonic (synergy/interference) and strict QMIX monotonicity underfits or destabilizes learning on hard scenarios.
Table 2. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 3m map. Bold indicates the highest mean win rate in each quarter.
Table 2. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 3m map. Bold indicates the highest mean win rate in each quarter.
QuarterQMIX MeanQMIX StdR-QMIX MeanR-QMIX StdQTRAN MeanQTRAN Std
10.3000.1260.6750.3250.7390.243
20.6690.1520.9800.0130.9720.019
30.8930.0690.9830.0120.9890.010
40.9740.0170.9820.0110.9900.009
Table 3. Time to reach win-rate thresholds on 3m (seeds = 6; hold k = 3 consecutive evals). Values are mean across seeds; “n/6” indicates how many runs reached the threshold.
Table 3. Time to reach win-rate thresholds on 3m (seeds = 6; hold k = 3 consecutive evals). Values are mean across seeds; “n/6” indicates how many runs reached the threshold.
Algorithm50% WR70% WR90% WRReached
t env Clock t env Clock t env Clock(50/70/90)
QMIX361 k00:34:46599 k00:56:541.03 M01:35:286/6/6
R-QMIX154 k00:14:34247 k00:22:51319 k00:29:196/6/6
QTRAN120 k00:11:09190 k00:17:30307 k00:28:116/6/6
Table 4. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the MMM2 map. Bold indicates the highest mean win rate in each quarter.
Table 4. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the MMM2 map. Bold indicates the highest mean win rate in each quarter.
QuarterQMIX MeanQMIX StdR-QMIX MeanR-QMIX StdQTRAN MeanQTRAN Std
10.0000.0010.1620.1680.0000.001
20.0020.0050.7620.1250.0010.003
30.0850.0610.9320.0270.0350.030
40.4230.1460.9710.0130.2470.108
Table 5. Time to reach win-rate thresholds on MMM2 (seeds = 6; hold k = 3 consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
Table 5. Time to reach win-rate thresholds on MMM2 (seeds = 6; hold k = 3 consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
Algorithm50% WR70% WR90% WRReached
t env Clock t env Clock t env Clock(50/70/90)
QMIX4.97 M09:09:415.57 M10:14:54N/AN/A4/4/0
R-QMIX1.51 M02:41:401.90 M03:22:142.63 M04:43:416/6/6
QTRAN5.36 M09:32:445.69 M10:02:00N/AN/A3/1/0
Table 6. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 6h vs. 8z map. Bold indicates the highest mean win rate in each quarter.
Table 6. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 6h vs. 8z map. Bold indicates the highest mean win rate in each quarter.
QuarterQMIX MeanQMIX StdR-QMIX MeanR-QMIX StdQTRAN MeanQTRAN Std
10.0000.0000.0050.0080.0000.000
20.0000.0000.0710.0430.0020.005
30.0000.0000.3010.0900.0050.008
40.0000.0010.5750.0630.0140.015
Table 7. Time to reach win-rate thresholds on 6h vs. 8z (seeds = 6; hold k = 3 consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
Table 7. Time to reach win-rate thresholds on 6h vs. 8z (seeds = 6; hold k = 3 consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold "N/A" indicates that no runs have met the threshold.
Algorithm50% WR70% WR90% WRReached
t env Clock t env Clock t env Clock(50/70/90)
QMIXN/AN/AN/AN/AN/AN/A0/0/0
R-QMIX4.41 M08:23:155.70 M10:50:57N/AN/A6/3/0
QTRANN/AN/AN/AN/AN/AN/A0/0/0
Table 8. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 27m vs. 30m map. Bold indicates the highest mean win rate in each quarter.
Table 8. Quarterly mean win rate and standard deviation for QMIX, R-QMIX, and QTRAN on the 27m vs. 30m map. Bold indicates the highest mean win rate in each quarter.
QuarterQMIX MeanQMIX StdR-QMIX MeanR-QMIX StdQTRAN MeanQTRAN Std
10.0000.0000.1680.1670.0020.004
20.0010.0040.7330.1300.0430.029
30.2030.1330.9300.0240.1910.076
40.5800.0780.9660.0120.3520.052
Table 9. Time to reach win-rate thresholds on 27m vs. 30m (seeds = 6; hold k = 3 consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold. "N/A" indicates that no runs have met the threshold.
Table 9. Time to reach win-rate thresholds on 27m vs. 30m (seeds = 6; hold k = 3 consecutive evals). Values are mean ± std across seeds; “n/6” indicates how many runs reached the threshold. "N/A" indicates that no runs have met the threshold.
Algorithm50% WR70% WR90% WRReached
t env Clock t env Clock t env Clock(50/70/90)
QMIX4.49 M15:44:364.72 M16:41:255.02 M17:33:476/4/1
R-QMIX1.47 M05:34:452.02 M07:20:182.89 M10:07:246/6/6
QTRAN3.84 M14:07:324.44 M16:24:35N/AN/A2/2/0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

O’Brien, L.; Xu, H. Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning. Robotics 2026, 15, 28. https://doi.org/10.3390/robotics15010028

AMA Style

O’Brien L, Xu H. Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning. Robotics. 2026; 15(1):28. https://doi.org/10.3390/robotics15010028

Chicago/Turabian Style

O’Brien, Liam, and Hao Xu. 2026. "Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning" Robotics 15, no. 1: 28. https://doi.org/10.3390/robotics15010028

APA Style

O’Brien, L., & Xu, H. (2026). Relaxed Monotonic QMIX (R-QMIX): A Regularized Value Factorization Approach to Decentralized Multi-Agent Reinforcement Learning. Robotics, 15(1), 28. https://doi.org/10.3390/robotics15010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop