Energy-Aware Multi-Agent Proximal Policy Optimization with Depletion Safety Constraints for Multi-Robot Coordination

Abdelmeguid, Yassin; Hasan, Ammar

doi:10.3390/robotics15050095

Open AccessArticle

Energy-Aware Multi-Agent Proximal Policy Optimization with Depletion Safety Constraints for Multi-Robot Coordination

by

Yassin Abdelmeguid

and

Ammar Hasan

^*

College of Engineering, American University of Sharjah, Sharjah P.O. Box 26666, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Robotics 2026, 15(5), 95; https://doi.org/10.3390/robotics15050095

Submission received: 19 March 2026 / Revised: 3 May 2026 / Accepted: 4 May 2026 / Published: 8 May 2026

(This article belongs to the Section AI in Robotics)

Download

Browse Figures

Versions Notes

Abstract

Multi-robot systems operating on battery power face fundamental constraints through which energy limitations directly impact mission success. The existing multi-agent reinforcement learning approaches optimize for task performance without explicit energy consideration, leading to inefficient consumption and depletion risk. This paper presents a framework for energy-aware multi-agent coordination that treats battery management as a safety constraint, rather than an optimization objective. We introduce Energy-Aware Multi-Agent Proximal Policy Optimization (EA-MAPPO) with energy-augmented observations and shaped rewards and extend it to Safe Energy-Aware MAPPO (SEA-MAPPO) combining predictive action masking with safety-oriented reward shaping. An experimental validation on the Georgia Tech Robotarium with 7 agents demonstrates that SEA-MAPPO reaches 95% goal completion 19× faster than standard MAPPO, requiring only 0.5 M environment steps versus 9.4 M. Throughout training, SEA-MAPPO reduces cumulative depletion events by 93% compared to MAPPO while maintaining superior energy efficiency. SEA-MAPPO achieves 100% goal completion versus 81.5% for MAPPO at the same training budget. Physical deployment on GTernal robots without fine-tuning achieves 100% goal completion with zero depletion events across 70 robot-trials, with the energy predictor achieving

R^{2} = 0.89

with measured power consumption.

Keywords:

multi-agent reinforcement learning; energy-aware coordination; multi-robot systems; safe reinforcement learning; battery management; swarm robotics

1. Introduction

Multi-robot systems have emerged as powerful solutions across warehouse logistics, agricultural monitoring, search and rescue operations, and environmental sensing. The coordination of multiple autonomous robots offers advantages through parallelism, redundancy, and distributed sensing. As these systems scale in fleet size and operational duration, energy management becomes a critical constraint that limits mission success and operational sustainability. Finite battery capacity restricts operation time, and premature depletion of even a single robot can compromise team-wide task completion. Different actions consume varying amounts of energy, with aggressive acceleration drawing more power than steady-state motion, making intelligent energy management a mission-critical capability for autonomous mobile robots that cannot rely on continuous power availability.

Despite this importance, the majority of multi-agent reinforcement learning research optimizes for task-centric metrics such as completion time, throughput, or coverage area without explicit energy modeling. This leads to policies that achieve task success but exhibit problematic energy behaviors, including unnecessary aggressive maneuvers, failure to balance workload across the fleet, and individual robots driven to critical battery levels while others remain underutilized. Such energy-unaware policies undermine operational sustainability and create a risk of mission failure through battery exhaustion.

This paper addresses the gap between task-focused multi-agent coordination and the energy realities of physical robot deployment through the following contributions:

Problem Formulation: We formulate energy-aware multi-robot coordination as a constrained decentralized partially observable Markov decision process with explicit battery dynamics where battery depletion constitutes mission failure.
Energy-Aware Algorithm: We present Energy-Aware Multi-Agent Proximal Policy Optimization (EA-MAPPO) with energy-augmented observations and shaped rewards for efficiency and load balancing.
Safe Energy-Aware Algorithm: The proposed EA-MAPPO is extended to Safe Energy-Aware MAPPO (SEA-MAPPO) combining predictive action masking with safety-oriented reward shaping that prevents battery depletion by filtering unsafe actions.
Energy Predictor Integration: We integrate an autoregressive energy predictor [1] trained on physical GTernal robot telemetry, enabling accurate energy estimation in simulation and action masking during deployment.
Experimental Validation: We demonstrate that SEA-MAPPO reaches 95% goal completion 19× faster than MAPPO while reducing cumulative training depletion by 93%, achieving 100% goal completion versus 81.5% for MAPPO at the same training budget.
Physical Deployment: We deploy trained policies on physical GTernal robots without fine-tuning, achieving 100% goal completion and zero depletion events across 70 robot-trials. The energy predictor achieves $R^{2} = 0.89$ with measured power consumption during deployment.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 formulates the problem. Section 4 describes energy predictor integration. Section 5 presents the algorithms. Section 6 details experimental evaluation. Section 7 concludes.

2. Related Work

2.1. Multi-Agent Reinforcement Learning Foundations

Multi-agent deep reinforcement learning enables robots to learn cooperative behaviors through environmental interaction. Orr and Dutta [2] survey applications across coverage, path planning, and task allocation, identifying three algorithmic families: value-based methods learning action-value functions, policy gradient methods directly optimizing policies, and actor-critic architectures combining both approaches. A key paradigm is Centralized Training with Decentralized Execution (CTDE), where algorithms access global information during training but execute using only local observations.

Value decomposition methods factorize joint action-value functions into per-agent utilities combined through mixing networks, enabling decentralized execution while training on global rewards. QMIX [3] represents this paradigm, employing a monotonicity constraint that ensures greedy action selection on individual Q-values yields optimal joint actions, achieving 90–95% win rates on StarCraft benchmarks with teams of 5–27 units. QPLEX [4] extends this through duplex dueling structure, achieving 95–98% on harder scenarios.

Independent learners treat each agent separately, ignoring other agents during training. Independent Proximal Policy Optimization (iPPO) applies PPO [5] separately to each agent with its own policy and value function based solely on local observations. While theoretically limited by non-stationarity as other agents’ policies change during training, iPPO serves as an important baseline demonstrating when centralized information is beneficial.

Centralized critic methods employ critics with access to global state during training while maintaining decentralized actors for deployment. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [6] pioneered this paradigm using centralized critics with Deep Deterministic Policy Gradient (DDPG) [7]. Multi-Agent PPO (MAPPO) [8] extends PPO to multi-agent settings with a centralized critic, demonstrating state-of-the-art performance across cooperative benchmarks while often matching or exceeding more complex algorithms. The centralized critic enables effective credit assignment while decentralized actors ensure deployable policies. Multi-Agent Actor-Critic (MAAC) [9] adds attention-weighted aggregation, scaling to teams above 10 agents. We build upon MAPPO as the foundation for our energy-aware extensions.

We selected MAPPO as the foundation for our energy-aware extensions based on its demonstrated performance across cooperative benchmarks. Yu et al. [8] show that MAPPO matches or exceeds value decomposition methods such as QMIX [3] while offering simpler implementation and more stable training dynamics. Independent learners (iPPO) lack the centralized critic necessary for effective credit assignment in our fleet energy coordination setting, where individual agent contributions to collective battery preservation must be distinguished.

Uwano [10] emphasizes decentralized partially observable Markov decision processes (Dec-POMDPs) as the theoretical framework for learning agents in robot navigation. In Dec-POMDPs, agents observe partial state information through local sensors and coordinate implicitly through learned policies. Foundational single-agent algorithms like Deep Q-Networks (DQN) [11] and Asynchronous Advantage Actor-Critic (A3C) [12] provide building blocks for multi-agent variants through experience replay and parallel training, with recurrent architectures further addressing partial observability [13].

2.2. Energy-Aware Multi-Robot Coordination

Energy considerations have received attention primarily in aerial robotics where battery constraints are severe. Nemer et al. [14] address energy-efficient UAV movement control for fair communication coverage, incorporating energy directly into state representations and reward functions. Their deep reinforcement learning approach demonstrates improved coverage fairness while respecting energy constraints on Unmanned Aerial Vehicles (UAVs).

Ramezani and Amiri Atashgah [15] introduce hierarchical reinforcement learning with predictive energy modeling for search-and-rescue UAVs. Their framework separates decision-making into a high-level controller selecting survivor locations and a low-level controller outputting continuous velocity commands. The critical innovation is a bidirectional Long Short-Term Memory (LSTM) energy predictor pre-trained on 195 real UAV flights, achieving Root Mean Square Error (RMSE) of approximately 4.5 W on unseen test data. An adaptive switching mechanism initially relies on LSTM predictions during early training when policy value estimates are unreliable, then phases out external predictions once the learned policy’s Temporal Difference (TD) error falls below LSTM error. Experiments demonstrate 92.4% mission success versus 84.7% for hierarchical actor–critic baseline and 62.2% for flat Soft Actor-Critic (SAC).

Li et al. [16] address energy-aware collaborative execution for mission-oriented drone networks where battery capacity directly impacts mission completion. Their multi-agent reinforcement learning approach enables each drone to learn collaborative task execution and trajectory planning based on the current status, including the battery level. Experiments demonstrate success rates of at least 80% across varying task configurations, achieving up to 100% when task density is sufficient.

Alternative approaches employ proxy metrics to encourage efficiency without explicit battery modeling. Jeon et al. [17] track the total distance traveled as an energy proxy, reporting 38% more deliveries and 30% better distance efficiency versus baseline. However, distance proxies ignore acceleration costs that often dominate energy budgets. Said et al. [18] employ hard budget constraints that terminate episodes when exceeded, while Singh et al. [19] use bio-inspired meta-heuristics for cluster-head selection, extending network lifetime by 20–26% through energy-balanced rotation.

These approaches share a common limitation: treating energy purely as an optimization objective through reward shaping, rather than as a safety constraint with hard guarantees. A robot that depletes its battery mid-mission represents a categorical failure distinct from suboptimal efficiency. Our work addresses this gap by introducing action masking that prevents constraint-violating actions by construction. The energy predictor we integrate specifically addresses ground robot dynamics on the Robotarium platform since the existing research focuses on aerial platforms with fundamentally different energy profiles.

2.3. Safe and Constrained Multi-Agent Reinforcement Learning

Safe reinforcement learning addresses satisfying constraints during learning and deployment. Constrained Markov Decision Process (CMDP) formulations [20] augment the objective with cost constraints, typically handled through Lagrangian relaxation [21] or trust region methods [22].

Lu et al. [23] formulate decentralized, safe multi-agent reinforcement learning as distributed CMDPs, deriving the Safe Decentralized Policy Gradient algorithm with provable convergence guarantees while satisfying per-agent constraints. Their primal–dual optimization approach simultaneously updates policy parameters and Lagrangian multipliers with proven convergence properties.

Gu et al. [24] derive Multi-Agent Constrained Policy Optimization (MACPO), modeling multi-robot systems as CMDPs, for which a joint policy maximizes expected return subject to per-agent cost constraints. Benchmarked on Safe MAMuJoCo and MARobosuite, MACPO achieves zero constraint violations after convergence while matching or exceeding unconstrained baselines in episodic return.

However, the safe reinforcement learning literature focuses predominantly on collision avoidance and physical damage prevention, while resource depletion constraints have received comparatively little attention despite being equally catastrophic for mission success. A multi-robot team that completes its task with one robot stranded due to battery failure has not achieved mission success. We position battery depletion as a safety constraint warranting the same formal treatment as collision avoidance, where action masking provides hard guarantees that the policy cannot select actions leading to predicted depletion, analogous to how barrier certificates prevent collision by construction.

Our action masking approach differs fundamentally from Lagrangian and constrained optimization methods. CMDP-based approaches such as CPO [22] and Lagrangian methods [21] treat constraints as soft penalties, iteratively adjusting multipliers to balance reward maximization against constraint satisfaction. These methods provide asymptotic guarantees but permit constraint violations during learning as the policy explores. In contrast, action masking enforces constraints by construction: unsafe actions are removed from the policy’s support before sampling, safeguarding against constraint violations from the first training step. This hard guarantee comes at the cost of requiring a predictive model, being our energy predictor, to evaluate action safety before execution, whereas Lagrangian methods learn constraint costs from experience. For battery depletion, where a single violation constitutes irreversible mission failure, we argue that the advantages of the provided safeguard guarantees during training outweigh the additional modeling requirement. Related work on safe exploration in continuous domains [25] similarly addresses constraint satisfaction, though without the predictive masking mechanism we employ.

2.4. Scalability and Multi-Objective Considerations

As deployments scale from small teams to large fleets, challenges emerge in computational complexity and formation stability [26]. Parameter sharing uses identical neural network weights for all agents, transforming N-agent learning into single-agent training on N parallel experience streams. Gupta et al. [27] demonstrate that shared actor–critic networks learn coordinated behaviors for teams of 3–100 agents, with parameter sharing reducing training time by factors of 5–10 compared to independent learners. Following this standard practice [8,28], we employ parameter sharing across all agents in our framework, improving sample efficiency while naturally handling the non-stationarity inherent to multi-agent optimization.

Energy-aware coordination involves multiple conflicting objectives exhibiting fundamental trade-offs. Roijers et al. [29] provide a comprehensive taxonomy of multi-objective reinforcement learning approaches, distinguishing between methods learning single scalarized solutions versus those discovering Pareto-optimal policy sets. While multi-objective frameworks offer the principled handling of trade-offs, our reward formulation combines task objectives with energy-aware terms through a weighted summation, providing a simpler optimization landscape well-suited to the PPO algorithm with weights tuned to balance task performance against energy efficiency.

3. Problem Formulation

We formulate multi-robot coordination as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) with cooperative structure and energy constraints.

3.1. Dec-POMDP Definition

The system is defined by the tuple

G = 〈 N, S, {A_{i}}, {O_{i}}, P, R, {Ω_{i}}, γ, C 〉

where N denotes the number of agents,

S

is the global state space,

A_{i}

and

O_{i}

are agent-specific action and observation spaces, P is the state transition function, R is the reward function,

Ω_{i}

maps global states to local observations,

γ

is the discount factor, and

C

represents energy safety constraints.

3.2. State Space

The global state

s \in S

encodes the following components for each robot i:

Position $p_{i} \in R^{2}$ and orientation $θ_{i}$ .
Velocities $(v_{i}, ω_{i})$ for linear and angular motion.
Battery level $b_{i} \in [0, 1]$ where 1 represents full charge and 0 represents complete depletion.
Task-specific information, including goal locations.

3.3. Battery Dynamics

The battery level of robot i evolves according to

b_{i}^{t + 1} = b_{i}^{t} - Δ b_{i}^{t}

(1)

where

0 \leq Δ b_{i}^{t} \leq b_{i}^{t}

is the energy consumed during timestep t, depending on the robot motion through motor power draw.

For simulation training, the autoregressive energy predictor [1] provides consumption estimates. Given the strong temporal autocorrelation in power consumption, the predictor conditions on current velocity features alongside a sliding window of recent power history to estimate

Δ b_{i}^{t}

.

For physical deployment, battery consumption is computed from current and voltage measurements via the onboard INA260 power monitor:

Δ b_{i}^{t} = \frac{1}{B_{max}} \int_{t}^{t + Δ t} P_{i} (τ) d τ

(2)

where

B_{max}

is battery capacity, and

P_{i} (τ)

is the instantaneous power for robot i.

3.4. Observation Space

Each agent, i, receives a local observation,

o_{i}^{t}

, consisting of position and orientation (

p_{i} \in R^{3}

), relative goal displacement (

Δ g_{i} \in R^{2}

), the agent’s own battery level (

b_{i} \in R

), and the battery levels of all N agents (

b \in R^{N}

), enabling fleet-level energy awareness. In our experiments,

N = 7

agents yields a total observation dimension of 13.

3.5. Action Space

Each robot operates under a discrete action space with five movement primitives:

A_{i} = {null, up, down, left, right}

(3)

We employ a discrete action space of five movement primitives corresponding to waypoint displacements in the arena’s global coordinate frame (e.g., ‘up’ translates to +Y), following established Robotarium methodology [30]. This formulation follows established Robotarium methodology, which demonstrates that discrete waypoint actions enable effective multi-agent coordination while the platform’s Control Lyapunov Function (CLF) and CBF control stack handles continuous trajectory execution and collision avoidance. Each discrete action produces a target position passed to the CLF-based position controller, which computes the unicycle velocities

(v, ω)

required to reach the waypoint. For action masking, we compute the exact

(v, ω)

that the CLF controller would produce for each candidate action, query the energy predictor, and filter candidates predicted to cause depletion. This hierarchical separation ensures that safety guarantees hold, regardless of policy behavior, while enabling precise energy-aware constraint enforcement.

3.6. Reward Structure

The reward function combines task objectives with energy considerations, with different algorithm variants using different subsets of the available components.

The task reward

R_{task, i} = - ∥ p_{i} - g_{i} ∥

provides dense feedback based on goal distance, computed as the negative Euclidean distance to the goal position. For MAPPO, the total reward is simply:

R_{i}^{MAPPO} = R_{task, i}

(4)

EA-MAPPO augments this with energy-aware components. The energy efficiency penalty

R_{energy, i} = - β \cdot Δ b_{i}

discourages wasteful consumption, while the load balancing penalty

R_{balance} = - γ_{b} \cdot Var ({b_{1}, \dots, b_{N}})

encourages equitable workload distribution. The total EA-MAPPO reward is:

R_{i}^{EA-MAPPO} = R_{task, i} + R_{energy, i} + R_{balance}

(5)

SEA-MAPPO further adds safety-specific components. Safety shaping

R_{safety, i} = α_{s} \cdot 1 [b_{i} > b_{crit}]

provides a positive signal for maintaining a safe battery margin. A readiness bonus,

R_{ready, i}

, is awarded when agents maintain the battery above the masking threshold

b_{mask}

. The depletion penalty

R_{deplete, i} = - λ_{d} \cdot 1 [depletion]

provides a strong negative signal when any robot depletes. The total SEA-MAPPO reward is:

R_{i}^{SEA-MAPPO} = R_{task, i} + R_{energy, i} + R_{balance} + R_{safety, i} + R_{ready, i} + R_{deplete, i}

(6)

This tiered structure enables controlled ablation: EA-MAPPO adds energy awareness through reward shaping alone, while SEA-MAPPO combines safety-oriented rewards with action masking for comprehensive constraint enforcement.

The task reward formulation mitigates local minima common in sparse reward settings. The negative Euclidean distance

R_{task, i} = - ∥ p_{i} - g_{i} ∥

provides a dense, monotonically improving signal as agents approach their goals, creating a convex reward landscape without plateaus that would trap gradient-based optimization. Additionally, goal positions are randomized each episode, preventing policies from memorizing environment-specific paths and encouraging generalizable navigation strategies. The energy-aware reward components (

R_{energy}

,

R_{balance}

) are similarly smooth functions of the battery state, avoiding discontinuities that could introduce local optima.

3.7. Safety Constraint

The core safety constraint requires all robots to maintain their battery above critical threshold throughout operation:

C : b_{i}^{t} \geq b_{crit} \forall i \in {1, \dots, N}, \forall t

(7)

When violated, the robot is considered depleted. It moves to the nearest arena corner to avoid obstructing other robots and is counted as lost for mission success evaluation, representing categorical mission degradation.

The novelty of this study lies in ensuring battery safety against depletion. Collision avoidance is addressed at the lower control level through the integration of a Control Barrier Function (CBF) [31], which guarantees inter-robot safety, irrespective of the high-level policy behavior.

3.8. Success Metrics

Mission success is evaluated through multiple metrics:

Goal Completion Rate: Average fraction of agents reaching goals per episode.
Depletion Rate: Average fraction of agents depleting per episode.
Mean Final Battery: Average battery level across all agents at episode end.
Battery Variance: Variance of battery levels across fleet at episode end.
Fleet Readiness: Fraction of agents with battery above masking threshold.

A mission achieving all task objectives but losing robots to depletion is not considered fully successful.

4. Energy Predictor Integration

Accurate energy prediction is essential for simulation-based training and action masking during deployment. The Robotarium simulator [30] provides robot kinematics and collision dynamics but does not model battery state or power consumption. We address this gap by integrating an autoregressive energy predictor developed specifically for the GTernal platform [1].

4.1. Predictor Overview

The energy predictor exploits the key insight that power consumption on differential-drive robots exhibits strong temporal autocorrelation, with lag-1 correlation

ρ_{1} = 0.95

across diverse motion patterns. This autocorrelation structure implies that recent power history contains far more predictive information than the current kinematic state alone.

The predictor is a lightweight Multi-Layer Perceptron (MLP) with 7041 parameters that processes an 11-dimensional input vector consisting of six velocity features (linear velocity, angular velocity, their derivatives, and absolute values) and five power history lags. The architecture achieves

R^{2} = 0.90

on held-out motion patterns. Physical validation across seven robots in random walk scenarios yields a mean of

R^{2} = 0.87

, demonstrating zero-shot transfer to unseen robots and behaviors.

4.2. Simulation Deployment

For CTDE training, the predictor operates autoregressively by maintaining a buffer of recent predictions that serve as input for subsequent steps. At each environment step, the predictor receives the commanded velocity

(v, ω)

alongside the simulated power history buffer and outputs the estimated power consumption in milliwatts. This recursive structure accurately models energy accumulation over extended episodes without requiring ground-truth power readings.

{\hat{P}}_{i, t} = f_{pred} (v_{i, t}, ω_{i, t}, {\hat{P}}_{i, t - 1}, \dots, {\hat{P}}_{i, t - 5})

(8)

The predictor runs in 224 μs per inference, enabling real-time deployment at 150× the platform’s 30 Hz control rate. This leaves ample computational budget for policy inference and higher-level planning.

4.3. Deployment Configuration

During physical operation on GTernal robots, actual power measurements from the onboard INA260 sensor replace predictor estimates for state tracking. The INA260 provides 10 mW precision power readings synchronized with the 30 Hz velocity command rate. Action masking continues to use the predictor for candidate action evaluation, querying the expected consumption for each action before selection to filter those predicted to cause depletion.

A limitation of this approach is that action masking relies on predictor accuracy. In scenarios with dynamics substantially different from the training distribution, predictor error could cause inappropriate masking. The threshold-based fallback provides robustness, and the predictor’s strong performance on unseen robots suggests reasonable generalization, but deployment on platforms with fundamentally different power characteristics would require retraining.

5. Methodology

We present an algorithm progression building upon Multi-Agent Proximal Policy Optimization.

5.1. Network Architecture

Both actor and critic networks employ Gated Recurrent Unit (GRU)-based architectures to capture temporal dependencies in the observation sequence. Because our problem is formulated as a Dec-POMDP, feed-forward networks suffer from perceptual aliasing where identical observations may require different actions, depending on the unobserved temporal context. Recurrent architectures aggregate action-observation histories into latent state representations [8]. We employ GRUs, rather than LSTMs, because GRUs achieve equivalent asymptotic performance with approximately 25% fewer parameters, reducing the inference latency for decentralized deployment.

The actor network processes agent observations through the following structure:

Input (13) \to Dense (256) \to GRU (256) \to Dense (256) \to Dense (5)

(9)

The output produces logits over the five discrete actions, transformed to probabilities via softmax (or masked softmax for SEA-MAPPO).

The critic network processes the global state through:

Input (s) \to Dense (256) \to GRU (256) \to Dense (256) \to Dense (1)

(10)

We select 256 hidden units following MAPPO implementation guidelines [8], which scale network capacity to task complexity. Our 13-dimensional observation space encoding spatial and energy dynamics represents moderate complexity, making 256 the appropriate configuration. This choice aligns with EPyMARL (v2.0.0; Autonomous Agents Research Group, University of Edinburgh, Edinburgh, UK) benchmarking defaults [32]. Parameter sharing is employed across all agents, improving sample efficiency by aggregating all agents’ experiences into unified optimization steps [28]. Behavioral diversity emerges naturally through observation conditioning, as each agent’s unique position and energy state produces specialized actions despite shared weights.

5.2. MAPPO Foundation

MAPPO extends Proximal Policy Optimization to multi-agent settings with a centralized critic observing the global state during training while maintaining decentralized actors for deployment. Each agent maintains a policy,

π_{i} (a_{i} | o_{i}; θ)

, mapping local observations to action distributions, and a shared critic,

V (s; ϕ)

, evaluates states using full information. The policy is updated to maximize the clipped surrogate objective:

L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(11)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | o_{t})}{π_{θ_{old}} (a_{t} | o_{t})}

is the probability ratio,

ϵ = 0.2

is the clipping parameter that constrains policy updates to a trust region, and

{\hat{A}}_{t}

is the advantage estimate computed via Generalized Advantage Estimation.

5.3. EA-MAPPO

Energy-Aware MAPPO (EA-MAPPO) extends the MAPPO foundation with energy-augmented observations and shaped rewards while leaving the core algorithm unchanged.

Agent observations include battery information, as specified in Section 3, enabling the policy to condition decisions on individual and collective energy states. The full fleet battery vector

b \in R^{N}

provides each agent with complete energy awareness.

The reward function includes the energy efficiency penalty (

β = 0.4

) and the load balancing penalty (

γ_{b} = 0.2

). These weights were determined through a grid search over

β \in {0.1, 0.2, 0.4, 0.6}

and

γ_{b} \in {0.1, 0.2, 0.3}

, selecting the configuration that achieved an optimal trade-off between goal completion and fleet energy variance. The energy penalty at approximately 40% of the typical per-step task reward magnitude influences learning without dominating task objectives. This configuration represents the standard approach in the energy-aware multi-agent reinforcement learning literature, where energy objectives are incorporated purely through reward engineering without explicit safety mechanisms.

EA-MAPPO demonstrates that policies can learn more energy-efficient behaviors when incentivized through rewards. However, while reward shaping incentivizes energy-conservative behavior, it provides no hard guarantees, and a policy optimizing expected return may still select actions that risk depletion when task rewards dominate energy penalties.

The multi-objective reward landscape may slow initial task-focused learning compared to single-objective MAPPO, as the policy must balance competing objectives.

5.4. SEA-MAPPO

Safe Energy-Aware MAPPO (SEA-MAPPO) extends EA-MAPPO with both predictive action masking and safety-oriented reward shaping, providing comprehensive constraint enforcement.

Given the current battery level

b_{i}

and the energy predictor

f_{pred}

, the action masking mechanism excludes actions predicted to cause constraint violation:

A_{i}^{valid} (b_{i}) = \{a \in A_{i} | b_{i} - f_{pred} (a, P_{hist}) \geq b_{crit}\}

(12)

where

P_{hist}

is the recent power history used by the autoregressive predictor.

The policy network outputs logits for all actions, and masking modifies the softmax normalization to consider only valid actions:

π_{θ} (a_{i} | o_{i}) = \{\begin{matrix} \frac{exp (f_{θ} (o_{i}, a_{i}))}{\sum_{a^{'} \in A_{i}^{valid}} exp (f_{θ} (o_{i}, a^{'}))} & if a_{i} \in A_{i}^{valid} \\ 0 & otherwise \end{matrix}

(13)

This formulation guarantees that sampled actions satisfy battery constraints, assuming predictor accuracy. The policy learns over the valid action subspace, naturally adapting behavior as the battery depletes and the valid set shrinks.

In addition to action masking, SEA-MAPPO augments the reward function with safety-specific components. With EA-MAPPO base weights held constant to ensure fair ablation, we conducted 60-trial Bayesian optimization over the safety-specific parameters. This yielded safety shaping of

α_{s} = 0.85

, providing a positive signal for maintaining the battery above the critical threshold, a readiness bonus of 0.1 for staying above the masking threshold, and a depletion penalty of

λ_{d} = 3.5

for catastrophic battery exhaustion. Details of the parameter sensitivity analysis, including search ranges and sensitivity patterns, are provided in Appendix A. The depletion penalty is intentionally large relative to other reward components, ensuring that battery exhaustion is treated as catastrophic mission failure, rather than a soft optimization trade-off. The masking threshold (0.155) was set to 0.005 above the critical threshold (0.15) to provide a safety margin absorbing predictor uncertainty. These components complement action masking by shaping the policy toward energy-conservative behaviors even before masking activates.

In practice, we implement threshold-based masking where movement actions (up, down, left, right) are masked when the battery falls below threshold

b_{mask}

, while the null action remains always valid. This provides a conservative safety mechanism independent of predictor accuracy for near-critical battery states.

The critical threshold

b_{crit} = 0.15

represents the minimum battery level for safe operation, determined by platform characteristics and mission requirements, rather than algorithm tuning. For the GTernal platform, this value ensures sufficient energy to execute the failsafe retreat behavior (moving to the nearest arena corner) from any position while maintaining voltage levels adequate for motor control. The masking threshold

b_{mask} = 0.155

introduces a margin of 0.005 above

b_{crit}

to absorb prediction uncertainty. This margin was selected based on predictor accuracy: with a mean absolute error of 31.5 mW on held-out validation data [1] and typical power consumption of approximately 3.5 W, the single-step prediction error is below 1%. The 0.005 margin (3.3% relative to

b_{crit}

) absorbs approximately 5–6 steps of worst-case accumulated error, providing robustness against predictor inaccuracy without overly restricting the action space.

For deployment on new platforms, practitioners should set

b_{crit}

based on hardware specifications (minimum safe discharge level, failsafe energy requirements) and

b_{mask}

, with the margin being proportional to the expected predictor error. More energy-intensive actions, longer episodes, or higher predictor uncertainty warrant larger margins. The threshold-based fallback, where movement actions are masked whenever the battery falls below

b_{mask}

, regardless of the predictor output, provides an additional safety layer independent of prediction accuracy for near-critical battery states.

5.5. Centralized Training with Decentralized Execution

Both EA-MAPPO and SEA-MAPPO operate under the Centralized Training with Decentralized Execution (CTDE) paradigm. During training, the centralized critic accesses global state—including all agent positions, velocities, and battery levels—enabling effective credit assignment across the fleet. During execution, each actor conditions solely on its local observation,

o_{i}

, a fixed 13-dimensional vector containing the agent’s pose, goal displacement, and fleet energy state. The fleet battery state comprises N scalar values per timestep, comparable bandwidth to goal positions that multi-robot coordination systems routinely share, and does not require learned communication protocols [33,34] that introduce additional trainable parameters and emergent messaging complexity. The Robotarium provides this telemetry through its standard API, a functionality compatible with platforms with a periodic mission state broadcast. With the fleet energy state available, the framework enables coordinated energy-aware behavior where each agent accounts for the collective battery status, while action masking provides formal safeguards against depletion.

A limitation of threshold-based masking is that the threshold requires tuning to balance safety and task performance. Different scenarios may benefit from different thresholds, and more energy-intensive tasks might require larger margins between

b_{mask}

and

b_{crit}

.

6. Experimental Evaluation

6.1. Experimental Setup

Experiments employ the Georgia Tech Robotarium [30], a remotely accessible swarm robotics testbed providing GTernal differential-drive robots [35] with onboard power monitoring enabling precise energy tracking. The GTernal platform features an 11 cm × 9.5 cm footprint with a maximum linear speed of approximately 26 cm/s. The Robotarium API exposes per-timestep power readings, Control Lyapunov Function (CLF)-based position control, and Control Barrier Function (CBF) collision avoidance, as shown in Figure 1.

We conduct our evaluation using the Navigation scenario with

N = 7

agents, where each robot must reach an assigned goal position that is randomized each episode. Episodes terminate upon all robots reaching goals or at the maximum step limit. This scenario isolates coordination and energy management challenges, requiring agents to navigate efficiently while preserving battery and avoiding depletion. The seven-agent configuration aims to balance computational tractability with sufficient fleet density to induce meaningful multi-agent interactions and energy contention. Table 1, Table 2 and Table 3 summarize the environment parameters, training hyperparameters, and reward components.

6.2. Training Protocol and Algorithm Configurations

Training employed convergence-based early stopping, where training was terminated when the test goal completion rate exceeded 95% for 500 consecutive policy updates (approximately 0.8 M environment steps), indicating policy convergence. This adaptive stopping criterion ensures a fair comparison by evaluating each algorithm at convergence, rather than at a fixed computational budget. SEA-MAPPO converges the fastest at approximately 10 M environment steps due to the reduced exploration space from action masking combined with safety reward shaping, followed by EA-MAPPO at approximately 11.5 M steps, while vanilla MAPPO requires approximately 15 M steps, as it must discover energy-efficient behaviors without explicit guidance.

Test metrics were evaluated every 1% of training via greedy rollouts (deterministic argmax actions) on 64 parallel test environments for statistical stability. Training metrics were logged every policy update and include exploration noise from stochastic action sampling.

Table 4 reports sample efficiency as environment steps to first reach each threshold, the standard metric for comparing learning speed in reinforcement learning. Due to training instability, MAPPO’s goal completion fluctuates after first crossing 95%, as is visible in Figure 2. The 81.5% goal completion at 10 M steps reported in Table 5 reflects this volatility, rather than a contradiction. Stable convergence, defined as sustained performance above 95% for 500 consecutive policy updates, requires approximately 15 M steps for MAPPO, 11.5 M for EA-MAPPO, and 10 M for SEA-MAPPO. Notably, EA-MAPPO reaches stable convergence faster than MAPPO despite its slower initial progress, as the multi-objective reward landscape ultimately aids learning once energy-efficient behaviors emerge.

6.3. Results

Table 4 presents the sample efficiency advantage of SEA-MAPPO. The combination of action masking and safety rewards enables SEA-MAPPO to reach 95% goal completion in only 0.49 M environment steps, representing a 19× speedup compared to MAPPO (9.36 M steps) and EA-MAPPO (10.21 M steps). This improvement arises because action masking eliminates unsafe exploration, allowing the policy to focus learning on the constrained action subspace where all solutions are feasible.

Table 5 compares task performance at 10 M environment steps, the point where SEA-MAPPO has converged. At this fixed training budget, MAPPO achieves only 81.5% goal completion while EA-MAPPO improves to 91.6% through energy reward shaping. SEA-MAPPO achieves perfect 100% goal completion.

Table 6 quantifies training stability. MAPPO exhibits a peak depletion rate of 49.1% (nearly half the fleet depleting in a single episode) and 1511 evaluation points where depletion exceeded 10%. EA-MAPPO reduces high-depletion points to 359 through reward incentives. SEA-MAPPO considerably improves stability, with only 50 points exceeding 10% depletion and the lowest variance in both goal completion (0.159 vs. 0.339) and depletion rate (0.017 vs. 0.076) across training.

Table 7 presents cumulative performance across all training. SEA-MAPPO achieves mean goal completion of 95.5% across its entire training run compared to 61.2% for MAPPO and 70.6% for EA-MAPPO. Energy preservation and fleet readiness metrics are reported with full statistical characterization in Table 8.

Figure 2 presents goal completion rates evaluated via greedy rollouts throughout training. SEA-MAPPO achieves near-perfect goal completion within the first 0.5 M environment steps, demonstrating the sample efficiency gains enabled by action masking and safety rewards. Figure 3 shows depletion rates during training, where MAPPO exhibits peak depletion rates approaching 0.5 in early training with persistent variance, while SEA-MAPPO drops to near-zero depletion within the first 2 M steps and maintains this level throughout.

Figure 4 analyzes the mean fleet battery preservation during training. SEA-MAPPO consistently preserves the most energy, reaching approximately 0.80 mean final battery at convergence. Figure 5 shows fleet battery variance, where SEA-MAPPO achieves the lowest variance (approximately 0.03–0.04), indicating that action masking combined with safety rewards not only prevents depletion but also encourages more equitable energy distribution across the fleet.

Figure 6 presents fleet readiness, where SEA-MAPPO achieves and maintains near-perfect readiness rapidly, while MAPPO exhibits a volatile trajectory with frequent drops to 0.3–0.6 mid-training. Figure 7 shows episode returns, where SEA-MAPPO exhibits the fastest improvement and lowest variance throughout training.

Table 8 summarizes the converged performance across five seeds. At convergence, all algorithms achieve high goal completion rates, but SEA-MAPPO maintains advantages in energy efficiency (0.809 vs. 0.727 mean battery), load balancing (0.032 vs. 0.047 variance), and fleet readiness (99.8% vs. 94.6%). The key finding is that SEA-MAPPO reaches this performance level in approximately 10 M steps, while MAPPO requires 15 M steps, and SEA-MAPPO maintains safety guarantees throughout training while MAPPO experiences numerous depletion events during learning.

6.4. Physical Deployment

To validate the sim-to-real transfer, we deployed SEA-MAPPO policies trained entirely in simulation on physical GTernal robots in the Robotarium facility. The transfer process involved three stages. First, the energy predictor was trained on physical robot telemetry prior to any policy training, ensuring that simulated energy dynamics matched real-world consumption patterns. Second, policies were trained in simulation using the Robotarium’s physics engine for kinematics and collision dynamics, augmented with predictor-based energy estimates. Third, trained policies were deployed directly on physical robots without fine-tuning or domain adaptation, with the weights learned in simulation executed as is, unchanged on hardware. During physical operation, actual power measurements from INA260 sensors (10 mW resolution) replaced predictor estimates for battery state tracking, providing ground-truth energy levels at each timestep. Action masking continued to use the predictor for candidate action evaluation, querying expected consumption for each action before selection. This hybrid approach, namely ground-truth for state observation and predictor for action filtering, ensures accurate battery tracking while maintaining the safety guarantees established during training. The key enabler of successful transfer is the predictor’s zero-shot generalization: trained on telemetry from structured motion primitives, it achieves

R^{2} = 0.87

on unseen robots executing learned policies in collision avoidance scenarios [1]. This generalization arises from the predictor’s reliance on velocity features and power history, rather than trajectory-specific patterns, making it robust to the novel behaviors produced by trained policies.

Ten deployment trials were conducted per algorithm, with seven robots executing the Navigation scenario. Robot positions were tracked by the Robotarium’s Vicon motion capture system at up to 120 Hz with submillimeter precision. Each trial used randomized goal assignments to ensure diverse evaluation conditions.

Table 9 summarizes deployment results. SEA-MAPPO achieved perfect goal completion with zero depletion events across all 70 robot-trials, demonstrating successful sim-to-real transfer without fine-tuning. The predictor achieved

R^{2} = 0.89

with measured power consumption during Navigation deployment, consistent with the

R^{2} = 0.87

reported for multi-robot random walk validation [1], as illustrated in Figure 8.

Residual prediction error arises from factors including temperature-dependent motor efficiency variations, surface friction differences between the simulation and the physical arena, and battery state-of-charge effects on voltage that influence power draw. The threshold-based masking mechanism provides robustness to these prediction errors by maintaining a conservative margin (

b_{mask} = 0.155

versus

b_{crit} = 0.15

) that serves to absorb estimation uncertainty.

6.5. Scalability Analysis

To evaluate how SEA-MAPPO scales with fleet size, we conducted experiments with

N \in {4, 7, 14}

agents. The network architecture remained unchanged across configurations—only the input layer was resized to accommodate the fleet battery vector

b \in R^{N}

in each agent’s observation. Parameter sharing ensures that the number of trainable parameters is independent of the fleet size, with all agents sharing identical policy and value network weights.

Figure 9 shows that SEA-MAPPO achieves consistent convergence dynamics across fleet sizes. All three configurations reach stable returns within similar training budgets, indicating that the combination of parameter sharing and action masking maintains learning efficiency as the fleet grows.

Figure 10 presents mean battery preservation across fleet sizes. At convergence, the mean battery reaches

0.801 \pm 0.034

for

N = 4

,

0.809 \pm 0.026

for

N = 7

, and

0.730 \pm 0.035

for

N = 14

. The modest decrease with larger fleets reflects increased coordination complexity in denser environments, where agents must navigate around more teammates to reach their goals.

Table 10 summarizes safety metrics and the computational cost. Goal completion remains perfect across all fleet sizes, confirming that action masking maintains its effectiveness as the fleet grows. Fleet readiness remains above 98% even at

N = 14

, demonstrating robust energy management at scale. The inference time scales approximately linearly with fleet size, consistent with the

O (N)

growth in observation dimensionality, and remains well within the Robotarium’s 30 Hz control loop (33.3 ms budget) even at

N = 14

, suggesting that substantially larger fleets can be supported without exceeding real-time constraints.

Table 11 summarizes the theoretical complexity. Parameter sharing keeps the number of trainable weights constant, regardless of the fleet size, with only the input layer dimensions scaling with N. This linear scaling enables deployment on larger fleets without architectural redesign.

6.6. Limitations and Scope

This work focused on the Navigation scenario with a seven-agent fleet, a configuration that captures the core challenges of multi-robot coordination under energy constraints while remaining computationally tractable for extensive ablation studies. The algorithmic components—energy-augmented observations, safety reward shaping, and predictive action masking—are designed to be task-agnostic, and adapting the framework to other coordination tasks, such as coverage or formation control, would primarily involve adjusting reward weights and masking thresholds, rather than architectural changes.

The energy predictor encodes dynamics specific to the GTernal differential-drive platform, having been trained on telemetry from these robots. Deployment on platforms with different locomotion or power profiles would require collecting platform-specific telemetry and retraining the predictor, following the same methodology described in Section 4. The predictor architecture itself generalizes across differential-drive systems, and for platforms where telemetry collection is impractical, physics-based energy models could serve as alternatives with appropriate margin adjustments to the masking threshold.

The CTDE framework and parameter sharing employed here scale with the fleet size, requiring only input layer resizing as demonstrated across the configurations tested. Extension to heterogeneous fleets, where robots differ in energy capacity or consumption profiles, represents a natural direction for future work, potentially requiring per-class predictors or adaptive masking thresholds.

7. Conclusions

This paper has presented a framework for energy-aware multi-agent coordination that treats battery management as a safety constraint. We introduced EA-MAPPO with energy-augmented observations and shaped rewards and extended it to SEA-MAPPO, combining predictive action masking with safety-oriented reward shaping for comprehensive constraint enforcement. Central to our approach is the integration of an autoregressive energy predictor [1] that enables accurate energy estimation in simulation and action masking during deployment.

Experimental results on the Georgia Tech Robotarium with 7 agents demonstrate that SEA-MAPPO achieves significant improvements in both sample efficiency and training safety. SEA-MAPPO reaches 95% goal completion 19× faster than MAPPO, requiring only 0.5 M environment steps, versus 9.4 M. Throughout training, SEA-MAPPO reduces cumulative depletion events by 93% compared to MAPPO while exhibiting substantially lower variance in all metrics. At convergence, SEA-MAPPO achieves 100% goal completion compared to 81.5% for MAPPO and 91.6% for EA-MAPPO at the same computational training budget.

Physical deployment on GTernal robots validates sim-to-real transfer, with SEA-MAPPO maintaining zero depletion events across 70 robot-trials while achieving perfect goal completion. The framework demonstrates that safety constraints, when properly integrated through action masking and reward shaping, accelerate, rather than hinder, learning.

The framework developed in this study can be applied beyond battery depletion safety. By demonstrating that action masking, combined with safety rewards, can enforce battery safety constraints while significantly accelerating convergence, we establish a template applicable to other resource-constrained multi-robot domains.

Author Contributions

Conceptualization, Y.A. and A.H.; methodology, Y.A. and A.H.; software and validation, Y.A.; resources, A.H.; data curation, Y.A.; supervision, A.H.; writing—original draft preparation, Y.A.; writing—review and editing, A.H.; visualization, Y.A.; project administration, A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the American University of Sharjah under Project FSU26-E06.

Data Availability Statement

The training configurations and experimental logs are available upon reasonable request.

Acknowledgments

The authors thank the Robotarium team at Georgia Institute of Technology for platform access and for implementing the power sensing API that enabled this research. This paper represents the opinions of the authors and does not mean to represent the position or opinions of the American University of Sharjah.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

A3C	Asynchronous Advantage Actor-Critic
CBF	Control Barrier Function
CLF	Control Lyapunov Function
CMDP	Constrained Markov Decision Process
CPO	Constrained Policy Optimization
CTDE	Centralized Training with Decentralized Execution
DDPG	Deep Deterministic Policy Gradient
Dec-POMDP	Decentralized Partially Observable Markov Decision Process
DQN	Deep Q-Network
EA-MAPPO	Energy-Aware MAPPO
GRU	Gated Recurrent Unit
iPPO	Independent Proximal Policy Optimization
LSTM	Long Short-Term Memory
MAAC	Multi-Agent Actor-Critic
MACPO	Multi-Agent Constrained Policy Optimization
MADDPG	Multi-Agent Deep Deterministic Policy Gradient
MAPPO	Multi-Agent Proximal Policy Optimization
MARL	Multi-Agent Reinforcement Learning
MLP	Multi-Layer Perceptron
PPO	Proximal Policy Optimization
RMSE	Root Mean Square Error
SAC	Soft Actor-Critic
SEA-MAPPO	Safe Energy-Aware MAPPO
TD	Temporal Difference
UAV	Unmanned Aerial Vehicle

Appendix A. Reward Parameter Sensitivity Analysis

Appendix A.1. EA-MAPPO Parameter Selection

For EA-MAPPO, reward weights were determined through a grid search over

β \in {0.1, 0.2, 0.4, 0.6}

and

γ_{b} \in {0.1, 0.2, 0.3}

, evaluating the goal completion rate and fleet energy variance after 5 M training steps. The selected configuration (

β = 0.4

,

γ_{b} = 0.2

) achieved the best trade-off, with the energy penalty at approximately 40% of the typical per-step task reward magnitude.

Appendix A.2. SEA-MAPPO Parameter Optimization

SEA-MAPPO safety-specific parameters were tuned via 60-trial Bayesian optimization using the Tree-Structured Parzen Estimator. The objective maximized a weighted combination of the success rate, the mean battery preservation, and depletion avoidance:

f (θ) = 0.4 \cdot GoalCompletion + 0.4 \cdot MeanBattery - 0.2 \cdot DepletionRate

(A1)

Table A1. Bayesian optimization search space and selected values.

Parameter	Search Range	Selected Value
Safety shaping $α_{s}$	[0.05, 1.0]	0.85
Readiness bonus	[0.01, 0.5]	0.10
Depletion penalty $λ_{d}$	[0.5, 5.0]	3.5
Masking threshold $b_{mask}$	[0.10, 0.50]	0.155

Appendix A.3. Sensitivity Patterns

An analysis of the optimization trials revealed distinct sensitivity patterns that inform parameter selection for other platforms:

Masking threshold: Most sensitive parameter. Values exceeding 0.35 notably degraded goal completion by over-restricting the action space, reducing objective scores by 30–40%. Values below the critical threshold eliminated safety guarantees. The effective operating range of 0.15–0.20 (corresponding to 0–33% margin above $b_{crit}$ ) balances safety with task performance.
Depletion penalty: Second most sensitive. The effective range spans 2.0–4.5. Penalties below 1.5 permitted excessive depletion during training (rates exceeding 3%), while values above 5.0 dominated the reward signal and suppressed goal-directed exploration. Practitioners should scale this parameter proportionally to typical episode returns.
Safety shaping: Moderately robust across 0.5–1.0. Performance degraded noticeably below 0.3, where insufficient positive reinforcement for safe states slowed convergence. The parameter primarily affects learning speed rather than final performance.
Readiness bonus: Least sensitive parameter. Performance remained stable across the full search range, with values between 0.05 and 0.20 producing statistically indistinguishable results. This parameter can be set conservatively without extensive tuning.

The optimization converged after approximately 40 trials, with a total optimization time of 6.4 h. For deployment on new platforms, we recommend prioritizing masking threshold calibration, followed by depletion penalty scaling relative to the task reward magnitude.

Appendix A.4. Additional Hyperparameters

For reproducibility, we provide additional training configuration details below.

Table A2. Additional Training Hyperparameters.

Parameter	Value
Optimization
Optimizer	Adam
Learning rate	0.002
Adam $β_{1}$ , $β_{2}$	0.9, 0.999
Max gradient norm	0.5
PPO
Discount factor $γ$	0.99
GAE parameter $λ$	0.95
Clip parameter $ϵ$	0.2
Entropy coefficient	0.01
Value loss coefficient	0.5
PPO epochs per update	10
Number of minibatches	4
Rollout
Parallel environments	16
Random seeds	5
Network
Hidden dimension	256
GRU layers	1
Activation	ReLU
Weight initialization	Orthogonal

References

Abdelmeguid, Y.; Hasan, A. Data-Driven Autoregressive Power Prediction for GTernal Robots in the Robotarium. arXiv 2026, arXiv:2603.13908. [Google Scholar] [CrossRef]
Orr, J.; Dutta, A. Multi-agent deep reinforcement learning for multi-robot applications: A survey. Sensors 2023, 23, 3625. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic value function factorisation for decentralised multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. QPLEX: Duplex dueling multi-agent Q-learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6379–6390. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative, multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Uwano, F. Learning agents in robot navigation: Trends and next challenges. J. Robot. Mechatron. 2024, 36, 508–516. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
Alcayaga, J.M.; Menéndez, O.A.; Torres-Torriti, M.A.; Vásconez, J.P.; Arévalo-Ramirez, T.; Prado Romo, A.J. LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints. Robotics 2025, 14, 74. [Google Scholar] [CrossRef]
Nemer, I.A.; Sheltami, T.R.; Belhaiza, S.; Mahmoud, A.S. Energy-efficient UAV movement control for fair communication coverage: A deep reinforcement learning approach. Sensors 2022, 22, 1919. [Google Scholar] [CrossRef]
Ramezani, M.; Amiri Atashgah, M.A. Energy-aware hierarchical reinforcement learning based on the predictive energy consumption algorithm for search and rescue aerial robots in unknown environments. Robot. Auton. Syst. 2024, 172, 104589. [Google Scholar] [CrossRef]
Li, Y.; Li, C.; Chen, J.; Roinou, C. Energy-Aware Multi-Agent Reinforcement Learning for Collaborative Execution in Mission-Oriented Drone Networks. In Proceedings of the IEEE International Conference on Computer Communications and Networks, New York, NY, USA, 2–5 May 2022. [Google Scholar]
Jeon, S.; Lee, H.; Kaliappan, V.K.; Nguyen, T.A.; Jo, H.; Cho, H.; Min, D. Multiagent reinforcement learning based on fusion-multiactor-attention-critic for multiple-unmanned-aerial-vehicle navigation control. Energies 2022, 15, 7426. [Google Scholar] [CrossRef]
Said, T.; Wolbert, J.; Khodadadeh, S.; Dutta, A.; Kreidl, O.P.; Bölöni, L.; Roy, S. Multi-robot information sampling using deep mean field reinforcement learning. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Melbourne, VIC, Australia, 17–20 October 2021. [Google Scholar]
Singh, K.J.; Nayyar, A.; Kapoor, D.S.; Mittal, N.; Mahajan, S.; Pandit, A.K.; Masud, M. Adaptive flower pollination algorithm-based energy efficient routing protocol for multi-robot systems. IEEE Access 2021, 9, 82417–82434. [Google Scholar] [CrossRef]
Altman, E. Constrained Markov Decision Processes; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
Tessler, C.; Mankowitz, D.J.; Mannor, S. Reward constrained policy optimization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017. [Google Scholar]
Lu, S.; Zhang, K.Q.; Chen, T.Y.; Başar, T.; Horesh, L. Decentralized policy gradient descent ascent for safe multi-agent reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2021, 35, 8767–8775. [Google Scholar] [CrossRef]
Gu, S.; Kuba, J.G.; Chen, Y.P.; Du, Y.L.; Yang, L.; Knoll, A.; Yang, Y.D. Safe multi-agent reinforcement learning for multi-robot control. Artif. Intell. 2023, 319, 103905. [Google Scholar]
Dalal, G.; Dvijotham, K.; Vecerik, M.; Hester, T.; Paduraru, C.; Tassa, Y. Safe exploration in continuous action spaces. arXiv 2018, arXiv:1801.08757. [Google Scholar] [CrossRef]
Soza-Mamani, K.M.; Alcoba, M.S.; Torres, F.; Prado-Romo, A.J. Cohesion-Based Flocking Formation Using Potential Linked Nodes Model for Multi-Robot Agricultural Swarms. Agriculture 2026, 16, 155. [Google Scholar] [CrossRef]
Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, São Paulo, Brazil, 8–12 May 2017. [Google Scholar]
Terry, J.K.; Grammel, N.; Son, S.; Black, B.; Agrawal, A. Revisiting parameter sharing in multi-agent deep reinforcement learning. arXiv 2021, arXiv:2005.13625. [Google Scholar]
Roijers, D.M.; Vamplew, P.; Whiteson, S.; Dazeley, R. A survey of multi-objective sequential decision-making. J. Artif. Intell. Res. 2013, 48, 67–113. [Google Scholar] [CrossRef]
Wilson, S.; Glotfelter, P.; Wang, L.; Mayya, S.; Notomista, G.; Mote, M.; Egerstedt, M. The Robotarium: Globally impactful opportunities, challenges, and lessons learned in remote-access, distributed control of multirobot systems. IEEE Control Syst. Mag. 2020, 40, 26–44. [Google Scholar]
Ames, A.D.; Xu, X.; Grizzle, J.W.; Tapia, P. Control barrier function based quadratic programs for safety critical systems. IEEE Trans. Autom. Control 2017, 62, 3861–3876. [Google Scholar] [CrossRef]
Papoudakis, G.; Christianos, F.; Schäfer, L.; Albrecht, S.V. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Online, 6–14 December 2021. [Google Scholar]
Sukhbaatar, S.; Szlam, A.; Fergus, R. Learning multiagent communication with backpropagation. Adv. Neural Inf. Process. Syst. 2016, 29, 2244–2252. [Google Scholar]
Singh, A.; Jain, T.; Sukhbaatar, S. Learning when to communicate at scale in multiagent cooperative and competitive environments. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Kim, S.; Davis, P.; Yam, N.; Coogan, S.; Wilson, S. GTernal: A robot design for the autonomous operation of a multi-robot research testbed. In Distributed Autonomous Robotic Systems; Springer: Cham, Switzerland, 2024; pp. 489–504. [Google Scholar]

Figure 1. The Georgia Tech Robotarium platform with GTernal robots.

Figure 2. Goal completion rate on Navigation scenario over training (greedy evaluation).

Figure 3. Depletion rate during training.

Figure 4. Mean fleet battery level during training. SEA-MAPPO preserves the most energy throughout, followed by EA-MAPPO, with MAPPO showing the lowest preservation and highest variance.

Figure 5. Fleet battery variance over training (greedy evaluation). SEA-MAPPO achieves the lowest variance, indicating superior load balancing, while MAPPO exhibits high early variance that persists longer.

Figure 6. Fleet readiness rate over training (greedy evaluation). SEA-MAPPO maintains near-perfect readiness throughout, while MAPPO and EA-MAPPO show substantial variability.

Figure 7. Episode returns during training. SEA-MAPPO achieves the highest returns the fastest, with the lowest variance, while MAPPO shows large negative returns from depletion penalties in early training.

Figure 8. Physical deployment results on Robotarium with GTernal robots. Learned policies transfer successfully with performance consistent with simulation, validating the energy predictor accuracy and overall framework.

Figure 9. Per-agent returns during training for SEA-MAPPO across fleet sizes

N \in {4, 7, 14}

. All configurations exhibit similar convergence dynamics, reaching stable performance within comparable training budgets.

Figure 9. Per-agent returns during training for SEA-MAPPO across fleet sizes

N \in {4, 7, 14}

. All configurations exhibit similar convergence dynamics, reaching stable performance within comparable training budgets.

Figure 10. Mean fleet battery during training for SEA-MAPPO across fleet sizes

N \in {4, 7, 14}

. Larger fleets exhibit modestly lower battery preservation, reflecting increased coordination demands.

Figure 10. Mean fleet battery during training for SEA-MAPPO across fleet sizes

N \in {4, 7, 14}

. Larger fleets exhibit modestly lower battery preservation, reflecting increased coordination demands.

Table 1. Environment parameters.

Parameter	Value
Number of agents N	7
Maximum steps	100
Initial battery $b_{0}$	1.0
Critical threshold $b_{crit}$	0.15
Masking threshold $b_{mask}$	0.155
Energy cost multiplier	0.4

Table 2. Training Hyperparameters.

Parameter	Value
Learning rate	0.002
Discount factor $γ$	0.99
GAE parameter $λ$	0.95
PPO clip $ϵ$	0.2
Hidden dimension	256
Parallel environments	16
Random seeds	5

Table 3. Reward Components by Algorithm. indicates the feature is used by the algorithm.

Component	Weight	MAPPO	EA-MAPPO	SEA-MAPPO
$R_{task}$	–	✓	✓	✓
$R_{energy}$	$β = 0.4$	–	✓	✓
$R_{balance}$	$γ_{b} = 0.2$	–	✓	✓
$R_{safety}$	$α_{s} = 0.85$	–	–	✓
$R_{ready}$	$0.1$	–	–	✓
$R_{deplete}$	$λ_{d} = 3.5$	–	–	✓
Action masking	–	–	–	✓

Table 4. Sample Efficiency: Environment Steps to First Reach Threshold.

Threshold	MAPPO	EA-MAPPO	SEA-MAPPO
95% Goal Completion	9.36 M	10.21 M	0.49 M
90% Goal Completion	5.65 M	6.08 M	0.49 M

Table 5. Task Performance at 10 M Environment Steps (Fixed-Budget Comparison).

Metric	MAPPO	EA-MAPPO	SEA-MAPPO
Goal Completion (%)	81.5	91.6	100.0

Table 6. Training stability metrics.

Metric	MAPPO	EA-MAPPO	SEA-MAPPO
Peak Depletion Rate	0.491	0.455	0.375
Eval Points with >10% Depletion	1511	359	50
Goal Completion Std (Training)	0.339	0.286	0.159
Depletion Rate Std (Training)	0.076	0.043	0.017

Table 7. Cumulative training metrics.

Metric	MAPPO	EA-MAPPO	SEA-MAPPO
Mean Depletion Rate (All Training)	0.052	0.018	0.004
Mean Goal Completion (All Training)	61.2%	70.6%	95.5%
Cumulative Depletion Reduction	–	65.8%	93.2%

Table 8. Navigation scenario results at convergence (mean ± std over 5 seeds).

Metric	MAPPO	EA-MAPPO	SEA-MAPPO
Goal Completion (%)	$99.3 \pm 0.7$	$98.7 \pm 0.7$	$100.0 \pm 0.0$
Mean Final Battery	$0.727 \pm 0.038$	$0.760 \pm 0.032$	$0.809 \pm 0.026$
Battery Variance	$0.047 \pm 0.008$	$0.052 \pm 0.006$	$0.032 \pm 0.002$
Fleet Readiness (%)	$94.6 \pm 3.0$	$96.7 \pm 1.4$	$99.8 \pm 0.1$

Table 9. Physical deployment results.

Metric	Value
Deployment trials	10
Robots per trial	7
Total robot-trials	70
Goal completion	100%
Depletion events	0
Predictor $R^{2}$ with measured power	0.89

Table 10. SEA-MAPPO performance across fleet sizes at convergence.

Metric	N = 4	N = 7	N = 14
Goal Completion (%)	$100.0 \pm 0.0$	$100.0 \pm 0.0$	$100.0 \pm 0.0$
Mean Final Battery	$0.801 \pm 0.034$	$0.809 \pm 0.026$	$0.730 \pm 0.035$
Battery Variance	$0.037 \pm 0.005$	$0.032 \pm 0.002$	$0.044 \pm 0.002$
Fleet Readiness (%)	$100.0 \pm 0.0$	$99.8 \pm 0.1$	$98.6 \pm 0.4$
Inference Time (ms/step)	2.8	4.6	9.1

Table 11. Theoretical Complexity Scaling.

Component	Scaling with N
Actor input dimension	$O (N)$
Actor parameters	$O (1)$ (parameter sharing)
Critic input dimension	$O (N)$
Action masking (per agent)	$O (1)$
Experience buffer (per step)	$O (N)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdelmeguid, Y.; Hasan, A. Energy-Aware Multi-Agent Proximal Policy Optimization with Depletion Safety Constraints for Multi-Robot Coordination. Robotics 2026, 15, 95. https://doi.org/10.3390/robotics15050095

AMA Style

Abdelmeguid Y, Hasan A. Energy-Aware Multi-Agent Proximal Policy Optimization with Depletion Safety Constraints for Multi-Robot Coordination. Robotics. 2026; 15(5):95. https://doi.org/10.3390/robotics15050095

Chicago/Turabian Style

Abdelmeguid, Yassin, and Ammar Hasan. 2026. "Energy-Aware Multi-Agent Proximal Policy Optimization with Depletion Safety Constraints for Multi-Robot Coordination" Robotics 15, no. 5: 95. https://doi.org/10.3390/robotics15050095

APA Style

Abdelmeguid, Y., & Hasan, A. (2026). Energy-Aware Multi-Agent Proximal Policy Optimization with Depletion Safety Constraints for Multi-Robot Coordination. Robotics, 15(5), 95. https://doi.org/10.3390/robotics15050095

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy-Aware Multi-Agent Proximal Policy Optimization with Depletion Safety Constraints for Multi-Robot Coordination

Abstract

1. Introduction

2. Related Work

2.1. Multi-Agent Reinforcement Learning Foundations

2.2. Energy-Aware Multi-Robot Coordination

2.3. Safe and Constrained Multi-Agent Reinforcement Learning

2.4. Scalability and Multi-Objective Considerations

3. Problem Formulation

3.1. Dec-POMDP Definition

3.2. State Space

3.3. Battery Dynamics

3.4. Observation Space

3.5. Action Space

3.6. Reward Structure

3.7. Safety Constraint

3.8. Success Metrics

4. Energy Predictor Integration

4.1. Predictor Overview

4.2. Simulation Deployment

4.3. Deployment Configuration

5. Methodology

5.1. Network Architecture

5.2. MAPPO Foundation

5.3. EA-MAPPO

5.4. SEA-MAPPO

5.5. Centralized Training with Decentralized Execution

6. Experimental Evaluation

6.1. Experimental Setup

6.2. Training Protocol and Algorithm Configurations

6.3. Results

6.4. Physical Deployment

6.5. Scalability Analysis

6.6. Limitations and Scope

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Reward Parameter Sensitivity Analysis

Appendix A.1. EA-MAPPO Parameter Selection

Appendix A.2. SEA-MAPPO Parameter Optimization

Appendix A.3. Sensitivity Patterns

Appendix A.4. Additional Hyperparameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI