We present an algorithm progression building upon Multi-Agent Proximal Policy Optimization.
5.1. Network Architecture
Both actor and critic networks employ Gated Recurrent Unit (GRU)-based architectures to capture temporal dependencies in the observation sequence. Because our problem is formulated as a Dec-POMDP, feed-forward networks suffer from perceptual aliasing where identical observations may require different actions, depending on the unobserved temporal context. Recurrent architectures aggregate action-observation histories into latent state representations [
8]. We employ GRUs, rather than LSTMs, because GRUs achieve equivalent asymptotic performance with approximately 25% fewer parameters, reducing the inference latency for decentralized deployment.
The actor network processes agent observations through the following structure:
The output produces logits over the five discrete actions, transformed to probabilities via softmax (or masked softmax for SEA-MAPPO).
The critic network processes the global state through:
We select 256 hidden units following MAPPO implementation guidelines [
8], which scale network capacity to task complexity. Our 13-dimensional observation space encoding spatial and energy dynamics represents moderate complexity, making 256 the appropriate configuration. This choice aligns with EPyMARL (v2.0.0; Autonomous Agents Research Group, University of Edinburgh, Edinburgh, UK) benchmarking defaults [
32]. Parameter sharing is employed across all agents, improving sample efficiency by aggregating all agents’ experiences into unified optimization steps [
28]. Behavioral diversity emerges naturally through observation conditioning, as each agent’s unique position and energy state produces specialized actions despite shared weights.
5.3. EA-MAPPO
Energy-Aware MAPPO (EA-MAPPO) extends the MAPPO foundation with energy-augmented observations and shaped rewards while leaving the core algorithm unchanged.
Agent observations include battery information, as specified in
Section 3, enabling the policy to condition decisions on individual and collective energy states. The full fleet battery vector
provides each agent with complete energy awareness.
The reward function includes the energy efficiency penalty () and the load balancing penalty (). These weights were determined through a grid search over and , selecting the configuration that achieved an optimal trade-off between goal completion and fleet energy variance. The energy penalty at approximately 40% of the typical per-step task reward magnitude influences learning without dominating task objectives. This configuration represents the standard approach in the energy-aware multi-agent reinforcement learning literature, where energy objectives are incorporated purely through reward engineering without explicit safety mechanisms.
EA-MAPPO demonstrates that policies can learn more energy-efficient behaviors when incentivized through rewards. However, while reward shaping incentivizes energy-conservative behavior, it provides no hard guarantees, and a policy optimizing expected return may still select actions that risk depletion when task rewards dominate energy penalties.
The multi-objective reward landscape may slow initial task-focused learning compared to single-objective MAPPO, as the policy must balance competing objectives.
5.4. SEA-MAPPO
Safe Energy-Aware MAPPO (SEA-MAPPO) extends EA-MAPPO with both predictive action masking and safety-oriented reward shaping, providing comprehensive constraint enforcement.
Given the current battery level
and the energy predictor
, the action masking mechanism excludes actions predicted to cause constraint violation:
where
is the recent power history used by the autoregressive predictor.
The policy network outputs logits for all actions, and masking modifies the softmax normalization to consider only valid actions:
This formulation guarantees that sampled actions satisfy battery constraints, assuming predictor accuracy. The policy learns over the valid action subspace, naturally adapting behavior as the battery depletes and the valid set shrinks.
In addition to action masking, SEA-MAPPO augments the reward function with safety-specific components. With EA-MAPPO base weights held constant to ensure fair ablation, we conducted 60-trial Bayesian optimization over the safety-specific parameters. This yielded safety shaping of
, providing a positive signal for maintaining the battery above the critical threshold, a readiness bonus of 0.1 for staying above the masking threshold, and a depletion penalty of
for catastrophic battery exhaustion. Details of the parameter sensitivity analysis, including search ranges and sensitivity patterns, are provided in
Appendix A. The depletion penalty is intentionally large relative to other reward components, ensuring that battery exhaustion is treated as catastrophic mission failure, rather than a soft optimization trade-off. The masking threshold (0.155) was set to 0.005 above the critical threshold (0.15) to provide a safety margin absorbing predictor uncertainty. These components complement action masking by shaping the policy toward energy-conservative behaviors even before masking activates.
In practice, we implement threshold-based masking where movement actions (up, down, left, right) are masked when the battery falls below threshold , while the null action remains always valid. This provides a conservative safety mechanism independent of predictor accuracy for near-critical battery states.
The critical threshold
represents the minimum battery level for safe operation, determined by platform characteristics and mission requirements, rather than algorithm tuning. For the GTernal platform, this value ensures sufficient energy to execute the failsafe retreat behavior (moving to the nearest arena corner) from any position while maintaining voltage levels adequate for motor control. The masking threshold
introduces a margin of 0.005 above
to absorb prediction uncertainty. This margin was selected based on predictor accuracy: with a mean absolute error of 31.5 mW on held-out validation data [
1] and typical power consumption of approximately 3.5 W, the single-step prediction error is below 1%. The 0.005 margin (3.3% relative to
) absorbs approximately 5–6 steps of worst-case accumulated error, providing robustness against predictor inaccuracy without overly restricting the action space.
For deployment on new platforms, practitioners should set based on hardware specifications (minimum safe discharge level, failsafe energy requirements) and , with the margin being proportional to the expected predictor error. More energy-intensive actions, longer episodes, or higher predictor uncertainty warrant larger margins. The threshold-based fallback, where movement actions are masked whenever the battery falls below , regardless of the predictor output, provides an additional safety layer independent of prediction accuracy for near-critical battery states.
5.5. Centralized Training with Decentralized Execution
Both EA-MAPPO and SEA-MAPPO operate under the Centralized Training with Decentralized Execution (CTDE) paradigm. During training, the centralized critic accesses global state—including all agent positions, velocities, and battery levels—enabling effective credit assignment across the fleet. During execution, each actor conditions solely on its local observation,
, a fixed 13-dimensional vector containing the agent’s pose, goal displacement, and fleet energy state. The fleet battery state comprises
N scalar values per timestep, comparable bandwidth to goal positions that multi-robot coordination systems routinely share, and does not require learned communication protocols [
33,
34] that introduce additional trainable parameters and emergent messaging complexity. The Robotarium provides this telemetry through its standard API, a functionality compatible with platforms with a periodic mission state broadcast. With the fleet energy state available, the framework enables coordinated energy-aware behavior where each agent accounts for the collective battery status, while action masking provides formal safeguards against depletion.
A limitation of threshold-based masking is that the threshold requires tuning to balance safety and task performance. Different scenarios may benefit from different thresholds, and more energy-intensive tasks might require larger margins between and .