This section presents a comprehensive empirical evaluation of our proposed D* Lite + TE-SAC framework. We designed experiments to assess its performance, robustness, and scalability against state-of-the-art baselines in challenging multi-agent maritime scenarios.
4.1. Experimental Setup
All baseline algorithms were implemented with meticulous attention to fairness to ensure a fair and rigorous comparison. Specifically, the neural network backbones for all learning-based agents (e.g., MAPPO, QMIX) were designed to have a comparable number of trainable parameters to our proposed TE-SAC agent. Furthermore, each baseline was subjected to an extensive hyperparameter sweep using a grid search methodology to identify its optimal configuration for the given tasks. The results reported for all baselines represent their best-achieved performance after this tuning process, providing a robust and equitable basis for comparison.
To comprehensively evaluate the performance of our proposed framework, we compare it against several state-of-the-art and classical baselines, categorized as follows:
Multi-Agent Reinforcement Learning (MARL) Baselines: We include QMIX, a representative value-decomposition method for cooperative tasks; MADDPG, a widely-used actor-critic algorithm suitable for mixed cooperative-competitive settings; and MAPPO, a state-of-the-art on-policy MARL algorithm known for its strong performance and stability. These baselines represent the current state of the art in learning-based approaches.
Classical Planning Baselines: To highlight the advantages of learning complex behaviors, we also compare our approach against hybrid planners that combine D* Lite with Optimal Reciprocal Collision Avoidance (ORCA) [
32] and the Dynamic Window Approach (DWA) [
33]. ORCA is a well-established geometric method for multi-agent collision avoidance, while DWA is a popular local planner in robotics. These represent robust, non-learning-based solutions [
34].
A high-fidelity multi-agent simulation environment was constructed using Python to validate the proposed framework. All experiments were conducted in a simulation environment developed using Python 3.8. The training and evaluation processes were executed on a desktop computer with the following specifications: an Intel Core i9-12900K CPU, 64 GB of DDR5 RAM, and an NVIDIA GeForce RTX 3090 GPU with 24 GB of VRAM, used to accelerate neural network computations.
To ensure a comprehensive and rigorous evaluation of our framework, we designed a suite of experimental scenarios with a clear rationale, progressing from rule-based validation to complex adversarial testing. The selection and configuration of these scenarios are explicitly designed to dissect and validate different facets of our proposed architecture:
Rule-Intensive Crossing Scenario (5-ship crossing): This scenario is specifically designed to test the core COLREGs compliance and safety performance of the local planner (TE-SAC). The configuration involves multiple vessels approaching from various angles, creating a dense environment where adherence to Rules 13, 14, and 15 is critical and non-trivial. This setup directly challenges the agent’s ability to internalize complex rules, a primary objective of our work. It provides a clear benchmark against classical methods, such as ORCA and DWA, which often struggle with such nuanced interactions. The number of ships (five) was chosen to create sufficient complexity without becoming chaotic, allowing for a precise analysis of decision-making.
Game-Theoretic Adversarial Scenarios (Cooperative Escort, Area Defense, Pursuit-Evasion): These three scenarios shift the focus from static rule compliance to dynamic, strategic, and cooperative multi-agent behavior. Their purpose is to evaluate the framework’s robustness, scalability, and generalization capabilities, particularly the effectiveness of the GNN-Transformer encoder for intent inference and the adversarial meta-learning mechanism.
Cooperative Escort tests coordinated protective maneuvers.
Area Defense assesses spatial control and emergent team strategy.
Pursuit-Evasion evaluates predictive capabilities against agile, unpredictable opponents.
The configurations, involving varying numbers of agents (N vs. M) and predefined team objectives, are standard benchmarks in multi-agent reinforcement learning. By testing across these diverse, high-stakes scenarios, we can robustly validate the practical applicability and superior performance of our framework beyond simple collision avoidance.
The environment encompasses a 10 km × 10 km rectangular area featuring static island obstacles and neutral vessels on predefined courses, creating a complex and dynamic setting. The agent is modeled with second-order motion dynamics, constrained by maximum speed, acceleration, and turn rate. A high-fidelity multi-agent simulation environment was constructed in Python to validate the proposed framework. Crucially, we have incorporated several key features to bridge the inherent gap between simulation and reality, thereby enhancing the realism of the environment. A second-order Nomoto model governs the agent’s motion, which captures fundamental vessel hydrodynamics, including constraints on maximum speed, acceleration, and turn rate. Furthermore, the virtual sensor suite (e.g., radar, cameras) is modeled to simulate imperfect perception with realistic limitations, such as range-dependent noise, limited fields of view, and data dropouts. The environment also simulates four levels of sea-state conditions (based on the Douglas sea scale), introducing varying degrees of sensor noise and data inconsistency to rigorously test the algorithm’s robustness against common real-world disturbances.
To simulate realistic perception capabilities, the agent is equipped with a virtual sensor suite, including a multi-threaded radar for detecting object positions and velocities within a 5 km range (with added noise) and high-definition cameras for detection within a 2 km forward-facing arc. The environment also simulates four levels of sea-state conditions, which introduce varying degrees of sensor noise and data inconsistency to rigorously test the algorithm’s robustness. Path smoothing for the agent’s trajectory is achieved using cubic spline interpolation, ensuring curvature continuity as described by the following equation:
Here represents the points of the smooth path and is the interpolation factor .
To evaluate the framework’s performance from multiple perspectives, we designed three typical multi-robot, game-theoretic scenarios:
Cooperative Escort: N friendly agents must escort a high-value target ship to its destination while defending against M waves of enemy attacks.
Area Defense: N friendly agents must form a dynamic defensive perimeter to intercept M enemy agents attempting to breach a specified rectangular region.
Pursuit-Evasion: N pursuer agents must cooperatively track and capture M agile evading targets within an environment containing obstacles.
These scenarios test cooperative strategies, spatial control, and adaptive planning against dynamic adversaries.
The network architecture and hyperparameter settings used throughout the experiments are detailed in
Table 3. These settings were kept constant across all comparable algorithm implementations to ensure reproducible results.
We employ an automated curriculum learning (CL) strategy to improve training stability and accelerate convergence. The agent progresses through four levels of increasing task difficulty, from basic static obstacle avoidance to the complete N-vs-M adversarial task, as detailed in
Table 3. The transition between levels is automatically triggered once the agent’s performance on key metrics (e.g., success rate, collision rate) exceeds a predefined threshold. As demonstrated in
Figure 5, the CL strategy enables the agent to achieve a higher final win rate (~85%) more quickly than training from scratch (~81% after additional steps), validating its role as a critical component for efficiently achieving high-performance policies.
The following quantitative metrics are used for performance evaluation:
Task Success Rate (%)/Win Rate (%): The primary metric indicating the percentage of episodes where the mission objective is completed.
COLREGs Compliance Rate (%): Measures the adherence to maritime traffic rules.
Strategy Degradation Rate (%): Measures the drop in win rate when facing an unseen opponent strategy, used to evaluate generalization and robustness.
Response Time (ms): The end-to-end processing delay from perception to action.
Energy Consumption (MJ/task): An estimate of the cost per mission.
Modeling Real-World Constraints: Sensor Noise and Communication Imperfections.
A core objective of our evaluation is to assess the framework’s robustness under non-ideal, realistic conditions. To this end, we have explicitly incorporated two critical sources of real-world challenges into our simulation environment:
Varying Levels of Sensor Noise: The virtual sensor suite is not perfect. We model range-dependent noise, detection dropouts, and simulate four distinct sea-state conditions (based on the Douglas scale). Each level introduces progressively higher degrees of sensor noise and data inconsistency, directly testing the policy’s ability to operate with imperfect perception.
Communication Delays and Packet Loss: To simulate real-world communication latencies and unreliability, particularly in multi-agent coordination, we conduct experiments under various conditions. These include introducing communication delays of 100 ms and 200 ms, as well as simulating packet loss rates of 5% and 10%.
By systematically evaluating our framework against these varying levels of degradation, we can provide a more credible assessment of its practical deployability and resilience compared to baselines.
4.2. Results and Analysis
All experiments were conducted with five different random seeds, and the reported results are averaged. Shaded areas in plots represent the standard deviation.
Figure 6 presents a comparison of the win rates across the three primary tasks. Our framework (Adv-TransAC) consistently outperforms all baselines. Notably, the cooperative escort mission achieves a 92% win rate, a significant improvement over the next-best algorithm, MAPPO (78%). Even in the highly complex 5v5 area blockade task, our method maintains a superior win rate of 85%.
A comparison with classical planners in a 5-ship crossing scenario (
Table 3) further highlights the advantages of our learning-based approach. While ORCA and DWA offer faster decision times (25 ms and 42 ms, respectively), they exhibit poor compliance with COLREGs (65.4% and 58.1%, respectively). In contrast, our framework achieves a near-perfect 98.7% compliance rate and a higher task success rate (95.2%), demonstrating its ability to master complex, rule-based behaviors.
A zero-shot transfer test evaluated the framework’s robustness against unseen opponent strategies. As shown in
Figure 6, the win rates of traditional MARL algorithms, such as MADDPG and QMIX, dropped by 45% and 38%, respectively. Our Adv-TransAC framework, however, experienced only an 8% performance degradation, showcasing the high effectiveness of its adversarial meta-learning mechanism in promoting generalization rather than simple memorization.
The ablation study results (Table 6) quantify the contribution of each architectural component in the 5v5 area blockade task.
Removing the D* Lite global planner (w/o DLite) resulted in a sharp drop in win rate to 61.5%, confirming the necessity of long-term strategic guidance.
Removing the GNN module (w/o GNN) decreased the win rate to 71.3%, demonstrating the critical importance of explicit spatial relationship modeling.
Replacing the Transformer with a standard LSTM (w/o Transformer) yielded a win rate of 75.8%, validating the superior capability of the self-attention mechanism for capturing long-term temporal dependencies.
The framework’s scalability was tested by increasing the number of agents from 4 to 20. Our method exhibits more graceful performance degradation than MAPPO, maintaining a win rate of ~70% at 20 agents. Analysis of failure cases identified two primary modes: decision deadlock in highly symmetric situations and delayed response to sudden, coordinated enemy attacks. These failure modes suggest that future work should focus on improving high-level coordination protocols and enhancing the predictive capabilities of the meta-learning mechanism.
We tested the framework’s performance under simulated data latency and packet loss conditions to replicate real-world scenarios. As shown in Table 7, our framework demonstrates significantly stronger robustness than MAPPO. With a 200 ms latency, our win rate dropped by only 9.4%, compared to MAPPO’s 15.6%. This resilience is attributed to the Transformer’s ability to infer and predict target states from historical trajectories, even with incomplete or delayed information.
4.4. Extended Performance Evaluation
We now present the results of our comprehensive experiments. All reported results are averaged over five independent runs with different random seeds to ensure statistical significance, with shaded areas in plots representing the standard deviation. We analyze the framework’s performance from several key perspectives: its overall effectiveness compared to baselines, the specific contributions of its architectural components, and its robustness under challenging and non-ideal conditions.
4.4.1. Overall Performance Comparison
First, we evaluate the overall performance of our framework against the state-of-the-art MARL baselines, MAPPO and QMIX. Quantitatively, across the three tested scenarios, our framework improves the win rate by 10–17 percentage points compared to the strong MAPPO baseline, and by a remarkable 18–32 percentage points compared to QMIX.
Figure 6 summarizes the results across the three primary task scenarios. The bar chart on the left illustrates the final win rates, while the pie chart on the right shows the compliance rate of our framework with the COLREGs.
As the results indicate, our D* Lite + TE-SAC framework (labeled Adv-TransAC) consistently outperforms the baselines in all tasks. In the challenging Cooperative Escort (4v3) task, it achieves a win rate of approximately 90%, demonstrating its superior coordination capabilities. In the more complex, large-scale Area Defense (5v5) scenario, it maintains a strong win rate of over 75%, surpassing both MAPPO and QMIX.
Crucially, the pie chart highlights one of the most significant achievements of our framework: a 98.7% compliance rate with COLREGs. This near-perfect adherence to maritime regulations is directly attributed to our hybrid design, which effectively integrates rule-based constraints into the learning process. This result is significant as it addresses a key challenge for the practical deployment and certification of autonomous systems. While other MARL agents can learn to complete tasks, our framework demonstrates the ability to do so in a manner that is both effective and certifiably safe.
The bar chart displays the win rates of our method (Adv-TransAC) against MARL baselines (MAPPO, QMIX). The star-marker line, corresponding to the right y-axis, highlights our framework’s consistently high compliance rate with COLREGs.
To further contextualize these results and highlight the advantages of our learning-based approach, we also compare it with classical hybrid planners in a complex 5-ship crossing scenario. The results, presented in
Table 5, reveal a critical trade-off between decision speed and intelligent, rule-compliant behavior.
While classical planners like D* Lite + ORCA and D* Lite + DWA offer faster decision times (25 ms and 42 ms, respectively), they exhibit poor COLREGs compliance (65.4% and 58.1%, respectively) and lower overall success rates. More importantly, their reactive nature leads to less safe and smoother maneuvers. Our proposed framework, in contrast, achieves a near-perfect 98.7% compliance rate and a higher task success rate of 95.2%.
Furthermore, the secondary metrics underscore the superior quality of the learned behavior. Our TE-SAC agent maintains a much larger Average Minimum DCPA of 0.82 nm, nearly double that of ORCA, indicating remarkably safer passage. It also produces a significantly lower Average Path Curvature (1.25 vs. 2.89 and 4.16), resulting in smoother trajectories, improved passenger comfort, and reduced energy consumption due to reduced rudder actions. Although our method has a slightly longer path length and decision latency, it remains within real-time requirements (176 ms) and demonstrates a more holistic and intelligent decision-making capability. This result underscores the unique ability of our RL-based agent to internalize and execute complex, rule-based behaviors that are challenging to hard-code with classical methods.
4.4.2. Ablation Study: Deconstructing the Framework’s Success
To dissect the sources of this performance gain, we conducted a systematic ablation study on the core architectural components of our framework in the challenging “5v5 area blockade” task. The results, summarized in Table 8, quantify the contribution of each element.
Global Planner (D Lite):* Removing the D* Lite planner and resorting to a purely end-to-end RL approach (w/o DLite) caused the most significant performance degradation, with the win rate plummeting from 85.0% to 61.5%. This 23.5-point drop confirms that long-term strategic guidance is indispensable for achieving high success rates in complex navigation tasks, preventing the agent from adopting myopic, inefficient behaviors.
Spatial Encoder (GNN): Eliminating the GNN module and feeding state vectors directly to the Transformer (w/o GNN) reduced the success rate to 71.3%. This 13.7-point drop underscores the importance of explicitly modeling the instantaneous spatial topology and relational structure in multi-agent scenarios.
Temporal Encoder (Transformer): Replacing the Transformer with a standard LSTM, a common model for sequence processing, decreased the success rate to 75.8%. While still effective, this 9.2-point drop validates the superior capacity of the Transformer’s self-attention mechanism for capturing the long-range temporal dependencies crucial for inferring vessel intent.
4.4.3. Robustness and Generalization Against Adversaries
A critical requirement for real-world deployment is the ability to generalize to unseen situations. We evaluated this via a zero-shot transfer experiment, in which our trained agent faced opponents with entirely new, previously unobserved strategies.
The contribution of our adversarial meta-learning mechanism, detailed in Section Adversarial Meta-Learning for Robust Policy Generalization, is starkly evident here. As shown in
Figure 7, traditional MARL algorithms, such as MADDPG and QMIX, exhibited severe performance degradation when transferred, with their win rates dropping by over 45% and 38%, respectively. However, our complete framework (Adv-TransAC), leveraging the meta-learning component, suffered only an 8% performance loss. An ablation variant of our model without the meta-learning component (w/o Meta) performed almost as poorly as the baselines in this transfer task, demonstrating a sharp decline in generalization. This highlights a qualitative shift achieved by our approach: from merely “memorizing” responses to known tactics to “learning how to adapt” to novel ones, a crucial step towards genuine autonomy.
4.4.4. Robustness Against Realistic Constraints: Performance Under Noise and Latency
The practical utility of an autonomous system is ultimately determined by its ability to perform reliably with imperfect information. We rigorously tested this by subjecting our framework and the MAPPO baseline to the sensor noise and communication degradation conditions outlined in
Section 4.1.
The results, presented in
Table 6, clearly demonstrate the superior resilience of our framework. Under communication stress, our method consistently outperforms the baseline. For instance, at a 200 ms latency, our win rate degrades by only 9.4%, whereas MAPPO’s performance drops by a more significant 15.6%. A similar trend is observed under packet loss, where our framework maintains a clear advantage.
This enhanced robustness can be directly attributed to the predictive capabilities of our spatio-temporal encoder. The GNN-Transformer architecture is not merely reactive to the immediate state; it learns to infer and predict target states from historical trajectories. This allows it to ‘fill in the gaps’ created by delayed or lost data packets and to ‘see through’ transient sensor noise by relying on a more stable, temporally-aware understanding of the situation. This resilience is a critical feature for real-world maritime operations where perfect, instantaneous information is rarely guaranteed, see
Table 7.
4.4.5. Scalability and Failure Case Analysis
Finally, we assessed the framework’s scalability by increasing the number of agents from 4 (2v2) to 20 (10v10). As depicted in
Figure 6, our method exhibits more graceful performance degradation as the agent count increases, maintaining a roughly 70% success rate at 20 agents, while the baseline drops below 60%.
Analysis of the rare failure cases provided valuable insights for future work. The primary failure modes were identified as: (1) Decision Deadlock in highly symmetric scenarios, where improved coordination protocols are needed; and (2) Delayed Response to sudden, highly coordinated swarm attacks, highlighting an opportunity to enhance the predictive foresight of the meta-learning mechanism.
4.4.6. System Performance and Efficiency
Beyond task success, the practical deployability of an autonomous navigation system hinges on its computational efficiency and energy consumption. We evaluated these critical engineering metrics, which are summarized in
Table 8.