4.1. Scenario Design
To evaluate the effectiveness of the proposed multi-agent reinforcement learning algorithm in cooperative pursuit tasks, we design two simulation scenarios: a 3-vs-1 and a 5-vs-2 UAV engagement. These configurations allow assessment of the algorithm’s scalability, coordination efficiency, and robustness under dynamic conditions.
In each scenario, the pursuer UAVs are equipped with the proposed reinforcement learning-based decentralized decision-making model. Their policies are updated throughout training based on interaction with the environment. In contrast, the evader UAVs adopt a predefined, threat-aware escape strategy, adjusting their flight direction, speed, and overload in real-time according to the perceived proximity of the nearest pursuer.
3-vs-1 Scenario: Three pursuer UAVs collaborate to capture a single evading target UAV. Two distinct initial formations are explored:
Fan-shaped deployment: Pursuers are initialized in a fan-like formation ahead or behind the target UAV. This layout provides a wide coverage area and enables quick convergence toward the target. It is particularly effective when the target’s escape direction is known or constrained.
Triangular encirclement: Pursuers are positioned at the vertices of an equilateral triangle centered on the target. This symmetric layout creates an initial full-surround structure, reducing escape options and enhancing early-stage containment. It is suitable for open areas with uncertain target headings.
Figure 6 illustrates initial deployment modes used in the 3-vs-1 experiments.
5-vs-2 Scenario: Five pursuers cooperate to intercept two evading targets. This more complex setting introduces multiple objectives and increased decision coupling, providing a rigorous test of the algorithm’s adaptability and inter-agent coordination.
Target Strategy: Each evader UAV continuously monitors the nearest pursuer and determines its threat level based on distance
d.
Table 1 outlines the dynamic maneuvering behavior selected at each threat level.
All experiments were conducted over multiple independent runs to ensure statistical reliability. Performance metrics including success rates and capture times are reported as mean values calculated across these independent trials.
4.2. Experimental Results
To verify the effectiveness and adaptability of the proposed reinforcement learning algorithm in multi-UAV cooperative pursuit tasks, simulations are conducted on both 3-vs-1 and 5-vs-2 scenarios using different initial configurations. Training performance is evaluated based on accumulated rewards, mission success rates, and trajectory coordination. For comparison, we implement a rule-based heuristic strategy [
35], where each UAV pursuer selects the closest evading target and moves towards it at maximum speed; and MADDPG, a mainstream multi-agent reinforcement learning method that employs centralized training with decentralized execution.
(1) 3-vs-1: Fan-Shaped Deployment
In the fan-shaped pursuit deployment scenario, the UAV team was trained over 30,000 episodes. The performance metrics during the training process are illustrated in
Figure 7a,b, with an explanation that a moving average filter (window size = 1000) was applied to smooth both the training and success rate curves.
Figure 7a presents the average reward per episode throughout the training process. A rapid increase is observed during the initial 5000 episodes, reflecting the agents’ accelerated policy learning and decision quality improvement. After approximately 15,000 episodes, the reward curve gradually plateaued, and the fluctuations diminished, demonstrating that the policy had converged toward an optimal strategy and that the system had reached a stable training state.
As shown in
Figure 7b, during the early training phase (within the first 5000 episodes), the success rate remained relatively low. However, it quickly improved and reached approximately 90% around the 800th episode. As training progressed, the success rate stabilized and consistently maintained near 95%, indicating that the agents had effectively learned the cooperative pursuit strategy.
The comparative performance of the proposed reinforcement learning method, a rule-based heuristic strategy and MADDPG in this deployment scenario is summarized in
Table 2. The proposed BRNN-RL method achieves a task success rate of 94.8%, statistically outperforming the heuristic baseline (68.3%) and the MADDPG baseline (80.2%). Furthermore, the proposed algorithm demonstrates superior time efficiency, reducing the average capture time to 38.1 s, compared to 50.7 s for the heuristic method and 42.3 s for MADDPG. As shown in
Figure 7, whether in the training curve or the test results, the performance of BRNN-RL outperforms that of MADDPG.
The performance improvement of BRNN-RL stems from its core design. The BRNN communication enables strategic coordination among pursuers, enhancing capture success. Meanwhile, its phased optimization prevents policy oscillations, ensuring smoother pursuit and faster capture times.
The trajectory of the 3-vs-1 fan-shaped deployment is shown in
Figure 8, illustrating the coordinated pursuit strategy of the UAV team. Panel (a) presents the 3D view, where the three pursuers (PUAV1, PUAV2, PUAV3) are shown closing in on the evader (EUAV) in a three-dimensional space. The trajectories of the UAVs demonstrate a coordinated fan-shaped formation, with the pursuers converging towards the evader from different directions.
Panels (b), (c), and (d) show the projections of the UAVs’ trajectories on the XY, XZ, and YZ planes, respectively. In panel (b), the XY-plane projection highlights the horizontal movement of the UAVs, where the pursuers gradually surround the evader. The XZ-plane projection in panel (c) illustrates the vertical maneuvering, showing how the UAVs adjust their altitude to maintain pursuit. Finally, panel (d) presents the YZ-plane projection, demonstrating the UAVs’ vertical coordination to restrict the evader’s movement in the vertical direction, further tightening the encirclement.
(2) 3-vs-1: Triangular Deployment
Figure 9a,b show faster reward convergence with minor fluctuations in success rate. As illustrated in
Figure 9a, the reward curve exhibits a rapid upward trend during the initial training phase (approximately the first 5000 episodes), indicating the agents’ efficient policy learning. Between 8000 and 9000 episodes, the reward stabilizes. A noticeable fluctuation appears around episode 20,000, but the curve quickly recovers and remains steady, reflecting a temporary instability in the strategy that is promptly corrected through further training.
Figure 9b shows the evolution of the task success rate. The success rate rises sharply during the early episodes and reaches nearly 90% around episode 8000. Although a slight decline is observed after approximately 10,000 episodes, the overall success rate remains high and stable, indicating that the learned strategy is both effective and robust under this deployment scheme.
As reported in
Table 3, statistically, the BRNN-RL method achieves a 90.3% success rate, compared to 76.8% for the heuristic method and 78.9% for MADDPG. The average capture time is also reduced from 45.6 s (heuristic) and 43.3 s (MADDPG) to 34.6 s. As shown in
Figure 9, whether in the training curve or the test results, the performance of BRNN-RL outperforms that of MADDPG, demonstrating improved efficiency.
The trajectory of the 3-vs-1 triangular deployment is shown in
Figure 10, showcasing the movement patterns of the three pursuers (PUAV1, PUAV2, and PUAV3) and the evader (EUAV) in both 3D and 2D projections. In panel (a), the 3D view reveals the coordinated maneuvers of the UAVs as they move through space. This 3D view highlights the complexity of the pursuit, where the UAVs use three-dimensional space to their advantage, adjusting their trajectories to ensure effective collaboration. The evader, in contrast, displays evasive maneuvers, attempting to create gaps and escape the encirclement.
Panels (b), (c), and (d) offer projections of the UAVs’ trajectories onto the XY, XZ, and YZ planes, respectively, providing a clearer view of the UAVs’ actions in different dimensions. In the XY-plane projection (b), the horizontal movement of the UAVs is more apparent, showing how the pursuers close in on the evader from different angles. This projection emphasizes the horizontal coordination and tactical positioning of the UAVs. In the XZ-plane projection (c), the vertical adjustments of the UAVs are depicted, where they modify their altitude to maintain alignment with the evader’s trajectory, showcasing their ability to adapt to the target’s movement in the vertical dimension. Lastly, in the YZ-plane projection (d), the vertical chase becomes more pronounced, with the pursuers aligning their height adjustments to restrict the evader’s movement along the vertical axis. The combined efforts of the UAVs in both horizontal and vertical dimensions create a dynamic and coordinated pursuit strategy, effectively limiting the evader’s movement space and gradually tightening the encirclement.
(3) 5-vs-2: Multi-Target Scenario
In this higher-complexity task, training proceeds for 50,000 episodes. As shown in
Figure 11a, the average reward increases rapidly during the initial 5000 episodes, reflecting the agents’ rapid learning and policy improvement. After this phase, the reward curve enters a relatively stable region and stabilizes near a high reward value of around 20,000. This pattern indicates that the agents progressively refine their strategies, achieving near-optimal behavior while maintaining high adaptability in complex interactions.
Figure 11b demonstrates a similar trend in success rate. Initially, the performance exhibits volatility, especially within the first 15,000 episodes. However, after this early phase, the success rate improves steadily and surpasses 90% by around episode 20,000. The curve then converges to near 95% success and maintains this performance with minimal fluctuation, indicating the robustness and effectiveness of the learned cooperative policy.
Table 4 shows that the proposed BRNN-RL algorithm achieves a statistically significantly higher success rate (97.4%) compared to the heuristic baseline (80.3%) and MADDPG (79.5%). Furthermore, the RL-based strategy reduces the average target capture time from 75.8 s (heuristic) and 74 s (MADDPG) to 61.2 s. As shown in
Figure 11, whether in the training curve or the test results, the performance of BRNN-RL outperforms that of MADDPG. These improvements highlight the superiority of BRNN-RL in multi-target, multi-agent environments, where dynamic coordination and flexibility are critical for performance.
The trajectory of the 5-vs-2 scenario is depicted in
Figure 12, showcasing the dynamic pursuit and evasion paths of five pursuers (PUAV1 to PUAV5) and two evaders (EUAV1 and EUAV2). In panel (a), the 3D view illustrates the relative movement of the UAVs in three-dimensional space. The pursuers adjust their positions and trajectories to close in on the evaders. The evaders, in turn, attempt to maneuver and evade capture by altering their paths. The 3D trajectories show the overall movement of each UAV, highlighting the complexity of multi-agent coordination in a pursuit–evasion task.
Panels (b), (c), and (d) show the UAVs’ trajectories projected onto the XY, XZ, and YZ planes, respectively. In the XY-plane projection (b), the horizontal pursuit strategy is evident, with the pursuers surrounding the evaders from different directions, creating a converging formation. The XZ-plane projection (c) emphasizes the vertical movement of the UAVs, demonstrating how they adjust their altitude during the pursuit to control the evaders’ vertical escape. Finally, in the YZ-plane projection (d), the coordination between pursuers and evaders is more apparent in the vertical dimension, showing how the UAVs restrict the evaders’ movement and compress their escape routes. The combined trajectories in all projections demonstrate the collaborative effort of the pursuers in gradually closing in on the evaders and successfully limiting their escape space.
In summary, across all evaluated configurations, the proposed multi-agent RL algorithm demonstrates rapid convergence, stable performance, and superior pursuit effectiveness. These results confirm the method’s potential for real-world application in large-scale, cooperative UAV interception tasks.