5.3. Results Analysis
In order to demonstrate the scalability and performance of the algorithm in complex scenarios, the obstacle numbers are set as 5, 10, and 15, respectively, to conduct comparison experiments. The H-PPO algorithm was trained for 500 rounds, and the initial network learning rate was 0.001. The task area contained five threat areas, which included one dynamic threat area and four static threat areas. The change curve of the reward function is shown in
Figure 8. In order to verify the effectiveness of the learning rate setting, the comparison experiment with the learning rate micros of 0.1, 0.01, and 0.0001 was added here.
As shown in
Figure 8, the learning rate is set at 0.001. After 500 rounds of training, the reward function value of the network rises the fastest and becomes stable the earliest, and the value of the reward function is the highest compared with other settings.
In order to further verify the performance of the H-PPO algorithm, the comparison experiment was carried out in the same experimental environment, the other parameters of the experiment were set unchanged, and the change curve of the network reward function was obtained as shown in
Figure 9.
As shown in
Figure 9, the abscissa represents the training round, and the ordinate indicates the reward value of the training. With the iteration of learning over multiple rounds, the H-PPO reward curve rose rapidly between 50 and 330 rounds and gradually converged to stability after reaching 137 at around the 360th round. There were no large fluctuations, and the convergence rate was faster. The reward value of the PPO algorithm started to rise rapidly after training for 130 rounds, but the increase was less than that of the H-PPO algorithm, and the increase in the 390th round slowed down and gradually converged to about 129. Compared with the PPO algorithm, the H-PPO algorithm can achieve faster convergence, and the model achieved a higher reward value, which proves that the combination of hierarchical reinforcement learning and PPO can accelerate network convergence and improve network performance. The reward function value of the SAC algorithm began to rise rapidly in the 80th round, and the rise rate was faster than that of PPO algorithm at the initial stage and became slower after around 230 rounds, and the reward function value was lower than that of the PPO algorithm. The reward function value of the DDPG algorithm began to rise rapidly around the 120th round, but the increase rate was lower than that of the PPO algorithm, and the reward value of the network after 500 rounds of training was about 110, which is significantly lower than that of the H-PPO, PPO, and SAC algorithms. Compared with the other two networks, the fluctuation in the DDPG network reward function was greater, and the model performance was not stable enough. By comparison, it can be concluded that the H-PPO algorithm constructed in this paper effectively improved the network convergence speed and model performance and obtained a higher reward value.
To further verify the model robustness in complex task scenarios, we tested the maneuver decision against multiple-threat target performance under the premise of the experiment in other conditions. The threat target number was 10, including 3 dynamic threat targets and 7 static threat targets, which were still randomly distributed in the task area, and the model reward function graph is shown in
Figure 10.
Due to the increase in the number of threat targets, the complexity of the task scene deepens, and we set the number of training rounds to 800.
Figure 10 shows that the H-PPO algorithm, after starting training at about the 10th round, the reward function value rose rapidly, and the rising trend was significantly higher than those of the PPO algorithm, SAC algorithm, and DDPG algorithm. This shows that the model achieved good convergence at around the 580th round, the reward function value reached 120, which is significantly higher than the other two models, and the DDPG had significantly more changes in amplitude; additionally, the network was still fluctuating toward the end of 800 rounds of training, and its performance was not stable.
We continued to increase the number of threat targets to 15, including 7 dynamic threat targets and 8 static threat targets. The graph of the resulting model reward function is shown in
Figure 11.
With 1200 rounds of training, the H-PPO algorithm can be obtained from the other two algorithms. The reward function value starts to rise significantly after the 170th round, earlier than the PPO algorithm, SAC algorithm, and DDPG algorithm, and the reward function value grows faster, with the network converging at around 950 rounds, and the reward function value is around 120. However, the PPO algorithm, SAC algorithm, and DDPG algorithm are not well converged; in particular, the SAC algorithm and DDPG algorithm fluctuate greatly, and the model is not well converged and does not have strong ability to cope with complex environment changes.
With the increases in the number of threat targets and the proportion of dynamic threat targets, H-PPO algorithm reward function value can achieve better convergence compared to the PPO algorithm and DDPG algorithm; it can also converge faster and obtain a higher final reward value, proving that the H-PPO algorithm can better realize intelligent maneuver decision in a complex dynamic environment, and the network’s decision performance and robustness are stronger.
The average maneuver success rate of 100 tests was calculated, and the results are shown in
Figure 12. The abscissa indicates the number of threat targets, and the ordinate indicates the success rate.
It can be seen when comparing multiple obstacles in the experiment that with the increase in the number of obstacles, the total number of obstacles the H-PPO algorithm is five, and its success rate of 80% is 22% higher than the PPO algorithm, 29% higher than the SAC algorithm, and 38% higher than the DDPG algorithm. With the dynamic threat target proportion, the H-PPO algorithm can still maintain a high success rate with a total threat target number of 15 and seven dynamic threat targets, accounting for 47%, and it can still achieve a 58% success rate, while the PPO algorithm’s success rate is 40%, the SAC algorithm’s success rate is 34%, and the success rate of the DDPG algorithm is only 24%. The results prove that the algorithm can be trained to help the agent achieve a higher success rate in the unknown dynamic complex scene, the model is relatively robust, and its adaptability to a complex environment is strong.
In order to verify the effectiveness and practical application value of the algorithm for mobile penetration, the algorithm’s performance in 100 tests was measured when the number of threat targets was 5, 10, and 15 respectively, and the results are shown in
Table 3,
Table 4 and
Table 5.
It can be seen from
Table 3 that when the number of threat targets in the task scenario is five, the success rate of the H-PPO algorithm is the highest, reaching 80%, which is significantly higher than the success rate of the PPO algorithm (58%), SAC algorithm (51%), and DDPG algorithm (42%). The average maneuvering range is 96 km, which is 48 km shorter than the PPO algorithm, 65 km shorter than the SAC algorithm, and 96 km shorter than the DDPG algorithm. The average flight time is 128 s, 64 s faster than the PPO algorithm, 88 s faster than the SAC algorithm, and 128 s faster than the DDPG algorithm. The average calculation time of the model is only 1.64 s, and the calculation speed is faster, making it suitable for real-time mission scenarios.
The number of threat targets was increased to 10, the decision-making performance of the algorithm in complex environments was verified, and 100 experimental results were counted and are presented in
Table 4.
The number of threat targets was increased to 15, and the decision-making performance of the algorithm in complex environments was verified. The results of 100 experiments were counted, and
Table 5 was obtained.
As can be seen from
Table 3,
Table 4 and
Table 5, with the increase in the number of threat targets and the deepening of environmental complexity, the success rate of the H-PPO algorithm decreases, but when the number of threat targets is 15, the success rate still reaches 58%, and the average maneuver range, time, and model calculation time are significantly lower than the PPO algorithm, SAC algorithm, and DDPG algorithm. It is proven that the H-PPO algorithm has stronger adaptability to a complex environment, better model robustness, and can better realize intelligent autonomous decision-making tasks in complex environments.
The visual effect generated by the network test is shown in
Figure 13. In the experiment, the static threat area is represented by the gray area, the orange area is the dynamic target area, and the position and size are generated randomly. The red dot represents the location of the UAV, the blue dot represents the location of the established target, and the red curve represents the maneuvering route of the UAV. The three figures in
Figure 13, respectively, show the UAV maneuvering schematics when the number of static areas is 4, 7, and 8 and when the corresponding number of dynamic areas is 1, 3, and 7. The meanings of the different colored areas in this figure are the same as those in
Figure 7.
As can be seen from
Figure 13, in the process of controlling the agent’s maneuver through H-PPO, the agent can choose the optimal path and maneuver to the designated target location with the shortest distance under the premise of effectively avoiding the threat area, with good ability to deal with multiple threat areas.
The number of static threat areas is set to 7, the number of dynamic threat areas is set to 3, the movement speed of dynamic threat areas is 2 m/s, and the position is generated randomly. The visualization results of the test are shown in
Figure 14. The meanings of the different colored areas in this figure are the same as those in
Figure 7.
As can be seen from
Figure 14, after the trial starts, the agent can continuously maneuver to the established target area and can choose the best path between the starting point and the trend of the threat target. At 40 s, it avoids the first closer threat target, effectively makes the avoidance decision on the threat area in the maneuvered route, and replans the maneuver direction to find the optimal path to move forward, showing good generalization and adaptability to complex scenarios. At 96 s, the agent chooses a closer route between the two maneuver directions and passes through, indicating that the trained agent can solve more complex maneuver decision problems. At 126 s, the agent safely and efficiently reaches the specified target point. The whole experiment verifies the feasibility of the algorithm to realize autonomous path planning in complex dynamic regions, and it has good robustness and generalization.
In this experiment, the UAV starting point and target point position are randomly generated. The number and radius of threat targets in the mission area are diverse. In the complex environment, the agent can effectively achieve the goal of security maneuver, effectively avoid threat areas, and choose the optimal path in the maneuver with no obvious risky behavior, showing that the model has good versatility.
The maneuvering speed is one of the important performance indexes of UAVs. The maneuvering speed directly affects the efficiency of mission completion and response speed. In a complex dynamic environment, it is necessary to react quickly and adjust the maneuvering strategy in time in the face of multi-target threats. A faster flight speed means that a more efficient maneuver decision-making network is needed to adjust maneuvering strategies in time. In order to further verify the influence of speed on the maneuvering success rate, different maximum speeds are set under the condition that the experimental environment and other parameters are unchanged, and the success rate of maneuvering to the specified target position is determined in 100 tests, respectively. The experimental results are shown in
Figure 15.
As can be seen from
Figure 15, with the increase in the maximum maneuvering speed, the success rate of the UAV in performing maneuvering tasks decreases somewhat, but the decrease is not large. An increase in speed will increase the possibility of collision with the threat target. However, the H-PPO network deals with the maneuvering decision problem using the hierarchical method, which effectively reduces the complexity of the stage maneuvering problem. And the low-level strategy can adjust the maneuver strategy more accurately under the guidance of the high-level strategy. In this way, the UAV can quickly make more accurate obstacle avoidance actions at a higher maximum flight speed, thus maintaining the success rate.