The framework of the proposed approach is illustrated in
Figure 1, which incorporates a self-attention mechanism and a self-play module into the intelligent agent, based on an improved MADDPG algorithm. From the perspective of strategy optimization, the enhanced MADDPG algorithm, self-play mechanism, and attention mechanism collaborate synergistically to improve the performance of the agent’s strategy. The improved MADDPG algorithm utilizes a centralized training and decentralized execution mechanism, enabling the agent to continuously adapt within the environment. The introduction of the self-play mechanism allows the agent to engage in interactions with itself, generating adversarial data for further strategy optimization. Additionally, the self-attention mechanism dynamically weights input features, guiding the agent to focus on the most relevant decision-making information. This combination results in more accurate and efficient strategic decisions. Overall, this strategy optimization framework enables the intelligent agent to achieve more effective strategy refinement during gameplay, thereby enhancing its success rate.
3.4. Strategy Optimization Model Building
By integrating MADDPG, the self-attention mechanism, and the self-play mechanism within the strategy optimization process, the distinct advantages of each component can be fully leveraged, thereby significantly enhancing the overall decision-making capability of intelligent agents. In the initial phases, MADDPG provides a robust foundational strategy that enables effective learning in simpler environments. However, as the complexity of the environment escalates, relying solely on MADDPG often proves insufficient to meet the demands of more intricate tasks. The incorporation of the self-attention mechanism enables the agent to focus on critical state features during the decision-making process, thereby improving its capacity to process complex information, particularly in dynamically evolving environments. Simultaneously, the self-play mechanism bolsters the adaptability and robustness of the strategies by allowing the agent to engage in training against its own historical strategies, thus mitigating the risk of converging to suboptimal solutions. Through the weighted combination of these mechanisms, the relative importance of each can be dynamically adjusted according to the specific phase of training and the demands of the task. This approach facilitates continuous strategy optimization within a changing environment, ultimately enhancing the long-term performance and adaptability of the intelligent agent.
Due to the simplicity of the linear weighting method’s computational process and its high real-time performance, it enables more timely decision-making in dynamic air combat environments. Furthermore, the physical significance of the weighting parameters is relatively clear, making debugging more intuitive. Therefore, to provide a more comprehensive consideration of both the original strategy and its optimization framework, and to enable the dynamic adaptation of various variables within the strategy, a novel strategy model will be developed. In this model, the optimized strategy is denoted as
, the original MADDPG strategy as
, the strategy derived from the self-attention mechanism as
, and the strategy generated through the self-play mechanism as
. The respective coefficients of these strategies are represented as
,
and
, where
,
and
denote the weights assigned to each mechanism. Consequently, the optimized strategy y can be expressed as:
Let represent the strategy of the MADDPG agent in state ; Q, K, and V de note the query, key, and value vectors, respectively, with d representing the dimension of these vectors. Additionally, refers to the strategy employed by the agent within the self-play mechanism, and indicates the reward received by the agent from its interaction with its own historical strategies.
In the strategy optimization process, the coefficients , and quantify the relative importance of the MADDPG strategy, the self-attention mechanism, and the self-play mechanism, respectively, in the final optimized strategy. The dynamic adjustment of these coefficients aims to calibrate the contribution of each mechanism based on environmental feedback and the agent’s performance. This approach enables the agent to adaptively modulate the influence of each mechanism, ensuring that the strategy remains optimal in the face of changing environmental conditions.
When the reward associated with a given mechanism is high, it indicates that the mechanism has made a substantial contribution to the agent’s strategy, necessitating an increase in the corresponding coefficient. Conversely, if the reward is low, the coefficient for that mechanism should be decreased. The adjustment process is formalized by the following formula:
Let , and denote the rewards associated with the MADDPG, self-attention mechanism, and self-play mechanism at time step t, respectively. The terms , , and represent the normalized values of these rewards, while , , and control the adjustment rates for each mechanism.
When the loss associated with a particular mechanism is substantial, it indicates that the mechanism has not effectively contributed to the enhancement of the strategy, warranting a reduction in its corresponding coefficient. The adjustment formula for loss feedback is as follows:
Let , , and represent the loss values associated with the MADDPG, self-attention mechanism, and self-play mechanism at time step t, respectively. The hyperparameters , , and regulate the loss feedback for each mechanism.
The success rate of a strategy serves as an indicator of an agent’s ability to complete a given task. When a mechanism contributes to an increased success rate, it implies that the mechanism has a greater impact on the strategy, thereby justifying an increase in its corresponding coefficient. The adjustment formula for success rate feedback is as follows:
In the above equation, , , and represent the task success rates associated with the MADDPG, self-attention mechanism, and self-play mechanism at time step t, respectively. The terms , , and denote the normalized values of the success rates, while , , and are hyperparameters that govern the adjustment of these success rates.
To effectively integrate the feedback from rewards, losses, and success rates, these factors can be combined for the purpose of coefficient adjustment as follows:
In this manner, the adjustment of the coefficients incorporates the integrated feedback from rewards, losses, and success rates, enabling a more holistic optimization of all components of the strategy. This approach ensures that the intelligent agent continuously adapts the contribution of each mechanism, thereby enhancing its ability to effectively respond to environmental changes throughout the training process.