Fuzzy Reinforcement Learning and Curriculum Transfer Learning for Micromanagement in Multi-Robot Confrontation

: Multi-Robot Confrontation on physics-based simulators is a complex and time-consuming task, but simulators are required to evaluate the performance of the advanced algorithms. Recently, a few advanced algorithms have been able to produce considerably complex levels in the context of the robot confrontation system when the agents are facing multiple opponents. Meanwhile, the current confrontation decision-making system su ﬀ ers from di ﬃ culties in optimization and generalization. In this paper, a fuzzy reinforcement learning (RL) and the curriculum transfer learning are applied to the micromanagement for robot confrontation system. Firstly, an improved Q-learning in the semi-Markov decision-making process is designed to train the agent and an e ﬃ cient RL model is deﬁned to avoid the curse of dimensionality. Secondly, a multi-agent RL algorithm with parameter sharing is proposed to train the agents. We use a neural network with adaptive momentum acceleration as a function approximator to estimate the state-action function. Then, a method of fuzzy logic is used to regulate the learning rate of RL. Thirdly, a curriculum transfer learning method is used to extend the RL model to more di ﬃ cult scenarios, which ensures the generalization of the decision-making system. The experimental results show that the proposed method is e ﬀ ective.


The Robot Confrontation System
The aim of Artificial Intelligence (AI) is to develop a computer program that can realize human-level intelligence, self-consciousness, and knowledge application. Multi-Agent Systems (MAS) have recently become popular as an important means for the study of confrontational decision-making, strategic behavior in electricity markets [1], and so on. As an effective tool for AI research, the simulation platform allows the agent to rely on the predefined algorithm to perform various kinds of actions in a certain scenario, which plays a role in replacing the real physical environment [2]. These simulations can not only be used as substitutes for the physical environment that can touch but can also be set to some scenarios that cannot exist in real-life depending on our imagination [3]. Recently, computer games have been used for AI research, which helps the agent to grow since its birth, including the Atari video games [4], the imperfect information game and so on [5]. The Multi-Robot Confrontation [6], as a platform to imitate real battlefields, provides convenience for military command, situation assessment, and intelligent decision-making, and is also an effective platform to develop AI applications.

Paper Structure
The remainder of the paper is organized as follows: Section 2 is devoted to the background for the proposed technique, such as reinforcement learning, softmax function. Section 3 presents a learning model for a single agent and this model uses an improved Q-learning with the fuzzy method. A neural network with adaptive momentum and the proposed multi-agent RL algorithm with parameter sharing for Multi-Robot Confrontation are introduced in Section 4. Section 5 introduces a curriculum transfer learning to address the generalization issue. This method transfers the prior learning experience to different scenarios without starting from scratch. Experiments are detailed in Section 6, to illustrate the performance of the proposed learning model. Conclusions are drawn in the last section.

Reinforcement Learning
The goal of reinforcement learning is to obtain an optimal strategy π in which the agent selects action A under state S, which is given by π(S) = A. The architecture for reinforcement learning is shown in Figure 1. s t is the current state of the agent and s t+1 is the next state. a t is the current action that the agent takes. r t is the current reward for the process of s t → s t+1 . γ is the discount factor. V(s) is the state value function for the state s. T a ss is the transition probability from the state s to the next state s . R a ss is the current reward that is obtained from the environment from the state s to the next state s . α is the learning rate and its value range is (0, 1).
The agent which is in state s t selects action a t until the final state is reached, and the cumulative reward obtained in this process is shown in Equation (1). R(S t ) = r t + γr t+1 + γ 2 r t+2 + . . . = r t + γR(S t+1 ) = ∞ i=0 γ i r t+i (1) Information 2019, 11 The agent which is in state t s selects action t a until the final state is reached, and the cumulative reward obtained in this process is shown in Equation (1).
where the learning rate reflects the efficiency of an RL algorithm.

Softmax Function Based on Simulated Annealing
In order to control the randomness of action selection, a simulated annealing (SA) algorithm [23,24] is used to optimize softmax function. The softmax function is a method for balancing Exploration and exploitation [25] in the RL method, which chooses the action according to the average reward of each action, and the probability of the action t a being chosen is higher if the average reward produced by the action is higher than the average reward produced by the other action.
The probability distribution of the action in the softmax algorithm is based on the Boltzmann distribution. The probability i P of action i a selected is given by, For the reinforcement learning algorithm that uses the future P-step average rewards, the mathematical expression for the value function V π (s t ) is shown in Equation (2).
When an agent takes a t a strategy π, the value function represents the expectation of the cumulative reward obtained by the agent.
The value function V π (s) of an agent which is in the state s is given by: Q-Learning is a model-free reinforcement learning algorithm, and it is an off-policy reinforcement learning algorithm. The Q value Q(s, a) is estimated via the Time Difference Method (TD Method) [21]: where the learning rate reflects the efficiency of an RL algorithm.

Softmax Function Based on Simulated Annealing
In order to control the randomness of action selection, a simulated annealing (SA) algorithm [23,24] is used to optimize softmax function. The softmax function is a method for balancing Exploration and exploitation [25] in the RL method, which chooses the action according to the average reward of each action, and the probability of the action a t being chosen is higher if the average reward produced by the action is higher than the average reward produced by the other action.
The probability distribution of the action in the softmax algorithm is based on the Boltzmann distribution. The probability P i of action a i selected is given by, where P i represents the probability of choosing action a i , and the total number of actions is K. The action selection policy based on Boltzmann distribution is used to ensure the randomness of the action selection, and the simulated annealing algorithm is added. In this method, the probability of the action a i being selected is given by, → P(a i |s t ) (6) where T t is the temperature parameter. The smaller the temperature parameter is T min , the bigger the probability of action with the high average reward being chosen is T max . The temperature value turned by the simulated annealing is given by, where η is the annealing factor, and the value range is 0 ≤ η ≤ 1.

An Improved Q-Learning Method in Semi-Markov Decision Processes
For Markov dynamic systems, the stochastic control problems are modeled as Semi-Markov decision processes (SMDPs) [26,27]. The time cost for the RL system to transit from one state to the next state is defined as the sojourn time. The robot confrontation process is regarded as an SMDP, and the agent may take a serial of the same actions before transiting into the next state. The RL method can diminish the uncertainty resulting from modeling purposely, compared with the dynamic modeling method.
If our agent defeats the opponents with a better probability, the output of the agent is required to be more stable in the robot confrontation system for micromanagement. The classical Q-learning method solves the dilemma of exploration and exploitation using the ε-greedy algorithm, which makes the agent have a certain probability to explore new actions [28]. However, the probability of each action being selected is the same when the ε-greedy algorithm is used, so the action that can produce better rewards is not easy to choose. To tackle the problem of the greedy algorithm for micromanagement in the robot confrontation system, we design an improved Q-learning with the softmax function, which can make our agent explore more actions in the early stage of the learning process and exploit previous experience in the later stage of the learning process. For the learning process of SMDP, in each epoch, the next state is transited from the current state using the same actions after T learning cycles. r t+i i = 0, 1, 2, . . . , T − 1 is the real-time reward. Equations (8)- (10) gives the updating method of the state-action function.
where T t is the temperature parameter. The detailed step of this algorithm (SSAQ algorithm) is shown in Algorithm 1. The SSAQ algorithm is a way to solve the dilemma of exploration and exploitation, and this method can output more stable actions for micromanagement scenarios. Initialize s t , a t , s t+1 , a t+1 , r t each value of Q matrix ← arbitrarily value; Repeat (for each step) Choose an initial state s 0 t ← 0 ; i ← 0 ; Repeat (for each step of the episode) P(a t |s t ) = max ps:1→K t P(a ps k |s t ) = max Observe s t+1 , after T learning cycles by the same actions a t Obtain reward r t , r t+1 , . . . , r t+T−1 P(a t+1 |s t+1 ) = max ps:1→K t P(a ps k |s t+1 ) = max → a t+1 s t ← s t+1 t + + Q(s t , a t ) ← f (Q(s t+1 , a t+1 ), r t , . . . , r t+T−1 , α, γ) until s is terminal. Until Q matrix is convergence.

A Reinforcement Learning Method using a Fuzzy System
In order to reduce the learning cost, a learning method using a dynamic learning rate is a good solution [29]. For the reward function in the RL system, when the reward is positive after using the RL method, the influence of the positive feedback and a faster learning rate can be ensured by a high learning rate; on the contrary, a low learning rate can guide the RL method to a faster convergence. Therefore, there will be a relationship between the obtained reward and the value of the learning rate. Meanwhile, the reward is a fuzzy concept. For instance, a fixed value does not divide the "large positive number" and "large negative number". So, a fuzzy system is used to develop a dynamic learning rate to improve the performance of this learning system. The reward r is taken as the input of the fuzzy system and the learning rate α F is taken as the output. We set the fuzzy description of r as "little negative number, the large negative number, zero, large positive number, little positive number", and their abbreviations are shown as "TN, LN, ZO, LP, TP". We set the fuzzy description of α F as "little small, very small, medium, very large, little large", and their abbreviations are shown as "LS, VS, M, VL, LL". Figure 2 gives the corresponding fuzzy membership functions. Previous work has shown that the performance of the fuzzy system is affected by the shape of the membership functions [30]. In the fuzzy system, triangular membership function and trapezoidal membership function are simple and effective, so we choose these two membership functions.
In  Figure 2 shows the curve for these membership functions. The degree of trust for the input shows a general mathematical symmetry. We take "LP" as an example, and its membership function is shown in Equation (11). Previous work has shown that the performance of the fuzzy system is affected by the shape of the membership functions [30]. In the fuzzy system, triangular membership function and trapezoidal membership function are simple and effective, so we choose these two membership functions.
(a) (b) . Figure 2 shows the curve for these membership functions. The degree of trust for the input shows a general mathematical symmetry. We take "LP" as an example, and its membership function is shown in Equation (11).
The number of discrete points for the input domain is represented by . These five points have their own degrees of truth corresponding to five fuzzy input descriptions. Equation (12) gives the input degree for the truth discrete matrix The number of discrete points for the output domain is from the output domain. Equation (13) shows the corresponding output degree for the truth discrete matrix Then, we design the fuzzy rules, as follows. "If reward is "LP" then the learning rate is "VL", If reward is "TP" then learning rate is "LL", if reward if "ZO" then learning rate is "M", if reward is "TN" then learning rate is "LS", if reward is "LN" then learning rate is "VS"". The fuzzy inference is and Equation (14) gives its mathematical expression.
The number of discrete points for the input domain is represented by N r . We select five independent discrete points from the input domain, which are represented by I r = I r j j = 1, . . . , N r . These five points have their own degrees of truth corresponding to five fuzzy input descriptions. Equation (12) gives the input degree for the truth discrete matrix The number of discrete points for the output domain is M α F . We select five independent discrete (13) shows the corresponding output degree for the truth discrete matrix Then, we design the fuzzy rules, as follows. "If reward is "LP" then the learning rate is "VL", If reward is "TP" then learning rate is "LL", if reward if "ZO" then learning rate is "M", if reward is "TN" then learning rate is "LS", if reward is "LN" then learning rate is "VS"". The fuzzy inference is represented by and Equation (14) gives its mathematical expression.
where "∨" represents the operation of choosing the maximum value. "∧" represents the operation of choosing the minimum value. The operation of fuzzification transforms the reward r 0 into a fuzzy input vector FI r 0 = [FI r 0 j ] 1×N r if r 0 is measured. The degrees of the truth for r 0 corresponding to the five inputs are µ . . , 5 . Equation (15) uses the weighted-average method to calculate FI r 0 . We use the "min-max compose" operation to calculate the fuzzy output vector using Equations (14) and (15).
The de-fuzzifying operation uses a weighted average method to transform the fuzzy output into an output value and the weighted average method is shown in Equation (17).
where the learning rate α FS is the final result of the fuzzy system relative to r 0 .

A Proposed Learning Model for Multi-Robot Confrontation
The classical Q-learning algorithm chooses actions by the look-up table method. However, with the increase in action space and state space, the look-up table method is obviously no longer suitable, and which results in low learning efficiency. In order to solve this problem, the RL method based on the neural network approximating the value function of Q learning is proposed [31]. In the random task scenarios, the state variable s t is used as the input of the neural network and the Q value is used as the output of the neural network, which is also the estimation of the Q value of the neural network based on previous experience. This Q value is Q current , and the action that was taken by the agent is the action corresponding to the maximum Q current . After the agent takes action, the environment will give agent reward, and the state of the agent will be transferred to s t+1 . Similarly, the state s t+1 is input into the neural network, and the Q value of the corresponding state s t+1 is obtained, which is Q next . Finally, Q next and Q current is used to update the Q value using (4). The gradient descent method is used to update the gradient of the neural network. The loss function L t of the neural network is shown in Equation (18).
where Q_current * is the Q value after the state s t is updated by (4), and Q_next is the Q value before the state s t is updated.

Neural Network Model with Adaptive Momentum
Since the efficient action selection strategy should be considered for micromanagement scenarios, and our agent's experience usually has a limited subset of the large state space, it will be difficult to apply the conventional reinforcement learning to learn an optimal policy. To address this problem, a BP neural network is used to approximate the state-action values to improve the generalization of our RL model. An acceleration algorithm using adaptive momentum is considered to be used in the BP neural network to ensure the efficiency of action selection.
In addition to the output layer and the input layer, the input signal of any neuron j in a layer is represented by net j . y j is the output signal. y j represents the output signal for the neuron i in the lower layer. Equation (19) gives the computing method for this output.
Information 2019, 10, 341 where the constant a = 1.725, b = 0.566, the threshold for a neuron k is represented by θ k . In the output layer, y k is the label output signal for the k − th neuron. net k is the input signal. y k and net k are computed by Equation (20), if y j is the output signal of the neuron j − th in the hidden layer next to the output layer.
In the t − th iteration for updating weight, an input value is x p (t). O pk (t), is the label output signal for the k − th neuron. y pk (t) is the actual output signal. Equation (21) gives the mean square error.
The weight of this neural network is updated by the back-propagation method. The mean square error of this neural network is given by Equation (22) if there are M n input values.
The weights ω k j (t) is updated according to the gradient direction of E p (t) to minimize the square error. The correction of ω k j (t) is ∆ p ω k j (t), which is given by: where the learning rate is λ and the momentum constant is β that is 0.95.
If the learning rate of the neural network is too large, the weight correction ∆ p ω k j (t) will be too large, so the stability of the learning process will be affected. Therefore, we use an adaptive learning rate. α(p) represents the adaptive learning rate, and mean square error is E p (t). The adaptive learning rate satisfies Equation (24).
If the mean square error increases, the learning rate increases, and the convergence rate of the neural network accelerate; on the contrary, the neural network tends to be stable.
The updating for the weights is given by Equation (27). where, Then, Finally, ∆ p ω ji = λy pi δ pj is the correction of the weights for the hidden layer.

Multi-Agent RL Algorithm Based on Decision-Making Neural Network with Parameter Sharing
In this paper, an accelerated BP neural network with adaptive momentum is used as an approximator of the state-action value function. In this study, we extend the SSAQ algorithm to multi-agent by sharing the parameters of the neural network. Agents behave differently because each one receives different states and actions in the environment. Therefore, it is feasible for multi-agent to use the same neural network via parameter sharing. The input of the neural network is the agent state set in this paper, and the output is the state-action value function of the corresponding state. In order to ensure the continuous actions of the agent, a softmax layer is added after the output layer of the neural network, and the softmax layer uses the softmax function to select the action for the agent. Meanwhile, the simulated annealing algorithm is added to the softmax function to adjust the temperature parameters. This kind of neural network is called a decision-making neural network (DMNN) in this paper. The neural network is used as the approximator for the SSAQ algorithm mentioned above, and the learning model for our agents is shown in Figure 3.
As shown in Figure 4, the decision-making neural network includes an input layer, a multilayer hidden layer, an output layer, and a softmax layer. The state s t of the agent is inputted to the input layer of the decision-making neural network, and the Q value corresponding to the state s t is outputted to the output layer of the decision-making neural network, which is recorded as where θ is the weight vector of the decision-making neural network. The Q values of all actions are entered into the softmax layer with a non-linear transformation function which is shown in Equation (30), and the action a t is selected and outputted using the Max operation as shown in Equation (31).
As shown in Figure 4, the decision-making neural network includes an input layer, a multilayer hidden layer, an output layer, and a softmax layer. The state t s of the agent is inputted to the input layer of the decision-making neural network, and the Q value corresponding to the state t s is outputted to the output layer of the decision-making neural network, which is recorded as The agent takes action t a and the environment gives instant rewards ( ) R t . The RL based on state-action function stores the state-action function in the form of a table, but this method cannot solve the large-scale continuous state-space problem, and the powerful generalization ability of the As shown in Figure 4, the decision-making neural network includes an input layer, a multilayer hidden layer, an output layer, and a softmax layer. The state t s of the agent is inputted to the input layer of the decision-making neural network, and the Q value corresponding to the state t s is outputted to the output layer of the decision-making neural network, which is recorded as   where θ is the weight vector of the decision-making neural network. The Q values of all actions are entered into the softmax layer with a non-linear transformation function which is shown in Equation (30), and the action t a is selected and outputted using the Max operation as shown in Equation (31).
The agent takes action t a and the environment gives instant rewards ( ) R t . The RL based on state-action function stores the state-action function in the form of a table, but this method cannot solve the large-scale continuous state-space problem, and the powerful generalization ability of the The agent takes action a t and the environment gives instant rewards R(t). The RL based on state-action function stores the state-action function in the form of a table, but this method cannot solve the large-scale continuous state-space problem, and the powerful generalization ability of the neural network is used to approximate the value function Q(s t , a t ; θ) for RL, which solves the problem of high dimensional continuous state space. The proposed method has a good generalization and can be used in other combat games, such as Starcraft II [18].
To update the weights of the neural network efficiently, the TD error of the RL method is used for the Loss function for the decision-making neural network. The TD error is shown in Equation (32). The back propagation algorithm is used to update the weights of neural networks. The Multi-agent SSAQ algorithm with network parameter sharing is given by Algorithm 2.
Observe s t+1 , after T learning cycles by the same actions a t Obtain reward r t , r t+1 , . . . , r t+T−1 P(a t+1 |s t+1 ) = max ps:1→K t P(a ps k |s t+1 ) = max Update TD error and weights:

Curriculum Transfer Learning for Different Micromanagement Scenarios
It will cost a lot of time if the agent starts learning from scratch in a new environment. Many researchers focus on improving the learning performance by exploiting domain knowledge between some related tasks. The prior learning experience is exploited from the source task to the target task by the transfer learning to accelerate the learning rate. Therefore, we use the transfer learning method to take the well-trained model of source task as the prior experience to build the learning model to the target task.
In the curriculum transfer learning for micromanagement scenarios in the confrontation decision-making system, the mapping ρ : π * pasttime → π * currenttime represents the process that transfers the learning policy of the source task to the learning policy of the target task. In this paper, the state space and action space remain unchanged, as shown in Equation (33).
ρA : A last time → A current time ρS : S last time → S current time (33) Many interference information exists in the learning process. Therefore, we set up a decay function using the Newton law of cooling. The decay function enables agents to exploit the domain knowledge with a decreasing probability. A steady-state is achieved eventually. The threshold is ε. Equation (34) shows the mathematical relationship between the threshold ε and time t. The agent uses the domain knowledge from the source task if the random number ε < random. Otherwise, the agent uses the conventional maximum Q value strategy to select an action.
where the decay coefficient is p and the initial time is t 0 . The probability of using the prior experience from the source task gradually decreases until a stable value is achieved. If the target task is too difficult compared with the source task, an intermediate task is usually set up in curriculum transfer learning, and the agent can gain more experience by the learning model for the intermediate task. The curriculum transfer learning with an intermediate task and decay function for micromanagement scenarios is shown in Figure 5.
Many interference information exists in the learning process. Therefore, we set up a decay function using the Newton law of cooling. The decay function enables agents to exploit the domain knowledge with a decreasing probability. A steady-state is achieved eventually. The threshold is ε . Equation (34) shows the mathematical relationship between the threshold ε and time t . The agent uses the domain knowledge from the source task if the random number random ε < . Otherwise, the agent uses the conventional maximum Q value strategy to select an action.
where the decay coefficient is p and the initial time is 0 t . The probability of using the prior experience from the source task gradually decreases until a stable value is achieved. If the target task is too difficult compared with the source task, an intermediate task is usually set up in curriculum transfer learning, and the agent can gain more experience by the learning model for the intermediate task. The curriculum transfer learning with an intermediate task and decay function for micromanagement scenarios is shown in Figure 5. The integral framework for the proposed learning model for micromanagement is shown in Figure 6. The proposed method has three parts: the decision-making neural network, the loss function that uses the TD error for the neural network and a fuzzy method. The state t s of the agent is input into the neural network, and the neural network outputs the action t a . So is the next state 1 t s + . The reward ( ) R t is obtained. The TD error is calculated as a loss function and a fuzzy method is used to adjust the learning rate of the RL method. The integral framework for the proposed learning model for micromanagement is shown in Figure 6. The proposed method has three parts: the decision-making neural network, the loss function that uses the TD error for the neural network and a fuzzy method. The state s t of the agent is input into the neural network, and the neural network outputs the action a t . So is the next state s t+1 . The reward R(t) is obtained. The TD error is calculated as a loss function and a fuzzy method is used to adjust the learning rate of the RL method.

Experiment and Analysis
Generally, in the robotic systems that are based on reinforcement learning, higher learning rates enable robots to utilize the previous learning experience. The larger discount rate makes learning agents think more about long-term returns in the future. For the exploration and exploitation, the εgreedy algorithm with a larger threshold allows the learning agent to more utilize prior experience. In this work, the initial value for the threshold of the decay function is the same as that of the classical ε-greedy algorithm. In these experiments, the values for the learning rate, the discount factor, the Exploring rate, the Annealing factor, the Maximum temperature parameter, and the Minimum temperature parameter are given manually and empirically.

Experiment and Analysis
Generally, in the robotic systems that are based on reinforcement learning, higher learning rates enable robots to utilize the previous learning experience. The larger discount rate makes learning agents think more about long-term returns in the future. For the exploration and exploitation, the ε-greedy algorithm with a larger threshold allows the learning agent to more utilize prior experience. In this work, the initial value for the threshold of the decay function is the same as that of the classical ε-greedy algorithm. In these experiments, the values for the learning rate, the discount factor, the Exploring rate, the Annealing factor, the Maximum temperature parameter, and the Minimum temperature parameter are given manually and empirically.

RL Model for a Confrontation Decision-Making System
Robocode [32,33] is an open-source platform, where the goal is to develop a robot to battle against other robots. The platform is shown in Figure 7. In this paper, the experiments are conducted in this platform and we consider several scenarios with the different enemies to test the generalization of the proposed method.

Experiment and Analysis
Generally, in the robotic systems that are based on reinforcement learning, higher learning rates enable robots to utilize the previous learning experience. The larger discount rate makes learning agents think more about long-term returns in the future. For the exploration and exploitation, the εgreedy algorithm with a larger threshold allows the learning agent to more utilize prior experience. In this work, the initial value for the threshold of the decay function is the same as that of the classical ε-greedy algorithm. In these experiments, the values for the learning rate, the discount factor, the Exploring rate, the Annealing factor, the Maximum temperature parameter, and the Minimum temperature parameter are given manually and empirically.

RL Model for a Confrontation Decision-Making System
Robocode [32,33] is an open-source platform, where the goal is to develop a robot to battle against other robots. The platform is shown in Figure 7. In this paper, the experiments are conducted in this platform and we consider several scenarios with the different enemies to test the generalization of the proposed method. Effective state-action space definition of the RL model is still an open problem with no universal solution. An RL model for robot confrontation with inputs from the game engine is constructed, which ensures the size of the state-action space remaining unchanged to avoid the curse of dimensionality.
State-space: The combination of the relative direction angle, the absolute orientation angle, and the distance between the robots form the state space. The absolute direction angle and the relative solution. An RL model for robot confrontation with inputs from the game engine is constructed, which ensures the size of the state-action space remaining unchanged to avoid the curse of dimensionality.
State-space: The combination of the relative direction angle, the absolute orientation angle, and the distance between the robots form the state space. The absolute direction angle and the relative direction angle are discretized into four kinds of states and the range of absolute direction angle is 0 ∼ 360 • . We divide the distance between the robots into 20 discrete parts.
Action space: Movement and rotation are two movements for the robot in this platform. At each time step, each robot can move to arbitrary directions with arbitrary distances in the ground. Similar to other types of combat games, our robot can choose to attack their opponents with bullets of different energies. Forward, backward, clockwise rotation and anticlockwise rotation four kinds of different movements form the action space.
Reward function: If a robot fires bullets hit the enemy or is hit by bullets, the health point of this robot will change. We propose a reward function to help agent balance losses of our and opponents, as shown in: In this reward function, E m t represents the health point of our agent at t time, E m t−1 represents the health point of our agent at the t−1 time, E e t represents the health point of the opponent at the t time and E e t−1 represents the health point of the opponent at the t−1 time. If we lose fewer health points than the opponent loses, we will get a positive reward. The ratio of absolute health point changes between the two sides will encourage our agent to hurt the opponent more in battle. According to the experimental results, the RL model is effective at controlling our agent in the micromanagement scenarios.

Proposed RL Algorithm Test
In order to verify the effectiveness of the proposed SSAQ algorithm, a comparative experiment of robot migration is designed. The map for robot migration satisfies the binary tree structure. As shown in Figure 8, there are N layers on the map, and the number of endpoints is 2 N − 1. t time and 1 e t E − represents the health point of the opponent at the t−1 time. If we lose fewer health points than the opponent loses, we will get a positive reward. The ratio of absolute health point changes between the two sides will encourage our agent to hurt the opponent more in battle. According to the experimental results, the RL model is effective at controlling our agent in the micromanagement scenarios.

Proposed RL Algorithm Test
In order to verify the effectiveness of the proposed SSAQ algorithm, a comparative experiment of robot migration is designed. The map for robot migration satisfies the binary tree structure. As shown in Figure 8, there are N layers on the map, and the number of endpoints is 2 1 This paper uses the SSAQ method to compare with the ε-greedy policy (Greedy-policy) and ε-greedy policy using the decay threshold (Dgreedy-policy) [34]. The settings of parameters for this experiment are listed in Table 1.
Reward +1000 The robot can get the reward corresponding to each endpoint by starting from the beginning and repeating the branch to the bottom endpoint. A total of 2 N − 1 actions can be selected by the robot. The corresponding Q values of each endpoint are expressed as Q 1 ∼ Q 2 N −1 . When the robot reaches the endpoint 2 N − 1, it gets a reward of +1000, and it does not get a reward reaching the rest of the endpoints. Q values can be obtained through experiments. This paper uses the SSAQ method to compare with the ε-greedy policy (Greedy-policy) and ε-greedy policy using the decay threshold (Dgreedy-policy) [34]. The settings of parameters for this experiment are listed in Table 1.

Parameter Value
Learning rate α 0.3 Discount rate γ 0.9 Exploring rate ε 0.9 Annealing factor η 0.9 Maximum temperature parameter T max 0.1 Minimum temperature parameter T min 0.01 Layers of the map N 10 Through the observation during the experiments, it shows that Q 3 , Q 7 and Q 15 will change in the experiment, and we record these Q values. Therefore, the changes in three kinds of Q values of Q 3 , Q 7 and Q 15 are used in the experiment to compare three different strategies. As shown in Figure 9, the convergence time of the proposed method (SSAQ algorithm) is 214, 168 and 122 for Q 3 , Q 7 and Q 15 , respectively. The convergence time of the proposed method is the shortest and the convergence rate of it is the fastest compared with the other two methods. In addition, the Q-value curve of the proposed strategy is relatively smooth, which also proves that the proposed strategy is more stable and can achieve the target state quickly while ensuring the stability of learning performance. 7 15 the convergence time of the proposed method (SSAQ algorithm) is 214, 168 and 122 for 3 Q , 7 Q and 15 Q , respectively. The convergence time of the proposed method is the shortest and the convergence rate of it is the fastest compared with the other two methods. In addition, the Q-value curve of the proposed strategy is relatively smooth, which also proves that the proposed strategy is more stable and can achieve the target state quickly while ensuring the stability of learning performance.

Effect Test for Multi-Agent RL Based on DMNN and Fuzzy Method
The convergence and stability of the neural network are important factors affecting the performance of micromanagement for an agent. Hence, this paper proposes a decision-making neural network with adaptive momentum. In order to verify the effectiveness of the proposed Multi-agent RL algorithm based on DMNN, a team of agents trained by the neural network with adaptive momentum (NN with AWM) and another team of tank agents trained by the neural network without

Effect Test for Multi-Agent RL Based on DMNN and Fuzzy Method
The convergence and stability of the neural network are important factors affecting the performance of micromanagement for an agent. Hence, this paper proposes a decision-making neural network with adaptive momentum. In order to verify the effectiveness of the proposed Multi-agent RL algorithm based on DMNN, a team of agents trained by the neural network with adaptive momentum (NN with AWM) and another team of tank agents trained by the neural network without adaptive momentum (NN without AWM) are used to fight the team of tank agents trained by TD method separately. Each team has four tank agents. The settings of parameters for this experiment are listed in Table 2. We compare "NN with AWM" and "NN without AWM" from three different perspectives: total score ratio, defensive score and fluctuation value of score ratio. The number of rounds is 500, and the score is recorded every 10 rounds. Every 10 rounds are called a session. Figures 10-12 represent the resulting curve of this experiment, in which the horizontal axis represents the number of sessions and the vertical axis represents the score ratio or fluctuation value or defensive score. This paper calculates the first-order difference of the total score ratio, that is, Score ratio t+1 − Score ratio t , and obtains the fluctuation value curve of the total score ratio, as shown in Figure 11. From the experimental results, it can be seen that these score ratios of the tank agents trained by the NN with the AWM method are obviously higher than the NN without AWM method, and the fluctuation value is relatively small. Figure 12 shows the curve of the defensive score. Similar to Figure 10, the defensive score of the NN with the AWM method is more stable and higher than the NN without the AWM method. The experimental results show that the proposed Multi-agent RL algorithm can not only make the tank agent get higher scores in combat, but also get more stable scores. Hence, we can see that the proposed method has outstanding performances for micromanagement. score ratio, defensive score and fluctuation value of score ratio. The number of rounds is 500, and the score is recorded every 10 rounds. Every 10 rounds are called a session. Figures 10-12 represent the resulting curve of this experiment, in which the horizontal axis represents the number of sessions and the vertical axis represents the score ratio or fluctuation value or defensive score. This paper calculates the first-order difference of the total score ratio, that is, 1 S S t t core ratio core ratio + − , and obtains the fluctuation value curve of the total score ratio, as shown in Figure 11. From the experimental results, it can be seen that these score ratios of the tank agents trained by the NN with the AWM method are obviously higher than the NN without AWM method, and the fluctuation value is relatively small. Figure 12 shows the curve of the defensive score. Similar to Figure 10, the defensive score of the NN with the AWM method is more stable and higher than the NN without the AWM method. The experimental results show that the proposed Multi-agent RL algorithm can not only make the tank agent get higher scores in combat, but also get more stable scores. Hence, we can see that the proposed method has outstanding performances for micromanagement. Figure 10. The comparison curve of the total score ratio. Figure 10. The comparison curve of the total score ratio. The number of discrete points is 7. According to the fuzzy method used in this paper, the results of the fuzzy system are shown in Figure 13. As shown in Figure 13, the fuzzy method converts the linear input into the smooth output, which is appropriate for micromanagement in the Multi-Robot Confrontation system. Then, the experimental results of the fuzzy-inspired learning model (FLM), learning model (LM) with learning rate being 0.2 (LM-0.2) [12] and learning model with learning rate The number of discrete points is 7. According to the fuzzy method used in this paper, the results of the fuzzy system are shown in Figure 13. As shown in Figure 13, the fuzzy method converts the linear input into the smooth output, which is appropriate for micromanagement in the Multi-Robot Confrontation system. Then, the experimental results of the fuzzy-inspired learning model (FLM), learning model (LM) with learning rate being 0.2 (LM-0.2) [12] and learning model with learning rate According to the fuzzy method used in this paper, the results of the fuzzy system are shown in Figure 13. As shown in Figure 13, the fuzzy method converts the linear input into the smooth output, which is appropriate for micromanagement in the Multi-Robot Confrontation system. Then, the experimental results of the fuzzy-inspired learning model (FLM), learning model (LM) with learning rate being 0.2 (LM-0.2) [12] and learning model with learning rate being 0.95 (LM-0.95) are tested in the Robocode platform, as shown in Figure 14. The number of discrete points is 7. According to the fuzzy method used in this paper, the results of the fuzzy system are shown in Figure 13. As shown in Figure 13, the fuzzy method converts the linear input into the smooth output, which is appropriate for micromanagement in the Multi-Robot Confrontation system. Then, the experimental results of the fuzzy-inspired learning model (FLM), learning model (LM) with learning rate being 0.2 (LM-0.2) [12] and learning model with learning rate being 0.95 (LM-0.95) are tested in the Robocode platform, as shown in Figure 14.  The three methods of FLM, LM-0.2, and LM-0.95 were respectively fought with the TD method. The solid line represents the score of three methods respectively, while the dotted lines represent the scores of TD methods fighting against three methods. From Figure 14, in FLM, the score ratio remains at around 0.8. In LM-0.2 and LM-0.95, the score ratio remains at around 0.7. According to the above results, it can be obtained that the fuzzy method can get a higher score in combat than the method of the fixed learning rate for micromanagement.

Effect Test for Curriculum Transfer Learning (TF)
Six tank agents trained by the TD method are formed, which is called "TD_team", and the multiple robots in formation will not attack each other. Then, the learning model proposed in this paper is used to fight the TD_team. Compared to the above experiments, our tank agents will face six opponents at the same time, as shown in Figure 15. In Figure 15, The agent in the red box is the enemy team, while the agent in the yellow box is our team. The three methods of FLM, LM-0.2, and LM-0.95 were respectively fought with the TD method. The solid line represents the score of three methods respectively, while the dotted lines represent the scores of TD methods fighting against three methods. From Figure 14, in FLM, the score ratio remains at around 0.8. In LM-0.2 and LM-0.95, the score ratio remains at around 0.7. According to the above results, it can be obtained that the fuzzy method can get a higher score in combat than the method of the fixed learning rate for micromanagement.

Effect Test for Curriculum Transfer Learning (TF)
Six tank agents trained by the TD method are formed, which is called "TD_team", and the multiple robots in formation will not attack each other. Then, the learning model proposed in this paper is used to fight the TD_team. Compared to the above experiments, our tank agents will face six opponents at the same time, as shown in Figure 15. In Figure 15, The agent in the red box is the enemy team, while the agent in the yellow box is our team.

Effect Test for Curriculum Transfer Learning (TF)
Six tank agents trained by the TD method are formed, which is called "TD_team", and the multiple robots in formation will not attack each other. Then, the learning model proposed in this paper is used to fight the TD_team. Compared to the above experiments, our tank agents will face six opponents at the same time, as shown in Figure 15. In Figure 15, The agent in the red box is the enemy team, while the agent in the yellow box is our team. In order to prove the efficiency and practicality of the learning model with curriculum transfer learning (LM with TF), it is compared with the learning model without curriculum transfer learning (LM without TF) for 500 rounds.
The tank agent using curriculum transfer learning will use the prior experience gained in the above experiments. From Figure 16, transfer learning accelerates the learning speed of tank agents in the new micromanagement scenario and achieves higher scores than the method without transfer learning. From Figure 17, it is obvious that the defensive score of the LM with the TF method is higher, which means that the agent can defend the opponent's attack well when attacking opponents. Ten tank agents trained by the TD method were formed, and then the LM with the TF method and the In order to prove the efficiency and practicality of the learning model with curriculum transfer learning (LM with TF), it is compared with the learning model without curriculum transfer learning (LM without TF) for 500 rounds.
The tank agent using curriculum transfer learning will use the prior experience gained in the above experiments. From Figure 16, transfer learning accelerates the learning speed of tank agents in the new micromanagement scenario and achieves higher scores than the method without transfer learning. From Figure 17, it is obvious that the defensive score of the LM with the TF method is higher, which means that the agent can defend the opponent's attack well when attacking opponents. Ten tank agents trained by the TD method were formed, and then the LM with the TF method and the LM without TF method were used to fight against them respectively. The task of fighting TD_team is regarded as the intermediate task of this task because it is more difficult than the original task. From Figure 18, the score ratio of the LM with the TF method is still higher than that of the LM without the TF method, which means the learning model with curriculum transfer learning has a strong and stable property for micromanagement scenarios in confrontation decision-making system. LM without TF method were used to fight against them respectively. The task of fighting TD_team is regarded as the intermediate task of this task because it is more difficult than the original task. From Figure 18, the score ratio of the LM with the TF method is still higher than that of the LM without the TF method, which means the learning model with curriculum transfer learning has a strong and stable property for micromanagement scenarios in confrontation decision-making system.

Conclusions
In this paper, the confrontation decision-making for micromanagement scenarios in SMDP is studied. This paper makes several contributions, including an improved Q-learning in SMDP (SSAQ algorithm) for confrontation decision-making, the fuzzy method for adjusting learning rate of the RL

Conclusions
In this paper, the confrontation decision-making for micromanagement scenarios in SMDP is studied. This paper makes several contributions, including an improved Q-learning in SMDP (SSAQ algorithm) for confrontation decision-making, the fuzzy method for adjusting learning rate of the RL Figure 18. Comparisons of learning models in score ratio.

Conclusions
In this paper, the confrontation decision-making for micromanagement scenarios in SMDP is studied. This paper makes several contributions, including an improved Q-learning in SMDP (SSAQ algorithm) for confrontation decision-making, the fuzzy method for adjusting learning rate of the RL method, Multi-agent RL algorithm with parameter sharing, accelerated neural network for the representation of state (DMNN) and a curriculum transfer learning method with decay function. The RL model designed ensures the size of the state-action space remaining unchanged to avoid the curse of dimensionality and the reward function to help agent balance losses of our agents and opponents. The decision-making neural network uses adaptive momentum, which is used as an approximator to estimate the state-action function. The accelerated algorithm using adaptive momentum allows the decision-making neural network to react quickly in micromanagement scenarios. Finally, the curriculum transfer learning method with the decay function extends our model to other different scenarios. The proposed transfer learning method can still achieve better experimental results in more complex confrontation scenarios. The results of experiments demonstrate proposed methods can achieve an excellent and stable control for micromanagement in robot confrontation system. In the future, we will extend this model to the game scenario of Starcraft II. In addition, the behavior decomposition methods [35][36][37][38] will be studied to achieve more advanced Multi-agent confrontation strategies.