Learning to Utilize Curiosity: A New Approach of Automatic Curriculum Learning for Deep RL

: In recent years, reinforcement learning algorithms based on automatic curriculum learning have been increasingly applied to multi-agent system problems. However, in the sparse reward environment, the reinforcement learning agents get almost no feedback from the environment during the whole training process, which leads to a decrease in the convergence speed and learning efﬁciency of the curriculum reinforcement learning algorithm. Based on the automatic curriculum learning algorithm, this paper proposes a curriculum reinforcement learning method based on the curiosity model (CMCL). The method divides the curriculum sorting criteria into temporal-difference error and curiosity reward, uses the K-fold cross validation method to evaluate the difﬁculty priority of task samples, uses the Intrinsic Curiosity Module (ICM) to evaluate the curiosity priority of the task samples, and uses the curriculum factor to adjust the learning probability of the task samples. This study compares the CMCL algorithm with other baseline algorithms in cooperative-competitive environments, and the experimental simulation results show that the CMCL method can improve the training performance and robustness of multi-agent deep reinforcement learning algorithms.


Introduction
Deep reinforcement learning [1] combines the perception ability of deep learning with the decision-making ability of reinforcement learning, and has been widely used in the processing of complex decision-making tasks [2], such as Atari games [3], complex robot action control [4,5], and the application of AlphaGo intelligence [6]. In 2015, Hinton, Bengio and Lecun, famous experts in the field of machine learning, published a review paper on deep learning in Nature, which considered deep reinforcement learning as an important development direction of deep learning [7].
However, there is a significant problem in the application of deep reinforcement learning algorithms in multi-agent systems [8]. With the increase in the number of agents and the increase in the complexity of the environment, the coordination and cooperation between agents becomes more difficult, which can easily cause a situation where the Reinforcement Learning (RL) algorithm does not converge or even cannot be trained [9,10].
Curriculum learning [11], as a hot field of current artificial intelligence research, was proposed by Bengio et al. at the International Conference on Machine Learning (ICML) in 2009. Bengio et al. pointed out that the curriculum learning method can be regarded as a special kind of continuous optimization method, which can start with smoother (i.e., simpler) optimization problems and gradually add rougher (i.e., more difficult) non-convex optimization problems, and finally optimize the target task. In curriculum reinforcement learning algorithms [12], manually set the tasks of different difficulty levels, and gradually add more difficult tasks to the simple reinforcement learning tasks, so that the knowledge of the source tasks can be reused in the process of learning difficult tasks, thereby accelerating the convergence of model to the optimal policy.
The above-mentioned predefined curriculum learning methods need to be manually set in advance in the process of task generation and sorting, so the quality of the generated curriculums will be directly affected by the experience of experts. However, the learning method of pre-defined curriculums requires manual curriculum difficulty assessment and sorting, and lacks task versatility. The current curriculum learning field gradually adopts automatic curriculum learning (ACL) instead of predefined curriculum learning to train reinforcement learning agents.
Before the agent learns the whole task, the difficulty of the experience samples in the experience replay buffer is evaluated and sorted, and the experience samples are learned in order from easy to difficult, so automatic curriculum learning [13] can realize the learning of difficult tasks, shorten the training time, and improve the training performance of task learning.
Traditional automatic curriculum learning often uses the temporal-difference error method to evaluate and sort the difficulty of task samples, that is, to obtain the optimal policy by maximizing the external reward value that appears in the process of interacting with the environment, but in the reward sparsity environment, the agent is difficult to obtain environmental reward feedback in long-lasting time steps. The lack of reward signals will affect the iteration and update of the agent's action policy, so it is hard for the agent to learn an effective policy.
To solve the above problems, this paper proposes a curriculum learning method based on curiosity module (CMCL), adding curiosity intrinsic reward in curriculum sorting criteria, the curiosity reward value of the experience samples was evaluated to obtain the curiosity priority, and the curriculum sequence of the experience samples was sorted together with the temporal-difference error, and the selection progress of the curriculum difficulty was adjusted by setting the curriculum difficulty factor, so as to enhance the exploration and training performance of the curriculum reinforcement learning algorithm for the environment. The experimental results of two tasks in multi-agent particle environment show that the CMCL method proposed in this paper can greatly improve the processing performance of multi-agent tasks in sparse reward environments compared with the three baseline algorithms.
The contributions of this paper are as follows: (1) This paper proposes a curriculum reinforcement learning method based on the curiosity module. By adding curiosity priority to the curriculum sorting criteria, it can enhance the exploratory and robustness of reinforcement learning agents and avoid the appearance of turn-in-place agent; (2) This paper introduces a curriculum difficulty factor in the process of selecting the curriculum difficulty of the model, and dynamically adjusts the difficulty of the currently selected curriculum through the curriculum difficulty factor, so as to realize automatic curriculum learning from easy to difficult priority experience.
The rest of this paper is organized as follows. Section 2 introduces related work, Section 3 introduces the MADDPG algorithm and the theory of automatic curriculum learning, Section 4 introduces the CMCL algorithm in detail, Section 5 presents experimental results and analyzes them, Section 6 presents discussion and Section 7 draws some conclusions.

Related Work
How to reasonably arrange the sequence of curriculums and select curriculums in the process of curriculum learning is the main research problem of current automatic curriculum reinforcement learning research. Carlos Florensa et al. [14] used generative networks to propose tasks that the agent needs to implement to automatically generate curriculums capable of learning many types of tasks without requiring prior knowledge. Ren et al. [15] proposed an automatic curriculum reinforcement learning method that uses a priority curriculum sorting method to extract experience samples from the experience replay buffer to achieve automatic curriculum learning. Jiayu Chen et al. [16] used the perspective of variational inference to automatically generate training curriculums for the task environment and the number of agents from two aspects of task expansion and agent expansion, which can be used to solve cooperative multi-agent reinforcement learning problems in difficult environments.
Curiosity-driven agent exploration is an important approach in reward function design for reinforcement learning. In supervised learning, curiosity is used to alleviate the problem of imbalanced representation and distributional bias among data [17,18]. Pathak et al. [19] used curiosity as an intrinsic reward value for agents, which can encourage the agent to explore new environmental states. Our method is derived from the curiosity mechanism of the human brain [20]. Curiosity is used as a reference standard for automatic curriculum learning's curriculum sorting, which can complement the priority experience replay algorithm (PER). The selection probability of novel samples is increased in the samples to balance the exploration of the uncertain state in the process of environmental exploration of multi-agent system.
The most important works related to our method include the self-adaptive priority correction algorithm proposed by Hongjie Zhang et al. [21], the High-Value Prioritized Experience Replay proposed by Xi Cao et al. [22], and the Curriculum Guided Hindsight Experience Replay proposed by Meng Fang et al. [4]. Hongjie Zhang et al. predicted the sum of the real Temporal-Difference error of all samples in the experience replay, and corrected it by an importance weight. Xi Cao et al. designed a priority experience replay method based on the combination of temporal-difference error and value for the sparse reward environment, Meng Fang et al. applied the curiosity mechanism to the Hindsight experience replay algorithm (HER), and learned successful experience from failure through the HER mechanism. Our method provides a further improvement on the basis of the above methods. As one of the curriculum sorting standards in the priority experience replay algorithm, the curiosity mechanism can compensate for the exploratory and randomness of the agent in the sparse reward environment, thereby improving the training performance and robustness of the algorithm.

Basic Concepts
This chapter will sequentially introduce some important concepts of Deep Reinforcement Learning, Multi-Agent Deep Deterministic Policy Gradient algorithms (MADDPG), and Automatic Curriculum Learning (ACL).

Deep Reinforcement Learning
Reinforcement learning [23] consists of two parts: agents and environment. To maximize agents' total reward value, the agents observe the initial state in the environment, take actions from an action set, and the environment accepts the action and gives the agents a reward. This process can be modeled as a Markov decision quintuple (S, A, R, P, γ), where S represents the state space, A represents the action space, R represents the reward function, P represents the state transition function, and γ represents the discount factor. The schematic diagram of reinforcement learning is shown in Figure 1. The goal of agents is to maximize expected reward J(π θ ) = E τ∼π θ [R(τ)] by continuously optimizing the policy , then the optimal policy is * = where represents the reward of agents at time t. Deep reinforcement learning algorithms can be divided into following three categories [26], deep reinforcement learning based on value function, deep reinforcement learning based on policy gradient, and deep reinforcement learning based on the actor-critic (AC) framework. The DRL algorithm based on the structure of the AC framework uses  Deep reinforcement learning approximates policy function and value function through a deep learning multi-layer neural network, thereby solving the high-dimensional mapping problem caused by continuous high-dimensional state-action pairs [24]. The goal of agents is to maximize expected reward J(π θ ) = E τ∼π θ [R(τ)] by continuously optimizing the policy π θ , then the optimal policy is where r t represents the reward of agents at time t.
Deep reinforcement learning algorithms can be divided into following three categories [25], deep reinforcement learning based on value function, deep reinforcement learning based on policy gradient, and deep reinforcement learning based on the actorcritic (AC) framework. The DRL algorithm based on the structure of the AC framework uses the error of the value function to guide the policy update and improve the performance of the algorithm training. The policy π θ is updated by policy gradient ∇ θ J(π θ ) of expected reward, the formula is as follows: where π θ (a|s) represents the actor Function and R(τ) represents the critic Function.

MADDPG Algorithm
Multi-Agent Deep Deterministic Policy Gradient algorithm [26] (MADDPG) is an improved Multi-Agent Reinforcement Learning algorithm based on the AC network framework, which can be considered as an extended application of the DDPG algorithm in a multi-agent environment. To solve the problem of non-stationarity in Multi-agent Training Process [27], the MADDPG pioneered the principle of centralized training and distributed execution (CTDE), that is, in the training stage, the MADDPG algorithm allows the agents to obtain global information during learning, only local information is used in the decision execution. The AC training framework can be seen as an actor network for policy exploration, critic network as an evaluator to evaluate the policy, and obtain the current optimal policy. The algorithm structure consists of actor network, critic network, target actor network and target critic network. The training framework of the MADDPG algorithm is shown in Figure 2. The MADDPG algorithm stores experience tuples through the experience replay mechanism: During the training process, experience tuples are stored in batches in the experience replay buffer, and the experience replay buffer extracts small samples of experience in stages and inputs them into the neural network for model training. This experience replay mechanism can reduce the degree of association between experience tuples, thus improving the neural network training efficiency. The MADDPG algorithm updates the action network of agents using the stochastic gradient descent method. The formula is as follows: In the formula, and represent the observation value and action of the th agent respectively; ( , ) represents the action of agent obtained by inputting the observation value into actor network. The critic network of agents is iteratively updated as follows to minimize the loss function: In the formula, the function represents the cumulative average reward of agent in the target actor network.
The network parameters of target actor network and target critic network are replicated and updated in stages: In the formula, represents the control parameter of the network parameter updating frequency, which can stabilize the parameter network update process. ′ represents the target network parameter of the ith agent, and represents the initial network parameter of the ith agent.
In view of the good stability and convergence of the MADDPG algorithm, it can be applied to various task scenarios such as cooperative, competitive and hybrid. The innovation and experimental verification of the algorithm in this paper are partly based on the MADDPG algorithm and its accompanying multi-agent particle environment (MPE). The critic network of agents is iteratively updated as follows to minimize the loss function: In the formula, the function y represents the cumulative average reward of agent i in the target actor network.
The network parameters of target actor network and target critic network are replicated and updated in stages: In the formula, τ represents the control parameter of the network parameter updating frequency, which can stabilize the parameter network update process. θ i represents the target network parameter of the ith agent, and θ i represents the initial network parameter of the ith agent.
In view of the good stability and convergence of the MADDPG algorithm, it can be applied to various task scenarios such as cooperative, competitive and hybrid. The innovation and experimental verification of the algorithm in this paper are partly based on the MADDPG algorithm and its accompanying multi-agent particle environment (MPE).

Automatic Curriculum Learning
End-to-end deep reinforcement learning methods have led to breakthroughs in board games, real-time policy games, and path planning problems. However, reinforcement learning agents still face difficulties and challenges when dealing with many application scenarios [13]. The reason is that agents need to fully interact with the environment to obtain enough information to continuously modify its own policy, but the environment itself has the problems of reward sparseness, partial observability, delayed reward, and too high dimension of action space, which leads to the problem that the training time of the agent is too long or even unable to converge when dealing with difficult tasks.
In response to the above problems, Curriculum Learning (CL) can utilize knowledge from source tasks to speed up the learning of complex target tasks, thus improving the training performance of reinforcement learning agents on fixed task sets [28]. As an important paradigm in the field of machine learning, curriculum learning can imitate the human learning sequence from easy to difficult. In the initial stage of reinforcement learning, the curriculum learning algorithm trains the model in a simple simulation environment (fewer obstacles and more reward values), and as the training progresses, the simulation environment is gradually added with more and more difficult (sparse reward values and more obstacles), and finally, the algorithm is validated in a full simulation environment.
Most traditional curriculum learning methods use predefined methods [13], that is, using expert experience to evaluate the difficulty of task curriculums and formulate curriculum plans from the perspectives of the number of agents, initial state distribution, reward function, goals, environment distribution, opponent policy, etc., such as tasks with a higher number of agents and more obstacles are generally considered more difficult training environments. Because the predefined curriculum learning method requires manual assessment and sorting of curriculum difficulties and lacks task versatility, the current curriculum learning field gradually adopts automatic curriculum learning instead of predefined curriculum learning to train reinforcement learning agents [29].
The current automatic curriculum learning process can be divided into curriculum sorting stage and curriculum selection stage [30]. The main idea is to construct a task curriculum sampler q(n, φ) based on the experience replay buffer, which can evaluate the difficulty of the transitions in the experience replay buffer and sort them from easy to difficult, and then the task M(n, φ) that is currently most suitable for agent training is extracted in real time from the experience replay to maximize the cumulative reward value of the reinforcement learning agent J(θ), φ represents the environmental factor variables that affect the difficulty of task curriculum.
To prove that curriculum updating can increase the cumulative reward value of agents in the process of automatic curriculum learning, in this paper, the proof is performed as follows from the perspective of mathematics.
Proof. For a given number n of agents, J(θ) can be simplified as follows: In the formula, p(φ) represents the uniform distribution of φ in the range of possible values. For all φ, the inequality is due to x − 1 ≥ logx, the equal sign of the inequality holds if and only if p(φ) = q(φ).
Through the simplification of the above equation, the cumulative reward value J(θ) can be composed of the policy update reward J 1 and the curriculum update reward J 2 . The policy update reward J 1 represents that reinforcement learning agents update their own policy functions iteratively to maximize their reward value obtained from the environment, and the curriculum update reward J 2 represents the task curriculum sampling sorting and adjustment through the task curriculum sampler q(n, φ), which can improve the agent's ability to explore environment and the training performance of the model to maximize agents' cumulative reward value.
In traditional automatic curriculum learning algorithms, the ordering of task curriculums often takes the environmental reward value of agents as the reference standard, that is, it adjusts its own action policy according to the external reward value. However, in sparse reward environments, it is difficult for an agent to obtain positive or negative rewards from the environment during most of the exploration process. Under the framework of the traditional automatic curriculum learning algorithm, selecting the task curriculum from low to high according to temporal-difference error can easily lead to overfitting of the model training, and agents stay in circles in the environment, making it difficult to train a good policy.

Curriculum Reinforcement Learning Based on Curiosity Model
This paper proposes a general automatic curriculum learning framework-curiosity module-based curriculum learning for deep RL (CMCL), which is divided into two stages: curriculum sorting and curriculum selection. For all reinforcement learning tasks, suppose D = d 1 , d 2 , · · · , d j , · · · , d K represents the experience sample set in experience replay buffer, and the task curriculum sampler q(n, φ) is used to operate on experience sample set D. The first stage is to evaluate and sort the difficulty of the samples in experience sample set to generate a curriculum learning plan; the second stage selects curriculums according to the set ability evaluation rules according to the curriculum plan.
The core of the curriculum difficulty sorting is to define the difficulty of the task samples. To convert the task samples in the experience replay into a curriculum sequence, a curriculum index function (CI) needs to be defined to calculate the priority p d j of task sample d j .

Definition 1. Curriculum Index Function (CI).
The function CI(d j ) → R is used to define the curriculum sequence of the task sample d j in the experience replay D. For the task sample d i and d j , if CI(d i ) < CI(d j ), the curriculum sequence of task sample d i is before the task sample d j .
In this paper, the curriculum sequence function is divided into two parts: KP() and CP(). KP() represents K-fold-priority function, CP() represents curiosity-priority function, and c j represents the K-fold teacher model score of task sample d i , λ represents the curriculum learning factor, η represents the hyperparameter, which is used to control the efficiency and exploration of sample learning.

K-Fold Priority Experience Replay
In this paper, the absolute value of the temporal-difference error of the neural network is used as a reference standard for the curriculum sequence function CI(d j ), and the difficult task is defined as the task with a large weight correction value for the current neural network model. The reason is that tasks with large temporal-difference error may have an adverse effect on the improvement of training model ability. For example, 1. The random noise during the model training process is prone to data deviation, thereby affecting the training accuracy of model; 2. In the stochastic gradient descent process of deep neural network training, tasks with large temporal-difference error often require a small update step size to obtain a better model convergence effect.
In this paper, the K-fold cross-validation method is used to evaluate the difficulty of the samples in the experience replay buffer, and experience replay D is divided into K equal parts D i : i = 1, 2, . . . , K , and trained separately to obtain K teachers Model network θ = {θ 1 , θ 2 , · · · , θ K }, since the experience replay D is divided, the obtained K teacher model networks are independent of each other. The training formula of the teacher model network is as follows: where L represents the loss function of the temporal difference error. The K teacher models obtained are cross-validated. For example, if sample d j belongs to teacher model i, then the sample d j is scored on the K − 1 teacher models other than its own teacher model i. The scoring process can be expressed as follows: In the formula, c ji represents the difficulty score of the teacher model i to the sample d j , Q π teacher represents the Q value obtained by inputting the state value s and the action value a into the value function network, and Q π represents the Q value obtained after state s and the action values a are input into the policy function network, γ represents the discount factor, and the final difficulty score of the task sample d j is the sum of the difficulty scores of all other teacher models: The function KP(c j , λ) → [0, 1] is used to define the K-fold priority of task sample d j in experience replay D, c j represents the final difficulty score of task sample d j after K-fold cross-validation, λ represents the curriculum learning currently selected task curriculum difficulty factor. The K-fold priority function KP(c j , λ) is expressed as follows: where 1. KP(cj, λ) is monotonically decreasing when c j > λ; 2. KP(cj, λ) is monotonically increasing when c j < λ; 3. KP(c j , λ) is the maximum value when c j = λ. The K-fold priority function outputs a scalar with a value range of [0, 1] by inputting the difficulty score c j of the task sample and the curriculum factor λ, thereby reflecting the sample priority of the task sample in the dimension of temporal-difference error. As the curriculum learning progresses, the curriculum factor λ can be gradually increased, thereby increasing the priority of the task curriculum with higher difficulty score c j . Since the selection probability of task samples is proportional to the K-fold priority, agents can frequently select empirical samples which fit the current model capabilities. The graph of the K-fold priority function is shown in Figure 3, where λ = 0.6 is shown in the figure.
The framework of the K-Fold Cross-Validation method is shown in Figure 4.

Curiosity Exploration Rewards
In the K-fold priority function KP(c j , λ), we use the temporal-difference error as the reference standard for prioritization, which can improve the utilization efficiency of task samples and the robustness of training. However, in a multi-agent system, the traditional reinforcement learning algorithm uses extrinsic reward to guide agents to adjust their own policy. The agents take actions in environment to interact with the environment. When the policy is correct, it will get a positive reward value, otherwise it will get a negative reward value. This extrinsic reward method can achieve good performance in most RL environments, but in a sparse reward value environment, agents do not obtain immediate reward value most of the time they explore in the environment, and then agents are impossible to adjust their own policy according to their reward value, which will greatly reduce their convergence speed and training efficiency of the algorithm.

Curiosity Exploration Rewards
In the K-fold priority function ( , ), we use the temporal-difference error as the reference standard for prioritization, which can improve the utilization efficiency of task samples and the robustness of training. However, in a multi-agent system, the traditional reinforcement learning algorithm uses extrinsic reward to guide agents to adjust their own policy. The agents take actions in environment to interact with the environment. When the policy is correct, it will get a positive reward value, otherwise it will get a negative reward value. This extrinsic reward method can achieve good performance in most RL environments, but in a sparse reward value environment, agents do not obtain immediate reward value most of the time they explore in the environment, and then agents are impossible to adjust their own policy according to their reward value, which will greatly reduce their convergence speed and training efficiency of the algorithm.
Inspired by the theory of intrinsic motivation, based on the curiosity exploration mechanism [12], this paper uses the curiosity exploration reward as one of the reference standards of curriculum sequence function ( ) to enhance the agent's exploration of environment and avoid the over-fitting phenomenon of "turning in place" of agents.
The basic principle of curiosity exploration mechanism is that when the next state is inconsistent with the predicted state of policy network, the intrinsic reward of curiosity is generated. The greater the difference between actual state and predicted state, the greater the value of curiosity reward.

Curiosity Exploration Rewards
In the K-fold priority function ( , ), we use the temporal-difference error as the reference standard for prioritization, which can improve the utilization efficiency of task samples and the robustness of training. However, in a multi-agent system, the traditional reinforcement learning algorithm uses extrinsic reward to guide agents to adjust their own policy. The agents take actions in environment to interact with the environment. When the policy is correct, it will get a positive reward value, otherwise it will get a negative reward value. This extrinsic reward method can achieve good performance in most RL environments, but in a sparse reward value environment, agents do not obtain immediate reward value most of the time they explore in the environment, and then agents are impossible to adjust their own policy according to their reward value, which will greatly reduce their convergence speed and training efficiency of the algorithm.
Inspired by the theory of intrinsic motivation, based on the curiosity exploration mechanism [12], this paper uses the curiosity exploration reward as one of the reference standards of curriculum sequence function ( ) to enhance the agent's exploration of environment and avoid the over-fitting phenomenon of "turning in place" of agents.
The basic principle of curiosity exploration mechanism is that when the next state is inconsistent with the predicted state of policy network, the intrinsic reward of curiosity is generated. The greater the difference between actual state and predicted state, the greater the value of curiosity reward. Inspired by the theory of intrinsic motivation, based on the curiosity exploration mechanism [11], this paper uses the curiosity exploration reward as one of the reference standards of curriculum sequence function CI(d j ) to enhance the agent's exploration of environment and avoid the over-fitting phenomenon of "turning in place" of agents.
The basic principle of curiosity exploration mechanism is that when the next state is inconsistent with the predicted state of policy network, the intrinsic reward of curiosity is generated. The greater the difference between actual state and predicted state, the greater the value of curiosity reward.
This curiosity-based mechanism is called the Intrinsic Curiosity Module (ICM), and the curiosity reward value is calculated through two sub-module networks. The first sub-module uses a feature convolutional neural network to extract the eigenvalues of the state s t in experience samples, and encoded as φ(s t ), the second sub-module contains a forward neural network θ F and an inverse dynamic network θ I . The evaluation mechanism of curiosity reward value is shown in Figure 5.
In the ICM mechanism, the inverse dynamic network θ I can estimate action value a t through function g:â t = g(s t , s t+1 ; θ I ) (13) In the formula, a t represents the actual action taken from state s t to state s t+1 ,â t represents the estimated action of a t , (s t , a t , r, s t+1 ) experience tuple is obtained from the experience replay D, and the network parameters of reverse dynamic network θ I are optimized by the following expressions: where L I represents the loss function between the predicted action valueâ t and the actual action value a t . The maximum likelihood estimates of the parameters θ I of the inverse dynamic network can be obtained by minimizing L I . This curiosity-based mechanism is called the Intrinsic Curiosity Module (ICM), and the curiosity reward value is calculated through two sub-module networks. The first submodule uses a feature convolutional neural network to extract the eigenvalues of the state in experience samples, and encoded as ( ), the second sub-module contains a forward neural network and an inverse dynamic network . The evaluation mechanism of curiosity reward value is shown in Figure 5. In the ICM mechanism, the inverse dynamic network can estimate action value through function : In the formula, represents the actual action taken from state to state +1 , ̂ represents the estimated action of , ( , , , +1 ) experience tuple is obtained from the experience replay , and the network parameters of reverse dynamic network are optimized by the following expressions: where represents the loss function between the predicted action value ̂ and the actual action value . The maximum likelihood estimates of the parameters of the inverse dynamic network can be obtained by minimizing .
For the forward neural network , the estimated state value at the next time step + 1 can be obtained by inputting action value and eigenvalue ̂( +1 ) of the state .
where the forward neural network parameter is optimized by the following loss function: Then the overall optimization function learned by reinforcement learning agents is For the forward neural network θ F , the estimated state value a t at the next time step t + 1 can be obtained by inputting action value a t and eigenvalueφ(s t+1 ) of the state s t .
where the forward neural network parameter θ F is optimized by the following loss function: Then the overall optimization function learned by reinforcement learning agents is In the formula, 0 ≤ β ≤ 1 represents the weight parameter between the inverse dynamic network and the forward neural network, λ > 0 represents the weight parameter between the intrinsic curiosity reward value and the gradient descent loss function, and the available curiosity reward value is as follows:

Definition 3. Curiosity Priority Function (CP).
Function CP(r i t (d j )) → [0, 1] is used to define the curiosity priority of task sample d j in experience replay D, r i t (d j ) represents the curiosity reward value of the task sample d j . The curiosity-priority function expression of CP(r i t (d j )) is as follows: where CP(r i t (d j )) is a monotonically increasing function of r i t (d j ).
From the above, the curriculum sequence function CI(d j ) = KP(c j , λ) + ηCP(d j ) can be obtained, that is, the priority of each experience sample c j in the experience replay D, then the sampling probability of each experience sample c j is as follows: In the formula, p d j represents the priority of the task sample d j , and a represents the use degree of the priority p d j .

Algorithm Framework and Pseudocode
The CMCL algorithm proposed in this paper combines the K-fold priority function and the curiosity priority function in the curriculum sorting stage, so as to use temporaldifference error and curiosity reward to jointly sort curriculums. Adjusting the curriculum factor, the K-fold priority selection of task samples can be controlled to ensure that agents frequently select samples that are most suitable for the current training difficulty, and to improve the exploration of the environment by agents. The basic framework of the CMCL algorithm is shown in Figure 6, and Algorithm 1 describes the training process of the CMCL algorithm.

Algorithm 1: CMCL algorithm.
Input: experience replay buffer D, curriculum factor λ, curriculum stride µ, balance weight η, curriculum sequence vector ci = [ci 1 , ci 2 , · · · , ci N ] Output: The final policy π θ for episode = 1 to max_episode do Initialize a random process N for reinforcement learning action exploration Receive initial state s 0 for t = 1 to max_episode_length do In state s t , the agents select action a through policy network π θ (s t ) Obtain the reward r given by environment E Store (s t , a, s t+1 , r) in experience replay buffer D s t ← s t+1 The experience samples in D are sampled for K-level teacher model training θ i : i = 1, 2, . . . , K The score of experience sample d j is evaluated by cross validation c j = ∑ i∈(1,...,N),i =k c ji The K-fold priority kp j = KP(c j , λ) can be obtained according to Equation (12) Calculate the curiosity reward The curiosity priority cp j = CP(r i t (d j )) can be obtained according to Equation (19) Update curriculum sequence function ci j by ci(d j ) = kp(c j , λ) + ηcp(d j ) for agent v = 1 to N_agent do Sample a minibatch of transitions (s t , a, s t+1 , r) from D according to the priority sampling probability The neural network parameter θ was updated by gradient descent algorithm end for Adjust curriculum factor λ based on current model capabilities λ = λ + µ end for end for ( ) = ∑  Figure 6. Framework diagram of curriculum reinforcement learning algorithm based on curiosity module.

Experiment
In this paper, the simulation verification of the CMCL algorithm is carried out in Multi-Agent Particle Environment [27] (MPE), and the multi-agent cooperative task and the competitive task are used as the target tasks. Based on the environment, a sparse reward value scenario is constructed to test the performance of the CMCL algorithm in teamwork and policy confrontation respectively. Each set of experiments is carried out in the experimental environment of Ubuntu18.04.3 + OpenAI + PyTorch, and adopts the hardware conditions of Intel Corei7-9700K + 64G + GeForceRTX2080. In our environment, the CMCL algorithm is compared with various baseline algorithms to demonstrate the effectiveness and feasibility of the CMCL algorithm. The key hyperparameters set for the RL training process are listed in Table 1. The state value and action value of the agents are input at the input end of the neural network, and the target Q value of the agents is obtained through the calculation of the neural network. The loss function is obtained by subtracting the original Q value, and the original Q value function is updated. Finally, the reinforcement learning algorithm is applied to the deep learning structure.

Experiment
In this paper, the simulation verification of the CMCL algorithm is carried out in Multi-Agent Particle Environment [26] (MPE), and the multi-agent cooperative task and the competitive task are used as the target tasks. Based on the environment, a sparse reward value scenario is constructed to test the performance of the CMCL algorithm in teamwork and policy confrontation respectively. Each set of experiments is carried out in the experimental environment of Ubuntu18.04.3 + OpenAI + PyTorch, and adopts the hardware conditions of Intel Corei7-9700K + 64G + GeForceRTX2080. In our environment, the CMCL algorithm is compared with various baseline algorithms to demonstrate the effectiveness and feasibility of the CMCL algorithm. The key hyperparameters set for the RL training process are listed in Table 1. The state value and action value of the agents are input at the input end of the neural network, and the target Q value of the agents is obtained through the calculation of the neural network. The loss function is obtained by subtracting the original Q value, and the original Q value function is updated. Finally, the reinforcement learning algorithm is applied to the deep learning structure. The multi-agent cooperation experiment adopts the cooperative navigation experiment in the MPE environment. As shown in the Figure 7, N agents and N landmarks are randomly generated in a square two-dimensional plane with side length 1. The plane is surrounded by walls, and the agents can observe landmarks, but cannot observe the walls, and their missions are to reach landmarks in as few steps as possible and avoid collisions with other agents.
Mathematics 2022, 10, x FOR PEER REVIEW 14 of 20 oscillates slightly in the early training process, and gradually smooths in the later stage, and can obtain higher reward values and landmark coverage than other baseline algorithms, showing better training performance. Figure 10 shows the rendering of the agent training in cooperative navigation environment after the CMCL algorithm has been trained for 12,500 episodes. From the rendering, it can be seen that the agents can successfully approach and cover landmarks in the environment.

Cooperative Navigation
Agent Landmark  Combined with the size of the two-dimensional plane, it is stipulated that when an agent enters an area with a radius of 0.1 around a landmark, the landmark is considered covered by the agent, and the cooperative navigation task is considered successful only when all landmarks are uniquely covered.
In the reward value setting of the experimental environment, to construct a sparse reward value scene, we cancel the dense reward function set according to the distance between the agent and the landmark in the original MPE environment. Therefore, the reward value obtained by each agent at each time step consists of only two parts, including 1. When there is a collision between the agents or the agent hits a wall, the environment gives a negative reward value, that is, agent collision reward value C 1 ; 2. When the agent covers the landmark, the environment gives a positive reward value, that is, the agent covers the landmark reward value C 2 .
The agent collision reward value is as follows: The agent coverage landmark reward value is as follows: As shown in Figure 7, in the N = 4 environment, the CMCL, ACL, PER-MADDPG, and MADDPG algorithms are used to control the movement of the agent. To prevent the agent from spinning in place or meaningless exploration, the episode duration is set to 30 steps, that is, when the agent finishes exploring after 30 steps, the environment is initialized to start a new episode of exploration. Figure 8 shows the average reward value graph and the coverage graph obtained by the four algorithms after 20,000 episodes of training in the cooperative navigation environment. Figure 9 shows the bar graph of the average reward value of the four algorithms in 20,000 episodes, that is, the quotient of the total reward value obtained by the four algorithms in the whole training session and the number of sessions. As can be seen from the curve in Figure 8, at the beginning of the algorithm training, the agent is prone to colliding with other agents or with the wall. As the training progresses, agents gradually learn the policy of cooperatively covering landmarks. The curve of the CMCL algorithm oscillates slightly in the early training process, and gradually smooths in the later stage, and can obtain higher reward values and landmark coverage than other baseline algorithms, showing better training performance. Figure 10 shows the rendering of the agent training in cooperative navigation environment after the CMCL algorithm has been trained for 12,500 episodes. From the rendering, it can be seen that the agents can successfully approach and cover landmarks in the environment.

Competition Experiment
In a cooperative training environment, agents share the observed value of the environment to maximize the total reward value, but in a multi-agent competition task, as training progresses, the policies of their opponents are constantly improved, resulting in the continuous fluctuation of the cumulative reward value. In addition to cooperating with other agents, the agent also needs to make policy corrections for the opponent's policy.
The multi-agent competition experiment uses the predator-prey experiment in the MPE environment. On a two-dimensional plane with side length 1, predators and prey are randomly generated, as well as three randomly generated obstacles, whose area is relatively large, which can prevent the intelligent body from observing and moving. The goal of predators is to capture prey as quickly as possible through team cooperation. During this process, the predators and the prey move randomly, and the prey move twice as fast as the predators. During the predation process, all predators form a team to hunt down the prey, and the capture is considered successful when the distance between the predator and the prey is less than the pursuit radius.
To construct the sparse reward scene of the predator-prey environment, the dense reward function set according to the distance between predator and prey is canceled. Therefore, the reward value obtained by the predator agent at each time step consists of two parts: 1. When the predator encounters the prey, it will receive a positive reward value, that is, the capture reward value 1 ; 2. To prevent agents from escaping the boundary, when agent hits the wall, it will receive a negative reward value, that is, the collision reward value 2 .
The capture reward is as follows: Figure 10. Diagram of the training effect of CMCL algorithm in cooperative environment.

Competition Experiment
In a cooperative training environment, agents share the observed value of the environment to maximize the total reward value, but in a multi-agent competition task, as training progresses, the policies of their opponents are constantly improved, resulting in the continuous fluctuation of the cumulative reward value. In addition to cooperating with other agents, the agent also needs to make policy corrections for the opponent's policy.
The multi-agent competition experiment uses the predator-prey experiment in the MPE environment. On a two-dimensional plane with side length 1, m predators and n prey are randomly generated, as well as three randomly generated obstacles, whose area is relatively large, which can prevent the intelligent body from observing and moving. The goal of predators is to capture prey as quickly as possible through team cooperation. During this process, the predators and the prey move randomly, and the prey move twice as fast as the predators. During the predation process, all predators form a team to hunt down the prey, and the capture is considered successful when the distance between the predator and the prey is less than the pursuit radius.
To construct the sparse reward scene of the predator-prey environment, the dense reward function set according to the distance between predator and prey is canceled. Therefore, the reward value obtained by the predator agent at each time step consists of two parts: 1. When the predator encounters the prey, it will receive a positive reward value, that is, the capture reward value D 1 ; 2. To prevent agents from escaping the boundary, when agent hits the wall, it will receive a negative reward value, that is, the collision reward value D 2 .
The capture reward is as follows: This represents that when the predator captures the prey, it gets a positive large reward value, while the prey gets a large negative reward value.
The collision boundary rewards are as follows: This represents that the predator and prey get a negative reward when they collide with the boundary.
As shown in Figure 11, the CMCL, ACL, PER-MADDPG, and MADDPG algorithms are used to control the movements of predators and prey, respectively. To prevent the agent from spinning in place or performing meaningless exploration, the episode duration is set to 30 steps, which means the agent finishes the exploration after 30 steps and initializes the environment to restart the exploration. This represents that the predator and prey get a negative reward when they collide with the boundary.
As shown in Figure 11, the CMCL, ACL, PER-MADDPG, and MADDPG algorithms are used to control the movements of predators and prey, respectively. To prevent the agent from spinning in place or performing meaningless exploration, the episode duration is set to 30 steps, which means the agent finishes the exploration after 30 steps and initializes the environment to restart the exploration. As shown in Figure 12, the predator agents are controlled by the CMCL, ACL, PER-MADDPG, and MADDPG algorithms respectively, and the prey agents are controlled by the MADDPG algorithm. The bar chart and the error band chart of the average reward value obtained after 20,000 episodes of training indicates that the average reward value in the bar chart is the quotient of the total reward value obtained during the whole training of the four algorithms and the number of episodes. As can be seen in the figure, as training progresses, predator agents controlled by the four algorithms gradually learn the cooperative hunting policy, which tends to stabilize after 10,000 episodes. Throughout the training process, the average reward value of the CMCL algorithm is generally higher than that of other baseline algorithms and is significantly higher than that of the other three algorithms after 10,000 episodes. As shown in Figure 12, the predator agents are controlled by the CMCL, ACL, PER-MADDPG, and MADDPG algorithms respectively, and the prey agents are controlled by the MADDPG algorithm. The bar chart and the error band chart of the average reward value obtained after 20,000 episodes of training indicates that the average reward value in the bar chart is the quotient of the total reward value obtained during the whole training of the four algorithms and the number of episodes. As can be seen in the figure, as training progresses, predator agents controlled by the four algorithms gradually learn the cooperative hunting policy, which tends to stabilize after 10,000 episodes. Throughout the training process, the average reward value of the CMCL algorithm is generally higher than that of other baseline algorithms and is significantly higher than that of the other three algorithms after 10,000 episodes.  Figure 13a shows the win rate charts obtained by both agents in each round under the condition that the predator agents adopt the CMCL algorithm and the prey agents adopt the ACL algorithm. Figure 13b shows the win rate charts obtained by both agents in each round under the condition that the predator agent adopts the CMCL algorithm and the prey agent adopts the PER-MADDPG algorithm. It can be seen from the figure that when the predator agents controlled by the CMCL algorithm fight against the prey agents controlled by the ACL algorithm, the two sides won and lost in the early stage. However, after a certain training period (5000 rounds), the predator agents controlled by the CMCL algorithm gain a significant advantage. Predator agents controlled by the CMCL algorithm can gain obvious advantages in a short period of time against the prey  Figure 13a shows the win rate charts obtained by both agents in each round under the condition that the predator agents adopt the CMCL algorithm and the prey agents adopt the ACL algorithm. Figure 13b shows the win rate charts obtained by both agents in each round under the condition that the predator agent adopts the CMCL algorithm and the prey agent adopts the PER-MADDPG algorithm. It can be seen from the figure that when the predator agents controlled by the CMCL algorithm fight against the prey agents controlled by the ACL algorithm, the two sides won and lost in the early stage. However, after a certain training period (5000 rounds), the predator agents controlled by the CMCL algorithm gain a significant advantage. Predator agents controlled by the CMCL algorithm can gain obvious advantages in a short period of time against the prey agents controlled by the PER-MADDPG algorithm, and the winning rate is above 0.85. Figure 13a shows the win rate charts obtained by both agents in each round under the condition that the predator agents adopt the CMCL algorithm and the prey agents adopt the ACL algorithm. Figure 13b shows the win rate charts obtained by both agents in each round under the condition that the predator agent adopts the CMCL algorithm and the prey agent adopts the PER-MADDPG algorithm. It can be seen from the figure that when the predator agents controlled by the CMCL algorithm fight against the prey agents controlled by the ACL algorithm, the two sides won and lost in the early stage. However, after a certain training period (5000 rounds), the predator agents controlled by the CMCL algorithm gain a significant advantage. Predator agents controlled by the CMCL algorithm can gain obvious advantages in a short period of time against the prey agents controlled by the PER-MADDPG algorithm, and the winning rate is above 0.85.
(a) (b) Figure 13. The win rate of predator and prey using two algorithms respectively in the adversarial environment. (a) CMCL vs. ACL; (b) CMCL vs. PER-MADDPG. Figure 13. The win rate of predator and prey using two algorithms respectively in the adversarial environment. (a) CMCL vs. ACL; (b) CMCL vs. PER-MADDPG. Figure 14 shows the training effect diagram of the CMCL algorithm obtained after 10,000 episodes of training in a competitive environment. It can be seen from the effect diagram that the predator agent can learn the batch-hunting policy, that is, to round up the prey agents in two batches by rational use of terrain obstacles. It can be seen that the CMCL algorithm can achieve better training performance than other baseline algorithms in the multi-agent competitive environment.
Mathematics 2022, 10, x FOR PEER REVIEW 18 of 20 Figure 14 shows the training effect diagram of the CMCL algorithm obtained after 10,000 episodes of training in a competitive environment. It can be seen from the effect diagram that the predator agent can learn the batch-hunting policy, that is, to round up the prey agents in two batches by rational use of terrain obstacles. It can be seen that the CMCL algorithm can achieve better training performance than other baseline algorithms in the multi-agent competitive environment.

Predator Prey
Obstacle Predator Prey Obstacle Figure 14. Training effect diagram of CMCL algorithm in competitive environment.

Discussion
On the basis of the analysis of the above two experimental environments, the overall performance of our proposed CMCL algorithm is better than that of the other three baseline algorithms, and the following experimental results can be obtained.
In the cooperative environment, the average reward value and landmark coverage of

Discussion
On the basis of the analysis of the above two experimental environments, the overall performance of our proposed CMCL algorithm is better than that of the other three baseline algorithms, and the following experimental results can be obtained.
In the cooperative environment, the average reward value and landmark coverage of the CMCL algorithm are better than those of the ACL, PER-MADDPG and MADDPG algorithms. Combined with the screenshots of the actual performance of the agents in the experimental simulation environment, CMCL algorithm training in the cooperative environment can be performed. The agents can learn to execute policies dispersedly and cooperatively cover landmarks, avoiding collisions between agents or between agents and the wall.
In the competitive environment experiment, the average reward value of the CMCL algorithm is better than those of the ACL, PER-MADDPG and MADDPG algorithms, and when the predator agent controlled by the CMCL algorithm is confronted with the prey agent controlled by the ACL algorithm and the PER-MADDPG algorithm, after a period of training, a good win rate can be obtained. Combined with the actual performance screenshots of the agents in the experimental simulation environment, it can be concluded that the predator agents trained by CMCL algorithm in the competitive environment can learn to cooperate to surround the prey agents and group the prey agents to carry out the hunting strategy, and avoid the collision between agents or between agents and walls.
The current CMCL algorithm can achieve good training performance in the sparse reward value environment, but there are still two limitations: 1.
The dimension explosion problem. A large number of agents in the reinforcement learning environment due to the excessively large state space and the action space, it is easy for the algorithm to fail to converge due to the explosion of dimensions.

2.
The problem of reliability distribution. When multiple agents are trained in a reinforcement learning environment, the effective exploration of the environment by the agents can easily be affected due to the uneven distribution of reward functions, especially when multiple players are trained. This problem is more obvious.

Conclusions
To solve the problem that the training efficiency of the automatic curriculum reinforcement learning algorithm is not high in the scenario of sparse reward value, this paper adds a curiosity module on the basis of automatic curriculum learning, and uses the curiosity reward value and the temporal-difference error as the reference standard for curriculum sorting. The ICM module is used to evaluate the priority of curiosity, the curriculum factor is designed to control the selection of curriculum difficulty, and an automatic curriculum reinforcement learning algorithm based on the curiosity module is proposed, and the availability and superiority of the algorithm in sparse reward scenarios are verified by simulation experiments in cooperative and competitive environments. With the increase in the number of agents in multi-agent reinforcement learning, the input nodes of the neural network and the complexity of the neural network grow linearly, which can easily cause the problem of dimension explosion in the training process, which makes the algorithm difficult to converge. Methods that can be adopted include compression of state space and share parameters between agents. In the future, based on automatic curriculum reinforcement learning, further research will be conducted on how to reduce the time complexity of multi-agent reinforcement learning training under large-scale number conditions.
The main abbreviations are listed in Table 2.