Cooperative Object Transportation Using Curriculum-Based Deep Reinforcement Learning

This paper presents a cooperative object transportation technique using deep reinforcement learning (DRL) based on curricula. Previous studies on object transportation highly depended on complex and intractable controls, such as grasping, pushing, and caging. Recently, DRL-based object transportation techniques have been proposed, which showed improved performance without precise controller design. However, DRL-based techniques not only take a long time to learn their policies but also sometimes fail to learn. It is difficult to learn the policy of DRL by random actions only. Therefore, we propose two curricula for the efficient learning of object transportation: region-growing and single- to multi-robot. During the learning process, the region-growing curriculum gradually extended to a region in which an object was initialized. This step-by-step learning raised the success probability of object transportation by restricting the working area. Multiple robots could easily learn a new policy by exploiting the pre-trained policy of a single robot. This single- to multi-robot curriculum can help robots to learn a transporting method with trial and error. Simulation results are presented to verify the proposed techniques.


Introduction
An object transportation technique using robots has been widely applied to diverse fields, such as logistics [1], exploration [2], a retrieval task [3], and service robotics [4]. Cooperative object transportation has been inspired by the collective behaviors of animals (e.g., ants) [5]. For example, ants can push or pull a large object that is much bigger than their bodies. They know by instinct that working together is better than alone. Inspired by this animal's cooperative behaviors, many researchers have studied cooperative transportation techniques by imitating their actions. Some researchers presented a grasping method in which robots grasp an object using their manipulators and transport it to a goal [6]. Other researchers suggested a pushing method where robots push an object using their bodies [7]. A caging method is an extended pushing method by enclosing an object using multiple robots [8]. Although they have some advantages, there were many issues, such as requirements for the gripper, precise pushing control, and real-time acquisition of the object shape.
Recently, deep reinforcement learning (DRL)-based navigation techniques have made rapid progress. The DRL has been proven to be applied to various mobile robotics fields, such as collision avoidance, object transportation, multi-robot navigation, and social navigation [9][10][11]. Among them, DRL-based object transportation techniques have attracted attention from many researchers because DRL can solve tricky issues of conventional methods [12][13][14]. Using the DRL algorithm, robots can learn how to transport an object to a goal without preliminary knowledge. They do not have to consider complex interactive behaviors between robots and the object; they only need plenty of training data. Complicated or precise controls for object transportation are no longer necessary. However, it takes a long time to learn a transportation technique due to the necessity of sequential transportation procedures. For example, the typical process of object transportation is as follows [15]. First, multiple robots approach an object. Second, the robots prepare a proper formation for object transportation; they have to be heading toward a goal together. Finally, robots push the object to the goal after the previous processes are completed. These procedures should be executed in order, which are difficult to learn or require a long training time by random actions only.
Therefore, we propose a new cooperative object transportation technique using curriculum-based DRL. Multiple robots can learn an object transportation method by gradual learning from easy to difficult tasks; learned knowledge from easy tasks facilitates the learning of difficult tasks. We present two curricula based on this principle; region-growing and a single-to multi-robot curriculum. During the learning process, the pose initialization region is gradually extended according to the region-growing curriculum. If robots are proficient at transporting an object in a small region, the robots can easily learn a transportation method in a large region. The robots also exploit the previous knowledge that a single robot has already learned. Using this single-to multi-robot curriculum, robots can utilize the policy of a single robot to learn their new policies.
This paper is organized as follows. Section 2 presents related works with cooperative object transportation and DRL-based transportation. The object transportation problem is defined in Section 3. The preliminary knowledge for DRL-based object transportation is presented Section 4, and Section 5 presents the proposed DRL framework. Two curricula for object transportation are suggested in Section 6. Simulation results are presented in Section 7, and we discuss the importance of this paper in Section 8. Finally, our conclusions are given in Section 9.

Cooperative Object Transportation
Cooperative object transportation methods are divided into three categories: grasping, pushing, and caging [16]. First, multiple robots can transport an object with a manipulator through grasping action [6,17]. The grasping method enables robots to manipulate an object precisely and robustly in a real environment. The movement of an object is restrained by robots with manipulators, which means that the object is under control by robots. However, the grasping method is not only intractable but also requires preliminary activity, such as object gripping. Additionally, the control complexity drastically increases as more robots are used because multiple robots should be controlled synchronously for object transportation. Second, multiple robots can push an object to a goal using their bodies [7,18]. In contrast to the grasping method, the pushing method does not require an equipped manipulator or proactive actions. Robots can change their poses freely because they are not tied to an object. However, the information of the surrounding environment, such as the static friction between an object and ground, object shape, or geometrical structures, should be known in advances. Finally, the caging method combines the advantages of robust manipulation in the grasping method and flexible object transportation in the pushing method [8,19,20]. Multiple robots approach an object and enclose it to prevent escape from robot formation. Then, robots can transport the object by maintaining the enclosing formation; robots do not have to consider the object's movement during the transportation process. However, multiple robots should be precisely controlled to maintain the enclosing formation. In addition, an excessive number of robots is required for wrapping up a target object.

Deep Reinforcement Learning-Based Object Transportation
The main problem of conventional transportation methods is that predicting the movements of robots and an object is difficult. There are many unpredictable sensing and control errors between robots and an object in a real environment that can deteriorate the transportation capacity. Many researchers, therefore, have focused on learning-based object transportation methods to solve the problem. Wang and De Silva [21] presented a reinforcement learning-based cooperative box-pushing method. They showed that the performance of single-agent Q-learning was better than that of team Q-learning due to the lack of sufficient random actions. Rahimi et al. [13] compared single and multi-robot cases with RL-based approaches. They showed that the performance of cooperative box-pushing could be improved by frequent Q-table updates. The above-mentioned RL-based boxpushing methods are effective in simple and small environments; however, they cannot be applied in large environments, as high-dimensional spaces of state and action are necessary for describing large environments.
To overcome this problem, deep reinforcement learning (DRL) has received much attention in the machine learning field as an alternative to conventional RLs [22][23][24]. The most popular DRL application is Atari 2600 games with Google DeepMind [22]; the action-value function (Q-function) is approximated by a deep convolutional neural network called the deep Q-network (DQN). The DQN agent was able to surpass the performance of humans. Following that, various improved DRL methods have been presented, such as double Q-learning (DDQN) [25], deep recurrent Q-learning (DRQN) [26], and proximal policy optimization (PPO) [27]. With the aid of recent scholarly exploration of these DRL methods, many researchers attempted to apply DRL to the cooperative object transportation field. End-to-end reinforcement learning-based methods were presented for manipulating a large-sized object [12,28]. A decentralized DRL control scheme was presented for cooperative transport behavior [29], and a decentralized/centralized Q-net separation method for action selection was proposed [30]. In the field of animation, agentbased cooperative methods for pushing, pulling, and moving objects were studied [14].
The above methods, however, were operational only under specific conditions. For example, an object and robots should be connected with a rod in advance [29], transportation is only working in a grid environment [30]. In this paper, we focus on cooperative box-pushing with free-motion based on DQN, which enables this method to be applied to real environments without restrictions.

Problem Formulation
The problem to be addressed in this paper is how to transport an object to the desired goal within a minimum time. Figure 1 shows the object transportation problem. Two robots are used to transport an object to a goal. An object cannot be transported by a single robot alone due to its heavy weight; however, two robots are able to transport the object by pushing together. If the object arrives at the goal within d success , an object transportation is considered as success. On the contrary, object transportation fails if the object does not arrive at the goal during the maximum time steps.
We have four assumptions for the detailed problem formulation as follows. First, two-wheeled nonholonomic mobile robots are used for transportation on the Euclidean plane, and robot movements are partially restricted by kinematics and dynamics. Second, all robots have homogeneous characteristics. Third, we assume that all robots can identify the positions of an object and a goal with respect to each robot. This is a reasonable assumption because a robot can detect other objects using its own sensors, such as visual or LiDAR sensors. Finally, we assume that there are no obstacles and that each robot has the ability to recognize a target object. In reality, there are static and dynamic obstacles, and thus the ability to distinguish obstacles or objects is an important skill. However, we will concentrate on transportation methods in this paper; the detection method of static and dynamic obstacle is out of scope. Figure 1. Object transportation problem description. Two robots transports an object to a goal. If the object is located at the goal within d success , the object transportation is completed. The distance d t and angle θ t between a robot and a target (an object or a goal) is represented with respect to the local coordinate of each robot. Therefore, the problem formulation can be described as follows: arg min π θ E[T|π θ , s 1

Reinforcement Learning
Reinforcement learning (RL) is a kind of machine learning algorithm for how agents take actions to maximize accumulated returns in an environment [31]. The environment of RL can be described as a Markov decision process (MDP) which consists of the 4-tuples: (S, A, R, P ). At each timestamp t, an agent observes a state s t ∈ S and selects an action a t ∈ A according to policy π, which is a mapping function from S to A. Then, a reward r t ∼ R(s t , a t ) is received, and the state s t is changed to s t+1 ∼ P (s t , a t ). The agent executes this process iteratively until the terminal state is reached.

Deep Q-Learning
Q-learning is a model-free and off-policy reinforcement learning algorithm for predicting the long-term expected return [32]. This return is represented as a state-action value function Q π (s, a), which is an expected value by performing the action with the policy π given the state s. Choosing the action a to obtain a high Q-function Q(s, a) leads to better results than does a lower Q-function. The Q-function is iteratively updated by minimizing the loss between the current and expected Q-function as follows: where α ∈ [0, 1) is the learning rate, r t is the immediate reward, γ ∈ [0, 1] is the discount factor, and r t + γ max a t+1 Q(s t+1 , a t+1 ) is the expected Q-function. Initially, the Q-function was represented by a tabular method called a Q-table. The Qtable is sufficient to represent state and action space in a simple environment, such as a 5 × 5, 7 × 7, or 10 × 10 grid world. However, the real world should be represented by large-sized state and action space due to its complex and unstructured characteristics; the Q-table cannot only fully describe the real environment but also requires an extremely large memory.
For solving these problems, Mnih et al. [22] exploited a deep Q-network (DQN) instead of the Q-table for describing the Q-function. Thus, Equation (2) is rewritten with network parameters θ as follows: where the θ − are the parameters of target network, which is copied from θ. The Q-network is trained by minimizing the sum of differential loss functions L i (θ) as follows: In standard DQN, the training process is sometimes unstable, e.g., the Q-function can be drastically oscillated or varied during training. The core cause of unstable Q-function is time-correlation, which means that continuous state-action sequences have a negative effect on network optimization. Many researchers have attempted to solve this problem using the following methods. First, they extracted random sample batches from a data storage called replay memory [33] with a -greedy algorithm [34]; the time correlation between sample batches disappears using this random extraction method. Second, the parameters of the target Q-network, θ − , are periodically copied from θ, not all the time [22]; the frequent update of parameters makes the network unstable.

Curriculum Learning
Curriculum learning is a step-wise training strategy for efficient learning [35][36][37]. For example, when a teacher attempts to teach a new theory to students, the teacher helps the students to understand a basic concept first. If the students fully understand the basic concepts, then the teacher makes progress towards more difficult ones. Similarly, the students attempt to learn more difficult concepts if they are familiar with the previous difficult ones, and this is called curriculum learning.
Generally, an object transportation has a sparse reward problem; there is no reward until an object finally reaches a goal. For solving the sparse reward problem, we introduce two different curriculum approaches. The first is a randomized pose initialization in an extended region based on region-growing concept [38]. The second is a single-to multirobot curriculum based on policy reuse [39]. Detailed explanations will be described in Section 6.

Reinforcement Learning Framework for Object Transportation
As already mentioned in Section 4.1, the MDP of an object transportation problem can be formulated as follows. First, the state of robot i at timestamp t consists of relative spatial information among a robot, an object, and a goal as follows: The examples of states are shown in Figure 1. When two robots are used, a composite state S t is represented as concatenate(s 1 t , s 2 t ). Second, a robot takes a position near an object and push it. Thus, the robot chooses an action a i t for each timestamp t. For free robot motion, the action space consists of 6 actions as follows: Stop action is not defined because robots should move continuously until an object arrives at a goal; there is no need to use stop action for object transportation. The translational and rotational velocities of the actions in Equation (6) are presented in Table 1. In addition, a composite action space A t is introduced when multiple robots are used:  (6).
Finally, a reward function should be designed for obtaining not only a final reward by reaching a goal but also consecutive rewards during the transport process. In typical object transportation methods, a final reward is provided only when an object reaches a goal. A robot rarely learns the transportation method if a reward is given once; which is known as a sparse reward problem. Therefore, we define a reward function for obtaining consecutive returns during the entire transportation process as follows: if an object reaches a goal −0.01 if an object hits the wall −0.005 if a robot collides with the wall where ∆d t . If a robot gradually approaches an object, the ∆d r i , o t has a positive value; it strengthens approaching action. On the contrary, if a robot moves further away from the object, the ∆d r i , o t has a negative value; it penalizes the leaving action. Similarly, the distance difference between an object and a goal, ∆d o, g t , affects the selection of transport actions according to their spatial distance. We assigned a five-times larger weight to the distance difference between an object and a goal than that between a robot and an object. This is because the object transporting behavior to a goal is more important than the approaching behavior to an object. We defined the negative reward values according to the collision with a wall. The 100-times (object collision) and 200-times (robot collision) reward differences compared with the reward of reaching a goal were appropriate experimentally, as described in the second and third terms of Equation (7), respectively. More concretely, the recovery from an object stuck was more difficult than that of a robot stuck, and thus we gave double weight to the object hitting case.

Curricula for Efficient Learning
Learning for object transportation takes a long time to or sometimes fails to learn due to the sparse reward problem; a situation in which an object reaches the goal rarely occurs if guidance is not provided. To solve this problem, we introduced two curriculumbased learning methods: region-growing and single-to multi-robot curricula. Using the region-growing curriculum, robots can gradually learn their actions by extending the pose initialization region from a small to large region. In addition, multiple robots can learn a new complex task from a simpler task that a single robot has already learned, which is the single-to multi-robot curriculum. Detailed explanations will be addressed in the next sections.

Region-Growing Curriculum
In object transportation using RL, the success of learning highly depends on the positions where an object is initialized for each episode. For example, robots can learn how to transport an object if the object is initialized around a goal. This is because the traveled distance from an object to a goal is short, which enables robots to experience arriving a goal frequently. In the point of robots, object transportation is an easy task when an object is initialized near a goal. On the other hand, robots have difficulty transporting an object if the object is generated far away from a goal, which is a difficult task. Long traveling distances cause robots to fail transportation. However, if robots are fully experienced with easy tasks in advance, robots can succeed in object transportation even when an object is initialized far away from a goal. This is a core principle of the region-growing curriculum.
Based on this principle, we adopted the region-growing concept for setting the stage for learning. Region-growing is a kind of image segmentation method [38] that partitions an image by extending a region from a seed point. We adjusted the degree of learning difficulty using the region-growing technique. For example, we allowed an object to be initialized in the ρ 1 -region only, as illustrated in Figure 2. In this case, robots can learn how to transport the object, which was initialized in the ρ 1 -region only. If robots are good at object transportation in the ρ 1 -region, then an allowable region for pose initialization is extended from ρ 1 to ρ 2 . Similarly, robots can learn how to transport an object in the ρ 2 -region. This process continues until the whole region is covered: Robots can gradually learn how to transport an object by extending regions in which the object is initialized. The region shape can be determined differently according to an environment: a symmetrical and circular partitioning (Figure 2a) and an asymmetrical and rectangular partitioning (Figure 2b).
In this section, we consider a single robot only for concentrating on the effect of regiongrowing curriculum learning. The region-growing curriculum is also useful when multiple robots are applied. We will address a multi-robot region-growing case including the singleto multi-robot curriculum in the next section. Algorithm 1 presents the process of regiongrowing curriculum for single-robot DQN. First, we initialize the replay memory D and Q-networks for active and target (lines 1-3). Second, we set an initialized region according to the number of episodes using the region-growing algorithm (lines 5-7); the more the episodes increase, the more the initialized region extends. The rest of the algorithm is similar to a typical DQN [22]; select and execute action a t with -greedy algorithm, and then observe s t+1 and store multiple transitions (s t , a t , s t+1 , r t+1 ) into D for whole steps (lines [8][9][10][11][12][13][14]. Third, train an active Q-network with random mini-batches using a gradient descent method (lines [15][16][17]. Finally, we update the target Q-network with the active Q-network for K episodes (line 18). Set region-growing ratio ρ j ← ρ(episode) (j = 1, 2, · · · , H); 6 Initialize the pose of a robot and an object in ρ j -region; 7 Set state s t using the initialized poses of a robot and an object; 8 Set according to the -greedy algorithm: ( initial , f inal , episode); 9 for t = 1 to T do 10 Select random action a t with probability ; 11 Execute action a t ;

12
Observe state s t+1 and receive reward r t+1 ; 13 Store transition (s t , a t , s t+1 , r t+1 ) into D; 14 Replace s t with s t+1 ; 15 for iteration = 1 to L do 16 Sample random mini-batch of transitions from D; 17 Perform gradient descent step on L t (θ) in Equation (4) w.r.t parameters θ; 18 Update target Q-network with active network every K episodes: θ − = θ;

Single-to Multi-Robot Curriculum
Multi-robot object transportation is a more complex task than a single-robot case because robots should manipulate an object by cooperative behavior; robots should not only perform their own transport actions but also consider the actions of other robots. In addition, they should simultaneously push the object in the same direction to a goal. Taking such actions by the random exploration of multiple robots is almost impossible.
Robots, however, can learn a cooperative transport technique if they reuse the policy already learned by a single robot; which is called policy-reuse [39]. The single-to multirobot curriculum is based on the policy-reuse method, and the core idea of this curriculum is that a complex problem (i.e., multi-robot object transportation) can be solved using prior knowledge (i.e., a transporting method by a single robot). Multiple robots can concentrate on cooperative behaviors because basic transporting actions have already been learned by a single robot. Figure 3 shows how to train a multi-robot Q multi -network using the pre-trained singlerobot Q single -network. Each robot takes an action using the pre-trained Q single -network with a probability of ψ and Q multi -network with a probability of 1 − ψ; robots reuse the pretrained Q single -network for selecting their actions and train a Q multi -network using multiple transitions generated by the Q single -network. The probability ψ gradually decreases from 1.0 to 0.0, which means that they gradually use their Q multi -network to take their actions as episodes proceed. Actions from the pre-trained Q single -network help robots to induce transport behaviors. They succeed in transporting an object from the beginning, even if there is a non-stationary environment. The input of the Q multi -network is a concatenated stateS t with three consecutive states of each robot. Output action A t is partitioned into two actions (a 1 t and a 2 t ) for giving an order to each robot.  Figure 3. Q multi -network training procedure. The states of the two robots (s 1 t ands 2 t ) are concatenated with one state (S t ) and are used as an input for a Q multi -network. Three consecutive states of each robot are considered as one state:s i t = {s i t−2 , s i t−1 , s i t }(t ≥ 2) for i ∈ {1, 2}. The Q multi -network consists of three fully-connected (FC) layers. In the beginning of training, each robot selects their actions from the pre-trained Q single -network according to with a probability of ψ. The probability ψ gradually decreases from 1.0 to 0.0, and thus, an output action set A t from arg max A t Q multi (S t , A t ) is used for selecting actions with a probability of 1 − ψ. The action set A t is divided into two actions (i.e., a 1 t and a 2 t ) for each robot.
Algorithm 2 presents the single-to multi-robot curriculum-based DQN. First, we initialize a replay memory and parameters of Q multi -network, and load the pre-trained Q single -network in advance (lines 1-4). The region-growing curriculum is also applied for multi-robot object transportation (lines 6-7). Using the Q single -network with a probability of ψ, each robot can take approaching and transport actions from the beginning (lines [12][13][14][15]. With the help of the actions by Q single -network, multiple robots can train their network Q multi (lines [16][17][18][19][20]. The rest of the algorithm is similar to that of the region-growing curriculum (Section 6.1) except the concatenation of states.

Algorithm 2:
Single-to Multi-Robot Curriculum-Based DQN. 1 Initialize replay memory D multi ; 2 Initialize parameters of active Q multi -network with random weights θ multi ; 3 Initialize parameters of target Q multi -network with the identical weights of active Q multi -network: θ − multi = θ multi ; 4 Load pre-trained Q single -network 5 for episode = 1 to M do 6 Set region-growing ratio ρ j ← ρ(episode)(j = 1, 2, · · · , H); 7 Initialize the poses of two robots and an object in ρ-region; 8 Set each state s i t of robot i (i = 1, 2); 9 Set state S t = concatenate(s 1 t , s 2 t ); 10 Set according to the -greedy algorithm: ( initial , f inal , episode); 11 for t = 1 to T do 12 if with a probability of ψ then 13 Select each action a i t of robot i using Q single -network: a i t ← arg max a Q single (s i t , a; θ single );

14
Execute each action a i t for all robots;

15
Observe each state s i t+1 and receive a reward r t+1 ; 16 else 17 Select two random actions A t = {a 1 t , a 2 t } with -greedy; 18 Execute a action set A t for all robots; 19 Observe state S t+1 and receive a reward r t+1 ;

20
Decompose the observed state S t+1 into each state s i t+1 of robot i; 21 Store transition (S t , A t , S t+1 , r t+1 ) into D multi ;

22
Replace S t with S t+1 ;

23
Replace s 1 t and s 2 t with s 1 t+1 and s 2 t+1 , respectively; 24 for iteration = 1 to L do 25 Sample random mini-batch of transitions from D multi ;

27
Update target Q multi -network with active network every K episodes:

Results
We conducted simulations to verify our proposed algorithms. First, we will describe a simulation environment and hyper-parameters in Section 7.1. Second, regiongrowing curriculum-based transportation results using a single robot will be presented in Section 7.2. Third, a single-to multi-robot curriculum-based transportation results based on policy-reuse will be described in Section 7.3. Finally, we tested the proposed method in a environment with action noise to verify robustness in Section 7.4.

Simulation Environment
We constituted a simulation environment, as shown in Figure 4. We assumed a warehouse scenario in which one or two robots transported a pallet (i.e., object) to a desired point (i.e., goal). The total size of the workspace was 7.5 m (W) × 6.0 m (H) and that of the pallet was 1.   Turtlebot3-waffle [40] was used as a transporter robot, and the diameter of the robot was about 0.3 m. A gazebo with ROS noetic was used to constitute the simulation, which makes a physics model based on an open dynamics engine (ODE). We played Gazebo simulations about 70 times fast, and it took about 10 and 20 h to train 2000 (single robot) and 4000 (multi-robot) episodes, respectively; these total training numbers of episodes were experimentally determined because the average reward curve converged from those of episodes, as will be described in Figure 5. The training was conducted on Geforce GTX-1650 and Intel i7-9700 3.00 Ghz. The hyper parameters of training are described in Table 2.

The Results of Region-Growing Curriculum
We conducted simulations to verify the performance of the region-growing curriculum. For concentrating on the effect of the region-growing curriculum, we only consider a single robot in this section; the case of multiple robots will be addressed in the next section. We set the mass of the pallet as 0.1 kg, which is light enough to be manipulated by a single robot. We determined that the whole region is uniformly partitioned into four divisions based on a goal: H equals 4 in Algorithm 1, and distance ratios among ρ j -regions are identical, as shown in Figure 2b. Figure 5 shows the average rewards for the no-curriculum and region-growing curriculum. Initially, the average rewards showed little improvement due to the exploration duration regardless of curriculum; the value of given by the -greedy algorithm was high in the initial period, which was 1.0. The average reward of the no-curriculum increased continuously as episodes proceeded. However, the average reward of the region-growing curriculum showed increases and decreases at regular intervals because was changed periodically in the region-growing curriculum. For example, was 1.0 at episode 0 and decreased continuously until the number of episodes equaled 500. Then, the value of was initialized to 1.0 again when the episodes were 500. This process was executed iteratively until total episodes ended. Although there was a small difference, the region-growing curriculum-based learning showed larger average rewards than the no-curriculum learning until 1600 episodes. This means that the curriculum-based RL can be readily trained from the beginning; the pose initialization in a restricted small region raises the probability of transport completion. The average reward of the no-curriculum overtook the region-growing curriculum after about 1600 episodes. However, this result did not affect the success rate as shown in Table 3; the region-growing curriculum helped a robot to learn a transportation method for the whole region, not for specific regions.
The success rate of the region-growing curriculum was 5% higher than no-curriculum as described in Table 3. In addition, the average traveling distance of the region-growing curriculum was shorter than that of the no-curriculum, and the average steps for transportation completion were also smaller than that of the no-curriculum. This means a robot can learn a transportation method with small energy consumption.

The Results of Single-to Multi-robot Curriculum
Unlike the previous section, we used two multiple robots and a heavyweight pallet for the simulation of the single-to multi-robot curriculum. We set the mass of a pallet as 0.3 kg, which is too heavy to be transported by a single robot alone, but two robots can push the pallet with cooperative behavior. The pre-trained network of a single robot, Q single -network, was generated by the region-growing curriculum of a single robot in Section 7.2. In addition, we tested not only the single-to multi-robot but also the regiongrowing curriculum to verify the overall performance of curriculum DRL-based object transportation methods. Figure 6 shows the average reward according to the combinations of curricula. The average reward of the single-to multi-robot curriculum was the largest but did not always increase during episodes. On the other hand, the average reward continuously increased when both single-to multi-robot and region-growing curricula were applied. The sustained growth of the average reward indicates that the robots kept learning the object transportation method. Although the average reward of the two curricula was lower than that of the sole single-to multi-robot curriculum, the success rate of the two curricula was the highest as described in Table 4. The average reward rarely increased when neither the region-growing nor the single-to multi-robot curricula were applied; which indicates that the robots could not learn anything. Robots experienced rare success because they did not receive any guidance for object transportation learning, which dropped the possibility of obtaining a reward. The average reward of the single-to multi-robot curriculum was higher than that of the region-growing curriculum, which means that the application of the pre-trained model was more helpful than restricting the pose initialization area. The success rate of the multi-robot object transportation methods are summarized in Table 4. We also compared the curriculum-based and the previous multi-robot Q-learning methods called independent Q-learning (IQL) [41]. For all cases, the success rates of the two robots case was lower than that of a single robot case due to the change of the problem situation (i.e., pallet mass: 0.1 kg → 0.3 kg at Section 7.3). The success rate was the highest when both the region-growing and single-to multi-robot curricula were used. On the other hand, the success rates were close to zero when the single-to multi-robot curriculum was not applied, which means that the pre-trained policy using a single robot provided crucial chances for the success of object transportation. The performance of IQL was also poor because the strategy of cooperative behavior could not be learned by the IQL. The sample paths of multi-robot transportation using the region-growing and singleto multi-robot curricula are shown in Figure 7. Two robots gathered around a pallet if the robots were initialized at different positions. Then, they pushed the pallet to a goal by maintaining a line formation. There were sometimes unnecessary behaviors while robots approached and transported the pallet; robots showed redundant actions, especially at the vicinity of the goal.

Results in a Environment with Action Noise
We also tested the proposed methods by adding noise to verify the effect of control errors. Gaussian noise with a zero-mean and increasing standard deviation σ was added to action values as follows: where v i and ω i are the translational and rotational velocities without the noise of robot i, respectively. Table 5 shows the success rate in a environment with action noise from N (0, 1.0) to N (0, 5.0). At first, the success rate slightly decreased until σ was 1.0. However, the success rate drastically decreased from σ = 2.0, and it was closed to zero from σ = 3.0. Although the proposed method was robust to small noise, it could not guarantee performance if the action noise exceeded a specific threshold.

Discussion
To solve the cooperative object transportation problem, two main issues should be addressed. The first is how to deal with sequential tasks, such as the approaching and transport phases; these phases should be executed in order. Second, multiple robots should be fully cooperative with synchronized motions, which means they should push an object together to manipulate an object. The DRL could be a solution to these issues because of its effective learning ability; however, it cannot be applied directly due to a sparse reward problem. Therefore, we presented region-growing and single-to multi-robot curricula in this paper.
First, the region-growing curriculum made the problem easier by restricting the initialized region of an object. The region-growing curriculum can be applicable to diverse environments regardless of environment size and obstacle distribution; if we design the region-growing steps appropriately, this curriculum could be exploited in various practical fields, such as warehouses, factories, and garbage collection centers. Second, the single-to multi-robot curriculum helped robots to work together based on the pre-trained policy of a single robot. The pre-trained policy provides a rough learning direction of multiple robots, which means that complex and cooperative behaviors of multiple robots can be learned from the simple behavior of a single robot. The two strategies have learning with steps in common; the object transportation method is gradually learned from easy to difficult tasks.
However, there are certain limitations to our work. First, the diverse shapes of objects were not validated. We only tested the proposed method using a rectangular object, i.e., a pallet, because we assumed a warehouse scenario. In reality, there are various shaped objects, and thus we should also consider these objects. We, therefore, will study variousshaped object transportation using DRL in future work. Second, the proposed methods are inappropriate to use more than two robots (i.e., N ≥ 3), because the dimensions of the state and action spaces exponentially increase as more robots are used. For example, if we use six actions as the same as in this paper, the dimensions of the action space will be 6 10 when 10 robots are used; this dimension is unlikely to be manageable using the proposed DRL framework. Meanwhile, there is another solution for using multiple robots; a distributed multi-robot team can be applied to object transportation. If we introduce the distributed system, each robot can decide actions based on its own sensing information. The state and action spaces are always identical regardless of the number of robots because a robot considers its own status. Even in this case, however, there is also a learning problem due to the characteristics of a non-stationary environment; the policies of other robots are continuously changing according to learning steps. In the future work, we will examine an extensible cooperative object transportation system by considering the non-stationary characteristics. Finally, practical experiments were not conducted. Although robots were tested in the simulation environments with action noise assuming real experiments, the simulation to real transfer performance (i.e., Sim-to-Real) should be verified. We left the Sim-to-Real verification as future work.

Conclusions
In this paper, we proposed two curriculum-based reinforcement learning methods for object transportation: the region-growing and single-to multi-robot. The region-growing curriculum extended the initialization region of an object, and the single-to multi-robot cur-riculum used the pre-trained policy of a single robot for cooperative object transportation. Both curricula had a common principle that complicated and difficult tasks could be learned from easy tasks. If robots are proficient in easy tasks, then the difficult tasks can be solved. Based on these curricula, robots can overcome the learning fail problem of deep reinforcement learning. We also tested the proposed methods in an environment with action noise and assuming a real environment. The proposed methods can be applied to various fields, such as foraging, logistics, and waste collection.