Cooperative Multi-Robot Task Allocation with Reinforcement Learning

: This paper deals with the concept of multi-robot task allocation, referring to the assignment of multiple robots to tasks such that an objective function is maximized. The performance of existing meta-heuristic methods worsens as the number of robots or tasks increases. To tackle this problem, a novel Markov decision process formulation for multi-robot task allocation is presented for reinforcement learning. The proposed formulation sequentially allocates robots to tasks to minimize the total time taken to complete them. Additionally, we propose a deep reinforcement learning method to ﬁnd the best allocation schedule for each problem. Our method adopts the cross-attention mechanism to compute the preference of robots to tasks. The experimental results show that the proposed method ﬁnds better solutions than meta-heuristic methods, especially when solving large-scale allocation problems.


Introduction
With the development of robot technology, the use of robots has increased in various fields such as automated factories, military weapons, autonomous vehicles, and drones. A robot can be controlled by an expert manually, but it is difficult to control a large number of robots simultaneously. If there are many robots, the job of an expert is then to determine the overall behavior of the robots, rather than controlling them individually. In such a case, the allocation of each robot can be done using an algorithms An example is illustrated by a situation involving ten rooms of different sizes in a building with five robot vacuum cleaners ready at the battery-charging machine. In this case, we want to allocate robots to rooms such that all rooms are cleaned and the total elapsed time is minimized. If we consider all possible allocations, there are 5 · (10!) possibilities that we have to consider for robots and permutations for ten rooms. This problem is an NP-hard problem, and finding an optimal solution takes a long time. This problem is termed the multi-robot task allocation (MRTA) problem, involving the assignment of tasks to multiple robots to achieve a given goal [1][2][3]. It is essential to assign tasks to robots efficiently for various applications of multi-robot systems, including reconnaissance [4,5], search and rescue [6,7], and transportation [8,9].
MRTA can be formulated as a problem of finding the optimal combination of a task schedule. Numerous approaches, including integer-linear programming (ILP) methods [6,10,11], auction-based approaches [12,13], and graph-based methods [14,15], have been suggested to solve such combinatorial optimization problems related to MRTA. However, they usually are not scalable to large-scale systems given that the number of possible combinations increases exponentially as the number of robots or tasks increases. In addition, they are not generalizable to complex problems without carefully handcrafted heuristics.
Recent advances in deep learning have led to a breakthrough in various fields of reallife applications including MRTA, aquaculture, thermal processes, energy system, mobile network, and vehicles [16][17][18][19][20][21][22]. The deep learning is applicable to various fields because of the high-dimensional representation that can encode complex knowledge in the real world. Vinyals et al. [21] and McClellan et al. [23] propose a deep-learning-based method for geometric problems applicable to the travelling salesman problem (TSP). The advantage of deep learning is that the features are learned end-to-end instead of using handcrafted features. Additionally, recent deep-learning-based methods outperform heuristic-based methods in various tasks. However, methods based on deep learning require a large amount of labeled data, and is hardly generalizable to out-of-distribution instances. Recently, Kool et al. [22] suggest a method based on reinforcement learning (RL) for combinatorial optimization problems including the TSP and vehicle routing problem (VRP). Unlike deep learning, RL-based methods interact with an environment to make decisions instead of learning from labeled data. For most of the real-world applications, acquiring the enough amount of labeled data is very expensive. Although existing deep-learning-based methods have shown promising results, they are limited to relatively simple problems, such as the TSP and VRP, which do not consider multi-task robots or multi-robot tasks.
Therefore, we propose an RL based MRTA method that can handle multi-robot tasks. Specifically, we focus on the allocation problems with single-task robots, multi-robot tasks, and time-extended assignments (ST-MR-TA). First, we formulate the problem as a Markov Decision Process (MDP) to adopt RL algorithms. Next, we suggest a hierarchical RL environment for MRTA, including an input representation and a reward function. Then, we propose an RL based MRTA method including a model structure and an optimization scheme. The dot-product cross-attention mechanism [24] is used in the model to guide the interactions between robots. Additionally, it provides the model with the interpretability in a sense that the attention weights indicate the importance of tasks to robots. The model is optimized with the policy gradient with greedy baseline [22], where the baseline is used to approximate the difficulty of each instance.
Our proposed method is evaluated in terms of the total time taken to complete the given tasks. We test our method in various settings with a different number of tasks. The results show that our method consistently outperforms the meta-heuristic baselines in different settings. Our method is not only superior in performance but also sample-efficient.
Our contributions are as follows: (1) an allocation algorithm in MDP formulation for complex scenarios that deal with multi-robot tasks, and (2) a deep RL based method for MRTA, including the model structure and the optimization scheme.

Multi-Robot Task Allocation
MRTA is the problem of assigning robots to tasks while maximizing the given utility. This problem can be defined in various ways according to the environmental settings [2].
(1) the number of tasks that a robot can handle at one time, (2) the number of robots required for the task, and (3) the consideration of future planning. The first axis involves single-task robots (ST) and multi-task robots (MT). The second axis relates to single-robot tasks (SR) and multi-robot tasks (MR) and the last axis refers to instantaneous assignments (IA) and time-extended assignments (TA). The simplest combination is (ST-SR-IA) and the most complex combination is (MT-MR-TA). For multiple robots, the number of possible combinations increases exponentially as the number of robots increases. In addition, if the problem contains time-related assignments, it can be viewed as a decision process. In such a case, the problem is an N P-hard problem [1]. Although we can find a metaheuristic algorithm to solve this type of problem, we must process the information of the environment to apply an algorithm in a simple manner, and loss of information will occur. For example, finding the shortest path algorithm in a village with the Dijkstra algorithm [25] requires a graph structure of the points, and it is difficult to consider the features of the streets. On the other hand, a neural network enables end-to-end training, and there is no loss of information when constructing the algorithm.

Reinforcement Learning
Recent advances in hardware have made deep learning possible, and an artificial neural network trained on massive amounts of data shows high performance in many fields, including vision and natural language processing. Reinforcement Learning (RL) has also progressed rapidly with deep learning. In RL, an agent learns the optimal action in a given state, and massive amounts of data and a simulator are required to train the agent. RL is defined on a Markov Decision Process with a tuple < S, A, P, R, γ > where S is the set of all possible states and A is a set of all possible actions. P represents the transition probability, R is the reward function and γ is a discount factor. The advantage of using MDP is that the decision process can be completely characterized by the current state. In other words, the current state is a sufficient statistic of the future. The goal of the training is to find the optimal policy π which maximizes the expected discounted return V π (s) of a state s where Similarly, the goal of MRTA is to find the optimal policy π which describes how to allocate agents. In detail, instantaneous allocation can be viewed as a type of RL with a discount factor γ of 0, meaning that the future is not considered. If time is related, MRTA problem includes the future impact, and then discount rate γ is a non-zero number. We hypothesize that the neural network can encode information optimal allocation.
There are several benefits of RL-based approaches. First, inference is fast despite the fact that training takes a long time. Inference computes the cross-attention between robots and tasks to allocate a single robot, with time complexity of O(MN). Because neural networks compress decision rules learned from the training data into parameters, it is possible to approximate the solution without considering all possible combinations. These trained parameters result in better performance than a meta-heuristic solution search. Next, this method can handle complex dependencies, where the optimal allocation for a task may be affected by the previous allocation. Additionally, it does not require delicately designed heuristics, which is one of the problems with meta-heuristic algorithms.

Problem Formulation
In this section, we describe a cooperative single task robots-multi robot tasks-time extended assignments (ST-MR-TA) problem. There are robots and tasks on a map. A robot moves to a task and works on it. When a robot works on a task, it reduces the workload by 1 at each time step, and the work is done when the workload is less than or equal to 0. Other agents can also undertake the work, and the workload of the task is reduced by n, with n representing the number of robots working on the task. We formulate this problem by means of (1) mixed integer programming and (2) a Markov decision process.

Mixed Integer Programming Formulation
In Equations (2)-(13), we describe the mixed integer programming (MIP) formulation of the cooperative ST-MR-TA problem following [26]. There are T , a set of n tasks with x,y coordinates and workload w(t j ) for each task t j and R, a set of m robots with x,y coordinates. We add initial task t 0 of workload 0 at coordinates (0,0) which is the initial location of robots. Additionally, robot r i can have battery constraint b(r i ). The distance between two objects in the map is denoted by d(·, ·) which is Euclidean distance. The variables S r i t j and F r i t j are continuous variables represents start time and finish time of work on task t j by robot r i . The start time and finish time are 0 if robot r i is not allocated to task t j . The binary variable A r i t j is 1 if task t j is assigned to robot r i and the binary variable X r i t j ,t k is 1 if robot r i performs task t j followed by t k .
The constraints for the MIP formulation are described in Equations (2)- (13). Equation (2) is the objective of the problem. We add another variable F last which is greater than or equal to finish time F r i t j and minimize it for all robots r i ∈ R and tasks t j ∈ T in Equation (3).
Equations (4)- (6) are the domain of each variable. Equation (7) ensures the cumulative work time on task t j for all robots is larger than the workload w(t j ) of task t j . Equation (8) forces start time S r j t j and finish time F r j t j to be 0 if indexed task t j is not allocated to robot r i . The very large number M is used for the conditional trick. Equation (9) ensures that travel time of a successive task allocation is bounded by the distance of two allocated tasks. Equations (10) and (11) are used to match the allocation variable A r i t k and the successive allocation variable X r i t j ,t k . Equation (12) indicates the initial position for every robot r t ∈ R. Equation (13) is optional for a battery constraint. minimize: F last (2) subject to: F

Markov Decision Process Formulation
The cooperative ST-MR-TA problem can be solved by individual agents. That is, each robot chooses one of the remaining tasks. However, we can consider another option in which there is a manager responsible for robot allocation. Therefore, we suggest a MDP formulation in a hierarchical environment. This involves an outer environment whose action is only the allocation of idle robots to one of the remaining tasks and an inner environment which simulates robots and tasks. Both environments are designed as MDP, a sequence of state, action and reward (S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , · · · ) [27]. An example of an episode in the hierarchical environment is described in Figure 1. We define makespan as the total number of time steps spent in the inner environment, and the goal is to minimize makespan. Figure 1. Time steps of an example episode. The time step of the allocation environment is denoted by T i and the time step of the inner environment is denoted by t i j . In this example, the allocation algorithm has allocated robots 3 times and makespan is 12 = (5 + 3 + 4).

Allocation Environment
The role of the allocation environment is to allocate robot r i to task t j . We formulate the allocation environment as follows. The observation is the encoded information of n robots and the remaining tasks T remaining . The reward r(s t ) is a sparse reward with a penalty -makespan at the end of an episode. The episode finishes when there are no remaining tasks.
We define T remaining as a set of remaining tasks whose workload is larger than 0, T f inished = T \ T remaining as a set of finished tasks and R ready as a set of ready robots which are waiting to be allocated. A single time step of the environment is allocation of robot r i ∈ R ready as described in Figure 2. Since we allocate only one robot in a single time step, it requires m successive environment time steps for the m ready robots. This successive allocation is denoted in the lines (4-6) in Algorithm 1. Therefore, the action A t is a mapping A t = r i → t j . Since a target task should be one of tasks in T remaining , we use the notation Allocate(r i , T remaining ) to describe the allocation of robot r i in Algorithm 1. Note that this environment is working independently from the real simulation of robots and is only responsible for the allocation of robots.

Figure 2.
A single time step of the allocation environment. When the allocation algorithm gives action A t to the environment, the environment gives the next state and reward pair (S t+1 , R t+1 ). When we simulate the environment, robots {R 1 , R 2 , R 3 , R 4 } choose actions in the inner environment which is also form of MDP. The allocation algorithm decides which task t j in T remaining is assigned to the given robot r i in R ready . In this example, R ready = {R 3 , R 4 }, T remaining = {T 1 , T 2 , T 3 }, and the given robot is R 4 . The algorithm allocates the robot R 4 to task T 3 . After allocation, the inner environment runs until there is another ready robot. In this case, robot R 3 is another ready robot and next allocation happens immediately. Best viewed in color.

Algorithm 1 Allocation Algorithm in MDP 1: Inputs:
Robots R, Tasks T , workload w(t), distance function d(·, ·), work distance 2: Initialize: 4: for r i in R ready do 5: end for 8: while R ready is ∅ do # begin inner environment episode 9: for r i in R do 10: t = r i .task 11: if d(r i , t) ≤ then 12: T remaining = T remaining \ {t} 22: end if 23: end for 24: end while 25: end while

Inner Environment
The robots take one of actions (working or moving) in the inner environment. We first describe how to encode the information of the robots and tasks. Coordinates (x r i , y r i ) and (x t i ,y t i ) denote positions of robots and tasks. The state information of robot r i is one of ready, going and working. The state information of task t j is one of done and remaining. (task(r i ).x, (task(r i ).y) is the x,y coordinates of allocated task of robot r i . If there is no allocation of task to robot r i , (task(r i ).x, (task(r i ).y) = (0, 0). Since the problem is cooperative, there could be multiple robots working on and coming to the task t j . The allocation must consider the existence of robots on task t j , so that we can prevent unnecessary allocation (when the allocated task is done before reaching to the position) and encourage cooperation. Therefore we encode two types of information, work(t j ) and com(t j ). work(t j ) is the number of working robots on task t j so that we can know when the task finishes. com(t j ) is a set of distances whose element is d(r i , t j ) for robot r i coming to the task t j . We give mean and variance of distances coming to the task and 0 for each value if there is no robot coming to the task. The overall encoded information {R 1 , · · · R m } and {T 1 , · · · , T n } are shown in Equations (15) and (16) for robots and tasks.
A single time step of the inner environment is 1 unit time behavior which means robot r i moves 1 unit distance when moving to task t j and reduces the workload w(t j ) by 1 when working on task t j . In the mathematical formulation of the problem, The start time and finish time are real values in MIP formulation. Therefore, the discrete environment does not match with the problem formulation. However, the two become more similar when the unit time approaches 0. By considering the gap between the formulation and the environment, we assume that the robot has reached the task when the distance d(r i , t j ) of allocated task t j is less than some small positive value . When the robot reaches the task, we change the state of robot r i from moving to working. This change of state is described in line (11) in Algorithm 1. Even though the inner environment can be formulated by MDP, we do not train the robots because the goal of our paper is training task allocation method, not individual robots. The episode of the inner environment starts when there is no R ready and runs until there are some robots who finished the assigned task. When a task is finished, the robots that worked on the task become idle and they are allocated to new tasks in T remaining . We designed this hierarchical architecture of environment for future study, where not only the outer MDP formulation is trained but also the robots are also trained.

Multi-Robot Task Allocation Methods
In this section, we describe our cross-attention mechanism-based method and baseline methods to allocate robots. We use baselines based on meta-heuristic algorithms proposed to solve combinatorial optimization problems. We test four algorithms, Random (RD), Stochastic Greedy (SG), Iterated Greedy (IG), and Genetic Algorithm (GA). Other metaheuristic algorithms such as particle swarm optimization and ant colony optimization are also possible but not directly applicable to the MDP formulation. The deep learning model used in our method is based on the encoder and decoder architecture as described in Figure 3. The model is similar to the Transformer architecture by Vaswani et al. in [24]. The information of robots {R 1 , · · · , R m } is embedded by the encoder using feed-forward neural network, ReLU activation, and batch normalization. Since our method allocates only one robot in a single time step, we must indicate the robot we are considering. Therefore, we use an additional one hot feature to indicate the robot and getsR i in Equation (17). ThenR i is linearly projected to d h -dimensional vector e r i through a linear projection with trainable parameters W r and b r . R i = [x r j , y r j , state r j , task(r j ).x, task(r) j .y, 1] if r i is to be allocated [x r j , y r j , state r j , task(r j ).x, task(r) j .y, 0] if r i is not to be allocated (17) Finally, we compute logit p j for each task embedding through linear projection with parameters W p and b p . Then we sample an index of the remaining task from the probabilities. Figure 4 describes what is learned by the model. The model has two types of layers, (1) encoding layer for robots and tasks, and (2) cross-attention layer to compute the attention score between robots and tasks. The role of the encoding layer is to represent information of robots and tasks. The role of the cross-attention layer is to calculate an attention score which can be considered as an importance score between robots and tasks. This interpretability is an advantage of cross-attention layer compared to other types of layers. As a result, the neural network is trained to have better vector representation of robots and tasks and compute what is important. In addition, since the model is trained to find efficient allocation, the information for finding better allocation may be indirectly encoded in the neural network.

Cross Attention
Robot Embedding

Task Embedding
Batch Norm Batch Norm FFN Sampling Figure 3. The neural network architecture for the MRTA. The robot to be allocated is r 2 and the remaining tasks are (t 1 , t 2 , t 3 , t 4 ). Tasks t 4 , t 5 are finished tasks and they are not used in the model. After normalizing the logits, we sample action from the probability. Here the robot r 2 is allocated to t k where t k ∈ {t 1 , t 2 , t 3 , t 4 }. Best viewed in color. Figure 4. What the model learns. The representation of robots and tasks includes the relationship information between robots and tasks. In this figure, T 1 is the most important task for the robot R 1 and T 3 is the most important task for the robot R 2 . The rectangle blocks in the cross-attention block are the vector representation of robot and task. The gradient on color denotes the importance of the task. More red is more important.

Optimization
We train the reinforcement learning model based on the policy gradient method with baseline. It is usual to use a critic network to estimate the value function. However, as noted in [22], using the baseline instead of a critic network is a better choice to solve combinatorial optimization. Therefore, we also construct baseline b(s) as the moving average of makespan while training. We also tried critic-network. However, training the value network made the training unstable. We optimized with Proximal Policy Optimizer (PPO) which clips the loss not to diverge. PPO takes the loss as the minimum value of the clipped advantage and the advantage: L(π θ |s t ) = min( π θ (a t |s t ) π θ old (a t |s t ) )Â t , clip( π θ (a t |s t ) π θ old (a t |s t ) We update the baseline b(s t ) after the end of the episode as the moving average of terminal reward which is −makespan.
The overall training algorithm of our model with baseline b(s t ) is in Algorithm 2.

Meta-Heuristics for MRTA
To evaluate the performance of the neural network, we compare with 3 types of meta-heuristic algorithms and a random selection. The solution of the cooperative MRTA problem can be viewed as permutations of n tasks for m robots. Then, the solution can be represented as Equation (22) where π j i is j-th task by which robot r i visits. The size of the solution space is n · (m!) and exponentially increases as the number of tasks increases. Permutation-related solutions have already been dealt with meta-heuristics and we use variants following the construction of meta-heuristic algorithms in [28]. π = ((π 1 1 , π 1 2 , · · · , π 1 n ), · · · , (π m 1 , π m 2 , · · · , π m n ))

Genetic Algorithm
Genetic Algorithms (GA) are widely used to solve many optimization problems, including scheduling problems. The GA is a population-based algorithm that searches for the optimal solution from a set of candidates. When the population is given, the approach applies crossover and mutation to generate new children from the population. Then, it selects some of them and removes some of the original population so that a new population can be generated. However, our solution is the scheduling problem for m robots. Therefore we randomly choose robot r i in R and find better solutions for the robot r i in a single iteration. In the scheduling problem, when the population {(π i 1,p , · · · , π i n,p )|p = 1, 2, · · · , P} of robot r i is given, and the crossover process randomly selects two solutions and task index k. Then, it switches the left side of the first solution and the right side of the second solution based on point k. This results in a new child, which is a combination of the original two solutions. If there are duplicated tasks, we use the original task of the parent. During the implementation, we pair all solutions in the population and undertake a crossover routine with a probability 0.4. The mutation randomly selects indexes of the two solutions in (π i 1 , · · · , π i n ) and switches the position of two tasks with probability 0.3. After applying two operations, the number of children is bounded by (P 2 + P). During the selection, first we calculate the fitness of the children and choose the best P individuals from the candidates. The above process is repeated to achieve better solutions. We set P to 10 because setting P to a large number is computationally expensive.

Iterated Greedy Algorithm
Iterated Greedy (IG) is widely used to solve an optimization problem. Similar to GA, IG does not guarantee to find the optimal solution. However, the solution is improved over iteration through destruction, reconstruction, and local search. Unlike GA, IG is not a population-based algorithm and it starts from single solution π. The algorithm randomly selects robot r i and improves the permutation (π i 1 , · · · , π i n ). The deconstruction randomly selects d number of positions and removes them from the permutation and inserted them again into the deconstructed permutation in the reconstruction phase. The algorithm inserts the task using local search which considers all the possible positions and finds the position where the minimum makespan is achieved. It is guaranteed to finish all the tasks even though the permutation of robot r i is partial in the reconstruction phase because there are other robots whose permutation has all the tasks. The local search selects a random position of the permutation and considers all the possible positions to find the best position. We set d as 0.2× length of the permutation.

Stochastic Greedy Algorithm
Simply selecting the closest task is one way to solve the problem. We can allocate robot r i to the nearest remaining task. However, there is one more factor to consider, the workload. Even when the work is close enough, in certain cases tasks at long distances must be considered first. On the other hand, there could be another problem instance in which working on the nearest task first is better than doing it later. Figure 5 shows such a case. Therefore we also considered a greedy algorithm as a baseline. Because the original greedy algorithm is deterministic, we modified the algorithm so that the allocation algorithm does not choose the nearest task, instead selecting one of the remaining tasks with a certain probability. In this work, we only considered distance-based greedy algorithm because the performance of the distance-based greedy algorithm was not good and we expect the same result with workload based greedy. Stochastic Greedy (SG) selects a task to be allocated to robot r i by sampling from the probability p(t j ) = d(r i , t j )/N where N is normalization N = ∑ k∈T remaining d(r i , t k ). Workload Based Greedy Distance Based Greedy Figure 5. Two types of greedy algorithm on an example problem. In workload based greedy, the robot chooses the task T 1 whose workload is larger than the workload of T 2 . In this case, the makespan = 3 + (3 + 6/2) + 9 = 18, while the makespan of the distance based greedy is (10 + 1) = 11. In this example, the travel time of R 2 at time step 1 is redundant in the workload based greedy.

Random Selection
In Random Selection, the algorithm randomly generates the permutation of tasks for all robots π. There are two reasons why we use this methodology. In [22], it is noted that random sampling also works well in a simple combinatorial optimization problem. In addition, since our solution size is large (n · (m!)), the computation of a meta-heuristic algorithm could be a worse choice than sampling solutions. For example, local search in IG requires evaluation of near solutions and if the first solution is bad, the neighbor solutions are equally bad.

Performance of Meta-Heuristic Algorithms
First, we demonstrate the performance of the Random, SG, IG, and GA algorithms on 100 samples. The makespan of the algorithms by time is shown in Figure 6. We find that the makespan of the meta-heuristic algorithm decreases. However, as noted in [22], the Random algorithm also works well considering the computation time. IG shows the worst performance because the most of the time is spent on the evaluation of the local search, and the performance does not increase significantly. GA shows the best performance among the four algorithms. This occurs because the population-based method can generate a variety of solutions by means of crossover and mutation from the best solution. The key parameters of the meta-heuristic algorithms were empirically determined. RD and SG require no parameters and IG requires a destruction parameter rate at 0.2 which shows the best performance among four options [0.1, 0.2, 0.3, 0.4]. GA requires the crossover and mutation probabilities and the population size. We empirically found that population size N generated children of size ≈ N 2 and that evaluating all of these solutions was computationally expensive. Therefore we set the population size 10 which shows the most increasing performance among five options [5,10,20,30,50]. All these parameters were tuned with problem instances with five robots and 50 tasks.

Reinforcement Learning
We test our model in five types of environments. We set the number of robots, the map size, and the workload to 5, 100, and 20, respectively. The tasks are randomly distributed in 100 × 100 continuous 2D space and the initial workload for each robot is sampled uniformly in a set range [1,20]. We tested 10, 20, 30, 40, and 50 tasks to compare the performances of thg algorithms on various difficulty levels. We trained IG, SG, RD, GA, and our model for 100 problem instances. All the problem instances are simulated separately using a single Intel(R) Xeon(R) Gold 6226 @ 2.70GHz CPU for meta-heuristics and RL-based model. RL-based model is optimized using NVIDIA Quadro RTX 6000 GPU. We trained the models for an hour and compared the minimum makespan. One of the most important hyperparameters of the deep learning model is the number of hidden dimensions. We add the experimental results with different numbers of hidden dimensions for 20 samples in Appendix A.2.
We plot the minimum makespan for 50 tasks in Figure 7. The other number of tasks can be found in Appendix A.1. In 50 tasks, IG, SG, and RD models converged early and the performance does not improve much, whereas the performance of GA improves over time. GA and RD find better solutions in the early steps than the RL-based search because we trained the model starting from random initialization, with time needed to train the initial weights. As training goes on, the proposed method (PPO-based) dominates the other models and the makespan decreases continuously. Unlike GA, whose makespan decreases mostly in the early phase (second < 500), the performance of our method improves smoothly and does not converge during the hour of training. Moreover, the variance of 100 samples is greater for GA than for our method. Therefore, we conclude that the solution search ability of our model improves over time and that it is not biased to a few problem instances. Although our model architecture (encoder-decoder) can be generalized to any number of robots and tasks, we do not pre-train the model, as the cooperative MRTA is more complex than TSP or VRP and because pre-training requires an additional training algorithm. We leave this as a future research project. In Table 1, we present the minimal makespan found for 20, 40, and 60 min for all the models tested with the various number of tasks. As a result, GA performs the best among the meta-heuristic baselines (IG, SG and RD). In the case of 10 tasks, GA shows similar performance (132.0 makespan) with our model (132.2 makespan). In the case of 20 tasks, GA performs best at 20 min (199.4 makespan), but our model shows better most of the time. When the number of tasks exceeds 30, our model shows significantly better performance than other models. As a result, we conclude that our model is highly effective for more complex problem. With regard to the statistical results, we compare two result groups, PPO and GA for 100 samples of 50 tasks. The p-value of two-sided T-test is 1.7 × 10 −45 which suggests that two samples have different average values. In addition, we measure the difference in performance for each sample, as shown in in Figure 8. This highlights the comparable performance in outcomes in a sample-wise manner. This difference in the makespan for each sample suggests that the proposed method finds better solutions than other meta-heuristics for all 100 samples. Table 1.
Mean makespan of 100 problem instances for various number of tasks and allocation algorithms. The bold text indicates the best makespan among allocation algorithms.  Figure 8. Boxplot of the makespan gap (makespan meta − makespan PPO ) between PPO and metaheuristic methods for each 100 samples in 5 robots, 50 tasks, and 1 h training.
When the problem is simple (i.e., with ten tasks), GA outperforms our model. This occurs because the search space becomes smaller when the problem is simpler and because near-optimal solutions can be found with a random exhausted search. On the other hand, the on-policy method finds improved solutions based on the current solution. Therefore, the improved solution can be biased with regard to the initial policy, which is randomly initialized. Therefore, meta-heuristic-based methods perform relatively well in simpler problems. However, the performance difference (when the number of tasks is 10) is marginal and the proposed method significantly outperforms the baselines in more complex cases (when the number of tasks is larger).
We present the baseline reward and the on-policy reward while training. As shown in Figure 9, the baseline reward works as a lower bound for the on-policy reward, and the rates of increase for the two reward types are similar, as the improvement in the policy results in an improvement in the baseline and because the evaluation of the policy is based on the improved baseline.
It is important to design an interpretable algorithm when devising a reliable autonomous allocation system. For example, the greedy algorithm selects tasks greedily and the overall plan is imaginable. However, the interpretation of the neural network is not intuitive even despite the fact that the RL based approach outperforms the baseline algorithms. We show a non-explainable property of our training scheme in Figure 10. We train a single instance of five robots and 50 tasks. There are no visual patterns that can be found despite the 38% performance gain from the 658 makespan to 412 makespan. Although the RL-based algorithm shows a significant performance gain with the complex problems, it is still difficult to understand the intention of the neural network. Trained Figure 10. The disadvantage of end-to-end training. The first row represents the initial allocation of the PPO model and the second row represents the trained model. The makespan before training is 658 while the trained allocation has 412 makespan. Even though the makespan is reduced by 38%, the interpretation of the model is not trivial. The darker color represents the later allocation.

Conclusions
In this paper, we solve the cooperative ST-MR-TA problem, which is a more complex problem than other combinatorial problems. We propose the MDP formulation for this problem. Then, we suggest a robot and task cross-attention architecture to learn the allocation of robots to remaining tasks. We compare the deep RL-based method with several meta-heuristics and demonstrate a trade-off between the computation time and performance at various levels of complexity. Our method outperforms the baselines, especially when the problem is complex. The main limitation of our study is that the model decisions are not interpretable. Therefore, Explainable Artificial Intelligence in relation to MRTA problems can be suggested as a future research directions. The ability to find a better solution depends on the complexity of problem. For detailed comparison between RL-based approach and meta-heurstics, we plot the solution search progress for 10, 20, 30 and 40 number of tasks in Figure A1. As the problem becomes more difficult, RL based solution search outperforms other baselines. When the number of tasks is 10, the performance gap between all the algorithms is small. GA reached the minimum makespan in early step, while PPO took more than 1000 s to reach the similar performance. When the number of tasks is 20, PPO shows better performance than GA at 3600 s. However, the performance gap is not that big and we may prefer GA than PPO, if we just consider time and performance together. In the case of 30 and 40 number of tasks, PPO shows better performance at 600 s and even the performance gap between GA and PPO increases as training goes on.

Appendix A.2. Training Details
We present the hyperparameters for the model and the training in Table A1. In the case of the robot and task embedding, we used 2 linear layers for the better vector representation. Our reward scheme is only valid when an episode is finished. Therefore, we used full episode to train the model. For example, if batch size is 32 and we sampled 3 episodes with elapsed 31 time steps, then we run another episode and construct time steps larger than the batch size. We also present the effect of the model size in Figure A2. This training graph shows that there is an appropriate model size for the problem. The result suggests that there is an appropriate model size for the problem. We also tested A2C policy network as shown in the Figure A3. This result suggest that A2C on-policy method shows different learning curve.

. Comparison between Mixed Integer Programming and RL Environment
Before training in the RL environment, it is necessary to check whether the mixed integer programming formulation and the discrete-time step environment match or not. Therefore we compare the solution of branch and bound method of the mixed integer programming and meta-heuristic in the RL environment. We test on 100 random problem instances with 3 robots and 3 tasks in map size 100 and uniform workload in [1,20]. There are (3!) 3 possible permutations of allocation. Even meta-heuristic algorithms improve solutions gradually, it is not guaranteed to find the optimal solution. Hence we run enough iteration for the meta-heuristics. We set 500 iterations and 10 population-size for GA. The makespan ratios of GA and LP solution are shown in the Figure A4. As a result, the gap between the two solutions is small. The difference between LP and GA is shown in Figure A5. There are some samples whose makespan is shorter than the optimal solution. That is, the makespan of the optimal solution in MIP formulation is larger than the sub-optimal solution in MDP formulation. This mismatch is because we construct RL environment based on the MDP formulation while the optimal solution is from the MIP formulation. Specifically, the MDP formulation considers time steps as discrete values, whereas the MIP formulation has continuous values.   Figure A4. Note that there are some samples whose performance is better than the optimal solution which represents the gap between the mixed integer programming formulation and the RL environment.