End-To-End Deep Reinforcement Learning for Decentralized Task Allocation and Navigation for a Multi-Robot System †

: In this paper, we present a novel deep reinforcement learning (DRL) based method that is used to perform multi-robot task allocation (MRTA) and navigation in an end-to-end fashion. The policy operates in a decentralized manner mapping raw sensor measurements to the robot’s steering commands without the need to construct a map of the environment. We also present a new metric called the Task Allocation Index ( TAI ), which measures the performance of a method that performs MRTA and navigation from end-to-end in performing MRTA. The policy was trained on a simulated gazebo environment. The centralized learning and decentralized execution paradigm was used for training the policy. The policy was evaluated quantitatively and visually. The simulation results showed the effectiveness of the proposed method deployed on multiple Turtlebot3 robots.


Introduction
Classical multi-robot systems (MRS) usually have two main steps: task allocation and task execution. In the context of multiple homogeneous mobile robots in which the tasks being executed are reaching multiple goal positions, the two main MRS steps become multi-robot task allocation (MRTA) and navigation. The MRTA algorithm's role is to assign tasks to robots in order to decrease the total cost exerted by all of the robots [1]. On the other hand, the navigation part usually includes multi-robot path planning algorithms (MPPs) [2], which plan paths for multiple robots to avoid any collisions between the robots or other obstacles. The problem arising due to the decoupling of both algorithms is that most MRTA algorithms do not adjust for any cost changes in the paths of the robots [3] due to multiple reasons, for example: dynamic obstacles or partial observability.
Reinforcement learning (RL) is a branch of machine learning in which an agent is interacting with the environment by receiving a state from the environment, then taking an action. There is a reward signal that signals the quality of the action. The goal of an RL algorithm is to learn through a heuristic approach how to maximize not only the immediate reward, but also the subsequent rewards [4]. One of the drawbacks of RL algorithms is their lack of scalability in terms of the dimensionality of the state of the environment; thus, their use is limited to solving low-dimensional problems [5]. The rise of deep neural networks as a powerful tool that can learn to find compact features of high-dimensional data has led to the introduction of the field of deep reinforcement learning (DRL) [5].
Researchers in the field of multi-robot systems (MRS) became attracted to DRL due to its ability to handle high-dimensional data [6] and the fact that RL algorithms are model free and have the ability to handle nonlinear, stochastic dynamics and non-quadratic reward functions [7]. As for the MRTA problem, researchers have used DRL to solve the MRTA problem, especially the fairness of the allocation problem [8] and the dynamic task allocation problem [9]. On the other hand, the use of DRL to solve the navigation problem has been given more attention by researchers due to DRL's ability to learn decentralized policies that can drive an MRS towards its goal positions [10][11][12][13]. As far as we know, there is a gap in the literature where there have not been any proposed methods that combine both MRTA and navigation.
In this paper, we propose a novel method that combines both MRTA and navigation for a homogeneous multi-robot system. The method is a decentralized deep reinforcement learning policy whose inputs are the raw sensor measurements and whose outputs are the robot's steering commands. The output of the policy should drive the robot from its initial position to a unique goal position without colliding with other robots or obstacles. We demonstrate that our method learned to perform both MRTA and navigation from end-to-end effectively, and we compare our method to two methods from the literature that only perform decentralized navigation for a multi-robot system. Thus, this paper is a first step to address performing MRTA and navigation from end-to-end and shows promising results that can be built on to enhance the current state of multi-robot systems.
The main contributions of this paper are: • An introduction of a novel method that performs MRTA and navigation for a homogeneous multi-robot system from end-to-end using deep reinforcement learning. • An introduction of a new metric called the Task Allocation Index (TAI), which measures the performance of navigation in MRTA.

Related Works
Recently, Zhu et al. [8] attempted to solve the fairness of allocation problem for a team of heterogeneous robots with different capabilities to perform the tasks. The problem they attempted to solve is an ST-MR-IAMRTA problem. They modeled the capability of a robot as a resource used to accomplish the task. When for a group of tasks, each of them requests a group of resources, the problem of the fairness of allocation arises. In order to solve the fairness problem, each resource requester (task) should be allocated the resources optimally to maximize the satisfaction of resource requesters. Zhu et al. [8] developed a decentralized two-phase algorithm. The first phase is the resource requester selection, which was modeled as a coordination game and solved using a deep Q-network (DQN) [14], which was used to learn the optimal strategies. The second phase is a consensus based algorithm for forming a team of robots to solve the task, thus only solving the MRTA problem without addressing the navigation problem.
Chen et al. [10] used a DRL algorithm that performs decentralized multi-agent collision avoidance. The DRL algorithm they used is a value network that uses the agent's joint configuration (positions and velocities) and the joint configurations of the other agents to encode the estimated time to the given goal position. The advantage of using the value network is its ability to run in real time and consider the uncertainty in the other agent's motion to predict a collision-free velocity vector. Chen et al. [10] designed their algorithm to work in cases of unreliable communication between the agents, which requires the value network to inherently predict the other agents' intents. Hence, the other agents' states are calculated using the agent's sensor measurements. They modeled static obstacles as stationary agents. They formulated the problem as a partially observable sequential decision making problem. The value network was first trained by using supervised learning (back-propagation) on a training dataset consisting of 500 trajectories generated using the optimal reciprocal collision avoidance (ORCA) algorithm [15], then the trained weights were used as the initialization to the DRL training. The value network was trained using a modified Q-learning update [4]. A major drawback of the paper is that by treating obstacles as stationary agents, the input size to the neural network increases with the increase of the number of obstacles, leading to the network being impractical in environments with a large number of obstacles. It is obvious from the paper that the value network effectively performs navigation with both decentralized MPP and path tracking, but does not address the MRTA problem.
On the other hand, Lin et al. [11] developed a novel DRL method to navigate a team of robots through an unknown environment. The aim of the method is drive a robot team such that their geometric centroid reaches a goal position while avoiding collisions and maintaining connectivity between the robots. The robots need to be within a certain distance from each other to maintain the connectivity between them. The centralized learning and decentralized execution were the mechanisms used to train the policy neural network in which each robot has a copy of the policy network controlling it. The policy network takes the raw readings of a 2D laser scanner, the goals' relative position, and other robots' relative positions as the inputs and outputs continuous velocity controls. They designed the reward function to be reaching the goal position, maintaining connectivity, and achieving smooth motion, while it penalizes collisions. They trained their policy network using proximal policy optimization (PPO) [16]. The paper only addressed the issue of making the geometric centroid of the robot team reach the goal position and did not address each robot's terminal goal position.
Long et al. [12] developed a decentralized sensor level collision avoidance algorithm for an MRS. The policy is a deep neural network (DNN) that maps directly between raw sensor measurements to a differential robot's linear and rotational velocities in order to drive the robot towards its goal position. The policy network is decentralized in a way that each robot has a copy of the network. At each time step, the policy network receives an observation that consists of the readings of a 2D laser scanner, the goal's position relative to the robot, and the robot's current linear and rotational velocities. The action space from which the policy network samples the output is a continuous space in which the linear velocity is in the range between [0.0, 1.0] and the rotational velocity is in the range between [−1.0, 1.0]. Hence, the problem is formulated as a partially observable Markov decision process (POMDP) [17]. They trained their neural network using proximal policy optimization (PPO) [16]. They adapted the centralized learning, decentralized execution paradigm in which each robot collects its own experiences (observations, actions, rewards) and the policy is trained on the collective experiences gathered by all the robots. It is obvious from the paper that it is similar to the approach used by Chen et al. [10] in which the policy network effectively preforms navigation with both decentralized MPP and path tracking when the goal positions are set for each robot, but the paper did not address the MRTA problem.
Fan et al. [13] published subsequent work to Long et al.'s [12] work, in which they validated Long et al.'s policy in real-world setups. They made two observations. The first is that Long et al.'s policy does not perform optimally in some trivial situations in which there are no obstacles or other agents nearby or when a robot is near its goal position. The second is that although Long et al.'s policy has shown a very high success rate, it is still prone to collisions even though these collision events are rare. Thus, Fan et al. proposed integrating Long et al.'s policy into a hybrid control framework to further improve the policy's robustness and effectiveness. The hybrid control system is composed of three separate sub-policies. The first policy is a PID policy that is used in cases where there are no obstacles or other agents nearby or the goal position is very close to the robot's current position. The PID policy provides an almost optimal solution to move the robot towards the goal position. The second policy is called the conservative safe policy, which is used in cases when the robot is very close to other agents or obstacles. The conservative safe policy uses Long et al.'s policy, but scales down the laser scanner input and sets a maximum velocity to the policy's output. The third policy is the same as Long et al.'s policy and is used in cases that are different from that of the other two sub-policies. The results show that the hybrid policy managed to navigate an MRS in a warehouse-like environment with human workers around. The policy also managed to navigate a single robot safely in a dense human crowd. Yet, it did not address the MRTA problem.
In summary, a review of some works from the literature that used DRL methods either to solve the MRTA problem or the navigation problem is given. Table 1 shows a summary of which works from the literature performed MRTA and which performed navigation and compares them with our proposed method. As far as we know, there has not been an attempt to combine both MRTA and navigation in a single algorithm. Table 1. The summary of the presented works from the literature that either perform multi-robot task allocation (MRTA) or navigation using DRL.

Problem Formulation
The problem of multi-robot task allocation and navigation is defined in the context of a differential drive mobile robot moving in a 2D plane with both static and dynamic obstacles and other moving robots (all robots are homogeneous). The map is unknown; thus, each robot has a 360°laser scanner mounted on it. The goal locations are laid out in the environment. The number of goal locations equals the number of robots denoted by N. We use the term ego-robot to describe the current robot whose perspective is used to perform the measurements and the robot whose perspective is used by the policy to compute actions.
Each robot i-where (1 ≤ i ≤ N)-at time step t receives an observation O t i and calculates the output command a t i , which drives the robot from the start position P i towards the goal position G i . The observation is drawn from a probability distribution p O t i |s t i , a t i , which depends on the underlying system state and previous action. The observation for each robot is composed of four parts: where o t e is the relative positions of all the goals in the ego-robot's local polar coordinates, o t o is the relative positions of all the goals in the other moving robots' local polar coordinates, o t l is the 2D laser scanner measurements, while o t aux is a vector containing the current time of the robot starting from the beginning of the robot's motion and the robot's previous actions a t−1 i . The observation only provides partial information about the robot's state and the other robots' states, which leads the robot to predict the other robots' intent.
Given the observations, each robot calculates its own action independently. The action is composed of two parts: where v t l is the robot's linear velocity and v t r is the robot's rotational velocity. The actions are sampled from two distributions produced by a policy (π), which is copied across all the robots: where θ denotes the parameters of the policy. The output action guides the robot closer towards the goal that the policy selected while avoiding collisions on the way to the goal within the time limit between two observations. The target allocation is done in a way that the policy does not explicitly select one of the goals, but outputs the actions that drive the robot towards the selected goal position. The problem is formulated as a partially observable Markov decision process (POMDP) [17], in which a sequence of observations and actions is done by each robot forming a trajectory τ i from its start position P 0 i till the goal position chosen by the policy g i , where t g i is the traveled time. The POMDP is episodic, and each episode ends either by one of the robots having a collision, all the robots successfully reaching the unique goals' positions, or the maximum episode duration T max being exhausted.
The main objective is to find an optimal policy that minimizes the expectation of the total arrival time of all robots to their goals successfully, defined as: (2)

Deep Reinforcement Learning Setup
Within the deep reinforcement learning setup, the focus is on learning a policy that is capable of controlling a robot in a decentralized manner within a multi-robot system. The policy selects one goal position within multiple ones and produces a series of actions that gets the robot to the goal position while avoiding collisions. Inherently, the policy should learn the intents of other robots and which goal position each of the other robots is going towards and aim for a different goal position. The policy learns by going through episodes of collecting data.

Observation Description
The observation space-as mentioned in Section 3-consists of four components: The first dimension is the number of the other robots N − 1, while the third dimension c represents the goal's distance and heading from each of the other robots in its polar coordinates. Since o t o encompasses the information of all the goals relative to the other robots, it enables the ego-robot to predict the intents of other robots and which robot is aiming towards which goal; this aids the ego-robot to choose a unique goal position to aim towards. o t l is the 360°laser scanner data of the ego-robot, and it is a 2D tensor o t l R q×360 . The second dimension represents all 360 distance values. o t l enables the robot to detect both static and dynamic obstacles and helps the robot to navigate without collisions. Finally, o t aux is a 1D vector o t aux R 3 representing the time spent since the ego-robot started to move and the previous action, which consists of: the linear velocity v t−1 l and rotational velocity v t−1 r of the ego-robot at the last time step. Since the policy is required to train to minimize the expectation of the total arrival time of all robots to their goals successfully, the policy for each robot requires an input that informs it of the time it is spending while moving. Having the previous action as an input informs the robot about its intent in the last time step.

Actions' Description
The action space a t i R 2 consists of two parts: the robot's linear velocity v t l and the robot's rotational velocity v t r where a t i = v t l , v t r . The policy network has two output heads h v t l , h v t r , and each of them is a softmax distribution [4] that represents v t l , v t r , respectively. Each softmax distribution is a K-dimensional categorical distribution h v t l , h v t r R K that makes the range of argmax h v t l and argmax h v t r between [0, k − 1] each. The category with the highest probability is converted to a continuous value within the range [−1, 1] for each action. Equations (3) and (4) are used for the conversion.

Reward Function
The reward function is designed to incentivize each robot to choose and move towards a unique goal position while minimizing the time consumed to reach the goal position. It penalizes collisions with obstacles or other robots and multiple robots reaching the same goal position. Equation (5) shows the reward r received by robot i at time step t.
where r d t i is used to incentivize the ego-robot to go towards a goal position. It is calculated in Equation (7) (6) is an intermediate variable that is used in Equation (7) to simplify it.
Equation (10) Equation (12) describes (r o ) t i , which rewards or penalizes the ego-robot if it is moving towards a unique goal position or towards a goal position that another robot is moving towards. d oth in Equation (11) is an intermediate variable that is used in Equation (12) to simplify it.
Finally, Equation (13) describes r t t i , which penalizes the ego-robot depending on the amount of time it takes to reach a goal position. The ego-robot receives a negative reward −r terminal if the maximum episode duration T max is exhausted before reaching a goal position.
All the constants r terminal , w g , w oth1 , w oth2 , T max , and w t are set during experimentation, and the constant R depends on the robot dimensions.

Training Algorithm
The deep reinforcement learning algorithm used is the proximal policy optimization (PPO) algorithm [16] with the actor-critic setup. We adapted the centralized learning, decentralized execution paradigm [12] in which each robot has a copy of the policy π θ and the value V φ networks. Both the policy and the value networks have different sets of weights θ and φ, respectively. Algorithm 1 summarizes the entire data collection and training process. At each time step, each robot receives its observation O t i and uses its copy of the policy π θ to generate the action a t i . Each robot collects its data O t i , a t i , r t i from the environment, and after the amount of data exceeds a certain threshold T th , it sends the rollouts of data over to a centralized copy of both the policy and the value networks. Then, the gradients of the objective function L with respect to the centralized policy network weights θ and value network weights φ are computed. Then, the Adam optimizer [18] updates the weights θ and φ using the learning rate l r for E epochs. After each update, each robot receives a copy of the update weights θ and φ to start collecting a new batch of data. Thus, the policy and value networks are trained on the experiences collected by all the robots simultaneously.
As shown in Algorithm 1, Line 27, the objective function L is a combination of three objective functions. The first one is the PPO clipped surrogate objective function L clip with the generalized advantage function (GAE) [19] A t i as the advantage estimator. The GAE A t i uses the value estimation obtained from the value network V φ O t i as the a baseline and λ as the decay parameter. The PPO clipped surrogate objective function L clip is used to update the policy network. The second objective function L v is used to train the value network V φ to reduce the mean squared error between the returns R t i and the value estimation V φ O t i . The third objective function is the entropy L e of the policy, which is used to ensure that the policy performs sufficient environment exploration over the training period. Gradient ascent is performed on the objective function by performing gradient descent on its negative value −L.

12:
Break if T i ≥ T max or a collision happens or each robot reaches a goal 13: end for 14: Break if T total ≥ T th N, and send the data rollout to train the centralized networks 15: end for 16: # Update policy and value 17: θold ← θ 18: for j=1,2... E do 19: # Policy loss 20:

Network Architecture
The policy network π θ takes the observation O t i as the input and outputs the action a t i . Figure 1 shows the architecture of the policy network. There are four inputs to the policy network, and each one corresponds to one of the observation while each input has its branch of hidden layers through which it passes. The input o t l goes through the first branch, and it passes through a convolutional layer [20] that consists of 32 one-dimensional filters with a kernel size of 5 and a stride of 2, the second hidden layer in the branch is a another convolutional layer similar to the first one, except it has a kernel size of 3. The third hidden layer is a flatten layer followed by a 128 neuron fully-connected layer [20]. The second branch that takes input o t o passes it through a reshape layer that reshapes it into the shape R N−1×(q×b×N) , then it is passed through a 32 neuron fully-connected layer applied on the last dimension of the reshape layer output, and then, the output of the fully connected layer is flattened. The observation o t e goes through the third branch, which is just one flatten layer. The fourth branch has no hidden layers; it passes the o t aux input through to the next stage. The outputs of all the branches are then concatenated, then passed through a 256 neuron fully-connected layer. All the layers until now have ReLU activations [21]. The final layers are the two output layers, each corresponding to an action component a t i = v t l , v t r , and each layer has K neurons with softmax activations [22]. The components of the action a t i are then sampled from the two softmax distributions, as explained in Section 4.1.2.  The value network V φ shown in Figure 2 shares the same architecture as the policy network except for the outputs. The output layer of the value network outputs the network's prediction of the value estimation, and it is a single neuron fully-connected layer with no activations. The value network has a separate set of weights with respect to the policy network.

Simulation Environment
We developed a simulation environment to train and test our deep reinforcement learning method. The robots used were the Turtlebot3 Waffle PI robots [23]. We adapted the simulation environment built by the Turtlebot3 robot team and modified the environment to support multiple robots and goal positions. The simulation environment was built using the Gazebo simulator [24], as shown in Figure 3 During training, ten simulation environments were launched in parallel. Each of them contained two robots N = 2 with a total of 20 robots. Each of the robots had a copy of the policy and value networks and used them for data collection. After the data were collected, the data were sent to a centralized copy of the policy and value networks to be updated. Then, the updated weights were sent back to each copy.

Training Configuration
The algorithm was implemented using TensorFlow [25] on an NVIDIA DGX-1 station. The training took 94 h (about 8 k iterations). The frequency of computing the actions on which the policy was operating was set to 5 Hz. Table 2 shows the hyperparameters used during training. The evolution of the reward over training is shown in Figure 4.

Evaluation and Testing
The overall performance of the policy can be evaluated using the same quantitative metrics used by Long et al. [12]: • Success rate: The ratio of the number of robots that reach unique goals successfully during the episode over the total number of robots (the higher the better). The metrics were measured over all the robots and all episodes during the evaluation to remove any effects of the variance in the goals' positions.
We categorized the robot collisions into three categories. Category A is a collision with an obstacle; Category B is one robot colliding with another robot that reached a goal position; and finally, Category C is two robots colliding together while moving in free space. These three categories cover 100% of the collision events that might happen in our setting. This categorization can be used to analyze the policy's performance when it fails and what kinds of collisions tend to happen. Such an analysis can give us insight into the behavior that the policy was trained to perform.
As our method combines MRTA and navigation, the categorization of collisions enabled us to develop a new metric that can measure the performance of our method and future methods that will be similar to it in performing MRTA. This metric can measure the MRTA performance by measuring the policy's failure in performing MRTA. We call this new metric the Task Allocation Index TAI. It depends on the ratio of episodes ending with Category B collisions relative to the total number of episodes because Category B collisions are a clear failure of MRTA when two robots reach the same goal. TAI also depends on the ratio of goals whose information is input to the policy relative to the total number of goals in the environment, as this gives a higher score for the policies that perform task allocation on a bigger ratio of goals that are available in the environment. Equation (14) shows the formula for TAI (the higher the TAI value the better). For the policies that only perform navigation and do not perform MRTA at all, the TAI value will be zero. where: • N i : number of goals whose information is input to the policy. • N g : total number of goals. • N CategoryB : number of episodes ended by Category B collisions. • N Episodes : total number of episodes.
We compared our method with the methods proposed by Long et al. [12] and Fan et al. [13]. The reason we are comparing with these works is the similarity in the objective of the studies in terms of performing the navigation of multi-robot systems using deep reinforcement learning. On the other hand, the difference between our method and the methods proposed by Long et al. [12] and Fan et al. [13] is that our policy selects which goal position to drive the robot towards. For the methods of Long et al. [12] and Fan et al. [13], the policy needs the goal position to be preselected per robot. Thus, our method performs MRTA and navigation form end-to-end, while the other methods only perform navigation from end-to-end. The second reason these methods were chosen is because the performance metrics were reported by both papers. Long et al [12] reported better performance than the NH-ORCAmethod over most test cases. We trained the method proposed by Long et al [12] on our simulation setup with the same robot kind we are using. Their method was implemented using the same setup, network architecture, and hyperparameters that they proposed in their paper. The training process took around 22 h. For the method proposed by Fan et al. [13], we used the same policy that we trained for Long et al.'s [12] method as part of the hybrid control system similar to what Fan et al. [13] did in their paper.
To test and evaluate each method, we repeated the evaluations for 2 h. The goals' positions were changed every time each robot reached a unique goal or five episodes passed by since the last goal position change E g = 5. The results are shown in Figure 5. It shows that both methods are on par with each other with a slight advantage of the method proposed by Long et al. [12] over both of the other methods in terms of the success rate. In terms of extra tie and extra distance, the method proposed by Fan et al. [13] has an advantage over the other two methods. This advantage is attributed to the PID policy of the hybrid control system, which drives the robot directly toward the goal position if there is no obstacle nearby or the goal position is closer to the robot than the nearest obstacle. On the other hand, the TAI metric highlights the advantages of our method over the other two methods since our method is the only one that performs both MRTA and navigation from end-to-end. Our method takes input information about all the goals in the environment (N i = N g in Equation (14)); thus, the only factor affecting the value of the TAI metric in this case is the percentage of episodes ending with Category B collisions N CategoryB , as shown in Table 3. On the other hand, the TAI metric for both Long et al.'s [12] and Fan et al.'s [13] methods is zero since both methods do not perform MRTA, and they only have input information about the preassigned goal position for each robot (N i = 1). Table 3 shows the statistics related to the number of episodes and the way they ended for each method. Both methods have a very low percentage of episodes ending with collisions, yet Fan et al.'s [13] method has a slight advantage. We attribute our method's performance to our method's tendency to move the robots with an average speed that is higher than the average speed produced by the other methods-as shown in Figure 5which can lead to the robots in our method having more collisions.   Table 3 also shows the percentage of the occurrence of each collision category during testing. It shows that all methods managed to avoid any collisions of Category C. Our method tends to have collisions of Category B about five times more than that of Category A, which indicates that the policies on both robots chose the same goal position to move towards. Samples of both categories of collisions are visualized in Figure 6.  On the other hand, Long et al.'s [12] method tends to have slightly more collisions of Category A than that of Category B. Most of the Category A collisions occur when the goal position is near the obstacle wall, which means that the policy drives the robot to approach such goals at an angle that exposes it to collision. Alternatively, the Category B collisions occur when the other robot reaches a goal position that is in the middle between the ego-robot and its assigned goal position, as shown in Figure 6, which indicates that the policy failed to avoid a stationary robot in front of it. As for Fan et al.'s [13] method, the only instance when a collision occurred was when a certain goal configuration appeared during testing. This collision instance is Category A, as shown in Figure 6. This collision happens when a goal position is spawned behind Robot 1 and the conservative safe policy of Fan et al.'s [13] hybrid control system is the policy controlling the robot. Figure 7 shows the distribution of collisions within the environment. The figure reinforces the analysis of both Table 3 and Figure 6 as it shows that for our method,  In Figure 8, we visualize three different successful episodes from testing for each method. the visualization shows the trajectory of each robot color coded. It is clear form the graph that the methods of Long et al. [12] and Fan et al. [13] tend to move the robots in more straight trajectories, while our method tends to move in trajectories that are more irregular and sometimes overshoots the goal positions. Figure 9 shows the distribution of goals reached by each robot throughout the environment during testing for our method. It shows that Robot 0, which is spawned near the upper right corner of the environment, tends to reach the goals positioned in the environment's upper half, while Robot 1, which is spawned in the lower left corner, tends to reach the goals positioned in the environment's lower half. The past two observations indicate that the task allocation strategy learned during training is to split the environment into two halves, and each robot attempts to reach the goals available in its half.
To analyze the reward function for our method that was introduced in Section 4.1.3 and to evaluate the effects of the values of the constants for the reward function that are shown in Table 2, Figures 10 and 11 visualize the reward function during a successful episode and an episode that ended with a collision, both episodes by our method. For the successful episode in Figure 10, the reward r d -described in Equation (7)-which is used to incentivize the ego-robot to go towards a goal position, has a positive value for both robots as each of them goes towards the goal position and has a large terminal value once each robot reaches a unique goal position. An exception to that happened at the beginning of the episode for Robot 1 since the policy moved it away from the goal position for the first 10 time steps. As for the other reward functions, since there were no collisions, therefore the reward r col -described in Equation (10)-had a zero value for both robots. Since no robot went towards the same goal position as the other robot, the reward function r o -described in Equation (12)-had a positive value for both robots. As for the reward function r t -described in Equation (13)  On the other hand, the episode that ended in a collision is shown in Figure 11. The collision that occurred in this episode is Category B. Since Robot 0 reached a goal position successfully, therefore its aggregate reward function showed positive values across its movement, and when the robot reached the goal position, a large terminal reward value was given to the robot. On the other hand, Robot 1 moved towards the same goal position as Robot 0 and collided with it. This behavior is reflected in the reward function as the reward function r o -described in Equation (12)-took on a negative value once Robot 1 started to move towards the same goal position as Robot 0. Once the collision occurred, Robot 1 received a large negative terminal reward according to the reward r col -described in Equation (10). The combination of all of the rewards led to the aggregate reward received by Robot 1 having a negative value for most of the time steps.
The motion of the robots Each reward value collected by Robot 0 The aggregate of the reward value by Robot 0 Each reward value collected by Robot 1 The aggregate of the reward value by Robot 1 Figure 10. Visualizing the rewards collected by each robot using our method in an episode that ended successfully. The trajectory of Robot 0 is shown in red, while the trajectory of Robot 1 is shown in green.
The motion of the robots Each reward value collected by Robot 0 The aggregate of the reward value by Robot 0 Each reward value collected by Robot 1 The aggregate of the reward value by Robot 1 Figure 11. Visualizing the rewards collected by each robot using our method in an episode that ended with a collision. The trajectory of Robot 0 is shown in red, while the trajectory of Robot 1 is shown in green.

Conclusions
In this paper, we present a novel decentralized sensor level policy that performs multi-robot task allocation and navigation from end-to-end. We train and test it in a simulation environment using a simulation of a real robotic platform. We also propose a new metric called the Task Allocation Index (TAI), which measures the performance of a method that performs MRTA and navigation from end-to-end in performing MRTA. Our method is compared with two other DRL based methods that only perform navigation for a multi-robot system. Our method shows on par results compared to these methods with a slight advantage for the other methods in terms of the success rate, extra time, and extra distance. On the other hand, our method outperforms the other two methods in terms of the Task Allocation Index (TAI) as the two other methods do not perform MRTA. Our work serves as a starting point towards building a robust system that performs task allocation and navigation from end-to-end. Future works include training our method on more complex environments and experimental setups to validate our method. We also plan in the future to use some more advanced network architectures to increase the performance and to adapt some of the methods used in natural language processing to allow the neural network architecture to accept observations with a variable number of robots in the system.