Multi AGV Coordination Tolerant to Communication Failures

: Most path planning algorithms used presently in multi-robot systems are based on of-ﬂine planning. The Timed Enhanced A* (TEA*) algorithm gives the possibility of planning in real time, rather than planning in advance, by using a temporal estimation of the robot’s positions at any given time. In this article, the implementation of a control system for multi-robot applications that operate in environments where communication faults can occur and where entire sections of the environment may not have any connection to the communication network will be presented. This system uses the TEA* to plan multiple robot paths and a supervision system to control communications. The supervision system supervises the communication with the robots and checks whether the robot’s movements are synchronized. The implemented system allowed the creation and execution of paths for the robots that were both safe and kept the temporal efﬁciency of the TEA* algorithm. Using the Simtwo2020 simulation software, capable of simulating movement dynamics and the Lazarus development environment, it was possible to simulate the execution of several different missions by the implemented system and analyze their results.


Introduction
The constant technological development felt at present creates a need for a constant adaptation on the part of the industrial sector in order to fulfil the demands of the corporate market. To remain competitive, industries must generate a demand for new innovative solutions in an attempt to create value. These solutions do not always reflect a direct valorization of the final product. In most industries, the production costs have a significant impact in their market competitiveness, which leads to a significant evolution of the automated systems. These systems grant the possibility of reducing the labour cost and simultaneously optimizing production time. As a product of this evolution, Automated Guided Vehicles Systems (AGV) were created and saw their first use in an industrial environment in 1954. Since then, the use of this type of system has seen a steady increase, and is a common sight in industry at present [1].
Currently, the AGV are predominantly used in the moving of products, being mainly used as a mean of transporting material between the production lines and the storage sectors. However, their use is not restricted to the industrial sector; AGV, for example, can also be used in hospitals and distribution centres. The mass use of AGV creates questions about their efficiency and productivity in scenarios where multiple robots are operating in restricted environments and exposed to communications faults.
The coordination of a fleet of autonomous vehicles is a very complex task, most multi-robot systems rely currently on static and pre-configured interactions between the robots [2]. When the unpredictability associated with communication flaws is added, this task becomes even more difficult. Due to this fact, the study of trajectory planning

State of the Art
The coordination of a fleet of robots falls under the Multi-Agent Path Finding (MAPF) category of problems. This problems comprise of finding a set of paths for the agents that are encompassed by the system. These paths have the objective of moving the agents from their current vertices to their targeted vertices while avoiding conflicts and reducing as much as possible the cost for the movement of the agents [6]. MAPF has practical applications in video games, traffic control, and robotics [7]. Solving MAPF problem in an optimal way is NP-Hard (non-deterministic polynomial-time hardness) in terms of complexity [8,9].
Presently, several different methods are commonly used to plan the movement and actions of fleet of mobile robots. This methods can range from the use of potential fields [10] to the use of Reinforcement Learning [11,12].
From all these methods, the ones based on Graph Search algorithms are of special importance not just because they can discover the possible paths but also because they allow the discovery of the most efficient possible paths for the robots [13].
As the name itself suggests, these methods consist of two stages: the initial decomposition of the workspace into a graph, and the application of a graph search algorithm in order to discover efficient paths for the robots. The initial decomposition of the workspace, in most cases, can be easily achieved by either using an approach based on Roadmaps [14,15] or Cell Decomposition [16].
After obtaining the graph, necessary the implementation of a method of searching it is necessary. Of particular importance in this field is the A* algorithm and its derivatives [17]. The standard A* algorithm, introduced by [17], uses both the cost already incurred from the initial node and the cost until the finish node. This characteristic allows this algorithm to obtain optimal and complete solutions.
Later, new variants of the A* algorithm were developed with the intention of being used in Multi-AGV systems. Such as the Dynamic A* algorithm proposed by [18], this algorithm stands out as it is capable of operating in situations where the environment is unknown or just partially known.
Another notable variant is the Lifelong Planning A* algorithm, also called LPA*, proposed by [19]. This algorithm allows the use of previous information in order to reduce the processing time. More recently, in 2011, the 3D A* algorithm was introduced by [20], which adds the temporal dimension, allowing for a coordinate representation of (x,y,t). This detail allows for the representation of stationary movement, where the x and y coordinates maintain the same through iterations and the t coordinate is increasing.
Another variant of the A* algorithm is the M* [21], this variant was created since planning trajectories for large numbers of robots is very computationally expensive. Therefore, the M* algorithm allows for the creation of a relatively cheap path for the robots by sacrificing a small portion of the overall efficiency of the system. Although this method does not generate the most efficient solution possible for the robot routing problem, it generates good enough solutions that avoid conflicts between the robots.
Recently, a new graph search algorithm was introduced for the implementation in Multi-AGV systems. The Time Enhanced A* algorithm, also known as TEA*, introduced by [3] and later modified by [4], is a graph search algorithm created due to the need for an algorithm suitable for Multi-AGV systems, which avoids collisions, deadlocks and guarantees the efficient execution of a set of tasks [3]. The TEA* algorithm was created by introducing the concept of temporal layers to the A* algorithm, which gives the possibility of the algorithm of executing the path planning in an online mode [3] using the estimation of the position of the robots for any given time instance. This makes the TEA* algorithm capable of taking into account possible changes that can occur in the environment, unlike its predecessor. The TEA* algorithm is also capable of dynamically shifting the priorities of each robot in order to resolve possible conflict situations [4], unlike the static priorities used by [22].
All the methods based on graph search algorithms referred to above rely on a centralized architecture [2,23], which expands on the idea of a decentralized multi-robot system by using a service-oriented architecture. However, this idea is still in its infancy.
In the field of tolerance to communication faults, most advances are focused either on regeneration of communication faults [24] or in applying estimation methods to deal with sudden sporadic communication faults [25]. However, there are few studies on multi robot systems that are capable of dealing both with sporadic faults, as well as, dealing with entire zones of the environment where there is no communication with the robots.

Overall System Architecture and Lower Level Control
To better comprehend the function and implementation of both the TEA* algorithm and the supervision module, the system's overall architecture is explained, as well as its overall composition.
The implemented system is based on a centralized control architecture [23]. This configuration was chosen, since the TEA* Graph Search Algorithm needs prior knowledge of all the current robots positions in order to correctly generate a set of paths for all the robots controlled by the systems [4].
The implemented system is responsible for controlling the movement of a team of robots. This team of robots would consist of four robots similar to the ones used in the Factory lite 2019 edition [26]. These robots locomotion system is made up of two differential traction wheels and are simulated using the Simtwo2020 simulation software (CRIIS, Porto, Portugal), created by Paulo Costa. This software is capable of simulating differential traction robots [27], as well as already having implemented the simulation model for the robots from the Factory Lite competition [26], these robots were developed and built by Centre for Robotics in Industry and Intelligent Systems (CRIIS).
In Figure 1, a top level representation of the implemented system is shown. As seen in this figure, the implemented control system is comprised by three core modules. The TEA* adapted library and supervisor modules located in an application that represents the central control unit and the trajectory control module located in each robot control unit. Information is transmitted between the four robots and the central controlling application via the UDP/IP communication protocol. The central control unit uses the interaction between the supervision module and the modified TEA* algorithm, to control each robot planned path, this path is then transmitted to the robot. The robot control algorithm is only responsible for controlling the velocities of each of the robots wheels, so that it can accurately execute its assigned path.
The simulation of communication faults in this simulated environment was executed by a special module that was implemented between the simulation platform and the centralised control platform. In the case of static communication faults, this module will analyze the position of the robot and checks if it is within the area defined by the user. If this condition is verified, the module will not transmit any data regarding that robot to the control system. For sporadic faults, this module uses a fixed probability to decide whether to transmit or not the data to the central control unit. It is also capable of simulating sporadic faults within a defined area or in the entire environment. All the data used by this module is isolated from the rest of the system to maintain the validity of the experiment results.

Modified TEA* Algorithm
As described before, several modifications to the algorithm presented by [4] were implemented during the design and creation of the central control unit. Three major modifications were implemented into this algorithm, these modifications are the following: • Implementation of a Task Scheduling function; • Modification of the method that associates the current robot position with a node from the graph; • Implementation of an algorithm that creates a binary semaphore when a communication fault is detected.
These modifications had the objective of allowing the application of the algorithm into industrial scenarios, and also allowing the system to safely plan paths for the robots during communication faults.

Task Scheduling
The main difference between the method used by [4] and the implemented Task Scheduling function, was that the later would move the robot to a charging post when all of the tasks assigned to the robot were concluded. This would allow for the robot to recharge while no new tasks are assigned to it.
To make the robots move to their charging stations when no tasks are assigned to them, both the function that obtains the next task, as well as the function that adds new tasks to the robot need to be modified.
When a robot reaches its target point, this new function checks if there are anymore tasks assigned to the robot and in the case that there are no more missions, it orders the robot to move to the closest recharging station; this attribution is represented via a flowchart in Figure 2. When a new task is added, the system checks whether the robot is either moving to or stopped at a charging station or if the robot is executing another task. If it ends up being the latter, it will add the new task to the subtask array as the last element and will also increment the number of tasks. Otherwise, it will change the target point to the new task node and add that task to the subtask array. It is also necessary to increment both the current task point as well as the number of tasks.
At the beginning of each planning cycle the nodes corresponding to the positions of all robots that are currently charging, will be considered to be "Obstacle Inaccessible Robot". This new node status represents a node that is currently occupied by a robot which is currently recharging or that has suffered a communication fault, therefore meaning that this robot cannot be moved to give way to another robot. This allows the planning of path around the position of these robots and avoiding the movement of these robots.
For this reason, the recharging positions should be placed in locations where they will not block traffic, such as at the extremities of dead end pathways. After a robot has completed all of its tasks, the robot will move to the nearest unoccupied recharging station and will remain there until another task is assigned to it.

Planning during Communication Faults
It is also necessary to modify the TEA* library so that it is able to plan safe and efficient paths even when one or more robots are experiencing communication faults. The library created by [4] does not take into account this possibility, therefore, when one or more robots are subjected to communication faults, it is not able to guarantee a safe and efficient planning, having a high probability of occurring collisions and deadlocks.
The method implemented in this project, in order to create a system robust to communication faults and to areas where there is no communication, consists in the interaction between both modules of the central control unit.
Each communication fault will have a set of nodes associated with it, these nodes are generated by the supervision module and their generation is explained in Section 5. Then the modified TEA* library places the nodes associated with the current active faults as immovable obstacles, for all robots and during all steps of the planning, until the affected robot exits the communication fault area and communication can be re-establish. This prevents any other robot from entering the communication fault zone. Therefore, creating a binary semaphore [28] that regulates the traffic of robots entering and leaving the zone.
The new library will also not plan any new paths for the robots experiencing communication faults, this serves to reduce the risk of the TEA* algorithm inadequately assuming that these robots are executing a different path than the one they are actually taking, which could lead to a loss of efficiency when planning the paths of the remaining robots.
This additional control is important since when a robot is experiencing a communication fault situation there are no guarantees that any alterations to the planned path will reach and be executed by the robot, therefore it is necessary to limit the presence of several robots inside the same communication fault zone. However, in some cases this is impossible as it is shown in Section 6.

Supervision Module
This module is responsible for controlling when the robots paths need to be replanned and for detecting, measuring and handling communications faults. This module is composed by two sub-modules hierarchically related with each other. Each of these submodules is responsible for handling one of the tasks mentioned earlier. These sub-modules will be referred as the Planning Supervision Sub-Module (PSSM) and the Communication Supervision Sub-Module (CSSM), respectively.
The Figure 3 represents the hierarchical relation between all the parts that constitute the implemented control system. Represented in blue in this figure are both the submodules that constitute the System Supervision Module. In this figure, it is also possible to observe that the CSSM is located above the PSSM, showing that the CSSM commands have a higher authority than the PSSM ones.
The relation between the two sub-modules is based on a loop and interrupt-based control. On environments where no communication faults occur the CSSM does not intervene and the replanning of paths is handled by the PSSM. In this case the CSSM's only job is to detect that a communication fault is occurring. However if a communication fault is detected, the PSSM is overruled and the replanning of the robots paths is then controlled by the CSSM. When the fault has ceased, the control is handed back to the PSSM.

Planning Supervision Sub-Module (PSSM)
This sub-module is responsible for controlling when the robots need to have their path replanned in situations where no communication fault is detected. In order to accomplish this task, it needs not only to detect when one of the robots is delayed or ahead of time but also detect when a robot has completed the current step of his path.
Firstly, it is necessary to define when a robot is considered delayed or ahead of time. When using the TEA* algorithm, it is necessary that all robots are synchronized, meaning that all robots must be in the same step of their path, in order to avoid collisions and deadlocks. Therefore, delayed and ahead of time robots are robots that are one or more steps behind and one or more steps forward than the remaining robots.
However, in practice, this is not so linear since it would be impossible to synchronise all the robots in the system, thereby originating a very ineffective system where the paths are constantly being replanned. For this reason, a detection system must be created that allows small delays when those delays do not constitute a problem for the overall safety of the robot. To accomplish this goal, the PSSM executes three different verifications every time the robots position is updated. These verifications are useful to decide if it is necessary or not to replan the paths for all robots .
• Check whether any robots are too distant from their supposed position; • Check whether the maximum difference between steps is 1; • Check whether any robot is moving to a position currently occupied by another robot.
However, before any of these verifications can be executed it is necessary to know the type of movement the robot is executing. This movement could be classified into one of three categories: Robot is currently moving along a link.
After obtaining the type of movement of the robot, the sub-module checks all of the robots current steps in order to obtain the lowest step value possible. It then compares the current robot coordinates to the coordinates that the robot would have if it was currently executing the same step as the robot with the lowest step value. If the distance between this coordinates is greater than the stipulated threshold, the robots paths are replanned.
However, this function only checks for robots that are currently moving between nodes, not taking into consideration any delays during the rotation or even during communication with the robot. Another problem of this function is that it assumes that all the links have a dimension superior to the stipulated threshold.
Therefore, another function is also required to guarantee that the robots that are rotating or stopped are also synchronized with the other robots, as well as, guarantee that the synchronization and safe planning of paths for the robots can also be executed, even if the link dimension is shorter than threshold distance. This function checks whether the maximum step different between all the robots in the system is superior to one. If this condition is verified it means that the robots are not synchronized therefore their paths must be replanned.
Even in this example, the desynchronization was artificially induced; in an industrial environment it can be caused by a lot of factors, from small differences in the wheels diameter to differences in the grip coefficient.
In some cases, even a small desynchronization can cause a collision, as illustrated in Figure 4. In these cases, the most viable solution in order to maintain the safety of the system is to constantly replan the robots' paths as soon as the position of one of the robots is updated. Therefore the sub-module must detect these cases and act accordingly. To detect these specific cases, a new function was implemented; this function checks whether the node that corresponds to the current step of the robot path is equal to the current position of any of the remaining robots. In the case of a robot being currently located in a link between two nodes, the function checks the coordinates of both nodes that are part of the link, in order to guarantee that no other robot is moving to those nodes. If any of the described conditions happen, the robots paths are immediately replanned so that collisions are avoided.
When no replanning of the robots path is being executed, this sub-module needs to detect when a robot has completed its current path step and increment that step. However, these points can be a problem in the case where some robots are waiting for other robots to finish their movement. Therefore, it is necessary to stipulate when the steps of these robots can be increment in a safe way.
If the robot movement is a rotation or if the robot is travelling between two nodes, determining when the robot has finished its current step of the planned path can be done by comparing the current robot position and orientation with the desired robot position and orientation at the end of that step.
In the case where the robot has stopped and is waiting for other robots to finish their movement, the robot step counter only increments when all moving robots have finished that step, preventing the movement of the robot before the path is cleared. This method assumes the worst case scenario where the movement of the stopped robot depends on the movement of all the moving robots of the system. It would be relatively costly in computational terms, to check all of the systems robots path and determine which robots needs to move in order to allow the movement of the stopped robot.

Communication Supervision Sub-Module (CSSM)
In order to create a system robust to communication faults the Communication Supervision Sub-Module (CSSM) was implemented. It is responsible for detecting communication faults, calculating their size and forcing the replanning of the robots paths to take into consideration said faults. This sub-module will predominantly be faced with two different fault situations: In either case, the system will not have any prior knowledge of the existence and location of these faults. For this reason it will have to constantly map the factory floor map, since faults can be dynamically created and destroyed. This allows the modified TEA* algorithm to then plan a path accordingly, as explained in Section 4.2. Figure 5 shows a flowchart of the normal functioning of this sub-module. On the detection of a communication fault in one or more of the robots, the CSSM generates an interrupt, where the normal functioning of the central control unit is halted and the size of the fault is calculated. Afterwards, it forces a new replanning event, in order to take into account the estimated fault size; and return the system back to its normal functioning. During this replanning event, the nodes associated with the active faults are placed as occupied, therefore preventing other robots' entry into the zone affected by an active fault.
As shown in Figure 5, while the robot is unaffected by communication faults the current position of the robot is recorded has locations where it is possible to establish communication with the robot. These nodes will be referred to as unfaulted nodes and will be used in the estimation stage of the method to delimit the area subjected to faults. However, this designation does not mean that it is always possible to establish communication with a robot located in them. These nodes can still suffer from sporadic faults, as well as the appearance of new communication faults in areas where previously communication could be established. Therefore, their status as unfaulted nodes may not be permanent. Figure 6 shows an example of the mapping of unfaulted.
The estimation stage of this method happens when a robot enters a fault state. When entering this stage there are three possible situations that can happen: either the robot is entering a fault that is still not mapped, it is entering a fault that has already been mapped previously, or it is entering a unmapped extension of a previously mapped fault. Each of these situations is treated differently, as shown in Figure 7.  To estimate the dimension of a fault, the planned path for the robot is analyzed, so that it can determine the most likely exit node, node where communication can re-establish. To achieve this, each of the nodes corresponding to the next step of the planned path is sequentially compared with the records of the unfaulted nodes. This comparison goes on until either one of the path nodes is found to be an unfaulted node or the planned path ends. Afterwards, this method records all of the nodes that comprised the robot path between the current robot position and the exit node as a set of faulted nodes. Each set of faulted nodes therefore represents an area where no communication can be established with the robots, and have associated with it an array of possible entry/exit nodes. After the structure of nodes that represent the fault is created, the id value of the robot that detected it is associated with the fault.
In this method, when a fault is detected in the middle of a link, the entry/exit points of a fault zone are chosen in a worse case scenario; therefore these points are the last known nodes with good communication with the robot and the first point where communication could be re-established, respectively.
This method also accounts for the fact that a robot can exit sooner than the estimated exit node, especially in a situation where only a small percentage of the graph node has already been mapped. In these cases, an interrupt is generated when communication is re-establish with the robot. During this interrupt, both the faulted node set, as well as the entry/exit node set associated with that fault are corrected. This correction is executed by removing from the faulted node set all the robots path nodes that are between the actual exit point and the estimated exit node. The status of the removed nodes are changed to not mapped. Finally, the node where communication was re-established is placed into the entry/exit set. On exiting the fault, the robot will no longer be associated with that fault. Figure 8 shows an example of the detection and mapping of a new fault; in this example the nodes represented in green are nodes considered to be unfaulted nodes, the nodes represented in orange are part of the entry/exit set associated with this fault and the nodes represented in red are faulted nodes associated with this fault. In this figure, the button graph representation serves to illustrate the current status attributed to each node. After the estimation of the fault dimension, the group of nodes associated with that fault is analyzed to see if they belong to an already existing fault or if they can be associated with an already existing fault as an extension of it, as shown in Figure 7.
First, the method starts by checking whether at least two of the nodes marked as entry/exit nodes are also entry/exit nodes of an already mapped fault. If this is not the case, the method will analyze each of the nodes that comprise the affected node set, individually and see if they belong to other already mapped faults. If any of these conditions are true both node sets are merged. In the rare cases where the estimated fault nodes sets are associated with more than one already mapped fault, all the faults associated with the estimated fault are merged into only one fault.
The mapping methodology presented is a methodology where the efficiency is directly proportional to the time spent, by moving the robots. An increased movement of the robots to different nodes would lead to more nodes being mapped, therefore leading to a more efficient estimation of the affected area upon the detection of a fault. To deal with sporadic faults, this idea was of an increase in efficiency with the passing of time was improved upon.
For starters, the CSSM would only assume that a robot is in a communication fault status, if there is no new message received after a time equivalent to the execution to four of the robot control cycles has passed since the last message from that robot was received. In the implemented system, this means that a gap of approximately 400 ms elapsed, taking into account the central unit control cycle period, before any action is taken. In terms of the robots' movement, this gap is too small, when used in conjunction with the PSSM, to cause any collisions or any significantly loss of efficiency, as shown in Section 6. Assuming a linear nominal velocity of 10 cm per second the gap would correlate with a movement of 4 cm if the robot was going full speed.
During this gap, the robot currently experiencing the fault will attempt to resend its data at least three times, therefore reducing any affect that random packet loss could induce a communication fault situation.
Applying the idea of increasing the efficiency of the algorithm with the passing of time, a method for dealing with sporadic faults was implemented. This method is comprised of two parts, the first one is executed every time a communication is established with the robot in a node that was previously flagged as suffering communication faults and the second one is executed when a communication fault is detected in an unfaulted node.
The first part of this function verifies whether the node where the robot is communicating from is considered a faulted node. If so, the fault associated with that node is removed and all the faulted nodes from that are placed as not mapped, except for the node from where the communication originated which is placed as unfaulted. All the entry/exit nodes associated with that fault are still kept as unfaulted nodes. Figure 9 shows an example of the removal of a fault; in this figure both the faulted nodes, represented in red, originated from to a sporadic fault that is currently inactive. Therefore, when a robot enters this fault it does not experience any communication loss, this leads to all the nodes being considered to be unfaulted nodes, represented in green. This first part serves to remove any sporadic faults that might occur. This removal is especially important for the efficient planning of the paths since it avoids the unnecessary creation of binary semaphores.
In situations when a sporadic fault occurs on an entry/exit node of another already mapped fault, the sporadic fault is associated with that fault. When the sporadic fault is removed, the nodes of the already mapped fault are placed as not mapped. This allows for the storing of the dimension associated with that fault, permitting an efficient remapping of the fault if a robot enters it.
The Figure 10e shows an example of the removal of a sporadic fault that has been merged with a fault that corresponds to an area where there is no communication with the robot. As with Figure 9, the robot starts by placing both the faulted nodes as unmapped nodes. However, when the robot enters the second unmapped node it experiences a communication fault for being in an area inaccessible by the communication network Figure 10d, therefore the fault dimension estimation algorithm is called leading to the remapping of that fault. The second part of this function runs when a fault is detected in an unfaulted node. When this happens, this fault is assumed as a sporadic fault, so its dimension is limited to the node where it happened. This node is removed from the unfaulted set and the mapping methodology is applied. Similar to the static faults when communication is reestablished with the robot, the size of the fault is updated.

Tests and Validation
In this section, all the implemented modules as well as the interaction between them will be tested and validated. It will also present the situations where this system is incapable of guaranteeing a safe execution of the mission.
These tests will be divided into two categories, those with static communication faults and those with sporadic and static communication faults. The results of these tests, as well as other tests executed to the system, are represented in the appendixes of this document. For the representation of the results of these tests, a methodology where the robots are represented by a full square, their destination point is represented by an empty square of the same colour as the robot and their executed path between images is represented in the same colour as the robot. For ease of comprehension, numbers with the same colour as the robot have been added to the figures, in order to demonstrate the task order associated with each robot.
For these tests, four robots will be used. Each robot will be assigned a set of tasks that it needs to complete. When all tasks assigned to all of the robots that comprise the system are completed and all robots have reached their charging post the test is deemed as finished.
All tests were executed using the same factory floor map. This layout was inspired by a real-life factory.

Planning Supervision and Overall System Performance when Subjected to Static Communication Faults
The first category will analyze the behaviour of the system when only exposed to areas where there is no communication. The second set of tests will expose the system not only to areas where there is no communication but also to packet loss and sporadic communication faults.
For the first set of tests, the initial conditions represented in Figure 11 will be used. In this set of tests the no communication section chosen encompasses both workstations 10 and 11, this area is represented in Figure 12 by the colour red.
To test the system response, the creation of a new mission is also necessary. As the intent of these tests is to analyze the implemented system response to areas where there is no communication, a mission comprised of only one task to each robot is enough. Table 1 shows each robot assigned task as well as its initial priority.  Table 1. Tasks and initial priority of each robot. The numbers under the task column represent the workstation, as identified in Figure 11, to which the robot is ordered to move to.

Id Robot
Colour Initial Priority Task   1  Brown  1  9   2  Dark Green  2  10   3  Black  3  11   4 Red 4 5 Figure A4 shows the paths travelled by the robots during the execution of this mission, as well as showing the mapping of the factory floor in order to determine which nodes are experiencing communication faults. In this mapping, the nodes represented in green are considered unfaulted nodes, the nodes represented in yellow are considered entry/exit nodes of the faults that have been mapped, the nodes represented in purple are considered possible entry/exit nodes of the faults that have yet to be mapped, and finally the red nodes are the faulted nodes.
By analyzing Figure A4, it is possible to observe that the system managed to successfully complete the assigned mission avoiding any collisions or deadlocks, even when two of the tasks required the robots to travel through the area that wasn't covered by the communication network. The system was also capable of correctly mapping the fault by using the data from the two robots that travelled through it.
The system first detected the fault in Figure A4e; initially the critical zone was considered to be part of the fault as seen in Figure A4f. On detecting this fault a replanning event is triggered and this leads the black robot to stop and wait for the dark green robot to exit the fault.
When the dark green robot exits the fault, Figure A4i, the fault dimensions are adjusted and the status of the node representing workstation 10 is changed from faulted node to an entry/exit node, as seen in Figure A4j. The black robot can now advance in its course, since the fault is no longer occupied, therefore a new replanning event is executed given the current position of the dark green robot, the black robot is now forced to plan a path that accesses the area via the bottom side instead of using the top side as initially planned.
When the dark green robot re-enterw the fault, in Figure A4k, the fault is once again placed as occupied. However, this time since the black robot is accessing the fault via the bottom side there is no need for it to stop immediately. An important thing to note is that in this moment only the top node of the fault has been mapped as faulted node since it was the only one used by the robots until this point, as seen in Figure A4l. This leads to the black robot only stopping when it reaches the bottom entrance of the zone as seen in Figure A4m.
After the dark green robot once again exits the fault, in Figure A4o, the black robot enters the fault from the other side and therefore maps the other faulted node, as seen in Figure A4p. Since one of the new mapped faulted nodes is the same as one of the possible entry/exit nodes of the previously mapped fault, the system merges both faults.
A link to a video showing the execution of this test can be found in Appendix E. This test was repeated 10 times in order to show that the presented sample was not a sporadic success. Analyzing the data from these tests, it is possible to conclude that the system is capable of successfully dealing with areas that are not covered by the communication network, since in all samples the system executed the mission successfully. It also allows for the stipulation of a baseline execution time for this mission of 1 min and 47 s. When this mission is executed without any communication faults, its mean execution time is of 1 min and 34 s. As expected, the overall efficiency of the system decreases when the system is exposed to communication faults.

Planning Supervision and Overall System Performance when Subjected to Static and Sporadic Communication Faults
The second set of tests has the objective of verifying that the implemented system is capable of completing an assigned mission, in a safe and efficient way, when exposed to both sporadic faults, as well as, zones where there is no communication with the robot.
To accomplish this, the CSSM must first not only detect the area where there is no communication but also detected the occasional loss of communication in a specific node. After the detection, this sub-module must also check if any of the detected faults have ceased to exist, as explained in Section 4.2.
In order to compare results of both sets of tests, the same mission and initial conditions used in the first set will be used, as shown in Figure 13 and Table 2.  Table 2. Tasks and initial priority of each robot. The numbers under the task column represent the workstation, as identified in Figure 13, to which the robot is ordered to move to.

Id Robot
Colour Initial Priority Task   1  Purple  1  9   2  Yellow  2  10   3  Orange  3  11 4 Green 4 5 The sector of the map that is not covered by the communication network is also the same as the one used in the previous set of tasks, as shown in Figure 14. These sets of tests will test the occurrence of a sporadic fault in an entry/exit node of the fault that represents the area where no communication can be established with the robot, this area is represented in orange in Figure 14.
The representation of the results of the first part of this set of tests is represented in Figure A6. As with the previous tests, the nodes represented in green are considered unfaulted nodes, the nodes represented in yellow are considered entry/exit nodes of the faults that have been mapped, the nodes represented in purple are considered possible entry/exit nodes of the faults that have yet to be mapped and finally the red nodes are the faulted nodes.
By analyzing Figure A6, it is possible to observe that when the yellow robot moves to the top entry node of the fault corresponding to the section of the factory that is not covered by the communication network, referred as the static fault for brevity, a sporadic fault occurs in this node. Since this fault occurs in an entry/exit node this node is associated with the static fault, as seen in Figure A6a-f. Similar to the previous tests, the orange robot advances until the edge of the fault and then stops and waits for the yellow robot to exit the fault, and therefore places the binary semaphore then corresponds to that fault in the free state before proceeding, seen in Figure A6e.
On exiting the fault a second time, as shown in Figure A6g, the yellow robot sees that the sporadic fault has ceased since the system can now communicate with the robot in a zone where it previously couldn't. This leads to the system reclassifying the node where communication was re-established with the robot as an unfaulted node. The system also reclassifies the node belonging to the static fault as an unmapped node, due to the impossibility of the system knowing if it belongs to the static fault or to the sporadic fault.
Similar to the previous tests after the yellow robot exits the fault, the orange robot is cleared to proceed with the execution of its task. This leads to the orange robot entering the static fault, as shown in Figure A6i. The executing of this robot tasks leads to the remapping of the static fault and this time the dimension is correct since no sporadic fault occurs when the orange robot exits the fault for a second time. This process is shown in Figure A6i-p.
A link to a video showing the execution of this test can be found in Appendix E.
The execution of this test was repeated 10 times. In all samples, the system was able to successfully operate in environments where sporadic communication faults as well as static communication faults can occur, and manage to safely execute its assigned tasks.

Cases Where the Implemented System Can't Guarantee a Safe Execution of the Mission
Even due the overall results of the implemented system tests were very successful there are some situations where the implemented system is incapable of guarantee a safe execution of its assigned mission.
These situations normally occur due to either external interference or due to a communication fault occurring in an unfortunate time or having a large dimension. These situations are the following: • Two or more robots entering the same fault at the same time; • Impossibility of a robot to move away from the exit node of a fault.
In the first one, if two or more robots enter the same unmapped static fault at the same time, a collision can occur as the system has no way to control the synchronization between robots as it is incapable of communicating with the robots. This normally occurs when a communication fault has a large dimension.
The second situation where this may occur is if the entry/exit of a fault is located in a link between two nodes. There is a chance of a situation where one robot is currently traversing that link and another robot enters the fault through another entry point. The first robot will be in a deadlock situation since both of its possible movements choices are now locked until the second robot exits the fault.

Conclusions
The article presents the implementation of a control system capable of controlling the traffic of a fleet of robots, i.e., plan and control the movement of a fleet of robots that have multiple tasks assigned. The implemented system must plan safe and efficient paths, avoid deadlocks and be immune to network failures. It uses the TEA* algorithm, proposed in [4], as a base for the path planning and has a high tolerance to communication faults. The implemented system is comprised of several modules. This system not only planned the robots paths but also supervised their execution and was on guard against any communication failures.
Several modifications of the TEA* algorithm, proposed by [4], were also implemented in order to make it compatible with the new environment conditions that the algorithm would have to face. This modifications also intended to move the TEA* algorithm closer to a real life industrial application.
As shown in Section 6.2, the implemented system is capable of safely executing a set of tasks in environments where both static and sporadic communication faults occur. On previous implementations of the TEA* algorithm [4] , when the robots were subjected to communication faults, there would be a high probability of the occurrence of deadlocks, since the TEA* algorithm requires the synchronization of the movements and positions of all the robots of the system in order to be effective.
Although the system has a good tolerance to communication faults, it still lacks total effectiveness when dealing with communication faults, as demonstrated by in Section 6.3. Data Availability Statement: This study didn't report any new data.

Acknowledgments:
The authors of this work would like to thank the members of INESC TEC iiLab for all the support rendered to this project.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Figure A1. Movement of the robots during the execution of the first mission, each robot as two tasks assigned to it. In subfigures (a-r), the robot is represent by the full coloured square, while its movement between each of the subfigures is represented by the outlined square of the same colour. The numbers presented in the subfigures represent the task order assigned to each robot. In subfigures a-p, the robot is represent by the full coloured square, while its movement between each of the subfigures is represented by the outlined square of the same colour. The numbers presented in the subfigures represent the task order assigned to each robot. (a) Initial positions of the robots; (b) All robots keep following their planned paths; (c) Since the brown robot has a higher priority than the pink one, this last one is force to take the longer path in order to avoid any conflicts; (d) The light blue robot had to stop and wait for the pink robot to move from the intersection before resuming its movement; (e) Delays in the movement of the dark blue robot force the system to replan the path of the pink robot to avoid pathing conflicts; (f) Delays in the rotation of the brown robot force further replanning of the robots paths; (g) No replanning is necessary since the robots are still synchronised; (h) Both the brown and dark blue robot finish their fist task.      Figure A6. Behaviour of the system when subjected to a sporadic fault in an entry/exit node, second image. In right subfigures, the robot is represent by the full coloured square, while its movement between each of the subfigures is represented by the outlined square of the same colour. The numbers presented in the subfigures represent the task order assigned to each robot. In the left subfigures the status of each node is displayed, blue equals unmapped, red equals faulted, yellow represents a entry/exit node and green represents an unfaulted node.