1. Introduction
Multi-robot systems (MRS) have emerged as a cornerstone of modern automation, playing a vital role in diverse fields such as logistics, disaster response, agriculture, and industrial manufacturing. These systems offer increased efficiency, flexibility, and scalability by distributing tasks among multiple robots. However, the complexity of coordinating such systems grows significantly when considering dynamic environments, unpredictable conditions, and the inevitability of robot or communication failures. Task assignment in these scenarios, especially in the presence of mid-task failures, remains a critical challenge.
Failures during task execution caused by hardware malfunctions, communication disruptions, or environmental factors severely impact overall system performance. A robot’s inability to complete a task leads to bottlenecks, inefficiencies, or even mission-critical failures in time-sensitive applications. Traditional task assignment approaches, while effective in static environments, often lack the adaptability and fault tolerance required to address such mid-task disruptions.
To tackle these challenges, we developed a fault-tolerant task assignment framework designed to dynamically detect and recover from failures during task execution. The framework integrates a failure detection mechanism with an enhanced task reassignment algorithm, enabling the redistribution of incomplete tasks to available robots in real time. By ensuring continuous operation and mitigating the cascading effects of failures, the framework enhances the reliability and robustness of multi-robot systems.
The framework emphasizes scalability, adaptability, and resource efficiency. By leveraging redundancy and real-time data exchange, the system maintains collaboration and minimizes disruptions in task execution. Through simulation and analysis, the framework demonstrates its ability to reduce downtime, improve task completion rates, and maintain system stability in various operational scenarios. For the design and implementation of the fault-tolerant task assignment framework, we evaluated the framework’s performance in handling mid-task failures, and the results highlight its potential in robust multi-robot coordination.
2. Related Work
Multi-robot systems (MRS) task assignment has been extensively researched on efficiency, scalability, and reliability. Traditional methods, such as market-based approaches, have been successful in static environments by using bidding mechanisms to allocate tasks among robots based on their capabilities and resource availability. Dias et al. conducted a comprehensive survey analyzing market-based methods, highlighting their potential and limitations in dynamic scenarios [
1]. Gerkey and Mataric developed a structured classification system for various task allocation methods in multi-robot systems, underscoring the need for adaptable approaches in multi-robot task allocation [
2].
Recent advancements in fault-tolerant control have introduced mechanisms to ensure system reliability in the presence of robot or task failures. Luo and Yang proposed a real-time cooperative fault-tolerant control scheme that dynamically adjusts to failures, enhancing system stability and performance [
3]. Zhang and Jiang compared active and passive fault-tolerant control systems, providing insights into their applicability in multi-robot systems [
4]. Cloud robotics has emerged as a powerful paradigm to address the computational and scalability challenges of MRS. By transferring computationally demanding tasks to cloud infrastructures [
5], robotic systems leverage enhanced computational resources for prompt information processing and action selection. Optimization algorithms based on heuristic and metaheuristic principles, including genetic algorithms (GA) and particle swarm optimization (PSO), have been applied to enhance task assignment efficiency. Yan et al. analyzed various coordination techniques in MRS, demonstrating the effectiveness of heuristic-based approaches [
5]. Kehoe et al. [
6] explored the evolution of cloud robotics and discussed how it could significantly enhance scalability and coordination in multi-robot systems. However, issues such as communication latency and data security remain challenges for real-time fault recovery [
7]. Rahimi et al. proposed an adaptive fault-tolerant scheduling method for cloud robotics, which dynamically reallocates tasks in response to failures [
8]. Zhang et al. [
9] proposed a cloud-centric architecture that supports scalable and real-time simultaneous localization and mapping (SLAM) for multi-robot systems, thereby enhancing coordination in dynamic environments. Nonetheless, these methods often lack the adaptability required to handle task failures in dynamic environments. Dynamic task reassignment has been explored to enhance system robustness [
10]. Based on these studies, we developed an adaptive and resilient task allocation framework that integrates real-time failure detection, recovery-driven reassignment, and robust communication protocols. Unlike traditional methods, the developed framework addresses fault tolerance and scalability and offers a robust solution for dynamic and uncertain environments.
4. Framework Architecture
The architecture introduced for achieving fault tolerance in multi-robot systems utilizes the principles of cloud robotics to support real-time monitoring, flexible task redistribution, and reliable coordination. At the center of this framework lies the base cloud station, deployed on a scalable cloud infrastructure such as Amazon Web Services, which maintains comprehensive records of robot configurations, task parameters, and environmental data (
Figure 1).
This station employs an adaptive fault-tolerant task allocation algorithm to optimize task distribution and reallocation during mid-task failures, ensuring efficient and reliable system operation. The local cloud station functions as a centralized hub for handling region-specific data, encompassing task progress updates, robot performance indicators, and sensor-generated information. It incorporates an application programming interface (API) management layer to facilitate seamless and structured communication between the base cloud, local cloud, and robots. Additionally, the dynamic reassignment engine within the local cloud station enables real-time redistribution of incomplete tasks when failures are detected.
A communication link is established between the local cloud station and the robot fleet to ensure continuous data exchange by a low-latency and reliable message queuing telemetry transport (MQTT) protocol, which ensures continuous data exchange for monitoring and coordination. The robotic fleet is integrated with sensors, actuators, and on-board computational modules and embedded systems for executing tasks while reporting their status to the local cloud. This robust communication framework supports fault detection and recovery, enabling quick responses to dynamic changes in the operational environment. By integrating cloud computing resources, real-time data analysis, and adaptive algorithms, the framework enhances fault tolerance and scalability in multi-robot systems. It provides an efficient and reliable solution for diverse applications, including industrial automation, disaster response, and logistics, where dynamic environments demand robust task allocation and system resilience.
4.1. Algorithm
The fault-tolerant multi-robot task assignment with a dynamic reassignment algorithm optimizes task scheduling by assigning priorities according to estimated execution durations and criticality while adapting to changing system conditions. It achieves fault tolerance through real-time monitoring and recovery-driven task reassignment. Tasks are prioritized by computed weights so that the most critical tasks are addressed first, and priorities are periodically re-evaluated to reflect current conditions. The improvements incorporate dynamic weight recalibration in response to system load, real-time ranking of task priorities, prompt allocation of high-urgency tasks, and the inclusion of a fairness strategy to avoid task starvation, collectively enhancing the responsiveness and efficiency of the scheduling mechanism (
Figure 2).
The execution scheduling order for the assigned tasks is defined as follows.
The processing duration of task
i on machine
j is estimated using Equation (1).
where
T[
i] represents the computational workload or size of task
I, and
D[
j] denotes the processing speed or execution capability of machine
j.
The mean processing time required for task
i is determined as follows.
Scaled variation between the maximum and minimum execution times is estimated using Equation (3).
Scheduling weight for task
I is calculated as follows.
Here, x denotes the total count of tasks in the system.
Task weights are adjusted based on the system load.
When a task is classified as urgent, it is immediately dispatched to the first robot that becomes available. As soon as a task is marked urgent, it is forwarded to the next unoccupied robot for execution, and the assignment status is published, typically using the MQTT protocol. For non-urgent tasks, the system applies a greedy selection strategy. It begins by sorting the remaining tasks in descending order based on their priority score s[i]. For each task in that order, the robot with the minimal cost E[i][j] or the best current load is selected. The task is then assigned to that robot, and the assignment is published.
Once tasks are assigned, robots begin executing them. Each robot regularly publishes status updates—including location, progress, and resource availability to the MQTT broker. Robot status is continuously monitored through MQTT channels such as robot/failure and robot/status. Failures are detected through disconnections, sensor alerts, or missed heartbeat signals. If a robot fails, it is marked as “failed,” and the system initiates a recovery-driven reassignment process. The system identifies all tasks that the failed robot was executing or had queued. For each affected task, it either recomputes or reuses parameters such as E[i][k], p[i], d[i], and w[i]. These tasks are then reassigned to a healthy robot that either minimizes execution time or balances the overall system load. A reassignment message is published. To reduce delays, the system prioritizes tasks that were partially completed or those approaching their deadlines. Urgent tasks are re-evaluated if necessary. The system load across all healthy robots is re-evaluated, and if the system is overloaded, task weights s[i] are adjusted to optimize performance.
The system collects key performance metrics, including completion rate (ratio of completed tasks to total tasks), robot utilization. such as CPU usage, battery level, or custom indicators, and the number of unassigned tasks still waiting for assignment. These metrics are evaluated against predefined thresholds or service-level agreements. If performance falls below expectations, the system may adjust the adjustment_multiplier or allocate additional resources. All metrics are published to a dashboard or through MQTT. To ensure scalability and resilience, the system adapts to new robots by recomputing E[i][j] for the newly added units. It also accommodates new tasks by computing w[i] and assigning them accordingly. Redundant communication channels are maintained in case MQTT becomes unavailable. The system periodically checks overall load and reassigns tasks as needed.
The system loops through the phases of task assignment, failure detection, reassignment, and performance evaluation. It terminates when all tasks are completed or when the system is shut down. A final status message is published to MQTT for logging purposes. Variables included the following.
x = The count of tasks to be scheduled;
y = Total count of available machines;
T[i] = Execution time of task i;
D[j] = Processing rate of machine j;
E[i][j] = Estimated runtime of task i on machine j;
p[i] = Mean runtime of task i;
c[i] = Adjusted difference in task i;
s[i] = Weighted value for task i;
load_level = System’s present workload;
Cutoff = Limit point for dynamic workload adaptation;
adjustment_multiplier = Weight adjustment multiplier.
4.2. Connection Between Robot and Cloud-Based Control Center
Efficient robot coordination via a cloud-based control center is vital for dynamic task reassignment in fault-tolerant multi-robot systems. Robots connect to an integrated cloud system for interaction via wireless networks for real-time data, enabling task coordination and decision-making. Robots regularly send updates on their status, including place, status of the task, and availability of resources. The cloud processes this data to dynamically reassign tasks, ensuring efficient collaboration, adaptability to environmental changes, and minimal downtime. The cloud’s computational power enhances scalability and decision-making accuracy, enabling seamless operations in dynamic environments. MQTT represents a lightweight protocol for communication designed for efficiency, particularly advantageous for cloud-integrated multi-robot platforms. Using a publish–subscribe model enables robots to send updates to an agent that disseminates them to subscribed devices, supports low-bandwidth environments, and ensures reliable, real-time communication. MQTT’s efficiency and simplicity make it ideal for coordinating task updates, failure alerts, and robot statuses, enabling seamless interaction and dynamic task reassignment.
4.3. Dataset
The dataset represents a simulation of task distribution in a cloud-managed multi-robot system for fault-tolerant operations in dynamic, unpredictable environments. Comprising 5000 tasks, the dataset includes length, priority, execution time, energy use, and communication delay, subtasks, emergency level, task failures, communication gaps, reassignment events, and progress tracking. These features capture routine operations and failure scenarios, such as robot malfunctions and communication disruptions. The dataset is used for the evaluation of dynamic task reassignment strategies, supporting optimization, and ensuring task distribution reliability across varying conditions. Visualizations, such as scatter plots and bar charts, reveal key relationships such as task length versus energy consumption, helping assess scalability, energy efficiency, robustness, and overall system performance (
Table 1,
Figure 3 and
Figure 4).
4.4. Task Analysis
To validate the effectiveness of the proposed fault-tolerant task reassignment framework, a series of simulations was conducted under varying task loads and failure scenarios.
Table 2 presents a comparative analysis of task completion times for the proposed Fault-Tolerant scheduling algorithm against two baseline methods, the periodic min–max weight algorithm (PMW) and PSO. The outcomes unequivocally show that the suggested approach regularly results in shorter task completion times, with performance advantages becoming increasingly prominent as the system load intensifies. As an example, when 100 jobs are loaded, the Fault-Tolerant algorithm completed execution in 125 ms, outperforming PMW (135 ms) and PSO (145 milliseconds). As loads increase, this trend persists, with the Fault-Tolerant scheduler maintaining a 50–80 ms advantage at 300 tasks. These improvements are attributed to the framework’s dynamic reassignment mechanism, which promptly redistributes failed or unassigned tasks among available robots with minimal coordination overhead.
Under unpredictable settings that simulated restrictions, including communication failures, robot breakdowns, and interruptions during tasks. PMW and PSO approaches exhibited delayed recovery and task congestion under such disruptions. However, the proposed framework maintained operational continuity due to its integration of real-time failure detection, MQTT-based low-latency messaging, and actuator redundancy that minimizes the need for full task reassignment in the event of partial hardware failure. In fault-injected scenarios, the system demonstrated 30–40 ms (
Figure 5 and
Figure 6).
These findings highlight the robustness and adaptability of the proposed method in dynamic environments, particularly within the context of multi-robot systems operating in mission-critical scenarios such as disaster response, industrial automation, and autonomous exploration. By ensuring rapid recovery and efficient load redistribution, the framework offers a scalable and resilient solution for maintaining system performance in the face of uncertainty.
4.5. Utilizations of Proposed Framework
The recommended fault-tolerant framework for dynamic task reassignment in multi-robot systems enhances efficiency across industries by ensuring seamless task execution despite failures. In warehouses, it enables uninterrupted inventory management, while in healthcare, it ensures timely medical supply delivery and patient monitoring. Logistics and transportation benefit from adaptive delivery routing, and smart manufacturing reduces downtime by reallocating assembly tasks. In disaster response, robots dynamically adjust search and rescue efforts, improving coverage in critical situations. Agriculture optimizes precision farming, while autonomous security systems maintain continuous surveillance. This framework enhances resilience, scalability, and operational continuity, making it ideal for unpredictable environments.