Next Article in Journal
Model Test Study on Group Under-Reamed Anchors Under Cyclic Loading
Next Article in Special Issue
Design and Development of a Rotating Nozzle for Large-Scale Construction 3D Printer
Previous Article in Journal
Unequal Exposure to Safer-Looking Streets in Shanghai: A City-Scale Perception Model with Demographic Vulnerability
Previous Article in Special Issue
Toolpath-Driven Surface Articulation for Wax Formwork Technology in the Production of Thin-Shell, Robotic, CO2-Reduced Shotcrete Elements
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Modular ROS–MARL Framework for Cooperative Multi-Robot Task Allocation in Construction Digital Environments

1
S.M.A.R.T. Construction Research Group, Division of Engineering, New York University Abu Dhabi (NYUAD), Experimental Research Building, Saadiyat Island, Abu Dhabi P.O. Box 129188, United Arab Emirates
2
KINESIS Core Technology Platform, New York University Abu Dhabi (NYUAD), Experimental Research Building, Saadiyat Island, Abu Dhabi P.O. Box 129188, United Arab Emirates
*
Author to whom correspondence should be addressed.
Buildings 2026, 16(3), 539; https://doi.org/10.3390/buildings16030539
Submission received: 12 November 2025 / Revised: 16 January 2026 / Accepted: 23 January 2026 / Published: 28 January 2026
(This article belongs to the Special Issue Robotics, Automation and Digitization in Construction)

Abstract

The deployment of autonomous robots in construction remains constrained by the complexity and variability of real-world environments. Conventional programming and single-agent approaches lack the adaptability required for dynamic multi-robot operating conditions, underscoring the need for cooperative, learning-based systems. This paper presents an ROS-based modular framework that integrates Multi-Agent Reinforcement Learning (MARL) into a generic 2D simulation and execution pipeline for cooperative mobile robots in construction-oriented digital environments to enable adaptive task allocation and coordinated execution without predefined datasets or manual scheduling. The framework adopts a centralized-training, decentralized-execution (CTDE) scheme based on Multi-Agent Proximal Policy Optimization (MAPPO) and decomposes the system into interchangeable modules for environment modelling, task representation, robot interfaces, and learning, allowing different layouts, task sets, and robot models to be instantiated without redesigning the core architecture. Validation through an ROS-based 2D simulation and real-world experiments using TurtleBot3 robots demonstrated effective task scheduling, adaptive navigation, and cooperative behavior under uncertainty. In simulation, the learned MAPPO policy is benchmarked against non-learning baselines for multi-robot task allocation, and in real-robot experiments, the same policy is evaluated to quantify and discuss the performance gap between simulated and physical execution. Rather than presenting a complete construction-site deployment, this first study focuses on proposing and validating a reusable MARL–ROS framework and digital testbed for multi-robot task allocation in construction-oriented digital environments. The results show that the framework supports effective cooperative task scheduling, adaptive navigation, and logic-consistent behavior, while highlighting practical issues that arise in sim-to-real transfer. Overall, the framework provides a reusable digital foundation and benchmark for studying adaptive and cooperative multi-robot systems in construction-related planning and management contexts.

1. Introduction

The slow adoption of robotic technology across industries, particularly in construction, can largely be attributed to the complexity and variability of real-world tasks that require robotic intervention [1]. Most industrial robotic systems are designed for the autonomous execution of repetitive, isolated operations [2,3]. Common implementations include stationary manipulators [4], which, although effective within specific contexts, typically operate independently rather than collaboratively. This operational isolation limits overall system efficiency and highlights a critical research gap in developing cooperative robotic systems inspired by human-like task coordination.
Robotic components and services are generally developed and tested in controlled, predictable environments before deployment [5], which contrasts with the dynamic and fragmented conditions of construction projects. However, their performance often degrades in dynamic, unstructured settings [6], where effective operation demands flexible planning and real-time adaptation. Traditional approaches to robotics have underutilized the potential for adaptability and collaboration [3], limiting their contribution to construction planning and control.
To fully realize the benefits of robotics in practical applications, systems must enable effective cooperation among multiple robots. Multi-Robot Task Allocation (MRTA) has been explored through diverse methodologies [7], including market-based mechanisms grounded in economic bidding [8], behavior-based approaches that rely on predefined roles [9], and utility-based methods that optimize task distribution through cost–benefit analysis [10]. While effective in static conditions with complete prior knowledge, these methods often fail in dynamic, uncertain environments. Learning-based approaches, including Q-learning [11] and policy gradient methods [12], represent a significant advancement by enabling systems to adapt through interaction. Reinforcement Learning (RL) has demonstrated success in domains such as industrial manufacturing [13], board games [14], robot control [15], and autonomous driving [16], where adaptability to complex and unpredictable environments is critical [17,18], including the evolving and uncertainty-prone conditions of construction sites.
Optimizing task allocation in multi-robot systems remains challenging due to dynamic schedules, evolving workspaces, and inter-agent dependencies typical of real-world operations [19], such as those in construction logistics and site management. Multi-Agent Reinforcement Learning (MARL) offers a promising approach by enabling robots to learn cooperative policies through shared experiences while accounting for others’ actions [20]. Such collaboration would increase efficiency and adaptability in handling complex, multi-agent environments, enabling flexible, digitally coordinated robot teams.
Although MARL has achieved considerable success on controlled benchmarks such as StarCraft II and the Hanabi card game [21], its application in real-world industrial and construction settings remains limited by the lack of standardized testing environments [22]. The lack of unified frameworks for modeling, benchmarking, and validation limits the translation of MARL’s theoretical advancements into practical robotic systems. Existing MARL frameworks and MRTA + MARL applications are typically implemented as bespoke, scenario-specific stacks in which environment models, task logic, and robot interfaces are tightly coupled, making them difficult to reuse across different robots, task structures, or construction constraints. In particular, they rarely expose a modular architecture that allows construction-relevant constraints (e.g., precedence relations, resource capacities, robot availability) to be introduced without reengineering the learning and execution pipeline.
This paper addresses that gap by introducing a modular, ROS-based framework that couples Multi-Agent Reinforcement Learning with a generic 2D simulation and execution stack for cooperative mobile robots in construction-oriented digital environments. The framework adopts a centralized-training, decentralized-execution (CTDE) architecture implemented with Multi-Agent Proximal Policy Optimization (MAPPO) and structures the system into distinct modules for environment modelling, task representation, robot interfaces, and learning. By decoupling these components through standardized data flows, the framework allows different layouts, task sets, and robot models to be instantiated without redesigning the underlying MARL implementation, and is intended to serve as reusable digital infrastructure for construction management and robotics research.
A proof-of-concept case study employing small-scale mobile robots is presented to validate the framework’s performance in task allocation, navigation, and collaborative learning. The results demonstrate enhanced autonomy, adaptability, and efficiency in multi-robot operations within dynamic environments.
In this study, we focus on establishing and validating the framework itself rather than modeling the full complexity of a construction site, positioning the contribution as a reusable MARL–ROS framework and digital testbed for multi-robot task allocation.
Although the environment resembles a generic indoor facility rather than a detailed construction project, the same digital pipeline can later be instantiated with richer site models and constraints, preserving the link to construction through construction-oriented digital environments rather than a fully realistic site deployment.
The main contributions of this paper are as follows:
  • Framework: We develop a modular MARL–ROS framework for cooperative multi-robot task allocation, with interchangeable modules for environment modelling, task representation, robot interfaces, and learning.
  • Formulation: We cast multi-robot task allocation as a MAPPO-based centralized-training, decentralized-execution problem and specify the state, observation, reward, and optimization structure in a robotics-compatible way.
  • Digital testbed: We instantiate the framework in a simplified 2D Flatland benchmark and on TurtleBot3 robots, demonstrating an end-to-end pipeline from digital environment description and MARL training to decentralized execution.
  • Evaluation: We compare the learned MAPPO policy with non-learning baselines in simulation and analyze the sim-to-real performance gap, illustrating how the framework can be used as a construction-oriented digital testbed.
The remainder of this paper is organized as follows: Section 2 reviews relevant literature on MRTA and related MARL approaches. Section 3 details the proposed methodology. Section 4 presents the case study and experimental results, and Section 5 and Section 6 discuss limitations, conclusions, and directions for future research.

2. Literature Review

2.1. MRTA Definitions and Solutions

The Multi-Robot Task Allocation (MRTA) problem aims to efficiently distribute tasks among multiple robots to maximize productivity and operational performance [23]. As a discrete optimization problem, MRTA often relies on graph-based formulations, requiring exploration of a large, constrained search space [10]. Since the 1990s, MRTA has attracted significant research attention across disciplines, including computer science, operations research, robotics, and artificial intelligence [24]. Gerkey and Matarić [25] introduced a widely adopted taxonomy that categorizes MRTA problems based on robot capabilities, task characteristics, and timing constraints, as summarized in Table 1. For example, the MT-SR-IA configuration represents a scenario where multiple tasks are executed simultaneously, each assigned to a single robot with instantaneous allocation. Moreover, MRTA applications span diverse domains, from industrial manufacturing to autonomous exploration, each demanding distinct optimization strategies [26]. Various algorithmic approaches have been developed to address these challenges [27], with each tailored to specific task requirements, coordination mechanisms, and environmental dynamics. The key characteristics of MRTA are summarized in Table 1.
Numerous studies have addressed sequential planning and coordination in multi-agent decision-making [28]. However, a comprehensive review is necessary to identify suitable methodologies for specific applications. Optimization-based approaches have been widely used to dynamically allocate tasks among multiple agents under constraints such as time windows and task completion uncertainty, particularly in drone delivery systems [29]. While effective, these methods require detailed modeling and a precise understanding of the operational context. In contrast, market-based approaches rely on auction mechanisms that enable robots to make autonomous decisions based on real-time information, thereby adapting to changing task priorities and agent availability [30]. Although these methods offer scalability and flexibility, challenges remain in accurately estimating bids under dynamic environmental conditions, leading to suboptimal allocations and increased communication overhead.
Beyond these, the broader spectrum of MRTA methodologies encompasses several distinct approaches, each with unique advantages and limitations that influence their suitability for specific applications:
(a)
Behavior-based approaches are simple to implement and perform well in dynamic environments; however, they lack adaptability to evolving conditions and may fail to achieve globally optimal task allocation.
(b)
Utility-based approaches provide precise and efficient task distribution through defined utility functions but require detailed system modeling, and their computational demands increase sharply with the number of tasks and agents.
(c)
Optimization-based approaches are capable of finding globally optimal solutions and offer strong adaptability, yet they are computationally intensive and often depend on accurate models that may not capture real-world variability.
(d)
Learning-based approaches enable adaptability and continuous improvement through experience, supporting effective task management after sufficient training. Nonetheless, they require large datasets and may exhibit unstable performance during early training stages.
(e)
Consensus- and cooperation-based approaches emphasize coordinated task distribution and conflict resolution, improving collective performance. However, they rely on continuous inter-agent communication, which becomes challenging in large-scale or bandwidth-limited systems.
Table 2 summarizes the necessary information to benchmark algorithms for the MRTA problem.

2.2. MRTA Solved by MARL

While the methods described previously are effective for small-scale cooperative tasks, they often fail to adequately address the complexities associated with large-scale operations and intricate task constraints [43]. For example, recent research proposed an efficient Ant Colony System for multi-robot task allocation, specifically designed for large-scale tasks and precedence constraints [44]. Another application introduces multi-objective task assignment models that optimize time and energy consumption across groups of robots by using genetic algorithms to minimize the overall system cost [45]. Although these optimized-based algorithms find optimal solutions, they require significant computational resources and precise models that may not accurately capture real-world variability. Other research has focused on learning-based methods to address the dynamics of such environments. Liu et al. [46] solved a planning problem with a learning-based algorithm based on an options framework for cooperative multi-robot systems in an aircraft painting application. Despite these advancements, there remains no one-size-fits-all solution for training robots to handle diverse roles across different settings. The design and development of activities and training procedures for robotic systems still require significant foundational work. Leveraging MARL emerges as a compelling approach to address the MRTA problem in complex settings effectively. MARL combines the strengths of various existing methods to handle both the scale and dynamic variability of tasks, offering a more flexible and comprehensive solution for managing complex, cooperative task allocations in evolving environments. For example, Agrawal et al. [47] suggested using an attention-inspired MARL method for multi-robot task allocation in warehouse environments. Recently, Lee et al. [48] introduced a digital twin-driven DRL for adaptive task allocation in robotic construction to assemble prefabricated concrete bricks using stationary robotic arms in a simulated environment. The MARL approach to solving the MRTA problem can learn high-quality policies that approach strong solutions empirically and continuously improve over time. MARL ensures cooperative task distribution and effectively resolves conflicts, significantly enhancing system performance. This method not only optimizes the allocation process across multiple agents but also evolves dynamically with the environment, leading to more robust and efficient outcomes in complex settings.

2.3. Research Gaps

Traditional approaches to multi-robot coordination often rely on complex heuristic rules and computationally intensive procedures. The emergence of Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL) has significantly advanced the capacity to address combinatorial optimization problems with large solution spaces, enabling artificial agents to collaborate effectively in decision-making [40]. However, integrating MARL into real-world multi-robot systems remains limited. A primary challenge in RL research lies in configuring task environments that align with algorithmic assumptions and constraints [49]. Most RL algorithms are developed and evaluated in low-dynamic environments such as DeepMind Lab [50] and OpenAI Gym [51]. Adapting realistic activities and task logic to these frameworks remains difficult, further constrained by the lack of standardized training and benchmarking environments [52]. Additionally, uncertainty persists in identifying the most effective algorithms for efficient collaboration in dynamic, unstructured operational contexts [53]. These challenges highlight the need for a modular and standardized framework that supports interoperability across robotic platforms, task structures, and environmental settings, enabling consistent testing and adaptation.
(1)
Simulation Environments Setup
Current RL simulation environments, such as those used for optimal scheduling [54] or automated gameplay [55], have limitations for representing dynamic, real-world robotic contexts. The trial-and-error nature of RL complicates the design of robust training schemes that accurately capture the interaction between agents and their environments. Challenges include initializing states, modeling agent–environment dynamics, and defining appropriate constraints and reward functions. Simplifying and standardizing these aspects is critical to improving the scalability and realism of RL applications in complex robotic systems [56].
(2)
MARL algorithms
Robotic activities demand a precise understanding of state–action relationships, often requiring expert knowledge to solve physical equations, design control scripts, and define temporally and spatially coupled constraints [57]. There remains a pressing need to benchmark and adapt existing MARL algorithms for different robotic platforms and operational domains. Customizing algorithms to specific applications (e.g., by adjusting hyperparameters, reward structures, and communication strategies) could substantially improve training efficiency and task performance.
(3)
Multi-agent collaboration
Multi-robot systems frequently encounter challenges in resource allocation, scheduling, and risk management that impact project performance, leading to inefficiencies such as design errors, delays, and cost overruns. Integrating these practical constraints into learning systems for real-time decision-making remains a major difficulty. Searching for optimal solutions in large, dynamic solution spaces is computationally expensive [58], and transferring learned experiences across agents to achieve coordinated, group-level performance remains an open problem. Strengthening inter-agent communication and shared learning mechanisms is essential to improving cooperation and achieving more cohesive multi-robot behavior [59].

3. Methodology

The proposed methodology builds on a modular, extensible framework for simulating robotic models in dynamic, variable environments. Through iterative training cycles, robots autonomously learn, adapt, and coordinate their assigned roles to optimize collective performance and ensure system stability. The methodology is organized into distinct but interconnected work packages, each addressing a critical stage of the multi-robot learning process, ranging from environment modeling and scenario definition to communication, coordination, and real-world verification.
The framework is divided into three main segments (Figure 1). The blue box represents dynamic inputs, including environmental, robotic, and task-logic models. These inputs ensure flexibility and adaptability across different robotic platforms, operational settings, and activity types. The red box corresponds to the simulator, which manages the training process, data flow, and communication among agents. The green box denotes the application stage, where policies trained in simulation are transferred to real-world systems for verification and performance evaluation.
The modular design allows for seamless interaction among the work packages, facilitating iterative refinement between simulation and physical implementation. Each data flow defines a structured exchange between work packages, enabling systematic progression from virtual experimentation to real-world validation. This workflow depicts the interdependencies and feedback loops essential to adaptive multi-robot task allocation and coordination.

3.1. Work Packages

3.1.1. Work Package 1: Communication Link

The first work package focuses on establishing communication protocols that support effective multi-agent interaction. These protocols are essential for coordinating collaborative tasks, allocating resources, and maintaining efficiency within the networked system architecture. Robust communication frameworks are fundamental to enabling mutual understanding and functional coordination among robotic agents [60]. Within the proposed framework, the communication link functions as both a shared memory and a data exchange hub, facilitating the storage and transfer of information across agents [61]. Each task scenario is modeled within the simulator, which dynamically responds to changes and transmits the resulting data to the communication link. This process allows agents to train within a unified network, forming connections between the simulation environment and the learning algorithms. The system supports collaborative learning and coordinated behavior among agents operating across varied tasks, environmental conditions, and training configurations.

3.1.2. Work Package 2: Simulation Environments

Work Package 2 focuses on developing a simulated environment, which is essential for validating the integration of reinforcement learning (RL) algorithms [62]. A high-fidelity simulator is required due to the computational demands of algorithm training and the limitations associated with direct experimentation on physical robots. For training to be both robust and practical, the simulator must provide sufficient physical realism while maintaining computational efficiency, enabling extensive testing and iterative refinement without excessive time or resource costs [63]. Such simulation environments serve as a critical interface between theoretical algorithm development and real-world robotic implementation.

3.1.3. Work Package 3: Scenario Definition

Defining tasks within the simulation environment is a critical step for the effective training and evaluation of autonomous agents [64]. Accurate task representation requires careful specification of parameters that govern robotic operations and interactions [65]. In Multi-Agent Reinforcement Learning (MARL), platforms such as OpenAI Gym provide standardized environments and interfaces that facilitate benchmarking and algorithm comparison across diverse tasks [66]. Each environment within OpenAI Gym encapsulates a distinct task or problem, enabling systematic testing of algorithmic performance and adaptability [67]. Similarly, this work package extends the concept of task definition by developing dynamic, interactive simulation scenarios that enable multiple agents to operate, collaborate, and adapt under varying environmental and operational conditions [68]. This structured approach supports comprehensive evaluation and fosters the development of more generalizable MARL strategies for robotic applications.

3.1.4. Work Package 4: MARL Algorithm

The development of Multi-Agent Reinforcement Learning (MARL) algorithms is often characterized by limited transparency, as evaluations are conducted across diverse environments with restricted interpretability of internal processes [69]. To address this challenge, this work package incorporates systematic benchmarking to identify and assess the most suitable algorithms based on performance metrics that reflect real-world applicability. The training framework is designed to be adaptable, allowing modifications to algorithmic structures, hyperparameters, and learning strategies in response to feedback from the simulated environment. This adaptive approach enables continuous refinement of algorithm performance and facilitates evaluation of its robustness and scalability in complex, dynamic settings.

3.2. Data Flow

The data flows within the proposed framework describe the sequential and interconnected processes that govern the integration of environment modeling, robotic control, and MARL-based learning. Together, they represent the complete training and validation cycle, from virtual simulation to real-world application.
Data Flow 1: Environment Digitization
The first data flow focuses on digitizing the operational environment. Physical structures, workspace geometry, and environmental parameters are encoded into the simulator, providing the foundation for realistic interaction and analysis.
Data Flow 2: Robotic Modeling
This data flow integrates detailed models of the robots, including their mechanical configurations, sensor arrays, and control dynamics. These digital representations ensure accurate simulation of perception and motion behaviors.
Data Flow 3: Dynamic Task Configuration
Task definitions and logic, traditionally embedded through static programming, are dynamically configured to respond to changing environmental and operational conditions. This data flow captures the adaptive task logic and the mathematical formulations that define robot behaviors, from simple navigation to complex manipulation tasks.
Data Flow 4: Task-environment Interaction and Iterative Learning
Adaptability and learning are central to the MARL approach. This data flow represents the iterative feedback process through which agents refine decision-making based on continuous interactions between tasks and environmental conditions.
Data Flow 5: Communication Build-Up
The communication build-up acts as a real-time data exchange layer. It serves dual functions: first, as a repository for continuous sensory data generated by the agents; and second, as an executor that processes and dispatches control commands derived from the learning algorithms. This process enables responsive and coordinated agent behavior within the simulation.
Data Flow 6: MARL Training Communication
This data flow establishes bidirectional communication between the simulator and the MARL training framework. It transmits environmental states and agent feedback to the communication link while delivering updated training parameters and decision inputs back to the simulator, facilitating iterative learning and algorithm refinement.
Data Flow 7: Decision Command Processing
In this stage, the MARL algorithm processes environmental inputs, evaluates potential actions, and outputs corresponding decisions. These decisions are transmitted to the robots as globally assigned tasks, guiding their autonomous operations within the simulated environment.
Data Flow 8: Simulation-to-Reality Translation
The final data flow bridges the simulation and real-world implementation. It translates validated behaviors and optimized decision strategies from the virtual environment into executable commands for physical robots, ensuring consistency and reliability between training and deployment.
Collectively, these data flows form the operational backbone of the framework, encapsulating the full life cycle of robotic learning, from environmental modeling and task adaptation to real-world application.

4. Case Study: Small-Scale Collaborative Mobile Robots

To evaluate the proposed modular framework for Multi-Agent Reinforcement Learning (MARL), a case study was conducted using small-scale mobile robots operating within simulated and physical environments. The objective of this case study is to demonstrate how the framework supports adaptive task allocation, coordinated navigation, and autonomous decision-making in dynamic, multi-agent contexts.
The implementation employs the Robot Operating System (ROS), an open-source middleware widely adopted in robotics research, to facilitate communication, data exchange, and system integration across simulation and control layers. The Multi-Agent Proximal Policy Optimization (MAPPO) algorithm [70] serves as the central learning mechanism, supporting centralized training and decentralized execution to enhance cooperative performance among multiple agents.
Table 3 summarizes the key characteristics of the case study configuration, outlining the simulation environment, scenario definition, communication framework, algorithmic setup, and simulation-to-reality translation through eight interconnected data flows.

4.1. Work Package 1: Communication Link

In this case study, ROS serves as the central integration framework, linking the training algorithms, simulation environment, and robotic agents. This integration facilitates advanced functionalities such as machine learning, adaptive control, and real-time data exchange. A key practical application of this architecture is the ROS Navigation Stack, which exemplifies the framework’s capability to support autonomous navigation and coordination in realistic operational settings. Table 4 presents an overview of the ROS network architecture and its sub-components, illustrating the communication pathways and interactions among system elements.
The robot task planner operates as a core node within this framework, receiving schedule data, task descriptions, and the Simulation Description Format (SDF) representation of the building environment. These inputs are combined with project- and task-specific rules that define operational constraints related to scheduling, safety, and spatial coordination. Based on these inputs, the path and task planner generate the robot’s motion trajectories and visualizes behaviors through textual and graphical outputs. The system produces a series of output files, including an updated task schedule and recorded positional data of the robot’s mobile base and end-effector, captured as 3D coordinates throughout the simulation.
Figure 2 presents an overview of the selected ROS nodes and packages used in this case study, illustrating their functions and integration within the ROS framework. Among existing options, ROS provides a suitable platform for combining simulation, control, and learning components to enable machine learning and data-driven control for multi-agent training within the simulator. Currently, there is no standardized framework for integrating reinforcement learning with the ROS interface for task allocation problems. To address this gap, a dedicated MARL training node, “rl_manager,” was developed to interface between the simulator and the training scripts, enabling seamless data exchange between ROS processes and the neural network during training.

4.2. Work Package 2: Simulation Environments

A simulation environment was developed as a testbed for integrating reinforcement learning (RL) algorithms into the multi-robot task allocation (MRTA) problem. The Flatland simulator [71] was used to incorporate multiple input models, including dynamic objects, structural elements, and the operational boundaries of the training environment.
Data Flow 1: Environment Digitization
The simulation environment is established by linking digital environment models to the robotic framework. The environment is digitized using Building Information Modeling (BIM) data, where files in .sdf or .dae formats provide detailed representations of workspace geometry, structural components, and dynamic entities. In practice, this setup mirrors the laboratory environment, where BIM models of the lab layout are imported into the simulator to visualize robotic movements and interactions within a realistic digital setting.
As shown in Figure 3a, green columns represent static obstacles and boundaries introduced to increase environmental complexity and to ensure the simulation accurately reflects real-world challenges. This design supports meaningful scheduling and obstacle avoidance during training, allowing results to be transferable to real-world applications. The ROS network continuously records environmental changes triggered by robotic actions and displays them in real time through the visualization interface. This establishes a dynamic, interactive simulation platform that underpins the development and validation of RL algorithms in robotics.
Figure 3b illustrates the recorded robot trajectories, where red and green particle clouds represent the real-time odometry data of the mobile robots during operation. These visualizations provide valuable insight into movement accuracy, environmental responsiveness, and algorithmic performance throughout the training process.
Data Flow 2: Robotic Modeling
The second data flow focuses on modeling the robots and defining their functional characteristics within the ROS environment. This process begins with the creation of robot models using the Unified Robot Description Format (URDF), an XML-based specification that imports the models into ROS and describes their physical configuration, kinematic structure, and dynamic properties. The URDF file serves as a comprehensive blueprint for the robotic platform, enabling the definition of its movement capabilities and interaction parameters.
For navigation, ROS utilizes a layered navigation stack comprising global and local planners. The global planner generates a path from the robot’s current position to a designated goal based on a pre-mapped environment, while the local planner manages short-term adjustments to accommodate dynamic obstacles and real-time changes. In this study, the global planner was modified to introduce variability during navigation, ensuring that even with fixed start and goal positions, robots exhibit non-deterministic behaviors by exploring alternative paths and strategies. The local planner was further enhanced to incorporate collision-avoidance mechanisms for multi-robot coordination, enabling agents to dynamically adjust their trajectories when paths intersect.

4.3. Work Package 3: Scenario Definition

Data Flow 3: Dynamic Task Configuration
This data flow focuses on defining and managing task-related parameters within the simulation environment. As illustrated in Figure 4, four robots are programmed to navigate toward eight predefined goal positions. To reflect realistic operational complexity, logical constraints are introduced into the task schedule, such as defining task dependencies, specifically, the predecessor–successor relationships between tasks A and B, to ensure activities are executed in a logical, sequential order consistent with real-world project requirements.
The simulation’s dynamics are strongly influenced by the navigation strategies adopted. The robots employ ROS navigation stacks that use the “Navfn” and “Dynamic Window Approach (DWA)” planners, enabling autonomous path planning based on environmental awareness and obstacle avoidance. These planners allow the robots to determine efficient navigation paths to their assigned goals under varying environmental conditions. However, because real-time decision-making introduces stochasticity, each training episode can produce different navigation routes and durations. This variability contributes to a more realistic simulation environment, reflecting the uncertainty and dynamic nature of real-world task execution.
Data flow 4: Task–environment Interaction and Iterative Learning
Based on the defined robotic work scenario, this data flow formalizes the processes required for iterative learning. Algorithm 1 shows the implementation workflow, outlining the sequence of operations that govern multi-agent training and task execution. The learning cycle is structured around key components, including the state definition, action space, and core learning operations—action selection and execution, state updating, observation, reward calculation, episode termination, and reset. These stages collectively define the interaction between tasks and the environment, shaping how agents perceive, act, and adapt throughout each episode. The formalization provides a systematic framework for capturing the learning dynamics, enabling consistent evaluation, parameter adjustment, and behavioral refinement of agents within the simulated environment.
Algorithm 1. Pseudocode implementation of the iterative learning process for robot task assignment
// Initialization and Definition of Parameters
Initialize robots R as [ R 0 ,   R 1 ,   ,   R n ]
Initialize Global Assigned Tasks   T a s s i g n e d as [ T A i ,   T B j ,   N o n e ]
Initialize Global Reached Tasks   T r   as [], an empty list.
Set Simulation Time t t and Time Goal Reached t r
Define Action Space A i j for each robot
// Begin Iterative Training Steps
for each episode do
        // Action Assignment
        for each robot in R do
                Assign new goal from Action Space A i j based on current state and police
        // Execution
        for each robot in R do
                Attempt to reach assigned goal within the environment
                Record execution success or failure
        // State Update
        Update the state of the environment
        Track changes and update Simulation Time t t
        // Observation State Reporting
        Process and report the state as neural network input
        Update Observation State with new environment and robot statuses
        // Reward Calculation
        for each robot in R   do
                Calculate reward based on task completion, logic, schedule adherence
        // Episode Check
        if all goals in T a s s i g n e d are reached or timeout is reached then
                Conclude the episode
        else
                Increment step count and continue with next step in the episode
        // Reset and Error Handling
         if episode concluded or connection error then
                 if connection error then
                         Terminate the simulator to prevent error propagation
                 else
                        Reset robots to initial positions for next episode
                 End if
         End if
End for
// Dynamic Configuration and Adaptation
Adjust parameters for agent count, efficiency, pick/place durations as needed
Modify logic and weights governing task sequences for various scenarios
This iterative learning process plays a central role in developing autonomous robotic behavior. By simulating realistic task execution scenarios, robots learn to complete assigned tasks while optimizing path planning and logical consistency, capabilities essential for deployment in dynamic, unpredictable environments. The process focuses not on predefined navigation paths but on the sequencing and allocation of tasks among multiple robots. Decision-making in the multi-robot task allocation (MRTA) framework centers on determining the optimal order of task completion and assigning responsibilities to specific agents, thereby addressing logistical efficiency and coordination rather than explicit route planning. Within this framework, the action space is defined by assigning distinct goal positions to individual robots. A “None” activity may also be designated, representing an idle state. Each robot then employs the navigation stack (comprising the global and local planners) to autonomously determine and follow its path to the assigned goal.
To formalize the actions, space is described as:
N A V i j = N A V A 1 ,   ,   N A V A i , , N A V B 1 ,   ,   N A V B j ,   N o n e
The action space can be formalized as a function of robot–environment interactions, where behavior dynamics arise from non-static path planning. Consequently, each iteration may produce unique navigation patterns, even with identical start and end points, reflecting the stochastic nature of the learning process. In MARL, defining effective observation spaces is crucial for enabling agents to make informed decisions in such dynamic settings. At each time step, agents access both their individual state observations and shared data from other agents, promoting coordinated learning. The complete set of state variables for each robotic agent is implemented in code and stored within the ROS network, as detailed in Table 5, for subsequent use in neural network training. Furthermore, the performance metrics summarized in Table 6 provide a comprehensive assessment of system behavior, supporting targeted refinements in algorithm design and robotic task execution.
The reward function quantitatively evaluates each agent’s action. It provides feedback on the quality of an action relative to the current environmental state and its resulting outcome, thereby guiding the agents toward behaviors that maximize overall task performance.
r s l o g i c i t = { 5 ,   w h e n   t h e   r o b o t   i s   a s s i g n e d   t o   a   t a s k   w i t h   c o r r e c t   o r d e r   a n d   f i n i s h   0 ,   w h e n   t h e   r o b o t   i s   a s s i g n e d   t o   a   t a s k   w i t h   c o r r e c t   o r d e r   n o t   f i n i s h     2 ,   w h e n   t h e   r o b o t   i s   a s s i g n e d   t o   a   t a s k   w i t h   t h e   w r o n g   o r d e r   b u t   f i n i s h   5 ,   w h e n   t h e   r o b o t   i s   a s s i g n e d   t o   a   t a s k   w i t h   w r o n g   o r d e r   a n d   n o t   f i n i s h    
r s n a v t i t = { 2 2 t t m i n   t a v g ,   i n   c a s e :   t m i n < t t < t m a x 8 ,   i n   c a s e :   t t > t m a x  
r s c o l i t = 0.05
r s i d l e i t = 0.22     t i d l e
R i t = ω a r s l o g i c i t + ω b r s n a v t i t + ω c r s c o l i t + ω d r s i d l e i t
The reward function considers the following components that collectively evaluate task performance, navigation efficiency, and operational safety within the learning framework:
r s l o g i c i t provides rewards based on task logic and completion. A high reward (5) is given for completing a task in the correct order, a lower reward (0) for being assigned a correct task but not finishing it, a reward (0) for finishing a task in the wrong order, and a significant penalty (−5) for being assigned a task with the wrong order and not completing it.
r s n a v t i t accounts for the navigation time about a minimum and maximum time threshold, rewarding more efficient navigation.
r s c o l i t penalizes the robot for collisions by a fixed negative value (−0.5).
r s i d l e i t , which penalizes the robot for idle time, reducing the overall reward based on the duration of inactivity.
The cumulative reward function, R i t , for a robot i at a given time step t, is defined as a weighted sum of multiple components that evaluate logic-based task performance, navigation efficiency, idle duration, and collision penalties. The weights ω a , ω b , ω c , and ω d , determine the relative importance of each factor in optimizing overall performance. This structured reward formulation encourages balanced decision-making that promotes efficiency, safety, and task adherence, thereby guiding robots toward behaviors transferable to real-world construction environments.
This structured reward design guides the agents toward behavior that balances task accuracy, efficiency, and safety. It ensures that each robot learns to operate autonomously while adhering to performance standards aligned with practical deployment in dynamic and real-world environments.
Data flow 5: Communication Build-Up (ROS node communication setup)
Establishing an efficient communication interface between the RL algorithm and the simulation environment is fundamental to developing an effective training pipeline within a robotic context. Data Flow 5 describes the ROS node communication setup that enables this information exchange.
At the core of the training architecture, the ROS framework coordinates the data flow between the RL algorithm—referred to as the algorithm brain—and the simulated environment. The communication process is managed by a series of ROS nodes that publish and subscribe to specific topics, facilitating bidirectional data exchange.
  • Publishing Actions: Nodes broadcast the actions generated by the algorithm at each time step t. These actions are transmitted through dedicated ROS topics (e.g., r i / g o a l ), where i denotes the specific robot instance. The published message represents the goal position toward which the robot’s navigation stack autonomously maneuvers.
  • Subscribing to Observations: Nodes simultaneously subscribe to observation topics that provide information about the robot’s state and the surrounding environment. These data serve as neural network inputs, informing decision-making and reward computation.
Additional communication topics support the exchange of contextual and performance data throughout training:
  • Goal Command: Nodes publish to r i / g o a l topic to issue action commands to the corresponding robot.
  • State and Environment Data: Nodes subscribe to /scan and /odom to obtain spatial and motion data, ensuring situational awareness and collision avoidance.
  • Simulation Time: Subscribing to /sim_time synchronizes training steps with the simulation clock.
  • Idle Time: The /idle_time topic logs periods of robot inactivity, helping assess efficiency.
  • Task Management: Nodes publish to / G l o b a l _ a s s i g n e d _ t a s k to record task allocations and subscribe to / G l o b a l _ r e a c h e d _ t a s k to monitor task completion order, providing essential data for evaluating task sequencing and logical consistency.
This ROS-based communication architecture serves as the backbone of the MARL training process. It enables continuous, real-time interaction between the algorithm and robotic agents, supporting adaptive decision-making and coordinated task execution. The framework thus provides a foundation for developing autonomous, precise robotic behaviors in complex, dynamic environments.

4.4. Work Package 4: MARL Algorithm (MAPPO Under CTDE)

Data flow 6: MARL Training Communication
As agents interact with the environment and adapt through experience, Data Flow 6 operates as a bidirectional communication channel between the MARL algorithm and the simulation framework. It retrieves state observations from the communication link and transmits training outputs, such as policy updates and value function refinements, back into the network. This exchange ensures that the learning algorithm continuously aligns its decision-making with environmental dynamics and agent performance.
The MARL algorithm functions on the principle of perception and evaluation, where each agent observes both its own state and partial representations of other agents’ states within the shared environment. The environment then evaluates these actions, generating feedback that informs the subsequent learning cycle.
  • State Observation: Data Flow 6 extracts the environment’s state from the communication link, comprising sensory readings and internal status data. These observations serve as the neural network inputs that represent the agents’ understanding of their surroundings.
  • Training Outputs: Updated parameters, including value functions and policy adjustments, are communicated back to the network. This iterative feedback loop—action, evaluation, and update—forms the core of the MARL mechanism.
In this case study, the task and corresponding reward function are explicitly defined to influence the policy optimization process. The algorithm uses all observed state information to calculate rewards and update the network at each time step and episode. The observations, which may include raw sensor vectors or image-based data, are processed within the ROS network illustrated in Figure 5. These inputs are fed into a Recurrent Neural Network (RNN) [72], where training updates the network parameters (ω and θ) to improve temporal learning and action prediction. The optimized policy outputs are then transmitted back through the ROS communication link, allowing the simulator to execute the neural network–generated task allocation plan.
Data flow 7: Decision Command Processing
Data Flow 7 plays a pivotal role in coordinating joint actions within the MARL framework. It implements a stochastic decision-making mechanism that integrates centralized learning with decentralized execution, forming a hybrid architecture that balances global coordination with individual autonomy. Under this model, all agents are trained collectively to develop a shared strategy while retaining the ability to act independently based on their local observations and internal decision-making processes.
As shown in Figure 6, this setup corresponds to the structure of the MAPPO algorithm. We choose MAPPO because PPO’s clipped surrogate is stable and performant in cooperative MARL when paired with a centralized critic, often matching or exceeding more complex off-policy methods across standard benchmarks (MPE, SMAC, Hanabi), while remaining simple to implement and tune.
During training, agents operate in a shared virtual environment governed by a central policy that is iteratively updated based on the aggregated experiences of all agents. This centralized learning process enables the model to capture the collective dynamics of multi-agent interaction and optimize the overall task performance.
During execution, each agent i has a decentralized actor π θ ( a t i o t i ) that conditions only on its local observation o t i , while a centralized critic V ϕ ( s ) conditions on the global state Z t N (or concatenated observations) only during training to compute advantages; at test time, agents act independently using π θ with local observations. Which means that the learned policy is distributed among individual agents, each applying it autonomously to respond to real-time environmental changes. This decentralized execution enables flexibility, scalability, and resilience in dynamic, uncertain conditions. Each agent’s local decision-making process draws from the globally trained policy, generating individualized action commands that are synchronized at the system level. When executed collectively, these coordinated actions yield optimal group performance and efficient task allocation within the simulation framework.
MAPPO Formulation under CTDE
In this work, cooperative multi-robot task allocation is formulated as a cooperative multi-agent Markov decision process (MMDP) with N agents (robots). At each discrete time step t , the environment is characterized by a global state s t S , and each agent i { 1 , , N } receives a local observation o t i O i . Based on its observation, agent i selects an action a t i A i . In our case study, the actions correspond to goal commands for the navigation stack (i.e., target waypoints in the 2D environment), while the observations and state variables follow the definitions summarized in Table 5 and the observation space in Table 6.
The joint action at time t is denoted by:
a t = ( a t 1 , , a t N )
and the environment evolves according to the transition dynamic:
s t + 1 P s t , a t
while each agent receives a scalar reward r t i . In this cooperative setting, the reward for agent i is constructed as a weighted sum of logic, time, idle, and collision components, as described in Section 4.3:
r t i = w l o g i c r l o g i c , t i + w t i m e r t i m e , t i + w i d l e r i d l e , t i + w c o l r c o l , t i
where w l o g i c ,   w t i m e ,   w i d l e ,   w c o l are scalar weights controlling the relative importance of task-logic correctness, navigation efficiency, idle time, and collision penalties.

4.4.1. CTDE Architecture

We adopt a Centralized Training, Decentralized Execution (CTDE) paradigm. During training, a centralized critic V ϕ ( s t ) conditions on the global state s t (or an equivalent concatenation of all agents’ observations and global variables), while each agent i has a decentralized actor π θ i ( a t i o t i ) that only observes its local input o t i . Policy parameters θ are shared across agents in this study, but the formulation permits heterogeneous actors if needed.
Centralized critic:
V ϕ : S R , V ϕ ( s t ) E [ G t s t ] ,
where G t = k = t T 1 γ k t r k is the discounted return, γ k t   is the discount factor, and r k is the team reward (e.g., mean or sum across agents).
Decentralized actors:
π θ i : O i P A i , a t i π θ i o t i
During execution (test time), only the actors are used: each robot i selects actions based solely on its local observation o t i , without access to the centralized critic or other agents’ internal states. This ensures that execution remains fully decentralized and consistent with real robotic constraints, while training benefits from global information to stabilize learning and shape the value estimates.

4.4.2. PPO-Based Policy Objective

We build on PPO to implement the policy update for each agent. Let θ o l d denote the policy parameters before an update, and define the importance sampling ratio for agent i at time t as:
r t i θ = π θ i o t i π θ o l d i o t i
Advantages A ^ t i are computed using a centralized critic with Generalized Advantage Estimation (GAE), which combines temporal-difference residuals over multiple steps to reduce variance and bias:
A ^ t i = l = 0 L 1 ( γ λ ) l δ t + l i , δ t i = r t i + γ V ϕ ( s t + 1 ) V ϕ ( s t )
where λ [ 0,1 ] controls the trade-off between bias and variance.
For each agent i , the PPO clipped surrogate objective is:
L p o l i c y i ( θ ) = E t m i n ( r t i ( θ ) A ^ t i , c l i p ( r t i ( θ ) , 1 ϵ , 1 + ϵ ) A ^ t i )
where ϵ > 0 is the PPO clipping parameter. The clipping operation constrains the magnitude of policy updates, improving training stability across episodes and episodes with diverse task allocations.

4.4.3. Value Function Loss and Entropy Regularization

The centralized critic is trained by minimizing the value function loss, defined as the mean-squared error between predicted values and empirical returns:
L v a l u e ( ϕ ) = E t [ ( V ϕ ( s t ) R ^ t ) 2 ]
where R ^ t is an estimate of the discounted return at time t (e.g., the sum of discounted rewards from a truncated trajectory, potentially with bootstrapping from V ϕ ( s T ) ).
To encourage exploration and avoid premature convergence to deterministic policies, we include an entropy regularization term for each agent:
L e n t r o p y i θ = E t H π θ i o t i
where H ( ) denotes the Shannon entropy of the policy distribution. Higher entropy corresponds to more stochastic policies, which can help agents escape suboptimal coordination patterns in multi-agent settings.

4.4.4. Overall Multi-Agent Objective

Combining the policy surrogate, value loss, and entropy regularization yields the overall MAPPO training objective:
L ( θ , ϕ ) = i = 1 N E t [ L p o l i c y i ( θ ) + c v L v a l u e ( ϕ ) c e L e n t r o p y i ( θ ) ]
where c v > 0 and c e > 0 are scalar coefficients controlling the relative weight of the value error and entropy bonus, respectively. In practice, this objective is optimized using minibatch stochastic gradient ascent (for the policy) and descent (for the critic) over trajectories collected jointly from all agents in the shared environment.
This CTDE formulation, together with the structured reward design and ROS-based communication architecture introduced in Work Package 4, yields a stable and reproducible MAPPO training routine for cooperative robotic task allocation. The explicit decomposition into actor loss, critic loss, and entropy terms also supports the training diagnostics discussed in Section 4.5, where actor and critic losses are monitored over time to assess convergence and training stability.

4.4.5. Training Hyperparameters

To ensure reproducibility, Table 7 summarizes the main hyperparameters used in the MAPPO training process. For all experiments in the 2D benchmark, we use a discount factor γ = 0.98 and PPO clipping parameter ϵ = 0.2 (epo_eps). The entropy coefficient is set to c e = 0.01 (entropy_factor), and the value loss coefficient to c v = 5.0 (value_factor), balancing exploration against value fitting. We train with an Adam optimizer with learning rate 1 × 10 5 and an exponential decay factor of 0.95 every 10,000 steps, using a batch size of 1000, minibatch size of 256, and 10 PPO epochs per update. Episodes run at 5 Hz for 120 s (HZ = 5.0, episode_duration = 120), with a target of 70,000 environment steps and 8 parallel threads for data collection. Unless otherwise stated, this configuration is used for all results reported in the 2D environment.
For all experiments, the reward is composed of four components: task-logic correctness, time efficiency, idle time, and collision penalty. The corresponding weights are set to w l o g i c = 0.6 , w t i m e = 0.35 , w i d l e = 0.2 , and w c o l = 0.8 , reflecting a higher priority on satisfying task-logic constraints and avoiding collisions, while still encouraging shorter completion times and reduced idle behavior. The weights need to be fine-tuned based on different experiment settings.

4.5. Simulation Results

4.5.1. Simulation Results Reporting

Each robot executes its assigned tasks according to the sequence determined by the task allocation algorithm. This process produces diverse configurations of task assignments and navigation trajectories, reflecting the dynamic nature of multi-agent coordination. During each experimental run, robots aim to complete their designated tasks in the assigned order, with navigation efficiency strongly influenced by the underlying path-planning algorithms. Consequently, variations in completion time across episodes reveal critical insights into the relative performance of different task allocation strategies.
At the conclusion of each episode, the system generates and records a detailed task execution schedule. This record serves as a key analytical dataset for evaluating task sequencing, execution time, and agent coordination. The primary objective of this analysis is to improve task completion efficiency and overall system performance. By systematically recording, evaluating, and comparing outcomes across episodes, the framework provides a structured basis for continuously improving robot task management and decision-making algorithms. Through iterative analysis, the agents progressively refine their understanding of multi-agent dynamics, leading to more effective coordination and optimized performance within the simulated environment.

4.5.2. Non-Training Baseline

The baseline experiment conducted in the simulation environment provides an essential reference for evaluating the MARL framework’s subsequent performance. This initial trial was executed without applying advanced training algorithms or establishing a control policy. During this phase, the robots were programmed to randomly select goal positions and attempt to complete predefined tasks within a fixed time frame. Figure 7 presents the mean reward trajectories for robots R 0 and R 1 over multiple episodes. The fluctuations in mean reward reflect the robots’ stochastic behavior when operating without learned strategies, demonstrating inconsistent task execution and frequent performance variability.
Figure 8 illustrates the breakdown of individual reward components for each robot: logic, time, idle time, and collision rewards. The dominance of logic and idle-time variations indicates that the robots frequently failed to follow the optimal task order or remained inactive for parts of the episode, both of which contributed to overall inefficiency.
Simulation results with two robots performing eight tasks yielded an average episode duration of 98.75 s and an average of 4.6 steps per episode (Table 8). These results indicate intermittent idle periods, during which robots were not actively engaged in task execution. As the number of robots increased to four, the average episode duration rose to 112.27 s, reflecting the additional coordination complexity inherent in multi-agent systems. On average, each robot completed between one and two tasks per episode, revealing inefficiencies in task sequencing and resource utilization under untrained conditions.
When logic constraints were introduced using the Genetic Algorithm (GA), the average completion times improved markedly to 40.20, 45.61, and 43.37 s for two-, three-, and four-robot configurations, respectively. These improvements demonstrate the effectiveness of incorporating task logic into the optimization process. Collectively, these results establish a quantitative baseline for evaluating the MARL framework, highlighting the limitations of random task allocation and the necessity of coordinated learning strategies to enhance multi-robot efficiency and collaboration.
The baseline experiment establishes a foundational understanding of how artificial intelligence can enhance robots’ autonomous capabilities. This initial phase serves as a critical reference point, demonstrating both the necessity and potential benefits of incorporating advanced training methodologies. These methods are designed to improve operational performance by promoting greater autonomy and efficiency in robotic behavior. Implementing a deterministic schedule with fixed navigation durations provides a controlled benchmark for performance comparison. This benchmark enables a comprehensive assessment of the MARL algorithm’s effectiveness in dynamic task allocation and navigation, offering a clear measure of improvement over non-trained, rule-based conditions.

4.5.3. Dynamic Task Allocation Training (Two-Robot Case)

The simulation was executed at approximately 40 times real-time by enhancing the Flatland server’s performance fivefold and distributing the computation across an eight-core parallel processing framework. The complete training process spanned six hours, focusing on analyzing how the reward metrics stabilized and how agent behavior evolved.
Across 12,000 episodes, the cumulative mean rewards of two robotic agents were recorded to evaluate learning progression. As shown in Figure 9, both agents initially exhibited substantial reward fluctuations, with mean values ranging from approximately −20 to slightly above 0. During early training episodes, the MARL algorithm explored a wide range of action policies, leading to unstable performance as agents learned to interpret environmental feedback.
As training advanced, the frequency of low-reward actions decreased, and the agents progressively refined their task-selection strategies. Around the 4000th episode, both robots began converging toward more consistent performance, with mean rewards stabilizing at higher values. This trend indicates that the agents successfully learned to coordinate and allocate tasks efficiently within the shared environment. The similarity in the convergence patterns for both robots further suggests the development of cooperative behavior and balanced task distribution within the multi-agent system.
As shown in Figure 10, the MARL training process exhibits clear convergence trends across all reward components, reflecting progressive refinement in decision-making and task coordination by both robots. The duration of task execution decreases steadily as training advances, indicating that the agents learn to prioritize efficient task sequencing and coordination. While convergence in efficiency occurs relatively quickly, additional training time is required to strengthen the agents’ understanding of task logic, which is essential for extending learned behaviors to complex, real-world scenarios.
The Logic Reward (blue line) displays substantial fluctuations during the initial learning phase, characteristic of an exploratory stage dominated by trial-and-error interactions. As the agents refine their decision-making, the logic reward stabilizes after approximately 30,000 steps, indicating improved alignment between actions and predefined task logic. This stabilization demonstrates that both robots successfully learned to interpret task dependencies and execute activities in the correct sequence.
The Time Reward (orange line) shows a consistent upward trend, signifying gradual improvements in task completion efficiency. Early penalties diminish as the algorithm optimizes navigation and scheduling strategies, eventually stabilizing near −2, which corresponds to task durations approaching the expected performance baseline.
The Idle Time Reward (green line) remains close to zero throughout most of the training, with minor early fluctuations. This stability reflects a balanced allocation of work, with idle periods minimized except when required by task sequencing constraints. The Collision Reward (red line) remains stable near zero, indicating effective spatial awareness and avoidance behavior throughout the training episodes.
In addition to monitoring the cumulative mean episode reward, we tracked the evolution of the actor (policy) and critic (value) losses during training to assess the stability and convergence behavior of the MAPPO algorithm. Figure 11a shows the total training loss aggregated over all agents for the benchmark scenario with two robots, dark blue—R1, light blue—R2. The loss exhibits large fluctuations and several peaks during the early stages of training (up to roughly 1700 episodes), followed by a clear downward trend and eventual stabilization close to zero after about 2000 updates.
Figure 11b reports the actor (policy) loss. Although the per-update values are noisy due to the on-policy nature of PPO, the loss remains small in magnitude and fluctuates around a narrow band without any drift or explosive growth over the entire 2500 updates. This indicates that policy updates are bounded and do not destabilize learning once a reasonably good policy has been found. The critic (value) loss in Figure 11c follows a similar pattern to the total loss: it starts at relatively high values with pronounced oscillations when the value function is still inaccurate, then decreases steadily and settles near zero as the critic converges to a consistent approximation of the return.
Importantly, neither the actor nor the critic loss displays sustained oscillations or divergence; instead, both losses remain bounded and progressively stabilize while the cumulative reward curves (Figure 9 and Figure 10) increase and then saturate. Taken together, these diagnostics support the claim that the proposed MAPPO configuration under the CTDE architecture yields a stable and convergent training process for cooperative multi-robot task allocation in the considered 2D environment.
Overall, these results demonstrate that while task-logic learning exhibits the highest variability and difficulty, the MARL framework enables the robots to achieve temporal efficiency, balanced utilization, and collision-free cooperation. The consistent 1onvergence across both agents confirms the framework’s ability to promote coordinated, stable learning behavior in multi-robot environments.

4.5.4. Dynamic Task Allocation Training (Four-Robot Case)

To assess the scalability of the proposed framework beyond the two-robot case, we conducted an additional simulation experiment with four robots ( N = 4 ) assigned to the same pool of eight tasks in the 2D benchmark environment. Increasing the number of agents substantially raises interaction density and the likelihood of path conflicts, making the cooperative task-allocation problem more challenging than in the original two-robot configuration.
The four-robot setting was trained for 20,000 episodes using the same MAPPO configuration and reward structure described in Section 4.4.2. The combined episode reward (Figure 12) initially exhibits high variance, reflecting extensive exploration and frequent sub-optimal joint decisions. As training progresses, the total reward gradually increases and then stabilizes, indicating that the agents learn to coordinate their task selections and navigation actions under denser traffic conditions.
Figure 13, the breakdown of the reward components shows that the logic reward remains the most difficult term to optimize, with delayed and more oscillatory convergence compared to the two-robot case. This behavior is expected, as collisions and path conflicts are primarily reflected through logic-consistency penalties: when robots obstruct one another or fail to reach intermediate goals, the corresponding tasks are not completed and negative feedback is injected through the logic reward. In contrast, the explicit collision reward term is configured with a small magnitude and is only triggered when the local planner reports physical contact with surrounding obstacles. As a result, the collision reward curve remains close to zero throughout training and mainly captures occasional contacts induced by the low-level navigation stack, while the dominant learning signal for avoiding disruptive interactions is provided by the logic component.
To make the effect of increased agent density more explicit, Figure 14 compares the collision-event frequency for the two-robot and four-robot configurations over a fixed evaluation window. As expected, the four-robot case exhibits a higher number of collision events, reflecting the increased interaction complexity and the greater probability of local planner contacts in shared corridors and intersections. However, the collision frequency remains bounded and does not lead to systematic failure of the task set: the learned policies still generate coherent task sequences and feasible navigation behaviors that complete the global task set within the allowed horizon. Overall, the reward evolution and collision-frequency comparison indicate that the proposed MAPPO-based CTDE formulation and MARL–ROS framework scale beyond the simple two-robot case and remain effective under higher agent densities in the considered 2D benchmark environment.

4.5.5. Scalability Analysis

The optimized task allocation duration generated by the MAPPO framework is summarized in Table 9, resulting in an overall task completion time of 42 s. The corresponding task distribution and execution schedule for both robots are illustrated in Figure 15, providing a visual representation of the optimized workflow.
The optimal allocation strategy is defined as:
R 0 = A 1 , A 2 ,   B 2 ,   B 1   a n d   R 1 = A 3 , A 4 ,   B 3 , B 4
This allocation minimizes task completion time while maintaining logical task order and balanced workload distribution. As shown in Figure 15, the scheduling timeline indicates that both robots operate in parallel, executing tasks efficiently without overlap or idle time. The logical correctness of the global task sequence is confirmed, and each robot successfully adheres to the defined task dependencies.
A comparison with the baseline and GA results highlights the efficiency of the MARL-derived schedule. The MARL configuration achieves a total completion time comparable to the logic-constrained GA solution while maintaining greater adaptability and autonomous decision-making. The close similarity between the schedules derived via MARL and GA validates the learning framework’s ability to converge on optimal solutions through experience-based training.
The four-robot experiment extends this analysis to a denser multi-agent setting with the same task set and reward structure. The corresponding results are reported in Table 10. Without training, the four-robot baseline requires 112.27 s to complete the tasks, indicating that simply adding more agents does not improve performance in the absence of coordination. With task logic enforced, the GA planner produces the schedule
R 0 = [ B 3 ,   B 4 ] ,   R 1 = [ B 2 ,   B 1 ] ,   R 2 = [ A 3 ,   A 4 ] ,   R 3 = [ A 2 ,   A 1 ]
with a completion time of 55 s. Compared with the untrained baseline, this represents a reduction of approximately 51% in completion time, demonstrating that the learned policy is able to coordinate four agents effectively in the shared environment.
R 0 = [ B 4 ,   B 2 ] ,   R 1 = [ A 1 ,   B 1 ] ,   R 2 = [ A 4 ,   B 3 ] ,   R 3 = [ A 2 ,   A 3 ]
Compared with the logic-constrained GA schedule, the learned policy is about 27% slower, but it discovers its own decomposition of the task set from interaction data rather than from explicit combinatorial search, and it remains fully compatible with the decentralized execution requirement. This performance gap is expected, because the MARL policy is trained under stochastic navigation times and collision risks and therefore favours task allocations that are more robust to timing variability and local path conflicts than the deterministic GA schedule, which is generated from nominal travel times and does not explicitly account for these probabilistic effects: in rollout experiments, the GA allocation yields logic-consistent executions in only about 65% of episodes, whereas the MARL allocation maintains logic correctness in approximately 92% of episodes, highlighting its superior ability to handle stochastic environments.
The trajectories and schedules for the four-robot case (Figure 16) further illustrate the learned cooperative behavior. The agents avoid duplicating work, respect the global task logic, and tend to stagger their presence in constrained regions of the map, which aligns with the reward-breakdown and collision analyses in Section 4.5.4. Although the increased agent density leads to more frequent local planner contacts, the collision-event statistics remain bounded and do not prevent successful completion of the global task set.
Taken together, the two- and four-robot results show that the proposed MARL–ROS framework
(i)
recovers near-optimal, logic-consistent allocations in the small-team case,
(ii)
adapts its task-distribution strategy when the team size is doubled from two to four robots without any change to the underlying learning architecture, and
(iii)
maintains cooperative, fully decentralized execution under higher agent densities while achieving substantial performance gains over uncoordinated baselines and robust behavior under stochastic navigation times.
These findings provide concrete evidence that the framework supports adaptive, cooperative, and scalable multi-robot task allocation in the considered 2D, construction-oriented digital environment.

4.6. Real-World Verification

To validate the MARL framework beyond the simulation environment, a real-world experiment was conducted using two TurtleBot3 Burger robots (Robotis, Seoul, Republic of Korea). These compact and cost-effective mobile platforms are widely adopted for indoor navigation and reinforcement learning due to their modular architecture and compatibility with the Robot Operating System (ROS). Each robot is equipped with a 360° 2D LiDAR sensor (HLS-LFCD LDS-01), wheel encoders, and a Raspberry Pi 3 Model B+ onboard computer (Cambridge, UK), enabling autonomous navigation and real-time data exchange. Mobility is achieved through a differential drive system integrated with the ROS Navigation Stack for path planning and motion control.
The reinforcement learning computations were performed offboard on a workstation equipped with an Intel Xeon E7 (8-core) CPU and an NVIDIA GeForce GTX 2060 GPU. Communication between the robots and the central system was established over a local ROS network, ensuring synchronized transmission of observations, actions, and sensor feedback. This configuration maintained consistency with the simulation framework while incorporating the additional uncertainties of real-world operation. The trained MARL policy was subsequently deployed on the physical robots to execute task allocation and navigation within a laboratory environment designed to replicate the simulated setup.

Real-World Results and Discussion

The ROS noetic middleware was employed to integrate the developed Python 3.10 scripts with the robot control packages, enabling seamless communication across hardware and algorithmic components. The TurtleBot3 Burger model was implemented in Gazebo 11 for simulation and RViz 1.14 for visualization, with onboard sensors, such as LiDAR and wheel encoders, simulated to reflect the real robots’ perception capabilities.
To closely replicate the physical testing conditions, the Gazebo simulation environment was built using a Building Information Model (BIM) generated in Autodesk Revit 2024, accurately modeling the laboratory where the real-world experiments were conducted. Key structural and spatial features from the BIM were extracted and imported into Gazebo to ensure high-fidelity correspondence between the simulated and physical environments.
Python scripts implementing the MAPPO algorithm were converted into ROS nodes, facilitating communication between the reinforcement learning architecture and the robotic agents. The trained policy was then deployed on two TurtleBot3 Burger robots in the physical laboratory setup shown in Figure 17, where each robot autonomously navigated to predefined positions and executed assigned activities.
As summarized in Table 10, the MARL-based schedule achieved a task completion duration of 42 s in simulation, while the real-world execution under identical task assignments required 60 s. The discrepancy primarily arises from real-world factors such as sensor noise, surface friction, localization drift, and processing latency, which are abstracted in simulation. Despite the extended duration, the task allocation pattern, assigning R0 to ‘A1’, ‘A2’, ‘B2’, and ‘B1’, and R1 to ‘A3’, ‘A4’, ‘B3’, and ‘B4’, was consistent across both domains. This consistency demonstrates the framework’s transferability and policy robustness from simulation to reality, while underscoring the need for adaptive compensation mechanisms to account for real-world dynamics.
The extended task duration observed in real-world trials can be attributed to several key factors.
1.
Reduced realized speed (commanded vs. effective).
Although commands capped at v m a x in sim, logs show the effective linear speed on TB3 was consistently lower due to actuator limits and floor friction (on hardware—0.19 m/s versus 0.22 m/s in sim (a 13% decrease)). In corridors, safety inflation and tighter turns further depress average speed.
2.
Navigational hesitation from perception & localization.
The robots frequently paused to reassess their trajectories due to sensor noise and minor localization errors, resulting in extended idle periods and reduced overall efficiency. LiDAR noise, beam dropouts, and small heading jitter trigger micro-stops and replans (safety checks, oscillation damping), inflating idle time. This is largely absent in deterministic sim runs. Concretely, each episode exhibited 3–6 pauses with a median duration of 0.5–1.2 s, alongside average 1.5 replan events. The measured LiDAR dropout rate was ~2–4% with processing latency around 40–70 ms, all of which cumulatively inflated idle time.
3.
Environmental irregularities not captured in BIM→2D.
Despite control, the laboratory environment introduced unpredictable surface variations and minor obstacles, such as uneven flooring and reflections, that were not fully represented in the simulated BIM-based model. Minor floor unevenness, glossy reflections, and small, transient obstacles (tripods, bags, cables) create conservative inflation and detours that do not exist in the Flatland world.
Addressing these issues is essential for bridging the gap between simulation and physical deployment. Future work will focus on enhancing the MARL framework’s robustness by incorporating adaptive motion control, sensor uncertainty modeling, and environment-aware learning to improve performance in real-world, dynamic conditions.

5. Limitations

This paper represents a first step toward applying MARL within a reusable digital framework for cooperative robotics; however, several limitations must be acknowledged to guide future development. First, the experimental validation focuses on small teams of two and four homogeneous mobile robots operating in a simplified, flat 2D environment. These settings are sufficient to verify that the proposed MARL–ROS framework can support end-to-end training and execution and to demonstrate scalability from two to four agents, but they still do not fully exploit the potential advantages of MARL for larger fleets or very dense interactions. As a result, the present results should be interpreted as proof-of-concept for the framework and learning pipeline rather than as a definitive demonstration of multi-robot “swarm” coordination. Future work will extend the benchmark to scenarios with larger (e.g., six or more) and potentially heterogeneous teams in confined areas, enabling systematic study of congestion effects, collision risk, and performance trends as the number of interacting robots increases.
Second, the current framework simplifies several aspects of task logic and physical interaction, which limits its capacity to capture the full complexity of real-world construction operations. The robots are differential-drive TurtleBot3 platforms operating on a flat, obstacle-free surface, and the navigation problem is formulated in 2D. This configuration does not yet represent uneven terrain, multilevel structures, heavy machinery kinematics, or dense human–equipment traffic, and is therefore closer to a generic logistics or indoor transport scenario than to a detailed construction site. In future work, the same framework will be instantiated in richer construction-oriented digital environments that incorporate changing maps (e.g., evolving building states), resource and access constraints, and dynamic obstacles, and will be coupled to robot models that more closely approximate construction machinery such as haulers or material-handling equipment.
Third, the selection of algorithms in this study is restricted to a MAPPO-based formulation with fixed hyperparameters, informed by common practice in the literature. Alternative MARL methods, adaptive parameter schedules, and more expressive policy architectures are not explored here. Although the current configuration was sufficient to obtain stable convergence in the benchmark scenario, a broader comparison across algorithms and architectures will be needed to fully assess robustness and sample efficiency.
Finally, the simulation environment imposes constraints on both physical fidelity and computational efficiency. Current limitations in modeling real-world dynamics and maintaining real-time performance affect the precision of algorithm validation and the speed of training. Enhancing simulation fidelity and optimizing computational pipelines will be essential to further close the sim-to-real gap and to evaluate the framework under more demanding construction scenarios.

6. Conclusions and Outlook

This paper presents a modular framework for multi-agent task allocation using Multi-Agent Reinforcement Learning (MARL), integrating centralized training with decentralized execution in an ROS-based simulation environment. The framework establishes a cohesive communication structure that links reinforcement learning algorithms, robotic agents, and simulation modules, enabling real-time data exchange and adaptive decision-making. Validation through both simulated experiments and a real-world proof-of-concept case study using two TurtleBot3 Burger robots demonstrated the framework’s capability to support cooperative task execution, efficient navigation, and logic-driven autonomy in a simplified 2D environment.
The central contribution of this work is the development of an integrated MARL–ROS platform that functions as a reusable digital testbed rather than a complete construction-site solution. By organizing the system around modular data flows, standardized interfaces, and configurable task and reward parameters, the framework provides a reproducible and extensible foundation for multi-agent experimentation, benchmarking, and adaptive control across diverse robotic systems and environment layouts.
Despite these advances, the study’s current scope remains limited by simplified task structures, deterministic scheduling, and the abstraction of real-world complexities such as sensor noise, actuator latency, and environmental uncertainty. In addition, the robot team is small and homogeneous, and the physical environment is flat and obstacle-free, so important construction-specific aspects such as uneven terrain, heavy equipment kinematics, dynamic human–machine interactions, and evolving site layouts are not yet represented. In our experiments, the learned policy achieves a completion time of 42 s in simulation and 60 s on the real robots (≈43% increase). This sim-to-real gap likely arises from sensor noise, localization drift, and actuation delays that are not modeled in the current 2D benchmark. In this first study, we use the real-robot runs mainly to verify that the MARL–ROS pipeline can be executed end-to-end. In future work, we plan to reduce this gap by applying domain randomization and noise injection in the environment layer (e.g., randomized sensor noise, delays, and speed profiles) so that the framework can be used to systematically study and mitigate sim-to-real discrepancies.
Future research will aim to enhance simulation fidelity, incorporate probabilistic modeling of task logic and feedback, and benchmark performance across a broader set of MARL algorithms. Extending the framework to larger teams and heterogeneous robots, and introducing construction-specific constraints (e.g., precedence relations, resource availability, access restrictions) and dynamic obstacles, will be key steps toward evaluating MARL under more realistic construction conditions. Extending the framework to heterogeneous robot teams and developing adaptive mechanisms for dynamic parameter tuning will further improve robustness and scalability. Collectively, these developments will advance the framework toward real-world deployment, helping to realize autonomous, coordinated, and efficient robotic systems for complex operational environments.

Author Contributions

Conceptualization, X.X. and B.G.d.S.; methodology, X.X., S.A.P. and B.G.d.S.; software, X.X.; validation, X.X., S.A.P. and B.G.d.S.; formal analysis, X.X., S.A.P. and B.G.d.S.; investigation, X.X.; data curation, X.X. and B.G.d.S.; writing—original draft preparation, X.X.; writing—review and editing, X.X., S.A.P. and B.G.d.S.; visualization, X.X.; supervision, S.A.P. and B.G.d.S.; project administration, S.A.P. and B.G.d.S.; funding acquisition, B.G.d.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

This work was partially supported by different Centers at NYUAD. In particular, the Center for Sand Hazards and Opportunities for Resilience, Energy, and Sustainability (SHORES), the Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001, and the Center for Artificial Intelligence and Robotics (CAIR). Part of this research benefited from the resources in the Core Technology Platform (CTP) at New York University Abu Dhabi (NYUAD), particularly the CTP’s Kinesis Lab. Special thanks to Nikolaos Giakoumidis for valuable discussions and assistance with laboratory methods and experimental design.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Arents, J.; Greitans, M. Smart industrial robot control trends, challenges and opportunities within manufacturing. Appl. Sci. 2022, 12, 937. [Google Scholar] [CrossRef]
  2. Piyush, P.; Mohamed, E.; Gabriella, S.K. Identifying the Challenges to Adopting Robotics in the US Construction Industry. J. Constr. Eng. Manag. 2021, 147, 05021003. [Google Scholar] [CrossRef]
  3. Bloss, R. Collaborative robots are rapidly providing major improvements in productivity, safety, programing ease, portability and cost while addressing many new applications. Ind. Robot. Int. J. 2016, 43, 463–468. [Google Scholar] [CrossRef]
  4. Feng, C.; Xiao, Y.; Willette, A.; McGee, W.; Kamat, V.R. Vision guided autonomous robotic assembly and as-built scanning on unstructured construction sites. Autom. Constr. 2015, 59, 128–138. [Google Scholar] [CrossRef]
  5. Pedersen, M.R.; Nalpantidis, L.; Andersen, R.S.; Schou, C.; Bøgh, S.; Krüger, V.; Madsen, O. Robot skills for manufacturing: From concept to industrial deployment. Robot. Comput. Manuf. 2016, 37, 282–291. [Google Scholar] [CrossRef]
  6. Hosseini, M.R.; Martek, I.; Zavadskas, E.K.; Aibinu, A.A.; Arashpour, M.; Chileshe, N. Critical evaluation of off-site construction research: A Scientometric analysis. Autom. Constr. 2018, 87, 235–247. [Google Scholar] [CrossRef]
  7. Khamis, A.; Hussein, A.; Elmogy, A. Multi-robot task allocation: A review of the state-of-the-art. In Cooperative Robots and Sensor Networks; Springer International Publishing: Cham, Switzerland, 2015; pp. 31–51. [Google Scholar]
  8. Badreldin, M.; Hussein, A.; Khamis, A. A comparative study between optimization and market-based approaches to multi-robot task allocation. Adv. Artif. Intell. 2013, 2013, 56524. [Google Scholar] [CrossRef]
  9. Parker, L.E. Task-oriented multi-robot learning in behavior-based systems. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS ’96, Osaka, Japan, 4–8 November 1996. [Google Scholar]
  10. Chakraa, H.; Guérin, F.; Leclercq, E.; Lefebvre, D. Optimization techniques for Multi-Robot Task Allocation problems: Review on the state-of-the-art. Robot. Auton. Syst. 2023, 168, 104492. [Google Scholar] [CrossRef]
  11. Clifton, J.; Laber, E. Q-learning: Theory and applications. Annu. Rev. Stat. Appl. 2020, 7, 279–301. [Google Scholar] [CrossRef]
  12. Cai, Q.; Pan, L.; Tang, P. Generalized deterministic policy gradient algorithms. arXiv 2018, arXiv:1807.03708. [Google Scholar]
  13. Lansing, E. Optimizing Production Manufacturing using Reinforcement Learning Sridhar Mahadevan and Georgios Theo-charous. Available online: https://cdn.aaai.org/FLAIRS/1998/FLAIRS98-072.pdf (accessed on 4 May 2024).
  14. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
  15. Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Rob. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
  16. Isele, D.; Rahimi, R.; Cosgun, A.; Subramanian, K.; Fujimura, K. navigating occluded intersections with autonomous vehicles using deep reinforcement learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018. [Google Scholar]
  17. Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue Correlation Filters for Robust Visual Tracking. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4844–4853. [Google Scholar] [CrossRef]
  18. Xu, X.; Garcia de Soto, B. Reinforcement learning with construction robots: A review of research areas, challenges and opportunities. In Proceedings of the International Symposium on Automation and Robotics in Construction (ISARC), Bogotá, Colombia, 13–15 July 2022. [Google Scholar] [CrossRef]
  19. Conde, M. Organization based multiagent architecture for distributed environments. Doctoral dissertation, Universidad de Salamanca, Salamanca, Spain, 2010. [Google Scholar]
  20. Edmondson, J.; Schmidt, D. Multi-agent distributed adaptive resource allocation (MADARA). Int. J. Commun. Netw. Distrib. Syst. 2010, 5, 229–245. [Google Scholar] [CrossRef]
  21. Ontanon, S.; Synnaeve, G.; Uriarte, A.; Richoux, F.; Churchill, D.; Preuss, M. A survey of real-time strategy game AI research and competition in StarCraft. IEEE Trans. Comput. Intell. AI Games 2013, 5, 293–311. [Google Scholar] [CrossRef]
  22. Turner, C.J.; Oyekan, J.; Stergioulas, L.; Griffin, D. Utilizing industry 4.0 on the construction site: Challenges and opportunities. IEEE Trans. Ind. Inform. 2021, 17, 746–756. [Google Scholar] [CrossRef]
  23. Zhu, X.; Xu, J.; Ge, J.; Wang, Y.; Xie, Z. Multi-task multi-agent reinforcement learning for real-time scheduling of a dual-resource flexible job shop with robots. Processes 2023, 11, 267. [Google Scholar] [CrossRef]
  24. Chipade, V.S. Collaborative Task Allocation and Motion Planning for Multi-Agent Systems in the Presence of Adversaries. Doctoral Dissertation, University of Michigan, Ann Arbor, MI, USA, 2022. [Google Scholar]
  25. Gerkey, B.P.; Matarić, M.J. A formal analysis and taxonomy of task allocation in multi-robot systems. Int. J. Robot. Res. 2004, 23, 939–954. [Google Scholar] [CrossRef]
  26. Nunes, E.; Manner, M.; Mitiche, H.; Gini, M. A taxonomy for task allocation problems with temporal and ordering constraints. Robot. Auton. Syst. 2017, 90, 55–70. [Google Scholar] [CrossRef]
  27. Calzavara, M.; Faccio, M.; Granata, I. Multi-objective task allocation for collaborative robot systems with an Industry 5.0 human-centered perspective. Int. J. Adv. Manuf. Technol. 2023, 128, 297–314. [Google Scholar] [CrossRef]
  28. Gmytrasiewicz, P.J.; Doshi, P. A framework for sequential planning in multi-agent settings. J. Artif. Intell. Res. 2005, 24, 49–79. [Google Scholar] [CrossRef]
  29. Choudhury, S.; Gupta, J.; Kochenderfer, M.; Sadigh, D.; Bohg, J. Dynamic multi-robot task allocation under uncertainty and temporal constraints. Auton. Robot. 2022, 46, 231–247. [Google Scholar] [CrossRef]
  30. Robu, V. Market-based task allocation and control for distributed logistics. In Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, Utrecht, The Netherlands, 25–29 July 2005; p. 1383. [Google Scholar]
  31. Liu, L.; Shell, D. Optimal market-based multi-robot task allocation via strategic pricing. In Proceedings of the Robotics: Science and Systems Conference, Berlin, Germany, 24–28 June 2013. [Google Scholar]
  32. Tang, F.; Parker, L.E. A complete methodology for generating multi-robot task solutions using ASyMTRe-D and market-based task allocation. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, 10–14 April 2007; pp. 3351–3358. [Google Scholar]
  33. Hussein, A.; Khamis, A. Market-based approach to Multi-robot Task Allocation. In Proceedings of the 2013 International Conference on Individual and Collective Behaviors in Robotics (ICBR), Sousse, Tunisia, 15–17 December 2013; pp. 69–74. [Google Scholar]
  34. Parker, L.E. L-ALLIANCE: Task-oriented multi-robot learning in behavior-based systems. Adv. Robot. 1996, 11, 305–322. [Google Scholar] [CrossRef]
  35. Seenu, N.; Kuppan Chetty, R.M.; Ramya, M.M.; Janardhanan, M.N. Review on state-of-the-art dynamic task allocation strategies for multiple-robot systems. Ind. Rob. 2020, 47, 929–942. [Google Scholar]
  36. Liu, F.; Liang, S.; Xian, X. Multi-robot task allocation based on utility and distributed computing and centralized determination. In Proceedings of the 27th Chinese Control and Decision Conference (CCDC), Qingdao, China, 23–25 May 2015. [Google Scholar]
  37. Mazdin, P.; Barcis, M.; Hellwagner, H.; Rinner, B. Distributed task assignment in multi-robot systems based on information utility. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 20–21 August 2020. [Google Scholar]
  38. Shelkamy, M.; Elias, C.M.; Mahfouz, D.M.; Shehata, O.M. Comparative analysis of various optimization techniques for solving multi-robot task allocation problem. In Proceedings of the 2020 2nd Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 24–26 October 2020. [Google Scholar]
  39. Majumder, A.; Majumder, A.; Bhaumik, R. Teaching–learning-based optimization algorithm for path planning and task allocation in multi-robot plant inspection system. Arab. J. Sci. Eng. 2021, 46, 8999–9021. [Google Scholar] [CrossRef]
  40. Park, B.; Kang, C.; Choi, J. Cooperative Multi-Robot Task Allocation with Reinforcement Learning. NATO Adv. Sci. Inst. Ser. E Appl. Sci. 2021, 12, 272. [Google Scholar] [CrossRef]
  41. Kim, I.; Morrison, J.R. Learning based framework for joint task allocation and system design in stochastic multi-UAV systems. In Proceedings of the 2018 International Conference on Unmanned Aircraft Systems (ICUAS), Dallas, TX, USA, 12–15 June 2018; pp. 324–334. [Google Scholar]
  42. Jin, L.; Li, S.; La, H.M.; Zhang, X.; Hu, B. Dynamic task allocation in multi-robot coordination for moving target tracking: A distributed approach. Automatica 2019, 100, 75–81. [Google Scholar] [CrossRef]
  43. Bischoff, E.; Meyer, F.; Inga, J.; Hohmann, S. Multi-robot task allocation and scheduling considering cooperative tasks and precedence constraints. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020. [Google Scholar]
  44. Liu, X.F.; Lin, B.C.; Zhan, Z.H.; Jeon, S.W.; Zhang, J. An efficient ant colony system for multi-robot task allocation with large-scale cooperative tasks and precedence constraints. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 4–7 December 2021. [Google Scholar]
  45. Alitappeh, R.J.; Jeddisaravi, K. Multi-robot exploration in task allocation problem. Appl. Intell. 2022, 52, 2189–2211. [Google Scholar] [CrossRef]
  46. Liu, Z.; Chen, B.; Zhou, H.; Koushik, G.; Hebert, M.; Zhao, D. MAPPER: Multi-Agent Path Planning with Evolutionary Reinforcement Learning in Mixed Dynamic Environments. In Proceedings of the IROS 2020 International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020. [Google Scholar]
  47. Agrawal, A.; Bedi, A.; Manocha, D. RTAW: An Attention Inspired Reinforcement Learning Method for Multi-Robot Task Allocation in Warehouse Environments. arXiv 2023, arXiv:2209.05738. [Google Scholar] [CrossRef]
  48. Lee, D.; Lee, S.; Masoud, N.; Krishnan, M.; Li, V.C. Digital twin-driven deep reinforcement learning for adaptive task allocation in robotic construction. Adv. Eng. Inform. 2022, 53, 101710. [Google Scholar] [CrossRef]
  49. Metelli, A.M. Configurable Environments in Reinforcement Learning: An Overview. In Special Topics in Information Technology; Piroddi, L., Ed.; Springer International Publishing: Cham, Switzerland, 2022; pp. 101–113. [Google Scholar]
  50. Beattie, C.; Leibo, J.Z.; Teplyashin, D.; Ward, T.; Wainwright, M.; Küttler, H.; Lefrancq, A.; Green, S.; Valdés, V. DeepMind Lab. arXiv 2016, arXiv:1612.03801. [Google Scholar] [CrossRef]
  51. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
  52. Abioye, S.O.; Oyedele, L.O.; Akanbi, L.; Ajayi, A.; Delgado, J.M.D.; Bilal, M.; Akinade, O.O.; Ahmed, A. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 2021, 44, 103299. [Google Scholar] [CrossRef]
  53. Liu, S.; Liu, P. Benchmarking and optimization of robot motion planning with motion planning pipeline. Int. J. Adv. Manuf. Technol. 2021, 118, 949–961. [Google Scholar] [CrossRef]
  54. Kayhan, B.M.; Yildiz, G. Reinforcement learning applications to machine scheduling problems: A comprehensive literature review. J. Intell. Manuf. 2023, 34, 905–929. [Google Scholar] [CrossRef]
  55. de Woillemont, P.L.P.; Labory, R.; Corruble, V. Automated Play-Testing through RL Based Human-Like Play-Styles Generation. Proc. AAAI Conf. Artif. Intell. Interact. Digit. Entertain. 2022, 18, 146–154. [Google Scholar] [CrossRef]
  56. Brito, B.; Everett, M.; How, J.P.; Alonso-Mora, J. Where to go next: Learning a subgoal recommendation policy for navigation in dynamic environments. IEEE Robot. Autom. Lett. 2021, 6, 4616–4623. [Google Scholar] [CrossRef]
  57. Xie, J.; Ge, F.; Cui, T.; Wang, X. A virtual test and evaluation method for fully mechanized mining production system with different smart levels. Int. J. Coal Sci. Technol. 2022, 9, 41. [Google Scholar] [CrossRef]
  58. Conway, B.A. A Survey of Methods Available for the Numerical Optimization of Continuous Dynamic Systems. J. Optim. Theory Appl. 2012, 152, 271–306. [Google Scholar] [CrossRef]
  59. Gazi, V.; Passino, K.M. Swarm Stability and Optimization; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  60. Green, S.A.; Billinghurst, M.; Chen, X.; Chase, J.G. Human-Robot Collaboration: A Literature Review and Augmented Reality Approach in Design. Int. J. Adv. Robot. Syst. 2008, 5, 1–18. [Google Scholar] [CrossRef]
  61. Fjeldstad, D.; Snow, C.C.; Miles, R.E.; Lettl, C. The architecture of collaboration. Strateg. Manag. J. 2012, 33, 734–750. [Google Scholar] [CrossRef]
  62. Al-Hamadani, M.N.A.; Fadhel, M.A.; Alzubaidi, L.; Harangi, B. Reinforcement Learning Algorithms and Applications in Healthcare and Robotics: A Comprehensive and Systematic Review. Sensors 2024, 24, 2461. [Google Scholar] [CrossRef] [PubMed]
  63. Bommasani, R. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
  64. Choi, H.; Crump, C.; Duriez, C.; Elmquist, A.; Hager, G.; Han, D.; Hearl, F.; Hodgins, J.; Jain, A.; Leve, F.; et al. On the use of simulation in robotics: Opportunities, challenges, and suggestions for moving forward. Proc. Natl. Acad. Sci. USA 2021, 118, e1907856118. [Google Scholar] [CrossRef] [PubMed]
  65. You, H.; Zhou, T.; Zhu, Q.; Ye, Y.; Du, E.J. Embodied AI for Dexterity-Capable Construction Robots: Dexbot Framework. Adv. Eng. Inform. 2024, 62, 102572. [Google Scholar] [CrossRef]
  66. Silver, T.; Chitnis, R. PDDLGym: Gym Environments from PDDL Problems. arXiv 2020, arXiv:2002.06432. [Google Scholar] [CrossRef]
  67. Gomes, G.; Vidal, C.A.; Cavalcante-Neto, J.B.; Nogueira, Y.L. A modeling environment for reinforcement learning in games. Entertain. Comput. 2022, 43, 100516. [Google Scholar] [CrossRef]
  68. Jonassen, D.H.; Rohrer-Murphy, L. Activity theory as a framework for designing constructivist learning environments. Educ. Technol. Res. Dev. 1999, 47, 61–79. [Google Scholar] [CrossRef]
  69. Perel, M.; Elkin-Koren, N. BLACK BOX TINKERING: Beyond transparency in algorithmic enforcement. SSRN Electron. J. 2016, 69, 181. [Google Scholar]
  70. Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. ICLR 2022. arXiv 2021, arXiv:2103.01955. [Google Scholar]
  71. Flatland. A 2D Robot Simulator for ROS. Github. Available online: https://github.com/avidbots/flatland (accessed on 11 November 2025).
  72. Grossberg, S. Recurrent neural networks. Sch. J. 2013, 8, 1888. [Google Scholar] [CrossRef]
Figure 1. Framework integrating MARL for solving the Multi-Robot Task Allocation (MRTA) problem.
Figure 1. Framework integrating MARL for solving the Multi-Robot Task Allocation (MRTA) problem.
Buildings 16 00539 g001
Figure 2. ROS interface combined with robot topics defining robot models and functionalities.
Figure 2. ROS interface combined with robot topics defining robot models and functionalities.
Buildings 16 00539 g002
Figure 3. (a) Lab Environment Demonstration, and (b) Rviz visualizer with robots.
Figure 3. (a) Lab Environment Demonstration, and (b) Rviz visualizer with robots.
Buildings 16 00539 g003
Figure 4. Collaborative Scenario Definition for Multi-agent navigation sequence.
Figure 4. Collaborative Scenario Definition for Multi-agent navigation sequence.
Buildings 16 00539 g004
Figure 5. MAPPO Input and output recurrent neural network.
Figure 5. MAPPO Input and output recurrent neural network.
Buildings 16 00539 g005
Figure 6. MAPPO Algorithm Structure.
Figure 6. MAPPO Algorithm Structure.
Buildings 16 00539 g006
Figure 7. Mean reward trajectories for robots R 0 and R 1 under the non-training baseline condition.
Figure 7. Mean reward trajectories for robots R 0 and R 1 under the non-training baseline condition.
Buildings 16 00539 g007
Figure 8. Breakdown of reward components for robots R 0 (a) and R 1 (b) under the non-training baseline condition.
Figure 8. Breakdown of reward components for robots R 0 (a) and R 1 (b) under the non-training baseline condition.
Buildings 16 00539 g008
Figure 9. Mean reward progression for robots R 0 and R 1 .
Figure 9. Mean reward progression for robots R 0 and R 1 .
Buildings 16 00539 g009
Figure 10. Breakdown of reward components for robots R 0 (a) and R 1 (b).
Figure 10. Breakdown of reward components for robots R 0 (a) and R 1 (b).
Buildings 16 00539 g010
Figure 11. (a) total loss, (b) policy loss, and (c) value loss for 2 robots.
Figure 11. (a) total loss, (b) policy loss, and (c) value loss for 2 robots.
Buildings 16 00539 g011
Figure 12. Combined reward for 4 robots task allocation.
Figure 12. Combined reward for 4 robots task allocation.
Buildings 16 00539 g012
Figure 13. Breakdown reward for 4 robots task allocation.
Figure 13. Breakdown reward for 4 robots task allocation.
Buildings 16 00539 g013
Figure 14. Collision-event frequency for the two-robot (a) and four-robot (b).
Figure 14. Collision-event frequency for the two-robot (a) and four-robot (b).
Buildings 16 00539 g014
Figure 15. Optimized task schedule for two robots.
Figure 15. Optimized task schedule for two robots.
Buildings 16 00539 g015
Figure 16. Optimized task schedule for four robots.
Figure 16. Optimized task schedule for four robots.
Buildings 16 00539 g016
Figure 17. TurtleBot3 Burger robots operating in the laboratory test environment.
Figure 17. TurtleBot3 Burger robots operating in the laboratory test environment.
Buildings 16 00539 g017
Table 1. Characteristics of the MRTA problem.
Table 1. Characteristics of the MRTA problem.
Problem Definition Application CategoryPossible Solutions
Single-task robots (ST)Multiple traveling salesman problem (mTSP)Market-Based Approaches
Multi-task robots (MT)Vehicle Routing Problem (VRP)Behavior-Based Approaches
Single-robot tasks (SR)Location routing problem (LRP)Utility-Based Approaches
Multi-robot tasks (MR)Job scheduling problem (JSP)Optimization-Based Approaches
Instantaneous assignment (IA)Linear assignment problem (LAP)Learning-Based Approaches
Time-extended assignment (TA) Consensus and Cooperation-Based Approaches:
Table 2. Pros and cons of various MRTA solutions.
Table 2. Pros and cons of various MRTA solutions.
Possible SolutionsPros ConsReferences
Market-Based ApproachesFlexibility
Scalability
Overhead
Instability
[31,32,33]
Behavior-Based ApproachesSimplicity
Robustness
Inflexibility
Efficiency
[9,34]
Utility-Based ApproachesOptimality
Precision
Complexity
Computation
[35,36,37]
Optimization-Based ApproachesComprehensive
Adaptable
Computational Demand
Rigidity
[8,10,38]
Learning-Based ApproachesAdaptive Learning
Generalization
Initial Learning Curve
Predictability
[39,40,41]
Consensus and Cooperation-Based ApproachesConflict Resolution
Cooperative
Communication Requirements
Complexity
[42]
Table 3. Overview of case study configuration and data flows.
Table 3. Overview of case study configuration and data flows.
Work PackageCharacteristics
Simulation EnvironmentSimulator: ROS Flatland. Data Flow 1: 2D laboratory environment reconstruction from BIM. Data Flow 2: Multi-robot setup using multiple TurtleBot3 platforms for testing collaborative navigation.
Scenario DefinitionDefines the multi-agent task allocation problem and corresponding robot actions. Data Flow 3: Generation of navigation sequences for coordinated robot movement. Data Flow 4: Definition of the iterative learning process and continuous update of task logic.
Communication LinkROS middleware enables inter-agent and environment communication through publishers and subscribers. Data Flow 5: Exchange of task commands, state data, and control feedback among ROS nodes. Data Flow 6: Bidirectional data exchange for MARL training, state observations as inputs and policy updates as outputs.
AlgorithmReinforcement learning algorithm: MAPPO. Data Flow 6: Processing observation-state inputs and producing training outputs for policy refinement. Data Flow 7: Decision command processing using centralized training and decentralized execution to maximize group performance.
Simulation-to-Reality TranslationData Flow 8: Transfers optimized policies and decision strategies from the simulation environment to real-world robots, ensuring consistency and reliability between virtual training and physical execution.
Table 4. Information fed into the ROS Framework and Subcomponents.
Table 4. Information fed into the ROS Framework and Subcomponents.
External ScriptsFormatFunctionalitiesROS Packages Used
Environment Model.SDFThe environmental model that robots work in/Gazebo;
/map_server;
/Rviz
Robot Model.URDFRobot model to be imported into the ROS for simulation/amcl/;
/move_base; /robot_state_publisher; /joint_state_publisher;
Scheduler.XMLA fundamental schedule for robotic tasks/rl_task_allocation_manager
RL Training ScriptPythonThe training method to find the optimal policy of robotic task allocation strategy within the scope of defined constraints and rewards/rl_manager; /rl_task_allocation_manager;
Table 5. States representations for One Robot.
Table 5. States representations for One Robot.
StateFormula Representation
Step number S t e p   i
Initial Position ( p x ,   p y , p z )
Robot ID R n
Current Task and Old Task T a s k t   a n d   T a s k t 1
Current and Old Goals G o a l t   a n d   G o a l t 1
Robot Status R e a c h g o a l f l a g ( B o o l )
Task Category T a s k   A   o r   B
Navigation Duration t N a v i
Step Duration t S t e p i
Table 6. Collaborative Schedule Measurement Criteria (performance metrics).
Table 6. Collaborative Schedule Measurement Criteria (performance metrics).
Observations SpaceFormat CategoryUsage
Global assigned tasksA list of assigned tasksGlobal Allocation strategy
Task statusTask finished list:
(Bool) True or false
Global Track completion
Global reached goalsA list of tasks reached with orders and timeGlobal Check logic
Logic correctnessLogic correct flag:
(Bool) True or False
Global Check logic
Episode durationA list of duration of stepsGlobal Optimization
Idle time After reaching the goal, time is used to wait for the other robot to finish in a step.LocalUse ratio
Navigation time The time to reach the goal in a stepLocalPath efficiency
Collision Whether there is a collision: (Bool) True or FalseLocalFatal error
Table 7. MAPPO training hyperparameters used in the 2D benchmark experiments.
Table 7. MAPPO training hyperparameters used in the 2D benchmark experiments.
HyperparameterSymbolValueDescription
Discount factorγ0.98Weighting of future rewards in return estimation
PPO clipping parameterϵ0.20Trust-region bound on policy ratio (epo_eps)
Entropy coefficient c e 0.01Weight of entropy bonus (entropy_factor)
Value loss coefficient c v 5.0Weight of value-function loss (value_factor)
Policy/value learning rate 1 × 10 5 Adam step size for actor and critic (learning_rate)
Learning-rate decay factor0.95/10,000 stepsMultiplicative LR decay and interval (learning_rate_decay, learning_rate_decay_steps)
Batch size1000Number of samples per PPO update (batch_size)
Minibatch size256Size of SGD minibatches (mini_batch_size)
PPO epochs per update10Passes over the collected batch (epochs)
Control frequency5 HzEnvironment update rate (HZ)
Episode duration120 sMaximum simulated episode time (episode_duration)
Total environment steps70,000Target number of steps per training run (num_env_steps)
Number of parallel threads8Parallel environments used for data collection (threads)
Table 8. Simulation results for non-training and GA optimized task allocation.
Table 8. Simulation results for non-training and GA optimized task allocation.
Algorithms and ScenarioDuration (avg-s)Task Assigned R0Task Assigned R1Task Assigned R2Task Assigned R3
No training for 2 robots98.75Random (Average task allocation counts num 4.6)
GA 2 robots with logic40.2‘A1’, ‘A2’, ‘B2’, ‘B1’‘A3’, ‘A4’, ‘B3’, ‘B4’NANA
No training for 2 robots112.27Random (Average task allocation counts num 2.6)
GA 4 robots with logic43.37‘B3’, ‘B4’‘B2’, ‘B1’‘A3’, ‘A4’,‘A2’, ‘A1’
Table 9. Summary of the average duration (s) on different agent num.
Table 9. Summary of the average duration (s) on different agent num.
Agent #No TrainingGAMAPPO
N = 2 98.7540.242
N = 4 112.2743.3755
Table 10. Comparison of simulation and real-world task execution results.
Table 10. Comparison of simulation and real-world task execution results.
Algorithms and ScenarioDuration (s)Task Assigned R0Task Assigned R1
No training 2 robots98.75randomrandom
GA 2 robots with logic40.2‘A1’, ‘A2’, ‘B2’, ‘B1’‘A3’, ‘A4’, ‘B3’, ‘B4’
MARL 2 robots42‘A1’, ‘A2’, ‘B2’, ‘B1’‘A3’, ‘A4’, ‘B3’, ‘B4’
Real-world 2D Nav60‘A1’, ‘A2’, ‘B2’, ‘B1’‘A3’, ‘A4’, ‘B3’, ‘B4’
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, X.; Prieto, S.A.; García de Soto, B. A Modular ROS–MARL Framework for Cooperative Multi-Robot Task Allocation in Construction Digital Environments. Buildings 2026, 16, 539. https://doi.org/10.3390/buildings16030539

AMA Style

Xu X, Prieto SA, García de Soto B. A Modular ROS–MARL Framework for Cooperative Multi-Robot Task Allocation in Construction Digital Environments. Buildings. 2026; 16(3):539. https://doi.org/10.3390/buildings16030539

Chicago/Turabian Style

Xu, Xinghui, Samuel A. Prieto, and Borja García de Soto. 2026. "A Modular ROS–MARL Framework for Cooperative Multi-Robot Task Allocation in Construction Digital Environments" Buildings 16, no. 3: 539. https://doi.org/10.3390/buildings16030539

APA Style

Xu, X., Prieto, S. A., & García de Soto, B. (2026). A Modular ROS–MARL Framework for Cooperative Multi-Robot Task Allocation in Construction Digital Environments. Buildings, 16(3), 539. https://doi.org/10.3390/buildings16030539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop