Learning-Based Multi-Robot Active SLAM: A Conceptual Framework and Survey

Lv, Bowen; Duan, Shihong

doi:10.3390/app16031412

Open AccessReview

Learning-Based Multi-Robot Active SLAM: A Conceptual Framework and Survey

by

Bowen Lv

¹

and

Shihong Duan

^1,2,*

¹

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

²

Shunde Innovation School, University of Science and Technology Beijing, Shunde, Foshan 528399, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1412; https://doi.org/10.3390/app16031412 (registering DOI)

Submission received: 19 December 2025 / Revised: 21 January 2026 / Accepted: 22 January 2026 / Published: 30 January 2026

(This article belongs to the Special Issue Applications of Robot Navigation in Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

Multi-robot systems (MRSs) offer distinct advantages in large-scale exploration but require tight coupling between decentralized decision-making and collaborative estimation. This survey reviews learning-based multi-robot Active Collaborative Simultaneous Localization and Mapping (AC-SLAM), modeling it as a coupled system comprising a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) decision layer and a distributed factor-graph estimation layer. By synthesizing these components into a conceptual framework, recent methods for cooperative perception, mapping, and policy learning are systematically critiqued. The analysis concludes that Hierarchical Reinforcement Learning (HRL) and graph-based spatial abstraction currently offer superior scalability and robustness compared to monolithic end-to-end approaches. Furthermore, a comprehensive analysis of Sim-to-Real transfer strategies is provided, ranging from domain randomization to emerging Real-to-Sim techniques based on NeRF and 3D Gaussian Splatting. Finally, future directions are outlined, moving from geometric mapping toward LLM-driven active semantic understanding and dynamic digital twins to bridge the reality gap.

Keywords:

multi-robot systems; active collaborative SLAM; sim-to-real transfer; multi-agent reinforcement learning; hierarchical reinforcement learning

1. Introduction

SLAM (Simultaneous Localization and Mapping) serves as a foundational capability for autonomous systems in GPS-denied, unmapped environments. It transforms continuous observations from sensors—such as cameras, LiDARs, and Inertial Measurement Units (IMUs)—into actionable pose estimates and environmental maps, thereby providing the core “spatial perception” required for autonomy [1,2]. Without this layer of reliable spatial understanding, high-level functions such as path planning, obstacle avoidance, target tracking, and human-robot interaction lack a unified coordinate system and a consistent world model, making it difficult for the system to achieve stable, closed-loop autonomy.

Modern SLAM has evolved from representations based primarily on geometry and appearance to more expressive world models that incorporate semantics, dynamics, hierarchical structures, and affordances [3]. This evolution equips agents with a more comprehensive understanding of the environment, enabling them to execute more complex tasks [4,5]. Embodied AI emphasizes that agents acquire information and develop long-term planning and learning capabilities through perception–action–interaction feedback within the environment [6]. Given this perspective, SLAM serves as a crucial bridge between perception (what is seen) and planning (how to act). Furthermore, this perception-action coupling is the essence of active SLAM: a paradigm that constructs more accurate and complete environmental models by planning and controlling motion, while balancing the uncertainty trade-off between exploration and exploitation. For critical applications such as autonomous driving, warehousing and industrial scenarios, disaster rescue [7], and subterranean or planetary exploration [8], the focus of SLAM research has shifted from simply pursuing higher accuracy to enabling long-term autonomy [9]. This entails continuously updating maps, managing map size, and maintaining robust localization and scalable operation within dynamic and unstructured environments. Ultimately, SLAM capabilities set the system’s safety boundaries, enabling autonomous exploration and task execution even in complex, GPS-denied environments without prior maps.

As illustrated in Table 1, against the backdrop of extreme underground environments such as the DARPA Subterranean Challenge, multi-robot systems (MRSs) have achieved kilometer-scale coverage and real-time mapping capabilities through parallel exploration and heterogeneous collaboration. Experimental data from real-world mines, caves, and underground facilities [8,10], as well as underwater environments [11], have demonstrated the feasibility and performance of multi-robot collaboration. Under constrained Field of View (FoV) conditions, MARVEL [12] utilizes multi-agent reinforcement learning (MARL) to learn a decentralized exploration policy, outperforming traditional planning baselines in metrics such as coverage, trajectory length, and overlap rate, with its deployability verified on real UAV platforms. Regarding localization accuracy, multi-robot collaborative SLAM typically introduces additional constraints to suppress drift by leveraging inter-robot loop closure detection and joint pose graph optimization or submap fusion. Xu et al. reported an approximate 32.8% improvement in localization accuracy over open-source baselines in public benchmarks [13]. Meanwhile, research on distributed collaborative SLAM indicates that inter-robot constraints and outlier rejection are crucial for global consistency [14,15]. From a scalability perspective, mechanisms such as loop closure prioritization and lightweight information exchange can reduce global optimization and communication overhead while maintaining accuracy [15,16]. However, real-world deployments like SubT reveal that restricted communication and connectivity intermittency in underground environments significantly constrain the frequency of collaboration and the upper limit of data volume, necessitating a systemic trade-off between the benefits of information sharing and the costs of real-time performance and bandwidth [8].

Based on the two dimensions of robot quantity (single-robot vs. multi-robot) and decision autonomy (passive vs. active), existing SLAM research can be categorized into four classes, as shown in Table 2. The foundational principles of SLAM were established in the seminal tutorial by Durrant-Whyte and Bailey [22], which formalized the joint estimation problem and probabilistic solution approaches that underpin modern systems. Single-Robot Passive SLAM has established a relatively stable and modular technology stack. Research focus has gradually expanded from classic geometric optimization to multi-sensor fusion, semantic enhancement, and robustness in dynamic or degraded environments. Surveys on visual SLAM and multi-sensor fusion have systematically reviewed the frameworks for front-end, loop closure, and back-end optimization, along with their evolving trends toward learning-based methods [23,24,25]. Single-Robot Active SLAM couples mapping and localization with action selection. It typically characterizes the trade-off among exploration, uncertainty, and cost from the perspective of information-theoretic objectives or POMDPs, emphasizing the challenges of non-myopic planning, real-time performance, and evaluation benchmarks. The survey published in IEEE T-RO by Placed et al. systematically reviewed information-driven path planning and deep reinforcement learning methods [9], Ahmed et al. reviewed the evolution of active SLAM regarding trajectory generation and control policies over the past decade [26], and Lluvia et al. conducted a comprehensive survey from the perspectives of active mapping and robot exploration [27]. The core of Multi-Robot Passive SLAM (Collaborative/Distributed SLAM) lies in inter-robot constraint construction and robust fusion (e.g., inter-robot loop closure, coordinate alignment, and outlier rejection), which is strongly constrained by communication and resource management. Centralized and decentralized architectures, along with their respective advantages and disadvantages, have been systematically summarized in collaborative SLAM surveys [1,28].

In contrast, Multi-Robot Active SLAM necessitates the simultaneous resolution of “collaborative decision-making (task allocation, redundancy reduction, conflict avoidance, and communication adaptation)” and “collaborative estimation (map and trajectory fusion).” Consequently, it more closely resembles a coupled system integrating Decentralized Partially Observable Markov Decision Processes (Dec-POMDP)/MARL with distributed estimation. Existing surveys typically approach the topic from the perspective of either active SLAM or collaborative SLAM. For instance, Ahmed et al. [26] reviewed active SLAM and provided the first relatively systematic qualitative and statistical analysis of active collaborative SLAM (AC-SLAM); Lajoie et al. [28] approached the subject from the perspective of collaborative SLAM, discussing active C-SLAM in specific sections; focusing on “collaborative exploration systems,” the recent work by Wang et al. [29] analyzed system architectures, offering a modular summarization of localization and mapping, collaborative planning, and communication modules; meanwhile, Orr et al. [30] summarized multi-robot collaborative decision-making methods from the perspective of multi-agent deep reinforcement learning. Although these works cover key elements such as active SLAM, collaborative SLAM, multi-robot collaborative exploration, and MARL, to the best of our knowledge, they only discuss Multi-Robot AC-SLAM in limited sections. A systematic survey centered on AC-SLAM that uniformly covers problem formulation, long-term decision-making, and Sim-to-Real transfer has not yet been established. This constitutes one of the primary motivations for our work.

This survey focuses on multi-robot AC-SLAM. Under communication-constrained and partially observable conditions, it investigates how to drive effective exploration through collaborative policies

a_{1 : N}

while simultaneously performing robust inter-robot constraint construction and map fusion to achieve scalable and sustainable team-level spatial awareness. This problem can be conceptualized as a coupled system comprising a Dec-POMDP/MARL decision-making layer and a distributed factor graph/map fusion estimation layer. Existing surveys on active and collaborative SLAM provide unified objective functions and system challenge descriptions for this domain [28], while surveys on collaborative exploration systems offer modular engineering frameworks tailored for real-world tasks [29].

Beyond the need for a unified survey treatment, AC-SLAM offers fundamental theoretical advantages that motivate its study. From an information–theoretic perspective, active exploration yields strictly greater information gain than passive trajectory execution: by deliberately visiting high-uncertainty regions and actively seeking loop closures, AC-SLAM reduces posterior entropy

H (X, M | Z)

more efficiently than random or predetermined paths [9]. This advantage becomes critical in scenarios where (1) exploration time is constrained (e.g., search-and-rescue missions requiring rapid coverage) [31], (2) GPS is unavailable and drift accumulation must be actively mitigated through loop closure (e.g., subterranean or indoor environments) [8], or (3) perceptual aliasing is prevalent and deliberate viewpoint planning is needed for disambiguation (e.g., warehouse or tunnel environments with repetitive structures) [32]. Despite these theoretical advantages validated in simulation benchmarks and small-scale multi-robot experiments [33], deploying learning-driven AC-SLAM systems at the industrial scale still faces significant challenges—the sim-to-real gap, real-time computational budgets, and reliability requirements remain active research problems [26].

An integrated conceptual framework for multi-robot AC-SLAM is introduced, modeling it as a coupled system consisting of a Dec-POMDP/MARL decision layer and a distributed factor graph/map fusion estimation layer. Within this abstraction, the fundamental components and objectives of collaborative perception, collaborative mapping, and collaborative decision-making are delineated, together with the input–output relationships of each module, to provide a structured lens for organizing and analyzing existing systems and methods. To avoid overstating novelty, it is emphasized that this framework serves primarily as an organizing abstraction: different surveyed works instantiate different subsets of modules with varying levels of experimental maturity. Table 3 summarizes the validation status of each component.
Learning-driven multi-robot active SLAM methods are systematically reviewed, with a specific focus on analyzing the advantages and limitations of techniques such as end-to-end DRL, hierarchical reinforcement learning, and sparse map representations in achieving effective team collaborative exploration and safe constrained control within long-duration, complex environments.
The primary challenges in the sim-to-real transfer of multi-robot active SLAM are summarized, including model bias, sensor and dynamics mismatch, and discrepancies in communication conditions. Typical technical pathways—such as domain randomization, stylized simulation, and online adaptation—are further categorized, and their transfer efficiency and applicable scenarios are compared.

The remainder of this paper is organized as follows: Section 2 summarizes the formal problem definition of AC-SLAM and presents a conceptual system framework encompassing collaborative perception, collaborative optimization, and collaborative decision-making. Under this framework, Section 3 reviews learning-driven active SLAM methods, with a particular focus on hierarchical reinforcement learning, graph abstraction, and safe, constrained MARL policies. Section 4 focuses on sim-to-real transfer, analyzing typical methods along with their applicability and limitations. Section 5 discusses future directions, including the integration of Large Language Models for semantic understanding and dynamic digital twins for bridging the reality gap. Finally, Section 6 concludes this paper.

2. Problem Formulation and Solution Framework of AC-SLAM

This section first reviews the formal definition of the AC-SLAM problem. It then proposes a conceptual system framework composed of cooperative perception, cooperative mapping (optimization), and cooperative decision-making. This framework can be abstracted as a hierarchical model coupling a Decision Layer (Dec-POMDP/MARL) with a Distributed SLAM Estimation Layer (Factor Graph/Map Fusion).

2.1. Formal Definition of AC-SLAM

AC-SLAM is not just a state estimation problem but a perception-driven stochastic optimal control problem [37]. Its core objective is to actively control the trajectories of multiple robots to maximize the accuracy of jointly estimating the environment map and the robot swarm’s poses, all under constraints of communication bandwidth, computational resources, and energy. This paper formalizes AC-SLAM as a constrained non-linear optimization problem within a finite time Horizon T.

2.1.1. Notations

Robots (Agents): $I = {1, \dots, N}$ , with $N (i)$ denoting the communication neighbors of robot i.
Joint State: $X_{t} = {x_{1, t}, \dots, x_{N, t}, M}$ , where $x_{i, t}$ denotes the pose of robot i at time t and M denotes the environmental map (feature points, occupancy grid, or 3D mesh).
Joint Actions: $A_{t} = {a_{1, t}, \dots, a_{N, t}}$ , representing the control inputs of all N robots at time t.
Joint Observations: $Z_{t} = {z_{1, t}, \dots, z_{N, t}}$ , where $z_{i, t}$ is the observation obtained by robot i at time t.
Policy: $π_{i} : H_{i} \to Δ (A_{i})$ , the decision policy of robot i mapping observation–action history to a distribution over actions.
Reward: $R : S \times A_{1} \times \dots \times A_{N} \to R$ , the joint reward function integrating exploration gain, mapping quality, and costs.
Belief: $b_{t}$ , the posterior probability distribution over the joint state $X_{t}$ given all past observations and actions, i.e., $b_{t} = P (X_{t} ∣ Z_{1 : t}, A_{1 : t}, X_{0})$ .

2.1.2. Optimization Objective Function

The goal of AC-SLAM is to find the optimal action sequence

A_{t : t + T}^{*}

that minimizes the expected posterior uncertainty (over future observations) while also penalizing action costs. Following the standard information-gain formulation established in active SLAM literature [9], the objective can be written as

A_{t : t + T}^{*} = arg min_{A_{t : t + T}} \sum_{k = t}^{t + T} E_{Z_{k + 1}} [\underset{Uncertainty}{\underset{︸}{U (b_{k + 1})}} + λ \cdot \underset{Cost}{\underset{︸}{C (a_{k})}}]

(1)

Here, the uncertainty metric

U (b_{k + 1})

typically adopts A-Optimality (the trace of the covariance matrix

tr (Σ)

) or D-Optimality (Shannon entropy

H (X, M)

). Essentially, this objective seeks to maximize the information gain

I G = H (b_{t}) - H (b_{t + 1})

. The cost term

C (a_{k})

encompasses travel distance, energy consumption, and communication bandwidth usage. The expectation operator

E_{Z_{k + 1}}

indicates that since future measurements Z are unknown, the expectation must be calculated based on the current observation model

P (Z ∣ X)

. In practice, the choice of uncertainty metric directly affects algorithm design: A-optimality (trace minimization) yields computationally efficient greedy planners suitable for real-time systems, while D-optimality (entropy) better captures correlations but incurs higher computational cost. The trade-off parameter

λ

is platform-dependent—energy-constrained UAVs typically use larger

λ

to penalize costly maneuvers. Furthermore, the uncertainty metric serves as a natural reward signal for learning-based approaches, guiding the policy to actively reduce map and pose uncertainty.

The AC-SLAM optimization problem is subject to three categories of constraints. Kinematic constraints characterize the robot motion model shown in Equation (2), where

f (\cdot)

is the state transition function (such as differential drive or omnidirectional kinematic models) and

w_{k}

is the process noise (typically assumed to be Gaussian white noise):

x_{i, k + 1} = f (x_{i, k}, a_{i, k}) + w_{k},

(2)

Observation model constraints relate sensor measurements to the state shown in Equation (3), where

h (\cdot)

is the observation function (such as LiDAR scan matching or visual feature projection) and

v_{k}

is the measurement noise:

z_{i, k} = h (x_{i, k}, M) + v_{k},

(3)

Communication constraints define when robots can exchange information (e.g., factor graph data, feature descriptors, decision intents). Robots i and j are allowed to share data only if they are within communication range

R_{c o m m}

of each other or are connected in the communication network (graph)

G_{c o m m}

. This constraint reflects real-world wireless communication effects such as signal attenuation with distance, occlusion interference, and limited bandwidth.

2.2. Conceptual System Framework: Coupled Model

In our AC-SLAM framework, the system is modeled as a tightly coupled, dual-layer, closed-loop architecture consisting of a decision-making layer and an estimation layer. The decision-making layer (based on Dec-POMDP [38]) coordinates cooperative behavior and joint action selection among multiple robots operating with local observations. The estimation layer (based on distributed factor graphs [39]) performs front-end perception fusion and back-end optimization of motion and sensing data. It produces the multi-robot pose estimates and global map and, in turn, provides the decision layer with updated belief states and uncertainty information.

2.2.1. Framework Architecture Diagram

As illustrated in Figure 1, AC-SLAM is a multi-robot closed-loop architecture with three coupled layers: cooperative perception, cooperative mapping, and cooperative decision-making. Each robot i maintains a local belief state

b_{i}

, which encapsulates the global map M, all robot poses

X_{1 : N}

, and associated uncertainty estimates. This belief state is updated through front-end perception, back-end distributed factor graph optimization, and an information/entropy-based decision planner. The communication network enables cooperation across all layers via three channels: feature descriptors for cross-robot data association, factor graph data for distributed pose graph optimization and map fusion, and decision intentions for joint policy learning and task allocation within the Dec-POMDP framework.

From a data flow perspective, this architecture can be formally described as follows. At time t, each robot i obtains a raw observation

z_{i} (t)

from its sensors and executes an action

a_{i} (t)

from its motion controller. Perception: the cooperative perception operator

C^{per}

processes

z_{i} (t)

to extract local features, compresses them into a descriptor

d_{i} (t)

, and exchanges descriptors with neighbors to form a pooled feature set

{\tilde{d}}_{i} (t) = {d_{j} (t) ∣ j \in N (i)}

. Mapping: the cooperative mapping operator

C^{map}

uses

{z_{i} (t), a_{i} (t), {\tilde{d}}_{i} (t)}

to perform distributed optimization on the joint factor graph

G

, solving Equation (4) to update the multi-robot posterior

p (X_{1 : N}, M ∣ Z_{1 : t})

and yielding an improved belief state

b_{i} (t + 1)

:

(X^{*}, M^{*}) = arg min_{X, M} \sum_{k \in E} {∥ e_{k} (X, M) ∥}_{Σ_{k}}^{2},

(4)

where the edge set

E

includes odometry constraints and inter-robot loop closure constraints. Decision-Making: the cooperative decision operator

C^{dec}

takes the updated beliefs

{b_{j} (t + 1) ∣ j \in N (i) \cup {i}}

and task objectives as input and selects joint actions by solving Equation (5) under the Dec-POMDP/MARL framework to maximize the expected cumulative reward:

π_{θ}^{*} = arg max_{π_{θ}} E_{π_{θ}} [\sum_{t = 0}^{T} γ^{t} R (s_{t}, a_{1 : N, t})],

(5)

where

R (s_{t}, a_{1 : N, t})

integrates exploration gain, mapping quality, localization accuracy, and cooperative gain. The output action

a_{i} (t + 1)

is transmitted to the motion controller to act upon the environment, generating a new observation

z_{i} (t + 1)

, which flows back to the perception layer, completing the closed-loop. Equations (4) and (5) follow standard factor graph optimization [39,40] and Dec-POMDP/MARL formulations [41], respectively.

In summary, the estimation layer uses distributed graph optimization to fuse observations and inter-robot loop closures, yielding a global belief state

b (t)

(with estimated poses X, map M, and covariance

Σ

). The cooperative perception operator

C_{per}

decides which observations to share given communication constraints, while the cooperative mapping operator

C_{map}

focuses on how to maintain a consistent global estimation on the distributed factor graph. In the Dec-POMDP framework, the decision layer uses the shared belief and task requirements to select joint actions

a_{1 : N, t}

via the cooperative policy operator

C_{dec}

. This achieves a well-coordinated exploration policy (effective division of labor) under partially observable, communication-constrained conditions, completing the closed-loop cycle of Physical World → Perception → Estimation → Decision → Control.

Algorithm 1 presents the complete AC-SLAM pipeline. At each iteration, robots first acquire observations (Equation (3)) and selectively share informative features (Equation (6)). The shared data is fused via distributed optimization (Equation (7)) to update the belief state. Based on this belief, the planner evaluates candidate actions (Equation (1)) and the learned policy outputs the next action (Equation (8)), which drives the robot to a new state (Equation (2)), completing the perception–estimation–decision loop.

Algorithm 1: Multi-robot AC-SLAM pipeline

2.2.2. Collaborative Perception

The core challenge of collaborative perception is to select and share the most informative observations under strict bandwidth constraints to maximize the information gain for the global map. Formally, given each robot’s raw observation

z_{i} (t)

and communication constraints (e.g., bandwidth limit

B_{max}

, latency bounds), the collaborative perception objective can be expressed as

max_{Z_{shared}} I (Z_{shared}; M) - λ \cdot Comm_cost (Z_{shared}),

(6)

where

Z_{shared}

denotes the subset of observation features shared across robots. The term

I (Z_{shared}; M)

represents the mutual information between the shared observations and the map M, quantifying information gain;

Comm_cost (\cdot)

denotes the communication cost (e.g., data volume, latency); and

λ

is the trade-off coefficient between communication and information. The cooperative perception operator

C_{i}^{per}

maps robot i’s local observations and network messages from neighbors and outputs compressed descriptors for transmission and derived inter-robot observations usable for mapping.

Typical algorithms can be classified into three categories. Feature-level fusion methods, like V2V-Net [42], perform intermediate feature concatenation. They achieve a balance between communication efficiency and perception quality by sharing intermediate layer features of deep neural networks rather than raw data or final detection results. Such methods typically employ Convolutional Neural Networks (CNNs) to extract local features, followed by spatial attention mechanisms to fuse multi-robot perspectives. Learning-based compression and selective communication methods, like Where2comm [43] or DiscoNet [44], utilize attention mechanisms or gating networks to learn to select high-value regions for communication. The core idea is that not all observations need to be shared; priority should be given to transmitting information that contributes most to cooperative mapping, such as keyframes containing loop closure cues and observations of high-uncertainty regions. Uncertainty or information gain-based adaptive sharing policies trigger high-bandwidth exchange only when potential inter-robot loop closures or high-entropy regions are detected, based on map uncertainty

Σ

fed back from the estimation layer. For instance, the transmission frequency of feature descriptors is increased when a robot enters a new area or detects a potential revisit to a historical trajectory [16,18]; conversely, in well-observed areas, communication frequency is reduced to save bandwidth.

Failure Modes and Mitigation Strategies. Despite advances in learning-based feature sharing, collaborative perception remains vulnerable to two critical failure modes. Perceptual aliasing occurs when perceptually similar but spatially distinct locations produce indistinguishable descriptors, leading to false positive matches, particularly problematic in repetitive environments such as warehouses, tunnels, or urban canyons where structural regularity exacerbates the problem [32]. False inter-robot loop closures arise when incorrect cross-robot associations propagate erroneous constraints into the pose graph, potentially causing catastrophic map inconsistency; this is especially severe in GNSS-denied environments where external corrections are unavailable [15,19]. In recent years, researchers have proposed various algorithms and strategies to improve the robustness of multi-robot collaborative perception: (1) pairwise consistency maximization (PCM): DOOR-SLAM [15] rejects geometrically implausible matches by finding the maximum set of mutually consistent loop closures, formulated as a maximum clique problem; (2) multi-level consistency checking: recent work [45] combines inter-robot pairwise-loop consistency with intra-robot loop-odometry consistency, achieving up to 89% reduction in position error compared to PCM alone; (3) distributed outlier rejection: Kimera-Multi [19] performs robust verification using only peer-to-peer communication, enabling scalable deployment; (4) graduated non-convexity (GNC) optimization: LAMP 2.0 [14] applies GNC to iteratively down-weight outlier factors during pose graph optimization, improving robustness without sacrificing convergence; (5) loop closure prioritization: selectively processing high-confidence candidates first reduces computational load and limits false positive propagation; experiments show that evaluating fewer than 1% of top-priority candidates can achieve accuracy comparable to exhaustive verification [16]. These techniques are essential for deploying collaborative SLAM in challenging real-world environments, though recent analysis suggests that complete robustness in extreme conditions remains an open challenge.

2.2.3. Collaborative Mapping/Optimization

The goal of collaborative mapping is to fuse the local observations of multiple robots with inter-robot constraints to solve for globally consistent map and trajectory estimates. This is typically modeled as a multi-robot joint factor graph optimization problem within a factor graph framework. Let

X = {X_{1 : N}}

be the set of poses for all robots (where

X_{i} = {x_{i, 1}, \dots, x_{i, T_{i}}}

is the trajectory of robot i) and M be the environmental map (which can be a feature point cloud, occupancy grid, or semantic graph). The edge set

E

contains three types of factors: odometry factors encode robot motion model constraints

x_{i, t + 1} = f (x_{i, t}, a_{i, t}) + w_{t}

; observation factors encode sensor measurement model constraints

z_{i, t} = h (x_{i, t}, M) + v_{t}

; and inter-robot loop closure factors represent inter-robot pose constraints detected by the cooperative perception operator (e.g., robot i at time

t_{i}

and robot j at time

t_{j}

observing the same scene).

The optimization objective of cooperative mapping is to minimize the sum of weighted squared residuals of all factors:

(X, M) = arg min_{X, M} \sum_{k \in E} {| r_{k} (X, M) |}_{Σ_{k}}^{2},

(7)

where

r_{k} (X, M)

is the residual function of the k-th factor and

{∥ \cdot ∥}_{Σ_{k}}^{2}

denotes the Mahalanobis distance weighted by the covariance matrix

Σ_{k}

(or its inverse, the information matrix

Ω_{k}

). Based on local observations

z_{i} (t)

, actions

a_{i} (t)

, and inter-robot constraints

{\tilde{d}}_{i} (t)

, the cooperative optimization operator

C^{map}

updates the global posterior distribution

p (X_{1 : N}, M ∣ Z_{1 : t})

and calculates the marginal covariance matrix

Σ

(quantifying the uncertainty of poses and map), finally outputting the belief state

b (t) = {\hat{X}, \hat{M}, Σ}

for the decision layer. This least-squares formulation enables sparse Cholesky or iterative solvers (e.g., Levenberg–Marquardt in GTSAM, Ceres), achieving real-time performance with thousands of variables. The Mahalanobis weighting naturally down-weights outliers, improving robustness against perceptual aliasing.

Common distributed algorithms include three categories. ADMM-based Distributed PGO [34] decomposes the global optimization problem into multiple sub-problems (each robot maintains its own sub-graph). By enforcing consistency constraints on shared poses (separator nodes) and utilizing the Alternating Direction Method of Multipliers (ADMM) [46] to alternately optimize local sub-graphs and dual variables, this method guarantees convergence to the global optimum for convex problems or a good local optimum for non-convex problems, featuring moderate communication volume but fast convergence. Distributed Gauss-Seidel/Jacobi Iteration [47] methods involve robots alternately updating their local sub-graphs while keeping neighbor pose estimates fixed. Gauss-Seidel uses the latest estimates from neighbors (sequential update), while Jacobi uses estimates from the previous round (parallel update). These methods have low communication overhead and simple implementation, but their convergence speed depends on the robot topology and update order. Hierarchical Optimization Methods [35], such as Swarm-SLAM, Multi S-Graphs, and CGE, partition the global factor graph into local sub-graphs managed by individual robots. Each robot independently optimizes its local sub-graph, then achieves cross-subgraph consistency via boundary node covariance propagation or submodular optimization. These methods significantly reduce computational and communication complexity in large-scale systems.

2.2.4. Collaborative Policy for AC-SLAM

The objective of the collaborative policy is to determine the next action for each robot in the multi-robot system based on the current belief over the map and robot poses, thereby maximizing the long-term cumulative reward. In the context of AC-SLAM, the collaborative policy addresses the multi-robot action decision-making problem within the framework of Decentralized Partially Observable Markov Decision Processes (Dec-POMDP). A Dec-POMDP can be formally defined as a tuple

〈 N, S, {A_{i}}, T, {Ω_{i}}, O, R, γ 〉

, where

N is the number of robots;
$S = {X_{1 : N}, M}$ represents the global state (including the poses of all robots and the environmental map);
$A_{i}$ denotes the action space of robot i (e.g., target poses, velocity commands);
$T : S \times A_{1} \times \dots \times A_{N} \to Δ (S)$ is the state transition function;
$Ω_{i}$ is the observation space of robot i;
$O : S \times A_{1} \times \dots \times A_{N} \to \prod_{i} Δ (Ω_{i})$ is the observation function;
$R : S \times A_{1} \times \dots \times A_{N} \to R$ is the joint reward function (integrating exploration coverage, pose and map uncertainty, communication/motion costs, and cooperative gains);
$γ \in (0, 1]$ is the discount factor.

The collaborative policy set

Π = {π_{i}}_{i = 1}^{N}

consists of the local policies of each robot, where

π_{i} : H_{i} \to Δ (A_{i})

maps the local observation–action history

H_{i}

of robot i to a probability distribution over the action space

A_{i}

. Under a belief-based formulation, the policy can be simplified as

π_{i} (a_{i} ∣ b_{i})

, where

b_{i} = {\hat{X}, \hat{M}, Σ}

is the local or approximate global belief state comprising poses, maps, and uncertainty provided by the cooperative mapping operator. The optimization objective for the AC-SLAM policy is to maximize the expected cumulative reward:

Π^{*} = arg max_{Π} E_{Π} [\sum_{t = 0}^{T} γ^{t} R (s_{t}, a_{1 : N, t})],

(8)

The design of the collaborative policy represents a critical bottleneck in AC-SLAM, facing several core challenges: the state and action spaces expand exponentially as the number of robots increases, individual robots typically have access only to local observations without global environmental awareness, and the exploration–exploitation trade-off (reducing map uncertainty vs. completing task objectives) further exacerbates the complexity. Traditional heuristic methods based on frontiers or simple information gain often fail to balance efficiency and coordination requirements when performing long-term planning within high-dimensional map and multi-robot joint action spaces. Moreover, real-world systems are constrained by local observations, limited communication bandwidth, and dynamic topology changes, making centralized planning and static task allocation schemes difficult to scale in complex, large-scale scenarios. Consequently, learning-driven MARL [36,48] has emerged as a core methodology for addressing the collaborative policy problem in AC-SLAM, offering a more efficient and scalable solution through adaptive learning mechanisms. Table 3 summarizes the validation status of each framework component, revealing that while individual modules—especially in the estimation layer—have reached high maturity, their systematic integration into complete AC-SLAM pipelines remains an open challenge.

3. Learning-Driven Multi-Robot Active SLAM

Within AC-SLAM, the collaborative decision operator

C_{dec}

is crucial for addressing the structural challenges inherent in Dec-POMDPs. This section provides an overview of representative, learning-driven active SLAM approaches in MRSs.

3.1. The Cognitive Shift in Collaborative Decision-Making

Historically, the “decision” component of AC-SLAM—the collaborative decision operator

C_{dec}

—has been dominated by information–theoretic heuristics. These classical approaches, rooted in the minimization of Shannon entropy or the reduction of pose-graph covariance (D-optimality) [49,50], view exploration as a purely geometric optimization problem. Agents maximize an information gain metric,

I (x; z)

, typically defined over a voxelized occupancy grid, to select the next best view. While theoretically rigorous, these methods inherently lack “spatial intuition”. A classical frontier-based explorer [51] treats a doorway leading to a vast hall and a dead-end alcove as mathematically identical if their immediate frontier information gain is equivalent. It cannot predict, infer, or reason about the unseen environment based on learned structural priors.

The integration of deep learning and, specifically, DRL promises to bridge this gap. By enabling agents to learn from experience, the field moves from purely reactive exploration to predictive exploration. In MRSs, this capability is amplified. Learning-driven agents can implicitly learn coordination policies, such as dynamic role allocation or spatial partitioning, without the prohibitive communication overhead required to share dense metric maps for centralized optimization.

3.2. End-to-End Deep Reinforcement Learning

End-to-End (E2E) Deep Reinforcement Learning methods [52,53,54] represent a direct application of learning to the SLAM problem. Driven by the hypothesis that a single differentiable function could map raw high-dimensional sensor inputs directly to optimal low-level control commands, E2E methods attempt to collapse the traditional perception–planning–control pipeline into a unified neural architecture.

3.2.1. Unified Architectures and Perception Backbones

In the monolithic E2E paradigm, the robot is modeled as an agent interacting with a partially observable environment. The policy

π_{θ} (a_{t} | o_{t})

is parameterized by a deep neural network, where

o_{t}

represents the raw observation and

a_{t}

represents the motor actuation. The perception backbone of these systems typically relies on CNNs to extract geometric and textural features. For instance, early works utilized standard architectures like ResNet or custom shallow CNNs to process depth images, encoding them into a latent feature vector

z_{t}

[55]. However, SLAM is inherently a temporal process; a single frame is insufficient to resolve state ambiguity or estimate motion. Consequently, these feature vectors are invariably fed into Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks. The LSTM’s hidden state

h_{t}

acts as a “neural memory,” implicitly encoding the robot’s belief state regarding its pose and the map structure. This memory allows the agent to exhibit behaviors like backtracking or obstacle avoidance without explicitly maintaining a geometric map [52,56,57].

Recent advancements have integrated Attention Mechanisms and Transformer blocks into this backbone [58]. Unlike LSTM, which suffers from forgetting over long horizons, Transformer-based encoders allow the agent to attend to specific historical observations that are relevant to the current context. For example, the agent can recall a visual landmark seen minutes ago to re-orient itself, a process akin to loop closure.

3.2.2. Rewards and Learning

The training of these networks is conducted via Reinforcement Learning algorithms such as Proximal Policy Optimization (PPO) [59], Soft Actor–Critic (SAC) [60], or Deep Deterministic Policy Gradient (DDPG) [61]. The “intelligence” of the system is shaped entirely by the reward function

R (s, a)

. In active SLAM, the reward is often composite and dense, designed to guide the agent through the sparse-reward landscape of exploration.

A typical reward structure includes the following:

Intrinsic Exploration Reward ( $r_{e x p}$ ): Proportional to the number of newly visited grid cells or the reduction in map entropy. This drives the fundamental map-building behavior.
Obstacle Avoidance Penalty ( $r_{c o l}$ ): A negative reward for collisions or proximity to obstacles, ensuring safety.
Smoothness Penalty ( $r_{s m o o t h}$ ): Penalties for jerky movements or rapid oscillations in angular velocity, which are detrimental to odometry estimation and map consistency.
Coordination Reward ( $r_{c o o r d}$ ): In multi-robot settings, a penalty is often applied for overlap with other agents’ trajectories or sensor footprints, encouraging dispersion and reducing redundant coverage.

The appeal of this approach lies in its theoretical elegance. By backpropagating gradients from the reward signal all the way to the convolutional filters, the feature extractor learns to identify exactly those visual cues that are relevant for navigation. It might learn to recognize “doorways” or “corridors” not because it was trained on labeled datasets of doors but because passing through them consistently leads to high exploration rewards [62]. This eliminates the error accumulation seen in modular pipelines, where a noisy map update might cause a valid plan to fail.

3.2.3. Challenges in Multi-Agent Scalability

Despite their conceptual purity, monolithic E2E methods face severe hurdles when scaled to MRSs [63]. The joint state space of N robots grows exponentially, making the learning of a joint policy intractable. To address this, the field has largely adopted Centralized Training with Decentralized Execution (CTDE) frameworks [64], such as Multi-Agent PPO (MAPPO) [36]. Standardized benchmarking libraries like BenchMARL [65] have emerged to facilitate reproducible comparison across MARL algorithms. In CTDE, a centralized “critic” network has access to the global state (the positions and observations of all robots) during training, allowing it to estimate the value function accurately. The individual “actor” networks, however, rely only on local observations, ensuring that the deployed system remains decentralized.

However, even with CTDE, the Credit Assignment Problem remains acute. Methods such as COMA [66] address this by using counterfactual baselines to isolate each agent’s contribution, yet challenges persist in complex scenarios. If a team of three robots fails to map a complex building efficiently, is it because Robot A chose a poor path, Robot B failed to cover its sector, or Robot C collided? A monolithic network struggles to disentangle these causal factors from a scalar reward signal. Furthermore, E2E policies are notoriously brittle; a network trained in a simulated “Gibson” environment often fails catastrophically when transferred to a real-world office due to subtle differences in lighting or texture [67,68].

While End-to-End DRL offers a compelling vision of “tabula rasa” learning, where agents derive complex behaviors from raw experience, the analysis suggests that it acts as a local optimizer rather than a global strategist. The requirement for the network to simultaneously learn perception, mapping, and planning places an enormous burden on the optimization process. This results in extreme sample inefficiency, often requiring millions of interaction steps to learn basic navigation that a classical geometric algorithm could perform instantly. Furthermore, the lack of interpretability in monolithic networks is a significant barrier to deployment. In safety-critical AC-SLAM applications, the inability to guarantee that a neural network will not hallucinate a path through a solid wall is a disqualifying factor. Consequently, E2E methods can be viewed as foundational research into neural representations, but in real-world, large-scale, multi-robot deployment, their performance is usually inferior to modular or hierarchical methods.

Computational Complexity Analysis

The scalability of MARL approaches is fundamentally constrained by the exponential growth of joint state and action spaces. For N agents with individual state space

| S |

and action space

| A |

, a naive centralized approach faces joint spaces of size

{| S |}^{N}

and

{| A |}^{N}

, making exact solutions intractable for

N > 3

. Table 4 summarizes the complexity characteristics of different paradigms:

Independent learners avoid the exponential blowup by treating other agents as environment dynamics, achieving $O (1)$ per-agent training and inference at the cost of non-stationarity.
CTDE methods (QMIX, VDN) maintain tractable per-agent inference $O (| A |)$ while requiring $O (N)$ mixing network forward passes during centralized training.
Hierarchical approaches reduce decision frequency through temporal abstraction, with high-level policies operating at $O (T / k)$ frequency where k is the option length.
Communication-based methods introduce $O (N \cdot | M |)$ per-agent message complexity for fully-connected topologies, where $| M |$ is the message size.

For multi-robot SLAM specifically, the estimation backend adds

O (n^{1.5})

complexity for sparse Cholesky factorization, where n is the number of pose variables, dominating computation in long-horizon missions [40].

3.3. Hierarchical Policy and Spatial Abstraction

To overcome the limitations of monolithic architectures, decomposing the decision-making process into distinct hierarchical levels is essential.

3.3.1. Policy Hierarchy

Policy hierarchy addresses the frequency mismatch between strategic decision-making and motor control. It is theoretically grounded in the Options Framework and Semi-Markov Decision Processes (SMDPs) [70], where actions (options) are not instantaneous but persist over variable durations. In active SLAM, typical hierarchical architectures include the following [71]:

High-level Policy: Runs at a lower frequency to select long-term sub-goals in an abstract state space (e.g., navigate to the corridor end node). Its action space is a discrete set of graph nodes [62].
Low-level Control: Runs at a higher frequency to generate specific motion commands (linear velocity v, angular velocity $ω$ ) that move the robot toward the sub-goals. This layer also handles obstacle avoidance and smooth control execution.

This approach is often regarded as a form of temporal abstraction, which involves encapsulating a sequence of continuous, fine-grained “atomic actions” into a high-level action that spans a duration, enabling the agent to make more macro-level decisions. Additionally, due to the involvement of multiple agents, learning roles for task allocation can also achieve higher-quality decision-making.

Some works have demonstrated the advantages of hierarchical architectures on single agents [72,73,74], and some works have already applied hierarchical architectures to MRSs. ROMA [75] proposes a framework for dynamic role emergence, where the hierarchy is established between a latent role embedding and individual policies. In this architecture, a high-level role encoder maps local observations to stochastic role embeddings, which then condition the low-level utility networks. To ensure this hierarchical abstraction is semantically meaningful, ROMA introduces specialized regularizers that force roles to be identifiable from long-term behavioral trajectories and distinct in their responsibilities, effectively allowing agents to specialize dynamically as the task evolves.

RODE [76] explicitly structures the policy hierarchy by decomposing the joint action space based on action effects. It adopts a bi-level learning hierarchy similar to the options framework: a high-level Role Selector operates at a lower temporal frequency to assign specific roles, which are defined as restricted subsets of the primitive action space. A low-level Role Policy then runs at a high frequency to select primitive actions within these restricted subsets. This method not only reduces the search space by clustering actions with similar effects but also realizes efficient temporal abstraction, enabling scalable coordination in large-scale scenarios.

Ref. [77] proposes a Multi-Agent Hierarchical Deep Deterministic Policy Gradient (MH-DDPG) framework. In this framework, the high-level policy controls abstract robots in a simplified environment to rapidly explore and identify optimal intermediate subgoals. These subgoals are then passed to the low-level policy, which governs the actual robots in the physical environment, executing primitive actions to reach the subgoals while managing complex dynamics and avoiding collisions with obstacles and other agents.

By leveraging temporal abstraction and task decomposition, hierarchical architectures effectively mitigate the curse of dimensionality inherent in large-scale environments. These approaches decouple the complexity of long-horizon planning from the dynamics of immediate control.

3.3.2. Spatial Abstraction

Spatial abstraction addresses the “curse of dimensionality” in state representation. A raw grid map for a large facility might contain millions of cells. Feeding this high-dimensional tensor into a DRL agent is computationally prohibitive and prone to overfitting.

Spatial abstraction compresses this continuous Euclidean space into discrete, semantic, or topological structures. Within the architecture of Hierarchical Reinforcement Learning (HRL), an abstract graph

G = (V, E)

serves as a critical bridge connecting the high-level policy with low-level execution [78]. It compresses the continuous, high-dimensional Euclidean space into a discrete, low-dimensional graph structure. The vertex set V represents key locations in the environment—such as intersections, room centers, frontiers, or the robot’s current position—while the edge set E signifies navigability or spatial relationships between nodes.

To handle the non-Euclidean structure of graph data, Graph Neural Networks (GNNs) are widely employed for feature extraction [79,80]. Through a message-passing mechanism, a node v can aggregate information from its neighbors:

h_{v}^{(k + 1)} = Update (h_{v}^{(k)}, Aggregate ({h_{u}^{(k)} | u \in N (v)}))

. Further details on this principle can be found in works such as [81,82]. This mechanism enables robots to perceive the topological structure of the environment (e.g., dead ends or loops) rather than relying solely on geometric distances.

The topological graphs formed by extracted features aggregate diverse information for downstream modules [83,84]. For instance, ref. [83] employs a hierarchical clustering algorithm to automatically segment complex environmental maps into distinct regions (e.g., rooms, corridors). The center of each region is represented as a node, and the paths connecting regions are represented as edges, thereby transforming the exploration task into a graph-theoretic problem. Subsequently, a model combining Graph Convolutional Networks (GCNs) and Deep Q-Networks (DQNs)—termed MAG-DQN—is utilized to learn how to assign exploration targets (i.e., nodes) to each agent. This model leverages the topological map structure to enable more informed decision-making. Similarly, ref. [84] utilizes a GNN-based approach with imitation learning for path planning, achieving a comparable information aggregation effect.

Ref. [85] proposes a Neural Topological Mapping framework that constructs a dynamic topological graph containing two types of nodes: main nodes, representing key locations in explored areas generated by visual encoders and distance heuristics, and ghost nodes, representing entrances to unexplored areas (i.e., frontiers) attached to the main nodes. The high-level policy utilizes a GNN to process this topology. The GNN not only aggregates structural information from explored areas but also predicts the potential value (i.e., long-term coverage) of ghost nodes, subsequently outputting a ghost node as the global target. The low-level policy then employs heuristics to navigate to the selected node.

Ref. [86] proposes a three-layer planning framework. The abstract layer identifies frontiers in free space to construct a sparse graph representation of the environment. The decision layer employs a Multi-GNN to process this graph, using policy-based RL to compute affinities between robots and frontier points. The execution layer assigns robots to frontier targets and performs local path planning via heuristic methods. Ref. [12] notes that small UAVs typically carry cameras with limited FOV, introducing directional challenges where the robot must decide not only where to go (position) but also where to look (orientation). Consequently, the nodes sampled in the environment are enriched with additional information such as relative position, utility, and occupancy status. An action policy is then inferred through a policy network based on Graph Attention Networks (GAT). The robot then selects an optimal orientation based on information gain.

Table 5 compares representative multi-robot active SLAM methods that use either hierarchical architectures or spatial abstraction techniques.

To complement the method-level comparison in Table 5 and the complexity analysis in Table 4, the key characteristics of MARL paradigms for AC-SLAM deployment are briefly summarized below. Value decomposition methods (e.g., QMIX [48]) require global state access during training and currently see limited adoption in AC-SLAM. Actor–Critic CTDE approaches (e.g., MAPPO [36], MADDPG [64]) offer better scalability (10–100 agents) and are emerging in multi-robot coordination research. Communication-based methods enable learned message passing but remain primarily in the research stage for SLAM applications. Fully decentralized approaches (e.g., independent PPO [59]) scale to large teams (>100 agents) and are commonly used in practice due to their simplicity. Hierarchical MARL (e.g., RODE [76], ROMA [75]) shows promise by combining role-based coordination with high scalability, though AC-SLAM integration remains an active research direction.

3.4. The Robustness Gap

The most significant penalty for learning-based methods is the sim-to-real gap, which is discussed in the next section.

Monolithic E2E policies trained in simulators often see success rates drop precipitously when deployed. Studies show success rates falling from 77% in simulation to 23% in their real-world trials for E2E visual navigation [87]. The domain shift in visual texture (lighting, shadows, camera noise) renders the learned perception features useless.

In contrast, Modular Learning approaches (using classical SLAM for mapping and pose estimation, and learning only for high-level decision making on a spatial abstraction) exhibit better robustness. Real-world success rates for modular approaches can reach up to 80–90% in their experiments, comparable to classical methods [87]. This validates the hierarchical taxonomy: rely on robust classical engineering for the perception/mapping and use learning for the policy.

4. Sim-to-Real Transfer Pathways

In multi-robot active SLAM, the core challenge is not only algorithm design. Another critical difficulty is how to learn effective policies in a safe, controlled simulation and then seamlessly transfer them to the uncertainty-laden real world. While numerous simulation platforms dedicated to agent training currently exist, such as [88,89], policies trained in simulation often experience performance degradation when transferred to real environments. The reason is that real-world conditions introduce uncertainties in perception, modeling, and communication, leading to several gaps between simulation and reality. These sim-to-real gaps can be categorized into four main aspects:

Non-stationarity in Action Execution: In simulation, robot actions are idealized to be instantaneous, synchronous, and of fixed step size. However, in the real world, physical constraints and network latency cause action execution to experience delays and jitter. This variability at the execution level results in the robot’s actual responses deviating from the commands issued in simulation.
Non-stationarity in Reward Function: Simulation-based training typically relies on a fixed reward function. However, the real environment is dynamic; previously defined reward weights or environmental features may no longer apply, and new factors can emerge. This non-stationarity of the reward function means that a policy optimal in simulation may not remain optimal in reality.
Environmental Uncertainty in State Perception: Simulation environments are highly simplified and often ignore complex physical factors like friction, slopes, and lighting variations. This leads to uncertainty in the state transition model. In other words, the robot’s perception and prediction of its own state and the environment in the real world will deviate from those in simulation.
Technical Uncertainty in Observation and Communication: Simulation assumes precise sensor observations, zero-latency communication, and no packet loss. In reality, sensor data is often accompanied by noise and information loss, and multi-robot communication faces issues such as bandwidth limitations, delays, and out-of-order delivery. This uncertainty in the perception and communication links causes the quality of information obtained by the robot during task execution to be significantly lower than in the simulation environment.

The stages where these gaps emerge are illustrated in Figure 2:

Therefore, additional training strategies should be introduced to ensure that policies trained in simulation can be effectively deployed in real-world environments.

4.1. Domain Randomization

Domain randomization serves as the most universal cornerstone technology for addressing the sim-to-real gap. Its core hypothesis is that if the simulation environment possesses sufficiently high diversity, the real world can be viewed merely as a single sample within the simulation distribution. By introducing randomized perturbations to various parameters during the training process, the model is forced to learn robust domain-invariant features that remain constant despite environmental variations [90].

Ref. [91] leverages thousands of parallel GPU environments to perform massive-scale visual domain randomization on lighting conditions, object materials, camera intrinsics, image noise, and even sensor latency. In the context of active SLAM, such large-scale randomization not only assists the robot in identifying environmental features but also suggests potential benefits for the robustness of visual odometry against drastic lighting changes. Experiments demonstrate that policies trained via this large-scale DR approach can be directly deployed on humanoid robots to perform complex long-horizon manipulation tasks without any fine-tuning on real data, proving the significant efficacy of domain randomization when computational power is abundant.

4.2. Domain Adaptation

Unlike domain randomization, which attempts to “encompass” the real distribution by expanding the simulation distribution, domain adaptation aims to “explicitly adapt” knowledge learned in simulation to a specific real-world environmental distribution. This typically requires a small amount of real-world data as guidance.

One pixel-level adaptation approach utilizes Generative Adversarial Networks (GANs) to transform simulation images into a realistic style or real images into a simulation style (Canonical Style). The advantage of this method lies in preserving the geometric consistency of image content while only altering the appearance style. RL-CycleGAN [92] is a seminal work in this field, introducing a task-consistency loss to ensure that the image translation process does not alter the semantic information of the scene (such as obstacle positions), which is crucial for SLAM. In active SLAM, robots can conduct large-scale training in simulation environments with simple textures to learn exploration policies. Upon real-world deployment, a lightweight GAN converts the complex footage from real cameras into simulation-style images in real time. These translated images are then fed into the policy network.

Another approach, feature-level adaptation, does not directly process image pixels but instead focuses on aligning the distributions of simulation and real data within the feature space. Ref. [93] proposes an invariance regularization method, adding a constraint to the reinforcement learning objective function to encourage the policy network to maintain invariant feature outputs invariant to visual changes that do not affect action selection (e.g., wallpaper color, floor patterns). Essentially, this method guides the neural network to learn “what information is truly important for navigation.” For MRSs equipped with diverse sensors, ref. [94] proposes using optimal transport theory to align point cloud and image features between simulation and reality. During the sim-to real process, this method can be used to align local map features generated by different robots (or even different sensor configurations). For example, a LiDAR-based map fusion policy trained in simulation can be transferred via feature alignment to real robots equipped only with depth cameras, facilitating the collaborative mapping capabilities of heterogeneous MRSs in real environments.

4.3. Real-to-Sim

Traditional sim-to-real techniques often rely on handcrafted simulation scenarios (e.g., block worlds in Gazebo or manually modeled indoor environments), which act as the primary source of the visual gap. Recently, with the advent of Neural Radiance Fields (NeRFs) [95] and 3D Gaussian Splatting (3DGS) [96], real environments can be reconstructed within a simulation to a significant extent. This capability makes it possible to replicate the real world inside a simulation, thereby aiming to thoroughly eliminate the visual gap.

The Vid2Sim framework proposed by [97] represents the state of the art in this direction. This framework is capable of directly reconstructing geometrically consistent 3D scenes from real-world monocular videos and generating simulation environments equipped with interactive physical properties. Agents perform visual navigation training within this highly realistic “Digital Twin” environment. Experimental data indicates that compared to training in traditional synthetic environments (such as Gibson or Matterport3D), agents trained using Vid2Sim-generated environments achieve a 68.3% improvement in navigation success rates in the real world. Its core advantage lies in not only reconstructing photorealistic visual appearances but also recovering physical geometry that supports collision detection, enabling robots to experience authentic physical interactions during simulation.

Similarly, ref. [98] leverages the efficient rendering characteristics of 3DGS technology. This method requires only videos captured by consumer-grade mobile phones (e.g., iPhone) to rapidly reconstruct high-fidelity real scenes for RL training. Unlike the implicit representation of NeRFs, 3DGS supports real-time explicit rendering, which drastically improves RL training efficiency. Experiments demonstrate that fine-tuning policies within EmbodiedSplat-reconstructed environments yields extremely high sim-vs.-real correlation (0.87–0.97). This implies that a policy performing better in simulation is almost guaranteed to perform better in the real world, addressing the critical pain point where simulation evaluation disconnects from real-world performance.

Evidently, while domain randomization remains the cornerstone for low-cost cold starts, real-to-sim technologies based on NeRFs and 3DGS are rapidly emerging. Scanning real environments with low-cost devices to generate high-fidelity simulations has become an effective methodology for enhancing the sim-to-real success rate of visual navigation policies.

4.4. Robustness in Communication and Distributed Coordination

In MRSs, communication conditions are another critical aspect requiring consideration. Wireless communication in real-world environments is subject not only to bandwidth constraints but also to stochastic packet loss, latency, and dynamically changing communication topologies.

Addressing communication instability, ref. [99] proposed introducing a “Message Dropout” mechanism during MARL training. This involves randomly masking communication messages from teammates during training, forcing agents to learn to predict teammate intentions and positions based on historical information. This adversarial training method enables policies to maintain a certain degree of collaborative consistency when encountering communication blind spots during real-world deployment, preventing stagnation or redundant exploration.

Ref. [100] explicitly introduces a two-tier communication mechanism for multi-robot exploration tasks with constrained communication. A low-speed channel is utilized for sharing lightweight robot position information, while a high-speed channel is dedicated to exchanging dense map data. By simulating severe bandwidth limitations and disconnections in simulation, the RL-trained policy learns a “Rendezvous-based” collaboration mode where robots tend to merge maps only when physically close enough to ensure successful data transmission.

Beyond policy-level adaptation, low-level distributed SLAM algorithms must also be optimized for real network characteristics. Ref. [101] presents a decentralized distributed SLAM system designed specifically for swarms. To bridge the sim-to-real gap, the system designs adaptive management for near-field and far-field estimation at the front end and employs the Asynchronous Distributed ADMM at the back end. This asynchronous design provides the system with extremely high tolerance for communication latency in real networks; even if packets from some robots arrive late, they do not block the optimization process of the entire system. Real-world drone swarm experiments have proven that D²SLAM maintains global trajectory consistency even under frequent network fluctuations.

Ref. [102] adopts a mesh-based metric-semantic SLAM approach. To alleviate bandwidth pressure during the sim-to-real process, the system exchanges only simplified 3D semantic meshes between robots rather than raw point cloud data. Furthermore, to cope with false loop closures caused by perceptual aliasing in real environments, Kimera-Multi employs various mechanisms (such as extensions based on PCM [103] and Graduated Non-Convexity (GNC) [104]) to ensure the robustness of multi-robot map fusion.

Although communication constraints are frequently discussed in qualitative terms, most surveyed works do not report standardized quantitative parameters, a gap that limits reproducibility and cross-system comparison. Bandwidth in multi-robot systems spans several orders of magnitude: from ultra-low-power UWB links achieving stable throughput of 12.5 kb/s [105] to high-end radios reaching 100 Mbps in ideal conditions [101]. Data payload per exchange varies significantly by representation: 2D/2.5D submaps require 70–72 KB, while 3D submaps reach 1.3 MB [106]; aggressive compression can reduce map transmission to under 1 kb/s with a 99% compression ratio while maintaining localization accuracy within 4 cm [105]. Regarding robustness, asynchronous backends like D²SLAM tolerate unbounded network delays without blocking [101], and Message-Dropout training enables policies to maintain coordination even under severe packet loss [99]. The absence of standardized communication benchmarks for multi-robot SLAM remains an open challenge for the field.

In summary, no single sim-to-real strategy dominates across all scenarios; for practical multi-robot AC-SLAM deployment, it is recommended to combine these complementary approaches based on available resources and target environment characteristics. This conclusion aligns with the trade-off analysis presented in Table 6.

5. Future Directions

In the preceding sections, learning-based multi-robot active SLAM methods were systematically explored. This exploration covered the progression from theoretical foundations to practical deployment, as well as the associated challenges and solutions. This section discusses several possible directions likely to become focal points for future research.

5.1. From Mapping to Understanding

In recent years, the integration of Large Language Models (LLMs) and Vision-Language Models (VLMs) is shifting the focus from “Active Mapping” (building a geometric model) to “Active Semantic Understanding” (building a knowledge base).

Many recent works have begun attempting to integrate LLMs into MRSs to accomplish tasks such as high-level task allocation, mid-level motion planning, and low-level action generation [107]. In these frameworks, the robot’s visual observations are often processed by a VLM to generate textual descriptions. This text, combined with the current topological graph state, serves as a prompt for an LLM. The LLM reasons about the exploration strategy using its vast pre-trained “common sense” knowledge.

For example, ref. [108] proposed Hi-Dyna Graph, a hierarchical framework that transcends static geometric representations by anchoring dynamic relationship subgraphs—derived from real-time video streams—onto a global topological map. Unlike traditional approaches that view the environment as a fixed snapshot, this method models the scene as a living knowledge base of persistent layouts and transient human–object interactions. By parsing this unified graph into structured textual prompts, the integrated LLM agent can reason about latent environmental triggers (e.g., inferring a “cleanup” task from a cluttered table state) and generate executable instructions grounded in robotic affordances, thus enabling long-horizon autonomy in rapidly evolving human-centric environments.

Ref. [109] introduces a two-level hierarchical framework designed to seamlessly integrate RL with code-as-policy. This approach aims to address the inherent limitations of LLMs in handling intricate logic and achieving precise control within embodied tasks. The framework comprises a “slow agent,” responsible for task decomposition and strategic planning, and a “fast agent,” tasked with executing code generation. The core mechanism of RL-GPT involves dynamically allocating strategies based on task difficulty: high-level planning amenable to direct coding is translated into executable code, while low-level actions that are difficult to encode—such as specific operational details—are delegated to RL for training. The generated code is embedded into the RL action space as high-level actions, thereby leveraging the complementary strengths of both modalities.

These preliminary works indicate a potential convergence between SLAM and high-level reasoning. By leveraging LLMs to interpret the semantics behind geometric changes, active SLAM systems appear to be moving toward a more adaptive perception. This suggests a future direction where robots might autonomously perceive and reason about the evolution of human-centric environments with greater contextual awareness.

5.2. Dynamic Digital Twins

Another promising future direction lies in maintaining a Dynamic Digital Twin that synchronizes with the physical world in real-time. This paradigm, recently termed “Real-is-Sim”, tries to narrow the sim-to-real gap via an alternative approach.

In Section 4, EmbodiedSplat was discussed, which converts Gaussian point clouds into meshes that robots can interact with, enabling them to train in a more realistic simulated environment. On this basis, ref. [110] attempts to further narrow the gap through digital twins. Instead of adapting a policy to cope with the complexities of the real world or relying solely on static reconstructions, this approach keeps the simulator active and synchronized during deployment. In this framework, the robot’s policy interacts exclusively with the simulator, while the physical robot acts as a “follower” tracking the simulated agent’s state. The responsibility of bridging the reality gap is shifted from the policy to the simulator’s synchronization mechanism.

Since the intelligent agent can simulate aggressive exploration actions in synchronized twins to evaluate information gain and safety, followed by physical execution, both the quality and safety of the strategy can be significantly improved. Moreover, due to the design characteristics of the framework, the transfer cost of sim-to-real is cleverly reduced, potentially leading to broader applications in the future.

6. Conclusions

This study introduced an integrated conceptual framework for Multi-Robot AC-SLAM, modeling the problem as a tightly coupled system comprising a Dec-POMDP decision layer and a distributed factor-graph estimation layer. Through this lens, the state of the art in learning-based methods was systematically reviewed, concluding that hierarchical reinforcement learning and graph-based spatial abstraction are currently the most effective pathways to balance exploration efficiency with the scalability constraints of large-scale teams. While end-to-end approaches offer theoretical elegance, modular designs remain superior in practical deployment due to their interpretability and robustness.

Furthermore, the critical sim-to-real gaps—spanning perception, dynamics, and communication—were analyzed, and transfer strategies ranging from domain randomization to emerging real-to-sim techniques based on NeRFs and 3D Gaussian Splatting were categorized. Looking forward, it is anticipated that the integration of Large Language Models and dynamic digital twins will drive the field from purely geometric mapping toward active semantic understanding, enabling robust collaboration.

Learning-driven multi-robot active SLAM remains a vibrant and challenging interdisciplinary field. Future breakthroughs will depend on a deeper integration of artificial intelligence, robotics, and control theory. With continued exploration of these frontier issues, collaborative systems composed of intelligent robots are expected to evolve into reliable partners for humanity in exploring the unknown, addressing crises, and expanding our capabilities.

Author Contributions

Conceptualization, B.L. and S.D.; methodology, B.L. and S.D.; formal analysis, B.L. and S.D.; investigation, B.L. and S.D.; resources, B.L. and S.D.; data curation, B.L.; writing—original draft preparation, B.L.; writing—review and editing, B.L. and S.D.; visualization, B.L.; supervision, S.D.; project administration, B.L. and S.D.; funding acquisition, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangdong Basic and Applied Basic Research Foundation, grant number 2023A1515140071, and the Science and Technology Innovation Program of Xiongan New Area, grant number 2024XAGG0006.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Chen, W.; Wang, X.; Gao, S.; Shang, G.; Zhou, C.; Li, Z.; Xu, C.; Hu, K. Overview of multi-robot collaborative SLAM from the perspective of data fusion. Machines 2023, 11, 653. [Google Scholar] [CrossRef]
Rosen, D.M.; Doherty, K.J.; Terán Espinoza, A.; Leonard, J.J. Advances in inference and representation for simultaneous localization and mapping. Annu. Rev. Control Robot. Auton. Syst. 2021, 4, 215–242. [Google Scholar] [CrossRef]
Rosinol, A.; Violette, A.; Abate, M.; Hughes, N.; Chang, Y.; Shi, J.; Gupta, A.; Carlone, L. Kimera: From SLAM to spatial perception with 3D dynamic scene graphs. Int. J. Robot. Res. 2021, 40, 1510–1546. [Google Scholar] [CrossRef]
Liu, Y.; Chen, W.; Bai, Y.; Liang, X.; Li, G.; Gao, W.; Lin, L. Aligning cyber space with physical world: A comprehensive survey on embodied AI. IEEE ASME Trans. Mechatron. 2025, 30, 7253–7274. [Google Scholar] [CrossRef]
Sun, F.; Chen, R.; Ji, T.; Luo, Y.; Zhou, H.; Liu, H. A comprehensive survey on embodied intelligence: Advancements, challenges, and future perspectives. Caai Artif. Intell. Res. 2024, 3, 9150042. [Google Scholar] [CrossRef]
Liu, Y.; Liu, L.; Zheng, Y.; Liu, Y.; Dang, F.; Li, N.; Ma, K. Embodied navigation. Sci. China Inf. Sci. 2025, 68, 141101. [Google Scholar] [CrossRef]
Moosavi, S.K.R.; Zafar, M.H.; Sanfilippo, F. Collaborative robots (cobots) for disaster risk resilience: A framework for swarm of snake robots in delivering first aid in emergency situations. Front. Robot. AI 2024, 11, 1362294. [Google Scholar] [CrossRef]
Ebadi, K.; Bernreiter, L.; Biggie, H.; Catt, G.; Chang, Y.; Chatterjee, A.; Denniston, C.E.; Deschênes, S.P.; Harlow, K.; Khattak, S.; et al. Present and future of SLAM in extreme environments: The DARPA SubT challenge. IEEE Trans. Robot. 2024, 40, 936–959. [Google Scholar] [CrossRef]
Placed, J.A.; Strader, J.; Carrillo, H.; Atanasov, N.; Indelman, V.; Carlone, L.; Castellanos, J.A. A Survey on Active Simultaneous Localization and Mapping: State of the Art and New Frontiers. IEEE Trans. Robot. 2023, 39, 1686–1705. [Google Scholar] [CrossRef]
Bernreiter, L.; Khattak, S.; Ott, L.; Siegwart, R.; Hutter, M.; Cadena, C. A framework for collaborative multi-robot mapping using spectral graph wavelets. Int. J. Robot. Res. 2024, 43, 2070–2088. [Google Scholar] [CrossRef]
Huang, Y.; Lin, X.; Englot, B. Multi-robot autonomous exploration and mapping under localization uncertainty with expectation-maximization. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: New York, NY, USA, 2024; Volume 18, pp. 7236–7242. [Google Scholar] [CrossRef]
Chiun, J.; Zhang, S.; Wang, Y.; Cao, Y.; Sartoretti, G. MARVEL: Multi-agent reinforcement learning for constrained field-of-view multi-robot exploration in large-scale environments. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 17–23 May 2025; IEEE: New York, NY, USA, 2025; Volume 20, pp. 11392–11398. [Google Scholar] [CrossRef]
Xu, W.; Chen, Y.; Liu, S.; Nie, A.; Chen, R. Multi-robot cooperative simultaneous localization and mapping algorithm based on sub-graph partitioning. Sensors 2025, 25, 2953. [Google Scholar] [CrossRef]
Chang, Y.; Ebadi, K.; Denniston, C.E.; Ginting, M.F.; Rosinol, A.; Reinke, A.; Palieri, M.; Shi, J.; Chatterjee, A.; Morrell, B.; et al. LAMP 2.0: A robust multi-robot SLAM system for operation in challenging large-scale underground environments. IEEE Robot. Autom. Lett. 2022, 7, 9175–9182. [Google Scholar] [CrossRef]
Lajoie, P.Y.; Ramtoula, B.; Chang, Y.; Carlone, L.; Beltrame, G. DOOR-SLAM: Distributed, online, and outlier resilient SLAM for robotic teams. IEEE Robot. Autom. Lett. 2020, 5, 1656–1663. [Google Scholar] [CrossRef]
Denniston, C.E.; Chang, Y.; Reinke, A.; Ebadi, K.; Sukhatme, G.S.; Carlone, L.; Morrell, B.; Agha-mohammadi, A.A. Loop closure prioritization for efficient and scalable multi-robot SLAM. IEEE Robot. Autom. Lett. 2022, 7, 9651–9658. [Google Scholar] [CrossRef]
Ahmed, M.F.; Maragliano, M.; Frémont, V.; Recchiuto, C.T. Efficient multi-robot active SLAM. J. Intell. Robot. Syst. 2025, 111, 64. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, L.; Lee, K.M.B.; Yoo, C.; Huang, S.; Fitch, R. Broadcast your weaknesses: Cooperative active pose-graph SLAM for multiple robots. IEEE Robot. Autom. Lett. 2020, 5, 2200–2207. [Google Scholar] [CrossRef]
Chang, Y.; Tian, Y.; How, J.P.; Carlone, L. Kimera-multi: A system for distributed multi-robot metric-semantic simultaneous localization and mapping. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: New York, NY, USA, 2021; Volume 12. [Google Scholar] [CrossRef]
Schmuck, P.; Ziegler, T.; Karrer, M.; Perraudin, J.; Chli, M. COVINS: Visual-inertial SLAM for centralized collaboration. In Proceedings of the 2021 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Bari, Italy, 4–8 October 2021; IEEE: New York, NY, USA, 2021; Volume 13. [Google Scholar] [CrossRef]
Cao, H.; Shreedharan, S.; Atanasov, N. Multi-robot object SLAM using distributed variational inference. IEEE Robot. Autom. Lett. 2024, 9, 8722–8729. [Google Scholar] [CrossRef]
Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual SLAM: From tradition to semantic. Remote Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
Favorskaya, M.N. Deep learning for visual SLAM: The state-of-the-art and future trends. Electronics 2023, 12, 2006. [Google Scholar] [CrossRef]
Ahmed, M.F.; Masood, K.; Fremont, V.; Fantoni, I. Active SLAM: A Review on Last Decade. Sensors 2023, 23, 8097. [Google Scholar] [CrossRef]
Lluvia, I.; Lazkano, E.; Ansuategi, A. Active mapping and robot exploration: A survey. Sensors 2021, 21, 2445. [Google Scholar] [CrossRef]
Lajoie, P.Y.; Ramtoula, B.; Wu, F.; Beltrame, G. Towards collaborative simultaneous localization and mapping: A survey of the current research landscape. Field Robot. 2022, 2, 971–1000. [Google Scholar] [CrossRef]
Wang, C.; Yu, C.; Xu, X.; Gao, Y.; Yang, X.; Tang, W.; Yu, S.; Chen, Y.; Gao, F.; Jian, Z.; et al. Multi-Robot System for Cooperative Exploration in Unknown Environments: A Survey. arXiv 2025, arXiv:2503.07278. [Google Scholar] [CrossRef]
Orr, J.; Dutta, A. Multi-agent deep reinforcement learning for multi-robot applications: A survey. Sensors 2023, 23, 3625. [Google Scholar] [CrossRef]
Queralta, J.P.; Taipalmaa, J.; Pullinen, B.C.; Sarker, V.K.; Gia, T.N.; Tenhunen, H.; Gabbouj, M.; Raitoharju, J.; Westerlund, T. Collaborative Multi-Robot Search and Rescue: Planning, Coordination, Perception, and Active Vision. IEEE Access 2020, 8, 191617–191643. [Google Scholar] [CrossRef]
Lajoie, P.Y.; Hu, S.; Beltrame, G.; Carlone, L. Modeling Perceptual Aliasing in SLAM via Discrete-Continuous Graphical Models. IEEE Robot. Autom. Lett. 2019, 4, 1232–1239. [Google Scholar] [CrossRef]
Feng, D.; Qi, Y.; Zhong, S.; Chen, Z.; Chen, Q.; Chen, H.; Wu, J.; Ma, J. S3E: A Multi-Robot Multimodal Dataset for Collaborative SLAM. IEEE Robot. Autom. Lett. 2024, 9, 11401–11408. [Google Scholar] [CrossRef]
Tian, Y.; Khosoussi, K.; Rosen, D.M.; How, J.P. Distributed certifiably correct pose-graph optimization. IEEE Trans. Robot. 2021, 37, 2137–2156. [Google Scholar] [CrossRef]
Lajoie, P.Y.; Beltrame, G. Swarm-SLAM: Sparse decentralized collaborative simultaneous localization and mapping framework for multi-robot systems. IEEE Robot. Autom. Lett. 2024, 9, 475–482. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; WU, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Indelman, V.; Carlone, L.; Dellaert, F. Planning in the continuous domain: A generalized belief space approach for autonomous navigation in unknown environments. Int. J. Robot. Res. 2015, 34, 849–882. [Google Scholar] [CrossRef]
Lauri, M.; Pajarinen, J.; Peters, J. Multi-agent active information gathering in discrete and continuous-state decentralized POMDPs by policy graph improvement. Auton. Agent. Multi. Agent. Syst. 2020, 34, 42. [Google Scholar] [CrossRef]
Dellaert, F.; Kaess, M. Factor graphs for robot perception. Found. Trends Robot. 2017, 6, 1–139. [Google Scholar] [CrossRef]
Kaess, M.; Johannsson, H.; Roberts, R.; Ila, V.; Leonard, J.J.; Dellaert, F. iSAM2: Incremental smoothing and mapping using the Bayes tree. Int. J. Robot. Res. 2012, 31, 216–235. [Google Scholar] [CrossRef]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar] [CrossRef]
Wang, T.H.; Manivasagam, S.; Liang, M.; Yang, B.; Zeng, W.; Urtasun, R. V2VNet: Vehicle-to-vehicle communication for joint perception and prediction. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 34, pp. 605–621. [Google Scholar]
Hu, Y.; Fang, S.; Lei, Z.; Zhong, Y.; Chen, S. Where2comm: Communication-Efficient Collaborative Perception via Spatial Confidence Maps. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 4874–4886. [Google Scholar]
Li, Y.; Ren, S.; Wu, P.; Chen, S.; Feng, C.; Zhang, W. Learning Distilled Collaboration Graph for Multi-Agent Perception. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 29541–29552. [Google Scholar]
Chen, J.; Wu, Z.; Li, H.; Yang, F.; Xia, L. Robust Loop Closure Selection Based on Inter-Robot and Intra-Robot Consistency for Multi-Robot Map Fusion. Remote Sens. 2023, 15, 2796. [Google Scholar] [CrossRef]
Shi, W.; Ling, Q.; Yuan, K.; Wu, G.; Yin, W. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process. 2014, 62, 1750–1761. [Google Scholar] [CrossRef]
Choudhary, S.; Carlone, L.; Nieto, C.; Rogers, J.; Christensen, H.I.; Dellaert, F. Distributed mapping with privacy and communication constraints: Lightweight algorithms and object-based models. Int. J. Robot. Res. 2017, 36, 1286–1311. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Feder, H.J.S.; Leonard, J.J.; Smith, C.M. Adaptive mobile robot navigation and mapping. Int. J. Robot. Res. 1999, 18, 650–668. [Google Scholar] [CrossRef]
Bourgault, F.; Makarenko, A.; Williams, S.; Grocholsky, B.; Durrant-Whyte, H. Information based adaptive robotic exploration. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Lausanne, Switzerland, 30 September–4 October 2002; Volume 1, pp. 540–545. [Google Scholar] [CrossRef]
Yamauchi, B. A frontier-based approach for autonomous exploration. In Proceedings of the 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97. ‘Towards New Computational Principles for Robotics and Automation’, Monterey, CA, USA, 10–11 July 1997; pp. 146–151. [Google Scholar] [CrossRef]
Mirowski, P.; Pascanu, R.; Viola, F.; Soyer, H.; Ballard, A.J.; Banino, A.; Denil, M.; Goroshin, R.; Sifre, L.; Kavukcuoglu, K.; et al. Learning to Navigate in Complex Environments. arXiv 2017, arXiv:1611.03673. [Google Scholar] [CrossRef]
Parisotto, E.; Salakhutdinov, R. Neural Map: Structured Memory for Deep Reinforcement Learning. arXiv 2017, arXiv:1702.08360. [Google Scholar] [CrossRef]
Lin, J.; Yang, X.; Zheng, P.; Cheng, H. End-to-end decentralized multi-robot navigation in unknown complex environments via deep reinforcement learning. In Proceedings of the 2019 IEEE International Conference on Mechatronics and Automation (ICMA), Tianjin, China, 4–7 August 2019; IEEE: New York, NY, USA, 2019; Volume 42. [Google Scholar] [CrossRef]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-End Training of Deep Visuomotor Policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. [Google Scholar]
Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: New York, NY, USA, 2017; pp. 2043–2050. [Google Scholar]
Clark, R.; Wang, S.; Wen, H.; Markham, A.; Trigoni, N. Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
Cai, Y.; He, X.; Guo, H.; Yau, W.Y.; Lv, C. Transformer-based Multi-Agent Reinforcement Learning for Generalization of Heterogeneous Multi-Robot Cooperation. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 13695–13702. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Volume 80, pp. 1861–1870. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Chaplot, D.S.; Gandhi, D.; Gupta, S.; Gupta, A.; Salakhutdinov, R. Learning to Explore using Active Neural SLAM. arXiv 2020, arXiv:2004.05155. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Lowe, R.; WU, Y.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Bettini, M.; Prorok, A.; Moens, V. BenchMARL: Benchmarking Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2024, 25, 1–10. [Google Scholar]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Devin, C.; Gupta, A.; Darrell, T.; Abbeel, P.; Levine, S. Learning modular neural network policies for multi-task and multi-robot transfer. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2169–2176. [Google Scholar] [CrossRef]
Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Virtual Conference, 1–4 December 2020; pp. 737–744. [Google Scholar] [CrossRef]
Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef]
Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Hafner, D.; Lee, K.H.; Fischer, I.; Abbeel, P. Deep Hierarchical Planning from Pixels. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 26091–26104. [Google Scholar]
Li, J.; Tang, C.; Tomizuka, M.; Zhan, W. Hierarchical Planning Through Goal-Conditioned Offline Reinforcement Learning. IEEE Robot. Autom. Lett. 2022, 7, 10216–10223. [Google Scholar] [CrossRef]
Gürtler, N.; Büchler, D.; Martius, G. Hierarchical Reinforcement Learning with Timed Subgoals. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 21732–21743. [Google Scholar]
Wang, T.; Dong, H.; Lesser, V.; Zhang, C. ROMA: Multi-Agent Reinforcement Learning with Emergent Roles. arXiv 2020, arXiv:2003.08039. [Google Scholar] [CrossRef]
Wang, T.; Gupta, T.; Mahajan, A.; Peng, B.; Whiteson, S.; Zhang, C. RODE: Learning Roles to Decompose Multi-Agent Tasks. arXiv 2020, arXiv:2010.01523. [Google Scholar] [CrossRef]
Setyawan, G.E.; Hartono, P.; Sawada, H. Cooperative multi-robot hierarchical reinforcement learning. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 745–751. [Google Scholar] [CrossRef]
Singh Chaplot, D.; Salakhutdinov, R.; Gupta, A.; Gupta, S. Neural topological SLAM for visual navigation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, 14–19 June 2020; IEEE: New York, NY, USA, 2020; Volume 44. [Google Scholar] [CrossRef]
Savinov, N.; Dosovitskiy, A.; Koltun, V. Semi-parametric Topological Memory for Navigation. arXiv 2018, arXiv:1803.00653. [Google Scholar] [CrossRef]
Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar] [CrossRef]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural Message Passing for Quantum Chemistry. Int. Conf. Mach. Learn. 2017, 70, 1263–1272. [Google Scholar]
Kipf, T. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Luo, T.; Subagdja, B.; Wang, D.; Tan, A.H. Multi-agent collaborative exploration through graph-based deep reinforcement learning. In Proceedings of the 2019 IEEE International Conference on Agents (ICA), Beijing, China, 6–9 July 2019; IEEE: New York, NY, USA, 2019; Volume 49. [Google Scholar] [CrossRef]
Tzes, M.; Bousias, N.; Chatzipantazis, E.; Pappas, G.J. Graph Neural Networks for Multi-Robot Active Information Acquisition. arXiv 2022, arXiv:2209.12091. [Google Scholar] [CrossRef]
Yang, X.; Yang, Y.; Yu, C.; Chen, J.; Yu, J.; Ren, H.; Yang, H.; Wang, Y. Active Neural Topological Mapping for Multi-Agent Exploration. IEEE Robot. Autom. Lett. 2024, 9, 303–310. [Google Scholar] [CrossRef]
Cai, G.; Guo, L.; Chang, X. An Enhanced Hierarchical Planning Framework for Multi-Robot Autonomous Exploration. arXiv 2024, arXiv:2410.19373. [Google Scholar] [CrossRef]
Gervet, T.; Chintala, S.; Batra, D.; Malik, J.; Chaplot, D.S. Navigating to objects in the real world. Sci. Robot. 2023, 8, eadf6991. [Google Scholar] [CrossRef] [PubMed]
Puig, X.; Undersander, E.; Szot, A.; Cote, M.D.; Yang, T.Y.; Partsey, R.; Desai, R.; Clegg, A.W.; Hlavac, M.; Min, S.Y.; et al. Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots. arXiv 2023, arXiv:2310.13724. [Google Scholar] [CrossRef]
NVIDIA Isaac Sim. Available online: https://developer.nvidia.com/isaac-sim (accessed on 15 December 2025).
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; Volume 56. [Google Scholar]
He, T.; Wang, Z.; Xue, H.; Ben, Q.; Luo, Z.; Xiao, W.; Yuan, Y.; Da, X.; Castañeda, F.; Sastry, S.; et al. VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation. arXiv 2025, arXiv:2511.15200. [Google Scholar] [CrossRef]
Rao, K.; Harris, C.; Irpan, A.; Levine, S.; Ibarz, J.; Khansari, M. RL-CycleGAN: Reinforcement learning aware simulation-to-real. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, 14–19 June 2020; IEEE: New York, NY, USA, 2020; Volume 58. [Google Scholar]
Béres, A.; Gyires-Tóth, B. Enhancing visual domain randomization with real images for Sim-to-real transfer. Infocommun. J. 2023, 15, 15–25. [Google Scholar] [CrossRef]
Cheng, S.; Ma, L.; Chen, Z.; Mandlekar, A.; Garrett, C.; Xu, D. Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training. arXiv 2025, arXiv:2509.18631. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2022, 65, 99–106. [Google Scholar] [CrossRef]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139:1–139:14. [Google Scholar] [CrossRef]
Xie, Z.; Liu, Z.; Peng, Z.; Wu, W.; Zhou, B. Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Vancouver, BC, Canada, 16–21 June 2025; pp. 1581–1591. [Google Scholar]
Chhablani, G.; Ye, X.; Irshad, M.Z.; Kira, Z. EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–10 October 2025; pp. 25431–25441. [Google Scholar]
Kim, W.; Cho, M.; Sung, Y. Message-dropout: An efficient training method for multi-agent deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6079–6086. [Google Scholar] [CrossRef]
Pongsirijinda, K.; Cao, Z.; Pik Lik Lau, B.; Liu, R.; Yuen, C.; Tan, U.X. MEF-explore: Communication-constrained multi-robot entropy-field-based exploration. IEEE Trans. Autom. Sci. Eng. 2025, 22, 16062–16078. [Google Scholar] [CrossRef]
Xu, H.; Liu, P.; Chen, X.; Shen, S. D²SLAM: Decentralized and distributed collaborative visual-inertial SLAM system for aerial swarm. IEEE Trans. Robot. 2024, 40, 3445–3464. [Google Scholar] [CrossRef]
Tian, Y.; Chang, Y.; Herrera Arias, F.; Nieto-Granda, C.; How, J.; Carlone, L. Kimera-multi: Robust, distributed, dense metric-semantic SLAM for multi-robot systems. IEEE Trans. Robot. 2022, 38, 2022–2038. [Google Scholar] [CrossRef]
Mangelson, J.G.; Dominic, D.; Eustice, R.M.; Vasudevan, R. Pairwise consistent measurement set maximization for robust multi-robot map merging. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; Volume 70. [Google Scholar] [CrossRef]
Yang, H.; Antonante, P.; Tzoumas, V.; Carlone, L. Graduated non-convexity for robust spatial perception: From non-minimal solvers to global outlier rejection. IEEE Robot. Autom. Lett. 2020, 5, 1127–1134. [Google Scholar] [CrossRef]
Zhang, L.; Deng, J. Deep Compressed Communication and Application in Multi-Robot 2D-Lidar SLAM: An Intelligent Huffman Algorithm. Sensors 2024, 24, 3154. [Google Scholar] [CrossRef]
Han, J.; Ma, C.; Zou, D.; Jiao, S.; Chen, C.; Wang, J. Distributed Multi-Robot SLAM Algorithm with Lightweight Communication and Optimization. Electronics 2024, 13, 4129. [Google Scholar] [CrossRef]
Li, P.; An, Z.; Abrar, S.; Zhou, L. Large Language Models for Multi-Robot Systems: A Survey. arXiv 2025, arXiv:2502.03814. [Google Scholar] [CrossRef]
Hou, J.; Xue, X.; Zeng, T. Hi-Dyna Graph: Hierarchical Dynamic Scene Graph for Robotic Autonomy in Human-Centric Environments. arXiv 2025, arXiv:2506.00083. [Google Scholar] [CrossRef]
Liu, S.; Yuan, H.; Hu, M.; Li, Y.; Chen, Y.; Liu, S.; Lu, Z.; Jia, J. RL-GPT: Integrating Reinforcement Learning and Code-as-policy. In Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 28430–28459. [Google Scholar] [CrossRef]
Abou-Chakra, J.; Sun, L.; Rana, K.; May, B.; Schmeckpeper, K.; Suenderhauf, N.; Minniti, M.V.; Herlant, L. Real-is-Sim: Bridging the Sim-to-Real Gap with a Dynamic Digital Twin. arXiv 2025, arXiv:2504.03597. [Google Scholar] [CrossRef]

Figure 1. Conceptual system framework of AC-SLAM: A closed-loop architecture encompassing Physical World–Perception–Estimation–Decision–Control.

Figure 2. The sim-to-real gap.

Table 1. Comparison of single-robot SLAM and multi-robot collaborative SLAM.

Dimension	Single-Robot SLAM	Multi-Robot Collaborative SLAM	Evidences
Exploration Efficiency/Coverage	Serial exploration; coverage time scales with environment size; limited sensing range/FoV.	Parallel exploration reduces time to completion and improves coverage; requires task allocation and mitigation of overlap/conflicts.	[7,17,18]
Localization Accuracy	Relies on intra-robot loop closure; prone to drift over long trajectories; lacks inter-robot constraint correction.	Inter-robot loop closure and joint optimization significantly reduce drift and improve global consistency; relies on robust inter-robot data association.	[14,19,20,21]
Robustness/Fault Tolerance	Single point of failure; limited recovery methods during sensor degradation, occlusion, or dynamic interference.	Redundant observations and multi-path loop closures enhance robustness; network splitting or outliers may lead to inconsistency or erroneous fusion.	[14,15,16,19]
Computation/Scalability	Centralized front/back-end; large-scale or high-frequency data constrained by onboard compute and memory.	Computation can be distributed or server-aided; however, global optimization, verification, and map fusion can become bottlenecks.	[16,17,20,21]
Communication Load	No inter-robot communication (local bus/storage only).	Requires exchange of keyframes/submaps/constraints; significantly impacted by bandwidth, latency, and disconnection.	[15,18,19,20]
System Complexity	Relatively simple architecture (no inter-robot data association, alignment, or consistency maintenance).	Requires time synchronization, coordinate initialization/alignment, inter-robot outlier rejection, and consistency management.	[14,16,19]

Note: Based on multi-robot SLAM literature with experimental metrics from 2020 to 2025.

Table 2. Representative Surveys of Four SLAM Categories (2020–2025).

Feature	Single/Passive SLAM	Single/Active SLAM	Multi/Passive SLAM	Multi/Active SLAM
Core Objective	Estimate $(X, M)$ only: High-precision localization & mapping	Joint decision and estimation: Select a to obtain $(X, M)$ faster/more accurately	Fuse multi-robot $Z_{1 : N}$ to estimate global $(X_{1 : N}, M)$	Collab. decision and dist. estimation: Jointly select $a_{1 : N}$ to maximize info gain and map consistency
Math. Essence	State estimation/optimization (Filtering, BA/Factor Graph)	POMDP/info-theoretic planning + SLAM backend	Distributed/centralized estimation (Factor Graph, Map Fusion)	Dec-POMDP/multi-agent collaboration + distributed estimation
RL Relevance	Low–Med: Partially used for learned front-end/loop closure	Med–High: Used for view selection, policy learning in exploration tasks	Low–Med: Used for communication scheduling/task allocation	High: MARL for collaborative policy; strongly coupled with SLAM backend
Key Challenges	Degraded scenes, dynamic environments, long-term drift	High cost of non-myopic planning; balancing efficiency/safety/real-time	Inter-robot loop closure, map fusion, bandwidth limits, scalability	Collab. policy design, sparse comms/sensing, robustness, decentralization, false loop closures
Surveys	[23,24,25]	[9,26,27]	[1,28]	[9,26,29,30]

Table 3. Validation status of AC-SLAM framework components.

Component	Theory	Sim	1-R	N-R	Bench	Benchmarks/Evidences
Estimation Layer (Section 2.2.2 and Section 2.2.3)
Factor Graph Backend	✔	✔	✔	✔	✔	KITTI, EuRoC, TUM RGB-D [8]
Distributed PGO	✔	✔	✔	✔	∘	Custom pose graphs [34,35]
Loop Closure	✔	✔	✔	∘	∘	Place recognition [15]
Decision Layer (Section 2.2.4)
Frontier Planning	✔	✔	✔	✔	✔	Standard grid worlds
Info-gain Planning	✔	✔	✔	∘	∘	Active SLAM metrics [9]
MARL Exploration	✔	✔	∘	–	∘	SMAC, Habitat (non-SLAM) [36]
E2E Visual RL	∘	✔	–	–	∘	Habitat, Gibson (nav only)
Full Pipeline Integration
Estimation–Decision	∘	✔	✔	∘	–	No unified benchmark
Complete AC-SLAM	∘	∘	∘	–	–	No standard benchmark

✔ = well established; ∘ = limited validation; – = lacking. 1-R = single-robot real-world; N-R = multi-robot deployment; Bench = standardized benchmarks.

Table 4. Computational complexity of MARL paradigms. N: number of agents;

| S_{i} |

,

| A_{i} |

: per-agent state/action space;

| M |

: message size; k: option length; d: neighbor degree. ^† Dec-POMDP is NEXP-complete [69]; see [41] for a comprehensive treatment. ^‡ Based on typical benchmarks (e.g., BenchMARL [65]) and survey [63].

Table 4. Computational complexity of MARL paradigms. N: number of agents;

| S_{i} |

,

| A_{i} |

: per-agent state/action space;

| M |

: message size; k: option length; d: neighbor degree. ^† Dec-POMDP is NEXP-complete [69]; see [41] for a comprehensive treatment. ^‡ Based on typical benchmarks (e.g., BenchMARL [65]) and survey [63].

Paradigm	Joint State/Action	Training	Inference	Comm.	Scale ^‡
Centralized ^†	${\| S \|}^{N}$ , ${\| A \|}^{N}$ (exponential)	$O (\| A \|^{N})$	$O (\| A \|^{N})$	N/A	Low ( $N ≲ 3$ )
Independent [63]	local only ( $\| S_{i} \|$ , $\| A_{i} \|$ )	$O (1)$	$O (1)$	None	High ( $N ≫ 10$ )
CTDE [48]	factorized	$O (N)$ mix	$O (\| A_{i} \|)$	Train only	Moderate
Hierarchical [70,71]	abstracted	$O (T / k)$ dec.	$O (1)$	Varies	High
Comm.-based [63]	$+ \| M \|$ msgs	$O (d \| M \|)$	$O (d \| M \|)$	$O (\| E \| \| M \|)$	Moderate

Notes: All Big-O with respect to N; per-agent

| S_{i} |

,

| A_{i} |

,

| M |

, d treated as constants.

| E |

: comm. edges (full

O (N^{2})

, sparse

O (N)

). Training/inference per agent; comm. is team-wide. Centralized assumes naive enumeration over joint actions. Scalability: Centralized limited by NEXP-completeness (

N ≲ 3

); independent scales but suffers non-stationarity; CTDE balances via value factorization; hierarchical decouples from N via temporal abstraction; comm.-based bounded by

| E |

.

Table 5. Comparison of approaches.

Method	Year	Sensing	Spatial Abstraction	Hierarchical Approach	Learning Method	Agent Count	Environment
[72]	2022	Visual and Proprioceptive	Discrete Latent Goal Codes	High-level: Selects latent goals Low-level: Generates primitive actions	Dreamer (Model-Based RL)	1	Standard RL Suites (DMLab, DM Control, Atari)
[74]	2021	State Space $S$ (Controllable and Environmental)	Subgoal Space Restricted to Directly Controllable State Part	High-level: Assigns long-term subgoals Low-level: Generates primitive actions	SAC (with Hindsight Action Relabeling and HER)	1	4 Standard Benchmarks (e.g., AntFourRooms) and 3 Dynamic Tasks (Platforms, Drawbridge, Tennis2D)
[73]	2022	Visual	Implicit Latent Space	High-level: Selects sub-goal images Low-level: Generates primitive actions	Q-Learning (Conservative)	1	CARLA (Town03) and Antmaze
[75]	2020	Proprioceptive and Relative Features of Nearest Units	Stochastic Role Embedding	High-level: Samples latent role Low-level: Generates primitive actions	QMIX	5–27	StarCraft II (SMAC)
[76]	2020	Relative Features of Nearest Units	Action Space Decomposition	High-level: Selects action subsets Low-level: Generates primitive actions	QMIX (Mixing)	2–27	StarCraft II (SMAC)
[77]	2022	State Vector (Relative Polar Coords and Physics)	Abstract Environmental Layer (Ignored Obstacles and Enhanced Dynamics)	High-level: Generates subgoals (abstract states) Low-level: Generates primitive actions	MH-DDPG (Hierarchical MADDPG)	3	Modified MPE (Simple Spread with Obstacles)
[83]	2019	Lidar and Graph Node States	Topological Graph	High-level: Allocates target region Low-level: Deterministic path planning	MAG-DQN (Spectral GCN + Centralized DQN)	1–10	ROS Room Dataset (2D Indoor Grid Maps)
[84]	2022	Positional, State Estimates, and Grid Map	Occupancy Grid and Uncertainty Heatmap	Global: Graph aggregation (info sharing via DKF) Local: node-update policy (CNN + Action MLP)	Imitation Learning	10–80	2D Cluttered Continuous Area
[85]	2024	Visual and Pose	Topological Graph	High-level: Selects global goal (via GNN) Low-level: Path planning (FMM) and primitive actions	MAPPO	3, 4	Habitat (Gibson and HM3D)
[86]	2024	LiDAR and Pose	Occupancy Grid and Sparse Frontier Graph	High-level: Allocates cluster centers (mGNN) Low-level: Local routing and ROS Nav	PPO (with mGNN)	3	iGibson (Gibson Dataset)
[12]	2025	Visual (Constrained) and Pose	Informative Graph	Viewpoint and heading selection (via graph attention and action pruning)	SAC (Attentive Privileged Critic)	2, 4, 8	Randomly Generated Large-Scale Indoor

Table 6. Overview of sim-to-real gaps and solutions.

Category	Work	Targeted Gap	Advantages	Limitations
Domain Rand.	[91]	Perception, Dynamics	Simple implementation; enables cold start without real data; robust to dynamic changes.	Policy may be overly conservative; difficult to cover long-tail distributions; high compute cost.
Domain Adapt.	[92]	Perception	Optimal performance for specific target environments; effectively utilizes real-world priors.	Requires target domain data; GAN training instability; may introduce geometric distortions affecting depth.
Real-to-Sim	[97,98]	Perception	High visual fidelity; highest sim-vs.-real correlation; supports complex physical interactions.	High reconstruction cost; requires re-scanning if environmental physical properties change.
Comm. Robustness	[100,101]	Communication	Explicitly handles multi-robot disconnection; enhances practical robustness of distributed systems.	Increases state space complexity; higher demands on real-time algorithmic performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lv, B.; Duan, S. Learning-Based Multi-Robot Active SLAM: A Conceptual Framework and Survey. Appl. Sci. 2026, 16, 1412. https://doi.org/10.3390/app16031412

AMA Style

Lv B, Duan S. Learning-Based Multi-Robot Active SLAM: A Conceptual Framework and Survey. Applied Sciences. 2026; 16(3):1412. https://doi.org/10.3390/app16031412

Chicago/Turabian Style

Lv, Bowen, and Shihong Duan. 2026. "Learning-Based Multi-Robot Active SLAM: A Conceptual Framework and Survey" Applied Sciences 16, no. 3: 1412. https://doi.org/10.3390/app16031412

APA Style

Lv, B., & Duan, S. (2026). Learning-Based Multi-Robot Active SLAM: A Conceptual Framework and Survey. Applied Sciences, 16(3), 1412. https://doi.org/10.3390/app16031412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Learning-Based Multi-Robot Active SLAM: A Conceptual Framework and Survey

Abstract

1. Introduction

2. Problem Formulation and Solution Framework of AC-SLAM

2.1. Formal Definition of AC-SLAM

2.1.1. Notations

2.1.2. Optimization Objective Function

2.2. Conceptual System Framework: Coupled Model

2.2.1. Framework Architecture Diagram

2.2.2. Collaborative Perception

2.2.3. Collaborative Mapping/Optimization

2.2.4. Collaborative Policy for AC-SLAM

3. Learning-Driven Multi-Robot Active SLAM

3.1. The Cognitive Shift in Collaborative Decision-Making

3.2. End-to-End Deep Reinforcement Learning

3.2.1. Unified Architectures and Perception Backbones

3.2.2. Rewards and Learning

3.2.3. Challenges in Multi-Agent Scalability

Computational Complexity Analysis

3.3. Hierarchical Policy and Spatial Abstraction

3.3.1. Policy Hierarchy

3.3.2. Spatial Abstraction

3.4. The Robustness Gap

4. Sim-to-Real Transfer Pathways

4.1. Domain Randomization

4.2. Domain Adaptation

4.3. Real-to-Sim

4.4. Robustness in Communication and Distributed Coordination

5. Future Directions

5.1. From Mapping to Understanding

5.2. Dynamic Digital Twins

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI