Next Article in Journal
Incorporating Building Morphology Data to Improve Urban Land Use Mapping: A Case Study of Shenzhen
Previous Article in Journal
Development of a Two-Stage Correction Framework for Satellite, Multi-Source Merged, and Reanalysis Precipitation Products Across the Huang-Huai-Hai Plain, China, During 2000–2020
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimizing Multi-Camera Mobile Mapping Systems with Pose Graph and Feature-Based Approaches

by
Ahmad El-Alailyi
1,2,*,
Luca Morelli
2,
Paweł Trybała
2,
Francesco Fassi
1 and
Fabio Remondino
2
1
Department of Architecture, Built Environment and Construction Engineering (DABC), Politecnico di Milano, 20133 Milan, Italy
2
3D Optical Metrology (3DOM) Unit, Bruno Kessler Foundation (FBK), 38123 Trento, Italy
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(16), 2810; https://doi.org/10.3390/rs17162810
Submission received: 31 May 2025 / Revised: 5 August 2025 / Accepted: 7 August 2025 / Published: 13 August 2025

Abstract

Multi-camera Visual Simultaneous Localization and Mapping (V-SLAM) increases spatial coverage through multi-view image streams, improving localization accuracy and reducing data acquisition time. Despite its speed and generally robustness, V-SLAM often struggles to achieve precise camera poses necessary for accurate 3D reconstruction, especially in complex environments. This study introduces two novel multi-camera optimization methods to enhance pose accuracy, reduce drift, and ensure loop closures. These methods refine multi-camera V-SLAM outputs within existing frameworks and are evaluated in two configurations: (1) multiple independent stereo V-SLAM instances operating on separate camera pairs; and (2) multi-view odometry processing all camera streams simultaneously. The proposed optimizations include (1) a multi-view feature-based optimization that integrates V-SLAM poses with rigid inter-camera constraints and bundle adjustment; and (2) a multi-camera pose graph optimization that fuses multiple trajectories using relative pose constraints and robust noise models. Validation is conducted through two complex 3D surveys using the ATOM-ANT3D multi-camera fisheye mobile mapping system. Results demonstrate survey-grade accuracy comparable to traditional photogrammetry, with reduced computational time, advancing toward near real-time 3D mapping of challenging environments.

Graphical Abstract

1. Introduction

Three-dimensional spatial data, such as dense point clouds and textured models, offers unique insights that go beyond traditional 2D formats, unlocking new opportunities for analysis and exploration across various fields. Close-range photogrammetry and terrestrial laser scanning (TLS) are widely adopted survey techniques that provide high geometric accuracy, but are relatively slow in data acquisition and require significant post-processing, especially in large-scale and complex environments [1,2,3,4,5,6]. To address these limitations, Visual- and LiDAR (Light Detection and Ranging)-based Mobile Mapping Systems (MMS), which integrate Simultaneous Localization and Mapping (SLAM) [7], have gained popularity for their acquisition speed and flexibility, enabling rapid 3D data acquisition and real-time positioning [7,8,9,10,11]. However, these systems may generate output data with lower density and accuracy compared to traditional methods [12]. They can vary significantly in terms of cost, weight, size, sensor quality (e.g., LiDAR sensor accuracy or camera resolution), and algorithm efficiency, depending on the target application. Furthermore, many MMS implementations rely on heterogeneous sensor fusion (e.g., LiDAR, camera, inertial measurement unit, etc.) or require external reference data (e.g., ground control points) and often involve extensive post-processing to ensure the accuracy and completeness of the output, especially in complex environments [13,14].
SLAM originated in robotics to enable autonomous localization and mapping in unknown environments [15,16]. Feature-based V-SLAM utilizes sequential image data to estimate camera poses and reconstruct the sensor trajectory by detecting, matching, and tracking local features [17]. Through feature triangulation, it incrementally builds a sparse 3D representation of the environment, which is used to estimate and refine the motion trajectory [18,19,20,21].
The integration of V-SLAM with Visual MMS [8,9,22,23] offers promising advances in real-time localization and mapping, improving speed, flexibility [8,22], efficiency, and accuracy [24]. V-SLAM-aided photogrammetry, a hybrid approach that integrates V-SLAM-generated camera poses and keyframe selection into the photogrammetric pipeline to establish spatial and temporal reference, has also shown great potential for enhancing the speed of 3D spatial data processing and possibly the accuracy of outputs [23]. Despite the growing popularity of SLAM-based vision mobile mapping systems, significant challenges remain, particularly related to the capabilities of V-SLAM algorithms to achieve metric accuracy and operational reliability required for geospatial applications.

1.1. Related Works

Various V-SLAM algorithms have been developed with different configurations based on the number of cameras used. Monocular V-SLAM offers a simple and lightweight setup, relying solely on a single camera to infer motion and the scene structure. Several key monocular V-SLAM algorithms have been developed over the years, each addressing different aspects of accuracy, robustness, and computational efficiency. MonoSLAM [25] pioneered the field using a constant velocity model combined with an Extended Kalman Filter for feature initialization; however, it remained susceptible to scale ambiguity and tracking loss. PTAM [18] introduced the concept of parallel tracking and mapping, incorporating local bundle adjustment (BA) to improve accuracy, although it required manual initialization. DTAM [19] followed a direct-based approach to achieve dense 3D surface reconstruction, but incurred high computational costs and lacked loop closure capabilities. SVO-SLAM [26] combined feature-based tracking with direct methods and offered improved speed and accuracy, but remained vulnerable to cumulative drift over time. ORB-SLAM [20] introduced a breakthrough by presenting a feature-based framework utilizing ORB descriptors for tracking, mapping, re-localization, and loop closure, making it one of the most widely adopted and influential V-SLAM frameworks. Building upon direct methods, DSO-SLAM [27] introduced a direct sparse approach that optimized camera motion and scene geometry through photometric constraints, improving accuracy but lacking loop closure detection. CubeSLAM [28] integrated semantic object detection, leveraging cuboid-shaped objects for scale recovery, but similar to other direct methods, it did not include a loop closure mechanism. SLAM3R [29], a monocular RGB system that regresses dense 3D data using a feed-forward network, performs real-time global alignment; however, the accuracy of its camera pose estimates may be limited compared to other SLAM solutions, due to the lack of global bundle adjustment with camera parameters refinement.
Stereo V-SLAM overcomes the scale ambiguity inherent to monocular approaches and improves the depth estimation accuracy. A notable advancement in this field was ORB-SLAM2.0 [30], an extension of ORB-SLAM that incorporated support for stereo and RGB-D cameras, enabling improved localization, mapping, and loop closure performance. It introduced an additional optimization thread dedicated to full bundle adjustment (BA), improving global consistency and SLAM performance in both indoor and outdoor environments. To address the challenge of feature scarcity in low-texture scenes, PL-SLAM [31] extended stereo V-SLAM by incorporating both point and line features into visual odometry and BA, while leveraging the Bag-of-Words model for loop closure detection. To further improve the stereo V-SLAM speed, HOOFR-SLAM [32] focused on maximizing parallelization by implementing hardware-software mapping on a CPU-GPU system to achieve faster processing while preserving accuracy. TIMA-SLAM [33] introduced a multi-camera approach inspired by ORB-SLAM2.0, where the cameras operate independently without requiring pre-calibration while sharing a common map. The system estimated and refined extrinsic calibrations during mapping to improve accuracy. However, it remained susceptible to scale ambiguity when using monocular cameras, which was mitigated by integrating RGB-D (i.e., red, green, blue channels, and depth) sensors for on-device depth estimation. AQUA-SLAM [34] introduced a tightly-coupled underwater SLAM system that fuses Doppler Velocity Log (DVL), stereo camera, and inertial measurement unit (IMU) within a graph optimization framework supplemented by a multi-sensor calibration method. RGB-D SLAM frameworks incorporate depth measurements obtained from active light sensors, such as structured light or time-of-flight sensors, enabling direct and dense depth recovery alongside visual information [35,36,37].
Recent research has increasingly focused on enhancing the accuracy and robustness of V-SLAM systems by integrating advanced computational techniques, including artificial intelligence (AI) components. EC-SLAM [38] integrates Neural Radiance Fields (NeRF) into a real-time dense RGB-D SLAM framework, using implicit NeRF-based loop closures and globally constrained bundle adjustment to improve tracking and map consistency. To address challenges posed by dynamic environments, RoDyn-SLAM [39] proposed an RGB-D SLAM framework that employs a motion mask generation method that combines optical flow and semantic segmentation to filter out dynamic regions by rejecting sampled points along camera rays that intersect moving objects in the 3D scene. A more uncertainty-aware approach was presented by CG-SLAM [40], which utilizes a 3D Gaussian field-based representation to enhance both tracking efficiency and mapping robustness. Additionally, Chai et al. [41] introduced a pedestrian dead-reckoning-aided visual-inertial SLAM system, which incorporates vanishing point observations as external references to correct for system drift and improve localization stability. VINGS-Mono [42] introduced a monocular visual-inertial SLAM framework that integrates Gaussian Splatting with dense bundle adjustment, NVS (Novel View Synthesis)-based loop closure, and dynamic-object filtering to achieve real-time, kilometer-scale mapping.
Semantic V-SLAM extends traditional SLAM frameworks by integrating deep learning-based object detection and semantic segmentation to improve localization accuracy, particularly in dynamic environments. Liu et al. [43] proposed a real-time optimization method for V-SLAM combining YOLOv8 with geometric constraints to improve tracking in dynamic scenes. Yang et al. [44] introduced a lightweight YOLOv8-based dynamic V-SLAM algorithm that focuses on computational efficiency while maintaining reliable tracking performance in challenging environments. Li et al. [45] developed a semantic stereo V-SLAM system based on ORB-SLAM2.0, tailored for outdoor dynamic environments to improve mapping consistency and reduce tracking errors. MVS-SLAM [46] builds upon ORB-SLAM3.0 by integrating YOLOv7-based semantic object detection to strengthen feature selection and ego-motion estimation and to improve the SLAM robustness in highly dynamic environments.
Recent trend in V-SLAM research has focused on the use of multi-camera systems, which integrate multiple synchronized cameras and, in some cases, asynchronous and additional sensors, to expand the field-of-view (FoV) and spatial awareness, thereby increasing robustness against tracking failures and improving localization and mapping performance. Yeh et al. [47] proposed a fusion method that integrates V-SLAM with Inertial Navigation Systems and Global Navigation Satellite Systems (GNSS), achieving lane-level positioning accuracy of 1.5 m in dynamic vehicular environments. Similarly, BundledSLAM [48] extended ORB-SLAM2.0 by introducing a virtual camera model, the BundledFrame, which fuses measurements from multiple cameras into a unified structure to improve pose tracking and bundle adjustment. However, the system still faces challenges in motion-blurred and texture-deficient scenarios. Focusing on scalability, Kaveti et al. [24] introduced a generalized MultiCam Visual Odometry framework that treats multiple cameras as a single imaging device using a generalized camera model, enabling efficient cross-matching across overlapping FoVs while minimizing computational overhead. MAVIS [49], an optimization-based Visual-Inertial-SLAM framework, extends traditional monocular and stereo SLAM to multi-camera configurations by combining wide FoV imaging with IMU-based metric scale measurements. A key innovation of MAVIS is its exact IMU (Inertial Measurement Unit) pre-integration formulation, based on the Lie Group SE2(3) [49], representing the extended poses (position, orientation, and velocity) in the 3D space, to improve tracking performance during fast rotations. Additionally, MAVIS optimizes front-end tracking and back-end pose estimation to accommodate multi-camera systems, improving the mapping accuracy while maintaining computational efficiency. Other works explore asynchronous multi-camera setups and advanced semantic integration. For example, AMV-SLAM [50] introduces a generalized SLAM framework capable of handling asynchronous sensor observations by employing a continuous-time motion model with cubic B-splines across multi-frame data during tracking, local mapping, and loop closure. Lastly, Multicam-SLAM [51] eliminates the need for FoV overlap by accurately estimating pose relationships among multiple RGB-D cameras, leveraging a multi-camera model and multi-keyframe architecture for enhanced scalability.
Despite significant advancements, persistent limitations include susceptibility to tracking failures due to challenging conditions such as motion blur, extreme and low lighting, or texture-poor environments, eventually leading to the accumulation of errors. Although stereo and multi-camera setups mitigate some of these challenges by providing additional geometric constraints and observational redundancy, drift remains an issue. Loop closure techniques [52,53,54], while beneficial, can fail due to insufficient or inconsistent visual correspondences, viewpoint changes, environmental variations, or repetitive structures. False-positive loop closures can introduce significant errors, whereas false negatives allow drift to persist, highlighting the need for more efficient error-reduction and drift-mitigation strategies. Additionally, incorporating supplementary heterogeneous sensors such as inertial measurement units (IMUs), Global Navigation Satellite Systems (GNSS), and RGB-D cameras, although beneficial, increases hardware complexity, power consumption, and operational costs. Furthermore, the lack of fully open-source frameworks or comprehensive documentation hinders reproducibility and extensibility.
Collectively, these challenges emphasize ongoing research needs in improving the accuracy, robustness, scalability, and adaptability of V-SLAM systems across various real-world scenarios.

1.2. Research Objectives and Contributions

This study aims to adapt and extend V-SLAM approaches to enhance their applicability for geomatics and 3D reconstruction tasks by improving the accuracy of the V-SLAM estimated camera poses. To this end, two novel multi-camera V-SLAM pose optimization approaches are introduced: (1) a Multi-View Feature-Based Optimization (Section 2.2.1); and (2) a Multi-Camera Pose Graph Optimization (Section 2.2.2).
The proposed optimization approaches are integrated within two multi-camera V-SLAM processing paradigms: (1) Multiple Single-Instance Stereo V-SLAM (Section 2.1.1), which independently processes synchronized stereo pairs; and (2) Multi-View Odometry (Section 2.1.2), which jointly processes all five camera views to produce a unified trajectory.
Building on this integration, we present two enhanced multi-camera V-SLAM approaches: (1) Multi-Instance Stereo V-SLAM (Section 2.3.1), which merges multiple stereo trajectories into a globally consistent solution utilizing both proposed optimization techniques; and (2) Multi-View V-SLAM (Section 2.3.2), which enhances simultaneous multi-view processing through feature-based optimization, reducing drift and adding a loop closure capability.
The proposed approaches are evaluated on the portable handheld ATOM-ANT3D system [10,23], equipped with five synchronized and overlapping wide-FoV (190°) fisheye cameras, configured to support four stereo-pair arrangements (Figure 1).
The evaluation is conducted in two complex real-world environments, demonstrating the practical applicability of the proposed approaches in producing accurate and metrically reliable results. This study focuses exclusively on a vision-based SLAM framework, utilizing only multi-camera image data without integrating additional sensors such as IMUs, GNSS, or LiDAR. The core novelty of the proposed optimizations lies in their ability to jointly refine multi-camera V-SLAM poses by leveraging synchronized and overlapping views from the multi-camera rig, allowing the incorporation of both inter-camera visual feature correspondences and pre-calibrated rig camera configurations (i.e., known rigid relative poses between the cameras). This enables an optimization framework that is robust, metrically consistent, and particularly well suited for multi-camera systems where auxiliary sensors are unavailable or difficult to integrate.

2. Multi-Camera V-SLAM Approaches and Optimization Algorithms

To process datasets acquired with multi-camera imaging systems and estimate the multi-camera trajectory required for high-quality 3D reconstructions, two initial approaches are evaluated (Section 2.1): Multiple Single-Instance Stereo V-SLAM and Multi-View Odometry. From our tests, both approaches exhibited limitations and produced suboptimal trajectory estimates. Therefore, to address these shortcomings, this study introduces two novel optimization methods, as follows:
(1) a Multi-View Feature-Based Optimization (Section 2.2.1);
(2) a Multi-Camera Pose Graph Optimization (Section 2.2.2).
These optimization methods are integrated into the initial approaches, resulting in two enhanced multi-camera V-SLAM approaches: Multi-Instance Stereo V-SLAM (Section 2.3.1) and Multi-View V-SLAM (Section 2.3.2). The proposed integration significantly reduces trajectory errors and enhances the overall accuracy. Figure 2 outlines the proposed optimization methodologies and their integration into the investigated multi-camera V-SLAM approaches.

2.1. Multi-Camera V-SLAM Approaches

2.1.1. Multiple Single-Instance Stereo V-SLAM

The multiple single-instance stereo V-SLAM approach is applied to the ATOM-ANT3D system [23], using ORBSLAM3.0 [55] as the underlying SLAM foundation. This approach leverages the known relative poses between overlapping camera pairs (Figure 1) to independently generate scaled 3D reconstructions for each stereo configuration. Consequently, multiple trajectories are estimated for the multi-camera system, each with varying levels of accuracy. To optimize and merge independently computed stereo trajectories into a single globally optimized solution, this approach is integrated with the two proposed optimization strategies (Section 2.2).

2.1.2. Multi-View Odometry

An alternative to running multiple independent stereo V-SLAM instances is to jointly process images from all cameras using a Multi-View Odometry approach. This approach increases tie point redundancy and enriches the observation set. To implement this, COLMAP-SLAM [56] was selected due to its support for generic multi-camera system configurations. The internal logic of the approach operates as follows. The keyframe selection is performed based on a designated primary camera; a new keyframe is created when the median optical flow of tie points between the current and previous primary keyframe exceeds a predefined threshold (e.g., 5 pixels). The corresponding secondary frames, captured at the same epoch, are also selected as keyframes. Each primary or secondary keyframe is subsequently matched spatially at the same epoch with adjacent cameras and temporally with previous frames from the same camera. COLMAP-SLAM supports a variety of deep learning-based feature extraction and matching algorithms. For this study, ALIKE [57] was selected due to its robust performance under challenging illumination conditions and faster matching capabilities. Camera orientations are incrementally estimated using the COLMAP API, with iterative local and global bundle adjustment (BA) performed throughout the process. However, COLMAP-SLAM does not natively support loop closure detection, a critical limitation for improving the accuracy of tracking and trajectory estimation. To address this, the novel Multi-View Feature-Based Optimization (Section 2.2.1) is integrated, serving as an additional loop closure mechanism, reducing accumulated drift and enhancing global trajectory consistency.

2.2. Optimization Methods

Two novel multi-camera V-SLAM optimization methods, applicable either independently or sequentially, are introduced in the following sections.

2.2.1. Multi-View Feature-Based Optimization

Our method is implemented through the Agisoft Metashape Software (v2.1) and its Python (v3.10) API [58]. We explored other open-source libraries; however, they currently do not offer full support for multi-camera fisheye systems with fixed rig constraints, which are essential to our application. The goal of the Multi-View Feature-Based Optimization is to reduce accumulated drift in the V-SLAM trajectory and refine the camera poses by incorporating additional spatial constraints within a bundle adjustment (BA) (Figure 3).
The initial trajectory, which is typically estimated by the SLAM system for a designated master camera, serves as a reference. Based on this reference, the poses of all five cameras in the ATOM-ANT3D system are determined, under the assumption of prior system calibration and known inter-camera transformations (i.e., fixed baselines and orientations).
In the Multiple Single-Instance Stereo V-SLAM approach, any of the four stereo trajectories can be used to initialize the full multi-camera pose estimates. In contrast, the Multi-View Odometry strategy directly outputs a unified trajectory. Each approach (Section 2.1) results in a trajectory with distinct keyframe selections and pose accuracies, influencing the subsequent optimization.
To implement the optimization, the multi-camera fisheye rig is first defined, with fixed baselines and relative orientations between overlapping cameras. Since the cameras are pre-calibrated under controlled laboratory conditions, we set a low uncertainty (i.e., standard deviation) of the rig’s cameras’ relative poses. These values serve as constraints during the bundle adjustment, allowing the entire camera rig to be adjusted as a rigid body. While the internal implementation is not publicly disclosed, the general optimization concepts follow standard constrained bundle adjustment principles, where additional priors or constraints are introduced to preserve known relative transformations among cameras [59,60].
Feature extraction and matching are performed internally using SIFT-like descriptors. The imported V-SLAM trajectory plays a key role in guiding spatial proximity during keyframe pair selection for feature matching. While Metashape (v2.1) does not expose a user-defined search radius, it internally leverages the spatial relationships encoded in the initial trajectory to prioritize image pairs for matching. This process is further supported by the fixed multi-camera rig configuration, which provides known relative orientations between the camera poses. As a result, image matching over the V-SLAM trajectory is performed both within each multi-camera rig instance and between keyframes captured at different times but located in close spatial proximity. This implicitly enables loop-closure-like behavior without relying on explicit global loop detection, similar to strategies commonly adopted in LiDAR-SLAM frameworks [61]. In our case, the use of a fisheye multi-camera system further improves observations correspondence, as the wide field-of-view increases scene coverage and overlap across cameras, enhancing the chances of successful feature matching even under moderate drift conditions. This improves robustness while keeping computational cost manageable. Following this, multi-view triangulation is performed to reconstruct a sparse 3D point cloud of tie points observed from multiple overlapping viewpoints.
The triangulated tie points, along with the defined multi-camera rig constraints, are then used to iteratively refine the camera poses via bundle adjustment. After triangulation and prior to bundle adjustment (BA), tie points exhibiting a reprojection error RMSE greater than 1 pixel are filtered out. During the iterative bundle adjustment process, outliers are progressively reduced by applying thresholding criteria. Specifically, at each iteration, the reconstruction uncertainty threshold is decreased by increments of 10–20, while the projection accuracy threshold is reduced by increments of 5–10, directly influencing outlier points filtering. Following each threshold reduction, optimization is performed. The iterative process continues until convergence criteria are met, defined by achieving a reprojection error RMSE of tie points below 1 pixel and observing no significant further changes in the optimized camera poses. This process ensures geometric consistency and improves the accuracy of the final multi-camera trajectory and resulting 3D reconstruction.

2.2.2. Multi-Camera Pose Graph Optimization (PGO)

In the Multiple Single-Instance Stereo V-SLAM approach (Section 2.1.1), multi-camera data acquisition produces redundant camera pose estimates (i.e., up to four trajectory estimations). To integrate these estimates into a unified, globally consistent solution, we propose a novel pose graph optimization (PGO) algorithm, implemented using the Georgia Tech Smoothing and Mapping Library (GTSAM v4.2.0) [62].
A pose graph is a common representation in SLAM, modeling camera poses (i.e., positions and orientations) as nodes, and their relative constraints (i.e., measurements and observations) as edges. A corresponding nonlinear factor graph mathematically represents these relationships. In our framework, the nodes of the graph correspond to the full keyframe poses (i.e., position and orientation) estimated independently from up to four stereo trajectories, and are treated as the unknowns of the pose graph. Each node is assigned an initial estimate, with the first node (i.e., the first keyframe in the trajectory) fixed, anchoring the model against gauge freedom. Edges are defined by relative pose constraints between consecutive keyframes within each trajectory. Due to the unavailability of uncertainty estimates for individual pose measurements (e.g., standard deviations of 3D poses), an isotropic noise model is adopted. To mitigate the influence of noisy measurements and outliers, while leveraging the redundancy of multi-trajectory constraints, a robust noise model based on the Cauchy loss function is employed, enhancing optimization stability and preventing divergence. The scale parameter of the loss function was empirically tuned to the value of 0.3, balancing the strong reduction of outlying observations’ weights and the stability of the optimization process.
Once the factor graph is constructed, a Gauss-Newton optimizer iteratively minimizes the squared error of the non-linear system to refine the graph and optimize the pose estimates. The optimization problem can be expressed as the minimization of the sum of relative pose errors across all trajectory edges, as shown in Equation (1) [63]. The final output is a single global trajectory. A simplified illustration of the factor graph structure is provided in Figure 4.
min { T i } ( i , j ) ϵ ε ρ log ( Z i j 1 . ( T i 1 . T j ) ) 2 i j 1  
where
ρ (⋅): Robust loss function
Ti ϵ SE(3): Estimated absolute pose at node i
Zij ϵ SE(3): Measured relative pose from node i to node j
ij: Covariance of measurement Zij
Log(∙): Log map from SE(3) to the Lie algebra
‖e‖2−1 = eT−1e: Mahalanobis form
Figure 4. The proposed multi-camera pose graph optimization scheme fuses ATOM-ANT3D individual estimated stereo pairs’ trajectories into a single final optimized trajectory.
Figure 4. The proposed multi-camera pose graph optimization scheme fuses ATOM-ANT3D individual estimated stereo pairs’ trajectories into a single final optimized trajectory.
Remotesensing 17 02810 g004
Beyond merging stereo trajectories, the proposed PGO framework is highly extensible. Additional constraints, such as known inter-camera baselines, shared tie points, and loop closure detections, can be seamlessly integrated. Moreover, the framework allows for the incorporation of other sensor modalities, such as IMUs or LiDAR odometry [64], making it a scalable solution for robust multi-camera V-SLAM.
A key consideration for the robustness of the optimization process is the availability of sufficient common keyframes across individual trajectories. Since each stereo independently selects keyframes based on tracking quality and keyframe count thresholds, this might create sparsity in common keyframes, potentially reducing the strength of inter-trajectory constraints. V-SLAM uses all frames for camera pose estimation; however, to maintain computational efficiency and enable real-time performance, only keyframes are retained during the optimization, and non-keyframes are excluded. Nonetheless, non-keyframes are abundant and can offer valuable improvement to the graph constraints. To leverage this, we enhance the pose graph by incorporating both keyframe-based and frame-to-frame (i.e., non-keyframe) constraints. Specifically, between-factors are introduced to connect consecutive frames within each trajectory, with keyframes acting as high-certainty anchors. This enriched constrained structure increases the density of shared keyframes, improving inter-trajectory consistency and enhancing the performance and stability of the proposed optimization framework.

2.3. Integration Methodology

The proposed optimizations (Section 2.2) are integrated into both V-SLAM approaches (Section 2.1), resulting in two new multi-camera V-SLAM solutions: (1) Multi-Instance Stereo V-SLAM and (2) Multi-View V-SLAM.

2.3.1. Multi-Instance Stereo V-SLAM

In the ATOM-ANT3D system, each stereo pair independently generates a V-SLAM trajectory, which may exhibit varying levels of drift despite the application of loop closure mechanisms. The integration of Multi-View Feature-Based Optimization (Section 2.2.1) refines these individual trajectories, improving their accuracy and spatial consistency by expanding the estimated stereo V-SLAM poses into five cameras’ 360° FoV, leveraging the fixed rigid baselines and orientations in a constrained iterative BA. This optimization enhances the internal accuracy of each trajectory. To merge them into a globally consistent solution, the Multi-Camera Pose Graph Optimization (Section 2.2.2) is applied, enforcing both intra- and inter-trajectory constraints to produce a unified, optimized multi-camera trajectory.
While the order of applying both optimizations can be reversed in principle, we intentionally apply the feature-based optimization first to support the goals of our evaluation. Our study investigates two different multi-camera V-SLAM configurations, and applying the feature-based optimization directly to the output of both configurations maintains a consistent and fair evaluation framework. This allows us to assess how each SLAM configuration behaves when refined independently by the same optimization method. In the case of multi-instance stereo V-SLAM, each stereo pair generates a trajectory with its unique drift profile, viewpoint-specific coverage, and keyframe distribution. Applying feature-based optimization at this stage not only improves local spatial consistency but also enables us to analyze the refined trajectories individually. This is valuable for understanding the performance and error characteristics of each SLAM instance before merging. If the pose graph optimization were to be applied first, this level of insight would be lost, as the feature-based refinement would then operate on a single, merged trajectory. Finally, applying the pose graph optimization after the feature-based optimization allows us to assess whether the optimization produces a balanced compromise between the independently optimized trajectories or further enhances the globally consistent solution. This ordering ultimately provides both deeper diagnostic insight and a more transparent understanding of the contributions of each optimization stage.

2.3.2. Multi-View V-SLAM

The Multi-View Odometry approach (Section 2.1.2) generates a single trajectory using all five cameras simultaneously; however, it lacks a dedicated loop detection thread. To overcome this limitation, the Multi-View Feature-Based Optimization (Section 2.2.1) is integrated to enhance the image network connectivity, ensure global coherence during BA, reduce drift, and enable loop closures when applicable, thereby transitioning from odometry estimation into a V-SLAM approach. Multi-View V-SLAM is primarily designed as a development and testing platform, making it better suited for post-processing V-SLAM sequences rather than real-time field deployment.
For the development and assessment of this study, the multi-camera pose graph optimization was implemented using the Python wrapper of the GTSAM library (v4.2.0) (Georgia Tech Smoothing and Mapping Library) [62]. Trajectory metric evaluation was conducted with the Evo Python package [65]. Additionally, we utilized two photogrammetric software, Agisoft Metashape (v2.1) (2024) [58], and COLMAP (v3.8) [66]. Finally, Cloud Compare (v2.13) [67] was used for 3D point cloud processing.

3. Case Studies and Evaluation Approach

This section presents the two case studies used to evaluate the proposed approaches, detailing the data collection process, survey methodologies, ground truth generation, and evaluation strategy.
Apart from ground truth data used for validation, no Ground Control Points (GCPs) or external constraints were used during the processing of the proposed approaches. We relied solely on the fixed, pre-calibrated rigid transformation between cameras, with known baselines and orientations, to achieve a scaled metric 3D reconstruction. This setup enabled a fully self-contained evaluation of our pipeline from acquisition to final output.

3.1. Case Study 1: Sordine of the Duomo Di Milano

The Sordine of the Duomo di Milano (Italy), located below the rooftop terrace, is an intricately designed U-shaped architectural space. It spans approximately 70 m in length with an average path width of 1 m (Figure 5). The environment consists of nine interconnected rooms featuring curved dome-like structures, linked by narrow passages as small as 75 cm wide. This case study presents several challenges, including complex geometries, tight corridors, intricate architectural elements, restricted maneuverability, and highly variable lighting conditions ranging from well-lit areas to near-dim lighting.
These factors pose difficulties for conventional photogrammetry and are time-consuming for static laser scanning. The primary challenge lies in achieving a coherent reconstruction of the U-shape trajectory while ensuring adequate coverage, particularly across the connecting passages between compartments.
The survey followed a U-shaped trajectory, with forward and backward passes beginning and ending at the same location to enable loop closure (Figure 5c). Over approximately fifteen minutes of field surveying, a total of 13,655 images were acquired (2731 images per camera). The acquisition rate was configured at fifteen frames per second, with each camera saving three images per second.

3.2. Case Study 2: The Minguzzi Spiral Staircase of the Duomo Di Milano

The Minguzzi spiral staircase (Figure 6), located at the front-right corner of the cathedral, is a marble structure extending vertically from ground level to the rooftop, reaching a height of ca. 25 m.
Its interior features a narrow passageway, approximately 70 cm wide, spiraling around a central column with a diameter of 40 cm. The confined space, limited line of sight, texture, and absence of loop closures present substantial challenges for 3D localization and mapping. This case study thus provides an ideal testbed for evaluating the robustness and accuracy of the proposed approaches. The dataset was originally collected during a previous study [23], following an ascending spiral trajectory starting at the lower entrance and ending at the rooftop exit. Over ca. 8 min of field acquisition, a total of 7905 images were captured (1581 images per camera).

3.3. Ground Truth and Evaluation Methodology

For the validation of the Sordine case study, 16 natural GCPs were sampled from a high-resolution terrestrial laser scan survey (Figure 5) covering the entire area. The TLS data scan-to-object distance ranged from ca. 1 to 5 m, where the GCPs were sampled, yielding a calculated point-cloud sampling interval of ca. 1 mm at a manufacturer-specified accuracy of 2 mm at 10 m scan distance. Such high resolution ensures that naturally occurring features and high-contrast surface textures are both visible and defined on the TLS point cloud. On the other hand, the image data acquisition with a scan-to-object distance of ca. 1 to 2 m, where the GCPs were sampled, produced a ground sampling distance (GSD) of less than ca. 1 cm; therefore, these TLS-derived natural points constitute valid GCPs for the photogrammetric reconstruction.
These points were used to scale, reference, and constrain a photogrammetric reconstruction of the entire dataset (13,655 images), establishing the ground truth reference for evaluation. Figure 7 presents the distribution of the GCPs at the start, middle, and end of the Sordine area, equally divided into two groups: checkpoints and control points. The Root Mean Square Error (RMSE) analysis of the natural GCPs yielded an RMSE of ca. 1 cm on the reference XYZ coordinates for both groups.
The consistency between the photogrammetric reconstruction and the TLS point cloud was evaluated by fitting least-squares planes to each point cloud and performing a cloud-to-cloud (C2C) distance analysis in CloudCompare. Distances from each point in one cloud were projected onto the fitted plane of the other cloud, enabling a comparison that captured deviations between corresponding surfaces. The C2C analysis yielded a mean distance of approximately 1.8 cm, with 95% of the points falling within 3.7 cm. Notably, large glass surfaces, dome-like curved geometries, low-texture regions, and high-elevation areas in the surveyed environment introduced challenges for the photogrammetric process, often resulting in local deviations and increased variability in pointwise distances, even when the overall alignment remained consistent. Figure 8 presents the computed C2C absolute distances as a scalar field coupled with the histogram, demonstrating that the majority of points exhibit minimal discrepancies, with deviations remaining relatively modest across the scene.
Despite potential local noise in the photogrammetric reconstruction, its agreement with the TLS point cloud is acceptable at the centimeter level. Consequently, the trajectory derived from the photogrammetric reconstruction is considered valid as ground truth for this case study.
For the Minguzzi case study, a set of 20 natural points, originally identified in a previous photogrammetric survey [68], was utilized. These points were evenly distributed along the staircase (Figure 7) and served to constrain a photogrammetric reconstruction of the complete image dataset.
The RMSE computed on the XYZ coordinates was 0.4 cm for control points and 2 cm for check points. Considering the complexity of the staircase environment and the challenges posed to alternative surveying methods, this level of accuracy is deemed acceptable for establishing a reliable ground truth in the context of this study.
The presented trajectory accuracy evaluations in the following sections compare the V-SLAM estimated and optimized trajectories against the ground truth trajectories. Specifically, the accuracy assessment metrics are performed using RMSE (i.e., the average deviation between estimated and reference camera poses over the entire trajectory) of the following:
(1)
Absolute Pose Error (APE): assessing the global consistency of the trajectory with respect to translational deviations.
(2)
Relative Poses Error (RPE): assessing the local consistency of the trajectory on translation and orientation differences, with the reported error on the translational deviations.

4. Results

The outputs from the proposed V-SLAM approaches are the estimated camera poses and trajectories, as well as their corresponding sparse point clouds. Due to the inherent keyframe selection mechanism of V-SLAM, the number of retained images varied in each approach, as reported in Table 1.
The photogrammetric reconstruction is performed with the complete image datasets for both case studies, simulating a workflow without the V-SLAM keyframe selection mechanism. In addition to traditional photogrammetry, a V-SLAM-aided photogrammetry [23] workflow is evaluated, using estimated camera poses from V-SLAM to improve spatial matching between adjacent cameras and reduce computational time. To achieve this, we used camera poses generated by the Multi-View V-SLAM method (Section 2.3.2), as it retained a larger number of keyframes, nearly matching the full image dataset in the Minguzzi case study, which allows a robust evaluation of the trade-off between computational efficiency and reconstruction accuracy. Both photogrammetric approaches are run on an Intel CPU i9-11900H @ 2.5GHz and NVIDIA GeForce RTX 3080, and serve as baselines for comparing accuracy and approximate computational time.

4.1. Multi-Instance Stereo V-SLAM Optimization

Due to hardware limitations and the rapid increase in memory consumption during map construction, only two stereo instances (i.e., corresponding to two stereo pairs) could be processed simultaneously in real-time on-field. To ensure fair and consistent comparison across all configurations, each of the four stereo instances in both case studies was reprocessed offline at a fixed rate of three frames per second on an Intel CPU i9-11900H @ 2.5 GHz and NVIDIA GeForce RTX 3080. This offline setting approximately simulates the time required for the V-SLAM operation on the field.
In the Sordine dataset, the front-left stereo instance could not provide and maintain reliable feature tracking due to limited illumination and insufficient texture within its field of view. These conditions impaired the system’s ability to re-localize and merge new observations with previously mapped areas, resulting in disconnected trajectory segments and poor camera pose estimation. As a result, this instance was manually excluded from further analysis by the operator based on the real-time feedback from the system. This highlights the advantage of multi-camera systems, where the redundancy of observations allows other stereo instances to maintain consistent tracking and mapping, preserving dataset continuity even in the presence of partial failures.
Table 2 presents the RMSE of Absolute Pose Error (APE) in both case studies for the proposed multi-camera V-SLAM approaches, before and after applying the optimization methods. For comparative purposes, the results of traditional photogrammetry and V-SLAM-aided photogrammetry are reported. The APE RMSE for photogrammetry is 0.03 m in the Sordine case and 0.05 m in the Minguzzi case, while the V-SLAM-aided photogrammetry yielded 0.04 m and 0.06 m, respectively.
The results presented in Table 2 reveal notable variability in pre-optimization trajectory accuracy among the stereo instances. In the Sordine case study, the left stereo instance achieves the lowest pre-optimization RMSE at 0.45 m, whereas the right stereo instance records the highest error at 1.27 m. The front-right stereo instance also exhibits a relatively high error of 1.10 m.
In contrast, the Minguzzi case study demonstrates a different distribution of errors. The right stereo instance attains the lowest RMSE at 0.34 m, while the left stereo instance shows a considerably higher error of 0.92 m. The front-left and front-right stereo instances report intermediate errors of 0.60 m and 0.53 m, respectively.
Following the application of the proposed Multi-View Feature-Based Optimization, a substantial improvement in camera pose accuracy was observed across all individual stereo instances. In Sordine, the left stereo RMSE drops from 0.45 m to 0.03 m, and the right stereo improves from 1.27 m to 0.02 m. Similarly, the front-right stereo instance improves from 1.10 m to 0.07 m.
Comparable improvements are recorded in Minguzzi. The left stereo instance RMSE was reduced from 0.92 m to 0.12 m, and the right and front-right stereo instances improved from 0.34 m and 0.53 m to 0.05 m, respectively. However, the front-left stereo instance continued to exhibit a relatively high RMSE of 0.26 m after optimization. This residual error is attributed to a combination of limiting factors, including a feature-poor viewpoint, exposure to strong artificial lighting, and the lowest number of retained keyframes, all of which contribute to potentially weaker visual correspondences. Although this stereo configuration is geometrically divergent, the divergence alone does not fully justify the error, as the other divergent instance (i.e., front-right) achieves better performance. Figure 9 illustrates the initial drift discrepancies between the left and right stereo trajectories in the Sordine before optimization. Figure 10 demonstrates the improvements achieved post-optimization, with refined camera poses and reduced accumulated drift leading to improved global trajectory consistency.
Beyond optimizing individual instances, the proposed Multi-Instance Stereo V-SLAM approach integrates multiple single-instance stereo trajectories using the Multi-Camera Pose Graph Optimization. This integration enforces global consistency across multiple viewpoints, producing a unified and optimized trajectory for the entire multi-camera system. The final achieved RMSE values for both case studies are 0.04 m in the Sordine and 0.03 m in Minguzzi, confirming the effectiveness of the proposed approach.
In terms of computational time efficiency, the average processing time of the Multi-Instance Stereo V-SLAM is ca. 0.53 h for Sordine and 0.27 h for Minguzzi. These computation times are lower than those of traditional photogrammetry, ca. 9.5 h for the Sordine case and ca. 3.0 h for the Minguzzi case, as well as V-SLAM-aided photogrammetry, which required ca. 2.0 h and 1.25 h, respectively.

4.2. Multi-View V-SLAM Optimization

To evaluate the effectiveness of the proposed Multi-View V-SLAM pipeline, the results of the underlying Multi-View Odometry are presented before integrating the Multi-View Feature-Based Optimization. Figure 11 presents the results of the Multi-view Odometry approach run on an Intel CPU i9-10900F @ 5.2GHz and a NVIDIA GeForce GTX 1080 GPU, showing a visual comparison of trajectories generated using two, three, and five cameras. The corresponding APE RMSE values for these configurations are summarized in Table 3.
In the Sordine case study, a substantial reduction in the RMSE is observed when the number of cameras is increased from two to three, with the error decreasing from 1.93 m to 0.50 m. However, increasing the number of cameras to five increased the RMSE to 0.64 m, suggesting a potential trade-off between enhanced viewpoint redundancy and the introduction of additional feature mismatches or noise that accumulate and propagate through the incremental estimation process.
In contrast, the Minguzzi case study exhibits consistently low RMSE values of ca. 0.11 m across all configurations, indicating greater reconstruction stability in this environment. To further interpret the outcomes, Figure 12 presents a component-wise (X, Y, Z) comparison of the pose deviations across increasing camera counts relative to the ground truth. The Sordine dataset demonstrates a larger variation across all axes, whereas the confined structure of the Minguzzi staircase results in a more stable performance with minimal deviation. Notably, the application of loop closure detection via the proposed Multi-View Feature-Based Optimization significantly enhanced the final trajectory accuracy. As presented in Table 2, the post-optimization RMSE values improved to approximately 0.08 m for both case studies.
Furthermore, Figure 13 demonstrates successful loop closure detection in the Sordine case study, confirming the trajectory estimation improvement and drift reduction achieved. The Multi-View V-SLAM, including both SLAM processing and optimization, required ca. 1.4 h and 0.7 h for the Sordine and Minguzzi case studies, respectively. A comparison of the multi-camera V-SLAM approaches before and after optimization is presented in Figure 14, which illustrates the variation plots of the APE residuals along each trajectory, benchmarked against the corresponding ground truth data across the two case studies.
In the Sordine case study (Figure 14a), the pre-optimization trajectories of the left, right, and front-right stereo configurations exhibit a gradual accumulation of APE, followed by sharp vertical drops at specific timestamps. This behavior corresponds to significant positional drift events and irregular pose estimation caused by outlier observations and loss of reliable visual features. The subsequent steep drops in APE reflect partial realignment events associated with loop closures or re-localization. In the Minguzzi case study (Figure 14b), pre-optimization trajectories display consistent, repetitive oscillations in APE, which are indicative of periodic geometric ambiguities and drift, particularly during repeated rotations within the spiral staircase traversal. This cyclic drift behavior is further highlighted for the Minguzzi case study in Figure 15a–d before optimization, which illustrates the recurring pose estimation errors.
Overall, the results of integrating the proposed optimization approaches demonstrate the effectiveness in refining camera poses and reducing accumulated drift in the overall trajectory estimations, as evident by the post-optimization plots in Figure 14 and Figure 15.

5. Discussion

Prior to optimization, the V-SLAM trajectories exhibited substantial camera pose errors and drift, with the RMSE in APE exceeding 1.0 m in some cases. Although the initial V-SLAM estimates often demonstrated relatively low median RPE values (Figure 16), the presence of high-magnitude outliers contributed to the cumulative drift, thereby increasing the global APE metric. In particular, this effect is evident in the Sordine case study, where the right stereo instance exhibited the highest APE RMSE of 1.27 m before optimization, despite having a median RPE value of ca. 0.05 m. This highlights how the frame-to-frame accuracy does not reflect the overall consistency of the trajectory owing to the buildup of drift.
Following the application of the proposed Multi-View Feature-Based Optimization, significant improvements were achieved. In Sordine, the right stereo instance error decreased from 1.27 m to 0.02 m, whereas in Minguzzi, the left stereo instance error was reduced from 0.92 m to 0.12 m. The box plots of RPE (Figure 16) illustrate these improvements, highlighting a marked reduction in both the median RPE and the variance across all samples. After optimization, the RPE outliers were significantly reduced, leading to more stable and consistent trajectory estimations and achieving centimeter-level APE accuracy in nearly all cases.
The Multi-Instance Stereo V-SLAM approach, integrated with Multi-Camera Pose Graph Optimization, successfully fused individual stereo trajectories into a globally consistent solution by enforcing inter-trajectory pose constraints. In the Minguzzi case, this strategy yielded better overall trajectory accuracy than individually optimized stereo instances, while in the Sordine case, it stabilized performance across all configurations, emphasizing the benefits of global graph-based refinement.
The impact of increasing the number of cameras within the Multi-View Odometry framework varied between the case studies. In the Sordine case study, expanding from two to three cameras substantially reduced the APE RMSE from 1.93 m to 0.50 m. However, further expansion to five cameras introduced a slight accuracy degradation with the RMSE increasing to 0.64 m, likely due to feature mismatches, outlier propagation, and the incremental nature of the pipeline accumulating the drift. Conversely, in the Minguzzi case study, the APE RMSE remained consistent at ca. 0.11 m across all camera configurations. While the use of two cameras may offer computational savings compared to three- and five-camera configurations, this observation is specific to the Minguzzi case study, where the constrained geometry and stable feature tracks using the ALIKE feature extractor potentially minimized drift across configurations. However, when loop closure is required, utilizing all five cameras is advisable, as the increased viewpoint redundancy introduces additional constraints in the final optimization, enhancing the robustness and generalizability of the solution, particularly in more complex or less structured environments. The integration of loop closure detection through the Multi-View Feature-Based Optimization further improved the final APE values to approximately 8 cm in both case studies, demonstrating the robustness and generalizability of the proposed optimization framework, reducing the accumulated drift error and providing a loop closure mechanism when applicable.
The computational efficiency of the proposed multi-camera V-SLAM approaches, along with V-SLAM-aided photogrammetry, was compared to the traditional photogrammetry workflow. The results indicate a reduction in processing time, with the V-SLAM methods achieving ca. 77–94% faster computation compared to conventional photogrammetry, while maintaining comparable reconstruction accuracy. Additionally, the V-SLAM-aided photogrammetry approach, leveraging a pre-selected subset of the V-SLAM-derived keyframes, further optimized processing, reducing computation time by ca. 58–79% relative to the full photogrammetric pipeline.

6. Conclusions

This study presented two novel multi-camera V-SLAM optimization methods (Multi-View Feature-Based Optimization and Multi-Camera Pose Graph Optimization) developed and integrated within two multi-camera V-SLAM approaches (Multi-Instance Stereo V-SLAM and Multi-View V-SLAM). These approaches were designed to improve camera pose accuracy, reduce drift, and enable reliable loop closure in multi-camera V-SLAM applications, thereby supporting more accurate and robust 3D reconstructions in complex and challenging environments.
The proposed approaches were evaluated using extensive and complex datasets acquired with ATOM-ANT3D, a synchronized wide FoV fisheye multi-camera mobile mapping system featuring a pre-calibrated rigid camera configuration. The Multi-Instance Stereo V-SLAM approach demonstrated that despite possible failures in individual trajectories estimation, the redundancy provided by the multi-camera setup ensured tracking and mapping continuity. The Multi-View Feature-Based Optimization improved the single instance trajectories, significantly reducing errors and achieving centimeter-level accuracy in V-SLAM camera poses. Furthermore, the integration of Multi-Camera Pose Graph Optimization merges individual stereo trajectories into a globally consistent and refined single solution.
In contrast, the Multi-View V-SLAM approach, which simultaneously processes all five cameras and integrates Multi-View Feature-Based Optimization, demonstrates a clear advantage over the Multi-View Odometry-based approach in scenarios where loop closure is required to correct the drift and improve the accuracy of the estimated trajectory.
Overall, this study demonstrates that multi-camera V-SLAM, when combined with dedicated optimization techniques, can achieve accurate 3D reconstructions, bridging the gap between the real-time SLAM capability and the accuracy of traditional photogrammetric methods. The proposed approaches significantly outperform conventional V-SLAM in terms of accuracy, achieving centimeter-level accuracy while requiring less computational time than traditional photogrammetry.

7. Patents

The ANT3D multi-camera involved in this study resulted in patent proposal no 102021000000812. The patent was licensed on 24 January 2023.

Author Contributions

Conceptualization, A.E.-A., L.M., P.T., F.F. and F.R.; methodology, A.E.-A., L.M., P.T., F.F. and F.R.; data acquisition, A.E.-A. and F.F.; Software, A.E.-A., L.M. and P.T.; writing—original draft preparation, A.E.-A., L.M. and P.T.; writing—reviewing and editing, A.E.-A., L.M., P.T., F.F. and F.R.; supervision, F.F. and F.R.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the “Boostech Valorization Program 2022”, funded by the Italian “Piano Nazionale di Ripresa e Resillienza—missione 1, Componente 2, investimento 6” funded by the European Union’s—NextGenerationEU with the goal of industrializing the Ant 3D prototype, which is already the subject of the patent proposal n 102021000000812. The patent was licensed on 24 January 2023.

Data Availability Statement

Data are not available due to non-disclosure agreements.

Acknowledgments

The authors would like to thank Veneranda Fabbrica del Duomo di Milano for allowing the tests to take place in Milan’s Cathedral.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analysis, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Pollefeys, M.; Frahm, J.-M.; Fraundorfer, F.; Zach, C.; Wu, C.; Clipp, B.; Gallup, D. Challenges in Wide-Area Structure-from-Motion. IPSJ Trans. Comput. Vis. Appl. 2010, 2, 105–120. [Google Scholar] [CrossRef][Green Version]
  2. Rüther, H.; Bhurtha, R.; Held, C.; Schröder, R.; Wessels, S. Laser Scanning in Heritage Documentation. Photogramm. Eng. Remote Sens. 2012, 78, 309–316. [Google Scholar] [CrossRef]
  3. Holst, C.; Kuhlmann, H. Challenges and Present Fields of Action at Laser Scanner Based Deformation Analyses. J. Appl. Geod. 2016, 10, 17–25. [Google Scholar] [CrossRef]
  4. Leduc, P.; Peirce, S.; Ashmore, P. Short Communication: Challenges and Applications of Structure-from-Motion Photogrammetry in a Physical Model of a Braided River. Earth Surf. Dynam. 2019, 7, 97–106. [Google Scholar] [CrossRef]
  5. Berra, E.F.; Peppa, M.V. Advances and Challenges of UAV SFM MVS Photogrammetry and Remote Sensing: Short Review. In Proceedings of the 2020 IEEE Latin American GRSS & ISPRS Remote Sensing Conference (LAGIRS), Santiago, Chile, 22–27 March 2020; pp. 267–272. [Google Scholar] [CrossRef]
  6. Waqar, A.; Othman, I.; Saad, N.; Qureshi, A.H.; Azab, M.; Khan, A.M. Complexities for Adopting 3D Laser Scanners in the AEC Industry: Structural Equation Modeling. Appl. Eng. Sci. 2023, 16, 100160. [Google Scholar] [CrossRef]
  7. Elhashash, M.; Albanwan, H.; Qin, R. A Review of Mobile Mapping Systems: From Sensors to Applications. Sensors 2022, 22, 4262. [Google Scholar] [CrossRef]
  8. Ortiz-Coder, P.; Sánchez-Ríos, A. A Self-Assembly Portable Mobile Mapping System for Archeological Reconstruction Based on VSLAM-Photogrammetric Algorithm. Sensors 2019, 19, 3952. [Google Scholar] [CrossRef]
  9. Torresani, A.; Menna, F.; Battisti, R.; Remondino, F. A V-SLAM Guided and Portable System for Photogrammetric Applications. Remote Sens. 2021, 13, 2351. [Google Scholar] [CrossRef]
  10. Perfetti, L.; Fassi, F.; Vassena, G. Ant3D—A Fisheye Multi-Camera System to Survey Narrow Spaces. Sensors 2024, 24, 4177. [Google Scholar] [CrossRef]
  11. Będkowski, J. Open Source, Open Hardware Hand-Held Mobile Mapping System for Large Scale Surveys. SoftwareX 2024, 25, 101618. [Google Scholar] [CrossRef]
  12. Szrek, A.; Romańczukiewicz, K.; Kujawa, P.; Trybała, P. Comparison of TLS and SLAM Technologies for 3D Reconstruction of Objects with Different Geometries. IOP Conf. Ser. Earth Environ. Sci. 2024, 1295, 012012. [Google Scholar] [CrossRef]
  13. Xu, X.; Zhang, L.; Yang, J.; Cao, C.; Wang, W.; Ran, Y.; Tan, Z.; Luo, M. A Review of Multi-Sensor Fusion SLAM Systems Based on 3D LIDAR. Remote Sens. 2022, 14, 2835. [Google Scholar] [CrossRef]
  14. Zhu, J.; Li, H.; Zhang, T. Camera, LiDAR, and IMU Based Multi-Sensor Fusion SLAM: A Survey. Tsinghua Sci. Technol. 2024, 29, 415–429. [Google Scholar] [CrossRef]
  15. Smith, R.; Self, M.; Cheeseman, P. A Stochastic Map for Uncertain Spatial Relationships. In Proceedings of the 4th International Symposium on Robotic Research, Santa Cruz, CA, USA, 9–14 August 1987; pp. 467–474. [Google Scholar]
  16. Durrant-Whyte, H.; Bailey, T. Simultaneous Localization and Mapping: Part I. IEEE Robot. Automat. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
  17. Jin, Y.; Mishkin, D.; Mishchuk, A.; Matas, J.; Fua, P.; Yi, K.M.; Trulls, E. Image Matching Across Wide Baselines: From Paper to Practice. Int. J. Comput. Vis. 2021, 129, 517–547. [Google Scholar] [CrossRef]
  18. Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 1–10. [Google Scholar]
  19. Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense Tracking and Mapping in Real-Time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
  20. Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
  21. Macario Barros, A.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A Comprehensive Survey of Visual SLAM Algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
  22. Kuo, J.; Muglikar, M.; Zhang, Z.; Scaramuzza, D. Redesigning SLAM for Arbitrary Multi-Camera Systems. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020; pp. 2116–2122. [Google Scholar]
  23. Elalailyi, A.; Perfetti, L.; Fassi, F.; Remondino, F. V-SLAM-Aided Photogrammetry to Process Fisheye Multi-Camera Systems Sequences. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, 48, 189–195. [Google Scholar] [CrossRef]
  24. Kaveti, P.; Vaidyanathan, S.N.; Chelvan, A.T.; Singh, H. Design and Evaluation of a Generic Visual SLAM Framework for Multi Camera Systems. IEEE Robot. Autom. Lett. 2023, 8, 7368–7375. [Google Scholar] [CrossRef]
  25. Davison. Real-Time Simultaneous Localisation and Mapping with a Single Camera. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; Volume 2, pp. 1403–1410. [Google Scholar]
  26. Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast Semi-Direct Monocular Visual Odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–5 June 2014; pp. 15–22. [Google Scholar]
  27. Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
  28. Yang, S.; Scherer, S. CubeSLAM: Monocular 3-D Object SLAM. IEEE Trans. Robot. 2019, 35, 925–938. [Google Scholar] [CrossRef]
  29. Liu, Y.; Dong, S.; Wang, S.; Yin, Y.; Yang, Y.; Fan, Q.; Chen, B. SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos. arXiv 2025, arXiv:2412.09401. [Google Scholar] [CrossRef]
  30. Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  31. Gomez-Ojeda, R.; Zuñiga-Noël, D.; Moreno, F.-A.; Scaramuzza, D.; Gonzalez-Jimenez, J. PL-SLAM: A Stereo SLAM System through the Combination of Points and Line Segments. IEEE Trans. Robot. 2019, 35, 734–746. [Google Scholar] [CrossRef]
  32. Nguyen, D.-D.; Elouardi, A.; Rodriguez Florez, S.A.; Bouaziz, S. HOOFR SLAM System: An Embedded Vision SLAM Algorithm and its Hardware-Software Mapping-Based Intelligent Vehicles Applications. IEEE Trans. Intell. Transport. Syst. 2019, 20, 4103–4118. [Google Scholar] [CrossRef]
  33. Ince, O.F.; Kim, J.-S. TIMA SLAM: Tracking Independently and Mapping Altogether for an Uncalibrated Multi-Camera System. Sensors 2021, 21, 409. [Google Scholar] [CrossRef]
  34. Xu, S.; Zhang, K.; Wang, S. AQUA-SLAM: Tightly-Coupled Underwater Acoustic-Visual-Inertial SLAM with Sensor Calibration. arXiv 2025, arXiv:2503.11420. [Google Scholar] [CrossRef]
  35. Steinbrucker, F.; Sturm, J.; Cremers, D. Real-Time Visual Odometry from Dense RGB-D Images. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 719–722. [Google Scholar]
  36. Kerl, C.; Sturm, J.; Cremers, D. Dense Visual SLAM for RGB-D Cameras. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 2100–2106. [Google Scholar]
  37. Kerl, C.; Stuckler, J.; Cremers, D. Dense Continuous-Time Tracking and Mapping with Rolling Shutter RGB-D Cameras. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2264–2272. [Google Scholar]
  38. Li, G.; Chen, Q.; Yan, Y.; Pu, J. EC-SLAM: Effectively Constrained Neural RGB-D SLAM with Sparse TSDF Encoding and Global Bundle Adjustment. arXiv 2024, arXiv:2404.13346. [Google Scholar] [CrossRef]
  39. Jiang, H.; Xu, Y.; Li, K.; Feng, J.; Zhang, L. RoDyn-SLAM: Robust Dynamic Dense RGB-D SLAM with Neural Radiance Fields. IEEE Robot. Autom. Lett. 2024, 9, 7509–7516. [Google Scholar] [CrossRef]
  40. Hu, J.; Chen, X.; Feng, B.; Li, G.; Yang, L.; Bao, H.; Zhang, G.; Cui, Z. CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-Aware 3D Gaussian Field. arXiv 2024, arXiv:2403.16095. [Google Scholar] [CrossRef]
  41. Chai, W.; Li, C.; Zhang, M.; Sun, Z.; Yuan, H.; Lin, F.; Li, Q. An Enhanced Pedestrian Visual-Inertial SLAM System Aided with Vanishing Point in Indoor Environments. Sensors 2021, 21, 7428. [Google Scholar] [CrossRef]
  42. Wu, K.; Zhang, Z.; Tie, M.; Ai, Z.; Gan, Z.; Ding, W. VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes. arXiv 2025, arXiv:2501.08286. [Google Scholar] [CrossRef]
  43. Liu, B.; Cheng, M. Real-Time Visual SLAM Optimization Method Based on YOLOv8 and Geometric Constraints in Dynamic Scenes. In Proceedings of the International Conference on Remote Sensing, Mapping, and Image Processing, Xiamen, China, 19–21 January 2024; p. 10. [Google Scholar]
  44. Yang, Z.; Zhang, H.; Fan, X. Dynamic Visual SLAM Algorithm Based on Lightweight YOLOv8. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT), Yichang, China, 20–22 September 2024; pp. 1–4. [Google Scholar]
  45. Li, Y.; Song, G.; Hao, S.; Mao, J.; Song, A. Semantic Stereo Visual SLAM toward Outdoor Dynamic Environments Based on ORB-SLAM2. Int. J. Robot. Res. Appl. 2023, 50, 542–554. [Google Scholar] [CrossRef]
  46. Islam, Q.U.; Ibrahim, H.; Chin, P.K.; Lim, K.; Abdullah, M.Z. MVS-SLAM: Enhanced Multiview Geometry for Improved Semantic RGBD SLAM in Dynamic Environment. J. Field Robot. 2024, 41, 109–130. [Google Scholar] [CrossRef]
  47. Yeh, T.-H.; Chiang, K.-W.; Lu, P.-R.; Li, P.-L.; Lin, Y.-S.; Hsu, C.-Y. V-SLAM Enhanced INS/GNSS Fusion Scheme for Lane Level Vehicular Navigation Applications in Dynamic Environment. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 48, 547–553. [Google Scholar] [CrossRef]
  48. Song, H.; Liu, C.; Dai, H. BundledSLAM: An Accurate Visual SLAM System Using Multiple Cameras. In Proceedings of the IEEE 7th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 15–17 March 2024; pp. 106–111. [Google Scholar] [CrossRef]
  49. Wang, Y.; Ng, Y.; Sa, I.; Parra, A.; Rodriguez, C.; Lin, T.J.; Li, H. MAVIS: Multi-Camera Augmented Visual-Inertial SLAM Using SE2(3) Based Exact IMU Pre-Integration. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 1694–1700. [Google Scholar] [CrossRef]
  50. Yang, A.J.; Cui, C.; Bârsan, I.A.; Urtasun, R.; Wang, S. Asynchronous Multi-View SLAM. arXiv 2021, arXiv:2101.06562. [Google Scholar] [CrossRef]
  51. Li, S.; Pang, L.; Hu, X. Multicam-SLAM: Non-Overlapping Multi-Camera SLAM for Indirect Visual Localization and Navigation. arXiv 2024, arXiv:2406.06374. [Google Scholar] [CrossRef]
  52. Li, X.; Ling, H. PoGO-Net: Pose Graph Optimization with Graph Neural Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5875–5885. [Google Scholar]
  53. Duan, R.; Feng, Y.; Wen, C.-Y. Deep Pose Graph-Matching-Based Loop Closure Detection for Semantic Visual SLAM. Sustainability 2022, 14, 11864. [Google Scholar] [CrossRef]
  54. Jia, G.; Li, X.; Zhang, D.; Xu, W.; Lv, H.; Shi, Y.; Cai, M. Visual-SLAM Classical Framework and Key Techniques: A Review. Sensors 2022, 22, 4582. [Google Scholar] [CrossRef]
  55. Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  56. Morelli, L.; Ioli, F.; Beber, R.; Menna, F.; Remondino, F.; Vitti, A. COLMAP-SLAM: A Framework for Visual Odometry. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 48, 317–324. [Google Scholar] [CrossRef]
  57. Zhao, X.; Wu, X.; Miao, J.; Chen, W.; Chen, P.C.Y.; Li, Z. ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction. IEEE Trans. Multimed. 2023, 25, 3101–3112. [Google Scholar] [CrossRef]
  58. Agisoft Metashape, version 2.1; Agisoft: St. Petersburg, Russia, 2024.
  59. Będkowski, J. Large-Scale Simultaneous Localization and Mapping; Cognitive Intelligence and Robotics; Springer Nature: Singapore, 2022; ISBN 978-981-19-1971-8. [Google Scholar] [CrossRef]
  60. Luhmann, T.; Robson, S.; Kyle, S.; Boehm, J. Close-Range Photogrammetry and 3D Imaging. De Gruyter: Berlin, Germany, 2023; ISBN 978-3-11-102967-2. [Google Scholar] [CrossRef]
  61. Habich, T.-L.; Stuede, M.; Labbe, M.; Spindeldreier, S. Have I Been Here before? Learning to Close the Loop with LiDAR Data in Graph-Based SLAM. In Proceedings of the 2021 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Delft, The Netherlands, 12–16 July 2021; pp. 504–510. [Google Scholar]
  62. Dellaert, F.; Varun, A.; Roberts, R.; Cunningham, A.; Beall, C.; Duy-Nguyen, T.; Jiang, F.; Lucacarlone; Nikai; Blanco-Claraco, J.L.; et al. Borglab/Gtsam: Release 4.2; 2023. Available online: https://github.com/borglab/gtsam (accessed on 16 October 2024).
  63. Carlone, L.; Kim, A.; Dellaert, F.; Barfoot, T.; Cremers, D. From Localization and Mapping to Spatial Intelligence. In SLAM Handbook; Cambridge University Press: Cambridge, MA, USA, 2024. [Google Scholar]
  64. Elalailyi, A.; Trybała, P.; Morelli, L.; Fassi, F.; Remondino, F.; Fregonese, L. Pose Graph Data Fusion for Visual- and LiDAR-Based Low-Cost Portable Mapping Systems. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, 48, 147–154. [Google Scholar] [CrossRef]
  65. Grupp, M. Evo: Python Package for the Evaluation of Odometry and SLAM. 2017. Available online: https://github.com/MichaelGrupp/evo (accessed on 12 August 2024).
  66. Schonberger, J.L.; Frahm, J.-M. Structure-from-Motion Revisited. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4104–4113. [Google Scholar]
  67. Cloud Compare, version 2.13; GPL software; Daniel Girardeau-Montaut: Grenoble, France, 2024.
  68. Perfetti, L.; Polari, C.; Fassi, F. Fisheye Photogrammetry: Tests and Methodologies for the Survey of Narrow Spaces. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, 42, 573–580. [Google Scholar] [CrossRef]
Figure 1. ATOM-ANT3D portable fisheye multi-camera mobile mapping system (left) and the spatial arrangement of the cameras demonstrating the four stereo configurations (right).
Figure 1. ATOM-ANT3D portable fisheye multi-camera mobile mapping system (left) and the spatial arrangement of the cameras demonstrating the four stereo configurations (right).
Remotesensing 17 02810 g001
Figure 2. Schematic of the optimization pipeline for multi-camera V-SLAM approaches. S.I. = Stereo Instance; Opt. = Optimization; Traj. = trajectory.
Figure 2. Schematic of the optimization pipeline for multi-camera V-SLAM approaches. S.I. = Stereo Instance; Opt. = Optimization; Traj. = trajectory.
Remotesensing 17 02810 g002
Figure 3. The proposed multi-view feature-based optimization, transitioning from single stereo V-SLAM pose estimates (top) to a full five-camera system during the constrained optimization (bottom).
Figure 3. The proposed multi-view feature-based optimization, transitioning from single stereo V-SLAM pose estimates (top) to a full five-camera system during the constrained optimization (bottom).
Remotesensing 17 02810 g003
Figure 5. TLS point cloud of the case study used as ground truth. (a) Side view showing the dome-like structure; (b) top view showing the U-shape geometry; (c) cross-sectional plan view illustrating the starting and ending points of the acquisition path, along with the narrow passage connections between the nine compartments.
Figure 5. TLS point cloud of the case study used as ground truth. (a) Side view showing the dome-like structure; (b) top view showing the U-shape geometry; (c) cross-sectional plan view illustrating the starting and ending points of the acquisition path, along with the narrow passage connections between the nine compartments.
Remotesensing 17 02810 g005
Figure 6. Photogrammetric point cloud of the Minguzzi spiral staircase (left) and its location within the Duomo di Milano (right).
Figure 6. Photogrammetric point cloud of the Minguzzi spiral staircase (left) and its location within the Duomo di Milano (right).
Remotesensing 17 02810 g006
Figure 7. Validation of the Sordine photogrammetric reconstruction using natural points distributed across the surveyed area indicated by red bounding boxes (left) and distribution of natural points along the Minguzzi staircase for photogrammetric referencing (right).
Figure 7. Validation of the Sordine photogrammetric reconstruction using natural points distributed across the surveyed area indicated by red bounding boxes (left) and distribution of natural points along the Minguzzi staircase for photogrammetric referencing (right).
Remotesensing 17 02810 g007
Figure 8. Accuracy validation between the photogrammetric and terrestrial laser scanning point clouds.
Figure 8. Accuracy validation between the photogrammetric and terrestrial laser scanning point clouds.
Remotesensing 17 02810 g008
Figure 9. Visual comparison of left (blue) and right (red) stereo V-SLAM instances’ trajectory estimation before optimization.
Figure 9. Visual comparison of left (blue) and right (red) stereo V-SLAM instances’ trajectory estimation before optimization.
Remotesensing 17 02810 g009
Figure 10. V-SLAM camera pose estimates from a single stereo instance (2 cam) showing persistent drift despite loop closure leading to a noisy 3D tie points cloud (top); camera pose estimates with minimized drift and reduced outliers after the Multi-View Feature-Based Optimization (5 cam) (bottom).
Figure 10. V-SLAM camera pose estimates from a single stereo instance (2 cam) showing persistent drift despite loop closure leading to a noisy 3D tie points cloud (top); camera pose estimates with minimized drift and reduced outliers after the Multi-View Feature-Based Optimization (5 cam) (bottom).
Remotesensing 17 02810 g010
Figure 11. Multi-View Odometry trajectory results with two (left), three (middle), and five (right) cameras.
Figure 11. Multi-View Odometry trajectory results with two (left), three (middle), and five (right) cameras.
Remotesensing 17 02810 g011
Figure 12. Comparison of Absolute Pose Errors (APE) along the individual X, Y, and Z axes of the camera poses for Multi-View Odometry using different numbers of cameras for (a) Sordine and (b) Minguzzi case studies.
Figure 12. Comparison of Absolute Pose Errors (APE) along the individual X, Y, and Z axes of the camera poses for Multi-View Odometry using different numbers of cameras for (a) Sordine and (b) Minguzzi case studies.
Remotesensing 17 02810 g012
Figure 13. Multi-View Odometry (left) and Multi-View V-SLAM integrating the Multi-View Feature-Based Optimization (right).
Figure 13. Multi-View Odometry (left) and Multi-View V-SLAM integrating the Multi-View Feature-Based Optimization (right).
Remotesensing 17 02810 g013
Figure 14. Propagation of Absolute Poses Error (APE) residuals along the trajectories for all proposed approaches before and after optimization for (a) Sordine and (b) Minguzzi case studies.
Figure 14. Propagation of Absolute Poses Error (APE) residuals along the trajectories for all proposed approaches before and after optimization for (a) Sordine and (b) Minguzzi case studies.
Remotesensing 17 02810 g014
Figure 15. Comparison between V-SLAM trajectories (blue) and corresponding ground truth (red) before and after applying the proposed optimization approaches. (a) left, (b) right, (c) front-right, (d) front-left stereo instances, (e) Multi-Instance Stereo V-SLAM, and (f) Multi-View Odometry (left) and Multi-View V-SLAM (right).
Figure 15. Comparison between V-SLAM trajectories (blue) and corresponding ground truth (red) before and after applying the proposed optimization approaches. (a) left, (b) right, (c) front-right, (d) front-left stereo instances, (e) Multi-Instance Stereo V-SLAM, and (f) Multi-View Odometry (left) and Multi-View V-SLAM (right).
Remotesensing 17 02810 g015
Figure 16. Box Plot analysis of Relative Pose Error (RPE) between each multi-camera V-SLAM approach estimates and their respective ground truth before and after optimization for (a) Sordine and (b) Minguzzi case studies.
Figure 16. Box Plot analysis of Relative Pose Error (RPE) between each multi-camera V-SLAM approach estimates and their respective ground truth before and after optimization for (a) Sordine and (b) Minguzzi case studies.
Remotesensing 17 02810 g016
Table 1. Total number of images used by each of the proposed approaches. S.I. = stereo instance.
Table 1. Total number of images used by each of the proposed approaches. S.I. = stereo instance.
Sordine Study CaseMinguzzi Study Case
Total Dataset2731 img/cam1581 img/cam
Multi-Instance Stereo
V-SLAM
Left S.I.636 img/cam592 img/cam
Right S.I.765 img/cam489 img/cam
Front-right S.I.987 img/cam434 img/cam
Front-left S.I.N.A321 img/cam
Multi-View V-SLAM1955 img/cam1526 img/cam
Photogrammetry2731 img/cam1581 img/cam
V-SLAM-aided photogrammetry1955 img/cam1526 img/cam
Table 2. RMSE results (in meters) for the proposed multi-camera V-SLAM approaches, reported before and after optimization, and compared against traditional photogrammetry and V-SLAM-aided photogrammetry. S.I. = stereo instance.
Table 2. RMSE results (in meters) for the proposed multi-camera V-SLAM approaches, reported before and after optimization, and compared against traditional photogrammetry and V-SLAM-aided photogrammetry. S.I. = stereo instance.
ApproachesCameras Absolute Poses Root Mean Square Error [m]
Sordine case studyMinguzzi case study
V-SLAMBefore
Optim.
Feature-Based
Optim.
Pose Graph
Optim.
Before
Optim.
Feature-Based
Optim.
Pose Graph
Optim.
Single-
instance
stereo
V-SLAM
Left S.I.0.45
(2 cam)
0.03
(5 cam)
-0.92
(2 cam)
0.12
(5 cam)
-
Right S.I.1.27
(2 cam)
0.02
(5 cam)
-0.34
(2 cam)
0.05
(5 cam)
-
Front-right S.I.1.10
(2 cam)
0.07
(5 cam)
-0.53
(2 cam)
0.05
(5 cam)
-
Front-left S.I.N.AN.A-0.60
(2 cam)
0.26
(5 cam)
-
Multi-View Odometry/V-SLAM
(5 cam)
Odometry
0.64
V-SLAM
0.08
-Odometry
0.11
V-SLAM
0.08
-
Multi-Instance Stereo V-SLAM-0.04-0.03
Baselines
Photogrammetry0.030.05
V-SLAM-aided Photogrammetry0.040.06
Table 3. Comparison of Absolute Pose Errors RMSE between two, three, and five cameras in the Multi-View Odometry approach before applying the proposed optimization.
Table 3. Comparison of Absolute Pose Errors RMSE between two, three, and five cameras in the Multi-View Odometry approach before applying the proposed optimization.
Cameras Absolute Poses Root Mean Square Error [m]
Sordine Study CaseMinguzzi Study Case
Multi-View OdometryTwo cameras1.9320.107
Three cameras0.5050.113
Five cameras0.6430.110
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El-Alailyi, A.; Morelli, L.; Trybała, P.; Fassi, F.; Remondino, F. Optimizing Multi-Camera Mobile Mapping Systems with Pose Graph and Feature-Based Approaches. Remote Sens. 2025, 17, 2810. https://doi.org/10.3390/rs17162810

AMA Style

El-Alailyi A, Morelli L, Trybała P, Fassi F, Remondino F. Optimizing Multi-Camera Mobile Mapping Systems with Pose Graph and Feature-Based Approaches. Remote Sensing. 2025; 17(16):2810. https://doi.org/10.3390/rs17162810

Chicago/Turabian Style

El-Alailyi, Ahmad, Luca Morelli, Paweł Trybała, Francesco Fassi, and Fabio Remondino. 2025. "Optimizing Multi-Camera Mobile Mapping Systems with Pose Graph and Feature-Based Approaches" Remote Sensing 17, no. 16: 2810. https://doi.org/10.3390/rs17162810

APA Style

El-Alailyi, A., Morelli, L., Trybała, P., Fassi, F., & Remondino, F. (2025). Optimizing Multi-Camera Mobile Mapping Systems with Pose Graph and Feature-Based Approaches. Remote Sensing, 17(16), 2810. https://doi.org/10.3390/rs17162810

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop