Abstract
Realizing real-time dense 3D reconstruction on resource-limited mobile platforms remains a significant challenge, particularly in low-texture environments that demand robust multi-frame fusion to resolve matching ambiguities. However, the inherent tight coupling of pose estimation and mapping in traditional monolithic SLAM architectures imposes a severe restriction on integrating high-complexity fusion algorithms without compromising tracking stability. To overcome these limitations, this paper proposes MDR–SLAM, a modular and fully decoupled stereo framework. The system features a novel keyframe-driven temporal filter that synergizes efficient ELAS stereo matching with Kalman filtering to effectively accumulate geometric constraints, thereby enhancing reconstruction density in textureless areas. Furthermore, a confidence-based fusion backend is employed to incrementally maintain global map consistency and filter outliers. Quantitative evaluation on the NUFR-M3F indoor dataset demonstrates the effectiveness of the proposed method: compared to the standard single-frame baseline, MDR–SLAM reduces map RMSE by 83.3% (to 0.012 m) and global trajectory drift by 55.6%, while significantly improving map completeness. The system operates entirely on CPU resources with a stable 4.7 Hz mapping frequency, verifying its suitability for embedded mobile robotics.
1. Introduction
Autonomous mobile robots are increasingly deployed in modern logistics warehouses, office buildings, and home environments [1,2]. Their ability to perform tasks such as path planning, autonomous navigation, and secure interaction depends to a large extent on accurate three-dimensional perception of the surrounding environment [3,4]. High-fidelity three-dimensional models are crucial for these downstream tasks [5,6]. Thanks to the low cost and information richness of vision sensors, visual SLAM (Simultaneous Localization and Mapping) technology has flourished. However, mainstream systems such as ORB-SLAM [7,8] and VINS-Mono [9] mainly generate sparse maps for positioning, resulting in the lack of geometric details, which is far from meeting the needs of dense reconstruction [10,11]. This has spurred extensive academic research into high-density, high-fidelity 3D mapping. In the context of visual SLAM, low-texture environments are rigorously defined as scenes dominated by large, uniform, or monochromatic surfaces (e.g., plain industrial walls, bare concrete floors, or dimly lit areas) which intrinsically lack the high-frequency intensity variations necessary for accurate feature extraction and stereo matching. In such scenes, traditional passive stereo methods like ELAS suffer from high-variance depth measurements, which rapidly lead to sparse, incomplete, and noisy 3D maps.
Although significant progress has been made in dense reconstruction, achieving high-quality modeling is still a long-standing major challenge in large-scale low-texture scenarios, mainly of artificial structures (such as walls and floors) [12,13]. In such environments, due to the lack of stable and distinguishable visual characteristics, traditional stereo matching algorithms, usually relying on feature matching or photometric consistency (for example, SGBM [14], PatchMatch [15])—will encounter serious matching ambiguity. This ambiguity directly leads to the appearance of large-scale data voids, significant noise or incorrect geometric structures in the reconstruction model, which seriously damages the integrity and usability of the map, thus adversely affecting the navigation and obstacle avoidance capabilities of the robot. In order to meet the challenges of low texture, the research community has explored a variety of strategies, such as relying on geometric a priori (such as plane hypothesis) or using multi-frame information fusion. Among them, time sequence filtering by using camera movement to accumulate weak geometric clues in the time dimension has been proven to be a very promising direction. However, such multi-frame fusion algorithms usually require huge computational resources. In the traditional monolithic SLAM architecture (such as LSD-SLAM [16] and DSO [17]), real-time front-end position estimation and back-end mapping are closely coupled in a single process. This design limits the computing budget, which must give priority to positioning performance, thus severely restricting the integration and deployment of complex and time-consuming dense mapping algorithms [18]. This inherent conflict constitutes a key research bottleneck in the field of high-quality dense mapping.
In order to break this bottleneck and specifically meet the reconstruction challenges in low-texture environments, this paper introduces MDR–SLAM: a modular, fully decoupled dense three-dimensional reconstruction framework. The core idea of our work is to release the performance potential of mapping algorithms through architectural innovation. MDR–SLAM adopts the modular design concept common in the robot operating system (ROS) [1], and decouples the whole system into three independent nodes that communicate asynchronously through standardized message interfaces: position estimation, dense depth computing, and map fusion. This design makes it possible to integrate a complex yet efficient temporal filter driven by key frames without sacrificing the front-end positioning performance. The filter innovatively combines ELAS stereo matching [19] with the Kalman filter to continuously and stably aggregate time information, thus effectively densifying the low-texture regions. The main contributions of this work are summarized as follows:
- A novel modular system architecture: We proposed the MDR–SLAM framework, whose decoupling design aims to accommodate computing-intensive mapping algorithms to cope with challenging scenarios that traditional monolithic systems find difficult to handle.
- A robust temporal density depth estimation algorithm: We have designed a temporal filtering algorithm driven by keyframes, which is specially used to enhance the density of point clouds and reconstruction integrity in low-texture environments.
- A dynamic fusion strategy for long-term mapping: We proposed an incremental fusion strategy containing a confidence model. It dynamically maintains the global map through a closed-loop process of “fusion-update-cleaning” to ensure the long-term consistency of large-scale scene mapping.
2. Related Work
2.1. Pose Estimation
Accurate and robust position estimation is the geometric basis of all three-dimensional reconstruction tasks, because even a small cumulative error can lead to shadows or distortions on the final map. In the field of visual SLAM, mainstream methods can be roughly divided into feature-based methods and direct methods. Feature-based methods, such as pioneering PTAM, achieve high-precision positioning by extracting and matching sparse feature points. In contrast, the direct method uses the photometric information of pixels to estimate the movement of the camera, and the representative work includes SVO [20] and DTAM [21]. However, such systems have two core limitations: first, they mainly serve positioning, and the generated sparse maps lack sufficient geometric details to be directly used for dense three-dimensional reconstruction [22]; second, most of them adopt a monolithic architecture, closely coupling all functional modules, which poses challenges for subsequent scalability and performance enhancements. In recent years, the visual SLAM method based on deep learning has shown great potential [23], but its computing overhead and hardware dependence are still key considerations.
2.2. Dense 3D Reconstruction in Low-Texture Scenes
Dense three-dimensional reconstruction, especially multi-view stereo (MVS), faces great challenges when encountering large-scale low-texture or texture-free areas [24,25]. This challenge stems mainly from the extreme difficulty of establishing reliable pixel-level matching in such areas. Traditional MVS methods, such as the classic PMVS [26], mainly rely on the brightness consistency across multiple views to calculate the depth. Although they are effective in texture-rich areas, these methods often fail in low-texture areas due to ambiguous photometric cues. Some advanced traditional methods introduce geometric constraints (such as a plane a priori) to alleviate this problem [27], but they are still insufficient under extremely low texture conditions. In recent years, the MVS method based on deep learning has made remarkable breakthroughs. Networks such as R-MVSNet [28] and CasMVSNet [29] learn robust image features and match a priori from large-scale datasets, so that they can better handle some low-texture surfaces [30]. However, these methods also have limitations: first, when the test scenario deviates significantly from the distribution of training data, their generalization ability will be weakened; second, they usually require a large amount of computing and memory resources [31]. The latest technological trends are turning to novel scene representation methods, such as neural radiation field (NeRF), which can generate high-quality models, but their high training and rendering costs limit their extensive deployment in mobile mapping applications.
2.3. System Architectures in SLAM
The mainstream architecture of contemporary visual SLAM systems is monolithic, in which all modules—including front-end tracking, back-end optimization, and mapping—run in a tightly coupled process. In order to ensure real-time performance, this architecture requires strict control of the calculation time of all modules, which often excludes the possibility of adopting more complex but more effective mapping algorithms. In contrast, the modular or decoupled architecture paradigm regards different functional modules as independent processes and communicates through message delivery mechanisms such as the robot operating system (ROS). Although modularization is a common paradigm in robotics, few studies in visual SLAM have explicitly employed a decoupled architecture as a strategic tool. Modern SLAM systems are developing in the direction of multimodal and multitasking capabilities, such as integrating semantics, inertia, and sparse point features, which puts forward higher requirements for the modularity and scalability of the system. Based on this observation, this work explores the use of decoupling architecture to overcome the challenges of traditional MVS.
3. Method
This chapter explains in detail the overall architecture design of the system, the implementation methods of each core functional module, and the key technologies to realize inter-module collaboration.
3.1. System Overview
The framework we proposed follows a modular design and decomposes complex reconstruction tasks into three independent functional nodes that communicate asynchronously through the standard ROS message interface. This architecture not only enhances the maintainability of the system but also greatly simplifies the replacement or upgrade of any algorithm component in the future. This modular design based on ROS has significant engineering advantages over monolithic systems. First of all, it enhances the robustness of the system, because the failure of a node (for example, the depth estimation module) will not cause the whole system to collapse, thus achieving graceful degradation. Secondly, it improves scalability, so that a single module can be easily upgraded or replaced with an alternative with minimal integration workload (for example, replacing the ORB-based position estimator with a VINS-based one). Finally, it provides flexible resource allocation, because computing-intensive nodes can be distributed to different processing units when necessary, thus optimizing the use of on-board computing hardware. The overall data flow of the system is shown in Figure 1.
Figure 1.
The decoupling dense reconstruction system framework proposed in this paper. The system consists of three independent ROS nodes, which exchange data through ROS topics. In the ‘Delaunay Triangulation’ inset, black dots represent support points, grey circles denote circumcircles, and red dots indicate query pixels. In the ‘Disparity Estimation’ map, the color spectrum represents disparity magnitude, ranging from blue (far) to red (near).
The system’s workflow is as follows:
- Data input: The system uses a stereo camera as its main sensor, which releases left and right image streams at a fixed frequency, and comes with corresponding camera calibration information.
- Parallel processing: the position estimation module (node 1) and the depth estimation module (node 2) subscribe to these image streams in parallel. Node 1 focuses on calculating the six-degree-of-freedom (6-DoF) position of the camera in real time, while node 2 is responsible for generating a dense point cloud for each stereo image pair in the current camera coordinate system.
- Data fusion: As the map fusion module (node 3) at the back end of the system, subscribe to the output of the first two nodes. It adopts a key time synchronization mechanism to ensure that each point cloud is associated with its best-matched camera position in time.
- Map generation: After performing the coordinate transformation, node 3 merges the local point cloud in the world coordinate system into a global map. In order to effectively manage the data volume, the system down samples the global map, and finally publishes a global dense point cloud for visualization or other downstream applications.
The core innovation of the system lies in its decoupling architecture, which solves the challenges of poor flexibility, difficult maintenance and limited scalability common in existing tightly coupled systems.
3.2. Pose Estimation Module
In the field of Visual Simultaneous Localization and Mapping, the core task of position estimation is to restore the motion trajectory of the camera in three-dimensional space from a series of images [32,33,34]. Our system employs a feature-based front-end that solves for camera motion by establishing constraints between 2D image features and 3D world points (in the world frame) onto pixel coordinates can be described by the pinhole camera model:
where is the depth scale factor, is the camera intrinsic matrix, and is the camera extrinsic matrix, composed of a rotation matrix and a translation vector .
We utilize ORB features for robust matching. During initialization, the relative pose is recovered by solving the epipolar constraint:
Here, and are the homogeneous coordinates of a matched feature pair, and is the fundamental matrix encapsulates the relative rotation and translation information between the two views and can be estimated from at least eight points of correspondence.
Once an initial map is constructed, for each new image frame, the camera’s absolute pose can be estimated by finding correspondences between existing 3D map points and their 2D projections in the current image. The PnP algorithm solves the projection equation:
To find a rotation matrix and translation vector that minimize the reprojection error between the observed 2D feature point and its corresponding projected 3D point .
In order to reduce the inevitable drift accumulated by inter-frame motion estimation, it is necessary to fine-tune the camera trajectory and three-dimensional map structure through back-end optimization.
- Bundle Adjustment (BA): As the gold standard for the refinement process in visual SLAM, bundle adjustment is a large-scale nonlinear optimization problem, which minimizes the total weight projection error on a series of camera positions and three-dimensional points. Its cost function is expressed as follows:
Solving the minimization problem can refine all variables, so as to obtain a global optimal and geometrically consistent estimate. In practical applications, in order to ensure real-time, full bundle adjustment is typically performed only upon loop closure detection. On a sliding window composed of the nearest keyframes, the computationally more feasible local bundle adjustment is carried out.
- 2.
- Pose Graph Optimization: The goal of pose graph optimization is to adjust the posture of all nodes in the diagram to minimize the error of these constraints. Its cost function can be written as follows:
For long-term and large-scale mapping tasks, maintaining global consistency is crucial. This is primarily addressed through pose graph optimization, especially after a loop closure is detected. A pose graph is a simplified representation of the problem, where each node represents the pose of a camera keyframe, and each edge represents a spatial constraint between two nodes. These constraints originate from odometry in consecutive frames or from non-consecutive loop closure detections.
3.3. Dense Depth Estimation with Temporal Filtering
To balance accuracy and efficiency, Node 2 integrates the efficient large-scale stereo matching (ELAS) algorithm [19] with a keyframe-guided temporal filter.
3.3.1. ELAS: Single-Frame Depth Estimation
ELAS models the stereo matching problem within the framework of a probability generation model. The algorithm first screens out a set of high-confidence support points in texture-rich areas by analyzing image gradients and other characteristics and then triangulates the image plane. The mathematical essence of ELAS is that it associates each support point s with a plane in three-dimensional space, which is defined by the parameter , where the disparity of any point on this plane is . The goal of the algorithm is to find the most likely parallax value for each pixel, that is, to find the optimal plane parameter for each support point by maximizing the posterior probability (MAP):
The estimation process consists of two core probability models:
- Prior Probability : Based on the a priori assumption that the surface in the scene is generally parallel to the camera imaging plane, the model sets a Gaussian a priori for the plane parameters. Its probability density function is directly proportional to the following:
- 2.
- Likelihood Function : The likelihood function measures how likely the observed support point disparity is to occur in the case of a given specific plane hypothesis:
After estimating the optimal local plane model of all support points, the last step is to infer the parallax of all other non-support point pixels in the image. ELAS constructs a Markov Random Field (MRF) and uses efficient graph inference algorithms such as confidence propagation to solve it. In the process, each pixel will “decide” the plane of which neighboring support point it is most likely to belong to, or be marked as an outlier. Once the pixel p is assigned to the plane of the support point , its parallax value is directly calculated by the optimal parameter of the plane :
After the disparity value is calculated for each pixel, it is converted to a depth value in 3D space based on the pinhole camera geometry. The relationship between depth and disparity is defined by the following equation:
where is the focal length of the camera, and is the baseline length of the stereo camera. Through this series of steps, the ELAS algorithm can generate dense depth information for the vast majority of pixels in the image, laying a solid foundation for subsequent three-dimensional reconstruction.
3.3.2. Temporal Depth Filtering
To suppress single-frame noise, we implement a recursive Bayesian update scheme anchored to keyframes. Upon keyframe creation, a depth map is initialized with high uncertainty. For subsequent frames, 3D points are projected onto the current image plane. Valid measurements are fused using a Kalman Filter update to iteratively reduce the depth variance. Points are back-projected to the 3D map only when their variance converges below a strict threshold , ensuring that only geometrically stable structures are reconstructed.
3.4. Global Map Fusion
Node 3 constructs a globally consistent map by integrating local point clouds via a confidence-based voting mechanism.
3.4.1. Coordinate Transformation
Local point cloud are transformed into the global world frame using the optimized pose matrix provided:
and represent the homogeneous coordinates of a point in the world and camera frames, respectively, while is a 4 × 4 rigid body transformation matrix representing the transform from the camera coordinate system to the world coordinate system world. This matrix is provided in real-time by the pose estimation module.
3.4.2. Incremental Map Fusion with Confidence Updating
Incremental Fusion and Cleaning. We employ a dynamic voxel grid for data management. Incoming points are downsampled and associated with existing map points via a nearest-neighbor search. If a match with a consistent surface normal is found, the points are merged, and a confidence counter is incremented. To maintain map quality, the system periodically executes an outlier cleaning process. Points that fail to accumulate sufficient confidence (indicating transient noise or dynamic objects) are removed from the global map. This “Fuse–Update–Clean” cycle ensures long-term map consistency with minimal memory overhead.
4. Results
4.1. Experimental Setup
In order to comprehensively and quantitatively evaluate the performance of the decoupling dense reconstruction framework proposed in this paper, we designed a series of experiments and verified them on multiple public and self-built datasets. The system is developed and tested on a hardware platform running Ubuntu 20.04. The whole framework is built on the basis of the robot operating system (ROS), and its core functions are encapsulated into independent ROS nodes. The communication and data flow between nodes are handled by the standard ROS release/subscription mechanism, which ensures the high modularity and low coupling of system components.
4.1.1. Datasets
Our experimental evaluation adopts three different types of datasets to comprehensively examine the accuracy, robustness, and generalization ability of the system:
- EuRoC MAV dataset: This is a visual inertial dataset collected by drones in the indoor environment, which contains high-precision ground true values. We use its three-dimensional image sequence to quantitatively evaluate the trajectory accuracy of the position estimation front-end (node 1).
- KITTI odometer dataset: As one of the most authoritative benchmarks in the field of autonomous driving, the dataset contains a sequence of stereo images collected from the on-board platform in a large outdoor environment. We use it to evaluate the quality and long-term stability of the three-dimensional dense mapping of the whole system in such scenarios.
- In order to further verify the adaptability of the framework under different sensors and scenarios, we used the Intel RealSense D435i camera (Intel Corporation, Santa Clara, CA, USA) to collect a series of challenging indoor scene data. The dataset is used to qualitatively display the actual reconstruction performance of the system.
- For the core quantitative map evaluation, we utilize the recently published NUFR-M3F (Neufield Robotics Multi-Modal Multi-Floor) Dataset [35]. This dataset is specifically chosen because it provides synchronized stereo visual data (from a ZED camera (Stereolabs, San Francisco, CA, USA)) and high-fidelity 3D map ground truth (generated from a LiDAR sweep), which is essential for calculating our RMSE and Completeness metrics. We utilize the 2nd_floor.bag sequence, which simulates a challenging indoor office environment with prevalent low-texture walls, long corridors, and structural symmetry. The dataset is publicly available at https://github.com/neufieldrobotics/NUFR-M3F?tab=readme-ov-file (accessed on 7 December 2025).
4.1.2. Evaluation Metrics
We evaluate the overall performance of the system from two main perspectives: pose accuracy and map quality. The former is quantified using metrics like Absolute Trajectory Error (ATE), while the latter is assessed by Map Accuracy (RMSE) and Completeness (%).
- Absolute trajectory error (ATE, m): We use ATE to measure the global consistency of position estimation. After aligning the estimated trajectory with the true value of the ground, the indicator calculates the root mean square error (RMSE) between all corresponding positions.
- Map Accuracy (RMSE, m): This metric measures how closely the estimated points align with the true surface. It is defined as the Root Mean Square Error (RMSE) of the distances from every point to its nearest neighbor in .
- 3.
- Map Completeness (%): This measures the percentage of the ground truth map that has been successfully reconstructed. It is calculated as the proportion of points that are within a distance threshold of their nearest neighbor in .
4.1.3. Camera Calibration
For the self-built dataset, we have carried out special camera calibration. In order to ensure the geometric accuracy of the estimated position and the final map, we accurately determined the internal parameters, the distortion coefficient, and the external parameter transformation of the stereoscopic camera system. To this end, we adopted the widely recognized Kalibr toolbox and combined it with the standard chessboard calibration board for a comprehensive calibration.
We verify the calibration quality by analyzing the re-projection error, and its distribution is shown in Figure 2. The figure shows the reprojection error distribution for the left (cam0) and right (cam1) cameras on the whole calibration image sequence. As shown in the figure, the errors of both cameras are closely concentrated near the origin (0,0), and most of them are limited to the range of ±0.2 pixels, showing extremely high accuracy. In addition, the error shows a random and unbiased distribution, which confirms that the camera model and the distortion coefficient have effectively compensated for the system error. The final calculated average re-projection error is 0.16 pixels, which verifies the high reliability of this calibration and lays a solid foundation for all subsequent experiments.
Figure 2.
Visualization of camera calibration accuracy. (a) The reprojection error distribution of the left camera (cam0); (b) the reprojection error distribution of the right camera (cam1). The color bar represents the image index.
4.2. Accuracy Analysis of Pose Estimation
The accuracy of the front-end position estimation determines the quality of the back-end dense mapping. Therefore, this section aims to quantitatively evaluate the trajectory accuracy of the decoupled position estimation front-end (node 1) proposed by us, and examine the impact of the ROS architecture we proposed on the performance of the original SLAM algorithm. We selected five representative sequences from the EuRoC MAV dataset, with absolute trajectory error (ATE) as the core evaluation index. The quantitative results are detailed in Table 1, while the corresponding trajectory and error distribution visualizations are presented in Figure 3.
Table 1.
Localization accuracy of the system on the EuRoC dataset (ATE).
Figure 3.
Comparison of the estimated trajectories (red dashed lines) and ground truth (blue solid lines) projected onto the XY, XZ, and YZ planes for four EuRoC sequences: (a) MH_01_easy, (b) MH_03_medium, (c) V1_01_easy, and (d) V1_02_medium.
The data presented in Table 1 indicates that the system achieves high-precision localization performance across all test sequences. The system performs exceptionally well on the MH_01 (easy) and MH_03 (medium) sequences, with ATE RMSE values of only 0.035 m and 0.036 m, respectively, achieving centimeter-level accuracy. Even in the more intense MH_05 (difficult) sequence, the system still maintains an RMSE of 0.050 m, which is stable. Similarly, the system also provides reliable tracking on the V1_01 (simple) and V1_02 (medium) sequences containing fast translation and rotation, and the RMSE is 0.087 m and 0.060 m, respectively.
From a qualitative perspective, the visual trajectory of Figure 3 verifies the above conclusion, and there is no obvious divergent or significant drift in the trajectory. Overall, these results confirm that the accuracy of our front-end is comparable to the true value of the ground. This not only verifies the reliability of the coupling architecture but also shows that the communication overhead introduced by ROS has a negligible impact on the positioning performance. At the same time, Figure 4 provides us with more detailed quantitative observation through the ATE distribution histogram. It can be clearly seen that the error of MH_01 and MH_03 sequences is mainly concentrated within 0.05 m, which is highly consistent with the centimeter-level accuracy conclusion in Table 1. In contrast, the error distribution of the V1_01 sequence is significantly wider, which is also consistent with its relatively high RMSE value.
Figure 4.
Translation error evolution over time (left) and probability distribution histograms (right) for four EuRoC sequences: (a) MH_01_easy, (b) MH_03_medium, (c) V1_01_easy, and (d) V1_02_medium.
4.3. Ablation Study
To quantitatively validate the individual and synergistic contributions of our core mapping modules—the Temporal Filtering and the Confidence Fusion—we perform a comprehensive ablation study. We compare the performance of the MDR–SLAM full system against a Minimalistic ELAS-only Baseline (Temporal OFF, Confidence OFF).
4.3.1. Low-Texture Qualitative Evidence
Figure 5 provides the necessary qualitative evidence of our system’s robustness in challenging indoor and visually degraded conditions. This figure compares the raw ELAS depth output (middle row) with the stabilized result from the MDR–SLAM Full System (bottom row) across three challenging scenarios captured by the stereo camera (top row).
Figure 5.
Comparison of Mapping Fidelity between ELAS only and MDR–SLAM across three challenging scenarios.
- Uniform Low-Texture (Column a): In the scenario dominated by a large, uniform white wall, the raw ELAS output is plagued by high noise and erroneous depth estimates, resulting in a fractured and highly inconsistent point cloud. The MDR–SLAM output, however, successfully filters this noise, generating a much smoother and geometrically accurate representation of the wall and surrounding structures.
- Specular/Ambiguous Texture (Column b): In the area featuring strip—like white light bands, raw ELAS fails to establish correct correspondences, leading to significant depth ambiguity and large voids in the reconstruction. The MDR–SLAM system successfully aggregates geometric constraints across multiple frames, effectively filling these voids and reconstructing the environment with high fidelity.
- High-Motion Cornering (Column c): During a corridor cornering motion (high translation/rotation), the ELAS depth map suffers from motion blur and frame inconsistency. The MDR–SLAM output maintains structural integrity during this challenging maneuver, demonstrating the filter’s ability to maintain consistency even when input quality is momentarily degraded.
This visual evidence confirms that our multi—frame Temporal Filtering and Confidence Fusion modules are essential for the system to successfully reconstruct complex geometry where single—frame depth estimation (ELAS—only) fails.
4.3.2. Map Quality Analysis
The quantitative results in Table 2 and Figure 6 provide definitive proof for our core claims regarding map quality. The Minimalistic Baseline, relying on raw, single—frame ELAS depth output, yields a map error (RMSE = 0.072 m) and a completeness of 39.33%. This noise validates that ELAS’s inherent limitations must be architecturally addressed. In direct contrast, the MDR–SLAM full system achieves a substantial reduction in map error, lowering the RMSE to only 0.012 m. This represents an outstanding 83.3% improvement in map accuracy. Completeness is also significantly improved to 45.15%. This massive leap in map fidelity conclusively validates the synergistic effect of the Temporal Filtering and the Confidence Fusion.
Table 2.
Map and Pose Quantitative Ablation on NUFR-M3F.
Figure 6.
Comparative evaluation of MDR–SLAM and ELAS_only algorithms using FastLIO2 as the reference baseline. (Top): Trajectory comparison on the XY plane. (Bottom): Absolute Pose Error (APE) of translation over time.
4.4. Reconstruction Showcase
Building upon the internal ablation study presented in Section 4.3, this section extends the evaluation to external comparative benchmarks and real-world application scenarios. We first present a rigorous quantitative and qualitative comparison against an established offline MVS baseline to validate the system’s geometric fidelity. Subsequently, we demonstrate the system’s adaptability across diverse indoor environments characterized by varying scales and degrees of texture degradation.
4.4.1. Quantitative Comparison with Baselines
To rigorously benchmark the dense mapping capabilities of our proposed system, we performed a comparative evaluation against OpenMVS (v2.3.0) [26], which is widely regarded as the gold standard for offline multi-view stereo reconstruction. We selected a sequence characterized by large low-texture surfaces (e.g., monochromatic walls and whiteboards) from the NUFR-M3F dataset to specifically evaluate performance in regions where photometric cues are scarce.
The qualitative visual comparison is presented in Figure 7. The corresponding quantitative metrics, evaluated against the high-precision LiDAR ground truth, are detailed in Table 3.
Figure 7.
Qualitative comparison of reconstruction completeness on extreme low-texture surfaces. (a,b) MDR–SLAM results showing dense and continuous surface recovery on the monochromatic wall and whiteboard; (c,d) OpenMVS results exhibiting significant data voids and sparsity in the corresponding regions. (highlighted in red boxes).
Table 3.
Quantitative comparison of the extremely low-texture segment.
As illustrated in Figure 7, the fundamental difference lies in the handling of textureless surfaces. OpenMVS (Figure 7c,d) exhibits significant data voids and sparse reconstruction in the monochromatic orange wall and whiteboard regions (highlighted in red boxes). This confirms that standard photometric optimization algorithms fail to establish reliable correspondences in the absence of texture. In contrast, MDR–SLAM (Figure 7a,b) successfully recovers these surfaces with continuous density. This demonstrates that our Temporal Filter, which aggregates geometric constraints over time, offers superior robustness against photometric ambiguity compared to traditional MVS pipelines.
The quantitative results in Table 3 reveal a critical insight. While OpenMVS leverages global offline optimization to attain a completeness of 34.07% versus 26.31% for our method, MDR–SLAM demonstrates superior geometric fidelity. This is evidenced by a significant reduction in reconstruction error, with the RMSE dropping from 0.227 m in OpenMVS to 0.099 m in MDR–SLAM. This higher error in OpenMVS suggests that in low-texture areas, the offline algorithm generates noisy or “hallucinated” geometry in an attempt to maximize coverage. Conversely, our system maintains higher geometric fidelity by effectively filtering out unstable depth measurements, proving that MDR–SLAM provides more reliable geometric information for robot navigation tasks.
Unlike OpenMVS, which relies on unrestricted offline computational resources to maximize reconstruction density, MDR–SLAM prioritizes real-time responsiveness (4.7 Hz) while maintaining structural fidelity. Although this entails a minor trade-off in global completeness, MDR–SLAM demonstrates significant advantages over offline MVS methods in robotic applications requiring online processing.
4.4.2. Indoor Reconstruction in Complex and Degraded Scenes
We evaluate the system’s practical application and generalization ability across challenging indoor environments, using data collected with an Intel RealSense D435i camera (Intel Corporation, Santa Clara, CA, USA).
Complex Geometry and Fidelity (Office): The reconstruction in a typical office environment demonstrates the system’s ability to handle complex geometry and rich fine details. As shown in Figure 8a, the generated dense point cloud preserves the original geometric information with high fidelity. This validates the framework’s capability to produce detail-rich 3D models essential for indoor service robots.
Figure 8.
Indoor reconstruction: (a) office scene reconstruction: capturing fine geometric structure via TSDF-fused triangular mesh. (b) Dense point cloud reconstruction in underground parking.
Perceptually Degraded Conditions (Underground Garage): This experiment validates the system’s robustness in a highly challenging, perceptually degraded environment. The underground parking garage presents classic difficulties for visual SLAM: poor lighting, severe low-texture areas, and highly repetitive structures. Despite these challenges, the system showed extraordinary robustness. As shown in Figure 8b, the concrete columns and the overall “T”-shaped outline of the garage structure were reconstructed with high fidelity and excellent consistency. The system proved its ability to maintain stable tracking and accurately reconstruct key structures even when features are scarce. The robustness shown in this environment is particularly valuable for applications where GPS signals are unavailable.
4.5. Performance and Resource Analysis
To quantify the computational advantage of the proposed decoupled architecture, we conducted a detailed performance analysis of the three—core ROS nodes. All experiments were run on a standard Intel CPU platform without GPU acceleration, strictly evaluating performance under resource—constrained conditions.
4.5.1. Decoupling Advantage and Resource Isolation
The results in Table 4 emphatically prove the necessity of the decoupled design in mitigating resource contention. Node 1 (ORB—SLAM2) is dedicated to low—latency pose tracking, yet demands nearly a full core (98.0% CPU usage). Concurrently, the resource—intensive tasks of Depth Estimation (Node 2) and Map Fusion (Node 3) also show high CPU loads (84.8% and 80.4% CPU, respectively). This high, concurrent load empirically confirms that a monolithic architecture would inevitably lead to severe resource blocking, preventing the core tracking thread from executing in real time. Our architecture strategically isolates these high—load tasks, utilizing multi—core parallelism.
Table 4.
Resource consumption of decoupled nodes (Intel CPU).
4.5.2. Platform Adaptability and Efficiency
MDR–SLAM demonstrates strong resource utilization efficiency tailored for mobile platforms. The entire pipeline operates fully on CPU resources (0% GPU usage), directly confirming its suitability for embedded systems lacking dedicated GPU or VRAM budgets. Furthermore, the total memory footprint remains manageable at approximately 1.3 GB, with Node 3 (Map Fusion) utilizing only ~463 MB to manage the dense point cloud, showcasing superior memory efficiency compared to many GPU—accelerated, learning—based MVS methods.
4.5.3. Real-Time Throughput Validation
Despite the total CPU load exceeding a single core (~2.6 Cores), the system maintains a stable global map update frequency of 4.7 Hz. This validated throughput satisfies the real-time mapping requirements for downstream mobile robot tasks such as path planning and obstacle avoidance, conclusively proving the real-time capability of MDR–SLAM on CPU-constrained hardware.
5. Discussion
The experimental results comprehensively validate that the proposed decoupled architecture effectively resolves the long-standing bottleneck in resource-constrained dense mapping. In this field, systems are typically forced to compromise between tracking stability and reconstruction quality due to computational limitations. By architecturally isolating the computationally intensive temporal filter from the latency-sensitive pose estimation thread, our system successfully leverages multi-frame geometric constraints to resolve matching ambiguities in low-texture environments. Compared with the standard single-frame ELAS baseline, the global trajectory drift was reduced by 55.6%, and the map RMSE was significantly reduced by 83.3%. Furthermore, the comparative analysis with the offline OpenMVS benchmark highlights the key advantages of our confidence-based fusion strategy: unlike methods that generate noise or “hallucinated” structures to maximize coverage in textureless areas, MDR–SLAM prioritizes geometric fidelity through rigorous outlier rejection. This design choice ensures that the generated map maintains high reliability for robotic navigation tasks, demonstrating that high-precision dense reconstruction is achievable in real-time on standard CPU platforms without relying on GPU acceleration.
6. Conclusions
This research addresses the dual challenges of reconstruction quality and computational efficiency in visual SLAM within low-texture indoor environments. We propose MDR–SLAM, a modular and fully decoupled dense 3D reconstruction framework. By architecturally isolating the computationally intensive mapping thread from the latency-sensitive tracking thread, we successfully integrated a keyframe-guided temporal filter and a confidence-based fusion backend on standard CPU hardware without compromising pose estimation stability. The proposed method has been rigorously validated through extensive experiments on the NUFR-M3F dataset. Ablation studies demonstrate that our temporal filtering mechanism effectively resolves matching ambiguities where traditional stereo matching methods fail. Quantitative results indicate that compared to the single-frame baseline (ELAS-only), MDR–SLAM reduces map RMSE by 83.3% (reaching 0.012 m) and decreases global trajectory drift by 55.6%. Compared against the offline “gold standard” OpenMVS, our real-time system exhibits superior geometric fidelity in extremely low-texture zones, achieving an RMSE of 0.099 m compared to the offline benchmark’s 0.227 m. This confirms the system’s ability to effectively filter out photometric noise that typically misleads global optimizers. Crucially, the system maintains a stable mapping frequency of 4.7 Hz operating solely on CPU resources, verifying its suitability for resource-constrained embedded platforms.
Despite these significant advantages, this study has limitations. First, as a pure stereo-vision system, the effectiveness of the temporal filter relies on triangulation parallax; consequently, reconstruction performance degrades in scenarios with insufficient depth observability, such as pure rotational motion or static hovering. Second, to ensure real-time performance and high accuracy, our strict outlier rejection strategy entails a minor trade-off in global map completeness. Future research will focus on overcoming these geometric constraints. We plan to leverage the system’s modular design to integrate Inertial Measurement Units (IMU). By fusing inertial data, we aim to ensure robust scale and depth estimation even during pure rotation or rapid maneuvers, further extending the system’s applicability to complex agile robotic tasks.
Author Contributions
Conceptualization, K.Z. and L.Z.; methodology, K.Z.; software, K.Z.; validation, K.Z.; writing—original draft preparation, K.Z.; writing—review and editing, L.Z.; supervision, L.Z.; project administration, L.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
Data will be made available upon request to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Han, J.; Ma, C.; Zou, D.; Jiao, S.; Chen, C.; Wang, J. Distributed Multi-Robot SLAM Algorithm with Lightweight Communication and Optimization. Electronics 2024, 13, 4129. [Google Scholar] [CrossRef]
- Rostum, H.; Vásárhelyi, J. Enhancing Machine Learning Techniques in VSLAM for Robust Autonomous Unmanned Aerial Vehicle Navigation. Electronics 2025, 14, 1440. [Google Scholar] [CrossRef]
- Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Towards the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
- Rodríguez-Lira, D.-C.; Córdova-Esparza, D.-M.; Terven, J.; Romero-González, J.-A.; Alvarez-Alvarado, J.M.; González-Barbosa, J.-J.; Ramírez-Pedraza, A. Recent Developments in Image-Based 3D Reconstruction Using Deep Learning: Methodologies and Applications. Electronics 2025, 14, 3032. [Google Scholar] [CrossRef]
- Kerbl, B.; Kopanas, G.; Leimkuehler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 2022, 65, 99–106. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
- Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New Orleans, LA, USA, 2022; pp. 12776–12786. [Google Scholar]
- Wang, Y.; Zhang, Y.; Hu, L.; Ge, G.; Wang, W.; Tan, S. Improved Feature Point Extraction Method of VSLAM in Low-Light Dynamic Environment. Electronics 2024, 13, 2936. [Google Scholar] [CrossRef]
- Yang, X.; Jiang, G. A Practical 3D Reconstruction Method for Weak Texture Scenes. Remote Sens. 2021, 13, 3103. [Google Scholar] [CrossRef]
- Ma, Y.; Lv, J.; Wei, J. High-Precision Visual SLAM for Dynamic Scenes Using Semantic–Geometric Feature Filtering and NeRF Maps. Electronics 2025, 14, 3657. [Google Scholar] [CrossRef]
- Hirschmuller, H. Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
- Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing. ACM Trans. Graph. 2009, 28, 1–11. [Google Scholar] [CrossRef]
- Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In Computer Vision–ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; Volume 8690, pp. 834–849. ISBN 978-3-319-10604-5. [Google Scholar]
- Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef]
- Crocetti, F.; Brilli, R.; Dionigi, A.; Fravolini, M.L.; Costante, G.; Valigi, P. Comparison of DSO and ORB-SLAM3 in Low-Light Environments with Auxiliary Lighting and Deep Learning Based Image Enhancing. J. Field Robot. 2025, 42, 3748–3771. [Google Scholar] [CrossRef]
- Geiger, A.; Roser, M.; Urtasun, R. Efficient Large-Scale Stereo Matching. In Computer Vision–ACCV 2010; Kimmel, R., Klette, R., Sugimoto, A., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6492, pp. 25–38. ISBN 978-3-642-19314-9. [Google Scholar]
- Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast Semi-Direct Monocular Visual Odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; IEEE: Hong Kong, China, 2014; pp. 15–22. [Google Scholar]
- Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense Tracking and Mapping in Real-Time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Barcelona, Spain, 2011; pp. 2320–2327. [Google Scholar]
- Luo, H.; Zhang, J.; Liu, X.; Zhang, L.; Liu, J. Large-Scale 3D Reconstruction from Multi-View Imagery: A Comprehensive Review. Remote Sens. 2024, 16, 773. [Google Scholar] [CrossRef]
- Teed, Z.; Deng, J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar]
- Romanoni, A.; Matteucci, M. TAPA-MVS: Textureless-Aware PAtchMatch Multi-View Stereo. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10412–10421. [Google Scholar] [CrossRef]
- Asafa, G.F.; Ren, S.; Mamun, S.S.; Gobena, K.A. DepthCloud2Point: Depth Maps and Initial Point for 3D Point Cloud Reconstruction from a Single Image. Electronics 2025, 14, 1119. [Google Scholar] [CrossRef]
- Furukawa, Y.; Ponce, J. Accurate, Dense, and Robust Multiview Stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1362–1376. [Google Scholar] [CrossRef] [PubMed]
- Xu, Q.; Kong, W.; Tao, W.; Pollefeys, M. Multi-Scale Geometric Consistency Guided and Planar Prior Assisted Multi-View Stereo. IEEE Trans. PATTERN Anal. Mach. Intell. 2023, 45, 4945–4963. [Google Scholar] [CrossRef] [PubMed]
- Yao, Y.; Luo, Z.; Li, S.; Shen, T.; Fang, T.; Quan, L. Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5520–5529. [Google Scholar] [CrossRef]
- Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; Tan, P. Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2492–2501. [Google Scholar] [CrossRef]
- Laga, H.; Jospin, L.V.; Boussaid, F.; Bennamoun, M. A Survey on Deep Learning Techniques for Stereo-Based Depth Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1738–1764. [Google Scholar] [CrossRef] [PubMed]
- Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An Open-Source Library for Real-Time Metric-Semantic Localization and Mapping. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1689–1696. [Google Scholar] [CrossRef]
- Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; IEEE: Nara, Japan, 2007; pp. 1–10. [Google Scholar]
- Ragot, N.; Khemmar, R.; Pokala, A.; Rossi, R.; Ertaud, J.-Y. Benchmark of Visual SLAM Algorithms: ORB-SLAM2 vs RTAB-Map. In Proceedings of the 2019 Eighth International Conference on Emerging Security Technologies (EST), Colchester, UK, 22–24 July 2019; IEEE: Colchester, UK, 2019; pp. 1–6. [Google Scholar]
- Qin, T.; Cao, S.Z.; Pan, J.; Shen, S.J. A General Optimisation-Based Framework for Global Pose Estimation with Multiple Sensors. IET Cyber-Syst. Robot. 2025, 7, e70023. [Google Scholar] [CrossRef]
- Kaveti, P.; Gupta, A.; Giaya, D.; Karp, M.; Keil, C.; Nir, J.; Zhang, Z.; Singh, H. Challenges of Indoor SLAM: A Multi-Modal Multi-Floor Dataset for SLAM Evaluation. In Proceedings of the 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE), Auckland, New Zealand, 26–30 August 2023; IEEE: Auckland, New Zealand, 2023; pp. 1–8. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).