The overview of our proposed StereoGS-SLAM system is illustrated in
Figure 1, and the algorithmic formulation is summarized in Algorithm 1. Given a set of stereo images, we first employ a pre-trained stereo network [
22] to estimate dense disparity images. The network is pre-trained on the SceneFlow dataset and further fine-tuned on target datasets. During SLAM operation, the network parameters are kept frozen to ensure real-time performance while providing robust depth estimates. Based on the estimated disparity images and the camera intrinsic parameters, dense depth maps are subsequently computed. These maps offer rich geometric guidance for the 3DGS optimization process, benefiting from their high spatial density and more complete scene coverage.
| Algorithm 1 StereoGS-SLAM Overview |
- Require:
Stereo image stream , camera intrinsics, pre-trained stereo and semantic networks - Ensure:
Camera poses and Gaussian map - 1:
Initialize , set , select initial keyframe set - 2:
for each time step t do - 3:
Estimate disparity and depth; compute depth statistics - 4:
Initialize Gaussians from depth and colors - 5:
Render from and - 6:
Tracking: optimize by minimizing - 7:
Update keyframe set using the hybrid selection strategy - 8:
Mapping: sample and optimize via - 9:
Update semantic decoder and semantic features - 10:
Apply adaptive scene depth estimation to refine Gaussian scales - 11:
end for
|
3.1. 3D Gaussian Scene Representation
We generate the dense Gaussian scene representation following the method in [
4],
where
N is the number of Gaussians. Each Gaussian
is defined by its 3D position
in the world coordinate system, 3D covariance matrix
, opacity
, RGB color
, and semantic feature
(
denotes the number of objects in the scene).
With the 3D Gaussian scene representation parameters, we render multiple modalities at each pixel using the differentiable Gaussian splatting technique [
4], including color, depth, and semantic features. Given the camera pose
, the
i-th 3D Gaussian is projected onto the 2D image plane for rendering, with
where
is the Jacobian matrix of the projection function of the
i-th Gaussian, and
is the rotation part of the camera pose
. After projection, the color of a single pixel is rendered by sorting the Gaussians in depth order and performing front-to-back
-blending,
where
represents the color of the
i-th Gaussian,
is the density computed by the opacity
and the 2D covariance matrix
as,
where
is the offset vector from the pixel center to the projected mean of the
i-th Gaussian. Similarly, the depth at each pixel is rendered as,
where
denotes the depth of the
i-th Gaussian centroid in the camera coordinate system.
Compared with 3D semantic information, the 2D semantic label is more accessible prior. Additionally, it is memory-inefficient to directly store a high-dimensional semantic feature vector in each Gaussian. Instead, we follow the same procedure as color and depth rendering. The semantic feature at each pixel is computed as,
where
represents the semantic feature vector for the
i-th Gaussian, with
denoting the number of semantic classes. The semantic features are initialized from semantic segmentation predictions produced by the segmentation network [
23] on RGB images, which provide the semantic signals used during online operation.
To decode the semantic label at each pixel, we utilize a lightweight CNN decoder
to map the aggregated semantic features to semantic predictions,
where the semantic decoder
is implemented as a single convolutional layer that transforms the input semantic feature dimension to the number of semantic classes. The decoder is trained end-to-end with the Gaussian optimization process. Specifically, the decoder weights are initialized randomly and updated jointly with Gaussian semantic features
using the Adam optimizer. The semantic decoder and Gaussian semantic features are optimized simultaneously to ensure consistency between the 3D scene representation and semantic understanding. This joint optimization strategy allows the system to refine initial SAM-based semantic predictions based on the reconstructed 3D scene geometry and photometric consistency, leading to more accurate and spatially coherent semantic segmentation results.
We encode semantic information directly in the Gaussian primitives via the learnable features , which are initialized from 2D semantic predictions and then optimized jointly with geometry. During rendering, the semantic features are splatted and composited using the same opacity-aware accumulation as color and depth, so each pixel receives a weighted mixture of nearby Gaussians along the ray. This effectively spreads semantic cues across neighboring primitives while preserving spatial coherence through the Gaussian support and the visibility weights. The subsequent decoder maps the aggregated feature to class probabilities, and the semantic loss provides feedback that updates both and the decoder, encouraging consistent semantic labels across the map.
We also render a silhouette image to determine the cumulative opacity
for each pixel,
3.2. Adaptive Scene Depth Estimation
The variations in depth range across different frames result in significant variance in the 3D Gaussian scales corresponding to these frames, leading to overly large or small Gaussian spheres that degrade reconstruction quality and system robustness. To address this limitation, we propose a novel adaptive scene depth estimation strategy. This strategy dynamically estimates and updates the scene depth, enabling more accurate Gaussian initialization and optimization.
For each incoming frame, we derive robust geometric characteristics from the depth distribution to characterize the scene properties. Based on these depth characteristics, we formulate an adaptive scene radius computation that accounts for both local scene characteristics and global scale variations. The adaptive scaling mechanism employs different strategies for uniform and complex scenes: for relatively uniform depth distributions, we use linear scaling of central depth values, while for scenes with significant depth variations, we apply logarithmic transformation to balance the scale estimation for both near and far objects. The adaptive scene radius computation is defined as,
where
is the median depth,
is the depth standard deviation,
is a scene depth threshold, and
are scaling parameters.
To ensure temporal consistency and prevent abrupt scale variations that could destabilize the system, we apply a temporal smoothing mechanism based on an exponential moving average with a smoothing factor , which balances responsiveness and stability. The resulting adaptive scene depth estimate is then integrated into the Gaussian mapping pipeline. Specifically, during Gaussian scale initialization, the mean squared distances are modulated using the estimated scene depth. This strategy allows the system to effectively handle scenes with varying depth ranges and remain robust under sparse or noisy depth measurements. As the scene depth range evolves over time, the mean Gaussian scale is designed to adapt accordingly rather than strictly converge, reflecting the intended response to changing scene geometry.
3.3. Spatial Consistency Mapping
Initial Gaussians are generated from all pixels, as the rendered silhouette map is initially empty. Each Gaussian is initialized with the following properties: color sampled from the corresponding pixel, center position determined by unprojecting the stereo-estimated depth, opacity fixed at 0.5, semantic features randomly initialized using spherical harmonics representation, and radius initialized based on the adaptive scene depth estimation to maintain appropriate projection size in the image plane.
After tracking a frame, we enforce spatial consistency in the Gaussian representation through an adaptive expansion strategy. New Gaussians are selectively introduced only in spatial regions that are inadequately represented by existing elements, ensuring comprehensive yet efficient scene coverage. Spatial consistency is maintained by utilizing cumulative opacity and depth information to construct an unobservable region mask
,
where
denotes the indicator function,
represents the cumulative opacity threshold for unobservable regions,
indicates the
-norm, and
corresponds to the median
-norm error between rendered and stereo-estimated depth.
The mapping objective aims to produce a spatially coherent and detailed 3D representation consistent across all observed frames. This is achieved through joint optimization of all Gaussian parameters by minimizing a comprehensive rendering-based objective function. To reduce optimization bias and enhance global map consistency, we employ a random keyframe sampling strategy during mapping. Instead of optimizing all keyframes in each iteration, we randomly sample a subset of keyframes, which discourages overfitting to recent observations and maintains a more balanced representation of the scene. The color loss
combines
-norm with structural similarity (SSIM) between rendered and observed colors,
where
and
C represent rendered and ground truth colors, respectively, with
as the balancing factor.
The geometric loss
incorporates depth uncertainty to enhance the geometric accuracy of the reconstructed scene,
where
and
D denote the rendered and estimated depth, respectively, and
represents the depth uncertainty estimated from the stereo matching process. The depth uncertainty is computed based on the reliability of stereo depth estimation, with higher uncertainty assigned to regions where depth estimation is less reliable, derived from both local disparity gradient consistency and image gradient information to provide robust depth estimates across different scene regions. Specifically, the depth uncertainty is formulated as,
where
denotes the 2D pixel coordinate in the image plane,
denotes the Euclidean norm of the disparity gradient measuring local disparity variation,
denotes the 90th percentile of disparity gradient magnitudes over the entire image used as a normalization factor,
denotes the Euclidean distance from pixel
to the nearest image edge extracted using the Canny operator, and
denotes a normalization constant controlling the influence range of image edges on the confidence value.
The semantic consistency loss
is formulated as a multi-class cross-entropy (softmax + CE),
where
S is a one-hot semantic label and
is the decoded class-probability vector at each pixel.
The composite mapping loss
incorporates awareness of unobservable regions while jointly optimizing color, geometric, and semantic consistency,
where
,
, and
are the color, geometric, and semantic consistency loss weights, respectively.
3.4. Tracking and Keyframe Selection
Given a Gaussian scene representation
, the camera pose initialization follows a constant-velocity model in
[
24],
In implementation, this is approximated as first-order extrapolation of rotation and translation parameters:
where
q is the camera quaternion and
is translation. To ensure accurate camera pose estimation, only those rendered pixels with reliable depth information are factored into the tracking loss function. Then the camera pose is updated iteratively by gradient descent optimization through differentiably rendering color, depth, and semantic maps, minimizing the following loss function,
where
simply employs the
-norm between the rendered image and the ground truth,
,
, and
are the weights for each term.
To address the limitations of fixed-interval keyframe selection, we propose a hybrid keyframe selection strategy that combines adaptive motion-based selection with a fixed-interval fallback mechanism. The motion-aware selection computes the relative transformation between the current frame and the most recently selected keyframe, estimating rotation angle
and translation distance
d,
where
and
are the rotation matrix and translation vector of the relative transformation, respectively. If either
or
0.1 m, the current frame is selected as a keyframe. Otherwise, the method defaults to fixed-interval selection for temporal regularity.
Additionally, special handling ensures robust initialization and completion: the first and final frames are always selected as keyframes. We implement random keyframe sampling during optimization, selecting a subset of keyframes each iteration to reduce computational load while maintaining global map consistency.