1. Introduction
Endoscopic technology, as a pivotal diagnostic tool in modern minimally invasive medicine, plays a critical role in clinical practice due to its minimally invasive access and real-time imaging capabilities [
1]. With the increasing prevalence of gastrointestinal cancers and related diseases, precise intraoperative navigation and spatial awareness are essential for enhancing early lesion detection and ensuring surgical safety. However, in complex anatomical regions, conventional endoscopic systems with limited field of view often impede surgeons’ ability to precisely assess spatial distances between instrument tips and target tissues, posing operational risks from visual blind zones, such as tissue injury or perforation. Monocular depth estimation technology based on deep learning algorithms addresses this geometric perception deficiency by reconstructing 3D depth maps directly from single-frame endoscopic images. This approach effectively compensates for the inherent lack of geometric awareness in endoscopic visualization systems [
2]. It provides a fundamental geometric perception basis for enabling safe automatic robotic intervention, accurate lesion measurement, and quantitative postoperative assessment. Supervised depth estimation methods rely on ground truth for model training. However, the acquisition of high-precision depth labels faces significant challenges due to the confined operational environments of endoscopes [
3]. In contrast, self-supervised approaches utilizing sequential images have emerged as a research priority by achieving depth prediction without labeled data through multi-view geometric consistency constraints and Structure-from-Motion (SFM) optimization. Self-supervised methods are more applicable to real clinical data, because the natural geometric relationships between consecutive frames in endoscopic videos are more suitable for constructing unsupervised signals. In addition, the depth information learned by the model under unlabeled conditions has the ability to generalize and adapt to different organs and lighting conditions.
Several self-supervised paradigms have been explored for monocular depth estimation. Geometry and temporal-consistency methods [
4,
5] leverage multi-view reconstruction and Structure-from-Motion (SfM) constraints to predict depth, yet they are sensitive to non-rigid tissue deformation. Optical-flow-based approaches [
6] both estimate motion and depth to enhance robustness in dynamic scenes, but their performance deteriorates when large occlusions occur. Generative reconstruction models [
7] synthesizing target views as implicit supervision are efficient, but they struggle to preserve fine structural details in complex endoscopic scenes.
Recent advancements in model architectures [
8] centered on convolutional neural networks (CNNs) and Vision Transformers (ViTs) [
9] have demonstrated remarkable efficacy in monocular self-supervised depth estimation tasks. CNN-based and ViT-based models exhibit inherent limitations in depth estimation tasks. Convolutional Neural Networks (CNNs), a class of deep learning models that use convolutional operations to extract local image features such as edges and textures, demonstrate strong local feature perception, as shown in
Figure 1a. However, their restricted receptive fields—the region of the image a model can analyze in a single operation—limit their ability to model geometric correlations between spatially distant regions. Vision Transformers (ViTs), which are deep learning models that process images by dividing them into non-overlapping patch sequences and applying self-attention mechanisms, partition input images and establish global long-range dependencies via multi-head self-attention, as illustrated in
Figure 1b. The self-attention mechanism allows ViTs to capture relationships across the entire image, but its computation of attention weights incurs quadratic computational complexity relative to image size, resulting in significant overhead when processing high-resolution endoscopic images. The recently proposed Vision Mamba (ViM) architecture [
10], based on Structured State Space Models (SSMs)—a framework for modeling dynamic systems that evolve over time—represents a breakthrough, as shown in
Figure 1c. ViM captures positional information across all directions through a cross-scan strategy while maintaining linear computational complexity for long-range spatial relationship modeling. Nevertheless, its predefined fixed scanning paths may inadequately adapt to anisotropic feature distributions, where features vary unevenly across directions, in complex anatomical structures. For example, the rigid, geometry-driven scanning order often disrupts the continuity of semantically cohesive regions, thereby hindering effective learning of the structural priors essential for depth estimation.
In order to obtain omnidirectional position information and adapt to complex environments while reducing the amount of calculation, this work proposes Mono-ViM, a lightweight and efficient self-supervised deep learning model based on the Mamba architecture. The proposed method uses the efficient sequence modeling of SSMs to achieve linear computational complexity. Compared to its based method ViM, we propose a depth-first scanning mechanism coupled with a cross-query interaction module. In detail, the Depth-Local Visual Mamba (DLViM) extends the ViM framework with adaptive depth-first scanning to model anatomical continuity, while the Cross-Query Layer (CQL), inspired by ViTs, employs query-based interaction to enhance global contextual understanding. Specifically, for endoscopic imaging where depth continuity induces distinct feature representations across varying tissue layers, we innovatively design the Depth Local Visual Mamba (DLViM) module. The core of DLViM is a depth-first scanning mechanism that prioritizes the serialized processing of image tokens originating from deeper anatomical structures before progressing to shallower ones. This scanning order is inherently aligned with the spatial configuration of endoscopic scenes, allowing the model to first establish a coherent representation of the underlying tissue geometry. As illustrated in
Figure 1d, this sequential feature aggregation from deep to shallow enhances the correlation of features along the depth dimension, thereby effectively preserving the global spatial coherence of the environment. Furthermore, we designed an enhanced cross-query layer that utilizes ViT to extract encoded features as object queries, which interrogate the decoded depth map to derive depth binning and probability maps for high-precision depth prediction.
The main contributions of this work can be summarized as follows:
Propose Mono-ViM, a lightweight Mamba-based model for monocular depth estimation in Endoscopic Images.
Propose a novel depth-first scanning strategy to enhance local feature representation.
Propose a Cross-Query Layer to boost fine-grained detail representation.
Demonstrate that Mono-ViM achieves simplicity, effectiveness, and higher accuracy through comprehensive experiments on the SIMCOL Dataset, C3VD and KITTI.
2. Related Work
2.1. Monocular Depth Estimation
Estimating depth from a single image is an inherently ill-posed problem. Therefore, methods using deep learning enabled significant progress in the field.
Using ground-truth depth as supervision, the predictive model can exploit the relationship between color images and their corresponding depth values. Eigen et al. [
11] first introduced a multi-scale convolutional framework combining global coarse prediction and local fine refinement, inaugurating the use of CNNs for depth regression. Subsequent studies enhanced accuracy through various architectural and loss-level innovations—such as Conditional Random Fields [
12,
13] for post-processing, the reverse Huber loss and improved up-sampling modules [
14]. These networks demonstrated that deeper and better-designed CNNs could significantly improve pixel-wise depth prediction quality. Moreover, multi-scale ordinal regression strategies [
15] are proposed by Fu et al. However, this is challenging to acquire in varied real-world settings. Recent work has shown that conventional structure-from-motion (SfM) pipelines [
16] can generate sparse training signal for both camera pose and depth.
However, with the fully supervised methods advanced rapidly, the availability of precise depth labels in the real world became a major issue. Network architectures also play an important role in achieving good results in self-supervised depth estimation. Numerous studies explored monocular depth estimation through architectural refinements such as attention mechanisms [
17], multi-scale feature modulation [
18], and lightweight backbone designs [
19]. These methods can capture local and global cues, yet they were limited by either computational complexity or restricted fields. With the evolution of deep vision models, self-supervised depth estimation based on CNNs and ViTs has achieved significant progress. Monodepth2 [
20] substantially improves the accuracy and robustness of monocular video depth prediction through photometric consistency loss in multi-view synthesis and an automatic masking mechanism to handle dynamic objects and occlusions. SQLDepth [
21] further captures fine-grained scene structures via self-query layers and self-cost convolution. SPIDepth [
22] introduces a more robust pose network to enhance geometric understanding of scenes, thereby achieving more precise depth estimation. Lite-Mono [
23] enables real-time and efficient inference on resource-constrained devices through depthwise separable convolutions and channel attention mechanisms. MonoViT [
24] leverages vision transformers with multi-scale feature fusion and self-supervised learning to obtain high-precision depth predictions without paired annotations. However, CNNs are constrained by their limited local receptive fields, while ViTs suffer from quadratic computational complexity. To address these limitations, we propose a ViM-based depth estimation model that achieves a balanced integration of efficient computation and long-range dependency modeling.
2.2. State Space Models and Vision Mamba
Structured State Space Models (SSMs) have emerged as a powerful framework for long-range sequence modeling. S4 [
25] leverages parameterized state transitions and convolutional kernels to achieve linear-complexity infinite-context modeling. However, its time-invariant design—with static system dynamics over time—limits adaptation to dynamic patterns. Mamba [
26] overcomes this via selective SSMs (S6), where input-dependent gating dynamically adjusts state transitions, enabling feature-aware computation without sacrificing efficiency.
VMamba [
27] adapts Mamba to vision via cross-scan patch partitioning, transforming 2D images into 1D sequences while preserving local context. LocalMamba [
28] processes images through overlapping/non-overlapping windows with local selective scans and cross-window aggregation, balancing global-local modeling. However, medical images’ blurred boundaries challenge fixed scanning strategies. We propose a depth-first scanning approach to prioritize structurally similar regions, enhancing depth estimation accuracy in blurred medical images.
2.3. Architectural Comparison with Related Models
To clearly elucidate the architectural innovations of Mono-ViM, we first provide a systematic comparison with its most relevant counterparts, LocalMamba [
28] and MonoViT [
24], in
Table 1. While all three frameworks target enhanced visual representation learning, they diverge fundamentally in core architectural paradigms.
LocalMamba [
28] processes images through overlapping/non-overlapping windows with local selective scans, emphasizing local feature extraction while maintaining computational efficiency through linear complexity. However, its window-based mechanism inherently limits global receptive fields, which may compromise the modeling of long-range dependencies in endoscopic scenes with continuous anatomical structures. MonoViT [
24] leverages vision transformers with multi-scale feature fusion, achieving global context modeling through self-attention mechanisms. Nevertheless, the quadratic computational complexity of self-attention poses significant challenges when processing high-resolution endoscopic images, limiting its practical deployment in real-time applications.
In contrast, Mono-ViM introduces a novel depth-first scanning strategy within the Visual State Space Model (ViM) framework, enabling global receptive field coverage while maintaining linear computational complexity. The proposed Depth Local Visual Mamba (DLViM) module specifically addresses the challenges of endoscopic imaging by prioritizing structurally similar regions through adaptive scanning paths, effectively capturing both local details and global contextual information essential for accurate depth estimation in complex anatomical environments.
Beyond direct architectural comparisons, Mono-ViM’s Mamba-based design offers distinct advantages over other popular depth estimation paradigms. Compared to diffusion-based methods, which iteratively refine depth maps from noise through a multi-step denoising process, Mamba offers a significant advantage in inference speed by producing depth estimates in a single forward pass. However, diffusion models incur substantial computational overhead due to their iterative nature, hindering real-time application. In contrast, recurrent frameworks (e.g., based on LSTMs or GRUs) share Mamba’s sequential nature and are designed for efficient step-by-step processing. However, traditional RNNs often struggle with long-range dependencies due to vanishing gradients. Mamba’s selective state space mechanism fundamentally overcomes this limitation, enabling effective modeling of long-range contextual information—critical for understanding endoscopic scenes—while retaining high computational efficiency.
Thus, as summarized in
Table 1, Mono-ViM occupies a unique niche: it integrates the global receptive field of Transformers, overcomes the quadratic complexity bottleneck through selective state spaces, and introduces a depth-first scanning strategy specifically optimized for the structural continuities and blurred boundaries characteristic of endoscopic imagery. This positions it as a highly suitable paradigm for real-time medical imaging applications where accuracy, efficiency, and contextual awareness are paramount.
4. Experiments
In this section, we evaluate the proposed framework with effects on three public datasets.
4.1. Implementation Details
All models were implemented in the PyTorch 1.13.0 and CUDA 11.7 framework, and trained on an RTX 4090 GPU with a batch size of 12. We use AdamW as the optimizer with an initial learning rate of
, which decays to
following a cosine schedule, and training is conducted for 25 epochs. Following the settings in [
23], we apply color and flip augmentations to the images during training. To ensure fair comparison, each baseline model uses its own default learning rate settings.
The accuracy is evaluated using the seven metrics proposed in [
23]: absolute relative error (Abs Rel), squared relative error (Sq Rel), root mean squared error (RMSE), root mean squared log error (RMSE log), and accuracy under threshold values (
,
, and
).
4.2. Datasets
To support intestinal surgical navigation, the C3VD dataset provides real clinical colonoscopy data for realistic evaluation, while the SimCol dataset offers high-precision synthetic data for quantitative validation. In addition, the KITTI dataset is included to assess the method’s generalization beyond medical scenes. The combination of these datasets allows comprehensive evaluation across synthetic, clinical, and outdoor domains.
The SimCol dataset: The SimCol Dataset [
29] integrates three categories of anatomical structure data derived from real human CT scans, with each category containing multiple trajectory paths generated. Each trajectory contains synchronously captured RGB rendered images, corresponding camera intrinsic parameter matrices, high-precision depth field information, and camera extrinsic pose parameters. All available data maintains an original image resolution of The images are of size
pixels, which is downsampled to
to accommodate model requirements. The data partitioning scheme of the SimCol dataset is illustrated in
Table 2, with the triplet index set as
. The ternary frame structure
means that the current frame is the center and the first 3 frames and the last 3 frames are used as reference frames for training. This choice can increase the parallax between adjacent frames while ensuring sufficient overlap between frames.
C3VD: Colonoscopy 3D Video Dataset (C3VD) [
30] is an open source dataset designed to promote colonoscopy computer vision research, containing 22 high-definition clinical colonoscopy videos and their accurate 3D truth data. The dataset provides depth map, surface normal, optical flow, occlusion annotation, and 6-degree-of-freedom position, which effectively solves the bottleneck of missing truth data in colonoscopy vision tasks. All available data maintains an original image resolution of
pixels, which is downsampled to
to accommodate model requirements. The data partitioning scheme of the SimCol dataset is illustrated in
Table 3, with the triplet index set as
due to the slow movement of the endoscope.
KiTTI: The KITTI dataset [
31] is a widely adopted benchmark for stereo road scene understanding, encompassing multi-modal data from cameras, LiDAR, and an inertial measurement unit (IMU). Following the Eigen split [
32], we partition the monocular sequences into 39,180 triplets for training, 4424 for validation, and 697 for testing. Notably, the test set employs refined ground truth depth maps generated through the method described in [
33], ensuring enhanced measurement accuracy.
4.3. SimCol Dataset Results
We compare our model with previous representative methods, and the results are presented in
Table 4. Mono-ViM outperforms all approaches while being the second smallest model. It achieves AbsRel (0.070), SqRel (0.061), RMSE (0.356), and RMSElog (0.099), demonstrating its accuracy in endoscopic image depth estimation.
Additionally, Mono-ViM exhibits significant performance improvements compared to Lite-mono, a prototype model that lays the groundwork for further development. In the self-supervised setting, Mono-ViM shows improvements of 16.7% in AbsRel, 14.8% in SqRel, 6.6% in RMSE, and 12.4% in RMSElog. These substantial improvements underscore the impact of strengthening the depth decoder network and its information flow in Mono-ViM.
Overall, these results validate the effectiveness of Mono-ViM in self-supervised monocular depth estimation. The qualitative experiments in
Figure 4 further demonstrate its superior performance by effectively reducing overestimated depth regions (e.g., diminished red high-depth areas) and generating depth maps that align closely with the ground truth, especially in challenging scenarios. Furthermore, compared to baseline models, our approach exhibits improved coherence characterized by smoother depth transitions and enhanced inter-region consistency. These improvements contribute to both higher reliability and superior visual quality of the depth estimation outputs.
4.4. C3VD Results
To address the domain gap in SimCol Dataset, we train our model on the real-world endoscopic dataset C3VD. As shown in
Table 5 Mono-ViM (small) achieves state-of-the-art performance on real-world benchmarks, with metrics of AbsRel (0.081), SqRel (0.005), RMSE (0.048), and RMSElog (0.109), demonstrating its robust adaptation to real endoscopic scenes. Qualitative comparisons in
Figure 5 further highlight the framework’s superior capability in reconstructing fine-grained depth details from real endoscopic images.
4.5. Kitti Results
Furthermore, we evaluate our model on the outdoor KITTI dataset, conducting comparative experiments with lightweight baseline models (<10 M parameters). These baseline models are tested using their publicly available optimal configurations without additional fine-tuning, ensuring a fair comparison under identical training protocols.
As shown in
Table 6, the proposed Mono-ViM-small achieves the lowest absolute relative error (0.081) and the highest accuracy under the threshold
(0.929) among all compared methods, while maintaining a compact model size of only 3.3 M parameters. Compared to other lightweight models such as Lite-mono (3.1 M), our method exhibits superior accuracy with marginal parameter overhead. Moreover, it surpasses heavier counterparts like R-MSFMX3-GC (5.0 M) and R-MSFMX6-GC (5.3 M), indicating a more efficient utilization of model capacity. These results demonstrate that Mono-ViM-small offers a favorable trade-off between accuracy and computational cost, making it well-suited for real-world applications with limited hardware resources.
4.6. Ablation Study on Model Architectures
To more systematically evaluate the effectiveness of the proposed DLViM, we performed a qualitative analysis of the gating units across different stages. As illustrated in
Figure 6, the activation weights are distributed relatively uniformly over the entire image at the early stage; however, as the network depth increases, these weights progressively concentrate on regions containing salient structural information, indicating that the model develops stronger spatial selectivity and structural awareness during the feature extraction process.
To further assess the effectiveness of the proposed model, we conducted a series of ablation studies to examine the contribution of individual components. All experiments were performed on the SimCol dataset, as illustrated in
Table 7.
First, we evaluated the impact of removing the CQL module. This led to a reduction in model size by 0.3 M parameters; however, the RMSE increased by 12.5%, indicating a notable decline in accuracy. This highlights the importance of CQL in enhancing the model’s predictive performance. Next, we replaced the DLViM module with a standard CNN. While this replacement also reduces the model size by 0.3 M, it leads to a significant increase in prediction error. This suggests that DLViM is critical for capturing remote global context, a capability lacking in traditional CNNs limited to local feature extraction.
Furthermore, we examined the effect of substituting DLViM with the scanning strategy from LocalMamba. Although this modification did not significantly alter the number of model parameters, it led to a performance drop. This suggests that the proposed depth-first scan mechanism contributes to the model’s ability to understand structural information, reinforcing its role in maintaining global coherence in the learned representation.
5. Discussion and Conclusions
This work presents Mono-ViM, a novel self-supervised monocular depth estimation framework tailored for endoscopic imaging. The proposed architecture employs a hybrid design that seamlessly combines CNNs with Mamba to enhance global contextual understanding and refine semantic detail perception. Experiments on the SimCol and C3VD dataset both confirm that Mono-ViM comprehensively outperforms all state-of-the-art methods. In a self-supervised setting on the SimCol dataset, the proposed framework reduces errors by up to 16.7% in AbsRel, 14.8% in SqRel, 6.6% in RMSE, and 12.4% in RMSElog over current models. These results demonstrate the strength of incorporation of the DLViM module for capturing local–global feature dependencies and the CQL mechanism for improved fine-grained depth reconstruction. Furthermore, evaluations on the KITTI dataset confirm that the proposed framework maintains robust performance and exhibits strong generalization capability across varying environments.
The proposed Mono-ViM framework still presents several limitations. First, in endoscopic scenes, the presence of digestive fluids can produce strong specular reflections that disrupt the photometric consistency assumed in self-supervised learning. This will lead to local depth discontinuities. To address this, future work will incorporate reflection-aware photometric loss to enhance robustness under reflective conditions. Second, the proposed Mono-ViM assumes structural continuity between adjacent frames. However, periodic peristaltic motion introduces non-rigid deformation of tissues, which violates this assumption and affects depth stability. To mitigate this issue, optical-flow consistency constraints or temporal feature aggregation mechanisms can be adopted to explicitly model periodic deformation and reduce transient estimation errors.