1. Introduction
Monocular Depth Estimation (MDE) refers to the recovery of a scene’s three-dimensional geometric structure from a single two-dimensional image, representing a fundamental challenge in the field of computer vision. The core difficulty of this task stems from its inherent ill-posed nature: the projection process from the three-dimensional world to a two-dimensional image loses geometric constraints such as depth and scale, theoretically allowing a single image to correspond to an infinite number of valid three-dimensional interpretations [
1]. Despite this fundamental challenge, MDE demonstrates immense application potential across numerous domains—including autonomous driving and robotic navigation—due to its advantages of requiring only a monocular camera, low cost, and ease of deployment. For instance, in 3D reconstruction, the quality of predicted depth maps directly determines the accuracy and realism of the reconstructed scene [
2].
However, this information gap makes depth prediction models highly susceptible to interference from complex environments in practical applications. Factors such as missing surface textures, extreme lighting variations, and complex occlusions severely limit the estimation accuracy and robustness of existing methods. Therefore, overcoming these bottlenecks to systematically enhance the accuracy, detail preservation capability, and adaptability to complex environments of monocular depth models remains a major research challenge that urgently needs to be addressed in this field.
To address these challenges, research methodologies have evolved from traditional approaches to deep learning-based solutions. Early traditional methods, such as Structure from Motion (SFM) [
3] and Multi-View Stereo (MVS) [
4], often relied on strong assumptions and exhibited limited generalization capabilities in complex scenes. For instance, MVS methods typically infer depth by analyzing pixel differences in image pairs [
5]. While primarily binocular approaches, some variants can be applied to monocular images. Geometry-constrained methods leverage scene geometry and optical flow to estimate depth. Common techniques include monocular depth estimation via photometric consistency [
6] and depth estimation using adaptive surface normal constraints and geometric context [
7]. Most deep learning-based approaches train neural networks using encoder–decoder (ED) architectures to extract image features and progressively reconstruct depth information. Some neural networks or their variants further enhance accuracy through fully supervised, self-supervised, or unsupervised learning. These methods rely on diverse training strategies and geometric constraints to derive depth information. Deep learning excels at handling complex scenes due to its superior feature extraction capabilities and faster inference speeds [
8]. Despite the diversity of techniques, improvements based on the ED framework remain a focal point in academia due to their outstanding performance in both accuracy and efficiency.
Within the mainstream ED framework, researchers have optimized encoders and decoders by incorporating CNNs, Transformers, or hybrid models [
9]. However, these approaches still exhibit significant limitations in both feature extraction and context modeling. Regarding feature extraction, the primary issue lies in inadequate fusion of multi-scale features across layers, making it difficult to balance local details with global context. Regarding context modeling, the lack of effective filtering for critical scale information causes disconnect between global and local details, making the model susceptible to background noise interference. Although numerous studies have attempted to address these issues, efficient synergistic optimization between feature extraction completeness and context modeling accuracy remains elusive, presenting a key research opportunity.
Methods based on the encoder–decoder framework are the mainstream approach for monocular depth estimation. This framework extracts features through an encoder and reconstructs depth maps via a decoder. However, shortcomings persist in both feature extraction and context modeling: First, deficiencies in feature extraction manifest as insufficient fusion of multi-scale features across layers [
10]. For instance, the Lite-Mono model proposed by Zhang et al. [
11] integrates CNN and Transformer features through a local-global feature interaction module. However, its fixed receptive field at the base layer, constrained by conventional convolutional dimensions, struggles to fully capture depth information in large-scale scenes. Yang et al. [
12] introduced semantic priors using a pre-trained CNN encoder, yet relied on skip connections to transfer cross-layer features without optimizing semantic matching for multi-scale features, resulting in incomplete feature representation in areas like road edges and building corners; Song et al. [
13] achieved feature enhancement through knowledge distillation, thereby improving the performance of deep estimation networks. While this enhanced the weighting of local features, it lacked a comprehensive feature interaction mechanism throughout the entire process. The SS-MDE method proposed by Hu et al. [
14] optimizes encoder–decoder training by introducing a teacher-student distillation mechanism, enhancing the model’s depth estimation accuracy and robustness. Although this reduced computational complexity, it still exhibited deficiencies in the coherence of multi-scale feature fusion.
On the other hand, inadequate context modeling primarily manifests as the absence of scale filtering and the disconnect between global and local information. Song et al. [
15] introduced Laplacian pyramids to optimize decoder gradient flow and enhanced depth continuity through multi-scale feature superposition. However, they failed to effectively distinguish the importance of features across different levels, leading to interference from irrelevant features such as background noise in context modeling. Zhu et al. [
16] embedded a Gate Recurrent Unit (GRU) temporal module within the DPT (Dense Prediction Transformer) encoder–decoder framework to leverage temporal information for global context modeling. However, achieving precise matching of local context for high-frequency spatial details remains challenging; The Align3r method proposed by Lu et al. [
17] achieves depth estimation by incorporating the DUSt3R (Dense Unconstrained Stereo 3D Reconstruction) model into an encoder–decoder architecture for feature optimization, but it fails to adapt to contextual information under varying lighting conditions. Yang et al. [
18] designed a hybrid encoder with an adaptive decoder, enhancing monocular depth estimation performance by strengthening the fusion of local and global information. However, this multi-frame input dependency fails to meet context modeling demands in applications like autonomous driving. Consequently, context discontinuities frequently occur in scenes with alternating light and shadow.
Although previous studies have attempted to address the aforementioned issues, certain limitations persist. Hwang et al. [
19] employed a CNN encoder to extract local features and introduced a Transformer into the global self-attention module, enhancing feature interaction capabilities through self-supervised learning. However, they failed to optimize for the differences in receptive fields across multi-scale features. Li et al. [
20] proposed the BinsFormer model, which employs a Transformer decoder for cross-scale attention computation to generate deep interval centers. However, it fails to resolve the issue of semantic discontinuity across feature layers. Zhao et al. [
21] introduced the MonoViT model, utilizing a dual-branch CNN-ViT architecture for feature extraction combined with a convolutional upsampling decoder for information fusion. Yet, it lacks a dynamic filtering mechanism for critical scale features. Li et al. [
22] designed a parallel encoder architecture where a convolutional branch preserves local details while a Transformer branch models long-range dependencies. Although this enhances feature diversity, it remains inefficient in multi-scale feature fusion. Furthermore, Lyu et al. [
23] enhanced the ED structure by optimizing skip connections and introducing a feature fusion module. While this improved depth accuracy in high-gradient regions (e.g., cliff edges), it failed to achieve synergistic optimization between feature extraction and context modeling. Godard et al. [
24] proposed the Monodepth2 model, which optimizes the ED framework through minimum reprojection loss and full-resolution multi-scale sampling. Although it has become a domain benchmark, there is still room for improvement in feature robustness under complex lighting conditions [
8]. Liu et al. [
25] proposed the FCMdepth depth estimation framework, which incorporates feature optimization and contextual information. However, improvements in feature extraction and contextual modeling are still needed to achieve more accurate depth estimation. Wang et al.’s [
26] MG-Mono achieves improved accuracy in self-supervised monocular depth estimation while maintaining a lightweight design. This is accomplished by introducing a multi-granularity information fusion module in its encoder to integrate pixel, local, and global features, and by employing a neighborhood-weighted collaborative prediction head in its decoder. However, feature extraction is not sufficiently detailed, and contextual information is not comprehensive enough, leaving room for optimization. Some of these methods have not proposed systematic solutions from the perspective of synergistically optimizing feature extraction completeness and context modeling accuracy, while others still have room for improvement in feature optimization and context modeling capabilities.
The aforementioned methods effectively enhance the accuracy of monocular depth estimation, yet there remains room for further optimization in feature extraction and context modeling. Many existing approaches tend to optimize encoders (feature extraction) or decoders (context modeling and reconstruction) independently, failing to fundamentally address the collaborative challenges between them in semantic representation and scale perception. This separate optimization makes it difficult for models to achieve optimal global performance. To this end, we propose a novel monocular depth estimation model, EFDepth, whose main contributions include the following aspects:
The EC-Net design encoder achieves optimized multi-scale feature extraction. Addressing the limitations of the MobileNetV3 architecture—insufficient shallow-to-deep information fusion and restricted receptive fields in lower layers—we designed the EC-Net encoder, comprising MobileNetV3-E and ETFBlock modules, to enhance cross-layer feature interaction. Hollow convolutions expand receptive fields, providing decoders with both granular and globally contextual feature representations. Experiments were conducted on the public KITTI dataset (Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago);
The decoder LapFA-Net enhances contextual modeling capabilities. A five-level Laplacian pyramid performs bottom-up upsampling layer by layer. Each layer implements feature skip connections corresponding to the encoder layer. A Fourier Modulated Attention (FMA) mechanism is introduced at the base layer to select critical scale information. Residual L1–L5 cascades refine high-frequency details, ultimately producing a high-quality depth map that combines global consistency with sharp edges;
EFDepth, a monocular depth estimation model based on a U-Net-like encoder–decoder architecture, achieves refined optimization of multi-scale features by constructing a dedicated encoder EC-Net and decoder LapFA-Net. This effectively enhances information exchange and fusion between upper and lower layers of the model.
The core innovation lies in achieving synergistic effects between feature extraction and depth reconstruction through the collaborative interaction between encoders and decoders. This powerful combination ultimately enhances depth map prediction performance.
The EC-Net encoder focuses on enhancing multi-scale features, extracting rich, multi-scale details to lay the foundation for depth estimation. The LapFA-Net decoder employs the FMA module to recalibrate and filter this feature stream, enabling precise depth map reconstruction. This synergistic mechanism ensures the encoder provides information-rich input to the decoding process through feature enhancement, while the decoder guides more accurate depth map reconstruction via its internal information filtering mechanism. This enhances model performance, ultimately improving depth estimation accuracy in complex scenes.
Experimental results demonstrate that the EFDepth monocular depth estimation model can provide a relatively accurate depth map foundation for applications such as 3D reconstruction.
The overall structure of this paper is as follows:
Section 2. Related Work briefly introduces prior work closely related to this study;
Section 3. EFDepth Model focuses on the proposed EFDepth model, including its overall framework and the design of the encoder EC-Net and decoder LapFA-Net;
Section 4. Experiments and Results Evaluation presents experiments and evaluates results to validate the effectiveness of the FCMdepth model;
Finally, the research findings are summarized, and future directions are explored.
3. EFDepth Model
3.1. Overall Framework of the Network Model
The EFDepth model adopts a U-shaped network encoder–decoder architecture, as shown in
Figure 2.
The EC-Net encoder is based on the MobileNetV3 architecture and consists of two modules: ① The MobileNetV3-E module is formed by introducing a Feature Enhancement Module (FEM) into the MobileNetV3 network. This module integrates multi-scale information to enhance feature extraction capabilities, enabling the network to capture richer spatial and semantic features. ② The ETFBlock module refines low-level features by employing dilated convolutions to expand the receptive field, strengthening the expression of granular details and enabling the network to capture broader contextual information.
The decoder LapFA-Net employs a five-level Laplacian pyramid incorporating an FMA module. During upscaling, skip connections and residual fusion with the preceding layer refine the predicted depth map layer by layer. This approach ensures the network maintains both global consistency and high-frequency detail, ultimately reconstructing a high-accuracy depth map.
3.2. Encoder EC-Net
EC-Net consists of MobileNetV3-E and ETFBlock modules, which synergistically optimize the feature extraction process. FEM can be employed for feature enhancement. Thus, MobileNetV3-E enhances its multi-scale feature extraction capabilities by introducing the FEM module after MobileNetV3, enabling it to capture rich spatial and semantic information. Meanwhile, the ETFBlock module expands the receptive field through dilated convolutions, allowing the network to better capture long-range dependencies and global contextual information. This collaborative design optimizes feature representation capabilities, thereby improving the features extracted throughout the entire encoding process.
3.2.1. MobileNetV3-E Module
Given the performance bottlenecks encountered by MobileNetV3 when processing multi-scale features and fine-grained information, a FEM module was introduced after MobileNetV3 to form the MobileNetV3-E module. This aims to enhance multi-scale feature extraction and feature augmentation capabilities. Its network architecture is illustrated in
Figure 3.
After processing through the MobileNetV3-E module, RGB images are extracted into feature maps at four scales (X1, X2, X3, X4). Their spatial dimensions are progressively halved: X1 is half the size of the original image, X2 is one-quarter, X3 is one-eighth, and X4 is one-sixteenth. Concurrently, the number of feature channels increases progressively. Consequently, X4 possesses the smallest spatial scale while carrying the richest channel information and the highest-level semantic details, providing multidimensional support for depth estimation tasks.
In
Figure 3, the four feature maps (X
1–X
4) from the MobileNetV3-E module represent feature maps at different scales. Within the MobileNetV3-E module, the FEM is inserted after each feature map extracted by MobileNetV3. The features extracted by MobileNetV3 are then processed by the FEM, enhancing multi-scale feature extraction capabilities and improving the perception of both local and global information.
The FEM in MobileNetV3-E (as shown in
Figure 4) originates from the FFCA-YOLO model [
34], initially designed to enhance detection capabilities for small objects. In FFCA-YOLO, the FEM enriches feature diversity through a multi-branch dilated convolution architecture, expanding the model’s local perception capabilities to enhance semantic information representation for small objects. The introduction of the FEM addresses MobileNetV3’s limitations in detail perception, enabling it to better handle small objects in complex scenes, enrich local feature information, and improve the model’s ability to capture fine-grained features, ultimately achieving feature enhancement.
The FEM first performs preliminary processing on input features through multiple 1 × 1 convolutional layers. Subsequently, it extracts features at different scales in parallel using standard convolutions and dilated convolutions. Finally, the extracted feature maps are concatenated using torch.cat(). This enables the model to simultaneously capture both local and global information. By employing 3 × 3 dilated convolutions to expand the receptive field, the model’s ability to model multi-scale features is enhanced, improving feature extraction and depth estimation capabilities in complex scenes. This fusion of local and global information improves the model’s ability to capture fine-grained features. Building upon multi-scale features, it enhances the expression of semantic information, elevating the model’s overall representational capability. This enables more precise capture of detailed information in complex scenes, achieving feature enhancement.
After processing RGB images through the MobileNetV3-E backbone, four feature maps are generated sequentially, each possessing distinct spatial resolutions and channel dimensions. These form complementary representations at both scale and semantic levels, capturing both fine-grained and global scene information.
3.2.2. ETFBlock Module
After feature enhancement via MobileNetV3-E, the lowest-level features in the encoder retain rich local details and texture information but suffer from significantly reduced spatial resolution. This results in limited feature receptive fields, making it difficult to effectively capture long-range dependencies and global scene structures within images. This lack of global contextual information deprives the decoder of sufficient scene-level semantic guidance during subsequent upsampling and depth map prediction. Consequently, the overall consistency of depth estimation is compromised, leading to prediction errors and structural distortions. To mitigate this issue, drawing inspiration from Atrous Spatial Pyramid Pooling (ASPP) [
35], the ETFBlock module (structural diagram shown in
Figure 5) is designed to capture multi-scale contextual information, expand the receptive field, enhance the model’s contextual modeling capability, and achieve feature optimization.
Traditional ASPP achieves multi-scale perception and effectively expands the receptive field by fusing convolutional features with different expansion rates. However, in monocular depth estimation tasks, the balanced fusion of global and local information remains an unsolved challenge. Building upon ASPP’s multiscale perception advantages, ETFBlock introduces innovative designs to achieve precise fusion of global-local information: it optimizes feature representation through multiscale context awareness and feature reuse mechanisms. Compared to ASPP, it enhances the model’s adaptability to complex scenes in prediction tasks. The dimension reduction layer serves as a preprocessing step, employing 1 × 1 convolutions to perform channel compression on input features, thereby reducing the number of parameters for subsequent operations. To comprehensively capture multi-scale contextual information, this module deploys multiple feature extraction paths in parallel: 1 × 1 convolutions preserve core features at the original scale while performing linear transformations; Three sets of dilated convolutions with varying dilation rates progressively expand receptive fields, precisely extracting hierarchical contextual information from local details to global structures; a global mean pooling layer generates image-level global feature descriptors by compressing spatial dimensions, infusing scene-level prior knowledge into the model. Subsequently, by sequentially fusing feature maps from each path, it achieves complementary integration of local details, multi-scale context, and global information. ETFBlock introduces residual connections, overlaying the original input features onto the concatenated multi-scale features. This enables the model to focus on residual learning between target features and input features, simplifying the optimization objective while ensuring the transmission of all feature information. Finally, 1 × 1 convolutions perform channel adjustment and cross-channel information integration on the fused and enhanced features, generating optimized output features. This hierarchical collaborative design enhances ETFBlock’s feature richness, with the feature fusion process within the ETFBlock module described by Equation (1).
In Equation (1), represents the merged features, and denotes the features to be concatenated.
The ETFBlock module achieves multi-scale feature optimization through collaboration with the MobileNetV3-E module, expanding the receptive field and enhancing contextual modeling capabilities. By capturing both local and global information through multi-scale perception, and further integrating residual connections with low-level features, ETFBlock significantly boosts feature expressiveness. This synergistic design ensures a balanced trade-off between detail and global consistency in depth estimation, thereby improving performance in complex scenes.
The multi-scale feature maps (X1–X4) extracted by MobileNetV3-E are renamed for subsequent reference: X1, X2, and X3 are sequentially designated as Layer1, Layer2, and Layer3. X4 requires further processing through the ETFBlock, and its output is named Layer4.
3.3. Decoder LapFA-Net
The decoder of LapFA-Net adopts a five-layer Laplacian pyramid architecture. It optimizes multi-scale features through the encoder and progressively restores the spatial resolution of the depth map layer by layer during decoding. To further enhance detail reconstruction quality, the network incorporates a FMA. This module extracts frequency-domain information and performs feature selection, thereby strengthening the decoder’s ability to model local structural features. Simultaneously, the FMA module integrates with multi-scale feature fusion to further strengthen the model’s contextual modeling performance. At each decoder layer, synergistic interaction with the FMA module not only achieves high-quality detail recovery in depth maps but also ensures the consistency and accuracy of depth estimation results across the entire scene.
LapFA-Net (as shown in
Figure 6) employs a five-layer Laplacian pyramid decoder. During each upsampling step, it utilizes skip connections to incorporate encoder features while introducing Laplacian residuals. Concurrently, it employs the FMA module [
36] to progressively refine depth map predictions. The decoder achieves high-quality depth estimation by extracting multi-scale features from different encoder layers (Layer4 to Layer1), progressively restoring the depth map’s spatial resolution layer by layer while refining depth details. The decoding process begins with low-resolution, high-semantic features from Layer4. Through a series of upsampling operations, it gradually enlarges the feature map dimensions and fuses them with corresponding encoder layer features. This integration of global semantic information and local details enhances the accuracy of depth estimation.
The LapFA-Net decoder employs a Laplacian pyramid architecture to predict depth residuals at multiple scales. Lower-resolution scales primarily handle overall depth trend prediction, while higher-resolution scales focus on detail recovery. Through the progressive accumulation of multi-layer residuals, the model ultimately generates detailed and accurate depth maps. Each layer reintroduces the mean-scaled Laplacian residual, progressively balancing feature scales across layers. This enhances model stability, mitigates gradient vanishing issues, and improves training efficiency. Furthermore, by employing diverse nonlinear activation functions, the model effectively captures complex nonlinear mapping relationships. This ensures depth predictions align more closely with real-world depth variation patterns.
Laplacian residuals introduce edge features to assist the decoder in refining and reconstructing images, enabling progressive refinement of depth map predictions [
37]. The Laplacian residual calculation formula is given by Equation (2).
In Equation (2), denotes the result after subsampling the original image, represents the original image, and indicates bilinear upscaling of a. is obtained by subtracting the bilinearly upscaled from , where represents the computed Laplacian residual.
The FMA module originates from the SRConvNet architecture. By integrating Fourier transforms with spatial domain feature interactions, it replaces traditional computations with linear-complexity frequency domain operations while partitioning local windows. This approach reduces computational and parameter overhead while simultaneously capturing both long-range and short-range dependencies to suit lightweight tasks. It also suppresses irrelevant feature responses and enhances texture edge details, enabling high-quality image reconstruction with efficient global information capture capabilities [
36]. Given the critical importance of global information in depth prediction within depth estimation model decoders, the FMA module is introduced into the LapFA-Net decoder to enhance the structural consistency and texture detail restoration accuracy of depth maps.
The FMA module is only introduced in decoder layers D1 to D4 and is not used in layer D5. The application of the FMA module in the decoder is based on a grouping mechanism. Given the limited number of channels in the D5 layer, introducing this module may not only incur unnecessary computational overhead but also lead to excessive feature smoothing. This could result in the loss of high-frequency details and compromise depth map accuracy. Furthermore, the core task of the D5 layer is to generate the final depth map. At this stage, features have undergone sufficient fusion and refinement, necessitating the preservation of more features to ensure detail recovery capability.
Decoder layers D1 to D4 primarily handle the fusion and refinement of multi-scale features. The FMA module effectively enhances the expressive power of medium-to-low resolution features, thereby improving detail recovery accuracy. Incorporating 0.1 × Li.mean into each layer compensates for potentially lost detail information and increases feature diversity, further boosting the model’s detail recovery capability. The FMA module filters key information from multi-scale features, preserving depth map details while ensuring overall structural integrity. Ultimately, this enhances depth map prediction accuracy and detail recovery capabilities.
To process information at different scales more efficiently, enhance focus on detailed regions, and reduce occlusion and blurring, the decoder incorporates FMA modules in its first four layers (as shown in
Figure 7). The FMA module transforms input feature maps from the temporal domain to the frequency domain via Fourier transform, capturing frequency information across different scales. It then divides the feature maps into multiple regions using non-overlapping block segmentation and assigns weights to each region through an attention mechanism, thereby amplifying features in critical areas. After attention-weighted feature concatenation, the final output is obtained through deep processing involving Softmax and 1 × 1 convolutions. By extracting frequency information and implementing adaptive feature selection, the FMA module enhances the model’s ability to focus on local details. This optimizes depth map detail recovery while improving feature selection efficiency. Combined with a pyramid structure, it enhances contextual modeling capabilities, emphasizing both local details and global context. Ultimately, this achieves more accurate and refined depth predictions.
The LapFA-Net decoder integrates global semantic information and local details through the fusion of multi-scale features, a Laplacian pyramid structure, precise application of the FMA module, and residual learning, ultimately generating high-accuracy depth maps.
4. Experiments and Results Evaluation
4.1. Experimental Environment
In the experiment, the EFDepth neural network model was constructed using the PyTorch framework and Python language. Model debugging and training were conducted on a Windows 11 system. Training hardware comprised an NVIDIA GeForce RTX 3060 Laptop GPU with 8GB VRAM, leveraging PyTorch 2.0.0 accelerated by CUDA 11.8. Python version 3.9 was employed. Following training, model evaluation was performed on Ubuntu 20.04 running on identical hardware specifications.
4.2. Dataset
The KITTI dataset (as shown in
Figure 8) is a widely used dataset in the field of computer vision, providing images with corresponding lidar point cloud data. Its 64-line lidar generates high-resolution 3D point clouds that can be aligned with RGB images using camera parameters, yielding sparse depth maps (commonly used as ground truth for depth estimation). The NYU Depth V2 dataset features dense depth maps alongside their corresponding RGB images, enabling training, evaluation and testing. The training process employs a fully supervised strategy, with labels being either lidar depth maps or dense depth maps.
4.3. Evaluation Indicators
The evaluation metrics align with previous literature [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15], employing Root Mean Square Error (RMSE), Root Mean Square Logarithmic Error (RMSELog), Absolute Relative Error (AbsRel), Square Relative Error (SqRel), and accuracy (
). Calculation methods are detailed in Equations (3)–(7).
In Equations (3) and (4),
denotes the total number of samples in the test set, i.e., the number of pixels in the actual depth map;
represents the actual depth value, while
denotes the depth value predicted by the model, indicating the depth calculated through the model.
is the number of elements in the test set.
In Equations (5) and (6),
also represents the total number of samples in the test set, i.e., the number of pixels in the actual depth map;
denotes the actual depth value, while
is the depth value predicted by the model.
is the number of elements in the test set.
In Equation (7), max(a,b) denotes the maximum value between a and b. represents the actual depth value, while is the depth value predicted by the model. , , and are used to evaluate the accuracy of the depth estimation model. measures whether the relative error between the predicted depth and the true depth is less than 1.25, indicating an error within 25%; checks if the error is less than 1.252, tolerating a maximum error of 56.25%; permits a maximum error of 1.253, corresponding to a maximum error of 95.31%. These thresholds help evaluate the model’s performance under different accuracy requirements, with smaller thresholds indicating more precise predictions.
The smaller the RMSE, RMSELog, AbsRel, and SqRel values calculated by the model, and the larger the accuracy , the better the model performs.
4.4. Loss Functions and Iterative Training
During training, to mitigate the influence of extreme values, the experiment employs a scale-invariant log loss function [
15]. This function is motivated by the requirement for scale invariance in depth estimation, particularly addressing discrepancies between distant and near-range depth data. Standard loss functions may be overly sensitive to errors in distant data. The scale-invariant log loss mitigates this error influence by taking the logarithm of both the predicted and true depth maps, enabling the model to exhibit greater stability across different scales. This method is widely adopted in existing monocular depth estimation models. The calibration-invariant error is expressed by Equation (8).
In Equation (8),
denotes the predicted depth map, while
represents the true depth map. By taking the logarithm of both the predicted and true depth maps and subtracting them, the region in the true depth map containing valid depth information for pixel N in the RGB image can be calculated [
38]. Subsequently, due to the limitations of 3D sensors, depth data collected at close range is dense, while data at long range is sparse. To mitigate this imbalance, the loss function from [
39] is adopted as the base loss function T to compare the predicted depth map with the true depth map, as shown in Equation (9).
In Equation (9), the balance factor μ is set to 0.85, consistent with reference [
15]. The variance penalty term reduces estimation errors for long-range depths by penalizing the variance of prediction errors, thereby enhancing model stability on long-range depth data. This approach originates from [
15] and is implemented in [
39], which proposes that variance penalties balance dense near-range data with sparse far-range data in depth datasets. This ensures the model does not neglect far-range depth estimation due to the excessive density of near-range data. This loss has been applied in multiple studies [
15,
17,
38,
39].
In Equation (10), L denotes the final loss function. The hyperparameter α is set to 10, consistent with Reference [
15] and Reference [
39].
During model training, the AdamW optimizer is employed. Due to the massive dataset size, training is time-consuming. To enhance efficiency while maintaining training effectiveness, selecting an appropriate learning rate strategy and number of epochs can help the model rapidly converge to its optimal values.
4.4.1. The Impact of Total Training Epochs
During the training process of the EFDepth model, its loss value gradually decreased and stabilized as training progressed, indicating that the model was converging. To comprehensively evaluate model performance, we saved multiple training checkpoints for testing, including those from epochs 1 (Epoch 1), 5 (Epoch 5), 10 (Epoch 10), 15 (Epoch 15), 20 (Epoch 20), 25 (Epoch 25), 30 (Epoch 30), and 35 (Epoch 35). Testing was conducted across multiple epochs on the KITTI dataset, revealing that the model checkpoint at epoch 33 achieved optimal performance across all evaluation metrics. Consequently, we formally designate the training results from epoch 33 as the final EFDepth model.
To investigate overfitting, we continued training beyond epoch 33 and obtained checkpoints with lower training losses (Minimum-loss checkpoint).
Figure 9 illustrates the evolution of performance metrics from epoch 1 to epoch 35, alongside the final EFDepth model and the checkpoint with the minimum loss.
As shown in
Figure 9, the bar chart uses the left Y1 axis as its vertical coordinate, while the line chart uses the right Y2 axis. Observation reveals that during the early training phase, both the model’s error metrics and loss values gradually decrease with increasing iterations, while accuracy metrics consistently increase. This indicates the model is in an effective learning phase. All metrics peak at the epoch 33 iteration, corresponding to our selected EFDepth model. However, starting from the 35th iteration, despite the training loss continuing to decrease, multiple performance metrics on the test set begin to deteriorate. The evaluation performance of the minimum loss checkpoint obtained from further training also fails to surpass that of the 33rd iteration. This clearly indicates that the model begins to overfit after the epoch 33 iteration.
Therefore, we set the total training epochs to 35. This setting represents a comprehensive trade-off between model performance and overfitting risk. Experiments show this configuration ensures sufficient model convergence while halting training early enough to avoid overfitting.
Experiments were conducted with a batch size of 8, and the full 35-epoch training cycle took approximately 90 h. Under this configuration, the finalized EFDepth model achieved an optimal performance equilibrium. Its EC-Net encoder and LapFA-Net decoder synergistically enhance predictive capabilities through multiscale feature optimization and contextual modeling, while ensuring stable convergence during training and high-quality depth map outputs.
4.4.2. The Impact of Learning Rate
The choice of learning rate impacts training effectiveness, so three learning rate strategies were selected for training and analysis: dynamic learning rate (0.0001 → 0.00001), a fixed learning rate of 0.0001, and a fixed learning rate of 0.00001. The experimental results of the high-performing models trained using these three learning rates are shown in
Table 1.
The dynamic learning rate adjustment strategy follows Reference [
15], employing a polynomial decay approach to balance parameter exploration efficiency and convergence stability during model training. Specifically, the learning rate is initially set to a preset baseline rate (baselr) and gradually decays smoothly with training iterations, ultimately converging to a preset minimum learning rate (endlr). The adjustment formula follows a fast-then-slow decay pattern, as shown in Equation (11).
In Equation (11), denotes the learning rate for the i-th iteration, and represents the total number of iterations. The training process varies with the selected learning rate.
Table 1 experimental results demonstrate that the dynamic learning rate model achieves optimal performance, while fixed learning rates of 0.0001 and 0.00001 exhibit inferior results. The core reason lies in the inability of fixed learning rates to adapt to the demands of the entire training epoch: the former, due to its fixed step size, struggles to perform fine-grained parameter optimization during the latter stages of training, often oscillating near the optimal solution and consequently yielding inferior accuracy and error metrics compared to the dynamic learning rate; The latter, with excessively small step sizes, not only converges slowly in the early training phase but also tends to get stuck in local minima, ultimately yielding the worst performance due to insufficient convergence. In contrast, the dynamic learning rate adapts to different training stages, facilitating rapid convergence in the early phase and enabling fine-tuning in the later phase. Therefore, a dynamic learning rate strategy is adopted during training.
4.5. Ablation Experiment
This paper adopts an encoder–decoder network architecture, where the base encoder is MobileNetV3 and the base decoder employs a five-layer Laplacian pyramid structure (termed Lap-Net). Building upon this foundation, the encoder EC-Net incorporates MobileNetV3-E and ETFBlock modules, while the decoder LapFA-Net utilizes a five-layer Laplacian pyramid structure with FMA modules integrated only in layers D1 to D4. To clarify the impact of adding the FMA module at layer D5, an ablation experiment was designed where the FMA module participates in every layer of the five-layer Laplacian pyramid structure. This scheme is termed Lap5FA-Net. Given that ETFBlock is an extended version of the ASPP, the ablation experiments includes a comparative analysis between the ASPP module and the ETFBlock. To evaluate the impact of each module on performance, multiple ablation experiments were conducted on a unified hardware platform using two datasets. Specific module configurations are detailed in
Table 2, and ablation results are summarized in
Table 3 and
Table 4.
As shown in
Table 2,
Table 3 and
Table 4, through stepwise analysis of ablation experiments, the specific contribution of each module to model performance becomes evident.
On the KITTI dataset and the NYU Depth V2 dataset, experimental results demonstrate that the integrated MobileNetV3-E, ETFBlock, and Laplacian pyramid structure with FMA module achieve optimal performance through synergistic collaboration, fully validating the functionality and improvement value of each component. When MobileNetV3-E is applied alone for basic feature extraction, its performance is the poorest due to the absence of subsequent processing modules. However, when combined with either ETFBlock or ASPP, the RMSE drops sharply, confirming that its feature optimization potential requires activation through collaborative modules. As an extended version of ASPP, ETFBlock cannot definitively claim superiority over any single module. Under identical configurations, experiments employing ETFBlock demonstrated lower root mean square error (RMSE) compared to ASPP. However, this advantage only materializes when ETFBlock collaborates with other modules, such as MobileNetV3-E. This demonstrates that while inheriting the multi-scale feature processing philosophy, the ETFBlock achieves more efficient feature refinement through structural optimization. Within this model, the ETFBlock proves more suitable than the ASPP module. In this model, ETFBlock demonstrates greater applicability than the ASPP module. However, both have their respective strengths and weaknesses, and neither holds an absolute advantage over the other. The Laplacian Pyramid architecture with the FMA module further enhances multi-scale feature fusion accuracy. Compared to EFDepth, Experiment 9 (without FMA) and Experiment 16 (five-layer FMA) both exhibit reduced root mean square error (RMSE). Thus, a four-layer FMA is sufficient and provides critical support for overall performance improvement. The EFDepth model demonstrated optimal performance across all metrics, outperforming any combination approach. This validated the applicability of MobileNetV3-E, ETFBlock, and LapFA-Net.
The progressive introduction of MobileNetV3-E, ETFBlock, and FMA modules has continuously enhanced model accuracy and stability. MobileNetV3-E synergizes with ETFBlock to strengthen feature extraction capabilities and multi-scale information fusion, thereby improving model performance. The FMA module, combined with the Laplacian pyramid structure, enhances the complementarity between features at different scales and the contextual modeling capability, ultimately improving depth estimation accuracy. Compared to using any single module alone, the synergistic application of all three yields superior depth map prediction results. The collaborative effect among the three modules optimizes the overall model performance, with the EFDepth configuration ultimately demonstrating the best accuracy and stability.
4.6. Comparative Experiment
To enhance the credibility of the model results, comparative experiments were conducted. Under identical experimental conditions, Lite-mono [
11], Lapdepth [
15], Hr-depth [
23], Monodepth2 [
24], and FCMdepth [
25] were compared with the EFDepth model. These five models were evaluated on the dataset. To assess robustness, random noise was added to the data used for evaluating both datasets. In the experiments, the robustness test was named EFDepth-B. Through comparative experiments, we analyze performance across different datasets to better evaluate the strengths and weaknesses of the EFDepth model. The comparative results are shown in
Table 5.
The comparative experimental results in
Table 5 demonstrate that the EFDepth model exhibits certain advantages: it achieves optimal performance across three depth estimation error metrics, such as RMSE, SqRel, and others, while also yielding the highest accuracy metrics δ
1, δ
2, and δ
3. Furthermore, this model requires fewer parameters and has lower FLOPs (Floating Point Operations). Even when noise is introduced, causing a slight decline in test performance, the results remain satisfactory. Consequently, EFDepth leads in error control and depth prediction accuracy, exhibiting good monocular depth estimation capabilities.
EFDepth demonstrates reduced error and improved accuracy across both datasets, validating its effectiveness and robustness. To further illustrate its reliability, depth map predictions were generated for selected images from KITTI dataset. The prediction results are shown in
Figure 10. Compared to other models, EFDepth exhibits greater stability in depth prediction, particularly in scenes with intricate details (e.g., areas highlighted in red boxes), where its predictions are more accurate and closer to the true depth values.
Observing
Figure 10, EFDepth demonstrates advantages in KITTI dataset scenes: on streets with dense trees or vehicles and high brightness (first column), it accurately predicts depth in highly reflective areas, clearly distinguishing vehicles from surrounding environments; on dimly lit streets (second column), the generated depth map remains stable and smooth overall; In ordinary street scenes (third column), it renders vehicle and building edges more sharply with clearer depth hierarchy; on shaded ordinary streets (fourth column), it captures distant vehicles with greater clarity, achieving more comprehensive depth prediction.
As shown in
Table 5, the EFDepth model demonstrates limited improvement in some metrics but overall performance gains. Error metrics generally decrease while accuracy values like δ
1 simultaneously increase, enhancing the prediction quality of depth maps. This advancement is visually evident, as demonstrated in
Figure 10. The model delivers clearer and more coherent predictions along object edges and complex structures, reducing blurring and distortion. Consequently, EFDepth’s enhancement extends beyond quantitative metrics to strengthen depth information, offering practical value across various domains.