A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation

Turkmen, Hamidullah; Akgun, Devrim

doi:10.3390/math13162567

Open AccessArticle

A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation

by

Hamidullah Turkmen

¹

and

Devrim Akgun

^2,*

¹

Computer and Informatics Engineering Department, Institute of Natural Science and Technology, Sakarya University, Esentepe Campus, 54050 Serdivan, Sakarya, Türkiye

²

Software Engineering Department, Faculty of Computer and Information Sciences, Sakarya University, 54050 Serdivan, Sakarya, Türkiye

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(16), 2567; https://doi.org/10.3390/math13162567

Submission received: 1 July 2025 / Revised: 1 August 2025 / Accepted: 4 August 2025 / Published: 11 August 2025

(This article belongs to the Special Issue Artificial Intelligence and Algorithms with Their Applications)

Download

Browse Figures

Versions Notes

Abstract

Monocular depth estimation is a crucial technique in computer vision that determines the depth or distance of objects in a scene using a single 2D image captured by a camera. UNet-based models are a fundamental architecture for monocular depth estimation, due to their effective encoder–decoder structure. This study presents an effective depth estimation model based on a hybrid UNet architecture that incorporates ensemble features. The new model integrates Transformer-based attention blocks to capture global context and an encoder built on ResNet18 to extract spatial features. Additionally, a novel Boundary-Aware Depth Consistency Loss (BADCL) function has been introduced to enhance accuracy. This function features dynamic scaling, smoothness regularization, and boundary-aware weighting, which provides sharper edges, smoother depth transitions, and scale-consistent predictions. The proposed model has been evaluated on the NYU Depth V2 dataset, achieving a Structural Similarity Index Measure (SSIM) of 99.8%. The performance of the proposed model indicates increased depth accuracy compared to state-of-the-art methods.

Keywords:

monocular depth estimation; autonomous driving; transformer attention; hybrid UNet model; boundary-aware depth consistency loss

MSC:

68T07

1. Introduction

Depth estimation is an essential component of autonomous vehicle systems, allowing them to perceive their environment and make informed decisions accurately. Currently, various methods are used for depth estimation. For instance, point cloud LiDAR data [1] is a highly effective method to precisely and accurately estimate the depth of a scene. However, obtaining such detailed and highly accurate depth maps is often costly. High hardware costs, processing power requirements, and complexity limit active sensing methods such as LiDAR. This has directed researchers to search for more cost-effective, flexible, and efficient solutions. In this context, passive sensing methods like monocular depth estimation have emerged. These techniques, which use a single camera to estimate depth, are more affordable and versatile, making them suitable for many applications. Image-based depth estimation has evolved significantly from basic principles to today’s advanced technologies. Early depth estimation techniques relied on the exploitation of depth cues like focus [2], vanishing points [3], and shadow information [4]. However, the applicability of these methods has often been restricted to constrained scenes. The evolution of computer vision led to the adoption of handcrafted engineered features (SIFT [5], SURF [6], PHOG [7]) and probabilistic graphical models (CRF [8], MRF [9]) for monocular depth estimation. These techniques were often integrated with machine learning approaches to learning model parameters or non-parametric representations [8,9]. The emergence of deep learning algorithms has brought great advantages to image processing [10,11,12] and especially depth estimation methods [13]. The advent of neural networks, particularly Convolutional Neural Networks (CNNs) [14], has enhanced image depth detection capabilities. Different models, such as ResNet [15], VGG [16], MobileNet [17], encoder–decoder [18], and UNet [19] have been developed using CNN methods and components. Later, with the development of Transformer and Attention layers, monocular depth estimation studies began to be used in various applications such as autonomous driving, robotic navigation, and augmented reality.

Duong et al. [20] proposed URNet, an improved UNet architecture with a residual backbone for monocular depth estimation. This proposed model effectively handles depth estimation in autonomous vehicles, showing good results when tested on the KITTI dataset. A UNet++-like architecture integrating characteristics for lightweight and effective depth estimation was presented by Wang et al. [21]. In this work, they tried to increase the accuracy rate of depth estimation results by further reducing the parameters using light channel attention based on convolution in the decoder stage. Xue et al. [22], in their study, proposed Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications (DNet), which enhances object-level depth estimation using hierarchical constraints and densely connected layers, ideal for self-supervised monocular depth estimation. Tadepalli et al. [23] proposed an EfficientNet-B0 backbone-based hybrid UNet architecture for monocular dense-depth map estimation. Their approach uses EfficientNet-B0 for efficient feature extraction and bilinear upsampling for depth map generation; they tried to achieve good results with fewer parameters. Li et al. [24] proposed an Adaptive Semantic Fusion Framework for unsupervised monocular depth estimation, integrating segmentation data to enhance depth consistency. This approach is effective for dealing with difficult scenarios. The paper [25] proposed a lightweight model called RCCNet that uses Reducing Channel Convolution (RCConv) to extract local features efficiently. RCCNet has achieved comparable results with existing methods while reasonably reducing the computational cost. In this study, Shen et al. [26] proposed a new model for autonomy drives called DNA-Depth. This method was developed for monocular depth estimation in night scenarios. It uses Fourier transform for domain alignment in order to handle the domain shift between day and night images. To deal with moving light sources, DNA-Depth uses an unsupervised joint learning architecture to jointly estimate depth, optical flow, and ego-motion. In the study [27], the Stereoscopic Pyramid Transformer-Depth (SPT-Depth) model is proposed for monocular depth estimation, which utilizes pyramid Transformers and multi-scale feature fusion to integrate shallow and deep semantic information. The model employs a unique training strategy with shift and scale-invariant loss functions and edge smoothing techniques to reduce noise and enhance robustness. On the NYU Depth V2 dataset, SPT-Depth has achieved a 10% reduction in Absolute Relative Error and a 17% reduction in RMSE. Sharma et al. [28] proposed the 2T-UNet model for robust stereo depth estimation, which uses twin convolution towers with different weights to avoid explicit stereo correspondence matching. To improve scene geometry prediction, 2T-UNet, in contrast to conventional stereo algorithms, inputs both the left and right stereo images along with monocular depth clue information. When tested on the difficult Scene Flow dataset, it outperformed the reasonable outcomes. In their study, Tang et al. [29] proposed CATNet for monocular depth estimation, handling the assignment as though it were an ordinal regression issue. The model uses an encoder–decoder architecture that minimizes complexity and parameters while trying to preserve depth estimation accuracy. CATNet incorporates a Multi-dimensional Convolutional Attention (MCA) module and a Dual Attention Transformer (DAT) module to enhance multi-scale feature refinement and global feature extraction. Experimental results on the KITTI and NYU datasets demonstrate that CATNet achieves performance comparable to state-of-the-art Transformer-based models. Robin et al. [30] proposed a monocular end-to-end depth estimation model called UCorr. This method has overcome challenges such as thin-wire obstacles in autonomous drone navigation. UCorr uses a temporal correlation layer trained on artificial data. The UCorr depth estimate shows how autonomous drone safety and accuracy might be enhanced in practical situations. A monocular depth estimation network called AMENet was suggested by Wu et al. [31] to increase autostereoscopic accuracy. The AMENet model is constructed by combining a CNN for local feature extraction and a Vision Transformer (ViT) for global semantic feature extraction. In contrast to other datasets, they have highlighted that the improved model demonstrated a discernible improvement in accuracy when evaluated on the KITTI dataset. In this study, Xi et al. [32] proposed a new approach for monocular depth estimation, called LapUNet, which considers the difficulties in preserving structural details and sharpness of edges in depth maps. In this hybrid model, they used a ResNeXt101 encoder and a decoder with a Dynamic Laplacian Residual U-shape module to improve the preservation of high-frequency features during upsampling. They tried to show that the experimental results on the NYU Depth v2 dataset are much better.

Depth estimation is a fundamental technique in computer vision that enables the extraction of three-dimensional scenes from two-dimensional images. Monocular depth estimation finds applications in various areas such as autonomous systems, robotics, augmented reality, and medical imaging, where accurately determining the distance of objects plays an important role. Increasing the accuracy of the models allows applications to make safe decisions when interacting with their environment. Traditional UNets are good at finding fine-grained local details, but they need to be improved to understand the more general scene context. While large UNet models can achieve high accuracy rates, they often demand considerable computational resources, reducing their practicality for real-time applications on resource-constrained hardware. This paper improves the limitations of current hybrid models in effectively integrating local and global information for depth estimation by designing a hybrid architecture. The new model uses a ResNet18 encoder for rich local feature extraction and integrates Transformer-based attention blocks, allowing the model to learn long-range dependencies and understand relationships between distant objects, improving scene understanding. Another improvement by the present work is the Boundary-Aware Depth Consistency Loss (BADCL) function, which combines a boundary-aware function to sharpen edges, a smoothness regularizer for uniform surfaces, and a dynamic scaling factor to ensure scale-consistent predictions.

The contributions of this paper can be summarized as follows:

A new hybrid model combines a ResNet18-based encoder for extracting local spatial features with Transformer attention blocks that are designed to capture global context and long-range dependencies.
A Boundary-Aware Depth Consistency Loss (BADCL) function has been introduced to improve training accuracy.
Comparative results show that the proposed approach achieves similar results with fewer trainable parameters.

The remainder of this paper is structured as follows: Section 2 describes the materials and methods, providing an overview of the UNet model, the practical application of monocular depth estimation, and the NYU-Depth V2 dataset used for evaluation. Section 3 introduces the proposed method, describing the architecture of the Transformer-based hybrid UNet, the innovative Boundary-Aware Depth Consistency Loss (BADCL) function, and the metrics used for performance evaluation. In Section 4, we present the experimental results, demonstrating the efficiency of the model on the NYU Depth Dataset V2 using various test metrics. Finally, Section 5 concludes the paper by summarizing the key contributions and suggesting potential directions for future work.

2. Materials and Methods

2.1. UNet Model

The UNet architecture is a well-known and popular deep learning model. Its flexible design and ability to capture multi-scale spatial information make it an ideal choice for a variety of computer vision tasks, including depth estimation. The UNet model is made up of two main parts: the encoder and the decoder. The encoder uses convolutional layers to extract features and reduce spatial resolution. Conversely, the decoder uses transposed convolutions (up-convolutions) to gradually upsample the feature map and recover the original spatial dimensions. Skip connections between the encoder and decoder help preserve fine-grained information, which is crucial for accurate depth prediction. UNet is crucial for effectively combining local and global feature information, essential for accurate depth estimation. This multi-scale feature fusion capability has made the UNet suitable for monocular depth estimation. In the encoder block of the UNet, novel model architectures can be constructed by employing various backbone networks such as ResNet, VGG, MobileNet, etc. These approaches have become increasingly prevalent in contemporary deep learning applications, particularly in tasks like depth estimation.

2.2. Monocular Depth Estimation in Practice

Monocular depth estimation (single-camera depth estimation) extracts depth information from a single image and offers advantages over stereo vision, such as simplicity, cost-effectiveness, and versatility. Although it may be less accurate due to intrinsic uncertainty, advances in deep learning techniques are improving its accuracy and extending its applications in areas such as autonomous driving, robotics, and augmented reality. These applications require the depth information to perceive their environment and make accurate decisions. Monocular depth estimation is used in such systems because they offer lower-cost and lighter alternatives. However, obtaining depth information from a single camera can be difficult compared to sensors like stereo cameras or LiDAR, as it requires extracting 3D information from 2D images.

Monocular depth estimation uses deep learning methods to analyze 2D images and attempt to predict the depth value of each pixel in the scene, as shown in Figure 1. Such systems provide significant advantages, especially for low-cost applications. This technique is typically used in autonomous vehicles for applications such as environmental perception, road lane and obstacle detection, tracking, and navigation. The accuracy of monocular depth estimation is generally lower than that of stereo or LiDAR-based systems and may encounter some limitations. Therefore, many autonomous vehicles combine monocular depth estimation with other sensors (e.g., stereo cameras, LiDAR, or radar) to obtain better results [33]. However, monocular depth estimation enables depth information acquisition with low-cost camera systems and can be used instead of expensive sensors such as stereo cameras or LiDAR. However, accurate depth estimation from a single image is difficult due to scene complexity, lighting conditions, and visibility constraints. Therefore, researchers are working on deep learning-based methods to increase the accuracy of algorithms and improve real-time performance.

2.3. NYU-Depth V2

The NYU-Depth V2 [34] is a widely used dataset for depth estimation in autonomous and indoor scene understanding. This dataset consists of color image RGB and depth data from 464 different indoor scenes in three cities, captured using a Microsoft Kinect sensor. The dataset consists of 1449 RGB-D image pairs with dense pixel-level semantic annotations that enable multi-class segmentation. Missing depth values in the labeled data are preprocessed to ensure completeness, increasing its utility for supervised learning tasks. Additionally, 407,024 unlabeled frames from continuous video sequences are included to support unsupervised and self-supervised learning approaches. The dataset is organized by dividing the NYU Depth V2 training set into three distinct subsets. A random split is used to separate the data using the training set, with 70% allocated for the training set, 10% for the validation set, and the remaining 20% for the test set. The total number of samples for the dataset is split into 35,481 for train, 5068 for validation, and 10,139 for testing. Figure 2 shows sample images from the NYU-Depth V2 dataset.

Before training the model, each image in the dataset is resized to a fixed resolution of 224 × 224 pixels, which is then applied to the ResNet backbone. After resizing, the pixel values are normalized using the mean and standard deviation derived from the ImageNet dataset. This step is crucial since the model uses weights pretrained on ImageNet. Finally, the processed image is converted into a PyTorch 2.5.1 tensor, preparing it for input into the neural network. For the depth maps, the preprocessing step involves a multi-step normalization process. First, the raw depth values, provided initially in millimeters, are converted to meters by dividing them by 1000. Next, these values are clipped to enforce a maximum depth of 10 meters, thereby addressing outliers or invalid measurements. The final step normalizes the depth map to a range of [0, 1] by dividing it by 10. No data augmentation is applied.

3. Proposed Method

3.1. Transformer-Based Hybrid UNet

The Transformer-enhanced UNet model improves the local feature extraction capabilities of the traditional UNet while including the global context understanding extracted by the Transformer mechanisms. This integration results in a more effective model for monocular depth estimation. The hybrid model is based on the encoder–decoder structure of UNet. In the encoder section, the ResNet18 model is used to extract multi-scale feature maps from the input image, providing local edge, texture, and structural information. In addition to local features, transfer blocks are used to incorporate global context, allowing the model to understand the relationships between distant regions. In particular, the Multi-Head Self-Attention (MHSA) [35] mechanism learns long-range relationships in the input data. MHSA uses the attention mechanism to focus on critical locations to comprehend the scene structure in detail.

In the decoder section, low-resolution feature maps are integrated with the global features processed by the Transformer. This combination allows for merging high-resolution depth maps, effectively preserving both local details and global context. Moreover, skip connections in the UNet enable the transfer of detailed spatial information from the encoder to the decoder. Therefore, skip connections enhance depth estimation accuracy, especially in critical areas such as object boundaries. The improved Boundary-Aware Depth Consistency Loss (BADCL) function of the proposed model further enhances accuracy at object boundaries while ensuring depth consistency in homogeneous regions. The Transformer-based hybrid lightweight architecture combines both local features and global context, contributing to monocular depth estimation applications in autonomous vehicles and robotic systems.

The proposed hybrid depth estimation model consists of three main components, the encoder, Transformer layers, and decoder, with fourteen layers. Based on the pre-trained ResNet18 backbone, the encoder includes five layers: ResNet Block 1 through ResNet Block 5, which progressively downsample the input image from

224 \times 224

to

7 \times 7

while extracting hierarchical features. Following the encoder, four Transformer attention layers are applied to the deepest features from the encoder to capture global context and long-range dependencies. Each Transformer layer uses MHSA and feedforward networks to improve the feature representations. The decoder then obtains the high-resolution depth map using four decoder blocks. These blocks perform upsampling operations and use skip connections from the encoder layers. This process ends with the output depth map layer, which produces a single-channel depth map at the original resolution of

224 \times 224 \times 1

. The model details are shown in Table A1 in Appendix A.

Figure 3 illustrates the overall structure of the proposed hybrid UNet model for depth estimation. The structure of the model is organized as follows: First, the encoder section includes ResNet blocks (from 1 to 4). These are followed by a FlattenBlock. Subsequently, Transformer blocks are employed to capture global context and long-range dependencies. Finally, the decoder blocks complete the process by generating a depth map. Figure 4 explains the details of the blocks.

The encoder blocks consist of a Conv2D, ReLU, and a MaxPooling layer. The FlattenBlock takes the output of the previous layer as input, flattens it using an internal flatten layer, reshapes it, and then sends its output to the next block. The operation of the Transformer blocks involves the following steps: The features from the previous layer are first passed to the Multi-Head Self-Attention (MSA) layer, and the output is combined with the previous attention mechanism before being fed into the Normalization layer. The output from the Normalization layer is directed to the Dense layer. The output of the Dense layer is also combined with the previous attention mechanism and sent back to the Normalization layer. This process continues iteratively, providing input similar to that of the next layer. The decoder blocks contain an UpSampling2D, Concatenate, and Conv2D layer.

3.2. Hybrid Loss Function

In deep learning, the loss function is important for efficiently training the model parameters. Selecting a suitable loss function has the potential to improve the performance of the model, optimize its efficiency, and enhance its ability to focus on the intended target. It might not always be possible to provide pixel-level accuracy with a single loss function for intricate applications like depth estimation.

In this study, new loss functions were proposed. Within this scope, the Boundary-Aware Depth Consistency Loss (BADCL) is introduced by improving the boundary-aware [36] loss. BADCL is a loss function designed to enhance the accuracy of monocular depth estimation by highlighting the essential features of depth maps. It consists of 3 components: boundary awareness, smoothness regularization, and dynamic scaling to optimize monocular depth estimation.

3.2.1. Boundary-Aware Loss

Boundary-aware loss focuses on improving depth accuracy near high-gradient regions (e.g., object boundaries) where sharp changes in depth are crucial [37]. The mathematical equation is presented below:

L_{boundary} = \frac{1}{N} \sum_{i = 1}^{N} B (i) \cdot | \hat{D} (i) - D (i) |

(1)

In this formula,

\hat{D} (i)

represents the predicted depth value at pixel i, while

D (i)

represents the corresponding ground truth depth value. The term

| \hat{D} (i) - D (i) |

calculates the absolute error in depth prediction for pixel i. The boundary map

B (i)

is a weighting factor in the loss function and indicates the importance of pixels near high-gradient areas of the image. The overall loss is computed by summing these weighted errors for all N pixels and normalizing by N.

B (i) = exp (- \sqrt{{(\frac{\partial I}{\partial x})}^{2} + {(\frac{\partial I}{\partial y})}^{2}})

(2)

3.2.2. Smoothness Regularization Loss

Smoothness is only implemented in low-gradient regions, which are weighted by the image gradients [36].

\begin{matrix} L_{smoothness} = \frac{1}{N} \sum_{i = 1}^{N} & (|\frac{\partial \hat{D}}{\partial x} (i)| \cdot exp (- |\frac{\partial I}{\partial x} (i)|) \\ + |\frac{\partial \hat{D}}{\partial y} (i)| \cdot exp (- |\frac{\partial I}{\partial y} (i)|)) \end{matrix}

(3)

The predicted depth map is defined as

\hat{D}

, with its gradients along the horizontal and vertical directions represented as

\frac{\partial \hat{D}}{\partial x}

and

\frac{\partial \hat{D}}{\partial y}

, respectively. These gradients show changes in the depth prediction. The image gradients, defined as

\frac{\partial I}{\partial x}

and

\frac{\partial I}{\partial y}

, represent the intensity changes in the input image for the horizontal and vertical directions. The exponential terms

exp (- | \frac{\partial I}{\partial x} |)

and

exp (- | \frac{\partial I}{\partial y} |)

serve as weights that emphasize the smoothness loss in regions where the image gradients are low, indicating less texture or detail. The weighted depth gradients are summed for all N pixels and then normalized by N.

3.2.3. Dynamic Scaling Loss

Scales the predicted depth to align it with the ground truth [38]. The equation is given below.

L_{scaling} = \frac{1}{N} \sum_{i = 1}^{N} s \cdot | \hat{D} (i) - D (i) |

(4)

For each pixel i, the term

| \hat{D} (i) - D (i) |

calculates the absolute error between the predicted and the ground truth depth values. This error is scaled by a dynamic factor s, which aligns the overall scale of the predicted depth map with that of the ground truth. The dynamic scaling factor is calculated as follows:

s = \frac{\sum_{i = 1}^{N} D (i)}{\sum_{i = 1}^{N} \hat{D} (i) + ϵ}

(5)

The scaling factor s is determined by the ratio of the sum of the ground truth depth values

\sum_{i = 1}^{N} D (i)

to the sum of the predicted depth values

\sum_{i = 1}^{N} \hat{D} (i)

. A small constant

ϵ

is added to the denominator to prevent division by zero.

3.3. Combined Loss Functions

The total loss function combines the three components with weighting factors:

\begin{matrix} L_{BADCL} = & λ_{boundary} \cdot L_{boundary} \\ + λ_{smoothness} \cdot L_{smoothness} \\ + λ_{scaling} \cdot L_{scaling} \end{matrix}

(6)

3.4. Hyperparameters

$λ_{boundary}, λ_{smoothness}, λ_{scaling}$ : Control the contribution of each term to the final loss.

The three components are combined with weighting factors to form the BADCL (Boundary-Aware Depth Consistency Loss) function. BADCL is designed to enhance the accuracy of monocular depth maps while ensuring the overall scale of the depth map is correct. As a result, monocular depth estimates provide more reliable and accurate predictions.

3.5. Performance Metrics

3.5.1. SSIM (Structural Similarity Index Measure)

SSIM is a metric used to evaluate the structural similarity between two images. It considers pixel-level differences and factors like luminance, contrast, and structural similarity. The Structural Similarity Index (SSIM) is defined as

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(7)

The luminance, represented by the means

μ_{x}

and

μ_{y}

of the two image patches x and y, measures the similarity in brightness levels. The contrast, determined by the standard deviations

σ_{x}

and

σ_{y}

, measures the difference in intensity variations between the images. The structural similarity, captured by the covariance

σ_{x y}

, reflects the correlation of pixel patterns and textures in the two images. Two stabilizing constants,

C_{1} = {(K_{1} L)}^{2}

and

C_{2} = {(K_{2} L)}^{2}

, where L is the dynamic range of pixel values and

K_{1}, K_{2}

are small constants, ensure numerical stability by preventing division by zero.

3.5.2. MSE (Mean Squared Error)

The MSE between the predicted depth map (x) and the ground truth depth map (y) is calculated as

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - y_{i})}^{2}

(8)

MSE is computed over all N pixels in the image, where N represents the total number of pixels. For each pixel i, the predicted depth value is defined as

x_{i}

, and the ground truth depth value is defined as

y_{i}

. The squared difference

{(x_{i} - y_{i})}^{2}

measures the error for pixel i, and the summation operator collects these errors over all pixels. Finally, dividing by N gives the average squared error to measure the quality of the result.

3.5.3. ARE (Absolute Relative Error)

The ARE metric for a predicted depth map (x) and a ground truth depth map (y) is calculated as

ARE = \frac{1}{N} \sum_{i = 1}^{N} \frac{| x_{i} - y_{i} |}{y_{i}}

(9)

Here, N represents the total number of pixels in the depth map. For each pixel i, the predicted depth value is denoted as

x_{i}

, while the corresponding ground truth value is

y_{i}

. The absolute difference,

| x_{i} - y_{i} |

, measures the magnitude of the error at pixel i. Additionally, the term

\frac{| x_{i} - y_{i} |}{y_{i}}

expresses this error relative to the true depth value. Summing these relative errors for all pixels and dividing by N gives the average relative error for the computed depth map. This measure is useful for evaluating depth estimation performance, where the magnitude of depth values varies throughout the map.

3.5.4. $δ$ -Metrics

Delta Metrics (

δ

-Metrics) are crucial accuracy measures in depth estimation. They evaluate how closely predicted depth values align with ground truth values by measuring their adherence to specific error thresholds (

< δ

).

δ < 1.25

,

δ < 1 . 25^{2}

, and

δ < 1 . 25^{3}

are widely used standards for evaluating depth estimation performance in the literature. Delta Metrics (

δ

) are defined as

\begin{matrix} δ_{k} = & Percentage of predictions d_{i} \\ such that : max (\frac{d_{i}}{g_{i}}, \frac{g_{i}}{d_{i}}) < δ^{k} \end{matrix}

(10)

For each prediction

d_{i}

and its corresponding ground truth depth value

g_{i}

, the term

max (\frac{d_{i}}{g_{i}}, \frac{g_{i}}{d_{i}})

represents the relative error. This ensures that predictions close to the ground truth are considered accurate. The metric uses a threshold

δ^{k}

, where

δ

is a base threshold value (e.g., 1.25), and k defines the tolerance for larger errors. Typically, k can take values of 1, 2, or 3, representing increasingly lenient levels of error acceptance. The final value of

δ_{k}

indicates the percentage of predictions that meet condition

max (\frac{d_{i}}{g_{i}}, \frac{g_{i}}{d_{i}}) < δ^{k}

.

3.5.5. Log10 Error (Mean Logarithmic Error)

The Log10 Error provides a scale-invariant and relative accuracy metric for depth estimation. The Log10 Error is calculated as

Log 10 = \frac{1}{N} \sum_{i = 1}^{N} | {log}_{10} (d_{i}) - {log}_{10} (g_{i}) |

(11)

The Log10 Error metric is applied to all N pixels in the depth map, where N represents the total number of pixels. The logarithmic transformation, represented as

{log}_{10} (d_{i})

for the predicted values and

{log}_{10} (g_{i})

for the ground truth values, ensures that the metric is scale-invariant. To calculate the Log10 Error, the sum operator accumulates the absolute differences

| {log}_{10} (d_{i}) - {log}_{10} (g_{i}) |

for all pixels and divides by N.

4. Experimental Results

The experiments were carried out on a computer equipped with a 12-core CPU, 64 GB of RAM, and an NVIDIA^® GeForce RTX^™ 4090 24 GB memory. The operating system used was Ubuntu 22.04. The model implementation utilized Python 3.12.5 and the PyTorch library, specifically designed for deep learning applications. For training, the proposed model was trained for 30 epochs in the NYU Depth V2 dataset, which is a widely recognized dataset for in-depth estimation research. The loss function weights were empirically determined and set as follows:

λ_{boundary} = 3.0

,

λ_{smoothness} = 1.2

, and

λ_{scaling} = 1.0

. The constant

ϵ

in Equation (5), used for numerical stability, is set to

1 \times 10^{- 8}

. The other training details like backbone, optimizer, and learning rate are summarized in Table 1.

The proposed hybrid model presented promising performance for various widely used monocular depth estimation metrics, including Root Mean Squared Error (RMS), Mean Relative Error (MRE), Mean Log10 Error, and thresholded accuracy measures

(δ < 1.25

,

δ < 1 . 25^{2}, δ < 1 . 25^{3})

. In addition, the SSIM value, which highlights the model’s ability to preserve spatial structure during depth prediction, has also increased to 0.998. Table 2 compares our model against other existing methods, while Figure 5 illustrates that after training the model for 100 epochs, the training loss approaches 0.0005, and the validation loss reaches approximately 0.0006. This consistent reduction in loss demonstrates the training convergence of the proposed model.

The performance of the monocular depth estimation model can be improved by changing its architecture and loss function. In this study, the backbone of the proposed model was constructed using Resnet18, which has residual connections to help avoid the problem of vanishing gradients. In the experiments, the light model was observed to have the best results compared to the models in the literature. Furthermore, loss functions are critical in solving monocular depth estimation problems. The proposed hybrid loss function, a combination of boundary-aware loss, smoothness regularization loss, and dynamic scaling loss, considerably improved depth estimation accuracy.

Although our method achieved good performance results as visually shown by Figure 6, a closer inspection of the predicted maps in complex scenes shows some areas for improvement. Specifically, our model tends to smooth over fine, high-frequency details in visually dense scenes such as the store aisle depicted in our results. This is the inherent difficulty in resolving ambiguous, repetitive patterns from a single 2D image, where the model may average out features to maintain overall structural consistency. A second limitation is observed with thin objects like chair legs, where predictions can show blurring at the boundaries. This is likely caused by the low pixel count of such structures, where the convolutional encoder fails to extract robust features without some spatial uncertainty.

The accuracy and efficiency of the model make it suitable for a range of applications. For instance, in augmented reality, accurate depth maps are essential for properly placing virtual objects within real-world scenes, ensuring realistic occlusion by actual objects. Since the model was evaluated on the NYU-Depth V2 indoor dataset, it is particularly well suited for indoor robotics, enabling tasks such as obstacle avoidance, mapping, and safe navigation in cluttered environments. Additionally, the model holds significant potential for autonomous driving systems, where accurate depth information is critical for environmental perception. However, because the model was trained and tested exclusively on the NYU-Depth V2 dataset, its performance in dynamic outdoor lighting and complex weather conditions remains untested. Furthermore, as a monocular system, it is inherently more susceptible to challenges due to scene complexity and visibility constraints compared to multi-sensor systems that utilize LiDAR or stereo cameras.

Computational Complexity

This study developed an efficient model suitable for real-time applications on resource-constrained hardware, utilizing a lightweight ResNet18 backbone. The resulting model comprises approximately 4.6 million parameters and achieves promising performance with fewer parameters than larger architectures. The model processes a fixed input resolution of 224 × 224 pixels. The operational load for a single forward pass is 43.02 Giga-Mult-Adds. Since each multiply-accumulate (mult-add) operation accounts for two floating-point operations (FLOPs), the model’s overall complexity is approximately 86.04 GFLOPS. Therefore, the computational cost to process a single frame of 224 × 224 is equivalent to approximately 86.04 GFLOPS. The architecture itself contains 4.6 million trainable parameters, which provide a computationally efficient solution for monocular depth estimation.

Table 3 shows the experiments conducted for the FPS tests and provides a performance analysis of the model in real-time applications. While a single-image batch achieves a high throughput of nearly 250 images per second, the performance scales significantly with larger batch sizes, increasing to 1200 images per second with a batch size of 32. Furthermore, the latency, or the average time to process a single image, decreases considerably as the batch size increases. The processing time drops from 4 ms for a single image to under 1 ms per image for all batch sizes of 4 or greater. The performance gains begin to plateau at larger batch sizes (16 and 32), which suggests that the GPU is approaching its maximum utilization for this particular model. Overall, the results confirm that the model can operate beyond typical real-time thresholds such as 30–60 FPS.

5. Conclusions

This study presents a novel hybrid depth estimation model aimed at improving the accuracy of monocular depth estimation for applications in areas such as autonomous driving and robotics. The model builds on existing methods by combining a ResNet18-based encoder for local feature extraction with Transformer attention layers to capture global contexts. Another improvement of the study is the BADCL, a new loss function that integrates boundary awareness, smoothness regularization, and dynamic scaling. This approach enhances depth estimation accuracy, especially around critical areas like object edges, while ensuring smoother predictions in uniform regions. Experimental evaluations on the NYU Depth V2 dataset showed quantitatively that the proposed model achieved an Absolute Relative Error (ARE) of 0.063, a Log10 error of 0.026, and a thresholded accuracy (

δ < 1.25

) of 0.982. With a Structural Similarity Index (SSIM) of 99.8%, these results have significant implications for practical applications. The increased accuracy enables autonomous systems to make safer decisions when interacting with their environment. Additionally, the proposed method accomplishes these results with fewer trainable parameters, addressing the demand for efficient models suitable for real-time applications on resource-constrained hardware. Future work will explore more lightweight architectures and effective loss functions to enhance performance for different datasets.

Author Contributions

Conceptualization, H.T. and D.A.; methodology, H.T. and D.A.; software, H.T.; validation and data curation, H.T.; original draft preparation, H.T. and D.A.; review and editing, H.T. and D.A.; visualization, H.T. and D.A.; supervision, D.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The NYU2 dataset used in the study are openly available at https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html [34] (accessed on 1 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AMENet	A Monocular Depth Estimation Network
ARE	Absolute Relative Error
BADCL	Boundary-Aware Depth Consistency Loss
CATNet	Convolutional Attention and Transformer
CNN	Convolutional Neural Network
CRF	Conditional Random Field
DAT	Dual Attention Transformer
DB	Decoder Block
DNet	Hierarchical Self-Supervised Monocular Absolute Depth Estimation
EB	Encoder Block
FLOPs	Floating-Point Operations
LapUNet	Laplacian Residual U-shape Network
LiDAR	Light Detection and Ranging
MCA	Multi-dimensional Convolutional Attention
MHSA	Multi-Head Self-Attention
MRF	Markov Random Field
MSE	Mean Squared Error
PHOG	Pyramid Histogram of Oriented Gradients
RCConv	Reducing Channel Convolution
REL	Mean Relative Error
ResNet	Residual Network
RMS	Root Mean Squared Error
SIFT	Scale-Invariant Feature Transform
SPT-Depth	Stereoscopic Pyramid Transformer-Depth
SSIM	Structural Similarity Index Measure
SURF	Speeded Up Robust Features
TB	Transformer Block
UNet	A U-shaped encoder–decoder network architecture
ViT	Vision Transformer

Appendix A

Table A1. Hybrid depth estimation model summary.

Layer (Type)	Output Shape	Param #
HybridDepthEstimationModel	`[8, 1, 224, 224]`	`–`
CNNEncoder	`[8, 64, 56, 56]`	`–`
Sequential	`[8, 64, 56, 56]`	`–`
Conv2d	`[8, 64, 112, 112]`	`9408`
BatchNorm2d	`[8, 64, 112, 112]`	`128`
ReLU	`[8, 64, 112, 112]`	`–`
MaxPool2d	`[8, 64, 56, 56]`	`–`
Sequential	`[8, 64, 56, 56]`	`–`
BasicBlock	`[8, 64, 56, 56]`	`73,984`
BasicBlock	`[8, 64, 56, 56]`	`73,984`
Sequential	`[8, 128, 28, 28]`	`–`
BasicBlock	`[8, 128, 28, 28]`	`230,144`
BasicBlock	`[8, 128, 28, 28]`	`295,424`
Sequential	`[8, 256, 14, 14]`	`–`
BasicBlock	`[8, 256, 14, 14]`	`919,040`
BasicBlock	`[8, 256, 14, 14]`	`1,180,672`
TransformerAttention	`[8, 128, 28, 28]`	`–`
MultiheadAttention	`[8, 784, 128]`	`66,048`
LayerNorm	`[8, 784, 128]`	`256`
Sequential	`[8, 784, 128]`	`131,712`
LayerNorm	`[8, 784, 128]`	`256`
TransformerAttention	`[8, 256, 14, 14]`	`–`
MultiheadAttention	`[8, 196, 256]`	`263,168`
LayerNorm	`[8, 196, 256]`	`512`
Sequential	`[8, 196, 256]`	`525,568`
LayerNorm	`[8, 196, 256]`	`512`
UpsampleBlock	`[8, 128, 28, 28]`	`–`
Conv2d	`[8, 128, 28, 28]`	`442,496`
BatchNorm2d	`[8, 128, 28, 28]`	`256`
Conv2d	`[8, 128, 28, 28]`	`147,584`
BatchNorm2d	`[8, 128, 28, 28]`	`256`
UpsampleBlock	`[8, 64, 56, 56]`	`–`
Conv2d	`[8, 64, 56, 56]`	`110,656`
BatchNorm2d	`[8, 64, 56, 56]`	`128`
Conv2d	`[8, 64, 56, 56]`	`36,928`
BatchNorm2d	`[8, 64, 56, 56]`	`128`
UpsampleBlock	`[8, 64, 112, 112]`	`–`
Conv2d	`[8, 64, 112, 112]`	`73,792`
BatchNorm2d	`[8, 64, 112, 112]`	`128`
Conv2d	`[8, 64, 112, 112]`	`36,928`
BatchNorm2d	`[8, 64, 112, 112]`	`128`
ConvTranspose2d	`[8, 32, 224, 224]`	`32,800`
Conv2d	`[8, 1, 224, 224]`	`289`
Total params: 4,653,313
Trainable params: 4,653,313
Non-trainable params: 0
Total mult-adds (G): 43.02

References

Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep learning for lidar point clouds in autonomous driving: A review. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3412–3432. [Google Scholar] [CrossRef]
Tang, C.; Hou, C.; Song, Z. Depth recovery and refinement from a single image using defocus cues. J. Mod. Opt. 2015, 62, 441–448. [Google Scholar] [CrossRef]
Tsai, Y.M.; Chang, Y.L.; Chen, L.G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proceedings of the 2006 IEEE International Symposium on Intelligent Signal Processing and Communications, Yonago, Japan, 12–15 December 2006; pp. 586–589. [Google Scholar] [CrossRef]
Zhang, R.; Tsai, P.S.; Cryer, J.E.; Shah, M. Shape-from-shading: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 690–706. [Google Scholar] [CrossRef]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar] [CrossRef]
Bay, H. Surf: Speeded up robust features. In Computer Vision—ECCV; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar] [CrossRef]
Bosch, A.; Zisserman, A.; Munoz, X. Image classification using random forests and ferns. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar] [CrossRef]
Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
Cross, G.R.; Jain, A.K. Markov random field texture models. IEEE Trans. Pattern Anal. Mach. Intell. 1983, PAMI-5, 25–39. [Google Scholar] [CrossRef]
Güney, E.; Bayılmış, C.; Çakar, S.; Erol, E.; Atmaca, Ö. Autonomous control of shore robotic charging systems based on computer vision. Expert Syst. Appl. 2024, 238, 122116. [Google Scholar] [CrossRef]
Yolcu, G.; Oztel, I.; Kazan, S.; Oz, C.; Palaniappan, K.; Lever, T.E.; Bunyak, F. Facial expression recognition for monitoring neurological disorders based on convolutional neural network. Multimed. Tools Appl. 2019, 78, 31581–31603. [Google Scholar] [CrossRef] [PubMed]
Sazak, H.; Kotan, M. Automated Blood Cell Detection and Classification in Microscopic Images Using YOLOv11 and Optimized Weights. Diagnostics 2024, 15, 22. [Google Scholar] [CrossRef] [PubMed]
Yang, W.J.; Wu, C.C.; Yang, J.F. Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation. Sensors 2024, 25, 80. [Google Scholar] [CrossRef]
O’Shea, K. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Sinha, D.; El-Sharkawy, M. Thin mobilenet: An enhanced mobilenet architecture. In Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; pp. 280–285. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
Duong, H.T.; Chen, H.M.; Chang, C.C. URNet: An UNet-based model with residual mechanism for monocular depth estimation. Electronics 2023, 12, 1450. [Google Scholar] [CrossRef]
Wang, B.; Wang, S.; Dou, Z.; Ye, D. Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation. arXiv 2024, arXiv:2309.09272. [Google Scholar] [CrossRef]
Xue, F.; Zhuo, G.; Huang, Z.; Fu, W.; Wu, Z.; Ang, M. Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 2330–2337. [Google Scholar] [CrossRef]
Tadepalli, Y.; Kollati, M.; Kuraparthi, S.; Kora, P. EfficientNet-B0 Based Monocular Dense-Depth Map Estimation. Trait. Signal 2021, 38, 1485–1493. [Google Scholar] [CrossRef]
Li, R.; Yu, H.; Du, K.; Xiao, Z.; Yan, B.; Yuan, Z. Adaptive Semantic Fusion Framework for Unsupervised Monocular Depth Estimation. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Dang, Y.; Li, C.; Zhang, L.; Gao, Y. RCCNet: Reducing Channel Convolution Network for Monocular Depth Estimation. In Proceedings of the 2023 4th IEEE International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 1–4. [Google Scholar] [CrossRef]
Shen, M.; Wang, Z.; Su, S.; Liu, C.; Chen, Q. DNA-Depth: A Frequency-Based Day-Night Adaptation for Monocular Depth Estimation. IEEE Trans. Instrum. Meas. 2023, 72, 2530112. [Google Scholar] [CrossRef]
Xia, Z.; Wu, T.; Wang, Z.; Zhou, M.; Wu, B.; Chan, C.; Kong, L.B. Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion. Sci. Rep. 2024, 14, 7037. [Google Scholar] [CrossRef]
Sharma, M.; Choudhary, R.; Anil, R. 2T-UNET: A Two-Tower UNet with Depth Clues for Robust Stereo Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 757–764. [Google Scholar]
Tang, S.; Lu, T.; Liu, X.; Zhou, H.; Zhang, Y. CATNet: Convolutional attention and transformer for monocular depth estimation. Pattern Recognit. 2024, 145, 109982. [Google Scholar] [CrossRef]
Kolbeinsson, B.; Mikolajczyk, K. UCorr: Wire Detection and Depth Estimation for Autonomous Drones. In Proceedings of the International Conference on Robotics, Computer Vision and Intelligent Systems, Rome, Italy, 25–27 February 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 179–192. [Google Scholar] [CrossRef]
Wu, T.; Xia, Z.; Zhou, M.; Kong, L.B.; Chen, Z. AMENet is a monocular depth estimation network designed for automatic stereoscopic display. Sci. Rep. 2024, 14, 5868. [Google Scholar] [CrossRef]
Xi, Y.; Li, S.; Xu, Z.; Zhou, F.; Tian, J. LapUNet: A novel approach to monocular depth estimation using dynamic laplacian residual U-shape networks. Sci. Rep. 2024, 14, 23544. [Google Scholar] [CrossRef]
Zou, Y.; Luo, Z.; Huang, J.B. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 36–53. [Google Scholar]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli ; Rob Fergus Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the ECCV, Florence, Italy, 7–13 October 2012. [CrossRef]
Wu, C.; Wu, F.; Ge, S.; Qi, T.; Huang, Y.; Xie, X. Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6389–6394. [Google Scholar] [CrossRef]
Borse, S.; Wang, Y.; Zhang, Y.; Porikli, F. Inverseform: A loss function for structured boundary-aware segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5901–5911. [Google Scholar]
Ngoc, M.Ô.V.; Chen, Y.; Boutry, N.; Chazalon, J.; Carlinet, E.; Fabrizio, J.; Mallet, C.; Géraud, T. Introducing the Boundary-Aware loss for deep image segmentation. In Proceedings of the British Machine Vision Conference (BMVC) 2021, Online, 22–25 November 2021. [Google Scholar]
Abdusalomov, A.; Umirzakova, S.; Shukhratovich, M.B.; Kakhorov, A.; Cho, Y.I. Breaking New Ground in Monocular Depth Estimation with Dynamic Iterative Refinement and Scale Consistency. Appl. Sci. 2025, 15, 674. [Google Scholar] [CrossRef]
Mancini, M.; Costante, G.; Valigi, P.; Ciarfuglia, T.A.; Delmerico, J.; Scaramuzza, D. Toward domain independence for learning-based monocular depth estimation. IEEE Robot. Autom. Lett. 2017, 2, 1778–1785. [Google Scholar] [CrossRef]
Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3917–3925. [Google Scholar] [CrossRef]
Alhashim, Ibraheem; Wonka, Peter High quality monocular depth estimation via transfer learning. arXiv 2018, arXiv:1812.11941. [CrossRef]
Li, J.; Klein, R.; Yao, A. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3372–3380. [Google Scholar]
Rudolph, M.; Dawoud, Y.; Güldenring, R.; Nalpantidis, L.; Belagiannis, V. Lightweight monocular depth estimation through guided decoding. In Proceedings of the 2022 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2344–2350. [Google Scholar] [CrossRef]
Lee, J.H.; Kim, C.S. Monocular depth estimation using relative depth maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 9729–9738. [Google Scholar]
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2485–2494. [Google Scholar] [CrossRef]
Basak, H.; Ghosal, S.; Sarkar, M.; Das, M.; Chattopadhyay, S. Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image. In Proceedings of the 2020 IEEE 7th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Prayagraj, India, 27–29 November 2020; pp. 1–6. [Google Scholar] [CrossRef]
Das, D.; Das, A.D.; Sadaf, F. Depth Estimation From Monocular Images with Enhanced Encoder-Decoder Architecture. arXiv 2024, arXiv:2410.11610. [Google Scholar] [CrossRef]
Ignatov, D.; Ignatov, A.; Timofte, R. Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation? In Proceedings of the Synthetic Data for Computer Vision Workshop@ CVPR 2024, Seattle, WA, USA, 17–21 June 2024.

Figure 1. Monoculardepth estimation from a single camera image [34].

Figure 2. Examples of RGB images and corresponding depth maps from NYUv2 dataset [34].

Figure 3. Proposed hybrid Unet depth estimation model.

Figure 4. The content of the layers inside each block.

Figure 5. Training and validation loss plots for (a) total loss, (b) boundary-aware loss, (c) smoothness loss, and (d) dynamic scaling loss.

Figure 6. The figure presents input RGB images, ground truth depth maps, and predicted maps.

Table 1. Training details.

Parameter	Value / Specification
Backbone	ResNet-18 (Pretrained on ImageNet)
Optimizer	AdamW (using PyTorch defaults)
Learning Rate	$1 \times 10^{- 4}$ (constant, no schedule)
Batch Size	8
Number of Epochs	100
Random Seed	42 (for dataset splits)

Table 2. Performance comparison of various existing methods for monocular depth estimation using error metrics and accuracy at different thresholds

(δ < 1.25, δ < 1 . 25^{2}, δ < 1 . 25^{3})

. The proposed model outperforms the others.

Table 2. Performance comparison of various existing methods for monocular depth estimation using error metrics and accuracy at different thresholds

(δ < 1.25, δ < 1 . 25^{2}, δ < 1 . 25^{3})

. The proposed model outperforms the others.

Method	Error Metrics (Lower is Better)			Accuracy (Higher is Better)
	ARE	RMSE	Log10	$δ < 1.25$	$δ < 1 . 25^{2}$	$δ < 1 . 25^{3}$
Mancini et al. [39]	0.312	0.565	0.336	0.809	0.786	0.911
Xu et al. [40]	0.125	0.593	0.057	0.806	0.952	0.986
Alhashim et al. [41]	0.123	0.465	0.053	0.846	0.974	0.994
Li et al. (VGG16) [42]	0.152	0.611	0.064	0.789	0.955	0.988
Li et al. (VGG19) [42]	0.146	0.617	0.063	0.795	0.958	0.991
Li et al. (ResNet50) [42]	0.143	0.635	0.063	0.788	0.958	0.991
Rudolph et al. [43]	0.138	0.501	0.058	0.823	0.961	0.990
Lee et al. [44]	0.131	0.538	–	0.837	0.971	0.994
Guizilini et al. [45]	0.072	2.727	0.120	0.932	0.984	0.994
Basak et al. [46]	0.103	0.388	–	0.892	0.978	0.995
Das et al. (Enc-Dec-IRv2) [47]	0.064	0.228	0.032	0.893	0.967	0.985
Ignatov et al. [48]	0.090	0.322	0.039	0.929	0.991	0.998
Hybrid Ensemble Unet (Proposed)	0.063	0.237	0.026	0.982	0.996	0.998

The best result in each column is highlighted in bold.

Table 3. Model inference performance by batch size.

Batch Size	Throughput (Images/s)	Batches/s	Latency (ms/Image)
1	249.84	249.84	4.003
4	1011.47	252.87	0.989
8	1087.35	135.92	0.920
16	1176.84	73.55	0.850
32	1212.19	37.88	0.825

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Turkmen, H.; Akgun, D. A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation. Mathematics 2025, 13, 2567. https://doi.org/10.3390/math13162567

AMA Style

Turkmen H, Akgun D. A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation. Mathematics. 2025; 13(16):2567. https://doi.org/10.3390/math13162567

Chicago/Turabian Style

Turkmen, Hamidullah, and Devrim Akgun. 2025. "A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation" Mathematics 13, no. 16: 2567. https://doi.org/10.3390/math13162567

APA Style

Turkmen, H., & Akgun, D. (2025). A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation. Mathematics, 13(16), 2567. https://doi.org/10.3390/math13162567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation

Abstract

1. Introduction

2. Materials and Methods

2.1. UNet Model

2.2. Monocular Depth Estimation in Practice

2.3. NYU-Depth V2

3. Proposed Method

3.1. Transformer-Based Hybrid UNet

3.2. Hybrid Loss Function

3.2.1. Boundary-Aware Loss

3.2.2. Smoothness Regularization Loss

3.2.3. Dynamic Scaling Loss

3.3. Combined Loss Functions

3.4. Hyperparameters

3.5. Performance Metrics

3.5.1. SSIM (Structural Similarity Index Measure)

3.5.2. MSE (Mean Squared Error)

3.5.3. ARE (Absolute Relative Error)

3.5.4. $δ$ -Metrics

3.5.5. Log10 Error (Mean Logarithmic Error)

4. Experimental Results

Computational Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation

Abstract

1. Introduction

2. Materials and Methods

2.1. UNet Model

2.2. Monocular Depth Estimation in Practice

2.3. NYU-Depth V2

3. Proposed Method

3.1. Transformer-Based Hybrid UNet

3.2. Hybrid Loss Function

3.2.1. Boundary-Aware Loss

3.2.2. Smoothness Regularization Loss

3.2.3. Dynamic Scaling Loss

3.3. Combined Loss Functions

3.4. Hyperparameters

3.5. Performance Metrics

3.5.1. SSIM (Structural Similarity Index Measure)

3.5.2. MSE (Mean Squared Error)

3.5.3. ARE (Absolute Relative Error)

3.5.4. δ -Metrics

3.5.5. Log10 Error (Mean Logarithmic Error)

4. Experimental Results

Computational Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.5.4. $δ$ -Metrics