Next Article in Journal
Parallel Optimization for Coupled Lattice Boltzmann-Finite Volume Method on Heterogeneous Many-Core Supercomputer
Previous Article in Journal
Automated Pollen Classification via Subinstance Recognition: A Comprehensive Comparison of Classical and Deep Learning Architectures
Previous Article in Special Issue
Real-Time Detection of Rear Car Signals for Advanced Driver Assistance Systems Using Meta-Learning and Geometric Post-Processing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Self-Supervised Monocular Depth Estimation via Intrinsic Albedo-Guided Multi-Task Learning

1
Graduate School of Science and Engineering, Ritsumeikan University, Kusatsu 525-8577, Shiga, Japan
2
Department of Intelligent Robotics, Faculty of Information Engineering, Toyama Prefectural University, Imizu 939-0398, Toyama, Japan
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 714; https://doi.org/10.3390/app16020714
Submission received: 1 December 2025 / Revised: 25 December 2025 / Accepted: 5 January 2026 / Published: 9 January 2026
(This article belongs to the Special Issue Convolutional Neural Networks and Computer Vision)

Abstract

Self-supervised monocular depth estimation has demonstrated high practical utility, as it can be trained using a photometric image reconstruction loss between the original image and a reprojected image generated from the estimated depth and relative pose, thereby alleviating the burden of large-scale label creation. However, this photometric image reconstruction loss relies on the Lambertian reflectance assumption. Under non-Lambertian conditions such as specular reflections or strong illumination gradients, pixel values fluctuate depending on the lighting and viewpoint, which often misguides training and leads to large depth errors. To address this issue, we propose a multitask learning framework that integrates albedo estimation as a supervised auxiliary task. The proposed framework is implemented on top of representative self-supervised monocular depth estimation backbones, including Monodepth2 and Lite-Mono, by adopting a multi-head architecture in which the shared encoder–decoder branches at each upsampling block into a Depth Head and an Albedo Head. Furthermore, we apply Intrinsic Image Decomposition to generate albedo images and design an albedo supervision loss that uses these albedo maps as training targets for the Albedo Head. We then integrate this loss term into the overall training objective, explicitly exploiting illumination-invariant albedo components to suppress erroneous learning in reflective regions and areas with strong illumination gradients. Experiments on the ScanNetV2 dataset demonstrate that, for the lightweight backbone Lite-Mono, our method achieves an average reduction of 18.5% over the four standard depth error metrics and consistently improves accuracy metrics, without increasing the number of parameters and FLOPs at inference time.

1. Introduction

Self-supervised monocular depth estimation learns to predict depth from sequential monocular images during training, without requiring ground-truth depth maps. By obviating the need for costly ground-truth depth acquisition (e.g., LiDAR-based measurements), this paradigm has substantially broadened the feasibility of leveraging commercial datasets [1]. Its effectiveness primarily stems from view synthesis: an image is reconstructed from adjacent frames using the estimated depth and relative pose, and learning is driven by enforcing a photometric consistency constraint between the reconstructed and original images [2,3]. Owing to sustained advances in recent years, self-supervised monocular depth estimation has achieved consistent accuracy improvements and has been increasingly adopted in practical settings, including autonomous driving [4], robotics [5], and augmented reality [6].
In recent years, research has shifted from merely achieving high accuracy on simple benchmarks to examining the generalization performance of trained models under changes in environment and domain [7,8,9]. However, the photometric image reconstruction loss relies on the Lambertian reflectance assumption, which presumes that the appearance of a surface remains constant regardless of the viewing direction. Under non-Lambertian conditions—such as specular reflections, metal, glass, water surfaces, glossy plastics, or strong highlights, shadows, and illumination changes—even when the same point is reprojected, pixel values vary depending on illumination and viewpoint rather than geometry. Consequently, the photometric error can be introduced into depth estimation as a spurious supervision signal, often leading to severe depth degradation in scenes characterized by strong reflections or sharp illumination gradients [3,10].
To enhance the robustness of self-supervised monocular depth estimation, a variety of approaches have been explored. In particular, some studies improve performance on reflective surfaces by incorporating Intrinsic Image Decomposition (IID) branches [10,11] into the depth estimation framework and jointly predicting albedo, shading, and specular components from the input image together with depth. Although such designs are appealing because they can leverage illumination-invariant reference cues, they typically require U-Net–type IID decoders or additional prediction branches that are separate from the depth network and must also be executed at inference time, which often leads to a substantial increase in the overall number of parameters and FLOPs.
To address these limitations, this study introduces a multitask learning framework that incorporates albedo estimation as an auxiliary task within a self-supervised monocular depth estimation pipeline. The objective is to explicitly mitigate the influence of illumination variations while maintaining a balance between computational efficiency and generalization capability. The proposed framework is integrated into prominent backbones, such as Monodepth2 [3] and Lite-Mono [12], utilizing a multi-head architecture where each upsampling block in the shared encoder–decoder branches into dedicated Depth and Albedo Heads. Furthermore, by leveraging Intrinsic Image Decomposition (IID) as a preprocessing step to generate albedo maps, we define an albedo supervision loss. This strategy facilitates the integration of illumination-invariant features into the training objective without the need for computationally intensive IID modules within the network. Consequently, the model suppresses erroneous learning in regions characterized by specular reflections or steep illumination gradients, all while adhering to a strictly self-supervised paradigm that requires no additional sensors or manual annotations. The main contributions of this paper are summarized as follows:
  • We propose a multitask learning framework for self-supervised monocular depth estimation based on a multi-head architecture, where each upsampling block in a shared encoder–decoder branches into a Depth Head and an Albedo Head.
  • We introduce an albedo supervision loss that uses albedo images generated in advance by Intrinsic Image Decomposition as supervision signals, allowing the network to exploit illumination-invariant albedo cues without embedding additional IID modules into the inference-time architecture.
  • Through experiments on the ScanNetV2 dataset, we demonstrate that integrating the proposed framework into multiple self-supervised monocular depth estimation backbones, including Monodepth2 and Lite-Mono, improves depth error and accuracy metrics without increasing the number of model parameters or FLOPs at inference time.

2. Related Works

2.1. Self-Supervised Monocular Depth Estimation

Supervised methods can achieve high accuracy. However, acquiring dense ground-truth depth maps is costly and highly dependent on the environment. In practice, this often requires extensive preprocessing, such as measurements using sensors like LiDAR, calibration, interpolation of missing values, and noise removal. These factors make it difficult to deploy such methods in a scalable manner. Representative supervised monocular depth estimation methods based on convolutional neural networks (CNNs) include the multi-scale architecture proposed by Eigen et al. [13], the residual network-based approach by Laina et al. [14], DORN, which treats depth as discrete ordinal classes [15], and AdaBins, which is based on adaptive binning [16]. Although these methods have demonstrated high estimation accuracy, they still face limitations when scaling to large datasets for the reasons mentioned above.
As an approach to overcoming the limitations of supervised methods, self-supervised monocular depth estimation has been extensively explored. Self-supervised depth estimation methods are generally divided into two categories: stereo-based methods and temporal sequence (video-based) methods. The former are inspired by traditional stereo vision, where disparity is estimated from left–right image pairs and a monocular depth network is trained using the corresponding reprojection error. Garg et al. [17] were the first to demonstrate self-supervised learning from stereo pairs using a reprojection loss, and Godard et al. [18] further improved accuracy with MonoDepth by enforcing a left–right disparity consistency constraint. Poggi et al. [19] reduced occlusion artifacts by exploiting a multi-camera setup, while Watson et al. [20] strengthened the supervision signal by incorporating depth hints from conventional stereo algorithms into the photometric image reconstruction loss. In addition, Gonzalez-Bello et al. [21] introduced the FAL network with occlusion masks, Zhu et al. [22] proposed EdgeDepth, which leverages semantic information, and Peng et al. [23] developed EPCDepth, which improves accuracy through edge-based graph filtering.
In contrast, temporal sequence-based methods jointly estimate depth and camera pose from consecutive frames captured by a monocular camera. Zhou et al. proposed SfM-Learner [2], which introduced a self-supervised framework that jointly trains a depth network and a camera pose regression network by minimizing a photometric image reconstruction loss between adjacent frames. Subsequently, Godard et al. presented Monodepth2 [3], which improves training stability and accuracy by introducing minimum reprojection error and an auto-masking scheme to handle dynamic objects and occlusions. Shu et al. proposed FeatDepth [24], which incorporates a feature-distance loss in the latent space, and Lyu et al. introduced HR-Depth [25], which redesigns skip connections to preserve high-resolution features. Since these self-supervised methods do not require ground-truth depth labels, they substantially reduce the cost associated with depth acquisition and preprocessing, while facilitating deployment across diverse environments and domains.
Furthermore, motivated by real-time applications such as autonomous driving, robotics, and AR/VR, there has been growing interest in lightweight self-supervised monocular depth estimation models. Dong et al. conducted a survey on real-time monocular depth estimation for robotics [5], and highlighted the trade-off between model size and inference speed as a key challenge for practical deployment. Liu et al. proposed a lightweight monocular depth estimation network designed for edge devices [26], achieving a good balance between parameter reduction and prediction accuracy. Zhang et al. introduced Lite-Mono [12], a self-supervised monocular depth estimation model that combines CNNs and Transformers, and demonstrated that it can substantially reduce the number of parameters and FLOPs while maintaining, or even surpassing, the accuracy of previous models. In addition to the advantage of not requiring ground-truth depth labels, this trend toward lightweight and efficient architectures is driving demand for monocular depth models that can run in real time on resource-constrained hardware platforms.
In recent years, research has shifted from merely improving metrics on standard benchmarks toward systematically evaluating the generalization ability of trained models under varying environmental conditions and domain shifts. However, in real-world scenarios, non-Lambertian effects such as specular reflections and strong illumination changes frequently occur, while the aforementioned photometric image reconstruction loss is still based on the Lambertian reflectance assumption, which presumes that the appearance of a surface is independent of the viewing direction. Consequently, under non-Lambertian conditions involving materials such as metals, glass, water surfaces, and glossy plastics, as well as strong highlights, shadows, and pronounced illumination gradients, pixel intensities can vary with lighting and viewpoint even when reprojecting the same 3D point. This causes the photometric error to propagate as an inappropriate supervision signal for depth estimation, and it has been reported that depth errors can grow significantly, especially in scenes containing strong reflections or steep illumination gradients [3,10].
At the same time, multitask learning approaches that jointly train depth estimation with auxiliary tasks have been explored to improve robustness by leveraging additional cues. Klingner et al. [27] introduced a multitask framework that incorporates semantic segmentation, where dynamic classes such as cars and pedestrians are detected based on semantic labels and their pixels are masked out from the photometric image reconstruction loss. In this way, they suppress incorrect supervision signals caused by moving objects that violate the static-world assumption. Petrovai et al. [28] also proposed a framework that jointly learns self-supervised depth estimation and video panoptic segmentation. By designing depth losses conditioned on panoptic labels and masking dynamic objects, they mitigate broken supervision signals contained in the reprojection loss. While these methods improve robustness by exploiting high-level cues such as semantics and instance boundaries, they still rely heavily on high-quality pseudo-labels or segmentation annotations. As a result, they require re-annotation when transferring to new domains, and misdetections in unseen environments can destabilize training.
Furthermore, as a more direct line of multitask learning that addresses depth errors caused by the breakdown of the photometric image reconstruction loss under non-Lambertian conditions, several methods have integrated Intrinsic Image Decomposition (IID) into depth estimation. Daher et al. [11] designed a non-Lambertian model that simultaneously estimates depth, albedo, shading, and specular components, and achieved robust depth estimation in reflective regions by treating specular reflections as an independent component. Similarly, Choi et al. [10] integrated IID into self-supervised monocular depth estimation. By combining reflection-region identification, the removal of invalid gradients, and pseudo-depth distillation, they improved performance in both reflective and non-reflective regions. However, in these methods, an additional U-Net–style IID decoder or prediction branch is embedded alongside the depth network and is also executed at inference time. Although this design allows the use of illumination-invariant reference information, it naturally increases overall model complexity and inference cost. Finally, although the minimum reprojection and auto-masking strategy introduced by Godard et al. [3] stabilizes training by suppressing the influence of pixels that violate geometric or motion assumptions, performance degradation is still reported in scenarios with strong specular reflections or large illumination changes, where the Lambertian reflectance assumption itself breaks down.

2.2. Intrinsic Image Decomposition

Intrinsic Image Decomposition (IID) has long been studied as a classical problem that aims to separate an observed image into an albedo component corresponding to surface reflectance and a shading component corresponding to illumination. Land et al. [29] introduced a Retinex-based framework that models an image as the product of albedo and shading, thereby laying the conceptual foundation of IID. Building on this, Grosse et al. [30] constructed the MIT Intrinsics dataset based on high-precision measurements of real objects, enabling quantitative evaluation of IID algorithms using ground-truth albedo and shading. Furthermore, Bell et al. [31] proposed the Intrinsic Images in the Wild (IIW) dataset, which uses crowdsourced human annotations of relative reflectance, providing a large-scale benchmark for assessing albedo consistency across diverse indoor and outdoor scenes. Shen et al. [32] designed a sparse-representation-based optimization method that incorporates priors encouraging large image gradients to be interpreted as albedo changes, and achieved stable decomposition results on natural images.
In the domain of supervised learning–based models, Barron et al. [33] developed a physically consistent model that jointly estimates shape, illumination, and reflectance, and proposed a unified framework that also includes a supervised extension to RGB-D data. Narihira et al. [34] introduced Direct Intrinsics, which directly regresses albedo and shading using convolutional neural networks (CNNs), and demonstrated high-accuracy decomposition with annotated datasets. Fan et al. [35] revisited existing deep-learning-based IID models, conducting a detailed analysis of how differences in datasets and loss designs affect decomposition performance, and presented refined architectures that yield higher-quality intrinsic predictions. More recently, Careaga et al. [36] proposed an extended intrinsic model that explicitly handles a specular residual component in addition to albedo and shading, achieving high-resolution and high-quality decompositions on real images, including outdoor scenes. These studies collectively advance IID in the direction of explicitly modeling non-Lambertian and specular effects by combining physical models with deep learning, thereby going beyond the classical Lambertian assumption.
Meanwhile, in order to reduce the cost of collecting ground-truth albedo and shading maps, self-supervised and weakly supervised IID models have also been proposed. Ma et al. [37] combined a two-stream network with temporal consistency constraints, leveraging temporal information from video sequences as a supervisory signal to learn intrinsic decomposition without ground-truth annotations, while still enabling single-image inference at test time.

3. Materials and Methods

3.1. Design Motivation

3.1.1. Basic Assumptions of Self-Supervised Monocular Depth Estimation

Self-supervised monocular depth estimation eliminates the need for ground-truth depth maps by casting depth learning as an image reconstruction task governed by photometric consistency. For this formulation to hold, prior work commonly adopts the following assumptions [2,3]:
(a)
The scene is assumed to be static while the camera undergoes motion [2,3].
(b)
The brightness of objects is assumed to be consistent across viewpoints [3].
(c)
All scene surfaces are assumed to follow Lambertian reflectance [2,3].
In practice, outdoor datasets often include factors that violate these assumptions, such as moving vehicles and pedestrians as well as rapidly varying weather and illumination. To alleviate violations of the static-scene assumption (a), some self-supervised monocular depth estimation methods employ an auto-mask [3] to disregard pixel regions that do not satisfy the static-scene constraint. While the Lambertian assumption (c) is not strictly satisfied in all cases, it remains a useful approximation for many outdoor scenarios. By contrast, (c) is frequently violated in indoor environments containing reflective materials, mirrors, and transparent objects; moreover, the broader diversity of object categories and lighting conditions can substantially deteriorate depth estimation accuracy.
In this study, we target the breakdown of assumption (c) by explicitly regularizing the depth network with illumination-invariant albedo cues obtained via intrinsic image decomposition.

3.1.2. Intrinsic Image Decomposition

As shown in Figure 1, the intrinsic decomposition model provides albedo, shading, diffuse, and residual components used in our framework. To suppress the influence of strong reflections and illumination changes, we adopt a pretrained intrinsic decomposition model [36] based on the intrinsic residual image formation model. Specifically, an observed RGB image I is decomposed into an albedo layer A, a shading layer S, and a residual layer R that captures non-Lambertian illumination effects:
I = A S + R
where A represents the albedo component that is less sensitive to illumination changes, while S explains illumination-dependent variations such as shading and global illumination in an RGB manner. The residual term R aggregates view-dependent and non-diffuse effects, including specular highlights and strong reflection artifacts.
As an auxiliary supervision signal, we use A as the Albedo target and define the Diffuse target as the reconstruction from albedo and shading:
L = A S
which suppresses specular and strong reflection effects by separating them into R, while smooth shading and global illumination variations may still remain through S.
In this work, instead of jointly learning a depth estimation network and an intrinsic decomposition module from scratch, we use the pretrained model of [36] as an auxiliary teacher.

3.2. Overall Architecture

This article is a revised and expanded version of “Multi-task Learning for Monocular Depth Estimation and Intrinsic Image Decomposition”. The overall architecture of the self-supervised monocular depth estimation model used in this study is illustrated in Figure 2. In the existing method, the model consists of two subnetworks: a DepthNet for depth estimation and a PoseNet for camera pose estimation.
In contrast, the proposed method additionally introduces an AlbedoNet that generates pseudo albedo images from the input RGB images. Following [3,38], we adopt the same PoseNet architecture. Specifically, a pre-trained ResNet-18 [39] is used as the pose encoder, which takes a pair of color images as input. The encoder output is fed into a pose decoder composed of four convolutional layers, which regresses the 6-DoF relative pose P t s from the target frame I t to each source frame I s ( s { t 1 , t + 1 } ), given the input image pairs ( I t , I t 1 ) and ( I t , I t + 1 ) .
For the DepthNet of the existing method, an encoder–decoder network takes the RGB image I t as input and outputs a depth map D t through a Depth Head. During training, the depth map D t predicted by the Depth Head, together with the camera poses P t s estimated by the PoseNet, is used to reconstruct the target image I t from its adjacent frames, yielding a reconstructed image I ^ t . The photometric image reconstruction error between I ^ t and the original image I t is then used as the supervision signal. As a result, pixels affected by specular reflections or strong illumination changes are uniformly treated as depth supervision, which tends to propagate incorrect gradients under non-Lambertian conditions. In practice, Monodepth2 and HR-Depth predict depth maps at four scales ( 1 , 1 / 2 , 1 / 4 , 1 / 8 ) , whereas Lite-Mono and MonoLENS [40] output depth at three scales ( 1 , 1 / 2 , 1 / 4 ) in a multi-scale fashion.
In contrast, the DepthNet in the proposed method preserves the same encoder–decoder backbone structure as the existing methods, while extending the head architecture attached to the decoder outputs. Specifically, two heads, a Depth Head and an Albedo Head, are attached to the decoder feature maps at each scale. The Depth Head predicts depth maps D t at all scales as in the original models, whereas the Albedo Head simultaneously estimates illumination-invariant albedo images A t from the same features. In our implementation, Monodepth2 and HR-Depth are extended to jointly predict depth and albedo at four scales, 1 , 1 / 2 , 1 / 4 , 1 / 8 , and Lite-Mono and MonoLENS are similarly extended to do so at three scales, 1 , 1 / 2 , 1 / 4 . The architecture and processing flow of the PoseNet are kept identical to those of the baseline methods.
Furthermore, the proposed method incorporates an albedo supervision pipeline based on AlbedoNet. A pre-trained IID model (AlbedoNet) is applied to the input RGB image I t to obtain illumination-invariant pseudo albedo images A ^ t . During training, for each scale s, we compute an L 1 loss between the predicted albedo A t generated by the Albedo Head and the IID-derived albedo A ^ t , and use the average of this loss over scales to update the parameters of both the Albedo Head and the shared backbone, thereby encouraging the network to learn illumination-invariant features that are beneficial for depth estimation. At inference time, however, only the Depth Head of the DepthNet is used to predict depth, and neither the Albedo Head nor AlbedoNet is executed.

3.3. Albedo Head Architectures

In this subsection, we describe the architectures of the Albedo Heads adopted for different backbones, as illustrated in Figure 3.
Figure 3a illustrates the Albedo Head architecture adopted for Monodepth2 and HR-Depth. In these models, the spatial resolutions of the decoder output feature maps at each scale are designed so that they can be directly used as the resolutions of the corresponding albedo outputs. Therefore, in this work we employ a simple head structure that applies only a 3 × 3 convolution followed by a Sigmoid activation to the features at each scale, without any additional upsampling layers. By keeping the number of channels in the convolutional layer comparable to that of the Depth Head, we limit the increase in the number of parameters relative to the original Depth Head during training, while still enabling albedo estimation at all scales.
Figure 3b shows the Albedo Head architecture adopted for Lite-Mono and MonoLENS. Due to the encoder–decoder design of Lite-Mono and MonoLENS, the resolutions of the decoder output features do not necessarily match the final albedo output resolution at all scales. Therefore, in our design we first apply a 3 × 3 convolution to the feature maps at each scale and then, when necessary, insert an upsampling layer to align them with the resolution of the upper scale before generating the albedo images. Thus, unlike the head in Figure 3a, the Lite-Mono/MonoLENS Albedo Head includes an upsampling operation. However, by keeping the number of channels in each stage small, we confine the increase in parameters during training to a range that does not compromise the inherent lightweight nature of Lite-Mono and MonoLENS.

3.4. Self-Supervised Depth and Albedo Supervision Loss

We follow the standard self-supervised monocular depth estimation framework and formulate depth prediction as an image reconstruction problem. Similar to previous work [2,3], our training objective consists of a photometric image reconstruction loss and an edge-aware smoothness loss, computed at multiple decoder scales. In addition, we introduce an albedo supervision loss.
Let I t denote the target image at time t, I s the source images from adjacent frames s { 1 , + 1 } , D t the predicted depth map for I t , and K the camera intrinsics. Let S be the set of decoder scales. In our Lite-Mono/MonoLENS baseline models, we use three scales
S = 1 , 1 2 , 1 4
while the Monodepth2/HR-Depth baselines use four scales
S = 1 , 1 2 , 1 4 , 1 8
Photometric image reconstruction loss. Given the predicted depth D t and the relative pose P t s estimated by the pose network, we synthesize a target view I ^ t ( s ) from each source frame I s via differentiable warping:
I ^ t ( s ) = F I s , P t s , D t , K
where F ( · ) denotes the standard 3D projection and sampling function.
Following [3], we define the per-source photometric error between I ^ t ( s ) and I t as a weighted combination of SSIM and 1 distance:
L p I ^ t ( s ) , I t = α 1 SSIM I ^ t ( s ) , I t 2 + ( 1 α ) I ^ t ( s ) I t 1
where α = 0.85 as in [3]. To handle occlusions and out-of-view regions, we take the minimum photometric error over all source frames:
L photo ( I t ) = min s { 1 , + 1 } L p I ^ t ( s ) , I t
As in Monodepth2 [3], we additionally use an auto-masking scheme that compares the reprojection loss against an identity reprojection loss obtained by directly comparing the source images I s with the target I t . This suppresses pixels whose appearance can be explained without motion, such as moving objects or static camera failures. We denote the resulting masked photometric image reconstruction loss at scale s as L r ( s ) .
Edge-aware smoothness loss. To regularize the predicted depth and encourage piecewise smooth inverse depth while preserving image edges, we use the edge-aware smoothness loss [3,18]. Let d t = 1 / D t be the inverse depth and d t = d t / d ¯ t the mean-normalized inverse depth, where d ¯ t is the spatial mean of d t . The smoothness term at scale s is defined as
L smooth ( s ) = x d t exp x I t + y d t exp y I t
where x and y denote spatial gradients along the horizontal and vertical axes, respectively. Image gradients down-weight the smoothness penalty near strong intensity edges.
Multi-scale depth loss. For each decoder scale s S , we compute the sum of the photometric reconstruction loss and the smoothness term:
L depth ( s ) = L r ( s ) + λ smooth L smooth ( s )
where λ smooth is a small constant (e.g., λ smooth = 10 3 ) controlling the strength of the smoothness regularization.
The total depth loss is obtained by averaging over all scales:
L depth = 1 | S | s S L depth ( s )
Albedo supervision loss. In addition to the depth branch, our network includes an albedo prediction branch that outputs an albedo image A t ( s ) at each scale s. As supervision, we use pseudo albedo ground truth A ^ t ( s ) obtained by applying an intrinsic image decomposition model to the input RGB frames and downsampling them to the corresponding scale.
For each scale s S , we define the albedo supervision loss as an 1 distance between the predicted and ground-truth albedo images:
L albedo ( s ) = A t ( s ) A ^ t ( s ) 1
We then average this loss over all scales:
L albedo = 1 | S | s S L albedo ( s )
This auxiliary term encourages the shared encoder to learn representations that are also predictive of illumination-invariant albedo, which empirically improves the robustness of depth estimation, especially in scenes containing strong specular highlights or reflections.
Overall training objective. Combining the depth loss and the albedo supervision loss, our final self-supervised training objective is
L total = L depth + λ albedo L albedo
where λ albedo controls the contribution of the albedo supervision. In non-Lambertian regions, the photometric reconstruction loss can yield gradients that correlate with appearance changes (e.g., specular highlights) rather than geometry, which may bias updates in the shared backbone. By adding L albedo , the shared parameters receive an additional gradient signal that favors illumination-invariant, reflectance-consistent features. Accordingly, Equation (13) implies θ L total = θ L depth + λ albedo θ L albedo , which helps mitigate non-geometric photometric artifacts and empirically improves robustness under reflections.

3.5. Experimental Setup

3.5.1. Baselines and Training Setups

In our experiments, we evaluate four representative architectures that have shown strong performance in prior studies: Monodepth2, HR-Depth, Lite-Mono, and MonoLENS. All baselines are trained in a self-supervised fashion by minimizing the photometric reconstruction error following the training strategy of Monodepth2, and we collectively denote these settings as [Baseline]. To assess the benefit of our method, we extend each baseline by attaching an additional Albedo Head to the decoder outputs and adopting a multitask learning scheme. Specifically, depth is learned self-supervised via the photometric image reconstruction loss, while the albedo branch is supervised using pre-generated albedo images obtained from an IID model. We refer to this augmented training configuration as [Baseline + Ours].

3.5.2. Dataset

ScanNetv2 [41] consists of indoor scenes captured by an RGB-D scanner, and includes 1201 training scenes, 312 validation scenes, and 100 test scenes. To simulate training scenarios dominated by non-Lambertian surfaces and to ensure a fair comparison with 3D Distillation [42], we use the training triplet split provided by the authors, which contains 45,539 training images and 439 validation images.

3.5.3. Implementation Details

Our method was implemented in PyTorch 1.12.1 and trained on a workstation equipped with a 13th Gen Intel Core i7-13700F CPU, an NVIDIA GeForce RTX 4070 GPU with 12 GB VRAM (PNY Technologies, Parsippany, NJ, USA), and 78 GB of system memory. The software environment consisted of Ubuntu 22.04 LTS. The batch size during training was set to 12 for all architectures. Following Choi et al. [10], we initialized the learning rate of all networks to 1 × 10 4 and reduced it by a factor of 10 at the 26th and 36th epochs over a total of 41 training epochs. All models were trained with input images of resolution 384 × 288 . The valid depth range was set to a minimum of 0.1 m and a maximum of 10 m.
For the quantitative evaluation of depth estimation, we applied median scaling to the predicted depth maps and used standard metrics that are widely adopted in the field [13]. These metrics consist of four error measures (Abs Rel, Sq Rel, RMSE, and RMSE log) and three accuracy thresholds ( δ < 1.25 , δ < 1 . 25 2 , and δ < 1 . 25 3 ). Their definitions follow.
Abs Rel = 1 | N | i N | d i d i | d i
Sq Rel = 1 | N | i N d i d i 2 d i
RMSE = 1 | N | i N d i d i 2
RMSE log = 1 | N | i N log ( d i ) log ( d i ) 2
Accuracies = max d i d i , d i d i = δ < threshold

4. Experiments Results and Discussion

This section reports the quantitative and qualitative results on the ScanNet-Reflection benchmark, and discusses the impact of the proposed albedo-supervised multitask training. We also analyze the computational complexity and training overhead, followed by an ablation study.

4.1. ScanNetV2 Results

We present quantitative results on the ScanNet-Reflection validation set in Table 1. Across all backbones, the proposed method (Baseline + Ours) consistently outperforms the corresponding baselines on multiple metrics.
For Monodepth2, the average improvement over the four error metrics (Abs Rel, Sq Rel, RMSE, RMSE log) reaches approximately 10.8%, with Sq Rel in particular improving by about 30.3%. Similarly, for HR-Depth, we observe an average improvement of roughly 11.0% across the four error metrics, where the gain in Sq Rel is especially pronounced at around 32.1%. For the lightweight backbone Lite-Mono, the average improvement over these four metrics further increases to approximately 18.5%, and Sq Rel improves by about 44.3%, indicating that the proposed method substantially enhances robustness in the reflection-heavy ScanNet-Reflection environment. Finally, for our most compact backbone MonoLENS, the proposed method achieves the largest relative gains, with an average improvement of about 27.6% over the four error metrics.
These gains can be attributed to the introduction of an auxiliary task that learns from IID-generated albedo images as supervision. By doing so, the network is encouraged to form internal representations based on illumination-invariant reflectance components rather than on pixels heavily affected by illumination changes or specular reflections. Consequently, even in regions containing strong highlights or non-Lambertian surfaces such as glass and metal, the photometric image reconstruction loss better reflects geometric consistency, leading to a reduction of large depth errors.
On the other hand, the improvements are not strictly monotonic across all threshold-based accuracy metrics. As shown in Table 1, the δ < 1.25 score (and, for Monodepth2, also δ < 1 . 25 2 ) becomes slightly worse for Monodepth2, HR-Depth, and MonoLENS when our albedo supervision is introduced. One possible explanation is that, although the proposed method mainly helps to suppress large errors and improve challenging regions with strong reflections or illumination variations, the albedo supervision loss may also attenuate textures and local contrast that are beneficial for fine-grained geometric prediction. This trade-off can slightly perturb predictions whose errors lie near the decision boundaries of the δ -metrics, leading to marginal degradations in some threshold-based accuracy scores despite overall gains in most error and accuracy measures.
Figure 4 presents qualitative results on the ScanNet-Reflection validation set. For each backbone, we show the input image (left), the baseline prediction (middle), and the proposed method (Baseline + Ours) (right).
Overall, the baselines exhibit noticeable local overestimation and underestimation of depth in regions where strong illumination produces highlights on tables and floors, indicating that photometric disturbances caused by reflections and illumination changes are directly translated into depth errors. In contrast, Baseline + Ours yields smoother and more spatially consistent depth in these regions, and the predictions are visibly more robust to pixels affected by highlights and specular reflections. This behavior can be attributed to incorporating albedo estimation as an auxiliary task using IID-generated albedo images: the network is encouraged to form internal representations based on illumination-invariant reflectance components, which enables more geometrically consistent depth predictions even in challenging scenes that contain non-Lambertian surfaces.
On the other hand, compared with the corresponding baselines, Baseline + Ours tends to slightly smooth fine structures such as chair legs, table edges, and small objects, leading to a modest loss of sharpness in depth around thin boundaries and distant objects. This suggests that introducing albedo estimation as an auxiliary task may partially smooth not only illumination components but also textures and local contrasts that are beneficial for precise geometric prediction.

4.2. Comparison with Reflective-Robust Self-Supervised Depth Methods

In this section, we compare our method with approaches on ScanNetV2 that are designed to improve robustness in reflective and non-Lambertian regions. For a fair comparison, we evaluate all methods under a common Monodepth2-backbone setting using standard depth error metrics (Abs Rel, Sq Rel, RMSE, and RMSE log) and accuracy metrics ( δ < 1.25 , δ < 1 . 25 2 , and δ < 1 . 25 3 ). Note that only Ours is trained and evaluated in our experimental environment, whereas the results of the other methods are taken from their respective papers. In addition, to ensure fairness, we compare results obtained under the setting where training is performed using known camera poses in Table 2.
As shown in Table 2, under the common Monodepth2 backbone, our method achieves better depth-error performance than the self-teaching approach of Poggi et al. [43] while maintaining the same parameter budget (14.3M for both). Specifically, our method improves Abs Rel from 0.192 to 0.156, Sq Rel from 0.188 to 0.100, RMSE from 0.548 to 0.477, and RMSE log from 0.233 to 0.203. In terms of accuracy, our method attains δ < 1.25 , δ < 1 . 25 2 , and δ < 1 . 25 3 of 0.775, 0.942, and 0.986, respectively. In contrast, Choi et al. [10] introduce an additional IID decoder/branch, and thus their parameter count is expected to exceed the Monodepth2 baseline. In this comparison, Choi et al. achieve the best RMSE and RMSE log, whereas our method preserves the same model size as the baseline (14.3M) and achieves the best Abs Rel and δ -based metrics (with ties on Sq Rel and δ < 1 . 25 3 ), demonstrating a favorable trade-off between accuracy and model complexity.

4.3. Complexity and Speed Evaluation

In this section, we evaluate the model size and computational complexity of the proposed method and each baseline. Table 3 reports the number of model parameters and FLOPs at inference time for the encoder and decoder of each method. All values are measured using the THOP library at an input resolution of 384 × 288 . Note that, in [Baseline + Ours], we insert the Albedo Head into the DepthNet only during training to perform multitask learning, whereas at inference time this Albedo Head is disabled and only depth is predicted. Consequently, the depth networks used in deployment can improve accuracy over their corresponding baselines without incurring any increase in the number of parameters or FLOPs.
Among the four architectures, MonoLENS is the most lightweight, with a total of 1.78M parameters and 3.60G FLOPs, followed by Lite-Mono with 3.07M parameters and 4.53G FLOPs. In contrast, Monodepth2 and HR-Depth exhibit larger model size and higher computational cost, with HR-Depth in particular being dominated by the computational cost of its decoder.
To further assess the computational overhead introduced by our multitask framework, which introduces an additional Albedo Head and multiple loss functions, we also measured the training time of Monodepth2 and Monodepth2 + Ours under the same training configuration. The baseline Monodepth2 model required 11 h 23 m 57 s, whereas Monodepth2 + Ours required 11 h 56 m 04 s, corresponding to an increase of only about 32 min (approximately 4.7%). This moderate additional cost during training, together with the unchanged parameter count and FLOPs at inference time, indicates that the proposed multitask framework improves accuracy with negligible impact on deployment efficiency.

4.4. Training Stability and Loss Convergence

Figure 5 shows the loss curves during training of the proposed method. This plot corresponds to the model trained with a Monodepth2 backbone and the albedo-loss weight set to λ albedo = 0.3 . Here, L depth denotes the depth-only objective (the photometric reconstruction loss combined with an edge-aware smoothness term; see Section 3.4). In addition, L albedo records the 1 discrepancy between the albedo image predicted by the Albedo Head and the reference albedo image produced by AlbedoNet. We report both the per-step raw losses and their moving-average (MA) curves to simultaneously visualize short-term fluctuations and long-term convergence trends.
As shown in Figure 5, both L depth and L albedo decrease rapidly in the early stage of training and then continue to converge with smoothly decreasing MA curves. While the raw curves exhibit occasional spikes, the variations remain bounded and do not indicate divergence or sustained oscillations in the MA trends. The total loss L total , which is optimized during training (Equation (13)), also decreases stably and gradually approaches a plateau in the later stage. These observations suggest that introducing the auxiliary albedo supervision does not destabilize optimization, and the proposed multitask training remains stable throughout learning.

4.5. Gradient Propagation Analysis

To provide a mechanistic explanation for the observed accuracy gains, this section analyzes how the auxiliary albedo supervision reshapes the optimization process via gradient propagation. Specifically, during training of Monodepth2 + Ours, we quantify the gradient norms induced by L albedo and by the total loss for each predefined parameter group to reveal where the auxiliary signal actually updates the shared backbone. In this analysis, the gradient norm represents the update pressure exerted by a loss term on each parameter group: larger norms indicate that the corresponding layer is more strongly driven by the loss and therefore more influential in the learning dynamics.
First, to clarify the extent to which the auxiliary supervision itself influences the network internals, we analyze gradients induced by the albedo supervision loss L albedo . Specifically, for each predefined parameter group, we measure the 2 norm of the gradient and report the value averaged over all epochs:
g albedo ( group ) = θ group L albedo 2
Here, the encoder is divided into the Stem and Blocks 1–4, and the decoder is divided into Stages 0–4, where a smaller index corresponds to a shallower layer and a larger index corresponds to a deeper layer.
As shown in Figure 6a, the largest gradients induced by L albedo are observed at the Albedo Head, while substantial gradients are also distributed across the encoder and the decoder. In terms of the contribution to the sum of mean gradient norms, the encoder accounts for 47.1%, the decoder accounts for 37.8%, and the Albedo Head accounts for 15.1%, indicating that the shared backbone covers 84.9% in total. In particular, relatively larger gradients are observed in deeper layers such as Encoder Block 4 and Decoder Stage 3, suggesting that albedo supervision strongly drives the learning of high-level feature representations. Furthermore, since the albedo supervision loss does not directly update the Depth Head, the improvement in depth estimation can be interpreted as being obtained by updating the feature representations in the shared backbone, which then contributes to depth estimation indirectly.
Next, we compute gradients with respect to the total loss L total used for backpropagation during training (Equation (13)). For each parameter group, we measure the 2 norm of the gradient,
g total ( group ) = θ group L total 2
and report the value averaged over all epochs.
As shown in Figure 6b, the largest gradients are observed at the Depth Head, followed by the Albedo Head. Meanwhile, non-negligible gradients are also distributed across the decoder stages and the encoder (Stem/Blocks), confirming that the effect of the auxiliary supervision is not confined locally to the Albedo Head but propagates into shared representations. We also observe a tendency for relatively larger gradients in deeper layers, such as Encoder Block 4, suggesting that the total loss including the auxiliary supervision strongly drives the learning of high-level features.

4.6. Ablation Study

4.6.1. Sensitivity to λ albedo : Monodepth2 + Ours on ScanNet-Reflection

Table 4 summarizes how the error and accuracy metrics change when the weight of the albedo supervision loss, λ albedo , is varied in the range { 0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 } . First, compared with the baseline setting λ albedo = 0 , all configurations with λ albedo > 0 consistently reduce the four error metrics (Abs Rel, Sq Rel, RMSE, RMSE log). In particular, for λ albedo = 0.3 and 0.4 , the average relative reduction over these four metrics reaches about 20% with respect to the baseline, indicating that a moderately weighted albedo supervision loss is effective in suppressing large depth errors. On the other hand, when λ albedo is further increased to 0.5 or 0.6 , the metrics tend to become slightly worse than those of the best-performing settings. This suggests that excessively emphasizing the albedo supervision loss as an auxiliary task can start to negatively affect depth estimation.
For the accuracy metrics, the strictest threshold δ < 1.25 attains its highest value at the baseline λ albedo = 0 , and becomes slightly lower for all configurations with λ albedo > 0 . Nevertheless, for λ albedo = 0.1 and 0.3 , the values remain almost comparable to the baseline. In contrast, the looser thresholds δ < 1 . 25 2 and δ < 1 . 25 3 are improved in most non-zero configurations, and achieve their best values around λ albedo = 0.3 0.4 . This indicates that increasing the weight of the albedo supervision loss mainly helps reduce larger errors, thereby improving performance at looser thresholds, while introducing a certain trade-off in the most stringent accuracy measure.
Overall, λ albedo = 0.3 achieves improvements in the error metrics that are comparable to those at λ albedo = 0.4 , while keeping the degradation of δ < 1.25 to a minimum. Therefore, it can be regarded as the most favorable setting in terms of balancing error reduction and threshold-based accuracy.

4.6.2. Impact of Albedo and Diffuse IID Targets on Depth Estimation Performance

We compare two intrinsic targets, Albedo and Diffuse, for our auxiliary supervision (with λ albedo fixed at 0.3), and quantify their impact on depth errors and accuracy.
Table 5 reports a comparison between using Albedo and Diffuse as supervision targets. Overall, Baseline + Ours (Albedo teacher) achieves slightly better depth accuracy: Abs Rel, RMSE, and RMSE log are marginally lower, and the accuracy metrics δ < 1.25 and δ < 1 . 25 2 are slightly higher. Sq Rel remains identical between the two variants, and δ < 1 . 25 3 is essentially comparable (marginally higher for the Diffuse teacher). These results indicate that, within our framework, using the Albedo component as the teacher for the auxiliary task is more beneficial for depth estimation performance than using the Diffuse component.
This trend can be attributed to the different properties represented by the two intrinsic components. Albedo emphasizes illumination-invariant reflectance properties and tends to preserve discontinuities corresponding to object boundaries and material changes more clearly. In contrast, although the Diffuse component suppresses specular highlights, it still retains smooth shading and global illumination variations. When Diffuse is used as the teacher, such smooth shading and global illumination variations act as noise from the perspective of the auxiliary task, which in turn relatively weakens textures and local contrast associated with geometric boundaries and slightly degrades depth estimation accuracy around edges and fine structures.
On the other hand, when Albedo is used as the teacher, the feature representations are formed primarily based on reflectance components corresponding to materials and object boundaries that are less dependent on illumination conditions. This likely contributes to more stable geometric predictions, even in regions containing non-Lambertian surfaces and strong highlights.

4.6.3. Effect of Supervision Scale: Attaching the Albedo Head at Different Decoder Levels

In this ablation, we investigate the effect of supervision scale by attaching the Albedo Head at different decoder levels in Monodepth2. At the scales where the Albedo Head is present, the network predicts an albedo map and is trained to match a pseudo albedo target generated by an IID teacher. At the remaining scales, only the Depth Head is used and the network predicts depth in the standard self-supervised manner without auxiliary albedo supervision.
Table 6 summarizes how the error and accuracy metrics change when the Albedo Head is attached at each individual scale. Compared with the baseline without any Albedo Head (None), introducing the Albedo Head at any single scale consistently improves the overall depth performance, which indicates that supervising albedo at intermediate feature levels is beneficial for depth estimation.
When we compare the different attachment points, the most favorable results are obtained when the Albedo Head is attached at the 1 / 2 and 1 scales. In these settings, the errors are clearly reduced and the threshold-based accuracy metrics, such as δ < 1 . 25 2 and δ < 1 . 25 3 , reach their highest values. A plausible explanation is that supervision at higher-resolution decoder features aligns better with the final depth map resolution. Coarser scales such as 1 / 8 and 1 / 4 can help suppress illumination changes and reflections in a coarse manner, but they are less sensitive to thin structures and object boundaries. In contrast, placing the Albedo Head at 1 / 2 and 1 allows the model to learn albedo representations that directly constrain high-resolution features along edges and fine textures, which in turn improves both robustness in reflective regions and the fidelity of geometric details.

4.6.4. Sensitivity to Pseudo-Albedo Quality: Training with Noisy IID Supervision

In this section, we evaluate the sensitivity of the proposed framework to the quality of pseudo-albedo generated by the pretrained IID teacher (AlbedoNet) under Monodepth2 + Ours with λ albedo = 0.3 . During training, we corrupt the pseudo-albedo targets by injecting pixel-wise zero-mean Gaussian noise and clipping the perturbed targets to [ 0 , 1 ] . Importantly, this perturbation is applied only in training and only when a pseudo-albedo target is available, while no noise is added at evaluation time. To ensure reproducibility, we generate a deterministic random seed for each sample and scale using the scene folder name, the center-frame index, the scale s, and a fixed salt string, and then sample the Gaussian noise based on this seed. We progressively increase the noise setting while keeping all other configurations fixed (architecture, optimization, and loss weights), and evaluate depth on the same ScanNetV2 validation split using standard metrics.
Table 7 reports the results. The proposed framework remains relatively robust to moderate degradations of the pseudo-albedo targets: as the noise level increases, the depth error metrics rise only modestly, and the δ -based accuracies stay largely stable with only minor fluctuations across settings. This indicates that the method is not overly sensitive to the exact quality of IID-derived pseudo-albedo supervision and can still benefit from the auxiliary signal even when the teacher outputs contain noticeable noise. That said, a gradual deterioration is observed under larger noise, suggesting that improving the IID teacher quality remains a promising direction for further strengthening depth robustness in reflective regions.

5. Conclusions

In this work, we proposed a multitask learning framework that integrates albedo, obtained via Intrinsic Image Decomposition, as an auxiliary task. Specifically, we target representative self-supervised monocular depth estimation backbones, Monodepth2 and Lite-Mono, and introduce a multi-head architecture in which the output of each upsampling block branches into a Depth Head and an Albedo Head. By using albedo images obtained by applying IID prior to training as a weak supervisory signal for the Albedo Head, we incorporate illumination-invariant albedo information into the loss design without embedding a heavy IID decoder or additional inference branches inside the network. This design allows the model to learn depth while suppressing the influence of illumination changes even for pixels in reflective regions and regions with strong illumination gradients. Furthermore, by using only the Depth Head at inference time, the proposed framework does not increase the number of parameters or FLOPs. Evaluation experiments on the ScanNetV2 dataset demonstrate that the proposed method achieves an 18.5% improvement in the average of the depth error metrics compared with the original Lite-Mono. In addition, since it can be integrated into existing backbones without increasing the number of parameters or FLOPs at inference time, the proposed framework also exhibits high practical potential for real-time processing and deployment on edge devices.
On the other hand, in this study the losses for the depth and albedo tasks are combined in a relatively simple manner, and there remains a possibility that the interaction between tasks is not fully controlled. To address this issue, an important direction for future work is to introduce dynamic loss weighting methods that suppress gradient interference between tasks, with the aim of further improving the performance and robustness of both depth and albedo estimation.

Author Contributions

Conceptualization, G.H.; Funding acquisition, X.K. and H.T.; Investigation, G.H.; Methodology, G.H.; Software, G.H.; Supervision, T.S., X.K. and H.T.; Validation, G.H.; Writing—original draft, G.H.; Writing—review and editing, T.S., X.K. and H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partly supported by Mazda Foundation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to its integration into a larger proprietary software framework that is currently under development and subject to confidentiality agreements with our collaborators.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, versions 5 and 5.1) for the purposes of text generation and manuscript refinement. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, J. Survey on Monocular Metric Depth Estimation. arXiv 2025, arXiv:2501.11841. [Google Scholar]
  2. Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
  3. Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
  4. Schön, M.; Buchholz, M.; Dietmayer, K. Mgnet: Monocular Geometric Scene Understanding for Autonomous Driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15804–15815. [Google Scholar]
  5. Dong, X.; Garratt, M.A.; Anavatti, S.G.; Abbass, H.A. Towards Real-Time Monocular Depth Estimation for Robotics: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16940–16961. [Google Scholar] [CrossRef]
  6. Schreiber, A.M.; Hong, M.; Rozenblit, J.W. Monocular Depth Estimation Using Synthetic Data for an Augmented Reality Training System in Laparoscopic Surgery. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Melbourne, Australia, 17–20 October 2021; pp. 2121–2126. [Google Scholar]
  7. Saunders, K.; Vogiatzis, G.; Manso, L.J. Self-Supervised Monocular Depth Estimation: Let’s Talk about the Weather. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8907–8917. [Google Scholar]
  8. Gasperini, S.; Morbitzer, N.; Jung, H.; Navab, N.; Tombari, F. Robust Monocular Depth Estimation under Challenging Conditions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8177–8186. [Google Scholar]
  9. Bae, J.; Hwang, K.; Im, S. A Study on the Generality of Neural Network Structures for Monocular Depth Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2224–2238. [Google Scholar] [CrossRef] [PubMed]
  10. Choi, W.; Hwang, K.; Choi, M.; Han, K.; Choi, W.; Shin, M.; Im, S. Intrinsic Image Decomposition for Robust Self-Supervised Monocular Depth Estimation on Reflective Surfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 2555–2563. [Google Scholar]
  11. Daher, R.; Vasconcelos, F.; Stoyanov, D. SHADeS: Self-Supervised Monocular Depth Estimation through Non-Lambertian Image Decomposition. Int. J. Comput. Assist. Radiol. Surg. 2025, 20, 1255–1263. [Google Scholar] [CrossRef] [PubMed]
  12. Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18537–18546. [Google Scholar]
  13. Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. Adv. Neural Inf. Process. Syst. 2014, 27, 2366–2374. [Google Scholar]
  14. Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
  15. Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
  16. Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4009–4018. [Google Scholar]
  17. Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 740–756. [Google Scholar]
  18. Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
  19. Poggi, M.; Tosi, F.; Mattoccia, S. Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions. In Proceedings of the International Conference on 3D Vision, Verona, Italy, 5–8 September 2018; pp. 324–333. [Google Scholar]
  20. Watson, J.; Firman, M.; Brostow, G.J.; Turmukhambetov, D. Self-Supervised Monocular Depth Hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2162–2171. [Google Scholar]
  21. GonzalezBello, J.L.; Kim, M. Forget about the Lidar: Self-Supervised Depth Estimators with Med Probability Volumes. Adv. Neural Inf. Process. Syst. 2020, 33, 12626–12637. [Google Scholar]
  22. Zhu, S.; Brazil, G.; Liu, X. The Edge of Depth: Explicit Constraints between Segmentation and Depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13116–13125. [Google Scholar]
  23. Peng, R.; Wang, R.; Lai, Y.; Tang, L.; Cai, Y. Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15560–15569. [Google Scholar]
  24. Shu, C.; Yu, K.; Duan, Z.; Yang, K. Feature-Metric Loss for Self-Supervised Learning of Depth and Egomotion. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 572–588. [Google Scholar]
  25. Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; pp. 2294–2301. [Google Scholar]
  26. Liu, S.; Yang, L.T.; Tu, X.; Li, R.; Xu, C. Lightweight Monocular Depth Estimation on Edge Devices. IEEE Internet Things J. 2022, 9, 16168–16180. [Google Scholar] [CrossRef]
  27. Klingner, M.; Termöhlen, J.-A.; Mikolajczyk, J.; Fingscheidt, T. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 582–600. [Google Scholar]
  28. Petrovai, A.; Nedevschi, S. MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to Depth-Aware Video Panoptic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3077–3086. [Google Scholar]
  29. Land, E.H.; McCann, J.J. Lightness and Retinex Theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef] [PubMed]
  30. Grosse, R.; Johnson, M.K.; Adelson, E.H.; Freeman, W.T. Ground Truth Dataset and Baseline Evaluations for Intrinsic Image Algorithms. In Proceedings of the IEEE 12th International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2009; pp. 2335–2342. [Google Scholar]
  31. Bell, S.; Bala, K.; Snavely, N. Intrinsic Images in the Wild. ACM Trans. Graph. 2014, 33, 159. [Google Scholar] [CrossRef]
  32. Shen, L.; Yeo, C.; Hua, B.-S. Intrinsic Image Decomposition Using a Sparse Representation of Reflectance. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2904–2915. [Google Scholar] [CrossRef]
  33. Barron, J.T.; Malik, J. Shape, Illumination, and Reflectance from Shading. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1670–1687. [Google Scholar] [CrossRef] [PubMed]
  34. Narihira, T.; Maire, M.; Yu, S.X. Direct Intrinsics: Learning Albedo–Shading Decomposition by Convolutional Regression. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; p. 2992. [Google Scholar]
  35. Fan, Q.; Yang, J.; Hua, G.; Chen, B.; Wipf, D. Revisiting Deep Intrinsic Image Decompositions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8944–8952. [Google Scholar]
  36. Careaga, C.; Aksoy, Y. Colorful Diffuse Intrinsic Image Decomposition in the Wild. ACM Trans. Graph. 2024, 43, 178. [Google Scholar] [CrossRef]
  37. Ma, W.-C.; Chu, H.; Zhou, B.; Urtasun, R.; Torralba, A. Single Image Intrinsic Decomposition without a Single Intrinsic Image. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 201–217. [Google Scholar]
  38. Zhou, Z.; Fan, X.; Shi, P.; Xin, Y. R-MSFM: Recurrent Multi-Scale Feature Modulation for Monocular Depth Estimating. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12777–12786. [Google Scholar]
  39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  40. Higashiuchi, G.; Shimada, T.; Kong, X.; Yan, H.; Tomiyama, H. MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation. Appl. Sci. 2025, 15, 10393. [Google Scholar] [CrossRef]
  41. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
  42. Shi, X.; Dikov, G.; Reitmayr, G.; Kim, T.-K.; Ghafoorian, M. 3D Distillation: Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 9133–9143. [Google Scholar]
  43. Poggi, M.; Aleotti, F.; Tosi, F.; Mattoccia, S. On the Uncertainty of Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3227–3237. [Google Scholar]
Figure 1. Examples of intrinsic decomposition results.
Figure 1. Examples of intrinsic decomposition results.
Applsci 16 00714 g001
Figure 2. Overall architecture of the proposed method. Components highlighted in red indicate the newly added modules and loss terms.
Figure 2. Overall architecture of the proposed method. Components highlighted in red indicate the newly added modules and loss terms.
Applsci 16 00714 g002
Figure 3. Albedo head architectures for (a) Monodepth2/HR-Depth and (b) Lite-Mono/MonoLENS.
Figure 3. Albedo head architectures for (a) Monodepth2/HR-Depth and (b) Lite-Mono/MonoLENS.
Applsci 16 00714 g003
Figure 4. Qualitative comparison on two rows (top/bottom) across four backbones. For each backbone, we show three columns: Input (left), baseline prediction (middle), and Baseline + Ours (right). From top to bottom: Monodepth2, HR-Depth and Lite-Mono. Yellow boxes indicate reflective regions used for visual comparison.
Figure 4. Qualitative comparison on two rows (top/bottom) across four backbones. For each backbone, we show three columns: Input (left), baseline prediction (middle), and Baseline + Ours (right). From top to bottom: Monodepth2, HR-Depth and Lite-Mono. Yellow boxes indicate reflective regions used for visual comparison.
Applsci 16 00714 g004
Figure 5. Training loss curves with auxiliary albedo supervision. We show raw losses and moving averages (MA) for the depth loss L depth , the auxiliary albedo supervision loss L albedo , and the total loss L total (Equation (13)).
Figure 5. Training loss curves with auxiliary albedo supervision. We show raw losses and moving averages (MA) for the depth loss L depth , the auxiliary albedo supervision loss L albedo , and the total loss L total (Equation (13)).
Applsci 16 00714 g005
Figure 6. Gradient propagation analysis for Monodepth2 + Ours. We report the mean 2 norm of gradients, averaged over all epochs, for each predefined parameter group: (a) gradients induced by the albedo supervision loss L albedo and (b) gradients induced by the total training loss L total .
Figure 6. Gradient propagation analysis for Monodepth2 + Ours. We report the mean 2 norm of gradients, averaged over all epochs, for each predefined parameter group: (a) gradients induced by the albedo supervision loss L albedo and (b) gradients induced by the total training loss L total .
Applsci 16 00714 g006
Table 1. Quantitative comparison on the ScanNet-Reflection validation set. Depth errors (lower is better) and accuracy metrics (higher is better). For each backbone, the better value between the Baseline and Baseline + Ours is highlighted in bold.
Table 1. Quantitative comparison on the ScanNet-Reflection validation set. Depth errors (lower is better) and accuracy metrics (higher is better). For each backbone, the better value between the Baseline and Baseline + Ours is highlighted in bold.
MethodDepth Error (Lower Is Better)Accuracy (Higher Is Better)
Abs RelSq RelRMSERMSE Log δ < 1 . 25 δ < 1 . 25 2 δ < 1 . 25 3
Monodepth20.1950.2280.6120.2430.7330.9230.974
Monodepth2 + Ours0.1880.1590.5720.2360.7310.9220.974
HR-Depth0.1870.1930.5960.2390.7370.9220.974
HR-Depth + Ours0.1860.1310.5470.2320.7060.9280.982
Lite-Mono0.2010.2640.6380.2490.7290.9190.973
Lite-Mono + Ours0.1840.1470.5520.2300.7300.9240.978
MonoLENS0.2210.4020.7090.2630.7140.9150.971
MonoLENS + Ours0.1930.1400.5500.2360.7000.9170.980
Table 2. Comparison with prior methods under a common Monodepth2 backbone on ScanNetV2. Depth errors (lower is better) and accuracy (higher is better) are reported. The best value for each metric is shown in bold, while the second-best is underlined. In case of ties for the best value, all tied results are shown in bold. Ours denotes the Monodepth2 baseline augmented with the proposed method using λ albedo = 0.2 .
Table 2. Comparison with prior methods under a common Monodepth2 backbone on ScanNetV2. Depth errors (lower is better) and accuracy (higher is better) are reported. The best value for each metric is shown in bold, while the second-best is underlined. In case of ties for the best value, all tied results are shown in bold. Ours denotes the Monodepth2 baseline augmented with the proposed method using λ albedo = 0.2 .
MethodParams (M)Depth Error (Lower Is Better)Accuracy (Higher Is Better)
Abs RelSq RelRMSERMSE Log δ < 1 . 25 δ < 1 . 25 2 δ < 1 . 25 3
Poggi et al. [43]14.30.1920.1880.5480.2330.7640.9200.967
Choi et al. [10]>14.30.1580.1000.4620.2000.7690.9390.986
Ours14.30.1560.1000.4770.2030.7750.9420.986
Table 3. Model size and computational complexity. Left block: Parameters [M] (lower is better). Right block: FLOPs [G] (lower is better).
Table 3. Model size and computational complexity. Left block: Parameters [M] (lower is better). Right block: FLOPs [G] (lower is better).
MethodParameters [M]FLOPs [G]
TotalEncoderDecoderTotalEncoderDecoder
Monodepth214.32911.1773.1537.2344.0193.215
Monodepth2 + Ours14.32911.1773.1537.2344.0193.215
HR-Depth16.96011.1775.78428.4624.01924.443
HR-Depth + Ours16.96011.1775.78428.4624.01924.443
Lite-Mono3.0682.8420.2254.5253.8820.642
Lite-Mono + Ours3.0682.8420.2254.5253.8820.642
MonoLENS1.7761.6480.1273.6043.3680.235
MonoLENS + Ours1.7761.6480.1273.6043.3680.235
Table 4. Ablation on λ albedo for Monodepth2 + Ours on the ScanNet-Reflection validation set. The best value for each metric is shown in bold, and the second-best value is underlined.
Table 4. Ablation on λ albedo for Monodepth2 + Ours on the ScanNet-Reflection validation set. The best value for each metric is shown in bold, and the second-best value is underlined.
Method λ albedo Depth Error (Lower Is Better)Accuracy (Higher Is Better)
Abs RelSq RelRMSERMSE Log δ < 1 . 25 δ < 1 . 25 2 δ < 1 . 25 3
Monodepth20.000.1950.2280.6120.2430.7330.9230.974
Monodepth2 + Ours0.100.1880.1590.5720.2360.7310.9220.974
Monodepth2 + Ours0.200.1770.1230.5340.2240.7250.9310.982
Monodepth2 + Ours0.300.1770.1220.5220.2220.7300.9330.983
Monodepth2 + Ours0.400.1780.1200.5190.2210.7260.9330.984
Monodepth2 + Ours0.500.1830.1250.5320.2260.7170.9290.983
Monodepth2 + Ours0.600.1820.1220.5270.2240.7150.9320.984
Table 5. Ablation on IID teacher type ( λ albedo = 0.3 ) on the ScanNetV2 dataset. The best value for each metric is shown in bold.
Table 5. Ablation on IID teacher type ( λ albedo = 0.3 ) on the ScanNetV2 dataset. The best value for each metric is shown in bold.
MethodDepth Error (Lower Is Better)Accuracy (Higher Is Better)
Abs RelSq RelRMSERMSE Log δ < 1 . 25 δ < 1 . 25 2 δ < 1 . 25 3
Baseline + Ours (Albedo teacher)0.1770.1220.5220.2220.7310.9330.983
Baseline + Ours (Diffuse teacher)0.1780.1220.5240.2230.7260.9320.984
Table 6. Effect of supervision scale (Monodepth2 backbone, λ albedo = 0.3 ). Depth errors (lower is better) and accuracy (higher is better) on the ScanNetV2 dataset. The best value for each metric is shown in bold.
Table 6. Effect of supervision scale (Monodepth2 backbone, λ albedo = 0.3 ). Depth errors (lower is better) and accuracy (higher is better) on the ScanNetV2 dataset. The best value for each metric is shown in bold.
Supervised ScaleDepth Error (Lower Is Better)Accuracy (Higher Is Better)
Abs RelSq RelRMSERMSE Log δ < 1 . 25 δ < 1 . 25 2 δ < 1 . 25 3
None (Baseline)0.1950.2280.6120.2430.7330.9230.974
1 / 8 0.1880.1310.5390.2290.7050.9250.982
1 / 4 0.1840.1320.5390.2290.7180.9280.981
1 / 2 0.1790.1230.5240.2220.7300.9330.984
10.1790.1220.5240.2220.7240.9320.984
Table 7. Sensitivity to IID (AlbedoNet) supervision quality on ScanNetV2 (Monodepth2 backbone, λ albedo = 0.3 ). We add Gaussian perturbations to the pseudo-albedo targets during training. Depth errors (lower is better) and accuracy (higher is better) are reported.
Table 7. Sensitivity to IID (AlbedoNet) supervision quality on ScanNetV2 (Monodepth2 backbone, λ albedo = 0.3 ). We add Gaussian perturbations to the pseudo-albedo targets during training. Depth errors (lower is better) and accuracy (higher is better) are reported.
Noise RatioDepth Error (Lower Is Better)Accuracy (Higher Is Better)
Abs RelSq RelRMSERMSE Log δ < 1 . 25 δ < 1 . 25 2 δ < 1 . 25 3
0%0.1770.1220.5220.2220.7300.9330.983
5%0.1800.1270.5380.2260.7190.9300.981
15%0.1840.1400.5460.2300.7230.9250.979
30%0.1880.1550.5640.2350.7290.9220.974
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Higashiuchi, G.; Shimada, T.; Kong, X.; Tomiyama, H. Robust Self-Supervised Monocular Depth Estimation via Intrinsic Albedo-Guided Multi-Task Learning. Appl. Sci. 2026, 16, 714. https://doi.org/10.3390/app16020714

AMA Style

Higashiuchi G, Shimada T, Kong X, Tomiyama H. Robust Self-Supervised Monocular Depth Estimation via Intrinsic Albedo-Guided Multi-Task Learning. Applied Sciences. 2026; 16(2):714. https://doi.org/10.3390/app16020714

Chicago/Turabian Style

Higashiuchi, Genki, Tomoyasu Shimada, Xiangbo Kong, and Hiroyuki Tomiyama. 2026. "Robust Self-Supervised Monocular Depth Estimation via Intrinsic Albedo-Guided Multi-Task Learning" Applied Sciences 16, no. 2: 714. https://doi.org/10.3390/app16020714

APA Style

Higashiuchi, G., Shimada, T., Kong, X., & Tomiyama, H. (2026). Robust Self-Supervised Monocular Depth Estimation via Intrinsic Albedo-Guided Multi-Task Learning. Applied Sciences, 16(2), 714. https://doi.org/10.3390/app16020714

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop