2.1. Self-Supervised Monocular Depth Estimation
Supervised methods can achieve high accuracy. However, acquiring dense ground-truth depth maps is costly and highly dependent on the environment. In practice, this often requires extensive preprocessing, such as measurements using sensors like LiDAR, calibration, interpolation of missing values, and noise removal. These factors make it difficult to deploy such methods in a scalable manner. Representative supervised monocular depth estimation methods based on convolutional neural networks (CNNs) include the multi-scale architecture proposed by Eigen et al. [
13], the residual network-based approach by Laina et al. [
14], DORN, which treats depth as discrete ordinal classes [
15], and AdaBins, which is based on adaptive binning [
16]. Although these methods have demonstrated high estimation accuracy, they still face limitations when scaling to large datasets for the reasons mentioned above.
As an approach to overcoming the limitations of supervised methods, self-supervised monocular depth estimation has been extensively explored. Self-supervised depth estimation methods are generally divided into two categories: stereo-based methods and temporal sequence (video-based) methods. The former are inspired by traditional stereo vision, where disparity is estimated from left–right image pairs and a monocular depth network is trained using the corresponding reprojection error. Garg et al. [
17] were the first to demonstrate self-supervised learning from stereo pairs using a reprojection loss, and Godard et al. [
18] further improved accuracy with MonoDepth by enforcing a left–right disparity consistency constraint. Poggi et al. [
19] reduced occlusion artifacts by exploiting a multi-camera setup, while Watson et al. [
20] strengthened the supervision signal by incorporating depth hints from conventional stereo algorithms into the photometric image reconstruction loss. In addition, Gonzalez-Bello et al. [
21] introduced the FAL network with occlusion masks, Zhu et al. [
22] proposed EdgeDepth, which leverages semantic information, and Peng et al. [
23] developed EPCDepth, which improves accuracy through edge-based graph filtering.
In contrast, temporal sequence-based methods jointly estimate depth and camera pose from consecutive frames captured by a monocular camera. Zhou et al. proposed SfM-Learner [
2], which introduced a self-supervised framework that jointly trains a depth network and a camera pose regression network by minimizing a photometric image reconstruction loss between adjacent frames. Subsequently, Godard et al. presented Monodepth2 [
3], which improves training stability and accuracy by introducing minimum reprojection error and an auto-masking scheme to handle dynamic objects and occlusions. Shu et al. proposed FeatDepth [
24], which incorporates a feature-distance loss in the latent space, and Lyu et al. introduced HR-Depth [
25], which redesigns skip connections to preserve high-resolution features. Since these self-supervised methods do not require ground-truth depth labels, they substantially reduce the cost associated with depth acquisition and preprocessing, while facilitating deployment across diverse environments and domains.
Furthermore, motivated by real-time applications such as autonomous driving, robotics, and AR/VR, there has been growing interest in lightweight self-supervised monocular depth estimation models. Dong et al. conducted a survey on real-time monocular depth estimation for robotics [
5], and highlighted the trade-off between model size and inference speed as a key challenge for practical deployment. Liu et al. proposed a lightweight monocular depth estimation network designed for edge devices [
26], achieving a good balance between parameter reduction and prediction accuracy. Zhang et al. introduced Lite-Mono [
12], a self-supervised monocular depth estimation model that combines CNNs and Transformers, and demonstrated that it can substantially reduce the number of parameters and FLOPs while maintaining, or even surpassing, the accuracy of previous models. In addition to the advantage of not requiring ground-truth depth labels, this trend toward lightweight and efficient architectures is driving demand for monocular depth models that can run in real time on resource-constrained hardware platforms.
In recent years, research has shifted from merely improving metrics on standard benchmarks toward systematically evaluating the generalization ability of trained models under varying environmental conditions and domain shifts. However, in real-world scenarios, non-Lambertian effects such as specular reflections and strong illumination changes frequently occur, while the aforementioned photometric image reconstruction loss is still based on the Lambertian reflectance assumption, which presumes that the appearance of a surface is independent of the viewing direction. Consequently, under non-Lambertian conditions involving materials such as metals, glass, water surfaces, and glossy plastics, as well as strong highlights, shadows, and pronounced illumination gradients, pixel intensities can vary with lighting and viewpoint even when reprojecting the same 3D point. This causes the photometric error to propagate as an inappropriate supervision signal for depth estimation, and it has been reported that depth errors can grow significantly, especially in scenes containing strong reflections or steep illumination gradients [
3,
10].
At the same time, multitask learning approaches that jointly train depth estimation with auxiliary tasks have been explored to improve robustness by leveraging additional cues. Klingner et al. [
27] introduced a multitask framework that incorporates semantic segmentation, where dynamic classes such as cars and pedestrians are detected based on semantic labels and their pixels are masked out from the photometric image reconstruction loss. In this way, they suppress incorrect supervision signals caused by moving objects that violate the static-world assumption. Petrovai et al. [
28] also proposed a framework that jointly learns self-supervised depth estimation and video panoptic segmentation. By designing depth losses conditioned on panoptic labels and masking dynamic objects, they mitigate broken supervision signals contained in the reprojection loss. While these methods improve robustness by exploiting high-level cues such as semantics and instance boundaries, they still rely heavily on high-quality pseudo-labels or segmentation annotations. As a result, they require re-annotation when transferring to new domains, and misdetections in unseen environments can destabilize training.
Furthermore, as a more direct line of multitask learning that addresses depth errors caused by the breakdown of the photometric image reconstruction loss under non-Lambertian conditions, several methods have integrated Intrinsic Image Decomposition (IID) into depth estimation. Daher et al. [
11] designed a non-Lambertian model that simultaneously estimates depth, albedo, shading, and specular components, and achieved robust depth estimation in reflective regions by treating specular reflections as an independent component. Similarly, Choi et al. [
10] integrated IID into self-supervised monocular depth estimation. By combining reflection-region identification, the removal of invalid gradients, and pseudo-depth distillation, they improved performance in both reflective and non-reflective regions. However, in these methods, an additional U-Net–style IID decoder or prediction branch is embedded alongside the depth network and is also executed at inference time. Although this design allows the use of illumination-invariant reference information, it naturally increases overall model complexity and inference cost. Finally, although the minimum reprojection and auto-masking strategy introduced by Godard et al. [
3] stabilizes training by suppressing the influence of pixels that violate geometric or motion assumptions, performance degradation is still reported in scenarios with strong specular reflections or large illumination changes, where the Lambertian reflectance assumption itself breaks down.
2.2. Intrinsic Image Decomposition
Intrinsic Image Decomposition (IID) has long been studied as a classical problem that aims to separate an observed image into an albedo component corresponding to surface reflectance and a shading component corresponding to illumination. Land et al. [
29] introduced a Retinex-based framework that models an image as the product of albedo and shading, thereby laying the conceptual foundation of IID. Building on this, Grosse et al. [
30] constructed the MIT Intrinsics dataset based on high-precision measurements of real objects, enabling quantitative evaluation of IID algorithms using ground-truth albedo and shading. Furthermore, Bell et al. [
31] proposed the Intrinsic Images in the Wild (IIW) dataset, which uses crowdsourced human annotations of relative reflectance, providing a large-scale benchmark for assessing albedo consistency across diverse indoor and outdoor scenes. Shen et al. [
32] designed a sparse-representation-based optimization method that incorporates priors encouraging large image gradients to be interpreted as albedo changes, and achieved stable decomposition results on natural images.
In the domain of supervised learning–based models, Barron et al. [
33] developed a physically consistent model that jointly estimates shape, illumination, and reflectance, and proposed a unified framework that also includes a supervised extension to RGB-D data. Narihira et al. [
34] introduced Direct Intrinsics, which directly regresses albedo and shading using convolutional neural networks (CNNs), and demonstrated high-accuracy decomposition with annotated datasets. Fan et al. [
35] revisited existing deep-learning-based IID models, conducting a detailed analysis of how differences in datasets and loss designs affect decomposition performance, and presented refined architectures that yield higher-quality intrinsic predictions. More recently, Careaga et al. [
36] proposed an extended intrinsic model that explicitly handles a specular residual component in addition to albedo and shading, achieving high-resolution and high-quality decompositions on real images, including outdoor scenes. These studies collectively advance IID in the direction of explicitly modeling non-Lambertian and specular effects by combining physical models with deep learning, thereby going beyond the classical Lambertian assumption.
Meanwhile, in order to reduce the cost of collecting ground-truth albedo and shading maps, self-supervised and weakly supervised IID models have also been proposed. Ma et al. [
37] combined a two-stream network with temporal consistency constraints, leveraging temporal information from video sequences as a supervisory signal to learn intrinsic decomposition without ground-truth annotations, while still enabling single-image inference at test time.