EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles

Zhang, Chenyuan; Lee, Deokwoo

doi:10.3390/app15169130

Open AccessArticle

EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles

by

Chenyuan Zhang

and

Deokwoo Lee

^*

Department of Computer Engineering, Keimyung University, Daegu 42601, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9130; https://doi.org/10.3390/app15169130

Submission received: 26 July 2025 / Revised: 15 August 2025 / Accepted: 18 August 2025 / Published: 19 August 2025

Download

Browse Figures

Versions Notes

Abstract

Monocular depth estimation (MDE) is a cornerstone of computer vision and is applied to diverse practical areas such as autonomous vehicles, robotics, etc., yet even the latest methods suffer substantial errors in high-dynamic-range (HDR) scenes where over- or under-exposure erases critical texture. To address this challenge in real-world autonomous driving scenarios, we propose the Exposure-Aware Single-Step Diffusion Framework for Monocular Depth Estimation (EASD). EASD leverages a pre-trained Stable Diffusion variational auto-encoder, freezing its encoder to extract exposure-robust latent RGB and depth representations. A single-step diffusion process then predicts the clean depth latent vector, eliminating iterative error accumulation and enabling real-time inference suitable for autonomous vehicle perception pipelines. To further enhance robustness under extreme lighting conditions, EASD introduces an Exposure-Aware Feature Fusion (EAF) module—an attention-based pyramid that dynamically modulates multi-scale features according to global brightness statistics. This mechanism suppresses bias in saturated regions while restoring detail in under-exposed areas. Furthermore, an Exposure-Balanced Loss (EBL) jointly optimises global depth accuracy, local gradient coherence and reliability in exposure-extreme regions—key metrics for safety-critical perception tasks such as obstacle detection and path planning. Experimental results on NYU-v2, KITTI, and related benchmarks demonstrate that EASD reduces absolute relative error by an average of 20% under extreme illumination, using only 60,000 labelled images. The framework achieves real-time performance (<50 ms per frame) and strikes a superior balance between accuracy, computational efficiency, and data efficiency, offering a promising solution for robust monocular depth estimation in challenging automotive lighting conditions such as tunnel transitions, night driving and sun glare.

Keywords:

monocular depth estimation; diffusion model; single-step diffusion; extreme illumination; attention mechanism; exposure-aware

1. Introduction

Monocular depth estimation (MDE) infers the distance of each pixel in a single 2D RGB image to the camera. As a fundamental yet challenging computer-vision task [1], accurate depth estimation is indispensable for three-dimensional scene understanding, environmental perception, and autonomous interaction. In the context of autonomous vehicles, MDE plays a pivotal role in enabling real-time obstacle avoidance, a critical factor for ensuring vehicle safety. For instance, by precisely inferring the distances of surrounding objects from a single RGB image, autonomous vehicles can make split-second decisions to avoid collisions. Furthermore, robust depth estimation supports accurate scene construction for augmented reality (AR)/virtual reality (VR) systems, autonomous robotic navigation, and 3D reconstruction [2].

However, MDE is inherently ill-posed: an identical 2D projection can correspond to infinitely many 3D geometries [3]. Recent deep-learning approaches, leveraging large-scale convolutional neural networks (CNNs) and Vision Transformers (ViTs), have achieved promising results under normal illumination by learning statistical regularities linking appearance, semantics, and geometry.

In autonomous driving scenarios, however, image sensors often encounter high-dynamic-range (HDR) scenes characterised by severe over-exposure or under-exposure. Over-exposed regions lose texture and edge cues due to saturation, while under-exposed regions suffer from poor signal-to-noise ratios. These conditions degrade conventional learning-based MDE models, leading to depth maps with holes, blurred boundaries, scale drift, and unreliable predictions [1,3]. Most existing methods assume well-exposed inputs, rendering them ineffective in HDR environments—a critical limitation for autonomous vehicles operating in diverse lighting conditions.

The emergence of latent diffusion models (LDMs) offers a novel pathway. Pre-trained on Internet-scale image corpora, LDMs encode rich visual priors about shape, material, and lighting, excelling in image synthesis and in-painting. For example, Marigold [4] fine-tunes Stable Diffusion for depth regression, achieving cross-domain generalisation. However, its 50-step DDIM inference ≈1 s exceeds real-time constraints for autonomous driving applications. Other approaches, such as Lotus, compress denoising to a single-step via x₀-prediction or distillation, improving inference speed but neglecting the adverse impact of extreme exposure on feature extraction and depth inference.

To address these limitations, we propose the Exposure-Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles (EASD). This framework integrates the single-step efficiency of LDMs with explicit exposure modelling to achieve real-time performance while enhancing robustness in HDR scenes. Key contributions include:

Single-step latent depth regression. We regress the latent depth map in a single x₀-prediction step, eliminating iterative error accumulation and heavy computation.
Exposure-Aware Feature Fusion (EAF). Global brightness statistics guide an attention mechanism that dynamically re-weights multi-scale features, amplifying subtle textures in under-exposed regions and suppressing saturation artefacts in highlights.
Diversified Depth Predictor (DDP) within the EASD framework supports two inference modes: Multi-Sample Mode draws multiple stochastic samples to provide pixel-level uncertainty estimates, making it suitable for reliability-critical scenarios, while Single-Shot Inference Mode fixes the noise vector to deliver fast, deterministic predictions for real-time applications; choosing between the two modes allows users to balance accuracy and speed.

2. Related Work

This section reviews work relevant to the proposed EASD framework, with a focus on autonomous driving applications.

2.1. Monocular Depth Estimation

Monocular depth estimation has evolved from geometry-based algorithms to deep-learning methods. Early approaches relied on scene constraints, hand-crafted features and probabilistic graphical models, but generalised poorly beyond those assumptions.

CNN-based methods. Eigen et al. [1] pioneered a multi-scale CNN that directly regresses depth. Subsequent studies refined architectures, loss functions and training schemes. MiDaS [2] trains on a large hybrid dataset and—using an affine-invariant loss—achieves strong zero-shot cross-dataset generalisation. LeReS [5] integrates point-cloud reprojection and ranks first on several unseen datasets, underscoring the importance of detail recovery. HDN [6] introduces multi-scale depth normalisation and reports further gains. While these methods perform well under normal illumination, they struggle with HDR scenes common in autonomous driving, where exposure extremes are under-represented during training.

Transformer-based methods. Dense Prediction Transformers (DPT) [7] adopt a ViT backbone whose long-range dependency modelling strengthens global scene understanding. DepthAnything V2 [8] employs weak labels and prompt learning to set new benchmarks. Yet, both CNN and ViT architectures implicitly assume well-exposed inputs, making them prone to scale drift and texture loss in autonomous driving environments with extreme lighting.

Additionally, robustness under adverse atmospheric conditions (e.g., fog, snow) remains a critical limitation. Prior studies have validated this phenomenon, demonstrating that depth estimation models trained on clean data typically exhibit substantial performance degradation under low-visibility conditions. For instance, Gasperini et al. [9] demonstrate that haze-induced texture contrast reduction and light scattering can degrade depth accuracy by up to 30%. Furthermore, snow accumulation on sensors introduces significant noise in monocular depth estimation pipelines.

2.2. Depth Estimation Under High-Dynamic-Range (HDR) or Extreme-Exposure Conditions

High-dynamic-range scenes present a formidable challenge for monocular depth estimation (MDE) in autonomous vehicles. In HDR images, the luminance span is far wider than that representable by conventional low-dynamic-range (LDR) imaging. Over-exposed regions become saturated, losing virtually all texture and structural detail and thereby undermining appearance-based depth cues. Conversely, under-exposed regions exhibit extremely low signal-to-noise ratios, with useful visual information submerged in noise. As a result, directly deploying MDE models designed for LDR imagery in HDR scenarios typically causes a marked deterioration in depth-map quality.

Researchers have therefore explored several strategies for depth estimation in HDR or extreme-exposure environments. In visual-odometry (VO) and simultaneous-localisation-and-mapping (SLAM) research, some studies propose active exposure-control schemes [10], dynamically adjusting shutter time to obtain image sequences at multiple exposures and thus reducing photometric error; such methods, however, rely on specialised hardware or multi-frame input. In the image-processing community, numerous HDR-reconstruction pipelines have emerged. They either (i) fuse differently exposed LDR images into a single HDR frame or (ii) recover HDR information from a single LDR image via gradient-consistency constraints, after which depth is inferred from the reconstructed HDR image. This two-stage “reconstruct-then-infer” approach is cumbersome, and inaccuracies in HDR reconstruction propagate directly to depth estimation. A third line of work designs bespoke network architectures that tackle extreme exposure end-to-end—for example, dual-branch CNNs that process short- and long-exposure images separately and learn complementary depth cues between them [10]. While such methods partly mitigate exposure-induced artefacts, they either require multi-frame alignment or substantially increase computational cost, making them difficult to embed efficiently in real-time, single-shot MDE frameworks.

2.3. Applications of Diffusion Models in Vision Tasks

Diffusion probabilistic models (denoising diffusion probabilistic models, DDPMs) [11] and their derivatives have recently achieved breakthrough results in a range of vision problems, including image synthesis, super-resolution, and inpainting. The underlying idea is to inject noise into the data through a parameterised Markov chain—the forward diffusion—and then to learn its inverse process to recover the original signal—the reverse denoising. Owing to their training stability and their ability to generate highly diverse, high-fidelity samples, diffusion models have rapidly become a focal point of generative-modelling research.

In autonomous driving, Marigold [4] fine-tunes a pre-trained Stable Diffusion V2 backbone and secures state-of-the-art cross-domain performance on the affine-invariant depth-estimation benchmark, confirming the value of large-scale diffusion priors for depth inference. However, Marigold still demands up to 50 DDIM sampling steps, leading to inference latencies > 1 s—unsuitable for real-time deployment. GeoWizard [12] jointly models depth and surface normals and introduces a geometric switcher, yet it retains the conventional ε-prediction objective and multi-step sampling. DiffusionDepth [13] re-casts MDE as a denoising diffusion process in latent space and employs self-diffusion to cope with sparse ground truth, but likewise suffers from slow inference.

StableNormal [14] addresses variance inflation in diffusion trajectories but overlooks exposure imbalance, a critical factor for HDR depth estimation in autonomous vehicles.

2.4. Single-Step Diffusion

To alleviate the heavy computational burden imposed by the multi-step sampling of conventional diffusion models, researchers have actively explored single- or few-step inference schemes. Direct prediction of the clean target

x_{0} - p r e d i c t i o n

than the noise term

ϵ - p r e d i c t i o n

has proved particularly effective. Follow-up work on Marigold observed that variance can be amplified in the early denoising stages under extreme lighting conditions. This highlights the fragility of multi-step pipelines for autonomous driving [15], where split-second decisions are critical. By predicting

x_{0}

in conjunction with single - step denoising, frameworks like

Lotus [16] compress iterations to one, reducing inference time from >1 s to <100 ms—a latency suitable for autonomous vehicle perception systems; its Detail Preserver module curbs texture degradation, delivering then-state-of-the-art speed for both depth and normal estimation. GenPercept [17] further simplifies inference by omitting Gaussian-noise inputs, adopting a “one-shot perception” policy. This approach underscores the generality of diffusion priors for dense prediction tasks in autonomous navigation, where robustness to input corruption is paramount. Similarly, DepthMaster [18] introduces a single-step diffusion model that retunes generative features for the discriminative demands of depth estimation, adding feature-alignment and Fourier-enhancement blocks.

Although these single- or few-step methods have achieved impressive gains in efficiency while preserving generation quality, their principal focus has been on generic scenes [13,14]. They currently provide no explicit, task-specific modelling of, or adaptive compensation for, the feature degradation induced by extreme illumination. Consequently, even these high-throughput frameworks can still prove fragile in severely over- or under-exposed regions.

3. Method

This section details our Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles (EASD) framework for depth estimation. Centred on a pre-trained latent-space diffusion backbone and a single-step denoising scheme, EASD innovatively integrates three components—(i) exposure-statistics computation, (ii) an exposure-adaptive feature-fusion (EAF) module, and (iii) a diversified depth predictor (DDP) function—to achieve efficient and robust depth estimation under extreme illumination [19,20,21].

3.1. Framework Overview

The overall architecture of EASD is illustrated in Figure 1. It consists of the following key modules, which collaborate to produce the final depth prediction:

The data flow proceeds as follows. An input RGB image

I_{r g b}

is encoded by the VAE encoder to obtain its latent representation

z_{r g b}

. In parallel, the exposure-statistics module extracts exposure cues, yielding an exposure vector

W_{e x p}

. During training, the ground-truth depth map

D_{g t}

is likewise encoded to a latent depth tensor

z_{g t}

and perturbed with noise to form

z_{g t}^{t}

. The

z_{g t}^{t}

and

z_{r g b}

is fed into the single-step diffusion depth regressor, whose internal features are modulated by the exposure-adaptive feature-fusion (EAF) module. The denoising network outputs a refined latent depth vector

{\hat{z}}_{0}

, which is decoded by the VAE decoder to produce the estimated depth map

\hat{D}

. Network parameters are optimised with the exposure-balancing loss (EBL).

At inference time, a diversity predictor may be invoked to select different operating modes according to image characteristics; otherwise, a single forward pass suffices to generate the depth map.

3.2. Latent-Space Encoding and Input Pre-Processing

To harness the powerful priors of large-scale pre-trained models and to obtain a modality-consistent representation, EASD processes both the input image and, during training, the target depth map within a unified latent space.

3.2.1. RGB-Image Encoding

An input RGB image

I_{r g b} \in R^{3 \times H \times W}

is first standardised by normalising each channel with the ImageNet mean and standard deviation. The normalised image is then passed to a pre-trained, frozen Stable Diffusion VAE encoder

E_{V A E}

.

E_{V A E}

projects the high-dimensional pixel array into a compact latent vector

z_{r g b} = E_{V A E} (I_{r g b}) \in R^{C_{z} \times H / f \times W / f}

(1)

where

C_{z}

follows the Stable Diffusion VAE configuration.

Freezing the encoder preserves the rich visual semantics and structural priors acquired from large-scale data, supplies high-quality conditioning for subsequent depth estimation, and ensures that all inputs reside in a single, semantically expressive latent space.

3.2.2. Depth-Map Encoding

During training, the ground-truth depth map

D_{g t} \in R^{1 \times H \times W}

must be embedded in the same latent space as the RGB image. We first min–max normalise

D_{g t}

to a fixed range (e.g., [−1, 1]). To satisfy the VAE encoder’s three-channel input requirement, the single-channel depth map is then replicated across the RGB channels to create a pseudo-colour image. This processed depth map is fed into the shared, frozen VAE encoder

E_{V A E}

, generating a latent representation

z_{g t} = E_{V A E} (D_{g t})

(2)

whose dimensions match those of

z_{g t}

.

This single-encoder, dual-branch design enforces stylistic and representational consistency between the RGB and depth modalities within the common latent space, thereby lowering the difficulty of cross-modal learning and mitigating error propagation that might arise from using separate encoders.

3.3. Single-Step Diffusion Strategy for Depth Reconstruction

Conventional diffusion models generate target data from pure noise through an iterative, multi-step denoising schedule—an approach that incurs substantial computational overhead and noticeable inference latency. To enable efficient depth estimation, EASD adopts an

x_{0} - p r e d i c t i o n

, single-step diffusion strategy for depth reconstruction.

3.3.1. $x_{0} - P r e d i c t i o n$ and Single-Step Denoising

Unlike

\in - p r e d i c t i o n

, which regresses the per-step noise

\in_{t}

,

x_{0} - p r e d i c t i o n

seeks to predict the clean signal after denoising. In single- or few-step generation tasks, this choice has proved more effective: it typically converges faster to the target data distribution and avoids the error accumulation and early-stage variance amplification that can plague multi-step schedules.

Noise injection (training stage): During training, we perturb the latent depth vector

z_{g t}

only once to emulate the state of the diffusion process at a specific—usually large—time-step

t

. The noised depth latent is computed as

z_{g t}^{t} = \sqrt{{\bar{α}}_{t}} z_{g t} + \sqrt{1 - {\bar{α}}_{t}} ϵ

(3)

where

ϵ ~ N (0, I)

is standard Gaussian noise and

I

is the identity covariance matrix.

{\bar{α}}_{t}

is a pre-defined noise-schedule parameter that sets the signal-to-noise ratio at step

t

. In EASD, we choose a fixed, relatively large

t

, providing the denoising network with a consistent and challenging starting point.

Denoising network

f_{θ}

: A lightweight U-Net serves as the single-step diffusion regressor

f_{θ}

. Its input is the concatenation of the noised depth latent

z_{g t}^{t}

and the image latent

z_{r g b}

; optionally, the time-step embedding

t

or a learnable task token

s_{d e p t h}

is appended as an additional condition. The network outputs the predicted clean depth latent

{\hat{z}}_{0} = f_{θ} (z_{g t}^{t}, z_{r g b}, c o n d i t i o n)

(4)

Here, condition can be either

t

or

s_{d e p t h}

Our approach employs a single-step diffusion process to directly predict depth latent variables:

z_{0} = ϵ_{θ} (X_{t}, c; θ) + σ_{t} \cdot z_{t}

(5)

where

c

denotes exposure features,

σ_{t}

represents the noise scaling factor. By leveraging a parameter-sharing UNet architecture, the proposed framework jointly optimises noise prediction and depth generation, enabling efficient inference through a single forward pass.

EASD achieves a remarkable reduction in inference latency—from 5000 ms with conventional DDPM to less than 50 ms—while maintaining identical model capacity (380 M parameters). This validates the computational efficiency of single-step diffusion while preserving depth estimation accuracy. The architectural design further demonstrates robustness to exposure variations through adaptive noise scaling, addressing critical limitations of conventional methods in HDR scenarios.

To accommodate the eight-channel concatenated input, the first convolutional layer of the U-Net duplicates its kernel weights uniformly across channels, mitigating abrupt shifts in activation statistics.

3.3.2. Latent-Space Loss Functions

To supervise the denoising network effectively, we define the following loss terms directly in the latent space.

Reconstruction loss

L_{r e c}

: This term measures the discrepancy between the predicted clean depth latent vector

{\hat{z}}_{0}

and the ground-truth clean depth latent vector

z_{g t},

thereby encouraging the network to recover depth information faithfully in latent space:

L_{r e c} = ‖{\hat{z}}_{0} - z_{g t}‖

(6)

We adopt the

L 1 - n o r m

rather than the

L 2 - n o r m

, as the former typically yields sharper depth discontinuities and crisper edges.

Edge-consistency loss

L_{e d g e}

: To retain structural detail and crisp edges in the depth map, we impose an edge-consistency constraint in latent space. Specifically, we compare the spatial gradients of the predicted and ground-truth depth latents:

L_{e d g e} = ‖{\nabla \hat{z}}_{0} - {\nabla z}_{g t}‖

(7)

where

\nabla

denotes a spatial-gradient operator applied channel-wise in the latent domain.

The overall latent loss

L_{l a t e n t}

combines the two terms:

L_{l a t e n t} = W_{r e c} L_{r e c} + W_{e d g e} L_{e d g e}

(8)

where

W_{r e c}

and

W_{e d g e}

weight the reconstruction and edge components, respectively. By coupling the single-step,

x_{0} - p r e d i c t i o n

strategy with these latent-space constraints, our model not only slashes inference time, but also learns depth representations that are markedly more robust and structurally faithful under challenging exposure conditions.

3.4. Exposure-Statistics Computation

To enable EASD to perceive and adapt to exposure variations in the input, we devise an exposure-statistics module that extracts quantitative descriptors of exposure imbalance from the RGB image and passes them, as priors, to the exposure-adaptive feature-fusion (EAF) block.

Brightness-histogram imbalance

Ψ (I)

: The input RGB image is converted to a luminance map

L

. This indicator gauges the uniformity of the global luminance distribution. A well-exposed, properly contrasted image has a fairly even histogram and therefore a low imbalance score, whereas over- or under-exposed images have most of their mass piled up at one end of the grey scale, yielding a high score. We compute

Ψ (I) = \frac{σ (H (L))}{μ (H (L)) + ϵ_{h i s t}}

(9)

where

H (L)

is the normalised luminance histogram of

L

;

μ (\cdot)

and

σ (\cdot)

denote the mean and standard deviation of that histogram, respectively;

ϵ_{h i s t} > 0

is a small constant that prevents division by zero. Larger

Ψ (I)

values typically signal more severe exposure imbalance.

Adaptive-gamma

γ_{I}

: Gamma correction is a standard technique for adjusting image brightness and contrast. We derive a data-driven gamma value from the global luminance statistics of the current frame: if the image is too dark,

γ_{I} < 1

brightens it; if the image is overly bright,

γ_{I} > 1

compresses the highlights.

Both exposure descriptors—histogram imbalance

Ψ (I)

and adaptive gamma

γ_{I}

are computed independently at three spatial scales: the original resolution, 1/2 resolution, and 1/4 resolution. This yields a six-dimensional exposure vector.

By concatenating the statistical features obtained at each scale and passing them through a lightweight multi-layer perceptron (MLP), we obtain a compact exposure-recalibration vector

W_{e x p}

. This vector is supplied as a conditioning signal to the EAF module, allowing it to adaptively modulate the denoising network’s internal feature representations.

3.5. Exposure-Adaptive Feature Fusion (EAF)

Exposure-adaptive feature fusion (EAF) is one of the principal innovations of the EASD framework. Guided by exposure statistics extracted from the input image, the module dynamically modulates the multi-scale features inside the single-step diffusion depth regressor. This strategy enables the network to cope with extreme illumination: it heightens sensitivity to fine structure in under-exposed regions while suppressing the saturation artefacts characteristic of over-exposed regions.

As illustrated in Figure 2, the Exposure-Adaptive Feature (EAF) module first computes global brightness statistics (e.g., mean intensity, variance) from the input image. These statistics are subsequently fed into a lightweight fully connected network to generate multi-scale attention weights. The weights are then fused with RGB/depth features through element-wise multiplication, dynamically modulating regional responses to balance under-exposed and over-exposed areas.

3.5.1. Channel-Attention Branch

For each pyramid level

F_{i}

, we apply channel attention, modulated by the exposure-recalibration vector

W_{r e c}

.

The channel-attention weights

A_{i}^{c h} \in R^{C_{i} \times 1 \times 1}

are obtained as

A_{i}^{c h} = σ ({M L P}_{c h} (G A P (F_{i}))) ⊙ W_{e x p, i}^{c h}

(10)

where

G A P (F_{i})

is the global-average-pooled channel descriptor.

{M L P}_{c h}

is a lightweight two-layer perceptron that learns inter-channel importance.

σ (\cdot)

is the element-wise Sigmoid.

⊙

denotes element-wise addition. The attended feature map is then

F_{i}^{'} = F_{i} ⊙ A_{i}^{c h}

(11)

With

⊙

denoting channel-wise multiplication.

This design allows the network to dynamically enhance or suppress individual channels according to exposure priors: channels sensitive to faint textures are boosted in under-exposed regions, whereas channels easily triggered by saturated pixels are attenuated in over-exposed regions.

3.5.2. Spatial Self-Attention Branch

After the channel–attention refinement, we further apply a spatial self-attention mechanism to the feature map

F_{i}^{'}

. This step highlights information-rich image regions and adaptively re-weights spatial features according to the exposure conditions.

First,

F_{i}^{'}

is linearly projected to query, key, and value tensors:

Q_{i} = W_{Q} F_{i}^{'}, K_{i} = W_{K} F_{i}^{'}, V_{i} = W_{V} F_{i}^{'}

(12)

where

W_{Q}, W_{K}, V_{i}

are learnable weight matrices. The conventional self-attention is then computed as

A t t n (W_{Q}, W_{K}, V_{i}) = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(13)

With

d_{k}

denoting the dimensionality of the key vectors. To integrate the exposure prior, we inject either the global exposure kernel

W_{e x p}

or its channel-wise component

W_{e x p, i}^{c h}

into the attention formulation. Specifically,

V_{i}^{m o d} = F_{i} ⊙ A_{i}^{c h} Q_{i}, K_{i} = L i n e a r (V_{i}^{m o d})

(14)

And the modulated attention output is

{\tilde{F}}_{i} = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}^{m o d}

(15)

A residual connection is finally added:

{\tilde{F}}_{i} = {\tilde{F}}_{i} + F_{i}^{'}

(16)

The spatial attention module enables the network to suppress noise in over-/under-exposed regions while preserving faint structural cues that may appear in poorly illuminated regions.

The set of enhanced features

{{\tilde{F}}_{i}}

produced by the EAF module is subsequently routed to the corresponding levels of the U-Net decoder. In this way, the EAF module dynamically amplifies or attenuates feature responses under varying exposure conditions. Ablation studies confirm that incorporating the EAF module yields substantial gains in regions subject to extreme exposure.

3.6. Diversified Depth Predictor (DDP)

To boost the EASD framework’s adaptability and prediction reliability across heterogeneous scenes, we add an optional Diversified Depth Predictor (DDP) that leverages exposure-driven scene cues to select—or softly blend—two complementary inference regimes: a multi-sample generative mode, which runs several stochastic forward passes to yield both a depth mean and an uncertainty map for intrinsically ambiguous cases (e.g., sky-dominant, specular, or glass scenes), and a fast deterministic mode, which issues real-time predictions when depth cues are clear, such as on texture-rich, uniformly lit Lambertian surfaces; by dynamically balancing these modes in a single lightweight module, the DDP achieves robust depth estimates without compromising overall speed.

3.6.1. Mode Selection

The inference regime is chosen according to the exposure statistics derived in Section 3.4: if the luminance-histogram imbalance surpasses a predefined threshold or extensive over-/under-exposed regions are detected, the scene is classified as highly ambiguous and the system prefers the multi-sample mode; otherwise, it defaults to the fast-prediction mode.

3.6.2. Multi-Sample Mode

When the scene exhibits high ambiguity, EASD switches to a generative inference regime. To explore multiple depth hypotheses and quantify epistemic uncertainty, we inject an isotropic Gaussian noise term

ϵ_{s} ~ N (0, σ_{s}^{2} I)

into the noisy depth latent

z_{g t}^{t}

before feeding it to the single-step diffusion regressor

f_{θ}

:

z_{i n}^{t} = z^{t} + ϵ_{s}

(17)

We then draw

N

independent samples, producing

N

distinct clean depth latents

{{\hat{z}}_{0, n}}_{n = 1}^{N}

. Each latent is decoded by the VAE decoder

D_{V A E}

to obtain depth maps

{{\hat{D}}_{n}}_{n = 1}^{N}

. From this ensemble, we derive a pixel-wise uncertainty map

U (p) = V a r ({{\hat{D}}_{n} (p)}_{n = 1}^{N})

(18)

Finally, the depth estimate is the mean of the

N

predictions,

{\hat{D}}_{m e a n} = \frac{1}{n} \sum_{n = 1}^{N} {\hat{D}}_{n}

(19)

3.6.3. Single-Shot Inference Mode

When the scene presents low ambiguity—or when ultra-low latency is imperative—EASD reverts to a discriminative pathway. No additional noise is injected

ϵ_{s}

; instead, a deterministic one-step forward pass through yields a depth map

\hat{D}

in a single shot. This mode achieves the lowest latency within the EASD suite and is therefore the default choice for real-time deployments.

3.7. Exposure-Balanced Loss (EBL)

To provide more informative supervision, especially in regions suffering from extreme exposure, we propose the Exposure-Balanced Loss

L_{E B L}

. The loss is computed in the pixel domain on the depth map

\hat{D}

decoded from the clean latent

{\hat{z}}_{0}

by the VAE decoder

D_{V A E}

.

EBL simultaneously (i) improves global depth accuracy, (ii) preserves local geometric consistency, and (iii) improves depth estimation in highly exposure imbalanced areas.

L_{E B L} = λ_{1} L_{g l o b a l} + λ_{2} L_{s t r u c t} + λ_{3} L_{e x p}

(20)

Global depth error

L_{g l o b a l}

. Penalises the holistic discrepancy between the predicted depth map and ground-truth:

L_{g l o b a l} = L_{R M S E} (\hat{D}, D_{g t}) = \sqrt{\frac{1}{M} \sum_{p} {(\hat{D} (p) - D_{g t} (p))}^{2}}

(21)

Local structural consistency

L_{s t r u c t}

. Encourages preservation of local structural cues such as edge sharpness and surface smoothness:

L_{s t r u c t} = L_{S S I M} (\hat{D}, D_{g t}) = \frac{1 - S S I M (\hat{D}, D_{g t})}{2}

(22)

Exposure-aware error

L_{e x p}

. It encourages the model to focus on regions that are difficult to predict owing to extreme over- or under-exposure.

Its contribution is modulated by the exposure-imbalance metric

Ψ (I (p))

and a threshold

τ

:

L_{e x p} = 1_{Ψ (I) > τ} \cdot \frac{1}{M_{e x p}} \sum_{p \in P_{e x p}} w (p) | \hat{D} (p) - D_{g t} (p) |

(23)

The global gating mechanism

1_{Ψ (I) > τ}

is defined as follows:

1_{Ψ (I) > τ} = 1

if the image-level imbalance

Ψ (I)

exceeds the threshold

τ

; Let

P_{e x p}

denote the set of over-/under-exposed pixels in the image, and let

{M_{e x p} = | P}_{e x p} |

represent the total number of such pixels.

Pixel weight

w (p)

is an optional weight. Setting

w (p) = Ψ (I (p))

highlights the most severely imbalanced pixels;

w (p) = 1

yields uniform weighting. Error metric

| \hat{D} (p) - D_{g t} (p) |

is the

L_{1}

depth error at pixel

p

.

Throughout our experiments, we use

λ_{1} : λ_{2} : λ_{3}

= 1:0.5:0.2.

4. Experiments

This section presents a comprehensive suite of experiments that validate the proposed EASD framework for monocular depth estimation, with particular focus on extreme-illumination scenarios. We first detail the datasets, evaluation metrics, and implementation specifics. Next, we benchmark EASD against state-of-the-art (SOTA) methods through both quantitative and qualitative comparisons. Finally, ablation studies quantify each key component’s contribution and evaluate the framework’s computational efficiency and potential for generalisation.

4.1. Datasets

To comprehensively evaluate the performance of the EASD framework across multiple scenes, wide illumination ranges and dense-depth-gradient conditions, we performed experiments on seven public benchmark datasets spanning indoor, outdoor and synthetic environments:

NYU-Depth-v2 [22] Indoor RGB-D pairs captured with a Microsoft Kinect sensor. Depth annotations are dense and accurate, covering residential, office and classroom settings. The official train/test split is adopted.
KITTI [23] Autonomous-driving imagery acquired by a vehicle-mounted stereo camera. Ground-truth depth, projected from LiDAR point clouds, is sparse and measurements are concentrated on roads and vehicle surfaces. Experiments follow the Eigen split.
ETH3D [24] Indoor–outdoor dataset captured with an industrial-grade camera array, delivering high-resolution images with sub-pixel dense depth; several sequences exhibit strong contrast or non-uniform lighting.
ScanNet [25] Large-scale indoor RGB-D benchmark with more than 1500 scans and accompanying dense depth and semantic labels. We use the official train/validation/test partitions.
DIODE [26] Indoor–outdoor dataset with laser-scanned dense depth that includes numerous extreme-exposure samples, making it ideal for assessing robustness under HDR conditions.

Synthetic datasets (for pre-training or augmentation):

Hypersim [27] Physically based, path-traced indoor dataset offering high-quality RGB, dense depth, surface normals and semantic labels; used to improve model generalisation.
Virtual-KITTI [28] Photo-realistic virtual replica of KITTI supporting diverse weather and lighting. Provides complete ground truth for depth, pose and semantics, offering controllable samples where LiDAR coverage is sparse.

4.2. Experimental Setup

EASD adopts Stable Diffusion v2 as backbone. Model development was conducted on several public datasets curated for canonical diffusion-model benchmarks and for extreme-illumination restoration tasks. Training ran on a workstation with 90 GB RAM, an NVIDIA RTX A3090 GPU (32 GB VRAM) and an Intel Xeon^® CPU.

To isolate the learned image prior, text-condition modulation was disabled. The diffusion trajectory comprised 1000 fixed time-steps. Optimisation used the decoupled-weight-decay Adam algorithm with an initial learning rate of 5 × 10⁻⁵, decayed by a factor of 0.2. The system batch size was 32.

In the rapid-inference setting, the network was trained for 3000 iterations (≈24 h). In the high-fidelity multi-sampling setting, training was prolonged to 10,000 iterations (≈96 h), as visualised in Table 1.

4.3. Evaluation Indicators

To quantify the depth-estimation performance of EASD and competing methods, we employed a suite of standard metrics that are widely used in monocular depth-estimation (MDE) studies. These metrics fall into two categories: error metrics (lower is better) and accuracy metrics (higher is better), as visualised in Figure 3.

Let

{\hat{d}}_{i}

denote the predicted absolute depth at pixeli.

d_{i}^{*}

the corresponding ground-truth depth, and

N

the total number of valid pixels.

Absolute relative error (AbsRel, dimensionless):

A b s R e l = \frac{1}{N} \sum_{i = 1}^{N} \frac{| {\hat{d}}_{i} - d_{i}^{*} |}{d_{i}^{*}}

(24)

Threshold accuracy:

δ 1 : t h r = 1.25, δ 2 : t h r = {1.25}^{2}, δ 3 : t h r = {1.25}^{3}

(25)

where threshold set is

t h r \in {1.25, {1.25}^{2}, {1.25}^{3}}

, referred to as

δ^{1}, δ^{2} a n d δ^{3},

respectively.

4.4. Quantitative Comparisons

The qualitative results in Figure 4 further validate the superiority of EASD in addressing extreme exposure scenarios across diverse datasets. On NYUv2, EASD accurately reconstructs underexposed corners—such as depth details of the right columnar object—and preserves sharp edges surrounding overexposed window regions, whereas MiDaS obscures these boundaries due to saturation effects. Within Hypersim, EASD uniquely excludes distractors (e.g., mirrors) while maintaining a precise room layout. In the KITTI dataset, EASD resolves depth discontinuities on sunlit vehicle surfaces and retains road textures in shadowed areas, outperforming Marigold’s over-smoothed predictions. For Virtual-KITTI, EASD maintains a consistent global layout, including street topology. These observations align with the quantitative metrics in Table 2, confirming EASD’s robust adaptability to extreme lighting conditions.

Table 2 summarises the primary performance metrics of EASD and several representative state-of-the-art (SOTA) methods across the NYU-Depth-v2, KITTI, ETH3D and ScanNet benchmarks. Downward (↓, lower is better) and upward (↑, higher is better) arrows indicate the optimisation direction of each metric. The EASD (Ours) results are derived from the experiments reported in this study.

Across the NYU-Depth-v2 [22], KITTI [23], ETH3D [24] and ScanNet [25] datasets, EASD lowers the HDR absolute-relative error by an average of 20% relative to state-of-the-art (SOTA) methods. Presenting these numbers explicitly in Table 2 would provide clear evidence of EASD’s distinct advantage in coping with extreme-illumination scenes. Accordingly, we will perform a more fine-grained comparison on datasets with well-defined HDR scenarios to underscore EASD’s capability under such challenging conditions.

4.5. Ablation Studies

In the ablation study detailed in Table 2, NYUv2 and KITTI were chosen as the two zero-shot validation sets. Commencing with the baseline configuration, we systematically assessed the effects of various components: parameterisation types, Exposure-adaptive features, and Exposure-Balanced Loss. Initially, the model was trained solely on the Hypersim dataset to establish a baseline. Subsequently, a mixed-dataset strategy was implemented, integrating the Virtual KITTI dataset to boost the model’s generalisation capability across diverse domains. These ablation results validate the efficacy of our proposed method, demonstrating that each design component is critical for optimising the diffusion model in the context of dense prediction tasks.

Figure 4 compares depth maps generated with and without the EAF module, underscoring its pivotal role in exposure-aware feature modulation. In the absence of EAF, the model fails to discriminate between valid signals and noise in overexposed regions—such as saturated window areas and indistinct bathtubs in the input—resulting in blurred depth transitions. Additionally, underexposed regions (e.g., dark floor corners) suffer from texture loss due to inadequate feature weighting. When EAF is activated, the module dynamically amplifies gradients in underexposed regions to recover floor details and suppresses spurious activations in highlights, thereby preserving sharp edges around light sources. These observations align with the ablation results in Table 2: the incorporation of EAF reduces AbsRel by 4.5% on NYUv2 and 5.6% on KITTI, demonstrating its effectiveness in balancing feature sensitivity under extreme exposure conditions. The performance degradation observed in the baseline model—lacking explicit exposure adaptation—likely stems from its inherent sensitivity to exposure variations, as evidenced by the increased texture distortion in HDR scenarios [9,30].

To objectively evaluate the effect of the Exposure-adaptive feature (EAF) component on the results when enabled versus disabled, we compared the outputs under these two conditions, with the results presented in Figure 4.

As shown in Figure 5, the integration of the Exposure-Adaptive Feature Fusion (EAF) module significantly enhances edge restoration performance in depth maps. Further incorporation of the Exposure-Based Regularisation (EBL) module yields sharper fine details and improved spatial coherence. The final output exhibits smoother surfaces and more continuous depth transitions, particularly in regions with extreme exposure variations.

Ablation Study. To evaluate the contribution of each component in EASD, we conduct a stepwise ablation study on the KITTI dataset under extreme lighting conditions. As shown in Table 3, Figure 5, the introduction of the EAF module significantly improves depth accuracy in over- and under-exposed regions by adaptively reweighting multi-scale features based on global exposure statistics, reducing AbsRel by 10%. Further incorporating the EBL loss enhances local gradient coherence and boundary preservation, especially around obstacles. Finally, the full model with single-step diffusion achieves the best performance with real-time efficiency, demonstrating the synergy of our design choices.

4.6. Limitations and Future Work

Although the frozen VAE performs slightly inferior to the fine-tuned VAE in non-standard scenarios (e.g., tunnels), its advantage lies in eliminating the need for additional training—a critical benefit for resource-constrained deployment environments. Future work could explore a Selective Joint Fine-tuning strategy to optimise domain adaptation while preserving generalisable features through partial parameter updates. as visualised in Table 4.

To evaluate model performance in over-exposed and under-exposed regions, we define extreme exposure regions as pixels with luminance values exceeding or falling below predefined thresholds. The errors in these regions are quantified using Equation (23), which incorporates an exposure-balanced loss function that prioritises medium-exposure pixels through higher weighting coefficients. This design choice, while beneficial for overall accuracy, may compromise performance in extreme regions due to reduced gradient contributions from underrepresented intensity levels. as visualised in Table 5.

5. Conclusions

This paper presents EASD, a single-step diffusion framework tailored for exposure-aware monocular depth estimation (MDE) in high-dynamic-range (HDR) scenes, particularly addressing the critical demands of autonomous driving. By integrating three core innovations—single-step latent depth regression, Exposure-Aware Feature Fusion (EAF), and a Diversified Depth Predictor (DDP)—EASD overcomes the limitations of existing methods in extreme illumination conditions, achieving a superior balance between accuracy, computational efficiency, and data efficiency.

EASD leverages a frozen Stable Diffusion VAE encoder to harness pre-trained visual priors, enabling robust feature extraction even from over- or under-exposed inputs a critical requirement for autonomous vehicle perception in real-world lighting conditions. The single-step diffusion strategy eliminates iterative error accumulation, achieving real-time inference while maintaining high precision. The EAF module dynamically modulates multi-scale features using exposure statistics, suppressing noise in saturated regions and restoring details in dark areas, as validated by qualitative comparisons in Figure 3 and ablation studies in Figure 4. Complemented by the Exposure-Balanced Loss (EBL), which prioritises supervision in exposure-extreme regions, EASD outperforms state-of-the-art methods with only 60k labeled images—reducing absolute relative error (AbsRel) by an average of 20% across NYUv2, KITTI, ETH3D, and ScanNet (Table 3), with particularly strong gains in HDR scenarios.

The DDP further enhances adaptability, offering multi-sample uncertainty estimates for ambiguous scenes and fast deterministic predictions for real-time applications. These results demonstrate that EASD effectively embeds exposure awareness into diffusion-based MDE, paving the way for robust depth estimation in challenging lighting conditions such as autonomous driving, AR/VR, and robotic navigation.

Limitations include degraded performance in fully saturated regions (all-white/all-black) and reliance on VAE priors trained on general imagery. Experimental analysis reveals a 3–5% AbsRel increase in extreme exposure regions (Table 4) compared to standard-exposure areas, primarily due to data scarcity (<5% HDR samples) and loss function bias toward mid-exposure pixels. Future work will explore dynamic exposure prediction networks to refine EAF modulation and integrate multi-modal inputs (e.g., infrared) to extend robustness in extreme HDR environments. Real-world deployment will further require validation under dynamic lighting and moving objects.

Author Contributions

Conceptualisation, C.Z. and D.L.; methodology, C.Z.; software, C.Z.; validation, C.Z. and D.L.; formal analysis, C.Z.; investigation, D.L.; resources, C.Z.; data curation, C.Z. and D.L.; writing—original draft preparation, C.Z. and D.L.; writing—review and editing, C.Z. and D.L.; visualisation, C.Z. and D.L.; supervision, C.Z. and D.L.; project administration, C.Z. and D.L.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is the result of a study on the “Convergence and Open Sharing System “Project, supported by the Ministry of Education and National Research Foundation of Korea.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

I would like to express my sincere gratitude to my advisor, Deokwoo Lee., for their continuous support, guidance, and valuable suggestions throughout my research. Their expertise and encouragement have been instrumental in the completion of this work. Additionally, I would like to acknowledge Keimyung University for providing the necessary resources and funding for this research. Lastly, I extend my deepest appreciation to my family for their unwavering support and encouragement throughout my academic journey.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems (NIPS); IEEE: New York, NY, USA; NeurIPS: La Jolla, CA, USA, 2014. [Google Scholar]
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Lee, D. Advancing Nighttime Object Detection Through Image Enhancement and Domain Adaptation. Appl. Sci. 2024, 14, 8109. [Google Scholar] [CrossRef]
Ke, B.; Qu, K.; Wang, T.; Metzger, N.; Huang, S.; Li, B.; Schindler, K. Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis. arXiv 2025. [Google Scholar] [CrossRef] [PubMed]
Yin, W.; Zhang, J.; Wang, O.; Simon, S. Learning to Recover Scale-Consistent Depth from Single Monocular Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Zhang, C.; Yin, W.; Wang, Z.; Yu, G.; Fu, B.; Shen, C. Hierarchical Normalization for Robust Monocular Depth Estimation. arXiv 2022, arXiv:2210.09670. [Google Scholar] [CrossRef]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021. [Google Scholar]
Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16 June 2024; IEEE: New York, NY, USA, 2024; pp. 10371–10381. [Google Scholar]
Gasperini, S.; Morbitzer, N.; Jung, H.; Navab, N.; Tombari, F. Robust Monocular Depth Estimation Under Challenging Conditions. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1 October 2023; IEEE: New York, NY, USA, 2023; pp. 8143–8152. [Google Scholar]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems. IEEE Trans. Robot. 2017, 33, 249–265. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS); IEEE: New York, NY, USA; NeurIPS: La Jolla, CA, USA, 2020. [Google Scholar]
Fu, X.; Yin, W.; Hu, M.; Wang, K.; Ma, Y.; Tan, P.; Shen, S.; Lin, D.; Long, X. GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image. arXiv 2024, arXiv:2403.12013. [Google Scholar] [CrossRef]
Duan, Y.; Guo, X.; Zhu, Z. DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation. arXiv 2023, arXiv:2303.05021. [Google Scholar] [CrossRef]
Ye, C.; Qiu, L.; Gu, X.; Zuo, Q.; Wu, Y.; Dong, Z.; Bo, L.; Xiu, Y.; Han, X. StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal. arXiv 2024, arXiv:2406.16864. [Google Scholar] [CrossRef]
Gao, J.; Zhang, S.; Lu, W. Federated Learning-Based Detection and Control Mechanism of In-Car Navigation Safety System. J. Multimed. Inf. Syst. 2024, 11, 57–66. [Google Scholar] [CrossRef]
Shi, J.; Jiang, L.; Dai, D.; Van Gool, L. General-Purpose Inpainting and Denoising with Lotus: A Single-Step Diffusion Model. arXiv 2023, arXiv:2305.12999. [Google Scholar]
Xu, G.; Ge, Y.; Liu, M.; Fan, C.; Xie, K.; Zhao, Z.; Chen, H.; Shen, C. What Matters When Repurposing Diffusion Models for General Dense Perception Tasks? arXiv 2024, arXiv:2403.06090. [Google Scholar]
Song, Z.; Wang, Z.; Li, B.; Zhang, H.; Zhu, R.; Liu, L.; Jiang, P.T.; Zhang, T. DepthMaster: Taming Diffusion Models for Monocular Depth Estimation. arXiv 2025, arXiv:2501.02576. [Google Scholar] [CrossRef]
Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. arXiv 2020, arXiv:2003.06620. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Volume 3, pp. 154–196. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Schops, T.; Schonberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Vasiljevic, I.; Kolkin, N.; Luo, S.; Shakhnarovich, G. DIODE: A Dense Indoor/Outdoor DEpth Dataset. arXiv 2019, arXiv:1908.00463. [Google Scholar]
Roberts, R.; Ramakrishnan, K.; Park, D.; Varma, G.; Vineet, V.; Kowdle, A.; Bautista, M.A.; Paczan, N.; Webb, R.; Shrivastava, A. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Gaidon, A.; Wang, Q.; Cabon, Y.; Vig, E. Virtual KITTI: Learning visual realistic new data from old. arXiv 2016, arXiv:1605.06465. [Google Scholar]
Yin, W.; Wang, X.; Shen, C.; Liu, Y.; Tian, Z.; Xu, S.; Sun, C.; Renyin, D. DiverseDepth: Affine-Invariant Depth Prediction Using Diverse Data. arXiv 2020, arXiv:2002.00569. [Google Scholar]
Kim, H.; Lee, D.; Kwon, S. Image Reconstruction Method by Spatial Feature Prediction Using CNN and Attention. J. Multimed. Inf. Syst. 2024, 11, 1–8. [Google Scholar] [CrossRef]

Figure 1. Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles (EASD) pipeline. An input RGB image is first encoded by a frozen VAE encoder to obtain its latent-space representation, while exposure analysis is carried out in parallel. The single-step diffusion depth regressor, augmented by the exposure-adaptive feature-fusion module, predicts a latent depth vector, which is subsequently decoded by the VAE decoder to yield the final depth map.

Figure 2. Architectural Overview of the EAF Module.

Figure 3. Qualitative comparison of monocular depth estimation methods across diverse datasets. EASD exhibits superior performance in capturing small objects within high-dynamic environments and preserving the overall scene layout.

Figure 4. Comparison of depth maps w/and w/o EAF. In high-dynamic environments, dense prediction tasks may compromise the model’s capacity to generate high-quality images, resulting in blurring in detail-rich regions. To strengthen the preservation of such fine-grained details, EAF is introduced to enhance the model’s capability of generating high-quality images.

Figure 5. Ablation Study Visualisation: Impact of EAF and EBL Modules on Depth Map Quality.

Table 1. Environment and parameterisation of the experiment.

Parameters	Configuration
CPU	Intel Xeon Platinum 8352 V
GPU	RTX 3090 (32 GB)
System	Ubuntu 22.04
Deep learning architecture	PyTorch1.11.0 + Python 3.10 + Cuda12.3
Training Epochs (Multi-Sample)	10,000
Training Epochs (Single-Shot)	3000
Batch size	32
Base	5 × 10⁻⁵

Table 2. Performance comparison between EASD and state-of-the-art methods on standard benchmarks. Avg. Rank is the arithmetic mean of ranks across all metrics on NYUv2, KITTI, ETH3D, and ScanNet. Lower values indicate better overall performance. ↓ indicates lower-is-better metrics, ↑ indicates higher-is-better metrics.

Method	Training Data	NYUv2 [21]		KITTI [22]		ETH3D [23]		ScanNet [24]		Avg. Rank
Method	Training Data	$AbsRel ↓$	$δ$ $1 ↑$	$AbsRel ↓$	$δ$ $1 ↑$	$AbsRel ↓$	$δ$ $1 ↑$	$AbsRel ↓$	$δ$ $1 ↑$	Avg. Rank
DiverseDepth [29]	320 K	18.7	83.1	27.4	59.6	33.7	62.2	16.0	79.8	5.5
Marigold [4]	74 K	11.2	86.9	17.0	80.6	12.2	86.5	12.1	85.9	2.4
MiDaS [2]	2 M	17.8	84.2	34.1	53.3	27.2	67.5	17.8	76.5	5.5
LeRes [5]	354 K	14.4	87.2	21.5	66.5	25.3	69.7	13.4	83.0	3.9
HDN [6]	300 K	11	90.1	16.6	73.6	17.9	74.6	11.8	84.9	2.5
EASD (Ours)	60 K	8.8	91.6	14.3	77.6	9.6	86.0	9.4	86.1	1.2

Table 3. Ablation study on the incremental design of adaptive protocols for adapting pre-trained diffusion models to dense prediction tasks. This table reports the experimental results in monocular depth estimation. ↓ indicates lower-is-better metrics, ↑ indicates higher-is-better metrics.

Method	Training Data	NYUv2		KITTI
Method	Training Data	$AbsRel ↓$	$δ$ $1 ↑$	$AbsRel ↓$	$δ$ $1 ↑$
Baseline model	40 K	19.1	83.0	29.8	59.6
+x₀-prediction	40 K	13.8	87.8	25.2	63.5
+x₀-prediction + EAF	40 K	9.3	91.2	19.6	70.1
+x₀-prediction + EAF + EBL	40 K	9.2	91.2	19.5	70.8
Multi-Sample EASD	60 K	9.0	91.4	16.8	74.3
Single-Shot EASD	60 K	8.8	91.6	14.3	77.6

Table 4. Frozen VAE vs. Fine-tuned VAE: Initial Comparative Results on KITTI. ↓ indicates lower-is-better metrics, ↑.

Scenario Types	$Frozen VAE AbsRel ↓$	$Fine - Tuned VAE AbsRel ↓$	Promote
Tunnel Scenarios	12.3%	9.8%	+2.5%
Night Vision Scenarios	14.7%	11.2%	+3.5%
Extreme Low-Light Conditions	17.5%	12.4%	+5.1%

Table 5. Error Discrepancy Between Standard-Exposure and Extreme-Exposure Regions in EASD. ↓ indicates lower-is-better metrics.

Dataset	Overall AbsRel	$Standard - Exposure AbsRel ↓$	$Extreme - Exposure AbsRel ↓$	Overall RMSE
NYUv2	8.8%	7.5%	13.3%	0.45 m
KITTI	14.3%	13.2%	17.5%	0.38 m

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, C.; Lee, D. EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles. Appl. Sci. 2025, 15, 9130. https://doi.org/10.3390/app15169130

AMA Style

Zhang C, Lee D. EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles. Applied Sciences. 2025; 15(16):9130. https://doi.org/10.3390/app15169130

Chicago/Turabian Style

Zhang, Chenyuan, and Deokwoo Lee. 2025. "EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles" Applied Sciences 15, no. 16: 9130. https://doi.org/10.3390/app15169130

APA Style

Zhang, C., & Lee, D. (2025). EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles. Applied Sciences, 15(16), 9130. https://doi.org/10.3390/app15169130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles

Abstract

1. Introduction

2. Related Work

2.1. Monocular Depth Estimation

2.2. Depth Estimation Under High-Dynamic-Range (HDR) or Extreme-Exposure Conditions

2.3. Applications of Diffusion Models in Vision Tasks

2.4. Single-Step Diffusion

3. Method

3.1. Framework Overview

3.2. Latent-Space Encoding and Input Pre-Processing

3.2.1. RGB-Image Encoding

3.2.2. Depth-Map Encoding

3.3. Single-Step Diffusion Strategy for Depth Reconstruction

3.3.1. x 0 − P r e d i c t i o n and Single-Step Denoising

3.3.2. Latent-Space Loss Functions

3.4. Exposure-Statistics Computation

3.5. Exposure-Adaptive Feature Fusion (EAF)

3.5.1. Channel-Attention Branch

3.5.2. Spatial Self-Attention Branch

3.6. Diversified Depth Predictor (DDP)

3.6.1. Mode Selection

3.6.2. Multi-Sample Mode

3.6.3. Single-Shot Inference Mode

3.7. Exposure-Balanced Loss (EBL)

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Evaluation Indicators

4.4. Quantitative Comparisons

4.5. Ablation Studies

4.6. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.1. $x_{0} - P r e d i c t i o n$ and Single-Step Denoising