Learning Depth from Focus with Multi-Candidate Estimation and Proximal Refinement

Mahmood, Muhammad Tariq

doi:10.3390/electronics15122548

Open AccessArticle

Learning Depth from Focus with Multi-Candidate Estimation and Proximal Refinement

by

Muhammad Tariq Mahmood

Future Convergence Engineering, School of Computer Science and Engineering, Korea University of Technology and Education, Cheonan 31253, Republic of Korea

Electronics 2026, 15(12), 2548; https://doi.org/10.3390/electronics15122548 (registering DOI)

Submission received: 12 May 2026 / Revised: 3 June 2026 / Accepted: 8 June 2026 / Published: 9 June 2026

(This article belongs to the Special Issue Image/Video Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In this paper, we propose a novel Depth from Focus (DFF) framework that formulates depth estimation as an energy minimization problem and unrolls the corresponding iterative optimization into a trainable neural architecture. Given a focal stack, a deep feature extractor constructs a learned focus volume that encodes defocus and structural cues. Based on this representation, multiple candidate depth maps are generated using a plane-based probabilistic formulation, while an attention mechanism adaptively assigns pixel-wise confidence weights to each candidate. The depth estimation is performed through an iterative refinement process, where each stage corresponds to a learned proximal update implemented via lightweight conditional networks. These updates incorporate focus consistency, adaptive step sizes, and learned regularization priors, enabling effective integration of physical imaging constraints with data-driven modeling. A final refinement module further enhances prediction accuracy by fusing the refined depth, focus volume features, and candidate hypotheses to estimate residual corrections. The entire framework is trained end-to-end, ensuring coherent optimization across all components. Experimental results demonstrate that the proposed method achieves improved robustness and accuracy, particularly in low-texture and noisy regions, while preserving interpretability through its unfolding-based design.

Keywords:

depth from focus; focus measure; focus volume; 3D shape; proximal refinement

1. Introduction

Depth estimation is one of the fundamental tasks in computer vision and is widely used in applications including autonomous navigation [1], robotic perception [2], 3D scene reconstruction, and augmented reality systems [3]. Existing approaches for depth estimation are generally divided into active and passive sensing techniques. Active methods employ dedicated hardware sensors, such as LiDAR (Light Detection and Ranging) [4] and time-of-flight cameras [5], to directly capture depth information from the environment. In contrast, passive approaches estimate depth by analyzing visual cues inherently present in images, including motion [6], stereo correspondence [7], and focus variations [8]. Among passive approaches, Depth from Focus (DFF) has gained significant attention because it can recover scene geometry using only focus information obtained from a single camera setup [9]. In SFF, multiple images are captured under different focal settings to form a focal stack, and depth is inferred from the relationship between image sharpness and object distance. Since the method does not require additional sensing hardware or multi-view image acquisition, it offers an efficient and practical solution for monocular depth estimation in real-world environments. DFF methods can be broadly categorized into two groups: (1) traditional methods and (2) deep learning-based methods.

Traditional DFF methods construct the focus volume using handcrafted focus-measure operators applied directly to the input focal stack. However, the limited representational capability of manually designed focus features often leads to issues such as edge bleeding, loss of fine structural details, and inaccurate depth estimation in complex scenes [10]. On the other hand, learning-based DFF methods have limitations in the depth extraction stage from the deep focus volume. Most deep DFF frameworks consist of two main components: (i) constructing a deep focus volume from the focal stack, and (ii) estimating depth from this representation. While significant research efforts have focused on improving focus volume generation through advanced network architectures, comparatively little attention has been given to the depth extraction process itself. In most existing approaches, depth is recovered using a simple one-step reduction operation, such as direct regression,

1 \times 1

convolutions, or soft-

\arg \max

-based aggregation. For example, Hazirbas et al. [11] employed a single

1 \times 1

convolution to directly regress the depth map from the focus volume. Other methods approximate the traditional

\arg \max

operator using differentiable alternatives such as soft-arg max [12], where the focus volume is first converted into a probability distribution through softmax or softplus activations [13,14], followed by weighted averaging over focus distances to obtain the final depth estimate. Although computationally efficient, these approaches collapse the three-dimensional focus volume into a two-dimensional depth representation in a single step, often neglecting important spatial dependencies and structural relationships. As a result, the estimated depth maps may suffer from blurred object boundaries, loss of fine details, and reduced structural consistency, particularly in challenging regions with weak textures or depth discontinuities.

In this work, we introduce a novel DFF framework that formulates depth estimation as an energy minimization problem and transforms the associated iterative optimization procedure into a trainable neural network through optimization unrolling. Among various optimization frameworks including Half Quadratic Splitting (HQS) [15], the Alternating Direction Method of Multipliers (ADMM) [16], and Approximate Message Passing (AMP) [17], the Proximal Gradient Descent (PGD) algorithm serves as the theoretical foundation for most Deep Unfolding Networks (DUNs) [18]. It is an iterative process that operates through two steps: a gradient descent step and a proximal step.

Given an input focal stack, the proposed framework first extracts deep focus representations that encode both defocus characteristics and structural scene information. Using these representations, multiple candidate depth maps are generated through a plane-based probabilistic modeling strategy, while an attention mechanism adaptively estimates pixel-wise confidence weights for each candidate hypothesis. Depth reconstruction is then progressively refined through a sequence of learned proximal updates, where each iteration is implemented using lightweight conditional networks. These refinement stages incorporate focus consistency constraints, adaptive step-size estimation, and learned regularization priors, enabling effective integration of physical imaging properties with data-driven learning. In addition, a dedicated refinement module further improves prediction quality by jointly exploiting refined depth estimates, focus volume features, and candidate depth hypotheses to predict residual corrections. The entire framework is optimized in an end-to-end manner, ensuring consistent learning across all stages of the model. Extensive experiments demonstrate that the proposed method achieves superior robustness and accuracy compared to existing approaches, particularly in challenging low-texture and noisy regions, while maintaining interpretability through its unfolding-based optimization design.

The remainder of this paper is organized as follows. Section 2 reviews existing studies on DFF and related depth estimation approaches to establish the context of the proposed work. Section 3 describes the proposed framework and explains its major components in detail. Section 4 presents the experimental setup along with quantitative and qualitative evaluations of the proposed method. Finally, Section 5 concludes the paper and discusses potential future research directions.

2. Related Work

2.1. Traditional Approaches

Traditional DFF methods generally follow a multi-stage pipeline, as illustrated in Figure 1. First, a sequence of differently focused images, referred to as a focal stack, is captured. This can be achieved either by moving the object or camera, or by adjusting the camera focus while keeping the scene fixed [9,19]. In the second stage, the focus quality of each pixel across the focal stack is estimated using a focus measure (FM) operator [20]. Applying the FM to all images produces a three-dimensional focus volume (FV), where each voxel represents the sharpness response at a particular pixel and focus level. The quality of the FV strongly depends on image characteristics such as texture, contrast, saturation, noise, and window size [21]. These factors often introduce inconsistencies and artifacts into the focus responses, leading to unreliable depth estimation, especially in textureless or noisy regions.

To improve the quality of the initial FV, several optimization and filtering strategies have been proposed. Subbarao and Choi [22] introduced a planar-focused image surface fitting approach for refining depth estimates. Linear filtering methods [9] smooth focus responses within local neighborhoods, while nonlinear techniques such as anisotropic diffusion [23] preserve structural details more effectively. More recently, guidance- and regularization-based approaches [10,24,25] have incorporated structural priors and auxiliary information to improve robustness against noise, depth discontinuities, and low-texture regions.

After obtaining an enhanced FV, the initial depth map is estimated by selecting, for each pixel, the focus level corresponding to the maximum focus response. However, the resulting depth maps often suffer from edge bleeding, structural inaccuracies, and loss of fine details. Consequently, an additional refinement or post-processing stage is commonly employed to improve depth consistency and reconstruction quality. Moeller et al. [26] proposed a variational depth-from-focus framework based on isotropic total variation regularization. Li et al. [27] introduced an adaptive weighted guided image filtering approach for detail enhancement. Danismaz et al. [28] improved robustness to outliers using a linearized least-squares Laplace regression formulation. More recently, Ashfaq et al. [29,30] proposed a dual-stage focus measure framework that combines vector-to-scalar conversion with directional focus analysis to obtain more reliable depth maps.

2.2. Deep Learning-Based Approaches

Deep learning-based depth-from-focus (DFF) methods generally estimate depth from a focal stack through two primary stages: (i) generation of a deep focus volume, and (ii) recovery of the final depth map from this representation, as illustrated in Figure 1. In the first stage, encoder–decoder (ED) networks are employed to extract deep focus features from the input focal stack. Both 2D and 3D ED architectures have been explored in previous studies [11,13,14].

In 2D encoder–decoder (2DED) approaches, each focal image is processed independently using 2D convolutional operations, and the extracted feature maps are subsequently aggregated to construct a deep focus volume. For example, Hazirbas et al. [11] utilized a VGG-16-based encoder with a symmetric decoder to learn focus-aware representations. DefocusNet [31] first synthesizes a focal stack from a single RGB image through a 2DED architecture and then estimates depth using a second 2DED network. Yang et al. [14] proposed a multi-scale focus volume extraction framework based on 2D CNNs and further introduced differential focus volumes computed along the focal dimension, later improving the method with PSF-based aberration correction [32]. Fujimura et al. [33] presented a camera-agnostic framework that directly estimates a cost volume using a 2D CNN, while Jiang et al. [34] constructed multi-scale focus volumes from event-based focal stacks using attention-enhanced 2DED architectures.

In contrast, 3D encoder–decoder (3DED) approaches apply 3D convolutions directly to the entire focal stack, enabling simultaneous modeling of spatial and focal correlations within a volumetric representation. Wang et al. [13] proposed Inception3D, a 3D encoder–decoder framework designed to generate deep focus volumes directly from focal stacks. Won et al. [35] combined 2D and 3D convolutional operations to construct multi-scale focus representations and further improved performance through a Sharpness Region Detection (SRD) module. Xie et al. [36] employed a 3D CNN to fuse focal stack information into an all-in-focus image, from which depth estimation was subsequently performed. More recently, Kang et al. [37] introduced a transformer-based framework integrated with LSTM and CNN decoder modules, enabling the model to generalize across varying focal stack lengths while benefiting from large-scale monocular depth pretraining.

The second stage focuses on extracting depth information from the generated deep focus volume. Simple

1 \times 1

convolution operations, such as those used in [11], often provide limited effectiveness for accurate depth recovery. Since the conventional

\arg \max

operation is non-differentiable and therefore unsuitable for end-to-end neural network training, many approaches adopt a differentiable soft

\arg \max

strategy [12]. In this formulation, a softmax operation is applied along the focus dimension to produce a probability distribution, and the final depth value is obtained as the weighted average of focus distances. Wang et al. [13] employed soft

\arg \max

for simultaneous prediction of depth and all-in-focus images. Yang et al. [14] further incorporated uncertainty-aware confidence estimation during the soft

\arg \max

process on refined focus volumes. Won et al. [35] combined multi-scale depth predictions using soft

\arg \max

, while Jiang et al. [34] utilized scale-wise

1 \times 1

convolutions followed by fusion of intermediate depth estimates. More recently, Ganj et al. [38] improved depth refinement by integrating priors obtained from single-image depth estimation models. A recent work combines traditionally computed focus volume with a deep recurrent network to obtain sharper depth estimates [39]. Recent developments, including graph-based reasoning [40] and sparse representation techniques [41], offer promising directions that could be incorporated into DFF frameworks to enhance depth inference and scene understanding.

3. Proposed Framework

Given a focal stack

I_{z} (x, y) = {I_{1} (x, y), I_{2} (x, y), \dots, I_{Z} (x, y)}

consisting of Z images, each of spatial resolution

(H \times W)

, our goal is to predict a dense depth map

\hat{D} (x, y) \in R^{H \times W}

. We formulate depth estimation as an optimization problem and solve it using an unrolled deep neural network. The proposed framework comprises four main components: (1) Feature Volume Construction, (2) Depth Candidate Generation, (3) Proximal Refinement and (4) Gated Proximal Network. An overview of the proposed framework is illustrated in Figure 2.

3.1. Feature Volume Construction

We first extract a feature representation

F

by providing input stack

I_{z} (x, y) \in R^{Z \times H \times W}

to a deep encoder–decoder network, represented as

\begin{matrix} \tilde{F_{l}} & = Φ (I_{z} (x, y)), l \in {1, 2, 3, 4} \end{matrix}

(1)

\begin{matrix} {\overset{´}{F}}_{l} & = \overset{´}{Φ} (\tilde{F_{l}}) \end{matrix}

(2)

\begin{matrix} F & = \frac{1}{4} \sum_{l = 1}^{4} {\overset{´}{F}}_{l}, \end{matrix}

(3)

where

Φ (\cdot)

denotes an encoder that takes the input focal stack

I

and extracts multi-scale features

{\tilde{F}}_{l}

,

l \in {1, 2, 3, 4}

. The encoder can be any backbone feature extractor; in our implementation, it is realized using a ResNet-18 [42]. A decoder

\overset{´}{Φ} (\cdot)

is then applied to

{\tilde{F}}_{l}

to produce refined multi-scale focus features

{\overset{´}{F}}_{l}

by incorporating various operations at different scales. Finally, the deep focus volume

F

is obtained by averaging the decoded multi-scale responses.

3.2. Depth Candidate Generation

We generate

N = 3

depth candidates

D = {D_{1}, D_{2}, D_{3}}

by estimating a distribution over C depth planes (channels) from the focus features

F \in R^{B \times C \times H \times W}

and adding learned residuals. The depth planes C are same as number of channels that are uniformly sampled

{1, \dots, C}

and a probability distribution over planes is obtained by apply arg-softmax operation as:

\begin{matrix} P_{c} (x, y) & = \frac{\exp (F_{c} (x, y))}{\sum_{c = 1}^{C} \exp (F_{c} (x, y))}, \end{matrix}

(4)

\begin{matrix} \bar{D} (x, y) & = \sum_{c = 1}^{C} c \cdot P_{c} (x, y), c \in {1, 2, \dots, C} . \end{matrix}

(5)

where

P_{i} (x, y)

denotes the probability distribution over the C depth channels at pixel location

(x, y)

. The i-th depth candidate, denoted by

{\bar{D}}_{i} (x, y)

, is obtained by adding a residual correction term

Δ {\bar{D}}_{i} (x, y)

to the soft-argmax depth estimate.

\begin{matrix} D_{i} (x, y) & = \bar{D} (x, y) + Δ {\bar{D}}_{i} (x, y) . \end{matrix}

(6)

The residual correction

Δ {\bar{D}}_{i} (x, y)

is predicted by a candidate-specific residual head network

ψ_{i} (\cdot)

, consisting of a small stack of convolutional layers. Each residual head takes the focus-volume feature

F_{c} (x, y)

as input and learns a refinement term

Δ {\bar{D}}_{i} (x, y)

that improves the accuracy of the depth hypothesis.

\begin{matrix} Δ {\bar{D}}_{i} (x, y) & = ψ_{i} (F_{c} (x, y)) . \end{matrix}

(7)

3.3. Proximal Refinement

The problem of estimating the depth

\hat{D}

from multiple depth candidates

D_{i}

,

i \in {1, 2, \dots, N}

, while omitting

(x, y)

for notation simplicity, is formulated as the minimization of the following objective:

\min_{\hat{D}} \sum_{i = 1}^{N} W_{i}^{2} {(\hat{D} - D_{i})}^{2} + λ R (\hat{D})

(8)

where the first term

\sum_{i = 1}^{N} W_{i}^{2} {(\hat{D} - D_{i})}^{2}

is the data fidelity term, which is typically smooth;

λ

is a constant regularization parameter; and the second term

R (D)

is a non-smooth regularization term. To solve the objective in (8), we adopt the Proximal Gradient Descent (PGD) algorithm due to its simplicity and effectiveness. PGD is an iterative optimization procedure consisting of two steps: a gradient descent step and a proximal step. The gradient descent step computes the gradient of the data fidelity term as:

\begin{matrix} {\tilde{D}}^{(k)} & = D^{(k)} - η_{k} (S D^{(k)} - b) \end{matrix}

(9)

\begin{matrix} S & = \sum_{i = 1}^{N} W_{i}^{2}, \end{matrix}

(10)

\begin{matrix} b & = \sum_{i = 1}^{N} W_{i}^{2} D_{i} \end{matrix}

(11)

where

{\tilde{D}}^{(k)}

denotes the gradient of the data fidelity term and

η

is a constant step size. The proximal step at iteration

k, k \in {1, \dots, K}

is defined as:

\begin{matrix} D^{(k)} & = P_{k} ({\tilde{D}}^{(k)}, F, η_{k}, λ, D^{(k - 1)}) \end{matrix}

(12)

where

P_{k} (\cdot)

represents the proximal operator, implemented using a U-Net–like network conditioned on intermediate variables.

D^{(0)}

denotes the initial depth estimate obtained as the weighted average of all depth candidates

\begin{matrix} D^{(0)} & = \frac{\sum_{i = 1}^{N} W_{i}^{2} \cdot D_{i}}{\sum_{i = 1}^{N} W_{i}^{2}} \end{matrix}

(13)

\begin{matrix} W_{i}^{2} & = \frac{\exp (π_{i} (F))}{\sum_{j = 1}^{N} \exp (π_{j} (F))} \end{matrix}

(14)

where the weight matrix

W

is computed using an attention network

π (\cdot)

. To further enhance the final depth prediction, we employ a lightweight residual refinement module progressively on the depth obtained at the final iteration

D^{(K)}

. Specifically, the estimated depth

{\hat{D}}^{(0)} = D^{(K)}

is concatenated with the feature volume

F

and passed through a sequence of convolutional blocks to predict a residual correction. The refined depth map is then obtained as:

\begin{matrix} Δ D & = F_{ref} (cat ({\hat{D}}^{(t)}, F)), t \in {1, \dots, T} \end{matrix}

(15)

\begin{matrix} {\hat{D}}^{(t + 1)} & = {\hat{D}}^{(t)} + Δ D \end{matrix}

(16)

where

F_{ref} (\cdot)

denotes the refinement network. This residual formulation allows the model to focus on correcting fine details and structural inconsistencies, leading to improved depth accuracy.

3.4. Gated Proximal Network

In the proposed framework, the proximal operator

P_{k} (\cdot)

is designed as a U-Net-style encoder–decoder with five layers and gated skip connections. This architecture selectively propagates spatial features via attention-based gating while producing an auxiliary depth estimate. The input tensor at iteration k is constructed as:

z^{(k)} = cat (D^{(k)}, F, η_{k}, λ,),

(17)

where

cat (\cdot)

denotes channel-wise concatenation. The encoder extracts hierarchical feature representations at multiple levels of abstraction:

\begin{matrix} e_{1} & = ϕ_{1} (z^{(k)}), \end{matrix}

(18)

\begin{matrix} e_{2} & = ϕ_{2} (e_{1}), \end{matrix}

(19)

\begin{matrix} e_{3} & = ϕ_{3} (Pool (e_{2})), \end{matrix}

(20)

where

ϕ_{1} (\cdot)

and

ϕ_{2} (\cdot)

are convolutional blocks that capture low- and mid-level spatial features, respectively, and

ϕ_{3} (\cdot)

operates on downsampled features via a pooling operation to encode higher-level contextual information with a larger receptive field.

The decoder progressively restores the spatial resolution by upsampling the encoded features and refining them through skip connections:

\begin{matrix} u_{2} & = upsample (e_{3}), \end{matrix}

(21)

\begin{matrix} u_{1} & = upsample (e_{2}), \end{matrix}

(22)

where the upsampling operations recover spatial details from the coarse latent representation, enabling the network to reconstruct fine-grained structures in the depth map. Instead of directly concatenating encoder features, we employ attention gates to suppress irrelevant activations. Given an encoder feature

e

and a decoder feature

u

, the gated feature is defined as:

\tilde{e} = σ (p r j (r e l (1 c o n (e) + 1 c o n (u)))) ⊙ e,

(23)

where

1 c o n (\cdot)

denotes

1 \times 1

convolutions,

r e l (\cdot)

is the ReLU activation,

p r j (\cdot)

is a linear projection,

σ (\cdot)

is the sigmoid function, and ⊙ denotes element-wise multiplication. The gated features are incorporated into the skip connections to effectively fuse encoder and decoder representations:

\begin{matrix} d_{2} & = ϕ_{d} (cat (u_{2}, {\tilde{e}}_{2})), \end{matrix}

(24)

\begin{matrix} d_{1} & = ϕ_{d} (cat (u_{1}, {\tilde{e}}_{1})), \end{matrix}

(25)

where

ϕ_{d} (\cdot)

represents convolutional refinement blocks that integrate the upsampled decoder features with the attention-filtered encoder features to recover spatial details while suppressing irrelevant information. The proximal operator

P_{k} (\cdot)

produces the refined depth estimate

D^{(k)}

by projecting the final decoder feature

d_{1}

at iteration k.

3.5. Loss Function

To train the proposed framework end-to-end, we model the depth estimation as a regression task and optimize the network using a standard Mean Squared Error (MSE) loss between the final refined depth prediction

\hat{D}

and the corresponding ground truth depth map

D_{g t}

as:

L_{MSE} = \frac{1}{H \times W} \sum_{x = 1}^{W} \sum_{y = 1}^{H} {(\hat{D} (x, y) - D_{g t} (x, y))}^{2} .

(26)

where

H \times W

is the total number of pixels in the depth map. The

\hat{D} (x, y)

and

D_{g t} (x, y)

denote the predicted and ground truth depth values at spatial location

(x, y)

, respectively.

4. Results and Discussion

4.1. Experimental Setup

The proposed model is evaluated on three benchmark datasets with diverse characteristics to assess both quantitative performance and generalization capability. The first dataset, FlyingThings3D (FT) [43], is a large-scale synthetic dataset containing 1000 training focal stacks and 100 testing focal stacks. Each stack consists of 15 focus-varying RGB images with a spatial resolution of

960 \times 540

. Since the dataset was originally designed for stereo matching and disparity estimation, the focal sequence is ordered from the farthest focal plane to the nearest one. The corresponding focus distances are uniformly distributed between 10 and 100 units. The second dataset, Middlebury (MB) [44], is a real-world benchmark primarily used to evaluate the generalization ability of the proposed method. It contains 14 focal stacks, each composed of 15 focus-varying images. For consistency in the experiments, all images are resized to

512 \times 512

. Similar to FT, the focal images are arranged from far focus to near focus because the dataset was also developed for stereo correspondence tasks. Moreover, the focus distances are evenly spaced within the range of 10 to 60 units. The third dataset, HCI [45], is a synthetic benchmark generated from 4D light fields. It includes 24 focal stacks, each containing 10 images. Among them, 20 stacks are used for training, while the remaining four stacks are reserved for testing. Due to its relatively limited size, the HCI dataset is mainly utilized for ablation studies and component-wise analysis of the proposed framework. We further evaluate the model on the real-world DDFF dataset [11], which is captured using a light-field camera and provides 10 images per focal stack. For DDFF, we use the official test set for evaluation.

The proposed model is implemented in PyTorch (version 2.8.0) [46]. We train the network using the AdamW optimizer [47], with the maximum learning rate set to

1 \times 10^{- 4}

and controlled by a one-cycle learning rate schedule. For all training experiments, we use a batch size of 2 focal stacks. During training, patches of size

256 \times 256

are used for the FT and HCI datasets, while patches of size

224 \times 224

are used for the DDFF dataset. On the FT dataset, we compare our model with both classical and learning-based depth-from-focus approaches. The classical baseline is RFVR [10], a regularization-based method evaluated using the authors’ official implementation. For learning-based methods, we compare with AiFDNet [13] and DWild [35] using their publicly released pretrained models. We also include two variants of DFV [14]: DFV-FV, which uses the standard focus volume, and DFV-Diff, which uses a differential focus representation. Both DFV variants are retrained using the official training code to ensure a fair comparison. For the DDFF dataset, we compare against AiFDNet [13], DFV-FV [14], and DFV-Diff [14] using their officially released checkpoints and official evaluation configurations. In order to quantitatively assess the effectiveness of the proposed model, several widely adopted evaluation metrics are employed. These metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMS), and logarithmic Root Mean Squared Error (logRMS). In addition, Relative Absolute Error (AbsRel) and Relative Squared Error (SqRel) are used to evaluate relative prediction accuracy. We further report threshold-based accuracy measures, namely Acc_1, Acc_2, and Acc_3, following the definitions provided in [11]. To analyze the statistical consistency between the estimated and ground-truth outputs, the Pearson correlation coefficient (Corr) is also computed, which reflects the degree of linear correlation between the two variables.

4.2. Ablation Study

The proposed framework progressively refines the estimated depth maps through iterative correction updates generated by proximal optimization steps. These updates incorporate focus consistency constraints, adaptive step-size estimation, and learned proximal update, allowing the model to effectively combine physical imaging principles with data-driven representations. Qualitative refinement results obtained at different iteration stages for a focal stack from the FT test dataset are illustrated in Figure 3. It can be observed that the model trained with three iterations produces increasingly accurate depth predictions up to the third refinement stage, yielding sharper structural boundaries and reduced reconstruction artifacts. Beyond the fifth iteration, however, the performance improvement becomes marginal, and in certain cases a slight degradation can be observed. This behavior is likely caused by overfitting to intermediate correction updates, which may lead to the suppression of fine structural details and the introduction of subtle inconsistencies within smooth regions.

In addition, quantitative evaluations were conducted on the FT test dataset to analyze the effect of iterative refinement. The values of all evaluation metrics were computed across the test samples at different refinement stages, and the corresponding results are reported in Table 1. The results demonstrate a consistent improvement in performance during the first three iterations, as reflected by lower MAE, RMS, and AbsRel values, along with higher Acc_1, Acc_2, and Acc_3 scores. These findings indicate that the proposed proximal refinement strategy effectively enhances the predicted depth maps and progressively improves the overall reconstruction quality.

To gain further insight into the behavior of the proposed fusion mechanism, Figure 4 visualizes the attention weighting maps and the corresponding average

W^{2}

maps associated with the depth candidates. The first three columns show the weighting maps obtained from focal stacks

1, 16, 54

for their three depth candidates, while the last column presents the average

W^{2}

map computed across these maps. It can be observed that the weighting maps exhibit consistent spatial patterns across different focal stacks, indicating that the attention module learns stable and meaningful confidence estimates for the candidate depths. The averaged

W^{2}

map further reveals the overall contribution of individual depth candidates during the fusion process, demonstrating that the final depth prediction is obtained through a balanced combination of multiple hypotheses rather than relying on a single candidate.

The proposed model is further evaluated on the real-world MB dataset and the synthetic HCI dataset to assess its generalization capability across diverse data domains. Qualitative results for focal stacks

0, 1, 3

from the MB dataset are presented in Figure 5, while representative results from the HCI dataset are shown in Figure 6. Since both datasets provide pseudo ground-truth (GT) depth maps, the GT depth maps are included for visual comparison. In addition, quantitative results are reported in Table 2. For each dataset, the evaluation metrics are first computed independently for every focal stack in the test set and then averaged across all focal stacks to obtain the final performance scores presented in the table. From the presented results, it can be observed that the predicted depth maps closely resemble the corresponding GT maps while preserving important structural and object boundary information. The proposed method successfully maintains fine details and produces spatially coherent depth estimations across complex scene regions. We attribute this strong performance to the proposed proximal refinement framework, which effectively integrates focus-dependent image cues with learned regularization priors to achieve accurate and structurally consistent depth reconstruction.

4.3. Comparative Analysis

We conduct a comparative evaluation on the FT dataset, which is the largest dataset considered in this work, comprising 1000 training samples and 100 testing samples. The proposed method is compared with both conventional and deep learning-based depth-from-focus approaches. Among the conventional methods, we include RFVR [10], a regularization-driven technique, evaluated using the official implementation provided by the authors. For learning-based baselines, we consider AiFDNet [13] and DWild [35], utilizing their publicly available pretrained models. In addition, two variants of the DFV framework [14] are included in the comparison: DFV-FV, which employs a standard focus volume representation, and DFV-Diff, which introduces a differential focus strategy. To ensure fairness in the experimental setup, both DFV variants were retrained using the official training configurations and scripts.

The quantitative comparison results for the FT dataset are presented in Table 3. As shown in the table, the proposed approach achieves the lowest average MAE value of 1.46 on the FT dataset, while the second-best result is obtained by DFV-Diff with an MAE of 5.51. This corresponds to an approximate relative improvement of 73.5% in MAE. Among the compared methods, RFVR produces the weakest overall performance, yielding the lowest scores across most evaluation metrics. AiFDNet and DFV-FV demonstrate moderate improvements over RFVR, indicating the advantage of learning-based representations over conventional handcrafted approaches. DWild and DFV-Diff emerge as the strongest competing methods, with very similar quantitative results, although DWild achieves slightly better performance on several metrics. Furthermore, the proposed method attains the best performance across several additional evaluation metrics, including Acc_1, Acc_2, Acc_3, and Corr. We attribute this significant improvement to the ability of the proposed framework to preserve fine structural details and accurately capture focus-dependent scene information, which competing methods often fail to model effectively. This advantage becomes more apparent in the qualitative comparisons.

A qualitative comparison for the subset

{12, 28, 54, 65, 70, 97}

from the FT dataset is provided in Figure 7. Among the compared approaches, RFVR exhibits noticeable limitations in handling background noise and preserving sharp object boundaries. This is primarily due to its heavy reliance on the number of images within the focal stack, which restricts the effective dynamic range of focus information and reduces the method’s capability to accurately estimate depth in scenes containing large depth variations. In contrast, the other five learning-based methods yield better output maps. It is evident from Figure 7 that our proposed method produces the least noise, particularly in the second and fifth columns. In addition, our approach effectively handles complex internal patterns within objects; for example, in the fifth column a brown object contains intricate internal patterns, and our method accurately estimates its distance without being misled by the texture, an area where several comparative methods fall short.

The effectiveness of the proposed method is further demonstrated through focal stack-wise quantitative comparisons presented in Figure 8. For visualization purposes, all metric values are normalized to the range

[0, 1]

. Consequently, the comparisons should be interpreted relatively, with the best-performing method determined according to the original metric definitions. From the figure, it can be observed that the proposed method achieves the lowest values for the error-based metrics, including MAE and RMS, while simultaneously obtaining the highest scores for the accuracy-based metrics Acc_1, Acc_2, Acc_3, and Corr. These results demonstrate the effectiveness of the proposed framework in producing accurate and structurally consistent depth estimations across different focal stacks.

Finally, we present qualitative results on the real-world DDFF [11] test set, a large-scale benchmark comprising diverse and challenging real-world scenes. Since the test set does not provide ground-truth depth maps, the evaluation is conducted through visual comparison using the officially released checkpoints of AiFDNet [13], DFV-FV [14], and DFV-Diff [14]. The results are shown in Figure 9, where the first row displays an RGB image from the focal stack and the subsequent rows show the depth predictions generated by different methods. Each column corresponds to a distinct test sample.

The DDFF [11] benchmark presents significant challenges due to complex scene structures, depth discontinuities, and varying imaging conditions. Despite these difficulties, the proposed method produces competitive depth estimates and, in several cases, exhibits improved object structure preservation and sharper boundary delineation. For instance, in the first column, the proposed method generates clearer object boundaries and more consistent depth transitions compared to the competing approaches. These results demonstrate the robustness and effectiveness of the proposed framework on challenging real-world data.

5. Conclusions

In this paper, we presented a novel DFF framework based on optimization unrolling, where depth estimation is formulated as an energy minimization problem and solved through a trainable iterative refinement process. The proposed approach integrates physical imaging principles with deep learning by combining learned focus representations, probabilistic depth hypotheses, attention-guided confidence estimation, and proximal optimization updates within a unified end-to-end framework.

The iterative refinement strategy enables progressive enhancement of the predicted depth maps while effectively preserving structural details and reducing artifacts in challenging regions. In addition, the incorporation of focus consistency constraints, adaptive step sizes, and learned regularization priors improves the robustness of the model under noisy and low-texture conditions. Experimental evaluations on both synthetic and real-world datasets demonstrate that the proposed method consistently outperforms existing classical and learning-based DFF approaches across multiple quantitative metrics while also producing visually accurate and structurally coherent depth maps.

Overall, the proposed framework provides an interpretable and effective solution for depth map estimation by bridging model-based optimization and data-driven learning. Future work will focus on extending the framework toward real-time applications, improving cross-domain generalization, and incorporating temporal consistency for video-based depth estimation scenarios.

Funding

This work was supported by the Research Promotion Program Koreatech (2026).

Data Availability Statement

The datasets used in the study are openly available at https://lmb.informatik.uni-freiburg.de/resources/datasets/ (accessed on 7 June 2026), mentioned in the reference [13]. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflict of interest.

References

Liao, M.; Lu, F.; Zhou, D.; Zhang, S.; Li, W.; Yang, R. Dvi: Depth guided video inpainting for autonomous driving. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16; Springer: Cham, Switzerland, 2020; pp. 1–17. [Google Scholar]
Dong, X.; Garratt, M.A.; Anavatti, S.G.; Abbass, H.A. Towards real-time monocular depth estimation for robotics: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16940–16961. [Google Scholar] [CrossRef]
Du, R.; Turner, E.; Dzitsiuk, M.; Prasso, L.; Duarte, I.; Dourgarian, J.; Afonso, J.; Pascoal, J.; Gladstone, J.; Cruces, N.; et al. DepthLab: Real-time 3D interaction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology; Association for Computing Machinery: New York, NY, USA, 2020; pp. 829–843. [Google Scholar]
Park, K.; Kim, S.; Sohn, K. High-precision depth estimation with the 3d lidar and stereo fusion. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2018; pp. 2156–2163. [Google Scholar]
Cui, Y.; Schuon, S.; Chan, D.; Thrun, S.; Theobalt, C. 3D shape scanning with a time-of-flight camera. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2010; pp. 1173–1180. [Google Scholar]
Griffin, B.A.; Corso, J.J. Depth from camera motion and object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 1397–1406. [Google Scholar]
Tosi, F.; Bartolomei, L.; Poggi, M. A survey on deep stereo matching in the twenties. Int. J. Comput. Vis. 2025, 133, 4245–4276. [Google Scholar] [CrossRef]
Zheng, Z.; Feng, S.; Chen, C.; Qu, Y. Depth from Focus in 3D Measurement: An Overview. IEEE Trans. Instrum. Meas. 2025, 74, 5034338. [Google Scholar] [CrossRef]
Nayar, S.K.; Nakagawa, Y. Shape from focus. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 824–831. [Google Scholar] [CrossRef]
Ali, U.; Mahmood, M.T. Robust focus volume regularization in shape from focus. IEEE Trans. Image Process. 2021, 30, 7215–7227. [Google Scholar] [CrossRef]
Hazirbas, C.; Soyer, S.G.; Staab, M.C.; Leal-Taixé, L.; Cremers, D. Deep depth from focus. In Proceedings of the Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part III 14; Springer: Cham, Switzerland, 2019; pp. 525–541. [Google Scholar]
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 66–75. [Google Scholar]
Wang, N.H.; Wang, R.; Liu, Y.L.; Huang, Y.H.; Chang, Y.L.; Chen, C.P.; Jou, K. Bridging unsupervised and supervised depth from focus via all-in-focus supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 12621–12631. [Google Scholar]
Yang, F.; Huang, X.; Zhou, Z. Deep depth from focus with differential focus volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 12642–12651. [Google Scholar]
Zhang, K.; Zuo, W.; Gu, S.; Zhang, L. Learning deep CNN denoiser prior for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 3929–3938. [Google Scholar]
Rick Chang, J.; Li, C.L.; Poczos, B.; Vijaya Kumar, B.; Sankaranarayanan, A.C. One network to solve them all–solving linear inverse problems using deep projection models. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 5888–5897. [Google Scholar]
Metzler, C.A.; Maleki, A.; Baraniuk, R.G. From denoising to compressed sensing. IEEE Trans. Inf. Theory 2016, 62, 5117–5144. [Google Scholar] [CrossRef]
Li, T.; Yan, Q.; Zou, Q.; Dai, Q. Gates-controlled deep unfolding network for image compressed sensing. IEEE Trans. Comput. Imaging 2024, 10, 103–114. [Google Scholar] [CrossRef]
Dogan, R.O.; Dogan, H.; Cal, S. From Handcrafted Focus Measurement Operators to Deep Learning: A Comprehensive Review of Shape from Focus Strategies. Arch. Comput. Methods Eng. 2026, 33, 4609–4623. [Google Scholar] [CrossRef]
Krotkov, E. Focusing. Int. J. Comput. Vis. 1988, 1, 223–237. [Google Scholar] [CrossRef]
Pertuz, S.; Puig, D.; Garcia, M.A. Analysis of focus measure operators for shape-from-focus. Pattern Recognit. 2013, 46, 1415–1432. [Google Scholar] [CrossRef]
Subbarao, M.; Choi, T. Accurate recovery of three-dimensional shape from image focus. IEEE Trans. Pattern Anal. Mach. Intell. 1995, 17, 266–274. [Google Scholar] [CrossRef]
Mahmood, M.T.; Choi, T.S. Nonlinear approach for enhancement of image focus volume in shape from focus. IEEE Trans. Image Process. 2012, 21, 2866–2873. [Google Scholar] [CrossRef] [PubMed]
Jeon, H.G.; Surh, J.; Im, S.; Kweon, I.S. Ring Difference Filter for Fast and Noise Robust Depth From Focus. IEEE Trans. Image Process. 2019, 29, 1045–1060. [Google Scholar] [CrossRef] [PubMed]
Ali, U.; Lee, I.H.; Mahmood, M.T. Guided image filtering in shape-from-focus: A comparative analysis. Pattern Recognit. 2021, 111, 107670. [Google Scholar] [CrossRef]
Moeller, M.; Benning, M.; Schönlieb, C.; Cremers, D. Variational depth from focus reconstruction. IEEE Trans. Image Process. 2015, 24, 5369–5378. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Zheng, C.; Wu, S. Adaptive weighted guided image filtering for depth enhancement in shape-from-focus. Pattern Recognit. 2022, 131, 108900. [Google Scholar] [CrossRef]
Danismaz, S.; Dogan, R.O.; Dogan, H. Two-phase deep learning method for image fusion-based extended depth of focus. J. Supercomput. 2025, 81, 1298. [Google Scholar] [CrossRef]
Ashfaq, K.; Mahmood, M.T. A dual-stage focus measure for vector-valued images in shape from focus. Pattern Recognit. 2026, 170, 112112. [Google Scholar] [CrossRef]
Ashfaq, K.; Mahmood, M.T. Depth from focus using directional spherical difference filter and Vector to Scalar Fusion. J. Vis. Commun. Image Represent. 2026, 117, 104794. [Google Scholar] [CrossRef]
Lu, Y.; Milliron, G.; Slagter, J.; Lu, G. Self-supervised single-image depth estimation from focus and defocus clues. IEEE Robot. Autom. Lett. 2021, 6, 6281–6288. [Google Scholar] [CrossRef]
Yang, X.; Fu, Q.; Elhoseiny, M.; Heidrich, W. Aberration-aware depth-from-focus. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 47, 7268–7278. [Google Scholar] [CrossRef] [PubMed]
Fujimura, Y.; Iiyama, M.; Funatomi, T.; Mukaigawa, Y. Deep depth from focal stack with defocus model for camera-setting invariance. Int. J. Comput. Vis. 2024, 132, 1970–1985. [Google Scholar] [CrossRef]
Jiang, C.; Lin, M.; Zhang, C.; Wang, Z.; Yu, L. Learning Depth from Focus with Event Focal Stack. IEEE Sens. J. 2024, 25, 1950–1958. [Google Scholar] [CrossRef]
Won, C.; Jeon, H.G. Learning depth from focus in the wild. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 1–18. [Google Scholar]
Xie, X.; Qingyan, J.; Chen, D.; Guo, B.; Li, P.; Zhou, S. StackMFF: End-to-end multi-focus image stack fusion network. Appl. Intell. 2025, 55, 503. [Google Scholar] [CrossRef]
Kang, X.; Han, F.; Fayjie, A.R.; Vandewalle, P.; Khoshelham, K.; Gong, D. FocDepthFormer: Transformer with Latent LSTM for Depth Estimation from Focal Stack. In Proceedings of the Australasian Joint Conference on Artificial Intelligence; Springer: Singapore, 2024; pp. 273–290. [Google Scholar]
Ganj, A.; Su, H.; Guo, T. HybridDepth: Robust Metric Depth Fusion by Leveraging Depth from Focus and Single-Image Priors. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2025; pp. 973–982. [Google Scholar]
Ashfaq, K.; Mahmood, M.T. Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network. Int. J. Comput. Vis. 2026, 134, 115. [Google Scholar] [CrossRef]
Yang, H.; Liu, Z.; Liu, W.; Wang, H.; Zhang, Y.; Wang, H. Graph-MDETR: A graph-guided Mamba-DETR network for UAV catenary support components detection in electrified railways. IEEE Trans. Intell. Transp. Syst. 2026, 27, 6319–6332. [Google Scholar] [CrossRef]
Chen, Z.; You, K.; Yang, J.; Chen, L.; Li, F.; Feng, Z.; Jia, L. A sparse-to-dense guided fusion framework for three-dimensional object detection in railway environments. Eng. Appl. Artif. Intell. 2026, 178, 115095. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 4040–4048. [Google Scholar]
Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-resolution stereo datasets with subpixel-accurate ground truth. In Proceedings of the Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, 2–5 September 2014; Proceedings 36; Springer: Cham, Switzerland, 2014; pp. 31–42. [Google Scholar]
Honauer, K.; Johannsen, O.; Kondermann, D.; Goldluecke, B. A dataset and evaluation methodology for depth estimation on 4D light fields. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part III 13; Springer: Cham, Switzerland, 2017; pp. 19–34. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf (accessed on 7 June 2026).
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Depth from Focus pipelines for traditional and deep learning methods.

Figure 2. Overview of the proposed framework. Multiple depth candidates are extracted from the deep focus volume and progressively fused via proximal refinement to obtain an intermediate depth estimate. A Depth Refiner Module then applies correction updates to produce the final depth map.

Figure 3. Outputs at different stages of the proposed model on the FT dataset. From left to right: ground truth (GT), outputs after iteration 1, iteration 3, iteration 5, and the final refined output.

Figure 4. Visualization of the attention weighting maps and

W^{2}

maps associated with the depth candidates. The first three columns show the attention maps generated for focal stacks

{1, 16, 54}

, respectively, while the last column presents the average

W^{2}

map.

Figure 4. Visualization of the attention weighting maps and

W^{2}

maps associated with the depth candidates. The first three columns show the attention maps generated for focal stacks

{1, 16, 54}

, respectively, while the last column presents the average

W^{2}

map.

Figure 5. Qualitative evaluation of our method on the MB dataset. The first row shows the all-in-focus (AiF) images, the second row shows the ground truth (GT), and the last row shows the predicted outputs.

Figure 6. Qualitative evaluation of our method on the HCI dataset. The first row shows the all-in-focus (AiF) images, the second row shows the ground truth (GT), and the last row shows the predicted outputs.

Figure 7. Qualitative comparison on the subset

{12, 28, 54, 65, 70, 97}

of the FT dataset. The first column shows the ground truth (GT) of the focal stack numbers

{12, 28, 54, 65, 70, 97}

, respectively, while the remaining columns show the depth maps obtained by using different methods.

Figure 7. Qualitative comparison on the subset

{12, 28, 54, 65, 70, 97}

of the FT dataset. The first column shows the ground truth (GT) of the focal stack numbers

{12, 28, 54, 65, 70, 97}

, respectively, while the remaining columns show the depth maps obtained by using different methods.

Figure 8. Quantitative comparison on focal stack 54 from the FT dataset. For visualization purposes, all metric values are normalized to the range

[0, 1]

. Consequently, the comparisons should be interpreted relatively, with the best-performing method determined according to the original metric definitions.

Figure 8. Quantitative comparison on focal stack 54 from the FT dataset. For visualization purposes, all metric values are normalized to the range

[0, 1]

. Consequently, the comparisons should be interpreted relatively, with the best-performing method determined according to the original metric definitions.

Figure 9. Qualitative comparison on the real-world DDFF test set.

Table 1. Quantitative results at different stages of the proposed model on the FT dataset. The table reports performance after iteration 1, iteration 2, iteration 3, and the final refined output.

Setting	MAE	RMS	logRMS	AbsRel	Acc_1	Acc_2	Acc_3	Corr
Iteration 1	1.525	2.952	0.081	0.040	98.89	99.62	99.77	0.996
Iteration 3	1.475	2.867	0.072	0.036	99.11	99.74	99.85	0.997
Iteration 5	1.469	2.849	0.069	0.036	99.18	99.77	99.87	0.997
Refined	1.460	2.842	0.066	0.035	99.28	99.81	99.89	0.997

Table 2. Quantitative results for test subsets from MB [44] and HCI [45] datasets. The table reports performance after iteration 1, iteration 2, iteration 3, and the final refined output.

Dataset	MAE	RMS	logRMS	AbsRel	Acc_1	Acc_2	Acc_3	Corr
MB	23.59	29.94	0.60	1.42	4.92	37.32	93.11	0.76
HCI	0.07	0.16	0.06	0.03	97.78	99.53	99.91	0.97

Table 3. Quantitative comparison on the FT dataset. Rows correspond to different methods, while columns represent evaluation metrics.

Method	MAE	RMS	logRMS	AbsRel	Acc_1	Acc_2	Acc_3	Corr
RFVR	11.89	23.63	0.79	1.55	72.79	80.81	84.60	0.73
AiFDNet	6.81	13.14	0.59	0.73	85.43	87.67	88.87	0.93
DFV-FV	6.33	12.10	0.57	0.89	85.09	87.60	89.52	0.95
DFV-Diff	5.51	10.65	0.53	0.62	86.18	88.09	89.93	0.97
DWild	5.54	10.44	0.53	0.61	86.35	88.21	89.84	0.97
Ours	1.46	2.84	0.07	0.04	99.28	99.81	99.89	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mahmood, M.T. Learning Depth from Focus with Multi-Candidate Estimation and Proximal Refinement. Electronics 2026, 15, 2548. https://doi.org/10.3390/electronics15122548

AMA Style

Mahmood MT. Learning Depth from Focus with Multi-Candidate Estimation and Proximal Refinement. Electronics. 2026; 15(12):2548. https://doi.org/10.3390/electronics15122548

Chicago/Turabian Style

Mahmood, Muhammad Tariq. 2026. "Learning Depth from Focus with Multi-Candidate Estimation and Proximal Refinement" Electronics 15, no. 12: 2548. https://doi.org/10.3390/electronics15122548

APA Style

Mahmood, M. T. (2026). Learning Depth from Focus with Multi-Candidate Estimation and Proximal Refinement. Electronics, 15(12), 2548. https://doi.org/10.3390/electronics15122548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning Depth from Focus with Multi-Candidate Estimation and Proximal Refinement

Abstract

1. Introduction

2. Related Work

2.1. Traditional Approaches

2.2. Deep Learning-Based Approaches

3. Proposed Framework

3.1. Feature Volume Construction

3.2. Depth Candidate Generation

3.3. Proximal Refinement

3.4. Gated Proximal Network

3.5. Loss Function

4. Results and Discussion

4.1. Experimental Setup

4.2. Ablation Study

4.3. Comparative Analysis

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI