SDENet: A Novel Approach for Single Image Depth of Field Extension

Zhang, Xu; Wen, Miaomiao; Jia, Junyang; Liu, Yan

doi:10.3390/a19030216

Open AccessArticle

SDENet: A Novel Approach for Single Image Depth of Field Extension

School of Computer Science and Artificial Intelligence, Zhengzhou University of Light Industry, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(3), 216; https://doi.org/10.3390/a19030216

Submission received: 3 February 2026 / Revised: 4 March 2026 / Accepted: 10 March 2026 / Published: 13 March 2026

Download

Browse Figures

Versions Notes

Abstract

Traditional hardware-based approaches for depth-of-field extension (DOF-E), such as optimized lens design or focus-stacking via layer scanning, are often plagued by bulkiness and prohibitive costs. Meanwhile, conventional multi-focus image fusion algorithms demand precise spatial alignment, a challenge that becomes particularly acute in applications like microscopy. To address these limitations, this paper proposed a novel single-image DOF-E method termed SDENet. The method adopts an encoder –decoder architecture enhanced with multi-scale self-attention and depth enhancement modules, enabling the transformation of a single partially focused image into a fully focused output while effectively recovering regions outside the original depth of field (DOF). To support model training and performance evaluation, we introduce a dedicated dataset (MSED) containing 1772 pairs of single-focus and all-focus images covering diverse scenes. Experimental results on multiple datasets verify that SDENet significantly outperforms state-of-the-art deblurring methods, achieving a PSNR of 26.98 dB and SSIM of 0.846 on the DPDD dataset, which represents a substantial improvement in clarity and visual coherence compared to existing techniques. Furthermore, SDENet demonstrates competitive performance with multi-image fusion methods while requiring only a single input.

Keywords:

single image; depth of field extension; deep feature enhancement

1. Introduction

DOF refers to the spatial range within which a camera can capture sharp images. In real-world, three-dimensional scenarios often extend beyond the camera’s DOF, causing regions outside DOF to appear blurred. This blurring effect not only obscures fine-grained details, critical for tasks in microscopy, surveillance, and other computer vision domains [1,2,3], but also diminishes the visual fidelity of the captured images [4,5]. Although optical techniques can enhance DOF, such as large-aperture lenses, specialized coatings, or multi-lens systems [6,7], these solutions often involve complex designs and high costs. An alternative DOF-E approach is multi-focus image fusion, where a series of multiple images with varying focal settings are captured and merged into a single all-focus image with uniformly sharp details [8,9], as demonstrated in the left side of Figure 1. However, these methods require capturing a stack of images with different focal regions, which is time-consuming and cumbersome, and the misalignment between images may lead to artifacts, thus necessitating precise image registration [10,11,12,13].

Single-image defocus deblurring algorithms can restore defocus blur by estimating the pixel-level defocus map (or blur kernel), and then performing non-blind deblurring processing, which is mainly for the situation where the entire image is blurred due to overall defocus for camera focus failed. Recent deep learning techniques have shown progresses in this area. Ma et al. [14] proposed an interpretable end-to-end network that predicts spatially varying inverse blur kernels for defocus deblurring. However, accurately simulating the spatially varying defocus effect remains an immensely challenging task, as the estimation error of the blur kernel can severely compromise the quality of the final outcome for locally defocused image. Recovering a fully sharp, all-focus image and DOF-E for normal scenery images from a single input remains a difficult problem [15].

In this work, we developed a single-image DOF-E network that avoids the estimation of the explicit blurred-map. The proposed model learns to extend the DOF of a local focus image directly rather than separate the restore problem into some defocus estimation parts. To summarize, the key contributions of this paper are:

A novel dataset (MSED) is introduced to mitigate the scarcity of dedicated training resources for single-image depth-of-field extension (DOF-E), comprising 1772 rigorously aligned single local-focus and all-focus image pairs with pixel-level correspondence.
A single-input DOF extension network (SDENet) is developed to reconstruct depth-of-field information from a single acquisition, which significantly reduces reliance on complex optical systems but also entirely eliminates the limitations of image registration and fusion inherent to multi-image methodologies.
A Multi-scale Self-attention (MsS) Transformer and a Feature Enhancement Module (FEM) are specifically designed to comprehensively extract and refine hierarchical depth features from monocular input, thus facilitating context-aware depth restoration.
The robustness of the proposed method is validated across diverse benchmarks and real-world scenarios, which is not only achieves significant performance gains over conventional deblurring and defocus techniques but also attains parity with multi-focus fusion approaches-remarkably, using only a single input image.

To better position our work within the existing literature, the following section provides a detailed review of traditional and deep learning-based methods for depth-of-field extension.

2. Related Work

2.1. Depth-of-Field Extension (DOF-E)

DOF-E can be achieved through complex optics design or computational imaging techniques. Optical DOF-E methods include adjustable-aperture optics, focus-tunable lenses, and coded apertures, which introduce aberrations or increased complexity [16]. In microscopy, DOF-E systems typically rely on specialized optics or active focus sweeping [17,18]; however, these approaches are often costly and computationally intensive. Recently, computational techniques have been explored as a way to reduce dependence on complex hardware.

2.2. Multi-Focus Image Fusion (MFIF)

MFIF techniques combine multiple images captured at different focus depths into a single all-in-focus image, thereby extending the DOF. Traditional fusion methods operate in either spatial or transform domains (e.g., wavelets, gradients) and rely on focus-measure heuristics [19]. In recent years, deep learning approaches have been applied to MFIF, with convolutional neural networks (CNNs) and generative models (e.g., GANs) achieving state-of-the-art fusion performance [20]. Vision Transformers have also been utilized to capture long-range dependencies in fusion tasks [21]. These methods can produce exceptionally sharp, all-in-focus images when provided with well-aligned multi-focal inputs. However, acquiring multiple focus-stacked images is time-consuming, and any misalignment or exposure variations between the input images can lead to artifacts. Additionally, multi-image systems increase the complexity of image capture.

2.3. Single-Image DOF-E/Defocus Deblurring

Removing defocus blur from a single image has been explored as a way to achieve effects similar to depth-of-field extension (DOF-E). Early approaches typically estimate a defocus blur map and then apply non-blind deconvolution. More recent end-to-end deep networks bypass explicit defocus estimation and directly predict the sharp images. For example, Quan et al. [22] introduced a deep network that explicitly predicts spatially varying inverse kernels for defocus deblurring, significantly improving interpretability. Nevertheless, most single-image methods struggle with the complex, spatially varying nature of defocus blur. Errors in blur estimation or limitations in network generalization often result in residual blur or artifacts [23,24]. To our knowledge, few methods have specifically focused on extending DOF for general images using only a single input. SDENet is designed precisely for this task: it directly learns to recover an all-in-focus image without requiring multiple captures or explicit defocus maps. More recently, the landscape of image restoration has been further advanced by sophisticated attention mechanisms and generative priors. For instance, the GRL (Global, Regional, and Local) network [25] and the latest SKA-based (Selective Kernel Attention) architectures [26] have shown that dynamic feature selection is crucial for handling spatially varying blur. In the domain of deblurring, Blur2Blur [27] introduced unsupervised blur conversion to bridge the gap between synthetic and real-world domains. Furthermore, recent research has begun exploring the integration of latent diffusion models for depth-of-field recovery [28], demonstrating superior texture synthesis in severely blurred regions. These trends reinforce the importance of multi-scale feature aggregation and robust depth enhancement-core principles we have integrated into our SDENet through the MsS Transformer and FEM. Table 1 presents the advantages and disadvantages of previous related work.

Building upon the limitations of the aforementioned methods, we propose SDENet, whose detailed architecture and core components are described in the following section.

3. Method

This section provides a comprehensive exposition of the proposed SDENet, a deep learning-based framework specifically engineered for single-image depth-of-field extension. We begin by delineating the complete pipeline of the SDENet architecture, which adopts a hierarchical U-shape structure to facilitate multi-scale feature learning. Subsequently, we provide an in-depth analysis of the two pivotal components that drive the network’s performance: (a) the Multi-scale Self-attention (MsS) Transformer Module, designed to capture long-range focus dependencies across various spatial resolutions, and (b) the Feature Enhancement Module (FEM), which is strategically integrated to refine high-frequency details and ensure the structural integrity of the reconstructed all-in-focus images.

3.1. Network Architecture

The structure of the proposed SDENet is illustrated in Figure 2. It follows a standard encoder–decoder architecture, sharing similarities with various UNet variants. Given an input image

I \in R^{H \times W \times 3}

, a

3 \times 3

convolution is first applied to obtain shallow features

F_{0} \in R^{H \times W \times C}

. These features are then passed through a 5-level symmetric encoder–decoder, producing deep features

{F_{r} \in R}^{H \times W \times 2 C}

, where the channel dimension is doubled to incorporate multi-scale contextual information. The factor 2 arises from the concatenation of features processed through residual downsampling modules. After the decoder, a

3 \times 3

convolutional layer is applied to obtain the residual image

F_{r e s} \in R^{H \times W \times 3}

. The final all-in-focus image

I_{o u t}

is then computed by adding this residual to the original input image:

I_{o u t} = I + F_{r e s}

(1)

This residual learning enables the network to focus on reconstructing the missing focus information (i.e., the difference between the input and the target all-focus image) rather than learning to generate the entire image from scratch, which stabilizes training and improves convergence. It is important to note that our network does not explicitly estimate depth maps; rather, it learns to recover focus information by implicitly leveraging depth-related cues through the proposed MsS Transformer and FEMs.

The MsS Transformer module combines a multi-head self-attention mechanism with feed-forward neural network layers to extract features and obtain long-range pixel dependencies. To enhance computational efficiency, a moving-window self-attention mechanism is applied to the feature maps. The enhanced features from the Feature Enhancement Module are passed to the MsS Transformer block for capturing long-range dependencies. The residual downsampling module doubles the number of channels using a

4 \times 4

convolution with a stride of 2. The original input features are then added to these doubled features via a residual connection to produce the final output features. The choice of

4 \times 4

kernel size with stride 2 offers two advantages over the more common

3 \times 3

stride-2 convolution. First, the larger kernel provides a slightly expanded receptive field during downsampling, enabling better capture of spatial context before attention operations. Second, the even kernel size ensures symmetric downsampling and aligns with the

2 \times 2

transpose convolutions used in the upsampling modules, maintaining spatial alignment across skip connections. At the end of the encoder, bottleneck stages composed of stacked MsS Transformer blocks are introduced. Owing to the hierarchical structure, longer-range dependencies, potentially global, can be captured by these blocks when the window size is set equal to the feature map size. At this deepest resolution (H/16 × W/16 × 16C), the feature map size is sufficiently small that global attention becomes computationally feasible while still capturing cross-image dependencies. Importantly, this global operation is applied only in the bottleneck stages; all preceding encoder stages retain the efficient sliding-window mechanism to process higher-resolution features. This hybrid design enables the network to benefit from both local detail preservation and global context modeling without incurring prohibitive computational costs. In the four-level decoder, each stage consists of an MsS Transformer block, a feature enhancement module, and an upsampling module, similar to the encoder. In the upsampling module, a

2 \times 2

transpose convolution with a stride of 2 is employed for upsampling, which simultaneously reduces the number of feature channels by half and doubles the size of the feature maps.

At all levels except the top one, the number of channels is reduced by half through a

1 \times 1

convolution, which is applied after the fusion operation via skip connections with the corresponding encoder features. The image recovery is then performed using MsS Transformer blocks and the feature enhancement module. After the decoder, the output features are reshaped into a 2D feature map, and a

3 \times 3

convolutional layer is applied to obtain the residual image. Finally, the DOF-extended image

I_{o u t}

is obtained.

3.2. MsS Transformer Block

To clarify the methodological novelty of the MsS Transformer, it is important to distinguish it from generic hierarchical models like the Swin Transformer. While both adopt a multi-scale approach, the MsS Transformer is specifically engineered for the unique challenges of defocus blur restoration. The core innovation, the Dynamic-Sliding Multi-head Self-attention (DS-MSA), replaces the discrete shifted-window mechanism with a dynamic sliding window. This ensures continuous spatial modeling, which prevents the formation of block artifacts in transition regions between blurred and focused areas. Additionally, by incorporating multi-scale feature extraction directly into the attention branches, the model can simultaneously process both coarse blur structures and fine-grained textures. This specialized architecture provides a more robust solution for single-image DOF-E than general-purpose restoration transformers, offering a better balance between local texture recovery and global structural consistency.

Long-range pixel interactions can be effectively captured by the self-attention mechanism; however, its computational complexity increases quadratically with spatial resolution [29,30,31], which makes its application to high-resolution images challenging in the DOF-E context. Local contextual information is crucial for DOF-E. As sequence length increases, dependencies between distant positions become sparse, posing difficulties for transformers in capturing long-range dependencies. Therefore, a multi-scale self-attention transformer (MsS Transformer) is used in the U-shaped network of SDENet. A sliding window operation is applied within the multi-head self-attention mechanism to reduce computational complexity and enhance feature extraction and fusion capabilities. Standard global self-attention has complexity

O (H^{2} W^{2} C)

, which grows quadratically with spatial resolution. By partitioning the feature map into non-overlapping windows of size

M \times M

and computing attention within each window, DS-MSA reduces complexity to

O ({H W \cdot M}^{2} C)

. This makes the mechanism feasible for high-resolution inputs. Deep convolution is utilized in feed-forward networks to obtain local context information, which is beneficial for DOF-E. Additionally, the depthwise convolutions in DEFN further reduce computational cost compared to standard convolutions. Several key distinctions motivate the MsS Transformer architecture for DOF-E tasks. First, unlike conventional window-based transformers that rely on fixed window partitioning with regular shifting strategies, our Dynamic-Sliding Multi-head Self-Attention (DS-MSA) introduces a learnable sliding mechanism that adaptively adjusts window positions based on feature content. This enables more flexible aggregation of cross-window information without requiring predefined shift patterns. Second, our Depth-wise Enhanced Feed-Forward Network (DEFN) integrates depthwise convolutions to explicitly capture local contextual information, which is crucial for restoring fine-grained details in defocused regions-a capability not present in standard transformer feed-forward networks. Third, the MsS Transformer is specifically designed within a U-shaped encoder–decoder where feature enhancement modules precede attention, creating a hybrid pipeline tailored for depth-of-field extension. These architectural choices collectively enhance the model’s ability to recover spatially varying defocus blur while maintaining computational efficiency. Specifically, the MsS Transformer consists of two core components: (1) Dynamic-Sliding Multi-head Self-Attention (DS-MSA). (2) the Depth-wise Enhanced Feed-Forward Network (DEFN).

3.2.1. Dynamic-Sliding Multi-Head Self-Attention (DS-MSA)

Instead of using global attention, local self-attention is applied to the feature map, followed by the application of a sliding window, as shown in Figure 3, which markedly alleviates the computational workload. As illustrated in Figure 3, within the DS-MSA module, the input feature maps are processed by a 1 × 1 convolution followed by a 3 × 3 convolution to generate the Value (V) representations. Crucially, each convolutional layer is coupled with a LeakyReLU activation function and Layer Normalization. This structure prevents the two layers from merging into a single linear transformation and facilitates the integration of local spatial context with channel-wise features, which is essential for accurate depth-of-field extension.

For a given input 2D feature map

F \in R^{H \times W \times C}

, with dimensions

H

times

W, the map

F

is partitioned into

M \times M

independent, non-overlapping patches. From each patch, the flattened and transposed feature

F_{i} \in R^{M \times M \times C}

is obtained. Next, self-attention is employed on the flattened features of each patch. To produce query (

Q

), key (

K

), and value (

V

) projections, depth-wise convolutions of sizes

1 \times 1

and

3 \times 3

are utilized in place of linear layers. The computation of

Q

,

K

,

V

is as Equation (2).

{Q = W}_{i}^{Q} W_{P}^{Q} F, {K = W}_{i}^{K} W_{P}^{K} F, {V = W}_{i}^{V} W_{P}^{V} F

(2)

where

W_{i}^{(\cdot)}

represents the

1 \times 1

convolution and

W_{P}^{(\cdot)}

denotes the

3 \times 3

depth convolution. Subsequently, queries (

Q

) and values (

V

) are employed to generate attention maps of corresponding sizes through softmax dot product interactions. Local attention is then computed on the feature map in blocks, and the results are merged. Due to the U-shaped network design, the window-based self-attention mechanism is applied to low-resolution feature maps, enabling larger receptive fields to be effectively obtained. long-range dependencies are captured and local image information is enhanced, which is beneficial for DOF-E.

3.2.2. Depth-Wise Enhanced Feed-Forward Network (DEFN)

The Feed-Forward Network is enhanced by incorporating deep convolution, as described in [32]. In DEFN,

1 \times 1

convolutions are used both before and after the

3 \times 3

depth-wise convolutions to first increase the number of feature channels and then reduce them back to the original input dimensions. Specifically,

3 \times 3

depth-wise convolutions are utilized to capture local information, which can then be more effectively applied to the DOF recovery. The information flow across hierarchical levels in the pipeline is managed by the DEFN, enabling each level to focus on enhancing fine details that complement those of other levels, while normalized features are processed to enhance DOF-E recovery details. Following each linear and convolutional layer, the GELU activation function [33] is employed to mitigate the vanishing gradient problem. In the projection stage of the DS-MSA module, we employ a sequence of 1 × 1 and 3 × 3 depth-wise convolutions to generate the query, key, and value representations instead of conventional linear layers. This design choice is motivated by the need to incorporate local spatial context into the self-attention calculation. Defocus blur characteristics are spatially correlated, and 3 × 3 depth-wise convolutions provide an efficient way to capture these local pixel relationships (inductive bias) that linear layers typically ignore. By integrating this local information, the model can more accurately estimate attention weights for focus restoration. Additionally, the use of depth-wise convolutions significantly reduces the number of parameters and computational overhead compared to fully connected linear projections, enhancing the overall efficiency of the SDENet.

3.3. Feature Enhancement Module

Fine details at a local scale can be diluted by convolutional self-attention layers; therefore, a Feature Enhancement Module (FEM) is introduced after each MsS block to restore and enrich the features [34,35]. A Pyramid Pooling Module (PPM) is incorporated in the FEM to gather multi-scale global context. Specifically, parallel average-pooling operations with different grid sizes (e.g., strides 1, 2, and 4) are applied by the PPM to generate hierarchical context features. These pooled features are upsampled and concatenated with the original feature map, enabling both global and intermediate image statistics to be captured. The combined feature is then passed through channel attention and spatial attention mechanisms by which informative feature channels and regions are adaptively emphasized. Global pooling is employed to compute per-channel weights in Channel Attention (CAM), while salient spatial locations are emphasized in Spatial Attention (SAM). By integrating multi-scale pooling with attention mechanisms, rich contextual cues are injected by the FEM, which complement the representations provided by the transformer. In summary, the output of each MsS block is enriched by the FEM through the reintroduction of high-level priors and the emphasis on the most relevant features. This process, as shown in Figure 4, is crucial for accurately reconstructing full-focus images from blurred inputs.

For a more granular understanding of the FEM’s internal data flow, the process is detailed as follows: The input feature map first enters the Pyramid Pooling Module (PPM), which extracts multi-scale contextual information through different pooling layers. These pooled features are then restored to their original spatial resolution via bilinear upsampling and are concatenated with the initial input feature map, effectively doubling the channel depth. To refine this fused information, a 1 × 1 convolution is applied to compress the channels back to their original count. Subsequently, the data flows through a sequential attention mechanism: first, the Channel Attention Module (CAM) recalibrates the importance of different feature channels, and then the Spatial Attention Module (SAM), utilizing a 7 × 7 convolutional kernel, localizes and emphasizes focus-critical regions. This explicit pipeline—from pyramid pooling and concatenation to sequential channel and spatial attention—ensures that the model robustly integrates both global context and local details for superior depth-of-field extension. The decision to apply independent attention refinement to each pyramid branch before final concatenation is rooted in the multi-scale nature of defocus blur. Since different pooling scales capture varying levels of semantic and structural information, a shared attention mechanism or simple early fusion would fail to account for the unique noise characteristics and feature importance of each scale. By processing each branch with its own sequential channel and spatial attention, the model can effectively perform scale-specific feature recalibration, prioritizing focus-critical details while suppressing redundant background information. Although this increases architectural complexity, the computational cost remains low because the attention operations are performed on reduced-resolution maps. This design ensures that the final concatenated feature is a highly refined composite of multi-scale information, which is essential for the high-fidelity restoration of complex blurred regions.

The detailed implementation steps of SDENet as mentioned above have been summarized in Figure 5.

3.4. Training Losses

The proposed network was trained using a loss function that combines MS-SSIM [36] and L1 [37]. MS-SSIM is known to be superior in preserving contrast in high-frequency regions, while L1 preserves color and luminance by treating errors equally across all local structures However, it does not replicate the contrast enhancements observed in MS-SSIM. To integrate the best aspects of both loss functions, a combined MS-SSIM and L1 loss function, as defined in Equation (3), was employed.

{L o s s}_{M i x} = α {L o s s}_{M S - S S I M} + (1 - α) G_{σ} L 1

(3)

where

α

= 0.84,

G_{σ}

is a Gaussian distribution parameter,

L 1

is defined as Equation (3),

x

represent the predicted image,

y

denote the all-focus image, and

N

indicate the total number of pixels in the image. The loss function in Equation (3) employs a weighting factor α to balance the contributions of MS-SSIM and L1 loss. Following established empirical studies in image restoration, we set α = 0.84. This choice is justified by our observation that it provides a superior trade-off between structural detail retention and pixel-level intensity consistency. Specifically, our sensitivity testing showed that values significantly lower than 0.84 lead to blurred structural edges, while excessively high values may introduce slight chromatic distortions. The chosen value ensures robust optimization for the depth-of-field extension task across diverse imaging scenarios. In SDENet, the

L 1

loss is calculated using the predicted images obtained after five stages and the fully focused true images. For the MS-SSIM, the loss is calculated in the three layers of scale 1, 2, and 3 in the encoder and decoder parts of SDENet to obtain the multi-scale SSIM loss.

L 1 (x, y) = \frac{1}{N} \sum_{i = 1}^{N} |x_{i} - y_{i}|

(4)

L_{M S - S S I M} = 1 - M S - SSIM (x, y)

(5)

The two losses are combined according to a specific method to obtain the

{L o s s}_{M i x}

in Equation (3), In this way, detail enhancement and high-frequency information preservation are achieved in the restored image, enabling a wider range of depth-of-field extension (DOF-E).

After defining the technical framework of SDENet, we next present the experimental setup and a comprehensive evaluation of its performance on multiple datasets.

4. Results and Discussion

4.1. Data

Currently, existing datasets are primarily designed for multi-image to single-image tasks, and datasets specifically intended for single-image depth-of-field extension (DOF-E) tasks, which map a single-focus image to an all-in-focus image, are still lacking. To address this deficiency, images were selected from the DLDP [38], DEDD [14], DPDD [23], and LFDOF [39] datasets (see Table 2). These images are not globally defocused, but contain clearly focused regions, while other areas are blurred due to imaging limitations. These selected images were used to construct the Multi-Scene Single-Image DOF-E Dataset (MSED), which comprises 1772 naturally sharp images and 1772 single-focus images. To validate the model, two test datasets (TEST(A) and TEST(B)) were created within MSED. Sample images from these datasets are shown in Figure 6. TEST(A) consists of 387 indoor and outdoor images from the above datasets, all characterized by single-focus blur. TEST(B) includes 100 single-focus blurred images captured by various smartphones (Apple, Xiaomi, Vivo, Honor) across different scenes. The MSED covers a wide range of indoor and outdoor scenes, containing both single-focus blur and fully focused images. Specifically, 1772 pairs of images from the MSED were adopted for training, and the dataset was divided into an 85% training set and a 15% validation set. TEST(A) and TEST(B) from the MSED were employed as the test set to evaluate the model’s robustness, along with publicly available datasets such as DDPD, RealDOF [40], and DLDP. The testing dataset from DLDP does not include fully focused clear images.

4.2. Implementation Details

The model was implemented using PyTorch (version 2.1.0) and trained on an NVIDIA GeForce RTX 4090D GPU with CUDA 12.1 and cuDNN 8.9. The Adam optimizer [41] was used for network training, with a weight decay rate of 0.02. The initial learning rate was set to

e^{- 4}

, and a cosine decay strategy was applied to gradually reduce the learning rate to

e^{- 6}

. The size of the sliding window for multi-head attention in the network was set to

10 \times 1

0. Random horizontal and vertical flips were employed for data augmentation, each with a probability of 0.5. The model was trained for 3000 epochs with a batch size of 4. Input images were randomly cropped to

256 \times 256

patches during training. The random seed was fixed to 42 for reproducibility. For reference, at a standardized input resolution of 256 × 256, the computational cost is 85.2 GFLOPs. For inference, the model processes a single 1024 × 1024 image in 0.35 s on an NVIDIA GeForce RTX 4090D GPU with batch size 1 and FP32 precision.

4.3. Performance Comparison

A comparative analysis between SDENet and the methods of MIMO-UNet [42], Uformer [43], MTRNN [32], MPRNet [44], DRBNet [45], and Blur2blur [27] is presented in this section on the TEST(A) and (B), DPDD, RealDOF, and DLDP test sets, aiming to evaluate the performance of the SDENet model in DOF-E tasks. Among these methods, MTRNN, MIMO-UNet and Blur2blur are designed for single-image deblurring, while Uformer, MPRNet, and DRBNet are defocus deblurring models. Beyond quantitative accuracy, we also consider the theoretical computational characteristics of these methods. MIMO-UNet employs a multi-input multi-output architecture with shared parameters across scales, achieving roughly linear complexity but with higher memory footprint due to parallel branches. Uformer adopts a U-shaped Transformer with fixed window-based self-attention, lacking the dynamic sliding mechanism that enables adaptive receptive field adjustment in SDENet. MTRNN utilizes recurrent connections across temporal frames, introducing sequential dependencies that limit parallelization efficiency for single-image tasks. MPRNet adopts a multi-stage progressive design, where sequential processing inherently increases cumulative complexity. DRBNet leverages recurrent blocks that increase network depth, but its iterative refinement leads to longer inference paths. Blur2blur focuses on unsupervised domain adaptation, trading efficiency for cross-domain generalization. In comparison, SDENet’s combination of dynamic window-based self-attention and efficient feature enhancement modules achieves competitive representational capacity while maintaining theoretical complexity on par with or lower than these methods, particularly for high-resolution inputs. This analysis underscores the practical viability of SDENet for real-world DOF-E applications.

4.3.1. MSED-TEST(A) and (B)

MSED-TEST(A) is composed of 387 indoor and outdoor multi-scene DOF blurred images without fully focused image references. TEST(B) consists of 100 single-focus images captured across multiple scenes using different mobile phones, including Apple (America), Honor (China), Xiaomi (China), and Vivo (China). A quantitative comparison of various methods, using Standard Deviation (SD) and Entropy as metrics on MSED-TEST(A) and TEST(B), is presented in Table 3, where SDENet is also shown to have achieved excellent performance. Since fully focused ground-truth images are not available for these datasets, no-reference statistical metrics are adopted for quantitative evaluation. Specifically, Standard Deviation (SD) reflects the global intensity variation and structural contrast of an image, while Entropy measures the richness and complexity of information distribution. Images with clearer structures and more detailed textures generally exhibit higher SD and entropy values. Therefore, these metrics are used as supplementary indicators to evaluate the enhancement of structural details and information content. It is worth noting that SD and entropy may also be influenced by noise or contrast variations. Consequently, in this work, these metrics are used only as auxiliary statistical measures for no-reference evaluation, rather than primary indicators of perceptual quality.

Entropy is used to measure the richness of the image details. The ideal value of entropy is interpreted as being as high as possible, as a higher value indicates that the model has successfully recovered more structural information and textures from the defocused areas. By visualizing the results (the area shown in blue in Figure 7) it can be observed that, compared with other models, the edges of the objects are sharpened more effectively by SDENet DOF-E, and the recovered information is rendered more realistically and effectively.

4.3.2. DPDD and RealDOF Test Set

The DPDD test set comprises 76 pairs of indoor and outdoor multi-scene DOF blurred images, while RealDOF contains 50 pairs. A quantitative comparison of the various methods using the average PSNR, SSIM, and LPIPS metrics on the DPDD and RealDOF test sets is presented in Table 4. PSNR, SSIM, and LPIPS are computed over the entire image using well-aligned all-in-focus ground truth. Since defocus blur mainly affects specific spatial regions, improvements in previously defocused areas contribute directly to the overall metric values. To complement the global quantitative results, we provide qualitative comparisons focusing on defocused regions to visually illustrate the model’s effectiveness in restoring sharp structures and extending depth of field. A systematic ROI-based quantitative evaluation requires datasets with pixel-level defocus mask annotations, which are not uniformly available across our test sets, and is therefore left for future work. The experimental results demonstrated that satisfactory performance is achieved by SDENet in terms of the PSNR, SSIM, and LPIPS metrics.

By visualizing the results (Figure 8, where the areas shown in red and yellow are the magnified effects of the red and yellow boxes), it can be observed that the single-image deblurring models often introduce distortions or excessive smoothing during DOF-E, whereas defocus deblurring models may overly process or distort clear regions outside the DOF-E area. The experimental results demonstrate that the images generated by the SDENet closely resemble authentic images and achieved the highest PSNR (the PSNR value is marked below each image).

4.3.3. DLDP Test Set

The DLDP test set comprises 263 indoor and outdoor multi-scene DOF blurred images which are devoid of fully focused image references. Quantitative comparisons on the DLDP test set are conducted using Standard Deviation (SD) and Entropy as metrics for various methods (see Table 5). Good results in SD and Entropy are achieved by the SDENet model.

Deep information reply. In this experiment, four individuals were positioned at distances of 3 m, 5 m, 10 m, and 15 m from the camera, each wearing a chest badge (3 m,5 m,10 m,15 m). Images were obtained with the camera aperture focused on the person located at the 3 m position. Various methods were then applied to process these images, and the recovery of depth information was evaluated by assessing the legibility of the text on chest badges.

By visualizing the results (as shown in Figure 9, highlighted in red), it was observed that depth information at 10 m was effectively recovered by SDENet for the images obtained in this experiment. At a depth of 15 m, the recovery of depth information by SDENet was found to be less effective.

4.4. Comparison with Multi-Image Fusion Methods

Compared with multi-image fusion algorithms, single-image DOF-E is advantageous because of the data acquisition and processing. Multi-image fusion methods require capturing a stack of images with varying focus settings, which is time-consuming and necessitates precise image registration to avoid misalignment artifacts. In contrast, SDENet operates on a single input image, completely bypassing these acquisition and registration burdens. However, this convenience introduces a trade-off: while multi-image fusion can theoretically achieve higher fidelity by leveraging complementary focal information from multiple captures, SDENet must rely on learned priors to hallucinate out-of-focus details. This may lead to occasional artifacts in regions with severe defocus or complex depth variations, yet it offers a practical solution when multiple captures are infeasible or when hardware simplicity is prioritized.

To demonstrate the effectiveness of SDENet in DOF-E, experiments were conducted on both registered and non-registered data using the Wavelet Transform (CWT) [26], DSIFT [27], GFDF [28], SESF [46], and SDENet. The CWT algorithm utilizes PYWavelets while DSIFT and GFDF are implemented using MATLAB R2024b for multi-image fusion, and SESF employs a DL-based multi-focus fusion algorithm. The objective of the experiments on registered and non-registered data presented below demonstrates that the single-image DOF-E model (SDENet) performs comparably to multi-image fusion models.

Non-Registered data. Eleven sets of non-registered images were collected, including six sets of microscopic images, each captured from a different location within the same scene. These sets were processed using five algorithms. DOF-E was performed on individual images by SDENet, while all images within each set were fused by the other algorithms.

The experimental results depicted in Figure 10 and Figure 11, which feature microscopic images, reveal significant shortcomings of CWT in achieving satisfactory image fusion. Edge blurring and substantial loss of original image information, including color and detail, often result from its application. Similarly, noticeable artifacts along the edges are exhibited by the DSIFT and GFDF fusion algorithms implemented in MATLAB, as well as by the DL-based SESF fusion model, with these artifacts becoming more pronounced as the number of images used for fusion increases. When SDENet is used for single-image DOF-E, the artifacts and color loss associated with image fusion are mitigated, facilitating the recovery of more image information. The experimental results underscore the capability of SDENet to restore the DOF of the microscopic images.

4.4.1. Registered Data

In this section, several multi-image fusion algorithms that merge registered images are compared with SDENet which processes single images independently. Nine sets of images (ranging from six to 13 images per set) selected from the Lytro dataset and six sets of registered images (two images per set) selected from the MFI-WHU dataset [47] are used for evaluation. The experimental results using registered images are presented in Figure 11 (two images), Figure 12, Figure 13 and Figure 14 (eight images). Color distortion is observed in the results of CWT, and compared with the DL-based SESF model, incomplete fusion and artifacts in certain regions are still exhibited by the MATLAB-based DSIFT and GFDF models. The visual output of SDENet is found to resemble the results of multi-image fusion, and the experiments demonstrate its capability to perform DOF-E on a single image [48].

4.4.2. Single Image Selection Criteria

Based on the experimental results, when SDENet is employed for DOF-E, images with prominent main objects should be selected (if multiple images are available) to ensure greater clarity and visibility. SDENet performs best when the input image contains well-focused salient objects that provide strong structural guidance. In contrast, images with large uniform defocused regions and limited foreground context tend to yield less satisfactory results, as the network lacks sufficient cues to hallucinate missing details. This highlights a key limitation of single-image DOF-E compared to multi-image fusion, which can leverage complementary focal information from multiple captures.

4.5. Ablation Study

To ensure a fair comparison, all ablation variants were trained with identical settings: 3000 epochs, batch size 4, cosine learning rate decay from

1 \times 10^{- 4}

to

1 \times 10^{- 6}

, Adam optimizer (weight decay 0.02), and fixed random seed 42. This ensures that the performance differences observed in Table 6 and Table 7 can be attributed solely to the architectural modifications. The effectiveness of the MsS Transformer and FEM in the proposed method is assessed through ablation experiments, which were conducted on MSED-TEST(A) and DPDD test sets, as described below.

Table 6 and Table 7 illustrate the progression from a basic U-Net architecture to adding each module separately (MsS Transformer and FEM). The results demonstrated that the addition of each module enhances the performance of the model, emphasizing the significance of all module for the performance of the model.

We conducted additional ablation experiments to compare variants with different deepest resolutions, while keeping all other components identical. The results are presented in Table 8.

The results show that H/16 × W/16 × 16C achieves the best performance across all metrics. The choice of H/16 × W/16 × 16C as the deepest resolution balances receptive field expansion with spatial detail preservation. While deeper representations (H/32 × W/32 × 32C) offer larger receptive fields, they risk losing fine spatial details critical for DOF extension, particularly at object boundaries and textured regions.

To further validate the design choice of module ordering, we conducted ablation experiments comparing two variants on the DPDD test set: (1) our proposed order with FEM followed by MsS (FEM → MsS), and (2) the reversed order with MsS followed by FEM (MsS → FEM). Both variants were trained with identical settings to ensure fair comparison. The results are presented in Table 9.

The experimental results demonstrate that placing FEM before MsS achieves better performance across all metrics. This confirms that enriching features with multi-scale contextual information before self-attention is beneficial for DOF-E tasks, as it provides stronger structural cues for the subsequent attention mechanism.

4.6. Sensitivity Analysis and Robustness

To evaluate the robustness of the SDENet, we conducted a sensitivity analysis on key hyperparameters. Regarding the window size in the DS-MSA module, we observed that a 7 × 7 window provides the best trade-off; smaller windows limit the receptive field, while larger ones increase computational cost without significant gain. For attention heads, a multi-head configuration (e.g., 4 or 8 heads) was found to be more stable than a single-head setup as it allows the model to attend to different feature subspaces simultaneously. In terms of loss weighting, the combination of MS-SSIM and L1 loss ensures both structural consistency and pixel-level accuracy; varying the weight of MS-SSIM within a reasonable range (0.6 to 0.9) showed minimal impact on the final PSNR, demonstrating the stability of our loss formulation. These results indicate that our method is robust to parameter variations and generalizes well across different imaging scenarios.

5. Conclusions

This paper proposes a single-image depth-of-field extension (DOF-E) network, which enables the restoration of out-of-focus blur induced by factors such as imaging device imperfections. Leveraging the Multi-scale Self-attention (MsS) Transformer and Feature Enhancement Module (FEM), the network optimizes feature fusion across all scales—thereby expanding the depth-of-field freedom for single images, including microscopic imagery. The algorithm specifically addresses two critical limitations in existing approaches: the prohibitive cost and design complexity of hardware-based single-image DOF-E solutions, and the inherent challenges of data acquisition and registration fusion in multi-image fusion frameworks. Extensive experimental validations demonstrate that this method outperforms state-of-the-art deblurring and defocus deblurring algorithms, delivering superior robustness and DOF-E performance across both public datasets and realistic scenarios.

Author Contributions

X.Z.: validation, writing—review and editing and supervision. M.W.: conceptualization, methodology, writing—original draft. J.J.: writing—review and editing and supervision. Y.L.: validation, writing—review and editing and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Henan Science and Technology R&D Program Joint Fund Project (Young Scientists), grant number 225200810098 and Key R&D and Promotion Projects of Henan Province (Science and Technology Research), grant number 242102211008.

Data Availability Statement

The data is publicly available at https://github.com/miaomiao-Wen/SDENet (accessed on 26 February 2026).

Acknowledgments

The authors appreciate the support from the National Natural Science Foundation of China in identifying collaborators for this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sui, X.; Kuang, D.; Liu, G.; Ding, Y.; Meng, M.; Xi, R. Highly focused beam generated with a height tuned micro-optical structure for high contrast microscopic imaging. Opt. Express 2024, 32, 19308–19318. [Google Scholar] [CrossRef]
Nguyen, C.M.; Chan, E.R.; Bergman, A.W.; Wetzstein, G. Diffusion in the dark: A diffusion model for low-light text recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 4146–4157. [Google Scholar]
Kang, B.; Moon, S.; Cho, Y.; Yu, H.; Kang, S. Metaseg: Metaformer-based global contexts-aware network for efficient semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 434–443. [Google Scholar]
Hildén, P.; Shevchenko, A. Extended depth of field of an imaging system with an annular aperture. Opt. Express 2023, 31, 11102–11115. [Google Scholar] [CrossRef]
Li, Y.; Wang, J.; Zhang, X.; Hu, K.; Ye, L.; Gao, M.; Cao, Y.; Xu, M. Extended depth-of-field infrared imaging with deeply learned wavefront coding. Opt. Express 2022, 30, 40018–40031. [Google Scholar] [CrossRef]
Thériault, G.; De Koninck, Y.; McCarthy, N. Extended depth of field microscopy for rapid volumetric two-photon imaging. Opt. Express 2013, 21, 10095–10104. [Google Scholar] [CrossRef]
Bo, Y.; Shubin, W. Research and design of a metasurface with an extended depth of focus in the near field. Appl. Opt. 2023, 29, 7621–7627. [Google Scholar] [CrossRef]
Wang, W.; Chang, F. A multi-focus image fusion method based on Laplacian pyramid. J. Comput. 2011, 12, 2559–2566. [Google Scholar] [CrossRef]
Wu, J.; Dou, J.; Kainerstorfer, J.M. Predicting cerebral partial pathlength and absorption changes using a deep learning model: A phantom study. In Proceedings of the Optica Biophotonics Congress: Biomedical Optics 2024 (Translational, Microscopy, OCT, OTS, BRAIN), Fort Lauderdale, FL, USA, 7–10 April 2024; Optica Publishing: Washington, DC, USA, 2024; p. JM4A.29. [Google Scholar]
Zhang, C.; Zhang, Z.; Li, H.; He, S.; Feng, Z. Multi-focus image fusion via online convolutional sparse coding. Multimed. Tools Appl. 2024, 83, 17327–17356. [Google Scholar] [CrossRef]
Zhao, F.; Zhao, W.; Lu, H.; Liu, Y.; Yao, L.; Liu, Y. Depth-Distilled Multi-Focus Image Fusion. IEEE Trans. Multimed. 2021, 25, 966–978. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Ye, T.; Cheng, X.; Liu, W.; Tan, H. Bridging the gap between multi-focus and multi-modal: A focused integration framework for multi-modal image fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1628–1637. [Google Scholar]
Zhang, X. Deep learning-based multi-focus image fusion: A survey and a comparative study. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4819–4838. [Google Scholar] [CrossRef]
Ma, H.; Liu, S.; Liao, Q.; Zhang, J.; Xue, J.H. Defocus image deblurring network with defocus map estimation as auxiliary task. IEEE Trans. Image Process. 2021, 31, 216–226. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Levin, A.; Hasinoff, S.W.; Green, P.; Durand, F.; Freeman, W.T. 4D frequency analysis of computational cameras for depth of field extension. ACM Trans. Graph. 2009, 28, 1–14. [Google Scholar] [CrossRef]
Dantas de Oliveira, A.; Rubio Maturana, C.; Zarzuela Serrat, F.; Carvalho, B.M.; Sulleiro, E.; Prats, C.; Veiga, A.; Bosch, M.; Zulueta, J.; Abelló, A.; et al. Development of a low-cost robotized 3D prototype for automated optical microscopy diagnosis: An open-source system. PLoS ONE 2024, 19, e0304085. [Google Scholar] [CrossRef] [PubMed]
Nuñez, I.; Matute, T.; Herrera, R.; Keymer, J.; Marzullo, T.; Rudge, T.; Federici, F. Low cost and open source multi-fluorescence imaging system for teaching and research in biology and bioengineering. PLoS ONE 2017, 12, e0187163. [Google Scholar] [CrossRef]
Qiu, X.; Li, M.; Zhang, L.; Yuan, X. Guided filter-based multi-focus image fusion through focus region detection. Signal Process. Image Commun. 2019, 72, 35–46. [Google Scholar] [CrossRef]
Duan, Z.; Luo, X.; Zhang, T. Combining transformers with CNN for multi-focus image fusion. Expert Syst. Appl. 2024, 235, 121156. [Google Scholar] [CrossRef]
Zhai, H.; Ouyang, Y.; Luo, N.; Chen, L.; Zeng, Z. MSI-DTrans: A multi-focus image fusion using multilayer semantic interaction and dynamic transformer. Displays 2024, 85, 102837. [Google Scholar] [CrossRef]
Quan, Y.; Wu, Z.; Ji, H. Neumann network with recursive kernels for single image defocus deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5754–5763. [Google Scholar]
Abuolaim, A.; Brown, M.S. Defocus deblurring using dual-pixel data. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 111–126. [Google Scholar]
Zhang, Z.; Foo, L.G.; Rahmani, H. Performing defocus deblurring by modeling its formation process. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 5791–5801. [Google Scholar]
Li, Y.; Fan, Y.; Xiang, X.; Demandolx, D.; Ranjan, R.; Timofte, R.; Gool, L. Efficient and Explicit Modelling of Image Hierarchies for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18278–18289. [Google Scholar]
Dhavalagimath, S.S.; Rajesh, T.M.; Singh, R.K. Design of an Intelligent Method for Blur Classification and Adaptive Restoration Framework for Robust Aerial Image Deblurring Operations. Int. J. Inf. Technol. 2026, 1, 1–10. [Google Scholar] [CrossRef]
Pham, B.D.; Tran, P.; Tran, A.; Pham, C.; Nguyen, R.; Hoai, M. Blur2Blur: Blur conversion for unsupervised image deblurring on unknown domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Waikoloa, HI, USA, 3–8 January 2024; pp. 2804–2813. [Google Scholar]
Yoon, S.; Park, I.K. Geometrically Consistent Light Field Synthesis Using Repaint Video Diffusion Model. In Proceedings of the International Conference on Pattern Recognition, Honolulu, HI, USA, 19–23 October 2025; pp. 145–160. [Google Scholar]
Lee, M.; Heo, J.P. Noise-free optimization in early training steps for image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 2920–2928. [Google Scholar]
Luo, X.; Xie, Y.; Qu, Y.; Fu, Y. SkipDiff: Adaptive skip diffusion model for high-fidelity perceptual image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 4017–4025. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
Park, D.; Kang, D.U.; Kim, J.; Chun, S.Y. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 2, 47–57. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Abuolaim, A.; Afifi, M.; Brown, M.S. Improving single-image defocus deblurring: How dual-pixel images help through multi-task learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1231–1239. [Google Scholar]
Ruan, L.; Chen, B.; Li, J.; Lam, M. AIFNet: All-in-focus image restoration network using a light field-based dataset. IEEE Trans. Comput. Imaging 2021, 7, 675–688. [Google Scholar] [CrossRef]
Lee, J.; Son, H.; Rim, J.; Cho, S.; Lee, S. Iterative filter adaptive network for single image defocus deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2034–2042. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Cho, S.J.; Ji, S.W.; Hong, J.P.; Jung, S.W.; Ko, S.J. Rethinking coarse-to-fine approach in single image deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 4641–4650. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general U-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14821–14831. [Google Scholar]
Ruan, L.; Chen, B.; Li, J.; Lam, M. Learning to deblur using light field generated and real defocus images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16304–16313. [Google Scholar]
Ma, B.; Yin, X.; Wu, D.; Ban, X. End-to-end learning for simultaneously generating decision map and multi-focus image fusion result. Neurocomputing 2022, 470, 204–216. [Google Scholar] [CrossRef]
Zhang, H.; Le, Z.; Shao, Z.; Xu, H.; Ma, J. MFF-GAN: An unsupervised generative adversarial network with adaptive and gradient joint constraints for multi-focus image fusion. Inf. Fusion 2021, 66, 40–53. [Google Scholar] [CrossRef]
Ze, L.; Li, Y.; Yue, C.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 10012–10022. [Google Scholar]

Figure 1. Comparison of the workflows for multi-image DOF-E (left) and single-image DOF-E (right).

Figure 2. Architecture of SDENet. The network consists of multi-scale hierarchical design including (a) efficient MsS Transformer blocks and (b) Feature Enhancement Module.

Figure 3. The process of dynamic-sliding multi-head self-attention to calculate local attention on the block feature map.

Figure 4. The Feature Enhancement Module employs a three-level pooling structure to enhance the features input to the CAM and SAM and then concatenates the enhanced features.

Figure 5. The detailed implementation steps of SDENet.

Figure 6. Example of selected images by MSED.

Figure 7. Visual comparisons with MIMO-UNet, Uformer, MPRNet, Blur2blur, and SDENet(our) on MSED-TEST(B).

Figure 8. Visual comparisons with MTRNN, MIMO-UNet, Uformer, MPRNet, DRBNet, Blur2blur and SDENet(ours) on DPDD and RealDOF test set.

Figure 9. Visual comparisons with MIMO-UNet, Uformer, MPRNet, DRBNet, and SDENet (ours) on depth information recovery.

Figure 10. Example results demonstrate the fusing effects on unregistered input images using various methods, with SDENet performing DOF-E on a single image. This set of images provides 7 images; multi-focus image fusion fuses ‘a~g’, SDENet processes a single image, and ‘g’ has the best effect.

Figure 11. Example results demonstrate the fusion effects on unregistered input images using various methods, with SDENet performing DOF-E on a single image. This set of images includes 13 images. Multi-focus image fusion fuses ‘a~m’, SDENet processes a single image, and ‘h’ achieves the best effect.

Figure 12. Visual comparison with multi-image fusion (two images a and b) and SDENet, where SDENet performs DOF-E on a single image ‘b’.

Figure 13. Visual comparison with multi-image fusion and SDENet, where SDENet performs DOF-E on a single image. This set of images includes 8 images. Multi-focus image fusion fuses ‘a~h’, SDENet processes a single image, and ‘e’ achieves the best effect.

Figure 14. Visual comparison with multi-image fusion and SDENet, where SDENet performs DOF-E on a single image. This set of images includes 8 images. Multi-focus image fusion fuses ‘a~h’, SDENet processes a single image, and ‘d’ achieves the best effect.

Table 1. The advantages and disadvantages of the pervious related works.

Category	Typical Methods	Advantages	Disadvantages
Hardware-based DOF-E	Coded apertures, focus-tunable lenses, active focus sweeping	High optical quality; directly captures sharp images	High cost; increased system complexity and bulkiness; computationally intensive
Multi-Focus Image Fusion (MFIF)	CNNs, GANs, Vision Transformers, Wavelet Transform	Produces exceptionally sharp images; utilizes multiple focal cues	Requires multiple captures (time-consuming); sensitive to misalignment and exposure variations
Single-Image Defocus Deblurring	Blur map estimation, Non-blind deconvolution, End-to-end networks	Only requires a single input; no hardware modifications	Struggles with complex spatially varying blur; susceptible to residual blur or artifacts
Recent Generative & Attention Trends	GRL, SKA-based Nets, Latent Diffusion Models	Superior texture synthesis; dynamic handling of complex blur kernels	Extremely high computational cost; potential for non-physical artifacts
SDENet (Ours)	MsS Transformer + FEM	Single input; avoids explicit blur map estimation; balances efficiency and quality	Requires large-scale specialized datasets (e.g., MSED) for training

Table 2. The MSED data constitute the source.

Training set (paired)
Dataset Name	DPDD	DLDP	LFDOF	DEDD	Total Quantity
Selection Quantity	350	300	385	737	1772
TEST(A)
Dataset Name	DPDD	DLDP	DEDD	Lytro	Total Quantity
Selection Quantity	76	200	63	91	387
TEST(B)
Dataset Name	Xiaomi	Apple	Vivo	Honor	Total Quantity
Selection Quantity	26	29	24	21	100

Table 3. Quantitative comparisons of MIMO-UNet, Uformer, MPRNet, DRBNet and SDENet(ours) methods in MSED-TEST(A) and (B).

Method	MSED-TEST(A)		MSED-TEST(B)
Method	SD↑	Entropy↑	SD↑	Entropy↑
MIMO-UNet [42]	60.0311	7.3044	59.0302	7.0881
MPRNet [44]	60.0024	7.2846	60.3419	7.0500
DRBNet [45]	59.0151	7.3254	59.5713	7.0602
Uformer [43]	60.3174	7.3423	63.3031	7.0851
Blur2blur [27]	60.4036	7.3115	64.0659	7.0843
SDENet(our)	60.5944	7.3256	65.4211	7.1235

Table 4. Quantitative comparisons of MTRNN, MIMO-UNet, Uformer, MPRNet, DRBNet and SDENet(ours) methods in DPDD and RealDOF test set.

Method	DPDD Test Set			RealDOF Test Set
Method	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
MTRNN [32]	24.7811	0.7588	0.2815	22.6408	0.6482	0.4093
MPRNet [44]	24.7882	0.7673	0.2869	22.6356	0.6574	0.4197
MIMO-UNet [42]	26.3792	0.8251	0.2048	25.0924	0.7849	0.2515
Uformer [43]	26.8938	0.8444	0.1638	25.5951	0.8117	0.2168
DRBNet [45]	26.7756	0.8274	0.1411	24.6081	0.7284	0.2251
Blur2blur [27]	26.7992	0.8325	0.1406	25.3172	0.8254	0.1905
SDENet(our)	26.9816	0.8461	0.1375	26.1756	0.7977	0.1788

Table 5. Quantitative comparisons of MTRNN, MIMO-UNet, Uformer, MPRNet, DRBNet and SDENet(our) methods in DLDP test set.

Method	SD↑	Entropy↑
MTRNN [32]	61.3427	7.2905
MIMO-UNet [42]	61.3427	7.2983
Uformer [43]	61.7846	7.3184
MPRNet [44]	62.2771	7.2919
DRBNet [45]	60.6655	7.2893
Blur2blur [27]	62.3564	7.2988
SDENet(our)	62.2367	7.3240

Table 6. Ablation experiments conducted on the DPDD test set.

	FEM	MsS Transformer	PSNR↑	SSIM↑	LPIPS↓
	-	-	26.1636	0.8209	0.2093
SDENet	√	-	26.2849	0.8248	0.1920
DPDD test set	-	√	26.7472	0.8324	0.1829
	√	√	26.9816	0.8461	0.1375

Table 7. Ablation experiments conducted on the MSED-TEST(A).

	FEM	MsS Transformer	SD↑	Entropy↑
	-	-	60.2863	7.3044
SDENet	√	-	60.3807	7.3111
MSED-TEST(A)	-	√	60.2712	7.3256
	√	√	60.5944	7.3287

Table 8. Ablation study on the deepest feature resolution evaluated on the DPDD test set.

Deepest Resolution	PSNR↑	SSIM↑	LPIPS↓
H/8 × W/8 × 8C	26.7132	0.8380	0.1520
H/16 × W/16 × 16C	26.9816	0.8461	0.1375
H/32 × W/32 × 32C	26.8901	0.8420	0.1440

Table 9. Ablation study on module ordering evaluated on the DPDD test set.

Deepest Resolution	PSNR↑	SSIM↑	LPIPS↓
MsS → FEM	26.7824	0.8430	0.1587
FEM → MsS (Ours)	26.9816	0.8461	0.1375

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Wen, M.; Jia, J.; Liu, Y. SDENet: A Novel Approach for Single Image Depth of Field Extension. Algorithms 2026, 19, 216. https://doi.org/10.3390/a19030216

AMA Style

Zhang X, Wen M, Jia J, Liu Y. SDENet: A Novel Approach for Single Image Depth of Field Extension. Algorithms. 2026; 19(3):216. https://doi.org/10.3390/a19030216

Chicago/Turabian Style

Zhang, Xu, Miaomiao Wen, Junyang Jia, and Yan Liu. 2026. "SDENet: A Novel Approach for Single Image Depth of Field Extension" Algorithms 19, no. 3: 216. https://doi.org/10.3390/a19030216

APA Style

Zhang, X., Wen, M., Jia, J., & Liu, Y. (2026). SDENet: A Novel Approach for Single Image Depth of Field Extension. Algorithms, 19(3), 216. https://doi.org/10.3390/a19030216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDENet: A Novel Approach for Single Image Depth of Field Extension

Abstract

1. Introduction

2. Related Work

2.1. Depth-of-Field Extension (DOF-E)

2.2. Multi-Focus Image Fusion (MFIF)

2.3. Single-Image DOF-E/Defocus Deblurring

3. Method

3.1. Network Architecture

3.2. MsS Transformer Block

3.2.1. Dynamic-Sliding Multi-Head Self-Attention (DS-MSA)

3.2.2. Depth-Wise Enhanced Feed-Forward Network (DEFN)

3.3. Feature Enhancement Module

3.4. Training Losses

4. Results and Discussion

4.1. Data

4.2. Implementation Details

4.3. Performance Comparison

4.3.1. MSED-TEST(A) and (B)

4.3.2. DPDD and RealDOF Test Set

4.3.3. DLDP Test Set

4.4. Comparison with Multi-Image Fusion Methods

4.4.1. Registered Data

4.4.2. Single Image Selection Criteria

4.5. Ablation Study

4.6. Sensitivity Analysis and Robustness

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI