1. Introduction
Image clarity fundamentally depends on the distance between the camera sensor and objects. Optimal sharpness occurs when objects lie precisely on the focal plane, whereas deviations cause defocus blur—a prevalent degradation that compromises detail visibility and edge definition. Despite continuous advances, existing deblurring methods often fail in complex scenarios where spatially varying blur obscures boundaries between focused and defocused regions. These limitations critically hinder downstream vision tasks such as object recognition, 3D reconstruction, and scene understanding [
1,
2,
3]. Consequently, effective defocus deblurring remains essential for generating artifact-free, detail-rich images to support advanced image processing [
4,
5,
6,
7].
Initial research on defocus deblurring primarily relied on traditional optimization methods. These approaches utilized global and local structural priors within images to guide deblurring tasks. However, they often faced interference from image blurriness and noise during the prior acquisition phase. This resulted in poor restored image quality, with noticeable artifacts or excessive smoothing at boundaries between blurred and clear regions. Subsequently, deep learning methods [
8,
9,
10] gained prominence, achieving superior deblurring outcomes compared to traditionals. These methods typically follow two implementation patterns. One pattern [
11,
12,
13] directly regresses clear image pixel values through deep networks, involving steps like learning feature representations from blurred images, enhancing these features, and regressing to clear outputs. Despite this, images from this pattern frequently exhibit over-smoothing, artifacts, or unnatural transitions at blur–clear boundaries. Alternatively, a second pattern first applies traditional methods for preliminary deblurring and then uses deep learning to iteratively optimize the results. This pattern reduces training costs by avoiding separate models for different blur levels while also mitigating some issues like over-smoothing. Nevertheless, it still struggles with unnatural transitions or artifacts at boundaries. Building on this second pattern, GKMNet [
14] demonstrates excellent performance in reducing over-smoothing and artifacts. However, experimental observations reveal a persistent issue: unnatural transitions or artifacts often remain at blur–clear boundaries, particularly in complex scenarios, and iterative optimization fails to eliminate them. This problem arises because GKMNet’s feature fusion incorporates adjacent regions that may belong to different depth levels than the central region. As blur complexity increases, more regions are fused, raising the probability of selecting incorrect adjacent areas and exacerbating artifacts. Moreover, the iterative optimization process is ineffective due to insufficient feature extraction capabilities. Specifically, the initial convolution layer’s limited receptive field yields incomplete local features, hindering precise detail restoration. Additionally, global feature extraction via max pooling causes information loss and noise sensitivity, limiting artifact correction accuracy.
To address these limitations, we propose GKMANet, a self-attention enhanced Gaussian Kernel Mixture Network. After initial deblurring akin to GKMNet [
14], GKMANet first computes local scores from blur–clear boundary distributions to identify artifact-prone regions with abnormally low gradients, which are then optimized to alleviate unnatural transitions. Furthermore, we introduce a Dual-Attention CNN (DA-CNN) to replace the feature extractor. DA-CNN expands convolutional processing to resolve limited receptive fields in local feature extraction and employs a learnable weighting strategy for global feature aggregation, eliminating information loss and noise sensitivity inherent in max pooling.
The main contributions of this article are as follows:
An enhanced Gaussian mixture model coefficient estimation module is designed, incorporating a self-attention mechanism. Built upon a scale-recursive architecture, this module significantly improves the model’s sensitivity in capturing the boundaries between blurry and clear regions. Furthermore, it demonstrates strong universality in determining the defocus blur at object boundaries within complex scenes.
A novel image multi-scale feature fusion network module is introduced. This network captures fine-grained pixel-level edge details through shallow convolutional layers, while simultaneously fusing object-level contour context information within its deeper layers. This explicit modeling of multi-scale characteristics, from micro to macro perspectives, effectively prevents boundary breakage or excessive smoothing that often results from relying on a single receptive field.
Comprehensive experimental results on multiple standard datasets demonstrate the effectiveness of the proposed GKMANet model. Specifically, GKMANet achieves superior Peak Signal-to-Noise Ratio (PSNR) in restoring edge details at junctions compared to previous approaches. Additionally, a significant improvement in computational efficiency is observed.
2. Related Work
2.1. Two-Stage Defocusing Deblurring
Let us consider the problem of Single Image Defocus Deblurring (SIDD). Traditional methods typically employ a two-stage strategy. In the first stage, defocused images are estimated to obtain blur information. In the second stage, image restoration is performed using non-blind deconvolution techniques as referenced in prior studies [
15,
16,
17]. Although this approach achieves acceptable recovery results in some static scenarios, it faces two key challenges. Firstly, it highly depends on the accuracy of defocused image estimation, which makes the method extremely sensitive to errors. Secondly, the overall computational cost is substantial, limiting its application in high-resolution images or real-time situations.
Specifically, during defocused image estimation, traditional non-learning methods often obtain an initial sparse defocused image through edge detection or sparse region analysis [
16,
17,
18,
19,
20]. They then infer a dense defocused image using propagation algorithms such as Poisson inpainting [
21]. Although these methods can theoretically restore the spatial distribution of the blur kernel, their reliance on sparse sampling and uniform propagation assumptions often leads to error accumulation in boundary areas. This causes artifacts at transitions between blurred and clear regions, where the blur kernel scale exhibits continuous gradient changes. If the defocused image estimation lacks sub-pixel-level accuracy, it severely degrades subsequent deblurring quality. Meanwhile, to compensate for transmission inaccuracies, such methods frequently introduce complex global optimization strategies, significantly increasing computational costs and creating a key bottleneck for practical use.
To improve defocused image estimation accuracy, researchers have proposed various methods to reduce error propagation. Early approaches relied on manually designed features, such as sparse coding based on blur atoms [
16], maximum rank of gradient domain blocks in different directions [
20], and re-blurred gradient amplitude from scale-consistent edge maps [
18]. While these methods enhance the estimation quality of sparse defocus values to some extent, they still depend on techniques like Laplacian map segmentation [
22], weighted averaging [
18], or regression trees [
23] to extend these values to dense images across the entire image. Furthermore, the propagation process itself can introduce additional errors. With advancements in deep learning, convolutional neural networks have been utilized to learn features related to defocus blur and assist in estimation. However, most current deep methods retain a two-stage structure involving edge estimation followed by propagation filling. Their core process remains similar to traditional propagation mechanisms, leaving room for improvement in accuracy and stability for boundary transition areas.
Although existing methods have made progress in defocused image estimation, they still confront two major challenges. Firstly, in regions where blurred and clear areas transition, the point spread function should be modeled as a continuous mixture distribution combining a sharp pulse function and a blurred Gaussian kernel. Traditional methods deriving the point spread function from defocused images cannot accurately capture this mixed characteristic, resulting in artifacts during non-blind deconvolution of boundary regions. Secondly, the iterative inversion of non-uniform blur operators demands significant computational resources. Unlike uniform blur operators such as convolution, which can be inverted efficiently using Fast Fourier Transform (FFT), non-uniform blur operators lack fast numerical inversion methods. This forces traditional approaches to adopt simplified uniform kernel assumptions in boundary regions, further sacrificing recovery accuracy.
2.2. End-to-End Deblurring
Although end-to-end deep neural networks (DNNs) [
24] have demonstrated significant potential for this deblurring task, existing methods still rely on specific hardware configurations or dual-view input. The pioneering work in this area traces back to Abuolaim and Brown, who trained a U-shaped CNN to predict a full-focus image from dual-view sensor captures. This approach alleviates blur kernel estimation errors by leveraging multi-view geometric constraints. It should be noted that in single-view scenarios, due to the absence of cross-view depth cues, models struggle to capture the spatial gradient characteristics at blurry–clear boundaries. Although Abuolaim and Brown’s model outperforms traditional two-stage methods significantly, its performance declines markedly when processing a single defocused image. This limitation stems from a lack of clear mechanisms for handling spatially variable defocus blurring within single-view inputs.
In the context of the single-image defocus deblurring (SIDD), a key challenge for traditional methods is effectively managing the spatial variations induced by defocus blur. Current network architectures often depend on specialized hardware or dual-view inputs, which constrains their applicability to single-view images. In particular, single-view defocus blur exhibits pronounced spatial variability in boundary regions. Therefore, edges of objects may present a multi-scale mixture of blur kernels caused by local depth variations. Traditional single-view approaches lack explicit mechanisms for modeling the spatial correlation of blur kernels. This deficiency leads to discretization errors during blur kernel estimation at boundaries, subsequently causing artifacts such as ringing effects or texture confusion.
2.3. Deep Learning Deblurring
Research on deep learning applications for spatially varying motion deblurring has grown substantially in recent years, particularly for dynamic scenes [
25,
26,
27,
28,
29,
30,
31,
32]. Most existing methods, however, rely on dynamic scene features such as optical flow or temporal information. Consequently, they prove unsuitable for addressing defocus deblurring problems. Defocus blur differs fundamentally from motion blur in its inherent characteristics: motion blur typically exhibits rapidly changing dynamic patterns, whereas defocus blur manifests as isotropic and pixel-level consistent blur. Due to this distinction, current deblurring methods predominantly focus on motion blur and consequently fail to effectively handle the spatial variability inherent in defocus blur. In contrast, deep neural networks employ coarse-to-fine structures based on multi-scale extensions of fixed-point iteration for Gaussian kernel mixture deblurring. These architectures utilize scale-recursive attention mechanisms to predict relevant coefficients within this framework.
A particularly challenging aspect of defocused images involves distinguishing boundaries between sharp and blurred regions. This boundary ambiguity significantly complicates the deblurring task, especially when local image areas exhibit substantial variations in blur intensity. Traditional methodologies face difficulties preserving detail restoration accuracy under such conditions. Consequently, designing efficient deblurring models capable of handling transition zones between blur and clarity has emerged as a critical research challenge for defocus blur characteristics.
Optimization unrolling represents a prevalent approach for Designing Deep Neural networks (DNNs) dedicated to image restoration. Studies across image deblurring have widely adopted this methodology [
33,
34,
35,
36]. These optimization-unrolled deep networks typically target uniform image deblurring scenarios. This design preference arises because uniform blur operators permit efficient inverse computations through Fast Fourier Transform (FFT), ensuring high effectiveness. However, such methods prove inapplicable to non-uniform deblurring problems like Single Image Defocus Deblurring, as no known fast numerical scheme currently exists for inverting non-uniform blur operators. Although certain studies attempt to address unknown point-spread functions through blind deblurring techniques, most remain confined to uniform blur models. This limitation hinders their ability to handle complex non-uniform blur scenarios. These constraints become particularly evident when processing transitions between sharp and blurred areas, frequently resulting in either excessive restoration or detail loss in existing methodologies.
It should be noted that although our method belongs to the category of deep learning-based image restoration, it differs fundamentally from existing deep unfolding (a.k.a. optimization unrolling) frameworks. Deep unfolding methods parameterize each iterative step of a model-based optimization problem into network layers, retaining the multi-stage iterative solution structure, and most existing designs are limited to uniform deblurring scenarios that support fast inverse computation of blur operators via FFT. In contrast, our GKMANet does not adopt the iterative unrolling design paradigm: we reformulate the spatially varying defocus blur modeling problem as a lightweight end-to-end weight prediction task for pre-defined Gaussian basis kernels, which requires only a single forward pass to obtain all prediction results, and completely avoids the dependency on fast inversion of non-uniform blur operators that plagues deep unfolding methods. Furthermore, our self-attention enhanced attention mechanism explicitly models the continuous variation in blur kernels at sharp-blurred boundaries, which addresses the limitation of local block-wise uniform assumptions commonly used in existing deep unfolding methods for non-uniform deblurring.
3. Method Overview
Spatial-Channel Attention Maps have gained widespread adoption in image processing in recent years, demonstrating considerable effectiveness in tasks such as image deblurring, super-resolution, restoration, and object detection. This approach guides networks to prioritize key information areas through attention regulation across spatial and channel dimensions, thereby enhancing model performance. However, conventional attention mechanisms typically rely on local weighting operations during feature enhancement. Their limited receptive fields present significant constraints when processing blurred images. Specifically, these mechanisms struggle to establish global feature connections between blurred and sharp region boundaries. Moreover, when handling defocused scenes where blur kernel scales vary continuously with spatial position, they exhibit inaccurate modeling of transitional zones, leading to artifacts or detail loss.
To address these limitations, we propose an enhanced spatial-channel attention framework incorporating self-attention mechanisms. Self-attention possesses the capability to model long-range pixel dependencies, explicitly capturing gradual blur kernel variations in boundary regions. Consequently, it precisely distinguishes sharp and blurred features across boundaries, significantly improving discriminative capacity for blurred areas. Our model integrates a self-attention module into conventional spatial-channel attention networks, enabling adaptive adjustment of regional weights across both spatial and channel dimensions. This integration enhances adaptability to unevenly distributed blur patterns and elevates restored image quality.
Although the existing Gaussian Kernel Mixture Network (GKMNet) has advanced defocus deblurring, it confronts critical limitations in complex scenarios. Firstly, GKMNet tends to generate artifacts or unnatural transitions at blur–sharp boundaries. Secondly, its global feature aggregation employs maximum pooling operations, which frequently cause information loss and exhibit noise sensitivity. This constrains the model’s adaptability to complex ambiguous patterns. Furthermore, GKMNet’s loss function relies solely on spatial domain constraints, hindering effective restoration of high-frequency details. This often results in excessive smoothing of edges and textures.
To overcome these challenges, we introduce the Gaussian Kernel Mixture Attention Network (GKMANet), featuring an adaptive attention mechanism. Our approach explicitly enhances sensitivity to blurred boundaries by dynamically learning channel-wise frequency responses and spatial gradient distributions, effectively mitigating boundary artifacts. Additionally, we design a Dual-Attention Convolutional Neural Network (DA-CNN). This architecture enhances local feature extraction in initial convolutional layers while employing a learnable weight-driven averaging strategy for global feature aggregation, substantially improving feature accuracy and noise robustness. Furthermore, we propose a novel loss function incorporating frequency reconstruction constraints. This addition specifically targets high-frequency detail restoration, effectively countering the excessive smoothing prevalent in conventional loss functions. The framework of our proposed methodology is illustrated in
Figure 1.
4. Detail Description of the Method
Consider a defocused image,
, and the relationship between it and the full-focus image
is expressed by Equation (1).
where
represents the measurement noise. The symbol B denotes a linear operator. Its specific definition is provided in Equation (2):
Each pixel located at position is associated with a distinct blur kernel, . This kernel, also known as the point spread function (PSF), is determined by the distance from the focal plane. Typically, these pixel-level PSFs are approximated as Gaussian or circular kernels. It is important to recognize that these pixel-level PSFs are generally unknown due to the lack of additional scene depth information. Consequently, single-image defocus deblurring presents a challenging nonlinear inverse problem, requiring the estimation of both B and from Equation (1).
Within the core processing module of the proposed network, the original structure has undergone optimization and enhancement. Several novel functional modules have been incorporated, and the overall architecture has been adjusted accordingly. The specific improvements made are illustrated in
Figure 2.
This work incorporates a frequency reconstruction loss term into the overall loss function. The primary purpose of this addition is to enhance the network’s capability to recover detailed image information by leveraging insights from the frequency domain perspective. Experimental results demonstrate that this strategy contributes significantly to improved model performance. The specific details and implications of these improvements are shown in
Figure 3.
4.1. Extracting Multi-Scale Features to Support Attention Modeling
The Feature Extraction (FE) module comprises a sequential connection of three encoder modules and three decoder modules. The structure of each encoder module includes a convolutional layer followed by a residual block. The residual block itself consists of two convolutional layers separated by a ReLU activation function. Initially, the first encoder module generates a feature image with 32 channels. Each subsequent encoder module doubles the number of channels relative to its predecessor. This channel increase is accompanied by a reduction in spatial dimensions, achieved through a convolutional layer utilizing a stride of 2. The final encoder module maintains the channel count generated by the preceding module. In contrast, each decoder module contains a residual block followed by a transposed convolution layer. The transposed convolution operation consequently doubles the spatial dimensions of the feature map while simultaneously halving the number of channels.
4.2. Predicting Spatial Channel Attention to Guide Feature Enhancement
This work introduces an attention prediction path designed specifically to achieve adaptive modeling of features within blurry regions. The path sequentially integrates three key components: Global Average Pooling (GAP), a two-layer Long Short-Term Memory (LSTM) structure, and a Softmax module. Respectively, these components perform global information aggregation, modeling of blur-sensitive features, and attention weight normalization. The path processes features originating from the bottleneck of the Feature Extractor (FE). Initially, GAP generates a compact global feature descriptor. Subsequently, this descriptor is fed into the two-layer LSTM structure. The ReLU activation function separates these two LSTM layers. The first LSTM layer reduces the feature dimension to one quarter of its original size. Conversely, the second LSTM layer then expands this reduced dimensionality. To effectively capture blur information at multiple scales and enhance restoration performance across these scales, the hidden unit representations from both LSTM layers are concatenated cross-scale. Ultimately, the Softmax function normalizes the resulting coefficient vector, producing the final attention weights. The normalized coefficient vector output by this attention prediction path corresponds exactly to the spatially varying combination weights of pre-defined Gaussian bases in our Gaussian Kernel Mixture model. Softmax normalization guarantees that the sum of weights at each spatial position equals 1, which satisfies the probability modeling requirement of the Gaussian mixture framework.
4.3. Model Long-Term Dependencies to Enhance the Ability of Perceiving Fuzzy Boundaries
The Self-Attention Module (SEAT) processes feature maps generated by the Feature Extraction Module (FE). This mechanism possesses the capability to model long-range dependencies between pixels. Consequently, it can explicitly capture the gradual variations in the blur kernel within boundary regions. This ability enables SEAT to precisely distinguish between sharp and blurry features on either side of a boundary, thereby significantly enhancing the discrimination capability for blurry areas. This work incorporates the SEAT module into a traditional spatial-channel attention network. The integration enables adaptive adjustment of regional weights across both spatial and channel dimensions. Consequently, this enhancement improves the model’s adaptability and restoration quality for images characterized by unevenly distributed blur features.
The self-attention mechanism captures complex interdependencies across spatial locations and feature channels within an image. This capability allows the model to more accurately adjust the weights assigned to different regions and channels. The mechanism proves particularly beneficial in challenging scenarios where boundaries between sharp and blurry regions are difficult to distinguish. Conventional methods often encounter limitations such as over-correction or inadequate handling of blurry areas in these situations. The network proposed in this paper utilizes the self-attention mechanism to strengthen attention on image details. This strategy enables the model to handle ambiguous boundary regions with greater flexibility. Consequently, it effectively distinguishes and restores details within both blurry and sharp regions.
4.4. Perform Gaussian Convolution Instead of the Rough Feature Aggregation Operation
The Gaussian Convolution Module (GCM) implements a straightforward group convolution layer. Its primary function is to apply a series of pre-defined 2D Gaussian kernels independently to the R, G, and B channels of the input. The module utilizes a predefined set of kernel sizes: 1 × 1, 3 × 3, 5 × 5, and M × M. For any kernel size m × m where m > 1, the process specifically generates two distinct Gaussian kernels. These derived kernels possess sizes of (m−1)/4 and (m−2)/4 respectively. It is important to note that the 1 × 1 kernel effectively functions as the Dirac delta kernel; consequently, it does not undergo this generation process. Therefore, the GCM ultimately employs a total of M distinct Gaussian kernels. Crucially, the parameters defining these kernels remain fixed and consistent across all spatial scales within the processing.
4.5. Adaptive Fuzzy Kernel in the Fitting Space Is Used to Model the Complex Defocus Distribution
This work introduces the Gaussian Kernel Mixture (GKM) model to better address the spatially uneven distribution of defocus blur. The GKM accurately models blur patterns across diverse image regions, demonstrating strong adaptability to local blur characteristics, particularly under uneven blur intensity distributions. Furthermore, the network integrates a self-attention mechanism. This integration enables the model to dynamically prioritize severely blurred regions and autonomously adjust restoration focus across different areas. Consequently, it effectively mitigates the overall image quality degradation often caused by ambiguous blur–sharp boundaries, significantly enhancing both the balance and precision of the deblurring effect. Regarding the loss function, an improved formulation incorporating frequency reconstruction loss is proposed. This loss term constrains the restoration process within the frequency domain, with specific emphasis on high-frequency detail recovery. Consequently, it effectively counters the excessive spatial smoothing frequently observed in traditional loss functions. The frequency reconstruction loss enables superior preservation of details and textures, achieving marked improvement in reconstructing high-frequency information like edges and textures. This innovation yields restored images with enhanced clarity, richer detail, and demonstrably improved image quality and restoration fidelity. Overall, the proposed method synergistically combines the Gaussian kernel mixture model, the attention mechanism, and the enhanced loss function, achieving superior performance compared to existing methods in single-image defocus deblurring.
The GKMA model proposed in this paper takes advantage of the strong homogeneity and smoothness of the defocused point spread function (PSF), providing an efficient parameterization form for the defocused PSF. This form represents the
in defocused PSF as a linear combination of predefined Gaussian kernels, as shown in Equation (3).
where
represents a two-dimensional Gaussian kernel with variance
, and
represents the mixing coefficient matrix used for the k-th Gaussian kernel within the mixture. In our decoupled design, the Gaussian basis kernels
are pre-defined with fixed variances that cover the full range of possible defocus blur degrees to model all-in-focus sharp regions.
To enhance feature extraction and adaptability to complex, ambiguous scenarios, the network depth is increased beyond the original architecture. Specifically, additional convolutional layers and dense connection blocks are introduced. This expanded depth enables the model to capture a broader range of multi-scale features, from microscopic details to macroscopic structures. This enhanced representational capacity significantly improves the model’s ability to learn diverse defocus blur characteristics and its deblurring performance on complex scenes. Since GKMA effectively models a wide range of isotropic (and potentially anisotropic) kernels, Equation (3) offers a more accurate representation of the actual defocus PSF than commonly used single Gaussian or disk models.
The GKMA model offers distinct advantages. Its weighted combination of isotropic Gaussian kernels allows accurate representation of numerous non-Gaussian isotropic kernels, providing a closer fit to real-world defocus PSFs than single-parameter Gaussian or disk kernels. Notably, the GKMA model simplifies to a single-parameter Gaussian kernel when k = 1. Furthermore, while another approach [
37] models the defocus PSF beyond a single Gaussian kernel using a generalized Gaussian function with complex nonlinear parameter estimation, the GKMA model utilizes a predefined linear model based on
. This structure results in more straightforward estimation of the parameters
. Thus, Equation (1) is reformulated as Equation (4).
where
denotes the Dirac function centered at position
, and ⊗ and ⊙ denote the two-dimensional convolution operation and element-wise multiplication operation respectively. Consequently, the defocus blur model based on GKM is derived, as shown in Equation (5). All network parameters, including the weights of the attention prediction branch, are optimized end-to-end via back-propagation under the constraint of our proposed joint loss function. During inference, all combination weights can be obtained in a single forward pass of the network, without requiring any additional post-processing or iterative optimization.
Compared with traditional methods that estimate blur kernel parameters independently before image restoration, our learning-based weight estimation strategy avoids accumulated errors caused by separate kernel estimation. It also greatly improves inference efficiency, making our method more suitable for real-world defocus deblurring applications.
4.6. Joint Spatial-Frequency Domain Loss for Restoring High-Frequency Detail Information
The loss function used in our GKMANet training is defined in Equation (6).
where
and
are real-valued coefficients. In this case, we set them to 0.1 to balance the contribution of each component. The parameter settings of λ
1 and λ
2 follow the general research conventions in the field of multi-component joint loss. 0.1 is the most commonly used default value for the auxiliary loss weight in similar works. The leading performance achieved by the model in this paper on multiple mainstream benchmark datasets further verifies the rationality and effectiveness of this parameter selection. Subsequently, we analyze the role of each term in Equation (6).
The mean squared error (MSE) loss, widely applied in regression and image processing, quantifies the model error by the squared difference between predicted and true values. Its objective is to minimize this error for parameter optimization, as defined in Equation (7).
where N represents the number of samples,
is the true value of the i-th sample, and
is the predicted value of the i-th sample.
Frequency Reconstruction Loss (FRL), widely used in image restoration, denoising, and deblurring, enhances detail and structural recovery by emphasizing frequency characteristics, particularly in high-frequency components. Unlike pixel-based losses, FRL operates in the frequency domain to preserve critical high-frequency information. Frequency components encode image attributes: low frequencies represent structural outlines, while high frequencies contain textures and edges. As high-frequency components are challenging to recover, FRL directly optimizes the model by comparing frequency spectra of real and reconstructed images. Its formula is shown as Equation (8).
where
represents the true image of the i-th sample;
represents the predicted image of the i-th sample;
denotes the Fourier transformation operation, which converts the image from the spatial domain to the frequency domain;
is the number of samples;
represents the difference calculation of the frequency components.
The Structural Similarity (SSIM) loss function is a perceptually based metric widely used in image quality assessment and restoration tasks. Unlike traditional pixel-level loss functions such as Mean Squared Error (MSE), SSIM incorporates structural, luminance, and contrast information. This enables it to better approximate human visual perception of image quality, making it prevalent in denoising, deblurring, and super-resolution applications. Its mathematical formulation is given in Equation (9).
This paper proposes a novel multi-level loss function integrating MSE loss, frequency reconstruction loss, and SSIM loss. The introduction of frequency reconstruction loss constitutes a key innovation. By constraining image restoration in the frequency domain, this approach significantly enhances the model’s high-frequency detail recovery capability. Particularly in cases with indistinct boundaries between blurred and clear regions, this constraint mitigates the excessive smoothing and detail loss prevalent in traditional methods. Experimental results demonstrate that our loss function design substantially improves deblurring performance, excelling in detail restoration and boundary processing.
As established in prior analysis, the blur operator B preserves low-frequency components while attenuating high-frequency components. Let I denote the identity mapping. Consequently, the mapping IB attenuates low-frequency components and preserves high-frequency components. Thus, ignoring noise ϵ and restating Equation (1) yields the modified expression in Equation (10).
Therefore, we obtained the fixed-point iterative formula for solving defocus deblurring, which is presented in the form of Equation (11).
The fixed-point iteration converges when constitutes a contraction mapping, i.e., when the eigenvalues of B satisfy λ∈(0,1). This condition holds for uniform defocus blur modeled by a normalized Gaussian kernel, corresponding to constant scene depth within the field of view.
Define
and
. This way, the clear region can be modeled by setting
and clearing
when
. Let
, and when
, let
. Then, Equation (11) can be expressed as Equation (12) shown below.
Since the defocus point spread function (PSF) derives from the GKM model, Equation (1) is solvable via fixed-point iteration involving learnable matrices . This method is preferable to alternatives (e.g., gradient descent, half-quadratic splitting) as it depends solely on the forward operator without requiring or the pseudo-inverse .
The mixed coefficient matrix can be interpreted as an attention map associated with the feature maps generated by different Gaussian kernels. Therefore, we constructed a deep neural network (DNN) with an attention module and long skip connections to utilize Equation (12) for processing in SIDD. Furthermore, the scale-recursive framework shares weights across scales while progressively refining predictions: At each iteration, the DNN predicts the all-in-focus image at the current scale, then up-samples it for subsequent scale processing. Further integration of self-attention optimizes spatial-channel attention map prediction. This enhancement significantly improves the model’s capacity to capture subtle blur–clear transitions, particularly under non-uniform blur distributions. Consequently, the network achieves more accurate region/channel attention mapping, yielding superior deblurring performance with lightweight computation.
5. Experimental Results and Analysis
5.1. Selection of the Dataset
Currently, the defocus deblurring field lacks suitable benchmark datasets for handling cases with indistinct boundaries between blurred and sharp regions. To address this gap, we selected the Defocus Photography Dataset (DPD) as our benchmark. DPD is well-established within the community and offers significant advantages for this specific challenge: (1) it provides 500 high-quality, diverse pairs of defocused and fully focused images, enabling models to learn various boundary features; (2) its rich 16-bit color images support high dynamic range image restoration, facilitating the differentiation between blurred and sharp regions; and (3) the dataset is rigorously partitioned into 350 training, 74 validation, and 76 test samples, ensuring scientifically sound evaluation. Furthermore, focusing exclusively on deblurring regular defocused images without leveraging dual-pixel characteristics, DPD provides an ideal benchmark for training and testing models like GKMANet under conditions of ambiguous blur–sharp boundaries.
5.2. Selection of Evaluation Criteria
In defocus deblurring research with indistinct blur–sharp boundaries, this study employs Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learning Perceptual Image Patch Similarity (LPIPS) for comprehensive performance assessment of the GKMANet model. PSNR quantifies reconstruction fidelity via mean square error between restored and original images, serving as a fundamental benchmark due to its universality. SSIM evaluates structural similarity through luminance, contrast, and structural comparisons, aligning closely with human visual perception to preserve detail and mitigate pixel-level limitations. LPIPS, a recent perceptual metric, excels in complex scenarios by capturing subtle texture variations and demonstrating heightened sensitivity in ambiguous boundary conditions. Collectively, this triad enables rigorous and holistic evaluation of deblurring efficacy.
5.3. Experimental Results
For comprehensive benchmarking, we compare against some state-of-the-art methods: DMPHN [
29], AIFNet [
38], MDP [
39], KPAC [
40], IFANet [
41], RDPD [
42], DDDNet [
43], DeepRFT [
44], BAMBNet [
45], JNB [
16], EBDB [
18], DMENet [
46], DRBNet [
47], DPDNet [
47], RGSIDD [
48], EFNet [
49], and CauSiam [
50]. Among these, JNB, EBDB, and DMENet [
46] employ a two-stage pipeline (blur kernel estimation → deblurring), while DRBNet and DPDNet [
47] utilize deep neural networks (DNNs). We further evaluate two dynamic scene deblurring networks: SRN [
26] and AttNet [
30], both adopting multi-scale architectures and retrained on our dataset. For fair comparison, GKMNet* (*assumed naming consistency with prior context) was retrained under identical conditions. Experimental results confirm the superior performance of our approach over GKMNet. The results can be seen from
Figure 4. There are a total of four different scenarios from top to bottom. In all scenarios, the output clarity of this method is better than that of all comparison algorithms. The results processed by the comparison algorithms all have varying degrees of blurriness. In the first group of wall textures, the textures of the comparison algorithms are rather hazy and blurred. In the second group of green fences, there are problems of blurred trailing edges in the fences of the comparison algorithms. In the third group of walls with text, the text edges output by the comparison algorithms are blurred and have poor readability. In the fourth group of poster scenarios, the text of the comparison algorithms is blurred and sticky, and the details are not clear enough. However, the results processed by this method are clearer in terms of wall textures, fence outlines, text readability, and poster details, and the effect is closest to the target clear image.
5.4. Ablation Experiment
To quantify the contribution of individual components in GKMANet, we performed an ablation study measuring their performance gains. We can see the data comparison from
Table 1 and
Table 2.
5.5. Method Comparison Analysis
This paper addresses the challenge of removing defocus blur in scenarios characterized by indistinct boundaries between blurred and clear regions. The proposed method introduces several key improvements over the original approach, significantly enhancing deblurring performance in these specific cases. Compared to DRBNet [
38], our model demonstrates distinct advantages in tackling this core problem.
Firstly, increasing the network depth significantly improved the model’s ability to represent features in regions with ambiguous boundaries. When processing images exhibiting gradual blur transitions, the refined model captures subtle boundary variations more accurately, enhancing its adaptability to varying blur levels. Experimental results on the relevant test set show our model outperforming DRBNet by approximately 0.18 dB in PSNR, indicating superior recovery of boundary details and overall image quality optimization.
Secondly, the integration of a self-attention mechanism enables the model to precisely locate critical areas with low-contrast, indistinct boundaries. In such scenarios, this mechanism focuses computational resources on the junctions between blurred and clear regions. This focus mitigates the misclassification of clear areas as blurred and prevents the loss of valid information within blurred zones. Comparative experiments confirm our model’s superior performance in restoring detail within these complex images, particularly in accurately recovering true details within blurred regions, thereby reducing negative impacts on overall quality.
Finally, incorporating a frequency-domain reconstruction loss function further enhanced the model’s capacity to recover high-frequency features in boundary details. These subtle transition zones between blurred and clear areas typically contain rich high-frequency information, which traditional methods often fail to preserve. Consequently, our approach more effectively restores high-frequency boundary details, resulting in more natural and realistic transitions compared to DRBNet.
For the problem of defocus blur removal with indistinct boundaries, our enhanced method surpasses DRBNet in critical metrics such as boundary detail restoration and adaptability to complex scenes. It consequently demonstrates greater robustness and reliability, making it more suitable for addressing this challenging image deblurring task.
6. Conclusions
Defocus blur represents a common type of degradation in image processing, exhibiting distinct characteristics compared to motion blur. When the spatial distribution of defocus blur is uneven, restoring the affected image presents significant challenges. A key difficulty lies in accurately distinguishing the boundary between clear and blurred regions, which often leads to problems in recovering fine details. This paper introduces a new deep neural network method specifically designed to address the problem of single image defocus blur removal. The core innovation of this approach involves integrating a defocus blurring process based on the Gaussian Kernel Mixture model. Furthermore, the method employs fixed-point iteration expansion to introduce a scale recursive structure. This design enables the model to effectively capture global image features while also flexibly handling local blurry areas. Significantly, in situations where boundaries between blurry and clear regions are hard to distinguish, the proposed structure enhances the model’s boundary recognition capability and refines the overall image restoration process. Experimental results demonstrate that the improved network proposed in this paper achieves an average increase of 0.71% in PSNR and 0.51% in SSIM compared to existing methods. The method showed stronger capabilities, particularly in images with uneven defocus blur distribution and in the restoration of high-frequency details. It also successfully addressed the challenge of distinguishing boundaries between blurred and clear regions. Although this method achieved good results on multiple standard datasets, further optimization is possible. Future research can focus on extending the approach to solve a wider range of real-world defocus point spread function problems, especially concerning deblurring for relatively rare and highly anisotropic blur types. Additionally, exploring how to utilize combinations of multiple training datasets could further enhance the model’s generalization ability. In particular, the challenging scenario of defocus deblurring with indistinct boundaries between blurred and clear regions still lacks sufficient public high-quality benchmark datasets at present. To address this gap, we plan to construct a larger and more diversified dedicated dataset for this scenario and conduct comprehensive cross-dataset verification to fully validate the generalization and robustness of our proposed model in future work. We acknowledge that the proposed method incurs higher computational overhead than lightweight baseline methods, as it introduces additional modules dedicated to fine-grained feature modeling to achieve improved restoration accuracy. At this stage, our method is better suited for offline applications with stringent requirements for restoration quality and relaxed constraints on real-time performance. To enable real-time edge deployment, future work will focus on reducing computational complexity through model compression, lightweight architecture design and operator-level inference optimization to further improve the practicality of our method, which is a core direction of our future work.