1. Introduction
Image deblurring stands as a crucial task in the low-level vision realm, especially in the digital age where cameras have become ubiquitous across a wide array of personal electronic devices such as smartphones and tablets. These devices can easily capture blurred images due to various factors, including shaky hands due to limited space and budget for anti-shake hardware, as well as poor focusing.
Deblurring algorithms have been studied for decades with the aim of recovering clear and sharp images from those with indistinct or blurred details. Mathematically, deblurring is an ill-posed inverse problem, which requires strong priors about the nature of the images to be recovered in order to be effectively regularized. Indeed, with the recent successes of data-driven methods based on neural networks, learning-based deblurring algorithms also have evolved rapidly, from convolutional and Recurrent Neural Networks [
1,
2] to Transformers [
3,
4,
5] and from Generative Adversarial Networks [
6] to diffusion models [
7]. The success of all of these can be largely attributed to their ability to learn sophisticated image features from training data. However, comparatively fewer works [
8,
9,
10] have focused on ways to incorporate side information, mainly in the form of event cameras, segmentation information, or optical flow, as an alternative way to help with the regularization of the deblurring inverse problem. Historically, guided filters [
11] have been used to modulate the filtering process with a guidance signal for this purpose.
Regarding this, multimodal imaging platforms, which combine multiple kinds of imaging devices, are currently gaining popularity. In particular, recent mobile devices, such as the Apple iPhone and iPad [
12], are now being equipped with Lidar sensors to provide depth information that can be used for 3D scanning capability. Such Lidars are time-of-flight sensors which send a grid of light pulses and measure the return time to estimate distance at multiple points, thus providing a depth map of the scene. Active sensing instruments are particularly interesting, as they can complement passive optical cameras. In particular, these can be prone to generating blurry images in situations requiring, even slightly, longer exposures, integrating received light over a longer period of time but being sensitive to camera shakes on handheld devices. Alternatively, errors in focusing under challenging conditions may also result in blurred images. This raises the question of whether true depth information from Lidar sensors, particularly smartphone ones, can be effectively used to regularize the deblurring problem and improve image quality.
Several challenges need to be overcome in order to answer this question. First, our focus will be on smartphone Lidars and cameras, as this is, possibly, the most widespread multimodal sensing platform at the moment. However, mobile Lidars have significant limitations in spatial resolution due to their size and cost, so it is not obvious whether they can provide sufficient information. Moreover, state-of-the-art image restoration models based on neural networks require large datasets to be effectively trained. At the moment, there is no existing dataset of blurry images with associated Lidar depth maps captured by smartphones, and assembling one of large size to enable effective training from scratch is indeed a challenging task.
In this work, we aim to answer the question of whether smartphone Lidar can boost image deblurring performance. We propose a novel approach that integrates depth maps with blurred RGB images in a way that is able to address the aforementioned challenges. In particular, we propose a continual learning approach where conventional encoder–decoder deblurring models are augmented and finetuned with adapters to incorporate depth information. We design a novel adapter neural network inspired by the classic guided filter to process depth maps and use their features to modulate the features extracted by any state-of-the-art image restoration model. The adapter also deals with the limited resolution of mobile Lidar depth maps by including a super-resolution operation that is capable of preserving their piecewise constant nature when upscaling them to the target resolution.
In summary, our main contributions can be regarded as follows:
We propose a novel approach to image deblurring that augments state-of-the-art models with depth information from smartphone Lidar sensors.
We propose a novel continual learning strategy that finetunes state-of-the-art deblurring models with adapters that process depth information and use it to module the main model features.
We propose a design of the adapter architecture that is inspired by the classic guided filter to effectively use depth to modulate image features.
We show that the true depth information obtained by mobile Lidar sensors improves image deblurring performance, as experimentally verified with real-world Lidar data.
This paper is organized as follows:
Section 2 reviews the relevant background and related work on image deblurring and continual learning.
Section 3 presents the proposed framework in detail, including depth super-resolution, depth adapters, and continual learning strategies.
Section 4 reports the experimental results, including quantitative comparisons, qualitative analysis, and ablation studies. Finally,
Section 6 concludes this paper and outlines potential directions for future work.
4. Experimental Results
This section reports the experimental results to validate several points of interest. First and foremost, we seek to answer the question of whether image quality is improved by providing mobile Lidar depth maps. This is conducted by presenting the results on several state-of-the-art deblurring architectures adapted following the proposed approach. Next, we validate the design of the proposed approach, particularly regarding the need for depth super-resolution, and the adapter design.
4.1. Datasets
In our experiments, we use a subset of the ArkitScenes dataset [
43], specifically the portion used for RGB-D-guided upsampling, which contains 29,264 image depth pairs in the training set. For validation, we randomly sample 500 pairs from the original validation set. Image blur is simulated by randomly choosing a blur kernel from a set of standard benchmark kernels, following the approach from [
39]. The blur kernel sizes defined in [
39] range from
to
in the original implementation. These kernels were designed for rescaled scenes of size
pixels. Since our input images are significantly bigger at
pixels, we rescale the kernels by a factor of
in order to preserve the relative ratio between the blur diameter and the image dimensions. This adjustment ensures that the simulated blur maintains similar strength in terms of spatial frequency attenuation characteristics as in the original setting, resulting in more realistic and perceptually consistent blur patterns across the different resolutions. Meanwhile, a novel dataset of real blurred images with associated mobile depth maps (LICAM dataset [
63]) is used as an additional evaluation, which contains 200 training images of size
and 180 test images of the same size.
The depth super-resolution network is pretrained on the same ArkitScenes dataset by using the low-resolution depth maps from the iPad Lidar as input and the high-resolution depth maps from the Faro Focus S70 Lidar as ground truth. The ground truth data contain some pixels with invalid measurements which are masked to be discarded in the loss computation.
4.2. Implementation Details
We selected four main state-of-the-art deblurring models to be tested with and without Lidar augmentation: Restormer [
41], NAFNet [
26], Stripformer [
5], and DeblurDiNATL [
62]. The versions with the proposed Lidar depth improvements using adapters and the continual learning strategy are denoted with the prefix “Depth-*”. Hyperparameters for our experiments are shown in
Table 1. In particular, this table reports the number of feature channels in the depth adapters and the SR network. Training generally follows the protocols outlined in the original papers in terms of image patch sizes, with
patches for Restormer and Stripformer and
for NAFNet and DeblurDiNATL. In the training process, first, the SR network is pretrained on the ARKitScenes data using L1 loss:
where
is the super-resolved depth map estimated by the network and
the high-resolution ground truth from the Faro S70 Lidar. For this pretraining, the Adam optimizer is used with
and
and a fixed learning rate equal to
for 50 epochs.
The backbone model weights are fully finetuned from the pretrained values on the GoPro dataset, as provided from the original implementation of the backbones, and the adapters are trained with a loss that is a combination of L1 and cosine distance:
where
is the deblurred image and
the ground truth image. The initial learning rate is
and gradually decays to
every 50 epochs with the cosine annealing decay policy, and the Adam optimizer has
and
. The published version of the method and the depth-enhanced one are trained on the same data and with the same protocol to ensure a fair comparison.
For the experiment on the LICAM dataset, the entire model is finetuned from the weights trained on the ARKitScenes data using LPIPS loss for 25 epochs using an Adam optimizer with and and with a fixed learning rate equal to . This choice is motivated by slight misalignments between the ground truth and blurred images for this dataset, which make the L1 or cosine distances unreliable.
The deblurring results on the ARKitScenes dataset are evaluated in terms of the widely used Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) metrics, and the LPIPS distance [
64] for a more perception-oriented metric. For the LICAM dataset, only the LPIPS distance is used to avoid the aforementioned issues.
Experiments were performed on four Nvidia A6000 GPUs. Depending on the specific neural network model, training requires approximately 2–3 days on the ArkitScenes dataset and two hours on the LICAM dataset. The code is available at
https://github.com/diegovalsesia/lidardeblurring (accessed on 30 October 2025).
4.3. Main Results
We first assess whether mobile Lidar data can improve the deblurring results on the selected state-of-the-art architectures. The results are shown in
Table 2. We can notice that the use of Lidar data generally provides a significant improvement in deblurring performance. The only exception is the NAFNet architecture where we still observe improvement but more modest ones. This could be explained by both the unusual network design of NAFNet and the fact that a saturation point was reached in the ability to deblur with any model (indeed, NAFNet achieves a baseline quality significantly better than that of the other models). We can also observe that the increase in the number of parameters is modest with respect to the size of the original models. The runtime results report the inference latency as measured on an Nvidia A6000 GPU and show that the integration of depth via adapters only marginally increases complexity. Further work is need for low-complexity models that could be directly run on smartphone devices. The qualitative results are reported in
Figure 4 and
Figure 5. In this figure, for each scene, the top line is the result of the four state-of-the-art conventional deblurring models without depth information, while the bottom line shows the results of the depth-enhanced models. It can be noticed that in correspondence to object boundaries, the depth-enhanced models significantly reduce ghosting effects.
Figure 4 reports the same result for the Restormer architecture while showing depth information.
Additionally, we present an experiment on a recently introduced dataset of low-light smartphone images affected by motion blur and noise, with registered Lidar depth maps and ground truth images [
63] of size
. For this experiment, we finetune models pretrained on ARKitScenes using the LPIPS perceptual loss for 25 epochs with a learning rate of
on the images in the training split. Evaluation on the test split also uses the LPIPS distance for perceptual distortion as it is more robust to slight misalignments with respect to the ground truth data. The results are reported in
Table 3 and confirm the effectiveness of depth information.
Finally, we present an experiment aimed at evaluating the quality of the deblurred images by means of performance on a downstream semantic segmentation task. Since the available datasets do not have ground truth segmentation maps, we devise a procedure where a pretrained Segment Anything Model (SAM) [
65] is used on the sharp ground truth images of the ARKitScenes test set to generate ground truth segmentation maps. Then, the SAM is used to estimate segmentation maps from the deblurred images with and without depth guidance using the Restormer architecture. We report a 99.37% accuracy for the model without Lidar depth information and 99.68% for the model with depth information. Notice that the accuracy values are quite high due to blur only marginally affecting the semantic segmentation. Nonetheless, a measurable improvement on the downstream task confirms the effectiveness of depth information.
Overall, these results demonstrate that mobile Lidar depth maps, despite their relatively low resolution, can successfully regularize the deblurring process when properly used with the proposed scheme.
4.4. Ablation Study
In the ablation study, we carefully analyze our design decisions to validate their effectiveness. This study concerns the quality of the depth maps and the continual learning strategy used to fuse them with the deblurring model. All the ablation results use the Restormer architecture as the baseline.
4.4.1. Impact of Real Lidar Depth Maps
Some approaches in the literature [
32,
33] attempt to use estimated depth maps to aid image deblurring, while [
66] utilized both real and estimated depth maps, achieving good performance. However, the literature generally lacks comparisons between the use of real and estimated depth maps, and mobile Lidars have not yet been considered for image restoration problems. We argue that depth map estimation from a blurry image can only provide additional image features that might have some use in the reconstruction process but do not really provide additional side information, as an independent Lidar would. Therefore, in this study, we compared the PSNR of the deblurred image obtained when real Lidar depth maps are used and when, instead, a depth map is estimated from the blurry image. The state-of-the-art Depth Anything [
35] model is used to estimate the depth maps. Because the model generates depth maps with the same resolution as the blurred images, the depth map super-resolution block is not used. From
Figure 6, we can see that the depth map generated from the blurred image lacks the explicit geometric information, particularly regarding object edges, that is present both in the high-resolution depth map from the high-end Lidar and the super-resolved depth map of the mobile Lidar. The quantitative results in
Table 4 confirm the results of the previous literature [
32,
33] in that even estimating the depth map from blurry images provides some degree of regularization to the deblurring process, leading to some improvements. However, Lidar depth maps provide a more significant improvement in performance, proving that the independent side information captured by the Lidar instrument, even if at modest resolution, can boost image quality.
4.4.2. Impact of Lidar Depth Super-Resolution
While we observed that a super-resolved mobile Lidar depth map can increase deblurring quality more than estimating it from the blurred image, we still need to analyze the sensitivity of the process to the resolution of the depth map. Therefore, we conducted an experiment with different depth map super-resolution scales, specifically four times and eight times, and with different methods, namely bicubic interpolation, and we also compared the results with high-resolution depth maps provided by the Faro Focus S70 Lidar. The results are shown in
Table 5. We first notice that the neural network approach to depth super-resolution significantly outperforms bicubic interpolation. Bicubic interpolation is mainly limited by being a general interpolation method fitting cubic polynomials, and as such, it does not exploit priors specific to depth data such as the fact that they are approximately piecewise constant. However, we notice that the proposed approach with bicubically interpolated depth maps is still improved over that not using depth maps. We also confirmed this result by running the experiment on the Stripformer architecture, where using bicubic upsampling achieves a PSNR equal to 35.89 dB instead of the 35.17 dB achieved by the model without depth, although this is still lower than the 36.34 dB achieved with the SR neural network. These results suggest that, despite its limitations, bicubic interpolation could be used as an effective baseline in the absence of a dedicated SR network. As an example, this could happen for new Lidar sensors that might be significantly different from the current iPhone ones and for which paired training data with HR Lidar depth maps might not be available. We also notice that the
super-resolution factor, which matches the ratio between the RGB images and the iPad depth maps, provides the best results. Interestingly, the depth maps processed with the
super-resolution network are capable of achieving equivalent deblurring performance to the high-resolution depth maps acquired with the Faro Focus S70 Lidar. A visualization of the
super-resolved depth maps against bicubic interpolation is shown in
Figure 7.
4.4.3. Impact of Depth Fusion and Continual Learning Adapter Design
In this study, we evaluated the design of the depth map fusion method and the continual learning strategy to create a joint deblurring model. In particular, we first assessed whether adapters are more effective than concatenating the super-resolved depth maps as an extra input channel. The results are reported in
Table 6. As explained in
Section 3.2.1, this is not as effective as the use of deep feature modulation and, in fact, results in a loss in the PSNR by 1.33 dB.
We then ablate the adapter design in
Table 7. We first consider a variant where the convolutional attention operation at the beginning of the adapter is replaced with a sequence of convolution, LayerNorm, and PReLU. Notice how this replacement loses the analogy with the guided filter in that it uses first-order features instead of the second-order features of the convolutional attention operation and guided filter. Indeed, we can see that this design is not as effective as the one proposed in analogy to the guided filter. Moreover, we validate the use PReLU activations instead of ReLU since the naive use of ReLU activations might lead to suboptimal results due to the truncation of negative features and directional bias. Indeed, we see that ReLUs are not as effective. Finally, we assess whether having adapters at both the encoder and decoder is more effective than the proposed decoder-only solution. As mentioned in
Section 3, the experiment confirms that adding encoder-side adapters is not effective and even degrades the original performance.
4.5. Analysis of Lidar Effectiveness
In this section, we will analyze how the depth map influences image quality and which image region benefits the most. For this purpose,
Figure 8 presents a heatmap of local PSNR values, where the PSNR of each pixel is averaged over a
patch centered around it. We observe that in boundary-rich areas (highlighted by yellow bounding boxes), the depth-integrated model yields a higher PSNR. In contrast, regions with rich texture but lacking strong geometric boundaries (such as the striped carpet and the picture on the card) show little improvement. This confirms that depth information is particularly helpful in preserving sharp edges and structural transitions while offering limited gains in textured areas.
6. Conclusions
We proposed using Lidar depth maps to further enhance the performance of deep deblurring models. In particular, we showed that inexpensive mobile Lidar devices can provide useful side information that improves the quality of deblurred images, especially thanks to information about object edges. The experimental results showed significant image quality improvement on synthetic and real blurred images.
While our current study demonstrates the effectiveness of depth-guided deblurring with efficient continual learning mechanisms, several challenges remain for future exploration. In particular, the focus of the current study on mobile Lidar data in iPhone smartphones where depth data is preprocessed limits the possibility of conducting an analysis on robustness to incomplete depth data or the responses of different Lidar sensors. Moreover, this study focused on scenes where the depth data could be the most useful, i.e., static, indoor scenes. Dynamic scenes could be an interesting extension but would require dedicated data. Further investigation on the performance over long-range scenes could also be of interest. Further potential avenues for future work include developing zero-shot approaches to avoid the need for extensive amounts of paired training data. Additionally, lower-complexity Lidar-guided deblurring models could be developed to enable real-time and low-memory inference on smartphones.