MWR-Net: An Edge-Oriented Lightweight Framework for Image Restoration in Single-Lens Infrared Computational Imaging

Qian, Xuanyu; Wang, Xuquan; Xing, Yujie; Yang, Guishuo; Dun, Xiong; Wang, Zhanshan; Cheng, Xinbin

doi:10.3390/rs17173005

Open AccessArticle

MWR-Net: An Edge-Oriented Lightweight Framework for Image Restoration in Single-Lens Infrared Computational Imaging

by

Xuanyu Qian

^1,2,3,†

,

Xuquan Wang

^1,2,3,*,†

,

Yujie Xing

^1,2,3,

Guishuo Yang

^1,2,3

,

Xiong Dun

^1,2,3,

Zhanshan Wang

^1,2,3,4 and

Xinbin Cheng

^1,2,3,4

¹

MOE Key Laboratory of Advanced Micro-Structured Materials, Shanghai 200092, China

²

Institute of Precision Optical Engineering, School of Physics Science and Engineering, Tongji University, Shanghai 200092, China

³

Shanghai Frontiers Science Center of Digital Optics, Shanghai 200092, China

⁴

Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(17), 3005; https://doi.org/10.3390/rs17173005

Submission received: 17 July 2025 / Revised: 24 August 2025 / Accepted: 27 August 2025 / Published: 29 August 2025

(This article belongs to the Special Issue Advances in Remote Sensing Video Data Processing: Theories, Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

Infrared video imaging is an cornerstone technology for environmental perception, particularly in drone-based remote sensing applications such as disaster assessment and infrastructure inspection. Conventional systems, however, rely on bulky optical architectures that limit deployment on lightweight aerial platforms. Computational imaging offers a promising alternative by integrating optical encoding with algorithmic reconstruction, enabling compact hardware while maintaining imaging performance comparable to sophisticated multi-lens systems. Nonetheless, achieving real-time video-rate computational image restoration on resource-constrained unmanned aerial vehicles (UAVs) remains a critical challenge. To address this, we propose Mobile Wavelet Restoration-Net (MWR-Net), a lightweight deep learning framework tailored for real-time infrared image restoration. Built on a MobileNetV4 backbone, MWR-Net leverages depthwise separable convolutions and an optimized downsampling scheme to minimize parameters and computational overhead. A novel wavelet-domain loss enhances high-frequency detail recovery, while the modulation transfer function (MTF) is adopted as an optics-aware evaluation metric. With only 666.37 K parameters and 6.17 G MACs, MWR-Net achieves a PSNR of 37.10 dB and an SSIM of 0.964 on a custom dataset, outperforming a pruned U-Net baseline. Deployed on an RK3588 chip, it runs at 42 FPS. These results demonstrate MWR-Net’s potential as an efficient and practical solution for UAV-based infrared sensing applications.

Keywords:

infrared video imaging; computional imaging; lightweight deep learning; mobilenet; real-time image restoration

1. Introduction

Infrared imaging technology provides unique advantages for environmental perception in complex scenes due to its ability to capture the thermal radiation emitted by objects. It has been widely used in military and civilian fields, such as autonomous driving [1], search and rescue [2], and industrial inspection [3], playing a particularly critical role in remote sensing platforms based on unmanned aerial vehicles (UAVs). As a key method for acquiring surface thermal information, UAV-mounted infrared imaging systems are required to deliver high-precision environmental sensing under strict payload constraints. This necessitates the miniaturization and lightweight design of the imaging hardware. However, traditional infrared systems often rely on complex optical components, which increase the overall size and weight of the system. As a result, there is a significant trade-off between system performance and compactness [4], which presents challenges for applications requiring both portability and high image quality.

Computational imaging provides a promising approach to overcome the physical limitations of conventional imaging systems by adopting a co-design framework that integrates optical encoding with algorithmic decoding. The core concept involves actively modulating light field information at the optical front end—including phase, spectrum, polarization and depth of field—followed by reconstructing multidimensional data through backend algorithms [5,6,7,8,9]. Through this hardware–software co-design paradigm, imaging systems can achieve performance comparable to that of traditional complex optical systems while reducing optical complexity. As a result, computational imaging offers a viable path toward realizing both lightweight and high-performance imaging. With the advancement of micro-nano optical devices, emerging technologies such as metasurfaces and diffractive optical elements (DOEs) have substantially improved the flexibility and efficiency of light field manipulation, promoting the transition of computational imaging from theoretical research to practical engineering applications. Notably, single-lens cameras based on the deep integration of micro-nano optics and computational imaging have already been realized [9,10,11,12,13,14]. These systems feature simple structures and low manufacturing costs, demonstrating strong adaptability in applications with budget constraints and requirements for flexible deployment. By effectively integrating computational imaging methods, it becomes feasible to enhance the overall performance of imaging systems without increasing hardware complexity, thereby offering strong support for the future development of compact optical systems.

However, these computational imaging approaches still encounter new challenges. The sophisticated algorithms they rely on often require substantial computational resources, a limitation that becomes especially critical in dynamic scene perception. As UAVs are increasingly deployed in real-time decision-making tasks, such as disaster monitoring and border patrol, video-rate infrared imaging has become an essential requirement. Existing methods struggle to meet the demands of latency-sensitive video processing applications. Figure 1 presents a comparative analysis of different imaging methods. Although computational imaging methods offer reduced hardware complexity compared to traditional multi-lens imaging systems, the extra latency can limit their practical applications. To address this challenge, lightweight network design has emerged as a research direction, aiming to reduce computational complexity and accelerate inference through model compression techniques such as model pruning [15,16] and quantization [17,18,19]. Our team previously implemented video-rate infrared imaging using a U-Net-based architecture combined with model pruning [20], and further enhanced performance through sensitivity analysis [21], validating the effectiveness of traditional strategies. Nevertheless, experimental results suggest that model compression techniques are approaching their performance limits, making further improvements in inference speed difficult. Moreover, most current networks designs primarily focus on statistical image quality metrics such as peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), with insufficient attention paid to key optical evaluation parameters like the modulation transfer function (MTF). This discrepancy leads to a mismatch between algorithmic outputs and the actual requirements of imaging systems.

In recent years, lightweight convolutional neural networks (CNNs), exemplified by MobileNet, have achieved an effective balance between performance and computational efficiency on resource-constrained edge devices. This success is attributed to innovative architectural components such as depthwise separable convolutions, linear bottlenecks and inverted residual blocks. These network designs have been widely adopted in high-level vision tasks such as image classification [22] and object detection [23,24]. Unlike high-level tasks that prioritize semantic understanding, low-level tasks require precise pixel-level reconstruction. This transformation requires selective modifications based on the core principles of MobileNet. For example, depthwise separable convolution modules have been utilized to construct efficient feature extractors, and multiscale feature fusion mechanisms have been incorporated to improve the recovery of fine image details [25,26,27,28,29,30,31,32]. This modular design philosophy not only inherits the efficiency advantages of lightweight networks, but also provides greater architectural flexibility for complex low-level vision tasks. As a result, it facilitates the practical deployment of high-precision, lightweight image restoration models on mobile and embedded platforms.

Building on these advances, this work proposes Mobile Wavelet Restoration-Net (MWR-Net), a lightweight deep learning framework tailored for real-time image restoration in single-lens infrared computational imaging. The network adopts MobileNetV4 [33] as the encoder backbone and pairs it with a compact decoder based on residual blocks, reducing computational complexity while maintaining restoration accuracy. A modular architecture is designed to flexibly balance performance and efficiency across different stages. To better preserve high-frequency details such as edges, we introduce a wavelet-domain loss function that enhances the network’s sensitivity to fine structures. Furthermore, the MTF is integrated as a complementary metric to evaluate perceptual and optical fidelity, addressing the limitations of conventional pixel-wise metrics. The proposed framework is implemented on the RK3588 edge computing platform, enabling real-time video processing via its NPU acceleration. This study demonstrates the feasibility of combining lightweight network design with frequency-aware constraints for practical, high-performance infrared image restoration under strict resource constraints.

Specifically, the main contributions of this work are summarized as follows:

We propose a lightweight image restoration network based on MobileNet, designed for real-time infrared computational imaging with high efficiency and strong restoration performance;
A wavelet-domain loss function is introduced to explicitly preserve high-frequency details, particularly edges, and the MTF is adopted as a complementary metric to better evaluate perceptual quality;
The network is optimized for edge deployment and demonstrates real-time inference on the RK3588 embedded NPU platform, showing strong potential for practical applications.

2. Related Works

2.1. Image Restoration

Image restoration is typically modeled as a degradation process:

y = k * x + n,

(1)

where k denotes the degradation kernel, x represents the clear image, ∗ represents convolution, and n is additive noise. Traditional image restoration methods focus on degradation caused by external disturbances, such as motion blur, haze, rain, and sensor noises [34,35,36]. These methods usually rely on physical a priori knowledge to model the degradation mechanisms. Although they perform well in specific scenarios, their effectiveness is heavily based on the accuracy of these assumptions, which limits their applicability to more complex degradation patterns in real optical systems.

In contrast, image restoration in computational imaging faces unique challenges, its degradation stems from the intrinsic physical limitations of the optical system, and manifests itself in resolution degradation and aberrations due to non-ideal point spread functions (PSFs). Early research has focused on combining PSF estimation with nonblind deconvolution algorithms. For example, Schuler et al. [37] proposed a PSF estimation method based on a single-lens setup and subsequently used nonblind deconvolution for image restoration, Heide et al. [38] further improved the performance by introducing cross-channel a priori information in the restoration process, and Zhan et al. [39] proposed a normal Sinh–Arcsinh model based on noisy image pairs for PSF estimation of a single-lens camera. In addition, Cai et al. [40] introduced a circular segmentation strategy for the estimation of PSF and achieved high-quality image recovery through nonblind deconvolution.

In recent years, deep learning has introduced a new paradigm for image restoration, enabling data-driven models to learn the process of mapping to high-quality images end-to-end from coded measurements. Li et al. [41] proposed a deep neural network for PSF awareness. Peng et al. [12] introduced a generative adversarial model specifically designed for correcting aberrations in high-resolution, large field-of-view (FoV) lens systems. Gong et al. [42] designed a deep neural network based on orthogonal non-negative matrix decomposition for efficient compensation of optical aberrations in the low-dimensional space.

Although these deep learning-based methods significantly outperform traditional algorithms in quantitative metrics such as PSNR and SSIM, they tend to have higher computational complexity. This raises memory bandwidth bottlenecks and inference latency issues when deployed on embedded platforms, making it difficult to meet the stringent real-time requirements in edge computing scenarios.

2.2. Lightweight Neural Networks

The design of lightweight CNNs emphasizes on achieving a balance between efficiency and performance, aiming for deployment in resource-constrained environments. In classic lightweight optimization designs, the MobileNet series [33,43,44,45] is one of the earliest designs to adopt depthwise separable convolutions and inverted residual blocks. In a later version, MobileNetV3 integrates neural architecture search, which can automatically find the best connection pattern within a predefined search space. However, inference latency can vary significantly across hardware platforms due to memory constraints. ShuffleNets [46,47] reduce computational costs through group convolutions and channel shuffling, yet the shuffle operation itself can become a bottleneck on highly parallel architectures like GPUs. GhostNets [48,49,50] generated a large number of “ghost” feature maps via inexpensive linear transformations and introduced hardware-friendly attention modules, making them particularly suitable for deployment on common edge devices like ARM CPUs. MobileOne [51] re-evaluated the relationship between parameter count, FLOPs, and model efficiency, identifying ReLU as the activation function with the lowest latency, and transformed multi-branch training structures into single-path inference models via structural reparameterization, it is suited for modern mobile devices. While this approach is well-suited for deployments, the reparameterization process can lead to accuracy degradation under quantization, a common trade-off in many lightweight architectures designed for real-world deployment.

Meanwhile, Vision Transformers (ViTs) and their variants have achieved state-of-the-art performance across various computer vision tasks due to their long-range modeling capabilities. Lightweight ViT variants have also made significant progress toward edge deployment. MobileFormer [52] combines local detail extraction and global semantic understanding through a parallel architecture integrating MobileNet and Transformer components. MobileViT [53] enhances the inverted residual structure of MobileNetV2 by inserting compact MobileViT modules at strategic locations. However, such hybrid designs can introduce operational heterogeneity, such as alternating between convolutional and self-attention operators, which may reduce deployment efficiency on runtimes heavily optimized for CNNs. EfficientFormer [54] simplifies the computation graph by maintaining consistent token dimensions and determines optimal configurations through latency-aware architecture search. EfficientViT [55] reduces mobile inference latency by employing multi-scale linear attention mechanisms. Despite these advances, contemporary edge hardware and compiler stacks tend to offer more mature support for convolutional operators. Consequently, well-optimized convolutional architectures often maintain practical advantages in scenarios demanding low latency and high throughput. It is partly for this reason that we select MobileNetV4 as our backbone, not only owing to its contemporary design but also due to its explicit optimization across diverse hardware platforms, including CPUs, GPUs, DSPs, and NPUs.

2.3. Frequency-Domain Learning

Frequency-domain analysis offers a global perspective on image restoration that complements spatial-domain modeling. In this context, high-frequency components correspond to edges and fine textures, while low-frequency components represent smooth regions and gradually varying regions of an image. Advances in tools such as wavelet transforms and Fourier analysis have significantly contributed to the development of more effective techniques in this area.

In terms of frequency-domain constraints, Jiang et al. [56] proposed the Focal Frequency Loss, inspired by the class imbalance handling mechanism in Focal Loss. This method enhances the reconstruction of critical frequency bands through adaptive frequency weighting. Fuoli et al. [57] introduced a Fourier-domain discriminator, which encourages the generator to align its output with real data in the frequency domain, thereby improving perceptual quality in super-resolution tasks. Korkmaz et al. [58] explored a GAN-based framework that focuses on detail sub-bands while omitting the low–low(LL) sub-band, effectively reducing image artifacts during reconstruction.

Hybrid-domain architectural innovations have also emerged to better exploit frequency characteristics. For instance, DeepRFT [59] proposes a frequency-domain residual convolution framework. Spatial feature maps are transformed into the frequency domain, where 1 × 1 convolutions are applied to decouple high- and low-frequency components for targeted learning. LoFormer [60] introduces a local channel self-attention mechanism in the frequency domain, capturing cross-covariance across frequency bands within localized windows. Another study [61] proposes multi-branch and content-aware modules that dynamically and locally decompose features into independent frequency sub-bands, selectively emphasizing the most informative components for restoration. Building upon this work, MFSNet [62] modulates frequency information in skip connections to improve information flow efficiency.

Despite their theoretical advantages, frequency-domain network modules often face practical deployment challenges on edge devices. For example, complex FFT/IFFT operators require strong hardware support, which is typically not well optimized on NPUs or other embedded accelerators. Therefore, this paper incorporates frequency-domain learning into the loss function rather than the network architecture. This approach aims to enhance the recovery of high-frequency details while maintaining hardware compatibility and ensuring efficient deployment in edge computing environments.

In summary, current deep learning-based restoration methods often face a trade-off between accuracy and efficiency, making them difficult to deploy in real-time edge applications. Moreover, frequency-domain techniques, while effective for detail preservation, are typically computationally intensive and poorly optimized for embedded hardware. These limitations highlight the need for lightweight, hardware-compatible approaches capable of high-quality infrared image restoration, particularly in preserving critical high-frequency information.

3. Proposed Method

This study proposes a lightweight image restoration framework tailored for edge computing scenarios, based on the MobileNetV4-Small architecture. From the perspective of lightweight design, the encoder employs depthwise separable convolutions as the fundamental operators, reducing computational complexity while maintaining effective feature representation. A four-level feature pyramid is constructed through the optimized distribution of convolutional layers. In the first two stages, we use fewer layers while the resolution is still high, and in the last two stages we use more layers. This design reduces the memory usage of activation tensors, thereby alleviating potential memory bottlenecks in edge devices. In the decoder design, a multiscale feature fusion strategy is adopted to integrate information from the former block, pooled original input and encoded features from skip connections. The channel dimension is expanded to enhance decoding capability. Furthermore, a detail refiner module is introduced before the final output to enhance texture preservation. The overall network architecture is illustrated in Figure 2. To improve perceptual consistency, we propose a multi-level joint optimization objective. During training, the Charbonnier loss, perceptual loss, and wavelet loss are jointly incorporated to form a cross-domain collaborative optimization mechanism. Experimental results demonstrate that this multi-dimensional loss function significantly enhances high-frequency detail recovery and global structural fidelity, leading to notable improvements in the MTF metric.

3.1. Feature Extraction Encoder

The proposed method adopts MobileNetV4-Small, a general-purpose lightweight architecture developed by Google, as the backbone for feature extraction. This network achieves strong feature representation capabilities and low inference latency across different platform. Thus, we consider it suitable for the basis of our design. Its core design philosophy focuses on optimizing operational intensity and balancing computation with memory bandwidth, enabling near-Pareto-optimal performance with carefully designed modules.

To better adapt the architecture to image restoration tasks, we first follow common practices in low-level vision tasks [63,64] by removing the batch normalization (BN) layers from the original architecture. BN has been shown to have the tendency to normalize fine-grained details, which can negatively impact feature expressiveness in restoration tasks. Subsequently, we build a multiscale feature pyramid is constructed across four resolution levels: 1/2, 1/4, 1/8, and 1/16 of the original image size, with corresponding output channel dimensions set to 32, 32, 64, and 96, respectively. Additionally, a bottleneck layer is added at the 1/16 resolution level to further enhance feature abstraction.

This memory-aware architecture improves the model’s ability to capture both global structures and local details at a lightweight scale. As a result, it is better equipped to address complex degradation patterns commonly encountered in computional image restoration tasks.

3.2. Decoder

The decoder adopts a lightweight multiscale fusion architecture, where each decoding stage consists of three input branches:

Upsampled feature: the output from the preceding decoder block;
Multi-scale pooled feature: global semantic information extracted by applying multi-scale average pooling to the original input image;
Encoder skip connection: shallow features preserved and fused from corresponding encoder layers to retain fine-grained spatial details.

In implementation, bilinear interpolation is first used to upsample the output of the last decoder block to the twice of its resolution. Meanwhile, multi-scale average pooling is applied to the original input image to extract global features at different levels. Then, all input features from the three sources are concatenated along the channel dimension and mixed through a 1 × 1 convolutional layer. Next is a naive residual block composed of two 3 × 3 convolutional layers, which performs feature fusion and detail reconstruction. At each decoding stage, the numbers of output channels in these blocks are set to 80, 64, 40, and 32. In addition, we use the ReLU6 activation function instead of standard ReLU to improve quantization performance, because it limits the range of weight distribution during training.

In the final decoding stage, an additional detail refinement module is introduced corresponding to the original image resolution, consisting of three convolutional layers. This module further refines the restored image, enhances texture quality, and overall visual fidelity.

3.3. Loss Function Design

We adopt a composite loss function that integrates three components: the Charbonnier loss, perceptual loss, and wavelet loss. This combination enables the model to achieve higher reconstruction quality at the pixel level, feature level, and frequency domain, respectively.

The Charbonnier loss serves as the primary loss term. As a robust variant of the L1 loss, it is particularly effective in handling noise and outliers. It introduces a small constant

ϵ

to smooth the gradient computation, and is defined as:

L_{charbonnier} (x, y) = \sqrt{{(x - y)}^{2} + ϵ^{2}},

(2)

where x and y denote the predicted and ground-truth images, respectively, and

ϵ

is typically set to

1 \times 10^{- 3}

.

In addition, we incorporate a VGG-based perceptual loss to enhance semantic consistency and visual quality. The perceptual loss measures the discrepancy between high-level features of the predicted and ground-truth images in the VGG feature space. It is formulated as:

L_{perceptual} (x, y) = \sum_{i} λ_{i} {∥ϕ_{i} (x) - ϕ_{i} (y)∥}_{1},

(3)

where

ϕ_{i} (\cdot)

denotes the feature map extracted from the i -th layer of the VGG network, and

λ_{i}

represents the weight assigned to that layer. By adjusting the weights across different layers, this loss guides the model to reconstruct more realistic and visually pleasing results.

Furthermore, considering that image restoration—especially deblurring, relies heavily on accurate recovery of high-frequency information, we propose a stationary wavelet transform (SWT)-based wavelet loss to explicitly constrain differences in the frequency domain, as shows in Figure 3. Compared to the conventional discrete wavelet transform (DWT), SWT avoids downsampling and preserves spatial resolution consistency across sub-bands.

Specifically, we apply symlet wavelets to decompose both the input and ground-truth images into four sub-bands: low–low (LL), low–high (LH), high–low (HL), and high–high (HH). An L1 distance is then computed for each sub-band:

L_{wavelet} (x, y) = \sum_{s \in {LL, LH, HL, HH}} α_{s} {∥{SWT}_{s} (x) - {SWT}_{s} (y)∥}_{1},

(4)

where

α_{s}

denotes the weight assigned to each sub-band. This frequency-aware loss helps the model better recover fine textures and edge structures, which are critical for high-quality image restoration.

3.4. Experimental Setups

Based on a single-lens infrared camera newly designed by our team, which has a focal length of 55 mm and an aperture of f/1.0. The system employs a hybrid refractive–diffractive design: the front surface features an aspheric profile with a diffractive zone, and the rear surface is fully diffractive [20]. This configuration provides sufficient focusing ability while maintaining a simple, monolithic structure. We successfully calibrated the PSF of the camera across nine different fields of view through a series of detailed experiments. To construct the training dataset, we selected the public dataset provided by Raytron, which contains 7224 images. These images were randomly split into a training set with 6498 images and a test set with 726 images, maintaining a ratio of approximately 9:1. The dataset covers a diverse range of scene types, including seascapes, animals, industrial scenes, cityscapes, landscapes, indoor scenes, portraits, surveillance views, and vehicle-mounted views—totaling nine categories, as shows in Figure 4.

Using the calibrated PSF data, we simulated the degradation process during imaging through the lens and image signal processor (ISP), thereby generating a high-precision aligned dataset. This dataset serves as a solid foundation for model training, ensuring that the model can learn to recover sharp images from blurred inputs. During training, we employed the Adam optimizer with hyperparameters set to

β_{1}

= 0.9 and

β_{2}

= 0.999, using a batch size of 8. The entire training process lasted 400 epochs, with the initial learning rate set to

2 \times 10^{- 4}

and gradually decayed to

1 \times 10^{- 7}

. This learning rate adjustment strategy facilitates rapid convergence in the early stages of training while enabling fine-tuning of model parameters in later stages.

To enhance the generalization capability of the model and prevent overfitting, we applied horizontal and vertical flipping as data augmentation techniques during training, along with multiscale Gaussian noise injection. Additionally, we used the Thop toolkit to measure the model’s parameter count at an input resolution of 480 × 640. The inference speed was evaluated using RKNN-Toolkit2 v2.3.0 on the RK3588 platform.

All quantitative restoration metrics (PSNR and SSIM) reported in the results are averaged over the entire test set (

N = 726

), ensuring statistical reliability. The MTF values are obtained from measurements of slit targets averaged across 9 field-of-view positions. Inference speed is computed as the average of 100 independent runs on the embedded platform after 5 warm-up iterations to eliminate system initialization overhead and stabilize hardware performance.

3.5. Evaluation Metrics

We adopted PSNR, SSIM, and MTF as quantitative evaluation metrics. PSNR and SSIM are widely used in image restoration tasks and primarily assess pixel-level fidelity between reconstructed and ground-truth images. However, these metrics do not always correlate well with perceptual quality; in some cases, excessively high PSNR values may even result in visually over-smoothed or blurred outputs. The PSNR is defined as

PSNR = 10 \cdot {log}_{10} (\frac{M A X_{I}^{2}}{MSE}),

(5)

where x and y denote the restored and ground-truth images,

M A X_{I}^{2}

denotes the maximum possible pixel value of the image and

M S E

denotes the mean squared error.

The SSIM index evaluates structural similarity by considering luminance, contrast, and structural information, and is defined as

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})},

(6)

where

μ_{x}

,

μ_{y}

,

σ_{x}

,

σ_{y}

and

σ_{x y}

represent the local means, standard deviations, and cross-covariance of x and y, respectively.

c_{1}

and

c_{2}

are small constants to stabilize the division.

To address this limitation, we introduce MTF at the Nyquist frequency as a supplementary evaluation criterion, which is defined as

MTF = \frac{π}{4} \cdot \frac{I_{m a x} - I_{m i n}}{I_{m a x} + I_{m i n}},

(7)

where

I_{m a x}

and

I_{m i n}

represent the maximum and minimum intensity values in the high-contrast edge or texture regions of the image. This formulation quantifies the system’s ability to preserve contrast at the highest resolvable spatial frequency. MTF is a standard metric in optical system evaluation and effectively characterizes the system’s ability to preserve high-frequency information. By incorporating MTF into the evaluation framework, we obtain a more comprehensive and physically meaningful assessment of image restoration performance, particularly in terms of perceptual sharpness and detail recovery.

4. Experiments and Results

4.1. Quantitative Evaluation

To evaluate the effectiveness of the proposed method, we conducted comparative experiments with a previously developed model from our team. The original model achieved strong performance on conventional image quality metrics, yielding a PSNR of 36.97 dB and an SSIM of 0.962. However, it was relatively heavy in terms of model size, with 8.63 million parameters and a computational complexity of 68.12 G MACs. It’s unsuitable for real-time or edge deployment. In order to create a more deployable baseline, we applied structured pruning with a reduction ratio of 50%, resulting in a lightweight variant of the original model. This pruned version reduced the parameter count to 3.92 million and lowered MACs to 28.44 G. Despite these reductions, the pruned model still maintained competitive performance, achieving a PSNR of 36.37 dB and an SSIM of 0.959. Notably, its MTF performance was also well preserved. With a value of 0.5001, it closely matches the original model, the MTF value of which is 0.5190. However, visual evaluations and frequency-domain analysis revealed that the pruned baseline still exhibited limitations in reconstructing high-frequency details, indicating room for improvement in perceptual and structural fidelity.

MWR-Net innovatively integrates the lightweight architecture of MobileNetV4 with a wavelet-domain constrained loss function, achieving both high efficiency and strong restoration performance. At FP32 precision, MWR-Net significantly outperforms the baseline model (Table 1). Specifically, it contains only 666.34 K parameters and 6.17 G MACs, making it highly suitable for edge deployment. In terms of image quality, MWR-Net achieves a PSNR of 36.63 dB, which is 0.26 dB higher than that of the baseline, while maintaining an SSIM of 0.962, demonstrating efficient utilization of model parameters. Furthermore, by incorporating the wavelet-domain constrained loss function, the model’s performance is further improved to a PSNR of 37.10 dB and an SSIM of 0.964, delivering the best results among all tested models in terms of MTF. Specifically, the MTF value reaches 0.6903, substantially exceeding the baseline’s 0.5001. This significant improvement highlights MWR-Net’s superior capability in reconstructing high-frequency details, which is critical for high-quality image restoration in real-world applications.

In practical deployment tests (Table 2), MWR-Net demonstrates robust performance under model quantization. Evaluated on the RK3588 chip with 6 TOPS NPU computing power, the results reveal that the baseline model suffers a noticeable performance degradation at INT8 precision. Specifically, its PSNR drops to 35.44 dB, SSIM decreases to 0.951, and the MTF value is 0.5144. In contrast, MWR-Net maintains a PSNR of 35.52 dB and an SSIM of 0.948 after quantization, with MTF value remaining at 0.6767. Particularly noteworthy is that the inference speed of MWR-Net reaches 42 FPS, representing a 27.3% improvement over the baseline’s 33 FPS. This performance gain is achieved while still preserving superior capability in high-frequency information restoration, underscoring MWR-Net’s strong hardware compatibility and computational efficiency.

Experimental results demonstrate that MWR-Net effectively balances accuracy and efficiency under FP32 precision, and exhibits significant robustness advantages in INT8 quantization scenarios. These findings indicate that the proposed framework offers a more optimal solution for video-rate, resource-constrained deployment. The architectural innovations introduced in this study not only validate the synergistic benefits of lightweight network design and wavelet-domain constraints, but also provide a promising technical pathway for the development of efficient, real-time video processing systems in remote sensing and related applications.

4.2. Visual Evaluation

To evaluate the image restoration performance of the proposed method in visual comparison, we selected representative samples from the test set and conducted reconstruction experiments using three models: the baseline model, MWR-Net (without wavelet loss), and MWR-Net (with wavelet loss). The visual results are presented in Figure 5, Figure 6 and Figure 7. The restoration results indicate that MWR-Net provides better overall reconstruction quality compared to the baseline. Moreover, the incorporation of the wavelet-domain loss function further improves the recovery of high-frequency details—such as edges and textures—showcasing its effectiveness in enhancing perceptual sharpness and structural fidelity.

Specifically, as shown in Figure 5, MWR-Net demonstrates superior performance in reconstructing fine textures on human subjects, accurately recovering clothing patterns and fabric details. In contrast, the baseline output suffers from noticeable blurring and residual noise. The incorporation of the wavelet-domain loss further enhances edge sharpness and the rendering of fine structures, contributing to improved perceptual quality. Moreover, as illustrated in Figure 6 and Figure 7, in scenes containing building structures and large-scale urban environments, the proposed method better preserves structural boundaries and textural features. This results in clearer representation of high-frequency elements—such as signage, window grids, and architectural contours. These results are critical not only for visual realism but also for supporting downstream vision tasks like object detection and semantic segmentation.

Figure 8 presents the visual results of MTF testing conducted at a room temperature of 20 °C. The MTF values exceed 0.5 across all FoVs, further confirming MWR-Net’s ability to perceive and recover high-frequency details. In summary, the proposed method not only achieves better objective evaluation metrics, but also obtains better subjective visual evaluation, especially in terms of edge reconstruction accuracy and fine detail restoration.

In summary, the proposed method not only achieves superior subjective visual quality but also demonstrates significant improvements in objective evaluation metrics, particularly in edge reconstruction accuracy and fine-detail restoration while maintaining higher inference speed.

5. Discussion

5.1. Interpretation of Key Results

This study successfully validates the integration of lightweight architecture with frequency-domain constraints in MWR-Net, achieving superior image restoration quality alongside significantly enhanced efficiency. Compared to the pruned baseline model, MWR-Net reduces parameter count by 83% and computational complexity by 78%, while attaining a 0.26 dB higher PSNR. This “lighter yet stronger” performance demonstrates that employing MobileNetV4 as a natively lightweight backbone provides greater architectural advantages than post hoc pruning of large models, enabling more efficient feature extraction. Crucially, the wavelet-based loss function directly addresses the baseline’s deficiency in reconstructing high-frequency details. MWR-Net’s MTF value of 0.6903 represents a 38% improvement over the baseline’s 0.5001, objectively confirming that frequency-domain constraints effectively guide the model toward sharper image reconstruction. Visual results consistently demonstrate this loss function’s dual benefit of enhancing textures and edges while suppressing artifacts and noise.

5.2. Comparison with Lightweight Design Strategies

Our comparative analysis reveals distinctive advantages and limitations of different lightweight design approaches. The pruned baseline model—representing a common strategy of compressing complex models like UNet. While metrically competitive, it exhibits inherent limitations. Its architecture remains suboptimal as pruning cannot fundamentally alter inefficient computational graphs. This structural deficiency manifests in pronounced quantization vulnerability, with a 0.93 dB PSNR drop under INT8 quantization, indicating sensitive weight distributions. Furthermore, its design origin lacks high-frequency optimization results in perceptually smooth but detail-deficient outputs. In contrast, MWR-Net embodies a “natively lightweight” philosophy by directly incorporating MobileNetV4, which is architecturally optimized for mobile deployment. This foundational difference enables superior quantization robustness and measurable inference speed advantages, demonstrating that intrinsic lightweight design outperforms post hoc compression for edge deployment scenarios.

5.3. Limitations and Further Analysis

Despite its advantages, MWR-Net’s performance boundaries reflect inherent trade-offs in lightweight design. The compact architecture inevitably constrains representational capacity, particularly in extreme low signal-to-noise ratio scenarios or exceptionally complex noise patterns where larger models would maintain superiority—a deliberate trade-off of generalization capability for efficiency. Additionally, the method’s effectiveness remains contingent on accurate PSF modeling, with imperfect PSF estimation leading to inferior restoration and limiting plug-and-play adaptability across diverse imaging systems. The current empirical selection of general-purpose wavelet bases presents another limitation, as these may not optimally represent infrared spectral characteristics, thereby constraining the full potential of frequency-domain optimization.

5.4. Future Work Directions

Building on these insights, future research will pursue two primary directions to overcome current limitations. First, we will develop learnable PSF estimation modules to integrate optical characterization with image restoration within an end-to-end optimized framework, enhancing adaptability to varying imaging systems while reducing dependency on specialized hardware calibration. Second, we will investigate task-driven wavelet learning mechanisms to adaptively generate optimal wavelet bases specific to infrared image characteristics, transforming frequency-domain constraints from empirically designed to data-driven components for improved effectiveness. Furthermore, we intend to develop a visible-light image restoration variant of MWR-Net by adapting the network architecture to account for the intrinsic differences between infrared and visible imaging modalities. This includes refining feature representation modules and reconstruction mechanisms to better accommodate the distinct characteristics of visible images. Such structural adaptation is expected to improve the model’s flexibility and broaden its applicability in different imaging scenarios. These directions aim to advance both the theoretical foundation and practical applicability of lightweight computational imaging systems.

6. Conclusions

This study addresses the trade-off between system compactness and performance, as well as the challenge of restoring high-frequency details in single-lens infrared imaging systems. We propose a lightweight, end-to-end image restoration network, MWR-Net, which integrates MobileNetV4 as the encoder backbone and incorporates a wavelet-domain loss function for frequency-domain constraints.

Experimental results show that MWR-Net maintains reconstruction performance despite substantial reductions in both parameter count and computational cost, which successfully overcomes the conventional trade-off between accuracy and efficiency in lightweight model design. The introduction of wavelet-based loss functions and MTF metrics further reveal that the network performs well in the frequency domain, leading to a more-detailed image output. This approach achieving high scores both in traditional objective evaluation indicators and subjective visual quality.

MWR-Net proves to be highly suitable for video-level signal processing. It is well equipped to handle high-throughput data streams while maintaining performance in resource-limited environments. Deployment on embedded platforms confirms its real-time inference capability, making it a practical choice for time-critical applications like UAV-based remote sensing.

Overall, this study highlights the value of co-optimizing lightweight architectures with frequency-domain guidance, offering a promising direction for future real-time infrared computational imaging systems. Future work will focus on enhancing its adaptability to extreme degradation scenarios, and expand its applicability in real-world imaging tasks.

Author Contributions

Conceptualization, X.Q., X.W. and X.D.; methodology, X.Q., X.W. and X.D.; software, X.Q. and Y.X.; validation, X.W. and X.D.; formal analysis, X.W. and G.Y.; resources, X.D., Z.W. and X.C.; writing—original draft preparation, X.Q.; writing—review and editing, X.W. and G.Y.; visualization, X.Q.; funding acquisition, X.W., X.D. and X.C.;supervision, Z.W. and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (Grant Nos. 62305250, 61925504, 62105243, 62205248).

Data Availability Statement

The dataset used in this study is available at http://openai.iraytek.com/apply/High_resolution.html/ (accessed on 16 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAVs	Unmanned Arial Vehicles
MWR-Net	Mobile Wavelet Restoration-Net
MTF	Modulation Transfer Function
MACs	Multiply–Accumulate Operations
DOEs	Diffractive Optical Elements
BN	Batch Normalization
PSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity Index Measure
NPU	Neural Processing Unit
CNNs	Convolutional Neural Networks
FOV	Field of View
FFT	Fast Fourier Transform
IFFT	Inverse Fast Fourier Transform

References

Dai, X.; Yuan, X.; Wei, X. TIRNet: Object detection in thermal infrared images for autonomous driving. Appl. Intell. 2021, 51, 1244–1261. [Google Scholar] [CrossRef]
Banuls, A.; Mandow, A.; Vázquez-Martín, R.; Morales, J.; García-Cerezo, A. Object detection from thermal infrared and visible light cameras in search and rescue scenes. In Proceedings of the 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Abu Dhabi, United Arab Emirates, 4–6 November 2020; pp. 380–386. [Google Scholar]
Wang, B.; Dong, M.; Ren, M.; Wu, Z.; Guo, C.; Zhuang, T.; Pischler, O.; Xie, J. Automatic fault diagnosis of infrared insulator images based on image instance segmentation and temperature analysis. IEEE Trans. Instrum. Meas. 2020, 69, 5345–5355. [Google Scholar] [CrossRef]
Jiao, J.; Zhao, L.; Pan, W.; Li, X. Development and Core Technologies for Intelligent SWaP3 Infrared Cameras: A Comprehensive Review and Analysis. Sensors 2023, 23, 4189. [Google Scholar] [CrossRef]
Hu, X.; Xu, W.; Fan, Q.; Yue, T.; Yan, F.; Lu, Y.; Xu, T. Metasurface-based computational imaging: A review. Adv. Photonics 2024, 6, 014002. [Google Scholar] [CrossRef]
Bian, L.; Wang, Z.; Zhang, Y.; Li, L.; Zhang, Y.; Yang, C.; Fang, W.; Zhao, J.; Zhu, C.; Meng, Q. A broadband hyperspectral image sensor with high spatio-temporal resolution. Nature 2024, 635, 73–81. [Google Scholar] [CrossRef]
Zhang, W.; Suo, J.; Dong, K.; Li, L.; Yuan, X.; Pei, C.; Dai, Q. Handheld snapshot multi-spectral camera at tens-of-megapixel resolution. Nat. Commun. 2023, 14, 5043. [Google Scholar] [CrossRef]
Huang, L.; Luo, R.; Liu, X.; Hao, X. Spectral imaging with deep learning. Light Sci. Appl. 2022, 11, 61. [Google Scholar] [CrossRef]
Baek, S.H.; Ikoma, H.; Jeon, D.S.; Li, Y.; Heidrich, W.; Wetzstein, G.; Kim, M.H. Single-shot hyperspectral-depth imaging with learned diffractive optics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2651–2660. [Google Scholar]
Wu, J.; Cao, L.; Barbastathis, G. DNN-FZA camera: A deep learning approach toward broadband FZA lensless imaging. Opt. Lett. 2020, 46, 130–133. [Google Scholar] [CrossRef] [PubMed]
Peng, Y.; Fu, Q.; Amata, H.; Su, S.; Heide, F.; Heidrich, W. Computational imaging using lightweight diffractive-refractive optics. Opt. Express 2015, 23, 31393–31407. [Google Scholar] [CrossRef] [PubMed]
Peng, Y.; Sun, Q.; Dun, X.; Wetzstein, G.; Heidrich, W.; Heide, F. Learned large field-of-view imaging with thin-plate optics. ACM Trans. Graph. 2019, 38, 219:1–219:14. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, C.; Kou, T.; Li, Y.; Shen, J. End-to-end computational optics with a singlet lens for large depth-of-field imaging. Opt. Express 2021, 29, 28530–28548. [Google Scholar] [CrossRef]
Qi, B.; Chen, W.; Dun, X.; Hao, X.; Wang, R.; Liu, X.; Li, H.; Peng, Y. All-day thin-lens computational imaging with scene-specific learning recovery. Appl. Opt. 2022, 61, 1097–1105. [Google Scholar] [CrossRef]
Vadera, S.; Ameen, S. Methods for pruning deep neural networks. IEEE Access 2022, 10, 63280–63300. [Google Scholar] [CrossRef]
Anwar, S.; Hwang, K.; Sung, W. Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. (JETC) 2017, 13, 1–18. [Google Scholar] [CrossRef]
Nagel, M.; Fournarakis, M.; Amjad, R.A.; Bondarenko, Y.; Van Baalen, M.; Blankevoort, T. A white paper on neural network quantization. arXiv 2021, arXiv:2106.08295. [Google Scholar] [CrossRef]
Wei, L.; Ma, Z.; Yang, C.; Yao, Q. Advances in the neural network quantization: A comprehensive review. Appl. Sci. 2024, 14, 7445. [Google Scholar] [CrossRef]
Yang, J.; Shen, X.; Xing, J.; Tian, X.; Li, H.; Deng, B.; Huang, J.; Hua, X.s. Quantization networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7308–7316. [Google Scholar]
Xing, Y.; Wang, X.; Dun, X.; Zhang, J.; Yu, J.; Huang, W.; Wang, Z.; Cheng, X. Real-time high-quality single-lens computational imaging via enhancing lens modulation transfer function consistency. Opt. Express 2025, 33, 5179–5190. [Google Scholar] [CrossRef]
Chenga, X. Edge accelerated reconstruction using sensitivity analysis for single-lens computational imaging. Adv. Imaging 2025, 31001, 1. [Google Scholar]
Li, Z.; Yu, Y.; Zhu, G.; Dai, Y. FedDyMNv4++: Lightweight Dynamic MobileNetV4 with Adaptive Federated Learning for Remote Sensing Image Change Detection. In Proceedings of the 2025 8th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Nanjing, China, 9–11 May 2025; pp. 725–730. [Google Scholar]
Kun, W.; Yi, S.; Shijiao, H. MoN-YOLO for Rotated Object Detection in SAR Ship Images. In Proceedings of the 2024 21st International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 14–16 December 2024; pp. 1–4. [Google Scholar]
Zhang, Y. Improved RT-DETR Based on MobileNetV4 for Vehicle Detection. In Proceedings of the 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Shanghai, China, 21–23 March 2025; pp. 901–906. [Google Scholar]
Ding, S.; Zhu, Q.; Zhu, W. A Lightweight Detail-Fusion Progressive Network for Image Deraining. In Proceedings of the International Conference on Intelligent Computing, Zhengzhou, China, 10–13 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 75–87. [Google Scholar]
Du, G.; Wang, H.; Teng, X.; Zhao, P. Research on Super-Resolution Convolutional Network Based on Depth Image and Heterogeneous Multi-core Processor. In Proceedings of the International Conference on Life System Modeling and Simulation, Suzhou, China, 13–15 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 115–129. [Google Scholar]
Kim, T.; Shin, C.; Lee, S.; Lee, S. Block-Attentive Subpixel Prediction Networks for Computationally Efficient Image Restoration. IEEE Access 2021, 9, 90881–90895. [Google Scholar] [CrossRef]
Kupyn, O.; Martyniuk, T.; Wu, J.; Wang, Z. Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8878–8887. [Google Scholar]
Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 833–843. [Google Scholar]
Lin, S.; Zhou, G.; Tang, Y. Lightweight Image Deraining Network Based on Dilated Depthwise Separable Convolution and Enhanced Channel Attention. In Proceedings of the 2024 2nd International Conference on Signal Processing and Intelligent Computing (SPIC), Guangzhou, China, 20–22 September 2024; pp. 1016–1021. [Google Scholar]
Shandilya, D.K.; Roy, S.; Singh, N. Optimized RainDNet: An efficient image deraining method with enhanced perceptual quality. Signal Image Video Process. 2024, 18, 7131–7143. [Google Scholar] [CrossRef]
Wang, H.; Bhaskara, V.; Levinshtein, A.; Tsogkas, S.; Jepson, A. Efficient super-resolution using mobilenetv3. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 87–102. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B. MobileNetV4: Universal models for the mobile ecosystem. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 78–96. [Google Scholar]
Zhai, L.; Wang, Y.; Cui, S.; Zhou, Y. A comprehensive review of deep learning-based real-world image restoration. IEEE Access 2023, 11, 21049–21067. [Google Scholar] [CrossRef]
Wali, A.; Naseer, A.; Tamoor, M.; Gilani, S. Recent progress in digital image restoration techniques: A review. Digit. Signal Process. 2023, 141, 104187. [Google Scholar] [CrossRef]
Su, J.; Xu, B.; Yin, H. A survey of deep learning approaches to image restoration. Neurocomputing 2022, 487, 46–65. [Google Scholar] [CrossRef]
Schuler, C.J.; Hirsch, M.; Harmeling, S.; Schölkopf, B. Non-stationary correction of optical aberrations. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 659–666. [Google Scholar]
Heide, F.; Rouf, M.; Hullin, M.B.; Labitzke, B.; Heidrich, W.; Kolb, A. High-quality computational imaging through simple lenses. ACM Trans. Graph. (ToG) 2013, 32, 1–14. [Google Scholar] [CrossRef]
Zhan, D.; Zeng, X.; Li, W.; Liu, Y.; Xiong, Z. Blur kernel estimation using normal sinh-arcsinh model based on simple lens system. In Proceedings of the 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), Luton, UK, 16–18 October 2017; pp. 1–6. [Google Scholar]
Cai, H.; Li, W.; Zhang, M.; Xu, W. An Imaging Method based on Front-end and Back-end Cooperation. In Proceedings of the 2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 6–8 January 2023; pp. 400–405. [Google Scholar]
Li, X.; Suo, J.; Zhang, W.; Yuan, X.; Dai, Q. Universal and flexible optical aberration correction using deep-prior based deconvolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2613–2621. [Google Scholar]
Gong, J.; Yang, R.; Zhang, W.; Suo, J.; Dai, Q. A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 24861–24870. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Liu, Z.; Hao, Z.; Han, K.; Tang, Y.; Wang, Y. Ghostnetv3: Exploring the training strategies for compact models. arXiv 2024, arXiv:2404.11202. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7907–7917. [Google Scholar]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5270–5279. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Jiang, L.; Dai, B.; Wu, W.; Loy, C.C. Focal frequency loss for image reconstruction and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13919–13929. [Google Scholar]
Fuoli, D.; Van Gool, L.; Timofte, R. Fourier space losses for efficient perceptual image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2360–2369. [Google Scholar]
Korkmaz, C.; Tekalp, A.M.; Dogan, Z. Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 5926–5936. [Google Scholar]
Mao, X.; Liu, Y.; Liu, F.; Li, Q.; Shen, W.; Wang, Y. Intriguing findings of frequency selection for image deblurring. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1905–1913. [Google Scholar]
Mao, X.; Wang, J.; Xie, X.; Li, Q.; Wang, Y. Loformer: Local frequency transformer for image deblurring. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 10382–10391. [Google Scholar]
Cui, Y.; Ren, W.; Cao, X.; Knoll, A. Image restoration via frequency selection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1093–1108. [Google Scholar] [CrossRef]
Gao, H.; Dang, D. Exploring Richer and More Accurate Information via Frequency Selection for Image Restoration. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 2689–2700. [Google Scholar] [CrossRef]
Nah, S.; Hyun Kim, T.; Mu Lee, K. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3883–3891. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]

Figure 1. A comparative illustration of distinct infrared imaging methodologies. Line 1: traditional multi-lens imaging. Line 2: single-lens computational imaging. To meet the demand in real-time, resource-constrained UAV-based applications, a turbo boost in the inference speed of reconstruction networks and optical performance is needed.

Figure 2. The overall structure of MWR-Net. The network adopts an encoder–decoder framework with multi-level skip connections. The input image is first passed through a series encoder blocks, while spatial average pooling operations are applied in parallel. At the deepest stage, a bottleneck block encodes high-level features using a lightweight structure. The decoder blocks reconstruct the image by integrating features from three corresponding sources. A final Refiner Block is added to further enhance image fidelity. The network is optimized using a combination of Charbonnier loss, perceptual loss, and a wavelet-domain loss.

Figure 3. The process of SWT decomposition. A set of low-pass and high-pass filters is first applied to the image along the column direction. The same filtering operations are then performed on the intermediate results from the first step, this time along the row direction, resulting in four sub-bands: LL, LH, HL, and HH.

Figure 4. The composition of the dataset. The dataset contains 9 categories with a total of 7224 images. We divided the training and testing sets in a 9:1 ratio.

Figure 5. Reconstruction results of human subjcts. The baseline model produces overly smooth results with blurred clothing edges. The result of MWR-Net without wavelet loss preserves more textures but introduces some noise. When the wavelet loss is applied, it suppresses the noise and improves the fidelity.

Figure 6. Reconstruction results of cityscapes. The baseline model produces overly smooth results with blurred signage on the hotel. The result of MWR-Net without wavelet loss preserves more textures but introduces some noise. When the wavelet loss is applied, it suppresses the noise and improves the fidelity.

Figure 7. Reconstruction results of seascapes. The baseline model produces overly smooth results with blurred sail structures. The result of MWR-Net without wavelet loss preserves more textures but introduces some noise. When the wavelet loss is applied, it suppresses the noise and improves the fidelity.

Figure 8. The MTF results of MWR-Net. The MTF values across various fields at the Nyquist frequency all exceed 0.5, showing excellent high-frequency performance.

Table 1. Quantitative evaluation results at FP32 precision.

Models	Params	MACs	PSNR	SSIM	MTF
Original Model	8.63 M	68.12 G	36.97	0.962	0.5190
Baseline	3.92 M	28.44 G	36.37	0.959	0.5001
MWR-Net w/o Wavelet Loss	666.34 K	6.17 G	36.63	0.962	0.6573
MWR-Net w/ Wavelet Loss	666.34 K	6.17 G	37.10	0.964	0.6903

Table 2. Quantitative evaluation results at INT8 precision.

Models	PSNR	SSIM	MTF	Inference Speed
Baseline	35.44	0.951	0.5144	33FPS
MWR-Net w/o Wavelet Loss	35.30	0.945	0.6618	42FPS
MWR-Net w/ Wavelet Loss	35.52	0.948	0.6767	42FPS

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, X.; Wang, X.; Xing, Y.; Yang, G.; Dun, X.; Wang, Z.; Cheng, X. MWR-Net: An Edge-Oriented Lightweight Framework for Image Restoration in Single-Lens Infrared Computational Imaging. Remote Sens. 2025, 17, 3005. https://doi.org/10.3390/rs17173005

AMA Style

Qian X, Wang X, Xing Y, Yang G, Dun X, Wang Z, Cheng X. MWR-Net: An Edge-Oriented Lightweight Framework for Image Restoration in Single-Lens Infrared Computational Imaging. Remote Sensing. 2025; 17(17):3005. https://doi.org/10.3390/rs17173005

Chicago/Turabian Style

Qian, Xuanyu, Xuquan Wang, Yujie Xing, Guishuo Yang, Xiong Dun, Zhanshan Wang, and Xinbin Cheng. 2025. "MWR-Net: An Edge-Oriented Lightweight Framework for Image Restoration in Single-Lens Infrared Computational Imaging" Remote Sensing 17, no. 17: 3005. https://doi.org/10.3390/rs17173005

APA Style

Qian, X., Wang, X., Xing, Y., Yang, G., Dun, X., Wang, Z., & Cheng, X. (2025). MWR-Net: An Edge-Oriented Lightweight Framework for Image Restoration in Single-Lens Infrared Computational Imaging. Remote Sensing, 17(17), 3005. https://doi.org/10.3390/rs17173005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MWR-Net: An Edge-Oriented Lightweight Framework for Image Restoration in Single-Lens Infrared Computational Imaging

Abstract

1. Introduction

2. Related Works

2.1. Image Restoration

2.2. Lightweight Neural Networks

2.3. Frequency-Domain Learning

3. Proposed Method

3.1. Feature Extraction Encoder

3.2. Decoder

3.3. Loss Function Design

3.4. Experimental Setups

3.5. Evaluation Metrics

4. Experiments and Results

4.1. Quantitative Evaluation

4.2. Visual Evaluation

5. Discussion

5.1. Interpretation of Key Results

5.2. Comparison with Lightweight Design Strategies

5.3. Limitations and Further Analysis

5.4. Future Work Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI