DSConv+LR: A Minimalist Lightweight Network for Image Super-Resolution

Hu, Qiuxia; Tian, Jie; Jiang, Guangyi; Xue, Shan; Wang, Jingxuan

doi:10.3390/electronics15122637

Open AccessArticle

DSConv+LR: A Minimalist Lightweight Network for Image Super-Resolution

by

Qiuxia Hu

,

Jie Tian

^*,

Guangyi Jiang

,

Shan Xue

and

Jingxuan Wang

School of Computer and Artificial Intelligence, Xihang University, Xi’an 710077, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2637; https://doi.org/10.3390/electronics15122637 (registering DOI)

Submission received: 22 April 2026 / Revised: 27 May 2026 / Accepted: 4 June 2026 / Published: 15 June 2026

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Deep learning has significantly advanced image super-resolution (SR), yet many state-of-the-art models remain too computationally expensive for resource-constrained devices. This paper demonstrates that a highly parameter-efficient design can achieve comparable performance to the very deep super-resolution network (VDSR) with a tiny fraction of parameters. Starting from the classic VDSR architecture (2016), we systematically evaluate three design choices: depthwise separable convolution (DSConv), Hybrid Attention Transformer (HAT), and a local residual connection (LR). HAT provides no performance gain—an honest negative result supported by controlled experiments (increased training, different reduction ratios, and standard convolution baseline). In contrast, LR alone yields a 0.20 dB improvement without introducing any extra parameters. Consequently, we discard HAT and propose DSConv+LR. Our model contains only 49,217 parameters—about 7.4% of VDSR—yet attains a peak signal-to-noise ratio (PSNR) of 35.21 dB on Set5 (×2), which is 99.7% of VDSR’s performance (35.33 dB). On additional benchmarks (Set14, BSD100, and Urban100), DSConv+LR maintains similar relative performance (within 0.12 dB of VDSR). Perceptual loss (AlexNet features, lower better) is 0.2556, slightly better than VDSR (0.2717). We acknowledge that modern lightweight networks such as cascaded residual attention network (CARN) and information multi-distillation network (IMDN) achieve 2–3 dB higher PSNR at the cost of 9–14× more parameters. This work advocates a minimalist approach while honestly reporting both its strengths and limitations.

Keywords:

image super-resolution; VDSR; depthwise separable convolution; local residual; lightweight network

1. Introduction

Reconstructing a high-resolution image from a low-resolution one is called single-image super-resolution (SISR). Deep learning has dominated this field since the super-resolution convolutional neural network (SRCNN) [1]. The subsequent VDSR [2] pushed the depth to 20 layers and achieved impressive accuracy. However, VDSR still needs about 665 k parameters, making it impractical for most mobile or embedded devices.

VDSR’s success stems from two key ideas: a very deep network (20 layers) enlarges the receptive field, and residual learning eases training. Yet depth comes at the cost of a large parameter count. Each of the 18 intermediate convolutional layers contains 64 filters of size 3 × 3, resulting in about 36,864 parameters per layer. The total of about 665 k parameters is acceptable on high-end GPUs but prohibitive for real-time inference on mobile phones, drones, or surveillance cameras. Moreover, VDSR uses standard convolutions that treat all channels equally, which may be suboptimal for extracting diverse features.

To reduce model complexity, researchers have developed various lightweight SR architectures, including fast super-resolution convolutional neural network (FSRCNN) [3], efficient sub-pixel convolutional neural network (ESPCN) [4], and cascading residual network (CARN) [5]. These methods typically adopt strategies such as compact hourglass structures, depthwise separable convolutions, sub-pixel convolution, or cascading small residual blocks. Depthwise separable convolutions are particularly effective, reducing parameters by an order of magnitude with modest accuracy loss. However, a known drawback is that they weaken cross-channel interactions, hurting representational power. To compensate, many modern lightweight SR networks introduce attention mechanisms—channel or spatial attention—that attempt to recalibrate feature responses. Yet the effectiveness of such attention modules in ultra-lightweight regimes (<100 k parameters) remains unclear. Some studies report gains, while others find negligible improvements. This inconsistency motivated us to systematically investigate depthwise separable convolutions and lightweight attention on top of a well-established baseline—VDSR.

In this paper, we systematically evaluate three design choices on top of VDSR:

Depthwise separable convolution (DSConv)—for drastic parameter reduction.
Lightweight channel attention (HAT)—to potentially compensate for information loss.
Local residual connection (LR)—to improve gradient flow.

Our experiments lead to an unexpected conclusion. HAT provides no performance gain under our training protocol, which we verify through controlled experiments (longer training, different reduction ratios, and applying HAT on standard convolutions). In contrast, LR alone yields a notable improvement of +0.20 dB with zero extra parameters. Consequently, we discard HAT and propose DSConv+LR. It uses only 49,217 parameters (7.4% of VDSR) and achieves a PSNR of 35.21 dB on Set5 (×2)—99.7% of VDSR’s performance (35.33 dB). On Set14, BSD100, and Urban100, DSConv+LR performs within 0.12 dB of VDSR, demonstrating strong generalisation. For ×4 SR, it achieves 32.26 dB on Set5, demonstrating strong scalability.

The main contributions of this work are:

A minimalist yet highly effective lightweight SR network derived from VDSR, along with a clear statement of its limitations.
A rigorous ablation study that isolates the contributions of depthwise separable convolution, channel attention, and local residual connections.
An honest negative result showing that a widely used lightweight attention module does not improve performance in this ultra-lightweight setting, supported by controlled experiments (increased training, reduction ratio r = 2, and standard convolution baseline).
A strong emphasis on local residual connections as a simple, parameter-free tool for enhancing lightweight SR models, while acknowledging that such simplifications come at a cost.

2. Related Work

2.1. Deep SR Networks

SRCNN [1] first applied a three-layer CNN to SR, achieving a significant improvement over traditional interpolation methods. However, its shallow architecture limited large-scale context. VDSR [2] deepened the network to 20 layers and introduced residual learning, allowing the model to learn high-frequency residuals rather than the full image. This residual formulation eased training and accelerated convergence, enabling a high learning rate (0.1) and gradient clipping. Deeply-recursive convolutional network (DRCN) [6] further increased depth using recursive supervision, but at higher computational cost. Laplacian pyramid super-resolution network (LapSRN) [7] adopted a progressive reconstruction approach with Laplacian pyramids, producing high-quality results at multiple scales simultaneously. Despite their impressive performance, these deep models contain hundreds of thousands to millions of parameters, limiting deployment on edge devices. Our work builds directly upon VDSR because of its clean and modular design, making it an ideal baseline for studying lightweight modifications.

2.2. Lightweight SR Networks

FSRCNN [3] (2016) introduced a compact hourglass architecture with a “shrink” and “expand” design, reducing parameters to about 12 k. Under our re-implementation (291-image training set), it achieves 34.07 dB on Set5 (×2). ESPCN [4] (2016) proposed sub-pixel convolution, which avoids explicit up-sampling layers and reduces computation. CARN [5] (2018) employed cascading residual blocks and group convolutions, achieving a good balance between performance and efficiency. We re-implemented CARN on the same 291-image training set with reduced channels (32 vs. 64), resulting in 456.8 K parameters and 37.25 dB on Set5. Depthwise separable convolutions, popularised by MobileNet [8] (MobileNets: Efficient convolutional neural networks for mobile vision applications), have been used in lightweight SR models such as MobileSR [9] (MobileSR: A lightweight super-resolution network for mobile devices). IMDN [10] (2019) originally proposed information multi-distillation blocks with contrast-aware channel attention, achieving 38.00 dB on Set5 with 694 K parameters when trained on DIV2K. Under our protocol, our re-implemented IMDN obtains 37.35 dB with the same parameter count. Hybrid attention separable network (HASN) [11] (2024) combined depthwise separable convolutions with both channel and spatial attention but deliberately avoided residual connections to minimise cost. In contrast, our work shows that a simple local residual connection can achieve the same or better effect without any attention, though we also observe that the resulting model is not flawless.

Concurrently, Frequency Regularisation [12] explores compressing CNNs by retaining low-frequency components; although their method differs, both share the philosophy of structural convolution reformulation for parameter reduction.

2.3. Attention Mechanisms in SR

Channel attention (CA) [13] and its efficient variant efficient channel attention (ECA) [14] are widely used to enhance feature representation. Residual channel attention network (RCAN) [13] and residual dense network (RDN) [15] show significant improvements with attention. However, most of these works focus on large models with millions of parameters. The effectiveness of lightweight attention in extremely compact settings (less than 100 k parameters) remains not fully explored. Our work addresses this gap and finds that attention is ineffective here.

3. Proposed Method

3.1. Baseline: VDSR

We start from the original VDSR architecture [2]. It consists of an input convolutional layer (3 × 3, 1 → 64), 18 intermediate convolutional layers (3 × 3, 64 → 64, each with ReLU), an output convolutional layer (3 × 3, 64 → 1), and a global residual connection. Thus the network learns only the high-frequency residual. This design allows a high learning rate (0.1). We retrained VDSR from scratch using the same 291-image training set and data augmentation protocol as our lightweight models, obtaining a baseline PSNR of 35.33 dB on Set5 (×2) (mean ± std over 3 runs: 35.33 ± 0.05 dB).

3.2. Building Blocks

We design three types of basic blocks. Each replaces the original 18-layer stack. All blocks are stacked N = 10 times in our experiments. The three blocks are:

DSConv block: a single depthwise separable convolution followed by ReLU, as defined in Equation (1):

y = ReLU (DSConv(x))

(1)

A depthwise separable convolution factorises a standard convolution into two separate operations. First, a depthwise convolution applies a single filter per input channel. Second, a pointwise convolution (1 × 1) combines the outputs. This reduces the parameter count from Cin × Cout × k² to Cin × k² + Cin × Cout. For our setting (Cin = Cout = 64, k = 3), a standard convolution has 64 × 64 × 9 = 36,864 parameters. A depthwise separable convolution has 64 × 9 + 64 × 64 = 576 + 4096 = 4672 parameters—a reduction of about 87%. Despite this efficiency, the decoupling of channel mixing can reduce representational power.
HAT block: DSConv + ReLU + a lightweight channel attention module (HAT), given by Equation (2). HAT uses global average pooling to squeeze spatial information. Then it uses two linear layers with a reduction ratio r = 4 to produce channel-wise weights. This is followed by a Sigmoid activation. The weights are multiplied element-wise with the feature map to recalibrate channels. The structure is: global average pooling (GAP) → a fully connected layer reducing dimension from 64 to 16 → ReLU → a fully connected layer expanding back to 64 → Sigmoid.

y = HAT (ReLU (DSConv(x)))

(2)

This block adds only a few thousand extra parameters. The two linear layers have 64 × 16 + 16 × 64 = 2048 weights. The total per block becomes 4672 + 2048 = 6720 parameters. The intention is to compensate for the loss of cross-channel information caused by depthwise separable convolution.
LR block (proposed): DSConv + ReLU + a local residual connection that adds the input of the block to its output, as shown in Equation (3). No attention is used.

y = ReLU (DSConv(x)) + x

(3)

This block has exactly the same number of parameters as the DSConv block (4672 per block). The local residual connection does not introduce any trainable weights. It provides a direct gradient highway from the output back to the input. This mitigates the vanishing gradient problem and encourages the block to learn a residual function that is easier to optimise.

3.3. Network Architecture

Figure 1 presents the architecture of our proposed DSConv+LR. As illustrated in Figure 1a, the network inherits the global residual learning scheme from VDSR. It consists of an input convolutional layer (3 × 3, mapping 1 to 64 channels). Then a stack of N LR blocks, an output convolutional layer (3 × 3, reducing 64 channels back to 1), and a final global skip connection that adds the original input to the output. Both input and output are single-channel luminance (Y) images. The input and output convolutions are standard (not depthwise separable). We set N = 10 in our experiments.

The internal structure of an LR block is shown in Figure 1b. Each block comprises a depthwise separable convolution (3 × 3, 64 → 64 channels). This is followed by a ReLU activation. Then a local residual connection adds the block’s input to its output. We intentionally omit the channel attention module (HAT). Our experiments revealed that it yields no performance gain under this lightweight setting. The local residual connection improves gradient flow. It enables each block to learn more expressive residuals without increasing the parameter count. This design is inspired by the residual blocks in ResNet. However, we adapt it to the ultra-lightweight regime. We use depthwise separable convolutions and remove batch normalisation (which is unnecessary for small models).

Overall, DSConv+LR contains only 49,217 parameters—approximately 7.4% of the original VDSR. The combination of depthwise separable convolution—which slashes parameter count—and the local residual connection—which adds no parameters—yields a highly efficient SR model. Yet this efficiency comes at a small price. Reconstruction of very fine details may be slightly worse than that of VDSR.

3.4. Training Details

We trained all models using the same protocol to ensure fair comparison. The loss function is L1 loss (mean absolute error) between the network output and the ground-truth high-resolution (HR) image. We found L1 loss produces slightly higher PSNR (+0.04 dB) and sharper edges than mean squared error (MSE). We used the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a weight decay of 10⁻⁴. The initial learning rate was set to 0.1, which is the same as in the original VDSR. The learning rate was reduced by a factor of 0.1 at epochs 20, 40, and 60 (total training epochs = 80). We applied gradient clipping with a dynamic threshold of 0.01/lr, as proposed in the VDSR paper. This prevents gradient explosion. Data augmentation included random rotations (90°, 180°, 270°) and horizontal flips. These were applied online during training. We used a batch size of 64 and trained on an NVIDIA RTX4090D GPU. We also experimented with label smoothing (0.1) and MixUp (α = 0.2), but neither improved PSNR (dropped by 0.05–0.08 dB), likely due to the model’s small capacity and sufficient training data.

4. Experiments

4.1. Datasets and Preprocessing

We used the widely adopted 291-image training set. It consists of 91 images from the BSD (Berkeley Segmentation Dataset) and 200 images from the BSD200 (an extended set). This is the same training set used in the original VDSR paper. To generate training patches, we first converted each image to the YCbCr colour space. We kept only the luminance (Y) channel, because human perception is the most sensitive to brightness. Then we extracted patches of size 41 × 41 with a stride of 21. This yielded approximately 545,000 patches. This patch size is standard in many SR works. It balances context and computational cost. We applied min–max normalisation to scale pixel values from the original range (0–255) to [0, 1] before feeding them into the network. Specifically, we computed the global minimum (0) and maximum (255) of the training set. We then linearly mapped all pixels to [0, 1]. For testing, we used four standard benchmarks: Set5, Set14, BSD100, and Urban100. All evaluations are on the Y channel with PSNR and the structural similarity index measure (SSIM).

4.2. Implementation

Framework: PyTorch 2.0.
GPU: NVIDIA RTX4090D.
Batch size: 64.
Number of blocks: N = 10 for all lightweight models.
Training epochs: 80.

We computed FLOPs using the thop library with input size of 1 × 256 × 256 for ×2 SR.

4.3. Ablation Study

We evaluate five models, as summarised in Table 1. Values are mean ± std over 3 runs. The purpose of this ablation study is to isolate the contribution of each design component: depthwise separable convolution (DSConv), channel attention (HAT), and local residual connection (LR). The models are:

VDSR (baseline): the original 20-layer standard convolution network, without any of our modifications. It serves as the performance upper bound in terms of PSNR. It also has the largest parameter count (665,921).
DSConv: replaces the 18 intermediate standard convolutional layers with DSConv blocks (no attention, no local residual). This model has only 49,217 parameters. Its PSNR drops to 35.01 dB, a loss of 0.32 dB compared to VDSR.
DSConv+HAT: adds the lightweight channel attention module to each DSConv block. This increases parameters to 69,697. Surprisingly, the PSNR remains 35.01 dB—identical to DSConv. This shows that HAT does not help in this ultra-lightweight setting.
Efficient Hybrid Attention Super-Resolution (Eff-HASR): adds local residual connections to the HAT blocks (i.e., DSConv + HAT + local residual). Its parameter count is still 69,697. PSNR improves to 35.23 dB. This +0.22 dB gain over DSConv+HAT is entirely due to the local residual connection.
DSConv+LR (ours): removes HAT but keeps the local residual connection. It uses only 49,217 parameters (same as DSConv). It achieves a PSNR of 35.21 dB. Compared to DSConv, this is a +0.20 dB gain with no extra parameters. Compared to Eff-HASR, it is slightly lower (35.21 vs. 35.23) but uses 30% fewer parameters.

Observations: DSConv alone reduces parameters by 92.6% but loses 0.32 dB. Adding HAT does not improve PSNR (35.01 dB). Even with r = 2 or 160 epochs, HAT remains ineffective. Adding local residual to HAT (Eff-HASR) gives +0.22 dB but uses 69.7 K parameters. Our DSConv+LR removes HAT, keeps the local residual, uses 49.2 K parameters, and achieves +0.20 dB gain over DSConv. A paired t-test between VDSR and DSConv+LR yields p = 0.08 (>0.05), indicating no statistically significant difference.

These results lead to three key conclusions. (i) Depthwise separable convolution alone drastically reduces parameters but hurts accuracy. (ii) Channel attention does not compensate for this loss under our training protocol. (iii) The local residual connection is both effective and parameter-free. It is the best choice for lightweight SR. Therefore, we select DSConv+LR as our final model.

4.4. Comparison with State-of-the-Art Lightweight Methods

We re-implemented FSRCNN, CARN, and IMDN under the same training protocol (291 images, 80 epochs) for fair comparison. Table 2 reports their results on Set5, and Table 3 extends the comparison to all four datasets.

Key observations. DSConv+LR matches VDSR within 0.12 dB on all datasets, outperforms FSRCNN by 0.9–1.5 dB, and offers a better trade-off than CARN/IMDN for extreme parameter efficiency.

Parameter Efficiency. Instead of a simple PSNR/Params ratio, we provide a Pareto frontier plot (Figure 2) showing PSNR vs. parameter count (log scale). DSConv+LR dominates the region below 100 K parameters.

Perceptual Quality. Using a pre-trained AlexNet to compute feature L1 distance (perceptual loss) on Set5, DSConv+LR achieves 0.2556, slightly better than VDSR (0.2717). This indicates that the lightweight design does not harm visual quality.

4.5. Extension to ×4 Super-Resolution

DSConv+LR achieves 32.26 dB on Set5 with 49.2 K parameters and 1.288 G FLOPs. Table 4 summarises the ×4 performance of DSConv+LR on all four benchmark datasets. As expected, the PSNR and SSIM values drop compared to ×2 SR, especially on Urban100 which contains rich textures.

4.6. Qualitative Results

Figure 3 presents visual comparisons on three representative images from Set5: ‘butterfly’, ‘woman’, and ‘head’. Visually, VDSR and DSConv+LR produce almost identical details. DSConv+LR uses an order of magnitude fewer parameters. This indicates that our lightweight design does not sacrifice visual quality on this particular image. However, on the “head” image, DSConv+LR achieves 34.29 dB. VDSR achieves 34.32 dB—a negligible difference. On “woman”, DSConv+LR achieves 34.86 dB. VDSR achieves 34.94 dB—slightly lower. The average PSNR (35.21 dB) is marginally lower than VDSR’s 35.33 dB. This confirms that extreme parameter reduction does incur a small but measurable performance penalty. This trade-off is the price of lightweight design. To be fair, we also note that DSConv+LR is far behind larger lightweight networks like CARN and IMDN in absolute PSNR—a gap of 2–3 dB.

5. Discussion

5.1. Why Did Channel Attention Not Help?

We conducted three controlled experiments to verify the ineffectiveness of HAT, which are summarised in Table 5.

These results confirm that channel attention provides no measurable benefit in our ultra-lightweight setting (channel dimension 64), regardless of training length, reduction ratio, or base convolution type. Therefore, we do not recommend using attention in extremely compact SR models.

5.2. The Importance of Local Residual Connections

The local residual connection improves performance by +0.20 dB with zero additional parameters. We also analysed the effect of the number of LR blocks N. As shown in Figure 4, PSNR saturates at N = 5 (35.23 dB); increasing N to 10, 15, or 20 yields no further gain. We choose N = 10 as a safe margin.

To verify that the local residual connection consistently improves performance regardless of network depth, we trained DSConv (without LR) models with N = 5, 10, 15, 20 under the same protocol and compared them with DSConv+LR. As shown in Table 6, LR provides a positive gain of +0.20–0.34 dB for all tested depths. The gain is slightly larger for shallower networks (N = 5, +0.34 dB) but remains substantial for the deeper ones. These results confirm that the local residual connection is a robust, parameter-free enhancement for lightweight SR models, and its effectiveness does not degrade as depth increases.

5.3. Fast Convergence

DSConv+LR reaches its peak performance within 5–10 epochs, much faster than VDSR (≈80 epochs). This is due to the reduced parameter count and the local residual connection.

5.4. Limitations of the Lightweight Design

Despite its parameter efficiency, DSConv+LR has several limitations:

It cannot fully match VDSR’s PSNR (0.12 dB lower on Set5).
It lags behind larger lightweight networks like CARN and IMDN by 2–3 dB, but uses 9–14× fewer parameters.
Compared to FSRCNN (34.07 dB), DSConv+LR achieves 35.21 dB with fewer parameters (49.2 K vs. 58.0 K) and better perceptual quality (perceptual loss 0.2556 vs. 0.295.).
The model is not state-of-the-art; it is intended for extreme resource-constrained scenarios where memory and computation are the primary constraints.

These limitations should be considered when deploying DSConv+LR in applications that require the highest possible reconstruction fidelity.

6. Conclusions

We have presented DSConv+LR, an extremely lightweight image super-resolution network derived from VDSR. By using depthwise separable convolutions and adding local residual connections—while removing an ineffective channel attention module—our model uses only 49,217 parameters (7.4% of VDSR). It achieves a PSNR of 35.21 dB on Set5 (×2), which is 99.7% of VDSR’s performance. On Set14, BSD100, and Urban100, DSConv+LR maintains similar relative performance (within 0.12 dB). Perceptual loss (0.2556) is slightly better than VDSR. The model is not state-of-the-art; larger lightweight networks like CARN and IMDN outperform it by 2–3 dB but require far more parameters. Our work demonstrates that a minimalist design can achieve high parameter efficiency while honestly reporting its limitations. Future work includes evaluating on more diverse datasets and exploring hybrid architectures to recover fine details without increasing parameters.

Author Contributions

Conceptualization, Q.H.; methodology, Q.H. and J.T.; software, S.X. and J.T.; validation, G.J. and J.W.; formal analysis, S.X.; investigation, J.T.; resources, J.T.; data curation, J.W.; writing—original draft preparation, Q.H. and J.T.; writing—review and editing, Q.H. and S.X.; visualisation, G.J.; supervision, S.X.; project administration, Q.H.; funding acquisition, Q.H. and J.T. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by Shaanxi NSFC [grant number 2023JCYB194, 2024JCYBMS169].

Data Availability Statement

The dataset and codes used in this study are openly available at https://github.com/tianblank/VDSR_improved (accessed on 22 April 2026).

Acknowledgments

During the preparation of this work, the authors used a language assistance tool for grammar and style improvements. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 391–407. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 252–268. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1637–1645. [Google Scholar] [CrossRef]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 624–632. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Zhang, L.; Li, H.; Liu, X.; Niu, J.; Wu, J. MobileSR: Efficient Convolutional Neural Network for Super-resolution. In Proceedings of the 2020 IEEE Global Communications Conference (GLOBECOM), Taipei, Taiwan, 7–11 December 2020; pp. 1–6. [Google Scholar] [CrossRef]
Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia (ACM MM), Nice, France, 21–25 October 2019; ACM: New York, NY, USA, 2019; pp. 2024–2032. [Google Scholar] [CrossRef]
Cao, W.; Lei, X.; Shi, J.; Liang, W.; Liu, J.; Bai, Z. HASN: Hybrid attention separable network for efficient image super-resolution. Vis. Comput. 2025, 41, 3423–3435. [Google Scholar] [CrossRef]
Zhao, C.; Dong, G.; Zhang, S.; Tan, Z.; Basu, A. Frequency regularization: Reducing information redundancy in convolutional neural networks. IEEE Access 2023, 11, 106793–106802. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 286–301. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2472–2481. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed DSConv+LR. (a) Overall network structure with global residual learning. (b) Internal structure of an LR block, consisting of a depthwise separable convolution, ReLU, and a local residual connection.

Figure 2. Pareto frontier of parameter efficiency vs. reconstruction quality on Set5 (×2). Each point represents a model (our reimplementations: VDSR, DSConv+LR, FSRCNN, CARN, IMDN). The dashed line indicates the Pareto front among our models. DSConv+LR offers the best trade-off in the ultra-lightweight regime (<50 K parameters).

Figure 3. Qualitative comparison on Set5 (×2). For each image (butterfly, woman, head), from left to right: LR, Bicubic, VDSR, DSConv+LR, HR. DSConv+LR achieves comparable visual quality to VDSR with only 7.4% of the parameters. PSNR values are shown in parentheses.

Figure 4. Effect of the number of LR blocks (N) on Set5 (×2) PSNR. Performance saturates at N = 5; further increases do not improve accuracy. The dashed line indicates the selected N = 10.

Table 1. Ablation study on Set5 (×2).

Model	Params (K)	PSNR (dB)	Note
VDSR (baseline)	665.9	35.33 ± 0.05
DSConv (no LR)	49.2	35.01 ± 0.04
DSConv+HAT (r = 4)	69.7	35.01 ± 0.05	No gain
DSConv+HAT (r = 2)	71.3	35.01 ± 0.04	Still no gain
DSConv+HAT (160 epochs)	69.7	35.02 ± 0.04	Longer training not helpful
Eff-HASR (DSConv+HAT+LR)	69.7	35.23 ± 0.04	Gain from LR, not HAT
DSConv+LR (ours)	49.2	35.21 ± 0.04	Best efficiency

Table 2. Performance on Set5 (×2) under identical protocol.

Model	Params (K)	PSNR (dB)	SSIM	Source
VDSR	665.9	35.33	0.9359	Our reimpl.
DSConv+LR (ours)	49.2	35.21	0.9369	Our reimpl.
FSRCNN (reimpl.)	58.0	34.07	0.9105	Our reimpl.
CARN (reimpl.)	456.8	37.25	0.9580	Our reimpl.
IMDN (reimpl.)	694.0	37.35	0.9587	Our reimpl.

Note: CARN uses reduced channels (32 vs. 64). IMDN uses default channels (64). Original paper results (trained on DIV2K) are 37.88 dB for CARN and 38.00 dB for IMDN, shown for reference only.

Table 3. Quantitative comparison on multiple benchmarks (×2).

Model	Set5	Set14	BSD100	Urban100
VDSR	35.33/0.9359	33.42/0.8868	33.08/0.8676	32.30/0.8566
DSConv+LR (ours)	35.21/0.9369	33.41/0.8899	33.10/0.8722	32.34/0.8609
FSRCNN	34.07/0.9105	32.73/0.8640	32.65/0.8495	31.76/0.8288
CARN (ours)	37.25/0.9580	34.72/0.9138	33.93/0.8971	33.86/0.9143
IMDN (ours)	37.35/0.9587	34.69/0.9129	33.95/0.8972	33.91/0.9134

Note: CARN uses reduced channels (32 vs. 64), Params = 456.8 K. IMDN uses default channels (64), Params = 694 K (original).

Table 4. DSConv+LR ×4 results on multiple benchmarks.

Dataset	Set5	Set14	BSD100	Urban100
PSNR (dB)	32.26	31.48	31.45	30.92
SSIM	0.8191	0.7209	0.6864	0.6759

Table 5. Controlled experiments to verify the ineffectiveness of HAT (Set5 ×2).

Experiment	Model	Params (K)	PSNR (dB)
Baseline	DSConv+LR	49.2	35.21
Original HAT	DSConv+HAT (r = 4)	69.7	35.01
(1) Longer training	DSConv+HAT (160 epochs)	69.7	35.02
(2) Less aggressive reduction	DSConv+HAT (r = 2)	71.3	35.01
(3) HAT on standard conv	Conv+HAT (no DSConv)	665.9	35.45 (baseline 35.40)

Table 6. Performance comparison of DSConv with and without LR under different depths (N) on Set5 (×2).

N	DSConv (without LR) PSNR (dB)	DSConv+LR (with LR) PSNR (dB)	Gain (dB)
5	34.89	35.23	+0.34
10	35.01	35.21	+0.20
15	34.92	35.21	+0.29
20	34.92	35.23	+0.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Q.; Tian, J.; Jiang, G.; Xue, S.; Wang, J. DSConv+LR: A Minimalist Lightweight Network for Image Super-Resolution. Electronics 2026, 15, 2637. https://doi.org/10.3390/electronics15122637

AMA Style

Hu Q, Tian J, Jiang G, Xue S, Wang J. DSConv+LR: A Minimalist Lightweight Network for Image Super-Resolution. Electronics. 2026; 15(12):2637. https://doi.org/10.3390/electronics15122637

Chicago/Turabian Style

Hu, Qiuxia, Jie Tian, Guangyi Jiang, Shan Xue, and Jingxuan Wang. 2026. "DSConv+LR: A Minimalist Lightweight Network for Image Super-Resolution" Electronics 15, no. 12: 2637. https://doi.org/10.3390/electronics15122637

APA Style

Hu, Q., Tian, J., Jiang, G., Xue, S., & Wang, J. (2026). DSConv+LR: A Minimalist Lightweight Network for Image Super-Resolution. Electronics, 15(12), 2637. https://doi.org/10.3390/electronics15122637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

DSConv+LR: A Minimalist Lightweight Network for Image Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. Deep SR Networks

2.2. Lightweight SR Networks

2.3. Attention Mechanisms in SR

3. Proposed Method

3.1. Baseline: VDSR

3.2. Building Blocks

3.3. Network Architecture

3.4. Training Details

4. Experiments

4.1. Datasets and Preprocessing

4.2. Implementation

4.3. Ablation Study

4.4. Comparison with State-of-the-Art Lightweight Methods

4.5. Extension to ×4 Super-Resolution

4.6. Qualitative Results

5. Discussion

5.1. Why Did Channel Attention Not Help?

5.2. The Importance of Local Residual Connections

5.3. Fast Convergence

5.4. Limitations of the Lightweight Design

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI