RepECN: Making ConvNets Better Again for Efficient Image Super-Resolution

Traditional Convolutional Neural Network (ConvNet, CNN)-based image super-resolution (SR) methods have lower computation costs, making them more friendly for real-world scenarios. However, they suffer from lower performance. On the contrary, Vision Transformer (ViT)-based SR methods have achieved impressive performance recently, but these methods often suffer from high computation costs and model storage overhead, making them hard to meet the requirements in practical application scenarios. In practical scenarios, an SR model should reconstruct an image with high quality and fast inference. To handle this issue, we propose a novel CNN-based Efficient Residual ConvNet enhanced with structural Re-parameterization (RepECN) for a better trade-off between performance and efficiency. A stage-to-block hierarchical architecture design paradigm inspired by ViT is utilized to keep the state-of-the-art performance, while the efficiency is ensured by abandoning the time-consuming Multi-Head Self-Attention (MHSA) and by re-designing the block-level modules based on CNN. Specifically, RepECN consists of three structural modules: a shallow feature extraction module, a deep feature extraction, and an image reconstruction module. The deep feature extraction module comprises multiple ConvNet Stages (CNS), each containing 6 Re-Parameterization ConvNet Blocks (RepCNB), a head layer, and a residual connection. The RepCNB utilizes larger kernel convolutions rather than MHSA to enhance the capability of learning long-range dependence. In the image reconstruction module, an upsampling module consisting of nearest-neighbor interpolation and pixel attention is deployed to reduce parameters and maintain reconstruction performance, while bicubic interpolation on another branch allows the backbone network to focus on learning high-frequency information. The extensive experimental results on multiple public benchmarks show that our RepECN can achieve 2.5∼5× faster inference than the state-of-the-art ViT-based SR model with better or competitive super-resolving performance, indicating that our RepECN can reconstruct high-quality images with fast inference.


Introduction
Single Image Super-Resolution (SISR), which aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) image, is an ill-posed problem without one unique solution.As an efficient data-driven technology, deep learning-based SISR methods have shown promising results and achieved better quantitative and qualitative performance than traditional methods.These super-resolution (SR) models can be divided into three categories, including convolutional neural network-based SR methods [1,2], Transformerbased SR methods [3,4], generative adversarial network-based SR methods [5,6].
However, deep learning-based methods require significant computation costs and storage resources to provide high reconstruction accuracy, hindering them from being deployed in resource-limited platforms or scenarios, such as live streaming [7], phone imaging [8], etc.Therefore, an SR model with high super-resolving performance and fast inference is urgently required to meet the requirements of resource-limited scenarios.
Lightweight SR models have recently been proposed, but they still face challenges in how to make a better trade-off between inference speed and reconstruction performance.Transformer-based methods, such as SwinIR [4], ESRT [9], and LBNet [10], have shown better performance than CNN-based lightweight models, like ESRN [11], LBFN [12], and ShuffleMixer [13].However, the multi-head self-attention and encoder-decoder designs overlook the actual inference latency caused by a large amount of memory access cost (MAC) and the parallelism degree of network structure.Our statistical experiments demonstrate that Transformer-based methods suffer from high latencies even with small parameter sizes, as illustrated in Figure 1.In contrast, CNN-based methods infer much faster than other designs with simple structures but suffer from lower reconstruction performance.Thus, ConvNet is often adopted to build efficient and lightweight models for improving inference speed.SR-LUT [14] and SPLUT [15] can reconstruct images faster at the expense of severe performance degradation.Wu et al. [16] explored a compiler-aware SR neural architecture search (NAS) framework to achieve real-time inference on GPU/DSP platforms for mobile devices.However, this work faces difficulties deploying or directly transferring pre-trained models to different hardware platforms with varying instruction architectures.With these considerations, RepSR [17] aims to improve the performance of VGG-like [18] CNN-based models but still has a low-performance cap.
To make a better trade-off between reconstruction performance and inference latency for practical scenarios, we propose a pure CNN-based Efficient Residual ConvNet with structural Re-parameterization (RepECN).The architecture is investigated by the stage-toblock hierarchical design of the ViT-based model to offer both fast speed and high-quality image reconstruction capabilities.The RepECN has three key structural components: a shallow feature extraction module, a deep feature extraction module, and an image reconstruction module.The deep feature extraction module comprises several ConvNet Stages (CNS), each containing six Re-Parameterization ConvNet Blocks (RepCNB), a head layer, and a residual connection.By employing the Transformer-like stage-to-block design, this module allows for learning channel and spatial information by different convolution structures, enabling faster processing speeds, while maintaining similar parameter numbers and performance compared to the Transformer-based models.In addition, we propose a novel image reconstruction module based on nearest-neighbor interpolation and pixel attention to save parameters and maintain reconstruction performance.The extensive experimental results show that our RepECN can achieve 2.5∼5× faster inference than the state-of-the-art ViT-based SR model with better or competitive super-resolving performance, indicating that our RepECN can achieve a better trade-off between super-resolution quality and inference latency for resource-limited scenarios.
In summary, the main contributions of this paper are as follows:

CNN-Based Efficient SR
FSRCNN [2] uses upsampling at the end of the model and optimizes the width and depth of convolutional layers from the pioneering model SRCNN [1].However, the performance is not competitive nowadays.Inspired by residual learning, VDSR [23] and EDSR [24] were proposed to allow deeper networks and avoid gradient disappearance and degradation problems.Later, a series of SR methods proposed by increasing the depth and width of the network (e.g., RCAN [25], RDN [26]) achieved state-of-the-art (SOTA) performance.However, huge Multiply-Accumulates (MACs) and parameters limit their deployment on hardwarelimited platforms.To solve this problem, some SR methods [19][20][21]27] focus on improving efficiency.IDN [19] and IMDN [20] use a channel-splitting strategy to reduce computational complexity with redundant parameters.Luo et al. [21] utilize the proposed lattice block to combine residual blocks and introduce a network LatticeNet for fast and accurate SR.MIPN [27] polymerizes multi-scale image features extracted by convolutions with different kernel sizes.The MAI 2021 Challenge [28] brings some extremely lightweight model works [29,30] with real-time inference latency.However, most are optimized for specific NPU mobile platforms, while the SR performance is insufficient.Wu et al. [16] use a neural architecture search (NAS) framework with adaptive SR blocks to find an appropriate model to achieve real-time SR inference.However, it needs to retrain the model when the environment changes, which cannot be used on new devices directly.Unlike these methods that mainly focus on efficiency, we aim at the trade-off of latency and accuracy.

Transformer-Based Efficient SR
Dosovitskiy et al. [31] firstly applied a vision transformer to image recognition.Since then, high-accuracy image SR methods based on transformers became popular.IPT [3] uses a pre-trained vanilla Vision transformer (ViT) on the ImageNet dataset.SwinIR [4] brings Swin Transformer [32] to image restoration tasks and achieves state-of-the-art performance.However, having fewer parameters and MACs does not necessarily result in faster inference latency because other factors, such as memory access cost and degree of parallelism, can also affect latency.The Transformer-based methods suffer from time-consuming and memory-intensive operations, including quadratic-complexity Multi-Head Self-Attention (MHSA) and inefficient non-parallelized window partition.Therefore, some works fo-cus on designing lightweight Transformer-based methods [10,33,34].A2N [33] obtains lightweight by studying the effectiveness of the attention mechanism.LBNet [10] uses a hybrid network of CNN and Transformer to build an efficient model.SMN [34] simplifies MHSA by separating spatial modulation from channel aggregation, hence making the long-range interaction lightweight.However, there is still potential for improvement in terms of accuracy.

Large Kernel ConvNet
After the introduction of VGG [18], large kernel ConvNets lost popularity due to the higher number of parameters and MACs they require, which is not appropriate for lightweight model designs.However, large kernel convolutions have regained their importance with the development of novel efficient techniques and structures such as transformers and MLPs.Then, ConvMixer [35], ConvNeXt [36], and RepLKNet [37] utilize the large kernel depth-wise convolutions to redesign ConvNet, which achieve competitive performance compared to Transformers.In addition, LKASR [38] also explores the possibility of using a large kernel for lightweight models in the image SR task.However, there is still potential for improvement in terms of SR performance.In this paper, we explore the combination of large kernel convolution and the Structural Re-parameterization technique to further improve performance without a computational cost at the inference phase.

Structural Re-Parameterization
Structural Re-parameterization [39][40][41] equivalently converts model structures via transforming the parameters between training and inference time.These structures enhance the off-the-shelf models without modification of the CNN architecture.Specifically, Ding et al. [39] improve the performance without any inference-time costs by using Asymmetric Convolutional Block (ACB).ACB uses 1D asymmetric convolutions to strengthen the square convolution kernels within a single convolution block.It also uses batch normalizations (BN) [42] in training time to reduce overfitting and accelerate the training process on high-level vision tasks.Besides, Ding et al. [40] designs a more complex version (DBB) that utilizes the symmetric square kernel in the branch during training.DBB performs better in high-level tasks but worse in SR tasks than ACB.RepSR [17] and RMBN [43] use the variants of DBB on VGG-like CNN for SR.However, the SR quality of RepSR is much lower than Transformer-based models.RepSR also introduces the artifacts problem when using BN in a VGG-like SR model.This paper explores the usage of asymmetric structural re-parameterization with BN on large kernel convolutions for image SR.

Methods
In this section, we first outline the architecture of the proposed Efficient Residual ConvNet with structural Re-parameterization (RepECN) and then introduce the ConvNet Stages (CNS), Re-Parameterization ConvNet Blocks (RepCNB), and the lightweight upsampling module.

Network Architecture
We leverage the high-performance, Transformer-like stage-to-block design paradigm and lower computation cost of a pure convolution structure to explore the efficient and high-accuracy network for image super-resolution.As shown in Figure 2, RepECN mainly consists of three modules: shallow feature extraction, deep feature extraction, and highquality image reconstruction.Different demands of the network sizes employ the same structure, while only different in the number of CNS and backbone channels.The network should also be doing well on other tasks of image resolution.

Shallow and Deep Feature Extraction
Given a low-resolution (LR) image input I LR ∈ R H×W×C in (H, W, and C in are the numbers of the LR image height, width, and input channels, respectively), we use A SF (•) to denote an ACB with a 3 × 3 kernel size.The corresponding shallow feature O 0 ∈ R H×W×C is extracted as where C is the number of output feature channels.Such ACB enhances the standard square-kernel convolution layer.So, it provides a better and simple way to map the input low-dimensional image space to a high-dimensional feature space than conventional shallow feature extraction.In the next module, we extract the deep feature where F DF where F CNS i is the i-th CNS and A DF is an ACB with a 3 × 3 kernel at the end of the module.Such an ACB could bring the inductive bias into the depth-wise ConvNet-based network, which helps aggregate shallow and deep features.Meanwhile, the long skip connection aggregates the shallow and deep features, bringing the low-frequency information directly to the next module.

Image Reconstruction
The input LR image has the most primitive information, which should guide the reconstruction output.Additionally, bicubic interpolation can upsample the LR image directly and maintain the original information.Considering that, we reconstruct the superresolution (SR) image I SR as where U F (•) and U LR (•) denote the upsampling of the extracted feature and the bicubic interpolation of the LR image, respectively.The benefit of the aggregation is that the backbone network could focus on learning the high-frequency information of tuning the conventional upsampling of the LR image to a high-qualitative SR image.The upsampling of the extracted feature is implemented by nearest-neighbor interpolation, ACBs, and pixel attention (PA) described in Section 3.3.

Loss Function
The parameters of our network are optimized by smooth L 1 loss where I HR denotes the corresponding ground-truth HR image, and I SR is the output of RepECN that takes I LR as the input.The smooth L 1 loss converges faster than the naive L 1 pixel loss.

ConvNet Stages
The ConvNet Stages (CNS) is a residual block consisting of six Re-Parameterization ConvNet Blocks (RepCNBs), a LayerNorm, and an ACB, as shown in Figure 2a.Each CNS of Equation ( 3) takes a feature as the input.For the specific i-th CNS, we use O i,0 , taking the place of input O i−1 for convenience.Inside such CNS, we obtain intermediate outputs where F RepCNB i,j (•) denotes the j-th RepCNB.Then, a RepCNB is added before the residual connect.The total output of i-th CNS is formulated as where is the ACB at the end of the i-th CNS.The ACB could be treated as a standard convolution, while the RepCNB consists of depth-wise and point-wise convolutions.The standard convolution with a small and spatially invariant filter brings a different vision, which benefits the translational equivariance.In addition, the residual connection aggregates different hierarchies of features to let the block fit more complex feature mappings.

Re-Parameterization ConvNet Blocks
The Re-Parameterization ConvNet Blocks (RepCNB) are based on a residual block inspired by the ConvNeXt [36].The main difference is that we use ACB to enhance the square convolution kernel inside RepCNB.As shown in Figure 2b, given an input with x channels, a RepCNB first uses a depth-wise ACB with a 7 × 7 kernel to extract a feature with the x channels.A layer normalization (LN) layer is added behind it.Then, two point-wise convolutional layers are added to learn features across the channel before the residual connection, with GELU non-linearity between them.The first point-wise layer accepts the output of LN with an x channel as the input and obtains a feature with 4x channels.The corresponding second point-wise layer takes the feature above as input and obtains the final output with x channels.

Asymmetric Convolutional Block
An asymmetric Convolutional Block (ACB) is a block using the structural re-parameterization technique [39], the same as a standard convolution at inference time while different at training time.Figure 3 compares standard convolution (Conv) and ACB with a kernel size of 3 × 3. The ACB or Conv takes a feature I ACB as the input.At training time, ACB uses three no-bias convolutional layers {F conv 1 , F conv 2 , F conv 3 } with kernel sizes of 3 × 3, 1 × 3, and 3 × 1, respectively.After batch normalization (BN) for each convolutional layer above, ACB obtains the output O ACB by merging three outputs by element-wise summation as where µ c , σ c , γ c , and β c denote the channel-wise mean, standard deviation, learned scaling factor, and bias term, respectively, while ∑ 3 c=1 means element-wise summation for several features.At inference time, ACB first merges channel-wise BN with Conv kernel by BN fusion and then merges three Conv by branch fusion as where K c denotes the kernel of no-bias convolutional layer F conv c .The ACB is finally converted to a standard convolutional layer with kernel K in f and bias b in f .

Asymmetric Convolution Block
Training-Time

Lightweight Upsampling Module
As shown in Figure 4, we choose the nearest-neighbor interpolation to upsample the input feature, followed by an ACB.Rather than sub-pixel convolution like pixel shuffle, such upsampling choice saves the parameter number without performance degradation.We first use an upsampling operation to transfer the feature O F from the entire feature extraction module in Equation (3).The upsampling operation consists of several pairs of nearest-neighbor interpolation and ACB.Each pair only upsamples on scale factor 2 or 3, limiting the whole module to accept scale factor 2 N or 3.The module should support varying scale factors by adopting the interpolation scale factor.Then, inspired by PAN [44], we employ a pixel attention (PA) layer and an ACB to reconstruct the SR feature.The PA can enhance the reconstruction and improve the SR quality.Finally, a second ACB layer generates the output U F (O F ) of the upsampling module in Equation (4).

Experiments
This section uses several commonly used benchmark datasets to compare the proposed network with effective and state-of-the-art SISR models.In addition, some ablation studies are used to analyze the rationality of our proposed modules.

Datasets and Indicators
We train the proposed network using the DIV2K dataset [45] while validating it on the Set5 [46] dataset.The 800 training and 100 validation image pairs in DIV2K are used as the training dataset.The indicators of evaluation for SISR performance are peak signal-to-noise ratio (PSNR) [47] and structural similarity index (SSIM) [48] on benchmark datasets Set5, Set14 [49], B100 [50], Urban100 [51], and Manga109 [52].We use MATLAB to calculate them on the Y channel of the YCbCr space converted from the RGB space of the image.

Training Details
We group the efficient models into three level sizes according to the parameter number.The parameter number of extremely tiny, small, and base size is smaller than 100 K, 500 K, and 1500 K, respectively.The settings of the training hyperparameters for our RepECN-T (tiny), RepECN-S (small), and RepECN (base) models are described in Table 1.The RepCNB and channel in the table denote the RepCNB number in each CNS and the channel number of each intermediate feature, while the patch denotes the size of RGB patches cropped from LR images as the input.The total training epochs of RepECN-T, RepECN-S, and RepECN are set to 3000, 2000, and 1500, respectively.Each minibatch comprises 32 patches for training all three models.The learning rate is set to 2 × 10 −4 and reduced by half at [ 1 2 , 4 5 , 9  10 , 19  20 ] of the total epoch.The latency of inference on the CPU and GPU platform are measured for generating a 720P SR image (the width and height are 1280 × 720) on an Intel Xeon Gold 5118 CPU (12 cores, 2.30 GHz, and 6 load-data threads) and Nvidia Titan V (12 GB of HBM2 memory and 5120 CUDA cores) GPU acceleration, respectively.Each latency takes an average of 50 running results.The multiply-accumulates (MACs) are also measured for generating a 720P SR image (1280 × 720).

Performance and Latency Comparison
To show the effectiveness of our RepECN fairly, we chose the state-of-the-art Transformerbased models with similar parameter numbers, which are trained on the same DIV2K dataset.Table 2 shows the quantitative performance comparisons between the proposed RepECN and state-of-the-art Transformer-based models: SwinIR [4], ESRT [9], and LBNet [10].As for the models with parameter numbers less than 1500 K, RepECN achieves the best or second-best performance on five benchmark datasets for three standard scale factors with much less latency.Specifically, compared to the state-of-the-art SwinIR-S with similar PSNR/SSIM, RepECN only needs one-fifth of the latency for a scale factor 2 on the platform with GPU.Especially, LBNet and ESRT cannot do inference for a scale factor of 2 on our platform with GPU because of memory resource limitations.To show the high SR quality of our RepECN structure, we chose the current CNNbased models in different sizes of parameter numbers.Specially, the training dataset of ShuffleMixer and LAPAR is DF2K (a merged dataset with DIV2K [45] and Flickr2K [53]), which contains much more image pairs.Table 3 shows the quantitative performance comparisons between the proposed RepECN and CNN-based models: SRCNN [1], FSR-CNN [2], ShuffleMixer [13], IDN [19], IMDN [20], LatticeNet [21], LapSRN [22], EDSR [24], DRRN [54], and LAPAR [55].Our RepECN family achieves state-of-the-art performance in all tiny, small, and base sizes.Specifically, RepECN-T (less than 100 K) outperforms ShuffleMixer-Tiny with a 0.45 dB gain on Urban100 (2×).RepECN-S (less than 500 K) outperforms ShuffleMixer with a 0.41 dB gain on Urban100 (2×) using similar parameter numbers.In addition, RepECN-S also outperforms LatticeNet with a 0.06 dB gain on Ur-ban100 (2×) using about half the parameter numbers.It proves that our design of ConvNet outperforms all of the previous designs.In conclusion, our model achieves state-of-the-art performance with a better trade-off between inference speed and performance.To evaluate our RepECN qualitatively, we also show visual comparisons in Figure 5, including three different sizes of RepECN and the corresponding size state-of-the-art models for scale factor 4 SISR on benchmark images.All three sizes of RepECN can restore higher frequency detailed textures and alleviate the blurring artifacts with more visually pleasing images.In contrast, most other models produce incorrect textures with blurry artifacts.Furthermore, we evaluate our model on real LR images from a historical dataset [22], as shown in Figure 6.RepECN can generate smoother details with a clearer structure than other models.This indicates the high effectiveness of our proposed RepECN.ESRT [9], LBNet [10], IMDN [20], LatticeNet [21], EDSR-baseline [24]) on Set14 [49] and Urban100 [51] benchmark datasets for 4× single image super-resolution (SISR).Zoom in for the best view.[10], LatticeNet [21], EDSR [24], CARN [56]) with fewer artifacts.

Ablation Study and Analysis
For the ablation study, we train RepECN family models on DIV2K [45] with 1000 epochs for 2× SISR in Sections 4.3.1,4.3.3 and 4.3.4,progressively adding useful elements to construct RepECN-T.Then, we train RepECN with CNSs, RepCNBs, channels, and epochs setting to 4, 6, 60, and 1500 epochs for 2× SISR as the baseline model and modify the first three hyperparameters individually in Section 4.3.5.In addition, we train FSRCNN variants on DIV2K [45] with 3000 epochs for 2× SISR in Section 4.3.2.In all sections, the performance comparison uses the PSNR on benchmark dataset Set5 [46].

Impact of Normalization in CNS and ACB
To explore the effect of layer normalization (LN) in each CNS, we remove the head layer in CNS and batch normalization (BN) inside ACB to reduce their effects.Table 4 firstly shows that LN is necessary for better performance, as the SR quality of RepECN-T-A is lower than RepECN-T-B and RepECN-T-C.Then, the table illustrates that LayerNorm before the residual connection in CNS further improves the PSNR than LN after the residual connection.In addition, we compare using batch normalization (BN) inside ACB.The no-BN variant RepECN-T-C skips normalization and generates a bias for each convolutional layer in ACB during training.When switching to inference, add the weight and bias of three convolutional layers in ACB to the single convolutional layer used as the inference ACB layer.Table 4 shows that the normalization inside ACB is important as RepECN-T-D improves the PSNR performance by 0.01 dB.Apart from that, the training with normalization in ACB will not converge when removing the residual connection of LR input to the output while using the pixel shuffle upsampling.

Impact of Structural Re-Parameterization
To demonstrate the effectiveness of structural re-parameterization for image superresolution (SR), we trained multiple variants of FSRCNN, a model with ample room for improvement.We first replace the upsampling module of FSRCNN with our proposed lightweight upsampling module as a variant FSRCNN-N, which improves the PSNR performance by 0.31.Then, we use a symmetric square kernel structural re-parameterization technique DBB [40] for each ConvNet layer in FSRCNN-N, a similar but more complex technique as used in RepSR [17].FSRCNN-N-DBB can improve the SR performance by a 0.16 dB gain on PSNR.Finally, we replace the DBB with the asymmetric kernel structural re-parameterization technique ACB.FSRCNN-N-ACB further improves the SR quality by a 0.09 dB gain on PSNR.In conclusion, structural re-parameterization can improve the performance of CNN-based SR models, while the asymmetric kernel technique is better than a symmetric square one.

Impact of the Head Layer in CNS
The effect of using a head layer (the last ACB before the residual connection) in CNS is shown in Table 4.The base version RepECN-T is designed as one 3 × 3 ACB.With this version, the performance gains on PSNR by 0.4 dB.Furthermore, the table shows that one 3 × 3 ACB is better than three 3 × 3 ACB (whose channel number of the second layer is one-fourth of the input and output channel number).RepECN-T-E saves a few parameters (5K) with 0.02 dB performance degradation on PSNR compared to RepECN-T.To achieve higher performance, we finally choose to use one 3 × 3 ACB as the head layer in CNS.

Impact of Nearest-Neighbor Interpolation with Pixel Attention in Upsampling Module
Tables 4 and 5 show the performance improvement of the proposed upsampling module in Section 3.3 with pixel attention (PA).In Table 4, The pixel shuffle of the variant RepECN-T-G is the same as the image reconstruction module in SwinIR [4].The nearestneighbor without PA of variant RepECN-T-F removes the PA block from the proposed upsampling module.The table shows that the nearest-neighbor interpolation saves parameters with performance improvement, and the PA is necessary, as it improves the PSNR by 0.02 dB.Table 5 shows that the proposed upsampling module can significantly improve the performance of FSRCNN by a 0.31 dB gain on PSNR.The effects of CNS numbers, RepCNB numbers in each CNS, and channel numbers of each layer are shown in Figure 7, respectively.We observed that the performance is positively correlated with such three hyper-parameters.In addition, as the number of settings increases, the performance growth tends to flatten out.As a result, it is a trade-off between the performance and the model's size.To achieve high performance and fast inference, we choose the point with the maximum change in slope as the setting.Especially, the RepCNB number of each CNS is fixed to 6, as the performance is more sensitive when reducing it than the others.

Figure 2 .
Figure 2. The architecture of the Efficient Residual ConvNet with structural Re-parameterization (RepECN).

Figure 3 .
Figure 3.The comparison between Asymmetric Convolutional Block (ACB) and standard Convolution.

Figure 4 .
Figure 4. Illustration of the proposed upsampling module.

•
We propose an efficient and high-accuracy SR model RepECN to offer fast speed and high-quality image reconstruction capabilities using the Transformer-like stage-toblock design paradigm.•Tofurther improve performance, we employ a large kernel Conv module inspired by ConvNeXt and an Asymmetric Re-Parameterization technique, which is proven to perform better than other symmetric square Re-Parameterization techniques.
• To save parameters and maintain reconstruction performance, we propose a novel image reconstruction module based on nearest-neighbor interpolation and pixel attention.• Extensive experimental results show that our RepECN can achieve 2.5∼5× faster inference than the state-of-the-art ViT-based SR model with better or competitive super-resolving performance.

Table 2 .
The performances of PSNR (dB) and SSIMs on standard benchmark datasets for our RepECN models trained on DIV2K compared with Vision Transformer-based models.The best and second-best SR performances are marked in red and blue, respectively.Blanked entries denote unavailable.

Table 3 .
The performances of PSNR (dB) and SSIMs on standard benchmark datasets for CNNbased models.The best and second-best SR performances are marked in red and blue, respectively.Blanked entries denote unavailable.

Table 4 .
Ablation study on the several designs of RepECN, including layer normalization in CNS, batch normalization in ACB, head layer in CNS, and upsampling design.The best SR performances are marked in red.

Table 5 .
Ablation study on the structural re-parameterization and upsampling design for the simple 3 × 3 ConvNet model FSRCNN to prove the effectiveness.The best SR performances are marked in red.