Video Super-Resolution Combining Dual Motion Compensation and Multi-Scale Structure–Texture Prior

Liu, Xiaolei; Shi, Jiawei; Xu, Jiayi; Song, Pengfei; Gao, Hongxia; Wang, Fuhai; Ji, Meining; Chen, Chen; Kong, Xianghao

doi:10.3390/app16020631

Open AccessArticle

Video Super-Resolution Combining Dual Motion Compensation and Multi-Scale Structure–Texture Prior

by

Xiaolei Liu

¹,

Jiawei Shi

^1,*,

Jiayi Xu

^2,3,

Pengfei Song

¹,

Hongxia Gao

¹,

Fuhai Wang

¹,

Meining Ji

¹,

Chen Chen

¹ and

Xianghao Kong

¹

China Academy of Space Technology, Beijing 100094, China

²

College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

³

Key Laboratory of Space Photoelectric Detection and Sensing of Industry and Information Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 631; https://doi.org/10.3390/app16020631

Submission received: 4 December 2025 / Revised: 4 January 2026 / Accepted: 5 January 2026 / Published: 7 January 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Video super-resolution methods based on convolutional kernels or optical flow often face challenges such as limited utilization of multi-frame detail information or strong reliance on accurate optical flow estimation. To address these issues, this paper proposes a novel super-resolution reconstruction network named Dual Motion Compensation and Multi-scale Structure–Texture Prior (DCST-Net). Dual motion compensation performs direct and progressive motion mapping in parallel, effectively mitigating estimation bias in motion modeling. A multi-scale structure–texture prior is introduced to enhance high-frequency details through feature fusion, alleviating over-smoothing caused by warping and fusion processes. The proposed DCST-Net method is validated on datasets containing both large and small targets, demonstrating its effectiveness and robustness.

Keywords:

video super-resolution; motion compensation; texture prior

1. Introduction

With the rapid development of digital imaging and information processing, video has become a primary vehicle for information acquisition and dissemination. Nevertheless, constrained by sensor performance, imaging distance, and transmission bandwidth, especially in infrared, including short-wave infrared (SWIR) scenarios, captured videos inevitably suffer from degradations such as insufficient resolution, blurred details, and severe noise. Computational imaging techniques are therefore urgently required to enhance visual quality. Super-resolution reconstruction, which recovers high-resolution structural details from low-resolution observations, has demonstrated critical value in medical imaging [1,2], remote-sensing surveillance [3,4], and infrared object perception [5,6].

Compared to Single Image Super-Resolution (SISR), Video Super-Resolution (VSR) faces the high-complexity challenge of aggregating information from multiple unaligned yet highly correlated frames. Wang et al. proposed using multi-scale deformable alignment and attention mechanisms to achieve high-precision inter-frame alignment and fusion, significantly improving reconstruction performance [7]. However, this method did not incorporate an explicit motion estimation mechanism, limiting its alignment accuracy in scenes with intense motion. Haris et al. progressively modeled inter-frame differences through multiple residual projections, enhancing the utilization of temporal information [8]. However, the sequential projection process may introduce cumulative errors, affecting the final reconstruction result. Chan et al. proposed BasicVSR, which divides the VSR task into four modules—propagation, alignment, fusion, and upsampling—simplifying model design while improving reconstruction effectiveness [9]. Nevertheless, this method still faces significant alignment errors when handling large-scale or non-rigid motion, particularly prone to artifacts or structural distortions in complex backgrounds or occluded regions.

It should be emphasized that infrared videos differ markedly from their visible-light counterparts in imaging mechanism, texture representation, and noise characteristics; directly transferring VSR pipelines designed for RGB scenes therefore frequently fails to simultaneously guarantee motion-alignment stability and high-quality detail recovery. Consequently, how to enhance motion modeling robustness under complex motion and occlusion while effectively reconstructing the limited yet critical structural and textural cues present in infrared videos remains a pivotal and unresolved challenge in current video super-resolution research.

To address these issues in infrared videos, such as blurred small-target edges, sparse texture, and severe occlusion, this paper proposes a video super-resolution network combining Dual Motion Compensation and Multi-scale Structure–Texture Prior (DCST-Net). This method approaches the improvement of video reconstruction quality and structural fidelity from two key perspectives: motion alignment and high-frequency detail modeling.

Specifically, to enhance the modeling capability for moving objects, this paper designs a Dual Motion Compensation Module (DMCM) that performs direct warping and progressive warping compensation strategies in parallel. This integrates the sensitivity of direct warping to local fine motion and the stability of progressive warping in handling large-scale displacement, thereby mitigating the problem of errors easily arising in intense motion scenarios. Combined with an occlusion mask to eliminate invalid regions during motion compensation, it enhances the stability and robustness of alignment. To avoid ambiguity, we explicitly distinguish the proposed method from existing bidirectional propagation or recurrent alignment frameworks. Although these approaches also perform multi-frame alignment, their core idea is to propagate features through recurrent structures, where alignment merely facilitates temporal propagation. In contrast, our method abandons feature propagation. Instead, for each input frame, we explicitly build two complementary motion-compensated results: single-flow direct warping and step-by-step progressive warping. The former performs one-shot global alignment, while the latter refines alignment locally and progressively, forming dual motion hypotheses that supply more robust cues for subsequent fusion. We stress that the progressive warping is not used for recurrent feature propagation; rather, it serves as a complementary alignment to the direct warping, operates in parallel with it, and jointly participates in the subsequent fusion stage. Thanks to this explicit dual-hypothesis modeling, the proposed network delivers more stable texture alignment under occlusion and complex displacement, yielding consistently lower LPIPS and higher perceptual quality.

In terms of detail restoration, this paper further proposes a Multi-scale Structure–Texture Prior Module (MSTPM). It introduces a pre-trained network to extract multi-scale feature information, guiding the model to perceive structural and textural details at different levels. This module effectively compensates for high-frequency details by upsampling, concatenating, and compressively fusing multi-scale feature maps, thereby alleviating structural blurring and textural degradation caused by alignment errors during feature fusion. DCST-Net demonstrates superior reconstruction performance on multiple datasets, effectively restoring object contours while recovering rich textural details, with overall visual quality and geometric consistency surpassing existing methods.

2. Methodology

2.1. Network Architecture

The overall architecture of DCST-Net is shown in Figure 1, including an optical flow estimation network and an upsampling module, as well as DMCM and MSTPM.

Input data is read using a sliding window approach. The input video frame sequence

{[I_{t}^{L R}]}_{t = t - n}^{t = t + n}

contains

2 n + 1

low-resolution images, and the corresponding output is 1 super-resolution image

I_{t}^{S R}

. A pre-trained SpyNet [10] is introduced to compute optical flow, combined with the dual motion compensation module to obtain warped frames. A multi-scale feature extraction module is employed to extract structural and textural information from high-resolution frames, which is concatenated with the warped frames and then fed into the upsampling module. The upsampling module uses a U-Net [11] to reshape features. The encoder uses common convolutional blocks, each consisting of two convolutional layers and PReLU [12] activation. The decoder performs upsampling via deconvolution and combines with the corresponding encoder output using skip connections. PixelShuffle [13] is used after the U-Net for upsampling to obtain the reconstructed frame.

2.2. Dual Motion Compensation Module

The implementation process of DMCM is shown in Figure 2, and it uses two different forms of motion compensation operations to model motion: direct warping and progressive warping. This approach integrates the sensitivity of direct warping to local fine motion and the stability of progressive warping in handling large-scale displacement. Thus, during inter-frame alignment, it accommodates motion characteristics at different scales, achieving more accurate and robust compensation.

Direct warping, which is shown in Figure 2a, calculates the motion field

F_{i \to t}

between the input frame

I_{i}

and the target frame

I_{t}

using the optical flow estimation module and then uses this motion field to directly warp the input frame to the position of the target frame. Specifically, the compensated frame

I_{i \to t}^{d i r e c t}

generated by direct warping can be expressed as

I_{i \to t}^{d i r e c t} = W (I_{i}, F_{i \to t})

(1)

where

W (\cdot, F)

denotes the optical flow-based backward warping operator. Direct warping remains stable under continuous displacement, yet it is sensitive to motion estimation error. When alignment accuracy is insufficient, artifacts and structural distortion readily appear in the compensated frame.

To mitigate the sensitivity of direct warping to one-shot motion estimation, we adopt the progressive warping strategy shown in Figure 2b. This approach splits the alignment from the reference frame to the target frame into a sequence of small-step mappings, reducing single-step error and stabilizing motion compensation. Specifically, the compensated frame

I_{i \to t}^{p r o g r e s s i v e}

generated by one-side progressive warping is expressed as

\begin{matrix} {\tilde{I}}_{i}^{(i)} = I_{i} \end{matrix}

(2)

\begin{matrix} F_{j \to j + 1} = F ({\tilde{I}}_{i}^{(j)}, I_{j + 1}), j = i, i + 1, \dots, t - 1 \end{matrix}

(3)

\begin{matrix} {\tilde{I}}_{i}^{(j + 1)} = W ({\tilde{I}}_{i}^{(j)}, F_{j \to j + 1}) \end{matrix}

(4)

\begin{matrix} I_{i \to t}^{p r o g r e s s i v e} = {\tilde{I}}_{i}^{(t)} \end{matrix}

(5)

where

F (\cdot, \cdot)

denotes the optical flow network. Progressive warping reduces the impact of one-shot motion error through step-wise alignment, lowering artifact risk. Yet it may accumulate error across steps and hurt the final accuracy, so it is not used alone.

Dual motion compensation combines the advantages of the above methods to generate more accurate compensation results. Specifically, the two compensation results are concatenated along the channel axis and fed into the reconstruction network, where convolution kernels learn spatially varying fusion weights for each path. The final dual motion compensation result

I_{i \to t}^{d u a l}

can be expressed as

I_{i \to t}^{d u a l} = C o n c a t (I_{i \to t}^{d i r e c t}, I_{i \to t}^{p r o g r e s s i v e})

(6)

where

C o n c a t (\cdot)

denotes feature concatenation. This design avoids the instability of hand-crafted weights and boosts adaptability to complex motion and occlusion.

To further suppress artifacts, the dual motion compensation module introduces an occlusion mask. By multiplying the compensated frame with the mask, invalid regions are eliminated, enhancing the accuracy and robustness of the compensation result. Given optical flow

F_{i \to t}

, a one-tensor matching input frame

I_{i}

is first built, and backward warping is performed with the same sampling strategy used for image remapping to yield an initial mask

M_{i}

.

M_{i} = W (1, F_{i \to t})

(7)

Pixels successfully mapped into the reference frame receive mask values close to 1, whereas those in occluded or out-of-range regions obtain values clearly below 1.

To further separate valid and invalid regions, we binarize the initial mask

M_{i}

by thresholding, yielding the final occlusion mask

{\tilde{M}}_{i}

.

{\hat{M}}_{i} = \{\begin{matrix} 1, & M_{i} (x) \geq τ \\ 0, & o t h e r w i s e \end{matrix}

(8)

With the threshold set to

τ = 0.9999

, invalid pixels caused by interpolation or out-of-bound sampling are effectively removed.

After the occlusion mask is obtained, we apply it directly to the remapped result to suppress interference from invalid regions in later feature fusion.

{\hat{I}}_{i \to t} = W (I_{i}, F_{i \to t}) ⨀ {\hat{M}}_{i}

(9)

where ⨀ denotes pixel-wise multiplication. This step keeps only the pixels that hold valid correspondences in the reference frame, so the motion-compensated result stays robust at occlusion boundaries and under large displacement. Note that the mask needs no extra network. It emerges naturally from the optical flow consistency. The scheme is light, fast, and fully aligned with the motion compensation pipeline, avoiding extra parameters for explicit occlusion modeling.

2.3. Multi-Scale Structure–Texture Prior Module

MSTPM employs a pre-trained VGG16 [14] network to perform multi-scale feature extraction on the input data, generating a set of feature maps

{F_{1}, F_{2}, \dots, F_{n}}

at different resolutions, where

F_{i}

denotes the feature map at the i-th scale. We use the convolutional part of VGG16 and divide it into five stages, covering shallow to deep multi-scale features. Stage1–stage5 take the outputs of

c o n v 1_2

,

c o n v 2_2

,

c o n v 3_3

,

c o n v 4_3

, and

c o n v 5_3

, with downsampling between adjacent stages to change scale. These feature maps can perceive the global structural contours and local texture variations of the image, respectively. For the extracted VGG feature maps, upsampling is performed along the channel dimension, followed by concatenation. A

1 \times 1

convolution is used to reduce the channel number. The result is then concatenated with the compensated frame

I_{t \pm n \to t}^{d u a l}

obtained from the previous section and sent for further processing. This process can be expressed as

F_{i n p u t} = C o n c a t (F_{V G G}, F_{c u r r e n t}),

(10)

where

F_{V G G}

represents the multi-scale features extracted by the VGG model,

F_{c u r r e n t}

represents the features of the current frame, and

C o n c a t (\cdot)

denotes the feature concatenation operation. Through this fusion strategy, the reconstruction module can fully integrate multi-scale structural and textural prior information. While enhancing the expression of high-frequency details, it alleviates the over-smoothing phenomenon caused by alignment errors and feature warping, thereby improving the accuracy of detail restoration and visual quality.

Note that our multi-scale structure–texture block is not a strict statistical prior. It is a perception-guided constraint. By feeding multi-scale features, it pushes the network to keep pixel fidelity while watching layout and texture consistently, so visual quality rises under motion and occlusion. Unlike adding a perceptual term to the loss alone, we feed the structure–texture features directly into the fusion and reconstruction stages, giving steady guidance to representation learning.

Although deeper nets like ResNet [15] or ConvNeXt [16] give higher classification scores, we keep VGG16 in MSTPM. The main reason is that we are not chasing optimal accuracy; however, we value its steady perceptual features and consistent texture modeling. Specifically, VGG16 has a neat convolutional layout, and its shallow and mid layers keep clear spatial detail and local texture, so the features are easy to read. These maps are common in perceptual loss, style transfer, and video restoration. By comparison, some new nets aim for high semantics in the processes of down-sampling and feature compression, so fine texture details may drop.

Our goal is to improve perceptual quality and keep temporal texture steady in infrared motion scenes, not just to cut pixel loss, so we keep VGG16 as the MSTPM backbone.

3. Experimental Results and Analysis

To validate the effectiveness of the proposed method, experiments and result analysis were conducted. The superiority of the proposed method was demonstrated through qualitative and quantitative comparisons with various baseline methods. The effectiveness of each module was verified through ablation studies and visual analysis.

3.1. Datasets

Since infrared videos are small and lack variety, we pre-train on RGB data and carry out fine-tuning on infrared clips. The public dataset Vimeo-90K [17] was used to train the network. A short-wave infrared dataset (SWIRD) was used for fine-tuning and testing. Note that we pre-train on Vimeo-90K not to learn RGB looks, but to let the net fully learn complex motion and temporal alignment. Optical flow and motion compensation rely on temporal structure and displacement patterns, not on the light band. Big RGB clips teach the model to handle large shifts and non-rigid movement. Then, short infrared fine-tuning adapts the model to radiant signal and noise, giving valid infrared super-resolution. In addition, to guarantee an impartial comparison, we conduct evaluations on the public benchmarks REDS4 [18] and Vid4 [19], so as to verify the superiority of the proposed method. Detailed information about the datasets is shown in Table 1.

The Vimeo-90K dataset contains 91,701 video clips, covering various scenes and motion types. Each clip contains seven consecutive frames with a resolution of 448 × 256, suitable for video denoising and super-resolution tasks. The REDS4 subset is from the full REDS, comprising four 100-frame sequences

(000, 011, 015, 020)

captured in diverse scenarios at a resolution of 1280 × 720. Vid4, a classical benchmark adopted for video enhancement tasks, consists of four clips ranging from 40 to 50 frames—namely calendar, city, foliage, and walk—whose spatial resolutions span 480 × 720 to 576 × 720.

The short-wave infrared dataset contains six video sequences, covering various scenes and aircraft targets in different poses with a resolution of 640 × 512. In our experimental setup, we adopt a staged training scheme. First, the model is pre-trained on the Vimeo-90K dataset to acquire generic spatio-temporal reconstruction and motion modeling capacity. Next, the longest sequence in SWIRD is divided into 987 short subsequences of seven frames each. These subsequences are used for fine-tuning so that the network can adapt to infrared imaging characteristics. Each of the remaining five long sequences, labeled IR1–IR5, contain 200 frames and are reserved exclusively for testing. They never participate in training or fine-tuning, ensuring impartial and objective evaluation.

During the data preprocessing stage, all video frames were downsampled by a factor of 4 using bicubic linear interpolation to generate low-resolution input data. For an original high-resolution frame

I_{H R}

, the corresponding low-resolution frame

I_{L R}

can be expressed as

I_{L R} = B i c u b i c D o w n s a m p l e (I_{H R}, s = 4),

(11)

where

B i c u b i c D o w n s a m p l e (\cdot)

represents the bicubic linear interpolation downsampling operation, and

s = 4

represents the downsampling factor. The low-resolution data generated in this way can effectively simulate the degradation process in real-world scenarios, providing high-quality input–output pairs for the video super-resolution task.

3.2. Implementation Details

The hardware and software configurations used in the experiments are shown in Table 2.

The following optimization strategies and parameter settings were adopted for the training process. Data was read using a sliding window approach, with seven consecutive video frames input each time as a training sample. To enhance the model’s local feature extraction capability, each frame was cropped into local patches of size

64 \times 64

, which were used as the network’s input. All tests use a fixed random seed to keep results repeatable. Before training, we set the Python, NumPy and PyTorch seeds to 42 and turned on the deterministic computing mode. The ADAM optimizer [20] was used for model training, with momentum parameters set to

β_{1} = 0.9

and

β_{2} = 0.999

. The batchsize was set to 1, the initial learning rate was set to

1 \times 10^{- 4}

, and a cosine annealing strategy was employed to dynamically adjust the learning rate; namely the learning rate decays every 5 epochs down to

1 \times 10^{- 7}

, enabling finer parameter optimization in the later stages of training. To constrain the training process and enhance reconstruction quality, a composite loss function including Charbonnier loss and perceptual loss was used. Charbonnier loss [21] was used to minimize the difference between the reconstructed frame

I_{S R}

and the real high-resolution frame

I_{H R}

. Perceptual loss [22] supervised the retention of high-frequency detail information, with its weight set to 0.1. During training, a pre-trained SpyNet model was used to initialize the optical flow estimation module, while other modules were trained from scratch. To keep training stable and stop early optical flow estimation errors from harming the reconstruction network, we freeze SpyNet for the first two epochs. After the model reaches a stable state, we unfreeze it and train end-to-end with the rest modules. This strategy keeps alignment accurate and improves overall quality. The model was trained for a total of 30 epochs. Under these settings, the model converged quickly and achieved optimal performance.

3.3. Evaluation Metrics

Network performance was primarily evaluated using three metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity [23] (LPIPS). For the results of each image sequence, the PSNR, SSIM, and LPIPS values for each frame in the sequence were calculated. The average values of these calculations were then taken as the evaluation metrics for the reconstruction result of that sequence.

It should be emphasized that PSNR and SSIM primarily quantify the pixel-level reconstruction error, which often fails to faithfully reflect visual perceptual quality in some scenes, such as large-scale motion, occlusion, and non-rigid deformation. In contrast, LPIPS concentrates on high-level semantic and textural perceptual consistency, offering a more effective assessment of the visual realism and naturalness of video reconstruction results. Consequently, we adopt LPIPS as a key metric for perceptual quality and focus on analyzing the performance of different methods under this indicator.

3.4. Quantitative Results

Details of the compared models are shown in Table 3, from which it can be seen that DCST-Net has the smallest parameter scale among the compared methods.

All methods were trained on the same dataset and evaluated under identical conditions. Comparative results of different methods on the test set are shown in Table 4, Table 5 and Table 6, and red and blue indicate the best and second-best results in Table 6. It should be noted that the best and second-best results exclude the without-fine-tuning comparison. It can be seen that the PSNR, SSIM, and LPIPS values of DCST-Net’s reconstruction results on the five test sequences are significantly better than those of EDVR [7] and DUF [24]. From the perspective of perceptual quality analysis, DCST-Net achieves comparable PSNR and SSIM values to BasicVSR [9], IART [25], and ST-AVSR [26], but with lower LPIPS values, indicating superior perceptual similarity while maintaining reconstruction accuracy. From Table 3, Table 4, Table 5 and Table 6, it can be seen that DCST-Net has a parameter count comparable to DUF but significantly outperforms DUF in all metrics. Compared to ST-AVSR, the parameter count is only about one-twentieth, yet the reconstruction PSNR and SSIM values are comparable and LPIPS is lower, demonstrating the superior performance of DCST-Net. It is worth pointing out that the proposed method demonstrates stronger stability in scenes characterized by complex motion patterns or pronounced occlusion, whereas its advantage becomes less compared with reconstruction approaches centered on long−term feature propagation when displacement variation is small or the texture is highly repetitive.

Although the proposed method does not consistently achieve the highest PSNR or SSIM scores across all test sequences, it delivers more consistent perceptual quality and exhibits superior robustness against motion-alignment uncertainty and occlusion. For sequences with gentle motion and negligible deformation, recurrent propagation-based approaches such as BasicVSR may retain an advantage in pixel-level fidelity. This indicates the complementarity of different motion modeling strategies in video reconstruction tasks and further clarifies the applicable scope of the present method.

To further investigate the generalization capability of the model, Table 4, Table 5 and Table 6 present the test results obtained by directly deploying the Vimeo-90K pre-trained weights on the infrared dataset without any infrared fine-tuning. The experiments demonstrate that the proposed method still maintains a perceptual advantage in terms of LPIPS, yet suffers from noticeable degradation in PSNR and SSIM, further corroborating the necessity of domain-adaptive fine-tuning for infrared images.

The comparative result curves of all methods on the test set are illustrated in Figure 3. It can be seen that DCST-Net delivers PSNR and SSIM values comparable to BasicVSR, IART, and ST-AVSR across all test sequences, while markedly outperforming EDVR and DUF. Its LPIPS scores surpass those of every comparison method.

We additionally quantify the performance distribution across the test sequences and report the corresponding means and standard deviations in Table 7. The results reveal that the proposed method exhibits stable performance variation among different infrared video sequences, corroborating its cross-sequence consistency and stability.

Beyond the aforementioned comparisons, Table 8 further reports the quantitative results obtained on the public test benchmarks to ensure an impartial evaluation.

As evidenced in Table 8, DCST-Net achieves the lowest LPIPS on both public benchmarks, corroborating its superiority in perceptual quality enhancement. On REDS4, although PSNR and SSIM do not attain the best or second-best ranks, the performance gap is marginal. On Vid4, DCST-Net obtains the second-best PSNR. This gain is attributed to the fact that Vid4 contains more intensive motion and richer textural details, thereby fully exploiting the motion modeling and texture restoration capacities of the proposed network.

3.5. Qualitative Results

Qualitative comparison results of different methods on the test set are shown in Figure 4.

The reconstruction results for large-scale targets (ten-thousand-pixel level) are shown in Figure 4a. As seen in Figure 4a, EDVR, DUF, and BasicVSR perform poorly in restoring textured regions, with significant loss of detail information. IART and ST-AVSR can reconstruct some details but still show considerable differences from the ground truth image. In contrast, DCST-Net performs better in detail restoration, accurately recovering high-frequency textures such as the white stripes on the aircraft, demonstrating stronger detail modeling capability. The reconstruction results for small-scale targets (hundred-pixel level) are shown in Figure 4b. As seen in Figure 4b, for small-scale targets with less texture, EDVR produces artifacts in the reconstruction result, causing contour blurring. DCST-Net can reconstruct clear contour boundaries in this scenario, presenting superior visual quality.

We further present qualitative comparisons on the REDS4 and Vid4 datasets.

As illustrated in Figure 5, DCST-Net still delivers visually good results within repetitive-texture regions. Regular structures such as window grids are faithfully reconstructed without observable distortion or artifacts. Nevertheless, on this particular sample the visual gap between the proposed method and several state-of-the-art competitors remains relatively limited, implying that under scenes dominated by regular textures and mild motion the performance of DCST-Net and the compared approaches converges to a comparable level.

As illustrated in Figure 6, the city sequence is dedicated to evaluating the capability of algorithms to model large-scale repetitive structures and high-frequency textures. Methods with inferior reconstruction quality tend to produce aliasing between low- and high-frequency components, such as conspicuous stripe artifacts. In contrast, DCST-Net successfully preserves the structural consistency of repetitive patterns such that the grid and edge of the windows remain sharp and arrangements regular—without any observable stripe or structural distortion, underscoring its superiority in high-frequency texture restoration.

3.6. Ablation Studies

To verify the effectiveness of the Dual Motion Compensation Module and the Multi-scale Structure–Texture Prior Module, ablation experiments were conducted on each module without changing the training and test sets. The experimental results are shown in Table 9, Table 10 and Table 11, and red and blue indicate the best and second-best results.

w/o Motion denotes the model equipped only with direct warping motion compensation and the texture prior. w/o Texture denotes the model equipped only with dual motion compensation. From Table 9, Table 10 and Table 11, it can be seen that the w/o Motion model achieves LPIPS values on the test set significantly better than the w/o Texture model and close to the full model’s performance, indicating that the multi-scale structure–texture prior plays a significant role in enhancing perceptual quality. The w/o Texture model performs better than w/o Motion in PSNR and SSIM, indicating that the dual motion compensation module improves the accuracy of inter-frame alignment, thereby enhancing overall reconstruction performance.

The result curves of ablation studies on the test set are shown in Figure 7. It can be seen that removing either DMCM or MSTPM consistently degrades model performance, corroborating the effectiveness of both the Dual Motion Compensation Module and the Multi-scale Structure–Texture Prior Module.

The ablation study results for large-scale targets (ten-thousand-pixel level) are shown in Figure 8a, from which it can be seen that the reconstruction result of the w/o Motion model exhibits obvious artifacts, with blurred wing contours, and the w/o Texture model alleviates the artifact problem to some extent but leads to the loss of high-frequency information.

The ablation study results for small-scale targets (hundred-pixel level) are shown in Figure 8b, from which it can be seen that the reconstruction result of the w/o Motion model shows blurred edges on the aircraft body contour, and the w/o Texture model yields a clearer contour but suffers from significant detail loss, such as the inability to effectively reconstruct stripes and highlight regions on the aircraft’s body.

Overall, both the Dual Motion Compensation Module and the Multi-scale Structure–Texture Prior Module lead to a performance drop, validating the effectiveness of DCST-Net.

To justify the adoption of VGG16 in MSTPM, we conduct additional ablation studies on the feature extractor. Concretely, while keeping the overall architecture, training protocol, and loss functions intact, we replace the VGG16 backbone with ResNet-50 and ConvNeXt-Tiny, respectively, and retrain under the same dataset and evaluation protocol. The quantitative comparisons are summarized in Table 12, Table 13 and Table 14.

The experimental results reveal that, although PSNR and SSIM exhibit marginal variations across different backbones, the VGG16-equipped MSTPM achieves the lowest LPIPS, indicating a clear advantage in encoding perceptually relevant features and maintaining textural consistency. Further inspection shows that the intermediate activations of VGG16 furnish more stable multi-scale texture representations and local structure alignment, effectively alleviating texture mismatch caused by occlusion and complex displacement during motion compensation. This characteristic aligns precisely with our core objective of boosting perceptual quality. Consequently, considering overall performance, stability, and perceptual reconstruction fidelity, VGG16 is selected as the feature extractor in MSTPM.

4. Conclusions

This paper proposed an infrared video super-resolution reconstruction method combining dual motion compensation and multi-scale structure–texture priors. The devised dual motion compensation improves resilience to motion alignment uncertainties by introducing complementary alignment pathways, while an occlusion mask effectively suppresses disturbances from misaligned regions. Concurrently, the multi-scale structure–texture prior enhances structural edges and high-frequency texture information, alleviating image over-smoothing caused by alignment errors and fusion distortions. Experimental results show that DCST-Net achieved 4x super-resolution reconstruction on multiple test sequences, attaining a high PSNR and SSIM together with a lower LPIPS. Compared with existing methods, the proposed approach can more accurately reconstruct clear contour boundaries and effectively restore the structure and texture details of targets, while markedly suppressing artifacts.

Author Contributions

Conceptualization, J.S.; Methodology, J.X.; Software, J.X.; Validation, X.L.; Formal analysis, P.S.; Investigation, X.L.; Writing—original draft, X.L.; Writing—review & editing, J.S.; Visualization, P.S.; Supervision, H.G., F.W., M.J., C.C. and X.K.; Project administration, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author.

Acknowledgments

We are grateful to the other researchers who contributed to this study. Special thanks go to Jiapeng Wu and Xuguang Liu for their assistance in preparing the manuscript.

Conflicts of Interest

Author Xiaolei Liu was employed by China Academy of Space Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Puttagunta, M.; Subban, R. Swinir transformer applied for medical image super-resolution. Procedia Comput. Sci. 2022, 204, 907. [Google Scholar] [CrossRef]
Ma, Y.Z.; Zhou, W.T.; Ma, R.; Wang, E.; Yang, S.; Tang, Y.; Zhang, X.; Guan, X. DOVE: Doodled vessel enhancement for photoacoustic angiography super resolution. Med. Image Anal. 2024, 94, 103106. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.P.; Wang, P.R.; Jiang, J.G. Nonpairwise-Trained Cycle Convolutional Neural Network for Single Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4250. [Google Scholar] [CrossRef]
Fernandez-Beltran, R.; Latorre-Carmona, P.; Pla, F. Single-frame super-resolution in remote sensing: A practical overview. Int. J. Remote Sens. 2017, 38, 314. [Google Scholar] [CrossRef]
Ren, S.B.; Meng, Q.; Wu, Z. Flow-based super-resolution reconstruction of remote sensing images. Chin. Space Sci. Technol. 2022, 42, 99. [Google Scholar] [CrossRef]
Sun, R.; Zhang, H.; Cheng, Z.K.; Zhang, X.D. Super-resolution reconstruction of infrared image based on channel attention and transfer learning. Opto-Electron. Eng. 2021, 48, 200045. [Google Scholar] [CrossRef]
Wang, X.; Chan, K.C.K.; Yu, K.; Dong, C.; Loy, C.C. EDVR: Video restoration with enhanced deformable convolutional networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; p. 1954. [Google Scholar] [CrossRef]
Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent back-projection network for video super-resolution. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 15–20 June 2019; p. 3897. [Google Scholar] [CrossRef]
Chan, K.C.K.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. BasicVSR: The search for essential components in video super-resolution and beyond. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; p. 4945. [Google Scholar] [CrossRef]
Ranjan, A.; Black, M.J. Optical flow estimation using a spatial pyramid network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; p. 4164. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer Assisted Intervention, Munich, Germany, 5–9 October 2015; p. 234. [Google Scholar] [CrossRef]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; p. 1026. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; p. 1874. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; p. 1. [Google Scholar]
He, K.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; p. 770. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.Z.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S.N. A convnet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; p. 11976. [Google Scholar]
Xue, T.F.; Chen, B.A.; Wu, J.J.; Wei, D.; Freeman, W.T. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2019, 127, 1106. [Google Scholar] [CrossRef]
Tao, X.; Gao, H.Y.; Liao, R.J.; Wang, J.; Jia, J. Detail-revealing Deep Video Super-resolution. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; p. 4472. [Google Scholar] [CrossRef]
Liu, C.; Sun, D.Q. On Bayesian Adaptive Video Super Resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 346. [Google Scholar] [CrossRef] [PubMed]
Kinga, D.; Adam, J.B. A method for stochastic optimization. IEEE Conf. Comput. Vis. Pattern Recog. 2015, 5, 1. [Google Scholar]
Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. Two deterministic half quadratic regularization algorithms for computed imaging. Proc. IEEE Int. Conf. Inf. Process. 1994, 2, 168. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Li, F.F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; p. 694. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; p. 586. [Google Scholar] [CrossRef]
Jo, Y.; Oh, S.W.; Kang, J.; Kim, S.J. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; p. 3224. [Google Scholar] [CrossRef]
Xu, K.; Yu, Z.W.; Wang, X.; Mi, M.B.; Yao, A. Enhancing Video Super-Resolution via Implicit Resampling-based Alignment. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; p. 2546. [Google Scholar] [CrossRef]
Shang, W.; Ren, D.W.; Zhang, W.Y.; Fang, Y.; Zuo, W.; Ma, K. Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors. In Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; p. 73. [Google Scholar] [CrossRef]

Figure 1. Architecture of Dual Motion Compensation and Multi-scale Structure–Texture Prior Network.

Figure 2. Dual motion compensation. (a) Direct warping. (b) Progressive warping.

Figure 3. Comparison of different methods on SWIR datasets.

Figure 4. Qualitative comparison of large-scale and small-scale target.

Figure 5. Qualitative comparison on the 011 video sequence from REDS4.

Figure 6. Qualitative comparison on the city video sequence from Vid4.

Figure 7. Comparison of ablation studies.

Figure 8. Ablation studies of large-scale and small-scale targets.

Table 1. Details of video datasets.

Dataset	Image Size (Pixel)	Object	Frame Number
Vimeo-90k	$448 \times 256$	Multiple objects	7 × 91,701
SWIRD	$640 \times 512$	Airplane	$7 \times 987$
SWIRD	$640 \times 512$	Airplane & Objects	$200 \times 5$
REDS4	$1280 \times 720$	Multiple objects	$100 \times 4$
Vid4	$720 \times (480 - 576)$	Multiple objects	$(40 - 50) \times 4$

Table 2. Configuration of the experimental environment.

Name	Information
CPU	13th Gen Intel(R) Core(TM) i7-13700F
GPU	NVIDIA GeForce RTX 4060 Ti
RAM	32.0 GB
Framework	Pytorch2.1.1
CUDA	12.1

Table 3. The details of models *.

	EDVR	DUF	BasicVSR	IART	ST-AVSR	Ours
Params	20.6 M	5.8 M	6.3 M	13.4 M	106.5 M	5.4 M

* Red and blue highlighting represent the fewest and second-fewest parameters, respectively.

Table 4. Quantitative comparison of different methods on SWIR datasets (PSNR).

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
EDVR	36.65	40.62	42.35	46.08	39.14
DUF	25.73	26.82	25.72	27.65	27.70
BasicVSR	47.06	54.41	50.44	54.84	57.20
IART	47.61	54.41	50.76	55.16	56.59
ST-AVSR	44.02	50.87	48.38	52.33	57.89
Ours (w/o fine-tuning)	43.37	49.12	48.39	44.39	46.79
Ours	44.18	49.36	50.03	46.28	48.03

Table 5. Quantitative comparison of different methods on SWIR datasets (SSIM).

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
EDVR	0.9625	0.9885	0.9834	0.9762	0.9011
DUF	0.9242	0.9637	0.9523	0.9381	0.9456
BasicVSR	0.9946	0.9989	0.9942	0.9980	0.9981
IART	0.9949	0.9989	0.9943	0.9977	0.9979
ST-AVSR	0.9929	0.9988	0.9938	0.9978	0.9985
Ours (w/o fine-tuning)	0.9903	0.9962	0.9916	0.9901	0.9933
Ours	0.9914	0.9977	0.9938	0.9935	0.9954

Table 6. Quantitative comparison of different methods on SWIR datasets (LPIPS) *.

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
EDVR	0.4061	0.1866	0.3373	0.1804	0.3471
DUF	0.1909	0.1509	0.2648	0.3705	0.3648
BasicVSR	0.0445	0.0306	0.1345	0.0304	0.0385
IART	0.0460	0.0319	0.1289	0.0293	0.0385
ST-AVSR	0.0438	0.0281	0.1492	0.0289	0.0251
Ours (w/o fine-tuning)	0.0348	0.0254	0.0732	0.0213	0.0249
Ours	0.0306	0.0202	0.0591	0.0194	0.0173

* Red and blue highlighting represent the best and second-best performances, respectively.

Table 7. Performance distribution of different methods across various test sequences.

Model	PSNR	SSIM	LPIPS
EDVR	40.97 ± 3.17	0.9623 ± 0.0318	0.2915 ± 0.0913
DUF	26.72 ± 0.87	0.9448 ± 0.0133	0.2684 ± 0.0889
BasicVSR	52.79 ± 3.60	0.9968 ± 0.0020	0.0557 ± 0.0398
IART	52.91 ± 3.27	0.9967 ± 0.0018	0.0549 ± 0.0374
ST-AVSR	50.70 ± 4.57	0.9964 ± 0.0025	0.0550 ± 0.0475
Ours (w/o fine-tuning)	46.41 ± 2.22	0.9923 ± 0.0023	0.0359 ± 0.0192
Ours	47.58 ± 2.13	0.9944 ± 0.0021	0.0293 ± 0.0156

Table 8. Quantitative comparison of different methods on REDS4 and Vid4(PSNR/SSIM/LPIPS) *.

	REDS4	Vid4
Model	REDS4	Vid4
EDVR	31.05/0.8793/0.2097	27.35/0.8264/-
DUF	28.63/0.8251/-	27.34/0.8327/-
BasicVSR	31.42/0.8909/0.2023	27.24/0.8251/0.2812
IART	32.89/0.9138/0.1629	28.26/0.8517/0.2501
ST-AVSR	31.03/0.8970/-	26.16/0.8520/-
Ours (w/o fine-tuning)	31.32/0.8936/0.1473	27.54/0.8283/0.2115

* Red and blue highlighting represent the best and second-best performances, respectively.

Table 9. Ablation studies of two proposed key innovative modules on SWIR datasets (PSNR) *.

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
w/o Motion	43.86	48.80	46.16	43.13	43.49
w/o Texture	43.89	47.81	49.42	45.23	45.14
Ours	44.18	49.36	50.03	46.28	48.03

* Red and blue highlighting represent the best and second-best performances, respectively.

Table 10. Ablation studies of two proposed key innovative modules on SWIR dataset (SSIM) *.

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
w/o Motion	0.9910	0.9959	0.9908	0.9917	0.9937
w/o Texture	0.9914	0.9969	0.9945	0.9961	0.9980
Ours	0.9914	0.9977	0.9938	0.9935	0.9954

* Red and blue highlighting represent the best and second-best performances, respectively.

Table 11. Ablation studies of two proposed key innovative modules on SWIR dataset (LPIPS) *.

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
w/o Motion	0.0302	0.0213	0.1143	0.0248	0.0232
w/o Texture	0.0521	0.0410	0.1064	0.0340	0.0318
Ours	0.0306	0.0202	0.0591	0.0194	0.0173

* Red and blue highlighting represent the best and second-best performances, respectively.

Table 12. Ablation studies of feature extraction on SWIR datasets (PSNR).

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
ResNet-50	44.42	49.93	49.65	46.75	47.95
ConvNeXt-Tiny	44.11	49.49	50.12	46.26	48.46
VGG16(Ours)	44.18	49.36	50.03	46.28	48.03

Table 13. Ablation studies of feature extraction on SWIR datasets (SSIM).

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
ResNet-50	0.9917	0.9960	0.9904	0.9938	0.9968
ConvNeXt-Tin	0.9911	0.9965	0.9944	0.9956	0.9977
VGG16(Ours)	0.9914	0.9977	0.9938	0.9935	0.9954

Table 14. Ablation studies of feature extraction on SWIR datasets (LPIPS) *.

	IR1	IR2	IR3	IR4	IR5
Model	IR1	IR2	IR3	IR4	IR5
ResNet-50	0.0433	0.0326	0.1058	0.0363	0.0343
ConvNeXt-Tin	0.0360	0.0317	0.0742	0.0245	0.0256
VGG16 (Ours)	0.0306	0.0202	0.0591	0.0194	0.0173

* Red and blue highlighting represent the best and second-best performances, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Shi, J.; Xu, J.; Song, P.; Gao, H.; Wang, F.; Ji, M.; Chen, C.; Kong, X. Video Super-Resolution Combining Dual Motion Compensation and Multi-Scale Structure–Texture Prior. Appl. Sci. 2026, 16, 631. https://doi.org/10.3390/app16020631

AMA Style

Liu X, Shi J, Xu J, Song P, Gao H, Wang F, Ji M, Chen C, Kong X. Video Super-Resolution Combining Dual Motion Compensation and Multi-Scale Structure–Texture Prior. Applied Sciences. 2026; 16(2):631. https://doi.org/10.3390/app16020631

Chicago/Turabian Style

Liu, Xiaolei, Jiawei Shi, Jiayi Xu, Pengfei Song, Hongxia Gao, Fuhai Wang, Meining Ji, Chen Chen, and Xianghao Kong. 2026. "Video Super-Resolution Combining Dual Motion Compensation and Multi-Scale Structure–Texture Prior" Applied Sciences 16, no. 2: 631. https://doi.org/10.3390/app16020631

APA Style

Liu, X., Shi, J., Xu, J., Song, P., Gao, H., Wang, F., Ji, M., Chen, C., & Kong, X. (2026). Video Super-Resolution Combining Dual Motion Compensation and Multi-Scale Structure–Texture Prior. Applied Sciences, 16(2), 631. https://doi.org/10.3390/app16020631

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video Super-Resolution Combining Dual Motion Compensation and Multi-Scale Structure–Texture Prior

Abstract

1. Introduction

2. Methodology

2.1. Network Architecture

2.2. Dual Motion Compensation Module

2.3. Multi-Scale Structure–Texture Prior Module

3. Experimental Results and Analysis

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Metrics

3.4. Quantitative Results

3.5. Qualitative Results

3.6. Ablation Studies

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI