1. Introduction
With the rapid development of digital imaging and information processing, video has become a primary vehicle for information acquisition and dissemination. Nevertheless, constrained by sensor performance, imaging distance, and transmission bandwidth, especially in infrared, including short-wave infrared (SWIR) scenarios, captured videos inevitably suffer from degradations such as insufficient resolution, blurred details, and severe noise. Computational imaging techniques are therefore urgently required to enhance visual quality. Super-resolution reconstruction, which recovers high-resolution structural details from low-resolution observations, has demonstrated critical value in medical imaging [
1,
2], remote-sensing surveillance [
3,
4], and infrared object perception [
5,
6].
Compared to Single Image Super-Resolution (SISR), Video Super-Resolution (VSR) faces the high-complexity challenge of aggregating information from multiple unaligned yet highly correlated frames. Wang et al. proposed using multi-scale deformable alignment and attention mechanisms to achieve high-precision inter-frame alignment and fusion, significantly improving reconstruction performance [
7]. However, this method did not incorporate an explicit motion estimation mechanism, limiting its alignment accuracy in scenes with intense motion. Haris et al. progressively modeled inter-frame differences through multiple residual projections, enhancing the utilization of temporal information [
8]. However, the sequential projection process may introduce cumulative errors, affecting the final reconstruction result. Chan et al. proposed BasicVSR, which divides the VSR task into four modules—propagation, alignment, fusion, and upsampling—simplifying model design while improving reconstruction effectiveness [
9]. Nevertheless, this method still faces significant alignment errors when handling large-scale or non-rigid motion, particularly prone to artifacts or structural distortions in complex backgrounds or occluded regions.
It should be emphasized that infrared videos differ markedly from their visible-light counterparts in imaging mechanism, texture representation, and noise characteristics; directly transferring VSR pipelines designed for RGB scenes therefore frequently fails to simultaneously guarantee motion-alignment stability and high-quality detail recovery. Consequently, how to enhance motion modeling robustness under complex motion and occlusion while effectively reconstructing the limited yet critical structural and textural cues present in infrared videos remains a pivotal and unresolved challenge in current video super-resolution research.
To address these issues in infrared videos, such as blurred small-target edges, sparse texture, and severe occlusion, this paper proposes a video super-resolution network combining Dual Motion Compensation and Multi-scale Structure–Texture Prior (DCST-Net). This method approaches the improvement of video reconstruction quality and structural fidelity from two key perspectives: motion alignment and high-frequency detail modeling.
Specifically, to enhance the modeling capability for moving objects, this paper designs a Dual Motion Compensation Module (DMCM) that performs direct warping and progressive warping compensation strategies in parallel. This integrates the sensitivity of direct warping to local fine motion and the stability of progressive warping in handling large-scale displacement, thereby mitigating the problem of errors easily arising in intense motion scenarios. Combined with an occlusion mask to eliminate invalid regions during motion compensation, it enhances the stability and robustness of alignment. To avoid ambiguity, we explicitly distinguish the proposed method from existing bidirectional propagation or recurrent alignment frameworks. Although these approaches also perform multi-frame alignment, their core idea is to propagate features through recurrent structures, where alignment merely facilitates temporal propagation. In contrast, our method abandons feature propagation. Instead, for each input frame, we explicitly build two complementary motion-compensated results: single-flow direct warping and step-by-step progressive warping. The former performs one-shot global alignment, while the latter refines alignment locally and progressively, forming dual motion hypotheses that supply more robust cues for subsequent fusion. We stress that the progressive warping is not used for recurrent feature propagation; rather, it serves as a complementary alignment to the direct warping, operates in parallel with it, and jointly participates in the subsequent fusion stage. Thanks to this explicit dual-hypothesis modeling, the proposed network delivers more stable texture alignment under occlusion and complex displacement, yielding consistently lower LPIPS and higher perceptual quality.
In terms of detail restoration, this paper further proposes a Multi-scale Structure–Texture Prior Module (MSTPM). It introduces a pre-trained network to extract multi-scale feature information, guiding the model to perceive structural and textural details at different levels. This module effectively compensates for high-frequency details by upsampling, concatenating, and compressively fusing multi-scale feature maps, thereby alleviating structural blurring and textural degradation caused by alignment errors during feature fusion. DCST-Net demonstrates superior reconstruction performance on multiple datasets, effectively restoring object contours while recovering rich textural details, with overall visual quality and geometric consistency surpassing existing methods.
3. Experimental Results and Analysis
To validate the effectiveness of the proposed method, experiments and result analysis were conducted. The superiority of the proposed method was demonstrated through qualitative and quantitative comparisons with various baseline methods. The effectiveness of each module was verified through ablation studies and visual analysis.
3.1. Datasets
Since infrared videos are small and lack variety, we pre-train on RGB data and carry out fine-tuning on infrared clips. The public dataset Vimeo-90K [
17] was used to train the network. A short-wave infrared dataset (SWIRD) was used for fine-tuning and testing. Note that we pre-train on Vimeo-90K not to learn RGB looks, but to let the net fully learn complex motion and temporal alignment. Optical flow and motion compensation rely on temporal structure and displacement patterns, not on the light band. Big RGB clips teach the model to handle large shifts and non-rigid movement. Then, short infrared fine-tuning adapts the model to radiant signal and noise, giving valid infrared super-resolution. In addition, to guarantee an impartial comparison, we conduct evaluations on the public benchmarks REDS4 [
18] and Vid4 [
19], so as to verify the superiority of the proposed method. Detailed information about the datasets is shown in
Table 1.
The Vimeo-90K dataset contains 91,701 video clips, covering various scenes and motion types. Each clip contains seven consecutive frames with a resolution of 448 × 256, suitable for video denoising and super-resolution tasks. The REDS4 subset is from the full REDS, comprising four 100-frame sequences captured in diverse scenarios at a resolution of 1280 × 720. Vid4, a classical benchmark adopted for video enhancement tasks, consists of four clips ranging from 40 to 50 frames—namely calendar, city, foliage, and walk—whose spatial resolutions span 480 × 720 to 576 × 720.
The short-wave infrared dataset contains six video sequences, covering various scenes and aircraft targets in different poses with a resolution of 640 × 512. In our experimental setup, we adopt a staged training scheme. First, the model is pre-trained on the Vimeo-90K dataset to acquire generic spatio-temporal reconstruction and motion modeling capacity. Next, the longest sequence in SWIRD is divided into 987 short subsequences of seven frames each. These subsequences are used for fine-tuning so that the network can adapt to infrared imaging characteristics. Each of the remaining five long sequences, labeled IR1–IR5, contain 200 frames and are reserved exclusively for testing. They never participate in training or fine-tuning, ensuring impartial and objective evaluation.
During the data preprocessing stage, all video frames were downsampled by a factor of 4 using bicubic linear interpolation to generate low-resolution input data. For an original high-resolution frame
, the corresponding low-resolution frame
can be expressed as
where
represents the bicubic linear interpolation downsampling operation, and
represents the downsampling factor. The low-resolution data generated in this way can effectively simulate the degradation process in real-world scenarios, providing high-quality input–output pairs for the video super-resolution task.
3.2. Implementation Details
The hardware and software configurations used in the experiments are shown in
Table 2.
The following optimization strategies and parameter settings were adopted for the training process. Data was read using a sliding window approach, with seven consecutive video frames input each time as a training sample. To enhance the model’s local feature extraction capability, each frame was cropped into local patches of size
, which were used as the network’s input. All tests use a fixed random seed to keep results repeatable. Before training, we set the Python, NumPy and PyTorch seeds to 42 and turned on the deterministic computing mode. The ADAM optimizer [
20] was used for model training, with momentum parameters set to
and
. The batchsize was set to 1, the initial learning rate was set to
, and a cosine annealing strategy was employed to dynamically adjust the learning rate; namely the learning rate decays every 5 epochs down to
, enabling finer parameter optimization in the later stages of training. To constrain the training process and enhance reconstruction quality, a composite loss function including Charbonnier loss and perceptual loss was used. Charbonnier loss [
21] was used to minimize the difference between the reconstructed frame
and the real high-resolution frame
. Perceptual loss [
22] supervised the retention of high-frequency detail information, with its weight set to 0.1. During training, a pre-trained SpyNet model was used to initialize the optical flow estimation module, while other modules were trained from scratch. To keep training stable and stop early optical flow estimation errors from harming the reconstruction network, we freeze SpyNet for the first two epochs. After the model reaches a stable state, we unfreeze it and train end-to-end with the rest modules. This strategy keeps alignment accurate and improves overall quality. The model was trained for a total of 30 epochs. Under these settings, the model converged quickly and achieved optimal performance.
3.3. Evaluation Metrics
Network performance was primarily evaluated using three metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity [
23] (LPIPS). For the results of each image sequence, the PSNR, SSIM, and LPIPS values for each frame in the sequence were calculated. The average values of these calculations were then taken as the evaluation metrics for the reconstruction result of that sequence.
It should be emphasized that PSNR and SSIM primarily quantify the pixel-level reconstruction error, which often fails to faithfully reflect visual perceptual quality in some scenes, such as large-scale motion, occlusion, and non-rigid deformation. In contrast, LPIPS concentrates on high-level semantic and textural perceptual consistency, offering a more effective assessment of the visual realism and naturalness of video reconstruction results. Consequently, we adopt LPIPS as a key metric for perceptual quality and focus on analyzing the performance of different methods under this indicator.
3.4. Quantitative Results
Details of the compared models are shown in
Table 3, from which it can be seen that DCST-Net has the smallest parameter scale among the compared methods.
All methods were trained on the same dataset and evaluated under identical conditions. Comparative results of different methods on the test set are shown in
Table 4,
Table 5 and
Table 6, and red and blue indicate the best and second-best results in
Table 6. It should be noted that the best and second-best results exclude the without-fine-tuning comparison. It can be seen that the PSNR, SSIM, and LPIPS values of DCST-Net’s reconstruction results on the five test sequences are significantly better than those of EDVR [
7] and DUF [
24]. From the perspective of perceptual quality analysis, DCST-Net achieves comparable PSNR and SSIM values to BasicVSR [
9], IART [
25], and ST-AVSR [
26], but with lower LPIPS values, indicating superior perceptual similarity while maintaining reconstruction accuracy. From
Table 3,
Table 4,
Table 5 and
Table 6, it can be seen that DCST-Net has a parameter count comparable to DUF but significantly outperforms DUF in all metrics. Compared to ST-AVSR, the parameter count is only about one-twentieth, yet the reconstruction PSNR and SSIM values are comparable and LPIPS is lower, demonstrating the superior performance of DCST-Net. It is worth pointing out that the proposed method demonstrates stronger stability in scenes characterized by complex motion patterns or pronounced occlusion, whereas its advantage becomes less compared with reconstruction approaches centered on long−term feature propagation when displacement variation is small or the texture is highly repetitive.
Although the proposed method does not consistently achieve the highest PSNR or SSIM scores across all test sequences, it delivers more consistent perceptual quality and exhibits superior robustness against motion-alignment uncertainty and occlusion. For sequences with gentle motion and negligible deformation, recurrent propagation-based approaches such as BasicVSR may retain an advantage in pixel-level fidelity. This indicates the complementarity of different motion modeling strategies in video reconstruction tasks and further clarifies the applicable scope of the present method.
To further investigate the generalization capability of the model,
Table 4,
Table 5 and
Table 6 present the test results obtained by directly deploying the Vimeo-90K pre-trained weights on the infrared dataset without any infrared fine-tuning. The experiments demonstrate that the proposed method still maintains a perceptual advantage in terms of LPIPS, yet suffers from noticeable degradation in PSNR and SSIM, further corroborating the necessity of domain-adaptive fine-tuning for infrared images.
The comparative result curves of all methods on the test set are illustrated in
Figure 3. It can be seen that DCST-Net delivers PSNR and SSIM values comparable to BasicVSR, IART, and ST-AVSR across all test sequences, while markedly outperforming EDVR and DUF. Its LPIPS scores surpass those of every comparison method.
We additionally quantify the performance distribution across the test sequences and report the corresponding means and standard deviations in
Table 7. The results reveal that the proposed method exhibits stable performance variation among different infrared video sequences, corroborating its cross-sequence consistency and stability.
Beyond the aforementioned comparisons,
Table 8 further reports the quantitative results obtained on the public test benchmarks to ensure an impartial evaluation.
As evidenced in
Table 8, DCST-Net achieves the lowest LPIPS on both public benchmarks, corroborating its superiority in perceptual quality enhancement. On REDS4, although PSNR and SSIM do not attain the best or second-best ranks, the performance gap is marginal. On Vid4, DCST-Net obtains the second-best PSNR. This gain is attributed to the fact that Vid4 contains more intensive motion and richer textural details, thereby fully exploiting the motion modeling and texture restoration capacities of the proposed network.
3.5. Qualitative Results
Qualitative comparison results of different methods on the test set are shown in
Figure 4.
The reconstruction results for large-scale targets (ten-thousand-pixel level) are shown in
Figure 4a. As seen in
Figure 4a, EDVR, DUF, and BasicVSR perform poorly in restoring textured regions, with significant loss of detail information. IART and ST-AVSR can reconstruct some details but still show considerable differences from the ground truth image. In contrast, DCST-Net performs better in detail restoration, accurately recovering high-frequency textures such as the white stripes on the aircraft, demonstrating stronger detail modeling capability. The reconstruction results for small-scale targets (hundred-pixel level) are shown in
Figure 4b. As seen in
Figure 4b, for small-scale targets with less texture, EDVR produces artifacts in the reconstruction result, causing contour blurring. DCST-Net can reconstruct clear contour boundaries in this scenario, presenting superior visual quality.
We further present qualitative comparisons on the REDS4 and Vid4 datasets.
As illustrated in
Figure 5, DCST-Net still delivers visually good results within repetitive-texture regions. Regular structures such as window grids are faithfully reconstructed without observable distortion or artifacts. Nevertheless, on this particular sample the visual gap between the proposed method and several state-of-the-art competitors remains relatively limited, implying that under scenes dominated by regular textures and mild motion the performance of DCST-Net and the compared approaches converges to a comparable level.
As illustrated in
Figure 6, the city sequence is dedicated to evaluating the capability of algorithms to model large-scale repetitive structures and high-frequency textures. Methods with inferior reconstruction quality tend to produce aliasing between low- and high-frequency components, such as conspicuous stripe artifacts. In contrast, DCST-Net successfully preserves the structural consistency of repetitive patterns such that the grid and edge of the windows remain sharp and arrangements regular—without any observable stripe or structural distortion, underscoring its superiority in high-frequency texture restoration.
3.6. Ablation Studies
To verify the effectiveness of the Dual Motion Compensation Module and the Multi-scale Structure–Texture Prior Module, ablation experiments were conducted on each module without changing the training and test sets. The experimental results are shown in
Table 9,
Table 10 and
Table 11, and red and blue indicate the best and second-best results.
w/o Motion denotes the model equipped only with direct warping motion compensation and the texture prior. w/o Texture denotes the model equipped only with dual motion compensation. From
Table 9,
Table 10 and
Table 11, it can be seen that the w/o Motion model achieves LPIPS values on the test set significantly better than the w/o Texture model and close to the full model’s performance, indicating that the multi-scale structure–texture prior plays a significant role in enhancing perceptual quality. The w/o Texture model performs better than w/o Motion in PSNR and SSIM, indicating that the dual motion compensation module improves the accuracy of inter-frame alignment, thereby enhancing overall reconstruction performance.
The result curves of ablation studies on the test set are shown in
Figure 7. It can be seen that removing either DMCM or MSTPM consistently degrades model performance, corroborating the effectiveness of both the Dual Motion Compensation Module and the Multi-scale Structure–Texture Prior Module.
The ablation study results for large-scale targets (ten-thousand-pixel level) are shown in
Figure 8a, from which it can be seen that the reconstruction result of the w/o Motion model exhibits obvious artifacts, with blurred wing contours, and the w/o Texture model alleviates the artifact problem to some extent but leads to the loss of high-frequency information.
The ablation study results for small-scale targets (hundred-pixel level) are shown in
Figure 8b, from which it can be seen that the reconstruction result of the w/o Motion model shows blurred edges on the aircraft body contour, and the w/o Texture model yields a clearer contour but suffers from significant detail loss, such as the inability to effectively reconstruct stripes and highlight regions on the aircraft’s body.
Overall, both the Dual Motion Compensation Module and the Multi-scale Structure–Texture Prior Module lead to a performance drop, validating the effectiveness of DCST-Net.
To justify the adoption of VGG16 in MSTPM, we conduct additional ablation studies on the feature extractor. Concretely, while keeping the overall architecture, training protocol, and loss functions intact, we replace the VGG16 backbone with ResNet-50 and ConvNeXt-Tiny, respectively, and retrain under the same dataset and evaluation protocol. The quantitative comparisons are summarized in
Table 12,
Table 13 and
Table 14.
The experimental results reveal that, although PSNR and SSIM exhibit marginal variations across different backbones, the VGG16-equipped MSTPM achieves the lowest LPIPS, indicating a clear advantage in encoding perceptually relevant features and maintaining textural consistency. Further inspection shows that the intermediate activations of VGG16 furnish more stable multi-scale texture representations and local structure alignment, effectively alleviating texture mismatch caused by occlusion and complex displacement during motion compensation. This characteristic aligns precisely with our core objective of boosting perceptual quality. Consequently, considering overall performance, stability, and perceptual reconstruction fidelity, VGG16 is selected as the feature extractor in MSTPM.