Frequency-Aware Enhancement Network for Satellite Video Super-Resolution

Lang, Xiujuan; Zhang, Jin; Lu, Tao; Yao, Yuan; Wang, Yu; Wang, Liwei

doi:10.3390/rs17243994

Open AccessArticle

Frequency-Aware Enhancement Network for Satellite Video Super-Resolution

by

Xiujuan Lang

^1,2,

Jin Zhang

^1,*,

Tao Lu

¹

,

Yuan Yao

¹,

Yu Wang

³

and

Liwei Wang

¹

Hubei Provincial Key Laboratory of Intelligent Robot, School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China

²

School of Computer Science and Information Technology, Daqing Normal University, Daqing 163111, China

³

State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(24), 3994; https://doi.org/10.3390/rs17243994

Submission received: 17 October 2025 / Revised: 4 December 2025 / Accepted: 8 December 2025 / Published: 11 December 2025

(This article belongs to the Special Issue Advances in Methods and Techniques for Satellite Image Processing and Analysis)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel Frequency-Aware Enhancement Network (FAENet) is proposed, which incorporates a frequency-domain distribution alignment function to compensate for alignment in satellite videos.
A Frequency Prompt Enhancement Block (FPEB) is designed to utilize learnable visual prompts, effectively distinguishing genuine object edges from high-frequency artifacts and achieving state-of-the-art performance on two satellite video datasets.

What are the implications of the main findings?

The proposed frequency alignment strategy provides a robust, plug-and-play solution for aligning tiny, slow-moving objects in a satellite scene, alleviating the limitations of traditional spatial alignment methods like optical flow.
By improving edge fidelity, this approach enhances the quality of reconstructed satellite videos, thereby facilitating more accurate downstream tasks such as object segmentation and tracking.

Abstract

The lower quality of frames in satellite videos compared to natural videos poses significant challenges in capturing detailed information for alignment and fusion in the image space. In this paper, we introduce a novel frequency-aware enhancement network (FAENet) for satellite video super-resolution (SVSR), which tackles these challenges from a frequency-domain perspective. By leveraging frequency components, FAENet amplifies the distinctions between frames and between objects, thereby improving alignment and reconstruction accuracy. Firstly, the proposed Frequency Alignment Compensation Mechanism (FACM) incorporates a frequency-domain distribution alignment function to enable effective alignment compensation. This mechanism can be seamlessly integrated into existing alignment methods designed for natural video, thereby enhancing their applicability to SVSR tasks. Secondly, we introduce the Frequency Prompt Enhancement Block (FPEB), which facilitates edge reconstruction by leveraging frequency-domain prompts to distinguish objects from artifacts, thereby improving the clarity and accuracy of reconstructed edges. The proposed FAENet achieves 35.33 dB PSNR on the Jilin-189 dataset and 40.57 dB on the SAT-MTB-VSR dataset, outperforming other state-of-the-art compared methods and demonstrating its effectiveness and robustness in addressing the unique challenges of SVSR.

Keywords:

satellite video; super-resolution; recurrent neural network; frequency

1. Introduction

In recent years, the remote sensing field has developed rapidly [1], with video satellites emerging as a novel type of Earth observation satellite. These satellites are capable of persistent regional staring imaging, acquiring continuous video data of designated areas. This regional staring imaging capability allows the satellite to maintain observation over a specific region for an extended duration. As a result, they are particularly well-suited for monitoring regional dynamic changes [2], including situational shifts, dynamic target reconnaissance and surveillance [3], and assessing attack impacts. However, due to constraints in the data transmission process, satellite videos are often compressed and downsampled, leading to unintended loss of high-frequency information and limiting their utility across various applications. To address this issue, developing a practical approach, such as super-resolution (SR) techniques, is crucial for enhancing the spatial resolution of satellite videos.

SR, a fundamental low-level task in computer vision, aims to reconstruct high-resolution (HR) data from a single low-resolution (LR) observation [4,5,6]. As an ill-posed problem, SR primarily focuses on enhancing spatial information, offering a practical solution to mitigate hardware constraints. Video super-resolution (VSR) extends image SR by reconstructing HR videos from LR video sequences, aiming to enhance perceptual quality. Unlike image SR, VSR is inherently more complex, as it requires simultaneous modeling of spatial and temporal information across misaligned video frames. Various approaches have been developed for this task. Early methods, inspired by image SR, often employed a sliding-window approach [7,8,9,10], reconstructing a target frame using 3–7 adjacent frames. However, because video sequences typically consist of dozens of frames, this approach captures only local temporal dependencies, leading to suboptimal performance. To model global temporal dependencies, recurrent neural networks (RNNs) have gained increasing attention in VSR. Among these, bidirectional propagation neural networks have become a widely adopted paradigm. Notably, BasicVSR [11] and BasicVSR++ [12] have achieved remarkable performance by decomposing the VSR pipeline into four key components: propagation, alignment, aggregation (fusion), and upsampling.

Despite significant advancements in VSR methods, these approaches are not directly applicable to satellite videos. Compared to natural videos, satellite videos, characterized by diverse scenarios and small-scale targets, pose significantly more complex challenges, rendering the satellite video super-resolution (SVSR) task particularly demanding [13]. First, satellite videos typically lack the rich texture and detail of natural videos, making feature extraction more challenging. Second, moving objects in satellite videos, which often occupy only a few pixels and exhibit subtle motion changes, pose challenges for capturing fine-grained motion information. These factors collectively make it challenging for mainstream alignment methods designed for natural videos to achieve precise alignment in SVSR. Specifically, mainstream alignment techniques are broadly categorized into explicit alignment (based on optical flow [14]) and implicit alignment (based on deformable convolution network (DCN) [15]). However, optical flow-based alignment relies on gradient information from frames, which is often insufficient for small and subtle motion moving objects in satellite video scenes. Meanwhile, it is sensitive to multi-scale small objects and complex boundary variations [9]. Similarly, accurately estimating the subtle motion offsets of DCN for alignment becomes exceptionally challenging in such scenes, resulting in suboptimal feature alignment. Meanwhile, DCN-based alignment methods have been widely criticized for their unstable training dynamics, which often lead to inconsistent performance when applied in complex and highly varied scenarios. Consequently, an alignment compensation strategy that accounts for satellite video characteristics is essential to improve alignment accuracy in SVSR. On the other hand, satellite videos often feature numerous densely packed small-scale building structures, which readily generate artifacts. Meanwhile, the small size of objects makes it challenging to distinguish objects from artifacts, which is crucial for downstream tasks such as object detection and tracking. Therefore, enhancing edge reconstruction is crucial to improve the performance in SVSR.

In this paper, we address the aforementioned challenges from a frequency-domain perspective, leveraging the global representation capabilities of frequency-based models [16]. To this end, we propose a novel Frequency-Aware Enhancement Network (FAENet) for SVSR, that focuses on alignment compensation and edge reconstruction in the frequency domain. Specifically, the proposed Frequency Alignment Compensation Mechanism (FACM) incorporates a frequency-domain distribution alignment function to achieve precise alignment compensation. In contrast to methods that perform alignment directly in the frequency domain [17], the FACM is designed to compensate and enhance the output of existing alignment methods, thereby enabling them to effectively adapt to the unique motion characteristics of satellite video scenes. Notably, FACM is designed for seamless integration into existing VSR frameworks, eliminating the need for fine-tuning of the architecture, which significantly facilitates the adoption of VSR techniques for challenging satellite video applications. Additionally, the proposed Frequency Prompt Enhancement Block (FPEB) combines prompts with frequency-domain components not only to extract features [16], but also to distinguish objects from artifacts, thereby enhancing edge reconstruction accurately. Extensive experiments demonstrate that FAENet significantly outperforms existing methods across two SVSR datasets. The main contributions of this paper are summarized as follows:

The proposed FACM employs frequency-domain alignment compensation for fine-grained frame alignment, enhancing the model’s ability to capture subtle motion. Additionally, it integrates seamlessly as a plug-and-play module into RNN-based natural VSR methods, enabling effective adaptation to satellite videos.
The proposed FPEB incorporates learnable prompts directly into the frequency domain. This design enhances the model’s capacity to differentiate genuine objects from high-frequency artifacts. As a result, it facilitates adaptive edge reconstruction in the spatial image domain.
Comprehensive evaluations on two satellite video datasets show that our method surpasses existing approaches in both quantitative and qualitative metrics. These results underscore the robustness and efficacy of the proposed framework in handling complex dynamic scenes, establishing a new benchmark for satellite video analysis.

The remainder of this paper is organized as follows. Section 2 reviews related work on VSR and SVSR. Section 3 provides a detailed description of our proposed method. Section 4 presents the datasets, training details, experimental results, and analysis. Finally, Section 6 summarizes the conclusions.

2. Related Work

2.1. Natural Video SR

VSR builds upon single-image SR by jointly exploiting spatial and temporal information processing that poses a central challenge to the task. Generally, existing VSR methods can be categorized into two primary paradigms: sliding-window methods and RNN-based methods.

Sliding-window methods aggregate information from multiple neighboring LR frames to reconstruct a target frame, typically following the alignment, fusion, and reconstruction paradigm [18]. Kappeler et al. [19] developed models that leverage both spatial and temporal dimensions of videos to enhance spatial resolution and systematically evaluated multiple strategies for integrating video frames within a unified convolutional neural network (CNN). Caballero et al. [20] proposed a spatio-temporal sub-pixel CNN that effectively utilizes temporal redundancies to optimize reconstruction accuracy while ensuring real-time computational efficiency. Jo et al. [8] proposed an end-to-end DNN that generates dynamic upsampling filters and a residual image based on local spatio-temporal pixel neighborhoods, eliminating explicit motion compensation. Yi et al. [21] proposed a novel progressive fusion network for VSR that leverages spatio-temporal information via an improved non-local operation, avoiding complex motion estimation and compensation, and proving more efficient and effective than existing direct fusion, slow fusion, or three-dimensional (3D) convolution strategies. In summary, while sliding-window-based methods have demonstrated considerable success in various video analysis tasks, they inherently suffer from a critical limitation: the reliance solely on local window information. This constraint prevents them from effectively capturing long-range dependencies across more extended temporal sequences. Consequently, this inherent architectural bottleneck has led to a noticeable performance plateau in these approaches. To mitigate this deficiency and facilitate the modeling of global context, RNNs are introduced as a compelling alternative.

RNN-based models seek to effectively leverage long-range dependencies by recurrently propagating comprehensive information from both past and future frames [22]. For example, RBPN [23] considered each context frame an independent information source, employing a recurrent encoder–decoder module to fuse spatial and temporal contexts derived from consecutive video frames. RSDN [24] presented a recurrent unit composed of multiple two-stream structure-detail blocks, which facilitates selective utilization of the current frame’s hidden state information to bolster robustness against appearance variations and error accumulation. Liu et al. [25] proposed a deformable motion alignment module to precisely estimate offsets, thereby efficiently aligning adjacent image frames using a small number of input frames. BasicVSR [11] presented a streamlined pipeline that reconsiders four key components for VSR: propagation, alignment, aggregation, and upsampling. BasicVSR++ [12] introduced second-order grid propagation and flow-guided deformable alignment, building upon the foundation of BasicVSR. In this work, we select these methods (BasicVSR, IconVSR, and BasicVSR++) as the compared methods. They represent the classics in RNN-based VSR frameworks, and since our proposed FAENet is also built upon the RNN architecture, benchmarking against them allows for a direct validation of our proposed methods.

In brief, RNN-based methods leverage long-range frame sequences to achieve superior performance in natural scenes. However, their direct application to satellite scenarios faces significant challenges due to distinct imaging characteristics. Mainstream approaches assume that moving objects exhibit large-scale motion and strong gradients, facilitating the capture of motion. However, this assumption is invalid in satellite videos, where the bird’s-eye view results in tiny object sizes (often occupying only a few pixels) and minimal motion, which severely impedes robust motion capture and alignment.

2.2. Satellite Video SR

As a subfield of VSR, SVSR has recently attracted increasing attention but remains underexplored compared to VSR due to challenges in data acquisition. Compared to the general scenario, satellite video suffers from greater degradation, and its wide scene characteristic also poses greater challenges for SVSR in reconstructing high-frequency details [26]. Zhang et al. [27] presented one of the first studies to utilize satellite video for SVSR, integrating both single- and multi-frame data. The multi-frame network was adapted from the classic general VSR network, EDVR [7], and employs deformable convolutions for feature alignment. Furthermore, standard models such as VDSR [28] and ESPCN [29] were retrained on satellite video data to suit the SVSR task. Subsequently, greater efforts have been devoted to leveraging the distinct characteristics of remote sensing imagery to improve feature representation capabilities. For example, to address temporal redundancy in satellite video [30], Liu et al. [31] proposed an efficient framework that leverages local prior knowledge and nonlocal spatial similarity. He et al. [32] proposed an OFEnet that employs 3D convolution to achieve temporal compensation. To optimize the reconstruction of high-frequency components, Jiang et al. [33] proposed a deep distillation recursive network that performs feature distillation and high-frequency compensation across multiple network stages. Furthermore, Jiang et al. [34] introduced a GAN-based SVSR approach to enhance high-frequency edge features and mitigate noise. To improve the detail of small objects in satellite video, Chen et al. [35] modeled the LR airplane with a new reflective symmetry shape prior to extracting more complete features for VSR. To mitigate errors resulting from inaccurate motion estimation and alignment, Xiao et al. [36] developed a novel fusion approach that integrates temporal grouping projection with multiscale deformable convolution alignment to optimize spatial resolution and generalization. Additionally, Xiao et al. [9] employed local and global temporal differences to facilitate efficient and robust temporal compensation.

In short, recent SVSR methods have primarily focused on feature extraction and reconstruction in the image domain. However, satellite videos lack sufficient detail and exhibit lower image quality than natural ones. These approaches overlook fine-grained features from a frequency-domain perspective, which could amplify distinctions between objects and artifacts, thereby enhancing the capture of motion information and the reconstruction of detail.

2.3. Learning in Frequency Domain

Recently, frequency-based models have garnered significant attention in image and video processing due to their capability to capture global representations effectively. Commonly employed techniques for extracting frequency-domain information are the Fast Fourier Transform (FFT) and the Discrete Wavelet Transform (DWT). The inherent multi-resolution property of the DWT enables it to effectively capture multi-scale features, while the FFT is well-suited for extracting global features [37]. The FFT decomposes data into amplitude and phase components, while the DWT separates data into four distinct frequency sub-bands: a low-frequency component and three high-frequency components in the horizontal, vertical, and diagonal directions. Leveraging the unique properties of these transforms, researchers have extended their applications to various domains, including low-light image enhancement, SR, and image segmentation. Approaches that leverage frequency information can be broadly categorized into transform-domain modeling and frequency-domain-guided recovery approaches, based on whether interaction between feature and frequency domains. Approaches leveraging frequency information can be broadly categorized into transform-domain modeling and frequency-domain-guided recovery, depending on the interaction mechanism between the feature and frequency domains.

Transform-domain modeling involves transforming features into the frequency domain within the intermediate network layers, operating on the resulting frequency components, and then transforming them back. For example, Zhang et al. [38] introduced a hierarchical feature restoration block to model wavelet frequency features, which considers the unique characteristics of distinct frequency sub-bands, establishing a robust model that fully exploits frequency domain features. Xiao et al. [39] incorporated high-frequency cues into Mamba by dynamically selecting more informative frequency signals, thereby reducing computational complexity. Leveraging the global modeling capacity of the FFT [40], Li et al. [16] aggregated spatial-temporal information directly within the frequency domain, thereby enabling CNNs to effectively capture long-range dependencies. Zhu et al. [17] captured motion relationships in the frequency domain to generate fine-grained details from aligned features, and utilized a frequency-aware contrastive loss that supervises reconstruction via separate high- and low-frequency groups. Yan et al. [41] leveraged the generative capability of diffusion models to produce frequency priors, which are then used to reconstruct more reliable details. It is a wavelet-based Transformer block that significantly accelerates inference because DWT decomposes features into smaller and more compact frequency components for processing. Li et al. [42] introduced multi-level wavelet transforms to extract pyramidal features and reduce the encoding cost of latent video diffusion models.

In contrast, frequency-domain-guided methods primarily perform recovery in the feature domain, utilizing frequency analysis to provide auxiliary guidance or constraints. For example, Xu et al. [43] integrated high-frequency components with U-Net feature maps, using this connection to guide the precise insertion of details, which effectively mitigates output blurriness. Zhao et al. [44] combined the original low-light image and its low-frequency component to search a new image space, achieving noise suppression and reducing its impact on feature encoding and interaction. Liu et al. [45] generated amplitude residuals to bridge the gaps between hazy and clear domains without adding extra parameters, and proposed a phase correction module to eliminate unwanted artifacts.

As previously mentioned, several existing SR methods have incorporated frequency-domain information. However, they often process it in a non-discriminative manner, applying uniform operations across all frequency features without targeted modulation, thereby limiting performance in challenging satellite scenarios, such as those involving small, slow-moving targets and dense small buildings. Therefore, we incorporate learnable prompts to enhance the model’s sensitivity to discriminative information, improving edge reconstruction and artifact discrimination.

3. Methods

In this section, we first provide an overview of the proposed FAENet. Next, we introduce its key components, FACM and FPEB, which address alignment challenges in SVSR and enhance object-edge reconstruction, respectively. Finally, we present the loss function for this pipeline.

3.1. Overview

RNNs are a widely recognized framework in VSR. We select the comprehensive VSR framework [46] as our baseline model. This framework is particularly advantageous as it integrates several classic VSR and SVSR models into a single architecture, providing a unified environment for training and testing to facilitate fair comparisons. Building upon this established architecture, we propose a novel SVSR model, termed FAENet, which integrates the proposed FACM and FPEB modules. Specifically, FACM includes a frequency-domain distribution alignment function designed to adapt alignment methods for satellite videos in a plug-and-play manner. Furthermore, FPEB is introduced to distinguish high-frequency textures from artifacts and enhance edge reconstruction. Let

2 N + 1

frames of LR video sequence be denoted as

{x_{t - N}, \dots, x_{t}, \dots, x_{t + N}}

, and its corresponding reconstructed HR video sequence as

{y_{t - N}, \dots, y_{t}, \dots, y_{t + N}}

, where

L = 2 N + 1

is the length of the video. The FAENet architecture is illustrated in Figure 1: (1) The model extracts features from the LR satellite video sequence

X \in R^{L \times C \times m \times w}

, where C, m, and w represent channel, height, and width, respectively. (2) Alignment and feature fusion modules align neighboring frames and integrate forward or backward propagation features (

f_{F}

or

f_{B}

) with the current frame. (3) FACM combines current features (

f_{1} / f_{2}

) and propagation features (

f_{F} / f_{B}

) to compensate for alignment inaccuracies. (4) FPEB leverages frequency components and prompts to enhance edge reconstruction. Finally, the reconstruction module upsamples the LR sequence to an HR video sequence

Y \in R^{L \times C \times M \times W}

, where

s = M / m = W / w

denotes the upscale factor. Moreover, the pixel shuffle layer upsamples the LR input video sequence.

3.2. Frequency Alignment Compensation Mechanism

As previously noted, mainstream alignment methods are primarily designed for natural videos with frontal perspectives. In such scenarios, moving objects typically exhibit large motion scales relative to the background, providing strong gradient information or distinct offsets that allow optical flow or DCN to capture motion effectively. In contrast, satellite videos capture information from a bird’s-eye view, where moving objects occupy only a few pixels and exhibit subtle, slow-motion characteristics. This results in weak gradient information, making it challenging for conventional spatial alignment methods to achieve high precision. To address this “insufficient alignment accuracy,” we propose alleviating the problem from a global perspective in the frequency domain. We introduce the FACM, a plug-and-play module that leverages a frequency-domain distribution alignment function to compensate for the limitations of spatial alignment in satellite scenarios.

In detail, the FACM takes the current features

f_{1}

(or

f_{2}

) and the propagation features

f_{F}

(or

f_{B}

) as inputs. For clarity, this section describes the process using the forward propagation pair

f_{1}

and

f_{F}

, noting that backward feature pair (

f_{2}

and

f_{B}

) undergo analogous processing. First, we utilize the 2D Haar DWT for feature decomposition. In the proposed FAENet, ‘DWT’ implies the Haar wavelet transform. The DWT yields four subbands: one low-frequency component (

L L

) and three high-frequency components (

H L

,

L H

, and

H H

). Since the

L L

primarily contain structural information that remains relatively stable between frames, the FACM focuses exclusively on the high-frequency components to capture subtle motion details and minimize computational complexity. Consequently,

L L

is excluded from the process, as shown in Figure 1. To characterize the global statistical distribution of these high-frequency details [16,17,45], we calculate the mean values

μ^{1}

and

μ^{F}

for each subband, formulated as:

μ_{{H H, H L, H H}}^{{1, F}} = \frac{1}{C \cdot M^{'} \cdot W^{'}} \sum_{i = 1}^{C} \sum_{j = 1}^{M^{'}} \sum_{k = 1}^{W^{'}} f_{{H H, H L, L H}}^{{1, F}} (i, j, k),

(1)

where

M^{'}

and

W^{'}

represent the spatial dimensions of the frequency subbands. Here,

μ_{{H H, H L, H H}}^{{1, F}}

denotes the set of frequency-domain coefficients (e.g.,

μ_{H H}^{1}, μ_{H L}^{1}, μ_{L H}^{1}

, and

μ_{H H}^{F}, μ_{H L}^{F}, μ_{L H}^{F}

) for the current and forward frames. Similarly,

f_{{H H, H L, L H}}^{{1, F}}

represents the corresponding feature maps.

To compensate for spatial misalignment, we propose a frequency-domain distribution alignment function

ψ (\cdot)

. The

ψ (\cdot)

acts as a statistical calibration function:

A_{H H}^{F} = ψ (μ_{H H}^{1}, μ_{H H}^{F}) .

(2)

Specifically, we utilize the relationship between the global means to calibrate the reference feature. The alignment formulation is defined as:

A_{H H}^{F} = (1 - \frac{μ_{H H}^{1}}{μ_{H H}^{F} + ϵ}) \times μ_{H H}^{F} + μ_{H H}^{1}

(3)

where

A_{H H}^{F}

represents the aligned high-frequency component

H H

. The term

(1 - \frac{μ^{1}}{μ^{F}}) μ^{F}

functions as a residual correction. When the frames are statistically well-aligned (i.e.,

μ^{1} \approx μ^{F}

), the correction term approaches 0, and the function naturally simplifies to preserving the target distribution

μ^{1}

. Moreover, FACM degenerates into an optimization module for features, preventing signal degradation when alignment is already accurate. This formulation leverages the ratio of means to implicitly adjust the intensity of the propagation features to match the distribution of the current frame, thereby enhancing alignment accuracy globally.

ϵ

is set to

10^{- 8}

for numerical stability. The components

A_{H L}^{F}

and

A_{L H}^{F}

are obtained similarly.

After compensating the three high-frequency components, they are concatenated and refined through a series of convolutional layers, average pooling, ReLU, and a sigmoid activation to generate a spatial modulation map. This map is then combined with the input features to produce the compensated output

f_{o u t}

. Finally, to robustly integrate this compensation with the original propagation features, we introduce a learnable parameter

α

. The final output of FACM is formulated as:

f_{F A C M} = α f_{o u t} + (1 - α) f_{F},

(4)

where

α

is a learning weight optimized end-to-end alongside the network parameters. This enables the model to adaptively determine the optimal balance between frequency-compensated features and original propagation features.

3.3. Frequency Prompt Enhancement Block

In this section, we introduce a key component of our proposed FAENet framework, the FPEB, designed to enhance edge reconstruction for small moving objects and background buildings in SVSR. It is widely recognized that satellite videos provide the advantage of continuous monitoring of specific areas, enabling adequate support for tasks such as object tracking and detection. However, the inherent characteristics of satellite videos restrict the accuracy of these applications. Specifically, (1) satellite videos typically capture a wide field of view, resulting in a dense arrangement of objects; (2) the frame quality is generally lower than that of natural videos; (3) moving objects occupy only a small portion of pixels. Such factors make SR results prone to artifacts and poor edges, making it difficult to distinguish objects clearly. To address this, we propose the FPEB module, which leverages visual prompts in the frequency domain to guide the network in prioritizing the restoration of object and building edges, thereby reducing interference in downstream tasks.

In general, edges correspond to high-frequency information, while content corresponds to low-frequency components. Therefore, high-frequency information is key to enhancing edge reconstruction. However, traditional frequency methods are sensitive to scene noise [16], often failing to distinguish true object details from artifacts or noise. To overcome this, the proposed module incorporates visual prompts [47,48] to facilitate automatic learning of critical edge patterns. As illustrated in Figure 1, the FPEB utilizes a prompt pool

P \in R^{T \times r}

to store frequency-domain priors. Here, T denotes the number of prompts in the pool, and r represents the inner rank of each prompt.

Specifically, after feature refinement (FR), we apply the DWT to extract high-frequency components. The process is formulated as follows:

_, H L, L H, H H = D W T ((F R (f_{3}))),

(5)

where

F R (\cdot)

comprises convolutional and LeakyReLU activation layers. The extracted high-frequency components

C a t (H L, L H, H H)

are flattened and passed through a routing strategy to generate a selection matrix. Subsequently, a LogSoftmax function is applied to compute the log probabilities, indicating the likelihood of each prompt in

P

being selected by the frequency components or the relevance of each prompt in

P

to the current inputs. To enable differentiable prompt selection, we apply the Gumbel-Softmax trick [49] to the log probabilities, yielding a one-hot routing matrix

R \in R^{D \times T}

, where

D = m n

is the flattened spatial resolution of LR frames. The specific frequency prompt

P \in R^{D \times r}

is then retrieved through matrix multiplication, acting as a lookup mechanism:

P = R P .

(6)

Finally, we concatenate the selected frequency prompt P with the output of

F R (\cdot)

. This learned prompt serves as a guiding cue, dynamically emphasizing frequency regions corresponding to genuine object edges while suppressing artifacts or noise. This mechanism significantly strengthens the network’s ability to reconstruct fine-grained structures in complex satellite scenes.

3.4. Loss Function

We adopt the Charbonnier penalty function [50] to optimize our network. This function provides a differentiable, smooth approximation of the

L_{2}

loss while offering robustness comparable to that of the

L_{1}

loss, effectively reducing the influence of large errors and emphasizing smaller ones. This makes it particularly effective for handling outliers in SR tasks [51]. The loss function is as follows:

L (y_{t}, {\hat{y}}_{t}) = \sqrt{∥ y_{t} - {\hat{y}}_{t} ∥^{2} + ϵ^{2}}

(7)

where

{\hat{y}}_{t}

is the ground truth (GT) frame and

ϵ

is a constant which is set to

10^{- 6}

.

4. Results

This section first introduces two satellite video datasets and evaluation metrics. It then elaborates on the training procedure, followed by a comparison of our proposed approach with state-of-the-art (SOTA) VSR and SVSR models and result analysis. Finally, we conduct comprehensive ablation experiments to validate the rationale and effectiveness of our method.

4.1. Satellite Video Dataset and Evaluation Metrics

Satellite Video Dataset

To thoroughly assess our SVSR method against SOTA approaches, we select two publicly accessible satellite video datasets for experimentation. The Jilin-189 dataset, derived from Jilin-1 satellites, offers a spatial resolution of 1 m, a frame rate of 25 frames per second, and video durations of 20 to 30 s, with each video containing 100 frames at a resolution of

640 \times 640

pixels. Accessible via Jilin-189 (https://github.com/XY-boy/LGTD?tab=readme-ov-file, 10 October 2025), the dataset comprises 189 training clips and 12 testing clips. For clarity, we label these testing clips as sequence 000 to sequence 011. Meanwhile, we set the HR and LR patch sizes for training to

256 \times 256

and

64 \times 64

, respectively, and for testing to

640 \times 640

and

160 \times 160

, respectively.

The SAT-MTB-VSR (https://github.com/Alioth2000/RASVSR, 11 September 2025) dataset, derived from the SAT-MTB satellite video multitask benchmark dataset acquired by the Jilin-1 satellite, comprises 431 video clips, each consisting of 100 consecutive frames with a resolution of

640 \times 640

pixels. Of these, 413 clips are allocated for training and 18 for validation, sourced from distinct original videos. For clarity, we label these testing clips as sequence 000 to sequence 017. The HR and LR patch sizes for training and testing settings are the same as the previous dataset.

To rigorously evaluate the quality of VSR outputs, we employ the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) to quantify the performance of all methods objectively. The learned perceptual image patch similarity (LPIPS) [52] is used to evaluate the perceptual quality of reconstructed videos. Notably, the PSNR and SSIM metrics are calculated across all three RGB channels rather than the Y channel of the YCbCr color space. Higher PSNR and SSIM values indicate better performance, and the lower LPIPS values indicate better perceptual performance.

4.2. Training Details

Our network is implemented using the PyTorch 2.7.1 framework. Following previous work, we employ bicubic interpolation to downsample video frames and generate LR frames. Our approach builds upon the open-source RASVSR framework [46]. The network is trained on four NVIDIA RTX 4090d GPUs. For fair comparison, we set the batch size to 4 for both our method and all compared methods. The length of the video sequences is fixed at 60 frames, and the compared methods adhere to their official code settings. We use the Adam optimizer [53] with an initial learning rate of

3 \times 10^{- 4}

and adopt the Cosine Annealing Warm Restarts strategy [54] to adjust the learning rate. Training is conducted for a total of

10^{5}

iterations. For the optical flow network, SpyNet, we utilize pretrained weights with fixed parameters for the first

5 \times 10^{3}

iterations, after which they are fine-tuned with a learning rate one-quarter of the initial value. All experiments are conducted exclusively at a

4 \times

scaling factor.

4.3. Comparison with State-of-the-Art Methods

To further illustrate the reasonableness and effectiveness of the algorithms in this paper, comparisons are made with some popular and currently available algorithms, including Bicubic, BasicVSR [11], IconVSR [11], BasicVSR++ [12], RASVSR [46], DVSRNet [55], and MADNet [18]. For a fair comparison, we utilize the baseline source code from [46] to train BasicVSR [11], IconVSR [11], BasicVSR++ [12], and RASVSR [46]. The remaining models are obtained from their official repositories. All results are evaluated at their best epoch to ensure optimal performance.

4.4. Results on Jilin-189 Dataset

Table 1 presents the quantitative results of all compared methods across 12 scenes, along with their average performance on the Jilin-189 test set. Our FAENet outperforms existing approaches, achieving the highest average values across two metrics, with improvements of 0.08 dB in PSNR and 0.0007 in SSIM over the second-best method. These results demonstrate that our method effectively restores high-quality, realistic details from LR sequences. DVSRNet introduces a new alignment method that fails to account for the specific characteristics of satellite videos, leading to suboptimal performance. When compared to methods using the same number of input frames and RNN-based architectures, such as RASVSR and MADNet, which perform alignment via DCN and optical flow, respectively, they are challenging to capture motion information in satellite scenarios featuring tiny moving objects and dense building backgrounds. In contrast, our approach integrates alignment compensation information, achieving a superior balance between computational efficiency and performance, as evidenced in Table 5. This demonstrates the effectiveness of FACM in capturing subtle motion information.

To comprehensively demonstrate the effectiveness of our models, we have provided visual comparison results in Figure 2. The first and fourth rows display the scenes from test clips 001 and 005, respectively. Meanwhile, the second and fifth rows present randomly selected local regions with enlarged results for enhanced visualization, while the third and last rows display their corresponding error maps computed relative to the ground truth. It can be seen that all results are better with bicubic interpolation. BasicVSR, IconVSR, and DVSRNet present various distortions in structural details. Although our method’s performance is comparable to that of MADNet, it produces sharper edges. This improved edge reconstruction is facilitated by the FPEB module, which consistently prioritizes edge sharpness across diverse scenarios.

To better visualize the results, we present absolute error maps of locally enlarged regions. Specifically, red regions indicate maximum deviation from the ground truth, while blue regions denote close alignment with it. Analysis of the error maps reveals that BasicVSR and IconVSR exhibit larger red areas, corresponding to structural distortions in the enlarged RGB regions. As optical flow-based alignment methods, these approaches confirm that their direct application is unsuitable for satellite videos due to challenges in capturing motion information. BasicVSR++ and RASVSR exhibit fewer red regions in their error maps; however, their overall colors are lighter than ours, indicating inferior reconstruction in global structural fidelity. These methods rely on DCN-based alignment, which often results in accumulated misalignment errors in SVSR. Notably, our method significantly reduces the number of red regions compared to others, demonstrating that our alignment compensation mechanism and frequency-domain prompts effectively reconstruct ground object contours with enhanced textures and structures, thereby achieving closer visual alignment with the ground truth.

4.5. Results on SAT-MTB-VSR Dataset

Table 2 presents the quantitative results for all video clips and their average performance on the SAT-MTB-VSR dataset. This dataset is larger than previous and includes more scenes featuring dense buildings and objects. Notably, the dataset is derived from satellite video tracking of small objects, leading to a higher prevalence of tiny moving objects and introducing significant alignment challenges. Under these demanding conditions, our proposed method outperforms existing approaches in 13 of the 18 evaluated scenes in terms of PSNR. BasicVSR++ and RASVSR rely on DCN-based alignment methods that perform alignment at the feature or image level. This strategy proves unsuitable for lower-quality satellite videos, primarily because DCN may extract inaccurate offset estimates from potentially noisy or LR inputs. In such conditions, misalignment artifacts amplify during propagation, leading to ghosting effects and degraded temporal consistency in SR sequences. BasicVSR and IconVSR rely on optical flow-guided alignment to aggregate temporal information. This design assumes access to reliable motion cues derived from high-contrast, large-displacement scenarios typical in videos. However, satellite video often contains small, slowly moving objects, rendering optical flow estimators prone to failure. In contrast, our method underscores an alignment compensation mechanism at a finer, more distinctive level in the frequency domain to reduce misalignment accumulation in complex satellite scenes, achieving superior average PSNR and SSIM, demonstrating its effectiveness and robust generalization capabilities. Furthermore, compared to the second-best model, MADNet, our approach improves PSNR by 0.1 dB and SSIM by 0.0004, which demonstrates the superiority of our FAENet.

Figure 3 illustrates the visual results of various compared methods on test clips 003 (first row) and 010 (fourth row) from the SAT-MTB-VSR test set. The second and fifth rows present their corresponding enlarged local regions, with error maps computed for these regions displayed in the third and last rows to enhance visualization. From the enlarged results, we can see that ours are sharper and the details are more accurate. In the visual results, both IconVSR and DVSRNet demonstrate a noticeable loss of fine details, failing to preserve intricate features in the reconstructed images. In contrast, MADNet, while maintaining overall structural integrity, tends to produce blurred edges, resulting in a less sharp appearance of object boundaries. In summary, our proposed FAENet delivers exceptional performance in reconstructing objects and background structures, demonstrating the high effectiveness of frequency-domain alignment compensation and prompts for detailed reconstruction.

To better visualize fine-grained differences in visual results, we present absolute error maps that highlight reconstruction errors relative to the ground truth. The third row, derived from the 003 clip, reveals varying degrees of reconstruction errors across methods. Featuring a complex scene with dense trees, a highway, and landmarks, the 003 clip poses significant challenges to accurate edge reconstruction. Other methods exhibit inferior performance, with expanded red and yellow regions in locally enlarged images indicating suboptimal structural reconstruction. In contrast, our approach achieves the fewest errors, demonstrating superior reconstruction quality. This advantage stems from our proposed FPEB’s ability to guide the model to prioritize edge capture, regardless of background complexity or the presence of moving objects, via frequency-domain prompts, thereby further validating the robust generalization capability of our method across diverse and complex scenes.

4.6. Evaluation on Perceptual Quality (LPIPS)

The quantitative results in terms of LPIPS are reported in Table 3. As shown in Table 3, the proposed FAENet achieves the best average performance on both test sets. Specifically, FAENet attains average scores of 0.0257 on the Jilin-189 dataset and 0.0238 on the SAT-MTB-VSR dataset, outperforming the second-best method, MADNet, by margins of 0.0005 and 0.0002, respectively. These observations highlight the superior capability of FAENet in recovering satellite videos with high perceptual quality and structural details.

4.7. Ablation Studies

In this section, we first perform ablation studies to evaluate the effectiveness of the core modules, FACM and FPEB. Next, we assess the generalization capability of FACM. We then investigate additional factors influencing the network’s performance and discuss their implications. Finally, we validate the effectiveness of our SVSR methods through downstream tasks. Notably, all our ablation studies were conducted on the Jilin-189 dataset, using a consistent batch size of 4 across all experiments. The number of frame inputs for each algorithm is specified in tables.

4.7.1. The Effectiveness of FACM and FPEB

In this section, we validate the effectiveness of the two proposed key modules: the FACM and the FPEB. The average results are presented in Table 4. The FACM is designed to enhance alignment accuracy, which is a critical operation in VSR. For RNN, effective alignment across the entire video sequence is essential, as misalignment can lead to significant error accumulation, rendering the results unusable. The FACM addresses this by compensating for alignment inaccuracies, which is particularly important given the discrepancies between natural videos and satellite videos, where the latter often present unique challenges due to their distinct characteristics. As shown in the comparison between the first and second rows of the Table 4, the inclusion of FACM improves the PSNR by 0.56 dB and the SSIM by 0.0046, demonstrating its effectiveness in enhancing alignment accuracy through the frequency-domain distribution alignment function, which provides fine-grained alignment compensation. The FPEB uses frequency-domain prompts to differentiate between artifacts and objects, thereby improving edge reconstruction. As indicated in Table 4, the incorporation of FPEB results in a performance improvement of 0.54 dB in PSNR and 0.0046 in SSIM, confirming the effectiveness of FPEB in enhancing the quality of reconstructed edges. Finally, integrating the FACM and FPEB significantly enhances the overall performance of the proposed method. By combining FACM’s ability to improve alignment accuracy and FPEB’s capability to refine edge reconstruction through frequency-domain prompt utilization, the model achieves superior results in both PSNR and SSIM, as evidenced by the quantitative improvements reported in Table 4. This synergistic incorporation of FACM and FPEB underscores their contributions to robust SVSR, particularly in challenging satellite video scenarios.

4.7.2. Complexity on Model Complexity

We evaluate model complexity based on two metrics: floating-point operations (Flops) and parameters (Para). Specifically, Flops represent the total number of addition and multiplication operations performed by the model, with higher counts indicating greater computational complexity. Para refers to the number of trainable parameters in the model. Table 5 presents the complexity comparison of our method against six SOTA methods. For a fair comparison, we first consider RASVSR, MADNet, and our proposed FAENet, all using an input sequence length of 60 frames. Our method achieves the best performance metrics while maintaining moderate complexity among these three, striking an effective balance between performance and computational efficiency. This superior performance is attributed to the FACM of the proposed FAENet. The FACM is specifically designed to account for the unique characteristics of satellite video scenarios (i.e., subtle motion and tiny moving objects) to implement effective alignment compensation. In contrast, the alignment methods employed by RASVSR and MADNet did not incorporate such a scenario-specific consideration, leading to suboptimal results. Compared to methods with shorter input sequence lengths, our approach ranks third in terms of Para while delivering the best performance, demonstrating clear advantages in both efficacy and efficiency.

4.7.3. Generalization of FACM

As previously mentioned, FACM is a plug-and-play module that can be integrated into methods derived from natural VSR algorithms, enabling their adaptation for SVSR tasks by compensating for alignment errors, thus enhancing alignment accuracy and overall performance. Therefore, we conduct the ablation study to evaluate the generalization of FACM. Specifically, we integrated FACM into four approaches without any fine-tuning of their frameworks. The results, including Flops, Para, PSNR, and SSIM, are presented in Table 6. The paired comparisons demonstrate that adding FACM improves PSNR and SSIM to varying degrees, accompanied by a marginal increase in Flops and Para. BasicVSR and IconVSR rely on optical flow estimation for alignment, while BasicVSR++ and RAVSR employ DCN to achieve alignment. The incorporation of FACM into these two mainstream alignment methods results in consistent performance improvements, as evidenced by enhanced PSNR and SSIM metrics, demonstrating the effectiveness and strong generalization capability of the proposed alignment compensation mechanism.

4.7.4. The Length of Input Sequence

To ensure the best effectiveness of input sequence length, we conducted experiments with sequence lengths ranging from 7 to 75 frames, as reported in Table 7. For a fair comparison, we maintained a consistent batch size across all experiments. The results demonstrate that increasing the lengths of the training and testing sequences improves SVSR performance. Models trained on 7- and 15-frame sequences perform similarly, whereas extending the sequence to 60 frames significantly improves PSNR, indicating that SVSR benefits from aggregating information across more extended sequences because of its inherently lower image quality compared to natural videos. However, an excessively long sequence may lead to information redundancy, resulting in increased computational complexity with minimal performance gains, as observed with a sequence length of 75 frames. To achieve a trade-off between performance and complexity, we ultimately set the input sequence length to 60 frames.

4.7.5. The Hyperparameters of the Number of Prompts in FPEB

One of the core components of FPEB is the learnable prompts, which enhance object edges in conjunction with the high-frequency domain. Two hyperparameters govern their performance: the prompt pool size, T, and the inner rank, r. In this section, we conduct ablation experiments to evaluate the impact of varying T and r on SVSR performance. To accelerate the training process, we obtain all results using an input sequence length of 7 frames, which maintains trend consistency with a 60-frame sequence. As shown in Table 8, when r is small, performance exhibits a slight downward trend. Conversely, when T is small, reducing the prompt pool size sometimes leads to a slight performance increase but presents instability. Therefore, we select

r \times T = 32 \times 64

, as it yields the optimal PSNR.

4.7.6. Temporal Consistency Analysis

In the context of VSR, temporal consistency plays a critical role in determining reconstruction quality. To illustrate this, we compare temporal profiles between our method and other approaches in Figure 4, generated by horizontally stacking pixel rows across consecutive frames from the Jilin-189 dataset. The results reveal that Bicubic, IconVSR, and DVSRNet exhibit discontinuities in the output video, while MADNet, BasicVSR, and RASVSR display slight flickering artifacts. Benefiting from our FACM, FAENet effectively aggregates richer information from video frames, achieving smoother temporal transitions.

4.7.7. Evaluation on Downstream Segmentation Task

To rigorously verify the quality of edge reconstruction, we conducted an evaluation on a downstream segmentation task, as segmentation necessitates higher edge fidelity than detection or tracking. Using an unsupervised segmentation method [56] on the Jilin-189 dataset, we measured performance via the Dice Similarity Coefficient (DSC) and Pixel Accuracy (PA), as shown in Table 9. Our method outperforms comparison methods in both metrics, confirming that the FPEB module effectively preserves intricate edges and suppresses artifacts. Furthermore, as illustrated in Figure 5, our generated segmentation maps most closely approximate the GT. In contrast, other compared methods exhibit noticeable segmentation errors.

5. Discussion

The primary objective of this study was to address the unique challenges of SVSR, specifically the difficulties in aligning tiny, slow-moving objects and reconstructing sharp edges. Our experimental results on the Jilin-189 and SAT-MTB-VSR datasets demonstrate that the proposed FAENet significantly outperforms existing SOTA methods.

5.1. Relationship Between Previous Studies and the Working Hypotheses

On one hand, previous studies, such as BasicVSR and IconVSR, rely heavily on optical flow for alignment. While effective for natural videos, our results indicate that these methods struggle in satellite scenarios, leading to unsatisfactory performance. This confirms our working hypothesis that motion estimation in the image/feature space is insufficient for tiny and slow-moving targets. In contrast, our FACM operates on the global statistics of frequency subbands. The superior performance of FAENet suggests that alignment in the frequency domain provides a more robust, “coarse-to-fine” compensation strategy that effectively corrects the misalignment errors prone to occur in spatial-domain methods.

On the other hand, satellite videos inherently lack rich texture and feature dense buildings, often leading to blurring or artifacts. While traditional frequency-domain methods are susceptible to high-frequency noise/artifacts, our FPEB mitigates this by introducing a dynamic selection mechanism implemented through visual prompts. The significant improvement in PSNR and the superior performance in downstream segmentation tasks validate that FPEB is successful. It effectively captures critical structural details even in the presence of noise/artifacts, thereby enhancing edge reconstruction fidelity.

5.2. Limitations and Future Directions

While our method generalizes well across diverse satellite scenes, it relies on supervised training with large-scale paired HR and LR data. Future work could investigate unsupervised adaptation techniques to fine-tune the frequency prompts for unseen sensor data without requiring paired ground truth, thereby enhancing cross-sensors/scenes generalization. In addition, although FAENet achieves a favorable trade-off between performance and complexity compared to advanced models like MADNet, the total computational cost remains non-negligible for real-time processing. Future research will focus on optimizing the architecture, potentially through network quantization or distillation, to enable efficient deployment on resource-constrained satellite edge devices.

6. Conclusions

In this paper, we propose FAENet, a novel recurrent neural network designed for SVSR. The core innovation lies in two key components: the FACM and FPEB. The FACM is a plug-and-play alignment compensation module that leverages a frequency-domain distribution alignment function, enabling seamless integration with existing alignment methods to significantly improve alignment accuracy for SVSR. To enhance object edge reconstruction, the FPEB module utilizes frequency-domain prompts to guide the model in distinguishing artifacts from objects, thereby focusing on accurate edge reconstruction. Extensive experiments conducted on two distinct satellite video datasets validate the effectiveness and efficiency of our proposed method. In future work, we plan to focus on enhancing the computational efficiency of our SVSR model from a lightweight perspective. This crucial improvement will facilitate the adoption and deployment of the model in real-world scenarios, particularly those constrained by limited computational resources (e.g., edge devices or satellite platforms).

Author Contributions

Validation, Y.Y.; Investigation, L.W.; Writing—original draft, X.L.; Writing—review & editing, J.Z.; Visualization, Y.W.; Funding acquisition, T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (62171328, 62401410); by the Hubei Key Research and Development Program (2025BAB0004); by the Hubei Science and Technology Innovation Team (T2023009).

Data Availability Statement

All data used in this study are publicly available. The Jilin-189 and SAT-MTB-VSR datasets are available at the following links. Jilin-189 dataset: https://github.com/XY-boy/LGTD?tab=readme-ov-file (10 October 2025) and SAT-MTB-VSR dataset: https://github.com/Alioth2000/RASVSR (11 September 2025).

Acknowledgments

We sincerely appreciate the editors and reviewers of this journal for their valuable suggestions on this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Y.; Shao, Z.; Lu, T.; Wu, C.; Wang, J. Remote sensing image super-resolution via multiscale enhancement network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5000905. [Google Scholar]
Li, D.; Wang, M.; Jiang, J. China’s high-resolution optical remote sensing satellites and their mapping applications. Geo-Spat. Inf. Sci. 2021, 24, 85–94. [Google Scholar]
Huang, X.; Wang, D.; Zhu, Q.; Zheng, Y.; Guan, Q. Single target tracking in high-resolution satellite videos: A comprehensive review. Geo-Spat. Inf. Sci. 2025, 28, 966–995. [Google Scholar]
Xiao, Y.; Yuan, Q.; He, J.; Zhang, L. Remote sensing image super-resolution via cross-scale hierarchical transformer. Geo-Spat. Inf. Sci. 2024, 27, 1914–1930. [Google Scholar]
Dong, Y.; Yang, B.; Liu, C.; Geng, Z.; Wang, T. Hyperspectral image super-resolution via joint network with spectral-spatial strategy. Geo-Spat. Inf. Sci. 2025, 1–19. [Google Scholar] [CrossRef]
Bing, T.; Qing, X.; Yan, Z.; Shuai, X. Dynamic data updating algorithm for image superresolution reconstruction. Geo-Spat. Inf. Sci. 2006, 9, 196–200. [Google Scholar] [CrossRef]
Wang, X.; Chan, K.C.; Yu, K.; Dong, C.; Change Loy, C. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Pecognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Jo, Y.; Oh, S.W.; Kang, J.; Kim, S.J. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3224–3232. [Google Scholar]
Xiao, Y.; Yuan, Q.; Jiang, K.; Jin, X.; He, J.; Zhang, L.; Lin, C. Local-global temporal difference learning for satellite video super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2789–2802. [Google Scholar] [CrossRef]
Yu, J.; Liu, J.; Bo, L.; Mei, T. Memory-augmented non-local attention for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17834–17843. [Google Scholar]
Chan, K.C.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Basicvsr: The search for essential components in video super-resolution and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4947–4956. [Google Scholar]
Chan, K.C.; Zhou, S.; Xu, X.; Loy, C.C. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5972–5981. [Google Scholar]
Wang, Y.; Shao, Z.; Zuo, X.; Lu, T.; Wang, J.; Wang, Y.; Wang, S.; Zhang, Z.; Zhao, X. NSBRNet: Non-Local Spatio-Temporal Bidirectional Recurrent Network for Satellite Video Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 2019, 127, 1106–1125. [Google Scholar] [CrossRef]
Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3360–3369. [Google Scholar]
Li, F.; Zhang, L.; Liu, Z.; Lei, J.; Li, Z. Multi-frequency representation enhancement with privilege information for video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12814–12825. [Google Scholar]
Zhu, Q.; Zhang, F.; Chen, F.; Zhu, S.; Bull, D.; Zeng, B. FCVSR: A Frequency-aware Method for Compressed Video Super-Resolution. arXiv 2025, arXiv:2502.06431. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Wang, S.; Lin, C.W. Multi-Axis Feature Diversity Enhancement for Remote Sensing Video Super-Resolution. IEEE Trans. Image Process. 2025, 34, 1766–1778. [Google Scholar] [CrossRef]
Kappeler, A.; Yoo, S.; Dai, Q.; Katsaggelos, A.K. Video super-resolution with convolutional neural networks. IEEE Trans. Comput. Imaging 2016, 2, 109–122. [Google Scholar] [CrossRef]
Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4778–4787. [Google Scholar]
Yi, P.; Wang, Z.; Jiang, K.; Jiang, J.; Ma, J. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3106–3115. [Google Scholar]
Jeelani, M.; Sadbhawna; Cheema, N.; Illgner-Fehns, K.; Slusallek, P.; Jaiswal, S. Expanding synthetic real-world degradations for blind video super resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1199–1208. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3897–3906. [Google Scholar]
Isobe, T.; Jia, X.; Gu, S.; Li, S.; Wang, S.; Tian, Q. Video super-resolution with recurrent structure-detail network. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 645–660. [Google Scholar]
Liu, H.; Huang, X.; Chu, Y.; Ye, M.; Liu, T. Improving bi-directional recurrent network for video super-resolution with deformable motion alignment structure. Int. J. Mach. Learn. Cybern. 2025, 16, 4783–4795. [Google Scholar] [CrossRef]
Liu, H.; Gu, Y. Deep joint estimation network for satellite video super-resolution with multiple degradations. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5621015. [Google Scholar] [CrossRef]
Zhang, S.; Yuan, Q.; Li, J. Video satellite imagery super resolution for ‘Jilin-1’ via a single-and-multi frame ensembled framework. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2731–2734. [Google Scholar]
Luo, Y.; Zhou, L.; Wang, S.; Wang, Z. Video satellite imagery super resolution via convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2398–2402. [Google Scholar] [CrossRef]
Xiao, A.; Wang, Z.; Wang, L.; Ren, Y. Super-resolution for “Jilin-1” satellite video imagery via a convolutional network. Sensors 2018, 18, 1194. [Google Scholar] [CrossRef]
Xiao, Y.; Su, X.; Yuan, Q. A recurrent refinement network for satellite video super-resolution. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 3865–3868. [Google Scholar]
Liu, H.; Gu, Y.; Wang, T.; Li, S. Satellite video super-resolution based on adaptively spatiotemporal neighbors and nonlocal similarity regularization. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8372–8383. [Google Scholar] [CrossRef]
He, Z.; Li, J.; Liu, L.; He, D.; Xiao, M. Multiframe video satellite image super-resolution via attention-based residual learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5605015. [Google Scholar] [CrossRef]
Jiang, K.; Wang, Z.; Yi, P.; Jiang, J.; Xiao, J.; Yao, Y. Deep distillation recursive network for remote sensing imagery super-resolution. Remote Sens. 2018, 10, 1700. [Google Scholar] [CrossRef]
Jiang, K.; Wang, Z.; Yi, P.; Wang, G.; Lu, T.; Jiang, J. Edge-enhanced GAN for remote sensing image superresolution. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5799–5812. [Google Scholar] [CrossRef]
Chen, D.L.; Zhang, L.; Huang, H. Robust extraction and super-resolution of low-resolution flying airplane from satellite video. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4700916. [Google Scholar] [CrossRef]
Xiao, Y.; Su, X.; Yuan, Q.; Liu, D.; Shen, H.; Zhang, L. Satellite video super-resolution via multiscale deformable convolution alignment and temporal grouping projection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5610819. [Google Scholar] [CrossRef]
Tan, J.; Pei, S.; Qin, W.; Fu, B.; Li, X.; Huang, L. Wavelet-based mamba with fourier adjustment for low-light image enhancement. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3449–3464. [Google Scholar]
Zhang, T.; Liu, P.; Lu, Y.; Cai, M.; Zhang, Z.; Zhang, Z.; Zhou, Q. CWNet: Causal Wavelet Network for Low-Light Image Enhancement. arXiv 2025, arXiv:2507.10689. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Zhang, Q.; Lin, C.W. Frequency-assisted mamba for remote sensing image super-resolution. IEEE Trans. Multimed. 2024, 27, 1783–1796. [Google Scholar] [CrossRef]
Wang, C.; Jiang, J.; Zhong, Z.; Liu, X. Spatial-frequency mutual learning for face super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22356–22366. [Google Scholar]
Yan, Q.; Hu, T.; Wu, P.; Dai, D.; Gu, S.; Dong, W.; Zhang, Y. Efficient image enhancement with a diffusion-based frequency prior. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 8452–8465. [Google Scholar] [CrossRef]
Li, Z.; Lin, B.; Ye, Y.; Chen, L.; Cheng, X.; Yuan, S.; Yuan, L. WF-VAE: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 17778–17788. [Google Scholar]
Xu, Y.; Park, T.; Zhang, R.; Zhou, Y.; Shechtman, E.; Liu, F.; Huang, J.B.; Liu, D. Videogigagan: Towards detail-rich video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 2139–2149. [Google Scholar]
Zhao, M.; Qin, X.; Du, S.; Bai, X.; Lyu, J.; Liu, Y. Low-light stereo image enhancement and de-noising in the low-frequency information enhanced image space. Expert Syst. Appl. 2025, 265, 125803. [Google Scholar] [CrossRef]
Liu, C.; Qi, L.; Pan, J.; Qian, X.; Yang, M.H. Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing. arXiv 2025, arXiv:2507.01275. [Google Scholar] [CrossRef]
Wang, H.; Li, S.; Zhao, M. A lightweight recurrent aggregation network for satellite video super-resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 685–695. [Google Scholar] [CrossRef]
Liu, W.; Cai, C.; Gao, J.; Wu, K.; Wang, Y.; Yap, K.H.; Chau, L.P. Promptsr: Cascade prompting for lightweight image super-resolution. arXiv 2025, arXiv:2507.04118. [Google Scholar] [CrossRef]
Guo, H.; Guo, Y.; Zha, Y.; Zhang, Y.; Li, W.; Dai, T.; Xia, S.T.; Li, Y. Mambairv2: Attentive state space restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 28124–28133. [Google Scholar]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of the 1st International Conference on Image Processing, Austin, TX, USA, 13–16 November 1994; Volume 2, pp. 168–172. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2599–2613. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Kingma, D.P.; Ba, J. A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Zhu, Q.; Chen, F.; Zhu, S.; Liu, Y.; Zhou, X.; Xiong, R.; Zeng, B. Dvsrnet: Deep video super-resolution based on progressive deformable alignment and temporal-sparse enhancement. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 3258–3272. [Google Scholar] [CrossRef]
Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Van Gool, L. Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10052–10062. [Google Scholar]

Figure 1. The proposed overall architecture of FAENet, which reconstructs HR satellite video

{y_{t - N}, \dots, y_{t}, \dots, y_{t + N}}

from LR satellite input video

{x_{t - N}, \dots, x_{t}, \dots, x_{t + N}}

. Here, Frequency Alignment Compensation Mechanism (FACM) and Frequency Prompt Enhancement Block (FPEB) are the core modules. The green and purple arrows denote forward and backward propagation, respectively. Moreover, their features are denoted by

f_{F}

and

f_{B}

.

μ^{*}

denotes the means of the frequency domain.

Figure 1. The proposed overall architecture of FAENet, which reconstructs HR satellite video

{y_{t - N}, \dots, y_{t}, \dots, y_{t + N}}

from LR satellite input video

{x_{t - N}, \dots, x_{t}, \dots, x_{t + N}}

. Here, Frequency Alignment Compensation Mechanism (FACM) and Frequency Prompt Enhancement Block (FPEB) are the core modules. The green and purple arrows denote forward and backward propagation, respectively. Moreover, their features are denoted by

f_{F}

and

f_{B}

.

μ^{*}

denotes the means of the frequency domain.

Figure 2. Visual results on the Jilin-189 dataset are presented for enhanced observation. We randomly selected a local region to enlarge its details. The first and fourth rows display clips 001 and 005, respectively, for various compared methods, while the second and fifth rows present their corresponding enlarged local regions (red box). The third and last rows show the error maps of these enlarged local regions. Best viewed zoomed in.

Figure 3. Visual results on the SAT-MTB-VSR dataset are presented for enhanced observation. We randomly selected a local region to enlarge its details. The first and fourth rows display clips 003 and 010, respectively, for various compared methods, while the second and fifth rows present their corresponding enlarged local regions (red box). The third and last rows show the error maps of these enlarged local regions, best viewed zoomed in.

Figure 4. Visual comparison of the temporal profile with different compared approaches. We select a row of pixels (red solid lines) and observe the changes across time (Zoom in for the best view and observe the green arrows).

Figure 5. Visual comparison of segmentation results on the Jilin-189 dataset. The leftmost column displays the GT, while the remaining columns show local regions produced by different methods, corresponding to the area marked by the red line in the GT.

Table 1. Quantitative comparison (PSNR↑/SSIM↑) on the Jilin-189 dataset. The best and second-best results are highlighted in bold and underlined, respectively. ↑ indicates that higher values indicate better performance.

Methods	Bicubic	BasicVSR [11]	IconVSR [11]	BasicVSR++ [12]	RASVSR [46]	DVSRNet [55]	MADNet [18]	Our
Venue	–	2020	2020	2022	2023	2024	2025	–
Metrics	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM
000	29.66/0.8979	34.39/0.9549	34.71/0.9590	34.38/0.9549	35.25/0.9620	34.54/0.9561	36.20/0.9679	36.15/0.9676
001	28.38/0.8690	32.11/0.9400	32.17/0.9442	32.32/0.9410	32.93/0.9496	32.10/0.9399	33.64/0.9550	33.73/0.9562
002	30.51/0.9119	35.34/0.9585	35.92/0.9642	35.39/0.9595	36.29/0.9657	35.24/0.9586	36.96/0.9699	36.94/0.9698
003	30.86/0.9139	35.34/0.9608	35.57/0.9639	35.56/0.9612	36.34/0.9660	35.45/0.9614	36.83/0.9699	36.80/0.9695
004	29.13/0.8856	32.70/0.9380	33.47/0.9493	33.26/0.9439	34.05/0.9521	32.79/0.9386	34.78/0.9587	34.90/0.9597
005	27.26/0.8516	30.76/0.9258	31.01/0.9331	30.90/0.9261	31.94/0.9406	30.89/0.9275	32.58/0.9474	32.78/0.9487
006	26.76/0.8337	29.50/0.8999	30.40/0.9183	30.08/0.9091	31.64/0.9360	29.61/0.9014	32.11/0.9416	32.27/0.9438
007	29.21/0.8904	32.92/0.9457	33.85/0.9556	33.32/0.9485	34.54/0.9603	33.07/0.9472	35.03/0.9643	35.16/0.9651
008	27.90/0.8620	30.39/0.9168	31.38/0.9335	31.09/0.9258	32.27/0.9439	30.45/0.9181	32.82/0.9495	33.02/0.9516
009	32.40/0.9302	37.77/0.9716	38.71/0.9761	37.80/0.9711	38.88/0.9761	37.83/0.9719	39.46/0.9789	39.43/0.9789
010	31.07/0.9140	35.27/0.9589	36.17/0.9668	35.54/0.9608	36.85/0.9696	35.43/0.9606	37.38/0.9730	37.44/0.9734
011	29.43/0.8872	33.01/0.9413	33.61/0.9495	33.39/0.9443	34.54/0.9575	33.07/0.9421	35.19/0.9625	35.35/0.9639
Average	29.38/0.8873	33.29/0.9427	33.91/0.9511	33.58/0.9455	34.63/0.9566	33.37/0.9436	35.25/0.9616	35.33/0.9623

Table 2. Quantitative comparison (PSNR↑/SSIM↑) on the SAT-MTB-VSR dataset. The best and second-best results are highlighted in bold and underlined, respectively. ↑ indicates that higher values indicate better performance.

Methods	Bicubic	BasicVSR [11]	IconVSR [11]	BasicVSR++ [12]	RASVSR [46]	DVSRNet [55]	MADNet [18]	Our
Venue	–	2020	2020	2022	2023	2024	2025	–
Metrics	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM	PSNR (dB)/SSIM
000	33.96/0.9157	35.92/0.9479	36.50/0.9517	37.36/0.9591	39.26/0.9676	35.19/0.9349	39.84/0.9719	39.91/0.9716
001	30.66/0.8848	31.42/0.9141	32.28/0.9225	32.04/0.9229	34.05/0.9406	32.35/0.9199	34.64/0.9472	34.59/0.9462
002	34.44/0.9311	36.30/0.9566	37.02/0.9608	37.13/0.9607	39.66/0.9700	36.55/0.9530	40.26/0.9739	40.33/0.9742
003	42.48/0.9778	42.65/0.9799	43.57/0.9828	41.04/0.9777	45.96/0.9862	44.42/0.9839	46.16/0.9873	46.57/0.9879
004	35.80/0.9271	35.67/0.9383	36.61/0.9420	36.41/0.9495	39.67/0.9667	37.09/0.9467	40.40/0.9712	40.44/0.9712
005	37.01/0.9405	39.35/0.9655	39.63/0.9664	40.20/0.9696	41.81/0.9747	39.04/0.9600	42.58/0.9783	42.68/0.9786
006	34.84/0.9175	36.01/0.9416	37.00/0.9494	36.79/0.9489	39.78/0.9649	36.38/0.9382	40.25/0.9685	40.38/0.9690
007	31.91/0.9058	35.34/0.9541	36.90/0.9644	36.73/0.9633	38.21/0.9698	35.22/0.9493	38.98/0.9740	38.97/0.9743
008	35.96/0.9445	38.99/0.9700	39.69/0.9730	39.41/0.9705	41.07/0.9740	37.27/0.9599	41.73/0.9780	41.73/0.9774
009	32.55/0.8965	34.65/0.9402	35.81/0.9478	35.88/0.9522	37.63/0.9615	34.04/0.9246	38.34/0.9664	38.32/0.9667
010	33.90/0.9239	35.67/0.9506	36.18/0.9551	36.67/0.9591	39.06/0.9698	35.72/0.9476	39.57/0.9731	39.78/0.9739
011	36.93/0.9392	38.93/0.9636	39.96/0.9694	40.02/0.9686	41.41/0.9738	39.03/0.9605	42.27/0.9779	42.30/0.9782
012	34.94/0.9299	37.03/0.9567	37.23/0.9556	37.23/0.9606	40.13/0.9703	38.18/0.9603	40.82/0.9739	40.96/0.9750
013	34.40/0.9214	35.90/0.9452	36.55/0.9498	37.22/0.9542	38.53/0.9626	36.79/0.9495	39.13/0.9669	39.16/0.9670
014	36.88/0.9275	38.83/0.9553	38.00/0.9467	39.69/0.9614	40.93/0.9666	38.03/0.9432	41.70/0.9713	41.78/0.9721
015	40.44/0.9633	41.28/0.9715	42.24/0.9732	42.17/0.9738	42.72/0.9752	42.18/0.9731	42.98/0.9765	42.87/0.9763
016	39.28/0.9568	40.42/0.9706	40.94/0.9686	40.26/0.9681	42.51/0.9747	41.26/0.9693	42.56/0.9757	42.81/0.9762
017	34.12/0.9003	34.14/0.9131	34.68/0.9155	34.96/0.9232	36.36/0.9363	34.90/0.9167	36.41/0.9375	36.70/0.9406
Average	35.58/0.9280	37.14/0.9519	37.82/0.9553	37.85/0.9580	39.93/0.9670	37.42/0.9495	40.47/0.9705	40.57/0.9709

Table 3. Quantitative comparison (LPIPS↓) on the Jilin-189 and SAT-MTB-VSR datasets. The best and second-best results are highlighted in bold and underlined, respectively. ↓ indicates that lower values indicate better performance.

Methods	Bicubic	BasicVSR [11]	IconVSR [11]	BasicVSR++ [12]	RASVSR [46]	DVSRNet [55]	MADNet [18]	Our
Jilin-189	0.0726	0.0336	0.0293	0.0359	0.0285	0.0338	0.0262	0.0257
SAT-MTB-VSR	0.0555	0.0381	0.0307	0.0309	0.0267	0.0409	0.0240	0.0238

Table 4. Ablation study on the effectiveness of FACM and FPEB. The best results are highlighted in bold. ↑ indicates that the larger the value, the better the performance.

Model Type	FACM	FPEB	PSNR↑	SSIM↑
Baseline			34.63	0.9566
Baseline + FACM	√		35.19	0.9612
Baseline + FPEB		√	35.17	0.9612
Baseline + FACM + FPEB	√	√	35.33	0.9623

Table 5. Complexity comparison on the Jilin-189 dataset. The best and second-best results are highlighted in bold and underlined, respectively. ↑ indicates that higher values indicate better performance.

Methods	Venue	Publications	Frames	Flops (G)	Para (M)	PSNR↑	SSIM↑
BasicVSR	2020	CVPR	15	396.95	6.29	33.29	0.9427
IconVSR	2020	CVPR	15	498.50	8.55	33.91	0.9511
BasicVSR++	2022	CVPR	30	879.96	7.03	33.58	0.9455
RASVSR	2023	JSTAR	60	1065.26	4.36	34.63	0.9566
DVSRNet	2024	TNNLS	7	181.68	4.55	33.37	0.9436
MADNet	2025	TIP	60	1497.52	6.67	35.25	0.9616
Our	2025	RS	60	1379.52	5.44	35.33	0.9623

Table 6. Generalization of FACM on the Jilin-189 dataset. “+ FACM” means the addition of the FACM module. ↑ indicates that higher values indicate better performance.

Methods	Alignment	Frames	Flops (G)	Para (M)	PSNR↑	SSIM↑
BasicVSR	Flow	15	396.948887	6.291311	33.29	0.9427
BasicVSR + FACM	Flow	15	397.055255	6.294184	33.39	0.9436
IconVSR	Flow	15	498.497246	8.547279	33.91	0.9511
IconVSR + FACM	Flow	15	498.709984	8.553025	34.53	0.9555
BasicVSR++	DCN	30	879.958354	7.027759	33.58	0.9455
BasicVSR++ + FACM	DCN	30	879.959958	7.028840	33.62	0.9447
RASVSR	DCN	7	118.815453	4.355663	32.63	0.9330
RASVSR + FACM	DCN	7	118.815640	4.356744	33.31	0.9424

Table 7. The ablation experiments are about the length of the input sequence. The best results are highlighted in bold. ↑ indicates that higher values indicate better performance.

Frames	Flops (G)	Batch	PSNR↑	SSIM↑
7	155.48	4	33.58	0.9450
15	340.24	4	33.59	0.9456
30	686.67	4	33.80	0.9494
60	1379.52	4	35.33	0.9623
75	1725.95	4	35.23	0.9616

Table 8. Ablation study on the hyper-parameters of the number of prompts T in the prompt pool, and the inner rank r on the Jilin-189 dataset. The best results are highlighted in bold. ↑ indicates that higher values indicate better performance.

$r \times T$	Frames	PSNR↑	SSIM↑
$8 \times 64$	7	33.52	0.9448
$16 \times 64$	7	33.54	0.9447
$32 \times 64$	7	33.58	0.9450
$32 \times 32$	7	33.54	0.9449
$32 \times 16$	7	33.57	0.9449
$32 \times 8$	7	33.56	0.9451

Table 9. Segmentation results on the Jilin-189 dataset. ↑ means higher value, better performance. The best and second-best results are highlighted in bold and underlined, respectively.

Methods	Bicubic	BasicVSR [11]	IconVSR [11]	BasicVSR++ [12]	RASVSR [46]	DVSRNet [55]	MADNet [18]	Our
DSC (%) ↑	88.39	74.89	92.75	90.00	94.09	91.03	93.01	94.74
PA (%) ↑	90.92	85.49	95.04	93.31	95.86	93.97	95.20	96.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lang, X.; Zhang, J.; Lu, T.; Yao, Y.; Wang, Y.; Wang, L. Frequency-Aware Enhancement Network for Satellite Video Super-Resolution. Remote Sens. 2025, 17, 3994. https://doi.org/10.3390/rs17243994

AMA Style

Lang X, Zhang J, Lu T, Yao Y, Wang Y, Wang L. Frequency-Aware Enhancement Network for Satellite Video Super-Resolution. Remote Sensing. 2025; 17(24):3994. https://doi.org/10.3390/rs17243994

Chicago/Turabian Style

Lang, Xiujuan, Jin Zhang, Tao Lu, Yuan Yao, Yu Wang, and Liwei Wang. 2025. "Frequency-Aware Enhancement Network for Satellite Video Super-Resolution" Remote Sensing 17, no. 24: 3994. https://doi.org/10.3390/rs17243994

APA Style

Lang, X., Zhang, J., Lu, T., Yao, Y., Wang, Y., & Wang, L. (2025). Frequency-Aware Enhancement Network for Satellite Video Super-Resolution. Remote Sensing, 17(24), 3994. https://doi.org/10.3390/rs17243994

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frequency-Aware Enhancement Network for Satellite Video Super-Resolution

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Natural Video SR

2.2. Satellite Video SR

2.3. Learning in Frequency Domain

3. Methods

3.1. Overview

3.2. Frequency Alignment Compensation Mechanism

3.3. Frequency Prompt Enhancement Block

3.4. Loss Function

4. Results

4.1. Satellite Video Dataset and Evaluation Metrics

Satellite Video Dataset

4.2. Training Details

4.3. Comparison with State-of-the-Art Methods

4.4. Results on Jilin-189 Dataset

4.5. Results on SAT-MTB-VSR Dataset

4.6. Evaluation on Perceptual Quality (LPIPS)

4.7. Ablation Studies

4.7.1. The Effectiveness of FACM and FPEB

4.7.2. Complexity on Model Complexity

4.7.3. Generalization of FACM

4.7.4. The Length of Input Sequence

4.7.5. The Hyperparameters of the Number of Prompts in FPEB

4.7.6. Temporal Consistency Analysis

4.7.7. Evaluation on Downstream Segmentation Task

5. Discussion

5.1. Relationship Between Previous Studies and the Working Hypotheses

5.2. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI