Video SAR Enhanced Imaging Using a Self-Supervised Super-Resolution Reconstruction Network

Huang, Xuejun; Zhang, Yan; Zhong, Chao; Ding, Jinshan; Wen, Liwu

doi:10.3390/rs18050670

Open AccessArticle

Video SAR Enhanced Imaging Using a Self-Supervised Super-Resolution Reconstruction Network

by

Xuejun Huang

¹

,

Yan Zhang

²,

Chao Zhong

²,

Jinshan Ding

²

and

Liwu Wen

^2,*

¹

School of Electronic Engineering, Xidian University, Xi’an 710071, China

²

National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 670; https://doi.org/10.3390/rs18050670

Submission received: 27 December 2025 / Revised: 12 February 2026 / Accepted: 17 February 2026 / Published: 24 February 2026

(This article belongs to the Special Issue Advances in Remote Sensing Video Data Processing: Theories, Technologies and Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A deep learning-based super-resolution reconstruction framework is proposed for video synthetic aperture radar (SAR) to overcome the contradiction of high-resolution imaging and high-frame-rate imaging.
We present a mathematical model for video SAR image super-resolution reconstruction. Building on this model, we design a self-supervised super-resolution reconstruction network, achieving good image reconstruction performance and strong generalization ability.

What are the implications of the main findings?

By formulating the problem of achieving high-frame-rate and high-resolution imaging as a deep learning-based image super-resolution reconstruction task, the proposed approach paves a new way for the high-frame-rate microwave video SAR imaging with low system complexity and low cost.
Thanks to the proposed self-supervised learning strategy, the designed network does not rely on unavailable high-resolution images with unblurred shadows as ground truth for the training, which is suitable for real world applications.

Abstract

Video synthetic aperture radar (SAR) enables observation of moving targets by leveraging temporal information across successive frames. In particular, dynamic shadows in video SAR image sequences provide critical cues for detecting moving objects whose energy is smeared or Doppler-shifted. To achieve high-resolution imaging at a high frame rate for effective dynamic scene monitoring, video SAR systems typically operate at extremely high frequencies or even in the terahertz band, rather than the microwave band. However, terahertz video SAR suffers from significant signal attenuation due to atmospheric absorption. We present a deep learning framework to achieve high-frame-rate and high-resolution imaging for microwave video SAR systems. In this framework, the problem of microwave video SAR imaging is formulated as an image super-resolution reconstruction task for low-resolution yet high-frame-rate image sequences from microwave video SAR. We develop a simple yet effective image super-resolution reconstruction network that is completely built upon convolutional neural networks. The designed network takes a low-resolution image sequence and the corresponding high-resolution image with blurred shadows as input, and then produces a high-resolution image sequence where shadows are clearly visible. Furthermore, the network is trained in a self-supervised manner and thus does not require high-resolution image sequences with unblurred shadows as ground truth, which is appealing to practical applications. Processing results of real data from two different video SAR systems have shown good performance of the proposed approach with convincing generalization ability.

Keywords:

synthetic aperture radar (SAR); video SAR; radar imaging; deep learning; image super-resolution reconstruction

1. Introduction

Video synthetic aperture radar (SAR) preserves the all-weather, day-and-night sensing capability of conventional SAR while providing a video-like dynamic perception ability similar to optical sensors, which has attracted considerable research interest in recent years [1,2]. By augmenting two-dimensional spatial information with a temporal dimension, video SAR produces high-frame-rate sequential SAR images, enabling continuous observation of the region of interest [3]. These image sequences visually capture dynamic changes, facilitating both intuitive human interpretation and effective detection and tracking of moving targets.

In SAR imagery, the energy of a moving target often appears defocused and positionally shifted, yet a distinct shadow is usually observable near the actual position of the target [4]. Consequently, the dynamic shadows in video SAR image sequences can provide valuable cues for estimating a moving target’s position, velocity, and other motion parameters. Moreover, the shadow-to-clutter contrast is independent of the target’s radar cross section (RCS) [5], making dynamic shadows particularly useful for detecting low-RCS moving targets. Accordingly, shadow-assisted moving target detection and tracking in video SAR has received considerable attention [6,7,8].

Video SAR systems are expected to possess both high frame-rate and high-resolution imaging capabilities. To meet this requirement, a straightforward way is to increase the radar carrier frequency, and thus many video SAR systems typically operate at extremely high frequencies [9] or even in the terahertz band [10,11]. However, such systems are costly and suffer from a limited operating range due to atmospheric attenuation of signals, which is unacceptable in many practical scenarios. For long-range imaging, some video SAR systems operate at relatively low frequencies, such as Ku or Ka bands, and adopt some alternative strategies to produce high-frame-rate and high-resolution sequential SAR images. One strategy is to increase the platform speed. A higher speed shortens the synthetic-aperture time required to achieve a given azimuth resolution, thereby increasing the frame rate. As a side effect, the azimuth Doppler bandwidth expands, which often results in Doppler ambiguity. Moreover, the platform speed is constrained in many applications. Another strategy is the overlapping aperture processing to increase the frame rate through data reuse. Although this approach generates more sequential SAR images, it does not reduce the synthetic aperture time of each frame. Consequently, when the synthetic aperture time remains long for high-resolution imaging, no-return areas in the SAR scene may quickly be washed out, causing moving target shadows to become blurred or even vanish. In other words, the overlapping aperture processing offers no substantial improvement in dynamic target detection capability. Most recently, a collaborative imaging framework based on distributed radar systems has been proposed for high-frame-rate imaging in common microwave band, where the spatial degrees of freedom are employed to shorten the synthetic time and thus improve the frame rate. However, this approach significantly increases system complexity and cost. Overall, it is still a challenge for microwave video SAR to improve both resolution and frame rate simultaneously.

We propose a deep learning-based super-resolution reconstruction network for microwave video SAR to address the inherent contradiction of high-resolution imaging and high-frame-rate imaging. Our main contributions are as follows:

To the best of our knowledge, this is the first work to formulate the problem of achieving high-frame-rate and high-resolution imaging as a deep learning-based image super-resolution reconstruction task, pioneering the high-frame-rate microwave video SAR imaging with low system complexity and low cost.
We propose a mathematical model for video SAR image super-resolution reconstruction, which simplifies the reconstruction task as the nonlinear mapping from a low-resolution image sequence to the desired high-resolution sequence with assistance from a shadow-blurred high-resolution image. Building on this model, we design a simple yet effective network that can be trained in a self-supervised manner. Therefore, unavailable high-resolution images with unblurred shadows are not required as ground truth for the training.
Experiments on the real data recorded by two different video SAR systems demonstrate that the proposed approach achieves excellent reconstruction performance and good generalization to the unseen data during the training.

The paper is organized as follows. Section 2 briefly introduces the related work. Section 3 details the complete framework for video SAR image super-resolution reconstruction, including the presented mathematical model, the designed network and its loss function. Experimental results are presented in Section 4, and Section 5 discusses the advantages and innovations of the proposed approach. Section 6 concludes this paper.

2. Related Work

2.1. Microwave Video SAR Imaging

When operating at the common microwave bands, video SAR systems struggle to maintain high azimuth resolution without sacrificing imaging frame rates. To alleviate the issue, the overlapping aperture processing technique was developed to achieve higher frame rates through data reuse. The backprojection (BP) algorithm, suitable for arbitrary trajectories and inherently conducive to data reuse, has been successfully applied in video SAR imaging. For instance, Hawley et al. [12] improved imaging efficiency through weighted fusion of sub-apertures based on the BP algorithm, enabling effective data reuse. Meanwhile, Moses et al. [13] introduced a recursive BP algorithm that models the current frame as a linear combination of previous frames and newly acquired backprojection data, significantly improving data reuse efficiency and reducing memory consumption. Similarly, Miller [14] employed circular shift registers for sub-aperture data reuse. Song et al. [15] divided imaging areas into regions of interest and general regions, applying the fast factorized BP algorithm to maintain high resolution in critical areas while sacrificing quality in less important regions to boost computational efficiency. Cheng et al. [16] proposed an improved Cartesian factorized BP algorithm, achieving data reuse while avoiding the interpolation required by traditional fast BP algorithms. In fact, the overlapping aperture processing reuses partial echo data among adjacent frames, resulting in more video SAR images. However, each frame produced by overlapping or non-overlapping apertures shares an equivalent synthetic aperture length when the azimuth resolution is fixed, offering no improvement in the shadow quality of moving targets. Recently, research efforts have focused on achieving realistically effective high-frame-rate and high-resolution imaging in the microwave band. Kim et al. [17] proposed a multi-input multi-output-based video SAR imaging technique, where the platform speed is very high for high-frame-rate imaging and then multi-channel joint processing is employed to mitigate Doppler spectrum ambiguities induced by the high platform speed, successfully validating both high-frame-rate and high-resolution imaging through simulations. Nevertheless, the radar platform speed is constrained by multiple real-world factors. Ding et al. [18] proposed a collaborative high-frame-rate imaging framework based on swarm UAV-borne radar systems, where each UAV-borne radar obtains short aperture data in a short time, and then multiple aperture data are fused to achieve the desired azimuth resolution and thus improve the frame rate.

In conclusion, achieving realistic and effective microwave video SAR imaging with low system complexity remains challenging.

2.2. SAR Image Super-Resolution Based on Deep Learning

Deep neural networks, capable of modeling complex nonlinear mappings through extensive training data, have demonstrated remarkable success in optical image super-resolution reconstruction tasks [19,20,21]. Motivated by these advances, researchers have extended these deep learning techniques to SAR image super-resolution reconstruction. Building on the FSRCNN [22], a fast optical image super-resolution network, Luo et al. presented the IFSRCNN [23] where the structural similarity index (SSIM) is incorporated into the original loss function to improve the performance and applied it to SAR images. Similarly, Sun et al. [24] employed convolutional neural networks to reconstruct high-resolution images from low-resolution images generated by the conventional BP algorithm. Wei et al. [25] combined alternating direction method of multipliers with deep neural networks to achieve fast super-resolution reconstruction of SAR images. Wang et al. proposed SRGAN [26] to utilize the generative adversarial network (GAN) for SAR image super-resolution reconstruction, achieving significant breakthroughs in reconstruction accuracy and computational efficiency compared with traditional super-resolution algorithms. Addressing the issue of low spatial resolution in polarimetric SAR images, Shen et al. [27] proposed a polarimetric SAR super-resolution reconstruction framework using residual convolutional neural networks to enhance spatial resolution. Li et al. [28] proposed a novel optical-guided super-resolution reconstruction network (OGSRN) for SAR images with large scaling factors. Guided by the corresponding optical images of SAR images, OGSRN achieved excellent performance in both quantitative evaluation metrics and visual quality. Recently, Zhang et al. proposed an unsupervised blind super-resolution reconstruction (BSR) framework that introduces SAR priors into CycleGAN and jointly learns a probabilistic degradation model to better match real SAR degradations [29]. Kong and Liu developed a conditional GAN with deformable self-attention and multi-scale feature fusion to enhance texture and structural detail recovery [30]. To support practical deployment, Jiang et al. designed a lightweight SRGAN (LSRGAN) by compressing residual blocks with depth-wise separable convolutions while maintaining competitive reconstruction performance [31]. More recently, Dong et al. explored complex-valued SR via subaperture learning and fusion to better preserve amplitude–phase consistency in SAR reconstruction [32].

High-frame-rate and high-resolution imaging of microwave video SAR can be viewed as a low-resolution image sequence super-resolution reconstruction task. Although deep learning methods demonstrate significant potential for high-quality image reconstruction, deep-learning-based super-resolution technology specifically for video SAR has yet to be explored.

3. Methodology

Although microwave video SAR systems enable long-range detection and imaging, achieving high-resolution and high-frame-rate imaging simultaneously remains challenging. Higher resolution demands a more considerable synthetic aperture length. Due to the limited speed of SAR platforms, higher resolution means longer synthetic aperture time, resulting in decreased frame rates. In addition, moving target shadows result from the obstruction of electromagnetic waves by targets on the ground, and thus the target shadow contrast to clutter is associated with the ratio of occlusion time to synthetic aperture time [33]. As the azimuth resolution increases, the synthetic aperture time also increases, but the occlusion time remains unchanged since it depends on the speed of the moving target. As a result, the shadow quality is degraded, adversely affecting subsequent moving target detection and tracking. As shown in Figure 1, moving target shadows often appear blurred in high-resolution SAR imagery while they are visible in low-resolution SAR imagery.

For a microwave video SAR system, high-resolution imaging not only conflicts with high-frame-rate imaging but also leads to blurred shadows of moving targets. In contrast, low-resolution imaging can satisfy the high-frame-rate requirement and produce clearer shadows of moving targets, but it inherently lacks the detailed information necessary for subsequent image interpretation and target identification. To resolve this contradiction, we formulate the challenge of achieving high-frame-rate and high-resolution imaging in microwave video SAR as a super-resolution reconstruction task for the high-frame-rate but low-resolution image sequence.

3.1. Mathematical Model for Super-Resolution Reconstruction

Existing deep learning-based super-resolution reconstruction methods typically model the reconstruction task as a complex nonlinear mapping from low-resolution images to the corresponding high-resolution images, training the network with massive one-to-one correspondence image pairs. However, in the video SAR image super-resolution reconstruction task, high-resolution images with unblurred shadows, which could serve as ground truth for the network training, are unavailable. Furthermore, due to the presence of speckle noise, it is difficult to train the network to learn the complex nonlinear mapping between low-resolution and high-resolution SAR images. Fortunately, in video SAR imaging, moving target shadows are more visible in low-resolution SAR images while the background exhibits higher resolution in high-resolution images. Leveraging this characteristic, our proposed mathematical model formulates video SAR super-resolution reconstruction as a nonlinear mapping from low-resolution image sequences to desired high-resolution sequences with assistance from high-resolution but shadow-blurred images. This can be mathematically expressed as:

{I_{S}^{1}, \dots, I_{S}^{i}, \dots, I_{S}^{N}} = f (I_{L}^{1}, \dots, I_{L}^{i}, \dots, I_{L}^{N}; I_{H}; Φ)

(1)

where

{I_{L}^{1}, \dots, I_{L}^{i}, \dots, I_{L}^{N}}

represents the low-resolution image sequence with N frames,

{I_{S}^{1}, \dots, I_{S}^{i}, \dots, I_{S}^{N}}

denotes the desired high-resolution image sequence with clear shadows of moving targets, and

I_{H}

indicates the high-resolution but shadow-blurred image. The nonlinear mapping parameters

Φ

correspond to the network weights when approximated using a deep neural network.

3.2. Imaging Framework

The proposed framework for high-frame-rate and high-resolution imaging in microwave video SAR is illustrated in Figure 2. In order to meet the requirement of high azimuth resolution, existing video SAR imaging techniques generally divide the raw data into multiple frames of long aperture echoes to obtain high-resolution image sequences. In contrast, as shown in Figure 2, during the imaging preprocessing of the proposed framework, one frame of long aperture echo is further split into N short apertures, producing a high-frame-rate but low-resolution image sequence. Moreover, we use the accelerated fast back-projection (AFBP) algorithm [34] to efficiently generate the low-resolution image sequence and the high-resolution image in the same coordinate system. The height and width of the high-resolution image are denoted by H and W, respectively. It should be pointed out that the high-resolution image corresponding to the long aperture and the low-resolution image sequence generated by N short-aperture frames should share consistent information in the background areas, except for differences in the dynamic areas. This physical relationship establishes the feasibility of the presented mathematical model, wherein the high-resolution image guides the super-resolution reconstruction of the low-resolution sequence, specifically enhancing details in the background area.

Guided by the proposed mathematical model, we design a super-resolution reconstruction network to transform the high-frame-rate but low-resolution image sequence into a high-resolution and high-frame-rate image sequence while preserving clear moving target shadows. Furthermore, to learn the nonlinear mapping effectively, we design a loss function for the network training. The loss function regionally maps the network’s output to the corresponding low-resolution input sequence and the shadow-blurred high-resolution image, enabling self-supervised learning. Finally, the trained network can be directly applied to high-frame-rate, but low-resolution video SAR image sequences unseen in the network training, yielding high-frame-rate and high-resolution image sequences.

3.3. Super-Resolution Reconstruction Network

As illustrated in Figure 3, the designed super-resolution reconstruction network consists of two encoders, a spatiotemporal learning module and a fusion and reconstruction module. The two encoders are the low-resolution image sequence encoder and the high-resolution image encoder, which respectively extract features from the low-resolution image sequence and the shadow-blurred yet high-resolution image. The spatiotemporal learning module learns both spatial dependencies and temporal variations of features from the low-resolution image sequence encoder. The fusion and reconstruction module integrates features from the outputs of the high-resolution image encoder and the spatiotemporal learning module at multiple scales and reconstructs the high-resolution image sequence based on the fused features.

3.3.1. High-Resolution Image Encoder

The encoder takes the high-resolution image with blurred shadows as input to extract features through four convolution blocks, where each block contains a

3 \times 3

convolution, group normalization, and leaky ReLU activation. Group normalization stabilizes training via feature re-scaling, while leaky ReLU preserves gradient flow. Spatial downsampling is triggered every two blocks via the convolution with

2 \times 2

stride, where channels double at each downsampling stage.

3.3.2. Low-Resolution Image Sequence Encoder

For a low-resolution image sequence of length N as input, the low-resolution image sequence encoder comprises N branches with shared weights for feature extraction. Each branch consists of 4 cascaded convolution blocks, where each block includes a

3 \times 3

convolution, group normalization, and a Leaky ReLU activation function. Spatial downsampling is also triggered every two blocks while channels double at each downsampling stage. To ensure that the spatial dimensions of the output feature maps from the low-resolution image sequence encoder match those of the high-resolution image encoder, the stride of the second and fourth convolutional layers is set to

2 \times 1

.

3.3.3. Spatiotemporal Learning Module

The spatiotemporal learning module employs two stacked gated spatiotemporal attention (gSTA) blocks [35] to enhance the features extracted by the low-resolution image sequence encoder. It should be noted that the features from N frames in the low-resolution image sequence are first concatenated along the channel dimension and then are fed into the spatiotemporal learning module for feature enhancement.

The gSTA block is introduced for its ability to efficiently capture spatiotemporal dependencies while maintaining computational efficiency, offering a critical advantage over traditional recurrent or transformer-based architectures. By leveraging a decomposed large-kernel convolution and a gating mechanism, the gSTA block dynamically filters and emphasizes informative spatiotemporal features, suppressing noise and irrelevant details. Internally, each gSTA block consists of three key components: a depth-wise convolution for local receptive fields, a depth-wise dilated convolution for distant connections, and a

1 \times 1

convolution for channel interactions. Furthermore, the gating mechanism splits the features into two streams, applying a sigmoid activation to one stream to generate adaptive attention weights, which are then multiplied element-wise with the other stream to perform feature selection. This architecture enables our module to effectively model both spatial details and temporal evolution in video SAR sequences, crucial for high-quality super-resolution reconstruction. The stacked gSTA blocks further enhance this capability by progressively refining spatiotemporal representations.

3.3.4. Fusion and Reconstruction Module

The proposed fusion and reconstruction module reconstructs the desired high-resolution image sequence by fusing enhanced spatiotemporal features and the spatial features from the high-resolution image with blurred shadows. This module consists of N parallel branches with shared weights, and each branch is built by stacking two sum-difference attention (SDA) blocks and two up-sampling blocks. Two SDA blocks fuse the features at two different scales, where the first SDA block directly integrates the enhanced features from the spatiotemporal learning module and the spatial features from the high-resolution image encoder and another SDA block fuses the up-sampling features with features of the same size from the high-resolution image encoder. Notably, the enhanced spatiotemporal features are first split into N parts along the channel dimension, and each part is then fed into the corresponding branch for fusion and reconstruction.

As shown in Figure 3, the SDA block takes two feature maps

X \in R^{\frac{H}{4} \times \frac{W}{4} \times 2 C}

and

Y \in R^{\frac{H}{4} \times \frac{W}{4} \times 2 C}

as input for efficient feature fusion, where

2 C

,

\frac{H}{4}

, and

\frac{W}{4}

denote the number of channels, the spatial height and width, respectively. The SDA block performs element-wise addition and subtraction to generate a sum branch and a difference branch, respectively. The sum branch enhances features corresponding to static background regions, while the difference branch captures dynamic variations between the input features. Inspired by the large-kernel attention mechanism [36], both branches are processed by two decomposed large-kernel convolution branches, and then the features from both branches are multiplied element-wise, followed by a sigmoid activation to produce the gated attention map

W_{a} \in R^{\frac{H}{4} \times \frac{W}{4} \times 2 C}

. Finally, the fused features

Z \in R^{\frac{H}{4} \times \frac{W}{4} \times 2 C}

are obtained as follows:

Z = W_{a} ⊙ X + (1 - W_{a}) ⊙ Y

(2)

where symbol 1 denotes an all-ones tensor with the same dimensions as the input feature maps and ⊙ represents Hadamard product.

The proposed SDA block offers three key advantages. First, the sum-difference decomposition explicitly models both static and dynamic components of the input features, enabling a more comprehensive feature representation. Second, the decomposed large-kernel convolution retains the benefits of global receptive fields while significantly reducing computational overhead compared to standard large-kernel operations. Third, the gating mechanism provides adaptive feature selection capability. This design is particularly effective for video SAR super-resolution tasks where both spatial details and temporal coherence need to be preserved. The two up-sampling blocks perform spatial up-sampling on fused features, and each block doubles the two-dimensional scale of the input features. Each up-sampling block incorporates a convolution layer with the kernel size of

3 \times 3

, followed by group normalization and Leaky ReLU activation. Spatial up-sampling is implemented by a

3 \times 3

convolution layer and a pixel-shuffle layer for efficiency. Finally, the

1 \times 1

convolution layer projects the up-sampling features to the desired video SAR image sequence.

3.4. Loss Function

Existing deep learning-based image super-resolution reconstruction methods generally define the loss function by minimizing the difference between the network’s output and the ground truth. However, for video SAR image super-resolution reconstruction, high-resolution SAR images generated from long synthetic aperture often contain blurred moving target shadows, rendering them unsuitable as ground truth for the training. In contrast, moving target shadows are clearer in low-resolution SAR images. Therefore, we define a loss function that regionally maps the network output to the input low-resolution sequence and the shadow-blurred high-resolution image.

Given the input low-resolution image sequence and the high-resolution image, two-dimensional regions within the output images are classified into background, potential moving-target shadow regions, and uncertain areas. Specifically, since the background regions in both the low-resolution and high-resolution images are highly similar, we perform SAR image similarity analysis [37] between each frame of low-resolution image sequence and the high-resolution image, obtaining N binary mask matrices

{M_{b}^{1}, \dots, M_{b}^{i}, \dots, M_{b}^{N}}

where 1 represents background pixels and 0 otherwise. It should be pointed out that before conducting similarity analysis, we should interpolate the low-resolution images to the same size as the high-resolution image. Subsequently, the morphological processing is employed to mitigate the effect of the speckle noise on the identification of background areas. Additionally, by further leveraging the grayscale information of low-resolution image sequence, N binary mask matrices

{M_{s}^{1}, \dots, M_{s}^{i}, \dots, M_{s}^{N}}

indicating potential shadow regions are generated, as described below:

M_{s}^{i} = (1 - M_{b}^{i}) ⊙ (\hat{I_{L}^{i}} < γ)

(3)

where

\hat{I_{L}^{i}}

is the i-th frame of the low-resolution image sequence processed by the interpolation and the Lee filter [38], 1 denotes an all-ones tensor with the same spatial dimensions as the corresponding mask, and

γ

is a threshold of shadow intensities, which is empirically set as 0.25.

The loss function is mathematically defined as:

\begin{matrix} L o s s & = \sum_{i = 1}^{N} \sum_{j = 1}^{H} \sum_{q = 1}^{W} \frac{∥ M_{b}^{i} (j, q) ⊙ (I_{S}^{i} (j, q) - I_{H} (j, q)) ∥_{1}}{\sum_{i = 1}^{N} \sum_{j = 1}^{H} \sum_{q = 1}^{W} M_{b}^{i} (j, q) + ϵ} \\ + \sum_{i = 1}^{N} \sum_{j = 1}^{H} \sum_{q = 1}^{W} β \frac{∥ M_{s}^{i} (j, q) ⊙ ({\bar{I}}_{S}^{i} (j, q) - {\bar{I}}_{L}^{i} (j, q)) ∥_{1}}{\sum_{i = 1}^{N} \sum_{j = 1}^{H} \sum_{q = 1}^{W} M_{s}^{i} (j, q) + ϵ} \end{matrix}

(4)

where

β

is a weighting factor balancing reconstruction quality and shadow preservation, which is set as 0.2 in this paper.

ϵ

is a small constant (e.g.,

1 \times 10^{- 8}

).

{\bar{I}}_{S}^{i}

is i-th frame of the output image sequence processed by a mean filter. The kernel size of the mean filter is 5 × 9, which reduces speckle effects and promotes regional gray level consistency between the network outputs and the interpolated low-resolution images in potential shadow areas. The kernel size is larger along azimuth because the proposed method performs azimuth super resolution.

{\bar{I}}_{L}^{i}

is the i-th frame of the low-resolution image sequence processed by the interpolation and the mean filtering. The first term aligns output images to the high-resolution image in the background regions, and the second term ensures shadow consistency with low-resolution inputs, preserving shadow clarity. Pixels in uncertain regions are excluded during training.

4. Experiments

We demonstrate the effectiveness of the proposed approach by using the real video SAR data recorded by two different video SAR systems. The generalization ability of the proposed approach is evaluated and discussed in detail.

4.1. Datasets and Training Strategy

We use real video SAR data recorded by two different systems to build the datasets, which include one training set and two test sets. Both video SAR systems operate at W band but with different parameters, as listed in Table 1. In addition, Figure 4 presents the optical images of the scenes imaged by the two video SAR systems, clearly indicating substantial differences between the corresponding imaging scenarios. It should be noted that due to the lack of real K-band radar data, only W-band radar data are available and used to equivalently validate the proposed high-frame-rate imaging for microwave video SAR. On the W-band radar data, we perform a four-fold (

N = 4

) super-resolution reconstruction to generate images with an ultra-high azimuth resolution of approximately 0.07 m while maintaining clearly visible shadows of moving targets. The W-band equivalent experiments are used as surrogate real-data verification of the proposed high-frame-rate microwave video SAR imaging, and they are not intended to demonstrate strict radiometric equivalence to K-band. As discussed in Section 2, the shadow contrast to clutter is associated with the ratio of occlusion time to synthetic aperture time. When higher azimuth resolution requires a longer synthetic aperture time while the occlusion time remains unchanged, the moving target shadow becomes blurred in the high-resolution image but remains visible in the low-resolution image. This relationship provides a physical basis for using W-band data to validate the proposed shadow-preserving high-frame-rate reconstruction under comparable acquisition settings.

The real video SAR data from radar A contains 1,600,000 pulses and is processed by the AFBP algorithm to generate 800 samples, where each sample consists of one high-resolution image and four corresponding low-resolution images. The high-resolution image has a spatial size of

2048 \times 4096

with a resolution of

0.15 m \times 0.07 m

, while each low-resolution image has a spatial size of

2048 \times 1024

with a resolution of

0.15 m \times 0.28 m

. The first 80 samples are used to form the testing set, and the remaining 720 samples are used to construct the training set. To increase the diversity of the training data, a data augmentation strategy based on random spatial cropping is adopted. For each of the 720 original training samples, the high-resolution image is randomly cropped using a window of size

1024 \times 1024

, while each corresponding low-resolution image is cropped using a window of size

1024 \times 256

. For each cropping operation, the cropped regions in the high-resolution and low-resolution images are spatially aligned, ensuring that they correspond to the same scene area. This random cropping process is repeated 60 times for each original training sample, resulting in a total of

720 \times 60 = 43, 200

training samples. After cropping, the high-resolution images have a size of

1024 \times 1024

, and the low-resolution images have a size of

1024 \times 256

.

In addition, the real video SAR data from radar B contains 145,000 pulses and is processed to generate 31 samples, where each sample also consists of one high-resolution image and four low-resolution images. The high-resolution image has a size of 2048 × 4096 with a resolution of 0.17 m × 0.08 m while low-resolution images have a size of 2048 × 1024 with a resolution of 0.17 m × 0.32 m. The 31 samples are used for testing only to evaluate the generalization ability of the proposed approach.

The super-resolution reconstruction network is trained by Adam optimizer with a batch size of 1. The learning rate is set to

1 \times 10^{- 4}

in the 20 epochs.

4.2. Evaluation Metrics

To comprehensively evaluate the proposed super-resolution reconstruction network for video SAR imaging, we use two specialized metrics to assess different aspects of reconstruction quality. First, the mask peak signal-to-noise ratio (MPSNR) is employed to quantify the similarity between the reconstruction image and the input high-resolution image specifically in background regions, which can be represented as follows:

M P S N R = 10 {log}_{10} (\frac{M A X_{I}^{2}}{∥ M ⊙ (I_{S} - I_{H}) ∥_{F}^{2} / {∥ M ∥}_{1}})

(5)

where M represents the binary mask matrix with 1 for background pixels and 0 otherwise. In addition,

M A X_{I}

is the maximum pixel value. The mask operation ensures that the evaluation focuses exclusively on static background regions. Therefore, a higher MPSNR indicates smaller reconstruction error and better fidelity in background regions, implying better super-resolution reconstruction performance.

Additionally, we introduce the average intensity of fast-moving shadow regions to specifically evaluate the preservation of moving target shadows, which is defined as follows:

A I S R = \frac{1}{| Ω |} \sum_{(i, j) \in Ω} I_{S} (i, j)

(6)

where

Ω

denotes the predefined moving target shadow regions. The two metrics provide a rigorous assessment of reconstruction fidelity for both static scenes and dynamic features, offering comprehensive insights into the reconstruction performance. Since moving-target shadows correspond to low-intensity regions in SAR imagery, a lower AISR indicates darker and better preserved shadows, thus reflecting better shadow preservation.

4.3. Comparison Methods

To validate the effectiveness of the proposed approach, we compared it with the conventional video SAR imaging methods and the state-of-the-art image super-resolution reconstruction methods based on deep learning. The comparative algorithms can be classified as follows.

Video SAR imaging based on overlapping aperture processing (VSAR-OAP) and low-resolution video SAR imaging (LR-VSAR) are traditional high-frame-rate imaging methods. VSAR-OAP obtains high-frame-rate and high-resolution SAR images by overlapping aperture processing while LR-VSAR achieves high-frame-rate imaging at the expense of azimuth resolution.
IFSRCNN and lightweight SRGAN are supervised learning-based algorithms for SAR image super-resolution reconstruction. IFSRCNN is derived from the fast super-resolution convolutional neural network, adapted and optimized for SAR imagery. LSRGAN follows SRGAN to learn the mapping from low-resolution to high-resolution SAR images and further introduces depth-wise separable convolution to compress residual blocks in SRGAN, aiming to reduce computational and storage costs.
ZSR [39] and BSR are unsupervised learning-based image super-resolution reconstruction algorithms. ZSR is originally developed for optical images, which leverages the internal self-similarity within an image to train the network. BSR explicitly considers real SAR degradations and speckle noise by introducing SAR priors and a learnable probabilistic degradation model within a CycleGAN framework, which translates low-resolution SAR images to high-resolution images without requiring paired training data.

4.4. Super-Resolution Imaging Results

The proposed approach is tested on the first test set to validate its effectiveness and practicality. The processing results are compared with those of the above six algorithms, and some representative imaging results obtained by different approaches are shown in Figure 5. As illustrated in Figure 5a, although the overlapping aperture method can obtain both high-frame-rate and high-resolution sequential SAR images, it results in blurred moving target shadows. While low-resolution imaging meets the high-frame-rate requirement, the imaging result shown in Figure 5b suffers from poor quality. As shown in Figure 5c,d, IFSRCNN and LSRGAN show limited super resolution performance because supervised learning requires paired low-resolution and high-resolution images, yet the practical high-resolution reference that can be provided is typically a shadow-blurred high-resolution image. In this setting, using a shadow-blurred high-resolution image as the training target biases the learned mapping toward reproducing blurred shadow appearance, which leads to insufficient moving target shadow preservation in the outputs. In addition, speckle noise introduces strong intensity fluctuations that hinder supervised learning of the complex nonlinear mapping from low-resolution SAR images to high-resolution ones, which further degrades the recovery of fine structures in background areas and results in poor reconstruction quality. Similarly, since speckle noise severely degrades the self-similarity of SAR imagery, ZSR produces unsatisfactory super-resolution reconstruction results, as seen in Figure 5e. Although BSR produces higher quality reconstructions than ZSR, its unsupervised learning formulation still makes it difficult to recover subtle structures faithfully, which leads to the loss of fine details, as depicted in Figure 5f. In contrast, as can be observed from Figure 5g, our algorithm not only shows superior super-resolution reconstruction performance but also preserves clear moving target shadows in the reconstructed images. Furthermore, Table 2 lists the assessment results for all the imaging results on the first test set by different approaches. As revealed by the MPSNR and AISR values listed in the Table, compared to other algorithms, the proposed approach achieves high-quality image super-resolution reconstruction while preserving clear moving target shadows, which effectively resolves the conflict between high-frame-rate imaging and high-resolution imaging in microwave video SAR.

Additionally, the efficiency of the proposed approach is evaluated in terms of frames per second (FPS), as listed in Table 3. Since our method is a deep-learning-based post-processing super-resolution reconstruction network, we compare the inference speed only against representative deep-learning super-resolution methods, including IFSRCNN, LSRGAN, ZSR, and BSR. All inference speeds are measured on a workstation equipped with an NVIDIA RTX 5090 GPU, where the inference time for each method is measured as the time required to generate a high-resolution frame with the size of 2048 × 4096. IFSRCNN achieves 18.748 FPS, LSRGAN achieves 2.27 FPS, and BSR achieves 2.067 FPS. ZSR is substantially slower at 0.002 FPS because it follows a zero-shot learning paradigm and requires thousands of iterative updates before producing the super-resolved result. Our method takes 386.8 ms to generate four high-resolution frames in a single forward pass, which corresponds to 96.7 ms per frame and 10.341 FPS. This efficiency mainly benefits from using the shadow-blurred high-resolution image as assistance, which provides strong reconstruction priors while keeping the model complexity moderate.

4.5. Ablation Study

We conduct a series of ablation studies on the first test set to clarify how each key component contributes to the proposed reconstruction framework. In particular, we evaluate the designed loss function, the auxiliary high-resolution guidance image together with the feature fusion strategy in the fusion and reconstruction module, and the spatiotemporal learning module, since these elements jointly determine the balance between overall reconstruction fidelity and shadow preservation.

To demonstrate that the designed loss function effectively steers the network toward faithful reconstruction while maintaining clear moving target shadows, we vary its key parameters and report the corresponding reconstruction performance. We first compare the performance of the proposed approach when the weight parameter

β

is respectively set to 0, 0.2, and 0.4. Table 4 summarizes the quantitative comparison of all processing results on the first test set, while Figure 6 shows the processing results on the 32nd sample from the first test set. Obviously, as the

β

value increases, the shadows in the network output images become clearer, but the overall image quality decreases. Therefore, in this paper, we set the

β

value to 0.2 to achieve a good trade-off between overall image quality and shadow clarity. Notably, when

β = 0

, the second term in the designed loss function is inactive, and thus the network produces high-resolution images with noticeably blurred shadows. It can be concluded that the designed loss function that regionally maps the network output to the input low-resolution sequence and the shadow-blurred high-resolution image is effective. Additionally, we investigate the influence of the kernel size used in the mean filter

M_{p}

in our loss function, the corresponding quantitative results and some representative reconstruction images are also given in Table 4 and Figure 6, respectively. When using a

1 \times 1

kernel,

M_{p}

degenerates to an identity mapping, so the shadow-related term effectively imposes a pixel-level constraint that is sensitive to speckle. This weakens the intended region wise gray level consistency in potential shadow areas and leads to degraded performance. Increasing the kernel size to

5 \times 9

substantially improves both metrics, indicating that moderate local averaging effectively suppresses speckle-induced fluctuations and makes the shadow-related constraint more stable by region-wise consistency. With a larger kernel of

7 \times 13

, although MPSNR value increases to 49.5854, AISR value increases to 0.2109, suggesting that overly strong averaging may relax the shadow preservation constraint even though it benefits overall reconstruction fidelity. Based on this trade-off, we adopt the

5 \times 9

mean filter kernel as the default setting.

Additionally, we analyze the influence of the number of gSTA blocks on the reconstruction performance. Table 5 reports the results for numbers ranging from 0 to 3, and Figure 7 shows the corresponding reconstruction results on the 64th sample from the first test set. When gSTA blocks are removed, namely when the number is 0, the reconstruction performance degrades noticeably, where MPSNR value is 44.0777 and AISR value is 0.2819. Increasing the number to 1 improves both metrics, yielding an MPSNR value of 46.5003 and an AISR value of 0.2125. When the number is increased to 2, MPSNR value further rises to 48.2291 and AISR value decreases to 0.2057, indicating a better balance between reconstruction fidelity and shadow preservation. Further increasing the number to 3 brings only a marginal gain in MPSNR value to 48.4652, while AISR value slightly increases to 0.2067, suggesting limited benefit and a weaker shadow preservation effect compared with the two-block setting. Considering the trade-off between reconstruction performance and model complexity, we set the number of stacked gSTA blocks to 2 in our default configuration.

Furthermore, to validate the effectiveness of using the shadow-blurred high-resolution image as auxiliary guidance and to assess the rationality of the SDA-based fusion design in the fusion-and-reconstruction module, we perform an ablation study by modifying the guidance branch and the fusion strategy. The quantitative results are presented in Table 6 for all the processing results on the first test set, and Figure 8 shows the processing results on the 48th sample from the first test set. Obviously, removing the guidance branch from shadow-blurred high-resolution image leads to a dramatic performance drop, indicating that relying solely on the low-resolution sequence is insufficient to recover high-quality background details. In contrast, incorporating the shadow-blurred high-resolution image into the reconstruction process substantially enhances the reconstruction performance. This confirms the feasibility and effectiveness of our proposed strategy, which leverages the shadow-blurred high-resolution image to assist in the reconstruction task. Moreover, when only one SDA block is used for feature fusion, retaining the second SDA block yields a much higher MPSNR value than retaining the first SDA block only, which reveals that injecting the guidance features at the fine scale is more critical for restoring high-frequency details. Importantly, using both SDA blocks for feature fusion yields the best MPSNR value, demonstrating complementary benefits from multi-scale fusion. In addition, replacing SDA blocks with simple channel concatenation leads to a lower MPSNR value when achieving approximately the same AISR value, indicating that SDA blocks provide a more effective feature interaction mechanism than naive fusion.

4.6. Generalization Ability

The generalization ability of deep learning-based methods depends on a large amount of training samples of different scenarios. However, different from optical images that can be available easily, it is difficult and expensive to obtain numerous video SAR images with sufficient diversity. A key challenge in applying deep learning to video SAR is limited generalization ability due to scarce and diverse training data.

To check the generalization ability of the proposed approach, the network trained on the data of the first video SAR dataset is directly applied to another video SAR dataset. The processing results are compared with those obtained by the conventional video SAR imaging methods and four image super-resolution reconstructions using deep-learning methods. Table 7 lists the assessment results measured on all the processing results by different approaches. In addition, some representative results obtained by different approaches are shown in Figure 9. As confirmed by both the quantitative assessment and the visual performance, the proposed approach can achieve satisfactory super-resolution reconstruction performance when applied to unseen video SAR data during the training. Therefore, it can be concluded that the proposed approach has a satisfactory generalization ability, which may be attributed to its self-supervised training strategy.

5. Discussion

This work formulates the challenge of achieving high frame rate and high azimuth resolution in microwave video SAR as a super-resolution reconstruction task for a high-frame-rate but low-resolution image sequence. The presented mathematical model simplifies the reconstruction as a nonlinear mapping from the low-resolution sequence to the desired high-resolution sequence, with assistance from a high-resolution but shadow-blurred image. This formulation is physically reasonable in video SAR because the long-aperture high-resolution image and the short-aperture low-resolution sequence generally share consistent information in background areas, while their differences mainly concentrate in dynamic areas. Under this model, the shadow-blurred high-resolution image can serve as an effective guidance prior to enhance background details, and the low-resolution sequence can provide clearer moving-target shadow information.

A key factor behind the reported performance is the self-supervised learning strategy. High-resolution SAR images generated from long synthetic apertures often contain blurred moving-target shadows, which makes them unsuitable as direct ground truth for training. The proposed loss function addresses this issue by regionally mapping the network output to the shadow-blurred high-resolution image in background regions, and to the low-resolution inputs in potential shadow regions, while excluding uncertain areas during training. The ablation results further clarify this mechanism. When the shadow-related term is disabled, the network produces high-resolution images with noticeably blurred shadows, and increasing the weight of the shadow-related term improves shadow clarity but reduces the overall image quality, which confirms that the designed loss effectively steers the balance between reconstruction fidelity and shadow preservation. In addition, introducing moderate local averaging in the loss improves stability by suppressing speckle-induced fluctuations and enforcing region-wise gray level consistency in potential shadow areas, whereas overly weak or overly strong averaging leads to inferior trade-offs.

The network design is also consistent with the task characteristics. The spatiotemporal learning module with stacked gSTA blocks improves reconstruction quality and shadow preservation compared with removing the module, and the results indicate that a moderate stacking depth provides a better balance between performance and complexity. In the fusion and reconstruction module, the use of the shadow-blurred high-resolution image as auxiliary guidance is critical, since removing the guidance branch leads to a dramatic performance drop, showing that relying solely on the low-resolution sequence is insufficient to recover high-quality background details. The SDA-based fusion further improves feature interaction compared with naive fusion, and multi-scale fusion yields complementary benefits, with guidance injection at the fine scale being more important for restoring high-frequency details. Beyond reconstruction quality, the reported inference speed indicates that the proposed approach can achieve a favorable efficiency level among deep-learning-based post-processing methods, which is mainly attributed to using the shadow-blurred high-resolution image as assistance that provides strong reconstruction priors while keeping the model complexity moderate. The cross-dataset evaluation also suggests that the proposed approach can achieve satisfactory performance on unseen video SAR data, which is consistent with the expected advantage of the self-supervised training strategy under limited data availability.

6. Conclusions

It is challenging for microwave video SAR to achieve simultaneously high-frame-rate and high-resolution, particularly with low system complexity and low cost. This paper formulates the problem as an image super-resolution reconstruction task for low-resolution yet high-frame-rate image sequences, paving a new way for microwave video SAR imaging. We simplify the reconstruction task as the nonlinear mapping from the low-resolution image sequence to the desired high-resolution sequence with the shadow-blurred high-resolution image aided. Furthermore, we design a self-supervised deep neural network to learn the nonlinear mapping for high-quality super-resolution reconstruction. Processing results of real video SAR data have revealed that the proposed approach achieves good performance in image super-resolution reconstruction and moving target shadow preservation. In addition, the proposed approach shows a convincing generalization ability when applied to different video SAR data.

Although the above validation is physically motivated, differences between W-band and K-band may still introduce modest variations in image appearance and quantitative metrics. Therefore, the W-band experiments are mainly regarded as framework-level verification of the proposed reconstruction strategy under the current data availability. Further evaluation on authentic K-band microwave video SAR data will be conducted when such datasets become available. In addition, we will consider extending the method to multi-scale and arbitrary-scale super-resolution by redesigning the upsampling strategy and training scheme, and by validating on datasets that support a broader range of resolution gaps.

Author Contributions

Conceptualization, X.H.; methodology, X.H. and L.W.; software, X.H. and Y.Z.; validation, X.H. and Y.Z.; formal analysis, X.H. and C.Z.; investigation, X.H. and C.Z.; resources, J.D.; writing—original draft preparation X.H. and L.W.; supervision L.W. and J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the National Natural Science Foundation of China under Grant 62401436 and in part by the China Postdoctoral Science Foundation under Grant 2023M742753.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Damini, A.; Balaji, B.; Parry, C.; Mantle, V. A VideoSAR mode for the x-band wideband experimental airborne radar. In Proceedings of the SPIE, Algorithms for Synthetic Aperture Radar Imagery XVII, Orlando, FL, USA, 5–9 April 2010; pp. 135–145. [Google Scholar]
Wallace, H.B. Development of a video SAR for FMV through clouds. In Proceedings of the SPIE, Algorithms for Synthetic Aperture Radar Imagery XXII, Baltimore, MD, USA, 20–24 April 2015; pp. 64–65. [Google Scholar]
Wells, L.; Sorensen, K.; Doerry, A.; Remund, B. Developments in SAR and IFSAR systems and technologies at Sandia National Laboratories. In Proceedings of the 2003 IEEE Aerospace Conference, Big Sky, MT, USA, 8–15 March 2003; pp. 1085–1095. [Google Scholar]
Raynal, A.M.; Bickel, D.L.; Doerry, A.W. Stationary and moving target shadow characteristics in synthetic aperture radar. In Proceedings of the SPIE, Algorithms for Synthetic Aperture Radar Imagery XXI, Baltimore, MD, USA, 5–9 May 2014; pp. 413–427. [Google Scholar]
Wang, H.; Chen, Z.; Zheng, S. Preliminary Research of Low-RCS Moving Target Detection Based on Ka-Band Video SAR. IEEE Geosci. Remote Sens. Lett. 2017, 14, 811–815. [Google Scholar] [CrossRef]
Ding, J.; Wen, L.; Zhong, C.; Loffeld, O. Video SAR Moving Target Indication Using Deep Neural Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7194–7204. [Google Scholar] [CrossRef]
Zhong, C.; Ding, J.; Zhang, Y. Joint Tracking of Moving Target in Single-Channel Video SAR. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5212718. [Google Scholar] [CrossRef]
Tian, X.; Liu, J.; Mallick, M.; Huang, K. Simultaneous detection and tracking of moving-target shadows in ViSAR imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 1182–1199. [Google Scholar] [CrossRef]
Palm, S.; Sommer, R.; Janssen, D.; Tessmann, A.; Stilla, U. Airborne Circular W-Band SAR for Multiple Aspect Urban Site Monitoring. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6996–7016. [Google Scholar] [CrossRef]
Kim, S.H.; Fan, R.; Dominski, F. ViSAR: A 235 GHz radar for airborne applications. In Proceedings of the 2018 IEEE Radar Conference (RadarConf18), Oklahoma City, OK, USA, 23–27 April 2018; pp. 1549–1554. [Google Scholar]
Zuo, F.; Li, J.; Hu, R.; Pi, Y. Unified Coordinate System Algorithm for Terahertz Video-SAR Image Formation. IEEE Trans. Terahertz Sci. Technol. 2018, 8, 725–735. [Google Scholar] [CrossRef]
Hawley, R.W.; Garber, W.L. Aperture weighting technique for video synthetic aperture radar. In Proceedings of the SPIE, Algorithms for Synthetic Aperture Radar Imagery XVIII, Orlando, FL, USA, 25–29 April 2011; pp. 67–73. [Google Scholar]
Moses, R.L.; Ash, J.N. An autoregressive formulation for SAR backprojection imaging. IEEE Trans. Aerosp. Electron. Syst. 2011, 47, 2860–2873. [Google Scholar] [CrossRef]
Miller, J.; Bishop, E.; Doerry, A. An application of backprojection for video SAR image formation exploiting a subaperature circular shift register. In Proceedings of the SPIE, Algorithms for Synthetic Aperture Radar Imagery XX, Baltimore, MD, USA, 29 April–3 May 2013; pp. 66–79. [Google Scholar]
Song, X.; Yu, W. Processing video-SAR data with the fast backprojection method. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 2838–2848. [Google Scholar] [CrossRef]
Cheng, Y.; Ding, J.; Sun, Z.; Zhong, C. Processing of Airborne Video SAR Data Using the Modified Back Projection Algorithm. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Kim, S.; Yu, J.; Jeon, S.-Y.; Dewantari, A.; Ka, M.-H. Signal processing for a multiple-input, multiple-output (MIMO) video synthetic aperture radar (SAR) with beat frequency division frequency-modulated continuous wave (FMCW). Remote Sens. 2017, 9, 491. [Google Scholar] [CrossRef]
Ding, J.; Zhang, K.; Huang, X.; Xu, Z. High Frame-Rate Imaging Using Swarm of UAV-Borne Radars. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5204912. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
Moser, B.B.; Raue, F.; Frolov, S.; Palacio, S.; Hees, J.; Dengel, A. Hitchhiker’s Guide to Super-Resolution: Introduction and Recent Advances. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9862–9882. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 8–16 October 2016; pp. 391–407. [Google Scholar]
Luo, Z.; Yu, J.; Liu, Z. The super-resolution reconstruction of SAR image based on the improved FSRCNN. J. Eng. 2019, 2019, 5975–5978. [Google Scholar] [CrossRef]
Sun, G.; Zhang, F. Convolutional neural network (CNN)-based fast back projection imaging with noise-resistant capability. IEEE Access 2020, 8, 117080–117085. [Google Scholar] [CrossRef]
Wei, Y.; Li, Y.; Ding, Z.; Wang, Y.; Zeng, T.; Long, T. SAR parametric super-resolution image reconstruction methods based on ADMM and deep neural network. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10197–10212. [Google Scholar] [CrossRef]
Wang, L.; Zheng, M.; Du, W.; Wei, M.; Li, L. Super-resolution SAR image reconstruction via generative adversarial network. In Proceedings of the 12th International Symposium on Antennas, Propagation and EM Theory (ISAPE), Hangzhou, China, 3–6 December 2018; pp. 1–4. [Google Scholar]
Shen, H.; Lin, L.; Li, J.; Yuan, Q.; Zhao, L. A residual convolutional neural network for polarimetric SAR image super-resolution. ISPRS J. Photogramm. Remote Sens. 2020, 161, 90–108. [Google Scholar] [CrossRef]
Li, Y.; Zhou, L.; Xu, F.; Chen, S. OGSRN: Optical-guided super-resolution network for SAR image. Chin. J. Aeronaut. 2022, 35, 204–219. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, Z.; Deng, Y.; Zhang, Y.; Chong, M.; Tan, Y.; Liu, P. Blind Super-Resolution for SAR Images with Speckle Noise Based on Deep Learning Probabilistic Degradation Model and SAR Priors. Remote Sens. 2023, 15, 330. [Google Scholar] [CrossRef]
Kong, Y.; Liu, S. DMSC-GAN: A c-GAN-Based Framework for Super-Resolution Reconstruction of SAR Images. Remote Sens. 2024, 16, 50. [Google Scholar] [CrossRef]
Jiang, N.; Zhao, W.; Wang, H.; Luo, H.; Chen, Z.; Zhu, J. Lightweight Super-Resolution Generative Adversarial Network for SAR Images. Remote Sens. 2024, 16, 1788. [Google Scholar] [CrossRef]
Dong, G.; Wang, Y.; Liu, H.; Liu, S. Complex-Valued SAR Image Super-Resolution via Subaperture Learning and Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5209714. [Google Scholar] [CrossRef]
Yan, S. Review of the development status for ViSAR techniques. Syst. Eng. Electron. 2024, 46, 2650–2666. [Google Scholar]
Zhang, L.; Li, H.; Qiao, Z.; Xu, Z. A Fast BP Algorithm With Wavenumber Spectrum Fusion for High-Resolution Spotlight SAR Imaging. IEEE Geosci. Remote Sens. Lett. 2014, 11, 1460–1464. [Google Scholar] [CrossRef]
Tan, C.; Gao, Z.; Li, S.; Li, S.Z. SimVPv2: Towards Simple yet Powerful Spatiotemporal Predictive Learning. IEEE Trans. Multimed. 2025, 27, 5170–5184. [Google Scholar] [CrossRef]
Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
Conradsen, K.; Nielsen, A.A.; Schou, J.; Skriver, H. A test statistic in the complex Wishart distribution and its application to change detection in polarimetric SAR data. IEEE Trans. Geosci. Remote Sens. 2003, 41, 4–19. [Google Scholar] [CrossRef]
Lee, J.S. Speckle analysis and smoothing of synthetic aperture radar images. Comput. Graph. Image Process. 1981, 17, 24–32. [Google Scholar] [CrossRef]
Shocher, A.; Cohen, N.; Irani, M. “Zero-Shot” Super-Resolution Using Deep Internal Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3118–3126. [Google Scholar]

Figure 1. Video SAR images with different azimuth resolutions and corresponding zoomed-in views. From left to right in each row: the full scene, a magnified view of the red-box region in the left image, and a magnified view of the white-box region in the left image. (a) 0.28 m. (b) 0.14 m. (c) 0.07 m.

Figure 2. High-frame-rate and high-resolution imaging framework for microwave video SAR based on deep learning, where H and W are respectively the height and width of the high-resolution image with blurred shadows.

Figure 3. Architecture of designed super-resolution reconstruction network, where cuboids denote feature maps.

Figure 4. Optical images of the scenes imaged by the (a) Radar A and (b) Radar B.

Figure 5. Processing results by different methods on the 16th sample from first video SAR data. (a) VSAR-OAP. (b) LR-VSAR. (c) IFSRCNN. (d) LSRGAN. (e) ZSR. (f) BSR. (g) Proposed approach.

Figure 6. Representative results of the proposed approach with different parameter

β

and kernel size

k_{c}

. (a)

β = 0.2

and

k_{c} = 5 \times 9

. (b)

β = 0

and

k_{c} = 5 \times 9

. (c)

β = 0.4

and

k_{c} = 5 \times 9

. (d)

β = 0.2

and

k_{c} = 1 \times 1

. (e)

β = 0.2

and

k_{c} = 7 \times 13

.

Figure 6. Representative results of the proposed approach with different parameter

β

and kernel size

k_{c}

. (a)

β = 0.2

and

k_{c} = 5 \times 9

. (b)

β = 0

and

k_{c} = 5 \times 9

. (c)

β = 0.4

and

k_{c} = 5 \times 9

. (d)

β = 0.2

and

k_{c} = 1 \times 1

. (e)

β = 0.2

and

k_{c} = 7 \times 13

.

Figure 7. Representative results of the proposed approach with different stacking depth of the gSTA block. (a) 0. (b) 1. (c) 2. (d) 3.

Figure 8. Ablation results of different fusion cases. (a) Fusion by two SDA blocks. (b) Without the shadow-blurred high-resolution image as assistance. (c) Only using the first SDA block for fusion. (d) Only using the second SDA block for fusion. (e) Fusion by channel concatenation.

Figure 9. Results obtained by different methods on the 16th sample from another video SAR data. (a) VSAR-OAP. (b) LR-VSAR. (c) IFSRCNN. (d) LSRGAN. (e) ZSR. (f) BSR. (g) Proposed approach.

Table 1. Parameters of two different video SAR systems.

Parameters	Radar A	Radar B
Center frequency	94 GHz	92.92 GHz
Bandwidth	1000 MHz	900 MHz
Pulse width	12 $μ$ s	20 $μ$ s
Platform velocity	46 m/s	75 m/s
Platform height	1263 m	2470 m
Number of pulses	1,600,000	145,000
Radius of circular flight path	2400 m	6650 m

Table 2. Assessment of different methods on the first test set. Higher MPSNR ↑ means better super-resolution reconstruction quality, and lower AISR ↓ indicates stronger shadow preservation.

Methods	MPSNR (dB) ↑	AISR ↓
VSAR-OAP	-	0.3317
LR-VSAR	14.1413	0.2002
IFSRCNN	15.5998	0.2302
LSRGAN	15.8491	0.2268
ZSR	12.9469	0.2329
BSR	13.5244	0.2026
Proposed approach	48.2291	0.2057

Table 3. Inference speed of different methods.

Methods	IFSRCNN	LSRGAN	ZSR	BSR	Proposed Approach
FPS	18.748	2.270	0.002	2.067	10.341

Table 4. Quantitative assessment of influence of parameter

β

and the kernel size on the reconstruction performance. Higher MPSNR ↑ means better super-resolution reconstruction quality, and lower AISR ↓ indicates stronger shadow preservation.

Table 4. Quantitative assessment of influence of parameter

β

and the kernel size on the reconstruction performance. Higher MPSNR ↑ means better super-resolution reconstruction quality, and lower AISR ↓ indicates stronger shadow preservation.

Parameter $β$	Kernel Size	MPSNR (dB) ↑	AISR ↓
0.2	5 × 9	48.2291	0.2057
0	5 × 9	51.8546	0.3416
0.4	5 × 9	41.6733	0.2028
0.2	1 × 1	44.8726	0.2070
0.2	7 × 13	49.5854	0.2109

Table 5. Quantitative assessment of the impact of the number of gSTA blocks on reconstruction performance. Higher MPSNR ↑ means better super-resolution reconstruction quality, and lower AISR ↓ indicates stronger shadow preservation.

Number of gSTA Blocks	MPSNR (dB) ↑	AISR ↓
0	44.0777	0.2819
1	46.5003	0.2125
2	48.2291	0.2057
3	48.4652	0.2067

Table 6. Ablation study on the effectiveness of using the shadow-blurred high-resolution image as assistance and the SDA-based fusion strategy in the fusion-and-reconstruction module. Higher MPSNR ↑ means better super-resolution reconstruction quality, and lower AISR ↓ indicates stronger shadow preservation.

Cases	MPSNR (dB) ↑	AISR ↓
Fusion by two SDA blocks	48.2291	0.2057
Without the shadow-blurred high-resolution image as assistance	16.1459	0.2067
Only using the first SDA block for fusion	30.9863	0.2043
Only using the second SDA block for fusion	46.7100	0.2077
Fusion by channel concatenation	46.4654	0.2055

Table 7. Assessments by different methods on the second test set. Higher MPSNR ↑ means better super-resolution reconstruction quality, and lower AISR ↓ indicates stronger shadow preservation.

Methods	MPSNR (dB) ↑	AISR ↓
VSAR-OAP	-	0.1542
LR-VSAR	15.6533	0.0967
IFSRCNN	17.4123	0.1505
LSRGAN	17.4237	0.1096
ZSR	13.8451	0.1016
BSR	14.7652	0.1016
Proposed approach	44.1817	0.0999

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, X.; Zhang, Y.; Zhong, C.; Ding, J.; Wen, L. Video SAR Enhanced Imaging Using a Self-Supervised Super-Resolution Reconstruction Network. Remote Sens. 2026, 18, 670. https://doi.org/10.3390/rs18050670

AMA Style

Huang X, Zhang Y, Zhong C, Ding J, Wen L. Video SAR Enhanced Imaging Using a Self-Supervised Super-Resolution Reconstruction Network. Remote Sensing. 2026; 18(5):670. https://doi.org/10.3390/rs18050670

Chicago/Turabian Style

Huang, Xuejun, Yan Zhang, Chao Zhong, Jinshan Ding, and Liwu Wen. 2026. "Video SAR Enhanced Imaging Using a Self-Supervised Super-Resolution Reconstruction Network" Remote Sensing 18, no. 5: 670. https://doi.org/10.3390/rs18050670

APA Style

Huang, X., Zhang, Y., Zhong, C., Ding, J., & Wen, L. (2026). Video SAR Enhanced Imaging Using a Self-Supervised Super-Resolution Reconstruction Network. Remote Sensing, 18(5), 670. https://doi.org/10.3390/rs18050670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video SAR Enhanced Imaging Using a Self-Supervised Super-Resolution Reconstruction Network

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Microwave Video SAR Imaging

2.2. SAR Image Super-Resolution Based on Deep Learning

3. Methodology

3.1. Mathematical Model for Super-Resolution Reconstruction

3.2. Imaging Framework

3.3. Super-Resolution Reconstruction Network

3.3.1. High-Resolution Image Encoder

3.3.2. Low-Resolution Image Sequence Encoder

3.3.3. Spatiotemporal Learning Module

3.3.4. Fusion and Reconstruction Module

3.4. Loss Function

4. Experiments

4.1. Datasets and Training Strategy

4.2. Evaluation Metrics

4.3. Comparison Methods

4.4. Super-Resolution Imaging Results

4.5. Ablation Study

4.6. Generalization Ability

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI