PC-YOLO: Moving Target Detection in Video SAR via YOLO on Principal Components

Han, Yu; Wang, Xinrong; Jiang, Jiaqing; Xue, Chao; Qin, Rui; Dong, Ganggang

doi:10.3390/rs18030510

Open AccessArticle

PC-YOLO: Moving Target Detection in Video SAR via YOLO on Principal Components

by

Yu Han

¹,

Xinrong Wang

¹,

Jiaqing Jiang

²,

Chao Xue

¹,

Rui Qin

¹ and

Ganggang Dong

^2,*

¹

State Key Laboratory of Space Information System and Integrated Application, Space Star Technology Co., Ltd., Beijing 100086, China

²

State Key Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 510; https://doi.org/10.3390/rs18030510

Submission received: 21 November 2025 / Revised: 24 December 2025 / Accepted: 24 December 2025 / Published: 5 February 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We reduced the heavy level of clutter in ViSAR by the principal component decomposition learning technique. So the SNR of SAR image can be improved. It played an important role for the subsequent task of moving target detection. A hierarchical learning mechanism composed of the self-attention, the spatial attention, and the channel attention were presented. Much more effective features for the task of target detection can be learned.
A cross-scale feature fusion strategy composed of the temporal fusion and the spatial fusion were presented within the look once detection framework. So the moving targets with mutable scales can be located accurately. Multiple rounds of experiments were performed. The results demonstrated the advantage of proposed method in comparison to the classical methods, as well as the state-of-the-art detection algorithms.

What are the implications of the main findings?

A new clutter removal strategy specific for the task of moving target detection. Different from the preceding works, we regarded the complicated imaging scenarios as a typically combinatorial scene, where the moving targets contributed to the sparse components, while the background clutter attributed to the low-rank components. So far as we known, the assumption for ViSAR was first presented in this paper.
A new detection framework for the extraction of shadow. Different from the generic method, the hierarchical attention mechanism and the cross-scale feature fusion were presented. So the target with various scales can be located accurately.

Abstract

Video synthetic aperture radar could provide more valuable information than static images. However, it suffers from several difficulties, such as strong clutter, low signal-to-noise ratio, and variable target scale. The task of moving target detection is therefore difficult to achieve. To solve these problems, this paper proposes a model and data co-driven learning method called look once on principal components (PC-YOLO). Unlike preceding works, we regarded the imaging scenario as a combination of low-rank and sparse scenes in theory. The former models the global, slowly varying background information, while the latter expresses the localized anomalies. These were then separated using the principal component decomposition technique to reduce the clutter while simultaneously enhancing the moving targets. The resulting principal components were then handled by an improved version of the look once framework. Since the moving targets featured various scales and weak scattering coefficients, the hierarchical attention mechanism and the cross-scale feature fusion strategy were introduced to further improve the detection performance. Finally, multiple rounds of experiments were performed to verify the proposed method, with the results proving that it could achieve more than 30% improvement in mAP compared to classical methods.

Keywords:

video SAR; moving target; detection and tracking; principal components; hierarchical attention

1. Introduction

Synthetic aperture radar (SAR) sensors have been widely applied across various civilian and military fields. Two-dimensional (2-D) high-resolution SAR images can be generated by pulse compression and coherent accumulation techniques along the range and the azimuth, respectively. Phase history data can be jointly compressed and focused using imaging algorithms, such as the range Doppler algorithm, the polar format algorithm, the chirp scaling algorithm, and the

ω

-K algorithm [1,2]. Given the high resolution of SAR images, valuable information can be extracted by image interpretation for target detection and recognition [3].

In realistic scenarios, both static and moving targets are widely available. Compared with static targets, the scattering characteristics of moving targets differ significantly as they are displaced over time and have varying spatial locations. The motion phenomenology causes the target imagery to be defocused due to spatially variant phase errors [4]. It is therefore much more difficult for SAR sensors to discern, especially in complex, cluttered environments. A set of exemplars is shown in Figure 1. The current solutions for moving target detection can be summarized from two perspectives: SAR images and video SAR (ViSAR).

1.1. Moving Target Detection via SAR Images

Early studies typically achieved the task of moving target detection using both single-channel and multi-channel SAR images. The single-channel data were collected using radar with a single antenna and provided very limited information. The moving target detection performance was therefore poor, especially for slow-moving targets on the ground. For the multi-channel SAR system, a series of radar antenna arrays was deployed on airborne or spaceborne platforms to allow significant amounts of higher-dimensional information to be collected by the phase center antenna array. The detection performance was improved accordingly. Typical multi-channel methods for moving target detection include the Displaced Phase Center Antenna (DPCA) [5], the Along-Track Interferometry (ATI) [6,7], and the Space-Time Adaptive Processing (STAP) [8] methods.

The DPCA technique leverages the identical scattering characteristics of stationary clutter across channels. By subtracting the channel echo signals, stationary clutter can be filtered out, whereas moving targets, due to their distinct motion characteristics, are not canceled by this subtraction, enabling their detection.
The ATI technique operates on the principle that stationary targets have a consistent echo phase between channels, whereas moving targets exhibit a phase variation. By comparing this inter-channel phase information, moving targets can be detected, and these phase data can also be used to estimate the target’s radial velocity.

Additionally, the multi-channel radar echo data can also be processed by jointly applying DPCA and ATI to achieve enhanced performance [9,10]. However, DPCA and ATI impose strict constraints on the configuration of the radar antenna. Detection performance deteriorates in strong clutter environments. Another technique, STAP, was then proposed [11], which employs adaptive algorithms to estimate the statistical properties of clutter in the joint space–time domain. A two-dimensional filter that is precisely matched to the clutter spectrum can then be formed accordingly. It is capable of suppressing the clutter while preserving the signal of the moving target. Although STAP could reduce the clutter its detection performance for the moving target is poor.

Recent studies therefore paid more attention to advanced signal processing tricks, such as principal component analysis. Yang et al. proposed a fast RPCA-based (robust principal component analysis) detection method for multichannel SAR under a strong clutter background [12]. They applied RPCA in the range-Doppler domain to improve STAP performance. The experiments proved that the presented method was even effective for channel imbalance or platform motion error. Guo et al. presented a moving target detection method based on RPCA for SAR systems [13]. An atomic norm-based optimization program was constructed to transform the data sparsity requirement into a moving target sparsity requirement. Li et al. developed an along-track interferometry GoDec approach for GMTI in multichannel SAR systems under a strong clutter background [14]. The proposed method was separated into predetection and postdetection. The former reduced the number of missing targets using an efficient ATI-RPCA, while the latter reduced the number of false alarms using a magnitude and phase processing technique. L. Ramos et al. presented two ground scene estimation methods for SAR images [15], where the RPCA via principal component pursuit and tensor RPCA via the tensor nuclear norm were employed to produce ground scene estimation results. On this basis, a combination of RPCA and STAP techniques was proposed for moving target detection in Ref. [16].

1.2. Moving Target Detection via Video SAR

Unlike in SAR images, targets can be observed continuously using video SAR technology, which makes so much more information available for subsequent tasks, such as target detection. Popular methods for target detection in video SAR include signal processing tricks and shadow discernment.

Signal processing tricks. Many kinds of signal processing techniques have been proposed to achieve target detection in video SAR. These methods identify moving targets through the statistical variations in radar echo amplitude. Since the scattering signatures of the targets had changed significantly, the resulting amplitude, phase, and frequency of the received echoes were altered accordingly. So, a natural idea was to extract motion information by comparing the amplitude, phase, and frequency of radar echoes, enabling the achievement of the task of target detection and tracking. Fan et al. proposed an echo-driven motion compensation framework to improve THz-ViSAR images [17]. Airborne motion errors were modeled as both radial and vibration errors and were further estimated from the echoes. Yan et al. derived the mathematical model of moving target echoes in a video SAR system [18]. Typical system parameters for motion trajectory estimation can be simulated. Li et al. developed a curvilinear moving target refocusing method for ViSAR images [19]. The localized phase-gradient autofocus trick was used to compensate for Doppler chirp-rate inconsistencies, while additional spatial-domain information was used for geometric deformation correction. Gou et al. proposed a circular-track video SAR detection method using sub-aperture segmentation [20]. This method achieved a high detection probability in high-frame-rate scenes. However, methods based on signal processing are sensitive to ground clutter and multipath interference, which reduce accuracy in complex environments.
Shadow discernment. Moving targets exhibited a Doppler shift in video SAR. Energy leaking across the range cells caused the images to be defocused. This phenomenon resulted in a deviation along the azimuth direction and a certain angular spread, preventing the target’s true position from fully reflecting the radar waves and forming distinct shadow features. These characteristics can be used to approximate the true position and state information of targets. Another idea was to detect targets from their shadows. Early methods include the inter-frame difference method and the background difference method. Luo et al. applied image binary operations after differencing the obtained background model [21]. Morphological processing was then imposed on the resulting binary image to extract moving target shadows and suppress false alarms. Zhang et al. combined the background difference model with the three-frame difference method for shadow extraction [22]. Zhang et al. proposed extracting the target by subtracting three sequential frames [23]. The extracted images were then handled using the mathematical morphology technique. Wang et al. performed several rounds of simulation experiments on low Radar Cross Section (RCS) moving target signals [24]. They were then transferred into the real Ka-band airborne video SAR images. The detection method was effective even for the stealth targets. He et al. introduced a ViSAR generation method based on fast factorization back-projection to address the problem of speed limitations [25]. It was applicable to both video SAR generation and moving target detection. Yin et al. presented a decomposition framework based on a low-rank sparse decomposition model [26]. It was combined with trajectory area extraction to achieve fast and accurate detection in specific regions.

Likewise, recent studies have gradually relied on the learning technique for target detection. Zhang et al. proposed a moving target tracking approach with shadow detection and tracking [27]. The CNN tracking classification was applied to potential moving target candidates extracted from a sequence of temporal and spatial sub-aperture SAR images. Ding et al. presented a framework for shadow-aided moving target detection with the combination of a faster region-based convolutional neural network (Faster-RCNN) and bi-directional long-short-term memory (Bi-LSTM) [28]. The former was used to detect shadows in a single frame, while the latter was employed to suppress missing alarms. Yan et al. developed a deep framework composed of a 3-D convolutional encoding path, a 2-D convolutional decoding path, and a bridge path to capture the target’s shadow information [29]. The spatiotemporal features and high-level semantics of raw continuous images can be exploited accordingly. Yang et al. proposed a fast multiscale feature extraction module embedded with triplet attention to tackle the SAR shadow tracking problem [30]. Wen et al. combined shadow detection in SAR images and Doppler energy detection in the range-Doppler spectrum using a dual faster region-based convolutional neural network [31].

1.3. Our Solution

Although many approaches devoted to moving target detection have been presented in prior work, they still suffer from some problems. Their typical difficulties can be summarized as follows.

Manual operations for feature extraction. Traditional methods, such as the inter-frame difference and the background difference, relied on many manual engineering operations, such as frame registration, de-speckling, and the estimation of background. These intermediate operations had a significant impact on termination performance.
Strong clutter environments. Unlike common vision images, the frame of ViSAR was formed by coherent imaging, while both the diffuse reflection and the specular reflection were available. Therefore, the signal-to-noise ratio (SNR) of the SAR image was much lower. Low-SNR environments make shadow extraction much more difficult, and how to address the clutter remained an open problem.
Variable target scales. In radar images, shadows form because electromagnetic echoes are occluded by the target itself. So the scale of the shadow was dependent on the size of the target, the incident angle, and the resolution. These factors make the scale of shadows mutable in ViSAR. Another problem was how to extract shadows with various scales, especially for those that covered few resolution cells.

To address these problems, this paper proposes a model and data co-driven cross-learning method. Unlike prior work, we regarded the imaging scenario in ViSAR as a combination of low-rank and sparse scenes. Specifically, the moving targets (local anomalous patterns) contributed to the sparse scene, while the global, slowly varying background (clutter) formed the low-rank scene. A sparse reconstruction and low-rank decomposition technique was then introduced to separate them accurately. The task of moving target detection in ViSAR can then be decoupled as two sub-tasks: clutter suppression via an empirical model, and shadow extraction via a data-driven model. The first sub-task was achieved using the sparse and low-rank decomposition technique. The resulting components were then delivered into the look once neural network. Since the moving targets had varying scales and weak scattering coefficients, a hierarchical attention learning mechanism and cross-scale feature fusion strategy were proposed. The shadow of moving targets can then be extracted accurately. To verify the proposed method, multiple rounds of experiments were performed. The results proved that the proposed method could achieve more than 30% improvement in mAP compared to classical methods.

2. The Signal Model

The primary operational modes for video SAR were stripmap and spotlight. In comparison, spotlight mode can continuously observe a fixed area to obtain richer and more precise information. In this section, the video SAR mechanism was demonstrated using the spotlight imaging mode.

The spatial model for spotlight SAR imaging is illustrated in Figure 2. Suppose the aircraft flew along the positive direction of the X axis at a velocity of v, with a flight altitude of h. The scene center point was denoted as O, and the size of the swath was

W_{a} \times W_{r}

. The notation used in the following sections is summarized in Table 1.

Suppose the radar transmitted the linear frequency modulation signal (LFM),

s_{t} (t) = σ (t) \cdot r e c t (\frac{t}{T_{r}}) exp (j 2 π f_{c} t + π K t^{2}),

(1)

where

f_{c}

is the carrier frequency, K is the chirp rate,

T_{r}

is the pulse repetition time, and

σ

is the scattering coefficients.

2.1. Range Compression

Given the time delay from the scene

τ

, the received signal was

s_{r} (t) = σ (t) \cdot r e c t (\frac{t}{T_{r}}) exp (j 2 π f_{c} (t - τ) + π K {(t - τ)}^{2}) .

(2)

Range compression was achieved using match filtering

s_{c} (t) = s_{r} (t) * h (t)

, in which the reference functions were the conjugate transpose of the transmitted signal

h (t) = s_{t}^{*} (t)

. The range resolution was determined by the bandwidth of the transmitted signal

ρ_{r} = \frac{c}{2 B}

, where c is the speed of light.

For different operational modes, when using pulse-modulated transmitted signals, the range resolution depends on the pulse interval. The setting of the pulse interval is also constrained by the detection range. Therefore, SAR typically employs signals with a large time-bandwidth product combined with pulse compression technology to achieve high range resolution while meeting certain transmission power requirements.

2.2. Azimuth Resolution

According to the spatial model of spotlight imaging, the instantaneous Doppler frequency of the echo signal at time

t_{s}

was

f_{a s} = \frac{2 v cos (θ_{c} - Δ θ / 2)}{λ} .

(3)

Likewise, the instantaneous Doppler frequency of the received signal at time

t_{e}

was

f_{a e} = \frac{2 v cos (θ_{c} + Δ θ / 2)}{λ} .

(4)

So, the Doppler bandwidth of the echo signal over the entire observation time T was

B_{a} = |f_{a s} - f_{a e}| = \frac{4 v}{λ} \cdot sin (Δ θ / 2) sin (θ_{c}) .

(5)

The azimuth resolution can be obtained as

ρ_{a} = \frac{v sin θ_{c}}{B_{a}} = \frac{λ}{4 sin (Δ θ / 2)} \approx \frac{λ}{2 Δ θ} .

(6)

This was dependent on the wavelength

λ

and the rotation angle

θ

relative to the scene center, and independent of the size of the radar antenna D. However, long spotlight durations would lead to large squint angles, in which range–azimuth coupling and range cell migration were produced, thereby deteriorating the focusing quality. Thus, the circular trajectory was typically used for ViSAR.

2.3. The Generation of ViSAR

In the small-angle spotlight mode, the synthetic aperture length can be approximately expressed as

L = r_{c} \times Δ θ

. Thus, the synthetic aperture time can be obtained as

T = \frac{r_{c} \times Δ θ}{v} .

(7)

The frame rate of SAR imaging can then be defined as

F = \frac{1}{T} = \frac{v}{r_{c} \times Δ θ} = \frac{2 v ρ_{a}}{λ r_{c}} = \frac{2 v ρ_{a} f_{c}}{c r_{c}} .

(8)

So the SAR imaging frame rate is related to platform velocity v, azimuth resolution

ρ_{a}

, the operating frequency, and the slant range

r_{c}

.

ρ_{a}

and

r_{c}

were predetermined for the radar systems [32]. Likewise, the velocity of platform v can be tuned in a limited manner. Furthermore, an increase in platform velocity would lead to the Doppler spectrum aliasing and the clutter spectrum broadening. A reasonable idea for boosting the ViSAR frame rate was to increase the carrier frequency of the radar.

2.4. The Formation of Shadow

According to the ViSAR imaging mechanism, it is easy to come to the conclusion that the formation of a shadow for a stationary target and a moving target is significantly different. For the stationary target, shadows are accompanied by the target, and the concept of shadows here is similar to the “shadow” in optical images. If the radar beam shines on a target of a certain height at a specific angle, the front side will produce a reflected signal, and therefore obtain the scattering characteristics in the image, while the back side will be occluded by itself. The energy information was therefore lost. Then the dark areas in the SAR image can be formed.

Conversely, when the target is moved with a velocity along a certain direction, the formation of the shadow is significantly different. If the movement along the range direction is available, the overlay area will shift from its true position. Contrarily, if there is movement along the azimuth, the imaging result of the target will be defocused. In these circumstances, the shadow of the target would be highlighted further. It is therefore convenient to detect the target using its shadow.

3. The Proposed Method

One of the major advantages of ViSAR was its ability to monitor targets continuously. Much more information could be provided. However, it also suffered from some issues, such as low SNR, a heavy level of clutter, and the mutable scale of the shadows. In this paper, a model and data co-driven learning method, look once on principal component (PC-YOLO), is proposed. We decoupled the detection of a moving target into two subtasks: clutter removal and shadow extraction. The former was achieved using a sparse, low-rank constraint decomposition model, while the latter was achieved using an improved detection network.

3.1. Method Overview and Motivation

Inspired by previous work, this paper proposes target detection via look once on principal components. We first suppose that the background (composed of clutter and speckle) of the frame in video SAR is usually strongly correlated. Thus, all of them could span a low-rank linear subspace. Conversely, the foreground (composed of targets) exhibited different motion modes and texture features compared to the background. This can be regarded as a small amount of gross error that deviates from the low-rank subspace. The stationary and moving components can be separated using low-rank and sparsity constraints. The task of moving target detection can then be decoupled as clutter removal and shadow extraction.

Clutter removal. To suppress the clutter, the imaging scenarios were viewed as the combination of low-rank and sparse components. The former models the components with strong correlations, while the latter represents the motion modes. Each ViSAR frame was then decomposed into a sparse component and a low-rank component. The background clutter was suppressed, while the targets were enhanced.
Shadow extraction. To extract shadows of various sizes, a new detection algorithm was proposed, where the cross-scale feature fusion strategy and the hierarchical attention mechanism were deployed. The sequential representations of the multi-scale features generated from the backbone were formed to obtain the features at different levels of detail. They were further combined to enhance the perception ability of multi-scale targets so that the detection performance could be improved accordingly.

The overall implementation of the proposed method is shown in Figure 3. The video SAR was first rearranged into the frame sequences. Each frame was normalized to ensure scattering consistency. It was then decomposed into the low-rank and sparse components

P_{L}, P_{S}

. The former models the stationary components, such as the clutter and the speckle. In contrast, the latter highlighted the motion modes compared to the background. This decomposition effectively suppressed the background interference and enhanced the saliency of motion-induced shadows. The results were fed into an improved target detection network that integrates a hierarchical attention mechanism and multi-scale feature fusion for shadow extraction.

3.2. The Sparse and Low-Rank Decomposition Model

In complex imaging or sensing environments, the raw ViSAR data are usually contaminated by random noise, clutter, and interference, which make the detection of moving targets difficult. To overcome these problems, we regarded the imaging scenarios as the combination of sparse and low-rank components. The sparse and low-rank decomposition technique was then introduced to jointly model the low-rank and sparse structures. An accurate separation between the structured background and the sparse components can be achieved. The former expressed the global, slowly varying background information, while the latter showed the sparse component that represents localized anomalies, such as moving objects or salient targets. Suppose the observed ViSAR data were

D \in R^{m \times n}

; we aim to solve the following decomposition,

D = P_{L} + P_{S},

(9)

where

P_{L}, P_{S}

denotes the low-rank and sparse components. The underlying structure of the observed data can be recovered. The Equation (9) can be solved using a convex optimization that minimizes the combination of the nuclear norm and the

ℓ_{1}

-norm,

min_{P_{L}, P_{S}} ∥ P_{L} ∥_{*} + λ {∥ P_{S} ∥}_{1} s . t . D = P_{L} + P_{S},

(10)

where

∥ P_{L} ∥_{*}

denotes the nuclear norm to enforce low-rankness,

∥ P_{S} ∥_{1}

represents the

ℓ_{1}

-norm to promote sparsity, and

λ

is a trade-off parameter. The formulation can be efficiently solved by the augmented Lagrange multiplier or the Alternating Direction Method of Multipliers (ADMM). Likewise, the principal component pursuit problem can also be relaxed as

\begin{matrix} min_{P_{L}, P_{S}} & ∥ D - P_{L} - P_{S} ∥_{F}^{2} \\ s . t . & rank (P_{L}) \leq r, \\ card (P_{S}) \leq k, \end{matrix}

(11)

where r denotes the rank maximum of

P_{L}

, and k represents the number of nonzero elements in

P_{S}

. Stochastic optimization can be used to approximate the decomposition under these constraints and is central to robust principal component analysis learning. Through low-rank and sparse decomposition, the heavy level of background clutter can be suppressed, while the moving targets can be highlighted, conversely. In the context of ViSAR,

P_{L}

captures the stable terrain or scene background, whereas

P_{S}

enhances the dynamic or anomalous components such as moving objects or shadow regions. So, a high signal-to-noise representation that significantly benefits subsequent processing tasks can be generated.

In video SAR imagery, consecutive frames exhibit strong temporal correlations and a dominant stationary background, whereas moving targets (shadows) primarily exhibit subtle intensity anomalies. By applying Equation (9) to a sequence of video SAR frames, the low-rank component

P_{L}

models the stable background clutter and correlated speckle noise, whereas the sparse component

P_{S}

isolates the transient or non-stationary features associated with moving targets and their shadows. The decomposition operations not only suppress the background clutter and the coherent speckle noise, but also enhance the saliency and discriminability of small shadow targets.

3.3. The Look Once Detection Network

According to the presented decomposition model, ViSAR can be decomposed into sparse and low-rank components. The extracted sparse component was employed to achieve the subsequent sub-task, moving target detection. In this paper, it was realized via shadow extraction [33,34]. An improved version of the look once algorithm, incorporating cross-scale feature fusion and a hierarchical attention mechanism, was presented. The overall framework for shadow extraction is shown in Figure 4.

In SAR video imagery, the shadows of moving targets are typically small in size and dim in intensity. Their features closely resemble the background clutter of SAR images. Due to the low contrast and small size of shadow targets, generic methods often fail to detect them, especially in complex backgrounds, where shadows overlap with clutter and noise. To solve this problem, two types of improvements are presented in this section: the hierarchical attention mechanism and the cross-scale feature fusion.

3.3.1. Hierarchical Attention Mechanism

The hierarchical attention mechanism includes self-attention, channel attention, and spatial attention. The self-attention decoupled the channel by a multi-head first.

Self-attention. Take the input feature map

F \in R^{C \times H \times W}

, for example; the self-attention learning includes four steps.

Step 1. Get the query, the key, and the value vectors using the feature mapping,

Q, K, V = ϕ_{s} (φ_{c o v}^{1 \times 1} (F)),

(12)

where

ϕ_{s}

denotes the feature split, and

φ_{c o v}^{1 \times 1}

is the 1 × 1 convolution.

Step 2. Calculate the similarity measurement between the query and the key,

S = \frac{Q \cdot K^{T}}{\sqrt{d_{k}}}

, where

\sqrt{d_{k}}

is the scale factor.

Step 3. Normalize the similarity measurement using softmax activation,

A = s o f t m a x (S)

.

Step 4. Obtain the outputs of self-attention using the weighted summation,

F_{a} = A V

.

Spatial attention. Given the self-attention features

F_{a} \in R^{C \times H \times W}

, the spatial attention mechanism partition the self-attention features into several non-overlapping sub-features along the spatial, resulting the top-left (

F_{a}^{(s_{1})}

), top-right (

F_{a}^{(s_{2})}

), bottom-left (

F_{a}^{(s_{3})}

), and bottom-right (

F_{a}^{(s_{4})}

) of the original feature, respectively. Then two types of attention were applied to the sub-features, local attention and global attention,

\{\begin{matrix} F_{s}^{(l)} = \oplus (κ_{c} (F_{a}^{(s_{i})} \otimes F_{a}^{(s_{i})})) & Local \\ F_{s}^{(g)} = κ_{c} (F_{a}) \otimes F_{a} & Global, \end{matrix}

(13)

where

κ_{c}

is the channel mechanism, ⊕ is the tensor concatenation operation, and ⊗ is the element-wise multiplication. The resulting features were further combined together,

F_{s} = F_{s}^{(l)} + F_{s}^{(g)}

.

Channel attention. Similar operations can be applied to the original features, forming the channel attention mechanism. The only difference was that the self-attention features were decomposed along the channel dimension, rather than the spatial dimension. The resulting features were noted as

F_{c} = F_{c}^{(l)} + F_{c}^{(g)}

.

3.3.2. Cross-Scale Feature Fusion

Although the generic YOLO algorithm performs well on vision images, it has limitations when applied to small targets, particularly shadows. To address the challenge of scale variation in video-based SAR, a cross-scale feature fusion strategy was presented.

Spatial fusion. The spatial fusion refers to summing or concatenating the multi-level features, such as

F_{1}, F_{2}, F_{3}, F_{4}

. The workflow of cross-scale feature fusion was illustrated in Figure 5.

Suppose the poorest feature was

F_{4}

; it was first up-sampled to be aligned with the finer scale of features

F_{3}

,

F_{43} = ϕ_{c o n v}^{1 \times 1} (\oplus (ϕ_{↑}^{2} (ϕ_{c o n v}^{1 \times 1} (F_{4})), F_{3})) .

(14)

The obtained tensor was further combined with the down-sampled version of

F_{2}

,

F_{23} = ϕ_{c o n v}^{1 \times 1} (\oplus (ϕ_{↑} (F_{43}), F_{2}, ϕ_{↓} (F_{1}))),

(15)

where

ϕ_{↑}, ϕ_{↓}

denote the up-sampling and down-sampling operations along the spatial dimensions.

Temporal fusion. Temporal fusion focuses on integrating shallow and deep features across multiple scales. It captures structural and semantic differences induced by scale changes and performs fine-grained fusion across hierarchical feature representations, enabling more accurate detection of small shadow targets under varying spatial resolutions and temporal dynamics. Given the feature maps

F

, the series of reconstructed features was expressed as

F_{r} = F * h (σ) = F * \frac{1}{2 π σ^{2}} exp (- \frac{u^{2} + v^{2}}{2 σ^{2}}),

(16)

where

h (σ)

was the Gaussian filter with the standard deviation

σ

, and

(u, v)

denotes the spatial index. This series of features was then fused in a similar way to the spatial fusion. The only difference was that the down-sampling and up-sampling operations were not implemented.

4. Experiments and Discussion

To verify the proposed method, multiple rounds of experiments were performed on the measured ViSAR. The experimental setting was first delineated, followed by comparative studies.

4.1. Experimental Settings

We first delineate the experimental setup, including the imaging, algorithm, and evaluation metrics.

4.1.1. The Imaging Setting

The video SAR dataset used in this paper was publicly released by the Sandia National Laboratory [35]. The imaging parameters available for the radar system are shown in Table 2. The FPS (frame-per-second) of the ViSAR is 30, and the resolution of each frame of the SAR image is 720 × 660. To test the presented algorithm, almost 900 continuous scenes of SAR images were obtained from the ViSAR. They were randomly partitioned into the training and test sets with a ratio of 8:2.

4.1.2. The Algorithm Setting

All of the experiments were performed on a computational platform with the NVIDIA RTX 3060 GPU. During the training process, the model was trained by the Adam optimizer with an initial learning rate of 1 × 10⁻³ and a minimum learning rate of 1 × 10⁻⁵. A cosine annealing learning rate schedule was adopted to gradually reduce the learning rate and stabilize the training process. For the inference phase, the nonmaximum suppression (NMS) was used to suppress duplicate detection boxes with an IoU threshold of 0.50.

4.1.3. The Evaluation Metrics

In the evaluation framework, the positive samples were defined as the moving targets (shadow), while the negative ones were expressed as the others, such as natural and man-made clutter. They were false alarms in the task of target detection. So all of the inference combinations are summarized in Table 3, where

T P

,

F P

, and

F N

denote the numbers of true positives, false positives, and false negatives, respectively.

Typical metrics for measuring detection performance include precision, recall, F1 score, and mean average precision (mAP). They were defined as follows:

\begin{matrix} Precision = \frac{T P}{T P + F P} \\ Recall = \frac{T P}{T P + F N} \end{matrix}\} \Rightarrow F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} .

(17)

We could therefore conclude that recall reflects the missing alarms, while precision delineates the false alarms.

The mAP can be explained as the average precision across the classes of the target,

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i},

(18)

where

A P_{i}

represents the average precision of the i-th category, and N is the total number of categories.

4.2. The Ablation Study

In this paper, the problem of moving target detection was solved using a model and data co-driven learning method (PC-YOLO). The former refers to the principal component decomposition technique for producing low-rank and sparse components from the ViSAR so that the heavy level of clutter can be suppressed. On this basis, a detection framework comprising a hierarchical attention mechanism and cross-scale feature fusion was presented, which enables the moving target (shadow) with mutable scales to be located accurately. To further test the performance of the presented techniques, some ablation experiments were performed. The results are shown in Table 4.

“Raw” refers to the performance of the original YOLO algorithm. The detection performance was very poor, with an mAP of 0.3562, due to the high level of clutter. The problem of moving-target detection in ViSAR could not be solved by applying generic methods directly.
The “Det (PCs)” means applying the generic detection algorithm to the principal component. We found that the performance was clearly improved. The results prove that the principal component decomposition can suppress the complex background clutter effectively and enhance the saliency of motion-induced shadows simultaneously, and hence is advantageous to the following feature extraction and detection. Although the mAP was improved from 0.3562 to 0.7239, the recall rate, 0.5163, is not satisfactory. The results demonstrate that many false alarms are present.
The “Det (F)” denotes the improved target detection algorithm. It improves the mAP from 0.3562 to 0.7504 using a combination of a hierarchical attention mechanism and cross-scale feature fusion. Thus, shadows at various scales can be located.

The proposed method, the improved detection algorithm on the principal components, achieves the best performance. This is because the heavy level of clutter has been reduced by the principal component decomposition, and the targets of multiple scales have been located by the improved detection technique. The mAP for the proposed method is 1.48%, 4.13%, 40.90% better than Det (F), Det (PCs), and Raw. Similar improvements can be obtained for Precision, Recall, and F1-score. In brief, the ablation results validate the complementary advantages of the proposed PC-YOLO framework for moving-target detection.

4.3. The Verification of Principal Component Decomposition

This paper employed the principal component decomposition technique to achieve the task of clutter suppression. The resulting sparse components were used for the subsequent task of target detection, while the low-rank components were ignored. The effectiveness of principal component decomposition therefore needs to be tested.

4.3.1. The Visualization of Principal Components

We first visualize a set of experiments for the decomposition of the principal component. The methods to be compared include background difference, mean difference, contrast enhancement, principal component analysis, and sparse and low-rank decomposition. The results are displayed in Figure 6.

“Raw” refers to the original image in the video SAR. We found that the heavy clutter is widespread in the imaging plane due to the coherent imaging mechanism. The low-SNR image makes target observations much more difficult.
The “bkDIFF” is the background difference method. It is typically used to reduce clutter. Here, the current frame of ViSAR is subtracted from the previous one to cancel the clutter and noise. As can be seen, the results of the difference are much poorer due to the change of view.
“bkMEAN” subtracts the average background from the current ViSAR frame so that the clutter cancellation can be achieved. It is also widely used in previous work. The results are not satisfactory. The moving targets are mixed with the background clutter.
“PCA-L” and “PCA-S” represent the image reconstruction via the 95% eigenvalue energy and the corresponding residual. It prioritizes the directions where the data vary the most because more variation usually indicates more useful information.

In this paper, we decompose the image of ViSAR as the low-rank and sparse components by constraining their ranks so that the heavy level of clutter can be suppressed. As can be seen from the final two rows in Figure 6, the clutter has been removed, while the targets have simultaneously been enhanced. On the other hand, it can be observed that the “PC-LR” reflects the global, slowly varying background information (such as clutter and noise), and the “PC-SP” expresses the localized mutable information (such as moving objects, or salient targets). These results also corroborate the analysis in Section 3.2.

4.3.2. Quantitative Experiments

We further verify the effectiveness of the clutter suppression technique from the perspective of target detection. Several rounds of experiments were performed. The methods commonly used in prior work were employed as the baseline. The results are shown in Table 5.

The results of the quantitative experiments are consistent with the visualizations. While the classical methods, “bkMEAN” and “bkDIFF” are commonly used in previous work [36], their performance is much poorer. This is because the imaging scenes of ViSAR differ from those of multichannel SAR systems. The imaging views are changed dynamically. Therefore, the consecutive frames must be registered before the difference operations. These problems can be circumvented using principal component decomposition. For example, the results of PCA are much better than those of classical methods. The introduction of principal analysis improved the mAP metric to 66.22%. It achieves nearly a 60% improvement compared to classical tricks. On this basis, the presented method further introduces the low-rank and sparse constraints. The mAP metric has been improved to 76.52% and is 10.30% better than that of the classical principal component analysis technique.

4.4. The Comparative Studies

In this section, the overall detection performance of the proposed method is verified by several rounds of comparative studies. Some state-of-the-art target detection algorithms are used as baselines. These methods include YOLO5, YOLO7, and Faster-RCNN. Two versions of the detection method, “L” and “X”, are considered. The former refers to the slim network, while the latter refers to the full network.

4.4.1. Quantitative Experiments

The results of the quantitative experiments on moving target detection are shown in Table 6. The proposed method is compared with the baseline algorithms. The two-stage detection method, Faster R-CNN, demonstrates poorer performance, with a mAP measurement of 0.2846. This proves that the generic method is not suitable for the task of target detection in heavy clutter environments. The accuracy obtained by the two versions of the slim network, YOLO5L and YOLO7L, is poorer than that of the full network, because the feature learning ability has been limited by the trimmed network. For methods with the full network configured, YOLO5X and YOLO7X, the recall rates are poor. The results demonstrate that many more false alarms are present in their detection results.

The proposed method, PC-YOLO, achieves the best scores across all metrics. Precision, Recall, F1-score, and mAP are 3.02%, 13.86%, 14.00%, and 4.80% higher than those of the nearest competitors, respectively. This is because the clutter has been suppressed, while the moving targets have simultaneously been enhanced.

4.4.2. The Visualization of Detection Results

We then show the visual detection performance obtained by the methods under study. The results are displayed in Figure 7. We find that both false-positive predictions and false-negative inferences are present in these maps.

In Figure 7b,c, many false alarms denoted by the green rectangle are produced.
In Figure 7d,e,g, some moving targets (shadow) have been missed.
Although all of the targets have been located in Figure 7f,h, the confidence values for (f) are much lower than for (g).

In traditional background-based methods, numerous false alarms are produced in cluttered cells. They are represented by the green rectangle. The results prove that these methods cannot achieve the task of detecting a moving target in complex environments. For the deep learning strategies—Faster-RCNN, YOLO5, and YOLO7—some missing targets are produced. They fail to detect many targets with low SNR. In contrast, the proposed method (PC-YOLO) achieves the most accurate detection results, indicating its strong detection capability in highly cluttered environments.

To verify the proposed method further, ViSAR frames at different angles were considered. The set of visualization results is shown in Figure 8. As can be seen, all of the moving targets in different scenarios can be located accurately by the proposed method. This is because both the empirical model and the historical data were unified in the proposed framework. An explainable learning architecture can be formed accordingly.

5. Conclusions

This paper presents a mode- and data co-driven learning method (PC-YOLO) to address moving-target detection in video SAR. It has been decoupled into two sub-tasks: the clutter removal and the shadow extraction. The former is achieved through low-rank and sparse decomposition, while the latter is solved using the hierarchical attention mechanism and cross-scale feature fusion. Multiple rounds of experiments are performed, from which some conclusions can be drawn.

It is not effective to apply the generic methods to video SAR data for moving target detection.
The clutter removal plays an important role in the proposed method, because the clutter in the SAR image makes the detection task difficult.

In the future, an intriguing question is how to combine clutter suppression and shadow extraction into a unified framework so that clutter removal and target detection can be boosted synchronously. On the other hand, how to exploit information across frames remains to be studied.

Author Contributions

Conceptualization, Y.H. and X.W.; experiments, J.J.; methodology, C.X.; validation, R.Q.; writing, G.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant numbers 61971324 and 62571400.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to thank the handling Associate Editor and the anonymous reviewers for their great contributions to this paper. We would also like to thank the Sandia Laboratory for providing the video SAR data.

Conflicts of Interest

Authors Yu Han, Xinrong Wang, Chao Xue and Rui Qin were employed by the company Space Star Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SAR	Synthetic Aperture Radar
ViSAR	Video Synthetic Aperture Radar
GMTI	Multidisciplinary Digital Publishing Institute
DPCA	Displaced Phase Center Antenna
ATI	Along Track Interferometry
STAP	Space-Time Adaptive Processing
PCA	Principal Component Analysis
RPCA	Robust Principal Component Analysis
mAP	mean Average Precision

References

Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar]
Li, S.; Dong, G.; Liu, H. ImagingNet: A New Learnable SAR Imaging Method via Hierarchical U-Shaped Network. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 12007–12022. [Google Scholar] [CrossRef]
Li, J.; Xu, C.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Futuret. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Li, S.; Wang, Y.; Dong, G.; Wang, P.; Liu, H. SAR Missing Echo Imaging via Hierarchical Learning Deployed on Hybrid Network. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 18833–18849. [Google Scholar] [CrossRef]
Cerutti-Maori, D.; Sikaneta, I. A generalization of DPCA processing for multichannel SAR/GMTI radars. IEEE Trans. Geosci. Remote Sens. 2012, 51, 560–572. [Google Scholar] [CrossRef]
Carande, R.E. Dual baseline and frequency along-track interferometry. In IGARSS’92: Proceedings of the 12th Annual International Geoscience and Remote Sensing Symposium, Houston, TX, USA, 26–29 May 1992; Institute of Electrical and Electronics Engineers, Inc.: Piscataway, NJ, USA, 1992; Volume 2, pp. 20–43. [Google Scholar]
Chen, J.; Miao, X.; Wan, Y.; Zhang, J.; Miao, H. Simulation Study of the Effect of Multi-Angle ATI-SAR on Sea Surface Current Retrieval Accuracy. Remote Sens. 2025, 17, 3383. [Google Scholar] [CrossRef]
Brennan, L.E.; Reed, L. Theory of adaptive radar. IEEE Trans. Aerosp. Electron. Syst. 1973, AES-9, 237–252. [Google Scholar] [CrossRef]
Deming, R.W.; MacIntosh, S.; Best, M. Three-channel processing for improved geo-location performance in SAR-based GMTI interferometry. In Proceedings of the Algorithms for Synthetic Aperture Radar Imagery XIX, Baltimore, MD, USA, 7 May 2012; SPIE: Bellingham, WA, USA, 2012; Volume 8394, pp. 100–116. [Google Scholar]
Deming, R.; Best, M.; Farrell, S. Simultaneous SAR and GMTI using ATI/DPCA. In Proceedings of the Algorithms for Synthetic Aperture Radar Imagery XXI, Baltimore, MD, USA, 13 June 2014; Zelnio, E., Garber, F.D., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2014; Volume 9093, p. 90930U. [Google Scholar]
Ward, J. Space-time adaptive processing for airborne radar. In Proceedings of the IEE Colloquium on Space-Time Adaptive Processing, London, UK, 6 April 1998; IET: Stevenage, UK, 1998. [Google Scholar]
Yang, D.; Yang, X.; Liao, G.; Zhu, S. Strong Clutter Suppression via RPCA in Multichannel SAR/GMTI System. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2237–2241. [Google Scholar] [CrossRef]
Guo, Y.; Liao, G.; Li, J.; Chen, X. A Novel Moving Target Detection Method Based on RPCA for SAR Systems. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6677–6690. [Google Scholar] [CrossRef]
Li, J.; Huang, Y.; Liao, G.; Xu, J. Moving Target Detection via Efficient ATI-GoDec Approach for Multichannel SAR System. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1320–1324. [Google Scholar] [CrossRef]
Ramos, L.P.; Alves, D.I.; Duarte, L.T.; Pettersson, M.I.; Machado, R. Robust Principal Component Analysis Techniques for Ground Scene Estimation in SAR Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9697–9710. [Google Scholar] [CrossRef]
Liu, K.; He, X.; Liao, G.; Zhu, S.; Zeng, C.; Lan, L. Multichannel Ground Moving Target Detection Based on the Block Space-Time RPCA Method. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5217415. [Google Scholar] [CrossRef]
Fan, L.; Wang, H.; Yang, Q.; Deng, B. High-Quality Airborne Terahertz Video SAR Imaging Based on Echo-Driven Robust Motion Compensation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 2001817. [Google Scholar] [CrossRef]
Yan, H.; Zhang, J.; Mao, X.; Zhu, D.; Gao, W. Moving target echo simulation of Video Synthetic Aperture Radar (ViSAR). In Proceedings of the 2017 International Symposium on Antennas and Propagation (ISAP), Phuket, Thailand, 30 October–2 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–2. [Google Scholar]
Li, Z.; Luo, C.; Wang, H.; Yang, Q.; Zhang, H.; Liang, C. Subspectrum Division-Based Imaging Method for Curvilinear Moving Target in Terahertz SAR. IEEE Geosci. Remote Sens. Lett. 2025, 22, 4011905. [Google Scholar] [CrossRef]
Gou, L.; Zhu, D.; Li, Y. A novel moving target detection method for VideoSAR. In Proceedings of the 2019 International Applied Computational Electromagnetics Society Symposium-China (ACES), Miami, FL, USA, 14–18 April 2019; IEEE: Piscataway, NJ, USA, 2019; Volume 1, pp. 1–2. [Google Scholar]
Luo, X. Videosar Moving Target Detection from Geometric Distortion Image Sequence. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2825–2828. [Google Scholar]
Zhang, Y.; Zhu, D.; Yu, X.; Mao, X. Approach to Moving Targets Shadow Detection for VideoSAR. J. Electron. Inf. Technol. 2017, 39, 2197–2202. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Qu, B. Three-Frame Difference Algorithm Research Based on Mathematical Morphology. Procedia Eng. 2012, 29, 2705–2709. [Google Scholar] [CrossRef]
Hui, W.; Chen, Z.; Zheng, S. Preliminary Research of Low-RCS Moving Target Detection Based on Ka-Band Video SAR. IEEE Geosci. Remote Sens. Lett. 2017, 14, 811–815. [Google Scholar]
He, Z.; Chen, X.; Yi, T.; He, F.; Dong, Z.; Zhang, Y. Moving target shadow analysis and detection for ViSAR imagery. Remote Sens. 2021, 13, 3012. [Google Scholar] [CrossRef]
Yin, Z.; Zheng, M.; Ren, Y. A ViSAR shadow-detection algorithm based on LRSD combined trajectory region extraction. Remote Sens. 2023, 15, 1542. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, S.; Li, H.; Xu, Z. Shadow Tracking of Moving Target Based on CNN for Video SAR System. In Proceedings of the 2018 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Valencia, Spain, 22–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4399–4402. [Google Scholar]
Ding, J.; Wen, L.; Zhong, C.; Loffeld, O. Video SAR Moving Target Indication Using Deep Neural Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7194–7204. [Google Scholar] [CrossRef]
Yan, S.; Zhang, F.; Fu, Y.; Zhang, W.; Yang, W.; Yu, R. A Deep Learning-Based Moving Target Detection Method by Combining Spatiotemporal Information for ViSAR. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4014005. [Google Scholar] [CrossRef]
Yang, X.; Shi, J.; Chen, T.; Hu, Y.; Zhou, Y.; Zhang, X.; Wei, S.; Wu, J. Fast Multi-Shadow Tracking for Video-SAR Using Triplet Attention Mechanism. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224212. [Google Scholar] [CrossRef]
Wen, L.; Ding, J.; Loffeld, O. Video SAR Moving Target Detection Using Dual Faster R-CNN. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2984–2994. [Google Scholar] [CrossRef]
Kim, S.H.; Fan, R.; Dominski, F. ViSAR: A 235 GHz radar for airborne applications. In Proceedings of the 2018 IEEE Radar Conference (RadarConf18), Oklahoma City, OK, USA, 23–27 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1549–1554. [Google Scholar]
Yan, S.; Fu, Y.; Yu, R.; Luo, C.; Zhang, W.; Yang, W. High-Precision Moving Target Shadow Detection Algorithm for ViSAR Based on Information Geometry. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 12728–12739. [Google Scholar] [CrossRef]
Fang, H.; Liao, G.; Liu, Y.; Zeng, C. Shadow-Assisted Moving Target Tracking Based on Multi-discriminant Correlation Filters Network in Video SAR. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4006205. [Google Scholar] [CrossRef]
Wells, L.; Sorensen, K.; Doerry, A.; Remund, B. Developments in sar and ifsar systems and technologies at sandia national laboratories. In Proceedings of the 2003 IEEE Aerospace Conference Proceedings, Big Sky, MT, USA, 8–15 March 2003; IEEE: Piscataway, NJ, USA, 2003; Volume 2, pp. 1085–1095. [Google Scholar]
He, Z.; Chen, X.; Yu, C.; Li, Z.; Yu, A.; Dong, Z. A Robust Moving Target Shadow Detection and Tracking Method for VideoSAR. J. Electron. Inf. Technol. 2022, 44, 3882. [Google Scholar] [CrossRef]

Figure 1. The images from left to right are the single frame of a video SAR, five frames of a video SAR are shown in (a), (b), (c), (d), and (e), respectively. It can be seen that the level of clutter differed.

Figure 2. Spotlight mode video SAR model.

Figure 3. The overall framework of the proposed method. The robust principal component analysis is described in Section 3.2, while the target detection network is detailed in Figure 4.

Figure 4. The overall framework for shadow extraction.

Figure 5. The illustration of cross-scale feature fusion.

Figure 6. Visualization of different decomposition methods. In each row, five frames (decomposition) of a video SAR are delineated from left to right. The results obtained from four decomposition methods are shown from top to bottom. For the PCA method, the principal components (PCA-R) and the reconstruction error (PCA-E) are displayed. For the robust PCA method, the low-rank component (PC-LR) and the sparse component (PC-SP) are shown.

Figure 7. Comparison of detection results using different methods. (a) Raw image, (b) Background subtraction, (c) Background elimination, (d) YOLO5X, (e) YOLO7X, (f) Improved detection method, (g) Faster-RCNN, (h) PC-YOLO.

Figure 8. The additional detection results of proposed method in different frames of the video SAR shown in Figure 7. (a) 18th frame, (b) 130th frame, (c) 249th frame, (d) 356th frame, (e) 818th frame.

Table 1. The notation for the ViSAR signal model.

$r_{c}$	the shortest slant range from scene center to aircraft
$t_{s}, t_{e}$	the start, terminate time of observation
$t_{c}$	the center time of the full aperture
T	the synthetic aperture duration
$Δ θ$	the rotation angle of the aircraft relative to the scene center
$θ_{c}$	the angle between flight direction and line-of-sight at $t_{c}$

Table 2. The imaging parameters of the SNL radar to collect the video SAR data.

Parameter	Value
Mode	Spotlight
Center Frequency	16.7 GHz (Ku band)
Wavelength	1.8 cm
Incidence Angle	65°
Platform Height	2 km
Platform Speed	245 km/h
Cross-Range Resolution	0.1 m
Total Rotation Angle	200°

Table 3. The prediction matrix for the task of target detection.

Truth	Prediction				$R = \frac{N_{t p}}{N_{t p} + N_{f n}}$
		Yes	No	Total	$R = \frac{N_{t p}}{N_{t p} + N_{f n}}$
	Yes	TP	FN	Positive (T)	Missing alarm
	No	FP	TN	Negative (T)
	Total	$P = \frac{N_{t p}}{N_{t p} + N_{f p}} \to$ False alarm			$F_{1} = \frac{P \cdot R}{P + R}$

Table 4. Ablation study.

Method	Det	PCs	Fusion	Precision	Recall	F1	mAP
Raw	✓	-	-	0.3967	0.4451	0.4195	0.3562
Det (PCs)	✓	✓	-	0.7587	0.5163	0.6145	0.7239
Det (F)	✓	-	✓	0.8328	0.6972	0.7590	0.7504
Our (PCs + F)	✓	✓	✓	0.8280	0.7156	0.7590	0.7652

Table 5. Comparative experiments for clutter suppression.

Method	Precision	Recall	F1	mAP
bkMEAN	0.0937	0.0567	0.0706	0.0436
bkDIFF	0.0831	0.0736	0.0781	0.0352
PCA	0.7863	0.5586	0.6532	0.6622
Proposed	0.8280	0.7156	0.7677	0.7652

Table 6. Comparative Experiment Based on SNL Video-SAR.

Method	Precision	Recall	F1	mAP
Faster-RCNN	0.3219	0.2917	0.3617	0.2846
YOLO5L	0.4367	0.3951	0.4149	0.3770
YOLO5X	0.5458	0.4951	0.5192	0.4432
YOLO7L	0.4936	0.3676	0.4214	0.4130
YOLO7X	0.7987	0.5586	0.6574	0.7172
PC-YOLO	0.8280	0.7156	0.7677	0.7652
$↑ \to$	3.02%	13.86%	14.00%	4.80%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, Y.; Wang, X.; Jiang, J.; Xue, C.; Qin, R.; Dong, G. PC-YOLO: Moving Target Detection in Video SAR via YOLO on Principal Components. Remote Sens. 2026, 18, 510. https://doi.org/10.3390/rs18030510

AMA Style

Han Y, Wang X, Jiang J, Xue C, Qin R, Dong G. PC-YOLO: Moving Target Detection in Video SAR via YOLO on Principal Components. Remote Sensing. 2026; 18(3):510. https://doi.org/10.3390/rs18030510

Chicago/Turabian Style

Han, Yu, Xinrong Wang, Jiaqing Jiang, Chao Xue, Rui Qin, and Ganggang Dong. 2026. "PC-YOLO: Moving Target Detection in Video SAR via YOLO on Principal Components" Remote Sensing 18, no. 3: 510. https://doi.org/10.3390/rs18030510

APA Style

Han, Y., Wang, X., Jiang, J., Xue, C., Qin, R., & Dong, G. (2026). PC-YOLO: Moving Target Detection in Video SAR via YOLO on Principal Components. Remote Sensing, 18(3), 510. https://doi.org/10.3390/rs18030510

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PC-YOLO: Moving Target Detection in Video SAR via YOLO on Principal Components

Highlights

Abstract

1. Introduction

1.1. Moving Target Detection via SAR Images

1.2. Moving Target Detection via Video SAR

1.3. Our Solution

2. The Signal Model

2.1. Range Compression

2.2. Azimuth Resolution

2.3. The Generation of ViSAR

2.4. The Formation of Shadow

3. The Proposed Method

3.1. Method Overview and Motivation

3.2. The Sparse and Low-Rank Decomposition Model

3.3. The Look Once Detection Network

3.3.1. Hierarchical Attention Mechanism

3.3.2. Cross-Scale Feature Fusion

4. Experiments and Discussion

4.1. Experimental Settings

4.1.1. The Imaging Setting

4.1.2. The Algorithm Setting

4.1.3. The Evaluation Metrics

4.2. The Ablation Study

4.3. The Verification of Principal Component Decomposition

4.3.1. The Visualization of Principal Components

4.3.2. Quantitative Experiments

4.4. The Comparative Studies

4.4.1. Quantitative Experiments

4.4.2. The Visualization of Detection Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI