1. Introduction
Synthetic aperture radar (SAR) is known as an active microwave imaging system using coherent signal processing to achieve high-resolution terrain observation and the synthesis of a virtual, large-aperture antenna from the relative motion between the radar platform and targets [
1,
2]. Notably, SAR exhibits unique penetration characteristics, allowing microwave signals to interact with subsurface layers through cloud cover, vegetation canopies, and certain man-made obstacles. These are capabilities unattainable in optical imaging systems. Current SAR applications predominantly focus on raw image generation, yet critical challenges persist in automated target interpretation. The effective utilization of SAR data requires a systematic extraction of discriminative target features from complex scattering signatures, followed by an accurate characterization of physical properties through advanced image interpretation techniques. This paper specifically addresses the technical bottlenecks in SAR target recognition, with emphasis on robust feature representation and reliable information extraction under operational constraints.
Deep learning-based SAR image target recognition methods use computational systems to automatically extract deeper, more detailed, and more abstract feature representations, whereas they use traditional target recognition algorithms that rely on human-designed features with relatively simplistic attribute characteristics. Furthermore, in the meantime, deep learning frameworks use end-to-end neural network (NN) structures that combine both processes by means of unified optimization, whereas traditional approaches usually divide feature extraction and classification into discrete stages. Deep learning-based SAR target recognition approaches have been used in the last several years to become the main topic of SAR target identification [
3].
Chen et al., from Fudan University, developed an all-convolutional neural network architecture (A-ConvNets) [
4]. By eliminating the fully connected layers in conventional convolutional neural network (CNN) models, this design significantly reduces the number of trainable parameters, thereby mitigating overfitting issues inherent in conventional architectures. Addressing this challenge, Fu et al. proposed a random deactivation mechanism integrated into deep residual networks, combining the center loss function with the Softmax classifier for joint optimization [
5]. To enhance model robustness, Ding et al. at Xidian University implemented data augmentation techniques, including rotation and translation, to diversify training samples [
6]. In [
7], a generative adversarial network (GAN) was used to construct realistic SAR imagery by adversarial training between generator and discriminator modules, effectively increasing datasets for SAR target recognition. Building upon this framework, Ledig et al. enhanced GAN performance by adding a geometric feature-aware loss function, achieving increased recognition accuracy [
8]. Reference [
9] applies regularization strategies to decrease noise-induced feature variations, thereby decreasing speckle noise interference in SAR target recognition. In [
10], a capsule network architecture is proposed to preserve pose-encoded features while keeping intrinsic spatial links, demonstrating better performance in SAR applications. Building on this, Ren et al. created a dilated convolutional capsule network that uses hierarchical capsule units to extract multi-scale features, greatly increasing the robustness of recognition by explicitly modeling feature spatial dependencies [
11]. In order to make the most of prominent visual patterns from SAR imagery easier to extract and to increase computational efficiency and classification accuracy, the integration of attention pathways into CNN architectures is studied in [
12]. Interestingly, although deep learning and conventional approaches have different advantages, current research directions highlight hybrid frameworks that work in concert with deep neural representations and handcrafted feature engineering. This paradigm leverages complementary strengths of both methodologies, positioning itself as a pivotal developmental trajectory in SAR target recognition technology. Reference [
13] proposed a Gabor transform-based CNN architecture using multi-scale, multidirectional Gabor filters to construct enriched training datasets, thereby increasing data variety and reducing overfitting. In [
14], rotation-invariant multi-scale features (encoding local texture and edge semantics) are combined with CNN-derived representations, achieving hierarchical feature representations for SAR target recognition. Reference [
15] introduced a CNN-driven dictionary learning framework, employing ConvNet as an automated feature extractor and building a discriminative dictionary for sparse sample representation. With the use of traditional dictionary learning, this method is able to overcome the shortcomings of being dependent on the use of handcrafted feature engineering. As explained in [
16], the research group used nonlinear mapping to project CNN features into a reproducing kernel Hilbert space for improved separability in order to address nonlinear dictionary training dynamics by broadening this work. Reference [
17] proposes a CNN model that can process covariance and polarimetric coherence matrices, which will allow for the extraction of spatial features with hierarchical polarization. The use of deep learning in polarimetric SAR target recognition is demonstrated in this paper, which also shows its potential for use in the past. Pioneering feature fusion research, Zhang et al. developed a hybrid framework integrating scattering center attributes with CNN-derived representations. Their methodology applies discriminative correlation analysis (DCA) to enhance feature complementarity, significantly boosting SAR recognition robustness [
18]. However, this approach inevitably compromises the geometric integrity of feature maps through the vectorization of spatial hierarchies. Addressing this limitation, the research team implemented adaptive fusion within 2D feature embedding spaces, preserving geometric–semantic relationships while optimizing cross-modal feature synergy. This advancement establishes new benchmarks for scattering–CNN feature fusion efficacy [
19]. In contrast to optical imagery, SAR images exhibit inherent speckle noise that severely degrades visual quality and compromises downstream target recognition accuracy. To address this dual challenge, we propose a novel joint convolutional neural network (J-CNN) with dual-task architecture, enabling simultaneous speckle suppression and target classification through end-to-end optimization. This framework demonstrates robust adaptability across SAR datasets with heterogeneous noise levels, establishing a unified solution for enhanced recognition reliability [
20].
The development of deep neural networks was studied; as the number of network layers increased, so did the corresponding gains in recognition accuracy. However, this architectural growth has a quadratic increase in parameter volume, which has resulted in a corresponding increase in the amount of storage resources that significantly restricts the use of deep models in practice. Model compression techniques, such as pruning, quantization, and knowledge distillation, as well as low-rank decomposition, have become important research areas to lessen the bottleneck. Model pruning in this paradigm uses structural saliency assessments, which are quantified by metrics including parameter magnitude, weight density, and convolutional kernel entropy, as well as redundant components. This process achieves simultaneous reduction in both parameter count and computational load, while post-pruning accuracy is recovered through systematic fine-tuning. The conceptual foundation of model pruning was established as early as 1989, when LeCun introduced a parameter saliency criterion based on loss function Hessian analysis, enabling connection reduction through second-order optimization principles [
21]. This pioneering work laid the theoretical groundwork for subsequent pruning research paradigms. Significantly advancing the field, Han et al. developed deep compression, a unified framework synergistically integrating pruning, quantization, and Huffman coding [
22]. Through repeated magnitude-based thresholding and treatment parameters as statistically independent entities, their approach used non-structured pruning, with sparsity-aware retraining maintaining baseline accuracy. Using these fundamentals, Molchanov et al. created the idea of pruning as a combinatorial optimization problem and used gradient-informed importance scoring to find low-cost parameter subsets that maintained model fidelity after pruning [
23]. But because of the way that these methods are composed of unstructured pruning approaches, which cause neural architectures to have irregular connection patterns, sparsity constraints are enforced to reduce memory footprints. Such structural irregularities call for the explicit indexing mechanisms for non-zero element identification during the inference phase, which naturally makes the computational network incompatible with parallel computation architectures. Furthermore, the practical use of sparse models requires the need for specific hardware acceleration libraries to effectively take advantage of the sparsity-aware computation primitives.
Recent advances in pruning methodology have prioritized structured pruning paradigms, where convolutional weight elimination is governed by regularization-driven structural constraints. For instance, Wen et al. implement group Lasso regularization to enforce group-wise sparsity, systematically driving parameter clusters towards numerical insignificance [
24]. Similarly, Jin et al. introduce an L0 norm penalty approximated by iterative reparameterization, though this regularization augmentation incurs considerable convergence overhead [
25]. Notably, coarse-grained pruning methods, spanning convolutional filter removal, input channel truncation, and feature map suppression, demonstrate layer-wise interdependency. Namely, pruning the
i-th layer’s filters necessitates a proportional elimination of dependent feature maps and subsequent (
i + 1)-th layer kernels, thereby maintaining dimensional compatibility across network hierarchies.
Current pruning methodologies exhibit task-agnostic limitations, failing to incorporate domain-specific prior knowledge. This deficiency becomes critically pronounced in SAR image processing, where spatial domain variance, stemming from radar parameter sensitivity, introduces non-stationary input distributions that compromise model input consistency. Paradoxically, data-driven pruning frameworks demonstrate pathological fragility to SAR domain shifts, thereby casting doubt on the operational integrity of pruned architectures in real-world reconnaissance scenarios. To address this dual challenge, we propose a task-oriented pruning framework that explicitly optimizes parameter saliency against mission-critical objectives. Our methodology strategically eliminates task-irrelevant parameters through the joint optimization of model sparsity and target recognition fidelity. At the same time, the stability and reliability of the model can be ensured as much as possible. Our work advances the practical deployment of deep learning models for SAR target recognition.
The remainder of our paper is structured as follows:
Section 2 describes related studies on SAR images, including a classical CNN model used for SAR target recognition, speckle noise characteristic of SAR images, and the J-CNN model structure. Then, the strategy of the proposed TDP-SAR is outlined in
Section 3. In
Section 4, we provide experiments to verify the effectiveness of TDP-SAR. Finally, the research content of this paper is summarized in
Section 5.
4. Experiments and Discussions
In this section, we analyze the performance of TDF-SAR. This includes an adaptability analysis of multiple SAR target recognition models for SAR images of varying quality prior to pruning, as well as amplitude and phase spectra analysis of convolutional kernels. Finally, we perform a performance comparison before and after pruning. This demonstrates that TDF-SAR not only achieves model lightweighting but also maintains strong adaptability.
4.1. Experimental Setup
To validate the effectiveness of the proposed method, our experiments employed the MSTAR database, which is jointly funded by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (ADRL). The MSTAR database contains 10-class SAR target images with 0.3 m × 0.3 m resolution. Most SAR images in the dataset are sized 128 px × 128 px. To standardize input dimensions, we followed Reference [
4] to crop original SAR images to 88 px × 88 px, which served as the ground truth for subsequent experiments. Training and test datasets were constructed using SAR images with 17° and 15° pitch angles, respectively, containing 2746 and 2425 samples each.
To validate adaptability to speckle noise, we augmented original training samples with Gamma-distributed simulated speckle noise (L = 1, 2, 5) to create a synthetic training dataset with varying quality levels. During testing, we evaluated samples containing Gamma-distributed noise across 11 noise levels (L = 0.2, 0.5, 1, 1.5, 2, 3, 5, 10, 20, 40, and 80).
- (2)
Experimental details
The following are the configurations used by the experimental server. Intel Core i7 CPU (Intel, Santa Clara, CA, USA), Ubuntu 20.04 operating system, and NVIDIA RTX 3090 Ti GPU (NVIDIA, Santa Clara, CA, USA) were used in the experiment. In order to train, we used the PyTorch 1.12.0 deep learning framework with CUDA 11.7, which was used with the Adam optimizer for the following parameters: weight decay of one by ten, batch size of 64, learning rate of one and a half, and initial stopping of 500 training epochs.
4.2. Comparison of SAR Target Recognition in Different Models
Section 2 highlights that in SAR images, the amount of natural speckle noise lowers the quality of the image and the degree of intensity in real world situations, but it varies. As such, we examine the effect of speckling noise in SAR target recognition models based on the CNN. Four SAR target identification models are trained and assessed.
Figure 5 depicts recognition accuracies at various noise levels; the speckle noise level (
L) is represented by the
x-axis. (Higher
L values point to worse SAR image quality.) Higher values of the image show a worse level of SAR.
As described in the dataset section, model training samples consist of SAR images with simulated speckle noise at
L = 1, 2, and 5. The accuracies are notably higher when test noise levels approximate these training values (
L ≈ 1, 2, and 5). Even with high-quality SAR images (large
L values), accuracy decreases significantly. This indicates that model performance degrades when testing conditions deviate from training parameters, revealing an overfitting problem in SAR target recognition. Nevertheless, J-CNN demonstrates superior adaptability to strong speckle noise compared to AlexNet, VGG16, and ResNet50. Particularly under
L < 2 conditions (highlighted results), the accuracy of J-CNN becomes pronounced. Confusion matrices for all four CNN models with
L = 0.2 are shown in
Figure 6, where J-CNN achieves the highest accuracy (69.38%), reaffirming its exceptional noise robustness.
4.3. Analysis of Amplitude Spectrum in Convolution Kernel
According to
Figure 5 results, J-CNN demonstrates the strongest adaptability to SAR images with severe speckle noise compared to classical CNN models. In this section, we analyze the reasons for this superior performance using the methodology described in
Section 3.
Figure 7 displays the amplitude spectrum of all convolutional kernels in the first layer across the four networks, standardized to 128 px ×128 px. Yellow and blue regions denote high- and low-energy-spectrum components, respectively. Granular artifacts in SAR images, i.e., speckle noise, predominantly reside in high-frequency ranges. Crucially, only 4 of the 16 kernels in
Figure 7d suppress low-band signals while emphasizing high-band feature extraction. Compared with this case, the amplitude spectrum of the first convolution layer in AlexNet and ResNet50 shown in
Figure 7a,c has a significantly smaller area of a high-energy region. Additionally, the extracted frequency range information illustrates the strong pertinence. However, for SAR images with strong speckle noise, strong pertinence may lose the overall information of the SAR target but highlight the expression of granular noise. Moreover, in
Figure 7b, the amplitude spectrum of the first convolution layer of VGG16 has a large area of high energy, but the retention of low-frequency band information is obviously less than J-CNN. It will lose the information on target contour and shape information in SAR images.
For a quantitative comparison of convolutional kernel amplitude spectra across the four models,
Figure 8 layers the high frequency ratios using Equation (13), with hyperparameters
thr = 0.7 and
R = 70. The axes denote convolutional layer indices (horizontal) and high frequency ratios (vertical), excluding all 1 × 1 convolutional layers in this analysis. As shown in
Figure 9, the J-CNN maintains elevated high frequency ratios across its first five convolutional layers, followed by a marked decline starting from the sixth layer. This aligns with J-CNN’s hierarchical design: its first five layers perform speckle reduction, while the final three layers constitute the classification stage. This structural dichotomy demonstrates that J-CNN’s speckle reduction phase actively preserves SAR target details, whereas the classification stage leverages generalized features for target identification, achieving balanced optimization of both objectives. Comparatively, AlexNet and ResNet50 prioritize general target information extraction in shallow layers (notably the first convolutional layer) while reserving detailed feature analysis for deeper layers. For SAR images affected by strong speckle noise, the low convolutional layer will extract more general speckle noise. Then, detailed information of speckle noise will be paid attention to in the deep layers. As a result, the adaptability is poor enough such that these models’ SAR images are affected by strong speckle noise. The high frequency ratio of VGG16 has little change with the deepening of the convolutional layers and always maintains a high level. It shows that VGG16 always pays attention to the extraction of detailed features in the high-frequency band. In the case of strong speckle noise, the details will affect the recognition results. Based on this, the reasons for the strong adaptability of the J-CNN to speckle noise are analyzed from the perspective of the convolution kernel amplitude spectrum.
4.4. Analysis of Phase Spectrum in Convolution Kernel
This section analyzes J-CNN’s exceptional noise adaptability through phase spectrum characteristics. A SAR target image with zeroed clutter regions (
Figure 9a) serves as the analytical basis. Reconstruction via Equation (15) yields the output in
Figure 9b, where the phase spectrum distinctly encodes target morphology and positional data.
We select convolutional kernels from the first layers of the AlexNet, VGG16, ResNet50, and J-CNN, along with the sixth layer of the J-CNN. Reconstruction results generated via Equation (17) are shown in
Figure 10. The phase spectrum of VGG16’s first layer and J-CNN’s sixth layer exhibit superior contour preservation, whereas other models lack this capability. Notably, J-CNN’s first layer prioritizes target-generalized features over shape-specific details, explaining its enhanced adaptability to SAR images of varying quality. Specifically, the J-CNN emphasizes holistic image information during speckle reduction while suppressing positional/shape details. These features are selectively reactivated only in later recognition stages.
We further quantify the phase spectrum of convolutional kernels using correlation coefficients derived from Equation (18).
Figure 11 displays the layer-wise correlation coefficients for all four models. Notably, the J-CNN exhibits marked spectrum divergence between its despeckling and recognition stages. During despeckling, its convolutional phase spectrum exerts minimal influence on target geometry—a strategic suppression of shape and positional encoding. This design intentionally avoids extracting spurious spatial features from speckle-dominated regions, where noise artifacts frequently mimic target characteristics. However, there is a considerable increase in the correlation coefficient in the recognition stage, which shows that the target position and shape encoding are given more attention. AlexNet and ResNet50, some of the well-established models, show reduced correlation coefficients in shallow layers, which suggests that there is a greater influence of the phase spectrum at early processing stages. While deeper layers move toward generalized representations for subsequent tasks, these layers place a high value on the extraction of positional and shape features. Because of this architecture, SAR imagery may be able to propagate speckle-induced artifacts because low-level convolutions might misunderstand granular noise because of structural characteristics. VGG16, on the other hand, maintains somewhat high correlation coefficients throughout layers, which indicates that SAR target geometry reconstruction has a limited amount of dependence on the phase spectrum.
4.5. Comparison of Model Performance Pre- and Post-Prunning
The proposed method employs the J-CNN as its baseline model, explicitly decoupling despeckling and recognition stages. We implement pruning via TDP-SAR (detailed in
Section 3) to validate its effectiveness. Ablation experiments compare pruned model accuracies across three configurations as follows: amplitude spectrum-only pruning, phase spectrum-only pruning, and full TDP-SAR. Results in
Table 3 evaluate the performance on SAR images with varying noise levels (
L = 0.2, 0.5, 1, 2, 3, and 5), including accuracy drop ratios between TDP-SAR and the original J-CNN. Threshold hyperparameters for TDP-SAR are set as
thr = 0.7 and
thc = 0.5 by the Neyman–Pearson lemma.
As shown in
Table 3, the J-CNN achieves the highest recognition accuracy across most SAR image quality levels. When pruned via the amplitude spectrum, phase spectrum, or TDP-SAR strategies, its accuracy exhibits only marginal declines—with all degradation rates remaining below 1%, a practically acceptable threshold. Notably, at specific noise levels (
L = 1 and
L = 3), the pruned models retain or even slightly surpass baseline accuracy. These results confirm that TDP-SAR introduces minimal degradation to J-CNN’s recognition performance while significantly reducing model complexity, thereby enabling effective compression.
We further apply TDP-SAR pruning to the AlexNet, VGG16, ResNet50, and J-CNN.
Figure 12 compares their recognition accuracies pre- and post-pruning, where solid lines denote baseline performance and dashed lines represent pruned results. For classical CNNs without dedicated despeckling stages (AlexNet, VGG16, ResNet50), phase spectrum-based pruning is directly implemented. The near-overlapping curves in
Figure 12 demonstrate minimal accuracy variation across all pruned models, confirming that TDP-SAR preserves recognition capability while achieving parameter reduction.
We quantified parameter counts and inference times for some classical models pre- and post-pruning, such as the AlexNet, VGG16, ResNet50, EfficientNet-B0 [
33], and MobileNet [
34], with results summarized in
Table 4. Pruning reduces parameter volumes significantly across all four SAR recognition networks. Most notably, the J-CNN achieves a 17.7% parameter reduction relative to its original size, demonstrating the effectiveness of TDP-SAR in model lightweighting. Execution times also decrease measurably for all pruned networks, further validating the practical utility of the strategy for deployment scenarios.
4.6. Comparison of Prunning Methods
To further validate the pruning performance of TDP-SAR, we conducted comparative experiments with different pruning methods, including L1 norm [
35], HRank [
36], and FiltDivNet [
37]. The results are presented in
Table 5. All baseline models employed the J-CNN, with test data comprising synthetic SAR images containing simulated noise at
L = 5.
Table 5 demonstrates the variations in model recognition accuracy and the number of weight parameters before and after applying different pruning methods. Comparative analysis reveals that while our method does not achieve the most significant parameter reduction, it maintains recognition accuracy after pruning due to its focus on retaining effective features extracted at different processing stages. In contrast, other classical pruning methods, particularly FiltDivNet, demonstrate optimal parameter reduction but exhibit reduced adaptability to SAR images under strong noise interference, resulting in a 2% degradation in recognition accuracy post-pruning.