ConFAS-Net: Few-Shot SAR Target Recognition via Confusion-Aware Attention and Adaptive Decision Scaling

Zhao, Xin; Xue, Xiaorong; Tian, Yishuo; Yang, Jingtong; Lu, Bingyan; Zhang, Wen; Wang, Wancheng

doi:10.3390/rs18101482

Open AccessArticle

ConFAS-Net: Few-Shot SAR Target Recognition via Confusion-Aware Attention and Adaptive Decision Scaling

by

Xin Zhao

,

Xiaorong Xue

^*,

Yishuo Tian

,

Jingtong Yang

,

Bingyan Lu

,

Wen Zhang

and

Wancheng Wang

School of Electronics and Information Engineering, Liaoning University of Technology, Jinzhou 121001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1482; https://doi.org/10.3390/rs18101482

Submission received: 31 March 2026 / Revised: 6 May 2026 / Accepted: 7 May 2026 / Published: 9 May 2026

(This article belongs to the Special Issue Advanced Methods and Applications in SAR (Synthetic Aperture Radar) Image Target Detection and Recognition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose the ConFAS-Net model, which integrates three innovative modules—MS-CA multi-scale channel attention, CACL category-confusion-aware loss, and CADA category-adaptive decision adjustment—to systematically address the core issues of insufficient feature utilisation and severe category confusion in small-sample SAR target recognition.
On the MSTAR dataset under 5/10/15/30-shot settings, the model achieved recognition accuracies of 73.25%, 87.43%, 94.97%, and 96.87%, respectively, representing a maximum improvement of 2.93 percentage points over baseline methods, whilst maintaining superior parameter efficiency and balancing accuracy with computational efficiency.

What are the implications of the main findings?

The establishment of a full-chain optimisation paradigm comprising ‘feature enhancement—loss optimisation—decision adjustment’ provides an innovative and practical technical solution for small-sample target recognition tasks.
The model’s lightweight design is tailored to the application requirements of resource-constrained scenarios, offering a viable approach for the engineering implementation of SAR target recognition under limited-sample conditions.

Abstract

Synthetic aperture radar (SAR) target recognition under few-shot scenarios faces challenges of insufficient feature extraction and severe inter-class confusion. To address these issues, a confusion-aware few-shot attention and scaling network (ConFAS-Net) is proposed. The method introduces a multi-scale channel attention module (MS-CA) to enhance adaptive extraction of multi-scale features, designs a confusion-aware loss optimization module (CACL) to guide discriminative feature learning using inter-class confusion information, and employs a class-adaptive decision adjustment module (CADA) to dynamically adjust classification boundaries for few-shot distribution characteristics. Extensive experiments on the standard MSTAR dataset demonstrated that ConFAS-Net achieved recognition accuracies of 73.25%, 87.43%, 94.97%, and 96.87% under 5-, 10-, 15-, and 30-shot settings, respectively. To rigorously substantiate the generalization capability and robustness of the proposed model across different data domains, additional validation was conducted on the public SAMPLE dataset, where ConFAS-Net consistently achieved state-of-the-art performance across all K-shot settings. Ablation studies and visualization analyses further validated the effectiveness of each proposed module. Comparisons with state-of-the-art methods demonstrate that the proposed method maintains high recognition accuracy while retaining a lightweight architecture of only 2.32 M parameters, providing an effective solution for SAR target recognition in resource-constrained environments.

Keywords:

synthetic aperture radar (SAR); target recognition; multi-scale attention; confusion-aware learning; adaptive decision; few-shot learning

1. Introduction

Radar target recognition technology has made significant progress over the past few decades [1]. Synthetic aperture radar (SAR), with its all-weather, all-time active imaging capabilities [2], has been widely applied in both civilian and military fields, such as topographic mapping and geological exploration [3], marine monitoring and vessel identification [4], disaster emergency response and damage assessment [5], military target reconnaissance and battlefield situational awareness [6], and automatic target recognition [7,8,9,10]. However, with the rapid growth in the volume of SAR image data and the increasing complexity of application scenarios, traditional SAR target recognition methods face severe challenges in terms of feature extraction capabilities and generalisation performance. Traditional methods primarily rely on manually designed features, such as scatter centre matching [11] and geometric feature extraction [12]; however, these methods exhibit limited recognition accuracy under complex background conditions and varying target characteristics, making it difficult to meet the demands of practical applications.

In recent years, deep learning-based SAR target recognition methods have attracted widespread attention [13]. Convolutional neural networks (CNNs) have achieved significant progress in the field of SAR target recognition due to their powerful feature learning capabilities. Reference [14] proposed an improved deep convolutional neural network algorithm that enhanced the performance of SAR image target recognition by optimising the network architecture and training strategies. Reference [15] designed the EFTL network, utilising electromagnetic feature transfer learning techniques to address SAR target recognition tasks, thereby effectively improving the model’s generalisation ability across different scenarios. Reference [16] proposed the EMI-Net, an end-to-end mechanism-driven interpretable network, which enhances the model’s recognition capabilities under extended operational conditions through the incorporation of interpretability design. However, most of these methods require a large number of labelled samples for training, and in practical applications, they often face the problem of sample scarcity.

As an effective approach in addressing data scarcity, few-shot learning has attracted significant attention from researchers in the field of SAR target recognition. To enhance feature discrimination capabilities, attention mechanisms—a key technique for improving the representational capacity of deep neural networks—have demonstrated great potential in SAR target recognition [17,18]. The squeeze-and-excitation (SE) attention mechanism [19] enhances the network’s focus on important feature channels by adaptively recalibrating channel feature responses. Reference [20] proposed a multi-scale time-frequency representation fusion network which, through a coordinate attention mechanism and an adaptive feature concatenation strategy, effectively fuses frequency-domain features at different scales, thereby improving SAR target recognition performance. Reference [21] proposed a dual-branch spatial-frequency domain fusion method which, through a cross-attention mechanism, achieves complementary fusion of spatial and frequency domain features, enhancing the network’s ability to distinguish multi-scale features of SAR targets. ECA-Net [22] avoids information loss caused by dimensionality reduction through an efficient channel attention mechanism. However, existing attention mechanisms primarily focus on feature enhancement and lack targeted designs to address class confusion in low-data-sample scenarios. Furthermore, the design of the loss function is crucial for the success of low-data-sample learning. Traditional cross-entropy loss is prone to overfitting under conditions of sample imbalance and class similarity; focal loss [23] improves class imbalance by dynamically adjusting the weights of difficult and easy samples; center loss [24] enhances feature compactness by minimising intra-class distances. However, existing loss function designs rarely consider the utilisation of inter-class confusion information, failing to fully exploit the relational information between samples to guide network learning. In recent years, researchers have conducted in-depth explorations into SAR small-sample object recognition. Geng et al. [25] utilised causal inference to eliminate background interference, Zhou et al. [26] employed evidence-based learning to estimate uncertainty, and Wang et al. [27] adopted feature generation for data augmentation. These works have provided valuable insights into SAR few-shot recognition.

More broadly, existing few-shot learning methods can be grouped into five categories: metric learning-based methods that classify by comparing sample similarity in learned embedding spaces, such as Twin Networks [28], Prototype Networks [29], and Relational Networks [30]; optimisation-based meta-learning methods that learn transferable initialisation parameters for rapid adaptation, such as MAML [31]; data augmentation and generative methods that expand limited training sets through sample synthesis [32]; pre-trained foundation models that leverage large-scale datasets to learn transferable visual representations, which have recently demonstrated promising performance in SAR target recognition tasks [33]; and contrastive learning-based methods that learn discriminative and generalizable representations through self-supervised pairwise instance comparison [34]. Despite their respective merits, most existing methods focus on cross-domain generalisation, uncertainty quantification, or feature generation, without jointly addressing the three core challenges in few-shot SAR recognition: insufficient multi-scale feature extraction, underutilisation of inter-class confusion information, and lack of adaptive classification boundary adjustment.

To address these issues, this paper proposes ConFAS-Net, a method for small-sample target recognition in SAR imagery that integrates multi-scale attention with confusion-aware learning. The main contributions are as follows:

(1) The multi-scale channel attention module (MS-CA) was designed to adaptively learn channel weights through the fusion of global and local context via dual pathways, thereby enhancing the network’s ability to select key feature channels;

(2) We designed the confusion-aware cosine loss module (CACL), which identifies easily confused class pairs by dynamically constructing a class confusion matrix, and imposes additional separation constraints in the feature space to enhance the inter-class separability and intra-class compactness of the features;

(3) The class-adaptive decision adjustment module (CADA) dynamically generates scaling factors for each class based on confusion information, adjusting classification boundaries and confidence distributions to mitigate the problem of class imbalance in low-sample-size scenarios.

2. Materials and Methods

2.1. Overview of the Overall Structure of ConFAS-Net

Existing deep learning methods for small-sample SAR target recognition suffer from issues such as inadequate multi-scale feature extraction, difficulty in distinguishing between similar classes, and rigid classification boundaries; they are particularly prone to overfitting and misclassification in scenarios with scarce data. To address the issues faced by existing deep learning methods in small-sample SAR target recognition—such as inadequate multi-scale feature extraction, severe class confusion, and rigid classification boundaries—this paper first adopts the multi-scale dense connection (TMDC) network as its foundational architecture. By customising and optimising it to suit the characteristics of small-sample scenarios, we derive the multi-scale dense connection (MSDC) backbone network, which is tailored for small-sample SAR target recognition; Building upon this, three innovative modules are introduced to propose the ConFAS-Net method, whose network architecture is shown in Figure 1.

First, in the feature extraction stage, a multi-scale channel attention module (MS-CA) is introduced. This module is embedded following each dense block and adaptively enhances the discriminative power of features at different scales by fusing dual-path feature statistics derived from global average pooling and local adaptive pooling. Second, in the loss optimisation stage, a confusion-aware loss optimisation module (CACL) is designed. By dynamically constructing a class confusion matrix to identify easily confused class pairs, and applying additional cosine similarity constraints to these samples, the network is guided to widen the distance between easily confused classes in the feature space. Third, a class-adaptive decision adjustment module (CADA) is introduced during the classification decision stage. This module dynamically generates adaptive scaling factors based on the confusion levels of each class, and by adjusting classification boundaries and confidence distributions, alleviates the issue of class imbalance in low-data-sample scenarios. Finally, a two-stage training strategy is adopted: the first stage employs standard cross-entropy loss for baseline training, whilst the second stage incorporates CACL and CADA for fine-tuning, thereby enhancing the model’s recognition accuracy and generalisation capabilities under low-sample-size conditions.

The data flow within the network is as follows: an 84 × 84 single-channel SAR image is input; following an initial convolutional layer and max-pooling, it sequentially passes through three multi-scale dense blocks (MSDC Block 1, Block 2, Block 3), with an MS-CA module appended to each block for feature enhancement, outputting low-level (128-dimensional), mid-level (256-dimensional), and high-level (1024-dimensional) features respectively. The features from these three levels are combined via global average pooling (GAP) and feature concatenation (Concat) to produce a 1408-dimensional fused feature vector, which is then mapped to 10-dimensional classification logits via a fully connected layer. During the second training stage, the CADA module performs adaptive scaling on the logits, whilst the CACL module calculates the confusion-aware cosine loss and updates the network parameters via backpropagation.

2.2. MS-CA Module

In SAR target recognition, discriminative cues are often distributed across both global structures and local scattering details. However, conventional channel attention modules mainly rely on global statistical information and may overlook local spatial structures, limiting their ability to highlight critical target features. To address this issue, we introduce the multi-scale channel attention (MS-CA) module, as illustrated in Figure 2.

MS-CA adopts a dual-path design to extract global and local channel descriptors and fuses them to generate adaptive channel attention weights. Specifically, given an input feature map (F), the global and local descriptors are computed as follows:

d_{g} = M L P (G A P (F))

(1)

d_{l} = M L P ({C o n v}_{2 \times 2} (P o o l (F)))

(2)

α = σ (d_{g} + d_{l})

(3)

In particular,

d_{g}

is a global channel descriptor that obtains global statistical information via global average pooling (GAP).

d_{l}

is the local channel descriptor, which captures local structural information through 2 × 2 adaptive pooling and convolution operations, preserving spatially sensitive scattering details. When fused with the global descriptor, this forms a multi-scale attention mechanism that combines global statistical information with local spatial features.

σ

is the sigmoid activation function, used to transform the fused features into attention weights within the range [0, 1]. The weights resulting from the fusion of the dual-path features take into account both global context and local details, thereby achieving multi-scale channel importance modelling.

To enhance the ability to select feature channels, the MS-CA module employs a pixel-wise weight application mechanism based on a dual-branch architecture. The global branch provides an overall assessment of channel importance, whilst the local branch supplements this with spatially sensitive channel information; the attention weights generated by fusing these two are applied to each spatial location of the original features via a broadcast mechanism. The final weighted features are calculated as follows:

F' = α \otimes F

(4)

F_{o u t} = F' + F

(5)

Here,

F'

denotes the attention-weighted feature map, and

\otimes

represents element-wise multiplication.

F_{o u t}

represents the final output features; the residual connection preserves the original information, ensuring training stability. This design not only enhances important channels but also prevents information loss, thereby effectively improving the ability to distinguish SAR targets.

Comparison with existing attention mechanisms. It is worth clarifying the distinctions between MS-CA and two representative attention methods. CBAM [35] adopts a sequential two-stage design that first generates channel attention via global average pooling and global max pooling and then applies an independent spatial attention branch using pooling along the channel axis. The two stages serve different purposes—channel selection and spatial localisation—without mutual information exchange. SCNet [36] employs a self-calibration mechanism that splits feature channels into two groups processed by convolutions with different kernel sizes, enabling cross-scale interaction at the convolutional level. In contrast, MS-CA integrates spatial structural information directly into the channel attention weights through multi-resolution pooling, achieving a unified mechanism without requiring a separate spatial branch or channel splitting. Moreover, the shared MLP across the global and local branches enforces a common transformation space, enabling the network to learn complementary global–local relationships rather than processing independent channel subsets.

Suitability for SAR imagery. This design is particularly motivated by two distinct types of discriminative information in SAR images: (1) localised strong scattering centres (e.g., corner reflectors on turrets and engine compartments), which produce high-intensity returns in small spatial regions; (2) distributed contextual features (e.g., shape contours and shadow patterns),which provide global structural cues.

Conventional channel attention relying solely on global average pooling tends to dilute localised scattering signatures across the entire spatial extent. The local branch of MS-CA, operating at 2 × 2 resolution, preserves the spatial distribution of these strong scatterers, allowing the channel weights to reflect not only how much average energy each channel contains but also where critical scattering occurs. This dual sensitivity is essential for distinguishing targets that share similar global silhouettes but differ in fine-grained scattering details (e.g., T-72 vs. T-62, BMP-2 vs. BTR-60).

2.3. CACL Module

In few-shot SAR target recognition, visually similar classes are easily confused due to limited training samples and similar scattering characteristics. Conventional cross-entropy and standard cosine-based losses do not explicitly account for class-dependent confusion, which limits their ability to enforce targeted discrimination between easily confused categories.

To address this issue, we propose the confusion-aware cosine loss (CACL), which introduces class-specific adaptive margins based on confusion statistics. By explicitly modelling the degree of inter-class confusion, CACL imposes stronger constraints on easily confused categories and thus provides more targeted supervision in the feature space. Specifically, the confusion degree of each class is estimated from the confusion matrix and then used to generate adaptive margins for subsequent cosine-based discrimination. The corresponding formulation is given as follows:

S_{i} = \frac{\sum_{j = 1, j \neq i}^{C} M_{i j}}{\sum_{K = 1}^{C} M_{i k}}

(6)

m_{i} = m_{0} + λ_{m} S_{i}

(7)

Z^{'} = \frac{Z}{Z_{2}}

(8)

In particular,

S_{i}

denotes the confusion degree of class i, defined as the proportion of samples from class i that are misclassified into other classes according to the confusion matrix. Based on

S_{i}

, the confusion-aware margin

m_{i}

is adaptively generated, where

m_{0}

is the base margin and

λ_{m}

controls the strength of the margin adjustment.

Z^{'}

denotes the L2-normalised feature vector, which projects features onto the unit hypersphere for subsequent cosine-based discrimination.

As illustrated in Figure 3, the confusion matrix is used to estimate class-specific confusion degrees and generate corresponding adaptive margins. These margins impose stronger discrimination constraints on easily confused categories. Based on the margin-adjusted normalised features, the CACL loss is formulated as follows:

L_{C A C L} = - \frac{1}{N} \sum_{n = 1}^{N} l o g (\frac{\tilde{z_{n}^{⊤}} {\tilde{y}}_{n}}{2 B} + 0.5)

(9)

B = \frac{\sqrt{C + (C - 1) C τ^{2}}}{\sqrt{C}}

(10)

L = {L_{C A C L}, t \leq α T, L_{C A C L} + β L_{C E} (z^{C A D A}, y), t > α T}

(11)

In particular,

\tilde{z_{n}}

denotes the margin-adjusted and L2-normalised prediction vector of the n-th sample,

{\tilde{y}}_{n}

denotes the softened target vector, and B is the normalisation boundary determined by the number of classes and the label-shift factor. Under this formulation,

L_{C A C L}

enforces confusion-aware discrimination by enhancing the alignment between predictions and class targets while incorporating adaptive class-specific margins.

The overall loss is defined in a stage-wise manner. In Stage 1, only

L_{C A C L}

is used to learn a discriminative feature space. In Stage 2, an additional weighted cross-entropy term with CADA-adjusted logits is introduced to further refine the final decision boundaries, where

β

controls the contribution of the cross-entropy term.

2.4. CADA Module

In few-shot SAR target recognition, different classes exhibit different levels of recognition difficulty, and some categories are more easily confused under complex imaging conditions. Conventional cross-entropy loss treats all classes equally and cannot explicitly enhance decision confidence for difficult categories. To address this issue, we propose the class-adaptive dynamic adjustment (CADA) module.

As shown in Figure 4, CADA uses the confusion matrix to estimate class-wise confusion scores and transforms them into adaptive scaling factors. During training, the scaling factor corresponding to the ground-truth class is selected and applied to adjust the true-class logit, thereby strengthening decision confidence for difficult categories.

The key idea of CADA is to assign larger scaling factors to classes with higher confusion scores, while keeping the logits of well-recognised classes nearly unchanged. Integrated into the second stage of training, CADA complements CACL by refining decision boundaries in logit space, whereas CACL enhances inter-class discrimination in feature space. Together, the two modules improve the recognition of hard classes under low-data conditions.

3. Results

3.1. Experimental Dataset

This paper employs the MSTAR (moving and stationary target acquisition and recognition) dataset for experimental validation [37]. The MSTAR dataset is the most widely used public benchmark dataset in the field of SAR target recognition, jointly released by the US Defence Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL). The dataset comprises SAR images of various types of military vehicles acquired at different elevation angles, featuring real-world imaging conditions and a wide variety of target variations. This paper employs a 10-class target recognition task under standard operating conditions (SOC) [38], comprising 10 types of military vehicles: the BMP-2, BTR-70, T-72, BTR-60, 2S-1, BRDM-2, D-7, T-62, ZIL-131, and ZSU-23/4. The training set comprises 2747 images captured at a 17° pitch angle, whilst the test set comprises 2723 images captured at a 15° pitch angle, as shown in Table 1. This training-test split across different pitch angles better reflects real-world application scenarios and effectively evaluates the model’s generalisation ability. All images were centred and cropped and uniformly resized to 84 × 84 pixels for use as network inputs.

To further validate the generalizability of ConFAS-Net across different data domains, we additionally conduct experiments on the synthetic and measured paired labeled experiment (SAMPLE) dataset [39]. The SAMPLE dataset was jointly developed by the Air Force Research Laboratory (AFRL) and comprises both synthetic and measured SAR imagery for 10 classes of military ground vehicles, corresponding to the same target categories as MSTAR: 2S1, BMP-2, BRDM-2, BTR-60, BTR-70, D-7, T-62, T-72, ZIL-131, and ZSU-23/4. Unlike MSTAR, the SAMPLE dataset provides paired synthetic-measured SAR image samples, making it particularly suitable for evaluating model robustness and cross-domain generalization. In our few-shot experiments, only the measured SAR images are utilized to maintain consistency with real-world recognition scenarios. For each K-shot setting (K = 5, 10, 15, 30), K images per class are randomly sampled to construct the training set, with the remaining measured images reserved for testing. All images are uniformly resized to 84 × 84 pixels, consistent with the preprocessing pipeline applied to the MSTAR dataset.

3.2. Experimental Setup

The experiments in this paper were implemented using the PyTorch 2.4.1 deep learning framework on an NVIDIA GeForce RTX 4090 GPU with 24 GB of video memory. The model was trained using the SGD optimiser, with an initial learning rate of 0.1, a momentum coefficient of 0.9, and a weight decay coefficient of 0.0001. A step-down strategy was employed for the learning rate, which was reduced to 0.3 times its original value every 150 iterations. The total number of training iterations was 300, with a batch size of 32 and a fixed random seed of 168 to ensure the reproducibility of the experiment. Classification accuracy was adopted as the evaluation metric, defined as the proportion of correctly classified samples out of the total number of test samples. Given the relatively balanced distribution of samples across categories in the MSTAR dataset, this metric effectively reflects the model’s overall recognition performance. For experiments on the SAMPLE dataset, three different random seeds (42, 168, 233) are applied to account for the variance introduced by random training set sampling, and the results are reported as mean accuracy.

ConFAS-Net adopts a two-stage training strategy with a total budget of 300 epochs. In the first stage (Epochs 1–180), the network is trained exclusively with the confusion-aware cosine loss (CACL) to establish a well-structured feature space with enlarged inter-class separation for easily confused category pairs. In the second stage (Epochs 181–300), the CADA module is activated, and the training objective transitions to a joint loss combining the CACL loss and a weighted cross-entropy loss with CADA-scaled logits, where the weighting coefficient β is set to 0.05. The CADA module dynamically scales the logit outputs based on confusion statistics, amplifying decision confidence for difficult classes. All network parameters remain fully trainable throughout both stages without any parameter freezing. The learning rate decays from 0.1 to 0.03 at epoch 150 via a step schedule, facilitating a smooth transition between stages. This staged design ensures that CACL first establishes a mature feature space before CADA operates at the decision level, avoiding the instability of simultaneously optimising all objectives from scratch with limited training data.

Table 2 lists the key hyperparameters of ConFAS-Net. The label shift factor τ controls the soft-label smoothing degree in CACL. The margin parameters m₀ and λ_m determine the base margin and the scaling intensity of confusion-aware adaptive margins, respectively. The CADA scaling strength λs controls how strongly the class-wise confusion scores influence the logit scaling. These hyperparameters were determined through grid search on the 15-shot validation setting.

3.3. Comparative Experiment

3.3.1. Recognition Performance Under Different K-Shot Settings on MSTAR Dataset

As shown in Table 3, the test accuracy of the proposed ConFAS-Net method under different K-shot settings is fully presented. The experimental results indicate that the model’s recognition performance exhibits a significant upward trend as the number of training samples increases, which fully validates the strong learning ability and adaptability of our method.

Specifically, under the 5-shot setting with very few samples, the model achieved an accuracy of 73.25%, indicating that the confusion-aware and attention mechanisms proposed in this paper can effectively extract key features even under conditions of extreme sample scarcity. When the sample size increased to 10-shot, the accuracy rose sharply to 87.43%, representing an increase of 14.18%, indicating that the model is capable of learning more robust feature representations from the additional samples. Under the 15-shot and 30-shot settings, the model achieved high accuracy rates of 94.97% and 96.87%, respectively. This indicates that ConFAS-Net is capable of fully exploiting the potential of the data when the sample size is moderate, approaching the performance levels of fully supervised learning.

3.3.2. Comparative Experiment on the MSTAR Dataset

To comprehensively evaluate the superiority of our proposed method and ensure the objectivity, comprehensiveness, and persuasiveness of the evaluation results, we have provided a detailed comparative analysis with three categories of existing representative methods in Table 4. These three categories of methods are selected to cover different research directions and technical routes, so as to fully reflect the advantages of our method in different comparison dimensions. Specifically, the three categories of comparison methods are as follows: (1) Classic deep convolutional neural networks, which are the basic backbone models widely used in image recognition tasks, including ResNet-18, Inception and DenseNet; (2) Classic few-shot learning methods, which are the mainstream methods for solving the few-shot learning problem, including Prototypical Networks and DeepEMD; (3) State-of-the-art methods specifically optimised for SAR target recognition tasks, which are the most advanced and representative methods in the current field of SAR target recognition, including Dens-CapsNet, FTL-dis, Prior-EDL, and PD Network.

The experimental results show that ConFAS-Net achieved the best performance across all K-shot settings.

Compared with classical CNN models, ResNet-18, Inception and DenseNet generally exhibit limited recognition performance under few-shot conditions, with accuracy rates hovering around 60% in the 5-shot setting. Our method achieves improvements of 13.73% and 12.27% over DenseNet in the 5-shot and 10-shot settings, respectively, demonstrating the necessity of a design specifically tailored for few-shot SAR tasks. When compared with classical few-shot methods, Prototypical Networks achieved significant improvements over the baseline CNN through a metric learning strategy; however, our method still outperforms Prototypical Networks across all settings, for example, achieving improvements of 4.97% and 3.41% in the 10-shot and 15-shot settings, respectively. DeepEMD employs Earth-moving distance for feature matching, but performs poorly on SAR images, achieving only 52.24% in the 5-shot setting. This indicates that few-shot methods designed for natural images are difficult to directly transfer to the SAR domain; feature enhancement and obfuscation tailored to domain-specific characteristics are necessary. Compared with state-of-the-art methods in the SAR domain, our method achieves improvements across all settings. Under the extremely scarce 5-shot setting, ConFAS-Net achieves a marginal gain of 0.08% over the MSDC baseline, indicating that the proposed modules do not degrade performance even under extreme data scarcity. As the number of available samples increases, the improvements become more pronounced, reaching 2.43%, 2.93%, and 1.28% in the 10-shot, 15-shot, and 30-shot settings, respectively, validating the consistent effectiveness of the three proposed modules across varying data conditions.

In summary, ConFAS-Net not only demonstrates greater robustness under conditions of extremely limited data but also achieves a higher upper bound for accuracy as the dataset size increases, thereby demonstrating its superiority and effectiveness in small-sample SAR target recognition tasks.

3.3.3. Comparative Experiment on the SAMPLE Dataset

To further evaluate the generalizability of ConFAS-Net across different SAR datasets, additional comparative experiments are conducted on the SAMPLE dataset. Since the majority of compared SAR-specific methods in Table 4 have not been evaluated on the SAMPLE dataset and their source codes are not publicly available for reproduction, the representative baselines covering classical CNN architectures, metric-based few-shot learning, and the direct predecessor method are selected for this comparison. As shown in Table 5, ConFAS-Net consistently achieves the highest recognition accuracy across all K-shot settings, demonstrating that the proposed method maintains its advantage beyond the MSTAR benchmark.

Among classical CNN backbones, ResNet-18 and DenseNet achieve relatively competitive performance on SAMPLE, yet ConFAS-Net still surpasses them by 7.78% and 8.35% in the 5-shot setting, respectively. Inception, by contrast, achieves only 60.45% in the 5-shot setting, indicating its susceptibility to overfitting under severe data scarcity. Compared with the few-shot learning baseline Prototypical Networks and the TMDC-CNNs backbone, ConFAS-Net achieves improvements of 15.12% and 11.82% in the 5-shot setting, respectively. Across all shot settings, ConFAS-Net maintains a consistent performance advantage, validating that the proposed MS-CA, CACL, and CADA modules contribute stable improvements even when applied to a different SAR dataset with distinct imaging characteristics. These results demonstrate the cross-dataset generalizability of ConFAS-Net and its robustness under varying few-shot data conditions.

3.4. Ablation Experiment

3.4.1. A Comparison of Strategies for Updating the CACL Confusion Matrix

To address the potential response lag associated with offline matrix updates, we designed an online EMA update strategy.

As shown in Table 6, online updates outperform offline strategies across all K-shot settings, with particularly notable improvements in the 5-shot and 10-shot scenarios (0.33% and 0.29% improvements, respectively). This indicates that, under few-shot conditions, confusion patterns change rapidly, and online updates can adjust loss weights more promptly. In practical deployment, should data distribution drift occur (such as new scenarios or new noise levels), the online EMA strategy can be employed to achieve dynamic adaptation of the confusion rate; whereas during the training phase, the offline strategy is sufficient to meet requirements whilst balancing computational overhead and performance.

3.4.2. Complete Module Ablation Experiment

To validate the effectiveness of the modules proposed in this paper, ablation experiments were conducted under a 15-shot setting, with the results shown in Table 7. Using MSDC as the baseline, the three modules—MS-CA, CACL, and CADA—were added sequentially to analyse the independent contributions of each module and their combined effects.

As shown in Figure 5, When combined in pairs, the MS-CA and CACL combination achieved the second-best performance (94.72%, +2.68%), which was significantly higher than the sum of the contributions from each individual module (0.53% + 1.16% = 1.69%), indicating a strong synergy between the attention-enhanced feature representations and the confusion-aware loss. The combination of MS-CA + CADA (93.61%, +1.57%) also exceeded the sum of individual gains (0.53% + 0.91% = 1.44%), further validating the complementary nature of feature enhancement and decision adjustment. The CACL + CADA combination (93.28%, +1.24%) yielded gains below the arithmetic sum of the individual contributions (1.16% + 0.91% = 2.07%), which is expected as both modules operate on confusion matrix information and share overlapping optimization objectives. Nevertheless, when all three modules are enabled, ConFAS-Net achieves a maximum accuracy of 94.97%, representing a 2.93% improvement over the baseline. The performance of the complete model outperforms all two-module combinations, confirming that the three modules are overall complementary, despite the partial overlap between CACL and CADA.

4. Discussion

4.1. Module Interoperability Analysis

To conduct an in-depth analysis of the coupling and synergy between modules, we further compared the training convergence behaviour and changes in class confusion following the gradual introduction of each module. The three core modules of ConFAS-Net operate at different hierarchical levels within the network and have clearly defined functional roles. MS-CA operates at the feature-extraction stage (feature-level), enhancing discriminative feature representations through multi-scale channel attention; CACL operates at the metric-learning stage (metric-level), optimising inter-class distances and intra-class compactness via confusion-aware loss; CADA operates at the decision-making stage (decision-level), adjusting classification boundaries through class-adaptive scaling. These three modules correspond respectively to the complete recognition chain of ‘feature-metric-decision’, and in theory, there is no direct conflict between them. To verify this hypothesis, we conducted an analysis across three dimensions: training stability, feature quality, and gradient flow.

Regarding training stability: The introduction of CADA slightly accelerated training convergence (reaching stability approximately 12–15 epochs earlier) without oscillation or divergence. Under the 15-shot setting, the baseline (MSDC) converged at epoch 180, MS-CA + CACL at epoch 165, and full ConFAS-Net at epoch 150, confirming that CADA and CACL are aligned in gradient direction and jointly accelerate convergence.

Regarding changes in feature quality and confusion: To further validate the synergy between modules, we analysed the changes in the diagonal elements of the confusion matrix under different module combinations. Taking the 15-shot setting as an example, the mean of the confusion matrix diagonal for the baseline model was 0.82, which increased to 0.87 after introducing MS-CA, indicating enhanced feature separability. Upon further integration of CACL, the mean diagonal value increased to 0.91, whilst the maximum off-diagonal value decreased from 0.31 to 0.18, indicating a significant improvement in the separation of easily confused classes. Finally, after introducing CADA, although the mean of the diagonal remained at 0.91 (indicating no degradation in feature quality), the final classification accuracy increased from 94.72% to 94.97%, a rise of 0.25 percentage points. This suggests that CADA did not alter the feature space structure constructed by CACL, but rather further optimised the classification boundaries through adaptive adjustments at the decision layer, with the two exhibiting a complementary and synergistic relationship.

Regarding gradient flow: CACL primarily influences the feature extractor via confusion-aware loss weights, whilst CADA acts on classifier logit outputs with gradients fed back mainly to the classifier layer, resulting in a degree of decoupling between the two. Experiments confirmed this—the variation in gradient norm across feature extractor layers following CADA incorporation was less than 5%.

The above analysis demonstrates that the three modules of ConFAS-Net have a clear division of labour in terms of their roles, optimisation objectives, and gradient flow, with no instances of mutual cancellation or conflict. MS-CA enhances the quality of feature representations, CACL optimises the metric learning process, and CADA further improves the decision boundary whilst maintaining feature quality. Together, these three components form a complete optimisation chain, spanning features, metrics, and decision-making, while collectively enhancing the model’s performance in the small-sample SAR target recognition task. Furthermore, the generalizability of this modular design was additionally validated by experiments on the SAMPLE dataset, where ConFAS-Net achieves the best recognition accuracy among all compared methods across K-shot settings, suggesting that the synergistic interaction of MS-CA, CACL, and CADA is not dataset-specific but transferable across SAR datasets with different imaging characteristics.

4.2. Performance Analysis

To conduct an in-depth analysis of the training characteristics and generalisation capabilities of the ConFAS-Net framework, Figure 6 shows the training curves for four settings: 5-shot, 10-shot, 15-shot, and 30-shot.

As shown in the training accuracy curve, ConFAS-Net converges stably under all K-shot settings, mainly due to the multi-scale feature extraction ability of the MS-CA module, which effectively captures key discriminative features from limited samples. Under the 5-shot setting, the training accuracy stabilizes after about 80 epochs, and the convergence speed accelerates with the increase of samples, reaching stability at around 120 epochs for 30-shot. The gap between training and validation curves reflects the model’s generalization ability; in the 5-shot setting, the accuracy gap is about 30% due to overfitting, a common issue in few-shot learning, but CACL alleviates inter-class confusion via confusion-aware boundary adjustment, keeping validation accuracy at around 70%; as samples increase to 15-shot and 30-shot, the two curves gradually converge, demonstrating strong generalization. The fluctuation of the validation curve indicates the stability of model prediction, where under low-sample conditions, the validation accuracy fluctuates significantly, while the CADA module adopts a class-adaptive decision strategy to dynamically adjust classification boundaries according to class confusion, enhancing the model’s robustness. As observed from the curve, the fluctuation of validation accuracy is greatly reduced under the 30-shot setting, resulting in a smoother and more stable curve.

In summary, the analysis of the training curves indicates that the three core modules within the ConFAS-Net framework work in concert: MS-CA provides stable feature representations, CACL optimises the inter-class decision boundary, and CADA enhances decision robustness, collectively achieving excellent performance in the task of SAR target recognition with limited data.

4.3. Visual Analysis

To provide an in-depth analysis of the mechanisms underlying the performance improvements of the ConFAS-Net framework compared to the baseline TMDC method, this section conducts a visual comparative analysis across three dimensions—confusion matrices, feature distributions, and class activation heatmaps—to intuitively validate the synergistic optimisation effects of each module.

4.3.1. Comparative Analysis of Confusion Matrices

Shown in Figure 7 is a comparison of the confusion matrices for the TMDC baseline model and ConFAS-Net on the 15-shot test set. A comparison of Figure 8a,b reveals that ConFAS-Net achieves significant improvements in classification performance across several easily confused categories: the number of misclassified samples between 2S1 and D7 fell from 33 to 5, a reduction of 84.8%; the number of misclassified samples between T72 and T62 decreased from 10 to 3, a reduction of 70.0%; and the number of misclassified samples between ZIL131 and D7 decreased from 14 to 10, a reduction of 28.6%. These improvements are primarily attributed to the synergistic interaction of two modules: the CACL module identifies the aforementioned easily confused category pairs by dynamically constructing a category confusion matrix, and applies targeted separation constraints in the feature space, effectively widening the feature distance between confused categories; the CADA module dynamically generates larger scaling factors for categories with higher confusion at the decision level, thereby enhancing the classifier’s discriminative confidence in these categories. Together, these two modules collaboratively mitigate the issue of category confusion in low-sample-size scenarios through both metric learning and decision adjustment.

4.3.2. Analysis of the Evolution of T-SNE Feature Distributions

As shown in Figure 8, the t-SNE feature distributions of ConFAS-Net at four key training stages illustrate the evolution of feature separability. At Epoch 1 (11.29%), the 10 target classes were distributed at random with no clusters formed, reflecting the network’s initial lack of discriminative capability. By Epoch 50 (36.31%), the MS-CA module began to take effect, with initial clustering trends emerging. By Epoch 150 (83.22%), distinct clustering structures formed as CACL’s confusion-aware boundary optimisation took effect, with inter-class separation improving significantly. By Epoch 261 (94.97%), the model reached convergence, with all 10 classes forming compact, well-separated clusters with clear boundaries. This evolution reflects the synergistic mechanism of the three modules: MS-CA provides rich multi-scale feature representations in the early stage; CACL widens inter-class boundaries during the middle phase; and CADA further enhances decision robustness in the late phase. The t-SNE visualisation validates, at the representation level, the effectiveness of the proposed method in improving feature separability under low-sampling-size conditions.

4.3.3. Class-Based Activation Heatmap Visualisation Analysis

To provide an intuitive explanation of the model’s decision-making process, the Grad-CAM technique was used to generate class activation heatmaps for 10 SAR target classes, as shown in Figure 9. Red-highlighted regions indicate the feature areas of highest interest to the model, whilst blue regions denote background areas of lower interest. The heatmaps reveal that ConFAS-Net accurately localises the core scattering regions of each target class: for the 2S1, activated regions focus on the turret and hull structures; for the T-72, they cover the gun barrel and upper hull. Thanks to the MS-CA module’s multi-scale channel attention, background noise is greatly suppressed across all heatmaps. For easily confused classes such as T-72 and T-62, ConFAS-Net produces distinct activation patterns—high activation concentrates on the front gun barrel for T-72 versus the middle hull for T-62—demonstrating that CACL and CADA jointly guide the model to learn fine-grained discriminative features between similar classes. In summary, the visualisation of class-specific activation heatmaps provides intuitive confirmation, from the perspective of the model’s decision-making process, of ConFAS-Net’s advantages in three key areas: precision of feature attention, robustness to background noise, and the ability to distinguish between easily confused classes, thereby offering compelling visual evidence for the model’s interpretability.

5. Conclusions

To address the challenge of SAR target recognition under small-sample conditions, this paper proposes ConFAS-Net, a SAR target recognition method based on multi-scale discriminative features and confusion-aware learning. The MS-CA module integrates a multi-scale channel attention mechanism, utilising a multi-branch parallel structure to extract feature representations across different receptive fields, providing rich discriminative features for subsequent classification. The confusion-aware cosine loss (CACL) dynamically adjusts loss boundaries for easily confused category pairs based on confusion matrix statistics, effectively increasing inter-class separation in the feature space. The class-adaptive decision adjustment module (CADA) calculates adaptive scaling factors based on the confusion level of each category and applies targeted adjustments to the logits, thereby enhancing decision robustness. Experimental results demonstrated that ConFAS-Net achieves a recognition accuracy of 94.97% on the MSTAR dataset under the 15-shot setting, representing an improvement of 2.93 percentage points over the baseline. Additional experiments on the SAMPLE dataset further validated the cross-dataset generalizability of the proposed method, with ConFAS-Net consistently outperforming all compared methods across K-shot settings. Ablation experiments confirmed the independent and complementary contributions of each module. In the future, we plan to investigate the integration of ConFAS-Net with pre-trained foundation models to further enhance few-shot generalisation across diverse SAR scenarios. Additionally, we will explore the extension of the proposed confusion-aware learning framework to cross-domain SAR target recognition tasks, where significant distribution shifts arise from variations in sensor parameters, imaging geometry, and environmental conditions, with the aim of developing a more robust and broadly applicable recognition system.

Author Contributions

Conceptualization, X.Z.; methodology, Y.T.; software, J.Y.; validation, X.Z.; investigation, B.L. and W.Z.; data curation, W.W. and X.Z.; writing—original draft preparation, X.Z. and Y.T.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Plan Project (2021JH2/10200023) of Liaoning Province, China, and the Key project (LJZZ212410154029) of scientific research of the Education Department of Liaoning Province, China.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare there are no conflicts of interest.

References

Li, J.; Yu, Z.; Yu, L.; Cheng, P.; Chen, J.; Chi, C. A comprehensive survey on SAR ATR in deep-learning era. Remote Sens. 2023, 15, 1454. [Google Scholar] [CrossRef]
Wen, Y.; Wang, X.; Peng, L.; Qiao, Y. A coarse-to-fine hierarchical feature learning for SAR automatic target recognition with limited data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13646–13656. [Google Scholar] [CrossRef]
Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Deng, H.; Pi, D.; Zhao, Y. Ship target detection based on CFAR and deep learning SAR image. J. Coast. Res. 2019, 94, 161–164. [Google Scholar] [CrossRef]
Liu, B.; He, K.; Han, M.; Hu, X.; Ma, G.; Wu, M. Application of UAV and GB-SAR in mechanism research and monitoring of Zhonghaicun landslide in southwest China. Remote Sens. 2021, 13, 1653. [Google Scholar] [CrossRef]
Clemente, C.; Pallotta, L.; Gaglione, D.; De Maio, A.; Soraghan, J.J. Automatic target recognition of military vehicles with Krawtchouk moments. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 493–500. [Google Scholar] [CrossRef]
Pei, J.; Huang, Y.; Huo, W.; Zhang, Y.; Yang, J.; Yeo, T.-S. SAR automatic target recognition based on multiview deep learning framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2196–2210. [Google Scholar] [CrossRef]
Wang, L.; Bai, X.; Xue, R.; Zhou, F. Few-shot SAR automatic target recognition based on Conv-BiLSTM prototypical network. Neurocomputing 2021, 443, 235–246. [Google Scholar] [CrossRef]
Li, Y.; Chen, W.; Hu, X.; Chen, B.; Wang, D.; Qu, C.; Meng, F.; Wang, P.; Liu, H. AOT: Aggregation optimal transport for few-shot SAR automatic target recognition. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 5088–5103. [Google Scholar] [CrossRef]
Li, W.; Yang, W.; Liu, T.; Hou, Y.; Li, Y.; Liu, Z.; Liu, Y.; Liu, L. Predicting gradient is better: Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture. ISPRS J. Photogramm. Remote Sens. 2024, 218, 326–338. [Google Scholar] [CrossRef]
Ding, B.; Wen, G.; Huang, X.; Ma, C.; Yang, X. Target recognition in synthetic aperture radar images via matching of attributed scattering centers. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3334–3347. [Google Scholar] [CrossRef]
Rui, J.; Wang, C.; Zhang, H.; Jin, F. Multi-sensor SAR image registration based on object shape. Remote Sens. 2016, 8, 923. [Google Scholar] [CrossRef]
Liu, Z.; Wang, L.; Wen, Z.; Li, K.; Pan, Q. Multilevel scattering center and deep feature fusion learning framework for SAR target recognition. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5227914. [Google Scholar] [CrossRef]
Gao, F.; Huang, T.; Sun, J.; Wang, J.; Hussain, A.; Yang, E. A new algorithm for SAR image target recognition based on an improved deep convolutional neural network. Cogn. Comput. 2019, 11, 809–824. [Google Scholar] [CrossRef]
Liu, J.; Xing, M.; Yu, H.; Sun, G. EFTL: Complex convolutional networks with electromagnetic feature transfer learning for SAR target recognition. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5209811. [Google Scholar] [CrossRef]
Liao, L.; Du, L.; Chen, J.; Cao, Z.; Zhou, K. EMI-Net: An end-to-end mechanism-driven interpretable network for SAR target recognition under EOCs. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205118. [Google Scholar] [CrossRef]
Li, R.; Wang, X.; Wang, J.; Song, Y.; Lei, L. SAR target recognition based on efficient fully convolutional attention block CNN. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4005905. [Google Scholar] [CrossRef]
Shi, B.; Zhang, Q.; Wang, D.; Li, Y. Synthetic aperture radar SAR image target recognition algorithm based on attention mechanism. IEEE Access 2021, 9, 140512–140524. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Lin, H.; Xie, Z.; Zeng, L.; Yin, J. Multi-scale time-frequency representation fusion network for target recognition in SAR imagery. Remote Sens. 2025, 17, 2786. [Google Scholar] [CrossRef]
Li, C.; Ni, J.; Luo, Y.; Wang, D.; Zhang, Q. A dual-branch spatial-frequency domain fusion method with cross attention for SAR image target recognition. Remote Sens. 2025, 17, 2378. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 499–515. [Google Scholar]
Geng, J.; Ma, W.; Jiang, W. Causal intervention and parameter-free reasoning for few-shot SAR target recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12702–12714. [Google Scholar] [CrossRef]
Zhou, X.; Tang, T.; He, Q.; Zhao, L.; Kuang, G.; Liu, L. Simulated SAR prior knowledge guided evidential deep learning for reliable few-shot SAR target recognition. ISPRS J. Photogramm. Remote Sens. 2024, 216, 1–14. [Google Scholar] [CrossRef]
Wang, S.; Wang, Y.; Liu, H.; Sun, Y.; Zhang, C. A few-shot SAR target recognition method by unifying local classification with feature generation and calibration. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5200319. [Google Scholar] [CrossRef]
Aladwani, T.; Pantazi-Kypraiou, M.; Goudroumanis, G.R.; Floros, G.; Anagnostopoulos, C. A federated few-shot learning siamese network framework with data label imbalance. In Proceedings of the 2025 IEEE 45th International Conference on Distributed Computing Systems Workshops (ICDCSW), Glasgow, UK, 21–23 July 2025; pp. 56–62. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv 2017, arXiv:1703.03400. [Google Scholar]
Xu, J.; Liu, B.; Xiao, Y. A multitask latent feature augmentation method for few-shot learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 6976–6990. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Liu, J.-J.; Hou, Q.; Cheng, M.-M.; Wang, C.; Feng, J. Improving convolutional networks with self-calibrated convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10093–10102. [Google Scholar]
Ross, T.; Worrell, S.; Velten, V.; Mossing, J.; Bryant, M. Standard SAR ATR evaluation experiments using the MSTAR public release data set. Proc. SPIE 1998, 3370, 566–573. [Google Scholar]
Keydel, E.R.; Lee, S.W.; Moore, J.T. MSTAR extended operating conditions. Proc. SPIE 1996, 2757, 228–242. [Google Scholar]
Lewis, B.; Scarnati, T.; Sudkamp, E.; Nehrbass, J.; Rosencrantz, S.; Zelnio, E. A SAR dataset for ATR development: The Synthetic and Measured Paired Labeled Experiment (SAMPLE). In Proceedings of the SPIE Defense + Commercial Sensing (SPIE DCS), Baltimore, MD, USA, 14–18 April 2019; p. 10987. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
Huang, G.; Liu, Z.; Van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Guan, J.; Liu, J.; Feng, P.; Wang, W. Multiscale deep neural network with two-stage loss for SAR target recognition with small training set. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4011405. [Google Scholar] [CrossRef]
Wang, Q.; Xu, H.; Yuan, L.; Wen, X. Dense capsule network for SAR automatic target recognition with limited data. Remote Sens. Lett. 2022, 13, 533–543. [Google Scholar] [CrossRef]
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. DeepEMD: Differentiable earth mover’s distance for few-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5632–5648. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Dong, H.; Deng, B. Improving pre-training and fine-tuning for few-shot SAR automatic target recognition. Remote Sens. 2023, 15, 1709. [Google Scholar] [CrossRef]
Zhang, L.; Leng, X.; Feng, S.; Ma, X.; Ji, K.; Kuang, G.; Liu, L. Optimal azimuth angle selection for limited SAR vehicle target recognition. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103707. [Google Scholar] [CrossRef]

Figure 1. The proposed network architecture of ConFAS-Net for few-shot SAR target recognition. The MS-CA, CACL, and CADA modules are integrated to enhance multi-scale feature discriminability, mitigate inter-class confusion, and adaptively adjust classification boundaries, respectively.

Figure 2. The proposed multi-scale channel attention (MS-CA) module for SAR target recognition. The dual-branch structure captures both global and local channel importance, and adaptively weights input feature maps to enhance discriminative feature selection.

Figure 3. The proposed confusion-aware cosine loss (CACL) module for few-shot SAR target recognition. The dual-path structure dynamically generates adaptive scaling factors based on the confusion matrix to impose differential cosine similarity constraints on easily confused classes.

Figure 4. The proposed class-adaptive dynamic adjustment (CADA) module for few-shot SAR target recognition. The dual-dataflow structure dynamically generates adaptive scaling factors based on the confusion matrix to adjust logit outputs and enhance the discriminability of hard-to-classify categories.

Figure 5. Ablation study results of the proposed ConFAS-Net. (a) Test accuracy comparison of different module configurations against the baseline. (b) Corresponding performance gain (%) relative to the baseline.

Figure 6. Training curves of the proposed ConFAS-Net under different few-shot settings. (a) 5-shot, (b) 10-shot, (c) 15-shot, (d) 30-shot, showing train and validation accuracy over epochs.

Figure 7. Confusion matrix comparison between the baseline TMDC model and the proposed ConFAS-Net on the 15-shot test set. (a) Baseline TMDC; (b) ConFAS-Net.

Figure 8. Evolution of t-SNE feature distribution of the proposed ConFAS-Net during training. (a) Epoch 1 (Acc = 11.29%); (b) Epoch 50 (Acc = 36.31%); (c) Epoch 150 (Acc = 83.22%); (d) Epoch 261 (Acc = 94.97%).

Figure 9. Input SAR images and corresponding Grad-CAM class activation maps. (a) 2S1. (b) BMP2. (c) BRDM2. (d) BTR60. (e) BTR70. (f) D7. (g) T62. (h) T72. (i) ZIL131. (j) ZSU23/4.

Table 1. Number of samples per class in MSTAR dataset.

Class	Training		Test
Class	Depression	Number	Depression	Number
2S1	17°	299	15°	274
BMP2	17°	232	15°	195
BRDM2	17°	298	15°	274
BTR60	17°	256	15°	195
BTR70	17°	233	15°	196
D7	17°	299	15°	274
T62	17°	299	15°	273
T72	17°	232	15°	196
ZIL131	17°	299	15°	274
ZSU23/4	17°	299	15°	274

Table 2. Hyperparameter settings of ConFAS-Net.

Module	Parameter	Symbol	Value
CACL	Label shift factor	τ	0.3
CACL	Base margin	m₀	0.0
CACL	Margin scaling factor	λ_m	0.15
CADA	Scaling strength	λs	0.1
Two-stage	CE loss weight	β	0.05
Two-stage	Stage transition ratio	α	0.6
MS-CA	Channel reduction ratio	r	16

Table 3. Test accuracy of ConFAS-Net under different K-shot settings on MSTAR dataset.

K-Shot	Test Accuracy (%)	Training Samples	Total Training Samples
5-shot	73.25	5 × 10 classes	50
10-shot	87.43	10 × 10 classes	100
15-shot	94.97	15 × 10 classes	150
30-shot	96.87	30 × 10 classes	300

Table 4. Comparison of recognition accuracy (%) of different methods on MSTAR dataset.

Modelling Methods	5-Shot	10-Shot	15-Shot	30-Shot
ResNet-18 [40]	55.92	72.01	80.59	90.44
Inception [41]	58.52	74.64	82.14	91.32
DenseNet [42]	59.52	75.16	82.78	92.05
Prototypical Networks [29]	70.37	82.46	91.56	94.92
TMDC-CNNs [43]	73.17	85.00	92.04	95.59
Dens-CapsNet [44]	66.90	80.26	-	94.56
DeepEMD [45]	52.24	56.04
FTL-dis [46]	72.13	81.21	-	-
Prior-EDL [26]	60.05	71.62	86.50	92.70
PD Network [47]	70.15	83.73	-	94.63
ConFAS-Net (ours)	73.25	87.43	94.97	96.87

Table 5. Comparison of recognition accuracy (%) of different methods on SAMPLE dataset.

Modelling Methods	5-Shot	10-Shot	15-Shot	30-Shot
ResNet-18 [40]	84.72	86.39	88.16	90.27
Inception [41]	60.45	72.20	75.37	82.57
DenseNet [42]	84.15	85.64	87.22	89.13
Prototypical Networks [29]	77.38	82.46	88.56	90.92
TMDC-CNNs [43]	80.68	81.42	83.54	85.68
ConFAS-Net (ours)	92.50	93.40	95.44	96.33

Table 6. A Comparative study of confusion matrix update strategies.

Update Strategy	Update Frequency	5-Shot	10-Shot	15-Shot	30-Shot	Training Times
Offline	300 iterations	73.25	87.43	94.97	96.87	1.0×
Semi-online	50 iterations	73.41	87.61	95.08	96.95	1.02×
Online EMA	1 iteration	73.58	87.72	95.15	97.01	1.05×

Table 7. Results of the complete module ablation study under the 15-shot setting.

NO.	MS-CA	CACL	CADA	Accuracy (%)	Gain (%)
1	×	×	×	92.04	-
2	√	×	×	92.57	+0.53
3	×	√	×	93.20	+1.16
4	×	×	√	92.95	+0.91
5	√	√	×	94.72	+2.68
6	√	×	√	93.61	+1.57
7	×	√	√	93.28	+1.24
8	√	√	√	94.97	+2.93

Note: “×” indicates that the corresponding module is absent, while “√” indicates that the module is present.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, X.; Xue, X.; Tian, Y.; Yang, J.; Lu, B.; Zhang, W.; Wang, W. ConFAS-Net: Few-Shot SAR Target Recognition via Confusion-Aware Attention and Adaptive Decision Scaling. Remote Sens. 2026, 18, 1482. https://doi.org/10.3390/rs18101482

AMA Style

Zhao X, Xue X, Tian Y, Yang J, Lu B, Zhang W, Wang W. ConFAS-Net: Few-Shot SAR Target Recognition via Confusion-Aware Attention and Adaptive Decision Scaling. Remote Sensing. 2026; 18(10):1482. https://doi.org/10.3390/rs18101482

Chicago/Turabian Style

Zhao, Xin, Xiaorong Xue, Yishuo Tian, Jingtong Yang, Bingyan Lu, Wen Zhang, and Wancheng Wang. 2026. "ConFAS-Net: Few-Shot SAR Target Recognition via Confusion-Aware Attention and Adaptive Decision Scaling" Remote Sensing 18, no. 10: 1482. https://doi.org/10.3390/rs18101482

APA Style

Zhao, X., Xue, X., Tian, Y., Yang, J., Lu, B., Zhang, W., & Wang, W. (2026). ConFAS-Net: Few-Shot SAR Target Recognition via Confusion-Aware Attention and Adaptive Decision Scaling. Remote Sensing, 18(10), 1482. https://doi.org/10.3390/rs18101482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ConFAS-Net: Few-Shot SAR Target Recognition via Confusion-Aware Attention and Adaptive Decision Scaling

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Overall Structure of ConFAS-Net

2.2. MS-CA Module

2.3. CACL Module

2.4. CADA Module

3. Results

3.1. Experimental Dataset

3.2. Experimental Setup

3.3. Comparative Experiment

3.3.1. Recognition Performance Under Different K-Shot Settings on MSTAR Dataset

3.3.2. Comparative Experiment on the MSTAR Dataset

3.3.3. Comparative Experiment on the SAMPLE Dataset

3.4. Ablation Experiment

3.4.1. A Comparison of Strategies for Updating the CACL Confusion Matrix

3.4.2. Complete Module Ablation Experiment

4. Discussion

4.1. Module Interoperability Analysis

4.2. Performance Analysis

4.3. Visual Analysis

4.3.1. Comparative Analysis of Confusion Matrices

4.3.2. Analysis of the Evolution of T-SNE Feature Distributions

4.3.3. Class-Based Activation Heatmap Visualisation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI