MSMC: Multi-Scale Embedding and Meta-Contrastive Learning for Few-Shot Fine-Grained SAR Target Classification

Chen, Bowen; Yang, Minjia; Wang, Yue; Bai, Xueru

doi:10.3390/rs18030415

Open AccessArticle

MSMC: Multi-Scale Embedding and Meta-Contrastive Learning for Few-Shot Fine-Grained SAR Target Classification

by

Bowen Chen

¹

,

Minjia Yang

¹,

Yue Wang

² and

Xueru Bai

^1,*

¹

National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China

²

School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 415; https://doi.org/10.3390/rs18030415

Submission received: 20 December 2025 / Revised: 17 January 2026 / Accepted: 23 January 2026 / Published: 26 January 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose MSMC, which is a dual-branch framework that fuses metric-based meta-learning and unsupervised contrastive learning, and can outperform mainstream methods in few-shot SAR target classification on the MSTAR dataset.
We design MSEN, which can capture fine-grained discriminative information of SAR targets via adaptive candidate region generation and global–local feature fusion.

What are the implications of the main findings?

We offer a new paradigm for few-shot fine-grained SAR target classification, mitigating the challenges of scarce labeled data and high target similarity.
We enhance the model robustness to depression angle variation, providing a practical solution for target recognition under complex observation conditions.

Abstract

Constrained by observation conditions and high inter-class similarity, effective feature extraction and classification of synthetic aperture radar (SAR) targets in few-shot scenarios remains a persistent challenge. To address this issue, this article proposes a few-shot fine-grained SAR target classification method based on multi-scale embedding network and meta-contrastive learning (MSMC). Specifically, the MSMC integrates two complementary training pipelines; the first employs metric-based meta-learning to facilitate few-shot classification, while the second adopts an auxiliary training strategy to enhance feature diversity through contrastive learning. Furthermore, a shared multi-scale embedding network (MSEN) is designed to extract discriminative multi-scale features via adaptive candidate region generation and joint multi-scale embedding. The experimental results on the MSTAR dataset demonstrate that the proposed method achieves superior few-shot fine-grained classification performance compared to existing methods.

Keywords:

synthetic aperture radar (SAR); meta-learning; contrastive learning; few-shot classification; fine-grained classification

1. Introduction

Unlike optical and infrared imaging systems, synthetic aperture radar (SAR) provides unique advantages including all-day, all-weather, long-distance, and high-resolution usability, making it invaluable for both military and civilian applications [1]. Within this context, achieving accurate and robust SAR target classification remains a crucial yet challenging research frontier in the field [2,3].

In recent years, synthetic aperture radar (SAR) target classification technology has witnessed remarkable advancement [4]. It has evolved progressively from traditional methodologies [5,6,7,8,9,10], which rely on similarity assessment of manually engineered features and suffer from inherent limitations such as poor generalization, toward modern intelligent approaches. The rapid proliferation of deep learning [11] has ushered in a novel paradigm for this field: its data-driven end-to-end architecture enables the autonomous learning of complex high-level target features, obviating the need for cumbersome manual feature engineering and achieving concurrent improvements in both performance and efficiency. Among these advanced techniques, convolutional neural networks (CNNs) leverage the properties of sparse interaction and weight sharing to extract geometric and structural information of SAR image targets in a lightweight yet efficient manner [12]. State-of-the-art models rooted in classical CNN frameworks, such as SDF-Net [13], GLMnet [14], and LcFGC [15], have further enhanced the efficacy of target feature extraction. In recent times, attention-based architectures, exemplified by the Transformer [16,17], have emerged as promising solutions. By capturing global representations through the key–query–value mechanism, these architectures mitigate local inductive biases and deliver superior classification performance.

The success of the aforementioned models hinges on the comprehensive feature distribution of large-scale, high-quality training samples [18]. However, constrained by observational conditions and the professionalism required for annotation in practical scenarios, SAR target classification confronts a prominent few-shot learning dilemma. On one hand, the number of accessible samples for rare targets is extremely limited; on the other hand, labeled data have become increasingly scarce due to the high threshold of annotation. These two factors collectively render models highly susceptible to overfitting [19]. Concurrently, fine-grained (model-level) classification tasks demand precise differentiation of targets with similar structures. This requirement further exacerbates the learning challenges in few-shot scenarios and underscores the urgency of investigating few-shot fine-grained SAR target classification methods.

Although current methods have achieved rapid adaptation capabilities in few-shot classification through meta-learning [20] and enhanced the efficiency of capturing subtle features in fine-grained classification via local region attention [21], they still suffer from significant limitations, i.e., the models lack synergistic perception ability to integrate global semantic correlations and local discriminative features of targets, resulting in limited classification robustness and generalization performance.

To overcome the aforementioned limitations, this paper introduces MSMC, a multi-scale embedding and meta-contrastive learning framework for few-shot fine-grained SAR target classification. The main innovations lie in two aspects. Firstly, to simultaneously enhance rapid adaptation capability for few-shot classification tasks while improving feature diversity and multi-scale feature representation capacity, a dual-branch training pipeline integrating meta-learning and contrastive learning is designed. Furthermore, a parameter-shared multi-scale embedding network (MSEN) is constructed to extract discriminative features through adaptive candidate region generation and joint multi-scale embedding mechanisms. The core contributions can be outlined as follows:

We design a dual-branch training pipeline combining metric-based meta-learning and unsupervised contrastive learning. The former facilitates effective few-shot classification, while the latter maximizes the utility of base samples and enhances feature diversity by an auxiliary training strategy. By combining these two complementary branches, the model strengthens feature representations in the embedding space while reducing reliance on labeled data, significantly boosting generalization for unseen-class tasks.
We propose an MSEN that hierarchically analyzes the semantic information across different CNN layers and generates multi-scale sub-images, from which a fused feature representation incorporating fine-grained details is produced. Compared with existing methods, MSEN combines global and local features and offers enhanced physical interpretability, leading to improved few-shot fine-grained SAR target classification accuracy.
Classification experiments on the MSTAR dataset demonstrate that MSMC achieves superior novel-class average accuracy over state-of-the-art few-shot SAR target classification approaches across various configurations. Ablation studies and visualization analyses further validate the effectiveness of each module.

2. Prior Work

2.1. Few-Shot Classification

The few-shot classification problem persists as a critical research challenge in the field of SAR image interpretation. Mainstream methods include transfer learning [22], sample augmentation [23], and meta-learning [24].

Transfer learning: Transfer learning enhances the learning performance of target tasks by leveraging knowledge acquired from relevant source tasks [25]. Model fine-tuning [26] constitutes a classical transfer learning strategy: a base model is first pre-trained on large-scale source data, followed by fine-tuning of specific network layers on the target few-shot dataset. While this approach is efficient and straightforward, it is prone to performance degradation on target tasks due to overfitting when there exists a significant domain shift between the distributions of source and target data.

Sample augmentation: Sample augmentation addresses the few-shot dilemma by generating additional samples to enhance data diversity, and it falls into two categories: data augmentation [27], which modifies original SAR images through pixel- or image-level transformations such as translation and rotation [28]; and sample generation [29], which leverages generative adversarial networks (GANs) [30] and their variants [31,32] to synthesize pseudo-samples that approximate the real data distribution. Both approaches exhibit inherent limitations when samples are extremely scarce, failing to compensate for the lack of representative training data.

Meta-learning: Meta-learning is the predominant approach for few-shot learning, with its core enabling models to acquire transferable meta-knowledge from multiple prior tasks for rapid adaptation to new few-shot tasks [33]. It achieves this via a two-stage meta-training and meta-testing framework; during meta-training the model extracts task-generalizable meta-knowledge from a fully annotated base class dataset by simulating numerous few-shot classification scenarios, while in meta-testing, the acquired meta-knowledge is directly transferred to few-shot tasks with novel classes for rapid classification [34]. Meta-learning exhibits strong generalization and adaptability due to its ability to capture cross-task structural relationships and correlations, among which distance metric-based methods [35] stand out as they efficiently utilize limited sample information through simple induction in the metric space, delivering excellent few-shot classification performance and minimizing dependence on feature dimensions by focusing on sample distance or similarity measurement, leading to enhanced robustness and scalability when processing high-dimensional data with redundant features.

Despite exhibiting remarkable few-shot adaptation capabilities for novel tasks, meta-learning approaches predominantly concentrate on global semantic representations in their feature extraction, demonstrating significant deficiencies in acquiring local discriminative features (particularly texture variations, edge characteristics, and microscopic configurations). Such a limitation hinders the model from extracting highly discriminative fine-grained features when dealing with highly similar subcategories, thereby reducing the classification accuracy.

2.2. Fine-Grained Classification

Fine-grained classification confronts unique challenges distinctly different from those of traditional classification tasks, with its core lying in achieving precise differentiation between visually similar subcategories within a broad category, such as vehicles and aircraft of varying models [36]. Characterized by “high inter-class similarity and large intra-class variability,” namely the subtle visual discrepancies among different subcategories and the significant variations within the same subcategory caused by factors like observational conditions and background occlusion [37], this task substantially increases the difficulty of classification. Since discriminative features are concentrated in small local regions [38], the accurate localization and utilization of these critical details constitute the core to performance improvement. Current state-of-the-art methods can be categorized into three types: multi-layer feature extraction, high-order information encoding, and localization-recognition.

Multi-layer feature extraction: Multi-layer feature extraction methods improve classification by hierarchically capturing and fusing discriminative information across spatial scales, using multi-level extractors for cross-scale interaction and joint representation [39]. HENC [40] designs a hierarchical embedding network (HEN) that selectively fuses feature maps from different layers, integrating local details from low-level maps with global semantics from high-level maps to enable multi-scale joint embedding and enhance fine-grained classification performance.

High-order information encoding: High-order information encoding methods boost feature discrimination by modeling complex relationships in visual representations, often via high-dimensional transformations to capture intricate image patterns. Bilinear-CNN (B-CNN) [41] utilizes a dual-pathway convolutional architecture to compute translation-invariant pairwise feature interactions, effectively encoding second-order statistics of local features for more discriminative representations in fine-grained classification. Its bilinear formulation preserves spatial coherence while capturing subtle feature correlations critical for distinguishing minor inter-class differences.

Localization-recognition: Localization-recognition methods mimic human visual processing, splitting fine-grained classification into two sequential stages discriminative region localization and feature extraction plus classification. Recurrent Attention Convolutional Neural Network (RACNN) [42] stands as a paradigmatic representative of this approach, adopting a multi-scale attention mechanism where its attention proposal subnetwork conducts hierarchical localization, with each scale optimizing and expanding the previous scale’s attention region. These methods deliver strong classification performance and model interpretability but have notable limitations: they rely heavily on initial region localization accuracy, with flawed attention mechanisms leading to error propagation, and lack scalability by focusing only on a single salient region while ignoring other complementary discriminative features, restricting applicability in scenarios requiring comprehensive feature analysis or handling input distribution variations.

3. MSMC

The overall framework of the proposed MSMC is shown in Figure 1, which integrates four core stages: (1) task sampling and data augmentation; (2) sample embedding by MSEN; (3) category inference by MCM; and (4) similarity measurement by Auxiliary Contrastive Learning Module (ACLM).

Owing to the high sensitivity of fine-grained SAR targets to pose variations and depression angle changes, pronounced fluctuations in scattering characteristics are inherently induced. Specifically, pose adjustments can modify the spatial distribution of key scattering centers (e.g., barrels, wheels) of the target, while variations in depression angles may give rise to intensity discrepancies between global contours and local components. These factors often lead to local discriminative details being obscured by interference from global features. To address these intrinsic challenges associated with SAR data, MSMC achieves adaptive compatibility through the synergistic integration of MSEN and meta-contrastive learning: MSEN leverages semantic features across different CNN layers to generate adaptive multi-scale key scattering regions, thereby enhancing the local discriminative capability under varying pose conditions; meanwhile, the dual-branch meta-contrastive learning architecture acquires pose-invariant metric knowledge via the meta-learning branch and optimizes feature distinguishability through sample-pair learning by the contrastive branch.

During the training phase, an N-way K-shot sampling strategy is performed on the training dataset to generate training tasks comprising support and query sets. The images within these tasks are then augmented to produce auxiliary sample sets containing positive and negative sample pairs. After data preprocessing, both the training tasks and auxiliary sample sets are fed into the MSEN to generate feature vectors, respectively. Based on these feature representations, the MCM computes the distance between feature vectors to derive the meta-classification loss

L_{meta}

, while the ACLM evaluates the similarity between positive and negative sample pairs to compute the contrastive loss

L_{contrast}

. The overall MSMC is optimized using a joint training loss

L_{total}

, i.e., a weighted summation of

L_{meta}

and

L_{contrast}

. During the test phase, the test tasks are directly processed by MSEN with shared parameters. Additionally, only the MCM is utilized to perform category inference of the query samples. The subsequent sections will provide detailed introductions to the structure of each module within the proposed network.

3.1. Task Sampling and Data Augmentation

In MSMC, task sampling is adopted in the meta-learning pipeline, while data augmentation is adopted in the contrast learning pipeline. As a matter of routine in few-shot learning [43], we construct the base-class set

C_{base}

and the novel-class set

C_{novel}

, which satisfy

C_{base} \cap C_{novel} = ⌀

. Few-shot SAR target classification aims to train a model by the base-class dataset

D_{base} = {\{(I_{i}, y_{i}) |y_{i}\}}_{i = 1}^{N_{base}}

and transfer the learned classification capabilities to novel-class dataset where only a small number of labeled samples

D_{novel} = {\{(I_{i}, y_{i}) |y_{i}\}}_{i = 1}^{N_{novel}}

is available. In the above definitions,

I_{i} \in R^{W \times H}

represents the i-th SAR image in the dataset, with height of H and width of W;

y_{i}

denotes the corresponding label of

I_{i}

; and

N_{base}

and

N_{novel}

denote the number of samples in the base-class dataset

D_{novel}

and novel-class dataset

D_{novel}

, respectively.

The proposed method leverages a meta-learning framework for few-shot SAR target classification. Adopting a task-driven episodic training paradigm (distinct from traditional learning), it enables models to acquire class-agnostic general knowledge for effective generalization to novel-class tasks. Both training and testing involve sampled subtasks

T_{base}

(from

C_{base}

) and

T_{novel}

(from

C_{novel}

), with the goal of using abundant

T_{base}

to build a robust model that generalizes well to

T_{novel}

.

Each task

T = \{S, Q\}

consists of a support set

S

and a query set

Q

, where

Q

samples are classified using labeled

S

samples. Constructed via the N-way K-shot strategy (Figure 2a),

T_{novel}

selects N categories from

C_{novel}

(each contributing K labeled samples to

S

) and evaluates on

Q

(unlabeled samples from the same N categories). During training,

T_{base}

is constructed similarly using

C_{base}

samples.

Data augmentation supports the ACLM in constructing positive/negative sample pairs for contrastive learning loss computation and MSEN training. For single-channel grayscale SAR images, we design a hybrid augmentation module

aug (\cdot)

integrating image-level and pixel-level operations. Specifically, image-level operations include random horizontal flipping and full-range rotation to construct an azimuth perturbation space; pixel-level operations encompass power transformation (adjusting dynamic range distribution), Gaussian noise addition (simulating speckle effects), and

3 \times 3

Gaussian kernel blurring (introducing multi-scale scattering distortions). A stochastic augmentation combination mechanism (Figure 2b) applies two parallel independent augmentation paths to each original sample, with each path randomly selecting a sequence of the aforementioned operations to ensure two distinct augmentations per sample. For the augmented dataset

A = \{A^{1}, A^{2}\} = \{aug (S \cup Q)\}

, we consider two images augmented from the same original image a positive sample pair, while the two augmented by different images are considered a negative sample pair.

3.2. Multi-Scale Embedding Network

The core challenge in fine-grained SAR target classification lies in high inter-class similarity and large intra-class variance. Targets of the same class exhibit significant scattering variations due to changes in pose and viewing angle, while differences between classes often reside in subtle local details such as component shapes and scattering center distributions. Traditional single-scale feature extraction suffers from two main limitations: it either overlooks critical local discriminative information while relying solely on global features, or fails to adapt to the multi-level scattering characteristics of SAR targets, such as intensity differences between overall contours and local components. To address these issues, MSEN is designed around global–local collaboration and multi-scale adaptation. Its dual-branch structure simultaneously captures global structural information and, guided by the semantic strength of different CNN layers, adaptively generates multi-scale candidate regions to locate and fuse key local scattering features. This enables accurate characterization of fine-grained differences and overcomes the shortcomings of single-scale methods.

As shown in Figure 3, MSEN is built upon a two-channel CNN backbone. The first channel extracts the global features of the input image and collects output features of different layers to jointly guide the perception of key regions of the region parameter generator

g (\cdot)

. According to the extracted key regions, the second channel integrates and compressed them to obtain local features of the target. Finally, the global and local features are weighted and fused to obtain the final feature vector.

Specifically, for an input image

I_{i}

of size

(W, H)

, it is firstly fed into ConvBlock 1, which consists of four convolutional layers with each layer employing a

3 \times 3

kernel with 64 channels. Following each convolutional layer, a

2 \times 2

max pooling layer is adopted to reduce the size of the feature map. In addition, the Mish activation function [44] is applied to enhance non-linearity. For convenience, the output feature map of the convolutional layer

L a y e r_{i} (i = 1, 2, 3, 4)

in ConvBlock 1 is denoted as

M a p_{i} (i = 1, 2, 3, 4)

, with size of

(W, H) = (W / 2^{i}, H / 2^{i})

. Finally, by flattening the feature map of the last layer and passing it through a fully connected layer, we obtain a global feature vector

f_{global}

of dimension

d = 128

.

To enhance fine-grained feature representation, three rectangular sampling regions are generated from ConvBlock 1, whose size progressively decreases from large to small. Firstly, adaptive max-pooling is applied to feature maps

M a p_{i} (i = 1, 2, 3)

to obtain standardized feature representations

M a p_{i}^{*} (i = 1, 2, 3)

with a consistent size of

(W_{S}, H_{S})

. Then, these features are flattened into vectors and processed by a shared-weight region parameter generator

g (\cdot)

, implemented as a two-layer MLP, to predict the center coordinates

(t_{x}^{j}, t_{y}^{j})

for each sampling region. Mathematically, this can be expressed as

g (f l a t (M a p^{*})) = [t_{x}^{1}, t_{y}^{1}; t_{x}^{2}, t_{y}^{2}; t_{x}^{3}, t_{y}^{3}]

, where

f l a t (\cdot)

denotes unfolding the feature map into a vector. Furthermore, let the side length

t_{l}^{j} (j = 1, 2, 3)

of the three rectangular regions be

(W / 2, H / 2)

,

(W / 3, H / 3)

and

(W / 6, H / 6)

, respectively. Then, taking the j-th sampling region as an example, the coordinates of its top-left corner and bottom-right corner can be expressed as follows:

\begin{matrix} t_{x (l u)} = t_{x} - \frac{1}{2} t_{l}, t_{y (l u)} = t_{y} - \frac{1}{2} t_{l} \\ t_{x (r b)} = t_{x} + \frac{1}{2} t_{l}, t_{y (r b)} = t_{y} + \frac{1}{2} t_{l} \end{matrix}

(1)

where “

l u

” denotes the left-upper corner, and “

r b

” denotes the right-bottom corner. Subsequently, the regions defined by the aforementioned rectangular boxes are selected from the original image, as follows:

{\dot{I}}_{i}^{j} = I_{i} ⊙ R (t_{x}^{j}, t_{y}^{j}, t_{l}^{j})

(2)

where ⊙ denotes element-wise multiplication,

{\dot{I}}_{i}^{j}

represents the cropped area, and

R (\cdot)

is a two-dimensional boxcar attention mask:

\begin{matrix} R (\cdot) = [Γ (x - t_{x (l u)}) - Γ (x - t_{x (r b)})] \cdot [Γ (y - t_{y (l u)}) - Γ (y - t_{y (r b)})] \end{matrix}

(3)

where

Γ (\cdot)

is a logistic function

Γ (x) = 1 / (1 + exp \{- k x\})

with an exponent of k. When k is sufficiently large (set to 100 in this artice), the logistic function approximates a step function, thereby enabling precise cropping of the specified image region, that is,

t_{x (l u)} ⩽ x ⩽ t_{x (r b)}

and

t_{y (l u)} ⩽ y ⩽ t_{y (r b)}

.

To extract an effective feature representation from the aforementioned highly localized region, bilinear interpolation is employed to upscale the cropped region

{\dot{I}}_{i}^{j}

to match the original size, resulting in the transformed image, i.e.,

{\ddot{I}}_{(e, f)} = \sum_{α, β = 0}^{1} |1 - α - \{e / λ\}| |1 - α - \{f / λ\}| {\dot{I}}_{(m, n)}

(4)

Among them,

m = α + [e / λ]

,

n = β + [f / λ]

,

λ

is the upsampling factor, and

[\cdot]

and

\{\cdot\}

are the integer part and the fractional part respectively.

After obtaining

{\ddot{I}}_{i}^{j} (j = 1, 2, 3)

, it is concatenated along the channel dimension, followed by local feature extraction using Convblock 2, which has the same structure as Convblock 1 except the input channels. Finally, a 128-dimensional local feature representation

f_{local}

is obtained. After conducting both global and local perception on the original SAR image, the global feature

f_{global}

and the local feature

f_{local}

are weighted and fused as follows:

f = σ f_{global} + (1 - σ) f_{local}

(5)

where

σ

is the feature weighting factor, with its value set to 0.7 in this article. In contrast to merely leveraging the coarse-grained feature information from the original image, the output feature f of the MSEN encompasses both the global and the local target features, thereby enabling comprehensive utilization of multi-scale fine-grained image information. Additionally, the selection of local image regions is learnable, allowing the network to autonomously focus on fine-grained features that are highly distinctive for target classification. As a result, the classification accuracy is further improved. After feeding the support set

S

, the query set

Q

, and the augment set

A

into the MSPN, the support feature set

F_{S} = \{f_{S}^{i} |i = 0, \dots, N \times K - 1\}

, the query feature set

F_{Q} = \{f_{Q}^{i} |i = 0, \dots, N \times q - 1\}

, and the augment feature set

F_{A} = \{f_{A}^{i} |i = 0, \dots, 2 \times N \times (K + q) - 1\}

can be obtained, respectively.

3.3. Meta-Contrastive Learning

The MSEN provides feature representations for SAR targets that integrate global semantics with local details. However, effectively utilizing these features requires adaptation to the needs of rapid generalization and feature discrimination in few-shot scenarios. Therefore, this section proposes a Meta-Contrastive Learning dual-branch training framework, which combines the multi-scale features extracted by MSEN with the cross-task generalization capability of meta-learning and the feature diversity optimization of contrastive learning, thus achieving further improvement in few-shot fine-grained classification performance.

The proposed MSMC adopts a dual-branch training pipeline that integrates meta-learning and contrastive learning to jointly optimize the weight parameters of MSEN. The meta-learning branch uses prototype distance metrics to extract features from support samples via MSEN, compute class prototype vectors, and classify query samples by measuring feature-prototype distances in the embedding space; its meta-classification loss

L_{meta}

enables generalizable feature mapping and cross-task knowledge transfer through episodic training, supporting efficient few-shot inference. To address meta-learning’s limitations of insufficient cross-task feature discriminability and limited intra-class diversity with few samples, an auxiliary contrastive learning branch is introduced, leveraging explicit instance-level similarity optimization and implicit data augmentation guided by the auxiliary contrastive loss

L_{contrast}

. This forces the feature space to balance task generalizability and sample discriminability, enhancing prototype construction robustness in few-shot scenarios. Overall, MSMC optimizes the parameters through the weighted sum of

L_{meta}

and

L_{contrast}

, with detailed branch descriptions provided in subsequent sections.

3.3.1. Meta-Learning Loss Based on Distance Measurement

The meta-learning methods based on distance measurement effectively utilize limited sample information through simple induction in the metric space and exhibit excellent performance in solving the few-shot learning problem. Therefore, this paper selects the meta-classification module based on distance measurement as the classification module of the proposed method.

After feeding the image samples into the MSEN and mapping them into the embedding space, the prototype representation of each category can be obtained. Concretely, for the k-th class target

(k \in 0, 1, \dots, N - 1)

of the selected N classes, a d-dimensional (set as 128 in this paper) vector representation in the embedding space is obtained, denoted as

c_{k}

, which is called the class prototype of the k-th class target.

S_{k}

is defined as the set of all samples of class k in the support set

S

; then,

c_{k}

can be computed as follows:

c_{k} = \frac{1}{K} {\bar{F}}_{S_{k}} = \frac{1}{K} \sum_{j = 0}^{K - 1} f_{S_{k}}^{j}

(6)

The samples in the query set

Q

within the embedding space are categorized based on the distance of each sample to the prototype of every class. The distance measurement function is defined as

γ (\cdot)

. Then, for feature

f_{Q}^{i} (i = 0, \dots, N \times q - 1)

in the feature set

F_{Q}

, the classification module will generate a probability distribution of the label

y_{i}

based on the Softmax function, as presented in (7):

p (y_{i} = k |f_{Q}^{i}) = \frac{exp (- γ (f_{Q}^{i}), c_{k})}{\sum_{k^{'}} exp (- γ (f_{Q}^{i}), c_{k^{'}})}

(7)

In the above probability distribution, the class with the maximum predicted probability value is denoted as the predicted class of feature

f_{Q}^{i}

. The distance metric function

γ (\cdot)

used in this article is the Euclidean distance [45]. Then, the meta-classification loss

L_{meta}

is defined as follows:

L_{m e t a} = \frac{1}{N \times q} \sum_{i = 0}^{N \times q - 1} - log p (y_{i} = k |f_{Q}^{i})

(8)

where k denotes the true label.

3.3.2. Auxiliary Contrastive Loss Based on Similarity Measurement

Contrastive learning is employed to learn effective feature representations by comparing the similarities among samples in the absence of labels. It emphasizes the learning of common features among samples in the same class and differentiates the disparities among samples in different classes. The proposed MSMC ulitizes auxiliary unsupervised contrastive learning module to generate contrastive loss that assists the optimization of MSEN. Following the task generation method, i.e., N-way K-shot q-query, the total number of samples in a task should be

M = N \times (K + q)

, which is then increasd to

2 M

through data augmentation. For a given sample in the augmented dataset

A = \{A^{1}, A^{2}\} = \{aug (S \cup Q)\}

, it has one positive sample and

2 M - 2

negative samples.

Given the feature set

F_{A}

for samples in

A

, we define the contrastive loss as follows:

ℓ (i, j) = - log \frac{exp (s i m (i, j) / τ)}{\sum_{r = 0}^{2 M - 1} 1_{[r \neq i]} exp (s i m (i, r) / τ)}

(9)

where

s i m (\cdot, \cdot)

denotes the negative Euclidean distance between features of the two samples; and

τ

denotes the temperature coefficient, which is set to 0.5 in this article. Then, the overall auxiliary unsupervised contrastive loss

L_{contrast}

is defined as follows:

L_{contrast} = \frac{1}{2 M} \sum_{m = 0}^{M - 1} [ℓ (f_{A}^{2 m}, f_{A}^{2 m + 1}) + ℓ (f_{A}^{2 m + 1}, f_{A}^{2 m})]

(10)

Among them, for all

m \in \{0, \dots, M - 1\}

,

2 m

and

2 m + 1

represent indicies of the positive samples in

F_{A}

.

3.3.3. Joint Training Loss

MSMC adopts the joint training loss to optimize the network parameters of MSEN. In each training task

T_{base}

, after acquiring the above meta-classification loss

L_{meta}

and

L_{contrast}

, the total loss function

L_{total}

of meta-contrastive joint learning can be defined as the weighted sum of the above two losses:

L_{total} = ρ L_{meta} + (1 - ρ) L_{contrast}

(11)

Among them,

ρ

is the weighting factor, whose value in this article is set to 0.8. Then, backpropagation is carried out in terms of epochs to update the network parameters of MSEN, as shown in (12):

θ_{φ}^{new} = θ_{φ}^{old} - η \frac{\partial L_{total}}{\partial θ_{φ}^{old}}

(12)

Among them,

θ_{φ}^{old}

and

θ_{φ}^{new}

, respectively, represent the weight parameters of the multi-scale embedding network before and after the update, and

η

represents the learning rate. When the total loss converges, the model training is considered to be completed.

4. Experiments

In this section, a series of classification experiments are conducted to verify the effectiveness of the proposed MSMC. Firstly, the MSTAR dataset and specific experimental settings are introduced. Then, comparative experiments with various classical few-shot SAR classification methods and ablation experiments of each module of the proposed method are carried out. All experiments are performed on an NVIDIA GeForce RTX 4090 GPU, utilizing the PyTorch 1.12.0 framework (CUDA 11.3).

4.1. Datasets

The dataset employed in this experiment is sourced from the MSTAR dataset [46], which is commonly adopted for performance evaluation of SAR target classification algorithms. Specifically, measured data of ground vehicles are recorded by a high-resolution spotlight SAR operating in the X-band. There are a total of ten categories of targets, namely BMP2, 2S1, BTR60, BTR70, BRDM-2, D7, T62, T72, ZIL131, and ZSU-234, with image resolution of 0.3 m × 0.3 m. Typical optical images of the ten categories and their corresponding SAR images are presented in Figure 4. It should be emphasized that the SAR targets in the MSTAR dataset have similar shapes and dimensions, satisfying conditions for fine-grained target classification.

While all the ten target categories in the MSTAR dataset include samples at

17^{\circ}

depression angle, comprehensive multi-angle coverage (

15^{\circ}

,

17^{\circ}

and

30^{\circ}

) is only available for three distinct vehicle types: 2S1, BRDM-2, and ZSU-234. To validate the MSMC’s adaptability to depression angle variations, we establish the following experimental configuration: all samples from three target classes (2S1, BRDM-2, and ZSU-234) constitute the test set, while the remaining seven classes form the training set. Considering the inherent class imbalance, we randomly select 200 samples per training class to ensure balanced representation and prevent overfitting. The test set incorporates all available samples from the three designated classes to enable comprehensive performance evaluation. Based on these considerations, the original MSTAR dataset is partitioned into the training and test sets shown in Table 1 [47].

In the experiments, we adopt the N-way K-shot q-query task generation format described in Section 3.2. Since there are three categories in the test set, we first randomly select three arbitrary categories from the seven-class training set in Table 1 for training task generation. Then, K samples from each selected category are randomly chosen to form the support set, while q samples from the remaining instances of each category are randomly selected to compose the query set. For test task construction, five types of support–query pairs are generated by combining depression angles, following a procedure analogous to that utilized in training tasks. The final configurations of the training and test tasks are summarized in Table 2 [47].

4.2. Experimental Setups

Depending on different experimental scenarios, the quantity K of samples per class in the support set is set to either 1 or 5, while the query set maintains fixed size of q = 30 samples per class. The feature weighting factor

σ

of the MSEN is set to 0.7, and the exponential parameter of the power transformation in the data augmentation module is set within the range from 0.7 to 1.1. During training, we employ the Adam optimizer with an initial learning rate of 0.001 and exponential decay rate of 0.997. The loss weighting factor

ρ

in the joint training loss is set to 0.8, and the number of training tasks is 2000. In the testing stage, the learned model weights from the training stage are utilized, and only the forward inference part of the meta-classification module is implemented to predict labels of test samples.

To address potential quality fluctuations in support samples from single experiments, which can introduce significant randomness and hinder accurate model performance evaluation, multiple random samplings are performed during testing. Specifically, 1000 different test tasks are constructed following the N-way K-shot criterion, and 1000 independent repeated experiments are conducted to mitigate the impact of such randomness on performance evaluation.

4.3. Algorithm Comparison

To validate the advantages of the proposed method, comparative experiments are conducted with classical few-shot learning approaches, including ProtoNet [48], RelationNet [49], TPN [50], MGA-Net [51], HENC [40] and MCL-DMM [52]. Among these methods, ProtoNet obtains feature vectors through an embedding network and infers categories based on the distances between the query features and category centers of the support set; RelationNet concatenates support features and query features first, and then determines the target category based on output scores of the relation projection network; TPN builds an undirected graph with unlabeled and labeled data, and achives category inference through label propagation; MGA-Net enhances sample diversity through data augmentation, utilizes multi-layer GAT to capture inter-sample relationships, and improves inter-class discriminability via a hybrid loss; HENC implements hierarchical embedding for target-level and part-level fine-grained feature extraction, explicitly calibrating novel class centers toward their true distribution centers; MCL-DMM integrates multi-level contrastive learning to enhance intra-class tightness and inter-class separateness, and employs dependency matrix-based measurement to model inter-channel correlations of feature matrices.

All experiments are conducted under identical conditions. The average classification accuracies of each method over 1000 test trials under the one-shot and five-shot scenarios are summarized in Table 3 and Table 4, respectively. For a more intuitive comparison of classification performance across subtasks, Figure 5 presents the accuracy curves of all models. Additionally, to further analyze the distribution of classification outcomes in the 1000 test trials, Figure 6 illustrates the accuracy distribution histogram of the proposed method.

As evident from Table 3 and Table 4, the proposed method consistently outperforms the five classical baselines in average classification accuracy across diverse few-shot SAR image classification scenarios. Moreover, the classification accuracy curves in Figure 5 demonstrate that our method achieves high precision across all subtasks with varying depression angle combinations, indicating its superior classification performance and resilience to depression angle variations. These results comprehensively validate the effectiveness and generalizability of the proposed method. The histogram results in Figure 6 show that MSMC achieves high classification accuracy across all subtasks in both three-way one-shot and three-way five-shot experimental scenarios, demonstrating excellent robustness to large depression angle variations. Moreover, it can be observed that for the same subtask, the three-way five-shot scenario exhibits higher mean accuracy and lower variance compared to the three-way one-shot setting. Therefore, even a modest increase in the number of support samples can lead to significant improvements in the model’s classification capability.

Furthermore, the visualization results in Figure 7 demonstrate that the proposed MSEN effectively captures fine-grained, critical target information such as the barrel in raw SAR images through multi-scale magnification, thereby enhancing classification accuracy. To further investigate the distribution of novel classes in the feature space, the t-SNE visualization results are presented in Figure 8. Specifically, subfigures (a–e) illustrate the embedded feature distributions of SAR data and few-shot classification methods for novel classes at a

15^{\circ}

depression angle under three conditions: raw SAR images, ProtoNet, MGA-Net, HENC and the proposed MSMC, respectively. Among the comparative methods, the visualization results of RelationNet and TPN are not presented, as their decision-making processes rely on implicit implementation of subsequent modules (relation computation/transductive inference), which cannot be visualized by intermediate features. As seen from Figure 8a, the raw SAR image features demonstrate significant inter-class overlap and scattered intra-class distributions, highlighting the inherent difficulties in SAR target discrimination. As shown in Figure 8b–e, both the comparative methods and MSMC achieve superior inter-class separability to the original data. Notably, MSMC outperforms them by explicitly implementing intra-class feature aggregation and inter-class feature divergence through meta-contrastive learning. Such a dual mechanism enhances feature discriminability in the embedding space, ultimately leading to significantly improved classification performance for novel classes.

4.4. Ablation Study

To verify the effectiveness of each module in MSMC, i.e., the MSEN and the ACLM, ablation experiments are carried out. The ProtoNet is selected as the baseline, where the feature extraction network is the standard Conv4 structure. Each module is sequentially added or replaced in the corresponding part of the baseline. Specifically, when the status of MSEN is ✔, it means that the Conv4 of the baseline is replaced with MSEN; when the status of ACLM is ✔, it means that the ACLM module is added to the baseline.

The results of the ablation experiments are recorded in Table 5. For intuitive performance comparison of each module, ↑ is utilized in the table to represent the performance improvement with respect to the baseline (ProtoNet+Conv4) after employing the current module. As evidenced in Table 5, adopting MSEN as the feature extraction network or incorporating ACLM into the framework significantly improves the classification accuracy across all subtasks. Furthermore, the optimal recognition performance is attained when MSEN and ACLM are jointly implemented, conclusively demonstrating the efficacy of the designed MSEN and ACLM modules.

As a classic method for fine-grained feature extraction, the Recurrent Attention Convolutional Neural Network (RACNN) has demonstrated excellent performance in fine-grained optical image recognition. Below, we will compare the performance of RACNN and the proposed MSEN. For RACNN, it firstly feeds the original image into the convolutional layer to extract region-based feature representations. Subsequently, it outputs final features through fully connected layers and predicts regional attention distributions via an attention proposal network. The generated attention parameters are employed for region cropping and scaling for multi-scale sub-image generation. Unlike recursive attention sampling of RACNN, the proposed MSEN extracts attention-sampled information through the varying strength of semantic information across different depth of convolutional layers, and generates the multi-scale sub-images directly from the original image.

Considering that MSEN generates sub-images at three scales, we select RACNN with a three-scale subnetwork as the comparative embedding network. The baseline model is the combination of “ProtoNet+Conv4+ACLM”, denoted as

{Baseline}^{†}

, where Conv4 represents a standard four-layer convolutional network and ACLM indicates the use of auxiliary contrastive loss during training. The dataset and task generation methodology are consistent with Section 4.1 and Section 4.2.

The experimental results are shown in Table 6. Specifically, when the status of MSEN is ✔, it means that the Conv4 of the baseline is replaced with MSEN; when the status of RACNN is ✔, it means that the Conv4 of the baseline is replaced with RACNN. At the same time, we use ↑ to represent the performance improvement in MSEN and RACNN, and use ↑ to represent the performance improvement in MSEN over RACNN.

The results demonstrate that replacing Conv4 with either the proposed MSEN or RACNN can lead to obvious performance improvement. Additionally, compared to RACNN’s recursive feature extraction mode, in which the generation and feature extraction of subsequent sub-images are prone to being influenced by the initial attention sampling, the proposed MSEN performs independent sampling of the original image based on feature maps from different depths of convolutional layers. By this means, we can mitigate the deviation at any single scale on the fused features. Accordingly, MSEN demonstrates superior performance across various classification subtasks in different experimental scenarios.

5. Discussion

In this section, few-shot classification experiments are firstly carried out on the MSMC under varying feature weighting factor

σ

to determine the optimal feature fusion weights, where the range of

σ

is defined as

[0, 1]

. Specifically,

σ

= 0.5 implies that the MSMC equally exploits the global features of the original image and the local features of the enlarged sub-images, whereas

σ

= 1 denotes that only the original image is employed for global information extraction. The experimental results are presented in Table 7. It is observed that as

σ

increases incrementally, and the recognition rates of all subtasks exhibit a trend of first increasing and then decreasing, attaining the optimal performance at

σ

= 0.7. Therefore, such parameter configuration is most congruent with the target recognition pipeline of the MSMC.

The loss weight

ρ

balances the meta-classification loss

L_{meta}

and the contrastive loss

L_{contrast}

, and its optimal value is determined by grid search in a way similar to the choice of

σ

. Specifically,

ρ

ranges from 0.5 to 1.0 in steps of 0.1 under the 3-way 1-shot and 3-way 5-shot configurations on the MSTAR dataset, and the results are summarized in Table 8. Experiments have shown that the model achieves the maximum average classification accuracy at

ρ

= 0.8. With a smaller

ρ

, the contrastive loss dominates, and the model overemphasizes the differences between individual samples while ignoring intra-class cohesion, thereby leading to misclassification in fine-grained tasks. With a larger

ρ

, the meta-classification loss dominates, which reduces the feature diversity and impairs the model’s ability to capture subtle structural differences between classes, thereby decreasing the average classification accuracy.

6. Conclusions

In this article, the MSMC framework was proposed for few-shot fine-grained SAR target classification, which has presented significant advancement over conventional few-shot classification methods based on meta-learning. By integrating unsupervised contrastive learning with metric-based meta-learning in a unified training paradigm, the MSMC effectively combined supervised class label information with unsupervised inherent image features. Such a dual learning mechanism can enable more comprehensive data representation, thereby resulting in substantially improved data understanding and feature discrimination. The framework was further enhanced by a multi-scale embedding network that extracts complementary SAR image features at global and local granularities, which significantly enhances the model’s ability to capture intra-class similarities and inter-class differences. The experimental results on the MSTAR dataset demonstrate the superior performance of MSMC, i.e., it can achieve state-of-the-art classification performance while maintaining excellent robustness across different experimental scenarios and diverse depression angles.

Future work will focus on designing multi-static few-shot ISAR recognition methods by exploiting complementary features and suppressing redundant features across all the observation stations within the framework of meta-learning. Compared with monostatic ISAR, multi-static ISAR can compensate for feature loss caused by self-occlusion and limited viewing angle through multi-perspective observation, thereby boosting the performance of few-shot classification. To address the challenges of distributed data, which include accurate feature alignment and effective feature fusion, we will extend the MSMC framework to multi-static few-shot ISAR target classification, focusing on designing a cross-station feature alignment module, a multi-station information fusion mechanism, and extending the meta-training paradigm to cross-station tasks. By this means, we can unlock the advantages of multi-static ISAR and achieve more accurate and robust fine-grained classification in complex scenarios.

Author Contributions

Conceptualization, B.C. and X.B.; Methodology, B.C., M.Y., Y.W. and X.B.; Software, B.C.; Validation, B.C.; Data curation, X.B.; Writing—original draft, B.C.; Writing—review & editing, X.B.; Visualization, M.Y. and Y.W.; Supervision, X.B.; Project administration, X.B.; Funding acquisition, X.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in by the National Natural Science Foundation of China under grants No. 62425113, No. 62531020, and No. 62131020.

Data Availability Statement

The original data presented in this study are openly available in Air Force Research Laboratory Scientific Data Management System (AFRL SDMS) at https://www.sdms.afrl.af.mil/index.php?collection=mstar (accessed on 22 January 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Lv, B.; Ni, J.; Luo, Y.; Zhao, S.; Liang, J.; Yuan, H.; Zhang, Q. A multiview interclass dissimilarity feature fusion sar images recognition network within limited sample condition. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 17820–17836. [Google Scholar] [CrossRef]
Cao, R.; Wang, Y.; Giusti, E.; Martorella, M. 3-d reconstruction of ship target based on sar images sequence and scatterer tracking technique. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5200415. [Google Scholar] [CrossRef]
Zhang, H.; Zhao, X.; Hao, X.; Li, W.; Hu, H.; Ni, J.; Luo, Y. Sar super-resolution imaging and recognition integrated network based on deep learning framework. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 20896–20913. [Google Scholar] [CrossRef]
Zhao, S.; Zhang, Z.; Zhang, T.; Guo, W.; Luo, Y. Transferable sar image classification crossing different satellites under open set condition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4506005. [Google Scholar] [CrossRef]
Liu, M.; Chen, S.; Wu, J.; Lu, F.; Wang, X.; Xing, M. Sar target configuration recognition via two-stage sparse structure representation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2220–2232. [Google Scholar] [CrossRef]
He, Z.; Xiao, H.; Gao, C.; Tian, Z.; Chen, S.-W. Fusion of sparse model based on randomly erased image for sar occluded target recognition. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7829–7844. [Google Scholar] [CrossRef]
Huang, Y.; Liao, G.; Zhang, Z.; Xiang, Y.; Li, J.; Nehorai, A. Sar automatic target recognition using joint low-rank and sparse multiview denoising. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1570–1574. [Google Scholar] [CrossRef]
Wang, C.; Shi, J.; Zhou, Y.; Li, L.; Yang, X.; Zhang, T.; Wei, S.; Zhang, X.; Tao, C. Label noise modeling and correction via loss curve fitting for sar atr. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5216210. [Google Scholar] [CrossRef]
Zhou, Z.; Cao, Z.; Pi, Y. Subdictionary-based joint sparse representation for sar target recognition using multilevel reconstruction. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6877–6887. [Google Scholar] [CrossRef]
Dong, G.; Kuang, G. Classification on the monogenic scale space: Application to target recognition in sar image. IEEE Trans. Image Process. 2015, 24, 2527–2539. [Google Scholar] [CrossRef]
Hatcher, W.G.; Yu, W. A survey of deep learning: Platforms, applications and emerging research trends. IEEE Access 2018, 6, 24411–24432. [Google Scholar] [CrossRef]
Ghanbari, H.; Mahdianpari, M.; Homayouni, S.; Mohammadimanesh, F. A meta-analysis of convolutional neural networks for remote sensing applications. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 3602–3613. [Google Scholar]
Liu, Z.; Wang, L.; Wen, Z.; Li, K.; Pan, Q. Multilevel scattering center and deep feature fusion learning framework for sar target recognition. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5227914. [Google Scholar]
Zheng, J.; Li, M.; Li, X.; Zhang, P.; Wu, Y. Revisiting local and global descriptor-based metric network for few-shot sar target classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205814. [Google Scholar]
Wang, S.; Wang, Y.; Liu, H.; Sun, Y.; Zhang, C. A few-shot sar target recognition method by unifying local classification with feature generation and calibration. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5200319. [Google Scholar]
Geng, J.; Zhang, Y.; Jiang, W. Polarimetric sar image classification based on hierarchical scattering-spatial interaction transformer. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205014. [Google Scholar] [CrossRef]
Dong, H.; Ma, W.; Jiao, L.; Liu, F.; Liu, X.; Zhu, H. Contrastive learning with context-augmented transformer for change detection in sar images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 17710–17724. [Google Scholar]
Yin, J.; Duan, C.; Wang, H.; Yang, J. A review on the few-shot sar target recognition. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 16411–16425. [Google Scholar]
Fu, K.; Zhang, T.; Zhang, Y.; Wang, Z.; Sun, X. Few-shot sar target classification via meta-learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 2000314. [Google Scholar]
Lin, L.; Zhang, S.; Fu, S.; Liu, Y.; Suo, S.; Hu, G. Prototype matching-based meta-learning model for few-shot fault diagnosis of mechanical system. Neurocomputing 2025, 617, 129012. [Google Scholar]
Xu, W.; Wan, Y. Ela: Efficient local attention for deep convolutional neural networks. arXiv 2024, arXiv:2403.01123. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
Cheung, T.-H.; Yeung, D.-Y. A survey of automated data augmentation for image classification: Learning to compose, mix, and generate. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13185–13205. [Google Scholar]
Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5149–5169. [Google Scholar] [CrossRef]
Niu, S.; Liu, Y.; Wang, J.; Song, H. A decade survey of transfer learning (2010–2020). IEEE Trans. Artif. Intell. 2021, 1, 151–166. [Google Scholar] [CrossRef]
Yang, K.; Tao, J.; Lyu, J.; Ge, C.; Chen, J.; Shen, W.; Zhu, X.; Li, X. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8941–8951. [Google Scholar]
Zhu, Q.; Fan, L.; Weng, N. Advancements in point cloud data augmentation for deep learning: A survey. Pattern Recognit. 2024, 153, 110532. [Google Scholar] [CrossRef]
Bao, J.; Yu, W.M.; Yang, K.; Liu, C.; Cui, T.J. Improved few-shot sar image generation by enhancing diversity. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 3394–3408. [Google Scholar] [CrossRef]
Akkem, Y.; Biswas, S.K.; Varanasi, A. A comprehensive review of synthetic data generation in smart farming by using variational autoencoder and generative adversarial network. Eng. Appl. Artif. Intell. 2024, 131, 107881. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Roy, A.; Dasgupta, D. A distributed conditional wasserstein deep convolutional relativistic loss generative adversarial network with improved convergence. IEEE Trans. Artif. Intell. 2024, 5, 4344–4353. [Google Scholar] [CrossRef]
Bhandari, A.; Tripathy, B.; Adate, A.; Saxena, R.; Gadekallu, T.R. From beginning to beganing: Role of adversarial learning in reshaping generative models. Electronics 2022, 12, 155. [Google Scholar] [CrossRef]
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. Deepemd: Differentiable earth mover’s distance for few-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5632–5648. [Google Scholar] [CrossRef]
Huisman, M.; Van Rijn, J.N.; Plaat, A. A survey of deep meta-learning. Artif. Intell. Rev. 2021, 54, 4483–4541. [Google Scholar] [CrossRef]
Gharoun, H.; Momenifar, F.; Chen, F.; Gandomi, A.H. Meta-learning approaches for few-shot learning: A survey of recent advances. ACM Comput. Surv. 2024, 56, 1–41. [Google Scholar] [CrossRef]
Wei, X.-S.; Song, Y.-Z.; Mac Aodha, O.; Wu, J.; Peng, Y.; Tang, J.; Yang, J.; Belongie, S. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8927–8948. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Li, Y. Interpretable and accurate fine-grained recognition via region grouping. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8662–8672. [Google Scholar]
Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5209–5217. [Google Scholar]
Wang, S.; Wang, Y.; Zhang, X.; Zhang, C.; Liu, H. Visual-semantic cooperative learning for few-shot sar target classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 6532–6550. [Google Scholar] [CrossRef]
Yang, M.; Bai, X.; Wang, L.; Zhou, F. Henc: Hierarchical embedding network with center calibration for few-shot fine-grained sar target classification. IEEE Trans. Image Process. 2023, 32, 3324–3337. [Google Scholar] [CrossRef]
Lin, T.-Y.; RoyChowdhury, A.; Maji, S. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
Fu, J.; Zheng, H.; Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446. [Google Scholar]
Wang, L.; Bai, X.; Gong, C.; Zhou, F. Hybrid inference network for few-shot sar automatic target recognition. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9257–9269. [Google Scholar] [CrossRef]
Wang, X.; Zhao, H.; He, Y.; Hu, P.; Shao, S. A simple neural network for nonlinear self-interference cancellation in full-duplex radios. IEEE Trans. Veh. Technol. 2024, 73, 10817–10822. [Google Scholar] [CrossRef]
An, X.; Cui, X.; Zhao, S.; Liu, G.; Lu, M. Efficient rigid body localization based on euclidean distance matrix completion for agv positioning under harsh environment. IEEE Trans. Veh. Technol. 2022, 72, 2482–2496. [Google Scholar] [CrossRef]
Chen, D.; Xiong, G.; Wang, L.; Yu, W. Variable length sequential iterable convolutional recurrent network for uwb-ir vehicle target recognition. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5102311. [Google Scholar]
Chen, B.; Yang, M.; Bai, X. Few-shot sar target classification with clpn: A prototypical network combining unsupervised contrastive learning. In Proceedings of the IET International Radar Conference (IRC 2023); IET: London, UK, 2023; Volume 2023, pp. 2028–2035. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4080–4090. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1199–1208. [Google Scholar]
Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S.J.; Yang, Y. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv 2018, arXiv:1805.10002. [Google Scholar]
Yang, M.; Bai, X.; Wang, L.; Zhou, F. Mixed loss graph attention network for few-shot sar target classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5216613. [Google Scholar]
Tan, H.; Zhang, Z.; Shi, X.; Yang, X.; Li, Y.; Bai, X.; Zhou, F. Few-shot sar atr via multi-level contrastive learning and dependency matrix-based measurement. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 8175–8188. [Google Scholar] [CrossRef]

Figure 1. The overall framework of the MSMC.

Figure 2. The task sampling and augmentation module. (a) N-way K-shot task sampling; (b) data augmentation.

Figure 3. The schematic diagram of MSEN.

Figure 4. Optical images and SAR images of the ten categories in the MSTAR dataset. (a) 2S1, (b) BMP2, (c) BRDM-2, (d) BTR60, (e) BTR70, (f) D7, (g) T62, (h) T72, (i) ZIL131, (j) ZSU-234.

Figure 5. Comparisons of the classification accuracy curves on the MSTAR dataset. (a) 3-way 1-shot; (b) 3-way 5-shot.

Figure 6. Statistical histograms of the test results for the proposed method under different experimental scenarios. (a–e) 3-way 1-shot scenario: (a) 15s/15q. (b) 17s/15q. (c) 17s/17q. (d) 30s/30q. (e) 17s/30q. (f–j) 3-way 5-shot scenario: (f) 15s/15q. (g) 17s/15q. (h) 17s/17q. (i) 30s/30q. (j) 17s/30q.

Figure 7. Visualization results of MSEN for the MSTAR dataset.

Figure 8. Visualizations of the t-SNE features. (a) Original SAR image; (b) ProtoNet; (c) MGA-Net; (d) HENC; (e) MCL-DMM; (f) MSMC.

Table 1. Division of the training set and the test set.

	Target Category	Depression	Sample Number
Training Set	BMP2, BTR60
	BTR70, D7, T62, T72	$17^{\circ}$	200
	ZIL131
Testing Set		$15^{\circ}$	274
	2S1, BRDM-2, ZSU-234	$17^{\circ}$	298
		$30^{\circ}$	287

Table 2. Division of the training and test tasks.

No.	Settings	Training Task		Testing Task
No.	Settings	Support Set	Query Set	Support Set	Query Set
1	15s /15q	$17^{\circ}$	$17^{\circ}$	$15^{\circ}$	$15^{\circ}$
2	17s/15q	$17^{\circ}$	$17^{\circ}$	$17^{\circ}$	$15^{\circ}$
3	17s/17q	$17^{\circ}$	$17^{\circ}$	$17^{\circ}$	$17^{\circ}$
4	30s/30q	$17^{\circ}$	$17^{\circ}$	$30^{\circ}$	$30^{\circ}$
5	17s/30q	$17^{\circ}$	$17^{\circ}$	$17^{\circ}$	$30^{\circ}$

Table 3. Experimental results of various methods on the MSTAR dataset (3-way 1-shot).

Method	15s/15q	17s/15q	17s/17q	30s/30q	17s/30q
ProtoNet [48]	77.63%	75.36%	75.40%	73.60%	68.06%
RelationNet [49]	73.86%	72.10%	72.51%	70.53%	69.20%
TPN [50]	80.42%	80.43%	82.16%	79.80%	77.51%
MGA-Net [51]	91.88%	91.37%	91.43%	88.85%	86.79%
HENC [40]	87.04%	86.80%	89.83%	86.70%	86.12%
MCL-DMM [52]	93.23%	92.25%	91.55%	89.88%	87.98%
MSMC	95.02%	93.40%	94.27%	91.47%	89.44%

Table 4. Experimental results of various methods on the MSTAR dataset (3-way 5-shot).

Method	15s/15q	17s/15q	17s/17q	30s/30q	17s/30q
ProtoNet [48]	87.59%	83.05%	84.79%	82.34%	74.07%
RelationNet [49]	81.07%	77.44%	78.11%	78.47%	75.62%
TPN [50]	91.43%	91.75%	92.75%	89.57%	88.15%
MGA-Net [51]	96.21%	95.78%	95.88%	94.20%	90.49%
HENC [40]	93.50%	92.80%	95.25%	92.67%	90.26%
MCL-DMM [52]	96.68%	96.15%	96.32%	95.12%	92.27%
MSMC	98.56%	97.60%	97.83%	96.46%	94.98%

Table 5. Results of ablation experiments for the MSTAR dataset.

Module		$15 s / 15 q$		$17 s / 15 q$		$17 s / 17 q$		$30 s / 30 q$		$17 s / 30 q$
MSEN	ACLM	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
Baseline (ProtoNet+Conv4)		77.63%	87.59%	75.36%	83.05%	75.40%	84.79%	73.60%	82.34%	68.06%	74.07%
✔		84.15%	91.68%	81.36%	89.44%	82.87%	89.92%	78.32%	86.47%	73.14%	79.58%
		$↑ 6.52 %$	$↑ 4.09 %$	$↑ 6.03 %$	$↑ 6.39 %$	$↑ 7.47 %$	$↑ 5.13 %$	$↑ 4.72 %$	$↑ 4.13 %$	$↑ 5.08 %$	$↑ 5.51 %$
	✔	93.07%	96.53%	92.12%	95.52%	92.16%	95.90%	88.85%	93.19%	87.15%	90.53%
		$↑ 15.44 %$	$↑ 8.94 %$	$↑ 16.76 %$	$↑ 12.47 %$	$↑ 16.76 %$	$↑ 11.11 %$	$↑ 15.25 %$	$↑ 10.85 %$	$↑ 19.09 %$	$↑ 16.46 %$
✔	✔	95.02%	98.56%	93.40%	97.60%	94.27%	97.83%	91.47%	96.46%	89.44%	94.98%
		$↑ 17.39 %$	$↑ 10.97 %$	$↑ 18.04 %$	$↑ 14.55 %$	$↑ 18.87 %$	$↑ 13.04 %$	$↑ 17.87 %$	$↑ 14.12 %$	$↑ 21.38 %$	$↑ 20.91 %$

Table 6. Experimental results with MSEN and RACNN as embedding networks.

Module		$15 s / 15 q$		$17 s / 15 q$		$17 s / 17 q$		$30 s / 30 q$		$17 s / 30 q$
RACNN	MSEN	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
${Baseline}^{†}$		93.07%	96.53%	92.12%	95.52%	92.16%	95.90%	88.85%	93.19%	87.15%	90.53%
✔		94.12%	97.68%	92.41%	96.78%	92.52%	97.05%	89.55%	93.74%	88.06%	92.35%
		$↑ 1.05 %$	$↑ 1.15 %$	$↑ 0.29 %$	$↑ 1.26 %$	$↑ 0.36 %$	$↑ 1.15 %$	$↑ 0.70 %$	$↑ 0.55 %$	$↑ 0.91 %$	$↑ 1.82 %$
	✔	95.02%	98.56%	93.40%	97.60%	94.27%	97.83%	91.47%	96.46%	89.44%	94.98%
		$↑ 1.95 %$	$↑ 2.03 %$	$↑ 1.28 %$	$↑ 2.08 %$	$↑ 2.11 %$	$↑ 1.93 %$	$↑ 2.62 %$	$↑ 3.27 %$	$↑ 2.29 %$	$↑ 4.45 %$
		$↑ 0.90 %$	$↑ 0.88 %$	$↑ 0.99 %$	$↑ 0.82 %$	$↑ 1.75 %$	$↑ 0.78 %$	$↑ 1.87 %$	$↑ 2.72 %$	$↑ 1.48 %$	$↑ 2.63 %$

Table 7. The impact of the feature weight coefficient

σ

on the classification results.

Table 7. The impact of the feature weight coefficient

σ

on the classification results.

$σ$	15s/15q		17s/15q		17s/17q		30s/30q		17s/30q
$σ$	1shot	5shot	1shot	5shot	1shot	5shot	1shot	5shot	1shot	5shot
0.5	92.75%	95.82%	91.12%	94.88%	91.64%	95.12%	88.02%	93.12%	86.17%	90.13%
0.6	93.32%	97.64%	92.88%	96.82%	93.06%	97.05%	88.23%	94.22%	87.15%	92.25%
0.7	95.02%	98.56%	93.40%	97.60%	94.27%	97.83%	91.47%	96.46%	89.44%	94.98%
0.8	94.68%	98.13%	93.02%	97.25%	94.15%	97.34%	90.68%	96.22%	89.12%	94.65%
0.9	93.75%	97.75%	92.94%	96.88%	93.74%	97.12%	89.54%	94.56%	87.48%	92.44%
1.0	93.62%	96.38%	91.98%	95.26%	91.92%	96.22%	88.63%	93.74%	87.25%	92.03%

Table 8. The impact of the loss weight coefficient

ρ

on the classification results.

Table 8. The impact of the loss weight coefficient

ρ

on the classification results.

$ρ$	15s/15q		17s/15q		17s/17q		30s/30q		17s/30q
$ρ$	1shot	5shot	1shot	5shot	1shot	5shot	1shot	5shot	1shot	5shot
0.5	92.15%	96.32%	90.08%	93.15%	91.83%	95.77%	88.24%	93.51%	86.05%	91.22%
0.6	93.47%	97.25%	91.56%	95.33%	92.96%	96.54%	89.68%	94.67%	87.33%	92.69%
0.7	94.23%	98.01%	92.64%	96.78%	93.75%	97.21%	90.55%	95.82%	88.51%	93.87%
0.8	95.02%	98.56%	93.40%	97.60%	94.27%	97.83%	91.47%	96.46%	89.44%	94.98%
0.9	94.16%	97.89%	92.38%	96.52%	93.58%	97.05%	90.33%	95.24%	87.96%	93.45%
1.0	92.89%	96.93%	90.77%	94.86%	92.14%	95.92%	88.76%	94.01%	86.58%	91.83%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, B.; Yang, M.; Wang, Y.; Bai, X. MSMC: Multi-Scale Embedding and Meta-Contrastive Learning for Few-Shot Fine-Grained SAR Target Classification. Remote Sens. 2026, 18, 415. https://doi.org/10.3390/rs18030415

AMA Style

Chen B, Yang M, Wang Y, Bai X. MSMC: Multi-Scale Embedding and Meta-Contrastive Learning for Few-Shot Fine-Grained SAR Target Classification. Remote Sensing. 2026; 18(3):415. https://doi.org/10.3390/rs18030415

Chicago/Turabian Style

Chen, Bowen, Minjia Yang, Yue Wang, and Xueru Bai. 2026. "MSMC: Multi-Scale Embedding and Meta-Contrastive Learning for Few-Shot Fine-Grained SAR Target Classification" Remote Sensing 18, no. 3: 415. https://doi.org/10.3390/rs18030415

APA Style

Chen, B., Yang, M., Wang, Y., & Bai, X. (2026). MSMC: Multi-Scale Embedding and Meta-Contrastive Learning for Few-Shot Fine-Grained SAR Target Classification. Remote Sensing, 18(3), 415. https://doi.org/10.3390/rs18030415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSMC: Multi-Scale Embedding and Meta-Contrastive Learning for Few-Shot Fine-Grained SAR Target Classification

Highlights

Abstract

1. Introduction

2. Prior Work

2.1. Few-Shot Classification

2.2. Fine-Grained Classification

3. MSMC

3.1. Task Sampling and Data Augmentation

3.2. Multi-Scale Embedding Network

3.3. Meta-Contrastive Learning

3.3.1. Meta-Learning Loss Based on Distance Measurement

3.3.2. Auxiliary Contrastive Loss Based on Similarity Measurement

3.3.3. Joint Training Loss

4. Experiments

4.1. Datasets

4.2. Experimental Setups

4.3. Algorithm Comparison

4.4. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI