Owing to the high sensitivity of fine-grained SAR targets to pose variations and depression angle changes, pronounced fluctuations in scattering characteristics are inherently induced. Specifically, pose adjustments can modify the spatial distribution of key scattering centers (e.g., barrels, wheels) of the target, while variations in depression angles may give rise to intensity discrepancies between global contours and local components. These factors often lead to local discriminative details being obscured by interference from global features. To address these intrinsic challenges associated with SAR data, MSMC achieves adaptive compatibility through the synergistic integration of MSEN and meta-contrastive learning: MSEN leverages semantic features across different CNN layers to generate adaptive multi-scale key scattering regions, thereby enhancing the local discriminative capability under varying pose conditions; meanwhile, the dual-branch meta-contrastive learning architecture acquires pose-invariant metric knowledge via the meta-learning branch and optimizes feature distinguishability through sample-pair learning by the contrastive branch.
During the training phase, an N-way K-shot sampling strategy is performed on the training dataset to generate training tasks comprising support and query sets. The images within these tasks are then augmented to produce auxiliary sample sets containing positive and negative sample pairs. After data preprocessing, both the training tasks and auxiliary sample sets are fed into the MSEN to generate feature vectors, respectively. Based on these feature representations, the MCM computes the distance between feature vectors to derive the meta-classification loss , while the ACLM evaluates the similarity between positive and negative sample pairs to compute the contrastive loss . The overall MSMC is optimized using a joint training loss , i.e., a weighted summation of and . During the test phase, the test tasks are directly processed by MSEN with shared parameters. Additionally, only the MCM is utilized to perform category inference of the query samples. The subsequent sections will provide detailed introductions to the structure of each module within the proposed network.
3.1. Task Sampling and Data Augmentation
In MSMC, task sampling is adopted in the meta-learning pipeline, while data augmentation is adopted in the contrast learning pipeline. As a matter of routine in few-shot learning [
43], we construct the base-class set
and the novel-class set
, which satisfy
. Few-shot SAR target classification aims to train a model by the base-class dataset
and transfer the learned classification capabilities to novel-class dataset where only a small number of labeled samples
is available. In the above definitions,
represents the
i-th SAR image in the dataset, with height of
H and width of
W;
denotes the corresponding label of
; and
and
denote the number of samples in the base-class dataset
and novel-class dataset
, respectively.
The proposed method leverages a meta-learning framework for few-shot SAR target classification. Adopting a task-driven episodic training paradigm (distinct from traditional learning), it enables models to acquire class-agnostic general knowledge for effective generalization to novel-class tasks. Both training and testing involve sampled subtasks (from ) and (from ), with the goal of using abundant to build a robust model that generalizes well to .
Each task
consists of a support set
and a query set
, where
samples are classified using labeled
samples. Constructed via the N-way K-shot strategy (
Figure 2a),
selects N categories from
(each contributing K labeled samples to
) and evaluates on
(unlabeled samples from the same N categories). During training,
is constructed similarly using
samples.
Data augmentation supports the ACLM in constructing positive/negative sample pairs for contrastive learning loss computation and MSEN training. For single-channel grayscale SAR images, we design a hybrid augmentation module
integrating image-level and pixel-level operations. Specifically, image-level operations include random horizontal flipping and full-range rotation to construct an azimuth perturbation space; pixel-level operations encompass power transformation (adjusting dynamic range distribution), Gaussian noise addition (simulating speckle effects), and
Gaussian kernel blurring (introducing multi-scale scattering distortions). A stochastic augmentation combination mechanism (
Figure 2b) applies two parallel independent augmentation paths to each original sample, with each path randomly selecting a sequence of the aforementioned operations to ensure two distinct augmentations per sample. For the augmented dataset
, we consider two images augmented from the same original image a positive sample pair, while the two augmented by different images are considered a negative sample pair.
3.2. Multi-Scale Embedding Network
The core challenge in fine-grained SAR target classification lies in high inter-class similarity and large intra-class variance. Targets of the same class exhibit significant scattering variations due to changes in pose and viewing angle, while differences between classes often reside in subtle local details such as component shapes and scattering center distributions. Traditional single-scale feature extraction suffers from two main limitations: it either overlooks critical local discriminative information while relying solely on global features, or fails to adapt to the multi-level scattering characteristics of SAR targets, such as intensity differences between overall contours and local components. To address these issues, MSEN is designed around global–local collaboration and multi-scale adaptation. Its dual-branch structure simultaneously captures global structural information and, guided by the semantic strength of different CNN layers, adaptively generates multi-scale candidate regions to locate and fuse key local scattering features. This enables accurate characterization of fine-grained differences and overcomes the shortcomings of single-scale methods.
As shown in
Figure 3, MSEN is built upon a two-channel CNN backbone. The first channel extracts the global features of the input image and collects output features of different layers to jointly guide the perception of key regions of the region parameter generator
. According to the extracted key regions, the second channel integrates and compressed them to obtain local features of the target. Finally, the global and local features are weighted and fused to obtain the final feature vector.
Specifically, for an input image
of size
, it is firstly fed into ConvBlock 1, which consists of four convolutional layers with each layer employing a
kernel with 64 channels. Following each convolutional layer, a
max pooling layer is adopted to reduce the size of the feature map. In addition, the Mish activation function [
44] is applied to enhance non-linearity. For convenience, the output feature map of the convolutional layer
in ConvBlock 1 is denoted as
, with size of
. Finally, by flattening the feature map of the last layer and passing it through a fully connected layer, we obtain a global feature vector
of dimension
.
To enhance fine-grained feature representation, three rectangular sampling regions are generated from ConvBlock 1, whose size progressively decreases from large to small. Firstly, adaptive max-pooling is applied to feature maps
to obtain standardized feature representations
with a consistent size of
. Then, these features are flattened into vectors and processed by a shared-weight region parameter generator
, implemented as a two-layer MLP, to predict the center coordinates
for each sampling region. Mathematically, this can be expressed as
, where
denotes unfolding the feature map into a vector. Furthermore, let the side length
of the three rectangular regions be
,
and
, respectively. Then, taking the
j-th sampling region as an example, the coordinates of its top-left corner and bottom-right corner can be expressed as follows:
where “
” denotes the left-upper corner, and “
” denotes the right-bottom corner. Subsequently, the regions defined by the aforementioned rectangular boxes are selected from the original image, as follows:
where ⊙ denotes element-wise multiplication,
represents the cropped area, and
is a two-dimensional boxcar attention mask:
where
is a logistic function
with an exponent of
k. When
k is sufficiently large (set to 100 in this artice), the logistic function approximates a step function, thereby enabling precise cropping of the specified image region, that is,
and
.
To extract an effective feature representation from the aforementioned highly localized region, bilinear interpolation is employed to upscale the cropped region
to match the original size, resulting in the transformed image, i.e.,
Among them,
,
,
is the upsampling factor, and
and
are the integer part and the fractional part respectively.
After obtaining
, it is concatenated along the channel dimension, followed by local feature extraction using Convblock 2, which has the same structure as Convblock 1 except the input channels. Finally, a 128-dimensional local feature representation
is obtained. After conducting both global and local perception on the original SAR image, the global feature
and the local feature
are weighted and fused as follows:
where
is the feature weighting factor, with its value set to 0.7 in this article. In contrast to merely leveraging the coarse-grained feature information from the original image, the output feature
f of the MSEN encompasses both the global and the local target features, thereby enabling comprehensive utilization of multi-scale fine-grained image information. Additionally, the selection of local image regions is learnable, allowing the network to autonomously focus on fine-grained features that are highly distinctive for target classification. As a result, the classification accuracy is further improved. After feeding the support set
, the query set
, and the augment set
into the MSPN, the support feature set
, the query feature set
, and the augment feature set
can be obtained, respectively.
3.3. Meta-Contrastive Learning
The MSEN provides feature representations for SAR targets that integrate global semantics with local details. However, effectively utilizing these features requires adaptation to the needs of rapid generalization and feature discrimination in few-shot scenarios. Therefore, this section proposes a Meta-Contrastive Learning dual-branch training framework, which combines the multi-scale features extracted by MSEN with the cross-task generalization capability of meta-learning and the feature diversity optimization of contrastive learning, thus achieving further improvement in few-shot fine-grained classification performance.
The proposed MSMC adopts a dual-branch training pipeline that integrates meta-learning and contrastive learning to jointly optimize the weight parameters of MSEN. The meta-learning branch uses prototype distance metrics to extract features from support samples via MSEN, compute class prototype vectors, and classify query samples by measuring feature-prototype distances in the embedding space; its meta-classification loss enables generalizable feature mapping and cross-task knowledge transfer through episodic training, supporting efficient few-shot inference. To address meta-learning’s limitations of insufficient cross-task feature discriminability and limited intra-class diversity with few samples, an auxiliary contrastive learning branch is introduced, leveraging explicit instance-level similarity optimization and implicit data augmentation guided by the auxiliary contrastive loss . This forces the feature space to balance task generalizability and sample discriminability, enhancing prototype construction robustness in few-shot scenarios. Overall, MSMC optimizes the parameters through the weighted sum of and , with detailed branch descriptions provided in subsequent sections.
3.3.1. Meta-Learning Loss Based on Distance Measurement
The meta-learning methods based on distance measurement effectively utilize limited sample information through simple induction in the metric space and exhibit excellent performance in solving the few-shot learning problem. Therefore, this paper selects the meta-classification module based on distance measurement as the classification module of the proposed method.
After feeding the image samples into the MSEN and mapping them into the embedding space, the prototype representation of each category can be obtained. Concretely, for the
k-th class target
of the selected
N classes, a
d-dimensional (set as 128 in this paper) vector representation in the embedding space is obtained, denoted as
, which is called the class prototype of the
k-th class target.
is defined as the set of all samples of class
k in the support set
; then,
can be computed as follows:
The samples in the query set
within the embedding space are categorized based on the distance of each sample to the prototype of every class. The distance measurement function is defined as
. Then, for feature
in the feature set
, the classification module will generate a probability distribution of the label
based on the Softmax function, as presented in (
7):
In the above probability distribution, the class with the maximum predicted probability value is denoted as the predicted class of feature
. The distance metric function
used in this article is the Euclidean distance [
45]. Then, the meta-classification loss
is defined as follows:
where
k denotes the true label.
3.3.2. Auxiliary Contrastive Loss Based on Similarity Measurement
Contrastive learning is employed to learn effective feature representations by comparing the similarities among samples in the absence of labels. It emphasizes the learning of common features among samples in the same class and differentiates the disparities among samples in different classes. The proposed MSMC ulitizes auxiliary unsupervised contrastive learning module to generate contrastive loss that assists the optimization of MSEN. Following the task generation method, i.e., N-way K-shot q-query, the total number of samples in a task should be , which is then increasd to through data augmentation. For a given sample in the augmented dataset , it has one positive sample and negative samples.
Given the feature set
for samples in
, we define the contrastive loss as follows:
where
denotes the negative Euclidean distance between features of the two samples; and
denotes the temperature coefficient, which is set to 0.5 in this article. Then, the overall auxiliary unsupervised contrastive loss
is defined as follows:
Among them, for all
,
and
represent indicies of the positive samples in
.
3.3.3. Joint Training Loss
MSMC adopts the joint training loss to optimize the network parameters of MSEN. In each training task
, after acquiring the above meta-classification loss
and
, the total loss function
of meta-contrastive joint learning can be defined as the weighted sum of the above two losses:
Among them,
is the weighting factor, whose value in this article is set to 0.8. Then, backpropagation is carried out in terms of epochs to update the network parameters of MSEN, as shown in (
12):
Among them,
and
, respectively, represent the weight parameters of the multi-scale embedding network before and after the update, and
represents the learning rate. When the total loss converges, the model training is considered to be completed.