1. Introduction
Synthetic Aperture Radar (SAR) systems have become indispensable tools for military surveillance, reconnaissance, and civilian monitoring applications because of their unique capability to provide high-resolution imagery regardless of weather conditions, cloud cover, or illumination. Unlike electro-optical (EO) sensors, SAR systems actively transmit electromagnetic signals and measure the backscattered energy, enabling consistent performance under diverse environmental conditions [
1,
2,
3]. This all-weather capability makes SAR particularly valuable for modern defense systems and enables the rapid identification and classification of targets in complex operational environments [
4,
5].
The development of Automatic Target Recognition (ATR) systems for SAR imagery presents unique challenges that distinguish them from conventional computer vision tasks. SAR images exhibit “non-literal” characteristics where electromagnetic scattering properties dominate visual appearance, making human interpretation and traditional computer vision approaches less effective. The radar backscatter depends on complex interactions between electromagnetic waves and target geometry, material properties, and viewing angles. This results in imagery that can be fundamentally different from expectations in the optical domain [
3].
A fundamental challenge in SAR ATR stems from the vast operating condition (OC) space that encompasses sensor parameters (frequency, polarization, resolution), target configurations (orientation, articulation), and environmental factors (weather, terrain, clutter) [
6]. This multidimensional parameter space makes comprehensive data collection impractical, leading to the persistent problem of limited labeled training data for machine learning approaches [
7]. Traditional datasets like MSTAR represent only a small fraction of the possible OC space, potentially limiting the generalizability of trained models [
8].
To address data scarcity, researchers have extensively explored synthetic data generation using electromagnetic modeling and computational tools [
7]. Sophisticated simulators such as RaySAR, CohRaS, and SARViz can generate realistic SAR imagery through ray tracing, full-wave electromagnetic modeling, and prediction of the center of scatter [
9,
10,
11]. However, despite careful Computer-Aided Design (CAD) model truthing and parameter matching, persistent domain gaps between synthetic and measured imagery continue to limit ATR performance [
3,
12].
The SAMPLE dataset introduced by Lewis et al. represents a comprehensive effort to understand and quantify these domain gaps through carefully matched synthetic and measured SAR image pairs [
3]. Their extensive experiments demonstrated that classification accuracy degrades significantly when synthetic data comprises more than 40% of the training set, even with meticulously truthed CAD models and matched collection parameters. This finding has highlighted the fundamental limitations of synthetic data approaches and motivated the exploration of alternative paradigms [
3,
12].
Traditional SAR ATR approaches are heavily based on supervised learning methods that require large amounts of labeled training data [
13,
14,
15]. However, obtaining labeled SAR data presents significant challenges due to security considerations, expert annotation requirements, and the high cost of data collection campaigns. This data scarcity problem has motivated researchers to explore alternative learning paradigms, with self-supervised learning (SSL) emerging as a promising approach [
15].
Self-supervised learning has emerged as a powerful paradigm for learning meaningful representations from unlabeled data across various domains. Recent advances in SSL have demonstrated remarkable success in computer vision through contrastive learning, masked autoencoding, and pretext task learning [
16]. The core principle of SSL, the separation of supervision signals directly from the data structure without external labels, aligns well with the rich internal structure present in the SAR images [
15,
17]. Radar data contains inherent geometric relationships, electromagnetic scattering patterns, and coherent imaging properties that can be exploited for representation learning [
16].
Despite the growing interest in SSL for SAR applications, there exists a significant gap in comprehensive evaluation frameworks that systematically assess the effectiveness of different SSL approaches across diverse downstream tasks and data availability scenarios. Most existing studies focus on specific model architectures or limited experimental settings, making it difficult to draw generalizable conclusions about the optimal deployment strategies for SSL-based SAR ATR systems [
18].
This paper presents a comprehensive, improved self-supervised learning (SSL) framework specifically designed for synthetic aperture radar automatic target recognition (SAR ATR) applications. Our approach addresses the limitations of existing methods through several key innovations:
Multi-Task Pretext Learning Architecture: We develop a sophisticated SSL framework employing nine complementary pretext tasks specifically tailored to SAR imagery characteristics, including geometric transformations (rotations, flips), signal processing operations (denoising, blurring), and multi-scale analysis (zoom transformations).
Comprehensive Downstream Evaluation: We provide extensive comparison across diverse architectural families, including traditional machine learning approaches (SVM, XGBoost, Random Forest, Gradient Boosting) and modern deep learning architectures (ResNet18, U-Net, MobileNet variants, EfficientNet variants, and GAN-based classifiers) to demonstrate the versatility and effectiveness of our learned representations.
Elimination of Synthetic Data Dependency: Our framework demonstrates competitive performance using exclusively measured SAR data, directly addressing the fundamental domain gap problem that has limited previous approaches [
3].
Operational Performance Analysis: We provide a comprehensive evaluation including timing analysis, false positive rate characterization, and computational efficiency metrics essential for operational deployment considerations.
Validation and Robustness: We perform a comprehensive analysis of performance across varying data availability scenarios (5% to 100% of training data), and cross-validation ensures reliable performance assessment.
The remainder of this paper is organized as follows.
Section 2 provides a comprehensive review of related work in SAR ATR and self-supervised learning.
Section 3 details our enhanced SSL framework, including architecture design, pretext tasks selection, and downstream evaluation methodology.
Section 4 presents extensive experimental results and comparative evaluation.
Section 5 discusses practical implications, limitations, and theoretical contributions as well as future research directions and broader impact considerations.
3. Materials and Methods
3.1. Problem Formulation and Framework Overview
Let represent a dataset of N measured SAR images, where each is a single-channel image of spatial dimensions . Our goal is to learn a feature representation function parameterized by that maps input images to 2048-length feature vectors suitable for downstream target classification. The self-supervised learning framework consists of three main components:
Pretext Learning Phase: Learn feature representations by training on multiple pretext tasks using unlabeled SAR imagery.
Feature Extraction Phase: Apply the learned encoder to extract features from training and test data.
Downstream Classification Phase: Train classifiers on the extracted features for target recognition.
Our proposed self-supervised learning-based SAR ATR architecture is depicted in
Figure 1.
3.2. Dataset Description and Preprocessing
Our experiments utilize the Synthetic and Measured Paired and Labeled Experiment (SAMPLE) dataset [
3], a comprehensive SAR ATR benchmark designed to facilitate research on bridging the domain gap between synthetic and real-world SAR imagery. The dataset exhibits a paired structure where each measured (real) SAR image is accompanied by a corresponding synthetic image generated from high-fidelity computer-aided design (CAD) models. Our study focuses exclusively on measured imagery to evaluate SSL performance under realistic operational conditions without synthetic data dependency. The dataset contains ten military vehicle classes with both synthetic and measured SAR images for each vehicle. Synthetic data were generated using meticulously truthed CAD models closely matched to real vehicle images from the publicly available Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset [
33]. Images were collected at elevation angles between
and
and azimuth angles ranging from
to
. The dataset was specifically created to facilitate research on bridging the gap between measured and synthetic SAR imagery for automatic target recognition applications.
We gathered the data from the publicly available GitHub repository [
34], where the dataset has been made available. There are two different kinds of files in the repository: MATLAB files and PNG images. ‘decibel’ and ‘qpm’ are two distinct directories with images in PNG format that are accessible, and the images in both folders are identical. The actual measured images are available in one of the subfolders in each of these folders, while the corresponding synthetic image to each of the measured images is available in the other. Each image filename follows a standardized naming convention encoding parsable metadata: target class identifier, elevation angle (
,
,
, or
), azimuth angle (
to
at variable intervals), and data type (synthetic/measured). This paired structure enables one-to-one correspondence between synthetic and measured images for the same target at matching viewing geometries.
Our Study’s Data Usage Rationale: We utilize only measured (real) images from the SAMPLE dataset, totaling 1345 images across 10 classes, to evaluate SSL performance under authentic operational conditions without synthetic data augmentation. This choice is motivated by three considerations: (1) Realism: measured images contain authentic sensor noise, atmospheric effects, and target variability absent in synthetic data, (2) SSL evaluation: self-supervised learning should demonstrate effectiveness on limited real data, and (3) Operational relevance: deployed SAR ATR systems process real sensor acquisitions. Following [
3], we adopt an elevation-based train-test split where all images at
elevation (539 samples) serve as the held-out test set, while images at
,
, and
elevations (806 samples) constitute the training pool.
Table 1 reports the complete distribution. From the 1345 measured (real) SAR images across 10 military vehicle classes, we use only measured images, excluding synthetic pairs, to evaluate SSL under realistic operational conditions.
The preprocessing pipeline involves two key steps. First, intensity normalization is performed by scaling pixel values to a range of [0, 1] using min-max scaling, ensuring consistent intensity distributions across images. Second, spatial standardization is applied by center-cropping and resizing the images to pixels, maintaining spatial consistency while preserving target details. We employ a systematic experimental protocol to ensure reproducible and reliable results. The experimental framework evaluates SSL approaches using the SAMPLE dataset with both held-out test data and rigorous cross-validation protocols to assess the performance reliability and generalizability.
Data Leakage Prevention Protocol: To ensure rigorous evaluation and prevent data leakage between training and testing phases, we implement a strict elevation-based data separation strategy. Following the SAMPLE dataset protocol [
3], all images at a 17
∘ elevation (539 test samples across 10 classes) are held out from both the SSL pretext training and downstream classifier training phases. Only images at 14
∘, 15
∘, and 16
∘ elevation angles (806 training samples) are available for pretext task learning and downstream training. This elevation-based split ensures that test data remains completely unseen throughout the entire training pipeline, preventing any form of data leakage. Additionally, during downstream classifier training, we employ stratified train-validation splitting where 15% of the training data is reserved for validation, with class proportions preserved to maintain balanced representation. This approach ensures that validation performance accurately reflects generalization capability across all target classes, providing reliable early stopping criteria without introducing bias toward majority classes. This rigorous protocol combining held-out test elevation, proportionate validation, and consistent data partitioning across all
k-value experiments substantiates that reported performance metrics represent genuine generalization to unseen data rather than memorization of training examples.
3.3. Multi-Task Pretext Learning Framework
3.3.1. Two-Stage Hierarchical SSL Pipeline with Multi-Task Pretext Learning
Our framework implements a two-stage hierarchical SSL pipeline. Stage 1 (Pretext Training) learns general-purpose representations from unlabeled data through pretext tasks. Stage 2 (Downstream Evaluation) transfers these learned features to target classification by freezing the pretrained encoder and training only the downstream classifier. This hierarchical separation ensures that pretext task learning captures domain-invariant features without overfitting to specific downstream labels.
Within the first stage (pretext training), we employ multi-task learning where a single shared encoder is trained simultaneously on nine pretext tasks. Our multi-task pretext objective jointly optimizes all nine tasks through a unified loss function, which enables the network to learn a unified representation space that captures multiple complementary aspects of SAR imagery simultaneously. In the second stage (downstream evaluation), we freeze the pretrained encoder and extract 2048-length feature vectors for all training samples. These extracted features are then used to train various downstream classifiers for the target classification task.
We adopt multi-task learning within the pretext stage (rather than training nine separate single-task models) for some key advantages: (1) Training efficiency: requires 2–3 min for training the nine pretext tasks simultaneously for the SAMPLE dataset taken, (2) Feature complementarity: ablation studies confirm multi-task synergy yields better performance over best single-task performance through simultaneous gradient contributions from all tasks, and (3) Data efficiency: each of 806 images provides nine training signals (7254 effective examples), compensating for limited labeled data, with minimal computational overhead.
3.3.2. Pretext Tasks Design and Theoretical Justification
We propose a comprehensive self-supervised learning framework comprising nine methodically constructed pretext tasks, each designed to exploit distinct structural characteristics inherent to synthetic aperture radar (SAR) imagery. Our approach is grounded in rigorous theoretical principles derived from established radar phenomenology and signal processing theory. The foundation of our methodology rests upon some key observations from the radar remote sensing literature. First, electromagnetic scattering mechanisms in SAR targets demonstrate invariance properties under specific geometric and radiometric transformations [
35,
36], providing a theoretical basis for transformation-based self-supervision. Second, operational SAR systems are inherently subjected to multiplicative speckle noise, atmospheric propagation effects, and system-induced artifacts [
37], which require robust feature representations that can model these degradation processes. Third, SAR backscatter signatures exhibit multi-scale spatial dependencies arising from the complex interplay between target geometry and radar viewing parameters [
38,
39], motivating the development of hierarchical feature extraction mechanisms. Building upon these theoretical foundations, our pretext tasks design systematically addresses each aspect of SAR data complexity while maintaining computational scalability for large-scale applications. The pretext tasks can be classified into these broad categories:
Original Image Prediction (): Classification of unmodified images serves as a baseline task and helps maintain original SAR signature characteristics.
Geometric Invariance Tasks (): Based on the principle that SAR target signatures should be recognizable regardless of platform orientation or viewing geometry, these tasks force the network to learn rotation and reflection-invariant features essential for multi-aspect target recognition. These tasks include:
90
∘ Rotation (
): Rotates the image counterclockwise by 90
∘, leveraging viewpoint-invariant nature of SAR targets.
180
∘ Rotation (
): Rotation by 180
∘, helping to learn orientation-invariant features.
270
∘ Rotation (
): Counterclockwise rotation by 270
∘ (equivalent to 90
∘ clockwise), helping rotational feature learning for multi-aspect recognition.
Horizontal Flip (
): Flips the image horizontally, creating a mirrored version to learn geometric invariances.
Vertical Flip (
): Flips the image vertically, producing an upside-down version, enhancing robustness to target orientation changes.
Signal Quality Robustness Tasks (): Motivated by operational requirements where SAR imagery may be degraded by atmospheric effects, sensor noise, or processing artifacts. These tasks ensure the learned representations are robust to signal quality variations commonly encountered in operational scenarios. These transformations include:
Denoising (
): The network learns to identify images processed with Gaussian smoothing (implemented as “denoising” task). This applies a 3 × 3 Gaussian kernel with fixed standard deviation
, which reduces noise and high-frequency details in SAR imagery.
where
is a 3 × 3 Gaussian kernel with standard deviation 0.5.
Blur Prediction (
): The network applies Gaussian blur transformations with a 5 × 5 kernel, simulating atmospheric effects and resolution variations. The standard deviation is chosen randomly between 0.5 and 1.0 to create varying levels of blur effect.
where
is a 5 × 5 Gaussian kernel with standard deviation
.
Multi-Scale Analysis Task (
): To address the multi-resolution nature of SAR phenomenology, where target features manifest at different spatial scales depending on sensor parameters and target geometry, we implement a zoom-in transformation. The network predicts zoom-in transformations that upscale the image by a factor between 1.2 and 1.5, then crops the center region back to the original size, creating a zoom effect. This corresponds to 20% to 50% zoom levels, enabling multi-scale feature learning important for variable resolution scenarios.
where
is the scaling factor.
Figure 2 illustrates the complete set of pretext transformations applied to representative samples from each target class in our dataset.
3.3.3. Pretext Network Architecture
Our CNN architecture is specifically designed to capture the characteristics of SAR imagery, following the implementation of Lewis et al. [
3] to ensure fair comparison and reproducibility. The network employs progressive channel expansion (16 → 32 → 64 → 128) to enable feature learning. Spatial resolution is systematically reduced using max-pooling operations that preserve essential spatial relationships while lowering computational complexity. The pretext classifier consists of a fully connected head (2048 → 1000 → 500 → 250 →
N) with ReLU activations, where
N denotes the number of pretext transformation classes. All convolutional layers use
kernels with same-padding to preserve spatial dimensions prior to pooling. Pretext training is formulated as a single-label classification problem over the
N transformation classes and is optimized using the categorical cross-entropy loss
where
p represents the predicted probability distribution over the transformation classes and
y is the corresponding ground-truth label. Optimization is performed using the Adam optimizer (learning rate
, batch size
) with early stopping (patience
epochs) to mitigate overfitting and encourage generalizable representations for downstream tasks. The architecture is shown in
Table 2.
Computational Resources: Pretext training requires approximately 2–3 min on a single GPU (NVIDIA GeForce RTX 3090, Santa Clara, CA, USA ) for the complete SAMPLE dataset. Feature extraction for all downstream classifiers adds an average of 15.5 ms per image. The framework is designed for computational efficiency, making it practical for research and operational deployment scenarios.
3.4. SimCLR Baseline Implementation
To establish a rigorous comparison with established SSL methods, we implement SimCLR (Simple Framework for Contrastive Learning of Visual Representations) [
30] as a strong baseline, following the original paper’s methodology as closely as possible while making necessary adaptations for SAR imagery and our experimental conditions. SimCLR represents a fundamentally different SSL paradigm compared to our task-based approach, while our method employs explicit pretext tasks. SimCLR learns representations through instance discrimination, where the model learns to distinguish between different images while recognizing augmented views of the same image as equivalent.
Our implementation faithfully replicates the core architectural and algorithmic components of the original SimCLR paper [
30]. We preserve the fundamental two-phase training protocol (contrastive pretraining followed by linear evaluation), the exact projection head architecture with batch normalization, the NT-Xent loss formulation with cosine similarity, and the linear evaluation protocol for representation quality assessment. We employ LARS optimizer with learning rate scaling (
), linear warmup for the first 10 epochs, and cosine decay learning rate scheduling. The temperature parameter (
), weight decay (
), momentum (0.9), and LARS trust coefficient (0.001) are set exactly as recommended in the original work. This faithful implementation ensures our comparison reflects the true performance characteristics of SimCLR rather than implementation-specific variations.
The SimCLR framework consists of two main phases. During contrastive pretraining, each SAR image x undergoes random augmentation to generate two correlated views . For each image, we randomly select two different augmentations from our SAR-adapted transformation set (–). Each view applies exactly one randomly selected transformation, with the constraint that the two views use different transformations to ensure meaningful view diversity while preserving target-specific SAR signatures.
Both augmented views are processed through a shared ResNet-18 encoder to extract 512-dimensional feature representations. Following the original SimCLR protocol exactly, we employ a two-layer MLP projection head with the architecture specified in the original paper: Linear () → BatchNorm → ReLU → Linear (). This produces 128-dimensional normalized embeddings for contrastive learning. The projection head, including batch normalization between layers as specified in the original work, is crucial for effective contrastive learning but is discarded during downstream evaluation.
The contrastive loss encourages agreement between positive pairs (augmented views of the same image) while maximizing disagreement with negative pairs (views from different images). For a batch of
N images producing
augmented views, we apply the normalized temperature-scaled cross-entropy (NT-Xent) loss exactly as defined in the original paper. For a positive pair
:
where
denotes cosine similarity between L2-normalized embeddings,
is the temperature parameter (original paper’s recommended value), and
excludes self-comparisons. The final loss averages over all
positive pairs in the batch, precisely following the original implementation.
We replicate SimCLR’s training protocol with dataset-appropriate adaptations. Following the original paper’s optimizer recommendations, we use LARS (Layer-wise Adaptive Rate Scaling) with base learning rate 0.3, momentum 0.9, weight decay , and trust coefficient 0.001. The learning rate is scaled by batch size: , maintaining the original paper’s linear scaling rule. We implement the original paper’s learning rate schedule: linear warmup for the first 10 epochs, followed by cosine decay to zero over the remaining epochs. The original SimCLR employs batch sizes of 256–8192 to provide sufficient negative samples; for our dataset constraints, we use an adaptive batch sizing strategy where batch size is set to to ensure at least two batches per epoch while maximizing negative sample diversity. This adaptation is necessary for low k-values where training samples are limited, but preserves the core contrastive learning principle. We train for 200 epochs, consistent with one of the pretraining durations explored in the original SimCLR paper.
The second phase, linear evaluation, follows the standard protocol established by the original SimCLR paper. We freeze the trained ResNet-18 encoder and train only a single linear layer (512-length input to 10 target classes) using supervised labels and cross-entropy loss. This evaluation protocol, widely adopted in the SSL literature, tests whether the encoder has learned features that are linearly separable for the downstream task, providing a fair comparison between SSL methods. Following the original paper’s specifications, we train the linear classifier for 100 epochs using SGD with learning rate 0.1, momentum 0.9, weight decay 0.0, and early stopping (patience = 10 epochs) on validation loss.
We modified ResNet-18’s initial convolutional layer to accept grayscale input (1 channel instead of 3) to adapt the SAR data we used. Our adaptive batch sizing ensures stable contrastive learning even with limited training data at low k-values, preventing training failures while maintaining meaningful negative sampling.
3.5. Downstream Classification Approaches
To evaluate the quality of learned representations from our SSL framework, we conduct comprehensive downstream classification experiments using both traditional machine learning and deep learning approaches. All classifiers operate on feature vector of length 2048 that is extracted from the pre-trained SSL encoder network.
3.5.1. Traditional Machine Learning Classifiers
We implement four well-established machine learning algorithms, each configured following standard practices and widely used default hyperparameter settings for high-dimensional feature classification.
Support Vector Machine (SVM): We employ an SVM with linear kernel optimized for high-dimensional features commonly encountered in deep learning representations. The implementation utilizes scikit-learn’s SVC with probability estimation enabled for comprehensive performance analysis.
XGBoost: Our XGBoost configuration implements an ensemble of 100 decision trees. The classifier is configured for multi-class soft probability prediction with 10 output classes and automatic label encoding disabled.
Random Forest: The Random Forest classifier employs 100 estimators. Bootstrap sampling and other ensemble parameters maintain default scikit-learn configurations.
Gradient Boosting: The Gradient Boosting implementation utilizes 100 estimators. The implementation follows scikit-learn’s gradient boosting approach with learning rate, maximum tree depth, and other regularization parameters configured according to the hyperparameter optimization results to prevent overfitting while maintaining classification performance.
3.5.2. Deep Learning Architectures
We evaluate several state-of-the-art deep neural network architectures, each adapted for SAR imagery classification.
ResNet18: We implement a modified ResNet18 architecture that can handle both the image and the input of the feature vector. The implementation includes adaptive input processing to accommodate 2048-length feature vectors by reshaping them into spatial representations suitable for convolutional processing.
U-Net: We adapt the U-Net encoder-decoder architecture for classification tasks by incorporating global average pooling in the decoder bottleneck. The encoder extracts hierarchical features, while skip connections preserve fine-grained spatial information, making this architecture particularly suitable for SAR imagery, where spatial context is crucial for target recognition.
MobileNet variants (v1 with different width multipliers: 1.0, 0.75, 0.5, 0.25): We implement the MobileNet v1 architecture with different width multipliers to explore the trade-offs between computational efficiency and classification performance. The architecture employs depthwise separable convolutions to reduce computational cost while maintaining feature extraction capability.
EfficientNet variants (B0, B1, B2, B3): Our EfficientNet implementation leverages compound scaling to systematically balance network depth, width, and resolution. We evaluated EfficientNet-B0 through B3 variants. The balanced scaling approach of EfficientNet makes it particularly suitable for SAR applications where computational efficiency is crucial while maintaining high accuracy requirements.
GAN Classifier: We implement a discriminator-based classification approach that leverages generative adversarial network (GAN) principles for target recognition. The GAN classifier consists of a discriminator network trained to distinguish between different target classes rather than real versus synthetic data. This approach utilizes the representational power of adversarial training to learn robust feature discriminations.
CNN: Additionally, we have employed the CNN architecture that is used to train the pretext features. We have adapted the CNN architecture to process one-dimensional feature vector while keeping the architecture parameters the same. For downstream classification, we have used 1D convolutions and filters instead of 2D.
Table 3 provides the hyperparameter specifications used to implement the downstream classifiers.
3.6. Experimental Design and Evaluation Methodology
3.6.1. Evaluation Metrics
Accuracy: , representing the overall percentage of correctly classified samples and providing a general measure of classifier effectiveness.
Precision: , macro-averaged across all classes to account for potential class imbalance effects.
Recall: , macro-averaged to ensure equal consideration of all target classes regardless of frequency.
F1-Score: , representing the harmonic mean of precision and recall to provide a balanced performance measure.
Area Under the Curve (AUC): , providing a threshold-independent measure of classifier discriminative ability across all possible decision boundaries.
True Positive Rate at Fixed False Positive Rates: evaluated at strategically chosen FPR thresholds of 1%, 5%, 10%, 15%, and 20% to assess performance under varying operational constraints typical in SAR target recognition applications.
Pretext Training Time: Total wall-clock time required for SSL pretext tasks training, measured across multiple epochs until convergence.
Downstream Classifier Training Time: Training duration for each downstream classifier using extracted feature representations.
3.6.2. Data Availability Scenarios and k-Value Selection
Our experimental framework evaluates performance across varying fractions of measured training data, denoted as k ranging from 0.05 to 1.00, where k represents the percentage of available measured (real) training data used for both pretext and downstream classifier training. The k parameter constitutes a critical experimental design element that enables systematic assessment of SSL framework effectiveness under different data availability constraints commonly encountered in operational SAR ATR scenarios. The significance of k-value analysis extends beyond academic evaluation, providing essential guidance for operational deployment decisions where data acquisition costs, labeling requirements, and rapid deployment constraints directly impact system viability.
While our framework supports comprehensive evaluation across all
k-values for traditional machine learning and most deep learning architectures, generative adversarial network (GAN) based classifiers require special consideration at extremely low
k-values. At
k = 0.05 (5% of training data), the available sample size (43 training samples from the SAMPLE dataset) falls below the minimum requirements for stable GAN training. GANs require adequate sample diversity for stable adversarial training to ensure numerical stability and avoid training failures. GANs exhibit several critical issues, including numerical instability due to insufficient batch statistics for BatchNorm layers, gradient instability leading to mode collapse and vanishing gradients, and training convergence failure. These technical limitations align with established findings in GAN literature [
40,
41], where minimum dataset requirements are necessary for stable training. Consequently, our evaluation presents GAN-based classifier results for
(at least 85 training samples).
The
k-value methodology provides valuable insights into the relationship between data availability and model performance across diverse architectural families. This systematic approach enables identification of operational thresholds where specific architectures become viable, quantification of performance degradation under data constraints, and optimization of resource allocation for data collection efforts. For operational SAR ATR systems,
k-value analysis directly informs critical deployment decisions, including minimum data collection requirements, expected performance under constrained scenarios, and cost-benefit analysis of additional data acquisition investments. The distribution of data for each
k (fraction of real data used) is shown in
Table 4.
3.6.3. Cross-Validation Protocol
To ensure robust and generalizable results, we also implement a rigorous cross-validation protocol. The cross-validation experiments validate our findings using independent data splits and provide confidence intervals for performance estimates.
Addressing the inherent data scarcity challenges in SAR ATR applications, our cross-validation methodology employs a conservative approach with reduced data usage to better reflect real-world operational constraints. We implement a 5-fold stratified cross-validation protocol using exclusively real SAR imagery, selecting the first 92 images per class lexicographically to ensure reproducible dataset composition and eliminate potential selection bias. This results in a total dataset of 920 images ( classes).
The cross-validation architecture differs fundamentally from our k-value experimental framework in several key aspects: (1) Data Volume: Cross-validation uses 920 real images compared to the full dataset that includes the original train and the test set of 1345 images, representing the evaluation under data-scarce conditions; (2) Pretext Training: All 920 images are used for self-supervised pretext training without fold splits, ensuring the SSL feature extractor learns from the complete available dataset; (3) Downstream Evaluation: The same 920 images are subjected to 5-fold cross-validation where each fold uses approximately 736 images for training and 184 for testing, with 25% of training data reserved for validation; (4) Statistical Rigor: This protocol provides confidence intervals and standard deviations across folds, offering more reliable performance estimates than single train-test splits used in the main k-value experiments.
4. Experimental Results and Analysis
4.1. Overall Performance Analysis
Our comprehensive experimental evaluation encompasses 16 diverse downstream classifier architectures systematically evaluated across four primary computational paradigms: traditional machine learning (SVM, Random Forest, XGBoost, Gradient Boosting), convolutional neural networks (CNN, ResNet18, U-Net), efficient architectures (MobileNet variants with width multipliers 1.0, 0.75, 0.5, 0.25; EfficientNet variants B0–B3), and generative adversarial networks (GAN classifier). This systematic evaluation across values of k (from 0.05 to 1.0) provides comprehensive insights into SSL framework performance under diverse data availability constraints, generating 320 experimental configurations that thoroughly validate our approach across operational scenarios typical of real-world SAR ATR deployments.
The experimental results reveal striking performance patterns that fundamentally challenge conventional assumptions about deep learning superiority in computer vision tasks. Traditional machine learning approaches consistently demonstrate exceptional performance, with SVM achieving a remarkable 99.63% accuracy, 99.63% precision, and 99.66% recall at full training data availability (k = 1.0), accompanied by near-perfect ROC AUC scores of 1.0000 for all target classes. Random Forest maintains robust second-best performance with 99.26% accuracy, 99.30% precision, and 99.22% recall, achieving ROC AUC values between 0.9990 and 1.0000 across all target classes. XGBoost delivers competitive 93.88% accuracy with consistently high per-class discrimination above 0.9930. These results establish traditional ML methods as the optimal choice for SSL-based SAR ATR applications, contradicting the common bias toward deep neural architectures in computer vision research.
Among deep learning architectures, ResNet18 emerges as the top performer with 97.40% accuracy at k = 1.0, achieving exceptional per-class ROC AUC values between 0.9960 and 1.0000 across classes, demonstrating robust discrimination across all target types. CNN provides competitive baseline performance at 94.25% accuracy, while U-Net achieves 84.04% accuracy despite higher computational requirements. The efficient architecture family demonstrates compelling performance-efficiency trade-offs, with EfficientNet B3 reaching peak efficient architecture performance at 95.92% accuracy and per-class ROC AUC values between 0.9733 and 1.0000, EfficientNet B2 achieving 89.42%, EfficientNet B1 attaining 95.18%, and EfficientNet B0 delivering baseline 92.95% accuracy.
MobileNet variants showcase scalability across computational constraints with remarkable performance. Standard MobileNet achieves an exceptional 98.70% accuracy with per-class ROC AUC values ranging from 0.9995 to 1.0000. MobileNet 0.75 reaches 97.77%, MobileNet 0.5 attains 95.73%, and MobileNet 0.25 delivers 92.21% acccuracy. GAN-based classifiers demonstrate promising generative learning capabilities with 96.85% accuracy and consistently high per-class discrimination above 0.9785, validating the framework’s applicability to adversarial learning paradigms while maintaining competitive performance across diverse target classes.
4.2. Data Availability Impact and k-Value Analysis
The k-value experimental design provides critical insights into SSL framework robustness under varying data constraints, systematically evaluating performance degradation and recovery patterns across data availability scenarios from extreme scarcity (k = 0.05, 5% of training data with only 43 total samples) to full availability (k = 1.0, 100% with 806 samples). Our systematic evaluation reveals dramatic performance variations that directly inform operational deployment strategies across diverse resource-constrained environments typical of real-world SAR ATR applications.
At extreme data scarcity (k = 0.05), traditional machine learning approaches demonstrate remarkable resilience that far exceeds random chance performance. SVM achieves 51.95% accuracy under these severe constraints with per-class ROC AUC values ranging from 0.7015 to 0.9970, while Random Forest maintains 46.01% accuracy with consistent discrimination above 0.5385 across all target classes. These performance levels remain operationally relevant for preliminary target screening applications, demonstrating the framework’s capability to extract meaningful features even from minimal training data. In contrast, deep learning architectures exhibit significant performance degradation, with ResNet18 achieving only 24.86% accuracy and CNN reaching 27.27% accuracy at k = 0.05.
The performance trajectory from
k = 0.05 to
k = 1.0 reveals distinct architectural families’ sensitivity to data availability. Traditional ML methods exhibit graceful degradation under data constraints while maintaining consistent relative performance across all
k-values, with SVM improving from 51.95% to 99.63% accuracy and Random Forest advancing from 46.01% to 99.26%. Deep learning architectures show higher variance at low
k-values but demonstrate rapid improvement as data availability increases: ResNet18 advancing from 24.86% accuracy at
k = 0.05 to 97.40% at
k = 1.0, and CNN improving from 27.27% to 94.25%. EfficientNet variants display particularly interesting scaling behavior, with B3 improving from 15.96% at
k = 0.05 to 95.92% at
k = 1.0, while B0 advances from 19.48% to 92.95%, indicating architectural depth’s relationship to data requirements. The accuracy on the full test set in varying training data availability scenarios is shown in
Figure 3, while the accuracy curves with different
k values for top downstream classifiers are depicted in
Figure 4.
4.3. Computational Efficiency and Operational Metrics
The timing analysis reveals highly favorable computational characteristics essential for operational deployment across diverse hardware configurations and real-time processing requirements. SSL feature extraction from the trained pretext model took a computational cost of 15.35 ms per image. Inference times with the extracted feature vary dramatically across architectural families, enabling precise performance-efficiency optimization. Traditional ML methods achieve exceptional computational efficiency: Gradient Boosting requires only 0.01 ms per image, XGBoost needs 0.02 ms per image. Random Forest demands 0.05 ms per image, and SVM requires 0.08 ms per image, despite its superior accuracy. Deep learning architectures require 0.05–0.37 ms per image, with ResNet18 achieving a competitive 0.13 ms per image, CNN requiring 0.05–0.06 ms per image, and U-Net demanding 0.35–0.37 ms per image due to its complex encoder-decoder architecture.
Efficient architectures maintain competitive timing profiles optimized for resource-constrained deployments. EfficientNet variants require 0.16–0.28 ms per image across B0–B3 configurations, with B0 achieving 0.16 ms per image and B3 reaching 0.27–0.28 ms per image, demonstrating excellent scalability. MobileNet variants need 0.05–0.11 ms per image depending on the width multiplier, with the 0.25 width multiplier variant achieving a fast 0.09 ms per image while maintaining 92.21% accuracy. GAN classifiers provide balanced performance at 0.08–0.09 ms per image, making them suitable for specialized applications requiring generative model capabilities.
Total processing times range from 15.36 ms (Gradient Boosting: 15.35 ms feature extraction and 0.01 ms inference) to 15.72 ms (U-Net: 15.35 ms feature extraction and 0.37 ms inference), enabling real-time processing capabilities for operational SAR ATR systems. The minimal variance in total processing time (±0.36 ms) across architectural families confirms that feature extraction dominates computational cost, making downstream classifier selection primarily a performance consideration rather than a computational constraint. This characteristic enables deployment flexibility where high-accuracy classifiers like SVM (99.63% accuracy, 15.43 ms total) can be deployed when computational resources permit, while fast alternatives like Gradient Boosting (92.76% accuracy, 15.36 ms total) provide viable options for extreme real-time constraints or battery-powered platforms.
4.4. Per-Class Performance Analysis
Class-wise ROC AUC analysis reveals exceptional discrimination capability across all target types, with particularly strong performance for specific vehicle classes that demonstrate distinct feature characteristics in the SSL representation space. The per-class analysis provides critical insights into target-specific recognition capabilities and potential operational deployment considerations for diverse mission requirements.
SVM achieves near-perfect per-class discrimination with ROC AUC values demonstrating exceptional consistency. This performance demonstrates the SSL framework’s ability to learn discriminative features for diverse target types, from tracked tanks to wheeled vehicles. Random Forest maintains robust per-class performance with ROC AUC values ranging from 0.9990 to 1.0000 across all target classes. The minimal performance variation indicates consistent feature quality across vehicle types. XGBoost demonstrates reliable per-class discrimination with ROC AUC values consistently above 0.9941. ResNet18 exhibits competitive per-class discrimination with ROC AUC values between 0.9974 and 1.0000, achieving perfect discrimination for Classes 5 (M35) and 6 (M548).
Class-specific insights reveal interesting patterns in target discrimination. Class 6 (M548 tracked carrier) exhibits perfect discrimination (ROC AUC = 1.0000) across all top-performing classifiers (SVM, Random Forest, XGBoost), indicating highly distinctive SSL-learned features that separate this tracked utility vehicle from combat platforms. Classes 3 (M1 tank) and 5 (M35 truck) consistently achieve perfect or near-perfect discrimination, suggesting that main battle tanks and wheeled logistics vehicles possess easily distinguishable radar signatures in the learned feature space. The minimal interclass performance variation (typically within 0.001–0.002 ROC AUC) across top classifiers demonstrates the SSL framework’s balanced feature learning capability for diverse military vehicle types encountered in operational SAR ATR scenarios.
4.5. Operational Performance Thresholds
The analysis of True Positive Rate (TPR) at fixed False Positive Rate (FPR) thresholds provides operationally critical performance metrics that directly inform deployment decisions across diverse mission requirements and operational constraints. These metrics represent real-world performance under varying tolerance levels for false alarms, allowing precise system configuration for specific operational scenarios from high-precision reconnaissance to rapid threat assessment.
At the stringent 1% FPR threshold, representing high-precision applications where false alarms carry significant operational costs, SVM achieves an exceptional 99.89% TPR, demonstrating near-perfect target detection with minimal false alarm rates suitable for critical decision-making scenarios. Random Forest attains 99.65% TPR at this threshold, while XGBoost reaches 96.10% TPR, and ResNet18 achieves 98.70% TPR. These performance levels indicate excellent discrimination capability suitable for high-stakes applications such as threat identification in densely populated areas or precision targeting scenarios where misclassification consequences are severe.
At the moderate 5% FPR threshold, providing operational flexibility for scenarios tolerating moderately higher false alarm rates in exchange for enhanced target detection, perfect 100% TPR is achieved by SVM, Random Forest (100%), XGBoost (100%), and ResNet18 (99.26%). MobileNet standard achieves perfect 100% TPR, while EfficientNet B3 reaches 99.44% and Gradient Boosting attains 97.40%. This threshold represents optimal operating points for many operational scenarios where some false alarms are acceptable to ensure comprehensive target detection.
The 10% FPR threshold demonstrates robust detection performance under permissive operational constraints. Perfect 100% TPR is achieved by SVM, XGBoost, ResNet18, MobileNet standard, and multiple EfficientNet variants. CNN reaches 98.33%, U-Net achieves 97.96%, and even challenging scenarios maintain TPR above 94%. The consistent achievement of 100% TPR at this threshold across top-tier classifiers demonstrates robust detection performance suitable for surveillance applications where comprehensive area monitoring is prioritized over precision.
At higher FPR thresholds (15% and 20%), all classifiers achieve TPR values above 98%, with many reaching perfect 100% detection. SVM maintains perfect 100% TPR at both 15% and 20% FPR thresholds, as do Random Forest, XGBoost, and most deep learning architectures. This performance consistency across varying operational constraints validates the SSL framework’s robustness and provides confidence for deployment across diverse mission requirements.
Operational deployment guidelines emerge from threshold analysis. For high-precision applications requiring minimal false alarms (≤1% FPR), SVM provides optimal performance, making it ideal for critical identification tasks, precision targeting, or operations in sensitive areas where false positives have severe consequences. For balanced operational requirements (5% FPR), multiple classifiers achieve perfect or near-perfect TPR, enabling selection based on computational constraints, with SVM and Random Forest providing optimal accuracy while Gradient Boosting offers fast processing for time-critical applications. For surveillance and area monitoring scenarios tolerating higher false alarm rates (≥10% FPR), the framework provides exceptional flexibility with multiple high-performance options enabling deployment optimization based on hardware limitations, power constraints, or specific mission requirements rather than accuracy considerations.
4.6. Cross-Validation Results and Statistical Validation
Our rigorous cross-validation methodology employs a conservative 5-fold class-proportionate approach using exclusively real SAR imagery. To ensure reproducible dataset composition and eliminate potential selection bias, the first 92 images per class are selected lexicographically. This results in a total dataset of 920 images ( classes), representing a constrained data regime characteristic of operational SAR ATR scenarios, where labeled training data are inherently limited and costly to acquire. Unlike the main k-value experiment, this cross-validation setup uses the full 920-image dataset for SSL pretext training while subjecting the same data to a rigorous 5-fold class-proportionate partitioning for downstream classification evaluation.
The cross-validation results demonstrate exceptional consistency and statistical significance across all evaluated classifiers, providing robust validation of our main experimental findings. SVM maintains the best and most consistent performance with 99.13% mean accuracy cross folds, indicating exceptional stability across different data partitions and confirming its position as the optimal classifier for SSL-based SAR ATR. The narrow confidence interval demonstrates statistical significance of performance differences between methods and validates the reliability of SVM’s superiority across diverse data configurations.
Random Forest achieves a robust 96.96% mean accuracy, while ResNet18 delivers a competitive 96.52% mean performance among deep learning approaches. The cross-validation results for additional architectures demonstrate consistent performance patterns: XGBoost reaches 95.65% accuracy, Gradient Boosting achieves 94.13%, and efficient architectures maintain their relative performance hierarchies observed in the experiments. CNN attains 90.22% accuracy, U-Net reaches 87.93%, and MobileNet variants achieve 93.37–95.54% accuracy depending on width multiplier configuration.
Key findings from cross-validation analysis provide compelling evidence for framework robustness. First, consistent high performance across multiple classifiers (99.13% for SVM, 96.96% for Random Forest, 96.52% for ResNet18) indicates results are not attributable to classifier-specific optimization but reflect genuine feature quality from our SSL framework. Second, low standard deviations across all experiments demonstrate reproducibility and reliability, confirming that performance variations are minimal across different data partitions. Third, traditional machine learning approaches exhibit exceptional stability compared to deep learning models while maintaining competitive mean performance, reinforcing their suitability for operational deployment where consistency is crucial. Key findings from cross-validation analysis provide compelling evidence for framework robustness. First, consistent high performance across multiple classifiers (99.13% for SVM, 96.96% for Random Forest, 96.52% for ResNet18) indicates results are not attributable to classifier-specific optimization but reflect genuine feature quality from our SSL framework. Second, low standard deviations across all experiments demonstrate reproducibility and reliability, confirming that performance variations are minimal across different data partitions. Third, traditional machine learning approaches exhibit exceptional stability compared to deep learning models while maintaining competitive mean performance, reinforcing their suitability for operational deployment where consistency is crucial.
The cross-validation framework validates that results generalize reliably across different data splits, providing confidence in deployment scenarios where data composition may vary from training conditions. The consistent performance patterns between main k-value experiments and cross-validation results demonstrate framework robustness and validate our SSL approach’s effectiveness across diverse experimental protocols, supporting the reliability of our findings for real-world SAR ATR applications where data availability and composition constraints are common operational challenges.
4.7. Comprehensive Architecture Performance Analysis
The systematic evaluation across 16 downstream classifier architectures reveals distinct performance patterns that inform optimal deployment strategies. Traditional ML classifiers demonstrate remarkable consistency: SVM achieves 99.63% accuracy with exceptional precision (99.63%) and recall (99.66%), Random Forest maintains 99.26% accuracy with balanced precision (99.30%) and recall (99.22%), XGBoost delivers competitive 93.88% accuracy, and Gradient Boosting provides rapid inference time of 0.01 ms with 92.76% accuracy.
Deep learning architectures exhibit varied performance characteristics. CNN achieves 94.25% accuracy with moderate computational requirements (0.06 ms inference on extracted features), ResNet18 delivers top deep learning performance at 97.40% accuracy (0.13 ms inference on extracted features), and U-Net provides specialized capabilities with 84.04% accuracy despite higher computational cost (0.37 ms inference on extracted features). The efficient architecture family demonstrates compelling performance-efficiency trade-offs: EfficientNet B3 reaches peak efficient performance at 95.92% accuracy, EfficientNet B2 achieves 89.42%, EfficientNet B1 attains 95.18%, and EfficientNet B0 delivers baseline 92.95% performance.
MobileNet variants showcase scalability across computational constraints: standard MobileNet achieves exceptional 98.70% accuracy, MobileNet 0.75 reaches 97.77%, MobileNet 0.5 attains 95.73%, and MobileNet 0.25 delivers 92.21%. GAN-based classifiers demonstrate promising generative learning capabilities with 96.85% accuracy, validating the framework’s applicability to adversarial learning paradigms.
4.8. Contrastive Learning Baseline: SimCLR Performance Analysis
To establish a rigorous comparison with established contrastive learning methods, we evaluated SimCLR under identical experimental conditions with the same dataset, and k-value range 0.05–1.00). SimCLR represents a fundamentally different SSL paradigm: while our multi-task approach learns through explicit transformation classification, SimCLR employs instance discrimination with contrastive loss, making this comparison particularly informative for understanding optimal SSL strategies for SAR ATR.
The SimCLR baseline achieves 92.02% accuracy at full training data availability (
k = 1.00), demonstrating that contrastive learning produces viable representations for SAR target recognition. However, this performance falls substantially short of our multi-task SSL framework, which achieves 99.63% accuracy (SVM), 99.26% (Random Forest), and 97.40% (ResNet18) under identical conditions. The performance gap persists across all
k-values, with SimCLR reaching only 15.77% accuracy at
k = 0.05 compared to 51.95% for our approach. Comprehensive performance metrics, including accuracy, precision, recall, and F1-score across all
k-values are provided in
Appendix A Table A5 (SimCLR Performance Metrics). Results reveal that SimCLR requires approximately
(327 training samples) to achieve performance comparable to our method at
k = 0.25 (203 samples), indicating superior data efficiency for task-based SSL in the SAR domain.
The performance differential can be attributed to several domain-specific factors. First, SAR imagery exhibits structured transformations (rotations, flips) that are semantically meaningful for target recognition, making explicit task-based learning more effective than generic instance discrimination. Second, contrastive learning relies heavily on negative sampling diversity, requiring large batch sizes for optimal performance [
42]; our SimCLR adaptation uses batch size 64 due to dataset constraints, potentially limiting contrastive signal strength. Third, the limited visual variability within SAR target classes (compared to natural images) may reduce the effectiveness of view-based contrastive learning, as augmented views may not provide sufficient discrimination signal.
4.9. Task Ablation Study and Pretext Task Analysis
To address fundamental questions about task contribution and framework complexity, we conducted comprehensive ablation studies systematically evaluating 15 task combinations across all 16 downstream classifiers at full data availability (
k = 1.00) for our best-performing classifier (SVM). The nine pretext tasks were organized into four semantic groups:
T0_Original (identity transformation),
T1–T5_Geometric (rotations: 90
∘, 180
∘, 270
∘; flips: horizontal, vertical),
T6–T7_SignalQuality (denoise, blur), and
T8_MultiScale (zoom). We evaluated individual groups, two-group combinations, and three-group combinations to identify optimal task selection strategies and quantify individual task contributions. The findings of the ablation studies are mentioned in
Table 5.
The ablation study reveals that geometric transformation tasks (T1–T5) are unequivocally the most critical component, achieving 96.22% average accuracy across all downstream classifiers independently, substantially outperforming all other task groups. Individual task group performance demonstrates a clear hierarchy: T1–T5_Geometric (96.22%), T0_Original (89.46%), T8_MultiScale (88.71%), and T6–T7_SignalQuality (59.80%). The geometric tasks’ superiority stems from SAR domain-specific characteristics: radar target recognition inherently requires aspect-angle invariance, and rotation/flip transformations directly address arbitrary viewing angles in overhead imaging while preserving electromagnetic scattering properties and target structure.
The ablation study identifies tasks contributing minimally: T6–T7_SignalQuality alone achieves only 59.80% accuracy, indicating poor standalone discrimination. However, when combined with geometric tasks, signal quality tasks contribute positively (T1–T5 + T6–T7 achieves the best performance), suggesting complementary robustness rather than primary discriminative features. Similarly, T8_MultiScale (88.71% alone) provides moderate independent contribution but enhances specific architectures when combined with geometric tasks.
The task contribution analysis for the task groups is depicted in
Figure 5. As shown in the heatmap, T0_Original, T1–T5_Geometric, and T8_MultiScale maintain consistently high scores (0.981–0.994) across accuracy, precision, and F1-score, demonstrating robust discriminative capacity. In stark contrast, T6–T7_SignalQuality exhibits substantially degraded performance (0.720–0.732), confirming its limited standalone contribution while validating its role as a complementary robustness enhancer when combined with geometric transformations. The near-uniform performance of geometric tasks across all metrics (0.981–0.982) reinforces their position as the primary feature learning mechanism, while the marginal performance between T8_MultiScale (0.992–0.993) and geometric tasks suggests scale invariance provides incremental rather than fundamental improvements to the learned representations. Overall, achieving 99.63% accuracy using all the tasks shows that all tasks contribute in gaining performance.
4.10. Baseline Comparison and State-of-the-Art Performance
Our SSL framework demonstrates significant improvements over synthetic data approaches reported by Lewis et al. [
3], achieving superior performance using exclusively measured data compared to their hybrid synthetic-measured approaches across all data availability scenarios. The elimination of synthetic data dependency while maintaining exceptional performance (99.63% accuracy for SVM) represents a fundamental advancement over previous methods that struggled with domain gap limitations.
The comparative analysis reveals our approach’s superiority across multiple dimensions: (1) elimination of synthetic data requirements removes domain gap constraints that limited previous approaches; (2) exclusive use of real SAR imagery ensures operational relevance and eliminates modeling assumptions inherent in synthetic generation; (3) superior performance metrics across diverse architectural families demonstrate framework generalizability; (4) robust performance under extreme data constraints (51.95% accuracy at k = 0.05) provides viable solutions for rapid deployment scenarios.
Our SSL framework substantially outperforms prior work across all evaluation metrics. While Lewis et al. [
3] achieved 94.5% accuracy using CNN-based models on full measured datasets, our approach delivers superior performance across diverse architectural families.
Table 6 presents comprehensive results for all classifiers achieving ≥94.50% accuracy at k = 1.00, demonstrating the framework’s effectiveness across traditional machine learning (SVM: 99.63%, Random Forest: 99.26%), efficient deep architectures (MobileNet variants: 95.73–98.70%, EfficientNet: 95.18–95.92%), standard deep networks (ResNet18: 97.40%), and generative models (GAN Classifier: 96.85%).
The results establish clear performance hierarchies: traditional ML approaches achieve exceptional accuracy when leveraging SSL-extracted features, with SVM demonstrating state-of-the-art performance (99.63%) that substantially exceeds previous benchmarks. Deep learning architectures maintain competitive performance, with ResNet18 leading at 97.40% accuracy, while efficient architectures provide compelling accuracy-efficiency trade-offs suitable for resource-constrained deployment scenarios.
Figure 6 presents ROC curves for the highest-performing classifiers at
k = 1.00, demonstrating exceptional discrimination capabilities across all target classes. The visualization reveals near-perfect performance characteristics with AUC values approaching 1.0000 for SVM, Random Forest, ResNet18, and MobileNet variants, validating the superior quality of SSL-extracted features for SAR target recognition.
To visualize the quality of learned representations, we employ t-SNE (t-distributed Stochastic Neighbor Embedding) dimensionality reduction to project the 2048-length SSL-extracted features into 2D space.
Figure 7 presents the t-SNE visualization for the test set, revealing distinct class clusters with minimal overlap. The clear separation between target classes demonstrates that our multi-task SSL framework learns semantically meaningful features that naturally group similar targets while maintaining discriminative boundaries between classes. SVM performs very well because the 2048-length SSL embeddings are close to linearly separable; a linear SVM maximizes inter-class margins in high dimensions with strong implicit regularization, which is advantageous in low-sample measured SAR. In contrast, tree ensembles and deep heads exhibit higher variance and tend to overfit speckle/idiosyncratic artifacts on limited data, while SVM leverages the global, geometry-aligned cues captured by the pretext encoder.
Comparison with SAR-Specific SSL Methods: Pei et al. [
14] achieved 90.71% (using 10% of MSTAR data) and 99.34% (using 30% of MSTAR data) using SimCLR-style contrastive learning with ResNet50. Li et al. [
15] introduced SAR-JEPA with predictive gradient-based embeddings, demonstrating robust low-data performance on MSTAR. SARATR-X [
16] employs Vision Transformer pretraining on 180K samples from 14 datasets, while SAFE [
18] uses masked Siamese ViT with self-distillation. These foundation models require millions of parameters, extensive multi-dataset curation, and substantial computational resources. Our lightweight CNN (97 K parameters, 2–3 min single-GPU training) achieves 99.63% accuracy with only 806 samples using SAMPLE dataset enhanced from MSTAR, demonstrating that physics-informed task design provides exceptional performance with dramatically lower cost and complexity compared to large-scale foundation approaches.
4.11. Detailed Results
Comprehensive experimental results are provided in the appendix tables.
Appendix A Table A1 (Model Performance Metrics and Timing Results for
k-value Experiment) presents detailed performance metrics and timing results for all 16 downstream classifiers across 20 different
k-values (0.05–1.00), including accuracy, precision, recall, F1-score, ROC AUC, TPR at various FPR thresholds, training time, and inference time.
Appendix A Table A2 (Class-wise ROC AUC Results for
k-value Experiment) provides class-wise ROC AUC results for the
k-value experiments, showing per-class discrimination performance for all 10 target classes.
Appendix A Table A3 (Model Performance Metrics and Timing Results for Cross Validation Experiment) contains cross-validation performance metrics as well as the timing results (average of the fold metrics).
Appendix A Table A4 (Class-wise ROC AUC Results for Cross Validation Experiment) presents class-wise ROC AUC results from the cross-validation experiments (average of the fold metrics).
Appendix A Table A5 (SimCLR Performance Metrics) provides complete SimCLR baseline performance metrics across all
k-values for comparison with our multi-task SSL approach.
5. Conclusions and Future Work
This work presents a comprehensive self-supervised learning framework for SAR automatic target recognition that fundamentally advances the state-of-the-art through systematic multi-task pretext training and extensive architectural evaluation. Our approach successfully eliminates dependency on synthetic data while achieving exceptional performance across diverse computational paradigms, providing critical insights for operational SAR ATR deployment through rigorous experimental validation on the SAMPLE dataset with 16 downstream classifiers spanning traditional machine learning, deep neural networks, efficient architectures, and generative models.
5.1. Key Findings and Contributions
The experimental results reveal fundamental insights that challenge conventional assumptions about deep learning superiority in SAR applications. Traditional machine learning approaches demonstrate exceptional effectiveness when combined with SSL-extracted features, with SVM achieving remarkable 99.63% accuracy, 99.63% precision, and 99.66% recall alongside near-perfect per-class ROC AUC values (1.0000 for six target classes, ≥0.9999 for four classes). Random Forest maintains the second-best performance with 99.26% accuracy and consistent per-class discrimination between 0.9960–1.0000, while XGBoost achieves competitive 93.88% accuracy with per-class ROC AUC above 0.9930. Among deep learning architectures, ResNet18 emerges as the top performer with 97.40% accuracy and exceptional per-class ROC AUC values between 0.9974–1.0000, demonstrating that convolutional approaches remain highly effective for SSL-based SAR feature classification.
The k-value experimental design provides insights into framework robustness under varying data constraints, systematically evaluating performance from extreme scarcity (k = 0.05, 43 samples) to full availability (k = 1.0, 806 samples). Traditional ML methods exhibit remarkable resilience at extreme data scarcity, with SVM achieving 51.95% accuracy and Random Forest reaching 46.01% accuracy at k = 0.05, far exceeding random chance performance (10%) and maintaining operationally relevant capabilities for preliminary target screening. Deep learning architectures show higher sensitivity to data availability but demonstrate rapid improvement as data increases, with ResNet18 advancing from 24.86% accuracy at k = 0.05 to 97.40% at k = 1.0, indicating architectural depth’s relationship to data requirements.
Computational efficiency analysis reveals highly favorable operational characteristics essential for real-world deployment. SSL feature extraction dominates processing time at 15.35 ms per image consistently across all classifiers, while inference times range from 0.01 ms (Gradient Boosting) to 0.37 ms (U-Net), enabling total processing times of 15.36–15.72 ms/image and supporting real-time operation at 58–65 Hz frame rates. Operational performance metrics demonstrate exceptional discrimination capability with SVM achieving 99.89% TPR at 1% FPR and perfect 100% TPR at 5% FPR, providing flexible deployment options across diverse mission requirements from high-precision reconnaissance to rapid threat assessment scenarios.
5.2. Theoretical and Practical Implications
The superior performance of traditional ML methods on SSL-extracted features suggests that the learned representations exhibit favorable linear separability characteristics, indicating that complex non-linear transformations may be unnecessary for effective SAR target discrimination when appropriate feature representations are available. This finding has significant practical implications for operational deployment, where traditional ML approaches offer computational efficiency advantages, interpretability benefits, and reduced training complexity compared to deep neural networks while maintaining state-of-the-art accuracy.
The framework’s elimination of synthetic data dependency addresses a critical limitation in SAR ATR research, where domain gap challenges between synthetic and measured imagery have consistently hindered practical deployment. By achieving exceptional performance using exclusively measured SAR imagery, our approach provides a viable path toward operational systems that can be deployed without extensive synthetic data generation infrastructure or domain adaptation techniques. The framework’s elimination of synthetic data dependency addresses a critical limitation in SAR ATR research, where domain gap challenges between synthetic and measured imagery have consistently hindered practical deployment. By achieving exceptional performance using exclusively measured SAR imagery, our approach provides a viable path toward operational systems that can be deployed without extensive synthetic data generation infrastructure or domain adaptation techniques.
Cross-validation experiments using a conservative 5-fold protocol with 920 real images validate framework robustness, with SVM achieving 99.13% mean accuracy and Random Forest maintaining 96.96% accuracy. The narrow confidence intervals demonstrate statistical significance and confirm that performance differences between methods reflect genuine capabilities rather than dataset-specific artifacts, providing confidence for deployment across varied operational scenarios.
5.3. Limitations and Methodological Considerations
Several limitations demand acknowledgment for comprehensive evaluation. Our experimental validation is constrained to the SAMPLE dataset (14– elevation angles, 10– azimuth range) with 1345 images, representing a specific operational scenario that is yet to be tested in diverse SAR sensor configurations, environmental conditions, or significantly larger target inventories encountered in comprehensive military applications. The framework’s performance across different radar frequencies, polarizations, and operational environments requires systematic investigation to establish generalizability bounds.
The k-value experimental design necessarily focuses on data availability scenarios where for GAN-based classifiers due to minimum sample requirements for stable adversarial training. While this methodological choice ensures reliable results and prevents training artifacts, it limits insights into extremely low-data scenarios () for generative approaches. Additionally, while our computational analysis indicates promising timing characteristics, real-time deployment considerations, including hardware optimization, system integration overhead, and edge computing constraints require empirical validation across diverse operational platforms.
5.4. Future Research Directions
Several promising research directions emerge from this foundation that could significantly extend framework capabilities and operational applicability. Multi-modal integration incorporating multi-polarization, multi-frequency, and multi-temporal SAR data represents a natural extension that could enhance recognition capabilities while maintaining the synthetic data independence that characterizes our approach. SSL-based domain adaptation techniques could address operational condition variations and sensor differences without requiring extensive paired datasets.
Developing incremental learning frameworks that adapt to new target classes without full retraining would improve operational flexibility in dynamic threat environments where new vehicle types or configurations emerge continuously. Advanced pretext tasks design incorporating radar phenomenology principles, such as polarimetric signatures, coherence patterns, or aspect-dependent scattering behaviors, could further improve feature representation quality while maintaining computational efficiency.
Federated learning approaches could enable collaborative model development across multiple operational sites while preserving data privacy and security requirements critical for defense applications. Extension to sequential SAR data and video SAR analysis could leverage temporal information for enhanced recognition performance, particularly for moving targets or complex scenarios requiring temporal context.
Investigation of foundation model architectures specifically designed for radar data could establish large-scale pretraining frameworks that leverage diverse SAR datasets while maintaining the measured data focus that characterizes our current approach. Integration with modern transformer architectures and attention mechanisms could enable more sophisticated feature learning while preserving computational efficiency requirements for operational deployment.
As SAR technology evolves with advanced sensor capabilities, the principles established in this work provide a framework for developing adaptive, efficient automatic target recognition systems. The elimination of synthetic data dependency, combined with exceptional performance characteristics and computational efficiency, positions this approach as a foundation for next-generation SAR ATR applications, contributing to advances in remote sensing, computer vision, and artificial intelligence that extend beyond defense applications to scientific remote sensing, environmental monitoring, and civilian surveillance systems.