Next Article in Journal
Using Multitask Machine Learning to Type Clouds and Aerosols from Space-Based Photon-Counting Lidar Measurements
Previous Article in Journal
Groundwater Crisis in the Eastern Loess Plateau: Evolution of Storage, Linkages with the North China Plain, and Driving Mechanisms
Previous Article in Special Issue
Dense Matching with Low Computational Complexity for Disparity Estimation in the Radargrammetric Approach of SAR Intensity Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Time-Frequency Representation Fusion Network for Target Recognition in SAR Imagery

1
School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China
2
Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China
3
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China
4
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(16), 2786; https://doi.org/10.3390/rs17162786
Submission received: 22 June 2025 / Revised: 18 July 2025 / Accepted: 26 July 2025 / Published: 11 August 2025

Abstract

This paper proposes a multi-scale time-frequency representation fusion network (MTRFN) for target recognition in synthetic aperture radar (SAR) imagery. Leveraging the spectral characteristics of six radar sub-views, the model incorporates a multi-scale representation fusion (MRF) module to extract discriminative frequency-domain features from two types of radar sub-views with high learnability. Additionally, physical scattering characteristics in SAR images are captured via time-frequency domain analysis. To enhance feature integration, a gated fusion network performs adaptive feature concatenation. The MRF module integrates a lightweight residual block to reduce network complexity and employs a coordinate attention mechanism to prioritize salient targets in the frequency spectrum over background noise, aligning the model’s focus with physical scattering principles. Furthermore, the model introduces an angular additive margin loss function during classification to enhance intra-class compactness and inter-class separability while reducing computational overhead. Compared with existing interpretable methods, the proposed approach combines architectural transparency with physical interpretability, thereby lowering the risk of recognition errors. Extensive experiments conducted on four public datasets demonstrate that the proposed MTRFN significantly outperforms existing benchmark methods. Comparative experiments using heat maps further confirm that the proposed physical feature-guided module effectively directs the model’s attention toward the target rather than the background.

1. Introduction

Synthetic Aperture Radar (SAR) has emerged as a pivotal technology in remote sensing, offering all-weather, day-night imaging capabilities crucial for applications such as military reconnaissance, autonomous navigation, and disaster monitoring [1,2]. Among its many applications, Automatic Target Recognition (ATR) stands out as a core technology for enhancing the intelligence and operational autonomy of military platforms. SAR ATR aims to automatically identify and classify targets in SAR imagery, providing decision-makers with timely and accurate information.
The development of SAR ATR methodologies has been driven by the unique advantages of SAR imagery and the demand for reliable target recognition systems. Early approaches primarily relied on handcrafted features and template matching. These methods, while effective in controlled environments, often struggle with the variability and complexity of real-world SAR imagery. Attributed scattering center (ASC) models exploit high-frequency electromagnetic reflections from canonical structures [3,4,5]. Zhang et al. [6] investigated the sensitivity of azimuth angles in limited SAR target recognition and selected the most representative samples based on the trend of feature similarity. As computational power grew and machine learning techniques evolved, more sophisticated methods emerged. Sparse representation classification (SRC), which represents test images as sparse linear combinations of training templates, gained prominence [7,8,9,10,11]. SRC leverages the assumption that if a test image belongs to a specific class, it can be sparsely represented using the training samples of that class. This approach has proven effective in capturing the intrinsic structures of SAR images and achieving robust classification performance. In [12], Lin et al. conducted the first quantitative evaluation of the contribution of polarimetric information to SAR target recognition and introduced a simple polarimetric feature that effectively captures the target’s physical scattering characteristics.
The advent of deep learning has revolutionized the field of SAR ATR. Deep learning models, particularly convolutional neural networks (CNNs), have demonstrated remarkable capability in automatically learning hierarchical features from SAR imagery [13,14,15]. Chen et al. [16] pioneered all-convolutional networks (A-ConvNets) that omit fully connected layers to reduce overfitting. Wang et al. [17] introduced despeckling CNNs as preprocessing modules, while Kwak et al. [18] designed speckle-invariant architectures via regularization. Transfer learning from ImageNet-pretrained models (e.g., VGG, AlexNet) has proven effective for limited SAR data [19,20]. Attention mechanisms remain underexplored; existing works focus on spatial or channel attention in the time domain [21,22,23], neglecting frequency-domain transformations.
Transform-based methods remap SAR data into alternative domains [24]. Dong et al. [25] applied Discrete Fourier Transforms (DFT) with SRC, while Grassmann manifolds encoded monogenic signals via Riesz transforms [26,27]. These methods trade computational complexity for marginal accuracy gains, limiting their adoption [28].
Despite significant progress over recent decades, SAR ATR still faces key challenges, especially in dealing with the complexity and variability of real-world scenarios [29,30,31,32,33]. Traditional and deep learning-based approaches have improved recognition performance, but often fail to fully exploit the rich information embedded in the frequency domain. Consequently, this leads to suboptimal feature extraction and limited robustness in practical deployments.
To address these shortcomings, we propose a novel SAR target recognition network, namely multi-scale time-frequency representation fusion network (MTRFN). First, a coordinate attention mechanism is integrated with the radar subview spectrogram generation process to dynamically capture the spatial-spectral feature distribution of targets, within the proposed multi-scale representation fusion (MRF) module. This combination enables the MRF module to achieve synergistic enhancement of both local details and global structural features. Second, a time-frequency domain encoder is developed to separately extract time-domain waveform and frequency-domain spectrogram features. These multimodal representations are then adaptively integrated using a gated fusion network, which effectively suppresses noise interference. Furthermore, the classic angular additive margin Loss (AAMLoss), is employed to enhance the discriminative power of the model by increasing the boundary certainty between different target classes. The contributions of this work In Proceedings of the are threefold:
  • To overcome the limitations of single-view features, we propose a multi-subview fusion architecture using a MRF module, combining frequency domain feature extraction of radar spectrum subviews with time-frequency domain physical scattering analysis.
  • We introduce a gated-fusion module for multi-subview fueature fusion and network complexity reduction, and further combined the classic AAMLoss function to optimize intra-class compactness and inter-class discrimination.
  • Extensive experiments performed on four public datasets show that the proposed MTRFN outperforms reference methods significantly, and the heat map comparison enables an intuitive correlation between the network’s decision patterns and the feature extraction.
This paper is organized as follows. Section 2 provides a comprehensive review of related work in SAR ATR. A detailed introduction to the proposed MTRFN is presented in Section 3. Section 4 shows the experimental results and analysis. Finally, Section 5 summarizes the key findings and outlines future research directions.

2. Related Work

In this section, we delve deeper into the methodologies and advancements that have shaped the field of SAR ATR. We categorize the related work into several key areas: traditional feature-based methods, SRC, low-rank matrix factorization (LMF), deep learning approaches, and frequenc2-domain methods.
Traditional feature-based methods have laid the foundation for SAR ATR research. These methods rely on handcrafted features such as geometric shapes, intensity distributions, and texture patterns to represent SAR images. Belloni et al. [34] employed Gaussian Mixture Models (GMMs) to segment targets from the background and used BRISK features to encode SAR images. Amrani et al. [35] generated graph-based visual saliency maps and combined Gabor and HOG features for target encoding. Bolourchi et al. [36] proposed a moment fusion strategy to describe SAR images, ranking moments based on their Fisher scores and inputting them into an SVM classifier. While these methods have achieved promising results, their reliance on manually designed features limits their adaptability to diverse and complex scenarios.
SRC represents a significant advancement in SAR ATR by leveraging sparse representation theory. SRC assumes that a test image can be represented as a sparse linear combination of training templates. Ding et al. [3] developed a robust similarity measure for attributed scattering center sets with application to SAR ATR. Zhang et al. [37] introduced a joint classification of multiresolution representations with discrimination analysis for SAR ATR. Wei et al. [7] proposed a fast DDL classification for SAR images with an L_inf constraint. SRC methods have demonstrated effectiveness in capturing the intrinsic structures of SAR images but may suffer from high computational complexity, especially when dealing with large-scale datasets.
LMF techniques further enhance feature representation by decomposing data matrices into low-rank components. Dang et al. [38] introduced incremental non-negative matrix factorization (INMF) with sparse constraints to update the trained model incrementally as new samples arrive. Zhang et al. [39] combined Gabor, PCA, and wavelet features with NMF for SAR target recognition. LMF effectively reduces data dimensionality but may lose some discriminative information during the factorization process.
Deep learning has transformed the field of SAR ATR by enabling automatic feature learning from data [40]. Gao et al. [41] improved CNNs by incorporating cross-entropy and class separability information into the cost function. Wang et al. [17] introduced a despeckling CNN as a preprocessing step to reduce the impact of speckle noise on SAR ATR. Liu et al. [42] proposed a multibranch expert network combined with a dual-environment sampling strategy to effectively address long-tail distribution challenges in both interclass and intraclass recognition tasks. Yue et al. [43] proposed a semi-supervised CNN that combines supervised and unsupervised learning. To enhance multiscale feature representation, Wang et al. [22] proposed a SAR ship recognition method that integrates multiscale feature attention with an adaptive-weighted classifier. This approach strengthens feature representations at each scale and adaptively selects the most informative scale for accurate recognition. In a related effort, Cui et al. [44] introduced a multidimensional feature joint learning (MFJL) framework for SAR target recognition, enabling the joint learning of traditional pattern features and deep features. These deep learning approaches have achieved state-of-the-art performance on multiple benchmarks but often require substantial labeled data for training.
To address the problem of sparse labeled samples, Gao et al. [45] proposed a convolutional block attention module for ship classification transfer learning from the optical domain to the SAR domain. Wan et al. [46] presented a YOLOX-based multi-scale enhancement representation learning method to balance the accuracy and learning speed, which specifically developed a channel-spatial attention enhancement module. Shao et al. [47] imposed spatial attention to enhance feature extraction ability, while Lang et al. [48] introduced a lightweight cascaded multidomain attention network (LW-CMDANet) to study the effectiveness of attention mechanism in few-shot learning scenarios. The class-specific feature extraction from both the frequency and wavelet transform domains was then performed via an attention module, embedded into CNN model. The attention module could effectively compensate for network feature extraction capacity.
Frequency-domain methods offer a unique perspective by remapping SAR images into the frequency domain to extract features. Dong et al. [25] applied the discrete Fourier transform (DFT) to SAR images and utilized SRC for classification. Zhou et al. [27] proposed a scale selection method based on weighted multi-task joint sparse representation to reduce information redundancy between scales. Dong et al. [28] developed a classification method via sparse representation of steerable wavelet frames on Grassmann manifold. Previous studies [25,49] have shown that signal energy in the frequency domain is primarily concentrated in a small subset of low-frequency components, which carry highly important information for target characterization and recognition. These methods leverage the frequency characteristics of SAR images but may face challenges in effectively integrating frequency-domain features with time-domain features. Existing methods usually concatenate frequency domain features with time domain features in the dimensional direction, while time-frequency analysis, a natural tool that combines time domain and frequency domain features, has not received enough attention.
In summary, while each category of methods has contributed significantly to the advancement of SAR ATR, they also exhibit certain limitations. Traditional feature-based methods lack adaptability, SRC methods may have high computational complexity, LMF may lose discriminative information, deep learning requires substantial data. To overcome these limitations, this paper proposes a novel approach that combines multiscale time-frequency representation with coordinate attention mechanism. Our goal is to fully utilize the rich information in SAR images and achieve more accurate and robust target recognition.

3. Proposed Method

3.1. Multi-Scale Representation Fusion

To alleviate the trade-off between recognition accuracy and model interpretability, this paper introduces a MRF module. MRF enhances the model’s capacity to capture global contextual information without sacrificing interpretability. As illustrated in Figure 1, the MRF module is composed of three key components: radar subview spectrogram generation (TFA), coordinate attention mechanism, and a lightweight residual structure. To reduce network complexity, the residual block within MRF comprises a MaxPooling layer, a 1 × 1 convolutional layer, and a 3 × 3 convolutional layer. The MaxPooling operation performs downsampling on the frequency-domain features, while the 3 × 3 convolution enlarges the receptive field and facilitates detailed feature extraction. This design ensures both efficiency and expressive power in local feature representation. The details of the subview spectrogram generation and coordinate attention are presented in the following.

3.1.1. Radar Subview Spectrogram Generation

Inspired by the time-frequency analysis (TFA) scheme employed in [24,50], this work adopts TFA to extract physical scattering characteristics from SAR images for target recognition purposes. Specifically, a two-dimensional TFA is applied to the extended 2D SAR image spectrum, which contains Doppler bandwidth in the azimuth direction and chirp bandwidth in the range direction. The objective is to characterize target properties by extracting variations in backscattering from the 2D spectral data. TFA effectively transforms SAR images from the spatial domain into the time-frequency domain, revealing backscattering characteristics that are not observable in the original spatial domain. This approach enables the generation of sub-aperture SAR images along the azimuth direction and decomposes complex SAR images into subbands along the range direction. The resulting echo signals, observed at different frequencies, are useful for characterizing or distinguishing targets sensitive to transmission frequency.
The core principle of TFA is based on short-time Fourier transform (STFT) and band-pass filtering. Given a pixel location ( x 0 , y 0 ) in a complex SAR image, the full-band signal S ( x 0 x , y 0 y ) centered at ( x 0 , y 0 ) is extracted and transformed into the Fourier domain. A series of band-pass filters w are applied to S ( x , y , f r , f a ) , and the filtered signals are transformed back into the spatial domain to obtain multiple subband images. These subband images reveal variations in backscattering more clearly, especially for man-made objects that may be visible in certain subbands and invisible in others.
For each center location ( x 0 , y 0 ) , we generate a radar spectrogram S ( x , y , f r , f a ) , where the amplitude | r | at r ( x 0 , y 0 ) is defined as the subband scattering pattern of the pixel at ( x 0 , y 0 ) . The implementation framework is illustrated in Figure 2. By fixing any two dimensions, two-dimensional projections of the four-dimensional data matrix can be created, enabling three-dimensional visualization through animation by varying the fixed parameter. There are 12 possible 2D projections, but only six of them are unique.
As an example, we use a SAR image of a cargo ship target to generate the radar spectrogram S ( x , y , f r , f a ) (shown in Figure 3) and its six sub-projections (shown in Figure 4). The projections centered on ( f r , f a ) and ( x 0 , y 0 ) exhibit clearer frequency-domain characteristics. Therefore, these two types of subband projection images are selected for subsequent frequency-domain feature learning. This approach addresses the limitations of relying solely on single-dimensional frequency-domain features, which often results in insufficient feature representation and degraded recognition performance, particularly for ship targets.

3.1.2. Coordinate Attention Mechanism

Conventional channel attention mechanisms often lose positional information after 2D global pooling, which is crucial for generating effective spatial attention maps. To address this limitation, this paper adopts the coordinate attention mechanism, which embeds precise positional information into channel attention. The core ideas can be summarized as follows:
  • Joint Spatial-Channel Modeling: By decomposing spatial coordinates along the horizontal (X) and vertical (Y) directions, coordinate attention mechanism enables more precise localization of spatial features while maintaining inter-channel dependencies.
  • Long-Range Dependency Capture: Global pooling operations performed separately along the horizontal and vertical axes help capture long-range contextual information, thereby enhancing the model’s understanding of global spatial structures.
  • Lightweight Design: Compared to standard global 2D pooling, the decomposed pooling strategy significantly reduces computational cost, making it suitable for integration into deep neural networks without sacrificing efficiency.
The detailed architecture and workflow of the coordinate attention mechanism are illustrated in Figure 5. Coordinate attention module consists of three stages: coordinate information embedding, attention generation, and feature recalibration.
  • Coordinate Information Embedding: Traditional channel attention mechanisms typically use global average pooling (GAP) to encode spatial information. However, GAP compresses the global spatial context into a single scalar per channel, which leads to the loss of fine-grained positional information. To overcome this, Coordinate Attention performs one-dimensional pooling operations along the height (H) and width (W) dimensions independently, thereby preserving direction-aware spatial information.
  • Coordinate Attention Generation: Given an input feature map X R C × H × W , the CA module first applies average pooling with kernel sizes of ( H , 1 ) and ( 1 , W ) to capture vertical and horizontal contextual information, respectively. This operation encodes each channel separately along the vertical (y-axis) and horizontal (x-axis) directions.
  • Feature Recalibration: The pooled features are then used to generate attention maps along each coordinate direction, which are subsequently applied to recalibrate the original feature map X, enhancing the representation of informative regions.
The detailed architecture of the Coordinate Attention module is illustrated in Figure 6. Let X = [ x 1 , x 2 , , x C ] R C × H × W be the input feature map. For channel k, the outputs of the pooling operation along the height ( Z k h ) and width ( Z k w ) can be formulated as follows:
Z k h ( y ) = 1 W i = 1 W X k ( y , i ) , y = 1 , 2 , , H
Z k w ( x ) = 1 H j = 1 H X k ( j , x ) , x = 1 , 2 , , W
These one-dimensional encodings capture long-range dependencies in each spatial direction while preserving the positional context, which is critical for spatially-aware feature enhancement. For a given channel k in the input feature map X R C × H × W , the vertical and horizontal contextual descriptors, denoted as Z k h and Z k w , are obtained through one-dimensional pooling operations along the height and width, respectively. These are defined as:
Z k h ( h ) = i = 0 W 1 x k ( h , i ) , h = 0 , 1 , , H 1
Z k w ( w ) = 1 H j = 0 H 1 x k ( j , w ) , w = 0 , 1 , , W 1
These pooled features are aggregated along the spatial axes to produce a pair of direction-aware feature maps. The vertical descriptor captures long-range dependencies along the height while preserving precise positional information along the width, and vice versa. This directional encoding enables the model to localize salient regions more effectively.
The two pooled features Z h and Z w are concatenated to form a unified representation of shape C × 1 × ( H + W ) , which facilitates subsequent processing. This concatenated feature is passed through a 1 × 1 convolution layer with reduced channel dimension C / r (where r is a reduction ratio), followed by batch normalization and a non-linear activation function δ ( · ) . The feature embedding is then split and transformed along each spatial direction using two separate convolutional layers to produce the attention weights g h and g w . These steps can be formulated as:
f = δ F 1 [ Z h , Z w ]
g h = σ F h ( f h )
g w = σ F w ( f w )
Here, F 1 , F h , and F w denote convolution operations, δ ( · ) is typically a ReLU activation, and σ ( · ) is the sigmoid function.
The attention weights g h and g w are applied to the original feature map X through element-wise multiplication along the corresponding dimensions. For channel k, the recalibrated output o k ( i , j ) is given by:
o k ( i , j ) = x k ( i , j ) · g k h ( i ) · g k w ( j )
This operation enhances the response of informative regions in both spatial directions while maintaining the original spatial layout, thus improving both the discriminative power and interpretability of the model.

3.2. Network Architecture

As illustrated in Figure 7, the proposed MTRFN architecture consists of two main components: time-frequency domain encoder and gated fusion network. The input raw data X slc undergoes TFA to generate X img and X fre . X img and X fre are independently processed by convolutional encoders to extract multi-scale spatial and spectral features. Coordinate attention is applied to X fre to extract attention-aware subband features, yielding F fre . Fused with the time-domain feature F img , the gated module adaptively merges the two feature domains into F M . Finnaly, F M is input to a classifier trained with AAMLoss for improved recognition and generalization in SAR scenarios.
Time-frequency domain encoder module enhances the model’s capability of capturing global context without compromising interpretability. It includes a series of multi-scale convolution operations applied to X fre , with each subband spectrum extracted using different center frequencies and bandwidths. These subbands are denoted as S fre and S pos . Coordinate attention is integrated to separately encode horizontal and vertical spatial dependencies, guiding the model to focus on informative regions. The fused frequency-domain feature map is denoted as F fre .
In gated fusion network, Features extracted from X img and X fre are processed via dedicated encoders to yield F img and F fre , respectively. These are then fused through a gated mechanism. Unlike static fusion methods, the gated module adaptively learns a weighting coefficient S via a shallow neural network that takes both F img and F fre as input. The fusion result is obtained by:
F M = S F img + ( 1 S ) F fre ,
where ⊙ denotes element-wise multiplication. This design allows the model to dynamically balance spatial detail (from F img ) and frequency-domain texture information (from F fre ), effectively suppressing background noise and improving feature discrimination.

3.2.1. Time-Frequency Domain Encoder

Due to the fundamental differences between optical and SAR image formation mechanisms, it is necessary to perform separate processing in the time and frequency domains for SAR data. To this end, we design a time-frequency domain encoder, as illustrated in Figure 8, in which distinct convolutional architectures are employed to extract features from each domain. The temporal branch adopts a simplified residual structure, while the frequency branch leverages multi-scale fused features.
The left half of Figure 8 corresponds to the temporal encoder, which takes as input the intensity image X img derived from the magnitude of the complex-valued SLC (Single Look Complex) SAR image. This branch primarily focuses on extracting spatial information from the time domain to alleviate information loss in small-scale or low-backscatter targets. The temporal residual block is composed of one max pooling layer followed by two consecutive 1 × 1 convolution layers. The pooling layer suppresses redundant background information, while the two point-wise convolutions perform a linear combination of spatial features to integrate inter-channel information, similar in spirit to the Coordinate Attention (CA) mechanism.
The right half of the figure corresponds to the frequency encoder, which elaborates on the MRF module. The input to this branch is the frequency-domain feature map X fre obtained through the MRF structure. With the aid of the coordinate attention module, the frequency encoder can better localize salient regions. Within the joint time-frequency encoder, feature extraction and fusion are simultaneously performed across both domains, enabling complementary representation learning.
The residual structure is crucial for both branches. It allows the network to preserve identity mappings while recovering lost information through convolutional refinement in subsequent layers. This design is particularly beneficial in scenarios involving small targets or weak backscatter, where precise localization and robust texture representation are challenging.
The time-domain image provides high spatial resolution and captures more intuitive representations of target characteristics under real-world conditions. However, its descriptive power diminishes under small-scale and low-backscatter settings. Conversely, the frequency-domain image provides abstract, less intuitive spectral features but, when processed through CA, becomes capable of capturing critical texture and edge information that complements time-domain representations.
Finally, the features from both branches, denoted as F img (time-domain) and F fre (frequency-domain), are concatenated to form the fused feature representation X F :
X F = Concat ( F img , F fre )

3.2.2. Gated Fusion Network

Gated fusion is a feature integration method that adaptively adjusts the contribution of different inputs via learnable gating mechanisms. The core idea is to utilize gate units—typically implemented using Sigmoid or Softmax functions—to dynamically generate fusion weights based on the input, enabling flexible and selective information aggregation. The concept of gating originates from recurrent neural networks (RNNs), particularly advanced variants such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). LSTM uses input, forget, and output gates to regulate the flow of information, whereas GRU simplifies the design by retaining only reset and update gates.
We propose a gated fusion network as shown in Figure 9. This module processes two types of features through three branches. Initially, the features are concatenated and passed through fully connected layers to reduce dimensionality and project them into a shared latent space. A hyperbolic tangent activation function (tanh) is applied to introduce non-linearity and allow feature-specific encoding and mapping.
The activated features from the two branches and the concatenated feature representation are fed into a gated neural unit, which assigns adaptive weights to each feature map. This allows the network to learn the optimal contribution of each input dynamically. To compensate for potentially suppressed but relevant features, two max-pooling layers are applied to the separate feature branches, ensuring that significant patterns are preserved.
The gated unit not only facilitates the effective integration of spatial edge features from the time-domain image and textural patterns from the frequency-domain image, but also enables robust learning of deep semantic features while suppressing irrelevant background noise. After adaptive weighting, the final representation is obtained through global pooling and channel-wise concatenation, as described by:
F M = Concat ( F , MaxPool ( F img ) , MaxPool ( F fre ) )
The gated fusion design addresses the static nature and redundancy issues found in conventional fusion methods. Its dynamic and adaptive nature makes it particularly suitable for complex tasks involving noisy data, multiple modalities, or multi-task scenarios.

3.3. Loss Function

To enhance intra-class compactness and inter-class separability in SAR image recognition, we adopt the classic AAMLoss as loss function. Considering the readability and completeness of the article, we briefly review the AAMLoss. The commonly used softmax loss function can be formulated as:
L softmax = 1 N i = 1 N log e W y i T x i + b y i j = 1 n e W j T x i + b j ,
where x i R d is the deep feature of the i-th sample before the final fully connected layer, y i { 1 , , n } is its ground truth label, W R d × n is the weight matrix with W j denoting the j-th class vector, b is the bias term, N is the batch size, and n is the number of classes.
To better model the angular relationship among features, we first set b = 0 and rewrite the inner product using cosine similarity:
W j T x i = W j · x i · cos ( θ j ) ,
where θ j is the angle between x i and W j . By applying L 2 normalization to both x i and W j , and scaling the result by a constant s > 0 , the softmax loss becomes:
L cos = 1 N i = 1 N log e s   cos ( θ y i ) j = 1 n e s   cos ( θ j ) .
To further strengthen the decision boundary between classes, AAMLoss introduces an angular margin m to the target class θ y i , enforcing the network to satisfy a stricter classification criterion. The AAMLoss is defined as:
L AAM = 1 N i = 1 N log e s   cos ( θ y i + m ) e s   cos ( θ y i + m ) + j y i e s   cos ( θ j ) , if θ y i + m π , 1 N i = 1 N log e s   cos ( θ y i m ) e s   cos ( θ y i m ) + j y i e s   cos ( θ j ) , otherwise .
The correction term in the second case ensures that L AAM remains monotonically increasing and numerically stable when the angular sum θ y i + m > π .

4. Experiments and Results

4.1. Datasets

4.1.1. Sentinel-1 Dataset

The Sentinel-1 dataset [51] is derived from single-look complex (SLC) SAR images acquired in StripMap mode by the Sentinel-1 satellite. Operated by the European Space Agency (ESA), the Sentinel-1 mission employs C-band SAR and supports both horizontal and vertical polarizations, offering data in four imaging modes. In this study, we use StripMap mode imagery from beams S3 and S4. We manually annotated eight categories by cropping patches of size 64 × 64 pixels from the SLC images. These categories include three natural surface types—forest, agriculture, and water—and five artificial land-use types—industrial buildings, storage tanks, containers, residential buildings, and skyscrapers. The resulting dataset comprises 2550 well-balanced image patches, with approximately 300 samples per class.

4.1.2. Open-SARShip Dataset

The Open-SARShip dataset [52] consists of SAR ship images extracted from 41 Sentinel-1 scenes using VV and VH polarizations. It contains 11,346 samples spanning 17 ship categories. Due to the relatively low spatial resolution, the dataset presents high intra-class variability and low inter-class separability. Moreover, it suffers from significant class imbalance, with the cargo ship category alone accounting for 72.47% of the total samples. To improve class balance and ensure valid training and testing splits, underrepresented categories were excluded. Ultimately, three representative classes—cargo ships, tankers, and other ship types—were retained, with 1750, 489, and 267 samples, respectively. A total of 2506 64 × 64 VV/VH SAR patches were selected for experiments.

4.1.3. FUSAR-Ship Dataset

The FUSAR-Ship dataset [53], acquired by the Gaofen-3 satellite, contains a richer variety of ship types compared to the OpenSARShip dataset. Following [40], our experiments use the ship categories BulkCarrier, ContainerShip, Fishing, GeneralCargo, Othercargo, Tanker, and Others, with the same data preprocessing procedures and training–testing ratios applied.

4.1.4. SAR-AIRcraft-1.0 Dataset

SAR-AIRcraft-1.0 [54] is a high-resolution dataset derived from the Gaofen-3 satellite, comprising 4368 images and 16,463 instances of aircraft targets across seven fine-grained identification categories: A220, A320/321, A330, ARJ21, Boeing 737, Boeing 787, and Other.

4.2. Experimental Settings

All experiments are implemented based on the TensorFlow deep learning framework, and the proposed MTRFN is trained on a single NVIDIA GeForce RTX 3090 GPU. The window size of the time-frequency analysis is set as 32 × 32 , and the window overlap is set as 16 [24]. Resnet18 is used as the network backbone. The AAMLoss is applied to all convolutional layers with weight decay set to 0.0005. To optimize the network parameters, we adopt the stochastic gradient descent (SGD) optimizer with a mini-batch strategy. We train the network with a batch size of 4 and a maximun epoch number of 100. Hyperparameters are set based on experience of classic network parameter settings [16,17,18,19,20]. Detailed parameter settings are provided in Table 1.

4.3. Performance Comparison with Reference Methods

To verify the effectiveness of the proposed MTRFN, four baseline methods are employed for comparison: the deep SAR network (DSN) proposed in [24], complex-valued CNN (CV-CNN) [13], scattering topology network (ST-Net) [55], and multidimensional feature joint learning framework (MFJL) [44]. To ensure the evaluation is more robust and convincing, we randomly split the dataset into training and testing sets ten times for each experiment, recorded all results and reported the average performance metrics.
Table 2 lists the target recognition accuracy of the test methods on the Open-SARShip dataset with different percentages of training samples. We can see that MTRFN consistently outperforms both DSN and CV-CNN across all training sample ratios. Specifically, from 30% to 100% training ratio, MTRFN achieves a performance improvement of 14.15%. In comparison, DSN improves by 18.20%, while CV-CNN improves by 19.78%, but its overall accuracy remains lower than that of DSN and MTRFN. Moreover, MTRFN achieves an overall accuracy of 78.65%, which represents a 2.05% absolute improvement over DSN (76.60%). This demonstrates that MTRFN effectively enhances recognition performance, particularly in scenarios with small-scale and low-scattering targets, by integrating the MRF module and gated fusion mechanism. It is also observed that all models suffer a performance drop as the training sample size decreases, confirming the challenge of limited-data SAR target recognition. Nevertheless, MTRFN demonstrates the highest accuracy under all training ratios due to its ability to capture both spatial and spectral domain information, even in small-sample learning situations.
Table 3 reports recognition performance comparison of the test methods on the four datasets. Four evaluation metrics are used, including recall, precision, F1 score and accuracy. We can clearly see that the the proposed MTRFN achieves the highest metrics on almost the four datasets. For example, the improvement in terms of precicsion over MFJL [44] can be as high as 3.06%. Both the proposed MTRFN and MFJL outperform the other methods significantly, and achieve nearly 100% recognition rate on on SAR-AIRcraft-1.0 dataset.
We can conclude that the propsoed MTRFN achieves accuracies of 93.58%, 78.65%, 90.03%, and 99.65% on the Sentinel-1, Open-SARShip, FUSAR-Ship, and SAR-AIRCraft-1.0 datasets respectively, outperforming traditional CNN-based models. This validates the effectiveness of multi-scale fusion and gated mechanisms in enhancing SAR image classification.

4.4. Ablation Study

Table 4 shows the classification accuracy when gate fusion is not included. It can be observed that accuracy generally improves as the number of training samples increases, with the best performance (78.33% average) achieved when 100% samples are used. The Cargo class attains the highest accuracy (81.00%). Comparing the rows with and without AAMLoss, the use of AAMLoss improves the average accuracy, demonstrating its effectiveness in enhancing intra-class compactness and inter-class separation.
As shown in Table 5, when MRF is excluded, the highest overall accuracy (77.73%) occurs with 100% training samples and the use of Gate Fusion only (without AAMLoss). Interestingly, introducing AAMLoss in this case slightly reduces performance in some scenarios. This indicates that while AAMLoss is generally beneficial for temporal domain features (e.g., image-based intensity), it may degrade performance when only frequency-domain features are used. This could be due to the difficulty in optimizing angular margins in high-dimensional spectral representations, which are less separable than spatial textures in SAR imagery.
Comparing Table 2 with Table 4, we can see that the proposed gated fusion helps improve target recognition performance. Comparing Table 2 with Table 5, similar results can also be observed and the improvements are even lager. Comparing Table 4 and Table 5, it is evident that the MRF module contributes more significantly to performance gains than the gated fusion module. The best performance is obtained when all modules—MRF, gated fusion, and AAMLoss—are jointly utilized, validating the design of the full MTRFN framework.

4.5. Model Parameters and Computation Complexity

The model parameters and computation complexity directly determines its operating efficiency. In this section, the number of training parameters, float point operations (FLOPs) and average inference time are used as evaluation metrics for comparison. The specific result comparisons are shown in Table 6. From these comparison, it can be seen that MFJL maintains the highest model parameter quantity and FLOPs, has the slowest inference speed, among the test methods. The other methods demonstrate comparable performance across the evaluation metrics, while the proposed MTRFN offers clear advantages, including significantly fewer parameters, relatively lower FLOPs, and reduced inference time.

4.6. Visualization Analysis and Interpretability Discussion

To validate that the proposed MTRFN model effectively guides the network to focus on the salient target regions within SAR images, Figure 10a presents the visual explanation results for Other Type ship targets from the Open-SARShip dataset using three popular interpretability techniques: Class Activation Mapping (CAM), Grad-CAM, and Score-CAM. In each visualization, the red regions denote areas that contribute positively to the final prediction. We compare three network configurations: (1) Ablated Network: the baseline DSN model without the proposed MRF module or the gated fusion mechanism; (2) Concatenation Network: a variant that incorporates the MRF module using simple concatenation fusion; (3) Gated Fusion Network (MTRFN): the full model integrating both MRF and the gated fusion mechanism.
We can see that the ablated DSN model exhibits activation maps with scattered attention, primarily focusing on irrelevant background regions rather than the target. The concatenation-based MRF fusion network partially improves this by shifting attention toward both target and background regions, indicating some enhancement in feature learning. The MTRFN network, with gated fusion, concentrates its attention strongly on the actual target areas, demonstrating that the gated mechanism effectively suppresses noise and enhances discriminative region learning.
Similar conclusions can be drawn from the Grad-CAM and Score-CAM visualizations. The baseline network tends to overemphasize background textures, while the proposed feature fusion strategies help the network localize and prioritize relevant target structures. These results indicate that the proposed MTRFN model significantly improves the interpretability of the SAR classification process and enhances the model’s ability to extract meaningful target features. However, the same visualization analysis applied to the Container class from the Sentinel-1 dataset reveals less consistent results, as shown in Figure 10b. Compared to Open-SARShip, the distinction between targets and background in Sentinel-1 is much lower, leading to ambiguous attention maps. The network struggles to localize texture-specific regions accurately in this dataset.
To further analyze the effectiveness of frequency-domain features extracted from subband SAR spectrograms, we perform qualitative interpretability analysis by jointly examining the network prediction results and corresponding SAR images. This approach enables an intuitive correlation between the network’s decision patterns and the original data across different network architectures.
We selected samples from forest and agricultural areas for interpretability analysis. The x and y axes of the Doppler frequency domain spectrum represent the center of the frequency range and azimuth, and the z axis represents the frequency domain amplitude value at that location. Figure 11 shows the spectrograms of forest (two samples in the top panel) and agriculture (two samples in the bottom panel). We can see that the Doppler-based frequency-domain spectrograms of natural targets in Figure 11 reveal several key characteristics: (1) the feature boundaries between agricultural and forest areas are relatively ambiguous; (2) Both target types exhibit similar backscattering patterns in their frequency spectra, making them difficult to distinguish based on amplitude alone. (3) the DSN model lacks the capacity to extract sufficient environmental context purely from frequency features, which limits its ability to disambiguate natural surface patterns in SAR imagery.
We selected three distinct man-made targets: residential, industrial building, and container. Although these targets differ in semantic category and structural intent, their frequency-domain characteristics are notably similar, as illustrated in Figure 12. The red circles in the figure highlight rectangular-shaped textures that may confuse CNNs relying on spatial intensity features. All three exhibit elongated stripe-like structures. Rectangular textures in residential indicate urban block patterns. Bright regions within rectangles reflect strong backscatter from dense housing structures. Rectangular patterns in industrial building represent flat rooftops; brighter linear features on the left denote strong backscatter at specific incidence angles due to rooftop edges. The bright rectangular boundaries in container likely stem from multipath reflections between stacked containers.
Although intensity images exhibit visually similar shapes (highlighted by red circles), the corresponding spectral magnitude distributions differ significantly across the three targets when plotted in 3D space. In residential areas, consistent frequency response suggests isotropic backscattering, possibly due to multiple two-bounce reflections from vertical surfaces such as walls or lamp posts on rough terrain. Conversely, industrial buildings with highly regular structures exhibit range-varying spectral patterns. Containers also demonstrate distinct scattering behaviors, unlike residential targets.
These findings suggest that in the domain of man-made targets, frequency-domain analysis significantly enhances class separability. The MTRFN model, leveraging both physical priors and learned features, achieves superior recognition performance compared to the purely data-driven CV-CNN model, thereby validating the interpretability advantages of hybrid (physics-aware + data-driven) deep learning frameworks.

5. Conclusions

This paper proposes a target recognition network based on multi-scale time-frequency domain feature fusion, aiming to improve the accuracy of ship target recognition and network interpretability in SAR images by jointly modeling time-frequency domain information. First, in the multi-scale feature fusion module, based on the radar subview spectrogram generation method, the coordinate attention mechanism is combined to dynamically capture the target space-spectrum feature distribution, and the MRF is used to achieve the synergistic enhancement of local details and global structure. Secondly, a time-frequency domain encoder is constructed to extract time domain waveform and frequency domain spectrogram features respectively, and the gated fusion network is used to adaptively integrate multimodal information to suppress noise interference. Furthermore, an improved loss function AAMloss is designed to enhance the degree of certainty of the boundaries of different classification targets.
In the comparative experimental analysis, the effectiveness of the proposed method is verified based on the Open-SARShip dataset. Compared with the reference methods, the accuracy of ship targets in the Open-SARShip dataset is increased from 76.60% to 78.65%. In the ablation experiment, the effectiveness of the above modules in improving network performance is verified by ablation comparison of MRF, Gate Fusion, and AAMLoss structures. Finally, the CAM heat map visualization results further demonstrate the network’s ability to focus on the target key frequency domain components. It can be seen that the network structure with the introduction of gated fusion has the strongest interpretability, but its generalization on the Sentinel-1 dataset needs to be improved.

Author Contributions

Conceptualization, H.L.; methodology, H.L., Z.X. and J.Y.; software, H.L.; validation, H.L., Z.X. and L.Z.; formal analysis, H.L., Z.X. and L.Z.; investigation, H.L.; resources, L.Z.; data curation, L.Z. and J.Y.; writing—original draft preparation, Z.X.; writing—review and editing, H.L., Z.X. and J.Y.; visualization, Z.X. and L.Z.; supervision, H.L. and J.Y.; project administration, H.L. and L.Z.; funding acquisition, H.L. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Chongqing Natural Science Foundation under granted CSTB2025NSCQ-GPX0743, in part by the National Natural Science Foundation of China under Grant no. 62301164, Grant no. 62222102 and Grant no. 62171023, and in part by the National Key Research and Development Program of China under granted 2024YFB3909800.

Data Availability Statement

The data are contained within the paper.

Acknowledgments

The authors would also like to thank the reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kechagias-Stamatis, O.; Aouf, N. Fusing deep learning and sparse coding for SAR ATR. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 785–797. [Google Scholar] [CrossRef]
  2. Kechagias-Stamatis, O.; Aouf, N.; Richardson, M.A. 3D automatic target recognition for future LIDAR missiles. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 2662–2675. [Google Scholar] [CrossRef]
  3. Ding, B.; Wen, G.; Zhong, J.; Ma, C.; Yang, X. A robust similarity measure for attributed scattering center sets with application to SAR ATR. Neurocomputing 2017, 219, 130–143. [Google Scholar] [CrossRef]
  4. Li, T.; Du, L. SAR automatic target recognition based on attribute scattering center model and discriminative dictionary learning. IEEE Sens. J. 2019, 19, 4598–4611. [Google Scholar] [CrossRef]
  5. Ding, B.; Wen, G.; Huang, X.; Ma, C.; Yang, X. Target recognition in synthetic aperture radar images via matching of attributed scattering centers. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2017, 10, 3334–3347. [Google Scholar] [CrossRef]
  6. Zhang, L.; Leng, X.; Feng, S.; Ma, X.; Ji, K.; Kuang, G.; Liu, L. Optimal azimuth angle selection for limited SAR vehicle target recognition. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103707. [Google Scholar] [CrossRef]
  7. Wei, Y.; Jiao, L.; Liu, F.; Yang, S.; Wu, Q.; Sanga, G. Fast DDL classification for SAR Images with L_inf constraint. IEEE Access 2019, 7, 68991–69006. [Google Scholar] [CrossRef]
  8. Zhang, Z.; Liu, S. Joint sparse representation for multi-resolution representations of SAR images with application to target recognition. J. Electromagn. Waves Appl. 2018, 32, 1342–1353. [Google Scholar] [CrossRef]
  9. Liu, S.; Zhan, R.; Zhai, Q.; Wang, W.; Zhang, J. Multiview radar target recognition based on multitask compressive sensing. J. Electromagn. Waves Appl. 2015, 29, 1917–1934. [Google Scholar] [CrossRef]
  10. Huang, Y.; Liao, G.; Zhang, Z.; Xiang, Y.; Li, J.; Nehorai, A. SAR automatic target recognition using joint low-rank and sparse multiview denoising. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1570–1574. [Google Scholar] [CrossRef]
  11. Ren, H.; Yu, X.; Zou, L.; Zhou, Y.; Wang, X. Class-oriented local structure preserving dictionary learning for SAR target recognition. Proc. IEEE Int. Geosci. Remote Sens. Symp. 2019, 1338–1341. [Google Scholar]
  12. Lin, H.; Wang, H.; Xu, F.; Jin, Y.Q. Target recognition for SAR images enhanced by polarimetric information. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5204516. [Google Scholar] [CrossRef]
  13. Zhang, Z.; Wang, H.; Xu, F.; Jin, Y.Q. Complex-valued convolutional neural network and its application in polarimetric SAR image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7177–7188. [Google Scholar] [CrossRef]
  14. Lin, H.; Yang, J.; Xu, F. PolSAR Target Recognition with CNNs Optimizing Discrete Polarimetric Correlation Pattern. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5212914. [Google Scholar] [CrossRef]
  15. Lin, H.; Yin, J.; Yang, J.; Xu, F. Interpreting Neural Network Pattern with Pruning for PolSAR Target Recognition. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5227114. [Google Scholar] [CrossRef]
  16. Chen, S.; Wang, H.; Xu, F.; Jin, Y.Q. Target classification using the deep convolutional networks for SAR images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4806–4817. [Google Scholar] [CrossRef]
  17. Wang, J.; Zheng, T.; Lei, P.; Bai, X. Ground target classification in noisy SAR images using convolutional neural networks. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 2018, 11, 4180–4192. [Google Scholar] [CrossRef]
  18. Kwak, Y.; Song, W.J.; Kim, S.E. Speckle-noise-invariant convolutional neural network for SAR target recognition. IEEE Geosci. Remote Sens. Lett. 2019, 16, 549–553. [Google Scholar] [CrossRef]
  19. Zhong, C.; Mu, X.; He, X.; Wang, J.; Zhu, M. SAR target image classification based on transfer learning and model compression. IEEE Geosci. Remote Sens. Lett. 2019, 16, 412–416. [Google Scholar] [CrossRef]
  20. Amrani, M.; Jiang, F. Deep feature extraction and combination for synthetic aperture radar target classification. J. Appl. Remote Sens. 2017, 11, 1–13. [Google Scholar] [CrossRef]
  21. Pei, J.; Huang, Y.; Huo, W.; Zhang, Y.; Yang, J.; Yeo, T.S. SAR automatic target recognition based on multiview deep learning framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2196–2210. [Google Scholar] [CrossRef]
  22. Wang, C.; Pei, J.; Luo, S.; Huo, W.; Huang, Y.; Zhang, Y.; Yang, J. SAR ship target recognition via multiscale feature attention and adaptive-weighed classifier. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4003905. [Google Scholar] [CrossRef]
  23. Zhao, C.; Zhang, S.; Luo, R.; Feng, S.; Kuang, G. Scattering features spatial-structural association network for aircraft recognition in SAR images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4006505. [Google Scholar] [CrossRef]
  24. Huang, Z.; Datcu, M.; Pan, Z.; Lei, B. Deep SAR-Net: Learning objects from signals. ISPRS J. Photogramm. Remote Sens. 2020, 161, 179–193. [Google Scholar] [CrossRef]
  25. Dong, G.; Liu, H.; Kuang, G.; Chanussot, J. Target recognition in SAR images via sparse representation in the frequency domain. Pattern Recognit. 2019, 96, 106972. [Google Scholar] [CrossRef]
  26. Dong, G.; Wang, N.; Kuang, G.; Qiu, H. Sparsity and low-rank dictionary learning for sparse representation of monogenic signal. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 2018, 11, 141–153. [Google Scholar] [CrossRef]
  27. Zhou, Z.; Wang, M.; Cao, Z.; Pi, Y. SAR image recognition with monogenic scale selection-based weighted multi-task joint sparse representation. Remote Sens. 2018, 10, 504. [Google Scholar] [CrossRef]
  28. Dong, G.; Kuang, G. Classification on the monogenic scale space: Application to target recognition in SAR image. IEEE Trans. Image Process. 2015, 24, 2527–2539. [Google Scholar] [CrossRef]
  29. Ding, B.; Wen, G.; Huang, X.; Ma, C.; Yang, X. Data augmentation by multilevel reconstruction using attributed scattering center for SAR target recognition. IEEE Geosci. Remote Sens. Lett. 2017, 14, 979–983. [Google Scholar] [CrossRef]
  30. Kechagias-Stamatis, O.; Aouf, N. Automatic Target Recognition on Synthetic Aperture Radar Imagery: A Survey. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 56–81. [Google Scholar] [CrossRef]
  31. Tian, Z.; Wang, W.; Zhou, K.; Song, X.; Shen, Y.; Liu, S. Weighted pseudo-labels and bounding boxes for semisupervised SAR target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5193–5203. [Google Scholar] [CrossRef]
  32. Deng, J.; Wang, W.; Zhang, H.; Zhang, T.; Zhang, J. PolSAR Ship Detection Based on Superpixel-Level Contrast Enhancement. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4008805. [Google Scholar] [CrossRef]
  33. Wang, J.; Quan, S.; Xing, S.; Li, Y.; Wu, H.; Meng, W. PSO-based fine polarimetric decomposition for ship scattering characterization. ISPRS J. Photogramm. Remote Sens. 2025, 220, 18–31. [Google Scholar] [CrossRef]
  34. Belloni, C.; Aouf, N.; Le Caillec, J.M.; Merlet, T. Comparison of descriptors for SAR ATR. In Proceedings of the 2019 IEEE Radar Conference (RadarConf), Boston, MA, USA, 22–26 April 2019; pp. 1–6. [Google Scholar]
  35. Amrani, M.; Jiang, F.; Xu, Y.; Liu, S.; Zhang, L. SAR oriented visual saliency model and directed acyclic graph support vector metric based target classification. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2018, 11, 3794–3810. [Google Scholar] [CrossRef]
  36. Bolourchi, P.; Moradi, M.; Demirel, H.; Uysal, S. Improved SAR target recognition by selecting moment methods based on Fisher score. Signal Image Video Process. 2020, 14, 39–47. [Google Scholar] [CrossRef]
  37. Zhang, X.; Qin, J.; Li, G. SAR target classification using Bayesian compressive sensing with scattering centers features. PIER 2013, 136, 385–407. [Google Scholar] [CrossRef]
  38. Dang, S.; Cui, Z.; Cao, Z.; Liu, N. SAR Target Recognition via Incremental Nonnegative Matrix Factorization. Remote Sens. 2018, 10, 374. [Google Scholar] [CrossRef]
  39. Zhang, X.; Wang, Y.; Li, D.; Tan, Z.; Liu, S. Fusion of multifeature low-rank representation for synthetic aperture radar target configuration recognition. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1402–1406. [Google Scholar] [CrossRef]
  40. Zheng, H.; Hu, Z.; Yang, L.; Xu, A.; Zheng, M.; Zhang, C.; Li, K. Multifeature collaborative fusion network with deep supervision for SAR ship classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5212614. [Google Scholar] [CrossRef]
  41. Gao, F.; Huang, T.; Sun, J.; Wang, J.; Hussain, A.; Yang, E. A new algorithm for SAR image target recognition based on an improved deep convolutional neural network. Cogn. Comput. 2019, 11, 809–824. [Google Scholar] [CrossRef]
  42. Liu, Y.; Zhang, F.; Ma, L.; Ma, F. Long-tailed SAR target recognition based on expert network and intraclass resampling. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4010405. [Google Scholar] [CrossRef]
  43. Yue, Z.; Gao, F.; Xiong, Q.; Wang, J.; Huang, T.; Yang, E.; Zhou, H. A novel semi-supervised convolutional neural network method for synthetic aperture radar image recognition. Cogn. Comput. 2021, 13, 795–806. [Google Scholar] [CrossRef]
  44. Cui, Z.; Mou, L.; Zhou, Z.; Tang, K.; Yang, Z.; Cao, Z.; Yang, J. Feature Joint Learning for SAR Target Recognition. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5216420. [Google Scholar] [CrossRef]
  45. Gao, G.; Dai, Y.; Zhang, X.; Duan, D.; Guo, F. ADCG: A cross-modality domain transfer learning method for synthetic aperture radar in ship automatic target recognition. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5109114. [Google Scholar] [CrossRef]
  46. Wan, H.; Chen, J.; Huang, Z.; Xia, R.; Wu, B.; Sun, L.; Yao, B.; Liu, X.; Xing, M. AFSar: An anchor-free SAR target detection algorithm based on multiscale enhancement representation learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5219514. [Google Scholar] [CrossRef]
  47. Shao, J.; Qu, C.; Li, J.; Peng, S. A lightweight convolutional neural network based on visual attention for SAR image target classification. Sensors 2018, 18, 3039. [Google Scholar] [CrossRef] [PubMed]
  48. Lang, P.; Fu, X.; Feng, C.; Dong, J.; Qin, R.; Martorella, M. LW-CMDANet: A novel attention network for SAR automatic target recognition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6615–6630. [Google Scholar] [CrossRef]
  49. Hwang, W.; Wang, H.; Kim, H.; Kee, S.C.; Kim, J. Face recognition system using multiple face model of hybrid Fourier feature under uncontrolled illumination variation. IEEE Trans. Image Process. 2010, 20, 1152–1165. [Google Scholar] [CrossRef]
  50. Spigai, M.; Tison, C.; Souyris, J.C. Time-frequency analysis in high-resolution SAR imagery. IEEE Trans. Geosci. Remote Sens. 2011, 49, 2699–2711. [Google Scholar] [CrossRef]
  51. Torres, R.; Snoeij, P.; Geudtner, D.; Bibby, D.; Davidson, M.; Attema, E.; Potin, P.; Rommen, B.; Floury, N.; Brown, M.; et al. GMES Sentinel-1 mission. Remote Sens. Environ. 2012, 120, 9–24. [Google Scholar] [CrossRef]
  52. Huang, L.; Liu, B.; Li, B.; Guo, W.; Yu, W.; Zhang, Z.; Yu, W. OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 11, 195–208. [Google Scholar] [CrossRef]
  53. Hou, X.; Ao, W.; Song, Q.; Lai, J.; Wang, H.; Xu, F. FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition. Sci. China Inf. Sci. 2020, 63, 140303. [Google Scholar] [CrossRef]
  54. Zhirui, W.; Yuzhuo, K.; Xuan, Z.; Yuelei, W.; Ting, Z.; Xian, S. SAR-AIRcraft-1.0: High-resolution SAR aircraft detection and recognition dataset. J. Radars 2023, 12, 906–922. [Google Scholar]
  55. Kang, Y.; Wang, Z.; Zuo, H.; Zhang, Y.; Yang, Z.; Sun, X.; Fu, K. ST-Net: Scattering topology network for aircraft classification in high-resolution SAR images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5202117. [Google Scholar] [CrossRef]
Figure 1. Multi-scale representation fusion.
Figure 1. Multi-scale representation fusion.
Remotesensing 17 02786 g001
Figure 2. Subband scattering pattern generation.
Figure 2. Subband scattering pattern generation.
Remotesensing 17 02786 g002
Figure 3. SAR image and radar spectrogram of a cargo ship target. (a) SAR image. (b) Radar spectrogram.
Figure 3. SAR image and radar spectrogram of a cargo ship target. (a) SAR image. (b) Radar spectrogram.
Remotesensing 17 02786 g003
Figure 4. Six subband 2D projections of the radar spectrogram.
Figure 4. Six subband 2D projections of the radar spectrogram.
Remotesensing 17 02786 g004
Figure 5. Diagram of the coordinate attention mechanism.
Figure 5. Diagram of the coordinate attention mechanism.
Remotesensing 17 02786 g005
Figure 6. Network layer structure of the coordinate attention mechanism.
Figure 6. Network layer structure of the coordinate attention mechanism.
Remotesensing 17 02786 g006
Figure 7. Workflow of the proposed MTRFN.
Figure 7. Workflow of the proposed MTRFN.
Remotesensing 17 02786 g007
Figure 8. Network architecture of the time-frequency domain encoder.
Figure 8. Network architecture of the time-frequency domain encoder.
Remotesensing 17 02786 g008
Figure 9. Network architecture of the gated fusion network.
Figure 9. Network architecture of the gated fusion network.
Remotesensing 17 02786 g009
Figure 10. Visualization results of different fusion strategies on (a) the OpenSARShip and (b) Sentinel-1 datasets.
Figure 10. Visualization results of different fusion strategies on (a) the OpenSARShip and (b) Sentinel-1 datasets.
Remotesensing 17 02786 g010
Figure 11. Interpretable analysis on spectral features of natural targets. The x and y axes of the Doppler frequency domain spectrum represent the center of the frequency range and azimuth, and the z axis represents the frequency domain amplitude value at that location.
Figure 11. Interpretable analysis on spectral features of natural targets. The x and y axes of the Doppler frequency domain spectrum represent the center of the frequency range and azimuth, and the z axis represents the frequency domain amplitude value at that location.
Remotesensing 17 02786 g011
Figure 12. Interpretable analysis of three different categoies of targets sharing similar spectral features. The x and y axes of the Doppler frequency domain spectrum represent the center of the frequency range and azimuth, and the z axis represents the frequency domain amplitude value at that location.
Figure 12. Interpretable analysis of three different categoies of targets sharing similar spectral features. The x and y axes of the Doppler frequency domain spectrum represent the center of the frequency range and azimuth, and the z axis represents the frequency domain amplitude value at that location.
Remotesensing 17 02786 g012
Table 1. Parameter settings for the proposed MTRFN.
Table 1. Parameter settings for the proposed MTRFN.
ParameterValue
Batchsize4
Base_lr1 × 10−4
Weight decay5 × 10−4
Epochs100
OptimizerSGD
Loss functionAMMLoss
Table 2. Target recognition accuracy of test methods on the Open-SARShip dataset with differeent percentages of training samples (%).
Table 2. Target recognition accuracy of test methods on the Open-SARShip dataset with differeent percentages of training samples (%).
Training SamplesDSN [24]CV-CNN [13]ST-Net [55]MFJL [44]Proposed
100 % 76.6075.8276.3078.0278.65
70 % 76.0075.5473.1875.3077.50
50 % 72.4068.4171.2871.5575.20
30 % 64.8063.3067.5565.9768.90
Table 3. Comparison with the reference methods on the four datasets.
Table 3. Comparison with the reference methods on the four datasets.
MethodSentinel-1 DatasetOpen-SARShip Dataset
Recall
(%)
Precision
(%)
F1
(%)
Accuracy
(%)
Recall
(%)
Precision
(%)
F1
(%)
Accuracy
(%)
DSN [24]92.1392.0692.0492.2173.8671.5272.7776.60
CV-CNN [13]84.0985.0184.5684.2174.7269.7072.1475.82
ST-Net [25]90.1189.5489.8089.9576.8972.3874.6176.30
MFJL [44]91.8993.9692.9092.9477.8673.4275.5878.02
Proposed92.6694.6793.6393.5878.5376.4677.5178.65
MethodFUSAR-Ship datasetSAR-AIRcraft-1.0 dataset
Recall
(%)
Precision
(%)
F1
(%)
Accuracy
(%)
Recall
(%)
Precision
(%)
F1
(%)
Accuracy
(%)
DSN [24]83.1983.3483.3183.2195.9497.8896.8997.09
CV-CNN [13]80.5980.8380.8380.5791.9792.2492.2093.42
ST-Net [25]82.4382.2982.5182.4693.3596.2094.8496.47
MFJL [44]88.5487.8688.2387.5999.6199.5799.6099.51
Proposed89.6789.8989.8490.0399.5799.7199.7599.65
Table 4. Recognition Accuracy (%) without gated fusion network on Open-SARShip Dataset.
Table 4. Recognition Accuracy (%) without gated fusion network on Open-SARShip Dataset.
SamplesMethodAAMLossCargoTankerOther TypeAverage
100%MRF×77.8077.0075.2076.67
Ours81.0078.5076.3078.33
70%MRF×78.9377.5076.5077.69
Ours79.1576.2176.2077.30
50%MRF×76.4073.7172.8374.39
Ours75.7174.7873.8974.90
30%MRF×69.1866.4265.7467.07
Ours68.3367.6166.5667.46
Table 5. Recognition Accuracy (%) without MRF on Open-SARShip Dataset.
Table 5. Recognition Accuracy (%) without MRF on Open-SARShip Dataset.
SamplesMethodAAMLossCargoTankerOther TypeAverage
100%Gate Fusion×80.2077.5075.5077.73
Ours77.3076.4076.8076.73
70%Gate Fusion×76.8275.3173.0775.21
Ours77.3374.8775.5176.03
50%Gate Fusion×75.8174.1073.2874.40
Ours76.0174.1073.2274.53
30%Gate Fusion×63.3162.2261.1862.13
Ours66.4066.9164.0565.10
Table 6. Comparison with reference methods in terms of model paratemter, computation complexity and inference time.
Table 6. Comparison with reference methods in terms of model paratemter, computation complexity and inference time.
MethodParam
(M)
FLOPs
(G)
Average Inference
Time (ms)
DSN [24]2.756.728.91
CV-CNN [13]1.225.107.31
ST-Net [55]1.957.098.32
MFJL [44]5.0817.5711.20
MTRFN1.285.467.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, H.; Xie, Z.; Zeng, L.; Yin, J. Multi-Scale Time-Frequency Representation Fusion Network for Target Recognition in SAR Imagery. Remote Sens. 2025, 17, 2786. https://doi.org/10.3390/rs17162786

AMA Style

Lin H, Xie Z, Zeng L, Yin J. Multi-Scale Time-Frequency Representation Fusion Network for Target Recognition in SAR Imagery. Remote Sensing. 2025; 17(16):2786. https://doi.org/10.3390/rs17162786

Chicago/Turabian Style

Lin, Huiping, Zixuan Xie, Liang Zeng, and Junjun Yin. 2025. "Multi-Scale Time-Frequency Representation Fusion Network for Target Recognition in SAR Imagery" Remote Sensing 17, no. 16: 2786. https://doi.org/10.3390/rs17162786

APA Style

Lin, H., Xie, Z., Zeng, L., & Yin, J. (2025). Multi-Scale Time-Frequency Representation Fusion Network for Target Recognition in SAR Imagery. Remote Sensing, 17(16), 2786. https://doi.org/10.3390/rs17162786

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop