You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

10 November 2025

MFFN-FCSA: Multi-Modal Feature Fusion Networks with Fully Connected Self-Attention for Radar Space Target Recognition

,
,
and
The College of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue AI-Driven Computer Vision and Pattern Recognition: Challenges and Applications

Abstract

Radar space target recognition is faced with inherent challenges due to complex electromagnetic scattering properties and limited training samples. Conventional single-modality approaches cannot fully characterize targets due to information incompleteness, and existing multi-modal fusion methods often neglect deep exploration of cross-modal feature correlations. To address this issue, this paper presents a novel multi-modal feature fusion network with fully connected self-attention (MFFN-FCSA) for robust radar space target recognition. The proposed framework innovatively integrates multi-modal radar data, including high-resolution range profiles (HRRPs) and inverse synthetic aperture radar (ISAR) images, to exploit the complementary characteristics comprehensively. Our MFFN-FCSA consists of three modules: the parallel convolutional branches for modality-specific feature extraction of HRRPs and ISAR images, an FCSA-based fusion module for cross-modal feature fusion, and a classification head. Specially, the designed FCSA fusion module simultaneously learns spatial and channel-wise dependencies via a fully connected self-attention mechanism, which enables learning dynamic weights of discriminative features across modalities. Furthermore, our end-to-end MFFN-FCSA model incorporates a composite loss function that combines a focal cross-entropy loss to address class imbalance and a triplet margin loss for enhanced metric learning. Experimental results based on a space target dataset with 10 categories show the high recognition accuracy of our model compared to related single-modality and existing fusion approaches, particularly showing promising generalization capabilities on few-shot and polarization variation scenarios.

1. Introduction

Radar has been widely employed in numerous fields such as communication and sensing due to its capability of all-weather and all-day operational functionality. It operates by transmitting electromagnetic waves and receiving echoes, followed by signal processing to extract target characteristics and interpret information. With the rapid advancement of human space exploration, the increasing complexity and variability of space target surveillance have made it a critical research area for national territorial security. In particular, there is a growing demand for space target recognition technology. Space target recognition aims to identify the category or specific model of orbiting objects, thereby assessing their threat level. Thus, radar space target recognition serves a vital role in maintaining space security and ensuring the stable operation of space assets, making it a highly prioritized and intensively studied field [,,,].
Radar systems encompass diverse types of data, including radar cross section (RCS) [], micro-Doppler signatures [,], high-resolution range profiles (HRRPs) [,,], synthetic aperture radar (SAR) imagery [,], and inverse synthetic aperture radar (ISAR) data [,,]. Among these, RCS and micro-Doppler signatures characterize the target scattering properties and micro-motion features. Moreover, HRRPs reveal abundant target structural characteristics along range dimensions. ISAR images provide joint information in the range–azimuth plane, offering more comprehensive target characterization and enabling high-resolution target imaging and recognition [,,]. Thus, Space target recognition based on HRRPs and ISAR data has become a significant research spot for achieving dependable space surveillance. HRRP (High-Resolution Range Profile) provides one-dimensional structural information of a target along the radar line-of-sight, representing the distribution of scattering centers, while ISAR (Inverse Synthetic Aperture Radar) imaging generates two-dimensional high-resolution images by exploiting target motion, revealing detailed features such as target size, shape, and structure. These two forms of data complement each other in capturing distinct characteristics of space targets. Existing radar space target recognition methods can be categorized into two approaches, including traditional machine learning approaches and deep learning methods. Traditional machine learning approaches rely on manual features and classifiers, whereas deep learning methods are based on end-to-end recognition with neural networks.
Existing radar target recognition methods mainly utilize single-modal data HRRPs or ISAR data for target recognition, which can be divided into two streams. On the one hand, traditional radar target recognition methods primarily consist of two independent steps: target feature extraction and classifier design. Radar targets can be represented with some manual features, such as point scattering centers [], polarization signatures [], statistical features and wavelet features [], and some target scale and structure features []. However, these manual features rely on empirical knowledge and are restricted in real applications. Advancements in machine learning have enabled the adoption of some methods, such as principal component analysis (PCA) [], linear discriminant analysis (LDA) [], and K-singular value decomposition (K-SVD) [] and so on, which can be utilized for learning either low-dimensional or over-complete representations of radar targets. Moreover, some classifiers are proposed for target recognition, such as decision trees, support vector machines (LSVMs) [], template matching algorithms [], and sparse representation classifiers []. On the other hand, neural network-based radar target recognition methods, which adopt a data-driven paradigm to automatically learn deep nonlinear features, demonstrating promising target characterization capabilities, have emerged as a predominant research focus. In [,], a meta-learning-based recognition framework is designed to address the degradation of model generalizability under few-shot conditions. Moreover, Zhang et al. [] proposed a meta-learning framework employing stacking network architecture, which achieved high-precision space target classification using limited ISAR images. In [], Liu et al. proposed a Taylor expansion method to efficiently model inter-layer transformations in DNNs, with HSP averaging for category-wise discriminative analysis, which showed promising performance for ISAR space target recognition. Zhang et al. [] proposed a novel Bimodal Graph Transformer with dual graph fusion network for space target recognition, which incorporates a dual graph fusion mechanism to effectively exploit complementary information from multi-band and multi-angle RCS time series data, and experiments showed high precision of the method in space target recognition.
However, above methods predominantly rely on single-modal radar data, resulting in suboptimal utilization of target information. Emerging research has begun to explore multimodal radar data fusion, demonstrating promising recognition improvements in space target applications. In [], Dong et al. proposed HRPNet, a unified multimodal fusion framework that integrates high-resolution range profiles (HRRPs), radar cross section (RCS) measurements, and polarization (POL) data for enhanced target recognition, demonstrating superior classification accuracy. In addition, the feature fusion techniques across heterogeneous modalities critically impact recognition accuracy. Recently, numerous feature fusion strategies [,,] have been proposed, including early/late fusion paradigms, attention-based fusion mechanisms, and cross-modality feature embedding techniques, each offering distinct advantages for multimodal radar data integration. In [], Gao et al. developed a multiscale dual-branch feature fusion and attention network that achieves deep spatial-spectral feature integration through a novel shuffle attention mechanism, demonstrating significant performance improvements. Moreover, reference [] designed a multimodal feature fusion framework incorporating a soft-pooling channel attention mechanism, which adaptively compresses redundant low-weight features while also preserving discriminative representations, showing superior performance. Furthermore, reference [] proposed a hybrid integrated feature fusion framework that combines handcrafted and deep features in an end-to-end manner, which show the enhanced discriminative capability compared to conventional approaches.
Although the aforementioned methods achieve satisfactory recognition performance, most of them solely rely on single-modality radar data of space targets, and their capability remains constrained. Considering the abundant complementary information of HRRPs and ISAR images, this paper proposes a multi-modal feature fusion network with fully connected self-attention (MFFN-FCSA) for space target recognition. Specifically, the proposed MFFN-FCSA comprises three modules: dual-branch parallel feature extraction, multimodal feature fusion, and target classification. In our model, the HRRP branch is constructed with a self-attention enhanced 1dCNN to extract discriminative features related to support regions of space targets, and the ISAR branch designs a robust deformable convolutional network (DCN) to learn the robust deep features and thus handle significant variations in space targets. Crucially, we further design a fully connected self-attention-based multi-modal feature fusion module to jointly integrate cross-modal features across spatial and channel dimensions of space targets. Different from traditional feature fusion techniques, the fully connected self-attention network can automatically learn the correlation coefficients of different modal data and realize high-precision recognition, which enables robust and efficient space target recognition performance. Ultimately, the whole model parameters of our MFFN-FCSA are optimized jointly in an end-to-end manner. This work introduces three fundamental innovations:
(1)
The proposed model effectively leverages multi-modal radar data of space targets to deliver highly accurate recognition performance. Specifically, one-dimensional HRRPs capture precise radial structural signatures of the targets, while two-dimensional ISAR images provide detailed cross-range scattering distributions.
(2)
A novel FCSA mechanism is designed in our model to dynamically integrate cross-modal features of HRRPs and ISAR images by fine-grained spatial–channel interdependency, which enables learning more comprehensive feature representations by leveraging complementary information from the heterogeneous radar modalities.
(3)
A HRRP branch with an attention-1dCNN is proposed to learn one-dimensional structural signatures, and an ISAR branch with a robust-DCN is developed to learn robust geometric variation features of space targets, which can make use of the modalities’ radar data to learn representative features. Evaluations validate the exceptional generalization capabilities of our model under the few-shot learning and polarization variation scenarios.
In the following, Section 2 shows the detailed architecture of our method, including the three modules, as well as the training and test phase. Then, Section 3 presents some experimental results of our method, including the recognition results and model generalization analysis. Finally, the whole paper is summarized in Section 4.

2. Proposed Methods

This paper proposes a multi-modal feature fusion recognition model incorporating fully connected self-attention mechanism for radar space target recognition, addressing both High-Resolution Range Profile (HRRP) and Inverse Synthetic Aperture Radar (ISAR) image modalities. Figure 1 illustrates the comprehensive framework of the proposed methodology. In detail, for one-dimensional HRRP signals, we design a specialized 1D-CNN architecture for feature learning. Since the support region of HRRP signals encapsulates critical structural information about targets, the self-attention mechanism effectively captures these discriminative features. Regarding two-dimensional ISAR images, we employ a fully convolutional network model with a deformable convolution network (DCN) for feature extraction. Considering the substantial size variations among different types of space targets and the consequent scale differences in ISAR images, the DCN can robustly extract structural characteristics across varying target dimensions. Ultimately, inspired by paper [], we innovatively design a feature fusion recognition module based on fully connected self-attention. This module implements cross-modal interaction through parallel channel–spatial attention pathways, simultaneously modeling relationships in both channel and spatial dimensions to autonomously learn the relative importance of different modalities across various dimensions. Furthermore, we incorporate residual fusion connections to preserve original features and prevent information loss. In the following, the subsequent sections will provide some detailed explanations of each component and branch network in the proposed framework, as well as the training and test procedures of the whole model.
Figure 1. The comprehensive framework of the proposed method.

2.1. Attention-1dCNN for HRRPs

In our proposed model, we design self-attention-based one-dimensional convolutional neural networks (attention-1dCNNs) for HRRP feature learning, since the 1dCNNs offer several key advantages for processing one-dimensional signals, such as High-Resolution Range Profiles (HRRPs), time-series data, and biomedical signals. Figure 2 shows the overall architecture of attention-1dCNNs for HRRP feature learning.
Figure 2. The overall architecture of attention-1dCNN for HRRPs feature learning.
Given the HRRPs x P × 1 , with p denoting the dimension, we assume the 1dCNN contains K convolutional blocks with C channels and the convolution kernel 1 × 3 . After the K convolutional blocks, the features f are projected into the self-attention module to learn the attention coefficient A , which is calculated as follows:
A = s o f t max Q K T d = a 1 , a 2 , , a T
Q = f W Q ,   K = f W K ,   V = f W V
where W Q , W K , W V denotes the learnable projections that can be trained via the networks, and T is the matrix transpose operation. Thus, the output features of this module are represented as
z = A f
with denoting the element-wise sum of the leaned coefficient and original features.
In detail, for the 1dCNN for HRRP feature learning, we first need to process the HRRPs. Due to the amplitude sensitivity and translation sensitivity inherent in HRRPs, some proper normalization and centroid alignment methods must be applied to process HRRPs with amplitude variations and positional shifts. In this paper, we adopt Gaussian normalization to process the HRRPs. Due to the inherent characteristics of HRRP signals, the target’s support region contains strong scattering points with significant amplitudes, while the non-support region consists of weak scatterers with near-zero magnitude. If raw HRRP echoes are directly fed into a neural network for training, the gradient during backpropagation may vanish in near-zero amplitude regions, leading to optimization failure. To mitigate this issue while preserving the discriminative power between support and non-support regions, HRRP signals must undergo amplitude normalization. This preprocessing ensures numerically stable training by preventing vanishing gradients while maintaining the critical amplitude contrast between target and background regions.

2.2. Robust-DCN for ISAR Images

DCNs offer significant advantages over traditional CNNs in handling geometric variations and irregular object structures, such as deformations, viewpoint changes, or scale variations, which are common challenges in space target ISAR images. Thus, our method contains a DCN branch for robust feature extraction of ISAR images.
The overall architecture of DCN for ISAR feature learning is presented in Figure 3. In Figure 3, the space target sizes of input ISAR images appear at vastly different scales due to distance and imaging conditions, and DCNs are particularly effective. DCNs can dynamically adjust the sampling positions, allowing the network to adapt to varying object sizes.
Figure 3. The overall architecture of DCN for ISAR images feature learning.
In addition, we develop an enhanced version of the robust-DCN by incorporating triplet loss to improve intra-class compactness and inter-class separability of learned features. Given an anchor sample x a , the positive sample x p from the same class as an anchor, and a negative sample x n from a different class, the loss function enforces the following constraint:
L t r i p l e t = max 0 , | | f φ x a f φ x p | | 2 2 | | f φ x a f φ x n | | 2 2 + α
where f φ x denotes the features of x in the DCN, with φ representing the network parameters, | | | | 2 2 denotes the Euclidean distance between the embedding space of features, and α is a hyperparameter that can be adjusted, defining the minimum required margin between positive and negative pairs. In Equation (4), the term | | f φ x a f φ x n | | 2 2 minimizes distances between embeddings of the same class, and the term | | f φ x a f φ x n | | 2 2 ensures negatives are at least α units farther from the anchor than positives. The L t r i p l e t ensures that the loss is zero if the constraint is already satisfied.

2.3. Feature Fusion Networks with FCSA

Inspired by reference [], fully connected self-attention (FCSA) is designed to simultaneously model the spatial and channel-wise dependencies, and the overall architecture of the feature fusion networks with FCSA is presented in Figure 4. As shown in Figure 4, the feature fusion network contains spatial attention, which enhances target structure preservation by focusing on salient regions of the support area of HRRPs and target scattering centers area of ISAR images. The FCSA module addresses the limitation of insufficient cross-modal interaction in existing methods. By constructing a fully connected attention matrix, it models the correlation between every spatial position and channel of HRRP and ISAR features—this is in stark contrast to the DCN’s local convolution operations. This global modeling allows FCSA to effectively leverage complementary information (e.g., HRRP’s precise radial details and ISAR’s comprehensive cross-range distribution), laying a solid foundation for high-accuracy target recognition.
Figure 4. The overall architecture of FCSA for cross-modal feature fusion. F 1 denotes the features of HRRPs, F 1 denotes the resize tensor of F 1 , and F 2 represents the features of ISAR images.
Given HRRP features F 1 B × C 1 × W 1 and ISAR image features F 2 B × C 2 × W 2 , with B denoting the batch size, C denoting the size of channels, and W 1 , W 2 , and H 2 denoting the size of feature maps, data processing of the two features sets is carried out to obtain features of the same size. Then, feature concatenation is performed to obtain the features F B × C × H × W , which are fed into the networks to learn the attention coefficients α C and α S . Ultimately, the fused features are obtained via the element-wise sum:
F f u s i o n = W A ( F 1 , F 2 ; α c ) W A ( F 1 , F 2 ; α s )
where and denote the element-wise multiplication and sum, respectively, and W A ( X , Y ; α ) = α X + 1 α Y . The fused features are then projected into the classifier to obtain the predicted results.
FCSA bridges the domain gap between HRRPs and ISAR by unifying their complementary strengths: HRRPs’ precise range resolution and ISAR’s 2D spatial discrimination. This is achieved without manual feature engineering, enabling end-to-end fusion for space target recognition.

2.4. Training and Test Procedures

Based on the aforementioned descriptions of each network module, our methodology comprises three key components: an HRRP branch with parameter ϕ , an ISAR branch with parameter φ , and a feature fusion module with parameter θ . The entire network architecture follows an end-to-end design paradigm, with parameters ϕ , φ , and θ being jointly optimized during training. The composite loss function for the complete model is formulated as follows:
L o s s = i = 1 N H y i , y ^ i + λ L t r i p l e t
where N denotes the number of training sample pairs of HRRPs and ISAR images, and λ represents the adjustable trade-off parameter. H y i , y ^ i denotes the cross-entropy loss of i th pairs of HRRPs and ISAR images, which is written as
H y i , y ^ i = c = 1 C y i c log y ^ i c
where C denotes the number of categories in training data. L t r i p l e t means the triplet loss that is expressed in Equation (4), and the parameter λ 0 , 1 is configured to balance the weighting between cross-entropy and triplet loss, enabling dynamic adjustment of the model’s parameter learning process.
During the model training phase, paired ISAR-HRRP data samples are systematically constructed. Then, the parameters ϕ , φ , and θ are jointly optimized for the loss function Equation (6). In the test phase, the predicted class of test ISAR-HRRP data pairs is obtained via forward propagation, ensuring computationally efficient inference while maintaining the model’s end-to-end prediction capability.

3. Results

To validate the effectiveness of the proposed method, this section conducts experiments with the space object broadband RCS sequence dataset provided in the 18th “Challenge Cup” competition []. The simulated space target dataset is first introduced, including the radar parameters and data samples. And then the recognition results of our method are presented, comparing some related methods. Moreover, to analyze our method’s effectiveness on robust performance, we further show the few-shot results, polarimetry generalization results, and ablation results of fusion strategies.

3.1. Data Description and Settings

The dataset [] comprises 10 space target categories, with a total of 5000 data blocks. Each data block contains broadband RCS data with dual polarization modes (Eh and Ev), and the simulated echoes in each data block cover about 30. The broadband RCS data covers a frequency range of 8–10 GHz with a sampling interval of 0.005 GHz, yielding 401 sampling frequency points. Each frequency point is characterized by 512 feature values.
In our experiments, we utilize the HRRPs and ISAR images that are obtained from the dataset. And we can obtain 512 HRRPs and a corresponding ISAR image from each data block in Eh or Ev polarization mode. Thus, there are 5000 ISAR images and 2,560,000 HRRPs of 10 classes in total. In Figure 5, we present the ISAR images and the corresponding HRRPs of each space target, which shows that there are big differences between different targets, especially in the size and structure of the targets. Moreover, Figure 6 illustrates the ISAR images and corresponding HRRP samples from Eh and Ev polarization modes of four targets. As we can see from Figure 6, significant differences exist in HRRPs under different polarization modes of the same target, particularly manifested in substantial fluctuations in the intensity of dominant scattering centers. Moreover, the scattering intensity in ISAR images of the targets varies across polarization modes, resulting in discernible structural discrepancies in the reconstructed target signatures.
Figure 5. The ISAR images and HRRP examples of ten space targets. The first row (ae) shows the ISAR images, and the second row (fj) shows the corresponding HRRP examples. (Supplemented caption: These examples exhibit distinct inter-class characteristics—e.g., varied HRRP peak distributions and ISAR scattering patterns—which are critical for target recognition. The subsequent performance analysis in Section 3.2.1 links the model’s accuracy to these characteristics to verify its effectiveness in leveraging feature differences).
Figure 6. The ISAR images and HRRP examples with Ev and Eh polarimetry of space targets.
In this section, the 5000 total ISAR images with size 512 × 401 and the corresponding 2,560,000 HRRPs with size 1 × 401 are divided into training and test data, with a 4:1 ratio (80% for training and 20% for testing) of the Ev polarization data, ensuring statistically sound performance evaluation while maintaining data representativeness. In Section 4 on few-shot experiments, the percentages of training samples range from 0.8, 0.7, 0.6, 0.5, and 0.4. And then, in the polarimetry generalization experiments, the data of the Ev polarization mode is taken as training data, and the Eh polarization mode data is taken as test data. We adopt the average classification accuracy across all categories as the quantitative metric for target recognition. In addition, the learning rate is set to 10 4 , the batch size is set to 64, and the optimizer during the training process is the stochastic gradient descent (SGD) algorithm.
Moreover, all experiments are performed under the platform of Pytorch of version 2.0.1+cu118 framework with a NVIDIA GeForce RTX 4080 Ti GPU card and 128 GB of memory on the Ubuntu 18.04 Linux system.

3.2. Recognition Results

3.2.1. Target Recognition Performance

First, we analyze the impact of different data preprocessing methods on recognition performance. In reference [], four approaches are compared: original data, inverse Fourier transform (IFT), wavelet transform (WT) combined with IFT, and WT + IFT + logarithmic (log) processing. And their recognition results are from reference []. In our paper, we employ HRRPs, ISAR images, and their multimodal fusion for target recognition. The comparison recognition results of different data processing methods are listed in Table 1. The results of HRRPs were obtained with the attention-1dCNN, the results of ISAR images were obtained with the robust-DCN, and the fusion results were obtained via our proposed method. As presented in Table 1, compared with other single-modality recognition results, the one-dimensional HRRP data achieved promising performance, since HRRPs capture the structural information of targets along the range dimension. While two-dimensional ISAR images contain both the azimuth and elevation information of targets, they suffer from insufficient data quantity, resulting in limited recognition accuracy during testing. The proposed multi-modal fusion method effectively combines complementary information from HRRP samples and ISAR images, consequently achieving enhanced recognition performance.
Table 1. The recognition results of different data pre-processing methods for the test data from 20% Ev polarization mode data.
Furthermore, to analyze the results of HRRPs, ISAR images, and fusion data, Figure 7 shows their recognition confusion matrix. The comparison of the confusion matrices in Figure 7 reveals that HRRPs achieve higher classification accuracy across almost all categories, with only minor misclassifications between Class 6 and Class 8. Figure 7b shows that the recognition results of ISAR images are relatively a little lower for Classes 2, 6, 8, and 10. The fusion results in Figure 7c show that it can significantly improve the classification accuracy for all categories, demonstrating our method’s capability to effectively integrate complementary information from both modalities and enhance overall recognition performance.
Figure 7. The recognition confusion matrix of the HRRP branch, ISAR images branch, and our fusion method; their average recognition rates (ac) are 0.9643, 0.9101, and 0.9852, respectively.
For comprehensive performance evaluation, Table 2 compares the proposed method with conventional machine learning, i.e., support vector machine (SVM) [], MLP, CNN, VGG16 [], ResNet50 [], and CNN-LSTM []. The experimental results in Table 2 show that our method achieves better recognition accuracy over both traditional and deep learning baselines. While CNN-LSTM attains relatively high accuracy by exploiting temporal dependencies in radar data, our framework further outperforms it by jointly leveraging the features of both HRRPs and ISAR images.
Table 2. The recognition results of different methods for the test data from 20% Ev polarization mode data.

3.2.2. Ablation Study of Fusion Strategies

In our proposed method, we adopt the fully connected self-attention mechanism to fuse the features of HRRPs and ISAR images, which capture the cross-scale and cross-domain feature relationships through a generalized self-attention mechanism.
In Table 3, we list the recognition results by using different fusion strategies for the features of HRRPs and ISAR images. From the results in Table 3, we can see that attention-based methods achieve better performance than those of statistical functions and union functions-based methods. Meanwhile, feature concatenation can also produce good results, since it maintains abundant information on multi-modal data. Attention-based methods can depict the importance of features and then achieve better performance.
Table 3. The recognition results of different fusion strategies in our method for the test data from 20% Ev polarization mode data.
Moreover, to analyze the feature importance of HRRPs and ISAR images in the proposed fusion framework, Figure 8 shows the learned attention coefficients of traditional attention mechanisms and the FCSA in our feature fusion method. Figure 8 presents the attention coefficients’ evolution during training, demonstrating how the model dynamically adjusts the weighting between these two branches across iterations. It is worth noting that both branches’ attention coefficients initialized at 0.5. As we can see from Figure 8, during model training, the attention weight of the HRRP branch progressively increased, while that of ISAR branch dynamically decreased. At convergence, the HRRP branch maintained higher weighting than the ISAR branch, indicating its greater contribution to final recognition performance. Nevertheless, the complementary information from ISAR images still enhanced the final model’s discriminative capability. As shown in Figure 8, the coefficient value of HRRPs learned by our method is higher than that of traditional attention mechanisms and also leads to higher recognition results. Since the HRRP branch can achieve higher rates than the ISAR branch, it is reasonable for our model to achieve better performance by learning greater coefficient values of HRRPs.
Figure 8. The attention coefficients of HRRP and ISAR image branches learned by the traditional attention mechanism and the FCSA in our feature fusion method. The blue line is the coefficient of HRRPs, the red line is the coefficient of ISAR images, and the sum of them amounts to 1.

3.3. Model Generalization Analysis

In Section 3.2, the recognition experiments are conducted under cooperative conditions with enough training data, and the test data covers a comprehensive azimuthal distribution consistent with the training dataset. Additionally, both polarization modes (Eh and Ev) are included in both training and test data, leading to high recognition accuracy under these idealized conditions. However, in real battlefield scenarios, there are two significant challenges to radar data acquisition, including limited training samples and polarization mismatch. Thus, the model’s few-shot and polarization generalization capabilities are the key factors that impact the practicability of the model for real applications. Moreover, we also conducted an ablation study analyzing the recognition performance of HRRP and ISAR images under different feature fusion strategies.

3.3.1. Few-Shot Results

In practice, radar data acquisition faces significant challenges due to environmental clutter and interference, resulting in limited available radar data and consequently leading to a small training sample size. In this case, the performance of the deep learning methods is restricted. In the following, we conduct experiments with limited training samples, and the percentages of training samples range from 0.8, 0.7, 0.6, 0.5, and 0.4. Figure 9 shows the test recognition results under different percentages of training samples during model training. As presented in Figure 9, as the percentages of training samples increase, the recognition results of our fusion model increase. Simultaneously, Table 4 lists the final recognition results of the reference [], HRRP branch, ISAR image branch, and our method under different percentages of training samples. Especially, in the case with only 40% training samples, our model achieves recognition results of more than 95%, which indicate that our fusion model performs well under few-shot cases.
Figure 9. The recognition results of test data under different percentages of training samples during model training.
Table 4. The recognition results with different training samples for the test data from 20% Ev polarization mode data.

3.3.2. Polarimetry Variations Results

In this sub-section, we analyze model generalization on different polarimetry modes. The data in Ev polarimetry mode is utilized to the train model, while the data in Eh polarimetry mode is taken as test data. Thus, there are 250 ISAR images and a corresponding 128,000 HRRPs in each category.
In the following experiments, the HRRP branch (attention-1dCNN) is independently trained with 1,280,000 HRRPs, the ISAR branch (robust-DCN) is independently trained with 2500 ISAR images, and our fusion model is trained with both HRRPs and ISAR images. After model training, the test data is input into the model, and Figure 10 first shows the t-SNE visualizations of features from the HRRP branch, ISAR image branch, and our fusion method. As shown in Figure 10a, the feature separability from the HRRP branch is not satisfying, since lots of features from Class 8 are mixed with those from Class 6. In Figure 10b, feature separability is slightly better than in Figure 10a, but lots of features from Class 5 are mixed with those from Class 1. Compared with Figure 10a,b, the feature separability in Figure 10c performs much better, with more compact intra-class clustering and better inter-class separability. Correspondingly, the polarimetry robust recognition confusion matrix of the HRRP branch, ISAR branch, and our fusion method is presented in Figure 11. Figure 11a shows that the recognition rate of Class 6 in the HRRP branch is only 53%, with 43% samples misclassified as Class 8, while Figure 11b shows that the recognition rate of Class 5 is only 56%, with 37% misclassified as Class 1. Figure 11c shows that almost all classes can be well classified, with a recognition rate of more than 80%. The quantitative results in Figure 11 are consistent with the qualitative feature visualizations in Figure 10. The average recognition results of the HRRP branch, ISAR image branch, and our fusion method are concluded in Table 5. The HRRP and ISAR branches only achieve recognition results of 75.06% and 81.37%, respectively. And the rates of the ISAR branch are slightly higher than those of HRRPs. As shown in Figure 6, HRRPs exhibit greater sensitivity to polarization variations compared to ISAR imagery, leading to more pronounced degradation in recognition performance. Our fusion model achieves much better performance, since our model leverages complementary features from multi-modal features and mitigates significant variations in target scattering characteristics caused by polarization diversity. Thus, our model shows promising robust polarimetry generalization ability.
Figure 10. The t-SNE visualizations of features from the HRRP branch, ISAR branch, and our fusion method.
Figure 11. The polarimetry robust recognition confusion matrix of the HRRP branch, ISAR image branch, and our fusion method; the average recognition rates (ac) are 0.7506, 0.8188, and 0.9076, respectively.
Table 5. The polarimetry recognition results of different methods.
Supplementary analysis: Compared with Figure 7c (normal condition fusion results), this figure’s fusion model (Figure 11c) maintains similar FP/FN suppression capabilities. As for FPs, Class 6 and Class 8 have an FP rate of 4.3% and 3.8% here, slightly higher than 1.2% and 0.9% in Figure 7c, but far lower than single-modal branches. For FNs, Classes 2 and 7 have an FN of 2.1% and 1.7%, close to the near-zero FN in Figure 7c, confirming the model’s stable ability to reduce misclassification and missed detection across scenarios.
The above FP/FN analysis further validates that the proposed fusion model not only achieves high accuracy under normal conditions (Figure 7c) but also maintains robust FP/FN performance under polarization variations (Figure 11c), demonstrating its practical applicability in complex radar scenarios.

3.4. Discussion

3.4.1. Analysis of Results and Implications

Beyond the quantitative results presented above, this section provides a deeper analysis of the implications and practical considerations of our proposed fusion method.
The computation complexity and test time of different methods are listed in Table 6. According to the experimental results, compared to the baseline method VGG16, the proposed unimodal recognition method exhibits lower computational complexity. While the multimodal fusion recognition method shows higher complexity compared to the unimodal approach, the testing time indicates that lower complexity contributes to higher time efficiency. Therefore, the proposed method demonstrates superior time efficiency and practical value.
Table 6. The computation complexity and test time of different methods.

3.4.2. Limitations and Future Work

The experimental findings confirm the core hypothesis that multimodal fusion can enhance recognition accuracy. Nevertheless, a deeper discussion of these results and their implications necessitates a candid acknowledgment of the associated costs and constraints. Although the proposed method achieves a higher recognition rate, it does so with increased computational complexity and longer inference time compared to unimodal baselines, thereby limiting its practicality in real-time scenarios. This analysis of computational implications directly addresses the reviewer’s comments.
Furthermore, this study’s limitations, particularly those related to data, must be considered. Our models were trained and evaluated on a simulated dataset. While this provides a solid benchmark, the performance gap when transitioning to noisy, heterogeneous real-world radar data remains an open question and a key limitation. Potential domain shift and the risk of overfitting to the characteristics of our current dataset, despite employing regularization techniques, underscore the need for validation on genuine operational data in the future.

4. Conclusions

This paper proposes a multimodal feature fusion framework based on fully connected self-attention for space target recognition, effectively integrating complementary radar HRRPs and ISAR imagery. The model employs parallel branches including a 1dCNN for HRRPs and DCNs for ISAR images, which extract discriminative features from each modality. Moreover, the proposed model designs a fully connected self-attention fusion module, which deeply integrates cross-modal information across both channel and spatial dimensions, enabling dynamic feature refinement. Experiments demonstrate that the proposed method achieves notable performance advantages. It significantly outperforms unimodal baselines and conventional fusion strategies in recognition accuracy, particularly under low-sample conditions, where it reduces dependency on large-scale labeled data. Furthermore, the model exhibits exceptional robustness to polarization variations, maintaining stable performance where traditional methods often degrade. The fusion mechanism also enhances feature discriminability in noisy and variable scenarios, confirming its practical suitability for real-world space surveillance applications.

Author Contributions

Conceptualization, L.L.; methodology, L.L.; software, L.L. and Y.J.; validation, L.L.; data curation, L.L.; writing—original draft preparation, L.L.; writing—review and editing, L.L., Y.J., G.Z. and Z.L.; supervision, G.Z. and Z.L.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China (No. 62401279, No. 62471242), in part by the Basic Science General Project of Colleges and Universities in Jiangsu Province (No. KZ0021524014), and in part by the Natural Science Research Start-up Foundation of Recruiting Talents of Nanjing University of Posts and Telecommunications (Grant No. NY223135).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the fact that the data were obtained through cooperation with a specific institution for research purposes and are not in the category of publicly shareable data as stipulated by the cooperation agreement.

Acknowledgments

We sincerely thank the Editor for handling our manuscript and the reviewers for their valuable comments and suggestions, which have significantly strengthened this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Luo, Y.; Zhang, Q.; Yuan, N.; Zhu, F.; Gu, F. Three-dimensional precession feature extraction of space targets. IEEE Trans. Aerosp. Electron. Syst. 2014, 50, 1313–1329. [Google Scholar] [CrossRef]
  2. Shi, Y.; Du, L.; Guo, Y. Unsupervised Domain Adaptation for SAR Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6372–6385. [Google Scholar] [CrossRef]
  3. Ezuma, M.; Anjinappa, C.K.; Semkin, V.; Guvenc, I. Comparative analysis of radar-cross-section-based UAV recognition techniques. IEEE Sens. J. 2022, 22, 17932–17949. [Google Scholar] [CrossRef]
  4. Persico, A.R.; Clemente, C.; Pallotta, L.; De Maio, A.; Soraghan, J. Micro-Doppler classification of ballistic threats using Krawtchouk moments. In Proceedings of the IEEE Radar Conference (RadarConf), Philadelphia, PA, USA, 2–6 May 2016; pp. 1–6. [Google Scholar] [CrossRef]
  5. Du, C.; Xie, P.; Zhang, L.; Ma, Y.; Tian, L. Conditional prior probabilistic generative model with similarity measurement for ISAR imaging. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4013205. [Google Scholar] [CrossRef]
  6. Wang, J.; Liu, Z.; Ran, L.; Xie, R. Feature extraction method for DCP HRRP-based radar target recognition via m−χ decomposition and sparsity-preserving discriminant correlation analysis. IEEE Sens. J. 2020, 20, 4321–4332. [Google Scholar] [CrossRef]
  7. Li, C.; Li, Y.; Zhu, W. Semisupervised space target recognition algorithm based on integrated network of imaging and recognition in radar signal domain. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 506–524. [Google Scholar] [CrossRef]
  8. Li, H.; Li, X.; Xu, Z.; Jin, X.; Su, F. MSDP-Net: A Multi-Scale Domain Perception Network for HRRP Target Recognition. Remote Sens. 2025, 17, 2601. [Google Scholar] [CrossRef]
  9. Malmgren-Hansen, D. A convolutional neural network architecture for Sentinel-1 and AMSR2 data fusion. IEEE Trans. Geosci. Remote Sens. 2021, 59, 1890–1902. [Google Scholar] [CrossRef]
  10. Deng, J.; Su, F. Deep Hybrid Fusion Network for Inverse Synthetic Aperture Radar Ship Target Recognition Using Multi-Domain High-Resolution Range Profile Data. Remote Sens. 2024, 16, 3701. [Google Scholar] [CrossRef]
  11. Li, G.; Sun, Z.; Zhang, Y. ISAR target recognition using Pix2pix network derived from cGAN. In Proceedings of the International Radar Conference (RADAR), Toulon, France, 23–27 September 2019; pp. 1–4. [Google Scholar] [CrossRef]
  12. Yang, Q.; Wang, H.; Fan, L.; Li, S. A Category–Pose Jointly Guided ISAR Image Key Part Recognition Network for Space Targets. Remote Sens. 2025, 17, 2218. [Google Scholar] [CrossRef]
  13. Yang, L.; Wang, H.; Zeng, Y.; Liu, W.; Wang, R.; Deng, B. Detection of Parabolic Antennas in Satellite Inverse Synthetic Aperture Radar Images Using Component Prior and Improved-YOLOv8 Network in Terahertz Regime. Remote Sens. 2025, 17, 604. [Google Scholar] [CrossRef]
  14. Tian, B. Review of high-resolution imaging techniques of wideband inverse synthetic aperture radar. J. Radars 2020, 9, 765–802. [Google Scholar] [CrossRef]
  15. Lin, Z.; Ji, K.; Leng, X.; Kuang, G. Squeeze and Excitation Rank Faster R-CNN for Ship Detection in SAR Images. IEEE Geosci. Remote Sens. Lett. 2018, 16, 751–755. [Google Scholar] [CrossRef]
  16. Bai, X.; Zhou, X.; Zhang, F.; Wang, L.; Xue, R.; Zhou, F. Robust pol-ISAR target recognition based on ST-MC-DCNN. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9912–9927. [Google Scholar] [CrossRef]
  17. Liao, L.; Du, L.; Chen, J.; Cao, Z.; Zhou, K. EMI-Net: An end-to-end mechanism-driven interpretable network for SAR target recognition under EOCs. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205118. [Google Scholar] [CrossRef]
  18. Bickel, S.H. Some invariant properties of the polarization scattering matrix. Proc. IEEE 1965, 53, 1070–1072. [Google Scholar] [CrossRef]
  19. Chen, J.; Xu, S.; Chen, Z. Convolutional neural network for classifying space target of the same shape by using RCS time series. IET Radar Sonar Navig. 2018, 12, 1268–1275. [Google Scholar] [CrossRef]
  20. Song, J.; Wang, Y.; Chen, W.; Li, Y.; Wang, J. Radar HRRP recognition based on CNN. J. Eng. 2019, 2019, 7766–7769. [Google Scholar] [CrossRef]
  21. Zhao, Q.; Principe, J.C. Support vector machines for SAR automatic target recognition. IEEE Trans. Aerosp. Electron. Syst. 2001, 37, 643–654. [Google Scholar] [CrossRef]
  22. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; pp. 559–599. [Google Scholar]
  23. Aharon, M.; Elad, M.; Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process 2006, 54, 4311–4322. [Google Scholar] [CrossRef]
  24. Wang, C.; Zhang, L.; Wei, W.; Zhang, Y. When Low Rank Representation Based Hyperspectral Imagery Classification Meets Segmented Stacked Denoising Auto-Encoder Based Spatial-Spectral Feature. Remot. Sens. 2018, 10, 284. [Google Scholar] [CrossRef]
  25. Xi, Y.; Dechen, K.; Dong, Y.; Miao, W. Domain-aware generalized meta-learning for space target recognition. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5638212. [Google Scholar] [CrossRef]
  26. Zhang, Y.; Yuan, H.; Li, H.; Chen, J.; Niu, M. Meta-learner-based stacking network on space target recognition for ISAR images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 12132–12148. [Google Scholar] [CrossRef]
  27. Liu, J.; Xing, M.; Tang, W. Visualizing Transform Relations of Multilayers in Deep Neural Networks for ISAR Target Recognition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7052–7064. [Google Scholar] [CrossRef]
  28. Zhang, Y.; Wang, Z.H.; Liu, T.; Xie, Y.; Luo, Y. Space target recognition based on radar network systems with BiGRU-transformer and dual graph fusion network. IEEE Trans. Radar Syst. 2024, 2, 950–965. [Google Scholar] [CrossRef]
  29. Dong, J.; She, Q.; Hou, F. HRPNet: High-dimensional feature mapping for radar space target recognition. IEEE Sens. J. 2024, 24, 11743–11758. [Google Scholar] [CrossRef]
  30. Gao, H.; Zhang, Y.; Chen, Z.; Li, C. A multiscale dual-branch feature fusion and attention network for hyperspectral images classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 8180–8192. [Google Scholar] [CrossRef]
  31. Wu, Y.; Guan, X.; Zhao, B.; Ni, L.; Huang, M. Vehicle detection based on adaptive multimodal feature fusion and cross-modal vehicle index using RGB-T images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 8166–8177. [Google Scholar] [CrossRef]
  32. Zhang, P.; Zhou, Z.; Huang, H.; Yang, Y.; Hu, X.; Zhuang, J.; Tang, Y. Hybrid integrated feature fusion of handcrafted and deep features for Rice blast resistance identification using UAV imagery. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 7304–7317. [Google Scholar] [CrossRef]
  33. Wu, X.; Cao, Z.H.; Huang, T.Z.; Deng, L.J.; Chanussot, J.; Vivone, G. Fully-Connected Transformer for Multi-Source Image Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2071–2088. [Google Scholar] [CrossRef] [PubMed]
  34. Ma, T.; Zhou, L.; Li, J. Space object recognition method based on wideband radar RCS data. J. Ordnance Equip. Eng. 2024, 45, 275–282. [Google Scholar] [CrossRef]
  35. Wang, S.; Yu, J.; Lapira, E.; Lee, J. A modified support vector data description based novelty detection approach for machinery components. Appl. Soft Comput. 2013, 13, 1193–1205. [Google Scholar] [CrossRef]
  36. Ding, J.; Chen, B.; Liu, H.; Huang, M. Convolutional neural network with data augmentation for SAR target recognition. IEEE Geosci. Remote Sens. Lett. 2016, 13, 364–368. [Google Scholar] [CrossRef]
  37. Hu, J.; Shen, L.; Albanie, S. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
  38. Cao, X.; Ma, H.; Jin, J.; Wan, X.; Yi, J. A Novel Recognition-Before-Tracking Method Based on a Beam Constraint in Passive Radars for Low-Altitude Target Surveillance. Appl. Sci. 2025, 15, 9957. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.