Point-HRRP-Net: A Deep Fusion Framework via Bi-Directional Cross-Attention for Space Object Classification Using HRRP and Point Cloud

Zhao, Zhenou; Yang, Zhuoyi; Zhang, Haitao; Wang, Yanwei; Meng, Kuo

doi:10.3390/rs18060868

Open AccessArticle

Point-HRRP-Net: A Deep Fusion Framework via Bi-Directional Cross-Attention for Space Object Classification Using HRRP and Point Cloud

by

Zhenou Zhao

^1,†

,

Zhuoyi Yang

^2,†

,

Haitao Zhang

^2,*

,

Yanwei Wang

² and

Kuo Meng

¹

School of Instrumentation Science and Opto-Electronics Engineering, Beijing Information Science and Technology University, Beijing 100101, China

²

State Key Laboratory of Precision Space-Time Information Sensing Technology, Department of Precision Instrument, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(6), 868; https://doi.org/10.3390/rs18060868

Submission received: 7 January 2026 / Revised: 23 February 2026 / Accepted: 4 March 2026 / Published: 11 March 2026

(This article belongs to the Special Issue Intelligent Processing and Analysis of Multi-Modal Remote Sensing Data)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose Point-HRRP-Net to fuse 1D High-Resolution Range Profiles (HRRP) and 3D LiDAR point clouds via a Bi-Directional Cross-Attention (Bi-CA) mechanism.
The framework consistently outperforms single-modality baselines. Furthermore, benchmarks reveal that Mamba-based backbones offer superior inference speeds.

What are the implications of the main findings?

Fusing HRRP with 3D LiDAR point clouds effectively mitigates the aspect sensitivity limitations of radar-based classification.
We validate the framework in simulated environments and discuss its potential for real-world deployment.

Abstract

High-Resolution Range Profile (HRRP)-based space object classification is severely limited by aspect sensitivity. Inspired by the intrinsic complementarity between HRRP and LiDAR point clouds, this work investigates the feasibility and effectiveness of fusing these two modalities to address this limitation. We propose the Point-HRRP-Net framework. This framework employs dual-stream extractors to independently encode HRRP electromagnetic signatures and 3D point cloud geometric topologies. Subsequently, a Bi-Directional Cross-Attention (Bi-CA) mechanism is designed to fuse the two modalities. To enable information interaction, this mechanism utilizes point-to-point attention to correlate radar scattering features with 3D geometric points, thereby constructing a comprehensive target representation. Due to data scarcity, we constructed a paired simulation dataset for evaluation. Experimental results demonstrate that the proposed framework consistently outperforms its constituent single-modality baselines. The model achieves 57.67% accuracy on the 180° split and demonstrates generalization capability to unseen viewpoints. Ablation studies further validate the efficacy of the Bi-CA mechanism and the selected feature extractors. Finally, we assess the potential sim-to-real discrepancies and evaluate deployment feasibility across various hardware platforms.

Keywords:

space target; object classification; High-Resolution Range Profile (HRRP); point cloud; multi-modal fusion; cross-attention

1. Introduction

Radar object classification is critical for remote sensing [1,2]. Space object classification is a significant branch of this field [3]. As a classic data source, one-dimensional High-Resolution Range Profile (HRRP) is characterized by its ease of acquisition, low dimensionality, and efficient processing [4,5,6]. However, HRRP suffers from severe aspect sensitivity, causing significant signal variations that make it difficult to classify objects from unseen viewpoints [7].

Early HRRP recognition technologies relied heavily on “hand-crafted features” [8,9]. Researchers attempted to extract feature vectors from HRRP using methods like template matching and statistical models (e.g., Hidden Markov Models) [10,11]. However, restricted by shallow feature extraction and heavy reliance on prior knowledge, these methods often struggle to generalize to unseen orientations [12,13,14]. Deep learning addressed these limitations by enabling automatic feature extraction [15]. Researchers initially applied Convolutional Neural Networks (CNNs) [16] to HRRP, treating them as 1D signals to extract local scattering features. To address the inherent sequential dependencies of radar echoes, scholars subsequently employed Recurrent Neural Networks (RNNs) [17,18] and variants like LSTM [19] and GRU [20]. However, these architectures often fail to capture long-range dependencies [21]. Consequently, Transformer-based approaches have been widely adopted to capture the global context of HRRP sequences [22,23,24]. To further improve recognition performance, recent studies have introduced various advanced mechanisms to model this global information. For instance, some methods combine local-aggregated units with self-attention to effectively perceive global information [25]. Meanwhile, dynamic graph neural networks are utilized to model the global topological relationships of scattering centers [26]. Furthermore, global-local Transformer modules have been designed to dynamically allocate attention between global contexts and local features [27]. Recently, the field has witnessed the emergence of advanced architectures: the Conformer [28] integrates CNNs and Transformers to simultaneously capture local and global features, while the Mamba framework [29], based on State Space Models (SSMs), offers linear computational complexity. 1D-Mamba [30] has shown great potential in efficiently processing long HRRP sequences. Meanwhile, to enhance model interpretability and structural awareness, studies have explored aligning attention mechanisms with physical scattering characteristics [31], or incorporating geometric constraints [32]. However, generating HRRP inevitably collapses 3D structures into 1D signals. This compression causes significant information loss, which severely limits generalization to unseen angles. Existing single-modality approaches fail to fundamentally address this limitation.

Multi-modal fusion has thus become a key research direction. For instance, researchers have combined Synthetic Aperture Radar (SAR) [33,34] with optical images to enable all-weather, high-precision ground target recognition. Others have integrated HRRP with micro-Doppler features to refine the distinction of target motion states [35]. For airborne target identification, studies have explored fusing HRRP with Infrared (IR) images [36,37]. In autonomous driving, fusing millimeter-wave radar with LiDAR is widely adopted [38,39]. However, a fundamental distinction exists: millimeter-wave radar provides 3D spatial data, whereas HRRP consists solely of 1D signatures. As a result, fusing LiDAR with HRRP remains largely unexplored.

Advancements in LiDAR technology [40,41] have made acquiring 3D LiDAR point clouds from distant targets a reality. Point clouds preserve intrinsic geometric topology that remains invariant under rigid motion [42]. Point clouds are an ideal candidate for fusion with HRRP. Regarding processing methods, PointNet [43] and PointNet++ [44] pioneered direct learning on unordered points, bypassing inefficient voxelization [45,46,47]. Subsequently, Dynamic Graph CNN (DGCNN) [48] introduced dynamic graph convolution to capture local topology, while Point Transformer [22] and PointMLP [49] introduced self-attention and pure residual designs, respectively. Most recently, the Mamba framework has extended its success to the 3D domain. Specifically, PointMamba [50] adapts the SSM mechanism to point clouds, achieving a superior balance between performance and computational efficiency.

Conversely, HRRP provides scattering information that resolves geometric ambiguities caused by point cloud occlusions. As shown in Figure 1, when the line of sight directly faces the base, the cone, cylinder, and cone–cylinder assembly all appear as identical circles [Figure 1a–c]. This makes them indistinguishable based on shape alone. However, their corresponding HRRP remains distinct due to specific scattering mechanisms like edge diffraction [Figure 1d–f].

Traditional strategies like early and late fusion fail to capture deep feature interactions due to their shallow interaction mechanisms [51,52]. To address this, the field shifted toward intermediate fusion to enable complex feature interactions [53,54,55]. Specifically, cross-attention mechanisms were introduced for fine-grained feature alignment [56,57]. Concurrently, to alleviate the computational burden of standard attention, efficiency-oriented designs have emerged, including Linear-Attention [58], Efficient-Attention [59], and the recent Mamba architecture [60]. These models utilize linear-complexity mechanisms to accelerate inference. In this work, to establish explicit, point-to-point interactions between HRRP and 3D point clouds for generalized recognition, we propose Point-HRRP-Net. This framework leverages a Bi-Directional Cross-Attention (Bi-CA) mechanism to improve classification accuracy and generalization capability.

The main contributions of this paper are summarized as follows:

To the best of our knowledge, this is the first framework to fuse HRRP with 3D point clouds for space object classification. We propose Point-HRRP-Net, a fusion framework that incorporates a Bi-CA mechanism to integrate HRRP and 3D point clouds.
We have constructed and publicly released a paired point cloud-HRRP dataset through electromagnetic and optical simulation.
We benchmark inference latency across diverse hardware, ranging from data center GPUs to embedded edge devices, providing a reference for deployment feasibility.
Extensive experiments demonstrate that the proposed framework significantly outperforms single-modality baselines, particularly in generalizing to unseen viewpoints.

This paper is structured as follows: Section 2 details our proposed method. Section 3 covers dataset creation and experimental results. Section 4 provides a discussion. Finally, Section 5 offers a conclusion.

2. Methods

2.1. Point-HRRP-Net Overview

The overall architecture of our proposed network is illustrated in Figure 2. The data processing pipeline can be conceptually divided into three main stages: (1) dual-branch feature extraction, (2) Bi-CA fusion, and (3) classification.

The process begins by normalizing the HRRP and point cloud inputs. Subsequently, these modalities are processed in parallel by specialized feature extractors. Specifically, the HRRP branch processes features in both the time and frequency domains using a hybrid CNN-Transformer architecture, while the point cloud is encoded by a DGCNN-based extractor to capture geometric features. The extracted feature sequences are then fused via the Bi-CA mechanism to enable explicit cross-modal interaction. Finally, the fused features are aggregated through pooling and concatenation and subsequently fed into an MLP for classification.

2.2. HRRP Feature Extractor

As shown in Figure 3, we propose a dual-stream extractor to capture target characteristics from both time and frequency domains. One stream processes the raw HRRP to analyze time-domain scattering distributions, while the other processes the amplitude spectrum (derived via FFT) to encode frequency-domain attributes. The extracted features from both branches are then projected and concatenated into a unified HRRP feature sequence, which serves as the input for the subsequent fusion module.

The time-domain branch processes the raw HRRP directly. Let the input be represented as a vector

x \in R^{L}

, where

L = 512

denotes the number of range bins. To capture local scattering patterns, we employ a hierarchical 1D-CNN backbone. This network consists of stacked convolutional blocks, each comprising 1D convolution layers followed by Max-Pooling. This design enables the extraction of features at varying scales, efficiently encoding both detailed peaks and global structural semantics.

The data flow through the time-domain 1D-CNN is detailed in Figure 4. Specifically, the hierarchical architecture comprises three main blocks with increasing channel depths (8, 16, and 32 filters, respectively). The input tensor, with a shape of

(B, 1, 512)

where B is the batch size, is first processed by two convolutional layers with 8 filters, followed by a Max-Pooling operation that halves the sequence length. This process is repeated with deeper feature maps. The 1D convolution operation at a layer k can be formally expressed as follows:

z_{j}^{(k)} = σ (\sum_{i = 1}^{C_{in}} w_{i j}^{(k)} * h_{i}^{(k - 1)} + b_{j}^{(k)})

(1)

where

h_{i}^{(k - 1)}

is the input feature map from the previous layer,

w_{i j}^{(k)}

and

b_{j}^{(k)}

are the learnable filter weights and biases, ∗ denotes the 1D convolution with padding, and

σ

is the ReLU activation function. The subsequent Max-Pooling operation reduces the dimensionality and provides local translational invariance. After passing through the 1D-CNN backbone, the raw HRRP is transformed into a compact sequence of high-level feature vectors

Z_{t} \in R^{B \times L^{'} \times C_{out}}

, where

L^{'}

is the reduced sequence length and

C_{out}

is the final channel dimension (32 in this case).

Following local feature extraction by the 1D-CNN, the sequence

Z_{t}

is fed into a Transformer Encoder. Unlike 1D-CNNs, which are limited by local receptive fields, the Transformer utilizes multi-head self-attention to capture global dependencies among scattering centers. This mechanism dynamically weights the importance of all elements in the sequence, effectively modeling the target’s global context. The output is an enriched time-domain feature sequence

H_{t} \in R^{B \times L^{'} \times D_{t}}

, where

D_{t} = 32

.

Concurrently, the second stream processes the frequency-domain information. The amplitude spectrum of the HRRP is first obtained via Fast Fourier Transform (FFT) and by taking the absolute value:

x_{f} = | F (x) |

(2)

where

F (\cdot)

denotes the FFT operator. This spectrum

x_{f}

is then processed by an analogous architecture consisting of a hierarchical 1D-CNN and a Transformer Encoder. We utilize the amplitude spectrum to ensure translational invariance. Since phase is sensitive to range shifts, discarding it allows the model to focus on the intrinsic scattering structure.

For efficiency, we employ a lightweight 1D-CNN architecture ((2, 4), (2, 8), (1, 16)) to extract salient spectral features. This yields a frequency-domain feature sequence

H_{f} \in R^{B \times L^{″} \times D_{f}}

, where

D_{f}

is set to 16.

The final step is Feature Unification. Since the feature sequences

H_{t}

and

H_{f}

have differing dimensions (

D_{t} = 32

and

D_{f} = 16

), they are projected into a unified feature space of dimension

D_{seq} = 64

using separate linear layers:

\begin{matrix} H_{t}^{'} & = H_{t} W_{t} + b_{t} \end{matrix}

(3)

\begin{matrix} H_{f}^{'} & = H_{f} W_{f} + b_{f} \end{matrix}

(4)

where

W_{t}, W_{f}

are learnable weight matrices and

b_{t}, b_{f}

are learnable bias vectors. This projection ensures dimensional compatibility. Finally, the mapped sequences are concatenated along the length dimension to produce the unified HRRP representation

H_{hrrp} \in R^{B \times (L^{'} + L^{″}) \times D_{seq}}

. This sequence serves as the input for the subsequent fusion module.

2.3. 3D Point Cloud Feature Extractor: DGCNN

To mitigate the aspect sensitivity limitation in HRRP, we utilize point clouds to provide rotation-invariant geometric context. We employ a DGCNN [48] as the feature backbone. Unlike architectures operating on fixed grids, DGCNN dynamically constructs local neighborhood graphs in the feature space, enabling the effective capture of fine-grained topological structures regardless of the target’s pose.

Subsequently, for a given point cloud input with

N = 256

points, we represent each point

p_{i}

by a feature vector (initially its 3D coordinates). At each layer, we construct a local geometric structure by identifying the

k = 20

nearest neighbors (k-NN) for every point. Crucially, this graph is dynamically updated at each network depth, allowing the model to group points based on learned semantic similarities rather than just physical proximity.

The core operation is the EdgeConv block, which computes “edge features” describing the relationship between the central point

p_{i}

and its neighbors

p_{j}

. To capture both local geometry and global position, we formulate the edge feature

e_{i j}

as follows:

e_{i j} = (p_{j} - p_{i}, p_{i})

(5)

where

(\cdot, \cdot)

denotes concatenation. Here,

p_{j} - p_{i}

encodes the local neighborhood structure, while

p_{i}

preserves absolute spatial information. These features are processed by a shared-weight Multi-Layer Perceptron (MLP), implemented efficiently as a 1 × 1 convolution. Finally, a channel-wise symmetric function (Max-Pooling) aggregates information from the local neighborhood, ensuring permutation invariance. This operation is defined as follows:

p_{i}^{'} = max_{j : (i, j) \in E} (h_{Θ} (p_{j} - p_{i}, p_{i}))

(6)

where

E

represents the set of edges in the dynamically constructed graph, and

h_{Θ}

denotes the learnable MLP.

To preserve geometric information across different abstraction levels, we aggregate the outputs from all stacked EdgeConv blocks. Specifically, the intermediate feature maps are concatenated along the channel dimension. This skip-connection design effectively integrates fine-grained local details from shallow layers with global semantic contexts from deep layers.

Following feature aggregation, a shared MLP (implemented as a 1D convolution) projects the combined features into a high-dimensional embedding space. The final output of the point cloud branch is a feature sequence

H_{pc} \in R^{B \times N \times D_{pc}}

, where

N

is the number of points and

D_{pc} = 1024

. This sequence provides the rotation-invariant geometric representation required for the subsequent Bi-CA fusion module.

2.4. Bi-CA Fusion Module

To effectively fuse the heterogeneous HRRP and point cloud features, we introduce a Bi-CA mechanism. This approach allows each modality to dynamically query the other, thereby selectively integrating complementary information.

Prior to fusion, we must map the input features into a shared latent space. Let the output sequence from the point cloud extractor be denoted as

H_{pc} \in R^{B \times N \times D_{pc}}

(where

D_{pc} = 1024

), and the unified sequence from the HRRP extractor as

H_{hrrp} \in R^{B \times (L^{'} + L^{″}) \times D_{seq}}

(where

D_{seq} = 64

). To align these features within a shared semantic space, we project both modalities to a common dimension

D_{fusion} = 64

using separate learnable linear layers.

We set

D_{fusion} = 64

to align with the HRRP feature dimension. Since the cross-modal information capacity is bounded by HRRP, we project the high-dimensional point cloud down to this shared space. This approach not only prevents spurious correlations arising from artificial up-sampling but also maintains an optimal balance between feature sufficiency and computational efficiency.

Formally, this projection is defined as follows:

\begin{matrix} H_{pc}^{'} & = H_{pc} W_{pc} + b_{pc} \end{matrix}

(7)

\begin{matrix} H_{hrrp}^{'} & = H_{hrrp} W_{hrrp} + b_{hrrp} \end{matrix}

(8)

where

W_{pc} \in R^{D_{pc} \times D_{fusion}}

and

W_{hrrp} \in R^{D_{seq} \times D_{fusion}}

are learnable weight matrices, and

b_{pc}

and

b_{hrrp}

are the corresponding bias vectors, where

H_{pc}^{'}

and

H_{hrrp}^{'}

are the aligned feature sequences serving as inputs for the fusion module. As illustrated in Figure 5, the first stream utilizes the aligned point cloud features

H_{pc}^{'}

to generate the Query vectors

Q_{pc}

, while the HRRP features

H_{hrrp}^{'}

are projected to produce the Key

K_{hrrp}

and Value

V_{hrrp}

vectors. The attention mechanism dynamically aggregates electromagnetic information for each geometric point:

\begin{matrix} Attention (Q_{pc}, K_{hrrp}, V_{hrrp}) = Softmax (\frac{Q_{pc} K_{hrrp}^{⊤}}{\sqrt{d_{k}}}) V_{hrrp} \end{matrix}

(9)

The geometric representation is then updated via a residual connection and Layer Normalization:

\begin{matrix} H_{pc}^{″} = LayerNorm (H_{pc}^{'} + Attention (Q_{pc}, K_{hrrp}, V_{hrrp})) \end{matrix}

(10)

This process effectively enriches the point cloud representation by selectively integrating complementary HRRP cues, yielding more discriminative geometric features. Symmetrically, the HRRP-enrichment stream employs

H_{hrrp}^{'}

as the Query and

H_{pc}^{'}

as the Keys and Values to ground abstract radar signals into the 3D geometric context, following the same formulation as Equations (9) and (10).

To facilitate deep feature interaction, we stack

L = 3

layers with 8 attention heads, a choice validated by the sensitivity analysis in Appendix C Figure A1. The refined feature sequences are aggregated via Global Average Pooling (GAP) and concatenated to form a unified vector

v_{final} \in R^{B \times 2 D_{fusion}}

, which is fed into an MLP classifier. Regarding the attention configuration, we adopt a post-norm design (normalization after residual connection) to ensure training stability and forgo causal masking to maintain a global receptive field for effective bidirectional modeling.

2.5. Experimental Setup and Implementation Details

All model training and accuracy evaluations were conducted on a Windows platform equipped with an NVIDIA RTX 5070 GPU (Blackwell Architecture).

To ensure fair efficiency comparisons, we evaluated inference latency, FLOPs, and parameter counts on a Linux workstation powered by an NVIDIA RTX 4090. This hardware transition was necessitated by the Mamba-based baselines, which rely on the mamba-ssm library’s optimized CUDA kernels. These kernels require specific environment configurations (Linux OS and mature CUDA versions) that currently face compatibility constraints on the RTX 50 series architecture.

For model optimization, we employed the Adam optimizer with a weight decay of 1 × 10⁻⁵. A differential learning rate strategy was adopted to accommodate the distinct characteristics of different network components. Specifically, the learning rates for the HRRP feature extractor and the point cloud feature extractor were set to 3 × 10⁻⁴ and 5 × 10⁻⁴, respectively. The remaining modules, including the feature mapping layers, the cross-attention layers, and the final classifier, were assigned a learning rate of 1 × 10⁻⁴. All models were trained with a batch size of 32. For a fair comparison, we employed an early stopping protocol to ensure convergence. The training process was terminated if the validation loss did not decrease for 20 consecutive epochs. The model checkpoint corresponding to the lowest validation loss was then selected for the final evaluation.

3. Results

3.1. Dataset Setup

3.1.1. Target Geometry and Parameters

Direct acquisition of measured electromagnetic data for space objects is difficult [61,62]. To the best of our knowledge, no public dataset currently exists that contains paired point cloud and HRRP. Adhering to standard research paradigms, we constructed a high-fidelity simulation dataset to address this gap. We selected three representative classes of Perfect Electrical Conductor (PEC) targets: cones, cylinders, and cone–cylinder composites.

The geometric configuration and dimensional parameters of the targets are illustrated in Figure 6 and Figure 7. Figure 6 defines the observation geometry, illustrating the coordinate system and the incident angle,

θ

, which represents the line-of-sight direction relative to the target’s primary axis. The detailed longitudinal cross-sections and specific dimensions of these targets are provided in Figure 7. To ensure numerical stability and avoid meshing singularities during the electromagnetic simulation, the mathematically sharp tips of the cone and the cone–cylinder composite were regularized by replacing them with small paraboloids. The paraboloid for the single cone target was defined with a focal length of 0.0025 m, while that of the composite object’s tip utilized a focal length of 0.004 m.

3.1.2. Multimodal Data Simulation

1D HRRP Data: We employed the electromagnetic simulation software Altair FEKO (version 2023.1) to generate the HRRP data. In the simulation, the target surfaces were set as PEC. To focus on the intrinsic scattering characteristics of the space targets, the simulation was conducted in a free-space environment, meaning that background noise and multipath effects were not modeled in this initial phase. Subsequently, a wideband signal with a center frequency of 6 GHz and a bandwidth of 4 GHz (ranging from 4 to 8 GHz) was used for excitation. The signal was a stepped-frequency waveform with a frequency step of 50 MHz. The simulation initially generated 81 frequency points. To obtain a smoother range profile, we zero-padded the data to 512 points before the Inverse Fast Fourier Transform (IFFT). This oversampling increases density without altering the physical resolution.

The theoretical range resolution,

Δ R

, is determined by the signal bandwidth, B, according to the following:

Δ R = \frac{c}{2 B},

(11)

where c is the speed of light. The maximum unambiguous range,

R_{un}

, is determined by the frequency step,

Δ f

, as follows:

R_{un} = \frac{c}{2 Δ f} .

(12)

According to Equations (11) and (12), the calculated range resolution is 3.75 cm, which is sufficient to capture the fine structural information of the targets, and the unambiguous range is 3 m, which completely encompasses the targets with an adequate margin. Considering that all three target types are bodies of revolution, we sampled them uniformly along the elevation angle (

θ

) from 0° to 180° at 1° intervals, generating a total of 543 raw HRRP samples (181 angles for each of the three target types).

We designed a Python 3.13 algorithm to generate the 3D point clouds of the targets by simulating LiDAR illumination on their 3D surfaces and capturing the reflected sparse point clouds. The viewpoints for point cloud generation were strictly aligned with the observation angles of the HRRP simulations. The raw point clouds were then uniformly down-sampled to a set of 256 points, resulting in 543 raw point cloud samples that are strictly paired with the HRRP.

3.1.3. Data Augmentation and Dataset Splitting

Data Augmentation and Scaling: To enhance model robustness and expand the dataset, we designed a joint data augmentation scheme comprising 15 distinct strategies, such as Gaussian noise injection, coordinate jittering, and global scaling. Detailed configurations for each strategy are provided in Appendix A (Table A1). Through these operations, the original 543 data pairs were expanded to a total of 8688 samples (543 × 16). Prior to training, all HRRP data underwent Min–Max normalization, while point clouds were centered and normalized to fit within a unit cube.

Dataset Splitting: To rigorously evaluate extrapolation capability, we adopted a structured, angle-based splitting strategy. The observation range from 1° to 180° was partitioned into contiguous, non-overlapping angular blocks. Within each block, data were divided into training, validation, and test sets according to a 5:2:2 ratio. To assess generalization capability to unseen viewpoints under varying degrees of angular separation, we established six experimental configurations with block sizes ranging from 9° to 180°.

For instance, in the 90° split configuration, the 180° observation range is divided into two blocks (i.e., 1–90° and 91–180°). For the first block, the 5:2:2 ratio allocates angles 1–50° to training, 51–70° to validation, and 71–90° to testing. A similar division is applied to the second block. A larger block size imposes greater angular separation, presenting a more challenging generalization task. The sample corresponding to the 0° observation angle was consistently included in the training set across all configurations. Supplementary analysis (Appendix D, Figure A2) indicates that excluding 0° samples causes a negligible accuracy drop (<0.8%), ruling out potential data leakage from back-scattering symmetry.

3.1.4. Evaluation Metrics

To comprehensively evaluate the classification performance, we employ two standard metrics: Overall Accuracy (OA) and F1-score. Accuracy serves as the primary metric for the general performance comparisons presented in Table 1. Additionally, given the potential for geometric confusion between target classes at specific viewpoints, we utilize the F1-score (the harmonic mean of Precision and Recall) in our ablation studies (Table 2 and Table 3). This metric provides a more robust assessment of the model’s ability to balance precision and recall, ensuring that the performance improvements are not biased towards specific classes.

We also evaluate its efficiency from three perspectives:

Parameters (Params): We calculate the number of trainable parameters of the entire model, measured in millions (M), to quantify the model’s size and memory footprint.

FLOPs (G): This metric quantifies the number of operations required for a single forward pass, measured in GFLOPs.

Latency (ms): Latency metrics in this paper represent the inference time of a single sample (batch size = 1). Unless otherwise specified, results are reported on an NVIDIA GeForce RTX 4090 GPU.

We conducted comprehensive inference latency benchmarks across a wide range of hardware to assess deployment feasibility. As detailed in Table A2, our tests spanned from data center accelerators to consumer-grade GPUs.

For edge deployment, we specifically utilized the NVIDIA Jetson Orin Nano (8 GB). The experimental environment on this embedded platform was configured with Ubuntu 22.04 LTS and CUDA 12.6.68. Operating under the 15 W power mode, the embedded platform achieved an average inference latency of 33.29 ms. This result indicates that the proposed model possesses favorable deployment capabilities, with acceptable inference speeds on embedded platforms.

3.2. Experimental Results

In this section, we evaluate Point-HRRP-Net from three perspectives: (1) comparison with single-modality baselines to assess generalization to unseen viewpoints; (2) ablation studies on fusion strategies to validate the cross-attention mechanism; and (3) analysis of the feature extractor to justify our architectural choices.

3.2.1. Performance Comparison Against Single-Modality Methods

Table 1 presents the generalization to unseen viewpoints performance comparison of Point-HRRP-Net against eight representative single-modality methods under six angle-based split configurations. These include our baseline HRRP network (HRRP-only), the MSDP-Net, Point-Transformer, and DGCNN, as well as the recent advanced Transformer-style network (Conformer) and Mamba-style networks (1D-Mamba and PointMamba). A vertical comparison of accuracy reveals that the multi-modal framework consistently outperformed single-modality methods across both small and large angle dataset splits. Specifically, under the 9° split configuration, our multi-modal model achieved a peak accuracy of 97.51%.

As the angular separation of the dataset splits increased from 9° to 180°, the overall recognition performance of all models exhibited an expected downward trend. A horizontal comparison indicates that Point-HRRP-Net demonstrated excellent generalization capability under large viewpoint changes. In the 180° split, Point-HRRP-Net maintained an accuracy of 57.67%, surpassing all baseline models. It outperformed the baseline HRRP-only by 12.05% and the best-performing point cloud baseline, PointMamba, by 3.87%.

Regarding the single-modality HRRP baselines, performance variations among different architectures were pronounced. Leveraging its powerful sequence modeling capabilities, Conformer achieved the best results among HRRP-based methods on our dataset. For instance, it attained 52.45% accuracy under the 180° split, outperforming both MSDP-Net and 1D-Mamba. Notably, 1D-Mamba showed a distinct advantage on the 90° split, achieving the highest accuracy among HRRP-based methods in this configuration.

Regarding single-modality point cloud methods, PointMamba significantly outperformed the traditional Point-Transformer and DGCNN across all angle splits. Specifically, it achieved an accuracy of 53.80% under the 180° split. This result demonstrated PointMamba’s advanced performance and potential.

3.2.2. Ablation Study on Fusion Strategies

To verify the effectiveness of the proposed Bi-CA mechanism, we conducted comparative experiments by replacing this module with eight fusion strategies: Addition, Product, Gating, Self-Attention, Linear Attention, Efficient Attention, and Bi-Mamba. Detailed ablation study results are listed in Table 2.

For the simple fusion strategies (Concatenation, Addition, and Product), experiments showed they maintained a low inference latency between 5.34 ms and 6.00 ms. However, they performed poorly in large-angle scenarios. These methods achieved respectable F1-scores of approximately 93–96% on the 9° split. While this suggests that basic aggregation is viable when viewpoint discrepancies are minimal, their performance declined significantly as the disparity expanded to 45° and 90°. On the 90° split, Concatenation, Product, and Addition achieved F1-scores of 51.77%, 54.69%, and 60.26%, respectively. Even the best among them lagged behind our proposed method by over 5%. This performance indicates that although simple strategies are fast and parameter-efficient, they lack sufficient capacity to fit complex features. Consequently, they suffer from severe generalization deficiencies under extreme conditions.

Regarding the classic Gating and standard Self-Attention mechanisms, Gating introduced weight modulation but only achieved an F1-score of 58.44% on the 90° test set, failing to surpass the 60% threshold. Self-Attention demonstrated high accuracy under the simple 9° angle but suffered a marked decline on the 90° split, with the F1-score dropping to 47.81%. This represents a gap of 17.73% compared to our method (65.54%). Although Self-Attention theoretically possesses strong fitting capabilities due to its large parameter count, this sharp decline suggests that overfitting occurred in this task. Furthermore, the model complexity increased the latency to 6.99 ms yet failed to provide the expected accuracy gains.

Subsequently, we evaluated fast attention mechanisms: Linear-Attention and Efficient-Attention. Standard Attention scales quadratically (

O (N^{2})

) with sequence length. In contrast, these fast variants reduce the cost to complexity

O (N)

through approximation techniques. Experiments showed that while these methods maintained F1-scores of 91–94% on the 9° split, their performance degraded significantly on the 90° split, dropping to 53.70% and 48.54%, respectively. This indicates that while approximation reduced complexity, it also failed to preserve detailed feature information. Moreover, with latencies ranging from 7.6 ms to 8.1 ms, they offered only a marginal speed advantage over our method (8.76 ms).

Finally, we compared the Bi-Mamba fusion strategy. Theoretically, the Mamba architecture has advantages in parameter efficiency and inference speed. Our experimental results showed that Bi-Mamba has 0.8275 M parameters, which is close to simple concatenation. However, its inference latency was 8.53 ms, showing only a negligible advantage over our method (8.76 ms). We speculate that Mamba’s advantage is maximized in ultra-long sequences. But our data are relatively short sequences, with an HRRP length of 512 and 256 point cloud points. Concurrently, although Bi-Mamba achieved an F1-score of 95.68% on the 9° split, its score dropped to 46.09% on the 90° split. We attribute this to the structural difference: unlike Bi-CA’s explicit token-to-token interaction, Mamba relies on a recurrent scan. We hypothesize that this mechanism compresses cross-modal interactions into hidden states, which may hinder the model’s ability to capture the correlations between HRRP and point clouds.

Through bi-directional interaction modeling, our method achieved an F1-score of 65.54% on the 90° split, outperforming the second-best method (Addition) by 5.28% and the Self-Attention mechanism by 17.73%. Although this mechanism increased inference latency to 8.76 ms (an increase of approximately 3.4 ms compared to the fastest simple fusion), we consider this latency cost acceptable for practical deployment in real-time recognition systems.

3.2.3. Ablation Study on Feature Extractors

We conducted an ablation study on feature extractors to validate the rationale behind our final architecture. The experimental results are presented in Table 3. By substituting different components, we evaluated the contribution of each part to the overall performance and efficiency.

Regarding HRRP feature extraction, our proposed dual-domain CNN-Transformer demonstrated the best overall performance, achieving an F1-score at 97.47% on the 9° split. Experimental data showed that 1D-Mamba minimized system latency to 6.52 ms, leveraging its unique selective scan mechanism. However, this speed advantage showed limitations when dealing with large viewing angles. Its F1-score on the 90° split was only 50.10%, lower than our model’s 65.54%. Similarly, the Transformer-based Conformer performed well in sequence modeling (58.02% F1-score on 90°). Yet, it failed to outperform our architecture. Traditional recurrent neural networks (RNN, LSTM, etc.) performed the worst as they struggled to capture complex spatial scattering features in HRRP. Therefore, the CNN-Transformer was the robust choice for balancing high F1-scores and real-time performance in this study.

For point cloud feature extraction, we compared DGCNN against classic methods and emerging lightweight architectures. Experimental results revealed a notable discrepancy between theoretical and actual efficiency. Although PointMamba possessed extremely low theoretical FLOPs (only 0.0871 G), its actual inference latency in the multi-modal framework (13.81 ms) was higher than our method (8.76 ms). This indicates that the theoretical efficiency of the Mamba architecture failed to materialize as actual inference acceleration within the current framework. We attribute this discrepancy to three primary factors: First, point clouds are unordered sets. To utilize the SSM, PointMamba requires point ordering. This index reordering of unstructured data on the GPU can disrupt memory coalescence, leading to latency overhead. Second, the Mamba kernel is still under active development, and its underlying CUDA kernel optimization may not be as mature as DGCNN. Third, Mamba’s linear complexity advantage only becomes significant with very long sequences. In this experiment, the point cloud contained only 256 points, rendering the latency optimization insignificant. Regarding generalization performance, DGCNN maintained a leading position. Its F1-score on the 90° split (65.54%) was significantly better than PointMLP (54.90%) and PointMamba (57.19%). This result demonstrated the advantage of DGCNN’s EdgeConv operation in capturing the local geometric structure of 3D targets.

Finally, it is worth noting that while PointMamba showed strong potential in single-modality tasks (as shown in Table 1), its accuracy dropped within our multi-modal fusion framework. This suggests a challenge in heterogeneous feature alignment when fusing PointMamba-generated features with HRRP features. We attribute this decline to structural incompatibility. Specifically, our Bi-CA module requires spatially explicit features to construct a global attention matrix for point-to-point interaction. However, PointMamba processes data via implicit state transitions along linear sequences. This mechanism inherently lacks the explicit geometric structure that Bi-CA demands. Given our focus on validating the fusion paradigm rather than pursuing single-modality SOTA, we selected DGCNN. Its explicit geometric modeling provides a baseline to verify the feasibility of the proposed framework.

4. Discussion

4.1. Visual Analysis of Cross-Modal Interactions

We visualized the attention weights in Figure 8 to examine how the two modalities interact. In the heatmaps, yellow regions denote higher attention weights, whereas blue regions indicate lower weights.

When point cloud features serve as the query (Figure 8a,b), the model assigns higher weights to high amplitudes in the HRRP time-domain and low-to-mid frequencies in the frequency domain. These high-amplitude peaks correspond to the dominant scattering centers of the object. Notably, extremely low frequencies receive almost no weight. This suggests that even if these components possess high spectral amplitudes, the model does not consider them useful for target classification. Physically, extremely low frequencies correspond to the slowest variations in the range profile structure, representing merely coarse outlines. This indicates that the network can effectively identify and utilize dominant scattering features while suppressing background noise from non-informative low-frequency components.

Conversely, when HRRP features are used as the query (Figure 8c), attention is not uniformly distributed. Instead, it focuses on geometric edges and discontinuities. Compared to smooth surfaces, edges typically provide more discriminative information for classification. It appears that the model has learned to focus on these geometrically significant regions. By weighting these features more heavily, the model achieves more stable classification performance. This explains why the cross-attention mechanism is more effective than simple feature concatenation.

We employ t-SNE to visualize feature distributions at different network stages, as illustrated in Figure 9. Both original HRRP and point cloud signals exhibit severe overlapping. After processing by the dual-stream extractors, the two modalities show different results. The extracted HRRP features remain highly dispersed with severe class overlap; this intuitively reflects the limitation of radar signatures caused by aspect sensitivity. The extracted point cloud features display clear primary clusters, but their distributions remain relatively loose with ambiguous boundaries between classes. Following the Bi-CA fusion, the fused features demonstrate enhanced intra-class compactness and wider inter-class margins.

4.2. Robustness Analysis

4.2.1. Analysis of Rotational Offset Scenarios

In real-world applications, multi-modal inputs are rarely perfectly aligned. To quantify the impact of such imperfections, we conducted a misalignment test using the pre-trained model on the full test set. Specifically, we introduced a rotational offset

Δ θ

ranging from 0° to 90° to one modality during the testing phase to simulate varying degrees of rotational offset.

Figure 10 illustrates the results under the 18° split configuration. We observe that when the rotational offset is within the range of 0° to 10°, the accuracy decline is negligible. Subsequently, the accuracy exhibits slight fluctuations but remains consistently above

98 %

. This suggests that since both modalities describe the same target, the model retains sufficient discriminatory information even with minor misalignments. When the error exceeds 50°, the fluctuations increase and a noticeable decline occurs, which can be attributed to feature conflicts arising from significant discrepancies between the modalities. Nevertheless, the accuracy remains above

96.2 %

.

Figure 11 compares the performance across different dataset split configurations. The results for the 9°, 18°, 36°, and 45° splits are similar, showing a generally flat trend with minimal degradation. However, for the 90° split, the decline in accuracy becomes pronounced as the rotational offset increases. In this configuration, the angular separation between the training and test sets is maximal, placing the highest demand on the model’s generalization capability. We infer that the model likely relies on consistent geometric–physical correspondences for inference in these unseen views. When this correspondence is disrupted by misalignment, the performance drop is consequently more significant compared to less challenging configurations.

4.2.2. Model Stress Test Analysis

To quantify the gap between simulation and real-world data to a certain extent, we conducted progressive stress tests on the model trained with the 9° split. Specifically, we divided the stress into four scenarios: (1) HRRP SNR deterioration; (2) Random Point Cloud jitter; (3) Random Point Cloud point dropping; (4) The summation of the previous three scenarios. Therefore, we tested these four conditions and observed the accuracy decline curves.

The test results are shown in Figure 12. The first row on the x-axis represents the Signal-to-Noise Ratio (SNR) of the HRRP, decreasing gradually from 30 dB. When the HRRP SNR is

- 15

dB, we consider the signal to be almost obscured by noise. The second row represents the magnitude of random point cloud jitter, whose numerical value directly applies to the normalized point cloud coordinates. The third row represents the degree of random point cloud loss. When the point cloud loss reaches 70%, we can consider the point cloud to be almost unrecognizable. Curves of different colors represent the addition of different noise conditions, and the maroon curve represents the superposition of all three conditions.

The y-axis represents the average accuracy. When noise is injected solely into the HRRP, the accuracy remains at 95%, even when the signal is nearly overwhelmed by noise. This suggests that the model may rely more on the geometric branch of the point cloud. However, a notable phenomenon is that even if the HRRP is already buried in noise, the multi-modal accuracy is still higher than the single-modality accuracy in Table 1. We speculate that this is because, during the training process, the point cloud has already learned more robust features through the HRRP. The accuracy decline caused by point cloud jitter and point cloud loss is more severe. We speculate that this is because DGCNN is adopted as the point cloud extraction branch, which is relatively sensitive to the geometric relationship between points and their neighbors.

When the three degradation conditions are superimposed (solid maroon line), the model’s accuracy changes very little within the range from the initial state (

S N R = 30

dB,

σ = 0.01

,

D r o p = 5 %

) down to the intermediate state (

S N R = 15

dB,

σ = 0.05

,

D r o p = 20 %

). This indicates that the system possesses a certain degree of noise resistance. Subsequently, as the severity increases, the performance gradually declines. However, even under the most extreme condition (

S N R = - 15

dB,

σ = 0.15

,

D r o p = 70 %

), the accuracy remains at 66.2%. This indicates that the model did not collapse. We believe that although the effective information available is limited, the model remains capable of making valid classification predictions.

Through this experiment, we can quantify the impact of different levels of noise on model performance. It helps us predict the model’s inference ability on real-world data.

4.3. Limitations and Future Directions

We acknowledge several limitations in the current work.

First is the “Sim-to-Real” gap. Current model validation relies on high-fidelity electromagnetic simulation data. However, real-world space objects often exhibit complex electromagnetic scattering characteristics due to non-PEC materials. Furthermore, real-world targets may exhibit complex micro-motions (e.g., spinning, tumbling), which challenges the model’s robustness in dynamic scenarios.

Second, regarding robustness, stress test results indicate that the model tends to rely on the geometric features provided by the point cloud for decision-making. When point cloud quality degrades severely, the model’s recognition performance declines significantly.

Third, the system requires strict synchronization. Point-HRRP-Net is a strict end-to-end model, requiring input data to be fixed in dimension and strictly paired. However, in practical sensor systems, radar and LiDAR sampling rates are asynchronous. The current architecture does not yet support asynchronous inputs, nor does it possess inference capabilities when a single modality is missing.

Fourth, a scalability bottleneck exists within the cross-attention mechanism. While the model meets real-time requirements at the current data resolution, we must acknowledge that the computational complexity of cross-attention scales quadratically with sequence length. In more complex real-world scenarios, such as those requiring the processing of high-density point clouds or ultra-high-resolution HRRP, the model’s memory consumption and computational load will increase exponentially.

Finally, we identified a compatibility issue with SSM. While PointMamba demonstrated superior performance in single-modality tasks, its accuracy degraded within our framework. The structural incompatibility between serialized 1D features and HRRP representations hinders the effective fusion of these two modalities.

Our future research plan focuses on the following directions: (1) Future work will attempt to use real-world measured data for training and testing; (2) We will investigate new training strategies or loss functions to reduce the over-reliance on a single modality; (3) We will improve the model framework to adapt to the asynchronous sampling rates of radar and LiDAR and explore single-modality inference mechanisms; (4) We will attempt to introduce lightweight techniques, such as efficient attention variants or model pruning, to balance efficiency and accuracy for resource-constrained scenarios; (5) Given the potential demonstrated by PointMamba in single-modality comparisons, we plan to fuse this information for space object classification in future work.

5. Conclusions

The aspect sensitivity of HRRP severely restricts the classification capability of space objects under unseen viewpoints. In this work, we presented Point-HRRP-Net, a multi-modal fusion framework designed to mitigate this limitation. By integrating HRRP with 3D LiDAR point clouds via a Bi-CA mechanism, our approach effectively synthesizes electromagnetic scattering signatures with rotation-invariant geometric topology, demonstrating superior generalization capabilities to unseen viewpoints. Ablation studies validated the effectiveness of the proposed design. Given the scarcity of paired experimental data, evaluations were conducted on a constructed simulation dataset. To assess real-world applicability, we evaluated potential sim-to-real discrepancies in Section 4.2.1 and Section 4.2.2. We acknowledge that a performance degradation of ≥10% on real data is conceivable. Furthermore, we provided extensive latency comparisons across various hardware platforms in Appendix B to evaluate deployment feasibility. Given PointMamba’s demonstrated potential in single-modality tasks, we plan to design a specialized multi-modal framework for it in future work. Additionally, we aim to explore optimization schemes for asynchronous inputs and lightweight deployment.

Author Contributions

Conceptualization, Z.Z. and Z.Y.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z., Z.Y. and Y.W.; formal analysis, Z.Z., Z.Y. and Y.W.; investigation, Z.Z.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z., Z.Y., H.Z., Y.W. and K.M.; visualization, Z.Z. and Z.Y.; supervision, H.Z. and K.M.; project administration, H.Z.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported in part by the Open Foundation of the State Key Laboratory of Precision Space-time Information Sensing Technology (No. STSL2025-B-04-01(L)) and the Sichuan Science and Technology Program (No. 2024YFHZ0002).

Data Availability Statement

The dataset presented in this study is openly available on GitHub at https://github.com/zzo-zhao/HRRP-PC-Paired-Dataset (accessed on 2 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HRRP	High-Resolution Range Profile
LiDAR	Light Detection and Ranging
Bi-CA	Bi-Directional Cross-Attention
DGCNN	Dynamic Graph Convolutional Neural Network
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Unit
SSM	State Space Model
PEC	Perfect Electrical Conductor
FFT	Fast Fourier Transform
IFFT	Inverse Fast Fourier Transform
OA	Overall Accuracy
SNR	Signal-to-Noise Ratio
MLP	Multi-Layer Perceptron
SAR	Synthetic Aperture Radar
IR	Infrared
GAP	Global Average Pooling

Appendix A. Data Augmentation Strategies

Table A1. Detailed list of the 15 joint data augmentation strategies.

Strategy ID	HRRP Augmentation	Point Cloud (PC) Augmentation
1	Gaussian noise ( $σ = 0.01$ )	Gaussian jitter to coordinates ( $σ = 0.01$ )
2	Gaussian noise ( $σ = 0.03$ )	Gaussian jitter to coordinates ( $σ = 0.03$ )
3	Gaussian noise ( $σ = 0.05$ )	Gaussian jitter to coordinates ( $σ = 0.05$ )
4	Gaussian noise ( $σ = 0.1$ )	Gaussian jitter to coordinates ( $σ = 0.1$ )
5	Amplitude scaling (range: 0.9–1.1)	Global scaling (range: 0.9–1.1)
6	Linear shifting (zero-padded)	No operation
7	No operation	Random rotation
8	Gaussian Noise ( $σ = 0.01$ ) + Linear shifting	Rotation + Global scaling (0.9–1.1)
9	Gaussian Noise ( $σ = 0.03$ ) + Linear shifting	Rotation + Global scaling (0.9–1.1)
10	Gaussian Noise ( $σ = 0.05$ ) + Linear shifting	Rotation + Global scaling (0.9–1.1)
11	Gaussian Noise ( $σ = 0.1$ ) + Linear shifting	Rotation + Global scaling (0.9–1.1)
12	Amplitude Scaling (0.9–1.1) + Linear shifting	Gaussian Jitter ( $σ = 0.01$ ) + Rotation
13	Amplitude Scaling (0.9–1.1) + Linear shifting	Gaussian Jitter ( $σ = 0.03$ ) + Rotation
14	Amplitude Scaling (0.9–1.1) + Linear shifting	Gaussian Jitter ( $σ = 0.05$ ) + Rotation
15	Amplitude Scaling (0.9–1.1) + Linear shifting	Gaussian Jitter ( $σ = 0.1$ ) + Rotation

Appendix B. Hardware Efficiency and Deployment Analysis

Table A2 presents the inference latency of our model on different devices. To ensure a fair comparison, all tests listed below were conducted in a Linux environment. The testing methodology involved measuring the inference time with a batch size of 1. Specifically, we first performed 50 warm-up runs. Then, we repeated the inference 100 times to calculate the final average result.

It is important to note that for lightweight models performing single-sample inference, the GPU load remains relatively low; consequently, CPU performance significantly impacts the overall latency. Since the host CPUs varied across different testing platforms, the data is for reference only.

Table A2. Inference latency comparison of our model on different devices (Batch Size = 1).

Category	Device Name	Architecture	Latency (ms)
Consumer GPU	NVIDIA RTX 5090	Blackwell	5.75
	NVIDIA RTX 5070	Blackwell	6.87
	NVIDIA RTX 4090	Ada Lovelace	8.76
	NVIDIA RTX 4090 D	Ada Lovelace	6.52
	NVIDIA RTX 3080 Ti	Ampere	9.82
Workstation/Data Center	NVIDIA RTX 6000 Ada	Ada Lovelace	5.20
	NVIDIA H800	Hopper	6.26
	NVIDIA H20	Hopper	6.33
	NVIDIA A800 (80 G)	Ampere	7.51
	NVIDIA L20	Ada Lovelace	5.98
	NVIDIA Tesla V100 (32 G)	Volta	23.28
	NVIDIA RTX A4000	Ampere	9.68
CPU (x86)	Intel Xeon Gold 6459C	Sapphire Rapids	8.77
	AMD Ryzen 7 9700X	Zen 5	12.31
	AMD EPYC 9654	Zen 4	19.62
	AMD EPYC 9754	Zen 4c	24.65
	Intel Xeon Platinum 8352V	Ice Lake	32.90
Embedded GPU	NVIDIA Jetson Orin Nano (8 GB)	Ampere	33.29
NPU	Huawei Ascend 910B2	Da Vinci	241.28 *

* Note: The high latency on the NPU is primarily attributed to the non-optimized kernel within the CANN framework.

Appendix C. Sensitivity Analysis

We conducted a sensitivity analysis to investigate the impact of the number of attention heads on model accuracy. By employing different random seeds, we repeated the experiments ten times to calculate the mean classification accuracy on the 90° split. As shown in Figure A1, we compared configurations with 4, 8, and 16 heads. The results indicate that the model achieves optimal accuracy with 8 attention heads. Furthermore, to evaluate the stability, we reported the variance for each configuration (0.258 for 4 heads, 0.305 for 8 heads, and 0.268 for 16 heads).

Figure A1. Sensitivity analysis of the number of attention heads versus classification accuracy on the 90° split test set.

Appendix D. Analysis of Potential Data Leakage from Back-Scattering Symmetry

We designed an experiment to exclude 0° samples from the training set to investigate whether back-scattering symmetry poses a risk of data leakage. We calculated the mean classification accuracy over ten repeated experiments on the 180° split. As shown in Figure A2, the mean accuracy after excluding 0° samples is 56.88%. Compared to the baseline (57.67%), the decrease in accuracy is only 0.79%, which falls within the 2% tolerance threshold. This result indicates that the model does not significantly rely on viewpoint symmetry for classification.

Figure A2. Impact of excluding 0° samples on classification accuracy on the 180° split.

References

Kechagias-Stamatis, O.; Aouf, N. Automatic Target Recognition on Synthetic Aperture Radar Imagery: A Survey. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 56–81. [Google Scholar] [CrossRef]
Obaideen, K.; McCafferty-Leroux, A.; Hilal, W.; AlShabi, M.; Gadsden, S.A. Analysis of deep learning in automatic target recognition: Evolution and emerging trends. In Proceedings of the Automatic Target Recognition XXXV; Chen, K., Hammoud, R.I., Overman, T.L., Eds.; International Society for Optics and Photonics, SPIE: Orlando, FL, USA, 2025; Volume 13463, p. 134630E. [Google Scholar] [CrossRef]
Zhang, Y.P.; Zhang, L.; Kang, L.; Wang, H.; Luo, Y.; Zhang, Q. Space Target Classification with Corrupted HRRP Sequences Based on Temporal–Spatial Feature Aggregation Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5100618. [Google Scholar] [CrossRef]
Chen, B.; Liu, H.; Chai, J.; Bao, Z. Large margin feature weighting method via linear programming. IEEE Trans. Knowl. Data Eng. 2008, 21, 1475–1488. [Google Scholar] [CrossRef]
Fu, Z.; Li, S.; Li, X.; Dan, B.; Wang, X. A Neural Network with Convolutional Module and Residual Structure for Radar Target Recognition Based on High-Resolution Range Profile. Sensors 2020, 20, 586. [Google Scholar] [CrossRef]
Wang, J.; Liu, Z.; Xie, R.; Ran, L. Radar HRRP target recognition based on dynamic learning with limited training data. Remote Sens. 2021, 13, 750. [Google Scholar] [CrossRef]
Liao, X.; Bao, Z.; Xing, M. On the aspect sensitivity of high resolution range profiles and its reduction methods. In Proceedings of the Record of the IEEE 2000 International Radar Conference [Cat. No. 00CH37037]; IEEE: New York, NY, USA, 2000; pp. 310–315. [Google Scholar] [CrossRef]
Jacobs, S.; O’Sullivan, J. Automatic target recognition using sequences of high resolution radar range-profiles. IEEE Trans. Aerosp. Electron. Syst. 2000, 36, 364–381. [Google Scholar] [CrossRef]
Du, L.; Liu, H.; Bao, Z. Radar HRRP statistical recognition: Parametric model and model selection. IEEE Trans. Signal Process. 2008, 56, 1931–1944. [Google Scholar] [CrossRef]
Liao, X.; Runkle, P.; Carin, L. Identification of ground targets from sequential high-range-resolution radar signatures. IEEE Trans. Aerosp. Electron. Syst. 2003, 38, 1230–1242. [Google Scholar] [CrossRef]
Du, L.; Wang, P.; Liu, H.; Pan, M.; Chen, F.; Bao, Z. Bayesian Spatiotemporal Multitask Learning for Radar HRRP Target Recognition. IEEE Trans. Signal Process. 2011, 59, 3182–3196. [Google Scholar] [CrossRef]
Meng, Y.; Wang, L.; Zhou, Q.; Zhang, X.; Zhang, L.; Wang, Y. Sparse View HRRP Recognition Based on Dual-Task of Generation and Recognition Method. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP); IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Zhou, Q.; Yu, B.; Wang, Y.; Zhang, L.; Zheng, L.; Zou, D.; Zhang, X. Generative Multi-View HRRP Recognition Based on Cascade Generation and Fusion Network. In Proceedings of the 2024 International Radar Conference (RADAR); IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Li, X.; Ouyang, W.; Pan, M.; Lv, S.; Ma, Q. Continuous learning method of radar HRRP based on CVAE-GAN. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5107819. [Google Scholar] [CrossRef]
Feng, B.; Chen, B.; Liu, H. Radar HRRP target recognition with deep networks. Pattern Recognit. 2017, 61, 379–393. [Google Scholar] [CrossRef]
Yin, H.; Guo, Z. Radar HRRP target recognition with one-dimensional CNN. Telecommun. Eng. 2018, 58, 1121–1126. [Google Scholar]
Xu, B.; Chen, B.; Wan, J.; Liu, H.; Jin, L. Target-Aware Recurrent Attentional Network for Radar HRRP Target Recognition. Signal Process. 2019, 155, 268–280. [Google Scholar] [CrossRef]
Liu, J.; Chen, B.; Chen, W.; Yang, Y. Radar HRRP Target Recognition with Target Aware Two-Dimensional Recurrent Neural Network. In Proceedings of the 2019 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC); IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2013; pp. 1310–1318. [Google Scholar]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 16259–16268. [Google Scholar]
Zhang, L.; Han, C.; Wang, Y.; Li, Y.; Long, T. Polarimetric HRRP recognition based on feature-guided Transformer model. Electron. Lett. 2021, 57, 705–707. [Google Scholar] [CrossRef]
Diao, Y.; Liu, S.; Gao, X.; Liu, A. Position Embedding-Free Transformer for Radar HRRP Target Recognition. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium; IEEE: New York, NY, USA, 2022; pp. 1896–1899. [Google Scholar] [CrossRef]
Wang, X.; Wang, P.; Song, Y.; Xiang, Q.; Li, J. RLAT: Lightweight Transformer for High-Resolution Range Profile Sequence Recognition. Comput. Syst. Sci. Eng. 2024, 48, 217. [Google Scholar] [CrossRef]
Chen, L.; Pan, Z.; Liu, Q.; Hu, P. HRRPGraphNet++: Dynamic Graph Neural Network with Meta-Learning for Few-Shot HRRP Radar Target Recognition. Remote Sens. 2025, 17, 2108. [Google Scholar] [CrossRef]
Song, Y.; Wang, Y. Multi-frame radar HRRP target recognition using MFA-Net. J. Southeast Univ. (Engl. Ed.) 2025, 41, 384–391. [Google Scholar] [CrossRef]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling; PMLR: Cambridge, MA, USA, 2024. [Google Scholar]
Wang, Z.; Kong, F.; Feng, S.; Wang, M.; Yang, X.; Zhao, H.; Wang, D.; Zhang, Y. Is mamba effective for time series forecasting? Neurocomputing 2025, 619, 129178. [Google Scholar] [CrossRef]
Gao, F.; Lang, P.; Yeh, C.; Li, Z.; Ren, D.; Yang, J. An Interpretable Target-Aware Vision Transformer for Polarimetric HRRP Target Recognition with a Novel Attention Loss. Remote Sens. 2024, 16, 3135. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, X.; Liu, Y. Target-Aspect-Guided Neural Network with Geometric Constraints for Imbalanced Radar Target Recognition. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 12122–12142. [Google Scholar] [CrossRef]
Rajah, P.; Odindi, J.; Mutanga, O. Feature level image fusion of optical imagery and Synthetic Aperture Radar (SAR) for invasive alien plant species detection and mapping. Remote Sens. Appl. Soc. Environ. 2018, 10, 198–208. [Google Scholar] [CrossRef]
Lin, Y.; Zhang, H.; Lin, H.; Gamba, P.E.; Liu, X. Incorporating synthetic aperture radar and optical images to investigate the annual dynamics of anthropogenic impervious surface at large scale. Remote Sens. Environ. 2020, 242, 111757. [Google Scholar] [CrossRef]
Chu, Z.; Luo, H.; Zhang, T.; Zhao, C.; Lin, B.; Gao, F. Micro-Doppler and HRRP Enabled UAV and Bird Recognition Scheme for ISAC System. In Proceedings of the 2025 IEEE/CIC International Conference on Communications in China (ICCC); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar] [CrossRef]
Yang, L.; Feng, W.; Wu, Y.; Huang, L.; Quan, Y. Radar-infrared sensor fusion based on hierarchical features mining. IEEE Signal Process. Lett. 2023, 31, 66–70. [Google Scholar] [CrossRef]
Zhang, F.; Bi, X.; Zhang, Z.; Xu, Y. HIFR-Net: A HRRP-Infrared Fusion Recognition Network Capable of Handling Modality Missing and Multisource Data Misalignment. IEEE Sens. J. 2025, 25, 5769–5781. [Google Scholar] [CrossRef]
Wang, Y.; Deng, J.; Li, Y.; Hu, J.; Liu, C.; Zhang, Y.; Ji, J.; Ouyang, W.; Zhang, Y. Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 13394–13403. [Google Scholar]
Xu, R.; Xiang, Z. RLNet: Adaptive Fusion of 4D Radar and Lidar for 3D Object Detection. In Proceedings of the Computer Vision—ECCV 2024 Workshops; Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 181–194. [Google Scholar]
Li, Z.P.; Ye, J.T.; Huang, X.; Jiang, P.Y.; Cao, Y.; Hong, Y.; Yu, C.; Zhang, J.; Zhang, Q.; Peng, C.Z.; et al. Single-photon imaging over 200km. Optica 2021, 8, 344–349. [Google Scholar] [CrossRef]
Trummer, N.M.; Reza, A.; Steindorfer, M.A.; Helling, C. Machine learning-based classification for Single Photon Space Debris Light Curves. Acta Astronaut. 2025, 226, 542–554. [Google Scholar] [CrossRef]
Widdowson, D.; Kurlin, V. Recognizing rigid patterns of unlabeled point clouds by complete and continuous isometry invariants with no false negatives and no false positives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 1275–1284. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 2017; pp. 5099–5108. Available online: https://proceedings.neurips.cc/paper/2017/hash/d8bf84be3800d12f74d8b05e9b89836f-Abstract.html (accessed on 2 March 2026).
Guo, B.; Huang, X.; Zhang, F.; Sohn, G. Classification of airborne laser scanning data using JointBoost. ISPRS J. Photogramm. Remote Sens. 2015, 100, 71–83. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel cnn for efficient 3d deep learning. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 965–975. [Google Scholar]
Hsu, P.H.; Zhuang, Z.Y. Incorporating handcrafted features into deep learning for point cloud classification. Remote Sens. 2020, 12, 3713. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. (Tog) 2019, 38, 1–12. [Google Scholar] [CrossRef]
Ma, X.; Qin, C.; You, H.; Ran, H.; Fu, Y. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv 2022, arXiv:2202.07123. [Google Scholar] [CrossRef]
Liang, D.; Zhou, X.; Xu, W.; Zhu, X.; Zou, Z.; Ye, X.; Tan, X.; Bai, X. Pointmamba: A simple state space model for point cloud analysis. Adv. Neural Inf. Process. Syst. 2024, 37, 32653–32677. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
Boulahia, S.Y.; Amamra, A.; Madi, M.R.; Daikh, S. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach. Vis. Appl. 2021, 32, 121. [Google Scholar] [CrossRef]
Dietz, S.; Altstidl, T.; Zanca, D.; Eskofier, B.; Nguyen, A. How Intermodal Interaction Affects the Performance of Deep Multimodal Fusion for Mixed-Type Time Series. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Guarrasi, V.; Aksu, F.; Caruso, C.M.; Di Feola, F.; Rofena, A.; Ruffini, F.; Soda, P. A systematic review of intermediate fusion in multimodal deep learning for biomedical applications. Image Vis. Comput. 2025, 158, 105509. [Google Scholar] [CrossRef]
Tan, H.; Bansal, M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar] [CrossRef]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2015; pp. 2048–2057. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th International Conference on Machine Learning; Daumé, H., III, Singh, A., Eds.; Proceedings of Machine Learning Research; PMLR: Cambridge, MA, USA, 2020; Volume 119, pp. 5156–5165. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient Attention: Attention with Linear Complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2021; pp. 3531–3539. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning; JMLR: Cambridge, MA, USA, 2024; ICML’24. [Google Scholar]
Chen, J.; Xu, S.; Chen, Z. Convolutional neural network for classifying space target of the same shape by using RCS time series. IET Radar Sonar Navig. 2018, 12, 1268–1275. [Google Scholar] [CrossRef]
Zhang, Y.P.; Zhang, Q.; Kang, L.; Luo, Y.; Zhang, L. End-to-end recognition of similar space cone–cylinder targets based on complex-valued coordinate attention networks. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5106214. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Xu, Z.; Jin, X.; Su, F. MSDP-Net: A Multi-Scale Domain Perception Network for HRRP Target Recognition. Remote Sens. 2025, 17, 2601. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 2 March 2026).

Figure 1. Complementarity of point clouds and HRRP. The top row shows that three distinct targets ((a) cone, (b) cylinder, (c) cone–cylinder) appear identical from the base view. In contrast, the bottom row (d–f) displays distinct HRRP.

Figure 2. Overview of Point-HRRP-Net. The framework operates in three stages: (1) dual-branch feature extraction for HRRP and point clouds, (2) cross-attention fusion via the Bi-CA mechanism, and (3) classification. (HRRP: High Resolution Range Profile; PC: Point Cloud; FFT: Fast Fourier Transform; Bi-CA: Bi-Directional Cross-Attention; DGCNN: Dynamic Graph CNN).

Figure 3. Structure of the dual-domain HRRP extractor. The module processes the raw HRRP and its amplitude spectrum in parallel. Each branch employs a 1D-CNN for local feature extraction and a Transformer Encoder for global dependency modeling. Finally, the features are projected and concatenated to form a unified HRRP feature sequence.

Figure 4. Data flow of the time-domain 1D-CNN branch. The input tensor of shape

(B, 1, 512)

traverses three stacked 1D-CNN blocks for local feature extraction and downsampling via Max-Pooling. Subsequently, a Transformer Encoder captures global dependencies to produce the final HRRP feature sequence.

Figure 4. Data flow of the time-domain 1D-CNN branch. The input tensor of shape

(B, 1, 512)

traverses three stacked 1D-CNN blocks for local feature extraction and downsampling via Max-Pooling. Subsequently, a Transformer Encoder captures global dependencies to produce the final HRRP feature sequence.

Figure 5. Mechanism of the cross-attention module. The geometric query

Q_{pc, i}

computes attention weights

α

with electromagnetic keys

K_{hrrp}

. These weights aggregate the values

V_{hrrp}

to generate the enriched output

Q_{pc, new}

.

Figure 5. Mechanism of the cross-attention module. The geometric query

Q_{pc, i}

computes attention weights

α

with electromagnetic keys

K_{hrrp}

. These weights aggregate the values

V_{hrrp}

to generate the enriched output

Q_{pc, new}

.

Figure 6. Illustration of the simulation coordinate system and the definition of the incident angle

θ

. The angle represents the line-of-sight for both point cloud generation and electromagnetic simulations, shown here with the cylinder target.

Figure 6. Illustration of the simulation coordinate system and the definition of the incident angle

θ

. The angle represents the line-of-sight for both point cloud generation and electromagnetic simulations, shown here with the cylinder target.

Figure 7. Longitudinal cross-sections showing the dimensional parameters of the three simulated Perfect Electrical Conductor (PEC) targets. (a) Cone, (b) Cylinder, and (c) Cone–cylinder composite. The points

O_{1}

,

O_{2}

, and

O_{3}

denote the origin of each target’s local coordinate system. All dimensions are in meters. To ensure numerical stability during simulation, the sharp tips in (a,c) are replaced with small paraboloids, as detailed in the text.

Figure 7. Longitudinal cross-sections showing the dimensional parameters of the three simulated Perfect Electrical Conductor (PEC) targets. (a) Cone, (b) Cylinder, and (c) Cone–cylinder composite. The points

O_{1}

,

O_{2}

, and

O_{3}

denote the origin of each target’s local coordinate system. All dimensions are in meters. To ensure numerical stability during simulation, the sharp tips in (a,c) are replaced with small paraboloids, as detailed in the text.

Figure 8. Visualization of the attention weights in the Bi-CA module. (a,b) Attention distribution on the HRRP time-domain signal and frequency-domain spectrum, respectively, when queried by point cloud features. (c) Attention distribution on the 3D point cloud when queried by HRRP features. In the heatmaps, yellow regions denote higher attention weights (high relevance), whereas blue regions indicate lower weights.

Figure 9. t-SNE visualization of feature distributions at different network stages. (a,b) show the original input signals. (c,d) display the extracted single-modality features. (e) presents the fused features. Different colors denote the three target classes: Cone (purple), Cylinder (teal), and Cone–Cylinder (yellow).

Figure 10. Impact of rotational offset on classification accuracy (18° split). The solid green line represents the average accuracy, while the light green shaded area indicates the accuracy fluctuation range across test samples. The x-axis denotes the rotational offset (

Δ θ

) between the two modalities, and the y-axis represents the classification accuracy.

Figure 10. Impact of rotational offset on classification accuracy (18° split). The solid green line represents the average accuracy, while the light green shaded area indicates the accuracy fluctuation range across test samples. The x-axis denotes the rotational offset (

Δ θ

) between the two modalities, and the y-axis represents the classification accuracy.

Figure 11. Sensitivity analysis of rotational offset across different dataset splits. The graph compares average accuracy trends for dataset splits ranging from 9° to 90° under increasing rotational offset.

Figure 12. Performance degradation under progressive stress tests. The x-axis represents degradation severity across three metrics: HRRP SNR (dB), Point Cloud Jitter (

σ

), and Point Cloud Drop rate (%). The curves illustrate the average accuracy of the model trained on the 9° split under four test scenarios: HRRP noise only, PC jitter only, PC point drop only, and the aggregate of all three.

Figure 12. Performance degradation under progressive stress tests. The x-axis represents degradation severity across three metrics: HRRP SNR (dB), Point Cloud Jitter (

σ

), and Point Cloud Drop rate (%). The curves illustrate the average accuracy of the model trained on the 9° split under four test scenarios: HRRP noise only, PC jitter only, PC point drop only, and the aggregate of all three.

Table 1. Generalization capability to unseen viewpoints against representative single-modality methods.

Model	Modality	9° Split	18° Split	36° Split	45° Split	90° Split	180° Split
Ours (HRRP-only)	HRRP	54.22	60.26	56.15	54.22	49.74	45.62
1D-Mamba [30]	HRRP	65.05	60.78	58.49	60.16	51.35	42.34
Conformer [28]	HRRP	74.43	75.00	68.54	61.93	47.45	52.45
MSDP-Net [63]	HRRP	73.70	71.56	67.66	69.48	49.48	43.96
Point-Transformer [22]	Point Cloud	83.12	77.50	67.55	71.82	39.79	37.81
DGCNN [48]	Point Cloud	89.74	90.16	82.92	76.25	51.77	41.56
PointMLP [49]	Point Cloud	79.84	70.52	71.41	66.93	46.88	42.97
PointMamba [50]	Point Cloud	93.33	91.51	81.93	79.90	60.57	53.80
Point-HRRP-Net (Ours)	Point Cloud + HRRP	97.51	93.84	85.57	87.29	66.34	57.67
Note: Bold values indicate the best performance.

Table 2. Ablation study on different fusion strategies.

Fusion Strategy	Params (M)	9° Split (F1)	45° Split (F1)	90° Split (F1)	Latency (ms)	FLOPs (G)
Concatenation	0.8271	93.12	80.83	51.77	5.9959	0.6273
Addition	0.7942	94.06	83.44	60.26	5.3409	0.6272
Product	0.7942	95.83	82.40	54.69	5.3470	0.6272
Gating	0.8436	95.73	81.77	58.44	5.9708	0.6273
Self-Attention [64]	0.8271	95.94	84.17	47.81	6.9912	0.6273
Linear-Attention [58]	0.9277	93.91	81.72	53.70	8.0915	0.6448
Efficient-Attention [59]	0.9277	91.72	84.58	48.54	7.6187	0.6448
Bi-Mamba [60]	0.8275	95.68	82.29	46.09	8.5317	0.6275
Bi-CA (Ours)	1.0272	97.47	85.09	65.54	8.7573	0.6624
Note: Bold values indicate the best performance.

Table 3. Comprehensive ablation study on feature extractors. Each row represents a variation from our final model (last row), where one component is replaced to evaluate its contribution to the overall performance and efficiency.

HRRP Extractor	PC Extractor	Params (M)	FLOPs (G)	9° Split (F1)	45° Split (F1)	90° Split (F1)	Latency (ms)
CNN	DGCNN	1.0272	0.6570	96.51	85.36	46.77	6.7711
RNN	DGCNN	1.1203	0.7194	85.31	66.15	49.06	7.1118
LSTM [19]	DGCNN	1.3705	0.7683	92.40	74.38	56.77	7.2894
GRU [20]	DGCNN	1.5002	0.8020	95.00	79.58	46.51	6.7926
1D-Mamba [30]	DGCNN	1.0247	0.7072	96.56	74.32	50.10	6.5243
Conformer [28]	DGCNN	1.1850	0.7888	95.89	83.13	58.02	7.5473
CNN-Transformer	PointNet [43]	0.6191	0.0950	92.81	82.19	47.60	9.0498
CNN-Transformer	PointNet++ [44]	0.5733	1.2934	88.70	80.10	48.02	113.3443
CNN-Transformer	PointMLP [49]	1.6522	0.4754	84.53	73.65	54.90	14.7948
CNN-Transformer	PointMamba [50]	0.5422	0.0871	91.87	83.54	57.19	13.8087
CNN-Transformer (Ours)	DGCNN (Ours)	1.0272	0.6624	97.47	85.09	65.54	8.7573
Note: Bold values indicate the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Z.; Yang, Z.; Zhang, H.; Wang, Y.; Meng, K. Point-HRRP-Net: A Deep Fusion Framework via Bi-Directional Cross-Attention for Space Object Classification Using HRRP and Point Cloud. Remote Sens. 2026, 18, 868. https://doi.org/10.3390/rs18060868

AMA Style

Zhao Z, Yang Z, Zhang H, Wang Y, Meng K. Point-HRRP-Net: A Deep Fusion Framework via Bi-Directional Cross-Attention for Space Object Classification Using HRRP and Point Cloud. Remote Sensing. 2026; 18(6):868. https://doi.org/10.3390/rs18060868

Chicago/Turabian Style

Zhao, Zhenou, Zhuoyi Yang, Haitao Zhang, Yanwei Wang, and Kuo Meng. 2026. "Point-HRRP-Net: A Deep Fusion Framework via Bi-Directional Cross-Attention for Space Object Classification Using HRRP and Point Cloud" Remote Sensing 18, no. 6: 868. https://doi.org/10.3390/rs18060868

APA Style

Zhao, Z., Yang, Z., Zhang, H., Wang, Y., & Meng, K. (2026). Point-HRRP-Net: A Deep Fusion Framework via Bi-Directional Cross-Attention for Space Object Classification Using HRRP and Point Cloud. Remote Sensing, 18(6), 868. https://doi.org/10.3390/rs18060868

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Point-HRRP-Net: A Deep Fusion Framework via Bi-Directional Cross-Attention for Space Object Classification Using HRRP and Point Cloud

Highlights

Abstract

1. Introduction

2. Methods

2.1. Point-HRRP-Net Overview

2.2. HRRP Feature Extractor

2.3. 3D Point Cloud Feature Extractor: DGCNN

2.4. Bi-CA Fusion Module

2.5. Experimental Setup and Implementation Details

3. Results

3.1. Dataset Setup

3.1.1. Target Geometry and Parameters

3.1.2. Multimodal Data Simulation

3.1.3. Data Augmentation and Dataset Splitting

3.1.4. Evaluation Metrics

3.2. Experimental Results

3.2.1. Performance Comparison Against Single-Modality Methods

3.2.2. Ablation Study on Fusion Strategies

3.2.3. Ablation Study on Feature Extractors

4. Discussion

4.1. Visual Analysis of Cross-Modal Interactions

4.2. Robustness Analysis

4.2.1. Analysis of Rotational Offset Scenarios

4.2.2. Model Stress Test Analysis

4.3. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Data Augmentation Strategies

Appendix B. Hardware Efficiency and Deployment Analysis

Appendix C. Sensitivity Analysis

Appendix D. Analysis of Potential Data Leakage from Back-Scattering Symmetry

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI