4.1. Datasets
In order to assess the results of the application of RadarSSM, two publicly available mmWave radar point cloud datasets, namely, MMActivity and a part of MiliPoint were used to perform the experiments.
MMActivity [
26]: These benchmark data were gathered through a TI IWR1443BOOST FMCW radar. There are five different exercises (Walking, Jumping, Jumping Jacks, Squats and Boxing) that were done by two subjects. There are 12,097 samples in the training set and 3538 samples in the test set. To build models, the training set was also divided into 9677 training examples and 2420 validation examples (8:2 proportion).
MiliPoint [
27]: Using the MiliPoint dataset measured with a TI IWR1843 radar, it is possible to evaluate performance on more diverse conditions. In order not to have an inconsistent basis of assessment, a specific subset consisting of the first five unique activities—marching with bent arms (Marching), bent-arm chest expansion with side stepping (Chest Exp), straight-arm raises with side stepping (Arm Raise), bent-arm raises with left knee lift (L-Knee) and bent-arm raises with right knee lift (R-Knee)—was chosen to work with. This subset contains 2822 training examples, 314 validation examples and 406 test examples.
4.2. Implementation Details
Data Preprocessing: Voxelization converts the raw point cloud into a regular 4D grid with dimensions
, where
D,
H and
W correspond to the depth (range), height (elevation) and width (azimuth) axes of the physical Region of Interest, respectively. Instead of a simple binary occupancy, the scalar value of each voxel directly encodes the spatial density of the radar reflections; specifically, it represents the absolute count of radar points falling within that 3D spatial boundary at time step
t. Furthermore, to effectively mitigate the inherent sparsity of mmWave radar point clouds, points from two adjacent temporal frames are accumulated prior to voxelization. Consequently, empty voxels are assigned a value of 0, while active voxels contain positive integers reflecting the local density of human body reflections. This specific configuration was determined empirically to achieve a practical trade-off between recognition accuracy and computational overhead [
26]. This is followed by adding a channel dimension (
), resulting in
.
Training Configuration: Experiments were conducted on a workstation that has an Intel Core Ultra 9 CPU, 128GB of RAM and an NVIDIA GeForce RTX 5070 Ti. The models were written in PyTorch 2.8.0. The model checkpoint with the lowest validation loss across all 150 epochs was selected for final evaluation. No learning rate warm-up or early stopping was applied; training ran for the full 150 epochs in all experiments. Detailed hyperparameters are detailed in
Table 2.
Regularization Strategy: In order to reduce the effect of overfitting with sparse radar observations, a composite regularization strategy was used. These are (1) Label Smoothing (
); (2) a hierarchical Dropout scheme that is applied to the spatial encoder, temporal module and projection/classification heads (rates summarised in
Table 2); and (3) augmentation at the level of data using Mixup (
) [
28] and VoxelAug (Specifically, VoxelAug performs random circular spatial shifting to simulate subject displacement and temporal frame masking (random frame dropout) to improve robustness against intermittent sensor data loss [
29].).
Reproducibility: The outcomes are expressed in terms of the average of 5 independent runs with varying random seeds. The training and validation splits had been randomized per seed, but the test set was held constant to make the comparisons fair. To additionally examine statistical robustness on the smaller MiliPoint dataset, training strategy variants are evaluated over
independent runs in
Section 4.4.
4.3. Comparative Analysis with State-of-the-Art Models
Table 3 presents a quantitative evaluation of the proposed RadarSSM against representative baseline models.
(1) Accuracy and Robustness: RadarSSM achieves 76.26 ± 2.59% on the MiliPoint dataset, substantially outperforming the second-best CNN-LSTM baseline (71.28 ± 5.46%), and 92.74 ± 0.90% on MMActivity, surpassing CNN-Transformer (91.27 ± 0.84%) and CNN-LSTM (91.03 ± 0.77%). It is essential to note that RadarSSM achieves these results at a substantially smaller parameter footprint than the baselines, which demonstrates a good trade-off between model complexity and performance. Moreover, the consistently low Expected Calibration Error (ECE) values (0.09 on MiliPoint and 0.07 on MMActivity) indicate reliable confidence calibration.
(2) Computational Efficiency: In terms of resource utilization, RadarSSM proves more efficient. As shown in
Table 3, the model requires only 87.1 MMACs per inference. It is a significant decrease in the amount of computation required in comparison with other baseline models. Although the ultra-lightweight baselines like 1D-CNN (40.6 MMACs) and Bi-LSTM (43.4 MMACs) have smaller operation counts, they cannot compete on par with recognition accuracy. RadarSSM has effectively balanced this trade-off and can operate at high precision with a minimal parameter footprint of about 108.3 K.
(3) Statistical Significance: To rigorously assess the reliability of the performance margins,
Table 4 reports per-seed accuracy, standard deviation and Welch’s
t-test results for the top-performing baselines. On MiliPoint, RadarSSM achieves a higher mean accuracy than CNN-LSTM (
vs.
) with a smaller standard deviation (
vs.
), although the corresponding
p-value (
) indicates that this margin should be interpreted cautiously given the limited subset size. CNN-Transformer exhibits both a lower mean accuracy and larger variability. On MMActivity, RadarSSM maintains the highest mean accuracy, and its margins over CNN-LSTM (
) and CNN-Transformer (
) are statistically significant.
4.4. Ablation Study
An ablation study is performed to test the contribution of individual components, which is described in
Table 5.
(1) Impact of Architecture Components: The deletion of the Bi-SSM module is associated with severe deterioration in performance, and accuracy decreases to 65.08% on MiliPoint and 91.07% on MMActivity. It confirms an essential contribution of SSM in the representation of long-range temporal correlations. In turn, elimination of the Temporal Mixer causes a decrease in accuracy, confirming the utility of Temporal Mixer in reducing the local feature jitter. It is worth noting that the use of standard 3D convolutions instead of the DS-Conv3D increases the computational cost by more than 5 times (87.1 MMACs as opposed to 536.3 MMACs). With this change goes an even sharper drop in accuracy (e.g., 89.26% on MMActivity and 70.49% on MiliPoint), indicating that DS-Conv3D is not merely efficient but rather also serves as a useful regularizer that avoids overfitting when using sparse voxel information.
To further validate the specific structural synergy, two alternative paradigms are evaluated. Replacing the SSM temporal module with attention-based global pooling (DS-Conv3D + AttnPooling) yields only
on MiliPoint and
on MMActivity (
and
respectively), confirming that
attention pooling is both computationally inferior and less effective for capturing long-range activity dynamics. Applying a Direct Voxel-Mamba backbone to flattened per-frame features achieves
on MiliPoint and
on MMActivity (
for both), despite its larger parameter count (1.08M). This validates the design rationale in
Section 3.1: a monolithic sequence model cannot compensate for the loss of structured spatial encoding, and the two-stage decoupled design of RadarSSM is essential.
(2) Impact of Training Strategies: The integration of VoxelAug and Mixup further improves the proposed model on MMActivity. Starting from the vanilla baseline of 91.78%, applying VoxelAug or Mixup individually increases the accuracy to 92.27% and 92.32%, respectively, while their combination yields the best result of 92.74%.
On the smaller MiliPoint subset, the effect is more nuanced. The vanilla baseline achieves accuracy, whereas applying VoxelAug or Mixup individually reduces the mean accuracy to and , respectively. We interpret this behavior as a small-data regularization-balance effect. Under this low-data regime, VoxelAug broadens the spatial–temporal support of sparse voxel patterns, while Mixup generates interpolated samples across class boundaries; when only 2822 training samples are available, either effect alone can become overly aggressive and degrade generalization. By contrast, combining the two strategies yields , which is the best 5-run mean among the compared training strategies. In particular, the combined setting outperforms VoxelAug in 5/5 seeds and Mixup in 4/5 seeds, with mean gains of approximately and percentage points, respectively. This suggests that VoxelAug increases within-class diversity, whereas Mixup smooths the decision boundary through vicinal interpolation, and their combination acts more complementarily than either strategy alone.
To assess whether this trend is statistically robust,
Table 6 further reports an extended evaluation over
independent runs on MiliPoint. The 10-run results confirm that RadarSSM remains stronger in mean accuracy than the single-augmentation settings (
vs.
for VoxelAug only and
for Mixup only), while also showing lower variability than the vanilla and Mixup-only settings. At the same time, the 10-run analysis does not support a statistically significant superiority over the vanilla baseline under Welch’s
t-test. Therefore, the combined strategy is better interpreted as providing a more balanced and reliable trend than either single augmentation, rather than as a universally superior replacement for the vanilla setting on this sparse dataset.
4.5. Visualization Analysis
Figure 2 illustrates the training loss and validation accuracy curves. Unlike CNN-based baselines, which also have a sharp saturation trend that peaks within the first ten epochs, RadarSSM has a gradual and steady convergence pattern. The model has reached a higher validation accuracy (>90%) in less than 20 epochs on the MMActivity dataset, as well as a relatively constant optimization curve with slight fluctuations. The training loss in training RadarSSM decreases steadily and smoothly over longer periods on the more sophisticated MiliPoint dataset, which is an opposite behavior in comparison to the earlier stalling of the baseline methods.
Figure 3 shows the confusion matrices to add more information on the results of recognizing the classes. The model has excellent precision on the MMActivity dataset (
Figure 3b) and it can be noted that the three activities, i.e., Jumping Jacks, Squats and Walking, have achieved above 97% accuracy. There is a slight misclassification of “Jumping” (83.4%), which is mainly confused with “Walking” (11.7%).
On the more challenging MiliPoint dataset (
Figure 3a), bent-arm raises with left knee lift achieves the highest recognition rate (90.1%). However, other categories exhibit distinct confusion patterns. Straight-arm raises with side stepping prove the most difficult to classify (59.8%) and is heavily misclassified as bent-arm raises with left knee lift (28.0%). Marching with bent arms (65.4%) also shows ambiguity, being confused with straight-arm raises (16.0%) and left knee lifts (14.8%). Furthermore, a significant mutual confusion is observed between bent-arm chest expansion (75.0%) and bent-arm raises with right knee lift (70.7%), where 20.0% of chest expansion samples are mislabeled as right knee lifts, and 17.1% of right knee lift samples are mislabeled as chest expansion. This suggests these pairs share highly similar spatiotemporal signatures in the sparse radar domain, particularly regarding the bent-arm dynamics.