1. Introduction
The rapid evolution of the Internet of Things (IoT) and ubiquitous computing has led to the emergence of Human Activity Recognition (HAR) as a foundational technology for various applications, such as smart healthcare, human-computer interaction, and elderly monitoring [
1]. Traditional HAR systems rely mainly on optical cameras or wearable sensors. However, vision-based approaches are highly sensitive to variations in lighting and raise serious privacy concerns, which limits their use in private spaces. Although wearable devices offer reliable measurements, they suffer from limited battery life and low user compliance during long-term monitoring. Although reliable measurements are offered by wearable devices, they are limited by low user compliance during long-term monitoring and by battery life. Therefore, millimeter-wave (mmWave) radar has gained significant attention as a device-free, privacy-preserving sensing modality, as shown in
Figure 1. As mentioned in multimodal fusion sensing [
2], mmWave radar is robust to adverse lighting conditions. Furthermore, its ability to detect non-metallic obstructions makes the invisible seen [
3], which demonstrates its great potential for through-wall HAR [
4].
Recently, the intersection of deep learning and radar sensing has significantly advanced the field of HAR. Related learning-based neural structures have also been explored in constrained sensing and reconstruction tasks, such as phase retrieval and image reconstruction [
5]. Early approaches usually converted raw radar echoes into either 2D spectrograms or 3D voxel grids. However, these methods introduce quantization errors and have a cubic increase in computational overhead with resolution, which limits real-time edge deployment. To address this, point-based architectures that can directly process sparse radar point clouds have emerged. Researchers have applied classic visual and point-cloud networks, such as PointNet++ [
6], Graph Neural Networks (GNNs) [
7], and advanced point Transformers like PTv3 [
8] to radar applications. These improvements have made significant progress in fine-grained human semantic segmentation tasks [
9].
Despite these advances, effective deployment and maintenance of robust HAR systems in real-world environments are still challenging due to two primary obstacles. On the one hand, the inherent sparsity and noise of radar data limit the upper bound on representation. Unlike dense RGB images, which contain rich semantic textures, radar point clouds are sparse and unstructured. This makes it difficult for single-modal radar networks to construct accurate human topological priors. On the other hand, traditional multimodal fusion suffers from severe negative transfer and computational redundancy. Recent works attempt to fuse radar with cameras [
10,
11], Wi-Fi [
12], or spatial-temporal features in gait recognition [
13]. But these works often rely on heavy network backbones. More importantly, corrupted visual features often affect the radar representations when the visual modality is impaired (e.g., in dim lighting or due to camera obstruction), which results in a significant decline in performance. Although many multimodal foundation models have emerged, the large number of parameters in these models makes them unsuitable for use on edge devices.
Resolving the conflict between multimodal dependency and single-modal fragility requires a lightweight framework that absorbs visual knowledge during training yet operates robustly on radar during inference. Lately, State Space Models (SSMs) like Mamba have become popular alternatives to complex Transformers. They offer linear computational complexity while maintaining a global receptive field. Motivated by these insights, we propose Tac-Mamba, a lightweight cross-modal distillation framework for mmWave radar. It utilizes a Spatial Mamba teacher to distill structural topologies into a PTv3 radar student with modality dropout, establishing environment-invariant human priors. To prevent negative transfer, a Trust-Aware Consistency Gate (TACG) assesses visual reliability and suppresses noise, ensuring a smooth fallback to pure radar sensing. Additionally, a Lightweight Temporal Mamba Block (LTMB) efficiently captures long-range dependencies.
The main contributions of this paper are summarized as follows:
We propose a novel, lightweight spatial-temporal architecture that leverages explicit topological knowledge transfer to overcome the intrinsic sparsity of mmWave radar signals. Specifically, we design a dual-stream spatial encoder where a Spatial Mamba teacher extracts structural priors from visual skeletons via a discrete spatial scanning mechanism. Driven by a modality dropout strategy and an auxiliary 3D pose regression task, this environment-invariant topological knowledge is distilled into a serialized PTv3 student network. This forces the radar encoder to construct accurate human models independently, ensuring robust sensing even under severe visual occlusion.
To mitigate the prevalent issue of negative transfer in multimodal fusion, we propose the TACMA module, which incorporates a novel TACG. Unlike static fusion mechanisms, TACG evaluates the reliability of incoming visual features through a SiLU-activated cross-modal bilinear interaction. This mathematical constraint acts as a dynamic confidence filter. It achieves optimal feature synergy under ideal conditions, while seamlessly degrading to a deterministic, purely radar-driven fallback projection when visual inputs are severely corrupted. This strictly prevents the pollution of the radar representation.
We develop the LTMB to efficiently capture long-range kinematic dependencies with pure linear complexity. This module first employs a large-kernel 1D depthwise convolution to extract local temporal micro-motions, followed by parallel Forward and Backward SSMs. Crucially, instead of utilizing additional dense layers for bidirectional fusion, we introduce a Zero-Parameter Cross-Gating (ZPCG) mechanism. It leverages the Sigmoid-activated hidden sequence of the forward SSM to explicitly gate the backward SSM, and vice versa. This explicitly filters future noise using historical states, achieving deep non-linear temporal context fusion with minimal parameter overhead.
The remainder of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 presents the proposed Tac-Mamba framework.
Section 4 reports the experimental results, including robustness, generalization, and deployment analyses.
Section 5 provides the ablation studies. Finally,
Section 6 concludes the paper.
4. Experiments Results
4.1. Datasets and Experimental Settings
To comprehensively evaluate the modality-missing robustness and environmental generalizability of the proposed Tac-Mamba, we conducted extensive experiments on the large-scale public dataset: MM-Fi [
18].
4.1.1. MM-Fi Dataset
We use the MM-Fi dataset to provide a fair and rigorous benchmark against state-of-the-art (SOTA) methods. Unlike traditional vision-centric datasets, MM-Fi provides synchronized multimodal sensor data that supports versatile, non-intrusive human sensing. The mmWave radar point clouds were collected using a TI IWR6843AOP mmWave radar with 3 transmit (Tx) and 4 receive (Rx) antennas. This device operates in the 60–64 GHz frequency band. The radar was strategically positioned 3 m from the human subjects during data recording. To establish the visual teacher modality, a calibrated dual-camera setup (Intel RealSense D435 RGB-D cameras) was synchronously deployed to generate precise 3D pose annotations. High-quality 2D skeleton keypoints were then extracted from these annotations using the HRNet algorithm. All multimodal data streams were temporally aligned using software synchronization and uniformly sampled at 10 FPS.
The dataset includes 27 human activities, 14 of which are daily activities and 13 of which are rehabilitation exercises. To evaluate the model’s ability to capture long-term dynamics, each action sequence is either padded or truncated to fit within a fixed time window of
frames. For the radar modality, we extract a 5D physical state vector for each point:
, representing the 3D Cartesian coordinates, radial Doppler velocity, and signal reflection intensity. The visual modality provides
structural keypoints
to supervise the topological learning.
Table 1 summarizes the specific action categories and comprehensive evaluation protocols of the MM-Fi dataset.
4.1.2. Evaluation Protocol (Cross-Environment)
To rigorously evaluate the model’s robustness against environmental variations and background clutter, we employ the strict Cross-Environment (Protocol 3) evaluation strategy defined in the MM-Fi benchmark. Unlike random splits, this protocol ensures that the environments in the test set are completely unseen during training. Specifically, data collected in three distinct environments (E01, E02, E03) are used for training, while data from a completely different environment (E04) is reserved exclusively for validation and testing.
4.2. Implementation Details
All experiments were implemented in PyTorch (version 2.2.0) on a computer equipped with an Intel Core i7-13600KF CPU and an NVIDIA GeForce RTX 4090 GPU.
In this work, we directly use the mmWave radar point clouds released in the public MM-Fi benchmark. Therefore, the proposed framework does not start from raw radar ADC signals or redesign the low-level radar signal processing pipeline. Instead, it focuses on downstream spatial-temporal representation learning and HAR based on the benchmark-provided point cloud data.
To preserve the geometric information of the radar point clouds, we do not use a fixed point count. Instead, the data loader keeps all valid radar reflection points in each frame and applies dynamic batch-wise padding. In each batch, the point cloud sequences are padded to the maximum number of valid points in that batch, which usually reaches up to about 96 points. This variable-length input is directly supported by the PTv3 radar encoder. Therefore, the model is trained with dynamic point clouds rather than a fixed setting. The embedding dimensions for both the radar student encoder (PTv3) and the vision teacher encoder (Spatial Mamba) were unified to . For the Spatial Mamba network, the state expansion factor was set to 2 and the hidden state dimension to 16. In the LTMB, the 1D depthwise convolution was configured with a kernel size of , a stride of , and a padding of .
The model was trained end-to-end for 150 epochs using the Adam optimizer with a batch size of 32. The initial learning rate was set to and was progressively decayed to ensure stable convergence. The network was supervised by a joint loss function, with the balance hyperparameter empirically set to 0.01. During training, the modality dropout probability was strictly set to .
During the inference phase, to validate the effectiveness of our TACG and Modality-Missing Robust Fusion mechanism, we designed a dual-scenario testing strategy: the Multimodal Setting (Radar + Vision) and the Radar-Only Setting (vision masked).
Figure 5 presents the confusion matrices for the Tac-Mamba model under both testing scenarios. Overall, the model demonstrated strong classification performance, achieving an average Top-1 accuracy of 95.37% under the multimodal fusion setting (
Figure 5a) and 87.54% under the pure radar setting (
Figure 5b).
4.3. Comparison with SOTA Methods
To comprehensively demonstrate the superiority of Tac-Mamba, we compare it against several recent SOTA architectures on the MM-Fi dataset. The baselines encompass single-modal networks, the latest multimodal foundation models, and advanced cross-modal distillation frameworks. Specifically, we adopt the following baselines for comparison:
AGCN [
15]: The Adaptive Graph Convolutional Network (AGCN) is a classic skeleton-based action recognition model. Rather than relying on a fixed physical structure, it introduces a dynamic graph construction mechanism that adaptively learns the graph topology for different action samples.
CTRGCN [
16]: The Channel-wise Topology Refinement Graph Convolution (CTRGCN) is an advancement in graph modeling that learns different network topologies for different feature channels. This enables the model to capture finer and more diverse spatial correlations among joints.
MM-Fi Baseline [
18]: The official multimodal benchmark model is provided by the MM-Fi dataset. It uses a flattened MLP architecture to extract spatial features and employs recurrent units (e.g., Bi-GRU) or standard Transformers to fuse multimodal sequences over time.
X-Fi [
20]: This is a recently proposed, modality-invariant foundation model designed for multimodal human sensing. It uses a flexible Transformer structure to handle different input sizes and incorporates a X-fusion mechanism. This enables the model to handle any combination of sensor modalities without extensive retraining.
MMSense [
21]: This is a multi-task foundation model that adapts vision-based architectures for wireless sensing. It incorporates a modality gating mechanism and uses a vision-based large language model (LLM) backbone to align features from multiple sensors (radar, vision, and LiDAR) into a unified semantic space.
SkeFi [
24]: A cutting-edge cross-modal knowledge transfer framework. It is explicitly designed to transfer high-quality topological knowledge from data-rich visual modalities to noisy wireless sensors (e.g., mmWave). It employs an enhanced Temporal Correlation Adaptive Graph Convolution (TC-AGC) to mitigate the noise caused by missing frames in wireless sensing.
We conduct the benchmark tests under the strict Cross-Environment setting (Protocol 3) of the MM-Fi dataset. This protocol ensures that the background clutter in the test set is never seen during training. Furthermore, to comprehensively evaluate the performance upper bound under ideal conditions and the modality-missing robustness in extreme environments, we report the Top-1 Accuracy (%) under two distinct inference scenarios:
Multimodal Setting: Both radar point clouds and visual skeletons are provided to the network during both training and inference phases. This setting assesses the upper bound of cross-modal fusion performance when comprehensive sensor information is available.
Radar-Only Setting: The model is jointly optimized using multimodal data during the training phase, leveraging the topological supervision from the vision teacher. However, during validation and inference, the visual input is completely masked (i.e., replaced by all-zero tensors). This setting explicitly simulates real-world modality-missing conditions caused by low nighttime illumination, severe physical occlusions, or visual sensor failures. It rigorously validates the network’s robustness for environment-invariant sensing, relying exclusively on the mmWave radar modality and entirely decoupled from visual dependencies.
The quantitative results are summarized in
Table 2.
As demonstrated in
Table 2, the proposed Tac-Mamba establishes a new SOTA performance, exhibiting superiority in both multimodal and single-modal inference scenarios. The fundamental reasons driving this superior performance can be attributed to our specific architectural innovations:
A critical observation from
Table 2 is the severe negative transfer observed in traditional multimodal models under the Cross-Environment protocol. Because the visual modality is highly sensitive to environmental and illumination changes, it introduces significant noise in unseen testing environments. Consequently, the multimodal fusion accuracies of the MM-Fi Baseline (66.90%) and the foundation model X-Fi (73.40%) are paradoxically much lower than their pure Radar-Only counterparts (85.00% and 85.70%, respectively). The simple concatenation or attention mechanisms in these baselines fail to isolate the corrupted visual features, leading to polluted joint representations.
In stark contrast, Tac-Mamba avoids this degradation entirely and achieves a competitive accuracy of 95.37% in the multimodal setting, outperforming the massive MMSense model (87.66%). This advantage is driven by our proposed TACG. Unlike static fusion, TACG utilizes bilinear cross-modal interactions to dynamically evaluate the reliability of incoming visual features. When the visual input is corrupted by environmental shifts, the gate deterministically suppresses the noisy visual branch, ensuring that the fusion process is constructively enhanced rather than polluted.
In the strictly Radar-Only scenario, Tac-Mamba retains a robust accuracy of 87.54%. Traditional graph-based networks, such as AGCN and CTRGCN, fail to capture the unstructured geometry of radar clouds. This results in accuracy rates below 66%. Furthermore, the dedicated cross-modal distillation framework SkeFi achieves only 62.98% on single-modal inference. This indicates that its graph-based transfer has difficulty overcoming the inherent sparsity of mmWave data.
Tac-Mamba’s performance in the pure-radar setting is superior, with a score of 87.54%. This score slightly surpasses the foundation model X-Fi’s score of 85.70%. This is primarily attributed to the modality dropout strategy combined with the PTv3 encoder. During training, the random masking of the vision teacher forces the PTv3 student to map sparse 3D coordinates independently to high-dimensional topological representations. This explicit distillation, which is supervised by the Spatial Mamba teacher, guarantees that the radar branch learns an environment-invariant human structural prior. This ensures that Tac-Mamba can maintain reliable sensing capabilities, even when cameras are disabled in real-world edge deployments.
4.4. Robustness to Radar Point Cloud Sparsity
In practical deployments, the number of valid mmWave radar points may change with distance, clutter, and body motion. To examine the robustness of Tac-Mamba to sparse radar point clouds, we conducted an additional inference-time sparsity test.
During training, the model used the original dynamic point clouds with batch-wise padding, where the maximum number of valid points in each batch generally reached up to about 96. During evaluation, we randomly downsampled the valid radar points in each frame to fixed densities of
. We then tested both the Radar-Only and Multi-Modal settings. The results are summarized in
Table 3.
As shown in
Table 3, the recognition accuracy decreases as the radar point cloud becomes sparser. When the input is downsampled to 64 points, Tac-Mamba still maintains 81.94% accuracy in the Radar-Only setting and 92.13% accuracy in the Multi-Modal setting. When the point cloud becomes much sparser, the performance drops more clearly in both settings.
In low-density point cloud scenarios, the Multi-Modal setting remains consistently better than the Radar-Only setting. For example, when , the Radar-Only accuracy drops to 25.46%, while the Multi-Modal accuracy remains at 53.70%. This result suggests that the fusion framework, including TACG, helps compensate for the loss of radar spatial information under sparse input conditions. Overall, these results show that the proposed framework is not limited to one fixed point count and has reasonable robustness to varying radar point densities.
4.5. Robustness to Visual Pose Estimator Quality
The proposed framework uses a Spatial Mamba teacher to provide visual structural guidance during training. To examine whether the final radar-only performance is sensitive to the quality of this visual teacher, we conducted a visual noise injection test.
Specifically, during training, we added Gaussian noise
to the 3D visual skeleton coordinates to simulate degraded pose estimation quality. Two degradation levels were considered:
and
. All models were trained under the Cross-Environment protocol, and the final Radar-Only accuracy was evaluated on the clean test set. For comparison, we also trained the official MM-Fi baseline under the same noisy visual conditions. The results are summarized in
Table 4.
As shown in
Table 4, the radar-only performance of the MM-Fi baseline is strongly affected by degraded visual supervision during training. In contrast, Tac-Mamba remains relatively stable. When
, the MM-Fi baseline drops from 85.00% to 68.35%, while Tac-Mamba only decreases from 87.54% to 84.85%. These results show that Tac-Mamba is less sensitive to imperfect visual supervision and can maintain robust radar-only performance even when the visual teacher quality degrades.
4.6. Statistical Significance and Stability Analysis
To further examine the stability of the proposed framework, we repeated the evaluation under multiple independent runs with different random seeds. Specifically, we retrained the full Tac-Mamba model from scratch using three random seeds, i.e.,
. We report the sample mean
, standard deviation
, and the 95% confidence interval of the mean accuracy:
where
denotes the Top-1 accuracy under seed
i, and
is the critical value of the Student’s
t distribution.
The quantitative results are summarized in
Table 5. In the Radar-Only setting, the mean accuracy is 87.36% with a standard deviation of 0.58%, and the 95% confidence interval is [85.93%, 88.79%]. In the Multi-Modal setting, the mean accuracy is 95.31% with a standard deviation of 0.30%, and the 95% confidence interval is [94.56%, 96.05%]. These results show that the proposed framework achieves stable performance across repeated runs.
4.7. Cross-Environment Generalization Analysis
Evaluating the model under unseen physical environments is important for examining its generalization ability. Although cross-dataset validation is often used for this purpose, it is not directly feasible for Tac-Mamba under the current training setting. The proposed cross-modal distillation framework requires synchronized mmWave point clouds and 3D visual skeletons during training. At present, MM-Fi is one of the few public benchmarks that provides this dual-modal pairing.
To further examine cross-environment generalization, we conducted a Leave-One-Environment-Out (LOEO) evaluation across the four environments (E01 to E04) in the MM-Fi dataset. According to the dataset specification, these environments cover two different rooms and two sensor deployment orientations. As a result, they involve different spatial layouts and multipath conditions. In each fold, the model was trained on three environments and tested on the remaining unseen one. The results are summarized in
Table 6.
As shown in
Table 6, Tac-Mamba maintains stable recognition performance across all unseen environments. In the Multi-Modal setting, the accuracy ranges from 93.60% to 95.37%. In the Radar-Only setting, the accuracy ranges from 80.31% to 87.54%. When tested on E03, the Radar-Only accuracy drops to 80.31%, which may be related to stronger background multipath interference in that environment. Even so, the Multi-Modal framework still achieves 93.98% accuracy on E03. These results show that the proposed framework has good robustness across different unseen environments and does not simply overfit to one specific environment.
4.8. Training Efficiency and Complexity Analysis
While Tac-Mamba is designed for efficient edge inference, training multi-modal networks on 4D point cloud sequences typically requires substantial computational resources. To empirically validate the training efficiency and the theoretical linear complexity () of the proposed architecture, we conducted a controlled profiling experiment comparing our Temporal Mamba block with a standard Multi-Head Self-Attention Transformer Encoder.
To isolate the temporal modeling efficiency from the spatial feature extraction overhead, we evaluated the standalone temporal modules under identical parameter settings (hidden dimension
, 4 layers, batch size 16) on an NVIDIA RTX 4090 GPU. We recorded the peak GPU memory allocation (VRAM) and the average training time per iteration across varying sequence lengths (
L). The results are summarized in
Table 7.
As shown in
Table 7, at the standard MM-Fi sequence length of
, the Mamba module reduces memory consumption by 53.2% and training time by 39.3% compared to the Transformer baseline. As the sequence length scales to 4096 frames (representing continuous monitoring scenarios), the performance gap widens significantly. The self-attention mechanism in Transformers incurs a quadratic computational complexity (
), leading to a rapid increase in memory usage (8043 MB) and training time (468.1 ms).
In contrast, the State Space Model (SSM) processes the temporal dynamics with a linear complexity (). At , the Mamba module requires only 3262 MB of VRAM and 40.2 ms per iteration, achieving an 11.6× speedup over the Transformer. This scaling behavior confirms that Tac-Mamba provides a highly memory- and time-efficient foundation for processing long-term continuous human sensing data, lowering the hardware threshold required for training.
4.9. Edge Deployment Validation on Jetson Nano
To evaluate the practical deployment feasibility of the proposed framework, we conducted on-device inference benchmarking on an NVIDIA Jetson Nano B01 developer kit, as shown in
Figure 6. The device is equipped with a 128-core NVIDIA Maxwell GPU, a quad-core ARM Cortex-A57 CPU, and 4 GB LPDDR4 memory. During testing, the power mode was set to MAXN (10 W mode).
We compared Tac-Mamba with our reimplementation of the official MM-Fi dual-modal baseline under the same hardware setting. According to the original MM-Fi paper, this baseline uses a 3-layer 1D convolutional network for radar feature extraction, a 2-layer MLP for visual skeleton embedding, feature concatenation, and a 2-layer BiGRU for temporal modeling, followed by an MLP classifier.
For both models, the input was a dual-modal sequence of 297 frames. The models were evaluated with batch size 1. The latency was measured with CUDA synchronization (torch.cuda.synchronize()) for accurate timing. We report the average latency over 100 test runs after 20 warm-up runs. The results are summarized in
Table 8.
As shown in
Table 8, the MM-Fi baseline achieves 0.93 GFLOPs and 164.19 ± 0.56 ms inference latency, while Tac-Mamba has 3.82 GFLOPs and 148.23 ± 1.01 ms latency. Although Tac-Mamba has higher theoretical complexity, it achieves lower actual latency on the Jetson Nano. Both models have fewer than 1M parameters, and Tac-Mamba contains 0.86 M parameters.
This result shows that Tac-Mamba is suitable for edge deployment. The lower latency mainly comes from the parallel computation pattern of the state space model, while the BiGRU baseline is limited by its sequential dependency. Since a 297-frame sequence corresponds to about 10 s of human activity, the measured latency is sufficient for real-time inference in this task.
5. Ablation Study
To thoroughly validate the effectiveness of our proposed modules and architectural choices, we conduct extensive ablation studies on the MM-Fi dataset under the strict Cross-Environment setting (Protocol 3). In this section, the primary evaluation metric is the Top-1 Accuracy (%) under both the Multimodal and Radar-Only inference scenarios. Furthermore, to evaluate the deployment feasibility on resource-constrained edge devices, we also report the model parameter size (Params in Millions), computational complexity (GFLOPs), and inference latency (ms). Specifically, GFLOPs are calculated using the thop library on a standardized input tensor, and inference latency is measured by averaging the execution times of 100 independent forward passes after a 20-iteration warm-up on a single NVIDIA RTX 4090 GPU with a batch size of 1. All ablated versions are trained and evaluated under identical settings to ensure fairness.
5.1. Effectiveness of Key Structural Components
In this section, we examine how the components developed in the Tac-Mamba framework affect the model’s overall performance. We systematically degrade the full model by replacing our innovative modules with traditional counterparts. Specifically, we implement three ablated versions for comparison: (1) replacing the vision Spatial Mamba encoder with a standard MLP (denoted as
w/
o Spatial Mamba (replace with MLP)); (2) substituting the LTMB with a bidirectional GRU (denoted as
w/
o LTMB (replace with Bi-GRU)); (3) removing the TACG to resort to a simple addition fusion (denoted as
w/
o TACG (replace with Addition)); and (4) removing the auxiliary coordinate regression loss during training (denoted as
w/
o HPE Loss). The results are reported in
Table 9.
Table 9 clearly shows that modifying or removing any of the core modules leads to noticeable performance degradation, explicitly validating the effectiveness of each developed module. In detail, we analyze the impact of these design components as follows. In multimodal settings, replacing LTMB with dual GRUs results in a significant drop in accuracy (from 95.37% to 86.57%). This confirms that our linear, global receptive field design is essential for refining hierarchical feature representations across hundreds of frames. This design successfully overcomes the issues of gradient vanishing and memory forgetting that are inherent in traditional recurrent networks. Moreover, removing the TACG module significantly decreased pure radar positioning accuracy by more than 5% (from 87.54% to 82.19%). This confirms that adaptive modality weighting is critical for managing variations in modality reliability under extreme sensing conditions. During single-modal inference, the masked visual branch introduces zero-tensor noise that degrades radar representations. TACG effectively filters out this interference to preserve feature integrity. Also, replacing the Spatial Mamba with an MLP results in a performance drop. This indicates that explicitly modeling the topological sequence of human joints in a hardware-friendly way provides much stronger structural priors than naive flattened feature mapping. When the auxiliary HPE loss is removed (
w/
o HPE Loss), the Multi-Modal accuracy drops from 95.37% to 93.60%, and the Radar-Only accuracy drops from 87.54% to 80.90%. This result shows that
is not redundant. It provides useful structural supervision during training and improves the discriminative ability of the radar branch, especially when visual inputs are degraded or unavailable.
Overall, the Tac-Mamba model performs consistently well in both scenarios, showing that its innovative components work together to improve cross-modal fusion capacity and robustness when one modality is missing.
5.2. Impact of Modality Dropout Probability
The modality dropout strategy is used to prevent the network from over-relying on the visual modality. Randomly masking the vision teacher during training forces the radar student to learn human topological priors independently. We systematically evaluate the impact of the dropout probability
. The comparative results are summarized in
Table 10.
As illustrated in
Table 10, the dropout probability strictly governs the trade-off between the multimodal fusion upper bound and the single-modal inference robustness. When
(no dropout), the network achieves a near-perfect multimodal accuracy of 98.15%. However, its radar-only accuracy drops to 14.81%. This reveals a severe modality laziness issue. The network heavily overfits the structured visual features and entirely ignores the radar representations during joint optimization. In contrast, an aggressive dropout (
) completely severs the cross-modal interactions, suppressing the multimodal upper bound to 83.80%.
Setting strikes the optimal balance. It successfully distills topological knowledge from the vision teacher while forcing the radar branch to maintain independent sensing capabilities. Consequently, Tac-Mamba achieves the highest modality-missing robustness (87.54%) alongside a strong multimodal accuracy (95.37%).
5.3. Sensitivity Analysis of the Auxiliary Loss Weight
In Equation (
19), the hyperparameter
controls the balance between the primary HAR objective and the auxiliary pose estimation objective. To justify the choice of
, we conducted a sensitivity analysis with
. Here,
denotes the model without the auxiliary HPE loss. For a fair comparison, all other training settings were kept unchanged. We report the final action recognition accuracy under both the Multi-Modal and Radar-Only settings. The results are shown in
Table 11.
As shown in
Table 11, the HAR performance follows a clear inverted-U trend as
increases. When the auxiliary HPE loss is removed (
), the Radar-Only accuracy drops to 80.90%. This result shows that the radar branch benefits from the structural supervision introduced during training. When
increases from 0 to 0.01, both the Multi-Modal and Radar-Only accuracies improve steadily. The best result is achieved at
, where the Multi-Modal and Radar-Only accuracies reach 95.37% and 87.54%, respectively. When
is further increased to 0.05 or 0.1, the recognition accuracy declines. This result indicates that a very large auxiliary loss weight puts too much emphasis on the pose regression objective and weakens the optimization of the primary HAR task. Therefore,
was selected as the default setting in the proposed framework.
5.4. Comparison of Fusion Mechanisms
To further justify the design of TACG, we evaluate its performance against mainstream feature fusion strategies, including Simple Concatenation, Simple Addition, and Standard Cross-Attention (without consistency gating). The quantitative comparisons are presented in
Table 12.
Table 12 reveals an essential trade-off between multimodal overfitting and single-modal robustness. Naive fusion approaches, such as Simple Concatenation and Simple Addition, tend to overfit the dominant visual modality during training. While they yield high multimodal accuracies (98.15% and 97.69%, respectively), their performance drops significantly to 80.56% and 83.80% when visual input is denied. This indicates that without a dynamic gating constraint, the radar representations fail to learn sufficient topological structures independently.
Moreover, the Standard Cross-Attention mechanism performs poorly in the radar-only scenario (82.19%). Since it relies on visual features as Keys and Values, masking the vision branch introduces zero-tensor noise, disrupting the computation of the attention matrix.
In contrast, our proposed TACG leverages bilinear cross-modal interactions to dynamically assess the reliability of the incoming visual features. It filters out unreliable modalities and ensures a smooth fallback to the pure radar representations. Consequently, TACG achieves the most robust radar-only accuracy of 87.54% without sacrificing the multimodal fusion capability (95.37%).
5.5. Superiority over Latest Spatial and Temporal Modules
To further validate the architectural advancement of Tac-Mamba, we replace our key spatial and temporal components with several SOTA architectures. Specifically, we divide the ablation into three aspects: (1) replacing the radar spatial encoder, PTv3, with Point-Mamba [
27] and DuoMamba [
28]; (2) replacing the vision spatial encoder, Spatial Mamba, with BlockGCN [
29] and SkateFormer [
30]; and (3) substituting the temporal modeler, LTMB, with Temporal xLSTM [
31] and Mamba-2 (SSD) [
32]. The comprehensive performance and efficiency comparisons are presented in
Table 13.
As detailed in
Table 13, substituting our carefully chosen components with these generic SOTA modules results in suboptimal performance in both accuracy and computational efficiency.
In the spatial domain, replacing the PTv3 encoder with modern point-cloud architectures (e.g., Point-Mamba or DuoMamba) results in noticeable accuracy drops, particularly under the radar-only setting (decreasing to 75.46% and 73.15%). This indicates that these generalized modules struggle to extract sufficient topological features from sparse mmWave signals. Interestingly, when the vision teacher module (Spatial Mamba) is replaced by BlockGCN or SkateFormer, not only does the multimodal accuracy degrade, but the radar-only accuracy also drops below 82%. This demonstrates that a stronger vision teacher explicitly built on the Mamba architecture provides superior topological supervision, which directly translates into a more robust PTv3 radar student during the knowledge distillation process.
In the temporal domain, the xLSTM module demonstrates strong temporal modeling, achieving a multimodal accuracy of 92.13%. However, its recurrent nature introduces substantial computational overhead, resulting in 4.05 GFLOPs and a notably high inference latency of 3.73 ms. In contrast, Mamba-2 (SSD) successfully minimizes the parameter count to 0.54 M. However, its accuracy drops to 83.22%. This suggests that generalized state-space duality struggles to capture fine-grained kinematic dependencies over long temporal sequences.
Overall, Tac-Mamba achieves the highest classification accuracies while maintaining the lowest computational cost (3.82 GFLOPs) and a low latency (1.89 ms). This proves that our customized spatial-temporal design provides the best trade-off between sensing reliability and edge deployment efficiency.
5.6. Controlled Profiling of Temporal Modules
To further examine the efficiency of the temporal modeling module, we conducted a controlled profiling experiment on an NVIDIA RTX 4090 GPU. We compared a standard 4-layer Transformer encoder with our 4-layer Temporal Mamba module under matched parameter counts. Both modules contain 1.75M parameters. The hidden dimension was set to , and the batch size was set to 16.
This experiment isolates the temporal modeling stage from the front-end point cloud encoder. Therefore, it can directly reflect the memory and time cost of the sequence modeling module itself. We recorded the peak GPU memory allocation (VRAM) and the average training time per iteration under different sequence lengths. The results are shown in
Table 14.
As shown in
Table 14, the Mamba module is more efficient than the Transformer baseline at all tested sequence lengths. At the standard sequence length of
, Mamba reduces VRAM from 288 MB to 271 MB and reduces the training time per iteration from 6.2 ms to 4.0 ms. When the sequence length increases to 4096, the Transformer requires 3460 MB of VRAM and 381.9 ms per iteration, while Mamba requires 3221 MB and 40.3 ms.
These results show that the time efficiency advantage of Mamba becomes much more evident as the sequence length increases. This trend is consistent with the difference between self-attention-based temporal modeling and state space sequence modeling. In the full Tac-Mamba framework, the front-end point cloud encoder still accounts for a large part of the total training cost. However, this controlled experiment confirms that the Temporal Mamba block provides a more scalable temporal modeling choice than a standard Transformer encoder.
6. Conclusions
In this paper, we propose Tac-Mamba, a lightweight spatial-temporal framework for human activity recognition from sparse mmWave radar point clouds. The proposed topology-guided cross-modal distillation scheme leverages a Spatial Mamba teacher to supervise a PTv3 radar student. This design enables the network to construct accurate geometric topologies independently, effectively overcoming the inherent sparsity of radar point clouds. To address the negative transfer issue in visually denied environments, the TACMA module incorporates a TACG mechanism. This mechanism uses cross-modal bilinear consistency to adaptively adjust the feature fusion and ensures that unreliable visual noise does not affect the radar representations. Furthermore, the LTMB utilizes bidirectional state-space operators and a Zero-Parameter Cross-Gating (ZPCG) mechanism to efficiently capture long-term kinematic dependencies. Experimental results on the MM-Fi dataset demonstrate that Tac-Mamba achieves strong performance in both multimodal and single-modal inference scenarios with a small parameter count of 0.86 M and 3.82 GFLOPs. However, the current study still has several limitations. First, the proposed framework is mainly evaluated under the single-person setting of the MM-Fi benchmark, and its effectiveness in more complex multi-person scenarios remains to be further validated. Second, the current study is validated on a single public dataset, MM-Fi. Although we added cross-environment experiments to strengthen the generalization evidence, the cross-dataset generalization of Tac-Mamba still needs further investigation. Third, the proposed training framework requires synchronized visual skeletons and mmWave radar data, which limits direct evaluation on many existing single-modal radar HAR datasets. In addition, some baseline comparisons are not fully equivalent because existing methods differ in modality setting, supervision form, and deployment objective. Therefore, these comparison results should be interpreted within the scope of the current benchmark and protocol. Future work will focus on developing techniques to separate spatial instances to address multi-person interference and occlusion issues in complex environments. Additionally, we aim to extend the trust-aware gating mechanism to accommodate more heterogeneous sensor data, such as WiFi CSI or LiDAR, further enhancing the system’s generalized perception capabilities on edge devices.