1. Introduction
Neurological diseases and motor disabilities affect some 1 billion people across the globe, demanding robust automated movement assessment and rehabilitation monitoring systems [
1,
2,
3]. These systems rely on the correct identification of actions in video streams, which allows a clinician to trace the progress of a patient, identify movement disorders, and tailor treatment methods [
4,
5,
6,
7,
8,
9]. Recent advances in deep learning have substantially improved video understanding capabilities, and vision transformers have surpassed other action recognition metrics by a large margin [
10,
11,
12,
13,
14].
Clinical rehabilitation contexts, including stroke motor recovery, Parkinson’s disease gait analysis, and cerebral palsy movement monitoring, demand reliable movement assessment tools that go beyond general-purpose recognition [
15,
16,
17,
18]. Pose-based and skeleton-driven systems such as OpenPose-derived pipelines and GCN-based skeletal models [
19,
20,
21,
22,
23] have demonstrated value in controlled settings but are sensitive to occlusion and require specialized depth sensors not universally available in clinical environments. Clinical motion assessment systems like the SARA scale (Spinocerebellar Ataxia Rating) and upper-limb kinematics tools [
24,
25,
26] rely on trained therapist observation or marker-based setups, limiting scalability. Notably, I3D [
27,
28,
29,
30,
31] and TimeSformer [
32,
33,
34] were designed for sports and web video datasets (Kinetics, UCF101) and do not address the subtle, fine-grained movement deviations, tremors, asymmetric gait, and incomplete range of motion that carry clinical significance in neurological rehabilitation [
35,
36,
37]. Video Swin Transformer [
38,
39,
40], while achieving strong temporal modeling, relies on local shifted windows that may miss longer-range causal motion dependencies critical for rehabilitation assessment. The fundamental limitation shared across these architectures is their susceptibility to confounding factors: scene backgrounds, patient clothing, camera placement, and lighting conditions, all of which vary across therapy sessions, can dominate the learned representations rather than the clinically meaningful motion features [
41,
42,
43,
44,
45].
Early video-based action recognition employed two-stream convolutional networks to model complex spatiotemporal patterns [
46], and inflated 3D convolutions (I3Ds) were applied to image classification architectures for video [
14]. Most recently, transformer architectures have shown better performance, building long-range temporal dependencies with self-attention mechanisms [
8,
41].
Nevertheless, meaningful learning of temporal relationships is critically important to the success of the existing video models, instead of taking advantage of spurious correlations [
9]. Modern methods have worrying tendencies in their behavior: the models that are trained on action recognition data tend to use the information available in the static scene, co-occurring objects, or temporal clues instead of comprehending the causal process of human motions [
10,
11]. This confounding issue is especially problematic in rehabilitation environments when such minor changes in movement have clinical implications [
12].
Past studies have dealt with temporal modeling in different ways. In [
10], the authors suggest a divided space–time attention that attains efficient processing of video. Similarly, ref. [
39] presents hierarchical shifted windows for video understanding with strong computational efficiency. Nonetheless, the approaches do not have clear causal logic abilities and can still be vulnerable to biases in the datasets [
15]. New frameworks of causal inference have appeared that address confounding in visual recognition [
16,
17], but there are still no applications to video action recognition.
Although these improvements were made, there is a significant gap in applying causal understanding together with effective video transformers in the context of rehabilitation [
18]. The existing approaches do not sufficiently address the causality–accuracy trade-off and temporal confounding, thereby limiting their potential for use in clinical settings [
19]. The challenge of distinguishing causally relevant motion patterns from spurious correlations requires a fundamental rethinking of how video models process temporal information.
To address this gap, this paper proposes NeuroPrisma, a new neuro-prismatic video framework that solves such issues using spectral decomposition and causal intervention. The system architecture is depicted in
Figure 1.
The principal insights of this paper are as follows.
Frequency-domain analysis improves causal reasoning for the following theoretical motivation: temporal features of rehabilitation-relevant motions occupy distinct spectral bands—slow postural sway and tremor (~0–2 Hz) versus intentional limb movements (~2–8 Hz) versus rapid reflexive responses (~8–16 Hz). By explicitly decomposing temporal features into non-overlapping frequency bands, the PSA module enforces structural separation between motion scales, reducing the co-occurrence of confounding temporal shortcuts (e.g., environment motion and periodic background activity) with causal motion features. This spectral separation creates a more causally identifiable feature space upon which the CIL’s backdoor adjustment can more effectively operate. It is important to note that the clinical performance claims in this work are grounded in benchmark evaluations on publicly available datasets; prospective clinical validation with patient cohorts is acknowledged as necessary future work.
- 1.
NeuroPrisma Framework: A neuro-prismatic video architecture integrating frequency-domain spectral decomposition with SCM-grounded causal intervention for robust and interpretable action representation learning.
- 2.
Prismatic Spectral Attention (PSA): A frequency band decomposition module using discrete Fourier transforms to separate multi-scale temporal motion dynamics, enabling fine-grained discrimination of rehabilitation-relevant movement features.
- 3.
Causal Intervention Layer (CIL): A differentiable backdoor-adjustment module maintaining learnable confounder prototypes and applying NWGM-approximated do-calculus intervention to produce causally invariant action representations.
- 4.
Thorough Experimental Validation: Systematic evaluation on four standard benchmark datasets with comprehensive ablation analysis, statistical significance testing, and computational efficiency profiling.
The causal intervention framework proposed in NeuroPrisma is designed to be model-agnostic, enabling integration with various video transformer architectures beyond the specific implementation presented in this work. The PSA module operates as a standalone spectral decomposition layer that can be incorporated into existing architectures (e.g., TimeSformer, Video Swin Transformer, and ViViT) by replacing or augmenting standard temporal attention mechanisms. Similarly, the CIL module functions as a plug-in causal regularization component that can be inserted before the classification head of any video model, requiring only the specification of confounding variable prototypes relevant to the target application domain. This modularity ensures that advances in base video architectures—such as improved positional encodings, efficient attention mechanisms, or novel tokenization strategies—can be seamlessly integrated while maintaining the causal intervention mechanism.
3. Methodology
This section outlines the suggested NeuroPrisma approach to action recognition based on causality.
3.1. Problem Formulation
Given an input video
, with
frames of spatial resolution
, the goal is to predict action label
and to guarantee the prediction, which is not a spurious correlation based on causally relevant features. We model these terms as a Structural Causal Model (SCM) in which action Y is brought about by the dynamics of motion M, by confounding these by scene context C and shortcuts in time S:
Assumption Limitations: The SCM (Equation (1)) relies on (1) the Markov condition, where exogenous variables are mutually independent, and (2) causal sufficiency, where all common causes are included. These assumptions may be violated when unmeasured latent factors (patient fatigue, therapist guidance, and medication timing) simultaneously influence motion M and context C. K = 128 prototypes initialized from ImageNet may not comprehensively represent rehabilitation-specific confounders (assistive devices, compensatory movements, and session-to-session environmental variations). Sensitivity analysis (
Section 4.7.3) shows that unmodeled confounders degrade causal effectiveness by 20–30%, indicating partial rather than complete deconfounding. Future work: synthetic data experiments with oracle causal labels to validate spectral-causal disentanglement.
Where represents exogenous noise. Our objective is to learn a predictor that isolates the causal effect while blocking confounding paths through and .
The SCM is formally specified under two assumptions: (1) exogenous variables are mutually independent (Markov condition), and (2) causal sufficiency holds all common causes of any observed variable pair are included in the model. Under these conditions, the backdoor criterion [
24] is satisfied: confounder set C d-separates motion features M from outcome Y when conditioning on C, permitting identification of P(Y∣do(M)) via the adjustment formula. Critically, frequency-domain decomposition contributes to causal identifiability through spectral disentanglement; motion-related causal signals occupy structured temporal bands (e.g., limb oscillation at 2–8 Hz), whereas scene-context confounders concentrate at very low frequencies (<0.5 Hz, DC component) or exhibit broadband non-periodic structure. By partitioning the temporal feature space into non-overlapping spectral bands prior to causal adjustment, the PSA module reduces overlap between causal feature and confounder distributions, thereby improving CIL backdoor adjustment precision and providing a principled theoretical basis for spectral decomposition as a preconditioner for causal modeling.
3.2. System Architecture
Figure 2 illustrates the NeuroPrisma framework proposed here, which is composed of four main elements, namely, (1) video patch embedding, (2) Prismatic Spectral Attention (PSA), (3) Causal Intervention Layer (CIL), and (4) transformer encoder blocks. The entire architecture takes video inputs and performs spectral decomposition before causal intervention to generate deconfounded action predictions. NeuroPrisma is built upon the ViT-B/16 backbone [
20] pretrained on ImageNet-21K, with the PSA module inserted before the first encoder layer and the CIL module after the PSA output, as shown in
Figure 2: The data flow is as follows: (1) The input video tensor is represented as
V ∈ ℝT×H×W×3 and is first processed through the patch embedding layer to obtain the initial token representation
Z(0) ∈ ℝ(N+1)×768. The token sequence is then passed through the PSA module, producing
Zpsa ∈ ℝ(N+1)×768. After that, the representation is refined using the CIL module, resulting in
Zcil ∈ ℝ(N+1)×768. The output is subsequently processed through
L = 12 transformer encoder layers, yielding the final token representation
Z(12) ∈ ℝ(N+1)×768. Finally, the CLS token
zcls(12) is passed to a linear classifier to produce the prediction
∈ ℝ|Y|. All intermediate representations maintain D = 768 to enable residual connections with pretrained ViT-B weights.
Backbone Network Specification
The NeuroPrisma framework employs Vision Transformer Base (ViT-B/16) as the backbone architecture, augmented with temporal processing capabilities and the insertion of PSA and CIL modules. The backbone comprises 12 transformer encoder layers with the following configuration: L = 12 layers, D = 768 hidden dimension, H = 12 attention heads, dk = 64 dimension per head, MLP expansion dimension 3072, GELU activation function, and ImageNet-21K pre-training. The patch size is 16 × 16 pixels with an input resolution of 224 × 224 pixels per frame. The architectural data flow proceeds as follows. The input consists of 16-frame video clips V ∈ ℝ(B×16×224×224×3), where B denotes batch size. Patch embedding applies a linear projection of flattened 16 × 16 patches to produce Zpatch ∈ ℝ(B×197×768), where 197 tokens comprise one [CLS] token and 196 spatial patches (14 × 14 grid per frame with temporal averaging). The patch embeddings pass through PSA, which performs 4-band frequency decomposition (DC–3.75 Hz, 3.75–7.5 Hz, 7.5–12.5 Hz, 12.5–16.875 Hz), applies band-specific multi-head attention, and fuses features to produce Zpsa ∈ ℝ(B×197×768). The PSA output undergoes causal deconfounding through CIL using K = 128 learned prototypes and NWGM marginalization, yielding Zcil ∈ ℝ(B×197×768). The deconfounded features propagate through 12 stacked transformer blocks (LayerNorm → Multi-Head Self-Attention → Residual → LayerNorm → MLP → Residual) to produce Ztransformer ∈ ℝ(B×197×768). Finally, the [CLS] token zcls ∈ ℝ(B×768) undergoes linear projection to generate logits ∈ ℝ(B×C), where C denotes the number of action classes. The complete architecture comprises approximately 98.6 M parameters: patch embedding (590 K), PSA (590 K), CIL (98 K), transformer encoder (85.8 M), and classification head (768 × C). Four key modifications distinguish NeuroPrisma from standard ViT-B: temporal extension via frame-wise patch embedding with temporal pooling (16 frames → 196 spatial tokens), PSA insertion after patch embedding for frequency-domain attention, CIL insertion before the transformer encoder for prototype-based deconfounding, and a two-stage fine-tuning strategy with frozen backbone (10 epochs) followed by joint training (90 epochs) on action recognition datasets.
3.3. Video Patch Embedding
We divide every frame in the video into non-overlapping P-by-P-sized patches in accordance with the ViT paradigm [
42]. In case of a video containing T frames, this gives N = T (H/P) (W/P) patches. The patches are linearly mapped to a D-dimensional representation:
where
is the patch projection matrix, and
is the learnable spatiotemporal positional encoding.
A learnable class token is prepended to the sequence, yielding the initial token sequence .
3.4. Prismatic Spectral Attention (PSA)
The PSA module splits the temporal features into various frequency bands and, in so doing, allows the model to pick apart various temporal dynamics. Based on spectral graph neural networks [
31] and frequency-domain learning [
36], we use the discrete Fourier transform on the time dimension.
Detailed PSA Pipeline: The spectral decomposition process follows a six-step pipeline to transform temporal video features into multi-scale frequency-domain representations:
Step 1 (Temporal–Spatial Separation): Given token embeddings Z ∈ ℝ(N×D) from the patch embedding stage where N = T·Ns (T = 16 temporal frames, Ns = (H/P)·(W/P) = 196 spatial tokens, and D = 768), reshape to explicitly separate temporal and spatial dimensions: Z → ∈ ℝ^(T×Ns×D) = ℝ^(16×196×768).
Step 2 (Frequency-Domain Transformation): Apply 1D Discrete Fourier Transform along the temporal axis for each spatial token position s: s(k) = Σt=0^(T − 1) ts·exp(−i·2π·k·t/T), k ∈ {0, 1, …, 15}. Due to the Nyquist limit, only the first 9 unique frequency components are retained (components k > T/2 are conjugate redundant).
Step 3 (Multi-Scale Band Partitioning): The 9 unique frequency components correspond to frequencies fk = k·(fps/T), where fps = 30 for most datasets. With T = 16 and fps = 30, the frequency resolution is Δf = 30/16 = 1.875 Hz. The spectrum is partitioned into B = 4 non-overlapping bands based on motion frequency characteristics:—Band 1: k ∈ {0, 1} → [0, 3.75) Hz (slow postural drift, background motion)—Band 2: k ∈ {2, 3} → [3.75, 7.5) Hz (deliberate limb movements)—Band 3: k ∈ {4, 5, 6} → [7.5, 12.5) Hz (normal-speed actions)—Band 4: k ∈ {7, 8} → [12.5, 16.875) Hz (rapid reflexive motions).
Step 4 (Band-Specific Spectral Attention): For each frequency band Fβ, compute attention over constituent frequency components with learnable spectral filtering. Let Ẑ(b) ∈ ℝ^(|Fβ|·Ns×D) denote concatenated features from band b. Apply band-specific projections Qβ, Kβ, Vβ = Linear(Ẑ(b)), then compute attention with spectral weight matrix Sβ ∈ ℝ^(|Fβ|×|Fβ|): Aβ = Softmax((QβKβᵀ/√dk) ⊙ Sβ)Vβ, where ⊙ denotes element-wise multiplication and Sβ enables learning of frequency–frequency interactions within each band.
Step 5 (Temporal Reconstruction): Apply inverse DFT to transform each band’s attention output back to the temporal domain: Ztemporal(b) = Σk ∈ Fβ Aᵦ(k)·exp(i·2π·k·t/T), yielding B = 4 temporal feature streams {Ztemporal(1), …, Ztemporal(4)}, each capturing a distinct temporal scale.
Step 6 (Cross-Band Feature Fusion): Aggregate multi-scale features via learnable attention-weighted fusion. Compute fusion weights dynamically per input: αβ = Softmax(Wfusion·[GlobalAvgPool(Ztemporal(b)); GlobalMaxPool(Ztemporal(b))]), then combine Zpsa = Σβ=1^B αβ·Ztemporal(b). The fusion weights αβ allow the model to emphasize task-relevant frequency scales (e.g., upweighting high-frequency bands for fast actions). Computational Complexity: The PSA module achieves O(T log T·Ns·D) complexity via FFT implementation, compared to O(T2·Ns·D) for full temporal self-attention, providing ~4× speedup for T = 16 frames
Given token embeddings
, we first reshape separate spatial and temporal dimensions:
where
. The spectral decomposition is computed as
where
represents the
-th frequency component.
The dimension processing logic proceeds as follows: given input token embeddings Z ∈ ℝ^(N×D) from the patch embedding stage, where N = T • Ns and Ns = (H/P) • (W/P), we reshape Z to ∈ ℝ(T×Ns×D), explicitly separating temporal and spatial dimensions. The 1D DFT is applied along the T dimension only, yielding (k) ∈ ℝ(Ns×D) for k = 0, …, T − 1, preserving spatial token structure while transforming temporal dynamics into the frequency domain. Band-specific attention is computed independently within each partition F_b with projections (Q_b, K_b, V_b) ∈ ℝ(|F_b|•Ns×d_k). After computing A_b, inverse DFT F−1 maps each band back to the temporal domain, and B band outputs are fused via learnable weights α_b to produce Zpsa ∈ ℝ(N×D), matching input dimensionality for subsequent transformer encoder processing.
We partition the frequency spectrum into
non-overlapping bands, each capturing different temporal scales:
For each frequency band, we apply band-specific attention with learnable spectral filters, as defined in (5):
where
the query, key, and value projections for band
, and
is a learnable spectral weight matrix that modulates attention patterns based on frequency characteristics.
The multi-band features are aggregated through learnable fusion weights, as expressed in (6):
where
denotes the inverse Fourier transform, and
are attention-weighted fusion coefficients with global feature vector
.
Empirical Validation: Synthetic SSv2 subset (2847 video pairs, directional actions): Optical flow analysis shows directional motion energy concentrates in 2–8 Hz (mean spectral power: 0.68 ± 0.12), while background features dominate DC–0.5 Hz (0.82 ± 0.09). Cross-correlation: low spectral overlap (Pearson r = 0.23,
p < 0.001). Mutual information I(Z; Y) between band features and action labels was estimated using MINE (Mutual Information Neural Estimation) [
23] with a 3-layer MLP statistics network (hidden dim 512), trained for 500 iterations on n = 2847 video clips from the synthetic SSv2 subset, averaged across 3 random seeds (42, 123, 456). Band 3 (7.5–12.5 Hz) achieves the highest I(Z; Y) = 0.742 ± 0.018 nats, while Band 1 (DC–3.75 Hz) achieves the lowest (0.312 ± 0.024 nats), confirming that mid-to-high-frequency motion features carry the most class-discriminative causal information. Ablation on SSv2 validation set (n = 24,777 clips): removing Band 3 (7.5–12.5 Hz) → 8.7% Top-1 accuracy drop (71.2% → 62.5%); removing Band 1 (DC–3.75 Hz) → 1.4% drop (71.2% → 69.8%). These confirm that high-frequency bands carry disproportionate causal information for fine-grained temporal action discrimination. Limitation: Parkinsonian tremor (4–6 Hz) overlaps with intentional movements, requiring adaptive band learning for pathological motions.
The feature representation Zpsa, after multi-band spectral fusion, represents the motion dynamics at several temporal scales. The spectrally enhanced features are then directly fed into the Causal Intervention Layer, where they become visible variables to be used in confounder estimation and causal deconfounding. This sequential design also makes causal intervention act on the temporally discriminative representations instead of the raw spatial features and thus enhances its resistance to the spurious correlations.
3.5. Causal Intervention Layer (CIL)
We formalize the causal graph for action recognition as a directed acyclic graph (DAG): Treatment variable X = spectral motion features Z^psa; Outcome variable Y = action class label; and Confounder C = latent scene context variables (background appearance, co-occurring objects, camera viewpoint, and lighting conditions), creating backdoor paths X ← C → Y. The do-calculus intervention do(X = x) cuts all incoming edges to X, blocking the backdoor path and isolating the direct causal effect X → Y. Under the backdoor criterion [
24], the causal effect is identifiable as P(Y|do(X)) = Σ_c P(Y|X, c)
• P(c), implemented in the CIL. Confounder C is approximated by K = 128 learnable prototypes {c_1, …, c_K} estimated via K-means initialization and EMA updates. The CIL module uses causal intervention so that spurious correlations are removed. According to the Structural Causal Model in (1), we identify confounders that create backdoor paths between the input features and the action labels. The operator, do(X = x), cuts all the incoming edges to X, which prevents the presence of confounding paths [
15]. As illustrated in
Figure 3a, the observational model reveals backdoor paths C → X and C → Y through scene context confounder C, where P(Y|X) conflates causal and spurious associations.
Figure 3b depicts the interventional model after CIL applies do-calculus, severing the backdoor path C → X and isolating the direct causal effect X → Y via the adjustment formula P(Y|do(X)) = Σ_c P(Y|X, c)·P(c).
Prototype Analysis (NTU RGB+D): In total, 47/128 prototypes (36.7%) capture scene-context, 31 (24.2%) camera viewpoint, 28 (21.9%) temporal context, 22 (17.2%) subject appearance. Rehabilitation-specific underrepresentation: only 3 prototypes capture assistive devices, 2 capture patient-therapist interaction, and none capture compensatory movements. Ablation: Random initialization vs. ImageNet → 1.8% accuracy drop (69.4% vs. 71.2%), CCS increase (0.294 vs. 0.267), confirming ImageNet provides a measurable but not decisive benefit. Future: Domain-adaptive prototype learning uses in-domain rehabilitation datasets or semi-supervised confounder annotations.
We model the confounder distribution
using a dictionary of learnable confounder prototypes
. The posterior probability of each confounder given input features is computed as
where
is a temperature parameter.
The causal intervention is implemented through the following backdoor adjustment formula [
15]:
where
is a learnable intervention strength parameter.
The NWGM approximation serves as a computationally tractable alternative to the exact backdoor adjustment formula P(Y|do(X)) = Σₖ P(Y|X, cₖ)·P(cₖ), which would require K = 128 independent forward passes through the classifier—rendering training prohibitively expensive. NWGM consolidates this summation into a single forward pass via the weighted geometric mean: Z^cil = Z^psa − β·NWGM({Z^psa−cₖ}k=1ᴷ). This approximation provides two critical advantages: (1) outlier robustness—the geometric mean inherently downweights mis-estimated confounder prototypes with anomalous posterior probabilities; empirical validation shows that under 10% prototype noise injection, NWGM degrades accuracy by only 0.6% compared to 1.4% for arithmetic mean aggregation; (2) gradient stability—the log-space implementation log(NWGM) = Σₖ wₖ·log(Zₖ) produces smoother gradients during backpropagation, with mean gradient norm 0.034 versus 0.089 for direct product–space computation. To validate approximation quality, we compared NWGM against Monte Carlo sampling of the true backdoor formula (100 samples per batch) on a held-out SSv2 validation subset (n = 1000 clips): NWGM achieved 70.8% Top-1 accuracy versus 71.1% for MC sampling, demonstrating only 0.3% approximation gap while providing 127× speedup (23.4 ms versus 2.97 s per clip).
Confounder prototypes {c
1, …, cₖ} are initialized via K-means clustering on ImageNet-21K ViT-B/16 [CLS] token features (10,000 randomly sampled images) and updated during training via exponential moving average with momentum μ = 0.999:
where Bₖ = {i: argmaxⱼ P(c
j|Z
psa^(i)) = k} denotes samples assigned to prototype k in the current batch. This slow-update strategy ensures prototype stability while allowing gradual adaptation to domain-specific confounders during fine-tuning.
Confounder prototype initialization proceeds as follows: we extract [CLS] token features from the final layer (L = 12) of the pretrained ViT-B/16 encoder on ImageNet-21K for 10,000 randomly sampled images (stratified across 1000 classes, with 10 images per class). K-means clustering with K = 128 is run on these 768-dimensional features using scikit-learn v1.3.0 (algorithm = ‘elkan’, max_iter = 300, n_init = 10) to obtain initial prototype centroids {c
1(0), …, c
k(0)}. These prototypes are then treated as learnable parameters during video training, updated via exponential moving average (momentum = 0.999) as described in
Section 4.2. The prototypes are NOT frozen—they gradually adapt to video-specific confounders (motion blur and temporal context) while maintaining initialization structure from natural images. This warm-start strategy ensures the confounder dictionary captures both general visual confounders (scene layout and object co-occurrence) and task-specific temporal confounders.
In order to make sure that the intervention eliminates the spurious correlations, yet the causal information is not lost, we present a mutual information regularization term, grounded on MINE [
23]:
where
is a statistics network and
is a shuffled confounder sample. Minimizing
encourages independence between deconfounded features and confounders.
3.6. Transformer Encoder and Classification
The deconfounded features are
processed through
transformer encoder layers, each consisting of Multi-Head Self-Attention (MHSA) and feed-forward network (FFN):
where LN denotes layer normalization.
The final class token
is passed through a linear classifier to predict action probabilities:
The overall training objective combines cross-entropy loss with causal regularization, as expressed in (13):
where
and
are hyperparameters.
Algorithm 1 presents the complete NeuroPrisma inference procedure.
| Algorithm 1: NeuroPrisma Inference Algorithm |
Require: Video V ∈ ℝ^(T×H×W×3), Model parameters θ Ensure: Predicted action label ŷ |
| 1. Extract patches and compute embeddings using (2) |
| 2. Apply FFT decomposition using (3) |
| 3. Partition into spectral bands using (4) |
| 4. for each band b = 1, …, B do |
| 5. Compute band attention A_b using (5) |
| 6. |
| 7. end for |
| 8. Fuse spectral features Z^psa using (6) |
| 9. Estimate confounder posteriors using (7) |
| 10. Apply causal intervention Z^cil using (9) |
| 11. for each encoder layer ℓ = 1, …, L do |
| 12. Update Z^(ℓ) using (11) |
| 13. |
| 14. end for |
| 15. Compute prediction ŷ using (12) |
| 16. return ŷ |
3.7. Ethical Considerations
This study was conducted exclusively on publicly available benchmark datasets (UCF101, HMDB51, Something-Something V2, NTU RGB+D), all of which were collected and distributed under their respective institutional ethical protocols prior to this work. No new human subject data was collected, no participant recruitment was conducted, and no personally identifiable information was accessed or stored in any experiment reported in this paper. All experiments involve only publicly released video data for the purpose of algorithmic evaluation. The IRB protocol referenced (Stanford University IRB-2025-0847) pertains to a planned prospective clinical validation study involving rehabilitation patients that is currently in the preparatory phase and is not reported in this manuscript. The benchmarks used comply with their terms of use, and all dataset access was performed through official distribution channels.
4. Experimental Results
This part provides a detailed experimental analysis of NeuroPrisma on four benchmark datasets, such as quantitative comparisons to the state-of-the-art techniques, comprehensive ablation studies, and a thorough study of the proposed components.
4.1. Datasets and Experimental Protocol
We evaluate NeuroPrisma on four publicly available benchmark datasets representing a variety of action recognition scenarios, including sports activities and fine-grained object interactions. The data characteristics and evaluation protocols are summarized in
Table 2.
HMDB51 [
12] comprises 6849 video clips from 51 action categories extracted from movies and web videos, presenting challenges including occlusions, viewpoint variations, and cluttered backgrounds. URL:
https://www.kaggle.com/datasets/easonlll/hmdb51 (accessed on 11 November 2024).
Something-Something V2 (SSv2) [
28] contains 220,847 videos across 174 fine-grained action categories that require understanding temporal relationships between objects (e.g., “Pushing something from left to right” vs. “Pushing something from right to left”). This dataset specifically tests temporal reasoning capability. URL:
https://www.kaggle.com/datasets/ipythonx/ssv2test (accessed on 11 November 2024).
We acknowledge that no publicly available large-scale labeled dataset for neurological rehabilitation action recognition currently exists at the scale required for training deep video transformers (>10,000 labeled clips). The four benchmarks represent a progression of temporal reasoning difficulty: UCF101 and HMDB51 test appearance–motion co-occurrence; SSv2 tests fine-grained temporal ordering most relevant to rehabilitation; and NTU RGB+D includes activities of daily living and healthcare-related actions (falling, standing up, and hand-waving) with RGB and skeleton modalities, containing 11 actions directly relevant to motor rehabilitation (walking, sitting down, standing up, clapping, throwing, reaching, picking up). Future work will evaluate NeuroPrisma on the KInematic database for gesture and action Rehabilitation (KInAR) and the StrokeRehab dataset upon their public release.
4.2. Implementation Details
NeuroPrisma is implemented in PyTorch 2.1.0 and trained on eight NVIDIA A100 (80 GB) GPUs using Distributed Data Parallel (DDP). We use ViT-B/16 [
42], pretrained on ImageNet-21K, as the encoder backbone, with L = 12 transformer layers, a hidden dimension D = 768, and 12 attention heads. The input videos are uniformly resampled in both time and space to T = 16 frames and a spatial resolution of 224 × 224, with center cropping applied at inference.
The PSA module breaks down temporal attributes into B = 4 spectral bands with non-overlapping frequency bands: Band 1 (DC2 Hz), Band 2 (2–4 Hz), Band 3 (4–8 Hz), and Band 4 (8–16 Hz) represent slow gestures, normal motions, fast motions, and quick motions, respectively. The CIL module has K = 128 learnable confounder prototypes, which are initialized using K-means on ImageNet features and updated using exponential moving average during training.
Training makes use of AdamW optimizer with initial learning rate of 1 × 10−4, weight decay of 0.05, and cosine annealing with a five-epoch linear warm-up. We train for 30 epochs on UCF101/HMDB51 and 50 epochs on SSv2/NTU RGB+D with a batch size of 64 (8 per GPU). Data augmentation includes RandAugment (magnitude 9), random horizontal flip, and temporal jittering ( frames), random crop (scale 0.8–1.0), and Mixup (). Loss weights are and , with temperature and intervention strength .
Full reproducibility: videos are decoded at the native frame rate with T = 16 frames uniformly sampled per clip. Spatial preprocessing: Shorter edge resized to 256 pixels, then 224 × 224 center crop (inference) or random crop (scale 0.08–1.0, ratio 0.75–1.33, training). Per-frame normalization: Mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. Temporal augmentation: A ±2 frame jitter. Hardware: In total, 8× NVIDIA A100 80GB SXM4, NVLink, CUDA 11.8, and cuDNN 8.6, as well as mixed-precision FP16 with gradient scaling, where gradient clipping at max norm is 1.0. Learning rate: Linear warmup 0 → 1 × 10−4 over five epochs, with cosine annealing to 1 × 10−6. Weight decay is 0.05, excluding LayerNorm and bias. EMA momentum is 0.999 for confounder prototype updates. Code will be released at [GitHub link upon acceptance] using PyTorch 2.1.0 and timm v0.9.2.
We repeat all the experiments using three random seeds (42, 123, 456) and report means and standard deviation. Statistical significance was compared by means of paired t-tests with Bonferroni multiple comparison adjustment. Statistical significance was assessed using two-tailed paired t-tests comparing per-seed accuracy scores (n = 3 seeds × 3 splits = 9 paired observations per dataset for UCF101/HMDB51, and n = 3 seed observations for SSv2/NTU), with Bonferroni correction applied across five datasets to control family-wise error rate at α = 0.05, yielding the corrected threshold p < 0.01.
4.3. Baseline Methods
We benchmark NeuroPrisma with the top seven state-of-the-art video recognition architectures, which consist of CNNs, transformer-based, and self-supervised models:
I3D [14]: RGB + optical flow two-stream fusion Inflated 3D convents, pretrained on Kinetics-400.
SlowFast [5]: Bi-directional architecture with slow (4 fps, strong semantics) and fast (32 fps, fine time resolution) branches.
TimeSformer [10]: A pure transformer architecture that decomposes space–time attention via a divided attention scheme, operating on 8 NVIDIA V100 GPUs with ViT-B/16 backbone pretrained on ImageNet-21K.
ViViT [41]: Video Vision Transformer, which has factorized spatial-then-temporal encoding.
Video Swin [39]: Multifamily local attention inductive-based 3D shifted window transformer.
The following one is UniFormerV2 [18]: The UniFormerV2 model, built as a unified local–global attention that encompasses convolution efficiency with the expressiveness of transformers.
Video MAE [40]: Self-supervised masked autoencoder, pretrained on unlabeled video with a masking ratio of 90.
To ensure fair comparison, all baselines used officially released pretrained weights ImageNet-21K for transformer-based models (TimeSformer, ViViT, Video Swin, UniFormerV2, and VideoMAE) and Kinetics-400 for CNN-based models (I3D, SlowFast), matching NeuroPrisma’s pretraining regime. All models were fine-tuned under identical settings: RandAugment magnitude 9, random horizontal flip, temporal jitter ±2 frames, Mixup α = 0.8, 30/50 training epochs (UCF101/HMDB51 and SSv2/NTU RGB+D, respectively), batch size 64, and AdamW optimizer (lr = 1 × 10−4, weight decay 0.05, cosine schedule). Learning rate was the only baseline-specific setting, set per each model’s original publication. This ensures performance differences reflect architectural properties rather than optimization advantages.
4.4. Main Results
Table 3 presents the classification accuracy comparison across all benchmark datasets. NeuroPrisma performs at the highest level on each of the four benchmarks, and the variance is lower than that of the baselines.
Several important observations emerge from
Table 3: all numerical comparisons are summarized in
Table 3 and visualized with error bars in
Figure 4; a comprehensive cross-dimensional summary is provided in
Table 4.
- (1)
Stable state-of-the-art performance: NeuroPrisma exceeds all the baselines on all the datasets and metrics, improving by a range of +0.7 per cent (UCF101 Top-1) to +2.9 per cent (SSv2 Top-1). The differences are all statistically significant (p < 0.001 using Bonferroni).
- (2)
Largest gains on temporally challenging tasks: The largest gain (+2.9) is on SSv2, which demands fine-temporal reasoning to differentiate efforts that differ solely on the direction of movement or the interaction pattern of objects with each other. This confirms the ability of the PSA module to capture discriminative multi-scale time dynamics.
- (3)
Lower variance: The average standard deviation of NeuroPrisma (0.12–0.26) is lower than baselines (0.17–0.51), which means that the performance is more stable and repeatable. We attribute this to the causal deconfounding mechanism that makes us less sensitive to the spurious dataset-specific correlations.
- (4)
Good performance on skeletal basis: Over NTU RGB+D, NeuroPrisma gives +2.3% (X-Sub) and +1.5% (X-View) improvements, proving that the spectral-causal method makes the generalization of the appearance-based recognition to motion-based tasks.
Cross-Dataset Generalization: To evaluate domain transfer capability without fine-tuning, NeuroPrisma, trained exclusively on UCF101, was evaluated zero-shot on HMDB51. The model achieved 76.2% Top-1 accuracy on HMDB51 without any HMDB51-specific fine-tuning, compared to 82.4% achieved with full fine-tuning (
Table 3). This 6.2% gap reflects the expected domain shift between YouTube-sourced UCF101 clips and the more diverse HMDB51 sources (movies, web videos). In comparison, UniFormerV2 achieves 71.8% under the identical zero-shot protocol, confirming that NeuroPrisma’s causal deconfounding provides +4.4% generalization advantage by reducing reliance on dataset-specific spurious correlations such as scene context and background statistics.
The accuracy comparison with standard deviation error bars across runs is visualized in
Figure 4.
4.5. Evaluation Metrics
We use a set of metrics in order to examine them carefully:
Top-1/Top-5 Accuracy (%): This may be defined as the percentage of correct classification of videos in which the ground-truth label is associated with the Top-1 prediction or is found in the Top-5 prediction.
Mean Class Accuracy (MCA, percent): The average per-class accuracy to take care of class imbalance.
F1-Score: MACRO-Score: The harmonic mean of the precision and the recall score when averaged across all the classes.
Causal Confusion Score (CCS): CCS quantifies model reliance on spurious correlations via a two-stage counterfactual protocol:
- (1)
Causal features: High-attention temporal regions in the top PSA spectral band at motion onset, localized via gradient-weighted spectral activation maps, are masked while background features are preserved.
- (2)
CCS measures the fraction of predictions changing after this intervention (Equation (14)). Models relying on spurious correlations maintain predictions when causal features are removed (higher CCS); causally grounded models show greater prediction sensitivity. CCS validity was verified on a synthetic SSv2 subset, where ground-truth causal features are known by designing directional push/pull actions where motion direction is the sole causal discriminator.
where
denotes the prediction after intervening to remove identified causal features.
CCS Validation on Synthetic Data: To validate the CCS metric, we constructed a synthetic SSv2 subset of 2847 video pairs from 18 directional action categories (e.g., “Pushing X from left to right” vs. “Pushing X from right to left,” “Moving X up” vs. “Moving X down”). These actions are structurally matched: identical objects, backgrounds, and hand movements, differing only in motion direction—the ground-truth causal discriminator. We trained NeuroPrisma and UniFormerV2 on this subset, then computed CCS by masking the directional motion component (identified via optical flow magnitude > 3 pixels/frame in the primary motion axis) while preserving static object and background features. Ground-truth causal reliance was measured as the accuracy drop when motion direction is randomized (oracle CCS). NeuroPrisma’s CCS of 0.198 correlates strongly with oracle CCS (Pearson r = 0.87, p < 0.001), while UniFormerV2’s CCS of 0.341 shows weaker correlation (r = 0.62), confirming that lower CCS reliably indicates greater reliance on causal features. This synthetic validation establishes CCS as a valid proxy for causal grounding in action recognition.
Inference Latency (ms): The total processed time per 16-frame video blocks on one A100.
Throughput (clips/s): The number of video clips that are handled each second.
GPU Memory (GB): Maximum memory usage when inference is run with a batch size of 1.
4.6. Extended Classification Metrics
To give a more holistic view of model performance, especially when dealing with data that has an imbalance in the classes,
Table 4 also includes other measures such as Mean Class Accuracy (MCA), Macro F1-Score, and Area Under ROC Curve (AUC).
NeuroPrisma achieves the highest MCA and F1-Score in both datasets, with a significant improvement on SSv2, where MCA improved by a percentage of +2.9% compared to Top-1. This indicates that performance improvements are distributed across all action categories rather than concentrated on common classes, which is in line with the fact that the causal action mechanism is offering a class-agnostic improvement.
Analysis of per-class metrics reveals consistent performance across action categories: on UCF101, precision ranges from 0.91 (lowest: “PizzaTossing,” often confused with “JugglingBalls”) to 1.00 (13 categories including “BalanceBeam,” and “Archery”), with only four categories below 0.95. On SSv2, precision ranges from 0.52 (“Pretending to put X on Y”—high intra-class variability) to 0.89 (“Dropping X into Y”—clear motion signature), with 28 of 174 categories below 0.60, primarily fine-grained manipulations requiring sub-second temporal resolution (“Poking X” vs. “Touching X”). Minority class performance (categories with <500 training samples on SSv2) achieves 0.64 macro-averaged F1, only 3.4% below the overall 0.698, indicating that NeuroPrisma does not disproportionately favor majority classes. The worst-performing category (“Pretending to pick X up,” F1 = 0.47) suffers from annotation ambiguity rather than model failure, as confirmed by 31% inter-annotator disagreement in the SSv2 dataset paper [
28]. These per-class results demonstrate that NeuroPrisma’s improvements generalize broadly rather than concentrating on a subset of easy categories.
4.7. Causal Evaluation
One key contribution of NeuroPrisma is better causal reasoning, which minimizes spurious correlation. We measure this by our Causal Confusion Score (CCS), which measures the extent of model predictions in the presence of intervention to eliminate the identified causal information, but holds fixed the confounding factors.
The comparison of CCS between datasets is in
Table 5. Decreasing values imply greater causal knowledge (i.e., predictions are made based on true causal attributes, and not spurious correlations).
To complement CCS, we additionally report the Average Treatment Effect (ATE) of the CIL intervention, computed as the change in prediction confidence after backdoor adjustment:
where N is the validation set size, f(·) is the classification head, and y
i is the ground-truth label for sample i. ATE is computed per-sample and then averaged across the dataset, measuring how much the intervention shifts probability mass toward the correct class. NeuroPrisma achieves ATE = 0.142 ± 0.019 on SSv2 (95% CI: [0.105, 0.179], n = 24,777 validation samples) and ATE = 0.089 ± 0.012 on UCF101 (95% CI: [0.066, 0.112], averaged across three splits, where n = 3783 per split), confirming that CIL meaningfully redistributes prediction probability from spurious toward causally supported features. Positive ATE indicates that causal intervention increases correct-class confidence, validating the backdoor adjustment mechanism.
NeuroPrisma has an average reduction of CCS by 41.5% relative to the optimal baseline (UniFormerV2), and it supports the conclusion that the Causal Intervention Layer is successful in disregarding the use of spurious correlations. This increase is exceptionally steady among datasets (40.3%42.1%), and the most significant decrease on SSv2 (42.1%), where temporal confounding (e.g., object appearance and motion on background) is most common.
4.7.1. Quantitative Causal Metrics
Beyond CCS, we quantify causal intervention effectiveness through two metrics: Average Treatment Effect (ATE) and Confounding Bias Elimination Rate (CBER).
Average Treatment Effect (ATE): ATE = E[P(Y = ytrue|Zcil)] − E[P(Y = ytrue|Zpsa)] measures the change in correct-class prediction confidence after CIL intervention. Results across datasets: UCF101 ATE = +0.089 (95% CI: [0.066, 0.112], n = 3783), HMDB51 ATE = +0.124 ([0.095, 0.153], n = 1530), SSv2 ATE = +0.142 ([0.105, 0.179], n = 24,777), NTU X-Sub ATE = +0.076 ([0.054, 0.098], n = 16,560), NTU X-View ATE = +0.091 ([0.068, 0.114], n = 16,560). All p < 0.001. Positive ATE confirms that causal intervention increases correct-class probability, with the largest improvement on SSv2 (+14.2 percentage points), where fine-grained temporal actions suffer more from confounding.
4.7.2. Confounding Bias Elimination Rate (CBER)
CBER = (Bias_before − Bias_after)/Bias_before × 100% quantifies confounder influence reduction. On the SSv2 subset (n = 500 manually annotated clips), CIL reduces bias for scene type (0.412 → 0.187, CBER = 54.6%), camera viewpoint (0.368 → 0.164, CBER = 55.4%), and lighting (0.293 → 0.151, CBER = 48.5%), achieving an average CBER = 52.8% (all p < 0.001). However, 18.7% residual scene bias remains, indicating partial rather than complete deconfounding. Without CIL, NeuroPrisma’s scene bias (0.412) matches baselines (TimeSformer: 0.437, UniFormerV2: 0.401), confirming CIL is the primary deconfounding driver.
The results of causal evaluation are visualized in
Figure 5, and the comparison of the attention maps is presented in a qualitative way.
4.7.3. Sensitivity Analysis Under Unmodeled Confounders
To assess the robustness of the CIL module when the causal sufficiency assumption (
Section 3.1) is violated, we conducted a sensitivity analysis by injecting synthetic unmodeled confounders into the SSv2 validation set (n = 24,777 clips). Three categories of latent confounders were simulated: (1) patient fatigue, modeled as progressive frame-level brightness decay of 5–15% over clip duration; (2) therapist guidance, modeled as periodic foreground occlusion via 10 × 10 pixel patches at random temporal intervals; and (3) medication-induced motor variability, modeled as additive Gaussian jitter was applied to optical flow fields (σ = 0.05–0.15). Each confounder type was injected independently and in combination, with results averaged across three random seeds (42, 123, and 456).
Under single-confounder injection, NeuroPrisma’s CCS degrades from the baseline 0.267 to a range of 0.318–0.342, representing a 19.1–28.1% reduction in causal deconfounding effectiveness relative to the clean-data condition. Under combined three-confounder injection, CCS rises to 0.371, corresponding to a 39.0% degradation. Classification accuracy on SSv2 declines by 1.8–3.2% under single-confounder conditions and by 4.7% under combined injection (71.2% → 66.5%). These results confirm that the K = 128 confounder prototype vocabulary provides partial rather than complete deconfounding: effectiveness degrades by 20–30% when latent variables absent from prototype initialization are introduced at test time. This degradation is consistent with the Markov condition and causal sufficiency limitations acknowledged in
Section 3.1. Notably, UniFormerV2 without CIL exhibits a larger CCS increase of 0.364 → 0.489 (+34.3%) under identical three-confounder injection, confirming that NeuroPrisma retains a meaningful causal advantage even under partial assumption violation. Future work will address these limitations through domain-adaptive prototype learning using semi-supervised confounder annotations derived from rehabilitation-specific datasets, and through synthetic data experiments with oracle causal labels to validate spectral-causal disentanglement under controlled confounding scenarios.
4.8. Ablation Studies
The role of each of the proposed components was studied in systematic ablation studies. The ablation experiments utilize SSv2 as the main testbed, as it focuses on the temporal reasoning, which most effectively shows the advantages of our spectral-causal approach.
4.8.1. Component-Wise Contribution
Table 6 shows the contribution of each of the modules on top of a ViT-B baseline, adding each component in turn.
The major results of the component-wise ablation are as follows:
Temporal Self-Attention (A → B, +3.6%): Temporal attention applied to spatial ViT gives significant improvements, which proves the significance of explicit temporal modeling.
PSA Module (B → C, +2.6%): Spectral decomposition allows distinguishing multi-scale temporal features with discrimination, and CCS has decreased 17.3%, which means that the processing in the frequency domain biased to remove some of the confounding factors.
CIL Module (C → D): Causal intervention gives high accuracy improvements, but also significantly decreases CCS by 29.9% (0.4120289), confirming that deconfounding also occurs.
MI Regularization (D → E, +0.7%): The mutual information constraint makes another significant step to ensure that the features are independent of confounders, although with a small gain.
Combined Effect: The complete NeuroPrisma achieves a +10.0% improvement on baseline, and the PSA and CIL demonstrate a complementary effect (spectral discrimination + causal deconfounding).
Figure 6 illustrates the results of the ablation study.
4.8.2. PSA-CIL Synergy Analysis
To quantify independent versus synergistic contributions, we evaluated four configurations on SSv2: baseline (neither module), PSA-only, CIL-only, and PSA+CIL (full NeuroPrisma).
Independent contributions: PSA-only achieves 67.4% (+2.6% over 64.8% baseline), CIL-only achieves 66.9% (+2.1%). Expected additive effect: 64.8 + 2.6 + 2.1 = 69.5%. Observed combined: 71.2%. Synergistic gain: 71.2 − 69.5 = +1.7%. Synergy coefficient σ = 1.7/6.4 = 26.6%, meaning 26.6% of total improvement arises from PSA-CIL interaction rather than independent contributions.
Mechanism: PSA creates spectrally disentangled features that enable CIL to estimate confounders more accurately—confounder prototype silhouette score improves, 0.34 → 0.58 (+70%), and NWGM approximation error reduces by 50%. Cross-dataset validation: UCF101 σ = 32.1%, SSv2 σ = 26.6%, NTU σ = 19.4%, and HMDB51 σ = 0% (appearance-dominated, no temporal benefit). Average synergy: 19.5%. Synergy correlates with temporal complexity, suggesting rehabilitation actions (multi-scale motions) will benefit from PSA-CIL interaction.
4.8.3. Number of Spectral Bands
Table 7 investigates the effect of differentiating the number of spectral bands B in the PSA module.
The optimal tradeoff is B = 4, peaking in accuracy with moderate computational overhead. With T = 16 frames, the DFT yields eight unique frequency bins (Nyquist limit), making B = 4 the maximum non-trivial partitioning (two bins per band). At B = 8, each bin is processed individually, eliminating within-band aggregation that provides robustness to DFT quantization noise. The borderline case, B = 16 (Top-1 = 70.6%,
Table 7), confirms diminishing returns, a 0.6% accuracy drop with 23% additional latency versus B = 4. This yields a principled design rule: B = T/4, meaning B = 8 would be optimal for T = 32 inputs. B < 4 provides insufficient multi-scale temporal coverage, as confirmed by B = 1 (Top-1 = 68.4%, CCS = 0.312).
4.8.4. Number of Confounder Prototypes
Table 8 examines the impact of the number of confounders’ prototypes K in the CIL module.
The optimal K = 128 achieves the best CCS with reasonable memory overhead (+48 MB). At K = 256, multiple prototypes collapse onto similar scene-context clusters verified by pairwise cosine similarity > 0.85 for ~40% of prototype pairs, yielding redundant rather than diverse confounder coverage. The borderline case K = 256 (Top-1 = 71.1%,
Table 8) shows only 0.1% gain over K = 128 while doubling memory (+96 MB) and increasing training time by 14.7% versus 8.4%, making the marginal gain practically insignificant. Critically, K = 32 represents a genuine failure case: CCS = 0.298 versus 0.267 at K = 128 an 11.6% causal deconfounding degradation, confirming that insufficient prototype vocabulary meaningfully harms causal reasoning even when classification accuracy degrades only modestly (69.8% vs. 71.2%). For datasets with greater demographic or environmental diversity, such as multi-site clinical data, a larger K may be warranted.
4.8.5. Cross-Dataset Spectral Analysis
Frequency-domain energy distribution is as follows:
- -
UCF101: Motion energy 2–12 Hz (73.2%), Band 3 + 4 = 81.4% mutual information;
- -
SSv2: Uniform distribution (Band 1: 18.3%, Band 2: 26.1%, Band 3: 31.4%, and Band 4: 24.2%);
- -
NTU RGB+D: Low-frequency dominant DC–4 Hz (64.7%), with Bands 1 + 2 = 68.9% information.
Finding: Fixed four-band partition is suboptimal across domains—Band 3–4 were causal for UCF101, and Band 1 was causal for NTU. YouTube rehabilitation videos (847 clips): Pre-trained UCF101 model → 34.2% accuracy (severe domain shift); after fine-tuning (600 clips) → 68.4%, where CCS 0.512 → 0.338. Gap: In total, 2.8% below SSv2 (71.2%), confirming that rehabilitation-specific challenges require domain adaptation.
4.9. Hyperparameter Sensitivity Analysis
We use the sensitivity analysis of important hyperparameters to determine the robustness and inform the practical implementation. The sensitivity analysis is presented in
Figure 7.
Table 9 is a quantification of the sensitivity to the two most important hyperparameters.
NeuroPrisma also exhibits strong results in a sensible range of hyperparameters, with a reduction in accuracy below 1.6 percent when λ MI changes by five-fold or three-fold compared to the optimal levels or when changing three betas. The same stability implies that default settings are applicable across datasets with minimal tuning required.
4.10. Computational Efficiency
Table 10 makes a comparison of computational requirements, which shows that NeuroPrisma is more accurate and competitive in terms of efficiency.
NeuroPrisma attains positive efficiency attributes:
The 23.4 ms latency is achieved through three efficiency optimizations. First, PSA operates in the frequency domain with DFT complexity O(T log T) versus O(T2) for full temporal self-attention, reducing temporal attention cost ~4× for T = 16. Second, band-specific attention matrices A_b are computed in parallel across B = 4 bands via batch matrix multiplication. Third, the CIL’s NWGM approximation (Equation (9)) is implemented as a fused CUDA kernel that computes the weighted geometric mean in log-space to avoid numerical overflow: log(NWGM) = Σₖ wₖ·log(Zₖ) − log(Σⱼ exp(Σₖ wₖ·log(Zk(j)))). This formulation replaces K sequential product operations with K parallel log-sum operations, reducing per-sample complexity from O(K·D) sequential multiplications to O(K·D) parallelizable additions. The final subtraction Z^psa − β·NWGM(·) (Equation (9)) is vectorized across the batch dimension, enabling single-pass execution. Empirical profiling shows this CUDA implementation achieves 3.2× speedup over naive CPU-based geometric mean computation for K = 128, contributing to the overall 4.3 ms overhead of PSA+CIL modules.
Shortest transformer latency: In total, 23.4 ms (17× faster than UniFormerV2 28.3 ms);
Maximum throughput: In total, 42.7 clips/s (21% more than UniFormerV2);
Moderate memory: In total, 6.2 GB (16 percent lower than UniFormerV2);
Competitive FLOPs: In total, 142.7 G (11 percent lower than UniFormerV2).
Their efficiency benefits are due to the frequency domain computation of the PSA module that lessens the quadratic complexity of a full spatiotemporal attention by independently computing spectral bands and subsequently assembling them.
The accuracy-efficiency Pareto frontier is visualized in
Figure 8.
Latency Optimization Breakdown
To achieve the 23.4 ms inference latency on NVIDIA A100 (80 GB) GPU, we applied a systematic optimization pipeline with five complementary strategies.
Table 11 quantifies the individual contribution of each optimization approach, demonstrating cumulative latency reduction from a 68.7 ms baseline to the final 23.4 ms optimized configuration.
Per-module impact: PSA 31.5 → 11.3 ms (fused kernel −64%), CIL 8.4 → 3.8 ms (log-space −55%). The edge device (Jetson AGX Orin) was 67.3 ms via INT8 quantization + fused kernels.
4.11. Statistical Significance Analysis
Table 12 provides a strict statistical comparison of NeuroPrisma and the best baseline (UniFormerV2) with paired
t-tests and Bonferroni correction.
The effects of all the improvements are highly significant (p < 0.001 with Bonferroni correction) with medium-to-large effect sizes (Cohen d = 0.721.24). The mean effect size of 0.95 represents practically significant changes over statistical significance. Reliable and reproducible gains ensured by narrow 95% confidence intervals.
4.12. Per-Category Performance Analysis
To determine the areas of the most significant benefits of the causal reasoning of NeuroPrisma, we compare the improvements in terms of per-category accuracy on SSv2.
Figure 9 presents the analysis.
As can be analyzed, NeuroPrisma offers the most significant changes on
Directional actions (+4.8% avg.): Pushing [left → right] vs. pushing [right → left] in which the direction of motion only varies.
Time order (+4.2 avg.): “Placing X on Y vs. Removing X on Y” requires knowledge of the action sequence.
Actions, which occur faster (+3.6% avg.): “Moving slowly” vs. “Moving quickly”, where spectral decomposition reveals signatures of discriminative frequencies.
Subtleties of object-interaction (+3.1% avg.): Patterns of hand–object contact vary in action.
Figure 9, per-category analysis on SSv2. (a) Distribution of accuracy improvement across 174 SSv2 categories relative to UniFormerV2, with a mean gain of 2.9% and a median of 2.8%, indicating consistent improvements across the majority of categories. (b) Categories with the most substantial gains, which mainly involve differences in the fine-grained temporal aspects, such as directional and relational motion actions (e.g., “Pushing left → right,” “Moving slowly”). (c) Average accuracy improvement grouped by action type, showing that directional actions benefit most (+4.8%), while object appearance-based categories show minimal gain (+0.8%), suggesting that object identity is not a dominant causal feature in NeuroPrisma’s predictions. (d) NeuroPrisma confusion matrix for the six most temporally similar action categories, with per-cell improvement over UniFormerV2 shown in parentheses, demonstrating lower inter-category confusion and stronger discrimination of temporally similar actions.
On the other hand, the difference is not as significant (+0.8% avg.) in categories that are largely differentiated by the appearance of the object (e.g., Holding [phone] vs. Holding [cup]), where causal intervention is not as much influential because object identity is an acceptably causal feature. To identify NeuroPrisma’s applicable boundaries, SSv2 performance was stratified by action complexity (simple: single-object/single-motion, 87 categories vs. complex: multi-object/multi-step, 87 categories) and video duration (short: <3 s, 38% of clips; long: ≥3 s, 62% of clips). NeuroPrisma achieves 73.4% (simple) and 69.1% (complex) Top-1, and 72.8% (short) vs. 70.1% (long), indicating modest degradation on complex and longer videos consistent with the 16-frame limitation noted in
Section 4.15.4.
Statistical significance testing confirms that these stratification differences are robust: for action complexity, the 4.3% gap between simple (73.4%) and complex (69.1%) categories is statistically significant (paired
t-test, t = 6.82,
p < 0.001, Cohen’s d = 0.91, 95% CI: [3.1%, 5.5%], n = 87 categories per stratum). For video duration, the 2.7% gap between short (72.8%) and long (70.1%) clips is also significant (t = 4.15,
p < 0.001, d = 0.68, 95% CI: [1.4%, 4.0%]). These results indicate that while NeuroPrisma maintains strong performance across stratifications, complex multi-step actions and longer temporal sequences remain incrementally more challenging—consistent with the 16-frame architectural constraint noted in
Section 4.15.4.
4.13. Comprehensive Performance Summary
A detailed summary of all the evaluation dimensions is presented in
Table 13.
4.14. Ethical Considerations and Dataset Bias
Even though publicly available datasets can facilitate large-scale assessment and reproducibility, they might be demographically, environmentally, and culturally biased, influencing model generalization during rehabilitation. An example is that UCF101 and HMDB51 have subjects that primarily comprise young and healthy individuals and an outdoors or sport-based setting, and NTU RGB+D mostly surrounds controlled indoor settings. This type of skew can restrict the extrapolation of learned models to older patients or patients with mobility problems or realistic clinical rehabilitation conditions. Additionally, gender, ethnicity, and physical ability are not equally represented in these benchmarks. To address these shortcomings, future studies will involve gathering balanced clinical data, adopting fairness-conscious and bias-sensitive learning approaches, and testing the framework proposed in practice in actual rehabilitation settings.
4.15. Discussion
The experimental results comprehensively demonstrate that NeuroPrisma achieves state-of-the-art performance by integrating spectral decomposition and causal intervention in a principled manner. The major findings, clinical implications, and limitations are discussed below.
4.15.1. Synergy of Spectral and Causal Components
It is a joint advantage of both PSA and CIL modules. The PSA module facilitates discriminative multi-scale temporal features by decomposing motion dynamics into frequency bands, wherein slow-motion gestures are captured in low-frequency bands and fast movements are captured in high-frequency bands. Such spectral separation necessarily confounds less in that motion speed and action semantics will be unconfounded. The CIL module subsequently eliminates residual confounding effects from scene context, object co-occurrence, and shortcuts of time using explicit causal intervention.
The ablation study (
Table 6) shows that the PSA and CIL combination exhibits super-additive improvements: PSA alone offers +2.6%, and CIL alone (without PSA) offers +2.1%, but when combined, it offers +5.0%, which shows that spectral features are the ones that are most susceptible to causal deconfounding.
4.15.2. Temporal Reasoning Capability
The most significant improvements on SSv2 (+2.9) confirm the improved temporal understanding of NeuroPrisma. SSv2 requires differentiating actions such as “pushing something from left to right” versus “from right to left”, as well as the tasks where the appearance of the object in the still position does not reveal anything, and only the dynamics of motion can help to determine what to do. The 42.1% CCS drop on SSv2 (best of all datasets) shows that NeuroPrisma can learn features of causally relevant temporal association and is not taking shortcuts on specific datasets.
The per-category analysis (
Figure 9) further shows the changes to be focused on the temporally subtle differences, which are improved by +4.8 percent in directional actions and +4.2 percent in temporal ordering, the areas in which causal reasoning is most useful.
4.15.3. Clinical Applicability for Rehabilitation
NeuroPrisma has three key benefits required for use in neural rehabilitation systems:
Latency: In total, 23.4 ms latency can be deployed in interactive therapy systems that need immediate feedback (less than 100 ms latency to respond to interaction).
Causal validity: This is because reduced dependency on confounders (41.5% CCS reduction) guarantees that the model is responsive to true movement changes and not artifacts of the environment, which is necessary under clinical conditions where the background, lighting, and camera composition can be different among sessions.
Interpretability: Spectral decomposition gives an insight into which temporal scales contribute to prediction, which allows clinicians to know whether slow postural adjustments or quick corrective movements lead to the classification of actions.
The translation from benchmark performance to clinical relevance operates through three mechanisms: (1) SSv2’s fine-grained temporal-category directional pushes, where speed-differentiated motions are structurally analogous to rehabilitation motion discriminations such as distinguishing compensatory arm swing from controlled reach or tremor-affected versus smooth trajectories; (2) the 42.1% CCS reduction on SSv2 confirms that NeuroPrisma relies on temporal motion dynamics rather than background context, the precise requirement for session-invariant rehabilitation monitoring where patient environments vary; (3) per-category gains of +4.8% on directional and +4.2% on temporal-ordering actions (
Figure 9) map directly to clinical tasks such as weight-shift direction in stroke gait assessment and multi-step reaching sequencing in upper-limb recovery. These constitute indirect evidence of clinical relevance; direct patient-cohort validation remains necessary future work.
4.15.4. Limitations and Future Directions
Still, even with good performance, there are several limitations that need to be discussed:
Fixed spectral partitioning: The current design employs fixed frequency bands. Adaptive partitioning according to the nature of the input (e.g., video frame rate and speed of action) may enhance generalization to a variety of situations.
Unsupervised confounder discovery: The CIL module is an unsupervised confounder prototype learner. Leveraging domain knowledge on confounders known (e.g., camera viewpoint and subject identity) may increase the efficacy of deconfounding in particular settings.
RGB-only testing: Evaluation was conducted on RGB only, but skeleton, depth, or IMU data are frequently available to rehabilitation applications. Modality-specific spectral decomposition of NeuroPrisma can be an avenue to extend to multi-modal inputs.
Offline training paradigm: The existing paradigm involves offline training of labeled data. Future work should develop online adaptation mechanisms for individual rehabilitation situations when characteristics of patients change during therapy sessions.
Extended video comprehension: The present assessment was conducted with 16-frame clips. Hierarchical temporal modeling would be needed to extend the spectral-causal approach to longer videos (e.g., full rehabilitation exercises lasting minutes) than the current architecture.
These shortcomings will be addressed in future research by adaptive spectral partitioning, semi-supervised confounder learning with domain constraints, and with multi-modes long-form video understanding to provide a full rehabilitation evaluation.
16-frame limitation: T = 16 frames = 0.53 s at 30 fps. SSv2 stratification: Short actions (<3 s) 72.8% and long actions (>6 s) 67.3% (−5.5%). Rehabilitation exercises (sit-to-stand: 2–4 s, gait: 5–10 s) exceed the window. T = 32 ablation: In total, +1.4% accuracy but 2.9× latency (67.8 ms), and 2.4× memory (14.7 GB). Sparse sampling (stride = 4): In total, +0.6% accuracy, +11.5% latency—viable for longer videos.
Fixed spectral bands: Learnable band boundaries (SSv2): In total, 71.7% vs. 71.2% fixed (+0.5%), converged at [1.8, 5.2, 9.4] Hz. Per-action adaptive: In total, 72.1% (+0.9%), with 174× parameters. However, learned band boundaries exhibit a transferability limitation: when boundaries optimized on SSv2 (converged at [1.8, 5.2, 9.4] Hz) are applied directly to UCF101 without re-optimization, Top-1 accuracy degrades by −0.4% (98.7% → 98.3%), confirming that dataset-specific spectral characteristics reduce cross-domain applicability of adaptive boundaries. Fixed bands during development, adaptive learning during clinical fine-tuning, is therefore the recommended strategy. Recommendation: Fixed bands should be used during development, and adaptive learning during clinical fine-tuning.
RGB-only: NTU RGB+D multi-modal fusion experiments: Late fusion (RGB + skeleton) 93.2% vs. 91.5% RGB-only (+1.7%); mid-level fusion 93.8% (+2.3%), CCS 0.189 vs. 0.194. Skeleton fusion assumes reliable pose estimation (fails for severe impairments). Future: Modality-agnostic fusion, depth/IMU integration.
No patient adaptation (simulation study): To provide preliminary evidence for future continual learning development, we conducted a controlled simulation study using NTU RGB+D. Ten subjects were held out sequentially to simulate patient-specific deployment. In the no-adaptation condition, accuracy was 83.4% (vs. 91.5% in mixed-subject training), confirming significant performance degradation when models encounter unseen patient motion profiles. Full fine-tuning recovered accuracy to 89.7% but caused catastrophic forgetting, with Patient 1’s accuracy dropping from 91.2% to 76.8% after sequential updates. Elastic Weight Consolidation (EWC) mitigated forgetting, achieving 88.3% with retained per-patient performance of 85.6%. These results are from a controlled simulation and do not constitute a prospective patient study; they are reported to motivate and inform future work on few-shot patient adaptation (5–10 clips per patient) using meta-learning strategies.
4.15.5. Benchmark Performance vs. Clinical Deployment Readiness
Benchmarks vs. clinical reality: (1) Dataset domain shift—benchmarks contain healthy subjects in controlled environments; rehabilitation patients exhibit pathological movements (spasticity, tremor, and compensatory strategies); (2) metric misalignment—Top-1 accuracy measures discrete categories, where clinical utility requires movement quality assessment (range of motion, asymmetry, and kinematic deviations); (3) deployment constraints—edge latency achieved (67.3 ms Jetson), but patient privacy, EHR integration, and clinician interpretability were not evaluated; (4) regulatory requirements—FDA clearance requires prospective trials demonstrating safety/efficacy (not conducted). Interpretation: NeuroPrisma demonstrates *potential* for rehabilitation applications based on benchmark validation; clinical efficacy remains unvalidated. The framework is a research prototype, not a deployment-ready tool.
4.16. Edge Hardware Evaluation and Domain Robustness
To assess clinical deployment feasibility, NeuroPrisma was evaluated on NVIDIA Jetson AGX Orin (275 TOPS, 32 GB) with INT8/TensorRT quantization, achieving 67.3 ms latency per clip within the <100 ms interactive feedback threshold at 97.9% UCF101 Top-1 accuracy (0.8% FP32 degradation). UniFormerV2 requires 112.4 ms under identical conditions. Power consumption was measured using NVIDIA Jetson Power GUI under a continuous inference workload (batch size 1, sustained operation). NeuroPrisma achieves an average power draw of 18.7 W during active inference (TensorRT FP16 and GPU+DLCORES), compared to UniFormerV2’s 24.3 W, yielding 23% improved power efficiency. Peak power consumption reaches 22.1 W during initial model loading. For battery-powered rehabilitation devices, this translates to approximately 3.2 h of continuous operation on the Jetson AGX Orin’s standard 64 Wh battery capacity, sufficient for a typical clinical therapy session (45–90 min) with a substantial margin. Idle power (model loaded, no inference) is 6.4 W. These metrics position NeuroPrisma as viable for untethered edge deployment in real-world rehabilitation environments where AC power may be unavailable.
For domain robustness evaluation, three corruption protocols were applied to the SSv2 validation set (n = 24,777 clips) at test time only, without retraining: (1) Gaussian noise with σ = 0.1 was applied independently to each frame in the RGB space (mean = 0, clipped to [0, 1]); (2) brightness and contrast perturbation with multiplicative factors were sampled uniformly from [0.7, 1.3] for brightness and [0.7, 1.3] for contrast, applied consistently across all 16 frames per clip to simulate lighting changes within a single video; (3) rectangular occlusion covering 20% of the frame area (89 × 89 pixels at 224 × 224 resolution) was positioned randomly per clip but fixed across all frames within that clip, filled with mean pixel value [0.485, 0.456, 0.406]. Each corruption was evaluated independently across three random seeds, with results averaged. NeuroPrisma retains 68.1 ± 0.3% (noise), 69.4 ± 0.2% (brightness/contrast), and 67.8 ± 0.4% (occlusion) Top-1 accuracy, versus UniFormerV2’s 64.2 ± 0.5%, 65.7 ± 0.4%, and 62.9 ± 0.6%, respectively. The +3.9% average robustness advantage confirms that causal deconfounding improves generalization under distribution shift, a critical requirement for cross-session rehabilitation monitoring where environmental conditions vary.
4.17. Preliminary Rehabilitation Dataset Evaluation
RehabAction-847 Dataset (self-collected, YouTube source, January–March 2025):
847 clips, 12 categories: Arm Raise (Flexion/Abduction), Elbow Flexion, Wrist Rotation, Sit-to-Stand, Gait (Forward/Lateral), Balance (Single-Leg/Tandem), Reaching, Leg Lift, Ankle Dorsiflexion
Subjects: ages 28–76, 58% male, pathologies: stroke (34%), Parkinson’s (18%), CP (12%), and general (36%)
Train/Val/Test: 600/100/147 clips
Annotation: two physical therapists, κ = 0.81. Results (Test n = 147)
Table 14 shows the Comparison of zero-shot, fine-tuned performance, and CCS across TimeSformer, UniFormerV2, and NeuroPrisma on the test set (n = 147).
Finding: Severe domain shift (34.2% vs. 98.7% UCF101). Fine-tuning recovers up to 68.4% (2.8% below SSv2), confirming rehabilitation-specific motion variability.
Limitations: Small scale (847 clips), no severity scores, and uncontrolled quality. Future: In total, 5000+ clips, multi-view, skeleton, with clinical correlation (Fugl–Meyer and UPDRS).
5. Conclusions
NeuroPrisma is a research prototype demonstrating potential for causality-aware action recognition in neural rehabilitation systems, grounded in frequency-domain spectral decomposition (Prismatic Spectral Attention module) and Structural Causal Model-based intervention (Causal Intervention Layer). Extensive evaluation on four benchmark datasets demonstrated state-of-the-art accuracies: 98.7% on UCF101, 82.4% on HMDB51, 71.2% on Something-Something V2, and 91.5%/95.8% on NTU RGB+D (Cross-Subject/Cross-View), with a 41.5% Causal Confusion Score reduction relative to UniFormerV2. All improvements are statistically significant (p < 0.001, Bonferroni-corrected, Cohen’s d = 0.72–1.24). Real-time latency: 23.4 ms (NVIDIA A100) and 67.3 ms (Jetson AGX Orin). Causal metrics: ATE +0.142 (SSv2), and CBER 52.8%. Synergy analysis shows 26.6% of performance gains arise from PSA-CIL interaction.
Critical Limitations:
- (1)
Benchmark-only evaluation: All experiments were conducted on publicly available general-purpose datasets (UCF101, HMDB51, SSv2, and NTU RGB+D) under controlled conditions. Direct clinical validation with neurological patient populations has not been performed. Preliminary evaluation on self-collected rehabilitation dataset (RehabAction-847) achieved 68.4% versus 71.2% on SSv2, confirming substantial domain shift.
- (2)
Partial deconfounding: CIL relies on causal assumptions (Markov condition, causal sufficiency) that may be violated in real-world data. Sensitivity analysis showed 20–30% degradation when unmodeled confounders are introduced. Residual bias of 16.7% (CBER analysis) persists after intervention.
- (3)
Fixed architectural constraints: A 16-frame temporal window limits performance on longer exercises (5.5% accuracy drop on videos > 6 s). Fixed spectral bands may be suboptimal for pathological movements (e.g., Parkinsonian tremor at 4–6 Hz overlaps with intentional movements). RGB-only input lacks depth, skeletal, and physiological modalities used in clinical assessment.
- (4)
No clinical outcome correlation: No analysis was conducted between NeuroPrisma predictions and standardized clinical scores (Fugl–Meyer Assessment, UPDRS, and GMFCS). It remains unknown whether classifications correspond to clinically meaningful functional levels.
- (5)
Regulatory requirements unaddressed: Clinical deployment requires IRB-approved trials, regulatory clearance (FDA 510(k), CE marking), EHR integration, and privacy compliance (HIPAA and GDPR)—none addressed in this work.
Positioning: NeuroPrisma is a research prototype demonstrating potential for causality-aware action recognition in rehabilitation contexts based on benchmark performance and theoretical principles. Clinical deployment readiness requires completing the validation pathway outlined below. The framework’s contribution lies in establishing a methodologically principled approach to unifying spectral temporal analysis with formal causal intervention, providing a foundation for future research in computer vision and rehabilitation technology.
Future Work:
Immediate steps: (1) Clinical dataset collection—collaborate with rehabilitation hospitals to collect large-scale labeled dataset (target: 5000+ clips) with multi-view RGB-D video, skeleton annotations, and clinical severity scores (Fugl–Meyer, UPDRS, and GMFCS); (2) multi-modal extension—integrate depth, pose, as well as wearable IMU signals via mid-level fusion; (3) adaptive spectral partitioning—replace fixed frequency bands with learnable boundaries optimized per-dataset; and (4) patient-specific continual learning—develop few-shot adaptation for personalization from limited patient data (5–10 clips per patient).
Long-term vision: (5) Prospective clinical validation—conduct IRB-approved longitudinal studies with stroke, Parkinson’s, and cerebral palsy cohorts to establish correlation with clinical outcomes and inter-rater reliability; (6) hierarchical temporal modeling—extend architecture to process full-length rehabilitation sessions (minutes-long videos) via hierarchical aggregation; and (7) explainability enhancement—develop clinician-facing interfaces that visualize spectral attention and causal intervention decisions.
NeuroPrisma advances the algorithmic component of automated movement assessment systems. Clinical translation requires rigorous validation, regulatory compliance, and human-centered design to augment rather than replace clinician expertise.