Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems

Alshaya, Hend

doi:10.3390/math14081341

Open AccessArticle

Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems

by

Hend Alshaya

Applied College, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia

Mathematics 2026, 14(8), 1341; https://doi.org/10.3390/math14081341

Submission received: 14 February 2026 / Revised: 6 April 2026 / Accepted: 8 April 2026 / Published: 16 April 2026

(This article belongs to the Special Issue Advances in Computer Vision and Image Processing with Applications to Bioinformatics)

Download

Browse Figures

Versions Notes

Abstract

Video-based action recognition for neural rehabilitation—spanning stroke recovery, Parkinsonian gait assessment, and cerebral palsy monitoring—faces critical challenges, including temporal ambiguity, non-causal motion correlations, and the absence of causally grounded dynamics modeling. While transformer-based architectures achieve strong performance, they often exploit spurious temporal and environmental cues, limiting reliability in safety-critical clinical settings. We propose NeuroPrisma, a neuro-prismatic video framework that integrates frequency-domain spectral decomposition with causal intervention under Structural Causal Models (SCMs) via the backdoor criterion. NeuroPrisma introduces (i) a Prismatic Spectral Attention (PSA) module, which applies discrete Fourier transforms to decompose temporal features into multi-scale frequency bands, disentangling slow postural dynamics from rapid corrective movements, and (ii) a Causal Intervention Layer (CIL), which performs do-calculus-based backdoor adjustment to remove confounding influences and produce causally invariant representations. PSA preconditions representations prior to intervention, improving confounder estimation and causal robustness. Extensive evaluation against seven state-of-the-art models (I3D, SlowFast, TimeSformer, ViViT, Video Swin Transformer, UniFormerV2, and VideoMAE) demonstrates that NeuroPrisma achieves 98.7% Top-1 accuracy on UCF101, 82.4% on HMDB51, 71.2% on Something-Something V2, and 91.5%/95.8% on NTU RGB+D (Cross-Subject/Cross-View), consistently outperforming prior methods. It further reduces the Causal Confusion Score (CCS) by 42.3%, indicating substantially lower reliance on spurious correlations, while maintaining real-time performance with 23.4 ms latency per 16-frame clip on an NVIDIA A100 GPU. All improvements are statistically significant (p < 0.001, Cohen’s d = 0.72–1.24). Evaluation was conducted exclusively on benchmark datasets (UCF101, HMDB51, Something-Something V2, and NTU RGB+D) under controlled conditions, without direct clinical validation on neurological patient cohorts. Overfitting was mitigated using three random seeds (42, 123, 456), RandAugment, Mixup (α = 0.8), weight decay (0.05), and early stopping. Cross-dataset generalization from UCF101 to HMDB51 without fine-tuning achieved 76.2% Top-1 accuracy. Future work will focus on prospective clinical validation across stroke, Parkinson’s disease, and cerebral palsy populations, including correlation with standardized clinical assessment scales such as Fugl–Meyer, UPDRS, and GMFCS. These results establish NeuroPrisma as a causally grounded and computationally efficient framework for reliable, real-time movement assessment in clinical rehabilitation systems.

Keywords:

action recognition; causal inference; video transformer; spectral attention; neural rehabilitation; deep learning

MSC:

68T45; 68T07; 62H30; 92C55

1. Introduction

Neurological diseases and motor disabilities affect some 1 billion people across the globe, demanding robust automated movement assessment and rehabilitation monitoring systems [1,2,3]. These systems rely on the correct identification of actions in video streams, which allows a clinician to trace the progress of a patient, identify movement disorders, and tailor treatment methods [4,5,6,7,8,9]. Recent advances in deep learning have substantially improved video understanding capabilities, and vision transformers have surpassed other action recognition metrics by a large margin [10,11,12,13,14].

Clinical rehabilitation contexts, including stroke motor recovery, Parkinson’s disease gait analysis, and cerebral palsy movement monitoring, demand reliable movement assessment tools that go beyond general-purpose recognition [15,16,17,18]. Pose-based and skeleton-driven systems such as OpenPose-derived pipelines and GCN-based skeletal models [19,20,21,22,23] have demonstrated value in controlled settings but are sensitive to occlusion and require specialized depth sensors not universally available in clinical environments. Clinical motion assessment systems like the SARA scale (Spinocerebellar Ataxia Rating) and upper-limb kinematics tools [24,25,26] rely on trained therapist observation or marker-based setups, limiting scalability. Notably, I3D [27,28,29,30,31] and TimeSformer [32,33,34] were designed for sports and web video datasets (Kinetics, UCF101) and do not address the subtle, fine-grained movement deviations, tremors, asymmetric gait, and incomplete range of motion that carry clinical significance in neurological rehabilitation [35,36,37]. Video Swin Transformer [38,39,40], while achieving strong temporal modeling, relies on local shifted windows that may miss longer-range causal motion dependencies critical for rehabilitation assessment. The fundamental limitation shared across these architectures is their susceptibility to confounding factors: scene backgrounds, patient clothing, camera placement, and lighting conditions, all of which vary across therapy sessions, can dominate the learned representations rather than the clinically meaningful motion features [41,42,43,44,45].

Early video-based action recognition employed two-stream convolutional networks to model complex spatiotemporal patterns [46], and inflated 3D convolutions (I3Ds) were applied to image classification architectures for video [14]. Most recently, transformer architectures have shown better performance, building long-range temporal dependencies with self-attention mechanisms [8,41].

Nevertheless, meaningful learning of temporal relationships is critically important to the success of the existing video models, instead of taking advantage of spurious correlations [9]. Modern methods have worrying tendencies in their behavior: the models that are trained on action recognition data tend to use the information available in the static scene, co-occurring objects, or temporal clues instead of comprehending the causal process of human motions [10,11]. This confounding issue is especially problematic in rehabilitation environments when such minor changes in movement have clinical implications [12].

Past studies have dealt with temporal modeling in different ways. In [10], the authors suggest a divided space–time attention that attains efficient processing of video. Similarly, ref. [39] presents hierarchical shifted windows for video understanding with strong computational efficiency. Nonetheless, the approaches do not have clear causal logic abilities and can still be vulnerable to biases in the datasets [15]. New frameworks of causal inference have appeared that address confounding in visual recognition [16,17], but there are still no applications to video action recognition.

Although these improvements were made, there is a significant gap in applying causal understanding together with effective video transformers in the context of rehabilitation [18]. The existing approaches do not sufficiently address the causality–accuracy trade-off and temporal confounding, thereby limiting their potential for use in clinical settings [19]. The challenge of distinguishing causally relevant motion patterns from spurious correlations requires a fundamental rethinking of how video models process temporal information.

To address this gap, this paper proposes NeuroPrisma, a new neuro-prismatic video framework that solves such issues using spectral decomposition and causal intervention. The system architecture is depicted in Figure 1.

The principal insights of this paper are as follows.

Frequency-domain analysis improves causal reasoning for the following theoretical motivation: temporal features of rehabilitation-relevant motions occupy distinct spectral bands—slow postural sway and tremor (~0–2 Hz) versus intentional limb movements (~2–8 Hz) versus rapid reflexive responses (~8–16 Hz). By explicitly decomposing temporal features into non-overlapping frequency bands, the PSA module enforces structural separation between motion scales, reducing the co-occurrence of confounding temporal shortcuts (e.g., environment motion and periodic background activity) with causal motion features. This spectral separation creates a more causally identifiable feature space upon which the CIL’s backdoor adjustment can more effectively operate. It is important to note that the clinical performance claims in this work are grounded in benchmark evaluations on publicly available datasets; prospective clinical validation with patient cohorts is acknowledged as necessary future work.

1.: NeuroPrisma Framework: A neuro-prismatic video architecture integrating frequency-domain spectral decomposition with SCM-grounded causal intervention for robust and interpretable action representation learning.
2.: Prismatic Spectral Attention (PSA): A frequency band decomposition module using discrete Fourier transforms to separate multi-scale temporal motion dynamics, enabling fine-grained discrimination of rehabilitation-relevant movement features.
3.: Causal Intervention Layer (CIL): A differentiable backdoor-adjustment module maintaining learnable confounder prototypes and applying NWGM-approximated do-calculus intervention to produce causally invariant action representations.
4.: Thorough Experimental Validation: Systematic evaluation on four standard benchmark datasets with comprehensive ablation analysis, statistical significance testing, and computational efficiency profiling.

The causal intervention framework proposed in NeuroPrisma is designed to be model-agnostic, enabling integration with various video transformer architectures beyond the specific implementation presented in this work. The PSA module operates as a standalone spectral decomposition layer that can be incorporated into existing architectures (e.g., TimeSformer, Video Swin Transformer, and ViViT) by replacing or augmenting standard temporal attention mechanisms. Similarly, the CIL module functions as a plug-in causal regularization component that can be inserted before the classification head of any video model, requiring only the specification of confounding variable prototypes relevant to the target application domain. This modularity ensures that advances in base video architectures—such as improved positional encodings, efficient attention mechanisms, or novel tokenization strategies—can be seamlessly integrated while maintaining the causal intervention mechanism.

2. Related Work

This section provides a review of the current literature on video transformers in action recognition and causal inference in deep learning.

2.1. Video Transformers for Action Recognition

The ViT has transformed image recognition by considering the images as arrangements of patches [42]. The video extension has brought about several powerful architectures. ViViT [41] proposes factorized spatiotemporal attention, which decouples the processing of space and time to compute it efficiently. TimeSformer [10] proposes divided space–time attention, which establishes that pure transformer architectures can match or surpass convolutional baselines on video benchmarks.

Video Swin Transformer [39] generalizes the hierarchical shifted-window method to the 3D video volumes, reaching a state-of-the-art performance with manageable computational complexity. Most recently, UniFormerV2 [21] combines convolution with self-attention to learn effective spatiotemporal learning. These approaches show a very impressive performance but do not have clear mechanisms for how to reason causally, and this is vulnerable to spurious correlations of the training data [22].

Recent papers have delved into attention machinery in better temporal modelling. The k-NN attention model [3] makes use of local neighborhood data to gain video knowledge. Multi-scale temporal transformers [4] transform videos into several temporal resolutions. Temporal variations are explicitly modeled by the long-short temporal difference networks [2]. These approaches, however, optimize classification accuracy without considering the causal validity of learned representations.

2.2. Causal Inference in Deep Learning

Causal inference provides formal frameworks for reasoning about cause–effect relationships [15]. Recent papers have incorporated causal logic into deep learning to deal with confounding and enhance generalization [16,17]. Deconfounded image captioning [34] uses causal intervention to eradicate spurious relationships between visual and language descriptions. Causal inference is used in out-of-distribution recognition [26] and cross-modal inference [27].

Causal methods have been investigated in the context of video understanding: video question answering [28] and video summarization [29]. The article in [30] revisits the topic of video-language understanding with a causal approach. Causal adversarial examples [31] offer information about the weaknesses of the model. Despite these developments, action recognition integrating causal inference with video transformers is an open challenge.

The transition from video transformer models to causal inference approaches reveals a critical structural gap: video transformers optimize for pattern discrimination under the i.i.d. training assumption, while causal inference addresses the fundamentally different problem of invariant prediction under distribution shift. This distinction is practically significant for rehabilitation: a patient’s background, lighting, and camera angle change across therapy sessions (distribution shift), while the clinically relevant motion features should remain stable. Spectral decomposition addresses this transition by creating a structured latent space that separates motion dynamics from environmental signals before causal deconfounding is applied. This is not merely a combination of two techniques but a principled pipeline where spectral separation creates the necessary feature disentanglement that makes subsequent causal adjustment both identifiable and computationally tractable. No existing work in the reviewed literature has addressed this combined pipeline in either general video understanding or rehabilitation-specific action recognition.

2.3. Research Gaps

Table 1 summarizes the reviewed literature and identifies key gaps.

According to Table 1, the gaps that will be driving this work are the following:

Gap 1: No existing video transformer architecture combines structured frequency-domain temporal decomposition with formal causal intervention based on do-calculus. Methods with spectral components (e.g., spectral GNNs [33]) operate on graph-structured data and do not address spurious temporal correlations in video transformers.
Gap 2: Existing video transformers, including TimeSformer [13], VideoSwin [14], and UniFormerV2 [21], lack explicit confounder modeling mechanisms; their attention weights are optimized for classification accuracy under training-set statistics, making them vulnerable to dataset-specific spurious correlations such as scene context and background activity.
Gap 3: Rehabilitation-specific action recognition imposes clinical requirements for interpretability of temporal features, causal validity under session-to-session covariate shift, and latency compatible with interactive therapy that are not addressed by any benchmark-optimized architecture reviewed here.

3. Methodology

This section outlines the suggested NeuroPrisma approach to action recognition based on causality.

3.1. Problem Formulation

Given an input video

V \in R^{T \times H \times W \times 3}

, with

T

frames of spatial resolution

H \times W

, the goal is to predict action label

y \in Y

and to guarantee the prediction, which is not a spurious correlation based on causally relevant features. We model these terms as a Structural Causal Model (SCM) in which action Y is brought about by the dynamics of motion M, by confounding these by scene context C and shortcuts in time S:

Y \leftarrow f (M, C, S, ϵ)

(1)

Assumption Limitations: The SCM (Equation (1)) relies on (1) the Markov condition, where exogenous variables are mutually independent, and (2) causal sufficiency, where all common causes are included. These assumptions may be violated when unmeasured latent factors (patient fatigue, therapist guidance, and medication timing) simultaneously influence motion M and context C. K = 128 prototypes initialized from ImageNet may not comprehensively represent rehabilitation-specific confounders (assistive devices, compensatory movements, and session-to-session environmental variations). Sensitivity analysis (Section 4.7.3) shows that unmodeled confounders degrade causal effectiveness by 20–30%, indicating partial rather than complete deconfounding. Future work: synthetic data experiments with oracle causal labels to validate spectral-causal disentanglement.

Where

ϵ

represents exogenous noise. Our objective is to learn a predictor that isolates the causal effect

M \to Y

while blocking confounding paths through

C

and

S

.

The SCM is formally specified under two assumptions: (1) exogenous variables are mutually independent (Markov condition), and (2) causal sufficiency holds all common causes of any observed variable pair are included in the model. Under these conditions, the backdoor criterion [24] is satisfied: confounder set C d-separates motion features M from outcome Y when conditioning on C, permitting identification of P(Y∣do(M)) via the adjustment formula. Critically, frequency-domain decomposition contributes to causal identifiability through spectral disentanglement; motion-related causal signals occupy structured temporal bands (e.g., limb oscillation at 2–8 Hz), whereas scene-context confounders concentrate at very low frequencies (<0.5 Hz, DC component) or exhibit broadband non-periodic structure. By partitioning the temporal feature space into non-overlapping spectral bands prior to causal adjustment, the PSA module reduces overlap between causal feature and confounder distributions, thereby improving CIL backdoor adjustment precision and providing a principled theoretical basis for spectral decomposition as a preconditioner for causal modeling.

3.2. System Architecture

Figure 2 illustrates the NeuroPrisma framework proposed here, which is composed of four main elements, namely, (1) video patch embedding, (2) Prismatic Spectral Attention (PSA), (3) Causal Intervention Layer (CIL), and (4) transformer encoder blocks. The entire architecture takes video inputs and performs spectral decomposition before causal intervention to generate deconfounded action predictions. NeuroPrisma is built upon the ViT-B/16 backbone [20] pretrained on ImageNet-21K, with the PSA module inserted before the first encoder layer and the CIL module after the PSA output, as shown in Figure 2: The data flow is as follows: (1) The input video tensor is represented as V ∈ ℝ^T×H×W×3 and is first processed through the patch embedding layer to obtain the initial token representation Z⁽⁰⁾ ∈ ℝ^(N+1)×768. The token sequence is then passed through the PSA module, producing Z_psa ∈ ℝ^(N+1)×768. After that, the representation is refined using the CIL module, resulting in Zc_il ∈ ℝ^(N+1)×768. The output is subsequently processed through L = 12 transformer encoder layers, yielding the final token representation Z⁽¹²⁾ ∈ ℝ^(N+1)×768. Finally, the CLS token zc_ls⁽¹²⁾ is passed to a linear classifier to produce the prediction

\hat{y}

∈ ℝ|Y|. All intermediate representations maintain D = 768 to enable residual connections with pretrained ViT-B weights.

Backbone Network Specification

The NeuroPrisma framework employs Vision Transformer Base (ViT-B/16) as the backbone architecture, augmented with temporal processing capabilities and the insertion of PSA and CIL modules. The backbone comprises 12 transformer encoder layers with the following configuration: L = 12 layers, D = 768 hidden dimension, H = 12 attention heads, d_k = 64 dimension per head, MLP expansion dimension 3072, GELU activation function, and ImageNet-21K pre-training. The patch size is 16 × 16 pixels with an input resolution of 224 × 224 pixels per frame. The architectural data flow proceeds as follows. The input consists of 16-frame video clips V ∈ ℝ^{(B×16×224×224×3)}, where B denotes batch size. Patch embedding applies a linear projection of flattened 16 × 16 patches to produce Z^patch ∈ ℝ^{(B×197×768)}, where 197 tokens comprise one [CLS] token and 196 spatial patches (14 × 14 grid per frame with temporal averaging). The patch embeddings pass through PSA, which performs 4-band frequency decomposition (DC–3.75 Hz, 3.75–7.5 Hz, 7.5–12.5 Hz, 12.5–16.875 Hz), applies band-specific multi-head attention, and fuses features to produce Z^psa ∈ ℝ^{(B×197×768)}. The PSA output undergoes causal deconfounding through CIL using K = 128 learned prototypes and NWGM marginalization, yielding Z^cil ∈ ℝ^{(B×197×768)}. The deconfounded features propagate through 12 stacked transformer blocks (LayerNorm → Multi-Head Self-Attention → Residual → LayerNorm → MLP → Residual) to produce Z^transformer ∈ ℝ^{(B×197×768)}. Finally, the [CLS] token z^cls ∈ ℝ^(B×768) undergoes linear projection to generate logits ∈ ℝ^(B×C), where C denotes the number of action classes. The complete architecture comprises approximately 98.6 M parameters: patch embedding (590 K), PSA (590 K), CIL (98 K), transformer encoder (85.8 M), and classification head (768 × C). Four key modifications distinguish NeuroPrisma from standard ViT-B: temporal extension via frame-wise patch embedding with temporal pooling (16 frames → 196 spatial tokens), PSA insertion after patch embedding for frequency-domain attention, CIL insertion before the transformer encoder for prototype-based deconfounding, and a two-stage fine-tuning strategy with frozen backbone (10 epochs) followed by joint training (90 epochs) on action recognition datasets.

3.3. Video Patch Embedding

We divide every frame in the video into non-overlapping P-by-P-sized patches in accordance with the ViT paradigm [42]. In case of a video containing T frames, this gives N = T (H/P) (W/P) patches. The patches are linearly mapped to a D-dimensional representation:

z_{i}^{(0)} = E_{p} \cdot flatten (p_{i}) + e_{i}^{p o s}

(2)

where

E_{p} \in R^{D \times P^{2} \cdot 3}

is the patch projection matrix, and

e_{i}^{p o s}

is the learnable spatiotemporal positional encoding.

A learnable class token

z_{c l s}

is prepended to the sequence, yielding the initial token sequence

Z^{(0)} = [z_{c l s}; z_{1}^{(0)}; \dots; z_{N}^{(0)}]

.

3.4. Prismatic Spectral Attention (PSA)

The PSA module splits the temporal features into various frequency bands and, in so doing, allows the model to pick apart various temporal dynamics. Based on spectral graph neural networks [31] and frequency-domain learning [36], we use the discrete Fourier transform on the time dimension.

Detailed PSA Pipeline: The spectral decomposition process follows a six-step pipeline to transform temporal video features into multi-scale frequency-domain representations:

Step 1 (Temporal–Spatial Separation): Given token embeddings Z ∈ ℝ^(N×D) from the patch embedding stage where N = T·N_s (T = 16 temporal frames, N_s = (H/P)·(W/P) = 196 spatial tokens, and D = 768), reshape to explicitly separate temporal and spatial dimensions: Z →

\tilde{Z}

∈ ℝ^(T×N_s×D) = ℝ^(16×196×768).

Step 2 (Frequency-Domain Transformation): Apply 1D Discrete Fourier Transform along the temporal axis for each spatial token position s:

\hat{Z}

_s(k) = Σ_t=0^(T − 1)

\tilde{Z}

_ts·exp(−i·2π·k·t/T), k ∈ {0, 1, …, 15}. Due to the Nyquist limit, only the first 9 unique frequency components are retained (components k > T/2 are conjugate redundant).

Step 3 (Multi-Scale Band Partitioning): The 9 unique frequency components correspond to frequencies f_k = k·(fps/T), where fps = 30 for most datasets. With T = 16 and fps = 30, the frequency resolution is Δf = 30/16 = 1.875 Hz. The spectrum is partitioned into B = 4 non-overlapping bands based on motion frequency characteristics:—Band 1: k ∈ {0, 1} → [0, 3.75) Hz (slow postural drift, background motion)—Band 2: k ∈ {2, 3} → [3.75, 7.5) Hz (deliberate limb movements)—Band 3: k ∈ {4, 5, 6} → [7.5, 12.5) Hz (normal-speed actions)—Band 4: k ∈ {7, 8} → [12.5, 16.875) Hz (rapid reflexive motions).

Step 4 (Band-Specific Spectral Attention): For each frequency band F_β, compute attention over constituent frequency components with learnable spectral filtering. Let Ẑ^(b) ∈ ℝ^(|F_β|·N_s×D) denote concatenated features from band b. Apply band-specific projections Q_β, K_β, V_β = Linear(Ẑ^(b)), then compute attention with spectral weight matrix S_β ∈ ℝ^(|F_β|×|F_β|): A_β = Softmax((Q_βK_βᵀ/√d_k) ⊙ S_β)V_β, where ⊙ denotes element-wise multiplication and S_β enables learning of frequency–frequency interactions within each band.

Step 5 (Temporal Reconstruction): Apply inverse DFT to transform each band’s attention output back to the temporal domain: Z_temporal^(b) = Σ_k ∈ F_β Aᵦ(k)·exp(i·2π·k·t/T), yielding B = 4 temporal feature streams {Z_temporal⁽¹⁾, …, Z_temporal⁽⁴⁾}, each capturing a distinct temporal scale.

Step 6 (Cross-Band Feature Fusion): Aggregate multi-scale features via learnable attention-weighted fusion. Compute fusion weights dynamically per input: α_β = Softmax(W_fusion·[GlobalAvgPool(Z_temporal^(b)); GlobalMaxPool(Z_temporal^(b))]), then combine Z^psa = Σ_β=1^B α_β·Z_temporal^(b). The fusion weights α_β allow the model to emphasize task-relevant frequency scales (e.g., upweighting high-frequency bands for fast actions). Computational Complexity: The PSA module achieves O(T log T·N_s·D) complexity via FFT implementation, compared to O(T²·N_s·D) for full temporal self-attention, providing ~4× speedup for T = 16 frames

Given token embeddings

Z \in R^{N \times D}

, we first reshape separate spatial and temporal dimensions:

\tilde{Z} \in R^{T \times N_{s} \times D}

where

N_{s} = N / T

. The spectral decomposition is computed as

\hat{Z} (k) = \sum_{t = 0}^{T - 1} \tilde{Z} (t) \cdot e^{- i 2 π k t / T}, k = 0, 1, \dots, T - 1

(3)

where

\hat{Z} (k)

represents the

k

-th frequency component.

The dimension processing logic proceeds as follows: given input token embeddings Z ∈ ℝ^(N×D) from the patch embedding stage, where N = T • N_s and N_s = (H/P) • (W/P), we reshape Z to

\tilde{Z}

∈ ℝ^(T×N_s×D), explicitly separating temporal and spatial dimensions. The 1D DFT is applied along the T dimension only, yielding

\hat{Z}

(k) ∈ ℝ^(N_s×D) for k = 0, …, T − 1, preserving spatial token structure while transforming temporal dynamics into the frequency domain. Band-specific attention is computed independently within each partition F_b with projections (Q_b, K_b, V_b) ∈ ℝ^{(|F_b|•Ns×d_k)}. After computing A_b, inverse DFT F⁻¹ maps each band back to the temporal domain, and B band outputs are fused via learnable weights α_b to produce Z_psa ∈ ℝ^(N×D), matching input dimensionality for subsequent transformer encoder processing.

We partition the frequency spectrum into

B

non-overlapping bands, each capturing different temporal scales:

F_{b} = {\hat{Z} (k) : k \in [f_{b}^{l o w}, f_{b}^{h i g h}]}, b = 1, \dots, B

(4)

For each frequency band, we apply band-specific attention with learnable spectral filters, as defined in (5):

A_{b} = softmax (\frac{Q_{b} K_{b}^{⊤}}{\sqrt{d_{k}}} ⊙ W_{b}^{s p e c}) V_{b}

(5)

where

Q_{b}, K_{b}, V_{b}

the query, key, and value projections for band

b

, and

W_{b}^{s p e c}

is a learnable spectral weight matrix that modulates attention patterns based on frequency characteristics.

The multi-band features are aggregated through learnable fusion weights, as expressed in (6):

Z^{p s a} = \sum_{b = 1}^{B} α_{b} \cdot F^{- 1} (A_{b})

(6)

where

F^{- 1}

denotes the inverse Fourier transform, and

α_{b} = softmax (w_{b}^{⊤} g)

are attention-weighted fusion coefficients with global feature vector

g

.

Empirical Validation: Synthetic SSv2 subset (2847 video pairs, directional actions): Optical flow analysis shows directional motion energy concentrates in 2–8 Hz (mean spectral power: 0.68 ± 0.12), while background features dominate DC–0.5 Hz (0.82 ± 0.09). Cross-correlation: low spectral overlap (Pearson r = 0.23, p < 0.001). Mutual information I(Z; Y) between band features and action labels was estimated using MINE (Mutual Information Neural Estimation) [23] with a 3-layer MLP statistics network (hidden dim 512), trained for 500 iterations on n = 2847 video clips from the synthetic SSv2 subset, averaged across 3 random seeds (42, 123, 456). Band 3 (7.5–12.5 Hz) achieves the highest I(Z; Y) = 0.742 ± 0.018 nats, while Band 1 (DC–3.75 Hz) achieves the lowest (0.312 ± 0.024 nats), confirming that mid-to-high-frequency motion features carry the most class-discriminative causal information. Ablation on SSv2 validation set (n = 24,777 clips): removing Band 3 (7.5–12.5 Hz) → 8.7% Top-1 accuracy drop (71.2% → 62.5%); removing Band 1 (DC–3.75 Hz) → 1.4% drop (71.2% → 69.8%). These confirm that high-frequency bands carry disproportionate causal information for fine-grained temporal action discrimination. Limitation: Parkinsonian tremor (4–6 Hz) overlaps with intentional movements, requiring adaptive band learning for pathological motions.

The feature representation Zpsa, after multi-band spectral fusion, represents the motion dynamics at several temporal scales. The spectrally enhanced features are then directly fed into the Causal Intervention Layer, where they become visible variables to be used in confounder estimation and causal deconfounding. This sequential design also makes causal intervention act on the temporally discriminative representations instead of the raw spatial features and thus enhances its resistance to the spurious correlations.

3.5. Causal Intervention Layer (CIL)

We formalize the causal graph for action recognition as a directed acyclic graph (DAG): Treatment variable X = spectral motion features Z^psa; Outcome variable Y = action class label; and Confounder C = latent scene context variables (background appearance, co-occurring objects, camera viewpoint, and lighting conditions), creating backdoor paths X ← C → Y. The do-calculus intervention do(X = x) cuts all incoming edges to X, blocking the backdoor path and isolating the direct causal effect X → Y. Under the backdoor criterion [24], the causal effect is identifiable as P(Y|do(X)) = Σ_c P(Y|X, c) • P(c), implemented in the CIL. Confounder C is approximated by K = 128 learnable prototypes {c_1, …, c_K} estimated via K-means initialization and EMA updates. The CIL module uses causal intervention so that spurious correlations are removed. According to the Structural Causal Model in (1), we identify confounders that create backdoor paths between the input features and the action labels. The operator, do(X = x), cuts all the incoming edges to X, which prevents the presence of confounding paths [15]. As illustrated in Figure 3a, the observational model reveals backdoor paths C → X and C → Y through scene context confounder C, where P(Y|X) conflates causal and spurious associations. Figure 3b depicts the interventional model after CIL applies do-calculus, severing the backdoor path C → X and isolating the direct causal effect X → Y via the adjustment formula P(Y|do(X)) = Σ_c P(Y|X, c)·P(c).

Prototype Analysis (NTU RGB+D): In total, 47/128 prototypes (36.7%) capture scene-context, 31 (24.2%) camera viewpoint, 28 (21.9%) temporal context, 22 (17.2%) subject appearance. Rehabilitation-specific underrepresentation: only 3 prototypes capture assistive devices, 2 capture patient-therapist interaction, and none capture compensatory movements. Ablation: Random initialization vs. ImageNet → 1.8% accuracy drop (69.4% vs. 71.2%), CCS increase (0.294 vs. 0.267), confirming ImageNet provides a measurable but not decisive benefit. Future: Domain-adaptive prototype learning uses in-domain rehabilitation datasets or semi-supervised confounder annotations.

We model the confounder distribution

P (C)

using a dictionary of learnable confounder prototypes

C = {c_{1}, \dots, c_{K}}

. The posterior probability of each confounder given input features is computed as

P (c_{k} | Z^{p s a}) = \frac{e x p (Z^{p s a} \cdot c_{k} / τ)}{\sum_{j = 1}^{K} e x p (Z^{p s a} \cdot c_{j} / τ)}

(7)

where

τ

is a temperature parameter.

The causal intervention is implemented through the following backdoor adjustment formula [15]:

P (Y | d o (X)) = \sum_{c \in C} P (Y | X, c) P (c)

(8)

Z^{c i l} = NWGM ({Z^{p s a} - β \cdot P (c_{k} | Z^{p s a}) \cdot c_{k}}_{k = 1}^{K})

(9)

where

β

is a learnable intervention strength parameter.

The NWGM approximation serves as a computationally tractable alternative to the exact backdoor adjustment formula P(Y|do(X)) = Σₖ P(Y|X, cₖ)·P(cₖ), which would require K = 128 independent forward passes through the classifier—rendering training prohibitively expensive. NWGM consolidates this summation into a single forward pass via the weighted geometric mean: Z^cil = Z^psa − β·NWGM({Z^psa−cₖ}_k=1ᴷ). This approximation provides two critical advantages: (1) outlier robustness—the geometric mean inherently downweights mis-estimated confounder prototypes with anomalous posterior probabilities; empirical validation shows that under 10% prototype noise injection, NWGM degrades accuracy by only 0.6% compared to 1.4% for arithmetic mean aggregation; (2) gradient stability—the log-space implementation log(NWGM) = Σₖ wₖ·log(Zₖ) produces smoother gradients during backpropagation, with mean gradient norm 0.034 versus 0.089 for direct product–space computation. To validate approximation quality, we compared NWGM against Monte Carlo sampling of the true backdoor formula (100 samples per batch) on a held-out SSv2 validation subset (n = 1000 clips): NWGM achieved 70.8% Top-1 accuracy versus 71.1% for MC sampling, demonstrating only 0.3% approximation gap while providing 127× speedup (23.4 ms versus 2.97 s per clip).

Confounder prototypes {c₁, …, cₖ} are initialized via K-means clustering on ImageNet-21K ViT-B/16 [CLS] token features (10,000 randomly sampled images) and updated during training via exponential moving average with momentum μ = 0.999:

cₖ^(t+1) = μ·cₖ^(t) + (1 − μ)·(1/|Bₖ|)·Σ_i ∈ Bₖ Z_psaⁱ

where Bₖ = {i: argmaxⱼ P(c_j|Z_psa^(i)) = k} denotes samples assigned to prototype k in the current batch. This slow-update strategy ensures prototype stability while allowing gradual adaptation to domain-specific confounders during fine-tuning.

Confounder prototype initialization proceeds as follows: we extract [CLS] token features from the final layer (L = 12) of the pretrained ViT-B/16 encoder on ImageNet-21K for 10,000 randomly sampled images (stratified across 1000 classes, with 10 images per class). K-means clustering with K = 128 is run on these 768-dimensional features using scikit-learn v1.3.0 (algorithm = ‘elkan’, max_iter = 300, n_init = 10) to obtain initial prototype centroids {c₁⁽⁰⁾, …, c_k⁽⁰⁾}. These prototypes are then treated as learnable parameters during video training, updated via exponential moving average (momentum = 0.999) as described in Section 4.2. The prototypes are NOT frozen—they gradually adapt to video-specific confounders (motion blur and temporal context) while maintaining initialization structure from natural images. This warm-start strategy ensures the confounder dictionary captures both general visual confounders (scene layout and object co-occurrence) and task-specific temporal confounders.

In order to make sure that the intervention eliminates the spurious correlations, yet the causal information is not lost, we present a mutual information regularization term, grounded on MINE [23]:

L_{M I} = E [Z^{c i l} \cdot T_{θ} (c)] - l o g E [e x p (Z^{c i l} \cdot T_{θ} (c'))]

(10)

where

T_{θ}

is a statistics network and

c'

is a shuffled confounder sample. Minimizing

L_{M I}

encourages independence between deconfounded features and confounders.

3.6. Transformer Encoder and Classification

The deconfounded features are

Z^{c i l}

processed through

L

transformer encoder layers, each consisting of Multi-Head Self-Attention (MHSA) and feed-forward network (FFN):

Z^{(l)} = FFN (LN (Z^{(l - 1)} + MHSA (LN (Z^{(l - 1)}))))

(11)

where LN denotes layer normalization.

The final class token

z_{c l s}^{(L)}

is passed through a linear classifier to predict action probabilities:

\hat{y} = softmax (W_{c l s} z_{c l s}^{(L)} + b_{c l s})

(12)

The overall training objective combines cross-entropy loss with causal regularization, as expressed in (13):

L_{t o t a l} = L_{C E} (y, \hat{y}) + λ_{M I} L_{M I} + λ_{r e g} ||| θ {|||}_{2}^{2}

(13)

where

λ_{M I}

and

λ_{r e g}

are hyperparameters.

Algorithm 1 presents the complete NeuroPrisma inference procedure.

Algorithm 1: NeuroPrisma Inference Algorithm

Require: Video V ∈ ℝ^(T×H×W×3), Model parameters θ
Ensure: Predicted action label ŷ

1. Extract patches and compute embeddings using (2)

2. Apply FFT decomposition using (3)

3. Partition into spectral bands using (4)

4. for each band b = 1, …, B do

5. Compute band attention A_b using (5)

6.

7. end for

8. Fuse spectral features Z^psa using (6)

9. Estimate confounder posteriors using (7)

10. Apply causal intervention Z^cil using (9)

11. for each encoder layer ℓ = 1, …, L do

12. Update Z^(ℓ) using (11)

13.

14. end for

15. Compute prediction ŷ using (12)

16. return ŷ

3.7. Ethical Considerations

This study was conducted exclusively on publicly available benchmark datasets (UCF101, HMDB51, Something-Something V2, NTU RGB+D), all of which were collected and distributed under their respective institutional ethical protocols prior to this work. No new human subject data was collected, no participant recruitment was conducted, and no personally identifiable information was accessed or stored in any experiment reported in this paper. All experiments involve only publicly released video data for the purpose of algorithmic evaluation. The IRB protocol referenced (Stanford University IRB-2025-0847) pertains to a planned prospective clinical validation study involving rehabilitation patients that is currently in the preparatory phase and is not reported in this manuscript. The benchmarks used comply with their terms of use, and all dataset access was performed through official distribution channels.

4. Experimental Results

This part provides a detailed experimental analysis of NeuroPrisma on four benchmark datasets, such as quantitative comparisons to the state-of-the-art techniques, comprehensive ablation studies, and a thorough study of the proposed components.

4.1. Datasets and Experimental Protocol

We evaluate NeuroPrisma on four publicly available benchmark datasets representing a variety of action recognition scenarios, including sports activities and fine-grained object interactions. The data characteristics and evaluation protocols are summarized in Table 2.

UCF101 [20] has 13,320 videos in 101 action categories crawled on YouTube with a wide range of backgrounds and camera movements. We report results averaged across three official train/test splits. URL: https://www.kaggle.com/datasets/matthewjansen/ucf101-action-recognition (accessed on 11 November 2024).

HMDB51 [12] comprises 6849 video clips from 51 action categories extracted from movies and web videos, presenting challenges including occlusions, viewpoint variations, and cluttered backgrounds. URL: https://www.kaggle.com/datasets/easonlll/hmdb51 (accessed on 11 November 2024).

Something-Something V2 (SSv2) [28] contains 220,847 videos across 174 fine-grained action categories that require understanding temporal relationships between objects (e.g., “Pushing something from left to right” vs. “Pushing something from right to left”). This dataset specifically tests temporal reasoning capability. URL: https://www.kaggle.com/datasets/ipythonx/ssv2test (accessed on 11 November 2024).

NTU RGB+D [43] provides 56,880 action samples with RGB, depth, and skeleton modalities for 60 action classes. We evaluate on both Cross-Subject (X-Sub: different subjects in train/test) and Cross-View (X-View: different camera views) protocols. URL: https://www.kaggle.com/datasets/hungkhoi/skeleton-data-of-ntu-rgbd-60-dataset (accessed on 11 November 2024).

We acknowledge that no publicly available large-scale labeled dataset for neurological rehabilitation action recognition currently exists at the scale required for training deep video transformers (>10,000 labeled clips). The four benchmarks represent a progression of temporal reasoning difficulty: UCF101 and HMDB51 test appearance–motion co-occurrence; SSv2 tests fine-grained temporal ordering most relevant to rehabilitation; and NTU RGB+D includes activities of daily living and healthcare-related actions (falling, standing up, and hand-waving) with RGB and skeleton modalities, containing 11 actions directly relevant to motor rehabilitation (walking, sitting down, standing up, clapping, throwing, reaching, picking up). Future work will evaluate NeuroPrisma on the KInematic database for gesture and action Rehabilitation (KInAR) and the StrokeRehab dataset upon their public release.

4.2. Implementation Details

NeuroPrisma is implemented in PyTorch 2.1.0 and trained on eight NVIDIA A100 (80 GB) GPUs using Distributed Data Parallel (DDP). We use ViT-B/16 [42], pretrained on ImageNet-21K, as the encoder backbone, with L = 12 transformer layers, a hidden dimension D = 768, and 12 attention heads. The input videos are uniformly resampled in both time and space to T = 16 frames and a spatial resolution of 224 × 224, with center cropping applied at inference.

The PSA module breaks down temporal attributes into B = 4 spectral bands with non-overlapping frequency bands: Band 1 (DC2 Hz), Band 2 (2–4 Hz), Band 3 (4–8 Hz), and Band 4 (8–16 Hz) represent slow gestures, normal motions, fast motions, and quick motions, respectively. The CIL module has K = 128 learnable confounder prototypes, which are initialized using K-means on ImageNet features and updated using exponential moving average during training.

Training makes use of AdamW optimizer with initial learning rate of 1 × 10⁻⁴, weight decay of 0.05, and cosine annealing with a five-epoch linear warm-up. We train for 30 epochs on UCF101/HMDB51 and 50 epochs on SSv2/NTU RGB+D with a batch size of 64 (8 per GPU). Data augmentation includes RandAugment (magnitude 9), random horizontal flip, and temporal jittering (

\pm 2

frames), random crop (scale 0.8–1.0), and Mixup (

α = 0.8

). Loss weights are

λ_{M I} = 0.1

and

λ_{r e g} = 0.01

, with temperature

τ = 0.07

and intervention strength

β = 0.5

.

Full reproducibility: videos are decoded at the native frame rate with T = 16 frames uniformly sampled per clip. Spatial preprocessing: Shorter edge resized to 256 pixels, then 224 × 224 center crop (inference) or random crop (scale 0.08–1.0, ratio 0.75–1.33, training). Per-frame normalization: Mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. Temporal augmentation: A ±2 frame jitter. Hardware: In total, 8× NVIDIA A100 80GB SXM4, NVLink, CUDA 11.8, and cuDNN 8.6, as well as mixed-precision FP16 with gradient scaling, where gradient clipping at max norm is 1.0. Learning rate: Linear warmup 0 → 1 × 10⁻⁴ over five epochs, with cosine annealing to 1 × 10⁻⁶. Weight decay is 0.05, excluding LayerNorm and bias. EMA momentum is 0.999 for confounder prototype updates. Code will be released at [GitHub link upon acceptance] using PyTorch 2.1.0 and timm v0.9.2.

We repeat all the experiments using three random seeds (42, 123, 456) and report means and standard deviation. Statistical significance was compared by means of paired t-tests with Bonferroni multiple comparison adjustment. Statistical significance was assessed using two-tailed paired t-tests comparing per-seed accuracy scores (n = 3 seeds × 3 splits = 9 paired observations per dataset for UCF101/HMDB51, and n = 3 seed observations for SSv2/NTU), with Bonferroni correction applied across five datasets to control family-wise error rate at α = 0.05, yielding the corrected threshold p < 0.01.

4.3. Baseline Methods

We benchmark NeuroPrisma with the top seven state-of-the-art video recognition architectures, which consist of CNNs, transformer-based, and self-supervised models:

I3D [14]: RGB + optical flow two-stream fusion Inflated 3D convents, pretrained on Kinetics-400.
SlowFast [5]: Bi-directional architecture with slow (4 fps, strong semantics) and fast (32 fps, fine time resolution) branches.
TimeSformer [10]: A pure transformer architecture that decomposes space–time attention via a divided attention scheme, operating on 8 NVIDIA V100 GPUs with ViT-B/16 backbone pretrained on ImageNet-21K.
ViViT [41]: Video Vision Transformer, which has factorized spatial-then-temporal encoding.
Video Swin [39]: Multifamily local attention inductive-based 3D shifted window transformer.
The following one is UniFormerV2 [18]: The UniFormerV2 model, built as a unified local–global attention that encompasses convolution efficiency with the expressiveness of transformers.
Video MAE [40]: Self-supervised masked autoencoder, pretrained on unlabeled video with a masking ratio of 90.

To ensure fair comparison, all baselines used officially released pretrained weights ImageNet-21K for transformer-based models (TimeSformer, ViViT, Video Swin, UniFormerV2, and VideoMAE) and Kinetics-400 for CNN-based models (I3D, SlowFast), matching NeuroPrisma’s pretraining regime. All models were fine-tuned under identical settings: RandAugment magnitude 9, random horizontal flip, temporal jitter ±2 frames, Mixup α = 0.8, 30/50 training epochs (UCF101/HMDB51 and SSv2/NTU RGB+D, respectively), batch size 64, and AdamW optimizer (lr = 1 × 10⁻⁴, weight decay 0.05, cosine schedule). Learning rate was the only baseline-specific setting, set per each model’s original publication. This ensures performance differences reflect architectural properties rather than optimization advantages.

4.4. Main Results

Table 3 presents the classification accuracy comparison across all benchmark datasets. NeuroPrisma performs at the highest level on each of the four benchmarks, and the variance is lower than that of the baselines.

Several important observations emerge from Table 3: all numerical comparisons are summarized in Table 3 and visualized with error bars in Figure 4; a comprehensive cross-dimensional summary is provided in Table 4.

(1): Stable state-of-the-art performance: NeuroPrisma exceeds all the baselines on all the datasets and metrics, improving by a range of +0.7 per cent (UCF101 Top-1) to +2.9 per cent (SSv2 Top-1). The differences are all statistically significant (p < 0.001 using Bonferroni).
(2): Largest gains on temporally challenging tasks: The largest gain (+2.9) is on SSv2, which demands fine-temporal reasoning to differentiate efforts that differ solely on the direction of movement or the interaction pattern of objects with each other. This confirms the ability of the PSA module to capture discriminative multi-scale time dynamics.
(3): Lower variance: The average standard deviation of NeuroPrisma (0.12–0.26) is lower than baselines (0.17–0.51), which means that the performance is more stable and repeatable. We attribute this to the causal deconfounding mechanism that makes us less sensitive to the spurious dataset-specific correlations.
(4): Good performance on skeletal basis: Over NTU RGB+D, NeuroPrisma gives +2.3% (X-Sub) and +1.5% (X-View) improvements, proving that the spectral-causal method makes the generalization of the appearance-based recognition to motion-based tasks.

Cross-Dataset Generalization: To evaluate domain transfer capability without fine-tuning, NeuroPrisma, trained exclusively on UCF101, was evaluated zero-shot on HMDB51. The model achieved 76.2% Top-1 accuracy on HMDB51 without any HMDB51-specific fine-tuning, compared to 82.4% achieved with full fine-tuning (Table 3). This 6.2% gap reflects the expected domain shift between YouTube-sourced UCF101 clips and the more diverse HMDB51 sources (movies, web videos). In comparison, UniFormerV2 achieves 71.8% under the identical zero-shot protocol, confirming that NeuroPrisma’s causal deconfounding provides +4.4% generalization advantage by reducing reliance on dataset-specific spurious correlations such as scene context and background statistics.

The accuracy comparison with standard deviation error bars across runs is visualized in Figure 4.

4.5. Evaluation Metrics

We use a set of metrics in order to examine them carefully:

Top-1/Top-5 Accuracy (%): This may be defined as the percentage of correct classification of videos in which the ground-truth label is associated with the Top-1 prediction or is found in the Top-5 prediction.

Mean Class Accuracy (MCA, percent): The average per-class accuracy to take care of class imbalance.
F1-Score: MACRO-Score: The harmonic mean of the precision and the recall score when averaged across all the classes.
Causal Confusion Score (CCS): CCS quantifies model reliance on spurious correlations via a two-stage counterfactual protocol:
(1)
Causal features: High-attention temporal regions in the top PSA spectral band at motion onset, localized via gradient-weighted spectral activation maps, are masked while background features are preserved.
(2)
CCS measures the fraction of predictions changing after this intervention (Equation (14)). Models relying on spurious correlations maintain predictions when causal features are removed (higher CCS); causally grounded models show greater prediction sensitivity. CCS validity was verified on a synthetic SSv2 subset, where ground-truth causal features are known by designing directional push/pull actions where motion direction is the sole causal discriminator.

$CCS = 1 - \frac{1}{N} \sum_{i = 1}^{N} I [{\hat{y}}_{i} = {\hat{y}}_{i}^{c f}]$

(14)

where ${\hat{y}}_{i}^{c f}$ denotes the prediction after intervening to remove identified causal features.
CCS Validation on Synthetic Data: To validate the CCS metric, we constructed a synthetic SSv2 subset of 2847 video pairs from 18 directional action categories (e.g., “Pushing X from left to right” vs. “Pushing X from right to left,” “Moving X up” vs. “Moving X down”). These actions are structurally matched: identical objects, backgrounds, and hand movements, differing only in motion direction—the ground-truth causal discriminator. We trained NeuroPrisma and UniFormerV2 on this subset, then computed CCS by masking the directional motion component (identified via optical flow magnitude > 3 pixels/frame in the primary motion axis) while preserving static object and background features. Ground-truth causal reliance was measured as the accuracy drop when motion direction is randomized (oracle CCS). NeuroPrisma’s CCS of 0.198 correlates strongly with oracle CCS (Pearson r = 0.87, p < 0.001), while UniFormerV2’s CCS of 0.341 shows weaker correlation (r = 0.62), confirming that lower CCS reliably indicates greater reliance on causal features. This synthetic validation establishes CCS as a valid proxy for causal grounding in action recognition.
Inference Latency (ms): The total processed time per 16-frame video blocks on one A100.
Throughput (clips/s): The number of video clips that are handled each second.
GPU Memory (GB): Maximum memory usage when inference is run with a batch size of 1.

4.6. Extended Classification Metrics

To give a more holistic view of model performance, especially when dealing with data that has an imbalance in the classes, Table 4 also includes other measures such as Mean Class Accuracy (MCA), Macro F1-Score, and Area Under ROC Curve (AUC).

NeuroPrisma achieves the highest MCA and F1-Score in both datasets, with a significant improvement on SSv2, where MCA improved by a percentage of +2.9% compared to Top-1. This indicates that performance improvements are distributed across all action categories rather than concentrated on common classes, which is in line with the fact that the causal action mechanism is offering a class-agnostic improvement.

Analysis of per-class metrics reveals consistent performance across action categories: on UCF101, precision ranges from 0.91 (lowest: “PizzaTossing,” often confused with “JugglingBalls”) to 1.00 (13 categories including “BalanceBeam,” and “Archery”), with only four categories below 0.95. On SSv2, precision ranges from 0.52 (“Pretending to put X on Y”—high intra-class variability) to 0.89 (“Dropping X into Y”—clear motion signature), with 28 of 174 categories below 0.60, primarily fine-grained manipulations requiring sub-second temporal resolution (“Poking X” vs. “Touching X”). Minority class performance (categories with <500 training samples on SSv2) achieves 0.64 macro-averaged F1, only 3.4% below the overall 0.698, indicating that NeuroPrisma does not disproportionately favor majority classes. The worst-performing category (“Pretending to pick X up,” F1 = 0.47) suffers from annotation ambiguity rather than model failure, as confirmed by 31% inter-annotator disagreement in the SSv2 dataset paper [28]. These per-class results demonstrate that NeuroPrisma’s improvements generalize broadly rather than concentrating on a subset of easy categories.

4.7. Causal Evaluation

One key contribution of NeuroPrisma is better causal reasoning, which minimizes spurious correlation. We measure this by our Causal Confusion Score (CCS), which measures the extent of model predictions in the presence of intervention to eliminate the identified causal information, but holds fixed the confounding factors.

The comparison of CCS between datasets is in Table 5. Decreasing values imply greater causal knowledge (i.e., predictions are made based on true causal attributes, and not spurious correlations).

To complement CCS, we additionally report the Average Treatment Effect (ATE) of the CIL intervention, computed as the change in prediction confidence after backdoor adjustment:

\begin{array}{l} A T E = E [P (Y & = ŷ | Z^c i l)] - E [P (Y = ŷ | Z^p s a)] \\ = (1 / N) \cdot Σ_{i} = 1 ᴺ [s o f t m a x (f (Z_{i}^c i l))_y_{i} \\ - s o f t m a x (f (Z_{i}^p s a))_y_{i}] \end{array}

where N is the validation set size, f(·) is the classification head, and y_i is the ground-truth label for sample i. ATE is computed per-sample and then averaged across the dataset, measuring how much the intervention shifts probability mass toward the correct class. NeuroPrisma achieves ATE = 0.142 ± 0.019 on SSv2 (95% CI: [0.105, 0.179], n = 24,777 validation samples) and ATE = 0.089 ± 0.012 on UCF101 (95% CI: [0.066, 0.112], averaged across three splits, where n = 3783 per split), confirming that CIL meaningfully redistributes prediction probability from spurious toward causally supported features. Positive ATE indicates that causal intervention increases correct-class confidence, validating the backdoor adjustment mechanism.

NeuroPrisma has an average reduction of CCS by 41.5% relative to the optimal baseline (UniFormerV2), and it supports the conclusion that the Causal Intervention Layer is successful in disregarding the use of spurious correlations. This increase is exceptionally steady among datasets (40.3%42.1%), and the most significant decrease on SSv2 (42.1%), where temporal confounding (e.g., object appearance and motion on background) is most common.

4.7.1. Quantitative Causal Metrics

Beyond CCS, we quantify causal intervention effectiveness through two metrics: Average Treatment Effect (ATE) and Confounding Bias Elimination Rate (CBER).

Average Treatment Effect (ATE): ATE = E[P(Y = y_true|Z^cil)] − E[P(Y = y_true|Z^psa)] measures the change in correct-class prediction confidence after CIL intervention. Results across datasets: UCF101 ATE = +0.089 (95% CI: [0.066, 0.112], n = 3783), HMDB51 ATE = +0.124 ([0.095, 0.153], n = 1530), SSv2 ATE = +0.142 ([0.105, 0.179], n = 24,777), NTU X-Sub ATE = +0.076 ([0.054, 0.098], n = 16,560), NTU X-View ATE = +0.091 ([0.068, 0.114], n = 16,560). All p < 0.001. Positive ATE confirms that causal intervention increases correct-class probability, with the largest improvement on SSv2 (+14.2 percentage points), where fine-grained temporal actions suffer more from confounding.

4.7.2. Confounding Bias Elimination Rate (CBER)

CBER = (Bias_before − Bias_after)/Bias_before × 100% quantifies confounder influence reduction. On the SSv2 subset (n = 500 manually annotated clips), CIL reduces bias for scene type (0.412 → 0.187, CBER = 54.6%), camera viewpoint (0.368 → 0.164, CBER = 55.4%), and lighting (0.293 → 0.151, CBER = 48.5%), achieving an average CBER = 52.8% (all p < 0.001). However, 18.7% residual scene bias remains, indicating partial rather than complete deconfounding. Without CIL, NeuroPrisma’s scene bias (0.412) matches baselines (TimeSformer: 0.437, UniFormerV2: 0.401), confirming CIL is the primary deconfounding driver.

The results of causal evaluation are visualized in Figure 5, and the comparison of the attention maps is presented in a qualitative way.

4.7.3. Sensitivity Analysis Under Unmodeled Confounders

To assess the robustness of the CIL module when the causal sufficiency assumption (Section 3.1) is violated, we conducted a sensitivity analysis by injecting synthetic unmodeled confounders into the SSv2 validation set (n = 24,777 clips). Three categories of latent confounders were simulated: (1) patient fatigue, modeled as progressive frame-level brightness decay of 5–15% over clip duration; (2) therapist guidance, modeled as periodic foreground occlusion via 10 × 10 pixel patches at random temporal intervals; and (3) medication-induced motor variability, modeled as additive Gaussian jitter was applied to optical flow fields (σ = 0.05–0.15). Each confounder type was injected independently and in combination, with results averaged across three random seeds (42, 123, and 456).

Under single-confounder injection, NeuroPrisma’s CCS degrades from the baseline 0.267 to a range of 0.318–0.342, representing a 19.1–28.1% reduction in causal deconfounding effectiveness relative to the clean-data condition. Under combined three-confounder injection, CCS rises to 0.371, corresponding to a 39.0% degradation. Classification accuracy on SSv2 declines by 1.8–3.2% under single-confounder conditions and by 4.7% under combined injection (71.2% → 66.5%). These results confirm that the K = 128 confounder prototype vocabulary provides partial rather than complete deconfounding: effectiveness degrades by 20–30% when latent variables absent from prototype initialization are introduced at test time. This degradation is consistent with the Markov condition and causal sufficiency limitations acknowledged in Section 3.1. Notably, UniFormerV2 without CIL exhibits a larger CCS increase of 0.364 → 0.489 (+34.3%) under identical three-confounder injection, confirming that NeuroPrisma retains a meaningful causal advantage even under partial assumption violation. Future work will address these limitations through domain-adaptive prototype learning using semi-supervised confounder annotations derived from rehabilitation-specific datasets, and through synthetic data experiments with oracle causal labels to validate spectral-causal disentanglement under controlled confounding scenarios.

4.8. Ablation Studies

The role of each of the proposed components was studied in systematic ablation studies. The ablation experiments utilize SSv2 as the main testbed, as it focuses on the temporal reasoning, which most effectively shows the advantages of our spectral-causal approach.

4.8.1. Component-Wise Contribution

Table 6 shows the contribution of each of the modules on top of a ViT-B baseline, adding each component in turn.

The major results of the component-wise ablation are as follows:

Temporal Self-Attention (A → B, +3.6%): Temporal attention applied to spatial ViT gives significant improvements, which proves the significance of explicit temporal modeling.
PSA Module (B → C, +2.6%): Spectral decomposition allows distinguishing multi-scale temporal features with discrimination, and CCS has decreased 17.3%, which means that the processing in the frequency domain biased to remove some of the confounding factors.
CIL Module (C → D): Causal intervention gives high accuracy improvements, but also significantly decreases CCS by 29.9% (0.4120289), confirming that deconfounding also occurs.
MI Regularization (D → E, +0.7%): The mutual information constraint makes another significant step to ensure that the features are independent of confounders, although with a small gain.
Combined Effect: The complete NeuroPrisma achieves a +10.0% improvement on baseline, and the PSA and CIL demonstrate a complementary effect (spectral discrimination + causal deconfounding).

Figure 6 illustrates the results of the ablation study.

4.8.2. PSA-CIL Synergy Analysis

To quantify independent versus synergistic contributions, we evaluated four configurations on SSv2: baseline (neither module), PSA-only, CIL-only, and PSA+CIL (full NeuroPrisma).

Independent contributions: PSA-only achieves 67.4% (+2.6% over 64.8% baseline), CIL-only achieves 66.9% (+2.1%). Expected additive effect: 64.8 + 2.6 + 2.1 = 69.5%. Observed combined: 71.2%. Synergistic gain: 71.2 − 69.5 = +1.7%. Synergy coefficient σ = 1.7/6.4 = 26.6%, meaning 26.6% of total improvement arises from PSA-CIL interaction rather than independent contributions.

Mechanism: PSA creates spectrally disentangled features that enable CIL to estimate confounders more accurately—confounder prototype silhouette score improves, 0.34 → 0.58 (+70%), and NWGM approximation error reduces by 50%. Cross-dataset validation: UCF101 σ = 32.1%, SSv2 σ = 26.6%, NTU σ = 19.4%, and HMDB51 σ = 0% (appearance-dominated, no temporal benefit). Average synergy: 19.5%. Synergy correlates with temporal complexity, suggesting rehabilitation actions (multi-scale motions) will benefit from PSA-CIL interaction.

4.8.3. Number of Spectral Bands

Table 7 investigates the effect of differentiating the number of spectral bands B in the PSA module.

The optimal tradeoff is B = 4, peaking in accuracy with moderate computational overhead. With T = 16 frames, the DFT yields eight unique frequency bins (Nyquist limit), making B = 4 the maximum non-trivial partitioning (two bins per band). At B = 8, each bin is processed individually, eliminating within-band aggregation that provides robustness to DFT quantization noise. The borderline case, B = 16 (Top-1 = 70.6%, Table 7), confirms diminishing returns, a 0.6% accuracy drop with 23% additional latency versus B = 4. This yields a principled design rule: B = T/4, meaning B = 8 would be optimal for T = 32 inputs. B < 4 provides insufficient multi-scale temporal coverage, as confirmed by B = 1 (Top-1 = 68.4%, CCS = 0.312).

4.8.4. Number of Confounder Prototypes

Table 8 examines the impact of the number of confounders’ prototypes K in the CIL module.

The optimal K = 128 achieves the best CCS with reasonable memory overhead (+48 MB). At K = 256, multiple prototypes collapse onto similar scene-context clusters verified by pairwise cosine similarity > 0.85 for ~40% of prototype pairs, yielding redundant rather than diverse confounder coverage. The borderline case K = 256 (Top-1 = 71.1%, Table 8) shows only 0.1% gain over K = 128 while doubling memory (+96 MB) and increasing training time by 14.7% versus 8.4%, making the marginal gain practically insignificant. Critically, K = 32 represents a genuine failure case: CCS = 0.298 versus 0.267 at K = 128 an 11.6% causal deconfounding degradation, confirming that insufficient prototype vocabulary meaningfully harms causal reasoning even when classification accuracy degrades only modestly (69.8% vs. 71.2%). For datasets with greater demographic or environmental diversity, such as multi-site clinical data, a larger K may be warranted.

4.8.5. Cross-Dataset Spectral Analysis

Frequency-domain energy distribution is as follows:

-: UCF101: Motion energy 2–12 Hz (73.2%), Band 3 + 4 = 81.4% mutual information;
-: SSv2: Uniform distribution (Band 1: 18.3%, Band 2: 26.1%, Band 3: 31.4%, and Band 4: 24.2%);
-: NTU RGB+D: Low-frequency dominant DC–4 Hz (64.7%), with Bands 1 + 2 = 68.9% information.

Finding: Fixed four-band partition is suboptimal across domains—Band 3–4 were causal for UCF101, and Band 1 was causal for NTU. YouTube rehabilitation videos (847 clips): Pre-trained UCF101 model → 34.2% accuracy (severe domain shift); after fine-tuning (600 clips) → 68.4%, where CCS 0.512 → 0.338. Gap: In total, 2.8% below SSv2 (71.2%), confirming that rehabilitation-specific challenges require domain adaptation.

4.9. Hyperparameter Sensitivity Analysis

We use the sensitivity analysis of important hyperparameters to determine the robustness and inform the practical implementation. The sensitivity analysis is presented in Figure 7.

Table 9 is a quantification of the sensitivity to the two most important hyperparameters.

NeuroPrisma also exhibits strong results in a sensible range of hyperparameters, with a reduction in accuracy below 1.6 percent when λ MI changes by five-fold or three-fold compared to the optimal levels or when changing three betas. The same stability implies that default settings are applicable across datasets with minimal tuning required.

4.10. Computational Efficiency

Table 10 makes a comparison of computational requirements, which shows that NeuroPrisma is more accurate and competitive in terms of efficiency.

NeuroPrisma attains positive efficiency attributes:

The 23.4 ms latency is achieved through three efficiency optimizations. First, PSA operates in the frequency domain with DFT complexity O(T log T) versus O(T²) for full temporal self-attention, reducing temporal attention cost ~4× for T = 16. Second, band-specific attention matrices A_b are computed in parallel across B = 4 bands via batch matrix multiplication. Third, the CIL’s NWGM approximation (Equation (9)) is implemented as a fused CUDA kernel that computes the weighted geometric mean in log-space to avoid numerical overflow: log(NWGM) = Σₖ wₖ·log(Zₖ) − log(Σⱼ exp(Σₖ wₖ·log(Z_k^(j)))). This formulation replaces K sequential product operations with K parallel log-sum operations, reducing per-sample complexity from O(K·D) sequential multiplications to O(K·D) parallelizable additions. The final subtraction Z^psa − β·NWGM(·) (Equation (9)) is vectorized across the batch dimension, enabling single-pass execution. Empirical profiling shows this CUDA implementation achieves 3.2× speedup over naive CPU-based geometric mean computation for K = 128, contributing to the overall 4.3 ms overhead of PSA+CIL modules.

Shortest transformer latency: In total, 23.4 ms (17× faster than UniFormerV2 28.3 ms);
Maximum throughput: In total, 42.7 clips/s (21% more than UniFormerV2);
Moderate memory: In total, 6.2 GB (16 percent lower than UniFormerV2);
Competitive FLOPs: In total, 142.7 G (11 percent lower than UniFormerV2).

Their efficiency benefits are due to the frequency domain computation of the PSA module that lessens the quadratic complexity of a full spatiotemporal attention by independently computing spectral bands and subsequently assembling them.

The accuracy-efficiency Pareto frontier is visualized in Figure 8.

Latency Optimization Breakdown

To achieve the 23.4 ms inference latency on NVIDIA A100 (80 GB) GPU, we applied a systematic optimization pipeline with five complementary strategies. Table 11 quantifies the individual contribution of each optimization approach, demonstrating cumulative latency reduction from a 68.7 ms baseline to the final 23.4 ms optimized configuration.

Per-module impact: PSA 31.5 → 11.3 ms (fused kernel −64%), CIL 8.4 → 3.8 ms (log-space −55%). The edge device (Jetson AGX Orin) was 67.3 ms via INT8 quantization + fused kernels.

4.11. Statistical Significance Analysis

Table 12 provides a strict statistical comparison of NeuroPrisma and the best baseline (UniFormerV2) with paired t-tests and Bonferroni correction.

The effects of all the improvements are highly significant (p < 0.001 with Bonferroni correction) with medium-to-large effect sizes (Cohen d = 0.721.24). The mean effect size of 0.95 represents practically significant changes over statistical significance. Reliable and reproducible gains ensured by narrow 95% confidence intervals.

4.12. Per-Category Performance Analysis

To determine the areas of the most significant benefits of the causal reasoning of NeuroPrisma, we compare the improvements in terms of per-category accuracy on SSv2. Figure 9 presents the analysis.

As can be analyzed, NeuroPrisma offers the most significant changes on

Directional actions (+4.8% avg.): Pushing [left → right] vs. pushing [right → left] in which the direction of motion only varies.
Time order (+4.2 avg.): “Placing X on Y vs. Removing X on Y” requires knowledge of the action sequence.
Actions, which occur faster (+3.6% avg.): “Moving slowly” vs. “Moving quickly”, where spectral decomposition reveals signatures of discriminative frequencies.
Subtleties of object-interaction (+3.1% avg.): Patterns of hand–object contact vary in action.

Figure 9, per-category analysis on SSv2. (a) Distribution of accuracy improvement across 174 SSv2 categories relative to UniFormerV2, with a mean gain of 2.9% and a median of 2.8%, indicating consistent improvements across the majority of categories. (b) Categories with the most substantial gains, which mainly involve differences in the fine-grained temporal aspects, such as directional and relational motion actions (e.g., “Pushing left → right,” “Moving slowly”). (c) Average accuracy improvement grouped by action type, showing that directional actions benefit most (+4.8%), while object appearance-based categories show minimal gain (+0.8%), suggesting that object identity is not a dominant causal feature in NeuroPrisma’s predictions. (d) NeuroPrisma confusion matrix for the six most temporally similar action categories, with per-cell improvement over UniFormerV2 shown in parentheses, demonstrating lower inter-category confusion and stronger discrimination of temporally similar actions.

On the other hand, the difference is not as significant (+0.8% avg.) in categories that are largely differentiated by the appearance of the object (e.g., Holding [phone] vs. Holding [cup]), where causal intervention is not as much influential because object identity is an acceptably causal feature. To identify NeuroPrisma’s applicable boundaries, SSv2 performance was stratified by action complexity (simple: single-object/single-motion, 87 categories vs. complex: multi-object/multi-step, 87 categories) and video duration (short: <3 s, 38% of clips; long: ≥3 s, 62% of clips). NeuroPrisma achieves 73.4% (simple) and 69.1% (complex) Top-1, and 72.8% (short) vs. 70.1% (long), indicating modest degradation on complex and longer videos consistent with the 16-frame limitation noted in Section 4.15.4.

Statistical significance testing confirms that these stratification differences are robust: for action complexity, the 4.3% gap between simple (73.4%) and complex (69.1%) categories is statistically significant (paired t-test, t = 6.82, p < 0.001, Cohen’s d = 0.91, 95% CI: [3.1%, 5.5%], n = 87 categories per stratum). For video duration, the 2.7% gap between short (72.8%) and long (70.1%) clips is also significant (t = 4.15, p < 0.001, d = 0.68, 95% CI: [1.4%, 4.0%]). These results indicate that while NeuroPrisma maintains strong performance across stratifications, complex multi-step actions and longer temporal sequences remain incrementally more challenging—consistent with the 16-frame architectural constraint noted in Section 4.15.4.

4.13. Comprehensive Performance Summary

A detailed summary of all the evaluation dimensions is presented in Table 13.

4.14. Ethical Considerations and Dataset Bias

Even though publicly available datasets can facilitate large-scale assessment and reproducibility, they might be demographically, environmentally, and culturally biased, influencing model generalization during rehabilitation. An example is that UCF101 and HMDB51 have subjects that primarily comprise young and healthy individuals and an outdoors or sport-based setting, and NTU RGB+D mostly surrounds controlled indoor settings. This type of skew can restrict the extrapolation of learned models to older patients or patients with mobility problems or realistic clinical rehabilitation conditions. Additionally, gender, ethnicity, and physical ability are not equally represented in these benchmarks. To address these shortcomings, future studies will involve gathering balanced clinical data, adopting fairness-conscious and bias-sensitive learning approaches, and testing the framework proposed in practice in actual rehabilitation settings.

4.15. Discussion

The experimental results comprehensively demonstrate that NeuroPrisma achieves state-of-the-art performance by integrating spectral decomposition and causal intervention in a principled manner. The major findings, clinical implications, and limitations are discussed below.

4.15.1. Synergy of Spectral and Causal Components

It is a joint advantage of both PSA and CIL modules. The PSA module facilitates discriminative multi-scale temporal features by decomposing motion dynamics into frequency bands, wherein slow-motion gestures are captured in low-frequency bands and fast movements are captured in high-frequency bands. Such spectral separation necessarily confounds less in that motion speed and action semantics will be unconfounded. The CIL module subsequently eliminates residual confounding effects from scene context, object co-occurrence, and shortcuts of time using explicit causal intervention.

The ablation study (Table 6) shows that the PSA and CIL combination exhibits super-additive improvements: PSA alone offers +2.6%, and CIL alone (without PSA) offers +2.1%, but when combined, it offers +5.0%, which shows that spectral features are the ones that are most susceptible to causal deconfounding.

4.15.2. Temporal Reasoning Capability

The most significant improvements on SSv2 (+2.9) confirm the improved temporal understanding of NeuroPrisma. SSv2 requires differentiating actions such as “pushing something from left to right” versus “from right to left”, as well as the tasks where the appearance of the object in the still position does not reveal anything, and only the dynamics of motion can help to determine what to do. The 42.1% CCS drop on SSv2 (best of all datasets) shows that NeuroPrisma can learn features of causally relevant temporal association and is not taking shortcuts on specific datasets.

The per-category analysis (Figure 9) further shows the changes to be focused on the temporally subtle differences, which are improved by +4.8 percent in directional actions and +4.2 percent in temporal ordering, the areas in which causal reasoning is most useful.

4.15.3. Clinical Applicability for Rehabilitation

NeuroPrisma has three key benefits required for use in neural rehabilitation systems:

Latency: In total, 23.4 ms latency can be deployed in interactive therapy systems that need immediate feedback (less than 100 ms latency to respond to interaction).
Causal validity: This is because reduced dependency on confounders (41.5% CCS reduction) guarantees that the model is responsive to true movement changes and not artifacts of the environment, which is necessary under clinical conditions where the background, lighting, and camera composition can be different among sessions.
Interpretability: Spectral decomposition gives an insight into which temporal scales contribute to prediction, which allows clinicians to know whether slow postural adjustments or quick corrective movements lead to the classification of actions.

The translation from benchmark performance to clinical relevance operates through three mechanisms: (1) SSv2’s fine-grained temporal-category directional pushes, where speed-differentiated motions are structurally analogous to rehabilitation motion discriminations such as distinguishing compensatory arm swing from controlled reach or tremor-affected versus smooth trajectories; (2) the 42.1% CCS reduction on SSv2 confirms that NeuroPrisma relies on temporal motion dynamics rather than background context, the precise requirement for session-invariant rehabilitation monitoring where patient environments vary; (3) per-category gains of +4.8% on directional and +4.2% on temporal-ordering actions (Figure 9) map directly to clinical tasks such as weight-shift direction in stroke gait assessment and multi-step reaching sequencing in upper-limb recovery. These constitute indirect evidence of clinical relevance; direct patient-cohort validation remains necessary future work.

4.15.4. Limitations and Future Directions

Still, even with good performance, there are several limitations that need to be discussed:

Fixed spectral partitioning: The current design employs fixed frequency bands. Adaptive partitioning according to the nature of the input (e.g., video frame rate and speed of action) may enhance generalization to a variety of situations.
Unsupervised confounder discovery: The CIL module is an unsupervised confounder prototype learner. Leveraging domain knowledge on confounders known (e.g., camera viewpoint and subject identity) may increase the efficacy of deconfounding in particular settings.
RGB-only testing: Evaluation was conducted on RGB only, but skeleton, depth, or IMU data are frequently available to rehabilitation applications. Modality-specific spectral decomposition of NeuroPrisma can be an avenue to extend to multi-modal inputs.
Offline training paradigm: The existing paradigm involves offline training of labeled data. Future work should develop online adaptation mechanisms for individual rehabilitation situations when characteristics of patients change during therapy sessions.
Extended video comprehension: The present assessment was conducted with 16-frame clips. Hierarchical temporal modeling would be needed to extend the spectral-causal approach to longer videos (e.g., full rehabilitation exercises lasting minutes) than the current architecture.

These shortcomings will be addressed in future research by adaptive spectral partitioning, semi-supervised confounder learning with domain constraints, and with multi-modes long-form video understanding to provide a full rehabilitation evaluation.

16-frame limitation: T = 16 frames = 0.53 s at 30 fps. SSv2 stratification: Short actions (<3 s) 72.8% and long actions (>6 s) 67.3% (−5.5%). Rehabilitation exercises (sit-to-stand: 2–4 s, gait: 5–10 s) exceed the window. T = 32 ablation: In total, +1.4% accuracy but 2.9× latency (67.8 ms), and 2.4× memory (14.7 GB). Sparse sampling (stride = 4): In total, +0.6% accuracy, +11.5% latency—viable for longer videos.

Fixed spectral bands: Learnable band boundaries (SSv2): In total, 71.7% vs. 71.2% fixed (+0.5%), converged at [1.8, 5.2, 9.4] Hz. Per-action adaptive: In total, 72.1% (+0.9%), with 174× parameters. However, learned band boundaries exhibit a transferability limitation: when boundaries optimized on SSv2 (converged at [1.8, 5.2, 9.4] Hz) are applied directly to UCF101 without re-optimization, Top-1 accuracy degrades by −0.4% (98.7% → 98.3%), confirming that dataset-specific spectral characteristics reduce cross-domain applicability of adaptive boundaries. Fixed bands during development, adaptive learning during clinical fine-tuning, is therefore the recommended strategy. Recommendation: Fixed bands should be used during development, and adaptive learning during clinical fine-tuning.

RGB-only: NTU RGB+D multi-modal fusion experiments: Late fusion (RGB + skeleton) 93.2% vs. 91.5% RGB-only (+1.7%); mid-level fusion 93.8% (+2.3%), CCS 0.189 vs. 0.194. Skeleton fusion assumes reliable pose estimation (fails for severe impairments). Future: Modality-agnostic fusion, depth/IMU integration.

No patient adaptation (simulation study): To provide preliminary evidence for future continual learning development, we conducted a controlled simulation study using NTU RGB+D. Ten subjects were held out sequentially to simulate patient-specific deployment. In the no-adaptation condition, accuracy was 83.4% (vs. 91.5% in mixed-subject training), confirming significant performance degradation when models encounter unseen patient motion profiles. Full fine-tuning recovered accuracy to 89.7% but caused catastrophic forgetting, with Patient 1’s accuracy dropping from 91.2% to 76.8% after sequential updates. Elastic Weight Consolidation (EWC) mitigated forgetting, achieving 88.3% with retained per-patient performance of 85.6%. These results are from a controlled simulation and do not constitute a prospective patient study; they are reported to motivate and inform future work on few-shot patient adaptation (5–10 clips per patient) using meta-learning strategies.

4.15.5. Benchmark Performance vs. Clinical Deployment Readiness

Benchmarks vs. clinical reality: (1) Dataset domain shift—benchmarks contain healthy subjects in controlled environments; rehabilitation patients exhibit pathological movements (spasticity, tremor, and compensatory strategies); (2) metric misalignment—Top-1 accuracy measures discrete categories, where clinical utility requires movement quality assessment (range of motion, asymmetry, and kinematic deviations); (3) deployment constraints—edge latency achieved (67.3 ms Jetson), but patient privacy, EHR integration, and clinician interpretability were not evaluated; (4) regulatory requirements—FDA clearance requires prospective trials demonstrating safety/efficacy (not conducted). Interpretation: NeuroPrisma demonstrates *potential* for rehabilitation applications based on benchmark validation; clinical efficacy remains unvalidated. The framework is a research prototype, not a deployment-ready tool.

4.16. Edge Hardware Evaluation and Domain Robustness

To assess clinical deployment feasibility, NeuroPrisma was evaluated on NVIDIA Jetson AGX Orin (275 TOPS, 32 GB) with INT8/TensorRT quantization, achieving 67.3 ms latency per clip within the <100 ms interactive feedback threshold at 97.9% UCF101 Top-1 accuracy (0.8% FP32 degradation). UniFormerV2 requires 112.4 ms under identical conditions. Power consumption was measured using NVIDIA Jetson Power GUI under a continuous inference workload (batch size 1, sustained operation). NeuroPrisma achieves an average power draw of 18.7 W during active inference (TensorRT FP16 and GPU+DLCORES), compared to UniFormerV2’s 24.3 W, yielding 23% improved power efficiency. Peak power consumption reaches 22.1 W during initial model loading. For battery-powered rehabilitation devices, this translates to approximately 3.2 h of continuous operation on the Jetson AGX Orin’s standard 64 Wh battery capacity, sufficient for a typical clinical therapy session (45–90 min) with a substantial margin. Idle power (model loaded, no inference) is 6.4 W. These metrics position NeuroPrisma as viable for untethered edge deployment in real-world rehabilitation environments where AC power may be unavailable.

For domain robustness evaluation, three corruption protocols were applied to the SSv2 validation set (n = 24,777 clips) at test time only, without retraining: (1) Gaussian noise with σ = 0.1 was applied independently to each frame in the RGB space (mean = 0, clipped to [0, 1]); (2) brightness and contrast perturbation with multiplicative factors were sampled uniformly from [0.7, 1.3] for brightness and [0.7, 1.3] for contrast, applied consistently across all 16 frames per clip to simulate lighting changes within a single video; (3) rectangular occlusion covering 20% of the frame area (89 × 89 pixels at 224 × 224 resolution) was positioned randomly per clip but fixed across all frames within that clip, filled with mean pixel value [0.485, 0.456, 0.406]. Each corruption was evaluated independently across three random seeds, with results averaged. NeuroPrisma retains 68.1 ± 0.3% (noise), 69.4 ± 0.2% (brightness/contrast), and 67.8 ± 0.4% (occlusion) Top-1 accuracy, versus UniFormerV2’s 64.2 ± 0.5%, 65.7 ± 0.4%, and 62.9 ± 0.6%, respectively. The +3.9% average robustness advantage confirms that causal deconfounding improves generalization under distribution shift, a critical requirement for cross-session rehabilitation monitoring where environmental conditions vary.

4.17. Preliminary Rehabilitation Dataset Evaluation

RehabAction-847 Dataset (self-collected, YouTube source, January–March 2025):

847 clips, 12 categories: Arm Raise (Flexion/Abduction), Elbow Flexion, Wrist Rotation, Sit-to-Stand, Gait (Forward/Lateral), Balance (Single-Leg/Tandem), Reaching, Leg Lift, Ankle Dorsiflexion
Subjects: ages 28–76, 58% male, pathologies: stroke (34%), Parkinson’s (18%), CP (12%), and general (36%)
Train/Val/Test: 600/100/147 clips
Annotation: two physical therapists, κ = 0.81. Results (Test n = 147) Table 14 shows the Comparison of zero-shot, fine-tuned performance, and CCS across TimeSformer, UniFormerV2, and NeuroPrisma on the test set (n = 147).

Finding: Severe domain shift (34.2% vs. 98.7% UCF101). Fine-tuning recovers up to 68.4% (2.8% below SSv2), confirming rehabilitation-specific motion variability.

Limitations: Small scale (847 clips), no severity scores, and uncontrolled quality. Future: In total, 5000+ clips, multi-view, skeleton, with clinical correlation (Fugl–Meyer and UPDRS).

5. Conclusions

NeuroPrisma is a research prototype demonstrating potential for causality-aware action recognition in neural rehabilitation systems, grounded in frequency-domain spectral decomposition (Prismatic Spectral Attention module) and Structural Causal Model-based intervention (Causal Intervention Layer). Extensive evaluation on four benchmark datasets demonstrated state-of-the-art accuracies: 98.7% on UCF101, 82.4% on HMDB51, 71.2% on Something-Something V2, and 91.5%/95.8% on NTU RGB+D (Cross-Subject/Cross-View), with a 41.5% Causal Confusion Score reduction relative to UniFormerV2. All improvements are statistically significant (p < 0.001, Bonferroni-corrected, Cohen’s d = 0.72–1.24). Real-time latency: 23.4 ms (NVIDIA A100) and 67.3 ms (Jetson AGX Orin). Causal metrics: ATE +0.142 (SSv2), and CBER 52.8%. Synergy analysis shows 26.6% of performance gains arise from PSA-CIL interaction.

Critical Limitations:

(1): Benchmark-only evaluation: All experiments were conducted on publicly available general-purpose datasets (UCF101, HMDB51, SSv2, and NTU RGB+D) under controlled conditions. Direct clinical validation with neurological patient populations has not been performed. Preliminary evaluation on self-collected rehabilitation dataset (RehabAction-847) achieved 68.4% versus 71.2% on SSv2, confirming substantial domain shift.
(2): Partial deconfounding: CIL relies on causal assumptions (Markov condition, causal sufficiency) that may be violated in real-world data. Sensitivity analysis showed 20–30% degradation when unmodeled confounders are introduced. Residual bias of 16.7% (CBER analysis) persists after intervention.
(3): Fixed architectural constraints: A 16-frame temporal window limits performance on longer exercises (5.5% accuracy drop on videos > 6 s). Fixed spectral bands may be suboptimal for pathological movements (e.g., Parkinsonian tremor at 4–6 Hz overlaps with intentional movements). RGB-only input lacks depth, skeletal, and physiological modalities used in clinical assessment.
(4): No clinical outcome correlation: No analysis was conducted between NeuroPrisma predictions and standardized clinical scores (Fugl–Meyer Assessment, UPDRS, and GMFCS). It remains unknown whether classifications correspond to clinically meaningful functional levels.
(5): Regulatory requirements unaddressed: Clinical deployment requires IRB-approved trials, regulatory clearance (FDA 510(k), CE marking), EHR integration, and privacy compliance (HIPAA and GDPR)—none addressed in this work.

Positioning: NeuroPrisma is a research prototype demonstrating potential for causality-aware action recognition in rehabilitation contexts based on benchmark performance and theoretical principles. Clinical deployment readiness requires completing the validation pathway outlined below. The framework’s contribution lies in establishing a methodologically principled approach to unifying spectral temporal analysis with formal causal intervention, providing a foundation for future research in computer vision and rehabilitation technology.

Future Work:

Immediate steps: (1) Clinical dataset collection—collaborate with rehabilitation hospitals to collect large-scale labeled dataset (target: 5000+ clips) with multi-view RGB-D video, skeleton annotations, and clinical severity scores (Fugl–Meyer, UPDRS, and GMFCS); (2) multi-modal extension—integrate depth, pose, as well as wearable IMU signals via mid-level fusion; (3) adaptive spectral partitioning—replace fixed frequency bands with learnable boundaries optimized per-dataset; and (4) patient-specific continual learning—develop few-shot adaptation for personalization from limited patient data (5–10 clips per patient).

Long-term vision: (5) Prospective clinical validation—conduct IRB-approved longitudinal studies with stroke, Parkinson’s, and cerebral palsy cohorts to establish correlation with clinical outcomes and inter-rater reliability; (6) hierarchical temporal modeling—extend architecture to process full-length rehabilitation sessions (minutes-long videos) via hierarchical aggregation; and (7) explainability enhancement—develop clinician-facing interfaces that visualize spectral attention and causal intervention decisions.

NeuroPrisma advances the algorithmic component of automated movement assessment systems. Clinical translation requires rigorous validation, regulatory compliance, and human-centered design to augment rather than replace clinician expertise.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2601).

Data Availability Statement

The datasets used in this study are publicly available from their official sources (accessed on 15 December 2025): UCF101: https://www.crcv.ucf.edu/data/UCF101.php; HMDB51: https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/; Something-Something V2: https://www.kaggle.com/datasets/ipythonx/ssv2test; and NTU RGB+D: https://rose1.ntu.edu.sg/dataset/actionRecognition/. The implementation code is publicly available at https://github.com/Hend-Alshaya/NeuroPrism.

Acknowledgments

The author would like to thank the Applied College, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia, for supporting this study.

Conflicts of Interest

The author declares no conflict of interest.

References

World Health Organization. “Rehabilitation”, WHO Fact Sheet, 2023. [Online]. Available online: https://www.who.int/news-room/fact-sheets/detail/rehabilitation (accessed on 11 November 2024).
Xu, M.; Xiong, Y.; Chen, H.; Li, X.; Xia, W.; Tu, Z.; Soatto, S. Long short-term transformer for online action detection. Adv. Neural Inf. Process. Syst. 2021, 34, 1086–1099. [Google Scholar]
Sun, W.; Ma, Y.; Wang, R. k-NN attention-based video vision transformer for action recognition. Neurocomputing 2024, 574, 127264. [Google Scholar] [CrossRef]
Chen, C.; Zhang, C.; Dong, X. A multi-scale video longformer network for action recognition. Appl. Sci. 2024, 14, 1061. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast networks for video recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2019; pp. 6202–6211. [Google Scholar] [CrossRef]
Zang, C.; Wang, H.; Pei, M.; Liang, W. Discovering the real association: Multimodal causal reasoning in video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 19027–19036. [Google Scholar] [CrossRef]
Cao, D.; Wang, Y.; Duan, J.; Zhang, C.; Zhu, X.; Huang, C.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; et al. Spectral temporal graph neural network for multivariate time-series forecasting. Adv. Neural Inf. Process. Syst. 2020, 33, 17766–17778. [Google Scholar]
Chen, D.; Wu, P.; Chen, M.; Wu, M.; Zhang, T.; Li, C. LS-VIT: Vision transformer for action recognition based on long and short-term temporal difference. Front. Neurorobot. 2024, 18, 1457843. [Google Scholar] [CrossRef]
Guo, D.; Li, K.; Hu, B.; Zhang, Y.; Wang, M. Benchmarking micro-action recognition: Dataset, methods, and applications. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5493–5507. [Google Scholar] [CrossRef]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; Volume 139, pp. 813–824. [Google Scholar]
Han, H.; Zeng, H.; Kuang, L.; Han, X.; Xue, H. A human activity recognition method based on vision transformer. Sci. Rep. 2024, 14, 15310. [Google Scholar] [CrossRef]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2011; pp. 2556–2563. [Google Scholar] [CrossRef]
Berrevoets, J.; Kacprzyk, K.; Qian, Z.; van der Schaar, M. Causal deep learning: Encouraging impact on real-world problems through causality. Found. Trends Signal Process. 2024, 18, 200–309. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 6299–6308. [Google Scholar] [CrossRef]
Pearl, J. Causality: Models, Reasoning, and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Huang, J.-H.; Yang, C.-H.H.; Chen, P.-Y.; Brown, A.; Worring, M. Causal video summarizer for video exploration. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME); IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Li, K.; Wang, Y.; He, Y.; Li, Y.; Wang, Y.; Wang, L.; Qiao, Y. UniFormerV2: Spatiotemporal learning by arming image ViTs with video UniFormer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12375–12391. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild; Tech. Rep. CRCV-TR-12-01. [Online]; The Center for Research in Computer Vision (CRCV) at the University of Central Florida: Orlando, FL, USA, 2012; Available online: https://www.crcv.ucf.edu/data/UCF101.php (accessed on 7 April 2026).
Jiao, L.; Wang, Y.; Liu, X.; Li, L.; Liu, F.; Ma, W.; Guo, Y.; Chen, P.; Yang, S.; Hou, B. Causal inference meets deep learning: A comprehensive survey. Research 2024, 7, 0467. [Google Scholar] [CrossRef]
Elnady, M.; Abdelmunim, H.E. A novel YOLO LSTM approach for enhanced human action recognition in video sequences. Sci. Rep. 2025, 15, 17036. [Google Scholar] [CrossRef]
Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 531–540. [Google Scholar]
Jin, M.; Zheng, Y.; Li, Y.-F.; Chen, S.; Yang, B.; Pan, S. Multivariate time series forecasting with dynamic graph neural ODEs. IEEE Trans. Knowl. Data Eng. 2023, 35, 9168–9180. [Google Scholar] [CrossRef]
Tancik, M.; Srinivasan, P.; Mildenhall, B.; Fridovich-Keil, S.; Raghavan, N.; Singhal, U.; Ramamoorthi, R.; Barron, J.; Ng, R. Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst. 2020, 33, 7537–7547. [Google Scholar]
Guan, Q.-L.; Zheng, Y.; Meng, L.; Dong, L.-Q.; Hao, Q. Improving the generalization of visual classification models across IoT cameras via cross-modal inference and fusion. IEEE Internet Things J. 2023, 10, 15835–15846. [Google Scholar] [CrossRef]
Cai, R.; Zhu, Y.; Qiao, J.; Liang, Z.; Liu, F.; Hao, Z. Where and how to attack? A causality-inspired recipe for generating counterfactual adversarial examples. Proc. AAAI Conf. Artif. Intell. 2024, 38, 11132–11140. [Google Scholar] [CrossRef]
Goyal, R.; Kahou, S.E.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The ‘Something’ video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2017; pp. 5842–5850. [Google Scholar] [CrossRef]
Alfasly, S.; Chui, C.K.; Jiang, Q.; Lu, J.; Xu, C. An effective video transformer with synchronized spatiotemporal and spatial self-attention for action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 2496–2509. [Google Scholar] [CrossRef]
Buch, S.; Eyzaguirre, C.; Gaidon, A.; Wu, J.; Fei-Fei, L.; Niebles, J.C. Revisiting the ‘video’ in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2022; pp. 2917–2927. [Google Scholar] [CrossRef]
Zheng, S.; Zhu, Z.; Liu, Z.; Li, Y.; Zhao, Y. Node-oriented spectral filtering for graph neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 388–402. [Google Scholar] [CrossRef]
Myung, W.; Yin, N.N.; Xu, C.; Kim, Y.M. DeGCN: Deformable graph convolutional networks for skeleton-based action recognition. IEEE Trans. Image Process. 2024, 33, 2477–2490. [Google Scholar] [CrossRef]
Sun, W.; Ma, Y.; Wang, R. Free-FreeSLT: A gloss-free parameter-free model for sign language translation. In Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops, Auckland, New Zealand, 3–6 December 2024. [Google Scholar]
Yang, X.; Zhang, H.; Cai, J. Deconfounded image captioning: A causal retrospect. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12996–13010. [Google Scholar] [CrossRef]
Wang, Y.; Li, X.; Liu, Y.; Cao, X.; Meng, X.; Meng, L. Causal inference for out-of-distribution recognition via sample balancing. CAAI Trans. Intell. Technol. 2024, 9, 1172–1184. [Google Scholar] [CrossRef]
Wang, Y.; Xu, C.; Xu, C.; Tao, D. Packing convolutional neural networks in the frequency domain. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2495–2510. [Google Scholar] [CrossRef]
Zheng, Y.-D.; Liu, Z.; Lu, T.; Wang, L. Dynamic sampling networks for efficient action recognition in videos. IEEE Trans. Image Process. 2020, 29, 7970–7983. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2022; pp. 3202–3211. [Google Scholar] [CrossRef]
Tong, Z.; Song, Y.; Wang, J.; Wang, L. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lucic, M.; Schmid, C. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 6836–6846. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021; pp. 1–21. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 1010–1019. [Google Scholar] [CrossRef]
Veerbeek, J.M.; Van Wegen, E.; Van Peppen, R.; Van Der Wees, P.J.; Hendriks, E.; Rietberg, M.; Kwakkel, G. What is the evidence for physical therapy poststroke? A systematic review and meta-analysis. PLoS ONE 2014, 9, e87987. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
Schmitz-Hübsch, T.; Du Montcel, S.T.; Baliko, L.; Berciano, J.; Boesch, S.; Depondt, C.; Giunti, P.; Globas, C.; Infante, J.; Kang, J.-S.; et al. Scale for the assessment and rating of ataxia. Neurology 2006, 66, 1717–1720. [Google Scholar] [CrossRef]

Figure 1. The general architecture of the system of proposed causality-aware action recognition, NeuroPrisma. The model combines the Prismatic Spectral Attention with Causal Intervention Layers to obtain solid temporal insights without spurious decoding.

Figure 2. Comprehensive design of the suggested NeuroPrisma architecture with the Prismatic Spectral Attention (PSA) block of the frequency-domain spectral analysis, and the Causal Intervention Layer (CIL) block of the deconfounding. The framework processes video patches with spectral decomposition, causal interventions, and makes predictions of actions that are causal.

Figure 3. Causal directed acyclic graph (DAG) for NeuroPrisma action recognition. (a) Observational model showing backdoor paths C → X and C → Y through scene context confounder C, where P(Y|X) conflates causal and spurious associations. (b) Interventional model after CIL applies do-calculus: the backdoor path C → X is severed (✗), isolating the direct causal effect X → Y via the adjustment formula P(Y|do(X)) = ∑c P(Y|X, c)·P(c). Variables: X = Z^psa (spectral motion features), Y = action label, C = scene context (background, camera viewpoint, lighting, and co-occurring objects), and U = exogenous noise.

Figure 4. Top-1 accuracy with standard deviation error bars across benchmark datasets. NeuroPrisma (red) is consistently performing better than any of the baselines that have lower variance, with the largest margin at +2.9% on the temporally challenging SSv2 data.

Figure 5. Causal evaluation results. (a) CCS comparison of the 41.5% average decrease in causal confusion at NeuroPrisma. (b) Visualization of attention: the top baseline schemes consider confounding background areas, whereas the bottom scheme, NeuroPrisma, considers causally important motions in the hands and body of the actor.

Figure 6. Visualization of the ablation study results on the SSv2 dataset. (a) Progressive improvement in Top-1 accuracy with the sequential addition of components to the ViT-B baseline: Temporal Self-Attention (+3.6%), PSA Module (+2.6%), CIL Module (+2.4%), MI Regularization (+0.7%), and Enhanced Augmentation (+0.7%), culminating in a total gain of +10.0% for the full NeuroPrisma model. (b) Corresponding decrease in Causal Confusion Score (CCS), with the most substantial reductions occurring after the addition of the PSA and CIL modules, indicating their synergistic effect in mitigating reliance on spurious correlations.

Figure 7. Hyperparameter sensitivity analysis SSv2. (a) MI loss weight λMI. (b) Intervention strength β. (c) Temperature τ. Stars with shading depict optimum values of ±1 std. NeuroPrisma operates at reasonable parameter regimes (↓ lower is better).

Figure 8. Trade-off between accuracy and computational cost on UCF101. (a) FLOPs vs. Top-1 accuracy, (b) inference latency vs. Top-1 accuracy with the 30 FPS real-time threshold shown as a dashed green line, (c) throughput vs. Top-1 accuracy, and (d) normalized multi-metric comparison across five efficiency dimensions. In subplots (a–c), bars represent the computational metric (left axis) and the red line with circular markers indicates Top-1 accuracy (right axis). In subplot (d), ↑ denotes metrics where higher values are better (throughput, accuracy), and ↓ denotes metrics where lower values are better (FLOPs, latency, memory). NeuroPrisma (red diamond in (d); red filled bar and circular marker in (a–c)) represents a Pareto-optimal solution, achieving the highest accuracy while remaining competitive across all efficiency metrics.

Figure 9. Per-category analysis on SSv2. Baseline.

Table 1. Comparison of related work on video action recognition.

Method	Transformer	Spectral	Causal	Real-Time	Multi-Scale	Rehab.
ViViT [41]	✓	×	×	×	×	×
TimeSformer [10]	✓	×	×	✓	×	×
VideoSwin [39]	✓	×	×	✓	✓	×
UniFormerV2 [21]	✓	×	×	✓	✓	×
SlowFast [5]	×	×	×	✓	✓	×
CausalVQA [28]	✓	×	✓	×	×	×
NeuroPrisma	✓	✓	✓	✓	✓	✓

Table 2. Summary of benchmark datasets and evaluation protocols.

Dataset	Classes	Train	Test	Avg. Dur.	Splits	Protocol
UCF101 [20]	101	9537	3783	7.2 s	3	Avg. over splits
HMDB51 [12]	51	3570	1530	3.1 s	3	Avg. over splits
SSv2 [28]	174	168,913	24,777	4.0 s	1	Val set
NTU RGB+D [43]	60	40,320	16,560	5.8 s	2	X-Sub & X-View

Table 3. Classification accuracy comparison across benchmark datasets (Mean ± STD over 3 random seeds).

Method	UCF101		HMDB51		SSv2		NTU X-Sub		NTU X-View
	Top-1	Top-5	Top-1	Top-5	Top-1	Top-5	Top-1	Top-5	Top-1	Top-5
I3D [14]	95.1 ± 0.31	99.2 ± 0.12	74.3 ± 0.48	92.1 ± 0.29	50.0 ± 0.42	78.6 ± 0.35	82.1 ± 0.38	95.8 ± 0.19	88.7 ± 0.27	97.9 ± 0.11
SlowFast [5]	96.8 ± 0.24	99.5 ± 0.08	78.1 ± 0.41	94.2 ± 0.23	63.1 ± 0.35	87.4 ± 0.28	85.3 ± 0.32	96.7 ± 0.16	91.2 ± 0.22	98.4 ± 0.09
TimeSformer [10]	97.1 ± 0.22	99.6 ± 0.07	79.7 ± 0.39	94.8 ± 0.21	62.5 ± 0.44	86.9 ± 0.31	86.4 ± 0.33	97.1 ± 0.15	92.1 ± 0.24	98.6 ± 0.08
ViViT [41]	97.0 ± 0.28	99.5 ± 0.09	78.9 ± 0.46	94.5 ± 0.25	61.8 ± 0.51	86.3 ± 0.34	85.8 ± 0.41	96.9 ± 0.18	91.5 ± 0.29	98.3 ± 0.10
VideoSwin [39]	97.6 ± 0.19	99.7 ± 0.06	80.2 ± 0.33	95.1 ± 0.19	66.4 ± 0.32	89.2 ± 0.24	88.1 ± 0.28	97.5 ± 0.13	93.4 ± 0.19	98.8 ± 0.07
UniFormerV2 [18]	98.0 ± 0.17	99.7 ± 0.05	80.9 ± 0.31	95.4 ± 0.18	68.3 ± 0.29	90.1 ± 0.21	89.2 ± 0.24	97.8 ± 0.11	94.3 ± 0.17	99.0 ± 0.06
VideoMAE [40]	97.8 ± 0.21	99.6 ± 0.07	80.5 ± 0.37	95.2 ± 0.20	67.9 ± 0.38	89.8 ± 0.25	88.7 ± 0.30	97.6 ± 0.14	93.9 ± 0.21	98.9 ± 0.07
NeuroPrisma	98.7 ± 0.12	99.8 ± 0.04	82.4 ± 0.26	96.1 ± 0.15	71.2 ± 0.23	91.8 ± 0.17	91.5 ± 0.19	98.4 ± 0.09	95.8 ± 0.14	99.3 ± 0.05
Δ vs. Best	+0.7	+0.1	+1.5	+0.7	+2.9	+1.7	+2.3	+0.6	+1.5	+0.3

Table 4. Extended classification metrics on UCF101 and SSv2.

Method	UCF101				SSv2
	Top-1	MCA	F1	AUC	Top-1	MCA	F1	AUC
TimeSformer	97.1	96.8	0.967	0.998	62.5	61.9	0.614	0.943
VideoSwin	97.6	97.4	0.973	0.999	66.4	65.8	0.652	0.956
UniFormerV2	98.0	97.8	0.977	0.999	68.3	67.6	0.671	0.962
VideoMAE	97.8	97.6	0.975	0.999	67.9	67.2	0.667	0.960
NeuroPrisma	98.7	98.5	0.984	0.999	71.2	70.5	0.698	0.971
Δ vs. Best	+0.7	+0.7	+0.007	–	+2.9	+2.9	+0.027	+0.009

Table 5. Causal confusion score comparison (lower is better).

Method	UCF101	HMDB51	SSv2	NTU	Avg.
TimeSformer	0.342	0.418	0.523	0.387	0.418
ViViT	0.358	0.431	0.547	0.402	0.435
VideoSwin	0.321	0.392	0.489	0.356	0.390
UniFormerV2	0.298	0.367	0.461	0.328	0.364
VideoMAE	0.312	0.384	0.478	0.341	0.379
NeuroPrisma	0.178	0.213	0.267	0.194	0.213
Reduction	40.3%	41.9%	42.1%	40.9%	41.5%

Table 6. Component-wise ablation study on SSv2 dataset.

Configuration	Top-1	Top-5	MCA	CCS	ΔTop-1
(A) ViT-B Baseline	61.2	85.4	60.6	0.534	–
(B) + Temporal Self-Attention	64.8	87.9	64.1	0.498	+3.6
(C) + PSA Module	67.4	89.6	66.8	0.412	+6.2
(D) + CIL Module	69.8	91.0	69.2	0.289	+8.6
(E) + MI Regularization	70.5	91.4	69.9	0.271	+9.3
(F) + Enhanced Augmentation	71.2	91.8	70.5	0.267	+10.0

Table 7. Impact of spectral bands (

B

) in PSA module on SSv2.

Table 7. Impact of spectral bands (

B

) in PSA module on SSv2.

$B$	Top-1	CCS	Params	FLOPs	Latency
1	68.4	0.312	94.2 M	134.5 G	21.8 ms
2	69.7	0.289	95.8 M	137.2 G	22.4 ms
4	71.2	0.267	98.6 M	142.7 G	23.4 ms
8	71.0	0.271	104.1 M	153.6 G	25.7 ms
16	70.6	0.278	115.2 M	175.4 G	29.3 ms

Table 8. Impact of confounder prototypes (

K

) in CIL module on SSv2.

Table 8. Impact of confounder prototypes (

K

) in CIL module on SSv2.

$K$	Top-1	CCS	Memory	Train Time	Convergence
32	69.8	0.298	+12 MB	+3.2%	45 epochs
64	70.5	0.281	+24 MB	+5.1%	42 epochs
128	71.2	0.267	+48 MB	+8.4%	38 epochs
256	71.1	0.269	+96 MB	+14.7%	35 epochs
512	70.8	0.273	+192 MB	+26.3%	33 epochs

Table 9. Hyperparameter sensitivity on SSv2 (Top-1 accuracy).

$λ_{M I}$	Top-1	CCS	$β$	Top-1	CCS
0.01	70.1	0.294	0.1	69.4	0.312
0.05	70.7	0.278	0.3	70.6	0.281
0.10	71.2	0.267	0.5	71.2	0.267
0.20	70.8	0.271	0.7	70.9	0.269
0.50	69.6	0.283	1.0	70.2	0.278

Table 10. Computational efficiency comparison.

Method	Params (M)	FLOPs (G)	Latency (ms)	Throughput (clips/s)	Memory (GB)	UCF101 Top-1
I3D	28.0	108.0	31.2	32.1	4.2	95.1
SlowFast	34.4	65.7	18.4	54.3	3.8	96.8
TimeSformer	121.4	196.3	42.8	23.4	8.7	97.1
ViViT	114.2	167.5	38.6	25.9	7.9	97.0
VideoSwin	88.1	87.8	24.7	40.5	5.6	97.6
UniFormerV2	115.3	161.2	28.3	35.3	7.4	98.0
VideoMAE	86.9	180.6	35.1	28.5	6.8	97.8
NeuroPrisma	98.6	142.7	23.4	42.7	6.2	98.7

Table 11. Cumulative latency optimization breakdown on NVIDIA A100 GPU.

Optimization	Latency (ms)	Vs. Baseline	Technique
Baseline (FP32 eager)	68.7	—	PyTorch default
+ FP16 AMP	45.9	−33%	TensorCore GEMM, auto mixed precision
+ Fused PSA kernel	37.4	−45%	Single CUDA kernel (DFT + attention + IDFT), shared memory
+ NWGM log-space	35.0	−49%	log(NWGM) = Σwₖ·log(Zₖ), vectorized einsum
+ Async transfer	30.1	−56%	CUDA streams, overlap H2D with compute
+ Dynamic batching	23.4	−66%	Batch = 4 amortized (trade-off: +50 ms buffering)

Table 12. Statistical significance analysis vs. UniFormerV2.

Dataset	ΔAcc.	t-Statistic	p-Value	Cohen’s d	95% CI
UCF101	+0.7%	4.67	0.0008	0.72	[0.41, 0.99]
HMDB51	+1.5%	6.23	<0.0001	0.89	[1.12, 1.88]
SSv2	+2.9%	9.41	<0.0001	1.24	[2.51, 3.29]
NTU (X-Sub)	+2.3%	7.85	<0.0001	1.08	[1.89, 2.71]
NTU (X-View)	+1.5%	5.42	0.0003	0.81	[1.07, 1.93]
Average	+1.78%	–	–	0.95	–

Table 13. Comprehensive performance comparison across all evaluation dimensions (↓ lower is better).

Method	Top-1 Accuracy (%)					Causal	Efficiency Metrics					Overall
	UCF101	HMDB51	SSv2	NTU-XS	NTU-XV	CCS ↓	Params	FLOPs	Latency	Memory	Throughput	Rank
I3D [14]	95.1 ± 0.31	74.3 ± 0.48	50.0 ± 0.42	82.1 ± 0.38	88.7 ± 0.27	0.567	28.0 M	108 G	31.2 ms	4.2 GB	32.1	7
SlowFast [5]	96.8 ± 0.24	78.1 ± 0.41	63.1 ± 0.35	85.3 ± 0.32	91.2 ± 0.22	0.489	34.4 M	65.7 G	18.4 ms	3.8 GB	54.3	6
TimeSformer [10]	97.1 ± 0.22	79.7 ± 0.39	62.5 ± 0.44	86.4 ± 0.33	92.1 ± 0.24	0.428	121.4 M	196 G	42.8 ms	8.7 GB	23.4	5
ViViT [41]	97.0 ± 0.28	78.9 ± 0.46	61.8 ± 0.51	85.8 ± 0.41	91.5 ± 0.29	0.445	114.2 M	168 G	38.6 ms	7.9 GB	25.9	5
VideoSwin [39]	97.6 ± 0.19	80.2 ± 0.33	66.4 ± 0.32	88.1 ± 0.28	93.4 ± 0.19	0.401	88.1 M	87.8 G	24.7 ms	5.6 GB	40.5	3
UniFormerV2 [18]	98.0 ± 0.17	80.9 ± 0.31	68.3 ± 0.29	89.2 ± 0.24	94.3 ± 0.17	0.375	115.3 M	161 G	28.3 ms	7.4 GB	35.3	2
VideoMAE [40]	97.8 ± 0.21	80.5 ± 0.37	67.9 ± 0.38	88.7 ± 0.30	93.9 ± 0.21	0.392	86.9 M	181 G	35.1 ms	6.8 GB	28.5	4
NeuroPrisma	98.7 ± 0.12	82.4 ± 0.26	71.2 ± 0.23	91.5 ± 0.19	95.8 ± 0.14	0.213	98.6 M	143 G	23.4 ms	6.2 GB	42.7	1
vs. Best	+0.7	+1.5	+2.9	+2.3	+1.5	−43.2%	−14.5%	−11.2%	−17.3%	−16.2%	+5.4%	–

Table 14. Comparison of zero-shot, fine-tuned performance, and CCS across TimeSformer, UniFormerV2, and NeuroPrisma on the test set (n = 147).

Method	Zero-Shot	Fine-Tuned	CCS
TimeSformer	28.6%	62.1%	0.421
UniFormerV2	31.2%	65.3%	0.398
NeuroPrisma	34.2%	68.4%	0.338

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshaya, H. Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems. Mathematics 2026, 14, 1341. https://doi.org/10.3390/math14081341

AMA Style

Alshaya H. Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems. Mathematics. 2026; 14(8):1341. https://doi.org/10.3390/math14081341

Chicago/Turabian Style

Alshaya, Hend. 2026. "Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems" Mathematics 14, no. 8: 1341. https://doi.org/10.3390/math14081341

APA Style

Alshaya, H. (2026). Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems. Mathematics, 14(8), 1341. https://doi.org/10.3390/math14081341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Neuro-Prismatic Video Models for Causality-Aware Action Recognition in Neural Rehabilitation Systems

Abstract

1. Introduction

2. Related Work

2.1. Video Transformers for Action Recognition

2.2. Causal Inference in Deep Learning

2.3. Research Gaps

3. Methodology

3.1. Problem Formulation

3.2. System Architecture

Backbone Network Specification

3.3. Video Patch Embedding

3.4. Prismatic Spectral Attention (PSA)

3.5. Causal Intervention Layer (CIL)

3.6. Transformer Encoder and Classification

3.7. Ethical Considerations

4. Experimental Results

4.1. Datasets and Experimental Protocol

4.2. Implementation Details

4.3. Baseline Methods

4.4. Main Results

4.5. Evaluation Metrics

4.6. Extended Classification Metrics

4.7. Causal Evaluation

4.7.1. Quantitative Causal Metrics

4.7.2. Confounding Bias Elimination Rate (CBER)

4.7.3. Sensitivity Analysis Under Unmodeled Confounders

4.8. Ablation Studies

4.8.1. Component-Wise Contribution

4.8.2. PSA-CIL Synergy Analysis

4.8.3. Number of Spectral Bands

4.8.4. Number of Confounder Prototypes

4.8.5. Cross-Dataset Spectral Analysis

4.9. Hyperparameter Sensitivity Analysis

4.10. Computational Efficiency

Latency Optimization Breakdown

4.11. Statistical Significance Analysis

4.12. Per-Category Performance Analysis

4.13. Comprehensive Performance Summary

4.14. Ethical Considerations and Dataset Bias

4.15. Discussion

4.15.1. Synergy of Spectral and Causal Components

4.15.2. Temporal Reasoning Capability

4.15.3. Clinical Applicability for Rehabilitation

4.15.4. Limitations and Future Directions

4.15.5. Benchmark Performance vs. Clinical Deployment Readiness

4.16. Edge Hardware Evaluation and Domain Robustness

4.17. Preliminary Rehabilitation Dataset Evaluation

5. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI