Previous Article in Journal
DEPART: Multi-Task Interpretable Depression and Parkinson’s Disease Detection from In-the-Wild Video Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dual-Stream Transformer with Kalman-Based Sensor Fusion for Wearable Fall Detection

by
Abheek Pradhan
,
Sana Alamgeer
,
Rakesh Suvvari
,
Syed Tousiful Haque
and
Anne H. H. Ngu
*
Department of Computer Science, Texas State University, San Marcos, TX 78666, USA
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(3), 90; https://doi.org/10.3390/bdcc10030090
Submission received: 3 February 2026 / Revised: 7 March 2026 / Accepted: 13 March 2026 / Published: 17 March 2026

Abstract

Wearable fall detection systems face a fundamental challenge: while gyroscope data provide valuable orientation cues, naively combining raw gyroscope and accelerometer signals can degrade performance due to noise contamination. To overcome this challenge, we present a dual-stream transformer architecture that incorporates (i) Kalman-based sensor fusion to convert noisy gyroscope angular velocities into stable orientation estimates (roll, pitch, yaw), maintaining an internal state of body pose, and (ii) processing accelerometer and orientation streams in separate encoder pathways before fusion to prevent cross-modal interference. Our architecture further integrates Squeeze-and-Excitation channel attention and Temporal Attention Pooling to focus on fall-critical temporal patterns. Evaluated on the SmartFallMM dataset using 21-fold leave-one-subject-out cross-validation, the dual-stream Kalman transformer achieves 91.10% F1, outperforming single-stream Kalman transformers (89.80% F1) by 1.30% and single-stream baseline transformers (88.96% F1) by 2.14%. We further evaluate the model in real time using a watch-based SmartFall App on five participants, maintaining an average F1 score of 83% and an accuracy of 90%. These results indicate robust performance in both offline and real-world deployment settings, establishing a new state-of-the-art for inertial-measurement-unit-based fall detection on commodity smartwatch devices.

1. Introduction

Falls represent a critical health concern for older adults, accounting for over 38 million medically treated injuries annually worldwide [1]. Wearable fall detection systems offer a promising solution for continuous, non-intrusive, anytime, anywhere monitoring using commodity devices such as smartwatches. However, developing accurate fall detection models that operate within the computational and power constraints of wearable devices remains challenging.
Modern smartwatches include inertial measurement units (IMUs) with triaxial accelerometers (to measure linear acceleration and capture impact events) and gyroscopes (to measure angular velocity and capture rotational motion that often precedes or accompanies falls). Intuitively, combining both modalities, accelerometer and gyroscope, should improve detection performance. However, preliminary analyses revealed an unexpected pattern: adding raw gyroscope data to accelerometer-only models degraded performance by 6.2% in F1 score [2]. This degradation occurs because angular velocity signals exhibit similar rotational patterns in both falls and certain daily activities, which confuses the model. A concrete example arises in the SmartFallMM dataset [3] during activities such as sitting down quickly compared with back falls, as illustrated in Figure 1. We observe that angular velocity patterns observed during these daily activities can resemble those during back falls despite fundamentally different physical outcomes.
The second challenge arises from the characteristics of consumer-grade IMU sensors themselves. In wearable devices, gyroscope measurements exhibit noise and bias drift and do not provide an absolute orientation reference [4]. As a result, angular velocity alone describes how fast the body rotates but does not directly indicate the body’s orientation in space, which is more closely related to fall events. Extracting orientation from angular velocity requires temporal integration with drift correction, a physics-based transformation that data-driven models struggle to learn reliably from the limited training data typically available in fall detection datasets.
The third challenge is related to the architecture. When accelerometer and gyroscope channels are concatenated and processed through shared early layers in a single-stream neural network, the model tends to overemphasize rotational similarities between daily activities and falls. This can increase false positives in activities involving strong trunk rotation, while noisy gyroscope signals may dilute impact-related acceleration features, increasing false negatives. These limitations suggest that naively fusing raw IMU signals may degrade rather than improve fall detection performance.
To overcome these challenges, we present a dual-stream neural network that is built on a transformer backbone and leverages both accelerometer and gyroscope data through two key innovations: (1) Kalman-based Sensor Fusion as Domain Knowledge Injection: Rather than expecting the neural network to learn IMU sensor physics from limited data, we apply a Linear Kalman Filter [5] to transform raw measurements into body orientation angles (roll, pitch, yaw). This injects domain knowledge about how accelerometer and gyroscope measurements relate to body pose, knowledge that, if learned only from data, would require far more training examples than are currently available in public fall detection datasets. In this sense, the Kalman layer plays a role by maintaining an internal state estimate of body orientation that guides subsequent learning. The resulting orientation representation captures what position the body is in rather than how fast it is rotating, aligning the input features with the physical phenomenon of interest.
(2) Dual-Stream Architecture for Modality Isolation: Accelerometer and orientation signals have fundamentally different characteristics: acceleration captures high-frequency impact dynamics, while orientation captures smooth body pose changes. We process these streams through separate encoder pathways with symmetric capacity allocation. This separation prevents cross-modal interference, where noisy or uninformative channels corrupt useful features when processed together. Critically, dual-stream processing only improves performance when combined with Kalman fusion; with raw gyroscope inputs, the dedicated orientation pathway learns noise patterns rather than useful features, actually degrading performance. Our architecture further incorporates Squeeze-and-Excitation (SE) channel attention [6] to weight the importance of fused feature channels, and Temporal Attention Pooling (TAP) to focus classification on the most discriminative time steps within each window, typically the impact phase of falls. Together, these attention modules implement a selective focus over channels and time, enabling the model to emphasize salient motion cues while down-weighting background activity. We evaluate our approach on three comprehensive fall detection datasets using Leave-One-Subject-Out cross-validation, achieving state-of-the-art performance for wearable IMU-based systems. The main contributions of this paper are:
1.
We empirically show that using gyroscope data as they are with linear accelerometer data hurts fall detection performance because neural networks cannot learn the physics-based transformation from sensed angular velocity to orientation from limited training data. By applying Kalman filtering, we inject domain knowledge that converts gyroscope data from a potential source of noise into a complementary modality.
2.
We show that dual-stream architecture and Kalman fusion exhibit a synergistic interaction: dual-stream improves performance by +1.30% with Kalman inputs but degrades performance by −1.38% with raw inputs. This asymmetry suggests that modality isolation amplifies the effect of input quality.
3.
We provide a systematic component-wise analysis that isolates the effects of Kalman fusion, architectural decoupling, and attention mechanisms. Kalman fusion enables a +3.52% F1 score improvement in the dual-stream setting (87.58% → 91.10%), dual-stream processing contributes an additional +1.30% F1 score over the best single-stream Kalman baseline, and SE+TAP attention mechanisms yield a further +1.28% F1 score improvement in single-stream transformers.
4.
To analyze robustness and architectural sensitivity of the proposed method, we perform ablation studies, which include: (a) evaluation on three datasets, demonstrating consistent performance gains; (b) evaluation of allocating different embeddings, revealing that a balanced and moderate dimensional split between acceleration and orientation streams yields superior performance.
5.
Finally, we validate the proposed method in a real-world smartwatch-based fall detection app demonstrating that our method maintains strong detection performance (83% F1 score and 90% accuracy) and practical viability under live operating conditions.
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the proposed methodology, including a preprocessing pipeline and a dual-stream network architecture. Section 4 describes implementation details, and Section 5 presents experimental results including ablation studies. Finally, Section 6 concludes with future directions.

2. Related Work

Fall detection using wearable IMUs has evolved through several methodological paradigms, each addressing specific limitations while introducing new challenges that motivate our proposed approach.
  • Deep Learning Approaches for IMU-Based Fall Detection: Early deep learning approaches for fall detection, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), focused on single-stream architectures processing raw accelerometer data. For example, our initial work [7] evaluated a Gated Recurrent Unit (GRU) model using wrist-worn accelerometer data from the SmartFall2018 dataset [8], achieving 87% F1 score offline, but only 73% in real-world deployment, which revealed a critical 14% performance gap. To address this degradation, subsequent work [9] explored ensemble learning with user feedback, but the computational overhead made it unsuitable for resource-constrained smartwatches.
To rigorously evaluate architectural alternatives for single wrist-worn sensing, we conducted a comparative study [10] using single and hybrid deep learning models (LSTM, LSTM-CNN, LSTM with self-attention, transformer [11], convolutional transformer). We observed that while LSTM-CNN achieved the best offline F1 score (87.6%), it suffered severe real-world degradation (70.2%). Interestingly, the basic transformer model showed better generalization: despite a lower offline F1 score (82.6%), it maintained higher real-world performance (77.1%) in comparison with the LSTM-CNN model. This indicates that attention mechanisms may capture more robust temporal patterns than recurrent architectures, yet substantial performance gaps persist between offline and real-world deployment.
Other transformer-based approaches have similarly explored self-attention mechanisms for fall detection, leveraging their ability to capture both local and global temporal dependencies in parallel. For example, Zafar et al. [12] achieved over 98% accuracy using sliding window segmentation with majority voting and predictive smoothing on the MobiAct dataset [13]. Similarly, Yhdego et al. [14] proposed a transformer attention network with Time2Vec positional encoding, achieving 98% specificity and 97% sensitivity for gait analysis-based fall detection. However, these studies share critical limitations: (1) They used torso-mounted sensors (waist, chest, shank) rather than wrist-worn devices, which exhibit fundamentally different motion patterns than wrist movements. (2) Both acknowledged difficulties distinguishing ADLs with similar vertical transitions (sitting, standing). (3) The models were evaluated exclusively offline using simulated falls without real-world deployment validation.
Recognizing that single-location sensing provides insufficient context to distinguish falls from ADLs, we systematically explored multi-location sensing [2] using a single-stream transformer-based architecture. We demonstrated that combining wrist and hip accelerometers improved offline F1 score to 88% and real-world F1 score to 80% on the SmartFallMM dataset [3]. However, this approach faced critical usability limitations: requiring users to carry two devices proved impractical for long-term adoption. Moreover, simply concatenating gyroscope data with accelerometer signals failed to improve performance and sometimes degraded results, suggesting that naive multimodal fusion does not address the fundamental challenge of extracting discriminative features from heterogeneous sensor modalities.
  • Dual-Stream Architectures and Multimodal Fusion: The recognition that different sensor modalities require separate processing pathways has motivated the development of dual-stream architectures. For example, a three-stream spatio-temporal graph convolutional network (GCN) for fall recognition [15] demonstrated the benefits of processing multiple feature representations through separate pathways. In the scenarios of multimodal sensing, decision-level fusion architectures consisting of distinct processing streams, such as vision-based streams exploiting skeletal landmarks and inertial-based streams using LSTM autoencoders, have shown that late fusion is more robust than early fusion by enabling better calibration, fault isolation, and resilience against modality-specific failures [16]. However, these approaches typically combine fundamentally different sensing modalities (vision and inertial) rather than addressing the specific challenge of fusing complementary measurements from the same IMU device. Moreover, vision modality is not available from a wrist-worn watch. Additionally, these dual-stream approaches fail to address data quality at the signal level, i.e., they assume that architectural separation alone suffices to handle signal interference in multimodal sensing.
  • Sensor Fusion and Kalman Filtering for Orientation Estimation: Kalman filtering [5] has been widely adopted in fall detection systems primarily for noise suppression and signal smoothing of accelerometer data. Liu and Lin [17] applied a first-order Kalman filter to extract slow-varying residual components from offline triaxial accelerometer signals, achieving 96.21% accuracy and 93.24% F1-score on using a support vector machine (SVM) with handcrafted features. The model accuracy was not tested in the real world. Similarly, in [18], complementary filtering techniques, such as the Madgwick algorithm, were employed to fuse accelerometer gravity references with gyroscope angular velocities for orientation estimation in motion-tracking applications. However, these approaches employed filtering solely for single-modality noise reduction or as preprocessing steps for threshold-based detection algorithms, with evaluations limited to offline settings. Critically, no prior work has applied Kalman filtering to noisy gyroscope data to obtain stable orientation angles and processed them through separate neural pathways to improve fall classification performance.
  • Attention Mechanisms for Time-Series Classification: Squeeze-and-Excitation (SE) networks [6,19] and temporal attention mechanisms [20] have demonstrated effectiveness in dynamically allocating weights to the features in human activity recognition (HAR). SE mechanisms perform channel-wise recalibration to amplify discriminative channels while suppressing less informative ones in the spatial/channel dimension. Addressing the temporal dimension, Wang et al. [21] showed that temporal attention could capture long-term dependencies without RNNs by integrating dilated CNNs with modified temporal attention mechanisms. These networks further demonstrated that jointly modeling spatial and temporal dependencies improved HAR performance. However, the application of combined SE and temporal attention mechanisms to IMU-based fall detection remains underexplored. Critically, no prior work has examined whether attention mechanisms designed for single-stream architectures transfer effectively to dual-stream designs, where modality-specific fusion introduces different channel dynamics, particularly in real-world deployment scenarios.

3. Methodology

Figure 2 illustrates the complete pipeline of the proposed method. First, raw accelerometer ( a k ) and gyroscope ( ω k ) measurements from a smartwatch undergo modality-specific preprocessing, where acceleration features are augmented with signal magnitude and then standardized using z-score normalization. In contrast, Kalman fusion transforms angular velocities into orientation angles, which are kept in their original scale. The processed streams are then segmented into overlapping windows and fed to the dual-stream network.

3.1. Kalman Fusion for Orientation Features

Our Kalman fusion technique introduces two key modifications to vanilla approaches. First, we adaptively reduce the influence of accelerometer measurements during high-acceleration events, so that when external forces distort the accelerometer’s gravity-based tilt estimates, the filter relies more on the gyroscope-driven prediction rather than noisy accelerometer updates. Second, we use Kalman filtering to transform raw gyroscope angular velocities into stable orientation angles as semantically meaningful input features for deep learning. The filter operates sequentially over the incoming IMU sensor stream, maintaining a continuous state estimate of orientation across time. Sliding windows used for model training and real-time inference are extracted from this filtered signal rather than performing independent filtering within each window. In the real-time smartwatch deployment, the filter state is initialized at the start of each sensor recording and is then updated continuously as new IMU samples arrive. The following steps detail our implementation.
  • Step 1: Sensor Acquisition: At discrete time step k, the inertial measurement unit (IMU) provides an accelerometer sample a k = [ a x , a y , a z ] R 3 , representing linear accelerations along the sensor axes, and a gyroscope sample ω k = [ ω x , ω y , ω z ] R 3 , representing angular velocities in rad/s.
  • Step 2: State Definition: We apply Kalman fusion exclusively to orientation estimation, with the state vector defined as x k = [ ϕ k , θ k , ψ k , ϕ ˙ k , θ ˙ k , ψ ˙ k ] , where ϕ k , θ k , and ψ k denote roll, pitch, and yaw angles, respectively, and ϕ ˙ k , θ ˙ k , and ψ ˙ k denote the corresponding angular rates. This state encapsulates the filter’s internal belief about body orientation and angular motion at time k.
  • Step 3: State Prediction: The prediction step uses gyroscope readings to estimate the system state at the next time step. Starting from the previous state x k 1 , the Kalman filter projects the orientation and angular motion forward in time according to x k = F x k 1 + w k , where w k N ( 0 , Q ) represents zero-mean Gaussian noise that accounts for modeling uncertainty and sensor imperfections. The state transition matrix F R 6 × 6 governs this temporal propagation and is defined as
    F = 1 0 0 Δ t 0 0 0 1 0 0 Δ t 0 0 0 1 0 0 Δ t 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 ,
    where Δ t denotes the sampling interval.
  • Step 4: Accelerometer-based Orientation Observation: Roll and pitch observations are obtained from the accelerometer via a gravity-based mapping ( ϕ k acc , θ k acc ) = f ( a k ) using the four-quadrant inverse tangent:
    ϕ k acc = atan2 ( a y , a z ) ,
    θ k acc = atan2 ( a x , a y 2 + a z 2 ) ,
Yaw remains unobservable from accelerometer measurements because gravity is invariant to rotations about the vertical axis. Therefore, in the implementation, yaw is computed solely by integrating angular velocity about the vertical axis of the gyroscope during the prediction step, such that:
ψ k = ψ k 1 + ψ ˙ k 1 Δ t ,
where ψ k denotes the yaw angle at time step k, ψ ˙ k 1 is the gyroscope-measured angular rate about the vertical axis, and Δ t is the sampling interval.
  • Step 5: Measurement Model Correction and Adaptive Noise Scaling: The measurement vector is defined as z k = [ ϕ k acc , θ k acc , ω x , ω y , ω z ] and is modeled as a noisy linear observation of the state through:
    z k = H x k + v k ,
    where v k N ( 0 , R ) is measurement noise and the observation matrix H R 5 × 6 is defined as:
    H = 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 .
    The state estimate is updated using the standard Kalman correction:
    x k = x k + K k ( z k H x k ) ,
    where x k denotes the updated state estimate at time step k, x k is the predicted (prior) state estimate, and the matrix K k is the Kalman gain, which weights the influence of the innovation (difference between observed and predicted measurements) on the state update.
    K k = P k H ( H P k H + R ) 1 ,
    where P k is the predicted state covariance matrix, and H denotes the transpose of the observation matrix. R is the measurement noise covariance matrix, which encodes the uncertainty associated with sensor-derived measurements. In this work, the accelerometer-related noise component R acc is adaptively scaled based on the magnitude of the measured acceleration to reduce the influence of unreliable gravity estimates during high-dynamic events. Specifically, at each time step, R acc is adjusted as
    R acc = R acc · min a g 2 , R max , a > τ g , R acc , otherwise ,
    where a denotes the acceleration magnitude, g is the gravitational acceleration, τ is an activation threshold, and R max limits the maximum scaling factor. The modified noise term R acc is incorporated into R before computing the Kalman gain.
  • Step 6. Final Output: The final output of the Kalman fusion is the orientation vector o k = [ ϕ k , θ k , ψ k ] = x k [ 0 : 3 ] , which is subsequently used as the orientation input to the learning model.

3.2. Acceleration Features

From the triaxial accelerometer measurement a k = [ a x , a y , a z ] R 1 × 3 , we compute the signal magnitude (SM) to capture overall acceleration intensity independent of sensor orientation. The signal magnitude is defined as SM k = a x 2 + a y 2 + a z 2 , which is appended to the original a k components to form the signal magnitude vector (SMV)
a k SMV = [ SM k , a x , a y , a z ] R 1 × 4 .
This a k SMV vector representation combines directional acceleration with overall impact intensity and is used as the input to the second stream of the neural network.

3.3. Feature-Specific Normalization

We apply different normalization strategies to acceleration and orientation features to preserve their physical interpretation. Specifically, we standardize a k SMV using z-score normalization computed from the training set. We keep orientation features ( ϕ , θ , ψ ) in raw radians because these angles carry inherent physical interpretation that normalization would distort. By preserving the original angular scale, the network learns meaningful pose-related patterns, such as large pitch deviations during forward falls.

3.4. Dual-Stream Network Architecture

  • Step 1: Window-Level Input Definition: The proposed model performs window-by-window fall detection, where T denotes the window length in samples, and k { 1 , , T } represents individual time steps within a window (details about window length and overlap are provided in Section 4.2). Stacking samples from a k SMV and Kalman-fused o k over the window yields two sequences:
    O = [ o 1 , , o T ] R T × 3 ,
    A = [ a 1 SMV , , a T SMV ] R T × 4 .
  • Step 2: Dual-Stream Temporal Projections: The O and A sequences are processed by two parallel temporal projection streams with identical structure but independent parameters. Each stream applies a one-dimensional temporal convolution (Conv1D) with kernel size 8 and same padding to preserve temporal resolution, followed by batch normalization (BN), a Sigmoid Linear Unit (SiLU) activation function ( φ ), and dropout (Drop).
For the acceleration stream, the projection is given by:
U acc = Drop acc ( φ ( BN ( Conv 1 D acc ( A ) ) ) ) R T × 32 ,
where Conv 1 D acc : R T × 4 R T × 32 projects the 4-channel acceleration input, and dropout is applied with rate 0.1 (i.e., 0.2 p where p = 0.5 is the global dropout probability).
Similarly, the orientation stream is projected as:
U ori = Drop ori φ BN Conv 1 D ori ( O ) R T × 32 ,
where Conv 1 D ori : R T × 3 R T × 32 projects the 3-channel orientation input, and dropout is applied with rate 0.15 (i.e., 0.3 p ).
  • Step 3: Feature Fusion and Normalization: The projected features U acc and U ori are fused by concatenation, followed by layer normalization (LN) to stabilize the combined representation across modalities. The fusion operation is defined as:
    U 0 = LN [ U acc U ori ] R T × 64 ,
    where [ · · ] denotes concatenation along the feature dimension. At this point, the two streams are fully merged into a shared temporal representation.
  • Step 4: Transformer Encoder for Temporal Modeling: The fused representation U 0 is further processed by a stack of transformer encoder layers to model long-range temporal dependencies across the window. The encoder applies pre-normalized multi-head self-attention (MSA) and a position-wise feed-forward network (FFN) at each layer. The encoder output is computed by:
    U = E ( U 0 ) ,
    where E ( · ) denotes a stack of L transformer encoder layers and U = [ U 1 , , U T ] is the resulting sequence of contextualized representations.
Each encoder layer l { 1 , , L } performs the following operations:
U ˜ ( l ) = U ( l ) + MSA LN ( U ( l ) ) ,
U ( l + 1 ) = U ˜ ( l ) + FFN LN ( U ˜ ( l ) ) ,
where E ( · ) denotes a stack of L = 2 transformer encoder layers with 4 attention heads and feed-forward dimension of 128.
  • Step 5: Channel Attention via Squeeze–Excitation: The contextualized representation U from the transformer encoder is recalibrated using a squeeze–excitation (SE) mechanism to emphasize informative feature channels while suppressing less relevant ones. The SE module first aggregates temporal information via global average pooling:
    c = 1 T k = 1 T U k R 64 ,
    yielding a global channel descriptor. Channel importance weights are then computed through a two-layer bottleneck with reduction ratio r = 4 , yielding a bottleneck dimension of 64 / r = 16 :
    s = σ W 2 δ ( W 1 c ) R 64 ,
    where W 1 R 16 × 64 reduces dimensionality, W 2 R 64 × 16 restores it, δ ( · ) denotes ReLU activation, and σ ( · ) denotes the sigmoid function. The recalibrated features are obtained by channel-wise scaling:
    U ˜ k = U k s , k = 1 , , T ,
    where ⊙ denotes element-wise multiplication, producing U ˜ = [ U ˜ 1 , , U ˜ T ] R T × 64 sequence.
  • Step 6: Temporal Attention Pooling: To aggregate the recalibrated sequence U ˜ into a fixed-length window-level representation, Temporal Attention Pooling (TAP) is applied. Unlike global average pooling, TAP learns to selectively weight time steps based on their relevance to fall detection, focusing on transient impact events while down-weighting the surrounding background motion.
The attention mechanism computes normalized importance scores via a two-layer network:
α k = exp v tanh ( W a U ˜ k ) t = 1 T exp v tanh ( W a U ˜ t ) , k = 1 , , T ,
where W a R 32 × 64 projects features to an intermediate attention space, tanh ( · ) applies nonlinearity, and v R 32 maps to scalar attention scores that are normalized via softmax. The window-level representation is obtained by weighted aggregation:
z = k = 1 T α k U ˜ k R 64 ,
yielding a fixed-dimensional descriptor that encodes both temporal dynamics and spatial patterns discriminative for fall detection.
  • Step 7: Window-Level Classification: The pooled representation z is passed through a dropout layer with a rate p = 0.5 for regularization, then mapped to a scalar logit via a fully connected layer (FC):
    o = w o Drop ( z ) + b o R ,
    where w o R 64 and b o R are learnable parameters of the FC layer. A sigmoid activation function produces the final window-level fall probability:
    y ^ = σ ( o ) ( 0 , 1 ) ,
    where y ^ 0.5 indicates a predicted fall, and y ^ < 0.5 indicates a predicted ADL.

4. Implementation Details

We implemented all models using the PyTorch 2.4 framework and trained them offline on a multi-GPU server with up to eight NVIDIA RTX A5000 GPUs (24 GB memory each). The rest of the implementation details are provided below.

4.1. Datasets

We evaluated the proposed method using SmartFallMM [3], which comprises 51 participants (30 young adults ages 18–35, 21 elderly adults ages 65+) performing 14 activities: nine ADLs (drinking water, picking up objects, putting on a jacket, sweeping, hand washing, waving, walking, sitting, standing) and five fall types (forward, backward, left, right, rotational). The SmartFallMM dataset was collected under IRB 7846 and 9461. Falls were performed only by the younger participants for safety reasons, while all participants performed the ADLs. We used triaxial accelerometer and gyroscope data captured from a commodity smartwatch (Google Pixel 3 Watch) at a sampling rate of 32 Hz.

4.2. Data Segmentation

We employed a Leave-One-Subject-Out Cross-Validation (LOSO-CV) protocol using data from the 30 young participants. In each fold, one young subject was designated as the test set. From the remaining 29 young subjects, we selected one subject as a validation set for hyperparameter tuning and early stopping, while the remaining 28 young subjects were reserved for the training set. This subject-level partition ensured that no individual appeared simultaneously in training, validation, and testing within the same fold, thereby preventing data leakage and enabling robust cross-subject generalization. To further enhance ADL diversity during training, we included data from all 21 elderly participants in the training set of each fold. These elderly participants contributed ADL samples only and were never used for validation or testing.
Then, we segmented the samples into fixed-length overlapping windows of 128 samples (≈4 s at 32 Hz) for classification. To address class imbalance, we applied class-aware striding: 64 samples for ADLs and 16 samples for falls. The smaller stride for falls captured more training examples from the minority class, partially mitigating imbalance before applying loss-level weighting. The class-aware stride yielded approximately 3200 ADL windows and 1800 fall windows per fold.

4.3. Problem Formulation and Window Labeling

We formulated fall detection as a binary classification problem at the window level. Each segmented window containing a fall event was assigned label 1, while windows containing only ADLs were assigned label 0. This formulation directly matches the primary deployment objective of wearable fall detectors, which is to reliably distinguish falls from non-falls rather than differentiate between specific fall or ADL subtypes.

4.4. Training Configuration

We trained all models using Focal Loss [22] (a loss function that prioritizes learning from misclassified examples by reducing the influence of correctly classified ones) with α = 0.75 and γ = 2.0 . We chose these hyperparameters empirically and held them fixed across all experiments. For optimization, we employed the AdamW optimizer with a learning rate of 10 3 , weight decay of 5 × 10 4 , and a batch size of 64. Regularization was applied before the classification layer and batch normalization within the stream encoders.
For the Kalman fusion approach, we empirically set process noise variances to Q ϕ = 0.005 for orientation angles ( ϕ , θ , ψ ) and Q ϕ ˙ = 0.01 for angular rates, while measurement noise variances were R acc = 0.05 for accelerometer-based orientation and R gyro = 0.1 for gyroscope measurements. The initial state was x 0 = 0 with covariance P 0 = I 6 , and the sampling interval was Δ t = 1 / 32 s (32 Hz).

4.5. Evaluation Protocol

For LOSO-CV, we used 21 folds for the SmartFallMM dataset, and for each fold, we tuned hyperparameters (learning rate, weight decay, and dropout rate) using a validation set (as described in Section 4.2) and applied early stopping during training to prevent overfitting. We used 21 folds because only 21 young participants contained both fall and ADL samples in sufficient quantity to form a valid LOSO validation and test fold for computing classification metrics.
For evaluation metrics, we used the F1 score and accuracy (in percentage) for a comprehensive performance assessment. Accuracy represents the overall proportion of correct predictions across both classes, fall (1) and no fall (0). In the results section, we report the average performance across all folds from multiple LOSO-CV sessions to ensure robust and reliable performance estimates.

5. Results

5.1. Performance Comparison with Baselines

We first established baseline performance using a single-stream transformer architecture ( M b a s e ), in which the overall architecture was identical to the proposed dual-stream model (see Figure 3), except that a single Conv1D projection with 64 dimensions replaced the dual Conv1D projections. This single stream processed the concatenated Kalman-fused inputs [ a k SMV | o k ] R 7 . We compared the baseline with three additional single-stream transformer variants: a model with Squeeze–Excitation (SE, reweighting feature channels by importance, denoted M S E ), a model with Temporal Attention Pooling (TAP, emphasizing fall-relevant time steps, denoted M T A P ), and a model combining SE and TAP (denoted M S E + T A P ). Table 1 reports average performance over 21 folds of LOSO-CV using the SmartFallMM dataset with standard deviations (std).
The results show that the model M S E + T A P yields the greatest improvement, with an F1 score gain of +1.28 and an accuracy gain of +1.29 relative to the baseline transformer without attention. The model M S E with SE alone provides a smaller but consistent benefit, improving the F1 score by +0.63 and the accuracy by +0.56. In contrast, the model M T A P results in a slight performance drop, reducing the F1 score by −0.18 and the accuracy by −0.26. In terms of variability across LOSO-CV folds, M S E and M S E + T A P both achieve lower standard deviations than the baseline, indicating more consistent performance across subjects, whereas TAP alone does not improve consistency in M T A P . This trend indicates that TAP is most effective when applied after channel-wise reweighting, where SE has already suppressed weak or noisy channels, rather than directly on unweighted feature representations that mix informative and noisy signals.
Having identified M S E + T A P as the best-performing baseline configuration in Table 1, we now compare this model against the proposed dual-stream architecture ( M τ ), illustrated in Figure 3, using O R 3 and A R 4 inputs (see step 2 in Section 3.4).
Table 2 summarizes the comparison. The M τ yields a +1.30 improvement in F1 score and a +2.34 increase in accuracy relative to the M S E + T A P model. In addition, the F1 std decreases from 8.99 to 5.42 and the accuracy std from 8.70 to 7.16, indicating more consistent performance across subjects. This gain indicates that separating acceleration and orientation inputs into independent projection streams reduces cross-modal interference and enables more effective feature extraction from each modality, offering a clear advantage over single-stream architectures with shared early projections. Additionally, the larger improvement in accuracy (+2.34) over the F1 score (+1.30) further suggests that dual-stream processing mainly sharpens rejection of challenging non-fall events, rather than correcting only a small number of missed falls.
To further validate the performance improvement observed in Table 2, we conducted statistical significance testing across the LOSO-CV folds comparing the best-performing single-stream baseline ( M S E + T A P ) with the proposed dual-stream architecture ( M τ ). For each fold, both models were evaluated on the same test subject, producing paired fold-level performance measurements. Specifically, we applied three complementary statistical tests. First, a paired t-test [23] evaluated whether the mean difference across folds differed significantly from zero under the assumption of normally distributed differences. Second, the Wilcoxon signed-rank test [24] provided a non-parametric alternative that tested whether the median difference was significantly different from zero without requiring normality assumptions. Third, the Nadeau–Bengio corrected t-test [25] adjusted the variance estimate of the paired t-test to account for the dependency between cross-validation folds caused by overlapping training sets. Table 3 reports the resulting p-values along with the mean performance differences ( Δ F 1 Score = M τ M S E + T A P ) and 95% confidence intervals (CI).
The results indicate that M τ consistently outperforms the M S E + T A P baseline, achieving a mean improvement of + 1.30 in F1 score and + 1.49 in accuracy across LOSO-CV folds. The 95 % confidence intervals of the paired performance differences remain largely positive, indicating that the observed improvements are stable across folds rather than driven by a small number of outliers. Statistical testing further supports this trend. For the primary evaluation metric (F1 score), the Wilcoxon signed-rank test indicates a statistically significant improvement ( p = 0.042 ), while the paired t-test shows a marginally significant trend ( p = 0.078 ). The Nadeau–Bengio corrected test produces a larger p-value due to its conservative adjustment for the dependency between cross-validation folds. This pattern is expected in LOSO-CV settings where folds share training data. Overall, these results indicate that the proposed dual-stream architecture provides consistent performance improvements over the single-stream baseline.

5.2. Comparison Across Architectures

To provide a comparative reference for the proposed method, we compared it against alternative dual-stream architectures evaluated under the same LOSO-CV protocol and dataset split, using O R 3 and A R 4 as inputs from the SmartFallMM dataset. Specifically, we considered the DSCS method [26], which uses a dual-stream CNN with a self-attention module, a dual-stream CNN-Mamba model [27], which combines convolutional feature extraction with Mamba-inspired state-space temporal modeling, and two variants of a dual-stream LSTM baseline [28]: the original architecture and an augmented version incorporating SE and TAP attention modules.
Table 4 compares the proposed dual-stream Kalman transformer ( M τ ) against alternative dual-stream architectures under the same LOSO-CV evaluation protocol, with Δ F1 score computed relative to the proposed method. The proposed model achieves the highest performance, exceeding CNN-Mamba and LSTM-based architectures by margins ranging from +2.54 to +3.60 in F1 scores.
In particular, the dual-stream CNN-Mamba exhibits a −3.04 drop in F1 score, while the dual-stream LSTM shows a −2.54 reduction, indicating that neither convolutional state-space modeling nor recurrent temporal modeling matches the effectiveness of transformer-based attention when combined with Kalman-fused inputs. The consistent advantage of M τ over CNN-Mamba and LSTM baselines suggests that transformer-based self-attention is better able to benefit from dual-stream Kalman-fused IMU signals than convolutional or recurrent models in this setting. Adding SE and TAP attention modules to the LSTM further degrades performance (−3.60 F1 score), suggesting that attention mechanisms designed for transformer architectures do not transfer effectively to recurrent models. Compared to the prior dual-stream fall-detection approach, the performance gap is substantially larger: DSCS lags behind the proposed method by −24.69 F1 score, highlighting the combined benefit of Kalman-based input stabilization and transformer-based dual-stream modeling.
In terms of variability, the M τ exhibits an F1 std of 5.42 and an accuracy std of 7.16. Although the dual-stream LSTM baseline achieves lower F1 variability ( 88.84 ± 3.84 ) than M τ , its mean F1 score is substantially lower, indicating that it is consistently under-performing across subjects. Notably, DSCS demonstrates exceptionally low variability ( 66.69 ± 0.19 F1 and 79.61 ± 0.07 accuracy), which is remarkably lower than all other architectures. However, this minimal variance is accompanied by substantially degraded mean performance (−24.69 F1 score), suggesting that the model fails to adapt effectively to subject-specific variations under LOSO-CV and instead produces consistently poor predictions across all test subjects. In contrast, M τ attains a clearly higher average F1 score ( 91.38 ± 5.42 ) while maintaining moderate variability, offering a better overall trade-off between accuracy and stability for fall detection.

5.3. Computational Cost

In this section, we analyze the computational cost of the proposed method against representative dual-stream baselines. We measured model size (parameter count) in thousands (K), computational complexity (FLOPs) in millions (M), inference time per batch on an NVIDIA RTX A5000 GPU in milliseconds (ms), and preprocessing overhead with and without Kalman fusion in ms. We used LSTM with Squeeze–Excitation and Temporal Attention Pooling blocks to be consistent with Transformer implementation. Table 5 summarizes the results obtained using the SmartFallMM dataset.
The results show that all three dual-stream architectures are lightweight and support real-time inference, but they differ in how they trade off capacity, cost, and accuracy. The proposed model M τ uses 42K parameters and 2.8M FLOPs with an inference time of 3.0 ms per batch of 64 windows. CNN-Mamba is slightly slower and heavier (54K parameters, 7.0M FLOPs, 3.1 ms), so it incurs a higher computational cost despite achieving lower F1 scores. The dual-stream LSTM is the most compact (14K parameters, 1.7M FLOPs, 2.4 ms), but its accuracy lags behind M τ , indicating that its lower cost comes at the expense of reduced detection performance. DSCS, in contrast, is the heaviest model at 85K parameters with 1.50M FLOPs and 3.5 ms inference time, combining the largest memory footprint and the slowest inference among the considered methods while still delivering substantially inferior F1 scores, making it the least efficient option in terms of accuracy–cost trade-off.
In terms of preprocessing, Kalman fusion adds an overhead of about 14 ms per 128-sample window for all models, whereas the preprocessing of the raw signals is effectively negligible (≈0.01 ms). Since a 128-sample window spans roughly 4 s at 32 Hz, spending about 14 ms on Kalman fusion and another 2–3 ms on model inference adds only a very small delay compared with the time needed to collect the window itself. This makes the overall processing latency per window negligible for online fall detection.

5.4. Ablation Studies

  • Consistency Across Multiple Datasets: To further analyze the robustness of the proposed method, we conducted an ablation study based on evaluation across multiple datasets. Unlike earlier experiments that focused on architectural and input-level variations within SmartFallMM, this study examined model behavior across datasets with substantially different sensor characteristics, sampling rates, and noise profiles.
For this purpose, we evaluated the model on two additional datasets. UP-FALL [29] was collected using MBIENT IMU sensors at 18 Hz from 17 young participants and represents a low-sampling-rate, research-grade sensing setup. WEDA-FALL [30] was collected using consumer-grade Fitbit devices at 50 Hz from 14 participants and exhibits higher sensor noise typical of real-world wearables. All datasets used wrist-mounted sensors and were evaluated independently under a subject-independent LOSO-CV protocol. Dataset-specific configurations were adjusted to account for differences in sampling rate across datasets. In particular, the window length was set to 160 samples (≈8.9 s) for UP-FALL at 18 Hz, and 250 samples (≈5.0 s) for WEDA-FALL at 50 Hz to ensure that each window captured the full temporal dynamics of fall events. The embedding allocation in the dual-stream architecture remained consistent across datasets, with approximately 65% of the embedding capacity assigned to acceleration features and 35% to orientation features. For the configuration details of the SmartFallMM dataset, please refer to Section 4.
Table 6 shows that the proposed method achieves the best performance across all evaluated datasets in terms of both F1 score and accuracy. Specifically, the proposed method attains an F1 score of 91.38 with an accuracy of 88.44 on SmartFallMM, 95.18 with 96.53 accuracy on UP-FALL, and 95.41 with 94.57 accuracy on WEDA-FALL. In each case, these results exceed the corresponding CNN-Mamba and LSTM baselines. Additionally, the relatively low F1 std across datasets, typically under 6% for the proposed method, indicates stable performance under the LOSO-CV protocol and limited subject-to-subject variation. In contrast, baseline models such as LSTM exhibit higher variance, particularly on UP-FALL (F1 std of 18.02), reflecting less consistent generalization across participants. The preservation of performance gains and reduced variability across SmartFallMM, UP-FALL, and WEDA-FALL datasets supports three key mechanisms: (1) transforming noisy gyroscope readings into semantic orientation, (2) isolating modalities in early layers, and (3) gating the benefit of dual-stream capacity on input quality, all of which remain valid under different sampling rates and sensor noise profiles.
Performance Across Different Variants of Dual-Stream Architecture: In this section, we repeated previous experiments (Table 2) to isolate the effect of the dual-stream architecture from the benefits of Kalman fusion. For this purpose, we used raw gyroscope ( ω k = [ ω x , ω y , ω z ] R 3 ) and accelerometer ( a k = [ a x , a y , a z ] R 3 ) measurements while keeping all other components fixed. This experiment allowed us to evaluate whether architectural stream separation alone was sufficient to improve performance, or whether its effectiveness depended on the quality of the input signals.
Table 7 summarizes the results comparing single-stream ( M S E + T A P ) and dual-stream architectures ( M τ ) using Kalman-fused orientation inputs and raw IMU measurements. Δ F1 denotes the performance difference between Kalman-fused and raw inputs for both models. Bold values indicate the highest performance. These results show that, relative to the Kalman-fused baseline, both raw-input configurations exhibit a decline in performance, with the degradation being more pronounced for the dual-stream architecture. The single-stream model M S E + T A P shows a moderate F1 score reduction of −0.84, whereas the dual-stream model M τ incurs a larger drop of −1.38. This additional −0.54 loss in F1 score for the dual-stream setting indicates that separating raw accelerometer and gyroscope signals into independent streams exacerbates the negative impact of sensor noise, rather than alleviating it. Consequently, when orientation information is not stabilized through Kalman fusion, architectural stream separation leads to diminished performance compared to single-stream processing.
In addition to mean performance, the Kalman-fused configurations also exhibit lower standard deviations in both F1 score and accuracy, particularly for the M τ model (5.42 vs. 7.27 for F1 score). This reduction in variance suggests that Kalman-based preprocessing not only boosts average detection performance but also enhances training stability and consistency across subjects. By contrast, the higher variability observed in raw-input settings reflects greater sensitivity to sensor noise and participant-specific motion patterns.
Next, we analyzed the effect of embedding capacity allocation between streams in the dual-stream architecture. Specifically, we varied the dimensional split between the acceleration and orientation (or gyroscope) streams while allowing the total embedding dimension to change accordingly. Understanding how embedding capacity should be distributed across streams is important because the acceleration and orientation modalities differ in both input dimensionality (four versus three channels) and signal characteristics.
Table 8 summarizes the results for Kalman-fused inputs. A balanced allocation of 32:32 with a total embedding dimension of 64 achieves the highest F1 score (91.10 % F1 score). Although this configuration uses a smaller total embedding dimension, it also attains the lowest standard deviation (±4.77), indicating more stable performance across subjects compared to the higher-capacity 48:48 and 48:24 settings. Increasing the total dimensionality to 96 with a balanced split reduces performance, and shifting the allocation toward the acceleration stream (48:24) further degrades the F1 score. These results indicate that for stabilized orientation input, balanced capacity allocation (32:32) with moderate dimensionality is more effective and yields more consistent performance than increasing or skewing embedding capacity.
  • Effect of Yaw Drift on Classification Performance: To examine whether yaw drift affects classification performance, we conducted ablation experiments evaluating the contribution of the yaw channel to the orientation representation. Specifically, we tested three configurations: (1) the full Kalman-based orientation representation including yaw, (2) a drift-free alternative where yaw was replaced by the gyroscope magnitude | ω | 2 = ω x 2 + ω y 2 + ω z 2 , and (3) a configuration where yaw was completely excluded from the input. All experiments were performed on the SmartFallMM dataset using the same LOSO-CV protocol.
The results in Table 9 indicate that including yaw provides a modest measurable performance benefit. Removing yaw reduces the mean F1 score by 1.50 percentage points compared with the full Kalman configuration. Moreover, the cross-fold standard deviation nearly doubles when yaw is excluded (from 5.36 to 10.55), suggesting that yaw contributes to more stable performance across subjects. Replacing yaw with the drift-free gyroscope magnitude partially recovers performance, indicating that rotational motion information is important for fall detection. However, the magnitude representation does not fully capture the temporal orientation dynamics encoded by yaw. These results suggest that although yaw is susceptible to drift due to gyroscope integration, the bounded window duration (128 samples 4 s) limits the magnitude of this drift, and the temporal patterns contained in the yaw signal remain informative for fall detection. Consequently, the integrated yaw channel provides useful rotational context that improves both average performance and cross-subject stability.

5.5. Real-Time Testing

For real-time testing, we had three versions of the SmartFall app developed in our prior work, which include watch-based, phone-based, and server-based, as described in [31]. Because the proposed model M τ was developed using the PyTorch framework, which is not supported by TensorFlow Lite, direct deployment on the smartwatch was not feasible. Therefore, we used the server-based version of the SmartFall app [32], where the PyTorch model was deployed on the server without conversion. In this setup, a smartwatch worn on the left wrist continuously collects accelerometer and gyroscope data, which are compressed into a binary format and transmitted to a cloud server using the NATS.io messaging framework. On the server, a Kalman filter processes the signals to extract motion features (SMV, roll, pitch, and yaw), which are then input to the fall detection model, and the prediction results are sent back to the smartwatch to trigger a fall notification when needed. Figure 4 shows the complete real-time inference pipeline.
In particular, the SmartFall app follows a three-queue streaming mechanism to aggregate sensor data and generate fall predictions in a continuous setting. The first component, referred to as the Alpha queue, manages multiple Beta queues, each containing 128 sensor samples representing a fixed-length window of accelerometer and gyroscope data, which are transmitted to the server-based model for inference. At any given time, the Alpha queue maintains 10 Beta queues. To enable continuous monitoring, a new Beta window is generated by sliding the window by one sample at a time, producing approximately 99% overlap between consecutive windows. Although this differs from the offline training configuration, the higher overlap is used in real-time deployment to ensure that fall events occurring between window boundaries are still captured during inference, and no obvious latency is experienced by the user. The prediction outputs are stored in a third queue, the Heuristics queue, where the average probability across the 10 Beta windows within the Alpha queue is computed and compared with a predefined threshold to classify the event as fall or no fall. The content of the Alpha queue is archived to the backend NoSQL database for feedback and continuous evaluation of the model in a process called personalization, which is not within the scope of this paper.
We recruited five participants (males, aged between 21 and 26 years) from the Texas State University community under IRB 9461. The inclusion criteria required participants to be healthy adults with no known mobility impairments, balance disorders, or recent musculoskeletal injuries that could increase the risk of performing simulated fall activities. All participants were physically capable of safely performing the fall and daily activity protocols defined in the SmartFallMM dataset.
Table 10 presents the real-time performance of the M τ model when tested on five participants, performing all 14 activities (nine ADLs and five fall types as listed in the SmartFallMM dataset), each repeated five times. The average F1 score achieved is 83%, which shows an improvement over the previous version of our study [2], where the real-world F1 score reached a maximum of 80%. This indicates that Kalman-fusion with dual-stream architecture, which incorporates both accelerometer and gyroscope signals with advanced preprocessing, enhances the model’s ability to distinguish falls from daily activities. Furthermore, the average real-time accuracy of 90% demonstrates that the model maintains strong predictive performance during live operation, confirming its robustness and reliability for continuous fall detection in practical use.
The common failure across all participants occurred in lateral or rotational falls and in ADLs with large posture changes. Leftfall, Rightfall, and Rotatefall each showed a missed-detection ratio of approximately 2/5, i.e., about two out of every five such falls were not detected. For ADLs, Put On and Take Off Jacket corresponded to an approximate false-positive (FP) ratio of 1/5, while Sit/Stand showed a higher FP ratio of about 2/5, indicating that sit-to-stand transitions are particularly prone to misclassifications as falls.
Figure 5 provides examples corresponding to the failure patterns reported above. Here, p denotes the predicted probability of the fall class obtained from the model output after the sigmoid activation, t indicates the time index within the 128-sample window used for inference, and GT denotes the ground-truth activity label. For additional visualization, the gyroscope magnitude (gyro mag) is also shown in the plots to highlight rotational motion, although this feature was not used as an input to the proposed method.
In the false-negative example (Figure 5a, the missed fall reflects the behavior observed in Rotatefall, one of the fall types with the highest missed-detection ratio. In this case, the acceleration magnitude does not exhibit a strong impact peak, while the Kalman-estimated orientation evolves smoothly with gradual roll and pitch variation. This distributed motion reduces the sharp vertical deceleration cues typically used by the model to identify fall events. In contrast, the false-positive example (Figure 5b illustrates the type of ADL misclassifications reported for Sit/Stand. Here, the rapid posture transition produces noticeable acceleration fluctuations together with substantial orientation variation, indicating strong rotational motion at the wrist. These dynamics resemble fall-like motion and lead the model to incorrectly trigger a fall prediction.
Based on a detailed analysis of the failure patterns, we conclude that they arise from three main factors: (1) lateral and rotational falls often produce lower peak acceleration with more distributed impact, reducing the sharp vertical deceleration cues that the model relies on, and therefore leading to missed detections; (2) sit-to-stand transitions induce rapid changes in orientation and vertical acceleration that closely resemble fall signatures, causing the model to trigger false positives; and (3) arm-intensive ADLs such as putting on or taking off a jacket generate high wrist-level angular velocities that are misinterpreted as fall-like motion, again resulting in false positives.
Based on our prior deployment studies [2,33], experiments with five participants can provide a reasonable initial estimate of system performance, although we acknowledge that this scale is limited. The reason is that recruiting a large number of participants for real-world testing is time-consuming and labor-intensive. Therefore, to analyze real-world behavior before conducting larger user studies, we are currently developing a real-time testing simulator to study factors that influence deployment performance.
Regarding the reason for the performance gap between offline evaluation and real-time deployment (see Table 2 and Table 10), one important factor is the dynamic nature of streaming sensor data. The model is trained offline using fixed windows of 128 samples (≈4 s), assuming that the fall event is fully captured within a window. In real-time deployment, however, the incoming sensor stream may not perfectly align with this window due to transmission delays, network variability, or sensor noise. Another factor arises from how the model decisions must be produced in real time while avoiding user-perceived latency. Predictions begin once the number of collected sensor samples reaches the window size and then continue as the model slides over incoming data, reusing most of the previously observed samples. To prevent isolated positive outputs from triggering false alarms while still capturing consistent fall patterns, the system aggregates several consecutive prediction probabilities before issuing the final decision. This additional stabilization step, which is necessary in real-time operation, can introduce differences from the offline evaluation pipeline.

6. Conclusions and Future Work

This paper addresses a fundamental challenge in wearable fall detection, where raw gyroscope data from MEMS smartwatches degrade classification performance when naively fused with accelerometer signals in single-stream neural architectures. To overcome this challenge, we presented a dual-stream transformer architecture, whose first stream processed acceleration features comprising triaxial measurements and their signal magnitude, and the second stream processed orientation features (roll, pitch, and yaw angles). The orientation features were computed by applying a Kalman filter that maintained an internal state estimate by fusing raw accelerometer and gyroscope measurements. Each stream employed dedicated convolutional projection layers to extract modality-specific representations before concatenation and transformer-based temporal modeling. We further incorporated Squeeze-and-Excitation channel attention and Temporal Attention Pooling to selectively emphasize discriminative features and temporal regions.
Our key findings demonstrate the synergistic interaction between signal preprocessing and architectural design. Evaluated on the SmartFallMM dataset using Leave-One-Subject-Out cross-validation, our dual-stream transformer achieved 91.10% F1 score when the input was preprocessed using a Kalman filter. When using raw gyroscope data as they are, the performance degraded to 87.58% F1, confirming that architectural sophistication requires appropriately preprocessed inputs. Further evaluation on two other datasets, UP-FALL (95.18% F1, 96.53% accuracy) and WEDA-FALL (95.41% F1, 94.57% accuracy), demonstrated consistent performance of Kalman filtered input across datasets from different sensor types and sampling rates. Compared to other dual-stream architectures, our approach exceeded dual-stream CNN-Mamba by +3.04% F1 and dual-stream LSTM by +2.54% F1, establishing state-of-the-art performance for wearable IMU-based fall detection.
To assess real-world performance, we tested the best-performing version of the dual-stream model using the server version of the SmartFall App on five participants. The system achieved an average F1 score of 83%, showing an improvement over the best real-world F1 score of 80% reported in our previous study [2] using a single-stream transformer model without Kalman filtering, but with data sensed from both wrist and hip locations. This confirms that Kalman-fused dual-stream processing transfers effectively from offline evaluation to a practical deployment setting.
Despite strong overall performance, occasional false alarms were observed during activities with pronounced posture transitions (e.g., sit-to-stand) and complex upper-body movements (e.g., putting on and taking off a jacket), as well as missed detections for lateral (left and right side falls) and rotational falls, which showed error ratios of approximately 2/5. To address these challenges, future work should pursue several directions. First, we intend to refine the Kalman fusion pipeline by maintaining a persistent filter state per user and tuning its parameters for different sampling rates, aiming to reduce both missed lateral/rotational falls and false alarms during posture transitions. Second, we intend to conduct a comprehensive real-time sensitivity evaluation using real-world feedback data from diverse groups using the smartwatch system to systematically quantify sensitivity, specificity, and robustness under realistic everyday conditions. Finally, we intend to explore uncertainty-aware prediction strategies that can flag low-confidence events for secondary checking, which may further reduce false positives without sacrificing sensitivity to true falls.

Author Contributions

Conceptualization, A.P.; Methodology, A.P.; Software, A.P., R.S. and A.H.H.N.; Validation, A.P., S.A., R.S. and A.H.H.N.; Formal Analysis, A.P., S.A., R.S. and A.H.H.N.; Investigation, A.P., S.A., R.S., S.T.H. and A.H.H.N.; Resources, A.P., S.A., S.T.H. and A.H.H.N.; Data Curation, A.P., S.A., S.T.H. and A.H.H.N.; Writing—Original Draft, A.P., S.A. and A.H.H.N.; Writing—Review and Editing, A.P., S.A., R.S., S.T.H. and A.H.H.N.; Visualization, A.P. and S.A.; Supervision, S.A. and A.H.H.N.; Project Administration, A.P., S.A. and A.H.H.N.; Funding Acquisition, A.H.H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Foundation (NSF) under the Smart and Connected Health (SCH) Program 21223749.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The implementation code of this study is openly available in DualStreamKalmanTransformer at https://github.com/txst-cs-smartfall/DualStreamKalmanTransformer (accessed on 12 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. WHO. Falls: Fact Sheet. 2021. Available online: https://www.who.int/news-room/fact-sheets/detail/falls (accessed on 1 December 2024).
  2. Yasmin, A.; Mahmud, T.; Haque, S.T.; Alamgeer, S.; Ngu, A.H.H. Enhancing Real-World Fall Detection Using Commodity Devices: A Systematic Study. Sensors 2025, 25, 5249. [Google Scholar] [CrossRef] [PubMed]
  3. SmartFall Group, Texas State University. SmartFallMM: A Multimodal Dataset Collected with Commodity Devices. 2025. Available online: https://github.com/txst-cs-smartfall/SmartFallMM-Dataset (accessed on 13 January 2026).
  4. Xuan, J.; Zhu, T.; Peng, G.; Sun, F.; Dong, D. A Review on the Inertial Measurement Unit Array of Microelectromechanical Systems. Sensors 2024, 24, 7140. [Google Scholar] [CrossRef] [PubMed]
  5. Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
  6. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  7. Mauldin, T.R.; Canby, M.E.; Metsis, V.; Ngu, A.H.; Rivera, C.C. SmartFall: A Smartwatch-Based Fall Detection System Using Deep Learning. Sensors 2018, 18, 3363. [Google Scholar] [CrossRef]
  8. SmartFall Group, Texas State University. SmartFall Dataset, 2018. Available online: https://userweb.cs.txstate.edu/~hn12/data/SmartFallDataSet/ (accessed on 13 January 2026).
  9. Mauldin, T.R.; Ngu, A.H.; Metsis, V.; Canby, M.E. Ensemble Deep Learning on Wearables Using Small Datasets. ACM Trans. Comput. Healthcare 2021, 2, 5. [Google Scholar] [CrossRef]
  10. Haque, S.T.; Debnath, M.; Yasmin, A.; Mahmud, T.; Ngu, A.H.H. Experimental Study of Long Short-Term Memory and Transformer Models for Fall Detection on Smartwatches. Sensors 2024, 24, 6235. [Google Scholar] [CrossRef]
  11. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  12. Zafar, R.O.; Zafar, F. Real-time activity and fall detection using transformer-based deep learning models for elderly care applications. BMJ Health Care Informatics 2025, 32, e101439. [Google Scholar] [CrossRef]
  13. Vavoulas, G.; Chatzaki, C.; Malliotakis, T.; Pediaditis, M.; Tsiknakis, M. The MobiAct Dataset: Recognition of Activities of Daily Living using Smartphones. In Proceedings of the 2nd International Conference on Information and Communication Technologies for Ageing Well and e-Health, Rome, Italy, 21–22 April 2016. [Google Scholar]
  14. Yhdego, H.; Li, J.; Paolini, C.; Audette, M. Wearable Sensor Gait Analysis of Fall Detection using Attention Network. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Virtual, 9–12 December 2021; pp. 3137–3141. [Google Scholar] [CrossRef]
  15. Shin, J.; Miah, A.S.M.; Egawa, R.; Hirooka, K.; Hasan, M.A.M.; Tomioka, Y.; Hwang, Y.S. Fall recognition using a three stream spatio temporal GCN model with adaptive feature aggregation. Sci. Rep. 2025, 15, 10635. [Google Scholar] [CrossRef]
  16. Rehouma, H.; Boukadoum, M. Fall Detection by Deep Learning-Based Bimodal Movement and Pose Sensing with Late Fusion. Sensors 2025, 25, 6035. [Google Scholar] [CrossRef]
  17. Liu, K.C.; Lin, Y.D. Efficient fall detection using Kalman filter-enhanced triaxial accelerometer signals and machine learning. Biomed. Signal Process. Control 2026, 114, 109304. [Google Scholar] [CrossRef]
  18. Shi, Y.; Zhang, Y.; Li, Z.; Yuan, S.; Zhu, S. IMU/UWB Fusion Method Using a Complementary Filter and a Kalman Filter for Hybrid Upper Limb Motion Estimation. Sensors 2023, 23, 6700. [Google Scholar] [CrossRef]
  19. An, G.; Zhou, W.; Wu, Y.; Zheng, Z.; Liu, Y. Squeeze-and-Excitation on Spatial and Temporal Deep Feature Space for Action Recognition. In Proceedings of the 2018 14th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 12–16 August 2018; pp. 648–653. [Google Scholar] [CrossRef]
  20. Essa, E.; Abdelmaksoud, I.R. Temporal-channel convolution with self-attention network for human activity recognition using wearable sensors. Knowl.-Based Syst. 2023, 278, 110867. [Google Scholar] [CrossRef]
  21. Wang, Z.; Kang, K. Adaptive temporal attention mechanism and hybrid deep CNN model for wearable sensor-based human activity recognition. Sci. Rep. 2025, 15, 33389. [Google Scholar] [CrossRef]
  22. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
  23. Student. The Probable Error of a Mean. Biometrika 1908, 6, 1–25. [Google Scholar] [CrossRef]
  24. Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
  25. Nadeau, C.; Bengio, Y. Inference for the Generalization Error. Mach. Learn. 2003, 52, 239–281. [Google Scholar] [CrossRef]
  26. Zhang, J.; Li, Z.; Liu, Y.; Li, J.; Qiu, H.; Li, M.; Hou, G.; Zhou, Z. An Effective Deep Learning Framework for Fall Detection: Model Development and Study Design. J. Med. Internet Res. 2024, 26, e56750. [Google Scholar] [CrossRef] [PubMed]
  27. Liu, C.P.; Li, J.H.; Chu, E.P.; Hsieh, C.Y.; Liu, K.C.; Chan, C.T.; Tsao, Y. Deep Learning-based Fall Detection Algorithm Using Ensemble Model of Coarse-fine CNN and GRU Networks. In Proceedings of the 2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Jeju, Republic of Korea, 14–16 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
  28. Wu, J.; Wang, J.; Zhan, A.; Wu, C. Fall Detection with CNN-Casual LSTM Network. Information 2021, 12, 403. [Google Scholar] [CrossRef]
  29. Martínez-Villaseñor, L.; Ponce, H.; Brieva, J.; Moya-Albor, E.; Núñez-Martínez, J.; Peñafort-Asturiano, C. UP-fall detection dataset: A multimodal approach. Sensors 2019, 19, 1988. [Google Scholar] [CrossRef]
  30. Fula, V.; Moreno, P. Wrist-based fall detection: Towards generalization across datasets. Sensors 2024, 24, 1679. [Google Scholar] [CrossRef]
  31. Yasmin, A.; Mahmud, T.; Debnath, M.; Ngu, A.H. An empirical study on ai-powered edge computing architectures for real-time iot applications. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; pp. 1422–1431. [Google Scholar]
  32. SmartFall Group, Texas State University. Optimizing Real-Time Fall Detection: Integrating NATS.io for Low-Latency IoT Edge Applications. 2024. Available online: https://smartfall.github.io/assets/docs/SayaliNATS.pdf (accessed on 29 January 2026).
  33. Ngu, A.H.; Metsis, V.; Coyne, S.; Srinivas, P.; Salad, T.; Mahmud, U.; Chee, K.H. Personalized Watch-Based Fall Detection Using a Collaborative Edge-Cloud Framework. Int. J. Neural Syst. 2022, 32, 2250048. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Illustration of the challenge introduced by raw gyroscope signals from the SmartFallMM dataset [3]: The left column shows an ADL (sitting down on a chair followed by standing up), while the right column shows a back fall. The bottom row displays full triaxial gyroscope signals (rad/s) over time, with red boxes highlighting windows containing similar rotational patterns, while the top row presents expanded 50-sample views of these windows.
Figure 1. Illustration of the challenge introduced by raw gyroscope signals from the SmartFallMM dataset [3]: The left column shows an ADL (sitting down on a chair followed by standing up), while the right column shows a back fall. The bottom row displays full triaxial gyroscope signals (rad/s) over time, with red boxes highlighting windows containing similar rotational patterns, while the top row presents expanded 50-sample views of these windows.
Bdcc 10 00090 g001
Figure 2. End-to-end pipeline of the proposed method: raw IMU data undergo modality-specific feature extraction (SMV for acceleration a k S M V and Kalman fusion for orientation o k ). Acceleration features are z-score normalized, while orientation features remain in their original scale. The streams are then segmented into overlapping windows and fed to the dual-stream network.
Figure 2. End-to-end pipeline of the proposed method: raw IMU data undergo modality-specific feature extraction (SMV for acceleration a k S M V and Kalman fusion for orientation o k ). Acceleration features are z-score normalized, while orientation features remain in their original scale. The streams are then segmented into overlapping windows and fed to the dual-stream network.
Bdcc 10 00090 g002
Figure 3. Architecture diagram of the proposed dual-stream transformer model, which is composed of separate Conv1D projections for acceleration and orientation features, followed by transformer encoder, SE channel attention, TAP temporal pooling, and classification head.
Figure 3. Architecture diagram of the proposed dual-stream transformer model, which is composed of separate Conv1D projections for acceleration and orientation features, followed by transformer encoder, SE channel attention, TAP temporal pooling, and classification head.
Bdcc 10 00090 g003
Figure 4. Illustration of the pipeline for smartwatch-based real-time testing of the proposed method using the SmartFall application [31].
Figure 4. Illustration of the pipeline for smartwatch-based real-time testing of the proposed method using the SmartFall application [31].
Bdcc 10 00090 g004
Figure 5. Representative failure cases of the M τ model: (a) a false negative where a fall is missed due to weak impact cues, and (b) a false positive where an ADL movement is misclassified as a fall due to large posture change and high wrist angular velocity.
Figure 5. Representative failure cases of the M τ model: (a) a false negative where a fall is missed due to weak impact cues, and (b) a false positive where an ADL movement is misclassified as a fall due to large posture change and high wrist angular velocity.
Bdcc 10 00090 g005
Table 1. Average performance (mean ± std) of baseline single-stream transformer models using Kalman-fused 7-channel input ( [ a k SMV o k ] R 7 ). ✓ indicates that the corresponding module is included in the model, while ✗ indicates that it is not used. The bold values represent the highest performance gain.
Table 1. Average performance (mean ± std) of baseline single-stream transformer models using Kalman-fused 7-channel input ( [ a k SMV o k ] R 7 ). ✓ indicates that the corresponding module is included in the model, while ✗ indicates that it is not used. The bold values represent the highest performance gain.
ModelSETAPF1 ScoreAccuracy Δ F1
M b a s e (no attention)88.52 ± 11.1083.67 ± 14.21
M S E 89.15 ± 5.7784.23 ± 12.15+0.63
M T A P 88.34 ± 9.6283.41 ± 9.34−0.18
M S E + T A P 89.80 ± 8.9984.96 ± 8.70+1.28
Table 2. Performance comparison (mean ± std) between single-stream with input [ a k SMV o k ] R 7 and dual-stream architectures with A R 4 and O R 3 inputs. Bold values indicate the highest performance.
Table 2. Performance comparison (mean ± std) between single-stream with input [ a k SMV o k ] R 7 and dual-stream architectures with A R 4 and O R 3 inputs. Bold values indicate the highest performance.
ArchitectureF1 ScoreAccuracy Δ F1 Score
M S E + T A P 89.80 ± 8.9984.96 ± 8.70
M τ (Proposed)91.10 ± 5.4287.30 ± 7.16+1.30
Table 3. Statistical significance analysis comparing the proposed dual-stream architecture ( M τ ) with the best single-stream baseline ( M S E + T A P ) across LOSO-CV folds of the SmartFallMM dataset.
Table 3. Statistical significance analysis comparing the proposed dual-stream architecture ( M τ ) with the best single-stream baseline ( M S E + T A P ) across LOSO-CV folds of the SmartFallMM dataset.
Metric Δ F1 Score95% CIPaired t-TestWilcoxonNadeau–Bengio
F1 Score+1.30[+0.36, +6.50] p = 0.078 p = 0.042 p = 0.383
Accuracy+1.49[−0.18, +6.55] p = 0.120 p = 0.094 p = 0.441
Table 4. Performance comparison (mean ± std) across dual-stream architectures on the SmartFallMM dataset. Bold values indicate the highest performance.
Table 4. Performance comparison (mean ± std) across dual-stream architectures on the SmartFallMM dataset. Bold values indicate the highest performance.
ArchitectureF1 ScoreAccuracy Δ F1
M τ (Proposed)91.38 ± 5.4288.44 ± 7.16
Dual-Stream CNN-Mamba [27]88.34 ± 7.6683.73 ± 9.89−3.04
Dual-Stream LSTM [28]88.84 ± 3.8485.28 ± 5.76−2.54
Dual-Stream LSTM [28] + SE + TAP87.78 ± 3.2984.12 ± 12.47−3.60
DSCS [26]66.69 ± 0.1979.61 ± 0.07−24.69
Table 5. Comparison of computational cost in terms of parameter counts, FLOPs, and inference times measured with batch size 64 on an NVIDIA RTX A5000 GPU. Preprocessing times are measured per 128-sample window on the CPU.
Table 5. Comparison of computational cost in terms of parameter counts, FLOPs, and inference times measured with batch size 64 on an NVIDIA RTX A5000 GPU. Preprocessing times are measured per 128-sample window on the CPU.
MethodParams
(K)
FLOPs
(M)
Inference
(ms/batch)
Preproc.
Raw (ms)
Preproc.
Kalman (ms)
M τ (Proposed)422.83.00.0114.26
CNN-Mamba [27]547.03.10.0113.97
LSTM [28] + SE + TAP141.72.40.0114.12
DSCS [26]851.503.50.0114.01
Table 6. Performance comparison across multiple datasets using Kalman-fused inputs. Results are reported as mean ± std F1 score and accuracy in percentages under LOSO-CV. Bold values denote the highest performance for each dataset.
Table 6. Performance comparison across multiple datasets using Kalman-fused inputs. Results are reported as mean ± std F1 score and accuracy in percentages under LOSO-CV. Bold values denote the highest performance for each dataset.
DatasetProposed ( M τ )CNN-Mamba [27]LSTM [28] + SE + TAP
F1 ScoreAccuracyF1 ScoreAccuracyF1 ScoreAccuracy
SmartFallMM91.38 ± 5.4288.44 ± 7.1688.34 ± 7.6683.73 ± 9.8988.84 ± 3.8485.28 ± 5.76
UP-FALL95.18 ± 3.0396.53 ± 2.2991.61 ± 8.2194.47 ± 6.5682.53 ± 18.0287.92 ± 9.56
WEDA-FALL95.41 ± 2.3994.57 ± 2.9591.09 ± 4.1488.36 ± 5.9490.22 ± 10.4587.58 ± 10.72
Table 7. Performance comparison (mean ± std) between single-stream ( M S E + T A P ) and dual-stream architectures ( M τ ) using Kalman-fused orientation inputs and raw IMU measurements. Δ F1 denotes the performance difference between Kalman-fused and raw inputs. Bold values indicate the highest performance.
Table 7. Performance comparison (mean ± std) between single-stream ( M S E + T A P ) and dual-stream architectures ( M τ ) using Kalman-fused orientation inputs and raw IMU measurements. Δ F1 denotes the performance difference between Kalman-fused and raw inputs. Bold values indicate the highest performance.
MethodInputF1 ScoreAccuracy Δ F1
Single-Stream ( M S E + T A P )Raw IMU88.96 ± 7.6684.55 ± 9.89−0.84
Kalman-fused89.80 ± 8.9984.96 ± 8.70
Dual-Stream ( M τ )Raw IMU87.58 ± 7.2783.12 ± 9.50−3.52
Kalman-fused91.10 ± 5.4287.30 ± 7.16
Table 8. Effect of embedding capacity allocation in dual-stream architectures using Kalman-fused orientation input with mean ± std. Bold values indicate the highest performance obtained from the best configuration.
Table 8. Effect of embedding capacity allocation in dual-stream architectures using Kalman-fused orientation input with mean ± std. Bold values indicate the highest performance obtained from the best configuration.
InputEmbed (acc:gyro)Total DimF1 Score
Kalman-fused32:326491.10 ± 4.77
48:489689.58 ± 9.03
48:247289.05 ± 7.34
Table 9. Effect of yaw drift on classification performance using the SmartFallMM dataset (mean ± std over LOSO-CV folds).
Table 9. Effect of yaw drift on classification performance using the SmartFallMM dataset (mean ± std over LOSO-CV folds).
ConfigurationOrientation ChannelsF1 Score (%)
Full Kalman (with yaw) [ ϕ , θ , ψ ] 91.65 ± 5.36
Gyro magnitude (replacing yaw) [ ϕ , θ , ω 2 ] 90.76 ± 8.97
No yaw (excluded) [ ϕ , θ ] 90.15 ± 10.55
Table 10. Performance comparison in percentages (%) per participant.
Table 10. Performance comparison in percentages (%) per participant.
ParticipantPrecisionRecallF1 ScoreAccuracy
Participant 180808089
Participant 280848290
Participant 387848592
Participant 484868590
Participant 582888488
Average83848390
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pradhan, A.; Alamgeer, S.; Suvvari, R.; Haque, S.T.; Ngu, A.H.H. Dual-Stream Transformer with Kalman-Based Sensor Fusion for Wearable Fall Detection. Big Data Cogn. Comput. 2026, 10, 90. https://doi.org/10.3390/bdcc10030090

AMA Style

Pradhan A, Alamgeer S, Suvvari R, Haque ST, Ngu AHH. Dual-Stream Transformer with Kalman-Based Sensor Fusion for Wearable Fall Detection. Big Data and Cognitive Computing. 2026; 10(3):90. https://doi.org/10.3390/bdcc10030090

Chicago/Turabian Style

Pradhan, Abheek, Sana Alamgeer, Rakesh Suvvari, Syed Tousiful Haque, and Anne H. H. Ngu. 2026. "Dual-Stream Transformer with Kalman-Based Sensor Fusion for Wearable Fall Detection" Big Data and Cognitive Computing 10, no. 3: 90. https://doi.org/10.3390/bdcc10030090

APA Style

Pradhan, A., Alamgeer, S., Suvvari, R., Haque, S. T., & Ngu, A. H. H. (2026). Dual-Stream Transformer with Kalman-Based Sensor Fusion for Wearable Fall Detection. Big Data and Cognitive Computing, 10(3), 90. https://doi.org/10.3390/bdcc10030090

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop