During the operation of large-scale amusement rides, passengers are exposed to complex dynamic conditions—such as impulsive acceleration shocks, rotational motion, and rapid attitude changes—which may induce diverse physiological abnormalities and, in severe cases, pose health risks. Therefore, developing a reliable and efficient intelligent diagnostic framework is essential for enhancing rider safety and improving operational management. In this chapter, leveraging the physical quantities provided by a strapdown inertial navigation system (SINS) and the Chinese national standard GB 8408-2018, we establish a real-time, broadly applicable threshold-based recognition mechanism for kinematic variables, and further design a CNN–LSTM dynamic threshold-enhanced model (TAPNet) to enable accurate identification and early warning of abnormal passenger states. For physiological signals, we propose HS-BANet, which integrates a convolutional neural network with a bidirectional long short-term memory network (BiLSTM) to deeply mine discriminative patterns from multi-channel time series data, including heart rate, electrodermal activity, and acceleration, thereby effectively recognizing complex conditions such as tension, dizziness, and loss of consciousness. Comprehensive evaluations and comparative analyses demonstrate that the proposed methods achieve superior performance in terms of accuracy, robustness, and practical deployability.
4.1. Heart Rate Estimation Results and Comparative Analysis
Under high-dynamic motion conditions, photoplethysmography (PPG) signals are highly susceptible to vigorous body movements, variations in contact pressure, and tissue deformation, which introduce pronounced motion artifacts and non-stationary noise. As a result, conventional heart rate estimation methods often suffer from spectral-peak drift, abrupt estimation jumps, or discontinuous tracking trajectories during moderate-to-high intensity activities. To address these challenges, this study develops a deep learning–based heart rate estimation model and conducts systematic evaluations across multiple datasets. This section first presents the implementation details, training protocol, and evaluation metrics, and then benchmarks the proposed approach against several representative heart rate estimation methods. The results substantiate the superiority of the proposed model in terms of accuracy, temporal stability, and cross-subject generalization, with particular emphasis on its robustness under severe motion artifact conditions.
4.1.1. Experimental Implementation and Training Strategy
The proposed heart rate estimation model was implemented in PyTorch 2.1.0 and trained/evaluated under a standardized workflow using PyTorch Lightning 2.1.2. NumPy 1.26.4 was employed for data organization and statistical computation, while Matplotlib 3.8.2 was used to visualize training curves and experimental results. The model was trained using the AdamW optimizer implemented in PyTorch, with a learning rate of and a weight-decay coefficient of . The batch size was set to 8, and training was accelerated using an NVIDIA GeForce RTX 3090 GPU with 24 GB memory [NVIDIA GeForce RTX 3090 GPU (ASUSTeK Computer Inc., Taipei, Taiwan)]. To mitigate gradient explosion, gradient clipping was applied with a clipping threshold of 1.0, which stabilized convergence. To mitigate gradient explosion, gradient clipping was applied with a clipping threshold of 1.0, which stabilized convergence.
To comprehensively assess the model’s generalization capability for heart rate estimation, five-fold cross-validation was conducted on HSSH-I (augmented training set) for model selection and hyperparameter tuning. Notably, because samples were constructed via a sliding window scheme (e.g., an 8 s window length with a 2 s step size), substantial overlap exists between adjacent windows. A naive random split at the window level would therefore introduce information leakage between training and validation sets. To avoid this issue, we adopted a subject-wise Group K-Fold strategy, ensuring that data from the same subject never appeared in both the training and validation folds, thus guaranteeing an unbiased evaluation. In each fold, the model was trained on the training split, and the best epoch was selected based on the average validation loss on the validation split. The final model was then trained on the full HSSH-I dataset using the optimal configuration.
For final evaluation, a one-shot independent test was performed on HSSH-II (independent validation set) to examine the model’s practical cross-subject generalization performance.
The primary accuracy metric was the root mean square error (RMSE, in bpm). In addition, correlation and agreement analyses were reported: the Pearson correlation coefficient was used to quantify the linear association between the estimated and reference heart rates, and Bland–Altman analysis was conducted to provide the mean bias and limits of agreement (LOA), enabling assessment of systematic bias and stability across different heart rate ranges.
4.1.2. Comparison Results with Mainstream Methods
To evaluate the accuracy of the proposed approach for dynamic heart rate estimation, we compared RMSE against several representative methods on both HSSH-I and HSSH-II. The compared baselines include a PPG-only estimator (PPG only), weighted frequency-peak voting (WFPV), frequency spectrum method (FSM), kernel-enhanced frequency spectrum method (Kernel-FSM), sparse spectrum method (SPF), and multi-spectrum method (MPF). The proposed approach is denoted as “Proposed Model.”
On HSSH-I (), the conventional PPG-only method yielded the largest error, with an average RMSE of 16.31 bpm. WFPV achieved an average RMSE of 8.75 bpm. The average RMSEs of FSM, Kernel-FSM, SPF, and MPF were 3.04 bpm, 2.12 bpm, 7.06 bpm, and 3.33 bpm, respectively. In contrast, the proposed model achieved an average RMSE of only 1.18 bpm across all subjects, with multiple subjects exhibiting errors below 1 bpm. These results indicate that the proposed approach maintains high accuracy and stability even under severe motion artifacts and high-dynamic perturbations.
On HSSH-II (), the overall trend remained consistent with that observed on HSSH-I. The average RMSEs of the PPG-only method and WFPV were 7.77 bpm and 4.72 bpm, respectively. MPF achieved an average RMSE of 3.00 bpm, while FSM and Kernel-FSM achieved 1.49 bpm and 2.04 bpm, respectively. The proposed model again delivered the best performance, with an average RMSE of 1.24 bpm. Notably, for several low-SNR samples, the proposed method still maintained RMSE within the 1–1.5 bpm range, demonstrating strong robustness and cross-subject generalization.
Figure 15 visualizes heart rate tracking trajectories for different subjects under high-dynamic conditions and illustrates the relationship between the estimated heart rate and changes in acceleration intensity. Specifically,
Figure 15a–d present representative subjects from HSSH-I, while
Figure 15e,f show representative results from the independent HSSH-II test set. It can be observed that the PPG-only method is more prone to fluctuations, abrupt jumps, or pronounced deviations during moderate-to-high intensity motion. By comparison, the proposed method remains more closely aligned with the reference heart rate curve across different motion phases, enabling stable tracking of rapid increases, oscillations, and recovery dynamics—thereby evidencing stronger resistance to motion artifacts and improved trajectory continuity.
Figure 16 presents the Pearson correlation analysis and Bland–Altman visualizations between the estimated heart rate (HR) and the ground truth HR on both the training and test datasets. As shown in
Figure 16a,b, the Spearman rank correlation coefficients of the proposed model are 0.9928
and 0.9865
for the training and test datasets, respectively.
Figure 16c,d depict the Bland–Altman plots for each dataset. For the training dataset, the limits of agreement (LOA) range from
bpm to
bpm (mean bias: 0.03 bpm; standard deviation: 1.89 bpm). Similarly, for the test dataset, the LOA ranges from
bpm to
bpm (mean bias: 0.08 bpm; standard deviation: 2.45 bpm).
Through an in-depth analysis of the heart rate estimation results, the proposed approach demonstrates superior accuracy and robustness in dynamic environments, maintaining high estimation fidelity even under pronounced motion artifact interference. In the following, this chapter shifts to passenger state recognition based on physical quantity thresholds, further investigating how abnormal passenger conditions can be effectively identified in complex high-dynamic scenarios and how the developed models can be leveraged for timely warning and risk mitigation.
4.2. State Recognition Results Based on Physical Quantity Thresholds
This section provides an in-depth investigation of the state recognition capability enabled by the physical quantity threshold rules proposed in
Section 3 and the dynamic-threshold-enhanced model TAPNet. By modeling the temporal characteristics of multi-channel kinematic variables, TAPNet effectively identifies abnormal behaviors across different passenger states and is further benchmarked against several conventional classifiers. Experimental results demonstrate that TAPNet achieves clear improvements in both recognition accuracy and recall for abnormal states and exhibits strong robustness when handling class-imbalanced data, underscoring its practicality for reliable detection in high-dynamic scenarios.
4.2.1. Model Evaluation and Experimental Analysis
After the preprocessing and attitude determination procedures described in
Section 3, the sampling frequency was set to 100 Hz. Each sample was constructed using a fixed-length window of 200 frames, i.e., 2 s per sample. Following data cleaning, window segmentation, and automatic labeling, 2370 valid motion segments were retained. Based on the safety thresholds specified in GB 8408-2018, the resulting windows were categorized into two classes: normal and abnormal. Specifically, the dataset contains 45,352 normal samples and 3266 abnormal samples. This distribution indicates a pronounced class imbalance: normal samples account for the vast majority (approximately 93.3%), whereas abnormal samples represent only 6.7%, which is consistent with real-world operations where abnormal states occur relatively infrequently. Despite the imbalance, the two classes exhibit clear separability in terms of signal magnitude, rate of change, and duration-related characteristics, which facilitates the learning of discriminative patterns and supports accurate abnormal state recognition—thereby providing a solid data foundation for subsequent model training.
During model training, the proposed TAPNet employs a CNN–LSTM architecture to model and classify the temporal dynamics of multi-channel physical quantities. The input is a tensor of shape , where 200 denotes the time-window length and 12 corresponds to the channels of tri-axial acceleration, angular velocity, velocity, and attitude. The feature-extraction stage consists of two 1D convolutional layers with pooling, designed to capture local transients and frequency-related patterns. A single LSTM layer is then used to model temporal dependencies, followed by fully connected layers and a Softmax classifier to output the final state label.
For optimization, TAPNet is trained using the Adam optimizer with an initial learning rate of
, a batch size of 32, and a maximum of 100 epochs. To mitigate overfitting, early stopping is applied with a patience of 10 epochs based on the validation loss. Labels are generated by the aforementioned GB-standard-based automatic labeling module, and the dataset is split into 70%/10%/20% for training, validation, and testing, respectively. For performance benchmarking, several classical classifiers—including Random Forest, Support Vector Machine, and Decision Tree—are also evaluated. The detailed hyperparameter settings for all models are provided in
Table 4.
4.2.2. Model Comparison and Analysis
After training the proposed TAPNet (Threshold-Aware Physical Network) on the same dataset, its performance was evaluated on both the validation and test sets. TAPNet achieved an accuracy of 96.9% on the validation set, which further increased to 98.2% on the test set. The corresponding confusion matrices are shown in
Figure 17.
As can be observed, on the validation set, the model correctly identified 454 abnormal samples while misclassifying 670 abnormal instances as normal. It also produced 330 false positives, i.e., normal samples incorrectly predicted as abnormal, and correctly classified 45,381 normal samples. On the test set, TAPNet correctly detected 837 abnormal samples and misclassified 524 abnormal samples as normal; meanwhile, 463 normal samples were incorrectly predicted as abnormal, and 45,352 normal samples were correctly recognized.
Overall, TAPNet maintains consistently high recognition performance on both splits, demonstrating strong capability in detecting abnormal states while preserving a high classification accuracy for normal states—indicative of robust generalization and practical reliability. Compared with conventional methods, TAPNet substantially improves the recall of abnormal conditions and the overall accuracy. Moreover, its low-latency and lightweight design makes it a promising candidate for real-world deployment on embedded platforms.
To evaluate the performance of conventional classifiers for passenger state recognition, five representative machine-learning methods were introduced for comparative experiments: Random Forest (RF), Artificial Neural Network (ANN), Support Vector Machine (SVM), Decision Tree (DT), and Naive Bayes (NB). All models were trained using the same feature inputs, with the recognition target defined as a binary label (normal/abnormal). Their corresponding confusion matrices are presented in
Figure 18. The training data were collected from 46 randomly selected participants, whereas the validation and test sets were drawn from an additional 25 independent subjects. The data split protocol was kept identical to that used for TAPNet to ensure a fair comparison.
As shown in
Figure 18, except for RF, the other models did not incorporate explicit class-balancing strategies during training, which led to consistently weak recognition performance for the minority (abnormal) class. Specifically, RF and ANN exhibited relatively balanced overall behavior: both achieved high prediction accuracy for normal samples and comparatively stronger capability in detecting abnormal states, correctly identifying 1766 and 2297 abnormal samples, respectively. However, both still suffered from non-negligible missed detections—RF misclassified 8307 abnormal samples as normal, while ANN misclassified 8542.
Although SVM produced almost no errors for the normal class (only 16 normal samples were incorrectly predicted), its ability to recognize abnormal conditions was clearly insufficient. It correctly detected only 2440 abnormal samples, while misclassifying the remaining 8896 as normal, indicating severe bias toward the majority class. The DT model was relatively stable for normal-state recognition but remained limited in abnormal detection: it correctly identified 2633 abnormal samples and misclassified 8710, suggesting susceptibility to overfitting in complex scenarios. Overall, NB delivered the weakest performance, correctly detecting only 214 abnormal samples and assigning 11,128 abnormal instances to the normal class, demonstrating clearly inadequate accuracy and robustness.
As shown in
Figure 19, a comparative evaluation of six classification models is reported using four standard performance metrics for the passenger state recognition task: accuracy, precision, recall, and F1-score. The horizontal axis lists the evaluated models—TAPNet, Random Forest, Artificial Neural Network, Support Vector Machine, Decision Tree, and Naive Bayes—while the vertical axis indicates the metric scores ranging from 0 to 1.
Overall, TAPNet achieves the best performance across all four metrics: its accuracy is close to 1, and its precision, recall, and F1-score also remain consistently high, highlighting its clear advantage in balancing the recognition of normal and abnormal states. In comparison, Random Forest and ANN exhibit relatively balanced behavior with high accuracy and precision; however, their recall and F1-score are noticeably lower than those of TAPNet, indicating that their ability to detect abnormal states remains limited.
The performance gaps among SVM, DT, and NB are more pronounced. Although these models achieve comparable—or even relatively high—precision, their recall and F1-score are extremely low. This is particularly evident for Naive Bayes and SVM, whose recall values fall below 0.1, suggesting a severe deficiency in detecting abnormal samples and a clear majority-class bias that leads to overall model imbalance.
Taken together,
Figure 19 provides an intuitive visualization of the performance disparities among competing methods under class imbalance and multi-channel dynamic time series conditions, further substantiating the comprehensive superiority and practical deployment potential of TAPNet for this task.
Although the physical quantity threshold–based recognition approach can effectively identify abnormal states, passengers’ physiological responses—such as heart rate variations and electrodermal activity—provide particularly critical evidence for state inference in highly dynamic environments. Therefore, building upon the results in
Section 4.2, this section further introduces a physiological-signal–driven framework for state classification and diagnosis. By leveraging deep learning models to exploit multi-channel physiological time series patterns, the proposed approach aims to further improve the accuracy and reliability of abnormal state recognition.
4.3. State Classification and Diagnostic Results Based on Physiological Signals
In
Section 4.2, abnormal passenger states were preliminarily categorized and identified using physical quantity thresholding. However, fine-grained diagnosis of physiological conditions—such as heart rate abnormalities, tension, or dizziness—requires dedicated analysis of physiological signals. To this end, this section proposes a PPG-based deep learning model, HS-BANet, which integrates a convolutional neural network (CNN), a bidirectional long short-term memory network (BiLSTM), and a multi-head self-attention mechanism to further extract discriminative abnormal patterns from physiological signals, enabling more effective and reliable abnormal state diagnosis.
For model development, the dataset was first randomly shuffled and then split into training, validation, and test sets at a 6:2:2 ratio, yielding 28,097, 9365, and 9365 samples, respectively. During training, cross-entropy was used as the loss function, and network parameters were optimized via backpropagation. As illustrated in
Figure 20, the initial learning rate was set to a predefined value and gradually decayed as training progressed. Once the validation loss failed to decrease for several consecutive epochs, an early-stopping mechanism was triggered to prevent overfitting and reduce unnecessary computation.
After training, the model was evaluated on the test set using a comprehensive set of metrics, including classification accuracy, precision, recall, specificity, sensitivity, and F1-score, so as to assess its recognition performance across different rhythm categories from multiple perspectives. As summarized in
Table 5, the model achieves overall strong performance in discriminating among the six arrhythmia types, while a small number of misclassifications remain. For instance, approximately 2.1% of normal sinus rhythm (SR) samples are incorrectly classified as premature ventricular contractions (PVC), and 6.35% of premature atrial contractions (PAC) are misidentified as PVC. In addition, 3.63% of SR samples are mistakenly predicted as supraventricular tachycardia (SVT), whereas 7.11% of SVT samples are assigned to other categories.
When examining the learning curves of the loss and accuracy throughout training, we observe that the loss decreases progressively and then stabilizes, while the accuracy continues to increase and eventually converges. Notably, by incorporating adaptive learning-rate decay and early stopping, overfitting is substantially mitigated, and the validation accuracy becomes more stable across epochs.
A class-wise evaluation further reveals heterogeneous recognition performance across rhythm categories, including normal sinus rhythm (SR), premature atrial contractions (PAC), premature ventricular contractions (PVC), ventricular tachycardia (VCT/VT), supraventricular tachycardia (SVT), and atrial fibrillation (AF). Among them, AF exhibits the most prominent performance, achieving an average accuracy of 98.4%, with consistently high precision, recall, and F1-score. In contrast, although the overall detection rate for VT exceeds 90%, non-negligible misclassifications persist, suggesting that the model still faces challenges in distinguishing certain rhythm types with similar morphological patterns.
Overall, the model attains a six-class rhythm recognition accuracy of 91.38%, with a mean precision and recall of 86.5% and 83.5%, respectively, and an average F1-score of approximately 84.9%. The mean specificity reaches 96.4%, indicating high reliability in rejecting non-target classes (i.e., avoiding false positives). From the perspective of class distribution, the model performs best on SR (96.0% precision/97.5% recall) and AF (93.6% precision/92.6% recall), whereas performance is relatively weaker for PAC (75.9% precision/72.5% recall) and VT (71.3% recall). This degradation is likely attributable to feature overlap across classes. Future improvements could focus on incorporating richer temporal-context representations and/or multi-modal signals for low-frequency or clinically complex rhythms (e.g., PAC and VT) in order to enhance classification robustness.
Figure 21 reports the confusion matrix on the test set. The diagonal entries correspond to correctly classified samples for each rhythm category. The overall test accuracy reaches 92.38%, indicating favorable overall discrimination among the six rhythm types. The confusion matrix also provides additional diagnostic insights: the model achieves the highest class-level accuracy for SR and AF, with accuracies of 99.53% and 97.30%, respectively. However, classification performance is notably lower for PAC and VT, with accuracies of 74.53% and 71.35%. More specifically, 9.36% of PAC samples are misclassified as AF and 10.18% as PVC, while 7.94% of PVC samples are incorrectly predicted as PAC. Moreover, 16.15% of VT samples are misclassified as SVT, whereas 2.78% of SVT samples are incorrectly assigned to VT, indicating non-trivial confusion between these two classes.
To provide a more intuitive overview of the model’s overall performance on the test set,
Figure 22a illustrates the ROC curves for the six rhythm categories, where the
x-axis denotes the false positive rate (FPR) and the
y-axis denotes the true positive rate (TPR). In general, a larger area under the curve (AUC) indicates higher classification accuracy and stronger discriminability among rhythm types. Notably, the proposed model achieves a mean AUC of 0.986 across the six classes, and the AUC for AF reaches 0.985, further highlighting the model’s high reliability and precision in multi-class arrhythmia detection.
Figure 22b presents the precision–recall (PR) curves for the same six classes, where recall is plotted on the
x-axis and precision on the
y-axis, thereby visualizing the trade-off between precision and recall under different decision thresholds. Similar to ROC analysis, a larger area under the PR curve also reflects better performance. In this evaluation, the model attains a micro-averaged AUC of 0.986 across the six rhythm types, demonstrating its strong capability to jointly maintain high precision and high recall. In other words, these results not only confirm the model’s excellent overall accuracy but also indicate that it preserves robust detection performance even under class-imbalanced data distributions.
To more comprehensively assess the potential and practical utility of HS-BANet, additional comparative experiments were conducted. The motivation for this comparison is twofold. First, the publicly available PPG subset accounts for only about 40% of the full dataset, implying that the data volume is inherently incomplete and makes it difficult to perform an absolutely fair, like-for-like evaluation of multiple models under fully identical conditions. Second, to the best of our knowledge, no prior work has reported a systematic benchmark of classical deep learning architectures using this publicly available subset.
Accordingly,
Figure 23 compares the proposed HS-BANet with four representative deep learning baselines—AlexNet, VGG16, ResNet18, and BiLSTM—on the rhythm classification task using five key performance metrics: precision, sensitivity, specificity, F1-score, and accuracy. In addition to convolution-based baselines, a sequence-aware BiLSTM model is included to provide a fairer comparison for physiological time series classification.
As shown in
Figure 23, HS-BANet achieves the best overall results across all five metrics, attaining 86.68% precision, 86.90% sensitivity, 98.46% specificity, an 86.75% F1-score, and 92.22% accuracy, which indicates its overall advantage in both classification performance and robustness. In contrast, AlexNet shows the weakest overall performance, with an F1-score of only 61.41%, indicating substantial limitations in handling multi-class rhythm patterns. VGG16 improves upon AlexNet across all metrics, likely benefiting from its deeper architecture and stronger hierarchical feature extraction capability. ResNet18 further improves the overall performance and achieves results relatively close to those of HS-BANet, demonstrating the effectiveness of residual learning for physiological signal classification.
Compared with the convolution-based baselines, the BiLSTM model provides a more task-relevant sequence modeling reference for physiological time series diagnosis. Its performance is higher than that of AlexNet and VGG16 and remains competitive with ResNet18, indicating that temporal dependency modeling is beneficial for rhythm classification. However, HS-BANet still outperforms BiLSTM across all reported metrics, suggesting that the combined design of residual local feature extraction, bidirectional temporal modeling, and attention-based feature refinement is more effective for multi-class physiological diagnosis under complex dynamic disturbances.
In summary, the proposed HS-BANet achieves the best overall performance among the evaluated deep learning models for multi-class classification of physiological time series signals. In particular, it improves both precision and sensitivity while maintaining high specificity, demonstrating its effectiveness and practicality as an intelligent diagnostic model for physiological signal analysis under high-dynamic conditions.
In this study, HS-BANet demonstrates outstanding performance in arrhythmia detection and is further benchmarked against representative prior methods.
Table 6 provides a concise summary of the similarities and differences between the proposed approach and existing studies. Specifically,
Table 6 compares key characteristics and performance of several typical arrhythmia-detection methods reported in recent years, including the data source (database), number of subjects, total data volume, detection targets, adopted methodology, and test-set evaluation metrics. The compared studies span tasks ranging from single-class atrial fibrillation (AF) detection to multi-class rhythm classification (e.g., PVC, PAC, VT, SVT, and AF). The employed datasets include both self-collected databases and public resources (e.g., MIMIC-II, MIMIC-III, and PhysioNet), as well as synthetic data, with data sizes varying from several hundred to tens of thousands.
From the perspective of detection targets, Method 1, Method 2, and Method 3 primarily focus on AF detection and generally achieve high recognition rates and AUC values. In particular, Method 2 reports the best performance, supported by multi-source data, with the highest accuracy (98.20%) and AUC (0.9959). In contrast, Method 4 and Method 5 extend the task to multiple rhythm types such as PVC and PAC, reflecting stronger multi-class recognition capability. Compared with these approaches, the proposed method not only supports five-category arrhythmia recognition (PVC, PAC, VT, SVT, and AF), but also achieves strong detection performance with a moderate data scale.
In terms of methodology, prior studies predominantly employ deep learning architectures such as CNN, RNN, 2D-CNN + LSTM, and DCNN. The proposed HS-BANet integrates residual learning, BiLSTM, and a multi-head attention mechanism, thereby jointly enhancing spatial feature extraction and temporal dependency modeling. As reflected by the five key performance metrics, HS-BANet attains competitive results in precision (88.45%), sensitivity (88.14%), specificity (89.42%), F1-score (86.87%), and accuracy (92.37%), substantiating the effectiveness and superiority of the proposed model for multi-class arrhythmia recognition.