This section presents a rigorous, multi-faceted evaluation of the proposed integrated framework for real-time physiological trend prediction in Wireless Body Area Networks (WBANs). We systematically assess the framework’s performance across its three core components: (1) the efficacy of the multi-stage preprocessing pipeline in restoring signal fidelity, (2) the accuracy and clinical responsiveness of the adaptive ARIMA forecasting, and (3) the utility of the clinical-aware risk assessment strategy for proactive decision support.
In contrast to earlier mild-corruption experiments, the evaluation reported here is conducted under a deliberately harsh degradation regime designed to stress-test selective restoration and to clearly differentiate spline-only, linear KF, and nonlinear EKF performance.
4.3. Evaluation of Multi-Stage Signal Preprocessing
The efficacy of our multi-stage preprocessing pipeline, comprising cubic spline interpolation (for missingness), extended Kalman filter (EKF) denoising (for noise and state estimation), and conservative local smoothing (for optimal signal refinement), was quantitatively assessed by measuring the Mean Squared Error (MSE) of the signal at each stage relative to its uncorrupted ground truth (GT). A lower MSE indicates a higher fidelity reconstruction of the true physiological signal.
In addition to the incremental stage-wise evaluation, we further compare three reconstruction strategies under the strong WBAN degradation regime: (i) spline-only (interpolation without state filtering), (ii) a gated linear Kalman filter (KF), and (iii) a gated extended Kalman filter (EKF, proposed). This comparison explicitly evaluates whether nonlinear state modeling and selective restoration provide measurable benefits beyond interpolation alone.
Figure 2 visually contrasts clean and corrupted WBAN vital signs within a single sliding window, highlighting the diverse degradation patterns introduced by burst missingness, impulse artifacts, nonlinear clipping, and drift.
Figure 3 demonstrates the progressive impact of the preprocessing pipeline on a simulated temperature signal. Interpolation fills data gaps, the EKF suppresses nonlinear noise and compensates for drift, and conservative local smoothing enhances temporal coherence.
Figure 4 confirms that adaptive filtering stabilizes high-variance signals (heart rate, glucose, and systolic pressure) while preserving physiological trends, and
Figure 5 highlights samples where corruption-triggered restoration was activated.
Across all windows, the average gating rate was , indicating that approximately 21.4% of samples triggered selective restoration. This quantifies the fraction of samples undergoing EKF/KF correction and directly supports the claim of computationally selective processing.
Under strong corruption, the spline-only method fails to suppress impulse artifacts and nonlinear distortions, resulting in the highest reconstruction error. The gated KF reduces normalized RMSE by 57% relative to spline-only, while the proposed gated EKF further reduces normalized RMSE by 13% relative to the KF, demonstrating the benefit of bounded nonlinear measurement modeling (logistic observation) under soft clipping/saturation. Despite a slightly higher runtime (2.79 s vs. 2.64 s), the EKF remains well within real-time constraints (processing time < stride interval).
Table 9 quantitatively summarizes the Mean Squared Error (MSE) for each vital sign at different stages of the preprocessing pipeline, calculated against the ground truth, and it also reports the percentage gain in signal quality (reduction in MSE) achieved by the EKF and conservative local smoothing stages relative to their preceding stage.
Notably, the per-stage analysis aligns with the global reconstruction comparison: interpolation primarily addresses missingness, while the EKF improves dominant noise suppression and local smoothing further enhances temporal coherence.
The results in
Table 9 clearly demonstrate the incremental improvement in signal fidelity at each stage of the preprocessing pipeline:
Interpolation: The “Interpolated” MSE shows a modest reduction compared to the “Corrupted” MSE for all vital signs. Its primary role is to reconstruct missing samples and provide a continuous signal for subsequent state-space filtering rather than to suppress impulse noise or nonlinear distortion.
State-Space Filtering (KF/EKF): Application of the Kalman-based filter yields a dominant reduction in MSE across all vital signs. Under the imposed strong WBAN degradation regime, both the linear KF and nonlinear EKF substantially improve signal fidelity relative to interpolation. However, the extended Kalman filter consistently achieves lower error than the linear KF, particularly in the presence of nonlinear clipping and bounded measurement effects, confirming the benefit of nonlinear state modeling for degraded physiological streams.
Conservative Local Smoothing: The conservative local median smoother further refines the filtered estimates by enforcing short-range temporal coherence within the window. Unlike full backward state smoothing, this localized operation improves stability without introducing excessive bias. Across all signals, smoothing yields an additional 21–33% MSE reduction relative to the filtered output.
Overall, the multi-stage pipeline cumulatively reduces the MSE relative to the initial corrupted signal by a substantial margin, ranging from 53.2% (systolic blood pressure) to 66.7% (blood glucose), with most vital signs exhibiting reductions between 63% and 66%, confirming its effectiveness in transforming severely degraded WBAN signals into high-fidelity trend representations suitable for downstream forecasting and clinical risk assessment.
4.3.1. Statistical Significance of Preprocessing Improvements
To ascertain the statistical significance of the observed improvements, paired t-tests and Wilcoxon signed-rank tests were conducted on window-level MSE values (relative to the ground truth) between successive preprocessing stages.
The full dataset contains approximately 35,600 windows per vital sign (100 simulated patient-hours, s, s). To avoid inflated statistical power due to extremely large sample size and to maintain computational efficiency, windows were uniformly randomly sampled per signal for hypothesis testing. The same sampled windows were used consistently across stage comparisons.
Table 10 summarizes the resulting
p-values.
As shown in
Table 10, nearly all stage-to-stage comparisons yield statistically significant improvements (
), with most comparisons achieving
under both parametric and non-parametric testing.
The only exception is the Blood Glucose comparison between Corrupted Raw and Interpolated under the Wilcoxon signed-rank test (), indicating that interpolation alone provides a weak and heterogeneous improvement for this modality. However, subsequent transitions (Interpolated → EKF and EKF → Smoothing) are highly significant for glucose under both tests, confirming that the primary reconstruction gains arise from gated nonlinear filtering and conservative smoothing rather than interpolation alone.
These results collectively demonstrate that the proposed multi-stage preprocessing pipeline yields statistically robust and reproducible reductions in reconstruction error across the strong degradation regime.
In addition, paired comparisons between gated KF and gated EKF reconstruction errors yielded statistically significant improvements (Wilcoxon across all high-variance signals), confirming that bounded nonlinear filtering provides measurable benefit under the imposed degradation regime.
4.3.2. Sensitivity Analysis of Gating Hyperparameters
We evaluated the sensitivity of selective restoration to the gating hyperparameters
and
by sweeping
For each pair , we report the (i) gating rate , (ii) reconstruction error (MSE vs. GT), (iii) forecasting accuracy (window-MAE and directional accuracy), and (iv) mean runtime per window. This analysis reveals the trade-off between aggressiveness (larger ) and computational load, and supports the default setting , as a balanced operating point.
As shown in
Table 11, smaller
values increase gating aggressiveness, improving reconstruction at the cost of runtime, while a larger
increases neighborhood correction but raises computational load. The configuration
,
provides the best balance between reconstruction accuracy, forecasting performance, and computational efficiency, validating its selection as the default operating point.
4.3.3. Ablation Study: Process-All vs. Gated Restoration
To rigorously validate the claimed benefit of selective restoration, we conducted an ablation study comparing three processing strategies:
(A) Process-all EKF: Extended Kalman filtering and local median smoothing applied to every sample in each window.
(B) Gated EKF (proposed): EKF replacement and local smoothing applied only to samples within the gated set , while samples outside remain unchanged.
(C) Spline-only baseline: Cubic spline interpolation for missing values without any state-space filtering or smoothing.
The performance was evaluated in terms of both accuracy (reconstruction MSE, window-MAE forecasting error, and directional accuracy) and efficiency (mean runtime per window). Since CPU energy consumption is approximately proportional to execution time under fixed hardware conditions, runtime is additionally reported as an energy proxy. When available, package-level energy counters (e.g., Intel RAPL) were used to confirm this proportional relationship.
The results indicate that the process-all EKF increases smoothing bias and computational cost without providing meaningful reconstruction or forecasting gains over gated processing. In contrast, the proposed gated EKF achieves equivalent or superior accuracy while reducing update operations by approximately , resulting in a proportional runtime reduction consistent with the theoretical savings factor .
These findings confirm that selective restoration preserves signal fidelity while improving computational efficiency, thereby validating the central design principle of the proposed framework. Although the process-all EKF achieves similar reconstruction accuracy (
Table 6), it increases computational cost by 26% compared to the gated EKF without meaningful forecasting improvement, confirming that selective restoration maintains fidelity while reducing unnecessary computation.
4.3.4. Forecasting Accuracy Under Strong Corruption (Heart Rate Case Study)
To isolate the effect of preprocessing on forecasting quality, we report a heart rate (HR) case study in which the ARIMA predictor is trained on HR series obtained from different reconstruction pipelines. Unlike cross-signal reconstruction summaries (e.g.,
Table 7), the results in
Table 8 are HR-only and are reported in raw BPM units (MAE/RMSE in BPM, and prediction interval coverage as a fraction). This scope is intentionally restricted to HR to avoid misleading aggregation across heterogeneous modalities (mmHg, mg/dL, °C, etc.).
We evaluate a one-step-ahead forecast aligned with the stride interval ( s), i.e., predicting HR at s using the most recent window history. Under the strong degradation regime, spline-only reconstruction yields relatively large errors due to residual impulse artifacts and drift. In contrast, gated Kalman filtering improves both point accuracy and uncertainty calibration, and the proposed gated EKF provides the best overall performance, indicating that bounded nonlinear measurement modeling is beneficial for HR under soft saturation/clipping effects.
We observed that applying aggressive smoothing indiscriminately (without gating) can degrade forecast performance by attenuating rapid local changes and introducing phase lag, which harms ARIMA parameter stability and leads to miscalibrated prediction intervals. This supports the design choice of selective (gated) restoration rather than process-all smoothing.
4.4. Adaptive ARIMA Forecasting Performance
Building upon the high-fidelity signals produced by the multi-stage preprocessing pipeline, the adaptive ARIMA model provides short-term predictions of physiological trends. This section evaluates the model’s accuracy, ability to capture directional changes, and overall contribution to early event detection.
The ARIMA model, operating on the gated-smoothed data from each window, achieved an across-window mean MAE of 12.31 BPM for heart rate and 2.33% for SpO
2, as detailed in
Table 12. These across-window aggregate values reflect the distribution of per-window prediction errors over all
sliding windows and capture inter-window variability. In contrast, the HR-specific one-step-ahead MAE reported in
Table 8 (5.48 BPM) represents the median per-step forecast error under the proposed gated EKF pipeline and serves as a pipeline-comparison benchmark. The difference arises because the across-window MAE includes windows with high degradation, transient ARIMA instability, and edge effects that inflate the aggregate statistic. This demonstrates the significant benefit of applying the full preprocessing cascade before forecasting, resulting in improved predictive accuracy. To explicitly quantify the impact of preprocessing on forecasting accuracy, we compared ARIMA performance when trained on three input series: (i) spline-only reconstruction, (ii) gated KF reconstruction, and (iii) gated EKF reconstruction (proposed).
Table 8 (
Section 4.3.4) reports the detailed HR-specific forecasting comparison across reconstruction pipelines, confirming that improved reconstruction directly translates to improved forecasting accuracy and uncertainty calibration.
Improved reconstruction directly translates to improved forecasting. The EKF-restored signals yield the lowest forecast error and the best uncertainty calibration, with 95% prediction interval coverage closest to the nominal level.
Figure 6 provides a comprehensive overview of the ARIMA model’s short-term forecasting capability across various vital signs within a single sliding window. It demonstrates how the model, operating on gated-smoothed data, generates smooth continuations of physiological patterns, indicating that the preprocessing enhances temporal coherence and improves time-series predictability.
Further,
Figure 7 offers a zoomed-in perspective on the ARIMA model’s forecasting for heart rate (HR), oxygen saturation (SpO
2), systolic blood pressure (SYS), and diastolic blood pressure (DIA) over a short, one-step-ahead horizon (10 s). The plots distinctly show the conservative locally smoothed observed data, the forecasted trajectories, and their 95% confidence intervals, along with predefined clinical thresholds. For example, in the simulated patient data shown, the forecasted downward trends in heart rate and SYS indicate an approaching breach of the lower clinical bounds, signaling potential bradycardic or hypotensive episodes. Conversely, the upward trends in SpO
2 and DIA suggest recovery or compensatory mechanisms. This ability to project trends with associated confidence intervals allows for proactive risk detection before a physiological parameter breaches a critical threshold. The proposed framework achieved an average lead time of approximately 25 s (mean = 24.7 s, median = 23.9 s) prior to threshold-crossing events. This indicates that the model typically identifies deteriorating physiological trends roughly 24–25 s before critical clinical limits are reached, providing a meaningful early-warning buffer under the imposed degradation regime. The sensitivity, specificity, and false alarm rate were 0.89, 0.92, and 0.08, respectively, demonstrating strong discriminative performance while maintaining controlled false positives, where the former two values correspond to global threshold-based event detection across all signals and windows; in contrast, the early-warning rate reported per vital sign reflects the proportion of events for which the forecasted risk exceeded the threshold before the clinical bound was crossed. Therefore, the low early-warning rate observed for heart rate (0.01) indicates limited proactive lead-time detection for that specific parameter under the conservative alerting configuration, rather than poor classification sensitivity overall.
This early detection capability is further quantified by the “Early-Warning Rate” in
Table 12, which indicates the proportion of events for which an alert was generated proactively.
To further quantify forecasting performance, we introduce several key metrics, summarized in
Table 12:
Window-MAE (Mean Absolute Error): The average absolute difference between the forecasted value and the true value within each window.
Directional Accuracy: The percentage of forecasts where the predicted direction of change (increase, decrease, or no change) matches the actual direction of change.
Slope Error: The mean absolute difference between the predicted slope of the trend and the actual slope of the trend.
Risk Error: A custom metric quantifying the discrepancy between the forecasted risk score and the actual risk score based on the ground truth.
Early-Warning Rate: The proportion of clinically significant events for which the framework provided an alert (forecasted risk above threshold) before the actual event occurred.
Actual Event Rate: The observed frequency of clinically significant events in the ground-truth data.
Wilcoxon p-Value: The statistical significance of the difference between the forecasted values and the actual observed values.
Table 12 summarizes these key forecasting performance metrics. “Window-MAE” provides a direct measure of prediction accuracy, with values such as 12.31 for heart rate and 17.78 for systolic BP indicating the typical absolute deviation from actual values. “Directional Accuracy” consistently hovers between 0.65 and 0.70 across most vital signs, suggesting that the model correctly predicts the trend direction (increase, decrease, or no change) approximately 65–70% of the time, which is crucial for clinical decision-making. The “Early-Warning Rate” and “Actual Event Rate” columns reveal the model’s ability to proactively identify critical events. For instance, the respiratory rate score shows a high early-warning rate (0.97) in conjunction with an actual event rate of 1.00, indicating excellent proactive detection for this vital sign. Conversely, heart rate has a low early-warning rate (0.01) despite a notable actual event rate (0.18), suggesting room for improvement in early detection for certain parameters. “Risk Error” quantifies the discrepancy in forecasted risk, which is critical for the downstream risk assessment module. The Wilcoxon
p-values in
Table 12 further assess the statistical significance of the difference between forecasted and actual values: a high
p-value (e.g.,
) suggests that there is no statistically significant difference between the forecasted values and the actual observed values, which indicates good predictive performance, while a low
p-value (e.g.,
) indicates a statistically significant difference, suggesting that the forecast deviates significantly from the actual values. For instance, diastolic BP (
) and respiratory rate (
) show no significant difference between forecasts and actual values, implying robust predictions, whereas other vital signs such as heart rate (
) exhibit statistically significant differences.
Table 13 provides a comparative view of the reconstruction error from the adaptive-filtered signal against the ground truth and the forecasting error (window-MAE) of the ARIMA model against the actual future ground-truth values. The “MSE (Adaptive Filtered vs. GT)” column reflects the fidelity of the reconstructed signal (from
Section 4.2) compared to the uncorrupted ground truth, while the “Forecast Error (ARIMA Window-MAE)” column represents the error of the one-step-ahead ARIMA forecast when compared to the actual future ground-truth value. Notably, the “Forecast Error (ARIMA Window-MAE)” values are significantly higher than those of “MSE (Adaptive Filtered vs. GT)” for all vital signs, reflecting the intrinsic difficulty and inherent uncertainty in predicting future physiological states, especially over longer lead times or during periods of high variability, compared to reconstructing a past signal. However, these forecasting errors must be interpreted in conjunction with the confidence intervals provided by the model (as shown in
Figure 7) and the demonstrated early-warning capabilities, which collectively underscore the model’s practical utility despite the inherent predictive challenges.
4.5. Clinical-Aware Risk Assessment and Multi-Modal Harmonization
The final and crucial stage of our integrated framework translates the refined, forecasted physiological trends into actionable clinical risk scores. This process involves two key steps: clinical-aware deviation scoring and multi-modal risk score clipping. This approach is designed to harmonize heterogeneous physiological features by combining statistical anomaly detection with deviations from predefined clinically significant thresholds, ensuring a balanced influence of each vital sign on the overall risk assessment, reflecting both its statistical variability and its clinical importance. Specifically, deviation scoring quantifies how far a vital sign’s forecasted value is from its normal physiological range and its statistical baseline, while clipping scales these deviations to ensure comparability across different vital signs with varying units and ranges.
Figure 8 illustrates a normalized clinical risk score heatmap across four consecutive sliding windows (W1–W4) for all WBAN vital signs. The intensity of the color directly reflects the relative severity of the risk, with darker shades indicating a higher calculated risk. The heatmap visually confirms that signals with inherently higher physiological variability, such as blood glucose and blood pressure, often contribute to higher risk levels. Conversely,
frequently exhibits comparatively lower risk, validating the efficacy of our risk-aware aggregation mechanism that accounts for these differences. This visualization clearly demonstrates the dynamic evolution of risk over time.
For example,
Figure 8 depicts a period of elevated risk observed between windows 15 and 25 for heart rate (HR) and systolic blood pressure (SYS), potentially signaling the onset of a hypotensive or bradycardic event.
While the overall early-warning rate for HR (1%) and systolic BP (10%) reported in
Table 12 indicates that proactive detection of all such events is challenging, these metrics, alongside the risk error, provide quantitative insight into the framework’s capability. The risk error (e.g., 0.21 for HR, 0.64 for SYS in
Table 12) quantifies the discrepancy between the forecasted risk score and the actual risk score based on ground truth, where lower values indicate more accurate risk prediction. Despite these challenges, the framework demonstrates the potential to identify critical shifts in patient status. Concurrently, the risk scores for SpO
2 (with a risk error of 0.57) and diastolic BP (with a risk error of 0.70) remain relatively low during this period in the heatmap, underscoring the multi-modal nature of the assessment. This prevents any single vital sign from disproportionately dominating the overall risk calculation, offering a more nuanced patient state assessment.
Our multi-modal clipping strategy is critical, ensuring that a clinically significant deviation in one vital sign (e.g., a 10 BPM change in HR) is scaled comparably to a clinically equivalent deviation in another (e.g., a 2% change in SpO2). This scaling is based on their respective clinical significance (e.g., severity of deviation from normal ranges) and statistical variability observed in the dataset. This systematic approach allows for a unified and interpretable aggregation of risk across diverse physiological parameters. By providing a consistent basis for comparing and combining risks, it not only improves the numerical stability for downstream decision-making algorithms but also enhances the framework’s utility in early intervention scenarios through risk scores derived from forecasted trends. To quantitatively validate multi-modal harmonization, we computed the Pearson correlation between the aggregated normalized risk score and ground-truth clinical event labels across all windows. The normalized risk demonstrated strong correlation (, ), confirming that the risk formulation preserves clinical interpretability.
Furthermore, clipping prevented any single modality from dominating the aggregate risk. For example, when heart rate exhibited severe drift while SpO2 remained stable, the combined risk remained bounded and proportional, demonstrating balanced cross-modal scaling.