Next Article in Journal
Epistemic Agency in the Age of Large Language Models: Design Principles for Knowledge-Building AI
Previous Article in Journal
CTV Delineation in the Era of Artificial Intelligence: A Multicenter Assessment of a 3D U-Net Model as Predictive Peer Review for Hypofractionated Prostate Cancer Treatment
Previous Article in Special Issue
Machine Learning and Deep Learning in Lung Cancer Diagnostics: A Systematic Review of Technical Breakthroughs, Clinical Barriers, and Ethical Imperatives
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Signal-Derived Feature Analysis for Cuffless Blood Pressure Estimation: Comparing Machine Learning and Deep Learning on ICU Physiological Waveforms

1
Department of Mathematics, University of Architecture, Civil Engineering and Geodesy, 1 Hristo Smirnenski Blvd., 1164 Sofia, Bulgaria
2
Department of Applied Computer Science and Mathematical Modelling, Faculty of Mathematics and Computer Science, University of Warmia and Mazury in Olsztyn, 54 Słoneczna Str., 10-710 Olsztyn, Poland
3
Department of Statistics and Econometrics, Faculty of Economics and Business Administration, Sofia University St. Kliment Ohridski, 125 Tsarigradsko Shosse Blvd., bl. 3, 1113 Sofia, Bulgaria
4
Faculty of Mechanical Engineering, Mechanical and Instrument Engineering, Technical University of Sofia, Branch Plovdiv, 4000 Plovdiv, Bulgaria
*
Authors to whom correspondence should be addressed.
Submission received: 31 December 2025 / Revised: 11 February 2026 / Accepted: 21 February 2026 / Published: 9 March 2026

Abstract

Continuous non-invasive blood pressure monitoring holds significant promise for cardiovascular disease management, yet cuff-based methods remain limited by their intermittent nature. Machine learning approaches leveraging photoplethysmography (PPG) and electrocardiography (ECG) signals present compelling alternatives, though questions persist about which signal type contributes more predictive value. This study compares traditional machine learning models, ensemble methods, and deep learning architectures for estimating systolic blood pressure from physiological waveforms. We extracted 55 features from PPG and ECG recordings of 100 subjects in the MIMIC-III Waveform Database, yielding 3000 segments with invasive arterial blood pressure as ground truth. Data splitting was performed at the subject level (70/15/15 train/validation/test) to prevent data leakage. Evaluation included regression metrics, British Hypertension Society grading, SHAP-based explainability, and ablation studies. Among all models, LightGBM achieved the best performance with mean absolute error of 15.97 mmHg, placing it at BHS Grade D. While SHAP analysis showed ECG features contributing 54.7% of importance versus 45.3% for PPG, our ablation study revealed that PPG-only models achieved comparable performance (MAE 15.97 vs. 16.23 mmHg), with the difference not statistically significant (p = 0.226). These results suggest that PPG-only wearable devices are viable for blood pressure estimation, as adding ECG features provides no statistically significant improvement. However, all configurations achieved only BHS Grade D, indicating that personalized calibration may be necessary for clinical acceptability.

1. Introduction

1.1. The Clinical Imperative for Continuous Blood Pressure Monitoring

Few health metrics carry as much prognostic weight as blood pressure. Hypertension affects an estimated 1.28 billion adults globally and contributes to roughly 10.4 million deaths each year, making it the single most important modifiable risk factor for cardiovascular disease [1,2]. The consequences extend across the full spectrum of cardiovascular pathology, including ischemic heart disease, stroke, chronic kidney disease, and heart failure, placing healthcare systems under considerable strain regardless of economic development level [3]. What makes this particularly concerning is the dose–response relationship: every 20 mmHg increase in systolic pressure approximately doubles cardiovascular mortality risk [4]. Yet despite decades of research establishing these connections, effective monitoring and control of hypertension continue to elude clinical practice.
The standard approach to blood pressure measurement, cuff-based sphygmomanometry, remains the clinical gold standard, but its limitations are increasingly apparent [5]. By its very nature, cuff measurement provides only intermittent snapshots, missing the dynamic fluctuations that carry independent prognostic value: nocturnal dips, morning surges, and stress-induced spikes all escape detection in typical clinical encounters [6]. The situation is further complicated by measurement artifacts inherent to clinical settings. White coat hypertension and its inverse, masked hypertension, affect up to 30% of patients, leading to both over- and under-treatment [5]. Beyond these diagnostic challenges, practical constraints such as the need for patient cooperation, proper cuff sizing, and correct positioning limit the utility of traditional measurement in ambulatory and intensive care contexts where continuous monitoring would be most valuable.

1.2. Signal-Based Approaches for Cuffless Blood Pressure Estimation

The proliferation of wearable sensor technology has opened new avenues for cuffless blood pressure estimation, with photoplethysmography (PPG) and electrocardiography (ECG) emerging as the most promising signal sources [7,8]. These physiological signals can now be captured unobtrusively through consumer devices such as smartwatches, fitness trackers, and smartphone sensors, offering the tantalizing possibility of continuous blood pressure monitoring without the familiar squeeze of an inflating cuff.
Each signal type captures different aspects of cardiovascular physiology. PPG works by shining light through the skin and measuring volumetric changes in peripheral blood vessels, producing a characteristic waveform shaped by cardiac output, vascular compliance, and autonomic regulation [9]. Researchers have extensively mined the PPG waveform for blood pressure correlates: pulse wave morphology, timing intervals between systolic and diastolic phases, and various derivative-based indices all show relationships with arterial pressure [10,11]. ECG, meanwhile, provides a window into cardiac electrical activity. Features such as heart rate variability, R-wave amplitude, and conduction intervals reflect autonomic tone and cardiovascular state in ways that may complement PPG-derived information [12].
Machine learning has proven particularly well-suited to exploiting these signal features for blood pressure prediction [13,14]. The literature now contains a rich array of methodologies, from straightforward regression models to elaborate deep learning architectures, with accuracy varying considerably across datasets and populations [15,16]. Yet the field remains somewhat fragmented. Standardized evaluation protocols are lacking, external validation is rare, and surprisingly few studies have seriously engaged with the clinical accuracy standards that would be required for regulatory approval [17]. This disconnect between research achievements and clinical translation represents a significant obstacle to real-world deployment.

1.3. The PPG Versus ECG Question: A Relatively Unexplored Dimension

A curious gap exists in the cuffless blood pressure literature. While PPG-based and ECG-based estimation have each attracted substantial research attention, surprisingly few studies have directly compared the predictive contributions of features from both signal types when recorded simultaneously [7,18]. The field has largely gravitated toward PPG, understandably so, given the simplicity of optical sensors and their near-universal presence in consumer wearables. But this practical consideration may have obscured a more fundamental question: does ECG offer superior or complementary predictive power that PPG alone cannot match?
The answer carries real consequences for wearable device design and clinical recommendations. If ECG-derived features meaningfully outperform their PPG counterparts, then the growing category of smartwatches with ECG capability deserves particular attention for blood pressure monitoring applications. If PPG proves sufficient on its own, the added complexity and cost of ECG integration may not be justified. Either way, clinicians and device manufacturers need clearer guidance than the literature currently provides.

1.4. Clinical Validation Standards and Explainability

Two additional requirements for clinical translation have received surprisingly little attention in the machine learning blood pressure literature. The first concerns regulatory standards. Any device claiming to measure blood pressure must demonstrate compliance with established validation protocols, most notably the British Hypertension Society (BHS) grading system and the Association for the Advancement of Medical Instrumentation (AAMI) standards [17,19]. These protocols set concrete accuracy thresholds: BHS Grade A, for instance, requires that 60% of readings fall within 5 mmHg, 85% within 10 mmHg, and 95% within 15 mmHg of reference values. Most machine learning papers either ignore these standards entirely or quietly acknowledge that their models fall short.
The second requirement is explainability. Clinicians are understandably reluctant to trust predictions from models they cannot interpret. When a machine learning algorithm estimates blood pressure, healthcare providers want to know which physiological features are driving that estimate and whether the underlying logic makes medical sense [20]. Techniques like SHAP (SHapley Additive exPlanations) offer a principled approach to this problem, providing rigorous, game-theoretically grounded measures of feature importance [21]. Yet SHAP analysis remains underutilized in the blood pressure prediction literature, leaving a gap between what models can do and what clinicians need to know.

1.5. Study Objectives

This study was designed to address these gaps through a comprehensive evaluation of machine learning approaches for blood pressure estimation from physiological waveforms. Four specific objectives guided the research.
First, the study extracted and systematically compared 55 features from PPG (31 features) and ECG (24 features) signals for systolic blood pressure prediction, drawing on data from critically ill patients in the MIMIC-III database. Second, 10 traditional machine learning models were benchmarked alongside a deep learning architecture (ResNet-Transformer) against both standard regression metrics and the clinical validation standards required for real-world deployment. Third, the relative predictive importance of ECG versus PPG features was quantified using SHAP analysis, settling, at least for this dataset, the question of which signal type contributes more to accurate blood pressure estimation. Finally, the investigation examined whether deep learning offers any advantage over tree-based ensemble methods when working with pre-engineered features rather than raw waveforms.
The answers to these questions should provide useful guidance both for researchers developing cuffless blood pressure algorithms and for clinicians evaluating whether these emerging technologies are ready for patient care.

2. Materials and Methods

2.1. Dataset

Data were drawn from the MIMIC-III Waveform Database, a publicly accessible repository hosted by PhysioNet and developed through collaboration between MIT [22]’s Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center. This database offers something particularly valuable for blood pressure research: high-resolution physiological signals from intensive care unit patients who have invasive arterial pressure monitoring in place. The arterial line provides continuous, beat-by-beat blood pressure measurements that serve as near-perfect ground truth, a luxury rarely available outside the ICU setting.
From this database, recordings from 1524 unique subjects were extracted, yielding 3000 segments for systolic blood pressure analysis and 61,192 for diastolic. Each sample included three synchronized signals: a PPG waveform capturing peripheral vascular pulsations, an ECG tracing cardiac electrical activity, and the reference arterial blood pressure signal. All waveforms were sampled at 125 Hz, providing adequate temporal resolution for the intended morphological feature extraction.
The blood pressure distribution in the cohort showed the wide variability characteristic of critically ill populations. Systolic pressure averaged 108.79 mmHg with a standard deviation of 18.63 mmHg, while diastolic pressure ran lower at 57.26 ± 11.14 mmHg. This spread, encompassing both hypotensive and hypertensive ranges, is actually advantageous for model development: it ensures that the algorithms learned from clinically relevant extremes rather than just the comfortable middle of the blood pressure spectrum. Table 1 summarizes these dataset characteristics, and Figure 1 shows representative waveform examples.

Subject-Level Data Splitting

To prevent data leakage, all train/validation/test splits were performed at the subject level, ensuring that all 30 segments from any given subject appear exclusively in one partition.
The split distribution for our 100-subject analysis:
  • Training: 70 subjects (70%) → 2100 segments;
  • Validation: 15 subjects (15%) → 450 segments;
  • Test: 15 subjects (15%) → 450 segments
We verified no subject overlaps between partitions. All preprocessing transformations (imputation statistics, scaling parameters) were fitted exclusively on training data.

2.2. Feature Extraction

A total of 55 features were engineered from the PPG and ECG waveforms, aiming to capture the diverse physiological characteristics that might relate to arterial pressure. Rather than relying on a single feature type, the analysis deliberately cast a wide net across statistical properties, waveform morphology, and frequency-domain information.
PPG Features (21 features). The PPG signal proved to be the richer source of features, contributing 21 measurements organized into three broad categories. Statistical features characterized the overall distribution and variability of the waveform: mean amplitude, standard deviation, variance, minimum and maximum values, range, median, interquartile range, skewness, kurtosis, and root mean square. Morphological features captured the timing and shape of the pulse wave itself: heart rate statistics derived from peak detection, RR interval measures including the RMSSD that reflects short-term heart rate variability, pulse amplitude, pulse width at half-maximum, the fall time from systolic peak to dicrotic notch, and the dicrotic index that quantifies the relative prominence of the reflected wave. A single frequency-domain feature, the dominant oscillatory frequency, rounded out the PPG feature set.
ECG Features (9 features). The ECG contributed a more focused set of 9 features. Heart rate characteristics accounted for five of these: mean heart rate, heart rate variability (standard deviation), mean RR interval, RR interval variability, and the coefficient of variation that normalizes variability to mean rate. Three morphological features captured aspects of the QRS complex: total signal power and both the mean and standard deviation of R-wave amplitude, the latter potentially reflecting beat-to-beat variations in ventricular depolarization strength. Finally, a signal quality index provided a measure of recording reliability, which we suspected might prove important for model performance.
Taken together, these 55 features were intended to capture complementary windows into cardiovascular physiology. The PPG features primarily reflect what is happening in the peripheral vasculature: pulse wave propagation, vascular compliance, and the interplay between forward and reflected pressure waves. The ECG features, by contrast, characterize central cardiac activity, specifically the electrical events driving each heartbeat and the variability in cardiac rhythm that reflects autonomic modulation. By including both, we positioned ourselves to determine which source of information contributes more to blood pressure prediction. Table 2, Table 3 and Table 4 provide the complete breakdown of preprocessing parameters and features extracted from each signal type.

2.3. Machine Learning Models

Ten machine learning algorithms representing four distinct methodological families were evaluated in this study. The goal was not to exhaustively test every possible algorithm, but rather to include sufficient diversity in learning paradigms to draw meaningful conclusions about which approaches work best for this particular problem.
Linear models served as our baseline and included ordinary least squares regression along with its regularized variants: Ridge (L2 penalty), Lasso (L1 penalty), and ElasticNet (combining both). These models assume that blood pressure relates linearly to the input features, an assumption that may be violated by the complex physiology underlying pressure regulation, but one that provides interpretable coefficients and a useful performance floor against which to compare more sophisticated approaches.
Instance-based learning was represented by K-Nearest Neighbors, which makes predictions by averaging the blood pressure values of training samples that are most similar to a new observation in feature space. KNN makes no assumptions about functional form and can capture arbitrarily complex relationships, though it may struggle with high-dimensional feature spaces.
Kernel-based methods were represented by Support Vector Regression with a radial basis function kernel. SVR can model nonlinear relationships by implicitly mapping features into a high-dimensional space where linear separation becomes possible. This is a powerful approach, though one that can be computationally demanding on large datasets.
Tree-based ensemble methods formed the largest category, including Random Forest (which averages predictions from many independent trees), Gradient Boosting (which builds trees sequentially, with each tree correcting the errors of its predecessors), and two highly optimized gradient boosting implementations: XGBoost and LightGBM [23,24]. These methods have earned their popularity in tabular data competitions for good reasons: they naturally capture nonlinear relationships and feature interactions, handle mixed feature types gracefully, and are relatively robust to hyperparameter choices.
Table 5 lists the complete model set. Default hyperparameters were deliberately used throughout, both to ensure reproducibility and to provide a fair comparison that does not favor algorithms that happen to have been more extensively tuned [25]. (see Supplementary File S1 for detailed training configurations).

2.4. Deep Learning Architecture

Alongside the feature-based machine learning models, it was important to test whether a modern deep learning architecture could extract useful information directly from raw waveforms, potentially discovering patterns that hand-crafted features might miss. For this purpose, a ResNet-Transformer hybrid was implemented that combines the hierarchical feature extraction strengths of residual networks with the attention mechanisms that have proven so powerful in sequence modeling tasks [26,27].
Architecture. The model takes raw PPG and ECG waveforms as input and processes them through a series of 1D convolutional residual blocks, followed by transformer encoder layers that can attend to different parts of the signal sequence. The idea is that the convolutional layers learn local patterns, such as the shape of individual beats, while the transformer layers capture longer-range dependencies that might relate to blood pressure. With over 6 million trainable parameters, this is a substantially more complex model than any of the feature-based approaches.
Training configuration. We trained the model using Adam with a learning rate of 0.0001 and weight decay of 1 × 10−5, running for up to 100 epochs with early stopping (patience of 15 epochs) to prevent overfitting. As it turned out, the model hit its best validation performance at epoch 5, remarkably early, suggesting either rapid convergence to a useful solution or, less optimistically, an architecture that quickly exhausts its ability to improve on this particular task. (see Supplementary File S2 for the complete deep learning implementation).
Table 6 summarizes the model configuration, and Figure 2 shows the training dynamics over the course of optimization.

2.5. Evaluation Framework

Model performance was assessed using a comprehensive evaluation framework encompassing regression metrics, clinical validation standards, and statistical comparison tests. This multi-faceted approach ensures that results are interpretable from both machine learning and clinical perspectives.
Regression Metrics. Standard regression metrics were computed to quantify prediction accuracy. Mean Absolute Error (MAE) measures the average magnitude of prediction errors in mmHg. Root Mean Square Error (RMSE) penalizes larger errors more heavily, providing sensitivity to outliers. The coefficient of determination (R2) indicates the proportion of blood pressure variance explained by the model. Bias (mean error) and standard deviation of errors characterize systematic and random error components. Bootstrap resampling with 1000 iterations provided 95% confidence intervals for MAE, RMSE, and R2, enabling statistical comparison of model performance.
Clinical Validation Standards. Two established clinical protocols were applied to assess suitability for medical device deployment. The British Hypertension Society (BHS) protocol grades devices based on the cumulative percentage of readings within 5, 10, and 15 mmHg of reference values: Grade A requires 60%, 85%, and 95% respectively; Grade B requires 50%, 75%, and 90%; Grade D requires 40%, 65%, and 85%; Grade D indicates failure to meet Grade D thresholds. The Association for the Advancement of Medical Instrumentation (AAMI) standard requires mean error not exceeding ±5 mmHg with standard deviation not exceeding 8 mmHg. Table 7 summarizes these clinical accuracy thresholds.
Statistical Testing. Pairwise model comparisons were conducted using the Wilcoxon signed-rank test, a non-parametric method appropriate for comparing paired prediction errors without assuming normal distribution. The significance level was set at α = 0.05. This approach enabled rigorous assessment of whether performance differences between models were statistically meaningful rather than attributable to random variation.
Explainability Analysis. SHAP (SHapley Additive exPlanations) values were computed using TreeExplainer for the best-performing gradient boosting model. SHAP provides theoretically grounded feature importance scores based on cooperative game theory, quantifying each feature’s contribution to individual predictions. Global feature importance was derived by averaging absolute SHAP values across all test samples, enabling comparison of predictive contributions from PPG versus ECG features.
The evaluation framework was designed to provide actionable insights for both algorithm developers (through regression metrics and statistical tests) and clinical stakeholders (through BHS grading).

2.6. Explainability Analysis

Understanding which features drive model predictions is essential for clinical acceptance and algorithm refinement. Multiple complementary explainability techniques were employed to characterize feature contributions and validate importance rankings.
SHAP TreeExplainer. SHAP (SHapley Additive exPlanations) values were computed using the TreeExplainer algorithm optimized for tree-based ensemble models. For each prediction, SHAP decomposes the output into additive contributions from each feature, grounded in cooperative game theory. The expected value serves as the baseline prediction, with each feature’s SHAP value representing its marginal contribution. Global feature importance was calculated as the mean absolute SHAP value across all test samples, providing a robust measure of each feature’s overall predictive influence. Figure 3 presents the SHAP summary plot showing feature importance rankings and value distributions.
Permutation Importance Validation. To validate SHAP-derived importance rankings, we computed permutation importance by measuring the decrease in model performance when each feature’s values are randomly shuffled. This model-agnostic approach provides an independent assessment of feature relevance. Permutation importance was calculated using 10 random shuffles per feature, with performance measured by negative mean absolute error. Figure 4 compares SHAP and permutation importance rankings, demonstrating consistency between the two methods.
Partial Dependence Plots. Partial dependence plots (PDPs) were generated to visualize the marginal effect of individual features on predicted blood pressure, averaging over the distribution of other features. These plots reveal the functional relationship between feature values and model output, identifying nonlinear patterns and threshold effects. Figure 5 shows partial dependence plots for the top predictive features.
Feature Interaction Analysis. SHAP interaction values were computed to identify feature pairs with synergistic or antagonistic effects on predictions. The interaction matrix quantifies how the effect of one feature depends on the value of another, revealing complex dependencies not captured by main effects alone. Figure 6 presents the feature interaction heatmap for the most important features.
The multi-method explainability analysis provides converging evidence that ECG-derived features, particularly R-wave amplitude variability and signal quality metrics, contribute most strongly to blood pressure predictions. This finding has direct implications for wearable device design, suggesting that ECG-capable devices may achieve superior estimation accuracy.

3. Results

This section presents comprehensive evaluation results across all models, including regression performance metrics, clinical validation outcomes, statistical comparisons, and explainability analysis findings. Additional analyses, including stochastic differential equation modeling, are provided in Supplementary Files S3 and S4.

3.1. Model Performance Comparison

Table 8 presents the complete performance comparison across all evaluated models, including regression metrics with 95% bootstrap confidence intervals and clinical validation outcomes. Models are ranked by mean absolute error, with XGBoost achieving the best overall performance.
Performance hierarchy. XGBoost achieved the lowest MAE of 15.97 mmHg (95% CI: 6.59–8.07), followed by KNN (8.47 mmHg) and Gradient Boosting (8.77 mmHg). Tree-based ensemble methods consistently outperformed linear models, with the best linear model (Linear Regression) achieving MAE of 13.66 mmHg, nearly double that of XGBoost. The deep learning ResNet-Transformer model (MAE: 12.78 mmHg) performed comparably to linear models despite its substantially larger parameter count.
Explained variance. XGBoost explained 62.1% of blood pressure variance (R2 = 0.621), the highest among all models. KNN and Gradient Boosting achieved similar explanatory power (R2 ≈ 0.56). Regularized linear models (Lasso, ElasticNet) showed negative R2 values, indicating predictions worse than the mean baseline. This suggests that L1 regularization eliminated predictive features, likely due to multicollinearity among physiological measurements.
Systematic bias. All models exhibited small negative bias (underestimation), ranging from −1.10 mmHg (Gradient Boosting) to −2.94 mmHg (SVR). These biases are clinically acceptable, falling well within established thresholds for mean error. Error standard deviations ranged from 11.60 mmHg (XGBoost) to 18.95 mmHg (Lasso/ElasticNet), indicating opportunities for further variance reduction through methodological refinements. Figure 7 compares model performance with 95% confidence intervals.

3.2. Clinical Accuracy

Clinical utility requires assessment against established validation standards. Table 9 presents the British Hypertension Society grading for all models, comparing the percentage of predictions within clinical accuracy thresholds.
XGBoost was the only model to achieve BHS Grade D, with 56.7% of predictions within 5 mmHg, 77.2% within 10 mmHg, and 87.2% within 15 mmHg. While approaching Grade D thresholds at the 10 and 15 mmHg levels, the model fell short of the 60% threshold for the ±5 mmHg criterion. All other models received Grade D, indicating insufficient accuracy for clinical deployment under current standards.
Variance characteristics. All models demonstrated acceptable mean error (bias within ±5 mmHg), meeting the bias component of clinical standards. Error standard deviations ranged from 11.60 mmHg (XGBoost) to 18.95 mmHg (Lasso/ElasticNet), suggesting that personalized calibration approaches may be beneficial for further improving prediction consistency. Figure 8 presents the Bland–Altman plot for XGBoost predictions.

3.3. Statistical Comparison

Pairwise Wilcoxon signed-rank tests were performed to assess statistical significance of performance differences between models. A total of 45 pairwise comparisons were conducted across the 10 machine learning models.
XGBoost significantly outperformed all other models (p < 0.0001 for all comparisons). The performance advantage of XGBoost over Gradient Boosting and KNN was statistically robust despite relatively small absolute differences in MAE (approximately 1–1.5 mmHg).
Among mid-tier models, the comparison between KNN and Gradient Boosting did not reach statistical significance (p = 0.0802), suggesting comparable performance between these approaches. Similarly, Lasso and ElasticNet showed identical performance, as expected given that ElasticNet with default parameters closely approximates Lasso behavior. Figure 9 visualizes the full pairwise statistical comparison matrix.

3.4. Feature Importance Analysis

SHAP analysis revealed a clear hierarchy of feature importance, with ECG-derived features dominating the top rankings. Table 10 presents the top 15 features by mean absolute SHAP value. Figure 10 shows the complete feature importance ranking. (see Supplementary File S5 for the complete SHAP explainability analysis).
ECG features dominated the importance rankings. R-wave amplitude variability (ecg_r_amp_std) emerged as the single most important feature with SHAP importance of 0.654, followed by ECG signal quality (0.482). Five of the top six features derive from ECG, with only PPG heart rate mean (0.418) breaking into the top tier at rank 3.
PPG morphological features showed moderate importance. Among PPG features, statistical distribution measures (skewness, kurtosis, IQR) ranked higher than traditional pulse wave features (dicrotic index, pulse width). This suggests that waveform shape variability may be more informative than absolute morphological measurements for blood pressure estimation.
Signal quality critically influences predictions. The prominence of ecg_signal_quality (rank 2) indicates that data quality substantially affects model outputs, highlighting the importance of signal preprocessing and quality control in cuffless blood pressure systems.

3.5. Ablation Study: PPG vs. ECG Feature Contributions

To quantify the relative predictive contributions of PPG versus ECG features, we conducted a formal ablation study with three experimental conditions: (1) PPG only using 31 features, (2) ECG only using 24 features, and (3) combined using 55 features. Table 11 presents the results, and Figure 11 compares the MAE across ablation conditions.
Key Finding: The difference between PPG-only and ECG-only models was NOT statistically significant for any model tested (all p > 0.05), as confirmed by Wilcoxon signed-rank tests (Table 12). Figure 12 shows the SHAP summary for SBP prediction.
While SHAP analysis showed ECG features having higher importance (54.7% vs. 45.3%), the ablation study revealed PPG-only models achieved comparable MAE (Table 13). Figure 13 illustrates the aggregate importance split by signal type, and Figure 14 shows the prediction error distributions across configurations.

4. Discussion

The results of this study converge on several findings with implications for both the research community and for the eventual clinical deployment of cuffless blood pressure monitoring. Each of the major findings is discussed in turn, placing them in the context of the existing literature and drawing out practical implications where appropriate.

4.1. Principal Findings: ECG Dominance in Signal-Based BP Estimation

The most striking finding. Perhaps the most unexpected result of our analysis was the clear dominance of ECG-derived features over PPG features for blood pressure prediction. The SHAP analysis placed R-wave amplitude variability at the top of the importance rankings (SHAP value: 0.654), outperforming all 21 PPG features. This runs counter to the direction the field has taken, as PPG has received far more research attention, largely because optical sensors are easier to integrate into consumer wearables. Our data suggest that this emphasis may have been misplaced, at least from a pure predictive accuracy standpoint.
Why might this be? From a physiological perspective, the prominence of R-wave amplitude variability makes intuitive sense. The R-wave reflects ventricular depolarization, and beat-to-beat variations in its amplitude may track changes in stroke volume and myocardial contractility, factors that directly drive systolic pressure. PPG features, by contrast, capture what is happening in the peripheral vasculature: the downstream consequences of central hemodynamic changes filtered through arterial compliance and wave reflections. It appears that for blood pressure prediction, information closer to the source matters more.
What this means for wearable devices. The practical implications are significant. If ECG features truly offer superior predictive value, then PPG-only devices, which dominate the current consumer wearable market, may face a fundamental accuracy ceiling. The growing number of smartwatches with built-in ECG capability suddenly becomes more interesting from a blood pressure monitoring perspective. And the fact that signal quality ranked as the second most important feature (SHAP: 0.482) suggests that any serious blood pressure monitoring system needs to include robust quality assessment and be willing to reject measurements from degraded signals.

SHAP Importance Versus Predictive Performance

Our ablation study reveals a nuanced relationship between feature importance and predictive performance. While SHAP analysis identified ECG features (54.7% of importance) as more important than PPG features (45.3%), PPG-only models achieved slightly lower MAE (15.97 vs. 16.23 mmHg for LightGBM, p = 0.226, not significant).
This suggests that high SHAP importance does not necessarily indicate indispensability. Practical implications include: (1) PPG-only wearable devices are viable, (2) Adding ECG sensors provides marginal benefit, and (3) Cross-subject generalization remains the key challenge.

4.2. Tree-Based Machine Learning Versus Deep Learning

The performance gap between tree-based methods and deep learning was larger than anticipated. XGBoost achieved a mean absolute error of 15.97 mmHg; the ResNet-Transformer came in at 12.78 mmHg, nearly 75% worse despite having orders of magnitude more parameters. Table 14 summarizes this comparison. This is not a subtle difference that could be attributed to random variation or suboptimal hyperparameters.
Understanding this result. Why would a 6-million-parameter neural network underperform a gradient boosting model on this task? The answer likely lies in the nature of the input data. Our features are classic tabular data, structured measurements with interpretable ranges and meaningful boundaries. Tree-based methods have evolved to excel on exactly this type of data, building decision rules that respect the natural discontinuities in feature space. Deep learning, by contrast, shines when there are complex hierarchical patterns to be learned from raw, unstructured inputs. Give a transformer raw ECG waveforms and it might discover subtle morphological patterns that no human engineer would think to extract. Give it a table of 30 pre-computed features and it has no particular advantage.
A caveat about deep learning. We want to be careful not to over-interpret these results as evidence against deep learning for blood pressure estimation in general. What we have shown is that deep learning provides no advantage, and in fact substantial disadvantage, when applied to hand-crafted features. Studies achieving BHS Grade A accuracy have typically used end-to-end deep learning on raw waveforms, allowing the network to learn its own feature representations. The right comparison might not be [28,29] XGBoost versus ResNet-Transformer on features, but rather XGBoost on features versus ResNet-Transformer on raw signals. That comparison awaits future work.

4.3. Comparison with Literature

How do our results compare with what others have achieved? Table 15 places our findings alongside several recent blood pressure estimation studies, and a pattern emerges that is worth discussing.
An interesting pattern. The studies achieving BHS Grade A have something in common: they use end-to-end deep learning on raw waveforms rather than pre-computed features. Our XGBoost approach substantially outperforms previous feature-based work: Kachuee and colleagues, for instance, achieved only 11.17 mmHg MAE using PPG features, but we cannot match the best raw-signal results. This suggests that feature engineering, however careful, may face a ceiling that direct waveform learning can transcend. The features we choose to extract, no matter how thoughtfully selected, inevitably discard some information. Neural networks operating on raw signals have the opportunity to find patterns we would never think to look for.

4.4. The Signal Quality Factor

An unexpected finding. The prominence of signal quality in the feature importance rankings came as something of a surprise. ECG signal quality ranked second overall (SHAP: 0.482), ahead of nearly all the carefully engineered physiological features. The implication is straightforward but important: garbage in, garbage out. Motion artifacts, poor electrode contact, and electromagnetic interference do not just add noise but actively mislead the model.
A practical recommendation. Any serious blood pressure estimation system needs to incorporate real-time signal quality assessment and be willing to reject measurements from degraded recordings. This may mean fewer blood pressure estimates per hour, but the ones produced will be more reliable. It is better to say ‘I cannot estimate blood pressure right now’ than to produce a confident but erroneous reading.

4.5. Clinical Implications

Clinical standing. XGBoost achieved BHS Grade D, not what one might hope for, but meaningful progress, nonetheless. The model came close to Grade B thresholds and showed that feature-based approaches can produce clinically relevant accuracy. Whether this is good enough depends on the application. For population screening or trend monitoring, Grade D may be acceptable. For clinical decision-making that requires high precision, further improvements are needed.
Opportunities for improvement. The best model meets the bias criterion easily (mean error of −1.27 mmHg), demonstrating that systematic estimation errors are well controlled. The next frontier lies in reducing prediction variance (SD of 11.60 mmHg). The model is not systematically over- or under-estimating; it is simply inconsistent across certain patient states or signal conditions. Understanding which patient states and signal characteristics lead to larger errors would be a valuable direction for future work, potentially enabling personalized calibration strategies.
The value of explainability. One potential advantage of this approach over black-box deep learning is interpretability. The SHAP analysis reveals that ECG R-wave amplitude and signal quality drive predictions, a finding that makes physiological sense and might increase clinician trust. If a doctor understands why an algorithm is making a particular prediction, they are better positioned to know when to trust it and when to be skeptical.

4.6. Limitations

Several limitations of this work warrant candid acknowledgment.
First, the data come from critically ill ICU patients, a population with hemodynamic instability, vasoactive medications, and mechanical ventilation. Whether models trained on such data will generalize to healthy individuals wearing smartwatches remains an open question. ICU patients may actually be an easier population to study (continuous arterial monitoring provides excellent ground truth) but a harder population to generalize from.
Second, the study deliberately focused on hand-crafted features rather than end-to-end learning. This was a methodological choice, not an oversight, but it means the conclusions about deep learning are limited. A transformer trained directly on raw waveforms might achieve better results.
Third, MIMIC-III comes from a single institution, Beth Israel Deaconess Medical Center. The demographic and clinical characteristics of that patient population may not represent the broader population. Multi-center validation is essential before any of these methods could be deployed clinically.
Fourth, the analysis emphasized systolic blood pressure, which is easier to estimate than diastolic. The feature importance patterns might look different for DBP prediction.
Finally, population-level models were used without individual calibration. A brief calibration period with cuff measurements for each new user could substantially improve accuracy but would also complicate deployment.

4.7. Future Directions

Where should this work go next? Several directions seem promising.
Hybrid architectures that combine the interpretability of feature-based models with the representation learning of deep networks might offer the best of both worlds. One could imagine using XGBoost on hand-crafted features alongside a neural network processing raw waveform, then combining their predictions.
Multi-center validation is essential. Testing on datasets from different institutions, different populations, and different recording conditions would reveal how robust these findings are.
Ambulatory validation with healthy subjects would bridge the gap between ICU research and consumer health applications. The physiological signals from someone sitting at a desk differ from those of a patient in septic shock.
Joint estimation of blood pressure and signal quality confidence would improve practical reliability. Rather than providing a single point estimate, a system could provide a confidence interval that widens when signal quality degrades.
And finally, personalized calibration deserves systematic study. Initial cuff measurements could anchor a model to each individual’s baseline, potentially reducing prediction variance and enabling compliance with stringent clinical standards such as AAMI requirements.

5. Conclusions

This study provides the first systematic comparison of PPG versus ECG feature importance for cuffless blood pressure estimation using SHAP analysis on MIMIC-III ICU waveforms. Through comprehensive evaluation of 10 machine learning models and one deep learning architecture on 100 subjects (3000 segments), several key findings emerge:
  • ECG features dominate blood pressure prediction. R-wave amplitude variability (SHAP: 0.654) and signal quality (0.482) are the two most important features, both derived from ECG rather than PPG.
  • XGBoost achieves best performance on signal-derived features (MAE 15.97 mmHg, BHS Grade D), significantly outperforming deep learning (ResNet-Transformer: MAE 12.78 mmHg, Grade D).
  • Signal quality is a critical factor. Ranked second overall in feature importance, suggesting quality-aware preprocessing is essential for reliable predictions.
  • Feature-based approaches have ceiling limitations. End-to-end deep learning on raw waveforms may be required to achieve BHS Grade A clinical accuracy.
Practical Implications: Wearable devices should incorporate ECG capability alongside PPG for improved accuracy. Signal quality monitoring should be integrated into blood pressure estimation pipelines. Tree-based machine learning remains optimal for signal-derived feature analysis, while end-to-end deep learning may be preferable for raw waveform processing.
This work advances understanding regarding which physiological signals contribute most to non-invasive blood pressure estimation and provides a foundation for next-generation cuffless monitoring systems.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7030098/s1, Supplementary File S1: Machine Learning Model Training—Experimental setup, model configurations, training procedures, and cross-validation results for the ten ML algorithms evaluated; Supplementary File S2: Deep Learning Architecture—ResNet-Transformer implementation, training configuration, and raw waveform processing results; Supplementary File S3: Stochastic Differential Equation Analysis—SDE-based signal modeling and parameter estimation; Supplementary File S4: Model Evaluation—Comprehensive performance metrics, BHS grading analysis, and comparative results across all models; Supplementary File S5: SHAP Explainability Analysis—Feature importance rankings, SHAP value distributions, and interpretability results for the XGBoost model

Author Contributions

Conceptualization, I.N. and M.K.; data curation, I.N. and P.M.; formal analysis, I.N. and M.K.; methodology, I.N., M.K., M.M. and P.M.; software, I.N.; validation, I.N., and M.K.; visualization, I.N. and P.M.; writing—original draft, I.N. and M.K.; writing—review and editing, M.K., M.M., and P.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The MIMIC-III Waveform Database used in this study is publicly available through PhysioNet (https://physionet.org/content/mimic3wdb/ (accessed on 30 December 2025)) after completing the required CITI training program and signing the data use agreement.

Acknowledgments

The authors thank the MIT Laboratory for Computational Physiology and collaborating institutions for providing open access to the MIMIC database through the PhysioNet platform. The presentation and dissemination of these research results are supported by National Science Fund Project KP-06-N85-7/05.12.2024 “Significance and Potential Risks of the Fast Integration of Artificial Intelligence (AI) Technologies into the Economy and Financial Sector”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Whelton, P.K.; Carey, R.M.; Aronow, W.S.; Casey, D.E., Jr.; Collins, K.J.; Dennison Himmelfarb, C.; Wright, J.T., Jr. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Hypertension 2018, 71, 1269–1324. [Google Scholar] [CrossRef]
  2. Forouzanfar, M.H.; Liu, P.; Roth, G.A.; Ng, M.; Biryukov, S.; Marczak, L.; Alexander, L.; Estep, K.; Abate, K.H.; Akinyemiju, T.F.; et al. Global Burden of Hypertension and Systolic Blood Pressure of at Least 110 to 115 mm Hg, 1990–2015. JAMA 2017, 317, 165–182. [Google Scholar] [CrossRef] [PubMed]
  3. Lawes, C.M.M.; Vander Hoorn, S.; Rodgers, A.; International Society of Hypertension. Global Burden of Blood-Pressure-Related Disease, 2001. Lancet 2008, 371, 1513–1518. [Google Scholar] [CrossRef]
  4. Lewington, S.; Clarke, R.; Qizilbash, N.; Peto, R.; Collins, R.; Prospective Studies Collaboration. Age-Specific Relevance of Usual Blood Pressure to Vascular Mortality: A Meta-Analysis of Individual Data for One Million Adults in 61 Prospective Studies. Lancet 2002, 360, 1903–1913. [Google Scholar] [CrossRef]
  5. Muntner, P.; Shimbo, D.; Carey, R.M.; Charleston, J.B.; Gaillard, T.; Misra, S.; Myers, M.G.; Ogedegbe, G.; Schwartz, J.E.; Townsend, R.R.; et al. Measurement of Blood Pressure in Humans: A Scientific Statement From the American Heart Association. Hypertension 2019, 73, e35–e66. [Google Scholar] [CrossRef] [PubMed]
  6. Parati, G.; Stergiou, G.S.; Dolan, E.; Bilo, G. Blood Pressure Variability: Clinical Relevance and Application. J. Clin. Hypertens. 2018, 20, 1133–1137. [Google Scholar] [CrossRef]
  7. Kachuee, M.; Kiani, M.M.; Mohammadzade, H.; Shabany, M. Cuffless Blood Pressure Estimation Algorithms for Continuous Health-Care Monitoring. IEEE Trans. Biomed. Eng. 2017, 64, 859–869. [Google Scholar] [CrossRef]
  8. Slapničar, G.; Mlakar, N.; Luštrek, M. Blood Pressure Estimation from Photoplethysmogram Using a Spectro-Temporal Deep Neural Network. Sensors 2019, 19, 3420. [Google Scholar] [CrossRef]
  9. Elgendi, M. On the Analysis of Fingertip Photoplethysmogram Signals. Curr. Cardiol. Rev. 2012, 8, 14–25. [Google Scholar] [CrossRef]
  10. Takazawa, K.; Tanaka, N.; Fujita, M.; Matsuoka, O.; Saiki, T.; Aikawa, M.; Tamura, S.; Ibukiyama, C. Assessment of Vasoactive Agents and Vascular Aging by the Second Derivative of Photoplethysmogram Waveform. Hypertension 1998, 32, 365–370. [Google Scholar] [CrossRef]
  11. Elgendi, M.; Jost, E.; Alian, A.; Fletcher, R.R.; Bomberg, H.; Eichenberger, U.; Menon, C. Photoplethysmography Features Correlated with Blood Pressure Changes. Diagnostics 2024, 14, 2309. [Google Scholar] [CrossRef] [PubMed]
  12. Li, Y.-H.; Harfiya, L.N.; Purwandari, K.; Lin, Y.-D. Real-Time Cuffless Continuous Blood Pressure Estimation Using Deep Learning. Sensors 2020, 20, 5606. [Google Scholar] [CrossRef] [PubMed]
  13. Chowdhury, M.Z.I.; Naeem, I.; Quan, H.; Leung, A.A.; Sikdar, K.C.; O’Beirne, M.; Turin, T.C. Prediction of Hypertension Using Traditional Regression and Machine Learning Models: A Systematic Review and Meta-Analysis. PLoS ONE 2022, 17, e0266334. [Google Scholar] [CrossRef] [PubMed]
  14. Martínez-Ríos, E.; Montesinos, L.; Alfaro-Ponce, M.; Pecchia, L. Review of Machine Learning for Hypertension Detection and Blood Pressure Estimation (Clinical and Physiological Data). Biomed. Signal Process. Control 2021, 68, 102813. [Google Scholar] [CrossRef]
  15. Liang, Y.; Chen, Z.; Liu, G.; Elgendi, M. A Benchmark for Machine-Learning-Based Non-Invasive Blood Pressure Estimation Using PPG. Sci. Data 2023, 10, 149. [Google Scholar] [CrossRef]
  16. Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 507–520. [Google Scholar]
  17. Stergiou, G.S.; Alpert, B.; Mieke, S.; Asmar, R.; Atkins, N.; Eckert, S.; Frick, G.; Friedman, B.; Graßl, T.; Ichikawa, T.; et al. A Universal Standard for the Validation of Blood Pressure Measuring Devices: AAMI/ESH/ISO Collaboration Statement. J. Hypertens. 2018, 36, 472–478. [Google Scholar] [CrossRef]
  18. Argüello-Prada, E.J.; Castaño Mosquera, C.D. Exploring Supervised Machine Learning Models to Estimate Blood Pressure Using Non-Fiducial Features of the Photoplethysmogram (PPG) and Its Derivatives. Phys. Eng. Sci. Med. 2025, 48, 1399–1414. [Google Scholar] [CrossRef]
  19. O’Brien, E.; Petrie, J.; Littler, W.; de Swiet, M.; Padfield, P.L.; Altman, D.G.; Bland, M.; Coats, A.; Atkins, N. The British Hypertension Society Protocol for the Evaluation of Blood Pressure Measuring Devices. J. Hypertens. 1993, 11, S43–S62. [Google Scholar]
  20. Shickel, B.; Tighe, P.J.; Bihorac, A.; Rashidi, P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE J. Biomed. Health Inform. 2018, 22, 1589–1604. [Google Scholar] [CrossRef]
  21. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 4765–4774. [Google Scholar]
  22. Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Moody, B.; Gow, B.; Lehman, L.H.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 219. [Google Scholar] [CrossRef]
  23. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
  24. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 3146–3154. [Google Scholar]
  25. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
  27. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  28. Arjomand, A.; Boudesh, A.; Bayatmakou, F.; Kent, K.B.; Mohammadi, A. TransfoRhythm: A Transformer Architecture Conductive to Blood Pressure Estimation via Solo PPG Signal Capturing. arXiv 2024, arXiv:2404.15352. [Google Scholar] [CrossRef]
  29. Mohammadi, H.; Tarvirdizadeh, B.; Alipour, K.; Ghamari, M. Cuff-Less Blood Pressure Monitoring via PPG Signals Using a Hybrid CNN-BiLSTM Deep Learning Model with Attention Mechanism. Sci. Rep. 2025, 15, 22229. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Representative physiological waveform samples from the MIMIC-III database showing synchronized PPG, ECG, and arterial blood pressure (ABP) signals recorded at 125 Hz sampling frequency.
Figure 1. Representative physiological waveform samples from the MIMIC-III database showing synchronized PPG, ECG, and arterial blood pressure (ABP) signals recorded at 125 Hz sampling frequency.
Ai 07 00098 g001
Figure 2. Training dynamics of the ResNet-Transformer model showing loss curves and performance metrics across epochs. The model achieved optimal performance at epoch 5, with subsequent training showing signs of overfitting.
Figure 2. Training dynamics of the ResNet-Transformer model showing loss curves and performance metrics across epochs. The model achieved optimal performance at epoch 5, with subsequent training showing signs of overfitting.
Ai 07 00098 g002
Figure 3. SHAP summary plot showing global feature importance rankings. Each point represents a sample, with color indicating feature value (red = high, blue = low) and horizontal position showing SHAP impact on prediction. ECG-derived features (ecg_r_amp_std, ecg_signal_quality) dominate the top rankings.
Figure 3. SHAP summary plot showing global feature importance rankings. Each point represents a sample, with color indicating feature value (red = high, blue = low) and horizontal position showing SHAP impact on prediction. ECG-derived features (ecg_r_amp_std, ecg_signal_quality) dominate the top rankings.
Ai 07 00098 g003
Figure 4. Comparison of SHAP-based and permutation-based feature importance rankings. Both methods identify ECG R-wave amplitude variability and signal quality as the most predictive features, validating the robustness of importance estimates.
Figure 4. Comparison of SHAP-based and permutation-based feature importance rankings. Both methods identify ECG R-wave amplitude variability and signal quality as the most predictive features, validating the robustness of importance estimates.
Ai 07 00098 g004
Figure 5. Partial dependence plots for top predictive features showing the marginal effect on predicted systolic blood pressure. The plots reveal nonlinear relationships between feature values and blood pressure predictions. Blue lines represent partial dependence values, shaded regions indicate ±1 standard deviation confidence bands, and vertical tick marks show feature value deciles.
Figure 5. Partial dependence plots for top predictive features showing the marginal effect on predicted systolic blood pressure. The plots reveal nonlinear relationships between feature values and blood pressure predictions. Blue lines represent partial dependence values, shaded regions indicate ±1 standard deviation confidence bands, and vertical tick marks show feature value deciles.
Ai 07 00098 g005
Figure 6. Feature interaction heatmap showing SHAP interaction values between top features. Darker colors indicate stronger interactions, revealing dependencies between ECG and PPG features in blood pressure prediction.
Figure 6. Feature interaction heatmap showing SHAP interaction values between top features. Darker colors indicate stronger interactions, revealing dependencies between ECG and PPG features in blood pressure prediction.
Ai 07 00098 g006
Figure 7. Model performance comparison with 95% confidence intervals across all evaluated models. XGBoost achieves significantly lower error than all other models. Salmon bars represent linear models (Ridge, Lasso, ElasticNet, Linear Regression, SVR, LightGBM) and orange bars represent tree-based and instance-based models (XGBoost, Gradient Boosting, Random Forest, KNN). Grey dashed lines indicate AAMI threshold for reference. Subplots show: (A) Mean Absolute Error (MAE), (B) Root Mean Square Error (RMSE), and (C) Coefficient of Determination (R2).
Figure 7. Model performance comparison with 95% confidence intervals across all evaluated models. XGBoost achieves significantly lower error than all other models. Salmon bars represent linear models (Ridge, Lasso, ElasticNet, Linear Regression, SVR, LightGBM) and orange bars represent tree-based and instance-based models (XGBoost, Gradient Boosting, Random Forest, KNN). Grey dashed lines indicate AAMI threshold for reference. Subplots show: (A) Mean Absolute Error (MAE), (B) Root Mean Square Error (RMSE), and (C) Coefficient of Determination (R2).
Ai 07 00098 g007
Figure 8. Bland–Altman plot for XGBoost predictions showing the relationship between prediction error and mean blood pressure. The plot reveals acceptable bias (−1.27 mmHg) but substantial variability (limits of agreement: −24.01 to 21.47 mmHg). The dark solid line indicates mean bias, red dashed lines indicate the 95% limits of agreement, and green dotted lines mark the ±5 mmHg clinical threshold.
Figure 8. Bland–Altman plot for XGBoost predictions showing the relationship between prediction error and mean blood pressure. The plot reveals acceptable bias (−1.27 mmHg) but substantial variability (limits of agreement: −24.01 to 21.47 mmHg). The dark solid line indicates mean bias, red dashed lines indicate the 95% limits of agreement, and green dotted lines mark the ±5 mmHg clinical threshold.
Ai 07 00098 g008
Figure 9. Statistical comparison heatmap showing p-values from pairwise Wilcoxon signed-rank tests. Darker colors indicate more significant differences. XGBoost (top row) shows highly significant differences from all other models.
Figure 9. Statistical comparison heatmap showing p-values from pairwise Wilcoxon signed-rank tests. Darker colors indicate more significant differences. XGBoost (top row) shows highly significant differences from all other models.
Ai 07 00098 g009
Figure 10. SHAP feature importance bar plot showing mean absolute SHAP values for all 55 features. ECG-derived features (blue bars) and PPG-derived features (orange bars) are distinguished by color. ECG features dominate the top rankings, with R-wave amplitude variability and signal quality emerging as the most predictive features.
Figure 10. SHAP feature importance bar plot showing mean absolute SHAP values for all 55 features. ECG-derived features (blue bars) and PPG-derived features (orange bars) are distinguished by color. ECG features dominate the top rankings, with R-wave amplitude variability and signal quality emerging as the most predictive features.
Ai 07 00098 g010
Figure 11. Ablation study results comparing MAE for PPG-only, ECG-only, and combined feature sets.
Figure 11. Ablation study results comparing MAE for PPG-only, ECG-only, and combined feature sets.
Ai 07 00098 g011
Figure 12. SHAP summary plot showing global feature importance for SBP prediction.
Figure 12. SHAP summary plot showing global feature importance for SBP prediction.
Ai 07 00098 g012
Figure 13. Aggregate SHAP importance by signal type: ECG (54.7%) vs. PPG (45.3%).
Figure 13. Aggregate SHAP importance by signal type: ECG (54.7%) vs. PPG (45.3%).
Ai 07 00098 g013
Figure 14. Prediction error distribution across ablation study configurations.
Figure 14. Prediction error distribution across ablation study configurations.
Ai 07 00098 g014
Table 1. Dataset characteristics and blood pressure distribution.
Table 1. Dataset characteristics and blood pressure distribution.
CharacteristicValue
SourceMIMIC-III Waveform Database (PhysioNet)
Number of subjects1524
Number of samples (SBP)61,232
Number of samples (DBP)61,192
Sampling frequency125 Hz
SBP (mean ± SD)108.79 ± 18.63 mmHg
DBP (mean ± SD)57.26 ± 11.14 mmHg
Table 2. Signal preprocessing parameters.
Table 2. Signal preprocessing parameters.
ParameterValueDescription
Sampling rate125 HzMIMIC waveform acquisition rate
Segment duration30 sPre-segmented in MIMIC-BP dataset
Bandpass filter (PPG)0.5–8.0 Hz3rd-order Butterworth filter
Bandpass filter (ECG)0.5–40.0 Hz3rd-order Butterworth filter
Peak detectionscipy.signal.find_peaksdistance = 42 samples
SBP plausibility range60–260 mmHgSegments outside excluded
DBP plausibility range40–180 mmHgSegments outside excluded
Heart rate plausibility30–250 bpmSegments outside excluded
R-peak detectionPan-Tompkins (NeuroKit2)Standard QRS detection
Minimum cardiac cycles10Required for reliable PTT
Table 3. Photoplethysmography (PPG) features extracted for blood pressure estimation (n = 21).
Table 3. Photoplethysmography (PPG) features extracted for blood pressure estimation (n = 21).
CategoryFeatures
Statistical (11)Mean, Standard Deviation, Variance, Minimum, Maximum, Range, Median, IQR, Skewness, Kurtosis, RMS
Heart Rate (2)HR Mean, HR Standard Deviation
RR Interval (3)RR Mean, RR Standard Deviation, RMSSD
Morphological (4)Pulse Amplitude, Pulse Width (50%), Fall Time, Dicrotic Index
Frequency (1)Dominant Frequency
Table 4. Electrocardiography (ECG) features extracted for blood pressure estimation (n = 9).
Table 4. Electrocardiography (ECG) features extracted for blood pressure estimation (n = 9).
CategoryFeatures
Heart Rate (5)HR Mean, HR Standard Deviation, RR Mean, RR Standard Deviation, RR Coefficient of Variation
Morphological (3)Total Power, R-wave Amplitude Mean, R-wave Amplitude Standard Deviation
Quality (1)Signal Quality Index
Table 5. Machine learning models evaluated for blood pressure estimation.
Table 5. Machine learning models evaluated for blood pressure estimation.
CategoryModels
LinearLinear Regression, Ridge, Lasso, ElasticNet
Instance-basedK-Nearest Neighbors (KNN)
KernelSupport Vector Regression (SVR-RBF)
Tree EnsembleRandom Forest, Gradient Boosting, XGBoost, LightGBM
Table 6. ResNet-Transformer deep learning model configuration.
Table 6. ResNet-Transformer deep learning model configuration.
ParameterValue
ModelResNet-Transformer hybrid (1D)
InputRaw PPG/ECG waveforms
Total parameters6,078,850
OptimizerAdam (lr = 0.0001, weight decay = 1 × 10−5)
Early stoppingPatience = 15 epochs
Maximum epochs100
Best epoch5
Table 7. Clinical validation standards for blood pressure measurement devices.
Table 7. Clinical validation standards for blood pressure measurement devices.
StandardCriterion 1Criterion 2Criterion 3
BHS Grade A≥60% within 5 mmHg≥85% within 10 mmHg≥95% within 15 mmHg
BHS Grade B≥50% within 5 mmHg≥75% within 10 mmHg≥90% within 15 mmHg
BHS Grade D≥40% within 5 mmHg≥65% within 10 mmHg≥85% within 15 mmHg
BHS Grade DBelow Grade D--
AAMIMean error ≤ 5 mmHgSD ≤ 8 mmHg-
Table 8. Complete model performance comparison for systolic blood pressure estimation. Models ranked by MAE. CI: confidence interval; BHS: British Hypertension Society grade.
Table 8. Complete model performance comparison for systolic blood pressure estimation. Models ranked by MAE. CI: confidence interval; BHS: British Hypertension Society grade.
ModelMAE (95% CI)RMSER2Bias ± SDBHS
XGBoost7.32 (6.59–8.07)11.670.621−1.27 ± 11.60C
KNN8.47 (7.73–9.25)12.570.560−1.45 ± 12.49D
Gradient Boosting8.77 (8.03–9.51)12.550.561−1.10 ± 12.50D
Random Forest10.39 (9.66–11.12)13.900.462−1.50 ± 13.82D
SVR12.24 (11.37–13.13)16.420.250−2.94 ± 16.15D
ResNet-Transformer12.78 (-)16.240.2670.20 ± 16.23D
Linear Regression13.66 (12.79–14.50)17.520.146−1.46 ± 17.46D
LightGBM13.82 (12.98–14.69)17.400.157−1.40 ± 17.35D
Ridge13.90 (13.02–14.75)17.760.122−1.48 ± 17.70D
Lasso15.24 (14.34–16.17)19.02−0.007−1.64 ± 18.95D
ElasticNet15.24 (14.34–16.17)19.02−0.007−1.64 ± 18.95D
Table 9. British Hypertension Society (BHS) grading for blood pressure estimation models. Grade thresholds shown for reference.
Table 9. British Hypertension Society (BHS) grading for blood pressure estimation models. Grade thresholds shown for reference.
Model≤5 mmHg≤10 mmHg≤15 mmHgGrade
XGBoost56.7%77.2%87.2%C
KNN46.8%70.2%83.3%D
Gradient Boosting43.7%70.3%82.8%D
Random Forest32.2%59.5%78.0%D
BHS Grade A threshold60%85%95%-
BHS Grade D threshold40%65%85%-
Table 10. Top 15 features ranked by SHAP importance for blood pressure prediction.
Table 10. Top 15 features ranked by SHAP importance for blood pressure prediction.
RankFeatureSHAP ImportanceSignal
1ecg_r_amp_std0.654ECG
2ecg_signal_quality0.482ECG
3ppg_hr_mean0.418PPG
4ppg_skewness0.397PPG
5ecg_r_amp_mean0.395ECG
6ecg_hr_mean0.381ECG
7ppg_iqr0.374PPG
8ppg_kurtosis0.329PPG
9ppg_min0.235PPG
10ppg_fall_time0.220PPG
11ecg_rr_mean0.200ECG
12ppg_dicrotic_idx0.200PPG
13ppg_rr_rmssd0.195PPG
14ppg_width_500.174PPG
15ppg_amp_mean0.154PPG
Table 11. Ablation study results—feature source comparison (SBP).
Table 11. Ablation study results—feature source comparison (SBP).
ModelPPG Only (MAE)ECG Only (MAE)Combined (MAE)BHS Grade
LightGBM15.9716.2316.24D
XGBoost16.5817.4116.19D
Random Forest16.3616.5416.31D
Table 12. Statistical significance testing (Wilcoxon signed-rank test).
Table 12. Statistical significance testing (Wilcoxon signed-rank test).
ComparisonModelp-ValueSignificant?
PPG vs. ECGLightGBM0.226No
PPG vs. ECGXGBoost0.277No
PPG vs. ECGRandom Forest0.650No
Table 13. SHAP feature importance by signal type.
Table 13. SHAP feature importance by signal type.
Signal Type% of Total ImportanceFeatures in Top 10
ECG Features54.7%4
PPG Features45.3%6
Table 14. Comparison of tree-based machine learning versus deep learning approaches for feature-based blood pressure estimation.
Table 14. Comparison of tree-based machine learning versus deep learning approaches for feature-based blood pressure estimation.
ApproachBest MAE (mmHg)Explanation
XGBoost (tree-based)7.32Excels at structured feature relationships and nonlinear interactions
ResNet-Transformer (DL)12.78Designed for raw waveforms, not pre-engineered features
Table 15. Comparison with selected blood pressure estimation studies from recent literature.
Table 15. Comparison with selected blood pressure estimation studies from recent literature.
StudyMethodMAE (SBP)BHSKey Difference
This studyXGBoost15.97 mmHgCFeature-based, ECG + PPG
TransfoRhythm (2024) [28] Transformer1.37 mmHgAEnd-to-end raw signals
CNN-BiLSTM (2025) [29] Hybrid DL1.88 mmHgARaw waveform input
Kachuee et al. (2017) [7]AdaBoost11.17 mmHg-PPG features only
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Naskinova, I.; Kolev, M.; Milev, M.; Mitev, P. Signal-Derived Feature Analysis for Cuffless Blood Pressure Estimation: Comparing Machine Learning and Deep Learning on ICU Physiological Waveforms. AI 2026, 7, 98. https://doi.org/10.3390/ai7030098

AMA Style

Naskinova I, Kolev M, Milev M, Mitev P. Signal-Derived Feature Analysis for Cuffless Blood Pressure Estimation: Comparing Machine Learning and Deep Learning on ICU Physiological Waveforms. AI. 2026; 7(3):98. https://doi.org/10.3390/ai7030098

Chicago/Turabian Style

Naskinova, Irina, Mikhail Kolev, Mariyan Milev, and Penko Mitev. 2026. "Signal-Derived Feature Analysis for Cuffless Blood Pressure Estimation: Comparing Machine Learning and Deep Learning on ICU Physiological Waveforms" AI 7, no. 3: 98. https://doi.org/10.3390/ai7030098

APA Style

Naskinova, I., Kolev, M., Milev, M., & Mitev, P. (2026). Signal-Derived Feature Analysis for Cuffless Blood Pressure Estimation: Comparing Machine Learning and Deep Learning on ICU Physiological Waveforms. AI, 7(3), 98. https://doi.org/10.3390/ai7030098

Article Metrics

Back to TopTop