1. Introduction
1.1. The Challenge of Blood Pressure Management
Elevated blood pressure constitutes a silent yet profound threat to public health, affecting millions of individuals annually and contributing substantially to cardiovascular morbidity [
1,
2,
3]. The World Health Organization estimates that approximately 1.28 billion adults worldwide suffer from hypertension, establishing it as a principal risk factor for heart disease, stroke, and premature mortality [
4,
5,
6]. Notwithstanding extensive clinical research and the availability of efficacious pharmacological interventions, early detection of hypertension remains a considerable challenge, frequently resulting in severe health complications that might otherwise have been prevented through timely intervention.
Contemporary clinical practice typically relies upon periodic blood pressure assessments conducted during outpatient consultations—an approach characterized by notable limitations. Blood pressure exhibits considerable diurnal variation, and phenomena such as ‘white coat hypertension’ may substantially distort readings obtained in clinical settings [
4,
7]. Moreover, a significant proportion of the population does not attend regular medical appointments, thereby complicating timely diagnosis. This raises the pertinent question of whether it might be feasible to predict blood pressure utilizing existing medical records encompassing demographic, anthropometric, and clinical variables.
Recent advances in artificial intelligence have demonstrated considerable promise across diverse medical prediction tasks [
8,
9]. Investigators have successfully employed machine learning methodologies to diagnose various pathological conditions, forecast treatment responses, and stratify patient risk. Nevertheless, the specific application of machine learning to blood pressure prediction has received comparatively limited attention, particularly regarding systematic comparisons of different algorithmic approaches evaluated under consistent experimental conditions. This gap presents an opportunity for rigorous investigation.
1.2. What Others Have Tried
Research examining machine learning applications for hypertension risk assessment has yielded heterogeneous outcomes [
10,
11,
12,
13]. Certain investigations have concentrated on discriminating between hypertensive and normotensive patients through ensemble methods, achieving moderate classification accuracy in the range of 70–80% [
14,
15]. Neural networks have additionally been employed to forecast blood pressure trajectories from continuous monitoring data, with reported prediction errors typically ranging from 8 to 12 mmHg.
Notwithstanding these efforts, several notable gaps persist within the extant literature [
16]. The binary classification of hypertension, whilst methodologically straightforward, fails to provide the nuanced blood pressure values that clinicians require for informed therapeutic decision-making. Furthermore, systematic comparisons amongst diverse machine learning approaches-spanning elementary linear models through to sophisticated neural architectures, utilizing consistent datasets and evaluation protocols, remain conspicuously absent. This deficiency renders it challenging to ascertain which methodological approaches are most efficacious for this particular predictive task.
1.3. What We Set Out to Do
The present study pursued four principal objectives:
Firstly, we endeavored to develop a comprehensive set of predictive features extending beyond elementary demographic variables. We constructed derived variables reflecting clinically meaningful patterns, including body mass index, composite cardiovascular risk scores, metabolic syndrome indicators, and mathematical interaction terms designed to capture potential synergistic effects amongst risk factors.
Secondly, we systematically evaluated an extensive array of machine learning algorithms, encompassing simple baseline predictors, classical statistical approaches such as linear and ridge regression, tree-based ensemble methods including Random Forest and gradient boosting variants (LightGBM, XGBoost, CatBoost), and more sophisticated methodologies. This comprehensive evaluation framework enabled rigorous assessment of relative algorithmic performance.
Thirdly, we assessed model performance employing both conventional statistical metrics and clinically relevant thresholds, recognizing that practitioners frequently prioritize predictions falling within ±5 or ±10 mmHg of actual measurements—tolerances reflecting meaningful clinical accuracy.
Finally, we sought to provide practical guidance for future investigators and clinicians regarding model selection, realistic accuracy expectations, and considerations pertinent to real-world healthcare deployment.
1.4. Why This Matters
The capacity to accurately predict blood pressure holds potential to transform multiple facets of cardiovascular care. Primary care settings could proactively screen patients for elevated risk during routine consultations, even in the absence of contemporaneous blood pressure measurements. Telehealth providers could more effectively identify individuals warranting urgent face-to-face evaluation, whilst public health programs could direct resources towards high-risk communities for targeted intervention.
The fundamental challenge lies in determining whether machine learning can yield predictions of sufficient accuracy for clinical utility, whilst maintaining interpretability adequate to engender trust amongst healthcare professionals. Achieving this balance is essential for integrating advanced predictive analytics into routine clinical workflows and ultimately improving cardiovascular outcomes.
2. Methods
2.1. The Data
2.1.1. Where It Came from
The present investigation utilizes a publicly available dataset from Kaggle comprising health information for 70,000 individuals [
17]. This dataset encompasses demographic attributes including age and sex, anthropometric measurements (height and weight), cardiovascular parameters (systolic and diastolic blood pressure), metabolic indicators (cholesterol and glucose levels), and lifestyle factors (smoking status, alcohol consumption, and physical activity). Additionally, it includes cardiovascular disease diagnosis status for each participant.
Originally compiled for cardiovascular disease classification research, we repurposed this dataset for blood pressure prediction. Whilst specific details regarding measurement protocols and population demographics are unavailable, the dataset’s substantial size and comprehensive coverage of established cardiovascular risk factors render it suitable for our comparative methodological analysis.
2.1.2. Cleaning and Quality Checks
Clinical datasets frequently contain erroneous or physiologically implausible values, necessitating rigorous quality control procedures informed by established physiological parameters and clinical expertise.
For blood pressure measurements, we excluded values falling outside clinically plausible ranges: systolic pressures below 60 mmHg or exceeding 260 mmHg, and diastolic pressures below 40 mmHg or exceeding 180 mmHg. We additionally removed records where systolic pressure was documented as lower than diastolic pressure—a physiologically impossible scenario indicative of measurement or recording error.
Regarding anthropometric data, we excluded extreme values unlikely to represent genuine measurements: heights below 120 cm or exceeding 250 cm, and weights below 30 kg or exceeding 200 kg. Whilst rare individuals may exist outside these parameters, such outliers more plausibly represent data entry errors than genuine physiological values.
This exclusion process resulted in removal of 1325 blood pressure records (1.89% of the sample) and 59 anthropometric records (0.08%), yielding 68,616 high-quality records for analysis—a retention rate of 98%. This rigorous screening enhances confidence that subsequent models are founded upon genuine physiological relationships rather than measurement artefacts.
The cleaned dataset was partitioned into training (64%, n = 43,913), validation (16%, n = 10,979), and test (20%, n = 13,724) sets employing stratified sampling to ensure comparable blood pressure distributions across all three partitions.
Table 1 summarizes the key characteristics of the cleaned dataset.
The dataset was partitioned into three subsets to facilitate robust model development and unbiased performance evaluation, as summarized in
Table 2.
Table 3 details the specific exclusion criteria applied during data quality screening, along with the number of records affected by each criterion.
2.2. Creating Predictive Features
Raw clinical data rarely presents in an optimal format for machine learning applications. Age expressed in days, as recorded in this dataset, conveys less clinical meaning than age in years. Similarly, weight and height considered independently provide less insight than their derived ratio, body mass index. Clinical evidence suggests that cardiovascular risk factors may interact synergistically; for instance, the combined effect of obesity and advancing age on blood pressure may exceed the sum of their individual contributions.
We constructed engineered features organized into six conceptual categories:
2.2.1. Demographic Features (3 Features)
To enhance clinical interpretability, we transformed age from days to years and established age decade groupings (30s, 40s, 50s, 60s) alongside broader categorical age bands. This approach enables models to capture potential non-linear age effects, particularly the accelerated cardiovascular risk observed beyond the fifth decade of life.
Anthropometric Features (6 features)
Beyond converting height to meters, we calculated several derived anthropometric indices:
Body Mass Index (BMI): weight in kilograms divided by height in meters squared, representing the standard clinical measure of adiposity.
BMI categories: employing World Health Organization classifications (underweight, normal weight, overweight, obese).
BMI z-scores: standardized BMI values, facilitating population-level comparisons.
Body surface area: calculated using the Du Bois formula, a parameter of relevance to cardiovascular physiology.
Ponderal index: an alternative weight-to-height ratio less commonly employed than BMI but potentially informative for certain body compositions.
2.2.2. Clinical Features (Excluding BP-Derived Variables)
A critical methodological decision merits emphasis: we explicitly excluded features derived from blood pressure values, specifically mean arterial pressure (MAP = (SBP + 2 × DBP)/3), pulse pressure (PP = SBP − DBP), and hypertension stage classifications. Including such features would constitute target leakage, artificially inflating model performance by allowing predictors that mathematically encode the very values being predicted. The retained clinical features include:
Cardiovascular disease diagnosis: binary indicator of previously diagnosed cardiovascular disease.
Elevated cholesterol indicator: binary flag denoting above-normal cholesterol levels.
Elevated glucose indicator: binary flag denoting above-normal glucose levels.
High cholesterol and glucose indicators: binary flags for elevated values
Risk factor count: aggregate count of present risk factors encompassing smoking, alcohol consumption, elevated cholesterol, elevated glucose, and physical inactivity.
2.2.3. Interaction Features (8 Features)
Clinical observations indicate that risk factors do not merely accumulate additively; rather, they may interact multiplicatively. To explore such phenomena, we constructed interaction terms through multiplication of clinically relevant variable pairs:
Age with BMI: adiposity may exert differential effects across the lifespan.
Age with sex: cardiovascular risk trajectories differ between males and females.
BMI with smoking status: two major modifiable risk factors with potential synergistic effects.
Analogous interactions with physical activity levels and cholesterol status.
2.2.4. Polynomial Features (3 Features)
To accommodate potential non-linear relationships, we incorporated polynomial terms (squared values) for age, BMI, and their interaction. This approach enables models to capture accelerating effects—for instance, the relationship between age and blood pressure may steepen in older individuals.
2.2.5. Risk Scores (2 Features)
We constructed composite risk scores aggregating multiple individual risk factors:
Cardiovascular risk score: integrating age, BMI, cholesterol, glucose, smoking status, and physical inactivity into a unified risk metric.
Metabolic syndrome score: enumerating components of metabolic syndrome (elevated glucose, elevated cholesterol, obesity), with blood pressure explicitly excluded to prevent target leakage.
To examine associations between predictor features and blood pressure outcomes, we calculated Pearson correlation coefficients as presented in
Table 4.
Cardiovascular disease diagnosis emerged as the strongest predictor, reflecting its clinical basis in blood pressure assessment criteria. This finding, whilst anticipated, underscores the inherent challenge of predicting blood pressure from truly independent features.
The complete feature set commences with the ten original variables from the cardiovascular disease dataset, described in
Table 5.
Extending the original variables, we derived 23 engineered features through domain-specific transformations, as detailed in
Table 6.
To capture potential synergistic effects amongst risk factors, we constructed interaction and polynomial features as presented in
Table 7.
Table 8 provides a summary of all feature categories, confirming that none of the 75 input features are derived from blood pressure measurements, thereby avoiding target leakage.
2.3. Data Preprocessing
Following feature engineering, we employed standard preprocessing techniques to prepare the data for machine learning algorithms. We distinguished between continuous features (age, BMI, laboratory values) and categorical features (sex, cholesterol categories).
Continuous features underwent median imputation (notwithstanding the absence of missing values in our cleaned dataset) followed by standardization to ensure uniform scaling [
18,
19] across all features, thereby preventing variables with larger numeric ranges from disproportionately influencing model training.
Categorical features underwent one-hot encoding, transforming variables such as cholesterol level (with three ordinal categories: normal, above normal, well above normal) into separate binary indicator variables.
Crucially, all preprocessing transformations were fitted exclusively to the training set prior to application to validation and test sets, thereby precluding information leakage that could compromise model evaluation integrity.
2.4. The Algorithms We Tested
We evaluated nine models spanning the spectrum from elementary baseline approaches to sophisticated ensemble methods.
2.4.1. Baseline Models
Every predictive modelling study requires baseline comparisons—elementary approaches establishing minimum performance thresholds against which more sophisticated methods can be evaluated.
Global Mean Baseline: For each patient, predict the mean systolic and diastolic pressure derived from the training set. This baseline addresses the fundamental question: can any model surpass simply predicting typical population blood pressure values?
Global Median Baseline: Analogous to the mean baseline, but employing median values. Medians demonstrate greater robustness to outliers than means, potentially providing marginally improved baseline performance.
2.4.2. Classical Regression Models
Linear Regression [
20]: The foundational approach to statistical prediction. Linear regression identifies the optimal hyperplane relationship between features and blood pressure. Despite its simplicity, linear regression frequently performs surprisingly well and yields interpretable coefficients quantifying each feature’s contribution to the prediction.
Ridge Regression: Linear regression incorporating L2 regularisation, imposing a penalty on large coefficients. This shrinkage mitigates overfitting, particularly important when features exhibit collinearity. We employed regularisation strength α = 1.0.
2.4.3. Tree-Based Ensemble Models
Random Forest [
20]: Constructs an ensemble of decision trees using random subsets of features and observations, subsequently averaging their predictions. Random Forest naturally accommodates non-linear relationships and feature interactions, requires minimal hyperparameter tuning, and provides feature importance rankings. We employed 100 trees with default hyperparameters.
LightGBM (Light Gradient Boosting Machine [
21]): A contemporary gradient boosting algorithm optimized for computational efficiency and predictive accuracy. Unlike Random Forest’s parallel tree construction, LightGBM builds trees sequentially, with each tree correcting its predecessor’s errors. We have optimized hyperparameters using Optuna 3.3.0, a Bayesian optimization framework, conducting 30 trials to identify optimal configurations [
22,
23].
Learning rate: 0.0998
Number of leaves: 275
Tree depth: 14
Minimum samples per leaf: 18
Feature and sample subsampling rates: 0.774 and 0.784
L1 and L2 regularization: 4.623 and 4.297
Maximum iterations: 1000 with early stopping criterion.
The scikit-learn Gradient Boosting implementation was configured as presented in
Table 10 [
20].
LightGBM hyperparameters, optimized for the leaf-wise tree growth strategy, are presented in
Table 11.
Table 12 details the XGBoost configuration, including regularization parameters to prevent overfitting [
24].
The best-performing CatBoost model employed the hyperparameters presented in
Table 13, leveraging ordered boosting and native categorical feature handling [
25].
2.5. Training Procedure
All models were trained on the training set (n = 43,913), utilizing the validation set (n = 10,979) for early stopping and hyperparameter selection. The test set (n = 13,724) remained entirely sequestered until final evaluation—a strict separation ensuring that performance estimates reflect genuine generalization capability rather than optimistic overfitting.
To optimize hyperparameters for gradient boosting models (CatBoost, XGBoost, LightGBM), we employed the validation set to identify optimal configurations [
26]. Specifically, we implemented Bayesian optimization through Optuna, which efficiently navigates the hyperparameter space rather than exhaustively evaluating all combinations, conducting 30 trials per model.
All models predicted both systolic and diastolic blood pressure simultaneously via multi-output regression. Random seed 42 was employed throughout to ensure reproducibility.
2.6. Evaluation Metrics
Model performance was assessed using both statistical measures familiar to the research community and clinically meaningful thresholds of relevance to practicing clinicians.
2.6.1. Statistical Metrics
Root Mean Squared Error (RMSE): The square root of the mean squared prediction error. RMSE penalises large errors more severely than small errors and shares the same units as blood pressure (mmHg), facilitating clinical interpretation. Lower values indicate superior performance, with values below 10 mmHg generally considered acceptable for blood pressure prediction.
Mean Absolute Error (MAE): The arithmetic mean of absolute differences between predictions and actual values. MAE demonstrates greater robustness to outliers than RMSE and offers intuitive interpretation—it represents the typical prediction error in mmHg.
Coefficient of Determination (R-squared): The proportion of blood pressure variance explained by the model, ranging from negative infinity (performance inferior to predicting the mean) to 1.0 (perfect prediction). R-squared values exceeding 0.8 indicate strong predictive power.
Bias: The mean prediction error (without absolute value). Bias reveals systematic under-prediction (positive bias) or over-prediction (negative bias). Unbiased models should exhibit bias approaching zero.
2.6.2. Clinical Accuracy Metrics
Beyond statistical measures, we assessed clinically relevant accuracy thresholds:
Within plus or minus 5 mmHg: The proportion of predictions falling within 5 mmHg of actual blood pressure. Clinical guidelines suggest that errors within 5 mmHg rarely affect therapeutic decisions. We targeted 80% or greater predictions meeting this threshold.
Within plus or minus 10 mmHg: The proportion within 10 mmHg, a more permissive yet clinically meaningful threshold. Most clinicians would consider predictions within 10 mmHg acceptable for screening and risk stratification purposes. We targeted 90% or greater.
2.6.3. Uncertainty Quantification
For the best-performing model, we computed bootstrap confidence intervals [
27] using 1000 resamples with replacement from the test set, providing uncertainty estimates around RMSE values and addressing the question of performance estimate stability.
2.7. Interpretability Methods
Achieving high predictive accuracy alone is insufficient for clinical deployment; model interpretability is crucial for clinical acceptance, regulatory approval, and healthcare professionals’ trust in automated decision support systems. Clinicians require transparent explanations of how individual predictions are derived to evaluate algorithmic recommendations against their clinical judgement.
The interpretability framework implemented in this study operates at three complementary analytical levels. Global interpretability methods identify which features exert the strongest influence on model predictions averaged across all patients, revealing population-level patterns in blood pressure determinants.
2.7.1. Feature Importance and Global Attribution
For tree-based ensemble models including Random Forest and LightGBM, which construct predictions through hierarchical decision rules applied across multiple trees, built-in feature importance metrics provide a natural mechanism for quantifying each predictor’s contribution to model performance. These importance scores quantify the aggregate contribution of each feature to splits across all trees.
Mathematically, for a tree-based model with T trees and a feature
, the importance score
is computed as:
where S
t represents the set of all splits in tree t, ΔError(s) quantifies the error reduction achieved by split s, and the indicator function
equals 1 when the split uses feature x
j and 0 otherwise. These raw importance values were then normalized to sum to 1.0 across all features and ranked to identify the strongest predictors of systolic and diastolic blood pressure.
Consistent with physiological expectations and the engineered feature design described in
Section 2.3, the most influential predictors identified through global importance analysis included age (capturing vascular stiffening and atherosclerotic burden), body mass index (reflecting adiposity and metabolic load), and cardiovascular disease diagnosis.
2.7.2. SHAP Value Analysis
To achieve model-agnostic and directionally consistent explanations applicable across diverse algorithmic architectures, we have utilized SHAP (SHapley Additive exPlanations). This interpretability framework, grounded in cooperative game theory from economics, provides a theoretically principled method for attributing model output to individual feature contributions.
Formally, the SHAP value φ
j for feature j quantifies its contribution to the prediction f(x) for patient x by:
where F represents the full feature set, S denotes a subset of features excluding j, |S| is the subset size, f_S(x) is the model prediction using only features in S, and the sum iterates over all possible feature subsets. The factorial terms weight each marginal contribution by the number of orderings in which feature j could be added to subset S, ensuring fair credit attribution.
SHAP analysis in this study operated at two complementary levels. For global interpretation, mean absolute SHAP values were calculated across all test patients to rank overall predictor influence on model output. This analysis revealed that age and body mass index consistently emerged as dominant predictors.
For local explanation, individual-level SHAP plots were generated to visualize the direction and magnitude of each feature’s contribution for representative test samples spanning diverse demographic and physiological profiles. These waterfall plots display how baseline predictions are systematically modified by individual feature values.
All SHAP computations utilize tailored algorithms optimized for specific model architectures: the TreeExplainer algorithm efficiently computes exact Shapley values for tree-based ensembles by leveraging their hierarchical structure. This methodological uniformity enabled valid comparisons across different model types.
2.7.3. Stability and Consistency Checks
To verify the robustness of interpretability findings and ensure that identified feature importance rankings reflect stable patterns rather than artefacts of particular test set composition, comprehensive stability analysis was conducted through bootstrap resampling and cross-model comparison.
The Spearman rank correlation ρ between two feature rankings R
1 and R
2 is computed as:
Across 30 bootstrap replicates, Spearman correlations between successive feature rankings exceeded 0.95 for the top ten features, indicating highly stable importance orderings that would yield consistent clinical recommendations.
Additionally, cross-model comparisons were performed by extracting SHAP-derived feature attributions from three algorithmically diverse models: Ridge Regression (linear with L2 regularisation), Random Forest (non-linear tree ensemble), and LightGBM (gradient boosted trees). Despite fundamentally different algorithmic approaches, the top features exhibited remarkable consistency.
2.7.4. Clinical Interpretation Context
The interpretability analysis elucidated not only the mathematical structure of each model but also provided clinically relevant insights bridging statistical pattern recognition and physiological understanding. The monotonic positive SHAP gradients observed for age and body mass index align with established cardiovascular physiology.
The prominence of interaction terms—particularly age multiplied by BMI—in feature importance rankings reveals that blood pressure regulation exhibits substantial heterogeneity across patient subgroups, with risk factors exerting synergistic rather than purely additive effects. This finding carries important clinical implications for risk stratification.
The interpretability patterns observed in these predictive models bolster their physiological credibility and enhance likelihood of acceptance in clinical decision-making environments where physicians must have confidence that algorithmic recommendations are grounded in established medical knowledge.
The comprehensive interpretability framework developed in this research integrates multiple complementary methodologies: tree-based feature importance metrics, SHAP analysis for model-agnostic attributions, bootstrap stability verification, and cross-model consistency validation. This multifaceted approach provides robust evidence regarding the reliability of identified predictive patterns.
3. Results
3.1. Overall Model Performance
Evaluation of the held-out test set (n = 13,724 individuals never encountered during training or validation) revealed modest yet consistent performance patterns across models. CatBoost achieved superior performance with systolic blood pressure RMSE of 14.37 mmHg and R-squared of 0.265, representing a 14.3% improvement over predicting the global mean (16.76 mmHg). The gradient boosting variants (XGBoost, LightGBM) followed closely, with performance differences of less than 0.2 mmHg RMSE.
Notably, classical linear models—specifically ordinary least squares, Ridge, and Lasso regression—trailed the boosting methods by similarly narrow margins. This convergence suggests that the predictive signal within our engineered features is predominantly linear, with tree-based methods capturing only modest additional non-linear structure.
The R-squared values, ranging between 0.25 and 0.27, convey a sobering reality: pre-measurement characteristics account for approximately one-quarter of blood pressure variance. The remaining three-quarters presumably reflects measurement timing, acute physiological states, and factors not captured within our feature set.
The complete regression performance metrics for all models on the held-out test set are presented in
Table 14.
3.2. Clinical Accuracy
CatBoost achieved clinical accuracy, with 26.4% of systolic blood pressure predictions falling within plus or minus 5 mmHg and 47.6% within plus or minus 10 mmHg of actual values. Whilst these percentages may appear modest, they represent the genuine predictive signal available from pre-measurement features—a more candid assessment than artificially inflated metrics derived from blood pressure-dependent predictors.
The baseline models (Global Mean, Global Median, Group Mean) achieved comparable clinical accuracy to the trained models, underscoring the inherent difficulty of substantially surpassing simple prediction strategies with the available feature set.
Clinical accuracy metrics, quantifying the proportion of predictions within clinically meaningful thresholds, are summarized in
Table 15.
Clinical utility targets: ≥80% within ±5 mmHg, ≥90% within ±10 mmHg.
3.3. Classification Performance
Reframing the prediction task as binary classification, distinguishing hypertensive from normotensive individuals according to ACC/AHA 2017 thresholds (SBP ≥ 130 mmHg OR DBP ≥ 80 mmHg), yields complementary insights into model utility [
28].
ROC-AUC values clustered within the 0.78–0.79 range across trained models, indicating moderate capacity to discriminate hypertensive from normotensive individuals. ROC-AUC is receiver operating characteristic area under curve. High specificity with low recall indicates conservative classification, favoring precision over-sensitivity. CatBoost achieved the highest AUC of 0.787, although the margin over simpler models remained narrow.
The pattern of high specificity (97–99%) coupled with low recall (7–10%) warrants interpretation. The models operate conservatively, confidently classifying an individual as hypertensive only when features strongly suggest elevated pressure. This conservative approach minimizes false positives but fails to identify many true cases—a trade-off potentially suitable for screening applications where confirmatory testing follows positive predictions.
When reframing blood pressure prediction as a binary hypertension classification task utilizing ACC/AHA 2017 thresholds, model performance metrics are presented in
Table 16.
3.4. Ablation Study: Feature Group Contributions
To elucidate which feature categories drive predictive performance, we conducted ablation experiments, systematically removing each feature group and quantifying the resultant change in CatBoost systolic blood pressure RMSE.
The results proved unexpectedly uninformative—or rather, informative in their uniformity. Removing any single feature category altered RMSE by less than 0.1%, with some removals paradoxically yielding slight performance improvements. This pattern suggests substantial redundancy amongst feature groups, whereby information captured by one category overlaps significantly with that captured by others.
Cardiovascular disease diagnosis alone accounts for the majority of predictive signal. When removed, RMSE increased by approximately 0.5%, the largest single-category effect observed. This dominance aligns with feature importance rankings.
To assess the contribution of each feature category to predictive performance, we conducted a systematic ablation study, with results presented in
Table 17.
3.5. Confidence Intervals
To quantify uncertainty in our performance estimates, we computed bootstrap confidence intervals for systolic blood pressure RMSE using 1000 resamples (
Table 18).
Bootstrap analysis (1000 resamples) yielded 95% confidence intervals for systolic blood pressure RMSE estimates. CatBoost achieved [14.21, 14.53] mmHg, whilst XGBoost demonstrated [14.25, 14.57] mmHg. The narrow confidence intervals across all models indicate stable, reliable performance estimates.
3.6. Visualizing Model Performance
To provide comprehensive visual representation of model performance, we present several complementary visualizations.
Figure 1 compares RMSE across all evaluated models, whilst
Figure 2,
Figure 3,
Figure 4 and
Figure 5 provide detailed analyses of clinical accuracy, prediction scatter plots, and residual distributions for the best-performing CatBoost model.
The gradient boosting methods form a distinct cluster at the lower end of the RMSE scale, demonstrating their consistent superiority over both classical linear methods and baseline predictors. Notably, the 95% confidence intervals for all gradient boosting implementations overlap substantially, indicating that observed performance differences may not be statistically significant. This visual representation reinforces the quantitative findings presented in
Table 14,
Table 15,
Table 16,
Table 17 and
Table 18.
The final optimized hyperparameters for the CatBoost model, selected through 30 trials of Bayesian optimization, are summarized in
Table 19.
Table 20 provides a comprehensive overview of all machine learning models evaluated in this study, highlighting their key algorithmic characteristics.
Beyond aggregate error metrics, clinical accuracy—the proportion of predictions falling within acceptable thresholds—provides a more clinically interpretable performance measure.
Scatter plots of predicted versus actual values illustrate the relationship between model outputs and true measurements, with the diagonal line representing perfect prediction.
Figure 4 presents the corresponding scatter plot for diastolic blood pressure predictions.
Examination of prediction error distributions provides insight into potential systematic biases in model predictions.
Figure 5 displays the distribution of prediction residuals (predicted minus actual) for systolic blood pressure. The approximately normal distribution centered near zero (mean bias = 0.11 mmHg) indicates unbiased predictions without systematic over- or under-estimation across the blood pressure range.
The symmetrical distribution of residuals confirms that the model does not systematically favor particular blood pressure ranges, an important property for clinical applications where both under-prediction and over-prediction carry meaningful consequences. These visualizations collectively support the statistical findings presented in the preceding tables, providing intuitive confirmation of model behavior.
4. Discussion
4.1. Interpreting Realistic Prediction Performance
The modest R-squared values (0.27 for systolic blood pressure, 0.19 for diastolic blood pressure) warrant interpretation rather than apology. These figures indicate that demographic, anthropometric, and clinical features available in routine epidemiological datasets explain approximately one-quarter of blood pressure variance. The remaining three-quarters reflects factors absent from our feature set: acute physiological states, measurement timing, environmental conditions, and inherent biological variability.
This finding carries several important implications. Firstly, it establishes realistic expectations for what machine learning can achieve with commonly available data. Secondly, it underscores that blood pressure prediction fundamentally differs from classification tasks where accuracies exceeding 90% are routinely achieved. Thirdly, it suggests that substantial improvements would require data sources beyond those typically available in epidemiological repositories.
4.2. Feature Leakage: A Critical Methodological Consideration
A cautionary observation emerged from our preliminary analyses. When blood pressure-derived variables—including mean arterial pressure, pulse pressure, and hypertension staging—remained within the feature set, models achieved spectacular performance: R-squared values approaching unity and RMSE near zero. Such results, whilst technically correct within the flawed analytical framework, represent circular reasoning rather than genuine prediction.
Removing these contaminated features yielded a dramatic correction. R-squared declined precipitously from inflated values to 0.26, exposing the true and considerably more modest predictive signal available from genuinely pre-measurement features. This experience underscores a critical lesson for the field: any blood pressure prediction study reporting exceptionally high accuracy warrants careful scrutiny for potential feature leakage.
4.3. Gradient Boosting Methods: Modest but Consistent Advantages
The gradient boosting implementations (CatBoost, XGBoost, LightGBM) consistently occupied the top performance positions, although their advantage over classical linear methods remained modest [
21,
24,
25]. CatBoost’s 14.37 mmHg RMSE represented only a 2.4% improvement over Ridge regression’s 14.72 mmHg. This narrow margin suggests that the predictive signal within our engineered features is predominantly linear in nature.
The convergence of diverse algorithms to similar performance levels carries methodological significance. When simple and complex models achieve comparable results, the limiting factor is likely feature informativeness rather than model expressiveness. Additional algorithmic sophistication yields diminishing returns once the feature set has been thoroughly exploited.
4.4. The Dominance of Cardiovascular Disease Diagnosis
Feature importance analysis revealed cardiovascular disease diagnosis as the dominant predictor, contributing the majority of explained variance across all models. This finding, whilst unsurprising given that cardiovascular disease diagnosis frequently incorporates blood pressure assessment, raises interpretive subtleties.
The cardiovascular disease diagnosis variable represents legitimate pre-measurement information—a patient’s diagnostic history exists prior to any new blood pressure measurement. However, its predictive power derives substantially from prior blood pressure readings that informed the diagnosis. This relationship, whilst not constituting direct target leakage, illustrates the complex temporal dependencies inherent in medical prediction tasks.
The top 10 most important features identified by the CatBoost model, together with their clinical interpretations, are presented in
Table 21.
4.5. Clinical Implications
If successfully validated on external populations, these predictive models could enable several meaningful clinical applications, enhancing patient care and public health efforts. In primary care settings, these tools could transform routine screening workflows, enabling risk stratification even when blood pressure measurement equipment is unavailable.
Telehealth represents another promising application domain where these models address a practical clinical challenge. During virtual consultations, healthcare providers often lack access to blood pressure measurement equipment. In such scenarios, the model could estimate blood pressure from available patient information, assisting providers in identifying patients requiring urgent in-person evaluation.
Public health departments could utilize these models to identify high-risk communities or demographic groups warranting focused screening initiatives. By adopting a data-driven strategy for resource allocation, limited public health resources can be optimized, ensuring that screening efforts effectively target populations most likely to benefit.
Many epidemiological studies lack complete blood pressure measurements for all participants, constraining certain analyses. Predicted values could enable research questions otherwise impossible to address, provided investigators appropriately acknowledge the uncertainty inherent in predicted rather than measured values.
Models frequently experience substantial performance degradation when applied to populations with characteristics differing from training data, rendering external validation essential rather than optional. The feature leakage concerns discussed previously carry direct implications for prospective deployment.
Ensuring fairness and addressing potential bias requires systematic evaluation across demographic subgroups. We must verify whether models predict equally well for males and females, across different age ranges, and amongst various racial and ethnic groups. Differential performance could inadvertently exacerbate existing health disparities if models prove more accurate for certain populations than others.
Clinical decision support systems fall under regulatory oversight by health agencies internationally. Demonstrating safety, effectiveness, and genuine clinical utility demands rigorous validation extending well beyond retrospective dataset analysis, including prospective studies documenting real-world performance and clinical impact.
Finally, even technically accurate models may fail to improve patient care if clinicians cannot or do not utilise them effectively. Successful clinical integration requires thoughtful attention to user experience through intuitive interfaces, seamless electronic health record integration, and clear presentation of predictions with associated uncertainty measures.
4.6. Comparison to Previous Studies [29]
Our results align with realistic expectations for blood pressure prediction from routine clinical features. Previous studies employing ensemble methods reported RMSE values of 10–15 mmHg with R-squared approximately 0.15–0.35 [
1,
30,
31,
32] when properly excluding blood pressure-derived features. Studies reporting substantially superior performance typically included features with target leakage.
Several factors explain our superior performance:
Feature engineering scope: We constructed 39 features, compared with the 10–20 typical of prior investigations [
33,
34,
35]. This comprehensive feature set captures additional predictive information. While addressing a different class of engineering problems, earlier contributions by Mitev [
36,
37] similarly illustrate how well-defined system architectures can support reliable analytical outcomes.
Data quality: Our 1.98% exclusion rate ensuring physiological plausibility may have generated an unusually clean dataset. Prior studies frequently omit reporting their data quality procedures, potentially training on noisier data.
Dataset size: With 68,616 samples, we possessed more data than many comparable studies (12,000–45,000 samples). Larger datasets enable superior model training, particularly for complex algorithms.
Systematic comparison: Most studies evaluate 2–3 models; we systematically assessed nine models. This breadth revealed performance patterns—such as tree-based models excelling whilst neural networks struggled—obscured in narrower comparisons.
However, our single-dataset focus constrains the generalisability claims that multi-dataset studies can make [
38,
39]. Cross-disciplinary work—for instance in automated part-orientation systems [
36,
37]—demonstrates how rigorous system-level methodologies can be valuable even in fields far removed from cardiovascular data analysis.
4.7. Limitations
Several important limitations must be considered when interpreting these findings.
Firstly, our findings derive from a single dataset, highlighting technical feasibility and performance metrics achievable within this specific context. However, these results may not extend to blood pressure prediction in varied clinical environments and populations. External validation across diverse healthcare settings and demographic groups is essential to confirm wider applicability.
The demographic profile of our study population raises additional concerns regarding specificity. With participants aged between 30 and 65 years, 50% prevalence of cardiovascular disease, and 66% female composition, this dataset may not fully capture all clinically relevant demographics.
Consequently, model effectiveness in pediatric patients, older adults, or individuals with particular disease characteristics remains uncertain and necessitates targeted research prior to clinical implementation in these populations.
Single-time-point measurements cannot capture blood pressure dynamics, track individual changes over time, or predict future hypertension development—all clinically important questions requiring longitudinal data collection and analysis. Future research incorporating temporal information would substantially enhance clinical utility.
Feature engineering decisions significantly influence model interpretability and clinical validity. Our deliberate exclusion of blood pressure-derived features represents a critical methodological choice, prioritizing genuine predictive utility over artificially inflated metrics.
The absence of detailed measurement protocol information represents another limitation. We lack essential details regarding how blood pressure was measured in the source dataset, including device type, measurement conditions, whether readings were averaged across multiple measurements, and quality control procedures implemented.
Furthermore, the complete absence of medication information presents a significant interpretive challenge. Many patients’ measured blood pressures reflect pharmacological treatment effects rather than underlying physiological status. Without knowing which patients received antihypertensive medications, we cannot distinguish between genuinely normotensive individuals and treated hypertensive patients.
Finally, limited temporal context information constrains our ability to account for blood pressure variability. We lack data regarding measurement timing, including time of day, seasonal factors, or acute circumstances surrounding measurement. Given that blood pressure varies substantially based on these contextual factors, our models may not generalise optimally to measurements obtained under different conditions.
4.8. Future Research Directions
Building upon the current findings, several promising research directions could advance clinical utility and real-world applicability of machine learning-based blood pressure prediction models. External validation represents a critical next step, testing these models on independent datasets from different healthcare systems, countries, and populations.
Prospective clinical trials offer another valuable evaluation avenue. Structuring studies wherein models predict blood pressure prior to actual measurements within standard clinical workflows would enable analysis of prediction accuracy against measured values whilst investigating whether these predictions influence clinical decision-making.
Addressing potential feature leakage requires developing models using only pre-measurement features routinely available in clinical care, excluding blood pressure-derived variables. Assessing whether prediction accuracy remains clinically useful under these constraints would strengthen confidence in real-world deployment.
Deeper interpretability analysis through SHAP values, attention mechanisms, or rule extraction methods would explain why models make specific predictions for individual patients. Clinical adoption fundamentally requires understanding model reasoning; this transparency is essential for building clinician trust and enabling appropriate clinical oversight.
Multimodal data integration presents opportunities to improve both prediction accuracy and mechanistic understanding. Incorporating additional information sources-wearable device data, genetic markers, imaging features, and social determinants of health—could reveal biological mechanisms underlying blood pressure variation whilst enhancing predictive performance.
5. Conclusions
This comprehensive comparative investigation demonstrates that machine learning models encounter fundamental limitations when predicting blood pressure from routinely available clinical and demographic features. Our best-performing model achieved a root mean squared error of 14.37 mmHg and an R-squared value of 0.265 for systolic blood pressure—figures that, whilst representing genuine predictive signal, fall substantially short of clinical utility for individual patient management.
The gradient boosting implementations (CatBoost, XGBoost, LightGBM) consistently outperformed classical linear approaches, although the performance margins remained modest throughout our analyses [
40]. This convergence of diverse algorithmic architectures towards similar performance levels carries important methodological implications: the limiting factor appears to be feature informativeness rather than model expressiveness. When simple and sophisticated models achieve comparable results, additional algorithmic complexity yields diminishing returns.
A critical methodological lesson emerged from our preliminary analyses: blood pressure-derived features—including mean arterial pressure, pulse pressure, and hypertension staging—must be rigorously excluded to prevent target leakage. Studies reporting exceptionally high prediction accuracy warrant careful scrutiny for this common but often unrecognized source of artificially inflated performance metrics.
Cardiovascular disease diagnosis emerged as the strongest predictor, contributing the majority of explained variance across all evaluated models. This finding reflects the clinical reality that hypertension represents both a cause and consequence of cardiovascular disease, creating complex temporal dependencies that complicate interpretation but do not invalidate the predictive utility of this feature for prospective risk assessment.
These findings counsel realistic expectations for machine learning applications to blood pressure prediction. With only approximately one-quarter of variance explained by available pre-measurement features, there exists substantial room for improvement through incorporation of additional data sources: continuous physiological monitoring, genetic markers, detailed lifestyle information, and environmental factors. The remaining three-quarters of blood pressure variance reflects factors absent from typical epidemiological datasets.
Prior to clinical deployment, several critical steps remain essential:
External validation across diverse populations and healthcare settings to confirm generalisability beyond the training dataset.
Prospective validation studies integrating predictive models within real-world clinical workflows to assess practical utility.
Collection of richer feature sets incorporating continuous physiological monitoring, genetic markers, and detailed lifestyle data.
Systematic fairness assessment across demographic subgroups to ensure equitable performance and avoid exacerbating existing health disparities.
Regulatory clearance through appropriate pathways, demonstrating safety and effectiveness for intended clinical applications.
Implementation research investigating effective strategies for clinical integration, including user interface design and electronic health record interoperability.
Should these challenges be successfully addressed, machine learning-based blood pressure prediction could meaningfully enhance cardiovascular care through earlier detection of high-risk individuals, improved resource allocation for screening programs, and support for clinical decision-making in settings where traditional measurement is impractical or unavailable.
Perhaps the most valuable contribution of this investigation lies not in achieving superior predictive performance, but in demonstrating the paramount importance of methodological rigour in medical machine learning research. Honest reporting of modest results advances the field more effectively than spectacular findings that cannot withstand scrutiny or replicate in practice. Future research should build upon this foundation of transparent methodology whilst pursuing the richer data sources necessary for clinically meaningful blood pressure prediction.