Prediction of Percutaneous Coronary Intervention from Clinical and ECG Data Using Machine Learning: A Retrospective Single-Center Observational Study

Alimbayeva, Zhadyra; Alimbayev, Chingiz; Ozhikenov, Kassymbek; Karibayev, Kairat; Ozhikenova, Aiman; Shylmyrza, Ussen; Akhmedova, Dilfuza

doi:10.3390/a19050367

Open AccessArticle

Prediction of Percutaneous Coronary Intervention from Clinical and ECG Data Using Machine Learning: A Retrospective Single-Center Observational Study

by

Zhadyra Alimbayeva

^1,2,*

,

Chingiz Alimbayev

¹,

Kassymbek Ozhikenov

¹,

Kairat Karibayev

^1,3,*,

Aiman Ozhikenova

¹,

Ussen Shylmyrza

¹ and

Dilfuza Akhmedova

¹

Department of Robotics and Technical Means of Automation, Satbayev University, Almaty 050013, Kazakhstan

²

Department of Information Technology and Library Science, Kazakh National Women’s Teacher Training University, Almaty 050000, Kazakhstan

³

Department of Disaster Medicine and Emergency Care Organization, Kazakh-Russian Medical University, Almaty 050013, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(5), 367; https://doi.org/10.3390/a19050367

Submission received: 6 April 2026 / Revised: 25 April 2026 / Accepted: 27 April 2026 / Published: 6 May 2026

Download

Browse Figures

Versions Notes

Abstract

The aim of this study was to evaluate the feasibility of predicting percutaneous coronary intervention (PCI) based on clinical, laboratory, and electrocardiographic data available at various stages of hospitalization. A retrospective single-center study was conducted, including 137 patients with suspected coronary artery disease. The fact that PCI was performed during the current hospitalization was considered as the endpoint. Taking into account the temporary availability of data, three sets of signs were formed: basic (SAFE), including indicators available at admission; clinical (CLINICAL); and extended (EXTENDED), supplemented with glycemic parameters. Logistic regression, random forest, and gradient boosting were used to build the models. The assessment was carried out using repeated stratified cross-validation (5 × 10). The main metrics were ROC-AUC, PR-AUC, accuracy and F1-measure. The models demonstrated moderate predictive ability. The basic model (SAFE) showed a ROC-AUC of 0.734 ± 0.092, while the best results were achieved using an extended model based on a random forest (ROC-AUC 0.755 ± 0.079). The addition of glycemic parameters provided a moderate improvement in prediction quality. In the logistic regression, the most significant predictor was the presence of type 2 diabetes mellitus (OR = 7.36; p < 0.001). The results indicate the potential for using non-invasive data to assess the likelihood of PCI in the early stages of hospitalization. However, the models show moderate accuracy and require further validation on larger and more independent samples.

Keywords:

machine learning; coronary artery disease; percutaneous coronary intervention; ECG analysis; glycemic parameters; predictive modeling; clinical decision support

1. Introduction

Coronary heart disease (CHD) remains one of the leading causes of death and disability worldwide, forming a significant clinical and socio-economic burden on healthcare systems [1,2]. In inpatient practice, patients who are admitted with clinical suspicion of coronary heart disease and acute coronary syndromes (ACS) are a particularly difficult category, since the choice of diagnostic and therapeutic tactics must be carried out in a short time and based on a limited set of primary examination data [3,4,5]. Current recommendations emphasize the need for an early structured assessment of the risk and likelihood of coronary lesion, as well as timely determination of indications for invasive tactics and revascularization [6,7]. At the same time, standardization of terms and endpoints (including myocardial infarction) is essential for the correct interpretation of clinical trial results and the construction of reproducible prognostic models [8].

Despite the high diagnostic accuracy of coronary angiography and other imaging methods of the coronary bed, their use may be limited by the availability of resources, organizational factors and clinical stability of the patient [9,10]. For this reason, the role of approaches focused on the use of routine non-invasive data available at admission, clinical characteristics, standard laboratory parameters and electrocardiography is increasing. These data form the basis of primary triage and are used in almost all clinics, which makes them attractive for the development of scalable diagnostic decision-support tools. At the same time, the question remains as to how informative such an “early” set of indicators is for predicting not only the anatomical severity of coronary atherosclerosis, but also the actual clinical decision to perform percutaneous coronary intervention (PCI) in the context of current hospitalization.

Type 2 diabetes mellitus is one of the key factors of cardiometabolic risk and significantly modifies the course of coronary heart disease, the incidence of complications and the prognosis of patients [11]. Diabetes is associated with more widespread and diffuse coronary artery disease, a higher incidence of adverse events, and potentially worse revascularization outcomes compared with patients without impaired carbohydrate metabolism [12]. In this regard, the clinical guidelines (ESC/EASD) separately identify patients with diabetes as a high-risk group requiring systematic consideration of metabolic factors in the diagnosis and treatment of cardiovascular diseases [13].

In addition to the fact of diabetes, increasing attention is being paid to the parameters of glycemic control and metabolic instability. Traditionally, HbA1c has been widely used in clinical practice as an indicator of chronic hyperglycemia. However, HbA1c reflects the average glucose exposure and does not always detect fluctuations in glycemia, which may have independent pathophysiological and prognostic significance [14]. Parameters of glycemic control, such as HbA1c and admission glucose, are commonly used in clinical practice [15]. At the same time, the issue of the additional diagnostic value of glycemic control and glucose variability metrics specifically for early prediction of the need for PCI remains insufficiently studied, especially in the context of retrospective inpatient data from real practice.

To support clinical decisions, scoring systems and risk stratification scales (GRACE, HEART, TIMI) are traditionally used, which demonstrate acceptable predictive ability in a number of populations, but may have limitations in tolerability between clinics, populations, and clinical scenarios [16]. In particular, universal scales are not always optimal for predicting the “tactical” endpoint—PCI execution—since revascularization is the result of a comprehensive solution that depends on clinical manifestations, marker dynamics, ECG, angiography availability, and expert assessment by the treating team. This creates the need for models capable of integrating heterogeneous features and taking into account the nonlinear interactions between them.

Machine learning (ML) methods are considered as a promising tool for analyzing clinical tabular data and ECG features, which makes it possible to increase the discriminative ability of models by identifying complex dependencies and interactions of risk factors [17,18,19]. A number of studies have shown that ML approaches can improve the prediction of coronary pathology based on clinical, laboratory, and electrocardiographic parameters compared to traditional methods [20]. At the same time, the introduction of ML models into the clinic requires not only high accuracy, but also transparency, interpretability and reproducibility of results, which leads to interest in methods of explicable AI (XAI), including approaches based on Shapley values, as well as compliance with modern recommendations on reporting and risk assessment of bias in prediction-model studies [21,22].

A retrospective single-center analysis of routine clinical, laboratory, and ECG data was performed of patients hospitalized with suspected coronary heart disease. PCI during the current hospitalization was considered as a clinically significant binary endpoint as a proxy indicator of real patient management tactics. This approach allows assessment of the potential usefulness of diagnostic models at an early stage of hospitalization.

The aim of the study was to evaluate the diagnostic effectiveness of machine learning models based on a basic set of primary examination signs, as well as to study the additional value of glycemic control indicators, including glucose variability metrics, to improve the prediction of the need for PCI. The results obtained may be useful for the development of non-invasive tools for early risk stratification and clinical decision support in patients with suspected coronary artery disease in a hospital setting.

2. Materials and Methods

2.1. Study Design and Sample Characteristics

The study is a retrospective, single-center observational study based on clinical data retrospectively collected from the National Hospital of the Medical Center of the Administrative Department of the President of the Republic of Kazakhstan (Almaty, Kazakhstan), aimed at evaluating the feasibility of predicting percutaneous coronary intervention (PCI) using clinical, laboratory, and electrocardiographic data.

The initial cohort included 138 patients hospitalized with suspected coronary artery disease between October 2024 and December 2025. Inclusion criteria included hospitalization with suspected coronary artery disease and availability of clinical, laboratory, and electrocardiographic data required for analysis. Patients with missing information on the primary endpoint (PCI performed during the current hospitalization) were excluded from the study. One patient was excluded from the analysis due to a lack of information about the primary outcome. Thus, the final analytical sample consisted of 137 patients.

The primary endpoint was the fact that PCI was performed during the current hospitalization (PCI: 1—completed, 0—not completed). It should be emphasized that PCI reflects a clinical decision based on a combination of factors, including the results of coronary angiography, the doctor’s clinical assessment and organizational features, and is not a direct indicator of the anatomical severity of coronary artery damage.

Additional information on coronary angiography was obtained from available clinical records. Coronary angiography was performed according to clinical indications and was not available for all patients.

Patients were categorized based on both PCI status and angiography status. All patients who underwent PCI had previously undergone coronary angiography. In the non-PCI group, patients represented a heterogeneous population and included: (1) patients who underwent coronary angiography without subsequent indication for PCI, and (2) patients who did not undergo angiography due to clinical stability, contraindications, or organizational factors.

Thus, the non-PCI group does not represent a homogeneous population without significant coronary artery disease, but rather reflects real-world clinical practice, where invasive diagnostics are applied selectively.

Baseline characteristics of the study population are presented in Table 1.

The average age of the patients was 65.1 ± 8.5 years, while the proportion of men reached 62.8%. Type 2 diabetes mellitus was detected in 51.1% of patients. The average values of HbA1c and glucose levels at admission were 7.23 ± 1.89% and 7.86 ± 3.62 mmol/L, respectively.

Among the electrocardiographic signs, ST segment elevation was observed in 22.6% of patients, ST depression—in 12.4%, pathological Q wave—in 21.2%, while inversion of the T wave was recorded in the majority of patients (87.6%).

The overall study workflow, including patient selection, data preprocessing, model development, and evaluation, is illustrated in Figure 1.

2.2. Research Variables

The selection of variables for this study was carried out taking into account their clinical significance, as well as their availability at various stages of the patient’s examination. The analysis included indicators reflecting demographic characteristics, data from the initial clinical examination, laboratory parameters, glycemic status and electrocardiographic signs.

Particular attention was paid to the temporal consistency of the variables used, in particular, the differentiation of the parameters available at the admission stage and the indicators obtained during hospitalization.

All variables used in the analysis are presented and described in detail in Table 2.

As shown in Table 2, the study included a set of variables covering the main aspects of the patient’s clinical assessment. Demographic and clinical indicators reflected the initial characteristics of patients, while laboratory and glycemic parameters made it possible to assess the metabolic state.

Electrocardiographic signs were used to identify objective markers of myocardial ischemia and electrical disorders. PCI performed during hospitalization was used as the primary endpoint during hospitalization, which reflects a clinical decision rather than a direct diagnostic characteristic of the severity of the disease. This structure of variables allowed us to form several models with varying degrees of complexity: from the basic early assessment model (SAFE) to more advanced models including clinical and laboratory (CLINICAL) and glycemic (EXTENDED) parameters.

2.3. Data Preprocessing

Data preprocessing included several sequential steps aimed at improving the quality and reproducibility of the analysis (Figure 2).

At the first stage, the completeness of the data was checked. One patient with a missing target variable value (PCI performance) was excluded from the analysis. Thus, the final analytical sample consisted of 137 patients.

Next, the correctness of the values was checked. Physiologically implausible or obviously erroneous values (for example, extreme HbA1c values or abnormal ECG interval values) were considered as missed and replaced with missing (NaN) values. This approach made it possible to avoid distortion of the results associated with data entry errors.

Categorical variables were converted to binary format (0/1), which ensured their correct use in machine learning models. In particular, signs such as the presence of diabetes mellitus, as well as electrocardiographic changes (ST segment elevation, ST depression, pathological Q wave, inversion of the T wave), were presented as binary indicators.

The missing values were processed using median imputation for quantitative variables and imputation by the most frequent value for categorical features. It is fundamentally important that the imputation procedure was performed separately in the training subsamples as part of cross-validation, which made it possible to avoid data leakage and ensure the correctness of model evaluation.

Variables with a high proportion of omissions (>50%) were not included in the main analysis, as their use could lead to instability of the models and increase systematic errors.

Additionally, various sets of signs (SAFE, CLINICAL, and EXTENDED) were formed, which made it possible to take into account the availability of data at different stages of the clinical examination and minimize the impact of temporal ambiguity.

2.4. Criteria for Choosing Machine Learning Algorithms

The selection of machine learning algorithms was based on methodological and practical considerations related to the clinical context of the study and the characteristics of the dataset.

Logistic regression was chosen as a baseline model due to its high interpretability and its widespread use in clinical research. This approach enables transparent estimation of feature effects and facilitates the assessment of independent associations between predictors and the outcome.

Random Forest was selected as a nonlinear ensemble method capable of capturing complex interactions between variables and effectively handling heterogeneous clinical data without requiring strict distributional assumptions. In addition, it is relatively robust to noise and less prone to overfitting when applied to small- and medium-sized datasets.

Gradient boosting was additionally included as an alternative ensemble approach to evaluate whether sequential learning strategies could further improve predictive performance.

Given the relatively limited sample size, preference was given to algorithms known to perform reliably under such conditions.

The use of multiple algorithms allows comparison between interpretable linear models and more flexible nonlinear methods, providing a more comprehensive assessment of model performance and robustness.

2.5. Building Models

To address the binary classification task (predicting whether PCI is performed), several machine learning approaches were applied, including both interpretable statistical models and ensemble algorithms.

Three feature sets were defined according to data availability at different stages of patient assessment. The SAFE feature set included age, sex, type 2 diabetes status, heart rate, and ECG findings (ST elevation, ST depression, Q wave, and T-wave inversion). The CLINICAL feature set included all SAFE variables plus body mass index, creatinine, and eGFR. The EXTENDED feature set further included glycemic parameters, namely HbA1c and admission glucose.

Missing data were handled using median imputation within the machine learning pipeline. Physiologically implausible values were treated as missing. Specifically, predefined thresholds were applied for key variables (e.g., HbA1c, ECG intervals, and laboratory parameters), and values outside these ranges were excluded from analysis.

All preprocessing steps, including imputation and scaling of continuous variables, were performed within the training folds as part of the machine learning pipeline to prevent data leakage during cross-validation. Model performance was evaluated using repeated stratified 5-fold cross-validation with 10 repetitions, which allowed for more stable estimates and helped reduce the risk of overfitting in a relatively small dataset. Logistic regression, Random Forest, and Gradient Boosting models were used. Model hyperparameters were predefined based on methodological considerations and were selected to limit model complexity and reduce the risk of overfitting.

Logistic regression was used as a baseline model for predicting the probability of performing PCI. The model estimates the probability of the binary outcome using a logistic function applied to a linear combination of predictors:

P (P C I = 1| X) = \frac{1}{1 + e^{- (β_{0} + β_{1} x_{1} + \dots + β_{n} x_{n})}}

where

x = (x_{1}, \dots, x_{n})

represents the set of predictors and (

β

) are the model coefficients estimated using maximum likelihood. The predicted probability was used for classification with a threshold of 0.5. All identified relationships were interpreted as associative and did not imply causality.

The coefficients of the logistic regression were interpreted in terms of the odds ratio, calculated using the formula:

O R = e^{β}

A positive coefficient value (for example, β_diabetes > 0) indicates an increase in the probability of PCI in the presence of a corresponding sign. All the dependencies obtained were interpreted as associative and did not imply a causal relationship.

2.5.1. Random Forest

The Random Forest algorithm was used as a nonlinear ensemble machine learning method in the study. This approach is based on constructing a set of decision trees and then combining their predictions.

The hyperparameters of the Random Forest model were selected a priori based on methodological considerations and prior experience with similar clinical datasets. In particular, the number of trees (n_estimators = 300) was chosen to ensure stable ensemble performance, while limiting tree depth (max_depth = 4) and increasing the minimum number of samples required for splitting and leaf nodes (min_samples_split = 10, min_samples_leaf = 5) were intended to reduce model complexity and mitigate overfitting, which is especially important given the relatively small sample size.

A formal hyperparameter optimization procedure (e.g., grid search) was not performed in order to avoid overfitting to the limited dataset and to preserve model generalizability. Each tree was trained on a random subsample of data using a random subset of features, which made it possible to reduce the correlation between the trees and increase the generalizing ability of the model.

The final prediction was formed based on the voting of the ensemble of trees.

\hat{Y} = m a j o r i t y v o t e o f t r e e s

The probability of completing the PCI was estimated as the proportion of trees that voted for the appropriate class:

P (Y = 1) = \frac{t h e t r e e t h a t v o t e d f o r P C I}{A l l t h e t r e e s}

Thus, the random forest model made it possible to take into account nonlinear dependencies and interactions between features, which is especially important when analyzing clinical data with a complex structure.

2.5.2. Gradient Boosting

As an alternative ensemble method, the study considered the Gradient Boosting algorithm, based on the sequential construction of weak models in order to minimize the loss function.

Unlike a random forest, where trees are trained independently, in gradient boosting, each subsequent model is built taking into account the errors of previous models. Thus, the model gradually refines the predictions, reducing the cumulative error.

The final model is represented as the sum of the basic algorithms:

F (x) = \sum (m = 1 \dots M) γ_{m} \cdot h_{m} (x)

where h_m(x) is the base model (usually a decision tree), γ_m is the weight of the corresponding model, and M is the total number of iterations (trees).

At each iteration, the model is trained on the gradient of the loss function, which makes it possible to reduce the prediction error in a targeted manner:

h_{m} (x) \approx \frac{- \partial L (y, F (x))}{\partial F (x)}

where L is the loss function, and F(x) is the current prediction of the model.

The model is updated according to the rule:

F_{m} (x) = F_{m - 1} (x) + η \cdot h_{m} (x)

where η is the learning rate, which determines the contribution of each subsequent model.

This study used the implementation of gradient boosting with limited tree depth and reduced learning rate, which partially controlled the risk of overfitting. However, given the relatively small sample size, the results of this model were interpreted with caution and used primarily for comparative analysis with more stable models. All analyses were performed in Python (version 3.10, Python Software Foundation, Wilmington, DE, USA) using the following libraries: pandas (version 1.5.3), numpy (version 1.23.5), statsmodels (version 0.13.5), and scikit-learn (version 1.2.2).

2.6. Evaluating Models

The assessment of the quality of the constructed models was carried out using Repeated Stratified K-Fold Cross-Validation. This approach makes it possible to ensure the stability and reproducibility of the results, especially in conditions of a limited sample size. The initial data was divided into 5 equal subsamples (folds) while maintaining the initial distribution of the target variable (stratification). At each iteration, one subsample was used as a test sample, and the remaining four were used to train the model. The procedure was repeated 10 times with various random splits, which allowed us to reduce the variability of estimates and obtain more reliable results.

All stages of data preprocessing, including imputation of missing values and scaling of features, were performed exclusively within training subsamples as part of cross-validation. This approach eliminated information leakage and ensured a correct assessment of the generalizing ability of the models.

To quantify the quality of the models, the following metrics were used: the area under the ROC curve (ROC-AUC), which characterizes the ability of the model to distinguish classes; the area under the accuracy-completeness curve (PR–AUC), which is especially informative for unbalanced data; accuracy, reflecting the proportion of correctly classified observations; and F1-measure representing the harmonic mean of accuracy and completeness.

For each metric, the average and standard deviation were calculated for all iterations of the cross validation. Additionally, models based on different sets of features (SAFE, CLINICAL, and EXTENDED) were compared, which made it possible to evaluate the contribution of various groups of variables to predicting PCI performance.

Thus, the chosen evaluation strategy ensured reliable and objective verification of the models, minimizing the risk of overfitting and overestimating their effectiveness.

2.7. Statistical Analysis

Statistical analysis was performed to quantify the characteristics of the sample and identify associations between clinical, laboratory, and electrocardiographic parameters and the likelihood of percutaneous coronary intervention (PCI).

Continuous variables were analyzed taking into account their distribution. The normality test was performed using the Shapiro–Wilk criterion. Under normal distribution, the data were presented as the mean and standard deviation (mean ± SD), otherwise as the median and interquartile range (median [IQR]).

Categorical variables were represented as absolute and relative frequencies:

p = \frac{n}{N}

where n is the number of observations in the category, and N is the total sample size.

To compare two independent groups (PCI vs. non-PCI), the student’s t-test for normally distributed data and the Mann–Whitney test for abnormal distribution were used.

The χ2 criterion was used for categorical variables:

χ^{2} = \sum (\frac{{(O_{i} - E_{i})}^{2}}{E_{i}})

where O_i is the observed frequency, E_i is the expected frequency.

For small expected values, Fischer’s exact criterion was additionally applied.

Statistical significance was assessed using two-sided tests (Table 3). A p-value < 0.05 was considered statistically significant.

The study was reported in accordance with the TRIPOD guidelines.

3. Results

3.1. Comparison Between PCI and Non-PCI Groups

Patients who underwent PCI differed significantly from those who did not in several clinical, metabolic, and electrocardiographic parameters.

The PCI group was younger (62.6 ± 8.7 vs. 66.2 ± 8.2 years, p = 0.024) and had a significantly higher prevalence of type 2 diabetes mellitus (76.6% vs. 37.8%, p < 0.001). In addition, both HbA1c levels (7.82 ± 1.90 vs. 6.72 ± 1.74%, p = 0.006) and admission glucose levels (9.17 ± 4.63 vs. 7.08 ± 2.58 mmol/L, p < 0.001) were significantly higher in patients undergoing PCI.

Coronary angiography was performed in a subset of patients as part of routine clinical evaluation. All patients who underwent PCI had previously undergone coronary angiography, as this procedure is required prior to revascularization. In the non-PCI group, only a proportion of patients underwent angiographic assessment, while the remaining patients received conservative treatment without invasive diagnostics. This reflects real-world clinical practice, in which the decision to perform coronary angiography depends on a combination of factors, including clinical presentation, risk stratification, and the treating physician’s judgment.

The inclusion of patients without angiographic confirmation in the non-PCI group may introduce heterogeneity, as some of these patients could have had significant coronary artery disease that remained undetected due to the absence of invasive evaluation.

Electrocardiographic abnormalities were also more frequent in the PCI group, including ST-segment elevation (39.1% vs. 14.6%, p = 0.003), ST-segment depression (21.3% vs. 7.9%, p = 0.048), and pathological Q waves (39.1% vs. 12.4%, p < 0.001).

No statistically significant differences were observed in BMI, renal function parameters (creatinine and eGFR), or T-wave inversion.

These findings suggest that both metabolic factors and ECG abnormalities are strongly associated with the likelihood of undergoing PCI. The comparison of clinical characteristics between PCI and non-PCI groups is presented in Table 4.

3.2. Machine Learning Model Results

The results of evaluating machine learning models for various feature sets (SAFE, CLINICAL, and EXTENDED) are presented in Table 5.

Values are presented as mean ± standard deviation based on repeated stratified 5-fold cross-validation. The models built on the basis of a SAFE set of features (available at the initial examination stage) demonstrated stable predictive ability. The logistic regression showed a ROC-AUC of 0.734 ± 0.092, while the random forest model showed 0.724 ± 0.092.

The expansion of the model to include standard laboratory parameters (the CLINICAL model) led to a slight improvement in the results. The best performance in this group was demonstrated by a random forest (ROC-AUC = 0.739 ± 0.080), while the logistic regression showed a ROC-AUC of 0.713 ± 0.109.

The highest values were achieved using the EXTENDED model, which includes glycemic parameters. The random forest model showed a maximum ROC-AUC value of 0.755 ± 0.079, as well as higher accuracy values (0.718 ± 0.065) and F1-measures (0.571 ± 0.100).

Thus, the addition of glycemic parameters provided a moderate improvement in the predictive ability of the model.

3.3. Interpretation of the Model and Significance of the Features

The results of the multidimensional logistic regression analysis are presented in Table 6.

The strongest statistically significant association with PCI was observed for the presence of type 2 diabetes mellitus (OR = 7.36; 95% CI: 2.79–19.40; p < 0.001). This finding highlights the important role of metabolic factors in influencing clinical decision making regarding revascularization (Figure 3).

Other variables, including age, sex, heart rate, electrocardiographic parameters, and renal function indicators, did not demonstrate statistically significant associations with PCI (p > 0.05). In particular, the presence of a pathological Q wave showed a positive but non-significant association (OR = 2.72; p = 0.118), indicating a possible trend that requires further investigation in larger cohorts.

It should be emphasized that the identified relationships are associative and should not be interpreted as causal.

To assess potential multicollinearity among predictors, the Variance Inflation Factor (VIF) was calculated for all variables included in the EXTENDED model (Table 7).

All variables demonstrated low VIF values (range: 1.07–2.02), indicating the absence of substantial multicollinearity. In particular, the glycemic variables (type 2 diabetes, HbA1c, and admission glucose), although clinically related, did not exhibit a level of interdependence that could significantly affect the stability of regression estimates.

These findings suggest that the differences observed between the logistic regression results and the Random Forest feature importance are unlikely to be explained by multicollinearity. Rather, they may reflect inherent differences between modeling approaches, including the sensitivity of logistic regression to variable representation and the ability of ensemble methods to capture nonlinear relationships and interactions between predictors.

Analysis of feature importance in the Random Forest model demonstrated that the greatest contribution to PCI prediction was provided by metabolic parameters, particularly HbA1c and admission glucose levels. Additionally, renal function (eGFR), the presence of type 2 diabetes, and heart rate showed notable importance within the model.

These results suggest that metabolic and systemic factors play a substantial role in the integrated prediction of PCI when using machine learning approaches (Figure 4).

Figure 4 presents the results of the multivariable logistic regression analysis, showing the odds ratios (OR) and 95% confidence intervals for each predictor associated with the likelihood of undergoing PCI.

Type 2 diabetes mellitus demonstrated the strongest association with PCI, indicating a substantially increased probability of intervention. Electrocardiographic parameters, including Q wave and ST-segment abnormalities, also contributed to the model, although with wider confidence intervals.

From the perspective of model interpretability, this analysis provides a transparent assessment of the direction and magnitude of the effect of each variable. Such representation aligns with explainable artificial intelligence (XAI) principles, as it enables direct clinical interpretation of the model outputs.

3.4. The Relationship of Glycemic Status with PCI

When analyzing the dependence of the frequency of PCI on the level of HbA1c, a clear trend towards an increase in the frequency of intervention was revealed as glycemic control worsened. Patients with elevated HbA1c values were significantly more likely to have PCI compared to patients with normal values (Figure 5).

Additionally, it was shown that the presence of type 2 diabetes mellitus is also associated with a higher frequency of PCI. In the group of patients with diabetes, the proportion of interventions performed was higher than in the group without diabetes (Figure 6). This confirms the significant role of metabolic disorders in the development of coronary pathology.

Distribution of PCI performance according to type 2 diabetes status. Patients with diabetes demonstrated a higher incidence of PCI compared to those without diabetes.

3.5. Evaluation of Machine Learning Models

The comparative analysis of machine learning models demonstrated that all evaluated approaches showed moderate discriminative ability in predicting PCI.

Models based on the SAFE feature set, which includes only variables available at the time of admission, also demonstrated stable performance. In particular, logistic regression achieved a ROC-AUC of 0.734 ± 0.092, indicating that even early non-invasive data can provide clinically relevant predictive information.

The analysis of mean ROC curves obtained using repeated stratified cross-validation (Figure 7) demonstrated moderate and consistent discriminative performance across the selected models, without a clear dominance of any single approach.

Precision–Recall analysis (Figure 8), which is particularly informative in the presence of class imbalance, showed comparable performance across the selected models. The Random Forest model demonstrated slightly higher PR-AUC values within the CLINICAL feature set, while logistic regression exhibited stable performance across different feature configurations.

Overall, these findings indicate that while more complex models and extended feature sets provide incremental improvements, simpler models based on early available data may still offer clinically useful predictive performance.

Calibration analysis was additionally performed for the best-performing EXTENDED Random Forest model. The calibration plot (Figure 9) and calibration table (Table 8) demonstrated a generally acceptable agreement between predicted probabilities and observed event rates, with minor deviations observed in the lower probability range.

The Brier score was 0.192, indicating a moderate level of overall prediction accuracy. The Hosmer–Lemeshow test showed no statistically significant lack of fit (χ² = 7.49, p = 0.058), suggesting acceptable calibration of the model. Overall, the results indicate that the predicted probabilities are broadly consistent with the observed event rates.

4. Discussion

The present study demonstrated the possibility of predicting PCI performance based on noninvasive clinical, laboratory, and electrocardiographic data. The constructed machine learning models showed moderate predictive ability, while the best results were obtained using an expanded set of features, including glycemic parameters.

One of the key results is that even the basic model (SAFE), based solely on data available at the patient admission stage, demonstrates stable discriminative ability (ROC-AUC of about 0.73). This indicates the potential informative value of standard clinical and ECG indicators for early risk stratification and support for clinical decision making in a time-limited environment.

An interesting finding of this study is that the Gradient Boosting model consistently demonstrated lower performance compared to both Random Forest and logistic regression across all feature sets. This finding is somewhat unexpected, given the theoretical advantages of boosting algorithms.

Several factors may explain this result. First, the relatively small sample size and limited number of events may reduce the effectiveness of sequential learning approaches, which typically require larger datasets to achieve stable performance. Second, the use of conservative hyperparameters (e.g., reduced learning rate and limited model complexity) to mitigate overfitting may have constrained the model’s ability to capture informative patterns. Third, class imbalance may have affected the optimization process, as boosting algorithms can be sensitive to the distribution of the target variable.

In contrast, Random Forest, as a bagging-based method, is generally more robust to small sample sizes and noise, while logistic regression benefits from its simplicity and lower variance. These factors may explain the relatively better performance of these models in the present study.

The addition of laboratory and metabolic parameters (CLINICAL and EXTENDED models) was accompanied by a moderate improvement in the quality of forecasting. In particular, the inclusion of glycemic and HbA1c parameters improved the effectiveness of the models, which is consistent with the literature data on the significant role of metabolic disorders in the development and progression of coronary heart disease.

The results of the multivariate analysis showed that the presence of type 2 diabetes mellitus is the most significant factor associated with PCI (OR = 7.36; p < 0.001). This confirms the clinical significance of diabetes as a factor influencing not only the course of the disease, but also the choice of therapeutic tactics.

The findings of the present study are consistent with previously published data demonstrating the significant role of metabolic disorders, particularly type 2 diabetes mellitus, in the progression of coronary artery disease and the need for revascularization [23]. Prior studies have shown that patients with diabetes tend to have more diffuse and complex coronary lesions, which increases the likelihood of percutaneous coronary intervention.

In addition, the observed associations between electrocardiographic abnormalities (such as ST-segment changes and pathological Q waves) and PCI are in line with established clinical criteria used in the diagnosis and management of acute coronary syndromes [24]. These ECG markers are routinely incorporated into clinical decision-making algorithms and current cardiology guidelines, where they serve as key indicators for urgent invasive strategies.

Importantly, the endpoint used in this study—PCI performed during the current hospitalization—reflects a clinical management decision rather than a direct measure of disease severity. This decision is influenced not only by anatomical findings, but also by a combination of clinical presentation, physician judgment, and organizational factors. While this may limit the use of PCI as a surrogate for obstructive coronary artery disease, it also provides an opportunity to model real-world clinical decision-making processes. From this perspective, the proposed approach should be interpreted not as a tool for diagnosing coronary artery disease or predicting hard clinical outcomes, but rather as a decision-support framework aimed at estimating the likelihood of revascularization based on early available data. Such models may be particularly useful in the initial stages of hospitalization, where rapid triage and prioritization are required, and where access to advanced diagnostics may be limited or delayed.

A comparison of different algorithms has shown that ensemble methods, in particular a random forest, have greater stability and a better ability to take into account nonlinear relationships between features compared with logistic regression. At the same time, the differences between the models remained moderate, which may be due to the limited sample size.

In general, the results obtained should be considered as preliminary and hypothesis-forming. Despite the moderate accuracy of the models, the study demonstrates the promise of using machine learning methods to integrate heterogeneous clinical data and develop decision-support tools.

Further research should aim to validate the proposed models on larger, multicenter datasets that include more diverse patient populations, which would improve the reliability and generalizability of the results. It would also be important to explore the incorporation of additional data sources, such as longitudinal clinical observations or continuous ECG monitoring, which may provide further improvement in predictive performance. Prospective studies are needed to better understand how such models can be integrated into routine clinical practice and whether they can meaningfully support clinical decision making.

From a clinical standpoint, the potential utility of the proposed approach may vary depending on the healthcare setting. In developed countries, these models could complement existing diagnostic pathways by supporting early risk stratification and helping to optimize workflow in high-demand clinical environments. In developing countries, where access to invasive procedures and advanced imaging may be limited, the use of readily available non-invasive data in combination with machine learning may offer a practical tool for preliminary assessment and prioritization of patients.

5. Limitations

This study has several important limitations. First, it was conducted as a retrospective single-center study with a relatively small sample size, which limits the generalizability of the findings. In particular, the number of PCI events (n = 47) is limited. In multivariable modeling, the number of events per predictor variable (EPV) in the EXTENDED model falls below commonly recommended thresholds, which may lead to instability of coefficient estimates and an increased risk of overfitting. The relatively small sample size also limits the ability to capture more complex patterns in the data. Therefore, the findings should be interpreted with caution and considered exploratory.

Second, the primary endpoint (PCI performance) reflects a clinical decision rather than a direct measure of the anatomical severity of coronary artery disease. Since PCI is typically performed following coronary angiography, patients who did not undergo angiographic assessment could not receive PCI regardless of the true severity of their condition. As a result, the non-PCI group may include a heterogeneous population comprising both patients without significant coronary lesions and those who were not evaluated invasively. This may introduce endpoint-related selection bias and affect both the estimated associations and predictive performance of the models.

Third, information on coronary angiography was not available in a structured format for all patients and therefore was not included as a variable in the main analysis. Although angiographic procedures were documented in individual clinical records, systematic extraction of this information was beyond the scope of the present study. Future studies should incorporate angiography status as a structured variable and perform sensitivity analyses restricted to patients who underwent coronary angiography.

Fourth, although internal model validation was performed using repeated cross-validation and calibration was assessed using calibration plots, calibration tables, Brier score, and the Hosmer–Lemeshow test, no external validation was conducted. This limits the assessment of model generalizability and stability in independent cohorts.

Fifth, some variables were characterized by a substantial proportion of missing values, which could affect model stability despite the use of imputation procedures.

Sixth, model hyperparameters were predefined rather than optimized using systematic search procedures. Although this approach was chosen to reduce the risk of overfitting in a relatively small dataset, it may have limited the ability to identify optimal model configurations.

Finally, a formal risk of bias assessment using the PROBAST framework was not performed, which represents an additional limitation of the study.

Overall, the proposed approach should be considered preliminary and hypothesis-generating. The results should be interpreted as reflecting real-world clinical decision making rather than purely anatomical disease severity. Further studies should include larger and multicenter cohorts, as well as external validation and incorporation of angiographic data.

6. Conclusions

This study demonstrates the feasibility of predicting the likelihood of a clinical decision to perform percutaneous coronary intervention (PCI) based on noninvasive clinical, laboratory, and electrocardiographic data available at early stages of hospitalization. The best-performing model, based on an EXTENDED set of features and a Random Forest algorithm, achieved moderate discriminative performance. In addition to discrimination metrics, calibration analysis indicated an overall acceptable agreement between predicted probabilities and observed event rates.

Importantly, the proposed models are intended to support the assessment of clinical decision-making processes rather than to directly predict the presence or severity of coronary artery disease. The findings also suggest a potential role of metabolic factors, particularly type 2 diabetes and glycemic parameters, in shaping the likelihood of revascularization decisions.

Given the retrospective single-center design, relatively small sample size, absence of external validation, and remaining methodological limitations, the results should be interpreted as preliminary and hypothesis-generating.

Further research using larger, independent cohorts, with external validation and comprehensive calibration assessment, is required before any potential clinical implementation.

Author Contributions

Conceptualization Z.A., C.A. and K.O.; Methodology K.O. and A.O.; Data curation Z.A. and D.A.; Formal analysis Z.A. and U.S.; Investigation Z.A. and C.A.; Visualization Z.A. and D.A.; Writing—original draft preparation Z.A. and C.A.; Writing—review and editing Z.A., C.A. and K.K.; Supervision: K.O. and A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan, grant number AP23485820.

Institutional Review Board Statement

Ethical approval for this study was obtained from the Local Bioethics Committee of the National Scientific Center of Traumatology and Orthopedics named after Academician N.D. Batpenov, Ministry of Health of the Republic of Kazakhstan (Protocol No. 4/9, dated 8 November 2023). The study was conducted in accordance with the principles of the Declaration of Helsinki.

Informed Consent Statement

Patient consent was waived due to the retrospective design of the study and the use of anonymized clinical data.

Data Availability Statement

The datasets analyzed during the current study were derived from hospital electronic medical records and contain potentially identifiable patient information. Therefore, the data are not publicly available due to ethical and privacy restrictions. De-identified data may be made available from the corresponding author upon reasonable request and subject to approval by the relevant institutional review board.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ACS	Acute Coronary Syndrome
AUC	Area Under the Curve
BMI	Body Mass Index
CKD-EPI	Chronic Kidney Disease Epidemiology Collaboration
CI	Confidence Interval
CV	Coefficient of Variation
ECG	Electrocardiogram
eGFR	Estimated Glomerular Filtration Rate
HbA1c	Glycated Hemoglobin
ML	Machine Learning
MI	Myocardial Infarction
NSTEMI	Non-ST-Segment Elevation Myocardial Infarction
OR	Odds Ratio
PCI	Percutaneous Coronary Intervention
PR-AUC	Area Under the Precision–Recall Curve
QTc	Corrected QT Interval
ROC-AUC	Area Under the Receiver Operating Characteristic Curve
SD	Standard Deviation
STEMI	ST-Segment Elevation Myocardial Infarction
T2DM	Type 2 Diabetes Mellitus
VIF	Variance Inflation Factor

References

Ahmadi, M.; Lanphear, B. The impact of clinical and population strategies on coronary heart disease mortality: An assessment of Rose’s big idea. BMC Public Health 2022, 22, 14. [Google Scholar] [CrossRef] [PubMed]
Kiyoshige, E.; Ogata, S.; O’Flaherty, M.; Capewell, S.; Takegami, M.; Iihara, K.; Kypridemos, C.; Nishimura, K. Projections of future coronary heart disease and stroke mortality in Japan until 2040: A Bayesian age-period-cohort analysis. Lancet Reg. Health West. Pac. 2022, 31, 100637. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Zwack, C.C.; Smith, C.; Poulsen, V.; Raffoul, N.; Redfern, J. Information Needs and Communication Strategies for People with Coronary Heart Disease: A Scoping Review. Int. J. Environ. Res. Public Health 2023, 20, 1723. [Google Scholar] [CrossRef]
Pastena, P.; Frye, J.T.; Ho, C.; Goldschmidt, M.E.; Kalogeropoulos, A.P. Ischemic cardiomyopathy: Epidemiology, pathophysiology, outcomes, and therapeutic options. Heart Fail. Rev. 2024, 29, 287–299. [Google Scholar] [CrossRef]
Theofilis, P.; Oikonomou, E.; Chasikidis, C.; Tsioufis, K.; Tousoulis, D. Pathophysiology of Acute Coronary Syndromes—Diagnostic and Treatment Considerations. Life 2023, 13, 1543. [Google Scholar] [CrossRef]
Coerkamp, C.F.; Hoogewerf, M.; van Putte, B.P.; Appelman, Y.; Doevendans, P.A. Revascularization strategies for patients with established chronic coronary syndrome. Eur. J. Clin. Investig. 2022, 52, e13787. [Google Scholar] [CrossRef] [PubMed]
Byrne, R.; Coughlan, J.J.; Rossello, X.; Ibanez, B. The ‘10 commandments’ for the 2023 ESC Guidelines for the management of acute coronary syndromes. Eur. Heart J. 2024, 45, 1193–1195. [Google Scholar] [CrossRef]
Wilkinson, C.; Bhatty, A.; Batra, G.; Aktaa, S.; Smith, A.B.; Dwight, J.; Ruciński, M.; Chappell, S.; Alfredsson, J.; Erlinge, D.; et al. Definitions of clinical study outcome measures for cardiovascular diseases: The European Unified Registries for Heart Care Evaluation and Randomized Trials (EuroHeart). Eur. Heart J. 2025, 46, 190–214. [Google Scholar] [CrossRef]
Gurav, A.; Revaiah, P.C.; Tsai, T.Y.; Miyashita, K.; Tobe, A.; Oshima, A.; Sevestre, E.; Garg, S.; Aben, J.P.; Reiber, J.H.C.; et al. Coronary angiography: A review of the state of the art and the evolution of angiography in cardio therapeutics. Front. Cardiovasc. Med. 2024, 11, 1468888. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
de Oliveira Laterza Ribeiro, M.; Correia, V.M.; Herling de Oliveira, L.L.; Soares, P.R.; Scudeler, T.L. Evolving Diagnostic and Management Advances in Coronary Heart Disease. Life 2023, 13, 951. [Google Scholar] [CrossRef]
Chakraborty, S.; Verma, A.; Garg, R.; Singh, J.; Verma, H. Cardiometabolic Risk Factors Associated with Type 2 Diabetes Mellitus: A Mechanistic Insight. Clin. Med. Insights Endocrinol. Diabetes 2023, 16, 11795514231220780. [Google Scholar] [CrossRef]
Zivkovic, S.; Mandic, A.; Krupnikovic, K.; Obradovic, A.; Misevic, V.; Farkic, M.; Ilic, I.; Tesic, M.; Aleksandric, S.; Juricic, S.; et al. Myocardial Revascularization in Patients with Diabetes and Heart Failure—A Narrative Review. Int. J. Mol. Sci. 2025, 26, 3398. [Google Scholar] [CrossRef]
Bloomgarden, Z.; Handelsman, Y. Management and prevention of cardiovascular disease for type 2 diabetes: Integrating the diabetes management recommendations of AACE, ADA, EASD, AHA, ACC, and ESC. Am. J. Prev. Cardiol. 2020, 1, 100007. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Wu, C.; Li, A.; Zhu, Q.; Guo, J.; Li, Y.; Gu, X.; Sun, A.; Wei, M.; Gong, Y. Glycemic variability of glycated hemoglobin in patients with type 2 diabetes mellitus and the risk of cardiovascular diseases: A latest systematic review and meta-analysis. Front. Endocrinol. 2025, 16, 1698360. [Google Scholar] [CrossRef]
Psoma, O.; Makris, M.; Tselepis, A.; Tsimihodimos, V. Short-term Glycemic Variability and Its Association With Macrovascular and Microvascular Complications in Patients with Diabetes. J. Diabetes Sci. Technol. 2022, 18, 956–967. [Google Scholar] [CrossRef]
Ke, J.; Chen, Y.; Wang, X.; Wu, Z.; Chen, F. Indirect comparison of TIMI, HEART and GRACE for predicting major cardiovascular events in patients admitted to the emergency department with acute chest pain: A systematic review and meta-analysis. BMJ Open 2021, 11, e048356. [Google Scholar] [CrossRef]
Ogunpola, A.; Saeed, F.; Basurra, S.; Albarrak, A.M.; Qasem, S.N. Machine Learning-Based Predictive Models for Detection of Cardiovascular Diseases. Diagnostics 2024, 14, 144. [Google Scholar] [CrossRef]
Khan, I.H.; Singh, A.; Rather, H.A. Real-time ECG-based detection of cardiovascular diseases using balanced and interpretable machine learning approaches. Phys. Eng. Sci. Med. 2026, 49, 489–503. [Google Scholar] [CrossRef] [PubMed]
Sanchez-Martinez, S.; Camara, O.; Piella, G.; Cikes, M.; González-Ballester, M.Á.; Miron, M.; Vellido, A.; Gómez, E.; Fraser, A.G.; Bijnens, B. Machine Learning for Clinical Decision-Making: Challenges and Opportunities in Cardiovascular Imaging. Front. Cardiovasc. Med. 2022, 8, 765693. [Google Scholar] [CrossRef] [PubMed]
Lee, H.G.; Park, S.D.; Bae, J.W.; Moon, S.; Jung, C.Y.; Kim, M.S.; Kim, T.H.; Lee, W.K. Machine learning approaches that use clinical, laboratory, and electrocardiogram data enhance the prediction of obstructive coronary artery disease. Sci. Rep. 2023, 13, 12635. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Bi, B.; Cao, L.; Gui, M.; Ju, F. Predictive model and risk analysis for peripheral vascular disease in type 2 diabetes mellitus patients using machine learning and shapley additive explanation. Front. Endocrinol. 2024, 15, 1320335. [Google Scholar] [CrossRef]
Wang, M. Explainable machine-learning-based cardiovascular disease prediction in patients with hypertension: Algorithm comparison and SHapley Additive exPlanations (SHAP) analysis. Arch. Cardiovasc. Dis. 2025, 119, 273–282. [Google Scholar] [CrossRef] [PubMed]
Razzouk, L.; Feit, F.; Farkouh, M.E. Revascularization for Advanced Coronary Artery Disease in Type 2 Diabetic Patients: Choosing Wisely Between PCI and Surgery. Curr. Cardiol. Rep. 2017, 19, 37. [Google Scholar] [CrossRef] [PubMed]
Birnbaum, Y.; Rankinen, J.; Jneid, H.; Atar, D.; Nikus, K. The Role of ECG in the Diagnosis and Risk Stratification of Acute Coronary Syndromes: An Old but Indispensable Tool. Curr. Cardiol. Rep. 2022, 24, 109–118. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Study workflow and machine learning pipeline.

Figure 2. Data preprocessing pipeline.

Figure 3. Multifactorial Random Forest analysis of factors associated with the implementation of PCI.

Figure 4. Contribution of features to the Logistic regression model (extended set of features).

Figure 5. Frequency of percutaneous coronary intervention (PCI) depending on the HbA1c category.

Figure 6. Distribution of PCI performance depending on the presence of type 2 diabetes mellitus.

Figure 7. Mean ROC curves for selected machine learning models based on repeated stratified cross-validation.

Figure 8. Mean Precision–Recall curves for selected machine learning models based on repeated stratified cross-validation.

Figure 9. Calibration plot for the EXTENDED Random Forest model.

Table 1. Baseline characteristics of the study population.

Variable	Total (n = 137)
Age, years	65.1 ± 8.5
Male sex, n (%)	86 (62.8%)
Body mass index (BMI), kg/m²	28.8 ± 4.2
Heart rate, bpm	73.5 ± 17.4
Type 2 diabetes, n (%)	70 (51.1%)
Diabetes duration, years	4.3 ± 6.7
HbA1c, %	7.23 ± 1.89
Admission glucose (ГП_1), mmol/L	7.86 ± 3.62
Creatinine, µmol/L	104.9 ± 93.3
eGFR, mL/min/1.73 m²	80.0 ± 31.2
ST elevation, n (%)	31 (22.6%)
ST depression, n (%)	17 (12.4%)
Q wave, n (%)	29 (21.2%)
T-wave inversion, n (%)	120 (87.6%)
PCI performed, n (%)	47 (34.3%)

Table 2. Description of variables included in the analysis.

Variable Category	Variable	Definition/Description
Demographic	Age (years)	Age of the patient at admission
Demographic	Sex (male/female)	Biological sex of the patient
Clinical	Body mass index (BMI), kg/m²	Calculated as weight divided by height squared
Clinical	Heart rate (HR), bpm	Heart rate measured at admission
Clinical	Type 2 diabetes mellitus (0/1)	Presence of type 2 diabetes mellitus
Laboratory	HbA1c (%)	Glycated hemoglobin level
Laboratory	Creatinine (µmol/L)	Serum creatinine concentration
Laboratory	eGFR (mL/min/1.73 m²)	Estimated glomerular filtration rate
Glycemic	Admission glucose (ГП_1), mmol/L	Glucose level measured at admission
ECG	PR interval (s)	Atrioventricular conduction time
ECG	QRS duration (s)	Ventricular depolarization duration
ECG	QT interval (s)	Total ventricular depolarization–repolarization time
ECG	ST elevation (0/1)	Presence of ST-segment elevation
ECG	ST depression (0/1)	Presence of ST-segment depression
ECG	Q wave (0/1)	Presence of pathological Q wave
ECG	T-wave inversion (0/1)	Presence of T-wave inversion
Outcome	PCI performed (0/1)	Performance of PCI during hospitalization

Table 3. Statistical methods used for analysis of study variables.

Type of Variable	Statistical Method
Continuous (normal distribution)	Mean ± SD; Student’s t-test
Continuous (non-normal distribution)	Median [IQR]; Mann–Whitney U test
Categorical	n (%); χ² test or Fisher’s exact test
Associations	Multivariable logistic regression (OR, 95% CI)

Table 4. Comparison of clinical characteristics between PCI and non-PCI groups.

Variable	PCI (n = 47)	Non-PCI (n = 90)	p-Value
Age (years)	62.57 ± 8.73	66.24 ± 8.20	0.024
BMI (kg/m²)	28.81 ± 5.13	29.05 ± 4.25	0.476
Type 2 diabetes, n (%)	36 (76.6%)	34 (37.8%)	<0.001
HbA1c (%)	7.82 ± 1.90	6.72 ± 1.74	0.006
Admission glucose (mmol/L)	9.17 ± 4.63	7.08 ± 2.58	<0.001
Creatinine (µmol/L)	99.98 ± 72.04	107.63 ± 103.66	0.536
eGFR (mL/min/1.73 m²)	83.90 ± 32.57	77.62 ± 30.28	0.105
Heart rate (bpm)	78.21 ± 19.83	70.98 ± 15.47	0.024
ST elevation, n (%)	18 (39.1%)	13 (14.6%)	0.003
ST depression, n (%)	10 (21.3%)	7 (7.9%)	0.048
Q wave, n (%)	18 (39.1%)	11 (12.4%)	<0.001
T-wave inversion, n (%)	43 (91.5%)	77 (85.6%)	0.467

Table 5. Performance of machine learning models across different feature sets.

Feature_Set	Model	ROC-AUC (Mean ± SD)	PR-AUC (Mean ± SD)	Accuracy (Mean ± SD)	F1 (Mean ± SD)
CLINICAL	Gradient Boosting	0.698 ± 0.091	0.625 ± 0.11	0.673 ± 0.081	0.467 ± 0.142
CLINICAL	Logistic Regression	0.713 ± 0.109	0.64 ± 0.131	0.637 ± 0.084	0.51 ± 0.103
CLINICAL	Random Forest	0.739 ± 0.08	0.664 ± 0.1	0.67 ± 0.08	0.486 ± 0.141
EXTENDED	Gradient Boosting	0.715 ± 0.085	0.62 ± 0.106	0.683 ± 0.079	0.481 ± 0.137
EXTENDED	Logistic Regression	0.699 ± 0.1	0.632 ± 0.12	0.646 ± 0.084	0.514 ± 0.105
EXTENDED	Random Forest	0.755 ± 0.079	0.649 ± 0.1	0.718 ± 0.065	0.571 ± 0.1
SAFE	Gradient Boosting	0.679 ± 0.094	0.589 ± 0.109	0.672 ± 0.085	0.466 ± 0.128
SAFE	Logistic Regression	0.734 ± 0.092	0.644 ± 0.115	0.666 ± 0.082	0.559 ± 0.097
SAFE	Random Forest	0.724 ± 0.092	0.65 ± 0.102	0.682 ± 0.08	0.548 ± 0.097

Table 6. Multivariable logistic regression analysis for prediction of PCI.

Variable	Coefficient (β)	Odds Ratio (OR)	95% CI	p-Value
Age	−0.053	0.95	0.89–1.01	0.108
Sex (male)	−0.356	0.70	0.27–1.81	0.462
Type 2 diabetes	1.996	7.36	2.79–19.40	<0.001
Heart rate	0.009	1.01	0.98–1.04	0.466
ST elevation	0.446	1.56	0.46–5.31	0.475
ST depression	0.882	2.42	0.66–8.80	0.181
Q wave	1.001	2.72	0.78–9.55	0.118
T-wave inversion	0.420	1.52	0.38–6.09	0.553
BMI	−0.042	0.96	0.86–1.07	0.437
Creatinine	−0.003	0.997	0.99–1.00	0.468
eGFR	0.009	1.01	0.99–1.03	0.360

Table 7. Variance Inflation Factor (VIF) for predictors included in the EXTENDED model.

Variable	VIF
Age	1.65
Male sex	1.21
Type 2 diabetes	1.68
Heart rate	1.25
ST elevation	1.75
ST depression	1.24
Q wave	1.66
T-wave inversion	1.07
BMI	1.21
Creatinine	1.59
eGFR	2.02
HbA1c (%)	1.94
Admission glucose	1.69

Table 8. Calibration table for the EXTENDED Random Forest model.

Mean Predicted Probability	Observed Event Rate
0.1961	0.0357
0.2896	0.2222
0.3868	0.2963
0.5080	0.5556
0.7001	0.6071

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alimbayeva, Z.; Alimbayev, C.; Ozhikenov, K.; Karibayev, K.; Ozhikenova, A.; Shylmyrza, U.; Akhmedova, D. Prediction of Percutaneous Coronary Intervention from Clinical and ECG Data Using Machine Learning: A Retrospective Single-Center Observational Study. Algorithms 2026, 19, 367. https://doi.org/10.3390/a19050367

AMA Style

Alimbayeva Z, Alimbayev C, Ozhikenov K, Karibayev K, Ozhikenova A, Shylmyrza U, Akhmedova D. Prediction of Percutaneous Coronary Intervention from Clinical and ECG Data Using Machine Learning: A Retrospective Single-Center Observational Study. Algorithms. 2026; 19(5):367. https://doi.org/10.3390/a19050367

Chicago/Turabian Style

Alimbayeva, Zhadyra, Chingiz Alimbayev, Kassymbek Ozhikenov, Kairat Karibayev, Aiman Ozhikenova, Ussen Shylmyrza, and Dilfuza Akhmedova. 2026. "Prediction of Percutaneous Coronary Intervention from Clinical and ECG Data Using Machine Learning: A Retrospective Single-Center Observational Study" Algorithms 19, no. 5: 367. https://doi.org/10.3390/a19050367

APA Style

Alimbayeva, Z., Alimbayev, C., Ozhikenov, K., Karibayev, K., Ozhikenova, A., Shylmyrza, U., & Akhmedova, D. (2026). Prediction of Percutaneous Coronary Intervention from Clinical and ECG Data Using Machine Learning: A Retrospective Single-Center Observational Study. Algorithms, 19(5), 367. https://doi.org/10.3390/a19050367

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Prediction of Percutaneous Coronary Intervention from Clinical and ECG Data Using Machine Learning: A Retrospective Single-Center Observational Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Sample Characteristics

2.2. Research Variables

2.3. Data Preprocessing

2.4. Criteria for Choosing Machine Learning Algorithms

2.5. Building Models

2.5.1. Random Forest

2.5.2. Gradient Boosting

2.6. Evaluating Models

2.7. Statistical Analysis

3. Results

3.1. Comparison Between PCI and Non-PCI Groups

3.2. Machine Learning Model Results

3.3. Interpretation of the Model and Significance of the Features

3.4. The Relationship of Glycemic Status with PCI

3.5. Evaluation of Machine Learning Models

4. Discussion

5. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI