Leveraging Machine Learning Techniques to Forecast Chronic Total Occlusion before Coronary Angiography

Background: Chronic total occlusion (CTO) remains the most challenging procedure in coronary artery disease (CAD) for interventional cardiology. Although some clinical risk factors for CAD have been identified, there is no personalized prognosis test available to confidently identify patients at high or low risk for CTO CAD. This investigation aimed to use a machine learning algorithm for clinical features from clinical routine to develop a precision medicine tool to predict CTO before CAG. Methods: Data from 1473 CAD patients were obtained, including 1105 in the training cohort and 368 in the testing cohort. The baseline clinical characteristics were collected. Univariate and multivariate logistic regression analyses were conducted to identify independent risk factors that impact the diagnosis of CTO. A CTO predicting model was established and validated based on the independent predictors using a machine learning algorithm. The area under the curve (AUC) was used to evaluate the model. Results: The CTO prediction model was developed with the training cohort using the machine learning algorithm. Eight variables were confirmed as ‘important’: gender (male), neutrophil percentage (NE%), hematocrit (HCT), total cholesterol (TC), high-density lipoprotein cholesterol (HDL), ejection fraction (EF), troponin I (TnI), and N-terminal pro-B-type natriuretic peptide (NT-proBNP). The model achieved good concordance indices of 0.724 and 0.719 in the training and testing cohorts, respectively. Conclusions: An easy-to-use tool to predict CTO in patients with CAD was developed and validated. More research with larger cohorts are warranted to improve the prediction model, which can support clinician decisions on the early discerning CTO in CAD patients.


Introduction
Chronic total occlusion (CTO) is defined as complete luminal obstruction without antegrade flow for an estimated duration of at least 3 months, according to historical clinical data [1]. It is diagnosed in 16% to 20% of coronary artery disease (CAD) patients undergoing coronary angiography (CAG) [2]. Despite the high prevalence of CTO, CTO recanalization has historically made up a small portion of the total volume of percutaneous coronary intervention (PCI). This is probably because of a combination of factors, including a low success rates, a high incidence of complications, a lengthy procedure, a high costs, and a lack of clinical benefit that is thought to outweigh the costs [3]. Therefore, early recognition and easily measured risk factors of CTO have the greatest potential for early intervention.
In order to improve risk stratification in patients, efforts have been made to develop prognostic predictive tools or risk scores to identify CTO before CAG [4]. Although some previous studies have analyzed potential predictors related to the high incidence rate of CAD and established a relevant model for CAD in patients before CAG, no research has attempted to conduct a rigorous analysis of the CTO data contained within clinical features as a means to produce more personalized CTO risk assessments [5].
In this study, a preoperative CTO risk model based on clinical demographics, echocardiography, and laboratory indexes was constructed by using a machine learning algorithm. This may be observed as an unmet need and an emerging opportunity, as promising computational analysis has demonstrated that it provides clinically valuable predictions in CTO risk and beyond.

Study Cohort
In total, 1105 patients with chest pain and suspected CAD who underwent first-time CAG, between January 2019 and December 2019 at Beijing Anzhen Hospital, Capital Medical University, were consecutively enrolled, which formed the training cohort for this study. The patients with CAG or CABG before were excluded. Between January 2020 and December 2020, an independent cohort of 368 consecutive patients was enrolled prospectively with the same inclusion and exclusion criteria. They formed the testing cohort for this study. The entire patients were assigned to the CTO group (n = 317) or the non-CTO (n = 1156) group based on the procedural outcome. CTO was defined as an occlusion lasting longer than 3 months based on the first occurrence of angina pectoris, previous angiogram, or previous infarction and thrombolysis in myocardial infarction grade 0. Non-CTO was defined as a stenosis of at least 50% of the luminal diameter in at least one branch of the main coronary artery. The CTO was diagnosed independently by two experienced cardiac interventionists (Yongxin Wu and Yanci Liu) and independently checked by another experienced cardiac interventionist (Ze Zheng). Any disagreements between the cardiac interventionists were resolved by discussion with a fourth investigator (Jinghua Liu). The study protocol was approved by the Human Research Ethics Committee of Beijing Anzhen Hospital, Capital Medical University, Beijing, China, and the study adheres to all the principles of the Declaration of Helsinki. The Ethics Committee Approval Code is 2022177X. Given the retrospective study design, informed consent is waived by the Human Research Ethics Committee of the Beijing Anzhen Hospital, Capital Medical University, Beijing, China.

Data Collection
Baseline clinical demographics, laboratory indexes, and angiography characteristics were obtained for each enrolled patient. Age, sex, and medical history information were acquired using questionnaires. Body mass index (BMI) is calculated in kilograms divided by square meters of height. Current tobacco use is defined as the ongoing use of more than one cigarette per day for 6 months. Alcohol consumption is defined as drinkers consuming more than 40 g ethanol (men) or 20 g ethanol (women) per day. Blood pressure in the seated position was measured using an aneroid sphygmomanometer after resting for a minimum of 5 min. The average of three measures was recorded as the final measurement. Hypertension is defined as systolic pressure (SBP) ≥ 140 mmHg, diastolic pressure (DBP) ≥ 90 mmHg, or self-declares the usage of any antihypertensive medication. Diabetes is defined as a fasting plasma glucose level of over 126 mg/dL, a non-fasting plasma glucose level of over 200 mg/dL or self-declares the usage of any antidiabetic medication. Blood samples were taken in a fasting state. White blood cell (WBC), red blood cell (RBC), hemoglobin (Hb), platelet (PLT), hematocrit (HCT), neutrophil (NE), neutrophil percentage (NE%), lymphocyte (LYM), monocyte (MONO), glucose (Glu), HbA1C, triglyceride (TG), total cholesterol (TC), high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), free fatty acids (FFA), Non-HDL, sdLDL, homocysteine (Hcy), uric acid (UA), creatinine (Cr), brain natriuretic peptide (BNP), NT-proBNP, high-sensitivity C-reactive protein (hs-CRP), creatine kinase-MB (CK-MB), and troponin I (TnI) were measured from blood samples by standardized laboratory methods. The estimated glomerular filtration rate (eGFR) was calculated in the modified calculated using the CKD-EPI study equation. Angiography characteristics included the number of vessel lesions and the location of the CTO lesion, including the left main coronary artery (LM), left anterior descending branch (LAD), left circumflex artery (LCX), right coronary artery (RCA), and diagonal branch (D1). All angiography characteristics were calculated by two experienced interventional cardiologists.

Statistical Analysis
Data for continuous variables are expressed as mean ± standard deviation, and the difference between groups is examined using Student's t-test. Meanwhile, the categorical variable data is expressed as frequency and percentages, of which the difference between the groups is analyzed using the chi-square test or Fisher's exact test.
Univariate logistic regression analysis was performed to identify factors related to CTO prediction. Multivariate logistic regression analysis was performed with the selection of univariate logistic regression analysis, in which the p-value levels of the inclusion and exclusion criteria were established as 0.05, respectively. SPSS version 23.0 (SPSS Inc., Chicago, IL, USA) was used to perform statistical analyzes in this study.

Machine Learning Model Construction and External Testing
The R packages with eXtreme Gradient Boosting (XGBoost), XGBoostExplainer, and machine learning in R were used to create predictive models. The training cohort formed by patients who met the inclusion and exclusion criteria between January 2019 and December 2019 at Beijing Anzhen Hospital, Capital Medical University. The testing cohort formed by patients who met the same inclusion and exclusion criteria between January 2020 and December 2020.
The Gini impurity was used to measure the importance of variables, which provided the contribution of the predictive variables to the model [6]. The relative importance score was used to demonstrate the significance of the predictive variables [7]. A receiver operating characteristic curve (ROC) was used to assess the predictive model [8]. Machine learning model construction and external testing were performed using the R software version 4.2.0 (R Foundation for Statistical Computing, Vienna, Austria) [9].

Baseline Demographics of the Study
The baseline clinical demographics of the study population are shown in Table 1. There were no differences in age, alcohol consumption, BMI, DBP, hypertension, diabetes mellitus, dyslipidemia, prior stroke, and medication history of PPIs between the CTO group and the non-CTO group. Compared to the non-CTO group, CTO group had a higher male rate, current tobacco use rate, left ventricular end diastolic diameter (LVEDD), medication history of clopidogrel, ticagrelor, β-blocker, nitrates, and diuretics. The SBP, ejection fraction (EF), medication history of ACEI (angiotensin converting enzyme inhibitors)/ARB (angiotensin receptor blocker), statin, and CCB were lower in the CTO group than in the non-CTO group.

Baseline Laboratory Indexes of the Study
Compared to non-CTO controls, the CTO group had higher WBC, NE%, MONO, TG, non-HDL, Hcy, UA, Cr, BNP, NT-proBNP, CK-MB, and TnI, but lower PLT and HDL. There were no differences in RBC, Hb, HCT, LYM, Glu, HbA1C, TC, LDL, FFA, sdLDL, eGFR, and hs-CRP. The detailed data and units for all variables are shown in Table 2. Table 3 shows the baseline clinical angiography characteristics of the study. The CTO group had a higher rate of multivessel lesions compared to the non-CTO group. More than half of the patients in the CTO group have 3-vessel lesions. In contrast, nearly half of the patients in the non-CTO group have 1-vessel lesions. The incidence of CTO is RCA, LAD, LCX, and others.    Table 4 summarizes the clinical characteristics of the training cohort (n = 1105) and the testing cohort (n = 368). Clinical demographics that included sex, age, current tobacco use, alcohol consumption, BMI, SBP, and DBP were not significantly different between both groups (all p > 0.05). Additionally, the incidences of hypertension, diabetes mellitus, dyslipidemia, and a history of previous stroke did not have significant differences between both groups (all p > 0.05). Similarly, there was no statistical significance in echocardiography and laboratory indexes (all p > 0.05).

Baseline Clinical Characteristics of the Training Cohort between the CTO and Non-CTO Groups
The baseline clinical characteristics of the training cohort are shown in Table 2. Compared to non-CTO controls, patients with CTO are more likely to be male and smokers. Surprisingly, they have lower SBP and DBP, but the prevalence of hypertension is comparable. They are more frequently encountered with a decreased EF and an increased LVEDD. With baseline laboratory indexes between the two groups, the CTO group demonstrated significantly higher WBC, NE, HbA1C, TG, HDL, Hcy, Cr, BNP, NT-proBNP, and Mb, but lower PLT, LYM%, HDL, and TnI. There were no differences in age, alcohol consumption, BMI, diabetes mellitus, dyslipidemia, prior stroke, RBC, Hb, LYM, MONO, Glu, TC, LDL, FFA, sdLDL, UA, eGFR, and hs-CRP between the two groups.

Multivariate Logistic Regression Analysis
Multivariate logistic regression analysis was used to determine the significant independent predictors of CTO. Nine predictors selected by the univariate logistic regression analysis were selected using multivariate logistic regression analysis to determine the independent factors that predict CTO before CAG. Eight variables were included in the final model. As shown in Table 6

Machine Learning Algorithm Model
After tuning the XGBoost model, the parameters of the XGBoost model were finally max_depth = 4, subsample = 0.63, colsample_bytree = 0.51. Based on the Gini impurity, the important predictors in the CTO prediction model are summarized in Figure 1. The three aforementioned predictors in the predictive model were male, HDL, and TC. The relative importance score for each is 0.84, −0.71, and 0.29, respectively.

Performance of the Machine Learning Model in the Training and Testing Cohorts
To further assess the performance of the models, the ROC curves and the corresponding AUC were evaluated ( Figure 2). As Figure 2A showed, the machine learning model achieved an AUC of 0.724 with an accuracy of 0.800 (95% CI, 0.755-0.8398), sensitivity of 80.57%, and specificity of 66.67% in the training cohort. A total of 368 patients were included in the testing cohort between January 2020 and December 2020. In the testing cohort, the AUC was 0.719 with an accuracy of 0.786 (95% CI, 0.761-0.810) ( Figure 2B). There was no statistical difference in AUC between training cohort and testing cohort (p > 0.05).

The Machine Learning Model Score System for Clinical Utility
According to the weight coefficient score for each predictor in the model, as Figure 3A showed, we developed the machine learning model scoring system into a web-based calculator (http://www.empowerstats.net/pmodel/?m=29484_CTOBeforeCAG, accessed on 4 October 2022). As an example to better explain the model (Figure 3B), if the patient is male with NE% of 80%, HCT of 35%, TC of 5.6 mmol/L, HDL of 0.8 mmol/L, EF of 35%, TnI of 0 µg/L, and NT-proBNP of 1300 pg/mL, the probability of CTO might be 70.5%.

Discussion
In this study, we developed a model based on machine learning algorithms to predict CTO patients through only basic patient information such as clinical demographics, echocardiography, and laboratory indexes. We found eight easily captured variables, including gender (male), NE, HCT, TC, HDL, EF, TnI, and NT-proBNP, with relatively accurate prediction ability. To the best of our knowledge, this is the first attempt to develop a predictive model for diagnosing CTO based on machine learning.
Although the event of CAD has been related to varied traditional risk factors, as well as sex (male) and lipid metabolism, their role within the etiology of CTO remains less understood [10]. To date, most research has centered on stable CAD or acute coronary syndrome; there has been no information on risk factors for CTO patients [11]. Additionally, reliable tools for predicting CTO in a timely manner before CAG are lacking. By converting the total score into a continuum of individual scores using a machine learning algorithm, we developed a relatively accurate model to predict the probability of CTO before CAG. Since recanalization of a totally occluded vessel needs a good quantity of time and finance, it would be necessary to construct a model with easy implementation by cardiological residents and staff alike [12].
The advent of machine learning methodologies could provide a large amount of information available from databases and encourage the development and testing of better predictive models [13]. Machine learning methods use computational algorithms to identify models in large datasets with multiple variables and can be used to construct predictive models [14]. Machine learning has demonstrated the potential for improving diagnostic accuracy and prognostic outcomes over conventional statistical methods [15]. There are also a number of other advantages to machine learning algorithms over traditional statistical modelling. Machine learning algorithms take into account all potential interactions and do not have pre-defined hypothesis, making them less likely to ignore unexpected predictive variables [16]. Predictive models using machine learning algorithms may thus facilitate the recognition of clinically important risk in patients with multiple marginal risk factors that would otherwise not raise clinical concerns [17]. Furthermore, machine learning algorithms can easily integrate new clinical data to continually update and optimize algorithms with minimal oversight [18].
The primary machine learning model of our study was the XGBoost gradient boosted tree model. XGBoost represents a technical building set of trees in a progressive way on the loss of from weak decision tree base learners [19]. It can learn quickly and efficiently from large amounts of data and its great flexibility makes it possible to learn even from missing data [20]. The XGBoost model had a much higher predictive accuracy compared to the generalized linear model, being able to capture complex associations in the data without requiring explicit high-order interactions and non-linear functions [21]. Using these features, predictive models could be developed from clinical demographics, echocardiography, and laboratory indexes, which are readily accessible and reproducible at admission.
The sex of men and dyslipidemia have previously been linked to the incidence of CTO in patients [22]. Consistent with previous researches, the sex and lipid elements in our machine learning model were significant predictors of CTO. Especially for HDL, it has been proposed that the reduction of HDL could alter the stability of HDL particles, which would disrupt the transport of reverse cholesterol and, at least partially, enhance the progression of CAD to CTO [23].
Besides risk factors for CAD, our study revealed that EF and NT-proBNP were independently positively related to CTO prediction. Some CTO lesions develop from an undetected acute myocardial infarction [24]. Acute myocardial ischemia and its early pathological changes often damage left ventricular systolic function, and progress to ischemic cardiomyopathy after 3 months [25]. In contrast, TnI was independently negatively associated with the prediction of CTO. In particular, TnI is a biomarker of acute myocardial ischemia [26]. Given this, the increase in TnI may imply a total occlusion lesion of less than 3 months. Taken together, EF, NT-proBNP, and TnI may predict the probability of CTO in patients with CAD.
In addition to non-modifiable variables (male sex) and traditional risk factors for CAD (TC, HDL, EF, NT-proBNP, and TnI) of the individual patient, blood routine levels including HCT and NE%, were also predictors in the machine learning model. HCT is an indicator that reflects the extent of hemoconcentration [27]. It appears plausible that a higher HCT with increased blood viscosity predicts a higher risk of first incident acute myocardial ischemia than a lower HCT [28]. In contrast, lower HCT may refer to CTO rather than acute total occlusion. For NE%, CTO has long been considered to be a lowgrade, subclinical, systemic inflammatory chronic disease [29]. Inflammation is at the forefront in initiating and developing the entire course of CTO [30]. CTO develops from the total luminal obstruction of an artery by a thrombus, with subsequent organization and varying degrees of recanalization; often these events are clinically silent [31]. The process of organizing thrombus coincides with the development of intraluminal microvessels accompanied by inflammatory cells, infiltrating smooth muscle cells, and the deposition of the proteoglycan matrix [32]. In CTOs of all ages, there was a close relationship, both in location and intensity, between cell inflammation and vascular wall neovascularization [33]. NE% is an indicator of inflammatory state. Therefore, it may predict the probability of CTO in CAD patients.
On the other hand, testing is also a significant aspect of predictive model construct, since the performance of regression models is usually much higher in the training cohort than in the testing cohort. In our research, the model was validated in a testing cohort and the results were relatively stable, which could reflect the generalizability of the model to a certain extent.
Taken together, non-invasive tests have been increasingly used for risk stratification and to facilitate clinical decision making in patients with suspected diseases [34]. Subsequently, machine learning models have been developed in a variety of diseases, which are demonstrated to be more accurate in predicting prognosis than traditional staging systems [35]. The machine learning models may improve the detection rate of CTO. However" the CAG is the gold standard of CTO. The CAG, including the CCTA, is invasive, with a difficult applicable and expensive for patients. Moreover, the CAG or CCTA are not recommended in routine screening for suspected chest pain populations according to medical guidelines. This model may provide an effective and applicable method for physician and primary care doctors based on the data of routine physical examination to screen suspected chest pain patients in the general population and improve the detection rate of CTO without redundant examination. In addition, the combined use of machine learning and the electronic health records of patients could provide clinicians with a more convenient and quick methods to identify CTO in the busy clinical work. Furthermore, we are not only building a predicting model, but also seek the risk factors for CTO. As mentioned in the above, although the event of CAD has been related to varied traditional risk factors, the risk factors within the etiology of CTO remains less understood. To date, most researches have centered on stable CAD or acute coronary syndrome; there has been no information on risk factors for CTO patients. We seek the risk factors for CTO by machine learning model and find eight easily captured variables, including gender (male), NE, HCT, TC, HDL, EF, TnI, and NT-proBNP might be the risk factors for CTO distinguishing from CAD. If we had knowledge about some important details about the classifiers, we might achieve one result that could be much more effective. For example, when we used the machine learning model, the model already selected the most important features according to the information gain values and then split, with the ranking of the features selected by the model potentially providing us with additional information. By focusing on and controlling those high-risk predictors, we would see a more positive tendency throughout the individual's entire treatment. Delaying the progression of CAD to CTO would be a tremendous relief for individuals, clinicians, and healthcare systems. It is worth noting that our results of the present study just represent a first step for CTO, and must be validated in a larger cohort in the future. An advantage of machine learning than the conventional statistical method is that it can analyze the data when others use our machine learning model to improve the predicted accuracy of CTO. With more use, the accuracy will improve. In addition, other factors such as genetic predisposition, best medical therapy, sports and activity might also play a role in CTO progress. Expecting larger cohort studies with more factors are validated.
The parameters easily available and reproducible at admission were necessary and quantitative. To our knowledge, this is the pioneering research to investigate the predictability of CTO in patients with CAD and is well performed. These results supported the notion that the machine learning model could better understand the risk factors for CTO and even offered an easily used and automated assistive tool to predict the diagnosis in patients with CTO before CAG.

Limitations
There are several limitations in our study. First, we only included patients from the Chinese population that the trained models should be further tested in broader groups of patients to ensure additional generalisation of performance. Second, due to the retrospective design, other randomized controlled trials could fully examine its efficacy. Moreover, our results of the present study just represent a first step for CTO, which must be validated in a larger cohort with more variables such as genetic predisposition, best medical therapy, sports and activity. Finally, there is still room for optimization of this predictive model using other advanced statistical methods, such as the least absolute shrinkage and selection operator (LASSO), which may help develop a more accurate prognostic prediction model.

Conclusions
In summary, we constructed and validated a relatively accurate machine learning model based on clinical demographics, echocardiography, and laboratory indexes, to help the early prediction of the probability of CTO in patients with CAD. More research with larger cohorts that include more patients and more variables such as genetic predisposition, best medical therapy, sports and activity are warranted to improve the prediction model, which can support clinician decisions on the early discerning CTO in CAD patients.