LASSO Regression-Based Diagnosis of Acute ST-Segment Elevation Myocardial Infarction (STEMI) on Electrocardiogram (ECG)

Electrocardiogram (ECG) is an important tool for the detection of acute ST-segment elevation myocardial infarction (STEMI). However, machine learning (ML) for the diagnosis of STEMI complicated with arrhythmia and infarct-related arteries is still underdeveloped based on real-world data. Therefore, we aimed to develop an ML model using the Least Absolute Shrinkage and Selection Operator (LASSO) to automatically diagnose acute STEMI based on ECG features. A total of 318 patients with STEMI and 502 control subjects were enrolled from Jan 2017 to Jun 2019. Coronary angiography was performed. A total of 180 automatic ECG features of 12-lead ECG were input into the model. The LASSO regression model was trained and validated by the internal training dataset and tested by the internal and external testing datasets. A comparative test was performed between the LASSO regression model and different levels of doctors. To identify the STEMI and non-STEMI, the LASSO model retained 14 variables with AUCs of 0.94 and 0.93 in the internal and external testing datasets, respectively. The performance of LASSO regression was similar to that of experienced cardiologists (AUC: 0.92) but superior (p < 0.05) to internal medicine residents, medical interns, and emergency physicians. Furthermore, in terms of identifying left anterior descending (LAD) or non-LAD, LASSO regression achieved AUCs of 0.92 and 0.98 in the internal and external testing datasets, respectively. This LASSO regression model can achieve high accuracy in diagnosing STEMI and LAD vessel disease, thus providing an assisting diagnostic tool based on ECG, which may improve the early diagnosis of STEMI.


Introduction
ST-segment elevation myocardial infarction (STEMI) is the leading cause of heart failure and death [1]. Early diagnosis of STEMI can effectively shorten the revascularization time, which helps doctors adopt precise treatment strategies, thereby reducing the incidence of heart failure and mortality [2]. Coronary angiography (CAG) is the gold standard for diagnosing STEMI, but it is invasive, time-consuming, and expensive. Electrocardiography (ECG) is a noninvasive and effective screening tool to detect STEMI in patients with chest pain [3]. However, faced with a large number of ECGs, the diagnosis of STEMI has become a great challenge for clinical physicians [4,5].
Although most ST elevation in the ECG indicates myocardial ischemia, there are many nonischemic etiologies to induce ST elevation, such as bundle branch block, ventricular hypertrophy, ventricular preexcitation, premature ventricular beat, and pacemaker rhythm [6]. These changes can mask the STEMI-triggered ST-segment elevation and cause real STEMI to be missed. In addition, the decrease in ECG amplitude can lead to missed diagnoses of STEMI, such as pulmonary disease, effusion, or anasarca [7]. Moreover, the diagnostic accuracy of ECG varies by level of the doctor, especially in primary and community hospitals. Therefore, the rapid and accurate diagnosis of STEMI based on ECG is still an urgent issue that needs to be resolved.
With the rapid growth of machine learning technologies, several successful ECG automatic diagnosis algorithms have achieved positive results for the detection of STEMI patients [8]. There are several machine learning algorithms for analysing ECG, which have solved the problems of noise reduction, feature extraction, detection of arrhythmia, and left ventricular hypertrophy [9][10][11][12]. For instance, an artificial intelligence (AI) network can analyze STEMI ECG through signal transformation and analysis, as well as automated ECG feature extraction [13,14]. However, these models have insurmountable defects, as most of them were trained and validated using data from the MIT-BIH database (PhysioNET) and the PTB database (physiobank) [14,15]. Moreover, some research excluded arrhythmias that may affect QRS morphology and ST-segment changes. Recently, a machine learning model was built based on real-world ECG data to detect ACS, but it failed to confirm the accuracy by comparison to CAG [16]. Due to the above reasons, there were rare machine learning models that can effectively detect STEMI with arrhythmias and diagnose infarct-related arteries in myocardial infarction.
In this study, we established a real-world ECG database, which was confirmed by goldstandard CAG. Moreover, a LASSO regression model was built and trained to diagnose STEMI and determine the location of infarct-related arteries, followed by a comparison of the diagnostic performance between machine learning and doctors.

Study Design
We enrolled patients who underwent CAG at the Third Affiliated Hospital of Sun Yat-sen University (Cohort 1) and the Guangzhou First People's Hospital (Cohort 2) from Jan 2017 to Jun 2019. The inclusion criteria were as follows: older than 18 years, no prior history of myocardial infarction or percutaneous coronary intervention (PCI) or coronary artery bypass graft (GABG), and CAG for any reason. The exclusion criteria were as follows: excessive ECG noise, multiple vascular diseases, no CAG performed during the first 24 h at the onset of symptoms (such as angina pain, chest pain, backache, shoulder pain, and stomach ache), and incomplete baseline data. This study was approved by the Human Ethics Boards of the Third Affiliated Hospital of Sun Yat-sen University and Guangzhou First People's Hospital.
We designed two stages to classify STEMI and the location of the infarct-related arteries. The first stage (Model 1) was to establish a model to distinguish between control and STEMI patients. The second stage (Model 2) was to establish a model to identify the control, LAD, LCX, and RCA.
For the model development, we randomly allocated the data into training, validation, and testing datasets based on the ratio 3:1:1. The performance of the model was validated by internal and external testing datasets. The flow chart for collecting ECG and constructing LASSO regression was shown in Figure 1.

Study Setting and Data Collection
A standard protocol containing demographics, complications, laboratory tests, 12-lead resting ECG reports, and 180 ECG features, along with responding CAG reports, was used to collect data. Twelve-lead ECG data were collected prior to thrombolysis therapy or PCI therapy and then stored for analysis. A cardiologist committee was composed of two board-certified practicing cardiac electrophysiologists and one board-certified practicing cardiologist. All ECG data were reviewed by the two committee members, who made the main diagnosis of STEMI and arrhythmia. A third committee member reassessed the ECG data when there was discordance between the first two members. Cases without a majority opinion after the cardiologist committee reviews were excluded. Acute STEMI was diagnosed according to clinical manifestations, ECG changes, and myocardial enzyme changes based on the Fourth Universal Definition of Myocardial Infarction [5]. The definitions of ST-elevation at J points are based on the American College of Cardiology/American Heart Association and the European Society of Cardiology STEMI guidelines. ST-elevation is defined by the Fourth Universal Definition of Myocardial Infarction consensus statement: (1) ST elevation in V2-V3 ≥ 2 mm in men ≥ 40 years, ≥2.5 mm in men < 40 years, or ≥1.5 mm in women, or ST-elevation ≥ 1 mm in other leads; (2) ST depression ≥ 0.5 mm; or (3) T-wave inversion ≥ 1 mm in leads with a prominent R wave or R/S ratio ≥ 1. For the cases with left bundle branch block, the Smith-Modified Sgarbossa Criteria was used to define the STEMI [17]. The location of infarct-related arteries was confirmed by CAG.

Ecg Data
All subjects underwent a resting surface ECG by a physician, with the subjects lying in the supine position (paper speed: 25 mm/s, calibration: 1 mv = 10 mm, ECGNET Vision 3.0, SanRui Electronic Technology, Guangdong, China). All ECG data were digital, standard, 10-s, 12-lead ECG. The ECG data of each patient were marked with the study ID. Poor ECG data were excluded by two independent doctors according to the flow chart. The sampling rate of ECG was 1000 Hz. Raw ECG data were stored in The Third Affiliated Hospital of Sun Yat-sen University Clinic cloud database. A total of 180 ECG features were automatically obtained by an ECG management system (ECGNET Vision 3.0, SanRui Electronic Technology, Guangdong, China). The interpretation of ECG features is shown in Figure 2 and Supplementary Table S1. The two major components of the features are the distance between each wave and the amplitude of each wave.

Lasso Regression
To avoid overfitting and simplifying the model, LASSO regression was used to automatically screen ECG features and to push the coefficient estimates toward zero ( Figure 1C). Furthermore, we tuned the parameter selection in the LASSO model via minimum criteria. The area under the curve (AUC) of the receiver operating characteristic curve was plotted. A coefficient profile plot was produced against the log (λ) sequence. A vertical line was drawn at the value selected, optimizing (λ). Dotted vertical lines were drawn at the optimal values by using the minimum criteria and 1 standard error of the minimum criteria (the 1-SE criteria). Finally, the remaining variables after multivariable analysis were regarded as potential risk factors and included in the training cohort. The accuracy of the model was evaluated by ROC curve, AUC of the receiver operating characteristic, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The accuracy was obtained from the best cutoff point in the ROC curve based on the maximum Youden index.

Comparative Test
The internal testing data of ECG images were stored in JPG format and tested in a comparative test. The diagnosis was performed blindly and independently. Four levels of doctors were included: experienced cardiologists, emergency physicians, internal medicine residents, and medical interns. Each level contained four doctors. Experienced cardiologists referred to those who had been engaged in the cardiovascular field for more than five years. Emergency doctors referred to those who had worked in the emergency department for more than two years. Internal medicine residents were those with medical licenses but who had not majored in cardiology. Medical interns had completed the theoretical study of cardiology and electrocardiography.

Statistical Analysis
Continuous data are shown as the mean value ± standard deviation, and categorical data are displayed as absolute numbers and percentages. Two independent sample t-tests in normal distribution were used to analyze variables between two groups. Pearson's chi-square test was used to analyze categorical data. Values of p < 0.05 are indicated as significant. Statistical analysis was performed in R (version 3.6.2: R Core Team, Vienna, Austria).

Clinical Characteristics
In total, 820 subjects were enrolled in this study, with 475 control individuals and 259 STEMI individuals from the Third Affiliated Hospital of Sun Yat-sen University (Cohort 1, Figure 1A) and 59 STEMI patients and 27 control patients from the Guangzhou First People's Hospital for external validation (Cohort 2, Figure 1B). Baseline characteristics are shown in Tables 1 and 2 and Supplementary Table S2. Control subjects were younger than STEMI patients (55.1 ± 12.4 vs. 60.6 ± 12.8) years. There was no significant difference in terms of age between Cohort 1 (57.0 ± 13.0) years and Cohort 2 (59.2 ± 10.6) years. Male patients accounted for 44.6% and 18.2% in the control and STEMI groups and 35.1% and 27.9% in Cohort 1 and Cohort 2, respectively. The prevalence rates of diabetes, chronic kidney diseases, and family history between the STEMI and control groups were not significantly different. Although the ratio of STEMI in Cohort 1 and Cohort 2 are different, the proportion of each infarct-related artery is balanced in Cohort 1 and Cohort 2. The interpretation of variables is shown in Supplementary Table S1 .

Model Performance
The details of 180 raw ECG features between the control and STEMI groups are shown in Supplementary Table S2 . ECG features between control and STEMI, between Cohort 1 and Cohort 2, locations of infarct-related arteries, and among different data sets are shown in Supplementary Tables S3-S6 . These significant ECG features were subjected to LASSO regression analysis to construct the diagnosis model. To avoid overfitting and the simplicity of the model, the model with less than one standard error and fewer variables was selected for comparison with the minimum error, and the confidence interval of the general model error was narrower. The diagnosis model performed the best when 14 features were included for STEMI screening (Table 3, Figure 3A) and 4 features were included for LAD location (Table 3, Figure 3C). The proportion of STEMI combined above ECG abnormal phenomenon is 10% (82/821). The performance of our model is shown in Supplementary  Table S7. The calculation of the regression coefficient is visualized in Figure 3B to detect STEMI and in Figure 3D for the location of LAD vessel disease. In short, most of these ECG features were closely represented by the amplitude of the J point, ST-segment, and Q wave. Binomial Deviance  The model performance is shown in Table 3. After training and model optimization, the AUCs of STEMI were 0.94 (95% CI: 0.90-0.98) in the internal testing dataset and 0.93 (95% CI: 0.88-0.98) in the external testing dataset. The accuracies were 0.85 and 0.84 in the internal and external testing datasets, respectively. In model 2, we established LASSO regression and logistic regression models to distinguish LAD and RCA/LCX. After optimizing the model, the AUCs were 0.92 (95% CI: 0.83-0.99) and 0.98 (95% CI: 0.95-1) in the internal and external testing datasets, respectively. The accuracies between LAD and RCA/LCX of the internal and external testing datasets were 0.84 and 0.95, respectively. The sensitivities of the internal and external testing datasets were 0.88 and 0.97, respectively. The specificities of the internal and external testing datasets were 0.79 and 0.94, respectively. Furthermore, we identified STEMI in patients with bundle branch block, ventricular hypertrophy, preexictation, pacing, and arrhythmias. The proportion of STEMI combined above ECG abnormal phenomenon is 10% (82/821). The performance of our model is shown in Supplementary Table S7.

The Interpretation of ECG Features
The nomogram to estimate STEMI and LAD vessel disease was built using the training dataset and validated on the internal and external datasets using the LASSO model (Table 4, Figure 4). The final diagnosis model was well calibrated. Calibration curves were drawn for the detection of STEMI and localizing LAD vessel disease for visual comparison ( Figure 5). Compared with that of the control group, the amplitude of the ST-segment was significantly different at different distances from the J point.     Figure 4A). These features mean the Q waves of Lead V1 and Lead V6, the beginning of the ST-segment of Lead II, Lead III, Lead AVL, and Lead V5, the ST-segment of Lead III, the beginning of the P wave of V4, the amplitude of the R wave in Lead II, the R-Q interval, and the QT interval. These features focused on the ST change, the Q wave, the amplitude of R, and the QT interval. These leads were mainly concentrated in the inferior and left chest leads.
In the second stage, V1 (Q), V2 (Q), V2 (TB), and V3 (ST40) contributed to the diagnostic model of LAD (Table 4, Figure 4B). These features represented the Q wave in Lead V1 and Lead V2, the beginning of the ST-segment of V2 and the ST-segment of V3. These ECG features focused on the change in the ST-segment. The lead position was located in the anterior septal leads.

Comparative Test
In model 1, experienced cardiologists, emergency physicians, internal medicine residents, and medical interns achieved AUCs of 0.92 (0.90-0.95), 0.86 (0.82-0.89), 0.83 (0.80-0.86), and 0.76 (0.72-0.80), respectively, suggesting that the more experienced doctors had higher accuracy in diagnosing STEMI. Our model surpassed all levers of doctors. In addition, cardiologists gained the highest sensitivity (0.85), specificity (0.86), PPV (0.76), and NPV (0.91) ( Table 5). To identify the infarct-related arteries in model 2, the trend of performance in different levels of doctors is similar to that in model 1. Compared to doctors, the LASSO model obtained a sensitivity of 0.85 and 0.88 in model 1 and model 2, respectively, which were able to compensate for the weakness of low sensitivity by doctor diagnosis (Table 5).

Discussion
In this study, we reported a machine learning algorithm based on 12-lead ECG to detect STEMI, which showed high sensitivity and specificity in distinguishing STEMI, with an AUC of 0.94. In addition, we demonstrated that the LASSO model improved the diagnostic accuracy of detecting LAD lesions, with a low false positive rate and a high NPV.
The first finding of this study was that the LASSO method was able to reduce the regression coefficient and cut 180 candidate ECG features down to 14 potential predictors in model 1 and 4 potential predictors in model 2. This method preceded traditional methods of choosing the ECG index according to the strength of the univariable association with outcome.
The innovation of data science, especially machine learning and AI, has brought revolutionary changes to the diagnosis of ECG, breaking through previous diagnosis concepts [18]. Previous ECG signal acquisition, filtering, and processing capabilities were performed by ANN, SVM, AdaBoost, and naive Bayes classifiers, with ACCs reaching 99.7% [19]. These algorithms extracted the signal of the original ECG diagram and detected the peak point of the QRS waveform by adopting a peak-detection algorithm. However, identifying the ST-segment and T wave changes is much more complex than identifying QRS waveforms. To avoid overfitting, random forest can be utilized in practical ECG applications, especially wearable medical devices and implanted medical devices, for wave detection and arrhythmia classification [20,21]. Many neural networks use a convolution process to mimic how the visual cortex addresses images. Unlike many other machine learning methods, deep learning models not only associate input features with outputs of interest but also learn the features from the original data [18]. Recently, a new model, STA-CRNN, has been reported to recognize most arrhythmias, reaching an average F1 score of 0.835. Through visualization, it is proven that the learning characteristics of STA-CRNN are consistent with clinical judgment [22].
AI technology is becoming smarter and more accurate in detecting arrhythmia, but it is still incompetent in the diagnosis of acute myocardial infarction. Yifan Zhao et al. proposed a Res-Net block to differentiate STEMI ECG from control ECG, with an AUC of 0.99, which was similar to that of cardiologists [8]. However, these models cannot identify the infarct-related arteries of STEMI.
The second advantage of this study is that we used real-world ECG data, which were further confirmed by CAG in both the control and STEMI groups. Most previous AI algorithms were based on the MIT-BIH database (PhysioNET) [19] or the PTB database (physiobank) [23], both of which have small sample sizes. For instance, the MIT-BIH Arrhythmia Database consists of 549 records from 290 subjects, including 148 cases of myocardial infarction and 52 healthy controls, containing 48 half-hour excerpts of twochannel ambulatory ECG recordings.
Unlike previous databases, our datasets are superior, as we included abnormal ECG phenomena that affected ST-segment changes, such as complete left bundle branch block, complete right bundle branch block, ventricular pre-excitation, premature ventricular beats, and ventricular tachycardia. In our study, this type of abnormal ECG phenomenon accounted for 9.5% (70/734) in Cohort 1 and 30% (26/86) in Cohort 2, with high proportions of ventricular premature beats, complete right bundle branch block, and left ventricular hypertrophy. Nestelberger et al. found that AMI occurred in approximately 30% of complete left bundle branch blocks. Using the modified Sagarbossa combined with 0/1 h or 0/2 h hs-cTnT could increase the diagnostic rate to above 90% [24]. Although previous studies have suggested that a new complete left bundle branch block should be cautiously extrapolated to AMI, it is still necessary to identify STEMI in patients with left bundle branch block accompanied by chest pain [25]. Ventricular pre-excitation likely manifests as false myocardial infarction with abnormal Q waves and ST-segment elevation or other symptoms that cover up real myocardial infarction and can lead to clinical misdiagnosis and missed diagnosis [26,27]. Patients with left ventricular hypertrophy have a higher incidence of myocardial infarction and stroke [28]. In survivors of myocardial infarction, left ventricular hypertrophy suggests more severe structural and functional damage to the heart [29]. In this study, our algorithm can still achieve good accuracy in a dataset containing several kinds of abnormal ECG phenomena.
Compared with deep learning, the ECG features screened by LASSO regression were more interpretable. V1 (Q) and V6 (Q) suggested pathological Q wave, AVL (TB) and II (TB) suggested J-point elevation, and III (ST80) suggested ST-segment elevation. These abnormal indices compose the diagnostic model of STEMI. Pathological Q wave and ST-segment elevation are important indicators of STEMI [30]. Another new finding of this study was that we identified prolongation of the QT interval and decrease in the R wave peak as important markers of ECG changes in STEMI. Interestingly, we also noticed that V1 (Q), V2 (Q), V2 (TB) and V3 (ST40) contributed to the diagnosis of LAD and were related to LAD innervating the anterior ventricular septum, the left ventricular anterior wall and the right ventricular anterior wall.
There were some limitations of this study. First, our LASSO model can only discriminate the infarct-related arteries between LAD and RCA/LCx. Because the occlusion of LCx or RCA is the major reason for inferior myocardial infarction (AIMI), it is difficult to diagnose the infarct-related arteries that is caused by RCA or LCx occlusion according to 12-lead ECG. There are several ECG criteria to solve this problem, and we will explore a new ML model with knowledge fusion. Second, the sample size of this retrospective study was small, especially the external test dataset. Third, in this study, patients with multiple vessel lesions were excluded, and patients with multiple vessel lesions accounted for more than 40-50% of patients with myocardial infarction [31]. The ECG pattern of STEMI with multiple vessels is variable and atypical. The change in ECG depends on the infarction area and the contribution degree of each vessel. Our model just tries to explore the differential diagnosis of infarct-related arteries in patients with a single vessel disease. Figuring out the infarct-related arteries in patients with multi-vessel coronary artery disease is still a major challenge for clinical physicians. We will explore the diagnostic efficacy of ECG in patients with multiple vessels in the real-world using the LASSO method in further studies. Moreover, further research is needed to clarify the location of the lesion (proximal versus distal) and the size of the infarct-related arteries. In real-world data, the incidence of STEMI-combined ECG abnormal phenomena, such as bundle branch block (left and right) or arrhythmias (such as AF and VT), is low. Because the real-world data are used in our study, the proportion of STEMI combined above ECG abnormal phenomenon is 10% (82/821), and the AUC is 0.879 (0.797-0.961). In order to verify the accuracy and robustness of the algorithm, we plan to construct a prospective study. In the future, we will embed this model into the application system so that clinicians can directly import ECG data and output results.
In this study, we constructed a machine learning model that provided good performance for detecting STEMI based on 12-lead ECG features, which were autoextracted from a real-world database. This machine learning model performed exceptionally with high diagnostic accuracy similar to that of experienced cardiologists, especially in the location of LAD vessel disease.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm11185408/s1, Table S1: The interpretation of ECG features. Table S2: Baseline characteristics between patients with or without STEMI. Table S3: ECG features between control and STEMI.

Institutional Review Board Statement:
The studies involving human participants were reviewed and approved by the Human Ethics Boards of The Third Affiliated Hospital of Sun Yat-sen University and Guangzhou First People's Hospital. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.
Informed Consent Statement: Not applicable. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

Data Availability Statement:
The custom codes for the diagnosis and discrimination STEMI by LASSO regression are available.