The Prediction Model of Medical Expenditure Appling Machine Learning Algorithm in CABG Patients

Most patients face expensive healthcare management after coronary artery bypass grafting (CABG) surgery, which brings a substantial financial burden to the government. The National Health Insurance Research Database (NHIRD) is a complete database containing over 99% of individuals’ medical information in Taiwan. Our research used the latest data that selected patients who accepted their first CABG surgery between January 2014 and December 2017 (n = 12,945) to predict which factors will affect medical expenses, and built the prediction model using different machine learning algorithms. After analysis, our result showed that the surgical expenditure (X4) and 1-year medical expenditure before the CABG operation (X14), and the number of hemodialysis (X15), were the key factors affecting the 1-year medical expenses of CABG patients after discharge. Furthermore, the XGBoost and SVR methods are both the best predictive models. Thus, our research suggests enhancing the healthcare management for patients with kidney-related diseases to avoid costly complications. We provide helpful information for medical management, which may decrease health insurance burdens in the future.


Introduction
Coronary artery bypass grafting (CABG) is the most common cardiac surgery to treat patients with severe coronary artery circulation blockages. After CABG, the patient will have the following two different situations: one is gradual recovery, the other is due to the complications that lead the patient to rehospitalization again [1]. Therefore, readmission is an essential outcome of CABG surgery, and it has a high incidence in 30 and 90 days [2][3][4]. Furthermore, it is a severe problem because it is directly related to the medical expenses that patients and hospitals must incur, substantially increasing healthcare costs and bringing a vast economic budget. However, the expenditure after CABG surgery remains poorly predicted. The various studies point out preoperative comorbidities, multiple complications, and medical expenses are essential variables that can affect the survival of CABG surgery patients [5][6][7].
This research used the National Health Insurance Research Database (NHIRD) to delineate this issue. It has been used widely and diversely in many academic studies [8]. Thus, the research results of NHIRD gradually become an indicator for clinical decisions, no matter in physicians or the government. There are three aims in this research. First, we would use feature selection to identify the essential variables that affect postoperative expenditures. Secondly, we would use different feature selection methods to rank the essential variables. Last, we use different machine learning methods to build an appropriate medical expenditure prediction model for patients who underwent CABG. The information could effectively reduce medical expenditures, improve the quality of healthcare institutions, and provide essential references for medical management policy advice.

Data Source
Taiwan's NHIRD has been built since 1995 and the coverage rate is about nearly 99%. NHIRD provides Taiwanese personal medical information, including primary demographic data and previous diseases. In addition, the NHIRD also covered all actual and most extensive healthcare data, including patients' original outpatient, inpatient record, treatment, expenditure, diagnosis code, and admission dates. The codes were based on the International Classification of Disease, 9th Revision, Clinical Modification (ICD-9-CM); the 10th Revision was added to the database on 1 January 2016. This study was designed as a population-based study on 23 million national health insurance beneficiaries enrolled in Taiwan [9]. NHIRD provides a comprehensive long-term follow-up of all claimed records for the benefit of the NHI program. All personal information was anonymized and deidentified in NHIRD. Thus, Fu-Jen University's ethics institutional review board in Taiwan was exempted from ethical review (C108121), and the requirement to obtain informed consent was waived.

Study Population
This research selected the patients who had accepted CABG surgery (procedure codes 68023A, 68023B, 68024A, 68024B, 68025A, 68025B) between 1 January 2014 and 31 December 2017, from the Taiwan NHIRD (n = 13,078). The date of newly CABG surgery is the index date. There were 133 patients that were not eligible for the study. To ensure that this study was only included the cases that received CABG operation for the first time, patients who had CABG surgery before the initial surgery year (n = 81) were excluded, and we also excluded the patients who were under 18 years old (n = 21) and missing information (n = 31) in this research. After excluding those unqualified patients for this study, 12,945 latest CABG surgery patients were included in our research from 1 January 2014 to 31 December 2017, and all followed up until 31 December 2018 ( Figure 1).

Variable and Outcome Definitions
This study used the total surgical expenditures of the corresponding CABG surgery and the patient's medical expenditures in the previous year as predictive variables. The total surgical expenditures of each patient were calculated by the claimed records, including examination, anesthesia, treatment, drug, operation-related expenses, and other medical services during CABG hospitalization.
To define the primary outcome, this research used one-year cumulative expenditures after discharge to reflect the medical expenditures as primary outcomes. Therefore, we added up the total expense on outpatients and hospitalization after discharged for one year, and this variable is a prediction variable (Y) ( Figure 2). All expenses are identified in New Taiwan dollars (NT$).
This study used the total surgical expenditures of the corresponding CABG surgery and the patient's medical expenditures in the previous year as predictive variables. The total surgical expenditures of each patient were calculated by the claimed records, including examination, anesthesia, treatment, drug, operation-related expenses, and other medical services during CABG hospitalization.
To define the primary outcome, this research used one-year cumulative expenditures after discharge to reflect the medical expenditures as primary outcomes. Therefore, we added up the total expense on outpatients and hospitalization after discharged for one year, and this variable is a prediction variable (Y) ( Figure 2). All expenses are identified in New Taiwan dollars (NT$).

Feature Selection and Prediction Models Implementation
When doctors make clinical decisions, they must review the patient's past medical records and current examination results one by one. This not only consumes time for searching, but also slows down the speed to make precise decisions immediately. Thus, feature selection (FS) is an essential preprocessing step before model prediction. By calculating different machine learning algorithms, removing irrelevant factors, we could reduce errors in clinical decisions and improve accuracy [12,13].
Medical expense is a continuous variable. Therefore, linear regression (LR) is often used for continuous numerical estimation, a model established by finding the relationship between the independent and dependent variables. In the training set, this research used five kinds of machine learning, including LR, classification and regression tree (CART), support vector regression (SVR), multi-variate adaptive regression splines (MARS), and XGBoost (extreme gradient boosting) to train by selecting the relevant features for medical expense prediction. In order to avoid overfitting, in the training process, we used five-fold cross-validation.
In more detail, we partitioned the training data into five stratified subsets, 80% of training data were used for training, and 20% of training data were used for validation. Subsequently, we repeated the above processes five times, each subset was used once as a validation dataset. After that, we obtained the average estimated results and used five different indicators to evaluate each prediction model.

Feature Selection and Prediction Models Implementation
When doctors make clinical decisions, they must review the patient's past medical records and current examination results one by one. This not only consumes time for searching, but also slows down the speed to make precise decisions immediately. Thus, feature selection (FS) is an essential preprocessing step before model prediction. By calculating different machine learning algorithms, removing irrelevant factors, we could reduce errors in clinical decisions and improve accuracy [12,13].
Medical expense is a continuous variable. Therefore, linear regression (LR) is often used for continuous numerical estimation, a model established by finding the relationship between the independent and dependent variables. In the training set, this research used five kinds of machine learning, including LR, classification and regression tree (CART), support vector regression (SVR), multi-variate adaptive regression splines (MARS), and XGBoost (extreme gradient boosting) to train by selecting the relevant features for medical expense prediction. In order to avoid overfitting, in the training process, we used five-fold cross-validation.
In more detail, we partitioned the training data into five stratified subsets, 80% of training data were used for training, and 20% of training data were used for validation. Subsequently, we repeated the above processes five times, each subset was used once as a validation dataset. After that, we obtained the average estimated results and used five different indicators to evaluate each prediction model.

Linear Regression (LR)
Linear regression is the association between the dependent variable and one or more independent variables. Through the establishment of the regression model, the variable (y) can be predicted. Before building a prediction model, data must be a normally distributed.

Classification and Regression Tree (CART)
CART can solve the regression and classification problem of multi-dimensional output. It is a kind of flow diagram tree structure; each node was the attribute variable. The branch is a test outcome, and the tree leaves present classification [14]. The method of CART for selection criteria is to use the Gini index. The Gini index is a measure of inequality, and it is usually used to measure income imbalance and can be used to measure any uneven distribution. A number between 0 and 1. 0 is entirely equal, and 1 is entirely unequal.

Support Vector Regression (SVR)
The main algorithm of SVM is the "kernel". When data cannot be linearly divided into lower dimensions, the kernel can transfer them to a higher dimensional divided linearly. SVR is an extension of SVM. In order to solve the problem of nonlinear, SVR is the model for considering the risk of structural, minimizing the generalization error, and maximizing hyper-plane margin to reduce the tolerated error [15,16].

Multi-variate Adaptive Regression Splines (MARS)
Friedman proposed the MARS method in 1991 [17]. MARS is a non-parametric regression and flexible model, and it has consisted of the weighted sum of the basis splines piecewise polynomial functions. The optimal variable is hidden in the high-dimensional data. Through variable interactions, MARS can find the best variable easier [18].

XGBoost (Extreme Gradient Boosting)
XGBoost methods were proposed by Chen et al. in 2016 [19]. It is an ensemble method based on decision tree methods. The framework in this method is gradient boosting, and model builds are sequential. Therefore, it can minimize errors, maximize models' performance, and reduce tree construction time. The central idea in XGBoost is to make a new model to correct the errors in the previous training model, then make the prediction [20].

Validation Index
This study used different machine learning methods for the prediction of one-year medical expenses after discharge. The validation index of the model was the reference data for determining the quality and accuracy of the model, which depended on the model attributes.
In order to evaluate the performance of the model, this study used five different indicators to measure the prediction result, which was widely and easily understood. These five performance metrics represented the following three different types: absolute error, scaled error, and percentage error. The absolute error group contained the mean absolute error (MAE) and root mean squared error (RMSE), mean square error (MSE), mean absolute scaled error (MASE), and the group of percentage error includes mean absolute percentage error (MAPE) [21,22].
The mathematical formula of these statistical validation metrics for evaluating the models was demonstrated as follows in Table 1. Table 1. Error measures for the performance metrics equations.

Type of Error Metrics Equations
Absolute error The indicators were frequently and widely used as a performance index among different prediction models [23]. The lower the deviation, the better the accuracy of the prediction model.
The above indicators were used to measure the prediction error in each model. Where n was the total amount of patients, b presented the actual medical expense, a represented the predicted medical expense.

Statistical Analysis
This research selected new CABG patients between 2014 and 2017, which was based on the disease's demographic characteristics and history. All results were expressed as the Healthcare 2021, 9, 710 6 of 13 number and percentages, N (%), for categorical variables. Means with standard deviation were presented as mean ± SD for continuous variables.

Hardware Equipment
MOHW provides an environment for data analysis, the main analyzed computer CPU is intel i7-8700, the main host memory is 128 GB, the brand of system disk type is Western Digital (WD10EZEX) 1T.
Research data were provided from NHIRD, which is the largest volume of data in Taiwan. All analysis data will be stored in the other replacement hard disks (disk type: WD (DC HC310) 6T), which will be kept by the Health and Welfare Data Science Center (HWDC).
X42 to X44 were hospital variables. The hospital area type (X42) was 15.13%, 62.10%, 20.54%, and 2.23% in central, northern, southern, and eastern, respectively. X43, different hospital ownership was 35.21% in public and 64.79 in private hospitals. Hospital accreditation (X44) was 61.89% in a medical center, and the non-medical center was 38.11%.

The Ranking Number of Feature Selection on CABG
After feature selection, we ranked the importance of each variable among different machine learning models that can provide helpful information for model building. Every algorithm has a different calculation. Thus, the variables selected were also different. For example, to determine the relative risk factors about the one-year medical expense after discharge, each important variable could provide helpful information through different feature selection methods. Huang et al. [5] point out that using fewer features was more efficient in model building.
This research used 44 variables [4,5,7,[26][27][28][29][30][31], which depended on the physician's clinical experience and literature review. Moreover, it used five different machine learning methods to predict after filtering factors, the highest score (10 points) was the most crucial factor, which will be the first on the rank; on the other hand, the lowest predictor was ranked the last (1 point). We listed the ranking degree and average in each variable in the following Table 3. The number of HD Dialysis The number of PD Dialysis The number of PCI vessels 7 0 0 0 0 1.4 After screenings and analyses, the variable with a higher score was selected as the predicted value in this research. Through the calculation of different machine learning algorithms, each variable will have a different relative importance rank.
In the LR model, the most crucial variable was the surgical expenditure (X4). The other two variables, HD dialysis (X15) and medical expenditure (X14), were both from one year before surgery. Therefore, the top three essential variables of SVR, CART, and MARS are the same as LR. However, for XGBoost, the top two essential variables are still X4 and X14, and the third most important variable was CKD. Therefore, the essential variable in the LR, CART, SVR, MARS, and XGBoost models was surgical expenditure (X4; average point 9.8 points) and one-year medical expenditure before surgery (X14; average point: 9 points), and the number of HD (X15; average point: 8 points).
In general, we knew these three variables (X4, X14, X15) could affect one-year medical expenditures after discharge in CABG patients.
In order to clarify and simplify the predictors, we averaged the scores in each important variable for more equality, as shown in Figure 3. The result depicts the variables that possibly affect one-year medical expenditure after discharge.
The top five critical variables were surgical expenditure (X4), the one-year factors before surgery, medical expenditure (X14), the number of HD, CKD (X30), and the mechanical ventilation use during the CABG surgery (X7).

Performance of 5 Different Prediction Models
After the feature selection, we performed LR, CART, SVR, MARS, XGBoost prediction models. Then, to identify the lowest value in each indicator, we evaluated the following metrics: MAE, MSE, MASE, MAPE, and MAPE. For example, from the overall results in Table 4, after feature selection by CART and after XGBoost was used to make a prediction, MSE (0.0490) and RMSE (0.2214) were the lowest values.
In general, we knew these three variables (X4, X14, X15) could affect one-year medical expenditures after discharge in CABG patients.
In order to clarify and simplify the predictors, we averaged the scores in each important variable for more equality, as shown in Figure 3. The result depicts the variables that possibly affect one-year medical expenditure after discharge. The top five critical variables were surgical expenditure (X4), the one-year factors before surgery, medical expenditure (X14), the number of HD, CKD (X30), and the mechanical ventilation use during the CABG surgery (X7).

Performance of 5 Different Prediction Models
After the feature selection, we performed LR, CART, SVR, MARS, XGBoost prediction models. Then, to identify the lowest value in each indicator, we evaluated the following metrics: MAE, MSE, MASE, MAPE, and MAPE. For example, from the overall results in Table 4, after feature selection by CART and after XGBoost was used to make a prediction, MSE (0.0490) and RMSE (0.2214) were the lowest values.  Figure 3. The average score after feature selection using five methods.  We used the variables that were selected by MARS and SVR to build the prediction model. There were three indicators to show the lowest value, namely, MAE (0.1302), MASE (0.2067), and MAPE (0.0094). Thus, MARS only selected three variables and used SVR to make the best predictive model in this research compared to other combined methods.

Discussion
NHIRD provides a lot of medical information, and each patient could be traced for a long follow-up time. Therefore, we used NHIRD to make the medical expense prediction.
The latest year of the NHIRD database is 2018. Therefore, we selected new CABG surgery patients between 2014 and 2017. The primary purpose of our study was to evaluate which factors could predict the one-year medical expenses after discharge of CABG patients, and build an expense prediction model. Most research discusses mortality, readmission, and the relationship between diseases and surgery [1,4,5,15,28,[32][33][34]. However, only a few studies explored medical expenses, even forecasting. For example, Mehaffey et al. in 2018 [29], analyzed that each additional complication would cause an exponential cost increase. Baciewicz et al. in 2018 [28] referred that because sicker patients needed a high blood transfusion, it led to the increased expense. From the above results, we could know that the baseline variables, including age (X1), CHA2DS score (X2), CCI score (X3), CKD (X30), AKF (X35), ESRD (X38), renal disease (X39), major illness (X44), the variables one year before surgery (total medical expense (X14), blood transfusion (X12), mechanical ventilation use (X13), the number of HD (X15), PD (X16), and PCI vessels (X17)), the surgical variables (surgical expenditure (X4), blood transfusion (X6) and mechanical ventilation use (X7)), all positively influenced one-year medical expense after discharge.
In this study, we used multiple stages to analyze and predict the one-year medical expense after discharge. First, we used the feature selection method to find the essential variables that affect the medical expense. Secondly, after finding out the important variables, we selected five different machine learning models to build a prediction model and evaluate the performance. Besides, through feature selection, we found the folowing several exciting variables: CKD (X30), AKF (X35), ESRD (X38), and renal disease (X39). Although they are all associated with the renal condition, those variables do not have an exceptionally high ranking that is easy to be overlooked, they are topics worthy of further study. For example, Chou et al. [35] in 2014 evaluated that dialysis patients who underwent CABG surgery had better survival than PCI surgery; Chen et al. [36] analyzed that dialysis is associated with higher risk and mortality with CABG patients. Furthermore, Liao et al. [7] found that ESRD patients have a higher medical expense after CABG surgery. From the above results, it could be known that for kidney disease patients who accepted their first CABG surgery, a one-year expense after discharge would be relatively high.
The medical expenditure in preoperative one-year (X4), surgical expense (X14), and the number of HD were the most critical medical expense predictors. Furthermore, after the predictions model was built, we could use the 3 or 10 variables selected by MARS or CART, respectively, to apply SVR and XGBoost methods and achieve a better medical expense prediction model.

Conclusions
Our study developed a multiple-stage model to evaluate the one-year medical expense after discharge for those first-time CABG patients. Our model could find that the corresponding operation variables could predict one-year medical expenditure after CABG. Furthermore, postoperative complications will increase the medical expense [28]. In our results, we found that patients with kidney problems, including previous HD, PD, ESRD, renal disease, and CAD, all have a high connection with the forecast medical expenses after CABG surgery. Therefore, hospitals should enhance healthcare management on specific disease prevention, especially the CABG patients with kidney-related diseases.
Our study suggests that the SVR and XGBoost models are an adequate tool to make a medical expense prediction model, through MARS and CART feature selection. The research can bring the benefits of providing the references for medical management with specific diseases that could reduce the expense through effective control, and the government's burdens could also be decreased. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are not available on request from the corresponding author. Due to the General Data Protection Regulation, the data presented in this research are not publicly available.