Predicting Continuity of Asthma Care Using a Machine Learning Model: Retrospective Cohort Study

Continuity of care (COC) has been shown to possess numerous health benefits for chronic diseases. Specifically, the establishment of its level can facilitate clinical decision-making and enhanced allocation of healthcare resources. However, the use of a generalizable predictive methodology to determine the COC in patients has been underinvestigated. To fill this research gap, this study aimed to develop a machine learning model to predict the future COC of asthma patients and explore the associated factors. We included 31,724 adult outpatients with asthma who received care from the University of Washington Medicine between 2011 and 2018, and examined 138 features to build the machine learning model. Following the 10-fold cross-validations, the proposed model yielded an accuracy of 88.20%, an average area under the receiver operating characteristic curve of 0.96, and an average F1 score of 0.86. Further analysis revealed that the severity of asthma, comorbidities, insurance, and age were highly correlated with the COC of patients with asthma. This study used predictive methods to obtain the COC of patients, and our excellent modeling strategy achieved high performance. After further optimization, the model could facilitate future clinical decisions, hospital management, and improve outcomes.


Background
Continuity of care (COC) is a mode of structured care delivery. It has been shown to offer numerous health benefits for chronic disease management with fewer adverse outcomes [1,2] and reduced costs [3,4]. A patient with a low level of COC involves better ongoing healthcare management. Thus, knowing the level of COC is essential for implementing care interventions. Thus far, its measurements have varied and mostly focused on finding a way to measure the "interpersonal relationship" between patients and collaborators, such as physicians, caregivers, and patients themselves [5][6][7][8]. Despite the availability of different measurements, appropriately obtaining COCs with a generalizable methodology has been underinvestigated. Notably, the predictive model is an artificial intelligence method that can be deployed in the clinic to facilitate decisions prospectively [9]. Using an effective technique to predict the COC of patients would be a breakthrough. However, the high dependence on multidisciplinary knowledge and massive data collection limits this progression [10]. To precisely identify the degree of COC, we used a machine learning classification model to predict the future level of COC of patients and targeted one of the major chronic diseases, asthma.
Asthma is a common chronic disease that would cause poor outcomes if out of continuous control. In the United States, 7.8% of the people have asthma, causing 1,629,469 emer-gency department (ED) visits; 178,530 hospitalizations; and over 10.0 million deaths annually [11]. Unlike other chronic diseases, it affects a broader age range, indicating that any effective improvement in asthma would benefit more patients. In addition, younger patients with asthma have a lower COC and more often experience episodic exacerbations [12]. By knowing the COC beforehand and improving it using practical methods, as many as 60-75% of the future ED visits and 25% of the hospitalization by patients with asthma can be avoided [13][14][15][16].

Current Research Gap
Previous studies have focused on finding an association between the COC and outcomes in patients with asthma. However, as demonstrated by the literature [17], proper gauging of COC and outcomes should be prioritized before exploring the relationship between them. As identifying the outcomes of patients is much easier than ascertaining the COC, many studies have developed a predictive model for the former [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34], while limited research has been performed for the latter.
Most importantly, thus far, assessing the COC of patients has relied on historical data [5][6][7][8]. Studies have extracted patients having longitudinal visiting records for several years to measure the COC, such as the continuity of care index (COCI). The existing quantitative methods all used historical data and had a limited sample pool. Moreover, it cannot be obtained for a new patient who has never been in a specific healthcare system. In the University of Washington Medicine (UWM), approximately 40% of the new patients with asthma receive medical care per year, as shown in Table 1. Assuming that using the past COC could predict its future level, the highest prediction accuracy would be less than 60%. Furthermore, the COC of patients is likely to change with time; thus, prior COC cannot represent the future ones for existing patients with 100% accuracy. We calculated the prediction accuracy using the historical COC for outpatients with asthma who received care from the UWM for 5 years. The highest accuracy was 57.94%, as shown in Table 1. Thus, in this study, the baseline prediction accuracy was set to 57.94%. Although this method seemed uncomplicated, it was insufficient. Furthermore, additional barriers would further affect this approach:

1.
Clinical research has mostly chosen claims data, electronic health records (EHR), patient surveys, and consultation to collect data. The claims data and the EHRs usually contain complete historical data, however, extracting and cleaning this massive raw data for clinical researchers is complicated. Thus, patient surveys and consultations are more preferable [7,[35][36][37][38][39]. However, general consultation data collection is practiced using computers, including telephone calls or emails, depending on the computer system or the operation person. It could be misleading if various researchers shared the same data or if the computer system changed. Although patient surveys could avoid such misrepresentative findings, the small sample size would limit the study. Therefore, the simple method that directly uses the previous COC to represent the future is limited by specific research, and it is not generalized.

2.
Several studies have shown that patient demographics and comorbidities are associated with the COC of patients [37,38]. Using these attributes could certainly facilitate the evaluation of COC in some new patients. Current studies mainly focus on investigating the probability that these characteristics would affect the COC; however, they do not implement them accurately to indicate its specific level. In this research, models were developed to explore the feasibility of using demographic and comorbidity attributes to assess the COC. Notably, a critical intervention for patients after an asthma attack is to invest in care management. It costs over $5000 per person annually [40] and generally enrolls only below 3% of the patients due to resource limitations [41]. The COC is a part of care management. Undeniably, earlier intervention for patients with low COC would achieve better quality and cost-effectiveness of care management. Thus, it is worthwhile to investigate a generalizable predictive methodology to determine the future COC.

Objective
This study was designed to fill the aforementioned research gap. We proposed a machine learning model to predict the future COC for outpatients with asthma. Our final model integrated the EHRs and the administrative data to estimate three possible categorical COC levels: high, moderate, and low.

Data Source
This retrospective cohort study used the EHRs and the administrative data extracted from the UWM, the most extensive academic healthcare system in the State of Washington. The data warehouse has been collecting complete adults' data from 12 clinics and 3 hospitals since 2011. This study's patient population included all outpatient visits from 2011 to 2018.

Data Collection and Patient Cohort
The enterprise data warehouse of the UWM contains the original and uncleaned EHRs and administrative data. To ensure data validity, we implemented a data collection and cleaning process before building the predictive model. We identified patients with asthma in a specific year using a minimum of one diagnosis code of asthma in that year: the International Classification of Diseases, Ninth Revision codes 493.9x, 493.8x, 493.1x, and 493.0x; and the International Classification of Diseases, Tenth Revision codes J45.x [20,42,43]. The patient cohort included 31,724 adult outpatients (age ≥ 18 years) with asthma between 1 January 2011 and 31 December 2018; 5057 outpatients (age < 18 years) were excluded. The distribution of this study's dataset is presented in Figure 1.

Prediction Target
The prediction target in this study was the class of COC score, which represent the level of COC of the patient. To calculate the score of the patients' COC, the most common

Prediction Target
The prediction target in this study was the class of COC score, which represent the level of COC of the patient. To calculate the score of the patients' COC, the most common COC measurement algorithm, the continuity of care index (COCI) [44], was chosen and divided into three dimensions following the classification strategy of the study [45]. The COCI is composed of the number of visits to each physician and that of distinct physicians consulted [44]. The following general equation represents the COCI of the outpatient visits.
where N refers to the total number of visits to the physicians, n j denotes the number of visits to physician j, and M refers to the total number of different physicians. The COCI ranges from 0 to 1, with a higher score indicating a higher level of COC. In this study, the COCI was classified into three levels: high (0.34-1.00), moderate (0.17-0.33), and low (0.00-0.16). For building an enhanced model, we assigned the numbers 3, 2, and 1 to high, moderate, and low levels, respectively, to represent these three dimensions.

Preprocessing Feature Values
The quality of data and features determines the performance and reliability of a machine learning model; specifically, preprocessing features are essential before training data. In this study, a total of 138 features were examined, describing a large variety of characteristics. Table A1 in the Appendix A describes the details of these features. Except for the demographic features (such as age, gender, race, and ethnicity), those related to medication, insurance, comorbidity, family location, and types of visits were included in this study. Typically, we utilized standardization to process complicated features. We adopted a uniform quantity standard to calculate the structured attributes. An instance of medication features for improved understanding is as follows: a patient who was prescribed medications twice in a specific year. Medications A and B were prescribed for the first time, and A and C for the second time; the total number of prescribed medications was four, and the number of distinct prescribed medications is three this year. In addition, binarization was introduced to quantify broad domain features, including those associated with the family location of patients. Our prior study [46] showed that the 5-mile radius from the patient's home to the UMW was the threshold distance for the patients who mostly tended to receive care from it. Therefore, we divided the value for this attribute as 1 or 0 to distinguish whether the distance was less than 5 miles.
Every input data instance in the predictive model was independent of the outcome. Therefore, the features corresponding to the number of visits to the physicians were not considered, such as "number of outpatient visits to the patient's primary care providers", "number of differing providers the patient saw in outpatient visits," and "number of differing primary care providers of the patient". In addition, if some features described similar items, they were integrated into one category. For instance, "primary asthma diagnosis" and "priority asthma diagnosis" were categorized as one entity under "primary asthma diagnosis".

Data Preparation
Most classification algorithms accept only numerical features. Thus, we applied onehot encoding to transform the categorical features into the numerical ones before they were added to the classifiers. Furthermore, as the COCI is a longitudinal prediction target, the corresponding values were initiated into computing because the patient was first shown in this UWM dataset. The entire 9-year period of this study was from January 2011 to December 2018.

Performance Metrics
For a multiclass classification problem, the prediction accuracy and the area under the receiver operating characteristic curve (AUROC) are two important metrics for evaluating the performance of a predictive model; however, it is not the sole measure to select a proper classifier. We further chose three additional standard metrics: precision, recall, and F1 score for a more precise evaluation. Precision refers to the percentage of positive cases from total predicted cases, recall refers to the percentage of how many total positive cases were predicted correctly with the built model, and F1 score refers to the combined result of precision and recall. The equations for the metrics are as follows: Here, P i refers to precision for class i, R i denotes recall for class i, F1 i refers to F1 score for class i, TP i denotes true-positive classifications for class i, and FP i refers to false-positive classifications for class i. FN i refers to false-negative classifications for class i. The confusion matrix for multi-class classification is presented in Table 2.

Classification Algorithms
Machine learning classification algorithms predict the probability of an objective variable by inputting labeled data for supervised learning. Our prediction target, the COCI of patients with asthma, was divided into three groups: high (3), moderate (2), and low (1). The machine learning classifiers are the best choice for handling this multiclass classification problem. In order to build a predictive model, this study proposed the use of the extreme gradient boosting (XGBoost) algorithm [47], an efficient and distributed realization of gradient boosting. Typically, the top six classification algorithms are employed to develop the advanced predictive models recognized in the data mining and machine learning literature [47,48]: random forest, k-nearest neighbor (k-NN), support vector machine (SVM), C4.5 decision tree, XGBoost, and Naive Bayes. Specifically, tree-based algorithms (e.g., random forest, C4.5, and XGBoost) and the SVM are both high-performance tools for classification. The former divides the input space into hyper-rectangles according to the target. The latter uses the kernel trick to convert a linearly nonseparable problem into a linearly separable one, thus prolonging the training duration. The six preliminary algorithms were tested and the XGBoost was selected owing to its superior performance.
The study sample was divided into 80% and 20% for training and internal validation, respectively. We fit them with the six algorithms and applied the 10-fold cross-validations to find the best parameters. The parameters tuned in the experiments for each model are as follows: the balanced or not of class weight, the number of trees, and split criterion measure in random forest; the number of neighbors in the k-NN; the balanced or not of class weight, regularization strength, and kernel function in the SVM; the class weight and trees' maximum depth in the C4.5 and the XGBoost; and the prior probabilities and likelihoods of different classes in the Naive Bayes. The other parameters were automatically set by each algorithm.

Evaluating the Superiority of the Final Model
Overall, 138 features were used to build the final model. Checking more types of features was undoubtedly an essential part of the modeling strategy. As this study was innovative, it was necessary to investigate whether an uncomplicated use of patients' demographic or comorbidity features to predict the future COC would also be effective. We constructed two additional models using the same patient cohort, prediction target, feature preprocessing method, and machine learning algorithm. The difference between these two models and the final one was the number of features. We named "model_2" as the second model using only demographic features, and "model_3" as the third one using demographic and comorbidity features. The details of the features are listed in Tables A2 and A3 of the Appendix A.
The purpose of these two models was to examine whether the final model was superior to the simpler models. It was unnecessary to use as many comorbidity features as the final model when the model_3 was built. Furthermore, most clinical studies could not obtain complete comorbidity information by patient surveys or consultations. Thus, we chose 10 asthma-related comorbidity features to build model_3. Table 3 presents the distributions of the COCI classes and the data instances. During the entire study period, 40.68% (12,905/31,724), 5.69% (1804/31,724), and 53.63% (17,015/31,724) of the data instances indicated low (COCI class = 1), moderate (COCI class = 2), and high COC levels (COCI class = 3), respectively.  Table 4 shows the characteristics of the patient cohort. We computed the p-value using the chi-square test [42] to evaluate the statistical differences of the data instances. As displayed in Table 4, most characteristics of the patients presented statistically significantly different distributions (p < 0.001) among the three COCI classes, with the exception of the occurrence of bronchopulmonary dysplasia (p = 0.99) and cystic fibrosis (p = 0.02) in the patient.

Performance Results of Various Machine Learning Models
In this study, the dataset was randomly divided into 80% and 20% as a training and test set, respectively. For comparison purposes, five additional models with the random forest, k-NN, SVM, C4.5, and Naive Bayes were evaluated using 10-fold cross-validations under the same sample of the training and test sets. The average values of accuracy, precision, recall, F1 score, and the AUROC of the six models are listed in Table 5. The baseline accuracy calculated using the direct method mentioned previously is listed in Table 5 for an improved comparison.
For a multiclass classification problem, high accuracy and the AUROC guarantee good performance for a predictive model. Furthermore, recall and precision are able to indicate critical factors for imbalanced datasets. Higher recall and precision indicate that additional instances were identified correctly. Notably, the F1 score is the weighted average of recall and precision. Thus, we considered the F1 score, accuracy, and the AUROC as assessments of the prediction performance. Across the models, our final model using the XGBoost classifier yielded the highest accuracy (88.20%), the highest F1 score (0.86), and the highest AUROC (0.96). Figure 2 presents the ROC curves of the model. The model gained a microaverage AUROC of 0.96 and a macroaverage AUROC of 0.90, respectively. Specifically, the AUROC of each class of the final model yielded 0.98, 0.80, and 0.93 for classes 1, 2, and 3, respectively. The confusion matrix of the final model is presented in Figure 3. In addition, we tuned a total of 138 features into the XGBoost classifier that was able to automatically compute each feature's importance value based on its allocated contribution to the model [49]. Our final model was built with 127 features selected by the XGBoost, as listed in Table A4  additional instances were identified correctly. Notably, the F1 score is the weighted average of recall and precision. Thus, we considered the F1 score, accuracy, and the AUROC as assessments of the prediction performance. Across the models, our final model using the XGBoost classifier yielded the highest accuracy (88.20%), the highest F1 score (0.86), and the highest AUROC (0.96). Figure 2 presents the ROC curves of the model. The model gained a microaverage AUROC of 0.96 and a macroaverage AUROC of 0.90, respectively. Specifically, the AUROC of each class of the final model yielded 0.98, 0.80, and 0.93 for classes 1, 2, and 3, respectively. The confusion matrix of the final model is presented in Figure 3. In addition, we tuned a total of 138 features into the XGBoost classifier that was able to automatically compute each feature's importance value based on its allocated contribution to the model [49]. Our final model was built with 127 features selected by the XGBoost, as listed in Table A4 of the Appendix A, in descending order of the importance values. The XGBoost automatically filtered noncontributing features.

Superiority Evaluation Results
The superiority evaluation study examined the performance of two simpler models built with fewer features. With the exception of the features, the dataset, prediction target, and modeling strategy were all consistent with the final model. The comparison results are presented in Table 6 and Figure 4. The final model yielded the best performance additional instances were identified correctly. Notably, the F1 score is the weighted average of recall and precision. Thus, we considered the F1 score, accuracy, and the AUROC as assessments of the prediction performance. Across the models, our final model using the XGBoost classifier yielded the highest accuracy (88.20%), the highest F1 score (0.86), and the highest AUROC (0.96). Figure 2 presents the ROC curves of the model. The model gained a microaverage AUROC of 0.96 and a macroaverage AUROC of 0.90, respectively. Specifically, the AUROC of each class of the final model yielded 0.98, 0.80, and 0.93 for classes 1, 2, and 3, respectively. The confusion matrix of the final model is presented in Figure 3. In addition, we tuned a total of 138 features into the XGBoost classifier that was able to automatically compute each feature's importance value based on its allocated contribution to the model [49]. Our final model was built with 127 features selected by the XGBoost, as listed in Table A4 of the Appendix A, in descending order of the importance values. The XGBoost automatically filtered noncontributing features.

Superiority Evaluation Results
The superiority evaluation study examined the performance of two simpler models built with fewer features. With the exception of the features, the dataset, prediction target, and modeling strategy were all consistent with the final model. The comparison results are presented in Table 6 and Figure 4. The final model yielded the best performance

Superiority Evaluation Results
The superiority evaluation study examined the performance of two simpler models built with fewer features. With the exception of the features, the dataset, prediction target, and modeling strategy were all consistent with the final model. The comparison results are presented in Table 6 and Figure 4. The final model yielded the best performance among all the metrics.

Principal Findings
In this study, a machine-learning model was developed to predict the future COC of patients with asthma. For enhanced identification and to calculate its level, the COCI was selected, which is the most common algorithm employed by patients and physicians. The XGBoost model yielded the best performance, including the highest accuracy, AUROC, and F1 score. XGBoost won in this study because of its superior big data processing capability. Nevertheless, other algorithms, such as random forest, performed appropriately as well based on the UWM data, owing to our excellent modeling strategy of feature engineering. Furthermore, solely using demographic or comorbidity features to assess the COC was inadequate, further validating the superiority of the modeling strategy. Generally, this study fills the research gap on the use of the predictive method to obtain the COC of patients with asthma that could facilitate the clinical decision-making and allocation of resources, eventually improving patient outcomes.
Overall, 138 features were assessed, and 92.03% (127/138) were used in the final model. Notably, most of the top 30 features in Table A4 of the Appendix A were related to the severity of asthma, comorbidities, insurance, and age, precisely consistent with prior research on the factors associated with care continuity [50].

Comparison with Prior Work
This study fills the research gap of predictive model construction for estimating patients' COCs; thus, prior works relevant to it are limited. Nevertheless, the use of machine learning to improve patient outcomes, such as disease or poor outcome prediction, has been studied broadly to date. In the research on predicting future outcomes of patients with asthma [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34], the AUROC ranged from 0.70 to 0.90. The highest AUROC with 0.98 for the low COC level patients in this study obtained better performance than the other models. When building a clinical machine learning model, a similar modeling strategy is usually chosen; although the prediction targets are not compatible, the extensive and ef-

Principal Findings
In this study, a machine-learning model was developed to predict the future COC of patients with asthma. For enhanced identification and to calculate its level, the COCI was selected, which is the most common algorithm employed by patients and physicians. The XGBoost model yielded the best performance, including the highest accuracy, AUROC, and F1 score. XGBoost won in this study because of its superior big data processing capability. Nevertheless, other algorithms, such as random forest, performed appropriately as well based on the UWM data, owing to our excellent modeling strategy of feature engineering. Furthermore, solely using demographic or comorbidity features to assess the COC was inadequate, further validating the superiority of the modeling strategy. Generally, this study fills the research gap on the use of the predictive method to obtain the COC of patients with asthma that could facilitate the clinical decision-making and allocation of resources, eventually improving patient outcomes.
Overall, 138 features were assessed, and 92.03% (127/138) were used in the final model. Notably, most of the top 30 features in Table A4 of the Appendix A were related to the severity of asthma, comorbidities, insurance, and age, precisely consistent with prior research on the factors associated with care continuity [50].

Comparison with Prior Work
This study fills the research gap of predictive model construction for estimating patients' COCs; thus, prior works relevant to it are limited. Nevertheless, the use of machine learning to improve patient outcomes, such as disease or poor outcome prediction, has been studied broadly to date. In the research on predicting future outcomes of patients with asthma [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34], the AUROC ranged from 0.70 to 0.90. The highest AUROC with 0.98 for the low COC level patients in this study obtained better performance than the other models. When building a clinical machine learning model, a similar modeling strategy is usually chosen; although the prediction targets are not compatible, the extensive and effective features and the massive data fed into the model facilitate yielding a higher AUROC. Moreover, our precise data extraction strategy for identifying the prediction target contributes to excellent performance.
Notably, different models were built on varying patient cohorts and predicting similar targets. Some studies have used data from patient surveys and self-report outcomes to analyze the COC of those with asthma. This study employed the EHRs and the administrative data that contained greater clinical characteristics for enhanced profiling. The final predictive model was built using the XGBoost, a state-of-the-art machine-learning algorithm. Compared with statistical approaches (linear model) such as logistic regression, the XGBoost (ensemble model) can intensify the prediction performance with less fundamental assumptions on data distribution [51,52]. As partial evidence for this, we built two additional simpler models to validate our modeling strategy's superiority and generalizability. The excellent performance demonstrated the feasibility of using our final model to predict the COC of patients with asthma.

Clinical Significance and Potential Use
Our model showed excellent performance in predicting the level of COC for patients with asthma. After working with the healthcare system's Information Technology (IT) team, we can deploy the model by publishing it as a web service, and the model would benefit both patients and hospital management. Knowing the level of future COC could facilitate the design of an improved objective intervention for patients with asthma. In addition, investing in patients with high COC and providing long-term health services has practically been the goal of all policymakers and healthcare organizations to save colossal costs.
Furthermore, once the patients were identified as having low COC, interventions could have been implemented to prevent it. In the clinical environment, interventions such as adding the COC score to the medical record, investing the patient into care management, and increasing the frequency of follow-up should be considered. Moreover, research has found that adjusting insurance policies, roles, and care delivery strategies can improve the COC [53]. However, such interventions for continuity of asthma care are multifaceted, as they [54] consist of several components such as: (1) interdisciplinary cooperation including interdisciplinary care standards, case conferences, and shared patient management tools; (2) the education of patients and their caregivers and the decision-making involved; (3) implementation of measurable goals of a care plan; (4) allocation of supplemental resources; and (5) coordination of care in the transition. These various components must be considered before designing the interventions.
Literature [55,56] has demonstrated that reimbursement and copayments are associated with improving COC; thus, if this insurance policy is reasonable, such as offering higher reimbursement or lower copayments, both patients and physicians could benefit.
Care delivery strategies can be flexible because thus far, no standard has been used uniformly, and numerous factors should be considered. Investing patients in case management is an effective strategy for improving the COC [57]. Typically, case management is a client-faced approach for promoting cooperation among services, benefits, and opportunities. Activities are designed by case managers to optimize the functioning of people with multiple needs [57]. Regarding asthma care, nurses could be case managers who devote themselves to improving the COC of patients [58]. The literature [59] has indicated that scheduling nurse-led follow-up care appointments increases the COC. A study [60] that recruited 1000 patients (including those with asthma) found that making earlier followup care appointments (after discharge is the best time) improved the attendance of the appointment. Similar results were found in research [61] that provided the patients with asthma a free 5-day medication tutorial such as prednisone, a 2-day telephone reminder for making an appointment, and travel vouchers for revisiting their providers that would significantly increase the COC.
Nevertheless, the follow-up care appointments made by providers obtained improved adherence compared to scheduling by patients themselves. The literature [62] has shown that 29% of the cases did not revisit when the care facility stopped the follow-up appointments. Therefore, education is necessary for patients and their caregivers. Numerous studies [58,[63][64][65][66] have shown that education programs in various forms, such as home-, web-, and telephone-based, have positive influences on asthma control. These programs increase the conjunction between patients or their caregivers and healthcare facilities [67]. Despite no evidence proving whether education is directly associated with the COC, the conjunction increased by the education programs supports this viewpoint, as the COC is essentially a mode of care delivery coordinated by patients and healthcare facilities. Thus, future research could investigate the association between the COC and asthma education.

Limitations
This study has several limitations that could be potential topics for future research: 1.
This study chose the COCI, an algorithm that mainly focuses on the relationship between patients and physicians to assess the COC of patients. In the future, it is possible to evaluate the COC using other methods by considering the interpersonal, geographical, socioeconomic, educational, and cultural aspects; 2.
The UWM is an academic healthcare system located in an urban area, and we could not access the data outside it. Thus, this study method's generalizability to other healthcare systems and rural areas could be further examined; 3.
This study's model was built using a machine learning algorithm, which is a black-box approach, without any explanation. In the future, implementing a rule-based method to explain the predictions would benefit clinical use.

Conclusions
This study fills the research gap in building a predictive model on massive and longitudinal data to estimate the patients' COCs. The excellent modeling strategy of assessing many features and precise prediction target identification obtained a high performance. This methodology has the potential to be generalized and benefit more diseases. After further optimization, the model could facilitate future clinical decisions, hospital management, and improve outcomes.  The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Institutional Review Board Statement:
The institutional review boards of the UWM approved this study on the EHR and the administrative data.

Informed Consent Statement: Not applicable.
Acknowledgments: The authors would like to thank Katy Atwood for helping with the retrieval of the UWM data set and Michael D Johnson for useful discussions. Y.T. did the work at the University of Washington when she was a visiting student.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. The list of candidate features considered in the final model.
Features that are concerning diagnoses and calculated based on ICD-10 and ICD-9 diagnosis codes No. of diagnoses of asthma; No. of diagnosis codes concerns ICD-10 and ICD-9; no. of primary asthma diagnoses; no. of years since first diagnosed with asthma in the data set; no. of diagnoses of status asthmaticus; whether the latest diagnoses of asthma is a primary one; the severity of the latest asthma diagnoses; the utmost exacerbation severity among all of the asthma diagnoses; no. of diagnoses of acute asthma; no. of days since the latest asthma diagnosis; the severity of the utmost severity of asthma diagnosis; no. of diagnosis codes of noncompliance with the medication regimen; no. of days since the latest diagnoses with acute asthma or status asthmaticus; the latest diagnoses of asthma that indicate the exacerbation severity (uncomplicated, exacerbation, or asthmaticus); allergic rhinitis;

Features concerning medications
The sum of medications ordered; the sum of various medications ordered; no. of medication orders; the sum of medication refills authorized; the sum of asthma medications ordered; the sum of units of medications ordered; no. of medication orders concerning asthma; the sum of asthma medication refills authorized; the sum of various asthma medications ordered; the sum of units of medications ordered concerning asthma; no. of medication prescribers; no. of medication prescribers concerning asthma; the sum of short-acting beta-2 agonists (SABAs) ordered; the sum of units of SABAs ordered; the sum of refills authorized for SABAs; the sum of systemic corticosteroids ordered; the sum of units of systemic corticosteroids ordered; the sum of refills authorized for systemic corticosteroids; no. of reliever orders concerning asthma; the sum of refills authorized for asthma relievers; the sum of relievers ordered concerning asthma; the sum of diverse asthma relievers ordered; the sum of units of relievers ordered concerning asthma that are neither SABAs nor systemic corticosteroids; the sum of units of relievers ordered concerning asthma; the sum of relievers ordered concerning asthma that are neither SABAs nor systemic corticosteroids; the sum of controllers ordered concerning asthma; no. of controller orders concerning asthma; the sum of various asthma controllers ordered; the sum of units of controllers ordered concerning asthma; the sum of refills authorized for asthma controllers; the sum of refills authorized for inhaled corticosteroids; the sum of inhaled corticosteroids ordered; the sum of units of inhaled corticosteroids ordered; the sum of refills authorized for mast cell stabilizers; the sum of ordered for mast cell stabilizers; the sum of units of ordered for mast cell stabilizers; the sum of nebulizer medications ordered; no. of nebulizer medication orders; the sum of various nebulizer medications ordered; the sum of units of ordered concerning nebulizer medications; the sum of refills authorized for nebulizer medications; whether spacer was used; and whether nebulizer was used.
Features concerning insurances Whether the patient enrolled in any public insurance; whether the patient was paid by charity or self-paid; and whether the patient enrolled in any private insurance. We calculate the features related to insurances on the last day of the specific year.
Features concerning the visit types of the patient No. of ED visits; the latest length of stay of ED visit; no. of ED visits concerning asthma; the average ED visit's length of stay; no. of outpatient visits; no. of all type (ED visit, hospital stay, and outpatient visit) of visits; no. of outpatient visits who diagnosed with asthma as the primary diagnosis; the total length of hospital stay; no. of hospitalizations; the average a hospitalization's length of stay; the latest visit's admission type (trauma, urgent, elective, or emergency); the most emergent hospital admission type among all of the visits; no. of prime asthma visits; and the latest visit's type (ED visit, hospital stay, or outpatient visit). According to our prior paper [34], we defined a prime asthma visit as an ED visit with a diagnosis of asthma, a hospitalization with a diagnosis of asthma, or an outpatient visit with a primary diagnosis of asthma. An outpatient visit with only a secondary diagnosis of asthma was assigned as a minor asthma visit.

Features concerning appointment and visit status
No. of no shows; and no. of canceled appointments.
Features concerning the family location of the patient Whether the distance from the patient's home to UMW is less than 5-miles.
Features that are concerning diagnoses and calculated based on ICD-10 and ICD-9 diagnosis codes (Comorbidity features) Allergic rhinitis; sleep apnea; cystic fibrosis; COPD; anxiety or depression; eczema; obesity; gastroesophageal reflux; bronchopulmonary dysplasia; and sinusitis.