Big Data Analytics to Reduce Preventable Hospitalizations—Using Real-World Data to Predict Ambulatory Care-Sensitive Conditions

The purpose of this study was to develop a prediction model to identify individuals and populations with a high risk of being hospitalized due to an ambulatory care-sensitive condition who might benefit from preventative actions or tailored treatment options to avoid subsequent hospital admission. A rate of 4.8% of all individuals observed had an ambulatory care-sensitive hospitalization in 2019 and 6389.3 hospital cases per 100,000 individuals could be observed. Based on real-world claims data, the predictive performance was compared between a machine learning model (Random Forest) and a statistical logistic regression model. One result was that both models achieve a generally comparable performance with c-values above 0.75, whereas the Random Forest model reached slightly higher c-values. The prediction models developed in this study reached c-values comparable to existing study results of prediction models for (avoidable) hospitalization from the literature. The prediction models were designed in such a way that they can support integrated care or public and population health interventions with little effort with an additional risk assessment tool in the case of availability of claims data. For the regions analyzed, the logistic regression revealed that switching to a higher age class or to a higher level of long-term care and unit from prior hospitalizations (all-cause and due to an ambulatory care-sensitive condition) increases the odds of having an ambulatory care-sensitive hospitalization in the upcoming year. This is also true for patients with prior diagnoses from the diagnosis groups of maternal disorders related to pregnancy, mental disorders due to alcohol/opioids, alcoholic liver disease and certain diseases of the circulatory system. Further model refinement activities and the integration of additional data, such as behavioral, social or environmental data would improve both model performance and the individual risk scores. The implementation of risk scores identifying populations potentially benefitting from public health and population health activities would be the next step to enable an evaluation of whether ambulatory care-sensitive hospitalizations can be prevented.


Introduction
Health systems in developed countries face a variety of challenges, including a rising demand for health services due to demographic changes, increasing multi-morbidity, unhealthy behaviors and financial constraints [1]. These challenges are reinforced by highly fragmented processes of healthcare delivery, which may be overcome in care models and settings that focus on creating value for individuals and also incorporate preventative action [2]. Putting people rather than siloed provider structures or diseases in the center, integrated health systems are fueled by an integration of health information technology infrastructure and can benefit from advanced models of health data analytics [3,4]. Big data analytical capabilities are recognized as one of the most important innovations in healthcare in the recent decade [5,6], and advances in prediction models provide great opportunities, e.g., in the identification of risk groups or in the prediction of hospitalization. One field of specific political interest is the analysis and reduction of ambulatory care-sensitive hospitalizations (ACSH), i.e., inpatient hospital cases that are at least partly considered avoidable with improved care in the outpatient sector in the context of nursing homes or through prevention achieved, e.g., by public health activities [7][8][9]. Reductions in ACSH can both improve the patient experience and avoid an unnecessary usage of health system resources so that the ACSH-rate is also used as a measure of healthcare quality [10,11]. A study analyzing the cost associated with ACSH in the German health insurance system estimated a cost of EUR 3.5 billion per year (increasing per year by 0.9%) based on the mean costs of such hospital cases from the German Diagnosis Related Group (DRG) system [12].
To support action towards reduction of unnecessary hospital cases, the aim of this study was to develop a prediction model based on real-world claims data to identify individuals or populations with a high risk of being hospitalized due to an ambulatory care-sensitive condition who then might get special attention or benefit from tailored prevention activities or treatment options. This is comparable to an approach of the Veterans Health Administration providing patient-specific care assessment need scores based on data from the corporate data warehouse that can be accessed by healthcare providers and population health managers [13]. Several studies exist predicting (re-)hospitalizations in general [7,[14][15][16][17][18][19], but only a few specifically predict ACSH in the context of the health systems of the USA, Canada and Italy [10,[20][21][22]. While the methodologies are comparable to a certain extent, this study extends the context to Germany, which is on the one hand valuable since ACSH definitions are most often adapted to the specific health system characteristics and therefore models and results from other contexts cannot be directly transferred or put into practice. On the other hand, just the fact that the results of this model are actually implemented in regional population health and integrated care interventions in Germany is another special feature of this work. Based on risk scores and predefined thresholds, warning signs could be implemented in the information systems of responsible medical and non-medical experts who can then suggest certain measures or adjust their actions. The action derived from such risk assessments would ideally lead to improved prevention, better healthcare quality for those affected by or being at risk of certain diseases and reduced cost for the community [23]. To achieve a reliable prediction, a statistical model based on a logistic regression was compared to a machine learning model based on the Random Forest method.

Materials and Methods
In the following section, the concept of ambulatory care-sensitive hospitalizations is described followed by a description of the database and the analytical method of model construction. The section closes with a definition of the outcome variable and the independent variables of the prediction models.

Ambulatory-Care Sensitive Conditions/Hospitalizations
Due to inconsistent definitions and varying national health system characteristics, there is no scientific consensus on which conditions are understood to be ambulatory caresensitive conditions (ACSC) or what defines an ambulatory care-sensitive hospitalization (ACSH). Generally, an ACSC is a diagnosis for which timely and effective activities "can help to reduce the risk of hospitalization by either preventing the onset of an illness or condition, controlling an acute episodic illness or condition, or managing a chronic disease or condition", and an ACSH is a hospitalization due to an ambulatory care-sensitive condition [7]. The international statistical classification of diseases and related health problems (ICD) helps to make definitions comparable but coding and care provision may differ at the regional or country level [24]. Due to specific health system characteristics, some diseases might be treated as inpatient cases in one context and as ambulatory cases in another. Since this study is built on German data, a definition developed for the German healthcare system was used. This categorization of ACSC contains 258 singular ICD-10 diagnoses, summarized in 40 groups of which 22 groups constitute a core list. The 22 groups of the core list have a relatively high preventability score of more than 50%, varying between 58% for gonarthrosis and 94% for dental diseases [8]. See Sundmacher et al. for the full list of ICD-10 codes of ambulatory-care-sensitive conditions used for this study [25]. At the least, the core list includes chronic diseases that are also commonly included in definitions of ACSC in the context of other countries [10].

Database
The database used for this study is deidentified insured-level claims data (n = 69,392) from two regional integrated care networks of OptiMedis AG, an integrated care management organization [26]. The regions are set as one rural and one urban area, each accounting for nearly half of the population size. Data were fully available for the years 2016-2019. The data set itself does not fulfil the 3-V characteristics of big data [27,28]. However, it has been shown that claims data are valuable in assessing quality and efficiency of care and have the advantage of being easily accessible in an electronic format without needing additional documentation [29]. The database contains information on patient demographics, in-and outpatient care, work incapacity, drugs, nonmedicinal remedies and aids, rehabilitation and long-term care services [30]. To account for country specifics in the data, the German guideline for claims data analysis was considered [31].

Big Data Analytics and Prediction Models
For big data analytics, there is also no agreed-upon definition. Performing predictive or explorative analytics (taken together also labelled as advanced analytics) on sets meeting the definition of big data is one approach to define big data analytics [5,6]. Another refers to the usage of inductive machine learning approaches suited for high-dimensional data sets [32]. As the database available for this study did not fulfil the 3-V characteristics, the second definition is adapted, and the term big data analytics therefore refers to the method instead. Most of the models in the literature rely on statistical methods, especially the logistic regression, and machine learning methods, such as Random Forests, Neural Networks or Support Vector Machines [14,15]. In this study, the predictive performance of a statistical model (logistic regression) is compared to that of a machine learning model (Random Forest). Supervised machine learning, such as the Random Forest model, is flexibly applicable on complex data of various structures. During the model building process, assumptions about the data distribution can be adapted, whereas most Random Forest algorithms assume a Gaussian distribution per default. Furthermore, the outcome variable has to be human-labelled, and the prediction is deduced based on three stages in a causal chain: training, validation and testing [33,34]. To train the model, a data set is analyzed to identify discriminating features of the predictor and optimization algorithms are performed to reproduce the outcome [35]. The Random Forest model randomly selects a predefined number of distribution criteria and grows several trees that categorize the individual observations. A majority vote over all trees then defines the class. There is not one specific Random Forest algorithm, rather many different algorithms exist. This analysis was performed in R statistics using the ranger package [36]. The number of variables tested at each node was the square root of the number of numerical variables. The number of iterations, i.e., the number of trees in the forest, was set to 500 [37].

Outcome Variable and Independent Variables
The outcome variable was defined similar to prior studies focusing on ACSH prediction [10,20,21]. It is the event of an individual being hospitalized with an ACSC in the prediction year. The full list model of ACSC comprises the above-mentioned 258 singular ICD-10 diagnoses. To assess whether it improves the model performance, an outcome variable was also defined, focusing only on the core list of ACSC with only 164 diagnoses (core list model) [8]. Death was not investigated as no information regarding the cause of death was available.
Independent variables with a high predictive value in previous studies were medical diagnoses and prescribed medications, prior healthcare utilization as well as multimorbidity and polypharmacy measures [16]. The following variables were used for the construction of the prediction models: age as a categorical variable in 16 classes (0-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, ≥85), gender (male vs. female), insurance status (employees, pensioners, children, unemployed, others), number of physician visits (GP and specialists), days of incapacity for work, number of hospitalizations (all-cause and ACSH); length of hospital stays in days, mean number of drug prescriptions per quarter (drug count), a polypharmacy measure (max amount prescribed on a given day), a multimorbidity measure (modified Charlson score [38]), enrollment in a German disease management program (coronary heart disease, asthma, type 2 diabetes, COPD), long-term care level (categorical variable in 4 classes: 0 = no care level, 1 = lowest care level, 2 = medium care level and 3 = highest care level including special hardship cases), days in any long-term care level (except 0) per year (0-365) and an inpatient and outpatient medical disease history of ACSC (distinct ACSC groups based on the International Statistical Classification Of Diseases And Related Health Problems, 10th revision, German Modification, discharge diagnoses in the inpatient setting and diagnoses with the feature "ensured" in the outpatient setting). All variables cover a time horizon of four years.

Model Construction and Descriptive Cohort Analysis
The process of model construction distinguishes between training and test data sets. In this study, 2019 was set as the prediction year. Thus, model building was conducted on a training set from 2018, whereas the disease history was observed from 2016 to 2018. Model evaluation was performed based on the test set with the outcomes being observed in 2019. The exclusion of certain variables is a common step in designing risk prediction models. The insurance duration was a major exclusion criterion. In order not to include individuals that were not insured with their current health insurance company for a considerable amount of time and thus had missing data, a threshold was determined. Individuals had to be insured for 360 days or more in the prediction year as well as for at least 300 days in each of the previous four years. Thereby, deceased individuals were indirectly excluded which was considered as unproblematic as it is doubtful whether the respective hospital cases might have been preventable in the sense of the ACSH concept. Of the list of ACSC, the group "rare diseases with 5000 cases each" was also excluded as not enough cases were documented in the data set.
To better understand the characteristics of the population with an ACSH, descriptive analyses of the underlying demographics were performed. Results are presented in Table 1. A rate of 4.8% of all individuals had an ACSH in 2019, and 6389.3 hospital cases per 100,000 individuals could be observed. As expected, the population with an ACSH is older, has a higher comorbidity score and higher utilization measures in nearly all sectors.  Table 2 displays the ACSH cases from the core list (22 diagnosis groups) per 100,000 individuals in the prediction year 2019. The most common ACSC disease groups in the study population were cardiovascular diseases, bronchitis and chronic obstructive pulmonary disease (COPD), mental disorders and infectious diseases. Independent variables with a significant effect on the outcome prediction for having an ACSH in the subsequent year according to the logistic regression models are displayed in Table 3. See Tables A1 and A2 in Appendix A for the regression coefficients, odds ratio (OR) and confidence intervals (CI; 95%) of all variables of the logistic regressions. A significant negative correlation with an odds ratio below 1 was found for being female and having an outpatient diagnosis for diseases of the skin. The latter finding might be due to the fact that these conditions in the regions observed are treated most often in an outpatient setting. Besides switching to a higher age-class, which has a strong positive correlation, the strongest feature for having an ACSH was having a previous outpatient diagnosis from the disease group "maternal disorders related to pregnancy", pointing to the fact that expectant mothers with health problems during their pregnancy take advantage of hospital care at an above average rate and thereby have an increased risk of subsequently receiving a discharge diagnosis included on the ACSC list. The birth itself or related complications during birth are of course not part of the ACSC list. Further significant positive correlations were found for switching to a higher level of long-term care, a unit increase in the number of prior hospitalizations (all-cause and cases due to an ACSC), and unit increases of the drug count and the number of specialist visits. Specific previously documented disease groups with a significant effect were, e.g., alcohol-related disorders, circulatory diseases, ear nose throat infections and diabetes in the outpatient setting, heart failure and hypertension in the inpatient setting or depressive disorders in both settings. Due to the rather small number of persons with long-term care and sick leaves in the sample, small but significant effects were also found for a unit increase (numerical variables ranging from 0-365) of the days in a high long-term care level or a unit increase of the duration of sick leaves in days. Having a diagnosis of heart failure was only significant in the core list model. Quite surprisingly, the number of GP visits did not show a significant effect. The fact that the Charlson comorbidity score did not show a significant effect with ACSH might be because this index was originally developed to predict one-year-mortality rates in hospital [39] so that the conditions taken into consideration might be severe rather than preventable as defined by the ACSH concept. Table 3. Odds ratio of significant independent variables (except age classes) of the logistic regression models for predicting ACSH in the two scenarios.

Variable
Odds With respect to the Random Forests, variable importance values were calculated using the impurity-corrected mode based on the Gini Index as part of the ranger package [36]. In the core list scenario, drug count, previous hospitalizations (all-cause, due to an ACSC, due to diabetes or due to hypertension) and the duration of a hospital stay in the previous year were the variables with the highest predictive value.

Comparison of the Predictive Model Performances
The performance of the models was evaluated and compared based on the c-statistics. The c-statistics point to the fact that the Random Forest model performs slightly better than the logistic regression model in predicting the outcome variable of having an ACSH in the prediction year, both in the full list and in the core list scenario (see Table 4). For a subset of the data from one health insurance company (n = 29,275), further evaluation criteria in the form of sensitivity, specificity and the positive and negative predictive value were applied [23]. Related to the outcome variable, sensitivity is defined as the percentage of individuals with an ACSH that are correctly identified as having an ACSH in the upcoming year. Specificity, on the other hand, relates to the number of individuals without an ACSH that are identified as such. Additional risk thresholds also used by Louis et al. [21] were implemented. The category "high risk" includes individuals with a predicted probability of 15% to 24%; the category "very high risk" includes individuals with a predicted probability of 25% and higher to have an ACSH in the prediction year. For the core list scenario, this categorization results in the values summarized in Table 5. Generally speaking, for these two cut-off points, the Random Forest achieved higher sensitivity scores but lower specificity scores, i.e., from the very high risk cohort it identifies more individuals who actually have an ACSH in the upcoming year than the logistic regression (50.0% versus 42.9% for the core list model). However, it also identifies more individuals erroneously (1 minus the specificity, i.e., 11.1% versus 8.9% of the population not having an ACSH). Vice versa, the positive predictive value for the logistic regression is higher.

Discussion
In the course of efforts to improve value in health systems, tools such as prediction models for ACSH can provide a valuable contribution to better steer interventions and allocate resources. In this paper, a risk prediction model with good reliability and wide applicability based on routinely collected administrative data was developed that can be used to improve not only primary care but also population health management and public health prevention by supporting providers with additional information. The fact that age has a strong positive correlation with ACSH is in line, e.g., with a population-based analysis of ACSC in Ireland showing that 69.1% of all ACSCs were found in adults over 65 [40]. The diagnosis groups with a high odds ratio, such as maternal disorders related to pregnancy, mental disorders due to alcohol or opioids, alcoholic liver diseases, certain diseases of the circulatory system or depressive disorders, could give hints for population health managers about which risk groups to address with intensified effort in a region. The individually calculated risk scores could be implemented in clinical or non-clinical information systems within the integrated care systems as an extension of the information base of the providers. Conversely, if further data, e.g., extracted directly from electronic health records, were also incorporated into the prediction models, not only more accurate, but also more up-to-date results could be calculated.
The models developed in this publication achieved c-statistics comparable to Billings et al.  [10], indicating a good model fit above the median of 0.68 of a systematic review of prediction models for rehospitalization [10]. However, perhaps due to the smaller sample size, the model performance did not reach that of Louis et al. (0.856) [21] or Gao et al. (0.833) [20]. In contrast to other studies in the field of hospital care, in this study we did not discriminate between emergency and elective admissions following the argument that an elective inpatient episode can also be a sign of unforeseen deterioration. One special feature in this study is that the Random Forest model outperforms the logistic regression model in both scenarios. The differences are not very pronounced and seem to decrease when the ACSC diagnoses are specified via the core list. Although reaching slightly higher c-values, a substantial benefit of the machine learning technique over the logistic regression model could not be found. In this specific use case, this might have been due to the fact that the database did not meet the 3-V criteria of big data. It seems understandable that a machine learning methodology alone does not lead to a superior outcome prediction because such methods applied to rather small data sources are limited in their ability to optimize the inductive feature selection process they are designed for [41,42]. Compared to a statistical regression model, it is more difficult for a machine learning model, such as Random Forest, to elucidate why one independent variable is more important than another in the feature selection process. While this may be negligible in a result-oriented perspective of calculating individualized risk scores, a link to causality and deliberations about the meaningfulness of the results should nevertheless be part of a comprehensive data mining approach [41]. Aspiring to the task of supporting providers with additional information on risk groups, in this regional context there seems to be no clear advantage of the Random Forest model. In general to date, big data analytics in healthcare found little evidence of anything surprisingly new that can effectively improve decision making or medical outcomes [43]. This does not mean that such methods do not have the potential to do so. Rather, data exchange and people-centered data collection may need to be further developed first [4]. Although predictions were meant to be derived for people in the context of the integrated care systems so that training and test sets contained the same persons, it might be valuable to test the predictive performance in populations which were not part of the training set, which was not possible in this context due to limited data availability.
A general limitation with respect to claims data is that it is collected for billing purposes, rendering it vulnerable to changes in the remuneration system, specific coding schemes or documentation errors, thus affecting the prediction results [44]. In addition, the decision of which ACSC to consider in the model building process affects the results, hampers cross-country comparisons and should be part of an ongoing model refinement process. Model refinement activities, such as hyperparameter tuning, would be useful extensions which were not applied in this study as a split of the training set into various subsets would most likely have led to subsets being too small for cross validation. Generally, most prediction models would likely benefit if a bigger data set and more independent variables were available for model optimization. Potentially valuable variables not covered in claims data would be, e.g., specific medications and dosages, ethnicity, marital status, behavioral data, lab test results, environmental data such as pollution or neighborhood characteristics, information on social support, living arrangements, the availability and proximity of hospitals as well as ambulatory treatment options [45,46], socioeconomic data, biomarker data, data from health sensors or patient-reported (outcome) data [4]. However, if additional data were to be integrated, other challenges such as interoperability would likely occur [47]. Usage of data directly extracted from primary systems, such as electronic health records, or from health platforms could enable timelier predictions as claims data encompass a certain time lag due to billing procedures.
To avoid underperforming models mis-informing clinical decision makers, analytical modelling standards and an agreed-upon framework for transparent evaluation would be needed [48]. This also implicates ethical issues, e.g., if a prediction model provides seriously harmful recommendations for some individuals. This ethical concern is not applicable in the current use case because the risk scores are only meant to support public health, population health managers or clinicians in deciding additional or intensified interventions without any proposal or judgement about the different options. Nevertheless, an appropriate framework for privacy protection and patient consent is indispensable. A subsequent general challenge for prediction models and the resulting risk scores is their factual application in the daily routines of public health or clinicians [49]. From an organizational perspective, resistance against expanding electronic data exchange between different stakeholders/parties and redesigning workflows with data-driven feedback need to be overcome [13,50] so that pilot interventions seeking to reduce ACSH can have measurable effects. Transferring the model to new regions might assess how these differ from the ones analyzed in this study. In all likelihood, other disease groups or continuous variables will show significant effects, leading to adapted intervention planning and allowing a cross-regional comparison based on the same outcome definition.

Conclusions
The risk score predictions presented in this study might be a starting point for reducing the number of ACSH on a regional level within an integrated care model incorporating public and population health activities and clinical process improvements. To proactively prevent ACSH, the results of such prediction models could steer interventions to those individuals with the highest risks and support decision making for which preventative action might be appropriate to deliver the best care or who might benefit from extra attention outside of the inpatient sector. Important next steps include continuously updating and refining the model with new data. Multidisciplinary teams will be involved to build practical and feasible solutions that engage stakeholders in the care process to use the results of such models, provided that the scores prove to be reliable. Once the accuracy of the risk scores presented here has been further tested, the next question is whether it can prevent future hospital admissions or at least delay them and thus reduce the overall number of admissions. To answer this question, further studies and evaluations would be needed that focus on gaining impact with such prediction models.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data used for this paper were provided from the data warehouse of the OptiMedis AG. It comprises deidentified claims data from three different health insurance companies from one rural and one urban area in Germany. With specific permission and in an aggregated format, the data can be used for care improvement and research purposes, but publication or provision of the original raw data of individuals is contractually prohibited.

Acknowledgments:
We would like to thank Pascal Wendel for technical support, Laura Lange for methodological support and Sophie Wang for linguistic proof-reading.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Regression coefficients of the independent variables in the logistic regression models for predicting ACSH in the full list model.