A Simple Free-Text-like Method for Extracting Semi-Structured Data from Electronic Health Records: Exempliﬁed in Prediction of in-Hospital Mortality

: The Epic electronic health record (EHR) is a commonly used EHR in the United States. This EHR contain large semi-structured “ﬂowsheet” ﬁelds. Flowsheet ﬁelds lack a well-deﬁned data dictionary and are unique to each site. We evaluated a simple free-text-like method to extract these data. As a use case, we demonstrate this method in predicting mortality during emergency department (ED) triage. We retrieved demographic and clinical data for ED visits from the Epic EHR (1/2014–12/2018). Data included structured, semi-structured ﬂowsheet records and free-text notes. The study outcome was in-hospital death within 48 h. Most of the data were coded using a free-text-like Bag-of-Words (BoW) approach. Two machine-learning models were trained: gradient boosting and logistic regression. Term frequency-inverse document frequency was employed in the logistic regression model (LR-tf-idf). An ensemble of LR-tf-idf and gradient boosting was evaluated. Models were trained on years 2014–2017 and tested on year 2018. Among 412,859 visits, the 48-h mortality rate was 0.2%. LR-tf-idf showed AUC 0.98 (95% CI:0.98–0.99). Gradient boosting showed AUC 0.97 (95% CI:0.96–0.99). An ensemble of both showed AUC 0.99 (95% CI:0.98–0.99). In conclusion, a free-text-like approach can be useful for extracting knowledge from large amounts of complex semi-structured EHR data.


Introduction
In the last decade, the medical world has been exposed to two important concepts related to digital information: Big-Data and Artificial Intelligence (AI) [1]. Bringing these two concepts together enables the creation of increasingly accurate prediction models.
One setting that can benefit from decision support tools is the emergency department (ED) triage. EDs are becoming increasingly crowded, impairing patient outcomes [2][3][4][5][6]. Decision support tools can aid the expedited assessment of patients during initial ED triage. Several clinical triage acuity scores have been developed. The most commonly used is the five-level emergency severity index (ESI). In recent years, studies have evaluated different decision support tools in the triage [7][8][9]. Yet, most of these studies used a discrete number of variables.
Today, the electronic health record (EHR) stores a wealth of information for each patient, both as tabular data and as free text. Patient cohort data is usually stored in a two axes matrix. The x-axis (rows) represents individual patients, and the y-axis (columns) represents data for each patient. While many machine-learning models strive to use a large number of rows, usually a limited number of columns are utilized.
The Epic EHR (Epic Systems Corporation, Verona, WI) is one of the most commonly used EHR in the United States. It is estimated that more than 250 million patients have a current electronic record in Epic [10]. Epic stores a majority of the data inside documents or structures called "flowsheets". These fields contain vast amount of semi-structured items that pertain to patient assessment. Flowsheet data lack a well-defined external ontology or data dictionary and are often unique to each implementation of Epic. This makes utilizing the information contained within them, which may include valuable clinical observations, quite difficult. We hypothesized that a free-text approach could help utilize the semi-structured Epic data.
We evaluated a simple free-text-like method to extract semi-structured EHR data. We tested this method on two machine learning models and on an ensemble of both. First, we trained a logistic regression model. Logistic regression is a well-established model. The model is easy to implement and easy to interpret and does not require significant resources. Second, we trained the XGboost implementation of gradient boosting algorithm. Gradient boosting is a machine learning algorithm where multiple weak learners are trained to augment each-other and together produce superior results. At each stage a new decision tree is learned with the aim to correct errors made by existing trees. As a non-linear method, it often out-performs linear models, when higher order relationships exist in the data. Gradient boosting has also surpassed other machine learning algorithms in a number of data challenges.
As a use case we demonstrate using this method in predicting in-hospital mortality during ED triage.

Materials and Methods
The Mount Sinai Hospital institutional review board (IRB) approval was granted for this retrospective study. Informed consent was waived by the IRB committee.
The study was conducted at the Mount Sinai Hospital (MSH) New York City. This is a large academic tertiary center with approximately 110,000 annual ED visits. The study's time frame was between 1 January 2014, to 31 December 2018.
We retrieved records of consecutive adult (age ≥ 18) patients admitted to the ED. Erroneously created and duplicate charts were excluded. We also excluded visits without triage notes and patients who died within 30 min from the triage note.
Both structured and free-text time-stamped data were retrieved from the EHR. All items were limited to those documented up to 30 min from triage note. Data points include Demographics: age, sex, ethnicity; Arrival mode (walk-in, by ambulance, or by intensive care ambulance); Chief complaints; Comorbidities, coded as International Classification of Diseases (ICD-10) and grouped using the diagnostic clinical classification software (CCS); First vital signs measurements; Acuity level (ESI); Laboratory orders; Nursing and physician text notes (free-text); and all Epic's flowsheet records from the visit.
The primary outcome was in-hospital death within 48 h. As a secondary outcome we evaluated overall in-hospital death.

Data Representation
Both semi-structured and free-text were encoded using a Bag of Words (BoW) approach [11]. BoW is a commonly used approach in natural language processing (NLP). In BoW, a text paragraph is represented as an unordered collection (bag) of its words. A classifier classifies the paragraphs based on the frequency of words in the "bags". Sparse matrix representation was used for the BoW collections.
BoW collections were used to represent the following data: nursing and physician free-text notes, flowsheet records, comorbidities, chief complaints and lab orders. For each of these items we also encoded the time in minutes from triage note to the item as a separate BoW collection.
We also created a BoW container to represent "past stories". This encoded the number of previous ED visits and hospitalizations, number of days to previous visits, type of ward if hospitalized and chief complaints during the previous visits.
All other variables (demographics, mode of arrival, vital signs) were concatenated to the BoW collections.

Machine Learning Models
Two machine-learning methods were trained: gradient boosting and logistic regression. We tested logistic regression with term frequency-inverse document frequency (LRtf-idf) [11]. An ensemble of LR-tf-idf and gradient boosting was also evaluated. Figure 1 presents the schematics of the models. Continuous variables were normalized (Z-scores) for the logistic regression model. Normalization was not used for the gradient boosting model. As this model "cuts" above and below the desired value. Thus, it is not affected by linear transformations.
Models were trained on data from the years 2014-2017 and tested on data from the year 2018. This ensures no chronological leakage of information.

Logistic Regression
A term frequency-inverse document frequency (tf-idf ) approach was employed to the BoW collections. Tf-idf balances the importance of a word to the document (tf ) and the frequency of the word in the corpus (idf ).
The tf-idf formula for each word (w) in one document is: Total

number o f documents Number o f documents containing w
The logistic regression hyperparameters included default l2 regularization with C = 1.0, and number of iterations = 2000. Variables with missing data were not included in the logistic regression model as experimentation with imputations did not show benefit. Data balancing was not used for the logistic regression as it did not improve the results.

Gradient Boosting
We used the XGBoost implementation of the gradient boosting algorithm [12]. This model uses multiple tree-based classifiers trained to correct errors made by the previous trees. The default hyper-parameters were used for the model: eta = 0.3, max depth = 3. We set n_estimators = 1000. Imputations of missing values were handled by the XGBoost model. Scale balancing of the XGBoost was set to the default scale pos weight = 1. since weight balancing did not affect the gradient boosting model, it was not applied.

Ensemble
Ensemble averaging is the process of averaging of multiple models' predictions to improve the desired output. This process is opposed to creating just one model. The ensemble of several models frequently performs better than any individual model's predictions, since the errors of the models "average out." We evaluated an ensemble averaging of the LR-tf-idf and gradient boosting outputs.

Statistical Analysis
All development and statistical analysis were carried out using Python (Version 3.6.5). Continuous variables are reported as the median with the spread reported as the Interquartile range (IQR). Categorical variables are reported as percentages. Continuous variables were compared using 1-way analysis of variance (ANOVA). Categorical variables were compared using the chi-square test. The Area under the curve (AUC) metric was used to compare model performance on the testing data (the year 2018).
To analyze the importance of single terms/words in the flowsheet and in the free-text BoW collections, we used the mutual information formula [13]. This formula measures the joint mutual information between the mortality class (C) and the term/word (W): Mutual In f ormation = ∑ ∑ P(C, W) * Log P(C, W)

P(C)P(W)
Youden's index was used to find an optimal sensitivity-specificity cutoff point on the receiver operating characteristic (ROC) curve. Sensitivity, specificity, false-positive rate (FPR), negative predictive value (NPV), positive predictive value (PPV) and F1-score were also evaluated for fixed specificities of 90% and 99%. Bootstrapping validations (1000 bootstrap resamples) were used to calculate 95% confidence intervals (CI) for all metrics.

Study Cohort
During the five-year study period, the MSH recorded 546,186 ED visits. After exclusion, the cohort consisted of 412,901 ED visits ( Figure 2). Overall, 2803 in-hospital mortality cases (0.7%) were identified. Of them, 703 (0.2%) died within 48 h of ED admission. The median time to death was 7 days (IQR: 2-16 days). 42 patients died within 30 min from triage note and were excluded. Patient characteristics of both the training and testing dataset are presented in Table 1. Significant differences were observed between patients who died and those who survived (Table 1). Of note, about half of the mortality cases had known cardiovascular and oncological diseases. Table 2 describes how the data were distributed across the training and testing datasets.     Figure 3 presents distribution of the sizes of different data types. Semi-structured flowsheet data was the largest, followed by free-text and lastly, structured data. On average, each patient had 37.9 (±18.4) flowsheet records accumulated within 30 min from triage. Flowsheet terms combinations associated with 48-h mortality include: "acuity level: 1", and terms related to mechanical ventilation and sepsis (Table A1 in Appendix A). Terms associated with overall in-hospital mortality include: "acuity level: 1", "acuity level: 2", "sepsis" and "supine position" (Table A1).

Data Analysis
Free-text words associated with 48-h mortality include resuscitation measures such as "CPR", "ACLS", "arrest" and "intubation" (Table A2). Free-text words associated with overall mortality include words related to transfer and disposition such as "EMS", "from", "admission" and also words related to resuscitation such as "CPR", and "arrest" (Table A2).

Machine-Learning Models
Adding tf-idf markedly improved the logistic regression model ( The ensemble model showed only a small increase over single models. For 48-h mortality, the AUC increased to 0.99 (95% CI: 0.98-0.99) and for overall mortality to 0.96 (95% CI:0.98-0.99). Yet, the ensemble did show better results compared to the gradient boosting for predicting 48-h mortality and better results compared to logistic regression for predicting overall mortality (Figure 4a,b). Calibration plots of the models are presented in Figure 5a,b. Figure 6a,b show the Precision-Recall (PR) curve for in hospital mortality within 48 h and overall in-hospital mortality. Figure 7a,b demonstrate the confusion matrix of the ensemble model predictions for in hospital mortality within 48 h and overall in-hospital mortality using Youden's index cut-off. For 48-h mortality, the ensemble showed sensitivity 95%, specificity 95% (FPR 1:20) and PPV 0.03 (Tables 4 and 5). For FPR 1:100 the ensemble showed sensitivity 74% and PPV 0.11. Figure 8a,b present word clouds of terms importance.

Discussion
A simple method can be used for grasping complex Epic EHR data. Specifically, semitabular data can be represented as free-text, using the BoW approach. A model trained on this representation showed potential for predicting in-hospital mortality.
In the ED, many life and death decisions are performed under stressful conditions. Decision support tools can be helpful in this setting. Importantly, ED physicians have shown low accuracy (AUC 0.73) for predicting 30-day mortality [14].
Raita et al. trained several machine-learning models for predicting mortality and ICU admission at triage. They used data from 135k ED visits. Their variables were structured EHR features, including demographics, vital signs, chief complaints and comorbidities. The best model was a neural network with AUC 0.86 [9]. Klug et al. used gradient boosting to predict mortality at triage. They trained the model on 800k ED visits. Structured features included demographics, arrival mode, chief complaint, vital signs, ESI, previous visits and comorbidities. Their model showed AUC 0.96 for 48-h mortality [8].
In our study, an ensemble model showed an AUC 0.99 for 48-h mortality and AUC 0.96 for overall in-hospital mortality, albeit at the price of a low PPV. Predicting in-hospital mortality at triage is a "needle in a hay stack" problem. Typically, 2:1000 patients die within 48 h and 7:1000 die overall. For a sensitivity of 95%, the PPV was only 0.03, because of the low incidence of mortality. Yet, the FPR is 1:20. This means that in an ED with 200 patients per day, the algorithm will alert only 10 times. For an average physician who sees 20 patients per shift, the model will alert one patient per shift. This is important to prevent alert fatigue [15].
For FPR of 1:100, the sensitivity lowers to 74%, and the PPV rises to 0.11. With this threshold, the model will alert in the entire ED only twice a day. It will alert once per five physician shifts. Yet, the model will still detect approximately three quarters of early mortality cases.
We employed a common NLP technique to handle semi-structured data. Using BoW collections made it easy to represent the data as a sparse matrix. This is a memory efficient solution for utilizing Big-Data. The virtual/dense size of the dataset was 400 k visits × 120 k items per visit (including all possible unique free-text words and flow-sheet term-value combinations). This amounts to 48 × 109 (50 billion) memory points. In the sparse matrix representation, each row physically occupied about 200 items. This amounts to 400 k × 200 = 80 × 106 (80 million) actual memory points. The sparse representation is 0.2% of the volume of the dense representation.
A previous study by Rajkomar et al. presented a method to extract the content of the EHR using the FHIR (Fast Healthcare Interoperability Resources) format [16]. FHIR was developed to represent clinical data in a consistent, hierarchical and extensible container format. Their study included a cohort of 216k hospitalized patients. The authors used a neural network to predict different outcomes [17]. In contrast with Rajkomar et al., we used a completely free-text-like representation of the data. This has the advantage of utilizing the entire dataset, as none of the data needs to be strictly encoded. The disadvantage of this approach is that it is site-specific.
We believe that the paradigm of "one solution to fit all sites" should be re-examined. At our site, there are almost 65k different unique flowsheet items with many values. The free-text records also have abbreviations and terms that are unique. We hypothesize that flexible solutions tailored to each medical institution are needed. This requires simple, efficient and flexible methods that can be easily implemented on-site. The presented method is an example of such a simple flexible solution. Future studies can elaborate on similar methods. Of note, the current method shows high predictive ability using logistic regression with tf-idf. This model is simple, fast, easy to implement and interpretable.
In a sense, converting data to "text," provides a very flexible way to "tell a story" about the data. For example, using this method, it was straightforward to create "past stories" for each patient. There are many possible ways to experiment with such stories, possibly improving results.

Limitations
This was a retrospective single-center study. A prospective applicative study is needed to prove the usefulness of the models. Yet, the essence of the study is the presentation of a simple method for harvesting data. Second, only in-hospital mortality was evaluated. Out of hospital death records were not available. Third, many different outcomes can be explored. For example, admittance to ICU or need for medical intervention. Fourth, we predicted all-cause mortality without stratifying to different pathologies. Fifth, there are multiple available machine and deep learning models. This study implemented two models. XGBoost, which shows state-of-the-art results in tabular tasks, and LR-tf-idf, which is a simple robust algorithm for NLP BoW. Other models can be explored. Sixth, a cut-off time of 30 min was used. We believe this is enough time for data to be accumulated, while still being clinically relevant as an early support tool. Any other cut-off time may be chosen. Seventh, a large number of variables (free-text words and flowsheet terms) were used. Even though data exploration such as we performed in Tables 3 and 4 can give insight, and although both logistic regression and gradient boosting are interpretable, the models may still be considered as black boxes.

Conclusions
A free-text-like approach can be useful for grasping large amounts of complex semistructured EHR data. Data Availability Statement: Anonymized participant data are held in a secure research server and will be handled in accordance with the ethical approval for this project.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. Flowsheet terms associated with in-hospital mortality within 48 h and overall in-hospital mortality.