Application of Machine Learning in Intensive Care Unit (ICU) Settings Using MIMIC Dataset: Systematic Review

: Modern Intensive Care Units (ICUs) provide continuous monitoring of critically ill patients susceptible to many complications affecting morbidity and mortality. ICU settings require a high staff-to-patient ratio and generates a sheer volume of data. For clinicians, the real-time interpretation of data and decision-making is a challenging task. Machine Learning (ML) techniques in ICUs are making headway in the early detection of high-risk events due to increased processing power and freely available datasets such as the Medical Information Mart for Intensive Care (MIMIC). We conducted a systematic literature review to evaluate the effectiveness of applying ML in the ICU settings using the MIMIC dataset. A total of 322 articles were reviewed and a quantitative descriptive analysis was performed on 61 qualiﬁed articles that applied ML techniques in ICU settings using MIMIC data. We assembled the qualiﬁed articles to provide insights into the areas of application, clinical variables used, and treatment outcomes that can pave the way for further adoption of this promising technology and possible use in routine clinical decision-making. The lessons learned from our review can provide guidance to researchers on application of ML techniques to increase their rate of adoption in healthcare.


Introduction
Artificial intelligence (AI) encompasses a broad-spectrum of technologies that aim to imitate cognitive functions and intelligent behavior of humans [1]. Machine Learning (ML) is a subfield of AI that focuses on algorithms that allow computers to define a model for complex relationships or patterns from empirical data without being explicitly programmed [2]. ML, powered by increasing availability of healthcare data, is being used in a variety of clinical applications ranging from diagnosis to outcome prediction [1,3].
The predictive power of ML improves as the number of samples available for learning increases [4,5].
ML algorithms can be supervised or unsupervised based on the type of learning rule employed. In supervised learning, an algorithm is trained using well-labeled data. Thereafter, the machine predicts on unseen data by applying knowledge gained from the training data [6]. Most adopted supervised ML models are Random Forest (RF), Support Vector Machines (SVM), and Decision Tree algorithms [6]. In unsupervised learning, there is no ground truth labeling required. Instead, the machine learns from the inherent structure of the unlabeled data [7]. Either type of ML is an iterative process in which the algorithm tries to find the optimal combination of both model variables and variable weights with the goal of minimizing error in the predicted outcome [5,6]. If the algorithm performs with a reasonably low error rate, it can be employed for making predictions where outputs are not known. However, while developing a ML model, an optimal bias-variance tradeoff should be selected to optimize prediction error rate [8]. Improper selection of bias and variance results in two problems: (1) underfitting and (2) overfitting [9]. Finding the "sweet spot" between the bias and variance is crucial to avoid both underfitting and overfitting [8,10].
Deep learning (DL), a subcategory of machine learning, achieves great power and flexibility compared to conventional ML models by drawing inspiration from biological neural networks to solve a wide variety of complex tasks, including the classification of medical imaging and Natural Language Processing (NLP) [10][11][12][13][14]. Most widely used DL models are variants of Artificial Neural Network (ANN) and Multi-Layer Perceptron (MLP). In general, ML models are data driven and they rely on a deep understanding of the system for prediction, thereby, empowering users to make informed decision.
To provide better patient care and facilitate translational research, healthcare institutions are increasingly leveraging clinical data captured from Electronic Health Records (EHR) systems [15]. Of these systems, the Intensive Care Unit (ICU) generates an immense volume of data, and requires a high staff-to-patient ratio [16,17]. To avoid adverse events and prolonged ICU stays, early detection and intervention on patients vulnerable to complications is crucial; for these reasons, the ML literature is increasingly using ICU patient data for clinical event prediction and secondary usage, such as sepsis and septic shock [18]. ML techniques in ICUs are making headway in the early detection of high-risk events due to increased processing power and freely available datasets such as the Medical Information Mart for Intensive Care (MIMIC) [19]. The data available in the MIMIC database includes highly structured data from time-stamped, nurse-verified physiological measurements made at the bedside, as well as unstructured data, including free-text interpretations of imaging studies provided by the radiology department [13].
The primary aim of this study was to conduct a systematic literature review on the effectiveness of applying ML technologies using MIMIC dataset. Specifically, we summarized the clinical area of application, disease type, clinical variables, data type, ML methodology, scientific findings, and challenges experienced across the existing ICU-ML literature.

Methods
This systematic literature review followed the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework for preparation and reporting [20].

Eligibility Criteria
This study focused on peer-reviewed publications that applied ML techniques to analyze retrospective ICU data available from publicly available MIMIC dataset, which includes discrete structured clinical data, physiological waveforms data, free text documents, and radiology imaging reports.

Data Sources and Search Strategy
Three search engines were used: PubMed, Web of Science, and the CINAHL. We restricted our search to research articles published in English and in peer-reviewed journals or conferences available from the inception of each database through 30 October 2020. The search syntax was built with the guidance of a professional librarian and included search terms: "Machine Learning", "Deep Learning", "Artificial Intelligence", "Neural Network", "Supervised Learning", "Support Vector Machine", "SVM", "Intensive Care Unit", "ICU", "Critical Care", "Intensive Care", "MIMIC", "MIMIC-II", "Medical Information Mart for Intensive Care", "Beth Israel Deaconess Medical Center". Figure 1 illustrates the process of identifying eligible publications.
Informatics 2020, 7, x Three search engines were used: PubMed, Web of Science, and the CINAHL. We restricted our search to research articles published in English and in peer-reviewed journals or conferences available from the inception of each database through 30 October 2020. The search syntax was built with the guidance of a professional librarian and included search terms: "Machine Learning", "Deep Learning", "Artificial Intelligence", "Neural Network", "Supervised Learning", "Support Vector Machine", "SVM", "Intensive Care Unit", "ICU", "Critical Care", "Intensive Care", "MIMIC", "MIMIC-II", "Medical Information Mart for Intensive Care", "Beth Israel Deaconess Medical Center". Figure 1 illustrates the process of identifying eligible publications.

Study Selection
Following the systematic search process, 322 publications were retrieved. Of that, 113 duplicate publications were removed, leaving 209 potentially relevant articles for the title and abstract screening. Two teams (HB, SS and MS, SB) screened these articles independently, leading to the removal of another 89 publications, and 120 publications were retained for a full-text assessment. These were assessed for eligibility, resulting in 61 total publications that were included in the final analysis. Disagreements were resolved by an independent review by third person (AS).

Study Selection
Following the systematic search process, 322 publications were retrieved. Of that, 113 duplicate publications were removed, leaving 209 potentially relevant articles for the title and abstract screening. Two teams (HB, SS and MS, SB) screened these articles independently, leading to the removal of another 89 publications, and 120 publications were retained for a full-text assessment. These were assessed for eligibility, resulting in 61 total publications that were included in the final analysis. Disagreements were resolved by an independent review by third person (AS).

Data Collection and Analysis
A quantitative descriptive analysis was performed on the qualified studies that had applied ML technique in the ICU settings using MIMIC database. Data elements extracted for this analysis included: (a) type of disease (b) type of data (c) clinical area of application (d) ML techniques and (e) year of publication. After the extraction and analysis, we summarized and reported the findings in tables in accordance with the aim of the study.

Results
The search strategy yielded a total of 322 articles, which were published and made available as of 30 October 2020. Of which, 61 publications were selected for further analyses. These publications were categorized into seven themes based on the effectiveness of applying ML techniques in various ICU settings. These themes are identified based on ML algorithms to predict, monitor, and improve patient outcomes. Descriptions of each theme and related publications are listed in Table 1. Majority of the studies in our review focused on predicting mortality (21 studies) and followed by risk stratification (10 studies). Multiple studies focused on predicting the onset date of specific diseases such as sepsis and septic shock (eight studies), cardiac diseases (seven studies), and acute kidney injury (AKI) (six studies).
Details of each study including clinical applications, ML models and clinical variables used, sample size, and model performance of the qualified studies is provided in the Supplementary Table S1.
In our review, both traditional ML and DL models were used in the classification tasks. Based on best-performing models reported by the studies, traditional ML was used in 36 studies and DL was used in 25 studies, respectively. In traditional ML algorithms, SVM and RF were most commonly used whereas in DL, Long Short-Term Memory (LSTM) was employed. Majority of the studies have used discrete clinical variables, and eight publications used both discrete and unstructured data such as discharge summaries, nursing notes, radiology reports, etc. as an input.
Thirty-six studies applied imputation methods, and the remaining 25 either completely removed the records with missing data or did not mention it. Fifty-two studies used crossvalidation to evaluate model performance. Forty-eight studies used feature identification to improve accuracy.

Discussion
The aim of this systematic review was to provide an up-to-date and holistic view of the current ML applications in ICU settings using MIMIC data in the attempt to predict clinical outcomes. Our review revealed ML application was widely adopted in areas such as mortality, risk stratification, readmission, and infectious disease in critically ill patients using retrospective data. This review may be used to provide insights for choosing key variables and best performing models for further research.
The application of ML techniques within the ICU domain is rapidly expanding with improvement of modern computing, which has enabled the analysis of huge volumes of complex and diverse data [1]. ML expands on existing statistical techniques, utilizing methods that are not based on a priori assumptions about the distribution of the data, but deriving insights directly from the data [80,81].
With ICUs being complex settings that generate a variety of time-sensitive data, more and more ML-based studies have begun tapping the openly available, large tertiary care hospital data (MIMIC). Our screening resulted in 61 publications that utilized MIMIC data to train and test ML models enabling reproducibility. The majority of these publications focused on predicting mortality, sepsis, AKI, and readmissions.

Mortality Prediction
Mortality prediction for ICU patients is critical and crucial for assessing severity of illness and adjudicating the value of treatments, and timely interventions. ML algorithms developed for predicting mortality in ICUs focused mainly on in-hospital mortality and 30 days mortality at discharge. Studies by Marafino et al. [22], Pirracchio et al. [23], Hoogendoorn et al. [24], Awad et al. [26], Davoodi et al. [29], and Weissman et al. [31] predicted in-hospital mortality, whereas Du et al. [25] predicted 28 days mortality at discharge, and Zahid et al. [30] predicted both 30 days and in-hospital mortality. Most studies focusing on predicting in-hospital mortality looked at mortality after 24 h of ICU admission. However, one in particular, Awad et al. [26], predicted mortality within 6 h of admission. Marafino et al. [22] predicted mortality using only nursing notes from the first 24 h of ICU admission, whereas Weissman et al. [31] improved mortality prediction by combining structured and unstructured data generated within the first 48 h of the ICU stay. Davoodi et al. [29] and Hoogendoorn et al. [24] predicted after 24 h and within a median of 72 h, respectively. Studies by Tang et al. [33], Caicedo-Torres et al. [36], Sha et al. [38], and Zhang et al. [41] predicted in-hospital mortality irrespective of the admission or discharge time.
For mortality prediction, all of the studies used three main categories of clinical variables: (1) demographics, (2) vital signs, and (3) laboratory test variables. In addition to the most commonly used data elements, other clinical information such as medications, intake/output variables, risk scores, and comorbidities were also utilized. Weissman et al. [31] and Zhang et al. [41] used clinical variables from both structured and unstructured data types for mortality prediction.
Multiple studies predicted mortality on disease-specific patient cohorts. Celi et al. [21] and Lin et al. [35] predicted in-hospital mortality on AKI patients. Lin et al. [35] predicted mortality based on five important variables (urine output, systolic blood pressure, age, serum bicarbonate level, and heart rate). In addition, the study by Lin et al. [35] also revealed that the effect of kidney injury markers, such as cystatin C and neutrophil gelatinase-associated lipocalin on subclinical injury, had not yet been analyzed, which can provide AKI prognostic information. This is due to lack of data availability in MIMIC. Garcia-Gallo et al. [37] and Kong et al. [40] predicted mortality on sepsis patients, and specifically, Garcia-Gallo et al. [37] identified patients that are on 1-year mortality trajectory. Anand et al. [34] claimed that the risk of mortality in diabetic patients could be better predicted using a combination of limited variables: HbA1c, mean glucose during stay, diagnoses upon admission, age, and type of admission. To compute the "diagnosis upon admission" variable, the study utilized Charlson Comorbidity Index, Elixhauser Comorbidity Index, and Diabetic Severity Index. The authors further claimed that combining diabetic-specific metrics and using the fewest possible variables would result in better mortality risk prediction in diabetic patients.
In our review, studies have used both traditional ML (10 studies) and DL methods (11 studies) to predict mortality. In traditional ML techniques, Random Forest, Decision Tree, and Logistic Regression were the most commonly used algorithms. However, recent studies by Caicedo-Torres et al. [36], Du et al. [25], and Zahid et al. [30] have used DL methods for mortality prediction with a promising accuracy ranging from 0.86-0.87 as reported in the Supplementary Table S1. Traditional ML models can be easily interpretable when compared to DL models that have many levels of features and hidden layers to predict outcomes. Understanding the features that contribute towards the prediction plays an important role for clinical decision-making [82,83]. For example, one of the most cited studies by Pirracchio et al. [23] developed a mortality prediction algorithm (Super Learner) using a combination of traditional ML models; the results of which were easily interpretable by clinical researchers. In general, DL techniques are employed to improve prediction accuracy by training on large volumes of data [12]. Zahid et al. [30] developed a DL model (Self-Normalizing Neural Network (SNN)) that performed marginally better than the Pirracchio et al. [23] mortality prediction rate (Area Under the Receiver Operating Characteristic curve (AUROC) of SNN: 0.86 and Super Learner: 0.85). However, interpreting the results of DL models is challenging because of multiple hidden layers and they are often treated as black-box models. To address this limitation, Caicedo-Torres et al. [36] and Sha et al. [38] demonstrated the interpretability of the model in visualizations that will allow clinicians to make informed decisions.

Acute Kidney Injury (AKI) Prediction
AKI is one of the common complications among adult patients in the intensive care unit (ICU). AKI patients are at risk for adverse clinical outcomes such as prolonged ICU and hospitalization stays, high morbidity, and mortality. Application of ML in AKI care has been mainly focused on early prediction of an AKI event and risk stratification. In our review, studies employed traditional ML techniques to predict AKI events and XGBoost was the most commonly used algorithm.
Using the MIMIC dataset, Zimmerman et al. [68], Sun et al. [69], and Li et al. [84] predicted AKI after 24 h of ICU admission. Sun et al. [69] and Li et al. [84] used clinical unstructured notes generated during the first 24 h of ICU stay, whereas Zimmerman et al. [68] used structured clinical variables for prediction. The AUROC of predicting AKI within the first 24 h in Sun et al. [69], Zimmerman et al. [68], and Li et al. [84] was reported as 0.83, 0.783, and 0.779, respectively. Additional details on the type of clinical variables, sample size, and ML model are listed in Supplementary Table S1.
To define and classify AKI, three standard guidelines have been published and used in clinical settings: (1) Risk, Injury, Failure, Loss, End-Stage (RIFLE), (2) Acute Kidney Injury Network (AKIN), and (3) Kidney Disease: Improving Global Outcomes (KDIGO). In our results, most studies used KDIGO guidelines to create ground truth labels and is based on serum creatinine (SCr) and urine output. The SCr is one of the important predictor variables in AKI; however, it is a late marker of AKI, which delays diagnosis and care [85]. In clinical settings, it is highly desirable to early predict the AKI event for better intervention strategies. To address the aforementioned clinical need, Zimmerman et al. [68] predicted SCr values for 48 and 72 h based on 24 h SCr values and other clinical variables. Li et al. [84] extracted key features from clinical notes, such as diuretic and insulin medications using NLP instead of completely depending on SCr. Even though urine output is one of the defined metrics of AKI, Zimmerman et al. [68] reported it as not a significant predictor [86]. Further investigation should focus on the effect of urine output on predicting AKI and its impact.

Sepsis and Septic Shock
Sepsis is one of the leading causes of death among ICU patients and hospitalized patients overall [87]. As sepsis progresses, patients from pre-shock state are highly likely to develop septic shock. Early recognition of sepsis and initiation of treatment will reduce mortality and morbidity [88]. In our review, eight studies applied ML techniques to predict sepsis or septic shock events. Of these, four applied traditional ML algorithms and the other four used DL methods. XGBoost and LSTM were the most commonly used algorithms, of which the details of variables, sample size, and model performances are provided in the Supplementary Table S1.
The Scherpf et al. [55] model predicted sepsis 3 h prior to the onset with an AUROC of 0.81. The results of our review also reveal that most studies focused on early predicting of pre-shock state using hemodynamic measurements. The common variables used in ML models are arterial pressure, heart rate, labs, risk scores including Glasgow Coma Scores (GCS) and Sequential Organ Failure Assessment (SOFA) scores, and respiratory rate.
For predicting pre-shock state, Liu et al. [54] and Kam et al. [53] used a combination of these variables along with lab findings with the Area under the Curve (AUC) performance reported as 0.93 and 0.929, respectively. One of the interesting findings of the Liu et al. [54] study was that serum lactate was the primary predictor variable indicating a patient's risk level of entering septic shock, and is used as a biomarker for sepsis patient risk stratification. The study also reported, "A patient with serum lactate concentration one standard deviation above the population mean is approximately five times as likely to transition into shock than a patient with average serum lactate concentration" [54]. The hemodynamic measurements can be derived from waveform data or can be extracted as discrete data elements from EHR. Ghosh et al. [52] used three waveforms: mean arterial pressure, heart rate, and respiratory rate to derive hemodynamic predictor variables, whereas Liu et al. [54] and Kam et al. [53] used discrete measurements.

ICU Readmission
Intensive Care Units (ICU) provide care to critically ill patients, which is often costly and labor-intensive. Prolong ICU stays increases cost burden to both patients and hospitals. Early predicting unplanned readmissions may help in ICU resources allocation and improve patient health outcomes. Details of the studies qualified in this theme are listed in the Supplementary Table S1. Desautels et al. [76] identified patients who are likely to suffer unplanned ICU readmission: his model reported an AUROC of 0.71. Rojas et al. [78] and Lin et al. [79] focused on identifying patients that were re-admitted within 30 days of discharge. The best AUROC reported by Lin et al. [79] and Rojas et al. [78] is 0.791 and 0.78, respectively. The common predictor variables used in all three studies include: vital signs, demographics, comorbidities, and labs. Our findings revealed that there has been limited research done on predicting readmissions and the reported model AUROCs in literature are not promising (less than 80%) using MIMIC data.

ML Model Optimization
The performance of a given model heavily depended on data pre-processing, feature identification, and model validation. The missing data problem is arguably the most common issue encountered by machine learning practitioners when analyzing real-world healthcare data [89]. Researchers in general choose to address the missing data by either imputing or removing the observations [89]. The imputation can be done using simple-tocomplex techniques: for example, in the study done by Lin et al. [35], missing observations were imputed using the mean value of the variable, whereas Davoodi et al. [29] and Zhang et al. [67] used sophisticated imputation techniques, Gaussian and Multivariate Imputation by Chained Equation (MICE), respectively. Substituting observed values with estimated observations introduces bias that may distort the data distribution or introduce spurious associations influencing model accuracy. To minimize this, imputation methods should be carefully selected, especially for prospective data. Imputation methodology depends on aim of the study, importance of data elements, percentage of missing data, and ML model used.
Feature importance technique is often employed to identify the highest ranked features. ML models with only important features improve the accuracy and computing time [90]. Cross-validation (CV) of a ML algorithm is vital to estimate a model's predictive power and generalized performance on the unseen data [91,92]. K-fold CV is often used to reduce the pessimistic bias by using more training data to teach the model. Our analysis found 52 studies used various validation techniques. Five-fold and 10-fold CV were the most common validation method used.
This review has some inherent limitations. First, there is the possibility of studies missed due to the search methodology. Second, we removed sixteen publications where full text was not available, and this may have introduced bias. Finally, a comparison of ML model performance was not possible in the quantitative analysis even though the studies used the MIMIC dataset for training and validating ML models. This is due to the fact that ML performance is dependent on the data elements selected for prediction, model parameters used, and size of the dataset.

Key Points and Recommendations
The aim of the study was to perform a comprehensive literature review on ML application in ICU settings using MIMIC dataset. The key points of our review and recommendations for future research provided therein are enlisted below.
Recent proliferation of publicly available MIMIC datasets allowed researchers to provide effective ML-based solutions in an attempt to solve complex healthcare problems. However, reproducibility of ML models is lacking due to inconsistent reporting of clinical variables selected, data pre-processing, and model specifications during the development. Future studies should follow standard reporting guidelines to accurately disclose model specifications.
Significant work has been done in predicting mortality within 6 to 72 h of hospital admission on retrospective data. However, prospective implementation is lacking. To adapt to dynamics of clinical events, we recommend exposing these models to prospective trials before moving it to routine clinical practice.
ML model performance heavily depends on clinical variables utilized. We identified and summarized the variables used by different model across the themes. Future studies should focus on performing a detailed analysis of these variables for improved performance.
Unstructured clinical notes have valuable and time-sensitive information critical for decision-making. Eight studies in our review taped into clinical notes to mine important information. However, recent advancements in NLP techniques like Bidirectional Encoder Representations from Transformers (BERT) and Embeddings from Language Models (ELMo) have not been explored.
Interpretable ML models allow clinicians to understand and improve model performance. However, only two studies have resorted to visualization-based interpretations in the review.

Conclusions
ML is gaining traction in the ICU setting. This systematical review aimed to assemble the current ICU literature that utilized ML methods to provide insights into the areas of application and treatment outcomes using MIMIC dataset. Our work can pave the way for further adoption and overcome hurdles in employing ML technology in clinical care. This study identified the most important clinical variables used in the design and development of ML models for predicting mortality and infectious disease in critically ill patients, which can provide insights for choosing key variables for further research. We also discovered that predicting disease classification and treatment outcomes using supervised and unsupervised ML is possible with high predictive value on retrospective data. Prospective validation is still lacking, possibly due to the limitations with implementation and real-time disparate data processing.
Supplementary Materials: The following are available online at https://www.mdpi.com/2227-970 9/8/1/16/s1, Table S1: Summary of 61 studies qualified for quantitative descriptive analysis. Funding: This study was supported in part by the Translational Research Institute (TRI), grant UL1 TR003107 received from the National Center for Advancing Translational Sciences of the National Institutes of Health (NIH). The content of this manuscript is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Conflicts of Interest:
The authors declare no conflict of interest.