A Stochastic Multivariate Irregularly Sampled Time Series Imputation Method for Electronic Health Records

: Electronic health records (EHRs) can be very difﬁcult to analyze since they usually contain many missing values. To build an efﬁcient predictive model, a complete dataset is necessary. An EHR usually contains high-dimensional longitudinal time series data. Most commonly used imputation methods do not consider the importance of temporal information embedded in EHR data. Besides, most time-dependent neural networks such as recurrent neural networks (RNNs) inherently consider the time steps to be equal, which in many cases, is not appropriate. This study presents a method using the gated recurrent unit (GRU), neural ordinary differential equations (ODEs), and Bayesian estimation to incorporate the temporal information and impute sporadically observed time series measurements in high-dimensional EHR data.


Introduction and Background
One of the biggest challenges to work with electronic health record (EHR) data is that there are many missing values. This issue incorporates uncertainty in the predictive model if the missing instances are imputed. Common imputation methods usually do not consider the temporal information, which is crucial for time series analysis. Moreover, most time series analysis methods ignore the time gap between measurements or assume that the time gaps are equal. In this study, we investigated time series imputation with irregular time gaps and propose a method based on neural ordinary differential equations (ODEs), recurrent neural networks (RNNs), and Bayesian estimation. This method offers a robust imputation of sporadically sampled multivariate time series measurements obtained from different patients.
Many data imputation techniques have been developed over the years. In reallife problems, it is very common to have multiple missing attributes for a particular dataset. In the literature, most datasets have 30% to 50% missing values, and they have been imputed using various techniques [1]. There are mostly two techniques widely used for data imputation. These are statistical techniques and machine-learning-based techniques. Among the statistical techniques, expectation minimization (EM), the Gaussian mixture model (GMM), Markov chain Monte Carlo (MCMC), naive Bayes (NB), principal component analysis (PCA), etc., have been used frequently [1]. Among the machinelearning-based techniques, the Gaussian process for machine learning (GPML) (see [2]), support vector machines (SVMs) [3,4], k-nearest neighbors (k-NNs) [5], decision trees (DTs) [6], and artificial neural networks (ANNs) [7] have been heavily used in the literature.
In recent years, longitudinal data imputation has been necessary specially in EHRs. However, many imputation methods only consider the data without the very important element-temporal information. However, there are many time series imputation methods that only consider equal time steps. Our research focuses on a time series imputation method that can deal with sporadically observed time series measurements obtained from EHRs. The major key components to implement this imputation method are neural ODEs [8], which parameterize the derivative of a neural network's hidden state. As compared to the popular residual neural networks, neural ODEs have superior memory and parameter efficiency. Neural ODEs can easily deal with continuous time series, unlike recurrent neural networks, which require discretization. In a follow-up study, latent ODEs were proposed for irregularly sampled (e.g., sporadic) time series [9]. This method (called ODE-RNN) is presented as an alternative to autoregressive models. However, neural ODEs [8] use RNNs as the recognition network to estimate the posterior probabilities. However, that approach is more appropriate for continuous time series modeling with regularly sampled data. Therefore, the ODE-RNN [9] has been introduced as the recognition network to deal with irregularly sampled continuous time series analysis. Combined with neural ODEs and the GRU, a Bayesian update [10] is proposed that uses a predictive (with observation masking) method (called the GRU-ODE-Bayes method) to include only the available observations to update the predicted values along the multivariate time series. However, this method does not impute the missing values; rather, it performs zero-or mean-value padding. The GRU-ODE-Bayes method assumes that the observations are sampled from a multidimensional stochastic process whose dynamics can be explained by a Weiner process. The examples include the stochastic Brusselator process [11], the double-Ornstein-Uhlenbeck (OU) stochastic differential equations [12], etc. The authors showed that their method achieved better results than the GRU-D [13,14], minimal gated unit or minimal GRU [15], and other popular methods. In another recent study [16], a bidirectional recurrent imputation for time series (BRITS) was proposed. This algorithm uses both a forward and backward feeding of inputs to the RNN and simultaneously imputes the missing values. However, the BRITS does not allow the stochastic imputation of time series data. The BRITS is composed of a recurrent component and a regression component for imputation. The authors also presented a unidirectional approach called the RITS and claimed that the process was slower compared to the BRITS. Among other RNN-based models, the multidirectional RNN (M-RNN) provides good imputation results [17]. This study aims to develop a robust multivariate stochastic imputation technique for irregular time series that will fill this research gap.

Data Processing
To develop a good predictive model, a reliable database is very crucial. The data need to be authentic and mostly accurate. Any big-data-driven research highly depends on the quality of the data being used. In this study, a very popular dataset [13,[18][19][20] named the Medical Information Mart for Intensive Care (MIMIC) clinical database was explored and analyzed. This dataset contains intensive care unit (ICU) admission records of patients admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts, from 2001 to 2012. This database has several versions. In this study, MIMIC III Version 1.4 was used, which is the latest. It contains de-identified electronic medical records, demographic information, and billing information for ICU-admitted patients. Many of these records contain timestamps of clinical events, nurse-verified physiological measurements, routine vital signs, check-up information, etc. However, this database is only available after completing a required course and acknowledging a data use agreement.
After accessing the electronic records from the MIMIC III clinical database, a clean dataset needs to be formed that can be useful for analysis. As most other databases, MIMIC III provides its users with different types of structured and unstructured data records that must be cleaned prior to performing any data-driven analysis. Furthermore, the electronic records sometimes have different artifacts associated with them. Data cleaning tends to be more difficult in the case of large databases such as MIMIC III. The search algorithms for the desired attributes need to designed in a way that they can extract the necessary information efficiently. There are hardware and software limitations due to which it becomes very difficult to carry out large matrix operations in traditional computers. Since the analysis highly depends on the data quality, a proper cleaning process should be selected and performed carefully. There might be anomalies in the chosen dataset that should be properly dealt with before any analysis.

Data Description
The MIMIC-III database is an information storehouse for critical care patients. Therefore, it needs to deal with proper care and privacy. To access the data, one needs to request access formally through the Physionet website (https://www.physionet.org) (last accessed on 15 November 2021). Two important steps need to be followed to access the data. The first one is to take a recognized course and comply with the Health Insurance Portability and Accountability Act (HIPAA) requirements. The second step is to sign a data use agreement that includes the appropriate data usage policy, security standards, and preventing identification efforts. Once the request is submitted, the approval comes within a week. Then, the data can be accessed from the server or can be stored in local storage. More information can be obtained by visiting the official MIMIC website (https://physionet.org/content/mimiciii/1.4/) (last accessed on 15 November 2021).
There are 26 tables in total in the MIMIC-III database (see Appendix A, Table A1). In this subsection, we mainly discuss the tables that were directly used for the analysis. Since our goal was to extract as many CHF-related variables as possible, we needed to explore different tables and match records with mostly unique patient IDs and sometimes with admission IDs. Different tables are linked together to compile the dataset that we needed. The tables are described in the following paragraphs.
The "Admissions" table provides unique hospital admission information for each patient. It reports the admission ID, ICUstay ID, date of admission, admission type, discharge location, diagnosis, insurance, language, religion, marital status, age, ethnicity, etc. This can be linked with other tables via the admission ID and patient ID.
The "Patients" table provides demographic information for 46,520 unique patients. It contains the patient ID, admission ID, date of birth, date of death, gender, hospital expire flag, etc.
The "Chartevents" table is the largest table in the entire MIMIC III database. This has about 330,712,483 records. This table provides the patient ID, admission ID, item ID, and the corresponding routine physiological measurements of each patient from time to time.
The "D_ICD_diagnosis" table contains unique patient IDs, unique hospital admission IDs, and International Coding Definitions Version 9 (ICD-9) for each of the 14,567 diagnosis categories. The code for CHF diagnosis is 4280.
The "D_items" table has 12,487 records of items used to treat different patients. Routine vital signs such as blood pressure, heart rate, white blood cell count (WBC), respiratory rate, and other numerical variables are listed here with distinct item IDs.

Data Cleaning
Before any type of analysis, the subset of the entire database, that is of interest, needs to be extracted. Data cleaning can be a very tedious process, especially in the case of such huge clinical databases as MIMIC III. Figure 1 shows the data-cleaning steps. The data-cleaning steps are quite challenging at times. At first, all patient records (56,320) were assessed. These records mostly contain the demographic information such as age, gender, religion, etc. There is some hospital-related information available as well, such as admission type, discharge location, length of stay, etc. Since this study focuses on CHF (and some related diagnoses)-diagnosed patients only, those were extracted using the ICD-9 diagnosis codes. This totaled 13,295 patients. Among these patients, many of them had been readmitted multiple times with the maximum number of readmissions as high as 13. If a patient is never readmitted or followed up, we kept their admission record as is. In the case of multiple readmitted patients, we only considered their first time readmission record and discarded the subsequent ones. Then, all the numerical and categorical features were joined to the patient list (total 10,027 patients) according to their unique ICU stay IDs.

Patient Cohort Selection
As mentioned earlier, mostly CHF patients were considered for this study. Along with CHF, the following diagnoses were also considered, as shown in Table 1. Table 1. All the diagnoses considered for patient cohort selection.

4280
Congestive heart failure, unspecified 4281 Left heart failure 4289 Heart failure, unspecified 42820 Systolic heart failure, unspecified 42821 Acute systolic heart failure 42822 Chronic systolic heart failure 42823 Acute or chronic systolic heart failure 42830 Diastolic heart failure, unspecified 42831 Acute diastolic heart failure 42832 Chronic diastolic heart failure 42833 Acute or chronic diastolic heart failure 42840 Combined systolic and diastolic heart failure, unspecified 42841 Acute combined systolic and diastolic heart failure 42842 Chronic combined systolic and diastolic heart failure 42843 Acute or chronic combined systolic and diastolic heart failure

Numerical Variables
EHRs contain time series measurements of different patients. It is sometimes very difficult to understand the contributing features of a certain outcome. Table 2 shows the chosen predictor numerical variables [21] for this analysis. The variables were selected based on multiple studies [22][23][24] and their importance to CHF readmission prediction.

Time Series Extraction
This section presents the filtering criteria and the personalized time series extraction process from the MIMIC-III v1.4 database. As mentioned in earlier sections, there were about 10,000 unique patients having CHF and other related diagnoses. For each of these patients, a unique time series was extracted that contained different types of measurements obtained for different items (e.g., heart rate, glucose, BUN, etc.). Figure 2 shows a sample patient with different item measurements taken at irregular intervals along the horizontal axis. Although it shows measurements beyond 48 h from discharge, this study only considered measurements up to 48 h from discharge. It is important to note that all the item measurements might not be available for all patients. Therefore, the time series imputation became more challenging due to the lack of data.
First, the patients were sorted using their unique ICU stay IDs. This ID distinguishes every single ICU stay of a patient. There were some patients who had been readmitted to the ICU more than once during the same hospital admission. In these cases, they usually had a the same admission ID, but different ICU stay IDs. That is why we chose to identify patients by their ICU stay IDs. However, their subject IDs and admission IDs are also stored in the patient database. A search algorithm was deployed to find each patient's measurement during his/her unique ICU stay.

Methodology
This section describes and implements the multivariate irregularly sampled time series imputation method that was based on neural ODEs, GRU, LSTM, and Bayesian estimation. It is important to discuss the useful technical details of this method so that it can be explained easily. The flowchart and mathematical notations are used for the explanation as necessary. are applicable here, as well. In the following sections, the imputation problem is introduced, and the methods for overcoming this challenge are discussed.
Any EHR database contains numerous vital measurement information for ICU patients. Due to many physiological factors, these measurements are not taken at the same time intervals. For predictive modeling and other data scientific procedures, regular intervals are usually expected. Therefore, EHR data can be challenging due to these irregular measurements.
The problem is to develop an imputation method that can perform two challenging tasks: capture the temporal information from irregularly sampled time series measurements and impute time series with high missing ratios. However, the standard imputation methods hardly consider temporal information and are mostly suitable for regular time series. This section describes the technical and mathematical details for the multivariate imputation method. The important components of the method-neural the ODE, GRU and Bayesian estimation-are discussed sequentially. To summarize the steps, the algorithm is presented in a compact version that is easier to understand. From this point, the proposed method is called the GRU LSTM ODE BAYES Imputation (GOBI) method.

Recurrent Neural Networks
Recurrent neural networks are a special type of artificial neural network that are able to exhibit temporal dynamic behavior in sequence data. They can process temporal information from the current state to the next state using hidden layers. However, the conventional RNN suffers from the vanishing gradient problem. This means that the weights of the neural network are more difficult to train further down the sequence because the loss function tends to be very close to zero. To avoid this problem, mostly two types of gated RNNs are used-long short-term memory (LSTM) and the gated recurrent unit (GRU). These two special types of RNNs were used in this study. Defining X as the input and h t and h t+1 as the previous and current hidden layers, the simple structure of an RNN layer is shown in Figure 3.

Neural ODEs
Neural ODEs [8] are useful for continuous-depth neural networks. This method involves performing a reverse-mode differentiation technique (e.g., backpropagation), which is quite difficult to train. While solving the ODEs, the adjoint sensitivity method is used. This approach usually takes less time and calculation effort. A scalar-valued loss function L [8] as described in Equation (1) is minimized, whose input comes from the ODE solver. Here, • h(t 1 ) = current hidden state; • h(t 0 ) = initial hidden state; • t 0 = initial time; • t 1 = current time; • θ = weight of neurons at synapses; • f = hidden unit dynamics function.
The loss function L incorporates all the time points (e.g., states) in the time series. The adjoint states help to calculate the gradients with respect to θ, as needed along the time sequence. Defining the adjoint a(t) = δL δt and taking the derivative yield [8], Note that Equation (2) allows the computation of gradients along the time sequence. However, the gradient of L with respect to θ is calculated by propagating backwards. This yields (See [8] Therefore, the above two equations can be efficiently calculated by any regular autodifferentiation packages. If the observations y i are sampled from the realizations of a D-dimensional stochastic process Y(t), the internal dynamics can be expressed as an unknown stochastic differential equation (SDE) as the following: Here, • dW(t) = Weiner process; • µ = mean of the probability density function of Y(t); • σ = covariance of probability density function of Y(t).
Then, the distribution of Y(t) evolves according to the Fokker-Planck equation [10].

GRU and the Differential Form of Its Hidden State
The GRU is a type of RNN that requires less time and calculation effort than its counterpart LSTM. However, there is no clear indication of the superiority of their performance on real-world datasets. As reported by many authors [13,15,17], the GRU and LSTM usually perform head to head with a slight edge over each other on different datasets and in different configurations. Figure 4 shows a standard GRU cell configuration. There are two main gates in a GRU: reset gate and update gate. This configuration makes the GRU very simple to train and take less time to compute all the parameters. As seen in Figure 4, a standard GRU cell contains the following elements [10]: Here, r t , z t , and g t denote the reset, update, and forget gate, respectively. Furthermore, corresponds to the elementwise product. Two matrices W ∈ IR H×D and b ∈ IR H×H denoting the weight and bias vectors are the cell parameters. H and D are the dimensions of the hidden process and given inputs. Therefore, the hidden state, h, of the GRU can be updated as follows [10]: In order to construct a first-order differential equation similar to Equation (1), the following can be obtained from Equation (8): Taking the differential with respect to t, the following can be obtained readily: Equation (10) can be solved using any regular first-order ODE solvers such as Euler, midpoint, Dormand-Prince (Dopri), etc.

Bayesian Estimation
Bayesian estimation allows the updating of prediction after the model is run on GRU cells. The most important feature of Bayesian estimation is the ability to incorporate new information as it becomes available. In this imputation method, only the available values can be used as a source of information since the missing values play no part. After the initial estimation from the GRU cells, a Bayesian estimation is necessary to include the information from observed values. This helps reduce the gaps between the observed and estimated values, which, in turn, improves the imputation performance.
In the original version, the model in [10] uses an integrated GRU cell to incorporate the Bayesian update with observed values. However, in this study, we used an LSTM cell to perform the Bayesian update since it provides a more accurate estimation [25]. However, the missing values are not updated since they do not have any effect on the hidden layers. The proposed hidden layer can be described as follows: Here, • h(t + ) = hidden state after Bayesian update; • h(t − ) = hidden state before Bayesian update; • f prep = perception layer; • y = observations; • k = observation mask. Figure 5 shows the Bayesian estimation effects on the prediction. Two loss functions are used to evaluate the updating performance as shown in Figure 5. The first function is the negative log-likelihood, described as follows: Here, • D = number of samples; • m = mask; • y j = current sample; • f obs = observed hidden layers.
The second function is the Kullback-Leibler (KL) divergence, which basically compares two probability distributions. Here, • D KL = KL divergence; • p Bayes,j = probability distribution after update; • p post,j = posterior distribution of observations.
The imputed mean and variances are calculated using the following equations: Since, the algorithm has many parameters and steps, it might be difficult to understand sometimes. Therefore, the algorithm sequences are outlined [10] in Figure 6 (λ = trade-off parameter between post-and pre-losses) below.

Experiment Design
The experiments were designed according to the needs and objectives of this study. As mentioned before, the missing ratio is very high in most of the numeric features. This scenario is not suitable for any ML-based techniques since they require complete datasets with much information available prior to training.
The training and testing ratio was 9:1, which means 90% of the samples were used for training and validation, while the rest were used for testing. To validate the results, 10-times 10-fold cross-validation was used. The following hyperparameters, shown in Table 3, were used to conduct all the experiments. Since this is an imputation task, the performance measures were the mean-squared error and mean absolute error. The corresponding formulae are described in the next section. All the experiments were performed on a workstation with the specifications as shown in Table 4.

Results
The following results were obtained from the five folds of the dataset with 1345 samples each. In the following

Comparative Analysis
The proposed method was compared with some baseline methods and RNNs. The results for both training and testing are shown in Tables 5 and 6, respectively. It is clearly seen that the GOBI method works better than most baseline and RNN-based methods. Two metrics are reported for comparison: root-mean-squared error (RMSE) and mean absolute error (MAE), as described by the following two equations: Here, • y i = true value for instance i; •ŷ i = imputed value for instance i; • N = number of instances.

Conclusions
The proposed model was based on neural ODEs, RNN units, and Bayesian estimation and is suitable for imputation tasks involving temporal information. In most real-life scenarios, temporal information is very sensitive and determines many aspects of our day to day lives. Therefore, it is not always right to ignore the time-sensitive information found in EHRs or other similar databases. Furthermore, it makes more sense to have a probabilistic imputation rather than a deterministic one since there is always some level of uncertainty associated with the imputed data.
The novel contribution of this model is that it is tailored specifically for multivariate irregularly sampled time series imputation. As mentioned earlier, most other imputation methods deal with regular time series and provide deterministic imputation. Both of these issues are addressed by the GOBI method in this study.
The performance of the GOBI method was satisfactory, as shown in the comparative analysis section. Many state-of-the-art methods have been developed for classification tasks, but RNN-inspired stochastic imputation is still a growing area of data scientific research. In this study, only EHR data in the MIMIC databases were used for imputation. However, the GOBI method works well for many other datasets, as well, and provides fairly good estimation [10]. As seen in Table 6, the GOBI model has greater accuracy and lower variance.
GOBI method has several advantages, as well. It takes less time to train since GRU cells are simpler to compute and have fewer parameters. Even with very large datasets, it works relatively faster than most RNN-based techniques [15,26]. Besides, the GOBI method is quite accurate as compared to many algorithms currently available. Although the performance might vary from dataset to dataset, still it should be fairly competitive overall.
The GOBI method has high potential in data science sectors. It should be very useful for analyzing large datasets with a high amount of missing values. It might have broad impacts in healthcare, manufacturing industries, process improvement, etc., because most of these sectors typically deal with missing data or have physical constraints for time-sensitive data collection. The GOBI method can deal with these types of problems, providing a good solution with an acceptable error margin.
The imputation algorithm presented here is of great importance due to the increasing number of missing values in EHRs. It is very common to have more than 50-60% missing values per channel in EHR time series data. Each patient has some set of demographics, which vary greatly from one patient to the other. Therefore, it is difficult to impute records of one patient based on another. There is hardly any way around this since some patients might not have any measurement for a particular channel or item. As mentioned earlier, the EHR time series data might have some underlying dynamics that might not be approximated well by the stochastic Weiner process. This opens up a great opportunity for further research. As for standard oscillating behaviors, there are many well-established equations to represent the internal dynamics. As shown in many recent articles, the simulated data for standard oscillations can be accurately estimated by the Weiner process. However, this might not be the case for most real-life EHR time series data. The proposed method not only imputes the time series data, but also provides an estimation of the level of uncertainty that the imputed values represent. In the proposed model, the Bayesian estimation is coupled with an LSTM cell. This allows for the update to happen only when there are available values. The missing values are then inferred from the resulting imputed time series. The target is to predict the mean imputed values as close as possible to the actual values. The log variance needs to be smaller as well, since higher levels of uncertainty are much more difficult to propagate and can easily jeopardize the entire predictive model. In most common imputation methods, the mean-squared error is minimized. The proposed method uses two losses (pre and post) before and after Bayesian estimation, which are useful to decide whether the new observational update reduces the error. The two loss functions (negative log likelihood and KL divergence) are well established and widely used in ML techniques. Considering these issues, further research can be concentrated on developing an imputation method that could efficiently learn the dynamics of the underlying process instead of assuming a Weiner process in every case. This would significantly boost performance in complex EHR data. Data Availability Statement: The data will be available only after completing the requirements from the MIMIC website (https://physionet.org/content/mimiciii/1.4/ (last accessed on 15 November 2021)).

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. All tables of MIMIC III and their total records.  Tables   Table name  Total records  Table name  Total