Predicting Blood Glucose Concentration after Short-Acting Insulin Injection Using Discontinuous Injection Records

Diabetes is an increasingly common disease that poses an immense challenge to public health. Hyperglycemia is also a common complication in clinical patients in the intensive care unit, increasing the rate of infection and mortality. The accurate and real-time prediction of blood glucose concentrations after each short-acting insulin injection has great clinical significance and is the basis of all intelligent blood glucose control systems. Most previous prediction methods require long-term continuous blood glucose records from specific patients to train the prediction models, resulting in these methods not being used in clinical practice. In this study, we construct 13 deep neural networks with different architectures to atomically predict blood glucose concentrations after arbitrary independent insulin injections without requiring continuous historical records of any patient. Using our proposed models, the best root mean square error of the prediction results reaches 15.82 mg/dL, and 99.5% of the predictions are clinically acceptable, which is more accurate than previously proposed blood glucose prediction methods. Through the re-validation of the models, we demonstrate the clinical practicability and universal accuracy of our proposed prediction method.


Introduction
Diabetes mellitus is an increasingly common chronic disease, which is mainly characterized by hyperglycemia and often induces multiple complications [1]. Hyperglycemia is also commonly encountered in the intensive care unit (ICU). Stress hyperglycemia (SH) often occurs in critically ill patients, even without a past diagnosis of diabetes [2][3][4]. Numerous observational studies demonstrate that elevated blood glucose is significantly associated with increased ICU mortality [5][6][7][8][9][10], and the aggressive correction of hyperglycemia with exogenous insulin can reduce the morbidity and mortality of ICU patients [11][12][13][14][15][16]. Continuously tracking blood glucose values is the best way to understand the effects of insulin on patients' blood glucose concentrations. Point sample test and continuous glucose monitoring (CGM) are currently the most commonly used blood glucose detecting methods [17]. The point sample test requires patients to draw blood with a lancet several times daily. Such discrete tests are prone to miss hyperglycemic or hypoglycemic events. The CGM system uses a microsensor inserted beneath the skin and held for a period to assess glucose levels in tissue fluids. However, limitations in detection accuracy and sensor lifetime make it still necessary for patients to order point sample tests for continuous calibration. The inconvenience, discontinuity, and limited accuracy of these two blood glucose detection methods have brought great hidden dangers to patients in clinical practice. Therefore, accurately predicting the blood glucose concentrations after exogenous insulin injections is of great significance for maintaining the stability of blood glucose, preventing the side effect of hypoglycemia, and reducing the mortality of patients in clinical practice. dataset from virtual patients [44,47,48]. The networks trained using datasets from a few patients can only simulate the physiological environment of a few specific people, and have very limited accuracy in predicting for patients that are not involved in the training data, which leads to the difficulty of predicting blood glucose after insulin injection with deep neural networks, and cannot be widely applied in clinical practice for the patients without continuous data records.
In this study, we try to explore whether the three commonly used deep learning regression models, including fully connected neural network (FCN), progressive neural network (CNN), and long short-term memory network (LSTM), could predict the effect of short-term insulin doses on blood glucose concentrations without the need for continuous past records. We extract independent discontinuous short-acting insulin-injection events occurring in the ICU from the Medical Information Mart for Intensive Care-IV (MIMIC-IV) database [53], and match each insulin injection record to the glucose values that triggered and related to the insulin treatment. Moreover, we also integrate some clinical characteristics that have been proven to be associated with insulin efficacy as predictive references, including the patient's age, gender, weight, ethnicity, blood pressure, and renal function indexes [54][55][56][57][58][59][60]. Using these atomic records of discrete insulin injection events to train deep neural networks, and taking these demographic characteristics and basic examination results of ICU patients as the network input, we construct various deep neural networks with different architectures to predict the blood glucose concentrations of ICU patients about 4 h after short-acting insulin injections. We re-validate the predictive accuracy of the network models using the K-fold test and the Chi-square test to demonstrate the universal accuracy of our proposed networks.

Dataset Extraction and Integration
We extracted our dataset from the MIMIC-IV database version 1.0. MIMIC-IV is an extensive, freely available database containing real medical information during hospital stays and ICU stays for patients admitted to a tertiary academic medical center in Boston, MA, USA. The data include the patient demographics, bedside monitoring of vitals, administration of medications, results of laboratory measurements, as well as diagnoses, procedures, and prescriptions for billing.
The primary objectives of our data extraction are as follows: 1.
Extract discontinuous short-acting insulin bolus injections occurring in the ICU from the electronic medical records in the MIMIC-IV database.

2.
Match the blood glucose values before insulin injection and around the peak time of insulin efficacy for each insulin injection event.

3.
Query and match the relevant characteristic values for each insulin injection event as the prediction basis.

4.
Remove all missing values and possible erroneous values.
For that purpose, we formulated data filtering and integration rules based on data analysis, clinician recommendations, and physiological and pharmacological standards. The overall process of data extraction is shown in Figure 1. Firstly, we extracted the records of short-acting insulin injection events (containing regular insulin, Humalog, and Novolog) from the table mimic_icu.INPUTEVENTS. After removing erroneous and incomplete data, we obtained 234,358 bolus injection records of short-acting insulin from 24,750 patients and 31,920 ICU stays (step A in Figure 1). Then we extracted the blood glucose values measured by laboratory chemistry analyzers from table mimic_hosp.LABEVENTS, and the blood glucose values measured by bedside fingerstick glucometers from table mimic_icu. CHARTEVENTS (4,450,933 in total). Posteriorly, glucose data from patients who did not receive any insulin injection were excluded (step B in Figure 1). After removing duplicates, errors, and outliers, we obtained a total of 1,298,679 blood glucose test results (step C in Figure 1). We matched the occurrence time of all blood glucose readings and insulin injection events to the ICU admission records from table mimic_icu.ICUSTAYS, and all events that did not occur in the ICU were removed (step D in Figure 1).
Secondly, to predict the effect of insulin doses on blood glucose concentrations, we tried to match patients' blood glucose values before and after insulin injection (glc_before and glc_after) for each insulin injection event according to time (step E in Figure 1). We calculated the time interval between each insulin injection and the closest blood glucose detection before and after the injection event (timediff_before and timediff_after). Figure 2a shows the density distribution of timediff_before. A total of 89.36% of the insulin injection events had a blood glucose record within 90 min before the insulin injection. Therefore, we selected the blood glucose value closest to the insulin injection event as glc_before from all blood glucose records within 90 min before and 5 min after the insulin injection. If there were no blood glucose records within 90 min before insulin injection, we would extend the backward search time and select the closest blood glucose results within 10 min after insulin injection as glc_before. Figure 2b shows the density distribution of timediff_after. The mean value of timed-iff_after was 3 h 43 min, and the median was 3 h 59 min. A total of 92.29% of insulin injection events have blood glucose records within 6 h after the insulin injection. Considering that the peak effect of short-acting insulin is about 4 h after injection, we selected the blood glucose value closest to 4 h and within 6 h after insulin injection as glc_after. Finally, we obtained 174,280 insulin injection events that were successfully matched to glc_before and glc_after. In addition, to further improve the prediction accuracy of the insulin effect, we extracted many related characteristics from the MIMIC-IV database (step F in Figure 1). We obtained the patients' weight and nutritional intake from table mimic_icu.INPUTEVENTS, patients' ethnicity from table mimic_core.ADMISSIONS, patients' gender and age from table mimic_core.PATIENTS, the average systolic and diastolic blood pressure (sbp and dbp) within two hours around insulin injection from table mimic_icu.CHARTEVENTS, and the values of creatinine and blood urea nitrogen (bun) from table mimic_hosp.LABEVENTS. Since the total amount of patient sugar intake could not be accurately measured according to the records from the MIMIC-IV database, we removed all the insulin injection records with nutrient solutions ingesting between the two blood glucose measurements before and after insulin injection. We also removed all the insulin injection records with glc_before less than glc_after to avoid the unknown nutrient intake affecting our prediction of the insulin efficacy. Finally, after removing all records with missing values and outliers, we obtained 86,833 insulin injection records as the dataset for predicting the effect of insulin doses on patients' blood glucose concentrations (step G in Figure 1).

Neural Network Training and Hyperparameter Tuning
For data preprocessing, we performed one-hot encoding for two discrete input characteristics (gender and ethnicity). For other numerical input characteristics, we performed min-max normalization and added the square and root of the normalized values into the input. We ended up with a 31-dimensional input dataset for all deep neural networks.
We trained three commonly used numerical regression neural networks for glucose prediction, including FCN, CNN and LSTM. We trained all the network models with different architectures using 80% of our dataset as the same training set. To explore the model architectures with high prediction accuracy, we generated network architectures with different depths and widths for these three kinds of neural network models.
We first tried to train FCNs with different numbers of fully connected layers (FL). When the number of network layers is less than 10, the network prediction accuracy improves with the increase in the network depth, indicating that there is still room for improvement. The prediction accuracy of the 10-layer and 11-layer networks is very close, the accuracy of 12-layer networks has a notable increase, and the 13-layer networks show a decrease in the testing accuracy and have an overfitting tendency. Through experiments, we found that CNN and LSTM networks have similar rules in the number of network layers and the prediction accuracy, so most of the layer numbers we tried were between 10 and 12. We also performed similar experiments to decide the width of the network layers, and we chose the layer size that improves prediction accuracy without causing overfitting while minimizing the computational cost. We tried to use different numbers of convolution layers (CL) and pooling layers (PL) in CNN. For the LSTM networks, we experimented with different numbers of LSTM layers and dropout layers. The complete architectures of the 13 networks with the highest prediction accuracy (RMSE below 16.90 mg/dL) are shown in the Appendix A in Table A2.
The kernel initializer for all Dense layers was RandomNormal, and for all Conv1D layers and LSTM layers was GlorotUniform. The bias initializer for all layers was Zeros. ReLU was used as the activation function of all the network layers. We used the Adam optimizer to update the weights of the networks. The learning rate of the Adam optimizer was 0.001, the beta1 was 0.9, and the beta2 was 0.999. Mean square error (MSE) was used as the loss function in the training process. Therefore, the goal of all the network training processes is to generate F that minimizes the MSE: The training epochs of deep neural networks are also important for prediction accuracy. To prevent insufficient training, we did not use the setting of earlystop. We tried different training epochs in 500 epochs increments. With the increase in training epochs, the prediction accuracy of the training set will gradually increase, while the prediction accuracy of the testing set will initially increase and then decrease. The decreased testing accuracy indicates that the network is overfitting to the training set, and the network training should be stopped. The training epochs with the highest test accuracy of each model are finally picked, as shown in the Appendix A in Table A2.
All the networks were trained with the Python Keras framework on four NVIDIA TITAN Xp GPUs with 12 G memory size. To maximize the throughput of the utilization machine, we set the training batchsize to 1000. Figure 3 shows the complete process of our predicting methods.

Network Evaluation Metrics
We used the following four metrics to measure the prediction accuracy of all network models. (y represents the actual blood glucose value after insulin injection, y represents the predicted blood glucose value after insulin injection, y represents the average value of y, and n represents the total number of the testing set): 1.
MAE: mean absolute error, which is calculated as the sum of absolute errors divided by the sample size. Lower values indicate better model fitting and more accurate prediction.
RMSE: root mean square error, which is the square root of the average squared errors.
Lower values indicate better model fitting and more accurate prediction.
R2_Score: The coefficient of determination, denoted R 2 or r 2 , can represent the proportion of the dependent variable that is predictable from the independent variables.
Higher values indicate better model fitting and more accurate prediction. MAE, RMSE, and R2_ Score are all common metrics for measuring the performance of regression learning models. EGA is one of the "gold standards" for determining the clinical accuracy of blood glucose prediction. These four evaluation metrics provide us with different helpful information. MAE is a linear score, meaning that all individual differences are equally weighted in the average and can visually represent the absolute prediction error. In comparison, since the errors are squared before being averaged in the calculation process of RMSE, RMSE gives relatively high weight to the larger errors and is more sensitive to outliers, which means that RMSE is most useful when significant errors are especially undesirable. Unlike MAE and RMSE for calculating offset distance, R2_ Score indicates how well the predictor variables can explain the variation in the response variable. R2_ Score is conveniently scaled between 0 and 1, independent of the actual value of the regression task, which is obviously easier to interpret and compare across different regression tasks. As a widely recognized indicator, EGA can provide a reference for the clinical reliability of our prediction network.
Our study is dedicated to predicting blood glucose concentrations after short-acting insulin injections. Since our prediction results will serve as an important reference for clinicians or intelligent monitoring devices in determining insulin doses, our prediction network must avoid large prediction errors that may affect the clinical treatment. From the perspectives of clinical accuracy, safety, and reliability, we choose RMSE, which is more sensitive to large prediction errors, as the primary metric to compare the prediction accuracy of all network models, and use the other evaluation metrics as auxiliary references. Most previous glucose prediction studies have similarly used RMSE as the primary assessment criterion [45][46][47][48][49][50]52].

Machine Learning for Comparison
Regression prediction is one of the most common applications of machine learning. We used scikit-learn to build and train five commonly used regression machine learning models: support vector regression (SVR), K-nearest neighbors regression (KNN), classification and regression tree (CART), random forest regression (RF), and extreme gradient boosting regression (XGBoost). Scikit-learn is a Python module that integrates various state-of-theart machine-learning algorithms for solving medium-scale supervised and unsupervised problems [62].

Model Retest
In order to verify the universal effectiveness of our network models and avoid the evaluation bias caused by the single testing set, we used K-fold cross validation [63] to re-validate the prediction accuracy of each network, which is a typical method for the reliability assessment of deep neural networks.
In our study, we relied on 5-fold cross validation to retrain and retest the 13 most accurate deep neural network models. We divided the entire dataset into five non-intersecting parts on average. For each training, we take four of the data parts as the training set and the remaining one as the testing set. After training and testing each model five times, all instances of the entire dataset were tested for each network model, and we obtained the mean prediction accuracy of the five testing sets for each network model. Furthermore, we performed the Chi-squared test [64] on the predicted RMSE values of the 13 networks on the 5 testing sets to prove that the networks have universal accuracy in predicting different testing sets.

Dataset Extracted from MIMIC-IV Database
From 244,000 short-acting insulin bolus injection events from the MIMIC-IV database, through data screening, glucose matching, and characteristic integration, we obtained 86,833 independent insulin injection events that occurred in the ICU without missing values. These insulin injection records and the relevant information come from 25,168 ICU admissions of 20,426 adult patients. We used 80% of the entries in the dataset as the training set for deep neural networks, and the remaining 20% were used as the testing set.
Each insulin injection record contains 11 items, including the patient's blood glucose before and after insulin injection, injected insulin dose, gender, age, weight, ethnicity, creatinine, blood urine nitrogen, systolic blood pressure, and diastolic blood pressure. The blood glucose before insulin injection is the closest glucose detection result to the injection event within 90 min before and 10 min after insulin injection. The blood glucose after insulin injection is determined according to the blood glucose detection results closest to 4 h after each injection event, which is the peak time of the efficacy of short-acting insulin. Creatinine and blood urine nitrogen are the basic vital signs of ICU patients, which are the main indicators reflecting the renal function of patients. Table 1 summarizes the statistical descriptions of 11 characteristics of the entire dataset that we created for this study, including the maximum, minimum, average, standard deviation and median of 9 digital characteristics, and the distribution of 2 discrete characteristics.

Deep Neural Networks and Performance Evaluation
In contrast to previous studies on blood glucose prediction shown in Appendix A Table A1, our study is not aimed at predicting the real-time blood glucose concentrations in the daily life of diabetic patients, but focuses on predicting the clinical effect of insulin dose on blood glucose in each insulin injection event. Most previous studies [45,47,48,50,52] require the relevant records of patients from tens of consecutive days for network training, and the trained models cannot predict the blood glucose for any patient without a long-term blood glucose history. However, in our study, we regarded each insulin injection as an independent atomic event unrelated to the patients' historical state, and used the patients' real-time characteristics at the time of insulin injection to predict the blood glucose after the short-acting insulin injection.
We built three kinds of deep neural networks, including FCN, CNN, and LSTM, to atomically predict the ICU patients' blood glucose concentration after the short-acting insulin injection. We tried dozens different network model architectures to explore whether these three commonly used numerical regression networks can be used for glucose prediction based on discrete insulin injection records and tried to find network models with high accuracy. The same training set was used to train all network models and the same testing set for network evaluation. The testing results of each network were evaluated using MAE, RMSE, and R2_Score, which are the common metrics to measure the prediction accuracy. Furthermore, we used Clarke error grid analysis (EGA) to evaluate the network performance, which is dedicated to measuring the accuracy of blood glucose predictions.
After experiments and comparisons, we obtained 13 network architectures with excellent prediction accuracy (RMSE less than 16.9 mg/dL). The detailed architectures of the 13 networks are shown in Appendix A Table A2, and the testing accuracy of these 13 network models is shown in Table 2. As in previous studies [45][46][47][48][49][50]52], taking RMSE as the primary evaluation criterion, the testing RMSEs of the five most accurate networks are all about 16 mg/dL, and the smallest RMSE is as low as 15.82 mg/dL. The accuracy of predicting using stochastically independent insulin injection records in this study is much better than that of previous studies [45][46][47][48][49][50]52] (RMSE more than 20 mg/dL, shown in Appendix A Table A1) predicted by continuous glucose monitoring data. Clarke error grid analysis of the prediction results of the five most accurate models shows that about 94% of the predicted values are located in Zone A of the error grid, and 5.5% are in Zone B, which means that over 99.5% of testing results could be regarded as being clinically acceptable. The Clarke error grid of the prediction results of the testing set for these five network models with the lowest RMSE is shown in Figure 4.
The testing accuracy of our proposed models demonstrates that the three most commonly used deep neural regression models (FCN, CNN, and LSTM) can accurately predict the blood glucose concentration after any atomically independent insulin-injection event without requiring any long-term relevant records. Our proposed method makes blood glucose prediction more effective, convenient and feasible in clinical practice.

Comparison with Machine Learning
To demonstrate the superiority of deep neural networks, we tried to use several machine learning models for blood glucose prediction. We trained five nonlinear regression machine learning models, including SVR, KNN, CART, RF, and XGBoost. The five models were trained using the same training set as the deep neural networks, and the same testing set was used for accuracy evaluation. Table 3 shows the prediction accuracy of the five machine learning models used for comparison. Among the five machine learning models, RF has the lowest RMSE and the highest R2_Score (RMSE = 17.07 mg/dL, R2_Score = 0.8869), and the CART model has the lowest MAE (MAE = 5.97 mg/dL). RF also performs best in EGA, with approximately 92.80% of the prediction results located in Zone A and Zone B. The comparison results show that all of the 13 individual neural networks perform better than the best machine learning models in all accuracy evaluation metrics.
It is worth discussing that the RMSE of RF is very close to our proposed deep neural network models, but its MAE is much higher. RF is an ensemble learning method based on decision trees, which integrates the prediction results of many decision trees by regressing the average prediction value. Since RF averages the predictions of multiple CART models, RF has a significant effect in reducing variance, which results in better testing RMSE and R2_Score. However, the MAE value of our RF model is as high as 8.29 mg/dL, which indicates that the RF model tends to overfit during the training process. In pursuit of the accurate prediction of some noise data and reducing the RMSE, which is sensitive to outliers, RF sacrifices the absolute prediction error of the overall dataset in the training process.

Test-Retest Reliability
We retested 13 deep neural networks using 5-fold cross validation to verify the accuracy of the network models. The 86,833 records in the dataset were randomly divided into five parts called folds, and all of the folds have instances of equal size (17,366 for each fold). Each model was trained and tested five times, each time using one part as the testing set and the other four parts as the training set. The final prediction accuracy of the 5-fold cross validation for each network model was the average test performance of the five parts, which is presented in Table 4. For all the deep neural network models proposed in this study, the RMSE values after retesting does not increase by more than 0.5 mg/dL, demonstrating that our models are generally effective in predicting the effect of insulin dose on blood glucose for ICU patients. We performed the Chi-squared test on the RMSE values of each testing fold for these 13 networks. The RMSE values of 13 models against five testing folds and the p-value of the Chi-squared test are shown in the Appendix A in Table A3. The p-value of the Chi-squared test is greater than 0.995, indicating that there is almost no statistical difference in the prediction accuracy for the five different testing sets, which further proves that the prediction results of the proposed models are generally accurate.

Conclusions
In this study, we developed 13 deep neural networks with different architectures, using 10 related characteristics as the network input to atomically predict the blood glucose concentrations of ICU patients after stochastically short-acting insulin injections without requiring any long-term history records..
We extracted 86,833 short-acting insulin bolus injection records discretely occurring in the ICU from the MIMIC-IV dataset, each record containing 11 characteristics associated with insulin efficacy. Using these records as the training and testing sets, we trained 13 deep neural network models with different architectures (shown in the Appendix A in Table A2). All of these 13 networks achieve good prediction accuracy. The testing RMSEs of these 13 deep neural networks are all below 16.90 mg/dL, reaching a minimum of 15.82 mg/dL, which is superior to five machine learning regression models (shown in Table 3) and the methods proposed in previous blood glucose prediction studies (shown in Appendix A Table A1.) The predicting EGA of these 13 network models shows that more than 99.5% of the prediction results are clinically acceptable.
In order to prevent the evaluation bias of model accuracy caused by the single testing set, we retested the 13 individual network models using 5-fold cross validation, and we performed the Chi-squared test on the prediction accuracy of the five different testing sets. The retesting results demonstrate that the network architecture proposed in this study is universally accurate in predicting the effect of insulin dose on blood glucose values.
In summary, our study demonstrates that it is a feasible, effective, and accurate method to atomically predict the effect of insulin doses on glucose concentrations using deep neural networks and discontinuous insulin injection records. The blood glucose concentration predicted by the deep neural networks has reference significance in clinical drug dosage. The various deep neural network architectures proposed in this study provide a reference for subsequent studies related to automatic insulin injection and continuous glucose monitoring. Our network models also provide basic simulation environments for further study on insulin dose recommendations using deep reinforcement learning.

Conflicts of Interest:
The authors declare no conflict of interest. Tables   Table A1. Summary of previous studies addressing blood glucose prediction using deep nueral networks: model type, number of participants in the cohort, mean number of monitored days per patient, methods and frequency of blood glucose monitoring, whether continuous recording is required (C), whether manual record is required (M), prediction horizon (PH), test performance of the model, citation and publish year.  3 Blood glucose risk index. 4 Long short-term memory network. 5 Type 2 diabetes. 6 Self-monitoring blood glucose device. 7 Root mean square error. 8 Continuous glucose monitoring devices. 9 Convolutional neural network. 10 Type 1 diabetes. 11 Convolutional recurrent neural network. 12 Gated recurrence unit. 13 Prediction-error grid analysis, an extension of the Clark error grid analysis.