In-Silico Evaluation of Glucose Regulation Using Policy Gradient Reinforcement Learning for Patients with Type 1 Diabetes Mellitus

: In this paper, we test and evaluate policy gradient reinforcement learning for automated blood glucose control in patients with Type 1 Diabetes Mellitus. Recent research has shown that reinforcement learning is a promising approach to accommodate the need for individualized blood glucose level control algorithms. The motivation for using policy gradient algorithms comes from the fact that adaptively administering insulin is an inherently continuous task. Policy gradient algorithms are known to be superior in continuous high-dimensional control tasks. Previously, most of the approaches for automated blood glucose control using reinforcement learning has used a ﬁnite set of actions. We use the Trust-Region Policy Optimization algorithm in this work. It represents the state of the art for deep policy gradient algorithms. The experiments are carried out in-silico using the Hovorka model, and stochastic behavior is modeled through simulated carbohydrate counting errors to illustrate the full potential of the framework. Furthermore, we use a model-free approach where no prior information about the patient is given to the algorithm. Our experiments show that the reinforcement learning agent is able to compete with and sometimes outperform state-of-the-art model predictive control in blood glucose regulation.


Introduction
Type 1 Diabetes Mellitus (T1DM) is a metabolic disease caused by the autoimmune destruction of insulin-producing beta cells in the pancreas [1]. The role of insulin is to utilize and transport glucose [2]. T1DM patients need life-long external insulin therapy to regulate their blood glucose concentrations. Without insulin, T1DM patients suffer from chronic high blood glucose levels (hyperglycemia) and, conversely, too much insulin causes hazardous low blood glucose levels (hypoglycemia). In fact, fear of hypoglycemia is a major limiting factor of glucose regulation in T1DM [3].
There are currently two dominant artificial pancreas controller algorithm paradigms, proportional-integral-derivative (PID) control, [11,21], and MPC [22,23]. Model predictive control, in particular, uses a dynamic model with patient-specific parameters to predict the blood glucose curve into the future, where the prediction window is typically four hours, after which the fast-acting insulin's effect has mostly subsided [24]. Afterwards, if the predicted blood glucose curve and its final value is off the glucose target, MPC calculates an optimized sequence of basal rate actions on the model to correct the prediction towards the target while avoiding hypoglycemia. The first action of this sequence is then picked to change the basal rate momentarily, and the whole process is repeated after a while, usually every five or 10 min. The MPC approaches require a good model of the dynamics. In the artificial pancreas system, MPC algorithms are based on glucose-insulin regulatory models that are not able to capture external perturbations, so these algorithms are limited to compensate for the incomplete model used in the artificial pancreas application [25].
In addition to PID control and MPC, there have been investigations into fuzzy logic [26], and more recently techniques from machine learning and statistics, such as Aiello et al. [27], who proposed a blood glucose forecasting approach based on reccurent neural networks. Similarly, Li et al. [28] created a deep learning based forecasting framework based on convolutional neural networks. The control algorithm used in the artificial pancreas system has to learn models that are rich enough and adapt to the system as a whole [25]. Particularly, reinforcement learning (RL), a branch of machine learning that is based on interactive learning from an unknown environment [29] has, in recent years, gained increased attention in artificial pancreas research [30][31][32][33][34][35][36][37][38][39]. A complete systematic review of reinforcement learning application in diabetes blood glucose control can be found in [40]. Outside of diabetes-related research, it has been particularly successful in achieving performance that exceeds the level of top human players in strategy games. The examples range from Backgammon in the early 1990's and more recently in the game of Go in 2015, where RL was combined with deep neural networks and Monte Carlo tree search [41,42]. RL allows us to introduce model-free and data driven algorithms that can enable another level of patient individualization [25]. Finally, previous works from the authors have shown promise for the use of RL in the artificial pancreas [32]. In that work, the amount of infused insulin was selected from a fixed and finite list of values, while the blood sugar level was treated as a continuous variable. In addition, there are several recent works using similar methodology [30,33,34,[36][37][38][39].
In this work we extend the evaluation of RL algorithms for the artificial pancreas and study the performance of Policy Gradient RL algorithms. It is well known in RL literature that policy gradient algorithms are the most suitable for problems where the action space is continuous. This is an important step in the intersection between the RL and diabetes research. Furthermore, we focus on deep Policy Gradient methods due to the flexibility, power and availability of modern neural network approaches [43][44][45][46].
We perform in-silico experiments while using the Hovorka model [22] and the trust-region policy optimization of Schulman et al. [45]. Our experiments demonstrate that RL can adapt to carbohydrate counting errors and that RL is flexible enough to treat a population of 100 patients using a single set of training hyperparameters. We consider MPC to be the current state-of-the-art approach and, thus, we compare the performance of the RL agents to that of MPC. Performance is measured through time-in-range (time spent on healthy blood glucose levels), time in hypo-/hyperglycemia, as well as blood glucose level plots for visual inspection.

Related Work
We include a quick overview over the most recent developments in deep reinforcement learning and the artificial pancreas. Particularly, Sun et al. [35] used reinforcement learning to learn the parameters of the insulin pump, specifically the insulin to carb-ratio, and not the insulin action itself. They do not use neural networks in the process. Zhu et al. [38] is quite similar to our work; however, they use PPO, a simpler version of TRPO, and they use the blood glucose level, bg rate, and an estimate of insulin-on board in the state space. The main difference between their work and this work is that they design a reward function that mimics the natural behaviour of the missing beta-cells, whereas our work focuses on a reward that encodes a more direct approach towards well-established performance measures for T1D therapy (time-in-range, etc.). Finally, Lee et al. [39] proposed a Q-learning approach, where a discrete number of actions modify the given basal rate. They also operate in a dual-hormone approach, where the infusion of glucagon is one of the actions. Their reward function however is quite similar to ours. Finally, they provide an alternative approach to training, where a population level policy is first introduced, followed by individual adaptation to each in-silico patient.

Structure of Paper
We begin with a short introduction to RL in Section 2 followed by a section about in-silico simulation for T1DM in Section 3. In Section 4 we present results and discussions. Section 5 provides concluding remarks and directions of possible future work.

Theoretical Background
In this section, we present the relevant theoretical background. We start with an introduction to RL, followed by a short section on MPC.

Reinforcement Learning
Informally, RL concerns the behavior of a decision-making agent interacting with its unknown environment. In this framework, the goal is to train an agent to take actions that result in preferable states. Figure 1 shows the agent-environment interaction, where at each time step the agent observes the current state of the environment and performs an action based on that state. As a consequence of this action, the environment transitions to a new state. In the next time step, the agent will receive a positive or negative reward from the environment due to the previous action taken [29]. The mathematical basis of reinforcement learning is the Markov decision process, which is represented by the tuple (S, A, P, R, γ). S and A are the state and the action spaces, respectively, P contains the state transition probabilities p(s |s, a) and represents the transition to state s from s using action a. R contains the rewards, represented by the reward function r(s, a, s ), which defines the goal of the problem, and 0 < γ ≤ 1 is a discount factor. The mapping from state to action is called the policy, which can be either a deterministic function π : S → A or a set of conditional distributions π(a|s), depending on the environment the agent is interacting with. The goal of any RL algorithm is to learn an optimal policy π * that maximizes the expected return it receives over time, which is the accumulated reward over time G t = ∑ ∞ k=0 γ k R t+k+1 , where R t = r(s t , a t , s t+1 ). The expected return assuming that the agent starts from the state s and thereafter follows the policy π is called the value function v π (s). Concretely, the value function specifies the long-term desirability of states, indicating the total amount of reward that is expected by the agent: Similarly, the expected return assuming that the agent starts from the state s, takes action a, and thereafter follows the policy π is called the action-value function q π (s, a): (1) The ultimate goal of RL is to find an optimal policy, a policy that is better than or equal to all other policies based on the values of the states. Realizing this goal in practice has led to two different main branches of RL algorithms, value based algorithms and policy based algorithms, see e.g., [47].
Value based algorithms aim to estimate the value of each state the agent observes. Decisions are then made such that the agent spends as much time as possible in valuable states. A policy in value based RL is often simply a greedy search over each action in the given state, where the action that gives the highest value is chosen. In the case of a RL agent controlling e.g., an insulin pump in the T1DM case, such states could be safe blood glucose levels, while states with lower value would be either high or low blood glucose values.
Policy based algorithms change the viewpoint from looking at how valuable each separate state is, to evaluating how good the policy itself is. Given some parametric policy, a performance measure for the policy is defined-most commonly how much reward the agent can get over a certain amount of time. This measure is then optimized using gradient-based methods. For the T1DM case, this performance measure could for example be time-in-range.

Policy Gradient Methods
Policy gradient algorithms consider a parametric policy, π(a|s, θ) = P(a|s, θ), and the goal is to optimize this policy using gradient ascent with a given performance measure J(θ) with parameter updates θ t+1 = θ t + α∇J(θ t ) [29]. The most common choice for the performance measure is the expected return of the initial state s 0 , given as This is equivalent to optimize the value of the initial state-a policy is thus considered to be good if it can generate a lot of reward during the course of an episode.
There are multiple benefits of using policy gradient algorithms; they can be applied directly on continuous action spaces, the policy gradient theorem, introduced below, shows that any differentiable parametric policy can be used and, in the limit deterministic policies, can be modeled by policy gradients, which is useful if we do not want stochastic actions in an online setting-such as in the diabetes case.
One of the key points of policy gradient algorithms is the policy gradient theorem [48]: where the distribution µ is the stationary distribution of the states succeeding s 0 when following π. This theorem states that the gradient of the performance measure is proportional to the gradient of the policy itself. This is of great benefit, as it allows the use of any differentiable policy parameterization. The policy gradient theorem allows, with some simple modifications to Equation (3), the formulation of a simple sample-based algorithm, called REINFORCE. Instead of updating based on summing over all actions, the policy gradient is rewritten using a single sample S t , A t , and the gradient update rule becomes The complete derivation can be found in Sutton and Barto [29] and the entire algorithm is shown in Algorithm 1.
The REINFORCE algorithm has been well studied and a number of improvements and suggestions have been proposed [45,46,49]. The current state-of-the-art in model free policy gradient algorithms is Trust-Region Policy Optimization by Schulman et al. [45] and a simplified version of the same algorithm called Proximal Policy Optimization [46]. In this work, we restrict our attention to the former.
Trust-region policy optimization (TRPO) is an algorithm that is based on the fact that if the policy gradient update is constrained by the total variation divergence, D TV (π 1 , π 2 ) = max s∈S |π 1 (·|s) − π 2 (·|s)|, between the old policy and the new policy, the performance of the policy is guaranteed to increase monotonically [45]. Rewriting the total variation divergence using the Kullback-Leibler divergence and introducing approximations using importance sampling, the trust-region policy optimization reduces to solving the following optimization problem: where q θ old (s, a) is the action-value function, i.e., the value of taking action a in state s when following the policy π θ old (s, a), D KL is the Kullback-Leibler divergence, and δ is the bound on Kullback-Leibler divergence. See Schulman et al. [45] for a complete description of the algorithm.

Parameterized Policies
The most common way to generate a parametric policy in a continuous action space is to use the Gaussian density function: where µ(s, θ) and σ(s, θ) are both state dependent parametric feature extractors. We use neural network feature extractors for both µ and σ in this work. In this way, µ = nn µ (s, θ) is a multilayer perceptron with three hidden layers with 100, 50, and 25 hidden neurons, respectively, where θ are the weights of the neural network, mapping the state space into the mean of the Gaussian function. We decided to use this neural network architecture following [43], where a feedforward neural network policy with the same number of layers and hidden neurons is used to test and evaluate several tasks with continuous action spaces. σ can either be a fixed vector, σ = r ∈ R d , where d is the dimension of the state space, or the output of a different neural network, σ = nn σ (s, θ). In this case, the multilayer perception used for σ consists of two hidden layers, each with 32 hidden neurons. It is common to take the exponential of σ to ensure a positive standard deviation [29,45]. In the multivariate case, a diagonal covariance matrix is used. For both neural networks, µ and σ, we used a non-linear tanh intermediate-layer activation functions, while linear activation functions are used in the output layers. Thus, the action is a sample from N (µ, σ 2 ). An illustration is shown in Figure 2.

Model Predictive Control
Model predictive control (MPC) is currently the state-of-the-art for artificial pancreas systems [50][51][52], and is used in commercial systems including the recently FDA approved Control-IQ TM advanced hybrid closed loop technology [53]. In general, MPC is a collection of algorithms where a model of the process is used to predict the system's future behavior. Optimal actions are then computed, while using an objective function, to ensure that the predicted behavior matches the optimal desired behaviour [54]. Algorithms typically differ in the type of model and the objective function used [54]. MPC has the advantage of incorporating constraints in the objective function. This is particularly beneficial for the artificial pancreas system case that is characterized by long delay times [54]. Because the quality of the results is completely driven by the ability of the model to describe the true process state, most MPC algorithms are used in conjunction with state estimation techniques, such as the Kalman filter [24]. The main drawback of MPC is that adapting the model to each patient individually and accounting for intra-day variability is completely dependent on the structure of the predictive model [25]. If the model is not expressive enough to capture the situation, the algorithm will fail. Several recent works have tried to lessen this burden by using multiple predictive models [55][56][57].

In-Silico Simulation
Most in-silico T1DM research is centered around three physiological models: the Bergman (minimal) model [58], the Hovorka model [22] and the UVA/Padova model [59], see also [60]. The minimal model is a simplified model consisting of two equations describing the internal dynamics of glucose and insulin and does not account for the significant delay involved in subcutaneous insulin infusion. The Hovorka model and the UVA/Padova both include this delay as well as the delay in the subcutaneous glucose measurement. In this work, we use the Hovorka model.

Simulator
The Hovorka model consists of five compartments that describe the dynamics of glucose kinetics and insulin action [61], two external, and three internal compartments. The three internal compartments describe insulin action, glucose kinetics and glucose absorption from the gastrointestinal tract. The two external compartments describe subcutaneous insulin absorption and interstitial glucose kinetics. The original model includes one virtual patient, which we use in our experiments. In addition, we follow Boiroux et al. [24] and use model equations, parameters and distributions as given in Hovorka et al. [22] and Wilinska et al. [62] to simulate further virtual patients. Unconstrained sampling from these distributions can lead to unrealistic virtual patients, as was also pointed out in Boiroux et al. [24]. To cope with this, the samples were constrained to the following set of rules [63].
• Patient weight is sampled from a uniform distribution between 55-95 kg.

•
When the basal rate is delivered and the patient is in fasting conditions, glucose levels are constant and are between 110-180 mg/dL.

•
The patient's basal rates were sampled from a uniform distribution between 0.2-2.5 U.

•
The patient's carbohydrate ratios were sampled from a uniform distribution between 3-30 g/U.

•
Each patient is characterized with a unique insulin sensitivity factor (ISF) S i mg/dL/U, i.e., if an insulin bolus of size 1 U is delivered, glucose levels will drop by ISF mg/dL.

•
The patient's insulin sensitivities were sampled from a uniform distribution between 0.5-6.5 mmol/L. • A theoretical total daily dose (TDD) of insulin is computed assuming a daily diet of carbohydrates between 70-350 g. This value is then compared to sampled insulin sensitivity to ensure that the 1800 rule holds: ISF = 1800 TDD .
• A theoretical total fraction of basal insulin is computed and is compared to TDD to ensure that the proportion of basal insulin is between 25-75% of TDD. • All Hovorka's parameters, [62], are sampled using a log-normal distribution (to avoid negative values) around published parameters.

Reinforcement Learning, T1DM and the Artificial Pancreas
Because of the fact that applying reinforcement learning to any problem assumes an underlying Markov decision process, we need to take this into account when designing the state and action spaces for the T1DM case. There are several factors influencing whether or not we can interpret the glucose insulin dynamics as a Markov decision process, most notably the delayed action caused by the use of subcutaneous insulin infusion. Depending on the type of insulin used, the maximum effect of insulin is delayed and can last up to four hours [64]. On top of this comes the delay between the subcutaneous CGM measurements and the true blood glucose values, which is typically between 5 and 15 min. [65]. One of the fundamental properties of reinforcement learning algorithms, is the fact that they can control systems with delayed reward [66]. This implies that an action in a state can still be considered to be good even if the immediate reward from taking that action is not considered good. Furthermore, we note that, since the insulin infusion is the action taken by the RL agent, the environment will not change its state immediately because of the delayed insulin effect. Therefore, in this work we consider 30 min. time intervals as the time between each updated state from the environment. The insulin basal rate is kept constant during these 30 min. This will allow for the environment enough time to change significantly between each time step.
The final component involved is the reward function. In this work, we used two different reward functions, a symmetric Gaussian reward function and an asymmetric reward function, previously introduced in [67]. The Gaussian reward is given as: where g is the current blood glucose value, h is a smoothing parameter, and g re f is the reference blood glucose value fixed at 108 mg/dL. The asymmetric reward function was, in [67], designed to reduce hypoglycemia and, at the same time, encouraging time-in-range. It is built as a piecewise smooth function and gives a strong negative reward for severe hypoglycemia, followed by an exponentially decreasing negative reward for hypoglycemic events starting at severe hypoglycemia, and zero reward when hyperglycemia occurs. Positive rewards from a symmetric linear function are given for glucose values in normoglycemic range. Concretely, the function is given as: where hyperglycemia is defined as values above g hyper = 180 mg/dL, hypoglycemia as values below g hypo = 72 mg/dL and severe hypoglycemia as values below g hypo − = 54 mg/dL. Thus, the normoglycemic range are values between [g hypo , g hyper ] mg/dL. The parameters of the reward were found experimentally. Figure 3 shows a graphical representation of the reward function.  Figure 3. The asymmetric reward function. Low blood glucose levels (a) are more penalized than (b) high blood glucose levels.

Experiment Setup
The reinforcement learning agent controls the basal insulin rate of the pump which is updated every 30 min. In this work, we use two different action spaces during our experiments. Following Boiroux et al. [24], we define the action space of the agent ranging from zero, where the controller stops the insulin pump, to twice the optimal basal insulin rate (designated as TRPO in the results section). In addition, we use an extended version of the action space, which ranges from zero to three times the optimal basal insulin rate (designated as TRPOe in the results section). It is further assumed that the patient estimates and manually announces the amount of carbohydrate taken at each meal, and a bolus is given according to the patient's individual carbohydrate to insulin ratio (The number of carbohydrates that one unit of insulin will counteract). The state space of the RL agent is defined as the concatenation of the last 30 min. of blood glucose values, the last 2 h of insulin values (last 4 actions/basal rates given in 30 min. intervals), the insulin bolus on board, which is the calculation of how much insulin is still active in the patient's body from previous bolus doses, and the size of the last given bolus if a bolus was given during the last 30 min. The insulin on board was calculated while using a decay exponential model as described in Loop (https://github.com/LoopKit/Loop). We note that using the previous insulin taken is violating the MDP assumption. We chose to keep this compromise for two reasons: (1) the insulin and carbohydrate dynamics operate on fundamentally different time scales,see e.g., [62] and (2) information about previous insulin and insulin on board is essential knowledge that the agent cannot do without.
We use time-in-range (TIR) and time-in-hypoglycemia (TIH) as the performance measures, where we want to maximize the former and minimize the latter, in order to measure the performance of our simulations. We consider the normoglycemic range as values between 72-180 mg/dL and hypoglycemia as values below 72 mg/dL, see Danne et al. [68] for further details (We ended up using 72 mg/dL as the threshold instead of 70 due to converting from the local standard of using 4 mmol/L as the hypoglycemia threshold). In addition we use the Coefficient of Variation (CoV), σ/µ, to measure glycemic variability [69], defined as the ratio of the standard deviation to the mean of the blood glucose, and the risk index (RI), including high and low blood glucose risk indices, as described in Clarke and Kovatchev [70]. The RI measures the overall glucose variability and risks of hyper-and hypoglycemic events, while the high and low blood glucose indices (HBGI and LBGI) measure the frequency and extent of high and low blood glucose readings, respectively.
To train the algorithms, we use a standard reinforcement learning setup: (1) the agent collects episodes from the environment, followed by (2) the agent updates its parameters based on the rewards ((4) and (5)) and the process repeats until training is done (e.g., when the policy stops improving or stops changing) or the maximum number of iterations is reached. Inspired by the experiments in the original TRPO work [45], where between 50 to 200 iterations was use, we fix the number of policy update iterations to 100. This was also empirically found to provide convergence for the policies that are involved in most experiments. Furthermore, each episode is defined as starting at 00:00 and ending the next day at 12:00, at a total of 36 h. For each episode during training,s the virtual patient is given meals from a fixed-seed random meal generator to ensure that each agent is trained on the same data set. This meal generator creates four virtual meals at ± 30 min. of 08:00, 12:00, 18:00, and 22:00 h with 40, 80, 60, and 30 g of carbohydrates. U [−20, 20] uniform noise is added to simulate meal variation. For simplicity, the meal times are kept concurrent to the start times of each state-every 30 min. Because of the delayed meal response and the generally high variation in the bg curve, we assume that this will generalize well to meals that are taken within a state-space time interval.
To test the agents, we use a fixed set of 100 episodes with 100 daily meals scenarios, sampled from the meal generator with a different seed than the training meals. Finally, to simulate carbohydrate counting errors, all meals-both training and testing-have a counting error of ±30% of the exact carbohydrate count. The reinforcement learning agent was implemented using the open source reinforcement learning toolbox garage https://github.com/rlworkgroup/garage. [43]. The in-silico simulator was wrapped in the OpenAI Gym framework for simplified testing [71].

Results
We now present the results and discuss the performance of a simulated artificial pancreas running the TRPO algorithm described in Section 2.2 in-silico. We show the results on the original Hovorka simulated patient, [22,62], as well as a cohort of 100 simulated patients according to the parameter distributions, as given in Wilinska et al. [62]. To illustrate its potential, we compare its performance to standard basal-bolus strategy and model predictive control algorithm, as described in [6]. We begin by comparing the TRPO agent to a simple basal-bolus treatment strategy on the original Hovorka patient.

TRPO versus Open Loop Basal-Bolus Treatment-Hovorka Patient and Carbohydrate Counting Errors
In this simulation, we consider open loop basal-bolus therapy, i.e., a fixed optimal basal insulin rate with manually administered meal-time bolus insulin, where the optimal basal rate is calculated as the minimum amount of insulin that is required to manage normal daily blood glucose fluctuations for this particular patient, while keeping the patient at target blood glucose value during steady state. We compared the basal-bolus therapy with a hybrid closed loop system in which the TRPO agent is controlling the basal insulin rate while meal-time bolus insulin are manually administered, both strategies running the same 100 test meal scenarios. Figure 4 shows the two previously mentioned treatments superimposed over each other, where we can see the blood glucose levels for the 100 test meal scenarios. The average blood glucose values for TRPO and basal-bolus strategies are highlighted in dashed and continuous curves, respectively. The dark gray shaded area shows the maximum and minimum values for each individual step of the simulation for the TRPO agent, while the light gray shaded area does likewise with the basal-bolus regimen. We are using the maximum and minimum blood glucose values instead of a confidence interval to include all possible curves in the envelope. This is due to the severe clinical implications of even a single blood glucose curve going too low.
We see in Figure 4 that the baseline performance of the basal-bolus controller is quite good, with a high portion of time being spent within range. Still, there are several hypoglycemic events, especially after the second meal, and the variation is high, as seen in the point-wise maximum and minimum band.
In the case of the TRPO controller, we see that the hypoglycemic events after meals and the overall variance have been reduced.  When comparing the two results, we see that the TRPO agent has improved the results; reducing variance in general and showing better overall within range performance. Especially with respect to hypoglycemia and the glucose levels after the second meal. The TRPO agent is able to get back to the optimal blood glucose level much quicker and with less variation than the basal-bolus strategy. An interesting observation from Figure 4 is that we see how the TRPO agent chooses to keep the steady state blood glucose value slightly higher than the desired value of 108 mg/dL (this can be observed from approximately minute 250 to 500 and from min. 1700 and onward). This helps to avoid the hypoglycemic incident that often happens after the second meal during the basal-bolus regimen.
The max-min envelope of Figure 4 is not showing the full picture with respect to the standard deviation of the two treatment options. To further illustrate this, we include kernel density plots in Figure 5, showing the distribution of the blood glucose shortly after meals, between meals and during the steady state long after any meals (equivalent to nighttime). The kernel density estimate was calculated using the seaborn python package (https://seaborn.pydata.org/). It is clear, especially between meals and during night-time, that the TRPO agent treatment is superior to the basal-bolus strategy in this case.
We test the performance of the treatments in the case where the patient forgets to take the bolus insulin during a meal in order to conclude the comparison of the TRPO agent and the basal-bolus controller. To simulate this, we keep the 100 test meal scenarios, but let each meal during testing have a 0.1 probability of containing a skipped bolus. Table 1 shows a summary of all the performance measures for the experiments with the random skipped boluses (RSB) for both TRPO and basal-bolus treatments. We also include additional experiments, as shown in the Table 1, where the agent, denoted now as TRPOe, was trained with an extended action space (from zero to three times the optimal basal rate) and tested on the RSB scenarios as well as the ordinary 100 test scenarios.
Observing the Table 1, we again see that the TRPO agent is superior to the basal-bolus treatment, increasing time-in-range while decreasing time spent in hypoglycemia. It has lower variation and risk indices, and it is overall more robust towards skipped boluses. We note that the low LBGI for the skipped bolus experiment is most likely an artifact due to the blood glucose level being higher in general when there are skipped boluses involved. The same goes for the overall percentage of time spent in hypoglycemia. Table 1. Summary of basal-bolus, TRPO and TRPOe results for the Hovorka patient. low blood glucose indices (LBGI) and high blood glucose indices (HBGI) is low and high blood glucose index respectively, RI is risk index, Std is the overall standard deviation and CoV is the coefficient of variation. All 100 test meal scenarios are included in the performance measures. A lower score is better for all measures, except time-in-range.

Treatment
Time When it comes to the results using the extended action space TRPOe, we found that the results using 100 policy gradient iterations are inferior to the other results. Therefore, we extended the number of training iterations to 300, which lead to an improvement over the original action space. The extended action space also leads to a treatment that is more robust to skipped boluses. However, the effect of increasing the number of policy gradient iterations from 100 to 300 represents a significant increase in data used for training the policy. There is a trade-off between the size of the action space and the number of training data/simulations needed.

Virtual Population Experiment: Undertreated Patients
The virtual population, as described in Section 3, have basal and bolus rates that are sub-optimal, keeping the patients within 110-180 mg/dL at steady state. We consider the virtual patients with high steady state glucose values as patients that are undertreated, i.e., their current treatment regimen does not give the desired blood glucose levels. We show a random sample of four patients in Figure 6 to illustrate the improvements made by letting a TRPO agent train and control each virtual patient. Each figure contains the original sub-optimal basal-bolus treatment as well as the results using TRPO agent superimposed over each other.
In all four cases, the TRPO agent improves the sub-optimal basal-bolus treatment. For virtual population patient #4, the performance of the basal-bolus is already close to optimal, but we still see a reduction in variance, especially later in the episode, during nighttime. (d) Figure 6. A random sample from the 100 virtual patients (a) patient #0, (b) patient #4, (c) patient #17, and (d) patient #38. All figures show results from the sub-optimal basal-bolus treatment and the TRPO agent trained on each patient individually.

Virtual Population Experiment: TRPO versus Model Predictive Control
We compare the TRPO agent to the open source MPC implementation (https://github.com/ McGillDiabetesLab/artificial-pancreas-simulator) provided by the McGill Diabetes Lab (https://www. mcgill.ca/haidar/). The TRPO agent is individually trained for each virtual patient. The MPC controller is adapted to each patient using the total daily insulin, basal rate, and carb-ratio. As many of the patients are undertreated, some of these parameters might represent poor choices. We note that this leaves MPC at a disadvantage from the outset, since it is not able to tune the parameters during training.
In Figure 7, we see a scatterplot of the mean of the minimum and the mean of the maximum blood glucose of the 100 virtual patients controlled by MPC, the TRPO agent, and a basal-bolus strategy. This is similar to control-variability grid analysis plot [72], which is often used for measuring the quality of closed loop glucose control on a group of subjects, see e.g., [24]. The undertreated patients are left out of bounds for standard CVGA, thus requiring a different kind of analysis, as shown here. The virtual population moves from both high mean maximum and minimum values in the basal-bolus case to lower mean maximum in MPC and even lower for the TRPO agent. We see that, in general, MPC stays at higher blood glucose levels as compared to TRPO, but conversely the TRPO agent is in some cases on the borderline low side.
To obtain a more complete picture, kernel density estimates of the same maxima and minima is shown for the entire population in Figure 8.
It is obvious that the TRPO again is outperforming the basal-bolus strategy. It shows tighter overall control and lower maximum values, while most minima are above the hypoglycemia threshold. The MPC is also tighter and improves over basal-bolus, but still the mean maximum values are, in general, higher. In addition, some of the mean minimum values are quite high, which indicates a mean blood glucose value that is generally high.
Finally, Table 2 shows the mean performance measures for the entire virtual population for basal-bolus, MPC and the TRPO agent. It also shows best and worst cases for all three treatments in terms of time-in-range (TIR) and time-in-hypo (TIH). TRPO improves the time spent in normoglycemia, while reducing the overall risk of hypo-and hyperglycemic events. However, MPC is more robust towards hypoglycemic events. Note that, in this case, in-silico patients spend less time in hypoglycemia following basal-bolus strategy than under TRPO control algorithm. This is because these patients are using sub-optimal basal-bolus treatment and therefore have higher steady state glucose values, spending most of the time close to hyperglycemia with almost no risk of hypoglycemic excursions. In this situation, the TRPO agent learns new basal rates to compensate the undertreated in-silico patients, improving the time spent in target range, but also at the same time slightly increasing the risk of hypoglycemia. Although the latter is, in general, considered to be negative, this comes down to how to design the control goals. There will always be a trade-off between better time-in-range and risk of hypo. A future study, with e.g., a parametric reward function, could help determine the exact trade-off for each patient, and take advantage of that. However, this is beyond the scope of this work.

Conclusions and Future Work
In this work, we have shown that policy gradient reinforcement learning using TRPO outperforms standard basal-bolus treatment and compares favourably to MPC in our experiments. We consider this work to be a strong proof of concept for the use of policy gradient algorithms in the artificial pancreas framework; the TRPO agent is able to cope with both carbohydrate counting errors and to a certain degree skipped boluses. Furthermore, the control is tighter than using a fixed optimal basal rate and risk indices are, in general, lower than both MPC and basal-bolus insulin therapy.
The main disadvantage of using RL, which has not been fully explored in this work, is the computational complexity of training. In this work, we fixed the number of policy gradient iterations to 100 for all experiments, but we empirically observed that, in many cases, far fewer iterations were required for convergence. Finally, we observed that a larger action space can lead to better control, but the increase in training data needed for convergence is significant.
All of the TRPO agents were trained model free, so from the agent's perspective the diabetes simulator is simply a black box that returns a reward when an input is given. Due to the fact that T1DM is a well studied disease and multiple treatment strategies already exist, there is a lot of domain knowledge that gets lost in a model free setting. An obvious direction of research is including domain knowledge into the RL framework for T1DM, as in e.g., [73,74].
Finally, state-of-the-art RL contains a plethora of directions that can be explored, the perhaps most important ones for the artificial pancreas framework are inverse reinforcement learning [75], safe reinforcement learning (safe exploration) [76] and hierarchical reinforcement learning [77].

Conflicts of Interest:
The authors declare no conflict of interest.