Improving Convergence in Therapy Scheduling Optimization: A Simulation Study

The infusion times and drug quantities are two primary variables to optimize when designing a therapeutic schedule. In this work, we test and analyze several extensions to the gradient descent equations in an optimal control algorithm conceived for therapy scheduling optimization. The goal is to provide insights into the best strategies to follow in terms of convergence speed when implementing our method in models for dendritic cell immunotherapy. The method gives a pulsed-like control that models a series of bolus injections and aims to minimize a cost a function, which minimizes tumor size and to keep the tumor under a threshold. Additionally, we introduce a stochastic iteration step in the algorithm, which serves to reduce the number of gradient computations, similar to a stochastic gradient descent scheme in machine learning. Finally, we employ the algorithm to two therapy schedule optimization problems in dendritic cell immunotherapy and contrast our method’s stochastic and non-stochastic optimizations.


Introduction
Therapy schedules comprise previously planned protocols of injection times with their respective vaccine quantity. Such protocols are proposed usually following therapist traits such as intuition and experience. We aim to provide a rational therapy planning that relies on mathematical modeling, optimal control, and simulations.
Optimal control is a popular tool to improve schedules for tumor treatment, and it is frequently used to rationalize issues like infusion times and drug quantities. For instance, in [1], they address the optimal control for a model with mixed immunotherapy and chemotherapy. Numerical and analytic control techniques for continuous and bang-bang controls can be consulted in [2]. Therapeutic protocol improvement by simulations have been made in [3] for Cytotoxic T Cell (CTL) and in [4] for dendritic cell transfections.
Apart from immunotherapy applications, optimal control is used in a variety of medical applications such as control of Dengue Fever [5] time valuation in chronic myeloid leukemia [6], for finding the best therapy schedule for metastatic castrate-resistant prostate cancer using abiraterone [7]. In [8], an optimal control scheme aids the effector cells and Interleukin-2 (IL-2) intensity while lessening the tumor cells. A mathematical model for Chronic myeloid leukemia and optimal control are used in [9] to obtain optimal piecewise-constant regimens. For a combination of therapies, optimal control can be used to find the optimal treatment between several drug options, in [10] dominant treatment is found for the combination of chemotherapy, IL-2, and tumor-infiltrating lymphocyte, considering the tumor size. Ref. [2] discuss several numerical and analytic control techniques for continuous and bang-bang controls. The works [8,11,12], study models for tumor growth, the immune system, and controlled drug therapies given in a Bang-Bang manner.
Standard procedures for taking anti-tumor medicament is to have an intramuscular shot or an intravenous injection. In both cases, the periods for drug intake is on the order of minutes or hours for intravenous intake. Hence, to implement continuous and bang-bang controls that usually need the treatment to stay continually over days is unrealistic. In this paper, we do not consider a continuous-infusion therapy, but pulse-like doses of short period (bolus injection) that can be considered instantaneous. This resembles more accurately the procedure following real immunotherapy treatments. Periodically pulsed immunotherapy is studied on [13,14] where the bolus injections have very small time-span in comparison with the overall therapy. A therapy-optimizing algorithm for pulsed dendritic immunotherapy is given in [15]; also, they give the foundation for computing the gradient of a cost function that measures tumor outgrowth. Schedule optimization for pulsed dendritic immunotherapy that encounters several impediments in the host environment, such as immunosuppression and poor transference to the lymph nodes is addressed on [16]. Then, deciding when and how much of such doses is a typical question when dealing with immunotherapy.
Gradient descent (GD) optimization algorithm is commonly used for schedule optimization when there are pulsed controls [15,16]. The algorithm has many good qualities, but speed convergence is not one of them. Considering that, we change the GD design in our method and use adaptive optimizations that are well known in the area of deep learning and we show that they can be useful too in the area of therapy scheduling for speeding up convergence. It was possible since such approaches rely on the gradient calculation given in [15]. Moreover, we included a stochastic step for decrease the number of gradient computations.
The road-map for this work is as follows. The explanation of the Gradient descent and its variants are in Section 2. Section 3 contains the optimal control problem and the optimization algorithm, followed by Section 4, that has a comparison between the GD variants for an application to a dendritic vaccine schedule model. Also, Section 4 has a comparison between the deterministic case and stochastic cases of the optimization algorithm. Section 5 applies the algorithm for a model calibrated to match the tumor growth in mice. It shows that the stochastic case can give reasonable results with a less computational cost. Finally, on Section 6 have some conclusions and future work.

Gradient Descent Optimization Algorithms
In this section, we describe the background for the gradient descent algorithm and some well know optimization algorithms, each version aims to speed up the convergence.
Gradient descent (GD) is an algorithm to find a minimum of a smooth function J using the iterative process: that take small steps in the gradient ∇J. The updating of vector s ∈ R d follows the opposite direction of the gradient, and the parameter h gives the step amount the update. For a sufficiently small h and a convex function J, Equation (1) will converge to a global minimum, and for non-convex function is only guaranteed to converge in a local minimum.
Step size h plays a vital role in the optimization process; h must give a balance between speed and convergence in the optimization. The procedure includes to test some value for h * for a few optimizations steps if the loss function decreases fast enough h * is elected, if not, h is increased a little. Usually, h is small initially, values such as [10 −1 , 10 −6 ] are common in the literature; nevertheless, the range of h differs between different problems. For example, in Section 4 we found that h = 10 provides us an acceptable convergence in our problem. For its versatility and simplicity, GD is the workhorse algorithm for a vast number of applications in areas such as mechanics, economics, physics, and machine learning. Equation (1) sometimes is also called vanilla GD for its simplicity.

Variants
Next, we summarize some adaptive optimization techniques that aim to improve the speed convergence of the GD algorithm. Most of them are thoroughly used and identified in deep learning to fit the parameters of a neural network. The following GD variants (or usually called optimizers) have helped deep learning to improve the state-of-the-art of computer vision, speech recognition, and drug discovery [17]. All of them tweak the GD algorithm to speed up and smooth convergence in distinct forms. We use (Hadamard Product) for element-wise vector multiplication; and, square root and division operation on vectors are element-wise operations for the sake of simplicity.

Momentum
GD has difficulty to reach an optimum because of its inclination to oscillate on steep surface curves [18]. Momentum intends to add inertia and speed while also dampening oscillations; like a snowball that passes through minor bumps but stays on the valley. The method of momentum [19] accelerates iterations in such cases. Now, the gradient ∇J affects position indirectly through a so-called velocity vector v, In practical terms, the updated position now depends on the past iteration of v and on the current gradient ∇J(s (n) ). The term µv (n) has a dampening and speeding up effect on s similar to a snowball down a valley. When the snowball is rolling in a pronounce slope, the velocity increases until it reaches the bottom where the velocity decays.

Nesterov
Similar to momentum, this method not only looks at the past velocity but also into the possible next iteration s (n+1) = s (n) − µv (n) of the momentum update, this takes the form The reasoning is, that since momentum is about to push s by µv (n) , then, it make sense to approximate ∇J on the direction s (n) − µv (n) .

Adagrad
Adagrad optimizer adapts the step size to use past gradient values. At each step a buffer variable b accumulates the squared sum of past gradients, Now each element s will experience different step size, elements with high gradients will have their step size reduced, and those with smaller gradients will have increased. It has been observed that using the square in the denominator make the algorithm to perform better [19]. Also, serves as a smoothing term for preventing division by zero. Notice that at each optimization step, b adds a positive term which could make b to grow so big that the current step size becomes too small and prevent further advance.

RMSprop
To cope with diminishing step size of Adagrad optimizer the buffer variable includes a decay rate to smooth the impact from big gradients and the accumulative sum, popular default values for the decay rate γ are 0.9, 0.99, 0.999. Now the active step size does not get monotonically smaller [19].

Adam and Adam-Bias
These optimizers combine a smooth version of momentum with RMSprop to adjust the update step h element-wise. Now we have a decaying mean (first moment) of past gradients, The RMSprop part is the same, and relates to the uncentered variance (second moment). The Adam optimizer is usually presented as, and the one called Adam-bias optimizer as where the variablesm,v are for correcting potential bias toward zero of m, v. The name Adam comes from Adaptive Moment Estimator [20] in which they propose values of 0.9 for β 1 and 0.999 for β 2 .

GD-Normalized
In our experiments, the elements of ∇J tend to differ several orders of magnitude, so, step size h could be either significant or negligible in the updating. To smooth that effect, we include a version that normalizes the gradient before making the schedule update; we call this GD-Normalized. For further discussion on the above methods, we refer to [21].

The Optimal Control Algorithm for Optimizing Therapy Schedules
In this section, we define the optimal control problem for therapy scheduling and an algorithm for optimizing therapy schedules using GD. This algorithm includes a stochastic step that only updates a random part of the schedule at each iteration. This simple technique gives us similar results on minimizing a cost function J in comparison with updating the entire schedule at each step. The advantage is that it has less computational cost.

The Optimal Control Problem Definition
The clinical effectiveness of immunotherapy vaccines by simulations of mathematical representations is a theme of open-ended research in mathematical oncology [22]. One of the most used tools for modeling the complex interactions between the immune system, tumor cells, and external therapies is ordinary differential equations (ODEs). With the mathematical model, hypothetical protocols simulation can proceed with several therapy combinations of dose timing and size. Such protocols could lead to new therapy dosing programs [4]. For this section we improve such hypothetical protocols using an optimization algorithm based on optimal control. The treatment is modeling by a system of ODEs that depends in the control u, where x is the state variable and u measures the therapy effect. For this work, u will act only in one of the independent variables of (20). Also, let e i be defined as the coordinate vector with i equal to the index of such independent variable. For example we would have e 4 for the system (24)-(28). Now, let us consider a schedule of injections be given as, such that t 0 < t 1 < · · · < t n−1 < t f , with t 0 as the therapy fixed starting point and, t f is the therapy fixed end. It is considered that a therapeutic protocol consists of n doses of size V given by the times on s. Let L be the space of schedules, then for a particular schedule S ∈ L the control variable takes the form: Notice, we are approximating the doses as an impulse given by the Dirac function δ(·). For this work, we will focus on a cost function of the form, where L measures how effective is the running therapy and Φ measure the cost at the end of the therapy. The optimal control problem consists in: (P) Determine the schedule s ∈ L of n injectionsthat solves; with U as the class of admissible controls defined as in (22), L is the running cost and Φ is the final cost.
Notice that x(·, u) is the solution of (20) using the control u. To solve (P), we use the next algorithm that relies on calculating the cost function gradient.

Optimization Algorithm
The Algorithm 1 uses the same method reported in [15] to compute the gradients of J. Since the methods in Section 2 are guaranteed to converge for convex problems and the applications in this work are for nonlinear dynamics (non-convex) the Algorithm 1 acts as a heuristic. Additionally, we also study the case when a schedule s * is randomly chosen from s. Algorithm 1: Schedule Optimization with stochastic step S1 Fix the time horizon t 0 and t f , the number N of vaccine administrations, the value V of vaccine quantity, an initial value x 0 of cells population and an initial schedule s 0 . Also, let 0 < p ≤ 1 and s = s 0 . S2 Integrate the system (20) with initial value x 0 to obtain the trajectory x s (t). Now define the schedule s * from choosing randomly m = N p values from the schedule s. For each t i of the schedule s * solve for each t i . S3 For each t i , update the schedule s as t i = t i − h∇J i . Go to S1.
In step S3 of Algorithm 1, the correct decision of h is essential. If h is very big, we could not see progress, if h is too insignificant, the optimization could develop slowly. The S3 step is the vanilla GD indicated by (1).
Step S3 can be interchanged for any optimization algorithm of Section 2 The choice of those optimization algorithms has a significant impact on the number of optimization steps needed to reach the minimum, as we will observe in the next section. Vanilla GD is the optimization algorithm found in the literature for schedule optimization using optimal pulsed control [15,23].
Notice that for each optimization step (S1-S3) we are letting (1 − p)N time injection values t i with no updating. That lightens the number of gradient computations on each step. From our simulations, whenever Algorithm 1 converges to a minimum using p = 1 it will also converge with 0.3 < p < 1, but further exploration is needed to define the lower bound for p to make certain minimum reaching. Using p = 1 and vanilla GD equation, we restore the algorithm used in [15,16]. Also, Algorithm 1 resembles the Stochastic Gradient Descent (SGD) algorithms adopted in deep learning applications. SGD involves updating a parameter vector θ from a subset of training samples randomly chosen (mini-batch) instead of the whole training set. A drawback of SGD is that the cost function shows high fluctuations while is updating; the smallest the mini-batch, the higher the fluctuations. We discover analogous behavior in our schedule evolution.

Application to a Dendritic Vaccine Schedule for Tumor Cells Model
This section has the optimization of two initial therapy schedules, one with a few injections times and another with daily doses over six months. We compare the gradient descent variants presented in Section 2 with attention on convergence speed-up. Algorithm 1 use the next mathematical model which accounts for tumor-immune and dendritic vaccine interactions, further details and parameters can be found in [23], In this section, Algorithm 1 aims to minimize the cost function, where x(t, u) is the solution of (24)-(28). Notice that the running cost, measure how much cancer cells T goes beyond the threshold T max , and the final cost T(x(t f , u)) measures the number of cancer cells at the end. Then, the optimizer penalizes schedules that let T(t) go above T max while making T(x(t f , u)) as small as possible.

Optimization and Comparison of GD Variants
In this subsection, we compare the GD extensions from Section 2 on Algorithm 1 to minimize the cost function in Equation (29).
We do two optimization runs, one for T max = 0.80 ( Figure 1) and other for T max = 0.85 (Figure 2) with the initial schedule; with doses of V = 0.5. From Figure 1a miniature GD has faster first decay but Adam variant quickly outperforms it. Most of the optimizers oscillate after step 50 except for GD and GD-Normalized which has a slower convergence. The first to converge is Adam to the minimum 0.10, and momentum to converge to 0.10 too but stays in a decreasing oscillatory behavior. Adam-bias has a moderate initial convergence, but after step 225 it decreases fast going below 0.05, although it converges to 0.14 approximately until step 400 (Figure 1c). From Figure 1b, we can appreciate that Adam and Momentum converge to a local minimum of roughly 0.10.  Cases T max = 0.80 and T max = 0.85 presents many oscillations which are very common in algorithms using momentum-like optimization [18], and this probably can be adjusted with a different parameter choice for Adam, Momentum, and Nesterov. Vanilla GD and GD-Normalized do not present prominent oscillations but present a slower convergence. Table 1 shows the parameter election for each optimization variant.    Nesterov Normalized h = 10, µ = 0.9 Figure 3 shows the schedule evolution for each GD variant. At each iteration, each optimizer moves s (0) to try to minimize the cost function J. The abscissa shows the number of optimization steps, and the ordinate shows the administration times in hours. On the whole, each GD variant favors an accumulation at the end of the therapy for T(x(t f , u)) minimization; also, there are equally spaced injections that contribute to decreasing the running cost. When two or more injection times collapse at roughly the same value, can be interpreted as a single injection with dose size as the sum of every collapsing dose. Figure 3 also reveals the difference between the convergence of the GD variants, Adam variant moves a lot faster the injection times to the end of the therapy. Conversely GD and GD-Normalized have little effect on the initial schedule; nonetheless, they also show a tendency to move injection times to the end. Nesterov and Momentum have a similar evolution, expected as they are very similar. Unexpectedly, Adam-bias evolve a lot slower than Adam, probably Equations (12) and (13) do not start biased towards zero, and the adaptive step size coming from dividing by √v + hinders the convergence. Next, it is explored the case for daily doses for six months.

Daily Doses over Six Months
Now, we optimize a more extensive first schedule s (0) over six months (4320 h). It comprises daily injections of dose size V = 0.01666 . . . , with a total dose of 3 roughly. Here we use the same cost function of Equation (29). We use Algorithm 1 deterministically with p = 1 and stochastically with p = 0.5, 0.3. Also, it is run for 300 optimizations steps using Adam variant with h = 10 , β 1 = 0.9, Notice that smaller values for p mean less computational cost. Let N be the fixed number of optimizations steps and s (0) ∈ R d . Then the total number of gradient calculations are given by; For this example, using p = 0.5, 0.3 make cost function to converge to the same minima. Figure 4 displays the plots of minimization for cost functions J using p = 1, 0.5, 0.3. For the three cases the running cost and the total cost looks very similar, on the other hand, the final cost (the miniatures) does not. The running cost for each instance decays quickly to 0.5. But, for p = 1 case running cost fluctuate at the 10-20 optimization steps, then converges near to 0.    Figure 5 displays the corresponding schedule dynamics for the three cases. Therapy starts at t = 200 h and ends after six months at t = 4520 h. For the deterministic case (p = 1) displays a well defined initial movement towards t = 200 h, 1000 h, 1400 h, 4300 h, 4520 h which continues to accumulate until collapsing into a single time point. Not all injection times combine; between [1400 h, 4300 h] some injection times stay uncombined until the optimization ends. Although Interval [1400 h, 4300 h] presents an accumulation of injections in fewer quantities, in general shows low and frequent doses similar to what is called a metronomic therapy [24]. The accumulation at the start (t = 200 h) and end (t = 4320 h) is because Algorithm 1 let injections to go neither above the fixed end nor below the start. Such restriction was required to obtain successful optimization; Algorithm 1 failed to converge to a minimum when the therapy start and end were variable. So, to optimize a schedule with variable period requires further analysis.
For the stochastic cases p = 0.5, 0.3 in Figure 5a,b, the schedule evolution is less smooth, and in comparison, looks like a perturbed version of the deterministic one. At injection times t = 200 h, 1000 h, 1400 h, 4300 h, 4520 h, there is a tendency to accumulate; and further steps could show a single injection. Notice that inside [1400 h, 4300 h] stochastic and deterministic optimizations have sparse injection times of frequent and low dose. Also, both cases accumulate injections but the deterministic case appears very plain. Notice that both optimization cases present injection times that move in the same direction forming groups, for stochastic versions, that movement looks messy and delayed. That can be explained, considering that some injection times have a deferred update; at each optimization step Algorithm 1 only picks to update p * 100% of the injection times from the schedule, for the (1 − p)% not chosen, the system is in a different state that if they were updated at once.  Since the deterministic and stochastic variant yields similar schedules, then perhaps the schedule from the deterministic variant is stable to small perturbations.

Application for Therapy Schedule Improvement in Murine Model
In this section, we examine a murine mathematical model that deals with the effects of dendritic cell bolus doses, and obstacles in the host environment; such as immunosuppression, and vaccine effect reduction generated by insufficient transference to the lymph nodes. The analyzed model is; We have calibrated the parameters of this model to the tumor growth data of mice imbued with melanoma cells [25], such parameters and deeper discussion are on [16]. We apply Algorithm 1 to improve a starting schedule s (0) , so that the eventual schedule drives the tumor mass to stay under threshold T max = 3.7 × 10 9 . Figure 6c have 300 optimizations steps using Adam variant with h = −2, β 1 = 0.9, β 2 = 0.999 and a fixed horizon t f = 5000 h. We start with the schedule s (0) = 80 160 . . . 4920 with injection times every 80 days of amount V = e f 1.93548 × 10 5 and overall dosage e f 1.31613 × 10 7 , where e f = 0.05 is the portion of dendritic cells that reach to the lymph nodes. Figure 7 indicates the deterministic and stochastic schedule evolution. Deterministic case renews the full schedule in every optimization step, and the stochastic form 50% percent of the time points (p = 0.5). Both optimizations distribute the injections times, so the therapy keeps the number of tumor cells below T max (Figure 6).
For the deterministic case, the first optimization steps in Figure 7a illustrates a trend for point times t i > 1000 to advance to late times, and for t i < 1000 move to a early times. This tendency remains until injection times stop evolving, and the following schedule has cycles of frequent doses and periods without doses. For stochastic optimization (Figure 7b), the initial evolution is like the deterministic one, but with tinier updates. After 300 optimization steps, the doses do not stay fix, the injection times can further evolve. The resulting schedule comprises scattered doses over the treatment duration and a few doses of fusions.  . Panel (a) shows the tumor growth using a initial 80-day periodic schedule with dose size V = 1.93548 × 10 5 over 5000 h using a total dose of 1.31613 × 10 7 . Panel (b) the tumor growth after the optimization of the cost function using p = 1, the optimized schedule makes the tumor cells count to stay below T max = 3.7 × 10 9 . Panel (c) the tumor growth after the optimization of the cost function using p = 0.5, the optimized schedule makes the tumor cells count go below T max = 3.7 × 10 9 for most of the 5000 h. The stochastic case has less final tumor cell count. Figure 6a displays the tumor expansion for the first dose protocol. There is a considerable tumor growth that surpasses T max in the period [0 h, 1000 h], later when therapy influence; the tumor decreases and stays about 5 × 10 8 . Figure 6b illustrates tumor growth, reacting to the therapy after schedule optimization with p = 1. The optimized schedule drives T to stay below T max ; however, T stay rising and fluctuating towards the therapy end. Figure 6c (the stochastic optimization) shows that T resides mostly below T max , except for two instances about t = 600 h, 1800 h. Nonetheless, tumor growth progresses decreasing towards the end. In contrast, the stochastic optimization computed half the gradients computed by the deterministic optimization. . Panel (a), the deterministic case, considers Algorithm 1 using p = 1. Panel (b) shows the stochastic case using p = 0.5, so, at each optimization step, the algorithm picks 50% of the injection times from the schedule to update. The deterministic case progresses to well-marked schedules. The stochastic version produces a more scattered schedule than the deterministic.

Concluding Remarks
In this study, we introduce a stochastic gradient descent technique for optimization problems with pulsed controls and a comparison between several gradient descent extensions. The comparison shows different convergence speeds between optimizers. These methods are well known but have convergence guarantees only for convex problems. However, the problems in Sections 4 and 5 involve nonlinear dynamics, thus they are non-convex, which makes it difficult to make a rigorous convergence study. We aim to highlight the importance of using adaptive optimization algorithms in these kinds of applications empirically. In Sections 4 and 5 we show that not every optimization approach will perform as expected, for example, Adam-bias has slower convergence than Adam, when regularly, Adam-bias converges faster on machine learning applications.
In the applications, we employed the method to identify therapy schedules that control tumor volume to stay under a threshold while minimizing tumor volume towards the therapy end. For both applications, the stochastic variant uses fewer gradient computations than the deterministic variation, with comparable results. We apply this method in the cancer treatment scope, but it can be transferred to ODE models with pulsed control that intends to minimize a cost function in the form of Equation (23).
The method presented in this manuscript rationalize the construction of immunotherapy schedules, which could complement therapist intuition and experience in therapy planning. Future work continues the technique for real clinical practice. We aim to optimize dosage and to introduce a penalty term that accounts for timing restrictions, such as weekends and holidays. Experiments on non-fixed time boundaries fail to converge, which calls for further research.