Approximately Optimal Control of Nonlinear Dynamic Stochastic Problems with Learning: The OPTCON Algorithm

: OPTCON is an algorithm for the optimal control of nonlinear stochastic systems which is particularly applicable to economic models. It delivers approximate numerical solutions to optimum control (dynamic optimization) problems with a quadratic objective function for nonlinear economic models with additive and multiplicative (parameter) uncertainties. The algorithm was ﬁrst programmed in C# and then in MATLAB. It allows for deterministic and stochastic control, the latter with open loop (OPTCON1), passive learning (open-loop feedback, OPTCON2), and active learning (closed-loop, dual, or adaptive control, OPTCON3) information patterns. The mathematical aspects of the algorithm with open-loop feedback and closed-loop information patterns are presented in more detail in this paper. Contributions: conceptualization, R.N., D.B. V.B.-N.; methodology, V.B.-N., R.N. D.B.; software, D.B.; formal analysis, V.B.-N.; writing—original draft preparation, V.B.-N., D.B. R.N.; writing—review and editing, R.N., D.B. and V.B.-N.; visualization, D.B. and V.B.-N.; supervision, R.N.; funding acquisition, V.B.-N.


Introduction
Optimal control of stochastic processes is a topic which occurs in many contexts of applied mathematics such as engineering, biology, chemistry, economics, and management science. These techniques are used to steer the behavior of a plant, natural, human or social system over a certain (finite or infinite) time period where the system is driven by differential or difference equations whose time path can be influenced by an external controller. The controller (decision maker, policy maker) aims at achieving the best trajectory of his/her controls according to some criterion (performance measure, objective function) over that time horizon.
Unfortunately, only a few special cases can be solved explicitly, in particular the linearquadratic-Gaussian problem, where the dynamic system is linear, the objective function is quadratic, and the stochastics are confined to additive normally distributed noise affecting the system dynamics (e.g., [1]). In this case, it is possible to separate the estimation of the state and the optimization of its behavior due to the property of certainty equivalence (the separation theorem). For most practical applications, however, especially those where the parameters of the system are not known precisely, an exact solution to the optimal stochastic control problem is not available and some kind of approximation is required. One of the reasons is the curse of dimensionality, which prevents the solving of dynamic functional equations for multivariable systems [2,3]. This is particularly complicated by the so-called dual effect of controls preventing the separation of estimation and control in adaptive processes; it was discovered by [4] that "the control must have a reciprocal, 'dual' character: to a certain extent they must be of a probing nature, but also controlling to a known degree" ( [4], p. 31). The dynamic functional equations for general problems of this kind are known (see, e.g., [5]) but cannot be solved except in very special cases; hence approximations are required (see [6][7][8]).
In this paper, we present the OPTCON algorithm for calculating numerically (approximately) optimal control solutions to nonlinear dynamic stochastic optimization problems without and with learning. The algorithm was developed in three stages (OPTCON1, OPTCON2, and OPTCON3) over a long period. OPTCON1 determines optimal controls for deterministic problems and stochastic problems based on the open-loop information pattern [9]. OPTCON2 solves problems with passive learning about the unknown parameters of the system model [10], so-called open-loop feedback policies (see [11] for the terminology). Finally, OPTCON3 also includes active learning (dual, adaptive control) according to an approximation procedure initiated by [12][13][14], and adapted to linear economic systems by [15,16], cf. [17]. The present paper provides mathematical details for OPTCON3, which is the most sophisticated version of the OPTCON algorithm; in [18] we concentrate on computations aspects and applications of OPTCON3.
The OPTCON algorithm is applicable for stochastic control problems with the following properties: The model of the process is multivariable, formulated in discrete time, and described by a system of nonlinear difference equations with known functional form but additive noise and (possibly) unknown parameters. The state is stochastically observable and controllable. No inequality constraints on states or controls are given. Open-loop, open-loop feedback, and closed-loop information structures can be considered. The objective function is quadratic and perfectly known; it is formulated in tracking form but can easily be transformed to the general quadratic form. Quadratic functions can be interpreted as second-order Taylor approximations to more general functional forms. There is only one decision maker; for decentralized problems, see the OPTGAME algorithm [19].
The paper has the following structure. In Section 2 the relevant basic background is provided, namely the class of problems to be solved by the algorithm as well as some basic information about the linear-quadratic framework and nonlinear system solving algorithms. Section 3 presents the OPTCON algorithm stepwise, starting with the basic version (OPTCON1) and then introducing the more advanced versions OPTCON2 and OPTCON3 for handling stochastic components in the optimal control process. In Section 4 some details on computational time and accuracy of the OPTCON algorithm are presented. Section 5 concludes.

Optimal Control Problem
We consider optimal control problems with a quadratic objective function and a nonlinear multivariate discrete-time dynamic system under additive and parameter uncertainties. The basis for an optimal control problem is a deterministic dynamic system to be controlled (plant, firm, economy, etc.) in discrete time in the form: where: -x t ∈ R n is a vector of state variables that describes the state of the system at any point in time t, -u t ∈ R m is a vector of control variables; we assume that the decision maker determines the values of the control variables exactly (without randomization) according to the approximately optimal solution of the problem, -z t ∈ R l denotes a vector of non-controlled deterministic exogenous variables, whose values are known to the decision maker at time t, -T denotes the terminal time period of the finite planning horizon.
The function f is assumed to be twice continuously differentiable with respect to all of its arguments.
We assume that there is some true law of motion given by Equation (1) in the background which is at least partially unknown to the policy maker while the function f is known to him/her. The policy maker faces two sources of uncertainty, additive and parameter uncertainties where: θ ∈ R p denotes a vector of stochastic variables of system parameters (parameter uncertainty), θ ∈ R p is a vector of true parameters whose values are assumed to be constant but unknown to the policy maker, ε t ∈ R n is a vector of additive stochastic disturbances (system error). θ t and ε t are assumed to be independent Gaussian random vectors with expectationŝ θ and O n respectively and covariance matrices Σ θθ and Σ εε respectively.
Including uncertainty, the system (1) can be written as: The optimal control problem to be solved approximately assumes that there is a modeler who wishes to control the system (2). It means that the controller wishes to bring the state variables, using the control variables, as close as possible to some pre-defined desired time path, according to a given optimality criterion. The OPTCON algorithm allows for the optimal control of the system (2) using a quadratic objective function. To this end the modeler needs to define the following variables: -x t ∈ R n are given target ('ideal') values (for t = 1, ..., T) of the state variables, -ũ t ∈ R m are given target ('ideal') values (for t = 1, ..., T) of the control variables, -W t is an ((n + m) × (n + m)) symmetric matrix defining the relative weights of the state and control variables in the objective function. In a frequent special case, W t includes a discount factor α with W t = α t−1 W. The resulting intertemporal objective function is given in quadratic tracking form: with The OPTCON algorithm allows us to find approximate optimal control solutions which minimize the functions (3) and (4) and satisfy the system dynamics (2).

Linear-Quadratic Optimal Control (LQ) Framework
The novel feature of the OPTCON algorithm is the combination of a nonlinear dynamic system and the above-mentioned stochastics when solving optimal control problems. Before we can discuss this technique in detail, a few words should be said about the underlying linear-quadratic optimal control framework. The dynamic system is given in linear form as The corresponding optimal control objective function can be written in the same form as in (3) and (4). Using well-known LQ optimization techniques, the rules for defining control variables can be found iteratively backwards in time using dynamic Riccati equations. We start from the familiar LQ framework (e.g., [20]), which was adapted to economic models by [15,21,22]. The same standard technique is applied in the OPTCON algorithm whenever an optimal control for a linearized system should be obtained. Furthermore, for computational reasons it is useful to transfer the objective function from the quadratic tracking form as used in Equation (4) into the general quadratic form: where The equivalence between the quadratic tracking form and the general quadratic form is shown, for instance, in [23].

Solving Nonlinear Dynamic Systems
The OPTCON algorithm allows us to find approximately optimal control solutions for nonlinear stochastic systems. In the solution process, the nonlinear dynamic system (2) should be solved in a large number of iterations. For this reason the choice of an appropriate solver is important to guarantee obtaining reliable solutions taking the computational costs into account. In the OPTCON algorithm, different solution algorithms are included attaching different importance to the trade-off between reliability and computational speed. At the moment the following methods can be used: Levenberg-Marquardt [24], trust region [25], Newton-Raphson [26], or Gauss-Seidel [27].

The OPTCON1 Algorithm
The first version of the OPTCON algorithm, the OPTCON1 algorithm, was developed by [9]. It allows us to calculate an open-loop (OL) solution to a nonlinear stochastic dynamic optimal control problem with a quadratic objective function under additive and parameter uncertainties. Open-loop controls either do not take account of the effect of uncertainties in the system or assume the stochastics (expectation and covariance matrices of additive and multiplicative disturbances) to be given for all time periods at the beginning of the planning horizon. The basic idea behind this algorithm is that it extends linear-quadratic stochastic optimal control techniques (see, e.g., [1]) to nonlinear problems using an iterative method of linear approximations.
The OPTCON algorithm solves the nonlinear optimal control problem iteratively by a sequence of linear approximations where the next approximation is derived using the optimal control solution of the previous one. The (optimal) solutions of the intermediate linear approximations are iterated from one time path to another until the algorithm converges. The criterion for convergence is that the difference in the values of the state and control variables of current and previous iterations is smaller than a pre-specified number. or the maximal number of iterations is reached. In each iteration the same procedure is conducted, namely linearizing and optimizing the linearized system. The system is linearized around the previous iteration's result and the problem is solved for the resulting time-varying linearized system. The solution of the linearized model is obtained via Bellman's principle of optimality and is used as the tentative path for the next iteration, starting off the procedure all over again. If the OPTCON algorithm converges, the solution of the last iteration is taken as the approximately optimal solution to the nonlinear problem and the value of the objective function is calculated. Figure 1 shows the structure of the OPTCON1 algorithm.
no yes nonlinearity-loop

The OPTCON2 Algorithm
The second version of the algorithm (called OPTCON2, see [10,23] for more details) extends the first version with a passive learning strategy (open-loop feedback (OLF)). OLF uses the idea of re-estimating the model at the end of each time period t = 1, ..., T. For that purpose, the model builder (policy maker) observes the current situation and uses the new information, namely the current values of the state variables, to improve his/her knowledge of the system.
Recall that the stochastic system includes two kinds of uncertainties, namely additive (random system errors) and multiplicative ('structural' errors in parameters). We assume that there is some true law of motion in the background which is unknown to the policy maker. The passive learning strategy deals with the 'true' parameters θ which generate the model. However, the policy maker does not know these true parameters θ and works with the 'wrong' estimated parametersθ. The idea of the OLF strategy is to observe the outcome of the model in each time period and use this information to bring the estimated parametersθ closer to the true values θ.
The OLF strategy of the OPTCON2 algorithm consists of the following (rough) steps running in a forward loop.
In each time period S (S = 1, . . . , T) do the following: • Find an (approximately) optimal open-loop solution for the remaining subproblem (for the time periods from S to T) using the OPTCON1 algorithm. • Fix the predicted solution for time period S, namely x * S and u * S . • Observe the realized state variables x a * S which result from the chosen control variables (u * S ), the true law of motion with parameter vector θ, and the realization of the random disturbances ε S . The main difference between x a * S and x * S is that the former is driven by the true dynamic process with parameter vector θ and the latter by the estimated dynamic system withθ.
• Update the parameter estimateθ (via the Kalman filter and using the difference between x a * S and x * S ) and use it in the next iteration step S + 1. The last step is carried out using the idea of the Kalman filter (see, e.g., [28] or [29]). The updating procedure according to the Kalman filter consists of two distinct phases, prediction and correction. First (the prediction phase), the predicted values of the state variablesx S/S−1 , the vector of parametersθ S/S−1 , and the covariances of the parameters Σ θθ S/S−1 are calculated using the estimates from the previous time period. Next (the correction phase), these 'a priori' values of the state variables, the vector of the parameters, and the values of the covariances are corrected using the current observations of the state variables.
The phase of calculating the predicted values ofx S/S−1 andθ S/S−1 is integrated in the previous steps of the OPTCON2 algorithm. At the end of time period S the predicted values ofx S/S−1 = x * S are calculated andθ S/S−1 =θ S−1 is known. The correction phase is reduced to the calculation of the parameter estimatesθ S/S and the parameter covariances Σ θθ S/S because the 'corrected' values of the state variables x a * S = f (x a * S−1 , x a * S , u * S , θ, z S ) + ε S are calculated in the previous step of the algorithm or simply observed. As explained before, the updating procedure is based on the difference between x * S = f (x a * S−1 , x * S , u * S ,θ, z S ) and x a * S at the end of each time period. The OPTCON2 algorithm allows us to find an approximate solution of the optimal control problem using a passive learning strategy. The policy maker uses real observations in each time period to update his/her knowledge about the stochastic dynamic system. In many cases it allows him/her to obtain more reliable results (see [30] for a performance study on optimal control with a passive learning strategy).

The OPTCON3 Algorithm
In this section we present the detailed description of the OPTCON3 algorithm. As mentioned above, the latest version of the OPTCON algorithm includes an active learning strategy. In the literature this strategy is also called closed-loop, adaptive, or dual control. The passive learning method in the OPTCON2 algorithm uses current observations to update the information about the dynamic system. The active learning strategy in OPT-CON3 actively uses system perturbations to improve the optimal control performance by reducing the uncertainty in the future.
The procedure for finding the closed-loop solution in the OPTCON3 algorithm is based on the linear active learning control method in [15]. Similar to the iterative structure in the OPTCON2 algorithm, the optimization runs in a forward loop (S = 1, . . . , T). However, in each time period, more information about future measurements is used. To this end the objective function J (as defined in Equations (3) and (4)) is extended for stochastic components. We define in each time period J d as the approximate total cost-to-go with T − S periods remaining. The approximate cost-to-go is broken down into three terms: J d = J D + J C + J P . J D is called deterministic term and incorporates only non-stochastic components. J C is the cautionary factor that includes the stochastic elements in the current period. J P is called probing term and captures the effect of dual learning on the uncertainty in the future periods. J C and J P constitute a separate quadratic minimization problem constrained by the nonlinear system. The original system needs to be expanded to the perturbation form δx t . The optimization problem has to be solved for the perturbed system, where ∆J * t is the perturbed objective function. Using Taylor expansion of nonlinear system and Bellman optimization technique, we obtain the solution ∆J * t as a quadratic function of δx t−1 . The original J * t can be derived from ∆J * t . To take uncertain parameters into account, all the terms and formulas need to be adjusted to the augmented system Next, a schematic structure of the OPTCON3 algorithm is presented. This structure is visualized in Figure 2.  The optimization is carried out in a forward loop from 1 to T conducting following procedure in each time period S (S = 1, ..., T). The subproblem from S to T is solved via the open-loop (OL) strategy (see Figure 1 in Section 3.1). The OL solution (x * S , u * S ) for the time period S is fixed. After that the core part of the dual control strategy starts, where the system is actively perturbed to learn about it in order to minimize the uncertainty and the objective function values in the future. In the OPTCON3 algorithm a grid search method is used (instead of grid search many other methods can be applied, such as gradient optimization or a more advanced heuristic approach). In the so-called "π-loop" we create a grid of possible solutions around the existing path (x * S , u * S ). In each iteration (π = 1, ..., Π), i.e., in each grid point, the optimal control solution is obtained and evaluated using the objective function. Inside the π-loop (for each π), the following steps are carried out.
An optimal open-loop solution for the subproblem from S + 1 to T is determined. Then the OL solution (x * π S+1 , u * π S+1 ) for the time period S + 1 is fixed. Next, after some auxiliary calculations (including, among other, Riccati matrices), the deterministic J D , cautionary J C , and probing J P terms of the cost-to-go are determined in a forward loop from S + 1 to T. Once the iteration of the π-loop has been completed, the total approximate objective function J d = J D + J C + J P can be obtained. When the search is completed, i.e., the approximately optimal path with minJ d has been found, the obtained optimal control in time period S + 1 is carried out by the policy maker. As a result of these controls and the true dynamic systems the actual values of the state variables can be obtained. This new information is used by the policy maker to update and to adjust the parameter estimatê θ, whereby the Kalman filter is used. Using the updated stochastic parameters, the same active learning procedure is applied for the remaining time periods from S + 2 to T.
The OPTCON3 algorithm basically uses the technique presented in [13,15] but is augmented by approximating, in each step, the nonlinear system in a series of linear systems (iterative procedure as described in Section 3.1).
The following steps (I-IV) of the OPTCON3 algorithm describe how to obtain an approximately optimal dual control solution to a stochastic problem.
The algorithm needs the following input values (A legitimate question would be how to obtain the required (tentative) inputs. It obviously depends on the research question. In the case of some numerical experiments, one can resort to Monte Carlo simulations. In the case of a real-world application, one can estimate system parameters by an appropriate method or simply guess some values): f system function tentative path of control variableŝ θ 0 =θ expected values of system parameters Σ θθ 0 = Σ θθ covariance matrix of system parameters Σ εε covariance matrix of system noise (ε t ) T As a result of the algorithm, an approximately optimal solution (x a * t ) T t=1 , (u * t ) T t=1 and the corresponding value of the objective function J * are obtained.
Step I- [2]: Run a grid search of size Π around (x * S , u * S ), i.e., perform steps (A)-(G) for each π = 1, ..., Π: Step I-2A: Find the OL solution (x * π t , u * π t ) T S+1 for the subproblem (S + 1, ..., T). • The nonlinearity loop is run until the stop criterion is fulfilled (the stop criterion is fulfilled when the difference between the values of the current and the previous iteration is smaller than a pre-specified number or the maximum number of iterations is achieved). As a result the approximately optimal solution (x * π t , u * π t ) T S+1 has been found. Then go to the next step I-2B. Notes: -After several runs of the nonlinearity loop, only the solution (x * π S+1 , u * π S+1 ) for the time period S + 1 will be taken as the optimal (nominal) solution. The calculations of the pairs (x * π t , u * π t ) for other periods (t > S + 1) have to be done again, taking into account the re-estimated parameters for all periods.
-After linearization the parameter matrices for the linearized system are obtained: where F x x t , F x x t−1 , and F x u t are the derivatives of the system function f with respect to x t , x t−1 , and u t respectively.
Step I-2B: Do a backward recursion to obtain auxiliary matrices for calculating components of the objective function.
Initialize the following auxiliary matrices for backward recursion: Calculate the Riccati matrices K xx θ is the derivative of the system function with respect to θ.
Step I-2C: Calculate the deterministic component of the approximate objective function J D,S and the cautionary component J C,S : Step I-2D: Repeat steps [a]-[c] for each j = S + 1, ..., T: [a]: Calculate the components J D,j and J C,j : [b]: Calculate the matrix Σ θθ j/j : [c]: Compute the probing component J P j : Step I-2E: Calculate the sum of the deterministic, cautionary, and probing terms over the periods S, ..., T: Step I-2F: Take a new control Step I- [3]: Choose an optimal u * S with minJ = J * d (u * S ). End of grid search. Fix the corresponding x * S .
Step II: Calculate the following (a) and (b) for only one time period S: Step III: Update the parameter estimatesθ and Σ θθ S/S : Step IV: Setθ =θ S/S and Σ θθ = Σ θθ S/S , go to Step I and run the procedure for the time period S + 1.
The OPTCON3 algorithm is finished when S = T and the approximately optimal dual control and state variables have been found for all periods.

Computational Details for the AL Procedure
This subsection includes the technical computations for the AL procedure of the OPTCON3 algorithm.
The basic tasks are applying Bellman's principle of optimality, linearization of the nonlinear system in perturbation form, minimizing the objective function, and inserting the obtained feedback rule for controls back into the objective function. In this way the (temporary) components of the optimal objective function can be derived. After that, due to the stochastic nature of the parameters (θ), all the components have to be adjusted to the extended system Start with Bellman's principle of optimality: The last term of (19) is where L jx , L ju are the gradients and L jxx , L jxu , and L juu are the Hessian matrices; δx j = x j − x oj , δu j = u j − u oj , and (x oj , u oj ) is a nominal path. Thus, and Next, apply the Taylor expansion to the nonlinear system, i.e., linearize the system function f in (2) around the reference values: This can be rewritten in perturbation form: After transformation Thus, (21) and (22) constitute the problem in perturbation form.
Next, we have to prove that the term (21) can be expressed in the quadratic form as a function of δx t only. Assume (23) and apply the method of induction. By doing so the parameters g, h x and H are determined. The next step is to prove that rule (23) is also true for ∆J * By second order expansion we get: Taking expectations over P j+1 yields Put (22) into (24) and set Then by taking expectations over P j The terms which are higher than second order are dropped here. Then use the following notations: and get Apply the rule E(x Ax) =x Ax + tr(AΣ xx ) and take expectations over P j−1 : where E(ξ) = 0. Next, minimize the objective function ∆J * T−j with respect to δu j : Put rule (29) into (27): Note that 1 2 Use the following rule: and get Defining and This shows that ∆J * T−j is a quadratic function of δx j−1 .

QED
Next, using the results of [31] we find the solution of the g recursion: Let us define a deterministic term M j = 1 2 λ u j (Λ uu ) −1 λ u j and a stochastic term N j = 1 2 tr(K j Σ ξξ j/j + Ω xx Σ xx j ), so that g j = g j+1 − M j + N j . By working backwards from period T the following can be shown: Next, solve it partially for the stochastic term N j . For that define γ j = γ j+1 − M j with γ T+1 = 0. Using again the backward recursion get the equation: Put it into (33) and get and Use the fact that ∆J * T−(t+1) and ∆J * t+1 are equivalent; thus put (34) into (32) for t + 1: where Then This objective function can be split into three parts: deterministic: cautionary: probing: In this way the optimal control problem is solved and the components of the objective function are found. However, these are not the final results. In the next step, the stochastic parameters have to be taken into account, i.e., the extension of the system has to be considered.

Extension of the System
Recall that the algorithm deals with a quadratic criterion function when the parameters of the system equations are unknown. One technique to incorporate the uncertain parameters is to treat the random parameters as additional state variables.
Thus, apply the equations of the objective function to the system Then and In order to calculate the adjusted components of the objective function (especially (40) and (41)) we need the terms H and Ω. First, we calculate the adapted Ω ΨΨ using Definition (30): where and We know that F θ x t−1 = 0, F x θ t−1 = 0, F θ θ t−1 = I, and L xu = W xu . Then we get Define we get the following: Use the definition Λ xu = A K xx B + A W xu and get the following: For the calculation of H we need the term Λ ΨΨ : Now we can calculate the term H for the extended system: Thus, After that we go back to the components of the objective function. According to (41) we need to calculate the term Ω ΨΨ Σ ΨΨ t/t . Thus, Next we calculate J C,t . According to (40) we need to calculate tr(K ΨΨ j Σ ξξ j/j−1 ) and tr(H ΨΨ t+1 Σ xx t/t−1 ): The OPTCON algorithm is programmed in MATLAB (MATLAB R2020a); the MATLAB computer code is available on request. The MacRae model is a simple linear model with one state and one control variable. One parameter is stochastic and the optimal control was calculated for two periods only. Table 1 summarizes the computational performance of the OPTCON algorithm for the MacRae model. In the case of a linear-quadratic deterministic optimal control framework convergence can be shown analytically. This is true for the MacRae model, at least for the deterministic and the OL case. Moreover, as a check for the special case of a linear system, we compared the results of OPTCON with those of the same model calculated by available software and found them to be the same apart from round-off errors. The ATOPT model is a small nonlinear model consisting of three equations, one control variable, and two stochastic parameters. The optimal control is calculated for five periods. Table 1 summarizes the computational performance of the OPTCON algorithm for the MacRae model. Table 2 summarizes the computational performance of the OPTCON algorithm for the ATOPT model.

Concluding Remarks
In this paper, we described the OPTCON algorithm for the approximately optimal control of stochastic processes under a quadratic objective function. For systems with complete information, either deterministic or stochastic with known statistical characteristics of the disturbances, the open-loop version of OPTCON1 is suitable. OPTCON2 assumes partial information, in particular uncertain parameters of the system, with passive storage of information accruing during the control horizon, i.e., not used for control purposes. If active storage of information is possible, the dual control policy of OPTCON3 is appropriate. All three versions of OPTCON were programmed first in C# and then in MATLAB. We have applied OPTCON1 to various economic policy problems ( [32,33], among others); Monte Carlo experiments with OPTCON2 and OPTCON3 have also yielded satisfactory results in terms of computing time. The results of approximately optimal policies were also as expected from the point of view of the economic problems under consideration. Due to the numerical nature of the algorithm, we cannot prove convergence in general; however, no problems of non-convergence have occurred so far. For future research, more applications are desirable to gain more insight into the different policies resulting from the various assumptions about the information structure of adaptive stochastic policy problems.