Optimization of Constrained Stochastic Linear-Quadratic Control on an Inﬁnite Horizon: A Direct-Comparison Based Approach

: In this paper we study the optimization of the discrete-time stochastic linear-quadratic (LQ) control problem with conic control constraints on an inﬁnite horizon, considering multiplicative noises. Stochastic control systems can be formulated as Markov Decision Problems (MDPs) with continuous state spaces and therefore we can apply the direct-comparison based optimization approach to solve the problem. We ﬁrst derive the performance difference formula for the LQ problem by utilizing the state separation property of the system structure. Based on this, we successfully derive the optimality conditions and the stationary optimal feedback control. By introducing the optimization, we establish a general framework for inﬁnite horizon stochastic control problems. The direct-comparison based approach is applicable to both linear and nonlinear systems. Our work provides a new perspective in LQ control problems; based on this approach, learning based algorithms can be developed without identifying all of the system parameters.


Introduction
In this paper we study the discrete-time stochastic linear-quadratic (LQ) control optimal problem with conic control constraints and multiplicative noises on an infinite horizon. There exist in the literature various studies on the estimation and control problems of systems with a multiplicative noise [1,2]. As for the LQ type of stochastic optimal control problems with multiplicative noise, investigations have been focused on the LQ formulation with indefinite penalty matrices on control and state variables for both continuous-time and discrete-time models (see, e.g., [3,4]).
In an LQ optimal problem, the system dynamics are both linear in state and control variables, and the cost function is quadratic in these two variables [5]. One important attractive quality of the LQ type of optimal control models is its explicit control policy which can be derived by solving the corresponding Riccati equation. Due to the elegant structure, the LQ problem has always been a hot issue in optimal control research. Since the fundamental research on deterministic LQ problems by Kalman [6], there have been a great number of studies on it; see [5,7,8]. In the past few years, stochastic LQ problems have drawn more and more attention on this topic, due to the promising applications in different fields, including dynamic portfolio management, financial derivative pricing, population models, and nuclear heat transfer problems; see [9][10][11]. This paper is motivated by two recent developments: LQ optimal control and Markov decision problems (MDPs). First, the constrained LQ problem is significant in both theory and applications.
Due to the constraints on state and control variables, it is hard to obtain the explicit control policy by solving the Riccati equation [5]. Recently, there have been studies regarding the constrained LQ optimal control problems, such as [12][13][14]. Meanwhile, in real applications, considering some practical limits, such as the risk or the economic regulations, we have to take some constraints on the control variables into the consideration. In the LQ control problems, including the positivity constraint for the control, some literature, [15,16], propose the optimality conditions and some numerical methods to characterize the optimal control policy. In this paper, we characterise the limits as the conic control constraints considering the real applications.
Work by Cao [17] and Puterman [18] demonstrate that stochastic control problems can be viewed as Markov decision problems. Therefore, the constrained stochastic LQ control problem can be formulated as an MDP, such as [19]. A direct-comparison based approach (or relative optimization), which originated in the area of discrete event systems, has been developed in the past years for the optimization of MDPs [17].
With this approach, optimization is based on the comparison of the performance measures of the system under any two policies. It is intuitively clear, and it can provide new insights, leading to new results to many problems, such as [20][21][22][23][24][25][26]. This approach is very convenient and suitable to the performance optimization problems, leading to results including the property of under-selectivity in time-nonhomogeneous Markov processes [24]. In this paper, we show that the special features of the constrained stochastic LQ optimal control make it possible to be solved by the direct-comparison based approach, leading to some new insights for the problem.
In our work, we consider the stochastic LQ control problem through an MDP formulation in the infinite horizon. Through the direct-comparison based approach [17], we first derive the performance potentials for the LQ problem by utilizing the state separation property of the system structure. Based on this, we successfully derive the optimality conditions and the stationary optimal feedback control. We show that the optimal control policy is a piece-wise affine function with respect to the state variables. In real applications, the proposed methodology can be used in many fields, such as system risk contagion [26] and power grid systems [27].
Our work provides a new perspective for LQ control problems. Compared with the former literature, such as [5,13], we still consider the multiplicative noises. We establish a general framework for studying infinite horizon stochastic control problems. With the direct-comparison based approach, which is applicable to both linear and nonlinear systems, we propose more results for the performance optimization problems, and the results can be extended easily. In addition, without identifying all the system parameters, this approach can be implemented on-line, and learning based algorithms can be developed.
The paper is organized as follows. Section 2 introduces an MDP formulation of the constrained stochastic LQ problem with multiplicative noises; some preliminary knowledge on MDP and the state separation property is also provided. In Section 3, we derive the performance difference formula, which is the foundation of the performance optimization; based on it, the Poisson equation and Dynkin's formula can be obtained. Then we derive the optimality condition and the optimal policy through the performance difference formula. In Section 4, we illustrate the results by numerical examples. Finally, we conclude the paper in Section 5.

Problem with Infinite Time Horizon
In this section, we study the infinite horizon discrete-time stochastic LQ optimal control problem, in which the conic control constraints are also considered; see [5,14]. For simplicity of the parameters, we consider a one dimensional dynamic system with a multiplicative noise described by for time l = 0, 1, · · · . By denoting R (R + ) as the set of real (nonnegative real) numbers, in this system, A ∈ R and B ∈ R 1×m are deterministic values; x l ∈ R is the state with x 0 being given; and u l ∈ R m is a feedback control law at time l. For each l, ξ l denotes an independent identically distributed one-dimensional multiplicative noise, satisfying a normal distribution with mean 0 and variance σ 2 , σ ≥ 0. For each l, ξ l denotes a one-dimensional noise. ξ l and ξ k are independent for every l = k. Now, we consider the conic control constraint sets (cf. [5]) for l = 0, 1, · · · , where H ∈ R n×m is a deterministic matrix; and F l is the filtration of the information available at time l. Let C l ⊂ R m be a given closed cone; i.e., αu l ∈ C l whenever u l ∈ C l and α ≥ 0; and The goal of optimization is to minimize the total reward performance measure in a quadratic form: (1) and (2) for where Q ∈ R + and R ∈ R m×m + are deterministic. Here we denote the transpose operation by a prime in the superscript , such as u l . {u l } denotes the control sequence {u 0 , u 1 , · · · }. We also assume that (3) exists.
Therefore, the performance function of (3) is In this paper, we will show that the direct-comparison based approach leads to more new results for the total rewards problem [7], and that the results can be easily extended.

MDPs with Continuous State Spaces
For a stationary control law u l = u(x), at time l = 0, 1, · · · , the constraint (2) can be written as Then the above stochastic control problem can be viewed as an MDP with continuous state spaces. More precisely, u(x) plays a similar role of actions in MDPs, and then the control law u is the same as a policy.
Consider a discrete-time Markov chain X := {x l } ∞ l=0 with a continuous state space on R. The transition probability can be described by a transition operator P as where P(dy|x) is the transition probability function, with x, y ∈ R; and h(y) is any measurable function on R. As ξ l is independent Gaussian noises, given the current state x l = x, under the stationary control u(x), y = x l+1 satisfies a normal distribution with mean µ y = Ax + Bu(x) and variance σ 2 y = [Ax + Bu(x)] 2 σ 2 . Then we have the transition function of this system as follows, Let B be the σ-field of R containing all the (Lebesgue) measurable sets. For any set B ∈ B, we can define the identity transition function I(B|x). I(B|x) = 1 if x ∈ B; I(B|x) = 0 otherwise. For any function h and x ∈ R, we have (Ih)(x) = h(x).
The product of two transition functions P 1 (B|x) and P 2 (B|x) is defined as a transition function (P 1 P 2 )(B|x): For any transition function P, we can define the kth power, k = 0, 1, · · · , as P 0 = I, P 1 = P, and P k = PP k−1 , k = 2, · · · . Suppose that the Markov chain X is time-homogeneous with transition function P(B|x), x ∈ R, B ∈ B. Then the k-step transition probability functions, denoted as P (k) (B|x), k = 1, 2, · · · , are given by the 1-step transition function defined as P (1) (B|x) = P(B|x) and For any function h(x), we have That is, as an operator, we have P (k) = P(P (k−1) ). Recursively, we can prove that P (k) = P k . Suppose that a Markov chain X with a continuous state space on R has a steady-state distribution π satisfying π = πP. Define function e(x) = 1 for all x ∈ R. We denote the performance potential g as a function which satisfies the Poisson equation (cf. [17]) where I and P are two transition functions, and η(x) = (π f )e(x) = ηe(x). Then if g is a solution to (7), so is g + ce, with any constant c. We define and assume the limit g(x) := lim K→∞ g K (x) exists for x ∈ R. Then we have the following lemma, Lemma 1 (Solution to Poisson Equations [17]). For any transition function P and performance function hold for every x ∈ R, then is a solution to the Poisson Equation (7).

State Separation Property
In order to derive the explicit solution of the stochastic LQ control problem with conic constraints, Reference [14] gives the following lemma for the state separation property of the LQ problem, Lemma 2 (State Separation [14]). In the system (1), for any x ∈ R, the optimal solution for problem (3) at time l is a piecewise linear feedback policy for l = 0, 1, · · · , where K := {K ∈ R m |HK ∈ R n + } associated with the control constraint sets C l ;K * ,K * ∈ K, are the optimal values of two correspondent auxiliary optimization problems, and the superscript " * " corresponds to the optimal control.
Based on (10) in Lemma 2, the stationary control can be written as u(x) =Kx1 x≥0 −Kx1 x<0 , where 1 B is an indicator function, such that 1 B = 1, if the condition B holds true and 1 B = 0 otherwise; andK,K ∈ K. Applying this control, the system dynamics (1) becomes for l = 0, 1, · · · , whereĈ Moreover, the performance measure (3) becomes whereŴ = Q +K T RK andW = Q +K T RK. Therefore, the performance function (4) becomes It is easy to verify thatŴ andW are positive semi-definite. We assume that this one-dimensional state system is stable, and then the spectral radiuses ofĈ andC are less than 1, i.e., C max = max(Ĉ,C) < 1. In the next section, we will derive the performance potentials for the LQ problem, which is the foundation of the performance optimization. Based on this, the Poisson equation and the Dynkin's formula can be derived. The direct-comparison based approach provides a new perspective for this problem,and the results can be extended easily.

Performance Optimization
In this section, utilizing the state separation property, we derive the performance difference formula, which compares the performance measures of any tow policies, and then derive the optimality condition and the optimal policy with the direct-comparison based approach.
Then we have proved that the closed-loop system (11) is L 2 -asymptotically stable, i.e., lim l→∞ E[(x l ) 2 ] = 0. Therefore, the total rewards η(x) exists, that is, a piecewise quadratic function with positive semi-definite matricesĜ andḠ. Now, we define the discrete version of generator, A for any function h(x), x ∈ R, such that Taking h(x) as η(x), and by the definition of η(x) in (3), we have the Poisson equation as follows, By (5) and (20), we obtain the discrete version of Dynkin's formula as and if the limit K → ∞ exists, then Now, we consider two policies u, u ∈ U 0 , resulting in two independent Markov chains X and X in the same state space R, with P, f , η, A, E, and P , f , η , A , E , respectively. Let x 0 = x 0 . Applying the Dynkin's Formula (22) Noting that η (x 0 ) = lim K→∞ ∑ K−1 k=0 {E [ f (x k )]|x 0 }, and lim K→∞ E {η(x K )|x 0 } = 0 due to asymptotical stability. Then by (23), we obtain the performance difference formula:

Optimal Policy
Based on the performance difference Formula (24), we have the following optimality condition.

Theorem 1 (Optimality Condition).
A policy u * in C is optimal if, and only if, From (25), the optimality equation is: Proof. First, the "if" part follows from the performance difference Formula (24) and the Poisson Equation (21). Next, we prove the "only if" part: Let u * be an optimal policy. We need to prove that (25) holds. Suppose that this is not true. Then, there must exist one policy, denoted as u , such that (25) does not hold. That is, there must be at least one state, denoted as y, such that P u * η u * (y) + f u * (y) > P u η u * (y) + f u (y).
Then we can create a policyũ by settingũ = u when x = y, andũ = u * when x = y. We have η u * > η u . This contradicts to the fact that u * is an optimal policy. Based on the optimality condition, the optimal control u * can be obtained by developing policy iteration algorithms. Roughly speaking, we start with any policy u 0 . At the kth step, k = 0, 1, · · · , given a piecewise linear policy u k (x) =Kx1 x≥0 −Kx1 x<0 , whereK,K ∈ K, we want to find a better policy by (26). We consider any policy u(x). Setting h(x) = η u k (x) =Ĝx 2 1 x≥0 +Ḡx 2 1 x<0 , by (5), (12), and (14), we have where a 1 and a 2 satisfy Equations (15) and (16), respectively. Then, from (4) and (27), we have It can be seen that if the policy u k (x) is a piecewise linear control, then we can find an improved policy u k+1 (x), which is also piecewise linear. Moreover, ifK k+1 =K andK k+1 =K, i.e., u k+1 = u k , then the iteration stops. The policy u k satisfies the optimal condition (26) in Theorem 1, and therefore is an optimal control. Therefore, we can obtain the optimal policy as follows, The original problem (3) is transformed to two auxiliary optimization problems (29) and (30). Under the optimal control u * in (28), the closed-loop system (11) is L 2 -asymptotically stable. From (19), with the initial condition x 0 = x, we know the optimal total reward performance is whereĜ * andḠ * satisfy (31) and (32), respectively.
Policy iteration can also be implemented on-line, the performance (potential) can be learned on a sample path without knowing all the transition probabilities. In on-line algorithms, the computation of policy evaluation is O(n), where n is the length of a sample path. Additionally, Reference [14] also provides some algorithms for calculating the optimal policy.

Simulation Examples
In this section, we use two numerical examples to illustrate the optimal policy for the constrained LQ control problem (3). For time l = 0, 1, · · · , the variance of the 0-mean i.i.d. Gaussian noise ξ l is σ 2 = 0.25. We consider the conic constraint u ≥ 0. By applying Theorem 1, the stationary optimal control is u * l (x l ) =K * x l 1 x l ≥0 − K * x l 1 x l <0 , for l = 0, 1, · · · , whereK * = (0.574, 0, 0) ,K * = (0, 0.250, 0.270) ,Ĝ * = 2.773 andḠ * = 3.473. Furthermore, the optimal reward performance is η * ( As shown in Figure 1a plots the outputsḠ * andĜ * with respect to iteration time K; Figure 1b plots the state trajectories of 50 samples by setting x 0 = 10 and implementing the stationary optimal control u * . It can be observed that x * l converges to 0 after time l = 20 and this closed loop system is asymptotically stable.

Conclusions
In this paper, we apply the direct-comparison based optimization approach to study the rewards optimization of the discrete-time stochastic linear-quadratic control problem with conic constraints on an infinite horizon. We derive the performance difference formula by utilizing the state separation property of the system structure. Based on this, the optimality condition and the stationary optimal feedback control can be obtained. The direct-comparison based approach is applicable to both linear and nonlinear systems. By introducing the the LQ optimization problem, we establish a general framework for studying infinite horizon control problems with total rewards. We verify that the proposed optimal approach can solve the LQ problems. Then we illustrate our results by two simulation examples.
The results can easily be extended to the cases of non-Gaussian noises and average rewards. Most significantly, our methodology can deal with a very general class of linear constraints on state and control variables, which includes the cone constraints, positivity and negativity constraints, and the state-dependent upper and lower bound constraints as a special case. In addition to the problem with the infinite control horizon, our results still fit problems with a finite horizon. In addition, without identifying all the system structure parameters, this approach can also be implemented on-line, and learning based algorithms can be developed.
Finally, this work focuses on the discrete-time stochastic LQ control problem. Our next step is to investigate continuous cases. As the constrained LQ problem has a wide range of applications, we hope to apply our approach in more areas, such as dynamic portfolio management, security optimization of cyber-physical systems, and financial derivative pricing, in our future research.
Funding: This work was supported by the National Natural Science Foundation of China under Grant 61573244 "Stochastic control optimization of uncertain systems based on offset with multiplicative noises and its applications in the financial optimization" and 61521063 "Control theory and techniques: design, control and optimization of network systems".