Optimization of Constrained Stochastic Linear-Quadratic Control on an Infinite Horizon: A Direct-Comparison Based Approach

Xue, Ruobing; Ye, Xiangshen; Wu, Weiping

doi:10.3390/a13020049

Open AccessEditor’s ChoiceArticle

Optimization of Constrained Stochastic Linear-Quadratic Control on an Infinite Horizon: A Direct-Comparison Based Approach

by

Ruobing Xue

¹

,

Xiangshen Ye

^1,* and

Weiping Wu

²

¹

Department of Automation, Shanghai Jiaotong University, Shanghai 200240, China

²

School of Economics and Management, Fuzhou University, Fuzhou 350108, China

^*

Author to whom correspondence should be addressed.

Algorithms 2020, 13(2), 49; https://doi.org/10.3390/a13020049

Submission received: 20 January 2020 / Revised: 15 February 2020 / Accepted: 19 February 2020 / Published: 24 February 2020

Download

Browse Figures

Versions Notes

Abstract

:

In this paper we study the optimization of the discrete-time stochastic linear-quadratic (LQ) control problem with conic control constraints on an infinite horizon, considering multiplicative noises. Stochastic control systems can be formulated as Markov Decision Problems (MDPs) with continuous state spaces and therefore we can apply the direct-comparison based optimization approach to solve the problem. We first derive the performance difference formula for the LQ problem by utilizing the state separation property of the system structure. Based on this, we successfully derive the optimality conditions and the stationary optimal feedback control. By introducing the optimization, we establish a general framework for infinite horizon stochastic control problems. The direct-comparison based approach is applicable to both linear and nonlinear systems. Our work provides a new perspective in LQ control problems; based on this approach, learning based algorithms can be developed without identifying all of the system parameters.

Keywords:

linear-quadratic; Markov decision process(MDP); conic constraints; stochastic control; direct-comparison based approach

1. Introduction

In this paper we study the discrete-time stochastic linear-quadratic (LQ) control optimal problem with conic control constraints and multiplicative noises on an infinite horizon. There exist in the literature various studies on the estimation and control problems of systems with a multiplicative noise [1,2]. As for the LQ type of stochastic optimal control problems with multiplicative noise, investigations have been focused on the LQ formulation with indefinite penalty matrices on control and state variables for both continuous-time and discrete-time models (see, e.g., [3,4]).

In an LQ optimal problem, the system dynamics are both linear in state and control variables, and the cost function is quadratic in these two variables [5]. One important attractive quality of the LQ type of optimal control models is its explicit control policy which can be derived by solving the corresponding Riccati equation. Due to the elegant structure, the LQ problem has always been a hot issue in optimal control research. Since the fundamental research on deterministic LQ problems by Kalman [6], there have been a great number of studies on it; see [5,7,8]. In the past few years, stochastic LQ problems have drawn more and more attention on this topic, due to the promising applications in different fields, including dynamic portfolio management, financial derivative pricing, population models, and nuclear heat transfer problems; see [9,10,11].

This paper is motivated by two recent developments: LQ optimal control and Markov decision problems (MDPs). First, the constrained LQ problem is significant in both theory and applications. Due to the constraints on state and control variables, it is hard to obtain the explicit control policy by solving the Riccati equation [5]. Recently, there have been studies regarding the constrained LQ optimal control problems, such as [12,13,14]. Meanwhile, in real applications, considering some practical limits, such as the risk or the economic regulations, we have to take some constraints on the control variables into the consideration. In the LQ control problems, including the positivity constraint for the control, some literature, [15,16], propose the optimality conditions and some numerical methods to characterize the optimal control policy. In this paper, we characterise the limits as the conic control constraints considering the real applications.

Work by Cao [17] and Puterman [18] demonstrate that stochastic control problems can be viewed as Markov decision problems. Therefore, the constrained stochastic LQ control problem can be formulated as an MDP, such as [19]. A direct-comparison based approach (or relative optimization), which originated in the area of discrete event systems, has been developed in the past years for the optimization of MDPs [17].

With this approach, optimization is based on the comparison of the performance measures of the system under any two policies. It is intuitively clear, and it can provide new insights, leading to new results to many problems, such as [20,21,22,23,24,25,26]. This approach is very convenient and suitable to the performance optimization problems, leading to results including the property of under-selectivity in time-nonhomogeneous Markov processes [24]. In this paper, we show that the special features of the constrained stochastic LQ optimal control make it possible to be solved by the direct-comparison based approach, leading to some new insights for the problem.

In our work, we consider the stochastic LQ control problem through an MDP formulation in the infinite horizon. Through the direct-comparison based approach [17], we first derive the performance potentials for the LQ problem by utilizing the state separation property of the system structure. Based on this, we successfully derive the optimality conditions and the stationary optimal feedback control. We show that the optimal control policy is a piece-wise affine function with respect to the state variables. In real applications, the proposed methodology can be used in many fields, such as system risk contagion [26] and power grid systems [27].

Our work provides a new perspective for LQ control problems. Compared with the former literature, such as [5,13], we still consider the multiplicative noises. We establish a general framework for studying infinite horizon stochastic control problems. With the direct-comparison based approach, which is applicable to both linear and nonlinear systems, we propose more results for the performance optimization problems, and the results can be extended easily. In addition, without identifying all the system parameters, this approach can be implemented on-line, and learning based algorithms can be developed.

The paper is organized as follows. Section 2 introduces an MDP formulation of the constrained stochastic LQ problem with multiplicative noises; some preliminary knowledge on MDP and the state separation property is also provided. In Section 3, we derive the performance difference formula, which is the foundation of the performance optimization; based on it, the Poisson equation and Dynkin’s formula can be obtained. Then we derive the optimality condition and the optimal policy through the performance difference formula. In Section 4, we illustrate the results by numerical examples. Finally, we conclude the paper in Section 5.

2. Problem Formulation

2.1. Problem with Infinite Time Horizon

In this section, we study the infinite horizon discrete-time stochastic LQ optimal control problem, in which the conic control constraints are also considered; see [5,14]. For simplicity of the parameters, we consider a one dimensional dynamic system with a multiplicative noise described by

\begin{matrix} x_{l + 1} = A x_{l} + & B u_{l} (x_{l}) + [A x_{l} + B u_{l} (x_{l})] ξ_{l}, \end{matrix}

(1)

for time

l = 0, 1, \dots

. By denoting

R (R_{+})

as the set of real (nonnegative real) numbers, in this system,

A \in R

and

B \in R^{1 \times m}

are deterministic values;

x_{l} \in R

is the state with

x_{0}

being given; and

u_{l} \in R^{m}

is a feedback control law at time l. For each l,

ξ_{l}

denotes an independent identically distributed one-dimensional multiplicative noise, satisfying a normal distribution with mean 0 and variance

σ^{2}, σ \geq 0

. For each l,

ξ_{l}

denotes a one-dimensional noise.

ξ_{l}

and

ξ_{k}

are independent for every

l \neq k

.

Now, we consider the conic control constraint sets (cf. [5])

\begin{matrix} C_{l} : = {u_{l} | u_{l} \in F_{l}, H u_{l} \in R_{+}^{n}}, \end{matrix}

(2)

for

l = 0, 1, \dots

, where

H \in R^{n \times m}

is a deterministic matrix; and

F_{l}

is the filtration of the information available at time l. Let

C_{l} \subset R^{m}

be a given closed cone; i.e.,

α u_{l} \in C_{l}

whenever

u_{l} \in C_{l}

and

α \geq 0

; and

u_{l} + v_{l} \in C_{l}

whenever

u_{l}, v_{l} \in C

.

The goal of optimization is to minimize the total reward performance measure in a quadratic form:

\begin{matrix} (P_{A}) min_{{u_{l}} |_{l = 0}^{\infty}} η^{u} = lim_{L \to \infty} E [\sum_{l = 0}^{L - 1} (Q x_{l}^{2} + u_{l}^{'} R u_{l}) | x_{0}] \\ (s . t .) {x_{t}, u_{t}} satisfies (1) and (2) for l = 0, 1, \dots, \end{matrix}

(3)

where

Q \in R_{+}

and

R \in R_{+}^{m \times m}

are deterministic. Here we denote the transpose operation by a prime in the superscript, such as

u_{l}^{'}

.

{u_{l}}

denotes the control sequence

{u_{0}, u_{1}, \dots}

. We also assume that (3) exists.

Therefore, the performance function of (3) is

\begin{matrix} f^{u} (x) = Q x^{2} + u^{'} R u . \end{matrix}

(4)

In this paper, we will show that the direct-comparison based approach leads to more new results for the total rewards problem [7], and that the results can be easily extended.

2.2. MDPs with Continuous State Spaces

For a stationary control law

u_{l} = u (x)

, at time

l = 0, 1, \dots

, the constraint (2) can be written as

\begin{matrix} C : = {u | u \in R^{m}, H u \in R_{+}^{n}} . \end{matrix}

Then the above stochastic control problem can be viewed as an MDP with continuous state spaces. More precisely,

u (x)

plays a similar role of actions in MDPs, and then the control law

u

is the same as a policy.

Consider a discrete-time Markov chain

X : = {x_{l}}_{l = 0}^{\infty}

with a continuous state space on

R

. The transition probability can be described by a transition operatorP as

\begin{matrix} (P h) (x) : = \int_{R} h (y) P (d y | x), \end{matrix}

(5)

where

P (d y | x)

is the transition probability function, with

x, y \in R

; and

h (y)

is any measurable function on

R

. As

ξ_{l}

is independent Gaussian noises, given the current state

x_{l} = x

, under the stationary control

u (x), y = x_{l + 1}

satisfies a normal distribution with mean

μ_{y} = A x + B u (x)

and variance

σ_{y}^{2} = {[A x + B u (x)]}^{2} σ^{2}

. Then we have the transition function of this system as follows,

\begin{matrix} P^{u} (d y | x) = \frac{1}{\sqrt{2 π} σ_{y}} e x p {- \frac{{(y - μ_{y})}^{2}}{2 σ_{y}^{2}}} d y . \end{matrix}

(6)

Let

B

be the

σ

-field of

R

containing all the (Lebesgue) measurable sets. For any set

B \in B

, we can define the identity transition function

I (B | x)

.

I (B | x) = 1

if

x \in B

;

I (B | x) = 0

otherwise. For any function h and

x \in R

, we have

(I h) (x) = h (x)

.

The product of two transition functions

P_{1} (B | x)

and

P_{2} (B | x)

is defined as a transition function

(P_{1} P_{2}) (B | x)

:

\begin{matrix} (P_{1} P_{2}) (B | x) : = \int_{R} P_{2} (B | y) P_{1} (d y | x), \end{matrix}

where

x, y \in R, B \in B

.

For any transition function P, we can define the kth power,

k = 0, 1, \dots

, as

P^{0} = I, P^{1} = P

, and

P^{k} = P P^{k - 1}, k = 2, \dots

. Suppose that the Markov chain

X

is time-homogeneous with transition function

P (B | x), x \in R, B \in B

. Then the k-step transition probability functions, denoted as

P^{(k)} (B | x), k = 1, 2, \dots

, are given by the 1-step transition function defined as

P^{(1)} (B | x) = P (B | x)

and

\begin{matrix} P^{(k)} (B | x) : = \int_{R} P (d y | x) P^{k - 1} (B | y), k \geq 2 . \end{matrix}

For any function

h (x)

, we have

\begin{matrix} (P^{(k)} h) (x) = \int_{R} h (y) P^{(k)} (d y | x) = P (P^{(k - 1)}) h (x) . \end{matrix}

That is, as an operator, we have

P^{(k)} = P (P^{(k - 1)})

. Recursively, we can prove that

P^{(k)} = P^{k}

.

Suppose that a Markov chain

X

with a continuous state space on

R

has a steady-state distribution

π

satisfying

π = π P

. Define function

e (x) = 1

for all

x \in R

. We denote the performance potential g as a function which satisfies the Poisson equation (cf. [17])

\begin{matrix} (I - P) g (x) + η (x) = f (x), \end{matrix}

(7)

where I and P are two transition functions, and

η (x) = (π f) e (x) = η e (x)

. Then if g is a solution to (7), so is

g + c e

, with any constant c. We define

\begin{matrix} g_{K} : = {I + \sum_{k = 1}^{K} (P^{k} - e π)} f, \end{matrix}

(8)

and assume the limit

g (x) : = {lim}_{K \to \infty} g_{K} (x)

exists for

x \in R

. Then we have the following lemma,

Lemma 1

(Solution to Poisson Equations [17]). For any transition function P and performance function

f (x)

, if

\begin{matrix} lim_{k \to \infty} P^{k} f = (e π) f = η e, \\ lim_{K \to \infty} g_{K} = g, and lim_{K \to \infty} P g_{K} = P g, \end{matrix}

hold for every

x \in R

, then

\begin{matrix} g = {I + \sum_{k = 1}^{\infty} (P^{k} - e π)} f, \end{matrix}

(9)

is a solution to the Poisson Equation (7).

2.3. State Separation Property

In order to derive the explicit solution of the stochastic LQ control problem with conic constraints, Reference [14] gives the following lemma for the state separation property of the LQ problem,

Lemma 2

(State Separation [14]). In the system (1), for any

x \in R

, the optimal solution for problem (3) at time l is a piecewise linear feedback policy

\begin{matrix} u^{*} (x_{l}) = \{\begin{matrix} {\hat{K}}^{*} x_{l}, & i f x_{l} \geq 0, \\ - {\bar{K}}^{*} x_{l}, & i f x_{l} < 0, \end{matrix} \end{matrix}

(10)

for

l = 0, 1, \dots

, where

K : = {K \in R^{m} | H K \in R_{+}^{n}}

associated with the control constraint sets

C_{l}

;

{\hat{K}}^{*}, {\bar{K}}^{*} \in K

, are the optimal values of two correspondent auxiliary optimization problems, and the superscript “*” corresponds to the optimal control.

Based on (10) in Lemma 2, the stationary control can be written as

u (x) = \hat{K} x 1_{x \geq 0} - \bar{K} x 1_{x < 0}

, where

1_{B}

is an indicator function, such that

1_{B} = 1

, if the condition B holds true and

1_{B} = 0

otherwise; and

\hat{K}, \bar{K} \in K

. Applying this control, the system dynamics (1) becomes

\begin{matrix} x_{l + 1} = & \hat{C} x_{l} 1_{x_{l} \geq 0} + \bar{C} x_{l} 1_{x_{l} < 0} \\ + [\hat{C} x_{l} 1_{x_{l} \geq 0} + \bar{C} x_{l} 1_{x_{l} < 0}] ξ_{l}, \end{matrix}

(11)

for

l = 0, 1, \dots

, where

\begin{matrix} \hat{C} = A + B \hat{K}, \bar{C} = A - B \bar{K} . \end{matrix}

(12)

Moreover, the performance measure (3) becomes

\begin{matrix} η^{u} (x) = lim_{L \to \infty} E [\sum_{l = 0}^{L - 1} \hat{W} x_{l}^{2} 1_{x_{l} \geq 0} + \bar{W} x_{l}^{2} 1_{x_{l} < 0} | x_{0} = x], \end{matrix}

where

\hat{W} = Q + {\hat{K}}^{T} R \hat{K}

and

\bar{W} = Q + {\bar{K}}^{T} R \bar{K}

. Therefore, the performance function (4) becomes

\begin{matrix} f (x) = \hat{W} x^{2} 1_{x \geq 0} + \bar{W} x^{2} 1_{x < 0} . \end{matrix}

(13)

It is easy to verify that

\hat{W}

and

\bar{W}

are positive semi-definite. We assume that this one-dimensional state system is stable, and then the spectral radiuses of

\hat{C}

and

\bar{C}

are less than 1, i.e.,

C^{m a x} = m a x (\hat{C}, \bar{C}) < 1

. In the next section, we will derive the performance potentials for the LQ problem, which is the foundation of the performance optimization. Based on this, the Poisson equation and the Dynkin’s formula can be derived. The direct-comparison based approach provides a new perspective for this problem, and the results can be extended easily.

3. Performance Optimization

In this section, utilizing the state separation property, we derive the performance difference formula, which compares the performance measures of any tow policies, and then derive the optimality condition and the optimal policy with the direct-comparison based approach.

3.1. Performance Difference Formula

We denote

{\hat{W}}_{0} = \hat{W}

and

{\bar{W}}_{0} = \bar{W}

. Then we have the performance function as

f (x) = {\hat{W}}_{0} x^{2} 1_{x \geq 0} + {\bar{W}}_{0} x^{2} 1_{x < 0}

. With the initial condition

x_{0} = x

, by (5), (6), (11), and (13), the performance operator is

\begin{matrix} (P f) (x) = {\hat{W}}_{1} x^{2} 1_{x \geq 0} + {\bar{W}}_{1} x^{2} 1_{x < 0}, \end{matrix}

(14)

where

\begin{matrix} {\hat{W}}_{1} = (a_{1} {\hat{W}}_{0} + a_{2} {\bar{W}}_{0}) {\hat{C}}^{2}, {\bar{W}}_{1} = (a_{1} {\hat{W}}_{0} + a_{2} {\bar{W}}_{0}) {\bar{C}}^{2}, \end{matrix}

and

\begin{matrix} a_{1} & = σ ϕ (\frac{1}{σ}) + (1 + σ^{2}) Φ (\frac{1}{σ}), \end{matrix}

(15)

\begin{matrix} a_{2} & = - σ ϕ (- \frac{1}{σ}) + (1 + σ^{2}) Φ (- \frac{1}{σ}), \end{matrix}

(16)

with

ϕ (\cdot)

as the probability density function of a standard normal distribution. We can verify that

a_{1}

and

a_{2}

are both nonnegative constants, with

a_{1} + a_{2} = 1 + σ^{2}

.

As

P^{2} f = P (P f)

, continuing this process, we obtain

\begin{matrix} (P^{k} f) (x) = {\hat{W}}_{k} x^{2} 1_{x \geq 0} + {\bar{W}}_{k} x^{2} 1_{x < 0}, \end{matrix}

(17)

where

\begin{matrix} {\hat{W}}_{k} & = (a_{1} {\hat{W}}_{k - 1} + a_{2} {\bar{W}}_{k - 1}) {\hat{C}}^{2}, \\ {\bar{W}}_{k} & = (a_{1} {\hat{W}}_{k - 1} + a_{2} {\bar{W}}_{k - 1}) {\bar{C}}^{2} . \end{matrix}

We set

W_{0}^{*} = max (\hat{W_{0}}, \bar{W_{0}}) .

In order to ensure the stability of the system, Reference [14] gives some assumptions. Here we assume

max ({\hat{C}}^{2}, {\bar{C}}^{2}) < 1 / (1 + σ^{2}) \leq 1

. Then we have

\begin{matrix} {\hat{W}}_{k} \leq {(1 + σ^{2})}^{k} {({\hat{C}}^{2})}^{k} W_{0}^{*}, {\bar{W}}_{k} \leq {(1 + σ^{2})}^{k} {({\bar{C}}^{2})}^{k} W_{0}^{*} . \end{matrix}

Therefore, we have

\begin{matrix} lim_{k \to + \infty} {\hat{W}}_{k} = lim_{k \to + \infty} {\bar{W}}_{k} = 0 . \end{matrix}

(18)

We denote

{\hat{G}}_{k} : = \sum_{i = 0}^{k} {\hat{W}}_{i}

and

{\bar{G}}_{k} : = \sum_{i = 0}^{k} {\bar{W}}_{i}

. Based on the above claims, we obtain that

{\hat{G}}_{k}

and

{\bar{G}}_{k}

would converge when

k \to + \infty

. Thus we denote

\begin{matrix} \hat{G} : = lim_{K \to + \infty} {\hat{G}}_{K} = \sum_{k = 0}^{+ \infty} {\hat{W}}_{k}, \bar{G} : = lim_{K \to + \infty} {\bar{G}}_{K} = \sum_{k = 0}^{+ \infty} {\bar{W}}_{k} . \end{matrix}

Based on the definition of total rewards (3), we have

\begin{matrix} η (x) = \hat{G} x^{2} 1_{x \geq 0} + \bar{G} x^{2} 1_{x < 0} . \end{matrix}

(19)

By (17) and (18), we have

\begin{matrix} lim_{k \to + \infty} (P^{k} f) (x) = 0 . \end{matrix}

Then we have proved that the closed-loop system (11) is

L^{2}

-asymptotically stable, i.e.,

{lim}_{l \to \infty} E [{(x_{l})}^{2}] = 0 .

Therefore, the total rewards

η (x)

exists, that is, a piecewise quadratic function with positive semi-definite matrices

\hat{G}

and

\bar{G}

.

Now, we define the discrete version of generator,

A

for any function

h (x), x \in R

, such that

\begin{matrix} A h (x) : = (P h) (x) - h (x) . \end{matrix}

(20)

Taking

h (x)

as

η (x)

, and by the definition of

η (x)

in (3), we have the Poisson equation as follows,

\begin{matrix} A η (x) + f (x) = 0 . \end{matrix}

(21)

By (5) and (20), we obtain the discrete version of Dynkin’s formula as

\begin{matrix} E {\sum_{k = 0}^{K - 1} [A h (x_{k})] | x_{0}} = E {h (x_{K}) | x_{0}} - h (x_{0}) . \end{matrix}

(22)

and if the limit

K \to \infty

exists, then

\begin{matrix} E {\sum_{k = 0}^{\infty} [A h (x_{k})] | x_{0}} = lim_{K \to \infty} E {h (x_{K}) | x_{0}} - h (x_{0}) . \end{matrix}

Now, we consider two policies

u, u^{'} \in U_{0}

, resulting in two independent Markov chains

X

and

X^{'}

in the same state space

R

, with

P, f, η, A, E

, and

P^{'}, f^{'}, η^{'}, A^{'}, E^{'}

, respectively. Let

x_{0}^{'} = x_{0}

. Applying the Dynkin’s Formula (22) on

X^{'}

with

h (x) = η (x)

yields

\begin{matrix} E^{'} {\sum_{k = 0}^{K - 1} [A^{'} η (x_{k}^{'}) | x_{0}} = E^{'} {η (x_{K})] | x_{0}} - η (x_{0}) . \end{matrix}

(23)

Noting that

η^{'} (x_{0}) = {lim}_{K \to \infty} \sum_{k = 0}^{K - 1} {E^{'} [f^{'} (x_{k})] | x_{0}}

, and

{lim}_{K \to \infty} E^{'} {η (x_{K}) | x_{0}} = 0

due to asymptotical stability. Then by (23), we obtain the performance difference formula:

\begin{matrix} η^{'} (x_{0}) - η (x_{0}) = & lim_{K \to \infty} \sum_{k = 0}^{K - 1} E^{'} {(A^{'} η + f^{'}) (x_{k}^{'}) | x_{0}} . \end{matrix}

(24)

3.2. Optimal Policy

Based on the performance difference Formula (24), we have the following optimality condition.

Theorem 1

(Optimality Condition). A policy

u^{*}

in

C

is optimal if, and only if,

\begin{matrix} A^{u} η^{u^{*}} + f^{u} \geq 0 = A^{u^{*}} η^{u^{*}} + f^{u^{*}}, \forall u \in C . \end{matrix}

(25)

From (25), the optimality equation is:

\begin{matrix} min_{u \in C} {A^{u} η^{u^{*}} + f^{u}} = 0 . \end{matrix}

(26)

Proof.

First, the “if” part follows from the performance difference Formula (24) and the Poisson Equation (21).

Next, we prove the “only if” part: Let

u^{*}

be an optimal policy. We need to prove that (25) holds. Suppose that this is not true. Then, there must exist one policy, denoted as

u^{'}

, such that (25) does not hold. That is, there must be at least one state, denoted as y, such that

\begin{matrix} P^{u^{*}} η^{u^{*}} (y) + f^{u^{*}} (y) > P^{u^{'}} η^{u^{*}} (y) + f^{u^{'}} (y) . \end{matrix}

Then we can create a policy

\tilde{u}

by setting

\tilde{u} = u^{'}

when

x = y

, and

\tilde{u} = u^{*}

when

x \neq y

. We have

η^{u^{*}} > η^{u^{'}}

. This contradicts to the fact that

u^{*}

is an optimal policy. □

Based on the optimality condition, the optimal control

u^{*}

can be obtained by developing policy iteration algorithms. Roughly speaking, we start with any policy

u_{0}

. At the kth step,

k = 0, 1, \dots

, given a piecewise linear policy

u_{k} (x) = \hat{K} x 1_{x \geq 0} - \bar{K} x 1_{x < 0}

, where

\hat{K}, \bar{K} \in K

, we want to find a better policy by (26). We consider any policy

u (x)

. Setting

h (x) = η^{u_{k}} (x) = \hat{G} x^{2} 1_{x \geq 0} + \bar{G} x^{2} 1_{x < 0}

, by

(5)

,

(12)

, and

(14)

, we have

\begin{matrix} (P^{u} η^{u_{k}}) (x) = & (a_{1} \hat{G} + a_{2} \bar{G}) {(A + B \hat{K})}^{2} x^{2} 1_{x \geq 0} \\ + (a_{1} \hat{G} + a_{2} \bar{G}) {(A - B \bar{K})}^{2} x^{2} 1_{x < 0} . \end{matrix}

(27)

where

a_{1}

and

a_{2}

satisfy Equations (15) and (16), respectively.

Then, from (4) and (27), we have

\begin{matrix} u_{k + 1} (x) & = arg {min_{u \in C} [(P^{u} η^{u_{k}}) (x) + f^{u} (x)]} \\ = {\hat{K}}_{k + 1} x 1_{x \geq 0} - {\bar{K}}_{k + 1} x 1_{x < 0}, \end{matrix}

with

\begin{matrix} {\hat{K}}_{k + 1} = arg min_{K \in K} [a_{1} {\hat{C}}^{2} \hat{G} + a_{2} {\hat{C}}^{2} \bar{G} + Q + K^{T} R K], \\ {\bar{K}}_{k + 1} = arg min_{K \in K} [a_{1} {\bar{C}}^{2} \hat{G} + a_{2} {\bar{C}}^{2} \bar{G} + Q + K^{T} R K], \end{matrix}

where

\hat{C} = A + B K,

and

\bar{C} = A - B K .

It can be seen that if the policy

u_{k} (x)

is a piecewise linear control, then we can find an improved policy

u_{k + 1} (x)

, which is also piecewise linear. Moreover, if

{\hat{K}}_{k + 1} = \hat{K}

and

{\bar{K}}_{k + 1} = \bar{K}

, i.e.,

u_{k + 1} = u_{k}

, then the iteration stops. The policy

u_{k}

satisfies the optimal condition (26) in Theorem 1, and therefore is an optimal control.

Therefore, we can obtain the optimal policy as follows,

\begin{matrix} u^{*} (x) = {\hat{K}}^{*} x 1_{x \geq 0} - {\bar{K}}^{*} x 1_{x < 0}, \end{matrix}

(28)

where

\begin{matrix} {\hat{K}}^{*} & = arg min_{K \in K} [a_{1} {\hat{C}}^{2} {\hat{G}}^{*} + a_{2} {\hat{C}}^{2} {\bar{G}}^{*} + Q + K^{T} R K], \end{matrix}

(29)

\begin{matrix} {\bar{K}}^{*} & = arg min_{K \in K} [a_{1} {\bar{C}}^{2} {\hat{G}}^{*} + a_{2} {\bar{C}}^{2} {\bar{G}}^{*} + Q + K^{T} R K] . \end{matrix}

(30)

Moreover,

\begin{matrix} {\hat{G}}^{*} & = min_{K \in K} {a_{1} {\hat{C}}^{2} {\hat{G}}^{*} + a_{2} {\hat{C}}^{2} {\bar{G}}^{*} + Q + K^{T} R K}, \end{matrix}

(31)

\begin{matrix} {\bar{G}}^{*} & = min_{K \in K} {a_{1} {\bar{C}}^{2} {\hat{G}}^{*} + a_{2} {\bar{C}}^{2} {\bar{G}}^{*} + Q + K^{T} R K} . \end{matrix}

(32)

The original problem (3) is transformed to two auxiliary optimization problems (29) and (30). Under the optimal control

u^{*}

in (28), the closed-loop system (11) is

L^{2}

-asymptotically stable. From (19), with the initial condition

x_{0} = x

, we know the optimal total reward performance is

\begin{matrix} η^{*} (x) = {\hat{G}}^{*} x^{2} 1_{x \geq 0} + {\bar{G}}^{*} x^{2} 1_{x < 0}, \end{matrix}

(33)

where

{\hat{G}}^{*}

and

{\bar{G}}^{*}

satisfy (31) and (32), respectively.

Policy iteration can also be implemented on-line, the performance (potential) can be learned on a sample path without knowing all the transition probabilities. In on-line algorithms, the computation of policy evaluation is

O (n)

, where n is the length of a sample path. Additionally, Reference [14] also provides some algorithms for calculating the optimal policy.

4. Simulation Examples

In this section, we use two numerical examples to illustrate the optimal policy for the constrained LQ control problem (3).

Example 1.

We consider a stochastic LQ system with

x_{0} = 10

,

m = 3

,

A = 0.8

, and

B = {(- 0.35, 0.18, 0.25)}^{'}

. The cost matrix is

\begin{matrix} R = (\begin{matrix} 1.2 & 0.6 & 0.4 \\ 0.6 & 1.8 & 0.2 \\ 0.4 & 0.2 & 2.4 \end{matrix}), and Q = 1.2 . \end{matrix}

For time

l = 0, 1, \dots

, the variance of the 0-mean i.i.d. Gaussian noise

ξ_{l}

is

σ^{2} = 0.25

. We consider the conic constraint

u \geq 0

. By applying Theorem 1, the stationary optimal control is

u_{l}^{*} (x_{l}) = {\hat{K}}^{*} x_{l} 1_{x_{l} \geq 0} - {\bar{K}}^{*} x_{l} 1_{x_{l} < 0}

, for

l = 0, 1, \dots

, where

{\hat{K}}^{*} = {(0.574, 0, 0)}^{'}, {\bar{K}}^{*} = {(0, 0.250, 0.270)}^{'}, {\hat{G}}^{*} = 2.773

and

{\bar{G}}^{*} = 3.473

. Furthermore, the optimal reward performance is

η^{*} (x_{0}) = {\hat{G}}^{*} x_{0}^{2} 1_{x_{0} \geq 0} + {\bar{G}}^{*} x_{0}^{2} 1_{x_{0} < 0} = 623.987

.

As shown in Figure 1a plots the outputs

{\bar{G}}^{*}

and

{\hat{G}}^{*}

with respect to iteration time K; Figure 1b plots the state trajectories of 50 samples by setting

x_{0} = 10

and implementing the stationary optimal control

u^{*}

. It can be observed that

x_{l}^{*}

converges to 0 after time

l = 20

and this closed loop system is asymptotically stable.

Example 2.

In the second case, we assume

x_{0} = 10, A

and B, following the identical discrete distribution with five cases. We assume

A \in (- 0.7, - 0.6, 0.9, 1, 1.1)

, and

\begin{matrix} B \in \{(\begin{matrix} 0.18 \\ - 0.05 \\ - 0.140 \end{matrix}), (\begin{matrix} 0.03 \\ - 0.12 \\ - 0.03 \end{matrix}), (\begin{matrix} - 0.05 \\ 0.05 \\ 0.05 \end{matrix}), (\begin{matrix} - 0.01 \\ 0.05 \\ 0.01 \end{matrix}), (\begin{matrix} - 0.05 \\ 0.01 \\ 0.06 \end{matrix})\}, \end{matrix}

each of which has the same probability 0.2. The cost matrix is

\begin{matrix} R = (\begin{matrix} 1.5 & 0.6 & 0.4 \\ 0.6 & 1.5 & 0.2 \\ 0.4 & 0.2 & 2.5 \end{matrix}), and Q = 1.5 . \end{matrix}

For time

l = 0, 1, \dots

, the variance of the 0-mean i.i.d. Gaussian noise

ξ_{l}

is

σ^{2} = 0.25

. We consider the conic constraint

u \geq 0

. By applying Theorem 1, the stationary optimal control is

u_{l}^{*} (x_{l}) = {\hat{K}}^{*} x_{l} 1_{x_{l} \geq 0} - {\bar{K}}^{*} x_{l} 1_{x_{l} < 0}

, for

l = 0, 1, \dots

, where

{\hat{K}}^{*}

and

{\bar{K}}^{*}

are identified as follows,

{\hat{K}}^{*} = {(0.259, 0.100, 0.130)}^{'}, {\bar{K}}^{*} = {(0.100, 0.500, 0.100)}^{'}, {\hat{G}}^{*} = 4.111

and

{\bar{G}}^{*} = 3.859

. Furthermore, the optimal reward performance is

η^{*} (x_{0}) = {\hat{G}}^{*} x_{0}^{2} 1_{x_{0} \geq 0} + {\bar{G}}^{*} x_{0}^{2} 1_{x_{0} < 0} = 489.23

.

As shown in Figure 2a plots the outputs

{\bar{G}}^{*}

and

{\hat{G}}^{*}

with respect to iteration times K; Figure 2b plots the state trajectories of 50 samples by setting

x_{0} = 10

and implementing the stationary optimal control

u^{*}

. It can be observed that

x_{l}^{*}

converges to 0 after time

l = 35

and this closed loop system is asymptotically stable.

5. Conclusions

In this paper, we apply the direct-comparison based optimization approach to study the rewards optimization of the discrete-time stochastic linear-quadratic control problem with conic constraints on an infinite horizon. We derive the performance difference formula by utilizing the state separation property of the system structure. Based on this, the optimality condition and the stationary optimal feedback control can be obtained. The direct-comparison based approach is applicable to both linear and nonlinear systems. By introducing the the LQ optimization problem, we establish a general framework for studying infinite horizon control problems with total rewards. We verify that the proposed optimal approach can solve the LQ problems. Then we illustrate our results by two simulation examples.

The results can easily be extended to the cases of non-Gaussian noises and average rewards. Most significantly, our methodology can deal with a very general class of linear constraints on state and control variables, which includes the cone constraints, positivity and negativity constraints, and the state-dependent upper and lower bound constraints as a special case. In addition to the problem with the infinite control horizon, our results still fit problems with a finite horizon. In addition, without identifying all the system structure parameters, this approach can also be implemented on-line, and learning based algorithms can be developed.

Finally, this work focuses on the discrete-time stochastic LQ control problem. Our next step is to investigate continuous cases. As the constrained LQ problem has a wide range of applications, we hope to apply our approach in more areas, such as dynamic portfolio management, security optimization of cyber-physical systems, and financial derivative pricing, in our future research.

Author Contributions

Conceptualization, R.X. and X.Y.; methodology, R.X.; validation, R.X., X.Y. and W.W.; formal analysis, X.Y.; data curation, X.Y.; writing–original draft preparation, R.X. and X.Y.; writing–review and editing, R.X., X.Y. and W.W.; supervision, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61573244 “Stochastic control optimization of uncertain systems based on offset with multiplicative noises and its applications in the financial optimization” and 61521063 “Control theory and techniques: design, control and optimization of network systems”.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MDP	Markov Decision Process
LQ	Linear-Quadratic

References

Basin, M.; Perez, J.; Skliar, M. Opitmal filtering for polynomial system wtates with polynomial multiplicative noise. Int. J. Robust Nonlinear Control 2006, 16, 303–314. [Google Scholar] [CrossRef]
Gershon, E.; Shaked, U. Static H2 and Houtput-feedback of discrete-time LTI systems with state multiplicative noise. Syst. Control Lett. 2006, 55, 232–239. [Google Scholar] [CrossRef]
Lim, A.B.E.; Zhou, X.Y. Stochastic optimal control LQR control with integral quadratic constraints and indefinite control weights. IEEE Trans. Autom. Control 1999, 44, 1359–1369. [Google Scholar] [CrossRef] [Green Version]
Zhu, J. On stochastic riccati equations for the stochastic LQR problem. Syst. Control Lett. 2005, 44, 119–124. [Google Scholar] [CrossRef]
Hu, Y.; Zhou, X.Y. Constrained stochastic LQ control with random coefficients, and application to portfolio selection. SIAM J. Control Optim. 2005, 44, 444–446. [Google Scholar] [CrossRef]
Kalman, R.E. Contributions to the theory of optimal control. Bol. Soc. Mat. Mex. 1960, 5, 102–119. [Google Scholar]
Anderson, B.D.; Moore, J.B. Optimal Control: Linear Quadratic Methods; Courier Corporation: North Chelmsford, MA, USA, 2007; pp. 167–189. [Google Scholar]
Yong, J. Linear-quadratic optimal control problems for mean-field stochastic differential equations. SIAM J. Control Optim. 2013, 51, 2809–2838. [Google Scholar] [CrossRef]
Gao, J.J.; Li, D.; Cui, X.Y.; Wang, S.Y. Time cardinality constrained mean-variance dynamic portfolio selection and market timing: A stochastic control approach. Automatica 2015, 54, 91–99. [Google Scholar] [CrossRef]
Costa, O.L.V.; Fragoso, M.D.; Margues, R.P. Discrete-Time Markov Jump Linear Systems; Springer: Berlin, Germany, 2007; pp. 291–317. [Google Scholar]
Primbs, J.A.; Sung, C.H. Stochastic receding horizon control of contrained linear systems with state and control multiplicative noise. IEEE Trans. Autom. Control 2009, 54, 221–230. [Google Scholar] [CrossRef]
Dong, Y.C. Constrained LQ problem with a random jump and application to portfolio selection. Chin. Ann. Math. 2019, 39, 829–848. [Google Scholar] [CrossRef] [Green Version]
Gao, J.J.; Li, D. Cardinality constrained linear quadratic optimal control. IEEE Trans. Autom. Control 2011, 56, 1936–1941. [Google Scholar] [CrossRef]
Wu, W.P.; Gao, J.J.; Li, D.; Shi, Y. Explicit solution for constrained stochastic linear-quadratic control with multiplicative noise. IEEE Trans. Autom. Control 2019, 64, 1999–2012. [Google Scholar] [CrossRef]
Campbell, S.L. On positive controllers and linear quadratic optimal control problems. Int. J. Control 1982, 36, 885–888. [Google Scholar] [CrossRef]
Heemels, W.P.; Eijndhoven, S.V.; Stoorvogel, A.A. Linear quadratic regulator problem with positive controls. Int. J. Control 1998, 70, 551–578. [Google Scholar] [CrossRef]
Cao, X.R. Stochastic Learning and Optimization: A Sensitivity-Based Approach; Springer: New York, NY, USA, 2007. [Google Scholar]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley: New York, NY, USA, 1994. [Google Scholar]
Chen, R.C. Constrained stochastic control with probabilistic criteria and search optimization. In Proceedings of the 43rd IEEE Conference on Decision and Control (CDC), Nassau, Bahamas, 14–17 December 2004. [Google Scholar]
Zhang, K.J.; Xu, Y.K.; Chen, X.; Cao, X.R. Policy iteration based feedback control. Automatica 2008, 44, 1055–1061. [Google Scholar] [CrossRef]
Cao, X.R. Stochastic feedback control with one-dimensional degenerate diffusions and nonsmooth value functions. IEEE Trans. Autom. Control 2018, 62, 6136–6151. [Google Scholar] [CrossRef]
Cao, X.R.; Wan, X.W. Sensitivity analysis of nonlinear behavior with distorted probability. Math. Financ. 2017, 27, 115–150. [Google Scholar] [CrossRef] [Green Version]
Xia, L. Mean-variance optimization of discrete time discounted Markov decision processes. Automatica 2018, 88, 76–82. [Google Scholar] [CrossRef] [Green Version]
Cao, X.R. Optimality consitions for long-run average rewards with underselectivity and nonsmooth features. IEEE Trans. Autom. Control 2017, 62, 4318–4332. [Google Scholar] [CrossRef]
Xue, R.B.; Ye, X.S.; Cao, X.R. Optimization of stock trading with additional information by Limit Order Book. Automatica 2019. submitted. [Google Scholar]
Ye, X.S.; Xue, R.B.; Gao, J.J.; Cao, X.R. Optimization in curbing risk contagion among financial institutes. Automatica 2018, 94, 214–220. [Google Scholar] [CrossRef]
Jia, Q.S.; Yang, Y.; Xia, L.; Guan, X.H. A tutorial on event-based optimization with application in energy Internet. J. Control Theory Appl. 2018, 35, 32–40. [Google Scholar]

Figure 1. The simulation results of Example 1.

Figure 2. Simulation Results of Example 2.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, R.; Ye, X.; Wu, W. Optimization of Constrained Stochastic Linear-Quadratic Control on an Infinite Horizon: A Direct-Comparison Based Approach. Algorithms 2020, 13, 49. https://doi.org/10.3390/a13020049

AMA Style

Xue R, Ye X, Wu W. Optimization of Constrained Stochastic Linear-Quadratic Control on an Infinite Horizon: A Direct-Comparison Based Approach. Algorithms. 2020; 13(2):49. https://doi.org/10.3390/a13020049

Chicago/Turabian Style

Xue, Ruobing, Xiangshen Ye, and Weiping Wu. 2020. "Optimization of Constrained Stochastic Linear-Quadratic Control on an Infinite Horizon: A Direct-Comparison Based Approach" Algorithms 13, no. 2: 49. https://doi.org/10.3390/a13020049

APA Style

Xue, R., Ye, X., & Wu, W. (2020). Optimization of Constrained Stochastic Linear-Quadratic Control on an Infinite Horizon: A Direct-Comparison Based Approach. Algorithms, 13(2), 49. https://doi.org/10.3390/a13020049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization of Constrained Stochastic Linear-Quadratic Control on an Infinite Horizon: A Direct-Comparison Based Approach

Abstract

1. Introduction

2. Problem Formulation

2.1. Problem with Infinite Time Horizon

2.2. MDPs with Continuous State Spaces

2.3. State Separation Property

3. Performance Optimization

3.1. Performance Difference Formula

3.2. Optimal Policy

4. Simulation Examples

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI