Optimal Control with Partially Observed Regime Switching: Discounted and Average Payoffs

Beatris Adriana Escobedo-Trujillo; Javier Garrido-Meléndez; Gerardo Alcalá; J. D. Revuelta-Acosta

doi:10.3390/math10122073

,

and

¹

Facultad de Ingeniería Campus Coatzacoalcos, Universidad Veracruzana, Coatzacoalcos 96535, Veracruz, Mexico

²

Centro de Investigación en Recursos Energéticos y Sustentables, Universidad Veracruzana, Coatzacoalcos 96535, Veracruz, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics2022, 10(12), 2073;https://doi.org/10.3390/math10122073

This article belongs to the Special Issue Probability Theory and Stochastic Modeling with Applications

Version Notes

Order Reprints

Abstract

We consider an optimal control problem with the discounted and average payoff. The reward rate (or cost rate) can be unbounded from above and below, and a Markovian switching stochastic differential equation gives the state variable dynamic. Markovian switching is represented by a hidden continuous-time Markov chain that can only be observed in Gaussian white noise. Our general aim is to give conditions for the existence of optimal Markov stationary controls. This fact generalizes the conditions that ensure the existence of optimal control policies for optimal control problems completely observed. We use standard dynamic programming techniques and the method of hidden Markov model filtering to achieve our goals. As applications of our results, we study the discounted linear quadratic regulator (LQR) problem, the ergodic LQR problem for the modeled quarter-car suspension, the average LQR problem for the modeled quarter-car suspension with damp, and an explicit application for an optimal pollution control.

Keywords:

ergodicity; filtering theory; hidden Markov models; partial observation; Wonham filter

MSC:

49N05; 49N10; 49N30; 49N90; 93C41

1. Introduction

In recent years, there has been more attention to a class of optimal control problems where the dynamic systems are governed means switching diffusions in which the switching is modeled by a continuous-time Markov chain (

ψ

) with unobservable hidden states (also known as partially observed optimal control problems). In these problems, an observable process y whose outcomes are “influenced” by the outcomes of

ψ

in a known way is assumed. Since

ψ

cannot be observed directly, the goal is to learn about

ψ

by observing y. Following the last mentioned, this article concerns with an optimal control problem with discounted and ergodic payoff in which the dynamic system

x (t)

evolves according to a Markovian regime-switching diffusion

d x (t) = f (x (t), ψ (t)) d t + σ (x (t), ψ (t)) d W (t)

for given continuous functions f and

σ

. The reward rate is allowed to be unbounded from above and from below. In this paper, the Wonham filter to estimate the states of the Markov chain from the observable evolution of a given process (y) is used. As a result, the original system

x (t)

is converted to a completely observable one

\bar{x} (t)

.

Our main results extend the dynamic programming technique to this family of stochastic optimal control problems with reward (or cost) rate per unit of time unbounded and Markovian regime-switching diffusions. The regime switching is modeled by a continuous-time Markov chain (

ψ

) with unobservable states. Early works include research on an optimal control problem with an ergodic payoff, considering that the dynamic system evolves according to Markovian switching diffusions. However, this diffusion does not depend on a hidden Markov chain [1]. Research on deriving the dynamic programming principle for a partially observed optimal control problem in which the dynamic system is governed by a discrete-time Markov control process taking values in a finite-dimensional space has also been proposed [2]. Finally, one paper studied the optimal control with Markovian switching that is completely observable and rewards rate unbounded [3]. As an application of our results, we study the discounted linear quadratic regulator (LQR) problem, the ergodic LQR problem for the modeled quarter-car suspension, the average (ergodic) LQR problem for the modeled quarter-car suspension with damp, and an explicit application for an optimal pollution control. Other applications with bounded payoff different from those studied in this work are found in [4,5,6].

The objective of the theory of controlled regime-switching diffusions is to model controlled diffusion systems whose dynamics are affected by discrete phenomena. In these systems, the discrete phenomena are modeled by a Markov chain in continuous time, whose states represent the discrete phenomenon involved. There is an extensive list of references dealing with the case of completely observable stochastic optimal control in which a switching diffusion governs the stochastic systems. A literature review includes the textbooks [7,8] and the papers [9,10,11,12,13,14], with several applications, including optimization portfolios, wireless communication systems, and wind turbines, among others.

Generally, to solve unobserved optimal control problems, where the dynamic systems are governed by a hidden Markovian switching diffusion, it is necessary to transform them into completely observed ones, which in our case is done using a Wonham filter.

This Wonham filter estimates the hidden state of the Markov chain from the observable evolution of the process y. When these estimates are replaced in the original system, this becomes a completely observable system [15,16] and ([17], Section 22.3). The numerical results for Wonham’s filter are given in [18].

The paper is organized as follows: in Section 1, an introduction is given. In Section 2, the main assumptions are given. In this section, the partially observable system is converted into an observable system. The conditions to ensure the existence of optimal solutions for the optimal control problem with discounted payoff are given in Section 3. In Section 4, the conditions to ensure the existence of optimal solutions for the optimal control problem with average payoff are deduced. To illustrate our results, four applications are developed: an application on a linear quadratic regulator (LQR) with discounted payoff (Section 5); the development of a model of a quarter-car suspension LQR with an average payoff (Section 6); the study of an optimal control of a vehicle active suspension system with damp (Section 7); and an explicit application for an optimal pollution control (Section 8).

2. Formulation of the Problem

This work focuses on controlled hybrid stochastic differential Equations (HSDE) under partial observation. To explain this, first, we consider the stochastic differential equations of the form:

d x (t) = b (x (t), ψ (t), u (t)) d t + σ (x (t), ψ (t)) d W (t), x (0) = x_{0}, ψ (0) = i,

(1)

where

b : R^{n} \times E \times U \to R^{n}

and

σ : R^{n} \times E \to R^{n \times d}

in (1) depend on a finite state and time-continuous irreducible and aperiodic Markov chain

ψ (\cdot)

taking values in

E = {1, \dots, N}

. For all

i, j \in E

the transition probabilities are given by:

\begin{matrix} P (ψ (s + t)) = j ∣ ψ (s) = i = & \{\begin{matrix} q_{i j} t + o (t), & if i \neq j, \\ 1 + q_{i i} t + o (t), \end{matrix} \end{matrix}

where the constants

q_{i j} \geq 0

are the transition rates from i to j and satisfy that

q_{i i} (x) = - \sum_{i \neq j} q_{i j} (x)

, the transition matrix is denoted by

Q = {q_{i j}}_{i, j = 1, 2, \dots, N}

. The control component is

u (t) \in U

with

U

a compact set of

R^{m}

, and W is a d-dimensional standard Brownian motion independent of

ψ (\cdot)

. Throughout the work, it is considered that both the Markov chain

ψ (\cdot)

and the Brownian motion W are defined on a complete filtered probability space

(Ω, F, P, {F_{t}})

that satisfies the usual conditions.

Until now, the switching diffusion (1) seems to be formulated as a classical switching diffusion, as in [11,12,13,14,19], among others. However, we propose that the process

ψ

is a hidden Markov chain, i.e., at any given instant of time, the exact state of the Markov chain

ψ (\cdot)

cannot be observed directly. Instead, we can only observe the process y given by:

d y (t) = h (ψ (t)) d t + σ_{0} d B (t), y (0) = 0,

(2)

whose dynamics depends on the value of

ψ (\cdot)

. In Equation (2),

h : E \to R

is a bounded function, whereas B is a one-dimensional Brownian motion independent of W and

ψ

, and

σ_{0}

is a positive constant.

Under partial observation, the best way to work is through nonlinear filtering. This technique studies the conditional distribution of

ψ (t)

given the observed data accumulated up to time t, namely:

\begin{matrix} Ψ_{i} (t) = P (ψ (t) = i | σ_{1} (y (s), 0 \leq s \leq t)), \forall i \in E, \end{matrix}

(3)

where

σ_{1} (y (s), 0 \leq s \leq t))

is the

σ_{1}

-algebra generated by the process

y (t)

and

\sum_{i = 1}^{N} Ψ_{i} (t) = 1 .

Taking into account the following notation:

\begin{matrix} h^{T} (Ψ) & = (h (1), h (2), \dots, h (N)), \\ Ψ^{T} (t) & = (Ψ_{1} (t), \dots, Ψ_{N} (t)), \\ diag (h) & = diag (h (1), \dots, h (N)), \end{matrix}

and using the Wonham filtering techniques, we know that the process

Ψ

in (3) satisfies the following Equation (see for instance [15] or ([17], Section 22.3)):

\begin{matrix} d Ψ (t) = & [Q Ψ (t) - σ_{0}^{- 2} h^{T} (Ψ (t)) (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t)] d t \\ + & σ_{0}^{- 2} (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t) d y (t), \end{matrix}

(4)

where

I_{N}

is the

N \times N

identity matrix. If we introduce the process:

d w_{0} (t) = σ_{0}^{- 1} (d y (t) - h^{T} (Ψ (t)) d t),

then Equation (4) can be rewritten as:

\begin{matrix} d Ψ (t) = Q Ψ (t) d t + σ_{0}^{- 1} (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t) d w_{0} (t) . \end{matrix}

(5)

Remark 1.

Note that the unique solution of (5) exists up to an explosion time τ (see, for instance [20]). However,

τ = \infty

a.s. since

Ψ_{i} (t) \leq 1

for all

t < τ

and

\forall i \in E

.

At this point, we have defined the controlled HSDE with partial observation. To fulfill the objective of this work, that is, to solve an optimal control problem with the discounted and average payoff with partial observation, we will transform this problem into one with complete observation (see for instance [5,6,16]). First, we will establish the following notational convention.

For the coefficients

b : R^{n} \times E \times U \to R^{n}

and

σ : R^{n} \times E \to R^{n \times d}

\begin{matrix} b (x (t), ψ (t), u (t)) & = (b_{1} (x (t), ψ (t), u (t)), \dots, b_{n} (x (t), ψ (t), u (t))), \\ σ (x (t), ψ (t)) & = {σ_{k l} (x (t), ψ (t))}_{k = 1, \dots, n; l = 1, \dots, d}, \end{matrix}

we have their filtered estimates:

\begin{matrix} {\bar{b}}_{k} (x (t), Ψ (t), u (t)) & = \sum_{i = 1}^{N} Ψ_{i} (t) b_{k} (x (t), i, u (t)), \end{matrix}

(6)

\begin{matrix} {\bar{σ}}_{k l} (x (t), Ψ (t)) & = \sum_{i = 1}^{N} Ψ_{i} (t) σ_{k l} (x (t), i), \end{matrix}

(7)

and with equalities (6)–(7), we establish the new coefficients:

\begin{matrix} \bar{b} (x (t), Ψ (t), u (t)) & = ({\bar{b}}_{1} (x (t), Ψ (t), u (t)), \dots, {\bar{b}}_{n} (x (t), Ψ (t), u (t))), \\ \bar{σ} (x (t), Ψ (t)) & = {{\bar{σ}}_{k l} (x (t), Ψ (t))}_{k = 1, \dots, n; l = 1, \dots, d} \end{matrix}

With the use of above functions and Equation (1), we introduce the components of a new diffusion process as:

d x_{k} (t) = {\bar{b}}_{k} (x (t), Ψ (t), u (t)) d t + \sum_{l = 1}^{d} {\bar{σ}}_{k l} (x_{k} (t), Ψ (t)) d W_{l} (t), x (0) = x_{0},

(8)

and therefore, we obtain from (5) and (8) the following controlled system with complete observation:

\{\begin{matrix} d x (t) = \bar{b} (x (t), Ψ (t), u (t)) d t + \bar{σ} (x (t), Ψ (t)) d W (t), \\ d Ψ (t) = Q Ψ (t) d t + σ_{0}^{- 1} (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t) d w_{0} (t), \end{matrix}

(9)

where

(x (t), Ψ (t)) \in R^{n} \times S_{N}

with:

S_{N} = {Ψ = (Ψ_{1}, \dots, Ψ_{N}) \in R^{N} | Ψ_{i} (t) > 0, \sum_{i = 1}^{N} Ψ_{i} (t) = 1} .

Throughout this work, we will use the following Assumption 1.

Assumption 1.

(a): The control set $U$ is compact.
(b): $b : R^{n} \times E \times U \to R^{n}$ is a continuous function that satisfies the Lipschitz continuous property on x uniformly in $(i, u) \in E \times U$ , that is, there exists a constant $C_{1} > 0$ such that:

$max_{(i, u) \in E \times U} ∥ b (x, i, u) - b (y, i, u) ∥ \leq C_{1} ∥ x - y ∥ .$
(c): There exists constants $C_{2}, C_{3} > 0$ such that, $σ : R^{n} \times E \to R^{n \times d}$ satisfies:

$∥ σ (x, i) - σ (y, i) ∥ \leq C_{2} ∥ x - y ∥ and x^{T} σ (x, i) σ^{T} (x, i) x \geq C_{3} {∥ x ∥}^{2}$

for all $x, y \in R^{n}$ and for all $i \in E$ .
(d): There exists $C_{4}, C_{5} > 0$ with:

$∥ σ (x, i) ∥ \leq C_{4} (1 + ∥ x ∥) and ∥ b (x, i, u) ∥ \leq C_{5} (1 + ∥ x ∥)$

for $i \in E$ and $u \in U$ .

Under Assumption 1 and taking into account Remark 1, we know that the system (9) has a unique solution.

For

x \in R^{n}

, we denote by

\nabla ν_{x}

and

H_{x}

the gradient and the Hessian matrix of x, respectively, and

⟨\cdot, \cdot⟩

the scalar product. For a sufficiently smooth real-valued function

ν : R^{n} \times R^{N} \to R

. Let:

\begin{matrix} L^{u, Ψ} ν (x, Ψ) & : = & ⟨\nabla ν_{x}, \bar{b} (x, Ψ, u)⟩ + \frac{1}{2} T r [(H_{x} ν) a (x, Ψ)] \\ + ⟨\nabla ν_{Ψ}, Q Ψ (t)⟩ + \frac{1}{2 σ_{0}^{2}} T r [(H_{Ψ} ν ((x, Ψ))) A_{2} (Ψ (t))] \end{matrix}

with

a (x, Ψ) = \bar{σ} (x, Ψ) \bar{σ} {(x, Ψ)}^{T},

A_{2} (Ψ (t)) = [(diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t)] {[(diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t)]}^{T},

the operator associated with Equation (9). In order to carry out the aim of this work, we define the control policies.

Definition 1.

A function of the form

u (t) : = f (t, x (t), Ψ (t))

for some measurable function

f : [0, \infty) \times R^{n} \times S_{N} \to U

, is called a Markov policy, whereas

u (t) : = f (x (t), Ψ (t))

for some measurable function

f : R^{n} \times S_{N} \to U

is said to be a stationary Markov policy. The stationary Markov policies set is denote by

F

.

The following assumption represents a Lyapunov-like condition.

Assumption 2.

There exists a function

(w \geq 1) \in C^{2} (R^{n} \times S_{N})

, and constants

p \geq q > 0

, such that:

(i): ${lim}_{| x | \to \infty} w (x, Ψ) = + \infty,$ and
(ii): $L^{u, Ψ} w (x, Ψ) \leq - q w (x, Ψ) + p$ for each $u \in U$ and $(x, Ψ) \in R^{n} \times S_{N}$ .

It is important to point out that since the

ψ (\cdot)

is irreducible and aperiodic, we can ensure the existence of a unique invariant measure for the Markov–Feller process

(x^{f} (\cdot), Ψ (\cdot))

(see [21,22]). Moreover, the Assumption 2 allows us to conclude that the Markov process

(x^{f} (\cdot), Ψ (\cdot))

, where

f \in F

is positive recurrent and there exists a unique invariant probability measure

μ_{f} (d x, Ψ)

for which is satisfied:

μ_{f} (w) : = \int_{R^{n} \times S_{N}} w (x, Ψ) μ_{f} (d x, d Ψ) < \infty .

(10)

Note that for every

f \in F

, the measure

μ_{f}

belongs to the space defined as follows.

Definition 2.

The w-norm is defined as:

{‖ ν ‖}_{w} : = sup_{(x, Ψ) \in R^{n} \times S_{N}} \frac{∣ ν (x, Ψ) ∣}{w (x, Ψ)},

where ν is the real-valued measurable function on

R^{n} \times S_{N}

and w is the Lyapunov function given in Assumption 2. The normed linear space of real-valued measurable functions ν with finite w-norm is denoted by

B_{w} (R^{n} \times S_{N})

. Moreover, the normed linear space of finite signed measures μ on

R^{n} \times S_{N}

such that:

{‖ μ ‖}_{w} : = \int_{R^{n}} w (x, Ψ) ∣ μ ∣ (d x, d Ψ) < \infty,

where

∣ μ ∣

is the total variation of μ is denoted by

M_{w} (R^{n} \times S_{N})

.

Remark 2.

For each

ν \in B_{w} (R^{n} \times S_{N})

and

μ \in M_{w} (R^{n} \times S_{N})

, we get:

| \int ν (x, Ψ) μ (d x, d Ψ) | \leq {‖ ν ‖}_{w} \int w (x, Ψ) ∣ μ ∣ (d x, d Ψ) = {‖ ν ‖}_{w} {‖ μ ‖}_{w} < \infty,

that is, the integral

\int ν (x, Ψ) μ (d x, Ψ)

is finite.

The next result will be useful later.

Lemma 1.

The condition

(i i)

in Assumption 2 implies that:

(a): $E^{x, Ψ, f} [w (x (t), Ψ (t))] \leq e^{- q t} w (x, Ψ) + \frac{p}{q} (1 - e^{- q t})$ ;
(b): ${lim}_{t \to \infty} \frac{1}{t} E^{x, Ψ, f} [w (x (t), Ψ (t))] = 0$ for all $f \in F$ , $(x, Ψ) \in R^{n} \times S_{N}$ , and $t \geq 0$ ;
(c): $μ_{f} (w) \leq \frac{p}{q}$ for all $h \in F$ .

Proof.

(a) After applying Dynkin’s formula to the function

e^{q t} w

, we use case

(i i)

of Assumption 2 to get:

\begin{matrix} E^{x, Ψ, f} [e^{q t} w (x (t), Ψ (t)] & = & w (x, Ψ_{0}) + E^{x, Ψ, f} [\int_{0}^{t} e^{q s} [L^{u, Ψ} w (x (s), Ψ (s)) + q w (x (s), Ψ (s))] d s] \\ \leq & w (x, Ψ_{0}) + E^{x, Ψ, f} [\int_{0}^{t} e^{q s} p d s] \\ \leq & w (x, Ψ_{0}) + \frac{p}{q} (e^{q t} - 1) . \end{matrix}

(11)

Finally, if we multiply the inequality (12) by

e^{- q t}

, we obtain the result. To prove

(b)

, it is enough take the limit from the inequality (12). Integrating both sides of (12) with respect to the invariant probability

μ_{f}

, we obtain

μ_{f} (w) \leq e^{- q t} μ_{f} (w) + \frac{p}{q} (1 - e^{- q t})

, i.e.,

μ_{f} (w) \leq p / q

; thus, the result

(c)

follows. □

In this work, the reward rate is a measurable function

r : R^{n} \times E \times U \to R

that satisfies the following conditions:

Assumption 3.

(a): The function $r (x, i, u)$ is continuous on $R^{n} \times E \times U$ ; moreover, for each $R > 0$ , there exists a constant $K (R) > 0$ such that:

$sup_{(i, u) \in E \times U} | r (x, i, u) - r (y, i, u) | \leq K (R) | x - y | f o r a l l | x |, | y | \leq R,$

i.e., r is locally Lipschitz in x uniformly with respect to $i \in E$ and $u \in U$ .
(b): $r (\cdot, \cdot, u)$ is in the normed linear space of real-valued functions $B_{w} (R^{n} \times E)$ uniformly in u; that is, there exists $M > 0$ such that for all $(x, i) \in R^{n} \times E$ :

$sup_{u \in U} | r (x, i, u) | \leq M w (x, i) .$

Notation. The rate reward

r : R^{n} \times E \times U \to R

is vector form is given by:

r^{T} (x, Ψ, u) = (r (x, 1, u), r (x, 2, u), \dots, r (x, N, u)),

and its estimation is:

\bar{r} (x, Ψ (t), u) = Ψ^{T} (t) r (x, Ψ, u) = \sum_{i = 1}^{N} Ψ_{i} (t) r (x, i, u) .

(12)

Henceforth, for each stationary Markov policy

f \in F

, we write:

\bar{r} (x, Ψ, f) : = \bar{r} (x, Ψ, f (x, i)) .

3. The Discounted Case

The objective of this section is to give conditions that guarantee the existence of discounted optimal policies for the

α

-discounted payoff criterion we are concerned with.

Definition 3.

Let r be as in Assumption 3 and α a positive constant. Given a stationary Markov policy

f \in F

and an initial state

x (0) = x, Ψ (0) = Ψ

, the total expected discount payoff (or discounted payoff, for short) is defined as:

V_{α} (x, Ψ, f) : = E^{x, Ψ, f} [\int_{0}^{\infty} e^{- α t} \bar{r} (x (t), Ψ (t), f) d t] .

Observe that the value function does not depend on the time at which the optimal control problem is studied to get the stationarity of the problem.

The following result shows a bound of the total expected discount payoff given in Definition 3. We will omit its proof because it is a direct consequence of Assumption 3 and inequality in Lemma 1a.

Proposition 1.

Suppose that Assumptions 2 and 3b hold. Then, for each x in

R^{n}

,

Ψ \in S_{N}

and

f \in F

we have:

sup_{f \in F} |V_{α} (x, Ψ, f)| \leq M (α) w (x, Ψ) w i t h M (α) : = M \frac{α + d}{α c} .

implying that α-discounted payoff

V_{α} (\cdot, \cdot, f)

, belongs to the space

B_{w} (R^{n} \times S_{N})

. Here, q and p are as in Assumption 2 and M is the constant in Assumption 3b.

α

-discounted optimal problem. The optimal control problem with discounted payoff consists of finding a policy

f^{*} \in F

such that:

V_{α}^{*} (x, Ψ) = V_{α} (x, Ψ, f^{*}) = sup_{f \in F} V_{α} (x, Ψ, f) .

(13)

The function

V_{α}^{*} (x, Ψ)

is referred to as the optimal discount payoff, whereas the policy

f^{*} \in F

is called the discounted optimal.

Definition 4.

We say that a function

v \in C^{2} (R^{n} \times S_{N}) \cap B_{w} (R^{n} \times S_{N})

, and a policy

f^{*} \in F

verify (are a solution of) the

α

-discounted payoff optimality equations (or Hamilton–Jacobi–Bellman (HJB) equation) if, for every

x \in R^{n}

and

Ψ \in S_{N}

:

\begin{matrix} α v (x, Ψ) & = & \bar{r} (x, Ψ, f^{*}) + L^{f^{*}, Ψ} v (x, Ψ) \end{matrix}

(14)

\begin{matrix} = & sup_{f \in F} \{\bar{r} (x, Ψ, f) + L^{f, Ψ} v (x, Ψ)\} . \end{matrix}

(15)

Proposition 2.

If Assumptions 1, 2, and 3 hold, then:

(a): There exists a function v in $C^{2} (R^{n} \times S_{N}) \cap B_{w} (R^{n} \times S_{N})$ and a policy $f^{*} \in F$ , such that (14) and (15) hold.
(b): The function v coincides with $V_{α}^{*} (x, Ψ)$ in (13).
(c): A policy $f^{*} \in F$ is an α-discount optimal if and only if (14) and (15) are satisfied.

Proof.

(a): Theorem 3.2 in [23] ensures that the value function $V_{α} (x, Ψ)$ defined in (13) considering $Ψ \equiv 0$ is the unique solution of the HJB Equation (14) in $C^{2} (R^{n}) \cap B_{w} (R^{n})$ . The existence of a function v in $C^{2} (R^{n} \times S_{N}) \cap B_{w} (R^{n} \times S_{N})$ and a policy $f^{*} \in F$ , such that (14) and (15) hold, follows from Theorem 3.1 and 3.2 in [23] for each $Ψ \in S_{N}$ fixed.
(b): By Dynkin’s formula for all $(x, Ψ) \in R^{n} \times S_{N}$ , $f \in F$ and $t \geq 0$ :

$\begin{matrix} E^{x, Ψ, f} [e^{- α t} v (x (t), Ψ (t))] & = & v (x, Ψ) + E^{x, Ψ, f} [\int_{0}^{T} L^{f, Ψ} [e^{- α t} v (x (t), Ψ (t)) d t] \end{matrix}$

(16)

Observe that:

$\begin{matrix} L^{f, Ψ} [e^{- α t} v (x (t), Ψ (t))] & = & - α e^{- α t} v (x, Ψ) \\ + & e^{- α t} \bar{b} (x, Ψ, f) v_{x} (x, Ψ) \\ + & e^{- α t} \frac{1}{2} T r (a (x, Ψ)) v_{x x} (x, Ψ) \\ = & e^{- α t} [- α v (x (t), Ψ (t)) + L^{f, Ψ} v (x (t), Ψ (t))] . \end{matrix}$

Therefore, the right-hand member of (16) equals:

$\begin{matrix} E^{x, Ψ, f} [e^{- α t} v (x (t), Ψ (t))] & = & v (x, Ψ) + E^{x, Ψ, f} [e^{- α t} (L^{f, Ψ} v (x (t), Ψ (t)) - α v (x (t), Ψ (t))) d t] \end{matrix}$

and from (15):

$\begin{matrix} E^{x, Ψ, f} [e^{- α t} v (x (t), Ψ (t))] & \leq & v (x, Ψ) - E^{x, Ψ, f} [\int_{0}^{T} e^{- α t} \bar{r} (x (t), Ψ (t), f) d t] . \end{matrix}$

This yields:

$\begin{matrix} v (x, Ψ) & \geq & E^{x, Ψ, f} [\int_{0}^{t} [e^{- α t} \bar{r} (x (t), Ψ (t), f) d t] + E^{x, Ψ, f} [e^{- α t} v (x (t), Ψ (t))] . \end{matrix}$

Now, as a consequence of v is in $B_{w} (R^{n} \times S_{N})$ and Lemma 1 (a),(b), we have that:

$\begin{matrix} | E^{x, Ψ, f} [e^{- α t} v (x (t), Ψ (t))] | & \leq & E^{x, Ψ, f} [[e^{- α t} {∥ v ∥}_{w} w (x (t), Ψ (t))] \\ \leq & e^{- α t} {∥ v ∥}_{w} E^{x, Ψ, f} w (x (t), Ψ (t)) \\ \leq & e^{- α t} {∥ v ∥}_{w} [e^{- q T} w (x, Ψ) + \frac{p}{q} (1 - e^{- q T})] (by Lemma (1 (a)) \\ \to & 0 a s \begin{matrix} t \end{matrix} \to \infty . \end{matrix}$

Therefore:

$v (x, Ψ) \geq E^{x, Ψ, f} [\int_{0}^{\infty} [e^{- α s} \bar{r} (x (s), Ψ (s), f) d s] = V_{α} (x, Ψ, f) for all f \in F .$

Thus, $v (x, Ψ) \geq V_{α} (x, Ψ, f)$ . In particular, if we take $f^{*} \in F$ satisfying (14) and proceed as above, we get:

$v (x, Ψ) = V_{α}^{*} (x, Ψ, f^{*}) .$
(c): The if part. Suppose that $f^{*} \in F$ satisfies Equations (14) and (15). Then, proceeding as in part (b), we obtain that $f^{*} \in F$ is an optimal policy.
The only if part. By mimic the same procedure of part (b), we can obtain that for any $f \in F$ fixed:

$α V_{α} (x, Ψ, f) = \bar{r} (x, Ψ, f) + L^{f, Ψ} V_{α} (x, Ψ, f); for all x \in R^{n}, Ψ \in S_{N} .$

(17)

On the other hand, by part (b) we can assert that:

$α v (x, Ψ) = sup_{f \in F} {\bar{r} (x, Ψ, f) + L^{f, Ψ} v (x, Ψ)}; for all x \in R^{n}, Ψ \in S_{N} .$

(18)

Now let $f^{*} \in F$ be an optimal policy, so that $V_{α} (x, Ψ, f^{*}) = v (x, Ψ)$ . Then, we get the result from (17) and (18).

□

Remark 3.

Briefly, Proposition 2 says that if the HJB-Equations (14) and (15) admit a solution

v \in C^{2} (R^{n} \times S_{N}) \cap B_{w} (R^{n} \times S_{N})

, then v is the optimal discount payoff (13) to the switching Markovian stochastic control problem with a discounted payoff completely observed, and

f^{*} \in F

is an optimal stationary policy.

4. Average Optimality Criteria

As in (10), let

μ_{f} (ν) : = \int_{R^{n}} ν (x, Ψ) μ_{f} (d x, Ψ)

for every

ν \in B_{w} (R^{n} \times S_{N})

.

Assumption 4.

Let

(x (t), Ψ (t))

be the solution of the hidden Markovian-switching diffusion (1)–(4). Then, we suppose that there exist positive constants C and δ such that:

sup_{f \in F} | E^{x, Ψ, f} [ν (x (t), Ψ (t))] - μ_{f} (ν) | \leq C e^{- δ t} {‖ ν ‖}_{w} w (x, Ψ)

(19)

for all

(x, Ψ) \in R^{n} \times S_{N}

,

ν \in B_{w} (R^{n} \times S_{N})

, and

t \geq 0

. That is, we assume that the process

(x (t), Ψ (t))

is uniformly w-exponentially ergodic.

Next, we define the long-run average optimality criterion.

Definition 5.

For each

f \in M

,

(x, Ψ) \in R^{n} \times S_{N}

, and

T \geq 0

, let:

J_{T} (x, Ψ, f) : = E^{x, Ψ, f} [\int_{0}^{T} \bar{r} (t, x (t), Ψ (t), f) d t] .

(20)

The long-run expected average reward given the initial state

(x, Ψ)

is:

J (x, Ψ, f) : = \underset{T \to \infty}{lim inf} \frac{1}{T} J_{T} (x, Ψ, f) .

(21)

The function:

J^{*} (x, Ψ) : = sup_{f \in F} J (x, Ψ, f) f o r a l l (x, Ψ) \in R^{n} \times S_{N}

is referred to as the optimal gain or the optimal average reward. If there is a policy

f^{*} \in F

for which

J (x, Ψ, f^{*}) = J^{*} (x, Ψ)

for all

(x, Ψ) \in R^{n} \times S_{N}

, then

f^{*}

is called average optimal.

Remark 4.

In some optimal control problems, the limit of

J_{T} (x, Φ, f) / T

as

T \to \infty

might not exist. To avoid this difficulty, in optimal control problems, it defines the average payoff as a liminf as in (21), which be interpreted as the worst average payoff that is to be maximized.

For each

f \in F

, let:

J (f) : = μ_{f} (\bar{r} (\cdot, Ψ, f)) = \int_{R^{n}} \bar{r} (x, Ψ, f) μ_{f} (d x, d Ψ) .

(22)

with

μ_{f}

as in (10). Now, observe that

J_{T}

defined in (20) can be expressed as:

J_{T} (x, Ψ, f) = T J (f) + \int_{0}^{T} [E^{x, Ψ, f} \bar{r} (x (t), Ψ (t), f) - J (f)] d t,

(23)

therefore, multiplying (23) by

\frac{1}{T}

and letting

T \to \infty

we obtain, by (19):

J (x, Ψ, f) = lim_{T \to \infty} \frac{1}{T} J_{T} (x, Ψ, f) = J (f) for all (x, Ψ) \in R^{n} \times S_{N} .

(24)

Moreover, by the definition (22) of

J (f)

, the Assumption 3b, and (10):

\begin{matrix} | J (f) | \leq \int_{R^{n}} ∣ \bar{r} (x (t), Ψ (t), f) ∣ μ_{f} (d x, d Ψ) \leq M \cdot μ_{f} (w) < \infty for all f \in F . \end{matrix}

Therefore, by Lemma 1c:

sup_{f \in F} | J (f) | \leq M \cdot μ_{f} (w) \leq M \cdot \frac{p}{q},

(25)

thus, the reward

J (f)

is uniformly bounded on

F

. From (24) and (25) we obtain that the following:

J^{*} : = sup_{f \in F} J (f) = sup_{f \in F} J (x, Φ, f) = J^{*} (x, Φ) for all (x, Φ) \in R^{n} \times S_{N}

(26)

has a finite value.

Thus, under the Assumptions 1, 2, and 4, it follows from (19) (w-exponential ergodicity) and (22) that the long-run expected average reward (21) coincides with the constant

J (f)

for every

f \in F

. Indeed, note that

J_{T}

defined in (20) can be expressed as:

J_{T} (x, Ψ, f) = T J (f) + \int_{0}^{T} [E^{x, Ψ, f} \bar{r} (x (t), Ψ (t), f) - J (f)] d t .

Definition 6.

(a) A pair

(J, v)

consisting of a constant

J \in R

and a function

v \in C^{2} (R^{n} \times S_{N}) \cap B_{w} (R^{n} \times S_{N})

is said to be a solution of the average reward HJB-equation if:

J = max_{u \in U} [\bar{r} (x, Ψ, u) + L^{u, Ψ} v (x, Ψ)] f o r a l l (x, Ψ) \in R^{n} \times S_{N} .

(27)

(b) If a stationary policy

f \in F

attains the maximum in (27), that is:

J = \bar{r} (x, Ψ, f) + L^{f, Ψ} v (x, Ψ)] f o r a l l (x, Ψ) \in R^{n} \times S_{N},

(28)

then f is called a canonical policy.

The following theorem shows that if a policy satisfies the average reward HJB-equation, then it is an optimal average policy.

Theorem 1.

If Assumptions 1, 2, and 3 hold, then:

(i): The average reward HJB Equation (27) admits a unique solution $(J, v)$ , with $v \in C^{2} (R^{n} \times S_{N}) \cap B_{w} (R^{n} \times S_{N})$ satisfying $v (0, Ψ_{0}) = 0$ for some $Ψ_{0} \in S_{N}$ fixed.
(ii): There exists a canonical policy.
(iii): The constant J in (27) equals $J^{*}$ in (26).
(iv): There exists a stationary average optimal policy.

Proof.

(i)

The steps for the proof of this incise are essentially the same given in proof of Theorem 6.4 in [24]; thus, we omit the proof.

(i i)

Since

u \to r (\cdot, \cdot, u)

and

u \to b (\cdot, \cdot, u)

are continuous functions on the compact set

U

, we obtain that

u \to \bar{r} (\cdot, \cdot, u) + L^{u, Ψ} v (\cdot, \cdot)

is a continuous function on

U

; thus, the existence of a canonical policy

f \in F

follows from standard measurable selection theorems; see [25] (Theorem 12.2).

(i i i)

Observe that, by (27):

J \geq \bar{r} (x, Ψ, u) + L^{u, Ψ} v (x, Ψ) for all (x, Ψ) \in R^{n} \times S_{N} and u \in U .

(29)

Therefore, for any

f \in F

, using Dynkin’s formula and (29) we obtain:

\begin{matrix} E^{x, Ψ, f} v (x (t), Ψ (t)) & = & v (x, Ψ) + E^{x, Ψ, f} (\int_{0}^{t} L^{f, Ψ} h (x (s), Ψ (s)) d s) \\ \leq & v (x, Ψ) + J t - E^{x, Ψ, f} (\int_{0}^{t} \bar{r} (x (s), Ψ (s)) d s) . \end{matrix}

(30)

Thus, multiplying by

t^{- 1}

in (30) we have:

t^{- 1} J_{t} (x, Ψ, f) \leq J + t^{- 1} v (x, Ψ) - t^{- 1} E^{x, Ψ, f} v (x (t), Ψ (t)) .

(31)

Consequently, letting

t \to \infty

in (31), and using Lemma 1b and (24), we obtain:

J \geq J (f) for all f \in F .

To obtain the reverse inequality, similar arguments show that if:

J \leq \bar{r} (x, Ψ, u) + L^{u, Ψ} v (x, Ψ) for all (x, Ψ) \in R^{n} \times S_{N} and u \in U,

then

J \leq J (f)

for all

f \in F

. This last inequality together with (29) yields that if

f \in F

is a canonical policy, which satisfies (28), then we obtain that

J (f) = J

, and by (26):

J = J (f) = J^{*} = J^{*} (x, Ψ) for all (x, Ψ) \in R^{n} \times S_{N} .

(32)

(i v)

Similar arguments to those given in

(i i i)

lead us to that if

f \in F

is a canonical policy, then it is an average optimal. □

Theorem 1 indicates that if a policy satisfies the HJB Equation (27), then this policy is an optimal policy for the optimal control problem associated with the HJB equation. The difficulty with this approach is how to get a solution

(J^{*}, v, f)

of the HJB equation. The most common form of the solve the HJB equation is based on variants on the vanishing discount approach (see [11,24,26] for details).

Remark 5

([1]). In the optimality criteria known as bias optimality, overtaking optimality, sensitive discount optimality, and Blackwell optimality, the early returns and the asymptotic returns are both relevant; thus, to study them, we need first to analyze the discounted and average optimality criteria. These optimality criteria will be studied in future work.

Remark 6.

On Assumption 1, ([7], Theorems 3.17 and 3.18). The uniform Lipschitz and linear growth conditions of b and σ ensure the existence and uniqueness of the global solution of the SDE with Markovian switching (1). The uniform Lipschitz condition $({max}_{(i, u) \in E \times U} ∥ b (x, i, u) - b (y, i, u) ∥ \leq C_{1} ∥ x - y ∥$ , $∥ σ (x, i) - σ (y, i) ∥ \leq C_{2} ∥ x - y ∥$ ) imply that the change rates of the functions $b (x, i, u)$ and $σ (x, i)$ are minor or equal to the change rate of a linear function of x. This gives, in particular, the continuity of b and σ in x for all $[t_{0}, \infty)$ . Thus, the uniform Lipschitz condition excludes the functions b and σ that are discontinuous concerning x. It is important to note that although a function let continuous, it does not guarantee that it satisfies the uniform Lipschitz condition; for example, the continuous function $s i n (x^{2})$ does not satisfy this condition. Uniform Lipschitz condition can be replaced by the local Lipschitz condition. In fact, the local Lipschitz condition allows us to include a great variety of functions, such as functions $v \in C^{2} (R^{n} \times E)$ . However, the linear growth condition (Assumption 1 (d)) also excludes some important functions, such as $b (x, i) = - {| x |}^{2} x + i$ . Assumption 1 (d) is quite standard but may be restrictive for some applications. As far as the results of this paper are concerned, the uniform Lipschitz condition may be replaced by the weaker condition:

$x^{T} b (x, i, u) + \frac{1}{2} {| | σ (x, i) | |}^{2} \leq {K (1 + | | x | |}^{2}), f o r a l l (x, i) \in R^{n} \times E,$

(33)

where K is a positive constant. This last condition allows us to include many functions as the coefficients b and $σ$ . For example:

$b (x, i, u) = a (i) [x (t) - x^{3} (t)] + x g (u) σ (x, i) = b (i) x^{2} (t)$

with $a (i), b (i) > 0$ such that $b^{2} (i) \leq 2 a (i)$ and for some continuous function $g : U \to R$ given. It is possible to check that a diffusion process with the parameters given above satisfies the local Lipschitz condition but the linear growth condition is not satisfied. On the other hand, note that:

$a (i) x [x - x^{3}] + x^{2} g (u) + \frac{1}{2} b^{2} (i) x^{4} \leq a (i) x^{2} + x^{2} g (u) \leq K (1 + x^{2})$

with $K = {max}_{(i, u) \in E \times U} {a (i) + g (u)}$ and a compact control set U. That is, the condition (33) is fulfilled. Thus, ([7], Theorem 3.18) guarantees that the SDE with Markovian switching with these coefficients has a unique global solution on $[t_{0}, \infty)$ .
On Assumption 2, ([7], Theorem 5.2). This assumption guarantees the positive recurrence and the existence of an invariant measure $μ_{f} (d x, Ψ)$ for the Markov–Feller process $(x (t), Ψ (t))$ . Moreover, if this assumption holds together with the inequality $k (| x |^{\bar{p}}) \leq w (x, i)$ for positive numbers $k, \bar{p}, H$ , then, the diffusion process (1) satisfies:

$l i m s u p_{t \to \infty} E {| x (t) |}^{\bar{p}} \leq H,$

that is, $x (t)$ is asymptotically bounded in $\bar{p}$ th moment. Some Lyapunov functions are, for example:

$w (x, i) = {k (i) | x |}^{\bar{p}}, k (i) > 0, \bar{p} \geq 2, \forall (x, i) \in R^{n} \times E,$

(34)

considering that the coefficients b and $σ$ in (1) satisfy the Lipschitz condition and:

$x^{T} b (x, i, u) + \frac{\bar{p} - 1}{2} {| | σ (x, i) | |}^{2} \leq {B (i) | | x | |}^{2} + a,$

(35)

with $a > 0,$ and $B (i)$ be constants. In fact, using the inequality $a^{c} b^{1 - c} \leq a c + b (1 - c) \forall a, b \geq 0, c \in [0, 1]$ and (35), we get:

$\begin{matrix} L^{u, ψ} w (x, i) & = & k (i) \bar{p} {| | x | |}^{\bar{p} - 1} b (x, i, u) + \frac{1}{2} k (i) \bar{p} (\bar{p} - 1) | | σ (x, i) {| |}^{2} {| x |}^{\bar{p} - 2} + \sum_{j = i}^{N} q_{i j} k (j) {| | x | |}^{\bar{p}} \\ = & \bar{p} {k (i) | | x | |}^{\bar{p} - 2} \{x^{T} b (x, i, u) + \frac{\bar{p} - 1}{2} {| | σ (x, i) | |}^{2}\} + \sum_{j = i}^{N} q_{i j} {k (j) | | x | |}^{\bar{p}} \\ \leq & \bar{p} {k (i) | | x | |}^{\bar{p} - 2} {{B (i) | | x | |}^{2} + a} + \sum_{j = i}^{N} q_{i j} {k (j) | | x | |}^{\bar{p}} \\ \leq & (\bar{p} B (i) k (i) + \sum_{j = i}^{N} q_{i j} k (j)) {| | x | |}^{\bar{p}} + a \bar{p} {k (i) | | x | |}^{\bar{p} - 2} \\ = & (\bar{p} B (i) k (i) + \sum_{j = i}^{N} q_{i j} k (j)) {| | x | |}^{\bar{p}} \\ + & {[{(a \bar{p} k (i))}^{\bar{p} / 2} {(\frac{2}{λ (i)})}^{(\bar{p} - 2) / 2}]}^{2 / \bar{p}} {[\frac{λ (i)}{2} {| | x | |}^{\bar{p}}]}^{(\bar{p} - 2) / \bar{p}} \\ \leq & (\bar{p} B (i) k (i) + \sum_{j = i}^{N} q_{i j} k (j)) {| | x | |}^{\bar{p}} + \frac{2}{\bar{p}} {(a \bar{p} k (i))}^{\bar{p} / 2} {(\frac{2}{λ (i)})}^{(\bar{p} - 2) / 2} \\ + & \frac{λ (i) (\bar{p} - 2)}{2 \bar{p}} {| | x | |}^{p} \\ \leq & - \frac{λ (i) (\bar{p} + 2)}{2 \bar{p}} {| | x | |}^{\bar{p}} + \frac{2}{\bar{p}} {(a \bar{p} k (i))}^{\bar{p} / 2} {(\frac{2}{λ (i)})}^{(\bar{p} - 2) / 2} \end{matrix}$

(36)

where $λ (i) = (\bar{p} B (i) k (i) + \sum_{j = i}^{N} q_{i j} k (j))$ .
If we set:

$q : = min_{i \in E} [\frac{λ (i) (\bar{p} + 2)}{2 \bar{p}}] p : = max_{i \in E} [\frac{2}{\bar{p}} {(a \bar{p} k (i))}^{\bar{p} / 2} {(\frac{2}{λ (i)})}^{(\bar{p} - 2) / 2}],$

then

$\begin{matrix} L^{u, ψ} w (x, i) & \leq & - {q | | x | |}^{\bar{p}} + p \leq - q w (x, i) + p . \end{matrix}$

Now, taking the Lyapunov function (34) we define:

$w (x, Ψ) = \sum_{i = 1}^{N} Ψ_{i} w (x, i) = \sum_{i = 1}^{N} Ψ_{i} {k (i) | | x | |}^{\bar{p}} .$

Considering that $w_{x} (x, Ψ) = \sum_{i = 1}^{N} Ψ_{i} k (i) \bar{p} {| | x | |}^{\bar{p} - 1}$ , $w_{x x} (x, i) = \sum_{i = 1}^{N} Ψ_{i} k (i) \bar{p} (\bar{p} - 1)$ ${| | x | |}^{\bar{p} - 2}$ , $\nabla w_{Ψ} (x, i) = {[k (i), k (2), \dots, k (n)] | | x | |}^{\bar{p}}$ and $w_{Ψ Ψ} (x, Ψ) = 0;$ a similar procedure to that given in (37) allows us to obtain that W is also a Lyapunov function. That is:

$\begin{matrix} L^{u, Ψ} w (x, Ψ) & \leq & - {q | | x | |}^{\bar{p}} + p \leq - q w (x, Ψ) + p . \end{matrix}$
On Assumption 3.This assumption allows us that the reward rate (or cost rate) can be unbounded from above and below. For the Lyapunov function $w (x, i) = {k (i) | x |}^{\bar{p}}$ , a reward rate of the form:

$r (x, i, u) = {k (i) | x |}^{\bar{p}} + h (u)$

for some continuous function $h : U \to R$ satisfies the Assumption 3. In fact:

$| r (x, i, u) | \leq {k (i) | x |}^{\bar{p}} + max_{u \in U} h (u) \leq (k (i) + max_{u \in U} h (u)) {| x |}^{\bar{p}} = M w (x, i)$

with $M = {max}_{i \in E} {k (i) + {max}_{u \in U}}$ and U a compact set.
On Assumption 4.This assumption indicates asymptotic behavior of $x (t)$ when t goes to infinite. Sufficient conditions for the w-exponentially ergodicity of the process $(x (t), ψ (t))$ can be seen in ([1], Theorem 2.8). In fact, in the proof of this theorem, Assumptions 1 and 2 are required. Note that, for the optimal control problem with discounted optimality criterion, the w-exponentially ergodicity of the process $(x (t), ψ (t))$ is not required. This assumption is only necessary to study the average reward optimality criterion.

Remark 7.

In the following sections, our theoretical results are implemented in three applications. The dynamic system in the three applications evolves according to linear stochastic differential equations

d x (t) = (A (i) x (t) + B u (t)) d t + σ d W (t)

, namely, Assumption 1. The state numbers of the Markov chain is 2, that is,

E = {1, 2}

. The payoff rate is of the form

r (x, i, u) = x^{T} R (i) x + u^{T} S u

with

x \in R^{2}

and

u \in U : = [0, a 1] \times [0, a 2]

,

a 1, a 2 > 0

. Taking

w (x, i) = x^{T} R (i) x + 1

we get:

\begin{matrix} | r (x, i, u) | & = & | x^{T} R (i) x | + | u^{T} S u | \\ \leq & | x^{T} R (i) x | + | u^{T} S u | | x^{T} R (i) x + 1 | \\ = & m a x_{u \in U} (| u^{T} S u | + 1) | x^{T} R (i) x + 1 | \\ = & M_{2} w (x, i) \end{matrix}

with

M_{2} = m a x_{u \in U} (| u^{T} S u | + 1)

; thus, Assumption 3 also holds. A few calculations allow us to obtain the Assumption 2 with

w (x, Ψ) = \sum_{i = 1}^{2} Ψ_{i} (t) w (x, ψ (t)) = \sum_{i = 1}^{2} Ψ_{i} (t) (x^{T} R (ψ (t)) x + 1)

. In fact:

\begin{matrix} L^{u, Ψ} w (x, Ψ) & = & x^{2} [2 A (i) [Ψ_{1} R (1) + Ψ_{2} R (2)] + R (1) \sum_{i = 1}^{2} q_{i 1} Ψ_{i} + R (2) \sum_{i = 1}^{2} q_{i 2} Ψ_{i}] \\ + & x [R (1) \sum_{i = 1}^{2} q_{i 1} Ψ_{i} + R (2) \sum_{i = 1}^{2} q_{i 2} Ψ_{i}] \\ + & σ^{2} [Ψ_{1} R (1) + Ψ_{2} R (2)] . \end{matrix}

(37)

Let

0 < q < - [2 A (i) [Ψ_{1} R (1) + Ψ_{2} R (2)] + R (1) \sum_{i = 1}^{2} q_{i 1} Ψ_{i} + R (2) \sum_{i = 1}^{2} q_{i 2} Ψ_{i}]

, and rewrite

L^{u, Ψ} w (x, Ψ)

as:

\begin{matrix} L^{u, Ψ} w (x, Ψ) & = & - q w (x, Ψ) + l (x, i, u) . \end{matrix}

where

\begin{matrix} l (x, i, u) & : = & q w (x, Ψ) + x^{2} [2 A (i) [Ψ_{1} R (1) + Ψ_{2} R (2)] + R (1) \sum_{i = 1}^{2} q_{i 1} Ψ_{i} + R (2) \sum_{i = 1}^{2} q_{i 2} Ψ_{i}] \\ + & x [R (1) \sum_{i = 1}^{2} q_{i 1} Ψ_{i} + R (2) \sum_{i = 1}^{2} q_{i 2} Ψ_{i}] \\ + & σ^{2} [Ψ_{1} R (1) + Ψ_{2} R (2)] \\ \leq & p, \end{matrix}

(38)

where the last inequality is obtained from fact that the function

l (x . i . u)

is continuous on the compact set U for all

x \in R

and that the term

q + [2 A (i) [Ψ_{1} R (1) + Ψ_{2} R (2)] + R (1) \sum_{i = 1}^{2} q_{i 1} Ψ_{i} + R (2) \sum_{i = 1}^{2} q_{i 2} Ψ_{i}]

is negative. Thus,

L^{u, Ψ} w (x, Ψ) = - q w (x, Ψ) + p

and Assumption 2b follows.

5. Application 1: Discounted Linear Quadratic Regulator (LQR)

In this subsection, we consider the

α

-discounted linear quadratic regulator. To this end, we suppose that the dynamic system evolves according to the linear stochastic differential equations:

d x (t) = (\bar{A} (Ψ (t)) x (t) + B u (t)) d t + σ d W (t) .

(39)

with

\bar{A} (Ψ (t)) : = \sum_{i = 1}^{N} A (i) Ψ_{i} (t)

,

A : E \to R^{n \times n}

,

B \in R^{n \times m}

,

W (\cdot)

is a m-dimensional Brownian motion, and

σ

is a positive constant. The expected cost is:

V_{α} (x, Ψ, u) : = E_{x, Ψ}^{u} [\int_{0}^{\infty} e^{- α s} {x^{T} (s) \bar{D} (Ψ (s)) x (s) + u^{T} \bar{R} (Ψ (s)) u (s)} d s] .

where

\bar{D} (Ψ (t)) : = \sum_{i = 1}^{N} D (i) Ψ_{i} (t)

,

D : E \to R^{n \times n}

,

\bar{R} (Ψ (t)) : = \sum_{i = 1}^{N} R (i) Ψ_{i} (t)

and

R : E \to R^{n \times n}

. The optimality equation or HJB-equation for the

α

-discounted partially observed LQR-optimal control problem is:

α v (x, Ψ) = min_{u \in U} {x \bar{D} (Ψ (t)) x^{T} + u^{T} \bar{R} (Ψ (t)) u + L^{u} v s . (x, Ψ)},

(40)

where the infinitesimal generator for the process

(x (t), Ψ (t))

applied to

v (x, Ψ) \in C^{2, 2} (R^{n} \times S_{N})

is:

\begin{matrix} L^{u} v s . (x, Ψ) = (\bar{A} (Ψ) x + B u) v_{x} (x, Ψ) + \frac{1}{2} [T r (σ σ^{T})] v_{x x} (x, ψ) \\ + Q^{T} Ψ v_{Ψ} (x, Ψ,) + \frac{1}{2} v_{Ψ Ψ} (x, Ψ,) T r [A_{2}] \end{matrix}

(41)

where

A_{2} = [σ_{0}^{- 1} (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t)] {[σ_{0}^{- 1} (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t)]}^{T} .

(42)

Note that, by minimizing (40) with respect to u, we find that the optimal control is the form:

f^{*} (x, Ψ) = - \frac{{\bar{R}}^{- 1} (Ψ)}{2} B^{T} v_{x} .

(43)

By Proposition 2, if there exist a function

v \in C^{2, 2} (R^{n} \times S_{N}) \cup B_{w} (R^{n} \times S_{N})

and a policy

f^{*} \in F

such that (14) and (15) hold, then v coincides with the value function

v^{*} (x, Ψ) : = {min}_{u \in U} V_{α} (x, Ψ, u)

and

u (t) = f^{*} (x)

is the

α

-discount optimal policy. Thus, we propose that the function

v \in C^{2, 2} (R^{n} \times S_{N}) \cup B_{w} (R^{n} \times S_{N})

that solves the HJB-Equation (40) has the form:

v (x, Ψ) = x^{T} K x + n (Ψ) + c,

(44)

where

n : S_{N} \to R

is a twice differentiable continuous function, c is a constant, and K is a positive definite matrix. Inserting the derivative of

v (x, Ψ)

in (43) we get the optimal control:

f^{*} (x, Ψ) = - {\bar{R}}^{- 1} (Ψ) B^{T} K^{T} x,

(45)

where the equality (40) holds if the matrix K satisfies the algebraic Riccati equation:

\begin{matrix} {\bar{A}}^{T} (Ψ (t)) K + K \bar{A} (Ψ (t)) - K B \bar{R} {(Ψ (t))}^{- 1} B^{T} K \\ + \bar{D} (Ψ (t)) - α K = 0, \end{matrix}

c = T r [b (w (t)) b^{T} (w (t)) K] / α

and

n (\cdot) \in C^{2} (S_{N})

satisfies the partial differential equation:

\begin{matrix} Q^{T} Ψ (t) n^{'} (Ψ (t)) + \frac{1}{2} T r [A_{2}] n^{″} (Ψ (t)) - α n (Ψ (t))) I_{n} = 0, \forall Ψ (t) \in S_{N}, \end{matrix}

where

A_{2}

is as in (42),

I_{N}

is the identity matrix of

N \times N

, and

n^{'}

and

n^{″}

are the gradient and the Hessian of the n, respectively.

Simulation results. In the following figures, we assume that the Markov chain

ψ (t)

has two states, namely,

E = {1, 2}

and the dynamic system

x (t) \in R^{2}

. We have computed the Wonham filter, the states of the dynamic system (39)

x (t) = {[x_{1} (t), x_{2} (t)]}^{T}

with initial condition

x (0) = {[10, 15]}^{T}

, the value function (44), and the optimal control (45) for the following data:

σ = 1,

σ_{0} = 1

,

α = 0.01

,

h (1) = 1

,

h (2) = 2

,

Ψ_{1} (0) = 0.5

,

Ψ_{2} (0) = 0.5

,

R_{1} = 1

,

R_{2} = 2

:

A (1) = [\begin{matrix} - 5 & 1 \\ 0 & - 10 \end{matrix}], A (2) = [\begin{matrix} - 10 & 1 \\ 0 & - 10 \end{matrix}],

D (1) = [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}], D (2) = [\begin{matrix} 2 & 0 \\ 0 & 3 \end{matrix}],

and the transition matrix:

Q = [\begin{matrix} - 0.2 & 0.2 \\ 0.7 & - 0.7 \end{matrix}] .

To solve the Wonhan filter, we use the numerical method given in ([18], Section 8.4), considering that the Markov chain can only be observed through

d y (t) = h (ψ (t)) + σ_{0} d B (t)

.

Figure 1 shows the solution of the filter Wonham equation and the states of the hidden Markov chain

ψ (t)

. As can be noted, in

t = 0.05

s

Ψ_{2} (0.05) = P (ψ (t) = 2 | y (s), 0 \leq s \leq 0.05) \geq Ψ_{1} (0.05)

, implying that the Markov chain with a higher probability to

0.5

is in state 2 in

t = 0.3

(

ψ (0.3) = 2

). The evolution of the dynamic system (39) is given in Figure 2 (top); in this figure, we can note that the optimal control (45) moves the initial point

x (0) = {[10, 15]}^{T}

to the point

{[0, 0]}^{T}

in

t = 0.8

s, indicating the good performance of the optimal control (45). The asymptotic behavior of the optimal control (45) is given in Figure 2 (bottom); this control stabilizes at zero around

t = 0.8

s, since

x (t)

also stabilizes at zero around

t = 0.8

s.

Figure 1. Wonham filter for the

α

-discounted LQR.

Figure 2. Asymptotic behavior of the state of dynamic system (top) and optimal control

α

-discount LQR (bottom).

6. Application 2: Average LQR: Modeling of a Quarter-Car Suspension

In this section, the basic quarter-car suspension model analyzed in [27] is considered, see Figure 3. The parameters are: the sprung mass (

m_{s}

), the unsprung mass (

m_{u}

), the suspension spring constant (

k_{s}

), and the tire spring constant (k). Let

z_{s}, z_{u}

, and

z_{r}

be the vertical displacements of the sprung mass, the unsprung mass, and the road profile, respectively. The equations of motion for this model are given by:

\begin{matrix} m_{s} z_{s}^{″} (t) & = & - k_{s} (z_{s} (t) - z_{u} (t)) - u (t), \end{matrix}

(46)

\begin{matrix} m_{u} z_{u}^{″} (t) & = & k_{s} (z_{s} (t) - z_{u} (t)) - k (z_{u} (t) - z_{r} (t)) + u (t) . \end{matrix}

(47)

Figure 3. Schematic of a quarter-car suspension.

Now, defining

x_{1} (t) = z_{s}^{^{'}} (t)

,

x_{2} (t) = z_{u}^{^{'}} (t)

,

x_{3} (t) = z_{s} (t) - z_{u} (t)

, and

x_{4} (t) = z_{u} (t) - z_{r}

, the equations of motion (46) and (47) can be expressed in matrix form as:

d x (t) = (A x (t) + B u (t)) d t + C_{1} d z_{r} (t)

(48)

where

d x (t) = [\begin{matrix} d x_{1} (t) \\ d x_{2} (t) \\ d x_{3} (t) \\ d x_{4} (t) \end{matrix}],

A = [\begin{matrix} 0 & 0 & \frac{k_{s}}{m_{s}} & 0 \\ 0 & 0 & \frac{k_{s}}{m_{u}} & \frac{k}{m_{u}} \\ 1 & - 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{matrix}],

B = [\begin{matrix} \frac{1}{m_{s}} \\ \frac{1}{m_{s}} \\ 0 \\ 0 \end{matrix}],

C_{1} = [\begin{matrix} 0 \\ 0 \\ 0 \\ - 1 \end{matrix}]

, and in the time domain, the road profile,

z_{r} (t)

, can be represented as the output of a linear first-order filter to white noise as follows:

d z_{r} (t) = - a (ψ (t)) V z_{r} (t) d t + σ_{2} d W_{1} (t),

where V is the vehicle speed (assumed constant),

σ_{2}

is a positive constant, and a is the road roughness coefficient depending on the type of road. Here, we assume that a depends on a hidden Markov chain, that is,

a (ψ (t))

with

ψ (t) \in {1, 2} .

In our case, we consider that the dynamic system (48) evolves with additional white noise, that is:

d x (t) = (A x (t) + B u (t)) d t + σ_{1} d W (t) + C_{1} d z_{r} (t)

(49)

The experts introduced the following performance index in order to trade off between the ride comfort and the handling while maintaining the constraint on suspension deflection:

\begin{matrix} J (x, Ψ, u) & = & lim_{T \to \infty} \frac{1}{T} E^{x, Ψ, u} [\int_{0}^{T} [c_{1} {\frac{d^{2} z_{s}}{d^{2} t}}^{2} + c_{2} {[z_{1} (t) - z_{u} (t)]}^{2} \\ + & c_{3} {[z_{u} (t) - z_{r} (t)]}^{2} + c_{4} u {(t)}^{2}] d t] \end{matrix}

(50)

Defining

y : = [{\frac{d^{2} z_{s}}{d^{2} t}}^{2}, {[z_{1} (t) - z_{u} (t)]}^{2}, {[z_{u} (t) - z_{r} (t)]}^{2}],

C : = d i a g (c_{1}, c_{2}, c_{3}),

and

R : = [c_{4}]

, we can rewrite (50) as:

\begin{matrix} J (x, Ψ, u) & = & lim_{T \to \infty} \frac{1}{T} E^{x, Ψ, u} [\int_{0}^{T} y C y^{T} + u^{T} (t) R u (t) d t] \end{matrix}

(51)

Now, from the equations of motion in (46) and (47), note that

y = M x + N u

with

M = [\begin{matrix} 0 & 0 & \frac{k_{s}}{m_{s}} & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}],

and

N = [\begin{matrix} - \frac{1}{m_{s}} \\ 0 \\ 0 \end{matrix}] .

Thus, replacing this matrix form of y in (51) we can rewrite (50) again as:

\begin{matrix} J (x, Ψ, u) & = & lim_{T \to \infty} \frac{1}{T} E^{x, Ψ, u} [\int_{0}^{T} (x^{T} Q_{1} x + 2 x^{T} Q_{2} u + u^{T} R_{1} u) d t] \end{matrix}

(52)

where

Q_{1} = M^{T} C M

,

Q_{2} = M^{T} C N

,

R_{1} = N^{T} C N + R

.

The optimal control problem (OCP). The OCP in this application consists of finding

u^{*} \in U

such that it minimizes the performance index (52) considering that the dynamic system evolves according to the stochastic differential Equation (49).

In the dynamic programming technique, we need the infinitesimal generator

L^{u}

of the process

(x (t), Ψ (t))

applied to

v (x, Ψ, z_{r}) \in C^{2, 2, 2} (R^{n} \times S_{N} \times R)

; in this case, this generator is:

\begin{matrix} L^{u} v s . (x, Ψ, z_{r}) = - a (Ψ (t)) v_{z_{r}} (x, Ψ, z_{r}) \\ + (A x + B u) v_{x} (x, Ψ, z_{r}) \\ + Q^{T} Ψ v_{Ψ} (x, Ψ, z_{r}) \\ + \frac{1}{2} T r [σ_{1} σ_{1}^{T}] v_{x x} (x, Ψ, z_{r}) . \\ + \frac{1}{2} T r [σ_{2} σ_{2}^{T}] v_{z_{r} z_{r}} (x, Ψ, z_{r}) \\ + \frac{1}{2} v_{Ψ Ψ} (x, Ψ, z_{r}) T r [A_{2}] \end{matrix}

(53)

where

A_{2} (Ψ (t)) = [σ_{0}^{- 1} (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t)] {[σ_{0}^{- 1} (diag (h) - h^{T} (Ψ (t)) I_{N}) Ψ (t)]}^{T}

, whereas the Hamilton–Jacobi–Bellman Equation (or dynamic programming equation) associated with this problem is:

J = max_{u \in U} [x^{T} Q_{1} x + 2 x^{T} Q_{2} u + u^{T} R_{1} u + L^{u} v s . (x, Ψ, z_{r})] for all (x, Ψ) \in R^{n} \times S_{N},

(54)

see [28] for more details.

Proposition 3.

Assume that

(x (t), z_{r} (t), Ψ (t))

evolves according to (49). Then, the control that minimizes the long-run cost (52) is:

f^{*} (x, Ψ, z_{r}) = - R_{1}^{- 1} {(Q_{2}^{T} + B^{T} K)}^{T} x (t),

(55)

whereas the corresponding function v that solves the HJB Equation (54) is given by:

v (x, Ψ, z_{r}) = x^{T} K x + g (z_{r}) + n (Ψ)

where K is a positive semi-definite matrix that satisfies the Ricatti differential equation

\begin{matrix} K (A - B R_{1}^{- 1} Q_{2}^{T}) + (A - B R_{1}^{- 1} Q_{2}^{T}) K - K B R_{1} B^{T} P \\ (Q_{1} - Q_{2} R_{1}^{- 1} Q_{2}^{T}) = 0, \end{matrix}

(56)

and

g (\cdot) \in C^{2} (R)

satisfies the differential equation:

\begin{matrix} a (Ψ) g^{'} (z_{r}) + \frac{1}{2} σ_{2}^{2} g^{″} (z_{r}) = 0, \end{matrix}

(57)

and

n (\cdot) \in C^{2} (S_{N})

satisfies the partial differential equation:

\begin{matrix} Q^{T} Ψ n^{'} (Ψ (t)) + \frac{1}{2} T r [A_{2}] n^{″} (Ψ) = 0, \end{matrix}

(58)

where

A_{2}

is as in (41) and

n^{'}

and

n^{″}

denote the gradient and the Hessian of the n, respectively. The optimal cost is given by:

J = T r [σ_{1} σ_{1}^{T}] K = J^{*} (x, Ψ) = min_{u \in U} J (x, Ψ, u) .

Proof.

The HJB-equation for the partially observed LQR optimal control problem with

(x (t), Ψ (t))

evolves according to (49) and finite cost (52) is (54), where

L^{u} v (t, x, w, Ψ)

is the infinitesimal generator given in (53). We are looking for a candidate solution

h \in C^{2, 2, 2} (R^{n} \times S_{N} \times R)

to (54) in the form:

v (x, Ψ, z_{r}) = x^{T} K x + g (z_{r}) + n (Ψ),

(59)

for some continuous functions

g (\cdot) \in C^{2} (R)

,

h (\cdot) \in C^{2} (S_{N})

and K a positive semi-definite matrix. We assume that

g^{″} (z_{r}) > 0

for all

z_{r} \in R

and

n^{″} (Ψ)

is positive definite, so that the function

(x, Ψ, z_{r}) \to v (x, Ψ, z_{r})

is convex.

Now, the function

u \in U \to 2 x^{T} Q_{2} u + u^{T} R_{1} u + B u v_{x}

is strictly convex on the compact set U, and thus, attains its minimum at:

f^{*} (x, Ψ, z_{r}) = - \frac{1}{2} R^{- 1} [- 2 x^{T} Q_{2} - B h_{x}] = - R_{1}^{- 1} {(Q_{2}^{T} + B^{T} K)}^{T} x (t) .

(60)

Inserting

f^{*} (x, Ψ, z_{r})

and the partial derivatives of v with respect to x,

z_{r}

, and

Ψ

in the HJB-Equation (54), we obtain:

\begin{matrix} J & = & x^{T} Q_{1} x + 2 x^{T} Q_{2} (- R_{1}^{- 1} {(Q_{2}^{T} + B^{T} K)}^{T} x) \\ + & {(- R_{1}^{- 1} {(Q_{2}^{T} + B^{T} K)}^{T} x)}^{T} R_{1} (- R_{1}^{- 1} {(Q_{2}^{T} + B^{T} K)}^{T} x) \\ - & a (Ψ (t)) g^{'} (z_{r}) + (A x + B (- R_{1}^{- 1} {(Q_{2}^{T} + B^{T} K)}^{T} x)) 2 K x + Q^{T} Ψ h^{^{'}} (Ψ) + + T r [σ_{1} σ_{1}^{T}] K \\ + & \frac{1}{2} T r [σ_{2} σ_{2}^{T}] g^{^{″}} (z_{r}) + \frac{1}{2} h^{^{″}} (Ψ) T r [A_{2}] . \end{matrix}

(61)

For equality (61) to hold, it is necessary that the functions g and h satisfy (57) and (58), respectively, and the matrix K satisfies the Ricatti differential Equation (56), whereas the constant

J = T r [σ_{1} σ_{1}^{T}] K

. Finally, from the Theorem 1, it follows that

f^{*}

is an optimal Markovian control and the value function

J_{T}^{*} (t, x, w, Ψ)

is equal to (59). That is:

J^{*} (x, Ψ) = min_{u \in U} J (x, Ψ, u) = J = T r [σ_{1} σ_{1}^{T}] K .

□

Simulation results. To solve the Wonhan filter, we use the numerical method given in ([18], Section 8.4), considering that the Markov chain

ψ (t)

has two states that can only be observed through

d y (t) = h (ψ (t)) + σ_{0} d B (t)

. The following data were used:

σ_{1} = 1

,

σ_{2} = 1

,

σ_{0} = 1

,

α = 0.01

,

a (1) = 0.03

,

a (2) = 0.015

,

Ψ_{1} (0) = 0.5

,

Ψ_{2} (0) = 0.5

,

R = 1.0239

×

10^{- 5}

,

h (1) = - 1

,

h (2) = 0.5

,

m_{s} = 329

kg,

m_{u} = 51

kg,

k_{s} = 4300

N/m,

k = 210, 000

N/m,

V = 20

m/s,

c_{1} = 1

,

c_{2} = c_{3} = 1 \times 10^{5}

,

c_{4} = 1 \times 10^{- 6}

and:

Q = [\begin{matrix} - 0.3 & 0.3 \\ 0.5 & - 0.5 \end{matrix}] .

The solution of the Wonham filter equation and the states of the hidden Markov chain

ψ (t)

are shown in Figure 4. As can be noted, in

t = 1

s,

Ψ_{1} (1) = P (ψ (t) = 1 | y (s), 0 \leq s \leq 1) \geq Ψ_{2} (1)

, implying that the Markov Chain with a probability greater than

0.5

is in state 1 at

t = 1

.

Figure 4. Wonham filter and hidden Markov chain (in t = 1 s).

The asymptotic behavior of the optimal control (55) is given in Figure 5 (bottom). It is interesting to note that this control minimizes the magnitude of the sprung mass velocity,

x_{1} = z_{s}^{^{'}}

and unsprung mass velocity,

x_{2} = z_{u}^{^{'}}

after

t = 9

s, see Figure 5 (top). This behavior implies that the magnitude of the sprung mass acceleration,

x_{1} = z_{s}^{^{″}}

and unsprung mass acceleration

x_{2} = z_{u}^{^{'}}

are also minimized, considering that the stochastic differential equation that models the road profile depends on a hidden Markov chain. These results agree with the obtained by authors in [27]. These authors mentioned that two important objectives of a suspension system are ride comfort and handling performance. The ride comfort requires that the car body be isolated from road disturbances as much as possible to provide a good feeling for passengers. In practice, we are looking to minimize the acceleration of the sprung mass.

Figure 5. Asymptotic behavior of the state of dynamic system (top) and optimal control (bottom).

7. Application 3: Optimal Control of a Vehicle Active Suspension System with Damp

The model analyzed in this subsection is given in [29]. In this application, a damp

b_{s}

is added to the quarter-car suspension given in Section 6, see Figure 6. The parameters in Figure 6 are: the sprung mass (

m_{s}

), the unsprung mass (

m_{u}

), the suspension spring constant (

k_{s}

), and the tire spring constant (k). Let

z_{s}, z_{u}

, and r be the vertical displacements of the sprung mass, the unsprung mass, and the road disturbance, respectively. The equations of motion are given by:

\begin{matrix} m_{s} z_{s}^{^{″}} (t) & = & - k_{s} (z_{s} (t) - z_{u} (t)) + b_{s} (z_{u}^{'} - z_{s}^{'}) + u (t), \end{matrix}

(62)

\begin{matrix} m_{u} z_{u}^{^{″}} (t) & = & k_{s} (z_{s} (t) - z_{u} (t)) - k (r (t) - z_{u} (t)) - b_{s} (z_{u}^{'} - z_{s}^{'}) - u (t) . \end{matrix}

(63)

Figure 6. Quarter vehicle model of active suspension system.

Now, defining

x_{1} (t) = z_{s} (t)

,

x_{2} (t) = z_{u} (t)

,

x_{3} (t) = z_{s}^{'} (t)

, and

x_{4} (t) = z_{u}^{'} (t)

, the equations of motion in (62) and (63) can be expressed in matrix form as:

d x (t) = (A x (t) + B u (t)) d t + F r (t)

(64)

where

d x (t) = [\begin{matrix} d x_{1} (t) \\ d x_{2} (t) \\ d x_{3} (t) \\ d x_{4} (t) \end{matrix}],

A = [\begin{matrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ - \frac{k_{s}}{m_{s}} & \frac{k_{s}}{m_{s}} & - \frac{k_{s}}{m_{s}} & \frac{k_{s}}{m_{s}} \\ \frac{k_{s}}{m_{u}} & - \frac{(k_{s} + k)}{m_{u}} & \frac{b_{s}}{m_{u}} & - \frac{b_{s}}{m_{u}} \end{matrix}],

B = [\begin{matrix} 0 \\ 0 \\ \frac{1}{m_{s}} \\ - \frac{1}{m_{u}} \end{matrix}],

F = [\begin{matrix} 0 \\ 0 \\ 0 \\ \frac{k}{m_{u}} \end{matrix}]

, and we assume that the road profile

r (t)

is represented by a function with hidden Markovian switchings:

r (t) = \{\begin{matrix} a (ψ (t)) {1 - c o s (8 π t)}, τ_{p} \leq t \leq τ_{p + 1} \\ 0 o t h e r w i s e \end{matrix}

(65)

where

a (1) = 0.05

(road bump height is 10 cm),

a (2) = 0.025

(road bump height is 16 cm), and

τ_{p}

,

p = 1, 2, \dots

are the random jump times of

ψ (t)

. In our case, we consider that the dynamic system (64) evolves with additional white noise, that is:

d x (t) = (A x (t) + B u (t) + F r (t)) d t + σ d W (t)

(66)

and we wish to minimize the discounted expected cost:

V_{α} (x, Ψ, u) : = E_{x, Ψ}^{u} [\int_{0}^{\infty} e^{- α s} {x^{T} (s) D x (s) + u^{T} (s) R u (s)} d s],

subject to (66) and (65). Considering the infinitesimal generator given in (53) with

z_{r} (t) \equiv r (t)

and the Hamilton–Jacobi–Bellman equation associated as the following problem:

α v (x, Ψ) = max_{u \in U} [x^{T} D x + u^{T} R_{1} u + L^{u} v s . (x, Ψ, r)] f o r a l l (x, Ψ) \in R^{n} \times S_{N},

similar arguments to these given in Section 5 and Section 6 allow us to find the optimal control

f^{*}

and the value function

v^{*}

for this setting. In fact:

v^{*} (x, Ψ) = x^{T} K x + n (Ψ) + g (r) + c,

where

n : S_{N} \to R

is a twice differentiable continuous function, c is a constant,

g : R \to R

is a twice differentiable continuous function, and K is a positive definite matrix. Inserting the derivative of

v (x, Ψ)

in (43), we get the optimal control:

f^{*} (x, Ψ) = - {\bar{R}}^{- 1} (Ψ) B^{T} K^{T} x,

(67)

where the matrix K satisfies the algebraic Riccati equation:

\begin{matrix} A^{T} K + K A - K B R^{- 1} B^{T} K + D - α K = 0, \end{matrix}

c = T r [σ σ^{T} K] / α,

the function

g \in C^{2} (R)

satisfies the differential equation:

\begin{matrix} a (Ψ (t)) g^{'} (r) + α g (r) = 0, \end{matrix}

and

n (\cdot) \in C^{2} (S_{N})

satisfies the partial differential equation:

\begin{matrix} Q^{T} Ψ (t) n^{'} (Ψ (t)) + \frac{1}{2} T r [A_{2}] n^{″} (Ψ (t)) - α n (Ψ (t))) I_{4} = 0, \forall Ψ (t) \in S_{N}, \end{matrix}

where

A_{2}

is as in (42),

I_{4}

is the identity matrix of

4 \times 4

, and

n^{^{'}}

and

n^{^{″}}

are the gradient and the Hessian of the n, respectively.

Simulation results. To solve the Wonhan filter, we use the numerical method given in ([18], Section 8.4) considering that the Markov chain

ψ (t)

has two states and that can be only observed through

d y (t) = h (ψ (t)) + σ_{0} d B (t)

. The following data were used:

σ = 1

,

σ_{0} = 1

,

α = 0.01

,

a (1) = 0.05

,

a (2) = 0.08

,

Ψ_{1} (0) = 0.4

,

Ψ_{2} (0) = 0.6

,

h (1) = 1, h (2) = 2

,

R = 1.0239 \times 10^{- 5}

,

m_{s} = 300

kg,

m_{u} = 60

kg,

k_{s} = 1600

N/m,

k = 190, 000

N/m,

b_{s} = 1000

N/m, and:

Q = [\begin{matrix} - 0.2 & 0.2 \\ 0.4 & - 0.4 \end{matrix}] .

Figure 7 shows the solution of the Wonham filter equation and the states of the hidden Markov chain

ψ (t)

. As can be seen, in the time interval

[2, 4]

,

Ψ_{1} (1) = P (ψ (t) = 1 | y (s), 0 \leq s \leq 1) \geq Ψ_{2} (1)

, implying that the Markov chain with a probability greater than

0.5

is in state 1.

Figure 7. Wonham filter and hidden Markov chain (time interval [2, 4]).

The asymptotic behavior of the optimal control (67) is given in Figure 8 (bottom). It is interesting to note that this control minimizes the magnitude of the sprung mass,

x_{1} = z_{s}

, and unsprung mass,

x_{2} = z_{u}

, al well as their velocities,

x_{3} = z_{s}^{^{'}}

and

x_{4} = z_{u}^{^{'}}

, after

t = 12

s, see Figure 8 (top).

Figure 8. Asymptotic behavior of the state of the dynamic system (top) and optimal control (bottom).

8. Application 4: Optimal Pollution Control with Average Payoff

The application studies the pollution accumulation incurred by the consumption of a certain product, such as gas or petroleum, see [30]. The stock of pollution

x (\cdot)

is governed by the controlled diffusion process:

d x (t) = [u (t) - η (ψ (t)) x (t)] d t + k d W (t), x (0) = x > 0,

(68)

where

u (t)

represents the pollution flow generated by an entity due to the consumption of the product,

η (ψ (t))

represents the decay rate of pollution, chosen at each time by nature, and k is a positive constant. We shall assume that

u (t) \in U = [0, γ]

is bounded and the parameter

γ

represents the consumption/production restriction. Let

ψ (t)

be a Markov chain with two states

E = {1, 2}

and a generator Q given by:

(\begin{matrix} q_{11} & q_{12} \\ q_{21} & q_{22} \end{matrix}) = (\begin{matrix} - λ_{0} & λ_{0} \\ λ_{1} & - λ_{1} \end{matrix}) .

The reward rate

r : [0, \infty) \times E \times U \to R

in this example represents the social welfare and is defined as:

r (x, i, u) : = F (u) - a (i) x, \forall (x, i, u) \in [0, \infty) \times E \times U,

(69)

where

F \in C^{2} (0, \infty) \cap C (0, \infty)

and

D = a (i) x \in C ([0, \infty) \times E)

is the social utility of the consumption u and the social disutility of the pollution

(x, i)

, respectively. We assume that the function F in (69) satisfies:

\{\begin{matrix} F^{'} (u) > 0, F^{″} (u) < 0, \\ F^{'} (\infty) = F (0) = 0, F^{'} (0 +) = F (\infty) = \infty, \end{matrix}

Clearly, (68) is a liner stochastic differential equation, and satisfies Assumption 1.

Now, we define the Banach space

B_{w} (R \times E)

and use

w (x, i) : = x + i

,

w (x, Ψ) = \sum_{i = 1}^{2} Ψ_{i} w (x . i) = Ψ_{1} (x + 1) + Ψ_{2} (x + 2) = x + (1 - Ψ_{1})

. Hence,

{lim}_{x \to + \infty} w (x, Ψ) = + \infty

and Assumption 2i holds. On the other hand, since the utility function

F (\cdot)

is continuous on the compact interval

U = [0, γ]

, then:

| r (x, i, u) | = | F (u) - a (i) x | \leq (max_{u \in [0, γ]} F (u) + max_{i \in {1, 2}} a (i)) (x + i) = M w (x, i)

where

M : = {max}_{u \in [0, γ]} F (u) + {max}_{i \in {1, 2}} a (i)

; thus, Assumption 3 holds. Note that:

L^{u, Ψ} w (x, Ψ) = u - η (i) x - λ_{0} Ψ_{1} + λ_{1} (1 - Ψ_{1}), for all x > 0 .

Thus, taking

q : = {max}_{i \in E} η (i)

and

p : = {max}_{u \in [0, γ]} u - (λ_{0} - λ_{1}) Ψ_{1}

we obtain:

L^{u} w (x, Ψ) \leq - p w (x, Ψ) + q for all x > 0 .

Therefore, Assumption 2(ii) holds. It can be proven that the process (68) satisfies Assumption 2.6 in [1]; thus, by ([1], Theorem 2.8),

x (t)

is exponentially ergodic (Assumption 4). In this application, we seek a policy u that maximizes the long-run average welfare

J (x, i, f)

:

J (x, i, u) : = \underset{T \to \infty}{lim inf} \frac{1}{T} E_{x, i}^{u} [\int_{0}^{T} [F (u) - a (i) x] d t] .

We propose

v (x, Ψ) = v (x) + h (Ψ)

, where

v \in C^{2} (R \times E) \cap B_{w} (R \times E)

and

h \in C^{2} (S_{N})

as a solution that verify the HJB Equation (27) associated with this pollution control problem. Simple calculations allow us to conclude that the policy on consumption/pollution takes the form:

u : = f (x, Ψ) = \{\begin{matrix} I (- v^{'} (x)) & i f F^{'} (γ) < - v^{'} (x), \\ γ & i f F^{'} (γ) \geq - v^{'} (x) . \end{matrix}

where

I (- v^{'} (x))

is the inverse function of derivative

F^{'}

,

f \in F

.

9. Concluding Remarks

Under hypotheses such as uniform ellipticity in Assumption 1c, the Lyapunov-like conditions in Assumption 2, and the w-exponential ergodicity in (4) for the average criterion, this work shows the existence of optimal controls for the control problems with discounted and average payoffs, where the dynamic system evolves according to switching diffusion with hidden states. To conclude, we conjecture that the results obtained in this work still hold (with obvious changes) if the hidden Markov chain (

ψ

) in (1) is replaced with any other diffusion process. Furthermore, these results can be extended to constrained and unconstrained nonzero-sum stochastic differential games with additive structures, which will allow us to model a larger class of practical systems. This will be a topic in future works.

Author Contributions

Conceptualization, B.A.E.-T.; Formal analysis, B.A.E.-T. and J.G.-M.; Investigation, B.A.E.-T., J.G.-M. and G.A.; Methodology, B.A.E.-T., J.G.-M. and J.D.R.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Escobedo-Trujillo, B.A.; Hernández-Lerma, O. Overtaking optimality for controlled Markov-modulated diffusions. J. Optim. 2011, 61, 1405–1426. [Google Scholar] [CrossRef]
Borkar, V.S. The value function in ergodic control of diffusion processes with partial observations. Stoch. Stoch. Rep. 1999, 67, 255–266. [Google Scholar] [CrossRef]
Borkar, V.S. Dynamic programming for ergodic control with partial observations. Stoch. Process. Their Appl. 2003, 103, 293–310. [Google Scholar] [CrossRef]
Rieder, U.; Bäuerle, N. Portfolio optimization with unobservable Markov-modulated drift Process. J. Appl. Probab. 2005, 362–378. [Google Scholar] [CrossRef] [Green Version]
Tran, K. Optimal exploitation for hybrid systems of renewable resources under partial observation. Nonlinear Anal. Hybrid Syst. 2021, 40, 101013. [Google Scholar] [CrossRef]
Tran, K.; Yin, G. Stochastic competitive Lotka–Volterra ecosystems under partial observation: Feedback controls for permanence and extinction. J. Frankl. Inst. 2014, 351, 4039–4064. [Google Scholar] [CrossRef]
Mao, X.; Yuan, C. Stochastic Differential Equations with Markovian Switching; World Scientific Publishing Co.: London, UK, 2006; Available online: https://www.worldscientific.com/doi/pdf/10.1142/p473 (accessed on 20 March 2022). [CrossRef]
Yin, G.G.; Zhu, C. Hybrid Switching Diffusions. In Stochastic Modelling and Applied Probability; Properties and Applications; Springer: New York, NY, USA, 2010; Volume 63, p. xviii+395. [Google Scholar] [CrossRef]
Yin, G.; Mao, X.; Yuan, C.; Cao, D. Approximation methods for hybrid diffusion systems with state-dependent switching processes: Numerical algorithms and existence and uniqueness of solutions. SIAM J. Math. Anal. 2009, 41, 2335–2352. [Google Scholar] [CrossRef] [Green Version]
Yu, L.; Zhang, Q.; Yin, G. Asset allocation for regime-switching market models under partial observation. Dynam. Syst. Appl. 2014, 23, 39–61. [Google Scholar]
Ghosh, M.K.; Arapostathis, A.; Marcus, S.I. Optimal control of switching diffusions with application to flexible manufacturing systems. SIAM J. Control Optim. 1993, 31, 1183–1204. [Google Scholar] [CrossRef]
Ghosh, M.K.; Marcus, S.I.; Arapostathis, A. Controlled switching diffusions as hybrid processes. In Proceedings of the International Hybrid Systems Workshop, New Brunswick, NJ, USA, 22–25 October 1995; Springer: Berlin/Heidelberg, Germany, 1995; pp. 64–75. [Google Scholar]
Zhang, X.; Zhu, Z.; Yuan, C. Asymptotic stability of the time-changed stochastic delay differential equations with Markovian switching. Open Math. 2021, 19, 614–628. [Google Scholar] [CrossRef]
Zhu, C.; Yin, G. Asymptotic properties of hybrid diffusion systems. SIAM J. Control Optim. 2007, 46, 1155–1179. [Google Scholar] [CrossRef]
Wonham, W.M. Some applications of stochastic differential equations to optimal nonlinear filtering. J. SIAM Control Ser. A 1965, 2, 347–369. [Google Scholar] [CrossRef]
Elliott, R.J.; Aggoun, L.; Moore, J.B. Hidden Markov Models: Estimation and Control; Springer: Berlin/Heidelberg, Germany, 1995. [Google Scholar]
Cohen, S.N.; Elliott, R.J. Stochastic Calculus and Applications, 2nd ed.; Probability and Its Applications; Springer: Cham, Switzerland, 2015; p. xxiii+666. [Google Scholar] [CrossRef]
Yin, G.; Zhang, Q. Discrete-Time Markov Chains: Two-Time-Scale Methods and Applications; Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2006. [Google Scholar]
Yin, G.G.; Zhu, C. Hybrid Switching Diffusions: Properties and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009; Volume 63. [Google Scholar]
Protter, P.E. Stochastic integration and differential equations. In Stochastic Modelling and Applied Probability, 2nd ed.; Version 2.1, Corrected Third Printing; Springer: Berlin/Heidelberg, Germany, 2005; Volume 21, p. xiv+419. [Google Scholar] [CrossRef]
Chigansky, P. An ergodic theorem for filtering with applications to stability. Syst. Control Lett. 2006, 55, 908–917. [Google Scholar] [CrossRef]
Kunita, H. Asymptotic behavior of the nonlinear filtering errors of Markov processes. J. Multivar. Anal. 1971, 1, 365–393. [Google Scholar] [CrossRef] [Green Version]
Lu, X.; Yin, G.; Guo, X. Infinite Horizon Controlled Diffusions with Randomly Varying and State-Dependent Discount Cost Rates. J. Optim. Theory Appl. 2017, 172, 535–553. [Google Scholar] [CrossRef]
Ghosh, M.K.; Arapostathis, A.; Marcus, S.I. Ergodic control of switching diffusions. SIAM J. Contr. Optim 1997, 35, 1962–1988. [Google Scholar] [CrossRef]
SchÄl, M. Conditions for optimality and for the limit of n-stage optimal policies to be optimal. Z. Wahrs. Verw. Gerb. 1975, 32, 179–196. [Google Scholar] [CrossRef]
Ghosh, M.K.; Marcus, S.I. Stochastic differential games with multiple modes. Stoch. Anal. Appl. 1998, 16, 91–105. [Google Scholar] [CrossRef]
Nguyen, L.H.; Seonghun, P.; Turnip, A.; Hong, K.S. Application of LQR Control Theory to the Design of Modified Skyhook Control Gains for Semi-Active Suspension Systems. In Proceedings of the ICROS-SICE International Joint Conference 2009, Fukuoka, Japan, 18–21 August 2009; pp. 4698–4703. [Google Scholar]
Escobedo-Trujillo, B.; Garrido-Meléndez, J. Stochastic LQR optimal control with white and colored noise: Dynamic programming technique. Rev. Mex. Ing. QuÍmica 2021, 20, 1111–1127. [Google Scholar] [CrossRef]
Maurya, V.K.; Bhangal, N.S. Optimal Control of Vehicle Active Suspension System. J. Autom. Control. Eng. 2018, 6, 1111–1127. [Google Scholar] [CrossRef]
Kawaguchi, K.; Morimoto, H. Long-run average welfare in a pollution accumulation model. J. Econom. Dynam. Control 2007, 31, 703–720. [Google Scholar] [CrossRef]

Figure 1. Wonham filter for the

α

-discounted LQR.

Figure 2. Asymptotic behavior of the state of dynamic system (top) and optimal control

α

-discount LQR (bottom).

Figure 3. Schematic of a quarter-car suspension.

Figure 4. Wonham filter and hidden Markov chain (in t = 1 s).

Figure 5. Asymptotic behavior of the state of dynamic system (top) and optimal control (bottom).

Figure 6. Quarter vehicle model of active suspension system.

Figure 7. Wonham filter and hidden Markov chain (time interval [2, 4]).

Figure 8. Asymptotic behavior of the state of the dynamic system (top) and optimal control (bottom).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Optimal Control with Partially Observed Regime Switching: Discounted and Average Payoffs

Abstract

1. Introduction

2. Formulation of the Problem

3. The Discounted Case

4. Average Optimality Criteria

5. Application 1: Discounted Linear Quadratic Regulator (LQR)

6. Application 2: Average LQR: Modeling of a Quarter-Car Suspension

7. Application 3: Optimal Control of a Vehicle Active Suspension System with Damp

8. Application 4: Optimal Pollution Control with Average Payoff

9. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics