Robust Statistic Estimation of Constrained Optimal Control Problems of Pollution Accumulation (Part I)

Beatris Adriana Escobedo-Trujillo; José Daniel López-Barrientos; Carmen Geraldi Higuera-Chan; Francisco Alejandro Alaffita-Hernández

doi:10.3390/math11040923

,

and

¹

Facultad de Ingeniería, Universidad Veracruzana, Coatzacoalcos 96535, Mexico

²

Facultad de Ciencias Actuariales, Universidad Anáhuac México, Naucalpan de Juárez 52786, Mexico

³

Departamento de Matemáticas, Universidad de Sonora, Hermosillo 83000, Mexico

⁴

Centro de Investigación en Recursos Energéticos y Sustentables, Universidad Veracruzana, Coatzacoalcos 96535, Mexico

Mathematics2023, 11(4), 923;https://doi.org/10.3390/math11040923

This article belongs to the Special Issue Application of Optimal Control and Game Theory to the Problem of Resource Management

Version Notes

Order Reprints

Abstract

In this paper, we study a constrained optimal control on pollution accumulation where the dynamic system was governed by a diffusion process that depends on unknown parameters, which need to be estimated. As the true values are unknown, we intended to determine (adaptive) policies that maximize a discounted reward criterion with constraints, that is, we used Lagrange multipliers to find optimal (adaptive) policies for the unconstrained version of the optimal control problem. In the present context, the dynamic system evolves as a diffusion process, and the cost function is to be minimized by another function (typically a constant), which plays the role of a constraint in the control model. We offer solutions to this problem using standard dynamic programming tools under the constrained discounted payoff criterion on an infinite horizon and the so-called principle of estimation and control. We used maximum likelihood estimators by means of a minimum least square error approximation in a pollution accumulation model to illustrate our results. One of the advantages of our approach compared to others is the intuition behind it: find optimal policies for an estimated version of the problem and let this estimation tend toward the real version of the problem. However, most risk analysts will not be as used to our methods as they are to, for instance, the model predictive control, MATLAB’s robust control toolbox, or the polynomial chaos expansion method, which have been used in the literature to address similar issues.

Keywords:

consistent estimators; discounted cost; control with restrictions; maximum likelihood estimators; least square errors

MSC:

93E10; 93E20; 93E24; 60J60

1. Introduction

This work studies the problem of the optimal control of pollution accumulation with an unknown parameter, which needs to be statistically estimated, in a constrained context. We aim to construct adaptive policies for the discounted reward criterion on an infinite horizon. We assumed that the stock of pollution is driven by an Itô’s diffusion and used the discounted criterion. We consider the presence of constraints on the reward function. We estimated the unknown parameter

θ

and followed the principle of estimation and control (PEC) to use standard dynamic programming tools to find optimal solutions.

We kept the presentation at the application level to the pollution accumulation problem while attempting to maximize a utility function. However, the theory is general enough to be exploited in other contexts. The goal of pollution accumulation models is to examine how some things are managed for society’s consumption. It is commonly acknowledged that this consumption produces two by-products: pollution and social welfare. The latter term refers to the distinction between the benefits and harms connected with pollution. The theory studied here enables the decision-maker to identify a consumption policy that maximizes anticipated social welfare for society, subject to a limitation that may reflect, for example, that some environmental clean-up expenditures will not surpass a certain number over time, while (for instance) the rate at which nature cleans itself is unknown. One of the features of the discounted optimality criterion used in this paper is that emphasis is placed on the utility of consumption for the present generations, which is mirrored by the value functions that we obtained. This characteristic renders the problem a rather flexible one and enables us to use standard dynamic programming tools.

We employ the PEC, which has roots in Kurano (1972) [1] and Mandl (1974) [2], to analyze the adaptive control problem with constraints. The goal of the PEC is to estimate parameter

θ

, replace the unknown

θ

with its estimated value, and then solve the optimal control problem with constraints. Refer to [1,3,4], and the references therein for studies developing asymptotically optimal adaptive strategies. For instance, Kurano and Mandl introduce the idea of estimation and control when thinking about Markov decision models with constrained rewards and a finite-state space. Based on a consistent estimator for the unknown parameter that is uniformly optimal in the parameter, they demonstrate the existence of an optimal policy. Reference [4] works with discrete-time stochastic control systems.

In the late 1990s and early 2000s, the stochastic optimization issue with constraints was addressed, under the assumption that all the coefficients—the diffusion itself, the reward function, and the restrictions—are bounded (see, for example, [5,6,7,8]). Related publications include Borkar and Ghosh’s foundational publication on constrained optimum control under discounted and ergodic criteria, the work of Mendoza-Pérez, Jasso-Fuentes, Prieto-Rumeau, and Hernandez-Lerma (see [9,10]), and the paper of Jasso-Fuentes, Escobedo-Trujillo, and Mendoza-Pérez [11].

The adaptive control of a linear diffusion process (of the sort we use here) with regard to the discounted cost criterion is studied in [12]. In [13], an adaptive optimal control for continuous-time linear systems is investigated. The use of statistical tools is common when modeling control problems. For instance, Bayesian adaptive techniques for ergodic diffusions are considered in [14], and ref. [15] uses the method of maximum likelihood estimation of parameters to study a self-tuning scheme for a diffusion process with the long-term average cost criterion. The main idea is to estimate the unknown parameter online so that the most recent estimate is used to determine the true parameter when choosing each control (see [16]). The estimation of parameters for diffusion processes using discrete observations was studied in several papers, including [17,18,19,20,21] and the references therein.

The issue of pollution accumulation has previously been studied from the perspective of dynamic optimization; for instance, refs. [22,23] use a linear-quadratic model to explain this phenomenon, ref. [24] deals with the average payoff in a deterministic framework, refs. [25,26] extends the former’s approach to a stochastic context, ref. [27] uses a robust stochastic differential game to characterize the situation, ref. [28] studies the problem from the perspective of constrained stochastic optimization, and ref. [29] is a statistical survey of the effects of air pollution on public health. The main contribution of our paper is as follows: a statistical estimation procedure is used to find

θ

; we construct adaptive policies that are almost certainly optimal for the constrained optimization problem under the discounted payoff on an infinite horizon. These adaptive policies were obtained by substituting the estimates into optimal stationary controls (or PEC); see [1,2]. In this sense, our findings resemble those presented in [30], for most risk analysts will not be as used to our methods as they are to, for instance, the model predictive control, the robust control toolbox or the polynomial chaos expansion method, which have been used in the literature to address similar issues.

Our work lies at the intersection of three important classes of optimal control problem. The first is the problem of controlling pollution accumulation, as presented in [22,29,30,31,32]; the second class deals with constrained optimal control problems (references [11,33,34] study this type of problem, considering that all parameters are known); the third class concerns the adaptive optimal control problems, as presented in [14,15,20,35,36]. Reference [30] belongs to the first and third classes. Our work is an extension of [11,22,28,31,32,36] to the adaptive constrained optimal control framework. Reference [22] builds a robust control to explore the normative maxim that, in the presence of uncertainty, society must guard against worst-case outcomes; [11] studies a constrained optimal control problem where all the parameters are known, while [28,32] study the same in the context of pollution accumulation; Refs. [31,36] study an unconstrained adaptive optimal control problem. In addition, a numerical example is given for demonstration purposes.

The rest of the paper is organized as follows. We present the theoretical preliminaries in the next Section. Then, we devote Section 3 to our main results and we illustrate them in Section 4. We provide our conclusions in Section 5. Please note that, for the sake of self-completeness of the presentation and acknowledgement of our sources, we have included references at the end of this article. However, we recognize that these might diverge the reader. We apologize for this inconvenience.

Throughout our work, we use the following notations. All

x = (x_{1}, x_{2}, \dots, x_{n})

in

R^{n}

and matrices

A = (A_{k, p})

we denote by

| \cdot |

the usual Euclidean norm

{| x |}^{2} : = \sum_{k} x_{k}^{2}

, and we let

{| A |}^{2} : = Tr (A A^{⊤}) = \sum_{k, p} A_{k, p}^{2}

, where

A^{⊤}

and

Tr (\cdot)

, denote the transpose and the trace of a square matrix, respectively. Sometimes, we use the notation

\partial_{i} : = \frac{\partial}{\partial x_{i}}

, and

\partial_{i j}^{2} : = \frac{\partial^{2}}{\partial x_{i} \partial x_{j}}

.

2. Problem Statement

In our model, the stock of pollution is modeled as an n-dimensional controlled stochastic differential equation (SDE) of the form:

d x (t) = b (x (t), u (t), θ) d t + σ (x (t)) d W (t), x (0) = x_{0}, t \geq 0,

(1)

where

b : R^{n} \times U \times Θ \to R^{n}

and

σ : R^{n} \to R^{n \times d}

are given functions, and

W (\cdot)

is an

F_{t}

-adapted d-dimensional Wiener process, such that

W (t) - W (s)

and

F_{s}

are pairwise independent. The compact set U is assumed to be contained in a suitable metric space and is called the control set. Here,

u (t)

represents the flow of consumption at time t; this is a stochastic process that takes values from U, which, in turn, is bounded to represent the consumption restrictions imposed by worldwide protocols. In this work, we assume that the pollution decay rate

θ

is an unknown parameter taking values from a compact set

Θ \subset R^{m}

called the parameter set. Assumption A1 in Appendix A ensures the existence and uniqueness of a strong solution to (1).

2.1. Control Policies and Stability Assumptions

Although we manage to illustrate our theoretical developments using the so-called stationary Markovian policies, we need to introduce the concept of randomized policies (also known as relaxed controls). To this end, we used the following nomenclature:

$B (B)$ is the Borel $σ$ -algebra spawned by the Borel set B.
$C (O)$ is, as usual, the space of all real-valued continuous functions on a bounded, open and connected subset $O \subset R^{n}$ .
$C_{b} (O)$ stands for the space of all real-valued continuous bounded functions f on the bounded, open and connected subset $O \subset R^{n}$ .
$C^{κ} (O)$ is the space of all real-valued continuous functions f on the bounded, open and connected subset $O \subset R^{n}$ , with continuous derivatives up to order $κ \in N$ .
$L^{p} (Ω)$ is, as is customary, the Lebesgue space of functions g on $Ω$ such that $\int_{Ω} {| g |}^{p} d μ < \infty$ , with $(Ω, F, μ)$ a suitable measure space, and $p \geq 1$ .
$P (B)$ is the family of probability measures on B endowed with the topology of weak convergence.

Definition 1.

A randomized policy is a family

π : = (π_{t} : t \geq 0)

of stochastic kernels on

B (U) \times R^{n}

, satisfying:

(a): for each $t \geq 0$ and $x \in R^{n}$ , $π_{t} (\cdot | x) \in P (U)$ , and for each $t \geq 0$ and $D \in B (U)$ , $π_{t} (D | \cdot)$ is a Borel function on $R^{n}$ ;
(b): for each $D \in B (U)$ and $x \in R^{n}$ , the function $π_{t} (D | x)$ is Borel-measurable in $t \geq 0$ .

A randomized policy is said to be stationary if there is a probability measure

π (\cdot | x) \in P (U)

, such that

π_{t} (\cdot | x) = π (\cdot | x)

for all

t \geq 0

and

x \in R^{n}

. The set of randomized stationary policies is denoted by Π.

Let

F

be the family of measurable functions

f : R^{n} \to U

. A strategy

u (t) : = f (x (t))

for some

f \in F

is said to be a stationary Markov policy.

For each randomized stationary policy

π \in Π

, we write the drift coefficient b defined in (1) as

b (x, π, θ) : = \int_{U} b (x, u, θ) π (d u | x) .

(2)

Note that

b (x, π, θ)

inherits the same continuity and Lipschitz properties from b, given in Assumption A1.

Remark 1.

Under Assumption A1, for each policy

π \in Π

and

θ \in Θ

, there exists a weak solution

x^{π, θ} (\cdot)

of (1) which is a Markov–Feller process in the probability space

(Ω, F, P^{π, θ})

. See Theorem 2.2.6 in [37].

Topology of relaxed controls. We will need the limit and continuity concepts. For this reason, we topologized the set of randomized stationary policies

Π

, as in [38]. This topology renders

Π

a compact metric space, and is determined by the following convergence criterion (see [37,38,39]).

Definition 2.

A sequence

(π_{m} : m = 1, 2, \dots)

in Π converges to

π \in Π

, if

\int_{R^{n}} g (x) h (x, π_{m}) d x \to \int_{R^{n}} g (x) h (x, π) d x .

for all

g \in L^{1} (R^{n})

, and

h \in C_{b} (R^{n} \times U)

, where

h (x, π_{m}) : = \int_{U} h (x, u) π_{m} (d y | x) a n d h (x, π) : = \int_{U} h (x, u) π (d y | x) .

We denote this type of convergence as

π_{m} \overset{W}{\to} π

.

For

u \in U

,

θ \in Θ

and

ν (\cdot, θ) \in C^{2} (R^{n}),

let

\begin{matrix} L^{u, θ} ν (x, θ) & : = & \sum_{i = 1}^{n} b^{i} (x, u, θ) \partial_{i} ν (x, θ) + \frac{1}{2} \sum_{i, j = 1}^{n} a^{i j} (x) \partial_{i j}^{2} ν (x, θ), \end{matrix}

(3)

where

b^{i}

is the i-th component of b, and

a^{i j}

is the

(i, j)

-component of the matrix

a (\cdot)

, defined in Assumption A1(d). Again, for each randomized stationary policy

π \in Π

, we write the infinitesimal generator

L^{u, θ}

defined in (3) as

L^{π, θ} ν (x, θ) : = \int_{U} L^{u, θ} ν (x, θ) π (d u | x) .

Note that the application of Dynkin’s formula to the function

v (t, x) : = e^{c t} w (x)

, and Assumption A2(b) yields

E_{x}^{π, θ} [w (x (t))] \leq e^{- c t} w (x) + \frac{d}{c} (1 - e^{- c t}), for all π \in Π, θ \in Θ, x \in R^{n} and t \geq 0,

(4)

where

E_{x}^{π, θ} [\cdot]

stands for the conditional expectation of ·, given that (1) starts at x, the controller uses the randomized stationary policy

π

, and the unknown parameter is fixed at

θ

. That is,

E_{x}^{π, θ} [\cdot]

is the expectation of · taken with respect to the probability measure

P^{π, θ}

when

x (t)

starts at x.

2.2. Reward, Cost and Constraint Rates

We will consider that the reward and the cost rates, along with the constraints of our model, can be unbounded from above and below, but are dominated by the Lyapunov function w given in Assumption A2. Namely, they are in the Banach space of real-valued measurable functions on

R^{n}

with the finite w-norm, which is defined as follows.

Definition 3.

Let

B_{w} (R^{n})

denote the Banach space of real-valued measurable functions v on

R^{n}

with finite w-norm, which is defined as

{∥v∥}_{w} : = \sup_{x \in R^{n}} \frac{| v (x) |}{w (x)} < \infty .

Let

r, c : R^{n} \times U \times Θ \to R

be measurable functions that will be identified as the social welfare (also called payoff or reward) rate and the cost rate, respectively, and let

η : R^{n} \times Θ \to R

be another measurable function that will be referred to as the constraint rate. In the present context, this restriction stands for the fact that, in some situations, due to each country’s legal framework, the cost of cleaning the environment must not exceed a given quantity. Both functions are supposed to meet Assumption A3.

When the controller uses policy

π \in Π

, we write the reward and cost rates in a similar way as (2); that is,

c (x, π, θ) = \int_{U} c (x, u, θ) π (d u | x), r (x, π, θ) : = \int_{U} r (x, u, θ) π (d u | x) .

(5)

3. Main Results

3.1. Discounted Control with Constraints

In the sequence, we work in the space

W^{2, p} (R^{n}) \cap B_{w} (R^{n})

, where

W^{ℓ, p} (O)

stands for the Sobolev space of real-valued measurable functions on the open and connected subset

O \in R^{n}

whose generalized derivatives up to order

ℓ \geq 0

are in

L^{p} (O)

for

p \geq 1

.

Definition 4.

Given the initial state x, a parameter value

θ \in Θ

and a discount rate

α > 0

, we define the total expected α-discounted reward and cost when the controller uses a policy π in Π, as

\begin{matrix} V (x, π, r, θ) & : = & E_{x}^{π, θ} [\int_{0}^{\infty} e^{- α t} r (x (t), π, θ) d t] a n d \\ V (x, π, c, θ) & : = & E_{x}^{π, θ} [\int_{0}^{\infty} e^{- α t} c (x (t), π, θ) d t] . \end{matrix}

Propositions 1–3 below state the properties of the functional

v \mapsto V (x, π, v, θ)

for all

(x, π, θ) \in R^{n} \times Π \times Θ

. Under Assumption A3 and inequality (4), a direct computation yields the following result. This estabilishes that the expected

α

-discounted reward are dominated by a Lyapunov function.

Proposition 1.

If Assumptions A1–A3 hold, the functions

V (\cdot, π, r, θ)

and

V (\cdot, π, c, θ)

belong to the space

B_{w} (R^{n})

for each π in Π; in fact, for each x in

R^{n}

and

θ \in Θ

, we have

\sup_{π \in Π} |V (x, π, r, θ)| + \sup_{π \in Π} |V (x, π, c, θ)| \leq 2 M (α) w (x) w i t h M (α) : = M \frac{α + d}{α c} .

Here, c and d are as in Assumption A2, and M is the constant in Assumption A3(b).

The following result is an extension of Proposition 3.1.5 in [40] to the topology of the relaxed controls. This shows that both the expected discounted reward and the expected cost are solutions to the linear version of the dynamic programming partial differential equation. This can be regarded as a necessary condition for the optimality of the value function. Its proof mimics the steps of the original, replacing the control sets with those used here while keeping

θ \in Θ

fixed.

Proposition 2.

Let Assumptions A1–A3 hold, and let

v : R^{n} \times U \times Θ \to R

be a measurable function satisfying Assumption A3. Then, for every

π \in Π

, the associated expected α-discounted function

V (\cdot, π, v, θ)

is in

W^{2, p} (R^{n}) \cap B_{w} (R^{n})

, and is such that

α V (x, π, v, θ) = V (x, π, v, θ) + L^{π, θ} V (x, π, v, θ) f o r a l l x \in R^{n} a n d θ \in Θ .

(6)

Conversely, if some function

φ (\cdot, θ) \in W^{2, p} (R^{n}) \cap B_{w} (R^{n})

verifies (6), then

φ (x, θ) = V (x, π, v, θ) f o r a l l x \in R^{n} a n d θ \in Θ .

(7)

Moreover, if the equality in (6) is replaced by “≤” or “≥”, then (7) holds, with the respective inequality.

Definition 5.

Let

α > 0

. The total expected α-discounted constraint when the controller uses a policy

π \in Π

, given the initial state

x \in R^{n}

and

θ \in Θ

, is defined by

\bar{η} (x, π, θ) : = α E_{x}^{π, θ} [\int_{0}^{\infty} e^{- α t} η (x (t), θ) d t] .

Remark 2.

The function

\bar{η} (\cdot, π, θ)

belongs to the space

B_{w} (R^{n})

for each

π \in Π

. Moreover, for each

x \in R^{n}

, we have

\sup_{π \in Π} |\bar{η} (x, π, θ)| \leq {∥ η ∥}_{w} \frac{α + d}{c} w (x) .

For each

α > 0

,

θ \in Θ

and

x \in R^{n}

, assume we are given a constraint function

η (\cdot, θ)

satisfying Assumption A3(c). In this way, we define the set

F_{θ}^{x} : = \{π \in Π | V (x, π, c, θ) \leq \bar{η} (x, π, θ)\} .

We assume that

F_{θ}^{x}

is nonempty.

Definition 6 (The discounted problem with constraints (DPC)).

We say that a policy

π^{*} \in Π

is optimal for the DPC with initial state

x \in R^{n}

, given that

θ \in Θ

is the true parameter value if

π^{*} \in F_{θ}^{x}

and, in addition,

V^{*} (x, π^{*}, r, θ) = \sup_{π \in F_{θ}^{x}} V (x, π, r, θ) .

In this case,

V^{*} (x, r, θ) : = V (x, π^{*}, r, θ)

is called the

α

-discount optimal reward for the DPC.

3.2. Lagrange Multipliers Approach

We mimic the technique we used in [32] to transform the original DPC into an unconstrained problem. To this end, take

λ \leq 0

and consider the new reward rate

r^{λ} (x, u, θ) : = r (x, u, θ) + λ (c (x, u, θ) - α η (x, θ)) .

(8)

Using the same notation of (5), we can write (8) as

r^{λ} (x, π, θ) : = r (x, π, θ) + λ (c (x, π, θ) - α η (x, θ)), π \in Π, θ \in Θ .

Observe that, for each

α > 0

and

λ < 0

,

r^{λ} (\cdot, π, θ)

is in

B_{w} (R^{n})

uniformly in

π \in Π

and

θ \in Θ

. In fact,

\begin{matrix} | r^{λ} (x, π, θ) | & \leq & | r (x, π, θ) | + | λ | | c (x, π, θ) | + | λ | | α η (x, θ) | \\ \leq & M w (x) + M | λ | w (x) + | λ | | α η (x, θ) | \\ \leq & {(M + M | λ | + | λ | α ∥ η ∥}_{w}) w (x) = N^{λ} w (x), \end{matrix}

(9)

where

N^{λ} : = M + M | λ | + {| λ | α ∥ η ∥}_{w}

, and M, as in Assumption A3(b).

For all

x \in R^{n}

and

θ \in Θ

, define

V (x, π, r^{λ}, θ) : = E_{x}^{π, θ} [\int_{0}^{\infty} e^{- α t} r^{λ} (x (t), π, θ) d t] .

The discounted unconstrained problem is defined as follows.

Definition 7 (

λ

-Discounted unconstrained problem (

λ

-DUP)).

A policy

π^{*} \in Π

for which

V (x, π^{*}, r^{λ}, θ) = \sup_{π \in Π} V (x, π, r^{λ}, θ) = : V^{*} (x, r^{λ}, θ) f o r a l l x \in R^{n},

(10)

is called discounted optimal for the λ-DUP, and

V^{*} (\cdot, r^{λ}, θ)

is referred to as the optimal discounted reward for the λ-DUP.

Let

v : R^{n} \times U \times Θ \to R

be a measurable function satisfying similar conditions as those given in Assumption A3. The following is called a verification result in the literature. It shows that

V^{*} (\cdot, v, θ)

is the unique solution of the Hamilton–Jacobi–Bellman (HJB) Equation (11), and also proves the existence of stationary policies

f_{θ}^{*} \in F

. Observe that, by virtue of Definition 7, the functional to which it refers can be the optimal discounted reward for the

λ

-DUP. Its proof can be found in [11,15,41], considering

θ

as fixed.

Proposition 3.

Suppose that Assumptions A1–A3 hold. Then:

(i): The α-optimal discount reward $V^{*} (\cdot, v, θ)$ belongs to $W^{2, p} (R^{n}) \cap B_{w} (R^{n})$ and verifies the discounted reward HJB equation; that is, for all $x \in R^{n}$ and $θ \in Θ,$

$\begin{matrix} α V^{*} (x, v, θ) & = & \sup_{u \in U} {r (x, u, θ) + L^{u, θ} V^{*} (x, v, θ)} . \end{matrix}$

(11)

Conversely, if a function $φ_{θ} \in W^{2, p} (R^{n}) \cap B_{w} (R^{n})$ verifies (11), then $φ_{θ} (x) = V^{*} (x, v, θ)$ for all $x \in R^{n}$ .
(ii): There exists a stationary policy $f_{θ}^{*} \in F$ that maximizes the right-hand side of (11); that is,

$α V^{*} (x, v, θ) = r (x, f_{θ}^{*}, θ) + L^{f_{θ}^{*}, θ} V^{*} (x, v, θ) f o r a l l x \in R^{n},$

and $f_{θ}^{*}$ is α-discount optimal.

Remark 3.

(a): Notice that $V (x, π, r^{λ}, θ) = V (x, π, r, θ) + λ [V (x, π, c, θ) - \bar{η} (x, π, θ)] .$
(b): By Definitions 4 and 5,

$\begin{matrix} V (x, π, c, θ) - \bar{η} (x, π, θ) & = & E_{x}^{π, θ} [\int_{0}^{\infty} e^{- α t} [c (x (t), π, θ) - α η (x (t), θ)] d t] \\ = : & V (x, π, c - α η, θ), \end{matrix}$
(c): Given that the cost and constraint rates $c, η$ satisfy Assumption A3, we deduce

$\begin{matrix} | c (x, π, θ) - α η (x, θ) | & \leq & | c (x, π, θ) | + | α η (x, θ) | \\ \leq & M w (x) + | α η (x, θ) | \\ \leq & (M + α ∥ η ∥_{w}) w (x), \end{matrix}$

Thus, $c (\cdot, π, θ) - α η (\cdot, θ) \in B_{w} (R^{n}) .$
(d): The function $c - α η : R^{n} \times U \times Θ \to R$ is locally Lipschitz on $R^{n} .$ In fact,

$\begin{matrix} | c (x, u, θ) - α η (x, θ) - c (y, u, θ) + α η (y, θ) | \\ \leq & | c (x, u, θ) - c (y, u, θ) | + | η (x, θ) - η_{(} y, θ) | \\ \leq & [K (R) + \tilde{K} (R)] | x - y | . \end{matrix}$

(12)

for each $R > 0$ , and for all $| x |, | y | \leq R$ . The last inequality in (12) is met since we assume that Assumption A3 holds.
(e): Parts (c) and (d) imply that the function $c - α η : R^{n} \times U \times Θ \to R$ satisfies Assumption A3. Thus, the rate $r^{λ} (\cdot, π, θ)$ is Lipschitz-continuous and $r^{λ} (\cdot, π, θ) \in B_{w} (R^{n})$ . Furthermore, by virtue of (9) and Proposition 1,

$\sup_{π \in Π} |V (x, π, r^{λ}, θ)| \leq M_{α}^{λ} w (x), w i t h M_{α}^{λ} : = N^{λ} \frac{α + d}{α c},$

implying that $V (\cdot, π, r^{λ}, θ) \in B_{w} (R^{n})$ .

3.3. Convergence of Value Functions $V (x, π, r, θ)$ and $V (x, π, r^{λ}, θ)$

Definition 8.

A sequence

(θ_{m} : m = 1, \dots)

of measurable functions

θ_{m} : Ω \to Θ

is said to be a sequence of uniformly strongly consistent (USC) estimators of

θ \in Θ

if, as

m \to \infty

,

θ_{m} (ω) \to θ P^{π, θ} - a . s . f o r a l l π \in Π,

where

P^{π, θ}

is the probability measure referred to by Remark 1.

For ease of notation, we write

θ_{m} : = θ_{m} (ω) \in Θ

. Let

v : R^{n} \times U \times Θ \to R

be a measurable function satisfying similar conditions to those given in Assumption A3.

Remark 4.

(a): If Assumptions A1–A3 hold, then by Proposition 3.4 in [11], the mappings $π \mapsto V (x, π, v, θ)$ , $π \mapsto V (x, π, c - α η, θ)$ and $π \mapsto V (x, π, r^{λ}, θ)$ are continuous on Π for each $x \in R^{n}$ and $θ \in Θ .$
(b): Let $(θ_{m} : m = 1, \dots)$ be a sequence of USC estimators of $θ \in Θ$ . Then, using Theorem 4.5 in [36], for every measurable function $v : R^{n} \times Π \times Θ \to R$ that satisfies the Assumptions A1–A3, the sequence $(V (x, π, v, θ_{m}) : m = 1, \dots)$ converges to $V (x, π, v, θ)$ $P^{π, θ}$ -a.s., for each $x \in R^{n}$ and $π \in Π$ .
(c): Let $(π_{m} : m = 1, \dots)$ be a sequence in Π. Since Π is a compact set, there exists a subsequence $(π_{m_{k}} : k = 1, \dots) \subset (π_{m} : m = 1, \dots)$ such that $π_{m_{k}} \overset{W}{\to} π \in Π$ , thus, combining parts (a) and (b), and using the following triangular inequality:

$\begin{matrix} | V (x, π_{m_{k}}, v, θ_{m_{k}}) - V (x, π, v, θ) | \\ \leq & | V (x, π_{m_{k}}, v, θ_{m_{k}}) - V (x, π, v, θ_{m_{k}}) | + | V (x, π, v, θ_{m_{k}}) - V (x, π, v, θ) |, \end{matrix}$

we deduce that, for every measurable function $v : R^{n} \times U \times Θ \to R$ satisfying Assumption A3, we have that

$\begin{matrix} V (x, π_{m_{k}}, v, θ_{m_{k}}) \to V (x, π, v, θ) P^{π, θ} - a . s . a s k \to \infty . \end{matrix}$
(d): The optimal discount reward for the $λ$ -DUP, $V^{*} (\cdot, r^{λ}, θ)$ satisfies Proposition 3. In addition, Proposition 3(ii) ensures the existence of stationary policy $f_{θ}^{λ} \in F$ .
(e): For each $λ \leq 0$ , $θ \in Θ$ and $α > 0$ , we denote

$Π^{λ, θ} : = \{π \in Π | α V^{*} (x, r^{λ}, θ) = r^{λ} (x, π, θ) + L^{π, θ} V^{*} (x, r^{λ}, θ) f o r a l l x \in R^{n}\} .$

(13)

Since $F$ can be seen as an embedding of Π, Proposition 3(ii) ensures that the set $Π^{λ, θ}$ is nonempty.
(f): Under the hypotheses of Proposition 3, Lemma 3.15 in [11] ensures that for each $θ \in Θ$ fixed and any sequence $(λ_{m} : m = 1, \dots)$ in $(- \infty, 0]$ converging to some $λ \leq 0$ ; if there exists a sequence $(π^{λ_{m}, θ} : m = 1, \dots) \in Π^{λ_{m}, θ}$ for each $m \geq 1$ , such that it converges to a policy $π \in Π$ , then $π \in Π^{λ, θ}$ . That is, π satisfies

$α V^{*} (x, r^{λ}, θ) = r^{λ} (x, π, θ) + L^{θ, π} V^{*} (x, r^{λ}, θ) f o r a l l x \in R^{n} .$
(g): Lemma 3.16 in [11] ensures that the mapping $λ ⟼ V_{α}^{*} (x, r^{λ}, θ)$ is differentiable on $(- \infty, 0)$ , for any $x \in R^{n}$ and $α > 0$ ; in fact, for each $λ < 0$ and $θ \in Θ$

$\frac{\partial V^{*} (x, r^{λ}, θ)}{\partial λ} = V (x, π^{λ}, c, θ) - \bar{η} (x, π^{λ}, θ) .$

(14)

3.4. Estimation Methods for Our Application

Pedersen [42] describes the approximate maximum likelihood estimator in the following manner. The unknown parameter

θ

is estimated by means of some function

h_{m} : Ω \times Θ \to R, for m = 1, \dots

that measures the likelihood of different values of

θ

. If for each

ω \in Ω

fixed, the function

h_{m} (ω, θ)

has a unique maximum point

θ_{m} (ω) \in Θ

, then

θ

is estimated by

θ_{m} (ω)

.

Under the assumption that, for

m \in N

and

θ \in Θ

,

h_{m} (\cdot, θ)

is a measurable function of

ω

and is also continuously differentiable in

θ

for all

P^{π, θ}

and almost all

ω \in Ω

, it is proven that the function

θ \to h_{m} (ω, θ)

is continuous and has a unique maximum point

θ_{m} (ω)

for each

ω \in Ω

fixed. The number

m \in N

is the index of a sequence of random experiments on the measurable space

(Ω, F)

.

In our application, the outcomes of the random experiments will be represented by a sequence

X_{T} : = (x_{t_{i}} : 0 \leq i \leq m)

of a trajectory

{x^{u, θ} (t) : t \in [0, T]}

at times

0 = t_{0} < t_{1} < t_{m} : = T

on

(Ω, F) : = (C ([0, \infty)), B (C ([0, \infty)))

and the function

h_{m}

will be called the least square function (LSE), i.e.,

h_{m} (w, θ) : = L S E (w, θ) .

In practice,

x^{u, θ} (t)

in (1) can only be observed in a finite horizon; for example, T. Actually, this is one of the hypotheses of the so-called model predictive control. However, at least from a theoretical point of view, our version of the PEC makes no such assumption, but still chooses T to be as large as practically possible (with regard to computer power, measurement instruments, computation time, etc.) so that we can define LSE as:

L S E (X_{T}, θ) : = \sum_{i = 1}^{m} {(x_{t_{i}} - x_{t_{i - 1}} - b (x_{t_{i - 1}}, u_{t_{i - 1}}, θ) (t_{i} - t_{i - 1}))}^{2} .

(15)

with b as in (1). The LSE function generates the least square estimator,

θ_{L S E}

,

θ_{L S E} \equiv θ_{L S E} (X_{T}) : = \arg \min_{θ \in Θ} L S E (X_{T}, θ) .

(16)

Consistency and asymptotic normality of

θ_{L S E}

are studied in [18,19,42]. Shoji [18] demonstrates that the optimization based on the LSE function is identical to the optimization based on the discrete approximate likelihood ratio function when a one-dimensional stochastic differential equation with a constant diffusion coefficient is taken into account:

\begin{matrix} M L R (X_{T}, θ) : = \sum_{i = 1}^{m} b (x_{t_{i - 1}}, u_{t_{i - 1}}, θ) {[σ (x_{t_{i - 1}}) σ {(x_{t_{i - 1}})}^{⊤}]}^{- 1} (x_{t_{i}} - x_{t_{i - 1}}) \\ - \frac{1}{2} \sum_{i = 1}^{m} \{b {(x_{t_{i - 1}}, u_{t_{i - 1}}, θ)}^{⊤} {[σ (x_{t_{i - 1}}) σ {(x_{t_{i - 1}})}^{T}]}^{- 1} \cdot b (x_{t_{i - 1}}, u_{t_{i - 1}}, θ) (t_{i} - t_{i - 1})\}, \end{matrix}

with b and

σ

as in (1). The MLR function generates the discrete approximate likelihood ratio estimator:

θ_{L R} \equiv θ_{L R} (X_{T}) : = \arg \max_{θ \in Θ} M L R (X_{T}, θ) .

We establish our main result considering Remarks 3 and 4.

Theorem 1.

Let

θ_{m}

be a sequence of USC estimators of

θ \in Θ

, and let

λ_{x, θ_{m}}^{*} < 0

be a critical point of

V^{*} (x, r^{λ}, θ_{m})

. Assume that there is a sequence

(π^{λ_{x, θ_{m}}^{*}} \in Π^{λ_{x, θ_{m}}^{*}} : m = 1, \dots)

that converges to

π \in Π

. Then, π is optimal for the DPC. Moreover, the equalities

V (x, π, c, θ) = \bar{η} (x, π, θ)

and

V^{*} (x, r^{λ_{x}^{*}}, θ) = V^{*} (x, r, θ)

hold

P^{π, θ}

-a.s.

Proof.

By Remark 3(a), we get

\begin{matrix} V (x, π^{λ_{x, θ_{m}}^{*}}, r^{λ_{x, θ_{m}}^{*}}, θ_{m}) \\ = V (x, π^{λ_{x, θ_{m}}^{*}}, r, θ_{m}) + λ_{x, θ_{m}}^{*} [V (x, π^{λ_{x, θ_{m}}^{*}}, c, θ_{m}) - \bar{η} (x, π^{λ_{x, θ_{m}}^{*}}, θ_{m})], \end{matrix}

(17)

whereas the facts in Remark 4(a)–(c) ensure that (17) converges to

\begin{matrix} V (x, π, r^{λ_{x, θ}}, θ) & = & V (x, π, r, θ) + λ_{x, θ} [V (x, π, c, θ) - \bar{η} (x, π, θ)] \end{matrix}

(18)

P^{π, θ}

-a.s., as

m \to \infty

.

On the other hand, note that

\begin{matrix} | V (x, π^{λ_{x, θ_{m}}^{*}}, c - α η, θ_{m}) - V (x, π, c - α η, θ) | & \leq \end{matrix}

\begin{matrix} | V (x, π^{λ_{x, θ_{m}}^{*}}, c - α η, θ_{m}) - V (x, π, c - α η, θ_{m}) | & + \end{matrix}

(19)

\begin{matrix} | V (x, π, c - α η, θ_{m}) - V (x, π, c - α η, θ) |, \end{matrix}

(20)

thus, by Remark 4(a) we can obtain that the term in (19) converges to zero as

m \to \infty

; whereas, using Remark 4(b), we can deduce that the term in (20) converges to zero

P^{π, θ}

-a.s. So,

V (x, π^{λ_{x, θ_{m}}^{*}}, c - α η, θ_{m}) \to V (x, π, c - α η, θ) P^{π, θ} - a . s .

(21)

As

λ_{x, θ_{m}}^{*} < 0

is a critical point of

V^{*} (x, r^{λ}, θ_{m})

, we obtain, from (14), that for every

π^{λ_{x, θ_{m}}^{*}} \in Π^{λ_{x, θ_{m}}^{*}, θ_{m}}

,

\begin{matrix} \frac{\partial V^{*} (x, r^{λ}, θ_{m})}{\partial λ} |_{λ = λ_{x, θ_{m}}^{*}} & = V (x, π^{λ_{x, θ_{m}}^{*}}, c, θ_{m}) - \bar{η} (x, π^{λ_{x, θ_{m}}^{*}}, θ_{m}) \\ = V (x, π^{λ_{x, θ_{m}}^{*}}, c - α η, θ_{m}) = 0 . \end{matrix}

(22)

Therefore, from (21) and (22), we obtain

V (x, π, c - α η, θ) = V (x, π, c, θ) - \bar{η} (x, π, θ) = 0 P^{π, θ} - a . s .

(23)

This last result, along with (18) yields

\begin{matrix} V (x, π, r^{λ_{x}^{*}}, θ) & = & V (x, π, r, θ) P^{π, θ} - a . s . \end{matrix}

(24)

Using Remark 4(g),

π

is in

Π^{λ_{x}^{*}, θ}

,

V^{*} (x, r^{λ_{x}^{*}}, θ) = \sup_{\bar{π} \in Π} V (x, \bar{π}, r^{λ_{x}^{*}}, θ) = V (x, π, r^{λ_{x}^{*}}, θ) .

(25)

We have that, for all

\bar{π} \in F_{θ}^{x}

,

V (x, \bar{π}, c, θ) - \bar{η} (x, \bar{π}, θ) \leq 0 .

This implies that

λ_{x}^{*} [V (x, \bar{π}, c, θ) - \bar{η} (x, \bar{π}, θ)] \geq 0

, which, in turn, together with (10), (24), (25) and Remark 3(a), lead to

\begin{matrix} V (x, π, r, θ) & = & V (x, π, r^{λ_{x}^{*}}, θ) = V^{*} (x, r^{λ_{x}^{*}}, θ) \\ \geq & V (x, \bar{π}, r^{λ_{x}^{*}}, θ) \\ = & V (x, \bar{π}, r, θ) + λ_{x}^{*} [V (x, \bar{π}, c, θ) - \bar{η} (x, \bar{π}, θ)] \\ \geq & V (x, \bar{π}, r, θ) for all \bar{π} \in F_{θ}^{x} P^{π, θ} - a . s . \end{matrix}

Thus,

\begin{matrix} V (x, π, r, θ) & \geq & \sup_{\bar{π} \in F_{θ}^{x}} V (x, \bar{π}, r, θ) P^{π, θ} - a . s . \end{matrix}

(26)

Finally, by (25) we obtain that

π \in F_{θ}^{x}

. Therefore, (26), along with (24) and (25) show that

\begin{matrix} V^{*} (x, r^{λ_{x}^{*}}, θ) & = & V (x, π, r, θ) = \sup_{\bar{π} \in F_{θ}^{x}} V (x, \bar{π}, r, θ), \end{matrix}

in other words,

π

is optimal for the DPC, and

V^{*} (x, r^{λ_{x}^{*}}, θ)

coincides with the optimal reward for the DPC

P^{π, θ}

-a.s. □

4. Numeric Illustration

To exemplify our results, we substitute (1) by

d x (t) = [u (t) - θ x (t)] d t + σ d W (t), x (0) = x > 0,

(27)

where

0 \leq u (t) \leq γ

. We assume that the reward and cost rates

r, c : [0, \infty) \times U \to R

, as well as the constraint rate

η : [0, \infty) \times Θ \to R

, are defined by

\begin{matrix} r (x, u) & : = & \sqrt{u} - a x \forall (x, u) \in [0, \infty) \times U with a \geq 0, \\ c (x, u) & = & c_{1} x + c_{2} u \forall (x, u) \in [0, \infty) \times U, \\ η (x, θ) & : & = \frac{c_{1} x}{α + θ} + q, for all x > 0, \end{matrix}

with

U : = [0, γ]

,

c_{1} > 0

,

c_{2} \in R

satisfying

c_{1} + θ c_{2} > 0

and

θ c_{1} + {(α + θ)}^{2} c_{2} > 0,

where q is a positive constant. Here,

r (x, u)

represents the social welfare, where

\sqrt{u}

and

a x

represent the social utility of the consumption u and the social disutility of the pollution x, respectively.

Remark 5.

Assumptions A1–A3 given in this work hold for the controlled diffusion (27); see Lemma 5.2 in [11]. In fact, the Lyapunov function in Assumption A2 is taken as

w (x) = x^{2} + 1

.

4.1. The $λ$ -DUP

Lemma 5.3 in [11] ensures that, under the conditions imposed on the constants

c_{1}

,

c_{2}

,

α

,

θ

and q given above, for every

x > 0

and

λ \leq 0

, the optimal reward

V^{*} (x, r^{λ}, θ)

in (10) with

r^{λ} (x, u) = \sqrt{u} - a x + λ [c_{1} x + c_{2} u - \frac{c_{1} x}{α + θ} - q]

, becomes

V^{*} (x, r^{λ}, θ) = \frac{[λ θ c_{1} - (α + θ) a] x}{{(α + θ)}^{2}} + \frac{\sqrt{f_{θ}^{λ}} - a_{λ, θ} f_{α}^{λ}}{α} - λ q,

where

a_{λ, θ} : = \frac{(α + θ) a - λ [θ c_{1} + {(α + θ)}^{2} c_{2}]}{{(α + θ)}^{2}} > 0,

and the discounted optimal policy for the

λ - D U P

(

f_{θ}^{λ} \in F

), which maximizes the right-hand side of (11) for this example, is the constant function given by

f_{θ}^{λ} = \{\begin{matrix} \frac{1}{4 {(a_{λ, θ})}^{2}} & if \frac{1}{2 \sqrt{γ}} < a_{λ, θ}, \\ γ & if \frac{1}{2 \sqrt{γ}} \geq a_{λ, θ} . \end{matrix}

(28)

4.2. The Dpc

Using Theorem 5.5 in [11], for a fixed point

z > 0

such that

q < \frac{η c_{1} z}{{(α + θ)}^{2}} + \frac{[θ c_{1} + {(α + θ)}^{2} c_{2}] γ}{α {(α + θ)}^{2}} .

If

\frac{1}{2 \sqrt{(\frac{α {(α + θ)}^{2} q - α θ c_{1} z}{θ c_{1} + {(α + θ)}^{2} c_{2}})}} > \frac{a}{α + θ}

, then the mapping

λ ⟼ V^{*} (z, r^{λ}, θ)

admits a critical point

λ_{z, θ}^{*} \equiv λ_{z, θ}^{*} (α, z) < 0

satisfying

a_{λ_{z, θ}^{*}, θ} = \frac{(α + θ) a - λ_{z, θ}^{*} [θ c_{1} + {(α + θ)}^{2} c_{2}]}{{(α + θ)}^{2}} = \frac{1}{2 \sqrt{(\frac{α {(α + θ)}^{2} q - α θ c_{1} z}{θ c_{1} + {(α + θ)}^{2} c_{2}})}} .

Hence, every

π^{λ_{z, θ}^{*}} \in Π^{λ_{z, θ}^{*}}

is

α

-optimal for the DPC and

V (z, π^{λ_{z, θ}^{*}}, c, θ) = \bar{η} (z, π^{λ_{z, θ}^{*}}, θ)

; in particular, the corresponding

α

-optimal policy for the DPC is

f_{θ}^{λ_{z, θ}^{*}} \in F \cap Π^{λ_{z, θ}^{*}}

f_{θ}^{λ_{z, θ}^{*}} = \frac{α {(α + θ)}^{2} q - α θ c_{1} z}{θ c_{1} + {(α + θ)}^{2} c_{2}},

(29)

and the

α

-optimal value for the DPC is given by

\begin{matrix} V^{*} (z, r^{λ_{z, θ}^{*}}, θ) = V^{*} (z, π^{λ_{z, θ}^{*}}, r, θ) \\ = - \frac{a z}{α + θ} + \frac{1}{α} \sqrt{(\frac{α {(α + θ)}^{2} q - α θ c_{1} z}{θ c_{1} + {(α + θ)}^{2} c_{2}})} - \frac{a}{α + θ} [\frac{{(α + θ)}^{2} q - θ c_{1} z}{θ c_{1} + {(α + θ)}^{2} c_{2}}] . \end{matrix}

(30)

4.3. Numerical Results for the Optimal Accumulation Problem

To implement the optimal controller (28), we estimate the unknown parameter with LSE (15) and (16). By replacing

b (x, u, θ) = u (t) - θ x (t)

in (15), a direct computation yields:

θ_{L S E_{m}} = \frac{\sum_{i = 2}^{m - 1} u x_{i} - d x_{i} x_{i}}{\sum_{i = 2}^{m - 1} x_{i}^{2}},

where

d x_{i} : = \frac{1}{2} \frac{x_{i + 1} - x_{i - 1}}{t_{i + 1} - t_{i}} .

Assume that the true parameter value of the parameter is

θ = 2.5

, and take

u (t) = 0.5

,

α = 0.2

,

T = 2.8

,

c_{1} = 100

,

c_{2} = 10

,

λ = - 0.1

,

q = 60

,

a = 1.25

, and

γ = 3

. We next obtained discrete observations of the stochastic differential Equation (27) by simulating the equation using the Euler–Maruyama technique on

[0, 2.8]

. Based on this information, we obtained

m = 1200

observations with different values of the diffusion coefficient

σ = 0.0001, 0.0007 .

4.3.1. Numerical Results for the $λ$ -DUP

In Table 1 and Table 2, we denote the root mean square error (RMSE) between the predicted process

x^{θ_{L S E_{m}}, f_{θ_{L S E_{m}}}^{λ}} (t)

and the real process

x^{θ, f_{θ}^{λ}} (t)

by

R M S E (x_{θ_{m}}^{*} - x_{θ}^{*})

, and the RMSE between the predicted optimal discount cost

V_{θ_{L S E_{m}}}^{*} (x)

and the real optimal discount cost

V_{θ}^{*} (x)

, by

R M S E (V_{θ_{m}}^{*} - V_{θ}^{*})

.

Table 1. RMSE and absolute error between the estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0001

.

Table 2. RMSE and absolute error between the estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0007

.

Table 1 and Table 2 take

m = 240, 300, 400, 600

and 1200 and display the information on the different values of

θ_{L S E_{m}}

. As can be seen, as m increases, the estimator approaches the true parameter value

θ = 2.5

, and the RMSEs between the predicted processes

x {(t)}^{θ_{L S E_{m}}, f_{θ_{L S E_{m}}}^{λ}}

,

V_{θ_{L S E_{m}}}^{*} (x)

, and the real processes

x {(t)}^{θ, f_{θ}^{λ}}

,

V_{θ}^{*} (x)

decrease, thus implying a good fit (see Figure 1 and Figure 2). We can also see that, as the amount of data increases, the absolute error between the predicted optimal control

f_{θ_{L S E_{m}}} (x^{θ_{L S E_{m}}} (t))

, and the real optimal control

f_{θ} ((x^{θ} (t)))

decreases. Therefore, the predicted optimal control approaches the true optimal control.

Figure 1. Asymptotic behavior of

x^{θ_{L S E_{m}}, f_{θ_{L S E_{m}}}} (t)

with

σ = 0.0001

(left) and

σ = 0.0007

(right).

Figure 2. Asymptotic behavior of the optimal discount cost for

σ = 0.0001

(left) and

σ = 0.0007

(right).

The diffusion process (27) with

σ = 0.0001

showed the best fit because, with 1200 data,

θ_{L S E_{m}} = 2.499682

and its RMSE is

9.76504 \times 10^{- 5}

, which suggests that the lower the noise in the measured data, the more accurate the least square estimator.

4.3.2. Numerical Results for the DPC

Table 3 and Table 4 show the predicted optimal controls

f_{θ_{L S E_{m}}}^{λ_{z, θ_{L S E_{m}}}^{*}}

defined in (29), as well as the predicted

α

-optimal rewards for the DPC

V^{*} (z, θ_{L S E_{m}}, f_{θ_{L S E_{m}}}^{λ_{z, θ_{L S E_{m}}}^{*}}, r)

given in (30) and denoted by

V_{z}^{*} (θ_{L S E_{m}})

. As m increases, the estimator approaches the true parameter value

θ = 2.5

, and the predicted optimal controls and rewards converge to the real optimal control and reward

f_{θ}^{λ_{z, θ}^{*}}

,

V^{*} (z, θ, f_{θ}^{λ_{z, θ}^{*}}, r),

respectively, implying a good fit. Again,

σ = 0.001

showed the best fit, which suggests that the lower the noise in the measured data, the more accurate the LSE.

Table 3. Estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0001

.

Table 4. Estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0007

.

5. Concluding Remarks

This paper concerns controlled stochastic differential equations of the form (1), where the drift coefficient depends on an unknown parameter

θ \in Θ

. Using a statistical estimation procedure to find

θ

, we constructed adaptive policies, which are almost certainly optimal for the constrained optimization problem under the discounted payoff on an infinite horizon. To this end, we let

V_{θ}^{*} (x, r^{λ}) : = \sup_{π \in Π} V (x, π, r^{λ}, θ)

, and

V_{θ}^{*} (x, r) : = \sup_{π \in F_{θ}^{x}} V (x, π, r, θ)

be the optimal discounted rewards for the DUP and the DPC, respectively. Our results are our own version of the PEC, and can be summarized as follows:

1.: For each m, there are optimal control policies $π_{θ_{m}}$ for the $θ_{m}$ -DUP and $θ_{m}$ -DPC.
2.: For each initial state $x \in R^{n}$ , $V_{θ_{m}}^{*} (x, r^{λ}) \to V_{θ}^{*} (x, r^{λ})$ and $V_{θ_{m}}^{*} (x, r) \to V_{θ}^{*} (x, r)$ , and is almost certainly $m \to \infty$ .
3.: For the DUP, there is a subsequence $(m_{k} : k = 1, \dots)$ of $(m)$ and a policy $π_{θ}^{*} \in Π^{λ, θ}$ , such that $π^{λ_{m_{k}}, θ_{m_{k}}}$ converges to $π_{θ}^{*}$ in the topology of relaxed controls, and, moreover, $π_{θ}^{*}$ is optimal for the $θ$ -DUP. Moreover, if $λ_{m_{k}} < 0$ is a critical point of $V_{θ_{m_{k}}}^{*} (x, r^{λ})$ , then $π_{θ}^{*}$ is optimal for the $θ$ -DCP.

Some of the techniques we use are standard in the context of dynamic programming, and our use of the discounted payoff criterion on the infinite horizon renders the problem a rather flexible one. This criterion emphasizes the weight of the rewards and costs on the present generations while tending to overlook their effects on future generations. One way to prevent this from happening is to use the ergodic criterion by means of the so-called vanishing technique, which has been used in, for instance [43]. To obtain some insight as to how this method would alter the value function and the overall results presented here, we invite the reader to let

α ↓ 0

.

There are many ways to obtain a sequence of USC estimators of the unknown parameter

θ

(see [18]). However, when implementing the approximation algorithms for

θ

, one needs to check the type numerical approximation of the derivative

d x (t)

that is required. In our case, we replaced

d x (t)

with its central difference, instead of the backward difference, because our applications yield more accurate approximations.

The PEC requires knowledge and storage of the optimal policies

π_{θ}

for all values of

θ

, which may require considerable off-line computation and considerable storage. Therefore, for optimal control problems with closed-form solutions (

π_{θ}^{*}, V_{θ}^{*} (x)

), such as, for example, LQ problems (linear systems with quadratic costs), the PEC works well. In this sense, our model resembles the model predictive control. However, the fact that the horizon has to be finite in the latter is a serious limitation that is surpassed by our proposal. In fact, the numeric illustration from Section 4 is another example of the distinction between our version of the PEC, and the polynomial chaos expansion method. While the latter aims to approximate the probability densities of finite-variance random variables, our goal is to obtain an optimal control while making estimations of the infinite-total variation processes (1). This is particularly true in the case of (27), regardless of how small the diffusion coefficient

σ

is in our illustration. There, the focus point should be that the lower the noise in the measured data, the more accurate the LSE.

One of the downfalls of our method is that a closed-form solution of

π_{θ}, V_{θ}^{*} (x)

is virtually impossible for many optimal control problems with or without constraints. Another limitation is that, for each application, there is a large number of assumptions and constraints that need to be verified. We believe this deflection from the main problem could be eased by the inclusion of our method in (for instance) MATLAB’s robust control.

The second part of this project will approximate the adaptive original problem using a sequence of discrete-time adaptive optimal control problems of controlled Markov switching diffusions.

Author Contributions

Conceptualization, methodology, and writing/original draft preparation of this research are due to B.A.E.-T., F.A.A.-H. and J.D.L.-B.; software, validation, visualization, and data curation are original of F.A.A.-H.; formal analysis, investigation, writing/review and editing are due to C.G.H.-C.; project administration, funding acquisition are due to J.D.L.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Universidad Anáhuac México grant number 00100575.

Data Availability Statement

Not applicable.

Acknowledgments

The authors wish to sincerely thank Ekaterina Viktorovna Gromova for her kind invitation to publish this work.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Technical Assumptions

The following hypothesis ensures the existence and uniqueness of a strong solution to (1). For more details, see Theorem 3.1 in [43] and Chapter III.2 in [44].

Assumption A1.

(a): The random process (1) belongs to a complete probability space $(Ω, F, P^{u, θ})$ . Here, ${F_{t}}_{t \geq 0}$ is a filtration on $(Ω, F)$ , such that each $F_{t}$ is complete relative to $F$ ; and $P^{u, θ}$ is the law of the state process $x (\cdot)$ given the parameter $θ \in Θ$ and the control $u (\cdot)$ .
(b): The drift coefficient $b (\cdot, \cdot, \cdot)$ in (1) is continuous and locally Lipschitz in the first and third arguments uniformly in u; that is, for each $R > 0$ , there exist nonnegative constants $K_{θ} (R)$ and $D (R)$ such that, for all $u \in U$ , all $| θ_{1} |, | θ_{2} | \leq R$ and $| x |, | y | \leq R$ ,

$| b (x, u, θ) - b (y, u, θ) | \leq K_{θ} (R) | x - y |,$

$| b (x, u, θ_{1}) - b (x, u, θ_{2}) | \leq D (R) | θ_{1} - θ_{2} |,$

Moreover, $u \mapsto b (x, u, θ)$ is continuous on U.
(c): The diffusion coefficient $σ (\cdot)$ satisfies a local Lipschitz condition; that is, for each $R > 0$ , there is a positive constant $K_{1} (R)$ such that, for all $| x |, | y | \leq R$ ,

$| σ (x) - σ (y) | \leq K_{1} (R) | x - y | .$
(d): The coefficients b and σ satisfy a global linear growth condition of the form

$\sup_{(u, θ) \in U \times Θ} {|b (x, u, θ)|}^{2} + {| σ (x) |}^{2} \leq \tilde{K} {(1 + | x |}^{2}) f o r a l l x \in R^{n},$

where $\tilde{K}$ is a positive constant.
(e): (Uniform ellipticity). The matrix $a (x) : = σ (x) σ {(x)}^{⊤}$ satisfies that, for some constant $K_{1} > 0$ ,

$x^{⊤} a (y) x \geq K_{1} {| x |}^{2} f o r a l l x, y \in R^{n} .$

The following hypothesis is a standard Lyapunov stability condition for the solution of the dynamic system (1) (see [37,41]). It gives the following inequality (4).

Assumption A2.

There exists a function

w \geq 1

in

C^{2} (R^{n})

and constants

d \geq c > 0

, such that

(a): $\lim_{| x | \to \infty} w (x) = \infty$ .
(b): $L^{π, θ} w (x) \leq - c w (x) + d$ for all $π \in Π$ , $θ \in Θ$ and x in $R^{n}$ .

The reward and cost functions are supposed to meet the next hypothesis.

Assumption A3.

(a): The payoff rate $r (x, u, θ),$ the cost rate $c (x, u, θ)$ and the constraint rate $η (\cdot, θ)$ are continuous on $R^{n} \times U \times Θ$ and $R^{n} \times Θ$ , respectively. Moreover, they are locally Lipschitz on $R^{n}$ , uniformly on U and Θ; that is, for each $R > 0$ , there are positive constants $K (R)$ and $K_{2} (R)$ such that for all $| x |, | y | \leq R$

$\sup_{(u, θ) \in U \times Θ} | r (x, u, θ) - r (y, u, θ) | + \sup_{(u, θ) \in U \times Θ} | c (x, u, θ) - c (y, u, θ) | \leq K (R) | x - y |,$

$\sup_{θ \in Θ} | η (x, θ) - η (y, θ) | \leq K_{4} (R) | x - y | .$
(b): The rates $r (\cdot, u, θ)$ , $c (\cdot, u, θ)$ and $η (\cdot, θ)$ are in $B_{w} (R^{n})$ uniformly on U and Θ; in other words, there exists $M > 0$ such that, for all $x \in R^{n}$

$\sup_{θ \in Θ} | η (x, θ) | + \sup_{(u, θ) \in U \times Θ} | r (x, u, θ) | + \sup_{(u, θ) \in U \times θ} | c (x, u, θ) | \leq M w (x) .$

References

Kurano, M. Discrete-time Markovian decision processes with an unknown parameter-average return criterion. J. Oper. Res. Soc. Jpn. 1972, 15, 67–76. [Google Scholar]
Mandl, P. Estimation and control in Markov chains. Adv. Appl. Probab. 1974, 6, 40–60. [Google Scholar] [CrossRef]
Hernández-Lerma, O.; Marcus, S. Technical note: Adaptive control of discounted Markov Decision chains. J. Optim. Theory Appl. 1985, 46, 227–235. [Google Scholar] [CrossRef]
Hilgert, N.; Minjárez-Sosa, A. Adaptive control of stochastic systems with unknown disturbance distribution: Discounted criteria. Math. Methods Oper. Res. 2006, 63, 443–460. [Google Scholar] [CrossRef]
Broadie, M.; Cvitanic, J.; Soner, H.M. Optimal replication of contingent claims under portfolio constraints. Rev. Fin. Stud. 1998, 11, 59–79. [Google Scholar] [CrossRef]
Cvitanic, J.; Pham, H.; Touzi, N. A closed-form solution for the super-replication problem under transaction costs. Financ. Stochastics 1999, 3, 35–54. [Google Scholar] [CrossRef]
Cvitanic, J.; Pham, H.; Touzi, N. Superreplication in stochastic volatility models under portfolio constraints. J. Appl. Probab. 1999, 36, 523–545. [Google Scholar] [CrossRef]
Soner, M.; Touzi, N. Super replication under gamma constraints. SIAM J. Control Optim. 2000, 39, 73–96. [Google Scholar] [CrossRef]
Mendoza-Pérez, A.; Jasso-Fuentes, H.; Hernández-Lerma, O. The Lagrange approach to ergodic control of diffusions with cost constraints. Optimization 2015, 64, 179–196. [Google Scholar] [CrossRef]
Prieto-Rumeau, T.; Hernández-Lerma, O. The vanishing discount approach to constrained continuous-time controlled Markov chains. Syst. Control Lett. 2010, 59, 504–509. [Google Scholar] [CrossRef]
Jasso-Fuentes, H.; Escobedo-Trujillo, B.A.; Mendoza-Pérez, A. The Lagrange and the vanishing discount techniques to controlled diffusion with cost constraints. J. Math. Anal. Appl. 2016, 437, 999–1035. [Google Scholar] [CrossRef]
Bielecki, T. Adaptive control of continuous-time linear stochastic systems with discounted cost criterion. J. Optim. Theory Appl. 1991, 68, 379–383. [Google Scholar] [CrossRef]
Vrabie, D.; Pastravanu, O.; Abu-Khalaf, M.; Lewis, F. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477–484. [Google Scholar] [CrossRef]
Di Masp, G.; Stettner, L. Bayesian ergodic adaptive control of diffusion processes. Stochastics Stochastics Rep. 1997, 60, 155–183. [Google Scholar] [CrossRef]
Borkar, V.; Ghosh, M. Ergodic Control of Multidimensional Diffusions II: Adaptive Control. Appl. Math. Optim. 1990, 21, 191–220. [Google Scholar] [CrossRef]
Borkar, V.; Bagchi, A. Parameter estimation in continuous-time stochastic processes. Stochastics 1982, 8, 193–212. [Google Scholar] [CrossRef]
Huzak, M. Estimating a class of diffusions from discrete observations via approximate maximum likelihood method. Statistics 2018, 52, 239–272. [Google Scholar] [CrossRef]
Shoji, I. A note on asymptotic properties of the estimator derived from the Euler method for diffusion processes at discrete times. Stat. Probab. Lett. 1997, 36, 153–159. [Google Scholar] [CrossRef]
Ralchenko, K. Asymptotic normality of discretized maximum likelihood estimator for drift parameter in homogeneous diffusion model. Mod. Stochastics Theory Appl. 2015, 2, 17–28. [Google Scholar] [CrossRef]
Duncan, T.; Pasik-Duncan, B.; Stettner, L. Almost self-optimizing strategies for the adaptive control of diffusion processes. J. Optim. Theory Appl. 1994, 81, 479–507. [Google Scholar] [CrossRef]
Durham, G.; Gallant, A. Numerical Techniques for Maximum Likelihood Estimation of Continuous-Time Diffusion Processes. J. Bus. Econ. Stat. 2002, 20, 297–316. [Google Scholar] [CrossRef]
Athanassoglou, S.; Xepapadeas, A. Pollution control with uncertain stock dynamics: When, and how, to be precautious. J. Environ. Econ. Manag. 2012, 63, 304–320. [Google Scholar] [CrossRef]
Jiang, K.; You, D.; Li, Z.; Shi, S. A differential game approach to dynamic optimal control strategies for watershed pollution across regional boundaries under eco-compensation criterion. Ecol. Indic. 2019, 105, 229–241. [Google Scholar] [CrossRef]
Kawaguchi, K. Optimal Control of Pollution Accumulation with Long-Run Average Welfare. Environ. Resour. Econ. 2003, 26, 457–468. [Google Scholar] [CrossRef]
Kawaguchi, K.; Morimoto, H. Long-run average welfare in a pollution accumulation model. J. Econ. Dyn. Control 2007, 31, 703–720. [Google Scholar] [CrossRef]
Morimoto, H. Optimal Pollution Control with Long-Run Average Criteria. In Stochastic Control and Mathematical Modeling: Applications in Economics; Encyclopedia of Mathematics and its Applications, Cambridge University Press: Cambridge, UK, 2010; pp. 237–251. [Google Scholar] [CrossRef]
Jasso-Fuentes, H.; López-Barrientos, J.D. On the use of stochastic differential games against nature to ergodic control problems with unknown parameters. Int. J. Control 2015, 88, 897–909. [Google Scholar] [CrossRef]
Zhang, G.; Zhang, Z.; Cui, Y.; Yuan, C. Game Model of Enterprises and Government Based on the Tax Preference Policy for Energy Conservation and Emission Reduction. Filomat 2016, 30, 3963–3974. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, G.; Su, B. The spatial impacts of air pollution and socio-economic status on public health: Empirical evidence from China. Socio-Econ. Plan. Sci. 2022, 83, 101167. [Google Scholar] [CrossRef]
Cox, L.A.T., Jr. Confronting Deep Uncertainties in Risk Analysis. Risk Anal. 2012, 32, 1607–1629. [Google Scholar] [CrossRef]
López-Barrientos, J.D.; Jasso-Fuentes, H.; Escobedo-Trujillo, B.A. Discounted robust control for Markov diffusion processes. Top 2015, 23, 53–76. [Google Scholar] [CrossRef]
Escobedo-Trujillo, B.A.; López-Barrientos, J.D.; Garrido-Meléndez, J. A Constrained Markovian Diffusion Model for Controlling the Pollution Accumulation. Mathematics 2021, 9, 1466. [Google Scholar] [CrossRef]
Borkar, V.; Ghosh, M. Controlled diffusions with constraints. J. Math. Anal. Appl. 1990, 152, 88–108. [Google Scholar] [CrossRef]
Borkar, V. Controlled diffusions with constraints II. J. Math. Anal. Appl. 1993, 176, 310–321. [Google Scholar] [CrossRef]
Duncan, T.; Pasik-Duncan, B. Adaptive control of continuous time linear stochastic systems. Math. Control. Signals Syst. 1990, 3, 45–60. [Google Scholar] [CrossRef]
Escobedo-Trujillo, B.; Hernández-Lerma, O.; Alaffita-Hernández, F. Adaptive control of diffusion processes with a discounted criterion. Appl. Math. 2020, 47, 225–253. [Google Scholar] [CrossRef]
Arapostathis, A.; Borkar, V.; Ghosh, M. Ergodic control of diffusion processes. In Encyclopedia of Mathematics and its Applications; Cambridge University Press: Cambridge, UK, 2012; Volume 143. [Google Scholar]
Warga, J. Optimal Control of Differential and Functional Equations; Academic Press: New York, NY, USA, 1972. [Google Scholar]
Fleming, W.; Nisio, M. On the stochastic relaxed control for partially observed diffusions. Nagoya Mathhematical J. 1984, 93, 71–108. [Google Scholar] [CrossRef]
Jasso-Fuentes, H.; Yin, G. Advanced Criteria for Controlled Markov-Modulated Diffusions in an Infinite Horizon: Overtaking, Bias, and Blackwell Optimality; Science Press: Beijing, China, 2013. [Google Scholar]
Jasso-Fuentes, H.; Hernández-Lerma, O. Characterizations of overtaking optimality for controlled diffusion processes. Appl. Math. Optim. 2007, 57, 349–369. [Google Scholar] [CrossRef]
Pedersen, A.R. Consistency and asymptotic normality of an approximate maximum likelihood estimator for discretely observed diffusions process. Bernoulli 1995, 1, 257–279. [Google Scholar] [CrossRef]
Ghosh, M.K.; Arapostathis, A.; Marcus, S.I. Ergodic control of switching diffusions to flexible manufacturing systems. SIAM J. Control Optim. 1993, 31, 1183–1204. [Google Scholar] [CrossRef]
Rogers, L.; Williams, D. Diffusions, Markov Processes and Martingales, Vol.1, Foundations; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]

Figure 1. Asymptotic behavior of

x^{θ_{L S E_{m}}, f_{θ_{L S E_{m}}}} (t)

with

σ = 0.0001

(left) and

σ = 0.0007

(right).

Figure 1. Asymptotic behavior of

x^{θ_{L S E_{m}}, f_{θ_{L S E_{m}}}} (t)

with

σ = 0.0001

(left) and

σ = 0.0007

(right).

Figure 2. Asymptotic behavior of the optimal discount cost for

σ = 0.0001

(left) and

σ = 0.0007

(right).

Figure 2. Asymptotic behavior of the optimal discount cost for

σ = 0.0001

(left) and

σ = 0.0007

(right).

Table 1. RMSE and absolute error between the estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0001

.

Table 1. RMSE and absolute error between the estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0001

.

			$σ = 0.0001$
$m$	$θ_{{LSE}_{m}}$	$RMSE (V_{θ_{m}}^{} - V_{θ}^{})$	$RMSE (x_{θ_{m}}^{} - x_{θ}^{})$	$f^{λ, θ_{{LSE}_{m}}}$	$\| f^{λ, θ_{{LSE}_{m}}} - f^{λ, θ} \|$
240	$2.499627$	$0.00204305$	0.000509042	$0.00410459$	$3.1947 \times 10^{- 8}$
300	$2.499643$	$0.00161749$	0.00040062	$0.00410458$	$3.0571 \times 10^{- 8}$
400	$2.499659$	$0.00117931$	0.000285924	$0.00410458$	$2.91879 \times 10^{- 8}$
600	$2.499678$	$0.000670105$	0.000155536	$0.00410458$	$2.75402 \times 10^{- 8}$
1200	$2.499682$	$0.000437844$	$9.76504 \times 10^{- 5}$	$0.00410458$	$2.72459 \times 10^{- 8}$
Real	$2.5$	0	0	$0.00410455$	0

Table 2. RMSE and absolute error between the estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0007

.

Table 2. RMSE and absolute error between the estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0007

.

			$σ = 0.0007$
m	$θ_{{LSE}_{m}}$	$RMSE (V_{θ_{m}}^{} - V_{θ}^{})$	$RMSE (x_{θ_{m}}^{} - x_{θ}^{})$	$f^{λ, θ_{{LSE}_{m}}}$	$\| f^{λ, θ_{{LSE}_{m}}} - f^{λ, θ} \|$
240	$2.498748$	$0.00299057$	0.000718096	$0.00410466$	$1.06613 \times 10^{- 7}$
300	$2.498791$	$0.00266943$	0.000629978	$0.00410466$	$1.03033 \times 10^{- 7}$
400	$2.498873$	$0.00184702$	0.000423607	$0.00410465$	$9.6021 \times 10^{- 8}$
600	$2.498762$	$0.00176898$	0.000437829	$0.00410466$	$1.05499 \times 10^{- 7}$
1200	$2.498762$	$0.00136883$	0.00033381	$0.00410466$	$1.05454 \times 10^{- 7}$
Real	$2.5$	0	$0.00410455$	0

Table 3. Estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0001

.

Table 3. Estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0001

.

m	$f_{θ_{{LSE}_{m}}}^{λ_{z, θ_{{LSE}_{m}}}^{*}}$	$\|f_{θ_{{LSE}_{m}}}^{λ_{z, θ_{{LSE}_{m}}}^{}} - f_{θ}^{λ_{z, θ}^{}}\|$	$V^{} (z, θ_{{LSE}_{m}}, f_{θ_{{LSE}_{m}}}^{λ_{z, θ_{{LSE}_{m}}}^{}}, r)$	$\| V_{z}^{} (θ_{{LSE}_{m}}) - V_{z}^{} (θ) \|$
240	$0.116042$	$3.11391 \times 10^{- 5}$	$0.971567$	$0.000257628$
300	$0.116043$	$2.97955 \times 10^{- 5}$	$0.971578$	$0.000246511$
400	$0.116045$	$2.84453 \times 10^{- 5}$	$0.971589$	$0.00023534$
600	$0.116046$	$2.68369 \times 10^{- 5}$	$0.971602$	$0.000222032$
1200	$0.116047$	$2.65498 \times 10^{- 5}$	$0.971605$	$0.000219656$
Real	$0.116073$	0	$0.971824$	0

Table 4. Estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0007

.

Table 4. Estimated processes and the real processes (

θ = 2.5

) with

σ = 0.0007

.

m	$f_{θ_{{LSE}_{m}}}^{λ_{z, θ_{{LSE}_{m}}}^{*}}$	$\|f_{θ_{{LSE}_{m}}}^{λ_{z, θ_{{LSE}_{m}}}^{}} - f_{θ}^{λ_{z, θ}^{}}\|$	$V^{} (z, θ_{{LSE}_{m}}, f_{θ_{{LSE}_{m}}}^{λ_{z, θ_{{LSE}_{m}}}^{}}, r)$	$\| V_{z}^{} (θ_{{LSE}_{m}}) - V_{z}^{} (θ) \|$
240	$0.115969$	0.000104368	$0.970961$	$0.000863615$
300	$0.115972$	0.000100842	$0.97099$	$0.00083443$
400	$0.115979$	$9.39409 \times 10^{- 5}$	$0.971047$	$0.000777316$
600	$0.11597$	0.000103271	$0.97097$	$0.000854533$
1200	$0.11597$	0.000103226	$0.97097$	$0.000854166$
Real	$0.116073$	0	$0.971824$	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Robust Statistic Estimation of Constrained Optimal Control Problems of Pollution Accumulation (Part I)

Abstract

1. Introduction

2. Problem Statement

2.1. Control Policies and Stability Assumptions

2.2. Reward, Cost and Constraint Rates

3. Main Results

3.1. Discounted Control with Constraints

3.2. Lagrange Multipliers Approach

3.3. Convergence of Value Functions $V (x, π, r, θ)$ and $V (x, π, r^{λ}, θ)$

3.4. Estimation Methods for Our Application

4. Numeric Illustration

4.1. The $λ$ -DUP

4.2. The Dpc

4.3. Numerical Results for the Optimal Accumulation Problem

4.3.1. Numerical Results for the $λ$ -DUP

4.3.2. Numerical Results for the DPC

5. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Technical Assumptions

References

Article Metrics

Citations

Article Access Statistics

Robust Statistic Estimation of Constrained Optimal Control Problems of Pollution Accumulation (Part I)

Abstract

1. Introduction

2. Problem Statement

2.1. Control Policies and Stability Assumptions

2.2. Reward, Cost and Constraint Rates

3. Main Results

3.1. Discounted Control with Constraints

3.2. Lagrange Multipliers Approach

3.3. Convergence of Value Functions V ( x , π , r , θ ) and V ( x , π , r λ , θ )

3.4. Estimation Methods for Our Application

4. Numeric Illustration

4.1. The λ -DUP

4.2. The Dpc

4.3. Numerical Results for the Optimal Accumulation Problem

4.3.1. Numerical Results for the λ -DUP

4.3.2. Numerical Results for the DPC

5. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Technical Assumptions

References

Article Metrics

Citations

Article Access Statistics

3.3. Convergence of Value Functions $V (x, π, r, θ)$ and $V (x, π, r^{λ}, θ)$

4.1. The $λ$ -DUP

4.3.1. Numerical Results for the $λ$ -DUP