Qualitative Properties of Randomized Maximum Entropy Estimates of Probability Density Functions

Yuri S. Popkov

doi:10.3390/math9050548

¹

Federal Research Center “Computer Science and Control” of Russian Academy of Sciences, 119333 Moscow, Russia

²

Institute of Control Sciences of Russian Academy of Sciences, 117997 Moscow, Russia

³

Department of Software Engineering, ORT Braude College, 2161002 Karmiel, Israel

Mathematics2021, 9(5), 548;https://doi.org/10.3390/math9050548

This article belongs to the Special Issue Control, Optimization, and Mathematical Modeling of Complex Systems

Version Notes

Order Reprints

Abstract

The problem of randomized maximum entropy estimation for the probability density function of random model parameters with real data and measurement noises was formulated. This estimation procedure maximizes an information entropy functional on a set of integral equalities depending on the real data set. The technique of the Gâteaux derivatives is developed to solve this problem in analytical form. The probability density function estimates depend on Lagrange multipliers, which are obtained by balancing the model’s output with real data. A global theorem for the implicit dependence of these Lagrange multipliers on the data sample’s length is established using the rotation of homotopic vector fields. A theorem for the asymptotic efficiency of randomized maximum entropy estimate in terms of stationary Lagrange multipliers is formulated and proved. The proposed method is illustrated on the problem of forecasting of the evolution of the thermokarst lake area in Western Siberia.

Keywords:

randomized maximum entropy estimation; probability density functions; Lagrange multipliers; Lyapunov-type problems; implicit function; rotation of vector field; asymptotic efficiency; thermokarst lakes; forecasting

1. Introduction

Estimating the characteristics of models is a very popular and, at the same time, important problem of science. This problem arises in applications with unknown parameters, which have to be estimated somehow using real data sets. In particular, such problems have turned out to be fundamental in machine learning procedures [1,2,3,4,5]. The core of these procedures is a parametrized model trained by statistically estimating the unknown parameters based on real data. Most of the econometric problems associated with reconstructing functional relations and forecasting also reduce to estimating the model parameters; for example, see [6,7].

The problems described above are solved using traditional mathematical statistics methods, such as the maximum likelihood method and its derivatives, the method of moments, Bayesian methods, and their numerous modifications [8,9].

Among the mathematical tools for parametric estimation mentioned, a special place is occupied by entropy maximization methods for finite-dimensional probability distributions [10,11].

Consider a random variable x taking discrete values

x_{1}, \dots, x_{n}

with probabilities

p_{1}, \dots, p_{n},

respectively, and r functions

f_{1} (x), \dots, f_{r} (x)

of this variable with discrete values. The discrete probability distribution function

p (x) = {p_{1} (x_{1}), \dots, p_{n} (x_{n})}

is defined as the solution of the problem

H (p) = - \sum_{i = 1}^{n} p_{i} ln p_{i}, \sum_{i = 1}^{n} p_{i} f_{k} (x_{i}) \leq q_{k}, k = 1, \dots, r,

where

q_{1}, \dots, q_{r}

are given constants.

If

f_{k} (x_{i}) \equiv x_{i}^{k},

then the system of equalities specifies constraints on the kth moments of the discrete random variable

x .

In the case of equality constraints, some modifications of this problem adapted to different applications were studied in [10,11,12,13]. Since this problem is conditionally extremal, it can be solved using the Lagrange method, which leads to a system of equations for Lagrange multipliers. The latter often turn out to be substantially nonlinear functions, and hence, rather sophisticated techniques are needed for their numerical calculation [14,15].

In the case of inequality constraints, this problem belongs to the class of mathematical programming problems [16].

The entropy maximization principle is adopted to estimate the parameters of a priori distributions when constructing Bayesian estimates [17,18] or maximum likelihood estimates.

The parameters of probability distributions (continuous or discrete) can be estimated using various mathematical statistics methods, including the method of entropy maximization. Their efficiency in hydrological problems was compared in [19]. Apparently, the method of entropy maximization yields the best results in such problems due to the structure of hydrological data.

The problem of estimating some model characteristics on real data was further developed in connection with the appearance of new machine learning methods, called randomized machine learning (RML) [20]. They are based on models with random parameters, and it is necessary to estimate the probability density functions of these parameters. The estimation algorithm (RML algorithm) is formulated in terms of functional entropy-linear programming [21].

The original statement of this problem was to estimate probability density functions (PDFs) in RML procedures. However, in recent times, a more general context has been assumed—the method of maximizing entropy functionals for constructing estimates of continuous probability density functions using real data (randomized maximum entropy (RME) estimation).

In this paper, the general RME estimation problem is formulated; its solutions, numerical algorithms, and the asymptotic properties of the solutions are studied. The theoretical results are illustrated by an important application—estimating the evolution of the thermokarst lake area in Western Siberia.

2. Statement of the RME Estimation Problem

Consider a scalar continuous function

φ (x, θ)

with parameters

θ = {θ_{1}, \dots, θ_{n}} .

Assume that this function is a characteristic of an object’s model with an input x and an output

\hat{y}

. Let

x^{(r)} = {x [1], \dots, x [r]}

and

y^{(r)} = {y [1], \dots, y [r]}

be given measurements at time

t = 1, \dots, r

. Note that the latter measurements are obtained with random vector errors

ξ = {ξ [1], \dots, ξ [r]},

which are generally different for different time points.

Thus, after r measurements, the model and observations are described by the equations

\begin{matrix} \hat{y} & = & Γ (x^{(r)}, θ), \\ \hat{v} & = & \hat{y} + ξ, \end{matrix}

(1)

where the vector function

Γ (x^{(r)}, θ)

has the components

φ (x [t], θ)

, where

t = 1, \dots, r

are the time points;

\hat{v}

denotes the observed output of the model containing measurement noises of the object’s output.

Let us introduce a series of assumptions necessary for further considerations.

The random parameters are $θ \in Θ \subset R^{n}, Θ = [θ^{-}, θ^{+}]$ , where $[•, •]$ is a vectorial segment in the space $R^{n}$ [22].
The PDF $P (θ)$ of the parameters is continuously differentiable on its support $Θ .$
The random noise is $ξ \in Ξ \subset R^{r},$ where

$Ξ = ⨂_{t = 1}^{r} Ξ_{t}, Ξ_{t} = [ξ_{t}^{-}, ξ_{t}^{+}] .$

(2)
The PDF $Q (ξ)$ of the measurement noises is continuously differentiable on the support $Ξ$ and also has the multiplicative structure

$Q (ξ) = \prod_{t = 1}^{r} Q_{t} (ξ [t]) .$

(3)

The estimation problem is stated as follows: Find the estimates of the PDFs

P^{*} (θ)

and

Q^{*} (ξ)

that maximize the generalized information entropy functional

H [P (θ), Q (ξ)] = - \int_{Q} P (θ) ln P (θ) d θ - \sum_{t = 1}^{r} \int_{Ξ_{t}} Q_{t} (ξ [t]) ln Q_{t} (ξ [t]) \Rightarrow max

(4)

subject to

—the normalization conditions of the PDFs given by

\int_{Θ} P (θ) d θ = 1; \int_{Ξ_{t}} Q_{t} (ξ [t]) d ξ [t] = 1, t = 1, \dots, r;

(5)

and

—the empirical balance conditions

\begin{matrix} Φ [P (θ), Q (ξ)] & = & y^{(r)}, \\ Φ [P (θ), Q (ξ)] & = & {Φ_{1} [P (θ), Q (ξ)], \dots, Φ_{r} [P (θ), Q (ξ)]} \\ Φ_{t} [P (θ), Q (ξ)] & = & \int_{Θ} φ (x [t], θ) P (θ) d θ + \int_{Ξ_{t}} Q_{t} (ξ [t]) ξ [t] d ξ [t], t = 1, \dots, r, \end{matrix}

(6)

where

y^{(r)} = {y [1], \dots, y [r]}

are the measured data on the object’s output. We will denote the problems (4)–(6) as the RME estimate.

Problems (4)–(6) are of the Lyapunov type [23,24], as they have an integral functional and also integral constraints.

3. Optimality Conditions

The optimality conditions in optimization problems of the Lyapunov type are formulated in terms of Lagrange multipliers. In addition, the Gâteaux derivatives of the problem’s functionals are used [25].

The Lagrange functional is defined by

\begin{matrix} L [P (θ), Q (ξ), μ, η, λ] & = & H [P (θ), Q (ξ)] + μ (1 - \int_{Θ} P (θ) d θ) + \\ + & \sum_{t = 1}^{r} η_{t} (1 - \int_{Ξ_{t}} Q_{t} (ξ [t]) d ξ [t]) + \\ + & \sum_{t = 1}^{r} λ_{t} (y [t] - \int_{Θ} P (θ) φ (x [t], θ) d θ - \int_{Ξ_{t}} Q_{t} (ξ [t]) ξ [t] d ξ [t]) . \end{matrix}

(7)

Let us recall the technique for obtaining optimality conditions in terms of the Gâteaux derivatives [26].

The PDFs

P (θ)

and

Q_{t} (ξ [t])

,

(t = 1, \dots, r),

are continuously differentiable, i.e., belong to the class

C^{1} .

Choosing arbitrary functions

h (θ)

and

w_{t} (ξ [t])

,

(t = 1, \dots, r),

from this class, we represent the PDFs as

P (θ) = P^{*} (θ) + α h (θ); Q_{t} (ξ [t]) = Q_{t}^{*} (ξ [t]) + β_{t} w_{i} (ξ [t]), t = 1, \dots, r,

where the PDFs

P^{*} (θ)

and

Q_{t}^{*} (ξ [t])

are the solutions of problems (4)–(6), and

α

and

β_{1}, \dots, β_{r}

are parameters.

Next, we substitute the above representations of the PDFs into (7). If all functions from

C^{1}

are assumed to be fixed, the Lagrange functional depends on the parameters

α

and

β_{1}, \dots, β_{r}

. Then, the first-order optimality conditions for the functional (7) in terms of the Gâteaux derivative take the form

\frac{d L}{d α} |_{(α, β) = 0} = 0, \frac{\partial L}{\partial β_{t}} |_{(α, β) = 0} = 0, t = 1, \dots, r .

These conditions lead to the following system of integral equations:

\int_{Θ} h (θ) Ω (θ) d θ = 0, \int_{Ξ_{t}} w_{t} (ξ [t]) Υ_{t} (ξ [t]) d ξ [t] = 0, t = 1, \dots, r,

which are satisfied for any functions

h (θ)

and

w_{1} (ξ [1]), \dots, w_{r} (ξ [r])

from

C^{1}

if and only if

Ω (θ) = 0, Υ_{t} (ξ [t]) = 0, t = 1, \dots, r .

The optimality conditions for problems (4)–(6) are given by

\begin{matrix} Ω (θ) = ln P^{*} (θ) + 1 - μ - \sum_{t = 1}^{s} λ_{t} φ (x [t], θ) = 0, \end{matrix}

(8)

\begin{matrix} Υ_{t} (ξ [t]) = ln Q_{t}^{*} (ξ [t]) + 1 - η_{t} - λ_{t} ξ [t] = 0, t = 1, \dots, r . \end{matrix}

(9)

Hence, the entropy-optimal PDFs of the model parameters and measurement noises have the form

\begin{matrix} P^{*} (θ | y^{(r)}, x^{(r)}) & = & \frac{exp (- \sum_{j = 1}^{r} λ_{j} (y^{(r)}, x^{(r)}) φ (x [j], θ))}{P (λ (y^{(r)}, x^{(r)})}, \\ Q_{t}^{*} (ξ [t] | y^{(r)}, x^{(r)}) & = & \frac{exp (λ_{t} (y^{(r)}, x^{(r)}) ξ [t])}{Q_{t} (λ_{t} (y^{(r)}, x^{(r)})}, t = 1, \dots, r, \end{matrix}

(10)

where

\begin{matrix} P (λ (y^{(r)}, x^{(r)}) & = & \int_{Θ} exp (- \sum_{j = 1}^{r} λ_{j} (y^{(r)}, x^{(r)}) φ (x [j], θ)) d θ, \\ Q_{t} (λ_{t} (y^{(r)}, x^{(r)}) & = & \int_{Ξ_{t}} exp (λ_{t} (y^{(r)}, x^{(r)}) ξ [t]) d ξ [t], t = 1, \dots, r . \end{matrix}

(11)

Due to equalities (10) and (11), the entropy-optimal PDFs are parametrized by the Lagrange multipliers

λ_{1}, \dots, λ_{r},

which represent the solutions of the empirical balance equations

\frac{G (λ (y^{(r)}, x^{(r)}))}{P (λ (y^{(r)}, x^{(r)}))} + \frac{E_{t} (λ_{t} (y^{(r)}, x^{(r)}))}{Q_{t} (λ_{t} (y^{(r)}, x^{(r)}))} = y [t], t = 1, \dots, r,

(12)

where

\begin{matrix} G (λ (y^{(r)}, x^{(r)})) = \int_{Θ} φ (x [t], θ) exp (- \sum_{j = 1}^{r} λ_{j} (y^{(r)}, x^{(r)}) φ (x [j], θ)) d θ, \\ E_{t} (λ_{t} (y^{(r)}, x^{(r)})) = \int_{Ξ_{t}} ξ [t] exp (- λ_{t} (y^{(r)}, x^{(r)}) ξ [t]) d ξ [t], t = 1, \dots, r . \end{matrix}

(13)

The solution

λ^{*} (y^{(r)}, x^{(r)})

of these equations depends on the sample

(y^{(r)}, x^{(r)})

used for constructing the RME estimates of the PDFs.

4. Existence of an Implicit Function

The second term in the balance Equations (12) and (13) is the mean value of the noise in each measurement t. The noises and their characteristics are often assumed to be equal over the measurements:

ξ^{-} \leq ξ [t] \leq ξ^{+}, t = 1, \dots, r .

(14)

Therefore, the mean value of the noise is given by

\bar{ξ} = \frac{E_{t} (λ_{t} (y^{(r)}, x^{(r)}))}{Q_{t} (λ_{t} (y^{(r)}, x^{(r)}))}, ξ^{-} \leq \bar{ξ} \leq ξ^{+} .

(15)

The balance equations can be written as

\begin{matrix} W_{t} (λ | \tilde{y} [t], x^{(r)}) & = & \int_{Θ} (φ (x [t], θ) - \tilde{y} [t]) exp (- \sum_{j = 1}^{r} λ_{j} ({\tilde{y}}^{(r)}, x^{(r)}) φ (x [j], θ)) d θ = 0, \\ t & = & 1, \dots, r, \end{matrix}

(16)

where

\tilde{y} [t] = y [t] - \bar{ξ}, {\tilde{y}}^{(r)} = {\tilde{y} [1], \dots, \tilde{y} [r]} .

(17)

In the vector form, Equation (16) is described by

W (λ | {\tilde{y}}^{(r)}, x^{(r)}) = 0 .

(18)

Equation (21) defines an implicit function

λ ({\tilde{y}}^{(r)}, x^{(r)})

. The existence and properties of this implicit function depend on the properties of the Jacobian matrix

J_{λ} (λ | {\tilde{y}}^{(r)}, x^{(r)}) = [\frac{\partial W_{t}}{\partial λ_{i}} | (t, i) = 1, \dots, r],

(19)

which has the elements

\begin{matrix} \frac{\partial W_{t}}{\partial λ_{i}} = \int_{Q} (φ (x [t], θ) - \tilde{y} [t]) φ (x [i], θ) \sum_{j = 1}^{r} exp (- \sum_{j = 1}^{r} λ_{j} φ (x [j], θ)) d θ . \end{matrix}

(20)

Theorem 1.

Let the next conditions be valid (assume that):

The function $φ (x^{(r)}, θ)$ is continuous in all variables.
For any $(x^{(r)}, {\tilde{y}}^{(r)}) \in R^{r} \times R^{r},$

$det J_{λ} (λ | {\tilde{y}}^{(r)}, x^{(r)}) \neq 0,$

(21)

$lim_{∥ λ ∥ \to \infty} W (λ | {\tilde{y}}^{(r)}, x^{(r)}) = \pm \infty .$

(22)

Then, there exists a unique implicit function

λ ({\tilde{y}}^{(r)}, x^{(r)},)

defined on

R^{r} \times R^{r}

.

Proof of Theorem 1.

Due to the first assumption, the continuous function

W (λ | {\tilde{y}}^{(r)}, x^{(r)})

induces the vector field

Φ_{({\tilde{y}}^{(r)}, x^{(r)})} (λ) = W (λ | {\tilde{y}}^{(r)}, x^{(r)})

in the space

R^{r} \times R^{r} .

We choose an arbitrary vector

u

in

R^{r}

and define the vector field

Π_{u} (λ) = Φ_{({\tilde{y}}^{(r)}, x^{(r)})} (λ) - u .

By condition (22), the field

Π_{u} (λ)

with a fixed vector

u

has no zeros on the spheres

∥ λ ∥ = ϱ

of a sufficiently large radius

ϱ .

Hence, a rotation is well defined on the spheres

∥ λ ∥ = ϱ

of a sufficiently large radius

ϱ .

For details, see [27].

Consider the two vector fields

Π_{u^{(1)}} (λ) = Φ_{({\tilde{y}}^{(r)}, x^{(r)})} (λ) - u^{(1)}, Π_{u^{(2)}} (λ) = Φ_{({\tilde{y}}^{(r)}, x^{(r)})} (λ) - u^{(2)} .

These vector fields are homotopic on the spheres of a sufficiently large radius, i.e., the field

Ω (λ) = α Π_{u^{(1)}} (λ) + (1 - α) Π_{u^{(2)}} (λ) = Φ_{({\tilde{y}}^{(r)}, x^{(r)})} (λ) - [α u^{(1)} + (1 - α) u^{(2)}]

has no zeros on the spheres of a sufficiently large radius for any

α \in [0, 1] .

Homotopic fields have identical rotations [27]:

γ (Π_{u^{(1)}} (λ)) = γ (Π_{u^{(2)}} (λ)) .

The vector fields

Π_{u^{(1)}} (λ)

and

Π_{u^{(2)}} (λ)

are nondegenerate on the spheres of a sufficiently large radius; in the ball

∥ λ ∥ \leq ϱ_{1} < ϱ,

however, each of them may have a number of singular points. We denote by

κ (u^{(1)})

and

κ (u^{(2)})

the numbers of singular points of the vector fields

Π_{u^{(1)}} (λ)

and

Π_{u^{(2)}} (λ),

respectively. As the vector fields are homotopic,

κ (u^{(1)}) = κ (u^{(2)}) = κ .

In view of (21), these singular points are isolated.

Now, let us utilize the index of a singular point introduced in [27]:

ind (λ^{0}) = {(- 1)}^{β (λ^{0})},

where

β (λ^{0})

is the number of eigenvalues of the matrix

Π_{u}^{'} (λ^{0}) = J_{λ} (λ^{0} |, {\tilde{y}}^{(r)}, x^{(r)})

with the negative real part. By definition, the value of this index depends not on the magnitude of

β (λ^{0})

, but on its parity. Due to condition (21), all singular points have the same parity. Really,

J_{λ} (λ^{0} | {\tilde{y}}^{(r)}, x^{(r)}) \neq 0

, and hence, for any

{\tilde{y}}^{(r)}, x^{(r)} \in R^{r} \times R^{r}

, the eigenvalues of the matrix

J_{λ} (λ^{0} | {\tilde{y}}^{(r)}, x^{(r)})

may move from the left half-plane to the right one in pairs only: Real eigenvalues are transformed into pairs of complex–conjugate ones, passing through the imaginary axis.

In view of this fact, the rotation of the homotopic fields (20) is given by

γ (Π_{u}) = κ {(- 1)}^{β},

where

β

is the number of eigenvalues of the matrix

Π_{u}^{'} (λ)

for some singular point.

It remains to demonstrate that the vector field

Π_{u} (λ)

has a unique singular point in the ball

∥ λ ∥ \leq ϱ_{1} < ϱ .

Consider the equation

Π_{u} (λ) = Φ_{({\tilde{y}}^{(r)}, x^{(r)})} (λ) - u = 0 .

Assume that for each fixed pair

({\tilde{y}}^{(r)}, x^{(r)}),

this equation has

κ

singular points, i.e., the functions

λ^{(1)} ({\tilde{y}}^{(r)}, x^{(r)}), \dots, λ^{(κ)} ({\tilde{y}}^{(r)}, x^{(r)}) .

Therefore, it defines a multivalued function

λ ({\tilde{y}}^{(r)}, x^{(r)})

, whose

κ

branches are isolated (the latter property follows from the isolation of the singular points). Due to condition (21), each of the branches

λ^{(i)} ({\tilde{y}}^{(r)}, x^{(r)})

defines an open set in the space

R^{r},

and

⋃_{i = 1}^{κ} λ^{(i)} ({\tilde{y}}^{(r)}, x^{(r)}) = R^{r} .

This is possible if and only if

κ = 1 .

Hence, for each pair

({\tilde{y}}^{(r)}, x^{(r)})

from

R^{r} \times R^{r},

there exists a unique function

λ^{*} ({\tilde{y}}^{(r)}, x^{(r)})

for which the function

W (λ | {\tilde{y}}^{(r)}, x^{(r)})

will vanish. □

Theorem 2.

Under the assumptions of Theorem 1, the function

λ ({\tilde{y}}^{(r)}, x^{(r)})

is real analytical in all variables.

Proof of Theorem 2.

From (15), it follows that the function

W (λ | {\tilde{y}}^{(r)}, x^{(r)})

is analytical in all variables. Therefore, the left-hand side of Equation (15) can be expanded into the generalized Taylor series [26], and the solution can be constructed in the form of the generalized Taylor series as well. The power elements of this series are determined using a recursive procedure. □

5. Asymptotic Efficiency of RME Estimates

The RME estimate yields the entropy-optimal PDFs (10) for the arrays of input and output data, each of size

r .

For the sake of convenience, consider the PDFs parametrized by the exponential Lagrange multipliers

z = exp (- λ) .

Then, equalities (10) take the form

\begin{matrix} P^{*} (θ, z (y^{(r)}, x^{(r)})) & = & \frac{\prod_{j = 1}^{r} {[z_{j} (y^{(r)}, x^{(r)})]}^{φ (x [j], θ)}}{\int_{Θ} \prod_{j = 1}^{r} {[z_{j} (y^{(r)}, x^{(r)})]}^{φ (x [j], θ)} d θ}, \\ Q_{t}^{*} (ξ [t], z_{t} (y^{(r)}, x^{(r)})) & = & \frac{{[z_{t} (y^{(r)}, x^{(r)})]}^{ξ [t]}}{\int_{Ξ_{t}} {[z_{t} (y^{(r)}, x^{(r)})]}^{ξ [t]} d ξ [t]}, t = 1, \dots, r . \end{matrix}

(23)

Consequently, the structure of the PDF significantly depends on the values of the exponential Lagrange multipliers

z,

which, in turn, depend on the data arrays

y^{(r)}

and

x^{(r)} .

Definition 1.

The estimates

P^{*} (θ, z^{*})

and

Q_{t}^{*} (ξ [t], z_{t}^{*})

are said to be asymptotically efficient if

\begin{matrix} lim_{r \to \infty} P^{*} (θ, z (y^{(r)}, x^{(r)})) & = & P^{*} (θ, z^{*}), \\ lim_{r \to \infty} Q_{t}^{*} (ξ [t], z_{t} (y^{(r)}, x^{(r)})) & = & Q_{t}^{*} (ξ [t], z_{t}^{*}), t = 1, \dots, r; \end{matrix}

(24)

where

z^{*} = lim_{r \to \infty} z (y^{(r)}, x^{(r)}) .

(25)

Consider the empirical balance Equation (21), written in terms of the exponential Lagrange multipliers:

\begin{matrix} Φ_{t} (z, {\tilde{y}}^{(r)}, x^{(r)}) = \int_{Θ} \prod_{j = 1}^{r} {[z_{j} ({\tilde{y}}^{(r)}, x^{(r)})]}^{φ (x [j], θ)} (φ (x [t], θ) - \tilde{y} [t]) d θ = 0, t = 1, \dots, r . \end{matrix}

(26)

As has been demonstrated above, Equation (26) defines an implicit analytical function

z = z ({\tilde{y}}^{(r)}, x^{(r)})

for

({\tilde{y}}^{(r)}, x^{(r)}) \in R^{r} \times R^{r} .

Differentiating the left- and right-hand sides of these equations with respect to

{\tilde{y}}^{(r)}

and

x^{(r)}

yields

\begin{matrix} \frac{\partial z}{\partial {\tilde{y}}^{(r)}} & = & - {[\frac{\partial Φ}{\partial z}]}^{- 1} \frac{\partial Φ}{\partial {\tilde{y}}^{(r)}}, \\ \frac{\partial z}{\partial x^{(r)}} & = & - {[\frac{\partial Φ}{\partial z}]}^{- 1} \frac{\partial Φ}{\partial x^{(r)}} . \end{matrix}

(27)

Then, passing to the norms and using the inequality for the norm of the product of matrices [28], we obtain the equalities

\begin{matrix} 0 \leq ∥\frac{\partial z}{\partial {\tilde{y}}^{(r)}}∥ & \leq & ∥{[\frac{\partial Φ}{\partial z}]}^{- 1}∥ ∥\frac{\partial Φ}{\partial {\tilde{y}}^{(r)}}∥, \\ 0 \leq ∥\frac{\partial z}{\partial x^{(r)}}∥ & \leq & ∥{[\frac{\partial Φ}{\partial z}]}^{- 1}∥ ∥\frac{\partial Φ}{\partial x^{(r)}}∥ . \end{matrix}

(28)

Both of the inequalities incorporate the norm of the inverse matrix

∥{[\frac{\partial Φ}{\partial z}]}^{- 1}∥ .

Lemma 1.

Let a square matrix A be nonsingular, i.e.,

det A \neq 0 .

Then, there exists a constant

α > 1

such that

\frac{1}{∥ A ∥} \leq ∥ A^{- 1} ∥ \leq \frac{α}{∥ A ∥} .

(29)

Proof of Lemma 1.

Since the matrix A is nondegenerate, the elements

a_{i k}^{(- 1)}

of the inverse matrix

A^{- 1}

can be expressed in terms of the algebraic complement (adjunct) of the element

a_{k i}

in the determinant of the matrix A [28]:

a_{i k}^{(- 1)} = \frac{A_{k i}}{det A}, (k, i) = 1, \dots, r,

and they are bounded:

a_{i k}^{(- 1)} \leq M < \infty, ∥ A^{- 1} ∥ < \infty .

Hence, there exists a constant

α > 1

for which inequality (29) is satisfied. □

Lemma 1 can be applied to the norm

∥{[\frac{\partial Φ}{\partial z}]}^{- 1}∥

of the inverse matrix. As a result,

{(∥\frac{\partial Φ}{\partial z}∥)}^{- 1} \leq ∥{[\frac{\partial Φ}{\partial z}]}^{- 1}∥ \leq α {(∥\frac{\partial Φ}{\partial z}∥)}^{- 1},

(30)

where

∥\frac{\partial Φ}{\partial z}∥ = r max_{t, j} |\frac{\partial Φ_{t}}{\partial z_{j}}| .

(31)

Lemma 2.

Let

∥ \frac{\partial Φ}{\partial {\tilde{y}}^{(r)}} ∥ \leq ϱ < \infty, ∥ \frac{\partial Φ}{\partial x^{(r)}} ∥ \leq ω < \infty .

(32)

Then,

lim_{r \to \infty} ∥ \frac{\partial z}{\partial {\tilde{y}}^{(r)}} ∥ = lim_{r \to \infty} ∥ \frac{\partial z}{\partial x^{(r)}} ∥ = 0 .

(33)

Proof of Lemma 2.

According to (28), (31), and (32) we have:

∥ \frac{\partial z}{\partial {\tilde{y}}^{(r)}} ∥ \leq \frac{1}{r} (\frac{ϱ}{b}), ∥ \frac{\partial z}{\partial {\tilde{x}}^{(r)}} ∥ \leq \frac{1}{r} (\frac{ω}{b}),

where

b = {max}_{t, j} |\frac{\partial Φ_{t}}{\partial z_{j}}|

.

Whence, it follows that for the sample length

r \to \infty

, the norms of relevant Jacobians tend to zero, and function

z = z ({\tilde{y}}^{(r)}, x^{(r)})

tends to the vector

z^{*}

(25). □

6. Thermokarst Lake Area Evolution in Western Siberia: RME Estimation and Testing

Permafrost zones, which occupy a significant part of the Earth’s surface, are the locales of thermokarst lakes, which accumulate greenhouse gases (methane and carbon dioxide). These gases make a considerable contribution to global climate change.

The source data in studies of the evolution of thermokarst lake areas are acquired through remote sensing of the Earth’s surface and ground measurements of meteorological parameters [29,30].

The state of thermokarst lakes is characterized by their total area

S [t]

in a given region, measured in hectares (ha), and the factors influencing thermokarst formations—the average annual temperatures

T [t],

measured in Celsius (C

^{\circ}

), and the annual precipitation

R [t],

measured in millimeters (mm), where t denotes the calendar year.

We used the remote sensing data and ground measurements of the meteorological parameters for a region of Western Siberia between

65^{\circ}

N–

70^{\circ}

N and

65^{\circ}

E–

95^{\circ}

E that were presented in [31]. We divided the available time series into two groups, which formed the training collection

L

(

t = 0, \dots, 24)

and the testing collection

T

(

t = 25, \dots, 35)

.

6.1. RME Estimation of Model Parameters and Measurement Noises

The temporal evolution of the lake area

S [t]

is described by the following dynamic regression equation with two influencing factors, the average annual temperature

T [t]

and the annual precipitation

R [t]

:

\begin{matrix} \hat{S} [t] & = & a_{0} + \sum_{k = 1}^{p} a_{k} \hat{S} [t - k] + a_{(p + 1)} T [t] + a_{(p + 2)} (R [t], \\ \hat{v} [t] & = & \hat{S} [t] + ξ [t] . \end{matrix}

(34)

The model parameters and measurement noises are assumed to be random and of the interval type:

\begin{matrix} a_{k} \in A_{k} = [a_{k}^{-}, a_{k}^{+}], k = 0, d o t s, p + 2, \\ a = {a_{0}, \dots, a_{p}, a_{p + 1}, a_{p + 2}} \in A = ⋃_{k = 0}^{p + 2} A_{k} . \end{matrix}

The probabilistic properties of the parameters are characterized by a PDF

P (a) .

The variable

\hat{v} [t]

is the observed output of the model, and the values of the random measurement noise

ξ [t]

at different time instants t may belong to different ranges:

ξ [t] \in Ξ_{t} = [ξ^{-} [t], ξ^{+} [t]],

(35)

with a PDF

Q_{t} (ξ [t])

,

(t = 0, \dots, N)

, where N denotes the length of the observation interval. The order

p = 4

and the parameter ranges for the dynamic randomized regression model (34) (see Table 1 below) were calculated based on real data using the empirical correlation functions and the least-square estimates of the residual variances.

Table 1. Parameter ranges for the model.

For the training collection

L,

the model can be written in the vector–matrix form

\begin{matrix} \hat{S} & = & \hat{S} a + a_{5} T + a_{6} R, \\ \hat{v} & = & \hat{S} + ξ, \end{matrix}

(36)

with the matrix

\hat{S} = (\begin{matrix} 1 & \hat{S} [3] & \dots & \hat{S} [0] \\ 1 & \hat{S} [4] & \dots & \hat{S} [1] \\ \dots & \dots & \dots & \dots \\ 1 & \hat{S} [23] & \dots & \hat{S} [20] \end{matrix})

(37)

and the vectors

\hat{S} = [\hat{S} [4], \dots, \hat{S} [24]], T = [T [4], \dots, T [24]], R = [R [4], \dots, R [24]],

and

\hat{v} = [v [4], \dots, v [24]]; ξ = [ξ [4], \dots, ξ [24]]

.

The RME estimation procedure yielded the following entropy-optimal PDFs of the model parameters (36) and measurement noises:

\begin{matrix} P^{*} (a, λ) = \prod_{k = 0}^{6} \frac{exp (- q_{k} a_{k})}{P_{k} (λ)}, P_{k} (λ) = \int_{A_{∥}} exp (- q_{k} a_{k}) d a_{k}, \\ q_{0} = \sum_{t = 4}^{24} λ_{n}, q_{k} = \sum_{t = p}^{24} λ_{n} S [t - k], k = 1, \dots, 4, \\ q_{5} = \sum_{t = 4}^{24} λ_{t} T [t], q_{6} = \sum_{t = p}^{24} λ_{t} R [t], \\ Q^{*} (ξ, \bar{λ}) = \frac{exp (- \bar{λ} ξ)}{Q}, Q = \int_{Ξ} exp (- \bar{λ} ξ) d ξ, \bar{λ} = \frac{q_{0}}{20} . \end{matrix}

(38)

Note that

S [t - k], T [t],

and

R [t]

are the data from the collection

L

. The two-dimensional sections of the function

P^{*} (a)

and the function

Q^{*} (ξ)

are shown in Figure 1.

Figure 1. Two-dimensional section of the function P* and the function Q*.

6.2. Testing

Testing was performed using the data from the collection

T

, which included the lake area

S [t]

, the average annual temperature

T [t]

, and the annual precipitation

R [t]

,

t = 25, \dots, 35

. An ensemble of trajectories of the model’s observed output

v [t]

was generated using Monte Carlo simulations and sampling of the entropy-optimal PDFs

P^{*} (a)

,

Q^{*} ξ

on the testing interval. In addition, the trajectory of the empirical means

\bar{v} [t]

and the dimensions of the empirical standard deviation area were calculated.

The quality of RME estimation was characterized by the absolute and relative errors:

\begin{matrix} A b s E r r = \sqrt{\sum_{t = 26}^{35} {(S [t] - \bar{v} [t])}^{2}} = 0.3446, \end{matrix}

(39)

\begin{matrix} R e l E r r = \frac{\sqrt{\sum_{t = 26}^{35} {(S [t] - \bar{v} [t])}^{2}}}{\sqrt{\sum_{t = 26}^{35} S^{2} [t]} + \sqrt{\sum_{t = 26}^{35} {\bar{v}}^{2} [t]}} = 0.0089 . \end{matrix}

(40)

The generated ensemble of the trajectories is shown in Figure 2.

Figure 2. Ensemble of the trajectories (gray domain), the standard deviation area (dark gray domain), the empirical mean trajectory, and the lake area data.

7. Discussion

Given an available data collection, the RME procedure allows estimation of the PDFs of a model’s random parameters under measurement noises corresponding to the maximum uncertainty (maximum entropy). In addition, this procedure needs no assumptions about the structure of the estimated PDFs or the statistical properties of the data and measurement noises.

An entropy-optimal model can be simulated by sampling the PDFs to generate an empirical ensemble of a model’s output trajectories and to calculate its empirical characteristics (the mean and median trajectories, the standard deviation area, interquartile sets, and others).

The RME procedure was illustrated with an example of the estimation of the parameters of a linear regression model for the evolution of the thermokarst lake area in Western Siberia. In this example, the procedure demonstrated a good estimation accuracy.

However, these positive features of the procedure were achieved with computational costs. Despite their analytical structure, the RME estimates of the PDFs depend on Lagrange multipliers, which are determined by solving the balance equations with the so-called integral components (the mathematical expectations of random parameters and measurement noises). Calculating the values of multidimensional integrals may require appropriate computing resources.

8. Conclusions

The problem of randomized maximum entropy estimation of a probability density function based on real available data has been formulated and solved. The developed estimation algorithm (the RME algorithm) finds the conditional maximum of an information entropy functional on a set of admissible probability density functions characterized by the empirical balance equations for Lagrange multipliers. These equations define an implicit dependence of the Lagrange multipliers on the data collection. The existence of such an implicit function for any values in a data collection has been established. The function’s behavior for a data collection of a greater size has been studied, and the asymptotic efficiency of the RME estimates has been proved.

The positive features of RME estimates have been illustrated with an example of estimation and testing a linear dynamic regression model of the evolution of the thermokarst lake area in Western Siberia with real data.

Funding

This research was funded by the Ministry of Science and Higher Education of the Russian Federation, project no. 075-15-2020-799.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
Witten, I.H.; Eibe, F. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: Heidelberg, Germany, 2005. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning. Series: Information Theory and Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2001. [Google Scholar]
Vorontsov, K.V. Mathematical Methods of Learning by Precedents: A Course of Lectures; Moscow Institute of Physics and Technology: Moscow, Russia, 2013. [Google Scholar]
Goldberger, A.S. A Course in Econometrics; Harvard University Press: Cambridge, UK, 1991. [Google Scholar]
Aivazyan, S.A.; Enyukov, I.S.; Meshalkin, L.D. Prikladnaya Statistika: Issledovanie Zavisimostei (Applied Statistics: Study of Dependencies); Finansy i Statistika: Moscow, Russia, 1985. [Google Scholar]
Lagutin, M.B. Naglyadnaya Matematicheskaya Statistika (Visual Mathematical Statistics); BINOM, Laboratoriya Znanii: Moscow, Russia, 2013. [Google Scholar]
Roussas, G. A Course of the Mathematical Statistics; Academic Press: San Diego, CA, USA, 2015. [Google Scholar]
Malouf, R. A comparison of algorithms for maximum entropy parameters estimation. In Proceedings of the 6th Conference on Natural Language Learning 2002 (CoNLL-2002), Taipei, Taiwan, 31 August–1 September 2002; Volume 20, pp. 1–7. [Google Scholar]
Borwein, J.; Choksi, R.; Marechal, P. Probability distribution of assets inferred from option prices via principle of maximum entropy. SIAM J. Optim. 2003, 14, 464–478. [Google Scholar] [CrossRef]
Golan, A.; Judge, G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; John Wiley & Sons: New York, NY, USA, 1997. [Google Scholar]
Golan, A. Information and Entropy econometrics—A review and synthesis. Found. Trends Econom. 2008, 2, 1–145. [Google Scholar] [CrossRef]
Csiszar, I.; Matus, F. On minimization of entropy functionals under moment constraints. In Proceedings of the IEEE International Symposium on Information Theory, Toronto, ON, Canada, 6–11 July 2008. [Google Scholar]
Loubes, J.-M. Approximate maximum entropy on the mean for instrumental variable regression. Stat. Probab. Lett. 2012, 82, 972–978. [Google Scholar] [CrossRef]
Borwein, J.M.; Lewis, A.S. Partially-finite programming in L₁ and existence of maximum entropy estimates. SIAM J. Optim. 1993, 3, 248–267. [Google Scholar] [CrossRef]
Burg, J.P. The relationship between maximum entropy spectra and maximum likelihood spectra. Geophysics 1972, 37, 375–376. [Google Scholar] [CrossRef]
Christakos, G. A Bayesian/maximum entropy view to the spatial estimation problem. Math. Geol. 1990, 22, 763–777. [Google Scholar] [CrossRef]
Singh, V.P.; Guo, H. Parameter estimation for 3-parameter generalized Pareto distribution by the principle of maximum entropy. Hydrol. Sci. J. 1994, 40, 165–181. [Google Scholar] [CrossRef]
Popkov, Y.S.; Dubnov, Y.A.; Popkov, A.Y. Randomized machine learning: Statement, solution, applications. In Proceedings of the 2016 IEEE 8th International Conference on Intelligent Systems (IS), Sofia, Bulgaria, 4–6 September 2016. [Google Scholar] [CrossRef]
Popkov, A.Y.; Popkov, Y.S. New methods of entropy-robust estimation for randomized models under limited data. Entropy 2014, 16, 675–698. [Google Scholar] [CrossRef]
Krasnosel’skii, M.A.; Vainikko, G.M.; Zabreyko, R.P.; Ruticki, Y.B.; Stet’senko, V.V. Approximate Solutions of Operator Equations; Wolters-Noordhoff Publishing: Groningen, The Netherlands, 1972. [Google Scholar] [CrossRef]
Ioffe, A.D.; Tikhomirov, V.M. Theory of Extremal Problems; Elsevier: New York, NY, USA, 1974. [Google Scholar]
Alekseev, V.M.; Tikhomirov, V.M.; Fomin, S.V. Optimal Control; Springer: Boston, MA, USA, 1987. [Google Scholar]
Kaashoek, M.A.; van der Mee, C. Recent Advances in Operator Theory and Its Applications; Birkhäuser Basel: Basel, Switzerland, 2006. [Google Scholar]
Kolmogorov, A.N.; Fomin, S.V. Elements of the Theory of Functions and Functional Analysis; Dover Publication: New York, NY, USA, 1999. [Google Scholar]
Krasnoselskii, M.A.; Zabreiko, P.P. Geometrical Methods of Nonlinear Analysis; Springer: Berlin, Germany; New York, NY, USA, 1984. [Google Scholar]
Gantmacher, F.R.; Brenner, J.L. Applications of the Theory of Matrices; Dover: New York, NY, USA, 2005. [Google Scholar]
Riordan, B.; Verbula, D.; McGruire, A.D. Shrinking ponds in subarctic Alaska based on 1950–2002 remotely sensed images. J. Geophys. Res. 2006, 111, G04002. [Google Scholar] [CrossRef]
Kirpotin, S.; Polishchuk, Y.; Bruksina, N. Abrupt changes of thermokarst lakes in Western Siberia: Impacts of climatic warming on permafrost melting. Int. J. Environ. Stud. 2009, 66, 423–431. [Google Scholar] [CrossRef]
Western Siberia Thermokarsk Lakes Dataset. Available online: https://cloud.uriit.ru/index.php/s/0DOrxL9RmGqXsV0 (accessed on 20 February 2021).

Figure 1. Two-dimensional section of the function P* and the function Q*.

Figure 2. Ensemble of the trajectories (gray domain), the standard deviation area (dark gray domain), the empirical mean trajectory, and the lake area data.

Table 1. Parameter ranges for the model.

a	$a_{0}$	$a_{1}$	$a_{2}$	$a_{3}$	$a_{4}$	$a_{5}$	$a_{6}$
$a^{-}$	−0.50	−0.14	−0.49	−0.53	−0.44	0.46	0.19
$a^{+}$	0.07	0.52	0.20	0.19	0.19	1.14	0.88

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Qualitative Properties of Randomized Maximum Entropy Estimates of Probability Density Functions

Abstract

1. Introduction

2. Statement of the RME Estimation Problem

3. Optimality Conditions

4. Existence of an Implicit Function

5. Asymptotic Efficiency of RME Estimates

6. Thermokarst Lake Area Evolution in Western Siberia: RME Estimation and Testing

6.1. RME Estimation of Model Parameters and Measurement Noises

6.2. Testing

7. Discussion

8. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics