Qualitative Properties of Randomized Maximum Entropy Estimates of Probability Density Functions

: The problem of randomized maximum entropy estimation for the probability density function of random model parameters with real data and measurement noises was formulated. This estimation procedure maximizes an information entropy functional on a set of integral equalities depending on the real data set. The technique of the Gâteaux derivatives is developed to solve this problem in analytical form. The probability density function estimates depend on Lagrange multipliers, which are obtained by balancing the model’s output with real data. A global theorem for the implicit dependence of these Lagrange multipliers on the data sample’s length is established using the rotation of homotopic vector ﬁelds. A theorem for the asymptotic efﬁciency of randomized maximum entropy estimate in terms of stationary Lagrange multipliers is formulated and proved. The proposed method is illustrated on the problem of forecasting of the evolution of the thermokarst lake area in Western Siberia.


Introduction
Estimating the characteristics of models is a very popular and, at the same time, important problem of science. This problem arises in applications with unknown parameters, which have to be estimated somehow using real data sets. In particular, such problems have turned out to be fundamental in machine learning procedures [1][2][3][4][5]. The core of these procedures is a parametrized model trained by statistically estimating the unknown parameters based on real data. Most of the econometric problems associated with reconstructing functional relations and forecasting also reduce to estimating the model parameters; for example, see [6,7].
The problems described above are solved using traditional mathematical statistics methods, such as the maximum likelihood method and its derivatives, the method of moments, Bayesian methods, and their numerous modifications [8,9].
Among the mathematical tools for parametric estimation mentioned, a special place is occupied by entropy maximization methods for finite-dimensional probability distributions [10,11].
Consider a random variable x taking discrete values x 1 , . . . , x n with probabilities p 1 , . . . , p n , respectively, and r functions f 1 (x), . . . , f r (x) of this variable with discrete values. The discrete probability distribution function p(x) = {p 1 (x 1 ), . . . , p n (x n )} is defined as the solution of the problem where q 1 , . . . , q r are given constants.
If f k (x i ) ≡ x k i , then the system of equalities specifies constraints on the kth moments of the discrete random variable x. In the case of equality constraints, some modifications of this problem adapted to different applications were studied in [10][11][12][13]. Since this problem is conditionally extremal, it can be solved using the Lagrange method, which leads to a system of equations for Lagrange multipliers. The latter often turn out to be substantially nonlinear functions, and hence, rather sophisticated techniques are needed for their numerical calculation [14,15].
In the case of inequality constraints, this problem belongs to the class of mathematical programming problems [16].
The entropy maximization principle is adopted to estimate the parameters of a priori distributions when constructing Bayesian estimates [17,18] or maximum likelihood estimates.
The parameters of probability distributions (continuous or discrete) can be estimated using various mathematical statistics methods, including the method of entropy maximization. Their efficiency in hydrological problems was compared in [19]. Apparently, the method of entropy maximization yields the best results in such problems due to the structure of hydrological data.
The problem of estimating some model characteristics on real data was further developed in connection with the appearance of new machine learning methods, called randomized machine learning (RML) [20]. They are based on models with random parameters, and it is necessary to estimate the probability density functions of these parameters. The estimation algorithm (RML algorithm) is formulated in terms of functional entropylinear programming [21].
The original statement of this problem was to estimate probability density functions (PDFs) in RML procedures. However, in recent times, a more general context has been assumed-the method of maximizing entropy functionals for constructing estimates of continuous probability density functions using real data (randomized maximum entropy (RME) estimation).
In this paper, the general RME estimation problem is formulated; its solutions, numerical algorithms, and the asymptotic properties of the solutions are studied. The theoretical results are illustrated by an important application-estimating the evolution of the thermokarst lake area in Western Siberia.
Thus, after r measurements, the model and observations are described by the equationŝ where the vector function Γ(x (r) , θ) has the components ϕ(x[t], θ), where t = 1, . . . , r are the time points;v denotes the observed output of the model containing measurement noises of the object's output. Let us introduce a series of assumptions necessary for further considerations.

•
The random noise is ξ ∈ Ξ ⊂ R r , where • The PDF Q(ξ) of the measurement noises is continuously differentiable on the support Ξ and also has the multiplicative structure The estimation problem is stated as follows: Find the estimates of the PDFs P * (θ) and Q * (ξ) that maximize the generalized information entropy functional subject to -the normalization conditions of the PDFs given by and -the empirical balance conditions where y (r) = {y [1], . . . , y[r]} are the measured data on the object's output. We will denote the problems (4)-(6) as the RME estimate.
Problems (4)-(6) are of the Lyapunov type [23,24], as they have an integral functional and also integral constraints.

Optimality Conditions
The optimality conditions in optimization problems of the Lyapunov type are formulated in terms of Lagrange multipliers. In addition, the Gâteaux derivatives of the problem's functionals are used [25].
The Lagrange functional is defined by Let us recall the technique for obtaining optimality conditions in terms of the Gâteaux derivatives [26].
Next, we substitute the above representations of the PDFs into (7). If all functions from C 1 are assumed to be fixed, the Lagrange functional depends on the parameters α and β 1 , . . . , β r . Then, the first-order optimality conditions for the functional (7) in terms of the Gâteaux derivative take the form These conditions lead to the following system of integral equations: which are satisfied for any functions h(θ) and The optimality conditions for problems (4)-(6) are given by Hence, the entropy-optimal PDFs of the model parameters and measurement noises have the form where Due to equalities (10) and (11), the entropy-optimal PDFs are parametrized by the Lagrange multipliers λ 1 , . . . , λ r , which represent the solutions of the empirical balance equations G(λ(y (r) , x (r) )) P (λ(y (r) , x (r) )) where The solution λ * (y (r) , x (r) ) of these equations depends on the sample (y (r) , x (r) ) used for constructing the RME estimates of the PDFs.

Existence of an Implicit Function
The second term in the balance Equations (12) and (13) is the mean value of the noise in each measurement t. The noises and their characteristics are often assumed to be equal over the measurements: Therefore, the mean value of the noise is given bȳ The balance equations can be written as In the vector form, Equation (16) is described by Equation (21) defines an implicit function λ(ỹ (r) , x (r) ). The existence and properties of this implicit function depend on the properties of the Jacobian matrix which has the elements Theorem 1. Let the next conditions be valid (assume that): • The function ϕ(x (r) , θ) is continuous in all variables.
Then, there exists a unique implicit function λ(ỹ (r) , x (r) , ) defined on R r × R r .
Proof of Theorem 1. Due to the first assumption, the continuous function W(λ |ỹ (r) , x (r) ) induces the vector field Φ (ỹ (r) ,x (r) ) (λ) = W(λ |ỹ (r) , x (r) ) in the space R r × R r . We choose an arbitrary vector u in R r and define the vector field By condition (22), the field Π u (λ) with a fixed vector u has no zeros on the spheres λ = of a sufficiently large radius .
Hence, a rotation is well defined on the spheres λ = of a sufficiently large radius . For details, see [27].
Consider the two vector fields (2) .
In view of (21), these singular points are isolated. Now, let us utilize the index of a singular point introduced in [27]: where β(λ 0 ) is the number of eigenvalues of the matrix Π u (λ 0 ) = J λ (λ 0 | ,ỹ (r) , x (r) ) with the negative real part. By definition, the value of this index depends not on the magnitude of β(λ 0 ), but on its parity. Due to condition (21), all singular points have the same parity. Really, J λ (λ 0 |ỹ (r) , x (r) ) = 0, and hence, for anyỹ (r) , x (r) ∈ R r × R r , the eigenvalues of the matrix J λ (λ 0 |ỹ (r) , x (r) ) may move from the left half-plane to the right one in pairs only: Real eigenvalues are transformed into pairs of complex-conjugate ones, passing through the imaginary axis.
In view of this fact, the rotation of the homotopic fields (20) is given by where β is the number of eigenvalues of the matrix Π u (λ) for some singular point. It remains to demonstrate that the vector field Π u (λ) has a unique singular point in the ball λ ≤ 1 < . Consider the equation Assume that for each fixed pair (ỹ (r) , x (r) ), this equation has κ singular points, i.e., the functions λ (1) (ỹ (r) , x (r) ), . . . , λ (κ) (ỹ (r) , x (r) ). Therefore, it defines a multivalued function λ(ỹ (r) , x (r) ), whose κ branches are isolated (the latter property follows from the isolation of the singular points). Due to condition (21), each of the branches λ (i) (ỹ (r) , x (r) ) defines an open set in the space R r , and κ i=1 λ (i) (ỹ (r) , x (r) ) = R r . This is possible if and only if κ = 1. Hence, for each pair (ỹ (r) , x (r) ) from R r × R r , there exists a unique function λ * (ỹ (r) , x (r) ) for which the function W(λ |ỹ (r) , x (r) ) will vanish.

Theorem 2.
Under the assumptions of Theorem 1, the function λ(ỹ (r) , x (r) ) is real analytical in all variables.
Proof of Theorem 2. From (15), it follows that the function W(λ |ỹ (r) , x (r) ) is analytical in all variables. Therefore, the left-hand side of Equation (15) can be expanded into the generalized Taylor series [26], and the solution can be constructed in the form of the generalized Taylor series as well. The power elements of this series are determined using a recursive procedure.

Asymptotic Efficiency of RME Estimates
The RME estimate yields the entropy-optimal PDFs (10) for the arrays of input and output data, each of size r. For the sake of convenience, consider the PDFs parametrized by the exponential Lagrange multipliers z = exp(−λ). Then, equalities (10) take the form , t = 1, . . . , r.
Consequently, the structure of the PDF significantly depends on the values of the exponential Lagrange multipliers z, which, in turn, depend on the data arrays y (r) and x (r) . Definition 1. The estimates P * (θ, z * ) and Q * t (ξ[t], z * t ) are said to be asymptotically efficient if lim r→∞ P * θ, z(y (r) , x (r) ) = P * (θ, z * ), where z * = lim r→∞ z(y (r) , x (r) ). (25) Consider the empirical balance Equation (21), written in terms of the exponential Lagrange multipliers: As has been demonstrated above, Equation (26) defines an implicit analytical function z = z(ỹ (r) , x (r) ) for (ỹ (r) , x (r) ) ∈ R r × R r .
Differentiating the left-and right-hand sides of these equations with respect toỹ (r) and x (r) yields Then, passing to the norms and using the inequality for the norm of the product of matrices [28], we obtain the equalities Both of the inequalities incorporate the norm of the inverse matrix ∂Φ ∂z −1 .

Lemma 1.
Let a square matrix A be nonsingular, i.e., det A = 0. Then, there exists a constant α > 1 such that Proof of Lemma 1. Since the matrix A is nondegenerate, the elements a Hence, there exists a constant α > 1 for which inequality (29) is satisfied.

Thermokarst Lake Area Evolution in Western Siberia: RME Estimation and Testing
Permafrost zones, which occupy a significant part of the Earth's surface, are the locales of thermokarst lakes, which accumulate greenhouse gases (methane and carbon dioxide). These gases make a considerable contribution to global climate change.
The source data in studies of the evolution of thermokarst lake areas are acquired through remote sensing of the Earth's surface and ground measurements of meteorological parameters [29,30].
The state of thermokarst lakes is characterized by their total area S[t] in a given region, measured in hectares (ha), and the factors influencing thermokarst formations-the average annual temperatures T[t], measured in Celsius (C • ), and the annual precipitation R[t], measured in millimeters (mm), where t denotes the calendar year.
We used the remote sensing data and ground measurements of the meteorological parameters for a region of Western Siberia between 65 • N-70 • N and 65 • E-95 • E that were presented in [31]. We divided the available time series into two groups, which formed the training collection L (t = 0, . . . , 24) and the testing collection T (t = 25, . . . , 35).

RME Estimation of Model Parameters and Measurement Noises
The temporal evolution of the lake area S[t] is described by the following dynamic regression equation with two influencing factors, the average annual temperature T[t] and the annual precipitation R[t]: The model parameters and measurement noises are assumed to be random and of the interval type: The probabilistic properties of the parameters are characterized by a PDF P(a).
The variablev[t] is the observed output of the model, and the values of the random measurement noise ξ[t] at different time instants t may belong to different ranges: with a PDF Q t (ξ[t]), (t = 0, . . . , N), where N denotes the length of the observation interval. The order p = 4 and the parameter ranges for the dynamic randomized regression model (34) (see Table 1 below) were calculated based on real data using the empirical correlation functions and the least-square estimates of the residual variances.
The RME estimation procedure yielded the following entropy-optimal PDFs of the model parameters (36) and measurement noises: Note that S[t − k], T[t], and R[t] are the data from the collection L. The two-dimensional sections of the function P * (a) and the function Q * (ξ) are shown in Figure 1.

Testing
Testing was performed using the data from the collection T , which included the lake area S[t], the average annual temperature T[t], and the annual precipitation R[t], t = 25, . . . , 35. An ensemble of trajectories of the model's observed output v[t] was generated using Monte Carlo simulations and sampling of the entropy-optimal PDFs P * (a), Q * ξ on the testing interval. In addition, the trajectory of the empirical meansv [t] and the dimensions of the empirical standard deviation area were calculated.
The quality of RME estimation was characterized by the absolute and relative errors: The generated ensemble of the trajectories is shown in Figure 2.

Discussion
Given an available data collection, the RME procedure allows estimation of the PDFs of a model's random parameters under measurement noises corresponding to the maximum uncertainty (maximum entropy). In addition, this procedure needs no assumptions about the structure of the estimated PDFs or the statistical properties of the data and measurement noises.
An entropy-optimal model can be simulated by sampling the PDFs to generate an empirical ensemble of a model's output trajectories and to calculate its empirical characteristics (the mean and median trajectories, the standard deviation area, interquartile sets, and others).
The RME procedure was illustrated with an example of the estimation of the parameters of a linear regression model for the evolution of the thermokarst lake area in Western Siberia. In this example, the procedure demonstrated a good estimation accuracy.
However, these positive features of the procedure were achieved with computational costs. Despite their analytical structure, the RME estimates of the PDFs depend on Lagrange multipliers, which are determined by solving the balance equations with the so-called integral components (the mathematical expectations of random parameters and measurement noises). Calculating the values of multidimensional integrals may require appropriate computing resources.

Conclusions
The problem of randomized maximum entropy estimation of a probability density function based on real available data has been formulated and solved. The developed estimation algorithm (the RME algorithm) finds the conditional maximum of an information entropy functional on a set of admissible probability density functions characterized by the empirical balance equations for Lagrange multipliers. These equations define an implicit dependence of the Lagrange multipliers on the data collection. The existence of such an implicit function for any values in a data collection has been established. The function's behavior for a data collection of a greater size has been studied, and the asymptotic efficiency of the RME estimates has been proved.
The positive features of RME estimates have been illustrated with an example of estimation and testing a linear dynamic regression model of the evolution of the thermokarst lake area in Western Siberia with real data.