Minimax Estimation in Regression under Sample Conformity Constraints

: The paper is devoted to the guaranteeing estimation of parameters in the uncertain stochastic nonlinear regression. The loss function is the conditional mean square of the estimation error given the available observations. The distribution of regression parameters is partially unknown, and the uncertainty is described by a subset of probability distributions with a known compact domain. The essential feature is the usage of some additional constraints describing the conformity of the uncertain distribution to the realized observation sample. The paper contains various examples of the conformity indices. The estimation task is formulated as the minimax optimization problem, which, in turn, is solved in terms of saddle points. The paper presents the characterization of both the optimal estimator and the set of least favorable distributions. The saddle points are found via the solution to a dual ﬁnite-dimensional optimization problem, which is simpler than the initial minimax problem. The paper proposes a numerical mesh procedure of the solution to the dual optimization problem. The interconnection between the least favorable distributions under the conformity constraint, and their Pareto efﬁciency in the sense of a vector criterion is also indicated. The inﬂuence of various conformity constraints on the estimation performance is demonstrated by the illustrative numerical examples.


Introduction
The problems of the heterogeneous parameter estimation in the regression under the model uncertainty are considered intensively from the various points of view. The guaranteeing (or minimax) approach gives one of the most prospective tools to solve these problems. For the proper formulation of an estimation problem in minimax terms one usually needs: • A description of the uncertainty set in the observation model; • A class of the admissible estimators; • An optimality criterion (a loss function) as a function of the argument pair "estimatoruncertain parameter value".
The problem is to find the estimator that minimizes the maximal losses over the whole uncertainty set.
In the related literature, the parametric uncertainty set is specified either by geometric [1][2][3][4][5][6][7], or by statistical [8][9][10][11][12][13][14][15] constraints. In the former case, the uncertain parameters are treated as nonrandom but unknown ones lying within the fixed uncertainty set. In the latter case, the parameters are supposed to be random with unknown distribution, and the uncertainty set is formed by all the admissible distributions. In both cases, the guaranteeing estimation presumes a solution to a two-person game problem: the first player is "a statistician", and the performer of the second, "external" player role is dictated by the problem statement-it might be nature, another human or device. Nevertheless, the guaranteeing approach suggests the unified prescription: finding the best estimator under the worst behavior of the uncertainty. In practice, such a universality leads to a loss of some prior information.
Let us explain this point by an example: the statistician knows that the source of the uncertainty is nature. This means he/she "should bear in mind that nature, as a player, is not aiming for a maximal win (that is, does not want us to suffer a maximal loss), and in this sense, it is 'impartial' in the choice of strategies" [12]. Hence, in this case, the minimax approach is too pessimistic and leads to cautious and coarse estimates. Even if we know the second player is a human, this does not imply his/her "bad will" towards the statistician. Hopefully, the second player has goal other than maximizing the loss of the statistician. If the goal of the second player is known, one can change the estimation criterion and transform the initial problem into a non-antagonistic game [16]. Otherwise, the statistician can identify the goal indirectly, relying on the available observations. Hence, in the latter case, it seems natural to introduce additional constraints to the uncertainty set, depending on the realized observations. The paper aims to present a solution to the minimax estimation problem under additional constraints, which are determined by a conformity index of the uncertain parameters to the available observations. The paper is organized as follows. Section 2 contains the formal problem statement with the conformity index based on the likelihood function. The section presents the assumptions concerning the observation model, which guarantee the correctness of the stated estimation problem and the existence of its solution. It also contains the comparison of the problem with the recent investigations.
Section 3 provides the main result: the initial estimation problem is reformulated as a game problem, which has a saddle point, defining the minimax estimator completely. Moreover, the point is a solution to a dual finite-dimensional constrained optimization problem, which is simpler than the initial minimax problem. The form of the minimax estimator and properties of the least favorable distributions (LFD) is also included in the section. Section 4 is devoted to the analysis of the obtained results. First, a numerical algorithm for the dual optimization problem solution is presented along with its accuracy characterization. Second, some other conformity indices based on the empirical distribution function (EDF) and sample mean are also introduced. Third, a new concept of the uncertain distribution choice under a vector criterion is considered. The first criterion component, being the loss function introduced in Section 2, describes the influence of the uncertainty on the estimation quality. The second component is the conformity index, which characterizes the accordance of the unknown distribution of γ and the realized observations Y = y. We present an assertion that the LFD in the minimax estimation problem is Pareto-efficient in the sense of the introduced vector criterion.
Section 5 presents the numerical examples, which illustrate the influence of various conformity constraints on the estimation performance. Section 6 contains concluding remarks.
The following notations are used in this manuscript: • B(S) is the Borel σ-algebra of the topological space S (is S is the whole space) or its contraction to the set S (if S is a set of the topological space); • col(A 1 , . . . , A n ) is a column vector formed by the ordinary or block components A 1 , . . . , A n ; • row(A 1 , . . . , A n ) is a row vector formed by the ordinary or block components A 1 , . . . , A n ; • a, b is a scalar product of two finite-dimensional vectors; • C(X ) is a set of all continuous real-valued functions with the domain X; • x is the Euclidean norm of the vector x; • P F {A} is the probability of the event A corresponding to the distribution F; • E F {X} is a mathematical expectation of the random vector X with the distribution F; • conv(S) is a convex hull of the set S.

Formulation
Let us consider the following observation model: Here: • γ ∈ C ∈ B(R m ) is an unobservable random vector, having an unknown cumulative distribution function (cdf) F; • X ∈ R n is a random unobservable vector with a known cdf Ψ(dx|γ) dependent on the value of γ; • Y ∈ R k is a vector of observations; • V ∈ R k is a random vector of observation errors with the known probability density function (pdf) φ V (v); • A(·) : C × R n → R k is a nonrandom function characterizing the observation plant; • B(·) : C × R n → R k×k is a nonrandom function characterizing the observation error intensity.
The observation model is defined on the family of the probability triplets {(Ω, F , P F )} F∈F , where: The probability measures P F are determined as: Using the generalized Bayes rule [17], it is easy to verify that the function: is the conditional pdf of the observation Y given γ: P F {Y ∈ dy|γ = q} = L(y|q)dy. Furthermore, the function: defines the pdf of the observation Y under the assumption that the distribution law of γ equals F: Below in the paper we refer to the function L(y, F) as the sample conformity index based on the likelihood function.
Our aim is to estimate the function h(γ, X) : C × R m → R l of (γ, X), and the admissible estimators are the functions h(Y) : R k → R l of the available observations.
The loss function is a conditional mean square of estimate error given the available observations: and the corresponding estimation criterion: characterizes the maximal loss for a fixed estimator h within the class F L of the uncertain distributions of γ, for which L(y, F) L. The minimax estimation problem for the vector h is to find an estimator h(·), such that: where H is a class of admissible estimators.

Necessary Assumptions Concerning Observation Model
To state the minimax estimation problem (8) properly and guarantee the existence of its solution we have to make additional assumptions concerning the uncertainty of γ, the observation model (1) and the estimated vector h: (ii) Let F be a family of all probability distributions with a support lying within the set C.
The set F L is itself a convex * -weakly compact [18] subset of F.
holds for all F ∈ F L . The inequality (9) is called the conformity constraint of the level L based on the likelihood function (or, shortly, likelihood constraint). (iv) The set F L is nonempty.
The observation noise is uniformly non-degenerate, i.e., The set of admissible estimators H contains only the functions h(·) : R k → R l , for which: sup

Argumentation
First, we discuss the sense of the assumptions in the subsection above. Conditions (i)-(iv), describing the set F L , have the following interpretation.
The requirement for C to be compact (i.e., fulfillment of condition (i)) is standard for the minimax estimation problems (see, e.g., [2,3]). In the case the prior information about the vector γ limited by the knowledge of its domain C only, it is rather natural to treat γ as a random vector with an unknown distribution F ∈ F. In practice we often have some additional prior information concerning the moment characteristics of γ, hence the entire uncertainty set F can be significantly reduced. If, for example, µ(q) = col(µ 1 (q), . . . , µ N (q)) : C → R N is a vector of convex moment functions, and we know the vector µ col(µ 1 , . . . , µ N ) ∈ R N of their upper bounds, then the set of ad-missible distributions takes the form F ∈ F : C µ j (q)F(dq) µ j , j = 1, N . The * -weak compactness and convexity can be easily verified for this subset. Further in the presentation, we do not stress the explicit form of the "total" constraints other than (9) forming the subset F L : they should just guarantee the closure and convexity for F L . That is the sense of condition (ii).
The conditional pdf L(y|q) (3) can also be treated as the likelihood function of the parameter γ, calculated at the point q given the observed sample Y = y. This likelihood value reflects the relevance of the parameter value q to the realized observation y. By analogy, the function L(y, F) can be considered as some generalization of the likelihood function that evaluates the correspondence between the uncertain distribution F and the realized observation y. The following lower and upper bounds for this value are obvious: Below in the paper we suppose that the likelihood level L lies in [L(y), L(y)]. The subset formed by constraint {F ∈ F : C L(y|q)F(dq) L} is called a distribution subset satisfying the likelihood conformity constraint of the level L. It is nonempty because it contains at least all distributions with the support lying within the set {q ∈ C : L(y|q) L}.
Adjusting the level L, we can vary the uncertainty set F L , choosing the distributions F, which are more or less relevant to the realized observations Y = y. That is an essence of condition (iii). Condition (iv) is obvious: all the constraints, defining the set F L , should be feasible.
Condition (v) is technical: it provides correctness of a subsequent change of measure. The condition is non-restricting because the broad class of the functions A, B and h can be approximated by the continuous functions. Conditions (vi) and (vii) guarantee correct utilization of the Fubini theorem and an abstract variant of the Bayes formula [19]. In practice these conditions are usually valid. Condition (viii) guarantees finite variance for both the observations and estimated vector independently of the distribution F. Condition (ix) guarantees a finite variance of the estimate h(Y) independently of F ∈ F L .
The solution to (8) is obvious in the case of the one-point set F L = {F}. This means the distribution F of the parameter γ is known, and the initial problem is reduced to the traditional optimal in the mean square sense (MS-optimal) estimation problem. The case of the one-point set C = {q} is quite similar. In both cases the optimal estimator is completely defined by the conditional expectation (CE): h(y) = E F {h(γ, X)|Y = y} in the case of a known distribution F, and h(y) = E {q} {h(q, X)|Y = y} in the "one point" case.
In the general case of F L this result is inapplicable, because the CE E F {h(γ, X)|Y = y} is a functional of the unknown distribution F.
The stated estimation problem has a transparent interpretation. First, under prior uncertainty of the distribution F the replacement of the loss function (6) by guaranteeing analog looks natural. Second, utilization of the CE in the criterion means that the desired estimate should be calculated optimally for each observed sample. The criteria in the form of the CE appear often in estimation and control problems [11,17,[20][21][22]. Mostly, the estimation is the preliminary stage in the solution to the optimization and/or control problem under incomplete information. The random disturbances/noises in such observation systems represent: • A result of natural (non-human) impacts; • A randomized (or generalized) control [23,24], used in the dynamic system; • A result of some uncontrollable (parasitic) input signals of "the external players".
The impact of the two latter types is not necessarily the nonrandom functions of available observations, but some "extra generated" random processes with distributions dependent on the observations. This type of control is used in the areas of telecommunications [25,26], cellular networks [27], technical systems [28], etc. The proposed minimax criterion allows inhibiting the negative effect of the "additional randomness" in the external signals (the third type of disturbances mentioned above) to the estimation quality.
The additional comprehension of the natural gaps, which are inherent to the minimax estimation paradigm, and the ways of their partial coverage can be revealed by the following interpretation. It is well-known that in the case a minimax estimation problem can be reduced to a two-person game with a saddle point, the minimax estimator is the best one calculated for the LFD. The form of the LFD can be very strange and artificial. Moreover, the conformity degree of the LFD to the realized observations can be too low. Thus, the utilization of various sample conformity indices (particularly the ones based on the likelihood function) admits all to describe this degree, restrict it from below, implicitly reduce the distribution uncertainty set and exclude "exotic" variants of the LFDs.
Minimax estimation of the regression parameters is an investigation object in the various settings. Mostly, the observation model is a linear function of the estimated parameters corrupted by an additive Gaussian noise. The optimality criterion is a mathematical expectation of some loss function. In [29], the problem is solved by engaging the framework of the fuzzy sets. The authors of [30,31] used the criterion other than the traditional mean square one, and the estimated vector was random with the uncertain discrete distribution. In [32], the Gaussian noises have an uncertain but bounded covariance matrix. The papers [33][34][35] are also devoted to the minimax Bayesian estimation in the regression under various geometric and moment constraints of the estimated parameters. The criterion functions are p norms of the estimation errors.
The optimality criterion in the form of CE and the admissibility of nonlinear estimates distinguish the proposed estimation problem from the recently considered ones [2,3,[5][6][7]9]. A closely related problem considered in [11] has an essential difference. The uncertain parameter in [11] was treated as unknown and nonrandom, and hence the initial minimax problem could not be solved in terms of the saddle points. Moreover, the statistic uncertainty in [11] gave no possibility to take into account any additional prior and posterior information about the moment characteristics, conformity indices, etc. The paper [14] was devoted to the particular case of the likelihood constraints only. An idea to use confidence sets, calculated by the available statistical data, as the uncertainty sets of the distribution moments was used in [36] for the conditionally-minimax prediction.

The Main Result
As is known, the CE is determined in a non-unique way, hence we should specify a version of the CE so as to use it in further inferences. If the distribution F of the vector γ is known, then the CE of an integrable random value h(γ, X) : C × R m → R can be calculated by the abstract variant of the Bayes formula: i.e., s. Below in the presentation we use the CE version, defined by (11). It should also be noted that if h(·) is the desired minimax estimator, then the inclusion (8) must be satisfied point-wise for any sample y ∈ R k . Further in the paper the function: is called the dual criterion for J * (7). All CEs in (12) are calculated by (11). Using (3) for the calculation of L, the notation: and the CE version (11), the loss function (6) can be rewritten in the form: As can be seen from (14), the function J(h, F|y) is neither convex nor concave in F, which complicates the solution to the estimation problem (8). Moreover, the argument F lies in the abstract infinite-dimensional space of the probability measures. Nevertheless, the problem can be reduced to a standard finite-dimensional minimax problem with a convex-concave criterion.
First, we introduce a new reference measure F and verify that the loss function (14) represents a functional, which is linear in F . Let: Lemma 1. If conditions (i)-(ix) are satisfied, then the following assertions are true.
1. F (F, dq|y) is a probability measure for ∀ y ∈ R k , and F (F, ·|y) ∼ F(·). The transformation (15) is a bijection of F into itself, and its inversion F has the form: 2.
The set F L of all distributions obtained from F L by the transformation (15): is convex and * -weakly closed.
The proof of Lemma 1 is given in Appendix A.
Applying the Fubini theorem and keeping in mind (11) and (15), we can rewrite the loss function (14) in the form: To reduce the initial problem to some finite-dimensional equivalent, we also introduce the vectors: w(F|y) col(w 1 (F|y), w 2 (F|y)) ∈ R +1 : and their collections generated by the subsets C and F L : Here and below the notation H(y) also stands for the whole set of the estimate values h ∈ H calculated for the fixed argument y.
On the set R × R +1 we prepare the new loss function: It is easy to verify that the loss function (18) can be expressed via (22): The corresponding guaranteeing criterion takes the form: and its dual can be determined as: The finite-dimensional minimax problem is to find: From the definitions of W(F L |y), H(y) and criterion (23) it follows that the problem (25) is equivalent to the initial minimax estimation problem (8): for ∀ y ∈ R k . The following theorem characterizes the solution to the finite-dimensional minimax problem in terms of a saddle point of the loss function J. Theorem 1. For ∀ y ∈ R k , the loss function J(η, w) (22) has the unique saddle point ( h(y), w(y)) on the set H(y) × W(F L |y). The second block subvector w(y) = col( w 1 (y), w 2 (y)) ∈ W(F L |y) of the saddle point is the unique solution to the finite-dimensional dual problem: and h(y) = w 2 (y) is the second sub-vector of this optimum w(y).
The proof of Theorem 1 is given in Appendix B.
By the definition of W(F L |y), for any vector w(y) there exists at least one distribution F such that: F is an LFD, and the whole set of the distributions satisfying (29) is denoted by F L . Theorem 1 allows to obtain a solution to the initial minimax estimation problem. The result is formulated as: The following assertion characterizes the key property of the LFD set F L .

Corollary 2.
There exists a variant of the LFD F L ∈ F concentrated at most at dim(W(F L |y)) + 1 points of the set C.
The proof of Corollary 2 is given in Appendix C.
The presence of the discrete version of LFD is a remarkable fact. Let us remind the reader that initially, we suppose that the uncertain vector γ lies in the set C. The deterministic hypothesis concerning γ hopelessly obstructed the solution to the minimax estimation problem. To overcome this obstacle we provide the randomness of γ: the vector keeps constant during the individual observation experiment, and the stochastic nature of γ appears from experiment to experiment only. The existence of a discrete LFD returns us partially to the primordial situation. The point is that there exists a set of γ values that are the most difficult for estimation. Tuning to these parameters we can obtain estimates of γ with the guaranteed quality.
Theorem 1 and Corollary 1 simplify the solution to the initial problem (8), reducing it to the maximization of the finite-dimensional quadratic function (28) over the convex compact set.

Dual Problem: A Numerical Solution
To simplify presentation of the numerical algorithm of problem (28)'s solution, we suppose that the uncertainty set F L takes the form F L = {F ∈ F : L L}, i.e., it is restricted by the conformity constraint only.
Let us consider the case C {q j } j=1,M ⊂ R m , which corresponds to the practical problem of Bayesian classification [10,38]. Here the dual problem (28) has the form w(y) = Argmax w∈conv(W(C|y)) J * (w). Its solution can be represented as w(y) = ∑ M j=1 P j (y)w(q j |y), where P(y) row( P 1 (y), . . . , P M (y)) is a solution to the standard quadratic programming problem (QP problem): Consequently, in the case of finite C the minimax estimation problem can be reduced to the standard QP problem with well-investigated properties and advanced numerical procedures.
Utilization of the finite subsets C(·) instead of the original domain C allows to calculate the "mesh" approximations for the solution to (8). Let: • n ↓ 0 be a decreasing nonrandom sequence characterizing the approximation accuracy; be modulus of continuity for w 1 (y|q) and w 2 (y|q). The assertion below characterizes the divergence rate of the approximating solutions to the initial minimax estimate. then the following sequences are convergent as n ↓ 0: with some constant 0 < K < ∞.
The proof of Lemma 2 is given in Appendix D.

The Least Favorable Distribution in the Light of the Pareto Efficiency
The minimax estimation problem under the conformity constraints is tightly interconnected with the choice of the distribution F that is optimal in the sense of a vectorvalued criterion. On the one hand, the solution to the considered estimation problem is grounded on the evaluation of the distribution F, maximizing the dual criterion (12): I 1 (F|y) J * (F|y) → max F . On the other hand, the distribution F should conform to the realized sample Y = y, and the maximization of the conformity index leads to the following optimization problem: I 2 (F|y) L(y, F) → max F . Obviously, the criteria I 1 and I 2 are conflicting; hence the proper choice of F requires the application of the vector optimization techniques.
Let:  The proof of Lemma 3 follows directly from the Germeyer theorem [16]. Consideration of the constrained minimax estimation problem in light of the optimization by the vector criterion is somehow close to the one investigated in [31], where the estimation quality is characterized by the 2 norm of the error, and the Shannon entropy is characterized as a measure of the statistical uncertainty of the estimated vector.

Other Conformity Indices
First, we consider the conformity constraint (9) thoroughly. It admits the following treatment. Let F ∈ F be some reference distribution. The constraint L(y, F) L(y, F) is a specific case of (9); the feasible distributions F should be relevant to the available observations Y = y no less than the reference distribution F is. One more treatment is also acceptable. Let q ∈ C be some "guess" value of the uncertain parameter γ, and α > 0 be a fixed value. The constraint: is a specific case of (9): it means that the likelihood ratio of any feasible distribution F to the one-point distribution at q should be no less that the level α. Obviously, the guess value q could be chosen from the maxima of the function L, i.e., q ∈ Argmax q∈C L(y|q), but calculation of these maxima is itself a nontrivial problem of likelihood function maximization. In Section 5 we use some modification of (33): where C n ⊆ C is a known subset, and r ∈ (0, 1) is a fixed parameter. This form is important, because in the case of C = C n it guarantees for the constraint (34) to be active in the considered minimax optimization problem for each r ∈ (0, 1). Furthermore, the proposed conformity index L(y, F) (9) is a non-unique numerical characteristic that describes the interconnection between F and Y. For example, an alternative conformity index can be defined as C f (L(y|q))F(dq), where f (·) : R → R is some continuous nondecreasing function. Another way to introduce this index is to set it as S(y) L(y , F)dy = P F {Y ∈ S(y)}, i.e., as a probability that the observation Y lies in the confidence set S(y) ∈ B(R k ).
For a particular case of the observation model (1) we can propose one more conformity index that is based on the EDF. Let us consider the observation model with the "pure uncertain" estimated parameter γ: Here: • Y T col(Y 1 , . . . , Y T ) are available observations; • γ ∈ C ∈ R m is a random vector with unknown distribution F; • V T col(V 1 , . . . , V T ) are the observation errors that are i.i.d. centered normalized random values with the pdf φ V (v). If the value γ is known, the observations {Y t } t=1,T can be considered as i.i.d. random values, whose pdf is equal to φ V (v) after some shifting and scaling. The EDF of the sample {Y t } t=1,T has the form: On the other hand, the cdf F Y (y) of any observation Y t for a fixed distribution F can be calculated as: The sample conformity index based on the EDF is the following value: The new uncertainty set F M describing all admissible distributions F satisfies conditions (i), (ii) and (iv) above, but condition (iii) is replaced by the following one: This holds for all F ∈ F M and some fixed level M > 0. It is called the constraint based on the EDF.
The proposed conformity index represents the well known Kolmogorov distance used in the goodness-of-fit test. One also knows the asymptotic characterization of M(Y T , F): Furthermore, the value M(Y T , F) can be easily calculated, because the function F * T is piece-wise constant while F Y is continuous: and the cdf F Y is calculated by (37). The distribution set determined by (39) takes the form: Using the variational series Y (T) col(Y (1) , . . . , Y (T) ) of the sample Y T , and recalling (40) can be rewritten in the form: It can be seen that this set is a convex closed polyhedron, lying in F, with at most 2T facets. All assertions formulated in Section 3 are valid after replacing the uncertainty set F L , generated by the likelihood function, by the set F M , generated by the EDF. Moreover, the mesh algorithm for the dual optimization problem solution, presented above in Section 4.1, can also be applied to this case.
Let us consider the observation model (35) again. We can use the sample mean Y 1 T Y t as one more conformity index. Let us remind the reader that due to the model property, the random parameter γ(ω) is constant for each sample Y T . For rather large T values, the central limit theorem allows to treat the normalized value as a standard Gaussian random one. We then fix a standard Gaussian quantile c α of the confidence level α and exscind the subset: If C α is compact then the set F α of all probability distributions with the domain lying in C α is called the set of admissible distributions satisfying the sample mean conformity constraint of the level α.
The comparison of the minimax estimates, calculated under various types of the conformity constraints, is presented in the next section.

Parameter Estimation in the Kalman Observation System
Let us consider the linear Gaussian discrete-time (Kalman) observation system: where: • X T col(X 0 , . . . , X T ) is an unobservable state trajectory (the autoregression X t is supposed to be stable); Our goal is to calculate the proposed minimax estimates of the uncertain vector γ and analyze their performance depending on the specific form of the loss function (6). To vary the loss function we can either specify the estimated test signal h(·) or determine different Euclidean weighted norms. We choose the second approach and define the following norm · ξ X ,ξ γ for the compound vector: Z col(X T , γ): and the corresponding loss function takes the form: In the case ξ γ = 1 and ξ X = 0 we obtain "the traditional" case of the mean-square loss conditional function J 0,1 (Z, F|Y T ) = E F γ − γ(Y T ) 2 |Y T , and the estimation quality of γ(·) is determined directly through the loss function. Using ξ γ = 0 and ξ X = 1 we transform the loss function into J 1,0 (Z, F|Y T ) = E F X − X(Y T ) 2 |Y T , and the estimation of γ appears indirectly via the estimation of the state trajectory X T .
The minimax estimation is calculated by the numerical procedure introduced in Section 4.1 with the uniform mesh C h a ,h b of the uncertainty set C; h a and h b are corresponding mesh steps along each coordinate.
We calculate the minimax estimate with the likelihood conformity constraint of the form: where r ∈ (0, 1) is a confidence ratio. We compare the proposed minimax estimate with some known alternatives. Second, in spite of sufficient observation length, the signal-to-noise ratio is rather small, which prevents high performance of the asymptotic estimation methods. Figure 1 presents the evolution of the minimax estimates a 0,1 (r) and a 1,0 (r) of a drift coefficient depending on the confidence ratio r ∈ (0, 1). The minimax estimates are compared with;

•
The estimate a MS (Y T ) calculated by the moment/substitution method [12]: The Bayesian estimate a F 1 (Y T ) (11) calculated under the assumption that prior distribution F 1 of γ is uniform over the whole uncertainty set C; • The Bayesian estimate a F 2 (Y T ) (11) calculated under the assumption that the prior distribution F 2 of γ is uniform over the vertices of C; • The estimate a EKF (Y T ) calculated by the extended Kalman filter (EKF) algorithm [39] and subsequent residual processing; • The maximum likelihood estimate (MLE) a MLE (Y T ) calculated by the expectation/maximization algorithm (EM algorithm) [17].
b1,0(r ) b0,1(r ) The results of this experiment allow us to make the following conclusions.

2.
Both minimax estimates are more conservative than the MLE, because they take into account a chance for other points of the LFD domain to be realized.

3.
Under an appropriate choice of the confidence ratio r, both minimax estimates become more accurate than other candidates, except for the MLE.

Parameter Estimation under Additive and Multiplicative Observation Noises
We consider the observations Y T col(Y 1 , . . . , Y T ): Here: • a is an estimated value; • X T col(X 1 , . . . , X T ) is a vector of the i.i.d. unobservable random values (multiplicative noise): We assume that the parameter a is random with unknown distribution, whose support set lies within the known set C [c 1 , c 2 ]. The loss function has the form: In this example our goal is to compare the minimax estimates of the parameter a under conformity constraint based either on the likelihood function or on the EDF.
The minimax estimations are calculated for the following parameter values: a = 2, T = 20, C = [2,3], σ = 0.1. We use the proposed numerical procedure under a uniform mesh C h of the set C with the step h = 0.005. The example has some features. First, the observation model contains both the additive (V T ) and multiplicative (X T ) heterogeneous noises. Second, the available observed sample is not too long to provide the high quality for the consistent estimates. Third, the exact value of a is equal 2; meanwhile under the constraint absence there exists a discrete variant of the LFD with the finite support set {2, 3}. This means that the LFD is realized only in the considered observation model.
The likelihood conformity constraint looks similar to the one from the previous subsection: where r ∈ (0, 1) is a confidence ratio. Figure 3 contains comparison of the minimax estimate a(r) with its actual value a, the (consistent asymptotically Gaussian) M-estimate a sub 2 T ∑ T t=1 Y t , obtained by the moment/substitution method [12] and the MLE a MLE .
Next, we investigate minimax posterior estimates under the conformity constraint based on the EDF. The constraint is of the form: where r ∈ (0, 1) is some fixed confidence ratio, and F C h is a "mesh" approximation of the set F C corresponding to the uniform "mesh" C h . The form (46) of the conformity constraint provides that it is active in the minimax optimization problem for any r ∈ (0, 1).
du of Y, corresponding to the one-point distribution concentrated at the point q (q = 2, 3);   The results of this experiment allow us to make the following conclusions.

1.
The minimax estimate a(r) under the conformity constraint, based on the EDF, does not converge to the MLE a MLE as r → 1.

2.
Under an appropriate choice of the confidence ratio r, the minimax estimate under the EDF constraint becomes more accurate than other candidates, including the MLE.

Conclusions
The paper contains the statement and solution to a new minimax estimation problem for the uncertain stochastic regression parameters. The optimality criterion is the conditional mean square of the estimation error given the realized observations. The class of admissible estimators contains all (linear and nonlinear) statistics with finite variance. The a priori information concerning the estimated vector is incomplete: the vector is random and the part of its components lies in the known compact. The key feature of the considered problem is the presence of the additional constraints for the statistical uncertainty, restricting from below the correspondence degree between the uncertainty and realized observations. The paper presents various indices, characterizing this conformity via the likelihood function, the EDF and the sample mean.
We propose a reduction of the initial optimization problem in the abstract infinitedimensional spaces to the standard finite-dimensional QP problem with convex constraints along with an algorithm of its numerical realization and precision analysis.
The minimax estimation problem is solved in terms of the saddle points, i.e., besides the estimators with the guaranteed quality, we have a description of the LFDs. First, the investigation of the LFDs' domains allowed us the detection of the uncertain parameter values, which are the worst for the estimation. Second, the consideration of the performance index pair "conformity index-guaranteed estimation quality" uncovered rather a new conception of the parameter estimation under a vector optimality criterion. The paper contains an assertion, which states that the LFDs are Pareto-optimal for the vector-valued criterion above.
The paper focuses mostly on the conformity indices related to the likelihood function; thus, it is obvious that the performance of the minimax estimate is compared with the one of the MLE. In general, the MLE has several remarkable properties, in particular the asymptotic minimaxity under some additional restrictions [12]. However, the estimate is non-robust to the prior statistical uncertainty. The proposed minimax estimate can be considered as a robustified version of the MLE, which is ready for application in the cases of the short non-asymptotic samples or the violation of the conditions for the MLE asymptotic minimaxity.
The conformity constraints are not exhausted by the likelihood function. In the paper, we present other conformity indices based on the EDF and sample mean. We demonstrate that the minimax estimates with the EDF conformity constraint are better than the MLE. One of the points of the paper is that the flexible choice of the conformity indices and design of the additional conformity constraints for each individual applied estimation problem allows obtaining a tradeoff between the prior uncertainty and available observations. The reason to choose one or another conformity index depends not only on the conditions of the specific practical estimation problem solved under the minimax settings. One of the essential conditions is the possibility of its quick computation for the subsequent verification of the conformity constraint. For example, calculation of the likelihood conformity constraint (33) with the guess value L(y|q) = max q L(y|q) tends to necessarily solve the auxiliary maximization problem for the likelihood function, which is nontrivial itself. Thus, the conformity indices based on the EDF or sample moments look more prospective from the computational point of view.
The applicability of the proposed minimax estimate also depends on the presence of the analytical formula of the estimates w(y|q), or the fast numerical algorithms of its calculation. In turn, this possibility is a base for the subsequent effective solution to the QP problem and specification of the LFD.
Finally, the key indicator affecting the estimate calculation process and its precision is the number of the mesh nodes in the approximation C( n ) of the uncertainty set C. It is a function of "the size of C / the mesh step n " ratio and dimensionality m of C.
All of the factors above characterize the limits of possible applicability of the proposed minimax estimation method for the solution to one or another practical problem.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: cdf cumulative distribution function CE conditional expectation EDF empirical distribution function EKF extended Kalman filter EM algorithm expectation/maximization algorithm LFD least favorable distribution MLE maximum likelihood estimate MS-optimal optimal in the mean square sense pdf probability density function QP problem quadratic programming problem Furthermore, for ∀ (0 < < 1) there exists a compact set S( ) ∈ B(R n ), such that S( ) Ψ(dx|q) 1 − , and by the Weierstrass theorem m(y) min (q,x)∈C×S( ) ν(q, x|y) > 0. Each measure F ∈ F can be associated with the measure µ F (dq|y) L(y|q)F(dq). Obviously, µ F F, and µ F is finite, i.e., 0 < m(y) C µ F (dq|y) M < ∞. Hence, ∀ y ∈ R k and ∀ F ∈ F. The measure F (F, dq|y) (15) is probabilistic; moreover F F. The measure F (F , dq|y) (16) is also a probabilistic one defined on (C, B(C)), F F, and the denominator in (16) has the following lower and upper bounds: From (15) and (16) it follows that F ∼ F , and the corresponding measure transformations are mutually inverse, i.e., ∀ F ∈ F the identity F (F (F)) ≡ F (F (F)) ≡ F holds, and, moreover, {F (F) : F ∈ F} = {F (F) : F ∈ F} = F. Assertion (1) of Lemma 1 is proven.
The set F L is * -weakly closed, because the set F L is, and the function L(y|q) is nonnegative, continuous and bounded in q ∈ C.

Appendix B
Proof of Theorem 1. The set H(y) = R by condition (ix); thus it is convex and closed. The set F L is convex and * -weakly closed due to Lemma 1. From this fact and (20) it follows that W(F L |y) is also a convex closed set. Moreover, it is bounded due to condition (viii). The function J (22) is strictly convex in η and concave (affine) in w. These conditions are sufficient for the existence of a saddle point [40]. It should be noted that both the set H(y) × W(F L |y) and the saddle point ( h(y), w(y)) depend on the observed sample y. Now we prove the uniqueness of the saddle point w(y). Let w (y) = col(w 1 (y), w 2 (y)) and w (y) = col(w 1 (y), w 2 (y)) be two different saddle points, and J (y) J * (w (y)) = J * (w (y)) and w (y) αw (y) + (1 − α)w (y) be arbitrary convex combinations of the chosen points (0 < α < 1). After elementary algebraic transformations we have: J * (w (y)) = J (y) + α(1 − α) w (y) 2 − w 2 (y) 2 > J (y), which contradicts our assumption that w (y) and w (y) are two different solutions to the finite-dimensional dual problem. Theorem 1 is proven.

Appendix C
Proof of Corollary 2. The set W(F L |y) ∈ B(R +1 ) is compact, and W(F L |y) ⊆ conv(W(C|y)). By the Krein-Milman theorem [37], each point of the set W(F L |y) can be represented as a convex combination at most of dim(W(F L |y)) + 1 extreme points of the set W(F L |y).
Obviously, all extreme points of W(F L |y) belong to the set W(C|y). Hence, for the point w(y) which is a solution to the finite-dimensional dual problem (28), there exists a finite set {q s (y)} s=1,S ⊆ C, 1 S dim(W(F L |y)) + 1 of parameters, and weights {P s (y)} s=1,S (P s (y) 0, ∑ S s=1 P s (y) = 1) such that: w(y) = S ∑ s=1 P s (y)w(q s (y)|y).
The parameters and weights define the reference measure (15) on the space (C, B(C)): F (dq|y) S ∑ s=1 P s (y)δ q s (y) (dq).