## 1. Introduction

#### 1.1. Motivation

The paper presents a new density estimator motivated by problems of population modeling, where the interest is in estimating the probability distribution π_{θ}, θ ∈ Θ, of the parameters of a mathematical model M(·|θ) characterizing the response y(t|θ) of individuals to applied stimuli x(t). The ultimate goal is in general to be able to predict the dispersion of the response of the population to an arbitrary future stimulus x(t), rather than to make a “tomography” of the population itself. These types of problems are frequent in domains like biomedical engineering, insurance studies or environmental management.

If the parameter

θ can be estimated from each observation

y(

t|θ) and each individual’s parameter is chosen independently from

π_{θ}, the problem of estimating

π_{θ} from a collection of responses

${\{\left({y}_{i}\left(t|{\theta}_{i}\right),\phantom{\rule{0.2em}{0ex}}{x}_{i}\left(t\right)\right)\}}_{i=1}^{N}$ is formally equivalent to the usual density estimation problem from a set of independent and identically distributed samples

${\{{\theta}_{i}\}}_{i=1}^{N}\sim {\pi}_{\theta}$ and can be solved using standard parametric or non-parametric methods; see the abundant literature on non-linear mixed-effects models. The situation considered in this paper is more complex, in that the response

y(

·|θ) of the model not observable, and we only have access to the result of the classification of its assignment to a finite number (

L + 1) of possible labels by a known classifier

C(·).

Figure 1 illustrates the structural modeling/observation framework that we consider.

In this setup, each observation can no longer be related to a single point θ ∈ Θ, the same label z being assigned, for the same stimulus x(t), to all responses inside a subset R ⊂ Θ. The set R is completely determined by the pair (z, x(t)) together with knowledge of the model M(·|θ) and of the classifier rule C(·). This situation, when a single observation does not give information with respect to the individual value θ, but only the indication that it belongs to a set, is commonly known in the statistical literature as “censored observations”. While in general studies of the density estimation under censored observations have assumed that the censoring sets R are intervals, the geometry of our censoring regions is determined by the structure of the (possibly highly non-linear) operators M(·|θ) and C(·) and can have an arbitrary morphology, requiring modification of the existing methods.

In Section 4, we detail a particular instance of the problem formally presented above, relevant in the context of the prevention of decompression sickness in hyperbaric diving. Readers may want to read the material in Section 4.1 to have a concrete instantiation of the generic stimuli and operators used in the presentation above.

#### 1.2. Notation and Problem Formulation

Consider the notation introduced in Section 1.1 (see also

Figure 1), and let

${\{\left({z}_{n},{x}_{n}(\xb7)\right)\}}_{n=1}^{N}$ denote the available set of observations, where label

z_{n} ∈ {0, …,

L} has been observed for input

X^{(}^{n}^{)} = {

x_{n}(

t),

t ∈

T_{n}}, where

T_{n} is the duration of the stimulus. Denote by

R_{n} ⊂ Θ the set of all individual parameters whose response to

X^{(}^{n}^{)} receives label

z_{n}:

We assume that for all possible stimuli X^{(}^{n}^{)}, the composition C(M(X^{(}^{n}^{)}|·)) (of the model and the classifier) is a measurable function from Θ to {0, …, L} with respect to the restriction of the Lebesgue measure to the set Θ. Under this assumption, the probability of the sets
${M}_{{x}^{(n)}}^{-1}({C}^{-1}(\ell ))$ is well defined for all 0 ≤ ℓ ≤ L and all stimuli for any distribution absolutely continuous with respect to the Lebesgue measure.

Usually, in population studies, the same stimulus is applied to several individuals. We assume here that stimuli

X^{(}^{j}^{)} are chosen in a finite set

$\mathrm{X}=\{{X}^{(1)},\dots ,{X}^{(J)}\}$ Each possible input function

X^{(}^{j}^{)} in

$\mathrm{X}$ determines a partition of Θ in

L + 1 sets, that we denote by

${\mathcal{Q}}^{\left(j\right)}=\{{R}_{0}^{(j)},\dots ,{R}_{L}^{(j)}\}$:

The top row of

Figure 2 illustrates schematically partitions that correspond to classification in two (

L_{1} = 1) and three (

L_{2} = 2) classes of the response to two distinct stimuli.

Let

n_{j} be the number of times that stimulus

X^{(}^{j}^{)} has been used in the

N observations and

${n}_{\ell}^{(j)}$ the number of times label

ℓ occurred in these

n_{j} experiences. The observed dataset determines

J empirical laws

${\tilde{f}}^{(j)}$, each one associated with a distinct partition

${\mathcal{Q}}^{(j)}$:

When we want to emphasize the number of observations on which these empirical laws are based, we will call
${\tilde{f}}^{(j)}$ an n_{j}-type. With the notation defined above, we can finally state the problem addressed in this paper with full generality.

**Problem 1. (Density estimation from region-censored data)**

Find the non-parametric estimate of π_{θ} from the set of J n_{j}-types${\tilde{f}}^{(j)}$,

j = 1, …,

J (see Equation (1)) of the discrete random variables associated with the known partitions${\{{\mathcal{Q}}^{(j)}\}}_{j=1}^{J}$ (see Equation (4)). Before initiating the study of this estimation problem, we show below how a set of constraints can be related to the observations (1) leading to an alternative problem formulation.

Let 1

_{A}(

θ) be the indicator function of set

A ⊂ Θ and

${\tilde{\pi}}_{\theta}^{({n}_{j})}$ the (non-observed) empirical distribution:

where

${\theta}_{i}^{(j)}$,

i = 1, …,

n_{j}, is the parameter of the

i-th individual to whom stimulus

X^{(}^{j}^{)} has been applied. It is immediate that

${\tilde{f}}_{\ell}^{(j)}$ in

Equation (1) can be written as the statistical expectation of the indicator function of

${R}_{\ell}^{(j)}$ with respect to

${\tilde{\pi}}_{\theta}^{({n}_{j})}$:

We stress that in our context, the (virtual) datasets
${\theta}^{(j)}={\left\{{\theta}_{i}^{(j)}\right\}}_{i=1}^{{n}_{j}}$ are distinct for different values of j ∈ {1, …, J}, since they correspond to statistically-independent samples from π_{θ}.

The remarks above allow us to relate Problem 1 to two alternative problems: Problem 2 formulated below and Problem 3 presented in the next subsection.

**Problem 2. (Density estimation under moment constraints)**

Consider a set of partitions $\mathcal{R}$^{(}^{j}^{)},

j ∈ {0 …

L}

all of size L+1,

and let${\{{g}_{m}(\xb7)\}}_{m=1}^{M}$,

with M = (

L+1)

J, be the set of indicator functions${\{{1}_{{R}_{\ell}^{(j)}}(\cdot )\}}_{j=1,\ell =0}^{J,L}$.

Denote by${\tilde{g}}_{m}$,

m = 1, …,

M, the corresponding empirical moments as in (2). Find the non-parametric estimate of π_{θ} that satisfies the set of constraints: Note that the existence and unicity of the solution to this problem is not guaranteed: depending on the set of partitions and empirical moments, the problem may have no solution or admit a solution (possibly non-unique).

The next subsection summarizes the present background on the two problems formulated above. Prior to that, we present three definitions that will be useful in the sequel.

**Definition 1.** Let$\mathcal{Q}$ be the smallest partition of Θ whose generated σ-algebra,
$\sigma (\mathcal{Q})$, contains all partitions${\{{\mathcal{Q}}^{(j)}\}}_{j=1}^{J}$ (elements of$\mathcal{Q}$ are the minimal elements of the closure of the union of all partitions${\mathcal{Q}}^{(j)}$ with respect to set intersection). The size$Q=|\mathcal{Q}|$ is necessarily finite. We denote by E_{m}, m ∈ {1, …, Q} a generic element of$\mathcal{Q}$.

The bottom row of

Figure 2 shows the partition

$\mathcal{Q}$ generated by the two partitions in the top.

**Definition 2**.

${\mathbf{E}}_{\ell}^{(j)}$ is the set of elements of$\mathcal{Q}$ that intersect${R}_{\ell}^{(j)}$,

such that:**Definition 3.** Let π_{θ} be a probability distribution over Θ

and$\mathcal{Q}$ a finite partition of Θ.

We denote by${\pi}_{\theta ,\mathcal{Q}}$π_{θ,Q} the probability law induced by π_{θ} over the elements of$\mathcal{Q}$:

#### 1.3. Background

#### 1.3.1. Density Estimation from Region-Censored Data

Determination of

${\widehat{\pi}}_{\theta}$, the NPMLE (non-parametric maximum likelihood estimate) of

π_{θ} from censored observations,

i.e., the solution of Problem 1, has been studied by many authors, starting with the pioneering formulation of the Kaplan–Meier product-limit estimator [

1]. Several types of censoring (one-sided, interval,

etc.) have been considered since, first for scalar and more recently for multivariate distributions.

The problem assessed here departs from previous studies in that our (multi-dimensional) censoring regions
${R}_{\ell}^{(j)}$ ⊂ Θ can have arbitrary geometry. To emphasize this, we speak of “region-censoring”, instead of the more common term “interval-censoring.” Another important difference concerns the fact that our regions are elements of a known set of partitions, being in general observed several times, while in general, no relation between the censoring intervals is assumed in the literature, each one being usually applied once.

Several facts are known about the NPMLE for censored observations.

**Proposition 1**.

The support of${\widehat{\pi}}_{\theta}$,

${\mathcal{S}}_{\text{NPMLE}}=\{\theta ,:{\widehat{\pi}}_{\theta}(\theta )>0\}$ i

s confined to a finite number K ≤ Q of elements of$\mathcal{Q}$,

the so-called “elementary regions”:

This set necessarily has a non-empty intersection with all observed lists${\mathbf{E}}_{\ell}^{(j)}$,

i.e.,

all distributions that put the same probability mass w_{k} = {π_{θ}(E_{k})}, k = 1, …, K in the elementary regions have the same likelihood;

there is in general no unique assignment of probabilities${\{{\widehat{w}}_{k}\}}_{k=1}^{K}$ that maximizes the likelihood.

Turnbull [

2] has first demonstrated (i) giving an algorithm to find the pairs

${\{({E}_{k},{w}_{k})\}}_{k=1}^{K}$ for the scalar case. Gentleman and Vandal [

3] addressed the multivariate interval-censored case, showing that the

E_{k}’s are the intersections of the elements of the maximal cliques of the intersection graph of the set of observed intervals; see

Figure 3a for a bi-dimensional example. We have shown elsewhere [

4] that (i) also holds when the censoring sets have arbitrary geometry, but that some elementary regions are now associated with non-maximal cliques of the intersection graph, as shown in

Figure 3b, requiring a slightly more complex identification of the sets

E_{k}, which we do not detail here.

Facts (i) and (ii) together imply that the NPMLE problem can be studied in the

K-dimensional probability simplex

${\mathbb{S}}^{K}$, since

${\widehat{\pi}}_{\theta}(\cdot )$ is determined only up to the probability vector

**ŵ** = {

ŵ_{1}, …,

ŵ_{K}}. The two types of “non-uniqueness” of the NPMLE, (ii) and (iii), have been first pointed out by Turnbull [

2]. More recently, they were studied in detail for the multi-variate case in [

3], where the authors coined the terms representational (ii) and mixture (iii) non-uniqueness, further showing that the set of probability laws

${\widehat{\pi}}_{\theta}$ defining NPMLEs is a polytope.

The NPMLE under censored observations retains the typical consistency properties of the maximum likelihood estimates, in particular

${\widehat{\pi}}_{\theta}({\mathcal{R}}_{\ell}^{(j)})$ tends to

π_{θ,}_{$\mathcal{R}$(}_{j}_{)} (

ℓ) (see

Equation (4)) when

n_{j} → ∞. It is not possible to guarantee the consistency of the estimate of the distribution of

${\pi}_{\theta ,\mathcal{Q}}$ over the finer partition

$\mathcal{Q}$. However, the simulations studies presented in Section 3 show that as the number of partitions

J tends to infinity and this

σ-algebra gets finer, while keeping fixed each

n_{j} (and thus,

n →

∞ with

J), the distance between the true and estimated probability laws decreases to zero.

Facts (i)–(iii) seriously hinder application of NPMLEs in many domains, in particular when, as is the case in our study, they provide a model of the diversity of the population under analysis that will be used for subsequent risk assessment. Besides being affected by some degree of arbitrariness (Facts (ii) and (iii)), the concentration of the probability mass in a small number of bounded regions reveals a tendency to underestimate population diversity, which may result in strong biases when estimating risk under unobserved stresses. The simulation studies that will be presented in Section 4 illustrates to what extent a lack of identifiability and a tendency to concentrate its support compromise the ability to predict the empirical laws corresponding to stimuli that were not used in the available dataset.

#### 1.3.2. Density Estimation under Moment Constraints

Eventual non-unicity problems in density estimation under constraints on moments, like Problem 2, have been most often solved by relying on the maximum entropy (MaxEnt) principle [

5] to select the most un-informative density that can match the observed moments

${\{{\tilde{g}}_{m}\}}_{m=1}^{M}$. Several information entropies have been considered in this context, the original Shannon entropy

H_{1}(·) remaining the most commonly used due to its simple interpretation in terms of coding theory and its intimate link to fundamental results in estimation theory, while amongst generalized entropies, the Rényi entropy

H_{α}(·), coinciding with Shannon when

α → 1, is often chosen due to its appealing numerical and analytical tractability for

α = 2:

**Problem 3. (H-MaxEnt density estimator)**

Let H(·)

be a generalized entropy. The H-MaxEnt estimate${\widehat{\pi}}_{\theta}^{H}$ of Problem 2 is the solution of: When
$\mathcal{G}$ is non-empty (i.e., the constraints are compatible) the MaxEnt density can be analytically determined for some choices of H(·).

**Proposition 2. (Equivalence to ML estimation in the exponential family)**

Assume that the constraints${\{{\tilde{g}}_{m}\}}_{m=1}^{M}$ of Problem 2 are statistical averages with respect to the empirical distribution of a common dataset${\theta}^{(N)}={\{{\theta}_{n}\}}_{n=1}^{N}$,

i.e.,

${\tilde{\pi}}_{\theta}^{({n}_{j})}={\tilde{\pi}}_{\theta}^{(n)}$ in Equation (2), such that: Note that the

H_{1}-MaxEnt/ML equivalence is lost when the empirical averages

${\tilde{g}}_{m}$ are not all obtained from the same dataset, as is the case in our problem, where (see

Equation (2)) constraints associated with distinct stimuli are being derived from distinct empirical distributions.

When the constraints are not compatible, i.e.,
$\mathcal{G}=\varnothing $ and Problem 2 has no solution,
${\tilde{\pi}}_{\theta}^{H}$ is not defined, and only a relaxed version of the original problem can be solved.

**Problem 4. (Relaxed** **H–MaxEnt density estimator)**

Let H be a generalized entropy, and **ϵ** ∈ ℝ

^{+}^{M}. The -relaxed H-MaxEnt density estimate${\widehat{\pi}}_{\theta}^{ME,\in}$ is the solution of:where **g** is the M-dimensional vector function with m-th component g_{m}(·),

$\tilde{\mathbf{g}}$ is the M-dimensional vector of empirical expectations of **g**, ║·║

_{π} is a vector of norms depending on π and inequality is understood component-wise. This estimator has been studied in detail in [

8,

9] for the Shannon entropy and moment constraints derived from a single empirical distribution, where the authors fully exploit the equivalence between regularized MaxEnt as formulated above and

ℓ_{1}-penalized maximum likelihood in the exponential family, showing that Proposition 2 holds in a more generic sense.

**Proposition 3. (Equivalence of ℓ**_{1}-regularized H_{1}-MaxEnt and penalized log-likelihood [9])Problem 4 with H =

H_{1} (Shannon entropy) and ║·║

_{π} the ℓ_{1} norm for the expected value:where the constraints$\tilde{\mathbf{g}}$are empirical averages computed using a dataset Θ,

is equivalent to the maximization of the sum of the log-likelihood of Θ

for the exponential family (6) penalized by the term${\sum}_{m}{\u03f5}_{m}}|{\lambda}_{m}|$,

where ϵ_{m} is the m-th element of **ϵ**.

By linking the relaxation level (the parameter in Problem 4) to the expected level of accuracy of the empirical averages

${\tilde{g}}_{m}$, in [

8,

9], the authors are able to establish performance guarantees for the resulting density estimate, in terms of log-likelihood loss.

As before, this regularized-MaxEnt/penalized-ML equivalence only holds when all constraints are on the empirical moments with respect to the same underlying empirical distribution. This is not true in population analysis, where an individual is observed only through one of the partitions, and we cannot invoke the properties of maximum likelihood estimators to characterize the properties of regularized MaxEnt estimators, as is done in [

8].

We remark that the regularized MaxEnt estimates are unique for strictly concave entropy functionals and always exist for sufficiently large **ϵ**. They do not suffer from neither representational non-unicity, the optimal continuous distribution being constant inside each element of
$\mathcal{Q}$, nor from mixture non-uniqueness, being the solution of a concave criterion under linear inequality constraints.

#### 1.4. Contributions

As largely documented in the literature, the NPMLE using censored data frequently exhibits a singular behavior. By concentrating probability mass in a subset of Θ of a small Lebesgue measure, they favor “over-homogenous” population models that may lead to dangerous biases in the context of risk assessment, by masking the existence of individuals for which risk can be large. As shown above, the problem of density estimation from censored observations addressed in the paper can be recast as the problem of density estimation under a set of constraints derived from the censored observations, each constraint being associated with one of the censoring regions.

While MaxEnt has been frequently used for density estimation from the joint observation of empirical moments of a set of features, its use for region-censored data arising from strongly quantified data, as we consider in this paper, violates the conditions under which previous equivalence to maximum likelihood estimation in the Gibbs family can be established. In these circumstances, guarantees on the likelihood of the original data can be no longer given.

We propose a novel estimator that explicitly relies on the two criteria, the most likely maximum entropy estimator (MLME), where the degree of regularization of a MaxEnt estimate (i.e., of the solutions to Problem 4) is chosen such that the resulting estimate has maximum likelihood. The duality of the two criteria is exploited to allow suppression of singularities that are due to inconsistent or small datasets, and the resulting solution converges to the non-parametric maximum likelihood solution as the size of the datasets associated with each constraint (censoring region) grows. By using the Rényi entropy of order two instead of the Shannon entropy, we are led to a quadratic optimization problem with linear inequality constraints that has an efficient numerical implementation.

While no theoretical performance guarantees are given, the paper presents numerical studies of the performance of the proposed MLME estimator in real and simulated data, comparing it to the NPMLE and to the best fitting MaxEnt solutions. The results of cross-validation on a real dataset show that our novel estimator is better than the NPMLE or the minimally-regularized MaxEnt estimator, leading to better predictions of the population risk under unobserved stress conditions.

The paper is organized as follows. Section 2 illustrates the poor behavior of the NPMLE using simulated data. We show (Section 2.4) that even the most uncertain of the NPMLEs still presents singularities that are unlikely to occur in a natural population. The section starts by presenting the likelihood function and defining the polytope of NPMLE solutions. It also addresses the numerical determination of the NPMLE, and two optimization algorithms are presented.

Section 3 presents the main contribution of the paper, introducing the most likely Rényi MaxEnt estimator (MLME; see Definition 4). We compare our estimator to the NPMLE, demonstrating using simulated datasets that it performs better. We also present numerical studies of its asymptotic behavior as the number J of different stimuli becomes large, revealing a remarkably better behavior.

In Section 4, the proposed estimator is applied to the real problem that motivated this study, in the context of the prevention of decompression sickness in hyperbaric deep sea diving. The new estimator is compared to classical maximum likelihood and maximum entropy estimators on real and simulated data, illustrating the superior performance of the new estimator in a realistic situation.

## 3. Most Likely Rényi-MaxEnt

To avoid the singular behavior of the NPMLE, we must estimate π_{θ} with a criterion other than maximum likelihood. Relying on the link of our problem with density estimation under constraints, we propose to estimate π_{θ} through the maximum entropy principle.

If there exists a

π that can satisfy all constraints,

i.e., if there exists a solution to Problem 2, the corresponding

**w** belongs to the NPMLE polytope

$\mathcal{P}$. However, being derived from

J distinct empirical distributions, the

J constraints are in general inconsistent, and as in [

9], we consider entropy maximization under relaxed constraints,

i.e., Problem 4. For reasons of numerical efficiency, we consider the Rényi entropy

H_{2}.

**Problem 5. (Relaxed ME estimator)**

For ϵ ∈ ℝ

^{+}, define the -relaxed MaxEnt estimator as:where ∑

^{(}^{j}^{)} is the covariance of the empirical estimate${\tilde{\mathbf{f}}}_{+}^{(j)}$ and${\mathbf{f}}_{+}^{(j)}$ is obtained from **f**^{(j)} by retaining all but one of its non-zero elements. We remark that the constraints in Problem 5, the relaxed MaxEnt problem that we solve, take into account the correlation between the observed frequencies, contrary to what is done in [

9], where the degrees of relaxation of each constraint are fixed independently, as in Problem 4. As we will verify in Section 4 (see also the discussion around

Figure 8), use of an inappropriate metric in the constraints directs the estimator towards sets of solutions that have have lower likelihood, resulting in a poor ability to reproduce the observed empirical moments.

Denote by

ϵ^{⋆} ≥ 0 the smallest value of

ϵ for which there exists a solution to Problem 5. Since in (13) we use the

ℓ_{∞} metric to evaluate the deviation of a model

π with respect to the empirical moments and

ℓ_{∞} is not equivalent to the (Riemannian) metric induced by maximum likelihood in the simplex

${\mathbb{S}}^{K}$, we cannot guarantee that likelihood is monotonically decreasing with the degree of relaxation,

i.e., that

$\mathcal{L}\left({\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}\right)<\mathcal{L}\left({\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}\right)$, for

ϵ >

ϵ^{⋆}. In fact, as the plot of the log-likelihood of

${\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}$ as a function of

ϵ/ϵ^{⋆} in

Figure 7 shows, this is not necessarily true for values of

ϵ close to

ϵ^{⋆}. More importantly, this figure shows that a suitable choice of the relaxation term can lead to a likelihood loss with respect to the NPMLE that is minimal, improving the fit to the data. These remarks motivate the definition of the new estimator proposed in this paper.

**Definition 4. (MLME: the most likely MaxEnt estimator)**

Let${\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}$ denote the solution of Problem 5 for a generic ϵ ≥

ϵ^{⋆}. The most likely Rényi-MaxEnt estimator is: **Proposition 5**. (ϵ^{⋆} = 0)

If ϵ^{⋆} = 0, then the feasible set of the constrained optimization Problem 5 coincides with the NPMLE polytope. Since the likelihood of all solutions with ϵ > 0 will be smaller, the MLME estimate coincides in this case with the MaxEnt NPMLE:${\u03f5}^{\star}=0\Rightarrow {\widehat{\pi}}_{\theta}^{{H}_{2},ml}={\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}={\widehat{\pi}}_{\theta}^{\mathcal{L}}$.

Since the probability that

ϵ^{⋆} = 0 is small for finite datasets, the solution space of our constrained optimization problem is in general larger than the NPMLE polytope

$\mathcal{P}$. We illustrate now the geometry of the NLME

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ using the following simple example for which

L = 1,

J = 2,

K = 3 and:

This choice allows us to represent graphically the elements of

${\mathcal{S}}^{3}$; see

Figure 8. The empirical moments (

${\tilde{f}}_{0}^{(1)}$,

${\tilde{f}}_{1}^{(1)}$,

${\tilde{f}}_{0}^{(2)}$,

${\tilde{f}}_{1}^{(2)}$) have been chosen such that the constraints are incompatible, avoiding the trivial case where

${\widehat{\pi}}_{\theta}^{\mathcal{L}}$,

${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$ and

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ all coincide.

Figure 8 illustrates in

${\mathbb{S}}^{3}$ the geometry behind the MLME. Black lines

${w}_{1}={\tilde{f}}_{0}^{(1)}$ and

${w}_{3}={\tilde{f}}_{1}^{(2)}$ correspond to the constraints, which do not intersect since they are incompatible. For this example, the NPMLE (orange dot on the boundary of

${\mathbb{S}}^{3}$, its second component being zero) is unique. All distributions that satisfy the minimally-relaxed constraints (

i.e., with

ϵ =

ϵ^{⋆}) belong to the two gray areas, their intersection defining

${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$ (the green dot, also on the boundary of

${\mathcal{S}}^{3}$). The dashed green line is the curve defined by

${\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}$ in

${\mathbb{S}}^{3}$ for

ϵ ≥

ϵ^{⋆}, which has an accumulation point in the uniform distribution

${w}_{1}={w}_{2}=w=\frac{1}{3}$ as

ϵ becomes sufficiently large for the uniform distribution to satisfy the constraints. Our estimator MLME is the point in this green curve at which the value of the likelihood is the largest, that is the highest level set of the likelihood function over

${\mathbb{S}}^{3}$ whose intersection with the green curve is a single point. The orange curve shows this level set, the contact point (red dot) being the MLME

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$.

The MLME estimator

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ corresponds in general to an

ϵ >

ϵ^{⋆} in the constraints (13). In terms of vector w of probabilities of the elementary regions

E_{k}, this set is a polytope

${\mathcal{P}}_{\u03f5}$, defined by its linear boundaries, which characterizes all solutions compatible with the data. One may notice that although the determination of its vertices is a difficult task, approximation of

${\mathcal{P}}_{\u03f5}$ by the maximum-volume interior ellipsoid is feasible at a reasonable computational cost [

19], providing directly a lower bound on the volume of

${\mathcal{P}}_{\u03f5}$.

Figure 9 shows the proposed estimator

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ for the same dataset as in

Figure 6. Note that the distribution of the probability mass is much smoother than in

Figure 6 and that the support of

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ is now the entire Θ. This example shows that the new estimator

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ is able to exploit the dual characteristics of the ML and MaxEnt criteria to produce an estimate that is not too informative while still fitting the observed data reasonably well.

Two common measures of the difference between two distributions are the Kolmogorov and the total variation distances. The Kolmogorov distance

d_{K} is the maximum value of the absolute difference between the two cumulative distributions, while the total variation distance

d_{TV} is the sum of all absolute differences [

20].

Figure 10 addresses the performance of the estimation of the true probability law over

$\mathcal{Q}$, showing box-plots of the Kolmogorov–Smirnov (left) and total variation (right) distances between

${\pi}_{\theta ,\mathcal{Q}}$ and the NPMLE and the MLME estimates observed in 200 simulations, each for

N = 10

^{3} observations. In each plot, the box in the left corresponds to the MaxEnt-NPMLE estimator

${\widehat{\pi}}_{\theta}^{\mathcal{L}}$ and the one on the right to the the proposed estimator

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$. This clearly demonstrates the superiority of the estimator proposed in the paper. Note that the difference is more pronounced for the total variation, which is the criterion that best indicates the predictive power of the identified population model.

Finally,

Figure 11 shows the behavior under an increasing number of randomly-generated binary partitions. The total number of observations grows with

J:

N = 100

J. The plots show the empirical average of the two Kullback–Leibler divergences

D(·║

π_{θ}) (

Figure 11a) and

D(

π_{θ}║·) (

Figure 11b) over 100 randomly-generated datasets for each value of

J, with

J varying from 10 to 100 in steps of 10. Here, the probability of “dangerous” partitions has been increased to 10

^{−2}, to guarantee a sufficient number of samples censored by them.

Figure 11a suggests that

${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ may be consistent, which is strongly contradicted by the behavior observed for the NPMLE. The divergence

$D({\pi}_{\theta}\Vert {\widehat{\pi}}_{\theta}^{\mathcal{L}})$ was infinite in all simulations (due to

${\widehat{\pi}}_{\theta}^{\mathcal{L}}({E}_{k})=0$ for some

E_{k} ∈

$\mathcal{Q}$) and, thus, cannot be presented in

Figure 11b.