Most Likely Maximum Entropy for Population Analysis with Region-Censored Data †

The paper proposes a new non-parametric density estimator from region-censored observations with application in the context of population studies, where standard maximum likelihood is affected by over-fitting and non-uniqueness problems. It is a maximum entropy estimator that satisfies a set of constraints imposing a close fit to the empirical distributions associated with the set of censoring regions. The degree of relaxation of the data-fit constraints is chosen, such that the likelihood of the inferred model is maximal. In this manner, the estimator is able to overcome the singularity of the non-parametric maximum likelihood estimator and, at the same time, maintains a good fit to the observations. The behavior of the estimator is studied in a simulation, demonstrating its superior performance with respect to the non-parametric maximum likelihood and the importance of carefully choosing the degree of relaxation of the data-fit constraints. In particular, the predictive performance of the resulting estimator is better, which is important when the population analysis is done in the context of risk assessment. We also apply the estimator to real data in the context of the prevention of hyperbaric decompression sickness, where the available observations are formally equivalent to region-censored versions of the variables of interest, confirming that it is a superior alternative to non-parametric maximum likelihood in realistic situations. Entropy 2015, 17 3964


Motivation
The paper presents a new density estimator motivated by problems of population modeling, where the interest is in estimating the probability distribution π θ , θ ∈ Θ, of the parameters of a mathematical model M (•|θ) characterizing the response y(t|θ) of individuals to applied stimuli x(t).The ultimate goal is in general to be able to predict the dispersion of the response of the population to an arbitrary future stimulus x(t), rather than to make a "tomography" of the population itself.These types of problems are frequent in domains like biomedical engineering, insurance studies or environmental management.
If the parameter θ can be estimated from each observation y(t|θ) and each individual's parameter is chosen independently from π θ , the problem of estimating π θ from a collection of responses {(y i (t|θ i ), x i (t))} N i=1 is formally equivalent to the usual density estimation problem from a set of independent and identically distributed samples {θ i } N i=1 ∼ π θ and can be solved using standard parametric or non-parametric methods; see the abundant literature on non-linear mixed-effects models.The situation considered in this paper is more complex, in that the response y(•|θ) of the model is not observable, and we only have access to the result of the classification of its assignment to a finite number (L + 1) of possible labels by a known classifier C(•). Figure 1 illustrates the structural modeling/observation framework that we consider.In this setup, each observation can no longer be related to a single point θ ∈ Θ, the same label z being assigned, for the same stimulus x(t), to all responses inside a subset R ⊂ Θ.The set R is completely determined by the pair (z, x(t)) together with knowledge of the model M (•|θ) and of the classifier rule C(•).This situation, when a single observation does not give information with respect to the individual value θ, but only the indication that it belongs to a set, is commonly known in the statistical literature as "censored observations".While in general studies of the density estimation under censored observations have assumed that the censoring sets R are intervals, the geometry of our censoring regions is determined by the structure of the (possibly highly non-linear) operators M (•|θ) and C(•) and can have an arbitrary morphology, requiring modification of the existing methods.
In Section 4, we detail a particular instance of the problem formally presented above, relevant in the context of the prevention of decompression sickness in hyperbaric diving.Readers may want to read the material in Section 4.1 to have a concrete instantiation of the generic stimuli and operators used in the presentation above.

Notation and Problem Formulation
Consider the notation introduced in Section 1.1 (see also Figure 1), and let {(z n , x n (•))} N n=1 denote the available set of observations, where label z n ∈ {0, . . ., L} has been observed for input X (n) = {x n (t), t ∈ T n }, where T n is the duration of the stimulus.Denote by R n ⊂ Θ the set of all individual parameters whose response to X (n) receives label z n : We assume that for all possible stimuli X (n) , the composition C M (X (n) |•) (of the model and the classifier) is a measurable function from Θ to {0, . . ., L} with respect to the restriction of the Lebesgue measure to the set Θ.Under this assumption, the probability of the sets M −1 X (n) (C −1 ( )) is well defined for all 0 ≤ ≤ L and all stimuli for any distribution absolutely continuous with respect to the Lebesgue measure.
Usually, in population studies, the same stimulus is applied to several individuals.We assume here that stimuli X (j) are chosen in a finite set X = X (1) , . . ., X (J) .Each possible input function X (j) in X determines a partition of Θ in L + 1 sets, that we denote by Q (j) = {R (j) 0 , . . ., R (j) L }: The top row of Figure 2 illustrates schematically partitions that correspond to classification in two (L 1 = 1) and three (L 2 = 2) classes of the response to two distinct stimuli.Let n j be the number of times that stimulus X (j) has been used in the N observations and n (j) the number of times label occurred in these n j experiences.The observed dataset determines J empirical laws f (j) , each one associated with a distinct partition Q (j) : n j , = 0, . . ., L, j = 1, . . ., J, When we want to emphasize the number of observations on which these empirical laws are based, we will call f (j) an n j -type.With the notation defined above, we can finally state the problem addressed in this paper with full generality.

Problem 1. (Density estimation from region-censored data)
Find the non-parametric estimate of π θ from the set of J n j -types f (j) , j = 1, . . ., J (see Equation ( 1)) of the discrete random variables associated with the known partitions {Q (j) } J j=1 (see Equation ( 4)).
Before initiating the study of this estimation problem, we show below how a set of constraints can be related to the observations (1) leading to an alternative problem formulation.
Let 1 A (θ) be the indicator function of set A ⊂ Θ and π(n j ) θ the (non-observed) empirical distribution: i , i = 1, . . ., n j , is the parameter of the i-th individual to whom stimulus X (j) has been applied.It is immediate that f (j) in Equation ( 1) can be written as the statistical expectation of the indicator function of R (j) with respect to π(n j ) θ : We stress that in our context, the (virtual) datasets θ (j) = θ are distinct for different values of j ∈ {1, . . ., J}, since they correspond to statistically-independent samples from π θ .
The remarks above allow us to relate Problem 1 to two alternative problems: Problem 2 formulated below and Problem 3 presented in the next subsection.

Problem 2. (Density estimation under moment constraints)
Consider a set of partitions R (j) , j ∈ {0 . . .L} all of size L+1, and let {g m (•)} M m=1 , with M = (L+1)J, be the set of indicator functions {1 R (j) (•)} J,L j=1, =0 .Denote by gm , m = 1, . . ., M , the corresponding empirical moments as in (2).Find the non-parametric estimate of π θ that satisfies the set of constraints: Note that the existence and unicity of the solution to this problem is not guaranteed: depending on the set of partitions and empirical moments, the problem may have no solution or admit a solution (possibly non-unique).
The next subsection summarizes the present background on the two problems formulated above.Prior to that, we present three definitions that will be useful in the sequel.Definition 1.Let Q be the smallest partition of Θ whose generated σ-algebra, σ(Q), contains all partitions {Q (j) } J j=1 (elements of Q are the minimal elements of the closure of the union of all partitions Q (j) with respect to set intersection).The size Q = |Q| is necessarily finite.We denote by E m , m ∈ {1, . . ., Q} a generic element of Q.
The bottom row of Figure 2 shows the partition Q generated by the two partitions in the top.

Definition 2. E
(j) is the set of elements of Q that intersect R (j) , such that: Definition 3. Let π θ be a probability distribution over Θ and Q a finite partition of Θ.We denote by π θ,Q the probability law induced by π θ over the elements of Q: 1.3.Background

Density Estimation from Region-Censored Data
Determination of πθ , the NPMLE (non-parametric maximum likelihood estimate) of π θ from censored observations, i.e., the solution of Problem 1, has been studied by many authors, starting with the pioneering formulation of the Kaplan-Meier product-limit estimator [1].Several types of censoring (one-sided, interval, etc.) have been considered since, first for scalar and more recently for multivariate distributions.
The problem assessed here departs from previous studies in that our (multi-dimensional) censoring regions R (j) ⊂ Θ can have arbitrary geometry.To emphasize this, we speak of "region-censoring", instead of the more common term "interval-censoring." Another important difference concerns the fact that our regions are elements of a known set of partitions, being in general observed several times, while in general, no relation between the censoring intervals is assumed in the literature, each one being usually applied once.Several facts are known about the NPMLE for censored observations.Proposition 1.
(i) The support of πθ , S NPMLE = {θ, : πθ (θ) > 0} is confined to a finite number K ≤ Q of elements of Q, the so-called "elementary regions": This set necessarily has a non-empty intersection with all observed lists E (j) , i.e., (ii) all distributions that put the same probability mass w k = {π θ (E k )}, k = 1, . . ., K in the elementary regions have the same likelihood; (iii) there is in general no unique assignment of probabilities { ŵk } K k=1 that maximizes the likelihood.
Turnbull [2] has first demonstrated (i) giving an algorithm to find the pairs {(E k , w k )} K k=1 for the scalar case.Gentleman and Vandal [3] addressed the multivariate interval-censored case, showing that the E k 's are the intersections of the elements of the maximal cliques of the intersection graph of the set of observed intervals; see Figure 3a for a bi-dimensional example.We have shown elsewhere [4] that (i) also holds when the censoring sets have arbitrary geometry, but that some elementary regions are now associated with non-maximal cliques of the intersection graph, as shown in Figure 3b, requiring a slightly more complex identification of the sets E k , which we do not detail here.
Facts (i) and (ii) together imply that the NPMLE problem can be studied in the K-dimensional probability simplex S K , since πθ (•) is determined only up to the probability vector ŵ = { ŵ1 , . . ., ŵK }.The two types of "non-uniqueness" of the NPMLE, (ii) and (iii), have been first pointed out by Turnbull [2].More recently, they were studied in detail for the multi-variate case in [3], where the authors coined the terms representational (ii) and mixture (iii) non-uniqueness, further showing that the set of probability laws πθ defining NPMLEs is a polytope.The NPMLE under censored observations retains the typical consistency properties of the maximum likelihood estimates, in particular πθ (R (j) ) tends to π θ,R (j) ( ) (see Equation ( 4)) when n j → ∞.It is not possible to guarantee the consistency of the estimate of the distribution of π θ,Q over the finer partition Q.However, the simulations studies presented in Section 3 show that as the number of partitions J tends to infinity and this σ-algebra gets finer, while keeping fixed each n j (and thus, n → ∞ with J), the distance between the true and estimated probability laws decreases to zero.Facts (i)-(iii) seriously hinder application of NPMLEs in many domains, in particular when, as is the case in our study, they provide a model of the diversity of the population under analysis that will be used for subsequent risk assessment.Besides being affected by some degree of arbitrariness (Facts (ii) and (iii)), the concentration of the probability mass in a small number of bounded regions reveals a tendency to underestimate population diversity, which may result in strong biases when estimating risk under unobserved stresses.The simulation studies that will be presented in Section 4 illustrates to what extent a lack of identifiability and a tendency to concentrate its support compromise the ability to predict the empirical laws corresponding to stimuli that were not used in the available dataset.

Density Estimation under Moment Constraints
Eventual non-unicity problems in density estimation under constraints on moments, like Problem 2, have been most often solved by relying on the maximum entropy (MaxEnt) principle [5] to select the most un-informative density that can match the observed moments {g m } M m=1 .Several information entropies have been considered in this context, the original Shannon entropy H 1 (•) remaining the most commonly used due to its simple interpretation in terms of coding theory and its intimate link to fundamental results in estimation theory, while amongst generalized entropies, the Rényi entropy H α (•), coinciding with Shannon when α → 1, is often chosen due to its appealing numerical and analytical tractability for α = 2:

Problem 3. (H-MaxEnt density estimator)
Let H(•) be a generalized entropy.The H-MaxEnt estimate πH θ of Problem 2 is the solution of: When G is non-empty (i.e., the constraints are compatible) the MaxEnt density can be analytically determined for some choices of H(•).

Proposition 2. (Equivalence to ML estimation in the exponential family)
Assume that the constraints {g m } M m=1 of Problem 2 are statistical averages with respect to the empirical distribution of a common dataset θ 2), such that: Then: (1) (Boltzmann theorem [6]) the H 1 -MaxEnt estimate πH 1 θ maximizes the likelihood of the observations in the exponential family, where Z λ is a normalizing constant (the partition function), and the {λ m } M m=1 are determined such that the M constraints are satisfied.In short, the MaxEnt (non-parametric) estimate coincides with the maximum likelihood parametric estimate inside the exponential distributions.
Note that the H 1 -MaxEnt/ML equivalence is lost when the empirical averages gm are not all obtained from the same dataset, as is the case in our problem, where (see Equation ( 2)) constraints associated with distinct stimuli are being derived from distinct empirical distributions.
When the constraints are not compatible, i.e., G = ∅ and Problem 2 has no solution, πH θ is not defined, and only a relaxed version of the original problem can be solved.
where g is the M -dimensional vector function with m-th component g m (•), g is the M -dimensional vector of empirical expectations of g, • π is a vector of norms depending on π and inequality is understood component-wise.
This estimator has been studied in detail in [8,9] for the Shannon entropy and moment constraints derived from a single empirical distribution, where the authors fully exploit the equivalence between regularized MaxEnt as formulated above and 1 -penalized maximum likelihood in the exponential family, showing that Proposition 2 holds in a more generic sense.Proposition 3. (Equivalence of 1 -regularized H 1 -MaxEnt and penalized log-likelihood [9]) Problem 4 with H = H 1 (Shannon entropy) and • π the 1 norm for the expected value: where the constraints g are empirical averages computed using a dataset Θ, is equivalent to the maximization of the sum of the log-likelihood of Θ for the exponential family (6) penalized by the term m m |λ m |, where m is the m-th element of .
By linking the relaxation level (the parameter in Problem 4) to the expected level of accuracy of the empirical averages gm , in [8,9], the authors are able to establish performance guarantees for the resulting density estimate, in terms of log-likelihood loss.
As before, this regularized-MaxEnt/penalized-ML equivalence only holds when all constraints are on the empirical moments with respect to the same underlying empirical distribution.This is not true in population analysis, where an individual is observed only through one of the partitions, and we cannot invoke the properties of maximum likelihood estimators to characterize the properties of regularized MaxEnt estimators, as is done in [8].
We remark that the regularized MaxEnt estimates are unique for strictly concave entropy functionals and always exist for sufficiently large .They do not suffer from neither representational non-unicity, the optimal continuous distribution being constant inside each element of Q, nor from mixture non-uniqueness, being the solution of a concave criterion under linear inequality constraints.

Contributions
As largely documented in the literature, the NPMLE using censored data frequently exhibits a singular behavior.By concentrating probability mass in a subset of Θ of a small Lebesgue measure, they favor "over-homogenous" population models that may lead to dangerous biases in the context of risk assessment, by masking the existence of individuals for which risk can be large.As shown above, the problem of density estimation from censored observations addressed in the paper can be recast as the problem of density estimation under a set of constraints derived from the censored observations, each constraint being associated with one of the censoring regions.
While MaxEnt has been frequently used for density estimation from the joint observation of empirical moments of a set of features, its use for region-censored data arising from strongly quantified data, as we consider in this paper, violates the conditions under which previous equivalence to maximum likelihood estimation in the Gibbs family can be established.In these circumstances, guarantees on the likelihood of the original data can be no longer given.
We propose a novel estimator that explicitly relies on the two criteria, the most likely maximum entropy estimator (MLME), where the degree of regularization of a MaxEnt estimate (i.e., of the solutions to Problem 4) is chosen such that the resulting estimate has maximum likelihood.The duality of the two criteria is exploited to allow suppression of singularities that are due to inconsistent or small datasets, and the resulting solution converges to the non-parametric maximum likelihood solution as the size of the datasets associated with each constraint (censoring region) grows.By using the Rényi entropy of order two instead of the Shannon entropy, we are led to a quadratic optimization problem with linear inequality constraints that has an efficient numerical implementation.
While no theoretical performance guarantees are given, the paper presents numerical studies of the performance of the proposed MLME estimator in real and simulated data, comparing it to the NPMLE and to the best fitting MaxEnt solutions.The results of cross-validation on a real dataset show that our novel estimator is better than the NPMLE or the minimally-regularized MaxEnt estimator, leading to better predictions of the population risk under unobserved stress conditions.
The paper is organized as follows.Section 2 illustrates the poor behavior of the NPMLE using simulated data.We show (Section 2.4) that even the most uncertain of the NPMLEs still presents singularities that are unlikely to occur in a natural population.The section starts by presenting the likelihood function and defining the polytope of NPMLE solutions.It also addresses the numerical determination of the NPMLE, and two optimization algorithms are presented.
Section 3 presents the main contribution of the paper, introducing the most likely Rényi MaxEnt estimator (MLME; see Definition 4).We compare our estimator to the NPMLE, demonstrating using simulated datasets that it performs better.We also present numerical studies of its asymptotic behavior as the number J of different stimuli becomes large, revealing a remarkably better behavior.
In Section 4, the proposed estimator is applied to the real problem that motivated this study, in the context of the prevention of decompression sickness in hyperbaric deep sea diving.The new estimator is compared to classical maximum likelihood and maximum entropy estimators on real and simulated data, illustrating the superior performance of the new estimator in a realistic situation.

Simulated Data Generation Mechanism
Before presenting the NPMLE and discussing its determination, we present the data generation mechanism that will be used to illustrate the different estimators presented in this and the next sections.The simulated population distribution π θ is the restriction of the joint distribution of two independent and identically distributed normal variables of mean µ = 0.5 and variance σ 2 = 0.2 to the unit square Partitions R (j) are randomly generated by considering random unions of the elements of the Voronoi tessellation of S = 50 points uniformly drawn in Θ; see Figure 4.The partition Q induced by 10 random binary splits of Θ is shown in Figure 5a, and Figure 5b is a color-coded representation of the probability law p π,Q , where the fine partition Q is easily recognizable (black delimited polygonal regions).We remark that the size of the elements of the partitions generated by our simulation mechanism tends have low dispersion, following approximately a gamma distribution with both parameters equal to (7/2)λ −2 , where λ is the intensity of the homogenous Poisson process [10] (λ = 1/50 in our simulations).In Section 4, we will see that this may not be the case in practical applications.Observations are then generated by independently sampling n j times from each of the probability laws associated with the individual partitions R (j) , for j = 1, . . ., J.In the numerical studies presented in this section, J = 10.To simulate the situation when some stimuli are seldom applied (for instance, if they may have compromised the safety of the individual to which they are applied), the partitions are divided into two groups, representing "safe" and "dangerous" stimuli, of sizes seven and three, respectively.The probability that a dangerous partition is chosen is 10 −3 , and inside each group, partitions are chosen uniformly.Except when indicated otherwise, we will consider a total of N = 10 4 observations.

Likelihood Function
The log-likelihood function for Problem 1 is: Consider the partition Θ = S NPMLE ∪S NPMLE , with A the complement of set A, and where S NPMLE is the union of the elementary regions Note that since the elementary regions introduced in Definition 2 is well defined.

(j)
. is the -th row of B (j) , the (L + 1) × K binary matrix, with B , and w ∈ S K is the vector of probabilities of the elementary regions E k : S K the K-dimensional probability simplex: Equations ( 7) and (8) show that (Proposition 1 (iii)) all π θ leading to the same w have the same likelihood.Proposition 4.There is in general no single w maximizing ( 7) and all elements of: where ŵ is a NPMLE are also NPMLEs.We call P the NPMLE polytope.
Note that the non-unicity statement above concerns w the probabilities of the elementary regions E k , but that the probability of the censoring regions R (j) is uniquely estimated, all w ∈ P assigning the same probabilities to the elements of the partitions Q (j) .It is obvious that the estimator is consistent for these, but no stronger statement seems to be possible.

Optimizing the Likelihood
Several algorithms have been proposed to maximize (7); see e.g., [11].Gentleman and Vandal [3] discussed several methods and summarized them in two categories: those based on convex optimization and those based on finite mixture estimation.Two algorithms are compared in [11]: the iterative convex minorant (ICM), initially presented by Groeneboom and Wellner [12], and the vertex exchange method [13].
We show below that a multiplicative algorithm, known as the Richardson-Lucy algorithm [14] in the framework of image deconvolution, can be used to maximize L. This follows from the fact that maximization of L is equivalent to an optimal design problem, enabling application of a vast collection of efficient algorithms originating from optimal design theory.As far as we know, this link of NPMLE estimation using censored observations to a D-optimal design problem has not been remarked on before.
Consider the n j (L + 1) × K matrices B (j) , obtained from the ((L + 1) × K) matrix B (j) by repeating n (j) times line , the N × K matrix B that stacks all B (j) , j = 1, . . ., J, the N × N diagonal matrix Then, it is easy to show that L can be written as: demonstrating that the determination of ŵ maximizing L(w; { f (j) , Q (j) }) with respect to w ∈ S K corresponds to a D-optimal design problem for the matrix M(w), with w considered as a design measure allocating weight w k to the elementary design matrix H k (see, e.g., [15]).A number of important properties follow from this equivalence with a D-optimal design problem.In particular, see [16,17], the iterations: initialized at some strictly positive w (0) converge to a maximizer of (7).This multiplicative algorithm is easy to implement, but the following vertex exchange method (VEM) [13] ensures a faster convergence to the optimum.The VEM updating rule is: where: and α is chosen to maximize a quadratic approximation of the log-likelihood evaluated at w (t+1) .In the multiplicative and VEM algorithms, we use the stopping condition max k∈{1,...,K} Our numerical studies show that the VEM Algorithm ( 11) is faster than the multiplicative Algorithm (10) requiring on average three-times less iterations to converge.We stress that these optimization algorithms can be applied to all Q elements of the complete partition Q and automatically sets to zero the entries of w that do not correspond to the elementary sets {E k } K k=1 (in particular, when using the result in [18] to detect the entries of w that can be set to zero), so that the computationally-expensive analysis of the intersection graph presented in [4] is not required.
Figure 6a shows one of the NPMLE estimates (i.e., one element of the NPMLE polytope) for a simulated dataset produced, as we explained in the beginning of the section.This example clearly displays the NPMLE singularities that have been mentioned before: while π θ is strictly positive inside the complete unit square, significant regions of Θ are assigned zero probability mass (the white regions in the figure), and the support of πθ is strictly contained in Θ.

Least Informative NPMLE
As stated in Proposition 1 (iii), the NPMLE is not unique, and we have seen (Proposition (4)) that the set of solutions is the polytope P defined in Equation ( 9), associated with the matrix B, B =    B (1)  . . .
Motivated by the ultimate goal of capturing the largest possible diversity of the underlying population, we select from the NPMLE polytope P the distribution that is least informative, i.e., that has maximum entropy.
Let ŵ be an NPMLE, and define f (j) = B (j) ŵ, with ŵ the vector of probabilities πθ (E k ), k = 1, . . ., K. We denote by πL θ the distribution in P maximizing the entropy H; it satisfies: Determining πL θ for the Shannon entropy, i.e., for H = H 1 , is a non-trivial non-linear constrained optimization problem.However, for H = H 2 , the Rényi-MaxEnt NPMLE probability vector w is the solution to the following quadratic program with linear equality constraints: with ν(E k ) the volume of E k , for which efficient solutions exist.The Rényi-MaxEnt NPMLE for the same dataset that leads to the NPMLE in Figure 6a is displayed in Figure 6b.We can see that the restriction to the NPMLE polytope still forces the density to be concentrated in a strict subset of Θ, with areas of zero measure (white zones in Figures 6a,b).This is inherent to the likelihood criterion, which favors the most concentrated densities that are able to explain the observed data.
Indeed, it is easy to see that the support of a NPMLE density may importantly shrink when a stimulus that is applied only once is added to the dataset, confirming the ill-conditioning of the NPMLE for small datasets.Suppose a new stimulus X (J+1) applied only once with resulting label is added to a dataset already containing J stimuli: Let Q be the partition of Θ corresponding to "old" stimuli j ≤ J and Q the new partition, which also integrates (X (J+1) , n (J+1) ).If R (J+1) intersects an elementary set E k ∈ Q, such that: then E k \ E k will no longer be an elementary set, showing that the support of the NPMLE will shrink.
Note that we may have ν(E k ) ν(E k ), with ν(•) the Lebesgue measure.

Most Likely Rényi-MaxEnt
To avoid the singular behavior of the NPMLE, we must estimate π θ with a criterion other than maximum likelihood.Relying on the link of our problem with density estimation under constraints, we propose to estimate π θ through the maximum entropy principle.
If there exists a π that can satisfy all constraints, i.e., if there exists a solution to Problem 2, the corresponding w belongs to the NPMLE polytope P.However, being derived from J distinct empirical distributions, the J constraints are in general inconsistent, and as in [9], we consider entropy maximization under relaxed constraints, i.e., Problem 4. For reasons of numerical efficiency, we consider the Rényi entropy H 2 .

Problem 5. (Relaxed ME estimator)
For ∈ R + , define the -relaxed MaxEnt estimator as: where Σ (j) is the covariance of the empirical estimate f (j) + and f (j) + is obtained from f (j) by retaining all but one of its non-zero elements.
We remark that the constraints in Problem 5, the relaxed MaxEnt problem that we solve, take into account the correlation between the observed frequencies, contrary to what is done in [9], where the degrees of relaxation of each constraint are fixed independently, as in Problem 4. As we will verify in Section 4 (see also the discussion around Figure 8), use of an inappropriate metric in the constraints directs the estimator towards sets of solutions that have have lower likelihood, resulting in a poor ability to reproduce the observed empirical moments.
Denote by ≥ 0 the smallest value of for which there exists a solution to Problem 5. Since in (13) we use the ∞ metric to evaluate the deviation of a model π with respect to the empirical moments and ∞ is not equivalent to the (Riemannian) metric induced by maximum likelihood in the simplex S K , we cannot guarantee that likelihood is monotonically decreasing with the degree of relaxation, i.e., that ), for > .In fact, as the plot of the log-likelihood of πH 2 , θ as a function of / in Figure 7 shows, this is not necessarily true for values of close to .More importantly, this figure shows that a suitable choice of the relaxation term can lead to a likelihood loss with respect to the NPMLE that is minimal, improving the fit to the data.These remarks motivate the definition of the new estimator proposed in this paper.
Proposition 5. ( = 0) If = 0, then the feasible set of the constrained optimization Problem 5 coincides with the NPMLE polytope.Since the likelihood of all solutions with > 0 will be smaller, the MLME estimate coincides in this case with the MaxEnt NPMLE: Since the probability that = 0 is small for finite datasets, the solution space of our constrained optimization problem is in general larger than the NPMLE polytope P. We illustrate now the geometry of the NLME πH 2 ,ml θ using the following simple example for which L = 1, J = 2, K = 3 and: This choice allows us to represent graphically the elements of S 3 ; see Figure 8.The empirical moments ( ) have been chosen such that the constraints are incompatible, avoiding the trivial case where πL θ , πH 2 , θ and πH 2 ,ml θ all coincide.Figure 8 illustrates in S 3 the geometry behind the MLME.Black lines w 1 = f (1) 1 correspond to the constraints, which do not intersect since they are incompatible.For this example, the NPMLE (orange dot on the boundary of S 3 , its second component being zero) is unique.All distributions that satisfy the minimally-relaxed constraints (i.e., with = ) belong to the two gray areas, their intersection defining πH 2 , θ (the green dot, also on the boundary of S 3 ).The dashed green line is the curve defined by πH 2 , θ in S 3 for ≥ , which has an accumulation point in the uniform distribution w 1 = w 2 = w 3 = 1 3 as becomes sufficiently large for the uniform distribution to satisfy the constraints.Our estimator MLME is the point in this green curve at which the value of the likelihood is the largest, that is the highest level set of the likelihood function over S 3 whose intersection with the green curve is a single point.The orange curve shows this level set, the contact point (red dot) being the MLME πH 2 ,ml θ .The MLME estimator πH 2 ,ml θ corresponds in general to an > in the constraints (13).In terms of vector w of probabilities of the elementary regions E k , this set is a polytope P , defined by its linear boundaries, which characterizes all solutions compatible with the data.One may notice that although the determination of its vertices is a difficult task, approximation of P by the maximum-volume interior ellipsoid is feasible at a reasonable computational cost [19], providing directly a lower bound on the volume of P .
Figure 9 shows the proposed estimator πH 2 ,ml θ for the same dataset as in Figure 6.Note that the distribution of the probability mass is much smoother than in Figure 6 and that the support of πH 2 ,ml θ is now the entire Θ.This example shows that the new estimator πH 2 ,ml θ is able to exploit the dual characteristics of the ML and MaxEnt criteria to produce an estimate that is not too informative while still fitting the observed data reasonably well.Two common measures of the difference between two distributions are the Kolmogorov and the total variation distances.The Kolmogorov distance d K is the maximum value of the absolute difference between the two cumulative distributions, while the total variation distance d TV is the sum of all absolute differences [20].Figure 10 addresses the performance of the estimation of the true probability law over Q, showing box-plots of the Kolmogorov-Smirnov (left) and total variation (right) distances between π θ,Q and the NPMLE and the MLME estimates observed in 200 simulations, each for N = 10 3 observations.In each plot, the box in the left corresponds to the MaxEnt-NPMLE estimator πL θ and the one on the right to the the proposed estimator πH 2 ,ml θ .This clearly demonstrates the superiority of the estimator proposed in the paper.Note that the difference is more pronounced for the total variation, which is the criterion that best indicates the predictive power of the identified population model.
Finally, Figure 11 shows the behavior under an increasing number of randomly-generated binary partitions.The total number of observations grows with J: N = 100J.The plots show the empirical average of the two Kullback-Leibler divergences D(• π θ ) (Figure 11a) and D(π θ •) (Figure 11b) over 100 randomly-generated datasets for each value of J, with J varying from 10 to 100 in steps of 10.Here, the probability of "dangerous" partitions has been increased to 10 −2 , to guarantee a sufficient number of samples censored by them.Figure 11a suggests that πH 2 ,ml θ may be consistent, which is strongly contradicted by the behavior observed for the NPMLE.The divergence D(π θ ||π L θ ) was infinite in all simulations (due to πL θ (E k ) = 0 for some E k ∈ Q) and, thus, cannot be presented in Figure 11b.).

Application to a Real Problem: Modeling Decompression Sickness
The density estimation problem studied in this paper has been motivated by a problem of population analysis in the context of the prevention of decompression sickness (DCS) in deep sea diving, which is known to be highly correlated with the presence of gas bubbles in the diver's blood.The ability to correctly predict the probability that this volume becomes exceedingly high can thus be exploited to establish safe diving rules, avoiding diving profiles (duration/depth) that may be dangerous for a non-negligible part of the population.
More precisely, we are interested in estimating the distribution π θ of the biophysical parameters θ of a mathematical model [21] for the instantaneous volume B(t) of micro-bubbles flowing through the right ventricle of a diver's heart when executing a decompression profile P (t) (see Figure 12a): Gas presence in the diver's circulatory system is only observed through "bubble grades", which are strongly quantified samples of B(t) (the red horizontal lines in Figure 12b indicate the quantification levels τ = {τ } L =1 applied to B(t), represented by the blue curve).In our case L = 4, as shown in Figure 12b, and thresholds τ 0 = 0 < τ 1 < • • • < τ L < τ L+1 = ∞ are assumed known.Since it is usually accepted that DCS is related to the maximum observed grade, only the grade corresponding to the peak volume: b(θ, P ) = max is retained, such that for each executed dive D n where (the known) profile P n has been followed by a diver with (unknown) bio-physical parameter θ n , a single grade measure G n is recorded: In Figure 12, a simplified model with θ ∈ Θ ⊂ R 2 , with Θ the rectangular colored region in Figure 12c, has been used, all other parameters of model ( 15) being held fixed.Note that all biophysical parameters θ in region R n : yield the same grade G n for all dives that use profile P n .Each diving profile P induces in this manner a partition 12c displays the regions corresponding to the L + 1 = 5 possible grade values for the profile P in Figure 12a.In this example, observation of a grade G = 3 indicates that the diver's biophysical parameters θ belong to the orange region.
The dataset available for this study contains records of the bubble grades observed over a total of J = 19 distinct profiles, repeated a number n j of times ranging from 12 to 41 (see Table 1; the most dangerous profiles have been executed less often) and leads to the partition Q shown in Figure 13.We remark on the strong dispersion of the sizes of the elements of Q in this case, in particular the presence of very narrow regions that are contained in the elements of several partitions.The elements of Q have in this case strongly elongated shapes, markedly different from the partitions built from Voronoi cells used in the simulations of the previous sections.Before showing the results obtained in the real dataset of grade measures for the set of profiles available, we study the performance of the method proposed on the set of partitions corresponding to the set of observed profiles using simulated data.We considered the simulation of two normally and independent random variables restricted to a (biologically motivated) rectangular domain Θ.We kept the same n j as shown in Table 1 and, thus, the same total N = 433., represented in Figure 14, is very strong in this case, the probability mass being concentrated in a subset of Θ of small Lebesgue measure.On the contrary, even for a partition of complex geometry like this one, the proposed MLME estimator (see Figure 15b) is able to overcome the shortcomings of the maximum-likelihood based estimates, producing an estimate that resembles the simulated law (in Figure 15a).The resulting population model has limited complexity while still retaining a superior predictive power, as is obvious from these plots.Figure 16 shows the evolution of the likelihood along the curve πH 2 , θ , > .We can see that the likelihood loss is larger than for the random partitions and that πH 2 ,ml θ is obtained for .The larger likelihood loss can be explained by a smaller number of observations (N = 433 here, while for the previous simulation study N = 10 4 ) and also by the more irregular geometry of the partition Q, with a large number of small elongated sets, which can produce over-optimistic values of the likelihood by concentrating mass over those sets., leads to a much smoother solution, resembling π θ,Q and covering nearly all of the domain, which seems to provide a more natural model of a biological population than the solution found by the two other estimators.
For the dataset sizes of our study with Q = 665, we observed very fast convergence of (11) for the complete Q (35 iterations for δ = 10 −4 ), confirming the applicability of the proposed algorithm.
We now assess the likelihood loss of our solution.Figure 18 shows the variation of L(π H 2 , θ ) with / for this real dataset.Compared to what we observed with random partitions in Figure 7, there is now a significant likelihood loss, the blue curve staying well below the maximum likelihood value for all values of the regularization parameter.This is natural, being an expected consequence of eventual misfits of the biophysical/classification model, which induce errors in the definition of the partitions Q (j) associated with the distinct profiles P (j) and, thus, compromise the ability to closely fit the data.
Finally, we show, for this real dataset, the importance of accounting for the correlation of the empirical distributions in the constrained optimization problem. Figure 19 shows the estimates πH 2 , θ (top) and πH 2 ,ml θ (bottom) obtained using the entire matrix Σ (left) or just its diagonal elements (right).We can see that simple independent relaxation of the empirical laws is not able to prevent the estimates from becoming highly concentrated, indicating an over-homogeneous population., full Σ (j) −1/2 ; (d) πH 2 ,ml θ , diag(Σ (j) −1/2 ).

Assessing Predictive Power
To assess the predictive power of the three estimators πL θ , πH 2 , θ and πH 2 ,ml θ , we performed leave-one-out cross-validation, removing at each time all observations relative to one profile P (j) and computing the three estimators using the data for the remaining 18 profiles.We then compare the estimated and observed grades' frequencies f (j) for the retained profile.
Figure 20 represents boxplots of the total variation distance for each of the 19 profiles in the dataset, confirming the superiority of the estimator proposed for prediction purposes.In the sense of the total variation distance, πH 2 ,ml θ is on average the closest distribution to the empirical frequencies in the dataset.

Conclusions
The paper studied the estimation of a probability density from region-censored observations, with application to the prevention of decompression sickness during hyperbaric diving.We show that the NPMLE is intrinsically ill-posed, leading to unstable solutions that are biologically implausible.Expressing counts of the censored observations as empirical means of a set of features, we derive the MaxEnt solution that best approximates the empirical distributions.The degree of fitting to the observed frequencies is chosen by selecting the MaxEnt solution that has the largest likelihood.The tests conducted show that the proposed most likely Rényi-MaxEnt estimator has superior behavior compared to the minimally-relaxed MaxEnt estimator, being able to approximate the observed dataset while at the same time being plausible as a description of a natural population.In particular, our numerical experiments show that our construction leads to a distribution estimate with good generalization properties, being able to predict grade probabilities for unseen profiles well, and can thus be used to detect profiles with a high risk of decompression sickness.

Figure 1 .
Figure 1.Partial response observation: z is the classification of the response to stimulus x(t) in a finite set.

Figure 3 .
Figure 3. Definition of elementary regions from the cliques of the intersection graph.(a) Three intervals: maximal clique and corresponding elementary region E k (shaded region); (b) three regions with empty intersection resulting in three disjoint elementary regions E k (the shaded regions).

Figure 4 .
Figure 4. Two randomly-generated partitions of the unit square.

Figure 5 .
Figure 5. (a) Partition Q determined by J = 10 random binary partitions of Θ.(b) Probability law π θ,Q induced over the elements of the partition Q.

Figure 8 .
Figure 8. Illustration of the three proposed estimators in a simple example.

Figure 11 .
Figure 11.Kullback-Leibler divergence for an increasing number J of partitions.(a) Empirical average of D(π H 2 ,ml θ

Figure 12 .
Figure 12.Definition of bubble grades G and regions R P .(a) Diving profile P (t); (b) blue: gas volume B; red: thresholds τ ; (c) regions corresponding to the L+1 = 5 bubble grades G.

Figure 13 .
Figure 13.Partition induced by the 19 profiles in the real diving dataset.

Figures 14 and 15
Figures 14 and 15 show the results obtained with a total of N = 10 4 observations.The singularity of both πL θ and πH 2 , θ

Figure 20 .
Figure 20.Boxplots of the total variation distance d TV for the 19 datasets in the leave-on-out cross-validation study.

Table 1 .
Number of experiments by profile in a real dataset.