#
Most Likely Maximum Entropy for Population Analysis with Region-Censored Data^{ †}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Motivation

_{θ}, θ ∈ Θ, of the parameters of a mathematical model M(·|θ) characterizing the response y(t|θ) of individuals to applied stimuli x(t). The ultimate goal is in general to be able to predict the dispersion of the response of the population to an arbitrary future stimulus x(t), rather than to make a “tomography” of the population itself. These types of problems are frequent in domains like biomedical engineering, insurance studies or environmental management.

_{θ}, the problem of estimating π

_{θ}from a collection of responses ${\{\left({y}_{i}\left(t|{\theta}_{i}\right),\phantom{\rule{0.2em}{0ex}}{x}_{i}\left(t\right)\right)\}}_{i=1}^{N}$ is formally equivalent to the usual density estimation problem from a set of independent and identically distributed samples ${\{{\theta}_{i}\}}_{i=1}^{N}\sim {\pi}_{\theta}$ and can be solved using standard parametric or non-parametric methods; see the abundant literature on non-linear mixed-effects models. The situation considered in this paper is more complex, in that the response y(·|θ) of the model not observable, and we only have access to the result of the classification of its assignment to a finite number (L + 1) of possible labels by a known classifier C(·). Figure 1 illustrates the structural modeling/observation framework that we consider.

#### 1.2. Notation and Problem Formulation

_{n}∈ {0, …, L} has been observed for input X

^{(}

^{n}

^{)}= {x

_{n}(t), t ∈ T

_{n}}, where T

_{n}is the duration of the stimulus. Denote by R

_{n}⊂ Θ the set of all individual parameters whose response to X

^{(}

^{n}

^{)}receives label z

_{n}:

^{(}

^{n}

^{)}, the composition C(M(X

^{(}

^{n}

^{)}|·)) (of the model and the classifier) is a measurable function from Θ to {0, …, L} with respect to the restriction of the Lebesgue measure to the set Θ. Under this assumption, the probability of the sets ${M}_{{x}^{(n)}}^{-1}({C}^{-1}(\ell ))$ is well defined for all 0 ≤ ℓ ≤ L and all stimuli for any distribution absolutely continuous with respect to the Lebesgue measure.

^{(}

^{j}

^{)}are chosen in a finite set $\mathrm{X}=\{{X}^{(1)},\dots ,{X}^{(J)}\}$ Each possible input function X

^{(}

^{j}

^{)}in $\mathrm{X}$ determines a partition of Θ in L + 1 sets, that we denote by ${\mathcal{Q}}^{\left(j\right)}=\{{R}_{0}^{(j)},\dots ,{R}_{L}^{(j)}\}$:

_{1}= 1) and three (L

_{2}= 2) classes of the response to two distinct stimuli.

_{j}be the number of times that stimulus X

^{(}

^{j}

^{)}has been used in the N observations and ${n}_{\ell}^{(j)}$ the number of times label ℓ occurred in these n

_{j}experiences. The observed dataset determines J empirical laws ${\tilde{f}}^{(j)}$, each one associated with a distinct partition ${\mathcal{Q}}^{(j)}$:

_{j}-type. With the notation defined above, we can finally state the problem addressed in this paper with full generality.

**Problem 1. (Density estimation from region-censored data)**

_{θ}from the set of J n

_{j}-types${\tilde{f}}^{(j)}$, j = 1, …, J (see Equation (1)) of the discrete random variables associated with the known partitions${\{{\mathcal{Q}}^{(j)}\}}_{j=1}^{J}$ (see Equation (4)).

_{A}(θ) be the indicator function of set A ⊂ Θ and ${\tilde{\pi}}_{\theta}^{({n}_{j})}$ the (non-observed) empirical distribution:

_{j}, is the parameter of the i-th individual to whom stimulus X

^{(}

^{j}

^{)}has been applied. It is immediate that ${\tilde{f}}_{\ell}^{(j)}$ in Equation (1) can be written as the statistical expectation of the indicator function of ${R}_{\ell}^{(j)}$ with respect to ${\tilde{\pi}}_{\theta}^{({n}_{j})}$:

_{θ}.

**Problem 2. (Density estimation under moment constraints)**

^{(}

^{j}

^{)}, j ∈ {0 … L} all of size L+1, and let${\{{g}_{m}(\xb7)\}}_{m=1}^{M}$, with M = (L+1) J, be the set of indicator functions${\{{1}_{{R}_{\ell}^{(j)}}(\cdot )\}}_{j=1,\ell =0}^{J,L}$. Denote by${\tilde{g}}_{m}$, m = 1, …, M, the corresponding empirical moments as in (2). Find the non-parametric estimate of π

_{θ}that satisfies the set of constraints:

**Definition 1.**Let$\mathcal{Q}$ be the smallest partition of Θ whose generated σ-algebra, $\sigma (\mathcal{Q})$, contains all partitions${\{{\mathcal{Q}}^{(j)}\}}_{j=1}^{J}$ (elements of$\mathcal{Q}$ are the minimal elements of the closure of the union of all partitions${\mathcal{Q}}^{(j)}$ with respect to set intersection). The size$Q=|\mathcal{Q}|$ is necessarily finite. We denote by E

_{m}, m ∈ {1, …, Q} a generic element of$\mathcal{Q}$.

**Definition 2**. ${\mathbf{E}}_{\ell}^{(j)}$ is the set of elements of$\mathcal{Q}$ that intersect${R}_{\ell}^{(j)}$, such that:

**Definition 3.**Let π

_{θ}be a probability distribution over Θ and$\mathcal{Q}$ a finite partition of Θ. We denote by${\pi}_{\theta ,\mathcal{Q}}$π

_{θ,Q}the probability law induced by π

_{θ}over the elements of$\mathcal{Q}$:

#### 1.3. Background

#### 1.3.1. Density Estimation from Region-Censored Data

_{θ}from censored observations, i.e., the solution of Problem 1, has been studied by many authors, starting with the pioneering formulation of the Kaplan–Meier product-limit estimator [1]. Several types of censoring (one-sided, interval, etc.) have been considered since, first for scalar and more recently for multivariate distributions.

**Proposition 1**.

- The support of${\widehat{\pi}}_{\theta}$, ${\mathcal{S}}_{\text{NPMLE}}=\{\theta ,:{\widehat{\pi}}_{\theta}(\theta )>0\}$ is confined to a finite number K ≤ Q of elements of$\mathcal{Q}$, the so-called “elementary regions”:$${\mathcal{S}}_{\text{NPMLE}}={\cup}_{k=1}^{K}{E}_{k},{E}_{k}\in \mathcal{Q}$$This set necessarily has a non-empty intersection with all observed lists${\mathbf{E}}_{\ell}^{(j)}$, i.e.,$${n}_{\ell}^{(j)}\ne 0\Rightarrow {\mathbf{E}}_{\ell}^{(j)}\cap {\mathcal{S}}_{\text{NPMLE}}\ne \varnothing $$
- all distributions that put the same probability mass w
_{k}= {π_{θ}(E_{k})}, k = 1, …, K in the elementary regions have the same likelihood; - there is in general no unique assignment of probabilities${\{{\widehat{w}}_{k}\}}_{k=1}^{K}$ that maximizes the likelihood.

_{k}’s are the intersections of the elements of the maximal cliques of the intersection graph of the set of observed intervals; see Figure 3a for a bi-dimensional example. We have shown elsewhere [4] that (i) also holds when the censoring sets have arbitrary geometry, but that some elementary regions are now associated with non-maximal cliques of the intersection graph, as shown in Figure 3b, requiring a slightly more complex identification of the sets E

_{k}, which we do not detail here.

**ŵ**= {ŵ

_{1}, …, ŵ

_{K}}. The two types of “non-uniqueness” of the NPMLE, (ii) and (iii), have been first pointed out by Turnbull [2]. More recently, they were studied in detail for the multi-variate case in [3], where the authors coined the terms representational (ii) and mixture (iii) non-uniqueness, further showing that the set of probability laws ${\widehat{\pi}}_{\theta}$ defining NPMLEs is a polytope.

_{θ,}

_{$\mathcal{R}$(}

_{j}

_{)}(ℓ) (see Equation (4)) when n

_{j}→ ∞. It is not possible to guarantee the consistency of the estimate of the distribution of ${\pi}_{\theta ,\mathcal{Q}}$ over the finer partition $\mathcal{Q}$. However, the simulations studies presented in Section 3 show that as the number of partitions J tends to infinity and this σ-algebra gets finer, while keeping fixed each n

_{j}(and thus, n → ∞ with J), the distance between the true and estimated probability laws decreases to zero.

#### 1.3.2. Density Estimation under Moment Constraints

_{1}(·) remaining the most commonly used due to its simple interpretation in terms of coding theory and its intimate link to fundamental results in estimation theory, while amongst generalized entropies, the Rényi entropy H

_{α}(·), coinciding with Shannon when α → 1, is often chosen due to its appealing numerical and analytical tractability for α = 2:

**Problem 3. (H-MaxEnt density estimator)**

**Proposition 2. (Equivalence to ML estimation in the exponential family)**

- (Boltzmann theorem [6]) the H
_{1}-MaxEnt estimate${\tilde{\pi}}_{\theta}^{{H}_{1}}$ maximizes the likelihood of the observations in the exponential family,$${\tilde{\pi}}_{\theta}^{{H}_{1}}(\theta )=\frac{1}{{Z}_{\lambda}}{\displaystyle \prod _{m=1}^{M}\mathrm{exp}({\lambda}_{m}{g}_{m}(\theta ))}$$_{λ}is a normalizing constant (the partition function), and the${\{{\lambda}_{m}\}}_{m=1}^{M}$ are determined such that the M constraints are satisfied.In short, the MaxEnt (non-parametric) estimate coincides with the maximum likelihood parametric estimate inside the exponential distributions. - the H
_{2}-MaxEnt estimate [7]${\tilde{\pi}}_{\theta}^{{H}_{2}}$ is:$${\tilde{\pi}}_{\theta}^{{H}_{2}}(\theta ){\left[-\frac{1}{2}{\displaystyle \sum _{m=1}^{M}{\lambda}_{m}{g}_{m}(\theta )}\right]}_{+}$$_{+}= max (·,0) and the${\{{\lambda}_{m}\}}_{m=1}^{M}$ are such that the M constraints are satisfied.

_{1}-MaxEnt/ML equivalence is lost when the empirical averages ${\tilde{g}}_{m}$ are not all obtained from the same dataset, as is the case in our problem, where (see Equation (2)) constraints associated with distinct stimuli are being derived from distinct empirical distributions.

**Problem 4. (Relaxed**

**H–MaxEnt density estimator)**

**ϵ**∈ ℝ

^{+}

^{M}. The -relaxed H-MaxEnt density estimate${\widehat{\pi}}_{\theta}^{ME,\in}$ is the solution of:

**g**is the M-dimensional vector function with m-th component g

_{m}(·), $\tilde{\mathbf{g}}$ is the M-dimensional vector of empirical expectations of

**g**, ║·║

_{π}is a vector of norms depending on π and inequality is understood component-wise.

_{1}-penalized maximum likelihood in the exponential family, showing that Proposition 2 holds in a more generic sense.

**Proposition 3. (Equivalence of ℓ**

_{1}-regularized H_{1}-MaxEnt and penalized log-likelihood [9])_{1}(Shannon entropy) and ║·║

_{π}the ℓ

_{1}norm for the expected value:

_{m}is the m-th element of

**ϵ**.

**ϵ**. They do not suffer from neither representational non-unicity, the optimal continuous distribution being constant inside each element of $\mathcal{Q}$, nor from mixture non-uniqueness, being the solution of a concave criterion under linear inequality constraints.

#### 1.4. Contributions

## 2. The NPMLE

#### 2.1. Simulated Data Generation Mechanism

_{θ}is the restriction of the joint distribution of two independent and identically distributed normal variables of mean µ = 0.5 and variance σ

^{2}= 0.2 to the unit square Θ = [0, 1]

^{2}.

^{(}

^{j}

^{)}are randomly generated by considering random unions of the elements of the Voronoi tessellation of S = 50 points uniformly drawn in Θ; see Figure 4. The partition $\mathcal{Q}$ induced by 10 random binary splits of Θ is shown in Figure 5a, and Figure 5b is a color-coded representation of the probability law ${p}_{\pi ,\mathcal{Q}}$, where the fine partition $\mathcal{Q}$ is easily recognizable (black delimited polygonal regions). We remark that the size of the elements of the partitions generated by our simulation mechanism tends have low dispersion, following approximately a gamma distribution with both parameters equal to (7/2)λ

^{−2}, where λ is the intensity of the homogenous Poisson process [10] (λ = 1/50 in our simulations). In Section 4, we will see that this may not be the case in practical applications.

_{j}times from each of the probability laws associated with the individual partitions $\mathcal{R}$

^{(}

^{j}

^{)}, for j = 1, …, J. In the numerical studies presented in this section, J = 10. To simulate the situation when some stimuli are seldom applied (for instance, if they may have compromised the safety of the individual to which they are applied), the partitions are divided into two groups, representing “safe” and “dangerous” stimuli, of sizes seven and three, respectively. The probability that a dangerous partition is chosen is 10

^{−3}, and inside each group, partitions are chosen uniformly. Except when indicated otherwise, we will consider a total of N = 10

^{4}observations.

#### 2.2. Likelihood Function

_{NPMLE}is the union of the elementary regions ${\{{E}_{k}\}}_{k=1}^{K}$ in Proposition 1 (i), such that ${p}_{{\pi}_{\theta},{\mathcal{Q}}^{(j)}}(\ell )={\pi}_{\theta}({R}_{\ell}^{(j)})\cap {S}_{\text{NPMLE}})+{\pi}_{\theta}({R}_{\ell}^{(j)}\cap \overline{{S}_{\text{NPMLE}}})$.

**B**

^{(}

^{j}

^{)}, the (L + 1) × K binary matrix, with ${\mathbf{B}}_{\ell k}^{(j)}=1\iff {E}_{k}\in {\mathbf{E}}_{\ell}^{(j)}$, and $\mathbf{w}\in {\mathbb{S}}^{K}$ is the vector of probabilities of the elementary regions E

_{k}: w

_{k}= π

_{θ}(E

_{k}), k = 1, …, K, with ${\mathbb{S}}^{K}$ the K-dimensional probability simplex:

_{θ}leading to the same w have the same likelihood.

**Proposition 4.**There is in general no single

**w**maximizing (7) and all elements of:

**ŵ**is a NPMLE are also NPMLEs. We call$\mathcal{P}$ the NPMLE polytope.

_{k}, but that the probability of the censoring regions ${R}_{\ell}^{(j)}$ is uniquely estimated, all

**w**∈ $\mathcal{P}$ assigning the same probabilities to the elements of the partitions ${\mathcal{Q}}^{\left(j\right)}$. It is obvious that the estimator is consistent for these, but no stronger statement seems to be possible.

#### 2.3. Optimizing the Likelihood

_{j}(L + 1) × K matrices

**B**

^{(}

^{j}

^{)}′, obtained from the ((L + 1) × K) matrix

**B**

^{(}

^{j}

^{)}by repeating ${n}_{\ell}^{(j)}$ times line ℓ, the N × K matrix

**B**′ that stacks all

**B**

^{(}

^{j}

^{)}′, j = 1, …, J, the N × N diagonal matrix ${\mathbf{H}}_{k}=\text{diag}\left({{\mathbf{B}}^{\prime}}_{\cdot k}\right)$ and the matrix $\mathbf{M}(\mathbf{w})={\displaystyle {\sum}_{k=1}^{K}{w}_{k}}{\mathbf{H}}_{k}$. Then, it is easy to show that $\mathcal{L}$ can be written as:

**ŵ**maximizing $\mathcal{L}(\mathbf{w};\{{\tilde{\mathbf{f}}}^{(j)},{\mathcal{Q}}^{(j)}\})$ with respect to

**w**∈ ${\mathbb{S}}^{K}$ corresponds to a D-optimal design problem for the matrix

**M**(

**w**), with w considered as a design measure allocating weight

**w**

_{k}to the elementary design matrix

**H**

_{k}(see, e.g., [15]). A number of important properties follow from this equivalence with a D-optimal design problem. In particular, see [16,17], the iterations:

^{(0)}converge to a maximizer of (7). This multiplicative algorithm is easy to implement, but the following vertex exchange method (VEM) [13] ensures a faster convergence to the optimum. The VEM updating rule is:

**w**

^{(}

^{t}

^{+1)}. In the multiplicative and VEM algorithms, we use the stopping condition $\underset{k\in \{1,\dots ,K\}}{\mathrm{max}}\frac{d(\mathbf{w},k)}{N}-1<\delta \ll 1$.

_{θ}is strictly positive inside the complete unit square, significant regions of Θ are assigned zero probability mass (the white regions in the figure), and the support of ${\widehat{\pi}}_{\theta}$ is strictly contained in Θ.

#### 2.4. Least Informative NPMLE

**B**,

_{1}, is a non-trivial non-linear constrained optimization problem. However, for H = H

_{2}, the Rényi-MaxEnt NPMLE probability vector $\tilde{\mathbf{w}}$is the solution to the following quadratic program with linear equality constraints:

_{k}) the volume of E

_{k}, for which efficient solutions exist.

^{(}

^{J}

^{+1)}applied only once with resulting label ℓ

^{⋆}is added to a dataset already containing J stimuli:

^{(}

^{J}

^{+1)}, n

^{(}

^{J}

^{+1)}). If ${R}_{{\ell}^{\star}}^{(J+1)}$ intersects an elementary set E

_{k}∈ $\mathcal{Q}$, such that:

## 3. Most Likely Rényi-MaxEnt

_{θ}with a criterion other than maximum likelihood. Relying on the link of our problem with density estimation under constraints, we propose to estimate π

_{θ}through the maximum entropy principle.

**w**belongs to the NPMLE polytope $\mathcal{P}$. However, being derived from J distinct empirical distributions, the J constraints are in general inconsistent, and as in [9], we consider entropy maximization under relaxed constraints, i.e., Problem 4. For reasons of numerical efficiency, we consider the Rényi entropy H

_{2}.

**Problem 5. (Relaxed ME estimator)**

^{+}, define the -relaxed MaxEnt estimator as:

^{(}

^{j}

^{)}is the covariance of the empirical estimate${\tilde{\mathbf{f}}}_{+}^{(j)}$ and${\mathbf{f}}_{+}^{(j)}$ is obtained from

**f**

^{(j)}by retaining all but one of its non-zero elements.

^{⋆}≥ 0 the smallest value of ϵ for which there exists a solution to Problem 5. Since in (13) we use the ℓ

_{∞}metric to evaluate the deviation of a model π with respect to the empirical moments and ℓ

_{∞}is not equivalent to the (Riemannian) metric induced by maximum likelihood in the simplex ${\mathbb{S}}^{K}$, we cannot guarantee that likelihood is monotonically decreasing with the degree of relaxation, i.e., that $\mathcal{L}\left({\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}\right)<\mathcal{L}\left({\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}\right)$, for ϵ > ϵ

^{⋆}. In fact, as the plot of the log-likelihood of ${\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}$ as a function of ϵ/ϵ

^{⋆}in Figure 7 shows, this is not necessarily true for values of ϵ close to ϵ

^{⋆}. More importantly, this figure shows that a suitable choice of the relaxation term can lead to a likelihood loss with respect to the NPMLE that is minimal, improving the fit to the data. These remarks motivate the definition of the new estimator proposed in this paper.

**Definition 4. (MLME: the most likely MaxEnt estimator)**

^{⋆}. The most likely Rényi-MaxEnt estimator is:

**Proposition 5**. (ϵ

^{⋆}= 0)

^{⋆}= 0, then the feasible set of the constrained optimization Problem 5 coincides with the NPMLE polytope. Since the likelihood of all solutions with ϵ > 0 will be smaller, the MLME estimate coincides in this case with the MaxEnt NPMLE:${\u03f5}^{\star}=0\Rightarrow {\widehat{\pi}}_{\theta}^{{H}_{2},ml}={\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}={\widehat{\pi}}_{\theta}^{\mathcal{L}}$.

^{⋆}= 0 is small for finite datasets, the solution space of our constrained optimization problem is in general larger than the NPMLE polytope $\mathcal{P}$. We illustrate now the geometry of the NLME ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ using the following simple example for which L = 1, J = 2, K = 3 and:

^{⋆}) belong to the two gray areas, their intersection defining ${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$ (the green dot, also on the boundary of ${\mathcal{S}}^{3}$). The dashed green line is the curve defined by ${\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}$ in ${\mathbb{S}}^{3}$ for ϵ ≥ ϵ

^{⋆}, which has an accumulation point in the uniform distribution ${w}_{1}={w}_{2}=w=\frac{1}{3}$ as ϵ becomes sufficiently large for the uniform distribution to satisfy the constraints. Our estimator MLME is the point in this green curve at which the value of the likelihood is the largest, that is the highest level set of the likelihood function over ${\mathbb{S}}^{3}$ whose intersection with the green curve is a single point. The orange curve shows this level set, the contact point (red dot) being the MLME ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$.

^{⋆}in the constraints (13). In terms of vector w of probabilities of the elementary regions E

_{k}, this set is a polytope ${\mathcal{P}}_{\u03f5}$, defined by its linear boundaries, which characterizes all solutions compatible with the data. One may notice that although the determination of its vertices is a difficult task, approximation of ${\mathcal{P}}_{\u03f5}$ by the maximum-volume interior ellipsoid is feasible at a reasonable computational cost [19], providing directly a lower bound on the volume of ${\mathcal{P}}_{\u03f5}$.

_{K}is the maximum value of the absolute difference between the two cumulative distributions, while the total variation distance d

_{TV}is the sum of all absolute differences [20]. Figure 10 addresses the performance of the estimation of the true probability law over $\mathcal{Q}$, showing box-plots of the Kolmogorov–Smirnov (left) and total variation (right) distances between ${\pi}_{\theta ,\mathcal{Q}}$ and the NPMLE and the MLME estimates observed in 200 simulations, each for N = 10

^{3}observations. In each plot, the box in the left corresponds to the MaxEnt-NPMLE estimator ${\widehat{\pi}}_{\theta}^{\mathcal{L}}$ and the one on the right to the the proposed estimator ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$. This clearly demonstrates the superiority of the estimator proposed in the paper. Note that the difference is more pronounced for the total variation, which is the criterion that best indicates the predictive power of the identified population model.

_{θ}) (Figure 11a) and D(π

_{θ}║·) (Figure 11b) over 100 randomly-generated datasets for each value of J, with J varying from 10 to 100 in steps of 10. Here, the probability of “dangerous” partitions has been increased to 10

^{−2}, to guarantee a sufficient number of samples censored by them. Figure 11a suggests that ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ may be consistent, which is strongly contradicted by the behavior observed for the NPMLE. The divergence $D({\pi}_{\theta}\Vert {\widehat{\pi}}_{\theta}^{\mathcal{L}})$ was infinite in all simulations (due to ${\widehat{\pi}}_{\theta}^{\mathcal{L}}({E}_{k})=0$ for some E

_{k}∈ $\mathcal{Q}$) and, thus, cannot be presented in Figure 11b.

## 4. Numerical Results

#### 4.1. Application to a Real Problem: Modeling Decompression Sickness

_{θ}of the biophysical parameters θ of a mathematical model [21] for the instantaneous volume B(t) of micro-bubbles flowing through the right ventricle of a diver’s heart when executing a decompression profile P (t) (see Figure 12a):

_{0}= 0 < τ

_{1}< ⋯ < τ

_{L}< τ

_{L}

_{+1}= ∞ are assumed known. Since it is usually accepted that DCS is related to the maximum observed grade, only the grade corresponding to the peak volume:

_{n}where (the known) profile P

_{n}has been followed by a diver with (unknown) bio-physical parameter θ

_{n}, a single grade measure G

_{n}is recorded:

^{2}, with Θ the rectangular colored region in Figure 12c, has been used, all other parameters of model (15) being held fixed. Note that all biophysical parameters θ in region R

_{n}:

_{n}for all dives that use profile P

_{n}. Each diving profile P induces in this manner a partition ${\mathcal{Q}}^{(P)}$ of Θ:

_{j}of times ranging from 12 to 41 (see Table 1; the most dangerous profiles have been executed less often) and leads to the partition $\mathcal{Q}$ shown in Figure 13. We remark on the strong dispersion of the sizes of the elements of $\mathcal{Q}$ in this case, in particular the presence of very narrow regions that are contained in the elements of several partitions. The elements of $\mathcal{Q}$ have in this case strongly elongated shapes, markedly different from the partitions built from Voronoi cells used in the simulations of the previous sections.

#### 4.1.1. Simulated Data

_{j}as shown in Table 1 and, thus, the same total N = 433.

^{4}observations. The singularity of both ${\widehat{\pi}}_{\theta}^{\mathcal{L}}$ and ${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$, represented in Figure 14, is very strong in this case, the probability mass being concentrated in a subset of Θ of small Lebesgue measure. On the contrary, even for a partition of complex geometry like this one, the proposed MLME estimator (see Figure 15b) is able to overcome the shortcomings of the maximum-likelihood based estimates, producing an estimate that resembles the simulated law (in Figure 15a). The resulting population model has limited complexity while still retaining a superior predictive power, as is obvious from these plots.

^{⋆}. We can see that the likelihood loss is larger than for the random partitions and that ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ is obtained for ϵ ≃ ϵ

^{⋆}. The larger likelihood loss can be explained by a smaller number of observations (N = 433 here, while for the previous simulation study N = 10

^{4}) and also by the more irregular geometry of the partition $\mathcal{Q}$, with a large number of small elongated sets, which can produce over-optimistic values of the likelihood by concentrating mass over those sets.

#### 4.1.2. Real Data (Part of the Material in this Section Has Been Previously Presented in [22])

^{−4}), confirming the applicability of the proposed algorithm.

^{⋆}for this real dataset. Compared to what we observed with random partitions in Figure 7, there is now a significant likelihood loss, the blue curve staying well below the maximum likelihood value for all values of the regularization parameter. This is natural, being an expected consequence of eventual misfits of the biophysical/classification model, which induce errors in the definition of the partitions ${\mathcal{Q}}^{\left(j\right)}$ associated with the distinct profiles P

^{(}

^{j}

^{)}and, thus, compromise the ability to closely fit the data.

#### 4.2. Assessing Predictive Power

^{(}

^{j}

^{)}and computing the three estimators using the data for the remaining 18 profiles. We then compare the estimated and observed grades’ frequencies ${\tilde{f}}^{(j)}$ for the retained profile.

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Kaplan, E.L.; Meier, P. Nonparametric estimation from incomplete observations. J. Am. Stat. Assos.
**1958**, 53, 457–481. [Google Scholar] - Turnbull, B.W. The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Stat. Soc. Ser. B (Methodol.)
**1976**, 38, 290–295. [Google Scholar] - Gentleman, R.; Vandal, A.C. Computational algorithms for censored-data problems using intersection graphs. J. Comput. Graph. Stat.
**2001**, 10, 403–421. [Google Scholar] - Bennani, Y. Intersection Graph for Region-Censored Data; Rapport de recherche I3S; I3S: Sophia-Antipolis, France, 2013. [Google Scholar]
- Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev.
**1957**, 106, 620–630. [Google Scholar] - Della Pietra, S.; Della Pietra, V.; Lafferty, J. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell.
**1997**, 19, 380–393. [Google Scholar] - Grechuk, B.; Molyboha, A.; Zabarankin, M. Maximum entropy principle with general deviation measures. Math. Oper. Res.
**2009**, 34, 445–467. [Google Scholar] - Dudik, M. Maximum Entropy Density Estimation and Modeling of Geographic Distributions of Species. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 2007. [Google Scholar]
- Dudik, M.; Phillips, M.; Schapire, R. Performance guarantees for regularized maximum entropy density estimation. Proceedings of the 17th Annual Conference on Computational Learning Theory, Banff, AL, Canada, 1–4 July 2004.
- Járai-Szabó, F.; Néda, Z. On the size-distribution of Poisson Voronoi cells. Physica A
**2007**, 385, 518–526. [Google Scholar] - Liu, X. Nonparametric Estimation With Censored Data: A Discrete Approach. Ph.D. Thesis, McGill University, Montreal, QC, Canada, 2005. [Google Scholar]
- Groeneboom, P.; Wellner, J.A. Information Bounds and Nonparametric Maximum Likelihood Estimation; Birkhauser Verlag: Basel, Switzerland, 1992. [Google Scholar]
- Böhning, D.; Schlattmann, P.; Dietz, E. Interval censored data: A note on the nonparametric maximum likelihood estimator of the distribution function. Biometrika
**1996**, 83, 462–466. [Google Scholar] - Fish, D.; Brinicombe, A.; Pike, E.; Walker, J. Blind deconvolution by means of the Richardson–Lucy algorithm. JOSA A
**1995**, 12, 58–65. [Google Scholar] - Fedorov, V.V. Theory of Optimal Experiments; Academic Press: New York, NY, USA, 1972. [Google Scholar]
- Silvey, S.D.; Titterington, D.H.; Torsney, B. An algorithm for optimal designs on a finite design space. Commun. Stat.-Theor. M.
**1978**, 7, 1379–1389. [Google Scholar] - Torsney, B. A moment inequality and monotonicity of an algorithm. In Semi-Infinite Programming and Applications; Springer: Berlin, Germany, 1983; pp. 249–260. [Google Scholar]
- Harman, R.; Pronzato, L. Improvements on removing nonoptimal support points in D-optimum design algorithms. Stat. Probab. Lett.
**2007**, 77, 90–94. [Google Scholar] - Khachiyan, L.G.; Todd, M.J. On the complexity of approximating the maximal inscribed ellipsoid for a polytope. Math. Program
**1993**, 61, 137–159. [Google Scholar] - Strassen, V. The existence of probability measures with given marginals. Ann. Math. Stat.
**1965**, 36, 423–439. [Google Scholar] - Hugon, J. Vers Une Modélisation Biophysique De La Décompression. Ph.D. Thesis, Université Aix Marseille, Aix-en-Provence, France, 22 November 2010. [Google Scholar]
- Bennani, Y.; Pronzato, L.; Rendas, M.J. Nonparametric density estimation with region-censored data, Proceedings of the 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, 1–5 September 2014; pp. 1098–1102.

**Figure 1.**Partial response observation: z is the classification of the response to stimulus x(t) in a finite set.

**Figure 2.**Two partitions associated with distinct stimuli, ${\mathcal{Q}}^{\left(1\right)}=\{{R}_{1}^{1},{R}_{2}^{1}\}$ (top left) and ${\mathcal{Q}}^{\left(2\right)}=\{{R}_{1}^{2},{R}_{2}^{2},{R}_{3}^{2}\}$ (top right) and the resulting partition $\mathcal{Q}$ (bottom); see Definition 1.

**Figure 3.**Definition of elementary regions from the cliques of the intersection graph. (

**a**) Three intervals: maximal clique and corresponding elementary region E

_{k}(shaded region); (

**b**) three regions with empty intersection resulting in three disjoint elementary regions E

_{k}(the shaded regions).

**Figure 5.**(

**a)**Partition $\mathcal{Q}$ determined by J = 10 random binary partitions of Θ. (

**b**) Probability law ${\pi}_{\theta ,\mathcal{Q}}$ induced over the elements of the partition $\mathcal{Q}$.

**Figure 6.**(

**a**) ${\widehat{\pi}}_{\theta}$, one non-parametric maximum likelihood estimate (NPMLE) solution found by (10). (

**b**) ${\widehat{\pi}}_{\theta}^{\mathcal{L}}$, the Rényi-MaxEnt NPMLE. The white regions have zero probability mass.

**Figure 7.**Log likelihood variation of ${\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5}$ as a function of ϵ/ϵ

^{⋆}. Red line: $\mathcal{L}({\widehat{\pi}}_{\theta})$.

**Figure 10.**Box-plots of the Kolmogorov–Smirnov (Top) and total variation (Bottom) distances between ${\pi}_{\theta ,\mathcal{Q}}$ and estimates ${\widehat{\pi}}_{\theta}^{\mathcal{L}}$ and ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$ observed in 200 simulations.

**Figure 11.**Kullback–Leibler divergence for an increasing number J of partitions. (a) Empirical average of $D({\widehat{\pi}}_{\theta}^{{H}_{2},ml}\Vert {\pi}_{\theta})$ and $D({\widehat{\pi}}_{\theta}^{\mathcal{L}}\Vert {\pi}_{\theta})$; (

**b**) empirical average of $D({\pi}_{\theta}\Vert {\widehat{\pi}}_{\theta}^{{H}_{2},ml})$.

**Figure 12.**Definition of bubble grades G and regions ${R}_{\ell}^{P}$. (

**a**) Diving profile P (t); (

**b**) blue: gas volume B; red: thresholds τ

_{ℓ}; (

**c**) regions corresponding to the L+1 = 5 bubble grades G.

**Figure 14.**(

**a**) Rényi-MaxEnt NPMLE ${\widehat{\pi}}_{\theta}^{\mathcal{L}}$. (

**b**) Rényi-MaxEnt ${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$. White regions have zero probability mass.

**Figure 15.**(

**a**) Simulated distribution ${\pi}_{\theta ,\mathcal{Q}}$. (

**b**) MLME estimate ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$. White regions have zero probability mass.

**Figure 16.**Variation of $\mathcal{L}({\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5})$ with ϵ/ϵ

^{⋆}Red line: $\mathcal{L}({\widehat{\pi}}_{\theta}^{\mathcal{L}})$.

**Figure 17.**Estimates of π

_{θ}with real dataset. (

**a**) Least informative NPMLE ${\widehat{\pi}}_{\theta}^{\mathcal{L}}$. (

**b**) Rényi-MaxEnt ${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$. (

**c**) MLME ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$. White regions have zero probability mass.

**Figure 18.**Variation of $\mathcal{L}({\widehat{\pi}}_{\theta}^{{H}_{2},\u03f5})$ with ϵ/ϵ

^{⋆}Red line: $\text{L(}{\widehat{\pi}}_{\theta}^{\mathcal{L}})$.

**Figure 19.**(

**a**) ${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$, full ${\mathrm{\Sigma}}^{{(j)}^{-1/2}}$; (

**b**) ${\widehat{\pi}}_{\theta}^{{H}_{2},{\u03f5}^{\star}}$$diag({\mathrm{\Sigma}}^{{(j)}^{-1/2}})$; (

**c**) ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$; full ${\mathrm{\Sigma}}^{{(j)}^{-1/2}}$; (

**d**) ${\widehat{\pi}}_{\theta}^{{H}_{2},ml}$, $diag({\mathrm{\Sigma}}^{{(j)}^{-1/2}})$.

**Figure 20.**Boxplots of the total variation distance d

_{TV}for the 19 datasets in the leave-on-out cross-validation study.

Profile | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

n_{j} | 31 | 41 | 24 | 31 | 28 | 12 | 18 | 14 | 14 | 17 | 16 | 26 | 14 | 16 | 18 | 30 | 12 | 41 | 30 |

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bennani, Y.; Pronzato, L.; Rendas, M.J.
Most Likely Maximum Entropy for Population Analysis with Region-Censored Data. *Entropy* **2015**, *17*, 3963-3988.
https://doi.org/10.3390/e17063963

**AMA Style**

Bennani Y, Pronzato L, Rendas MJ.
Most Likely Maximum Entropy for Population Analysis with Region-Censored Data. *Entropy*. 2015; 17(6):3963-3988.
https://doi.org/10.3390/e17063963

**Chicago/Turabian Style**

Bennani, Youssef, Luc Pronzato, and Maria João Rendas.
2015. "Most Likely Maximum Entropy for Population Analysis with Region-Censored Data" *Entropy* 17, no. 6: 3963-3988.
https://doi.org/10.3390/e17063963