A Maximum Entropy Procedure to Solve Likelihood Equations

In this article, we provide initial findings regarding the problem of solving likelihood equations by means of a maximum entropy (ME) approach. Unlike standard procedures that require equating the score function of the maximum likelihood problem at zero, we propose an alternative strategy where the score is instead used as an external informative constraint to the maximization of the convex Shannon’s entropy function. The problem involves the reparameterization of the score parameters as expected values of discrete probability distributions where probabilities need to be estimated. This leads to a simpler situation where parameters are searched in smaller (hyper) simplex space. We assessed our proposal by means of empirical case studies and a simulation study, the latter involving the most critical case of logistic regression under data separation. The results suggested that the maximum entropy reformulation of the score problem solves the likelihood equation problem. Similarly, when maximum likelihood estimation is difficult, as is the case of logistic regression under separation, the maximum entropy proposal achieved results (numerically) comparable to those obtained by the Firth’s bias-corrected approach. Overall, these first findings reveal that a maximum entropy solution can be considered as an alternative technique to solve the likelihood equation.


Introduction
Maximum likelihood is one of the most used tools of modern statistics. As a result of its attractive properties, it is useful and suited for a wide class of statistical problems, including modeling, testing, and parameters estimation [1,2]. In the case of regular and correctly-specified models, maximum likelihood provides a simple and elegant means of choosing the best asymptotically normal estimators. Generally, the maximum likelihood workflow proceeds by first defining the statistical model which is thought to generate the sample data and the associated likelihood function. Then, the likelihood is differentiated around the parameters of interest by getting the likelihood equations (score), which are solved at zero to find the final estimates. In most simple cases, the maximum likelihood solutions are expressed in closed-form. However, analytic expressions are not always available for most complex problems and researchers need to solve likelihood equations numerically. A broad class of these procedures include Newton-like algorithms, such as the Newton-Raphson, Fisher-scoring, and quasi Newton-Raphson algorithms [3]. However, when the sample size is small, or when the optimization is no longer convex as in the case of more sophisticated statistical models, the standard version of Newton-Raphson may not be optimal. In this case, robust versions should instead be used [4]. A typical example of such a situation is the logistic regression for binary data, where maximum density function l(θ) likelihood function U (θ), U (θ) score function z K × 1 vector of finite elements forθ p K × 1 vector of unknown probabilities forθ p vector of estimated probabilities forθ

A Maximum Entropy Solution to Score Equations
Let y = {y 1 , . . . , y n } be a random sample of independent observations from the parametric model M = { f (y; θ) : θ ∈ Θ, y ∈ Y }, with f (y; θ) being a density function parameterized over θ, Θ ⊆ R J the parameter space with J being the number of parameters, and Y the sample space. Let be the log-likelihood of the model and U (θ) = ∇ θ l(θ) = (∂l/∂θ 1 , . . . , ∂l/∂θ j , . . . , ∂l/∂θ J ) the score equation. In the regular case, the maximum likelihood estimate (MLE)θ of the unknown vector of parameters θ is the solution of the score U (θ) = 0 J . In simple cases,θ has closed-form expression but, more often, a numerical solution is required forθ, for instance by using iterative algorithms like Newton-Raphson and expectation-maximization.
In the maximum likelihood setting, our proposal is instead to solve U (θ) = 0 J by means of a maximum entropy approach (for a brief introduction, see [28]). This involves a two step formulation of the problem, where θ is first reparameterized as a convex combination of a numerical support with some predefined points and probabilities. Next, a non-linear programming (NLP) problem is set with the objective of maximizing the entropy of the unknown probabilities subject to some feasible constraints. More formally, letθ be the reparameterized J × 1 vector of parameters of the model M, where z j is a user-defined vector of K × 1 (finite) points, whereas p j is a K × 1 vector unknown probabilities obeying to p T j 1 K = 1. Note that the arrays z 1 , . . . , z J must be chosen to cover the natural range of the model parameters. Thus, for instance, in the case of estimating the population mean µ ∈ R for a normal model N(µ, σ 2 ) with σ 2 known, z µ = (−d, . . . , 0, . . . , d) T with d as large as possible. In practice, as observations y are available, the support vector can be defined using sample information, i.e., z µ = (min(y), . . . , max(y)) T . Similarly, in the case of estimating the parameter π ∈ [0, 1] of the Binomial model Bin(π, n), the support vector is z π = (0, . . . , 1) T . The choice of the number of points K of z can be made via sensitivity analysis although it has been shown that K ∈ {5, 7, 11} is usually enough for many regular problems (e.g., see [27,29]). Readers may refer to [27,30] for further details. Under the reparameterization in Equation (1), U (θ) = 0 J is solved via the following NLP problem: where H(p) = − ∑ J j=1 p T j log p j is the Shannon's entropy function, whereas the score equation U (θ) has been rewritten using the reparameterized parametersθ. The problem needs to recover K × J quantities which are defined in a (convex) hyper-simplex region with J (non-) linear equality constraints U (θ 1 ), . . . , U (θ J ) (consistency constraints) and linear equality constraints p T 1 1 K , . . . , p T J 1 K (normalization constraints). The latter ensure that the recovered quantitiesp 1 , . . . ,p J are still probabilities. Note that closed-form solutions for the ME-score problem do not exist and solutions need to be attained numerically.
In the following examples, we will show how the ME-score problem can be formulated in the most simple cases of estimating a mean from normal, Poisson, and gamma models (Examples 1-3) as well as in more complex cases of estimating parameters for logistic regression (Example 4).

Example 1: The Normal Case
Consider the case of estimating the location parameter µ ∈ R of a Normal density function with be the corresponding score w.r.t. µ. To define the associated ME-score problem to solve U (µ) = 0, first let µ ME = z T p with z and p being K × 1 vector of supports and unknown probabilities. In this example, with K = 7, z 1 = min(y), and z K = max(y). Given the optimization problem in (2), in this case p can be recovered via the Lagrangean method, as follows. Let Entropy 2019, 21, 596 5 of 15 be the Lagrangean function, with λ 0 and λ 1 being the usual Lagrangean multipliers. The Lagrangean system of the problem is Solving p in Equation (4), by using Equation (6), we get the general solutions for the ME-score problem:p where the quantity in the denominator is the normalization constant. Note that solutions in Equation (7) depend on the Lagrangean multiplierλ 1 , which needs to be determined numerically [31]. In this particular example, we estimate the unknown Lagrangean multiplier using a grid-search approach, yielding toλ 1 = −0.024. withμ ME = z Tp = 3.432, which corresponds to the maximum likelihood estimate of µ ML = 1 n y T 1 n = 3.432, as expected.

Example 2: The Poisson Case
Consider the simple case of estimating λ ∈ R + of a Poisson density function. Let y = (5,7,7,4,4,8,15,7,7,4,7,3,8,5,4,7) T be a sample of n = 16 drawn from a Poisson density Pois(λ) and U (λ) = −n + (y T 1 n )/λ be the score of the model. The reparameterized Poisson parameter is λ ME = z T p, with support being defined as follows: z = (0.00, 3.75, 7.50, 11.25, 15.00) T , where K = 5 and z K = max(y). Note that, since the Poisson parameter λ is bounded below by zero, we can set z 1 = 0. Unlike the previous case, we cannot determinep analytically. For this reason, we need to solve the ME-score problem: subject to: withλ ME = 6.375, which is equal to the maximum likelihood solutionλ ML = 1 n y T 1 n = 6.375, as expected. drawn from a Gamma density Ga(α, ρ) with α ∈ R + being the scale parameter and ρ ∈ R + the rate parameter. The log-likelihood of the model is as follows: where Γ(.) is the well-known gamma function. The corresponding score function equals to with ψ(α) = ∂ ∂α log(Γ(α)) being the digamma function, i.e., the derivative of the logarithm of the gamma function evaluated in α. The re-parameterized gamma parameters are defined as usual α ME = z T α p α andρ ME = z T ρ p ρ whereas the supports can be determined as z α = (0, . . . , α + δ) and z ρ = (0, . . . , ρ + δ), with δ being a positive constant. Note that the upper limits of the support can be chosen according to the following approximations: α = 1 2M and ρ = α y, with M = log(y) − ∑ i log(y i ) n and y = ∑ i y i n [33]. In the current example, the supports for the parameters are: z α = (0.00, 1.12, 2.24, 3.35, 4.47) T and z ρ = (0.00, 1.91, 3.82, 5.73, 7.64) T , where K = 5, α = 1.47, ρ = 4.64, and δ = 3. The ME-score problem for the gamma case is which is solved via an augmented Lagrangean adaptive barrier algorithm. The algorithm required few iterations to converge and the recovered probabilities are as follows: The estimated parameters under the ME-score formulation areα ME = 1.621 andρ ME = 5.103 which equal to the maximum likelihood solutionsα ML = 1.621 andρ ML = 5.103.

Example 4: Logistic Regression
In what follows, we show the ME-score formulation for logistic regression. We will consider both the cases of simple situations involving no separation-where maximum likelihood estimates can be easily computed-and those unfortunate situations in which separation occur. Note that in the latter case, maximum likelihood estimates are no longer available without resorting to the use of a bias Entropy 2019, 21, 596 7 of 15 reduction iterative procedure [7]. Formally, the logistic regression model with p continuous predictors is as follows: where X is an n × p matrix containing predictors, β is a p × 1 vector of model parameters, and y is an n × 1 vector of observed responses. Here, the standard maximum likelihood solutionsβ are usually attained numerically, e.g., using Newton-Raphson like algorithms [5]. No separation case. As an illustration of the ME-score problem in the optimal situation where no separation occurs, we consider the traditional Finney's data on vasoconstriction in the skin of the digits (see Table 2) [34]. In the Finney's case, the goal is to predict the vasoconstriction responses as a function of volume and rate, according to the following linear term [34]: with logit : [0, 1] → R being the inverse of the logistic function. In the maximum entropy framework, the model parameters can be reformulated as follows: where z is a K × 1 vector of support points, I p+1 is an identity matrix of order p + 1 (including the intercept term), P is a (p + 1) × K matrix of probabilities associated to the p parameters plus the intercept, ⊗ is the Kronecker product, whereas vec( ) is a linear operator that transforms a matrix into a column vector. Note that in this example p = 2 and K = 7, whereas the support z = (−10, . . . , 0, . . . , 10) T is defined to be the same for both predictors and the intercept (the bounds of the support have been chosen to reflect the maximal variation allowed by the logistic function). Finally, the ME-score problem for the Finney's logistic regression is: maximize vec(P) −vec(P) T log(vec(P)) subject to: vec(P) T 1 p(K+1) where X is the n × (p + 1) matrix containing the variables rate, volume, and a column of all ones for the intercept term, and π = (1 + exp(−Xβ ME )) −1 , with β ME being defined as in Equation (12). Solutions forP were obtained via the augmented Lagrangean adaptive barrier algorithm, which yielded the following estimates: which are the same as those obtained in the original paper of Pregibon et al. [34].
Separation case. As a typical example of data under separation, we consider the classical Fisher iris dataset [35]. As generally known, the dataset contains fifty measurements of length and width (in centimeters) of sepal and petal variables for three species of iris, namely setosa, versicolor, and virginica [36]. For the sake of simplicity, we keep a subset of the whole dataset containing two species of iris (i.e., setosa and virginica) with sepal length and width variables only. Inspired by the work of Lesaffre and Albert [35], we study a model where the response variable is a binary classification of iris, with Y = 0 indicating the class virginica and Y = 1 the class setosa, whereas petal length and width are predictors of Y. The logistic regression for the iris data assumes the following linear term: where model parameters can be reformulated as in Equation (12), with K = 7, p = 2, and z being centered around zero with bounds z 1 = −25 and z K = 25. The ME-score problem for the iris dataset is the same as in (13) Table 3 (ME, first column). For the sake of comparison, Table 3 also reports the estimates obtained by solving the score of the model via bias-corrected Newton-Raphson (NRF, second column) and Newton-Raphson (NR, third column). The NRF algorithm uses the Firth's correction for the score function [7] as implemented in the R package logistf [37]. As expected, the NR algorithm fails to converge reporting divergent estimates. By contrast, the NRF procedure converges to non-divergent solutions. Interestingly, the maximum entropy solutions are more close to NRF estimates although they differ in magnitude.

Simulation Study
Having examined the ME-score problem with numerical examples for both simple and more complex cases, in this section, we will numerically investigate the behavior of the maximum entropy solutions for the most critical case of logistic regression under separation.
Design. Two factors were systematically varied in a complete two-factorial design: (i) the sample size n at three levels: 15, 20, 200; (ii) the number of predictors p (excluding the intercept) at three levels: 1, 5, 10.
The levels of n and p were chosen to represent the most common cases of simple, medium, and complex models, as those usually encountered in many social research studies. (10) and let n k and p k be distinct elements of sets n and p. The following procedure was repeated Q = 10,000 times for each of the n × p = 9 combinations of the simulation design:

1.
Generate the matrix of predictors X n k ×(1+p k ) = 1 n k |X n k ×p k , whereX n k ×p k is drawn from the multivariate standard normal distribution N(0 p k , I p k ), whereas the column vector of all ones 1 stands for the intercept term; 2.
Generate the vector of predictors β 1+p k from the multivariate centered normal distribution N(0 1+p k , σI 1+p k ), where σ = 2.5 was chosen to cover the natural range of variability allowed by the logistic equation; 3.
Compute the vector π n k via Equation (10) using X n k ×(1+p k ) and β p k ;

4.
For q = 1, . . . , Q, generate the vectors of response variables y (q) n k from the binomial distribution Bin(π n k ), with π n k being fixed;
The entire procedure involves a total of 10,000 × 3 × 3 = 90,000 new datasets as well as an equivalent number of model parameters. For the NR and NRF algorithms, we used the glm and logistf routines of the R packages stats [38] and logistf [37]. By contrast, the ME-score problem was solved via the augmented Lagrangean adaptive barrier algorithm implemented in constrOptim.nl routine of the R package alabama [32]. Convergences of the algorithms were checked using the built-in criteria of glm, logistf, and constrOptim.nl. For each of the generated data {y, X} q=1,...,Q , the occurrence of separation was checked using a linear programming-based routine to find infinite estimates in the maximum likelihood solution [39,40]. The whole simulation procedure was performed on a (remote) HPC machine based on 16 cpu Intel Xeon CPU E5-2630L v3 1.80 GHz, 16 × 4 GB Ram. Measures. The simulation results were evaluated considering the averaged bias of the parameterŝ B = 1 Q (β (k) −β (k) ) T 1, its squared versionB 2 (the square here is element-wise), and the averaged variance of the estimatesV = 1 Q Var(β (k) ). They were then combined together to form the mean square error (MSE) of the estimates MSE =V +B 2 . The relative bias RB = (β (k) j − β 0 j ) |β 0 j | was also computed for each predictor j = 1, . . . , J, (β 0 indicates the population parameter). The measures were computed for each of the three algorithms and for all the combinations of the simulation design.
Results. Table 4 reports the proportions of separation present in the data for each level of the simulation design along with the proportions of non-convergence for the three algorithms. As expected, NR failed to converge when severe separation occurred, for instance, in the case of small samples and large number of predictors. By contrast, for NRF and ME algorithms, the convergence criteria were always met. The results of the simulation study with regards to bias, variance, and mean square error (MSE) are reported in Table 5 and Figure 1. In general, MSE for the three algorithms decreased almost linearly with increasing sample sizes and number of predictors. As expected, the NR algorithm showed higher MSE than NRF and ME, except in the simplest case of n = 200 and p = 1. Unlike for the NR algorithm, with increasing model complexity (p > 1), ME showed a similar performances of NRF both for medium (n = 50) and large (n = 200) sample sizes. Interestingly, for the most complex scenario, involving a large sample (n = 200) and higher model complexity (p = 10), the ME algorithm outperformed NRF in terms of MSE. To further investigate the relationship between NRF and ME, we focused on the latter conditions and analyzed the behavior of ME and NRF in terms of relative bias (RB, see Figure 2). Both the ME and NRF algorithms showed RB distributions centered about 0. Except for the condition N = 200 ∧ P = 10, where ME showed smaller variance than NRF, both the algorithms showed similar variance in the estimates of the parameters. Finally, we also computed the ratio of over-and under-estimation r as the ratio between the number of positive RB and negative RB, getting the following results: r ME = 1.18 (over-estimation: 54%), r NRF = 0.96 (over-estimation: 49%) for the case N = 200 ∧ P = 5 and r ME = 1.12 (over-estimation: 53%), r NRF = 0.91 (over-estimation: 47%) for the case N = 200 ∧ P = 10.
Overall, the results suggest the following points: • In the simplest cases with no separation (i.e., N = 50 ∧ P = 1, N = 200 ∧ P = 1, N = 200 ∧ P = 5), the ME solutions to the maximum likelihood equations were the same as those provided by standard Newton-Raphson (NR) and the bias-corrected version (NRF). In all these cases, the bias of the estimates approximated zero (see Table 5); • In the cases of separation, ME showed comparable performances to NRF, which is known to provide the most efficient estimates in the case of logistic model under separation: Bias and MSE decreased as a function of sample size and predictors, with MSE being lower for ME than NRF in the case of N = 200 ∧ P = 5 and N = 200 ∧ P = 10; • In the most complex scenario with a large sample and higher model complexity (N = 200 ∧ P = 5, N = 200 ∧ P = 10), ME and NRF algorithms showed similar relative bias, with ME estimates being less variable than NRF in N = 200 ∧ P = 10 condition. The ME algorithm tended to over-estimate the population parameters, by contrast NRF tended to under-estimate the true model parameters.  Simulation study: averaged bias, squared averaged bias, and MSE for NR, NRF, ME algorithms. Note that the number of predictors p is represented column-wise (outside) whereas the sample sizes n is reported in the x-axis (inside). The measures are plotted on logarithmic scale.

Discussion and Conclusion
We have described a new approach to solve the problem U (θ) = 0 in order to getθ in the context of maximum likelihood theory. Our proposal took the advantages of using the maximum entropy principle to set a non-linear programming problem where U (θ) was not solved directly, but it was used as informative constraint to maximize the Shannon's entropy. Thus, the parameter θ was not searched over the parameter space Θ ⊂ R J , rather it was reparameterized as a convex combination of a known vector z, which indicated the finite set of possible values for θ, and a vector of unknown probabilities p, which instead needed to be estimated. In so doing, we converted the problem U (θ) = 0 from one of numerical mathematics to one of inference, where U (θ) was treated as one of the many pieces of (external) information we may have had. As a result, the maximum entropy solution did not require either the computation of the Hessian of second-order derivatives of l(θ) (or the expectation of the Fisher information matrix) or the definition of initial values, as is required by Newton-like algorithms θ 0 . In contrast, the maximum entropy solution revolved around the reduction of the initial uncertainty: as one adds pieces of external information (constraints), a departure from the initial uniform distribution p results, implying a reduction of the uncertainty about θ; a solution is found when no further reduction can be enforced given the set of constraints. We used a set of empirical cases and a simulation study to assess the maximum entropy solution to the score problem. In cases where the Newton-Raphson is no longer correct for θ (e.g., logistic regression under separation), the ME-score formulation showed results (numerically) comparable with those obtained using the Bias-corrected Newton-Raphson, in the sense of having the same or even smaller mean square errors (MSE). Broadly speaking, these first findings suggest that the ME-score formulation can be considered as a valid alternative to solve U (θ) = 0, although further in-depth investigations need to be conducted to formally evaluate the statistical properties of the ME-score solution.
Nevertheless, we would like to say that the maximum entropy approach has been used to build a solver for maximum likelihood equations [22,23,26]. In this sense, standard errors, confidence levels, and other likelihood based quantities can be computed using the usual asymptotic properties of maximum likelihood theory. However, attention should be directed at the definition of the support points z since they need to be sufficiently large to include the true (hypothesized) parameters we are looking for. Relatedly, our proposal differs from other methods, such as generalized maximum entropy (GME) or generalized cross entropy (GCE) [20,27], in two important respects. First, the ME-score formulation does not provide a class of estimators for the parameters of statistical models. By contrast, GME and GCE are estimators belonging to the exponential family, which can be used in many cases as alternatives to maximum likelihood estimators [28]. Secondly, the ME-score formulation does not provide an inferential framework for θ. While GME and GCE use information theory to provide the basis for inference and model evaluation (e.g., using Lagrangean multipliers and normalized entropy indices), the ME-score formulation focuses on the problem of finding roots for U (θ) = 0. Finally, an open issue which deserves greater consideration in future investigations is the examination of how the ME-score solution can be considered in light of the well-known maximum entropy likelihood duality [41].
Some advantages of the ME-score solution over Newton-like algorithms may include the following: (i) model parameters are searched in a smaller and simpler space because of the convex reparameterization required for θ; (ii) the function to be maximized does not require either the computation of second-order derivatives of l(θ), or searching for good initial values θ 0 ; (iii) additional information on the parameters, such as dominance relations among the parameters, can be added to the ME-score formulation in terms of inequality constraints (e.g., θ j < θ t , j = t). Furthermore, the ME-score formulation may be extended to include a priori probability distributions on θ. While in the current proposal, the elements of z j have the same probability to occur, the Kullback-Leibler entropy might be used to form a Kullback-Leibler-score problem, where z = (z 1 , . . . , z J ) T are adequately weighted by known vectors of probability w = (w 1 , . . . , w J ) T . This would offer, for instance, another opportunity to deal with cases involving penalized likelihood estimations.
In conclusion, we think that this work yielded initial findings in the solution of likelihood equations from a maximum entropy perspective. To our knowledge, this is the first time that maximum entropy is used to define a solver to the score function. We believe this contribution will be of interest to all researchers working at the intersection of information theory, data mining, and applied statistics.