Composite Likelihood Methods Based on Minimum Density Power Divergence Estimator

In this paper, a robust version of the Wald test statistic for composite likelihood is considered by using the composite minimum density power divergence estimator instead of the composite maximum likelihood estimator. This new family of test statistics will be called Wald-type test statistics. The problem of testing a simple and a composite null hypothesis is considered, and the robustness is studied on the basis of a simulation study. The composite minimum density power divergence estimator is also introduced, and its asymptotic properties are studied.


Introduction
It is well known that the likelihood function is one of the most important tools in classical inference, and the resultant estimator, the maximum likelihood estimator (MLE), has nice efficiency properties, although it has not so good robustness properties.
Tests based on MLE (likelihood ratio test, Wald test, Rao's test, etc.) have, usually, good efficiency properties, but in the presence of outliers, the behavior is not so good. To solve these situations, many robust estimators have been introduced in the statistical literature, some of them based on distance measures or divergence measures. In particular, density power divergence measures introduced in [1] have given good robust estimators: minimum density power divergences estimators (MDPDE) and, based on them, some robust test statistics have been considered for testing simple and composite null hypotheses. Some of these tests are based on divergence measures (see [2,3]), and some others are used to extend the classical Wald test; see [4][5][6] and the references therein.
The classical likelihood function requires exact specification of the probability density function, but in most applications, the true distribution is unknown. In some cases, where the data distribution is available in an analytic form, the likelihood function is still mathematically intractable due to the complexity of the probability density function. There are many alternatives to the classical likelihood function; in this paper, we focus on the composite likelihood. Composite likelihood is an inference function derived by multiplying a collection of component likelihoods; the particular collection used is a conditional determined by the context. Therefore, the composite likelihood reduces the computational complexity so that it is possible to deal with large datasets and very complex models even when the use of standard likelihood methods is not feasible. Asymptotic normality of the composite maximum likelihood estimator (CMLE) still holds with the Godambe information matrix to replace the expected information in the expression of the asymptotic variance-covariance matrix. This allows the construction of composite likelihood ratio test statistics, Wald-type test statistics, as well ∏ k=1 f A k (y j , j ∈ A k ; θ) w k and the corresponding composite log-density has the form: c (θ,y ) = K ∑ k=1 w k A k (θ,y),

with:
A k (θ,y) = log f A k (y j , j ∈ A k ; θ), where {A k } K k=1 is a family of random variables associated either with marginal or conditional distributions involving some y j and j ∈ {1, ..., m} and w k , k = 1, ..., K are non-negative and known weights. If the weights are all equal, then they can be ignored. In this case, all the statistical procedures produce equivalent results.
Let y 1 , ..., y n also be independent and identically distributed replications of y. We denote by: c (θ,y 1 , ..., y n ) = n ∑ i=1 c (θ,y i ) the composite log-likelihood function for the whole sample. In complete accordance with the classical MLE, the CMLE, θ c , is defined by: It can also be obtained by solving the equations.
u(θ,y 1 , ...,y n ) = 0 p , where: u(θ,y 1 , ...,y n ) = ∂c (θ,y 1 , ...,y n ) ∂θ We are going to see how it is possible to get the CMLE, θ c , on the basis of the Kullback-Leibler divergence measure. We shall denote by g (y) the density generating the data with the respective distribution function denoted by G. The Kullback-Leibler divergence between the density function g (y) and the composite density function CL(θ,y) is given by: The term: can be removed because it does not depend on θ; hence, we can define the following estimator of θ, based on the Kullback-Leibler divergence: or equivalently: If we replace in (3) the distribution function G by the empirical distribution function G n , we have: and this expression is equivalent to Expression (1). Therefore, the estimator θ KL coincides with the CMLE. Based on the previous idea, we are going to introduce, in a natural way, the composite minimum density power divergence estimator (CMDPDE). The CMLE, θ c , obeys asymptotic normality (see [9]) and in particular: where G * (θ) denotes the Godambe information matrix, defined by: with H(θ) being the sensitivity or Hessian matrix and J(θ) being the variability matrix, defined, respectively, by: where the superscript T denotes the transpose of a vector or a matrix.
The matrix J(θ) is nonnegative definite by definition. In the following, we shall assume that the matrix H(θ) is of full rank. Since the component score functions can be correlated, we have H(θ) = J(θ). If c (θ,y) is a true log-likelihood function, then H(θ) = J(θ) = I F (θ), I F (θ) being the Fisher information matrix of the model. Using multivariate version of the Cauchy-Schwarz inequality, we have that the matrix G * (θ) − I F (θ) is non-negative definite, i.e., the full likelihood function is more efficient than any other composite likelihood function (cf. [10], Lemma 4A).
We are now going to proceed to the definition of the CMDPDE, which is based on the density power divergence measure, defined as follows. For two densities p and q associated with two m-dimensional random variables, respectively, the density power divergence (DPD) between p and q was defined in [1] by: for β > 0, while for β = 0, it is defined by: For β = 1, Expression (4) reduces to the L 2 distance: It is also interesting to note that (4) is a special case of the so-called Bregman divergence [T(p(y)) − T(q(y)) − {p(y) − q(y}T (q(y))] dy. If we consider T(l) = l 1+β , we get β times d β (p, q). The parameter β controls the trade-off between robustness and asymptotic efficiency of the parameter estimates (see the Simulation Section), which are the minimizers of this family of divergences. For more details about this family of divergence measures, we refer to [11].
The term: does not depend on θ, and consequently, the minimization of (5) with respect to θ is equivalent to minimizing: Now, we replace the distribution function G by the empirical distribution function G n , and we get: As a consequence, for a fixed value of β, the CMDPDE of θ can be obtained by minimizing the expression given in (6); or equivalently, by maximizing the expression: Under the differentiability of the model, the maximization of the function in Equation (7) leads to an estimating system of equations of the form: The system of Equations (8) can be written as: and the CMDPDE θ β c of θ is obtained by the solution of (9). For β = 0 in (9), we have: u(θ,y)CL(θ,y)dy.

Asymptotic Distribution of the Composite Minimum Density Power Divergence Estimator
Equation (9) can be written as follows: Therefore, the CMDPDE, θ β c , is an M-estimator. In this case, it is well known (cf. [12]) that the asymptotic distribution of θ β c is given by: being: ∂θ T and: We are going to establish the expressions of H β (θ) and J β (θ). In relation to H β (θ), we have: CL(θ,y) β CL(θ,y)u(θ,y) T u(θ,y)dy and: In relation to J β (θ), we have, Then, Based on the previous results, we have the following theorem.

Remark 1.
If we apply the previous theorem for β = 0, then we get the CMLE, and the asymptotic variance covariance matrix coincides with the Godambe information matrix because: for β = 0.

Wald-Type Tests Statistics Based on the Composite Minimum Power Divergence Estimator
Wald-type test statistics based on MDPDE have been considered with excellent results in relation to the robustness in different statistical problems; see for instance [4][5][6].
Motivated by those works, we focus in this section on the definition and the study of Wald-type test statistics, which are defined by means of CMDPDE estimators instead of MDPDE estimators. In this context, if we are interested in testing: we can consider the family of Wald-type test statistics: For β = 0, we get the classical Wald-type test statistic considered in the composite likelihood methods (see for instance [7]).
In the following theorem, we present the asymptotic null distribution of the family of the Wald-type test statistics W 0 n,β .

Theorem 2.
The asymptotic distribution of the Wald-type test statistics given in (14) is a chi-square distribution with p degrees of freedom.
The proof of this Theorem 2 is given in Appendix A.1.

Theorem 3.
Let θ * be the true value of the parameter θ, with θ * = θ 0 . Then, it holds: being: and: The proof of the Theorem is outlined in Appendix A.2.

Remark 2.
Based on the previous result, we can approximate the power, β W 0 n , of the Wald-type test statistics in θ * by: where Φ n is a sequence of distribution functions tending uniformly to the standard normal distribution function Φ(x). It is clear that: lim for all α ∈ (0, 1) . Therefore, the Wald-type test statistics are consistent in the sense of Fraser.
In many practical hypothesis testing problems, the restricted parameter space Θ 0 ⊂ Θ is defined by a set of r restrictions of the form: on Θ, where g : R p → R r is a vector-valued function such that the p × r matrix: exists and is continuous in θ and rank(G (θ)) = r; where 0 r denotes the null vector of dimension r. Now, we are going to consider composite null hypotheses, Θ 0 ⊂ Θ, in the way considered in (16), and our interest is in testing: on the basis of a random simple of size n, X 1 , ..., X n .
with the inverse of the Fisher information matrix, and then, we get the classical Wald test statistic considered in the composite likelihood methods.
In the next theorem, we present the asymptotic distribution of W n,β .

Theorem 4.
The asymptotic distribution of the Wald-type test statistics, given in (19), is a chi-square distribution with r degrees of freedom.
The proof of this Theorem is presented in Appendix A.3. Consider the null hypothesis H 0 : θ ∈ Θ 0 ⊂ Θ. By Theorem 4, the null hypothesis should be rejected if W n,β ≥ χ 2 r,α . The following theorem can be used to approximate the power function. Assume that θ * / ∈ Θ 0 is the true value of the parameter, so that θ β c a.s.
Theorem 5. Let θ * be the true value of the parameter, with θ * = θ 0 . Then, it holds: being: and:

Numerical Example
In this section, we shall consider an example, studied previously by [8], in order to study the robustness of CMLE. The aim of this section is to clarify the different issues that were discussed in the previous sections.

Consider the random vector
which follows a four-dimensional normal distribution with mean vector µ = (µ 1 , µ 2 , µ 3 , µ 4 ) T and variance-covariance matrix: i.e., we suppose that the correlation between Y 1 and Y 2 is the same as the correlation between Y 3 and Y 4 . Taking into account that Σ should be semi-positive definite, the following condition is imposed: In order to avoid several problems regarding the consistency of the CMLE of the parameter ρ (cf. [8]), we shall consider the composite likelihood function: where: where f 12 and f 34 are the densities of the marginals of Y, i.e., bivariate normal distributions with mean vectors (µ 1 , µ 2 ) T and (µ 3 , µ 4 ) T , respectively, and common variance-covariance matrix: with densities given by: being: By θ, we are denoting the parameter vector of our model, i.e, θ = (µ 1 , µ 2 , µ 3 , µ 4 , ρ) T . The system of equations that it is necessary to solve in order to obtain the CMDPDE: and: 1 nβ being: After some heavy algebraic manipulations specified in Appendix A.5, the sensitivity and variability matrices are given by: and: where

Simulation Study
A simulation study, developed by using the R statistical programming environment, is presented in order to study the behavior of the CMDPDE, as well as the behavior of the Wald-type test statistics based on them. The theoretical model studied in the previous example is considered. The parameters in the model are: Through R = 10,000 replications of the simulation experiment, we compare, for different values of β, the corresponding CMDPDE through the root of the mean square errors (RMSE), when the true value of the parameters is θ = (0, 0, 0, 0, ρ) T and ρ ∈ {−0.1, 0, 0.15}. We pay special attention to the problem of the existence of some outliers in the sample, generating 5% of the samples with θ = (1, 3, −2, −1,ρ) T andρ ∈ {−0.15, 0.1, 0.2}, respectively. Notice that, although the case ρ = 0 has been considered; this case is less important taking into account the method of the theoretical model under consideration, and having the case of independent observations, the composite likelihood theory is useless. Results are presented in Tables 1 and 2. Two points deserve our attention. The first one is that, as expected, RMSEs for contaminated data are always greater than RMSEs for pure data and that the RMSEs decrease when the sample size n increases. The second is that, while in pure data, RMSEs are greater for big values of β, when working with contaminated data, the CMDPDE with medium-low values of β (β ∈ {0.1, 0.2, 0.3}) present the best behavior in terms of efficiency. These statements are also true for larger levels of contamination, noting that, when larger percentages are considered, larger values of β are also considerable in terms of efficiency (see Tables 3-5 for contamination equal to 10%, 15% and 20%, respectively). Considering the mean absolute error (MAE) for the evaluation of the accuracy, we obtain similar results (Table 6).   Table 6. MAEs for pure and contaminated data (5%, 10%, 15% and 20%), n = 100.
with I(S) being the indicator function (with a value of one if S is true and zero otherwise). Empirical levels with the same previous parameter values are presented in Table 7 (pure data) and Table 8 (5% of outliers). While medium-high values of β are not recommended at all, CMLE is generally the best choice when working with pure data. However, the lack of robustness of the CMLE test is impressive, as can be seen in Table 8. The effect of contamination in medium-low values of β is much lighter, while for medium-high values of β, it can return to being deceptively beneficial.    (Tables 9 and 10). The (simulated) power for different composite Wald-type test statistics is obtained by: As expected, when we get closer to the null hypothesis and when decreasing the sample sizes, the power decreases. With pure data, the best behavior is obtained with low values of β, and with this level of contamination (5%), the best results are obtained for medium values of β.

Conclusions
The likelihood function is the basis of the maximum likelihood method in estimation theory, and it also plays a key role in the development of log-likelihood ratio tests. However, it is not so tractable in many cases, in practice. Maximum likelihood estimators are based on the likelihood function, and they can be easily obtained; however, there are cases where they do not exist or they cannot be obtained. In such a case, composite likelihood methods constitute an appealing methodology in the area of estimation and testing of hypotheses. On the other hand, the distance or divergence based on methods of estimation and testing have increasingly become fundamental tools in the field of mathematical statistics. The work in [13] is the first, to the best of our knowledge, to link the notion of composite likelihood with divergence based on methods for testing statistical hypotheses.
In this paper, MDPDE are introduced, and they are exploited to develop Wald-type test statistics for testing simple or composite null hypotheses, in a composite likelihood framework. The validity of the proposed procedures is investigated by means of simulations. The simulation results point out the robustness of the proposed information theoretic procedures in estimation and testing, in the composite likelihood context. There are several areas where the notions of divergence and composite likelihood are crucial, including spatial statistics and time series analysis. These are areas of interest, and they will be explored elsewhere.

Appendix A.1. Proof of Theorem 2
The result follows in a straightforward manner because of the asymptotic normality of θ Appendix A.2. Proof of Theorem 3 A first order Taylor expansion of l (θ) at θ β c around θ * gives: Now, the result follows because the asymptotic distribution of l θ β c − l (θ * ) coincides with the asymptotic distribution of √ n ∂l(θ) ∂θ Appendix A.3. Proof of Theorem 4 We have: