Model Selection in a Composite Likelihood Framework Based on Density Power Divergence

This paper presents a model selection criterion in a composite likelihood framework based on density power divergence measures and in the composite minimum density power divergence estimators, which depends on an tuning parameter α. After introducing such a criterion, some asymptotic properties are established. We present a simulation study and two numerical examples in order to point out the robustness properties of the introduced model selection criterion.


Introduction
Composite likelihood inference is an important approach to deal with those real situations of large data sets or very complex models, in which classical likelihood methods are computationally difficult, or even, not possible to manage. Composite likelihood methods have been successfully used in many applications concerning, for example, genetics ( [1]), generalized linear mixed models ( [2]), spatial statistics ( [3][4][5]), frailty models ( [6]), multivariate survival analysis ( [7,8]), etc.
Let us introduce the problem, adopting here the notation by [9]. Let { f (·; θ), θ ∈ Θ ⊆ R p , p ≥ 1} be a parametric identifiable family of distributions for an observation y = (y 1 , ..., y m ) T , a realization of a random m-vector Y. In this setting, the composite likelihood function based on K different marginal or conditional distributions has the form CL(θ, y) = K ∏ k=1 f A k (y j , j ∈ A k ; θ) w k and the corresponding composite log-density logCL(θ, y) = K ∑ k=1 w k A k (θ, y), (1) with A k (θ, y) = log f A k (y j , j ∈ A k ; θ), where {A k } K k=1 is a family of sets of indices associated either with marginal or conditional distributions involving some y j , j ∈ {1, ..., m} and w k , k = 1, ..., K are non-negative and known weights. If the weights are all equal, then they can be ignored. In this case, all the statistical procedures give equivalent results. The composite maximum likelihood estimator (CMLE), θ c , is obtained by maximizing, in respect to θ ∈ Θ, the expression (1).

Composite Minimum Density Power Divergence Estimator
Given two probability density functions g and f , associated with two m-dimensional random variables respectively, the DPD ( [25]) measures a statistical distance between g and f by for α > 0, while for α = 0 it is defined by where d KL (g, f ) is the Kullback-Leibler divergence (see, for example, [26]). For α = 1, the expression (2) leads to the L 2 distance L 2 (g, f ) = R m ( f (y) − g(y)) 2 dy. It is also interesting to note that (2) is a special case of the so-called Bregman divergence R m T(g(y)) − T( f (y)) − {g(y) − f (y}T ( f (y)) dy.
If we consider T(l) = 1 α l 1+α in (3), we get d α (g, f ). The parameter α controls the trade-off between robustness and asymptotic efficiency of the parameter estimates which are the minimizers of this family of divergences. For more details about this family of divergence measures we refer to [27].
Let now Y 1 , ..., Y n be independent and identically distributed replications of Y which are characterized by the true but unknown distribution g. Taking into account that the true model g is unknown, suppose that Ξ = { f (·; θ), θ ∈ Θ ⊆ R p , p ≥ 1} is a parametric identifiable family of candidate distributions to describe the observations y 1 , ..., y n . Then, the DPD between the true model g and the composite likelihood function, CL(θ, ·), associated to the parametric model f (·; θ) is defined as for α > 0, while for α = 0 we have d KL (g (·) , CL(θ, ·)), which is defined by In Section 3, we are going to introduce and study the CLDIC criterion based on (4).
be a family of candidate models to govern the observations Y 1 , ..., Y n . We shall assume that the true model is included in {M k } k∈{1,..., } . For a specific k = 1, . . . , , the parametric model M k is described by the composite likelihood function In this setting, it is quite clear that the most suitable candidate model to describe the observations is the model that minimizes the DPD in (4). However, the unknown parameter θ is included in it, so it is not possible to use directly this measure for the choice of the most suitable model. A way to overcome this problem is to plug-in, in (4), the unknown parameter θ by an estimator which is desirable to obey some nice properties, like consistency and asymptotic normality. Based on this point, the CMDPDE, introduced in [12], can be used. This estimator is described in the sequel for the sake of completeness.
If we denote the kernel of (4) as we can write d α (g (·) , CL(θ, ·)) = W α (θ) + 1 α R m g(y) 1+α dy and the term 1 α R m g(y) 1+α dy does not depend on θ and could be ignored in (9). A natural estimator of W α (θ), given in (7), can be obtained by observing that the last integral in (7), can be expressed in the form R m CL(θ, y) α dG(y), for G the distribution function corresponding to g. Hence, if the empirical distribution function of Y 1 , ..., Y n will be exploited, this last integral is approximated by We shall denote the score of the composite likelihood by Let θ 0 be the true value of the parameter θ. In [12], it was shown that the asymptotic distribution of θ α c is given by and Remark 1. For α = 0 we get the CMLE of θ At the same time it is well-known that where G * (θ) denotes the Godambe information matrix defined by G * (θ) = H(θ)J(θ) −1 H(θ), with H(θ) being the sensitivity or Hessian matrix and J(θ) being the variability matrix, defined, respectively, by

A New Model Selection Criterion
In order to describe the CLDIC criterion we consider the model M k given in (6). Following standard methodology (cf. [28], pp. 240), the most suitable candidate model to describe the data Y 1 , ..., Y n is the model that minimizes the expected estimated DPD subject to the assumption that the unknown model g is belonging to Ξ, i.e., the true model is included in {M s } s∈{1,..., } and taking into account that θ α c , defined in (9), is a consistent and asymptotic normally distributed estimator of θ. However, this expected value is still depending on the unknown parameter θ. So, as a criterion, it should be used an asymptotically unbiased estimator of (14), for g ∈ Ξ. The most appropriate model to select is the model which minimizes the expected value This expected value is still depending on the unknown parameter θ. So, an asymptotically unbiased estimator of the above expected value could be the basis of a selection criterion, for g ∈ Ξ. In order to proceed with the derivation of such an asymptotically unbiased estimator of The empirical version of W α (θ), in (7), is W n,α (θ), given in (8) with H α (θ) and J α (θ) given in (11) and (12), respectively.
Based on the above theorem, the proof of which is presented in a full detail in the Appendix A, This ascertainment is the basis and a strong motivation for the next definition which introduces the model selection criterion.
The next remark summarizes the model selection criterion in the case α = 0 and it therefore extends, in a sense, the pioneer and classic AIC.

Remark 2.
For α = 0 we have, d KL (g(·), CL(θ, ·)) = W 0 (θ) + R n g(y) log g(y)dy with W 0 (θ) = − R n log CL(θ, y)g(y)dy. Therefore, the most appropriate model which should be selected, is the model which minimizes the expected value where θ c is the CMLE of θ 0 defined in (9). The expected value (15) is still depending on the unknown parameter θ. A natural estimator of W 0 ( θ c ) can be obtained by replacing the distribution function G, of g, by the empirical distribution function based on Y 1 , . . . , Y n , Based on it, we select the model M * that verifies where J( θ c ) and H( θ c ) are defined in Remark 1. In a manner, quite similar to that of the previous theorem, it This would be the model selection criterion in a composite likelihood framework based on Kullback-Leibler divergence. We can observe that this criterion coincides with the criterion given in [22] as a generalization of the classical criterion of Akaike, which will be referred from now as Composite Akaike Information Criterion (CAIC).

Scenario 1: Two-Component Mixed Model
We are starting with a simulation example, which is motivated and follows ideas from the paper [29] and the Example 4.1 in [20] which will compare the behaviour of the proposed criteria with the CAIC criterion, for α = 0 (see Remark 2).
Consider the random vector Y = (Y 1 , Y 2 , Y 3 , Y 4 ) T from an unknown density g and let now Y 1 , ..., Y n be independent and identically distributed replications of Y which are described by the true but unknown distribution g. Taking into account that the true model g is unknown, suppose that { f (·; θ), θ ∈ Θ ⊆ R p , p ≥ 1} is a parametric identifiable family of candidate distributions to describe the observations y 1 , ..., y n . Let also CL(θ, y) denotes the composite likelihood function associated to the parametric model f (·; θ).
We consider the problem of choosing (on the basis of n independent and identically distributed replications , and a 4-variate t-distribution with ν degrees of freedom, t ν µ t ν , Σ * , with different location parameters , µ t ν 4 ) T and same variance-covariance matrix Σ, and density, and m = 4. Consider the composite likelihood function, , where f N 12 and f N 34 are the densities of the marginals of Y, i.e., bivariate normal distributions with mean vectors (µ N 1 , µ N 2 ) T and (µ N 3 , µ N 4 ) T , respectively, and common variance-covariance matrix In a similar manner consider the composite likelihood , where f t ν 12 and f t ν 34 are the densities of the marginals of Y, i.e., bivariate t-distributions with mean vectors (µ t ν 1 , µ t ν 2 ) T and (µ t ν 3 , µ t ν 4 ) T , respectively, and common variance-covariance matrix Under this formulation, the simulation study follows in the next two scenarios.

Scenario 1a
Following Example 4.1 in [20], the steps of the simulation study are the following: • Generate 1000 samples of size n = 5, 7, 10, 20, 40, 50, 70, 100 from a two component mixture of two 4-variate distributions, namely, a 4-variate normal and a 4-variate t-distribution,  [29], taking into account that Σ should be semi-positive definite, the following condition is imposed: Estimate the common parameter ρ, separately in each model, by using the CMDPDE estimator for different values of the tuning parameter α = 0, 0.3. The composite density which corresponds to the mixture h ω (y) is defined by and it is used to obtain the CMDPDE estimator, ρ, of ρ. • Define the mixture composite likelihood function , the value of the model selection criterion considered in this paper, for the two candidate models, with An explanation of how to obtain this value for the both candidate models is given in Appendix B.

•
Compute the times that the 4-variate normal model was selected.
Results are summarized in Table 1. Extreme values of ω = 0, 1 represent the times that the 4-variate normal model was selected under the 4-variate t-distribution and 4-variate normal distribution, respectively. This means that, for ω = 1, the perfect discrimination will be achieved when 1000 of the 1000 simulated samples are correctly assigned, while for ω = 0, the more near to 0, the better discrimination of the criterion. ω = 0.5 means that each sample was generated both from the normal and t-distribution in the same proportion.   Table 2. In this case, the models under consideration are more similar, so it would be understandable that the CLDIC criterion did not discriminate in such as good way.
with Σ being again a common variance-covariance matrix with unknown parameter ρ of the form Following the same steps that in the first scenario, we generate 1000 samples of the three-component mixture for different sample sizes n = 5, 7, 10, 20, 40, 50, 70, 100 and different values of ω and λ. Then, we consider the problem of choosing among one of the two 4-variate normal distributions and the 4-variate t-distribution through the CLDIC criterion, for different values of the tuning parameter α = 0, 0.3, 0.5, 0.7. See Table 3 for results. Here, the normal models are denoted by N1 and N2, respectively, while the 4-variate t-distribution is denoted by MT. The first three cases evaluate the selected model under these multivariate distributions. In the last two scenarios, a mixed model is considered as the true distribution.

Discussion of Results
In Scenario 1a, two well-differentiated multivariate models are considered. In this case CLDIC criterion works in a very efficient way, with an almost-perfect discrimination for extreme values of ω. The good behaviour is also observed for not so extreme values of ω, such as ω = 0.55 or 0.45. We can not observe a significant difference in the choice of α.
It seems, therefore, that when we have well-discriminated models, CLDIC criterion works very well, independently of the sample size and the tuning parameter α considered. Dealing with closer models leads, as expected, to worst results, overall for α = 0 (CAIC).
Note that the behaviour of Wald-type and Rao tests based on CMDPDEs was studied in [12,13] through extensive simulation studies.

Choice of the Tuning Parameter
In the previous sections, we have seen that CLDIC criterion works generally very well, independently of α, but that some values present a better behaviour, overall when distinguishing similar models. In these situations, it appears that values close to 0.2 or 0.3 work well, while CAIC criterion presents a worse behaviour. A data-driven approach for the choice of the tuning parameter which would be helpful in practice. The approach of [30] was adapted In [13], for the choice of the optimum α in CMDPDEs. This approach consisted on minimizing the estimated mean squared error by means of a pilot estimator, θ P . This approximation is given by where H α (θ) and J α (θ) are given in (11) and (12). The optimum α will be the one that minimizes expression (16). The choice of the pilot estimator is probably one of the major drawbacks of this approach, as it may lead to a choice of α too close to that used for the pilot estimator. A pilot estimator with α ≈ 0.4, was proposed in [13] after some simulations, in concordance with [30], where the initial choice of a pilot is suggested to be a robust one in order to obtain the best results in terms of robustness.

Iris Data
The Iris data (Fisher,[31]) includes 3 categories of 50 sample values each, where each category refers to a type of iris plant: setosa, versicolor and virginica. Each plant is categorized in its class and described by other 4 variables: (1) sepal length, (2) sepal width, (3) petal length and (4) petal width. This is one of the most known data sets for discriminant analysis. [32] proposed the use of a Gaussian finite mixture for modeling Iris data, in which each known class is modeled by a single Gaussian term with the same variance-covariance matrix. The resulting model is as follows We propose a composite likelihood approach to modeling (17) where we suppose independence between the two first and two last variables. This is with y), i = 1, 2, 3 are bivariate normals with variance-covariance matrices We are going to evaluate the behavior of the CLDIC criterion proposed in previous sections. After estimating parameters ρ 12 and ρ 34 in (18), we consider 10 different subsets of the IRIS data: • SE subset: 50 first observations, corresponding to Setosa plants (n = 50). In Table 4, chosen models for each one of the subsets are obtained by the proposed CLDIC criterion. When a "pure" subset is considered, all the tuning parameters lead to optimal decisions, but when a "contaminated" subset is under consideration, only α = 0.2, 0.3 have an optimal response in all the cases.

VE) SE(VI) VE(SE) VE(VI) VI(SE) VI(VE) VI(SE+VE)
We now apply the ad hoc approach presented in Section 5.1 for selecting the tuning parameter α in a composite likelihood framework. Applying this procedure to our data set though a grid search of length 100 and by means of a pilot estimator with α = 0.4 leads to the optimal tuning parameter α = 0.22, what is in concordance with the obtained results (see Table 5). We can see that the use of other pilot estimators would not affect very much to the final decission.

Wine Data
We now work with Wine data ( [33]), which contain a chemical analysis of 178 Italian wines from three different cultivars (Barolo, Grignolino, Barbera) yielded 13 measurements. In order to illustrate our criterion, we will work with only first four explanatory variables: Alcohol, Malic, Ash and Alkalinity. As in the previous section, we adjust a Gaussian mixture model with weights, in this case: 59/178 , 72/178 and 47/178 corresponding to Barolo, Grignolino and Barbera classes, respectively. We now consider these 10 different subsets of the Wine data: We can observe how, for medium values of α, the discrimination is perfect (see Table 6). Applying ad-hoc tuning parameter choice procedure we obtain α opt ≈ 0.51, with a perfect discrimination again (Table 5).

Conclusions and Future Research
In this paper, we have addressed the problem of model selection in the framework of composite likelihood methodology, on the basis of the DPD as a measure of the closeness of the composite density and the true model that drives the data. In this context, an information criterion is introduced and studied which is defined by means of composite minimum distance type estimators of the unknown parameters, well-known for having nice robustness properties. Thanks to a simulation study, we have shown that the proposed here model selection criterion works well in practice and mainly that the use of CMDPDE makes the criterion more robust than the criteria based on the classic CMLE and the Kullback-Leibler divergence, given in [22]. The analysis of two real data examples of the literature illustrate on how the model selection criterion, presented here, can be applied in practical cases. This paper is a part of a series of papers by the authors where composite likelihood ideas and methods are harmonically weaved with divergence theoretic methods in order to develop statistical inference (estimation and testing of hypotheses) and model selection criteria, as well. We envision future work in some directions. The development of change point methodology on the basis of composite density with CMDPDE and divergence measures would be maybe an appealing problem for a future research on the topic. However, all the information theoretic methods developed on the basis of the composite likelihood depend on the choice of the family of sets {A k } K k=1 , appeared in Formula (1). A question is raised at this point: how the information theoretic procedures developed on the basis of the composite likelihood are affected by this family of sets? It is an appealing problem which deserves also investigation in a future work. Acknowledgments: The authors would like to thank the Editor and Reviewers for taking their precious time to make several valuable comments on the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: