Next Article in Journal
Synchronization of a Non-Equilibrium Four-Dimensional Chaotic System Using a Disturbance-Observer-Based Adaptive Terminal Sliding Mode Control Method
Next Article in Special Issue
Robust Model Selection Criteria Based on Pseudodistances
Previous Article in Journal
Maxwell’s Demon in Quantum Mechanics
Previous Article in Special Issue
Convergence Rates for Empirical Estimation of Binary Classification Bounds
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Model Selection in a Composite Likelihood Framework Based on Density Power Divergence

1
Interdisciplinary Mathematics Institute and Department of Statistics and O.R. I, Complutense University of Madrid, 28040 Madrid, Spain
2
Interdisciplinary Mathematics Institute and Department of Financial and Actuarial Economics & Statistics, Complutense University of Madrid, 28003 Madrid, Spain
3
Department of Mathematics, University of Ioannina, 45110 Ioannina, Greece
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(3), 270; https://doi.org/10.3390/e22030270
Submission received: 22 January 2020 / Revised: 17 February 2020 / Accepted: 25 February 2020 / Published: 27 February 2020

Abstract

:
This paper presents a model selection criterion in a composite likelihood framework based on density power divergence measures and in the composite minimum density power divergence estimators, which depends on an tuning parameter α . After introducing such a criterion, some asymptotic properties are established. We present a simulation study and two numerical examples in order to point out the robustness properties of the introduced model selection criterion.

1. Introduction

Composite likelihood inference is an important approach to deal with those real situations of large data sets or very complex models, in which classical likelihood methods are computationally difficult, or even, not possible to manage. Composite likelihood methods have been successfully used in many applications concerning, for example, genetics ([1]), generalized linear mixed models ([2]), spatial statistics ([3,4,5]), frailty models ([6]), multivariate survival analysis ([7,8]), etc.
Let us introduce the problem, adopting here the notation by [9]. Let { f ( · ; θ ) , θ Θ R p , p 1 } be a parametric identifiable family of distributions for an observation y = ( y 1 , , y m ) T , a realization of a random m-vector Y . In this setting, the composite likelihood function based on K different marginal or conditional distributions has the form
CL ( θ , y ) = k = 1 K f A k ( y j , j A k ; θ ) w k
and the corresponding composite log-density
log CL ( θ , y ) = k = 1 K w k A k ( θ , y ) ,
with A k ( θ , y ) = log f A k ( y j , j A k ; θ ) , where { A k } k = 1 K is a family of sets of indices associated either with marginal or conditional distributions involving some y j , j { 1 , , m } and w k , k = 1 , , K are non-negative and known weights. If the weights are all equal, then they can be ignored. In this case, all the statistical procedures give equivalent results. The composite maximum likelihood estimator (CMLE), θ ^ c , is obtained by maximizing, in respect to θ Θ , the expression (1).
The CMLE is consistent and asymptotically normal and, based on it, we can establish hypothesis testing procedures in a similar way to the classical likelihood ratio test, Wald test or Rao’s score test. A development of the asymptotic theory of the CMLE including its application to obtain the composite ratio statistics, Wald-type tests and Rao score tests in the context of composite likelihood can be seen in [10]. However, in [11,12,13] is shown that the CMLE and the derived testing procedures present an important lack of robustness. In this sense, [11,12,13] derived some new distance-based estimators and tests with good robustness behaviour without an important loss of efficiency. In this paper, we are going to consider the composite minimum density power divergence estimator (CMDPDE), introduced in [12], in order to present a model selection criterion in a composite likelihood framework.
Model selection criteria, for summarizing data evidence in favor of a model, is a very well studied subject in statistical literature, overall in the context of full likelihood. The construction of such criteria requires a measure of similarity between two models, which are typically described in terms of their distributions. This can be achieved if an unbiased estimator of the expected overall discrepancy is found, which measures the statistical distance between the true, but unknown model, and the entertained model. Therefore, the model with the smallest value of the criterion is the most preferable model. The use of divergence measures, in particular Kullback–Leibler divergence ([14]), to measure this discrepancy, is the main idea of some of the most known criteria: Akaike Information Criterion (AIC, [15,16]), the criterion proposed by Takeuchi (TIC, [17]) and other modifications of AIC [18]. DIC criterion, based on the density power divergence (DPD), was presented in [19] and, recently, [20] presented a local BHHJ power divergence information criterion following [21]. In the context of the composite likelihood there are some criteria based on Kullback–Leibler divergence, see for instance [22,23,24] and references therein. To the best of our knowledge only Kullback–Leibler divergence was used to develop model selection criteria in a composite likelihood framework. To fill this gap, our interest is now focused on DPD.
In this paper, we present a new information criterion for model selection in the framework of composite likelihood based on DPD measure. This divergence measure, introduced and studied in the case of complete likelihood by [25], has been considered previously in [12,13] in the context of composite likelihood. In these papers, a new estimator, the CMDPDE, was introduced and its robustness in relation to the CMLE as well as the robustness of some families of test statistics were studied, but the problem of model selection was not considered. This problem is considered in this paper. The criterion introduced in this paper will be called composite likelihood DIC criterion (CLDIC). The motivation of considering a criterion based on DPD instead of Kullback–Leibler divergence is due to the robustness of the procedures based on DPD in statistical inference, not only in the context of full likelihood [25,26], but also in the context of composite likelihood [12,13]. In Section 2, the CMDPDE is presented and some properties of this estimator are discussed. The new model selection criterion, CLDIC, based on CMDPDE is introduced in Section 3 and some of its asymptotic properties are studied. A simulation study is carried out in Section 4 and some numerical examples are presented in Section 5. Finally, some concluding remarks are presented in Section 6.

2. Composite Minimum Density Power Divergence Estimator

Given two probability density functions g and f, associated with two m-dimensional random variables respectively, the DPD ([25]) measures a statistical distance between g and f by
d α ( g , f ) = R m f ( y ) 1 + α 1 + 1 α f ( y ) α g ( y ) + 1 α g ( y ) 1 + α d y ,
for α > 0 , while for α = 0 it is defined by
d 0 ( g , f ) = lim α 0 + d α ( g , f ) = d K L ( g , f ) ,
where d K L ( g , f ) is the Kullback–Leibler divergence (see, for example, [26]). For α = 1 , the expression (2) leads to the L 2 distance L 2 ( g , f ) = R m f ( y ) g ( y ) 2 d y . It is also interesting to note that (2) is a special case of the so-called Bregman divergence
R m T ( g ( y ) ) T ( f ( y ) ) { g ( y ) f ( y } T ( f ( y ) ) d y .
If we consider T ( l ) = 1 α l 1 + α in (3), we get d α ( g , f ) . The parameter α controls the trade-off between robustness and asymptotic efficiency of the parameter estimates which are the minimizers of this family of divergences. For more details about this family of divergence measures we refer to [27].
Let now Y 1 , , Y n be independent and identically distributed replications of Y which are characterized by the true but unknown distribution g. Taking into account that the true model g is unknown, suppose that Ξ = { f ( · ; θ ) , θ Θ R p , p 1 } is a parametric identifiable family of candidate distributions to describe the observations y 1 , , y n . Then, the DPD between the true model g and the composite likelihood function, CL ( θ , · ) , associated to the parametric model f ( · ; θ ) is defined as
d α ( g · , CL ( θ , · ) ) = R m CL ( θ , y ) 1 + α 1 + 1 α CL ( θ , y ) α g ( y ) + 1 α g ( y ) 1 + α d y ,
for α > 0 , while for α = 0 we have d K L ( g · , CL ( θ , · ) ) , which is defined by
d K L ( g · , CL ( θ , · ) ) = R m g ( y ) log g ( y ) CL ( θ , y ) d y .
In Section 3, we are going to introduce and study the CLDIC criterion based on (4).
Let
M k k 1 , ,
be a family of candidate models to govern the observations Y 1 , , Y n . We shall assume that the true model is included in M k k 1 , , . For a specific k = 1 , , , the parametric model M k is described by the composite likelihood function
CL ( θ , · ) , θ Θ k R k .
In this setting, it is quite clear that the most suitable candidate model to describe the observations is the model that minimizes the DPD in (4). However, the unknown parameter θ is included in it, so it is not possible to use directly this measure for the choice of the most suitable model. A way to overcome this problem is to plug-in, in (4), the unknown parameter θ by an estimator which is desirable to obey some nice properties, like consistency and asymptotic normality. Based on this point, the CMDPDE, introduced in [12], can be used. This estimator is described in the sequel for the sake of completeness.
If we denote the kernel of (4) as
W α θ = R m CL ( θ , y ) 1 + α d y 1 + 1 α R m CL ( θ , y ) α g ( y ) d y ,
we can write
d α ( g · , CL ( θ , · ) ) = W α θ + 1 α R m g ( y ) 1 + α d y
and the term 1 α R m g ( y ) 1 + α d y does not depend on θ and could be ignored in (9). A natural estimator of W α θ , given in (7), can be obtained by observing that the last integral in (7), can be expressed in the form R m CL ( θ , y ) α d G ( y ) , for G the distribution function corresponding to g. Hence, if the empirical distribution function of Y 1 , , Y n will be exploited, this last integral is approximated by 1 n i = 1 n CL ( θ , Y i ) α , i.e.,
W n , α θ = R m CL ( θ , y ) α + 1 d y 1 + 1 α 1 n i = 1 n CL ( θ , Y i ) α .
Definition 1.
The CMDPDE of θ, θ ^ c α , is defined, for α > 0 , by
θ ^ c α = arg min θ Θ W n , α θ .
We shall denote the score of the composite likelihood by
u ( θ , y ) = l o g CL ( θ , y ) θ .
Let θ 0 be the true value of the parameter θ . In [12], it was shown that the asymptotic distribution of θ ^ c α is given by
n ( θ ^ c α θ 0 ) L n N 0 p , H α ( θ 0 ) 1 J α ( θ 0 ) H α ( θ 0 ) 1 ,
being
H α ( θ ) = R m CL ( θ , y ) α + 1 u ( θ , y ) u ( θ , y ) T d y
and
J α ( θ ) = R m CL ( θ , y ) 2 α + 1 u ( θ , y ) u ( θ , y ) T d y R m CL ( θ , y ) α + 1 u ( θ , y ) d y R m u ( θ , y ) T CL ( θ , y ) 1 + α d y .
Remark 1.
For α = 0 we get the CMLE of θ
θ ^ c = arg min θ Θ 1 n i = 1 n log CL ( θ , y i ) .
At the same time it is well-known that
n ( θ ^ c θ ) L n N 0 p , G ( θ ) 1 ,
where G ( θ ) denotes the Godambe information matrix defined by G ( θ ) = H ( θ ) J ( θ ) 1 H ( θ ) , with H ( θ ) being the sensitivity or Hessian matrix and J ( θ ) being the variability matrix, defined, respectively, by
H ( θ ) = E θ θ u ( θ , y ) T , J ( θ ) = E θ u ( θ , y ) u ( θ , y ) T .

3. A New Model Selection Criterion

In order to describe the CLDIC criterion we consider the model M k given in (6). Following standard methodology (cf. [28], pp. 240), the most suitable candidate model to describe the data Y 1 , , Y n is the model that minimizes the expected estimated DPD
E Y 1 , , Y n d α ( g · , CL ( θ ^ c α , · ) ) ,
subject to the assumption that the unknown model g is belonging to Ξ , i.e., the true model is included in M s s 1 , , and taking into account that θ ^ c α , defined in (9), is a consistent and asymptotic normally distributed estimator of θ . However, this expected value is still depending on the unknown parameter θ . So, as a criterion, it should be used an asymptotically unbiased estimator of (14), for g Ξ .
The most appropriate model to select is the model which minimizes the expected value
E Y 1 , , Y n W α θ ^ c α .
This expected value is still depending on the unknown parameter θ . So, an asymptotically unbiased estimator of the above expected value could be the basis of a selection criterion, for g Ξ . In order to proceed with the derivation of such an asymptotically unbiased estimator of E Y 1 , , Y n W α θ ^ c α . The empirical version of W α θ , in (7), is W n , α ( θ ) , given in (8), and plays a central role in the development of the model selection criterion on the basis of the next theorem which expresses the expected value E Y 1 , , Y n W α θ ^ c α by means of the respective expected value of W n , α ( θ ^ c α ) , in an asymptotically equivalent way.
Theorem 1.
If the true distribution g belongs to the parametric family Ξ and θ 0 denotes the true value of the parameter θ, then we have
E Y 1 , , Y n W α ( θ ^ c α ) = E Y 1 , , Y n W n , α ( θ ^ α ) + α + 1 n t r a c e J α θ 0 H α θ 0 1 + o p ( 1 )
with H α θ and J α θ given in (11) and (12), respectively.
Based on the above theorem, the proof of which is presented in a full detail in the Appendix A, an asymptotic unbiased estimator of E Y 1 , . , Y n W α ( θ ^ c α ) is given by
W n , α ( θ ^ c α ) + α + 1 n t r a c e J α ( θ ^ c α ) H α ( θ ^ c α ) 1 .
This ascertainment is the basis and a strong motivation for the next definition which introduces the model selection criterion.
Definition 2.
Let M k k 1 , , be candidate models for the observations Y 1 , , Y n . The selected model M verifies
M = min k { 1 , , , } C L D I C α M k ,
where
C L D I C α M k = W n , α ( θ ^ c α ) + α + 1 n t r a c e J α ( θ ^ c α ) H α ( θ ^ c α ) 1 ,
W n , α ( θ ) was given in (8) and J α θ and H α θ were defined in (11) and (12), respectively.
The next remark summarizes the model selection criterion in the case α = 0 and it therefore extends, in a sense, the pioneer and classic AIC.
Remark 2.
For α = 0 we have,
d K L ( g ( · ) , CL ( θ , · ) ) = W 0 ( θ ) + R n g ( y ) log g ( y ) d y
with W 0 ( θ ) = R n log CL ( θ , y ) g ( y ) d y . Therefore, the most appropriate model which should be selected, is the model which minimizes the expected value
E Y 1 , , Y n W 0 ( θ ^ c ) ,
where θ ^ c is the CMLE of θ 0 defined in (9).
The expected value (15) is still depending on the unknown parameter θ. A natural estimator of W 0 ( θ ^ c ) can be obtained by replacing the distribution function G, of g, by the empirical distribution function based on Y 1 , , Y n ,
W n , 0 ( θ ) = 1 n i = 1 n log CL ( θ , y i ) .
Based on it, we select the model M that verifies
M = min k { 1 , , } C L D I C 0 M k ,
with
C L D I C 0 M k = H n , 0 ( θ ^ c ) + 1 n t r a c e J ( θ ^ c ) H ( θ ^ c ) 1 ,
where J ( θ ^ c ) and H ( θ ^ c ) are defined in Remark 1. In a manner, quite similar to that of the previous theorem, it can be established that C L D I C 0 ( M k ) is an asymptotic unbiased estimator of E Y 1 , , Y n W 0 ( θ ^ c ) .
This would be the model selection criterion in a composite likelihood framework based on Kullback–Leibler divergence. We can observe that this criterion coincides with the criterion given in [22] as a generalization of the classical criterion of Akaike, which will be referred from now as Composite Akaike Information Criterion (CAIC).

4. Numerical Simulations

4.1. Scenario 1: Two-Component Mixed Model

We are starting with a simulation example, which is motivated and follows ideas from the paper [29] and the Example 4.1 in [20] which will compare the behaviour of the proposed criteria with the CAIC criterion, for α = 0 (see Remark 2).
Consider the random vector Y = ( Y 1 , Y 2 , Y 3 , Y 4 ) T from an unknown density g and let now Y 1 , , Y n be independent and identically distributed replications of Y which are described by the true but unknown distribution g. Taking into account that the true model g is unknown, suppose that { f ( · ; θ ) , θ Θ R p , p 1 } is a parametric identifiable family of candidate distributions to describe the observations y 1 , , y n . Let also CL ( θ , y ) denotes the composite likelihood function associated to the parametric model f ( · ; θ ) .
We consider the problem of choosing (on the basis of n independent and identically distributed replications y 1 , , y n of Y = ( Y 1 , Y 2 , Y 3 , Y 4 ) T ) between a 4-variate normal distribution, N μ N , Σ , with μ N = ( μ 1 N , μ 2 N , μ 3 N , μ 4 N ) T and
Σ = 1 ρ 2 ρ 2 ρ ρ 1 2 ρ 2 ρ 2 ρ 2 ρ 1 ρ 2 ρ 2 ρ ρ 1 ,
and a 4-variate t-distribution with ν degrees of freedom, t ν μ t ν , Σ , with different location parameters μ t ν = ( μ 1 t ν , μ 2 t ν , μ 3 t ν , μ 4 t ν ) T and same variance-covariance matrix Σ , and density,
C m | Σ | 1 / 2 1 + 1 ν ( y μ t ν ) T ( Σ ) 1 ( y μ t ν ) ( ν + m ) / 2 ,
with Σ = ν 2 ν Σ , C m = ( π ν ) m / 2 Γ [ ( ν + m ) / 2 ] Γ ( ν / 2 ) and m = 4 .
Consider the composite likelihood function,
CL N ( ρ , y ) = f A 1 N ( y ; ρ ) f A 2 N ( y ; ρ ) ,
with f A 1 N ( y ; ρ ) = f 12 N ( y 1 , y 2 ; μ 1 N , μ 2 N ; ρ ) and f A 2 N ( y ; ρ ) = f 34 N ( y 3 , y 4 ; μ 3 N , μ 4 N ; ρ ) , where f 12 N and f 34 N are the densities of the marginals of Y , i.e., bivariate normal distributions with mean vectors ( μ 1 N , μ 2 N ) T and ( μ 3 N , μ 4 N ) T , respectively, and common variance-covariance matrix
Σ 0 = 1 ρ ρ 1 .
In a similar manner consider the composite likelihood
CL t ν ( ρ , y ) = f A 1 t ν ( y ; ρ ) f A 2 t ν ( y ; ρ ) ,
with f A 1 t ν ( y ; ρ ) = f 12 t ν ( y 1 , y 2 ; μ 1 t ν , μ 2 t ν ; ρ ) and f A 2 t ν ( y ; ρ ) = f 34 t ν ( y 3 , y 4 ; μ 3 t ν , μ 4 t ν ; ρ ) , where f 12 t ν and f 34 t ν are the densities of the marginals of Y , i.e., bivariate t-distributions with mean vectors ( μ 1 t ν , μ 2 t ν ) T and ( μ 3 t ν , μ 4 t ν ) T , respectively, and common variance-covariance matrix
Σ 0 = 1 ρ ρ 1 .
Under this formulation, the simulation study follows in the next two scenarios.

4.1.1. Scenario 1a

Following Example 4.1 in [20], the steps of the simulation study are the following:
  • Generate 1000 samples of size n = 5 , 7 , 10 , 20 , 40 , 50 , 70 , 100 from a two component mixture of two 4-variate distributions, namely, a 4-variate normal and a 4-variate t-distribution,
    h ω ( y ) = ω N μ N , Σ + ( 1 ω ) t ν μ t ν , Σ , 0 ω 1 ,
    with μ N = ( 0 , 0 , 0.5 , 0 ) and μ t ν = ( 3.2 , 1.5 , 0.5 , 2 ) , for ω = 0 , 0.25 , 0.45 , 0.5 , 0.55 , 0.75 , 1 , ν = 5 , 10 , 30 degrees of freedom and with specific values of ρ = 0.15 , 0.10 , 0.10 . As pointed out in [29], taking into account that Σ should be semi-positive definite, the following condition is imposed: 1 5 ρ 1 3 .
  • Estimate the common parameter ρ , separately in each model, by using the CMDPDE estimator for different values of the tuning parameter α = 0 , 0.3 . The composite density which corresponds to the mixture h ω ( y ) is defined by
    CL ( ρ , y ) = ω CL N ( ρ , y ) + ( 1 ω ) CL t ν ( ρ , y ) , 0 ω 1 ,
    and it is used to obtain the CMDPDE estimator, ρ ^ , of ρ .
  • Define the mixture composite likelihood function
    CL ( ρ ^ , y ) = ω CL N ( ρ ^ , y ) + ( 1 ω ) CL t ν ( ρ ^ , y ) , 0 ω 1 .
  • Calculate C L D I C α M k , the value of the model selection criterion considered in this paper, for the two candidate models, with
    C L D I C α M k = W n , α ρ ^ + α + 1 n t r a c e J α ρ ^ H α ρ ^ 1 .
    An explanation of how to obtain this value for the both candidate models is given in Appendix B.
  • Compute the times that the 4-variate normal model was selected.
Results are summarized in Table 1. Extreme values of ω = 0 , 1 represent the times that the 4-variate normal model was selected under the 4-variate t-distribution and 4-variate normal distribution, respectively. This means that, for ω = 1 , the perfect discrimination will be achieved when 1000 of the 1000 simulated samples are correctly assigned, while for ω = 0 , the more near to 0, the better discrimination of the criterion. ω = 0.5 means that each sample was generated both from the normal and t-distribution in the same proportion.

4.1.2. Scenario 1b

Same Scenario is evaluated under the more-closed means μ N = ( 0 , 1.5 , 0.5 , 0.75 ) and μ t ν = ( 0 , 1.5 , 0.5 , 2 ) for moderate-large sample sizes and α { 0 , 0.2 , 0.4 } . Here ν = 5 and ρ = 0.15 . Results are shown in Table 2. In this case, the models under consideration are more similar, so it would be understandable that the CLDIC criterion did not discriminate in such as good way.

4.2. Scenario 2: Three-Component Mixed Model

Now, we consider a mixed model composed on two 4-variate normal distributions and a 4-variate t-distribution with ν = 10 degrees of freedom. The three distributions have common variance-covariance matrix, as in the previous scenario, with unknown ρ = 0.15 and different but known means μ 1 N = ( 0 , 0 , 0.5 , 0 ) , μ 2 N = ( 0 , 1.5 , 0.5 , 0 ) and μ t = ( 0 , 1.5 , 0.5 , 2 ) . The model is defined by
ω N ( μ 1 N , Σ ) + λ N ( μ 2 N , Σ ) + ( 1 ω λ ) t ν = 10 ( μ t , Σ ) , 0 ω , λ , ω + λ 1 ,
with Σ being again a common variance-covariance matrix with unknown parameter ρ of the form
Σ = 1 ρ 2 ρ 2 ρ ρ 1 2 ρ 2 ρ 2 ρ 2 ρ 1 ρ 2 ρ 2 ρ ρ 1 .
Following the same steps that in the first scenario, we generate 1000 samples of the three-component mixture for different sample sizes n = 5 , 7 , 10 , 20 , 40 , 50 , 70 , 100 and different values of ω and λ . Then, we consider the problem of choosing among one of the two 4-variate normal distributions and the 4-variate t-distribution through the CLDIC criterion, for different values of the tuning parameter α = 0 , 0.3 , 0.5 , 0.7 . See Table 3 for results. Here, the normal models are denoted by N1 and N2, respectively, while the 4-variate t-distribution is denoted by MT. The first three cases evaluate the selected model under these multivariate distributions. In the last two scenarios, a mixed model is considered as the true distribution.

4.3. Discussion of Results

In Scenario 1a, two well-differentiated multivariate models are considered. In this case CLDIC criterion works in a very efficient way, with an almost-perfect discrimination for extreme values of ω . The good behaviour is also observed for not so extreme values of ω , such as ω = 0.55 or 0.45 . We can not observe a significant difference in the choice of α .
In Scenario 1b we consider closer models, which affect the discrimination power of the CLDIC. However, in this case, we do observe great differences when considering different α . While the discrimination power of CLDIC for α = 0 (CAIC) and ω = 1 is around 75 % , for α = 0.2 or α = 0.4 the behaviour is excellent. This happens also for large but not extreme values of ω , such as ω = 0.75 . However, a medium value of α turns into a worse discrimination for low values of ω .
Scenario 2 deals with three different models, two multivariate normal and one multivariate t (N1, N2 and MT, respectively). The second normal distribution is closer to MT in terms of means. While CLDIC criterion discriminate well between N1 and N2 and between N1 and MT, it has difficulties in distinguishing N2 an MT distributions, overall for small samples sizes and α = 0 .
It seems, therefore, that when we have well-discriminated models, CLDIC criterion works very well, independently of the sample size and the tuning parameter α considered. Dealing with closer models leads, as expected, to worst results, overall for α = 0 (CAIC).
Note that the behaviour of Wald-type and Rao tests based on CMDPDEs was studied in [12,13] through extensive simulation studies.

5. Numerical Examples

5.1. Choice of the Tuning Parameter

In the previous sections, we have seen that CLDIC criterion works generally very well, independently of α , but that some values present a better behaviour, overall when distinguishing similar models. In these situations, it appears that values close to 0.2 or 0.3 work well, while CAIC criterion presents a worse behaviour. A data-driven approach for the choice of the tuning parameter which would be helpful in practice. The approach of [30] was adapted In [13], for the choice of the optimum α in CMDPDEs. This approach consisted on minimizing the estimated mean squared error by means of a pilot estimator, θ P . This approximation is given by
M S E ^ α = ( θ ^ c α θ P ) T ( θ ^ c α θ P ) + 1 n Trace H α 1 ( θ ^ c α ) J α ( θ ^ c α ) H α 1 ( θ ^ c α ) ,
where H α ( θ ) and J α ( θ ) are given in (11) and (12). The optimum α will be the one that minimizes expression (16). The choice of the pilot estimator is probably one of the major drawbacks of this approach, as it may lead to a choice of α too close to that used for the pilot estimator. A pilot estimator with α 0.4 , was proposed in [13] after some simulations, in concordance with [30], where the initial choice of a pilot is suggested to be a robust one in order to obtain the best results in terms of robustness.

5.2. Iris Data

The Iris data (Fisher, [31]) includes 3 categories of 50 sample values each, where each category refers to a type of iris plant: setosa, versicolor and virginica. Each plant is categorized in its class and described by other 4 variables: (1) sepal length, (2) sepal width, (3) petal length and (4) petal width. This is one of the most known data sets for discriminant analysis. [32] proposed the use of a Gaussian finite mixture for modeling Iris data, in which each known class is modeled by a single Gaussian term with the same variance-covariance matrix. The resulting model is as follows
f ( x ) = 1 3 N ( μ 1 , Σ ) + 1 3 N ( μ 2 , Σ ) + 1 3 N ( μ 3 , Σ ) ,
with
μ 1 = ( μ 11 , μ 12 , μ 13 , μ 14 ) T , μ 2 = ( μ 21 , μ 22 , μ 23 , μ 24 ) T , μ 3 = ( μ 31 , μ 32 , μ 33 , μ 34 ) T
and
Σ = σ 1 2 σ 12 σ 13 σ 14 σ 21 σ 2 2 σ 23 σ 24 σ 31 σ 32 σ 3 2 σ 34 σ 41 σ 42 σ 43 σ 4 2 .
Exact values can be obtained by MclustDA() function of mclust package in R Software ([32]).
We propose a composite likelihood approach to modeling (17) where we suppose independence between the two first and two last variables. This is
f C L ( y ) = 1 3 C L N 1 + 1 3 C L N 2 + 1 3 C L N 3 ,
with
C L N i = f A i 1 N ( ρ 12 , y ) f A i 2 N ( ρ 34 , y ) ,
where f A i 1 N ( ρ 12 , y ) = f A i 1 N ( ρ 12 , μ i 1 , μ i 2 , Σ A 1 , y ) and f A i 2 N ( ρ 34 , y ) = f A i 2 N ( ρ 34 , μ i 3 , μ i 4 , Σ A 2 , y ) , i = 1 , 2 , 3 are bivariate normals with variance-covariance matrices
Σ A 1 = σ 1 2 ρ 12 σ 1 σ 2 ρ 12 σ 1 σ 2 σ 2 2 , Σ A 2 = σ 3 2 ρ 34 σ 3 σ 4 ρ 34 σ 3 σ 4 σ 4 2 .
We are going to evaluate the behavior of the CLDIC criterion proposed in previous sections. After estimating parameters ρ 12 and ρ 34 in (18), we consider 10 different subsets of the IRIS data:
  • SE subset: 50 first observations, corresponding to Setosa plants ( n = 50 ).
  • VE subset: 50 second observations, corresponding to Versicolor plants ( n = 50 ).
  • VI subset: 50 last observations, corresponding to Virginica plants ( n = 50 ).
  • SE(VE) subset: SE subset with 2 first observations of VE subset ( n = 52 ).
    Equivalently: SE(VI), VE(SE), VE(VI), VI(SE) and VI(VE).
  • VI(SE+VE) subset: VI subset with 2 first observations of SE and VE subsets ( n = 54 ).
In Table 4, chosen models for each one of the subsets are obtained by the proposed CLDIC criterion. When a “pure” subset is considered, all the tuning parameters lead to optimal decisions, but when a “contaminated” subset is under consideration, only α = 0.2 , 0.3 have an optimal response in all the cases.
We now apply the ad hoc approach presented in Section 5.1 for selecting the tuning parameter α in a composite likelihood framework. Applying this procedure to our data set though a grid search of length 100 and by means of a pilot estimator with α = 0.4 leads to the optimal tuning parameter α = 0.22 , what is in concordance with the obtained results (see Table 5). We can see that the use of other pilot estimators would not affect very much to the final decission.

5.3. Wine Data

We now work with Wine data ([33]), which contain a chemical analysis of 178 Italian wines from three different cultivars (Barolo, Grignolino, Barbera) yielded 13 measurements. In order to illustrate our criterion, we will work with only first four explanatory variables: Alcohol, Malic, Ash and Alkalinity. As in the previous section, we adjust a Gaussian mixture model with weights, in this case: 59 / 178 , 72 / 178 and 47 / 178 corresponding to Barolo, Grignolino and Barbera classes, respectively. We now consider these 10 different subsets of the Wine data:
  • BO subset: 20 first observations of Barolo wines ( n = 20 ).
  • GR subset: 20 first observations of Grignolino wines ( n = 20 ).
  • BA subset: 20 first observations of Barbera wines ( n = 20 ).
  • BO(GR) subset: BO subset with 5 first observations of GR subset ( n = 25 ).
    Equivalently: BO(BA), GR(BO), GR(BA), BA(BO) and BA(GR).
  • BA(BO+GR) subset: BA subset with 3 first observations of BO and GR subsets ( n = 26 ).
We can observe how, for medium values of α , the discrimination is perfect (see Table 6). Applying ad-hoc tuning parameter choice procedure we obtain α o p t 0.51 , with a perfect discrimination again (Table 5).

6. Conclusions and Future Research

In this paper, we have addressed the problem of model selection in the framework of composite likelihood methodology, on the basis of the DPD as a measure of the closeness of the composite density and the true model that drives the data. In this context, an information criterion is introduced and studied which is defined by means of composite minimum distance type estimators of the unknown parameters, well-known for having nice robustness properties. Thanks to a simulation study, we have shown that the proposed here model selection criterion works well in practice and mainly that the use of CMDPDE makes the criterion more robust than the criteria based on the classic CMLE and the Kullback–Leibler divergence, given in [22]. The analysis of two real data examples of the literature illustrate on how the model selection criterion, presented here, can be applied in practical cases. This paper is a part of a series of papers by the authors where composite likelihood ideas and methods are harmonically weaved with divergence theoretic methods in order to develop statistical inference (estimation and testing of hypotheses) and model selection criteria, as well. We envision future work in some directions. The development of change point methodology on the basis of composite density with CMDPDE and divergence measures would be maybe an appealing problem for a future research on the topic. However, all the information theoretic methods developed on the basis of the composite likelihood depend on the choice of the family of sets { A k } k = 1 K , appeared in Formula (1). A question is raised at this point: how the information theoretic procedures developed on the basis of the composite likelihood are affected by this family of sets? It is an appealing problem which deserves also investigation in a future work.

Author Contributions

Conceptualization, E.C., N.M., L.P. and K.Z.; Methodology, E.C., N.M., L.P. and K.Z.; Software, E.C., N.M., L.P. and K.Z.; Validation, E.C., N.M., L.P. and K.Z.; Formal Analysis, E.C., N.M., L.P. and K.Z.; Investigation, E.C., N.M., L.P. and K.Z.; Resources, E.C., N.M., L.P. and K.Z.; Data Curation, E.C., N.M., L.P. and K.Z.; Writing—Original Draft Preparation, E.C., N.M., L.P. and K.Z.; Writing—Review & Editing, E.C., N.M., L.P. and K.Z.; Visualization, E.C., N.M., L.P. and K.Z.; Supervision, E.C., N.M., L.P. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by Grant PGC2018-095194-B-I00 and Grant FPU16/03104 from Ministerio de Ciencia, Innovación y Universidades (Spain). E. Castilla, N. Martín and L. Pardo are members of the Instituto de Matemática Interdisciplinar, Complutense University of Madrid.

Acknowledgments

The authors would like to thank the Editor and Reviewers for taking their precious time to make several valuable comments on the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MLEMaximum likelihood estimator
CMLEComposite maximum likelihood estimator
CLDICComposite likelihood DIC
DPDDensity power divergence
MDPDEMinimum density power divergence estimator
CMDPDEComposite minimum density power divergence estimator
AICAkaike Information Criterion
CAICComposite Akaike Information Criterion
TICTakeuchi Information Criterion

Appendix A. Proof of Theorem 1

Proof. 
A Taylor expansion of W α θ around the true parameter θ 0 and evaluated in θ = θ ^ c α , gives
W α θ ^ c α = W α θ 0 + W α θ θ θ = θ 0 θ ^ c α θ 0 + 1 2 θ ^ c α θ 0 T 2 W α θ θ θ T θ = θ 0 θ ^ c α θ 0 + o θ ^ c α θ 0 2 .
Now,
W α θ θ = R m 1 + α CL ( θ , y ) α CL ( θ , y ) θ d y 1 + 1 α α R m CL ( θ , y ) α 1 CL ( θ , y ) θ g ( y ) d y = 1 + α R m CL ( θ , y ) α + 1 u θ , y d y 1 + α R m CL ( θ , y ) α u θ , y g ( y ) d y .
It is clear that if the true distribution g belongs to the parameter family f ( . ; θ ) , θ Θ and θ 0 denotes the true value of the parameter θ , we get
W α θ θ θ = θ 0 = 0 .
Now we are going to get
2 W α θ θ θ T = 1 + α R m 1 + α CL ( θ , y ) α + 1 u θ , y u θ , y T d y R m CL ( θ , y ) α + 1 2 log CL ( θ , y ) θ θ T d y α R m CL ( θ , y ) α u θ , y u θ , y T g ( y ) d y + R m CL ( θ , y ) α 2 log CL ( θ , y ) θ θ T g ( y ) d y .
If the true distribution g belongs to the parameter family f θ ( · ; θ ) , θ Θ and θ 0 denotes the true value of the parameter θ , verifies,
2 W α θ θ θ T θ = θ 0 = 1 + α R m CL ( θ 0 , y ) α + 1 u θ 0 , y u θ 0 , y T d y = 1 + α H α θ 0 .
Therefore,
n W α θ ^ c α = n W α θ 0 + 1 + α 2 n θ ^ c α θ 0 T H α θ 0 n θ ^ c α θ 0 + n o θ ^ c α θ 0 2 .
But
n θ ^ c α θ 0 L n N 0 , H α ( θ 0 ) 1 J α ( θ 0 ) H α ( θ 0 ) 1 ,
and n o θ ^ c α θ 0 2 = o ( O p ( 1 ) ) = o p ( 1 ) .
The asymptotic distribution of the quadratic form n θ ^ c α θ 0 T H α θ 0 n θ ^ c α θ 0 , verifies
n θ ^ c α θ 0 T H α θ 0 n θ ^ c α θ 0 L n r = 1 k λ r Z r 2
being λ r , r = 1 , , k , the eigenvalues of the matrix
H α θ 0 H α ( θ 0 ) 1 J α ( θ 0 ) H α ( θ 0 ) 1 = J α ( θ 0 ) H α ( θ 0 ) 1
and Z r are independent normal random variable of mean zero and variance 1. Therefore,
E Y 1 , , Y n n θ ^ c a θ 0 T H α θ 0 n θ ^ c a θ 0 = r = 1 k λ r + o p ( 1 ) = t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 )
and
E Y 1 , , Y n n W α ( θ ^ c α ) = n W α θ 0 + 1 + α 2 t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 ) .
Now a Taylor expansion of W n , α θ , around θ ^ c α and evaluated at θ = θ 0 gives
W n , α ( θ 0 ) = W n , α ( θ ^ c α ) + H n , α θ θ θ = θ ^ c α θ 0 θ ^ c α + 1 2 θ 0 θ ^ c α T 2 W n , α θ θ θ T θ = θ ^ c α θ 0 θ ^ c α + o θ 0 θ ^ c α 2 .
But
W n , α θ θ = α + 1 R m CL ( θ , y ) α + 1 u θ , y d y α + 1 1 n k = 1 n CL ( θ , y k ) α u θ , y k
therefore
W n , α θ θ θ = θ ^ c a P n 0 .
On the other hand
2 W n , α θ θ θ T = 1 + α R m 1 + α CL ( θ , y ) α + 1 u θ , y T u θ , y d y + R m CL ( θ , y ) α + 1 u θ , y θ T d y 1 n i = 1 n α CL ( θ , y i ) α u θ , y i T u θ , y i 1 n i = 1 n CL ( θ , y i ) α u θ , y i θ T .
But
1 n i = 1 n CL ( θ , y i ) α u θ , y i T u θ , y i P n R m CL ( θ , y ) α + 1 u θ , y T u θ , y d y
and
1 n i = 1 n CL ( θ , y i ) α u θ , y i θ T P n R m CL ( θ , y ) α + 1 u θ , y θ T d y .
Therefore
2 H n , α ( θ ) θ θ T θ = θ ^ c α P n 1 + α H α θ 0 .
We can now write
n W n , α θ 0 = n W n , α ( θ ^ c α ) + 1 + α 2 n θ 0 θ ^ c α T H α θ 0 n θ 0 θ ^ c α + o p ( 1 ) .
It is clear that
E Y 1 , , Y n n θ 0 θ ^ c α T H α θ 0 n θ 0 θ ^ c α = r = 1 k λ r + o p ( 1 ) = t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 ) .
Then
E Y 1 , , Y n n W n , α ( θ 0 ) = E Y 1 , , Y n n W n , α ( θ ^ c α ) + 1 + α 2 t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 )
and, on the other hand, it is clear that
E Y 1 , , Y n W n , α ( θ 0 ) = W α ( θ 0 ) .
Therefore,
E Y 1 , , Y n n W α ( θ ^ c α ) = n W α θ 0 + 1 + α 2 t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 ) = E Y 1 , , Y n n W n , α θ 0 + 1 + α 2 t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 ) = E Y 1 , , Y n n W n , α ( θ ^ c α ) + 1 + α 2 t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + 1 + α 2 t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 ) = E Y 1 , , Y n n W n , α ( θ ^ c α ) + 1 + α t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 + o p ( 1 ) .
Hence n W n , α ( θ ^ c α ) + 1 + α t r a c e J α ( θ 0 ) H α ( θ 0 ) 1 is an asymptotic unbiased estimator of
E Y 1 , , Y n n W α ( θ ^ c α ) .
 □

Appendix B. Computation of the CLDIC in Section 4.1

We have to compute
C L D I C M k = W n , α ρ ^ + α + 1 n J α ρ ^ H α ρ ^ ,
where
W n , α ρ ^ = R 4 CL ( ρ ^ , y ) α + 1 d y ( 1 α 1 ) 1 n i = 1 n CL ( ρ ^ , y i ) α ,
J α ( ρ ^ ) = R 4 CL ( ρ ^ , y ) 2 α + 1 u ( ρ ^ , y ) 2 d y R 4 CL ( ρ ^ , y ) α + 1 u ( ρ ^ , y ) d y 2 ,
H α ( ρ ^ ) = R 4 CL ( ρ ^ , y ) α + 1 u ( ρ ^ , y ) 2 d y ,
for our candidate models, namely, composite normal and composite 4-variate t-distribution. As commented in Section 4.1, we consider a composite likelihood function based on the product of two bivariate distributions with common variance-covariance matrix. It is therefore, necessary in this example, to obtain values (A1), (A2) and (A3) for both composite normal and composite t-distributions. However, as stated in [10], while the sensitivity and variability matrices can be sometimes be evaluated explicitly, it is more usual to use empirical estimates. Following this comment, in the current example, we compute Equations (A1), (A2) and (A3) empirically through the sample data using
W ^ n , α ρ ^ = i = 1 n CL ( ρ ^ , y i ) α + 1 ( 1 α 1 ) 1 n i = 1 n CL ( ρ ^ , y i ) α , J ^ α ( ρ ^ ) = i = 1 n CL ( ρ ^ , y i ) 2 α + 1 u ( ρ ^ , y i ) 2 i = 1 n CL ( ρ ^ , y i ) α + 1 u ( ρ ^ , y i ) 2 H ^ α ( ρ ^ ) = i = 1 n CL ( ρ ^ , y i ) α + 1 u ( ρ ^ , y i ) 2 .
Now, we obtain the score of the composite likelihood u ( ρ ^ , y i ) explicitly for both cases. By equation (A.5) in [12],
u N ( ρ ^ , y i ) = ρ ^ 1 ρ ^ 2 2 + 1 ρ ^ ( t 1 i t 2 i + t 3 i t 4 i ) 1 1 ρ ^ 2 t 1 i 2 2 ρ ^ t 1 i t 2 i + t 2 i 2 1 1 ρ ^ 2 t 3 i 2 2 ρ ^ t 3 i t 4 i + t 4 i 2 ,
with t j i = y j i μ j , j = 1 , , 4 . On the other hand, we want to compute u t ν ( ρ ^ , y i ) .
u t ν ( ρ ^ , y i ) = CL t ν ( ρ ^ , y i ) ρ ^ = log CL t ν ( ρ ^ , y i ) ρ ^ = 1 CL t ν ( ρ ^ , y i ) CL t ν ( ρ ^ , y i ) ρ ^ = 1 f 12 t ν ( y i ; ρ ^ ) f 34 t ν ( y i ; ρ ^ ) ρ ^ f 12 t ν ( y i ; ρ ^ ) f 34 t ν ( y i ; ρ ^ ) = 1 f 12 t ν ( y i ; ρ ^ ) f 34 t ν ( y i ; ρ ^ ) ρ ^ f 12 t ν ( y i ; ρ ^ ) f 34 t ν ( y i ; ρ ^ ) + f 12 t ν ( y i ; ρ ^ ) ρ ^ f 34 t ν ( y i ; ρ ^ ) = 1 f 12 t ν ( y i ; ρ ^ ) ρ ^ f 12 t ν ( y i ; ρ ^ ) + 1 f 34 t ν ( y i ; ρ ^ ) ρ ^ f 34 t ν ( y i ; ρ ^ ) .
Now, it can be shown that
f 12 t ν ( y i ; ρ ^ ) ρ ^ = f 12 t ν ( y i ; ρ ^ ) ν ( ν 2 ) ρ ^ 3 t 1 i t 2 i ν ρ ^ 2 + ( t 1 i 2 + t 2 i 2 1 ) ν + t 2 i 2 + t 1 i 2 + 2 ρ ^ t 1 i t 2 i ν 2 t 1 i t 2 i ( 1 ρ ^ 2 ) ( ν 2 ) ρ ^ 2 + 2 t 1 i t 2 i ρ ^ ν t 1 i 2 t 2 i 2 + 2
and
f 34 t ν ( y i ; ρ ^ ) ρ ^ = f 34 t ν ( y i ; ρ ^ ) ν ( ν 2 ) ρ ^ 3 t 3 i t 4 i ν ρ ^ 2 + ( t 3 i 2 + t 4 i 2 1 ) ν + t 4 i 2 + t 3 i 2 + 2 ρ ^ t 1 i t 4 i ν 2 t 3 i t 4 i ( 1 ρ ^ 2 ) ( ν 2 ) ρ ^ 2 + 2 t 3 i t 4 i ρ ^ ν t 3 i 2 t 4 i 2 + 2 .

References

  1. Fearnhead, P.; Donnelly, P. Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2002, 64, 657–680. [Google Scholar] [CrossRef]
  2. Renard, D.; Molenberghs, G.; Geys, H. A pairwise likelihood approach to estimation in multilevel probit models. J. Comput. Stat. Data Anal. 2004, 44, 649–667. [Google Scholar] [CrossRef]
  3. Hjort, N.L.; Omre, H. Topics in spatial statistics. Scand. J. Stat. 1994, 21, 289–357. [Google Scholar]
  4. Heagerty, P.J.; Lele, S.R. A composite likelihood approach to binary spatial data. J. Am. Stat. Assoc. 1998, 93, 1099–1111. [Google Scholar] [CrossRef]
  5. Varin, C.; Host, G.; Skare, O. Pairwise likelihood inference in spatial generalized linear mixed models. Comput. Stat. Data Anal. 2005, 49, 1173–1191. [Google Scholar] [CrossRef]
  6. Henderson, R.; Shimakura, S. A serially correlated gamma frailty model for longitudinal count data. Biometrika 2003, 90, 355–366. [Google Scholar] [CrossRef]
  7. Parner, E.T. A composite likelihood approach to multivariate survival data. Scand. J. Stat. 2001, 28, 295–302. [Google Scholar] [CrossRef]
  8. Li, Y.; Lin, X. Semiparametric Normal Transformation Models for Spatially Correlated Survival Data. J. Am. Stat. Assoc. 2006, 101, 593–603. [Google Scholar] [CrossRef] [Green Version]
  9. Joe, H.; Reid, N.; Somg, P.X.; Firth, D.; Varin, C. Composite Likelihood Methods. Report on the Workshop on Composite Likelihood. 2012. Available online: http://www.birs.ca/events/2012/5-day-workshops/12w5046 (accessed on 23 July 2019).
  10. Varin, C.; Reid, N.; Firth, D. An overview of composite likelihood methods. Statist. Sin. 2011, 21, 5–42. [Google Scholar]
  11. Martín, N.; Pardo, L.; Zografos, K. On divergence tests for composite hypotheses under composite likelihood. Stat. Pap. 2019, 60, 1883–1919. [Google Scholar] [CrossRef]
  12. Castilla, E.; Martin, N.; Pardo, L.; Zografos, K. Composite Likelihood Methods Based on Minimum Density Power Divergence Estimator. Entropy 2018, 20, 18. [Google Scholar] [CrossRef] [Green Version]
  13. Castilla, E.; Martin, N.; Pardo, L.; Zografos, K. Composite likelihood methods: Rao-type tests based on composite minimum density power divergence estimator. Stat. Pap. 2019. [Google Scholar] [CrossRef]
  14. Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
  15. Akaike, H. Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory; Petrov, B.N., Csaki, F., Eds.; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
  16. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  17. Takeuchi, K. Distribution of information statistics and criteria for adequacy of models. Math. Sci. 1976, 153, 12–18. (In Japanese) [Google Scholar]
  18. Murari, A.; Peluso, E.; Cianfrani, F.; Gaudio, P.; Lungaroni, M. On the Use of Entropy to Improve Model Selection Criteria. Entropy 2019, 21, 394. [Google Scholar] [CrossRef] [Green Version]
  19. Mattheou, K.; Lee, S.; Karagrigoriou, A. A model selection criterion based on the BHHJ measure of divergence. J. Stat. Plan. Inference 2009, 139, 228–235. [Google Scholar] [CrossRef]
  20. Avlogiaris, G.; Micheas, A.; Zografos, K. A criterion for local model selection. Shankhya 2019, 81, 406–444. [Google Scholar] [CrossRef]
  21. Avlogiaris, G.; Micheas, A.; Zografos, K. On local divergences between two probability measures. Metrika 2016, 79, 303–333. [Google Scholar] [CrossRef]
  22. Varin, C.; Vidoni, P. A note on composite likelihood inference and model selection. Biometrika 2005, 92, 519–528. [Google Scholar] [CrossRef] [Green Version]
  23. Gao, X.; Song, P.X.K. Composite likelihood Bayesian information criteria for model selection in high-dimensional data. J. Am. Stat. Assoc. 2010, 105, 1531–1540. [Google Scholar] [CrossRef] [Green Version]
  24. Ng, C.T.; Joe, H. Model comparison with composite likelihood information criteria. Bernoulli 2014, 20, 1738–1764. [Google Scholar] [CrossRef]
  25. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimizing a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef] [Green Version]
  26. Pardo, L. Statistical Inference Based on Divergence Measures; Chapman & Hall CRC Press: Boca Raton, FL, USA, 2006. [Google Scholar]
  27. Basu, A.; Shioya, H.; Park, C. Statistical Inference. The Minimum Distance Approach; Chapman & Hall/CRC: Boca Raton, FL, USA, 2011. [Google Scholar]
  28. Burham, K.P.; Anderson, D.R. Model Selection and Multinomial Inference: A Practical Information-Theoretic Approach; Springer: New York, NY, USA, 2002. [Google Scholar]
  29. Xu, X.; Reid, N. On the robustness of maximum composite estimate. J. Stat. Plan. Inference 2011, 141, 3047–3054. [Google Scholar] [CrossRef]
  30. Warwick, J.; Jones, M.C. Choosing a robustness tuning parameter. J. Stat. Comput. Simul. 2005, 75, 581–588. [Google Scholar] [CrossRef]
  31. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugenics. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  32. Fraley, A.; Raftery, E.; Murphy, T.B.; Scrucca, L. MCLUST Version 4 for R: Normal Mixture Modeling for Model-based Clustering, Classification, and Density Estimation; Technical Report 597; Department of Statistics, University of Washington: Seattle, WA, USA, 2012. [Google Scholar]
  33. Forina, M.; Lanteri, S.; Armanino, C.; Leardi, R. PARVUS: An Extendable Package of Programs for Data Exploration, Classification, and Correlation; Institute of Pharmaceutical and Food Analysis Technologies: Genoa, Italy, 1998. [Google Scholar]
Table 1. Main results, Scenario 1a.
Table 1. Main results, Scenario 1a.
α = 0 (CAIC) α = 0.3
ω 00.250.450.50.550.75100.250.450.50.550.751
ν = 5 , ρ = 0.15
n = 50126949971399610000027349871210001000
7012465047589981000012205117389991000
1000202482775100010000018546777110001000
2000114486871100010000011247386610001000
40004145994710001000005449695410001000
50002147596410001000004155698610001000
7000946198510001000004865699510001000
1000054729921000100000142885100010001000
ν = 10 , ρ = 0.15
5032224456889961000032184336889971000
70119143972010001000001794316909991000
1000163432747100010000015240272510001000
20005939981910001000004936177310001000
40001933691210001000001232689910001000
5000636293610001000001033492510001000
70001292960999100000235697310001000
1000003019831000100000153199210001000
ν = 30 , ρ = 0.15
5042374236779971000022354136569961000
70015539468910001000001413796779991000
1000144413719100010000013439370110001000
20005735180110001000004031176410001000
4000112969041000100000826388210001000
500062719181000100000325390310001000
700012259421000100000022994110001000
1000002089781000100000030398910001000
ν = 10 , ρ = 0.10
5042424646809961000032384596829991000
7001874617339971000001994577319981000
1000162445738100010000016540771310001000
20006237880710001000005935478910001000
4000193579029991000001433389510001000
500063259321000100000832593110001000
700023059541000100000636796710001000
1000003079791000100000250799310001000
ν = 10 , ρ = 0.10
501126845966999110001112684786809931000
7012114567209991000032074647169981000
1000168423704100010000016240370210001000
2000863607891000999008935778610001000
40003536789310001000003839889610001000
50001933188610001000001936091310001000
70001131193310001000001637996310001000
1000022769691000100000749098510001000
Table 2. Main results, Scenario 1b.
Table 2. Main results, Scenario 1b.
α = 0 (CAIC) α = 0.2 α = 0.4
00.250.75100.250.75100.250.751
n = 4000397310053796100580949
5000247320085999000944994
600014772009991000019991000
700097340099910000279991000
800057700110001000032610001000
9000478202310001000279410001000
1000048020173100010002697810001000
Table 3. Main results, Scenario 2.
Table 3. Main results, Scenario 2.
α = 0 (CAIC) α = 0.3 α = 0.5 α = 0.7
Model N1N2MTN1N2MTN1N2MTN1N2MT
True model: N ( μ 1 N , Σ )
n = 59572419950163493923389362836
79701911966132496113269502228
1099334986410979615971623
201000001000009980299703
40100000100000100000100000
50100000100000100000100000
70100000100000100000100000
10010000010000010000099900
True model: N ( μ 2 N , Σ )
529638333346103563863932350646304
715622363135893981759938428627345
1066103845540455554045511586403
201612387151848114725271527472
400566434065035005904100614386
500561439080419607972030835165
7005844160987130994609982
1000520480010000010000010000
True model: t ν = 10 ( μ t , Σ )
52159831699318991315982
703997019992299604996
1001999029980299803997
20001000001000001000001000
40001000001000001000001000
50001000001000001000001000
70001000001000001000001000
100001000001000049960296704
True model: 0.7 N ( μ 2 N , Σ ) + 0.3 t ν = 10 ( μ t , Σ )
563846106375619440159511452537
71331668129470513176821373626
101261738121878112537461306693
200109891010189901078930141859
40026974012687401228780166834
50013987031168903456550445555
700699409485209821809946
1000299801000001000009991
True model: 1 3 N ( μ 1 N , Σ ) + 1 3 N ( μ 2 N , Σ ) + 1 3 t ν = 10 ( μ t , Σ )
5127377496121363516107392501107424469
787357556703395916635657863396541
1069326605613146255633061445381574
2037259704252986771733764615349636
407145848945253945084881469530
502122876574425138141833853144
70099901499604996049960
100036964355645064535508561440
Here the model candidates are expressed as N1, N2, MT to denote N ( μ 1 N , Σ ) , N ( μ 2 N , Σ ) and t 10 ( μ t , Σ ) , respectively.
Table 4. Selected model in each of the subsets. Iris data.
Table 4. Selected model in each of the subsets. Iris data.
α SEVEVISE(VE)SE(VI)VE(SE)VE(VI)VI(SE)VI(VE)VI(SE+VE)
0 (CAIC)CN1CN2CN3CN1CN1CN1 CN2CN1 CN3CN3
0.2CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3
0.3CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3
0.4CN1CN2CN3CN1CN1CN2CN2CN1 CN3CN3
0.5CN1CN2CN3CN1CN1CN2CN2CN1 CN3CN3
0.8CN1CN2CN3CN1CN1CN2CN2CN1 CN3CN3
0.22CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3
Table 5. Selected α for different pilot estimators, ad-hoc tuning parameter selection procedure. Iris and Wine data.
Table 5. Selected α for different pilot estimators, ad-hoc tuning parameter selection procedure. Iris and Wine data.
α pilot 00.10.20.30.40.50.60.70.80.91
Iris α o p t 0.310.170.200.210.220.230.240.240.250.250.25
Wine α o p t 0.450.460.470.490.510.530.550.560.560.560.57
Table 6. Selected model in each of the subsets. Wine data.
Table 6. Selected model in each of the subsets. Wine data.
α BOGRBABO(GR)BO(BA)GR(BO)GR(BA)BA(BO)BA(GR)BA(BO+GR)
0 (CAIC)CN1CN2CN3CN1CN1CN2CN2CN3CN3CN2
0.2CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3
0.3CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3
0.4CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3
0.5CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3
0.8CN1CN2CN3CN1CN1CN2CN2CN2 CN2 CN3
0.51CN1CN2CN3CN1CN1CN2CN2CN3CN3CN3

Share and Cite

MDPI and ACS Style

Castilla, E.; Martín, N.; Pardo, L.; Zografos, K. Model Selection in a Composite Likelihood Framework Based on Density Power Divergence. Entropy 2020, 22, 270. https://doi.org/10.3390/e22030270

AMA Style

Castilla E, Martín N, Pardo L, Zografos K. Model Selection in a Composite Likelihood Framework Based on Density Power Divergence. Entropy. 2020; 22(3):270. https://doi.org/10.3390/e22030270

Chicago/Turabian Style

Castilla, Elena, Nirian Martín, Leandro Pardo, and Konstantinos Zografos. 2020. "Model Selection in a Composite Likelihood Framework Based on Density Power Divergence" Entropy 22, no. 3: 270. https://doi.org/10.3390/e22030270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop