Next Article in Journal
Age-Invariant Adversarial Feature Learning for Kinship Verification
Next Article in Special Issue
Tourism Development and Economic Growth in Southeast Asian Countries under the Presence of Structural Break: Panel Kink with GME Estimator
Previous Article in Journal
A Generalized Net Model of the Prostate Gland’s Functioning
Previous Article in Special Issue
Latin American Agri-Food Exports, 1994–2019: A Gravity Model Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Selection Criteria for Overlapping Binary Models—A Simulation Study

by
Teresa Aparicio
and
Inmaculada Villanúa
*
Department of Economic Analysis, University of Zaragoza, Gran Vía, 2, 50005 Zaragoza, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(3), 478; https://doi.org/10.3390/math10030478
Submission received: 28 December 2021 / Revised: 25 January 2022 / Accepted: 29 January 2022 / Published: 2 February 2022
(This article belongs to the Special Issue Statistical Methods in Economics)

Abstract

:
This paper deals with the problem of choosing the optimum criterion for selecting the best model out of a set of overlapping binary models. The criteria we studied were the well-known AIC and SBIC, and a third one called C 2 . Special attention was paid to the setting where neither of the competing models was correctly specified. This situation has not been studied very much but it is the most common case in empirical works. The theoretical study we carried out allowed us to conclude that, in general terms, all criteria perform well. A Monte Carlo exercise corroborated those results.

1. Introduction

This work focused on the analysis of model selection criteria within the framework of binary choice models (BCM), where the endogenous variable is binary Y i , representing the choice of the decision-maker (i) between two options which are quantified by the values 1 and 0. These models are usually expressed as p i = F x i β , F being the cumulative distribution function (c.d.f.), x i the regressors vector and p i the probability that Y i = 1 . The c.d.f. can be normal or logistic, leading to a probit model or a logit model, respectively. Although the common analysis procedure for these models is to apply the maximum likelihood estimation (MLE) method, they can also be implemented from a Bayesian framework using Gibbs Sampling Markov Chain Monte Carlo (MCMC) methods [1,2]. Nevertheless, in this work, we considered the conventional context; thus, the MLE procedure was used.
This paper compares several models in order to select the “best” of them. In a general context, and following [3], the compared models can be nested, overlapping and non-nested models. In the specific framework of BCM, two binary models are nested if they possess the same c.d.f. (both probit or both logit) and the regressors of one of the models are included in the other one. Two binary models are overlapping if both possess the same c.d.f. (both probit or both logit) with some common explanatory variables and other specific variables. Finally, the compared models are non-nested if they possess only specific regressors. Moreover, the models are also non-nested when they possess different c.d.f. (probit versus logit), even if there are some common regressors. Many works define nested and non-nested models, and describe the way of working in every situation [4,5,6], while the overlapping models have been the least analyzed. In this paper, we compared overlapping models, and we found that they were equivalent or that one of them was better than the other (non-equivalent).
Although the hypothesis testing procedures (HTP) are widely used to discriminate between models, we can only use them to choose between pairs of models. In comparison, the selection criteria allowed us to select the best model from quite a large set. This is an important advantage in empirical econometric works. The latter approach allows researchers to express their objectives in the form of a loss function, or by using the discrepancy concept. As [7] established, the discrepancy concept is a particular case of loss function. For non-linear regression models (our framework), the procedures developed by [3,8,9] belongs to the first category (HTP). The second category involves the well-known AIC [10] and SBIC [11], where the discrepancy was obtained from the Kullback–Leibler distance. Additionally, the use of the mean square error (MSE) of prediction as a discrepancy enabled us to derive another criterion, denoted as C 2 (see [12]).
Many works have studied the behaviour of selection procedures in linear regression models. However, this subject has been less analysed in a non-linear regression context, and the nested framework is nearly always assumed [13,14,15]. The performance of some selection procedures has also been studied in phylogenetics, where partitioned models were used [16,17,18,19]. Specific references for discrete choice models are [12] for nested models, and [20] for non-nested models.
In this paper, the competing models we selected from were overlapping models. The purpose was to investigate the discriminatory power of certain model selection criteria assuming two situations: (i) at least one of the models was correctly specified; (ii) neither of the models was correctly specified. According to [21], a well-specified model can include irrelevant variables together with the set of regressors of the data generating process (DGP). In our opinion, situation (ii) is the most interesting in practice but the least studied in the literature. Given that, in this case, no model was well-specified, we could not consider consistency as the condition that makes a given selection criterion adequate. The requirement we proposed is that the criterion selects the closest model to the DGP.
The article is organised as follows. In Section 2, we establish the general context and the methodology. Section 3 is dedicated to study the theoretical behaviour of the criteria. Section 4 presents and discusses the results from a Monte Carlo experiment. Conclusions are presented in Section 5.

2. Materials and Methods

Consider the following DGP:
M 0 : p i = F x i β 0 = F β 0 + β 1 x 1 i + β 2 x 2 i ( i = 1 ,   ,   N )  
and a pair of overlapping models which, in general terms, are defined as follows:
M 1 : p i = F a i γ M 2 : p i = F b i δ ( i = 1 , ,   N )  
where F(·) is the cumulative distribution function (c.d.f), which can be normal or logistic, leading to the probit model or the logit model, respectively. The two competing models have the same c.d.f.; a i and b i are the 1 × k 1 and 1 × k 2 explanatory variables vectors of M1 and M2, for the i-observation; γ and δ are the corresponding parameter vectors. Given the definition of overlapping models, a b and b a are satisfied, and both vectors have some common variables.
In order to describe the relationship of each of the competing models with the true model (DGP) we used the Kullback–Leibler distance (KLIC) to the DGP:
KLIC   ( M 0 ,   M j ) = E 0   ln   f 0 f j ( j = 1 , 2 )
being f 0 the density function of the DGP and f j corresponding to model Mj.
From (3), we can write:
KLIC   ( M 0 ,   M 1 ) KLIC ( M 0 , M 2 ) = E 0   [ ln   f   ( y   |   b ,   δ )   ] E 0   [   ln   f   ( y   |   a ,   γ )   ] = E 0 2 1
where γ and δ are the corresponding pseudo-true parameter vectors (see [22]).
It is well-known that if this statistic (expression (4)) is positive, then M2 is the preferred model, M1 being preferred if (4) is negative. If it is null, the two models are equivalents.
Given the DGP of (1), and following [21], any model which is correctly specified can be written as:
p i = F γ 0 + γ 1 x 1 i + γ 2 x 2 i + j = 3 k j γ j d j i
where dj are additional regressors, including the particular case, where d j do not exist.
It is worth noting that, for each competing model, the maximum likelihood estimation of the parameter vectors satisfies:
γ ^ p γ δ ^ p δ
We can distinguish two cases: the case where at least one of the competing models was correctly specified, and the case where neither of them was correct.

2.1. Case 1: At Least One of the Competing Models Is Correctly Specified

Then, the situations we considered are:
  • Case 1.1: Both models were well-specified (or both models included the DGP):
    M 1 : p i = F γ 0 + γ 1 x 1 i + γ 2 x 2 i + j = 3 k 1 γ j d j i M 2 : p i = F δ 0 + δ 1 x 1 i + δ 2 x 2 i + δ 3 z i
    with z d j   j .
  • Case 1.2: Only one of them was well-specified (or only one of them included the DGP):
M 1 : p i = F γ 0 + γ 1 x 1 i + γ 2 x 2 i + j = 3 k 1 γ j d j i M 2 : p i = F δ 0 + δ 1 x 1 i + δ 2 z i
Let β 0 + be the parameter vector extended with elements equal to zero in the places corresponding to the variables that are not included in the DGP, that is, β 0 + = β 0 | 0 . From the convergence result (5), and according to [23], in case 1.1, the equality γ = δ = β 0 + held, implying that both models are equivalent. However, in case 1.2 γ = β 0 + but δ β 0 + , M1 being better than M2.

2.2. Case 2: Neither of the Models Is Correctly Specified (or Neither of the Models Includes the DGP)

In this situation the compared models are:
M 1 : p i = F γ 0 + γ 1 x 1 i + γ 2 w i M 2 : p i = F δ 0 + δ 1 x 2 i + δ 2 w i
We again used the convergence result (5) to conclude that, in this case, γ β 0 + and δ β 0 + , it being possible that the competing models are equivalent or not. Specifically, according to [3], there are two possible situations:
  • Case 2.1: f y i ; γ , a i = f y i ; δ , b i that is, the density functions of y i in M1 and M2, evaluated at the corresponding pseudo-true parameter vectors, were observationally identical. It implies that M1 and M2 are equivalent specifications.
  • Case 2.2: f y i ; γ , a i f y i ; δ , b i . In this situation the models can be:
    (a)
    Equivalent, which means that E 0 [ 1 ] = E 0 [ 2 ] .
    (b)
    Non-equivalent, or E 0 [ 1 ] E 0 [ 2 ] .
Now, we present the selection criteria, whose behaviour was the aim of our paper. Specifically, we are discussing the well-known information criteria (IC) of Akaike (AIC) and Schwarz (SBIC), and another criterion we call C 2 . To obtain them, we adopted the discrepancy concept (see [7]). As we can see in [12], “A discrepancy measures the lack of fit between the proposed model and the DGP, in the aspect which the researcher considers the most relevant”. Then, the discrepancy for model M1 could be written as Δ F 1 , F 0 , and we wished to minimize the “overall discrepancy”, expressed as Δ F γ ^ , F 0 , or equivalently Δ γ ^ , with F γ ^ the estimated model M1 (that is, p ^ i = F a i γ ^ ). The estimation of the expected overall discrepancy, E ^ 0 Δ γ ^ , constitutes the selection criterion. Details about this procedure can be found in [7] and [12]. We assume two discrepancies, called Δ 1 and Δ 2 . The first one is the Kullback–Leibler distance. For a model Mj, it is expressed as Δ 1 F j , F 0 = K L I C ( M j , M 0 ) , and leads to the information criteria (IC) AIC and SBIC:
I C M j = ^ j N + K N M j N
where ^ j   j = 1 , 2 denotes the log-likelihood of model Mj, evaluated at the corresponding vector of estimates, k j is the number of parameters of Mj, K N M j is the correction factor ( k j for AIC and k j log N 2 for SBIC).
The second discrepancy is the mean square error (MSE) of prediction. For model M1, this discrepancy is Δ 2 F 1 , F 0 = E 0 Y N + 1 F a N + 1 γ 2 , with “N + 1” indicating an out-sample observation. For any Mj model, and following the previously mentioned procedure, the expression of the criterion is:
C 2 M j = S S D j N 1 + 2 k j N
and S S D j = i = 1 N Y i F ^ j i 2 , the squared sum of the differences between the binary variable and the estimated probability with model Mj. The proof of (10) is developed in [24].
The model having the criterion with the lowest value was chosen, so different criteria could have led to a different choice. Nevertheless, we were interested in analysing if the criteria worked well, that is, if the selection was correct, in the sense we define in the following section.

3. Theoretical Results

In this section, we study the theoretical behaviour of the criteria, in order to prove if they perform well. All proofs of the results we present in this section can be seen, in detail, in [24].
We carry out an asymptotic analysis, which needs a set of initial assumptions, the results and definitions that we state below.
Assumption 1.
The x i , a i and b i regressor vectors of the models specified in (1) and (2) are non stochastic. The variables of these vectors have sample means and variances with finite limits.
Lemma 1.
Let be y i a variable, which is not i.i.d., but heterogeneous (non-identical means and non-identical variances). Then:
N 1 i = 1 N a ( y i , θ ˜ )   p   E 1 N i = 1 N a ( y i , θ 0 )
Proof. 
The proof of (11) is based on the law of large numbers for heterogeneous variables, together with a lemma of [25].
The law of large numbers for heterogeneous variables is expressed in the following terms [26]: “Let the sequence { y i μ i } be independent with E ( y i μ i ) = 0. If E  | y i μ i | 1 + δ B < ∞ ∀i with δ > 0, then 1 N i = 1 N ( y i μ i ) p 0”.
The lemma of [25] (p. 2156) is expressed as follows: “If z i is i.i.d., a (z, θ ) is continuous at θ 0 with probability one, and there is a neighbourhood Γ of θ 0 such that E [ sup θ Γ a ( z , θ ) ] < ∞, then for any θ ˜   p   θ 0 , N 1 i = 1 N a ( z i , θ ˜ )   p   E [ a ( z , θ 0 ) ] ”. This lemma, together with the law of large numbers, allows us to write (11). □
Definition 1.
M1 and M2 are equivalent models if
E 0 log f y i ; a i , γ f y i ; b i , δ = 0  
which leads to:
F 0 i log F 1 i F 2 i + 1 F 0 i log 1 F 1 i 1 F 2 i = 0  
where F 0 i = F x i β , F 1 i = F a i γ and F 2 i = F b i δ .
Definition 2.
M1 is closer to the DGP than M2 if:
E 0 log f y i ; a i , γ f y i ; b i , δ > 0  
which leads to:
F 0 i log F 1 i F 2 i + 1 F 0 i log 1 F 1 i 1 F 2 i   > 0
Definition 3.
Let  R ·  be a model selection criterion. It is said that  R ·  is adequate when:
(i) 
If M1 and M2 are equivalent, plim [R(M1)] = plim [R(M2)].
(ii) 
If M1 is closer than M2 to the DGP, then plim [R(M1)] < plim [R(M2)].
Now, for every case we enuntiated in the previous section, we must prove whether the definition of “adequate criterion” is satisfied.
Result 1.
The IC criteria behave well in all settings.
Proof. 
The basic tool for achieving this result is the comparison of Definitions 1 and 2 with Definition 3. In this sense, Definitions 1 and 2 establish the condition that must be met when the compared models are equivalent or non-equivalent, respectively. On the other hand, Definition 3 tells us the requirements for determining if a specific criterion is adequate in each context (of equivalence or not).
Expression (9) can be written as:
I C M j = 1 N i = 1 N y i log F ^ j i + 1 y i log 1 F ^ j i + K N M j N j = 1 , 2
with F ^ 1 i = F a i γ ^ and F ^ 2 i = F b i δ ^ .
Using Lemma 1 and the convergences given in (5) for the first term, we obtain:
^ j N p 1 N i = 1 N E 0 y i log F j i + 1 y i log 1 F j i j = 1 , 2
The correction factor K N M j N converges to zero for every model.
When the competing models are equivalent (cases 1.1, 2.1 and 2.2.(a)), the IC criteria will be adequate if equality of Definition 3 (i) holds, which, using (17), leads to the following expression:
1 N i = 1 N F 0 i log F 1 i F 2 i + 1 F 0 i log 1 F 1 i 1 F 2 i = 0
This result is always satisfied, given Definition 1, so we can say that the IC criteria performed well.
When the models are non-equivalent, and assuming M1 is always better than M2 (cases 1.2 and 2.2.(b)), the IC criteria will be adequate if equality of Definition 3 (ii) holds, which, using (17), leads to:
1 N i = 1 N F 0 i log F 1 i F 2 i + 1 F 0 i log 1 F 1 i 1 F 2 i   >   0
this result being identical to Definition 2. Thus, the IC criteria performed well. □
It should be noted that K N M j N converges to zero faster for the model with a lower k j . Additionally, expression (19) shows that AIC and SBIC are asymptotically identical, which is not strictly true when k 1 k 2 . In this situation, for a given pair of competing models, the difference between both criteria is due to the different rate of convergence to zero between A I C ( M 1 ) A I C ( M 2 ) and S B I C ( M 1 ) S B I C ( M 2 ) . This difference is caused by the corrector factor.
Specifically, we can write:
S B I C ( M 1 ) S B I C ( M 2 ) A I C ( M 1 ) A I C ( M 2 ) = O log N N O 1 N = O log N
which means that, when N increases, the distance between the convergence rate of the numerator and the denominator of expression (20) becomes larger. This implies that SBIC will tend toward one of the models more than AIC. Which model? It is evident that, if k 1 > k 2 , the tendency will be toward M2, given that both AIC and SBIC selects the model with a lower value of the criterion. It is important to remark that p l i m I C M 1 I C M 2 = 0 is not contradictory with a higher tendency to the model that is more parsimonious, given that both models are equivalent.
Result 2.
The C 2 criterion is adequate, except in a specific situation.
Proof. 
Expression (10) can be written as:
C 2 M j = i = 1 N y i F ^ j i 2 N 1 + 2 k j N j = 1 ,   2
Applying Lemma 1 together with convergences (5) we obtain:
S S D j N = 1 N i = 1 N y i F ^ j i 2 p i = 1 N E 0 y i F j i 2 N j = 1 ,   2
Additionally, the term 1 + 2 k j N (j = 1, 2) converges to 1 when N → ∞.
When the compared models are equivalent and well-specified (case 1.1), the probability limit (22) is the same for both models. It implies that Definition 3 i) is satisfied, in other words, the C 2 criterion performed well. Note that the convergence rate is different between M1 and M2 when k 1 k 2 . The term 2 k j N is O 1 N and converges to zero faster for models with a lower k j .
When the compared models are non-equivalent and only M1 is well-specified (case 1.2), the probability limit (22) is different for each model:
S S D 1 N p i = 1 N F 0 i 1 F 0 i N = h 1
S S D 2 N p i = 1 N F 0 i 1 F 0 i N + i = 1 N F 0 i F 2 i 2 N = h 1 + h 2
being h 1 and h 2 positive terms. It is straightforward to see that Definition 3 (ii) is satisfied, so the C 2 criterion performed adequately.
If neither of the competing models is correctly specified (case 2), the probability limits of (22) for each model can be written as:
S S D 1 N p i = 1 N F ´ 0 i 1 F 0 i N + i = 1 N F 0 i F 1 i 2 N = h 1 + h 3
S S D 2 N p i = 1 N F 0 i 1 F 0 i N + i = 1 N F 0 i F 2 i 2 N = h 1 + h 2
with h i (i = 1, 2, 3) being positive constants.
Now, the final conclusions depend on the relationship between the density functions. Then, in Case 2.1, where the density functions were observationally identical (equivalent models), F 1 i = F 2 i is satisfied. It implies that h 3 = h 2 , so Definition 3 is verified, and the C 2 criterion behaved well.
In Case 2.2. (a), with non-observationally identical density functions and equivalent models, the only possibility for achieving h 3 = h 2 is that, on average, F 0 i F 1 i = F 0 i F 2 i , or, equivalently, 2 F 0 i = F 1 i + F 2 i . Therefore, there can be empirical works where the criterion C 2 did not behave well. The Monte Carlo experiment will allow us a more specific analysis of the behaviour of the criterion.
Finally, in Case 2.2.(b), where the competing models were non-equivalent, we assumed that M1 was better than M2. In order to study the power of the criterion, we applied a strategy similar to that used in the IC criteria. That is, we related Definitions 2 and 3 (ii). Definition 2 can be written as:
F 1 i F 2 i F 0 i > 1 F 2 i 1 F 1 i 1 F 0 i
We wanted to find the combinations of F 0 i , F 1 i and F 2 i that satisfy (27). The results we obtained are summarized in Table 1.
Definition 3 (ii) establishes that the C 2 criterion behaves well if h 3 < h 2 , that is to say:
i = 1 N F 1 i F 2 i F 1 i + F 2 i 2 F 0 i N < 0
For every combination presented in Table 1, we get the previous result, so the C 2 criterion is adequate. □

4. Simulation Study and Discussion

The objective of the Monte Carlo experiments is twofold: confirm the theoretical results and assess the performance of all criteria with finite samples.
The generation of the binary variable y i is based on the latent linear model that underlies any binary model:
y i = x i β + u i
where y i is a latent (unobservable) variable which generates y i through:
y i = 1 0 i f i f y i > 0 y i 0
Under the assumption established in Section 3, and following the procedure of [27], we obtained the values of y i . We considered different sets of parameter values and different kinds of explanatory variables (continuous and dummy) and the standard normal distribution function was chosen for the error term, implying exclusive focus on probit models. Two sample sizes N = 200 and 2000 were used; we carried out 500 replications for each experiment. Additionally, the intercept was fixed at a value of −2, in order to avoid a non-balanced number of ones in the sample of y i , which would lead to problems when estimating and interpreting results.
In each of the 500 replications, we estimated M1 and M2, and calculated the value of the IC and C 2 criteria in each replication. The corresponding tables for every experiment show the number of times that each criterion selected M1. Note that we only present tables for N = 2000 and comment the differences from N = 200 if such differences exist. In all cases, the DGP is p i = F β 0 + β 1 x 1 i + β 2 x 2 i .

4.1. Montecarlo Exercise When Both Models Are Correctly Specified (Case 1.1)

We consider the following well-specified models M1 and M2:
M 1 p i = F γ 0 + γ 1 x 1 i + γ 2 x 2 i + γ 3 w i + γ 4 s i M 2 p i = F δ 0 + δ 1 x 1 i + δ 2 x 2 i + δ 3 z i
Firstly, we assumed γ 4 = 0 , so we chose between models with the same number of parameters and, afterwards, we assumed γ 4 0 . These settings are called A and B, respectively; the results are presented in Table 2.
For setting A, we can see that the number presented in each cell was around 250 (50% of 500 times), which is the correct behaviour for equivalent models. However, if we consider setting B, where M1 had more irrelevant regressors than M2, we could see that the criteria tended to select the model with fewer parameters, that is, it selected the more parsimonious model (M2), and that this tendency grew with the sample size. This behaviour was also correct given that both specifications were equivalent. Specifically, the most parsimonious criterion is SBIC, so the Monte Carlo exercise corroborated this theoretical aspect of the previous section. Additionally, we observed that neither the kind of variables nor the set of values of the DGP parameter vector seemed to affect the behaviour of the criteria.

4.2. Montecarlo Exercise When Only One of the Models Is Correctly Specified (Case 1.2)

In these experiments, M1 and M2 are expressed as follows:
M 1 p i = F γ 0 + γ 1 x 1 i + γ 2 x 2 i + γ 3 w i M 2 p i = F δ 0 + δ 1 x 1 i + δ 2 z i
The theoretical results were corroborated, so all the criteria tended to select M1 for whatever kind of explanatory variables. The corresponding table has been omitted, given that the value in all cells was 500.
However, for N = 200 the results were not so evident, although they tended towards adequate behaviour. Specifically, differences were found when the variables x 1 and x 2 were both uniform and the weight of x 2 was not greater than that of x 1 ; this difference was more evident for the SBIC criterion.

4.3. Montecarlo Experiment When Neither of the Models Is Correctly Specified (Case 2)

We needed to analyse each of the situations defined in Case 2 of Section 2 separately. The two compared models are:
M 1 p i = F γ 0 + γ 1 x 1 i + γ 2 w i M 2 p i = F δ 0 + δ 1 x 2 i + δ 2 w i
The implementation of the experiments for 2.1 and 2.2.(a) required using the relationship between the true parameter vector ( β 0 ) and each of the pseudo-true parameter vectors ( γ and δ ). In case 2.1, we should have been able to obtain the value of β 0 from them, satisfying the equality of density functions. In other terms, if we had γ = m 1 ( β 0 ) and δ = m 2 ( β 0 ) , we were interested in obtaining β 0 that makes f ( Y i ; m 1 β 0 , a i ) = f ( Y i ; m 2 β 0 , b i ) . The same idea could be used in 2.2.(a) in order to make E( 1 ) and E( 2 ) equal, that is, E 1 m 1 β 0 = E 2 m 2 β 0 . Nevertheless, the non-linear equation system that we needed to solve possessed insurmountable problems. Given that we could not obtain the exact relationship, we needed to approximate the equalities of densities and likelihoods. To this end, we generated several DGPs modifying the value of the parameter vector β 0 and the kind of explanatory variables. Again, the intercept was fixed at a value of −2, while β 1 and β 2 took values in the range of (−2, 2) counting by 0.5 s; additionally, they took values 3, 4, 5 and 7. As a result of this strategy, we generated 132 different DGPs. Each of the outlined DGPs lead to a specific relationship between models M1 and M2: equivalent models (with identical or non-identical densities) or non-equivalent models.
In order to classify the 132 experiments into the two categories, we used the following indicators:
a b s e l = 1 N i = 1 N d e l i
AM 1 times   ( of   the   N   observations )   that   d e l i > 0
a b s d i f d e n = 1 N i = 1 N d i f d e n i
with d e l i = E 1 i E 2 i and d i f d e n i = f y i ; γ , x 1 , x 2 f y i ; δ , x 1 , x 2 .
Firstly, we classified the experiments into “containing equivalent models”, or “containing nonequivalent models”. Secondly, in the first group, we distinguished identical from non-identical densities. Finally, we classified the non-equivalent models depending on their closeness to the DGP. The following three stages were carried out:
Step 1.
The two requirements for considering the models as equivalents are:
(R.1)
A value of absel close to zero.
(R.2)
A value of AM1 close to N / 2 .
Taking into account absel, two models will tend to be equivalent if, for most of the observations, d e l i 0 , which should lead to a b s e l 0 . Could we have used only this measure to affirm that the models were equivalent? The answer is no, because we could find a b s e l 0 but with most of the observations satisfying E 1 i > E 2 i , which means that M1 was closer to the DGP. Using AM1 instead of absel, two models will tend to be equivalent if A M 1 N / 2 . Could we have used only AM1 to classify the models? Again, the answer is no, because it could happen that A M 1 N / 2 but with a large value of absel, due to large values of d e l i . Then, we need the two requirements (R.1) and (R.2).
Step 2.
To consider that two equivalent models have identical densities, a value of difden close to zero is required.
Step 3.
If model M1 is better than M2, N / 2 < A M 1 < N must be satisfied, while M2 will be better if 0 < A M 1 < N / 2 .
Each experiment was numbered from 1 to 132, and was classified in type A, B or C, as we can see in Table 3.
Those experiments with an extreme percentage of zeros in the sample of the binary variable were omitted. Table 4 presents the experiments with the lowest value of absel, the values of AM1 closer to 1000, and the values of absdifden near zero.
We concluded that experiments 27, 3 and 6 included equivalent models, 27 and 3 having identical densities. The rest of the experiments corresponded to non-equivalent models, and we needed to classify them according to their closeness to the DGP. Given that AM1 was the adequate indicator, Table 5 shows all the experiments sequenced from the highest to the lowest value of this measure.
The variable w is always generated as N(3,1) and IC groups AIC and SBIC together, because the results of both criteria were identical.
We find the experiments with equivalent models at the middle of this table. Above them (the upper part of the table), we see the experiments where M1 was better and, below them (the lower part), those where M2 was the best model. The results showed that, in the upper end of the table, the values in columns IC and C2 tended towards 500 and, in the lower end, tended towards zero, corroborating the theoretical conclusions.
The experiments that contained equivalent models with identical densities (27 and 3) corroborated the theoretical results. On the other hand, in experiment 6 (equivalent models with non-identical densities), both IC and C 2 performed well. Nevertheless, the theoretical results for C 2 concluded that this criterion was adequate only in some situations. We can affirm that experiment 6 belonged to one of these situations, characterized by uniform distribution of the DGP variables, and similar (but not oversized) weights for both variables. We think that these characteristics could cause C 2 to behave well.
Finally, we observed a non-adequate behaviour of the criteria in some experiments. In the upper part of the table, experiments 62, 48 and 57 showed values far less than 500 and large differences between the values of the criteria columns. In the lower part of the table, we find that experiments 21 and 109 have IC and C2 column values far from 0. Could this atypical behaviour be due to anomalous observations? Taking into account that d e l i is the main element which underlies the indicators used to classify the experiments, we studied whether extreme values of d e l i were always associated with the same selection (same sign of d e l i ). We found that this happened in all the experiments, except in 21. Eliminating these extreme values, the behaviour of the criteria became adequate, as we show in Table 6.
As a final comment, we observed a greatly reduced number of experiments with equivalent models. We understood that this was logical, because the experiments of Case 2 corresponded to pairs of models where the DGP was not nested in M1 or M2. Given that model M1 contained the variable x 1 and model M2 contained x 2 ( x 1 and x 2 being the only DGP variables), it was very difficult to find cases where both the M1 and M2 models were equivalent.
When we re-executed the analysis for a sample size of 200, the results were similar in general terms, although the tendency toward correct behaviour of the criteria was slower. Nevertheless, we could affirm that the three criteria performed quite well for finite sample sizes.

5. Conclusions

Within the framework of overlapping binary models, we have studied the power of model selection criteria: the well-known information criteria AIC and SBIC, and the C 2 criterion, based on the mean square error of prediction.
As we previously mentioned, two binary models are overlapping if both have the same functional form (both probit or both logit), with some common explanatory variables and some specific variables. In this article, we distinguished two cases: i) at least one of the competing models is well-specified, and ii) neither of them is correctly specified. This last case is an important aspect of our work because it is not commonly considered in empirical works.
From a theoretical point of view, we have classified the competing models as equivalent or non-equivalent. Once this classification had been carried out, the task was to define the requirement that a given criterion must satisfy to be considered as adequate. Specifically, if two models are equivalent, the probability limits of a given criterion must be the same in both models. However, if one of them is better, its corresponding limit must be lower than that of the other model. The theoretical analysis carried out has confirmed that all the criteria performed well in every situation. Only C 2 did not, sometimes, behave well in a specific alternative.
These theoretical results have been corroborated by a Monte Carlo experiment. The most complicated situation to simulate was, as we expected, when neither of the two models were well-specified. This situation can lead to three possibilities: equivalent models with identical densities, equivalent models with non-identical densities, and non-equivalent models.
In order to develop this part of the Monte Carlo exercise, we had to generate 132 different DGPs, leading to 132 different experiments. Each of the experiments corresponded to one of the three theoretical relationships mentioned above. To establish the specific relationship, we have defined three indicators:
(a)
The average of the absolute differences between the expected log-likelihoods (at the pseudo-trues) of both models. We have denoted it as absel.
(b)
The number of observations in the sample where the expected log-likelihood in model M1 is larger than in M2. Note that we have assumed that M1 is the closest to the DGP. This indicator is called AM1.
(c)
The average of the absolute differences between the density functions (at the pseudo-trues) of both models. We have denoted it as absdifden.
The general conclusion is that the three criteria behaved well for overlapping binary models: when neither of the two competing models was well-specified, the criteria tended to choose the best of them, that is, the closest to the DGP. In the most commonly studied case, where at least one of the competing models was correct, our conclusion was that the criteria also performed well, as we expected. Furthermore, when both models were correct, the criteria tended to choose the most parsimonious model.
It is important to note that these criteria are also used when we compared an extensive set of models, correctly specified or misspecified. The criteria AIC, SBIC and C 2 allow us to order them, being in the first places those correctly specified, which will be equivalent to each other. Among them, the first one (the selected model) will be the most parsimonious if we use the SBIC criterion. The misspecified models will be at the bottom of the ranking.
This paper has been focused on the restricted framework of overlapping binary models. In order to complete this analysis, a future work should study the behaviour of AIC, SBIC and C 2 in the non-nested framework. Moreover, the wider context of multinomial dependent variables could be the aim of future research. Given that the MLE procedure was also applied to estimate these models, the formal expression of the IC would be quite straight, while C 2 would require a deeper analysis.

Author Contributions

Conceptualization, T.A. and I.V.; methodology, T.A. and I.V.; software, I.V.; formal analysis, T.A. and I.V.; numerical simulation, I.V.; writing—original draft preparation, T.A. and I.V.; writing—review and editing, I.V.; supervision, I.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by DGA Reference Group S40_20R and Agencia Estatal de Investigación Reference PID2019-106822RB-I00.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data were generated in the simulation study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Aljarallah, R.; Kharroubi, S.A. Use of Bayesian Markov Chain Monte Carlo Methods to Model Kuwait Medical Genetic Center Data: An Application to Down Syndrome and Mental Retardation. Mathematics 2021, 9, 248. [Google Scholar] [CrossRef]
  2. Li, Z.; Wang, E.; Su, J.; Yu, Y. Using MCMC Probit Model to Value Coastal Beach Quality Improvement. J. Environ. Prot. 2011, 2, 109–114. [Google Scholar] [CrossRef] [Green Version]
  3. Vuong, Q.H. Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses. Econometrica 1989, 57, 307–333. [Google Scholar] [CrossRef] [Green Version]
  4. Lewis, F.; Butler, A.; Gilbert, L. A Unified Approach to Model Selection Using the Likelihood Ratio Test. Methods Ecol. Evol. 2011, 2, 155–162. [Google Scholar] [CrossRef]
  5. Hendry, D.F. Econometric Modelling; Department of Economics, University of Oslo: Oslo, Norway, 2000. [Google Scholar]
  6. Hong, H.; Preston, B. Nonnested Model Selection Criteria; Department of Economics, Stanford University: Stanford, CA, USA, 2006; pp. 1–33. [Google Scholar]
  7. Linhart, H.; Zucchini, W. Model Selection; John Wiley and Sons: New York, NY, USA, 1986. [Google Scholar]
  8. Pesaran, M.H.; Pesaran, B. A Simulation Approach to the Problem of Computing Cox’s Statistics for Testing Nonnested Models. J. Econom. 1993, 57, 377–392. [Google Scholar] [CrossRef]
  9. Santos Silva, J.M.C. A Score Test for Non-Nested Hypotheses with Applications to Discrete Data Models. J. Appl. Econom. 2001, 16, 577–592. [Google Scholar] [CrossRef]
  10. Akaike, H. Information Theory and an Extension of the Likelihood Ratio Principle. In Proceedings of the Second International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, 2–8 September 1971; Petrov, B.N., Csáki, F., Eds.; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
  11. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  12. Aparicio, T.; Villanúa, I. Some Selection Criteria for Nested Binary Choice Models: A Comparative Study. Comput. Stat. 2007, 22, 635–660. [Google Scholar] [CrossRef]
  13. Kim, H.; Cavanaugh, J.E. Model Selection Criteria Based on Kullback Information Measures for Nonlinear Regression. J. Stat. Plan. Inference 2005, 134, 332–349. [Google Scholar] [CrossRef]
  14. Van Der Hoeven, N. The Probability to Select the Correct Model Using Likelihood-Ratio Based Criteria in Choosing Between Two Nested Models of Which the More Extended One Is True. J. Stat. Plan. Inference 2005, 135, 477–486. [Google Scholar] [CrossRef]
  15. Lalou, P.; Chalikias, M.; Skordoulis, M.; Papadopoulos, P.; Fatouros, S. A Probabilistic Evaluation of Sales Expansion. In Proceedings of the 5th International Symposium and 27th National Conference on Operation Research, Egaleo, Greece, 9–11 June 2016; pp. 109–113, ISBN 978-618-80361-6-1. [Google Scholar]
  16. Seo, T.K.; Thorne, J.L. Information Criteria for Comparing Partition Schemes. Syst. Biol. 2018, 67, 616–632. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Jhwueng, D.C.; Huzurbazar, S.; O’Meara, B.C.; Liu, L. Investigating the Performance of AIC in Selecting Phylogenetic Models. Stat. Appl. Genet. Mol. Biol. 2014, 13, 459–475. [Google Scholar] [CrossRef] [PubMed]
  18. Susko, E.; Roger, A.J. On the Use of Information Criteria for Model Selection in Phylogenetics. Mol. Biol. Evol. 2020, 37, 549–562. [Google Scholar] [CrossRef] [PubMed]
  19. Dziak, J.J.; Coffman, D.L.; Lanza, S.T.; Li, R.; Jermiin, L.S. Sensitivity and Specificity of Information Criteria. Brief. Bioinform. 2020, 21, 553–565. [Google Scholar] [CrossRef] [PubMed]
  20. Monfardini, C. An Illustration of Cox’s Non-Nested Testing Procedure for Logit and Probit Models. Comput. Stat. Data Anal. 2003, 42, 425–444. [Google Scholar] [CrossRef] [Green Version]
  21. Bierens, H.J. Topics in Advances Econometrics; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
  22. White, H. Maximum Likelihood Estimation of Misspecified Models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  23. Kohn, R. Consistent Estimation of Minimal Subset Dimension. Econometrica 1983, 51, 367–376. [Google Scholar] [CrossRef]
  24. Aparicio, T.; Villanúa, I. Selection Criteria for Overlapping Binary Models; Documentos de Trabajo; Facultad de Economía y Empresa, Universidad de Zaragoza: Zaragoza, Spain, 2012; pp. 1–54. [Google Scholar]
  25. Newey, W.K.; McFadden, D. Large sample estimation and hypothesis testing. In Handbook of Econometrics; Elsevier: Amsterdam, The Netherlands, 1994; Volume 4, pp. 2111–2245. [Google Scholar]
  26. Davidson, J. Econometric Theory; Blackwell: Oxford, UK, 2000. [Google Scholar]
  27. Gourieroux, C.; Monfort, A.; Renault, E.; Trognon, A. Simulated Residuals. J. Econom. 1987, 34, 201–252. [Google Scholar] [CrossRef]
Table 1. Combinations of F 0 , F 1 and F 2 which leads to M1 be better than M2.
Table 1. Combinations of F 0 , F 1 and F 2 which leads to M1 be better than M2.
Value of F 0 Condition Satisfied When M1 Is Better than M2
F 0 i = 0 F 1 i < F 2 i and F 1 i 0
F 0 i ∈ (0, 0.5] F 1 i < F 2 i and F 1 i + F 2 i > 1 F 1 i > F 2 i and F 1 i + F 2 i < 1 and F 0 i 0.5 F 1 i < F 2 i and F 1 i + F 2 i < 1 and F 0 i 0
F 0 i ∈ (0.5, 1) F 1 i > F 2 i and F 1 i + F 2 i < 1 F 1 i > F 2 i and F 1 i + F 2 i > 1 and F 0 i 1 F 1 i < F 2 i and F 1 i + F 2 i > 1 and F 0 i 0.5
F 0 i = 1 F 1 i > F 2 i and F 1 i 1
Table 2. Behaviour of the selection criteria in Case1.1 N = 2000.
Table 2. Behaviour of the selection criteria in Case1.1 N = 2000.
DGPSpecified Models
β x 1 x 2 w zsAICSBIC C 2
Setting A(−2,3,1)UUUN(3,1) 273273275
(−2,1,3)UUUN(3,1) 236236240
(−2,1,1)UUUN(3,1) 234234244
(−2,3,1) χ 2 χ 2 UN(3,1) 240240252
(−2,1,3) χ 2 χ 2 UN(3,1) 273273261
(−2,1,1) χ 2 χ 2 UN(3,1) 236236238
(−2,3,1)U χ 2 UN(3,1) 247247245
(−2,1,3)U χ 2 UN(3,1) 231231245
(−2,1,1)U χ 2 UN(3,1) 243243260
(−2,3,1)UdummyUN(3,1) 259259269
(−2,1,3)UdummyUN(3,1) 231231245
(−2,1,1)UdummyUN(3,1) 242242261
(−2,3,1) χ 2 dummyUN(3,1) 255255251
(−2,1,3) χ 2 dummyUN(3,1) 248248257
(−2,1,1) χ 2 dummyUN(3,1) 251251234
Setting B(−2,3,1)UUUN(3,1) χ 2 14410179
(−2,1,3)UUUN(3,1) χ 2 13310151
(−2,1,1)UUUN(3,1) χ 2 1262144
(−2,3,1) χ 2 χ 2 UN(3,1) χ 2 12812215
(−2,1,3) χ 2 χ 2 UN(3,1) χ 2 14010230
(−2,1,1) χ 2 χ 2 UN(3,1) χ 2 13011185
(−2,3,1)U χ 2 UN(3,1) χ 2 1289165
(−2,1,3)U χ 2 UN(3,1) χ 2 1062192
(−2,1,1)U χ 2 UN(3,1) χ 2 1205161
(−2,3,1)UdummyUN(3,1) χ 2 14412177
(−2,1,3)UdummyUN(3,1) χ 2 1457153
(−2,1,1)UdummyUN(3,1) χ 2 1417162
(−2,3,1) χ 2 dummyUN(3,1) χ 2 14710200
(−2,1,3) χ 2 dummyUN(3,1) χ 2 1368185
(−2,1,1) χ 2 dummyUN(3,1) χ 2 14312185
Table 3. Number of every experiment in Case 2, and types of x in the DGP (A,B,C) 1.
Table 3. Number of every experiment in Case 2, and types of x in the DGP (A,B,C) 1.
NumberDGPNumberDGPNumberDGPNumberDGP
A,B,C β 1 , β 2 A,B,C β 1 , β 2 A,B,C β 1 , β 2 A,B,C β 1 , β 2
1,45,89(1,−2)12,56,100(1,7)23,67,111(3,5)34,78,122(7,1)
2,46,90(1,−1.5)13,57,101(3,−2)24,68,112(3,7)35,79,123(−2,3)
3,47,91(1,−1)14,58,1023,−1.5)25,69,113(−2,1)36,80,124(−1.5,3)
4,48,92(1,−0.5)15,59,103(3,−1)26,70,114(−1.5,1)37,81,125(−1,3)
5,49,93(1,0.5)16,60,104(3,−0.5)27,71,115(−1,1)38,82,126(−0.5,3)
6,50,94(1,1)17,61,105(3,0.5)28,72,116(−0.5,1)39,83,127(0.5,3)
7,51,95(1,1.5)18,62,106(3,1)29,73,117(0.5,1)40,84,128(1.5,3)
8,52,96(1,2)19,63,107(3,1.5)30,74,118(1.5,1)41,85,129(2,3)
9,53,97(1,3)20,64,108(3,2)31,75,119(2,1)42,86,130(4,3)
10,54,98(1,4)21,65,109(3,3)32,76,120(4,1)43,87,131(5,3)
11,55,99(1,5)22,66,110(3,4)33,77,121(5,1)44,88,132(7,3)
1 In type A both variables are U(0,1), in B x 1   U(0,1) and x 2     χ 1 2 , and C has both variables χ 1 2 .
Table 4. Experiments sequenced by each of the three indicators (absel, AM1 and absdifden).
Table 4. Experiments sequenced by each of the three indicators (absel, AM1 and absdifden).
Experiments with the Lowest
Value of absel
Experiments with AM1 around 1000Experiments with the Lowest
Value of absdifden
Exp.abselAM1Exp.AM1abselExp.absdifden
30.0071653510015710110.17489579270.0224
270.0071688610032710030.0071688630.0227
280.00731291505310010.00716535280.0257
40.007370391498610000.0185438240.0261
50.012794041514639950.25329468480.0325
290.0127966510219910.13463303470.0341
480.0151596210351099800.44042798460.0349
450.0354
290.0510
Table 5. Behaviour of the selection criteria in Case 2. N = 2000 1.
Table 5. Behaviour of the selection criteria in Case 2. N = 2000 1.
Exp.AM1ICC2Exp.AM1ICC2Exp.AM1ICC2Exp.AM1ICC2
1041868500500201393500500129805008431500
341861500500761366500500477862223731500
171846500500119136550050065764005026100
1618345005003013594994991127190012324500
3318275005001313425005001287080012423200
921825500500107132150050096705005122600
103180950050088129250050022701001022400
911807500500421288500500666860011420900
901794500500118126850050076852212520800
102177950050059125250050046668055220400
891779500500132121950050035662001119500
10117535005001081188500500416360011319400
321751500500131115650050067601006919000
105169150050062115467244455890211518000
181691500500941112441451117583005317000
15167550050058111235249497576003917000
781641500500130109350050023537003817000
1221606500500871067447496685180011616700
4415825005004810357219598513005416200
12115565005005710114132229510117016200
19154050050027100324524028505101312616100
1201538500500310012502458504007915700
3115245005006100024925236494001213700
771516500500639950074491007113300
515144994982199124022399486005613100
93151050050010998035732440475005512900
141504500500869320285444008012800
4149848448411089100100414007310400
60149250050064877004939000819700
106147450050095835002438500728400
4314385005001118130012732900838000
6114295005007581300932500825700
1 Experiments 1, 2, 25 and 26 are omitted, due to extreme percentage of ones/zeros in the samples.
Table 6. Atypical experiments in Case 2. Behaviour of the selection criteria.
Table 6. Atypical experiments in Case 2. Behaviour of the selection criteria.
Experiment ( β 1 , β 2 ) Type of xNAM1absdelICC2
62(3,1)B180011440,1517500500
48(1,−0.5)B175010350,0097377499
57(3,−2)B180010030,12499500
109(3,3)C19259090,400
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Aparicio, T.; Villanúa, I. Selection Criteria for Overlapping Binary Models—A Simulation Study. Mathematics 2022, 10, 478. https://doi.org/10.3390/math10030478

AMA Style

Aparicio T, Villanúa I. Selection Criteria for Overlapping Binary Models—A Simulation Study. Mathematics. 2022; 10(3):478. https://doi.org/10.3390/math10030478

Chicago/Turabian Style

Aparicio, Teresa, and Inmaculada Villanúa. 2022. "Selection Criteria for Overlapping Binary Models—A Simulation Study" Mathematics 10, no. 3: 478. https://doi.org/10.3390/math10030478

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop