Selection Criteria for Overlapping Binary Models—A Simulation Study

: This paper deals with the problem of choosing the optimum criterion for selecting the best model out of a set of overlapping binary models. The criteria we studied were the well-known AIC and SBIC, and a third one called C 2 . Special attention was paid to the setting where neither of the competing models was correctly specified. This situation has not been studied very much but it is the most common case in empirical works. The theoretical study we carried out allowed us to conclude that, in general terms, all criteria perform well. A Monte Carlo exercise corroborated those results.


Introduction
This work focused on the analysis of model selection criteria within the framework x the regressors vector and i p the probability that 1 i Y = . The c.d.f. can be normal or logistic, leading to a probit model or a logit model, respectively. Although the common analysis procedure for these models is to apply the maximum likelihood estimation (MLE) method, they can also be implemented from a Bayesian framework using Gibbs Sampling Markov Chain Monte Carlo (MCMC) methods [1,2]. Nevertheless, in this work, we considered the conventional context; thus, the MLE procedure was used.
This paper compares several models in order to select the "best" of them. In a general context, and following [3], the compared models can be nested, overlapping and nonnested models. In the specific framework of BCM, two binary models are nested if they possess the same c.d.f. (both probit or both logit) and the regressors of one of the models are included in the other one. Two binary models are overlapping if both possess the same c.d.f. (both probit or both logit) with some common explanatory variables and other specific variables. Finally, the compared models are non-nested if they possess only specific regressors. Moreover, the models are also non-nested when they possess different c.d.f. (probit versus logit), even if there are some common regressors. Many works define nested and non-nested models, and describe the way of working in every situation [4][5][6], while the overlapping models have been the least analyzed. In this paper, we compared overlapping models, and we found that they were equivalent or that one of them was better than the other (non-equivalent).
Although the hypothesis testing procedures (HTP) are widely used to discriminate between models, we can only use them to choose between pairs of models. In comparison, the selection criteria allowed us to select the best model from quite a large set. This is an important advantage in empirical econometric works. The latter approach allows researchers to express their objectives in the form of a loss function, or by using the discrepancy concept. As [7] established, the discrepancy concept is a particular case of loss function. For non-linear regression models (our framework), the procedures developed by [3,8,9] belongs to the first category (HTP). The second category involves the well-known AIC [10] and SBIC [11], where the discrepancy was obtained from the Kullback-Leibler distance. Additionally, the use of the mean square error (MSE) of prediction as a discrepancy enabled us to derive another criterion, denoted as 2 C (see [12]).
Many works have studied the behaviour of selection procedures in linear regression models. However, this subject has been less analysed in a non-linear regression context, and the nested framework is nearly always assumed [13][14][15]. The performance of some selection procedures has also been studied in phylogenetics, where partitioned models were used [16][17][18][19]. Specific references for discrete choice models are [12] for nested models, and [20] for non-nested models.
In this paper, the competing models we selected from were overlapping models. The purpose was to investigate the discriminatory power of certain model selection criteria assuming two situations: (i) at least one of the models was correctly specified; (ii) neither of the models was correctly specified. According to [21], a well-specified model can include irrelevant variables together with the set of regressors of the data generating process (DGP). In our opinion, situation (ii) is the most interesting in practice but the least studied in the literature. Given that, in this case, no model was well-specified, we could not consider consistency as the condition that makes a given selection criterion adequate. The requirement we proposed is that the criterion selects the closest model to the DGP.
The article is organised as follows. In Section 2, we establish the general context and the methodology. Section 3 is dedicated to study the theoretical behaviour of the criteria. Section 4 presents and discusses the results from a Monte Carlo experiment. Conclusions are presented in Section 5.

Materials and Methods
Consider the following DGP: and a pair of overlapping models which, in general terms, are defined as follows: In order to describe the relationship of each of the competing models with the true model (DGP) we used the Kullback-Leibler distance (KLIC) to the DGP: being 0 f the density function of the DGP and j f corresponding to model Mj.
It is well-known that if this statistic (expression (4)) is positive, then M2 is the preferred model, M1 being preferred if (4) is negative. If it is null, the two models are equivalents.
Given the DGP of (1), and following [21], any model which is correctly specified can be written as: We can distinguish two cases: the case where at least one of the competing models was correctly specified, and the case where neither of them was correct.

Case 1: At Least One of the Competing Models Is Correctly Specified
Then, the situations we considered are: • Case 1.1: Both models were well-specified (or both models included the DGP):

Case 2: Neither of the Models Is Correctly Specified (or Neither of the Models Includes the DGP)
In this situation the compared models are: We again used the convergence result (5) to conclude that, in this case, + * ≠ 0 β γ and + * ≠ 0 β δ , it being possible that the competing models are equivalent or not. Specifically, according to [3], there are two possible situations: C . To obtain them, we adopted the discrepancy concept (see [7]). As we can see in [12], "A discrepancy measures the lack of fit between the proposed model and the DGP, in the aspect which the researcher considers the most relevant". Then, the discrepancy for model M1 could be written as ( ) about this procedure can be found in [7] and [12]. We assume two discrepancies, called 1 Δ and 2 Δ . The first one is the Kullback-Leibler distance. For a model Mj, it is expressed , and leads to the information criteria (IC) AIC and SBIC: The second discrepancy is the mean square error (MSE) of prediction. For model M1, , with "N + 1" indicating an outsample observation. For any Mj model, and following the previously mentioned procedure, the expression of the criterion is: and , the squared sum of the differences between the binary variable and the estimated probability with model Mj. The proof of (10) is developed in [24].
The model having the criterion with the lowest value was chosen, so different criteria could have led to a different choice. Nevertheless, we were interested in analysing if the criteria worked well, that is, if the selection was correct, in the sense we define in the following section.

Theoretical Results
In this section, we study the theoretical behaviour of the criteria, in order to prove if they perform well. All proofs of the results we present in this section can be seen, in detail, in [24].
We carry out an asymptotic analysis, which needs a set of initial assumptions, the results and definitions that we state below.  (1) and (2) Proof. The proof of (11) is based on the law of large numbers for heterogeneous variables, together with a lemma of [25]. The law of large numbers for heterogeneous variables is expressed in the following terms [26]: The lemma of [25] (p. 2156) is expressed as follows: "If i z is i.i.d., a (z, θ ) is continuous at 0 θ with probability one, and there is a neighbourhood Γ of 0 which leads to: ( ) Definition 2. M1 is closer to the DGP than M2 if: (14) which leads to: Now, for every case we enuntiated in the previous section, we must prove whether the definition of "adequate criterion" is satisfied.

Result 1. The IC criteria behave well in all settings.
Proof. The basic tool for achieving this result is the comparison of Definitions 1 and 2 with Definition 3. In this sense, Definitions 1 and 2 establish the condition that must be met when the compared models are equivalent or non-equivalent, respectively. On the other hand, Definition 3 tells us the requirements for determining if a specific criterion is adequate in each context (of equivalence or not).
Expression (9) can be written as: Using Lemma 1 and the convergences given in (5) for the first term, we obtain: The correction factor converges to zero for every model.
When the competing models are equivalent (cases 1.1, 2.1 and 2.2.(a)), the IC criteria will be adequate if equality of Definition 3 (i) holds, which, using (17), leads to the following expression: This result is always satisfied, given Definition 1, so we can say that the IC criteria performed well.
When the models are non-equivalent, and assuming M1 is always better than M2 (cases 1.2 and 2.2.(b)), the IC criteria will be adequate if equality of Definition 3 (ii) holds, which, using (17), leads to:  (19) shows that AIC and SBIC are asymptotically identical, which is not strictly true when 1 2 k k ≠ . In this situation, for a given pair of competing models, the difference between both criteria is due to the different rate of convergence to zero between .This difference is caused by the corrector factor.
Specifically, we can write: which means that, when N increases, the distance between the convergence rate of the numerator and the denominator of expression (20) becomes larger. This implies that SBIC will tend toward one of the models more than AIC. Which model? It is evident that, if 1 2 k k > , the tendency will be toward M2, given that both AIC and SBIC selects the model with a lower value of the criterion. It is important to remark that = is not contradictory with a higher tendency to the model that is more parsimonious, given that both models are equivalent.

2
C criterion is adequate, except in a specific situation.
Proof. Expression (10) can be written as: Applying Lemma 1 together with convergences (5) we obtain: Additionally, the term 2 1 When the compared models are equivalent and well-specified (case 1.1), the probability limit (22)  When the compared models are non-equivalent and only M1 is well-specified (case 1.2), the probability limit (22) is different for each model: being 1 h and 2 h positive terms. It is straightforward to see that Definition 3 (ii) is satisfied, so the 2 C criterion performed adequately.
If neither of the competing models is correctly specified (case 2), the probability limits of (22) for each model can be written as:  , where the competing models were non-equivalent, we assumed that M1 was better than M2. In order to study the power of the criterion, we applied a strategy similar to that used in the IC criteria. That is, we related Definitions 2 and 3 (ii). Definition 2 can be written as: We wanted to find the combinations of 0i F , 1i F and 2i F that satisfy (27). The results we obtained are summarized in Table 1.
For every combination presented in Table 1, we get the previous result, so the 2 C criterion is adequate. □

Simulation Study and Discussion
The objective of the Monte Carlo experiments is twofold: confirm the theoretical results and assess the performance of all criteria with finite samples.
The generation of the binary variable i y is based on the latent linear model that underlies any binary model: where i y * is a latent (unobservable) variable which generates i y through: Under the assumption established in Section 3, and following the procedure of [27], we obtained the values of i y . We considered different sets of parameter values and different kinds of explanatory variables (continuous and dummy) and the standard normal distribution function was chosen for the error term, implying exclusive focus on probit models. Two sample sizes N = 200 and 2000 were used; we carried out 500 replications for each experiment. Additionally, the intercept was fixed at a value of −2, in order to avoid a non-balanced number of ones in the sample of i y , which would lead to problems when estimating and interpreting results. In each of the 500 replications, we estimated M1 and M2, and calculated the value of the IC and 2 C criteria in each replication. The corresponding tables for every experiment show the number of times that each criterion selected M1. Note that we only present tables for N = 2000 and comment the differences from N = 200 if such differences exist. In all cases, the DGP is

Montecarlo Exercise When Both Models Are Correctly Specified (Case 1.1)
We consider the following well-specified models M1 and M2: Firstly, we assumed 4 0 γ = , so we chose between models with the same number of parameters and, afterwards, we assumed 4 0 γ ≠ . These settings are called A and B, respectively; the results are presented in Table 2.  For setting A, we can see that the number presented in each cell was around 250 (50% of 500 times), which is the correct behaviour for equivalent models. However, if we consider setting B, where M1 had more irrelevant regressors than M2, we could see that the criteria tended to select the model with fewer parameters, that is, it selected the more parsimonious model (M2), and that this tendency grew with the sample size. This behaviour was also correct given that both specifications were equivalent. Specifically, the most parsimonious criterion is SBIC, so the Monte Carlo exercise corroborated this theoretical aspect of the previous section. Additionally, we observed that neither the kind of variables nor the set of values of the DGP parameter vector seemed to affect the behaviour of the criteria.

Montecarlo Exercise When Only One of the Models Is Correctly Specified (Case 1.2)
In these experiments, M1 and M2 are expressed as follows: The theoretical results were corroborated, so all the criteria tended to select M1 for whatever kind of explanatory variables. The corresponding table has been omitted, given that the value in all cells was 500.
However, for N = 200 the results were not so evident, although they tended towards adequate behaviour. Specifically, differences were found when the variables 1 x and 2 x were both uniform and the weight of 2 x was not greater than that of 1 x ; this difference was more evident for the SBIC criterion.

Montecarlo Experiment When Neither of the Models Is Correctly Specified (Case 2)
We needed to analyse each of the situations defined in Case 2 of Section 2 separately. The two compared models are: The implementation of the experiments for 2.1 and 2.2.a required using the relationship between the true parameter vector ( 0 β ) and each of the pseudo-true parameter vectors ( γ * and δ * ). In case 2.1, we should have been able to obtain the value of 0 β from them, satisfying the equality of density functions. In other terms, if we had Nevertheless, the nonlinear equation system that we needed to solve possessed insurmountable problems. Given that we could not obtain the exact relationship, we needed to approximate the equalities of densities and likelihoods. To this end, we generated several DGPs modifying the value of the parameter vector 0 β and the kind of explanatory variables. Again, the intercept was fixed at a value of −2, while 1 β and 2 β took values in the range of (−2, 2) counting by 0.5 s; additionally, they took values 3, 4, 5 and 7. As a result of this strategy, we generated 132 different DGPs. Each of the outlined DGPs lead to a specific relationship between models M1 and M2: equivalent models (with identical or non-identical densities) or non-equivalent models.
In order to classify the 132 experiments into the two categories, we used the following indicators: . Firstly, we classified the experiments into "containing equivalent models", or "containing nonequivalent models". Secondly, in the first group, we distinguished identical from non-identical densities. Finally, we classified the non-equivalent models depending on their closeness to the DGP. The following three stages were carried out: Step1. The two requirements for considering the models as equivalents are: (R. Step 2. To consider that two equivalent models have identical densities, a value of difden close to zero is required.
Step 3. If model M1 is better than M2, 2 1 N AM N < < must be satisfied, while M2 will be better if 0 1 2 AM N < < . Each experiment was numbered from 1 to 132, and was classified in type A, B or C, as we can see in Table 3. Those experiments with an extreme percentage of zeros in the sample of the binary variable were omitted. Table 4 presents the experiments with the lowest value of absel, the values of AM1 closer to 1000, and the values of absdifden near zero. We concluded that experiments 27, 3 and 6 included equivalent models, 27 and 3 having identical densities. The rest of the experiments corresponded to non-equivalent models, and we needed to classify them according to their closeness to the DGP. Given that AM1 was the adequate indicator, Table 5 shows all the experiments sequenced from the highest to the lowest value of this measure.

Conclusions
Within the framework of overlapping binary models, we have studied the power of model selection criteria: the well-known information criteria AIC and SBIC, and the 2 C criterion, based on the mean square error of prediction. As we previously mentioned, two binary models are overlapping if both have the same functional form (both probit or both logit), with some common explanatory variables and some specific variables. In this article, we distinguished two cases: i) at least one of the competing models is well-specified, and ii) neither of them is correctly specified. This last case is an important aspect of our work because it is not commonly considered in empirical works.
From a theoretical point of view, we have classified the competing models as equivalent or non-equivalent. Once this classification had been carried out, the task was to define the requirement that a given criterion must satisfy to be considered as adequate. Specifically, if two models are equivalent, the probability limits of a given criterion must be the same in both models. However, if one of them is better, its corresponding limit must be lower than that of the other model. The theoretical analysis carried out has confirmed that all the criteria performed well in every situation. Only 2 C did not, sometimes, behave well in a specific alternative. These theoretical results have been corroborated by a Monte Carlo experiment. The most complicated situation to simulate was, as we expected, when neither of the two models were well-specified. This situation can lead to three possibilities: equivalent models with identical densities, equivalent models with non-identical densities, and non-equivalent models.
In order to develop this part of the Monte Carlo exercise, we had to generate 132 different DGPs, leading to 132 different experiments. Each of the experiments corresponded to one of the three theoretical relationships mentioned above. To establish the specific relationship, we have defined three indicators: (a) The average of the absolute differences between the expected log-likelihoods (at the pseudo-trues) of both models. We have denoted it as absel. (b) The number of observations in the sample where the expected log-likelihood in model M1 is larger than in M2. Note that we have assumed that M1 is the closest to the DGP. This indicator is called AM1. (c) The average of the absolute differences between the density functions (at the pseudotrues) of both models. We have denoted it as absdifden.
The general conclusion is that the three criteria behaved well for overlapping binary models: when neither of the two competing models was well-specified, the criteria tended to choose the best of them, that is, the closest to the DGP. In the most commonly studied case, where at least one of the competing models was correct, our conclusion was that the criteria also performed well, as we expected. Furthermore, when both models were correct, the criteria tended to choose the most parsimonious model.
It is important to note that these criteria are also used when we compared an extensive set of models, correctly specified or misspecified. The criteria AIC, SBIC and 2 C allow us to order them, being in the first places those correctly specified, which will be equivalent to each other. Among them, the first one (the selected model) will be the most parsimonious if we use the SBIC criterion. The misspecified models will be at the bottom of the ranking. This paper has been focused on the restricted framework of overlapping binary models. In order to complete this analysis, a future work should study the behaviour of AIC, SBIC and 2 C in the non-nested framework. Moreover, the wider context of multinomial dependent variables could be the aim of future research. Given that the MLE procedure was also applied to estimate these models, the formal expression of the IC would be quite straight, while 2 C would require a deeper analysis.