A Comparison Study on Criteria to Select the Most Adequate Weighting Matrix

The practice of spatial econometrics revolves around a weighting matrix, which is often supplied by the user on previous knowledge. This is the so-called W issue. Probably, the aprioristic approach is not the best solution although, presently, there are few alternatives for the user. Our contribution focuses on the problem of selecting a W matrix from among a finite set of matrices, all of them considered appropriate for the case. We develop a new and simple method based on the entropy corresponding to the distribution of probability estimated for the data. Other alternatives, which are common in current applied work, are also reviewed. The paper includes a large study of Monte Carlo to calibrate the effectiveness of our approach compared to others. A well-known case study is also included.


Introduction
Let us begin with a mantra: the weighting matrix, usually denoted by W, is the most characteristic element in a spatial model. Most scholars agree with this commonplace. In fact, spatial models deal primarily with phenomena such as spillovers, trans-boundary competition or cooperation, flows of trade, migration, knowledge, etc. in complex networks. Rarely does the user know about how these events operate in practice. Indeed, they are mostly unobservable phenomena which are, however, required to build the model. Traditionally the gap has been solved by providing externally this information, in the form of a weighting matrix. As an additional remark, we should note that W is not the unique solution to deal with such kind of unobservables ( [1]; for example, develop a latent-variables approach that does not need W), but is the simplest.
Hays et al. [2] give a sensible explanation about the preference for a predefined W. Network analysts are more interested in the formation of networks, taking units attributes and behaviors as given. This is spatial dependence due to selection, where relations of homophily and heterophily are crucial. The spatial econometricians are more interested in what they call "computing the effects of alters actions on ego's actions through the network"; in this case, the patterns of connectivity are taken as given. This form of spatial dependence is due to the influence between the individuals, and the notions of contagion and interdependence are capital. Therefore, if the network is predefined, why not supply it externally?
However, beyond this point, the W issue has been frequent cause of dispute. In the early stages, terms such as "join" or "link" were very common (for instance, in [3], or [4]). The focus at that time was mainly on testing for the presence of spatial effects, for which is not so important the specification of a very detailed weighting matrix; contiguity, nearness, rough measures of separation may be appropriate As said, our intention is to offer new evidence to help the user to select the most appropriate W matrix for the specification. Section 2 revises four selection criteria that fit well into the problem of selecting a weighting matrix from among a finite set of them. Section 3 presents the main features of the Monte Carlo experiment solved in the fourth section. Section 5 includes a well-known case study which is revised in the light of our findings. The sixth section concludes.

Criteria to Select a W Matrix from among A Finite Set
A general spatial panel data econometric model, can be expressed as: y t is a (N × 1) vector of data for the explained variable in period t, t = 1, 2, ..., T; x t is a (N × k) matrix of observed data for the explicative variables, assumed exogenous, in period t; u t is a (N × 1) vector of error terms in period t and ε t a vector of random terms, assumed to be normally distributed (this assumption can be relaxed). β, θ, ρ 1 and ρ 2 are unknown parameters; the last two parameters are called spatial correlation parameters. Finally, W and M are two weighting matrices, specified exogenously, that channel the corresponding spill-over effects. Usually, the two matrices are forced to be equal. The model of (1) is called a Cliff-Ord specification. A Spatial Durbin Model, SDM, results from ρ 2 = 0; a Spatial Lag Model, SLM, requires that θ = 0 and ρ 2 = 0; a Spatial Durbin Error Model, SDEM, ρ 1 = 0 and a Spatial Error Model, SEM, that θ = 0 and ρ 1 = 0. In view of model (1), it is clear the critical importance played by the weighting matrices. We really need these matrices but there are few clues to build them in applied research; this results in the so-called W issue, partly review in the first section. In recent decades, a very interesting literature has appeared that examines the problem of choosing a matrix among a finite set of them, which is the target of this paper. First, we review the literature devoted to the J test and then we move to the selection criteria, Bayesian methods and a new procedure based on entropy.
We should recognize that there are other interesting procedures in the literature, like, for example, the model boosting approach of Kostov [23] and the model averaging of Zhang and Yu [24], which are not used in our study for reasons of space or computational burden.
Anselin [25] was the first to raise, formally, the W issue, suggesting a Cox statistic in a framework of non-nested models. Leenders [26], on this basis, elaborates a J-test using classical augmented regressions. Later, Kelejian [27] extends the approach of [26] to a SAC model, with spatial lags of the endogenous variable and in the error terms, using Generalized Method of Moments, GMM, estimates. Piras and Lozano [28] confirm the adequacy of the J-test to compare different weighting matrices stressing that we should make use of a full set of instruments to increase GMM accuracy. Burridge and Fingleton [29] show that the Chi-square asymptotic approximations for the J-tests produces irregular results, excessively liberal or conservative in a series of leading cases; they suggest a bootstrap resampling approach. Burridge [30] focuses on the propensity of the spatial GMM algorithm to deliver spatial parameter estimates lying outside the invertibility region which, in turn, affects the bootstrap; he suggests the use of a QML algorithm to remove the problem. Kelejian and Piras [31] extended the original version of [27] to account for all the available information, according to the findings of [28]. Finally, Kelejian and Piras [32] adapt the J test to a panel data setting with unobserved fixed effects and additional endogenous variables which reinforces the adequacy of the GMM framework. Another milestone in the J test literature is Hagemann [33], who copes with the reversion problem originated by the lack of a well-defined null hypothesis in the test. He introduces the minimum J test, MJ. His approach is based on the idea that if there is a finite set of competing models, only the model with the smallest J statistic can be the correct one. In this case, the J statistic will converge to the Chi-square distribution but will diverge if none of the models is correct. The author proposes a wild bootstrap to test if the model with the minimum J is correct. This approach has been applied by Debarsy and Ertur [20] to a spatial setting with good results.
In the Monte Carlo experiment that follows, we know that there is a correct model so are going to use only the first part of the procedure of [33]. Assuming a collection of M different weighting matrices, such as: W = {W 1 ; W 2 ; ...; W M }, the MJ approach works as follows:

1.
We need the estimates of the m models; in each case, the same equation is employed but with a different weighting matrix belonging to W. Following Burridge [30] and given that our interest lies on the small sample case, the models are estimated by ML.

2.
For each model, we obtain the battery of J statistics as usual, after estimating, also by ML, the corresponding extended equations.

3.
The chosen matrix is the one associated with the minimum J statistic. As said, we stop the procedure here, thus avoiding the wild bootstrap to test if this matrix is, indeed, the correct one.
Another popular method for choosing between models deals with the so-called Information Criteria. Most are developed around a loss function, such as the Kullback-Leibler, KL, quantity of information which measures the closeness of two density functions. One of them corresponds to the true probability distribution that generated the data, obviously not known, the other is the distribution estimated from the data. The criteria differ in the role assigned to the aprioris and in the way of solving the approximation to the unknown true density function [34]. The two most common procedures are the AIC [35] and the Bayesian BIC criteria [36]. The first compares the models on equal basis whereas the second incorporates the notion of model of the null. Both criteria are characterized by their lack of specificity in the sense that the selected model is the closest to the true model, as measured by KL. We should note that as indicated by Potscher [37], a good global fit does not necessarily mean that the model be the best alternative to estimate the parameters of interest. AIC and BIC lead to simple expressions that depend on the accuracy of the ML estimation plus a penalty term related to the number of parameters entering the model; that is: where l( γ) means the estimated log-likelihood at the ML estimates, γ, k is the number of non-zero parameters in the model and n the number of observations. For the case that we are considering the models only differ in the weighting matrix, so k and n are the same in every case. This means that the decision depends on the estimated log-likelihood or, what is the same, on the balance between the estimated variance and the Jacobian term. Please note that for a standard SLM, we can write: l( γ) ∝ log 1 σ n |I − ρW| , being σ the standard deviation and ρ the corresponding spatial dependence coefficient. To minimize any of the two statistics in (2) the objective is to maximize the concentrated estimated log-likelihood, l( γ). In sum, the Information Criteria approach implies:

1.
Estimate by ML of the M models corresponding to each weighting matrix in W.

2.
For each model, we obtain the corresponding AIC statistic (BIC produces the same results).

3.
The matrix in the model with minimum AIC statistic should be chosen.
An important part of the recent literature on spatial econometrics has Bayesian grounds, which extends also to the topic of choosing a weighting matrix. The Bayesian framework is well equipped to cope with these types of problems using the concept of posterior probability as the basis for taking a decision. There are excellent reviews in [38][39][40], Besag and Higdon [41] and especially, [42][43][44]. For the sake of completeness, let us highlight the main points in this approach.
The analysis is made conditional to a model, which is not under discussion. Moreover, we have a collection of M weighting matrices in W, a set of k parameter in γ, some of which are of dispersion, σ, others of position, β, and others of spatial dependence, ρ and θ, and a panel data set with nT observations in y. The point of departure is the joint probability of data, parameters and matrices: where W m ∈ W; π (·) are the prior distributions and L (y | γ; W m ) the likelihood for y conditional on the parameters and the matrix. Bayes' rule leads to the posterior joint probability for matrices and parameters: whose integration over the space of parameters, γ ∈ Υ, produces the posterior probability for matrix W m : The presence of spatial structures in the model complicates the resolution of (5) which requires of numerical integration. The priors are always a point of concern and, usually, practitioners prefer diffuse priors. LeSage and Pace [42] (Section 6.3) suggest π (W m ) = 1 M ∀m, a N IG conjugate prior for β and σ where π β (β | σ) ∼ N β 0 ; σ 2 (κX X) −1 , being X the matrix of the exogenous variables in the model, and π (σ) an inverse gamma, IG(a, b). For the parameter of spatial dependence, they suggest a Beta(d, d) distribution, being d the amplitude of the sampling space of ρ. The defaults in the MATLAB R codes of LeSage [45] are β 0 = 0, κ = 10 −12 and a = b = 0. In brief, the Bayesian approach implies the following:

1.
Specify the priors for all the terms appearing in the equation. In this point, we have followed the suggestions of [42].

2.
For each matrix, obtain the corresponding posterior probability of (5) for which we need (i) solve the ML estimation of the corresponding model and (ii) solve the numerical integration of (5).

3.
The matrix chosen will be that associated with the highest posterior probability.
This paper advocates for a selection procedure based on the notion of entropy, which is used as a measure of the information contained in a distribution of probability. Let us assume a univariate continuous variable, y, whose probability density function is p(y); then, entropy is defined as: being I the domain of the random variable y. As known, higher entropy means less information and more uncertainty about y. Our case fits with Shannon's framework [46]: we observe a random variable, y, and there is a finite set of rival distribution functions capable of having generated the data. Our decision problem is well defined because each distribution function differs from the others only in the weighting matrix; the other elements are the same. However, we are not interested in the Laplacian principle of indifference (select the density with maximum entropy, less informative, to avoid uncertain information). Quite the opposite: in our case there is no uncertain information and we are looking for the more informative probability distribution, so our objective is to minimize entropy.
As with the other three cases, the application of this principle requires the complete specification of the distribution function, which means that the user knows the form of the model (Equations (8)- (10) below, except the W matrix); additionally we add a Gaussian distribution. Next, we should remind that for the case of a (n × 1) multivariate normal random variable, y ∼ N(µ; Σ), the entropy is: h(y) = 1 2 n + log (2π) n |Σ| . This measure does not depend, directly, on first order moments (parameters of position of the model) but on second order moments (dependence and dispersion parameters). For example, in the case of the SLM of (10) below, the entropy is: where B = (I − ρW). Please note that the covariance matrix for y in the SDM is V(y) = B −1 V(u)B −1 .
If u is indeed a white noise random term with variance σ 2 , the covariance matrix of y is simply Let us note that the covariance matrix of y in the SDM of (8) coincides with that of the SLM case. The covariance matrix for the SDEM equation is V(y) = σ 2 (B B), everything else remains the same.
To apply the Entropy criterion we must go through the following steps: 1.
Estimate each one of the M versions of the model that we are considering. As said, each model differs only in the weighting matrix. We obtain the ML estimates for reasons given above.

2.
For each model, we obtain the corresponding value of the entropy, in the h m ; m = 1, 2, ..., M statistic. 3.
The decision criterion consists in choosing the weighting matrix corresponding to the model with minimum value of entropy.

Description of the Monte Carlo Study
This part of the paper is devoted to the design of the Monte Carlo experiment conducted to calibrate the performance of the four criteria presented so far for selecting W: the MJ procedure, the Bayesian approach, the AIC criterion and the entropy measure. The objective of the analysis is to identify the most reliable method to select the most adequate weighting matrix for a spatial model, given the data of the variables and the form of the model itself. The parameters are also unknown for the user and must be estimated. In this context, if the matrix is misspecified, and the estimated parameters will be biased, which will impact the four criteria described in Section 2 in different ways. Our Monte Carlo study generates sequences of data of the explained and explicative variables, for different scenarios, and applies the four criteria to select the (unknown) W. Moreover, our focus is on small sample sizes. As will be clear soon, the four criteria have good behavior even in small samples, so it is not necessary to employ very large sample sizes.
We are going to simulate a panel setting, with three of the most common Data Generating Processes, DGPs in what follows, in the applied literature on spatial econometrics; namely, the spatial Durbin Model, SDM of (8), the Spatial Durbin Error Model, SDEM in expression (9) and the Spatial Lag Model of (10), SLM. Main conclusions can be extended to other processes such as the Spatial Error Model, which are not replicated here (details on request from the authors).
Only one exogenous regressor, x variable, appears in the right hand side of the equations whose observations are obtained from a normal distribution, x , where σ 2 x = 1; the same applies with respect to the error terms: The two variables are not related, E (x it ε it ) = 0. Our space is made of hexagonal pieces which are arranged regularly, one next to the others without discontinuities nor empty spaces.
A weighting matrix appears in the three equations, which is not observable, and the user must take decisions to continue with the analysis. The problem consists in choosing one matrix from among a finite set of alternatives which in our simulation are composed by three candidates: W 1 is built using the traditional contiguity criterion between spatial units; the weights in W 2 are the inverse of the distance between the centroids of the spatial units, W 2 = ω ij = 1 d ij ; i = j ; whereas W 3 incorporates a cut-off point to the connections in W 2 , so that W 3 = ω ij = 1 d ij ; i = j i f j ∈ N 4 (i); 0 otherwise being N 4 (i) the set of 4 nearest neighbors to i. To keep things simple, the same weighting matrix appears with the endogenous and exogenous variables in (8) and with the exogenous and error terms in (9). Following usual practice, every matrix has been row-standardized, which implies that the three matrices are non-nested. In what follows we will use W 3 as the true matrix.
In sum, each case consists in: The spatial equation is assumed to be known so the model can be estimated by maximum likelihood, ML, once the user supplies a W matrix.

•
Compute the four selection criteria, MJ, Posterior probability, entropy and AIC for the three alternative weighting matrices for each draw.

•
Select the corresponding matrix according to each criterion and compare the result with the true matrix (W 3 ) in the DGP.

•
The process has been replicated 1000 times.
Please note that the selection of the matrix is made conditional on a correct specification of the equation. We are perfectly aware that this dichotomy is artificial; in fact, both decisions are intimately related because the same matrix, but in different equations, plays different roles and bears different information. However, this point is not further developed in the present paper. To give some intuition, we include the results corresponding to the case of a wrong specification (i.e., estimate a SDM model whereas the true model in the DGP is a SDEM). MATLAB R codes to replicate these simulations are freely downloadable from https://sites.google.com/site/mherreragomez/principal/Codes.

Results of the Monte Carlo Study
This section summarizes the results obtained in the Monte Carlo simulation described previously and, we must admit, they are a bit surprising: in strictly quantitative terms, the AIC and the entropy measures are the best criteria. What is more striking, according to our results the Bayesian approach, although it does well in general, it is clearly the third criterion. Finally, the MJ approach is the worst alternative among the four candidates. The last two observations are puzzling given the strong support that the two procedures have received in recent decades. Table 1 presents the percentage of correct selections attained by each criterion after aggregating all the cases in our simulation. Each percentage accumulates 126,000 items. A number in bold indicates that the respective criterion reaches the maximum rate of correct selections. Entropy dominates at the extremes of the range of values for the spatial dependence coefficient, whereas AIC is the best for medium to low values of ρ. The differences between the two are always lower to 3.5 percentage points (in fact, the average proportion of correct selections is statistically equal with a confidence of 99%). Bayes is a good criterion for medium to large values of ρ but its performance weakens for small values of this parameter (in fact, is fourth in ±0.2). Finally, the curve of correct selections of the MJ is too flat. Figure 1 disaggregates the accumulated percentages by number of spatial units, left, or number of cross-sections, right. Please note that in each case, the data represent aggregated percentages (i.e., in the case n = 25 we aggregate the three cross-sections corresponding to T = 1, T = 5 and T = 10). These Figures ratify the ordering set out above. The behavior of the MJ criterion is striking: its curves of correct selections are very flat, with unexpected drops at the extremes of the interval for ρ when the sample size (n or T) increases. The other three criteria, as expected, react positively to the sample size or to higher values of ρ. Apparently, the improvement is more relevant for the time dimension, T, than for the cross-sectional size n, especially for high values of the spatial coefficient. Finally, there is a certain asymmetry in all the curves. Tables 2-5 present the details by type of DGP. A quick look at the Tables reveals that bold percentages are concentrated, mainly, in the entropy and AIC columns.
The prevalence of both criteria is quite regular for the four cases shown in the Tables. The preference extends to the case of correctly specified models, as in Tables 2-4, and also for misspecified equations, as in Table 5, for negative and especially for positive values of the spatial coefficient, for small and large number of individuals in the sample (n) and for simple to large panels (T). Overall, entropy attains the highest rate in 46% of the 144 cases in Tables 2-5, followed by AIC, 35%, Bayes, 12%, and MJ, 7%.    The complete relation of results for the 756 different experiments in the study of Monte Carlo (3 ns, 3 Ts, 6 ρs, 2 βs, 2 θs and four configurations for the DGP/estimated equation pair; note that the parameter θ does not intervene in the SLM equation) appear in Tables A1-A12 in the Appendix. We want to stress the good results attained in the case of small samples (n = 25 and T = 1) where the average rate of correct selections for entropy and AIC is usually above 30% (a little worse for the other two criteria). Very often, the percentage exceeds 70% at the extremes of the spatial parameter interval, ±0.8. The average rate increases up to 75-80%, for the case of n = 25 and T = 5 and continues improving when T = 10, where most cases have a rate of correct selections above 90%. In general, the rate of correct selections is nearly 100%, using 5 to 10 cross-sections.
In a similar vein, the increase in the cross-sectional size, n, when the number of cross-sections, T, remains constant also has positive effects in the four criteria. The rate of correct selections for the case of a hundred of spatial units is above 70%, on average, for the case of a single cross-section (T = 1). These percentages improve quickly once the time dimension of the panel increases; it is also clear that the improvement depends on the type of DGP (stronger for SDM or SLM models and weaker for the SDEM and for the misspecified equation case).
The value of parameter β 1 , as expected, has only a slight impact in the four criteria; on the contrary, the signal of θ 1 plays a crucial role in the SDEM and SDM cases. Another interesting feature is the asymmetry of the selection curves that tends to be diluted with T. Negative spatial dependence helps to better detect the correct weighting matrix, especially when the number of time cross-sections is small. The asymmetry is evident in entropy, Bayes and AIC, but it is more diffused in the MJ case which remains highly inelastic to the value of ρ.
To complete the picture, we estimate a response-surface for each DGP/Estimated equation combination, with the aim of modelling the empirical probability of choosing the correct weighting matrix for each criterion, p i . As usual, a logit transformation of the empirical probabilities is carried out, so the estimated equation is: where p * i is the logit transformation, often known as the logit, r the number of replications of each experiment (1000 in all the cases); (2r) −1 assures that the logit is defined even when the probability of correct selection is 0 or 1 [47]; η is an intercept term, z i the design matrix whose columns reflect the conditions of each experiment, ϕ is a vector of parameters and i the error term assumed to be independent and identically distributed (this assumption is reasonable if all experiments come from the same study, as ours, and are obtained under identical circumstances; [48]). Recall that the number of observations in the response-surface equations is 216 (so i = 1, 2, ..., 216), except for the SLM case where the number of observations is 108. Table 6 shows the results for the four DGP/Estimated equation combinations.
In general, the estimates confirm previous facts. The main factor influencing the empirical probability of choosing the correct weights matrix is the spatial parameter, absolute value of ρ in Table 6. Its contribution is crucial in all the cases, without exceptions, and occurs in the expected direction: higher values of |ρ| facilitate the selection of the correct weighting matrix. The second more influential factor is the parameter θ, associated with spatial spillovers. Also, its impact is beneficial for all the cases though it appears to be more important for the Bayesian and MJ criteria. Sample size is crucial and T has a relatively higher impact than n. Finally, as said before, parameter β 1 is not significant in any circumstance, except for the SLM case; this means that the signal-to-noise ratio should not be a major factor to consider when the problem is selecting the best weighting matrix. Table 7 completes the response-surface analysis with the F tests of equality in the coefficients of the estimates of Table 6. According to the sequence of F tests, the most dissimilar method is the MJ approach, and then Bayes. On the other hand, entropy and AIC emerge as quasi-similar strategies to compare weighting matrices, almost indistinguishable in the four types of DGPs.

Empirical Application: Ertur and Koch (2007)
The case study in this section is based on a well-known economic growth model estimated by Ertur and Koch (2007) using a cross-section of 91 countries for the period 1960-1995. The purpose of this section is to illustrate the use of the selection procedures discussed before.
Ertur and Koch [49] build a growth equation to model technological interdependence between countries using spatial externalities. The main hypotheses of interaction are that the stock of knowledge in one country produces externalities that cross-national borders and spill over into neighboring countries, with an intensity which decreases with distance. The authors use a geographical distance measure.
The benchmark model assumes an aggregated Cobb-Douglas production function with constant returns to scale in labor and physical capital: where Y i (t) is output, K i (t) is the level of reproducible physical capital, L i (t) is the level of labor in the period t, and A i (t) is the aggregate level of technology specified as: The aggregate level of technology A i (t) in a country i depends on three elements. First, a certain proportion of technological progress is exogenous and identical in all countries: Ω(t) = Ω(0)e µt , where µ is a constant rate of technological growth. Second, each country's aggregate level of technology increases with the aggregate level of physical capital per worker k φ i (t) = (K i (t)/L i (t)) φ with parameter φ ∈ [0; 1] capturing the strength of home externalities by physical capital accumulation. Finally, the third term captures the external effects of knowledge embodied in capital located in a different country, whose impact crosses national borders at a diminishing intensity, δ ∈ [0; 1]. The terms ω ij represent the connectivity between country i and its neighbors; these weights are assumed to be exogenous, non-negative, and finite.
Following Solow, the authors assume that a constant fraction of output s i , in every country i, is saved and that labor grows exogenously at the rate l i . Also, they assume a constant and identical annual rate of depreciation of physical capital for all countries, denoted τ (assumed as a constant value equal 0.05 in the literature). The evolution of output per worker in country i is governed by the usual fundamental dynamics of the Solow equation which, after some manipulations, lead to a steady-state real income per worker [49] (p. 1038, Equation (9)): This is a spatially augmented Solow model and coincides with the predictor obtained by Solow adding spill-over effects. In terms of spatial econometrics, we have a Spatial Durbin Model, SDM, which can be expressed as: Equation (15) is estimated using information on real income, investment and population growth for a sample of 91 countries for the period 1960 − 1995. Regarding the spatial weighting matrix, [49] consider two geographical distance functions: the inverse of squared distance (which is the main hypothesis) and the negative exponential of squared distance (to check robustness in the specification). We also consider a third matrix based on the inverse of the distance.
Let us call the three weighting matrices as W 1 , W 2 and W 3 which are row-standardized: with d ij as the great-distance (i.e., the shortest distance between two points on the surface of a sphere) between the capitals of countries i and j. The authors analyze several specifications checking for different theoretical restrictions and alternative spatial equations. We concentrate our revision in the so-called non-restricted equation, in the sense that it includes more coefficients than advised by theory. Table 8 presents the SDM version of this equation using the three alternative weighting matrices specified before (the first two columns coincide with those in Table I, columns 3-4, pp. 1047, of [49]). The last four rows in the Table show the value of the selection criteria corresponding to each case.
The preferred model by [49] is the SDM/W 1 which coincides with the selection attained by minimum entropy, the Bayesian posterior probability and AIC. The selection of the MJ approach is W 2 .
Other results in [49] refer to the Spatial Error Model version of the steady-state equation of (14), or SEM model. The intention of the authors is to visualize the presence of spatial correlation in the traditional non-spatial Solow equations; we use this case as an example of selection of weighting matrices in misspecified models. The main results appear in Table 9 (which can be compared with columns 2-3 of Table II, in [49] (p. 1048)).  The selection of the most adequate W matrix does not change. Using the values of entropy criterion we select the model in which intervenes the matrix W 1 , the same as with the Bayesian approach and the AIC criterion; MJ continues selecting W 2 .

Conclusions
Much of the applied spatial econometrics literature seems to prefer an exogenous approximation to the W matrix. Implicitly, it is assumed that the user has relevant knowledge with respect to the way individuals in the sample interact. In recent years, new literature advocates for a more data driven approach to the W issue. We strongly support this tendency, which should be dominant in the future; however, our focus in this study is on the exogenous approach.
The problem posed in the paper is very frequent in applied work: the user has a finite collection of weighting matrices, they all are coherent with the case of study, and one needs to select one of them. Which is the best W? We can address this question using different proposals: the Bayesian posterior probability, the J approach with all its variants, by means of simple model selection criteria, such as AIC or BIC and several other alternatives not used in this study. We add a fourth one, based on the entropy of the estimated distribution function. This new criterion h(y) is a measure of uncertainty, and fits well with the W decision problem. The h(y) statistics depends on the estimated covariance matrix of the corresponding model offering a more complete picture of the suitability of the distribution function (related to a particular choice of W), to deal with the data at hand.
The conclusions of our Monte Carlo experiment are very illuminating. First, we can confirm that it is possible to identify, with confidence, the true weighting matrix (if it really exists); in this sense, the selection criteria do a good job. However, the four criteria should not be taken as indifferent, especially in samples of small size (n or T). The ordering is clear: entropy and AIC in first place and then Bayesian posterior probability doing slightly worse; the MJ appears always in the last position. As shown in the paper, the value of the spatial parameter has a great impact to guarantee a correct selection, but this aspect is unobservable to the researcher. However, the user effectively controls the amount of information involved in the exercise, and this is also a key factor. The advice is clear: use as much information as you have because the quality of the decision improves with the amount of information. Once again, the way the information accrues is not neutral: the length of the time series in the panel is more relevant than the number of cross-sectional units in the sample.
Our final recommendation for applied researchers is to care for the adequacy of the weighting matrix and, in case of having various candidates, take a decision using well-defined criteria such as those examined in the paper. The case of study presented in Section 5 illustrates the procedure.
As avenues for future research, let us mention the possibility of combining different matrices into a single one, as pursued in model averaging or in fuzzy logic, which offers new, flexible alternatives. Moreover, this study assumes that the user knows the form of the equation which is, very often, an unrealistic assumption. This constraint poses new challenges and can be solved by using a more general framework where both the model and the matrix should be chosen. It is clear that not all the four criteria are well equipped to work in the new scenarios.

Conflicts of Interest:
The authors declare no conflict of interest.