Modelling and Diagnostics of Spatially Autocorrelated Counts

: This paper proposes a new spatial lag regression model which addresses global spatial autocorrelation arising from cross-sectional dependence between counts. Our approach offers an intuitive interpretation of the spatial correlation parameter as a measurement of the impact of neighbouring observations on the conditional expectation of the counts. It allows for ﬂexible likelihood-based inference based on different distributional assumptions using standard numerical procedures. In addition, we advocate the use of data-coherent diagnostic tools in spatial count regression models. The application revisits a data set on the location choice of single unit start-up ﬁrms in the manufacturing industry in the US.


Introduction
Our paper seeks to address global spatial autocorrelation arising form cross-sectional dependence between counts, a research area which constitutes a fundamental challenge in the spatial econometrics literature according to Billé and Arbia (2019). We propose a spatial autoregressive specification for the expected mean of counts at location i as a function of its j neighbouring region counts, an approach which is in the spirit of the class of observationdriven models of Cox (1981). It allows us to obtain a reduced form of the conditional mean function of the counts as a function of the so-called Leontief inverse or spatial multiplier as in the spatial autoregressive (SAR) model for continuous (dependent) variables. We contrast our model with the well-received SAR-Poisson count model of Lambert et al. (2016), which also employs a spatial autoregressive structure for the expected mean of counts at location i, but relates this to its j neighbouring regions' Poisson log-intensities. We argue that this modelling of the spatial dependence between neighbouring regions is less intuitive in terms of interpretation and is not directly in the spirit of a continuous SAR model, because the intensity is an inherently unobserved measure while the counts are observable. Moreover, in the SAR-Poisson count model, the exogenous regressors have to explain the spatial dependence in the data fully, as there is no unexplained part in the intensity modelled explicitly. Lambert et al. (2016) also derived a full-information maximum likelihood approach for parameter estimation, which, however, suffers from numerical difficulties mainly caused by the need to invert a transformation of the n × n spatial weight matrix, where n denotes the sample size. As an alternative, the authors suggested a two-step estimation framework for their model, with an ad hoc specification to deal with the problem of computing the logarithm from zero counts 1 .
Our proposed spatial count regression model addresses all the issues raised and aims at supplying an alternative, which provides a useful interpretation of the model's parameters for the empirical economist and is straightforwardly to estimate using a likelihood-based approach with standard numerical procedures.
Our contribution extends the literature on econometric models for spatially correlated count in two further directions: a more flexible functional form assumption to cope with the empirically highly relevant phenomenon of overdispersion, and the introduction of diagnostic tools and model validation methods. In non-spatial count data regression, the Poisson model is regularly not able to deal with unobserved heterogeneity in the data. One way to overcome this is to allow for a more flexible variance function as, e.g., in the negative binomial (NB) regression model. We propose to allow for such flexibility in spatial count data models as well and employ a negative binomial variant of our proposed model specification.
Our paper also advocates the routine use of diagnostic tools and model validation methods to assess the adequacy of a fitted model and to compare two or more competing model specifications. For this purpose we propose, i.a., a suitably adjusted variant of the probability integral transform (PIT) and scoring rules.
The reminder of paper is structured as follows: In Section 2, we briefly outline the current state of the literature on econometric models for spatially correlated counts and introduce our proposed alternative model specification. Section 3 contains goodness of fit analysis tools for spatial count data models. In Section 4, we report the results of a small Monte Carlo study to show the appropriateness of our estimation approach. In Section 5, we revisit the start-up births data set of Lambert et al. (2016) and apply our spatial count regression model to it. Section 6 concludes.

Spatial Lag Models for Count Data
As already pointed out, the econometrics literature on models accounting for spatial autocorrelation in count data is quite sparse. The starting point of this literature is deemed to be the widely studied auto-Poisson model of Besag (1974) 2 . However, the model suffers from related problems to the exponential feedback models for time series models for counts 3 . The recent surveys of Glaser (2017) and Billé and Arbia (2019) list the few approaches available so far. Among them, the SAR-Poisson model of Lambert et al. (2016) features prominently, as it allows one to compute marginal effects which, due to the autoregressive nature of the model, can be decomposed in direct and indirect effects and allow for useful econometrics interpretations in many fields of applications.
To provide a starting point for the following discussion, the SAR-Poisson model specification for observed counts y is given here: where w ij is an element of the time invariant (n × n) spatial weight matrix W, which is assumed to be row-standardised, ρ is the spatial autocorrelation parameter, x i is the i-th row of the matrix of exogenous variables X, and β denotes the corresponding parameter vector. The symbol y −i denotes the vector of counts for all neighbours of y i . Note that Equation (11) given in Lambert et al. (2016) differs slightly from (1); after careful checking, we believe our result to be correct. An attractive feature of the SAR-Poisson model is the availability of a reduced form. This can be seen by collecting the µ i 's in the vector µ and employing a log transformation of (1) to obtain ln µ = (I − ρW) −1 Xβ. Written in this reduced form makes it obvious that the SAR-Poisson way of introducing spatial dependence only allows for spatial dependence in the regressors, not in the unexplained part of the observations.
An explicit way to alter this is presented in Liesenfeld et al. (2016), where the SAR-Poisson model is extended with an additional error term, allowing also for spatial dependence in the unexplained part of the variation of the observed counts. However, while the SAR-Poisson model of Lambert et al. (2016) can be estimated via (limited information) maximum likelihood, the estimation of the model proposed by Liesenfeld et al. (2016) is not straightforward, as the likelihood function is not available in closed form. Their proposal is then to use efficient importance sampling estimation which is not routinely available.
To overcome the shortcomings of the approaches discussed above, we propose a model specification that adjusts Besag's auto-Poisson model by introducing spatially lagged counts linearly in the conditional expectation function of a Poisson regression. This is in the spirit of the linear feedback model of Blundell et al. (2002), which has its origins in the integer valued autoregressive models for count time series (see, e.g., Jung and Tremayne 2011). Using the generic notation D for a (discrete) probability model, our spatial count regression model is given by To insure the non-negativity of the conditional mean function in (2), ρ must fulfil a possibly more restrictive condition as compared to SAR models for continuous data. It can be shown that must hold for all i and w i denotes the ith row of the W matrix. The lower bound for ρ will therefore be negative, although it might be greater than −1. In our empirical example below, however, the bound is found to be far away from −1. Accordingly, a change in the observation of a single neighbour j of i causes a change in y i by ρw ij . As argued in the Introduction, this interpretation of the spatial dependence between neighbouring regions is more intuitive as compared to that in the SAR-Poisson specification. The proposal to use neighbouring counts instead of unobservable neighbouring intensities was also pursued in the recent paper by Apardian and Smirnov (2020), who, however, relied on the exponential feedback framework, the drawbacks of which have already been outlined above.
Several count data distributions can be assumed for D in (2), the most prominent being the Poisson. The resulting model will be denoted as the Poisson spatial count regression model below. One way to deal with unobserved heterogeneity in the observed counts is to use the negative binomial (NB) distribution. Employing an NB distribution in (2) leads to the NB spatial count regression model below. Many other extensions of the Poisson distribution which have been introduced in the literature-e.g., the zero-inflated Poisson and hurdle Poisson models-could be used in the modelling framework of (2). As these more flexible specifications have turned out not to substantially improve the model fit in our application below, they are not discussed further here.
In non-linear regression models such as the spatial count regression model (2), marginal effects are regularly employed to support the interpretation of the model parameters. In order to derive these, we restate the conditional mean of our model specification in (2) using the identity y = E[y|y −i , X] + , where represents the (vector of) regression errors, as follows.
Based on the form (3), it is straightforward to derive the total marginal effects for continuous regressors X k for as follows.
Following LeSage and Pace (2009), this can be decomposed into a direct and an indirect marginal effect. The former is given as where a −1 ii is the according element of the A −1 , and the latter as Average marginal effects can be computed in various ways, e.g., by averaging over all observations i.
The specific form of the proposed conditional mean function in (2) does not allow for an easy to implement factorisation of the joint probability density function of the observed counts. Additionally, an operational high-dimensional multivariate count distribution is not available. Therefore, we propose a pseudo-likelihood approach for the estimation of the model parameters and marginal effects.
The idea to compose a pseudo-likelihood function of conditional probability functions such as the ones described in (2) stems from Besag (1975), who proposed this technique for the auto-normal schemes of Besag (1974) and sketched a proof of the estimator's consistency, relating it to the coding technique. Besag pointed out that the obtained estimator can be thought of as a weighted average of coding estimators. Since these are consistent, the estimator of the conditional pseudo-likelihood approach is consistent as well (under suitable regularity conditions).
Pseudo and composite likelihood functions have been either employed in the literature to obtain an approximation for an intractable likelihood function (see for example Jensen and Moller 1991) or to reduce the complexity of the maximisation of the likelihood function either by using the joint distribution of subsets of the data (i.e., 'subsetting methods') or by omitting redundant information (i.e., 'omission methods') in the likelihood (Varin and Vidoni 2005). Such a procedure was first proposed for spatial models by Vecchia (1998), building upon Besag (1974. In the case of a spatial model, one omits irrelevant information and conditions only on the relevant neighbours (in our case the eight nearest neighbours) instead of giving a full decomposition of the likelihood into conditional distributions. See also the description of the data generating process employed in the Monte Carlo experiments in Section 4 below. Finally note that a pseudo-likelihood procedure has also been proposed in the paper by Apardian and Smirnov (2020) in the context of their spatial count regression model.
In case of a Poisson distributional assumption in specification (2), the corresponding pseudo-log likelihood function takes the specific form For the NB model variant, we propose to use the Negbin 2 specification of the NB distribution which is popular in econometric applications due to is quadratic variance function.
where the dispersion parameter is denoted as α. The resulting pseudo-log likelihood function for the NB spatial count regression model is then given as The parameter estimates from the pseudo-likelihood for both models can straightforwardly be obtained using standard numerical optimisation methods. The properties of the estimators were investigated in the Monte Carlo study in Section 4. Robust standard errors were computed from the Godambe sandwich information matrix.

Diagnostics
In this section, we propose the application of several diagnostic tools adopted from the literature on models for non-spatially correlated counts. Although checking the adequacy of a specified model is an important part of any empirical (regression) modelling, diagnostic tools specially designed for count data are rarely considered in the spatial econometrics literature. In an attempt to bridge this gap, we employ the non-randomised probability integral transform histogram, scoring rules and a relative deviations plot.

Non-Randomised Probability Integral Transform
The probability integral transform (PIT) of Rosenblatt (1952) can be used to obtain an informal check for the model calibration. In the case of continuous random variables, the PIT is calculated as the values of the predictive cumulative distribution function (CDF) for the observations. If the predictive CDF equals the data generating process of the observations, the obtained values follow a standard uniform distribution. This can then be checked graphically, for example, by plotting a histogram of the PIT values (see, e.g., Diebold et al. 1998).
For the discrete case, this concept is not directly applicable because the predictive CDF will not be a continuous function. In the case of count data, it is a step function, meaning that the calculated PIT values will not follow a standard uniform distribution.
A popular solution for this problem is provided by the nonrandomised PIT, introduced by Czado et al. (2009). The nonrandomised PIT approach uses the predictive CDF for each observed count y i to obtain the distribution of the PIT values directly as follows: where P y i is the predictive CDF of observation y i , evaluated at y i . The mean PIT is obtained by aggregating over all F(u|y i ) and dividing by n and compared to the cumulative distribution function of standard uniform random variable. A straightforward way to interpret PIT histogram can then be obtained by transforming the average F(u|y i ) into J equally spaced bins j = 1, . . . , J and checking for uniformity. A u-shaped pattern of the histogram indicates overdispersion in the data (compared to the predictive distribution) and an inverse u-shaped underdispersion in the data.

Scoring Rules
With the help of scoring rules, the fit of the predictive distribution to the observed data can be compared among models. Like information criteria which are based on the likelihood values of the estimates, they negatively penalise a worse result, and are therefore to be minimised. We adopted this method of model selection from the work of Czado et al. (2009) andJung et al. (2016), and average over the score of all observations for model selection. Several scores are suggested in the count data literature. We will present the logarithmic, quadratic and ranked probability scores, which place their emphases on different aspects of the estimated distributions.
The logarithmic score centres on the predicted probability of the observed count and penalises predictive distributions for which the observation has a small probability. The average logarithmic score is The quadratic score also considers the whole estimated probability distribution by adding the squared probabilities of all possible outcomes: As a third scoring rule we use the ranked probability score, which especially penalises a flat estimated distribution: An important property of scoring rules is propriety; i.e., the score takes its worst value for the worst possible fit and its best value for its best. This property holds for all three scoring rules presented here (Czado et al. 2009;Jung et al. 2016).

Relative Deviations Plot
A method for evaluating the whole predictive distribution graphically is available by plotting the relative deviations of the estimated probability function together with the observed frequencies of the counts. Similar plots can be found in Long (1997), for example. Such a representation compares the predicted probabilitiesP i (k) for each predictive distribution i = 1, . . . , n and possible outcome k = 0, 1, 2, . . . with the frequencies h(k) observed in the data set, and averages over the n predictive distributions: andμ i being the estimate for the conditional expectation of observation y i in the case of a Poisson distributional assumption.
Other than the logarithmic score, this measure takes the whole predictive distribution into account. In addition to the information in the quadratic score and the ranked probability score, which also take into account the whole estimated probability function, the deviation plot gives a visual impression of how well the predictive distributions display the different features of the data, including modus, tails, etc.

Monte Carlo Study
In this section, we assess the small sample properties of the parameter estimates obtained from the pseudo-log likelihood functions of our proposed spatial count regression model by means of a simulation study.

Data Generating Process
Due to the fact that spatially lagged y i s are directly included into the conditional mean specification of our model and no natural ordering of the data as compared to a time series model is available, we propose to employ a Gibbs sampling approach to generate counts according to our conditional model specification (2) using Poisson and NB distributions, respectively, and the Gibbs sampling algorithm of Geman and Geman (1984). That is, by iteratively drawing from the conditional distributions of y i , i = 1, . . . , n, given all other y j s, we asymptotically obtain draws from the joint distribution of y 1 , . . . y n . E.g., in the k iteration of this procedure we draw from the following distributions: where D is defined as above. The starting values y (0) i were drawn from a non-spatial Poisson and negative binomial distribution, respectively, with µ i = exp(X i β). For the negative binomial case, the dispersion parameter α was set to 0.2 in our experiments to obtain a modest level of overdispersion.
We employed an 8-nearest neighbours inverse distance matrix for the spatial weighting matrix. To calculate this matrix, we first generated n random coordinates using random numbers from a U(0, 1) distribution. Then, we computed the Euclidian inverse distance matrix for these points which represent the n spatial units of our simulated data set. After selecting the 8 nearest neighbours of each unit, we set all other entries of the matrix equal to zero, and finally row-standardised the resulting matrix. X consists of a constant, X 1 ∼ U(0, 2), and X 2 ∼ N(1, 2). The parameter vector β was set [0.5, 0.5, 0.5] , and data were generated for ρ * = {0, 0.2, 0.4, 0.6, 0.8}, to cover a wide range of positive spatial dependence. Sample sizes employed were n = {100, 250, 500, 1000, 5000, 10, 000}; in particular, a range of low sample sizes was used to check for potential finite sample estimation bias. In total, 10,000 replications for each simulation were used to obtain estimates of bias and root mean squared error (RMSE) with a high precision.

Monte Carlo Results
The results of our simulation experiments are presented in Tables 1 and 2. With a few exceptions, the root mean square errors and biases of the parameter estimates are in general small. For both the Poisson and the NB case, the finite sample bias of the parameter estimates decreased quickly, as the sample size was larger than 100. From simulation results not reported in the tables, it emerged that fitting the Poisson spatial count regression models to an even lower sample size than 100 is not recommended. To obtain precise and unbiased estimates of the regression intercept β 0 in both the Poisson and the NB case is more difficult as compared to the slope coefficients β 1 and β 2 . A high level of spatial dependence (lager than 0.6) leads to larger biases and root mean squared errors for all parameter estimates employed. Unsurprisingly, the RMSE and bias results for the NB spatial count regression model are larger than those obtained for the Poisson variant. For overdispersed counts, the minimal sample size recommendations are even higher than in the equidispersed case. In particular, the good performance of the parameter estimators for large sample sizes supports our conjecture that our pseudo-log likelihood function for the spatial count regression model can be seen as compatible with the form provided in Besag (1975). The estimations for small sample sizes showed satisfactory results as well. With regard to our empirical application, which will be presented in Section 5 below, the simulation showed that for our sample size of about 3000 observations, the estimation based on the conditional likelihoods works quite well.

Data
For the application of the proposed spatial count regression models, we revisit the cross-sectional data of Lambert et al. (2016) on firm births in the manufacturing sector of the United States between 2000 and 2004. The data contain the number of start-up firms during this time period for 3078 U.S. counties and variables measuring the location factors, such as market structure, labour market and infrastructure for each county (see Table 3 and Lambert et al. (2016), p. 249, for more details). Figure 1 displays the spatial structure of the number of firm births. The frequency distribution of the number of start-up firms for values up to 100 is shown in Figure 2. It is evident that the counts do not exhibit excess zeros but an extraordinarily large range (from 0 to 6938), which is very difficult to capture for any single index model adequately.
We experimented with several forms of the weighting matrix W. All of these specifications produced very similar results. Therefore, only the results for a row-standardised inverse distance matrix considering only the eight nearest neighbours (which equals the weight matrix used by Lambert et al. 2016) are reported.

Results
Table 4 displays the estimation results for the Poisson and NB models discussed in Section 2, and for comparison, the non-spatial Poisson and non-spatial NB regression, whose conditional expectations are given by In the case of the non-spatial NB model, we chose again the Negbin2 (NB2) specification. In both spatial models, the spatial autocorrelation parameter ρ is highly significant, indicating that the proposed spatial structure exists in the data but is considerably different in size. Additionally, several of the other parameter estimates (and marginal effects, respectively) show clear differences between the Poisson and NB model specifications. Both the log score and the quadratic score clearly favour the negative binomial models and give the spatial variant a slight preference. The ranked probability score, in contrast, prefers the spatial Poisson model because it highly penalises the relatively flat estimated distributions from the negative binomial models. All three scores give evidence in favour of the spatial model specification over their non-spatial counterpart.
To get a first visual impression of the fit of the models, Figure 3 shows the estimated conditional expectations of each county for the four models. The classes equal the deciles of the observations displayed in Figure 1 going from dark blue (0 firm births) to dark red (60 to 6938 firm births). On all four maps, the clusters of high numbers of firm births at west and east coasts and at the Great Lakes of the observations are recognisable, along with the cluster of lower values in the Midwest. The spatial models, however, predicted higher values for many counties, reproducing the map of the observations more properly. However, all models tended to estimate the outcomes of counties with observation 0 (dark blue in Figure 1) too high.
Next, the nonrandomised probability integral transform (PIT) is presented to evaluate the calibration of the models. Figure 4 displays the PIT histograms for the four estimated model specifications. The u-shaped PIT histograms of both Poisson models show clear signs of too little dispersion allowed by the model. The PIT histogram of the non-spatial negative binomial model has an even shape, but shows a decrease for PIT values larger than 0.8. This might be caused by the occurrence of large outliers in the data (>6000), and at the same time relatively high frequencies of small counts (<3). As already mentioned, single index models, as used here, generally have difficulties coping with such a data structure. The model having the PIT histogram closest to a uniform distribution, even though the difference to the non-spatial negative binomial is small, is the negative binomial spatial count regression model. Table 4. Estimation results from Poisson and the NB spatial count data regression and the non-spatial Poisson and NB2 regressions. N = 3078; robust standard errors in brackets. * * and * * * denote a 5% and 1% significance, respectively.

Spatial
Non-Spatial Spatial Non-Spatial  A further way to look at the fit of the model is to predict the probabilities of each possible outcome and compare these visually with the empirical frequencies in a relative deviations plot. The relative differences of predictive probabilities and observed frequency of the four models are displayed in Figure 5. Generally, the two negative binomial models outperform the Poisson models. Small counts are underestimated by the Poisson model; large counts are overestimated by all models. For outcomes between 5 and 25, only small differences between frequencies and predicted distributions of the negative binomial models and the Poisson spatial count regression model can be observed, whereas the non-spatial Poisson model fits the data less well.

Variable
For evaluating the impact of the regressors on the number of firm births in these nonlinear models, the marginal effects have to be considered. Table 5 displays the estimated median marginal effects for all four model specifications considered. We decided to look at the median rather than the mean, since our data set contains some unusually large observations of the dependent variable. As for that, the predicted effects for these regions distort the mean marginal effects. On the other hand, calculating only the marginal effect for a certain (hypothetical) observation would not reflect the spatial nature of the data accurately, since only one particular pattern of neighbours could be considered.
As mentioned in Section 2, the total marginal effects in our spatial count regression model can be separated into the direct effect and indirect effect. The direct effect gives the impact of a change in the regressor x ik on observation i and is therefore comparable to a marginal effect in a non-spatial model. The indirect effect is the sum of the impacts of a change in all other regressors x jk , j = i on observations i which do not exist in a non-spatial model. Following (LeSage and Pace 2009, p. 39), the standard errors for the marginal effects were obtained using 2000 draws from the simulated joint distribution of the parameter estimators. As can be seen in Table 5, the majority of effects are significant. In general, more effects are significant in the negative binomial models than in the Poisson models. In the spatial models, the indirect effects are smaller than the direct ones.   When comparing the impacts between spatial and non-spatial models, we can see that the total marginal effects are in general of the same magnitude as the marginal effects of the corresponding non-spatial model. Therefore, the spatial models separate in some sense the effects in the non-spatial ones into direct and indirect and highlight the influence of neighbouring observations. Again, there are similar patterns for non-spatial and spatial versions of the Poisson and the negative binomial models. Table 5. Median of marginal effects of the Poisson and NB spatial count regressions, and the nonspatial Poisson and non-spatial NB2 regressions. For the dummy variables metro and micro, the effect of a change from 0 to 1 is given. * * and * * * denote 5% and 1% significance, respectively. Standard errors (in brackets) were estimated using their sample counterparts of 2000 draws of the asymptotic joint distribution of the coefficients.

Conclusions
This article proposes a new regression model for spatially autocorrelated count data, denoted as the spatial count regression model. The model is observation-driven in the sense that neighbouring counts linearly affect the conditional expectation function of a Poisson or negative binomial regression model. This avoids the problems caused by exponential feedback specifications and provides a more intuitive interpretation of the spatial autocorrelation than most existing specifications. It also allows one to compute marginal effects, which can be decomposed into direct and indirect effects and allow for useful econometrics interpretations in many fields of application. The model can be estimated via pseudo-maximum likelihood using standard optimisation methods, making it easily adoptable in applied work. Finally, we advocate the usage of several model validation and diagnostics already standard in the regression analysis of non-spatial count data, i.e., PIT histograms, scoring rules and relative deviations plots.
In our empirical example, we revisited the start-up firm births data set of Lambert et al. (2016). With the help of the model validation devices and diagnostic tools, we illustrated the pros and cons of several model specifications. Among these, the best fit for the modelling of firm births in U.S. counties can be obtained with the negative binomial spatial count regression model. Nevertheless, this model considerably overestimated the smallest counts. A future improvement of the fit for this particular application may be obtained by moving away from single-index models, e.g., by employing a mixture of discrete distributions.
Author Contributions: All authors have contributed equally to all aspects of the preparation and writing of this paper. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data used in this study are available upon request from the corresponding author.