This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
This paper applies the generalised linear model for modelling geographical variation to esophageal cancer incidence data in the Caspian region of Iran. The data have a complex and hierarchical structure that makes them suitable for hierarchical analysis using Bayesian techniques, but with care required to deal with problems arising from counts of events observed in small geographical areas when overdispersion and residual spatial autocorrelation are present. These considerations lead to nine regression models derived from using three probability distributions for count data: Poisson, generalised Poisson and negative binomial, and three different autocorrelation structures. We employ the framework of Bayesian variable selection and a Gibbs sampling based technique to identify significant cancer risk factors. The framework deals with situations where the number of possible models based on different combinations of candidate explanatory variables is large enough such that calculation of posterior probabilities for all models is difficult or infeasible. The evidence from applying the modelling methodology suggests that modelling strategies based on the use of generalised Poisson and negative binomial with spatial autocorrelation work well and provide a robust basis for inference.
For count data, the mean and variance are often related and can be estimated using a single parameter, as in the Poisson model, which is the most frequently used model for analysing disease mapping data. Under this model, the mean and variance of the dependent variable are assumed to be equal, conditional on any variables used to explain differences in the mean across primary sampling units (PSU). In practice, however, this assumption is often false, since the variance can either be larger or smaller than the mean,
Statistical methods for analysing spatial patterns of disease incidence or mortality have matured over the past decade or so [
When a count dependent variable’s assumed variance is a function of its mean, one source of overdispersion is due to an inappropriate probability model, for example selecting the Poisson model when the generalised Poisson or negative binomial distribution would better capture the variation [
IntraPSU heterogeneity may induce overdispersion as follows: individuals comprising any population subgroup may differ in terms of characteristics that are known to influence the response and if these characteristics are not included in the set of covariates in a model specification then population heterogeneity across PSUs can lead to extraPoisson variation in cancer counts [
Presence of overdispersion is a particular problem for the analysis of geographically correlated data. In addition to the misspecification of the mean function and/ or misspecification of the probability model, spatial autocorrelation is a third cause of overdispersion in geographically correlated count data [
The purpose of this paper is to consider the problem of modelling cancer counts when overdispersion is likely. We consider spatial regression to estimate the association between relative risk of disease and potential risk factors and map model predicted ratios in which counts in PSUs that are geographically close are assumed to have stronger correlation with each other than counts in PSUs that are geographically dispersed. The development of this work was motivated by our previous study of esophageal cancer (EC) incidence in the Caspian region of Iran during 2001–2005 [
This paper is structured as follows: in
Residents of Mazandaran and Golestan provinces of Iran constituted the study population. The aims of analysis were to determine the extent of spatial variability in risk for esophageal cancer in this area, and to assess the degree to which this variability is associated with socioeconomic status (SES) and dietary pattern indices. During the study period, there were 1,693 EC cases in a population of around 4.5 million people. Population and EC counts were available for the 152 agglomerations in the Mazandaran and Golestan provinces. Geographic coordinates for each agglomeration were also obtained that approximately reflected the geographical centroid of each agglomeration. The distances between agglomeration centres was measured in kilometres and ranged from 9 to 507 km.
Explanatory variables relating to SES were available for each of the 152 agglomerations and to diet for each of the 26 wards [
Log linear models are often used to describe the dependence of the mean function on
(
The raw data are in the form of disease counts,
Equation (1) is equivalent to a model for agglomeration level SIRs. Poisson, generalised Poisson and negative binomial distributions are considered for modelling counts at the agglomeration level and for each of these distributional assumptions, nonspatial, neighbourhoodbased and distancebased spatial correlation structures are compared. These analysis approaches are now described in detail.
The Poisson model is given by:
The Poisson distribution has mean and variance
The negative binomial, NB, distribution can be constructed by adding a hierarchical element to the Poisson distribution through a random effect
The negative binomial model has the property that the variance is always greater than the mean and
The generalized Poisson, GPoisson, model with parameters
Bayesian inference is based on constructing a model
The maximum total number of candidate models given k covariates (considered additively,
Model (1) is a nonspatial model in the sense that it neither recognizes the distancebased relationships among the J agglomerations, nor in area
In the distancebased approach the multivariate normal distribution
Besag
Although other possibilities exist, the simplest and most commonly used neighbourhood structure is defined by the existence of a common border of any length between the areas. In this case, the weights
In order to be consistent across models with specification of prior belief, the prior distributions imposed on common parameters were the same and noninformative priors were used. A Gamma(0.001, 0.001) prior distribution was used for
In the highest level of the hierarchy prior distributions were specified for the prior precisions for hyperparameters
Candidate models can be represented as (
We assume that α is fixed and we concentrate on the estimation of the posterior distribution of
The Gibbs sampling procedure is summarized by the following three steps [
Sample the parameters included in the model from the posterior:
Sample the parameters excluded from the model from the pseudoprior:
Sample each variable indicator j from a Bernoulli distribution with success probability
The algorithm is further simplified by assuming prior conditional independence of all
We considered a normal prior and pseudoprior for the
The Normal prior assumption and Equation (13) result in a prior that is a mixture of two Normal distributions:
Using priors Equation (14) and Equation (9) gives the following full conditional posterior:
When no restrictions on the model space are imposed a common prior for the indicator variables
Consider Ʃ as the constructed prior covariance matrix for the whole parameter vector
Because
Bayesian model averaging (BMA) obtains the posterior inclusion probability of a candidate regressor,
Within the disease mapping context, usually the aim is prediction. In such cases, prediction should be based on the BMA technique, which also accounts for model uncertainty [
The Markov chain Monte Carlo method (MCMC) was employed to obtain a sample from the joint posterior distribution of model parameters, automatically generating samples from the marginal posteriors and hyperparameters. It has been suggested that the Gibbs sampler is run for 100,000 iterations for GVS after discarding the first 10,000 iterations for the burnin period [
Mean absolute deviance (MAD), meansquared predictive error (MSPE), pseudoR^{2} [
Pseudo R^{2} is calculated for model comparison and takes values between zero and one. It is based on
To assess the prediction performance of the models their meansquared predictive error and deviance statistic are reported. Meansquared predictive error is defined as
The deviance statistic, D= 2{
The absolute deviance residuals
A Moran scatterplot depicts standardised Pearson residuals
The methodology described in
The GVS methodology involved covariate selection conditional on the probability distribution and spatial autocorrelation type. With five SES and dietary factors there were 32 covariate models, and hence variable selection was made over the 32 models for a specified probability model type. Posterior summaries of the parameters of interest for the candidate models containing all five covariates are presented in
The estimated marginal posterior probabilities were calculated commencing with GVS for all the covariates. Then the covariates were ranked according to the marginal posterior probabilities and factors with marginal posterior inclusion probabilities lower than 0.2 were eliminated, using a rule of thumb [
BMA of these reduced models was used for prediction purposes. Posterior model probabilities of the top two covariate subsets are presented in
Posterior summaries for Poisson, GPoisson and Negative Binomial (NB) regression models each with the spatial correlation structures: “IN” independence, “N” neighbourhoodbased, “D” distancebased.
Model  Posterior median of regression coefficient β_{1}, (95% credible interval)  Random components  

Distribution  Spatial structure  income  urbanisation  literacy  unrestricted food choice  restricted food choice 



Poisson  IN  −0.22, (−0.60, −0.03)  −0.36, (−0.42, −0.15)  −0.16, (−0.26, −0.08)  0.12, (0.08, 0.16)  −0.32, (−0.41, −0.09)  0.78     
Poisson  IN + N  −0.19, (−0.68, 0.02)  −0.36, (−0.51, −0.16)  −0.15, (−0.22, −0.05)  0.07, (−0.04, 0.16)  −0.24, (−0.38, −0.06)  0.35  0.73   
Poisson  IN + D  −0.18, (−0.69, 0.07)  −0.35, (−0.51, 0.03)  −0.15, (−0.22, 0.02)  0.07, (−0.03, 0.16)  −0.23, (−0.38, 0.04)  0.13    
GPoisson  IN  −0.24, (−0.61, −0.09)  −0.38, (−0.51, −0.09)  −0.18, (−0.22, −0.05)  0.11, (0.09, 0.16)  −0.28, (−0.44, −0.11)  0.56     
GPoisson  IN + N  −0.19, (−0.69, −0.04)  −0.35, (−0.51, −0.03)  −0.12, (−0.21, −0.03)  0.07, (−0.02, 0.16)  0.23, (−0.38, −0.04)  0.12  0.66   
GPoisson  IN + D  −0.19, (−0.68, 0.01)  −0.36, (−0.51, −0.07)  −0.15, (−0.22, 0.06)  0.07, (−0.02, 0.16)  −0.24, (−0.39, −0.07)  0.17    
NB  IN  −0.23, (−0.59, −0.10)  −0.39, (−0.58, 0.09)  −0.17, (−0.27, −0.7)  0.17, (0.03, 0.16)  −0.31, (−0.48, −0.12)  0.36     
NB  IN + N  −0.17, (−0.68−0.06)  −0.35, (−0.51, 0.11)  −0.11, (−0.21, 0.01)  0.07, (−0.04, 0.16)  −0.23, (−0.38, 0.02)  0.12  0.74   
NB  IN + D  −0.20, (−0.68, 0.10)  −0.35, (−0.51, 0.08)  −0.15, (−0.22, 0.08)  0.07, (−0.01, 0.16)  −0.24, (−0.38, 0.09)  0.11   
The top two candidate covariate models (covariate subsets) based on their posterior probabilities: “IN” stands for independence, “N” stands for neighbourhoodbased and “D” stands for distancebased structure.
Model distribution  Spatial structure  Subset  Covariates *  f(my) ** 

Poisson  IN  1  income, restricted food choice  0.37 
Poisson  IN  2  income, restricted food choice, urbanisation  0.12 
Poisson  IN + N  1  income, restricted food choice, urbanisation  0.31 
Poisson  IN + N  2  income, restricted food choice  0.15 
Poisson  IN + D  1  urbanisation  0.25 
Poisson  IN + D  2  income  0.20 
GPoisson  IN + D  2  income, urbanisation  0.18 
GPoisson  IN  1  income, restricted food choice  0.28 
GPoisson  IN  2  income, restricted food choice, urbanisation  0.17 
GPoisson  IN + N  1  income, urbanisation, restricted food choice  0.28 
GPoisson  IN + N  2  urbanisation, restricted food choice  0.13 
GPoisson  IN + D  1  restricted food choice  0.19 
GPoisson  IN + D  2  income, urbanisation  0.18 
NB  IN  1  income, restricted food choice  0.21 
NB  IN  2  restricted food choice, urbanisation  0.11 
NB  IN + N  1  income  0.26 
NB  IN + N  2  income, restricted food choice  0.13 
NB  IN + D  1  income  0.18 
NB  IN + D  2  restricted food choice  0.12 
* Covariates are listed in order of decreasing estimated marginal posterior probabilities; ** Posterior probability of the model.
Marginal posterior inclusion probability for the top candidate models (covariate subsets): “IN” stands for independence, “N” stands for neighbourhoodbased and “D” stands for distancebased structure.
Model  Spatial structure  Subset  Covariates  f( 

distribution  
Poisson  IN  1  income  0.67 
restricted food choice  0.42  
Poisson  IN + N  1  income  0.61 
restricted food choice  0.48  
urbanisation  0.37  
Poisson  IN + D  1  urbanisation  0.40 
GPoisson  IN  1  income  0.57 
restricted food choice  0.33  
GPoisson  IN + N  1  income  0.59 
urbanisation  0.43  
restricted food choice  0.25  
GPoisson  IN + D  1  restricted food choice  0.22 
NB  IN  1  income  0.64 
restricted food choice  0.42  
NB  IN + N  1  income  0.47 
NB  IN + D  1  income  0.55 
* marginal posterior inclusion probability.
The pseudo R^{2} suggested that approximately one third of the total variation in esophageal cancer counts was explained by each of the subset 1 models with slight improvement for joint independence and spatial models.
Goodness of fit measures: “IN” stands for independence, “N” stands for neighbourhoodbased and “D” stands for distancebased structure.
Model  MAD ^{a}  MSPE ^{b}  PseudoR^{2}  Deviance index ^{c}  

Distribution  Spatial structure  
Poisson  IN  4.4  30.3  0.24  3.1 
Poisson  IN + N  3.7  16.6  0.32  2.8 
Poisson  IN + D  2.6  13.8  0.28  2.9 
GPoisson  IN  3.2  14.9  0.30  2.6 
GPoisson  IN + N  2.1  10.1  0.35  1.6 
GPoisson  IN + D  2.3  11.6  0.33  1.7 
NB  IN  3.4  15.8  0.30  2.4 
NB  IN + N  2.2  10.3  0.33  1.7 
NB  IN + D  2.3  13.0  0.35  1.4 
^{a} Mean absolute deviance; ^{b} Meansquared predictive error; ^{c}
Scatterplots of observed counts (vertical axis) against model predicted counts (horizontal axis) from different models.
For MSPE and MAD the prediction performances of all spatial models are relatively similar but these spatial models perform better than corresponding nonspatial models. These criteria also suggest that negative binomial and GPoisson models with neighbourhoodbased autocorrelation were preferable to the other models.
The deviance statistic is reported in
Absolute deviance residuals
Moran scatterplots in
Moran scatter plot of the residuals from competing models: standardised Pearson residuals against spatiallylagged standardised Pearson residuals.
Bayesian techniques are recognised as powerful tools in disease mapping but little is known about how these methods compare when applied to real data. Reviews and comparison of Bayesian hierarchical and/or nonhierarchical methods suggested for the analysis of aggregate count data in the context of disease mapping and spatial regression can be found in [
Our study aims were to assess the risk factors of EC cancer using an automatic Bayesian covariate selection procedure, and to compare prediction performance of the competing models using three distributions for modelling count data to deal with overdispersion and three spatial correlation structures to take account of intra and interagglomeration variation. In conclusion, the use of joint models that include both spatial and nonspatial random effects gave a better picture in terms of model goodness of fit and prediction performance. Generalised Poisson and NB models also performed better than Poisson regression. Overall, generalised Poisson or NB models with conditional autoregressive (CAR) correlation structure seemed to provide the most satisfactory basis for inference.
Two spatial structures were considered in our models: the neighbourhoodbased autocorrelation structure that borrows strength from neighbouring agglomerations and the distancebased autocorrelation structure that borrows strength from agglomerations over an effective range. The use of the spatial term resulted in more conservative estimates by explicitly modelling the positive interagglomeration correlation of the SIRs, compared with the models that ignored this interagglomeration correlation. A nonspatial random effect was included along with spatial random effects to take into account agglomeration heterogeneity. The nonspatial term is especially important in CAR structure, because if the majority of the variability is nonspatial, inference for the CAR model might incorrectly suggest that spatial dependence was present. Results from a simulation study have indicated that if the data are truly independent, a model with CAR random effects and no nonspatial random effects leads to very poor efficiency in the estimation of regression coefficients [
In model selection the uniform prior distribution on model space is typically used by setting
In this paper we have compared Poisson, generalised Poisson and NB distributions for modelling count data when overdispersion is a problem. Results indicate that the Poisson distribution is not adequate to model cancer SIRs in our data setting. The negative binomial and the generalized Poisson distributions are more appropriate than the Poisson distribution. The negative binomial distribution and the generalized Poisson distributions are quite similar for the range of parameters in our study. It must be emphasized that for count data with small counts, various discrete distributions can fit the data sufficiently well [
When competing models exist, the information criterion such as Akaike information criterion (AIC), Bayes information criterion (BIC) and deviance information criterion (DIC) may be useful to select a single “best” model for final inference. However, these standard regression techniques and selection methods do not address the uncertainty associated with model specification. In contrast BMA considers a set of models with all available covariates. Then, it deals with the uncertainty in model form in the estimated parameters, which enables one to average across all models considering the posterior probabilities. Moreover, using the Gibbs sampler to search the model space for all possible models is efficient, due to limited number of covariates. We considered BMA in order to control the model uncertainty with respect to covariates. The advantages of using the BMA approach to account for model uncertainty have been assessed for several different classes of models [
The objectives of this study were to evaluate and compare the generalised Poisson and negative binomial models with the Poisson model commonly used for analysing count data. The results indicate that: (i) models with joint independence and spatial random effects were superior to the models with an independence random effect alone; (ii) models with alternative distributions that accommodate overdispersion performed better than Poisson regression. Using a spatial random effect term has the advantage of allocating the overdispersion to spatial and nonspatial components, recognizing the inherently spatial nature of the data. It was found in the case study that generalised Poisson or negative binomial models with conditional autoregressive correlation structure seemed to provide the most satisfactory basis for inference. The methodology presented is not specific to our example and can be applied in a variety of settings to produce more informative results than simple Poisson regression modelling.
The authors declare no conflicts of interest.
We would like to thank the survey team and colleagues of the Babol Cancer Registry.