2.1. Esophageal Cancer Incidence Data in the Caspian Region of Iran
Residents of Mazandaran and Golestan provinces of Iran constituted the study population. The aims of analysis were to determine the extent of spatial variability in risk for esophageal cancer in this area, and to assess the degree to which this variability is associated with socioeconomic status (SES) and dietary pattern indices. During the study period, there were 1,693 EC cases in a population of around 4.5 million people. Population and EC counts were available for the 152 agglomerations in the Mazandaran and Golestan provinces. Geographic coordinates for each agglomeration were also obtained that approximately reflected the geographical centroid of each agglomeration. The distances between agglomeration centres was measured in kilometres and ranged from 9 to 507 km.
Figure 1a shows the geographic boundaries of wards, cities and rural agglomerations within wards, in the two provinces. Adjustment of incidence rates for differences in the age structure of agglomerations was accomplished by calculating SIRs with a 2003 population reference.
Figure 1b shows strong spatial aggregations among SIR, with a tendency for higher EC rates in the eastern and central agglomerations and lower rates in the west.
Explanatory variables relating to SES were available for each of the 152 agglomerations and to diet for each of the 26 wards [
14]. Factor analysis was used to summarise SES and diet variables into a few uncorrelated factors: for SES: “income”, “urbanisation” and “literacy”, with lower values indicating greater deprivation; and for diet: “unrestricted food choice diet” characterized by high intake of foods generally thought to be preventive against EC and “restricted food choice diet” with positive loadings on risky foods. Estimates of the percentage of the population in each ward with diet factor scores in the highest tertile (3rd) were used in regression models. For socio-economic components, factor scores related to each agglomeration were used in the regression model as a continuous covariate. Further details on how the factors were created and defined for diet and SES can be found elsewhere [
14].
Log linear models are often used to describe the dependence of the mean function on
k covariates,
X1, …,
Xk. A general form for this type of model for J geographically-defined units (areas) is given by:
where
Yj is the count for area j and
Ej denotes an “expected” count in area j that is assumed known,
Xj = (1,
Xj1, …,
Xjk) is a 1 × (
k + 1) vector of area-level risk factors,
β = (
β0,
β1, …,
βk) is a 1× (
k + 1) vector of regression parameters and
θj represents a residual with no spatial structure (so that
θi and
θj are independent for
i ≠
j).
Figure 1.
(a) Geographic boundaries of wards (bold polygons), cities (grey polygons) and rural agglomerations within wards, in Mazandaran and Golestan provinces; (b) Observed spatial pattern; and (c) model adjusted SIR.
Figure 1.
(a) Geographic boundaries of wards (bold polygons), cities (grey polygons) and rural agglomerations within wards, in Mazandaran and Golestan provinces; (b) Observed spatial pattern; and (c) model adjusted SIR.
2.2. Model & Data Structure
The raw data are in the form of disease counts,
Yj, and population counts,
Nj in region
j. The expected count when adjusting for the age structure of an agglomeration,
Ej, was obtained by age-standardisation. Then, using the theoretical relationship (
SIR =

).
Equation (1) is equivalent to a model for agglomeration level SIRs. Poisson, generalised Poisson and negative binomial distributions are considered for modelling counts at the agglomeration level and for each of these distributional assumptions, non-spatial, neighbourhood-based and distance-based spatial correlation structures are compared. These analysis approaches are now described in detail.
2.3. Distributions for Disease Counts
The Poisson model is given by:
The Poisson distribution has mean and variance E(Yj) = V(Yj) = λj.
The negative binomial, NB, distribution can be constructed by adding a hierarchical element to the Poisson distribution through a random effect
εj, specifically:
for
yj = 0, 1, 2, 3, …, where
ϑ > 0. The resulting probability distribution function marginal to
εj is:
for
yj = 0, 1, 2, 3, …, with
E(
Yj) =
λj and
V(
Yj) =
λj + (
λj)
2/
ϑ.
The negative binomial model has the property that the variance is always greater than the mean and ϑ is the parameter of extra-Poisson variation with large values of ϑ corresponding to variability more like the Poisson distribution. As ϑ →∞ the distribution of Yj converges to a Poisson random variable.
The generalized Poisson, G-Poisson, model with parameters
λ and
ω is defined as [
9]:
for
yj = 0, 1, 2, 3, … and has
E(
Yj) =
λj and
V(
Yj) =
λj(1 −
ω)
−2. For
ω = 0, the generalized Poisson reduces to the Poisson distribution with mean
λj.
Bayesian inference is based on constructing a model
m (which encapsulates distributional assumptions and covariate relationships with outcome), its likelihood
f(
Yǀ
γm, m), and the corresponding prior distribution
f(
γmǀ
m), where
γm is a parameter vector under model
m and
Y is the outcome variable vector. We use the following hierarchical structure on model parameters:
where
f(
m) is the prior probability for entry of covariates in the specification of the linear predictor part of the bigger model
m within a class of one of the three probability assumptions above.
The maximum total number of candidate models given k covariates (considered additively, i.e., no interactions) is 2k. The usual choice for the prior on model m is the uniform distribution over the covariate parameter space M = {β1, …, βk}. We used this uniform distribution because the prior can be thought of as noninformative in the sense of favouring all candidate models equally within the same probability model class.
2.4. Hierarchical Models for Relative Risks
Model (1) is a non-spatial model in the sense that it neither recognizes the distance-based relationships among the J agglomerations, nor in area
j allows for any neighbourhood-based effects between adjacent areas that would mean counts in one area might be related to counts in adjacent areas. Suppose the variability in the {
Yj}
j = 1, …,j follows a spatial model that incorporates assumptions about the spatial relationships between areas. We then extend (1) as:
where the new parameter
ϕj, represents a residual with spatial structure with
ϕi and
ϕj,
i ≠ j, modelled to have positive spatial dependence. Two approaches are used for modelling the J-dimensional random variable
ϕ: distance-based and neighbourhood-based spatial correlation structures.
In the distance-based approach the multivariate normal distribution
MVN(
µ,
τΣ) is specified for
ϕ, where
µ is a 1
J mean vector,
τ > 0 controls the overall variability of the
ϕi and
Σ is a
J ×
J positive definite matrix. If
dij denotes the distance between centroids of agglomerations
i and
j, then we specify:
where
f(
dij; v, k) = exp[(−
vdij)
k]. In this specification
ν > 0 controls the rate of decrease of correlation with distance, with large values representing rapid decay, and
τ is a scalar parameter representing the overall precision parameter. The parameter
κ ϵ (0,2] controls the amount by which spatial variations in the data are smoothed. Large values of
κ lead to greater smoothing, with
κ = 2 corresponding to the Gaussian correlation function [
15]. The distance-based parameters are jointly referred to as

.
Besag
et al. [
16] propose modelling the spatial components via a conditional autoregression (CAR) as
ϕi~N(0,

) , describing the spatial variation in the heterogeneity component so that geographically close areas tend to present similar risks. One way of expressing this spatial structure is via Markov random fields models where the distribution of each
ϕi given all the other elements {
ϕ1, …,
ϕi – 1, ϕi + 1, …,
ϕJ} depends only on its neighbourhood [
17]. A commonly used form for the conditional distribution of each
ϕi is the Gaussian:
where the prior mean of each
ϕi is defined as a weighted average of the other
ϕj,
j ≠
i, and the weights
πij define the relationship between area
i and its neighbours. The precision parameter
σϕ controls the amount of variability for the random effect.
Although other possibilities exist, the simplest and most commonly used neighbourhood structure is defined by the existence of a common border of any length between the areas. In this case, the weights πij in Equation (8) are constants and specified as πij = 1 if i and j are adjacent and πij = 0 otherwise. In that case, the conditional prior mean of ϕi is given by the arithmetic average of the spatial effects from its neighbours and the conditional prior variance is proportional to the number of neighbours.
2.7. Gibbs Variable Selection, GVS
Candidate models can be represented as (
ψ, α) ϵ
M ×{0, 1}
κ, where
ψ is a set of binary indicator variables
ψg (
g = 1, …,
k), where
ψg = 1 or 0 represents respectively the presence or absence of covariate
g in the model, and
α denotes other structural properties of the model. For the generalised linear models in this study,
α describes the distribution, link function, variance function and (un)structured terms, and the linear predictor may be written as:
We assume that α is fixed and we concentrate on the estimation of the posterior distribution of
β within the class of probability models defined by
α The prior for (
β,
ψ) is specified as
f (
β,
ψ) =
f (
βǀ
ψ)
f (
ψ). Furthermore,
β can be partitioned into two vectors
βψ and
β\ψ corresponding to those components of
β that are included
ψg = 1 or not included
ψg = 0 in the model. Then, the prior
f (
βǀ
ψ) may be partitioned into a “model” prior
f (
βψ ǀ
ψ) and a “pseudo” prior
f (
β\ψǀ
βψ, ψ) [
18]. The full posterior distributions for the model parameters are given by:
and we assume that the actual model parameters
βψ and the inactive parameters
β\ψ are a priori independent given
ψ. This assumption implies that
f (
βψǀ
β\ψ, ψ, y)
~f (
yǀ
β,ψ)
f (
βψǀ
ψ) and
f (
β\ψǀ
βψ, ψ, y) ∝
f(
β\ψǀ
ψ).
The Gibbs sampling procedure is summarized by the following three steps [
19]:
- (1).
Sample the parameters included in the model from the posterior:
- (2).
Sample the parameters excluded from the model from the pseudoprior:
- (3).
Sample each variable indicator j from a Bernoulli distribution with success probability

; where O
g is given by:
where
ψ\g denotes all terms of
ψ except
ψg.
The algorithm is further simplified by assuming prior conditional independence of all
βg for each model
ψ. Then, each prior for
β gǀ
ψ consists of a mixture of true prior
f (
β gǀ
ψg = 1
, ψ\g) for the parameter and a pseudoprior
f (
βgǀ
ψg = 0
, ψ\g) As a result:
We considered a normal prior and pseudoprior for the
βgs resulting in:
and:
where
µG, SG are the mean and variance respectively in the corresponding pseudoprior distributions and Ʃ
g is the prior variance when covariate g is included in the model.
The Normal prior assumption and Equation (13) result in a prior that is a mixture of two Normal distributions:
Using priors Equation (14) and Equation (9) gives the following full conditional posterior:
indicating that the pseudoprior,
f (
βgǀ
ψg = 0) does not affect the posterior distribution of model coefficients.
When no restrictions on the model space are imposed a common prior for the indicator variables
βg is
f(
ψg) = Bernoulli (0.5) [
20]. The Gibbs sampler was begun with all
ψg = 1, which corresponds to starting with the full model.
Consider Ʃ as the constructed prior covariance matrix for the whole parameter vector
β when the multivariate extension of prior distribution (14) is used for each
βg. Zellner’s g prior framework was used to define prior variance structure for Ʃ [
21]. The choices
µG = 0 and S
g =

with
p = 10 were made as they have also been shown to be adequate [
18]. The pseudoprior parameters
µG and
Sg are only relevant to the behaviour of the MCMC chain and do not affect the posterior distribution [
20].
Because
α is assumed fixed in our study and we have
k covariates a set of 2
K competing models are considered
M = {
m1, m2, m3, …, m2k}
, and the posterior probability of model
ma ϵ
M is defined as:
Bayesian model averaging (BMA) obtains the posterior inclusion probability of a candidate regressor, pr(βg ≠ 0ǀy), g = 1, …, k, by summing the posterior model probabilities across those regressors that are included in the model.
Within the disease mapping context, usually the aim is prediction. In such cases, prediction should be based on the BMA technique, which also accounts for model uncertainty [
22]. Whatever the final intention is (prediction using BMA or selection of a single model) we need to evaluate posterior model probabilities.
2.7.1. Fully Bayesian Estimation
The Markov chain Monte Carlo method (MCMC) was employed to obtain a sample from the joint posterior distribution of model parameters, automatically generating samples from the marginal posteriors and hyperparameters. It has been suggested that the Gibbs sampler is run for 100,000 iterations for GVS after discarding the first 10,000 iterations for the burn-in period [
23]. In our analyses, a total of 500,000 runs with every tenth posterior draw after a burn-in of 50,000 runs was used. The inference of every parameter was thus based on 45,000 posterior samples. Convergence to the posterior distribution was assessed using time series scatterplots, correlograms and the Gelman-Rubin convergence statistic as implemented in WinBUGS and CODA/BOA [
24,
25].