#### 2.1. Esophageal Cancer Incidence Data in the Caspian Region of Iran

Residents of Mazandaran and Golestan provinces of Iran constituted the study population. The aims of analysis were to determine the extent of spatial variability in risk for esophageal cancer in this area, and to assess the degree to which this variability is associated with socioeconomic status (SES) and dietary pattern indices. During the study period, there were 1,693 EC cases in a population of around 4.5 million people. Population and EC counts were available for the 152 agglomerations in the Mazandaran and Golestan provinces. Geographic coordinates for each agglomeration were also obtained that approximately reflected the geographical centroid of each agglomeration. The distances between agglomeration centres was measured in kilometres and ranged from 9 to 507 km.

Figure 1a shows the geographic boundaries of wards, cities and rural agglomerations within wards, in the two provinces. Adjustment of incidence rates for differences in the age structure of agglomerations was accomplished by calculating SIRs with a 2003 population reference.

Figure 1b shows strong spatial aggregations among SIR, with a tendency for higher EC rates in the eastern and central agglomerations and lower rates in the west.

Explanatory variables relating to SES were available for each of the 152 agglomerations and to diet for each of the 26 wards [

14]. Factor analysis was used to summarise SES and diet variables into a few uncorrelated factors: for SES: “income”, “urbanisation” and “literacy”, with lower values indicating greater deprivation; and for diet: “unrestricted food choice diet” characterized by high intake of foods generally thought to be preventive against EC and “restricted food choice diet” with positive loadings on risky foods. Estimates of the percentage of the population in each ward with diet factor scores in the highest tertile (3rd) were used in regression models. For socio-economic components, factor scores related to each agglomeration were used in the regression model as a continuous covariate. Further details on how the factors were created and defined for diet and SES can be found elsewhere [

14].

Log linear models are often used to describe the dependence of the mean function on

k covariates,

X_{1}, …,

X_{k}. A general form for this type of model for J geographically-defined units (areas) is given by:

where

Y_{j} is the count for area j and

E_{j} denotes an “expected” count in area j that is assumed known,

X_{j} = (1,

X_{j}_{1}, …,

X_{jk}) is a 1 × (

k + 1) vector of area-level risk factors,

β = (

β_{0},

β_{1}, …,

β_{k}) is a 1× (

k + 1) vector of regression parameters and

θ_{j} represents a residual with no spatial structure (so that

θ_{i} and

θ_{j} are independent for

i ≠

j).

**Figure 1.**
(**a**) Geographic boundaries of wards (bold polygons), cities (grey polygons) and rural agglomerations within wards, in Mazandaran and Golestan provinces; (**b**) Observed spatial pattern; and (**c**) model adjusted SIR.

**Figure 1.**
(**a**) Geographic boundaries of wards (bold polygons), cities (grey polygons) and rural agglomerations within wards, in Mazandaran and Golestan provinces; (**b**) Observed spatial pattern; and (**c**) model adjusted SIR.

#### 2.2. Model & Data Structure

The raw data are in the form of disease counts,

Y_{j}, and population counts,

N_{j} in region

j. The expected count when adjusting for the age structure of an agglomeration,

E_{j}, was obtained by age-standardisation. Then, using the theoretical relationship (

SIR =

).

Equation (1) is equivalent to a model for agglomeration level SIRs. Poisson, generalised Poisson and negative binomial distributions are considered for modelling counts at the agglomeration level and for each of these distributional assumptions, non-spatial, neighbourhood-based and distance-based spatial correlation structures are compared. These analysis approaches are now described in detail.

#### 2.3. Distributions for Disease Counts

The Poisson model is given by:

The Poisson distribution has mean and variance E(Y_{j}) = V(Y_{j}) = λ_{j}.

The negative binomial, NB, distribution can be constructed by adding a hierarchical element to the Poisson distribution through a random effect

ε_{j}, specifically:

for

y_{j} = 0, 1, 2, 3, …, where

ϑ > 0. The resulting probability distribution function marginal to

ε_{j} is:

for

y_{j} = 0, 1, 2, 3, …, with

E(

Y_{j}) =

λ_{j} and

V(

Y_{j}) =

λ_{j} + (

λ_{j})

^{2}/

ϑ.

The negative binomial model has the property that the variance is always greater than the mean and ϑ is the parameter of extra-Poisson variation with large values of ϑ corresponding to variability more like the Poisson distribution. As ϑ →∞ the distribution of Y_{j} converges to a Poisson random variable.

The generalized Poisson, G-Poisson, model with parameters

λ and

ω is defined as [

9]:

for

y_{j} = 0, 1, 2, 3, … and has

E(

Y_{j}) =

λ_{j} and

V(

Y_{j}) =

λ_{j}(1 −

ω)

^{−2}. For

ω = 0, the generalized Poisson reduces to the Poisson distribution with mean

λ_{j}.

Bayesian inference is based on constructing a model

m (which encapsulates distributional assumptions and covariate relationships with outcome), its likelihood

f(

Yǀ

γ_{m}, m), and the corresponding prior distribution

f(

γ_{m}ǀ

m), where

γ_{m} is a parameter vector under model

m and

Y is the outcome variable vector. We use the following hierarchical structure on model parameters:

where

f(

m) is the prior probability for entry of covariates in the specification of the linear predictor part of the bigger model

m within a class of one of the three probability assumptions above.

The maximum total number of candidate models given k covariates (considered additively, i.e., no interactions) is 2^{k}. The usual choice for the prior on model m is the uniform distribution over the covariate parameter space M = {β_{1}, …, β_{k}}. We used this uniform distribution because the prior can be thought of as noninformative in the sense of favouring all candidate models equally within the same probability model class.

#### 2.4. Hierarchical Models for Relative Risks

Model (1) is a non-spatial model in the sense that it neither recognizes the distance-based relationships among the J agglomerations, nor in area

j allows for any neighbourhood-based effects between adjacent areas that would mean counts in one area might be related to counts in adjacent areas. Suppose the variability in the {

Y_{j}}

_{j}_{ = 1, …,j} follows a spatial model that incorporates assumptions about the spatial relationships between areas. We then extend (1) as:

where the new parameter

ϕ_{j}, represents a residual with spatial structure with

ϕ_{i} and

ϕ_{j},

i ≠ j, modelled to have positive spatial dependence. Two approaches are used for modelling the J-dimensional random variable

ϕ: distance-based and neighbourhood-based spatial correlation structures.

In the distance-based approach the multivariate normal distribution

MVN(

µ,

τΣ) is specified for

ϕ, where

µ is a 1

J mean vector,

τ > 0 controls the overall variability of the

ϕ_{i} and

Σ is a

J ×

J positive definite matrix. If

d_{ij} denotes the distance between centroids of agglomerations

i and

j, then we specify:

where

f(

d_{ij}; v, k) = exp[(−

vd_{ij})

^{k}]. In this specification

ν > 0 controls the rate of decrease of correlation with distance, with large values representing rapid decay, and

τ is a scalar parameter representing the overall precision parameter. The parameter

κ ϵ (0,2] controls the amount by which spatial variations in the data are smoothed. Large values of

κ lead to greater smoothing, with

κ = 2 corresponding to the Gaussian correlation function [

15]. The distance-based parameters are jointly referred to as

.

Besag

et al. [

16] propose modelling the spatial components via a conditional autoregression (CAR) as

ϕ_{i}~N(0,

) , describing the spatial variation in the heterogeneity component so that geographically close areas tend to present similar risks. One way of expressing this spatial structure is via Markov random fields models where the distribution of each

ϕ_{i} given all the other elements {

ϕ_{1}, …,

ϕ_{i}_{ – 1,} ϕ_{i}_{ + 1}, …,

ϕ_{J}} depends only on its neighbourhood [

17]. A commonly used form for the conditional distribution of each

ϕ_{i} is the Gaussian:

where the prior mean of each

ϕ_{i} is defined as a weighted average of the other

ϕ_{j},

j ≠

i, and the weights

π_{ij} define the relationship between area

i and its neighbours. The precision parameter

σ_{ϕ} controls the amount of variability for the random effect.

Although other possibilities exist, the simplest and most commonly used neighbourhood structure is defined by the existence of a common border of any length between the areas. In this case, the weights π_{ij} in Equation (8) are constants and specified as π_{ij} = 1 if i and j are adjacent and π_{ij} = 0 otherwise. In that case, the conditional prior mean of ϕ_{i} is given by the arithmetic average of the spatial effects from its neighbours and the conditional prior variance is proportional to the number of neighbours.

#### 2.7. Gibbs Variable Selection, GVS

Candidate models can be represented as (

ψ, α) ϵ

M ×{0, 1}

^{κ}, where

ψ is a set of binary indicator variables

ψ_{g} (

g = 1, …,

k), where

ψ_{g} = 1 or 0 represents respectively the presence or absence of covariate

g in the model, and

α denotes other structural properties of the model. For the generalised linear models in this study,

α describes the distribution, link function, variance function and (un)structured terms, and the linear predictor may be written as:

We assume that α is fixed and we concentrate on the estimation of the posterior distribution of

β within the class of probability models defined by

α The prior for (

β,

ψ) is specified as

f (

β,

ψ) =

f (

βǀ

ψ)

f (

ψ). Furthermore,

β can be partitioned into two vectors

β_{ψ} and

β_{\ψ} corresponding to those components of

β that are included

ψ_{g} = 1 or not included

ψ_{g} = 0 in the model. Then, the prior

f (

βǀ

ψ) may be partitioned into a “model” prior

f (

β_{ψ} ǀ

ψ) and a “pseudo” prior

f (

β_{\ψ}ǀ

β_{ψ}, ψ) [

18]. The full posterior distributions for the model parameters are given by:

and we assume that the actual model parameters

β_{ψ} and the inactive parameters

β_{\ψ} are a priori independent given

ψ. This assumption implies that

f (

β_{ψ}ǀ

β_{\ψ}, ψ, y)

~f (

yǀ

β,ψ)

f (

β_{ψ}ǀ

ψ) and

f (

β_{\ψ}ǀ

β_{ψ}, ψ, y) ∝

f(

β_{\ψ}ǀ

ψ).

The Gibbs sampling procedure is summarized by the following three steps [

19]:

- (1).
Sample the parameters included in the model from the posterior:

- (2).
Sample the parameters excluded from the model from the pseudoprior:

- (3).
Sample each variable indicator j from a Bernoulli distribution with success probability

; where O

_{g} is given by:

where

ψ_{\}_{g} denotes all terms of

ψ except

ψ_{g}.

The algorithm is further simplified by assuming prior conditional independence of all

β_{g} for each model

ψ. Then, each prior for

β _{g}ǀ

ψ consists of a mixture of true prior

f (

β _{g}ǀ

ψ_{g} = 1

, ψ_{\}_{g}) for the parameter and a pseudoprior

f (

β_{g}ǀ

ψ_{g} = 0

, ψ_{\}_{g}) As a result:

We considered a normal prior and pseudoprior for the

β_{g}s resulting in:

and:

where

µ_{G}, S_{G} are the mean and variance respectively in the corresponding pseudoprior distributions and Ʃ

_{g} is the prior variance when covariate g is included in the model.

The Normal prior assumption and Equation (13) result in a prior that is a mixture of two Normal distributions:

Using priors Equation (14) and Equation (9) gives the following full conditional posterior:

indicating that the pseudoprior,

f (

β_{g}ǀ

ψ_{g} = 0) does not affect the posterior distribution of model coefficients.

When no restrictions on the model space are imposed a common prior for the indicator variables

β_{g} is

f(

ψ_{g}) = Bernoulli (0.5) [

20]. The Gibbs sampler was begun with all

ψ_{g} = 1, which corresponds to starting with the full model.

Consider Ʃ as the constructed prior covariance matrix for the whole parameter vector

β when the multivariate extension of prior distribution (14) is used for each

β_{g}. Zellner’s g prior framework was used to define prior variance structure for Ʃ [

21]. The choices

µ_{G} = 0 and S

_{g} =

with

p = 10 were made as they have also been shown to be adequate [

18]. The pseudoprior parameters

µ_{G} and

S_{g} are only relevant to the behaviour of the MCMC chain and do not affect the posterior distribution [

20].

Because

α is assumed fixed in our study and we have

k covariates a set of 2

^{K} competing models are considered

M = {

m_{1}, m_{2}, m_{3}, …, m_{2}k}

, and the posterior probability of model

ma ϵ

M is defined as:

Bayesian model averaging (BMA) obtains the posterior inclusion probability of a candidate regressor, pr(β_{g} ≠ 0ǀy), g = 1, …, k, by summing the posterior model probabilities across those regressors that are included in the model.

Within the disease mapping context, usually the aim is prediction. In such cases, prediction should be based on the BMA technique, which also accounts for model uncertainty [

22]. Whatever the final intention is (prediction using BMA or selection of a single model) we need to evaluate posterior model probabilities.

#### 2.7.1. Fully Bayesian Estimation

The Markov chain Monte Carlo method (MCMC) was employed to obtain a sample from the joint posterior distribution of model parameters, automatically generating samples from the marginal posteriors and hyperparameters. It has been suggested that the Gibbs sampler is run for 100,000 iterations for GVS after discarding the first 10,000 iterations for the burn-in period [

23]. In our analyses, a total of 500,000 runs with every tenth posterior draw after a burn-in of 50,000 runs was used. The inference of every parameter was thus based on 45,000 posterior samples. Convergence to the posterior distribution was assessed using time series scatterplots, correlograms and the Gelman-Rubin convergence statistic as implemented in WinBUGS and CODA/BOA [

24,

25].