Spatially Interpolated Disease Prevalence Estimation Using Collateral Indicators of Morbidity and Ecological Risk

This paper considers estimation of disease prevalence for small areas (neighbourhoods) when the available observations on prevalence are for an alternative partition of a region, such as service areas. Interpolation to neighbourhoods uses a kernel method extended to take account of two types of collateral information. The first is morbidity and service use data, such as hospital admissions, observed for neighbourhoods. Variations in morbidity and service use are expected to reflect prevalence. The second type of collateral information is ecological risk factors (e.g., pollution indices) that are expected to explain variability in prevalence in service areas, but are typically observed only for neighbourhoods. An application involves estimating neighbourhood asthma prevalence in a London health region involving 562 neighbourhoods and 189 service (primary care) areas.


Introduction
Geographic data on disease prevalence are important for assessing health inequities within regions, but are not necessarily collected for small areas or neighbourhoods suitable for health profiling. However, data on disease prevalence may be collected by health service providers for the populations they serve. Thus in the UK, prevalence data on chronic diseases is collected by general practitioner (GP) teams responsible for primary care. The population cared for by each GP team (subsequently called GP areas) is typically dispersed over several neighbourhoods. This raises the question whether data on prevalence for GP areas (source areas) can be used to provide spatially interpolated estimates for neighbourhoods (target areas) [1].

OPEN ACCESS
Many interpolation strategies rely simply on geographic location data for interpolating values to target areas, but additional information relevant to interpolation may be available. There may be observed data for neighbourhoods on morbidity outcomes (e.g., hospital admissions) providing ancillary information on neighbourhood prevalence. Ecological risk factor data (e.g., air quality) relevant to explaining prevalence variations may also be available. In the terminology of structural equation modeling there are both reflexive indicators (morbidity indicators), and formative indicators (ecological covariates), relevant to the interpolation process [2]. The analysis here extends that of Congdon [3] which considered situations involving only reflexive indicators. The analysis here is also distinctive in considering situations where the formative indicators are only observed for one spatial framework, and must be interpolated to the other.
The approach used here to take account of such collateral information is based on a technique known as discrete process convolution [4]. This method involves convolution of a random effect process over a discrete grid with a smoothing kernel to estimate the underlying spatial trend. The random effects are sampled at points in a regular grid covering the region. Whereas geostatistical interpolation using multivariate normal approaches tend to be computationally demanding as the number of source or target regions increases [5], the convolution method can be applied with large numbers of regions (e.g., over 500) in either or both source and target frameworks.
Typically two forms of interpolation are required. Prevalence estimation for neighbourhoods (the target areas) necessitates spatial interpolation, since prevalence is actually only observed for service areas (GP areas). Additionally, ecological covariates are typically only available for neighbourhoods. In order to assess links between prevalence and ecological covariates over GP areas, spatial interpolation of ecological covariates (formative indicators) to these areas is required.
The analysis involves comparison of different forms of kernel and discrete grid random effects. A Bayesian estimation method is applied via repeated sampling (Markov chain Monte Carlo or MCMC) methods. Both fit of the model to the observed data, and the characteristics of the spatially interpolated disease estimates, may be sensitive to kernel choice or the random effect process assumed. In applications (as here) involving formative indicators, there may be sensitivity to additional modeling choices, for example, linear or nonlinear effects of ecological covariates on disease prevalence.

Case Study Application
The case study setting is provided by a particular region within England (the largest country within the UK), where population prevalence registers for a range of chronic diseases are maintained by 8,200 GP teams under a system called the Quality Outcomes Framework (QOF). In the UK, primary care is provided by general practitioners from community based clinics, and either involves groups of GPs (and ancillary staff such as nurses) based in the same clinic, or single practitioners. The chronic prevalence data are provided only for such GP teams, and not for any small area populations. The application is to a health region within London, namely four London boroughs (Barking & Dagenham, Havering, Redbridge, Waltham Forest) collectively denoted as Outer North East London, and with a population of 970,000 in 2012.
There are 189 GP practice areas (source areas), and 562 neighbourhoods (target areas), in this part of London. Neighbourhoods are defined as Lower Level Super Output Areas (LSOAs), of which there are currently 32,844 across England, with an average of 1,500 residents and 650 households. These are derived from smaller Census output areas [6], subject to constraints of proximity (to ensure a compact shape), and social homogeneity within each LSOA (e.g., similar household tenure and dwelling type patterns).
It is important to note that data for GP service areas cannot be obtained by aggregating over neighbourhoods. There is no simple nested area hierarchy of neighbourhoods within GP service areas. In the UK in general, and the study region in particular, a GP service area is typically diffuse and scattered over several neighbourhoods. Conversely people in any particular neighbourhood can choose between GP teams and are typically split between several GP teams.
The analysis here uses data in one or other framework and interpolates to the other. For example, interpolated prevalence estimates of asthma in neighbourhoods use collateral morbidity and covariates observed in neighbourhoods, as well as observed asthma prevalence for GP practice areas. In this application the data consist of (a) counts of patients (in the financial year 2010-2011) with diagnosed asthma for GP practice areas (source areas); these counts are released as a single total for each GP practice area with no breakdown by sex or age; (b) collateral information on asthma hospital admissions for neighbourhoods (also counts) in 2010-2011; morbidity indicators such as this are taken as reflexive of prevalence for neighbourhoods (target areas), and (c) an air quality index for neighbourhoods in 2008 (target areas), based on levels of nitrogen dioxide, particulates, sulphur dioxide and benzene. The index was developed by the Geography Department at Staffordshire University using data from the UK National Air Quality Archive [7].
There is a slight time period discrepancy between the air quality data and the health datasets, but air quality differences between areas are likely to be broadly stable over such a period-as evidenced by stability in the London Atmospheric Emissions Inventory (LAEI) concentrations [8]. Also available are 2010 population estimates for both GP areas and neighbourhoods, so that expected prevalence totals (populations multiplied by the region-wide prevalence rate) can be obtained for both source and target areas. In other applications there might be more than one type of collateral indicator reflecting prevalence (e.g., mortality counts as well as hospitalizations) and more than one ecological predictor.

Model for Prevalence Data
This section considers modelling observed and latent prevalence data, for GP areas and neighbourhoods respectively. We consider first the model assumed for prevalence in GP areas. Consider total observed cases y i of disease in GP areas with geographic centroids (s 1 ,.., s N ) (sometimes called eastings and northings), where s i = (s 1i , s 2i ). These are assumed to be Poisson distributed: where the E i represent expected prevalence totals, and the ρ i then have a relative risk interpretation. Specifically the ρ i are interpretable as relative prevalence rates (with average 1). Variation in relative prevalence is assumed to be partitioned between that due to covariates X i , that due to a spatial process z 1 (s i ) (the specification of which is elaborated below), and a residual random term e i to account for Poisson over-dispersion. Over-dispersion is frequently encountered in practice for health count data [9]. Because relative risks are necessarily positive, a log-link is used so that: where β are regression parameters, and the residuals are assumed to be normally distributed with an area-specific variance V i : Variances may be related to population size in source areas (e.g., greater residual variation for smaller populations) and are modelled as: Note that, as in the present study, ecological predictors of prevalence (e.g., air quality) may not necessarily be observed for source areas (here GP areas). A subscript L (denoting latent data) is added to indicate this, so that Model 1 becomes: The spatial process z 1 (s i ) is distinct from another process used to interpolate predictor values X Li to GP areas (see Section 3.2), and is modelled by a convolution of a random effect and a smoothing kernel. The random effects, denoted w 1j , are sampled at a discrete grid of J < N points at geolocations u j = (u 1j , u 2j ). The interpolated spatial effect at points s i is then obtained as: where κ 1 (|s − u|) is a two-dimensional smoothing kernel, and d = |s − u| represents distance.
An analogous model is assumed for modelling neighbourhood prevalence counts. Although such counts are latent rather than observed, expected neighbourhood prevalence totals E k can be calculated. Then for k = 1,.., K neighbourhoods with centroids t k = (t 1k ,t 2k ), and expected prevalence counts E k , latent neighbourhood prevalence counts are distributed as: with risk model involving observed ecological covariates X k , namely: and with the spatial process z 1 defined with respect to t k as: The same variance model used for source area random effects e i is assumed for neighbourhood random effects e k , so that:

Using Collateral Morbidity and Ecological Covariate Data
There are two sources of collateral information: indicators that reflect prevalence, and indicators that may cause variations in prevalence (i.e., reflexive indicators and formative indicators respectively). Collateral reflexive indicators of morbidity are provided in the case study by neighbourhood asthma hospitalisations. These are taken as reflexive of prevalence in that higher prevalence is likely to be associated with higher asthma hospitalisation rates: if a neighbourhood population of a given size has a higher number of asthmatic patients, then other things equal it will tend to have higher number of inpatient admissions with asthma as the precipitating condition. Of course, in practice, there may be additional factors affecting hospitalisation levels, though these are likely to some extent to influence hospitalisation indirectly by affecting prevalence levels. These factors might include smoking levels, poverty rates, or urban-rural setting.
Consider hospitalisation data in the form of counts M k for neighbourhoods k. Then observed hospitalisation counts are taken as Poisson: where G k are expected hospitalisations, and relative hospitalisation risks ν k are modelled as a function of relative neighbourhood prevalence: It is expected that γ 2 > 0 on the basis that higher neighbourhood asthma prevalence rates ρ k will be associated with higher risks ν k of hospitalisation with asthma diagnosis. This is a substantive expectation, not a constraint explicitly applied in the model.
Reflexive indicators provide additional information relevant to interpolating neighbourhood prevalence. So also do formative indicators, namely area risk factors for prevalence. A number of studies indicate that asthma prevalence may be increased by air pollution [10,11]. The air quality index in the present analysis is a continuous index, and is available (i.e., observed) for the target areas (neighbourhoods) k = 1,.., K, but not available for GP areas.
Thus a second kernel process z 2 (s), defined using the same grid points u j as z 1 (s), is used to spatially interpolate air quality values to GP areas i = 1,.., N. Thus: This spatial process z 2 () can be denoted the risk factor process, as opposed to the prevalence spatial process z 1 ().

Assumptions Regarding the Spatial Process
The fit of the model to the observed data (asthma prevalence counts for GP areas, together with asthma hospitalisations, and air quality data for neighbourhoods) may be affected by assumptions regarding the kernels κ(d) and the random processes w j . Inferences about the spatial pattern of the interpolated neighbourhood asthma prevalence rates may also be sensitive to model specification (see Section 3.4).
While a bivariate normal kernel (with scale parameter α), namely: is often assumed, other bivariate kernel functions have been discussed, for example in the literature on seed dispersal [12,13]. For example, a two dimensional exponential has the form: and a two dimensional Student t has the form: For the random process w j , possible alternatives to a normal process include Student t: (with degrees of freedom ν), or w j taken as spatially correlated over points in the discrete grid [14].
Regarding estimation, it may be noted that if the parameters of the random process w and of the kernel function κ(d) are both taken as unknown, they will be confounded. For unique identification, one can adopt standardised forms in either the kernel or the random process, with the unknown parameter(s) then remaining in the other component.

Assessing Alternative Spatial Process Assumptions
Overall fit to the observed data (prevalence y, admissions M, pollution x) is here based on the Deviance Information Criterion [15]. Sensitivity of inferences to model assumptions is also important. In particular, for assessing spatial pattern in the interpolated neighbourhood prevalence rates, one may compare models in terms of: (a) localised hot spot probabilities of high asthma risk (excess risk in each neighbourhood without regard to risk in surrounding neighbourhoods), (b) clustering of excess asthma risk, namely elevated risk in both neighbourhood k and the surrounding neighbourhoods.
To this end one may define binary indicators: where I(R) = 1 if condition R pertains and I(R) = 0 otherwise. The Bayesian estimation process can be used to provide an estimate of probabilities Pr(ρ k > 1|y,M,x) that relative prevalence risk is above average. The classification of areas as having excess localised risk may be based on a threshold such as 0.9 or 0.8 [16], namely: Pr(ρ k > 1|y,M,x) > 0.8 (23) To assess spatial clustering in excess asthma risk, one may monitor co-occurrence of excess risk in both neighbourhood k, and in the L k neighbourhoods in its surrounding locality A k . Thus consider the indicators: Then C k is a probability indicator of a high risk cluster centred on area k, since higher values for C k occur as more adjacent areas have high risk J l = 1 (for l ∈ A k ) in combination with high risk in area k itself (J k = 1).
To compare how different model specifications (e.g., kernel and grid random process choices) affect the resulting interpolated spatial prevalence pattern, one may cross-tabulate totals of neighbourhoods classified (or not) as localized high risk or as high risk cluster centers.

Analysis Framework
The model set out above is applied to spatial interpolation of asthma and pollution, and to estimation of the impact of pollution on prevalence in Outer NE London. Bayesian estimation is carried out using the WINBUGS package. Observed asthma data (ICD10 codes J45-J46) consist of prevalence counts y i for GP practice areas in 2010-2011. These data provide source area observations to make interpolated prevalence estimates for neighbourhoods (target areas). Additional collateral data are asthma hospitalisations M k for neighbourhoods, and air quality indices x k (log transformed), also for neighbourhoods. Note that higher x values denote worse air quality. Figures 1 and 2 contain maps of hospitalization rates (crude rates per 1,000 population) and the logged air quality indices.  Of importance for practical application of the methodology of Section 3 is stability or otherwise of inferences regarding the distribution of, and spatial patterning in, the interpolated asthma prevalence rates in neighbourhoods. Four alternative models are applied differing in kernel form (bivariate normal vs. bivariate exponential) and in the random process (normal vs. Student t). The same assumption regarding the kernel-process combination is applied both to the prevalence process z 1 and the risk factor process z 2 . There is a wide range of possible model options: different kernels, different form of random grid effect (w j ), and differing regression forms for ρ on x. The selected four models are intended as a representative subset of the possibilities.
All four models assume standard kernels and a linear regression effect of ecological covariates x on prevalence risk ρ. Thus model 1 combines a standard normal kernel (α = 1) with a normal process w j ~ N(0,σ 2 w ) with unknown variance. Model 2 combines a standard exponential kernel ( = 1) with a normal process w j ~ N(0,σ 2 w ). As mentioned by Clark et al. [13], exponential kernels allow for more leptokurtic (more peaked and fat tailed) densities than normal kernels, and so exponential kernels may represent a more flexible assumption for latent prevalence rates ρ and air quality rates x. Model 3 combines a standard normal kernel with a Student t process with five degrees of freedom w j ~ t(0, 5, σ 2 w ), and Model 4 combines a standard exponential kernel with a Student t process with five degrees of freedom. The Student t with low degrees of freedom allows for heavier tails than the normal.
Bayesian inference requires that prior densities on parameters be specified. Since there is no extensive prior evidence, relatively diffuse options are chosen. Gamma priors with scale 1 and index 0.001 are assumed for unknown precisions, and normal priors with mean 0 and variance 1,000 are assumed for regression parameters {β,γ,δ}. Inferences are based on the second halves of two chain runs of 25,000 iterations, with convergence assessed using Brooks-Gelman criteria [17].
While stability of inferences is an important aspect in assessing the interpolated prevalence data, fit to the three observed datasets is, of course, also central. Table 1 uses the DIC criterion to assess how well the method estimates the observed small-area counts or rates, whether obtained on one area frame (GP service areas) or the other (residential neighbourhoods). The DIC values are based on the Poisson deviance for y i (prevalence counts observed over GP service areas) and M k (hospitalisations observed over neighbourhoods) namely: and: The deviance for the air quality data x k (observed over neighbourhoods) is: where µ k = η 1 + z 2 (t k ).

Results
Defining a composite measure of fit is complicated by the different scales of the three sets of observations (and hence deviances), but it can be seen from Table 1 that no model provides a best fit across all observations. The best fit for the y-data (generated by modelled prevalence rates ρ i in GP areas) and M-data is provided by a normal kernel κ combined with a normal process w, while the best fit for the x-data is provided by an exponential kernel combined with Student t errors. Such results illustrate that default choices such as normal kernels may not necessarily provide best performance. Observed data: y, asthma prevalence counts for 189 GP practice areas; M, asthma hospitalisations for 562 neighbourhoods. Table 2 summarises estimates of the main structural parameters across sub-models. These are β 1 and β 2 in the GP area and neighbourhood prevalence equations: the "reflexive effect" parameter γ 2 in the model for neighbourhood hospitalisations: log(ν k ) = γ 1 + γ 2 ρ k (30) and the heteroscedasticity parameters in the log-variance equations: It can be seen that all models agree on the positivity of γ 2 , namely that asthma hospitalisation rates in neighbourhoods increase in line with neighbourhood prevalence of asthma. All models also provide closely similar results on the level of prevalence (the parameter β 1 ).
However, results are more equivocal concerning the effect of pollution (poor air quality) on asthma prevalence. Posterior means for this coefficient are all positive, ranging from 0.22 to 0.31, but a significant effect, namely an entirely positive 90% credible interval, is only apparent under Model 1. There is also no clear evidence that variances of unstructured residuals e i and e k are related to population size.
To assess stability in inferences about the spatial pattern of asthma risk, the four models are compared in terms of co-location of neighbourhoods identified as having high localized risk, namely: Table 4 considers distributional features of neighbourhood asthma prevalence as indicated by posterior mean asthma prevalence rates from the four models (expressed as percentage rates). There is close agreement between Models 2-4, but slightly lower mean prevalence, and lower 95th and 99th percentiles under Model 1. Figure 3 depicts the distribution of asthma prevalence over the Outer NE London region, using Model 1 estimates. There is a close correspondence between prevalence and hospitalization rates ( Figure 1), even though the latter are simple crude rates, so providing consistent evidence on the asthma burden.

Concluding Remarks
Profiles of neighbourhood prevalence of major chronic diseases such as asthma are important for public health priority setting. In the UK, the prevalence of a number of chronic diseases is recorded by General Practitioner teams for the areas they serve, but this data is not provided for neighbourhoods.
To make spatially interpolated estimates of prevalence for neighbourhoods one might simply rely on using geographic location data and apply a multivariate Gaussian process. However, such a procedure neglects subsidiary information about neighbourhood disease prevalence. First, there may be reflexive indicators (e.g., neighbourhood level hospitalisations or deaths due to the disease being considered). Second there may be observations for neighbourhoods on ecological risk factors for prevalence of the disease. This paper has demonstrated how such information can be utilised within an encompassing model for prevalence, reflexive morbidity and ecological risk indices, with a risk interpolation model being necessary if the risk variable (here air quality) is only observed for neighbourhoods. Interpolation is based on a kernel convolution methodology combining different possible assumptions about kernel form and random process.
Among the four models considered, none has a clear best fit to the observed data. There is some sensitivity in inferences about interpolated prevalence in neighbourhoods (the target areas), the main differences being between Model 1 (combining a normal kernel and a normal random process) compared to the other models. However, inferences are broadly stable about the distributional form and spatial patterning between the four models. Such stability supports use of the methodology for other regions and diseases.
Other techniques are available for spatial interpolation, but are not adapted to including the interconnected ancillary variables used here in an overarching interpolation methodology which recognizes their interdependence. GIS packages could be used to interpolate asthma neighbourhood prevalence rates using GP area prevalence data, and to separately interpolate pollution rates to GP service areas using neighbourhood pollution data. For example, the ArcGIS package uses deterministic interpolation techniques such as inverse distance weighted interpolation, as well as geostatistical methods [18]. This type of interpolation would not take into account additional information provided by ancillary variables such as neighbourhood asthma hospitalisations.
Off-the-shelf interpolation techniques will also not typically provide the full range of inferences possible with the Bayesian simulation methods used in the current paper. Thus as well as estimating asthma prevalence, one may use the simulation (see Section 3.4) to estimate the probabilities of elevated risk in a neighbourhood, and also whether or not elevated risk characterizes both a particular area and its neighbours-that is to distinguish between spatial clustering and spatial outliers [19]. In broader terms, Bayesian estimation has benefits in fitting relatively complex random effects models such as the one in this paper, whereas classical estimation would have to involve approximate numerical integration methods (e.g., quadrature), and may not necessarily even be feasible.
It is important for public health priority setting to identify areas with excess risk and also spatial clustering of excess risk, as evidence of either pattern may provide support for targeted interventions, resourcing or health promotion campaigns. One might apply the methodology here to other respiratory diseases where interpolation is relevant, and where ancillary variables are also generally available. Examples are other types of respiratory disease prevalence data available only for GP service areas (such as COPD prevalence), or when incidence data (e.g., for lung cancer) is only released for relatively aggregated areas. For example, in the UK, lung cancer incidence data are released for middle level super output areas (of which there are currently 6,791 across England), rather than LSOAs (of which there are 32,844). The methods in this paper can be adapted to this type of application, namely to interpolate lung cancer incidence to LSOAs. Having identified neighbourhoods with jointly elevated risk across conditions (just as Table 3 and Figure 3 identify excess risk asthma risk), one may then ascertain whether common precipitating influences are present.

Conflicts of Interest
The author declares no conflict of interest.