Multi-Scale Multivariate Models for Small Area Health Survey Data: A Chilean Example

Background: We propose a general approach to the analysis of multivariate health outcome data where geo-coding at different spatial scales is available. We propose multiscale joint models which address the links between individual outcomes and also allow for correlation between areas. The models are highly novel in that they exploit survey data to provide multiscale estimates of the prevalences in small areas for a range of disease outcomes. Results The models incorporate both disease specific, and common disease spatially structured components. The multiple scales envisaged is where individual survey data is used to model regional prevalences or risks at an aggregate scale. This approach involves the use of survey weights as predictors within our Bayesian multivariate models. Missingness has to be addressed within these models and we use predictive inference which exploits the correlation between diseases to provide estimates of missing prevalances. The Case study we examine is from the National Health Survey of Chile where geocoding to Province level is available. In that survey, diabetes, Hypertension, obesity and elevated low-density cholesterol (LDL) are available but differential missingness requires that aggregation of estimates and also the assumption of smoothed sampling weights at the aggregate level. Conclusions: The methodology proposed is highly novel and flexibly handles multiple disease outcomes at individual and aggregated levels (i.e., multiscale joint models). The missingness mechanism adopted provides realistic estimates for inclusion in the aggregate model at Provincia level. The spatial structure of four diseases within Provincias has marked spatial differentiation, with diabetes and hypertension strongly clustered in central Provincias and obesity and LDL more clustered in the southern areas.


Introduction
Often, health outcome data arises where measures are made on individuals who reside within geographical regions of a country. A survey may have been carried out to obtain a 'snapshot' of the health of the study region. In addition, the survey could be useful in providing insight into prevalences of disease within aggregated regions of the study window. Recently, Vanderdijck et al., [1] proposed an approach to the modeling of small area health survey data by utilizing the survey weight as a predictor within a conventional generalized linear mixed model, thereby making allowance for the survey design aspect of the study. Watjou et al. [2] extended this approach to situations where there is non-response. Alternative approaches have been proposed whereby the outcome itself is adjusted via sample weights [3][4][5]. However, assuming a conventional generalized linear mixed model (GLMM) with an unadjusted outcome variable has a variety of advantages not least of which is that conventional software can be used for modeling and hence flexible extension to models is straightforward. Multi-scale models have also been developed where models allow borrowing of information across scales [6][7][8][9]. In the most recent examples, models at higher levels of spatial aggregation can inherit effects from finer resolution levels, individual-level models can inherit contextual effects from aggregate regions. In what follows, we assume individual level models for survey participants and aggregate level models for Provincia estimation. We assume that spatial effects can be captured using Markov Random Field (MRF) models as we have discrete spatial units [10].
The Chilean National Health Survey (CNHS) is a large population cross-sectional study representative of the adult Chilean population, which is performed every 6 years. The main objectives of the CNHS are to assess and monitor the prevalence of adult health problems in the general Chilean population and to describe its variation according to sex, age, socioeconomic level, rural-urban area and geographic region. This is to facilitate the planning and evaluation of preventive and curative strategies. The survey provides information about a range of diseases (physical and mental health) and risk factors. Data are collected by self-reported questionnaires, a physical examination visit and biological samples. In the CNHS, type 2 diabetes, hypertension, obesity and high cholesterol, amongst the major outcomes of the metabolic syndrome, are available. After 1980, with economic development, the prevalence of obesity and other chronic diseases continued to increase in Chile. In the period from 2003 to 2017, both type 2 diabetes and obesity prevalence showed marked increases in the CNHS surveys.
One big challenge of the data collection of the CNHS was the particular Chilean geography with an uneven distribution of population across the country in a central-overcrowded and in remote, isolated and depopulated areas that were very difficult to access [11].

Objectives
Our focus is the modeling of multiple individual survey outcomes and the extension of this to a multi-scale approach to aggregate estimation of disease prevalence or risk. By multi scale, we mean analysis of more than one geographic resolution (scale) level. By aggregate, we mean that data are accumulated into larger geographic regions. This multi-scale objective extends the idea of small area estimation by jointly modeling different geo-scales at the same time adjusting for survey response and biases.

Data Background and Availability
The aim of this study was to present a methodology that can be used in the analysis of multiple disease outcomes and that can provide combined estimates of prevalence at various spatial scales. The data we used are from a cross-sectional study of the 2009-2010 CNHS. The regions used in the analysis were 52 out of 54 provinces of Chile. The population in each province varies from a few thousand to over 4 million in the largest province Santiago. The sampling frame was constituted from the 2002 Population and Housing Census which was the most recent Census to the study period. The study design was based on a random sample of households of complex type (stratified and multistage by clusters) with national, regional and rural/urban representation. The target population was adults older than or equal to 15 years. The survey had a response rate in the eligible population of 85%. Our data consist of a total of 4780 individuals with differing patterns of missingness. The data from the survey consist of individual clinical, demographic, and behavioral variables for survey participants. Diabetes outcome was defined as fasting glucose ≥ 126 mg/dl or self-reported medical diagnosis (not during pregnancy). Obesity was defined as body mass index (BMI) HTA ≥ 30 kg/m 2 , hypertension was defined as systolic blood pressure ≥ 140 mmHg or diastolic blood pressure ≥ 90 mmHg or self-report of arterial hypertension (HTA) treatment. Elevated low density lipoprotein (LDL) was defined according to the cardiovascular adult treatment panel (ATP) III Update (> 100 mg/dl (if there is already cardiovascular disease), > 130 mg/dl (if moderate cardiovascular reactivity(CVR)) or > 160 mg/dl (if low CVR)). The 2009-010 CNHS included a total of 5293 individuals, although only 4780 participated in the two examinations and the laboratory test. All outcome definitions are based on the official report [12] and data provided by the CNHS. The demographic variables included from the survey were age, gender and their interaction. In addition, the individuals were geo-referenced to various spatial administrative units (such as regions, provinces and communes). In the following analysis, the province level was used throughout. The data are publicly available at http://epi.minsal.cl/bases-de-datos/.
The basic map of Chile and its provinces was taken from http://www.rulamahue.cl/. It has to be mentioned that the map only represents 52 provinces instead of the current 54. One reason is that Easter Island, which is counted as one province, is not on the map and was not included in the survey. The other missing province is Marga Marga, which was created in 2009 and became operative in 2010. Marga Marga consists of two communes of the Valparaiso province and two communes of the Quillota province. As the survey was carried out during the creation of the new province, it was decided to use the map that was still valid when the survey started.
Ethics approval and consent to participate: The data used in this study are public domain as part of CNHS and the analysis does not require ethical approval as individual cases are not labelled or referenced. Provincial level inference is made only.

Methods
Denote a set of surveyed individual responses as y ik , i = 1, ...., m; k = 1, . . . ., K where the i th person has kth outcome. The vector of outcomes for each person is denoted by y i . The survey consists of m participants and the disease outcomes are defined as y ik i = 1, . . . , m; m = 4780. We assume that this binary outcome for the k th disease is Bernoulli distributed as y ik ∼ Bern(p ik ) and we modeled the logit of the probability of positive outcome. The logit consists of fixed and random effects to account for both observed confounding and unobserved outcome confounding. Hence, logit(p ik ) = x t i β k + z t i γ k will be assumed where x t i β k is a linear predictor including observed confounders, and z t i γ k is a linear combination of random effects. As the individuals were sampled from within a population, we must include within our specification the sampling weight used for each person. Following the proposal in [1,2], we add the sampling weight (sw i ) as a fixed effect within the linear predictor.
The random effects included within the model were chosen to represent the different forms of unobserved effects. First, we assume an individual frailty effect could be present and add an uncorrelated disease specific individual level random effect v ik . Next, we include spatial effects to represent the area within which the individual resides (province only). Two effects are estimated for the province level: an uncorrelated effect v p j (i∈j) , where the subscript represents the j th province within which the i th person resides, and a spatially correlated effect u p j(i∈j) . The overall model used for this analysis is a Bayesian formulation and so all parameters in the model have prior distributions, as follows: The ICAR distribution is a special Markov random field prior distribution which includes spatial correlation (see, e.g., [13]; [10], ch. 5) and essentially captures the clustering tendency of the outcome via neighborhood adjacency. Note that this is a common assumption for the analysis of discrete small area spatial structure, rather than geostatistical models which require distance-based correlation to be specified [10]. The uncorrelated effects have zero mean Gaussian distribution with small precisions which should provide non-informativeness. There are 52 provinces in Chile, so that j = 1, . . . , n where n = 52 and Table A1 provides a basic summary of the regions.

Joint Scale Models
The overall model used for this analysis is a binary logistic model at the individual level for each of the k outcomes. A Bayesian formulation is assumed and so all parameters in the model have prior distributions, as follows: The ICAR distribution is a special prior distribution with u p jk |u p l j,k ∼ N(up δ j k , τ −1 up k /n δ j ) where up δ j k = l∈δ j u p l j,k /n δ j and δ j is the neighborhood set of the j th location with n δ j the number of neighbors. (1) This can model spatial structure (see, e.g., Besag [13]; Lawson [10], chapter 5) and essentially assumes that neighboring areas are positively correlated. It is assumed that each outcome has a different variance of the spatial field and hence, the precisions have a k subscript. Note that while this ICAR formulation is dependent on neighborhood definition, it is an adaptive prior specification as the variance depends on the number of neighbors of any given region. This allows for edge regions to have larger variance and smaller precision. To estimate the probability of the diseases for each province, a binomial model on aggregated province data was used as in Aregay, Lawson, Faes, Kirby, Carroll and Watjou [8], Aregay, Lawson, Faes and Kirby [9]. The binomial model assumed here is an approximation and we assume that the random effect structure employed makes allowance for this misspecification. The outcome was defined as cases of each disease per province out of the number of individuals sampled per province. The mean sampling weight per province (sw p j ), the mean age per province (Age p j ), the percentage of male individuals per province (Sex p j ) and their interaction were defined as adjusting variables. The same uncorrelated and correlated random spatial effects from the individual model were used in the aggregated model. This was done to use the spatial information gained from the individual analysis for the aggregated model. Hence, for the k th outcome, with age x sex adjustment, where n p j is the sample size in the j th province and N p j is the population of the province. p p jk is the probability of the outcome and y p jk = i∈j y ik where the sum is over all the individuals within the j th province. A crude unadjusted estimate of prevalence could be computed as y p jk /n p j .

Missingness and Imputation
All missing data were imputed within our Bayesian models: outcomes were imputed using predictive distribution and predictors were imputed, where appropriate, from an assumed prior distribution. Any missing predictors were given suitable prior distributions and imputed within the Markov Chain Monte Carlo (McMC) algorithms.
Some provinces had no sampling, in which case they must have their prevalence estimated. We did this as follows. Assume restricted prior distributions for and p p j ∼ Beta(1, 1) and use the predictive distribution to yield For missing sampling weights, the following estimation for province level sampling weights was performed. Denote the average sampling weight for a province as sw p j = l ∈j sw l /n p j (9) for provinces sampled and sw where sw is the global mean sampling weight (for those areas sampled). Hence, So that the sampling weight for the missing areas is the global average.

Joint Outcome Modelling
While a multi-scale model is assumed for a given disease outcome, we also have a range of outcomes that are a focus of this study. Outcomes related to the metabolic syndrome were considered important to examine. These include diabetes, obesity, hypertension and elevated LDL. For each of these, a joint multi-scale model was assumed, but to allow linkage between the outcomes at the individual level, a joint model approach was implemented assuming that there is a shared random effect between the k=4 outcome variables. This means that models for each outcome on the individual data level and aggregated data level were run during the same Markov Chain Monte Carlo iterations. The individual data level models included a common random effect vs i for each individual.
Individual Level Models (I1) Note that the random effects are chosen to represent our belief that individuals vary independently but multiple measurements on an individual will have some commonality: Hence, v ik + vs i is a composite effect. We also assume that individuals inherit a contextual effect of province and so v p jk + u p jk are jointly modelled with the aggregate level.
Aggregate (Provincia) Level Models (A1) At the provincial level, we approximate the model by assuming that the sampled count is where there are assumed to be different confounder effects for each outcome. This binomial approximation affects the variance of the aggregate count, but the inclusion of random effects at this level allows for the adjustment of variance. Note that there are eight models fitted jointly with some separate and shared random effects.

Model Fitting and Goodness of fit
Model I1 and AI were fitted jointly for all disease outcomes. Posterior sampling via McMC was chosen as the main tool for estimation. Given that missingness occurs within the outcomes and predictors, we assumed a Bayesian paradigm and decided to use BUGS software as it allows the imputation of missing outcomes using data augmentation, and allows prior specification for missing predictors. All joint models were fitted using WinBUGS14 [14]. Maps of Chile were created using the tmap package in R [15]. Two chains were run with a thinning of 50. Samples of a size of 5000 were taken following burn-in (Lunn, et al. [16], Lunn, et al. [17]). Convergence was visually checked by means of Gelman-Rubin-Brooks plot (Brooks and Gelman [18]), the potential scale reduction factorR (Gelman and Rubin [19]), sample trace and density plots, sample autocorrelations and Markov Chain Monte Carlo error (Lawson,et al. [20]). Convergence can be assumed ifR is close to 1. Model fit was visually checked by means of trace plots of the deviance (Spiegelhalter, et al. [21]), mapping the correlated heterogeneity, up, and uncorrelated heterogeneity, vp. The map of the correlated heterogeneity shows a clustered pattern and the map of uncorrelated heterogeneity shows a random pattern, if an adequate model fit is found.
As is common in other cross-sectional studies the missingness in this study did not display any particular structure and was assumed to be essentially at random (MAR). This form of missingness is handled optimally using predictive inference within the Bayesian modeling framework.

Posterior Risk Exceedance
As an additional diagnostic to help with the delineation of areas of exceptionally high risk, we employed exceedance probability criteria (Lawson [22]). An exceedance probability can be computed from posterior sampled output from a Bayesian model (see, e.g., Lawson and Rotejanaprasert [23]). The exceedance probability is defined as the probability that the estimated posterior value, in this case prevalence, for each sample, was greater than a chosen threshold.
where G is the total number of samples, p is the estimated probability of sample g and c is the chosen threshold. Whenever p exceeds the threshold c, it is recorded as 1, 0 otherwise. Averaged over the sample, this yields an estimate of the upper tail marginal probability of the parameter. This can be used to detect unusually high values and for hot spot clustering by looking for groups of unusually high areas in mapped output (Richardson,et al. [24]). For diabetes, a threshold of 9.4%, for obesity of 25.1%, for hypertension of 26.9% and for elevated LDL a threshold of 22.7% was chosen based on the estimated national prevalence in the CNHS report (http://www.minsal.cl/estudios_encuestas_salud/). Exceedance results are reported in Tables A1-A4.

Descriptive Statistics
The mean age of the study population was 46.3 with no significant difference in age between males (45.7) and females (46.7). More females (59.9%), than males (40.1%) participated in the study (p-value < 0.001). Females showed a significantly higher BMI than males (28.1 vs. 27.4; p-value < 0.001).
Women had a higher, but not significant, diabetes rate (10.7% vs. 10.4%; p-value = 0.8331), a significantly higher obesity rate (32.9% vs. 23.4%; p-value < 0.001), a lower hypertension rate (34.3% vs. 37.2%; p-value = 0.03895) and a lower significant elevated LDL rate (24.5% vs. 34.6%; p-value = < 0.001) than men. Table 1 gives some information concerning which region each province belongs to, square km, Inhabitants per square km, gross domestic product (GDP) per capita and the number of sampled individuals per province. Tables A1-A3 show estimates of posterior probabilities, their standard deviations (SD), median, 2.5% and 97.5% percentiles of the 95% credible interval, and the probability of exceeding the chosen threshold for each province and each outcome. In Figures 1-4, posterior estimates of a range of quantities for the fitted spatial models for diabetes, hypertension, obesity and elevated LDL are found.       It is notable that for all the outcomes the uncorrelated effects are relatively random, whereas the correlated effects display distinct clustering. For diabetes, the clustering is marked in the northern regions, whereas the clustering is marked in the south for obesity. Hypertension displays a clustering in the central regions whereas elevated LDL shows clustering in the most southerly areas of the Magallanes region, with lower central area effects. Table 2 displays the posterior mean estimates for the predictors included in each of the individual and aggregate models for the four outcomes. Some general features of the analysis should be highlighted. First, in the aggregate models, only the intercept and spatial random effects were well estimated. The aggregate survey weights were not well estimated for any outcome, and neither were the age and gender predictors. For the individual level outcomes, different effects emerged. In all the outcomes, the sampling weight was not found to be well estimated. The intercept was well estimated, as was age for all outcomes. Age was found to be positively related to diabetes, obesity, cholesterol and hypertension. Differences arise in the effect of gender and the age x gender interaction. While gender and the age x gender interaction were not well estimated for diabetes, there is a well estimated negative effect of gender (male) on obesity, and a positive effect for cholesterol and hypertension. In

Results of the Joint Modeling
For Diabetes (Table A1), Cauquenes (29) had the highest posterior probability of diabetes, with 16.27%. Provinces with a 95% or higher probability of exceeding the threshold of 9.4% of diabetes were Limari (12), San Felipe de Aconcagua (19), Melipilla (22), Colchagua (28), Cauquenes (29), Curicó (30), Concepción (34) and Cautín (38). Those diabetes hotspots are all located in the central part of Chile. Table A2 provides the results for the analysis of obesity. The highest posterior probability of obesity was estimated in Antártica Chilena (49) with 40.60%, but it did not appear to be significantly higher than the chosen threshold of 25.1% and the estimate actually mainly depends on spatial prior distributions as no samples were taken in Antártica Chilena (49) itself. The highest posterior probability was for Magallanes (51) with 38.52%. Obesity hotspots are mostly found in the central to southern part of Chile. In Table A2, posterior probabilities of hypertension per province are listed. The highest probability of hypertension was found in Cauquenes (29) with 57.14%. Hypertension hotspots are mostly found in the central to southern part of Chile. Table A4 demonstrates that the highest probability of elevated LDL level was estimated in Ultima Esperanza (50) with 41.79%, though estimation relies on assumed prior distributions as no samples were observed in this province itself. The highest exceedance probability was for Copiaco (9), 37.15%, Cautin (38) 40.06%, and Chiloe (41) 40.28% which all have a 100% exceedance of the national rate.
Overall, it would appear that diabetes and hypertension display a similar distribution in that there are concentrations of both in central metropolitan areas and also in Southern Chile. Obesity, on the other hand, is less marked in central area but shows elevation on the south also. Cholesterol (elevated LDL) demonstrates a similar pattern to obesity in terms of the north-south gradient, but displays less prevalence in central areas and is overall less clustered.
In the joint analyses reported here, it is clear that diabetes and hypertension display a similar spatial distribution of risk in that central and southern areas are the most affected and exceedance probabilities > 0.95 are commonly found. In contrast, elevated LDL and obesity are more marked in the southern parts of the country and show fewer examples of exceedance than for diabetes or hypertension. This suggests a more uniform pattern of risk for elevated LDL over the country than for the other outcomes. Figures 1-4 display the multi-scale model heterogeneity effects for each outcome. It is notable that for all the outcomes the uncorrelated effects are relatively random, whereas the correlated effects display distinct clustering. For diabetes, the clustering is marked in the northern regions, whereas the clustering is marked in the south for obesity. Hypertension displays a clustering in the central regions whereas elevated LDL shows clustering in the most southerly areas of the Magallanes region, with lower central area effects. Table 2 displays the posterior mean estimates for the predictors included in each of the individual and aggregate models for the four outcomes. Some general features of the analysis should be highlighted. First, in the aggregate models, only the intercept and spatial random effects were well estimated. The aggregate survey weights were not well estimated for any outcome, and neither were the age and gender predictors. For the individual level outcomes, different effects emerged. In all the outcomes, the sampling weight was not found to be well estimated. The intercept was well estimated, as was age for all outcomes. Age was found to be positively related to diabetes, obesity, cholesterol and hypertension. Differences arise in the effect of gender and the age x gender interaction. While gender and the age x gender interaction were not well estimated for diabetes, there is a well estimated negative effect of gender (male) on obesity, and a positive effect for cholesterol and hypertension. In addition, while most age x gender interactions were not well estimated, it was found well estimated for hypertension. Although hypertension tends to display similar spatial patterning to diabetes, there are marked differences in the individual predictors associated with each outcome. Hypertension has a gender and age x gender association that is not seen in diabetes.

Discussion
There are some notable correlations and disparities between the distributions of diabetes, hypertension, obesity and elevated LDL. Both diabetes and hypertension have marked prevalences in the central provinces, including the metropolitan areas around Santiago. Note that while marked differences arise between outcomes, at the individual level, we have included a person-specific and outcome-specific individual effect. This leaves some allowance for correlation between outcomes in the individual. At the province level, we did not include shared province effects, but individuals had multi-scale sharing of province effects. The regression parameter posterior estimates demonstrate that while similar spatial patterning can arise, there are also differences at the individual level that can be marked. The central regions elevation of diabetes and hypertension could be explained by the urbanicity of the areas around Santiago and the associated lifestyle trends. The concentration of obesity and elevated LDL in the southern regions may reflect differentials with northern comparison regions and in particular dietary practices.
In this example, the sampling weight appears to have little impact on any outcome whether at individual or aggregate level. That said, it is important to include the sampling weight as it represents factors affecting the inclusion of participants in the survey.

Conclusions
The joint analysis of these four metabolic outcomes demonstrates the benefit of considering the correlation between outcomes at the individual level. It also demonstrates the benefit of a multi-scale analysis in that individual outcomes can be modelled contextually within provinces and they can inherit the grouping effect of the province. In addition, the inclusion of survey weights at different levels is an important feature that allows the analysis to proceed, taking into consideration sampling effects. The joint analysis allowed the estimation of prevalences at the aggregate level while also providing contextual effects for the individuals. We did not assume here that the province level outcomes should share effects but in future work, we could explore sharing further especially for diabetes and hypertension (e.g., shared spatial effects), which appear to have similar patterning.

Conflicts of Interest:
There are no competing interests with respect to the submission of this paper.