Bayesian analysis of population health data

The analysis of population-wide datasets can provide insight on the health status of large populations so that public health officials can make data-driven decisions. The analysis of such datasets often requires highly parameterized models with different types of fixed and randoms effects to account for risk factors, spatial and temporal variations, multilevel effects and other sources on uncertainty. To illustrate the potential of Bayesian hierarchical models, a dataset of about 500 000 inhabitants released by the Polish National Health Fund containing information about ischemic stroke incidence for a 2-year period is analyzed using different types of models. Spatial logistic regression and survival models are considered for analyzing the individual probabilities of stroke and the times to the occurrence of an ischemic stroke event. Demographic and socioeconomic variables as well as drug prescription information are available at an individual level. Spatial variation is considered by means of region-level random effects.


Introduction
Population and Public Health officials often require to address complex issues in important health problems with high levels of uncertainty that can affect millions of people. Providing scientific evidence to help decision-making processes in that area is a key issue and statistical analysis becomes an essential tool.
Data on large populations are often difficult to obtain due to confidentiality issues and the technical difficulties and financial resources involved in their design, maintenance and updating as well as its day-to-day management. The existence and availability of population databases for scientific exploitation is a treasure. Having a strong knowledge on the population makes it possible to accurately estimate the parameters of interest in the study, to identify potential risk factors, to detect patterns, outcomes or groups of individuals with special characteristics, and minimize the uncertainty associated with the prediction process. These studies are of great help to Public Health insofar as they contribute to the development of efficient and effective strategies and policies aimed at improving the health of the target population.
This paper deals with population health from a statistical point of view, and concentrates on the prevalence of stroke in Poland. In particular, we aim to identify different patterns that may increase the estimation of the probability associated to the disease in terms of a set of explanatory covariates and random effects is often modeled using mixed logistic regression models [11]. Survival models are statistical models specially designed to learn about time-to-event outcomes and their relationships with regard to relevant risk factors [8]. They also include as a particular issue the assessment of prevalence probabilities by means of particular cases of the survival function. Both approaches and how they are related to each other are described below in a first subsection devoted to logistic regression and a second one for survival models. This section concludes with an introduction to the integrated nested Laplace approximation (INLA) [10] within the framework of Bayesian inferential methods.

Logistic regression
Binomial regression connects probabilities associated to Bernoulli trials with covariates. The outcome of interest is an observable binary response which describes the presence (value 1) or absence (value zero) of a certain individual feature of the population under study. In the case of individual i it is defined as follows being p i the probability of success in the subsequent Bernoulli trial. Probabilities and covariates are not usually in the same scale. For this reason a link function g is defined in order to accommodate the probabilities and the linear predictor η i in the same scale as follows where β = (β 0 , β 1 , . . . , β q ) is the regression coefficient vector associated to covariates x i = (x i0 = 1, x i1 , . . . , x iq ) . The most common link functions when dealing with binary variables are the logit and the probit functions. The logit function is the canonical link function for the Bernoulli distribution in generalized linear models and a binomial regression endowed with the logit link function is called logistic regression. It offers an intuitive interpretation of the relationship between the probability of interest and the linear predictor in terms of odds in logarithmic scale as follows Random effects allow to assess variability associated to the outcome of interest within groups that are not accounted for the covariates. Random effects can be modeled in different ways. The simplest one considers random effects as conditionally independent and identically distributed random variables with Gaussian distribution of zero mean and precision (i.e., the reciprocal of the variance) τ. This condition assumes that given τ there is no prior correlation among the different groups and that differences among them are only due to intrinsic factors. The inclusion of these elements in the regression model forces its reformulation with the addition of a new index to indicate the random effect associated to group j, j = 1, . . . , J, as follows: where γ j | τ ∼ N(0, τ). It is worth noting that the model can include covariates associated with groups. In such scenarios, the value of the corresponding covariate would be the same for all individuals belonging to the same group j.
In the case where it is assumed that the risk varies smoothly along the study region, spatially correlated random effects can be considered. A typical approach considers the Intrinsic Conditional Auto-Regressive (ICAR) model [12] that incorporates information from the neighboring regions. This model specifies a Gaussian distribution for the conditional distribution of the random effect γ j associated to the region j, j = 1 . . . , J given the set of the random effects at its neighbours (denoted by l ∼ j) with mean ∑ l∼j γ l /n j and precision τ/n j , where n j is the number of neighbours of region j. This model is often used in disease mapping models to account for spatial and spatio-temporal risk variation. The joint distribution for γ = (γ 1 , . . . , γ J ) is a multivariate normal random vector where Q is a J × J precision matrix with entries n j , j = 1 . . . , J in the diagonal and entries Q jl equal to -1 if regions j and l are neighbours and 0 otherwise. Given that this is an improper distribution, a sum-to-zero constraint is often added on the values of the random effects, i.e., ∑ J j=1 γ j = 0 [13]. Leroux et al. (1999) [14] propose an alternative specification for the precision matrix that better distinguishes between spatial dependence and overdispersion effects as follows: where I is the identity matrix and parameter φ ∈ [0, 1] determines how matrices I and Q are combined. Values of φ close to 0 indicate that there is a weak spatial pattern, while values close to 1 mean a strong spatial pattern.

Accelerated failure time survival models
Survival analysis is the branch of Statistics dedicated to the study of the length of time between two events, the event that initiates the observation process and the final event, also called the event of interest or final point, which determines the end of the monitoring procedure. From a statistical point of view, the topic focuses on the analysis of samples from random variables with support in the positive real numbers, generally skewed and usually partially observed. In most cases the observation period ends before the event of interest occurs and the actual observation period does not always coincide with its theoretical start. In the first case, the data will be right censored and left truncated in the second one. Both mechanisms, especially censoring, introduce complexity into the statistical analysis due to their important role in the likelihood function.
The key concepts for assessing survival times are the survival and the hazard function. The survival function for the survival random variable T i at t ≥ 0 corresponding to individual i is the probability that this individual survives beyond time t as defined below The hazard function of T i at time t is a nonnegative function that describes the instantaneous rate of occurrence of the event among individuals who have not yet experienced the event of interest at t. It is defined in terms of a conditional probability as follows: The hazard function is very popular in epidemiological contexts where it is known as the incidence function.
Survival regression models assess the variability of the survival times of the different individuals of the target population with regard to relevant covariates. Accelerated failure time (AFT) models are, together with Cox proportional hazards models, the most popular in survival analysis [8]. We start assuming a basic AFT model for the survival time of individual i as follows being x i and β the same as in (1), σ a scale parameter and i i.i.d random variables with a standard Gumbel distribution (standard type I Fisher-Tippett extreme value distribution). This is a non-negative continuous distribution with probability density function f i (t) = e t exp{−e t }, survival function S i (t) = exp{−e t }, and hazard function h i (t) = e t , t > 0. As a result, the distribution of T i is a Weibull distribution with shape parameter 1/σ and scale parameter exp{−x i β/σ}, i.e., it has hazard function The AFT model in (6) is very flexible because it can also be expressed as a Cox proportional hazards model [15,16].
As in the binomial regression model, the inclusion of random effects associated with groups of individuals in the survival model also needs a new definition format. Assuming the same type of random effects γ j that we have considered in the logistic regression model, our accelerated model will be as follows: with the γ j 's modelled according to each of the two proposals, conditionally i.i.d. and spatially correlated, formulated as in the previous sub-section.

Bayesian inference and the integrated nested Laplace approximation
Bayesian inference accounts for uncertainty in terms of probability distributions. The main element of a Bayesian learning process is the likelihood function, which is constructed from the sampling model and the observed data D, and the prior distribution for all unknown elements in the sampling model. The subsequent posterior distribution combines two pieces of information and is computed via Bayes' theorem.
Inference for hierarchical and highly parameterized models is often conducted using a number of tools available. Markov chain Monte Carlo (MCMC) methods can estimate a wide range of models but they are too slow when dealing with large datasets such as those arising from population studies [17].
Alternatively, approximate inference could be carried out so that posterior sampling is not required. In particular, the integrated nested Laplace approximation (INLA) [10] provides accurate approximations of the posterior marginal distribution for the latent effects, parameters and hyperparameters of the model. INLA considers random samples from a common probabilistic population as conditionally independent given a latent Gaussian Markov random field (GMRF) [18] θ with zero mean and precision matrix H that depends on some hyperparameters φ which can include effects of different type (regression coefficients, random effects, seasonal effects, etc). This feature ensures that the structure of H is sparse so that computationally efficient algorithms can be employed for the estimation procedure.
INLA starts the estimation procedure by obtaining a good approximation to the joint posterior distribution of the hyperparameters, i.e., π(φ | D). Then it uses this approximation to compute the posterior marginal of each univariate hyperparameter φ l and the marginal posterior distribution of each latent term θ m in θ as follows These integrals are approximated using numerical integration methods and the Laplace approximation [10,19].
Note that once the posterior marginals are available it is possible to compute quantities of interest about the parameters and hyperparameters such as posterior means or credible intervals.
The INLA procedure is implemented in the R-INLA package [20] for the R statistical software [21]. This package can also be used to compute a number of features for model selection, which include information-based criteria such as the deviance information criterion [DIC,22] and the Watanabe-Akaike information criterion [WAIC, 23].

Analysis of ischemic stroke and risk factors in Poland
In Poland, the incidence of stroke is similar to that in other European countries: approximately 112 strokes per 100 000 inhabitants, which gives about 65 000 new cases of stroke registered annually [24]. The number of strokes in Poland is expected to increase in the coming years, what is mostly related to the aging of the population. This means an increased demand for medical and palliative care, which require both adequate resources and the development of a strategy for the future [5].
As presented in the introduction, the data for the study consist on an anonymized dataset of about 500 000 inhabitants from the Polish National Health Fund that includes individual information about ischemic stroke and other important covariates such as gender, age, administrative region and drug prescriptions. The period of observation is two years, but the actual dates have not been released and they remain unknown. We do not know the reasons for this decision; we can only assume that it is a recent period of two years. The patient's age is given in 5-years-old groups and the gender is a binary variable without clearly indication of which value stands for which gender. However, it is commonly known that women live longer than men and thus we can distinguish the two genders in the data. We decided to analyze only patients older than 38 years old, as in younger age groups stroke had a very low prevalence. As a result, the three age groups finally considered in the analysis are (38-58] years (group Age1), (58, 68] (group Age2), and (68, 108] (group Age3). As we are interested in studying spatial dependencies, we take only patients with known territorial code (no missing values). The final dataset consists on 332 799 patients, among them 2 889 had ischemic stroke (0.9%). This percentage is low, but due to the fact that the sample is probably randomly selected (they are not people with a specific disease or medical history) and the observation period lasted only two years, it seems reasonable.
For almost each patient, territorial code of the place of registration is available. Based on that code, we can classify to which powiat or voivodeship belongs the patient. In Poland, powiat is a second-level local government unit, which is often referred to as 'county', and it is a part of a larger unit-voivodeship. In the dataset, there are 379 powiat-level entities, which can be divided according to the administrative divisions of Poland into 66 city counties (formally 'Cities with powiat rights') and 313 regular counties, which we will be called land counties. Nowadays there are 380 powiats, which have changed in 2013 and therefore we assume that the dataset comes from two consecutive years between 2003 and 2013 [25,26]. Poland is divided in 16 voivodeships, which could also be used instead of county divisions.
The dataset also contains information on prescriptions for reimbursed drugs. For each prescription the three-digits code of the Anatomical Therapeutic Chemical (ATC) Classification System is provided. Based on this code, the drug can be identified on which organ or system it acts. In this classification, there are five different levels to identify the active substances of any drug. In the dataset, the three-digits codes allow to classify the prescription in a pharmacological or therapeutic subgroup. Hence we decided to include also the information of the prescriptions dispensed by patient. The risk factors for stroke are, among others, high blood pressure, atrial fibrillation (AF) and diabetes [27]. Therefore we choose to include in the analyses the use of prescriptions for the cardiovascular system (based on the ATC classification -type 'C'), any antithrombotic agents (used in the prevention or treatment of AF, ATC B01) and drugs used in diabetes (ATC A10), because they appear to be the most relevant when analyzing the occurrence of strokes [28]. In our analysis, it is not possible to detect any association between the stroke and the prescription drug, and its associated disease. This should be beared in mind when interpreting the results, i.e., the coefficients associated to this covariates will assess the relation of suffering from the condition and taking the associated prescription drugs.
The impact of socioeconomic factors cannot be overlooked when talking about such a complex disease as stroke. People with a lower status have limited access to medical care, which may result in the lack of quick diagnosis, which in the event of a stroke may lead to severe disability. Low level of public awareness can be related with the increase of risk factors for stroke and can affect recovery during rehabilitation. This is consistent with studies showing that low socioeconomic status may result in an increased incidence of stroke and mortality [29]. Accordingly, we included in the study the powiat index of deprivation (PID). This index is computed from five components using data from 2013 from another database independent of the one used in our study. [30]: income, employment, living conditions, education and access to goods and services. The values of the index are in the range of −1.8 to +1.1, with a negatively skewed distribution (with zero mean and standard deviation 0.58). A higher value of the index means a higher risk of deprivation to which the population of a given powiat is exposed.
In the final dataset there were less than 1% patients who suffered a stroke. Almost half of the population is over 38 and under 58, and more than half are women and people living in land counties. The vast majority of patients takes drugs for the cardiovascular system, while drugs for diabetes and atrial fibrillation (and others) represent only around 12%. Table 1 shows a short description of the percentage of people who have and have not suffered a stroke with regard to age, gender, county type and group of medicines.

Bayesian logistic and survival modelling
Let p ij be the probability that the individual i living in powiat j will suffer an ischemic stroke, and T ij be the time when that individual suffers a stroke since entering the study. The statistical analysis begins with a basic logistic regression and a basic accelerated failure time survival model for analyzing the probability p ij and the survival time T ij , respectively, in terms of covariates gender, age, prescriptions for reimbursed drugs, and PID as follows: where I A (ij) is an indicator variable for A that takes the value 1 if the individual i from powiat j has the characteristic A and zero if she or he does not, and consequently I Woman (ij), I Age2 (ij), I Age3 (ij), I City (ij), I T.A10 (ij), I T.B01 (ij) and I T.C (ij) are the indicator random variables for being a woman, being in age group Age2, Age3, living in city county and having received diabetes, antithrombotic and cardiovascular treatment in powiat j, respectively. The Depr covariate stands for the deprivation index which is the numerical variable defined for each powiat. To complete the specification of the Bayesian model it is necessary to elicit a prior distribution for the parameters and hyperparameters of the model. In the case of the logistic regression model the set of parameters θ = (β 0 , β 1 , . . . , β 8 ) is a GMRF with diagonal precision matrix 0.001 for all the coefficients except for β 0 whose marginal prior distribution is selected as an improper Gaussian distribution with zero mean and zero precision. The discussion of the marginal prior distribution for the scale parameter σ in the survival model needs a previous comment about INLA and the Weibull distribution. INLA offers two different parameterizations of the Weibull distribution for survival models. We have opted for the so-called first variant, which corresponds to shape parameter α = 1/σ and scale parameter λ = exp{x ij ζ}, so that the hazard function of T ij is This parameterization implies that positive coefficients ζ's of the covariates increase hazard, while negative values reduce it. Note that this parameterization is slightly different from the typical parameterization of this AFT model shown in equation (7). Coefficients ζ's estimated with INLA are equal to coefficients −β/σ's in the accelerated survival model [31].
The shape parameter α of the Weibull distribution has a penalized complexity prior (PC-prior) [32]. In fact, INLA considers α = exp{0.1α } to avoid numerical instabilities and the prior is set on α . PC-priors are defined using the Kullback-Leibler distance between the proposed model and a natural base model, which in this case corresponds to α = 1, that is the exponential distribution. In our model, we have used the default PC-prior for α; see [33] for details.
Random effects associated to the powiats are introduced in the logistic and the survival model in (9) according to the two proposals presented in the previous section: in terms of conditionally i.i.d. random variables and spatially correlated random variables. The marginal prior distribution of the precision τ in the case of both conditionally independent and spatially correlated random effects is an improper uniform distribution in the interval 0 to infinity. The weight parameter φ in the precision of the spatial effect has a prior distribution so that the logit of φ follows a Gaussian distribution with zero mean and precision 0.1. Table 2 presents the posterior mean and posterior credible intervals for the parameters and hyperparameters of the logistic regression model and the accelerated survival model without random effects, and with random effects in terms of conditionally i.i.d random variables and spatially correlated random variables. Moreover all models have been evaluated through the DIC and WAIC criteria. For both types of modelling (i.e., logistic and survival), the model with spatially correlated random effects has the lowest values of DIC and WAIC.   All models have similar estimates of the regression coefficients associated with the covariates, providing evidence of statistical robustness. As expected, a lower risk of stroke is associated with being a woman and age increases the risk of stroke. Naively, this can be regarded as if the results pointed to that being male increases the stroke rate by about 25% and being in the older age group multiplies the stroke rate by about 5-6 times. The estimates of the model indicate that men older than 68 who live in a city county have the highest risk of stroke. It is worth noting the positive relation between the pharmacological prescriptions dispensed to patients and the risk of stroke, specially those related to the cardiovascular system. The analysis of the credible intervals suggests that all the covariates, except the county, are relevant both for the risk of stroke and for time to stroke. The risk of stroke grows in proportion to the deprivation index although the importance of this variable is questionable. The posterior mean of the parameter 1/σ of the survival models is always close to 1. It could suggests that the risk of stroke does increase with time, but not rapidly. This latter may be due to the fact that the data was collected only for a period of two years and relates to people without specific diseases.
The posterior mean of the hyperparameter φ which assesses the strength of the spatial effect in the spatial models is equal to 0.866 and 0.889 for the logistic regression and for the survival model, respectively, with 95% credible intervals that clearly state the relevance of the spatial effect. The posterior mean of the precision τ estimated for counties indicates that there is variation between powiats. It is lower for the spatial models, but still this is an evidence of the dispersion in the data. Figure 1 illustrates the posterior mean of the random effects for both the logistic regression and the survival modelling. As expected, the outcomes associated to the different powiats in the conditional i.i.d models are very similar as well as those for the two spatial effects models. There are, however, differences between the conditional i.i.d and spatial models. The latter show strong spatial patterns, with an southwest-northeast alignment of the smallest values, which can be interpreted as regions with lower probability of stroke. On the contrary, a high-value cluster in the southeast, means that the risk of stroke is higher than in the other parts of the country.
The potential of the models analyzed is enormous because they allow us to study and visualise the outcomes of interest in relation to the population subgroups defined by the different values of the covariates. This information is too long to be included in this article. By way of illustration we present in Figure 2 the posterior expectation of the probability of stroke, by gender and age group, for people who did not take any medication, obtained from the spatial logistic regression model. It is clearly visible that the probability of stroke increases with age and in general women have lower probability than men. The largest difference between the estimated values is in the oldest age group. The spatial pattern is very relevant. In the southeast of Poland (Podkarpackie and Lubelskie Voivodeship) there is a visible spatial cluster with the highest risk of stroke. Among the ten powiats with the lowest estimated probabilities of stroke, nine of them are cities including Wroclaw, Cracow and the capital Warsaw. Similarly, and in accordance with the illustrative objective, Figure 3 shows the posterior probabilities of stroke by gender and age group for people who takes drugs for the cardiovascular system (ATC C). The overall pattern shows higher probabilities of stroke than in Figure 2 due to the effect associated to these drugs (and the underlying condition, i.e., cardiovascular diseases).

Discussion
As previously stated in this paper, health care decisions often involve the collection and analysis of datasets from different sources. Typical health data include mortality and morbidity of certain diseases as well as other information about risk factors, environmental exposure and others [34]. In addition, statistical developments in recent years allow researchers to handle, both methodologically and computationally, large datasets of individual data for population level analyses that involve highly parameterized hierarchical models [35].
Interesting analysis for health care decisions include the estimation of prevalence, assessment of risk factors, estimation of spatial and temporal risk variation, to mention a few. The assessment of risk factors is particularly important because the identification (and prevention) of relevant risk factors may help to reduce morbidity, which in turn may reduce mortality and public health care expenditure. Public health authorities can benefit from these population level analyses in different ways. First of all, insight on a given condition can be gained by conducting a population-wide analysis. Secondly, potential risk factors can be assessed which can help to develop best health policies and practices. In the study developed in this paper, a better understanding of the incidence of stroke in Poland is gained as well as knowledge about potential risk factors, with a particular interest on different conditions and associated prescription drugs. Given that health care decisions by government agencies have an immediate and long-lasting effects on the populations it is important that these decisions are data-driven.
In particular, this paper considers the analysis of population data about stroke disease in Poland in a 2-year period. This is a large dataset that comprises information about 500 000 people on a number of topics, including age, gender, other conditions and drugs prescribed, region and others. In addition to the individual level data, information at the powiat level (such as deprivation index and city/land county indicator) are available to complete the analysis. Given the high burden of stroke, identifying risk factors which can lead to a reduction in the prevalence of stroke will have a significant impact on the overall quality of life of the population and the cost of public health care.
The available data can be approached in a number of different ways. First of all, the probability of suffering from a stroke has been considered, for which a logit analysis has been conducted. However, given that the time-to-stroke is available, survival models can be used as well to tackle an alternative inferential outcome. As individual and area level data are available, multilevel models have been fitted. In addition to the individual and area level covariates, mixed-effect models that include random effects at the area level have been studied in two different ways: conditional independent and spatially correlated random effects.
All these models have been estimated using a Bayesian framework, for which novel computational methods have been used in order to fit the required models. In particular, the integrated nested Laplace approximation [10] has been used to obtain approximations of the posterior marginals of the parameters, random effects and hyperparameters of the model. In addition, the implementation of INLA in the R-INLA package is able to handle the hundreds of thousands of records in the dataset and fit the models in a few minutes.
Relevant risk factors identified by the analysis include age, gender and certain conditions and associated drug prescriptions. In particular, women showed a lower risk, which increased with age. Regarding the prescription drugs, three different types of drugs (associated to relevant health risk factors of stroke) were included in the models and they showed an increase in risk of suffering from a stroke. However, our analysis is not able to disentangle whether this increased risk is due to the condition or the associated treatment. Furthermore, the estimates of both types of random effects showed differences among powiats. Model selection using the DIC and WAIC pointed to the model with fixed effects and spatially correlated random effects as the best one among all the models proposed for both the logit and survival families of models.
These models can be exploited for inference in a number of ways. The spatial logit model can provide estimates of the probability of suffering a stroke for age, gender and area. Similarly, survival models can provide estimates of time-to-stroke for any individual or the median time-to-stroke according to age, gender and area, and include the effect of prescription drugs in the estimates.
Other similar models can be used in the analysis of this dataset but the proposed models provide additional opportunities for inference. As an example, the output from the fitted models can be used for personalized medicine provided that relevant individual-level information (e.g., genetic markers) is available.