Geo-information Geographical Variation of Incidence of Chronic Obstructive Pulmonary Disease in Manitoba, Canada

We aimed to study the geographic variation in the incidence of COPD. We used health survey data (weighted to the population level) to identify 56,944 cases of COPD in Manitoba, Canada from 2001 to 2010. We used five cluster detection procedures, circular spatial scan statistic (CSS), flexible spatial scan statistic (FSS), Bayesian disease mapping (BYM), maximum likelihood estimation (MLE), and local indicator of spatial association (LISA). Our results showed that there are some regions in southern Manitoba that are potential clusters of COPD cases. The FSS method identified more regions than the CSS and LISA methods and the BYM and MLE methods identified similar regions as potential clusters. Most of the regions identified by the MLE and BYM methods were also identified by the FSS method and most of the regions identified by the CSS method were also identified by most of the other methods. The CSS, FSS and LISA methods identify potential clusters but are not able to control for confounders at the same time. However, the BYM and MLE methods can simultaneously identify potential clusters and control for possible confounders. Overall, we recommend using the BYM and MLE methods for cluster detection in areas with similar population and structure of regions as those in Manitoba.


Introduction
Chronic obstructive pulmonary disease (COPD) is a lung disease defined by continuous airflow limitation caused by small airway disease (obstructive bronchiolitis) and parenchymal destruction (emphysema).The small airways narrow in response to chronic inflammation.As well, inflammatory processes cause the deterioration of the lung parenchyma, which leads to a decrease in the elastic recoil of the lung.As a result of these changes, the airways have a decreased ability to remain open during expiration [1].The biggest and most widely known risk factor of COPD is cigarette smoking [2].Other risk factors of COPD include occupational or environmental exposure to dust and hazardous gases, for example when burning biomass fuel [3].A family history (i.e., genetics), low socioeconomic status, poor nutrition, asthma, and recurrent lung infections can also be risk factors for COPD [1,4].Therefore, COPD can be the result of a gene-environment interaction [1].
The impact of COPD is often underestimated by health authorities and government officials [5].In Canada, one of the most overlooked chronic conditions is COPD.Patients suffering from a degenerative lung disease are often misdiagnosed as having bronchitis, a cough or a respiratory tract infection [6].In 2008, COPD was the leading cause of hospitalizations in Canada.As well, 18% of COPD patients were readmitted to a hospital once within the year and 14% were re-admitted twice within the year.These readmission rates were higher than any other chronic illnesses [6,7].According to a Canadian article [8], for severe COPD exacerbations or attacks, the average length of a hospital visit was 10 days with an estimated cost of $10,000.Within a single year, the estimated cost of moderate and severe COPD exacerbations exceeds $730 million.This number is expected to nearly double by 2015 [8].
There are various treatments for COPD including antibiotics and chest physiotherapy.However, early detection of COPD is crucial for a positive outcome [9].Therefore, it is important to identify trends in COPD incidence that may suggest further epidemiological studies to identify risk factors and identify any changes in important factors.Trends may occur over a region and the focus of our paper is to examine geographical variation in the number of people diagnosed as having COPD during 2001 to 2010 in the province of Manitoba, Canada.
A spatial cluster is defined as a limited area within the entire study region which has a high proportion of disease cases [10].Possible factors related to diseases may be determined by discovering disease clusters which may lead to an improved understanding of etiology.In fact, the identification of clusters may lead to further analyses to study how exposures and disease interventions are connected [11].
Spatial cluster detection methods can be classified into two statistical approaches, a focused approach or a non-focused (general) approach.The methodology of focused cluster detection approaches is to locate regions with an excess number of disease cases in an area near a possible cause (i.e., a toxic waste site) [12,13].On the other hand, non-focused cluster detection methods typically use various ways in order to discover areas with a high number of disease cases in the entire study region [14][15][16].The circular spatial scan statistic (CSS) [17], flexible spatial scan statistic (FSS) [18], and Bayesian disease mapping (BYM) [14] are all considered to be focused cluster detection methods, whereas, the Besag and Newell (BN) [19,20] test and the maximizing excess event test (MEET) [21] are classified as non-focused cluster detection procedures.Non-focused tests are used to detect potential clusters in the study area, while focused tests are used to test the null hypothesis of no spatial cluster against the alternative hypothesis that a spatial cluster exists.In other words, the purpose of focused tests (CSS, FSS, BYM) is to find possible clusters in an area of interest and the aim of non-focused tests is to discover any significant cluster without determining a specific area of interest.These approaches were compared by analyzing childhood cancer data in the province of Alberta, Canada [22].Recently, a frequentist approach based on the maximum likelihood estimation (MLE), via data cloning (DC) [23,24], was also proposed to obtain possible clusters [25] in an area of interest.Another cluster detection method is the local indicator of spatial association (LISA) [26].This method is simple and easy to implement.This paper is based on the focused cluster detection methods.In particular, the aforementioned focused approaches (CSS, FSS, BYM, MLE, and LISA) are used to analyze a real dataset of COPD cases in the province of Manitoba, Canada, from 2001 to 2010.

Study Subjects
This study was based on the Canadian Community Health Survey (CCHS) [27] from Statistics Canada.The CCHS is a cross-sectional survey, which gathers information from the Canadian population regarding health status, health care utilization and health determinants.The CCHS collects health related data from individuals aged twelve and older in order to provide reliable estimates at the health region level [27].The information from the CCHS used in this study was the number of COPD cases in the province of Manitoba, Canada, from 2001 to 2010.Eleven Regional Health Authorities, which are further divided into 67 Regional Health Authority Districts (RHADs) are in charge of delivering health care services to individuals in Manitoba.The RHADs are the geographic units used in our models and all of the data used in the study are related to these RHADs which are labeled 1, 2, …, 67 for simplicity.As well, a population-based centroid was provided for each RHAD, however, these centroids were not necessarily geographic centres.Since the data used in the study was from a survey, appropriate weights (see Section 2.2 for more details) established by Statistics Canada [27] were applied to the data, which was then aggregated over the study period from 2001 to 2010.
The population was stable in Manitoba from approximately 1.15 million people in 2001 to 1.20 million people in 2010.Region 38 had the smallest average population size of 920 people while region 62 had the largest average population size of 91,633 people.The mean and median population sizes across the regions were 17,471 and 9466, respectively.The total number of COPD cases in Manitoba was 56,944 with a mean of 850 and median of 504 cases.These observations are based on the weighted results of COPD cases across the 67 regions.
The five focused spatial cluster detection procedures (CSS, FSS, BYM, MLE, and LISA) have different assumptions.Although the CSS, FSS, and LISA approaches are distribution free, it is assumed that the number of disease cases follow a Poisson distribution in the BYM and MLE methods.As well, while the number of regions to be included in the cluster must be specified for the CSS and FSS methods, this is not a requirement for the BYM and MLE approaches.For the model-based cluster identification methods (BYM and MLE), if the model does not fit the data well, the result can be misleading.So, the deviance residual [28] should also be checked.While the expected number of disease cases or the population of each region is required for the above methods, they are not a requirement for the LISA method.
The University of Manitoba's Research Data Centre approved the study, and Statistics Canada approved administrative data access.ArcGIS version 10.0 (Environmental Systems Research Institute, Redlands, CA, USA) was used to produce choropleth maps of risks.

Weighting Process
The weighting was completed by Statistics Canada using a detailed weighting process [27].A brief summary of this procedure is given here.First, the weighting depends on the sampling method (area frame vs. telephone frame) used in each region.In the area frame an initial weight is assigned based on the Labour Force Survey (LFS).Out-of-scope units (i.e., Dwellings that are under construction, vacant, seasonal or secondary and institutions) are removed from the sample.As well, sub-clusters (i.e., Sub-sampling within a selected dwelling), larger sample sizes and non-response units are adjusted for in the weighting process.In the telephone frame (the survey is conducted by telephone) an initial weight is assigned as the probability that phone number will be selected, which depends on the number of units sampled and the number of units available to be sampled.In this method, samples are drawn every two months therefore, an adjustment factor is applied to reduce the weights of each two-month sample so that the total sample is representative of the population only once.Similar to the area frame method, out-of-scope numbers (i.e., Businesses, institutions, out-of-scope dwellings or numbers that are not in service) are removed from the sample.Also, non-response units and dwellings with multiple phone numbers are adjusted for in the weighting process [27].
The weights common to the area frames and telephone frames need to be integrated using an adjustment factor α (0 < α < 1).Then a person-level weight is created by taking the inverse of the probability a person in the selected dwelling will be selected, which depends on the number of people in the household and the ages of those people.After the appropriate adjustment is made, a "winsorization" trimming method is used to decrease any extreme weights that occur.Finally, a calibration approach is used to ensure the weights are representative of the population estimates for the different age groups and genders in each health region [27].

Specific Hypotheses
We specify the alternative hypotheses for the CSS, FSS, BYM, and MLE approaches.We consider multiple alternatives that are tested separately.Further, let RR i indicate the relative risk for the i-th region within a cluster when compared with the region outside a cluster; the latter has RR i = 1.For example for cluster X, the RR i is given by

Results
The results of five different cluster detection techniques when applied to a COPD dataset in the province of Manitoba, Canada, from 2001 to 2010 are shown and compared in this section.
Based on the 67 regions, four different clusters were tested: (1) a case of no clusters (called A); (2) seven regions from the northern part of the province (called B); (3) seven regions from south-central part of the province (called C); and (4) 12 regions which comprise the Winnipeg region (called D).For A, no region was specified as a potential cluster.Moreover, the regions belonging to clusters B, C, and D are: B = {31, 33, 34, 36, 38, 40, 41}, C = {27, 28, 29, 30, 50, 51, 52}, and D = {56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67}.Since the LISA method does not depend on the expected number of observations, it could only be applied to cluster A as the other clusters require the adjustment of the expected number of disease cases for those regions inside the specified cluster.
The areas that are statistically significant (potential clusters) are shown for each cluster and each method separately (Figures 1-4).The summary of cluster A, no region specified as a potential cluster, is presented in Table 1.For the CSS and FSS procedures, the regions that are most likely to constitute a disease cluster are presented, as well as the regions that are second and third most likely to be considered as a cluster.For the BYM and MLE methods, a region is considered (and ranked) to be a significant cluster if the lower limit of the credible/prediction interval follows the specified criteria.For example, in the BYM method region 10 is most likely to be considered as a cluster and region 61 is least likely to be considered as a cluster under the criteria that the lower bound of RR is greater than one.For the LISA method, a region is determined (and ranked) to be significant if the p-value is less than 0.1.
The FSS method identified more regions as potential clusters than the CSS approach for cluster A, although, the regions with potential clusters that were detected by the CSS method were also identified by the FSS approach.The CSS approach detected regions {10, 43, 45, 61, 62} as potential clusters, and the FSS method identified the same regions as the CSS method, as well as regions {1, 11, 12, 13, 14, 20, 21, 27, 46, 50, 51, 54, 56, 60, 64, 65}.The BYM and MLE methods identified regions {1, 3,6,10,11,12,20,21,24,27,43,45, 50, 54, 61, 62, 64, 65} as possible clusters.The only difference between the results of these two procedures was the order of significance for the potential clusters.As well, most of the regions identified using these two approaches were also identified by the FSS approach and the regions identified by the CSS technique were also detected by the BYM and MLE approaches.Note that by evaluating the criterion of the RR values from greater than 1 to 1.5 or even 2, the number of potential clusters decreases (Table 1).Based on the deviance residual plots for both the BYM and MLE methods, we found that there is no serious lack of fit in the model.The LISA method found regions {2, 7, 16, 24, 43, 56, 57, 58, 60, 62, 64, 67} to be possible clusters of COPD.This approach identified some different regions to be potential clusters as compared to the other methods.

LISA CSS FSS RR > 1.0 RR > 1.5 RR > 2.0 BYM MLE BYM MLE BYM MLE
For cluster D, all four methods detected the regions belonging to the D cluster as a potential cluster.In addition to the regions in Winnipeg (cluster D), the BYM and MLE approaches were also able to detect some neighbours of Winnipeg (14 regions) as potential clusters.However, the CSS and FSS methods only detected two regions, 10 and 43 as a potential cluster in addition to cluster D.

Discussion and Conclusions
We used five popular approaches in spatial epidemiology to identify potential clusters of COPD cases in the Canadian province of Manitoba, Canada.These five methods have been used extensively in the literature and are relatively comprehensive.These methods use different approaches (non-parametric to parametric) to test for significant clusters.
We considered four different alternative hypotheses to compare the results of the CSS, FSS, BYM, and MLE methods.All four methods did a good job of identifying potential clusters for dense populations (clusters C and D) but not for a dispersed population (cluster B).In general, the CSS method identified a lower number of regions combined as a potential cluster compared to the FSS method, due to the non-circular shape of some regions in the province of Manitoba.A disadvantage of the LISA method is that the results do not depend on the expected number of cases or the population in each region.This is concerning since regions with high populations will likely have higher observed numbers of disease, however, this is not taken into account when using the LISA method.Therefore, the LISA method could only be applied to cluster A as cluster B, C, and D require the expected numbers to be adjusted for the respective regions in a cluster.This may explain why the LISA approach identified some different regions as potential clusters when compared with the other procedures.
A region was identified as a potential cluster if the lower bound of credible/prediction interval of the estimated relative risk was larger than one for the BYM and MLE approaches.Different decision rules may be defined where the estimated relative risk (in terms of the credible/prediction interval) would be larger or smaller than one [29].One could also consider the exceedance probability Pr RR i > b > c , where b can be 1, 2 or 3 and c may be a large value such as 0.90 [30].For the LISA method, a region was defined to be significant if the associated p-value was less than 0.1.However, different decision rules could be used where the level of significance is smaller than 0.1.
Here, three important factors, age, gender and year were used to adjust the expected number of COPD cases in the province of Manitoba.Unlike the CSS, FSS, and LISA methods, we can extend the model A2-A3 in the Appendix, for both the BYM and MLE methods, to include other covariates directly, which may be required for some applications.
We also note that the methods have different settings and assumptions, which motivate our comparison.User-chosen settings are part of all cluster tests and different choices could lead to different results.All five methods have been proposed for local clusters.Under the null hypothesis, the number of COPD cases follows a Poisson distribution for the BYM and MLE methods, while the test statistic for the CSS and FSS methods has an asymptotically χ 2 distribution and the LISA method uses an empirical distribution.These features motivated us to consider these important methods and apply them to our COPD cases.
As limitations of this study, we assumed that our COPD cases are rare cases to be able to use the Poisson model in the BYM and MLE methods.Also, we used survey data (weighted to the population level) in our study.Strengths of the study include the evaluation of multiple cluster detection methods.
Overall, the potential clusters of COPD were located in the southern part of the province with the exception of region 24, which was identified by the BYM and MLE methods (cluster A).According to findings from Fransoo et al. [31], which are based on the Community Health Survey [27], binge drinking and smoking levels are higher than the Manitoba average in the south-central part of the province.As well, there are a higher percentage of people who consume low levels of fruits and vegetables in this region, although these differences are not statistically significant.Obesity levels are significantly higher than the Manitoba average in this region as well.In the south-eastern part of the province binge drinking, exposure to second hand smoke, and overweight and obesity levels are higher than the Manitoba average although the results are not statistically significant [31].These are some possible health determinants that may explain the clusters of COPD in these regions.We found that the BYM and MLE methods are the best approaches in terms of identifying potential clusters and controlling for possible confounders (if any).These results may represent real increases in COPD or may be due to unmeasured covariates that need to be adjusted for in the model.Further investigation is needed to examine these findings, and also to explore the cause of these increases.paper are those of the authors and do not necessarily reflect the position of Statistics Canada.This study is based in part on data provided by the Research Data Centre at the University of Manitoba.The interpretation and conclusions contained herein are those of the researchers and do not necessarily represent the views of Statistics Canada.each region.By connecting adjacent regions, the flexible scan statistic places an irregularly shaped window S on each region.For any region i , the set of irregularly shaped windows of length j , containing j connected regions including region i, can vary from 1 to the pre-determined maximum J, where J is the maximum length of a cluster.Moreover, to prevent unlikely cluster shapes, the joined regions are restricted to the subsets of the set of regions i and J-1 -th nearest neighbours of region i.The set of all windows to be scanned by the flexible spatial scan statistic is then S 2 = {S i:j k ; i = 1,…,m;j = 1,…,J;k = 1,…,k ij }.The circular spatial scan statistic examines J circles for any region i and the flexible spatial scan statistic examines J circles plus all the sets of connected regions whose centroids are found within the J-th largest concentric circle.Subsequently, the size of S 2 is much larger than S 1 which is at most mJ.The test statistic used in the FSS method under the Poisson assumption is based on the likelihood ratio test given in Equation (A1).Now, the circle defined in Equation (A1) refers to S 2 instead of S 1 .Using the FSS method, circles with high likelihood ratio values are considered to be potential regions of disease clusters [18].The FSS method can also be applied using the FleXScan software [34] with the FleXScan default which is J = 15.

Bayesian Disease Mapping (BYM)
Identifying clusters can also be done through a Bayesian framework using Markov chain Monte Carlo (MCMC) methods [14,15,35,36].First used by Besag et al. [14], Bayesian disease mapping (BYM) is a modeling approach consisting of two parts.In the first part of the model, it is assumed that the cases follow a Poisson distribution with an area specific parameter θ i E i : where the observed and expected number of cases in region i are given by C i and E i , respectively.
The second part of the model is achieved through where the relative risk (RR) in region i is given by θ i , μ represents the overall mean ratio over the entire region and η i represents spatially correlated random effects.These spatial random effects are captured using the usual CAR model.A variety of CAR models can be used by obtaining a collection of mutually compatible conditional distributions p(η i |η -i ), i = 1,…,m, where η -i = {η j :j ≠ i, j ∈ δ i } and δ i refers to a set of neighbours for the i-th region [14].We use the following general model for spatial effects η i : where P is a m × m diagonal matrix with elements P ii = 1/e i ; D is a m × m matrix with elements D ij = (e j /e i ) 1/2 if region i and j are adjacent and D ij = 0 otherwise; e i is the number of regions adjacent to region i ; σ η 2 is the spatial dispersion parameter; λ η measures the spatial autocorrelation, λ min ≤ λ η ≤ λ max , where λ min -1 and λ max -1 are the smallest and largest eigenvalues of P -1/2 DP 1/2 ; and I m is an identity matrix of dimension m.We refer to [37] for details of this proper CAR model.Using vague prior distributions and within the Bayesian framework (MCMC), the parameters may be estimated to produce posterior distributions for the parameters in the model [14].
A cluster is specified as a region where the estimated relative risk (in terms of the lower credible set) is significantly larger than one [38].To apply this method, WinBUGS software [37] is used to calculate the relative risk values.

Frequentist Approach Using Maximum Likelihood Estimation (MLE)
The data cloning (DC) approach is a computational algorithm to obtain the MLE for hierarchical models [23,24].This approach is based on the Bayesian computational method and is used for frequentist purposes.This method involves independently repeating the observations C = ( where the prior distribution on the parameter space is π(α) and H C L = L(α,C) L π(α) dα is the normalizing constant.Also, L(α,C) L represents the likelihood for L copies of the original data.As shown by Lele et al. [23,24], when L is large enough, π L (α|C (L) ) converges to a multivariate Normal distribution with the mean given by the MLE of the model parameters and variance-covariance matrix equal to 1/L times the inverse of the Fisher information matrix for the MLE.Hence, the sample mean vector of the generated random numbers from Equation (A4) acts as an estimate of the MLE and an estimate of the asymptotic variance-covariance matrix for the MLE α is given by L times the sample variance-covariance matrix of the generated random numbers from Equation (A4).Lele et al. [24] also provided various tests to determine the adequate number of clones L.

Prediction of Relative Risk:
The prediction of relative risk (random effects) can be fairly problematic, especially in the frequentist framework.One approach to estimate r using the data is to use π(R=r|C,α ) where R = (RR 1 ,…,RR m )' .However, the variability introduced by the model parameters estimate is not captured in this approach.In the literature [39], it has been suggested to use the following density in order to take into consideration the variation of the estimator, where α 1 = μ , α 2 = (λ η ,σ η 2 )' , f(•) and g(•) are Poisson and Normal distributions, respectively, and ϕ(., ξ, Σ) denotes a multivariate Normal density with mean ξ and variance-covariance Σ .In this paper, the prediction of the r was found using Equation (A5) through MCMC sampling.
A disease cluster is defined as a region where the estimated relative risk (in terms of the lower prediction interval) is significantly larger than one.The dclone package [40] is utilized within the R software [41] in order to calculate the relative risk values.

Local Indicator of Spatial Association (LISA)
Another method for identifying spatial clusters is a local indicator of spatial association (LISA) statistic [26].In general, for an observation, y i in the i th region, the LISA statistic is given by L i =f y i ,y J i where f is a function and the values observed in the J th neighbourhood of region i are given by y J i .
In order to determine the statistical significance of the spatial association at region i, the following must be satisfied where a critical value is given by δ i and α i is a given level of significance.Another condition of a LISA statistic is the total of all LISA statistics in a region must be proportional to a global indicator of spatial association.In other words, where Λ is an indicator of the global indicator with a scale factor defined by γ.In order to test whether there is statistically significant spatial association over all the regions, the following statement must be true [26] Pr Λ > δ ≤ α.
A general LISA statistic may be used to test the null hypothesis of no spatial association against the alternative hypothesis that spatial clustering exists across a region.However, the distribution of the general LISA may be hard to find.For this reason, conditional randomization or a permutation approach is used to find an empirical distribution.The randomization is done by holding the observed value (y i ) in region i constant and the remaining observed values across the entire study region are randomly permuted and the value of L i is computed.This is done for each region in the study area.The result is an empirical distribution function, which expresses the extent to which each observation is considered to be extreme in comparison with the other observed values [26].
The LISA method is usually a simple method to apply, however, it is complicated by the fact that the LISA statistics for individual regions may be correlated.For example, when regions i and k are neighbours or have common elements in their neighbourhood sets, the corresponding LISA statistics, L i and L k will be correlated.Typically, it is extremely hard to derive the marginal distributions of each statistic and therefore, the significance levels must be approximated by Bonferroni inequalities or the method outlined by Sidák [42].Using Bonferroni inequalities, the individual significance levels (α i ) are set to α/m and using Sidák's method, they are equal to 1 -1 -α 1/m , where the overall significance level is set to α and there are m comparisons.It has been suggested that m is taken to be the number of observations n.However, this may result in bounds that are too conservative and in fact very few observations may be deemed to be significant clusters [26].Further investigation is being conducted to determine the best value for m.In our study this method is implemented in R [41] using the ncf package [43].

Figure 1 .Figure 2 .Figure 3 .Figure 4 .
Figure 1.The order of most likely clusters of COPD for the CSS, FSS, and LISA (based on the p-value) methods, and the special effects of the regional COPD risks for the BYM and MLE methods; in the case of cluster A. Major urban centre (Winnipeg region) is incorporated as an inset.(a) CSS; (b) FSS; (c) BYM; (d) MLE; (e) LISA.
C 1 ,…,C m )' for L different individuals.Subsequently, these individuals all have the exact same set of observations C which are represented by C (L) = (C,C,…,C).The posterior distribution of α = (μ,λ η ,σ η 2 )' conditional on the data C (L) is then given by