Spatial Analytics Based on Conﬁdential Data for Strategic Planning in Urban Health Departments

: Spatial data analytics can detect patterns of clustering of events in small geographies across an urban region. This study presents and demonstrates a robust research design to study the longitudinal stability of spatial clustering with small case numbers per census tract and assess the clustering changes over time across the urban environment to better inform public health policy making at the community level. We argue this analysis enables the greater e ﬃ ciency of public health departments, while leveraging existing data and preserving citizen personal privacy. Analysis at the census tract level is conducted in Mecklenburg County, North Carolina, on hypertension during pregnancy compiled from 2011–2014 birth certiﬁcates. Data were derived from per year and per multi-year moving counts by aggregating spatially to census tracts and then assessed for clustering using global Moran’s I. With evidence of clustering, local indicators of spatial association are calculated to pinpoint hot spots, while time series data identiﬁed hot spot changes. Knowledge regarding the geographical distribution of diseases is essential in public health to deﬁne strategies that improve the health of populations and quality of life. Our ﬁndings support that spatial aggregation at the census tract level contributes to identifying the location of at-risk “hot spot” communities to reﬁne health programs, while temporal windowing reduces random noise e ﬀ ects on spatial clustering patterns. With tight state budgets limiting health departments’ funds, using geographic analytics provides for a targeted and e ﬃ cient approach to health resource planning.


Introduction
In 2014, the U.S. Public Health Leadership Forum proposed that local and state health departments act as the Community Chief Health Strategist [1]. One of the practice recommendations in the report specifically calls for analysis and translation of large, real-time data sets to identify trends and hot spots. A key step toward achieving this recommendation involves leveraging in new ways data already routinely available to health departments as secondary data sources, namely birth and death certificate data. The overarching objective of this undertaking is to identify trends in health outcomes in a population, and the associated socio-economic determinants that may be conducive to developing adequate and efficient interventions to enhance public health. This is deemed of particular relevance for the resilience of urban regions, where inter-generational and cross-cultural population dynamics challenges standards and practices in healthy cities.
Increasing the availability of standardized sub-county data is important for enhancing the capability of improving our understanding of public health in urban areas. Geographic variation in health factors and outcomes at the small-area level, including zip codes and census tracts, has been noted in many contexts [2]. Moreover, the use of small area analysis has become an important tool in the effective targeting of limited public health resources [3]. Despite the ability of spatial data analytics to detect patterns of clustering of events in small geographies across an urban region, small geographies may display data on health conditions that are very sparse, which can lead to cross-sectional analysis being biased by small-n designs. The challenge this creates for public health researchers and practitioners is to discern longitudinal trends from noise associated with low frequencies. Thus, given the criticality of small-area public health analysis and the constraints on access to relevant public health data, the purpose of our analysis is to present and demonstrate a robust research design to (a) study the longitudinal stability of spatial clustering with small case numbers per census tract and (b) assess the clustering changes over time across the urban environment to better inform public health policy making at the community level.
To this end, we use a case study of Mecklenburg County, North Carolina, to demonstrate that temporal windowing may effectively smooth out noise, enhance the cross-sectional validity of results, and allow us to trace longitudinal trends in spatial clusters of hypertension during pregnancy compiled from 2011-2014 birth certificates. If applied in an ongoing fashion, this approach would facilitate an important tool in targeting limited public health resources. We argue this analysis enables the greater efficiency of public health departments, while leveraging existing data and preserving citizen personal privacy.

Background
Local health departments increasingly are adopting geographic information system (GIS) software for epidemiological analyses [4,5] and to build analytic capacity [6]. Historically, the geographic unit of analysis most often used has been the county [7][8][9] or zip code [10], often due to availability of geography-related data. Using the county as the aggregation unit, however, precludes identifying specific high and low risk locales within the county, which may covary with the socioeconomics of local population, environmental exposure, or geographic access to health care services. Although analysis within county units, such as zip codes, has been used with publicly available data, analysis at smaller geographic units, such as city blocks or census tracts, has been used less frequently due to concerns about individual privacy and confidentiality [11]. Using the smallest feasible unit of analysis could uncover more distinct or isolated high and low risk locations in need of public health attention. As local health departments seek to act as Community Chief Health Strategists, public health administrators will want geospatially informed analysis. Geospatial analysis begins by assigning a location to each case [12], while balancing considerations of accuracy and anonymity. Street addresses act as the finest level of analysis, yielding the most detailed results. Aggregating cases into larger spatial units, such as zip codes or counties, hides valuable details and reduces variance in the data [13]. In other words, the spatial unit acts as a proxy for individual cases, resulting in a positional discrepancy between points on a map and true home locations [14]. The degree of location accuracy becomes important when planning local interventions or when proximity of cases to a source of exposure needs to be determined [15]. Using the most accurate location provides health program planners with evidence to more reliably identify where increases in resources or interventions are needed [16].
Accuracy in location, however, can lead to cases being identifiable, particularly in small spatial units [17]. The identifiability of cases has been noted as a potential or actual issue of concern in reproductive health [18], birth defects [19], diet [20], environmental health [21], social care planning [22], and geo-privacy studies [23]. The U.S. Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) [24] requires that disclosed health information be restricted to the minimum necessary to satisfy its intended purpose. Data are considered de-identified in accordance with the HIPAA Privacy Rule if the data do not "identify an individual and if the covered entity has no reasonable basis to believe it can be used to identify an individual" [25]. De-identification of geographic information is accomplished by aggregating geographic identifiers to large-population area-based units or applying statistical principles to render information not individually identifiable [26]. However, there is no universal standard for "adequate confidentiality protection" or "acceptable risk" [27].
Thus, selecting an approach to reduce the probability of identifying individuals, while preserving the characteristics of the geographic data for valid inference, depends in part on the nature of the data, acceptable confidentiality risk, and current and future use of the data [28].
Public health administrators and researchers should be aware of the analytic approach used to identify high and low risk areas. Research using census tracts identifies fairly specific spatial clusters of high rates (hot spots) of adverse health conditions, and conversely, spatial clusters of low rates (cold spots) of that condition [29]. Knowing whether a hot spot is statistically significant (that is, it would not have occurred by chance) can be determined through various approaches of spatial statistics [30]. The health geography literature has evolved towards fully recognizing the scientific merit of exploratory analysis [31], and cluster detection in particular. Open source geospatial software (e.g., GeoDa, PYSAL, R code libraries) make geographic statistical methods more accessible to target locations for interventions [32].
The United States Patient Portability and Affordable Care Act requirement that health care organizations conduct community needs assessments [33] has led to new collaborations between health departments and health care organizations, including data sharing. To leverage the value of contemporaneous data from the electronic health records requires not only sharing data elements across health organizations, but overcoming historical and logistical challenges of sharing data among health departments [34]. The geographic mapping of any health condition is a cogent framework to comply with legal requirements as it assumes that the health condition has an underlying spatial pattern [35] and that it is well positioned to capitalize on the extensive toolbox of geospatial methods to identify where underlying spatial patterns exist to direct strategic planning [36].
We argue, however, that using the county level as the spatial unit may smooth out much of the spatial variability conducive to effective strategic planning for certain health conditions. Therefore, the contention is that aggregating confidential health data temporally and spatially yields results that are stable at the census tract level. In particular, we study the use of temporal moving windows spanning multiple years to reduce uncertainty in prevalence rates resulting from small counts in small geographies. Windowing is aimed at detecting the stable spatial patterns embedded in a spatial data series, including possible hot and cold spots, when spatial data series are based on small-n data sets, such as a number of chronic diseases in urban settings. As a corollary, this temporal smoothing reduces the sensitivity of longitudinal analysis to annual fluctuations that could be ascribed to non-systematic causes. Hence, it is anticipated that sensitive temporal windowing smooths out seemly random effects to reveal longitudinal trends.

Data Collection
We conducted a secondary data analysis of 2011-2014 birth certificate data from Mecklenburg County, North Carolina, which is the county in the Southeastern United States that encompasses the City of Charlotte. Mecklenburg County has a population of nearly one million people, mostly living in the urban area of Charlotte. The study was approved by the Institutional Review Board at the University of North Carolina at Charlotte and conducted in cooperation with staff from Mecklenburg County's Health Department. Census tracts are obtained from SimplyAnalytics, which interpolates census tracts though time to 2010 geographic boundary files [37].
The sample consisted of 2011-2014 birth certificates showing hypertension of the mother during pregnancy in Mecklenburg County. Mecklenburg County also has both racial or ethnic and economic diversity, and a sufficient number of census tracts (n = 233) for geographic statistical analysis. At the time of the study, 2014 was the most recent year with complete hypertension birth certificate data.
Mecklenburg County Health Department provided birth certificates showing hypertension at birth (n = 887).

Variables
We selected hypertension during pregnancy as the health outcome because of Mecklenburg County's intent to explore contextual associations of place and behavioral or disease outcomes from chronic diseases. This is a path towards a healthier population that a number of urban jurisdictions across the globe are taking to improve the quality of life offered to their citizens and reduce the incidence of social disparities. The variable was also selected because pregnant women are routinely screened for having elevated blood pressure.
Hypertension while pregnant can lead to acute health problems for the mother, such as seizures and death, or to long-term health problems, particularly with the kidneys and liver [38]. Gestational hypertension also affects the fetus and contributes to infants being born prematurely. Hypertension during pregnancy affects an estimated 5 to 15% of all pregnancies [39,40], with one study finding 20% of women pregnant for the first time developing high blood pressure during the pregnancy [41]. In addition to biological causes of hypertension, various social factors contribute to developing hypertension. Studies have associated racism and segregation with hypertension across populations [42] and among pregnant women [43]. Such studies suggest that place matters for pregnancy health. Overall, the seriousness, prevalence, and social aspects of hypertension during pregnancy makes it worthy of a geographically focused analysis.
To operationalize the analysis of hypertension incidence, we relied on variable definitions used on North Carolina's birth certificate to create an annual, two-year moving average, and three-year moving average of prenatal hypertension rate. The annual prenatal hypertension rate is calculated as: The two-year moving average prenatal hypertension rate is calculated as: The three-year moving average prenatal hypertension rate is calculated as: where X is the annual number of total births per census tract derived from the Mecklenburg County Health Department birth certificates. In total, four annual hypertension rates were calculated, as well as three two-year moving averages and two three-year moving averages.

Analysis
Birth certificate geocoding was conducted at Mecklenburg County's Health Department. Each birth certificate address was coded with longitude and latitude coordinates by matching home addresses against the county's master address file and then, using ArcGIS software, aggregated to the corresponding census tract. We achieved a 98% match rate, which is consistent with the literature [44,45]. We excluded 2% of the final geocoded addresses based on being outside Mecklenburg County. The geocoded birth data were then linked to census tracts using a spatial join yielding a 100% match. Smaller geographic units, such as census blocks and census block groups, could not be used to aggregate individual records to because the dataset was too small, which would have violated the HIPAA requirements.
To investigate the presence and extent of spatial autocorrelation, we calculated the hypertension rate per census tract in ArcGIS. With each census tract having a hypertension rate, the dataset is prepared to run the exploratory spatial data analysis [46]. There are many methods to investigate spatial autocorrelation. One of the most widely used geographic tools to illustrate the uneven spatial distribution of geographic indicators in health studies, especially the clustering of diseases, is Moran's I (both global and local) [47][48][49][50]. This statistic has the advantage to display various types of spatial distribution characteristics [51].
The Global Moran's I is a measure describing the overall relationship of spatial dependence across all geographic units for the study area. Thus, only one value is derived to determine if spatial association exists (that is if similarly valued tracts tend to be neighbors). The global Moran's I model uses a Moran's I value, a z-score, and a p-value to formally test the null hypothesis of spatial randomness in a dataset [52]. The global Moran's I statistic evaluates whether a clustered, dispersed, or random spatial pattern exists [53,54]. The global Moran's I is calculated from the following formula: where x i and x j are the values of hypertension in census tracts i and j in the study area, w ij corresponds to the weight between census tracts i and j as defined in the spatial weight matrix, and n represents the total census tracts in the study area. The spatial weight matrix is formed from weight coefficients. It is the formal expression of spatial dependence between spatial entities that collectively constitute the study area. This paper follows a common strategy to determine a spatial weight based on border sharing. For example, census tracts in Mecklenburg County, NC, are represented by i and j. When two tracts are adjacent to one another (neighbors), the value of w ij will be 1. If the tracts do not share a border, they are not deemed to be neighbors, and w ij will have a value of 0. The criterion of vicinity used to calculate Global Moran's I was the Queen criterion, which considers polygons that share any border (edge or vertex) as neighbors. We estimated the Global Moran's I statistic for each annual, two-year moving average, and three-year moving average using a 999 randomized permutations, and a significance set at p < 0.001.
Also using ArcGIS, we conduct analysis with a local indicator of spatial association (LISA), which is the local version of Moran's I autocorrelation statistic, to assess the degree of difference between each census tract and its own neighbors [55]. The LISA analysis focuses on patterns surrounding individual spatial observations. The Local Moran's I is calculated from the following formula With where the x i is the attribute of spatial feature i, X is the mean of the corresponding attribute, w ij is the spatial weight between features i and j, and n being the total number of features. An I that has a positive value indicates the feature has a neighbor with similar high or low attribute values. This feature is then part of the cluster. An I that has a negative values indicates a feature has a neighbor with dissimilar values. This feature would be an outlier, with negative spatial autocorrelation. The associated p-value calculated must be small enough for the highlighted cluster to be considered statistically significant.
With the LISA analysis, two maps are produced: one shows the statistical significance of local clusters, the LISA Significance Map, and the other the distribution of four potential local spatial outcomes based on a difference gradient between the rates in a given census tract and the average rate in its neighboring census tracts [56]. The LISA output identifies four categories of clusters based on the spatial autocorrelation (Table 1). Hot spots in our analysis refer to census tracts with high prenatal hypertension surrounded by census tracts with high prenatal hypertension. Cold spots refer to census tracts with low prenatal hypertension surrounded by other census tracts with low prenatal hypertension. The high-high and low-low census tracts highlight the core of the clusters, whereas the high-low and low-high census tracts represent spatial outliers. We ran the LISA process for each annual, two-year moving average, and three-year moving average hypertension rate with a queen spatial contiguity weight and 999 randomized permutations, and a significance set at p < 0.001. Table 1. Relationship of high and low rates in focal and surrounding census tracts as revealed though LISA analysis.

High Average Rates in Surrounding Census Tracts
"Hot spots" Spatial outlier

Low Average Rates in Surrounding Census Tracts
Spatial outlier "Cold spots" To this end, our study incorporates global indicators of spatial autocorrelation and LISA to compare spatial clustering of hypertension in Mecklenburg County from 2011 to 2014. The goal is to assess the spatial clustering over time to identify how using temporal windowing reduces the amount of noise with the use of a small-n in our study area.

Results
The geographic variation in prenatal hypertension rates per census tract in Mecklenburg County appears to reveal a non-random spatial pattern in the annual, two-year moving average, and three-year moving average (Figures 1-3). This spatial pattern is formally tested with Moran's I and LISA statistics. Given the geography of census tracts and the series of prenatal hypertension rates, we find the distribution of prenatal hypertension exhibits a significantly positive Moran's I statistic for each of the eight data series tested (Table 2), indicating a level of positive spatial autocorrelation in hypertension across the county. The four categories of hot and cold spot census tracts are seen in the LISA maps of Mecklenburg County (Figures 4-6) and the LISA results are documented in Table 3.           Overall, the incidence of prenatal hypertension does not happen randomly in the urban region under study. When we look at neighborhoods within the Mecklenburg County region, we find that hot spots tend to be loosely found in a crescent to the north of the county center throughout the study years. However, hot spots tend to shift to western portions of Mecklenburg County in 2013 and become more prominent in the east in 2014. These areas are known to be associated with the social geography of the region, particularly with a large proportion of African American and Hispanic populations, with lower educational attainment, and lower socio-economic status (i.e., higher poverty rates, lower household income, higher unemployment) [57]. Prenatal hypertension cold spots are mainly found in an area fanning out south from the county center as well as in the northern section of the county; this spatial pattern carries through the study period. Cold spots are found in neighborhoods with large presence of Caucasian populations, higher educational attainment, and higher socio-economic status [58]. Hot and cold spots fall in these areas with greater consistency as temporal windowing is applied to the yearly series, which suggests this is the discernable trend in prenatal hypertension once noise has been filtered out. Finally, while tracts that are outliers begin to fade away over time, hot and cold spots become more prominent. Hence, when noise caused by small-n annual statistics is controlled for, the same neighborhoods of the city continue to be separated by sharp health disparities epitomized by hot and cold spots in prenatal hypertension.

Discussion
Identifying the location of at-risk hot spot communities provides an opportunity to refine place-based health programs in urban regions with the use of temporal windowing. Geospatial analysis makes it feasible for local health departments to analyze their own data while complying with regulations and ethics related to protection of human subjects for public health surveillance and planning purposes. With a few procedures, existing data and publicly available databases can be leveraged by geocoding birth and death certificates to explore those data from different perspectives such that hot spots of census tracts can be pinpointed for further investigation, including for further confirmatory analysis of spatial epidemiology.
In the case of Mecklenburg County, chronic diseases such as hypertension continue to be cited as high priorities in the 2010-2018 Community Health Assessments (CHA) [59]. Although the CHA allows health departments a place to compile health priorities for the county, one major challenge health departments face is understanding the impact of initiatives aimed at reducing health disparities. With CHA conducted every 4 years, the county intends to track the overall indicator of women who have a history of one or more chronic diseases. However, Charlotte administrators have not tracked where hypertension hot spots are located, and whether these hot spots shrank or expanded over time.
In Charlotte, communities that have the highest priority in the CHA tend to have less education and income and live in neighborhoods which lack access to healthy food and safe places for recreation. Spatially, a crescent-shaped area of poverty and low-educational attainment has formed around the center city of Charlotte. These residents may also be exposed to risk factors that increase their chance for chronic disease. Our results demonstrated that once temporal windowing is used with the hypertension data, especially the two-and three-year windows, hot spot clusters emerged from these traditionally segregated communities in the Charlotte area. The benefit of tracking hypertension in Charlotte with Moran's I and LISA analysis at census tracts is that it is possible to identify whether health departments achieve their goals from assessment to assessment. Although county level spatial aggregation continues to be in use routinely in most urban areas, this level of analysis precludes this type of surveillance and is rather ineffective to fully inform decision makers on where to target their efforts and on whether past efforts were successful.
Studying a series of spatial health data through a finer-scale analysis-such as the neighborhoods-allows a better understanding of local health outcomes and risk factors over time.
Incorporating smaller geographies into a GIS can also aid health administrators by incorporating the location of hospital and medical facilities to address whether proximity influences the spread of the hot or cold spots. Moreover, U.S. Census and American Community Survey demographic data can be linked to see whether trends correlate with socio-economic conditions. Linking these data into the CHA will require collaboration with data stewards and adequate training of public health practitioners so that the benefits of using these data can be fully realized. In this way, a more tailored approach to surveying health priorities can be undertaken.
Our findings support that data spatially aggregated to the tract level lead to results that contribute to increased capacity to identify local clusters [60]. We also show that data stability and greater consistency in the significant spatial patterns can be obtained through a temporal windowing approach, which is an important achievement for strategic planning when rates are based on small numbers. Another type of models, known as Bayesian hierarchical models, have also been used to address sparseness in populations and cases, allowing for an adaptive smoothing approach [61]. This can, however, create overly smoothed maps, masking true risk distribution. The degree of smoothing used is a trade-off between high sensitivity and high specificity [62]. Prior research suggests that a numerator of 20 or more is needed to produce fairly stable estimates [63]. Also, it is well known that denominator data that rely on annual data sampling to apprehend the at-risk population (denominator)-such as annual American Community Survey data-is affected by large and spatially variable margins of error across the urban region. While some methods have been developed to provide unbiased estimates of statistics [64,65], many analyses continue to ignore this important and impactful data uncertainty [66].
We also acknowledge a key limitation in spatial analytics when dealing with aggregated data, such as census tracts. The possibility exists of having a Modifiable Areal Unit Problem [67], in the form of a scale effect or of a zoning effect. The scale effect arises when different results are attained due to variations in the scale of aggregation units. This implies that using census tracts rather than zip codes, for example, can lead to different findings. The zoning effects occur when a constant scale of analysis is used with a variation in the shape of aggregation units. This is the situation for census tracts, as well as zip codes and county boundaries. We did not test for scale or zoning effect because the study was to document how administrators can use confidential data. Further research is needed to understand which of the commonly used geographic scales is more useful for public health planning and under which conditions.

Conclusions
With tight state budgets limiting health department's funds, using LISA on spatially aggregated secondary health data with a GIS provides for a targeted and more efficient approach to health resources planning, enabling monitoring of socio-economic determinants of health at a geographic scale commensurate with policy making and assessment. With the use of temporal windowing, administrators can reduce the effect of random noise while using a health indicators that are based on a small number of cases. The geographic tools used in this study are not intended to draw any causal conclusions about the spatial patterns that emerge from the analysis. Instead, they are powerful, effective, and robust for pattern detection and monitoring to enhance health administrator's understanding of the processes that occur at the neighborhood or community level, such as census tracts, or whichever level administrators deem appropriate for the data to be aggregated too. Over time, the use of these tools will help local health departments identify how heath indicators change and what socio-demographic data associate with those changes.
We successfully demonstrated the application of this geospatial health analytics research design to the case of prenatal hypertension in the urban setting of Mecklenburg County, and because it is easy to reproduce, argue for its broad use in public health departments as part of their standard analytic toolbox. Tracking a system of sub-county data will allow public health officials in different urban contexts to benefit from our research by better understanding local health outcomes and risk factors over time. Incorporating these data into a collaborative urban network is advocated so that the benefits of using these data can be fully realized and identified challenges resolved.