Application of Epidemiological Geographic Information System: An Open-Source Spatial Analysis Tool Based on the OMOP Common Data Model

Background: Spatial epidemiology is used to evaluate geographical variations and disparities in health outcomes; however, constructing geographic statistical models requires a labor-intensive process that limits the overall utility. We developed an open-source software for spatial epidemiological analysis and demonstrated its applicability and quality. Methods: Based on standardized geocode and observational health data, the Application of Epidemiological Geographic Information System (AEGIS) provides two spatial analysis methods: disease mapping and detecting clustered medical conditions and outcomes. The AEGIS assesses the geographical distribution of incidences and health outcomes in Korea and the United States, specifically incidence of cancers and their mortality rates, endemic malarial areas, and heart diseases (only the United States). Results: The AEGIS-generated spatial distribution of incident cancer in Korea was consistent with previous reports. The incidence of liver cancer in women with the highest Moran’s I (0.44; p < 0.001) was 17.4 (10.3–26.9). The malarial endemic cluster was identified in Paju-si, Korea (p < 0.001). When the AEGIS was applied to the database of the United States, a heart disease cluster was appropriately identified (p < 0.001). Conclusions: As an open-source, cross-country, spatial analytics solution, AEGIS may globally assess the differences in geographical distribution of health outcomes through the use of standardized geocode and observational health databases.


Expected count
In AEGIS, the expected count is used several statistical calculations. Specifically, is the rate in stratum in the incidence rate on indirect standardized population for age and sex of all patients included in the target cohort is the population in stratum j of the administrative district.

Standardized Incidence Ratio (SIR)
The disease risk is estimated by the SIR, which is calculated as the ratio of the observed number of outcomes to the expected counts: (2) is the case of the outcome in administrative district is expected counts of the outcome in administrative district.

Proportion
The proportion indicates the number of patient outcomes per population in the administrative district. The fraction is used as a parameter to represent the value of the proportion. (3)

Scan Statistics
AEGIS supports to identify clusters by Kulldorff's scan statistics. This method scans an unusual number of cases, expanding a myriad of windows in the area of interest. Generally, a circle that does not contain more than 50% of the total number of patients in the target cohorts is detected using a window whose radius increases continuously from zero to the upper limit. Then, a likelihood ratio test is used to assess the occurrence of clusters statistically. The scan statistics method can assess disease occurrences and social inequalities in health quantitatively by detecting areas where adverse outcomes are concentrated. For the clustering results, p-value <0.05 was considered significant. Conditioning on the observed total number of outcomes , the likelihood ratio of is the expressed as , is the overall window is the likelihood for window is a likelihood function under the null hypothesis that the probability inside is the same as the probability outside

Bayesian mapping
To estimate disease risk in relatively poorly informed areas, the AEGIS uses the Besag-York-Mollié (BYM) model, which is a well-established method to estimate areas with a small sample size. Specifically, are covariates is a spatial random effect. is a spatially structured variance parameter and is a spatially independent variance.
The R-INLA package was used for Bayesian calculations for small area estimation.

Global Moran's I
Global Moran's I representing the overall spatial autocorrelation of the area covered by the study. The values range from −1 (indicating dispersed distribution) to 1 (perfect clustering together). A value of 0 for indicates no autocorrelation. Global Moran's I is the expressed as is the attribute value of the 'th object n is the number of objects is the weight of the commination Local Moran's AEGIS calculates regional autocorrelation between individual regions using the Local Indicators of Spatial Association (LISA) method to identify clusters Specifically,.
is represents the number of unit areas where, is a spatial weight for judging whether 'spatial adjacency' is in the unit area and , and a space weight value is given based on whether the boundary between the two unit areas is shared. In other words, if unit and share a boundary, otherwise . Supplementary Table S1. Comparison of the estimated major cancer incidences (age adjusted) from AEGIS with the findings of relevant published reports.

Cancer site Classification
National incidences ( All cancer incidence reported by Statistics Korea is within the 95% CI range of all cancer incidence estimated by AEGIS. In addition, the trend of increasing or decreasing the incidence of cancer between the two periods (2004-2008 and 2009-2013) except male liver cancer incidence is the same as the one estimated by AEGIS and the one published by Statistics Korea. Geographical variations of major cancer incidence generated by AEGIS were similar to regional cancer incidence rates reported by the Statistics Korea and Won et al (Won Y, Jung K, Oh C, Kong H, Lee DH, Lee KH. Geographical Variations and Trends in Major Cancer Incidences throughout Korea during 1999-2013. 201850(4):1281-1293. Even in case of studies that denote the same research objective and data, differences in incidence may occur; further, AEGIS has often been able to estimate intermediate results.

Supplementary Figure S1: County-level age-adjusted geographical variation in incidence and mortality of major cancers in Korea.
Supplementary Figure S2: Comparison of GADM-level 2 major cancer age-standardized incidence rate from AEGIS and major cancer age-standardized incidence rate from NCC in Korea.
1. Comparison of age-standardized rate (red line) with 95% credible interval (red color) estimated from AEGIS and age-standardized rate reported from NCC (blue line) for major cancer between 2004 and 2008 in men.
2. Comparison of age-standardized rate (red line) with 95% credible interval (red color) estimated from AEGIS and age-standardized rate reported from NCC (blue line) for major cancer between 2004 and 2008 in women.
3. Comparison of age-standardized rate (red line) with 95% credible interval (red color) estimated from AEGIS and age-standardized rate reported from NCC (blue line) for major cancer between 2009 and 2013 in men.  Figure S3: Disease mapping and clustering for regional differences in hospitalization rates due to heart-related diseases per 1,000 people in the United States from 2008 to 2010 (age and sex adjusted).
We also designed a spatial study involving a variety of heart diseases for further comparison with the previously published study 'Interactive Atlas of Heart Disease and Stroke' (https://www.cdc.gov/dhdsp/maps/atlas/index.htm) in US sources. The target cohorts were defined as the whole population in the database, and the outcome cohorts included patients hospitalized for stroke, acute myocardial infarction, cardiac dysfunction, coronary heart disease, heart failure, and heart disease, respectively, between 2008 and 2010.
Supplementary Information S1: Interactive web application AEGIS interactive web application provides spatial analysis design by setting parameters in each function tabs on the left side (red box). Define the target cohort and outcome cohort to be analyzed from the population (green box). Combine two defined cohorts and GIS data, select options (including data handling options, method, and parameters; green box) for analysis and output the analysis results (blue box). The parameters provided through the interactive interface are shown in Figure A and are described below. It also provides a brief demonstration of the video format to help researchers who are new to AEGIS (https://youtu.be/tExqsZU7qYg).
WARNINGS: The results of AEGIS depend on the used healthcare database. AEGIS proportionate units of analysis and supports small area estimation techniques to prevent sampled location bias or false positive, but it does not guarantee nationwide results. Therefore, the results of AEGIS should be interpreted considering the geographical/medical rationale along with the characteristics of the data used.
A. DB connection panel: To configure the CDM server connection, set the server address, user name, password, database management systems, and CDM database schema.
B. Cohorts: Spatial analysis research design by setting user parameters.
(1) select the target cohort and outcome cohort defined from the ATLAS; (2) set range of date to analyze with select windows parameter; (3) adjust age and gender differences between regions through age and gender adjustment parameters; (4) set the time-at-risk parameter for observing the outcome from the index date; and (5) select the country of study from the country list (total = 254 countries).
C. Disease mapping: From the settings designed in Cohorts tab, it is a panel for disease map visualization. Choose an administrative level with an administrative level parameter, and determine how to draw a disease map (Count of the target cohort (n), Proportion, Standardized Incidence Ratio, Bayesian mapping). Finally, set the entire title of the map and the title of the legend.
D. Clustering: In this tab, researchers can choose statistical methods (Local Indicators of Spatial Association, Kulldorff method) to detect disease clusters in which disease occurrences are concentrated. Visualize the results according to the selected method.