Exploring Spatial Trends and Influencing Factors for Gastric Cancer Based on Bayesian Statistics: A Case Study of Shanxi, China

Gastric cancer (GC) is the fourth most common type of cancer and the second leading cause of cancer-related deaths worldwide. To detect the spatial trends of GC risk based on hospital-diagnosed patients, this study presented a selection probability model and integrated it into the Bayesian spatial statistical model. Then, the spatial pattern of GC risk in Shanxi Province in north central China was estimated. In addition, factors influencing GC were investigated mainly using the Bayesian Lasso model. The spatial variability of GC risk in Shanxi has the conspicuous feature of being ‘high in the south and low in the north’. The highest GC relative risk was 1.291 (95% highest posterior density: 0.789–4.002). The univariable analysis and Bayesian Lasso regression results showed that a diverse dietary structure and increased consumption of beef and cow milk were significantly (p ≤ 0.08) and in high probability (greater than 68%) negatively associated with GC risk. Pork production per capita has a positive correlation with GC risk. Moreover, four geographic factors, namely, temperature, terrain, vegetation cover, and precipitation, showed significant (p < 0.05) associations with GC risk based on univariable analysis, and associated with GC risks in high probability (greater than 60%) inferred from Bayesian Lasso regression model.


Introduction
Gastric cancer (GC), or stomach cancer, is a serious health problem. It is the fourth most common type of cancer and the second leading cause of cancer-related deaths worldwide [1]. More than 950,000 new cases are diagnosed annually [1]. According to estimates, approximately 720,000 patients died from stomach cancer in 2012 [2]. In particular, East Asia accounts for more than half of GC cases globally [3]. Furthermore, 679,100 new cases of GC are estimated to be diagnosed in China annually [4]. In addition, approximately half of the world's GC cases occur in China [5]. The 5-year survival rate for stomach cancer is low because more than 80% of Chinese patients are diagnosed as advanced stage [6]. An estimated 498,000 Chinese people died from GC in 2015 [4]. GC is a major contributor to the global burden of disability-adjusted life-years due to cancer in men [7]. The burden of GC is very high in Asia [8], particularly in China. The average expenditure per patient with GC is $9891 ($9606-$10,176), which is surpassed only by colorectal, oesophageal, and lung cancers; moreover, this expenditure is nearly 1.15 times the annual household income [9]. In addition, expenditures increase from stage I to stage IV GC [9]. Despite the nationwide conditions of a developing country with a large population, the Chinese national total health care bill as a proportion of gross domestic product (GDP) is less than that in most countries. Given the shortage of government investment, the high Of these 379 patients, 358 resided in 11 prefecture-level cities of Shanxi Province, e.g., Taiyuan, Datong, and Xinzhou, and 21 were from neighbouring provinces, e.g., Hebei, Henan, and Shaanxi. Shanxi Province is located in north central China and has four neighbouring provinces ( Figure 1). It has a population of approximately 30 million. To ensure the integrity of the study region, the 358 patients residing in Shanxi were included (281 males and 77 females; average age 63 ± 12 years). Of these 358 patients, 346 were diagnosed with gastric adenocarcinomas (GACs), five with GC with signet ring cell carcinomas, and the remaining seven with an undetermined type of GC. These GC diagnoses were histologically confirmed by professional clinical doctors at the FHSMU. Most TNM stages were III and above; only 10 patients were diagnosed as TNM stage I or II. All patients underwent chemotherapy or surgical treatment at the FHSMU. This research was approved by the institutional review boards of the FHSMU, Shanxi Province. Of these 379 patients, 358 resided in 11 prefecture-level cities of Shanxi Province, e.g., Taiyuan, Datong, and Xinzhou, and 21 were from neighbouring provinces, e.g., Hebei, Henan, and Shaanxi. Shanxi Province is located in north central China and has four neighbouring provinces ( Figure 1). It has a population of approximately 30 million. To ensure the integrity of the study region, the 358 patients residing in Shanxi were included (281 males and 77 females; average age 63 ± 12 years). Of these 358 patients, 346 were diagnosed with gastric adenocarcinomas (GACs), five with GC with signet ring cell carcinomas, and the remaining seven with an undetermined type of GC. These GC diagnoses were histologically confirmed by professional clinical doctors at the FHSMU. Most TNM stages were III and above; only 10 patients were diagnosed as TNM stage I or II. All patients underwent chemotherapy or surgical treatment at the FHSMU. This research was approved by the institutional review boards of the FHSMU, Shanxi Province.

Determinant Variables
Based on previous research reviews, this paper investigated four types of GC risk factors: socioeconomic, dietary structure, medical condition, and geographic environment, which included most non-genetic risk factors. Since cases were collected between 2014 and 2016, the year 2015 was defined as the baseline timepoint. The four categories of GC risk factors include 22 specific variables ( Figure 2). The socio-economic influencing factor is represented by six variables: percentage of rural population (PRP), GDP per capita (GDP-PC), percentage of tertiary industry (PTI), proportion of living expenditures to disposable income per capita of urban households (PLEDI-PC-UH), proportion of living expenditures to disposable income per capita of rural households (PLEDI-PC-RH), and percentage of residents with primary education and below (PRPEB). The dietary structure influencing factor is represented by eight variables: farming-forestry-animal husbandry-fishery total value of output per capita (FFAHFTVOP-PC), wheat sown area per capita (WSA-PC), sown area of grain except for corn and wheat per capita (SAGECW-PC), pork production per capita (PP-PC), beef production per capita (BP-PC), cow milk production per capita (CMP-PC), poultry production per capita (POP-PC), and agricultural consumption of chemical fertilizers per capita (ACCF-PC). The medical condition influencing factor is represented by four variables: medical technology personnel per capita (MTP-PC), number of licensed doctors per capita (NLD-PC), number of country doctors per capita (NCD-PC), and number of hospitals per capita (NH-PC). The geographic environment influencing factor is represented by four variables: annual accumulated temperature greater than 10 degrees (AATGT10), topographic variation (TV), normalized difference vegetation index variation (NDVIV), and mean annual precipitation (MAP) from 1980-2015. These 22 variables were chosen based on the accessibility of these data. The first three influencing factor categories, which include 18 variables, were collected from the Shanxi statistical yearbook of 2015. The geographic environment influencing factor, which includes four variables, was provided by the Data Center for Resources and Environmental Sciences, Chinese Academy of Sciences (RESDC) (http://www.resdc.cn).

Determinant Variables
Based on previous research reviews, this paper investigated four types of GC risk factors: socio-economic, dietary structure, medical condition, and geographic environment, which included most non-genetic risk factors. Since cases were collected between 2014 and 2016, the year 2015 was defined as the baseline timepoint. The four categories of GC risk factors include 22 specific variables ( Figure 2). The socio-economic influencing factor is represented by six variables: percentage of rural population (PRP), GDP per capita (GDP-PC), percentage of tertiary industry (PTI), proportion of living expenditures to disposable income per capita of urban households (PLEDI-PC-UH), proportion of living expenditures to disposable income per capita of rural households (PLEDI-PC-RH), and percentage of residents with primary education and below (PRPEB). The dietary structure influencing factor is represented by eight variables: farming-forestry-animal husbandry-fishery total value of output per capita (FFAHFTVOP-PC), wheat sown area per capita (WSA-PC), sown area of grain except for corn and wheat per capita (SAGECW-PC), pork production per capita (PP-PC), beef production per capita (BP-PC), cow milk production per capita (CMP-PC), poultry production per capita (POP-PC), and agricultural consumption of chemical fertilizers per capita (ACCF-PC). The medical condition influencing factor is represented by four variables: medical technology personnel per capita (MTP-PC), number of licensed doctors per capita (NLD-PC), number of country doctors per capita (NCD-PC), and number of hospitals per capita (NH-PC). The geographic environment influencing factor is represented by four variables: annual accumulated temperature greater than 10 degrees (AATGT10), topographic variation (TV), normalized difference vegetation index variation (NDVIV), and mean annual precipitation (MAP) from 1980-2015. These 22 variables were chosen based on the accessibility of these data. The first three influencing factor categories, which include 18 variables, were collected from the Shanxi statistical yearbook of 2015. The geographic environment influencing factor, which includes four variables, was provided by the Data Center for Resources and Environmental Sciences, Chinese Academy of Sciences (RESDC) (http://www.resdc.cn).
disease. Nevertheless, the case data in this paper were collected from a single hospital, the FHSMU; therefore, the Bayesian spatial model could not be directly applied. To correct bias, we presented a selection probability model. The main idea of the selection probability model is that the process of selecting hospitals for patients can be regarded as a stochastic process. If the selection probability of selecting the FHSMU for patients in various regions can be determined, then the actual patient number of the corresponding region may be estimated.

Bayesian Spatial Statistical Model Integrated with a Selection Probability Model
Because of the small sample size, the Bayesian statistical method was used in this paper. The Bayesian spatial statistical model [45,46] has been widely applied in explorations of spatial trends of disease. Nevertheless, the case data in this paper were collected from a single hospital, the FHSMU; therefore, the Bayesian spatial model could not be directly applied. To correct bias, we presented a selection probability model. The main idea of the selection probability model is that the process of selecting hospitals for patients can be regarded as a stochastic process. If the selection probability of selecting the FHSMU for patients in various regions can be determined, then the actual patient number of the corresponding region may be estimated.
For each patient, there are three options when selecting clinic hospitals: local hospitals, hospitals in provincial capital city, and hospitals in neighbouring provincial cities. The developed level is the primary factor of consideration when patients select clinic hospitals in local cities or outside of cities. In this paper, the developed level is represented by the developed grade of the city. Under the condition that patients have selected hospitals outside cities, the probability of selecting Taiyuan, the provincial capital city of Shanxi Province, is determined by the developed level and traffic distance. Figure 1a shows that there are four neighbouring provinces: Inner Mongolia, Shaanxi, Henan, and Hebei. Because of the traffic inconvenience ( Figure 1b) and under-development of Inner Mongolia, patients in Shanxi rarely select hospitals in Inner Mongolia. Notably, although Beijing is not neighboured by Shanxi Province, Beijing's hospitals have a strong attraction for Shanxi patients due to the high medical level in Beijing (Chinese capital city) and traffic convenience with Beijing ( Figure 1b). Therefore, probable outside cities selected by Shanxi's patients include four cities, Beijing, Xi'an, Shijiazhuang, and Zhengzhou ( Figure 1). Taken together, the selection model can be expressed as follows: where p i represents the probability of selecting the FHSMU for each GC case in the i-th city, which can be mainly disassembled to three portions: the probability of selecting outside cities, the probability of selecting Taiyuan city, and the probability of selecting the FHSMU in Taiyuan. The first two selection probabilities can be determined by the gravity model [47] and the inverse power-law traffic distance function by referencing a model of individual mobility [48]. In equation (1), G i , G i→ , and G TY represent the developed grade of the i-th city, Beijing or the provincial city of neighbouring province with the i-th city, and Taiyuan city, respectively. d i→TY is the traffic distance from the i-th city to Taiyuan city. d i→ is the traffic distance from the i-th city to Beijing or the provincial city of neighbouring provinces. The coefficient h (h = 5) is the number of the hospitals at the same level with the FHSMU in Taiyuan; we supposed that the GC patients' selection probability of the top five hospitals with the same level in Taiyuan is equal. Additional, the random selection process can be regarded as a repeated Bernoulli process. Thus, the Bayesian spatial model may be expressed as follows: where y i is the number of GC cases in the i-th city of Shanxi collected from the FHSMU, C i is the number of GC cases in the i-th city by correcting bias. N i and r i are the population and GC morbidity of the i-th city in Shanxi, respectively. In formula (3), α represents the average level of GC risk throughout Shanxi, and is assigned to flat prior. S i represents the overall spatial component effects, and exp(S i ) directly quantifies the relative risk of the i-th city compared to Shanxi's overall risk level, exp(α) [49]. The BYM model, named after its authors, Besag et al. [49], is a convolution of spatially structured and unstructured random effects, which is assigned to the parameter S i . BYM considers both spatially structured random effects with a convolution algorithm and unstructured random effects using a normal distribution. The spatial structure effects are modelled using conditional autoregressive (CAR) [50]. The spatial adjacency matrix adopts the first order "Queen" form. The concrete form is as follows: and σ 2 i is the variance of S i . δ i represents spatial random effects. ε indicates a Gaussian noise error. Gaussian prior is assigned to δ i and ε.

Bayesian Lasso Regression Model
Considering the small sample size along with the 22 factors, this paper adopted the Bayesian Lasso regression model [51,52], which can overcome the problem of small sample size to some extent. The Bayesian Lasso regression model was developed from the Lasso regression, which differs from the ordinary least square (OLS), which is penalized by least squares that minimizes the residual sum of squares while controlling the L1-norm of the coefficient vector of regression: where λ ≥ 0 determines the amount of shrinkage. In the view of Bayesian statistics, the Lasso regression can be interpreted as posterior mode estimates when the regression parameters have independent and identical Laplace priors [53]. The Bayesian lasso regression parameters were assigned by a prior conditional Laplace: where σ 2 is the variance of the conditional Laplace prior of the Lasso regression coefficient, β. k is the number of independent variables. The likelihood function of the observed data, y, fitted to a normal distribution: y|β, λ, σ ∼ N βX, σ 2 I where I is an identity matrix. The meanings of the other parameters are the same as above. Then, the posterior of the regression parameter, β, can be expressed as follows: According to Park and George's study [52], we will regard λ 2 as the parameter rather than λ. This paper also considers the class of gamma priors on λ 2 . The parameter σ 2 is assigned an inverse gamma prior.
To investigate the factors in GC relative risk, the following formula was employed: where exp(S i ) is estimated by the abovementioned Bayesian spatial statistical model. β n (n = 1, . . . , n) is the regression coefficient responding to the n-th factor X in . represents the Gauss random effect. Additionally, to remove the dimensional effect, all the variables were normalized by dividing their values by the corresponding provincial average value. The Bayesian statistics estimation in this paper was based on the Markov chain Monte Carlo (MCMC) algorithm. The Bayesian estimate of spatial variability was implemented in WinBUGS software [54], and the Bayesian Lasso regression used Pymc3 [55]. Two MCMC chains were run with different initial values. The number of iterations for each chain was set to 200,000; 150,000 iterations were for the burn-in period, and 50,000 were for the posterior distribution of parameters. Two MCMC chains were used to ensure the results' convergence, which was evaluated by the Gelman-Rubin statistic [56]; the convergence is better when the Gelman-Rubin statistic is closer to one. The Gelman-Rubin statistics of each parameter in the paper were all between 0.9999 and 1.0001; the estimated results are thus reliable.

Spatial Trends
The spatial GC relative risks can be quantitatively described by the posterior median of exp(S i ), whose value measures the relative magnitude of the GC incidence in the i-th city of Shanxi relative to the total provincial average incidence, exp(α). If exp(S i ) > 1.0, the GC incidence in the i-th city is exp(S i ) times the provincial overall GC incidence, and vice versa. Figure 3 shows the Shanxi spatial GC relative risks estimated from the Bayesian spatial statistical model integrated with the selection probability model based on the collected cases. chains were used to ensure the results' convergence, which was evaluated by the Gelman-Rubin statistic [56]; the convergence is better when the Gelman-Rubin statistic is closer to one. The Gelman-Rubin statistics of each parameter in the paper were all between 0.9999 and 1.0001; the estimated results are thus reliable.

Spatial Trends
The spatial GC relative risks can be quantitatively described by the posterior median of exp( ), whose value measures the relative magnitude of the GC incidence in the i-th city of Shanxi relative to the total provincial average incidence, exp (α). If exp( ) > 1.0, the GC incidence in the i-th city is exp( ) times the provincial overall GC incidence, and vice versa. Figure 3 shows the Shanxi spatial GC relative risks estimated from the Bayesian spatial statistical model integrated with the selection probability model based on the collected cases.  The estimated results showed that the spatial distribution of GC relative risks showed a distinct feature of being 'high in the south and low in the north'. Two specific regions located in the southeast of Shanxi namely the south regions of Taihang Mountain, Jincheng and Changzhi had the highest GC spatial relative risk, with posterior medians of exp(S i ) of 1.291 (95% highest posterior density (95% HPD): 0.789-4.002) and 1.248 (95% HPD: 0.789-3.251), respectively. In addition, the top two high risk regions' posterior probability of exp(S i ) > 1.0, which were denoted as P(exp(S i ) > 1.0|Data) , were 0.85 and 0.83, respectively. Yuncheng, Taiyuan, and Linfen also showed a higher spatial relative risk, with corresponding P(exp(S i ) > 1.0|Data) values of 0.60, 0.63, and 0.60, respectively, and posterior medians of exp(S i ) of 1.070 (95% HPD: 0.514-2.257), 1.039 (95% HPD: 0.652-2.048), and 1.038 (95% HPD: 0.607-1.744), respectively. Lvliang and Jinzhong showed the provincial average GC risk. Their posterior medians of exp(S i ) were 1.002 (95% HPD: 0.5947-1.688) and 0.9913 (95% HPD: 0.556-1.571), respectively. Yangquan and the northern three cities, Datong, Shuozhou, and Xinzhou, had lower GC spatial relative risks than the overall provincial average. The population of the four cities with lower GC incidence accounted for 26.5% of the Shanxi's population.

Verification of Spatial Trends
The spatial trends of GC risk in Shanxi were estimated from the Bayesian statistical model with the selection probability model based on hospital-diagnosed case data. Given that the result of spatial trends is incorrect, it is difficult to make any further analysis, e.g., to analyse influencing factors. Although the result cannot be strictly verified due to the unavailability of survey data over Shanxi Province in recent years, some previous studies using GC survey data can be evaluated. Han and Zhao [57] have investigated Shanxi's spatial distribution of GC based on disease survey data of GC in the late 20th century across Shanxi Province. Han and Zhao's study [57] showed that the GC risk decreased as latitude increased in space, i.e., 'high in the south and low in the north'. The conclusion based on GC survey data across Shanxi is consistent with the result in our study. Meanwhile, Han and Zhao noted that regions with high GC risk were located in the south section of Taihang Mountain and the surrounding areas, particularly Changzhi and Jincheng city. Our study reached the same conclusion that Changzhi and Jincheng city have the highest GC risk in Shanxi. Furthermore, according to the official announce of cancer epidemic survey data of six sampling areas collected by the Shanxi Cancer institute in 2009-2012, Taihang Mountain and the surrounding areas have the highest GC risk in Shanxi (http://health.sina.com.cn/news/2013-02-28/105674176.shtml). In addition, Wen et al. [58], Liang et al. [59], and Wen et al. [60] have all concluded that the south section of Taihang Mountain and the surrounding areas including Changzhi and Jincheng of Shanxi have a higher GC risk. Since these previous studies were all based on epidemiological survey data, these conclusions can be regarded as validation criteria. In sum, our estimated results of the spatial trends of GC risk over Shanxi coincide with the results based on GC epidemiological survey data, thus demonstrating the reliability of the method used in this paper.

Univariable Analysis
The GC relative risk of 11 cities in Shanxi Province estimated from the Bayesian spatial model integrated with the selection probability model was regarded as the dependent variable. The 22 influencing factors (Figure 2) of the 11 cities were regarded as the independent variables. Therefore, the associations between the GC relative risks and the 22 influencing factors were evaluated using Pearson correlation analyses. The statistical analysis showed that the p-value of the 10 factors was less than 0.10 (Table 1), whereas the other 12 factors were not significantly (p > 0.10) associated with the GC relative risk. The Pearson correlation coefficients (PCCs) for the relationships between the GC relative risk and the 10 factors were all greater than 0.40, and the corresponding statistical test p values were less than 0. In high risk regions, dietary, agricultural and geographic environment factors had a more evident influence. In addition, the three dietary or agricultural factors, sown area of grain except for corn and wheat per capita, beef production per capita, and cow milk production per capita, were all associated negatively with GC risk. Amongst the four geographic factors, annual accumulated temperature greater than 10 degrees, topographic variation, NDVI variation, and mean annual precipitation from 1980-2015, only NDVI variation negatively correlated with the GC risk; the other 3 factors positively associated with GC risk.

Multivariable Regression Results
The univariable analysis results cannot describe the synthesis and interaction effects of multiple factors that create multicollinearity, which can be observed from the PCCs between various variables (Table 1). To remove this multicollinearity effect, the Bayesian Lasso regression model was employed to investigate the combined associating effect of the 10 significantly influencing factors. Table 2 lists the estimated results, including the posterior mean of the regression coefficients inferred from the Bayesian Lasso regression model, the corresponding 95% HPD, and the posterior probability of the regression coefficients, β n , greater than 0 or less than 0. According to the Bayesian hypothesis test theory [61], one way to decide between H 0 and H 1 is to compare P( H 0 |y) and P( H 1 |y) and accept the hypothesis with the higher posterior probability. This is the idea behind the maximum a posteriori test.

Discussion
This paper explored the spatial variability of GC risk in Shanxi in north central China. To our knowledge, this is the first study to produce a GC disease map of a Chinese province at an urban scale in recent years. As mentioned before, disease mapping is generally produced based on survey data. We attempted to estimate the spatial trends of GC in Shanxi based on hospital-diagnosed case data, which must be corrected for bias. In this paper, a selection probability model was presented that aimed to correct this bias. Simultaneously, the Bayesian statistics paradigm was utilized to overcome the problem of small sample size. Although there are not direct survey data during the same period to verify our results, some previous studies [57][58][59][60] based on survey data pointed out the spatial distribution of GC or high risk regions of GC in Shanxi. Encouragingly, our estimated spatial trends of GC coincided with the previous research results, which demonstrates the reliability and feasibility of our methods. It is well known that obtaining disease survey data is difficult for a variety of reasons, including that performing disease surveys is a time and labour consuming work. Hence, we hope that this paper may contribute to mining not only GC hospital-diagnosed data, but also other cancers, e.g., lung cancer, oesophageal cancer, liver cancer, etc.
The GC spatial trends can provide scientific evidence and references for relevant medical government departments to develop GC prevention policies. In clinical practice, most of the GC cases were diagnosed at late stages, when treatment is substantially less effective [62]. Hence, the accurate prevention or early diagnosis of GC is important in reducing GC incidence and mortality and to ease the GC disease burden. Based on the spatial distribution of GC risk, the relevant medical government departments may develop region-specific policies and utilize limited medical resources. The spatial trends of GC risk in Shanxi Province in north central China has the conspicuous feature of being 'high in the south and low in the north', which illustrates that GC risk is significantly different in various regions. This phenomenon indicates that GC incidence is related to regional factors, such as regional eating habits, local food structure [57], and geographic environment. This paper quantitatively assessed GC spatial relative risk compared to the provincial average risk level. However, future studies must be continuously conducted based on additional case samples. In addition, the spatio-temporal trends of GC risk should be investigated in future research. Based on the Bayesian estimated GC spatial relative risks, we evaluated influencing factors to GC using univariable analyses and a multivariable regression model that can synthetically assess the synthetical influencing magnitude of various factors. The estimated results show that all 10 influencing factors have the same positively or negatively associations resulted in the univariable analysis results (Table 1). Table 3 summarizes the correlations between GC risk and the four categories of factors, i.e., the 22 factors. Amongst the four types of factors, socio-economic, dietary structure, and geographic environment showed significant correlations with GC risk. However, medical condition factors were not significantly related with GC risk.
Socioeconomics is strongly associated with GC risk. Partially consistent with previous studies [40,41], we found evidence of associations between GC risk and several socio-economic factors. The regions where the percentage of tertiary industry was lower and PLEDI-PC-UH was greater had a higher GC spatial relative risk compared to the provincial average risk level. The factors percentage of tertiary industry and PLEDI-PC-UH belong to the socio-economic category; a higher percentage of the tertiary industry represents a more developed economic level, and vice versa. Meanwhile, a higher PLEDI-PC-UH implies lower savings, which could be considered a measure of a resident's prosperity. Table 3. Summary of the association of risk factors and GC. The statistical analysis showed that the regions with a lower percentage of tertiary industry and higher PLEDI-PC-UH, i.e., less developed economic level and less prosperity, had a higher GC risk. Nevertheless, the statistical analysis in this paper showed that other socio-economic factors, such as the percentage of rural population, GDP per capita, PLEIDI-PC-RH, and PRPEB, did not show significant associations with GC risk. When considering education level, previous studies show different results. Several previous studies [40,42,43] have reported that there is an inverse relationship between GC risk and the level of education. Gao et al. [34] found an opposite conclusion. In terms of regional epidemiology, we have not discovered a definite relationship between GC risk and education level. The results of the associations between GC risk and four dietary structure factors are a specific finding from this paper that can provide a feasible reference for governments when creating accurate regional guidelines for the prevention of GC. Specifically, sown area of grain except for corn and wheat per capita, beef production per capita, and cow milk production per capita associated negatively with the GC risk, whereas the pork production per capita is a positive influencing factor. Shanxi is known as the "Minor Coarse Cereal Kingdom" for its specific geographical position and climate features. In particular, the sowing area for minor coarse cereals in northern Shanxi, e.g., Datong, Shuozhou, and Xinzhou, is larger than that of southern Shanxi, such as Yuncheng and Jincheng. The larger the sowing area of minor coarse cereals, the greater the sown area of grain except for corn and wheat per capita. The residents of the regions where minor coarse cereals are sown in larger areas, namely, northern regions of Shanxi, have a diverse dietary structure. The residents living in the southern regions of Shanxi, Yuncheng, Linfen, Jincheng, and Changzhi, which are major wheat sowing areas, have a relatively singular dietary structure. Considering the GC risk feature of 'high in the south and low in the north', we conjecture that diversity in dietary structure may reduce GC risk. Moreover, a review [63] assessed the nutritional attributes of minor coarse cereals and stated that the nutrition in minor coarse cereals is helpful in reducing several types of chronic diseases such as cancer, cardiovascular diseases, and various gastrointestinal disorders. This finding verifies the inference in this paper from another perspective. Furthermore, we found a negative association between GC risk and cow milk and beef production. The possible mechanism is that milk contains several components with anticancer potential. This was reported in some studies [64,65]. In addition, several previous studies [66][67][68][69] have found similar conclusions, namely, an increased risk of GC in populations who consume less milk, whereas Gao et al. [34] reported that milk intake increases the risk of GC. The association between GC risk and beef has not reached a consensus yet. Ward et al. [70] and Huang et al. [71] reported that increased beef consumption was associated with a high GC risk. However, Chen et al. [72] drew an opposite conclusion. Chen et al. conducted a case-control study on upper gastrointestinal cancer (including GC) based on Shanxi GC cases. They found that beef consumption can reduce GC risk, which is consistent with this paper. Consistent with a few previous studies [73,74], we found that pork production per capita was positively associated with GC risk. The influencing mechanism of GC is a synthetical and multi-dimensional process, and we argue that the influencing mechanism of GC exerts various features in different regions. An influencing mechanism of GC with regional characteristics is displayed in Shanxi in north central China. According to recent cancer survey results from 12 sampling areas of Shanxi in 2009-2012 (http://health.sina.com.cn/news/2013-02-28/105674176.shtml), GC risk is associated with dietary habit and nutrition intake deficiency. It is well known that the nutritive values of beef are higher than those of pork, which may explain the associations between GC risk and beef production per capita and pork production per capita in Shanxi. Although geographic environment is also a crucial influencing factor for GC [13], there is limited relevant research. This paper quantitatively demonstrated the associations between four geographic factors and GC risk. The results showed that all 4 geographic environment factors, temperature, terrain, vegetation cover, and precipitation, were with high probability (greater than 60%) related to GC risk. In north central China, i.e., Shanxi, the higher the total temperature, namely, the greater the annual accumulated temperature greater than 10 degrees and mean annual precipitation from 1980-2015, the higher the GC risk. This result is in accordance with Han and Zhao's [57] research based on survey GC data in the late 20th century across Shanxi. Topographic variation indicates the variability in terrain, which in a probability of 73% associated positively with GC risk. NDVI variation indicates the diversity of vegetation cover, which in a probability of 78% correlated negatively with GC risk. We speculate that the variability in terrain, vegetation cover and mean annual precipitation from 1980-2015 may determine local climate, which influences the health of regional inhabitants. The understanding of these concrete mechanisms requires further study.

Socioeconomics
There are some limitations in our study. The patient sample size was not large enough. The results would be more precise if additional patient data were included. Because we were limited in data collection, we explored the spatial variability of GC risk by employing Bayesian statistical paradigm. Although 22 factors were explored in this paper, other factors, such as the regional consumption of salt and the regional production of vegetables and fruit, were not involved. This assessment is the objective of the next study.

Conclusions
First, this paper presented a selection probability model and integrated it into the Bayesian spatial statistical model. This method can implement disease mapping from hospital-diagnosed patients. Second, the spatial trends of GC risk in north central China, i.e., Shanxi, showed a 'high in the south and low in the north' pattern. Third, this study employed the Bayesian Lasso regression model to detect the combined effects of the 10 significant (p < 0.10) factors inferred from the univariable analysis, and any factors did not have to be removed. Fourth, this paper also highlighted dietary structure and geographic environment as significant (p ≤ 0.08) factors associated with GC risk based on univariable analysis, and Bayesian Lasso regression model showed similar correlations in high probability (greater than 0.60).
Author Contributions: G.Z., S.L., and J.L. contributed to the study design and conception. J.L. and Y.W. collected and extracted patient data. G.Z., J.L., and S.L. contributed to the interpretation of the data and analysis results. G.Z. and J.L. contributed to drafting the manuscript and critically revising it. All authors contributed to the final version. All authors approved the final version to be published.