Spatiotemporal Analysis and Risk Assessment Model Research of Diabetes among People over 45 Years Old in China

Diabetes, which is a chronic disease with a high prevalence in people over 45 years old in China, is a public health issue of global concern. In order to explore the spatiotemporal patterns of diabetes among people over 45 years old in China, to find out diabetes risk factors, and to assess its risk, we used spatial autocorrelation, spatiotemporal cluster analysis, binary logistic regression, and a random forest model in this study. The results of the spatial autocorrelation analysis and the spatiotemporal clustering analysis showed that diabetes patients are mainly clustered near the Beijing–Tianjin–Hebei region, and that the prevalence of diabetes clusters is waning. Age, hypertension, dyslipidemia, and smoking history were all diabetes risk factors (p < 0.05), but the spatial heterogeneity of these factors was weak. Compared with the binary logistic regression model, the random forest model showed better accuracy in assessing diabetes risk. According to the assessment risk map generated by the random forest model, the northeast region and the Beijing–Tianjin–Hebei region are high-risk areas for diabetes.


Background
In the past few decades, diabetes has become one of the most common chronic noncommunicable diseases in both developed and developing countries [1]. Diabetes is emerging as an epidemic all over the world, and it is a common chronic disease that seriously threatens human health [2]. It affects the quality of lives of many people around the world [3], and the quality of life for Chinese residents is also affected by diabetes. China has a large and rapidly growing elderly population. Studies have shown that diabetes may also lead to the occurrence of other diseases, such as metabolic-associated fatty liver disease [4][5][6][7][8]. Diabetes has become another serious health hazard, following cardiovascular and cerebrovascular diseases and tumors. Half (50.1%) of the population does not even know if they are diabetic, which greatly increases the global disease burden [9]. According to data published by the International Diabetes Federation (IDF), the prevalence of diabetes is increasing rapidly around the world. According to IDF estimates, the prevalence of diabetes in China has reached 10.6%, with the proportion of undiagnosed diabetics as high as 51.7% [10].
Disease mapping has been historically considered one of the most important public health issues, derived from an understanding of the relationship between health and location. Understanding this relationship has been the goal of scientists and researchers for decades [11]. Geographic information systems (GIS) are a type of computer software used for data capturing, thematic mapping, updating, retrieving, structured querying, and analyzing the distribution and differentiation of various phenomena, including communicable and non-communicable diseases across the world, with reference to various periods [12]. The most important characteristic of a geographic information system is its powerful spatial This study is based on the baseline data of the China Health and Retirement Longitudinal Study (CHARLS). The China Health and Retirement Longitudinal Study is part of a worldwide pension tracking survey. This database is one of the most commonly used databases in China to study the health of the middle-aged and older population, and provides high-quality microdata representing households and individuals aged ≥45 years in China. Many scholars have obtained many reliable research results based on CHARLS [25][26][27][28].
The China Health and Retirement Longitudinal Study (CHARLS) aims to collect a high-quality nationally representative sample of Chinese residents ages 45 and older to serve the needs of scientific research on the elderly. The baseline national wave of CHARLS was established in 2011 and includes about 10,000 households in 125 prefecture-level city and 450 villages/resident committees. CHARLS adopts multi-stage stratified probabilityproportional-to-size sampling. CHARLS is based on the Health and Retirement Study (HRS) and on related aging surveys such as the English Longitudinal Study of Aging (ELSA) and the Survey of Health, Aging and Retirement in Europe (SHARE) [29].

Diabetes Definition
Prevalence refers to the proportion of the total number of people who have the disease at a specific point in time in a given place. Diabetes was defined as: fasting glucose level ≥ 126 mg/dL (7.0 mmol/L), or 2-h glucose level ≥ 200 mg/dL (11.1 mmol/L), or on medications for high blood sugar, or self-reported diagnosis of diabetes by a physician.

Spatial Autocorrelation
Global Spatial Autocorrelation statistics are often expressed as Moran's I (Equation (1)). According to the literature, the classical Moran's index of Spatial Autocorrelation has been widely used in many knowledge fields, such as epidemiology, ecology, and economics [30]. The index was used to explore the overall spatial pattern of disease prevalence. When the Moran index is between 0 and 1, it indicates that there is a positive correlation between geographical entities. The larger the value, the more obvious the spatial correlation. When the Moran index is between −1 and 0, there is a negative correlation. The smaller the Moran index, the greater the spatial difference. A value of 0 indicates no correlation. In addition, the value also needs to pass the hypothesis test, without which, the Moran index is meaningless.
where Z i = y i −ӯ, whereӯ is the mean of the variable y representing the observations under study, W ij is the spatial weight between feature i and j, and S 0 is the sum of all the elements in the spatial weights matrix (S 0 = ∑i∑j W ij ) [31]. Getis and Ord's G* assessed localized patterns of spatial association. Specifically, Getis and Ord's G* can indicate regions where low values are clustered (G* > 0) and regions where high values are clustered (G* < 0) [32]. Local Spatial Autocorrelation can accurately indicate the aggregation mode of each spatial unit [33]. Generally, Local Spatial Autocorrelation analysis (LISA) is used. LISA had five results of "high-high" (H-H), "low-low" (L-L), "low-high" (L-H), "high-low" (H-L), and no statistical significance [34]. Respectively, the regions with high prevalence surround the regions with high prevalence, the regions with low prevalence surround the regions with low prevalence, the regions with low prevalence surround the regions with high prevalence and the regions with high prevalence surround the regions with low prevalence. In this study, Moran's I and LISA plots were calculated for the prevalence of diabetes in members of the Chinese population over 45 years old in 2011, 2013, 2015, and 2018, respectively. ArcGIS 10.4 software (ESRI Inc., Redlands, CA, USA) was used in this study.

Spatial Cluster Analysis
Temporal, spatial, and spatiotemporal scan statistics are now commonly used for disease cluster detection and assessment for a variety of diseases, including cancer, Creutzfeldt-Jakob disease, granulocytic ehrlichiosis, sclerosis, and diabetes. Spatial clustering analysis was performed using SaTScan software (Martin Kulldorff, Harvard Medical School, Boston and Information Management Services Inc, Calverton, MD, USA) to detect spatially clustered areas or high-risk areas of diabetes in members of the Chinese population over 45 years old. The "purely spatial analysis" and "space time analysis" were used to test whether the prevalence of diabetes was randomly distributed in space. To avoid preselection bias as described in the SaTScan User Guide (version 9.1) [35], a maximum spatial cluster size of 10% of the population at risk was used.

Binary Logistic Regression
Binary logistic regression is a linear regression analysis in which the dependent variable is a binary classification variable, requiring logit transformation of the target probability first, so as to ensure that when the probability is at (0, 1), the logit transformation value can be any real number, avoiding the structural defects of the linear probability model. The probability of each classification of a classification variable can be predicted by logistic regression. The dependent variable is a classification variable, and the independent variable can be an interval variable, a classification variable, or a mixture of the interval and the classification variable. Binary logistic regression model is a regression model established for binary variables, such as Equation (2) [36], which can capably meet the modeling require-4 of 26 ments of classified data. It has become a commonly used modeling method for classifying variables and has been widely used in many fields, such as medicine. We used IBM SPSS Statistics 26 software(IBM Corp., Armonk, NY, USA) and the test level α = 0.05 was used in this study.
Suppose a survey of diabetes for conditional probability P i = P (Y i = 1|X i ), according to the type of binary logistic regression model assumes that the probability expression as shown in Equation (3).

Geographically Weighted Regression
The geographically weighted regression (GWR) (Equation (4)) is a statistical technique that is used to model heterogeneous spatial processes. It has high accuracy in analyzing location-affected relationships [37].
is a constant term, ε i is the random error term at point i, and n is the number of independent variables. GWR is a local modeling tool based on the optimization of global regression models, which complements the global model by providing a set of coefficients for each geographic unit to determine the spatial variability of the observations [38]. GWR was used to explore the spatial heterogeneity of risk factors in this study.

Random Forest Model
The random forest algorithm can deal with nonlinear problems, has good anti-noise ability, and tends to avoid overfitting. Compared with the traditional multiple linear regression model, the random forest algorithm does not need to set the function form in advance and overcome the complex interaction between covariables [39]. The building blocks of the decision tree-based modeling approach, the random forest model, are bootstrapped and are called bagged aggregates. Random forest models randomly use bagging to identify features, thereby separating each node by selecting the most critical possible to assess or predict variables, which will improve the model's accuracy without causing overfitting. At present, the random forest model has been widely applied to predict and assess soil moisture, shallow water level, hydrology, and environmental management. In a random forest, factors with a significant influence on logistic regression are included as independent variables into random forest modeling [40], and the presence of diabetes is set as the dependent variable. The total data are divided into a training set and test set according to 7:3. The model parameters are trained through the training set for the assessment of the test set.

Statistical Analysis and Spatial Distribution
In 2011, a total of 20,525 samples were included, including 1088 cases, with a prevalence of 5.30%. In 2013, a total of 20,525 samples were included, including 1333 cases, with a prevalence of 6.49%; In 2015, a total of 20,525 samples were included, including 1766 cases, with a prevalence of 8.60%. In 2018, a total of 18,174 samples were included, including 1032 cases, with a prevalence of 5.68%.
As shown in Figure 1, the highest prevalence of diabetes was in 2015. The overall prevalence of the respondents was 8.60%, of which, the prevalence of male respondents was 7.44% and the prevalence of female respondents was 9.74%; the lowest prevalence of diabetes was in 2011, when the overall prevalence of the respondents was 5.30%, of which, the prevalence of male respondents was 4.68% and the prevalence of female respondents was 5.91%. In addition, the survey data showed that the prevalence of female respondents was higher than that of male respondents.

Statistical Analysis and Spatial Distribution
In 2011, a total of 20,525 samples were included, including 1088 cases, with a prevalence of 5.30%. In 2013, a total of 20,525 samples were included, including 1333 cases, with a prevalence of 6.49%; In 2015, a total of 20,525 samples were included, including 1766 cases, with a prevalence of 8.60%. In 2018, a total of 18,174 samples were included, including 1032 cases, with a prevalence of 5.68%.
As shown in Figure 1, the highest prevalence of diabetes was in 2015. The overall prevalence of the respondents was 8.60%, of which, the prevalence of male respondents was 7.44% and the prevalence of female respondents was 9.74%; the lowest prevalence of diabetes was in 2011, when the overall prevalence of the respondents was 5.30%, of which, the prevalence of male respondents was 4.68% and the prevalence of female respondents was 5.91%. In addition, the survey data showed that the prevalence of female respondents was higher than that of male respondents. The survey respondents are stratified according to age groups, as shown in Figures 2-5, which show that the age group with the lowest prevalence of respondents was 45 to 49 years old, the age groups with the highest prevalence of respondents were 60 to 64 years old and 65-69 years old, and the prevalence of female respondents was higher than that of male respondents in almost any age group. 45  The survey respondents are stratified according to age groups, as shown in which show that the age group with the lowest prevalence of respondents was 45 to 49 years old, the age groups with the highest prevalence of respondents were 60 to 64 years old and 65-69 years old, and the prevalence of female respondents was higher than that of male respondents in almost any age group.            In 2011, the prevalence of diabetes in the respondents was between 0.00% and 14.04%, and the prefecture-level cities with higher prevalence were mainly located in the northeast region and Beijing-Tianjin-Hebei region. In 2013, the prevalence of diabetes in In 2011, the prevalence of diabetes in the respondents was between 0.00% and 14.04%, and the prefecture-level cities with higher prevalence were mainly located in the northeast region and Beijing-Tianjin-Hebei region. In 2013, the prevalence of diabetes in the respondents was between 0.00% and 14.74%, and the prefecture-level cities with higher prevalence were mainly located in the central region, the northeast region and Beijing-Tianjin-Hebei region. In 2015, the prevalence of diabetes in the respondents was between 1.55% and 22.36%, and the prefecture-level cities with high prevalence were mainly located in the Beijing-Tianjin-Hebei region. In 2018, the prevalence of diabetes in the respondents was between 0.00% and 14.50%, and prefecture-level cities with high prevalence were distributed in the central region and the northeast region. The prevalence of diabetes is generally higher in the north than in the south, and in the coastal areas than in the inland [18]. the respondents was between 0.00% and 14.74%, and the prefecture-level cities with higher prevalence were mainly located in the central region, the northeast region and Beijing-Tianjin-Hebei region. In 2015, the prevalence of diabetes in the respondents was between 1.55% and 22.36%, and the prefecture-level cities with high prevalence were mainly located in the Beijing-Tianjin-Hebei region. In 2018, the prevalence of diabetes in the respondents was between 0.00% and 14.50%, and prefecture-level cities with high prevalence were distributed in the central region and the northeast region. The prevalence of diabetes is generally higher in the north than in the south, and in the coastal areas than in the inland [18].

Spatial Autocorrelation Analysis
Hotspot

Analysis of Time and Space
Using SaTScan software to conduct a purely spatial analysis of the respondents in 2018 to accurately locate the spatial clustering area of diabetes, a Poisson distribution was used, and we set a maximum of 10% of the population in the at risk group. The results showed that the most likely clustering center appears in Cangzhou, Hebei Province. There were ten cities are in the dangerous areas (Cangzhou, Tianjin, Dezhou, Baoding, Binzhou, Beijing, Jinan, Shijiazhuang, Liaocheng, Weifang) ( Table 2 and Figure 11), and 1899 respondents at risk.

Analysis of Time and Space
Using SaTScan software to conduct a purely spatial analysis of the respondents in 2018 to accurately locate the spatial clustering area of diabetes, a Poisson distribution was used, and we set a maximum of 10% of the population in the at risk group. The results showed that the most likely clustering center appears in Cangzhou, Hebei Province. There were ten cities are in the dangerous areas (Cangzhou, Tianjin, Dezhou, Baoding, Binzhou, Beijing, Jinan, Shijiazhuang, Liaocheng, Weifang) ( Table 2 and Figure 11), and 1899 respondents at risk.  Figure 11. Three clusters were detected by purely spatial analysis.
In order to explore if diabetes had clustering characteristics in space and time, a spatiotemporal analysis of respondents in 2011, 2013, 2015, and 2018 was performed using SaTSca, with a maximum of 10% of the population at risk. The results showed that the most likely agglomeration center appears in Dezhou, Shandong Province. There are ten cities in the danger zone (Dezhou, Cangzhou, Jinan, Liaocheng, Binzhou, Shijiazhuang, Baoding, Tianjin, Puyang, Anyang) ( Table 3 and Figure 12), and 1931 respondents at risk.  In order to explore if diabetes had clustering characteristics in space and time, a spatiotemporal analysis of respondents in 2011, 2013, 2015, and 2018 was performed using SaTSca, with a maximum of 10% of the population at risk. The results showed that the most likely agglomeration center appears in Dezhou, Shandong Province. There are ten cities in the danger zone (Dezhou, Cangzhou, Jinan, Liaocheng, Binzhou, Shijiazhuang, Baoding, Tianjin, Puyang, Anyang) ( Table 3 and Figure 12), and 1931 respondents at risk.

Binary Logistic Regression
In order to explore the factors that affect the occurrence of diabetes and assess the risk of diabetes, binary logistic regression was used for exploration based on the baseline data of 2018. The initial assignment of variables is shown in Table 4.

Binary Logistic Regression
In order to explore the factors that affect the occurrence of diabetes and assess the risk of diabetes, binary logistic regression was used for exploration based on the baseline data of 2018. The initial assignment of variables is shown in Table 4.  Table 5 shows the results of the chi-square test for single factors: age, location of residential address, education, hypertension, dyslipidemia, cancer, liver disease, smoking history, and alcohol use. A total of nine factors passed the chi-square test (p < 0.05) and could be included in binary logistic regression.  Binary logistic regression took diabetes as the dependent variable, age, location of residential address, education, hypertension, dyslipidemia, cancer, liver disease, kidney disease, smoking history, and alcohol use as independent variables. The Hosmer-Lemeshow test of the model was greater than 0.05 (0.889), indicating that the model had fully utilized the data and there was no very significant difference between the predicted value and the true value. Meanwhile, the result of the Omnibus test indicated that the model was statistically significant (p < 0.05). The established binary logistic regression can be expressed as Equation (5), according to Table 6.  The results showed that the occurrence of diabetes was significantly correlated with age, hypertension, dyslipidemia, kidney disease, and smoking history. The risk was higher in the 60-64 age group than in other age groups (OR = 1.635, p < 0.001). Patients with hypertension had a significantly higher risk of diabetes than those with other chronic diseases (OR = 2.004, p < 0.001). The highest risk was associated with dyslipidemia (OR = 3.598, p < 0.001). Figure 13 showed the local R 2 by using GWR (AICc = 640.402523, R 2 = 0.621877, Adjusted R 2 = 0.609018). The distribution of residuals of GWR in space was randomized using Global Spatial Autocorrelation (p = 0.233661, spatial distribution model was random). Table 7 shows the statistics of local coefficient variables, illustrating that none of the factors exhibited significant spatial heterogeneity. The results showed that the occurrence of diabetes was significantly correlated with age, hypertension, dyslipidemia, kidney disease, and smoking history. The risk was higher in the 60-64 age group than in other age groups (OR = 1.635, p < 0.001). Patients with hypertension had a significantly higher risk of diabetes than those with other chronic diseases (OR = 2.004, p < 0.001). The highest risk was associated with dyslipidemia (OR = 3.598, p < 0.001). Figure 13 showed the local R 2 by using GWR (AICc = 640.402523, R 2 = 0. 0.621877, Adjusted R 2 = 0.609018). The distribution of residuals of GWR in space was randomized using Global Spatial Autocorrelation (p = 0.233661, spatial distribution model was random). Table 7 shows the statistics of local coefficient variables, illustrating that none of the factors exhibited significant spatial heterogeneity.

Disease Risk Assessment
Through binary logistic regression, we chose age, hypertension, dyslipidemia, cancer, heart attack, stroke, kidney disease, smoking history, and alcohol use as independent variables. We chose diabetes as the dependent variable to establish the binary logistic model and random forest model. AUC (area under the ROC curve) was used to evaluate the assessment model in this study. To verify whether the model's expected risk result is consistent with the actual prevalence of diabetes, ArcGIS 10.4 was used to visualize the actual diabetes prevalence map and the diabetes risk assessment map (Figure 14), the high-risk assessment areas are mainly located in the Beijing-Tianjin-Hebei region and the northeast region. The random forest model's assessment results are consistent with the actual prevalence, while the binary logistic regression model's assessment results are far from the real incidence rate. Meanwhile, according to the ROC curve (Figures 15 and 16), the accuracy of the random forest model (AUC = 0.7745) was higher than the binary logistic model (AUC = 0.6677). However, the random forest model cannot explain the function direction of independent variables and the relative risk degree of influencing factors, but binary logistic regression analysis can define the model and variables well.

Disease Risk Assessment
Through binary logistic regression, we chose age, hypertension, dyslipidemia, cancer, heart attack, stroke, kidney disease, smoking history, and alcohol use as independent variables. We chose diabetes as the dependent variable to establish the binary logistic model and random forest model. AUC (area under the ROC curve) was used to evaluate the assessment model in this study. To verify whether the model's expected risk result is consistent with the actual prevalence of diabetes, ArcGIS 10.4 was used to visualize the actual diabetes prevalence map and the diabetes risk assessment map (Figure 14), the high-risk assessment areas are mainly located in the Beijing-Tianjin-Hebei region and the northeast region. The random forest model's assessment results are consistent with the actual prevalence, while the binary logistic regression model's assessment results are far from the real incidence rate. Meanwhile, according to the ROC curve (Figures 15 and  16), the accuracy of the random forest model (AUC = 0.7745) was higher than the binary logistic model (AUC = 0.6677). However, the random forest model cannot explain the function direction of independent variables and the relative risk degree of influencing factors, but binary logistic regression analysis can define the model and variables well.

Innovation in This Study
Because the traditional data analysis method does not easily avoid interactions between the independent variables, as an emerging machine learning algorithm, the random forest algorithm performs well in avoiding multicollinearity. Therefore, it is widely used in the assessment of disease risk. The use of a random forest model to establish a concise and accurate diabetes risk assessment model is an innovative way to assess the risk of diabetes among people over 45 years old in China. Because the dataset does not always contain complete information, the distribution between positive and negative classes is mostly imbalanced, and some parameters are of low importance for the deci-

Innovation in This Study
Because the traditional data analysis method does not easily avoid interactions between the independent variables, as an emerging machine learning algorithm, the random forest algorithm performs well in avoiding multicollinearity. Therefore, it is widely used in the assessment of disease risk. The use of a random forest model to establish a concise and accurate diabetes risk assessment model is an innovative way to assess the risk of diabetes among people over 45 years old in China. Because the dataset does not always contain complete information, the distribution between positive and negative classes is mostly imbalanced, and some parameters are of low importance for the deci-

Innovation in This Study
Because the traditional data analysis method does not easily avoid interactions between the independent variables, as an emerging machine learning algorithm, the random forest algorithm performs well in avoiding multicollinearity. Therefore, it is widely used in the assessment of disease risk. The use of a random forest model to establish a concise and accurate diabetes risk assessment model is an innovative way to assess the risk of diabetes among people over 45 years old in China. Because the dataset does not always contain complete information, the distribution between positive and negative classes is mostly imbalanced, and some parameters are of low importance for the decision class, the random forest model performed better in this situation. We used the random forest model to make our diabetes risk assessment map, compared it with the assessment results of logistic regression, and noted that the assessment result was consistent with the actual prevalence. Thus, we conclude that the random forest model can achieve greater accuracy in assessing diabetes risk [41]. However, binary logistic regression analysis can intuitively explain diabetes risk factors, which is a disadvantage of the random forest model. The advantages of the two models should be combined in practical applications to allow them to jointly play a valuable role in disease risk assessment.

Scale Effect
The selection of different observation and analysis scales will result in the detection of different phenomena. This is known as the scale effect [42]. We took this into consideration when conducting our research. Our preliminary experiments showed that the spatial patterns obtained from the study at the prefecture-level city scale and the provincial scale are basically the same. Therefore, in order to get more detailed spatial patterns, our spatiotemporal analysis was based on the city-level prefecture scale.

Spatiotemporal Characteristic of Diabetes Prevalence
Diabetes prevalence remains high in China. According to the report from the International Diabetes Federation, diabetes prevalence in China had increased from 8.8% in 2011 to 10.9% in 2018 in adults 20-79 years. The prevalence of diabetes among people over 45 years old increased from 0.00% to 14.04% in 2011 to 0.00% to 14.50% in 2018 in the study area where the sample is located.
A significant Moran's I test indicates that there is a presence of spatial autocorrelation, Getis and Ord's G* could identify the hot or cold spot areas. Identifying hot spots for diseases is important for public health authorities who should adopt them for bettertargeted interventions [43]. To determine the spatial patterns of a disease, local indicators of spatial association (LISA) in the environmental GIS are very helpful. This model is a set of methods used to describe and visualize spatial distributions, identify atypical locations or spatial outliers, determine patterns of spatial association, clusters, or hot-spots, and propose spatial regimes or other shapes of spatial heterogeneity [44].
In 2011, 2013, 2015, and 2018, the Moran's I coefficient of diabetes prevalence in China was between 0.025585 and 0.104485, and showed non-random spatial distribution. Getis and Ord's G* showed that hot spots are mostly found in the eastern and central regions, while cold spots are more common in southern regions. Local Spatial Autocorrelation analysis found that the High-High distribution pattern of diabetes is mainly found in cities close to the Beijing-Tianjin-Hebei region.
We also found that the spatial distribution model of diabetes was clustered, but that the tendency to cluster is waning, as the Moran's I decreased from 0.103458 in 2011 to 0.025585 in 2018, and the hot and cold spot areas were also conspicuously decreased. Many areas also showed not significant High-High or Low-Low distributions.
The spatial scan statistic is a useful and widely used tool for detecting spatial or spacetime clusters in disease surveillance. The software SaTScan, available for free, enhances this method's ease-of-access for researchers [45]. We used SaTScan to accurately locate the spatial clustering areas of diabetes and to explore if diabetes had clustering characteristics in space and time.
Spatiotemporal clustering areas were detected by SaTScan software and they were located near the Beijing-Tianjin-Hebei region.
Therefore, diabetes prevalence has obvious spatial distribution characteristics in the population over 45 years old in China, that is, the north is higher than the south, the coast is higher than the inland, and economically developed areas are higher than economically underdeveloped areas. The specific reasons for the patterns need further research, but should be related to differences in eating habits and lifestyle changes caused by economic development, and by glycemic control, which varied greatly across geographic regions [46,47].

Diabetes Risk Factors
Binary logistic regression is often used to explore diabetes risk factors [48,49]. Binary logistic regression analysis showed that age, hypertension, dyslipidemia, and smoking history were all diabetes risk factors in this study.
In China, diabetes poses a severe threat to the population. Age is a main factor for diabetes [50]. In this study, especially after the age of 55, diabetes risk increased significantly with age. Therefore, middle-aged and elderly residents in China should always pay attention to their health, so as not to miss the best treatment time.
Besides, compared with other chronic diseases, hypertension and dyslipidemia are more likely to lead to diabetes, and diabetes also likely leads to the occurrence of hypertension or dyslipidemia [51][52][53]. As the main component of metabolic syndrome, diabetes, hyperglycemia, and hyperlipidemia interconnect and influence each other, forming a complex framework of chronic diseases [54]. With the prolongation of the disease's course, the patient's body's immune function becomes increasingly abnormal, the function of many systems is weakened, and multiple diseases are prone to occur. With the prolongation of the disease's course, the function of many systems in the patient's body is weakened, which always leads to multiple diseases [55][56][57][58].
More and more studies show that smoking significantly increases the risk of diabetes [59]. Thus, diabetes patients with a history of smoking are reported to be at especially increased risk of incidence and poor outcomes from severe acute respiratory syndrome coronavirus [60]. China is one of the countries with the largest number of tobacco consumers in the world [61,62], which may be one of the reasons for the high prevalence of diabetes, and even of other chronic diseases, in China.

Spatial Heterogeneity of Diabetes Risk Factors
A GWR model is a simple and effective technology used to deal with spatial heterogeneity. Unlike traditional multiple linear regression, GWR lets regression parameters vary across space [63]. A GWR model was used to explore the spatial heterogeneity of diabetes risk factors. However, the results showed that there is no obvious spatial heterogeneity in the four risk factors (age, hypertension, dyslipidemia, and smoking history). This might be because this study did not incorporate socioeconomic and environmental factors into the study [64,65].

Limitations and Future Research
There are still some deficiencies in this research. For example, environmental factors, which are closely related to the prevalence of diabetes, have not been considered in this study. Besides, our approach to spatiotemporal analysis in this study was still traditional, and factors included in the model were not enough. In addition, there is still room for improvement in the accuracy of the model, and we are also trying to add other classification algorithms to our research. We will continue to advance this research, and it is believed that our research will provide accurate data support for improving the living conditions of people over 45 years old in China.

Conclusions
Firstly, in this paper, spatial autocorrelation and spatiotemporal clustering analysis were used to analyze the spatial distribution characteristics of diabetes. Secondly, we used the binary logistic regression model to explore the risk factors of diabetes in detail. Finally, the logistic regression model and random forest model were used to assess the risk of diabetes in people over 45 years old in China. The results showed that the clustering areas of patients with diabetes were mainly in the Beijing-Tianjin-Hebei region. The tendency to find clusters of diabetes prevalence among people over 45 years old in China is waning. Age, hypertension, dyslipidemia, and smoking history all had effects on diabetes, but the spatial heterogeneity of these factors were weak. Compared with the binary logistic model, the random forest model showed better fitness in assessing diabetes risk, and showed that the high-risk regions are the northeast region and the Beijing-Tianjin-Hebei region. Therefore, our method can analyze the spatial distribution characteristics and influencing factors of diabetes, but there is still room for improvement in the accuracy of assessing the risk of diabetes. We will continue to follow up on this study after the data of CHARLS is updated, and we will explore more excellent methods in the following research. Institutional Review Board Statement: We are using a secondary dataset. It has been procured from a government agency, and they have followed all the ethical protocols in collecting data.