A Fine-Grained Simulation Study on the Incidence Rate of Dysentery in Chongqing, China

: Dysentery is still a serious global public health problem. In Chongqing, China, there were 37,140 reported cases of dysentery from 2015 to 2021. However, previous research has relied on statistical data of dysentery incidence rate data based on administrative regions, while grained scale products are lacking. Thus, an initialized gradient-boosted decision trees (IGBDT) hybrid machine learning model was constructed to ﬁll this gap in grained scale products. Socioeconomic factors, meteorological factors, topographic factors, and air quality factors were used as inputs of the IGBDT to map the statistical dysentery incidence rate data of Chongqing, China, from 2015 to 2021 on the grid scale. Then, dysentery incidence rate grained scale products (1 km) were generated. The products were evaluated using the total incidence of Chongqing and its districts, with resulting R 2 values of 0.7369 and 0.5439, indicating the suitable prediction performance of the model. The importance and correlation of factors related to the dysentery incidence rate were investigated. The results showed that socioeconomic factors had the main impact (43.32%) on the dysentery incidence rate, followed by meteorological factors (33.47%). The Nighttime light, normalized difference vegetation index, and maximum temperature showed negative correlations, while the population, minimum and mean temperature, precipitation, and relative humidity showed positive correlations. The impacts of topographic factors and air quality factors were relatively weak.


Introduction
Dysentery is a water-and food-borne infectious disease [1].It is an intestinal infectious disease transmitted through contaminated water, food, or human contact.It includes bacillary dysentery and amebic dysentery [1,2].Dysentery is also a global public health problem with high contagiousness and complex transmission, affecting people of all ages, especially in developing countries [3][4][5].According to a report from the World Health Organization (https://www.who.int,accessed on 2 November 2023), diarrhoeal is the second leading cause of death among children under the age of five.Approximately 525,000 children under the age of five die from diarrhoeal each year, and there are approximately 1.7 billion cases of diarrhoeal in children worldwide every year.And, dysentery is a significant subtype of diarrhea.In China, dysentery is still classified as a Class A or B legally notifiable infectious disease.According to the report on legally notifiable diseases by the National Health Commission of China in 2021 (http://www.nhc.gov.cn/jkj/,accessed on 8 July 2023), the number of reported cases of dysentery was 50,403, with a national incidence rate of 3.5752 (1/10 5 , per 100,000 population).
The United Nations introduced the Sustainable Development Goals in 2015, and within the "Good Health and Well-being" goal, there is a clear aim to eliminate epidemic diseases by 2030.Given that dysentery remains a prevalent infectious disease worldwide, it is crucial to study and analyze the spatial and temporal clustering areas, development trends, and related factors of dysentery.Although there are many studies related to dysentery, such as spatial and temporal distribution studies [5][6][7][8][9], prediction of dysentery incidence [10], and related influencing factors [3,[11][12][13][14][15][16][17][18][19], these studies were based on the incidence rate of dysentery statistics data within administrative regions.Although statistical dysentery incidence rate data can intuitively represent geographical phenomena and spatial distribution and be convenient for spatial relationships and statistical analysis, it is limited by the use of predefined statistical units [20].As such, it cannot express the situation within units in a fine-grained manner and cannot achieve more detailed research.Grid data can provide this required detail [21,22], and it has the advantage of integrating various related factors, such as topographic and meteorological factors.Therefore, there is a demand for mapping dysentery incidence rate data to obtain a fine-grained distribution of dysentery incidence rates.
The factors related to dysentery incidence should be investigated first to obtain the finegrained distribution of dysentery incidence rates.As for the factors affecting dysentery, they can be broadly summarized as socioeconomic factors [1][2][3]15,23,24], such as population and economic development status, meteorological factors [13,14,[16][17][18][19]24], such as precipitation and temperature, topographical factors [11,15], such as slope and elevation differences, and air quality factors [25,26], such as PM 2.5 and PM 10 .Therefore, this study believes that using these factors for research is appropriate.However, these factors have strong nonlinear relationships with the dysentery incidence rate, so machine learning methods are a suitable approach for mapping the dysentery incidence rate to the grid scale.
To date, some scholars have used machine learning methods to map statistical data to the grid scale, such as using a random forest (RF) model to map population census data to a 1 km grid scale [21], using an ensemble approach for fine-scale dynamic population mapping [27] using a convolutional neural network method to map gross domestic product (GDP) statistics data to a grid scale [22], and using an RF model to map NO 2 concentration data to a grid scale [28].Compared to traditional multiple linear regression methods, machine learning algorithms such as RF, gradient-boosted decision trees (GBDT), and neural networks are better at explaining the nonlinear relationships between variables.The initialized gradient-boosted decision trees (IGBDT) [29] is a hybrid machine learning model that combines the RF [30] and GBDT [31,32], thereby integrating the strengths of both algorithms.This model has suitable interpretability, allowing for an explanation of the impact of each feature on the target variable.These advantages make IGBDT highly generalizable and capable of explaining nonlinear relationships, thus allowing for more accurate predictions of the dysentery incidence rate.
The main contributions of this paper lie in two aspects.An IGBDT hybrid machine learning model is used to downscale the annual dysentery incidence rate data to a 1 km grid scale.Second, the importance levels of and correlations among various factors affecting the incidence rate of dysentery are revealed.
The remainder of this article is organized as follows.In the Section 2, the data sources, data processing, and integration process used herein are introduced.The Section 3 describes the workflow of this study and the IGBDT model.In the Section 4, the evaluation results of the constructed IGBDT model are presented.The dysentery incidence rate grained scale products (1 km) of Chongqing generated by the model are described, and the importance levels of and correlations of various relevant factors are discussed.In the fifth part, the IGBDT model is compared with other models, and the importance levels of and correlations among the influencing factors obtained in this study are compared with other studies.Finally, the contributions, limitations, and future prospects of this study are summarized.

Study Area
The study area considered in this work is Chongqing, a municipality located in southwestern China and one of the four municipalities directly under the central government of China.Chongqing has 38 districts, covers an area of 82,400 km 2 , and had a total population of approximately 32.12 million in 2021.Chongqing is located between the middle reaches of the Yangtze River and the Sichuan Basin, with the landscape consisting of mountains, hills, and valleys.The terrain is characterized by large differences in elevation, with elevations ranging from 73.1 m to 2723.Chongqing, China, ranked fourth in terms of the mean incidence rate of dysentery from 2015 to 2021, with a rate of 16.91 cases per 100,000 people.During this period, there were a total of 37,140 reported cases of dysentery in Chongqing.In addition, the overall incidence rate of dysentery in Chongqing was higher than the national average during this period (Figure 1c).Therefore, given Chongqing's unique geographical environment and the notably high incidence rate of dysentery, it is of paramount importance to investigate the dysentery incidence rate in Chongqing.Figure 1b also shows that the incidence rate of dysentery is relatively high in the areas surrounding the main city of Chongqing and expands northeastward.Moreover, Chengkou, located in the northernmost part of Chongqing, is also a high-risk area.

Materials Sources and Processing
In this study, the incidence rates of dysentery and other auxiliary data such as socioeconomic factor data, meteorological factor data, topographic factor data, and air quality factor data in Chongqing from 2015 to 2021 were collected.The details are presented in Table 1.The incidence rate data of dysentery for all 38 districts in Chongqing from 2015 to 2021 were obtained from the website of the Chongqing Municipal Health Commission (https://wsjkw.cq.gov.cn/,accessed on 8 July 2023).In total, we collected 266 samples of incidence rate data.
The population raster dataset covering the years 2015 to 2021 was obtained from the open-source population mapping product called the LandScan Global population database, which is maintained by the Oak Ridge National Laboratory (https://landscan.ornl.gov/,accessed on 8 July 2023).The spatial resolution of the raw data is 0.01745 arc-degrees.
Monthly normalized difference vegetation index (NDVI) datasets were obtained from the National Earth System Science Data Center, National Science & Technology Infrastructure of China (http://www.geodata.cn,accessed on 8 July 2023).Their original resolution is approximately 0.01745 arc-degrees.
Monthly precipitation datasets [38] covering China from 2015 to 2021 were obtained from the National Earth System Science Data Center, National Science & Technology Infrastructure of China.Their original resolution is approximately 0.01745 arc-degrees.
A monthly relative humidity gridded dataset covering the period from 2015 to 2020 with a spatial resolution of 1 km was collected from the National Earth System Science Data Center, National Science & Technology Infrastructure of China.Due to the lack of relative humidity raster data for 2021, air temperature and dew point temperature data of Chinese regional meteorological stations through the FTP server (ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/, accessed on 4 July 2023) published by the National Climatic Data Center (NCDC) were collected.The relative humidity was calculated from air temperature and dew point temperature by Equation (1) [39].Afterward, kriging interpolation was performed on the relative humidity data recorded at regional stations in China to obtain raster data with a resolution of 1 km.
12) e (17.62×T)/(T+243.12) where RH represents relative humidity, T d represents the dew point temperature, and T represents the air temperature.

Topographic Factor Data
The 12.5 m digital elevation model (DEM) data in Chongqing were obtained from the ALOS PALSAR Dataset of The Earth Science Data Systems (ESDS) (https://search.asf.alaska.edu/,accessed on 4 July 2023) in Chongqing.We proceeded to resample these data, changing the original resolution of 12.5 m to 1 km.Based on these resampled data, the surface slope was calculated.

Air Quality Factor Data
Yearly PM 2.5 and PM 10 datasets covering China from 2015 to 2021 at a 1 km resolution were obtained from the National Earth System Science Data Center, National Science & Technology Infrastructure of China.The original resolution of these data is approximately 0.01745 arc-degrees.

Data Integration
To ensure the smooth operation of the model, the format of all covariates should be standardized.Therefore, NTL data were used as the basis to resample and correct all other grid data, including the DEM, slope, NDVI, PM 2.5 , PM 10 , precipitation, temperature, and relative humidity data, so that all data had the same spatial resolution (1 km), spatial extent, and row/column numbers.The NDVI, temperature, relative humidity, and precipitation data were all monthly data.However, this study was focused on an annual timescale, so these data needed to be aggregated on a yearly basis.Specifically, the maximum NDVI value across the 12 months was taken as the annual NDVI value, the monthly precipitation values were summed up to obtain the annual cumulative precipitation value, and the mean relative humidity and temperature across the 12 months were calculated to obtain the annual relative humidity and temperature values, respectively.This aggregation process enabled us to analyze the annual trends and patterns in these environmental variables.

Methods
To explore the nonlinear relationships between the dysentery incidence rate and the covariates, an IGBDT model was established to map the incidence rate of dysentery to a 1 km grid scale.First, the mean values of the covariates (mean NTL, mean NDVI, mean population, mean DEM, mean slope, mean cumulative precipitation, mean temperature, annual maximum temperature, annual minimum temperature, mean relative humidity, mean night light, and mean PM 2.5 and PM 10 ) in each district were used as the explanatory variables, and the annual incidence rate of dysentery in each district was used as the response variable to train the model.Due to the large gaps in the annual dysentery incidence rate among districts, a logarithmic transformation was performed to make the distribution of data more symmetrical and reduce the impacts of extreme values on the model.Then, the covariate data of Chongqing from 2015 to 2021 were input into the IGBDT model in pixels to predict the incidence rate of dysentery (anti-logarithm is required) at the pixel level at a 1 km resolution.The specific workflow is shown in Figure 2.
variables, and the annual incidence rate of dysentery in each district was used as the response variable to train the model.Due to the large gaps in the annual dysentery incidence rate among districts, a logarithmic transformation was performed to make the distribution of data more symmetrical and reduce the impacts of extreme values on the model.Then, the covariate data of Chongqing from 2015 to 2021 were input into the IGBDT model in pixels to predict the incidence rate of dysentery (anti-logarithm is required) at the pixel level at a 1 km resolution.The specific workflow is shown in Figure 2. IGBDT is a method in which trained RF prediction results are used to initialize the GBDT, and the final boosted learner is added to the initial RF results.The specific algorithm process is described as follows.First, an RF is trained on the training data to obtain the initial prediction results.Then, the residuals of the initial prediction results are calculated and used as the target variables to train the GBDT.In each iteration of the GBDT, a new decision tree is trained to fit the residuals, and the results of the new decision tree are added to the initial prediction results of the RF.This process is repeated until the desired number of iterations is reached or until the residuals are no longer significantly reduced.The specific algorithm is shown in Steps (1)-(3).
Step 1. Initialize the weak learner: where L is the loss function.
Step 2. Perform the iterations: where m is the number of weak learners.
  IGBDT is a method in which trained RF prediction results are used to initialize the GBDT, and the final boosted learner is added to the initial RF results.The specific algorithm process is described as follows.First, an RF is trained on the training data to obtain the initial prediction results.Then, the residuals of the initial prediction results are calculated and used as the target variables to train the GBDT.In each iteration of the GBDT, a new decision tree is trained to fit the residuals, and the results of the new decision tree are added to the initial prediction results of the RF.This process is repeated until the desired number of iterations is reached or until the residuals are no longer significantly reduced.The specific algorithm is shown in Steps (1)-(3).
Step 1. Initialize the weak learner: where L is the loss function.
Step 2. Perform the iterations: where m is the number of weak learners.m ∈ {1, 2, . . . ,M}: (a) Calculate the negative gradient for each sample (x i , y i ) (i ∈ {1, 2, . . . ,n}).The obtained residual r im is taken as the new true value of the sample, and (x i , r im ) (i ∈ {1, 2, . . . ,n}) is obtained as the training data of the next tree, then a new regression tree f m (x) is obtained along with its corresponding R jm (j ∈ {1, 2, . . . ,J}).J is the number of leaf nodes in the regression tree.
(b) Calculate the best fitting value for the leaf area J.
(c) Update the strong learner.
where I is a function.If the sample falls on the node, then I = 1; otherwise, I = 0.
Step 3. Obtain the final learner:

Dysentery Incidence Rate Grained Scale Product (1 km)
The IGBDT model was used to map the incidence rate of dysentery in Chongqing the pixel level (1 km).To reflect the spatial distribution of the incidence rate of dysente in 1 km grained scale products, the mean incidence rate from 2015 to 2021 was calculat for grained scale products, as shown in Figure 4.The spatial distribution of the mean cidence rate was consistent with the distribution trend in 2019.Using this product, it possible to understand the spatial distribution of dysentery incidence in the regions Chongqing at a relatively fine scale, not only at the scale of statistical units.The inciden rate of dysentery is relatively high in the main urban area of Chongqing and Chengko which is located in the northernmost part of Chongqing.High dysentery incidence ra can also be observed in the northwestern and southwestern of Chongqing.Additional there is a clear zone with high dysentery incidence rates along the Yangtze River basin

Dysentery Incidence Rate Grained Scale Product (1 km)
The IGBDT model was used to map the incidence rate of dysentery in Chongqing to the pixel level (1 km).To reflect the spatial distribution of the incidence rate of dysentery in 1 km grained scale products, the mean incidence rate from 2015 to 2021 was calculated for grained scale products, as shown in Figure 4.The spatial distribution of the mean incidence rate was consistent with the distribution trend in 2019.Using this product, it is possible to understand the spatial distribution of dysentery incidence in the regions of Chongqing at a relatively fine scale, not only at the scale of statistical units.The incidence rate of dysentery is relatively high in the main urban area of Chongqing and Chengkou, which is located in the northernmost part of Chongqing.High dysentery incidence rates can also be observed in the northwestern and southwestern of Chongqing.Additionally, there is a clear zone with high dysentery incidence rates along the Yangtze River basin.A regression evaluation was carried out on the population data multiplied by the simulated incidence rates and actual incidence rate data in Chongqing, as presented in Figure 5. Figure 5a illustrates the regression analysis outcome of the actual and simulated incidence rate numbers for the entire city from 2015 to 2021, resulting in a correlation coefficient of 0.7360.Figure 5b displays the regression analysis results of the actual and simulated incidence rate numbers for each district in the city from 2015 to 2021, with a correlation coefficient of 0.5439.These results suggest that the model is capable of predicting the incidence rate to a certain extent and can capture some factors influencing dysentery incidence.A regression evaluation was carried out on the population data multiplied by the simulated incidence rates and actual incidence rate data in Chongqing, as presented in Figure 5. Figure 5a illustrates the regression analysis outcome of the actual and simulated incidence rate numbers for the entire city from 2015 to 2021, resulting in a correlation coefficient of 0.7360.Figure 5b displays the regression analysis results of the actual and simulated incidence rate numbers for each district in the city from 2015 to 2021, with a correlation coefficient of 0.5439.These results suggest that the model is capable of predicting the incidence rate to a certain extent and can capture some factors influencing dysentery incidence.

Covariate Importance and Correlation Analysis
Through the IGBDT model, the contribution of each characteristic element to the incidence rate can be obtained (Figure 6).Among the various factors analyzed in this study, the population, NTL, and NDVI had the highest impact weights among the socioeconomic factors, accounting for approximately 43.32% of the total impact weight, with the population accounting for 17.07%, NTL accounting for 14.18%, and NDVI accounting for 12.07%.Among the meteorological factors, the mean temperature accounted for 12.20%, minimum temperature accounted for 9.55%, and maximum temperature accounted for 3.46%.The impact weights of cumulative precipitation (4.88%) and relative humidity (3.38%) were relatively small, with meteorological factors accounting for a total of 33.47%.The impacts of air quality factors, specifically PM10 and PM2.5, were relatively small, accounting for only 4.83%.Topographic factors, including the slope and DEM, accounted for a total of 14.42%.The impacts of annual features, as this study was based on time-series data from 2015 to 2021, account for 3.95% of the total impact weight.Overall, socioeconomic factors had the dominant impact on the dysentery incidence rate, followed by meteorological factors, with temperature being the most important meteorological factor, while relative humidity and cumulative precipitation had relatively small impacts.Topographic factors were more important than air quality factors, but neither reached a significant level of importance.

Covariate Importance and Correlation Analysis
Through the IGBDT model, the contribution of each characteristic element to the incidence rate can be obtained (Figure 6).Among the various factors analyzed in this study, the population, NTL, and NDVI had the highest impact weights among the socioeconomic factors, accounting for approximately 43.32% of the total impact weight, with the population accounting for 17.07%, NTL accounting for 14.18%, and NDVI accounting for 12.07%.Among the meteorological factors, the mean temperature accounted for 12.20%, minimum temperature accounted for 9.55%, and maximum temperature accounted for 3.46%.The impact weights of cumulative precipitation (4.88%) and relative humidity (3.38%) were relatively small, with meteorological factors accounting for a total of 33.47%.The impacts of air quality factors, specifically PM 10 and PM 2.5 , were relatively small, accounting for only 4.83%.Topographic factors, including the slope and DEM, accounted for a total of 14.42%.The impacts of annual features, as this study was based on time-series data from 2015 to 2021, account for 3.95% of the total impact weight.Overall, socioeconomic factors had the dominant impact on the dysentery incidence rate, followed by meteorological factors, with temperature being the most important meteorological factor, while relative humidity and cumulative precipitation had relatively small impacts.Topographic factors were more important than air quality factors, but neither reached a significant level of importance.
The grained scale products (1 km) of dysentery incidence rate were used to study the correlations between various features and the incidence rate of dysentery.The r (Pearson correlation coefficient [40]) value was calculated by computing the pixel values of each covariate at the corresponding location from 2015 to 2021.The results are shown in Figure 7.The r value cannot represent the importance of each factor or the incidence rate of dysentery but represents only the correlation of the trend between 2015 and 2021.In Chongqing, among the socioeconomic factors, NTL and NDVI were all negatively correlated with the incidence rate of dysentery, and the population was weakly positively correlated with the dysentery incidence rate.Among the meteorological factors, the minimum temperature and mean temperature were positively correlated with the dysentery incidence rate, while the maximum temperature was negatively correlated, contrary to the actual expectations.The cumulative precipitation and relative humidity were positively correlated with the dysentery incidence rate.The air quality factors, PM 10 and PM 2.5, were both negatively correlated with the incidence rate of dysentery.The grained scale products (1 km) of dysentery incidence rate were used to study the correlations between various features and the incidence rate of dysentery.The r (Pearson correlation coefficient [40]) value was calculated by computing the pixel values of each covariate at the corresponding location from 2015 to 2021.The results are shown in Figure 7.The r value cannot represent the importance of each factor or the incidence rate of dysentery but represents only the correlation of the trend between 2015 and 2021.In Chongqing, among the socioeconomic factors, NTL and NDVI were all negatively correlated with the incidence rate of dysentery, and the population was weakly positively correlated with the dysentery incidence rate.Among the meteorological factors, the minimum temperature and mean temperature were positively correlated with the dysentery incidence rate, while the maximum temperature was negatively correlated, contrary to the actual expectations.The cumulative precipitation and relative humidity were positively correlated with the dysentery incidence rate.The air quality factors, PM10 and PM2.5, were both negatively correlated with the incidence rate of dysentery.
The impacts of topographic factors on the dysentery incidence rate in Chongqing were investigated using the 2019 dysentery incidence rate grained scale product, which has the same spatial distribution trend as the mean incidence rate.The correlations between the incidence rate of dysentery and the DEM and slope were calculated using all the raster values of the dysentery incidence rate in 2019, and the resulting r values of the DEM and slope were −0.2358 and −0.2371, respectively.Our findings revealed that the impacts of topographic factors on the incidence rate of dysentery in Chongqing were negatively correlated, contrary to our initial expectations.

Comparison with Other Models
We constructed the IGBDT, RF, GBDT, linear, and SVM models and obtained the optimal model through Bayesian hyperparameter optimization.The MAE, RMSE, and R 2 values of the IGBDT model in the testing set were best at 4.7024 (1/10 5 ), 6.2830 (1/10 5 ), and 0.8368, respectively.The MAE, RMSE, and R 2 values obtained by five-fold cross-validation were 5.0039 (1/10 5 ), 7.346 (1/10 5 ), and 0.78, respectively.Table 2 shows the performances of the different models.RF and GBDT perform better than linear and SVM.However, compared to GBDT, the performance of the RF was not ideal.The IGBDT model had a The impacts of topographic factors on the dysentery incidence rate in Chongqing were investigated using the 2019 dysentery incidence rate grained scale product, which has the same spatial distribution trend as the mean incidence rate.The correlations between the incidence rate of dysentery and the DEM and slope were calculated using all the raster values of the dysentery incidence rate in 2019, and the resulting r values of the DEM and slope were −0.2358 and −0.2371, respectively.Our findings revealed that the impacts of topographic factors on the incidence rate of dysentery in Chongqing were negatively correlated, contrary to our initial expectations.

Comparison with Other Models
We constructed the IGBDT, RF, GBDT, linear, and SVM models and obtained the optimal model through Bayesian hyperparameter optimization.The MAE, RMSE, and R 2 values of the IGBDT model in the testing set were best at 4.7024 (1/10 5 ), 6.2830 (1/10 5 ), and 0.8368, respectively.The MAE, RMSE, and R 2 values obtained by five-fold crossvalidation were 5.0039 (1/10 5 ), 7.346 (1/10 5 ), and 0.78, respectively.Table 2 shows the performances of the different models.RF and GBDT perform better than linear and SVM.However, compared to GBDT, the performance of the RF was not ideal.The IGBDT model had a better performance than the RF and GBDT models.The R 2 values on the testing set improved by 0.1015 compared to the RF and by 0.0202 compared to the GBDT.Moreover, the MAE and RMSE values were better, demonstrating a higher accuracy, robustness, and generalization ability than the RF and GBDT models.Mohan et al. noted that the GBDT results were heavily influenced by initialization, while the RF was highly resistant to overfitting and served as an excellent optimization starting point.In addition, the IGBDT outperformed both RF and GBDT in terms of the RMSE values when M (the number of trees) was less than 1000.This conclusion was confirmed by our experiments, which indicated that providing GBDT with a better initialization can significantly improve its performance, allowing it to surpass the individual GBDT and RF models.The transmission of dysentery may be influenced by various factors [17].The incidence rate grained scale products (1 km) of dysentery were used to study the importance levels and correlations of socioeconomic factors, meteorological factors, air quality factors, and topographic factors on the dysentery incidence rate.
Significant negative correlations between the incidence rate of dysentery and the NTL and NDVI variables among socioeconomic factors were found, while a positive correlation was observed between the dysentery incidence rate and population density in this study.NTL satellite imagery is highly positively correlated with socioeconomic parameters, including urbanization, economic activity, and population [41][42][43].In the first stage of urbanization (an urbanization rate of less than 77.59%),NDVI is positively correlated with the per-capita GDP [44].The urbanization rate of Chongqing in 2021 was 70.3%, far from exceeding the first stage to the next stage.Therefore, the study of NTL and NDVI indirectly reflects that the relationship between the incidence rate of dysentery and socioeconomic development reflects a negative correlation; this finding is consistent with other research results [11,15,23,45,46].Regarding the population density, the denser the population is, the more people the dysentery bacteria come into contact with within a certain time and space, and the more likely this situation is to have group reality.Therefore, we found a positive correlation between the population density and the dysentery incidence rate, consistent with previous research findings [6,8,23,24].
Meteorological factors such as temperature, relative humidity, and precipitation are considered to be among the main environmental predictors of the dysentery transmission risk [47].Previous studies have shown that temperature is a key meteorological factor affecting the incidence rate of dysentery [8,13,16,17,19,48].For example, in a study performed around Beijing, the minimum, mean, and maximum temperatures were found to be positively correlated with the dysentery incidence rate [8]; the dysentery incidence rate in Jinan, China increased by 12%, and in Shenzhen, China, it increased by 16% with a 1 • C rise in the highest or lowest temperature [48]; in Peru, for every 1 • C increase in temperature, severe diarrhea in children increased by 8% [49].The increases in temperature may lead to an increase in pathogen exposure, promote bacterial growth, and prolong the survival rate of bacteria in the environment and contaminated food [49].In addition, when the temperature rises, some changes related to specific behavior in the population may occur, and such changes may increase the demand for drinking water and accelerate the spread of dysentery [50].In this study, the minimum and mean temperatures were positively correlated with the incidence rate of dysentery, consistent with previous research conclusions.However, the maximum temperature was negatively correlated with the incidence rate, which was unexpected.A possible reason for this result is that high maximum temperatures may promote improved hygiene activities, such as increased hand washing and cleaning of food and water sources, thereby reducing the spread of dysentery.It is possible that other factors contributed to the decline in the incidence rate of dysentery during the study period, so specific, accurate research also needs to add more refined temperature factors to continue this discussion and research.
Positive correlations were observed between the dysentery incidence rate and cumulative precipitation and between the incidence rate and relative humidity.This was in line with the anticipated results based on previous studies.To date, some studies have explored the impacts of relative humidity and precipitation on the dysentery incidence rate, but the results are inconsistent.For example, some studies have shown positive correlations, such as studies performed in Northeast China, Hunan Province, and Beijing, where the incidence rates of dysentery were positively correlated with relative humidity and precipitation [5,13,14,47].Similar findings have been reported in Taiwan, China [19], the Pacific islands [51], and Bangladesh [52].Some studies have shown negative correlations, such as a study performed in Wuhan, which found that relative humidity and precipitation were negatively correlated with the risk of bacterial dysentery [17].One study indicated that a lack of precipitation during the dry season would increase the incidence rate of dysentery in areas south of the Sahara Desert [18].There are also studies that found no significant impact, such as a study conducted in two cities in North and South China, in which no significant correlation was found between relative humidity and precipitation and the incidence rate of dysentery [48,53].
Regarding air quality factors, we found that PM 2.5 and PM 10 had positive correlations with the incidence rate of dysentery.Currently, no research has indicated any clear association between PM 10 or PM 2.5 and dysentery.However, air pollution may affect the human immune system and health conditions, thereby increasing the risk of infectious diseases such as dysentery [25,54].Therefore, the results of this study are in line with expectations, but studying the relationship between PM air pollution and the incidence rate of dysentery will require more detailed data.
The relationships between topographic factors and the incidence rate of bacillary dysentery are complex, with some studies indicating that topographic factors can lead to an increase in the dysentery incidence rate [15,46,55], while others have found that the impacts of topographic factors on the dysentery incidence rate are not significant [11].These inconsistent results of existing studies may have been due to other factors, compared to topographic factors, being more important in determining the dysentery incidence rate, thus preventing generalizations.In this article, we found that the incidence rate of dysentery in Chongqing was negatively correlated with topographic factors overall.This was not consistent with the expected situation.The reason for this contradiction may be due to the complex terrain of Chongqing, which is low in the west and high in the east.However, the dysentery incidence rate is generally high in the west and low in the east.For example, the terrain in the main urban area is flat, while the terrain in Chengkou is rugged and steep.The incidence rate of dysentery in both these places is high, while in the eastern part of the urban area, the terrain is also rugged and steep, but the incidence rate is low.
Our findings suggested high incidence rates of dysentery in the southwestern and northernmost parts of Chongqing.In the southwestern region, this high incidence rate is associated with factors such as high urbanization, a large population of migrants, a high population density, and a high per-capita GDP.This finding is consistent with previous studies that suggested that the population and economy are dominant factors affecting dysentery incidence rates [9,23].On the other hand, Chengkou, located in the northernmost part of Chongqing with the least population (a permanent population of 198,000 in 2021) and per-capita GDP, has had a consistently high incidence rate since 2012 [9].This may be due to the small population causing a relatively high disturbance in incidence rates.Although Wulong (with a permanent population of 356,500 in 2021) has a population one level higher than that of Chengkou, the incidence rate has not remained high.The specific reasons for these discrepancies need to be further investigated.
The obvious area of high dysentery incidence rates along the Yangtze River basin may be due to the low NDVI and DEM values in this area, leading to high dysentery incidence rates.Therefore, further optimization and research involving dysentery incidence rate prediction models should be conducted.

Conclusions
To fill the gap resulting from the lack of fine-grained scale dysentery incidence rate products at the grid scale, this study constructed an IGBDT hybrid machine learning model using socioeconomic factors (NTL, NDVI, and population), meteorological factors (precipitation, relative humidity and temperature), topographic factors (DEM and slope) and air quality factors (PM 2.5 and PM 10 ) as explanatory variables, and the dysentery incidence rate as the output variable.The model was used to map annual statistical dysentery incidence rate data from various districts in Chongqing from 2015 to 2021 to the grid scale, thereby producing a dysentery incidence rate grid (1 km) product for Chongqing.The grained scale products of dysentery incidence rate were evaluated by using the total incidence of the population of Chongqing and each district in Chongqing with assessment R 2 results of 0.7369 and 0.5439, respectively.These results show that the model has a certain predictive effect.Comparing the IGBDT hybrid model with other models, such as the RF and GBDT models, it is proven that the IGBDT effect is better than that of either the RF or GBDT, individually.Then, the spatial distribution of the incidence of dysentery in Chongqing and the importance and correlation of its related factors were discussed.The results showed that the impacts of socioeconomic factors on the incidence of dysentery were the main factor, accounting for 43.32% of the impact, of which the NTL and NDVI showed negative correlations, and the population showed a positive correlation; meteorological factors accounted for 33.47%, of which the minimum temperature, mean temperature, precipitation, and relative humidity were positively correlated, while the maximum temperature was negatively correlated.However, the effects of topographic factors (the DEM and slope were negatively correlated) and air quality factors (PM 2.5 and PM 10 were positively correlated) were relatively weak.
The dysentery incidence rate grained scale products for Chongqing (1 km) produced fill the gap left by the absence of detailed dysentery incidence rate products.It provides researchers and public health institutions with a more comprehensive foundation for their studies.This product allows researchers to delve deeper into the analysis of factors related to dysentery, revealing their significance and correlations.This, in turn, better supports the monitoring and control of dysentery.Furthermore, this grid product serves as a powerful tool for various sectors of society, assisting government and health institutions in accurately pinpointing potential high-risk areas of dysentery within densely populated and economically underdeveloped regions.It provides a more precise basis for resource allocation, enabling targeted governance and monitoring measures to reduce the spread 7 m.According to the Chongqing Municipal Bureau of Statistics (https://tjj.cq.gov.cn/,accessed on 8 July 2023), the urbanization rate in Chongqing reached 70.3% in 2021, and the city's annual GDP was CNY 2.7894 trillion.The mean temperature between 2015 and 2021 was 16-18 °C, and the mean precipitation during this period was approximately 1100-1400 mm (https://ceidata.cei.cn/jsps/,accessed on 12 July 2023).

Figure 1 .
Figure 1.Incidence rate of dysentery in Chongqing.

Figure 1 .
Figure 1.Incidence rate of dysentery in Chongqing.

Figure 2 .
Figure 2. The workflow used to map the dysentery incidence rate to a fine-grained grid scale.

) 3. Results 3 . 1 .Figure 3 .
Figure 3. IGBDT model.The black dashed line is the perfect fitting line with R² equal to 1, red do represent data distribution points, and the green solid line is the data fitting line.

Figure 3 .
Figure 3. IGBDT model.The black dashed line is the perfect fitting line with R 2 equal to 1, red dots represent data distribution points, and the green solid line is the data fitting line.

18 Figure 4 .
Figure 4. Grained scale products of the dysentery incidence rate (1 km) in Chongqing from 2015 to 2021.

Figure 4 .
Figure 4. Grained scale products of the dysentery incidence rate (1 km) in Chongqing from 2015 to 2021.

Figure 5 .
Figure 5. Evaluation of morbidity grained scale products.The black dashed line is the perfect fitting line with R² equal to 1, red dots represent data distribution points, and the green solid line is the data fitting line.

Figure 5 .
Figure 5. Evaluation of morbidity grained scale products.The black dashed line is the perfect fitting line with R 2 equal to 1, red dots represent data distribution points, and the green solid line is the data fitting line.

Figure 7 .
Figure 7. Correlation between each feature and the incidence rate of dysentery in Chongqing from 2015 to 2021.

Figure 7 .
Figure 7. Correlation between each feature and the incidence rate of dysentery in Chongqing from 2015 to 2021.

Table 1 .
Datasets used in this study.

Table 2 .
Comparison of different models.