Urban Population Distribution Mapping with Multisource Geospatial Data Based on Zonal Strategy

Mapping population distribution at fine resolutions with high accuracy is crucial to urban planning and management. This paper takes Guangzhou city as the study area, illustrates the gridded population distribution map by using machine learning methods based on zoning strategy with multisource geospatial data such as night light remote sensing data, point of interest data, land use data, and so on. The street-level accuracy evaluation results show that the proposed approach achieved good overall accuracy, with determinant coefficient (R2) being 0.713 and root mean square error (RMSE) being 5512.9. Meanwhile, the goodness of fit for single linear regression (LR) model and random forest (RF) regression model are 0.0039 and 0.605, respectively. For dense area, the accuracy of the random forest model is better than the linear regression model, while for sparse area, the accuracy of the linear regression model is better than the random forest model. The results indicated that the proposed method has great potential in fine-scale population mapping. Therefore, it is advised that the zonal modeling strategy should be the primary choice for solving regional differences in the population distribution mapping research.


Introduction
People are the main body of geographical environment and social activities, and their distribution pattern is an important subject in many research fields such as sociology, geography, environmental studies, and so on [1,2]. Common population data mainly include two forms: demographic data and spatialized data. Currently, using demographic data based on administrative units is the main way to obtain population distribution information. Although the census data have the advantages of being comprehensive, accurate, and authoritative [3,4], they have the disadvantages of low spatial resolution, excessively long statistical period, and low update frequency. Therefore, the low spatial and temporal resolution of census data is not conducive to effective urban governance. Compared with demographic data, spatialized population data can express the true population distribution more intuitively and on multiple scales, thereby effectively overcome the shortcomings of demographic data.
In recent years, the emergence of multi-source data such as point of interest (POI for short), location check-in data, sharing-bikes' trajectories, floating car GPS, and night light remote sensing data provide time-efficient, fine-grained, and reliable data source for gridded population distribution estimation, while machine learning technologies provide advanced methods for simulating gridded population distribution. Hence, with the progress of data sources and research methods, it is possible to accurately understand the urban population distribution in multiple scales and recognize spatiotemporal patterns [5][6][7]. Guangzhou is one of the core cities of Guangdong-Hong Kong-Macao Greater Bay Area. Due to the fast urbanization process, it is facing several threats, such as large The principle of land use inversion method is to give different weights to each land use type according to the differences of population density between different land use types [18,[21][22][23]. However, the land use inversion method fails to reflect the population distribution difference in the same type of land parcels, and ignores the randomness of population distribution. Since the launch of the DMSP/OLS satellite in 1970s, the night light modeling method has become one of the mainstream methods for simulation of gridded population distribution [24][25][26]. However, the application of DMSP/OLS data is relatively rare, because the sensor stopped working in 2013 and the spatial resolution is too low(1KM). The emergence of Suomi National Polar-Orbiting Partnership Visible Infrared Imaging Radiometer Suite (NPP-VIIRS) data effectively overcomes the shortcomings of DMSP-OLS data in spatio-temporal resolution [27], and such data are more suitable for the study of human social and economic activities [28,29]. Sutton K. et al. used DMSP/OLS night light data to estimate the global population based on a global scale, and found that the global population is about 6.3 billion [30]. However, spatial resolution of DMSP/OLS and NPP/VIIRS night light data sources are still to coarse for population modelling in local scale. Nowadays, the world's first professional luminous remote sensing satellite designed by Wuhan University (Wuhan, China), Luojia 1-01 (LJ 1-01), was successfully launched in June 2018. Compared to NPP-VIIRS data, LJ 1-01 data have a significant improvement in spatial resolution and quantitative level, which provide a more refined data source for small-scale population distribution simulations [31][32][33]. In essence, the night light modeling method simulates the gridded population distribution in the study area by establishing a linear relationship between the night light intensity and population density. However, the problems in the night light data seriously affect the accuracy of the model outputs, such as pixel saturation, overflow, and low accuracy of population retrieval in weak light areas. In order to improve the simulation accuracy, some scholars have tried to combine night lighting data with land use data for research [34][35][36]. However, the improvement in accuracy was not significant. Then, some scholars attempted to improve the simulation accuracy by combining the night lights data, land use data, Point of interest (POI) data, and other types of data [37][38][39]. With the continuous enrichment of social sensing data, some scholars attempt to depict the dynamics of population distribution by using these data, such as taxi GPS data [40], mobile phone data [41], tencent location service data [42], baidu heat map data [43], and so on. The multi-source data fusion method has become one of the most popular method. Due to the limitations of these data, including poor data availability, data heterogeneity, and so on, the conditions for producing gridded population distribution dataset at a large-scale are not yet mature.
Essentially, the spatial decomposition of census data to grid unit is a regression process. Several regression models have been proposed to create the weight layer, such as linear regression (LR), geographically weighted regression (GWR), random forest regression, and so on [23,44]. In particular, the random forest (RF for short), one machine learning method which was first systematically proposed by Breiman [45], is becoming more and more popular in the study of models to produce the weighting layer of population data, as it has several advantages including suitable for collinearity, avoiding over-fitting, high calculation speed, and so on. The idea of RF algorithm is to use the bootstrap resampling method to extract multiple samples from the original sample, model the decision tree for each bootstrap sample, and then combine the results of multiple decision trees to obtain the final regression result by voting. For the first time, scholars obtained 100 m grid population mapping results in three undeveloped countries [46]. Due to the ease of training and interpretation, the random forest method has received increasing attention in the population mapping research [42,47,48]. In the process of random forest modeling, it is necessary to optimize several parameters for improving the performance and effect of learning, such as the max number of features and the max depth of trees, and so on [49,50]. In general, to avoid the impact of the initial data partition on the results, cross validation (CV) and grid search were usually performed to reduce the occasionality and ensure the validity and accuracy of the model training [51].
The uncertainty of modeling is also an important issue in the study of population distribution mapping, which includes ecological fallacy, modifiable areal unit problem (MAUP), and so on.
The MAUP is a classic problem in the Geographical Information Sciences, which has been long acknowledged and explored. For any spatial resolution, as any set of boundaries, MAUP may seriously hamper the strength of statistical results [52]. Sensitivity Analysis is a useful method to explore the uncertainty of modelling. The authors tested the effect of different geometrical data aggregation schemas-administrative regions and hexagonal surface tessellation-on global spatial autocorrelation statistics, by using several datasets for two study-areas including Continental Portugal (mixed urban-rural) and the Lisbon municipality (urban), and raised an important point, i.e., inferences based on spatial analysis of areal data depend greatly on the method used to quantify the degree of proximity between spatial units [52].
As we know, the relationship between population distribution and influencing factors is an extremely complex nonlinear relationship. A variety of machine learning algorithms, including random forests, neural network, and so on, can handle nonlinear relationship fitting well, while a geographic detector model can handle the spatial heterogeneity of influencing factors. However, most existing studies use a single model to the whole area. For regions with large population density differences, a single or global model cannot accurately explain the mechanism of the spatial distribution of internal population. The zonal strategy can effectively solve the shortcomings of the single model, because it can perform secondary partition modeling on the study area according to the characteristics. To this end, this article intends to illustrated the population map of Guangzhou city based on the idea of zoning modeling and machine learning methods, by using multi-source spatial data such as land use, POI, night lights, and so on. This article attempts to improve the study of mapping population distribution, and provide important basic information for urban management.

Materials and Methods
In this section, the study area and data source are first introduced (see Sections 2.1 and 2.2). Second, we needed to select the appropriate grid cell size to rasterize the initial influence factor data based on various calculation formulas (see Section 2.3.1). Third, we identified main factors affecting population distribution based on a geographic detector model (see Section 2.3.2). Finally, we simulated gridded population distribution information of study area based on machine learning algorithms and zonal strategy (see Sections 2.3.3 and 2.3.4). Moreover, we further evaluated the model accuracy by using several metrics.

Study Area
Guangzhou, the capital of Guangdong province, lies between 112 • 57 E to 114 • 3 E longitude and 22 • 26 N to 23 • 56 N latitude, and has an area about 7434 km 2 . The study case was conducted in six central urban districts of Guangzhou city, including Yuexiu, Liwan, Tianhe, Haizhu, Huangpu, and Baiyun. In order to ensure consistency with the caliber of population statistics in 2013, the Huangpu district does not contain the street-level units of former Luogang district. These districts have the highest population densities in Guangzhou city and serve as the political, cultural, and economic centers of Guangzhou. According to the data provided by Guangzhou statistics bureau, the resident population of Guangzhou was 15.3059 million in 2019 and the population density of the city was close to 2059 person/km 2 . However, the study area covers approximately 20.9% of the total area of Guangzhou city, and the recorded permanent resident population comprising 63.18% of the total resident population of Guangzhou. The street-scale population density map of the study area in 2013 is shown in Figure 1.

Data and Preprocessing
The research data used in the article is shown in Table 1. The spatial resolution of digital elevation model (DEM) data is 30 m, and the spatial reference is GCS_WGS_1984. The land use data include six major categories of cultivated land, forest land, grassland, water area, urban and rural, industrial and mining, residential land, and unused land, with a spatial resolution of 30 m. With consideration of the proportions and potential for contribution to population distribution, the urban construction and rural housing land use type was reclassified into three subtypes including urban Land, rural land, and the other construction land. Road data includes highways, national highways, provincial highways, urban roads, and county highways. There are thirteen types of POI data, including catering facilities, public facilities, companies and enterprises, shopping facilities, transportation facilities, finance and insurance, scientific, educational and cultural facilities, commercial housing, life service facilities, sports and leisure facilities, medical service facilities, government agencies and social groups, and accommodation facilities. After obtaining the POI data, it is necessary to perform data cleaning

Data and Preprocessing
The research data used in the article is shown in Table 1. The spatial resolution of digital elevation model (DEM) data is 30 m, and the spatial reference is GCS_WGS_1984. The land use data include six major categories of cultivated land, forest land, grassland, water area, urban and rural, industrial and mining, residential land, and unused land, with a spatial resolution of 30 m. With consideration of the proportions and potential for contribution to population distribution, the urban construction and rural housing land use type was reclassified into three subtypes including urban Land, rural land, and the other construction land. Road data includes highways, national highways, provincial highways, urban roads, and county highways. There are thirteen types of POI data, including catering facilities, public facilities, companies and enterprises, shopping facilities, transportation facilities, finance and insurance, scientific, educational and cultural facilities, commercial housing, life service facilities, sports and leisure facilities, medical service facilities, government agencies and social groups, and accommodation facilities. After obtaining the POI data, it is necessary to perform data cleaning work such as deduplication and correction before it can be used normally. At last, all datasets were unified to the Albers project coordinate system. The reference ellipsoid is Krasovsky_1940 ellipsoid.

Rasterization of Initial Influence Factor Data
Grid cell size has an important influence on the quality of the population distribution simulation results. The gridded population data with appropriate cell size can reveal the spatial varieties of population distribution pattern more clearly. According to a previous research [53], the most appropriate grid size is almost 10% of the smallest street's area. ShaMian street is the smallest one in our study case, and its area is 30,8348 square meters. Therefore, the cell size was set to 150 m in our study case. The initial impact factor datasets are rasterized into 150 m. The calculation methods of each factor in grid scale are shown in Table 2. Table 2. Data Source Information.

Data Calculation Methods
Road Index (R i ) The average density of each types of POI data in the grid cell i The average house price in the grid cell i

Identification of Main Factors Based on Geographic Detector Model
The geographic detector was developed by Wang Jinfeng et al. [54]. It is a statistical method for detecting spatial differentiation and revealing the driving factors behind it. Because this method does not require linear assumptions, and has an elegant form and clear physical meaning, it is widely used in the study of the influence mechanism of social economic, natural environment factors. The geographic detector mainly includes four sub-detectors: factor detector, interactive detection, risk detector, and ecological detector. This paper mainly uses the factor detector to calculate the explanatory power of the influence factor on the population distribution, which is denoted as q value to measure. The q value ranges from 0 to 1. The larger the q value, the stronger the explanatory power of the factor. The initial influence factors (X) include X1 (Government agencies and  First, the Create Random Points tool in ArcGIS 10.2 (ESRI Inc., Redland, USA) was used to randomly generate 3000 sample points in the study area. Then, the extract multi values to Points tool was used to extract the population density values of the corresponding sample points and the index values of each factor. Finally, the GeoDetector2015 software (Wang Jinfeng et al. [54], Beijing, China) was used to calculate the explanatory power (q value) of the influence factor (X) for the population density (Y).
The analysis results of the geographic detector model are shown in Figure 2. The results show that, except for the water area index and grassland index, the significance level of remain factors is 0.05. In other words, except for the water area index and grassland index, remain factors have obvious impact on population density. From the explanatory power ranking results, socioeconomic factors have a greater impact on the spatial distribution of population in the study area than natural factors. Among socio-economic factors, except for building area and road network density, the explanatory power values of other impact factors are all above 0.1, indicating a significant impact on the population density. Among the natural factors, except for the urban land use index, the explanatory power q of other influencing factors is relatively small. To improve the model accuracy, five factors including grassland index, water area index, urban land index, rural land index, and other construction land indexes are removed. The final impact factors of population distribution include 20 indicators including cultivated land index, forest land index, night light intensity, elevation, POI impact intensity index, building area, road index, and so on.
ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 7 of 22 was used to calculate the explanatory power (q value) of the influence factor (X) for the population density (Y). The analysis results of the geographic detector model are shown in Figure 2. The results show that, except for the water area index and grassland index, the significance level of remain factors is 0.05. In other words, except for the water area index and grassland index, remain factors have obvious impact on population density. From the explanatory power ranking results, socioeconomic factors have a greater impact on the spatial distribution of population in the study area than natural factors. Among socio-economic factors, except for building area and road network density, the explanatory power values of other impact factors are all above 0.1, indicating a significant impact on the population density. Among the natural factors, except for the urban land use index, the explanatory power q of other influencing factors is relatively small. To improve the model accuracy, five factors including grassland index, water area index, urban land index, rural land index, and other construction land indexes are removed. The final impact factors of population distribution include 20 indicators including cultivated land index, forest land index, night light intensity, elevation, POI impact intensity index, building area, road index, and so on. The sample features are presented numerically, and the numerical ranges between some different features are quite different. In addition, the dimensions of different features are inconsistent. In order to eliminate the influence of non-uniform dimension and extreme value, the data are normalized. There are 106 records in the street sample dataset after min-max normalization, and each record includes 20 feature values. The grid sample dataset has a total of 48,648 records, and each record contains 20 feature values.

Simulation of Gridded Population Distribution with Single Modeling Strategy Linear Regression Model
The linear regression model based on land use data and night light remote sensing data is currently a widely used method of spatialization of population data. For this reason, based on NPP/VIIRS night light data and land use data, this paper establishes a linear regression model of the population distribution grid in the study area to realize the spatialization of population data.
The model is built as follows: The sample features are presented numerically, and the numerical ranges between some different features are quite different. In addition, the dimensions of different features are inconsistent. In order to eliminate the influence of non-uniform dimension and extreme value, the data are normalized. There are 106 records in the street sample dataset after min-max normalization, and each record includes 20 feature values. The grid sample dataset has a total of 48,648 records, and each record contains 20 feature values.

Simulation of Gridded Population Distribution with Single Modeling Strategy Linear Regression Model
The linear regression model based on land use data and night light remote sensing data is currently a widely used method of spatialization of population data. For this reason, based on NPP/VIIRS night light data and land use data, this paper establishes a linear regression model of the population distribution grid in the study area to realize the spatialization of population data.
The model is built as follows: where P i denotes the population data of the i-th street (town); a j represents the initial coefficient of population distribution for the j-th land use type; S ij represents the j-th land use type index of the ISPRS Int. J. Geo-Inf. 2020, 9, 654 8 of 21 i-th street (town); n represents the type of land use selected; k represents the coefficient of night light intensity; M nl represents the average night light intensity value of each street; and b is the constant term.

Random Forest Regression Model
In this research, the street scale data set is used as the training set, and the grid scale data set is used as the development set. The machine learning library used for modeling is scikit-learn 0.21.3, which is a suit of free software released and maintained by scikit-learn international community. First, the model is training using the algorithms selected in the article, taking the population density of each street as the dependent variable, and the influence factors identified by the geographic detector model as the independent variable. During the training process, the best model parameters such as the maximum depth and features of the decision tree were obtained by GridSearchCV function in the scikit-learn which is conducted to try every possibility through loop traversal in all candidate parameter selections. To improve the predictive accuracy and control overfitting [55,56], K-fold cross validation (CV) was used to obtain model parameters. Therefore, 10-fold CV was performed, and the determinant coefficient (R 2 ) score was used to determine the overall accuracy of the model. Then, the optimal model is applied to the grid scale dataset to predict the population density value of each grid cell. Next, the population density is multiplied by the cell area to obtain the population of each cell, and the populations of each cell are merged into the street administrative unit. Finally, the accuracy of model result is evaluated through error indicators such as mean relative error (MRE), root mean square error (RMSE), R 2 and so on, taking the demographic data of each street as the true value.

Simulation of Gridded Population Distribution with Zonal Modeling Strategy
Using the population concentration division method, the study area is subdivided to improve the simulation accuracy. The calculation method of population concentration is shown in Equation (2): where JDD i represents the population concentration degree of a certain street (town); P i and P n represents the population of the street (town) and the total population of the area respectively; and A i and A n represent the area of the street (town) and the total area of the area. The study area is divided to two partitions according to the classification criteria of population concentration. In order to ensure that each region has a sufficient sample size for machine learning modeling, referring to the Chinese population agglomeration classification standard [57], the study area is divided into two subareas: dense area and sparse area. The dense area contains 54 streets and the average population density is 37,647 people/km 2 . The non-populated area contains 52 streets and the average population density is 3594 people/km 2 . The specific zoning results are as follows as shown in Figure 3.  Different model factors were selected for two subareas. For dense area and sparse area, linear regression model and random forest model are both executed. Models suitable for each subarea are selected according to the simulation accuracy and merged to population distribution map for the entire study area.

Simulation Results Based on Global Linear Regression Model
The population density of each street (town) in the study area was taken as the dependent variable, while the urban land index (X1), arable land index (X2), forest land index (X3), rural land index (X4), other construction land index (X5), grassland index(X6), and the average night light intensity (X7) were selected as the independent variables. The model is calculated through linear regression analysis, and is shown in Equation (3) The value of coefficient of determination (R 2 ) is 0.59, which means that the population density in one street unit is explained in 59% by the land use indices and night light intensity index. The variance analysis results of model are shown in Table 3.

Simulation Results Based on Global Linear Regression Model
The population density of each street (town) in the study area was taken as the dependent variable, while the urban land index (X1), arable land index (X2), forest land index (X3), rural land index (X4), other construction land index (X5), grassland index(X6), and the average night light intensity (X7) were selected as the independent variables. The model is calculated through linear regression analysis, and is shown in Equation (3): The value of coefficient of determination (R 2 ) is 0.59, which means that the population density in one street unit is explained in 59% by the land use indices and night light intensity index. The variance analysis results of model are shown in Table 3. Based on the multiple linear regression Equation (3), the grid calculator in ARCGIS software is used to generate the population distribution data of the study area. The result is shown in Figure 4. It can be seen from Figure 4 that the population is gradually decreasing from the center of the city to the surrounding area. Because the simulation results are too much affected by the spatial pattern of construction land and the intensity of night lights, the population of the construction land area is overestimated, and the spatial differentiation of population distribution is not obvious. In particular, the northern part of Renhe town is close to Guangzhou Baiyun International Airport (Guangzhou, China) and the industrial area is concentrated, resulting in obvious outliers in population estimates, which are inconsistent with the actual situation. In order to verify the accuracy of the simulation results, a linear fit was performed between the simulated population of the street (town) and the census data. It was found that the goodness of fit (R 2 ) is only 0.0039, and the RMSE is 1,007,871.35. The simulation results in most of the street (town) had a huge deviation from the true value. In a word, the simulation accuracy is very poor. Through comparison of previous studies and analysis, it is found that the reasons for the large deviation of the model estimation may be roughly as follows: (1) The research area is the downtown area of Guangzhou city, where the factors affecting population distribution in the area are very complicated. It is impossible to accurately simulate the characteristics of population distribution only by relying on land use and night light data. (2) The spatial resolution of land use data is low, which is only 30 m. At the same time, the resolution of land use types in this study is not detailed enough, which has only five categories.
Since the study area is the central urban area of Guangzhou city, the construction land is the dominant land use type in the area. Therefore, in the process of gridded population distribution mapping, simulation result is deeply affected by the spatial distribution pattern and area of the construction land, which easily leads to overestimation and large deviations from actual census data. As we known, previous studies also demonstrate that the higher the resolution of land use data and the more detailed land use types can improve the accuracy of population simulation.  It can be seen from Figure 4 that the population is gradually decreasing from the center of the city to the surrounding area. Because the simulation results are too much affected by the spatial pattern of construction land and the intensity of night lights, the population of the construction land area is overestimated, and the spatial differentiation of population distribution is not obvious. In particular, the northern part of Renhe town is close to Guangzhou Baiyun International Airport (Guangzhou, China) and the industrial area is concentrated, resulting in obvious outliers in population estimates, which are inconsistent with the actual situation. In order to verify the accuracy of the simulation results, a linear fit was performed between the simulated population of the street (town) and the census data. It was found that the goodness of fit (R 2 ) is only 0.0039, and the RMSE is 1,007,871.35. The simulation results in most of the street (town) had a huge deviation from the true value. In a word, the simulation accuracy is very poor. Through comparison of previous studies and analysis, it is found that the reasons for the large deviation of the model estimation may be roughly as follows: (1) The research area is the downtown area of Guangzhou city, where the factors affecting population distribution in the area are very complicated. It is impossible to accurately simulate the characteristics of population distribution only by relying on land use and night light data. (2) The spatial resolution of land use data is low, which is only 30 m. At the same time, the resolution of land use types in this study is not detailed enough, which has only five categories. Since the study area is the central urban area of Guangzhou city, the construction land is the dominant land use type in the area. Therefore, in the process of gridded population distribution mapping, simulation result is deeply affected by the spatial distribution pattern and area of the construction land, which easily leads to overestimation and large deviations from actual census data. As we known, previous studies also demonstrate that the higher the resolution of land use data and the more detailed land use types can improve the accuracy of population simulation. (3) The time of night light data, land use data, and statistical population data are inconsistent.
The population census data used in this study case are from 2013, while the land use data and the night light data are from 2015 and from 2016, respectively. The inconsistency between the modeled data and the real data will inevitably lead to a decrease in model accuracy.

Simulation Results Based on Global Random Forest Model
In order to reduce the simulation error and improve the simulation accuracy, the selection of the factors affecting the spatial distribution of the population is optimized, and the random forest model is adopted to realize the spatialization of population data in the study area. Considering the error factors of the global linear regression model, the independent variables of random forest model were updated to 20 indicators including cultivated land index and woodland index, night light intensity, altitude, POI influence intensity, building area, road network density, and average house prices. Based on the street (town) scale, the training feature database was built. Due to the inconsistent dimensions of the influencing factors, the data was standardized to eliminate the influence of inconsistent dimensions and extreme values.
As the number of streets being used to fit the model is only 106, 10-fold CV was chosen. It means that 9-fold samples were adopted into the model to obtain the proper parameters and the accuracy was assessed using the remaining samples. The 10-fold cross validation and grid search were conducted to find the proper parameters in the RF model (Table 4). When the max depth of the decision tree was 9 and the number of estimators was 500, the determinant coefficient (R 2 ) of the CV score reached the maximum value of 0.7250. Finally, we built a random forest model by setting the maximum depth of subtrees to 9, and the number of estimators to 500.  Based on the parameters set above, the model training was carried out with the street (town) data set as the training set. Subsequently, the generated random forest regression model is applied to each grid cell to predict the population density. Finally, the simulation result is obtained by multiplying the population density and cell area. The simulation result is shown in Figure 5. model is adopted to realize the spatialization of population data in the study area. Considering the error factors of the global linear regression model, the independent variables of random forest model were updated to 20 indicators including cultivated land index and woodland index, night light intensity, altitude, POI influence intensity, building area, road network density, and average house prices. Based on the street (town) scale, the training feature database was built. Due to the inconsistent dimensions of the influencing factors, the data was standardized to eliminate the influence of inconsistent dimensions and extreme values. As the number of streets being used to fit the model is only 106, 10-fold CV was chosen. It means that 9-fold samples were adopted into the model to obtain the proper parameters and the accuracy was assessed using the remaining samples. The 10-fold cross validation and grid search were conducted to find the proper parameters in the RF model (Table 4). When the max depth of the decision tree was 9 and the number of estimators was 500, the determinant coefficient (R 2 ) of the CV score reached the maximum value of 0.7250. Finally, we built a random forest model by setting the maximum depth of subtrees to 9, and the number of estimators to 500.  Based on the parameters set above, the model training was carried out with the street (town) data set as the training set. Subsequently, the generated random forest regression model is applied to each grid cell to predict the population density. Finally, the simulation result is obtained by multiplying the population density and cell area. The simulation result is shown in Figure 5.   Figure 5, the population distribution in the study area presents an obvious "core-edge" pattern, i.e., the population is highly concentrated in the center of the study area, while the population density in the edge area is small. Among them, the northeast of Yuexiu district, Liwan district and Tianhe district has the largest population, and an aggregation effect is obvious. It is worth mentioning that the coefficient of determination (R 2 ) of the random forest model result is 0.605, and the RMSE is 6497.02. At the same time, the average relative error of all streets (towns) in the study area is close to 32.16%. Compared with the global linear regression model, it can be seen that the simulation accuracy by using the random forest regression model has been significantly improved. However, for regions with large population density differences, a global model cannot accurately explain the mechanism of the spatial distribution of population.

Simulation Results for Dense Area
Dense area is located in the center of the city, and the population distribution is more affected by social and economic factors than natural environmental factors. To this end, sixteen impact factors were selected for population distribution modeling in dense area, including the road network density (X1), building area (X2), average house price (X3), catering facilities (X4), public facilities (X5), companies (X6), shopping facilities (X7), transportation facilities (X8), financial insurance (X9), scientific, educational and cultural facilities (X10), commercial residential buildings (X11), life service facilities (X12), sports and leisure facilities (X13), medical service facilities (X14), government agencies and society groups (X15), and accommodation facilities (X16).
The stepwise regression method was used to establish a multiple linear regression model for dense area. The model is shown as follows: Y = 51753.187X 11 + 26819.234X 2 − 20028.002X 6 + 19690.791X 5 (4) The random forest model was performed in the same way of global RF modelling as mentioned above. Ten-fold cross validation and grid search were conducted to find the proper parameters in the RF model for dense area. It is found that when the max depth of the decision tree was 10 and the number of estimators was 600, the determinant coefficient (R 2 ) of the CV score reached the maximum value. Therefore, the random forest model was built, by setting the maximum depth of subtrees to 10, and the number of estimators to 600. Subsequently, the generated random forest regression model is applied to each grid cell to predict the population density in dense area. Finally, the simulation result is obtained by multiplying the population density and cell area. The two simulation results for dense area are shown in Figure 6, and their accuracy is shown in Table 5.

Simulation Results for Sparse Area
Compared with dense area, the factors affecting population density distribution in sparse area are more complex. On the one hand, it can be seen from the land use classification map that, due to the wide distribution and large area of cultivated land and forest land in sparse area, they should not be ignored in the modelling of population distribution. On the other hand, it can be seen from the digital elevation model that the elevation spatial differences in sparse area are obvious. Therefore, the natural environment and socio-economic factors should be fully considered in the process of selecting factors affecting the spatial distribution of population. Twenty impact factors are used as the main factors for spatial modeling of population in sparse area, including cultivated land index (X1), forest land index (X2), altitude (X3), building area (X4), road network density (X5), night light intensity (X6), housing price (X7), and point of interest (X8-X20).
The stepwise regression method was used to establish a multiple linear regression model for dense area. The model is shown as follows: y = 3747.097X 19 The random forest model was performed in the same way of global RF modelling as mentioned above. Ten-fold cross validation and grid search were conducted to find the proper parameters in the RF model for sparse area. It is found that when the max depth of the decision tree was 7 and the number of estimators was 600, the determinant coefficient (R 2 ) of the CV score reached the maximum value. Therefore, the random forest model was built, by setting the maximum depth of subtrees to 7, and the number of estimators to 600. Subsequently, the generated random forest regression model is applied to each grid cell to predict the population density in sparse area. Finally, the simulation result is obtained by multiplying the population density and cell area. The two simulation results for sparse area are shown in Figure 7, and their accuracy is shown in Table 6. the simulation result is obtained by multiplying the population density and cell area. The two simulation results for sparse area are shown in Figure 7, and their accuracy is shown in Table 6.
(a) LR model result (b) RF model result

Simulation Results for Whole Area
By merging the random forest simulation results of dense area and the linear regression simulation results of sparse area, the ultimate simulation result for whole study area is obtained (see Figure 8). Comparing the simulation results by using the zonal model ( Figure 8) with the simulation results by using the global LR model and global RF model (Figures 4 and 5), it can be found that the simulation results based on zonal model strategy can reflect the spatial characteristics of the "core-periphery" population distribution in the study area more clearly. Taking the actual street population as the true value, the simulation accuracy of zonal model and global model were evaluated by using the goodness of fit. The results are shown in Table 7. It can be seen from Table 7 that the simulation accuracy of zonal modeling is better than that of non-zonal models. Because the simulation accuracy of the single linear regression model is too low, it will not be discussed here. Next, the relative errors of the zonal model and the single random forest model were analyzed. The results are shown in Table 8.  Comparing the simulation results by using the zonal model ( Figure 8) with the simulation results by using the global LR model and global RF model (Figures 4 and 5), it can be found that the simulation results based on zonal model strategy can reflect the spatial characteristics of the "core-periphery" population distribution in the study area more clearly. Taking the actual street population as the true value, the simulation accuracy of zonal model and global model were evaluated by using the goodness of fit. The results are shown in Table 7. It can be seen from Table 7 that the simulation accuracy of zonal modeling is better than that of non-zonal models. Because the simulation accuracy of the single linear regression model is too low, it will not be discussed here. Next, the relative errors of the zonal model and the single random forest model were analyzed. The results are shown in Table 8. It can be seen from Table 8 that in the results of the zonal model, the proportion of streets with a simulation error within 30% is 51.89%, while the single random forest model is 49.06%. The difference between the two models is not obvious. However, in the modeling results, the proportion of streets with a relative error greater than 50% is 31.13% based on a single random forest model, which is obviously higher than the zonal model. Compared with the global random forest model, the proportion of streets with a relative error percentage greater than 50% in the zoning modeling results has decreased from 31.13% to 16.98%. In summary, the mapping results obtained based on the zonal modeling strategy are more in line with the actual situation, which can significantly improve the accuracy of population distribution mapping. Thematic maps are drawn for the relative error percentage values of each street produced by two models. The thematic maps are shown in Figure 9. It can be seen from Table 8 that in the results of the zonal model, the proportion of streets with a simulation error within 30% is 51.89%, while the single random forest model is 49.06%. The difference between the two models is not obvious. However, in the modeling results, the proportion of streets with a relative error greater than 50% is 31.13% based on a single random forest model, which is obviously higher than the zonal model. Compared with the global random forest model, the proportion of streets with a relative error percentage greater than 50% in the zoning modeling results has decreased from 31.13% to 16.98%. In summary, the mapping results obtained based on the zonal modeling strategy are more in line with the actual situation, which can significantly improve the accuracy of population distribution mapping. Thematic maps are drawn for the relative error percentage values of each street produced by two models. The thematic maps are shown in Figure 9.
(a) RF model (b) Zonal model It can be seen from Figure 9 that, the accuracy of simulation results in the region including Baiyun district, the east of Tianhe district, and Huangpu district are lower than other regions. The relative error percentages of streets in these area are mostly exceeding 50%. Among them, Tonghe street in Baiyun district has the highest MRE in the global random forest model result, while Sanyuanli street in Baiyun district has the highest MRE in the zonal model result. Although different models were applied It can be seen from Figure 9 that, the accuracy of simulation results in the region including Baiyun district, the east of Tianhe district, and Huangpu district are lower than other regions. The relative error percentages of streets in these area are mostly exceeding 50%. Among them, Tonghe street in Baiyun district has the highest MRE in the global random forest model result, while Sanyuanli street in Baiyun district has the highest MRE in the zonal model result. Although different models were applied to different subareas in this paper, due to the obvious differences in the natural environment and social-economic development level between two subareas. However, in the dense area, whether global random forest or zonal model, the model accuracy is relatively low. A reasonable estimation is that as densely populated areas are areas are usually with rapid urbanization rate, the mismatch of temporal and spatial resolutions of different data sources may cause large deviations between model results and statistical data. This inadequacy may reduce the simulation accuracy of the model.

Discussion
As for the census data of administrative units, it shows how many people are distributed in a certain spatial range (e.g., a county or a township), and where the specific distribution cannot be presented. That is, census data is not designed for the purpose of spatialization, and the mapping based on grid units can achieve the fine mapping of population distribution [41]. Therefore, the gridded population mapping is very important for urban management. However, previous studies on fine-scale population mapping are few, and the spatial resolutions of mapping results are mostly based on a grid with a 500 m or 1 km spatial resolution, which is difficult to meet the requirements of fine management. This study produced the 150 m gridded population data in central Guangzhou using the strategy of zoning modeling integrated two machine learning methods. The state-of-the-art random forest model and linear regression model were used to integrate multiple data and predict the population density distribution, which was used for the gridded population mapping. The results provide timely and important information to policy makers for city management and planning [51]. Obviously, the maps with grid units are visually coarse and cannot produce the information consisting of the geographical entities (i.e., the human residential settlements). On the contrary, our mapping result can provide fine population information by using homogeneous polygons with irregular boundaries. On the basis of the above experimental analysis, it is necessary to further discuss the following issues.

Principal Findings and Meaningful Implication
(1) The geographic detector model is a very useful tool for identification of influence factor on population density distribution. As many concerned natural and socioeconomic variables were input in our modeling, the analysis of the geographic detector model has an obvious positive effect on understanding the mechanism of influencing population distribution in this study area. The experimental results of population distribution mapping show that the influence factors selected by the geographic detector model are reasonable. However, the impact factor selected by the geographic detector model is not the final modeling factors for linear regression model. The explanatory power ranking results demonstrated that socioeconomic factors have a greater impact on the spatial distribution of population in the study area than natural factors. This finding is consistent with previous studies. (2) The zoning modeling strategy using machine learning methods can improve the accuracy of fine-scale population mapping in urbanized areas. The results show that the goodness of fit for the zoning model result is close to 0.713, while the goodness of fit for single linear regression model and random forest regression model are 0.0039 and 0.605, respectively. We speculate that the poor accuracy of the single linear regression model is mainly due to the low spatial resolution of land use data and the insufficient classification of types, which leads to large deviations in the simulation results. Compared with zonal model, the reason for the lower accuracy of the global random forest model may be due to the neglect of the natural and human environment differences between dense area and sparse area. For dense area, the accuracy of the random forest model is better than the linear regression model. While for sparse area, the accuracy of the linear regression model is better than the random forest model. Compared with the linear regression model, although the random forest model has many advantages in theory, it does not guarantee that it has an accuracy advantage in population distribution mapping. Therefore, it is advised that the zonal modeling strategy should be the first choice for solving regional differences in the population distribution mapping research. (3) The application of random forest regression in fine-scale population mapping should be cautious.
Despite the random forest model is popularized in many disciplines as its advantages, some major and crucial parameters can affect the results achieved by users [58,59], such as max depth of the decision trees, the max number of features, and so on. Therefore, tuning the parameters is a crucial work in random forest modelling. Grid search combined with K-fold cross-validation is an effective method to find the optimal parameters, especially when the number of sample is small. As the number of streets being used to fit the model is only 106, 10-fold CV was chosen in our study. The prediction result has a good effect with the goodness of fit value of 0.605 in the global random forest model.

Explanations for Further Research
Fine-scale population distribution mapping is a very complicated problem. This paper first uses the geographic detector model to select the influencing factors on population spatial distribution, and then realizes the population distribution mapping using the single model and zoning model, and finally compares the simulation results to prove that the accuracy of the zonal method implemented in this paper has been improved significantly. However, due to the limitation of personal scientific research ability, time, and other factors, there are still the following shortcomings in the research process of this article, which need to be gradually improved in the future.
(1) The quality of data source for modeling can be improved. The spatial resolution of NPP/VIIRS night light data is nearly 500 m, which is too coarse for population mapping in city or commune scale. Compared to NPP-VIIRS data, spatial resolution and quantitative level of LJ 1-01 data have increased from 500 m to 150 m. Despite the finest resolution of the LJ 1-01 data, the complex urban fabric patterns and the uncertainty of the radiation calibration of Luojia1-01 constrain the application of the images in conducting accurate population estimations at a fine scale. Besides that, the spatial and type resolution of land use data used in this paper are also not detailed enough, which may affect the accuracy of the model obviously. Moreover, the time characteristics of the data source do not match. The street (town) demographic data used in the research is from 2013, while the land use, night lighting, POI, and other data used in the modeling are later than 2015. The mismatch of the time characteristics will inevitably affect the accuracy of the model. Therefore, application of more detailed and accurate dataset in population mapping should be further studied as soon as possible. (2) The issue of model uncertainty need be addressed in further studies. Similar to many studies, different model parameters produce different model results, especially in the field of geography science. For example, grid size may affect the model result. The gridded population data with appropriate cell size can reveal the spatial varieties of population distribution pattern more clearly. According to previous research experience, in order to prevent the smallest residential units from falling into the same grid, the grid size is set to 150 m in our study. However, whether there is a more suitable grid size for the model is still to be studied. As we know, sensitivity analysis is a useful method to explore the uncertainty of modelling. Consequently, the issue of uncertainty using sensitivity analysis method in the population mapping is another topic need to be explored. (3) There is still potential for improvement in the selection of modeling factors. In fact, the influence mechanism of population distribution is a tough topic. Although this paper identified the main influence factors by integrating previous experience with geographic detector model analysis, it is still difficult to fully reveal the inherent mechanism. For example, random forest model can also be used to distinguish the importance of different features [60]. Among the impact factors selected in this article, POI data accounted for the vast majority, which has an extremely important impact on the modeling of population distribution. Meanwhile, each category of POI may be correlated with each other, which may cause some feature importance bias. Further studies must deal with the independence of each feature or fitting the model by integrating all the POIs into one layer.
Besides that, only two classic machine learning methods such as linear regression and random forest are used in this paper. The applicability and accuracy evaluation of other machine learning methods such as neural networks, support vector machine, and so on, should to be further studied. In other words, more efforts are needed to apply alternative ways to build new models with proper parameters.

Conclusions
This study provided a zonal method for downscaling the census population to 30 m' grid scale in central Guangzhou, China. Firstly, the geographic detector model is used to select the influencing factors on population spatial distribution. Then, the population distribution mapping with the single model and zoning model are realized respectively. Finally, the accuracies of different model result are compared. The street-level accuracy evaluation results show that, the proposed approach achieved good overall accuracy, with R 2 being 0.713 and RMSE being 5512.9. Meanwhile, the goodness of fit for single linear regression model and random forest regression model are 0.0039 and 0.605 respectively. Our study demonstrated that the accuracy of the population distribution mapping can be effectively improved compared with mapping which does not use the zoning modeling strategy. The results indicated that the proposed method has great potential in fine-scale population mapping. For dense area, the accuracy of the random forest model is better than the linear regression model. For sparse area, the accuracy of the linear regression model is better than the random forest model. Therefore, it is advised that the zonal modeling strategy should be the first choice for solving regional differences in the population distribution mapping research.