A Population Spatialization Model at the Building Scale Using Random Forest

: Population spatialization reveals the distribution and quantity of the population in geographic space with gridded population maps. Fine-scale population spatialization is essential for urbanization and disaster prevention. Previous approaches have used remotely sensed imagery to disaggregate census data, but this approach has limitations. For example, large-scale population censuses cannot be conducted in underdeveloped countries or regions, and remote sensing data lack semantic information indicating the different human activities occurring in a precise geographic location. Geospatial big data and machine learning provide new ﬁne-scale population distribution mapping methods. In this paper, 30 features are extracted using easily accessible multisource geographic data. Then, a building-scale population estimation model is trained by a random forest (RF) regression algorithm. The results show that 91% of the buildings in Lin’an District have absolute error values of less than six compared with the actual population data. In a comparison with a multiple linear (ML) regression model, the mean absolute errors of the RF and ML models are 2.52 and 3.21, respectively, the root mean squared errors are 8.2 and 9.8, and the R 2 values are 0.44 and 0.18. The RF model performs better at building-scale population estimation using easily accessible multisource geographic data. Future work will improve the model accuracy in densely populated areas.


Introduction
Population spatialization data reflect the population distribution in the objective world, and fine-scale population distribution data are essential in public health and urban planning [1][2][3][4][5][6][7] and facilitate mobile population monitoring, resource allocation optimization, and urban structure analysis [3,[8][9][10][11]. Most countries obtain detailed population distribution maps through censuses, which have limited utility and suffer from the following problems: (1) High costs prohibit underdeveloped countries and regions from conducting large-scale population censuses [12,13]. (2) Long census intervals, changing administrative boundaries, and uneven distributions of the population within administrative units do not allow the censusing approach to reflect population changes promptly at a fine spatial resolution [12]. Since census data cannot reveal the spatial heterogeneity of population density in detail [6,14], the use of general multisource GIS datasets for decomposing census data to map population distribution at fine scales has become a research hotspot [2][3][4]13,15].
However, remote sensing data with medium spatial resolution lack semantic information and cannot directly indicate land use or human presence [37]. These data have a limited ability to extract demographic and socioeconomic characteristics associated with human activities in complex urban environments [38][39][40][41][42][43]. Geospatial big data such as point of interest (POI) data can compensate for these drawbacks of remotely sensed imagery. POI data contain location information and textual descriptions that extract detailed information about cities or social systems [44,45]. Many studies [35,38,40,46,47] have used POIs to define functional urban areas and land use types. Moreover, the results have shown a correlation between POI categories and population density [4,40]. Bakillah used voluntary geographic information (VGI) to map the population distribution at the building level in Hamburg; this approach relies only on POI and fine-grained land use/land cover data and does not consider the spatial heterogeneity of the population distribution when calculating the population within buildings [4]. Different geospatial big data can capture different aspects of the ground truth [39]. Additionally, some studies have combined remote sensing products with residential building footprints and census data to build empirically weighted models to map building-scale population distributions [15,48]. These methods are still challenging to apply to fine mapping of the population in China, where the urban spatial structure is diverse and the population distribution is complex [35]. Therefore, this study integrates multiple sources of easily accessible geospatial data to construct a population prediction model, revealing the population distribution at the building scale.
Machine learning, which has advanced considerably in the past few decades, has provided more efficient tools for population spatialization, and the random forest (RF) algorithm is one of the most common and powerful supervised learning algorithms. RF is a classification tree-based machine learning algorithm first proposed in 2001 by Leo Breiman and Cutlery Adele [49]. It has many appealing properties, such as high classification accuracy; the ability to model complex interactions among predictor variables; and the flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning [50][51][52]. Stevens et al. used the RF algorithm to build a nonparametric predictive model to map the fine-scale population distribution in Kenya, Vietnam, and Cambodia at a reduced scale of census data [13]. Methods using RF have been successfully applied to map the population density of China at a 100 m resolution [53]. Yao et al. used the RF algorithm to analyze POIs and real-time user density (RTUD) to reduce the street-level population distribution to the grid level [35]. Ye et al. combined POIs with multisource remote sensing data in an RF model to produce a 100 m spatial resolution gridded population map of China with a higher accuracy than the WorldPop dataset [37]. The results of the above studies all show that the RF algorithm is reliable and has good performance in fine-scale population spatialization; therefore, the RF algorithm was chosen to construct the prediction model in this paper.
Although many researchers have fused remote sensing images with geospatial data such as POI data and used machine learning algorithms to map population distributions at a fine scale, the existing methods still have drawbacks. Countries and regions with poor economic conditions cannot conduct large-scale population censuses, and some high-resolution remote sensing images are confidential data. This study addresses these drawbacks by fusing multiple sources of easily accessible geospatial data, such as building data, land use data, NTL data, administrative district data, road data, water system data, and POI data, using the RF algorithm to train a model to predict the population distribution at the building scale. Section 2 includes an overview of the study area and a description of the multiple sources of easily accessible geospatial data. Section 3 focuses on the methodology; it introduces how to build a multidimensional feature library using seven easily accessible sources of geographic data, filter 30 features related to population distribution by feature engineering, and train the prediction model using the RF algorithm. In Section 4, the prediction results are mainly presented and compared with a multiple linear (ML) regression model trained with the same dataset. The model performance is analyzed by evaluating metrics such as absolute error compared to the actual value, mean absolute error (MAE), root squared mean error (RSME), and R 2 . Section 5 presents the feature importance and contribution analysis, and the results show that building area and self-service POI are the two most essential features in this model. Finally, Section 6 summarizes the paper and discusses the implications for future research in this area.

Data and Preprocessing
The study area in this paper is Lin'an District, Hangzhou city, Zhejiang Province, located from 118 • 21 to 120 • 30 E longitude and 29 • 11 to 30 • 33 N latitude, as shown in Figure 1. Lin'an District contains 18 subareas. In addition, CGCS2000 and the 3-degree Gauss-Kruger zone 40 are used in the experiment. The following datasets are used in this study, Table 1 lists the format and source of these data, Figures 2 and 3 show the visualization of datasets:    Table 2. We calculated the closest Euclidean distance from each building type to the same type of POI and used the results as model inputs.

Methods
The RF concept is based on a bagging algorithm, and a method involving the random selection of independent variables is used in the training process of decision trees [49]. An RF contains multiple decision trees trained with a bagging-based integrated learning technique [54]. When building an individual classification tree, for each splitting point in the tree, a random sample containing q (1 ≤ q ≤ p) independent variables is selected as a candidate from among p total independent variables, and the independent variables associated with a splitting point can only be selected from q variables. The similarity between individual classification trees with highly correlated output prediction results can be reduced by considering only a subset of the independent variables at each splitting point. The independent variables that have the most significant impact on performance do not influence the (p − q)/p ratio because they are not selected as splitting points, and other independent variables have an equal chance of being selected as split points, thus decorrelating the effects of single trees [55].
First, a multidimensional feature library is constructed. Second, feature engineering steps, such as filtering and standardization, are performed for the feature library, and features related to the spatial distribution of the population are selected as explanatory variables and included in model construction. Then, based on the RF regression algorithm, population spatialization is performed, and the grid search method and cross-validation are applied to adjust and optimize the model to improve performance. Finally, the population spatialization results at the building scale in the study area are obtained and compared with the results of an ML model to evaluate the accuracy and performance of each model. The flowchart in Figure 4 shows the entire workflow.  Population spatialization requires the use of feature values related to the population distribution as independent variables for training. For an established feature set, manual screening is performed, and the influence on the spatial distribution of the population is used as the feature selection criterion. Text-based feature values, such as the name of the administrative district in which a building is located, the address of the building, the population of each sex, the names of roads, and the name of the water system, are removed due to the algorithmic factors used in model fitting. Furthermore, numerical-type feature values are used as the feature data. Among these extracted feature values, numerical-type feature values, such as those for road length and the number of floors in a building, are removed due to issues with missing values.
Feature selection is performed to facilitate the evaluation of features, mainly with a screening method; specifically, the importance of each dimensional feature is assessed according to an evaluation index, and then the features are ranked and selected based on their importance scores. To construct the model, if the variance of a dimensional feature is minimal, the feature provides limited information, the variability in the feature is minimal, the contribution to the model is limited, and the impact on model performance is negligible; therefore, these types of features are removed with a low-variance filtering feature selection method. In this method, the variance of each feature in the sample is determined, and the features are ranked according to their magnitude of variance. By default, features with a variance of 0 are removed, i.e., the sample features do not change. In this paper, the threshold value is 0.8, and the number of features is not set. The final result is that all features with variance values between 0.9 and 1 are more significant than the set threshold value, so no features are removed.
In summary, 30 features were selected from the multidimensional feature library to form the auxiliary dataset for model training and prediction. The specific features are shown in Table 3.

Feature Standardization
After filtering of the features, the value data are normalized, and the dataset is scaled to the interval of [0, 1] to eliminate the possible problems caused by the unit differences and different magnitudes among the multidimensional feature values. The standardization process is divided into two main steps: decentering the mean (setting the mean to 0) and scaling the variance (setting the variance to 1). To fairly assess the role of eigenvalues in population spatialization, the eigenvalues are standardized before building the model according to the following equation.
where x mean denotes the average value of the data and n represents the number of data points, which in the formula indicates the amount of data associated with one eigenvalue.

Model Building and Training
A total of 31 feature values in six categories, including building features, NTL features, land use type features, water system features, road features, and POI features, are used as the independent variable dataset, and the number of people in buildings is used as the dependent variable. Python and Scikit-learn [56], a third-party open-source machine-learning algorithm package, are used as the basis for programming, and an RF regression algorithm is used to construct a population spatialization model and perform model training.
In constructing the model, a sampling method with replacement is used. Of the original samples, 85% are used as the training dataset, and the remaining 15% constitute the validation dataset, which is divided by a fixed random number to ensure the randomness of the dataset division and the reproducibility of the experiment for the adjustment and optimization of the model parameters.
When constructing the model, it is necessary to determine the optimal hyperparameters, and in this experiment, a grid search approach was chosen as the parameter optimization method. The grid search method has a relatively slow run time, but after cross-validation, the results are highly reliable. Based on the optimal parameter values returned from the grid search method, the optimal values of each parameter applied in the RF regression algorithm to construct the population spatialization model are shown in Table 4.  20 18 In this RF model, the first three parameters are the RF framework parameters, among which the bootstrap parameter indicates whether bootstrap samples are used when building trees. n_estimators is the number of trees in the forest, which impacts model performance; when this parameter is too small, underfitting will occur, the run time will be long, and the modeling efficiency will be reduced. oob_score indicates whether out-of-bag samples are used to estimate the generalization score and can be used to evaluate the model's strengths and weaknesses, validate the model, reduce time consumption, and improve the modeling accuracy.
The last four parameters are decision tree parameters, which mainly control the growth process of a single decision tree in the RF. max_features is the maximum number of features to consider when constructing a decision tree. max_depth is the maximum depth of the decision trees. min_samples_split controls subtree splitting; when the number of samples at the middle node is lower than the selected parameter value, the tree stops growing, i.e., no more features are selected for division. min_samples_leaf controls the tree depth of decision trees; when the number of samples at a leaf node is lower than a given threshold, the tree is pruned.
When model training and parameter optimization are completed, the population spatialization model is constructed according to the optimal values of the determined model parameters.

RF Population Spatialization Results
The input dataset included 117,116 buildings with 30 features, and after RF model training, the regression model predicted the population in each building. According to the natural breaks grading method, the predicted result was divided into five levels, as shown in Figure 5.  Figure 6a shows the hexagonal bin plot and histogram of the prediction results. In this plot, the horizontal coordinate indicates the predicted population, the vertical coordinate indicates the actual population, and the color indicates the degree of overlap of the scattered points at the location; the darker the color, the higher the degree. Moreover, the histogram reflects the distribution of the number of people at the building scale. Figure 6b shows that the number of buildings with a population in the interval of (0, 3] is 96,625, accounting for 83% of all buildings. The number of buildings with a population in the interval of (3, 6] is 11,474, accounting for 9% of all buildings, and the number of buildings with a population above 15 is 2923, accounting for 2% of all buildings. Some scattered points along the vertical axis indicate that the predicted population is higher than the actual population, and the prediction deviation is within 30 people. Outside the interval of [0, 30), the predicted population is significantly lower than the population denoted by the diagonal line, which suggests that the model seriously underestimates the actual population; the prediction deviation is within the range of [10,70), which indicates that the RF regression algorithm produces a significant deviation when predicting populations for buildings with high population aggregation. The absolute error is calculated to evaluate the performance of the RF model, which is the difference between the predicted and true values: This variable reflects the absolute error in the population estimate for each building and can be used to analyze the sources of error based on local conditions. The error in the population of each building is divided into five levels and visualized by color differences, as shown in Figure 7. The number of buildings with prediction deviations in the interval of [0, 6) is 106,643, accounting for 91% of all buildings. The number of buildings with errors within the interval of [6,20) is 8592, accounting for 7% of all buildings, and the number of buildings with errors above 20 people is 1881, accounting for 2% of all buildings. The buildings with large deviations from the predictions are concentrated in the eastern part of the study area, the core area of population aggregation, indicating that the model yields poor predictions in densely populated areas.
Moreover, three indicators are chosen to evaluate the model and predicted values: the mean absolute error (MAE) is the average of the absolute error, which is used to evaluate the predicted data and actual data directly; the root mean squared error (RMSE) is the square root of the mean of the squared difference between the predicted value and the actual observation, which is often used as a measure of the prediction results of machine learning models; and the goodness-of-fit (R 2 ) is used to evaluate the fit of the model to the observed data, with a range of [0, 1].
The MAE indicator for the RF model is 2.52, indicating that the model is relatively accurate in predicting the population at the building scale in the study area. The RMSE indicator is 8.2, indicating that the variance of the population at the building scale is explained by the characteristics extracted from the multisource data. The R 2 indicator is 0.44, indicating that the model accurately fits 44% of the population data at the building scale in the study area.

Comparison with an ML Regression Model
To investigate the advantages of the RF population spatialization model at the building scale, we constructed an ML regression model [57] with the same feature set. The regularized Lasso method [58] was chosen to construct the ML regression model. Then, we compared and analyzed the population prediction results obtained with the two algorithms. Table 5 shows the variable information and corresponding coefficients used in the model. Based on these parameters, the prediction results of the multiple linear regression model are shown in Figure 8. The hexagonal bin plot of the multiple linear regression results is shown in Figure 9. As seen from the figure, most of the points are concentrated in the interval of [0, 40) and limited to [0,20). Some of the data in the interval are distributed along the vertical axis, indicating that the model overestimates the actual population, and the prediction deviates from the observations by 20-30 people. The scatter points outside this interval are significantly lower than the actual population, and the deviation is in the range of [30,80), indicating that the multiple linear regression algorithm underestimates the population; however, the degree of deviation is slightly higher than that of the RF regression algorithm.  for 84% of all buildings. Additionally, the number of buildings with errors in the [6,20) interval is 12,544, accounting for 11% of all buildings, and the number of buildings with errors of 20 or more people is 5740, accounting for 5% of all buildings. From the perspective of relative error, the prediction deviation for the multiple linear regression model in the [0, 6) interval is similar to that of the RF regression model ( Figure 11). However, the number of buildings with a prediction deviation of more than 20 people is significantly higher than that of the RF model. In the prediction results of the multiple linear regression model, the buildings with large deviations were concentrated in the northeastern part of the study area, and the RF algorithm does not simulate a similar result.  The image above shows the error comparison between the RF model and ML model. The vertical coordinate indicates the number of buildings, while the horizontal coordinates indicate the different intervals of absolute error values. The number of buildings with zero error in the RF model is more than twice the number in the ML model. In addition, based on the selected standard model metrics, the MAE for the ML model was 3.21, the RMSE was 9.8, and the R 2 was 0.18. These results were compared with the RF regression model, as shown in Table 6. The result indicates that the ML regression model performs poorly in terms of data prediction accuracy and model goodness of fit compared to the RF regression model from metric evaluation. In summary, compared with the ML regression algorithm, the population spatialization model of population data constructed based on the RF regression algorithm yields a better fitting result, smaller prediction bias values, and better relatively accurate results for population prediction in densely populated areas, thus reflecting a better overall model performance.

Discussion
We quantified the involvement of features in the model construction process from two perspectives: feature importance and feature contribution. Then, the relationship between features and the model was established to assess the model.

Feature Importance Analysis
The importance of each feature in the constructed population spatialization model was evaluated based on the mean decrease impurity (MDI) or Gini index, a commonly used evaluation index [59]. Suppose there are m features X 1 , X 2 , . . . , X m . The Gini coefficient score VIM j for the jth feature X j at the nodes of each decision tree in the RF is calculated as where GI is the Gini coefficient value, k is the feature category, m is the node, and ρ mk is the proportional contribution of category k at node m. Thus, the amount of change in the Gini index generated by the branching of feature X j at node m is where VIM is the feature importance score and GI l and GI r represent the amount of change in the Gini coefficient after feature j branches at node m, respectively. Assuming that the nodes for which feature X j influences the construction of decision tree i are in set M, the importance of feature X j to decision tree i is On this basis, assuming that there are n trees in the RF, the importance of the influence of feature X j on all decision trees in the RF construction process is After the importance scores of all the features are calculated, the scores are normalized to obtain the final score for each feature, as shown below for the example of feature Xj: According to the MDI method, the importance of the constructed RF model was calculated, and the results are shown in Figure 12. Notably, the building footprint has the highest importance score and has the most significant influence on the features; the next-most-important features are self-service features, sports facilities features, and road features. The three feature types that have the lowest influence on the model are the total water area features, total system length features, and land use type features.

Feature Contribution Analysis
This experiment is based on the Boruta algorithm in the R language [60] and is performed to assess the degree of feature contributions. The features used to construct the RF model are the explanatory variables, and the actual population of buildings in the study area is the explanatory variable. The Boruta algorithm is used to evaluate the correlation of features with the dependent variable, and features with high correlations with the dependent variable are filtered and removed; rather than focusing solely on the impact of features on model performance improvement, this approach enhances the understanding of feature contributions [60].
After the algorithm is used to quantify the degree of feature contributions, the correlations between the explanatory variables and the explained variables are assessed by comparing the median feature Z score (Z score) with the median Z score of the best attributes. The obtained results are visualized in graphical form, as shown in Figure 13.
In the figure, green denotes the correlation between the corresponding feature and the population distribution, and the features with high correlation rankings include the building footprint, POIs, and land use, among others. The features with the slightest influence are the total water area, the total water system length, and the maximum water system length.

Conclusions
Most previous studies on mapping population distributions have fused multisource remote sensing data and big geospatial data such as POI data to decompose census data. These methods have limitations for economically underdeveloped countries and regions, which do not have access to confidential remote sensing data and cannot conduct largescale population censuses. This study addresses this problem by fusing multiple sources of easily accessible geospatial data and successfully mapping the building-level population distribution in Lin'an using the RF algorithm. First, 30 features related to population activities are extracted from the easily accessible multisource geographic dataset and trained using the RF algorithm, and the results are compared with actual values and ML models. The RF model has better performance than the ML model in terms of the absolute error, MAE, RMSE, and R 2 . Furthermore, the results of feature importance and feature contribution analysis show that only a few features contribute significantly to the model prediction results, and other features play a fine-tuning role in the prediction results, among which the building footprint is the most important feature and is highly correlated with the population distribution.
This study also has some shortcomings. The multisource geographic data sources used in the experiments were released at inconsistent times: building data, road network data, water system data, and land use data were collected in 2017; POI data were updated in 2019; and NTL data were collected in 2013. The data will change over time, leading to bias in model prediction in local areas. Future work will introduce additional geospatial data and improve the algorithm to increase the accuracy of the prediction model in densely populated areas.  Data Availability Statement: The Finer Resolution Observation and Monitoring-Global Land Cover data are available at the link http://data.ess.tsinghua.edu.cn/ (accessed on 1 December 2021); the MSP-OLS nighttime lights imagery data are available at the link https://ngdc.noaa.gov/eog/ (accessed on 1 December 2021); actual population data are confidential and cannot be uploaded. Other related data and codes that support the findings of this study are available with identifiers at the link https://doi.org/10.5281/zenodo.6275126 (accessed on 25 February 2022).