Local Population Mapping Using a Random Forest Model Based on Remote and Social Sensing Data: A Case Study in Zhengzhou, China

: High-resolution gridded population data are important for understanding and responding to many socioeconomic and environmental problems. Local estimates of the population allow o ﬃ cials and researchers to make a better local planning (e.g., optimizing public services and facilities). This study used a random forest algorithm, on the basis of remote sensing (i.e., satellite imagery) and social sensing data (i.e., point-of-interest and building footprint), to disaggregate census population data for the ﬁve municipal districts of Zhengzhou city, China, onto 100 × 100 m grid cells. We used a statistical tool to detect areas with an abnormal population density; e.g., areas containing many empty houses or houses rented by more people than allowed, and conducted ﬁeld work to validate our ﬁndings. Results showed that some categories of points-of-interest, such as residential communities, parking lots, banks, and government buildings were the most important contributing elements in modeling the spatial distribution of the residential population in Zhengzhou City. The exclusion of areas with an abnormal population density from model training and dasymetric mapping increased the accuracy of population estimates in other areas with a more common population density. We compared our product with three widely used gridded population products: Worldpop, the Gridded Population of the World, and the 1-km Grid Population Dataset of China. The relative accuracy of our modeling approach was higher than that of those three products in the ﬁve municipal districts of Zhengzhou. This study demonstrated potential for the combination of remote and social sensing data to more accurately estimate the population density in urban areas, with minimum disturbance from the abnormal population density.

to disaggregate census data in the Zhujiang delta area, China. With the increase of both type and amount of ancillary data, machine learning methods such as RF are much more needed to handle these high-dimensional, high-resolution data.
Another important but usually neglected issue in population modeling is that population distribution may be affected by unobserved reasons. In some cases, the fundamental assumption that population density should be similar in similar types of parcels (a type is usually a function of many modelable factors) may be broken by some abnormal and non-modelable reasons, such as vacant houses and houses being rented by more people than allowed. Such issues are difficult to avoid in large-scale population modeling projects, but could and should be avoided in local population modeling activities, as detecting and investigating areas with an abnormal population density is also an important and useful practice in city planning and management. However, to the knowledge of the authors, this issue has not been considered in previous efforts of population modeling.
In this study, we employed RF as a semi-automated dasymetric modeling approach, combining a variety of ancillary data to redistribute census data from the county level (also named "district level" in urban regions) to 100 m grid cells in Zhengzhou, China. We assessed the performance of the building footprint and the POI data in population modeling in our study area, and used a statistical tool to detect areas with an abnormal population density, which was validated by our field work. Such areas were then excluded from model training and dasymetric mapping to increase the accuracy of population estimates in other areas with a more common population density. The accuracy of our modeling approach was assessed by comparing it with three major gridded population products that cover Zhengzhou: Worldpop, the Gridded Population of the World (GPW), and the 1-km Grid Population Dataset of China (GPC) [35]. Zhengzhou is geographically at a pivotal location in China, and, as of 2018, has been ranked 2nd among all Chinese cities regarding the population density in urban areas. The city has also recently been officially named the 8th central city; after Beijing, Shanghai, Guangzhou, Chongqing, Tianjin, Wuhan, and Chengdu. Therefore, there is an urgent demand for establishing a benchmark of high-resolution population distribution, so population allocation could be better considered in the planning of this rapidly growing city. This study would significantly contribute to both methodology and practice of urban planning.

Study Area
Zhengzhou is the capital of Henan province, a key economic region in the central plains and a crucial transportation junction in China. Extending from about 112 • 42 to 114 • 14 E and 34 •

Datasets
The census of 2010 at county/district and township levels (equivalent to level 3 and 4 of the Global Administrative Unit Layer) was obtained from the China National Bureau of Statistics [36] as the population reference, for the purpose of population redistribution and accuracy assessment, respectively. The county/district-and town-level administrative boundaries were obtained from the Henan Administration of Surveying Mapping and Geoinformation. The relative accuracy of our gridded population product was compared to the relative accuracy of three population datasets: Worldpop (mainland China dataset), the GPW version 4, and the GPC. Specifically, WorldPop (https://www.worldpop.org/) used county level census data for population redistribution and provides gridded population products with the finest spatial resolution (i.e., 100 m) in China. GPW version 4 is presently the most widely used gridded population product, which employed township level census data as population input [17]. The GPC [35] held more local perspectives introduced by the Chinese Academy of Sciences and used county level census data for dasymetric mapping. The spatial resolution of both GPW version 4 and GPC are 1km.
The ancillary data used for population modeling included seven satellite-derived raster datasets (i.e., consisting of grid cells). The LULC data showed strong relationships with the population distribution, particularly those types indicating human settlements [1,37,38]. In this study, the BaseVue 2013 (http://www.mdaus.com/products/land-cover-products) 30 m land cover dataset, which contains 14 types of land cover, was used as ancillary data. The Net Primary Productivity (NPP) is a key driving force of human distribution [39]; we included the 2010 MODIS 17 A3 annual NPP as our NPP data [40]. Furthermore, the precipitation, temperature, slope angle, and elevation of an area also affect human settlement patterns to some extent, which may directly relate to the population distribution [41]. In this case, we employed digital elevation data and its derived slope data from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM), as well as the WorldClim/BioClim 1970-2000 mean annual precipitation (BIO12) and mean annual temperature (BIO1) data [42]. Moreover, before the Defense Meteorological Satellite Program's Operational Linescan System (DMSP-OLS) imagery was produced in 1992, the relationship between quantitative population and NTL had been revealed by several researchers [25,43]. NTL is a straighter way to indicate population distribution. We employed the Suomi National Polar-Orbiting Partnership Visible Infrared Imaging Radiometer Suite (NPP-

Datasets
The census of 2010 at county/district and township levels (equivalent to level 3 and 4 of the Global Administrative Unit Layer) was obtained from the China National Bureau of Statistics [36] as the population reference, for the purpose of population redistribution and accuracy assessment, respectively. The county/district-and town-level administrative boundaries were obtained from the Henan Administration of Surveying Mapping and Geoinformation. The relative accuracy of our gridded population product was compared to the relative accuracy of three population datasets: Worldpop (mainland China dataset), the GPW version 4, and the GPC. Specifically, WorldPop (https://www.worldpop.org/) used county level census data for population redistribution and provides gridded population products with the finest spatial resolution (i.e., 100 m) in China. GPW version 4 is presently the most widely used gridded population product, which employed township level census data as population input [17]. The GPC [35] held more local perspectives introduced by the Chinese Academy of Sciences and used county level census data for dasymetric mapping. The spatial resolution of both GPW version 4 and GPC are 1km.
The ancillary data used for population modeling included seven satellite-derived raster datasets (i.e., consisting of grid cells). The LULC data showed strong relationships with the population distribution, particularly those types indicating human settlements [1,37,38]. In this study, the BaseVue 2013 (http://www.mdaus.com/products/land-cover-products) 30 m land cover dataset, which contains 14 types of land cover, was used as ancillary data. The Net Primary Productivity (NPP) is a key driving force of human distribution [39]; we included the 2010 MODIS 17 A3 annual NPP as our NPP data [40]. Furthermore, the precipitation, temperature, slope angle, and elevation of an area also affect human settlement patterns to some extent, which may directly relate to the population distribution [41]. In this case, we employed digital elevation data and its derived slope data from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM), as well as the WorldClim/BioClim 1970-2000 mean annual precipitation (BIO12) and mean annual temperature (BIO1) data [42]. Moreover, before the Defense Meteorological Satellite Program's Operational Linescan System (DMSP-OLS) imagery was produced in 1992, the relationship between quantitative population and NTL had been revealed by several researchers [25,43]. NTL is a straighter way to indicate population distribution. We employed the Suomi National Polar-Orbiting Partnership Visible Infrared Imaging Radiometer Suite (NPP-VIIRS) 500 m resolution NTL data as our ancillary data, which has a higher spatial resolution than DMSP-OLS's 1 km [44].
We also used vector data (i.e., consisting of geometric shapes such as points, lines, and polygons) as ancillary data, including POI, roads and buildings data. The POIs are places of various categories with location information, and some of them are related to human activities [29]. Road networks were obtained from the AutoNavi Map Service, which is one of the largest web map service providers in China [45]. Building footprint and height are very useful for an accurate population estimation, especially in urban regions with a dense population [2,46,47]. In Zhengzhou metropolitan area, these were acquired from AutoNavi Map Services (https://www.amap.com/). Ancillary data used in this study is shown in Table 1.

Data Preparation
The ancillary data were processed to enrich information relevant to population distribution. Since jiedao level census data (transformed into logarithmic population density as response variable) was the finest official population data we could get, the information extracted from ancillary data was summarized at jiedao level and used as independent variables. Different strategies were employed based on the type of data. Areas of different land cover types and the distance to these land cover types can denote human presence from different perspectives. In every jiedao, each type of land cover (e.g., general agricultural land, shrub, urban) was extracted, after which its proportion and the mean value of the Euclidean distance was calculated. For the other RS data (i.e., NPP, NTL, DEM, slope, temperature, and precipitation), the mean value of each dataset in every jiedao was calculated. The distance to buildings, overall building volume (the total of volumes of all buildings in a jiedao), and building volume density (m 3 /km 2 , overall building volume in each jiedao divided by areas of each jiedao) in each jiedao were attained. Moreover, categories of POIs that may relate to population distribution were first selected based on a literature review [24,29,30]. Twenty categories of POIs, including residential communities, educational locations, hospitals (clinics), parking lots, parks, government buildings, airports, railway stations, bus stations, motor passenger stations, gas stations, service zones of highways, toll stations, banks, commercial buildings, retails, hotels, restaurants and entertainments, companies, and others (e.g., small business, museum) were selected. Each category of POI was analyzed by Euclidean distance, since the radial distance from the nearest POI had been found to be related to population density [20,29,48]. In order to attain the road density in each jiedao, the road networks were processed by a road density function [49] (1): where RSD represents the road network density (km/km 2 ), and Nr, Nne, Npe, Ncr, and Ntr stand for the length of railroads, national roads, provincial roads, county roads, and jiedao level roads, respectively, in each jiedao. A is the area of each jiedao and the numbers of 3, 3, 2, 1, and 0.4 are conversion ratios based on transport and traffic capacity of the different types of roads. All independent variables generated in this section were classified into three categories (i.e., RS and road variables, POI variables, and building footprint variables). The whole data preparation procedure is shown in a flowchart (i.e., Figure 2) and processed based on ArcGIS 10.2. and the mean value of the Euclidean distance was calculated. For the other RS data (i.e., NPP, NTL, DEM, slope, temperature, and precipitation), the mean value of each dataset in every jiedao was calculated. The distance to buildings, overall building volume (the total of volumes of all buildings in a jiedao), and building volume density (m 3 /km 2 , overall building volume in each jiedao divided by areas of each jiedao) in each jiedao were attained. Moreover, categories of POIs that may relate to population distribution were first selected based on a literature review [24,29,30]. Twenty categories of POIs, including residential communities, educational locations, hospitals (clinics), parking lots, parks, government buildings, airports, railway stations, bus stations, motor passenger stations, gas stations, service zones of highways, toll stations, banks, commercial buildings, retails, hotels, restaurants and entertainments, companies, and others (e.g., small business, museum) were selected. Each category of POI was analyzed by Euclidean distance, since the radial distance from the nearest POI had been found to be related to population density [20,29,48]. In order to attain the road density in each jiedao, the road networks were processed by a road density function [49] (1): where RSD represents the road network density (km/km 2 ), and Nr, Nne, Npe, Ncr, and Ntr stand for the length of railroads, national roads, provincial roads, county roads, and jiedao level roads, respectively, in each jiedao. A is the area of each jiedao and the numbers of 3, 3, 2, 1, and 0.4 are conversion ratios based on transport and traffic capacity of the different types of roads. All independent variables generated in this section were classified into three categories (i.e., RS and road variables, POI variables, and building footprint variables). The whole data preparation procedure is shown in a flowchart (i.e., Figure 2) and processed based on ArcGIS 10.2.

Random Forest Model
The RF is an ensemble, nonparametric classifier based on a set of Classification and Regression Trees (CARTs) [32]. CARTs in RF are created on the basis of bootstrap samples and a user-defined number of features, which indicate that each CART is independently constructed by drawing samples with replacement from training datasets, and each node of a CART is split using the largest information gain among a subset that is randomly chosen from variables [50]. After the construction of CARTs, a simple majority vote is conducted for classification (arithmetic mean for regression).

Random Forest Model
The RF is an ensemble, nonparametric classifier based on a set of Classification and Regression Trees (CARTs) [32]. CARTs in RF are created on the basis of bootstrap samples and a user-defined number of features, which indicate that each CART is independently constructed by drawing samples with replacement from training datasets, and each node of a CART is split using the largest information gain among a subset that is randomly chosen from variables [50]. After the construction of CARTs, a simple majority vote is conducted for classification (arithmetic mean for regression). Because of the bootstrap, the samples involved in the construction of CARTs normally account for two thirds of the training dataset; the remaining one third is called Out-Of-Bag (OOB) data, which can be used for an unbiased internal cross-validation that estimates the performance of the constructed model. The OOB data can also involve in a measurement of the importance score of each variable in RF [32]. By comparing the OOB accuracy of the model before and after the permutation of the variable, an indicator called Mean Decrease Accuracy (MDA) can be used for variable importance evaluation. Important variables cause a higher OOB accuracy decline than unimportant variables [31]. Another indicator of variable importance is the Mean Decrease Impurity (MDI), which measures the effectiveness of variables at reducing uncertainty when creating CARTs in RF [51]. However, since MDI produces more biases while evaluating continuous variables, we decided to use MDI as an aiding Remote Sens. 2020, 12, 1618 7 of 17 indicator and MDA as the main indicator to rank the importance of variables [52]. The advantages of RF have been explored in several investigations. For example, compared to simple decision trees, it is less affected by noises and can avoid overfitting, because of the way of sampling and the voting of multiple decision trees [53]. In comparison with multi-linear regression, RF is less susceptible to multicollinearity, which makes it more efficient in terms of feature selection [50,54].
To be able to incorporate more training units in the RF algorithm and to include sufficient information indicating population distribution in suburban regions, 28 jiedaos outside the five municipal districts and within the scope of Zhengzhou metropolitan area were included in the study area. Together with the 66 jiedaos in the five municipal districts, a total of 94 jiedaos were used as RF input. While building the RF model, it is important to choose variables that can improve the model accuracy before taking further steps. In this study, we simply selected variables with positive MDA scores. Binary regression was used to understand the direction of the association between each important variable and the population density. After that, two significant parameters named mtry and ntree were adjusted. mtry represents the usage of numbers of variables during a decision tree splitting, and ntree is the number of decision trees to be generated in the model. In this study, mtry was first decided according to the lowest OOB error. Furthering that, ntree was confirmed by the lowest mean squared error and selected mtry. After model establishment, variable values within each 100 m grid were used to predict the population density in each grid. This gridded population density layer was then employed as a weighting layer for population redistribution. To highlight the usefulness of POI and building footprint data in population modeling, we additionally employed three combinations of variables (i.e., RS and road variables; RS, and road and POI variables; RS, and road and building footprint variables) as RF input to generate three other weighting layers through the procedure described above. Each of these weighting layers was then used to redistribute census data and therefore a comparison was made.

Dasymetric Mapping
The weighting layers generated by RF were devoted to disaggregating census data. In this stage, we redistributed county level census data into 100 m resolution raster surfaces by Function (2): where P p is the estimated people count per pixel, PD p is the people count per pixel produced by RF, r represents each county, C r is the census population count in counties, and PD r is the people count in counties summed by PD p . In Function (2), district/county level instead of jiedao level census data was redistributed because of the accuracy assessment requirements. The accuracy assessment requires a finer level census and boundary data than the ones used as inputs to dasymetrically disaggregate the administrative unit-based population counts. However, in China, village and community (a finer level than jiedao) boundaries change frequently due to rapid urbanization, and there are few chances to access precise and real-time village/community boundary data. Therefore, we chose county level census data for the population redistribution and employed jiedao level census data for the accuracy assessment. Additionally, due to this reason, we only redistributed census data for the five municipal districts, rather than the whole Zhengzhou metropolitan area, since some counties outside of the five municipal districts were partly out of the Zhengzhou metropolitan area, which could lead to an incomplete population reference for these counties.

Abnormal Detection
A fundamental assumption of any dasymetric population modeling is that the population density should be similar in similar areas, which could be characterized by the used ancillary data [2,10]. However, this assumption may not always hold true in the real world, as there may be some reasons Remote Sens. 2020, 12, 1618 8 of 17 that could invalidate this assumption in certain areas, e.g., vacant houses, and houses being rented by more people than allowed, which cannot be detected and modelled by the used ancillary data. Including such areas with an extremely abnormal population density at the model training stage could obscurely and unpredictably lead to a distortion of the predicted population density (i.e., the weighting layer) in most of the areas with a more common population density. In addition, since areas with an extremely abnormal population density could not be well modelled by the used ancillary data, an overor under-estimated population density in these areas could lead to a wrongly redistributed population across all areas, and hence unfairly affect the accuracy of the population products, without providing any benefit to local planning and management sectors. Therefore, detecting and excluding such areas with an extremely abnormal population density from model training and dasymetric mapping should be able to not only provide potentially useful information to relevant sectors, but also mitigate over-or under-estimation issues in most of the areas with a normal population density.
Boxplot as a statistical tool is used to visually summarize groups of data, and it could also detect abnormal values among a group of data through their quartiles [55]. The values that are larger than the upper quartile by at least k times the interquartile range (i.e., range of the upper and lower quartile) or smaller than the lower quartile by at least k times the interquartile range, are considered abnormal values [56]. According to Tukey [57], the parameter k can be set as 1.5 and 3 to illustrate abnormal values and extremely abnormal values, respectively. In this study, the values of the differences between the actual and predicted population in each jiedao were mapped in a boxplot. Jiedaos with extremely high values (both positive and negative) were considered areas with an abnormal population density, which were wrongly populated by our ancillary data. Considering the poor quality of ancillary data could also raise the differences between the actual and predicted population in some areas. We finally chose k as 3 to minimize this disturbance and only considered areas with values with an extremely high difference as areas with an abnormal population density. The relationship of this stage in our study is shown in a flowchart (i.e., Figure 3).

Abnormal Detection
A fundamental assumption of any dasymetric population modeling is that the population density should be similar in similar areas, which could be characterized by the used ancillary data [2,10]. However, this assumption may not always hold true in the real world, as there may be some reasons that could invalidate this assumption in certain areas, e.g., vacant houses, and houses being rented by more people than allowed, which cannot be detected and modelled by the used ancillary data. Including such areas with an extremely abnormal population density at the model training stage could obscurely and unpredictably lead to a distortion of the predicted population density (i.e., the weighting layer) in most of the areas with a more common population density. In addition, since areas with an extremely abnormal population density could not be well modelled by the used ancillary data, an over-or under-estimated population density in these areas could lead to a wrongly redistributed population across all areas, and hence unfairly affect the accuracy of the population products, without providing any benefit to local planning and management sectors. Therefore, detecting and excluding such areas with an extremely abnormal population density from model training and dasymetric mapping should be able to not only provide potentially useful information to relevant sectors, but also mitigate over-or under-estimation issues in most of the areas with a normal population density.
Boxplot as a statistical tool is used to visually summarize groups of data, and it could also detect abnormal values among a group of data through their quartiles [55]. The values that are larger than the upper quartile by at least k times the interquartile range (i.e., range of the upper and lower quartile) or smaller than the lower quartile by at least k times the interquartile range, are considered abnormal values [56]. According to Tukey [57], the parameter k can be set as 1.5 and 3 to illustrate abnormal values and extremely abnormal values, respectively. In this study, the values of the differences between the actual and predicted population in each jiedao were mapped in a boxplot. Jiedaos with extremely high values (both positive and negative) were considered areas with an abnormal population density, which were wrongly populated by our ancillary data. Considering the poor quality of ancillary data could also raise the differences between the actual and predicted population in some areas. We finally chose k as 3 to minimize this disturbance and only considered areas with values with an extremely high difference as areas with an abnormal population density. The relationship of this stage in our study is shown in a flowchart (i.e., Figure 3).

Accuracy Assessment
Worldpop, GPW, and GPC were employed to compare with our product. Furthermore, gridded population datasets generated before removing the areas with an abnormal population density from model training and dasymetric mapping were used to assess the improved accuracy of the estimated population in other areas, with the more common population density remaining after the exclusion of areas with an abnormal population density. Gridded population datasets generated by the three combinations of variables were used to examine the performance of POI and building footprint data in the population modeling. The population grids of each dataset were aggregated to jiedao Remote Sens. 2020, 12, 1618 9 of 17 level, after which the difference between each dataset and the census data was calculated. The Root Mean Square Error (RMSE) (3) and the Mean Absolute Error (MAE) (4) were used to quantify the difference [58,59].
In Functions (3) and (4); N represents the number of jiedaos (one level lower than county/district) included in the accuracy assessment, P estimated i is the estimated population in a jiedao i, and P observed i is the population of the census data in jiedao i.

Abnormal Detection
For each gridded population dataset generated by different combinations of variables, a boxplot was drawn to evaluate the differences between the estimated and actual population numbers in each jiedao. Results showed that one jiedao named Jichenglu was an abnormal unit with an extremely high positive difference value ( Figure 4). In addition, since this jiedao was also detected as an area with an abnormal population density without using our building footprint data, temporally mismatched building footprint data were not the main reason for such a severe mis-prediction.
Worldpop, GPW, and GPC were employed to compare with our product. Furthermore, gridded population datasets generated before removing the areas with an abnormal population density from model training and dasymetric mapping were used to assess the improved accuracy of the estimated population in other areas, with the more common population density remaining after the exclusion of areas with an abnormal population density. Gridded population datasets generated by the three combinations of variables were used to examine the performance of POI and building footprint data in the population modeling. The population grids of each dataset were aggregated to jiedao level, after which the difference between each dataset and the census data was calculated. The Root Mean Square Error (RMSE) (3) and the Mean Absolute Error (MAE) (4) were used to quantify the difference [58,59].
In function (3) and (4); N represents the number of jiedaos (one level lower than county/district) included in the accuracy assessment, is the estimated population in a jiedao i, and is the population of the census data in jiedao i.

Abnormal Detection
For each gridded population dataset generated by different combinations of variables, a boxplot was drawn to evaluate the differences between the estimated and actual population numbers in each jiedao. Results showed that one jiedao named Jichenglu was an abnormal unit with an extremely high positive difference value (Figure 4). In addition, since this jiedao was also detected as an area with an abnormal population density without using our building footprint data, temporally mismatched building footprint data were not the main reason for such a severe mis-prediction.

Accuracy Assessment
The estimated population numbers were used to compare with the census population counts at jiedao level to assess the performance of POI and building footprint data. The results showed that whether we exclude the area with an abnormal population density from the study area, the population datasets generated with POI and building footprint data achieved a better RSME and MAE than those datasets only produced by RS and road data ( Figure 5). Higher correlations between the estimated and census population were also found in datasets generated with POI and building footprint data. Temporally mismatched building footprint data performed relatively worse than POI data. Moreover, after excluding the area with an abnormal population density from both model training and dasymetric mapping, the accuracy of the estimated population in most areas with a more common population density was improved. As for the two datasets generated by all variables; the one that excluded the area with an abnormal population density achieved a better accuracy (RMSE = 24,956.93, MAE = 19,420.04) than the other (RMSE = 27,267.04, MAE = 21,352.75). Similar results were also found in datasets generated by other combinations of variables. The result of this study, which were produced after removing the area with an abnormal population density from the study area, was compared with three other widely used products. Results showed that the accuracy of our study was better than that of the others in both RMSE and MAE ( Table 2). road and point of interest variables; all variables).

Accuracy Assessment
The estimated population numbers were used to compare with the census population counts at jiedao level to assess the performance of POI and building footprint data. The results showed that whether we exclude the area with an abnormal population density from the study area, the population datasets generated with POI and building footprint data achieved a better RSME and MAE than those datasets only produced by RS and road data ( Figure 5). Higher correlations between the estimated and census population were also found in datasets generated with POI and building footprint data. Temporally mismatched building footprint data performed relatively worse than POI data. Moreover, after excluding the area with an abnormal population density from both model training and dasymetric mapping, the accuracy of the estimated population in most areas with a more common population density was improved. As for the two datasets generated by all variables; the one that excluded the area with an abnormal population density achieved a better accuracy (RMSE = 24,956.93, MAE = 19,420.04) than the other (RMSE = 27,267.04, MAE = 21,352.75). Similar results were also found in datasets generated by other combinations of variables. The result of this study, which were produced after removing the area with an abnormal population density from the study area, was compared with three other widely used products. Results showed that the accuracy of our study was better than that of the others in both RMSE and MAE ( Table 2).    Four residential communities in the jiedao with an abnormal population density were selected as samples for our field survey of abnormal population density validation ( Figure 6). The field survey was conducted on 16 May, 2017, and was based on the interviews with residents and the staff of property management in each community. According to the property management records and the interviews, we found that few people dwelled in these communities in 2010, since the constructions of these residential communities were just completed in 2008, 2009, and 2010, and people were gradually moved in after decoration. As such, the population density in these communities was extremely low compared to communities in other jiedaos in 2010. Furthering that, according to the literature review, the jiedao with an abnormal population density was a key development area planned by the government and experienced fast urbanization during the period from 1995 to 2010 [60]. Multiple commercial residential buildings and corresponding facilities with a low occupation rate were built [61,62]. Four residential communities in the jiedao with an abnormal population density were selected as samples for our field survey of abnormal population density validation ( Figure 6). The field survey was conducted on 16 May, 2017, and was based on the interviews with residents and the staff of property management in each community. According to the property management records and the interviews, we found that few people dwelled in these communities in 2010, since the constructions of these residential communities were just completed in 2008, 2009, and 2010, and people were gradually moved in after decoration. As such, the population density in these communities was extremely low compared to communities in other jiedaos in 2010. Furthering that, according to the literature review, the jiedao with an abnormal population density was a key development area planned by the government and experienced fast urbanization during the period from 1995 to 2010 [60]. Multiple commercial residential buildings and corresponding facilities with a low occupation rate were built [61,62].

Variable Importance
The thirty most important variables were ranked by MDA and MDI. The direction of the association between each variable and the population density was indicated by the coefficient from binary regression (Figure 7). According to the rank of MDA, distance to the POIs of residential communities, parking lots, government buildings, and distance to roads contributed the most to the population estimation in Zhengzhou, China. A negative association between each of these variables and the population density was found. Building density and overall building volume also contributed to the model accuracy, which were positively associated with population density. Eight

Variable Importance
The thirty most important variables were ranked by MDA and MDI. The direction of the association between each variable and the population density was indicated by the coefficient from binary regression (Figure 7). According to the rank of MDA, distance to the POIs of residential communities, parking lots, government buildings, and distance to roads contributed the most to the population estimation in Zhengzhou, China. A negative association between each of these variables and the population density was found. Building density and overall building volume also contributed to the model accuracy, which were positively associated with population density. Eight of the ten most important variables were originally derived from social sensing data. The proportion of general agricultural land and DEM were the only two RS based variables in the top ten list. Road density, measured by the road density function, was not contributing enough to be included in the list of the thirty most important variables.
Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 17 of the ten most important variables were originally derived from social sensing data. The proportion of general agricultural land and DEM were the only two RS based variables in the top ten list. Road density, measured by the road density function, was not contributing enough to be included in the list of the thirty most important variables.

Discussion
The study provided an example of small-area population mapping using an RF algorithm, on the basis of RS and social sensing data in Zhengzhou, which is a typical metropolitan area in China. The variables that were important to the population distribution and their associations with population density were demonstrated. A boxplot of the differences between estimated and actual population counts in each jiedao was used to detect jiedaos with an abnormally high/low population density, which was validated in the field work. After excluding that jiedao, the final population product in the other areas that resulted from this study outperformed three widely used population products (i.e., Worldpop, GPW, and GPC).
The distance to the surrounding residential communities, parking lots, government buildings, roads, and hotels were found to be the top five influential factors of population distribution, which were all negatively associated with population density. The proportions of general agricultural land were also useful to characterize urban and suburban areas with a different population density. With some of the RS data contributed to our study, detailed social sensing data, such as categories of POIs and building footprint data, can provide socioeconomic factors and finer human settlement information that are relevant to human presence, and hence accurately depict the local population distribution [29,30]. The negative association between road length and population density in our study area was mainly because jiedaos in urban areas with a higher population density are normally smaller than jiedaos in suburban regions with a lower population density; and larger jiedaos tend to have longer total length of roads. In addition, the positive association between slope and population Figure 7. The thirty most important variables ranked according to their contribution to the random forest modeling, measured by Mean Decrease Accuracy (MDA) and Mean Decrease Impurity (MDI), with direction of the association between each ranked variable and the population density. Variable names: xx_pdst-distance to a category of points of interest; xx_ldst-distance to a landcover type; xx_%-proportion of a landcover type; xx_den-density of ?; buildFootprint_vol-volume of all buildings in a jiedao; xx_len-length of ?; DEM-Digital Elevation Model; NTL-Nighttime Lights; npp-Net Primary Productivity.

Discussion
The study provided an example of small-area population mapping using an RF algorithm, on the basis of RS and social sensing data in Zhengzhou, which is a typical metropolitan area in China. The variables that were important to the population distribution and their associations with population density were demonstrated. A boxplot of the differences between estimated and actual population counts in each jiedao was used to detect jiedaos with an abnormally high/low population density, which was validated in the field work. After excluding that jiedao, the final population product in the other areas that resulted from this study outperformed three widely used population products (i.e., Worldpop, GPW, and GPC).
The distance to the surrounding residential communities, parking lots, government buildings, roads, and hotels were found to be the top five influential factors of population distribution, which were all negatively associated with population density. The proportions of general agricultural land were also useful to characterize urban and suburban areas with a different population density. With some of the RS data contributed to our study, detailed social sensing data, such as categories of POIs and building footprint data, can provide socioeconomic factors and finer human settlement information that are relevant to human presence, and hence accurately depict the local population distribution [29,30]. The negative association between road length and population density in our study area was mainly because jiedaos in urban areas with a higher population density are normally smaller than jiedaos in suburban regions with a lower population density; and larger jiedaos tend to have longer total length of roads. In addition, the positive association between slope and population density was primarily due to the relatively poor quality of DEM data. This is because the buildings in urban areas are influencing the accuracy of DEM data [63]; tall buildings in urban areas lead to higher values of slope. Moreover, the population distribution in some areas may be difficult to be explained by normal reasons. Some abnormal reasons, such as vacant houses and houses being rented by more people than allowed, violate the fundamental assumption of dasymetric population modeling that the population density should be similar in similar areas, which could be characterized by the used ancillary data. This violation can obscurely and unpredictably lead to an over-or under-estimation of the population, and hence unfairly affect the accuracy of the population products without providing any benefit to local planning and management sectors. By using a boxplot, one jiedao in our study area was found to be abnormally populated. It was supposed to have a high population density on the basis of the values of the important variables in this jiedao (e.g., close to residential communities, parking lots, etc.). However, because of the high housing vacancy rate, the actual population density in this jiedao was relatively low and extremely different from our prediction [61,62]. Such detection is useful for improving both model performance and accuracy by removing such abnormal units from modeling, but providing information to relevant sectors to potentially solve some other problems may have caused the abnormal population distribution.
The spatial population distribution can be somewhat reflected by many indicators, such as NTL [25], LULC [64], building volume [46,65], POI [30], slope angle, elevation of an area [20], etc. However, the varying performances of these data is a common problem in terms of predicting population distribution in different geographic and socioeconomic conditions. Therefore, choosing an optimal set of ancillary data is of prime importance. By employing RF, we could identify the variables that could improve the model accuracy with multicollinearity overcome. Additionally, the quality of ancillary data is important to population modeling. Our study employed POIs and building footprint data from the two largest map service companies in China (i.e., Baidu and AutoNavi), which provided more complete and reliable data for our study area than the volunteered OpenStreetMap used in WorldPop [66].
There were several limitations to our study. First, over-fitting may be an issue that affected the predictability of our model for future projection of the population. The limited number of jiedaos (i.e., a small sample size) could only provide a relatively small amount of information in our study area, which could restrict the predictability of our model in a larger area with more complex patterns of population distribution. This needs to be further confirmed in future studies with a larger number of areal units included (i.e., a large sample size), so traditional regression methods could be used to compare with machine learning methods. Additionally, due to the small sample size; the direction, and especially the significance of the associations between independent variables and population density, were just preliminarily explored by binary regression, instead of being estimated in multivariate regression. Second, the accuracy of the detection of areal units with an abnormal population density may be affected by its size. A larger areal unit tends to contain more errors (i.e., differences between estimated and actual population counts) during population modeling, which might be wrongly detected as an outlier by the boxplot. Therefore, results of abnormal detection could be further investigated and validated by other abnormal detection methods, such as the k-nearest neighbors algorithm, especially when field validation is not possible in large or inaccessible study areas. Moreover, since this study only detected one jiedao as an area with an abnormal population density, we were unable to build a model specifically for this area and therefore unable to make a better prediction for this particular area. With a larger study area, more abnormal population density units could be found and therefore modelled accurately. Third, the range of data values at the training stage (at the township level) was smaller than that at the prediction stage (at the 100 m grid cells), as extreme values were averaged in courser units. This may lead to an over-or under-estimation of the population density values [67]. It would be more suitable, whenever possible, to conduct model training at finer scales, such as at village or community levels.

Conclusions
The high-resolution population product developed in this study, on the basis of the RF method, and RS and social sensing data, outperformed three global population products (i.e., Worldpop, the Gridded Population of the World version 4, and the 1 km Grid Population Dataset of China). Building footprint data, combined with height information and categories of POI data, can increase the accuracy of the population estimates. Certain types of POI data, including residential communities, parking lots, government buildings, hotels, banks, hospitals, and gas stations were more useful predictors than building footprint data to model population distribution at a 100 m grid cell level. Our approach of population modeling that exclude areas with an abnormal population density (i.e., areas with an extremely different population density than other similar areas) from model training and dasymetric mapping, increased the accuracy of population estimates in most of the areas with a more common population density. Such an exclusion can be applied to larger areas with a more complex population distribution, in order to model population density with a minimum disturbance by abnormal population densities. This holds promise for many applications, including public health, environmental protection, regional planning and development, and policy making.