Gridded Population Maps Informed by Different Built Settlement Products

The spatial distribution of humans on the earth is critical knowledge that informs many disciplines and is available in a spatially explicit manner through gridded population techniques. While many approaches exist to produce specialized gridded population maps, little has been done to explore how remotely sensed, built-area datasets might be used to dasymetrically constrain these estimates. This study presents the effectiveness of three different high-resolution built area datasets for producing gridded population estimates through the dasymetric disaggregation of census counts in Haiti, Malawi, Madagascar, Nepal, Rwanda, and Thailand. Modeling techniques include a binary dasymetric redistribution, a random forest with a dasymetric component, and a hybrid of the previous two. The relative merits of these approaches and the data are discussed with regards to studying human populations and related spatially explicit phenomena. Results showed that the accuracy of random forest and hybrid models was comparable in five of six countries.


Summary
As of 2017, the global human population is estimated to be near 7.6 billion, demonstrating a global population growth of roughly 200 million since 2015 [1].By 2050, the human population is estimated to increase by at least 2 billion, with the largest global population growth per continent in Africa and Asia [1].This change is implicitly associated with increasing rates of urbanization, which are seen most prominently in highly populated low-and middle-income countries, which together account for 37% of projected population growth into 2050 [2].These global patterns of population change highlight the need for spatially explicit and comparable high-resolution gridded population datasets that accurately depict the spatial distribution of the residential human population and inform many fields, including infectious disease assessment [3][4][5], disaster response [6], adaptive strategies towards climate change mitigation [7,8] and many of the Millennium Development Goals [9].This need is met by a broad variety of gridded population techniques.

Data Description
This dataset provides a set of 54 different high-resolution, gridded population raters produced for the purposes of methodological and built area data product comparison.Gridded products represent population as people per pixel (ppp) at ~100 m resolution for recent census years in select countries.This includes Madagascar, Rwanda, and Malawi from Africa, Nepal, and Thailand from Southeast Asia, and Haiti from the Caribbean.The gridded population datasets depict population distribution under the constraints of 3 different approaches explored in Table 3. Population estimates are presented in GeoTIFF format along with corresponding metadata, covariate importance, explanations of variance, and model accuracy assessment where appropriate.Examples of model outputs are previewed in Figure 4.

Census Data
We use census data that represents the finest spatial resolution and most contemporary data that were publically available at the time of analysis.Retrieval of census data is made on request from country-specific National Statistics Offices.Census data are then matched to a country-specific GIS administrative level from GDAM (https://gadm.org/index.html) that is specific to the region and not comparable to units of the same level in different countries (Table 1) [22].To ensure a level of Data 2018, 3, 33 3 of 11 comparability between countries, the Average Spatial Resolution (ASR) was calculated as the square root of its surface area divided by the number of administrative units, representing the effective resolution units within the country [4].All models were run using a 2/3 aggregate of the finest available census data, in which a 1/3 random selection of units was dissolved with the neighbor sharing the longest border, as outlined in Figure 1.resolution units within the country [4].All models were run using a 2/3 aggregate of the finest available census data, in which a 1/3 random selection of units was dissolved with the neighbor sharing the longest border, as outlined in Figure 1.

Built Area Data
For the purposes of this study, the term Built Area is used to describe both urban and built-up datasets, all of which are assumed to be indicative of human settlement.To test the effectiveness of combined dasymetric and random forest methods, we chose three built area datasets obtained using different remote sensing techniques with different spatial resolutions and criteria under which builtarea is sensed.These publically available datasets include World Settlement Footprint (WSF), Global Human Settlement Layer (GHSL), and the Facebook Connectivity Lab's High-Resolution Settlement Layer (HRSL) (Table 2).
Table 2. Three primary built/human settlement datasets and supporting information.GHSL and HRSL datasets are accessible from their respective portals, while WSF is available upon request [23].

Built Area Data
For the purposes of this study, the term Built Area is used to describe both urban and built-up datasets, all of which are assumed to be indicative of human settlement.To test the effectiveness of combined dasymetric and random forest methods, we chose three built area datasets obtained using different remote sensing techniques with different spatial resolutions and criteria under which built-area is sensed.These publically available datasets include World Settlement Footprint (WSF), Global Human Settlement Layer (GHSL), and the Facebook Connectivity Lab's High-Resolution Settlement Layer (HRSL) (Table 2).
Table 2. Three primary built/human settlement datasets and supporting information.GHSL and HRSL datasets are accessible from their respective portals, while WSF is available upon request [23].The first, World Settlement Footprint (WSF), represents a global coverage of earth's land surface from the German Space Agency (DLR) Earth Observation Center based on Landsat 8 and Sentinel 1 optical and radar imagery for 2014-2015.The initial dataset was retrieved through personal communication with Thomas Esch and Mattia Marconcini and represents an initial version prior to public release [23,27].Second, the Global Human Settlement Layer (GHSL) represents a global built-up dataset that focuses on three primary products: built-up areas, population grids, and urban/rural classification.The derived built area classifications use a combination of supervised and unsupervised procedures on the panchromatic channel of Landsat 8. Three land cover types are identified over four primary epochs, as informed by ancillary data from GHSL partners [25].The global GHSL product is available on a global scale through the European Joint Research Center [28].For the Facebook Connectivity Lab population product, distribution is determined using a combination of supervised classification and computer vision techniques on composited DigitalGlobe imagery [29].Population distribution products may be downloaded for a limited number of countries as GeoTIFFs from CIESIN/FCL's associated High Resolution Settlement Layer (HRSL) project [26].It is worth noting that the proposed built datasets make no distinction between residential and commercial features, as limited by their remotely sensed methodology.

Additional Ancillary Data
A wide range of ancillary data are used as explanatory variables of the random regression forest used in Models 2 and 3, as outlined in Table 3.While the most recent and detailed covariates will produce the best models [20], the best data is often regional and not consistently available across the study area.Thus, the ancillary data products used represent readily available, high-quality data that was present for all countries.Three types of covariate data include categorical rasters, continuous rasters, and converted vector data as outlined in Table 4.

Data Production Workflow
The following section outlines the open-access archive of comparable, high-resolution datasets of gridded population distribution for the countries of Haiti (HTI), Madagascar (MDG), Malawi (MWI), Rwanda (RWA), Nepal (NPL), and Thailand (THA).These countries represent criteria of comparable human distribution, heterogeneous land-cover types, and diverse continental representation.Figure 2 highlights the production of population estimates from the three models, broadly categorized into five stages.

Data Production Workflow
The following section outlines the open-access archive of comparable, high-resolution datasets of gridded population distribution for the countries of Haiti (HTI), Madagascar (MDG), Malawi (MWI), Rwanda (RWA), Nepal (NPL), and Thailand (THA).These countries represent criteria of comparable human distribution, heterogeneous land-cover types, and diverse continental representation.Figure 2 highlights the production of population estimates from the three models, broadly categorized into five stages.The approach utilized here is adapted from published WorldPop random forest methodology that has been altered to suit this study needs [20].For an in-depth analysis of programmatic operation, please refer to the procedural documents stored in [37].The methods and scripts presented in this paper are from R 3.4.1,Python 2.7.8, and ESRI ArcMap 10.3.1.
The covariate selection and data preparation step has three primary phases of preparation, including built data processing, covariate standardization, and hydrofeature mask creation.
First, we process the three built areas mentioned in Table 2 into binary built feature classifications.Resampling via presence/non-presence occurs on the binary masks to create a consistent ~100 m resolution and standardized projection (WGS 84 geographic coordinate system) prior to model application.It is worth noting that the described preparation here applies only for those built areas that will be used to constrain the binary dasymetric and hybrid models (Table 3, Figure 3), and that remaining covariates are manipulated in the parameterization of the random forest model, as described in Forrest et al. 2015 [20].In addition to the independent built area layers, a fourth built area layer representing a combination of WSF, GHSL, and HRSL datasets provided a final dataset for comparison.By combining all built features, we increase the chance of false positives but simultaneously minimize errors of omission present in other built products.

2018, 3, x FOR PEER REVIEW 6 of 12
The approach utilized here is adapted from published WorldPop random forest methodology that has been altered to suit this study needs [20].For an in-depth analysis of programmatic operation, please refer to the procedural documents stored in [37].The methods and scripts presented in this paper are from R 3.4.1,Python 2.7.8, and ESRI ArcMap 10.3.1.
The covariate selection and data preparation step has three primary phases of preparation, including built data processing, covariate standardization, and hydrofeature mask creation.
First, we process the three built areas mentioned in Table 2 into binary built feature classifications.Resampling via presence/non-presence occurs on the binary masks to create a consistent ~100 m resolution and standardized projection (WGS 84 geographic coordinate system) prior to model application.It is worth noting that the described preparation here applies only for those built areas that will be used to constrain the binary dasymetric and hybrid models (Table 3, Figure 3), and that remaining covariates are manipulated in the parameterization of the random forest model, as described in Forrest et al. 2015 [20].In addition to the independent built area layers, a fourth built area layer representing a combination of WSF, GHSL, and HRSL datasets provided a final dataset for comparison.By combining all built features, we increase the chance of false positives but simultaneously minimize errors of omission present in other built products.Next, we cluster covariates into three groups depending on their subsequent transformations (Table 4).For example, the multi-class ESA land cover product classifications were separated into individual feature types and transformed using a distance to outer edge (DTE) calculation in ArcMap [30].To produce the DTE covariate, the target feature is loaded at ~100 m resolution, refined to show the feature class in question if multiple classifications are present and re-projected to a region specific Next, we cluster covariates into three groups depending on their subsequent transformations (Table 4).For example, the multi-class ESA land cover product classifications were separated into individual feature types and transformed using a distance to outer edge (DTE) calculation in ArcMap [30].To produce the DTE covariate, the target feature is loaded at ~100 m resolution, refined to show the feature class in question if multiple classifications are present and re-projected to a region specific UTM.The same distance to outer edge calculation was also used in the preparation of the primary built areas as specified in Table 4. Final covariates products match in regional extent, spatial resolution (100 m resolution), and country-specific UTM projection.
The last component is generating a hydrofeatures mask based on the European Space Agency's land use classification product [30] and processed as a binary raster with an 8 km buffer.By including sufficiently over-estimated borders, we ensure the combined extent of all stacked covariates will be identical and exclude additional features that might occur within the buffered boundary.The mask also acts to exclude a consistent representation of water features across the covariate stack.This is necessary, because while the study area is artificially bounded, the processes are not [38].

Model Types and Construction
We use three different models and the four built area configurations across six different countries to produce 54 models (Table 3).The first model type (Model 1, Figure 3) represents a simple binary dasymetric approach, in which census counts are disaggregated into pixels coincident with built areas defined by a given built product.To address the issue of census units with no built pixels, an iterative set of selections and redistributions mitigate the potential of under-estimating the population [37].The second model (Model 2, Figure 3) creates a population density-weighting surface based on a random forest (RF) statistical model, which is explained further in Stevens et al. 2015 [20].RFs are robust to noise, small sample sizes, and over-fitting, requiring minimal user parameterization [39,40].The three primary parameters include the number of covariates to be selected at each node, the number of trees in the forest, and the number of observations allowed in the terminal nodes of each decision tree [39].Specifically, for the approach outlined in Reed et al. [41], we generated a forest of 500 individual trees, based on the results of multiple experimental runs to produce stable and minimized out-of-bag error predictions [37].The RF model produces a population density estimation grid used to dasymetrically redistribute the population counts across the entire continuous weighting layer.Figure 4, Model 2 demonstrates no visible boundary of built area constraint and shows no stark boundaries between census administrative units.The second model (Model 2, Figure 3) creates a population density-weighting surface based on a random forest (RF) statistical model, which is explained further in Stevens et al. 2015 [20].RFs are robust to noise, small sample sizes, and over-fitting, requiring minimal user parameterization [39,40].The three primary parameters include the number of covariates to be selected at each node, the number of trees in the forest, and the number of observations allowed in the terminal nodes of each decision tree [39].Specifically, for the approach outlined in Reed et al. [41], we generated a forest of 500 individual trees, based on the results of multiple experimental runs to produce stable and Data 2018, 3, 33 7 of 11 minimized out-of-bag error predictions [37].The RF model produces a population density estimation grid used to dasymetrically redistribute the population counts across the entire continuous weighting layer.Figure 4, Model 2 demonstrates no visible boundary of built area constraint and shows no stark boundaries between census administrative units.
Last, the third model (Model 3, Figure 3) uses the population density-weighting surface generated in Model 2 but restricts the redistribution of census data to built area grid cells.In doing so, areas excluded from the built classification are given a population count of 0, constraining where people can be located while maintaining the predictive detail of the random forest (Figure 2). Figure 4, Model 3 shows the same patterning in Model 2 but with the built area distributional constraints of Model 1.

Technical Validation
To assess the accuracy of each model, population based on a two-thirds aggregate of available administrative units at the finest level was resampled in Python 2.7.8 by dissolving boundaries with the longest shared border, sorted randomly without spatial consideration.These final mapping products are then compared to the finest level of census data available for a given country by summing gridded population estimates within each administrative unit [20].The statistical measures include the root mean squared error (RMSE), percent root mean squared error (%RMSE), and the mean absolute error (MAE) [42].

Assessment of Gridded Population Datasets
Accuracy assessment of each map featured a suite of error metrics, including the RMSE and MAE for both population counts and density.Results show a consistent decrease in error relative to model complexity, with a few exceptions (Table 5).Those exceptions, as well as variation in accuracy for the more complex approaches, is ultimately dependent on the quality of the underlying RF model, which is a function of the nominal resolution captured by input census data and covariates.
The random forest model that produces the population density-weighting layer for the RF and Hybrid approaches has a variance explained for each country noted in Table 6.The variance explained fell consistently between 72.3% and 84.5%.The only exception was Haiti, where only 52.4% of variance could be explained due to an already low number of large census administrative units, which is known to decrease the predictive capacity of the models (Table 6) [12,20,43].In terms of covariate importance, the HRSL built area delineations had the greatest covariate importance across all countries (Figure 5).    4.

User Notes
The datasets presented in this paper facilitate comparisons and considerations of different approaches to the production of gridded population data.When producing such data, it is worth assessing the underlying built data and associated population densities to assess whether a binary dasymetric or hybrid approach may be more appropriate than statistical or smart interpolation models.The datasets presented here are endogenous and should not be used to explore relationships and correlations between the ancillary datasets and the resulting population distribution [4].Please see Reed et al. for a full analysis of environmental queues for population model selection [41].The

Figure 1 .
Figure 1.Census unit aggregation procedure in which 1/3 of the finest available units are randomly selected independent of spatial size or any other stratification and merged with its neighbor with the longest shared border until the target 2/3 census count is reached.

Figure 1 .
Figure 1.Census unit aggregation procedure in which 1/3 of the finest available units are randomly selected independent of spatial size or any other stratification and merged with its neighbor with the longest shared border until the target 2/3 census count is reached.

Figure 2 .
Figure 2. Workflow for generating the population distribution maps.Figure 2. Workflow for generating the population distribution maps.

Figure 2 .
Figure 2. Workflow for generating the population distribution maps.Figure 2. Workflow for generating the population distribution maps.

Figure 3 .
Figure 3. Model enumeration and visual representation of feature overlays used to produce output datasets by means of dasymetric redistribution.Ordered by increasing complexity.

Figure 3 .
Figure 3. Model enumeration and visual representation of feature overlays used to produce output datasets by means of dasymetric redistribution.Ordered by increasing complexity.

Figure 4 ,
Model 1, demonstrates the visible boundary of built area constraint, in addition to the visible difference in population along administrative unit boundaries.2018, 3, x FOR PEER REVIEW 7 of 12 population [37].Figure 4, Model 1, demonstrates the visible boundary of built area constraint, in addition to the visible difference in population along administrative unit boundaries.

Figure 4 .
Figure 4.An example of the three primary model types and the rasters they produce for Kigali, Rwanda.Pictured built area extent on models 1 and 3 is the combination layer described in Section 3.1.2.

Figure 4 .
Figure 4.An example of the three primary model types and the rasters they produce for Kigali, Rwanda.Pictured built area extent on models 1 and 3 is the combination layer described in Section 3.1.2.

Figure 5 .
Figure 5. Box plots of global variable importance presented as mean squared error for each covariate class.The median is represented by the black bar, while the whiskers represent the min/max values within 1.5× inter-quartile range.Variables sourced in Table4.

Figure 5 .
Figure 5. Box plots of global variable importance presented as mean squared error for each covariate class.The median is represented by the black bar, while the whiskers represent the min/max values within 1.5× inter-quartile range.Variables sourced in Table4.

Table 1 .
Census data for the six sampled countries and supporting data for finest available and aggregate products.Each model is built using the aggregate data, while finest available census units are reserved for accuracy assessment.

Table 1 .
Census data for the six sampled countries and supporting data for finest available and aggregate products.Each model is built using the aggregate data, while finest available census units are reserved for accuracy assessment.

Table 3 .
Model enumeration and brief descriptions, indicating the number of resulting maps and built area restrictions.Ordered by increasing complexity.

Table 4 .
Covariates and data sources included in the random forest.Nominal resolutions noted with 'as' represent the unit arcseconds.

Table 6 .
Variance explained captured in the random forest models of each sampled country.

Table 5 .
Error metrics for each of the 52 maps.Tables are shaded to indicate increasing methodological complexity.Values highlighted in red represent minimum error.Labeled as follows a: Haiti, b: Madagascar, c: Malawi, d: Nepal, e: Rwanda, f: Thailand.