Linking Synthetic Populations to Household Geolocations: A Demonstration in Namibia

Whether evaluating gridded population dataset estimates (e.g., WorldPop, LandScan) or household survey sample designs, a population census linked to residential locations are needed. Geolocated census microdata data, however, are almost never available and are thus best simulated. In this paper, we simulate a close-to-reality population of individuals nested in households geolocated to realistic building locations. Using the R simPop package and ArcGIS, multiple realizations of a geolocated synthetic population are derived from the Namibia 2011 census 20% microdata sample, Namibia census enumeration area boundaries, Namibia 2013 Demographic and Health Survey (DHS), and dozens of spatial covariates derived from publicly available datasets. Realistic household latitude-longitude coordinates are manually generated based on public satellite imagery. Simulated households are linked to latitude-longitude coordinates by identifying distinct household types with multivariate k-means analysis and modelling a probability surface for each household type using Random Forest machine learning methods. We simulate five realizations of a synthetic population in Namibia’s Oshikoto region, including demographic, socioeconomic, and outcome characteristics at the level of household, woman, and child. Comparison of variables in the synthetic population were made with 2011 census 20% sample and 2013 DHS data by primary sampling unit/enumeration area. We found that synthetic population variable distributions matched observed observations and followed expected spatial patterns. We outline a novel process to simulate a close-to-reality microdata census geolocated to realistic building locations in a lowor middle-income country setting to support spatial demographic research and survey methodological development while avoiding disclosure risk of individuals.


Introduction
The ideal resource to evaluate the accuracy of gridded population datasets and certain household survey methodologies would be a complete set of individual records from a population linked to location of residence, though this is generally not available. Gridded population datasets model counts of human population in small grid cells, often based on census data and spatial covariates such as land cover type [1][2][3][4]. Various gridded population datasets have evaluated the accuracy of population counts at the geographic scale of input census data [3][4][5], and other analyses have evaluated whether cells were accurately classified as populated or not populated [6]; however, accuracy of population count per grid cell has not been evaluated because it requires a geo-located

Setting
Namibia was selected for the simulation because the population varies widely from low-to-high density, and the 2011 Namibia census meets the UN recommendations for high-quality census data [27].
We selected Oshikoto, one of Namibia's 13 regions in northern Namibia, to demonstrate the simulation methods discussed here because it presents a rich microcosm of conditions and population types ( Figure 1). Oshikoto covers an area of 38,653 square kilometres and is home to roughly 182,000 people [28]. The region has an unpopulated desert in the southwest, rural settled agriculture area in the north, rural area comprised mostly of a nomadic population in the southeast, and two cities comprised of planned and unplanned neighborhoods. Oshikoto is comprised of 10 administrative sub-regions called constituencies, for which there are published census population and household totals.

Setting
Namibia was selected for the simulation because the population varies widely from low-to-high density, and the 2011 Namibia census meets the UN recommendations for high-quality census data [27].
We selected Oshikoto, one of Namibia's 13 regions in northern Namibia, to demonstrate the simulation methods discussed here because it presents a rich microcosm of conditions and population types (Figure 1). Oshikoto covers an area of 38,653 square kilometres and is home to roughly 182,000 people [28]. The region has an unpopulated desert in the southwest, rural settled agriculture area in the north, rural area comprised mostly of a nomadic population in the southeast, and two cities comprised of planned and unplanned neighborhoods. Oshikoto is comprised of 10 administrative sub-regions called constituencies, for which there are published census population and household totals.

Data
Input data included the 20% microdata sample from the 2011 Namibia Population and Housing Census, available by request from the Namibia NSA [29]; 2011 Namibia census enumeration area boundaries, provided by request from the Namibia NSA [30]; 2013 Namibia Demographic and Health Survey (DHS) recode files and geo-displaced cluster coordinates, available by request from ICF International [31]; high-resolution (30 cm) satellite imagery available through ESRI via ArcGIS 10.5 [32]; and multiple spatial data layers derived from public sources such as land cover type, nighttime lights intensity, and health facility locations summarized in Table 1. Building off of earlier work [33], spatial covariates were processed as part of the "Global High Resolution Population Denominators" Project by the WorldPop team at University of Southampton and Center for International Earth Science Information Network at Columbia University, and are detailed in a forthcoming paper (Alessandro Sorichetta, personal communication, July 2018).

Data
Input data included the 20% microdata sample from the 2011 Namibia Population and Housing Census, available by request from the Namibia NSA [29]; 2011 Namibia census enumeration area boundaries, provided by request from the Namibia NSA [30]; 2013 Namibia Demographic and Health Survey (DHS) recode files and geo-displaced cluster coordinates, available by request from ICF International [31]; high-resolution (30 cm) satellite imagery available through ESRI via ArcGIS 10.5 [32]; and multiple spatial data layers derived from public sources such as land cover type, nighttime lights intensity, and health facility locations summarized in Table 1. Building off of earlier work [33], spatial covariates were processed as part of the "Global High Resolution Population Denominators" Project by the WorldPop team at University of Southampton and Center for International Earth Science Information Network at Columbia University, and are detailed in a forthcoming paper (Alessandro Sorichetta, personal communication, July 2018). * Spatial covariate was processed by the "Global High Resolution Population Denominators" Project.
The 2011 Namibia 20% census microdata sample is comprised of 36,137 individuals in 7536 conventional households selected at random from a complete census enumeration [28], and the DHS survey sample is comprised of 3316 individuals in 705 households located in 38 primary sampling units (PSUs) [31] (Table 2). In addition to the variables age, sex, relationship, and household size used to simulate household membership configurations, six covariates, common to both the DHS and census microdata, were simulated to support modelling of household type and prediction of outcome variables ( Table 2). Four of these covariates are often used to operationalize the UN-Habitat definition of a "slum household": lack of improved toilet, lack of improved water source, inadequate space defined as three or more people per sleeping room, and unimproved structure defined as having an earthen or wood floor [45]. Other characteristics include urban versus rural location, use of solid fuel for cooking, whether the head of household has no formal education, and whether there are any children under age five in the household.
While the microdata provides a large, systematic sample reflecting the distribution of characteristics in the population, it is not a complete census and cannot be linked to local geographic positions (in this case, below the constituency level). The DHS survey on the other hand, provides geographic coordinates, albeit displaced, for each PSU allowing us to explore spatial variation in the population. The method developed here leveraged the strengths of each dataset and took advantage of variables common to both datasets in order to link a simulated population to geographic positions.

Simulation
We generated realistic household membership with realistic household point location and demographic and social characteristics in the following three phases. In phase A, we defined household types and then predicted the spatial distribution of the types in Oshikoto using DHS data, spatial covariates, and visual inspection of satellite imagery. The output was a probability surface for each household type. In phase B, we generated the synthetic population using a census microdata sample and assigned the population to household point locations using the household type probability surfaces generated in phase A. Phase C involved prediction of additional population characteristics in each household. The code was written in R [46] and spatial data were generated in ArcGIS [47]. Each phase is summarized in Figure 2 and described below. Five realizations of the simulated population (Supplement 1), the code (Supplement 2), and interim output (Supplement 3) is provided.

Simulation
We generated realistic household membership with realistic household point location and demographic and social characteristics in the following three phases. In phase A, we defined household types and then predicted the spatial distribution of the types in Oshikoto using DHS data, spatial covariates, and visual inspection of satellite imagery. The output was a probability surface for each household type. In phase B, we generated the synthetic population using a census microdata sample and assigned the population to household point locations using the household type probability surfaces generated in phase A. Phase C involved prediction of additional population characteristics in each household. The code was written in R [46] and spatial data were generated in ArcGIS [47]. Each phase is summarized in Figure 2 and described below. Five realizations of the simulated population (Supplement 1), the code (Supplement 2), and interim output (Supplement 3) is provided.

Phase A: Predict Spatial Distribution of Household Types
Using the DHS dataset, we first defined realistic and distinct types of households present in Oshikoto based on the 2013 DHS data of 705 households. We used the kmeans function in R [46] to generate a large number of clusters (k = 20) from eight household demographic and social variables common to both the DHS and census microdata (urban_rural, noedu, any_u5, toilet, water, structure, space, fuel). K-means is a form of unsupervised clustering which seeks to partition observations into groups by minimizing the within group sum of squares. We then utilized the output dendrogram visualizing the hierarchically clustered k-means centroids to choose a smaller number of statistically distinct household types (long Euclidean distance between parent and child clusters in the dendrogram) that were easily interpretable. In the case of Namibia 2013 DHS, seven household types are identified. To interpret and label household types, we considered whether the household type values were above, below, or near the Oshikoto average (Table 3). We saved the k-means centroids and hierarchical clustering cut-off points to classify household types in other datasets in steps 3 and 5.
Second, we processed 19 spatial covariates from free, public data sources including land cover types, night time light intensity, and health facility locations (see Table 1). These datasets were

Phase A: Predict Spatial Distribution of Household Types
Using the DHS dataset, we first defined realistic and distinct types of households present in Oshikoto based on the 2013 DHS data of 705 households. We used the kmeans function in R [46] to generate a large number of clusters (k = 20) from eight household demographic and social variables common to both the DHS and census microdata (urban_rural, noedu, any_u5, toilet, water, structure, space, fuel). K-means is a form of unsupervised clustering which seeks to partition observations into groups by minimizing the within group sum of squares. We then utilized the output dendrogram visualizing the hierarchically clustered k-means centroids to choose a smaller number of statistically distinct household types (long Euclidean distance between parent and child clusters in the dendrogram) that were easily interpretable. In the case of Namibia 2013 DHS, seven household types are identified. To interpret and label household types, we considered whether the household type values were above, below, or near the Oshikoto average (Table 3). We saved the k-means centroids and hierarchical clustering cut-off points to classify household types in other datasets in steps 3 and 5.
Second, we processed 19 spatial covariates from free, public data sources including land cover types, night time light intensity, and health facility locations (see Table 1). These datasets were available for the whole region, enabling predictive mapping, and were shown to be related to population density [3,48]. We converted each covariate into a 100 m × 100 m raster, and then for each cell, calculated the minimum, maximum, and average values within a five kilometer buffer using WGS84 geographic projection. This five-kilometer moving window was used because the DHS data used to fit models in the next step were randomly geo-displaced up to five kilometers in rural areas. Further, the average covariate value within a five-kilometer buffer of a displaced DHS PSU location was closer to the real, non-displaced, unpublished covariate value than the published, displaced covariate value [49,50]. Although DHS PSU coordinates were only displaced up to two kilometers in urban areas, a five-kilometer buffer was used for all PSUs, and urban probability surfaces were improved manually in step 4.
Third, using the 2013 DHS data for all of Namibia (N = 550 clusters) and household types created in step 1, we calculated the most common household type for each PSU using the k-means centroids and cut-off points. Next, we extracted the five-kilometer averaged spatial covariates created in step 2 to each DHS PSU location, resulting in 550 observations of household type linked to (19 × 3) 57 spatial covariates. In this step 3, we found a relationship between household type and buffered spatial covariates in order to predict household types over the whole region. To do this, we used a Random Forest model-a non-parametric ensemble machine-learning algorithm that grows a "forest" of decision trees during the modelling process [3]-to model this relationship and predict a 100 m by 100 m probability surface for each household type across Namibia.
Fourth, we manually created household type probabilities for urban EAs. This step was necessary because initial tests found that the household type probability model generated in step 3 could not adequately distinguish household types within urban areas. This was expected given the displacement of the DHS PSU locations and the summary of geospatial covariate data, which are essentially identical across urban household types. Without step 4, simulated households of different socioeconomic types would be evenly spatially integrated in urban areas, which was unrealistic. Poor and rich households are often segregated in urban areas worldwide [51], and visual inspection of satellite imagery indicates that socioeconomic segregation was present in Oshikoto's urban areas as well. From Step 1, we labeled the two urban household types as poor and rich, then manually assigned a proportion of households that we judged to be rich versus poor within each EA based on satellite imagery, such that the probabilities summed to 1. These manually created EA-level urban household type probabilities were multiplied by the predicted household type probability surfaces created in step 3 to create the final 100 m × 100 m household type probability surfaces. Fifth, we simulated a population of realistic households in Oshikoto using the 20% census microdata sample and multinomial logistic regression techniques proposed by Alfons and colleagues (2011) and operationalized by Templ and colleagues (2017) in the R simPop package [7,15]. In this approach, we first calculated the proportion of households to simulate per household-size, per stratum (defined by constituency and urban/rural boundary). Second, we selected random resamples from the microdata until the number of target households was reached in each household size and strata. Third, demographic characteristics of the household members (age, sex, relationship) were replicated from the microdata. Fourth, we added household socioeconomic characteristics to the simulated dataset (education, toilet, water, structure, space, fuel) using a multinomial regression. This allowed for the simulation of combinations of demographic characteristics that existed in the population but were not present in the census microdata. For each simulated household, we assigned the household type by selecting the class from step 1 with the smallest distance (i.e., most similar) between each household record and the k-means centroids.
Sixth, the census microdata sample was provided with a weight equal to five for nearly all conventional households. We recalibrated these weights to the total number of households per constituency in the 2011 census [28]. However, this process could lead to too few observations in some constituency-urban/rural strata, and too many observations in other strata. Therefore, we increased the weights to simulate an extra 5% of households from which a random selection of households was assigned to latitude-longitude coordinates in step 7.
Seventh, we joined reweighted household type probabilities (100 m × 100 m grid cells) created in step 4 to the household latitude-longitude coordinates created in step 6. Finally, for each household simulated in step 5, we randomly sampled one latitude-longitude coordinate within the constituency-urban/rural strata based on the probability of household type. We repeated the assignments until all coordinates were assigned a simulated household, and then discarded the extra 5% unassigned simulated households.

Phase C. Predict Additional Population Characteristics, Generalize Locations
In step 8, we used the 2013 DHS records in Oshikoto (N = 705 households) to develop multinomial models of socioeconomic and health outcome variables. We stored the coefficients of each model and applied them to our simulated dataset to predict outcomes in each simulated household. The three simulated outcome variables represented different prevalence levels and patterns of dispersion in the population. These outcome variables represented children under age five, women of reproductive age, and households in order to support within household clustering analyses. The outcome variables were: household wealth (expressed in quintiles), women's use of modern contraception (approximately 50% in Namibia and Oshikoto), and child's receipt of the third Diphtheria-Tetanus-Pertussis (DPT) vaccination (approximately 90% in Namibia and Oshikoto) [52]. Multinomial models were used for both multi-category and binary outcomes where K is the number of categories in the outcome variable, Y i is the outcome value for individual i, and X i is a matrix of covariate values belonging to individual i. Model coefficients were applied to covariates of the 37,298 households in the simulated dataset to predict outcome values.

Assessment
We conducted global assessments to evaluate whether each of the five realizations of the simulated population were realistic overall, and a local assessment to evaluate whether the realizations were realistic at an EA level. In the global assessment, we aggregated the DHS records to PSU and the simulated census records to EA, and graphically compared the distributions of simulated covariates and outcomes. We also mapped simulated census records by EA to visually inspect the spatial distributions across Oshikoto. In the local assessment, DHS data were averaged by PSU and compared to the distribution from repeated samples simulating a set of survey respondents. For each of 10,000 simulations, a random EA was selected within 5 km of each DHS PSU coordinate, then the same number of households as the observed DHS cluster were drawn from the simulated population. The characteristics were averaged from the sampled EAs and compared to the observed DHS data.

Ethics
Before releasing our simulated data, we closely reviewed papers about privacy of synthetic population data including a paper by Alfons and Templ (2010) who calculated disclosure risk of close-to-reality synthetic data generated with the simPop [R package] algorithm used in this analysis [53]. The authors found extremely low risk of disclosure for five worst case scenarios and concluded that simulations "implemented in simPop are confidential and can be distributed to the public" [53]. Any additional risk in our study due to linking simulated records to realistic building locations is negligible due to random spatial components in the analysis, and as a result of beginning with a random sample of the original census microdata in phase B. Any match between characteristics in a simulation realization of a household at a given building location and a real-world household at that same location is purely by chance.
The main risk in this analysis is misinterpretation and/or misuse of the synthetic population data by users (e.g., believing that the simulated data are from actual households and treating real-world household members, or their communities, with stigma). To minimize misinterpretation, we release five realizations of the synthetic population and label each dataset as "synthetic". To further minimize the risk of maltreatment of real-world people in the case that these data are misinterpreted, we only simulated commonly mapped variables which have been interpolated with real-world survey data to 1 km × 1 km grid cells by the MeasureDHS project [54].
This analysis and public release of simulated data was reviewed by the University of Southampton Ethics Review Committee (#41006).

Results
Demographic and socioeconomic characteristics of the five simulated populations in Oshikoto (Table 4) were consistent with the 2013 DHS and 20% census distributions presented in Table 2.  The distribution of the three outcomes were heaped in the 2013 DHS dataset, perhaps due to small sample size. In the global assessment of the simulated population by PSU/EA in Oshikoto, Namibia, the distributions of households per wealth quintile, contraceptive use among reproductive age women, and percent children who received third DPT vaccination were consistent between the 2013 DHS PSUs and the synthetic population EAs in all five realizations of the population (Figure 3).
A key difference was that the Oshikoto synthetic populations distributed more households in the lowest wealth quintile, while the DHS measured a greater percent of Oshikoto households in the second lowest wealth quintile. The distribution of the three outcomes were heaped in the 2013 DHS dataset, perhaps due to small sample size. In the global assessment of the simulated population by PSU/EA in Oshikoto, Namibia, the distributions of households per wealth quintile, contraceptive use among reproductive age women, and percent children who received third DPT vaccination were consistent between the 2013 DHS PSUs and the synthetic population EAs in all five realizations of the population (Figure 3).
A key difference was that the Oshikoto synthetic populations distributed more households in the lowest wealth quintile, while the DHS measured a greater percent of Oshikoto households in the second lowest wealth quintile.
Maps showing simulated household wealth by EA followed expected spatial patterns with higher wealth in planned urban neighborhoods and large rural towns, and lowest household wealth in remote rural areas (Figure 4, realization 1). Similarly, higher rates of contraceptive use were located in urban EAs, and wealthier rural EAs, as expected. Namibia has greater DTP3 vaccination coverage in rural, rather than urban, populations, which is atypical of LMICs [52]. This atypical pattern was reflected in the maps of DPT3 vaccination coverage among one of the simulated populations. In the local EA-level assessment, we found that DHS estimates for each of the 38 Oshikoto clusters fell within the 95% confidence interval of repeated random simulated samples from the simulated population EAs near to the DHS PSU ( Figure 5). This implied that the observed DHS results could potentially have been drawn from the synthetic population. Maps showing simulated household wealth by EA followed expected spatial patterns with higher wealth in planned urban neighborhoods and large rural towns, and lowest household wealth in remote rural areas (Figure 4, realization 1). Similarly, higher rates of contraceptive use were located in urban EAs, and wealthier rural EAs, as expected. Namibia has greater DTP3 vaccination coverage in rural, rather than urban, populations, which is atypical of LMICs [52]. This atypical pattern was reflected in the maps of DPT3 vaccination coverage among one of the simulated populations.
In the local EA-level assessment, we found that DHS estimates for each of the 38 Oshikoto clusters fell within the 95% confidence interval of repeated random simulated samples from the simulated population EAs near to the DHS PSU ( Figure 5). This implied that the observed DHS results could potentially have been drawn from the synthetic population.

Discussion
Close-to-reality simulated populations are needed to answer questions at the forefront of spatial demographic research and survey methodological development while reducing disclosure risks of releasing high spatial resolution census data. We outline a novel process to simulate multiple realizations of a population linked to realistic latitude-longitude coordinates in a LMIC setting. Our approach used the strengths of two commonly available population datasets, namely household surveys and census microdata samples. We also drew together computational methods in microsimulation of individuals and households with high-resolution mapping of household characteristics using geospatial data. The result was a full enumeration of a synthetic population with household relations and characteristics, linked to realistic locations. The simulated population was assessed and found to be realistic in terms of socioeconomic and health outcomes at both regional and local (community) levels. We released the code and five realizations of the simulated population to encourage additional simulations of close-to-reality populations to realistic latitude-longitude coordinates, and to support development of household surveys and gridded population survey sample frames for LMICs.
One such question is whether one-stage sampling can result in precise and feasible household surveys compared to the classic two-stage sampling design. Nearly every nationally-representative multi-topic household survey implemented since the 1980s in LMICs has used a two-stage sampling design with census enumeration areas comprising the first-stage sample frame and a manual household listing comprising the second-stage sample frame [9]. This has proven to be an effective sample design when census EAs are the only available first-stage sample frame, maximizing statistical power while reducing field costs [55][56][57]. Two-stage sampling, however, requires that two field visits are made to each sampled household several months (or even years) apart, making it more likely that mobile and vulnerable households are excluded from the survey or fail to respond compared to stable long-term households [58]. This problem is of increasing concern in LMICs cities today as rates of urbanization and mobility increase [51], possibly leading to increased bias in standard two-stage household surveys. Gridded sampling frames open the door for one-stage surveys, such that households are listed and interviewed on the same day, which can theoretically improve the accuracy of poor and vulnerable households in household surveys, however, one-stage sampling comes at the risk of increased design effect, requiring increased sample size. The use of

Discussion
Close-to-reality simulated populations are needed to answer questions at the forefront of spatial demographic research and survey methodological development while reducing disclosure risks of releasing high spatial resolution census data. We outline a novel process to simulate multiple realizations of a population linked to realistic latitude-longitude coordinates in a LMIC setting. Our approach used the strengths of two commonly available population datasets, namely household surveys and census microdata samples. We also drew together computational methods in microsimulation of individuals and households with high-resolution mapping of household characteristics using geospatial data. The result was a full enumeration of a synthetic population with household relations and characteristics, linked to realistic locations. The simulated population was assessed and found to be realistic in terms of socioeconomic and health outcomes at both regional and local (community) levels. We released the code and five realizations of the simulated population to encourage additional simulations of close-to-reality populations to realistic latitude-longitude coordinates, and to support development of household surveys and gridded population survey sample frames for LMICs.
One such question is whether one-stage sampling can result in precise and feasible household surveys compared to the classic two-stage sampling design. Nearly every nationally-representative multi-topic household survey implemented since the 1980s in LMICs has used a two-stage sampling design with census enumeration areas comprising the first-stage sample frame and a manual household listing comprising the second-stage sample frame [9]. This has proven to be an effective sample design when census EAs are the only available first-stage sample frame, maximizing statistical power while reducing field costs [55][56][57]. Two-stage sampling, however, requires that two field visits are made to each sampled household several months (or even years) apart, making it more likely that mobile and vulnerable households are excluded from the survey or fail to respond compared to stable long-term households [58]. This problem is of increasing concern in LMICs cities today as rates of urbanization and mobility increase [51], possibly leading to increased bias in standard two-stage household surveys.
Gridded sampling frames open the door for one-stage surveys, such that households are listed and interviewed on the same day, which can theoretically improve the accuracy of poor and vulnerable households in household surveys, however, one-stage sampling comes at the risk of increased design effect, requiring increased sample size. The use of close-to-reality simulated populations can be used to compare various sample designs under different realistic conditions of population distribution, mobility, and characteristics.
Another application of close-to-reality population simulations is the evaluation of gridded population dataset accuracy at the cell-level. Several gridded population datasets are generated at 100 m × 100 m scale from census data [3,4]. Accuracy of these models is often performed at the geographic scale of the input census data; however, accuracy is never evaluated at the grid cell-level. Microdata located to realistic household locations and aggregated to 100 m × 100 m grid cells provides a first opportunity for this kind of accuracy assessment.
One limitation of this approach is it resulted in some spatial smoothing of household type probabilities due to the use of buffered covariates in the Random Forest model in step 3. The choice, however, was to either introduce substantial measurement error by training models on covariates at the location of geo-displaced DHS coordinates [50], or to reduce spatial precision in prediction of household type probabilities by using aggregated covariate values within a buffer region around the geo-displaced DHS coordinates. We opted for the latter approach because rural Oshikoto was sparsely populated, and thus we expected minimal impact of spatial imprecision of household type. Furthermore, we manually corrected the spatial distribution of urban household type probabilities during step 4 via manual inspection of satellite imagery and classification of census EAs. However, researchers applying these methods in more densely populated and/or heterogeneous settings might consider smaller buffers in urban area.
A related consideration when extending these methods is that Random Forest models cannot extrapolate beyond the range of the training data [3]. This could impact the accuracy of the prediction of household type maps. To ensure accuracy of predicted household types in step 3, the same geographic unit should be used in both training and prediction datasets (e.g., 5 km buffers), and the range of covariate values in the training data (e.g., all Namibia DHS clusters) must be similar or larger than the range in covariate values in the region of study (e.g., Oshikoto). We checked that training locations had a wider range of covariate values than the Oshikoto household locations (see Supplement 3).
A second limitation of this work is that it relied on manually digitized building point locations, and delineation of rich versus poor EAs in urban areas. Manual data creation was manageable for a subnational region but would require substantial time to scale nationally. It took one GIS analyst nearly one week of full-time work to generate building point locations and to classify urban census EAs in Oshikoto for this analysis. However, as coverage of publicly available sub-meter satellite imagery increases globally, so does automated feature extraction of individual buildings in LMICs [59], which is promising to help scale this simulation approach to larger geographic areas. Note that if feature extraction is used to generate building locations, additional information or researcher judgement may still be needed to identify multi-household building locations and to remove non-residential buildings. Furthermore, machine learning techniques are showing promise in mapping neighborhood types from very high-resolution imagery [60] and other building datasets [61], which can also help address this limitation.
A third limitation is potential errors introduced by temporal differences between datasets. The census and DHS datasets were collected two years apart, so major differences in population totals or demographic distributions were not expected; however, several covariates related to roads, travel time, facilities, and topography were more than a decade older and might not reflect most recent development. Furthermore, household point locations were digitized from more recent imagery, and thus might include new buildings not reflected in the spatial covariates. The predictive model may be improved with better temporal alignment to covariate data.
One might wonder why not generate random points for building locations within administrative areas near roads, or by using some other set of simple rules, as other researchers have done to simulate close-to-reality populations [19]. While this would permit certain types of analysis, such as the comparison of one-stage and two-stage sampling, creation of random points for households within large administrative areas is not recommended if the simulated population will be used to evaluate accuracy of gridded population models, particularly gridded populations with real-world spatial covariates at fine geographic scale (e.g., 100 m × 100 m). There is a large amount of heterogeneity in human population distribution, and this must be reflected accurately at a very local level to be able to evaluate gridded population models on a cell-by-cell basis.
This novel method to simulate close-to-reality household records linked to realistic building locations in a LMIC stands to support development of more accurate household survey methods and gridded population datasets as household survey sample frames. These methods are feasible to implement in other LMIC settings and will become globally scalable as feature extraction methods evolve.
Supplementary Materials: The following are available online at http://www.mdpi.com/2306-5729/3/3/30/s1. Data S1: Five realizations of a simulated, geo-located population in Oshikoto, Namibia, Code S2: R code to produce a simulated, geo-located population. Output S3: Interim graphs and maps used to assess accuracy of a simulated, geo-located population in Oshikoto, Namibia.