The less developed regions of the world have reached a symbolic milestone: Half of the population is now living in urban areas [1
]. Even though this ratio is much lower in the least developed countries, most of which are located in sub-Saharan Africa (SSA), urbanization rates are increasing rapidly (where about 33% of the population is urban and are expected to face the highest growth rates during the next decades). It is expected that 40% of the population will live in urban areas by 2030 and 50% by 2050 [1
]. As a consequence of these rapid transformations, SSA cities are exposed to increasing urban poverty and intra-urban inequalities [2
], while a large part of the urban population is extremely vulnerable to health and disaster risks. In this context, detailed population data is essential in improving evidence-based decision-making by relevant authorities and organizations [3
], as well as for any application relying on a human population denominator, such as estimating the population at risk, assessing vulnerability, and deriving health or development goals indicators [6
]. However, this knowledge is often very limited in SSA and population data are regularly outdated and criticized regarding their reliability [6
]. While collected at a household or individual level, census data are generally aggregated and released in administrative units for privacy reasons [5
], and do not match the requirements for different fields of research [4
]. With regards to population data aggregated in administrative units, we can further mention some issues related to the fact that (1) the real spatial patterns of the population distribution are blurred by an impression of homogeneity within entities [10
], (2) the aggregated values and subsequent analysis are very dependent on the choice of the administrative limits, which is also known as the modifiable areal unit problem (MAUP) [11
], and (3) administrative units create subjective spatial discontinuities that sometimes change from one census to another [9
When the spatial extension of a phenomenon does not correspond to any existing administrative limits, official population data are often unexploitable. In such a situation, a gridded population product—a raster layer where the pixel value refers to the (estimated) number of inhabitants—can provide a more useful estimate of population counts [12
], by summing up all the pixels falling into the area under investigation. Creation of these population grid products is usually achieved using dasymetric mapping [9
]. This modeling technique relies on the assumption that the knowledge of the territory—places more densely populated than others—can be used to spatially disaggregate the official census data provided at the administrative level to a finer scale [5
]. Ancillary geoinformation data, such as land cover (LC) and land use (LU) maps, can provide valuable information for estimating the potential of different locations within the administrative units to be inhabited. Even though they are different by nature—LC is related to the physical characteristics of earth surface elements (e.g., vegetation, water, built-up, …), while LU refers to the functions and activities that humans decided to carry out in certain locations (e.g., agricultural land, residential area, industrial area, …)—they can provide complementary information valuable for population modelling purposes; for example, by combining building density (from LC) with the distinction between residential and commercial areas (from LU). For example, the built-up density and the land use information of a location can be combined and used as proxies for population density.
The major challenge in dasymetric mapping resides in the determination, from a set of ancillary data, of the relative distribution of the population within the administrative units. This information can be seen as spatial reallocation “weights”, which are used in dasymetric mapping to disaggregate (redistribute) the population count known for the administrative units into a finer subunit level. When a simple built-settlement layer is available, a common strategy is to homogeneously allocate the population counts of the administrative unit within areas identified as built-up (binary dasymetric method). When the ancillary data are thematically more detailed than just a binary built-up layer—e.g., with a distinction between urban core, periurban, and rural areas—the weights can be adjusted to better correspond with the expected relative distribution of the population. For a long time, these weights were subjectively determined based on expert knowledge, or according to existing information [12
], such as land-use information or household characteristics, combined with the use of quantitative methods, such as correlation analysis and multivariate regression [13
]. Recent research has shifted this paradigm by taking advantage of the power and the efficiency of machine learning algorithms to model the distribution of population densities, without any prior knowledge. In the case of the WorldPop project, the popular Random Forest (RF) algorithm [14
] is used to predict the weights for reallocation of population in 100 × 100 m grid layers [15
]. In this work, the RF algorithm is used in a similar fashion.
Irrespective of the approach (expert-based or using machine-learning), built-settlement layers are consistently among the most important predictors for population models [16
]. These layers are typically extracted from satellite imagery, and have been commonly used to estimate population densities at large spatial scales. However, both the quality [5
] and the spatial resolution [4
] of ancillary information have a strong influence on the accuracy of the predictions. In an urban context, the potential of finer resolution products for population redistribution is largely unexplored. We hypothesize that, by utilizing high and very-high resolution information (i.e., land cover and land use), the accuracy of the dasymetric reallocation might be significantly improved.
In this paper, we compare the contribution of three data sets with different spatial and thematic resolutions (built-up mask, land cover, and land use) for disaggregating population counts into 1 hectare grid cells. The availability of extremely detailed census data for the city of Dakar (Senegal) enables the assessment of the added value of very-high resolution (0.5 m) data, compared to medium resolution (10 m) data, in the context of a top-down dasymetric approach. Different levels of information are extracted from these data sets to create different weighting layers and perform dasymetric mapping. While very-high resolution data are expected to increase the accuracy of the dasymetric mapping procedure, their acquisition and processing costs might hinder their applicability for large-scale population mapping in Africa. It is, therefore, important to evaluate the loss in accuracy when using freely-available medium resolution data.
The analysis of the results, hereafter, is based on the validation performed by comparing the reference data against the aggregated estimates at administrative level 1. Several conclusions can be drawn, when analyzing the results (see Table 3
). First, when relying only on built-up/non-built-up masks, the impact of the very-high resolution data is notable, since it allows the RTAE to decrease by 13% (from 36.7 to 31.7). As highlighted in Figure 7
, VHR-BU also leads to an important reduction of extreme relative errors of prediction. It is consistent with recent research which shows that VHR settlement layers systematically have better feature importance than lower resolution layers, in a RF-based dasymetric approach [27
Second, when taking advantage of the detailed spatial and thematic information provided by the VHR-LC layer, the accuracy of the dasymetric reallocation is significantly improved, with the RTAE reduced to 0.308; corresponding to a drop by 16%, relative to the results obtained using only the built-up mask at medium resolution (MR-BU). Third, when considered as a single source of ancillary data, the VHR-LU layer performs poorly, compared to VHR-LC alone, and is even worse than when only using the binary information provided by the VHR-BU layer. Regarding the data used, it is probable that the lower spatial resolution (characterization of land use at the street block level) and lower classification accuracy of the VHR-LU, compared to VHR-LC and VHR-BU, have a strong influence on this result. Fourth, combined use of VHR-LC and VHR-LU provides better result than when using either of them alone, which confirms that these data are complementary. Figure 8
depicts the feature importance provided by the RF model for the best-performing test (J). It supports the conclusion that VHR-LC and VHR-LU are complementary, since there is a clear alternation of land-cover and land-use variables in the sixth most important features. In addition, it is interesting to see that both the building classes from the LC layer (“Low buildings (> 5 m)” and “Medium buildings (5 to 10 m)”) appear in the most important variable, as well as the distinction between planned residential areas and deprived residential areas from the LU layer.
Finally, in our analysis, the best-performing dasymetric reallocation was obtained when using all available data; that is, test J, whose predictions at the grid level are illustrated in Figure 9
. Surprisingly, the accuracy is improved when using MR-BU in addition to VHR-LC and VHR-LU. As shwon in Figure 10
, the majority of the large relative errors (in terms of percentage of the reference population) are located, not surprisingly, in less populated administrative units.
Another interesting point, that can be highlighted from our analysis, is related to the validity of the Out-Of-Bag (OOB) score as a validation metric in dasymetric mapping. The OOB-score (or error) is an accuracy assessment metric, computed during the fitting of the Random Forest model. It is computed from the internal cross-validation procedure, and can be interpreted as an average goodness measure of the ability of the model to predict on unseen data in the training set. Since the training set is composed of administrative units of level 0, this metric could be seen as a measure of the ability of the model to predict on unseen units at the same specific level. Inversely, the external validation used here, as described in the “methods” section, is designed to assess the ability of the dasymetric mapping procedure to accurately redistribute population counts from one geographic scale to a finer one. When studies suffer from the lack of spatially detailed population data, external validations can not be systematically performed. In such contexts, it may be tempting to consider the OOB-score as a measure of the performance of the dasymetric reallocation. Nevertheless, our results show that there is no straightforward relationship between the external validation metrics and the internal OOB-score. Indeed, as visible in Table 3
, the best-performing combination of input data (covariates) appears to be in the case of test H (OOB of 0.85 for a RF model built at level 0). However, when considering the RTAE or %RMSE, test J is identified to be the best-performing one. Furthermore, fitting RF models on both administrative levels revealed that the best-performing set of covariates identified at one scale is not obviously the one that performs best at another scale, which could be interpreted as a result of the MAUP effect.
The gridded population layer resulting of the dasymetric mapping procedure presented in this paper is available for anyone interested and for any purposes. The reference is provided in Appendix A
Medium resolution built-up settlement layers have been commonly used in population modeling [15
]. They present the advantage of being free and providing global coverage, even though their spatial resolution is limited. However, high or very-high spatial resolution data are usually preferred for applications covering a relatively small geographic extent, as they allow the counting of dwelling units or the interpretation of residential land-use types, despite their expensive costs. To date, little research has explored the potential of built-settlement layers at different spatial resolutions for urban population mapping.
Even if the price of VHR remote sensing data tends to drop slowly, it is still an important limiting factor and reduces the merits of large-scale applications. Usually, the acquisition of VHR imagery is firstly dedicated to the production of detailed LC and LU maps, which are useful pieces of information by themselves. The gains that these VHR-LC and VHR-LU information could provide to the performance of a dasymetric mapping approach is important information regarding the cost effectiveness of these data. In this regard, our results show that there is a clear positive impact of using VHR products for population modeling, as well as a complementarity between VHR-LC and VHR-LU products. Future research could involve integrating low-cost imagery, such as SPOT-6/7 (1.5 m of spatial resolution for pan-sharpened images), which could provide an interesting cost-efficient compromise between MR and VHR.
Regarding the performance of the dasymetric mapping procedure presented here, it is important to mention that the accuracy of the predictions at the grid layer are probably lower than the one presented here (with a validation at level 1, after reaggregation of grid level estimates). Since grid population products tend to be commonly used in different fields of research, it is essential to inform end-users about the confidence and limitations of such products, as much as about their advantages. When using such population models, the end-user should always keep in mind that “The most that can be expected from any model is that it can supply a useful approximation to reality: All models are wrong; some models are useful”
] (p. 440).
With regard to top-down dasymetric approaches, such as the one presented here, we should mention that they are completely dependent on official censuses which are frequently criticized regarding their reliability [6
]. Furthermore, in the best case, official censuses are usually organized once in a decade, but in developing countries this rate is not systematically respected. This often forces studies to deal with population data that are asynchronous with ancillary geoinformation, used as covariates. When the official population data are not available or are outdated, remotely-sensed data can be used to support bottom-up approach [6
]. The latter uses population counts coming from micro-census surveys—i.e., the collection of census information through field surveys on a limited portion of the territory—to extrapolate the population count on the rest of the territory, thus allowing implementation with limited human and financial capacity, compared to a regular census [6
Additionally, we highlighted that, in a RF-based dasymetric mapping procedure, the OOB-score is not guaranteed to effectively help in the identification of the best-performing combination of input layers, in terms of reallocation accuracy. Therefore, a recommendation is made for future studies to exercise caution when using the OOB-score as an indicator of the performance of dasymetric mapping. More generally, the sensitivity of RF-based top-down population models to the scale factor and the MAUP effect is poorly explored in the literature. It would be beneficial for the field to further investigate the impact of the quantity and the spatial resolution of the administrative units used to train the RF model, and the impact of the difference of spatial resolutions between training and prediction (grid) levels.
Dasymetric mapping has been used to provide estimation of population densities at a finer scale than the available official data released in administrative units. To this end, this method relies on the use of ancillary data—such as settlement layers, land cover, or land use maps—that are used as proxies of the real spatial distribution of the population within the administrative units. On one hand, MR satellite imagery can be used to derive such ancillary data at no cost. Unfortunately, when working at the intra-urban level, these data often fail to provide sufficient details in the population model. On the other hand, VHR satellite imagery can provide very detailed information and thus improve the quality of the population models, but their acquisition is much more expensive. In this research, we assessed the added value of VHR remote-sensing derived products, compared to MR products, when used as ancillary data in a dasymetric mapping procedure. When using a simple binary built-up/non-built-up mask, we showed that the use of VHR resulted in a drop of 13% in the error (RTAE). Moreover, our results showed that the use of the spatially and thematically detailed information, which can be derived from VHR land cover and land use maps, which are useful pieces of information by themselves, enabled significant improvement in the dasymetric reallocation accuracy, compared to what can be achieved using a MR built-up mask alone.