Multiple Global Population Datasets: Differences and Spatial Distribution Characteristics

Spatial data of regional populations are indispensable in studying the impact of human activities on resource utilization and the ecological environment. Because the differences between datasets and their spatial distribution are still unclear, this has become a puzzle in data selection and application. This study is based on four mainstream spatialized population datasets: the History Database of the Global Environment version 3.2.000 (HYDE), Gridded Population of the World version 4 (GPWv4), Global Human Settlement Layer (GHSL), and WorldPop. In view of possible influences of geographical factors, this study analyzes the differences in accuracy of population estimation by computing relative errors and population spatial distribution consistency in different regions by comparing datasets pixel by pixel. The results demonstrate the following: (1) Source data, spatialization methods, and case area features affect the precision of datasets. As the main data source is statistical data and the spatialization method maintains the population in the administrative region, the populations of GPWv4 and GHSL are closest to the statistical data value. (2) The application of remote sensing, mobile communication, and other geospatial data makes the datasets more accurate in the United Kingdom, with rich information, and the absolute value of relative errors is less than 4%. In the Tibet Autonomous Region of China, where data are hard to obtain, the four datasets have larger relative errors. However, the area where the four datasets are completely consistent is as high as 84.73% in Tibet, while in the UK it is only 66.76%. (3) The areas where the spatial patterns of the four datasets are completely consistent are mainly distributed in areas with low population density, or with developed urbanization and concentrated population distribution. Areas where the datasets have poor consistency are mainly distributed in medium population density areas with high urbanization levels. Therefore, in such areas, a more careful assessment should be made during the data application process, and more emphasis should be placed on improving data accuracy when using spatialization methods.


Introduction
Population growth has placed certain pressures on society, resources, and the ecological environment, and even affected ecosystem functions [1,2]. The critical role of population data in the study of social economy, resource utilization, and ecosystem change has been widely recognized [3]. In particular, population density data can be broadly applied in quantifying the intensity of human activities, depicting the spatial patterns of eco-environmental quality, simulating the spatial distribution of pollutant emissions, and evaluating ecological problems brought about by urbanization [4][5][6][7], as well as in other ecological research. With the development of remote sensing technology, population data estimation deviation [43], consistency of spatial population distribution [44], and population density level distribution at the administrative unit and pixel scale, so as to provide a reference for the selection of population datasets in socioeconomic or ecological environment research [41,45].

Case Area Selection
In order to evaluate the performance of the population datasets in areas with different topographical and urban-rural distribution characteristics, and to discover the relationship between the accuracy of datasets with geographical factors, we selected 4 case areas with different characteristics: the United Kingdom, with a high population density of 274.7 persons/km 2 and an urbanization rate of 83.4%; Argentina, with a high proportion of urban population of 91.9%; Sri Lanka, with flat terrain below 200 m; and Tibet Autonomous Region of China, with an altitude higher than 4000 m and sparse population less than 3 persons/km 2 ( Table 1). The characteristics of the population distribution in the UK are as follows: overall population density is high and the urbanization rate is as high as 83.4%, forming a pattern of outward divergence with the population concentration centers in Greater London, Manchester, Birmingham, and other counties within the jurisdiction of England, and Glasgow and Edinburgh within the jurisdiction of Scotland. The overall population density of Argentina is 16.3 persons/km 2 , but much of the rural population has poured into the cities due to the backward progress of economic development, resulting in a large population concentration and high proportions of urban population in Buenos Aires, Cordoba, Mendoza, and other large cities in the north, while the population density of small cities in the south is mostly less than 5 persons/km 2 . Sri Lanka is relatively flat, with altitude less than 200 m, and most of the terrain is plain; the overall population density is as high as 345.6 persons/km 2 . The characteristics of population distribution are as follows: population density in the west is greater than 500 persons/km 2 , and in the east is mostly less than 100 persons/km 2 ; it decreases in all directions, with Colombo and Kandy as the areas with the highest population concentration; there is also high population distribution density in the ports, such as Jaffna in the north, with more than 2000 persons/km 2 . The average altitude of Tibet Autonomous Region of China is more than 4000 m [45], and the overall population density is less than 3 persons/km 2 . The population is predominantly distributed in Lhasa, Xigaze, and several agricultural counties, while high-altitude areas such as Ali and the north of Naqu are very sparsely populated.

Data Source
The main data used in this paper include four spatial population datasets from 2015, administrative division data, and demographic data. The four spatial population datasets are: (1) popd of History Database of the Global Environment (HYDE), version 3.2.000 [13][14][15], produced by PBL Netherlands Environmental Assessment Agency; (2) Population Density of Gridded Population of the World, version 4 (GPWv4) [11], produced by Center for International Earth Science Information Network (CIESIN), Columbia University; (3) GHS-POP of Global Human Settlement Layer (GHSL) [12], produced by the European Commission; and (4) WorldPop Population Counts [16] using unconstrained top-down methods completed by multiple organizations and institutions. The data of administrative divisions are derived from the Global Administrative Areas (GADM) version 3.6 produced by the Center for Spatial Sciences at the University of California and administrative divisions (Due to the lack of WorldPop data in the southeast part of the Tibet Autonomous Region, data of Longzi County, Cuona County, and Linzhi city are excluded.) of the Tibet Autonomous Region provided by the Qinghai Tibet Plateau scientific data center (http://www.tpedatabase.cn). The definition of population density according to HYDE and GPWv4 data is the number of people per square kilometer, and according to GHSL and WorldPop it is the number of people in each grid. In order to reduce the impact of data processing on spatial statistics of datasets and facilitate comparison, this paper unifies the measurement unit of the four datasets as population per square kilometer without changing their resolution. The population statistics came from the website of the World Bank and the Statistical Yearbook of the Tibet Autonomous Region of China from 2015.
There are nearly 20 years between the development of the earliest GPWv4 dataset and the latest WorldPop data entry. The data sources experienced a transformation from single demographic data to the integration of digital elevation map (DEM), land cover data and transportation network, then to remote sensing data, mobile communication and other new data sources ( Table 2). The GPWv4 dataset, with a spatial resolution of 1km, is based on the 2010 official census and estimated population estimated data, supplemented by administrative boundaries and the United Nation's World Population Prospects, 2015 Revision. The data source of HYDE 3.2 is the United Nation's World Population Prospects and historical estimations from the literature [46][47][48], supplemented with data from the sub-national population statistics of Populstat and other sources. HYDE constructed a continuous population time series with spatial resolution of 10km for each country's province or state [11]. Using remote sensing satellite data and volunteered geographic information, GHSL generates fine built-up areas and decomposes the GPWv4 produced by CIESIN to generate population distribution maps with higher spatial resolution (250m) and more detailed spatial expression. There are many input data for WorldPop, including elevation, slope, land cover, infrastructure, satellite data, and mobile phone communication data, in addition to the 2010 national census and official population estimation data. At present, year-on-year time series data with a spatial resolution of 100 m from 2000 to 2020 have been developed.
In terms of the production method, GPWv4, based on the area weighting method, is the only dataset of the 4 that is not spatialized by modeling. The production mode is simple and ensures the accuracy of the total population within the administrative unit. However, the disadvantage is that it is based on the assumption that humans are evenly distributed in space. The HYDE3.2 dataset generates a combined weight layer based on soil suitability, road accessibility, distance from water body, night light and other indicators to spatialize population data. This model is applicable globally, but does not take into account additional uncertainties in the region. GHSL uses remote sensing satellite data and volunteered geographic information to generate built-up areas with a spatial resolution of 38 m, and according to the proportion of built-up area in each grid, decomposes GPW again based on a linear regression method. The modeling method is simple, considering that the population is mainly distributed in built-up areas, but ignores administrative boundaries. With the development and application of machine learning and other algorithms, WorldPop uses a random forest model to quantify the relationship between model factors such as land cover, satellite data, mobile phone communication and micro-census so as to generate a weight layer and reallocate census data. Among the four datasets, the geographic information data source of WorldPop is more sufficient, and its random forest model is superior to classification and variable importance ranking [49], which presents the development direction of the spatialization method in the future.

Analysis Method of Spatial Distribution Consistency
The spatial distribution consistency analysis measured the consistency of population spatial distribution in the 4 datasets by comparing them pixel by pixel. The process of consistency analysis is as follows: (1) The units of the 4 datasets are converted and unified into people per km 2 . (2) According to the different population density characteristics of each case area (Table 1), population density is reclassified to 9 levels based on the natural breakpoint method [50,51] (Table 3). (3) Raster calculation is performed on the 4 datasets after reclassification to obtain the grid data reflecting data consistency. The grid data include instances where 4 datasets are consistent, 3 datasets are consistent, 2 datasets are consistent, and each dataset is inconsistent with the others, respectively defined as completely consistent, highly consistent, lowly consistent, and completely inconsistent [50]. (4) Datasets are compared pairwise and analyzed to determine whether they are consistent and the proportion of consistency. (5) Statistical analysis is conducted by zonal statistical tools in Arcgis, which refers to the distribution of population density levels in 2 types of consistent regions (Figure 1).

Accuracy of Population Estimation
Taking the World Bank and Statistical Yearbook data of 2015 as the reference for the total population, this paper compares the accuracy of the population estimated by four spatial datasets (Table 4.). The total populations of GPWv4 and GHSL are the closest to the statistical data, and the absolute value of relative error is within 3%. The reason for this may be that GPWv4 is based on the 2010 census data and allocates the population within each administrative unit so as to keep the population in each unit unchanged, while GHSL is a refined spatial allocation based on GPWv4. The relative error of WorldPop is negative, and the absolute value of relative error is the largest in Argentina, Sri Lanka and Tibet, which may be explained by a small amount of regional data in each case area. Although the data sources of WorldPop and GPWv4 are based on the 2010 census data, there is a great gap between them in total population. The relative error of WorldPop in Argentina, Sri Lanka and the Tibet Autonomous Region of China is as high as 20%, which shows that differences in population density and distribution patterns simulated by different spatialization methods make the total amount unequal. HYDE shows that the accuracy varies in different regions. In the UK, the relative error is only -3.71%, and the absolute value in Argentina and Sri Lanka is less than 10%, while the relative error in the Tibet Autonomous Region is -15.03%.

Accuracy of Population Estimation
Taking the World Bank and Statistical Yearbook data of 2015 as the reference for the total population, this paper compares the accuracy of the population estimated by four spatial datasets ( Table 4). The total populations of GPWv4 and GHSL are the closest to the statistical data, and the absolute value of relative error is within 3%. The reason for this may be that GPWv4 is based on the 2010 census data and allocates the population within each administrative unit so as to keep the population in each unit unchanged, while GHSL is a refined spatial allocation based on GPWv4. The relative error of WorldPop is negative, and the absolute value of relative error is the largest in Argentina, Sri Lanka and Tibet, which may be explained by a small amount of regional data in each case area. Although the data sources of WorldPop and GPWv4 are based on the 2010 census data, there is a great gap between them in total population. The relative error of WorldPop in Argentina, Sri Lanka and the Tibet Autonomous Region of China is as high as 20%, which shows that differences in population density and distribution patterns simulated by different spatialization methods make the total amount unequal. HYDE shows that the accuracy varies in different regions. In the UK, the relative error is only −3.71%, and the absolute value in Argentina and Sri Lanka is less than 10%, while the relative error in the Tibet Autonomous Region is −15.03%. The application of remote sensing, mobile phone communication, and other geospatial data will make the data in areas with abundant information more accurate. In the UK, the absolute value of the relative error of the four datasets is lower than 4%, and that of GPWv4 and GHSL is less than 1%. In Argentina, Sri Lanka and the Tibet Autonomous Region of China, the results of GPWv4 and GHSL are similar. The relative error of GHSL in Argentina, Sri Lanka and the Tibet Autonomous Region of China is 0.66%, −2.41% and −1.15%, respectively; however, the absolute deviation between HYDE and WorldPop is between 5% and 25%.
Compared with the other three regions, the relative error of the Tibet Autonomous Region in China is generally larger, especially with HYDE using the literature's historical data and WorldPop using multi-source geographic information data such as communication data. The accuracy of the estimation of the Tibet Autonomous Region is far lower than that of other regions, with a deviation of about 20%. There may be two reasons for this: in terms of massive sparsely populated areas at high altitude, the scale of population statistics is not precise enough [8], and/or it is difficult to obtain new auxiliary data such as household survey and mobile phone communication data, which makes the error of the spatial population dataset larger.

Consistency Analysis of Population Spatial Distribution
Contrary to the population accuracy, due to the lack of geographical information data, the spatial distribution characteristics of the four datasets are basically the same and the datasets are the most consistent in the Tibet Autonomous Region of China ( Figure 2). Thus, the proportion of completely/highly consistent regions is as high as 97.01%. In the UK, the proportion of completely/highly consistent regions is the lowest, at only 66.75% (Table 5). The proportion of completely/highly consistent regions is slightly higher in Argentina than in Sri Lanka, at 82.06% and 81.80%, respectively (Table 5), although Sri Lanka's urbanization rate is lower (the proportion of urban population in 2015 was 18.3%), and Argentina's urbanization rate is high (91.5% in 2015). However, Sri Lanka's population differentiation is more complicated, and the overall population density is higher compared to Argentina, which indicates that the spatial distribution pattern of datasets is quite different in areas with high population density and complex variation.  Pairwise comparison and analysis of the data show that the highest consistency exists between WorldPop and other data, which may be related to its abundant data sources and auxiliary data and reasonable redistribution rules. In the UK and Argentina, the consistency between WorldPop and GHSL is the highest, and is 5-15% higher than that between WorldPop and GPWv4. In Sri Lanka and Figure 2. Spatial distribution consistency of four datasets in each case area (completely consistent: four datasets are consistent; highly consistent: three datasets are consistent; lowly consistent: two datasets are consistent; completely inconsistent: each dataset is inconsistent with the others). Pairwise comparison and analysis of the data show that the highest consistency exists between WorldPop and other data, which may be related to its abundant data sources and auxiliary data and reasonable redistribution rules. In the UK and Argentina, the consistency between WorldPop and GHSL is the highest, and is 5-15% higher than that between WorldPop and GPWv4. In Sri Lanka and the Tibet Autonomous Region of China, the consistency between WorldPop and GPWv4 is the highest, and the consistency between WorldPop and GHSL is 4-30% lower than that between WorldPop and GPWv4. This indicates that the portrayal of characteristics of population distribution varies depending on the spatialization methods in different case areas, and GHSL, which is integrated with built-up areas extracted from remote sensing, is more advantageous in areas with a high urbanization level (Table 6).

Consistency of Datasets in Different Population Density Levels
In order to explore the spatial relationship between population density and consistency, we conducted a statistical analysis of the distribution of population density levels in consistent or inconsistent regions. Judging from the distribution of population density levels in completely/highly consistent regions (Figure 3), in the four case areas, each dataset is dominated by low-density population distribution of level 1-3, with an area proportion of more than 45%, which indicates that the data consistency is great among extremely low population density areas. Especially in Tibet and Argentina, the proportion of areas wherein population density is level 1 and 2 is as high as 77%, and in the UK it is 51%. In Sri Lanka, where the population density is high and the spatial distribution is relatively uniform, there are always highly consistent areas for each population density level. In the UK, with a high level of urbanization, 40-82% of the high-density population areas (level 7-9) are highly consistent (WorldPop, 81.72%; GHSL, 66.94%; GPWv4, 58.90%; HYDE, 40.47%). Since HYDE contains historical data for long time series, its spatial resolution is far lower than that of the others. Therefore, in densely populated and highly heterogeneous areas, spatial accuracy will be reduced due to the influence of mixed pixels and the precision of original data, which is reflected in the UK and Sri Lanka (Figure 3.). Lowly consistent/completely inconsistent regions are mainly distributed in the medium population areas with a high urbanization level. Among the medium density population areas in the UK and Argentina, in 62-93% of the regions the four datasets are completely inconsistent, or only two are consistent (Figure 4, Table 7).

Discussion
It can be seen that the spatial patterns of the spatial population datasets produced by different methods and data sources are very similar in Tibet, where data are scarce and the population is sparse.
In the data selection of such regions, the accuracy of population estimation and the time scale needed for research are the main considerations. For regions with high levels of urbanization, we should not only consider spatiotemporal resolution and accurate quantity, but also pay more attention to the uncertainty of data in areas with medium population density. Based on the results, a table is summarized to show the applicability of datasets in different population density areas (Table 8). This study serves as a basis for not only the selection of population data, but also the future development of population spatialization. In areas where data are lacking, improving the accuracy of spatial population datasets depends more on continuously refining demographic data [52][53][54][55][56] and abundant data sources [57]. The difficulty in obtaining data in areas at high altitude and with poor data quality may be the reason for the large relative error in the Tibet Autonomous Region of China [58]. Remote sensing, mobile communication, and other big data will play important roles in improving the accuracy of spatial population data in areas with deficient data. For areas with medium population density, with the development of spatialization methods, from simple interpolation to machine algorithms based on intelligent models such as neural networks, decision trees, genetic algorithms and random forest [9,48,59,60], strengthening the experimental research and verifying such areas will improve the reliability and consistency between datasets. Verifying the accuracy of spatial population datasets is a massive problem in the research. According to the comparison between the population of spatial datasets and census data in this study, not only are there differences in spatial layout, but there is also about 20% deviation in the population. Therefore, in areas with different geographical characteristics and with more detailed statistical units, even at grid scale, it is also a necessary development direction of population spatialization to develop standard experimental areas, and to provide verification data for the accurate quantity and spatiotemporal layout of spatial data designed by various applications. Besides, urban/rural populations are two concepts of population geography corresponding to urban and rural areas. When it comes to urban population in most countries, the population of small cities generally is included, while in China, it usually refers to the population of towns [61]. Although the population scale for towns in China is equivalent to that for small cities of other countries, the difference in definition for urban/rural population may have a slight effect. Table 8. Applicability of datasets in different population density areas.

HYDE
GPWv4 GHSL WorldPop Notes: The number of stars indicates the degree of applicability of datasets.

Conclusions
In order to understand differences in the number and spatial distribution of the main spatial population datasets in the world, four datasets with different spatiotemporal resolutions (HYDE, GPWv4, GHSL and WorldPop), developed based on multiple data sources and spatialization methods, were selected, and Sri Lanka, the UK, Argentina and the Tibet Autonomous Region of China were taken as the case areas. This paper conducted research from the aspects of relative error of population, consistency of population spatial distribution, and the characteristics of population density distribution within consistent and inconsistent regions. Furthermore, this paper analyzed the causes of the differences by combining the data production process and the difficulty of data acquisition, urbanization level and the characteristics of population distribution for the case areas. The results show the following: (1) The differences in source data and spatialization methods between datasets affect their accuracy. The development of remote sensing and deep learning technology promotes the progress of data collection and spatialization methods. Therefore, the accuracy of each dataset in the study is very different. Because GPWv4 is based on 2010 census data for allocation according to the principle that the population in each administrative unit is unchanged, and GHSL is based on GPWv4 for secondary spatialization, their absolute value for the relative error of total population is the smallest, both of which being within 3%. Although WorldPop uses the same data source as GPWv4, the relative error of the former is as high as 20% in Argentina, Sri Lanka and the Tibet Autonomous Region of China, due to different spatialization methods. HYDE, for the purpose of producing long time series historical data, has medium accuracy for estimating the population of the UK, Argentina and Sri Lanka; (2) The application of geospatial data makes the datasets more accurate in the UK with abundant information, where the absolute value of the relative error of the four datasets is less than 4%. In other case areas, the absolute value of the relative error of GPWv4 and GHSL is less than 3%, and that of HYDE and WorldPop is between 5% and 25%. Affected by the imprecision of statistical data and the difficulty in obtaining new auxiliary data, the relative error of datasets in the Tibet Autonomous Region of China is relatively large, especially with HYDE using historical literature data and WorldPop using multi-source geographic information data. With regard to the ability to describe spatial distribution, the pairwise consistency between WorldPop and the other three datasets is the highest due to the fusion of multiple data sources, and GHSL, which mixes built-up area distribution information extracted from remote sensing, has more advantages in terms of spatial consistency in areas with a high urbanization level. It is difficult to spatialize population distribution in areas with complex variation, characterized by reduced consistency in spatial distribution. The consistency of population spatial distribution for the four datasets is the highest in the Tibet Autonomous Region of China, where the total proportion of four and three datasets being consistent is as high as 97.01%. On the other hand, in the UK, where the population spatial distribution is complex, only 66.75% of the regions are completely or highly consistent; (3) Areas where the four datasets are completely/highly consistent are mainly distributed in low population density areas. In Tibet, Argentina and the UK, the proportions of level 1 and 2 in completely/highly consistent areas are as high as 89%, 76% and 92,% respectively, indicating that data consistency is great in low-density areas. In addition, in highly urbanized and densely populated areas, the spatial distribution of each dataset is also highly consistent, and 62% of high-density population areas in the UK are completely/highly consistent areas. The lowly consistent/completely inconsistent regions are mainly distributed in the middle density areas with a high urbanization rate, and 62-93% of middle density population areas in the UK and Argentina are lowly consistent/completely inconsistent regions.