GHS-POP Accuracy Assessment: Poland and Portugal Case Study

Calka, Beata; Bielecka, Elzbieta

doi:10.3390/rs12071105

Open AccessArticle

GHS-POP Accuracy Assessment: Poland and Portugal Case Study

by

Beata Calka

^*

and

Elzbieta Bielecka

Faculty of Civil Engineering and Geodesy, Military University of Technology, gen. S. Kaliskiego 2 st., 00-908 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

Remote Sens. 2020, 12(7), 1105; https://doi.org/10.3390/rs12071105

Submission received: 6 March 2020 / Revised: 26 March 2020 / Accepted: 29 March 2020 / Published: 31 March 2020

(This article belongs to the Special Issue European Remote Sensing-New Solutions for Science and Practice)

Download

Browse Figures

Versions Notes

Abstract

The Global Human Settlement Population Grid (GHS-POP) the latest released global gridded population dataset based on remotely sensed data and developed by the EU Joint Research Centre, depicts the distribution and density of the total population as the number of people per grid cell. This study aims to assess the GHS-POP data accuracy based on root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) and the correlation coefficient. The study was conducted for Poland and Portugal, countries characterized by different population distribution as well as two spatial resolutions of 250 m and 1 km on the GHS-POP. The main findings show that as the size of administrative zones decreases (from NUTS (Nomenclature of Territorial Units for Statistics) to LAU (local administrative unit)) and the size of the GHS-POP increases, the difference between the population counts reported by the European Statistical Office and estimated by the GHS-POP algorithm becomes larger. At the national level, MAPE ranges from 1.8% to 4.5% for the 250 m and 1 km resolutions of GHS-POP data in Portugal and 1.5% to 1.6%, respectively in Poland. At the local level, however, the error rates range from 4.5% to 5.8% in Poland, for 250 m and 1 km, and 5.7% to 11.6% in Portugal, respectively. Moreover, the results show that for densely populated regions the GHS-POP underestimates the population number, while for thinly populated regions it overestimates. The conclusions of this study are expected to serve as a quality reference for potential users and producers of population density datasets.

Keywords:

global population data; accuracy assessment; remote sensing; GHS-POP; Moran statistics; RMSE; MAE; MAPE

1. Introduction

Reliable information on population numbers in a given geographical area is essential for the rational pursuit of spatial planning as well as economic and social policies. Certainly, data on population density from the census are still available, but they have some limitations such as a lengthy process of acquisition and processing and the fairly common problem of accessing very detailed data. With the increasing emphasis on real-time and detailed population information, users have switched to a more technology-based data source [1]. Remote sensing data, as an alternative source of data is recognized as a cost-effective way of obtaining population information. Remote sensing provides a more time-efficient way of estimating the residential population, particularly for larger areas [2,3,4].

Geographic information system (GIS) and remote sensing literature present numerous datasets and methods for population estimation [5,6,7,8,9,10]. The potential of remote sensing has been investigated since the 1950s when aerial photos were first utilized to count dwelling units [11,12]. Tobler used satellite remote sensing to study urban populations and found a strong statistical correlation between the settlement radius and number of inhabitants of various cities [13]. This approach is applicable at large regional scales with low-resolution imagery, and it is based on the ‘allometric’ modeling of a direct mathematical relationship between the population of an urban area and its size. This approach was also included in studies by Lo and Welch [14] for Chinese cities and Stern [15] for Sudanese villages. Many researchers use data from SPOT, Landsat TM, and Enhanced Thematic Mapper Plus (ETM+) sensors along with census or survey data on population counts to estimate population size [16]. Lo [17] developed methods for extracting population and dwelling units from a SPOT image for Hong Kong. However, the accuracy of the regression model used to link the spectral radiance values of image pixels with population densities was high only for the whole study area. In 1997, Yuan et al. [18] used regression and scaling techniques to develop a population distribution map in a regional-scale study, using Landsat TM images and population counts from census data. Harvey [19] also used these data to allocate population estimates to each pixel of the image and overcome the problem of spatial aggregation of census data. Li and Weng [20] integrated a Landsat ETM+ image and census data for estimating intra-urban variations in population density in the state of Indiana, USA. Liu et al. [21] and Galeon [22] are researchers who used very high spatial resolution imagery for population estimation purposes. Liu et al. [21] used linear regression to explore the correlation between census population density and Ikonos image texture for Santa Barbara in California, while Galeon [22] used Quickbird satellite image to estimate population size using a field survey and regression analysis. In the past few years LiDAR (Light Detection and Ranging) data integrated with high-resolution imagery have been used for population estimation by several researchers, including Qiu et al. [23], Ramesh [24], and Weng [25]. This literature review shows that remote sensing techniques are useful tools for population estimation. Information about the detailed number and spatial distribution of population improves the understanding of numerous phenomena and processes related to the Earth’s surface and its increasing importance. In particular, applications in natural hazards and risk management, ensuring better living conditions, the spread of diseases, as well as people’s impact on the environment and as a consequence, climate change, are of the utmost importance. Detailed, timely, and reliable information on the spatial distribution of people is also important for local and regional communities and could have a positive impact on their quality of life and health as well as the surrounding environment.

Population data stored in regular grid cells differ in terms of spatial resolution, year of publication, spatial extent, input data used, and method used to assess accuracy. Different global and continental gridded datasets have been widely discussed in the literature by Leyk et al. [26]. Some of the most popular examples of gridded population data are Gridded Population of the World (GPW) and Global Rural-Urban Mapping Project (GRUMP). Those data may be downloaded from the Center for International Earth Science Information Network’s website. The spatial resolution of GPW is 2.5 arc-minutes, while the spatial resolution of GRUMP is 30 arc-seconds [26,27]. GPW data uses the water mask based on remote sensing data to exclude areas of water and permanent ice from the population location [28,29]. The allocation mechanism for GRUMP builds on the GPW approach but explicitly considers the populations of urban areas. The urban settlements in cities were identified using the National Oceanic and Atmospheric Administration night-time satellite images [30,31,32]. The LandScan Global Population Database at 30 arc-seconds (1 km or finer) is provided by the Oak Ridge National Laboratory. The data were developed using census demographic data and geographic data with remote sensing imagery analysis techniques within a multivariate dasymetric modeling framework to disaggregate census counts within an administrative boundary [33,34,35,36,37]. In 2013, the WorldPop project, which was a result of merging three regional population mapping projects (AfriPop, AsiaPop, AmeriPop), was initiated [38]. The WorldPop project is a weighted dasymetric approach that relies on a random forest model to produce a predictive weighting layer for dasymetrically redistributing population counts into gridded cells. WorldPop utilizes satellite imagery such as, 30 m spatial resolution Landsat Enhanced Thematic Mapper for mapping settlement [26,39]. The Global Human Settlement Population Grid (GHS-POP) is the latest released global gridded population dataset, developed by the Joint Research Centre (JRC), the European Commission’s science and knowledge service in Ispra, Italy. The population in each grid cell was estimated on the basis of the Global Human Settlement Layer (GHSL), which is primarily based on automatic processing of optical imagery from Landsat satellites [40,41].

Considering an extensive use of gridded population data, assessing the accuracy of people count estimation in a grid cell is an important yet still challenging issue. The simplest accuracy assessment method is based on comparing the number of people on pixel level. However, it is a difficult task because it is almost impossible to obtain sufficient true values (i.e., the exact number of people) at the grid scale given that population distributions are highly dynamic. Another method is to conduct a comparative analysis of the population number for administrative units on one selected level, from countries to districts and counties [33,42]. Hay et al. [43] illustrated the accuracy of GRUMP, GPW, and LandScan data in determining at-risk populations of various climate levels prone to malaria infection. Tatem et al. [44] assessed the accuracy of GRUMP, Landscan, and GPW demonstrating the effects of spatial population dataset choice on estimates of populations at risk of falciparum malaria and their detailed country-level assessments. Literature presents numerous metrics for comparative evaluation of raster population distribution [36,42,43,44,45]. Freire et al. [40] used correlation analysis to assess the accuracy of GHS-POP 2015 data. GEOSTAT 2011 data were used to assess accuracy, however, results are rather limited due to the lack of other independent reference data. The other simpler metrics measure the differences between two datasets, while other geo-reference metrics are adopted from geo-statistics, i.e., applied meteorology and signal processing, and then applied to compare the grid-based population datasets [36]. Moreover, errors of the population estimation are generally measured by root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Bai et al. [42] assessed and compared the estimation accuracy of the GPW, GRUMP, WorldPop and CnPop (the China 1 km grid population dataset) at the province level in China using those errors values. The results show that, for most of the country, the analyzed data provide good estimation accuracy. Nevertheless, in some regions, like coastal zones, the differences in population numbers were significant (i.e., the MAPEs exceeded 50%).

The aim of our study was to assess the accuracy of a gridded population distribution dataset—the GHS-POP 2015. This research is based on comparative analysis with official census data provided by Eurostat. It was carried out for six administrative levels according to the administrative division, in Poland and Portugal, two countries with different population density and spatial population distribution. The novelty of the article lies in using scale effect in accuracy assessment of GHS-POP data, which allowed us to show the results depending on the adopted reference unit [11,46].

The article is organized as follows: The Introduction section deals briefly with recent studies using remote sensing data for population density estimation and with different methods of accuracy in accessing the data. Section 2 presents the area of research together with the description of the data. Section 3 describes the methods used in the research, followed by Section 4, where the results of the studies are presented on maps in an efficient and legible way. Finally, Section 5 and Section 6 contain the discussion, and conclusions are provided afterwards.

2. Materials and Study Area

2.1. Poland and Portugal

The analyses were carried out for Poland, located in Eastern Europe and for Portugal, located in south-western Europe (Figure 1).

According to data provided by the Central Statistical Office (2018), Poland occupies an area of 312,696 km², and it is inhabited by 38,005,614 people (data for 2019), which account for 5.4% of the European population and 0.5% of the world population. In terms of its area, Poland ranks as the 69th largest country in the world and the 9th in Europe, while in terms of population it is the 36th most populated country in the world. The average population density is 123 people/km² [47,48]. The distribution of the population in Poland has a fairly large spatial diversity, which is due to significant diversity of natural conditions and the geographic environment (i.e., soils, terrain, forests, water bodies, and mineral resources). Southern and central Poland are the most densely populated regions due to the economic development of the areas. North-western and north-eastern Poland are the least populated due to considerable afforestation and the small number of cities.

Portugal is three times smaller than Poland, with an area of 92,391 km². In terms of area, it is the 13th biggest country in Europe and the 110th in the world, while the population of Portugal is 10,524,145 inhabitants. Its population constitutes 0.13% of the total world population, which makes it the 87th most populated country in the world. The average population density in Portugal is 114 people/km² [39]. The highest density is found in the coastal lowlands (700 people per km² in the Lisbon region) and Madeira, while the lowest density is in the south, in the driest part of the country, as well as in the mountains (about 20 people per km²). Most Portuguese, 55% of the population, live in cities (one of the lowest indicators in Europe).

In addition, Portugal comprises two archipelagos: the Azores and Madeira. The autonomous Region of the Azores is an archipelago of nine volcanic islands located in the middle of the Atlantic Ocean. The islands, with an area of about 2346 km² (São Miguel, the largest one, covers 744.55 km²), are inhabited by 246,750 people [49]. Madeira is the largest island of the Madeira Archipelago with an area of 741 km², a length of 57 km (from west to east), a width of 22 km at the widest point, and the length of the coast is about 140 km. The total area of the Madeira Islands is 801 km², with 246,689 inhabitants. Madeira’s population density is 296 people/km², while on the Azores Islands it is 104 people/km².

Poland and Portugal differ not only in number of people per km², but also in spatial distribution of people. Figure 2 presents CORINE Land Cover data (CLC) for Poland and Portugal. CLC data is a collection of information about land cover, which was created by the European Environment Agency (EEA) and its member countries and based on the results of IMAGE2000, a satellite imaging program undertaken jointly by the Joint Research Centre of the European Commission and the EEA. The urban fabric layer shows the population residence. Poland is characterized by highly dispersed settlement; there are several major Polish cities with the largest populations in Poland [50]. In Portugal, the population is mainly concentrated on the western and southern coasts. The northern and eastern parts of Portugal are mountainous areas with a smaller number of localities, and thus, population.

2.2. The GHS-POP Data

The GHS-POP is the newest global raster population data, released in 2018 by the European Commission Joint Research Centre [40]. It depicts the distribution and density of population expressed as the number of people per cell. Those data are produced in an equal-area projection at 250 m (GHS-POP 250 m) and 1 km (GHS-POP 1 km) spatial resolution. Residential population estimates for 1975, 1990, 2000, and 2015 were disaggregated from census or administrative units to grid cells based on the distribution of built-up areas in the Global Human Settlement Layer (GHSL) [51,52]. The disaggregation methodology is described in a conference scientific paper [40].

Figure 3 presents the GHS-POP data for 2015 for Poland and Portugal for two cell sizes (250 m and 1 km), which were used in this analysis.

Descriptive statistics of the GHS-POP data for Poland and Portugal are presented in Table 1. Positive skewness values and high values of kurtosis indicate that population data in Poland and Portugal are characterized by non-normal distributions with right-hand asymmetry. This is also confirmed by the unequal values of means and medians.

2.3. Census Data

The number of people refers to 2015 for administrative units and boundaries for all six NUTS (Nomenclature of Territorial Units for Statistics) levels were obtained from the Eurostat web page. Eurostat is an institution that collects demographic data at national and regional levels from the national statistical offices of European Union (EU) Member States, European Free Trade Association (EFTA) countries, and EU candidate countries. Population data are provided to Eurostat several times a year, which means that they are constantly updated and can be considered as reference data [53].

The country level (i.e., NUTS 0) corresponds to the areas of Poland and Portugal, respectively. NUTS 1 is the level of macroregions, with seven in Poland, and with each macroregion consisting of several provinces. NUTS 2 includes regions, and NUTS 3 covers 25 units in Portugal and 73 units in Poland with groups of counties (Table 2). In Poland NUTS level 4 (LAU 1, i.e., local administrative unit), consists of 380 counties or towns with county rights, and LAU 2 refers to municipalities [54]. In Portugal municipalities are classified as LAU 1, while civil parishes as LAU 2 [53].

3. Methods

Based on the knowledge obtained from the available literature and the technical specifications of GHS-POP data, we hypothesized that the data are reliable regardless of the adopted reference unit, and that the spatial distribution of the highest overestimation and underestimation of number of people is dispersed and random. The mean absolute percent error (MAPE) was chosen as the most appropriate indicator to compare two sets population data. The MAPE as the summary measure is most often used for evaluating the accuracy of population forecasts [55]. As stated by Tayman et al. [56], it measures forecast precision and represents the average percent error over all observations ignoring its direction.

To prove the hypothesis, the study aims to answer the following research questions:

What is the accuracy of the GHS-POP? Do the results of the data accuracy assessment change depending on the reference unit size? Which administrative level presents the highest errors?
The answer to this question is based on the analysis of MAE, RMSE, and MAPE calculated for reference units at six administrative levels (from country to municipality).
What are the maximum and minimum MAPE values at each administrative level? Where are the units with maximum underestimation and overestimation located? Are the differences between MAPEs of GHS-POP 1 km and GHS-POP 250 m significant?
The values and the spatial diversity of the MAPE for administrative units are shown on the choropleth maps. The answer to this question allows us to discover administrative units with the maximum overestimations and underestimations of the population for GHS-POP 1 km and GHS-POP 250 m. It also allows us to indicate units with the highest differences between MAPE values for 250 m and 1 km grid cells.
Is the spatial pattern of the MAPE at LAU 2 really random? What is the MAPE structure and diversity?
The answer is based on testing spatial autocorrelation based on administrative unit locations and MAPE values simultaneously. The Moran’s I index value as well as the Z score and p-value were calculated. The diversity of the MAPE was evaluated with use of the statistical measures of central tendency, position, and dispersion (i.e., the mean, median, and the coefficient of variation. The box plot analysis was used to detect extreme MAPE value outliers.

For this method, the estimation of GHS-POP data accuracy is based on the determination of the differences between the number of people obtained from census data (via Eurostat [53]) and the number of people from GHS-POP data, calculated for administrative units, using zonal statistics. The absolute estimation error (AEE) was used for the analysis. It measures the total difference between number of people in administrative units and the population number from corresponding spatial grid cells from the GHS-POP; it is expressed as Equation (1) [33]:

A E E = N P E - \sum_{k = 0}^{n} G H S_P O P .

(1)

where NPE is the number of people in administrative units provided by Eurostat for 2015; GHS_POP is the total number of people for grid cells in the corresponding administrative unit.

The AEE_i takes values in the range <−GHS-POPmax; NPEmax>. Values lower than 0 mean data overestimation, and values greater than 0 mean underestimated GHS-POP data.

Root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) and the coefficient of determination (R²) were used for accuracy estimation. RMSE refers to the amount by which the population counts predicted by the GHS-POP algorithm differs from the values provided by the statistical office (Equation (2)). MAE measures the difference between the GHS-POP and statistical population data (Equation (3)), while MAPE, one of the most common metrics used to measure forecasting accuracy, is calculated as the absolute percent error (Equation (4)).

M A E = \frac{1}{N} \sum_{N = 0}^{N} | A E E |

(2)

R M S E = \sqrt{\frac{1}{N} \sum_{T = 1}^{T} {(A E E)}^{2}}

(3)

M A P E = \frac{1}{N} \sum_{T = 1}^{T} | \frac{A E E}{N P E} | * 100

(4)

where N is the number of administrative units.

Based on descriptive statistics of the MAPE, elaborated for GHS-POP 250 m and 1 km resolutions as well as all administrative levels of Poland and Portugal, it was assumed that high overestimation and underestimation of population counts is observed when the MAPE values exceed 1.5 interquartile distance from (IQR) the first quartile (outliers), and be significant (an extreme outlier) for 3.0 IQR from the first quartile. The results of the accuracy assessment are presented in the form of choropleth maps with manually determined class ranges, in order to obtain an appropriate visualization of the MAPE outlier.

The hypothesis of a random pattern of MAPE at LAU 2 is tested through global indices of spatial autocorrelation across six levels of administrative units. High values of the Moran’s I and corresponding z-scores greater than 1.96 indicate that there is statistically significant clustering across the counties (p < 0.05). Low values of Moran’s I and z-scores lower than −1.96 indicate that there is statistically significant regularity (i.e., nearby counties have different MAPEs). Moran’s I can be thought of as a spatially weighted form of Pearson’s correlation coefficient [57]. The value “1” means perfect positive spatial autocorrelation (high values or low values cluster together), while “−1” suggests perfect negative spatial auto-correlation (a checkerboard pattern), and “0” implies perfect spatial randomness [58].

4. Results

4.1. Error Determination

For the NUTS 0 level (i.e., the country level) differences between the population number provided by Eurostat and the GHS-POP were determined using PE (percentage error). For GHS-POP 1 km the PE values in Poland and Portugal are −1.4% and 1.1%, respectively; for GHS-POP 250 m they are −1.5% and 0.4%, respectively. Percentage error values show that the GHS-POP data for are overestimated for Poland and underestimated for Portugal. RMSE, MAE, and MAPE value errors for other NUTS levels of administrative divisions (NUTS 1 to NUTS 5) were calculated (Table 3).

The highest RMSE and MAE error values were noted for the NUTS 1 level. As the number of reference units increased and the area decreased, the size of the RMSE and MAE decreased as well (Figure 4 and Figure 5). An inverse situation occurred for the MAPE; for small reference units the percentage difference between GHS-POP population data and census data grew (Figure 6). It is noteworthy that the difference between the RMSE and MAE values, especially for LAU 1 and LAU 2 levels, were significant, which suggested that the variances in the individual errors were large. It confirms the existence of local extremes. The maximum MAPE value noticed for GHS-POP 1 km LAU 2 in Portugal was 11.6%.

MAPE for the GHS-POP data at 1 km resolution are larger than for data at 250 m resolution. This is particularly evident for Portugal at the level of civil parishes. The reference units at this level are the most fragmented and have the smallest areas.

4.2. Cartographic Visualization of MAPE

4.2.1. NUTS 1

The MAPE values at the microregion level in Poland for GHS-POP 1 km and GHS-POP 250 m are similar (see Figure 7a,c); the largest difference in population counts was found in Makroregion południowo-zachodni and Makroregion wschodni (–3.8% and –2.0% for 250 m grid cells, −3.9% and −2.0% for 1 km grid cells, respectively). The lowest MAPE values were obtained in Makroregion województwo mazowieckie, assuming −0.4% and −0.5% respectively for 250 m and 1 km resolution. These are the areas with the highest population density in Poland and, at the same time, with the largest number of big cities. The results indicate that the GHS-POP 250 m and 1 km are overestimated. All macroregions marked with different shades of red correspond to negative values of the MAPE.

In Portugal, the GHS-POP 1 km and 250 m underestimate the population counts, with the largest underestimation in the Azores at 10.8% and 5.1% for 1 km and 250 m data, respectively (Figure 7c–e). The differences in the population of mainland Portugal take the value of 0.3% and 0.8%, for the 250 m and 1 km GHS-POP data, correspondingly.

4.2.2. NUTS 2

In Poland, all values apart from those for the Warszawski stoleczny region were negative, which means overestimation of the GHS-POP population counts. However, the largest overestimation occurred in the Dolnoslaskie region, with MAPE values at about −4.1% and −3.9% in GHS-POP 250 m and 1 km grids. An underestimation of 0.1% was found in the capital city of Warsaw, the smallest region with a continuous dense urban fabric and a large number of people. At this level, neither the underestimation nor the overestimation of population counts exceed 5% (Figure 8a).

In four out of seven regions in Portugal there was an underestimation of GHS population data at 250 m resolution (Figure 8b). At 1 km resolution, in six out of seven regions the census population was greater than the GHS-POP data (Figure 8d). The Area Metropolitana de Lisboa (i.e., the area with Lisbon, the Portuguese capital) and the Norte were overestimated the most for 1 km cells. These two areas have the highest population number of all the regions. At the same time, the Azores and Madeira Islands were largely underestimated. These are the smallest regions with the lowest populations (246,000 and 258,000 people, respectively). The biggest population overestimation occurred for the touristic, seaside region Algarve. The MAPE values were −3.7% and −4.6% for 250 m and 1 km grid cells.

4.2.3. NUTS 3

In nine sub-regions at 1 km grid cells and in ten at 250 m grid cells the population number in GHS-POP data in Poland was underestimated (Figure 9a,c). However, the highest value of underestimation amounted only to 2.5% in GHS-POP 1 km and 1.1% in GHS-POP 250 m in Miasto Warszawa region. In general, the highest overestimation was observed in the northern part of Poland, mainly the rural and woodland areas. The highest overestimation also occurred in the Walbrzyski, Opolski and Nowotarski sub-regions. However, only in the Walbrzyski sub-region did the overestimation exceed 14.3% for 250 m grid cells and 14.5% for 1 km grid cells, while for the other sub-regions it was about 5%. In Portugal, a substantial underestimation was recorded in Ave and Area metropolitana de Porto, two regions with high population density. The highest overestimation was observed in Algarve, a touristic seaside area (Figure 9b,d).

4.2.4. LAU 1 (NUTS 4)

In 176 Portuguese counties (57% of all counties) and 115 counties in Poland (30% of counties), the population was underestimated (Figure 10a–e). The highest value of underestimation in Poland was in Elbląg (the MAPE value was 94% in the GHS-POP 250 m and 93% in the GHS-POP 1 km), Włocławek (90% and 92% in 250 m and 1 km grid cells, respectively), and Suwałki (88% and 85% in 250 m and 1 km grid cells, respectively). The highest underestimation in Portugal occurred in Santa Cruz, a city in Madeira Island with 5780 people according to Eurostat data; the underestimation was 37% in 250 m grid cells of the GHS-POP, and 16% in 1 km grid cells of the GHS-POP.

MAPE value analysis shows that overestimation had higher values than underestimation. The highest overestimation was in powiat grudziądzki and powiat elbląski in Poland, which are located in the northern part of Poland. The overestimation of population count was −212.5% in powiat grudziącki and −202.2% in powiat elbląski (at 250 m grid cells) and −199.3% and −196.6% in the GHS-POP 1 km. The highest overestimation value in Portugal was in Porto Moniz, with −15.1% and −13.6% of the MAPE value, and Albuferia (−14.0 % and −13.0%, respectively).

4.2.5. LAU 2 (NUTS 5)

In 1623 municipalities of Poland (which constitute 65% of all municipalities) at 250 m grid cells and in 1547 municipalities (which constitute 62% of all municipalities) at 1 km grid cells, the population in the GHS-POP data was overestimated. In Portugal, the number of people in the GHS-POP data was underestimated in most civil parishes. In 2582 civil parishes at 250 m grid cells (83% of all civil parishes) of the GHS-POP and in 2238 municipalities at 1 km grid cells the population was underestimated. The highest value of overestimation in Poland was −661% at 1 km grid cells and −655% at 250 m grid cells in Wiśniowa. In Portugal it was −423% at 250 m grid cells and −424% at 1 km grid cells in Montalegre e Padroso. The highest underestimation was observed in Wieliczka in Poland, with the MAPE constituting 86% at 250 m grid cells and 82% at 1 km grid cells. The highest underestimation in Portugal was for Angra on the Azores (Figure 11). The class with the highest underestimation values (37.9–98.9) contained 20 objects, with four objects in Portugal and 16 objects in Poland. The class with the highest overestimation value (−654.8–−395.6) contained three objects (two in Poland, one in Portugal).

4.3. LAU 2: Spatial Distribution and Statistics of Extreme Outliers

The spatial distribution of MAPE values at LAU 2 level was analyzed using Moran statistics. In Poland at GHS-POP 250 m and GHS-POP 1 km the z-score took the values of 0.4870 and −0.1291, respectively, which confirms the fact that the distribution of errors is random. In Portugal the z-score values were 1.6204 and −0.2483 (Table 4).

The statistical analysis of the MAPEs show that there are high overestimation and underestimation values. Descriptive statistics of the MAPE for LAU 2 in Poland and Portugal are shown in Table 5.

The maximum MAPE value in Poland was 86.44 (250 m grid cells) and 82.44 (1 km grid cells), and in Portugal 83.75 (250 m grid cells) and 198.91 (1 km grid cells). The distribution of the MAPE values was close to normal (Figure 12 and Figure 13). The median and mean values of the MAPEs in Poland are similar, but there are some outliers. This has also been confirmed by the Grubbs test (Table 5). The 10% trimmed mean shows how it is affected by outliers. It fits better for datasets with erratic high or low values, especially for these skewed distributions. The box plots (see Figure 12 and Figure 13) show distributions of population counts in grid cells and allow to detect outliers. The number of extreme values (marked in Figure 12 in red) in Poland was 46 (GHS-, POP 250 m), of which 22 were overestimated. In Portugal there were nine civil parishes with extreme values of the MAPE, the population in one of them was overestimated, while it was underestimated in the other eight (see Figure 13).

The spatial distribution of extreme MAPE outliers, marked in Figure 12 and Figure 13 as red circles, is presented in Figure 14. In Poland at 250 m grid cells MAPE values are concentrated around zero, which means that all MAPE values below −16.7% and above 13.8% are considered to be outliers (Figure 14a). There are 29 municipalities for which the population in the GHS-POP is overestimated (1% of all municipalities), with 60 underestimated ones (2% of all municipalities). That means that 3% of all municipalities have extreme outliers of MAPE values. Such municipalities represent 3.9% of the total country area. Municipalities with an underestimated population are those with a large share, even up to 60%, of built-up areas (the average is 36%) and low percentage of forests (10% on average). The GHS-POP algorithm overestimates people count in agricultural and woodland areas (even by 65% in woodlands or 45% in agricultural areas), where the settlement network is spatially dispersed.

For the GHS-POP with 1 km grid cells, the values of extreme outliers of the MAPE are below −20.2% and over 18.7% (Figure 14e). For 54 municipalities the number of people in the GHS-POP is underestimated (2% of all municipalities) and for 48 municipalities the number of people is overestimated (1.9% of all municipalities). The municipalities with underestimation represent 2.2% of the country area, and municipalities with overestimation represent 1.8%. Municipalities with an underestimated population are municipalities with a large share of built-up areas (40% on average) and a low percentage of woodland areas (11%). The correlation coefficient between MAPEs for GHS-POP 250 m grid cells and GHS-POP 1 km grid cells is 0.64 and indicates a positive linear relationship.

For the GHS-POP in Portugal, with 250 m grid cells, MAPE values below −15.9% and above 24.7% are considered as outliers (Figure 14b–d). There are 19 civil parishes for which the population in the GHS-POP is overestimated (0.6% of all civil parishes) and nine with an underestimated population (0.3% of all civil parishes). That means that 0.9% of all civil parishes have extreme outliers of MAPE values. For the GHS-POP with 1 km grid cells, the values of extreme outliers of the MAPE are below −32.4% and above 40.8% (Figure 14f–h). In Portugal, the number of people in 71 civil parishes is underestimated (2.3% of all civil parishes), and for 91 civil parishes the number of people is overestimated (2.9% of all civil parishes). Underestimation occurs in areas with a high population number, especially in the north of the country, which has the highest population density overestimation. The correlation coefficient between MAPEs for the GHS-POP 250 m grid cells and the GHS-POP 1 km is 0.70 and indicates a positive linear relationship.

5. Discussion

The availability of different data, in particular the increasing volume of population data developed on the basis of remote sensing, has created a need for estimating their accuracy [59,60]. The results of credibility assessment are certainly influenced by numerous factors, such as the type of source data, processing methods, time, spatial resolution, the accepted test area, and the methods used. Dasymetric modeling, which has been widely discussed in literature, is another factor that affects data uncertainty [14,32,34]. Freire et al. [40] described the GHS-POP methodology, which is based on dasymetric mapping, relying on GHS-BUILT as a proxy to limit and refine the distribution of people and inform about the appropriate density. This method certainly has some limitations related to the interpretation of remote sensing data. As a result of the difficulty of interpreting built-up areas, the GHS-POP underestimates the population number for densely populated regions, while for thinly populated ones it overestimates. A similar relation was noticed by Calka and Bielecka [33] analyzing LandScan data.

Populated polygons (administrative units of different levels) were used in the analysis as source zones in zonal statistics. The method assigns the population number from the GHS-POP raster cell to the administrative unit in which the centroid of the given raster cell is contained. However, the method used has some limitations, mainly related to the size of both input data, the zone’s polygons and the grid cell. If the zone’s polygons are relatively small (like LAU 2 units) and the GHS-POP grid cell is 1 km, using centroids to sum the number of people gives worse results than for GHS-POP 250 m; the MAPE for a 1 km cell is higher than for a 250 m raster resolution. Moreover, the NUTS and LAU administrative systems in Poland and Portugal are unequal in terms of unit numbers and size, hence the assessment analysis provides different results, regardless of the GHS-POP spatial resolution. Similar dependence was noticed by Stillwell and Thomas [61] when analyzing migration within England in the mid-2000s.

Although many authors, including Balk et al. [30], Wang et al. [2], and Martin el. al [62] have dealt with the analysis of data accuracy, no studies have taken into account the scale aspect. Analysis of the accuracy of GHS-POP 250 m and 1 km estimation related to administrative units of different levels (from macro level, NUTS 2 to local level, LAU 2) assessed whether the reliability of population estimation depends on the scale, assumed as the size of the cell raster and administrative unit. Finally, it could be stated that scale matters, as it is a basic concept in spatial analysis, which was confirmed by Ge et al. [63], among others. In particular, analysis of outliers showed that the number of extreme MAPE values was higher when analyzing GHS-POP data with a resolution of 1 km than 250 m. Analysis of spatial distribution of outliers also demonstrated that the outliers are random, which results from the fact that the GHS-POP data development methodology is not burdened with systematic errors. Studies have demonstrated good accuracy of GHS-POP data for all levels of reference unit and spatial resolution, as MAPEs do not exceed 8% in Poland and 11% in Portugal.

The proposed accuracy assessment of the GHS-POP data is universal and could be adopted for any research area, regardless of whether the population is concentrated, as it is in Portugal, or spatially dispersed, like in Poland. However, this method requires reliable reference population data.

Both traditional statistical methods and GIS methods were used for the analysis. Credibility assessment was carried out using three error evaluation metrics: RMSE, MAE, and MAPE. Both the root mean square error (RMSE) and the mean absolute error (MAE) are regularly employed in model evaluation studies. RMSE and MAE dependency analysis allows outliers detection. Willmott and Matsuura [64] suggested that the RMSE is not a good indicator of average model performance and might be misleading, and thus MAE would be a better metric for that purpose. However, one problem with MAE is the fact that the relative character of the error is not always obvious. Sometimes it is hard to distinguish a big error from a small one. To deal with this problem, the MAE can be determined in percentage terms. The resulting MAPE allows comparing forecasts of different series in different scales. It can be said that it is the most appropriate indicator to compare two population data populations. However, the combined use of three error assessment indicators has allowed us to provide the most reliable data evaluation.

The results of GHS-POP accuracy assessment are presented as a set of choropleth maps. Both the number of classes and the class ranges were adjusted to efficiently present these administrative units, where overestimation and underestimation of population counts exceed the 1.5 interquartile distance from (IQR) the first quartile (outliers), and 3.0 IQR from the first quartile, which is a strong (extreme) outlier. Hence, the maps fulfill their main, informative function.

This study contributed significantly to assessing the suitability of GHS-POP data for use, demonstrating their high compliance with census data, and thus high suitability for many applications at local and regional levels [65]. This study could serve as a quality reference for potential users and producers of population density datasets.

6. Conclusions

Global population grids are essential for analyses supporting policy-making in a wide range of fields (from environmental assessment through disaster risk analysis to smart city management). Therefore, it is extremely important that population data are current and reliable.

The accuracy of population data is based on simple GIS statistics and analyses. According to the error values, the GHS-POP data are greatly credible regardless of the research area. The data assessment for municipalities (i.e., the most detailed level) resulted in a larger MAPE due to some larger local outliers. In addition, there was a marked difference between RMSE and MAE values for LAU 2 because of several extremely large differences between the GHS-POP and Eurostat population data values. The minimum MAPE in Poland and Portugal (GHS-POP 1 km and GHS-POP 250 m) was observed at NUTS 1 level. For GHS-POP 1 km in Portugal, MAPE values were the highest at 11.6%.

The headline result is that scale matters; as the size of administrative zones gets smaller (from NUTS to LAU) and the size of the GHS-POP gets larger, the difference between the population counts reported by Eurostat and the estimated by the GHS-POP algorithm become larger.

This research has made a significant contribution to the assessment of the GHS-POP data fitness-for-use, demonstrating its high compliance with census data, and thus high usefulness in many applications at the local and regional levels. It could provide some helpful tips for potential users and producers of population density data sets. Further research is planned to assess the reliability of other population data available for the area of Poland. A comparative analysis of GHS-POP data with other population grid data is planned in our subsequent research project.

Author Contributions

B.C. conceptualized the study and performed all analyses; E.B. and B.C. analyzed the results and drew conclusions; B.C. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Military University of Technology in Warsaw, Faculty of Civil Engineering and Geodesy, Institute of Geospatial Engineering and Geodesy.

Acknowledgments

GHS-POP grids were produced in the frame of the Global Human Settlement Layer (GHSL) project by the European Commission, Joint Research Centre, in Ispra, Italy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mennis, J.; Hultgren, T. Intelligent dasymetric mapping and its application to areal interpolation. Cartogr. Geogr. Inf. Sci. 2006, 33, 179–194. [Google Scholar] [CrossRef]
Guo, H.; Cao, K.; Wang, P. Population estimation in Singapore based on remote sensing and open data. In Proceedings of the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Wuhan, China, 18–22 September 2017. [Google Scholar]
Dong, P.; Ramesh, S.; Nepali, A. Evaluation of small-area population estimation using LiDAR, Landsat TM and parcel data. Int. J. Remote Sens. 2010, 31, 5571–5586. [Google Scholar] [CrossRef]
Douglass, R.W.; Meyer, D.A.; Ram, M.; Rideout, D.; Song, D. High resolution population estimates from telecommunications data. EPJ Data Sci. 2015, 4, 1–13. [Google Scholar] [CrossRef]
Novack, T.; Kux, H.; Freitas, C. Estimation of Population Density of Census Sectors Using Remote Sensing Data and Spatial Regression. In Geocomputation, Sustainability and Environmental Planning; Studies in Computational Intelligence; Murgante, B., Borruso, G., Lapucci, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 348. [Google Scholar] [CrossRef]
Wang, L.; Wu, C. Population estimation using remote sensing and GIS technologies. Int. J. Remote Sens. 2010, 31, 5569–5570. [Google Scholar] [CrossRef]
Wardrop, N.A.; Jochem, W.C.; Bird, T.J.; Chamberlain, H.R.; Clarke, D.; Kerr, D.; Bengtsson, L.; Juran, S.; Seaman, V.; Tatem, A.J. Spatially disaggregated population estimates in the absence of national population and housing census data. Proc. Natl. Acad. Sci. USA 2018, 115, 3529–3537. [Google Scholar] [CrossRef]
Deng, C.; Wu, C.; Wang, L. Improving the housing-unit method for small-area population estimation using remote-sensing and GIS information. Int. J. Remote Sens. 2010, 31, 5673–5688. [Google Scholar] [CrossRef]
Wu, S.; Qiu, X.; Wang, L. Population Estimation Methods in GIS and Remote Sensing: A Review. GISci. Remote Sens. 2005, 42, 80–96. [Google Scholar] [CrossRef]
Pirowski, T.; Bartos, K. Detailed mapping of the distribution of a city population based on information from the national database on buildings. Geod. Vestn. 2018, 62, 458–471. [Google Scholar] [CrossRef]
Wu, C.; Murray, A.T. Population estimation using Landsat enhanced thematic mapper imagery. Geogr. Anal. 2007, 39, 26–43. [Google Scholar] [CrossRef]
Lo, C.P. Applied Remote Sensing; Longman: London, UK, 1986. [Google Scholar]
Tobler, W.R. Satellite confirmation of settlement size coefficients. Area 1969, 1, 30–34. [Google Scholar]
Lo, C.P.; Welch, R. Chinese urban population estimates. Ann. Assoc. Am. Geogr. 1977, 67, 246–253. [Google Scholar] [CrossRef]
Stern, M. Landsat data for population estimates -approaches to inter-censal counts in the rural Sudan. In Remote Sensing from Satellites; Carter, W.D., Engman, E.T., Eds.; Pergamon: New York, NY, USA, 1984; pp. 117–125. [Google Scholar]
Patino, J.E.; Duque, J.C. A review of regional science applications of satellite remote sensing in urban settings. Comput. Environ. Urban. Syst. 2013, 37, 1–17. [Google Scholar] [CrossRef]
Lo, C.P. Automated population and dwelling unit estimation from highresolution satellite images: A GIS approach. Int. J. Remote Sens. 1995, 16, 17–34. [Google Scholar] [CrossRef]
Yuan, Y.; Smith, R.M.; Limp, W.F. Remodeling census population with spatial information from Landsat TM imagery. Comput. Environ. Urban. Syst. 1997, 21, 245–258. [Google Scholar] [CrossRef]
Harvey, J.T. Population estimation models based on individual TM pixels. Photogramm. Eng. Remote Sens. 2002, 68, 1181–1192. [Google Scholar]
Li, G.; Weng, Q. Using Landsat ETM+ imagery to measure population density in Indianapolis, Indiana, USA. Photogramm. Eng. Remote Sens. 2005, 71, 947–958. [Google Scholar] [CrossRef]
Liu, D.; Weng, Q.; Li, G. Residential population estimation using remote sensing derived impervious surface. Int. J. Remote Sens. 2006, 27, 3553–3570. [Google Scholar] [CrossRef]
Galeon, F.A. Estimation of population in informal settlement communities using high resolution satellite image. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2008, 37 Pt B4, 1377–1382. [Google Scholar]
Qiu, F.; Sridharan, H.; Chun, Y. Spatial autoregressive model for population estimation at the census block level using LIDAR-derived building volume information. Cartogr. Geogr. Inf. Sci. 2010, 37, 239–257. [Google Scholar] [CrossRef]
Ramesh, S. High Resolution Satellite Images and LiDAR Data for Small-Area Building Extraction and Population Estimation. Master’s Thesis, University of North Texas, Denton, TX, USA, 2009. [Google Scholar]
Weng, Q. Remote sensing of impervious surfaces in the urban areas: Requirements, methods, and trends. Remote Sens. Environ. 2012, 117, 34–49. [Google Scholar] [CrossRef]
Leyk, S.; Gaughan, A.E.; Adamo, S.B.; de Sherbinin, A.; Balk, D.; Freire, S.; Rose, A.; Stevens, F.R.; Blankespoor, B.; Frye, C.; et al. The spatial allocation of population: A review of large-scale gridded population data products and their fitness for use. Earth Syst. Sci. Data 2019, 11, 1385–1409. [Google Scholar] [CrossRef]
Balk, D.L.; Deichman, U.; Yetman, G.; Pozzi, F.; Hay, S.I.; Nelson, A. Determining global population distribution: Methods, applications and data. Adv. Parasitol. 2006, 62, 119–156. [Google Scholar] [PubMed]
CESIN—Center for International Earth Science Information Network Columbia University. Gridded Population of the World, Version 4 (GPWv4): Data Quality Indicators, Beta Release; NASA Socioeconomic Data and Applications Center (SEDAC): Palisades, NY, USA, 2015. [CrossRef]
Doxsey-Whitfield, E.; MacManus, K.; Adamo, S.B.; Pistolesi, L.; Squires, J.; Borkovska, O.; Baptista, S.R. Taking Advantage of the Improved Availability of Census Data: A First Look at the Gridded Population of the World, Version 4. Appl. Geogr. 2015, 1, 226–234. [Google Scholar] [CrossRef]
Balk, D.; Yetman, G. The Global Distribution of Population: Evaluating the Gains in Resolution Refinement. Center for International Earth Science Information Network (CIESIN), 2005. Columbia University. Available online: http://beta.sedac.ciesin.columbia.edu/gpw/docs/gpw3_documentation_final.pdf (accessed on 6 June 2019).
Bellucci, A.; Tholey, N.; Studer, M.; Goester, J.F.; Fuentes, N. Extrapolation of population grids for risk analysis. J. Space Saf. Eng. 2018, 5, 192–196. [Google Scholar] [CrossRef]
Chu, H.-J.; Yang, C.-H.; Chou, C.C. Adaptive Non-Negative Geographically Weighted Regression for Population Density Estimation Based on Nighttime Light. ISPRS Int. J. Geo-Inf. 2019, 8, 26. [Google Scholar] [CrossRef]
Calka, B.; Bielecka, E. Reliability Analysis of LandScan Gridded Population Data. The Case Study of Poland. ISPRS Int. J. Geo-Inf. 2019, 8, 222. [Google Scholar] [CrossRef]
Bhaduri, B.; Bright, E.; Coleman, P. Development of a High Resolution Population Dynamics Model. Paper Presented at Geocomputation 2005, Ann Arbor, Michigan. Available online: http://www.geocomputation.org/2005/Abstracts/Bhaduri.pdf (accessed on 10 June 2019).
Stevens, F.; Gaughan, A.E.; Linard, C.; Tatem, A.J. Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. PLoS ONE 2015, 10, e0107042. [Google Scholar] [CrossRef]
Sabesan, A.; Abercrombie, K.; Ganguly, A.R.; Bhaduri, B.; Bright, E.A.; Coleman, P.R. Metrics for the comparative analysis of geospatial datasets with applications to high-resolution grid-based population data. GeoJournal 2007, 69, 81–91. [Google Scholar] [CrossRef]
Calka, B.; Nowak Da Costa, J.; Bielecka, E. Fine scale population density data and its application in risk assessment. Geomat. Nat. Hazards Risk 2017, 8, 1440–1455. [Google Scholar] [CrossRef]
Tatem, A.J.; Gaughan, A.E.; Stevens, F.R.; Patel, N.N.; Jia, P.; Pandey, A.; Linard, C. Quantifying the effects of using detailed spatial demographic data on health metrics: A systematic analysis for the AfriPop, AsiaPop, and AmeriPop projects. Lancet N. Am. Ed. 2013, 381, S142. [Google Scholar] [CrossRef]
Gaughan, A.E.; Stevens, F.R.; Linard, C.; Jia, P.; Tatem, A.J. High Resolution Population Distribution Maps for Southeast Asia in 2010 and 2015. PLoS ONE 2013, 8, e55882. [Google Scholar] [CrossRef] [PubMed]
Freire, S.; MacManus, K.; Pesaresi, M.; Doxsey-Whitfield, E.; Mills, J. Development of new open and free multi-temporal global population grids at 250 m resolution. In Proceedings of the 19th AGILE Conference on Geographic Information Science, Helsinki, Finland, 14–17 June 2016. [Google Scholar]
Pesaresi, M.; Ehrlich, D.; Ferri, S.; Florczyk, A.J.; Freire, S.; Halkia, S.; Julea, A.M.; Kemper, T.; Soille, P.; Syrris, V. Operating Procedure for the Production of the Global Human Settlement Layer from Landsat Data of the Epochs 1975, 1990, 2000, and 2014; EUR 27741 EN; Publications Office of the European Union: Luxembourg, 2016. [CrossRef]
Bai, Z.; Wang, J.; Wang, M.; Gao, M.; Sun, J. Accuracy Assessment of Multi-Source Gridded Population Distribution Datasets in China. Sustainability 2018, 10, 1363. [Google Scholar] [CrossRef]
Hay, S.I.; Guerra, C.A.; Tatem, A.J.; Noor, A.M.; Snow, R.W. The global distribution and population at risk of malaria: Past, present, and future. Lancet Infect. Dis. 2004, 4, 327–336. [Google Scholar] [CrossRef]
Tatem, A.J.; Guerra, C.A.; Kabaria, C.W.; Noor, A.M.; Hay, S.I. Human population, urban settlement patterns and their impact on Plasmodium falciparum malaria endemicity. Malar. J. 2008, 7, 218. [Google Scholar] [CrossRef] [PubMed]
Uhl, J.H.; Leyk, S. Multi-Scale Effects and Sensitivities in Built-up Land Data accuracy Assessments. Remote Sens Environ. 2018, 204, 898–917. [Google Scholar] [CrossRef]
Qiao, C.; Sun, R.; Cui, T. Research on scale effect of vegetation net primary productivity. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1333–1336. [Google Scholar] [CrossRef]
GUS. Powierzchnia i Ludność w Przekroju Terytorialnym w 2018 Roku; Główny Urząd Statystyczny: Warsaw, Poland, 2012.
GUS. Ludność w Gminach Według Stanu w Dniu 31.12.2011 r.—Bilans Opracowany w Oparciu o Wyniki NSP 2011; Główny Urząd Statystyczny: Warsaw, Poland, 2012. Available online: https://geo.stat.gov.pl/imap/?locale=en (accessed on 10 June 2019).
Eurostat. Statistical Yearbook. Available online: https://ec.europa.eu/eurostat/web/ess/portugal/statistics (accessed on 15 June 2019).
Sleszynski, P. Delimitation of the Functional Urban Areas around Poland’s Voivodship Capital Cities. Przeglad Geograficzny 2013, 85, 173–197. [Google Scholar] [CrossRef]
Florczyk, A.J.; Corbane, C.; Ehrlich, D.; Freire, S.; Kemper, T.; Maffenini, L.; Melchiorri, M.; Pesaresi, M.; Politis, P.; Schiavina, M.; et al. GHSL Package 2019; EUR 29788EN; JRC117104; Publications Office of the European Union: Luxembourg, 2019; ISBN 978-92-76-08725-0. [CrossRef]
Corbane, C.; Pesaresi, M.; Politis, P.; Syrris, V.; Florczyk, A.J.; Soille, P.; Maffenini, L.; Burger, A.; Vasilev, V.; Rodriguez, D.; et al. Big earth data analytics on Sentinel-1 and Landsat imagery in support to global human settlements mapping. Big Earth Data 2017, 1, 118–144. [Google Scholar] [CrossRef]
Eurostat. Population Data. Available online: https://ec.europa.eu/eurostat/web/population-demography-migration-projections/data (accessed on 20 June 2019).
GUS. Statystyka Regionalna—Regional Statistic. Available online: http://stat.gov.pl/statystyka-regionalna/jednostki-terytorialne/klasyfikacja-nuts/ (accessed on 10 June 2019).
Tayman, J.; Swanson, D.A. On the validity of MAPE as a measure of population forecast accuracy. Popul. Res. Policy Rev. 1999, 18, 299–322. [Google Scholar] [CrossRef]
Tayman, J.; Schafer, E.; Carter, L. The role of population size in the determination and prediction of population forecast errors: An evaluation using confidence intervals for subcounty areas. Popul. Res. Policy Rev. 1998, 17, 1–20. [Google Scholar] [CrossRef]
Waller, L.A.; Gotway, C.A. Applied Spatial Statistics for Public Health Data; John Wiley and Sons: New York, NY, USA, 2004. [Google Scholar]
Tu, J.; Xia, Z.G. Examining spatially varying relationships between land use and water quality using geographically weighted regression I: Model design and evaluation. Sci. Total Environ. 2008, 407, 358–378. [Google Scholar] [CrossRef]
Nowak Da Costa, J. Novel Tool to Examine Polygon Features Completeness Based on a Comparative Study of VGI Data and Official Polish Building Datasets. Geodetski Vestnik 2016, 60, 495–508. [Google Scholar] [CrossRef]
Nowak Da Costa, J.; Bielecka, E.; Calka, B. Jakość danych OpenStreetMap—Analiza informacji o budynkach na terenie Siedlecczyzny. Ann. Geomat. 2016, 14, 193–203. (In Polish) [Google Scholar]
Stillwell, J.; Thomas, M. How far do internal migrants really move? Demonstrating a new method for the estimation of intra-zonal distance. Reg. Stud. Reg. Sci. 2016, 3, 28–47. [Google Scholar] [CrossRef]
Martin, R.G. Accuracy assessment of Landsat-based visual change detection methods applied to the rural-urban fringe. Photogramm. Eng. Remote Sens. 1989, 55, 209–215. [Google Scholar]
Ge, Y.; Jin, Y.; Stein, A.; Chen, Y.; Wang, J.; Wang, J.; Cheng, Q.; Bai, H.; Liu, M.; Atkinson, P.M. Principles and methods of scaling geospatial Earth science data. Earth-Sci. Rev. 2019, 197, 102897. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Mościcka, A.; Pokonieczny, K.; Wilbik, A.; Wabiński, J. Transport Accessibility of Warsaw: A Case Study. Sustainability 2019, 11, 5536. [Google Scholar] [CrossRef]

Figure 1. Location of Poland and Portugal in Europe.

Figure 2. CORINE land cover data for (a) Poland and (b) Portugal.

Figure 3. Population distribution by the Global Human Settlement Population Grid (GHS-POP). (a) Poland, 250 m grid cells; (b) Poland, 1 km grid cells; (c) Portugal, 250 m grid cells; (d) Portugal, 1 km grid cells.

Figure 4. RMSE in Poland and Portugal.

Figure 5. MAE in Poland and Portugal.

Figure 6. MAPE in Poland and Portugal.

Figure 7. NUTS 1: MAPE. (a) GHS-POP 250 m, Poland; (b) GHS-POP 250 m, Portugal; (c) GHS-POP 1 km, Poland; (d) GHS-POP 1 km, Portugal; (e) map legend.

Figure 8. NUTS 2: MAPE. (a) GHS-POP 250 m, Poland; (b) GHS-POP 250 m, Portugal; (c) GHS-POP 1 km, Poland; (d) GHS-POP 1 km, Portugal; (e) map legend.

Figure 9. NUTS 3: MAPE. (a) GHS-POP 250 m, Poland; (b) GHS-POP 250 m, Portugal; (c) GHS-POP 1 km, Poland; (d) GHS-POP 1 km, Portugal; (e) legend.

Figure 10. LAU 1 (NUTS 4): MAPE. (a) GHS-POP 250 m, Poland; (b) GHS-POP 250 m, Portugal; (c) GHS-POP 1 km, Poland; (d) GHS-POP 1 km, Portugal; (e) legend.

Figure 11. LAU 2 (NUTS 5): MAPE. (a) GHS-POP 250 m, Poland; (b) GHS-POP 250 m, Portugal; (c) GHS-POP 250 m, Azores; (d) GHS-POP 250 m, Madeira; (f) GHS-POP 1 km, Poland; (g) GHS-POP 1 km, Portugal; (h) GHS-POP 1 km, Azores; (i) GHS-POP 1 km, Madeira; (j) legend.

Figure 12. MAPE, 250 m, Poland.

Figure 13. MAPE, 250 m, Portugal.

Figure 14. Extreme outliers of MAPE. (a) 250 m grid cells, Poland; (b) 250 m grid cells, Portugal; (c) 250 m grid cells, Azores; (d) 250 grid cells, Madeira; (f) 1 km grid cells, Poland; (g) 1 km grid cells, Portugal; (h) 1 km grid cells, Azores; (i) 1 km grid cells, Madeira; (j) legend.

Table 1. Descriptive statistics of the GHS-POP for Poland and Portugal.

Descriptive Statistics	GHS-POP 250 m Poland	GHS-POP 1 km Poland	GHS-POP 250 m Portugal	GHS-POP 1 km Portugal
Number of grid cells	4,982,749	311,647	1,423,136	88,956
Min.	0.00	0.00	0.00	0.00
Max.	3849	12,821	2668	19,414
Median	0.00	6.83	0.00	0.31
Mean	7.74	123.74	6.91	109.73
Mode	0.00	0.00	0.00	0.00
The first quartile (Q1)	0.00	0.00	0.00	0.00
The third quartile (Q3)	0.00	63.39	0.01	33.48
Percentile 10	0.00	0.00	0.00	0.00
Percentile 90	11.37	232.11	7.28	188.25
Standard deviation	36.46	472.29	47.21	582.09
Skewness	9.36	8.57	17.23	13.79
Kurtosis	175.59	98.06	423.14	261.14
Variance	1329.49	223,058.4	2229.12	338,835.1
Number of people according to Eurostat	38,005,614		10,374,822

Table 2. NUTS and LAU classification for Poland and Portugal.

	Poland	Portugal
NUTS 0: The area of the whole country	1	1
NUTS 1: Macroregions	7	3
NUTS 2: Regions	17	7
NUTS 3: Groups of counties	73	25
LAU 1 (NUTS 4)	380	308
LAU 2 (NUTS 5)	2482	3092

Table 3. Root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) values for the GHS-POP 1 km and GHS-POP 250 m.

	POLAND				PORTUGAL
	RMSE	MAE	MAPE (%)	R²	RMSE	MAE	MAPE (%)	R²
NUTS 1
GHS-POP 1 km	88,633.42	78,233.33	1.5%	0.9990	49,696.17	37,679.97	4.5%	≈1.000
GHS-POP 250 m	92,590.10	82,353.64	1.6%	0.9990	19,480.26	14,680.60	1.8%	≈1.000
NUTS 2
GHS-POP 1 km	40,728.07	32,494.29	1.6%	0.0993	28,911.45	20,796.51	2.8%	0.9998
GHS-POP 250 m	41,717.26	34,081.67	1.7%	0.9994	20,396.73	12,544.67	1.6%	0.9999
NUTS 3
GHS-POP 1 km	14,728.36	9392.25	1.9%	0.9711	12,461.90	6310.94	1.5%	0.9998
GHS-POP 250 m	14,629.63	9240.49	1.9%	0.9726	7747.00	3993.05	1.0%	0.9999
LAU 1 (NUTS 4)
GHS-POP 1 km	18,525.99	6296.60	7.8%	0.9753	4084.98	1354.32	4.2%	0.9967
GHS-POP 250 m	19,051.75	6078.21	7.5%	0.9739	3156.08	1020.45	3.2%	0.9945
LAU 2 (NUTS 5)
GHS-POP 1 km	2945.81	733.24	5.8%	0.9716	1330.04	407.57	11.6%	0.9650
GHS-POP 250 m	2935.74	608.21	4.5%	0.9699	465.85	163.88	5.71%	0.9958

Table 4. Moran I statistics.

Descriptive Statistics	Poland 250 m	Poland 1 km	Portugal 250 m	Portugal 1 km
Moran’s index	0.0049	−0.0019	0.0129	−0.0007
z-score	0.4870	−0.1291	1.6204	−0.2483
p-value	0.6263	0.8973	0.9760	0.8031
	Random	Random	Random	Random

Table 5. Descriptive statistics for MAPE for LAU 2 (NUTS 5).

Descriptive Statistics	MAPE 250 m Poland	MAPE 1 km Poland	MAPE 250 m Portugal	MAPE 1 km Portugal
Grubbs Test Statistics	36.80 p = 0.000	35.04 p = 0.000	44.40 p = 0.000	22.70 p = 0.000
Min.	−654.83	−660.88	−423.15	−423.96
Max.	86.44	82.44	83.75	98.91
Median	−1.23	−1.32	4.61	5.11
Mean	−1.13	−1.70	1.04	−0.02
Trimmed mean (10%)	−1.18	−0.52	0.63	−0.01
The first quartile (Q1)	−3.96	−3.18	1.51	−1.00
The third quartile (Q3)	1.08	1.71	7.30	9.48
Percentile 10	−5.14	−7.70	−1.91	−12.70
Percentile 90	3.92	5.47	9.33	18.39
Standard deviation	17.77	18.81	9.62	18.84
Skewness	−24.77	−21.63	28.26	−4.58
Kurtosis	842.32	694.36	1259.49	92.32
Variance	315.69	353.83	28.98	10.72
Quartile Range	4.25	5.67	5.79	10.48
Range	741.27	743.32	506.90	522.88
Number of objects	2478		3092

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Calka, B.; Bielecka, E. GHS-POP Accuracy Assessment: Poland and Portugal Case Study. Remote Sens. 2020, 12, 1105. https://doi.org/10.3390/rs12071105

AMA Style

Calka B, Bielecka E. GHS-POP Accuracy Assessment: Poland and Portugal Case Study. Remote Sensing. 2020; 12(7):1105. https://doi.org/10.3390/rs12071105

Chicago/Turabian Style

Calka, Beata, and Elzbieta Bielecka. 2020. "GHS-POP Accuracy Assessment: Poland and Portugal Case Study" Remote Sensing 12, no. 7: 1105. https://doi.org/10.3390/rs12071105

APA Style

Calka, B., & Bielecka, E. (2020). GHS-POP Accuracy Assessment: Poland and Portugal Case Study. Remote Sensing, 12(7), 1105. https://doi.org/10.3390/rs12071105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GHS-POP Accuracy Assessment: Poland and Portugal Case Study

Abstract

1. Introduction

2. Materials and Study Area

2.1. Poland and Portugal

2.2. The GHS-POP Data

2.3. Census Data

3. Methods

4. Results

4.1. Error Determination

4.2. Cartographic Visualization of MAPE

4.2.1. NUTS 1

4.2.2. NUTS 2

4.2.3. NUTS 3

4.2.4. LAU 1 (NUTS 4)

4.2.5. LAU 2 (NUTS 5)

4.3. LAU 2: Spatial Distribution and Statistics of Extreme Outliers

5. Discussion

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI