Spatial Distribution Estimates of the Urban Population Using DSM and DEM Data in China

Spatial distribution and population density are important parameters in studies on urban development, resource allocation, emergency management, and risk analysis. High-resolution height data can be used to estimate the total or spatial pattern of the urban population for small study areas, e.g., the downtown area of a city or a community. However, there has been no case of population estimation for large areas. This paper tries to estimate the urban population of prefectural cities in China using building height data. Building height in urban population settlement (Mdiffs) was first extracted using the digital surface model (DSM), digital elevation model (DEM), and land use data. Then, the relationships between the census-based urban population density (CPD) and the Mdiffs density (MDD) for different regions were regressed. Using these results, the urban population for prefectural cities of China was finally estimated. The results showed that a good linear correlation was found between Mdiffs and the census data in each type of region, as all the adjusted R2 values were above 0.9 and all the models passed the significance test (95% confidence level). The ratio of the estimated population to the census population (PER) was between 0.7 and 1.3 for 76% of the cities in China. This is the first attempt to estimate the urban population using building height data for prefectural cities in China. This method produced reasonable results and can be effectively used for spatial distribution estimates of the urban population in large scale areas.


Introduction
The spatial pattern of regional populations is closely related to the geographical, societal, and economic environments, as well as to natural resources.Understanding the population size, density, and spatial distribution is beneficial for urban planning, resource allocation, and disaster risk management [1].Census population data, geographic information system (GIS), and remote sensing data, and social media data are currently major sources for demographic studies.
Census data are relatively accurate; however, they are labor intensive and time consuming to obtain and update.They are based on the administrative unit and have low spatial resolution, and do not reflect dynamic changes in population in a timely manner [2,3].These limitations make it necessary to develop alternative techniques and methods to improve the accuracy, time resolution, and spatial resolution.
Currently, mobile phone data and social media data, such as mobile call data [4][5][6], WiFi data [7,8], and social networking service software data [9][10][11], which have short acquisition periods and high timeliness, are popular methods to estimate population density, distribution, and mobility.Deville et al. used mobile phone data to estimate the urban population density of Portugal, and the accuracy of the estimation results was increased by comparing the results with census data at an administrative division level and remote sensing data at a 100 m×100 m grid scale [4].However, the users of mobile phones and social media service software do not cover the whole country [12], so that the estimated total population might be low if only big data sources are used.Due to privacy, the aforementioned data may be difficult to obtain and thus cannot be widely used, which represents another difficulty when estimating population mobility.
Remote sensing has become an important method of performing population estimates in the past four decades [13] and many methods have been reported in the GIS and remote sensing literature.Depending on the intended goal and the required information, these methods can be grouped into two categories: Areal interpolation and statistical modeling [14].The former is primarily designed for zone transformation that involves transforming data from one set of spatial units to another [13,15].How to define source zones and target zones, the degree of generalization in the interpolation process, and the characteristics of the partitioned surface influence the quality of the interpolation estimates [16].Areal interpolation methods can be further separated into two categories depending on whether ancillary information is used.The statistical modeling approach is more interested in inferring the relationship between the population and other variables, such as urban areas, land use, dwelling units, image pixel characteristics, and other physical or socioeconomic characteristics, in order to estimate the total population of an area [17,18].The quality of the statistical modeling approach depends on how to select the related variables and statistical model.
Landsat satellite data, land use/cover data, nighttime light data, digital elevation model (DEM) data, and some high-resolution remote sensing images are commonly utilized to estimate population.With the development of light detection and ranging (LiDAR) technology, the height data of a building can be obtained and applied to gradually estimate regional population.Some research on population estimation using height data and the statistical modeling approach has been published.In 2010, Lu et al. [19] examined the utility of QuickBird imagery and LiDAR data for estimating population at the census-block level using two approaches: Area-based and volume-based.Residential building footprints were first delineated from the remote-sensing data using classification algorithms.Regression analysis was used to model the relationship between the population and the area or volume of the delineated residential buildings.Both approaches were successful in terms of estimating the population with high accuracy, finding that the root mean square error (RMSR) was equal to 10-30 people.The number, area and volume of buildings for residential and commercial parcels were first calculated using the digital surface model (DSM) and LiDAR-data-based DEM data, and then were regressed using the census population.The result showed an underestimation of the population density due to the lack of high-resolution image data [20].In 2013, Alahmadi et al. [21] classified their study area using Enhanced Thematic Mapper (ETM+) data using the iterative self-organizing data analysis technique (ISODATA) and support vector machine (SVM) algorithms in order to extract building area land cover type; then height data was used to refine the building area, which was regressed using the density of the dwelling units for the population estimation.The result showed that the SVM classification algorithm and building area refined by height data improved the accuracy.Later, in 2016, they improved their previous work using height data, and the results showed higher accuracy [22].Xie et al. [23] extracted the height, area, and volume of buildings using a morphological building detection algorithm; these variables were then utilized to classify and collect residential buildings; finally, the height, area, and volume of residential buildings were regressed using the population.The results showed that the RMSE and mean absolute relative error (MARE) of the population estimation were 13 and 33.52%, respectively.
It can be concluded that high-resolution height data can be well used to estimate the total or spatial pattern of the urban population for small study areas, e.g., the downtown area of a city or a community.However, there is a lack of evidence as to whether height data can be used to estimate the total urban population of large areas.In this paper, height data was used to estimate the urban population of prefectural cities in China.The building height in urban areas was first obtained according to the DSM and DEM datasets.Next, three algorithms were used to classify the whole search area, and then regression models between the census population density and total building height were built and analyzed for the prefectural cities.By comparing the regression results and through comprehensive analysis, an optimum classification algorithm was developed.Subsequently, the population of China was estimated using the regression models from the optimum classification algorithm, and the results were finally compared with the census population.

Data
The data used in this paper mainly include DSM, DEM, land use data, census data, and basic geographic information data.Details are shown in Table 1.The DSM captures natural and built features on the earth's surface.Only 30 m resolution data is available, thus it needed to be resampled to 1 km.DEM is a 3D, computer-generated, graphical representation of a terrain's surface, and only represents the bare-earth height information without any further definition about the surface.Land use data shows the type of a certain district.The land use data used in this paper includes the natural landscape, human activity areas, and unutilized land.The human activity areas include urban land, suburban settlements, and construction land.Urban land refers to large, medium, and small cities, and some built-up areas which rank above counties and towns.The total number of land use types is 26, each of which is recorded by a code.Urban areas are coded as 51.The census population data used in this paper refer to the permanent resident population from the sixth national population census of China.All the data are projected with equal-area Albers.

Method
The technical flow of estimating the population using the proposed method in this study is shown in Figure 1.The method includes the following three steps: Step

1: Calculation of Building Height and Calibration of Mdiffs
The DSM data values represent the elevation of the surface including built up features, and the DEM data values represent the bare-earth elevation of the terrain, thus the average height (diffs) data can be calculated according to Equation (1).
The digital number (DN) value of the diffs data in urban areas was referred to as Mdiffs and was approximately equal to the average height of the buildings.Urban areas were defined by the urban land type in land use data where there was an urban population.
Outliers with very large and negative values were observed in the Mdiffs data.Outliers less than 0 were changed to 0 in theory.As for outliers with very large values, an accurate range could not be obtained.In order to ensure a relatively reasonable range, the 30 m resolution DSM and DEM data for some large urbanized cities may be helpful: The Mdiffs with 30m resolution was able to be obtained directly according to Equation ( 1) and the land use data.Then, the negative values were changed to 0. Next, the Mdiffs of these cities were resampled to 1 km resolution, and their maximum Step 1: Calculation of Building Height and Calibration of Mdiffs The DSM data values represent the elevation of the surface including built up features, and the DEM data values represent the bare-earth elevation of the terrain, thus the average height (diffs) data can be calculated according to Equation (1).
The digital number (DN) value of the diffs data in urban areas was referred to as Mdiffs and was approximately equal to the average height of the buildings.Urban areas were defined by the urban land type in land use data where there was an urban population.
Outliers with very large and negative values were observed in the Mdiffs data.Outliers less than 0 were changed to 0 in theory.As for outliers with very large values, an accurate range could not be obtained.In order to ensure a relatively reasonable range, the 30 m resolution DSM and DEM data for some large urbanized cities may be helpful: The Mdiffs with 30m resolution was able to be obtained directly according to Equation ( 1) and the land use data.Then, the negative values were changed to 0. Next, the Mdiffs of these cities were resampled to 1 km resolution, and their maximum value was Dmax, which was the maximum value of the Mdiffs data with 1km resolution for all of China.
Step 2: Classifying the Study Area In view of the large area and geographic regional diversity in China, the whole study area should be classified into different categories.In urban areas, a linear relationship between building height and population size was assumed if differences in land use were not considered.Population and Mdiffs were the feature variables of each prefectural region or prefectural city in China.Their ratio (Equation ( 2)) could reflect the average population density in the 3D space or the average vacancy rate of a city.It is advantageous to study result if cities with similar features can be divided into a class.
where CPD and MDD represent the urban population density from the census data (unit: People/km 2 ) and the density of Mdiffs at the prefectural city level (unit: Meter/km 2 ), respectively.The K value is reasonable as long as both the CPD and MDD are larger than zero.
The prefectural cities of China were then classified according to the K index and the classification code was recorded as k.Firstly, a box plot was graphed to describe the outliers and interquartile range.Then the k-medoids algorithm was applied to confirm the optimum clustering number (Cnum) and the clustering result.After that, the hierarchical clustering algorithm was also tried in R language to divide the Cnum categories.Finally, the results of the three methods were compared to obtain the optimum method for the population estimate.

Step 3: Urban Population Estimation
Based on the classification results of the different algorithms, the density of the census population (CPD), which was chosen as the dependent variable, was fitted with density of Mdiffs (MDD) using the linear regression model through the origin in each k class.This model was adopted because the population of a region should be zero if its MDD value is zero.Then the statistical information of the regression models based on the different classification results was compared and integrated to find the optimum classification algorithm whose regression coefficients were recorded as a ki .The a ki for all pixels in region k is assumed to be equivalent; thus, the gridded urban population in region k can be calculated according to Equation (3).Thus, the total urban population in China can be obtained by the summation of the estimated total urban population of the different regions.
where GEP ki is the estimated population for each grid i in region k; a ki is the coefficient; and DN ki is the DN value of the Mdiffs data of grid i in region k.

Results
Beijing, Shanghai, Guangzhou are the most urbanized and developed cities in China.Therefore, they were selected to calibrate the outliers.The value of Dmax was set to 100 in this paper.
The K values of 344 cities were available (there are two cities where there was no Mdiffs).The box plot of the K index showed that the maximum, median and minimum values were 6178.840,2214.419, and 267.656, respectively, and the upper and lower quartile values were 3338.423 and 1392.929,respectively.The value >6178.840was an outlier.The result of k-medoid clustering was Cnum = 6; outliers and normal values were divided into two and four categories, respectively.The hierarchical clustering algorithm divides the outliers into three parts and the normal values into three parts.
The classification results of the three algorithms were applied to fit the relationship of the urban census population density (CPD) and Mdiffs density (MDD).
Table 2 shows the classification and regression results.As for the normal values, the adjusted R 2 of the box plot was higher than the k-medoids and hierarchical clustering; the standard deviation of the standard residual (SD of the std.residual) of the box plot was lower than that of the hierarchical clustering, but roughly similar to that of the k-medoids.As for outliers, the k-medoid was similar to the box plot no matter whether the fifth and sixth categories were merged.Hierarchical clustering is more sensitive to outliers, but the regression model was not significant for the fifth category, which might be caused by a lower sample number.Thus, the paper integrated the sixth category into the fifth category for hierarchical clustering, which highly improved the result for the outliers.Therefore, the box plot algorithm was outstanding for the normal value clustering while hierarchical clustering was excellent for the outliers clustering.Considering that the number of normal samples was larger than that of the outlier samples, the results of the outliers for hierarchical clustering were applied to replace the results of the outliers for the box plot.Based on two previous adjustments, an adjusted box plot algorithm (shown in Table 2) was built, which was an optimum algorithm: The clustering number was 6, which is optimum; adjusted R 2 and SD of the std.residual for each category were better than the other results.The paper selected the classification results of the adjusted box plot (Figure 2).The regression equations were available.Therefore, in Equation ( 3), a 1i = 1118.123,a 2i = 1777.533,a 3i = 2684.094,a 4i = 4586.383,a 5i = 6641.906,and a 6i = 8910.301.Figure 3 shows the distribution of the estimated urban population (EP) and census urban population (CP) at the prefectural city level.The spatial pattern of the estimated population was the same as for the census population.Table 3 shows the statistical results of the number and percentage of the prefectural cities with different population sizes and density grades.As for the population size, the blue area with an urban population of less than 100,000 covered five cities; the cyan area with an urban population of 100-500,000 covered 36 cities; the green area with an urban population of 500,000-1 million covered 73 cities; the yellow area with an urban population of 1-2 million covered 112 cities; the orange area with an urban population of 2-5 million covered 94 cities; and the red area with an urban population larger than 5 million covered 24 cities.There is the largest difference between EP and CP in grade (2000, 5000] (orange color), while smallest difference in grade (1000, 2000] (yellow color).As for population density, the 13 cities with a population density larger than 1000 persons/km 2 (red) were mainly distributed in the areas of Beijing-Tianjin, Shanghai, and Guangzhou-Shenzhen.The 84 cities with a population density smaller than 50 persons/km 2 (blue) were mainly distributed in western China.The cities with a high population density were mostly distributed in the eastern coastal region of China.The difference between the EP and CP was smaller in the green and orange areas while larger in the blue areas.Figure 3 shows the distribution of the estimated urban population (EP) and census urban population (CP) at the prefectural city level.The spatial pattern of the estimated population was the same as for the census population.Table 3 shows the statistical results of the number and percentage of the prefectural cities with different population sizes and density grades.As for the population size, the blue area with an urban population of less than 100,000 covered five cities; the cyan area with an urban population of 100,000-500,000 covered 36 cities; the green area with an urban population of 500,000-1 million covered 73 cities; the yellow area with an urban population of 1-2 million covered 112 cities; the orange area with an urban population of 2-5 million covered 94 cities; and the red area with an urban population larger than 5 million covered 24 cities.There is the largest difference between EP and CP in grade (2000, 5000] (orange color), while smallest difference in grade (1000, 2000] (yellow color).As for population density, the 13 cities with a population density larger than 1000 persons/km 2 (red) were mainly distributed in the areas of Beijing-Tianjin, Shanghai, and Guangzhou-Shenzhen.The 84 cities with a population density smaller than 50 persons/km 2 (blue) were mainly distributed in western China.The cities with a high population density were mostly distributed in the eastern coastal region of China.The difference between the EP and CP was smaller in the green and orange areas while larger in the blue areas.

Discussion
The total estimated urban population was 7.03 million, while census urban population was 6.70 million.The overall relative error was 4.9%.In order to evaluate the results, the estimated population was compared with the census population for each city.PER, the ratio of estimated urban population (EP) to census urban population (CP) (dividing the EP by the CP), was considered to be an index to evaluate the results for each city.Figure 4 shows the spatial distribution of PER.Table 4 shows the statistical information for PER.A total of 301 prefectures,

Discussion
The total estimated urban population was 7.03 million, while census urban population was 6.70 million.The overall relative error was 4.9%.In order to evaluate the results, the estimated population was compared with the census population for each city.PER, the ratio of estimated urban population (EP) to census urban population (CP) (dividing the EP by the CP), was considered to be an index to evaluate the results for each city.Figure 4 shows the spatial distribution of PER.Table 4 shows the statistical information for PER.A total of 301 prefectures, distributed across almost all of China and accounting for 87.50% of the total prefectures, had PER values between 0.7~1.3.Only one city had PER values smaller than 0.7.The number of cites with PER values between 1.3 and 2.0 and larger 2.0 were 30 and 12, respectively, which was terribly overestimated, and mainly distributed in the north-east, north-west and south-west of China.Figure 5 shows the histogram of K.The mean = 1.10,SD = 0.38, and the distribution of PER was concentrated in the range 0.7-1.5.There were 151 prefectures with a PER less than 1.0, while 192 prefectures had a PER larger than 1.0.Table 5 shows the number and percentage of cities in each PER range for region k.Generally, the number of cities with a PER between 0.7 and 1.3 reached 100% when k = 2, 3, 5.There were few cities overestimated in categories 1 and 4.   distribution of PER.Table 4 shows the statistical information for PER.A total of 301 prefectures, distributed across almost all of China and accounting for 87.50% of the total prefectures, had PER values between 0.7~1.3.Only one city had PER values smaller than 0.7.The number of cites with PER values between 1.3 and 2.0 and larger 2.0 were 30 and 12, respectively, which was terribly overestimated, and mainly distributed in the north-east, north-west and south-west of China.
Figure 5 shows the histogram of K.The mean = 1.10,SD = 0.38, and the distribution of PER was concentrated in the range 0.7-1.5.There were 151 prefectures with a PER less than 1.0, while 192 prefectures had a PER larger than 1.0.Table 5 shows the number and percentage of cities in each PER range for region k.Generally, the number of cities with a PER between 0.7 and 1.3 reached 100% when k = 2, 3, 5.There were few cities overestimated in categories 1 and 4.

Figure 1 .
Figure 1.Technical flow of the proposed method.

Figure 1 .
Figure 1.Technical flow of the proposed method.

Figure 3 .
Figure 3. Distribution of the urban estimation and census population at the prefectural city level.(a) Estimated urban population size.(b) Census urban population size.(c) Estimated urban population density.(d) Census urban population density.

Figure 3 .
Figure 3. Distribution of the urban estimation and census population at the prefectural city level.Estimated urban population size.(b) Census urban population size.(c) Estimated urban population density.(d) Census urban population density.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 9 of 12 distributed across almost all of China and accounting for 87.50% of the total prefectures, had PER values between 0.7~1.3.Only one city had PER values smaller than 0.7.The number of cites with PER values between 1.3 and 2.0 and larger 2.0 were 30 and 12, respectively, which was terribly overestimated, and mainly distributed in the north-east, north-west and south-west of China.Figure5shows the histogram of K.The mean = 1.10,SD = 0.38, and the distribution of PER was concentrated in the range 0.7-1.5.There were 151 prefectures with a PER less than 1.0, while 192 prefectures had a PER larger than 1.0.

Table 1 .
Data used in this paper.

Table 2 .
The classification and regression results.

Table 3 .
The number (N) and percentage (P) of prefectures with different population size (PZ) and population density (PD) grades.

Table 3 .
The number (N) and percentage (P) of prefectures with different population size (PZ) and population density (PD) grades.

Table 4 .
The number and percentage of prefectures in each PER grade.

Table 4 .
The number and percentage of prefectures in each PER grade.

Table 4 .
The number and percentage of prefectures in each PER grade.