Next Article in Journal
Pollution Havens in South-East Asia: Examining Japanese Multinational Enterprises in the Philippines
Previous Article in Journal
Assessing the Need for and Environmental Acceptability of Infrastructural Facilities in Natural Areas with Special Management Status
Previous Article in Special Issue
Towards Resilient Cities: Optimizing Shelter Site Selection and Disaster Prevention Life Circle Construction Using GIS and Supply-Demand Considerations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Study on Spatialization and Spatial Pattern of Population Based on Multi-Source Data—A Case Study of the Urban Agglomeration on the North Slope of Tianshan Mountain in Xinjiang, China

1
College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi 830049, China
2
Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi 830017, China
*
Author to whom correspondence should be addressed.
Sustainability 2024, 16(10), 4106; https://doi.org/10.3390/su16104106
Submission received: 10 March 2024 / Revised: 26 April 2024 / Accepted: 12 May 2024 / Published: 14 May 2024
(This article belongs to the Special Issue Spatial Analysis for the Sustainable City)

Abstract

:
The urban agglomeration on the north slope of the Tianshan Mountains is a pivotal place in Western China; it is essential for the economic growth of Xinjiang and acts as a critical bridge between China’s interior and the Asia–Europe continent. Due to unique natural conditions, the local population distribution exhibits distinct regional characteristics. This study employs the spatial lag model (SLM) from conventional spatial analysis and the random forest model (RFM) from contemporary machine learning techniques. It integrates traditional geographic data, including land cover data and nighttime light data, with geographical big data, such as POI (points of interest) and OSM (OpenStreetMap), to build a comprehensive indicator database. Subsequently, it simulates the spatial population distribution within the urban agglomeration on the northern slopes of the Tianshan Mountains in 2020. The accuracy of the results is then compared and assessed against the accuracy of other available population raster datasets, and the spatial distribution pattern in 2020 is analyzed. The findings reveal the following: (1) The result of SLM, combined with multi-source data, predicts the population distribution as a relatively uniform and nearly circular structure, with minimal spatial differentiation. (2) The result of RFM, employing multi-source data, better captures the spatial population distribution, resulting in irregular boundaries that are indicative of strong spatial heterogeneity. (3) Both models demonstrate superior accuracy in simulating population distribution. The spatial lag model’s accuracy surpasses that of the GHS and GPW datasets, albeit still trailing behind WorldPop and LandScan. Meanwhile, the random forest model significantly outperforms the four aforementioned population raster datasets. (4) The population spatial pattern in the urban agglomeration on the north slope of the Tianshan Mountains predominantly consists of four distinct circles, illustrating a “one axis, one center, and multiple focal points” distribution characteristic. Combining the random forest model with geographic big data for spatialized population simulation offers robust scientific validity and practicality. It holds potential for broader application within the urban agglomeration on the Tianshan Mountains and across Xinjiang. This study can offer insights for studies on regional population spatial distributions and inform sustainable development strategies for cities and their populations.

1. Introduction

Population distribution refers to the arrangement of people within a region, characterized by the location and density of individuals at a specific time and place. Generally, spatial distribution is described using population density [1]. The distribution is shaped by a mix of physical and human geographic factors, including the region’s natural geographical advantages and disadvantages, the availability of land resources, and the level of socioeconomic development [2].
Currently, the population census remains the primary method for collecting basic demographic information worldwide, with census sampling being the predominant statistical approach. In China, population statistics derive from a decennial population census. Despite the data collected by census workers being the most accurate, systematic, and standardized compared to those from other countries, the ten-year interval is too lengthy to satisfy the research demands of many scholars [3]. In the interim, sample surveys targeting one percent of the population offer less precise data, failing to meet the exacting requirements of researchers. Moreover, China’s census delineates its smallest units of analysis at the township, town, and street levels, leading to low spatial and temporal resolution in statistical data. This resolution inadequately captures the nuanced variations in population density and spatial distribution patterns [4], thus posing significant challenges for detailed research.
To address these challenges, scholars and organizations have continuously sought theories and methods capable of accurately depicting population distribution in spatial terms. Based on the goals and methods of related research, population spatialization techniques can be categorized into spatial interpolation, regression models, and machine learning methods [5]. In terms of data sources, these are divided into single data sources represented by nighttime light data, land use data, and building outline data, as well as multi-source data combining points of interest, heatmaps, and mobile phone data, among other geospatial big data. Initially, research in this domain predominantly relied on zonal density mapping to spatialize population data. However, with advances in remote sensing technology, researchers started to explore the potential of correlating the reflective properties of remote sensing imagery pixels with population density. This approach facilitated the development of models to estimate population density based on characteristics observed in remote sensing pixels [6,7]. Consequently, techniques utilizing remote sensing data from nighttime lights [8,9,10,11,12] and land use classifications [13,14,15,16] to spatialize population information have gained traction. Moreover, the advent of computational artificial intelligence has led to the application of geographically weighted regression [17,18,19,20,21] and machine learning methods [22,23] for producing fine-grained visual representations of population distribution using both traditional geographic and geographical big data [24,25,26,27,28,29,30].
A variety of global-scale open-source population raster datasets, such as LandScan, WorldPop, and the Gridded Population of the World (GPW), have been developed based on advanced methodologies. However, these global datasets often face limitations due to their broad scale and reliance on nighttime lighting and statistical data for spatial simulation. Recently, efforts to visualize the spatial distribution of populations by merging multi-source geographic data with big data analytics and machine learning have started to gain traction, particularly in urban studies [31,32,33,34]. Wang [35] used the random forest algorithm to map the population density of China’s mainland at a 100 m × 100 m scale in 2015 and verified its accuracy. Wang [36], based on building models, utilized the random forest algorithm to spatially simulate the population in Lin’an District, Hangzhou City, Zhejiang Province. Ding [37] trained the random forest model using building outlines and POI data to simulate the spatial distribution of the population in Hubei Province in 2010. These studies demonstrate the feasibility of the random forest method in spatializing population research. Liu [38], based on multi-source data, used the random forest method to study the population distribution of Zhengzhou City in 2020 at three grid scales. Chen [39] simulated the population spatial distribution in Sichuan Province in 2020 using POI data and multi-source remote sensing data. However, these studies are mostly concentrated at the national or provincial scale, as well as in areas of high population concentration like Beijing and Hangzhou, making it worthwhile to explore the feasibility of this method in areas with lower and more dispersed population densities like Xinjiang. Academic studies on spatial population patterns have largely centered on analyzing these patterns [40] and their evolution [41,42] using statistical data, or on examining inter-provincial population migration [43] and mobility [44,45].
As a distinctive oasis urban agglomeration in the arid zone of Northwest China, the urban agglomeration on the north slope of the Tianshan Mountains presents unique natural and climatic conditions. It covers a broad geographical area but has a relatively small and dispersed population. The cities within it are widely spaced, with limited transportation connectivity. Consequently, the selection of indicators and methodologies for population spatialization here differs from those applied in large cities. Simulating the spatial and temporal distribution of population in this agglomeration is fundamental for studying human–land relationships in the region. This simulation is crucial for the expansion and reorganization of regional spatial structures and the diffusion and concentration of economic activities, infrastructure development, public service allocation, and ecological impacts. It serves as a vital prerequisite for formulating rational regional policies on population, economy, ecology, and social development.
Inspired by the above-cited studies, this paper aims to select data indicators that more closely fit the population distribution characteristics of the research area, characterized by low density and dispersion. By utilizing multi-source data simulation modeling of population spatial data and machine learning methods, it explores the spatial distribution and patterns of the population in the urban agglomeration on the north slope of the Tianshan Mountains in 2020.

2. Materials and Methods

2.1. Study Area and Data

2.1.1. Study Area

The urban agglomeration on the north slope of the Tianshan Mountains (hereinafter referred to as the NSTS Urban Agglomeration), located in the northern part of Xinjiang in Northwest China, predominantly spans the oases of the Junggar Basin, the north of the Tianshan Mountains, and the south of the Altai Mountains. The region extends from 84°14′ E to 89°55′ E in longitude and from 42°55′ N to 46°15′ N in latitude. It includes thirteen cities and counties: Urumqi, Karamay, Shihezi, Shawan, Kuitun, Wusu, Changji, Fukang, Wujiaqu, Huyanghe, Manas, Hutubi, and Jimusar. The agglomeration covers an area of approximately 94,000 square kilometers, which is about 5.62% of Xinjiang’s total land area. As of the seventh population census in 2020, the agglomeration’s population was about 7.3 million, representing roughly 28.25% of Xinjiang’s total population. The region’s total GDP in 2020 was around CNY 667,820 million, making up about 48.39% of the province’s GDP. Characterized by a temperate continental climate, this urban agglomeration is a quintessential oasis in an arid zone. It boasts a well-developed transportation infrastructure, including railroads, highways, and civil aviation, enhancing trade and cultural exchanges between Eastern Asia and Europe and providing the region with a unique strategic advantage. Among its cities, Urumqi is the foremost political, economic, and cultural center, playing a critical role in driving the development of the surrounding cities and acting as their pivotal support.
The NSTS Urban Agglomeration (Figure 1), located at the core of the Silk Road Economic Belt, is one of the 19 urban agglomerations promoted during China’s “13th Five-Year Plan” period. It is also one of the two key border area urban agglomerations under construction and the only urban agglomeration in the core area of the Silk Road Economic Belt [46]. Over the years, thanks to its superior geographical location and comprehensive infrastructure, this urban agglomeration has become the region with the highest level of economic development and the highest rate of urbanization within Xinjiang. It is also a crucial strategic area for the national government’s efforts to promote the development of Xinjiang and the western region [47].
The area facilitates the flow of goods and cultural exchanges between East Asia and Europe, creating unique regional advantages. Urumqi, the capital city, serves as the political, economic, and cultural center of the entire region and is the main lifeline driving the development of other cities within the NSTS Urban Agglomeration. Influenced by geographical location, the population distribution of this urban agglomeration exhibits characteristics of “general dispersion with regional aggregation”, which sets it apart from other typical urban agglomerations in the country. The exploration of population spatialization and spatial patterns in this area can provide references for related research.

2.1.2. Data

The study utilizes a range of data sources, including the following: the China Seventh Population Census in 2020; statistical data from the China County Statistical Yearbook (2015–2021); the 2022 Statistical Bulletin for Cities and Counties; the CLCD (Chinese Land Cover Dataset) from 2015 to 2022; POI (Points of Interest) data from Gaode Maps (2015–2022); building outline data from Baidu Maps (2015–2022); nighttime lighting data from the class NPP-VIIRS (National Polar-orbiting Partnership Visible Infrared Imaging Radiometer Suite Class) data from 2015 to 2022; road vector data from OSM (OpenStreetMap) for 2015–2022; and population raster data for 2020 from LandScan, the GHSL (Global Human Settlement Layer), WorldPop, and the GPW (Gridded Population of the World). Among them, data from 2015 to 2022 are used for relevance and collinearity diagnosis, as well as for spatialization simulation of the population in 2020; four population raster datasets from 2020 are used for accuracy verification. Data sources and resolution are shown in Table 1.

2.2. Methods

2.2.1. Min–Max Standardization

Prior to the analysis, a data normalization process for 24 different indicators is necessary to prevent bias in subsequent data processing due to significant variations in the values of these indicators. The min–max normalization method was chosen for this operation. The specific formula is as follows:
G = x x min x max x min
where G is the normalized result value, x max is the maximum value of a single indicator, and x min is the minimum value of a single indicator.

2.2.2. Ordinary Least Square (OLS)

The ordinary least squares (OLS) linear analysis model is employed to explore the linear relationships between variables. This model is not only user-friendly but also tolerates a certain level of data redundancy, making it particularly useful for constructing complex spatial models and facilitating the preliminary examination of data before detailed analysis. Furthermore, it aids in investigating the interrelationships among various factors. Consequently, this paper adopts the OLS method for assessing the significance of indicators prior to modeling. The model is expressed as follows:
y i = β 0 + j = 1 k β j x i j + ε i
where y i is the dependent variable, β 0 is the intercept, β i is the partial regression coefficient corresponding to the i explanatory variable, x i j is the i explanatory variable of the sample, ε i is the corresponding residuals of the sample, and k is the number of explanatory variables.

2.2.3. Pearson Correlation

Pearson’s correlation coefficient, also known as the product–moment correlation coefficient, is a statistical measure that quantifies the degree and direction of the linear relationship between two variables. According to this method, a correlation coefficient greater than 0.9 indicates a strong covariance between two datasets, potentially leading to data redundancy in subsequent model construction. In this study, Pearson’s correlation coefficient is utilized to evaluate the covariance among database indicators. The formula for calculation is as follows:
r x y = i = 1 n ( x i x ¯ ) ( y i y ¯ ) i 1 n ( x i x ¯ ) 2 i = 1 n ( y i y ¯ ) 2
where x i and y i are the observations i of the variable x and the variable y , and r x y are the correlation coefficients of x i and y i .

2.2.4. Spatial Lag Model (SLM)

The spatial distribution of the population is influenced by natural environmental conditions, socioeconomic factors, and interactions with neighboring areas, leading to varying degrees of agglomeration and a significant spatial dependence. This spatial correlation challenges the classic linear regression model’s ability to address the spatialization needs of population studies, thus underscoring the advantages of employing spatial regression models. Compared to basic spatial regression models, advanced models like the spatial lag model (SLM) and the spatial error model (SEM) feature a greater number of non-zero spatial regression coefficients. These models more effectively harness the spatial clustering characteristics of the data for regression analysis. To contrast with the outcomes derived from machine learning methods, this paper aims to simulate the population spatialization of the NSTS Urban Agglomeration in 2020 using the SLM approach within a multiple regression modeling framework. The formula is specified as follows:
y = ρ w y + x β + ε ε ~ N [ 0 , σ 2 I ]
where y is the matrix of explanatory variables, i.e., population size; x is the matrix of independent variables; ρ is the spatial effect coefficients; w is the matrix of spatial weights; β is the vector of parameters; ε is the vector of independent errors.
Utilizing the spatial weight matrix, the population spatialization model was constructed using the following equation:
S p o p i = ρ W S p o p i + i = 1 n k = 1 m ( a k x i k ) + μ
where S p o p i is the estimated population size of the grid i generated by spatial lag regression; W is the spatial weight matrix; ρ is the regression coefficient of the spatially lagged variable W S p o p i , which determines the spatial effect of the population distribution; x i k is the value of the independent variable k of the first grid i , and a k is the corresponding regression coefficient; and μ is the vector of independent errors.

2.2.5. Random Forest Model (RFM)

Random forest (RF) is an ensemble learning method derived from bagging. It creates multiple distinct datasets through sampling and trains a classification tree on each one. The final prediction result of the random Forest is obtained by aggregating the predictions from each classification tree. This method essentially enhances the decision tree algorithm by generating a multitude of decision trees, aggregating their classification outcomes to select the most probable result [48]. The training process of the random forest model (RFM) can be highly parallelized, which markedly accelerates model training. Moreover, RFM can rank the importance of variables, aiding in identifying the impact of each modeling factor on the spatial distribution of the population.
The principle of random forests endows it with several advantages, including the following: high accuracy as it performs well with large datasets and high-dimensional data; resistance to overfitting due to its method of building multiple decision trees and using voting or averaging for predictions; effectiveness with large datasets through parallel processing of multiple trees; the ability to handle missing data without the need for imputation. In this study, utilizing Python 3.12, RFM is applied to a curated database to model the population’s spatial distribution in 2020.

2.2.6. Error Correction

Given that the data derived from the spatial lag model simulation exhibit some discrepancies compared to the original data, it becomes essential to adjust the regression outcomes for the resident population at the township, town, and street levels to district and county scales. The correction formula is as outlined below:
{ S p o p j i = S p o p j i × C j C j = R e a l j × P r e d i c t j
where S p o p j i denotes the corrected spatial lag regression of the estimated population in the grid i of the district j ; C j is the correction factor for the district j ; R e a l j denotes the statistical population of the district j ; and P r e d i c t j denotes the estimated population of the district j .

2.2.7. Accuracy Verification

In this paper, three indicators are employed to evaluate and compare the accuracy of the population estimates from the SLM and RFM against census data at the township and street levels: mean absolute error (MAE), root mean square error (RMSE), and relative root mean square error (%RMSE). A smaller value for these metrics indicates higher accuracy. The formulas for these indicators are as follows:
M A E = 1 N i = 1 N | p r e d i c t i r e a l i |
R M S E = 1 N i = 1 N ( p r e d i c t i r e a l i ) 2
% R M S E = R M S E / 1 N i = 1 N r e a l i
where MAEis the absolute error of the mean value of each type of population raster data, which is used to reflect the numerical difference between the estimated population size and the actual population size; RMSE is the root mean square error, which is used to measure the degree of deviation of the estimated population size from the actual population size; and %RMSE is the relative root mean square error, which is used to measure the overall accuracy of the model estimation. p r e d i c t i is the number of population estimated by the various models for the township, town, and street, i ; r e a l i is the number of population counted in the township, town, and street, i ; N is the number of townships, towns, and streets.

2.2.8. Standard Deviation Ellipse and Center of Gravity

The standard deviation ellipse is a technique in spatial statistics for describing the distribution patterns of spatial datasets. Unlike traditional panel data analysis, the standard deviation ellipse reveals the directional trends in the spatial distribution of geographic elements, showcasing their dispersion and concentration [49]. The formula for calculating its centroid is as follows:
( x ¯ w m c , y ¯ w m c ) = ( i = 1 n w i x i i = 1 n w i , i = 1 n w i y i i = 1 n w i )
where ( x ¯ w m c , y ¯ w m c ) is the weighted average center of the study object, i.e., the center of gravity; w i is the weight of the points, i ; x i and y i are the deviation of the coordinates of all the points in the region moving to the center of gravity position.

2.2.9. Polycentricity

The local Moran’s I method is utilized to identify the core areas of the urban agglomeration. The degree of urban polycentric development is quantified by the ratio of the population in the sub-centers to the total population across all centers [50]. This ratio is employed to assess the significance of the sub-centers within the entire urban agglomeration. A larger value indicates a higher proportion of the population in the sub-centers, suggesting that these areas play a more crucial role. Consequently, a higher ratio signifies a more pronounced tendency towards polycentric development in the urban agglomeration. The formula is as follows:
p o l y = p o p s u b p o p s u b + p o p m a i n
where p o l y is the degree of polycentricity, p o p s u b is the number of sub-centers, and p o p m a i n is the number of centers.

2.3. Database Indicator

2.3.1. Data Preprocessing

OSM data from 2015 to 2022 were categorized into five types: railroads; elevated roads and expressways (encompassing highways and arterial roads); urban main roads (including major and secondary roads); secondary urban arterial roads (covering tertiary roads); and urban feeder roads (comprising residential area roads and unclassified roads). The buffer width was determined based on the minimum red line width specified in the “Urban Comprehensive Transportation System Planning Standard (CJJ75-97)” by the Ministry of Housing and Urban–Rural Development of the People’s Republic of China. Utilizing this classification, a road network distribution map (Figure 2) was created to depict the evolution of the road network in the study area over the years.
This study preprocesses the POI data from Gaode Maps spanning from 2015 to 2022. The dataset encompasses a wide range of categories, including food and beverage services, road accessory facilities, place names and addresses, scenic spots, public facilities, businesses and enterprises, shopping, transportation, financial and insurance services, education and cultural services, motorcycle and automobile services (including maintenance and sales), commercial and residential services, lifestyle services, event venues, indoor facilities, sports and leisure, entry points, healthcare, government and community organizations, and lodging services, resulting in a total of 23 major categories. Due to the nature of their collection, POI data may exhibit overlaps and null values, necessitating thorough data cleaning and processing. Post-cleaning, the data are reorganized into 13 categories for analysis: transportation, travel, science and education, finance, healthcare, shopping, residential, governance, employment, public amenities, food services, leisure and entertainment, and kernel density estimation. This categorization facilitates the subsequent step of gridding the data based on Euclidean distance calculations (Figure 3).
Ultimately, data for nighttime light intensity, land cover type Euclidean distance, POI Euclidean distance, building area ratio, road Euclidean distance, and road network area ratio are allocated to a 1000 m × 1000 m grid to create a multi-source database. This database encompasses 24 secondary indicators grouped under 5 primary categories (Table 2). The forthcoming step involves evaluating the relevance and covariance of these 24 indicators with the population to refine the final set of indicators. This refined indicator system will then be utilized for subsequent spatialized analysis and processing of population data.

2.3.2. Significance and Covariance Diagnostics

Before proceeding with spatialized simulation, it is crucial to assess the database’s indicator factors for significance and covariance. Initially, the ordinary least squares (OLS) method thoroughly examines the significance of the indicator database in relation to population distribution. This step aims to identify and eliminate factors uncorrelated with population distribution, thereby preventing non-significant factors from influencing the subsequent spatialization process and averting skewed results. Subsequently, the Pearson correlation method is employed for a detailed examination and filtering of covariance among indicators, leading to the formulation of a robust indicator system. These measures ensure the chosen indicator system for the spatialization simulation is both significant and free from severe covariance issues, thereby enhancing the accuracy and reliability of the findings.
In the OLS significance analysis (Table 3), the R2 value of the indicator database is 0.635, indicating that the 24 categories of indicators can explain 63.5% of the variation in population distribution. The analysis reveals that indicators N01, R01, R02, U03, U06, U07, P02, P08, and P13 have a significant positive impact on the population (POP). Conversely, U02, P04, P05, P07, and P12 are found to significantly negatively affect POP. However, the impact of indicators B01, U01, U04, U05, P01, P03, P06, P09, P10, and P11 on population distribution is not statistically significant, implying their influence is negligible. Thus, these indicators will be omitted from further analysis.
In the Pearson correlation analysis, a value greater than 0.9 indicates significant covariance between two datasets. The use of a heatmap (Figure 4) to visualize the relationships between indicators reveals that the covariance among the library’s indicators is generally moderate. However, the correlation between P02 and P08, with a significance value of 0.91, exceeds the 0.9 threshold, indicating a substantial degree of covariance. Therefore, these indicators should be excluded from further analysis.
Considering the results of the significance and covariance diagnostics, a total of 12 indicators were excluded: B01, U01, U04, U05, P01, P02, P03, P06, P08, P09, P10, and P11. Consequently, 12 indicators were retained for the final indicator system: N01, R01, R02, U02, U03, U06, U07, P04, P05, P07, P12, and P13 (Table 4). This refined indicator system forms the basis for subsequent studies on population spatialization.

3. Results

3.1. Spatialization of the Population

3.1.1. Spatial Autocorrelation Test

Before conducting spatialization analysis, it is necessary to perform a spatial autocorrelation analysis on the population distribution to confirm its spatial heterogeneity in the region. This analysis, carried out using Geoda software V1.20, provides a foundational basis for subsequent spatialized simulation. The global Moran’s I index was selected to evaluate the spatial correlation of population density at the township and street levels based on data from the 2020 Seventh Population Census within the NSTS Urban Agglomeration. A spatial weight matrix must be established first, for which the Rook method was used to generate a first-order spatial weight matrix. Utilizing this matrix, the global Moran’s I index was calculated to be 0.587, with a p-value of 0.001 and a z-value of 13.7836, indicating a significant positive correlation in population distribution. The Moran scatter plot (Figure 5) shows the population primarily clustered in the first and third quadrants, indicating both “high–high” and “low–low” population concentrations within the study area.

3.1.2. Construction of Population Spatialization Model Based on SLM

The Rook’s first-order spatial weight matrix was selected for spatializing the 2020 population, using the number of enumerated individuals as the dependent variable and the other 12 indicators as explanatory variables. The analysis yielded an R2 value of 0.890965 and an adjusted R2 value of 0.889068, indicating a favorable overall fit. In comparison, the ordinary least squares (OLS) regression results produced an R2 of 0.597331 and an adjusted R2 of 0.597280. The superior performance of the spatial lag model (SLM) over the OLS model suggests that the SLM is the preferable choice for analyzing population spatialization. The regression coefficients for each index are detailed in Table 5.
The indicator values for each grid were inputted into SLM Equation (5) to predict the spatial distribution of the population across the grids. Considering the varying developmental stages of cities, it was necessary to adjust the uniformly calculated values to match the actual census population of districts and counties. This adjustment was made by calculating correction coefficients using Equation (6). Subsequently, the adjusted grid data were visualized to produce a map showing the population distribution in the NSTS Urban Agglomeration, as simulated by the SLM (Figure 6).
The results indicate that, in 2020, the population in the study area predominantly clustered in the cities of Urumqi, Shihezi, Karamay, and Kuitun, with Urumqi hosting the largest and most dense population aggregation, particularly pronounced in the southern part of the city. Although the sub-population clusters in Shihezi and Karamay are smaller in size, their agglomeration is distinctly visible. In contrast, the population center in Kuitun is only beginning to form and remains relatively dispersed, lacking noticeable concentration. The SLM’s simulation, augmented by various geographic data indicators, reveals a clear center-and-circle structure in the population’s spatial distribution, offering a more intuitive representation than traditional statistical data and providing a solid foundation for diverse spatial analyses.
In the analysis of all 95,214 grids shown in Figure 7, under the SLM results, the number of grids with a population of less than 2000 in the NSTS Urban Agglomeration is 94,681, accounting for 99.44% of all grids. This indicates that the overall population of the study area is relatively low and dispersed. The number of grids with a population greater than 2000 and less than or equal to 4000, and those with a population greater than 4000 and less than or equal to 8000, are 249 and 166, respectively, representing 46.71% and 31.14% of the remaining grids, and constituting the vast majority of the population concentration areas. Grids with a population of more than 8000 are extremely rare, only accounting for 0.12% of all grids, indicating that highly concentrated population areas are very scarce and mostly distributed in the urban and town centers.

3.1.3. Construction of Population Spatialization Model Based on RFM

In Python, we install the sklearn model library and call the RandomForestRegressor from the model library to train the data in the indicator library to simulate the prediction of population data in 2020, and set the specific parameters as shown in Table 6. In the final training stage, the model achieved a R2 of 0.954834, which is extremely excellent, and it outperformed the fitting results of the SLM.
The population data values simulated by the random forest model are corrected according to the correction method with the SLM. The correction process shows that the initial estimation of the model for counties with large populations and obvious aggregation is much better than that for regions with smaller populations and spatial dispersion. The corrected grid data are visualized, and the spatial distribution of population in the NSTS Urban Agglomeration in 2020 is finally obtained, as shown in Figure 8.
It can be seen that the population of the NSTS Urban Agglomeration in 2020 also has four obvious spatial aggregation areas, namely, Urumqi and its surroundings, Shihezi, Kuitun, and Kelamayi. Urumqi and Changji cities have the most obvious and larger spatial aggregation, and a clear center of gravity of distribution can already be seen, with five smaller aggregation areas distributed around the center of gravity. Shihezi shows a clear aggregation in the center, but the driving effect is weak. In Kuitun and Kelamayi cities, there are obvious distribution areas for the population, but they are more dispersed, and the aggregation is not obvious. The edges of population distribution simulated based on RFM spatialization have irregular shapes, which is more scientific in population aggregation and distribution than the more rounded edges of SLM.
To further understand the distribution of population numbers in the study area, a sectional statistic of all 95,214 grids was conducted (Figure 9). It was found that approximately 94,483 grids in the study area have a population density of less than 2000 people per square kilometer, accounting for 99.2% of the total, indicating that the overall population of the NSTS Urban Agglomeration is relatively small, with low population density and dispersed spatially. There are a total of 435 grids with a permanent population density of 2000 to 4000 people per square kilometer, and 164 grids with a density of 4000 to 8000 people per square kilometer, accounting for 81.94% of the densely populated areas with a density greater than 2000 people per square kilometer. Grids with a density greater than 8000 people per square kilometer are few and mainly concentrated in urban and rural settlements. Compared to the results of the SLM, the RFM results show larger extremes and more apparent fluctuations in population regions.

3.2. Accuracy Comparison

The four more widely used population raster data of GHSL, WorldPop, LandScan, and GPW in 2020 were reclassified according to the precision size of 1000 m × 1000 m to summarize the estimated population size for each dataset under the level of townships, towns, and streets. The accuracy of each of the four is calculated using Equations (7)–(9), and compared with the accuracy of SLM and RFM, and the results are shown in Table 7.
It can be seen that the results of SLM’s estimation of population exceed the GHS and GPW datasets in the three of accuracy metrics of MAE, RSME, and %RMSE, but there are still some gaps in comparison to WorldPop and LandScan. This somewhat validates the feasibility of multi-source data, especially POI big data, in the estimation of fine-grained population spatial distribution, but the overall accuracy of the SLM method still needs to be improved. In comparison, RFM far outperforms the other four demographic datasets in the accuracy evaluation link, and the %RMSE is as low as 0.85, which indicates that RFM combines multi-source data with excellent accuracy and expressiveness in demographic spatialization simulation. Comparing SLM and RFM, the latter far exceeds the former in all three of accuracy metrics, which further validates the strong accuracy of RFM in simulating the spatialization of population in the study area numerically.
In the process of revising the results, it was discovered that the RFM tends to have a correction index close to 1 in areas of high population density, indicating that it may simulate the actual population number better in these areas. To further understand the performance of this model in different population density areas, this paper categorizes the population density at the level of villages, towns, and streets into three categories—high density, medium density, and low density—using the natural breaks method. The precision results for each category are then calculated to verify the feasibility of this spatialization method. The population density intervals for low-density areas range from 0.373392 to 6863.783147 people per square kilometer, for medium-density areas from 6863.783148 to 19,012.225642 people per square kilometer, and for high-density areas from 19,012.225643 to 41,588.494104 people per square kilometer. By applying Equations (7)–(9) for calculation and statistical analysis, the results are shown in Figure 10.
Overall, the RFM model exhibits higher accuracy in all three divisions compared to the GHSL, WorldPop, LandScan, and GPW datasets. Within the low- and high-density areas, the RFM model demonstrates higher MAE and RMSE accuracy, while its accuracy decreases in medium-density areas. The %RMSE accuracy notably increases with density. Hence, it is evident that regardless of the population density, the RFM model holds a significant advantage in spatializing population, particularly in representing the spatial distribution of population in high-density areas effectively. This has meaningful implications for tasks requiring accurate spatial distribution of population, such as public resource allocation and infrastructure development.

3.3. Patterns of Spatial Distribution of the Population

3.3.1. Cold and Hot Spots and Center of Gravity Distribution

The results of the spatial distribution of the population under RFM, which has higher accuracy, were selected for cold hotspot and center of gravity analysis (Figure 11). It can be observed that the long axis of the standard deviation ellipse of the population distribution in the study area shows a SE-NW (east of Urumqi—north of Shawan) direction, while the short axis shows a NE-SW (central Wujiaqu—north of Changji) direction. Its center of gravity is located in the middle of the intersection line between Changji and Hutubi, which is offset from Urumqi’s location and its circle. The long axis indicates that the population of the NSTS Urban Agglomeration exhibits a northwest–southeast directional distribution; the overall ellipse is flatter, indicating that the population of the study area is more concentrated, mainly in the central part of the study area.
Within this timeframe, there are 10 distinct demographic hotspots in the study area. Among them, Urumqi, Wujiaqu, Changji, and Hutubi have obvious distribution of high population hotspots in the region, which represent the primary level circle of population aggregation in the study area. Shihezi and Shawan, as well as Kuitun and its surroundings, exhibit a clear distribution of hotspots, representing secondary circles of population concentration. There are small areas of sub-hotspot distribution of population in the southern part of Jimusar and the western part of the city of Karamay, representing a tertiary circle of population aggregation. The population of the NSTS Urban Agglomeration has formed a distribution pattern of “one center, multiple focal points”, but there are noticeable cold spot faults between the agglomerations, indicating weak connectivity within the urban agglomeration.

3.3.2. Identification of Urban Agglomeration Centers

Based on the local Moran’s results, the centers of the NSTS Urban Agglomeration can be identified and extracted. Spatial extraction and integration are performed on the “high–high” clustering areas, following the methodology outlined in previous studies [51]. Regions meeting the criteria of having a population equal to or greater than 100,000 people and occupying an area greater than 3 km2 are selected as the centers of the urban agglomeration under study. The results are depicted in Figure 12.
It is evident that the NSTS Urban Agglomeration possesses four levels of centers. Their extent, area and population are shown in the Table 8. Level I center is situated in the middle of Urumqi and at the junction of Changji and Wujiaqu, representing the largest area and highest population density in the study area. These centers radiate outward from Urumqi in the north, southeast, and northwest directions. The Level II center is located in the northern part of Shihezi and at the junction of Manas. This center is primarily within the city limits, with minimal expansion mainly to the north. The Level III center is situated at the junction of Kuitun, Karamay, and Wusu, covering a significant area across all three regions, indicating a certain level of exchange and integration among the three cities. Its center lies in the northern part of the Dushanzi district of Karamay, extending towards the northeast and northwest directions. Level IV centers are located in the northwestern part of Karamay city, within the administrative district, with a smaller area and population. These centers are in the initial stages of formation, expanding slightly outward from the center in irregular diamond shapes.
Applying Equation (11) to calculate the degree of population distribution polarization in the four centers of the urban agglomeration, the result is 0.187946, indicating a relatively pronounced state of polarization. The primary pole is situated in the Level I center, which attracts the vast majority of the population in the study area. This unipolar form of urban agglomerations can lead to various issues such as unequal resource distribution and imbalanced industrial development across the region, ultimately resulting in an unbalanced and unsustainable state of development among cities.

3.3.3. Exploration of Spatial Patterns of Population

As shown in Figure 13, the NSTS Urban Agglomeration exhibits a specific spatial pattern of “one axis, one center, and multiple focal points”. “One center” refers to the first-tier circle in the eastern part of the urban agglomeration, represented by the cities of Urumqi, Wujiaqu, and Changji, as well as the “high–high” population concentration areas within Manas, Fukang, and Jimusaer. This layer is characterized by its wide land area and concentrated population distribution, with Urumqi holding a very clear dominant position. The overall shape is an irregular yan-shaped distribution with the north–south as the main axis and the east–west as two wings. “Multiple focal points” include two second-tier circles and one third-tier circle in the central, western, and northwestern parts of the urban agglomeration. The central second-tier circle is represented by the cities of Shihezi, Shawan, and Manas, along with internal “high–high concentration areas”. This layer is characterized by a relatively smaller area, more dispersed concentration areas, and smaller differences in size between the center and concentration areas. The western second-tier circle includes the third-tier centers represented by the Karamay (Dushanzi District), Wusu, and Kuitun, as well as the “high–high” population concentration area within Wusu. Its characteristics are small land area, few internal concentration areas, and poor concentration. The northwestern third-tier circle consists of the areas of Kelamayi, Baijiantan District, and Uerhe District in Karamay, with the center located in the northern part of the Kelamayi area. This layer is characterized by its small scale, existing independently within the administrative region, without association with other cities. Based on these four layers, “one axis” runs through the northwest and southeast regions, connecting and communicating the internal parts of the region, linking “one center, multiple focal points” together, forming the overall spatial distribution pattern of the population in the NSTS Urban Agglomeration.

4. Discussion

4.1. Feature Importance Analysis

The importance of various features in the RFM results for assessing the formation of spatial population patterns was measured using MDI (mean decrease impurity). As shown in Figure 14, the results indicate that the highest importance value is attributed to the core density of POIs, which scored 0.6812. It is followed by U02, N01, U06, R02, R01, U03, U07, and P04, with scores of 0.0760, 0.0682, 0.0406, 0.0272, 0.0267, 0.0245, 0.0220, and 0.0124, respectively. The other three POI indicators had importance values below 0.01, suggesting that the explanatory role of POI data in the population spatialization model is primarily concentrated on the core density indicator, while the importance of natural cover indicators and urban construction indicators is more stable. This confirms the rationality of using POI data, a big data source of human activities, in population spatialization applications. Surprisingly, the forest Euclidean distance indicator ranked second in importance, which may be due to human economic activities typically occurring in areas without forest cover, and human expansion also tends to reduce forest coverage [52]. Nighttime lights, commonly used as a base data for population spatialization [53,54], ranked third in this study. Bare land, usually unutilized land without human activities, ranked fourth. Road network area and Euclidean distance to roads, as indicators of urban construction, are somewhat related to population distribution and activity, ranking fifth and sixth in importance, respectively. Grasslands, as a significant indicator of water source distribution, also reflect population distribution to some extent. Impervious surfaces, typically covered with materials like roofs, parking lots, and roads that have low permeability, are one of the most prominent features of urbanization [55]. The distance to impervious surfaces reflects the boundaries of human-made buildings in the study area, thus influencing the simulation of population distribution. The overall importance of reclassified POI service indicators in the model was minimal, with only financial services exceeding 0.01, indicating that reclassified POIs as a basis for population spatialization need further refinement or a new classification system in future research.

4.2. Significance and Shortcomings

In recent years, some studies have combined POI data with machine learning methods for population spatialization. Compared with related research, this study confirms the feasibility of using geospatial big data combined with the random forest method in population spatialization research, and the results are significantly superior to traditional spatial econometric models. The advantage lies in the use of eight years of data in this study, which involved screening out factors that were not significant and had severe collinearity in the study area, making it more tailored to the specific conditions of the NSTS Urban Agglomeration, and by extension, the entire Xinjiang region. The goodness of fit and accuracy validation results also supported this point well. In the process of population spatialization, in addition to the need for continuous innovation in methods and tools, selecting appropriate indicator factors is equally important. In related research, population spatialization methods that are more regionally adapted can make the research results more scientifically valid and effective. This is also one of the significances of this study.
To further understand the distribution of populations of varying densities across different urban centers, an analysis was conducted on the gridded representations of different population density areas within one of the population centers. Figure 15 depicts the population density areas of Urumqi, a Level I center. It is evident that the high-density areas, represented by Tianshan District (Figure 15b), have higher grid values predominantly falling within areas densely packed with buildings. Meanwhile, medium-density areas have grids that correspond to factory zones, and low-density areas coincide with mountainous regions or other high-altitude areas with sparse populations. Despite the limitations posed by the 1000 m resolution, which only allows for the representation of contiguous building areas and not precise individual buildings, this method still holds significant applicational value in urban-scale studies. It can be applied in future research within the study area for resource allocation [56,57], urban built-up area extraction [58], and disaster risk assessment [59], among other fields.
However, this study also has certain limitations. First, an important data source used in the research—the POIs—is restricted by the year of acquisition. Data prior to 2012 cannot be obtained from the platform; hence, the spatialization of the population before this year is similarly constrained. Second, due to the large volume of data and limitations of hardware conditions and the study area size, the spatial accuracy of this research is only 1000 m × 1000 m, which requires further improvement. Furthermore, the random forest model has already matured in related research. In subsequent studies, the method could be updated by incorporating relevant aspects of deep learning into the research. Additionally, the exploration of the urban agglomeration’s spatial structure in this paper is based solely on the population spatial patterns simulated from multi-source data. However, the actual spatial structure is influenced by many more factors, necessitating more in-depth research in subsequent studies. More importantly, the selection of indicators tailored to the study area means that this research might be particularly applicable to modeling population distribution in Xinjiang, which occupies one-sixth of China’s land area. Whether this method can be extended to more regions with similar natural and socioeconomic conditions still requires further exploration.

4.3. Recommendations

As an oasis urban agglomeration in the arid region of northwest China, the NSTS Urban Agglomeration features unique natural and climatic conditions. It spans a vast geographical area but has a relatively sparse and dispersed population. The cities within the agglomeration are far apart, with relatively weak transportation accessibility, making the choice of indicators and methods for population spatialization simulation distinct from those used in major cities like Beijing and Shanghai. This study, based on indicators that reflect regional characteristics, explores regional population spatialization and can provide references for research on spatial structure expansion, the diffusion and agglomeration of economic activities, and the allocation of public services. Based on the existing population spatial patterns discussed above, the following recommendations are proposed:
  • Strengthen the radiating and driving effect: It is necessary to enhance the connection between the eastern and central–western parts of the urban agglomeration, further improve the information gathering and radiating capacity of the Urumqi–Shihezi–Kuitun–Karamay development axis both within and outside the region, strengthen the construction of transportation infrastructure, promote linkages between industries, and provide active guidance and support from the government to promote a more balanced and diversified new pattern of regional development.
  • Coordinate regional communication and management: There is a need to strengthen the capacity for unified regional planning and management. Currently, there is a significant development gap between the “center” and “focal points” within the NSTS Urban Agglomeration. Further improvements in planning management and coordinated development mechanisms at various levels are required, with a focus on strengthening resource allocation towards peripheral cities, such as Wusu and Fukang, and accelerating the integration process.
  • Strengthen internal industry collaboration: By encouraging cooperation between small- and medium-sized cities and the core city of Urumqi, an urban agglomeration structure with multiple active centers and extensive network connections should be constructed. Important linkages within the city network should be formed to optimize resource allocation and enhance the overall strength of the region. Ultimately, an economically strong, environmentally friendly, and culturally diverse urban agglomeration should be built.
  • Respect the objective laws of development: Considering the characteristic of large spatial distances between cities and counties within the urban agglomeration, it is crucial to prevent the formulation of internal policies that detach from the actual economic development of the cities. Imitating the development paths of other large urban agglomerations blindly should be avoided. Instead, adapt to local geographic characteristics, setting new standards to measure the development status of the urban agglomeration, thus promoting sustainable regional development.

5. Conclusions

The study is based on a database of indicators constructed from traditional geographic data as well as emerging geographic big data. It diagnoses the significance and covariance between all 24 categories of data and population numbers in the database for the period 2015–2022. Finally, 12 categories of data were selected to establish the indicator database of population spatialization. Using the SLM under the spatial measurement model and the RFM under the machine learning method, the spatial distribution of the population in the NSTS Urban Agglomeration in 2020 was simulated and verified for accuracy. The findings are as follows:
  • The relative statistics of population under SLM simulation can show the spatial distribution of population more intuitively and effectively, but its actual performance has a more obvious near-circle layer structure.
  • The spatial distribution of population under RFM simulation exhibits obvious spatial differentiation with irregular and expressive edges.
  • SLM accuracy exceeds the two population raster datasets, GHS and GPW, with better accuracy but still needs improvement. RFM far exceeds the selected four types of datasets in terms of accuracy.
  • The spatial pattern of the population of the NSTS Urban Agglomeration is mainly divided into four distinct circles, showing the distribution characteristics of “one center, multiple focal points”.
This study demonstrates that machine learning, combined with traditional geographic data and big geographic data tailored to local conditions, achieves good simulation results and accuracy in population spatialization. This shows that not only the choice of methods but also the selection of indicator factors is crucial in such research. This approach can be applied more deeply to population-related studies in the NSTS Urban Agglomeration and even the entire Xinjiang region.

Author Contributions

Conceptualization, Y.Z. and H.W.; Data curation, Y.Z.; Formal analysis, Y.Z.; Funding acquisition, H.W.; Investigation, Y.Z.; Methodology, Y.Z.; Project administration, H.W.; Resources, H.W.; Software, Y.Z.; Supervision, H.W.; Validation, Y.Z.; Visualization, Y.Z.; Writing—original draft, Y.Z.; Writing—review and editing, Y.Z., K.L., C.W. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Third Xinjiang Scientific Expedition Program of Ministry of Science and Technology of the People’s Republic of China, grant number 2021xjkk0902.

Institutional Review Board Statement

The study did not require ethical approval.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

Thank you to the hard-working editors and reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Feng, Z.; Yang, Y.; You, Z.; Zhang, J. Research on the suitability of population distribution at the county level in China. Acta Geogr. Sin. 2014, 69, 723–737. [Google Scholar]
  2. Li, J.; Lu, D.; Xu, C.; Li, Y.; Chen, M. Spatial heterogeneity and its changes of population on the two sides of Hu Line. Acta Geogr. Sin. 2017, 72, 148–160. [Google Scholar]
  3. Ye, T.T.; Zhao, N.Z.; Yang, X.C.; Ouyang, Z.T.; Liu, X.P.; Chen, Q.; Hu, K.J.; Yue, W.Z.; Qi, J.G.; Li, Z.S.; et al. Improved population mapping for China using remotely sensed and points-of-interest data within a random forests model. Sci. Total Environ. 2019, 658, 936–946. [Google Scholar] [CrossRef] [PubMed]
  4. Chen, Z. Research on Demographic Statistical Data Spatialization Based on Residential Area Classifying. Geospat. Inf. 2016, 14, 47–48+52+47. [Google Scholar]
  5. Guo, H.; Zhu, W. A review on the spatial disaggregation of socioeconomic statistical data. Acta Geogr. Sin. 2022, 77, 2650–2667. [Google Scholar]
  6. Lung, T.; Lübker, T.; Ngochoch, J.K.; Schaab, G. Human population distribution modelling at regional level using very high resolution satellite imagery. Appl. Geogr. 2013, 41, 36–45. [Google Scholar] [CrossRef]
  7. He, M.; Xu, Y.M.; Li, N. Population Spatialization in Beijing City Based on Machine Learning and Multisource Remote Sensing Data. Remote Sens. 2020, 12, 1910. [Google Scholar] [CrossRef]
  8. Briggs, D.J.; Gulliver, J.; Fecht, D.; Vienneau, D.M. Dasymetric modelling of small-area population distribution using land cover and light emissions data. Remote Sens. Environ. 2007, 108, 451–466. [Google Scholar] [CrossRef]
  9. Zhuo, L.; Chen, J.; Shi, P.; Gu, Z.; Fan, Y.; Lchinose, T. Modeling Population Density of China in 1998 Based on DMSP/OLS Nighttime Light mage. Acta Geogr. Sin. 2005, 60, 266–276. [Google Scholar]
  10. Gao, Q.; Alimujiang, K.-S. Modeling the Population Spatial Distribution of Tianshan North-slope Urban Agglomeration Based on DMSP/OLS Night Lighting Data. Northwest Popul. J. 2017, 38, 113–120. [Google Scholar] [CrossRef]
  11. Li, H.; Zhang, H.; Wang, M. A Comparative Study of Population Spatialization Based on NPP/llRS and LJ1-01 Night Light Data:Taking Beijing for an Example. Remote Sens. Inf. 2021, 36, 90–97. [Google Scholar]
  12. Meiling, W.; Hesheng, Z. Research on Population Spatialization Based on Luojia-1 Nighttime Light Data. Geospat. Inf. 2021, 19, 53–56+57. [Google Scholar]
  13. Gaughan, A.E.; Stevens, F.R.; Linard, C.; Jia, P.; Tatem, A.J. High Resolution Population Distribution Maps for Southeast Asia in 2010 and 2015. PLoS ONE 2013, 8, e55882. [Google Scholar] [CrossRef]
  14. Ye, Q.; Yang, X.; Jiang, D. The Grid Scale Effect Analysis on Town leveled Population Statistical Data Spatialization. J. Geo-Inf. Sci. 2010, 12, 40–47. [Google Scholar] [CrossRef]
  15. Hu, L.J.; He, Z.Y.; Liu, J.P. Adaptive Multi-Scale Population Spatialization Model Constrained by Multiple Factors: A Case Study of Russia. Cartogr. J. 2017, 54, 265–282. [Google Scholar] [CrossRef]
  16. Zhuang, D.F.; Liu, M.L.; Deng, X.Z. Spatialization model of population based on dataset of land use and land cover change in China. Chin. Geogr. Sci. 2002, 12, 114–119. [Google Scholar] [CrossRef]
  17. Wang, K.; Cal, H.; Yang, X. Multiple scale spatialization of demographic data with multi-factor linear regression and geographically weighted regression models. Prog. Geogr. 2016, 35, 1494–1505. [Google Scholar]
  18. Xiong, J.N.; Li, K.; Cheng, W.M.; Ye, C.C.; Zhang, H. A Method of Population Spatialization Considering Parametric Spatial Stationarity: Case Study of the Southwestern Area of China. Isprs Int. J. Geo-Inf. 2019, 8, 495. [Google Scholar] [CrossRef]
  19. Tan, C.D.; Tang, Y.H.; Wu, X.F. Evaluation of the Equity of Urban Park Green Space Based on Population Data Spatialization: A Case Study of a Central Area of Wuhan, China. Sensors 2019, 19, 2929. [Google Scholar] [CrossRef]
  20. Guo, W.; Liu, J.K.; Zhao, X.S.; Hou, W.; Zhao, Y.X.; Li, Y.X.; Sun, W.B.; Fan, D.Q. Spatiotemporal dynamics of population density in China using nighttime light and geographic weighted regression method. Int. J. Digit. Earth 2023, 16, 2704–2723. [Google Scholar] [CrossRef]
  21. Chen, M.; Xian, Y.; Huang, Y.; Zhang, X.; Hu, M.; Guo, S.; Chen, L.; Liang, L. Fine-scale population spatialization data of China in 2018 based on real location-based big data. Sci. Data 2022, 9, 624. [Google Scholar] [CrossRef] [PubMed]
  22. Gao, P.; Wu, T.J.; Ge, Y.; Li, Z.H. Improving the accuracy of extant gridded population maps using multisource map fusion. Giscience Remote Sens. 2022, 59, 54–70. [Google Scholar] [CrossRef]
  23. Zhao, S.; Liu, Y.X.; Zhang, R.; Fu, B.J. China’s population spatialization based on three machine learning models. J. Clean. Prod. 2020, 256, 120644. [Google Scholar] [CrossRef]
  24. Chun, J.; Zhang, X.-C.; Huang, J.-F.; Zhang, P.-C. A Gridding Method of Redistributing Population Based on POls. Geogr. Geo-Inf. Sci. 2018, 34, 83–89+124+122. [Google Scholar]
  25. Li, K.N.; Chen, Y.H.; Li, Y. The Random Forest-Based Method of Fine-Resolution Population Spatialization by Using the International Space Station Nighttime Photography and Social Sensing Data. Remote Sens. 2018, 10, 1650. [Google Scholar] [CrossRef]
  26. Cui, X.; Zhang, J.; Wu, F.; Zhang, Q.; Wu, Y. Spatio-temporal Analysis of Population Dynamics based on Multi-source Data Integration for Beijing Municipal City. J. Geo-Inf. Sci. 2020, 22, 2199–2211. [Google Scholar]
  27. Mei, Y.; Gui, Z.; Wu, J.; Peng, D.; Li, R.; Wu, H.; Wei, Z. Population spatialization with pixel-level attribute grading by considering scale mismatch issue in regression modeling. Geo-Spat. Inf. Sci. 2022, 25, 365–382. [Google Scholar] [CrossRef]
  28. Peipei, D.; Xiyong, H. Spatial Simulation of Population in China’s Coastal Zone based on Multi-source Data. J. Geo-Inf. Sci. 2020, 22, 207–217. [Google Scholar]
  29. Wang, X.; Ning, X.; Zhang, H.; Wang, H.; Hao, M. Population spatialization by integrating LJ1-01 nighttime light and WeChat positioning data—Taking Beiiing city as an example. Sci. Surv. Mapp. 2022, 47, 173–183. [Google Scholar] [CrossRef]
  30. Guo, W.; Zhang, J.; Zhao, X.; Li, Y.; Liu, J.; Sun, W.; Fan, D. Combining Luojia1-01 Nighttime Light and Points-of-Interest Data for Fine Mapping of Population Spatialization Based on the Zonal Classification Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1589–1600. [Google Scholar] [CrossRef]
  31. Liu, Z. Research on Fine Population Spatialization Method Based on Multi-Source Geographic Data; Wuhan University: Wuhan, China, 2019; Available online: https://kns-cnki-net-s.webvpn.xju.edu.cn:8040/kcms2/article/abstract?v=FC2wxXHna7rhn0nl9d8IdtSskzdnzLE30RL0OFDmNKjUhWCyNMYWubAypu7MsyZCJXPkT3RaK4-XS5F9DI0CM_49anu0ivgowgTRD9MVcGEeekzGMC5B8136eIj0sZp37PhjpAXMQZo=&uniplatform=NZKPT&language=CHS (accessed on 12 December 2023).
  32. Zou, Y. Research on Population Spatialization Based on Multi-Source Data; China University of Mining and Technology: Beijing, China, 2020; Available online: https://link.cnki.net/doi/10.27623/d.cnki.gzkyu.2020.000269 (accessed on 12 December 2023).
  33. Bao, W.X.; Gong, A.D.; Zhao, Y.R.; Chen, S.Q.; Ba, W.R.; He, Y. High-Precision Population Spatialization In Metropolises Based On Ensemble Learning: A Case Study Of Beijing, China. Remote Sens. 2022, 14, 3654. [Google Scholar] [CrossRef]
  34. Sinha, P.; Gaughan, A.E.; Stevens, F.R.; Nieves, J.J.; Sorichetta, A.; Tatem, A.J. Assessing the spatial sensitivity of a random forest model: Application in gridded population modeling. Comput. Environ. Urban Syst. 2019, 75, 132–145. [Google Scholar] [CrossRef]
  35. Wang, Y.; Huang, C.; Zhao, M.; Hou, J.; Zhang, Y.; Gu, J. Mapping the Population Density in Mainland China Using Npp/Viirs And Points-of-Interest Data Based on a Random Forests Model. Remote Sens. 2020, 12, 3645. [Google Scholar] [CrossRef]
  36. Wang, M.; Wang, Y.; Li, B.; Cai, Z.; Kang, M. A Population Spatialization Model at the Building Scale Using Random Forest. Remote Sens. 2022, 14, 1811. [Google Scholar] [CrossRef]
  37. Jianjun, D.; Chunqiao, S.; Ruifan, L.; Guohua, Z. Spatial prediction of population based on random forest. In Proceedings of the 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 17–19 June 2022; Volume 10, pp. 1360–1363. [Google Scholar] [CrossRef]
  38. Liu, L.; Cheng, G.; Yang, J.; Cheng, Y. Population spatialization in Zhengzhou city based on multi-source data and random forest model. Front. Earth Sci. 2023, 11, 1092664. [Google Scholar] [CrossRef]
  39. Chen, Y.; Wang, S.; Gu, Z.; Yang, F. Modeling the Spatial Distribution of Population Based on Random Forest and Parameter Optimization Methods: A Case Study of Sichuan, China. Appl. Sci. 2024, 14, 446. [Google Scholar] [CrossRef]
  40. Liu, Y.; Yang, J.; Liu, Q. Study on Spatial Pattern of Population Development in Chinaan Empirical Study Based on the Data of the Seventh National Census. Stat. Decis. 2024, 40, 78–82. [Google Scholar] [CrossRef]
  41. Wang, H.; Zhang, W.; Song, H.; Yuan, Z.; Zhou, G.; Li, Q.; Chen, Y.; Zhu, P. Spatial Evolution of Urban Population in Changsha and Its Simulation:Based on the Multi-Source Data. Econ. Geogr. 2023, 43, 49–61. [Google Scholar] [CrossRef]
  42. Zhang, H.; Zhang, S.M.; Liu, Z.D. Evolution and influencing factors of China’s rural population distribution patterns since 1990. PLoS ONE 2020, 15, e233637. [Google Scholar] [CrossRef]
  43. Liu, Y.; Zhang, X.; Xu, M.; Zhang, X.; Shan, B.; Wang, A. Spatial Patterns and Driving Factors of Rural Population Loss Under Urban–Rural Integration Development: A Micro-Scale Study on the Village Level in a Hilly Region. Land 2022, 11, 99. [Google Scholar] [CrossRef]
  44. Lao, X.; Gu, H.; Lu, L.; Wang, S.-T.; Wen, F.-H. The Changes of Spatial Pattern and Influence Factors of Interprovincial Migration between 2 National Census Periods. Popul. Dev. 2023, 29, 15–30. [Google Scholar]
  45. Ke, W.; Xiao, B.; Lin, L.; Zhu, Y.; Wang, Y. Interprovincial urban and rural floating population evolution of China and its relationship with regional economic development. Acta Geogr. Sin. 2023, 78, 2041–2057. [Google Scholar]
  46. Fang, C. Strategic thinking and spatial layout for the sustainable development of urban agglomeration in northern slope of Tianshan Mountains. Arid Land Geogr. 2019, 42, 1–11. [Google Scholar]
  47. Xu, J.-H.; Kasimu, A.; Xu, H.; Reheman, R.; Wei, B.-H. Identification of the Spatial Pattern and Analysis of Spatial and Temporal Changes in the Urban Agglomeration on the Northern Slope of the Tianshan Mountains. J. Northwest For. Univ. 2024, 39, 237–246. [Google Scholar]
  48. Zhiyong, Z. Python Machine Learning Algorithm; Publishing House of Electronics Industry: Beijing, China, 2017. [Google Scholar]
  49. Zhao, L.; Zhao, Z.; Wang, W. (Eds.) The Spatial Pattern of Economy in Coastal Area of China. Econ. Geogr. 2014, 34, 14–18+27. [Google Scholar]
  50. Wang, Q.; Liu, X.-Y.; Li, Y.-C. Spatial Structure, City Size and Innovation Performance of Chinese Cities. China Ind. Econ. 2021, 5, 114–132. [Google Scholar] [CrossRef]
  51. Li, Y.; Liu, X. How did urban polycentricity and dispersion affect economic productivity? A case study of 306 Chinese cities. Landsc. Urban Plan. 2018, 173, 51–59. [Google Scholar] [CrossRef]
  52. Dong, J.; Zhou, C.; Liang, W.; Lu, X. Determination Factors for the Spatial Distribution of Forest Cover: A Case Study of China’s Fujian Province. Forests 2022, 13, 2070. [Google Scholar] [CrossRef]
  53. Lu, D.; Wang, Y.; Yang, Q.; Su, K.; Zhang, H.; Li, Y. Modeling Spatiotemporal Population Changes by Integrating Dmsp-Ols and Npp-Viirs Nighttime Light Data in Chongqing, China. Remote Sens. 2021, 13, 284. [Google Scholar] [CrossRef]
  54. You, H.; Jin, C.; Sun, W. Spatiotemporal Evolution of Population in Northeast China During 2012–2017: A Nighttime Light Approach. Complexity 2020, 2020, 3646145. [Google Scholar] [CrossRef]
  55. Tao, Y.; Liu, W.; Chen, J.; Gao, J.; Li, R.; Ren, J.; Zhu, X. A Self-Supervised Learning Approach for Extracting China Physical Urban Boundaries Based on Multi-Source Data. Remote Sens. 2023, 15, 3189. [Google Scholar] [CrossRef]
  56. Thomson, D.R.; Rhoda, D.A.; Tatem, A.J.; Castro, M.C. Gridded population survey sampling: A systematic scoping review of the field and strategic research agenda. Int. J. Health Geogr. 2020, 19, 34. [Google Scholar] [CrossRef] [PubMed]
  57. Hierink, F.; Boo, G.; Macharia, P.M.; Ouma, P.O.; Timoner, P.; Levy, M.; Tschirhart, K.; Leyk, S.; Oliphant, N.; Tatem, A.J.; et al. Differences between gridded population data impact measures of geographic access to healthcare in sub-Saharan Africa. Commun. Med. 2022, 2, 117. [Google Scholar] [CrossRef] [PubMed]
  58. Wang, H.; Yu, X.; Luo, L.; Li, R. Urban–Rural Boundary Delineation Based on Population Spatialization: A Case Study of Guizhou Province, China. Sustainability 2024, 16, 1787. [Google Scholar] [CrossRef]
  59. Calka, B.; Nowak Da Costa, J.; Bielecka, E. Fine scale population density data and its application in risk assessment. Geomat. Nat. Hazards Risk 2017, 8, 1440–1455. [Google Scholar] [CrossRef]
Figure 1. Diagram of the study area.
Figure 1. Diagram of the study area.
Sustainability 16 04106 g001
Figure 2. Road network display map in 2015 (partial).
Figure 2. Road network display map in 2015 (partial).
Sustainability 16 04106 g002
Figure 3. POI data classification system.
Figure 3. POI data classification system.
Sustainability 16 04106 g003
Figure 4. Heat map of indicator correlation.
Figure 4. Heat map of indicator correlation.
Sustainability 16 04106 g004
Figure 5. Moran’s I of population density at study area in 2020.
Figure 5. Moran’s I of population density at study area in 2020.
Sustainability 16 04106 g005
Figure 6. Population distribution of the NSTS Urban Agglomeration in 2020 based on the SLM.
Figure 6. Population distribution of the NSTS Urban Agglomeration in 2020 based on the SLM.
Sustainability 16 04106 g006
Figure 7. Histogram of population density grid statistics.
Figure 7. Histogram of population density grid statistics.
Sustainability 16 04106 g007
Figure 8. Population distribution of the NSTS Urban Agglomeration in 2020 based on the RFM.
Figure 8. Population distribution of the NSTS Urban Agglomeration in 2020 based on the RFM.
Sustainability 16 04106 g008
Figure 9. Histogram of population density grid statistics.
Figure 9. Histogram of population density grid statistics.
Sustainability 16 04106 g009
Figure 10. Comparison of RFM partitioning accuracy.
Figure 10. Comparison of RFM partitioning accuracy.
Sustainability 16 04106 g010
Figure 11. Cold and hot spots and center of gravity distribution.
Figure 11. Cold and hot spots and center of gravity distribution.
Sustainability 16 04106 g011
Figure 12. Identification of the centers in NSTS Urban Agglomeration.
Figure 12. Identification of the centers in NSTS Urban Agglomeration.
Sustainability 16 04106 g012
Figure 13. Spatial pattern of population in the NSTS Urban Agglomeration.
Figure 13. Spatial pattern of population in the NSTS Urban Agglomeration.
Sustainability 16 04106 g013
Figure 14. MDI of each indicator.
Figure 14. MDI of each indicator.
Sustainability 16 04106 g014
Figure 15. Comparison with actual features.
Figure 15. Comparison with actual features.
Sustainability 16 04106 g015
Table 1. Name and source of data.
Table 1. Name and source of data.
Number Name Resolution Sources
1Land cover datasets (CLCD)30 mTeam of Prof. Jie Yang and Xin Huang, Wuhan University
(https://zenodo.org/record/8176941, accessed on 5 December 2023)
2Points of interest (POI) data for the Xinjiang region 1\Gao De Map
(https://ditu.amap.com/, accessed on 20 December 2023)
3Outline data of buildings in Xinjiang 1\Baidu’s online map
(https://map.baidu.com/, accessed on 26 November 2023)
4Global nighttime light dataset (class NPP-VIIRS)500 mThe team of Prof. Yu Bailang from East China Normal University, Associate Researcher Chen Zuoqi from Fuzhou University, and Associate Professor Shi Open from Southwestern University
(https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YGIVCD, accessed on 5 July 2023)
5Xinjiang road data
(OSM) 1
\OpenStreetMap
(https://www.openstreetmap.org/, accessed on 25 November 2023)
6Xinjiang Population Raster Dataset LandScan1000 mDeveloped by the U.S. Department of Energy’s Oak Ridge National Laboratory
(https://landscan.ornl.gov/, accessed on 3 January 2024)
7WorldPop1000 mThe Department of Geography and Environmental Sciences at the University of Southampton, the Centre for Geosciences at the University of Louisville, the Centre for International Earth Science Information Networks at Columbia University and others collaborated to create the management (https://hub.worldpop.org/, accessed on 3 January 2024).
8GHSL250 mPublished by the European Commission (https://ghsl.jrc.ec.europa.eu/, accessed on 6 January 2024)
9GPWnearly 1000 mCentre for International Earth Science Information Networking (CIESIN) research release, Columbia University, United States
(https://sedac.ciesin.columbia.edu/, accessed on 3 January 2024)
10Population Statistics Yearbook Data of Xinjiang Municipalities and Counties\Xinjiang Uygur Autonomous Region Statistical Yearbook and the Statistical Bulletin for Cities and Counties
1 means vector data.
Table 2. Indicator database establishment.
Table 2. Indicator database establishment.
NameDescriptionNameDescription
N01Nighttime Lighting IndexP02Distance to travel service POI
B01Building area ratioP03Distance from Science, Education and Culture Services POI
R01Distance from roadP04Distance to financial services POI
R02Road network area ratioP05Distance to healthcare POI
U01Distance from farmlandP06Distance to Shopping Services POI
U02Distance to forestP07Distance to residential service POI
U03Distance to grassP08Distance to political services POI
U04Distance from waterP09Distance to Employment Services POI
U05Distance to snowfieldsP10Distance to Convenience Services POI
U06Distance from bare groundP11Distance to Food Service POI
U07Distance from impervious surfaceP12Distance to Leisure and Entertainment Services POI
P01Distance to transportation services POIP13Kernel Density of POI
Table 3. Significance and covariance diagnostic results with population distribution.
Table 3. Significance and covariance diagnostic results with population distribution.
NameRegression CoefficientStandard Errortp
B0135,060.38720,577.0111.7040.088
N0136,763.116744.15.4510.000 2
R01520.412230.5042.2580.024 1
R0260,174.10712,641.6414.760.000 2
U01−156.97385.346−1.8390.066
U02−612.757165.096−3.7120.000 2
U03659.51126.5045.2130.000 2
U0420.591142.5090.1440.885
U05−73.96169.388−0.4370.662
U06690.802175.4223.9380.000 2
U07430.998102.1474.2190.000 2
P0165.488209.3970.3130.754
P02656.333176.5353.7180.000 2
P03−230.746171.433−1.3460.178
P04−880.354109.058−8.0720.000 2
P05−367.234149.638−2.4540.014 1
P06189.724152.4531.2440.213
P07−626.009152.614−4.1020.000 2
P08931.317231.4334.0240.000 2
P09204.867113.51.8050.071
P100.82386.0070.010.992
P11−116.589107.328−1.0860.277
P12−648.513119.457−5.4290.000 2
P13190,141.43513,326.57314.2680.000 2
R20.635
Adjusted R20.635
FF (24,31847) = 175.894, p = 0.000
D-W 2.015
1 p < 0.05, 2 p < 0.01.
Table 4. Finalized system of indicators.
Table 4. Finalized system of indicators.
Indicator LayerNumberDescription
State of urban constructionN01Nighttime Lighting Index
R01Distance from road
R02Road network area ratio
State of natural coverU02Distance to forest
U03Distance to grass
U06Distance from bare ground
U07Distance from impervious surface
State of socioeconomic P04Distance to financial services of POIs
P05Distance to healthcare of POIs
P07Distance to residential service of POIs
P12Distance to Leisure and Entertainment Services of POIs
P13Kernel Density of POIs
Table 5. Regression coefficients for indicators of the SLM.
Table 5. Regression coefficients for indicators of the SLM.
NameRegression CoefficientNameRegression Coefficient
W_POP0.890965U070.459132
N01713.867P04−6.03744
R01−2.30669P054.92153
R02−8.67391P07−0.826732
U02−3.89721P120.138499
U034.39785P132865.92
U0615.2164
Table 6. Parameter settings of RFM.
Table 6. Parameter settings of RFM.
NameSettings
Training set ratio70%
Test set ratio30%
Number of decision trees30
Seed count42
Maximum number of featuresAuto
Maximum depth of decision treeNone
Sampling rulesreplenishable
Implicit variablePOP
Explanatory variableN01, R01, R02, U02, U03, U06, U07, P04, P05, P07, P12, P13
Table 7. Comparative test results of SLM accuracy.
Table 7. Comparative test results of SLM accuracy.
Precision IndicatorsSLMRFMGHSLWorldPopLandScanGPW
MAE21,840.29 11,132.77 26,822.67 16,671.79 17,329.67 23,070.15
RMSE51,180.14 23,114.50 55,251.70 48,138.11 48,244.11 53,458.65
%RMSE1.89 0.85 2.04 1.78 1.78 1.97
Table 8. Extent, Area, and Population of Centers by Level.
Table 8. Extent, Area, and Population of Centers by Level.
RankExtentArea (km2)Population
Level I CenterUrumqi, Changji and Wujiaqu1792.344,195,936
Level III CenterShihezi and Manas286.19478,490
Level III CenterKuitun, Karamay (Dushanzi District) and Wusu331.03312,628
Level IV CenterKaramay 218.79180,014
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wang, H.; Luo, K.; Wu, C.; Li, S. Study on Spatialization and Spatial Pattern of Population Based on Multi-Source Data—A Case Study of the Urban Agglomeration on the North Slope of Tianshan Mountain in Xinjiang, China. Sustainability 2024, 16, 4106. https://doi.org/10.3390/su16104106

AMA Style

Zhang Y, Wang H, Luo K, Wu C, Li S. Study on Spatialization and Spatial Pattern of Population Based on Multi-Source Data—A Case Study of the Urban Agglomeration on the North Slope of Tianshan Mountain in Xinjiang, China. Sustainability. 2024; 16(10):4106. https://doi.org/10.3390/su16104106

Chicago/Turabian Style

Zhang, Yunyi, Hongwei Wang, Kui Luo, Changrui Wu, and Songhong Li. 2024. "Study on Spatialization and Spatial Pattern of Population Based on Multi-Source Data—A Case Study of the Urban Agglomeration on the North Slope of Tianshan Mountain in Xinjiang, China" Sustainability 16, no. 10: 4106. https://doi.org/10.3390/su16104106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop