Mapping Essential Urban Land Use Categories in Nanjing by Integrating Multi-Source Big Data

: High-spatial-resolution (HSR) urban land use maps are very important for urban planning, tra ﬃ c management, and environmental monitoring. The rapid urbanization in China has led to dramatic urban land use changes, however, so far, there are no such HSR urban land use maps based on uniﬁed classiﬁcation frameworks. To ﬁll this gap, the mapping of 2018 essential urban land use categories in China (EULUC-China) was jointly accomplished by a group of universities and research institutes. However, the relatively lower classiﬁcation accuracy may not su ﬃ ciently meet the application demands for speciﬁc cities. Addressing these challenges, this study took Nanjing city as the case study to further improve the mapping practice of essential urban land use categories, by reﬁning the generation of urban parcels, resolving the problem of unbalanced distribution of point of interest (POI) data, integrating the spatial dependency of POI data, and evaluating the size of training samples on the classiﬁcation accuracy. The results revealed that (1) the POI features played the most important roles in classiﬁcation performance, especially in identifying administrative, medical, sport, and cultural land use categories, (2) compared with the EULUC-China, the overall accuracy for Level I and Level II in EULUC-Nanjing has increased by 11.1% and 5%, to 86.1% and 80% respectively, and (3) the classiﬁcation accuracy of Level I and Level II would be stable when the number of training samples was up to 350. The methods and ﬁndings in this study are expected to better inform the regional to continental mappings of urban land uses.

the road network represent comparatively homogeneous socioeconomic functions, which can be used as the basic unit of land use classification [11]. Nevertheless, due to the roads at different administrative levels having different widths [12], urban parcels should not be directly segmented by the road center line. Previous studies simply adopted a unified threshold to generate a road buffer for whole study areas [10], which resulted in under-or over-segmentation of urban parcels, and thus led to reducing the purity of land uses within parcels.
Given the fact that the nature of urban land use categories is caused by differences in human social and economic activities, it is difficult to distinguish urban land use types only relying on remote sensing imagery [13]. For example, the spectral features of commercial, business, and residential land uses are similar, and we cannot differentiate them only using remote sensing images [14]. The urban light density recorded by nighttime light (NTL) images is closely related to the intensity of human activities [15] and can provide a different perspective for extracting urban land use information. Besides, various kinds of open social big data, such as mobile phone positioning data [14], point of interest (POI) data [16], Tencent Mobile phone locating-request (MPL) data [17][18][19], and taxi trajectory data [20], also provide new opportunities for identifying different urban land uses, due to its intrinsic capability in capturing the spatiotemporal rhythm of the human activities. Among these available social big data, POI data that contain geographic location and attribute information have been most widely used [10,21]. However, most of the previous studies only used the frequency information of POIs without considering their inner spatial dependency [22]. Secondly, since POIs were generated by people's annotation of interested places, there was often an unbalanced distribution in space and in different land use categories. For example, the number of commercial POIs was significantly larger than other land use categories-the POIs density in the central business district was always higher than the other areas. The imbalance of the original POIs cannot reflect the actual distribution of urban land uses, which requires to be preprocessed and regenerated before being used appropriately [16].
Previous studies have demonstrated that the comprehensive utilization of remote sensing data and POI data [9], mobile phone positioning data [14], or other geographical data can effectively improve the urban land use classification accuracy [23]. However, due to the lack in the data availability and a uniform standard on the urban land use classification practices, the mapping of urban land use in China and other countries was still limited [12]. Based on the urban areas delineated from 30 m resolution impervious surface data [24] and a 10 m resolution global land cover map (FROM-GLC10) [25], Gong et al. [12] has generated preliminary mapping results of essential urban land use categories in China (EULUC-China) using both remote sensing images (Sentinel-2, Luojia-01) and social big data (OSM, MPL, POIs). The classification accuracy for 27 validation cities ranges from 40.4% to 82.9% for Level I, and from 34% to 80% for Level II. However, the relatively lower accuracy of EULUC-China may not sufficiently meet the growing demands for urban planning, environmental monitoring, and other aspects at local to regional scopes [26], which can be attributed to the following aspects. Firstly, the buffer thresholds of roads for generating urban parcels were simply divided into major and minor categories without details assigned. Secondly, there were no pre-processing steps to resolve the unbalanced distribution of POI data. Thirdly, the number of parcel samples was still not sufficient (there were 440,798 parcels in China, but for training samples there were only 1795 and for validation samples there were 869). Fourthly, there were potentials to further explore the features extracted from multi-source big data [12]. For example, the building height information [27] and the texture features [28] extracted from remote sensing images have proven to help distinguish different urban land use types.
To address the above issues, this study aims to take Nanjing as the case study, comprehensively utilize multi-source remote sensing images and social big data, refine the generation of urban parcels, resolve the problem of unbalanced distribution between POIs, explore the geographic data features, and analyze the impact of sample size on the classification accuracy in order to improve the mapping result of EULUC-Nanjing. The structure of this article is as follows: Section 1 describes the background of this research and reviews related work on urban land use classification. The data sources, methodology, and techniques are described in detail in Section 2. The experimental results are illustrated in Section 3. Section 4 provides a meaningful discussion of the results and analyzes the future research directions. Finally, this article is summarized in Section 5.

Study Area
Nanjing, the capital of Jiangsu province, is situated on the lower reaches of the Yangtze River in eastern China ( Figure 1). As an ancient city, Nanjing enjoys a worldwide reputation not only for its history and culture, but also for its economy and politics. In recent years, with the rapid urbanization, the urban areas of Nanjing have expanded dramatically. As of 2018, the built-up area of Nanjing is approximately 1624 km 2 and the urban population is about 8.44 million, with an urbanization rate of 82.5% [29]. In this study, all built-up areas were regarded as our research area (marked with the red polygon in Figure 1c). sources, methodology, and techniques are described in detail in Section 2. The experimental results are illustrated in Section 3. Section 4 provides a meaningful discussion of the results and analyzes the future research directions. Finally, this article is summarized in Section 5.

Study Area
Nanjing, the capital of Jiangsu province, is situated on the lower reaches of the Yangtze River in eastern China ( Figure 1). As an ancient city, Nanjing enjoys a worldwide reputation not only for its history and culture, but also for its economy and politics. In recent years, with the rapid urbanization, the urban areas of Nanjing have expanded dramatically. As of 2018, the built-up area of Nanjing is approximately 1624 km 2 and the urban population is about 8.44 million, with an urbanization rate of 82.5% [29]. In this study, all built-up areas were regarded as our research area (marked with the red polygon in Figure 1c).  [12,30] with Sentinel-2A composite imagery as the background.

Remotely Sensed Data
Sentinel-2A/B Composite Imagery All preprocessed Sentinel-2A/B images from 1 January to 31 December 2018 were first collected from the Copernicus Open Access Hub (https://scihub.copernicus.eu). Then, the normalized difference vegetation index (NDVI) of each pixel was calculated. The pixel-based maximum NDVI values were finally used as a quality index to merge the whole-year images as the greenest composite imagery [12].  [12,30] with Sentinel-2A composite imagery as the background.

Remotely Sensed Data
Sentinel-2A/B Composite Imagery All preprocessed Sentinel-2A/B images from 1 January to 31 December 2018 were first collected from the Copernicus Open Access Hub (https://scihub.copernicus.eu). Then, the normalized difference vegetation index (NDVI) of each pixel was calculated. The pixel-based maximum NDVI values were finally used as a quality index to merge the whole-year images as the greenest composite imagery [12].
Luojia-1 Nighttime Light Image LJ1-01, launched in June 2018, is the first remote sensing satellite focusing on nighttime light in China. It can obtain high-precision nighttime light imagery with a swath of 250 km and a spatial resolution of 130 m [15]. In this study, we used the Luojia-1 nighttime light images acquired from December 2018 as an indicator of human activities. The data can be downloaded freely at the High-Resolution Earth Observation System of the Hubei Data and Application Center (http: //59.175.109.173:8888/app/login.html).

Google Earth High-Resolution Image
Google Earth high-resolution image [31] of Nanjing obtained in May 2018 was downloaded from BIGEMAP (http://www.bigemap.com), with a resolution of 2 m, including three bands of blue, green, and red. In this study, we used it as an auxiliary data to help preprocess POIs.

Point of Interest
We purchased the POI data obtained in December 2018 from Gaode Map Services (https: //lbs.amap.com/), which is one of the most popular and largest web map service providers in China. The data have a total of 760,000 records with 19 categories in the top level and 400 categories in the second level.

Mobile Phone Locating-Request Big Data
Tencent mobile phone locating-request big data records the real-time location of Tencent active users such as QQ (an instant-messaging software, 800 million users in China), WeChat (a chatting software, 350 million users in China), Tencent games (an online game community, 200 million users in China), and others, which can reflect the spatial distribution of the population in the study area [18]. The hourly data for an entire week from 12 to 18 August 2019 (12-16 is the workday, 17-18 is weekend), with a spatial resolution of 25 m were collected from the Tencent Location Big Data platform (http://heat.qq.com). The data are in the format of TXT, including four fields: count, longitude, latitude, and time. The count field carries the heat information of the population.

Other Data
The Impervious Surface The impervious surface is defined as the ground surface, such as roof, asphalt, or cement road, which is the key index to characterize the degree of urbanization and evaluate the quality of the urban ecological environment. In this study, we used the 2018 impervious surface [12] extracted from Landsat images by the "exclusion and inclusion" framework [30] to represent the urban area.

OSM Road Network
OSM road network data collected in December 2018 were downloaded from OpenStreetMap (https://www.openstreetmap.org) [32]. The data are in vector format and contain a series of information, such as road names, road grades, etc.

Building Footprint Dataset
The building footprint data collected in 2018 were downloaded from BIGEMAP (http://www. bigemap.com), which are in vector format, including information about the area and number of floors for each building. The building polygons are mainly extracted from the Google Earth high-resolution image, and the building height is inverted from its shadow length.

FROM-GLC10
FROM-GLC10 [25] is the first 10 m resolution global land-cover map in the world, which can be freely downloaded from the website: http://data.ess.tsinghua.edu.cn/. In this research, we used FROM-GLC10 [25] to determine the land cover types in the non-built-up area in Nanjing.

Methods
The study included the following four procedures ( Figure 2). Firstly, urban parcels were generated by an overlay analysis of the road buffer, the water, and the impervious surface data. Secondly, multiple features were extracted from Sentinel-2 images, Luojia-1 nighttime light image, POI data, MPL data, and Nanjing building footprint data. Thirdly, training and validation samples were collected by both the visual interpretation and field investigation. Fourthly, the mapping of EULUC-Nanjing and assessment of classification accuracy were performed.

FROM-GLC10
FROM-GLC10 [25] is the first 10 m resolution global land-cover map in the world, which can be freely downloaded from the website: http://data.ess.tsinghua.edu.cn/. In this research, we used FROM-GLC10 [25] to determine the land cover types in the non-built-up area in Nanjing.

Methods
The study included the following four procedures ( Figure 2). Firstly, urban parcels were generated by an overlay analysis of the road buffer, the water, and the impervious surface data. Secondly, multiple features were extracted from Sentinel-2 images, Luojia-1 nighttime light image, POI data, MPL data, and Nanjing building footprint data. Thirdly, training and validation samples were collected by both the visual interpretation and field investigation. Fourthly, the mapping of EULUC-Nanjing and assessment of classification accuracy were performed.

Classification System
According to the EULUC classification system [12], urban land use types are divided into five classes in Level I (01 Residential, 02 Commercial, 03 Industrial, 04 Transportation, 05 Public management and service) and twelve classes in Level II (0101 Residential, 0201 Business office, 0202 Commercial service, 0301 Industrial, 0401 Road, 0402 Transportation station, 0403 Airport, 0501 Administrative, 0502 Educational, 0503 Medical, 0504 Sport and cultural, 0505 Park and greenspace). For the non-built-up area, there are four land cover types in Nanjing, including cropland, forest, grassland, and water, by referring to the FROM-GLC10 data [25].

Classification System
According to the EULUC classification system [12], urban land use types are divided into five classes in Level I (01 Residential, 02 Commercial, 03 Industrial, 04 Transportation, 05 Public management and service) and twelve classes in Level II (0101 Residential, 0201 Business office, 0202 Commercial service, 0301 Industrial, 0401 Road, 0402 Transportation station, 0403 Airport, 0501 Administrative, 0502 Educational, 0503 Medical, 0504 Sport and cultural, 0505 Park and greenspace). For the non-built-up area, there are four land cover types in Nanjing, including cropland, forest, grassland, and water, by referring to the FROM-GLC10 data [25].

Parcels Generation
In this study, according to the attributes of OSM data (Figure 3), the roads were divided into seven levels as the primary, secondary, tertiary, residential, motorway, trunk road, and railway. In order to set buffer thresholds for different road levels, the 604 samples were randomly selected from the above seven road levels and their width were measured using high-spatial-resolution imagery in the BIGEMAP software. Here, the size of samples from each road level were more than 30 percent of its total numbers. Then, the upper quartile of the roads' width statistics of each level was adopted as the buffer thresholds (Table 1). By overlaying the road buffer and the water layer from FROM-GLC10 [25] and the impervious surface data, the built-up area of Nanjing was segmented into 8209 land parcels ( Figure 4).

Parcels Generation
In this study, according to the attributes of OSM data (Figure 3), the roads were divided into seven levels as the primary, secondary, tertiary, residential, motorway, trunk road, and railway. In order to set buffer thresholds for different road levels, the 604 samples were randomly selected from the above seven road levels and their width were measured using high-spatial-resolution imagery in the BIGEMAP software. Here, the size of samples from each road level were more than 30 percent of its total numbers. Then, the upper quartile of the roads' width statistics of each level was adopted as the buffer thresholds (Table 1). By overlaying the road buffer and the water layer from FROM-GLC10 [25] and the impervious surface data, the built-up area of Nanjing was segmented into 8209 land parcels (Figure 4). .

Parcels Generation
In this study, according to the attributes of OSM data (Figure 3), the roads were divided into seven levels as the primary, secondary, tertiary, residential, motorway, trunk road, and railway. In order to set buffer thresholds for different road levels, the 604 samples were randomly selected from the above seven road levels and their width were measured using high-spatial-resolution imagery in the BIGEMAP software. Here, the size of samples from each road level were more than 30 percent of its total numbers. Then, the upper quartile of the roads' width statistics of each level was adopted as the buffer thresholds (Table 1). By overlaying the road buffer and the water layer from FROM-GLC10 [25] and the impervious surface data, the built-up area of Nanjing was segmented into 8209 land parcels (Figure 4).
.    By overlaying a total record of 760,000 POI data with impervious surface polygons, the 220,000 POI points within the built-up area were retained. According to the EULUC classification system [12], the POI data were reclassified into four classes at Level I and nine classes at Level II. It should be noted that the transportation category was excluded in this study [12]. Statistically, the commercial, residential, public, and industrial POIs accounted for 52%, 27%, 19.8%, and 1.2% of the total numbers, respectively. Additionally, the POIs density in the downtown business district is higher than that in the surrounding areas. In order to solve the unbalanced distribution problem, these POI data were regenerated by the following three steps: (1) the commercial POI points with a distance less than 10 m were deleted, (2) the industrial and residential POIs were added according to the method proposed by Zhang et al. [16]. Firstly, each building abstracted from the building footprint data was regarded as the basic unit. For each unit, the geometrical, spectral, and textural features were extracted from the Google Earth high-resolution image. Next, the buildings overlapped with the POIs were labeled by the corresponding POI's Level I category. The labeled buildings were employed to train a random forest (RF) model for recognizing other non-labeled buildings. Finally, the centers of the building polygons identified as the residential and industrial categories were converted to POI points of the corresponding classes. (3) For the public category, POI points were added on the buildings by visual interpretation due to the relatively lower classification accuracy via the above RF classification method.

Features Extraction
From Remote Sensing Data Eighteen spectral features and 26 texture features were extracted from Sentinel-2 composite imagery ( Table 2) for each urban parcel. Spectral features included the mean and standard deviations of blue, green, red, near-infrared, short-wave-infrared bands, NDVI, normalized difference built-up index (NDBI), and normalized difference water index (NDWI). Texture features included the entropy, contrast, correlation of blue, green, red, near-infrared bands, and the entropy of normalized difference vegetation index (NDVI). For the Luojia-1 nighttime light image, we calculated the mean of digital number (DN) values for each urban parcel and applied it as a feature to urban land use classification ( Table 2). The names of features are shown in Supplementary Table S1. Building footprint data (12) Total number of buildings, proportion of different grade buildings, average height of buildings in each parcel, floor area ratio, and height density index.

From POIs
There was a total of 19 frequency features and 10 spatial features extracted from the regenerated POIs (Table 2). The frequency features included the total number of all POIs, and the number and proportion of each POI class within each parcel. The spatial features were abstracted by the Google word2vec model, which is a deep-learning tool for transforming natural language words into high-dimensional spatial vectors [33]. In this study, we regarded the Nanjing built-up-area as a corpus, 8209 parcels as the documents, POI categories as the basic words, and the POIs spatial distribution within the parcels as the word sequences in the documents. Thus, the spatial distribution features of POIs can be quantified by the Word2vec model. The calculation of spatial features included three steps [20]: (1) the parcels and POIs were used to build a Parcel-POI corpus. In order to obtain an adequate number of words, we divided the generated POIs into 44 categories in Level III on the basis of Level II (Supplementary Table S2), (2) all POI category vectors were obtained using the continuous bag of words (CBOW) model in Word2Vec and the dimension value of 10 was set by the trial and error, and (3) the parcel vectors were calculated by averaging inside POI vectors with weightings. The 10-dimensional parcel vectors were the final spatial features we would use in the EULUC-Nanjing mapping. The results of the Parcel-POI corpus, POI category vectors, and the parcel vectors can be downloaded from https://pan.baidu.com/s/1jb9lOpqFcqNlVQZ4UrC8rA.

From the Building Footprint Data
Based on the number of floors, the buildings were divided into eight levels ( Figure 5). For each parcel, the total number, each level's proportion, and the average height and floor area ratio (FAR) of all buildings were calculated. Considering that the business office building is higher but occupies a small area, and the commercial service building is lower but covers a large area, this study proposed a height density index (total height of buildings in parcel/parcel area), which would differentiate the difference between the commercial and business land uses. Twelve features were finally extracted from Nanjing building footprint data ( Table 2).

From Mobile Phone Locating-Request Big Data
Given that people's daily activities generally change periodically on a weekly basis, and there is a certain difference in the population distribution between work and non-work days [19], the weekly Tencent MPL data were divided into two groups: working day and rest day. Kernel density analysis is commonly used in urban hot spot exploration [34]. Since the original data was a series of spatial points, the kernel density analysis was used to make the MPL data into population heat maps. Furthermore, to better reflect the law of population distribution and reduce the data error, the results of kernel density analysis at the same time on working days and rest days were averaged. Finally, the mean of the hourly population heat map value of working days and rest days was calculated from each parcel and 48 features were obtained ( Table 2).

Urban Land Use Classification
The samples were collected by both visual interpretation and field investigation from the Google Earth [31] high-resolution image. The selected samples should be typical and stable with a low mixing of land uses. However, there are few samples with high purity, such as the medical and administrative categories, which are usually mixed with other land uses. Therefore, we edited these samples manually to ensure that the purity of each sample was more than 90%. For example, as shown in Figure 6, the parcel contained two land use types (educational and medical) before editing (Figure 6a), while after editing, it was divided into two parcels with high-purity (Figure 6b). Finally, a total of 680 samples were collected, according to a ratio about 3:1 [35], 500 samples (from 26 to 98 in Level II) were used for training and the remaining 180 samples (from 15 to 25 in Level II) were used for validation. The samples can be downloaded from https://pan.baidu.com/s/1jb9lOpqFcqNlVQZ4UrC8rA.

From Mobile Phone Locating-Request Big Data
Given that people's daily activities generally change periodically on a weekly basis, and there is a certain difference in the population distribution between work and non-work days [19], the weekly Tencent MPL data were divided into two groups: working day and rest day. Kernel density analysis is commonly used in urban hot spot exploration [34]. Since the original data was a series of spatial points, the kernel density analysis was used to make the MPL data into population heat maps. Furthermore, to better reflect the law of population distribution and reduce the data error, the results of kernel density analysis at the same time on working days and rest days were averaged. Finally, the mean of the hourly population heat map value of working days and rest days was calculated from each parcel and 48 features were obtained ( Table 2).

Urban Land Use Classification
The samples were collected by both visual interpretation and field investigation from the Google Earth [31] high-resolution image. The selected samples should be typical and stable with a low mixing of land uses. However, there are few samples with high purity, such as the medical and administrative categories, which are usually mixed with other land uses. Therefore, we edited these samples manually to ensure that the purity of each sample was more than 90%. For example, as shown in Figure 6, the parcel contained two land use types (educational and medical) before editing (Figure 6a), while after editing, it was divided into two parcels with high-purity (Figure 6b). Finally, a total of 680 samples were collected, according to a ratio about 3:1 [  The random forest (RF) model is an ensemble classifier that consists of multiple decision trees and has been extensively used in urban land use classification [36]. It has a higher advantage in processing high-dimensional data compared with other algorithms [37]. In this study, in order to improve the performance of the RF model, the optimal features were selected from the total of 134 features by using the recursive feature elimination algorithm with the help of the Caret package [38] in R. The selected features and 500 training samples were employed to train a RF model for identifying the land use types of the remaining parcels. The parameters of 'ntree' and 'mtry' for the RF model were set to 500 and 3, respectively. Finally, the 180 independent validation samples were adopted to establish the confusion matrices for Level I and Level II categories, and the overall accuracy (OA) and kappa coefficient [39][40][41] were used to evaluate the classification result.
As for the built-up area, the mapping of Level I and Level II categories was directly used from the classification results. As for the non-built-up area, the FROM-GLC10 land cover map of Nanjing City was used [25]. Finally, the built-up and non-built-up areas were combined to produce a complete The random forest (RF) model is an ensemble classifier that consists of multiple decision trees and has been extensively used in urban land use classification [36]. It has a higher advantage in processing high-dimensional data compared with other algorithms [37]. In this study, in order to improve the performance of the RF model, the optimal features were selected from the total of 134 features by using the recursive feature elimination algorithm with the help of the Caret package [38] in R. The selected features and 500 training samples were employed to train a RF model for identifying the land use types of the remaining parcels. The parameters of 'ntree' and 'mtry' for the RF model were set to 500 and 3, respectively. Finally, the 180 independent validation samples were adopted to establish the confusion matrices for Level I and Level II categories, and the overall accuracy (OA) and kappa coefficient [39][40][41] were used to evaluate the classification result.
As for the built-up area, the mapping of Level I and Level II categories was directly used from the classification results. As for the non-built-up area, the FROM-GLC10 land cover map of Nanjing City was used [25]. Finally, the built-up and non-built-up areas were combined to produce a complete land use map.

The Results of POI Regeneration
Compared with the original POIs, the number of regenerated POIs substantially increased (Table 3), especially for industrial, residential, and public POI points. Furthermore, the skewness distribution of the original POIs (Figure 7a) has been rectified to a relatively uniform distribution after being regenerated (Figure 7b).

Feature Selection
The accuracy assessment based on OOB error indicated that the number of features, 68 in Level I and 61 in Level II, achieved the highest OA (Figure 8). The specific features selected are shown in Supplementary Tables S3 and S4. Table 4 shows that using the above 68 and 61 features, the classification OA (obtained through 180 independent validation samples) increased by 2.6% and 4% for Level I and Level II respectively, and the kappa coefficients both increased by 0.04.

Feature Selection
The accuracy assessment based on OOB error indicated that the number of features, 68 in Level I and 61 in Level II, achieved the highest OA (Figure 8). The specific features selected are shown in Supplementary Tables S3 and S4. Table 4 shows that using the above 68 and 61 features, the classification OA (obtained through 180 independent validation samples) increased by 2.6% and 4% for Level I and Level II respectively, and the kappa coefficients both increased by 0.04. Figure 7. A spatial distribution of (a) the original POI points and (b) the regenerated POI points.

Feature Selection
The accuracy assessment based on OOB error indicated that the number of features, 68 in Level I and 61 in Level II, achieved the highest OA (Figure 8). The specific features selected are shown in Supplementary Tables S3 and S4. Table 4 shows that using the above 68 and 61 features, the classification OA (obtained through 180 independent validation samples) increased by 2.6% and 4% for Level I and Level II respectively, and the kappa coefficients both increased by 0.04.

Performance of EULUC-Nanjing
The confusion matrices of Level I (Table 5) and Level II (Table 6) showed that the OA and kappa coefficient were 86.1% and 0.78 respectively, in Level I, and 80% and 0.77 respectively, in Level II. The residential, industrial, and public land use in Level I achieved a higher accuracy with both producer's accuracy (PA) and user's accuracy (UA) of more than 80%, while the commercial land use had a relatively lower user's accuracy of 68%. As for the Level II category, the industrial land use had the highest accuracy with both producer's accuracy (PA) and user's accuracy (UA) of 92%, while the business land use had the lowest accuracy with UA of 60% and PA of 64%. The feature importance plots indicated that seven POI spatial features (POIspa_2, POIspa_8, POIspa_6, POIspa_1, POIspa_9, POIspa_7, and POIspa_5), three POI frequency features (POIp201, POIp101, and POI101), three texture features (b2cormean, b2entstd, and b3entstd), and two spectral features (ndvimean, b8mean) were identified as the top 15 important features (Figure 9a) in the Level I class. In the Level II class, six POI spatial features (POIspa_8, POIspa_1, POIspa_2, POIspa_7, POIspa_6, and POIspa_9), four POI frequency features (POIp502, POIp501, POI503, and POIp201), three texture features (b2entstd, b2cormean, and b3entmean), and two spectral features (ndvimean, b4mean) were selected in the top 15 (Figure 9b). Compared with spectral and texture features extracted from Sentinel-2 imagery, POI features, especially for the POI spatial features, were more important in the EULUC-Nanjing mapping.

Contribution of Different Features
In order to investigate the role of different features in urban land use classification, the features from five data sources in Table 2 were removed in turn and the features from the remaining four data sources were used as input variables in the RF model. From the changes of the OA, the influence degree of the features from different data sources on Level II categories can be found. The result (Table 7) showed that POI data had a great impact on the business, commercial, educational, administrative, medical, and sports culture categories. In particular, the influences on the latter three land uses were the most obvious. The features derived from Sentinel-2 imagery had more influences on residential, industrial, and greenspace land. The height features had an influence on the business and commercial land. In addition, the mean DN values from Luojia-1 data were also helpful to distinguish commercial and business land from other land uses. This may be because the night light intensity in the commercial districts was relatively higher than that in other places. In this study, it can be found that the POI features had the greatest contributions in the EULUC-Nanjing mapping (Table 7 and Figure 9). However, Tu et al. [26] pointed out that compared with POI data, the features obtained from Sentinel-2 imagery were the main factors affecting the classification performance. This may be because this study used the regenerated POI data by overcoming the problem of unbalanced distribution and extracted the POI spatial features. Therefore, the POI spatial features derived from the regenerated POIs are expected to better inform the regional to continental mappings of urban land uses.

The Impact of the Sample Sizes
In order to explore whether 500 samples can meet the classification requirement, the number of training samples was gradually increased by 10% each time and the variation of classification accuracy was observed. Figure 11 shows that when the number of samples reaches 70% (350), the accuracy for both the Level I and Level II categories tends to be stable, which is consistent with the stable classification concept proposed by Gong et al. [25]. Therefore, 500 training samples used in this study have met the number of samples required for the maximum classification accuracy.
Since there are some small water bodies within the impervious surface polygons, the water layer from FROM-GLC10 was used as a mask to remove these water areas. However, there were still water bodies within segmented urban land parcels. This is due to the limited spatial resolution of Landsat images being used to extract impervious surface, and Sentinel-2 images being applied for FROM-GLC10 generation. The water bodies of less than 100 m 2 or rivers with a width of less than 10 m cannot be removed. Therefore, high-spatial-resolution images will be considered to improve the parcels' purity in the future work. In addition, although we have tried to use high-purity samples to train the RF model, actually, the land use parcels are often mixed [42], and even a single building contains multiple functions. Liu et al. [43] have found that 18.82% of the buildings in Tianhe district of Guangzhou had mixed functions; for example, some buildings had residential and commercial functions, while others had residential, leisure, and official functions. The existence of multi-functional buildings makes it hard to determine a single land use type in one urban land parcel. The results of EULUC-China showed that the Level II classification OA in Beijing, Hong Kong, and Shenzhen city were only about 43%, 44%, and 51%, respectively. The low accuracy may be due to the highly mixed land use functions within one building in these cities. Although the OAs of EULUC-Nanjing have been greatly improved (from 75.0% and 75.0% to 86.1% and 80% in Level I and Level II categories, respectively), there was still considerable confusion between the business and the commercial land use (Table 6). By overlapping the object parcels segmented from the high-resolution images with the POI data, Zhang et al. [16] found that the feature of the object categories contributed the most to the extraction of urban functional zones. This enlightens us to use the object parcels obtained from high-resolution images as the basic analysis unit in the future to reduce the possibility of mixed land use. Since there are some small water bodies within the impervious surface polygons, the water layer from FROM-GLC10 was used as a mask to remove these water areas. However, there were still water bodies within segmented urban land parcels. This is due to the limited spatial resolution of Landsat images being used to extract impervious surface, and Sentinel-2 images being applied for FROM-GLC10 generation. The water bodies of less than 100 m 2 or rivers with a width of less than 10 m cannot be removed. Therefore, high-spatial-resolution images will be considered to improve the parcels' purity in the future work. In addition, although we have tried to use high-purity samples to train the RF model, actually, the land use parcels are often mixed [42], and even a single building contains multiple functions. Liu et al. [43] have found that 18.82% of the buildings in Tianhe district of Guangzhou had mixed functions; for example, some buildings had residential and commercial functions, while others had residential, leisure, and official functions. The existence of multifunctional buildings makes it hard to determine a single land use type in one urban land parcel. The results of EULUC-China showed that the Level II classification OA in Beijing, Hong Kong, and Shenzhen city were only about 43%, 44%, and 51%, respectively. The low accuracy may be due to the highly mixed land use functions within one building in these cities. Although the OAs of EULUC-Nanjing have been greatly improved (from 75.0% and 75.0% to 86.1% and 80% in Level I and Level II categories, respectively), there was still considerable confusion between the business and the commercial land use (Table 6). By overlapping the object parcels segmented from the high-resolution images with the POI data, Zhang et al. [16] found that the feature of the object categories contributed the most to the extraction of urban functional zones. This enlightens us to use the object parcels obtained from high-resolution images as the basic analysis unit in the future to reduce the possibility of mixed land use.
In this study, it can be found that the POI spatial features made the greatest contributions to the In this study, it can be found that the POI spatial features made the greatest contributions to the EULUC-Nanjing (Figure 9). Due to the large area of the urban parcels, there are different types of POI data in a parcel. The POI points have the coordinates but no geometric attribute information, such as the area, so it is difficult to determine the weight of the different POI types. At present, this study only adopted the same weight for different POI types in a parcel to calculate the POI vectors. If the segmented objects can be obtained, it could apply the POI category to mark the attributes of these objects and use the object area as the weight to construct a more accurate parcel vector.
Compared with previous studies [12], this study refined the spatial and temporal resolution of MPL data; however, the results showed that the contribution of this kind of data in classification was the lowest (Table 6). Chen et al. [44] used the MPL data and a dynamic time warping (DTW) distance-based k-medoids method to aggregate those buildings with similar social activities into functional areas with a clustering accuracy rate of up to 85%. Therefore, the dynamic change of MPL data in a parcel can be used as a feature in the regional or global urban land use classification. Furthermore, with the increasing of mobile phone users, mobile phone positioning data also can be used to obtain a large amount of the dynamic trajectory data. Limited to the density of the base station, mobile phone positioning data has a lower resolution of 300-500 m, while the MPL data has a relatively higher spatial resolution of 25 m. The number of the Tencent users is relatively smaller but the number of mobile phone users is larger. Therefore, fusing the MPL data and mobile phone positioning data would be feasible and effective in the national or global urban land use mapping.

Conclusions
Based on the concept of EULUC-China, this study proposed a more consolidated framework of urban land use mapping at a regional scale. We refined the generation of the urban parcels, resolved the problem of unbalanced distribution of POIs, extracted the POI spatial features, and mapped the EULUC-Nanjing. The results showed that: (1) compared with EULUC-China, the classification OA in Level I and Level II were improved by 11.1% and 5% respectively, (2) the spatial features from POI data were identified to be the most important variables, and (3) the classification accuracy tended to be stable when the number of training samples reached 350. This study recommended that the POI data should be preprocessed before they were used and the spatial features from POIs cannot be overlooked in the national or global urban land use mapping, especially in cities with a lot of tall buildings, and there are multiple land use functions in a single building, such as Hong Kong, Shen Zhen, etc. Also, the urban land use information provided in this study can be applied to help urban planners monitor urban land use changes, analyze urban structures [45], and make scientific and reasonable planning for the existing urban land resources, so as to promote the healthy development of the city. In addition, this study also had some limitations. For example, the extraction of urban parcels needed to be further refined, and the problem of mixed land use was not considered. Different types of POIs used the same weights when constructing parcel vectors, and the land use information of MPL data was not deeply mined. In the future work, for a multi-scale analysis unit, such as a single building, the objects or the parcels segmented from the high-spatial-resolution images will be considered in the urban land use. Attempts will be made to address the weighting problem of different types of POIs. Meanwhile, the MPL data and mobile phone positioning data can be fused to derive dynamic trajectory data with high spatial and temporal resolutions to better uncover land use functions in other cities.