Sampling Strategy for Detailed Urban Land Use Classiﬁcation: A Systematic Analysis in Shenzhen

: A heavy workload is required for sample collection for urban land use classiﬁcation, and researchers are in urgent need of sampling strategies as a guide to achieve more e ﬀ ective work. In this paper, we make use of an urban land use survey to obtain a complete sample set of a city, test the impact of di ﬀ erent training and validation sample sizes on the accuracy, and summarize the sampling strategy. The following conclusions are drawn based on our systematic analysis in Shenzhen. (1) For the best classiﬁcation accuracy, the number of training samples should be no less than 40% of the total number of parcels or no less than 5500 parcels. For the best labor cost performance, the number should be no less than 7% or no less than 900. (2) The accuracy evaluation is stable and reliable and requires validation sample numbers of no less than 10% of the total or no less than 1200. (3) Samples with a purity of 60–90% are preferred, and the classiﬁcation e ﬀ ectiveness is better in samples with a purity greater than 90% under the same number. (4) If spatial equilibrium sampling cannot be carried out, sampling areas with complex land use patterns should be preferred.


Introduction
Urbanization has greatly changed our living environments, and more than half of the global population resides in urban areas [1]. China has undergone the fastest urbanization worldwide over the past three decades, and its artificial impervious area ranked first in 2015 [2]. For better urban planning, spatial governance, and sustainable development of urbanized areas in China, more up-to-date, detailed, and accurate land use classification is critically important.
Thus far, detailed urban land use classification in China has been performed only through field surveys [3,4]. Currently, only a few major cities, such as Shenzhen, Wuhan, and Chongqing, have detailed urban land use classifications at the entire city level [3,[5][6][7]. This is an important task for the Third Terrestrial Survey of China [8].
Field surveys are time consuming and laborious, and researchers have long been committed to improving the efficiency of land use classification through remote sensing technology [9][10][11][12][13][14][15][16][17][18][19][20]. Gong and his colleagues were among the earliest researchers to use spatial-context information in addition to spectral data from satellite images to map urban land use categories, and their algorithms have been adopted in mapping global settlement areas [21]. However, because of the limitation of physical property measurements, the above-mentioned methods involving only spectral, texture, and structural features face challenges in effectively differentiating among residential, industrial, commercial, and service types of land uses.
In 2000, Zhang et al. proposed conducting urban land use classification by integrating GIS and remote sensing data [22]. In 2007, Goodchild noted that volunteered geographic information (VGI) can be used as a new data source for urban land use classification [23]. Information from OpenStreetMap (OSM), point of interest (POI), and social data, such as traffic trace data of individuals, taxis, and public transportation, can all be applicable to urban land use mapping [24][25][26][27][28][29][30]. VGI can be used as an important supplement to remotely sensed data in the detailed mapping of urban land use [31] and has since become a new focus area of research [32][33][34][35][36][37][38][39][40]. The most influential work was the mapping of essential urban land use categories (EULUC) in all cities in China by 70 researchers from more than 30 organizations [40].
Because it is impossible to determine the classification results simply through visual interpretation of images, the difficulty and workload of sample collection are increasing exponentially, representing a difficult challenge for most researchers. Researchers are in urgent need of sampling strategies as a guide to achieve more effective classification with relatively low labor costs. In the field of traditional land use/land cover, scholars have accumulated a large number of samples over a long time and quantitatively analyzed the impact of the sample number and other conditions on classification accuracy [41][42][43][44][45]. However, detailed urban land use classification is a new research focus; most studies use a limited number of sample units to test experimental classification methods, and no research results regarding the optimal sampling strategies have been reported [31,34,36,40,[46][47][48].
In this study, we take advantage of the availability of an urban land use map of Shenzhen city that has been generated through a field survey of the entire city. By converting the map into a parcel-based land use map, we obtain a complete sample set for experiments with various sample sizes. Based on this map, we evaluated the impact of the sample size and land use mix of samples on the resulting classification accuracy.

Study Area and Data
Shenzhen is the most rapidly developing city in China. In 1979, Shenzhen was essentially a rural county bordering Hong Kong (Figure 1). By 2019, Shenzhen had more than 13 million permanent residents, and its per capita gross domestic product (GDP) ranked first in China [49][50][51]. Due to the high diversity and high precision of urban land use, complex land use types exist, such as villages surrounded by city blocks, golf courses, and large entertainment facilities. The high level of complexity and high land use intensity in Shenzhen provide a good opportunity for detailed urban land use classification experiments.      Figure 2 shows a flowchart outlining the methodology used in this study, including the following four major procedures: first, parcel segmentation with road networks, water, and impervious layers; second, collection of training and validation samples; third, multisource feature extraction; and fourth, classification and mapping. All datasets used in this study are summarized in Table 1.

Detailed Urban Land Use Classification System
In 2007, China issued the first formal land use classification standard, which was revised in 2017. This standard includes residential land, commercial and service land, industrial and mining storage land, public administration and public service land, and transportation land [52]. The city of Shenzhen developed a local classification system to supplement the national system [53]. In this study, based on the national and Shenzhen classification schemes, we develop the Shenzhen Urban Land Use Classification system (SULUC), which includes 5 Level I classes and 18 Level II classes ( Table 2). The SULUC is basically consistent with the standard used in the Third Terrestrial Survey of China, and some Level II classes are even more detailed.

Parcel Segmentation
We used the road network from the 2018 special road survey to divide the Shenzhen area into land parcels using the following major procedures: first, a road buffer was generated using the road centerline and width; second, the road buffer zone was used to divide Shenzhen area into land parcels; third, water surface data from the National Geographical Condition Survey were used to exclude parcels of water; fourth, parcels within the built-up area were extracted, and the purpose of this step was to exclude farmland, forestland, bare land, and other categories that do not belong to SULUC. In fact, there were approximately 200 parcels that did not belong to SULUC in the built-up area, accounting for approximately 2% of the total number of parcels, which had little impact on the overall accuracy of classification. The land parcels were divided into 12,965 land parcels ( Figure 3).

119
The average size of a parcel was approximately 6 ha, which was approximately four times 120 greater than the land parcel size in the field survey. More than 100 parcels were superlarge land     The average size of a parcel was approximately 6 ha, which was approximately four times greater than the land parcel size in the field survey. More than 100 parcels were superlarge land parcels exceeding 50 ha. These superlarge parcels included villages in cities and large tracts of factories with no obvious roads ( Figure 4). These areas were located in the less developed part of the city, and using the road network-based land partition method as a quick land partition strategy should be improved in the future.

119
The average size of a parcel was approximately 6 ha, which was approximately four times 120 greater than the land parcel size in the field survey. More than 100 parcels were superlarge land

Feature Extraction
We used the following five types of features in the parcel-level land use classification based on Sentinel-2A/B images, Tencent mobile-phone locating-request (MPL) data, Luojia-1 nighttime light images, Gaode POI data [40,54], and building surveys: We used the coconstellation Sentinel-2A/B images from January 1 to December 31, 2018, from the Copernicus Open Access Hub to extract the multispectral features. We first calculated the normalized difference vegetation index (NDVI) of each pixel. We further used the pixel-based maximum NDVI values as a quality index to merge the whole-year images. Then, we calculated the mean and standard deviations of the blue, green, red, and near-infrared bands, NDVI, and normalized difference water index (NDWI) in each urban parcel.

Human Activity Features from Tencent MPL Data
We used the MPL dataset from November 1 to November 30, 2018, from Tencent, Inc. to track the dynamics of the population distribution. MPL records are produced by retrieving the real-time locations of active mobile-phone users as they use Tencent's location-based services (LBS). We aggregated the 5 min MPL records per 8 h on weekdays and weekends, which represented the geographic pattern of the human distribution during three temporal periods (

Nighttime Light Features from Luojia-1 Nighttime Light Imagery
We used Luojia-1 nighttime light images acquired from June to December 2018, and the spatial resolution of these images was 130 m. For each urban parcel, we calculated the mean value of the digital number.

POI Features from Gaode
We used POI data from Gaode, Inc. in 2018. Each POI record consists of the name, location coordinates, and POI type, such as catering, retailing, automobile, accommodation, recreation, public facility, transportation, culture and media, and so forth. For each urban parcel, we calculated the total number of all POI and the total number and proportion of each type of POI within that parcel.

Building Features from Survey Data
We used building survey data consisting of the base area, stories, and average story height of each building in Shenzhen. We further aggregated these data into parcel levels to calculate the number of stories, the sum of the building height, and the average building height.
The specific features are summarized in Table 3. Table 3. Summary of the features used in the parcel-level mapping of SULUC.

Training and Validation Samples
Since the land survey data covering the entire city of Shenzhen are accessible, we possessed an accurate reference dataset for training and validating the sample collection. Quality assurance of the field survey data was determined following a procedure of in situ photographing and by interviewing the land managers to record the condition of the land use operation. The data were sample-verified and quality-checked by a series of indoor processes to ensure that the results were consistent with field survey standards. Therefore, the field surveyed land use served as a reliable source of reference in this study.
Because parcels resulting from field surveys differ from parcels resulting from segmentation, within each land parcel, we obtained the statistics of the areal proportion of different land use types through a spatial intersect operation with the GIS software system. The land use category with the largest proportion was assigned to the land parcel (Table 4). A sliver polygon removal operation was applied to polygons less than 1000 m 2 in area.
Through the above operation, we obtained a complete coverage reference land use dataset with proportional records of different land use types. An advantage of this dataset is that all land parcels can be used for training or validation. Therefore, we refer to this reference sample set as complete samples, and the number of parcels in each category is shown in Figure 5. Under the complete samples, the accuracy of the sample is equivalent to that of the field survey. consistent with field survey standards. Therefore, the field surveyed land use served as a reliable 167 source of reference in this study.

168
Because parcels resulting from field surveys differ from parcels resulting from segmentation, 169 within each land parcel, we obtained the statistics of the areal proportion of different land use types 170 through a spatial intersect operation with the GIS software system. The land use category with the 171 largest proportion was assigned to the land parcel (Table 4). A sliver polygon removal operation was 172 applied to polygons less than 1000 m 2 in area. samples, and the number of parcels in each category is shown in Figure 5. Under the complete 178 samples, the accuracy of the sample is equivalent to that of the field survey.

179
The complete samples can reflect the land use mixing status. We used purity to quantify the land

187
Since 1996, machine learning has been widely used in the field of remote sensing classification.

188
Many scholars have found that machine learning can obtain results with a higher precision than 189 traditional parameter classifiers in processing complex data with a high-dimensional feature space 190 [47,[55][56][57][58]. In particular, random forest (RF) is widely used by scholars. RF is a machine-learning 191 algorithm consisting of a large ensemble of regression trees that has shown great efficiency and 192 robustness in both computational cost and model performance [46][47][48]. We applied the training The complete samples can reflect the land use mixing status. We used purity to quantify the land mixed-use level of the parcel. The higher the purity of the parcel, the lower the mixing level of land use. In the complete sample, we started with 100% purity and divided it into 10 groups according to each 10% decrease and combined 0-40% into one group. The number of each group is shown in Figure 5.

Classifier
Since 1996, machine learning has been widely used in the field of remote sensing classification. Many scholars have found that machine learning can obtain results with a higher precision than traditional parameter classifiers in processing complex data with a high-dimensional feature space [47,[55][56][57][58]. In particular, random forest (RF) is widely used by scholars. RF is a machine-learning algorithm consisting of a large ensemble of regression trees that has shown great efficiency and robustness in both computational cost and model performance [46][47][48]. We applied the training parcels with the extracted features to produce a parcel-level mapping of urban land use classification in Shenzhen with RF.

The Impact of the Sample Size
We set up two experiments. The first experiment tested the influence of different training sample sizes on accuracy. From the complete sample, 30% of the stratified random sampling was used as validation samples, and the remaining samples were used as training samples. The number of training samples decreased by 1% each time, and each decrease repeated randomly sampled k times. The second experiment tested the influence of the different validation sample sizes on the accuracy evaluation. From the complete sample, 35% of the stratified random samples were used as training samples, and the remaining samples were used as validation samples. The number of validation samples decreased by 1% each time, and each decrease repeated randomly sampled k times. For k = 5, the accuracy of each classification and the average accuracy are shown in Figure 6. We define stable accuracy as a classification accuracy of the reduced samples no greater than 1% compared with that of all samples. Experiment One shows that the relationship between the number of samples and accuracy follows the rule of stable classification with limited samples (Gong, Liu, et al., 2019). The classification accuracy kept stable until the number of training samples was reduced to 61% of all training samples (5540, accounting for 40% of all urban parcels). When the number was reduced to 10% (908, approximately 7% of all urban parcels), the classification accuracy began to significantly decline. Experiment Two shows that as the number of validation samples decreases, the range of the accuracy evaluation results increases. Considering the average accuracy as the measurement, when the number of validation samples was reduced to 14% of all validation samples (1178, approximately 9% of all urban parcels), the accuracy evaluation results were no longer stable.
In summary, to obtain stable and reliable classification results, the training samples need at least 40% of the total number of parcels or no less than 5500. At least 10% of the total number of parcels is required for the validation samples or no less than 1200. If the labor force is insufficient, the high-cost performance scheme requires the training samples to be at least 7% of all parcels or no less than 900. In this situation, the maximum accuracy loss was not greater than 7%.

Impact of the Sample Purity
In this experiment, the influence of the sample purity on the classification accuracy was tested. Currently, in most research concerning urban land use classification, the level of mixed land use is not high, and the training samples always have high purity [31,39,40]. The mixed-use level of land in Shenzhen is high, and there are many low-purity parcels. Therefore, it is necessary to study whether it is reasonable to select high-purity samples as training samples (Figure 7). Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 20

Impact of the Sample Purity
it is reasonable to select high-purity samples as training samples (Figure 7).    We selected seven categories of 11,034 parcels for the test. The specific categories included urban residential, urban village, business and finance, storage, other commercial, industrial, instructional and research, parks and green space.
Among them, 30% of the stratified random sampling was used as validation samples, and the remaining samples were used as the mixed-purity [0,100%] sample set. Then, we divided the mixed-purity set into high purity (≥90%), medium purity (60-90%), and low purity (≤60%). Finally, we randomly selected the same number of training samples from the above four sets, and the results are shown in Figure 8.

251
In this experiment, the influence of the sample space distribution on accuracy was tested. We 252 divided Shenzhen into three zones: the original special zone, former Bao'an, and former Longgang.

253
The original special zone included Luohu District, Futian District, Nanshan District, and Yantian

254
District. Former Bao'an included current Bao'an District, Longhua District, and Guangming District.
Former Longgang included the current Longgang District, Pingshan District, and Dapeng District.

256
The same numbers of training and validation samples were randomly selected from the three regions 257 for the cross experiment, and the accuracy was calculated with the training samples from the original 258 special zone, former Bao'an, former Longgang, and the validation samples from the three regions 259 (Figure 9).

Percentage of samples
Mixed-purity High-purity Medium-purity Low-purity The experimental results show that under the same number of conditions, the classification accuracy of the mixed-purity samples was equal to that of the medium-purity samples and higher than that of the high-purity samples. The classification accuracy of the low-purity samples was the lowest. These results show that for a study area with a high land use mixing level, the representativeness of high-purity samples is not enough, which could lead to accuracy loss. The classification features of the low-purity samples are all mixed; thus, it is difficult for the classifier to learn effectively. The classification effect of the medium-purity samples is representative and can be used as the principle of sample collection.

Impact of the Sample Spatial Distribution
In this experiment, the influence of the sample space distribution on accuracy was tested. We divided Shenzhen into three zones: the original special zone, former Bao'an, and former Longgang. The original special zone included Luohu District, Futian District, Nanshan District, and Yantian District. Former Bao'an included current Bao'an District, Longhua District, and Guangming District. Former Longgang included the current Longgang District, Pingshan District, and Dapeng District. The same numbers of training and validation samples were randomly selected from the three regions for the cross experiment, and the accuracy was calculated with the training samples from the original special zone, former Bao'an, former Longgang, and the validation samples from the three regions ( Figure 9).   Figure 10). Using the same parcels, features, and classifier, the overall accuracy for Level 284 I categories reached 76%, and that for Level II categories reached 71% (Table 5 and Table 6). The   The experimental results show that land use in different areas in a single city also has heterogeneity and that an uneven spatial distribution of samples could cause accuracy loss. In this experiment, the original special zone was the old special economic zone, which has good planning control and orderly land development. Former Bao'an is a labor-intensive industrial agglomeration area with inefficient and extensive land use. Former Longgang is restricted by ecological protection due to location factors, and its density is relatively low. There are differences in the representativeness of the three samples, and the classification accuracy of other areas is significantly reduced.
From the perspective of sample migration capacity, the more diverse the regional urban land use model, the stronger the migration capacity. In former Bao'an, Guangming is a relatively less developed area of Shenzhen, and Bao'an Qianhai center is the most important economic center. Therefore, multiple internal development stages coexist in former Bao'an, land use is extremely complex, and the migration capacity is strong. Due to the high level of overall urban development, the original special zone has low representativeness and a weak migration capacity.

Mapping of SULUC in Shenzhen
At the beginning, local professional urban land use surveyors were invited to choose training samples from the complete sample set according to their knowledge and experience. They generated 1163 high-purity samples. Four-fold cross-validation was adopted to optimize the land use classifier and the classifier was applied to the complete sample set for accuracy assessment. The overall accuracy for the Level I categories was 62%, and 55% for Level II categories. Then, we took the best sampling strategy in terms of the above-mentioned experiments and selected 5028 samples of medium purity as the training samples. Its frequency distribution was similar to that of the complete sample set ( Figure 10). Using the same parcels, features, and classifier, the overall accuracy for Level I categories reached 76%, and that for Level II categories reached 71% (Tables 5 and 6). The accuracy was improved by approximately 15% under the optimal sampling strategy, shown in Figure 11. accuracy for the Level I categories was 62%, and 55% for Level II categories. Then, we took the best 281 sampling strategy in terms of the above-mentioned experiments and selected 5028 samples of 282 medium purity as the training samples. Its frequency distribution was similar to that of the complete 283 sample set ( Figure 10). Using the same parcels, features, and classifier, the overall accuracy for Level 284 I categories reached 76%, and that for Level II categories reached 71% (Table 5 and Table 6). The

291
Regarding Level I categories, major discrepancies were clustered in residential and industrial 292 land, and the misclassification of other land use types to residential and industrial land accounted 293 for over 50% of each of the misclassified categories. Regarding Level II categories, major discrepancies 294 were clustered in the urban residential, industrial, and parks and green space land. For example, 295 urban residential land was primarily misclassified as industrial land, industrial land was primarily 296 misclassified as urban villages, and parks and green space land was primarily misclassified as urban 297 residential, industrial, and road areas.  Regarding Level I categories, major discrepancies were clustered in residential and industrial land, and the misclassification of other land use types to residential and industrial land accounted for over 50% of each of the misclassified categories. Regarding Level II categories, major discrepancies were clustered in the urban residential, industrial, and parks and green space land. For example, urban residential land was primarily misclassified as industrial land, industrial land was primarily misclassified as urban villages, and parks and green space land was primarily misclassified as urban residential, industrial, and road areas.  We compared the difference between the mapping of SULUC and land surveys in terms of the urban land use structure ( Figure 12). Most commercial and public services lands are not correctly classified and are basically misclassified as residential and industrial, which is critical for improving accuracy in the future.    From the perspective of the feature contribution rate, the most important feature is building height information, followed by POI and Sentinel 2A/B multispectral information ( Figure 13). In the MPL data, the Luojia-1 nighttime light feature contribution rate is very low, mainly because the original spatial resolution of these data is low, which is not suitable for high-resolution urban land use classification tasks.

Discussion
Mixed land use is a big obstacle to improving classification accuracy. Current results show that misclassifications of low-purity parcels were much more than those of high-purity parcels. The lower the purity of the parcel, the worse the classification accuracy ( Figure 14). The reasons are as follows: 1.
Due to the high scarcity of land, commercial, transportation, and public facilities in high-density cities such as Shenzhen often exist in the form of nonindependent land occupation. In this case, the features mentioned above may not be sufficiently significant compared with those in other cities.

2.
There is more and more three-dimensional utilization of land use. For example, a business center generated by urban renewal could have a commercial center on its low floors and high-quality housing on the top floors; thus, this center is both commercial and urban residential. Additionally, government agencies could rent some commercial buildings for office space, and in this situation, the building is both for commercial use and public service use. In the above cases, it is unreasonable to assign only one category to a parcel. A possible solution is to assign multiple categories to a parcel through a probability method.

316
Mixed land use is a big obstacle to improving classification accuracy. Current results show that 317 misclassifications of low-purity parcels were much more than those of high-purity parcels. The lower 318 the purity of the parcel, the worse the classification accuracy ( Figure 14). The reasons are as follows:

333
The methodology of the parcel segmentation and feature extraction can be improved:   The methodology of the parcel segmentation and feature extraction can be improved: 1.
The segmentation of parcels is not detailed enough. Because road segmentation technology is not suitable for the underdeveloped areas of the road network in the city, this results in superlarge parcels which contain multiple land use categories. In the future, image segmentation can be introduced to segment the superlarge parcels generated by road segmentation.

2.
The POI information collection from commercial companies is biased, resulting in unsatisfactory classification results. In the future, POI information from official electronic maps can be combined with POI information from commercial institutions to enhance the classification accuracy.
Given the opportunity that Shenzhen has a complete set of ground truth of land use samples, it makes it possible to design a series of experimental tests to investigate the impact of sample quantity and quality on detailed land use classification performance. We have further checked the availability of data in different cities around the world. The multispectral and nighttime light remote sensing data used in this paper can be obtained globally. Global road network data can also be accessible through OpenStreetMap. However, the major challenge of this study was to collect sufficient land use samples. Fortunately, Shenzhen has just conducted an urban land use survey, and we could obtain its complete sample set from the survey results. Similar research can be conducted in other cities in China after the completion of the Third Nationwide Land Survey of China. In other areas, the cadastral data could be considered as a source of samples in similar experiments to demonstrate whether the conclusions are representative throughout the world.

Conclusions
In the process of detailed urban land use classification based on multisource remote sensing, VGI, and machine learning, we studied how to improve the classification accuracy by optimizing the number and purity of the samples and summarized the optimal sampling strategy. The main conclusions are as follows: 1.
Quantity strategy. To acquire the best classification accuracy in a single city, it is necessary to collect training samples of no less than 40% of the total number of urban parcels or no less than 5500 in number. If limited labor is available for sample selection, it is recommended to collect no less than 7% of the total parcels of training samples or no less than 900 samples. Further reduction in the number could cause a significant loss of accuracy. To ensure the stability and reliability of the accuracy evaluation results, it is necessary to collect no less than 10% of the total parcels of validation samples or no less than 1200. Notably, if the principle of stratified random sampling is followed, the impact of the number of validation samples on the accuracy evaluation is limited. Even if the number of validation samples is reduced to 1% of the total, the maximum accuracy evaluation loss is not greater than 8%. 2.
Purity strategy. Using only high-purity samples could cause a certain loss of accuracy. It means that there is no need to collect only high-purity parcels as training samples. The better strategy is to prioritize using samples with a purity between 60% and 90%. It is worth noting that random sampling without considering purity can also obtain ideal accuracy results, but there are great difficulties in identifying low-purity mixed land, which could require more work.

3.
Spatial distribution strategy. The spatial distribution of the samples should be as balanced as possible, as unbalanced sampling will cause a significant accuracy loss even if in a single city. The samples have the ability to migrate. When spatial equilibrium sampling is not allowed, priority should be given to areas with complex land use patterns, which can provide better classification results. Funding: This research received no external funding.