Supervised versus Semi-Supervised Urban Functional Area Prediction: Uncertainty, Robustness and Sensitivity

: To characterize a community-scale urban functional area using geo-tagged data and available land-use information, several supervised and semi-supervised models are presented and evaluated in Hong Kong for comparing their uncertainty, robustness and sensitivity. The following results are noted: (i) As the training set size grows, models’ accuracies are improved, particularly for multi-layer perceptron (MLP) or random forest (RF). The graph convolutional network (GCN) (MLP or RF) model reveals top accuracy when the proportion of training samples is less (greater) than 10% of the total number of functional areas; (ii) With a large amount of training samples, MLP shows the highest prediction accuracy and good performances in cross-validation, but less stability on same training sets; (iii) With a small amount of training samples, GCN provides viable results, by incorporating the auxiliary information provided by the proposed semantic linkages, which is meaningful in real-world predictions; (iv) When the training samples are less than 10%, one should be cautious using MLP to test the optimal epoch for obtaining the best accuracy, due to its model overfitting problem. The above insights could support efficient and scalable urban functional area mapping, even with insufficient land-use information (e.g., covering only ~20% of Beijing in the case study).


Introduction
Land use is defined by the function, or functions, characterizing humans' use of an area of land, which mostly falls within six main categories, including agricultural, residential, recreational, commercial, industrial and transportation [1], in which the last four are the main contributors for urban area coverage.Commercial land use is land being used for the sale of goods or services for financial profit, including central business districts and shopping centers.Residential land use is the land used for housing.Recreational land use in urban areas includes city parks, playing fields, hiking and biking trails, etc. Industrial land use is the land used in manufacturing and storing, etcetera.Transport land use is the land delegated to the moving or transportation of goods and commuting people from one spot to another; that is, roads, highways, railroads and airports.
Urbanization is a process whereby populations move from rural to urban areas, enabling cities and towns to grow [2,3], and the process typically brings the need for more housing and jobs, associating with a need for land use change.That is, agricultural or natural recreational types of land use must be converted to residential, business, industrial and transportation types [4,5].During the renewal of a city, different functional areas with relatively homogeneous internal functional land use are gradually formed to meet the various needs of people's daily life and they are considered as basic spatial units implementing urban plans [6].The combination and distribution of different areas constitute the structure of the city [7].
Efforts have been made to monitor real urban functional land use patterns [8], and to compare with the urban development plan, which is important for sustainable urban development.Some of the urban functional areas may be inevitably sculpted by anthropogenic activities, rather than following the initial masterplan, due to multiple driving factors, including local geographic/topographic conditions [9,10], economic purposes [11,12], etc.Thus, diagnosing differences between plan and reality enables an improved future urban planning, such as modifications for planning transportation and recreational spaces [13,14].Besides, detailed urban functional maps and the dynamics of urban functional area change make a possible estimation for urban resilience in facing natural hazards or climate change [14][15][16][17][18], and for the ecological impact, due to urban sprawl, developing pattern, clustering, trend and functional land use interrelationships [19][20][21][22][23][24].
Nevertheless, before commencing all those analyses, a precise knowledge of the distribution of urban functional areas is necessary, which depends on the classification model, the amount of training data, different combinations of training data, etc.The unsupervised clustering methods exhibit the ability to discover various functional areas, yet ultimately require manual identification of the properties of individual clusters, which could bring about large uncertainty [25].Recently, artificial intelligence (AI) and machine learning (ML) developers have generated AI and ML to "think more intelligently", like humans, making decisions with supervised and semi-supervised models on urban functional area analysis [26][27][28].Later, GeoAI and artificial intelligence, together with a geographical information system (GIS), performs well in numerous tasks via combining the strong modeling ability of AI and geospatial characteristics (see for example [29][30][31][32]).Still, less effort has been made for diagnosing the sensitivities of performances which are related to the selection of the classification models, discrepancies in training data sets, and the method for quantifying model uncertainty.Previous works assess classification accuracies by comparing results with planning maps and online maps (e.g., [28,[33][34][35][36]), or using land use attributes extracted from human activities [27], but very few have leveraged standardized land use datasets with non-empirical means to determine the viability of classification models [37].
Research using geotagged training data adds new perspectives to data mining and to categorizing urban functional land use on a community scale.Geotagging is the process of appending geographic coordinates to media-based on the location of a mobile device [38].Geotags usually consist of coordinates, such as latitude and longitude, but may also include bearing, altitude and place names, which can be applied to photos, videos, or QR codes, and could also include time stamps or other contextual information [38].The point of interest (POI) data is one of the most popular geo-tagged data [26,39], used for sensing urban functional area characteristics associated with analysis methods based on GIS and GeoAI [8,37,40].
However, data requirements in number and quality are different for diverse supervised and semi-supervised ML models [41].For example, traditional non-graph structured supervised models may not capture the complex interactions and connections between urban functional elements [30], which results in a larger requirement of training data.In comparison, graph-based classification models require relatively less training samples (i.e., support a semi-supervised learning) due to their ability to combine semantic linkages between urban functional areas [27,30].Efficient methods, however, for building linkages are essential for the performance of graph-based classification models [27].
In this study, first, the basic study unit (i.e., functional area) is designed in Hong Kong, according to urban road networks, to obtain a relatively more homogeneous functional property (see [36,42]), followed by data-labeling on the basic study unit, which is the process of identifying raw data and adding meaningful labels to provide context for machine learning.After applying the supervised and semi-supervised ML models to achieve the urban functional area classification, the models' performances, in terms of uncertainty, robustness and sensitivity, will be compared to give insights into model selection strategies for different scenarios.The resulted insights are further supported by a case study carried out in Beijing.

Materials and Methods
Three data sets are used as inputs: the point of interest (POI) dataset, the open street map (OSM) dataset and a public land use dataset.The POI data used in this study is collected from the Amap platform, which consists of attributes including longitudes, latitudes, type (see Appendix A, Table A1) and rating scores, etc., which provide information about real-world geographical locations on which human activities take place [10].A Python language-based web crawler is developed for accessing and storing the JSON-formatted POI data: (https://gitee.com/pickup20/multi-modal-paper/tree/mastere/data,POI data was accessed on 1 March 2022).The OSM dataset includes detailed world-wide road network and part of the land use information.The road network consists of multiple classes of roads, of which the classes of primary, secondary and tertiary are used.The main drawbacks of land use information in OSM data are the inadequacy in some areas and incorrect records.The recently updated OSM data is downloaded from https://www.openstreetmap.org/accessed on 3 April 2022.Finally, the land use dataset of Hong Kong (Land Utilization in Hong Kong, LUHK) with 10 × 10 m spatial resolution is used for testing the classification accuracy, which is downloaded from https://www.pland.gov.hk/pland_en/info_serv/open_data/landu/accessed on 16 April 2022.
The identification and classification scheme of an urban functional area (see Figure 1) consists of commercial, residential, public service, recreational and transportation.Here, for diagnosing more details about urban central areas, agricultural and industrial urban functional land use types mentioned in the first chapter are excluded and replaced by public service land use type, including schools, institutions and administration facilities.

Preprocessing
The input data (e.g., OSM and POI) attain a unified coordinate system by map projection transformation as the known functional land use datasets; then, they are reclassified into a unified classification scheme (see Table 1) for further categorizing, feature extraction, semantic linkage and model comparison.The following processes are introduced: Segmentation: Urban road network data is used to divide the study area into basic study units (i.e., areas), according to the segmentation method proposed by [42].The initial OSM road networks are line vector features.Followed by transforming the line vectors to polygon vectors (using the ArcGIS 10.7 software), polygons with an area less than 0.001 km 2 are merged into the closest polygon through a morphological closing manipulation, which eliminates the scattered objects while preserving the shapes and sizes of larger objects.
Labeling: The segmented polygons are labeled/categorized (using the Zonal Statistic Tool in ArcGIS 10.7), by retrieving the functional land use type with the highest coverage within them.Occupation percentage of the labeled class is calculated for further possible bias testing (see Section 3.1.4).Note that in our case study, only when the highest coverage of the functional land use type exceeds 50%, a valid functional area is attained; otherwise, the function of this area remains unknown.Representation methods, such as the Word2Vec and the Term Frequency-Inverse Document Frequency (TF-IDF), in the natural language-processing field, enable the parse or process of natural language to a standard feature vector format for present models.Here, the Term Frequency-Inverse Document Frequency (TF-IDF) is used to obtain a standard feature vector for each functional area from its related POI categories.
Importance of POI categories within the functional area (represented by the TF value): The frequency of occurrence for the ith POI category, POI i , providing a weight of its importance within the specific functional area, which is calculated as: where AREA represents one specific functional area out of all areas, with a total number of N, and j is a counting integer ranging from 0 to N. Freq AREA j , POI i is the number of occurrences of a specific POI category (POI i ) in a given area (AREA j ), with the integer i ranging from 0 to C.
Importance of POI category among all functional areas (represented by the IDF value): the Inverse Document Frequency (IDF) value of a given POI category, IDF POI i , is calculated as: where: This value indicates the overall view of a POI category in the whole study region.That is, the more frequently the POI category occurs among all the functional areas, the lower is the IDF value.
Combination (TF-IDF): Thus, a normalized Term Frequency-Inverse Document Frequency (TF-IDF) feature vector for each functional area, TFIDF AREA j , can be calculated for a further supervised or semi-supervised classification: The similarity among TF-IDF feature vectors from each functional area is also used for a further mapping of the land use distribution (detailed information see Section 2.2.2).

Similarity Measurement for Functional Area Semantic Linkage
Pairwise relations or linkages of functional areas have been considered to be useful information in big data based urban computation [27].Together with the TF-IDF feature vector of the functional areas, a graph could be obtained for further exploration of the implicit information among all functional areas, so as to improve the final urban functional area classification.The linkages could be simply derived by the normalized TF-IDF feature vectors.
Semantic similarity: A graph consists of nodes and their linkages.Here, the nodes are areas with the associated normalized TF-IDF feature vectors TFIDF AREA , and the linkages are calculated as the cosine similarity of TF-IDF feature vectors from two different areas, Sim(TFIDF AREA m , TFIDF AREA n ): where m, n representing different functional areas, are integers ranging from 0 to N. Threshold is set as 0.7 for separating the final interdependencies or similarities among all the functional areas (see similar link prediction strategy in [43]).Values exceeding the threshold indicate that there is a noticeable interrelationship between two functional areas.The final linkages are organized by an adjacency matrix, A, with each element calculated as:

Urban Functional Area Prediction Models
Both supervised and semi-supervised models are commonly used for classification tasks.The differences here are that only the normalized TF-IDF feature vectors (calculated from Section 2.2.1) are required by supervised models, e.g., support vector machine (SVM, [44]), random forest (RF, [45]) and multi-layer Pperceptron (MLP, [46]), while both the normalized TF-IDF feature vectors and their linkages (Section 2.2.2) should serve as input for semi-supervised models, e.g., graph convolutional network (GCN, [41]), and comparison (see Table 2).Followed by an introduction, five models (four supervised versus one semi-supervised) are built and compared in the application analysis (Section 3).

Support Vector Machine (SVM):
A SVM is a supervised machine learning algorithm used for both classification and regression, which has become exceedingly popular, due to its relative simplicity and flexibility in addressing a range of classification problems, even in studies where sample sizes may be limited [44,46].After providing the SVM model sets of labeled training data for each category, it defines a decision boundary (i.e., a hyperplane) by maximizing the width of the gap between two categories to best separate them.Note that SVM is a binary classification method, which classifies objects into two groups of "True" and "False".Thus, a simple workaround of One-vs.-Rest is implemented to obtain a multi-class classification (for details, see [47]).
Random Forest (RF): RF is a classic ensemble learning model, which is built based on decision trees to provide results about modeling predictions and behavior analysis [45].Each decision tree in RF represents a distinct instance of the classification of data input into the random forest by integrating the entropy function to measure the loss between the prediction and the true label.RF considers the instances individually, taking the one with the majority of votes from decision trees as the selected prediction.
Multi-layer Perceptron (MLP): MLP is a widely applied supervised neural network model, which consists of three types of layers: (a) the input layer receives the input signal to be processed; (b) the output layer displays results from the required task, e.g., prediction and classification; (c) the hidden layers (with arbitrary number) are placed in between the input and output layer, which are the true computational engine of the MLP [46].Training data flows in the forward direction from input to output layer, and the neurons in the MLP are trained with the back propagation learning algorithm; thus, they approximate any continuous function for unknown pattern classification, recognition, prediction and approximation.Note that a focal loss function (proposed by Lin, et al. [48]) is used to address the class imbalance problem during the training process [37,49], which poses a problem in machine learning when the numbers of training samples for different classes vary greatly: where t represents the classified type.The focal loss adds a factor, α t (1 − p t ) γ , to the standard cross entropy criterion, −log(p t ), to reduce the relative loss for well-classified examples.Here, p t is the softmax-normalized t-th output of the model, and α t is a weighting factor corresponding to the model's t-th output.γ is for reducing well-classified examples loss, thus, forcing the model to focus on hard and misclassified objects, thereby improving the model performance.

Semi-Supervised Model: Graph Convolutional Network
Unlike traditional machine learning (e.g., SVM, RF, MLP) lacking consideration of graph structured semantic linkage, graph-based models have the potential to predict urban land use types with a small number of training data, which is meaningful in the real-world prediction.This could be achieved via measuring the semantic linkages or their similarity distances, etc. [27].The requirements of training data are largely reduced for these models compared to others, i.e., they could be trained in a semi-supervised manner.
Kipf and Welling [41] proposed a multi-layer graph convolutional network to scale linearly in the number of graph edges and learn hidden layer representations that encode both features of nodes and the graph structure.For a GCN model with L hidden layers, the forward propagation rule of graph convolution is given by: where H (0) represents TF-IDF in each area, and H (l) is for i-th neural network layer outputs.
σ(•) is a non-linear activation function, like the ReLU (see [41]).A is a representative description of the graph structure in the form of an adjacency matrix (self-connection is included, see Section 2.2.2).D is the diagonal node degree matrix of A, and W (l) is a weight matrix for the i-th convolutional layer.

Accuracy Assessment
The proposed supervised and semi-supervised models will be trained on a series of training sets, while being evaluated on the test sets in chapter 3. Two widely used metrics are used for measuring the prediction precision and precision for each individual type: where #(•) is the count of the corresponding predictions.As suggested by [50], the prediction accuracy is the fraction of the number of correct predictions over all the predictions.User accuracy is the probability that a value predicted to be in a certain class really is in that class.
The probability is based on the fraction of correctly predicted values to the total number of values predicted to be in a class.

Results
For a model comparison and sensitivity test, the above-mentioned four supervised and one semi-supervised models are applied over the central urban region of Hong Kong (~91.13 km 2 ), for which locally complete data sets of urban functional land use are available and accessible.Specifically, the study region includes the Central and West, Wanchai, Eastern District, Kwun Tong, Kowloon City, Sham Shui Po and Yau Tsim Mong districts (see Figure 2a).After segmentation, this region is divided into 469 functional areas, which will be labeled and represented by feature vectors for further classification (for related histogram see Figure 2b).The frequency distribution of the logarithm area of the basic study units (functional areas) is plotted in Figure 2b; the mean (standard deviation) and median values are 0.139 (0.625) and 0.064 km 2 , respectively.There is a total number of 171,704 POI points located in the study region.The related kernel density distribution shows hotspots in the economic central regions of Mongkok, Central district and Causeway Bay (see Figure 2c).Local urban functional land use is 100% (available from the LUHK dataset, Figure 2d), which provides an ideal study area for validating the candidates of the supervised and semi-supervised models.The total urban functional areas consist of 53.3% residential, 11.9% recreational, 17.1% commercial, 15.8% public service and 1.9% transportation areas (Figure 2d).

Model Comparison
The five selected supervised and semi-supervised models are trained on a series of training sets with increasing number of training samples (ranging from 2% to 90% of the 469 functional areas).For the cases whose number of training samples are critically small (less than 5%), we manually select areas as training samples to ensure there are no types missing in the training set.Otherwise, the training/test sets are randomly split from all areas in the study region (see Appendix A, Table A2).

Sensitivity on Training Set Size
Figure 3a displays the accuracy changes along an increasing number of training samples, from which the following results are noted:  In general, RF and MLP (GCN) are better choices when presented with sufficient (insufficient) amounts of training data, while lacking training data (<10%) can significantly reduce the supervised model's accuracy.

Training Performances: Small vs. Large Number of Training Data
Different training performances under the same number of training sample conditions (here 10% and 50% are selected for comparison, and the prediction accuracy is the fraction of the number of correct predictions over the total number of samples) indicate (Figure 3b,c):

Small number of training data (10%):
The best testing accuracy of MLP and GCN are similar with the number of epochs of ~200.However, GCN is better, because the testing accuracy of MLP first increases as the epoch increases but then decreases, which indicates an overfitting phenomenon: the production of an analysis that corresponds too closely or exactly to a particular set of data may, therefore, fail to fit to additional data or to predict future observations reliably [51].
Large number of training data (50%): GCN is approaching the highest accuracy around ~150 epochs, with which MLP indicates similar testing accuracy.However, the testing accuracy of MLP keeps improving as the number of training data increases and approaches 0.8, while GCN is facing a stagnation at less than 0.7.

Robustness to Different Selection of Training Data
To evaluate the performance of the 5 models, a further validation is conducted.10 independent training sets, each consisting of 10% out of all samples, are generated, based on the stratified sampling [52] strategy.Another 10 training sets, each consisting of 50% out of all samples (one sample may occur in many sets), are generated similarly for a further cross-validation (Figure 4).The models are estimated on the test sets, which consist of the other 90% or 50% samples, to observe the stability and generalization ability [53,54].The following results are noted: Condition of small number of training data (10%): Accuracies for all the models (SVM, two RFs, MLP and GCN) are roughly divided into three levels: MLP and GCN are in the 1st class, RFs (with 128 and 200 decision trees) are in the 2nd class, and SVM should be the last choice in terms of accuracy.For MLP and GCN (the 1st class), GVN (MLP) shows a relatively higher (lower) accuracy and a relatively higher (lower) variability.For RFs (the 2nd class), more decision trees do not necessarily increase the accuracy, but may increase the variability.

Condition of large number of training data (50%):
Apart from the SVM, all models yield higher accuracies and lower variabilities with rising training set sizes (Figure 4b).The RF-200 exhibits higher accuracies than the RF-128, while the two RFs still show the highest variabilities among all models.Two neural networks (MLP and GCN) show clear improvements on the robustness, whereas the accuracies of GCN are no more competitive.The MLP exhibits the highest accuracies of around 0.8, and a low sensitivity on the selection of training samples; therefore, it is preferred under the large number of training data conditions.

Impact from Different Levels of Functional Heterogeneity
To study the impact of the functional heterogeneity inside areas on the classification accuracy, 5 training sets, each consisting of 10%, are selected out of all samples (using the same stratified sampling method in Section 3.1.3)to train the GCN, and another 5, each consisting of 50% samples, are generated similarly to train the MLP.The occupation percentage of the labeled class (or purity) is calculated for the remaining 90% and 50% samples, based on the same method described in Section 2.1.The purity values are then segmented with an equal interval of 10% to form different purity ranks.The trained models are applied on those samples to observe the relationship between the accuracy and the purity rank.The results are plotted together with a frequency distribution of samples per purity rank in Figure 5. Results reveal that, whether using 10% or 50% samples (to train GCN and MLP model, respectively), the classification accuracy increases with higher ranking of functional purity, and (or) larger number of samples per purity rank.On the one hand, mixed functions could blur the important features, making it more difficult for the model to distinguish the main function among compound functions.On the other hand, since there is an observed bias in the number of samples of different purities (fewer samples have low purity, as seen in Figure 5), the models may fail to learn enough knowledge to correctly classify low-purity samples.Therefore, we argue that the estimation of functional heterogeneity inside functional areas should be an important procedure in the entire workflow to indicate the confidence of classification results.For example, our experiment in Hong Kong showcases that using road network to divide and generate functional areas, the averaged purity is ~76.1%, and 90.4% samples have higher purity than 50%, meaning a reasonable overall confidence.

Visual Comparison
Figure 6 illustrates the details of urban functional area predictions and the comparisons with the reality maps (derived from the LUHK dataset of Hong Kong following the labeling methods in Section 2.1).In this comparison, to maximize model output differences, the number of training samples is set as 4.7% (known/unknown: 22/469) within the whole region.Images acquired from Sentinel-2A satellite with the spatial resolution of 10 × 10 m is used as a background.Given such a low quantity of training sample, the accuracy ranking from high to low is: GCN (0.68) > MLP (0.63) = RF-128 (0.63) > RF-200 (0.62) > SVM (0.55).As shown in the visual comparison, the GCN exhibits abilities to distinguish functional areas that other models fail to classify correctly.For example, the GCN model effectively identified areas characterized by hospitals and schools (universities) (as indicated by the red boxes, located near Yau Ma Tei area), as well as major business and industrial areas (as indicated by yellow boxes, located at Kwun Tong district).In addition, despite introducing the one-vs.-reststrategy, the SVM classifier failed to distinguish multi-classes within the study area (Figure 6), which may be because it lacks the ability to handle the class imbalance problem introduced in Section 2.3.1.

Case Study: Beijing
The above analyses have compared and explained model sensitivity (on the training sample size), stability or reproducibility, accuracy, and robustness (from cross validation).Constraints on model applicability are comprehensively studied, which encouraged us to test the classification framework on a very different city in China.Unlike the relatively narrower roads, denser road network and smaller blocks observed in Hong Kong, the urban structure in Beijing is much less affected by terrain factors, but more by anthropological motivations [55,56].
The GCN is outstanding with a small number of training samples, which is important because, in real-world scenarios, lacking training samples of urban functional land use is one of the significant problems, especially in China.Here, for Beijing, the functional land use information of only ~20% of the central urban area (within the 5th ring, ~1140 km 2 ) is known (Figure 7).This small amount of training samples brings challenges and less accuracies if a supervised model (e.g., SVM, RF, MLP) is applied (see Section 3).Thus, the semi-supervised model of graph convolutional network (GCN) is selected in the following urban functional area prediction.Applications are carried out in parallel for three sub-regions of Beijing to promote the computing efficiency and highlight the heterogeneities within sub-regions.After data processing, 1571 functional areas are generated, with 9.4% (14/149), 8.81% (52/590) and 13.21% (110/833) of the three sub-regions (inside the 2nd ring, 2nd to 4th ring, and outside the 4th ring) selected as training samples.Three sub-regions cover different periods and degrees of development of the city.The 2nd ring encircles the famous Forbidden City and other archaeological landmarks, and outside the 4th ring the area shows less POI data (Figure 7b).Therefore, TF-IDF is calculated on the entire study region scale (within the 5th ring), while the semantic linkage of the cosine similarity is calculated on sub-region scale for highlighting the heterogeneities within sub-regions (see [57,58]).
GCN based classification and validation results are shown in Figure 8.The validation is achieved through field surveys and refers to the online map provided by the Amap platform close to the POI acquisition date.20/149, 40/590 and 50/833 areas are randomly selected for validation, and the confusion matrixes of each sub-region are derived (Figure 8b).The following results are noticed:
The classification accuracies for three sub-regions are 0.60, 0.83 and 0.69, respectively, and 0.73 for the whole study area.The belt sub-region between the 2nd and 4th ring roads contains a complex distribution of various functional regions, yet exhibits the highest accuracy among three sub-regions.This is in agreement with the relatively high variability observed in the robustness test in Section 3.1.3;

7.
Confusion matrixes indicate relatively higher user accuracies of residential (and recreational) functional land use from 64% to 88% (67% to 100%).The user accuracy of commercial (and public service) functional land use varies in three subareas from 40% to 80% (and 50% to 83%).Transportation functional land use is with less accuracy, which is probably because its low density; 8.
More specifically, public service is located more in the north of the city, while transportation occurs more on the south.On the east side of the 2nd ring, Beijing Central Business District is clearly displaced (the clustered dark red colors).From the 2nd to 4th ring, the well-identified public service areas show the existence of corresponding institutes and universities, especially in the north.To the northwest part of the third sub-region, recreational areas, such as the Summer Palace and Yuanmingyuan, compose distinctive clusters (see the areas colored with dark green).POI point density is relatively low in this region, indicating a lower density of buildings and commercial activities.Areas in the south part of the city are more irregular in shape (Figure 8a), which may be explained by the fact that this region is relatively less planned and developed compared with other regions.

Sensitivity and Accuracy
The accuracy-efficiency tradeoff: Neural networks are complex architectures and require enormous amounts of training data with good quality to produce viable results.For one neural network, as the size of the training data grows, so does the output accuracies.This is in agreement with [59,60].However, in choosing different supervised and semisupervised models, it is not necessarily the more training data the better.An efficient training set size could be selected, according to the cost of the training data and the required output accuracy (Figure 3a).For example, with a 5% training sample, GCN reaches an accuracy of 0.68; with a 40% training sample, RF reaches an accuracy of 0.8; while, with a 80% training sample, MLP reaches an accuracy of 0.85.In this study, the semi-supervised GCN model has better performance in urban functional land use prediction, using only a small number (5%) of training samples, which is meaningful in large-scale, real-world applications.
Reproducibility: Lack of reproducibility in machine learning, which is a complex and growing issue exacerbated by a lack of code transparency, can affect safety, reliability and the detection of bias.In choosing supervised and semi-supervised models, reproducibility comes from two aspects: applying a model multiple times with the same training set (Figure 3a), and a different combination of the same amount of training data (see the cross validation in Figure 4).That is, to obtain a more reliable urban functional land use classification, a model with high reproducibility (less variable in the cross validation) is recommended.In this study, GCN is more stable with the same training set, but varies more in cross-validation with 10% training samples (Figure 4a).MLP is less stable with the same training set, but varies less in cross-validation.And for RF, the increasing of the number of decision tree may not increase the final accuracy, but may increase the model's instability.This is also reported in previous papers such as [61].

When Is Machine Learning Application the Best Choice?
As the size of a neural network's architecture grows, so does its requirement for the amounts of training data.Thus, with a large amount of training samples, model selection and validation would be easy.For example, in modifying present urban functional land use plans with keeping most of the original designs, supervised models such as RF and MLP could be selected.However, there is normally limited training data available in real-world applications.In such cases, supervised classifiers that once have performed well may fail, while exploiting the limited data by incorporating their sematic linkages will produce viable results; see, for example, with 5% training sample, GCN reaches an accuracy of 0.68.This is in agreement with previous research [27], which feeds both feature vectors and different linkages of functional areas (such as the origin and destination pairs of taxi trips) into a GCN model and obtains a lower error.As also implied in previous research [31], the forward and backward training processes of GCN are effectively equivalent to geographical weighted regression (GWR), which makes it suitable in understanding geographical phenomena.

Limitation and Future Work
Almost always labeled data is essential for machine learning (ML) models.Without enough high-quality labeled data, the use of ML is not recommended.Meanwhile, most ML algorithms work better when there is a spatial balanced or quantitatively equaled distribution for each urban functional area types.In addition, the overfitting problem is noticed when applying MLP over Hong Kong, but in the real application, there may not be enough samples for a validation set to monitor the training process and the overfitting problem.Thus, one should be cautious to apply MLP when there is less than 10% of the whole regions with known classification categories.
In addition to POI data used in this study, other geographical and remote sensing indexes could also improve the classification accuracy [62].Other similarity methods could also be used to measure the linkage among urban functional areas (see the Moran's I index and high-resolution image features for example [62,63]).In addition, different segmentation (e.g., OSM based in this study) and different study scale (e.g., sub-regions in Beijing) and urban functional characteristics in neighborhoods, communities, or even regions [37,62], result in varied prediction accuracy.Therefore, each specific method should be considered simultaneously with the available data source, the cost and efficiency for a better outcome.And for future work, within the basic study units, functional characteristics are actually mixed and require more detailed decomposition techniques to better represent the development intensities and interactions in a downscaled situation [27,39].Furthermore, to deal with the current poor performance diagnosing low-purity samples (Figure 5), new techniques may be needed in the model training process, to specifically improve the ability to discriminate them

Conclusions
Formal land use refers to the qualitative attributes of land surface, while functional land use indicates its socioeconomic function.The formal land use map can be created with aerial or remote sensing images, but it is difficult to infer any functional attributes from these observations, especially for urban land use.City planners and other agencies have undertaken surveys to assign or infer the functional characteristics on basic urban areas under their jurisdiction [39].Such an endeavor is often time-consuming, as the urban landscape is constantly changing with the construction/renovation of infrastructures, new commercial/residential/industrial developments, and the modification of existing uses [20,57,64].Present geotagged data, e.g., point of interest (POI) data, bring new perspectives in data mining and supplement for defining urban functional area characteristics by associating with machine learning techniques.However, to our knowledge, there are still questions that need to be answered in using present geotagged data to diagnose urban functional areas, such as: When is a machine learning application the best choice?How to select a machine learning model?And what is the model's uncertainty, robustness and sensitivity?
Therefore, in this study, three supervised (SVM, RF, MLP) and one semi-supervised machine learning model (GCN) are selected.For both supervised and semi-supervised models, a normalized Term Frequency-Inverse Document Frequency (TF-IDF) feature vector for each functional area is calculated as model inputs for the three supervised (SVM, RF, MLP) models.Both the TF-IDF feature vector and the cosine similarity of the TF-IDF feature vectors from two different areas are calculated as model inputs for the semi-supervised GCN model.Followed by the uncertainty, robustness and sensitivity tests (see Section 3), the following results are noticed: 1.
As the amount of training sample grows, models' accuracies are improved, but with different potentials.GCN model is with the top accuracy, from 0.65 to 0.70, when the number of training samples is less than 10%, while MLP and RF show top accuracies when the number of training samples exceeds around 10%; 2.
With a large amount of training samples, which is normally in the modification of existing urban functional area maps, RF and MLP could be the best selection.However, one should note that MLP is less stable with the same training set, but varies less in cross-validation.For RF, the increasing of the number of decision trees may not increase the final accuracy, but may increase the model's instability; 3.
With a small amount of training samples, which is normally the case in the real world, GCN could provide viable results by incorporating the auxiliary information provided by the proposed semantic linkages.For example, with the incorporating of the similarity-based semantic linkage, the model could be trained using only 5% of the total samples and produce an accuracy of 0.68; 4.
In the perspective of the model overfit problem, which could be ignored in the real application due to lacking enough testing samples, when the training samples is less than 10%, we suggest choosing GCN for the urban functional land use prediction, and one should be cautious using MLP, by testing the optimal epoch for obtaining the best accuracy.

Figure 1 .
Figure 1.A flowchart illustrating the five stages of the functional area type identification and classification conducted in this paper.

Figure 2 .
Figure 2. (a) Geographical setting of the study region, Hong Kong, China.The research is carried out within the red lines; (b) histogram statistics of the size for the functional area; (c) kernel density distribution of Point of Interest (POI) points; and (d) the functional land use distribution derived from the dataset of Land Utilization in Hong Kong (LUHK), including five categories: commercial, residential, public service, recreational and transportation.

4 .Figure 3 .
Figure 3. (a) Comparing training set size dependence associated with five functional area classification models: support vector machine (SVM), random forest (RF-128 and RF-200), multi-layer perceptron (MLP) and graph convolutional network (GCN), indicating a best performance with GCN under semi-supervised situation and a best performance with MLP under supervised situation; (b) and (c) the testing accuracy and testing loss curves for MLP and GCN during model trainings.

Figure 4 .
Figure 4. Cross validation results associated with five classification models (a) using small number of samples (10%) for training and (b) using large number of samples (50%) for training.

Figure 5 .
The correlation between accuracy and different functional purities with the frequency distribution of purity ranks.

Figure 6 .
Figure 6.Visual comparison between ground truth and five functional land use prediction results obtained from SVM, RF-128, RF-200, MLP and GCN.The number of training samples is set as 4.7% (known/unknown: 22/469) of the total region, and the accuracy is highlighted accordingly.

Figure 7 .
Figure 7. (a) Geographical setting of the study region, Beijing, China.The research is carried out within the 5th ring road (marked by the red line); (b) histogram statistics of the size for the functional area; (c) kernel density distribution of local point of interest (POI) points; and (d) the functional land use distribution obtained from the open street map land use dataset, including five categories: commercial, residential, public service, recreational and transportation.

Figure 8 .
Figure 8.(a) Functional area prediction result within the 5th ring road of Beijing; (b) confusion matrixes of the 3 sub-regions, respectively.The row-normalized matrixes where diagonal element are user accuracies of the corresponding functional land use types.

Table 1 .
Functional land use types in this study associated with OSM functional land use dataset and the Land Utilization in Hong Kong (LUHK) dataset.

Table 2 .
Characteristics and implementation details of 5 models compared in this study.

1 .
Although models' accuracies are improved as the amount of training data increases, disparities could be diagnosed from the model comparison.RF, MLP and GCN show an obvious higher accuracy and improving potential (see the tendency of the accuracies to the training sample percentage) as the number of training samples Semi-supervised models indicate advantages with a small number of training samples.That is, the GCN model is within the top accuracy, from 0.65 to 0.70, when the number of training samples is less than 10%.However, the potential of the GCN model is of underperformance when the number of training samples increases, compared with supervised models of RF and MLP;

Table A1 .
POI categories used in this paper.

Table A2 .
Statistics of the training sets used for model comparison experiment.