Semantic-Enhanced Graph Convolutional Neural Networks for Multi-Scale Urban Functional-Feature Identification Based on Human Mobility

: Precise identification of spatial unit functional features in the city is a pre-condition for urban planning and policy-making. However, inferring unknown attributes of urban spatial units from data mining of spatial interaction remains a challenge in geographic information science. Although neural-network approaches have been widely applied to this field, urban dynamics, spatial semantics, and their relationship with urban functional features have not been deeply discussed. To this end, we proposed semantic-enhanced graph convolutional neural networks (GCNNs) to facilitate the multi-scale embedding of urban spatial units, based on which the identification of urban land use is achieved by leveraging the characteristics of human mobility extracted from the largest mobile phone datasets to date. Given the heterogeneity of multi-modal spatial data, we introduced the combination of a systematic data-alignment method and a generative feature-fusion method for the robust construction of heterogeneous graphs, providing an adaptive solution to improve GCNNs’ performance in node-classification tasks. Our work explicitly examined the scale effect on GCNN backbones, for the first time. The results prove that large-scale tasks are more sensitive to the directionality of spatial interaction, and small-scale tasks are more sensitive to the adjacency of spatial interaction. Quantitative experiments conducted in Shenzhen demonstrate the superior performance of our proposed framework compared to state-of-the-art methods. The best accuracy is achieved by the inductive GraphSAGE model at the scale of 250 m, exceeding the baseline by 25.4%. Furthermore, we innovatively explained the role of spatial-interaction factors in the identification of urban land use through the deep learning method.


Introduction
Investigating urban land use is pivotal for building sustainable and habitable cities [1], and the measured influence of the environment on travel behavior in human mobility is shown to be sensitive to land use [2].In this research field, one of the most longstanding and far-reaching geographic issues is the modifiable areal unit problem (MAUP), concerned with the spatial scale and the zoning sub-problem [3,4].Moreover, appropriate spatialunit zoning is significant for urban planning, urban governance, and synthetic studies of urban geography [5].Given the spatial heterogeneity of urban spatial units, predicting unknown functional features of spatial units from the effective representation of units' relationships is treated as an extensive challenge in urban science [6].In addition, the various features of the connections among spatial units are always disorganized in the social sensing process, making it arduous to adopt characteristics of spatial interaction in an urban geographic-prediction issue [7,8].The rapid development of deep learning mechanics has made remarkable advances in geospatial artificial intelligence (GeoAI) research [9,10].The deep neural network brings a new direction to the novel representation of urban spatial units, which encodes a spatial unit into high-dimensional embedding for the complete preservation of various types of geospatial information [9,11].Graph convolutional neural network (GCNN)-related methods can represent spatial structures graphically, while traditional neural network approaches are not powerful enough to interpret spatial dynamics [12].With the potency of capturing different spans of connection relationships and transforming them into the weights of neural networks, GCNNs can further learn a deep representation of spatial units' functional features from complicated geographic contexts involving human activities [13].
A comprehensive embedding of urban spatial units consists of physical feature vectors and human feature vectors [14,15].The physical vector directly conveys information on the built-environment attributes of the individual urban spatial unit.The human vector, concerned with social-economic activities, implies estimable information on how citizens generate and operate urban spaces with their daily demands [16,17].In other words, human activities carried out in different functional urban regions may reshape a place's land-use type [18].Spatial interaction represented by human mobility as flows of population [19] can capture complementary dependencies in different ranges [15,20,21].A profound understanding of spatial interaction favors optimizing crisis management, mitigating epidemic spreading, forecasting traffic status, and many other applications [22].On the city scale, spatial interaction can dynamically indicate functional land use and urban structure through spatial-temporal human mobility patterns from a collective perspective [16,23,24].Moreover, the urban land-use type is prone to diverge from official plans, due to the sophistication of urban evolutions; thus, it is of great necessity to identify and monitor functional land use for urban infrastructure construction, transportation modernization, resource assignment, ecology protection, and disaster assessment [25,26].
The identification of urban land use is an interdisciplinary problem related to the prediction of spatial units' functional features, which relies on the observed characteristics of the individual unit and the connected spatial units.Early studies that explicate the information of such kinds of geographic context as predefined mathematical functions cannot decently model the intricate essence of the urban system [8].While most existing studies apply GCNNs to the embedding of overall patterns and flow rates or amounts of spatial interaction, fewer of them have in-depth discussions on integrating the semantics of mobility patterns and socioeconomic attributes into the GCNN framework for the better prediction of spatial-unit features.One of the major reasons is the limitation of the range and contents of their datasets.The deliberate arrangement of spatial units will considerably increase the prediction accuracy in GCNN-based deep-learning tasks; however, the spatial scale effect has not been validated in the identification of urban land use from spatial interaction with graph embedding [5].In addition, due to the deficiency of transparency for most deep learning models, explainability continues to be a major concern in attaining greater insight into the characteristics underlying mobility flows [27].
This study aims to investigate the scale effect and semantic characteristics of spatial interaction in identifying functional features of urban spatial units.We developed a semantically enhanced GCNN framework to learn the multi-scale embedding of urban spatial units and the dynamic identification of functional urban land use by leveraging the geographic context of spatial interaction.The key contributions of this study are fourfold.First, we introduced a hierarchical data-alignment method for the spatial segmentation and mapping of urban land-use data and spatial interaction with the assemblage of individual human mobility data.Second, we primarily proposed a heterogeneous feature-fusion method to extract multi-dimensional attributes of spatial interaction from a massive amount of mobile phone data.Third, we expanded the graph-construction method by aggregating the distributed spatial interaction information to their origination and destination at different spatial scales.Fourth, we achieved the goal of explaining the features' contribution to the deep-learning tasks of land-use identification.Quantitative experiments comparing the performance of various backbone GCNN models empirically demonstrate that the semantic-enhanced GCNN framework is more powerful than the state-of-the-art methods.

Place Embedding with Graph Convolutional Neural Networks
The embedding of urban spatial units can be taken as a vector representation of their spatiotemporal characteristics, to manifest the similarity among urban spatial units in the vector space [28].In the initial phase of embedding, unsupervised text encoding algorithms broadly applied in Natural Language Processing (NLP) tasks, such as Word2Vec [29,30] and Doc2Vec [31], and homologous subjects bottomed on graph theory, such as Node2Vec [32], Place2Vec [33], and Traj2Vec [34], have fostered significant development in place-embedding research.However, an enormous loss of information happens when spatial units are transformed into word sequences in a document by the Word2Vec model, because spatial units usually possess multi-dimensional features, whereas word sequences are dispensed in one-dimensional space [35].In other words, the aforementioned methods cannot currently involve node features in graph embeddings.To overcome this limitation, a notable increase in interest has arisen in graph convolutional neural networks (GCNNs), as they can master spatial-unit embedding even when the scope of unit connections shifts [36].
The backbone models of GCNNs integrate a graph convolutional network (GCN), a relational graph convolutional network (R-GCN), inductive and directed graph sample and aggregate (GraphSAGE), and a graph attention network (GAT) [37][38][39][40].GCN utilizes a competent layer-wise propagation rule for neural network models that directly deal with graphs [37].On the strength of GCN, relation-specific transformations are brought to R-GCN to proceed with the highly multi-relational features of knowledge bases [38].In preference to building discrete embeddings per node, GraphSAGE samples and aggregates the features from a node's local space to produce its embedding [39,41].While GraphSAGE and GCN are learned in an unsupervised way, where nodes co-occurring is made along with short random walks in the embedding space, attributed network embedding (Attribute to Vector, Attri2Vec) is learned by capturing the context node similarity over a network structure-guided transformation in the original attribute space [42].
Thanks to the emergence of geospatial big data collected from location-based services (LBSs) and the Internet of Things (IoT), an increasing number of researchers have taken advantage of GCNNs to investigate urban issues, e.g., traffic prediction [43,44], urban land-use recognition [24,45], urban scene classification [25], urban security perception [46], public health evaluation [47], weather forecasting [48], and cultural association mining [49].Nevertheless, the implicit semantics and contextual information of geospatial big data are underexploited in the identification of urban functional features.Regarding the application of GCNNs in embedding urban spatial units, an important issue to be addressed is the effective representation of multi-dimensional features that can be adaptive to most backbone models.

Advanced Data and Methods for Urban Land-Use Identification
Research in identifying functional urban regions or land-use types has evolved from rule-guided to data-driven, with a bottom-up strategy, over the past decades [50].The data sources applied in this field vary from static data to dynamically recorded data.The static data mainly include remote sensing images, street view images, points of interest (POIs), building footprints, and traditional survey statistics [51][52][53][54].The dynamically updated data consist of mobile phone signal data, social media data, taxi traces, smartcard records of metros or buses, and other human activity data [55][56][57][58][59]. High-resolution remote sensing images and street view images are effective for deriving urban land covers, yet they struggle to continuously acquire the social-economic characteristics of urban spatial units, due to the accessibility of data and the shadows of abundant skyscrapers in highdensity cities [60,61].POI data is popular in urban studies for its facilitated accessibility, but the problem of widespread noise should not be disregarded [62].Compared with static data sources, dynamically updated human mobility data have the bulge on extensive data coverage, diverse spatiotemporal resolution, and high sensitivity to social functional attributes [60,63,64].
The progression of research methods conducted in urban land-use identification contains two stages: one focuses on density analysis and cluster analysis, while the other concentrates on data mining and deep learning.Most density-analysis and cluster-analysis approaches are generally acceptable for their hallmark of lightweight computation and time-saving functions [60].However, for certain urban regions with spatial heterogeneity, the foregoing methods using shallow features of data may have limited identification accuracy [60].Accordingly, scholars seek to employ data-mining methods like topic models, which can generate latent semantics for delineating functional urban zones [65].Further, a growing group of deep-learning methods has been widely applicable in urban functionalzone identification, including the Place2Vec model and the convolutional neural network (CNN)-based models, such as large-patch CNN and deep-feature CNN [66,67].Those experiments prove that deep learning's migration capability and scalability have unlimited potential in urban land-use identification and smart cities [60,68].In summary, a method to combine dynamically updated data and graph-based deep learning techniques for urban functional-feature identification is long-awaited.And the spatial scale effect, previously underappreciated, needs to be deeply explored in the current research stage.

Overview
In this study, a semantic-enhanced GCNN framework (Figure 1) to investigate the relationship between spatial dynamics represented by spatial interaction and urban functional features is proposed.We start with the alignment of multi-modal datasets, including mobile phone signal data that contain user locations and mobile phone usage data, land use data, and city boundary data within the research area.As the datasets are of different spatial scales, we conduct spatial segmentation on the study area and map multi-modal data to urban spatial units.Three kinds of features are extracted and fused under feature engineering, to enrich the semantic properties of urban spatial units.We then build up the spatial interaction matrix, based on the aggregated trips extracted from mobile phone data.With the combination of the feature fusion matrix and spatial interaction matrix, a variety of graphs involving the information of geographic semantics and spatial topological structure are constructed.Finally, we introduce a series of backbone models under the GCNN framework to test their performance in urban spatial-unit embedding, presenting the fact that they are competitive with state-of-the-art models.Furthermore, we analyze the contribution of multiple features in deep-learning classification tasks.

Building the Grid-Travel Corpus
Grid cells are typically regarded as a metric function for coding space for their utility in delivering multi-scale spatial representations in urban computing [69].The entire urban region is divided into uniform grids, as the urban spatial unit at different scales for the research demands.The human mobility involving travel between origin-destination (OD) grids is systematically documented on an hourly basis, encompassing both the travel and social attributes outlined in Table 1.Travel attributes primarily delineate the physical characteristics of mobility, while social attributes provide a comprehensive depiction of individuals engaged in travels.These attributes within the mobility data or OD data vividly showcase the diverse nature of human mobility, indicating a significant correlation between social structures and urban spaces [70,71].

Name Connotation Examples
Travel Attribute

Travel Type
The purpose of the travel commuting, dwelling, etc.

Transport Mode
The means of transportation subway, highway, airplane, train, etc.

Travel Time
The

Probability of Owning Cars
The probability level that a traveler owns a car Five levels: high, middle, low, etc.

Probability of Owning Houses
The probability level that a traveler owns a house Five levels: high, middle, low, etc.
To infer urban functional features with human mobility patterns, the traveling population flows as the link between urban spatial units, and its attributes should be aggregated to the origin and destination grids.Given the data heterogeneity of travel attributes and social attributes, a self-adapting data-normalization method is proposed to support the training of the GCNN models.Above all, those attributes are classified as continuous attributes and discrete attributes (categorical attributes).The continuous attributes include travel time, distance, and speed.The average and standard deviation of these data are calculated based on the origin grid (O-grid) and destination grid (D-grid), separately.The discrete attributes include travel type, transport mode, date of travel, people's age, gender, occupation, education, income, and the probability of owning a car or a house.The frequency at which the value of those discrete attributes appears is primarily calculated as the ratio.Since the attribute value is also important for feature construction, we multiply the ratio with the attribute value to obtain the weight of the attribute.The average and standard deviation of attribute ratio and weight are then obtained from the O-grid and D-grid, respectively.
Apart from the property of human mobility as an intrinsic attribute of urban spatial interaction, the intensity and frequency of spatial interaction also play a significant role in quantifying spatial interaction and delineating urban functional features.The intensity of urban spatial interaction can be defined as the volume of flows between urban spatial units, which is obtained by summarizing the number of travelers per hour.The number of travelers departing from the O-grid is calculated as "out-population", while the number of travelers arriving at the D-grid is calculated as "in-population"; the difference between "out-population" and "in-population" of each grid is defined as "stay-population".The frequency of urban spatial interaction can be defined as the ratio of flows between urban spatial units, which is acquired by computing the ratio of travel numbers of origin-destination pairs compared to the total travel records.Specifically, the travel ratio of O-grids is calculated as "out-frequency", while the travel ratio of D-grids is calculated as "in-frequency".

Generating a Feature Matrix from Multi-Modal Data Fusion
To probe how the semantic features of urban spatial units influence the performance of identifying urban functional features, the grids are further characterized by fusing the OD data with the other two types of human activity data derived from mobile phone use records.The first type is the monthly population (Pop) of each grid, which covers the population of working, dwelling, and visiting activities.A series of social attributes are extracted to present a comprehensive population portrait of urban spatial units, such as gender, age, occupation, education, etc.With similar social attributes of OD data, Pop data shares the same aforementioned normalization method.The second type of human activity data is the monthly mobile phone data usage of different mobile apps (Flux) located in each grid.These mobile apps are classified into six categories, including communication, entertainment, business, news, tools, and domestic services.The total mobile phone data usage of each category is calculated based on separate grids.Utilizing OD data as the primary features of urban spatial units, the normalized Pop data and Flux data are individually integrated with the OD data.Consequently, three distinct node feature matrices (OD and Pop, OD and Flux, and OD and Pop and Flux) are derived through multi-dimensional data fusion, serving as the foundation for graph construction in the context of urban spatial analysis.

Constructing Semantic-Enhanced Graph
A graph is constructed with nodes and with edges linking between nodes.While the nodes refer to urban spatial units, the edges correlate with the spatial interactions between nodes.The grid features are taken as node features, and the intensity of spatial interaction between grids is taken as edge weights in the graph.Three kinds of graphs are constructed in this study: directed weighted, inductive weighted, and multi-type weighted.In the directed graph, the origin and destination grids are taken as the source node and target node, respectively.The inductive weighted graph is undirected, and its nodes are treated as the same type.The multi-type weighted graph is a heterogeneous directed graph, where various types of edges symbolize different relationships between nodes.In our multi-type graph, two types of edges are constructed by, respectively, in-flow and out-flow, the weight of which is represented by in-population and out-population calculated in Section 3.2.1.
A directed graph is comprised of urban spatial interactions reflected by human mobility flows (m) between spatial units, which are aggregated from the travel trajectories of a crowd of people: where V m signifies a variety of graph nodes, and each node v i ∈ V m represents an urban spatial unit u i ∈ U, composed of origin nodes N o and destination nodes N d of trajectories: E m represents a set of edges, and each edge e ij = v i , v j , a ij ∈ E m stands for a directed mobility trajectory between two nodes v i and v j in end-to-end spatial interaction.The interaction intensity between urban spatial units is thus encoded as edge weights a ij .Edge directions are consistent with trajectory directions.
For a multi-type graph, the edge type that represents the relations between nodes is defined as R in the multi-type graph: and then each edge is refined as e ij = v i , r, v j , a ij ∈ E m where r ∈ R m is a relation type.
Moreover, E m in the multi-type graph is defined as the aggregation of bidirectional edges, namely in-edge (E in ) and out-edge (E out ): For each node v i ∈ V m in graph G m , the local attributes of urban spatial units and the spatial interaction attributes are encoded as node features x k ∈ X v .Specifically, X v is a subset of the fusion of the feature matrix, which is the concatenation ( ) of the urban mobility features, local population features, and mobile phone data usage of various apps:

Graph Convolution
In this study, the problem of identifying urban land-use type from the characteristics of urban spatial units and the spatial interaction between them can be defined in a straightforward form: where J is the general notation of the GCN model designed for the node-classification task in graph-embedding issues.The propagation model and loss function of the GCNN backbones we used will be further discussed in the following sub-sections.

GCN
In the GCN framework, the embedding of the urban spatial unit is conducted in K convolution layers; thus, each node v i in the graph has K node embeddings through hidden state h (k) i ∈ H (K) .While the node features X v are applied to the initial hidden layer H (0) , the prediction of node attributes is obtained from the embedding H (K) in the last layer.A layer-wise propagation rule is followed by the multi-layer GCN [72].
In particular, S = S + I N is a weighted spatial interaction matrix S ∈ R N×N , with added self-connections.It denotes the topological structure and edge weights of graph G m .I N is the identity matrix, and N typifies the node number of graph G m .W (k) is the learnable weight matrix specified by layers, and D ii = ∑ j S ij is a diagonal degree matrix.Following the layer-wise propagation rule discussed in Equation ( 7), we apply a two-layer GCN model to our study for supervised node classification, which is an instantiated form of Equation ( 6): Here, X is the input data that serve as H (0) in Equation (7).Ŝ = D − 1 2 S D − 1 2 is precalculated, based on the symmetric weighted matrix S. W (0) ∈ R C×H is a weight matrix for an input layer to a hidden layer with C input channels and H feature maps.W (1) ∈ R H×F is a weight matrix for a hidden layer to an output layer with F classes.The activation functions softmax and ReLU are applied row-wise.Given that the identification of urban land-use types belongs to supervised multi-class node classification, we estimate the categorical cross-entropy error for all nodes with land-use labels: where Y is denoted as node labels.The weights of GCN are trained using gradient descent, and the loss function above is applied to tune the weights to predict better results.

Relational GCN
Motivated by the architecture of GCN, in the "k + 1"-th convolutional layer of a relational multi-type graph, a node represented by v i is updated by the forward pass, and the propagation model for calculating it is defined as the following: where N r i depicts the group of node indices under relation r ∈ R m as the neighbors of node i.Moreover, c i,r is a learnable constant for task-oriented normalization.In contrast to the regular GCN model, the R-GCN model conducts transformations for individual relations, depending on the type and direction of its edge.
A critical problem of the aforementioned propagation model is that the increase in relation number in the graph will lead to a sharp rise in the number of parameters, which often becomes a primary contributor to overfitting.To address this issue, we apply basisdecomposition to regularize the weights of R-GCN layers in this study, where each W (k) r is defined as follows: This is a linear combination of basis transformation rb , which depend on r relation type.

GraphSAGE
The main purpose of GraphSAGE is to produce an approximate representation of adjacent nodes by aggregating feature information from the local node's neighborhood; meanwhile, the representations of dissimilar nodes are quite distinctive.The loss function designed for the output representations is shown in Equation (12).
Specifically, v and u are nodes that coexist, and they can reach each other on a random walk of definite length; σ is the nonlinear activation function for the representations of vertex z u (∀u ∈ V m ).P n denotes the distribution of negative sampling, and the negative samples are counted by Q. Incompatible with preceding embedding methods, a constant number of neighbors N(v) are evenly sampled from the set [u ∈ V m : (u, v) ∈ E m ] in Graph- SAGE.Moreover, the erratic node neighborhood features are operated by aggregator functions with symmetry properties.
Here, mean denotes the element-wise mean operator, and b is an optional bias toward each node's aggregated embedding in the given graph.The reason we employ a mean pooling aggregator is to efficiently capture alternative aspects of the node's neighborhood.

Study Area and Data Description
The study area Shenzhen (Figure 2) is one of the special economic zones of China.This port city is located on the east coast of the Pearl River Delta, bordering Hong Kong to the south.Two types of data are utilized in this study: mobile phone data and land use data.The mobile phone data were formally collected from China Unicom (China United Network Communications Group Co., Ltd. in Bejing, China), one of the largest national communication operators.Those long-sequence mobile phone data convey three types of information.The first is users' hourly intracity mobility data between OD grids, characterized by travel attributes and social attributes.The second is the population data of dwelling, working, and visiting in each grid.The third is the mobile phone usage data of different apps per grid.The mobile phone data covers three months of 2019: March, September, and December.Regarding the label category of urban spatial units, the ground truth of land-use data was collected from the third official National Land Resource Survey, entitled "Classification of Land Use Status", which was launched by Shenzhen State Council in September 2018 (GBT 21010-2017).An approximation strategy for labeling a spatial unit is to select the land-use type with the largest area in the spatial unit as the unique identifier.The spatial resolution was set as the grid of 250 m, 500 m, and 1000 m, respectively.In addition, the WGS84 geographic coordinate system and UTM 50 N project system were adopted in this study.Figure 3 illustrates the spatial distribution of Pop data and Flux data in the study area.Based on those data, a set of heterogeneous graphs was systematically constructed across three distinct spatial scales, as delineated in Table 2.

Geo-Semantic Embedding and Prediction
The identification of urban functional features such as land use according to spatial graphs is intrinsically a node-classification task in geo-semantic embedding and prediction issues.To contextualize the empirical results of the proposed method, we compared it against the state-of-the-art baselines including Random Forest and Attri2Vec, and four backbone GCNN models: GCN, relational-GCN, inductive GraphSAGE, and directed GraphSAGE.All models were implemented in the TensorFlow framework with the Stellargraph library.In terms of the experimental setup, 80% of the dataset was prepared for training, while the remaining 20% was used for validation and testing.For the urban land-use-identification task, we applied a three-layer model in all GCNN frameworks introduced in this study.In the GCN framework, the model works with a layer size of 128 for the first two layers, followed by a rectified linear unit (ReLU) activation for those hidden layers, as well as a softmax activation for the last output layer.In the GraphSAGE framework, the model executes with a batch size of 256 in both inductive and directed GraphSAGE.The number of in-samples and out-samples is set as 64 for each hidden layer in the directed GraphSAGE, while the same number is set for the undirected samples in the inductive GraphSAGE.In the R-GCN framework, the first two layers are activated by leaky rectified linear units (Leaky ReLU), whereas the output layer is normally activated by softmax for node classification.As R-GCN uses basis-decomposition for the regularization of layer weights, we chose 30 as the number of basis-decomposition, based on the validation set performance.The architectural hyperparameters of those tasks had been optimized on randomly sampled datasets.We set the drop rate as 0.3 for all layers, and imposed the bias to True.All models were trained using the Adam optimizer, to minimize categorical cross-entropy on the training nodes, implemented by an initial learning rate of 1 × 10 −3 and a decay of 1 × 10 −4 .We used an early stopping strategy with a patience of 200 epochs on both the accuracy and the cross-entropy loss.

The Performance of Multi-Modal Data Fusion
The semantic information provided by node features plays a significant role in the GCN-based node-classification task.Assuming that the performance of urban functional feature identification will improve with the increase in the amount of semantic information for urban spatial units, we conducted experiments on three multi-modal data fusion matrices, using Random Forest as the baseline and GCN as the representation of GCNN models.Results abbreviated in Table 3 convey overall classification accuracy (OA) and Kappa coefficient in percentages to evaluate the performance of the model and dataset.The results indicate that the data fusion matrix of population data and mobile phone app usage data outperformed the semantic feature matrix generated from single human mobility data in tasks executed by both the Random Forest and the GCN model.We can conclude that the integration of diverse human activity data will provide valuable information to the GCN model for identifying urban functional features.
Regarding the performance in various urban spatial scales, the scale of 250 m, with an OA of 82.59%, outperformed the scale of 500 m and 1000 m over four multi-modal data-fusion scenarios.In particular, the scale of 250 m outperformed the scale of 500 m by 4.9%, on average, while it outperformed the scale of 1000 m by 15.2%, on average.It turns out that the performance of urban functional-feature identification is negatively related to the spatial scale.Additionally, we found that the classification results of different land-use types are correlated with their proportions in spatial units.Taking the spatial scale of 1000 m as an example in Figure 4, residential land and industrial land with a large proportion of spatial units have higher F1-scores in the classification results.Nonconstruction land accounts for the largest proportion of spatial units, but its F1-score does not change drastically with variation in data-fusion scenarios, like other land types.An essential reason is that human activities have little interaction with non-construction land.This further illustrates the fact that the semantic information on human activities plays a significant role in identifying urban functional features.

The Performance of Different Embedding Models
Given that the semantic feature of the graph network constructed from integrating three kinds of human activity data outperformed other data-fusion scenarios in the task of urban functional-feature identification, we employed this graph network to test the performance of different embedding models at three urban spatial scales.Results archived in Table 4 connote OA and Kappa coefficient in percentages.The performance of all those models at small urban spatial scales was superior to those at large urban spatial scales.For the fine spatial scale of 250 m, the mean accuracy of all models was 80.19%.The GraphSAGE model strongly outperformed other GCNN models and baselines, and exceeded the mean accuracy by 9.1%.The inductive one, with an accuracy of 87.49%, shows a modest improvement compared to the directed one.For larger spatial scales of 500 m and 1000 m, the mean accuracy was 76.41% and 69.88%, respectively.At the scale of 500 m, the directed GraphSAGE model with an accuracy of 83.17% significantly outperformed other GCNN models and baselines, and exceeded the mean accuracy by 8.9%.At the scale of 1000 m, the R-GCN model, with an accuracy of 75.72%, outperformed other models and exceeded the mean accuracy by 8.3%.In summary, the GCNN models that emphasize the direction and relational type of graph edges are more suitable for identifying land-use types at middle and large spatial scales.The GCNN models that strengthen local characteristics of graph nodes are more competent in the identification of land-use types at small spatial scales.This result also brings conceptual insights into the embedding of urban spatial units.Both the directed GraphSAGE model and the R-GCN model exploit the edge direction that traces the directionality of spatial interactions.While the edge in directed GraphSAGE maps the one-directional relationship between the origin and the destination, the edge with bilateral types in R-GCN explicitly maps the bi-directional relationship between spatial units.The performance variations in directed GraphSAGE and R-GCN indicate that graph embeddings with the directionality of spatial interaction have a crucial impact on largescale urban functional-feature identification.Regarding the inductive GraphSAGE model that achieves superior performance at the 250 m scale, it recognizes the node's local role in the neighborhood by incorporating adjacent nodes' features and topological structures.This leads to another noteworthy conclusion: the graph embedding with the adjacency of spatial interaction makes a distinctive contribution to identifying small-scale urban functional features.

Geospatial Distribution of Urban Functional Features
To investigate the geospatial distribution of urban functional features, we took the landuse identification results output from the GCN model at the scale of 1000 m as an example (Figure 5).The regions near the southern border of Shenzhen are economically prosperous, due to their superior geographical conditions and high population density.The land-use types and human activities in these regions are more complex for GCNN embedding.The prediction result in the north of Shenzhen is better than that in the south of the city.The results also show that residential land is the most distinguishable with our semanticenhanced GCNN identification framework.Inferred from the average computation based on prediction results, over 56.1% of industrial land was successfully identified.Around 27.9% of commercial land was misclassified as residential land, for which an example is demonstrated by the middle arrow in Figure 5.One of the major reasons is that many commercial areas are located near residential areas, and there were more instances of residential land use in our dataset.Further supplementary information was required to precisely discriminate them.A noticeable issue is that no sound quality within the input features is sufficiently effective for recognizing public-service and administration land, which occurs around areas such as schools, hospitals, parks, etc.Most of these areas were misclassified as residential land instead, of which an instance is shown by the right arrow in Figure 5. Similarly, there were a small number of transportation land-use areas in our dataset, and a number of industrial parks were located around the transportation land, to facilitate logistics.Thus, 16.8% of them were misclassified as industrial land, with more occurrences.An example is displayed by the left arrow in Figure 5. Finally, the highest accuracy appeared in the identification of non-construction land at all three spatial scales, because the sample size of this land-use type accounts for the largest proportion.The potential reason will be discussed in Section 5.

Explanation of Feature Impacts on Model Prediction
Understanding why a model makes certain predictions is of great importance for results interpretation, the explanation of differences between models, and the extension of our understanding of the phenomenon under analysis [73,74].We used Shapley Addictive exPlanations (SHAP) [75] to figure out how the input features of human mobility contributed to determining the output urban land use of the GCNN model.According to game theory, SHAP enumerates optimal Shapley values to measure the contribution of individual features.It indicates how the occurrence of certain features influences the model function, compared to the normal performance for the complete dataset.With the support of SHAP, we demonstrated global interpretations of feature impacts at three urban spatial scales (Figure 6) and the corresponding local interpretations for six classes of urban land use at the spatial scale of 1000 m (Figure 7).Names with the prefix "o" and "d" are features extracted from the origin and the destination, respectively.The feature name with the suffix "avg" is the average of that feature value.The feature name with the suffix "std" is the standard deviation of that feature value.The feature names with the abbreviations "fre" and "wt" are, respectively, the ratio and the weight of the attribute mentioned in Section 3.2.1.
From a global perspective, the features extracted from the travel attributes contributed more to the identification of urban land use at small spatial scales, like 250 m, while the features extracted from the social attributes contributed more to the identification of urban land use at middle and large spatial scales, such as 500 m and 1000 m.In addition, the features for describing the graph network of urban mobility, such as the frequency of spatial interaction between urban spatial units (the out_degree and in_degree of the urban spatial unit), were of similar importance to the model prediction for three spatial scales.Two of the most relevant features were the full-weighted population (out_fpop) and gender-weighted population (out_gpop), which flows out of the urban spatial unit, which contributed more to the identification of non-construction land, residential land, and industrial land than that of commercial land, transportation land, and public service and administration land.While other features only brought a marginal contribution to the global prediction, the average weight of transport type (o_type_wt_avg) from the origin and the average arrival time (d_ehour_avg) to the destination showed more relevance for the identification of commercial land and residential land.From a local perspective (Figure 7), we take features' Shapley value for the land-use identification at the spatial scale of 1000 m as an example.Most of the features had a mixed effect on the identification probability of transportation land.For public-service and administration land, a decrease in the identification probability of this land-use type is provoked by a vast distance between origin and destination, which leads, contrarily, to an increase in the probability of identifying residential land.A possible explanation is that people tend to choose nearby public facilities, yet they must return home, no matter how far away they live.Also, the average speed across a destination was inversely proportional to the identification probability of administration land and commercial land, which coincides with the fact that there are speed limits in most administrative and commercial areas.Apart from the spatial features, the temporal features also showed a strong impact on the model output.The average departure time from the origin was conducive to an increase in the identification probability of commercial land, which resulted in a reduction in residential land.A rational explanation is that people leave residential areas early for commuting, and are inclined to visit commercial areas late in the day, after leaving work.In contrast to the features of transport attributes, features of social attributes revealed the social impact on land-use identification.For instance, the educational qualification of people from the origin had a positive effect on the identification probability of public-service and administration land; however, the opposite effect was shown in predicting industrial land.A possible explanation is that most of the people working in the administrative sector are highly educated, while industrial parks have no evident requirement concerning education level.Compared to other features of social attributes, such as age, gender, income, occupation, etc., educational qualification was the only one that appeared in the top 20 relevant features for the identification of urban land use.

Discussion
In this study, we introduced a spatial-hierarchical method in which a series of backbone GCNN models for urban functional-feature identification were operated at different spatial scales.The results substantiate a negative correlation between prediction accuracy and spatial scale.Notably, it is suggested that an optimal size of 250 m should be deemed suitable for urban spatial units in land-use studies, in conjunction with spatial interaction analyses.This observation is underpinned by the notion that large-scale spatial units tend to amalgamate subtle distinctions in land-use characteristics, compounded by the escalating data occasionality associated with a higher volume of spatial interaction flows at larger scales [76,77].In addition, the downsizing of spatial resolution provides an augmented set of training samples for GCNN models, thereby enhancing their capacity to discern intricate relationships between spatial interaction patterns and urban land-use dynamics.Furthermore, we applied heterogeneous GCNN models to explore how spatial interactions affect the identification task of urban functional features.An important finding is that the directionality of spatial interaction makes a greater contribution to large-scale tasks, while the adjacency of spatial interaction has a more prominent impact on small-scale tasks.As the spatial scale increases, the intensity of spatial interaction shows an upward trend, enlarging the edge weights in GCNN models.The edge direction distinguishes the spatial interactions with similar intensities, and enriches the semantic information of the graph structure [78], thereby improving the accuracy of urban functional-feature identification.For the small spatial scale, according to the first law of geography [79], the local attributes of adjacent graph nodes have higher similarity, which enhances the representation of spatial dependency for identifying urban functional features.
In the second place, to probe the model performance in different human-activity scenarios, we extended the graph feature matrix by fusing the mobility data with population data and mobile phone usage data, respectively.The best model performance was derived from using the integration of three kinds of human-activity datasets.Concerning geographic context, human mobility data generally have wider coverage in both space and time, providing useful insights to represent social-economic activities between urban spatial units.The traffic attributes and social attributes of human mobility further characterize the diversification of relationships between spatial units.Additionally, the demographic data containing various population structures depict the long-term routine of human activity [80], revealing their stable usage of urban functions.The spatial pattern of mobile phone usage data may illustrate the short-term routine of human activity, reflecting the urban vitality in spatial units with different functions [81].Therefore, when the spatial interaction traced by human mobility patterns is taken as a vital factor for inferring urban functional features, other types of human activity are also worthy of academic attention.While mobile phone data are gaining popularity in spatial interaction research, their availability varies in different countries and regions.A potential alternative is to integrate human mobility features using other location-based services, such as GPS data from taxis or Uber, public transportation data, social media data, etc.

Conclusions
This study proposes a novel semantic-enhanced GCNN framework for the multi-scale embedding of urban spatial units, and applies it to the identification of urban functional features.It is the first time that spatial interaction features derived from mobile phone data have been taken as the only data source for urban land-use identification.Above all, our work enriches the theory of multi-scale spatial interaction based on human mobility flows, and promotes the technology of feature-driven node classification in GCNNs.
From a theoretical perspective, predicting the unknown attributes of urban spatial units, such as land-use type, from spatial interaction is a challenging issue in urban geography, as the unknown attributes of a geographic unit have previously been inferred from its observed local attributes, rather than from the relationships between geographic units.Instead of using auxiliary information (e.g., POIs, remote sensing images, building footprints) as the observed attributes, as in existing studies, we exploited in depth the latent semantic information of spatial interaction as the geographic context, and aggregated the decentralized attributes to urban spatial units.This work leads to the high accuracy of GCNN models for urban land-use identification compared to state-of-the-art methods, showing the great power of spatial interaction data in precisely predicting the attributes of urban spatial units.Additionally, the scale effect as an important issue in the field of urban geography has rarely been examined in the existing studies.We explored the scale effect of spatial interaction and GCNN-based urban land-use identification in three typical spatial scales for urban studies.Our results prove that the directionality of spatial interaction has considerable influence on large-scale urban functional-feature identification, while the adjacency of spatial interaction exerts a profound effect on the same issue at a small scale.
From a technological perspective, while most related studies have focused on the effective fusing of multi-source data, we propose the combination of a systematic data-alignment method and a feature-generation method for the robust construction of heterogeneous graphs with multi-modal spatiotemporal data, providing an adaptive solution for improving the GCNNs' performance in geographic problems.As far as we can tell, the mobile phone dataset we collected in the current study is the largest and most current dataset used to date in the research of urban land-use identification through convolutional deep learning.In utilizing this large dataset, the generalization ability of our model was improved by augmenting the training samples.We conducted extensive experiments on this dataset with backbone GCNN models at different spatial scales, offering a scientific approach of scale-oriented model prioritization to the relevant research.
In terms of our contributions to the explainability of deep-learning methods, we used quantitative techniques to explain the role of spatial-interaction characteristics in land use identification, strengthening our understanding of the human-land relationship in the urban environment.By identifying the functional features of urban spatial units through AI methods, our study can provide references to the quick examination of urban spatial structure, timely verification of land-use planning, and synthetical optimization of urban functional form.Meanwhile, the study acknowledges specific limitations.The model's performance is contingent upon the quality of the data, wherein the spatialinteraction pattern derived from human mobility data typically adheres to a long-tail distribution.Effectively mitigating the inherent sparsity within human mobility data poses a formidable challenge for GCNNs.Concurrently, achieving a nuanced equilibrium between prediction accuracy and computational efficiency across diverse scales remains a complex endeavor.In future research, our paramount focus will be the development of a self-adaptive GCNN framework, which is designed to exhibit superior performance in identifying urban functional features characterized by sparse data, across a spectrum of spatial scales.

Figure 2 .
Figure 2. Location of the study area.

Figure 3 .
Figure 3. Examples of diverse demographic distribution on the spatial scale of 1000 m. (a-c): Demographic distribution of occupational classifications in Shenzhen; (d-f): Demographic distribution of educational qualifications in Shenzhen; (g-i): Mobile phone usage data of various apps in the spatial unit with different land-use types.

Figure 4 .
Figure 4. (a) F1-score of land-use types predicted by the GCN model in different data-fusion scenarios at the spatial scale of 1000 m.The F1-score distribution of the other two spatial scales is similar to the scale of 1000 m.(b) The proportion of units of each land-use type in the total number of spatial units.

Figure 5 .
Figure 5.The identification of urban land use achieved by the GCN model at the spatial scale of 1000 m.(a) Ground truth of Shenzhen land use; (b) Identification results of Shenzhen land use.

Figure 6 .
Figure 6.Global summary of distributed SHAP values for the top 20 contributed features of urban land-use identification by GCN model.Class 0-5 in the legend corresponds to the six classes of land-use types.Class 0: Transportation land; Class 1: Public service and administration land; Class 2: Commercial land; Class 3: Residential land; Class 4: Industrial land; Class 5: Non-construction land.Names with the prefix "o" and "d" are features extracted from the origin and the destination, respectively.The feature name with the suffix "avg" is the average of that feature value.The feature name with the suffix "std" is the standard deviation of that feature value.The feature names with the abbreviations "fre" and "wt" are, respectively, the ratio and the weight of the attribute mentioned in Section 3.2.1.

Figure 7 .
Figure 7. Local summary of distributed SHAP values related to the top 20 contributed features for identifying six urban land-use classes.The spot on the horizontal axis depicts the feature's Shapely value for that land-use class, indicating how the feature influences the rise and fall of the probability of identifying that land-use class.The blue spot and red spot typify the features that have a low and high value, respectively.

Table 1 .
Travel and social attributes for characterizing human mobility.

Table 2 .
Description of heterogeneous graph structures.

Table 3 .
Model performance in four data-fusion scenarios at different spatial scales.

Table 4 .
Performance of backbone GCNN models and baselines.