Urban Multi-Source Spatio-Temporal Data Analysis Aware Knowledge Graph Embedding

: Multi-source spatio-temporal data analysis is an important task in the development of smart cities. However, traditional data analysis methods cannot adapt to the growth rate of massive multi-source spatio-temporal data and explain the practical signiﬁcance of results. To explore the network structure and semantic relationships, we propose a general framework for multi-source spatio-temporal data analysis via knowledge graph embedding. The framework extracts low-dimensional feature representation from multi-source spatio-temporal data in a high-dimensional space, and recognizes the network structure and semantic relationships about multi-source spatio-temporal data. Experiment results show that the framework can not only effectively utilize multi-source spatio-temporal data, but also explore the network structure and semantic relationship. Taking real Shanghai datasets as an example, we conﬁrm the validity of the multi-source spatio-temporal data analytical framework based on knowledge graph embedding.


Introduction
Many data are collected from peoples' daily life, including daily travel, weather, and industries, which contain lots of information [1][2][3]. Multi-source spatio-temporal information are the basic data sources for predicting urban population activity flow and urban transportation planning. It is an important task to understand the potential laws behind multi-source spatio-temporal data. The target of data analysis is to examine potential laws behind active data and the many external-influence data of city residents, including predicting the possibility for future development [4] and the state of aggregation of the region [5], explaining their practical significance [6] and abnormal road surface recognition [7]. Urban multi-source spatio-temporal data analysis can not only understand the practical significance of data existence from the perspective of the human-land relationship, but also provide a connective point for the construction of new smart cities and the integration of big data development strategies. The development of a smart city is inseparable from the support of resident activity data. As a mega city, Shanghai's development into a smart city is based on the actual activities of residents. An analysis of the residents' activities can guide and recommend residents' travel, which has laid the foundation for the development of smart cities [8].
One main issue for data analysis is to understand the structure and practical significance for multi-source spatio-temporal data. However, this is also a difficult task in smart cities. First, in different scenarios, it is difficult to comprehensively consider data from different types, sources, and meanings [2]. Second, traditional data analysis methods cannot adapt to high-dimensional data from daily life, and results of multi-source spatio-temporal data analysis cannot be interpretable [6]. Many stumbling blocks hinder the development of data analysis in smart cities.
There are many existing data analysis methods. Due to the high dimensional characteristics of data, it is difficult to recognize multi-source spatio-temporal data. Many scholars treat data as a network graph. From the perspective of node type, networks are mainly divided into homogeneous and heterogeneous. Data analysis models based on homogeneous networks mainly include Word2Vec [9], word embedding and spatio-temporal embedding [10], LINE [11], node2vec [12], and SDNE [13]. Those methods consider a single datum as a network but cannot represent multi-source data. To make full use of multi-source spatio-temporal data, some scholars use the meta-path approach to represent multi-source data as a heterogeneous network [14,15]. However, a heterogeneous network can only represent a specific network and requires an accurate meta-path between nodes. It is not universal and cannot be applied to multi-source spatio-temporal data analysis tasks.
To solve the problem of multi-source spatio-temporal data analysis in heterogeneous networks, in this paper, we use a general framework via knowledge graph embedding for multi-source spatio-temporal data analysis tasks. The main contributions in this paper are: • We propose a general framework for multi-source spatio-temporal data analysis aware knowledge graph embedding. Knowledge graph embedding models are used to capture heterogeneous network structure features and semantic features in a low dimensional space. We then use link prediction and cluster analysis tasks to mine the network structure and semantic knowledge.

•
We recognize the importance of knowledge from practical perspective. Different knowledge has different impacts on travel activities.

•
We evaluate the framework using travel data and external knowledge data of research areas in Shanghai. Then we analyze the potential network structure and semantic of multi-source spatio-temporal data from the evaluation results, and understand the practical significance of multi-source spatio-temporal data from the perspective of visualization.
The rest of the paper is organized as follows. Section 2 reviews related research on multi-source-data analysis. Section 3 introduces the details of our framework. Section 4, we evaluate the performance of the framework with actual data, including model parameter design, analysis results, and disturbance analysis. We summarize and discuss the results in Section 5.

Related Works
Multi-source spatio-temporal data analysis tasks are an important cornerstone for the development of smart cities. During the development of smart cities, understanding the potential development and aggregation states of internal urban structures are important. Existing urban fragmentation data analysis methods are divided into application-driven and model-driven. First, application-driven approaches require strong reliance on raw data and analytics platforms. Representative methods include spatial auto-correlation analysis [16], nuclear density estimation (KDE) [17,18], cluster analysis [19], and social network analysis (SNA) [20]. In practice, there are many kinds of urban fragmentation data, and it is difficult to analyze the network structure and practical significance from multiple angles. These application-driven analysis methods are limited to a certain type of application, and, it is impossible to deeply capture the internal correlation of urban multi-source data by utilizing econometric analysis and spatial organization.
The model-driven approach is mainly used to analyze urban fragmentation data through probability topics or deep learning models. From them, the probability topic model can be regarded as extracting features to find the optimal feature subspace, including LDA and LSA [21][22][23]. However, the probability model has flaws that ignore the time factor of residents' travel. In addition, deep learning models can automatically learn the feature representation of the original data, and extract the potential semantics of human travel modes, e.g., Deepwalk [24], node2vec [12] and LINE [11], as well as several others [25][26][27]. Deep learning models embed high-dimensional data into a low-dimensional space on the basis of retaining the spatial structure and semantic connection of data. However, many deep learning models can only be applied to a single type of node network. With the support of big data, traditional urban fragmentation data analysis methods can not comprehensively describe the structure of data, so the heterogeneity of multi-source data has caused flaws in the traditional depth model.
Multi-source spatio-temporal data analysis tasks are an indispensable task for building smart cities. Many models for the analysis of urban multi-source spatio-temporal data have emerged. To overcome the limitations of homogeneous nodes, scholars have proposed the concept of heterogeneous networks and related research [14,15]. A heterogeneous network can represent information about different types of nodes, as well as relationships between nodes. The PTE model achieves network heterogeneity by classifying text or tags and representing the relationship [28]; the HINES model constructs a heterogeneous network through implementing a representation of paths between nodes according to metainformation [29]; on the basis of edge features and the superboundary concept, the authors in [30,31] proposed the HEBE embedded framework to model events with strong correlation as a whole and realize a heterogeneous event network. However, a big drawback of heterogeneous networks is to build accurate metapaths when representing relationships between nodes, while specific metapaths constrain heterogeneous networks within the framework of a particular network. In recent years, the knowledge graph has been widely used by many scholars due to its richness and relevance.
Using the knowledge graph to analyze and retrieve residents' activity data has also been applied to all aspects of life [32]. The emergence of knowledge graph to represent heterogeneous networks provides a broader perspective for the above problems [33][34][35].
Therefore, in order to solve those problems of fragmented multi-source spatio-temporal data analysis in heterogeneous networks, we propose an analytical framework based on knowledge graph embedding to recognize multi-source spatio-temporal data. The framework can exploit the potential law and aggregation state of multi-source spatio-temporal data from network structure and semantic knowledge.

Definition
The main objective of urban multi-source spatio-temporal data analysis is to analyze the network structure and practical significance. In this paper, the multi-source spatio-temporal data analytical framework is a general analytical framework containing knowledge of multi-source data, knowledge graph embedding, and multiperspective analysis. Without loss of generality, in the experiment session, we used multi-source spatio-temporal data on the basis of Mobike's behavior in Shanghai as an example of urban data analysis. Definition 1. Travel Network G. In this paper, we used the triple G = (H, R, T) to describe the travel network, and we treated the original grid as head entity H, the tail entity was destination T, and the relation of head and tail entity R were described the relationship information in the travel network. In addition, (h g , r g , t g ) was a subset of the triples. Definition 2. Knowledge Network K. In this paper, we used K = {K 1 , K 2 , ..., K n } to describe the knowledge network. K k = (H k , R k , T k ) was the i-th assist knowledge. In this paper, we set the value of k to 6. Definition 3. City Knowledge Graph (CKG). We used directed network CKG = (K, G) to describe the CKG. G was the travel network. K was the knowledge network that describes the collection of auxiliary information.

Framework Overview
In this section, we described an analytical framework based on knowledge graph embedding for multi-source spatio-temporal data. Specifically, the analytical framework consists of three parts: knowledge of multi-source spatio-temporal data, knowledge graph embedding, and analysis of multi-source spatio-temporal data. As shown in Figure 1, the lower dotted line frame is the knowledge of multi-source data, mainly including travel network, knowledge network, and city knowledge graph. First, we processed the original network into triples. Second, we combined the semantics of the triples with the structural information of the network and represent vectorized entities and relationships. Finally, we analyzed the multi-source spatio-temporal data from network structure and semantic knowledge perspectives. The detail of analysis content are described in Section 4.4. Overview. We took network triples as input and obtained low-dimensional feature vectors by characterizing the model. Then, we analyzed multi-source spatio-temporal data from network structure and semantic knowledge perspectives.

Knowledgeable Multi-Source Spatio-Temporal Data
To effectively explore urban multi-source spatio-temporal data, in this paper we represented multi-source spatio-temporal data with a heterogeneous network-knowledge graph. The knowledge graph is essentially a semantic network that can represent heterogeneous nodes and multi-relationship information. We used the knowledge graph to achieve the fusion of multi-source data on the basis of retaining the original information. In this paper, we processed urban multi-source data into the triples of the city knowledge graph that contained three basic networks, the travel network (G), knowledge network (K), and the city knowledge graph (CKG). As shown in Figure 2, the visualization results of the knowledge graph of Shanghai formed a hierarchical structure with Shanghai, the administrative division, grid, and POI. For example, (Hongkou, belongs to, Shanghai) is a triple of the city knowledge graph.

Knowledge Graph Embedding
A knowledge graph can solve the fusion problem of multi-source spatio-temporal data well, which is one of the key problem in analyzing multi-source spatio-temporal data. At present, traditional knowledge graph analysis methods were based on database operation. On the basis of graph theory and probability, the graph model can efficiently analyze the association between entities. However, the database limited the intrinsic and potential analysis of knowledge graph. Knowledge graph embedding model can achieve low-dimensional vectorized representations while preserving the structural and relational features of high-dimensional networks. Low-dimensional vectors can be used to perform a variety of potential and intrinsic structural or relational analyses. Therefore, we selected knowledge graph embedding methods to obtain the structure and semantic characteristics of network. Figure 3 is a schematic diagram of knowledge graph embedding (KGE) model of the TransX series [36][37][38][39]. The input of the KGE model is knowledge graph, and output are the entity and relationship embedding vectors. For example, (h 1 , r 1 , t 1 ) and (h 2 , r 2 , t 2 ) are two triples in knowledge graph. The left dashed box is the original space, and the right dashed box is the mapping space. M r is the transfer matrix learned from the original space to the mapping space learned by KGE model. By transfer matrix M r , we can project head and tail entities from the original space into the mapping space. Therefore, the projection vectors of head and tail entities can be expressed as: Constantly adjusting triples by mapping entities in each triple to the mapping space aims to satisfy equation h + r ≈ t. The loss function of TransX is: where h and t are head and tail entity vectors in the city knowledge graph, and r is the relation of head and tail entities. M 1 r and M 2 r are the transform matrices of head and tail entities, respectively. In addition, some embedded models characterized relationships between entities in knowledge graph through matrix decomposition or (non)linear operations. For example, the ComplEx model [40] overcame real number vectors on the basis of the product of complex numbers that are not commutative. The dot product operation has exchangeability, which leads to the problem of only dealing with symmetric relations. The representation vectors of each entity and relationship are represented by complex numbers. ComplEx [40] can capture the symmetry between entities, and the representation of asymmetric relationships is also significantly better than that of other models, which can verify the importance of complex representation.

Multi-Source Spatio-Temporal Data Analysis
A knowledge graph embedding model can obtain entity and relationship feature vectors reflecting the network structure and semantics. In this paper, we analyzed multi-source spatio-temporal data from link prediction and cluster tasks.
To analyze multi-source spatio-temporal data, we explored the network structure and practical significance of semantic information. Figure 4 shows the basic framework of multi-source spatio-temporal data analysis tasks. The first task is link prediction, which can clearly understand the structure of network by mining potential relationships between entities and semantic by different knowledge. For example, entity 1 has a relationship with entity 2, Entity 2 has a relationship with entity 3, and it is possible to predict whether there is a relationship between entity 1 and entity 3. The second task is cluster analysis that can more accurately understand the structure of network by discover the similarity structure of the network and semantics by visualization from different knowledge. We adopted the K-means clustering method to understand the structural characteristics of the network from the perspective of intraclass aggregation degree and interclass separation degree.

Data Description
In this section, we evaluated the performance of the framework based on Mobike which is a bike-share that is suitable for residents traveling short distances, weather data, administrative division, POI, station, and grid information in Shanghai. MobikeStation, MobikeGrid, MobikeWeather, MobikeAD, and MobikePOI constitute the city knowledge graph formed by the subway station, grid geographic relationship, weather, administrative division, and POI information. MobikeKG is the city knowledge graph in Shanghai obtained by integrating various types of multi-source data. It mainly uses the inner ring of Shanghai as the research area, accounting for 25.39% of the total area of Shanghai in 2016. The research area is divided into many 500 × 500 grids, and the number of effective grids is 5859, including the areas of Huangpu, Xuhui, Changning, Jing'an, Putuo, Zhabei, Hongkou, Yangpu, Minhang, Baoshan, Jiading, Pudong New, Songjiang, and Qingpu, a total of 14 administrative areas. The specific data distribution is shown in Figure 5:

Evaluation Metrics
We used link prediction to evaluate the possibility of potential associations and semantic between network entities in the city knowledge graph. Evaluation indicators mainly include the average ranking of the entities (MeanRank) and the proportion of top 10 correct entities (Hit@10). Then, we used the cluster task to understand the network clustering effect and practical significance. Evaluation indicators mainly include silhouette coefficient (SC) and Calinski-Harabaz index (CHI) from the intraand inter-class perspectives.

Average ranking of entities (MeanRank):
where n is the number of triples, f r (h, t) i is the result of the i-th triple; f r (h, t) i is better when it is smaller. MeanRank means the average of all entity rankings. The smaller the MeanRank is, the better the prediction effect is. 2. Proportion of top 10 correct entities (Hit@10): where #T is the number of correct entities in top 10, and Hit@10 means the proportion of correct entities in top 10. The larger that Hit@10 is, the better the prediction effect is. 3. Silhouette coefficient (SC): where a is the average distance from other samples in the same category, b is the average distance from samples in different categories, n is the total number of samples, and the range of contour coefficients is [−1, 1]. The closer that sample distance of the same category is, the farther the distance of different categories and the higher the score are.

Calinski-Harabaz index(CHI):
where B k and W k are the covariance matrix between different classes and same classes, respectively; tr is the trace of matrix; m is the number of samples in training sets; and k is the number of categories. The larger the covariance between different categories is, the smaller the covariance between the same categories, the larger the value of CHI, and the better the representative effect are.

Model Parameter Design
In this section, we introduced the relevant parameters of the KGE models.

Hyperparameters.
The hyperparameters of the framework mainly include learning rate λ, embedding dimension k, train epoch, batch size B, margin gamma, the number of iterations and clusters. In the experiment, we manually adjusted and set learning rate to 0.001, batch size to 100, embedding dimension to 100, training epoch to 500, number of iterations to 1000, and number of clusters to 5.
To select the appropriate embedding dimension, we experimented with different embedding dimensions to compare their impact on link prediction accuracy. In the experiment, we utilized the TransE embedding model to select dimensions (50, 80, 100, 150, and 200). As shown in Figure 6, it shows the results of MeanRank and Hit@10 in different embedding dimensions on the MobikeKG dataset. The horizontal axis represents the size of embedding dimension, and the vertical axis represents the change of the different evaluation indices. In addition, the filter indicates that the network is evaluated after removing the negative samples on the basis of the original data. Performance is best when the embedding dimension is 150. We then selected the stabler embedding dimension of 150 for the next experiment.

Training.
A detailed description is shown in Table 1 of triples, and the experiment training, test, and validation datasets (80%, 10%, and 10% of the total, respectively) for the seven types of travel data in the city knowledge graph:

Experiment Results
In order to realistically analyze multi-source spatio-temporal data from the network structure and semantic, we divided the experiment into two parts. The first part (see Section 4.4.1) mined the potential relationship between entities from the network structure and semantic from different knowledge through link prediction. The second part (see Section 4.4.2) use cluster task to understand the entity association in the network from entity similarity and visualization from geometric and geography.

Analysis from Network Structure Perspective
In order to verify the validity of KGE methods, we compared them with traditional homogeneous node embedding methods Deepwalk and node2vec, as shown in Table 2.  Table 2 shows the comparison of different embedding methods in the same datasets. We can know that results of KGE methods are much better than those of embedding models of homogeneous nodes. The addition of 'knowledge' can increase the semantic connection of the network, so KGE methods are superior to embedding methods of homogeneous nodes.

A. Link prediction
To explore the potential relationship between entities in the network structure, we used different KGE methods to analyze network characteristics, as shown in Figure 7. Different KGE methods can capture different aspects of the network structure. The experiment utilized four KGE methods based on the MobikeKG datasets to understand structural characteristics of knowledge graph network from multiple angles.  Figure 7 shows the comparison results obtained via the link prediction task. MeanRank and Hit@10 can measure the global and local characteristics of network structure. The STransE model performs the worst under MeanRank but better under Hit@10, indicating that the STransE model is lacking in capturing the global feature of network, but pays more attention to local characteristics. In addition, the complex model performs the smallest under MeanRank but largest under Hit@10, indicating that there are many asymmetric triples in the KG. The relationship between entities is more affected by the local network structure.

B. Cluster
To understand the entity association in the network from entity similarity, we utilized the cluster task to study the similarity and mine the aggregation state of entities in multi-source spatio-temporal data, as shown in Figures 8 and 9. We first understood the similarity between entities from different dimensionality reduction methods. Second, we explored the influence of different KGE methods on similarity clustering between entities, as shown in Figure 10.

Different clustering dimensionality reduction methods
Different dimensionality reduction methods change the clustering effect between entities at different angles. In order to understand the aggregation state of entities in the network from different dimensionality reduction methods, the dimensional representation of entity expression vectors are reduced by TSNE, ICA, ISOMAP, LLE, and PCA. SC and CHI are used to evaluate results. Experiment results are shown in Figure 8. The radar diagram in Figure 8 shows the results of clustering dimensionality reduction evaluation methods based on the TransR model. The outer ring represents six dimensionality reduction methods, and colors represent different datasets. Figure 8 shows that the LLE effect is most prominent in the six traditional methods. It may be that LLE is more advantageous in capturing local features and entity similarity. So, similar entities are generally distributed around the entity.
In addition, to clearly show the clustering effect of the multi-source spatio-temporal data analysis model, we used geometrical visualization to understand the network structure based on MobikeKG datasets. Experiment results are shown in Figure 9. From the visualization results of Figure 9, we can know that ICA and PCA linear dimensionality reduction models can separate entities of different structural types well. ISOMAP and TSNE nonlinear dimensionality reduction models are more concerned with overall data characteristics. LLE local linear model preserves the popular structure between data and uses local linearity to reflect global nonlinearity, which can better distinguish different categories.

Different KGE methods
Different KGE methods can change the clustering effect between entities. To understand the aggregation state of entities in the network from different KGE methods, TransE, TransH, TransR, STransE, and ComplEx KGE methods are used to map high-dimensional data to a low-dimensional space. On the basis of the LLE dimension reduction method, results are evaluated by clustering coefficients SC and CHI. Experiment results are shown in Figure 10. The heat map of Figure 10 shows the characterization vector clustering dimension reduction evaluation results of different KGE methods in LLE. The horizontal axis and vertical axis describe different datasets and different KGE methods, respectively. The darker the color is, the better the clustering effect is. From Figure 10a, it can be seen that the TransE model has the greatest influence on clustering classes among the various KGE methods. As can be seen from Figure 10b, several types of embedding models have similar effects on entity similarity. In general, KGE models have less impact on the aggregation state of entities within the class than between classes.

Analysis from Knowledge Semantic Perspective
Different types of 'knowledge' have different effects on the network. To more accurately understand the reality of the city knowledge graph, we used link prediction and cluster task to study the network semantic characteristics from different knowledge, as shown in Figures 11 and 12.

A. Link prediction
To explore the impact of various auxiliary 'knowledge' types in the fragmented data of the network, we used four KGE methods to analyze datasets of seven types of external knowledge. The semantic relationship between potential entities is based on the MobikeOD travel network to explore the embedded performance of the auxiliary knowledge network. Link prediction results are shown in Figure 11:  Figure 11 is the comparison of link prediction results with various types of 'knowledge'. The horizontal axis is the datasets and different colors represent different KGE methods. Figure 11b,d shows that the station and KG can enhance the results of link prediction, and other auxiliary types of knowledge reduce the accuracy of link prediction to a certain extent or left it unchanged. Not all types of auxiliary knowledge can enhance knowledge representation. It may be because knowledge that plays a role in KG is more important than inhibition knowledge. In summary, 'knowledge' has a positive and negative effect, and the addition of auxiliary knowledge can enhance the association between entities in the travel network structure of residents.

B. Cluster
Different types of 'knowledge' have different degrees of impact on the aggregation state of entities in the network. We used the STransE method and LLE to explore the impact of different 'knowledge' types on the similarity of entities in the network. We analyzed 'knowledge' as a variable, as shown in Figures 12 and 13.  Figure 12 shows the evaluation results of the clustering dimensionality reduction evaluation of KGE vectors. The horizontal axis is seven datasets with different 'knowledge' types. The overall trend of the two evaluation indicators are consistent, and the addition of different knowledge types assist with the realization of the clustering results to different extents. The weather factor had the least effect, and the POI had the greatest impact. Except for the grid, SC values in the other types of 'knowledge' are greater than or equal to CHI, indicating that the similarity of entities within a class is more important than the similarity between classes.
In addition, to understand the reality of the network, we considered changes of spatial active areas on the basis of different auxiliary knowledge cities from the perspective of human-land relations in urban geography. The urban spatial active domain represents the spatial mapping of human activities over time. The results of geoscience visualization are shown in Figure 13: From the results of geographic visualization in Figure 13, we can see: • The performance of residents in MobikeOD is more dispersed, and the active urban space is not clear. MobikeAD show that the Pudong New Area can be divided well, and the concentration of the urban space is relatively high. From MobikeGrid, we can know that first-order and second-order association can ensure regional clustering. The urban spatial active domain is not only limited to administrative divisions but also related to grids.

•
MobikePOI shows that the distribution of urban POIs in combination with residents mainly presented a ring-enclosed structure, and the discovery of resident activities from urban spatial active areas is based on the importance of the POI in the area. MobikeStation is similar to MobikeWeather distribution, which can reflect the same degree of influence on clustering results, consistent with the previous conclusions.
In general, the role of 'knowledge' can be understood from different perspectives by classifying different auxiliary knowledge types, and the interpretability of 'knowledge' can assist some research. In addition, various types of knowledge affect the degree of urban spatial agglomeration and spatial active domain to varying degrees in terms of aggregation degree or distribution position and shape. From the perspective of big data, knowledge can not only assist in network analysis and research, but also make results interpretable.

Perturbation Analysis and Robustness
In actual city perception data, data information is rich but there is some noise. In order to understand the effect of noise on the model, we added noise to the city knowledge graph to analyze its effect.
In noise addition experiment, we added Gaussian and Poisson noise to the city knowledge graph. First, we kept the number of entities constant and randomly deleted the relationship; the ratio is [0, 0.1, 0.3]; then, we randomly deleted the entities, and the ratio is [0, 0.2, 0.3]. Taking MobikeOD data as an example, the obtained results by four different evaluation matrices are shown in Figure 14.

Conclusions
In this paper, we have proposed an analytical framework for multi-source spatio-temporal data analysis tasks aware KGE methods. We have modeled urban fragmentation multi-source spatio-temporal data as three types of networks: travel network, knowledge network, and city knowledge graph. From a network structure perspective, we could know that there are many asymmetric triples in CKG, and entities show local similarity from network structure. From a knowledge semantic perspective, we found that knowledge is positive and negative, and it can enhance the semantic relationship in the network, which can explain the spatial distribution characteristics of cities.
Our future focus will be on multi-source data of different disciplines, and we will more extensively study the important role of knowledge assistance and its true meaning. We hope that this work will enrich the future understanding and promote the continuous advancement of data analysis.