Structured Knowledge Base as Prior Knowledge to Improve Urban Data Analysis

: Urban computing at present often relies on a large number of manually extracted features. This may require a considerable amount of feature engineering, and the procedure may miss certain hidden features and relationships among data items. In this paper, we propose a method to use structured prior knowledge in the form of knowledge graphs to improve the precision and interpretability in applications such as optimal store placement and trafﬁc accident inference. Speciﬁcally, we integrate sub-graph feature extraction, sub-knowledge graph gated neural networks, and kernel-based knowledge graph convolutional neural networks as ways of incorporating large urban knowledge graphs into a fully end-to-end learning system. Experiments using data from several large cities showed that our method outperforms the baseline methods.


Introduction
Nowadays, applications of urban computing often rely on manual feature engineering tasks, which may lead to some latent features being overlooked.For example, it is usually necessary to construct and combine some complex features for machine learning tasks in urban computing.However, the complexity of applications and the different modalities of urban data make the feature construction task extremely challenging.Moreover, most learning-based approaches cannot provide explanations of the prediction results.Data generated from sensors and social media in cities contain millions of concepts that are understood by humans.Each region in a large city contains some hidden and inherent knowledge (e.g., demographics, points of interest, and so on).However, when shown only a few items of knowledge pertaining to a region, humans can make a satisfactory assessment of the area.On the contrary, modern learning-based approaches usually require thousands of labeled instances with complicated feature engineering.A combination of prior urban knowledge and available approaches are used for this.
Recently, structured knowledge representation, such as knowledge graphs [1,2], has already an played important role in search engines.Urban knowledge graphs generated from historical experience, geography, and common sense play an unexpected role in practical applications.For instance, Figure 1 illustrates how we might use prior knowledge of a region in the optimal store placement problem.If we want to open a new restaurant, we might know the target region's function type (e.g., business area, range of rent), and that it is near a subway station (transportation), a university, a film center, and In this paper, we exploit structured urban knowledge and reasoning to solve practical problems in urban computing.A number of studies have lately been devoted to building a knowledge graph (KG) that can be useful for machine learning, such as the Never Ending Image Learner [3], the Never Ending Language Learner [4], WordNet [5], Wikidata [6], and ConceptNet5 [7].However, most of these KGs are noisy and contain little relevant urban knowledge.Therefore, we build our urban-specific KG from both raw data and existing knowledge bases.However, effectively extracting features from the graph persists as a problem.
To tackle a practical urban computing problem, we need to retrieve three kinds of knowledge from an urban KG: (1) the definition of the problem (e.g., what is a good location for a store?); (2) the measurement of the problem (e.g., why does this store attract more consumers?);(3) specific knowledge of the location (e.g., what feature does this location have?).A considerable amount of research is available on graph feature extraction.We classify features into three major parts: (1) Global features, which usually contain the global information of a graph.Random work-based approaches, such as the path ranking algorithm [8] and sub-graph feature extraction (SFE) [9] can easily generate features from graphs given two nodes; (2) Propagation features, which usually learn evidence between nodes conditional on the type of relations.For example, the gated graph neural network (GGNN) [10], which takes an arbitrary graph as input and learns how to propagate information given the annotation specific to the task; (3) Local features, which usually represent node-specific features.Graph convolution neural network (GCNN) [11] learn convolutional neural networks for arbitrary graphs based on kernels.
Our work improves on the above models, and adapts an end-to-end graph neural network to the urban computing problem.We propose an urban knowledge graph neural network (UKG-NN) that uses features from raw urban data and graphical features extracted from an urban KG.The SFE first is used to extract global features from the graph to obtain the definition of the problem.Following this, for each region node (node represented as a task-specific region) in an urban KG, we extract sub-graphs and use them as inputs for the GGNN to obtain knowledge of the measurement of the problem.Node-specific features are extracted based on graph kernels that represent the specific knowledge of location.Finally, we feed the three kinds of graph features and features from raw data into a neural network.We show how our UKG-NN model is effective at reasoning about concepts to improve urban computing tasks.Importantly, our model can provide some types of explanations by following the manner of the propagation of information in the graph.
The major contributions of this work are as follows: (1) We propose a method to learn the structured knowledge representation from raw urban data and build an urban knowledge graph containing domain prior knowledge that is helpful for decision-making when there are few instances and may help improve the prediction of other applications by transfer learning.UKG-NN employs convolution-based neural network by considering global, propagation-specific, and locale-specific features automatically generated from an urban knowledge graph as well as manually extracted features from raw urban data.(2) We apply the UKG-NN to optimal store placement and traffic accident inference using a noisy urban knowledge graph and manually extracted features.(3) The UKG-NN has the ability to explain some results of store placement and traffic accident inference as model interpretability is a requirement in many applications in which crucial decisions are made by users relying on a model's outputs.(4) We evaluated our method using real user comments, taxicabs, meteorological data, and human flow data in Shanghai.The results showed that our approach outperforms the baseline methods.

Urban Computing
In recent years, many applications have been proposed for different scenarios of urban data analysis, including transportation, the environment, energy, society, the economy, and public safety and security [12][13][14][15][16][17][18].For example, a number of researchers have studied the store placement problem by focusing on various techniques, such as multiple regression discriminate analysis, spatial interaction models, and so on [19].The spatial interaction model is based on the assumption that the strength of the interaction between positions decreases with their distance and the availability of locations increases with the strength of use and the proximity of complementary, temporally arranged positions [20,21].However, locations cannot only attract residents near it.Multiple regression discriminant analysis [22] is location analysis that has been employed to produce a series of sales forecasts for both new and existing stores.Specifically, Ref. [23] studied the predictive power of various machine learning features on the popularity of retail stores in a city using a dataset collected from the Foursquare mall in New York.Ref. [24] proposed "Antenna Virtual Placement" (AVP), a method to geolocate mobile devices according to their connections to Base Transceiver Station (BTS), which have implications for the design of applications like recommendation of places and routes, retail store placement and so on.Ref. [25] studied an analytical approach to selecting expansion locations for retailers selling add-on products whose demand is derived from the demand of another base product.Ref. [26] employed the "Retail Gravity Model" to cultivate retail crucial variables, for example, supply chain and operation management, etc that can assist the site selection decisions of fashion retailers in Hong Kong.Ref. [27] used regional relevance analysis and human mobility construction to establish the feature values of retail store recommendation.Ref. [28] focused on locating ambulance stations by using real traffic information to minimize the average travel time to service emergency requests.On the other hand, a large number of researchers have proposed a number of methods by analyzing traffic accidents [29,30].These focus on hotspot detection in terms of traffic accidents.Past work on finding optimal location and traffic accident inference did not consider structured urban knowledge, and cannot explain the results of the prediction.

Knowledge Graph
Learning a knowledge graph and using it for machine learning tasks has been of interest to recent researchers [31].For example, Ref. [32] collected a knowledge base and queried it for first-order probabilistic reasoning to predict affordances.However, none of these approaches involves learning in an end-to-end manner, and the propagation model on the graph is mostly handcrafted.Ref. [9] defined a simpler algorithm for generating feature matrices from graphs called the SFE.Ref. [10] studied feature learning techniques for graph-structured inputs and modified them to use gated recurrent units and modern optimization techniques.They then extended this to output sequences called GGNN.Ref. [11] presented a general approach to extract locally connected regions from graphs.Our work improves on this model and uses end-to-end graph neural networks to construct features from an urban KG.

Overview
As Figure 2 shows, we first preprocess raw urban data and construct a structured data representation (urban KG) of urban data using a combination of out existing external KGs (e.g., Wikidata, ConecptNet5, and so on) based on a pre-defined urban KG OWL (Web Ontology Language).Then, global, propagation-specific, and local-specific features are automatically generated from the urban KG and fed into the neural network.We also construct several simple manual features and feed them into the neural network.After several pre-train iterations, dense net, and a confusion layer, we get the final results.The global, propagation-specific features can help explain the prediction results.

Notations and Problem Formulation
Definition 1 (Region).In this study, we partition a city into I regions based on the road network [33].
Using these raw data and external KG, urban KG G = (V, E) is constructed.Manual features F i t , i ∈ I at the tth time interval are extracted for all I regions.We use consumer number y (1) i t in each region as the measurement of optimal store placement.We use traffic accident occurrence y (2) as the measurement of traffic accident inference.

Problem 1. Define f
(1) i t ∈ F i t as store placement-related feature sets; I l and I u are the label region set and the unlabeled region set of optimal store placement, respectively, I = I l ∪ I u , given historical observation { f (1) i t |i = 0, 1, 2, ..., n − 1, i ∈ I l }, and G, we need to predict {y (1) i t ∈ F i t as the traffic accident inference-related feature sets and historical observation { f (2) i t |i = 0, 1, 2, ..., n − 1, i ∈ I}, and G, we need to predict {y (2) i t , i ∈ I, t = n}.
As Figure 3 depicts, the predictions of our approaches including optimal store placement and traffic accident inference can be easily mapped and displayed on the city maps to help the smart cities admin for decision-making.Moreover, some supplementary information (derived from Urban KG) can explain the reasons for the predictions.

Urban Knowledge Graph
We preprocess the raw urban data to obtain raw data for each region.Based on the urban OWL, we discretize and semantically process the data.For example, given the real estate price p of region i ∈ I, we discretize the data into triples {Region i , Has_Real_Estate_Price, RealEstateHigh (if p > 60,000)}.The triples are connected and contribute to the urban KG.We build a preliminary urban KG from those raw urban data, and then collect new nodes in the external KG that are not in our output label by including ones that directly connect to our output labels and, thus, are likely to be relevant; we add these nodes to a combined graph.We then take all edges between these nodes and add them to our combined graph.Moreover, we also consider the difference of shop locations and do spatial analysis carried out in the GIS environment as an element of creating the urban knowledge base in the city.All the locations of shops are added in the KG as attribute nodes.Finally, we construct the urban KG, which is used for auto-graph feature extraction in Section 3.5.

Manual Feature Extraction
With the raw urban data, we also manually construct some simple features.Function of the region.The most prominent Point of Interest (POI) category f f of the region itself and its surroundings is used to denote the function of the region.
Traffic convenience.The number of buses and subway stations f t are used to denote traffic convenience.
Real estate price nearby.This is the average price f e of the nearest five pieces of real estate within 2 km, and is used to estimate price.
Popularity of specific area category.The set of all POIs of category C is defined as P C .Then, the popularity of a specific area category is the average number of consumers for the category C over all POIs in P C , and its region function is f f .Formally, f p = ∑ p∈P Cum(p) where cum(p) is the number of consumers of POI p.
Competition.We define the number of POIs belonging to category C in region i as N c (i), and the total number of POIs in i is N(i).Then, competition is defined as the ratio of N c (i) to N(i), which is f c = N c (i) N(i) .Area popularity.Consumers at places in POI j, j ∈ i are potential consumers for a new store at i.The popularity of an area i is defined as f a = ∑ p∈j cum(p).

Graph Feature Extraction
Sub-Graph Feature Extraction (SFE).Given the urban KG G, we adopt SFE to obtain global knowledge.The node in G is first classified into categories G = {A 0 , A 1 , A 2 , ..., A n } based on problems.For example, two category based the node with attribute 0 (no accident) and 1 (accident) for the task of traffic accident inference.Then, for the node v i with the region type g i , we compute the feature path of all nodes in the same category g i .Specifically, with node v i , where v j and v i belong to the same category, i = j, n is the total number of nodes in the category containing i, where s f e(v i , v j ) computes the feature matrices from the graph through the path from v i to v j .These features represent the meaning of the definition of the problem.Following this, the extracted features H g are fed into a two-layer fully connected neural network: where f (.) is the relu function(e.g., f (.) = max(0, x) ).

Sub-Knowledge Graph Gated Neural Network (s-GGNN).
In contrast to the SFE, we use the sub-graph of G to extract features because the scale of the graph is very large.We generate sub-graphs g i for each region-type node i, where all nodes and edges farther than γ are removed.For each node in graph g i , we have a hidden state representation h t v at every time step t.We start at t = 0 with initial hidden states x v that depends on the problem.For instance, to learn graph reachability, this might be a two-bit vector that indicates whether a node is the source or the destination node.We then use the structure of our graph, encoded in matrix A, to retrieve the hidden states of adjacent nodes based on their edge types.The hidden states are then updated by a gated update module similar to an LSTM.After T time steps of the propagation network, our final hidden states are obtained.The node-level outputs can then simply be computed as , where g is a fully connected network, the output network, and, xv is the original annotation for the node.
where m is the total number of nodes of g i .Following this, the extracted features F G are fed into a two-layer fully connected neural network: H p H (1) p ).

Kernel-Based Knowledge Graph Convolution Neural Network (GCNN).
Figure 4 illustrates the Knowledge Graph Convolution Neural Network architecture.Given sub-graphs g i for each region-type node i, a receptive field needs to be constructed.The nodes of the neighborhood are candidates for the receptive field.Given as inputs a node v and the size of the receptive field k, the system performs a breadth-first search, exploring vertices with increasing distance from v, and adds these vertices to set N. If the number of collected nodes is smaller than k, the one-neighborhood of the vertices most recently added to N is collected, and so on, until at least k vertices are in N or there are no more neighbors to add.Note that, at this time, the size of N is possibly different from that of k.The receptive field for a node is constructed by normalizing the neighborhood assembled in the previous step.The normalization imposes an order on the nodes of the neighborhood graph to map from the unordered graph space to a vector space with a linear order.The basic idea is to leverage graph labeling procedures that assign nodes of two graphs to a similar relative position in the respective adjacency matrices if and only if their structural roles within the graphs are similar.Finally, we obtain the local feature of the H l node i and feed it into a two layer convolution network: where * denotes the convolution, g(.) := max(0, z) [34].

Model
To incorporate the features of the graph into a regression task pipeline, we simply concatenate these three kinds of features with an early fusion layer to capture the global, propagation, and local patterns together.The early fusion layer can be written as Afterwards, we stack a fully connected layer.For the manual extraction of features, we first pretrain them on a denoise autoencoder [35].The hidden representations are then connected with two fully connected layers with relu as the activation function.Finally, all features are fed into a late fusion layer that is more adept at fusing data from different domains.In our implementation, we used external factors (i.e., function of the region ) as global features.Late fusion can be written as Ŷr = f (W (5)  * H (4) + W (5) where G r is the global feature vector of region r, Z is output of three fully connected layers from the manually extracted feature vector, and Ŷr is the predicted tensor.We use the mean squared error for the regression task and cross-entry error for the classification task as the loss function .To improve robustness and get better results, we adopt boosting methods using Xgboost [36].

Datasets
We used real urban data from Shanghai, including traffic and weather, from 1 March 2015 to 1 March 2016 [37] and using the Baidu API.We partitioned the Shanghai city into 2135 regions.For the optimal store placement task, we used desensitized user location data from China Telecom near a POI as consumer data.We treated the records that were stable in the POI for more than half an hour as consumers.We used the total number of consumers in the previous two months as to measure whether a given store was considered attractive.We used two categories of stores, express inns and coffee shops, to evaluate the performance of our framework.For each category, the test data were two brands of stores to eliminate the bias of ranking and avoid over-fitting.Specifically, the training datasets were all POIs belonging to the categories "coffee shop" and "express inn" in Shanghai, except the POIs in the test data.Express inns excluded "Home-Inn" in the training set.For the traffic accident inference task, we used data from 1 March 2015 to 28 February 2016 the training set, those from 29 February 2016 as the validation set, and data from 1 March 2016 as the test set.We used the external KG data from Wikidata [38] and ConecptNet5 [39].

Baselines
We compared our proposed method with these eight baselines:

•
Linear.We used the linear regression algorithm Lasso, where the regularization parameter were 10 −2 .• RF.RF stands for random forests.The number of trees was 10, and the minimum number of samples required to split an internal node was set to 2. The function to measure the quality of a split was the mean squared error.• GBR.GBR is short for gradient boosting for regression (GBR), where the loss function to be optimized was least-squares regression, and the number of boosting stages was 100.

• SVR (Support Vector Regression).
The kernel function we used was also an RBF.The penalty parameter of the error term was set to 1.0, and the kernel coefficient was set to 0.1.

• SVC (Support Vector Classification).
The kernel function we used was also an RBF.The penalty parameter of the error term was set to 1.5 and the kernel coefficient to 0.2.

•
LambdaMART.The number of boosting stages was 100, and the learning rate was 0.1; the minimum loss required to make a further partition was 1.0.• Huff Gravity Model.We used both manual features graph features in the Huff Gravity Model.• Geo-spotting.We used both manual features graph features in the Geo-spotting model.
Table 1 shows the features of each baseline model.

Effictiveness of Urban KG:
To determine the whether the UKG-NN can learn a better model based on less data, we gradually added training data.As Figure 5 shows, with increase in the number of training samples, the error dropped rapidly compared with the baselines for both the optimal store placement and traffic accident inference tasks.The UKG-NN could even obtain better results with little training data.

Explanations of Results:
For region i, we obtain prediction y (1) i = 1 at 9:00 a.m., which meant that this region might have had a traffic accident.As Table 4 shows, the path for the relation between the prediction region node and the label node with had the largest weights among relations such as "hasWeatherType", "hasSubwayInFlow" and "hasRegionType", which indicated the possible causes of these results.Figure 6 shows part of the nodes(entities) and paths(relations) of the graph.The mean weights of each path is the average of the weights of paths (e.g., MeanWeight = 10 ∑ i=1 w i ).As Table 5 shows, "WeatherType" and "SubwayInFlow" were the largest weight nodes, which indicated that these data had a relatively strong effect on the final prediction results.The explanation of the results for the optimal store placement task is similar to the above.
For region i, we obtain prediction 400 < y (1) i < 500 at 8:00 p.m., which meant that this region is popular.As Table 6 shows, the path for the relation between the prediction region node and the label node with had the largest weights among relations such as "hasNeighbourRegion", "hasSubwayInFlow" and "hasRegionType", which indicated the possible causes of these results.In fact, the subway may bring a lot of passengers, which tend to be potential customers.As Table 7 shows, "RegionType", "HumanFlow" and "SubwayInFlow" were the largest weight nodes, which indicated that these data had a relatively strong effect on the final prediction results.

Ablation Study
To analyze the effect of varying different components of our framework, we also report the ablation test of the UKG-NN in terms of using different setups of the network.The experimental results are summarized in Table 8 for the optimal store placement and in Table 9 for the traffic accident inference.Generally, all three proposed strategies including Boost, GCNN, SFE, s-GGNN, and FC (Fully Connected Layer) contribute to the effectiveness of UKG-NN (integrates all methods aforementioned).

Conclusions
In this study, we built an urban knowledge graph containing domain prior knowledge that is helpful for decision-making such as human flow or traffic monitoring and guidance when there are few instances and may help improve the prediction of other applications by transfer learning.We proposed UKG-NN that employs convolution-based neural networks by considering global, propagation-specific, and locale-specific features automatically generated from an urban knowledge graph and manually extracted features from raw urban data.We discussed two case studies based on the implementation of our framework and obtained interesting results.In addition, we tested our proposed approach, and the results showed that it is practical, and can provide explanations for predictions as model interpretability is a requirement in many applications in which crucial decisions are made by users relying on a model's outputs.
There are some limitations to this study, which should be addressed in future work.One major one lies in the partially missing data when constructing the urban KG.For example, some data cannot be represented in the urban KG due to a lack of the representation of OWL and a procedure for discretization.We would like to mine the data for regions more deeply and try a spatial interpolation approach in the future to reduce data loss.The adaptability of this approach to real-world circumstances will also be considered in future work.Some visual analytics functions will be added to our ongoing demonstration system.By presenting similar historical circumstances or forecasting results according to different features, the system will be able to provide more information for flexible decision-making.We are also investigating a new model that utilizes data from similar historical circumstances to understand the underlying semantics.We plan to apply our approach to additional applications such as urban smog prediction and urban road planning.Moreover, we aim to study the distribution of our framework to enable it to process very large amounts of data.

Figure 1 .
Figure 1.Example of how semantic knowledge of region aids store placement.

Figure 3 .
Figure 3. Visualization of predictions for smart cities.

Figure 4 .
Figure 4. Knowledge Graph Convolution Neural Network.A node sequence is selected from a graph via a graph labeling procedure.For some nodes in the sequence, a local neighborhood graph is assembled and normalized.The normalized neighborhoods are used as receptive fields and combined with existing Convolution Neural Network (CNN) components.

Figure 5 .
Figure 5. Effectiveness of Urban KG:error vs. number of train data.

Figure 6 .
Figure 6.Part of the urban knowledge graph for traffic accident inference tasks.

Table 2 .
Results for Starbucks and Home-Inn.

Table 3 .
Results of traffic accident inference.

Table 4 .
Explanations of the resultsbased on global graph features for traffic accident inference.

Table 5 .
Explanations of results based on propagation graph features for traffic accident inference.

Table 6 .
Explanations of the results based on global graph features for optimal store placement.

Table 7 .
Explanations of results based on propagation graph features for optimal store placement.

Table 8 .
Ablation study for optimal store placement.

Table 9 .
Ablation Study for traffic accident inference.