Multi-Modal Graph Interaction for Multi-Graph Convolution Network in Urban Spatiotemporal Forecasting

Graph convolution network based approaches have been recently used to model region-wise relationships in region-level prediction problems in urban computing. Each relationship represents a kind of spatial dependency, like region-wise distance or functional similarity. To incorporate multiple relationships into spatial feature extraction, we define the problem as a multi-modal machine learning problem on multi-graph convolution networks. Leveraging the advantage of multi-modal machine learning, we propose to develop modality interaction mechanisms for this problem, in order to reduce generalization error by reinforcing the learning of multimodal coordinated representations. In this work, we propose two interaction techniques for handling features in lower layers and higher layers respectively. In lower layers, we propose grouped GCN to combine the graph connectivity from different modalities for more complete spatial feature extraction. In higher layers, we adapt multi-linear relationship networks to GCN by exploring the dimension transformation and freezing part of the covariance structure. The adapted approach, called multi-linear relationship GCN, learns more generalized features to overcome the train-test divergence induced by time shifting. We evaluated our model on ridehailing demand forecasting problem using two real-world datasets. The proposed technique outperforms state-of-the art baselines in terms of prediction accuracy, training efficiency, interpretability and model robustness.


INTRODUCTION
The deployment of urban sensor networks is one of the most important progresses in urban digitization process.Recent advances in sensor technology enables the collection of a large variety of datasets.Multi-modality is one of the most significant features in knowledge discovery process in urban computing.Data from different sources are often correlated with each other.For region-level prediction problems, like crowd flow prediction [29,30] or taxi demand prediction [6,11,21], it has become a common practice to incorporate a large variety of auxiliary datasets, like weather, POI, road network and events.In this paper, we define each auxiliary dataset as a modality and study multi-modal learning on multi graph convolution networks (MGCN) for spatiotemporal prediction problems in urban computing.This task is challenging due to complex spatial dependencies and temporal shifting generalization gap.
Designing spatial feature extraction method is challenging due to complex region-wise spatial dependencies.GCN-based models [15,23] are first used for traffic prediction on road networks.Geng et al. [6] proposed Multi-GCN (MGCN) for generic spatiotemporal prediction tasks by stacking three GCNs.Each GCN encodes a unique modality (relationship) of auxiliary data (geo-distance, POI similarity and road network) as graph and extract spatial dependencies from such relationship.The spatial feature extraction by MGCN architecture is incomplete, due to the lack of cross-graph connectivities.Figure 1 shows an example for MGCN.Consider the vertex (region) pair A and D. According to graph topology, A and D are disconnected in all three graphs.MGCN is incapable of extracting features from D for A, or vice versa.However, we argue that the A−D relationship is important.The region pair A 3 −B 3 and B 2 − D 2 are closely related on road connectivity and POI similarity.shows graph connectivity for MGCN [6] in each graph.X i represents vertex (region) X on the i-th graph.
Weighted edges between vertices denote region-wise relationship.There is no interaction among graphs.(b) shows compound graph connectivity by adding graph-wise interaction to MGCN.Vertices are connected as long as there exists an edge in any graph.
A and D are related region pairs for spatial feature extraction.To complete the physical meaning for spatial feature extraction by MGCN, the ideal graph connectivity is shown in 1(b).It is produced by merging all edges from separate graphs, so that any random walk path is a compound of any kind of relationships.
Improving model generality to overcome the temporal shifting generalization gap is another challenging task.Temporal pattern for time series data varies along with time.Formally, P(X t |X t −1 , X t −2 , ...) P(X t ′ |X t ′ −1 , X t ′ −2 , ...), t t ′ The gap above defines the divergence between temporal pattern distributions in two different time windows t and t ′ .Such a time shifting gap is often caused by time series fluctuations induced by periodicity, seasonality or miscellaneous factors like weather variation or events.We further discovered that this gap is usually accumulative.A longer temporal interval between two timestamps causes a larger divergence between two distributions.Due to this problem, machine learning models for time series prediction tasks expire frequently.Improving model generality makes the model more robust and avoid of fitting to local time series fluctuations.
We propose several graph interaction techniques to address to above problem, by enhancing the learning of multi-modal coordinated representations and reinforcing the model performance.Yosinski et al. [25] studied feature transferability in deep learning.It shows that features in lower layers are more general and those in higher layers are more specific.According to this phenomenon, we designed two kinds of graph interaction mechanisms correspondingly for lower layers and higher layers.
In lower layers, input spatiotemporal signal maintains its physical properties as engineered features.According to the case in figure 1, generating latent features via compound graph connectivity makes great sense in terms of spatial feature extraction.For lower layer spatial feature extraction, we designed grouped GCN (GGCN), which enables random walk graph convolution on compound graph connectivity.The objective of GGCN is to produce a more abstract multi-modal latent feature representation based on graph convolution operations.This technique addresses the first problem on completeness in spatial feature extraction.
Higher layer features provide high level abstractions for the input signal.It becomes meaningless to explicitly extract feature from a certain region.Leveraging some advances from multi-task learning [31], we adapt multi-linear relationship learning [17] to graph convolution networks and try to find shared information among modality-specific representations.According to characteristics in GCNs, we propose multi-linear relationship GCN (MRGCN), which imposes tensor normal distribution as the prior distribution of multi-modality graph convolution kernels to learn explainable, robust and fine-grained relationship among modalities.To further enhance the model generality, we propose to freeze part of the covariance structure in the covariance update algorithm, in order to improve output feature independency and alleviate the feature coadaptation problem.The proposed model generates more general high level feature abstractions.This technique also reduce model training time.
On real-world ride-hailing demand data, our model outperforms state-of-the art baselines by a significant margin.Leveraging the advantage of multi-modal and multi-task learning, our model requires less amount of data and time to reach low prediction error.In summary, this paper makes the following contributions: The proposed approach achieves more than 10% error reduction over state-of-the-art baseline methods for ride-hailing demand forecasting.

RELATED WORK Region-level prediction in urban computing
Region-level prediction is a fundamental task in data-driven urban management.There are rich amount of topics, including citizen flow prediction [29,30], traffic demand prediction [10,11,24], arrival time estimation [14] and meteorology forecasting [18,19].For these topics, the region-wise relationships are measured as geographical distance.The spatial structures for these prediction tasks are formulated as regular graphs, which are inherently euclidean structures.Convolution neural networks based models are used for effective prediction.Non-euclidean structures exist in station-based prediction tasks, including bike-flow prediction [4], traffic volume prediction [15,23,26] and point-based taxi demand prediction [21].The spatial structures for these problems are no longer regular.Graph convolution networks are usually leveraged for spatial feature extraction in these tasks.Non-euclidean structures also exist when incorporating auxiliary data to model region-wise relationships.Yao et al. [24] encoded region-wise relationship as a graph and use graph embedding as external features for convolution neural networks.Geng et al. [6] used MGCN to model region-wise relationships under multiple modalities.

Multi-modality in urban computing
The core issue for multi-modal machine learning is to build models that can process or relate information from multiple modalities [3].Traditional multi-modal machine learning problems focus on human sensory modalities, including audio-visual speech recognition [28], multi-media analysis [2] and media description [9].In urban computing, we usually need to harness knowledge from a diverse family of related datasets.Wei et al. [22] first categorized the diversity of urban computing datasets, such as POI and air quality as multi-modality and explored feature transferability among different modalities.
Multi-modal fusion is one of the most challenging problems in urban computing.Most existing works incorporate multi-modality auxiliary data as handcrafted features in a straightforward manner.Tong et al. [21] used multi-modality data as input features for linear regression model.Zhang et al. [29], Yao et al. [24] concatenated auxiliary data to high level abstractions for region-level spatiotemporal prediction networks.
GCN-based approaches encode multi-modality data as regionwise relationships and perform as a static structure in deep learning.
The spatial feature extraction process on GCN is associated with these modalities.According to applications in traffic volume prediction [15] and taxi demand prediction [6], GCNs are effective in spatial feature extraction on spatial-variant modality data.However, all techniques above fail to build relationship among modalities, which is expected to improve the generality of the learning framework.

Multi-task relationship learning
Multi-task relationship learning is a basic approach for multi-task learning.Zhang and Yeung [32] first proposed a regularized multitask model MTRL by placing a matrix-variate normal prior on model parameter.
where Σ r and Σ c are the row and column covariance.Long et al. [17] proposed Multilinear Relationship Network (MR Network) which learns multilinear relationship on different modes for the joint-task parameter tensor as: where W refers to the joint weight by concatenating all fully connected weights from all tasks.D f ,D c and D t denotes to the feature dimension, class dimension and task dimension in the joint weight.Σ f ,Σ c and Σ t represent covariance for each mode.Experiment results showed that imposing multilinear relationship regularizer on last few fully connected layers in CNN-like structures increased the feature generality and transferability in task specific layers.
However, MR Networks only learn multilinear relationships on fully connected layers.Other deep learning structures, like CNN or GCN have more complicated physical meanings.
A spatiotemporal observation (like ride-hailing demand) value at time t Output layer as the l th layer σ function Activation function Input feature dimension and output feature dimension For grouped GCN b j R |V |×f Bias for j th modality Weight corresponding to a specific chebyshev polynomial term For multi-linear relationship GCN

|I |, |O | scalars
Input and Output dimension used to measure weight dimension

METHODOLOGY
Denote A = {A 0 , A 1 , ..., A |M | } as adjacency matrices for different graphs.Each graph corresponds to one of the |M | modalities.In the ride-hailing demand prediction problem, each graph represents a kind of pair-wise spatial relationships for regions, including neighborhood (geo-distance) A N , POI similarity A S and road connectivity A C [6].
A N ,i, j = 1, if region i and j are adjacent 0, otherwise A S,i, j =sim(P v i , P v j ) A N defines adjacency relationship between regions.We construct A N by connecting a vertex to its 8 neighbors in a 3 × 3 grid.A S is the cosine similarity between POI vectors of two regions.Each entry in the POI vector represents the number of POIs in a specific category.A C indicates the connectivity between two regions.Two regions are connected as long as there is a highway or subway that directly connects them.
Define the one-step spatiotemporal prediction task for a certain modality (graph A i ) on a spatiotemporal observation x as: where G represents any random-walk based graph convolution network.x t ∈ R |V | is the temporal slice of a spatiotemporal observation at time t.
When the graph convolution operation G : defined as polynomial of graph laplacian 1 L with degree up to K: the above definition refers to the graph convolution operation of ChebNet [5].In this work, we use this variation of graph convolution operations.
In multi-modality formulation of this problem, each modality refers to a representation learning process of the same spatiotemporal observation on different graphs.Following the convention in [3], we formalize joint a representation of multi-modality learning problem on multi-graph convolution network as: where F A∈A denotes the interaction function across multi-graphs.In previous work [6], it is defined as stacking function in anterior layers and sum function in the output layer.The major contribution of this work focuses on the design of this interaction function.Figure 2 shows the proposed framework.According to analysis on feature generality [25] for deep neural networks, we proposed two techniques for building modality-wise interactions targeted for lower layers and higher layers respectively in stacked MGCNs.In lower layers, the hidden features are concrete.The feature extraction in lower layers are usually general.Considering these facts, we propose to build inter-modality connections to enable intergraph spatial feature extraction.To distinguish feature extraction parameters, we penalize inter-graph weight and intra-graph weight differently by group regularization.In higher layers, the hidden features are highly abstract that they can no longer maintain their physical properties.Applying inter-modality connections is not applicable.High level features are usually task specific, which is harmful to model generality and transferability.In these layers, we propose to learn multilinear relationship on training parameters of joint modalities, in order to improve the model generality and avoid overfitting the model to local fluctuations.

Grouped GCN
Figure 3 shows one layer transformation of grouped GCN (GGCN ).In lower layers, we use GGCN to build compound graph connectivity, which enables cross-graph spatial feature extraction.
Denote L i ∈ R |V |×|V | as the graph laplacian matrix of i-th modality.Denote X l i ∈ R |V |×f l as the input signal of the i th modality of the l th layer 2 .When l = 1, X 1 i represents the raw input and X 1 i = X 1 j , ∀i, j.Define the l th layer parameter W l as: 1 In this work, we use symmetric normalized laplacian: ×K as the weight matrix to transform the i th modality input to j t h modality output via ChebNet transformation G w l i, j (X l i ; A i ), where f l and f l +1 are the feature dimension of l th and (l + 1) t h layer.K represents degree of Chebyshev polynomial, which is sliced during the computation of ChebNet.The j t h modality output is computed as: We denote all weights that transform input to output within same modality, i.e. w l i, j for ∀i = j, as intra-modality weights.Similarly, define the inter-modality weight as w l i, j for ∀i j.It's obvious that when all inter-modality weights are set to 0, the graph convolution operation defined above degrades to MGCN.
Adding cross-modality weights as stated above introduces a tremendous increment on the number of parameters with a factor of O(|M |).This may boost the model complexity and cause overfitting.To address this issue, we use grouped sparsity [20,27] to regularize the complexity of parameters.We designed flexible group regularization loss for layer l: Different from traditional group regularization, we use a tunable parameter α to control the trade-off on penalties for intra-modality weights and inter-modality weights.To maintain the difference among modalities, we prefer a smaller α value, in order to introduce less penalty to intra-modality weight.The inter-modality feature extraction focuses on those highly strong relationships.This will help to maintain model generality from multi-modality throughout the proposed GGCN architecture.
The design strategy has several properties that maintain the advantage of GCN models.Firstly, the increment for computational complexity for GGCN is limited.The factor of time complexity increment is O(M), which is polynomial of the number of modalities.In practice, the number of modalities are usually not large.Secondly, the extra computation above to compute intra-modality transformation and inter-modality transformation are naturally independent.It's easy to design a parallel implementation.Finally, GGCN is a linear combination of different graph laplacians, which keep the numerical stability of the original MGCN model when using the normalized symmetric laplacian.

Multi-linear relationship GCN
In high level layers, latent features no longer maintain their properties as spatiotemporal observations.Instead of building crossmodality connections, we propose to learn multi-linear relationships (MR) on joint-modality weights 3 by imposing tensor normal distribution as the prior distribution.
The dimensionality transformation of graph convolution operations in ChebNet is shown in figure 4.There are totally five dimensions in the whole system, including regions/vertices (R, |R| = |V |), inputs (I), outputs (O), Chebyshev Polynomial (C, |C | = K) and modalities (M).For each single modality task, the representation of 3 We only keep intra-modality weights in high level layers input signals on graph laplacian L i is in three dimensional space of region, input and chebyshev polynomial: {L α i X |α = 0, 1, ..., K } ∈ R |V |×|I |×K .The model parameter for the i th modality is in three dimensional space of input, output and chebyshev polynomial: The joint representation for multi-modality weight is defined as a four order tensor Firstly, we impose tensor normal distribution as prior distribution for W l where M l is the mean tensor.
is the kronecker decomposable covariance structure.The density function is estimated as: where According to Long et al. [17], for MAP estimation for model parameters, learning the posterior distribution of W l given training data (X , Y ) is equivalent to minimizing the negative logarithm for density of l P(W l ), where4 : where vec(•) is the flattening operation to transform a high-dimensional tensor to a 1-d vector.The flip-flop algorithm for updating covariance matrix of a certain mode Σ i is: where ϵI d i is a trade-off term for numerical stability.(W l ) (i) is the vectorization along the i t h mode.Such operation outputs matrix of shape We further discovered that the covariance update rule above should not be applied to input (I) and ouput (O) modes.Instead, freezing covariance matrix of input (I) and output (O) mode to identity matrix I d will improve model generality and transferability.
Observing equation 2 of ChebNet on the l t h layer: where • |V| is very large.L α X l is usually sparse.
• In higher layers of DNNs, the feature dimension is usually decreasing, i.e.
According to lemma on matrix multiplication: The rank for GCN output feature matrix is bounded: where Rank |f 2 | (W α ) is the rank on f 2 mode of matrixW α .Increasing the Rank(W α ) on both modes (f 1 and f 2 ) will lift the upper bound of the output rank.It's known that co-adaptation problem [8] limits the generality and transferability of DNNs.Initializing and freezing the covariance matrix along input and output dimension to I I and I O , will induce a high rank matrix W α , which lifts the upper bound of rank of output features.The inter-neuron dependency is smaller for a high rank output feature matrix, so that the co-adaptation problem is alleviated and model generality is increased.

Multi-modality fusion
The final layer is the modality fusion layer, in order to aggregate features from different modalities and output a prediction result.
For one-step spatiotemporal prediction problem, the output shape is R |V |×1 .The design of modality fusion is straightforward.First, we make sure the last MRGCN layer reduces the feature dimension to 1.Then, the modality fusion layer is designed as an modality-wise average:

Training algorithm
We combine all loss functions and summarize it for the entire network: where the J 0 term is the prediction loss of the model.

EXPERIMENTS
In this section, we compare our graph interaction techniques with state-of-the-art baselines on region-level demand forecasting for ride-hailing service.

Dataset
We conduct our experiments on two real-world large scale ridehailing datasets collected in two cities: City A and City B 5 .Both of the datasets were collected in main city zone in 2017.We split data to training set (Mar 1st to Jul 31st, 2017), validation set (Aug 1st to Oct 31th, 2017) and test set (Nov 1st to Dec 31st, 2017).The POI data used for A S contains 13 primary categories, including business building, residential building, entertainments, etc.The road network data used for A C is extracted from railway, highway and subway dataset from OpenStreetMap [7].

Experiment setting
The ride-hailing forecasting problem is a one-step spatiotemporal prediction problem to learn predictor f : R |V |×T → R |V |×1 .According to previous works [6][11][29], we set T to 5. Physically, it means to predict the ride-hailing demand in the next time interval using the most recent three ones (closeness), the one in the same time yesterday (period) and the one in the same time last week (trend) [30].V is the set of regions acquired by partitioning the main city zone to 1km × 1km rectangular grids.Under this setting, there are totally 1296 regions in City A and 896 regions in City B .We set 30 minutes as the time interval for both training data and test data.Each entry in the spatiotemporal tensor represents the number of ride-hailing demand of a certain region in 30 minutes.
We propose a 4-layer MGCN, where the first two layers are GGCN and the last two layers are MRGCN .The output dimensions for these layers are set to 32,64,32,1.For all graph convolution operations, the max chebyshev polynomial K is set to 4. In GGCN , the tunable α is set to 0.1 to maintain intra-modality properties.In MRGCN , the trade-off parameter ϵ is set to 1e−6.We monitor RMSE on the validation set with early stopping.The regularizers α low and α hiдh are both set to 1e − 4. The neural network is implemented using tensorflow [1] and optimized using adam optimizer [12] with the learning rate as 5e-4 and the batch size as 32.All experiments are conducted on an environment with 10GB RAM and 9GB GPU memory of Tesla P40.  2 shows experiments comparisons between the proposed methodology, variations and baselines:

Method
• MGCN: Use one separate GCN to learn prediction task in each modality.There is no graph interaction among modalities.
• STMGCN [6]: Use RNN-based model to extract temporal features ahead of MGCN.• Share weight: A common technique in multi-task learning.
The GCN weight is shared across modalities in each layer.• Domain adaptation network (DAN) [16] Achieving a lower error usually costs longer training time.We set the benchmark to 10.78 in City A , which is the performance of baseline [6] on the same dataset.
The experiments shows following facts.Firstly, according to performance of GGCN , it improves the prediction accuracy for MGCN by invoking more complexity in spatial feature extraction on graphs.With the help of intra-modality transformations, spatial feature extraction is more complete and the model is more expressive.The performance improvement by GGCN is even more significant than incorperating an RNN-based temporal feature extraction process (STMGCN).However, with the increment of the parameter size, the model is more prone to overfitting and requires longer training time.
Secondly, MRGCN also improves model performance.Compared with GGCN, the influence to prediction error is slightly inferior.There is no significant difference in model capacity and model structure between MRGCN and MGCN.We infer that multi-linear relationship approach improves prediction performance by improving model generality, so that MRGCN freezing the covariance for input and output dimension on the weight tensor may induce higher independency among neurons, which alleviates the co-adaptation problem, thus improves model generality.
Training speed is another important factor to evaluate machine learning models.Table 4 shows the training time required to achieve the optimal performance of each model.We use the grid search to determine the minimum training length of each model.Given a larger training set than this, the model can't converge to a significantly lower validation error.Compared with the baseline, the proposed method reduces the amount of training set and the length of training time by approximately 50%.Among all tested approaches, MRGCN 2Σ achieves the lowest prediction error on average and on test data after the 4 t h week.This is an important feature for industrial use.The life cycle for a more generalized model is longer, which reduces the frequency for model update.Figure 5 shows the generalization ability for different models, which validates above arguments in detail.The data relative divergence (blue bar) is computed as the Kullback Leibler divergence [13] between temporal pattern of the last week in training set, and temporal patterns of each week in test set.We discovered that the gap between training set and test set is accumulative.This indicates that the test data will become more and more divergent from training data with time shifting.Models are expected to be more general to overcome this phenomenon.According to prediction error by weeks, the prediction error for STMGCN keeps increasing as the test data becomes more divergent.We believe this phenomenon is not caused by model capability, but model generality.For methods including GGCN and MRGCN, the model performance is less influenced by this generalization gap.There is no difference between the model capacity of STMGCN (MGCN) and MRGCN.The network architecture and connectivity are almost the same.This shows that MRGCN has better generalization ability to avoid overfitting to training set. Figure 6 shows the feature inter-dependency of different models.The feature covariance is calculated as negative logarithm of L2-norm of covariance matrix along the feature mode.Feature covariance measures the inter-dependency between different neurons in a hidden layer of deep neural network.A higher value represents a lower absolute value for covariance between neurons and a higher neuron dependency.According to above plot, the neuron independency could be greatly improved by MRGCN.According to Yosinski et al. [25], co-adapted neurons are the major cause for optimization difficulty in middle layers.Compared with baseline methods, the proposed MRGCN 2Σ successfully reduced the coherence among hidden layer units and improved generality and transferability for deep neural networks.

Modality relationship
MRGCN learns explainable relationships between modalities by maintaining a modality-wise covariance matrix.In this part, we first show that all modalities are helpful to the learning task.Then, we will explore the relationship between the modality-wise relationship learnt from optimization and relationship between graphs.
Figure 7 is the Hinton diagram showing the modality-wise relationships for the 3 r d and 4 r d layers in GGCN +MRGCN 2Σ .N, P, R represent modality for Neighborhood A N , POI similarity A S and road connectivity A C .Similar to the interpretation by [17], we could draw several conclusions.Firstly, most of the tasks are positively correlated (green), implying that all modalities could reinforce the learning of others.This conclusion reachs a consensus with ablation study of [6] 5: Ablation study for ST-MGCN.Removing any one modality will result in great damage to the prediction accuracy.
Secondly, we discover that the relationship between N and R is weak and random.These two tasks are seemlessly related.Compared with that, the relationship R-P and N-P are stable and robust.We try to explain this phenomenon by comparing the graphs A N , A S and A C .7: Two measurements to show similarity between different graphs.F-measurement considers matched and unmatched edges proportional to graph size.Edit distance measures difference between two edge sets.
Table 6 shows the density for each graph, which measures the connectivity of graph in each modality.According to graph definition, A S is defined as POI similarity between any region pairs, which induces a dense adjacency matrix.A N and A C are sparse.We measure the graph similarity by F-measure and edit distance in table 7.According to graph definition, edges in A N are all removed from A C , that the edge set E N E C = ∅.From the view of graph connectivity, the prediction task on these modalities are hardly related.The relationship A S − A C and relationship A S − A N are quite similar due to that A S is dense.The analysis above helps to understand figure 7. The relationship between neighborhood (N) and road connectivity (R) is quite random, due to the inherent independency between these two modalities.MRGCN learns similar modality relationships for similar graph-pairs.The relationship N-P and R-P are maintained to be similar in both layers.

CONCLUSION AND FUTURE WORK
In this work, we propose two graph interaction techniques for multimodal multi graph convolution networks.We use GGCN in lower layers to complete graph connectivity for better spatial feature extraction by graph convolution networks.In higher layers, we use MRGCN to learn robust modality relationships.MRGCN alleviates the co-adaptation problem by lifting the upper bound for feature dependency, thus improves the model generality.The experiment on ride-hailing demand prediction shows that our proposed model outperforms baselines in effectiveness, efficiency and robustness.For the future work, we plan to investigate the following aspects: (1) evaluate the model with other spatial temporal prediction tasks and other region-wise relationships; (2) explore the impact of sparse and dense graphs on this framework;

Figure 1 :
Figure1: (a) shows graph connectivity for MGCN[6] in each graph.X i represents vertex (region) X on the i-th graph.Weighted edges between vertices denote region-wise relationship.There is no interaction among graphs.(b) shows compound graph connectivity by adding graph-wise interaction to MGCN.Vertices are connected as long as there exists an edge in any graph.

Figure 2 :
Figure 2: Overview of the proposed graph interaction mechanism for stacked MGCNs.The multi-modality representation of input signals is generated by multi-graphs.In lower layers of deep neural networks, we use grouped GCN to enable inter-graph spatial feature extraction.In higher layers, we use multi-linear relationship GCN to learn modalitywise relationship by imposing tensor normal distribution on the joint representation of parameters.Finally we aggregate modalities to produce output.

Figure 3 :
Figure 3: One layer transformation for grouped GCN .Weights marked in red represent intra-modality weights.Green ones represent inter-modality weights.

Figure 4 :
Figure 4: The dimensionality transformation for graph convolution operations in MGCN.Single modality GCN slices input and weight on the C mode, multiply slice pairs and sum up the product.

Table 4 :
2Σ is less prone to overfit to the local fluctuations in training set and overcomes the gap between training set and test set.Multi-task learning based approaches, including share weight, DAN and MRGCN all shorten the model training time.Among these approaches, share weight method reduces model complexity by a factor of O(|M |), which brings down the prediction performance.The performance of MRGCN and DAN are almost the same.Thirdly, we show that freezing input and output coordinates in MRGCN is effective.Compared with MRGCN 4Σ , MRGCN 2Σ decreases the prediction error.This validates our assumption that Training speed for each model to achieve best performance in ride-hailing demand forecasting task.

Figure 5 :
Figure 5: The experiment to test model generality to overcome divergence in temporal data.The relative data divergence in test set accumulates along with time.Multi-task learning based approaches maintains low prediction error when the data divergence is large.

Figure 6 :
Figure6: Feature dimension-wise covariance for different models.It's calculated as the negative logarithm of the L2norm of the covariance matrix of latent features along the feature mode.A higher value indicates higher feature independence.

Figure 7 :
Figure 7: Hinton diagram for modality relationships.The magnitude for relationship is represented by the rectangle size.Green rectangle represents a positive relationship.Red rectangle represents a negative relationship.

6 :
The density for each graphs.The graphs are undirected.Density is calculated as 2|E|/|V |(|V | − 1)A N − A S A S − A C A C − A N

Table 1 :
Table of notations of l th layer for i th modality and α th chebyshev polynomial In this work, we use the rooted mean squared error (RMSE) to measure distance between the predicted value and true value.In stacked MGCNs, we set 1, 2, ..., l k -th layers to L low and use GGCN to construct graph interactions.The remaining layers l k , l k + 1, ... are set to learn multilinear relationships by MRGCN.The J 1 terms are the GGCN regularizer for each lower layer.The J 2 terms are the relationship regularizer for MRGCN in the higher layers.αlow and α hiдh are the trade-off parameters for regularizers.The overall training algorithm for the entire network, including GGCN and MRGCN is shown below.Training algorithm for GCN with interactions Set layers L low = {1, 2, ..., l k } to grouped GCN Set layers L hiдh = {l k +1 , ...} to multi-linear relationship GCN Initialize Σ l d = I d , ∀l ∈ L hiдh and d ∈ {|I |, |O |, |C |, |M |} Initialize all weights repeat Extract (x i , y i ) from training set as current training batch Update model parameter W according to J W (x i , y i ) Update covariance matrices Σ l C and Σ l M , ∀l ∈ L hiдh until converge

Table 2 :
Experiment performance in City A and City B .The proposed approach achieves best result among all methods

Table 3 :
Number of epochs required to converge to optima or benchmark.Multi-task-based method reduce training time by at least 50%.The experiment is done in City A dataset. • MRGCN 4Σ : The proposed multi-linear relationship GCN with all four covariance matrices updated.• MRGCN 2Σ : Proposed method to freeze covariance matrices for input and output coordinates.All proposed methods above are 4-layer MGCNs, with similar hidden feature sizes and same training configurations (learning rate, batch size, etc).We evaluate the model performance according to the prediction error (RMSE) on the test set.The epoch of converge shown in table 4 measures the time consumption for each model to reach its optima.Different models converge to different optima.
: Minimizing modality divergence by minimizing cross-modality feature divergence.The divergence used is mean maximum discrepancy (MMD).
in table 5.