Predicting Station-Level Short-Term Passenger Flow in a Citywide Metro Network Using Spatiotemporal Graph Convolutional Neural Networks

: Predicting the passenger ﬂow of metro networks is of great importance for tra ﬃ c management and public safety. However, such predictions are very challenging, as passenger ﬂow is a ﬀ ected by complex spatial dependencies (nearby and distant) and temporal dependencies (recent and periodic). In this paper, we propose a novel deep-learning-based approach, named STGCNNmetro (spatiotemporal graph convolutional neural networks for metro), to collectively predict two types of passenger ﬂow volumes—inﬂow and outﬂow—in each metro station of a city. Speciﬁcally, instead of representing metro stations by grids and employing conventional convolutional neural networks (CNNs) to capture spatiotemporal dependencies, STGCNNmetro transforms the city metro network to a graph and makes predictions using graph convolutional neural networks (GCNNs). First, we apply stereogram graph convolution operations to seamlessly capture the irregular spatiotemporal dependencies along the metro network. Second, a deep structure composed of GCNNs is constructed to capture the distant spatiotemporal dependencies at the citywide level. Finally, we integrate three temporal patterns (recent, daily, and weekly) and fuse the spatiotemporal dependencies captured from these patterns to form the ﬁnal prediction values. The STGCNNmetro model is an end-to-end framework which can accept raw passenger ﬂow-volume data, automatically capture the e ﬀ ective features of the citywide metro network, and output predictions. We test this model by predicting the short-term passenger ﬂow volume in the citywide metro network of Shanghai, China. Experiments show that the STGCNNmetro model outperforms seven well-known baseline models (LSVR, PCA-kNN, NMF-kNN, Bayesian, MLR, M-CNN, and LSTM). We additionally explore the sensitivity of the model to its parameters and discuss the distribution of prediction errors.


Introduction
The prediction of short-term passenger flow volume is a vital component of metro systems.In particular, the accurate prediction of the short-term passenger flow volume in a citywide metro network can help urban managers to fine-tune travel behaviors, reduce passenger congestion, and enhance the service quality of metro systems [1].Thus, constructing an effective model to predict the passenger flow volume in a citywide metro network is essential.
During recent decades, many passenger-flow prediction models have been proposed based on statistical and machine learning (ML) algorithms, such as support vector machine (SVM) [2][3][4], Bayesian regression [5], principal component analysis (PCA) [6,7], non-negative matrix factorization (NMF) [8], and artificial neural networks (ANNs) [1,[9][10][11].However, these conventional methods cannot process raw sample data and require a manual feature engineering procedure.In the ML domain, a feature is an individual measurable property of the phenomenon being observed [12].In a typical ML model, feature engineering transforms raw data into suitable internal features so that the learning subsystem can detect patterns in the input data.Generally, feature engineering is complex as it requires careful engineering and considerable domain expertise.As the key to achieving a high-accuracy prediction system is capturing features as accurately as possible [13], feature engineering is crucial for ML methods.However, for the situation of a citywide metro network, it is difficult to achieve high-quality feature engineering manually.The most important feature for the prediction of the citywide spatiotemporal passenger flow volume is the spatiotemporal dependencies [14,15].In a metro network, the spatial dependencies refer to the interactions between the passenger inflow volume and outflow volume in near and distant stations.Specifically, the passenger inflow volume in one station directly affects the outflow volume in another (near or distant) station.Meanwhile, the temporal dependencies refer to the impact of historical observational time-series.For example, the passenger flow volume at 7:00 a.m. may have a strong correlation with that at 6:00 a.m.Additionally, it may also be very similar to the flow volume at 7:00 a.m. on the previous day or that at 7:00 a.m. on the same day of the previous week, since human activities have daily and weekly periodicities.Some metropolises have hundreds of metro stations (e.g., Beijing, China, has 391; Shanghai, China, has 289; and Tokyo, Japan, has 290).The spatial and temporal dependencies of these citywide metro systems are complex and highly integrated.It is difficult to characterize the spatiotemporal dependencies of a metro network by manual feature engineering.Therefore, conventional ML methods are not suitable for the prediction of passenger flow volume in a citywide metro network.
Recently, deep learning (DL) has solved the problem of manually extracting feature engineering [16].Unlike traditional ML algorithms, DL models can accept input data in raw format and automatically discover the required features level-by-level, a technique that is called "end-to-end" learning.DL enormously simplifies the feature engineering procedure, and also improves feature quality [13].The performance of DL methods has been confirmed in many domains, such as image recognition, video processing, and natural language processing [13,[17][18][19][20][21].Long short-term memory (LSTM) is a special kind of deep recurrent neural network (RNN) which is suitable for modeling dynamic temporal dependency occurring in a time-series [22,23].Hence, several LSTM-based models have been proposed to predict the passenger flow volumes in isolated stations or traffic lines, which obtained better accuracy than traditional prediction methods [24][25][26].However, LSTM is not good at capturing spatial dependency at the citywide level [27].Convolutional neural networks (CNNs) have been applied for the prediction of citywide crowd flow [28].CNNs are specifically designed for data domains with regular grids, such as images.They can straightforwardly identify spatial dependencies among grids with various localized filters or kernels, and these shift-invariant kernels are learned automatically from data.To apply a CNN to the prediction of large-scale spatiotemporal crowd flow volume, Zhang et al. [28] divided a city extent into grids with a predefined size and calculated the flow volume for each grid.Several historical spatiotemporal flow-volume grids were then fed into a deep CNN model, and the spatiotemporal dependencies among grids were learned through stacked localized kernels.However, it should be noted that this approach is not applicable for the prediction of station-level flow volume, since if the grid size is set too large, multiple stations will be covered by the same grid and fail to satisfy the required granularity.Conversely, if the grid size is as small as one station, the resulting huge image matrix with redundant zero elements will strongly increase the computational burden.Ma et al. [29] proposed another method to transform road-network-based data into a two-dimensional image, with the horizontal axis representing time tags and the vertical axis representing road segments.This image was then input into a CNN framework to model the spatiotemporal dependency using a convolution operator.Although the image of Ma et al. represented each road segment separately, maintaining the spatial precision of each road, the image transformation procedure did not consider the topological dependencies of the road network which is the same for the metro network.Importantly, as citywide metro networks are irregular network structures, the regular-grid-based convolution operation could not accurately capture the spatiotemporal dependencies along the metro network.
For data in irregular or non-Euclidean domains, such as user data in social networks and genetic data in biological regulatory networks, graphs have been applied as main structures to encode heterogeneous pairwise dependencies and complex geometric structures in data [30].In order to conduct classification and prediction tasks using these graph datasets, a novel DL model, named graph CNN (GCNN), was recently proposed [30,31].A citywide metro network can be represented as a graph with metro stations as vertices and metro lines as edges.Each station vertex has a feature vector consisting of historical values of passenger flow, and an adjacency matrix can be defined to encode the pairwise dependencies between stations.Therefore, instead of representing metro stations by grids and capturing features using a CNN, a metro network can be characterized as a general graph, and a GCNN can be employed to effectively capture the irregular spatiotemporal dependencies at the metro network level, rather than the grid level.
This paper introduces GCNNs to predict the station-level short-term passenger flow volume in a citywide metro network, proposing a novel DL method named STGCNNmetro (spatiotemporal graph convolutional neural networks for metro).First, we apply stereogram graph convolution kernels to historical metro passenger flow-volume time-series to seamlessly capture the irregular spatiotemporal dependencies along the metro network.Second, a deep structure composed of GCNNs is constructed to capture the distant spatiotemporal dependencies at the citywide level.Finally, to consider the influences of multiple temporal patterns, we fuse the spatiotemporal dependencies captured from recent, daily, and weekly patterns to produce the final prediction values.The model is trained by back-propagation.Compared with existing models, the main contribution of this paper is that it proposes a novel "end-to-end" DL-based model which is able to automatically capture the irregular spatiotemporal dependencies of a metro network from raw passenger flow-volume data and achieve accurate predictions of passenger flow volume for the citywide metro system.

Representing Time-Series of Metro Network Passenger Flow Volume by Graphs
As the prediction of the passenger flow volume of a citywide metro network is a typical spatiotemporal series prediction problem, we can predict the passenger flow volume at a subsequent time step given the previous M observations.In this work, we define a citywide metro network on a graph and focus on structured time-series of passenger flow volume.As shown in Figure 1, V t ∈ R n is an observation vector of n metro stations at time step t, each element of which records the historical passenger flow volume for a single metro station.The observation V t is not isolated in space but is rather linked by pairwise connection in the graph.Therefore, the passenger flow volume V t of the citywide metro network can be regarded as a graph signal that is defined on an undirected graph G with weights w ij .At time step the t, in the graph G t = (V t , E, W), V t is a finite set of vertices corresponding to the observations from n monitoring stations in a metro network; E is a set of edges, indicating the connectedness between stations; while W ∈ R n×n denotes the weighted adjacency matrix of G t .

Graph Convolution Operation
A standard convolution for regular grids is clearly not applicable to general graphs.A basic approach currently used to explore how to generalize CNNs to structured data forms is to manipulate in the spectral domain with graph Fourier transforms [32].It introduces the spectral framework to apply convolutions in spectral domains, which is often called the spectral graph convolution.Several follow-up studies have made the graph convolution more effective by reducing the computational complexity from O(n 2 ) to linear cost [30,31].In this study, we adopted the GCNN structure proposed by Defferrard et al. [30] to model passenger flow volume data from a citywide metro network.The notion of the graph convolution operator " * ", which is based on the concept of spectral graph convolutions, as the multiplication of a signal  ∈  with a kernel , is given as follows: where  is the kernel size of the graph convolution, which determines the maximum radius of the convolution from central nodes,  is the convolution kernel parameter,  is the normalized graph Laplacian, and  is an activation function (i.e., rectified linear unit (ReLU)).

Using Stereogram Graph Convolution to Capture Irregular Spatiotemporal Dependencies
After constructing the graph-structured metro network data as described in the previous section, we extract the local irregular spatiotemporal dependencies of the metro network using the GCNN.Taking vertex i as an example, Figure 2 shows the conventional graph convolution with a kernel size of one (K = 1).Actually, this is the convolution calculation performed with the first-order adjacencies.Obviously, the spatial dependencies between vertex i and its five first-order adjacent vertices (marked as 1, 2, 3, 4, and 5 in Figure 2) are captured.The illustrated operation is executed for all vertices.For the metro network passenger flow volume time-series data of the previous M intervals, we construct a stereogram convolution kernel to seamlessly extract the spatiotemporal dependencies.The M historical flow-volume graphs are stacked in the temporal dimension.In the stereogram graph convolution, convolutions are seamlessly performed in both the spatial and temporal dimensions, as shown in Figure 3.We define the notion of the spatiotemporal graph convolution operator " * ′" as the multiplication of a signal ′ ∈  × with a stereogram kernel ′: where  is the convolution kernel parameter at the time step t, and  corresponds to the observations from n monitoring stations in a metro network at the time step t.For vertex i, after the ), V t is an observation vector of n metro stations at time step t, each element of which records the historical passenger flow volume for a single metro station; E is a set of edges, indicating the connectedness between stations; while W ∈ R n × n denotes the weighted adjacency matrix of G t .

Graph Convolution Operation
A standard convolution for regular grids is clearly not applicable to general graphs.A basic approach currently used to explore how to generalize CNNs to structured data forms is to manipulate in the spectral domain with graph Fourier transforms [32].It introduces the spectral framework to apply convolutions in spectral domains, which is often called the spectral graph convolution.Several follow-up studies have made the graph convolution more effective by reducing the computational complexity from O(n 2 ) to linear cost [30,31].In this study, we adopted the GCNN structure proposed by Defferrard et al. [30] to model passenger flow volume data from a citywide metro network.The notion of the graph convolution operator " * g", which is based on the concept of spectral graph convolutions, as the multiplication of a signal x ∈ R n with a kernel Θ, is given as follows: where K is the kernel size of the graph convolution, which determines the maximum radius of the convolution from central nodes, α j is the convolution kernel parameter, L is the normalized graph Laplacian, and σ is an activation function (i.e., rectified linear unit (ReLU)).

Using Stereogram Graph Convolution to Capture Irregular Spatiotemporal Dependencies
After constructing the graph-structured metro network data as described in the previous section, we extract the local irregular spatiotemporal dependencies of the metro network using the GCNN.Taking vertex i as an example, Figure 2 shows the conventional graph convolution with a kernel size of one (K = 1).Actually, this is the convolution calculation performed with the first-order adjacencies.Obviously, the spatial dependencies between vertex i and its five first-order adjacent vertices (marked as 1, 2, 3, 4, and 5 in Figure 2) are captured.The illustrated operation is executed for all vertices.For the metro network passenger flow volume time-series data of the previous M intervals, we construct a stereogram convolution kernel to seamlessly extract the spatiotemporal dependencies.The M historical flow-volume graphs are stacked in the temporal dimension.In the stereogram graph convolution, convolutions are seamlessly performed in both the spatial and temporal dimensions, as shown in Figure 3.We define the notion of the spatiotemporal graph convolution operator " * g " as the multiplication of a signal x ∈ R M × n with a stereogram kernel Θ : where α jt is the convolution kernel parameter at the time step t, and x t corresponds to the observations from n monitoring stations in a metro network at the time step t.For vertex i, after the stereogram graph convolution, the irregular spatiotemporal dependencies contained in its first-order spatial adjacent vertices and its M-order temporal adjacencies are captured.stereogram graph convolution, the irregular spatiotemporal dependencies contained in its first-order spatial adjacent vertices and its M-order temporal adjacencies are captured.indicates the historical passenger flow volume of vertex j at the time step t.

Using Deep GCNNs to Capture Distant Spatiotemporal Dependencies in a Citywide Metro Network
In a metro network structure, high-order adjacencies can be accumulated by low-order adjacencies.For example, second-order adjacencies can be obtained by accumulating two first-order adjacencies.Usually, the spatial scale of a citywide metro network is large and contains many stations from the city center to the suburbs.Intuitively, the passenger flow volumes between nearby stations may affect each other.This can be effectively handled by one GCNN layer which has shown a powerful ability to hierarchically capture spatial structural information [33,34].Additionally, since metro systems connect two locations separated by a large distance, this leads to spatiotemporal dependencies between distant stations.As one convolution layer accounts for dependencies only between stations that are situated close to each other, in order to capture the spatiotemporal dependencies between distant stations, it is necessary to stack multiple GCNN layers to form a deep GCNN structure.stereogram graph convolution, the irregular spatiotemporal dependencies contained in its first-order spatial adjacent vertices and its M-order temporal adjacencies are captured.indicates the historical passenger flow volume of vertex j at the time step t.

Using Deep GCNNs to Capture Distant Spatiotemporal Dependencies in a Citywide Metro Network
In a metro network structure, high-order adjacencies can be accumulated by low-order adjacencies.For example, second-order adjacencies can be obtained by accumulating two first-order adjacencies.Usually, the spatial scale of a citywide metro network is large and contains many stations from the city center to the suburbs.Intuitively, the passenger flow volumes between nearby stations may affect each other.This can be effectively handled by one GCNN layer which has shown a powerful ability to hierarchically capture spatial structural information [33,34].Additionally, since metro systems connect two locations separated by a large distance, this leads to spatiotemporal dependencies between distant stations.As one convolution layer accounts for dependencies only between stations that are situated close to each other, in order to capture the spatiotemporal dependencies between distant stations, it is necessary to stack multiple GCNN layers to form a deep GCNN structure.indicates the historical passenger flow volume of vertex j at the time step t.

Using Deep GCNNs to Capture Distant Spatiotemporal Dependencies in a Citywide Metro Network
In a metro network structure, high-order adjacencies can be accumulated by low-order adjacencies.For example, second-order adjacencies can be obtained by accumulating two first-order adjacencies.Usually, the spatial scale of a citywide metro network is large and contains many stations from the city center to the suburbs.Intuitively, the passenger flow volumes between nearby stations may affect each other.This can be effectively handled by one GCNN layer which has shown a powerful ability to hierarchically capture spatial structural information [33,34].Additionally, since metro systems connect two locations separated by a large distance, this leads to spatiotemporal dependencies between distant stations.As one convolution layer accounts for dependencies only between stations that are situated close to each other, in order to capture the spatiotemporal dependencies between distant stations, it is necessary to stack multiple GCNN layers to form a deep GCNN structure.
First, the Laplacian matrix is constructed based on the adjacency matrix of the metro network data, and the local spatiotemporal dependencies (LSTD) of the metro network structure are captured, as shown in Equation ( 3): where F GCNN is one GCNN layer that performs Equation ( 2) on all metro stations, θ represents all the parameters learned in the model, L f is the Laplacian matrix, W is the adjacency matrix of the metro network graph, and X ∈ R M×n corresponds to the graphs of passenger flow volume of n metro stations in the past M intervals.Then, multiple GCNN layers are stacked to form a deep structure to capture the high-order (distant) spatiotemporal dependencies (HSTD) from the LSTDs, as shown in Equation ( 4): where LSTD N refers to the N-order spatial dependencies and M-order temporal dependencies captured by stacking N GCNN layers.Assuming the size of the graph convolution kernel is 1, the specific deep GCNNs procedure is shown in Figure 4.By stacking multiple GCNN layers, we can capture the distant spatiotemporal dependencies of a citywide metro network.
ISPRS Int.J. Geo-Inf.2019, 7, x FOR PEER REVIEW 6 of 25 First, the Laplacian matrix is constructed based on the adjacency matrix of the metro network data, and the local spatiotemporal dependencies (LSTD) of the metro network structure are captured, as shown in Equation ( 3): where  is one GCNN layer that performs Equation ( 2) on all metro stations,  represents all the parameters learned in the model,  is the Laplacian matrix,  is the adjacency matrix of the metro network graph, and  ∈  × corresponds to the graphs of passenger flow volume of n metro stations in the past M intervals.Then, multiple GCNN layers are stacked to form a deep structure to capture the high-order (distant) spatiotemporal dependencies (HSTD) from the LSTDs, as shown in Equation ( 4): where  refers to the N-order spatial dependencies and M-order temporal dependencies captured by stacking N GCNN layers.Assuming the size of the graph convolution kernel is 1, the specific deep GCNNs procedure is shown in Figure 4.By stacking multiple GCNN layers, we can capture the distant spatiotemporal dependencies of a citywide metro network.

Using Spatiotemporal GCNNs for the Prediction of Station-Level Short-Term Passenger Flow Volume in a Citywide Metro Network
The general framework of the proposed novel DL-based model, STGCNNmetro, is shown in Figure 5. Clearly, the passenger flow of a metro system contains inflow and outflow.In this study inflow volume is the total passenger flow volume entering a station during a given time interval and outflow volume denotes the total passenger flow volume leaving a station.We first transform the historical citywide metro inflow volume graph and outflow volume graph at each time interval into structured spatiotemporal time-series.As the spatiotemporal dependencies have different patterns [15,26,35], the historical flow-volume time-series are divided into recent, daily, and weekly patterns.Then, three sub-models based on GCNNs are constructed to capture the features of these three patterns.Three outputs are fused based on parameter matrices, which assign different weights to the results of the analysis of different patterns in different stations.Finally, the ReLU function is selected to activate the fusion result to give the final prediction values.

Using Spatiotemporal GCNNs for the Prediction of Station-Level Short-Term Passenger Flow Volume in a Citywide Metro Network
The general framework of the proposed novel DL-based model, STGCNNmetro, is shown in Figure 5. Clearly, the passenger flow of a metro system contains inflow and outflow.In this study inflow volume is the total passenger flow volume entering a station during a given time interval and outflow volume denotes the total passenger flow volume leaving a station.We first transform the historical citywide metro inflow volume graph and outflow volume graph at each time interval into structured spatiotemporal time-series.As the spatiotemporal dependencies have different patterns [15,26,35], the historical flow-volume time-series are divided into recent, daily, and weekly patterns.Then, three sub-models based on GCNNs are constructed to capture the features of these three patterns.Three outputs are fused based on parameter matrices, which assign different weights to the results of the analysis of different patterns in different stations.Finally, the ReLU function is selected to activate the fusion result to give the final prediction values.

Input Datasets
The input datasets used in this study consist of spatiotemporal passenger flow-volume data from recent, daily, and weekly time scales.Recent inputs refer to the historical observations which are close to the target time interval, and daily and weekly inputs refer to the historical observations in the same time interval as the prediction target, but in daily or weekly periodic cycles.For example, assuming that the prediction target is the spatiotemporal flow volume between 9:00 and 9:10 a.m. on a Thursday, then the recent inputs are the historical observations between 8:00 and 9:00 a.m. on the same day, the daily inputs are the observations between 9:00 and 9:10 a.m. on the previous days, and the weekly inputs are the observations between 9:00 and 9:10 a.m. on the previous Thursdays. Let

Input Datasets
The input datasets used in this study consist of spatiotemporal passenger flow-volume data from recent, daily, and weekly time scales.Recent inputs refer to the historical observations which are close to the target time interval, and daily and weekly inputs refer to the historical observations in the same time interval as the prediction target, but in daily or weekly periodic cycles.For example, assuming that the prediction target is the spatiotemporal flow volume between 9:00 and 9:10 a.m. on a Thursday, then the recent inputs are the historical observations between 8:00 and 9:00 a.m. on the same day, the daily inputs are the observations between 9:00 and 9:10 a.m. on the previous days, and the weekly inputs are the observations between 9:00 and 9:10 a.m. on the previous Thursdays.
Let X t denotes the passenger flow volume data of the citywide metro network for the tth time interval and let D , and X W represent the recent, daily, and weekly historical observation data, respectively.Assume that the number of time intervals in D , and X (0) W is r, d, and w, respectively.The time interval to be predicted is t, and the total number of time intervals in one day is m.So, D , and X (0) W are defined in Equation (5).In the tth time interval, the inflow and outflow volumes in all stations can be denoted as a matrix X t ∈ R 2×n .Three input datasets X W can be represented by matrices with dimensions of 2r × n, 2d × n and 2w × n, respectively, where n is the number of stations.

Integrally Capturing Spatiotemporal Dependencies
In our task, the final output size should be same as the size of the input (i.e., 2 × n).The same problem was found in a video-sequence generating task where the input and output have the same resolution [36].For this goal, we employ a special type of convolution, i.e., "same convolution."Several methods have been introduced to avoid the loss of resolution brought about by subsampling while preserving distant dependencies.As our approach is different from the classical CNN, we do not use subsampling, but only convolutions [37].
The recent, daily, and weekly patterns use a network structure composed of multiple GCNN layers, as shown in Figure 5. Taking the recent pattern as an example, the recent input X R is fed into a stacking stereogram graph convolution structure to integrally capture spatiotemporal dependencies.With several convolutions, the output of the recent pattern denoted as X R is defined as follows: Likewise, using the above operations, we can construct the daily and weekly components of Figure 5.The outputs of the daily and weekly components, denoted as X D and X W respectively, are as follows:

Feature Fusion
A previous study demonstrated that crowd inflows and outflows in one region are both affected by recent, daily, and weekly dependencies.However, the degrees of influence may be very different [28].Therefore, we adopted a parametric-matrix-based method to fuse the three components of Figure 5 (recent, daily, and weekly passenger flow volume patterns) as follows: where X t is the prediction target in the tth time interval, • is the Hadamard product (i.e., element-wise multiplication), and W R , W D , and W W are the learnable parameters that adjust the degrees affected by recent, daily, and weekly dependencies, respectively.The fusion result is activated by σ (i.e., ReLU) to obtain the prediction target.

Model Training
To train the STGCNNmetro model, the mean-square error (MSE) is used as the loss function.As shown in Equation ( 9), y i is the actual value, ỹi is the predicted value, and n is the total number of values to be predicted.All samples are divided into three sub-datasets: A training set, a validating set, and a testing set.The training set is fed into the model in batches.For each batch, the value of the loss function is calculated after forward propagation.Then, the loss is back-propagated layer-by-layer and an optimizer updates all trainable parameters according to the loss.The Adam optimizer [38] is a widely adopted optimizer in ML, as it generates adaptive learning rates for different parameters.Here, Adam is applied as the optimizer.By minimizing the loss, all trainable parameters are trained.

Experimental Data
A metro passenger flow dataset was generated from card-swiping data of the metro system of Shanghai, China, from 1 April to 30 April 2015.During this period, there were about nine million card records per day, covering 289 unique metro stations from all 14 lines of the Shanghai metro.The distribution of all metro lines and stations are shown in Figure 6.

Experimental Data
A metro passenger flow dataset was generated from card-swiping data of the metro system of Shanghai, China, from April 1 to April 30, 2015.During this period, there were about nine million card records per day, covering 289 unique metro stations from all 14 lines of the Shanghai metro.The distribution of all metro lines and stations are shown in Figure 6.Corresponding to human activities, the number of metro passengers is significantly less before 5:30 a.m. and after 11:00 p.m. Therefore, we selected the time range of between 5:30 a.m. and 11:00 p.m. as our study period.The time interval was set to 10 min, so that there are 105 time intervals per day.The passenger inflow volume and outflow volume during each time interval was determined based on the card records.Corresponding to human activities, the number of metro passengers is significantly less before 5:30 a.m. and after 11:00 p.m. Therefore, we selected the time range of between 5:30 a.m. and 11:00 p.m.As our study period.The time interval was set to 10 min, so that there are 105 time intervals per day.The passenger inflow volume and outflow volume during each time interval was determined based on the card records.
The dataset is divided as follows: The data for April 24 and 25 (including one working day and one non-working day) is used as the verification set, the data before April 25 is used as a training set, and the data for five days after April 25 (including four working days and one non-working day) is used as a test set.

Evaluation Metrics and Baseline Models
To measure and evaluate the performances of different methods, the root-mean-square error (RMSE) and mean absolute percentage error (MAPE) were adopted.The corresponding calculations are as follows: MAPE( ỹ, y) = 100 We compared our STGCNNmetro model with those of seven well-known baselines models.Brief introductions to these are shown in Table 1.

PCA-kNN
A mixture of principal component analysis (PCA) and k-nearest neighbor (kNN) regression [6].PCA is used to select the principal components which are input into kNN for prediction.

NMF-kNN
A mixture of non-negative matrix factorization (NMF) and kNN regression which is similar to PCA-kNN above.

LSTM
Long short-term memory (LSTM).Here, the LSTM model has multiple LSTM layers and one fully-connected layer.

M-CNN
A convolutional neural network (CNN)-based model proposed by Ma et al. [29] which transforms the metro-network-based passenger flow volume into a two-dimensional image whose horizontal axis represents time and whose vertical axis represents the metro station.Prediction is made by performing convolutions on the image.

Experimental Environment and Settings
All experiments were run on a graphics processing unit (GPU) platform with an NVIDIA GeForce GTX 950 graphics card with 8GB of GPU memory, and were implemented in the Python 3 programming language, using Keras [40] and TensorFlow [41] as the DL packages and Sklearn [42] as the general ML package.
The procedure for tuning parameters is an essential part of most DL-based models [15,29,43].Here, only the settings are listed which obtain the best prediction accuracy, and the tuning procedure will be detailed in Section 3.3.The number of time intervals for recent, daily, and weekly patterns were set as r = 7, d = 1, and w = 1, respectively.Since two passenger flow volumes were predicted, the number of kernels in the last layer was set to 2, while the number of other layers was set to 288.For the recent pattern, the graph convolution kernel size (i.e., K) was set to 3 and the network depth or the number of graph convolution layers was set to 5. The daily pattern and the weekly pattern had the same graph convolutional network structure.Unlike for the recent pattern, the main contribution of the two periodic patterns (daily and weekly) were temporal dependencies.Thus, the K values of the daily and weekly patterns were both set to 1 and their network depths were both set to 2.
We trained the STGCNNmetro model by minimizing the mean-square-error for 100 epochs with a batch size of five.The initial learning rate was 1 × 10 −3 with a decay rate of 0.95 after every 20 epochs.For the seven comparison models, we also selected the parameters which gave the best results via tuning.
For inputs, the multivariable linear regression (MLR) model was given only historical observation from three patterns (i.e., recent, daily, and weekly) to predict each passenger inflow or outflow volume, while the other seven models were given inflow volume and outflow volume for all stations from the three patterns, taking into account spatiotemporal correlation.

Experimental Results and Analysis
The test results for all models are shown in Table 2.The RMSEs were calculated from all prediction data.The MAPEs were calculated from the data whose actual value is not 0. Thus, the case where the denominator is 0 is avoided.
The following can be seen from Table 2: (1) Compared with the other seven models, the RMSE of the STGCNNmetro model is the smallest, and the MAPE of the PCA-kNN model is the smallest.The MAPE of the STGCNNmetro is the second smallest and very close to that of the PCA-kNN model, with a difference of 0.14%.However, the RMSE of PCA-kNN is 8.05 larger than that of STGCNNmetro.Thus, STGCNNmetro is better than PCA-kNN overall, which means that STGCNNmetro is better than all other models; (2) the RMSEs of the STGCNNmetro, Bayesian, and LSVR models are similar.However, the training time of the Bayesian model is over 4.5 h, which is nearly nine times that of STGCNNmetro, and the training time of LSVR is close to six hours, which is nearly 11 times that of STGCNNmetro; (3) the RMSE of the STGCNNmetro model is 9.46 less than that of the M-CNN model, which reflects the advantages of STGCNNmetro in capturing the irregular spatio-temporal dependencies; (4) the LSTM model has poor prediction results, which means that it is not good at capturing spatial dependencies of the metro network; and (5) the MLR model directly performs regression operations on historical data, while the passenger flow has spatiotemporal correlation, nonlinearity, and volatility, so that insufficient factors are considered by this model and the result is worse than for the other seven models.We first choose two models-one whose MAPE is the closest to that of the STGCNNmetro model and whose RMSE is relatively small (PCA-kNN) and another whose RMSE is closest to that of STGCNNmetro and whose MAPE is relatively small (Bayesian).Then, we take the Shenzhuang metro station, the most crowded station in Shanghai [44], as an example to explore more details of the models.The results of the passenger flow predictions from the three models for working days and non-working days are compared and analyzed as follows: 1.

Prediction Results for Passenger Inflow Volume
From the results of the prediction of inflow volume using the test set, shown in Figure 7, it can be seen that the values predicted by the STGCNNmetro, Bayesian, and PCA-kNN models correspond well to the actual values.In some time intervals (i.e., morning and evening, see the red circle in Figure 7), the values predicted by the STGCNNmetro model are superior to those predicted by the Bayesian and PCA-kNN models.The following is a detailed analysis of the prediction of inflow volume for a Sunday (non-working day) and a Monday (working day): From the results of the prediction of inflow volume using the test set, shown in Figure 7, it can be seen that the values predicted by the STGCNNmetro, Bayesian, and PCA-kNN models correspond well to the actual values.In some time intervals (i.e., morning and evening, see the red circle in Figure 7), the values predicted by the STGCNNmetro model are superior to those predicted by the Bayesian and PCA-kNN models.The following is a detailed analysis of the prediction of inflow volume for a Sunday (non-working day) and a Monday (working day): As shown in Figure 8a, the STGCNNmetro model better predicts the actual values of passenger inflow volume for the non-working day than the Bayesian and PCA-kNN models.Between 8:30 and 9:30 a.m.(see the red arrow in Figure 8a), the predictions of the STGCNNmetro model are obviously superior to those of the Bayesian and PCA-kNN models.The predictions of the Bayesian model are too large, while those of the PCA-kNN are too small.As shown in Figure 8b, all three models effectively predict passenger inflow volume for the working day.However, in the morning and evening peaks, the predictions of the STGCNNmetro model are superior (see the red arrow in Figure 8b).As shown in Figure 8a, the STGCNNmetro model better predicts the actual values of passenger inflow volume for the non-working day than the Bayesian and PCA-kNN models.Between 8:30 and 9:30 a.m.(see the red arrow in Figure 8a), the predictions of the STGCNNmetro model are obviously superior to those of the Bayesian and PCA-kNN models.The predictions of the Bayesian model are too large, while those of the PCA-kNN are too small.As shown in Figure 8b, all three models effectively predict passenger inflow volume for the working day.However, in the morning and evening peaks, the predictions of the STGCNNmetro model are superior (see the red arrow in Figure 8b).

Prediction Results for Passenger Outflow Volume
As shown in Figure 9, the passenger outflow volume is more variable than the inflow volume, and the prediction results are therefore slightly worse.Nevertheless, the overall predictive abilities of the three models are good.
Based on the results of the prediction of outflow volume on a non-working day shown in Figure 10a, the STGCNNmetro and Bayesian models perform significantly better than the PCA-kNN model, and the prediction results of the STGCNNmetro model during peak periods are better than those of both the Bayesian and PCA-kNN models (see the red arrow in Figure 10a).Furthermore, based on the results of the prediction of outflow volume on a working day, as shown in Figure 10b, the STGCNNmetro and Bayesian models generally perform better than the PCA-KNN model.The ability of the STGCNNmetro model to capture peaks in the evening is better than that of the Bayesian model (see the red arrow in Figure 10b).However, its ability to capture peaks in the morning is slightly worse than that of the Bayesian model.

Prediction Results for Passenger Outflow Volume
As shown in Figure 9, the passenger outflow volume is more variable than the inflow volume, and the prediction results are therefore slightly worse.Nevertheless, the overall predictive abilities of the three models are good.In summary, the predictive ability of the STGCNNmetro model is generally better than those of the Bayesian and PCA-KNN models.The STGCNNmetro model's prediction of peaks is usually more accurate than those of the other two models, although its prediction of peak values of outflow volume on working day mornings is slightly worse than those of the Bayesian model.This may be related to the improvement of computational efficiency achieved through convolution sharing parameters.Based on the results of the prediction of outflow volume on a non-working day shown in Figure 10a, the STGCNNmetro and Bayesian models perform significantly better than the PCA-kNN model, and the prediction results of the STGCNNmetro model during peak periods are better than those of both the Bayesian and PCA-kNN models (see the red arrow in Figure 10a).Furthermore, based on the results of the prediction of outflow volume on a working day, as shown in Figure 10b, the STGCNNmetro and Bayesian models generally perform better than the PCA-KNN model.The ability of the STGCNNmetro model to capture peaks in the evening is better than that of the Bayesian model (see the red arrow in Figure 10b).However, its ability to capture peaks in the morning is slightly worse than that of the Bayesian model.

Tuning Parameters
Hyperparametric tuning is an indispensable process for DL models.The number of graph convolution kernels and graph convolution layers represent spatial attributes.The lengths of the input sequences of the three patterns (i.e., recent, daily, and weekly) represent temporal attributes.This section focuses on the adjustment process of the above three types of parameters (i.e., the number of graph convolution kernels, graph convolution layers, and the lengths of the input sequences of the three patterns).Based on the results of the prediction of outflow volume on a non-working day shown in Figure 10a, the STGCNNmetro and Bayesian models perform significantly better than the PCA-kNN model, and the prediction results of the STGCNNmetro model during peak periods are better than those of both the Bayesian and PCA-kNN models (see the red arrow in Figure 10a).Furthermore, based on the results of the prediction of outflow volume on a working day, as shown in Figure 10b, the STGCNNmetro and Bayesian models generally perform better than the PCA-KNN model.The ability of the STGCNNmetro model to capture peaks in the evening is better than that of the Bayesian model (see the red arrow in Figure 10b).However, its ability to capture peaks in the morning is slightly worse than that of the Bayesian model.In summary, the predictive ability of the STGCNNmetro model is generally better than those of the Bayesian and PCA-KNN models.The STGCNNmetro model's prediction of peaks is usually more accurate than those of the other two models, although its prediction of peak values of outflow volume on working day mornings is slightly worse than those of the Bayesian model.This may be related to the improvement of computational efficiency achieved through convolution sharing parameters.Unlike the STGCNNmetro model, the Bayesian model has specific parameters for each station, and its training times are almost nine times that of the STGCNNmetro model.Compared with the Bayesian model, the STGCNNmetro model shows more advantages for practical application.

Tuning Parameters
Hyperparametric tuning is an indispensable process for DL models.The number of graph convolution kernels and graph convolution layers represent spatial attributes.The lengths of the input sequences of the three patterns (i.e., recent, daily, and weekly) represent temporal attributes.This section focuses on the adjustment process of the above three types of parameters (i.e., the number of graph convolution kernels, graph convolution layers, and the lengths of the input sequences of the three patterns).

Different Input Lengths of the Recent, Daily, and Weekly Passenger Flow Volume Patterns
Here, we verify the impact of different input lengths of the three patterns (i.e., recent, daily, and weekly), as shown in Figure 11. Figure 11a shows the effect of temporal closeness when d and w are fixed as 1 but r is changed.For example, r = 0 means that we do not employ the recent pattern, and results in a very high RMSE of 44.23.It can be seen that the RMSE first decreases and then increases as the input length increases, with r = 7 having the lowest RMSE. Figure 11b depicts the effect of the daily pattern when r is set as 7 and w as 1 but d is changed.It can be seen that d = 1 gives the lowest RMSE.The model without the daily pattern (i.e., d = 0) has a higher RMSE than the model with d = 1, 2, 3, 4, but a lower RMSE than the model with d = 5, 6, meaning that short-range periods are always well-modeled and long-range periods may be hard to model or not helpful.Figure 11c presents the effect of the weekly pattern when r and d are fixed as 7 and 1, respectively, while w is changed from 0 to 2. It can be seen that the model with w = 1 outperforms the others.For the weekly pattern, the long-range period may be hard to model or useless.Here, we verify the impact of different input lengths of the three patterns (i.e., recent, daily, and weekly), as shown in Figure 11. Figure 11a shows the effect of temporal closeness when d and w are fixed as 1 but r is changed.For example, r = 0 means that we do not employ the recent pattern, and results in a very high RMSE of 44.23.It can be seen that the RMSE first decreases and then increases as the input length increases, with r = 7 having the lowest RMSE. Figure 11b depicts the effect of the daily pattern when r is set as 7 and w as 1 but d is changed.It can be seen that d = 1 gives the lowest RMSE.The model without the daily pattern (i.e., d = 0) has a higher RMSE than the model with d = 1, 2, 3, 4, but a lower RMSE than the model with d = 5, 6, meaning that short-range periods are always well-modeled and long-range periods may be hard to model or not helpful.Figure 11c presents the effect of the weekly pattern when r and d are fixed as 7 and 1, respectively, while w is changed from 0 to 2. It can be seen that the model with w = 1 outperforms the others.For the weekly pattern, the long-range period may be hard to model or useless.

The Effect of Different Numbers of Graph Convolution Layers
From Figure 12, it can be seen that as the number of graph convolution layers increases, the RMSE of the model first decreases and then increases, demonstrating that more graph convolution layers can give a better result since they can capture not only close spatial dependence, but also distant ones.However, when the network is very large (e.g., number of network layers ≥ 9), the training result becomes worse.The newly included stations are from Shanghai's urban fringe and have little contribution to the passenger flow volume at the core stations.An increasing number of layers leads to more parameters and to greater training difficulty, and as a result, the RMSE increases.From Figure 12, it can be seen that as the number of graph convolution layers increases, the RMSE of the model first decreases and then increases, demonstrating that more graph convolution layers can give a better result since they can capture not only close spatial dependence, but also distant ones.However, when the network is very large (e.g., number of network layers ≥ 9), the training result becomes worse.The newly included stations are from Shanghai's urban fringe and have little contribution to the passenger flow volume at the core stations.An increasing number of layers leads to more parameters and to greater training difficulty, and as a result, the RMSE increases.

The Effect of Different Numbers of Kernels
From the results shown in Figure 13, it can be seen that up to a certain level, more kernels result in a lower RMSE.However, when the number of kernels is very large (e.g., ≥352), the RMSE increases.An increasing number of kernels leads to more parameters and to greater training difficulty, and the RMSE therefore increases.

The Effect of Different Numbers of Kernels
From the results shown in Figure 13, it can be seen that up to a certain level, more kernels result in a lower RMSE.However, when the number of kernels is very large (e.g., ≥352), the RMSE increases.An increasing number of kernels leads to more parameters and to greater training difficulty, and the RMSE therefore increases.
RMSE of the model first decreases and then increases, demonstrating that more graph convolution layers can give a better result since they can capture not only close spatial dependence, but also distant ones.However, when the network is very large (e.g., number of network layers ≥ 9), the training result becomes worse.The newly included stations are from Shanghai's urban fringe and have little contribution to the passenger flow volume at the core stations.An increasing number of layers leads to more parameters and to greater training difficulty, and as a result, the RMSE increases.

The Effect of Different Numbers of Kernels
From the results shown in Figure 13, it can be seen that up to a certain level, more kernels result in a lower RMSE.However, when the number of kernels is very large (e.g., ≥352), the RMSE increases.An increasing number of kernels leads to more parameters and to greater training difficulty, and the RMSE therefore increases.

Spatial Distribution of Error
This section discusses the spatial distribution of the errors of the STGCNNmetro model for working days and non-working days.The daily period of interest is from 6: 40 a.m.To 11:00 p.m.We calculate the RMSE of each station's predicted inflow and outflow volume and calculate the average actual inflow and outflow volume for each station as a reference.The latter is called the station mean actual value (SMAV), and its calculation is shown in Equation (11): where y i is the actual value corresponding to the ith time interval and T is the number of predicted time intervals.
From the distribution maps of RMSE and SMAV shown in Figure 14, it can be seen that there is a positive correlation between the model errors and the actual values of passenger flow volume.To further verify the relationship between the two, they are plotted against each other in Figure 15.
Based on the scatter plot of RMSE and SMAV shown in Figure 15, we further carry out regression analysis on these two measures and obtain the linear regression equation and the coefficient of determination (R 2 ).The results show that, overall, the RMSE is positively correlated with the SMAV.However, several points show some differences.For example, although the SMAV of the People's Square station (see marker "1" in Figure 14) is the largest of all stations, its RMSE is not the largest.Additionally, the RMSEs of the stations of Xu Jidong (see marker "2" in Figure 14), Hongqiao Railway, (see marker "3" in Figure 14), and Shanghai South Railway (see marker "4" in Figure 14) are all abnormally high.The reason for this may be that they are located near the railway station and are affected by external uncertainties.Furthermore, the RMSEs of the stations of Caohejing Development Zone (see marker "5" in Figure 14) and Lu Jiazui (see marker "6" in Figure 14) are also abnormally high.All of these abnormal points should be explored using other types of data in the future.

Temporal Distribution of Error
This section analyzes the temporal distribution of the errors in the output of the STGCNNmetro model using the data of the test set for five consecutive days (Sunday to Thursday) as an example.The daily period of interest is from 6:40 a.m. to 11:00 p.m.We calculate the RMSE of the predicted passenger inflow and outflow volumes for all stations for every 10 min interval, and also calculate the average of the actual flow volume of all stations at this time interval.The latter is called the time mean actual value (TMAV), and its calculation is shown in Equation ( 12): where  is the actual value corresponding to the ith station, and  is the total number of stations.
The result is shown in Figure 16.In general, the error distribution in the temporal dimension is similar to that in the spatial dimension, and the RMSE is positively correlated with the TMAV.

Temporal Distribution of Error
This section analyzes the temporal distribution of the errors in the output of the STGCNNmetro model using the data of the test set for five consecutive days (Sunday to Thursday) as an example.The daily period of interest is from 6:40 a.m.To 11:00 p.m.We calculate the RMSE of the predicted passenger inflow and outflow volumes for all stations for every 10 min interval, and also calculate the average of the actual flow volume of all stations at this time interval.The latter is called the time mean actual value (TMAV), and its calculation is shown in Equation ( 12): where y i is the actual value corresponding to the ith station, and S is the total number of stations.
The result is shown in Figure 16.In general, the error distribution in the temporal dimension is similar to that in the spatial dimension, and the RMSE is positively correlated with the TMAV.

Temporal Distribution of Error
This section analyzes the temporal distribution of the errors in the output of the STGCNNmetro model using the data of the test set for five consecutive days (Sunday to Thursday) as an example.The daily period of interest is from 6:40 a.m. to 11:00 p.m.We calculate the RMSE of the predicted passenger inflow and outflow volumes for all stations for every 10 min interval, and also calculate the average of the actual flow volume of all stations at this time interval.The latter is called the time mean actual value (TMAV), and its calculation is shown in Equation ( 12): where  is the actual value corresponding to the ith station, and  is the total number of stations.
The result is shown in Figure 16.In general, the error distribution in the temporal dimension is similar to that in the spatial dimension, and the RMSE is positively correlated with the TMAV.Based on the scatter plot of RMSE and TMAV shown in Figure 17, we further carry out regression analysis on these two measures and obtain the linear regression equation and the coefficient of determination.In general, the error distribution in the temporal dimension is similar to that in the spatial dimension, and the RMSE is positively correlated with the TMAV.Based on the scatter plot of RMSE and TMAV shown in Figure 17, we further carry out regression analysis on these two measures and obtain the linear regression equation and the coefficient of determination.In general, the error distribution in the temporal dimension is similar to that in the spatial dimension, and the RMSE is positively correlated with the TMAV.

Conclusions and Future Work
In this paper, we propose a spatiotemporal graph convolutional neural network model, called STGCNNmetro, to predict the passenger inflow and outflow volumes of a citywide metro network.We apply stereogram graph convolution kernels to historical metro passenger flow-volume timeseries to seamlessly capture the irregular spatiotemporal dependencies between stations along the metro network.Furthermore, a deep structure composed of GCNNs is constructed to capture the distant spatiotemporal dependencies at the citywide level.Finally, we fuse the spatiotemporal dependencies captured from recent, daily, and weekly patterns to form final predicted values of passenger inflow and outflow volumes.The effectiveness of the STGCNNmetro model is verified by predicting the citywide short-term passenger flow volume of the metro network of Shanghai, China.Experiments show that our STGCNNmetro model outperforms seven baseline models, namely LSVR, PCA-kNN, NMF-kNN, Bayesian, MLR, M-CNN, and LSTM.It achieves an "end-to-end" prediction that can accept raw metro passenger flow-volume data and automatically capture the effective features of the citywide metro network.By discussing the spatial and temporal error distributions of the results of the STGCNNmetro model, we conclude that the errors are positively correlated with the actual values for most predicted targets.However, there are also several stations with abnormally high RMSEs.
In the future, we will further optimize the model's network structure and parameters to explore the reason for the abnormal prediction points.Moreover, we will continue our work to design an "end-to-end" DL model for the multistep prediction of metro passenger flow volumes.

Conclusions and Future Work
In this paper, we propose a spatiotemporal graph convolutional neural network model, called STGCNNmetro, to predict the passenger inflow and outflow volumes of a citywide metro network.We apply stereogram graph convolution kernels to historical metro passenger flow-volume time-series to seamlessly capture the irregular spatiotemporal dependencies between stations along the metro network.Furthermore, a deep structure composed of GCNNs is constructed to capture the distant spatiotemporal dependencies at the citywide level.Finally, we fuse the spatiotemporal dependencies captured from recent, daily, and weekly patterns to form final predicted values of passenger inflow and outflow volumes.The effectiveness of the STGCNNmetro model is verified by predicting the citywide short-term passenger flow volume of the metro network of Shanghai, China.Experiments show that our STGCNNmetro model outperforms seven baseline models, namely LSVR, PCA-kNN, NMF-kNN, Bayesian, MLR, M-CNN, and LSTM.It achieves an "end-to-end" prediction that can accept raw metro passenger flow-volume data and automatically capture the effective features of the citywide metro network.By discussing the spatial and temporal error distributions of the results of the STGCNNmetro model, we conclude that the errors are positively correlated with the actual values for most predicted targets.However, there are also several stations with abnormally high RMSEs.
In the future, we will further optimize the model's network structure and parameters to explore the reason for the abnormal prediction points.Moreover, we will continue our work to design an "end-to-end" DL model for the multistep prediction of metro passenger flow volumes.

Figure 1 .
Figure 1.Graph-structured metro passenger flow-volume data.At the time step t, in the graph  = ( , , ),  is an observation vector of n metro stations at time step t, each element of which records the historical passenger flow volume for a single metro station;  is a set of edges, indicating the connectedness between stations; while  ∈  × denotes the weighted adjacency matrix of  .

Figure 1 .
Figure 1.Graph-structured metro passenger flow-volume data.At the time step t, in the graphG t = (V t , E, W), V t is an observation vector of n metro stations at time step t, each element of which records the historical passenger flow volume for a single metro station; E is a set of edges, indicating the connectedness between stations; while W ∈ R n × n denotes the weighted adjacency matrix of G t .

Figure 2 .
Figure 2. Procedure for capturing spatial dependency by graph convolution for K = 1 (using only one vertex as an example).K is the kernel size of the graph convolution, which determines the maximum radius of the convolution from central nodes,  is the convolution kernel parameter,  is an activation function (i.e., rectified linear unit (ReLU)),  () indicates the passenger flow volume of vertex i to be predicted, and each  () indicates the historical passenger flow volume of vertex j.

Figure 3 .
Figure 3. Procedure for capturing spatiotemporal dependencies by graph convolution for K = 1 (using only one vertex as an example). is the convolution kernel parameter at the time step t, and  ( )

Figure 2 .
Figure2.Procedure for capturing spatial dependency by graph convolution for K = 1 (using only one vertex as an example).K is the kernel size of the graph convolution, which determines the maximum radius of the convolution from central nodes, α 1 is the convolution kernel parameter, σ is an activation function (i.e., rectified linear unit (ReLU)), y (i) indicates the passenger flow volume of vertex i to be predicted, and each f ( j) indicates the historical passenger flow volume of vertex j.

Figure 2 .
Figure 2. Procedure for capturing spatial dependency by graph convolution for K = 1 (using only one vertex as an example).K is the kernel size of the graph convolution, which determines the maximum radius of the convolution from central nodes,  is the convolution kernel parameter,  is an activation function (i.e., rectified linear unit (ReLU)),  () indicates the passenger flow volume of vertex i to be predicted, and each  () indicates the historical passenger flow volume of vertex j.

Figure 3 .
Figure 3. Procedure for capturing spatiotemporal dependencies by graph convolution for K = 1 (using only one vertex as an example). is the convolution kernel parameter at the time step t, and  ( )

Figure 3 .
Figure 3. Procedure for capturing spatiotemporal dependencies by graph convolution for K = 1 (using only one vertex as an example).α t is the convolution kernel parameter at the time step t, and f t( j)

Figure 4 .
Figure 4. Procedure for capturing the spatiotemporal dependencies of metro network data using graph convolutional neural networks (GCNNs) (taking Station 6 as an example).

Figure 4 .
Figure 4. Procedure for capturing the spatiotemporal dependencies of metro network data using graph convolutional neural networks (GCNNs) (taking Station 6 as an example).

25 Figure 5 .
Figure 5. Architecture of the STGCNNmetro model.Conv: Convolution;  and  ′ represent actual and predicted values in the tth time interval, respectively;  ,  , and  represent the outputs of the recent, daily, and weekly patterns, respectively;  is the fusion result of  ,  , and  using a parametric-matrix-based method.

Figure 5 .
Figure5.Architecture of the STGCNNmetro model.Conv: Convolution; X t and X t represent actual and predicted values in the tth time interval, respectively; X R , X D , and X W represent the outputs of the recent, daily, and weekly patterns, respectively; X F is the fusion result of X R , X D , and X W using a parametric-matrix-based method.

Figure 7 .
Figure 7.A comparison of the predicted and actual values of passenger inflow volume per 10 min using the test data from the Shenzhuang metro station, Shanghai.Data cover five consecutive days (from Sunday to Thursday).STGCNNmetro: spatiotemporal graph convolutional neural networks model for metro; Bayesian: Bayesian regression model; PCA-kNN: A mixture of principal component analysis (PCA) and k-nearest neighbor (kNN) regression [6].

Figure 7 .
Figure 7.A comparison of the predicted and actual values of passenger inflow volume per 10 min using the test data from the Shenzhuang metro station, Shanghai.Data cover five consecutive days (from Sunday to Thursday).STGCNNmetro: spatiotemporal graph convolutional neural networks model for metro; Bayesian: Bayesian regression model; PCA-kNN: A mixture of principal component analysis (PCA) and k-nearest neighbor (kNN) regression [6].

Figure 8 .
Figure 8.A comparison of the predicted and actual values of passenger inflow volume between 6:40 a.m. and 11:00 p.m. at the Shenzhuang metro station for (a) a Sunday (non-working day) and (b) a Monday (working day).

Figure 8 .
Figure 8.A comparison of the predicted and actual values of passenger inflow volume between 6:40 a.m. and 11:00 p.m. at the Shenzhuang metro station for (a) a Sunday (non-working day) and (b) a Monday (working day).
Unlike the STGCNNmetro model, the Bayesian model has specific parameters for each station, and its training times are almost nine times that of the STGCNNmetro model.Compared with the Bayesian model, the STGCNNmetro model shows more advantages for practical application.

Figure 9 .
Figure 9.A comparison of the predicted and actual values of passenger outflow volume per 10 min using the test data from the Shenzhuang metro station.Data cover five consecutive days (from Sunday to Thursday).

Figure 9 .
Figure 9.A comparison of the predicted and actual values of passenger outflow volume per 10 min using the test data from the Shenzhuang metro station.Data cover five consecutive days (from Sunday to Thursday).

25 Figure 9 .
Figure 9.A comparison of the predicted and actual values of passenger outflow volume per 10 min using the test data from the Shenzhuang metro station.Data cover five consecutive days (from Sunday to Thursday).

Figure 10 .
Figure 10.A comparison of the predicted and actual values of outflow volume between 6:40 a.m. and 11:00 p.m. for the Shenzhuang metro station on (a) a non-working day and (b) a working day.

3. 3 . 1 .
Different Input Lengths of the Recent, Daily, and Weekly Passenger Flow Volume Patterns

Figure 11 .
Figure 11.Experimental results for different input lengths of (a) the recent pattern, (b) the daily pattern, and (c) the weekly pattern.RMSE: Root-mean-square error.

Figure 12 .
Figure 12.Experimental results with different numbers of graph convolution layers.

Figure 13 .
Figure 13.Experimental results with different numbers of kernels.

Figure 12 .
Figure 12.Experimental results with different numbers of graph convolution layers.

Figure 12 .
Figure 12.Experimental results with different numbers of graph convolution layers.

Figure 13 .
Figure 13.Experimental results with different numbers of kernels.

Figure 13 .
Figure 13.Experimental results with different numbers of kernels.

Figure 14 .Figure 14 . 25 Figure 14 .Figure 15 .Figure 15 .
Figure 14.The spatial distribution of (a) the RMSE of passenger inflow volume, (b) the RMSE of passenger outflow volume, (c) the station mean actual value (SMAV) of passenger inflow volume, and (d) the SMAV of passenger outflow volume, at each metro station on working days; and the spatial distribution of (e) the RMSE of passenger inflow volume, (f) the RMSE of passenger outflow volume, (g) the SMAV of passenger inflow volume, and (h) the SMAV of passenger outflow volume, at each station on non-working days.The numbers 1-6 indicate the metro stations of People's Square, Xu Jidong, Hongqiao Railway, Shanghai South Railway, Caohejing Development Zone, and Lu Jiazui, respectively.

Figure 15 .
Figure 15.The relationship between the RMSE and SMAV of (a) passenger inflow volume at each station on working days, (b) passenger outflow volume at each station on working days, (c) passenger inflow volume at each station on non-working days, and (d) passenger outflow volume at each station on non-working days.

ISPRSFigure 15 .
Figure 15.The relationship between the RMSE and SMAV of (a) passenger inflow volume at each station on working days, (b) passenger outflow volume at each station on working days, (c) passenger inflow volume at each station on non-working days, and (d) passenger outflow volume at each station on non-working days.

Figure 16 .
Figure 16.The RMSE distribution of (a) the passenger inflow volume at each 10-min interval, (b) the passenger outflow volume at each 10-min interval; and the time mean actual value (TMAV) distribution of (c) the passenger inflow volume at each 10-min interval and (d) the passenger outflow volume at each 10-min interval.

Figure 17 .
Figure 17.The relationship between the RMSE and TMAV for (a) passenger inflow volume and (b) passenger outflow volume, at each 10-min interval.

Figure 17 .
Figure 17.The relationship between the RMSE and TMAV for (a) passenger inflow volume and (b) passenger outflow volume, at each 10-min interval.

Table 1 .
Brief introductions to the baselines models used for comparison with the model used in this study.

Table 2 .
Root-mean-square error (RMSE) and mean absolute percentage error (MAPE) values for different models.