A Hybrid GLM Model for Predicting Citywide Spatio-Temporal Metro Passenger Flow

: Accurate prediction of citywide short-term metro passenger ﬂow is essential to urban management and transport scheduling. Recently, an increasing number of researchers have applied deep learning models to passenger ﬂow prediction. Nevertheless, the task is still challenging due to the complex spatial dependency on the metro network and the time-varying trafﬁc patterns. Therefore, we propose a novel deep learning architecture combining graph attention networks (GAT) with long short-term memory (LSTM) networks, which is called the hybrid GLM (hybrid G AT and L STM M odel). The proposed model captures the spatial dependency via the graph attention layers and learns the temporal dependency via the LSTM layers. Moreover, some external factors are embedded. We tested the hybrid GLM by predicting the metro passenger ﬂow in Shanghai, China. The results are compared with the forecasts from some typical data-driven models. The hybrid GLM gets the smallest root-mean-square error (RMSE) and mean absolute percentage error (MAPE) in different time intervals (TIs), which exhibits the superiority of the proposed model. In particular, in the TI 10 min, the hybrid GLM brings about 6–30% extra improvements in terms of RMSE. We additionally explore the sensitivity of the model to its parameters, which will aid the application of this model.


Introduction
People constantly interact with the urban space through various spatio-temporal activities, such as taking the subway, driving, and walking [1]. In the big data era, the rapid proliferation of mobile sensors and Internet technologies continuously generates an exceptionally large amount of spatio-temporal data, which offers unprecedented opportunities for constructing intelligent transportation systems (ITS). In particular, short-term metro passenger flow prediction is an important part of ITS. Accurate prediction of passenger flow can help urban managers to fine-tune travel behaviors, reduce passenger congestion, and enhance the service quality of metro system [2]. From a broader point of view, metro passenger flow prediction helps to optimize traffic efficiency via alleviating the imbalance of transport capacity across the city. Therefore, developing an effective framework for predicting passenger flow in a citywide metro network is essential.
Due to its great practical value, passenger flow prediction has been extensively investigated. Existing solutions can be classified into three categories: statistical methods, machine learning (ML) methods [3], and deep learning (DL) methods. Statistical methods are simple but cannot capture non-linear features. ML methods improve the drawbacks of statistical methods but are still incapable of processing raw spatio-temporal data. When constructing an ML-based model, feature extractors require precise engineering and substantial domain knowledge to transform raw data into a proper internal representation. This procedure is called feature engineering. With regard to big data, the feature engineering procedure is particularly challenging. Compared with ML methods, DL methods can automatically build up feature engineering and accept raw input to make an end-to-end prediction, which can learn more complex non-linear characters and gain better generalization ability. DL methods are the most widely used solution for passenger flow prediction. As metro passenger flow prediction is a time series processing problem, recurrent neural network (RNN) is effective for the task [4,5]. In addition, it has become a popular trend to exploit RNNs in combination with the convolutional neural networks (CNNs) for traffic flow prediction [6,7], owing to the ability of CNNs to mine spatial dependency. However, CNNs are designed for spatial structure in Euclidean space (e.g., 2D images and regular grids), so it cannot fully adapt the complex topological structure of a metro network. Aiming at this problem, graph modeling on spatio-temporal data has been in the spotlight. Several works have studied Graph Neural Networks (GNNs) for capturing topological spatial correlation [8][9][10]. However, existing graph-structure-based approaches still have the following gaps: 1.
Most graph-structure-based approaches are based on the Graph Conventional Network (GCN) [11], which is operated in the spectral domain. Due to the use of the Laplacian matrix, GCN requires the network to be symmetric. However, there always exists some asymmetric networks in a city. The network in a city whose graph structure is asymmetric can be defined as an asymmetric network. For example, a road network with one-way and two-way streets can be described as an asymmetric network. A GCN-based structure cannot be used in this case.

2.
Most graph-structure-based methods ignore the improvement of the adjacent matrix. In other words, they only care about the effect of the adjacent nodes but ignore the nodes located a little further.

3.
Some graph-structure-based models only capture the spatial dependency but ignore the temporal dependency and external factors.
To overcome the abovementioned issues, we propose a hybrid DL model for shortterm metro passenger flow prediction by integrating graph attention networks (GATs) and long short-term memory (LSTM) networks. The proposed model is called the hybrid GLM (hybrid GAT and LSTM Model). Inspired by Petar Veličković [12], we introduce the GAT to solve the problem that GCN cannot be applied to asymmetric networks in a city. GAT captures the spatial dependency of adjacent nodes by calculating the graph attention coefficient between nodes, which is beneficial to asymmetric graphs. In addition to GAT, we combine LSTM networks for modeling the temporal dependency of metro passenger flow. External factors related to metro passenger flow include weather conditions, air quality, weekends, holidays and events, etc. We add these factors into another LSTM layer to improve the accuracy of the entire model. The main contributions of this paper are as follows: We propose a hybrid graph-structure-based model to predict the short-term metro passenger flow. The GAT structure in the proposed model can capture the complex topological dependency. Besides, GAT puts more focus on nodes, which means it can solve the problem that GCN cannot be used in asymmetric networks. Moreover, we improve the adjacent matrix in the GAT structure for modeling the nodes located a little further.

2.
We construct a novel framework to jointly model the spatial, dynamic temporal and external dependencies in metro flow volume data. Specifically, we stack graphstructure-based layers based on GAT, recurrent layers based on LSTM and an output layer based on fully connected neural networks in the proposed model.

Related Work
Passenger flow prediction is one of the major research issues in geo-information systems. Extensive research has been conducted to solve the problem. Existing methods can be categorized into three broad types: statistical methods, machine learning methods and deep learning methods.

Statistical Methods
Statistical methods predict future values based on previously observed values through time-series analysis [9]. As passenger flow data are a kind of time series data, it is feasible to use statistical methods to solve the prediction task. The statistical methods contain the autoregressive integrated moving average (ARIMA) model [13] and its variations [14,15], logistic regression model [16], Kalman filtering [17], etc. Liu [13] used a model based on ARIMA to forecast the rail traffic and found the results are superior than a back-propagation (BP) network. Ding [18] integrated ARIMA and GRACH models to forecast the subway short-term ridership accounting for dynamic volatility. Liang [19] combined the Kalman filtering and K-nearest neighbor (KNN) approach to handle different variation trends in the passenger flow data.
These methods can well capture the linear features but may neglect the non-linear features of passenger flow. However, passenger flow is often influenced by a variety of factors. The prediction performance of the statistical methods may worsen significantly if the data are not stable.

Machine Learning Methods
ML methods can map the complicated non-linear relations between input and output data, which can address the issue of statistical methods. Support vector machine (SVM) [20] is one of the most widely used ML models. It can strike a compromise between prediction accuracy and generalization ability based on the structural risk minimization principle [21]. Zhang [22] used the SVM to predict traffic flow, which obtained better results than linear regression models. Hybrid models based on SVR also have been widely used [23,24]. Li [25] integrated the seasonal autoregressive integrated moving average (SARIMA) model and SVM to establish a traffic flow prediction model. Cao [26] improved the parameter of the SVM model by using partial swarm optimization (PSO) for traffic flow prediction. Wang [21] proposed an SVM online model for capturing the periodicity and non-linear characteristics of short-term metro ridership, which extracts inputs feature via the SARIMA model and optimizes the parameters via PSO. Some other ML models, such as Bayesian networks, random forests (RFs), BP neural networks, KNN, etc., are also used in traffic flow prediction. Roos [27] proposed a Bayesian network for traffic flow forecasting, which can be used in incomplete data. Liu [28] combined RF methods to predict passenger flow, which puts more focus on input feature combination. Zhang [29] combined principal component analysis (PCA) and error BP networks to predict bus passenger flow, which increases the convergence speed. Bai [30] used enhanced KNN methods by considering the trend factor and time interval factor of passenger flow, which gets a better performance than the BP network and original KNN method.
ML methods greatly improved the accuracy of the metro passenger flow prediction. However, the performance of ML methods heavily depends on the manually designed features. Thus, it is hard to yield the best results for passenger flow prediction due to the complex and huge spatio-temporal data. Nowadays, it is rare to apply a single ML model to passenger flow prediction.

Deep Learning Methods
Due to the complexity of spatio-temporal data, most state-of-the-art literatures apply DL methods for passenger flow prediction. Compared with ML methods, DL methods can automatically extract essential features from raw data and make them robust with respect to variations in inputs [31].
RNNs [32] are good at dealing with complicated sequence information. As passenger flow is a kind of time series data, RNNs and their successors such as long short-term memory (LSTM) networks [33] and gated recurrent units (GRU) [34] are commonly used in the prediction task. Zhao [35] proposed a traffic forecasting model based on long short-term memory (LSTM) networks and achieved a good performance. Zhang [36] used a GRUbased method to predict urban traffic flow. Later, HAN [37] improved the optimizer of the LSTM, yielding better results than LSTM. Lin [38] used random forest (RF) to calculate the feature importance and applied the LSTM to predict metro passenger flow. However, the models mentioned above only capture the temporal dependency but neglect the spatial dependency, so that they cannot optimize the performance for the entire networks.
CNNs [39] are originally designed for data with regular grids, such as images. Some works apply CNN to identify spatial dependency with various localized filters or kernels. Zhang [40] input the spatio-temporal flow data into regular grids and proposed a model based on CNN called DeepST, which contains three fragments, denoting recent time, near history, and distant history. This is a milestone in passenger flow prediction. From then on, CNN-based models became prevailing in passenger flow prediction. Zhang [41] continued to integrate the residual unit to propose ST-ResNet. Residual networks can increase the model depth to capture characters with longer distances and more complex structures. Yu [42] designed a three-dimensional CNN network to achieve large-scale prediction on traffic flow. To capture the spatio-temporal dependency, scholars exploit RNNs in combination with CNN-based networks for passenger flow prediction. Ren [7] combined the ResNet and the LSTM to form a hybrid model, HIDLST, which yields better results than ST-ResNet. Qiao [6] utilized a one-dimensional CNN and LSTM for flow prediction.
However, CNN-based models must be conducted on regular grid data, which means it cannot capture the irregular topological relation of non-Euclidean data. The topological relation often exists in traffic infrastructures, such as stations, metro lines, roads, etc. Therefore, we should consider the topological information of traffic networks. With the development of graph theory [11,43,44], it is feasible to apply a graph structure to passenger flow prediction. Yu [45] used GCN for capturing the spatial relationship and applied one-dimensional CNN convolution to explore the temporal relationship. The proposed STGCN model was verified using the datasets of PEMS and Beijing. Zhao [46] proposed the temporal graph convolutional network (T-GCN) model, combining the graph convolutional network (GCN) with the GRU for traffic prediction. Zhang [47] utilized the GCN and 3D-CNN to model passenger flow. Ye [10] designed three spatial matrixes to extract the spatial dependency of a neighbor of different distances. However, because of the use of GCN, these models can only be used in asymmetric networks. Based on the attention mechanism, Guo [8] proposed a spatio-temporal convolutional network (ASTGCN) model, which can effectively capture the dynamic spatio-temporal dependency in traffic data. Zhang [48] combined ResNet, GCN, and attention LSTM to build a hybrid model, ResLSTM, which has achieved good results on the prediction of metro passenger flow in Beijing. However, these models neglect the improvement of the adjacent matrix.

Summary
Statistical methods and ML methods can be used for metro passenger flow prediction. However, both kinds of models cannot accept raw inputs for prediction. Therefore, it is hard to obtain higher prediction accuracy due to the complexity and randomness of the spatio-temporal data.
Among the DL methods, the graph-structure-based model is the hot spot regarding passenger flow prediction. Though the current literature shows progress in the given tasks, there are a number of knowledge gaps that need to be addressed, which are (1) GCNbased models require the network structure to be symmetric, so they are not applicable to asymmetric networks in a city; (2) most graph-structure-based methods only care about the effect of the adjacent nodes but ignore the nodes located a little further; and (3) some graph-structure-based models only capture the spatial dependency but ignore the temporal dependency and external factors.
Our model is inspired by GAT [12], which can model the topological spatial dependency of asymmetric networks. Aiming at solving the above problems, we proposed the hybrid GLM to extract spatio-temporal dependency and external dependency seamlessly.

Problem Definition
The research objective is to seamlessly model the spatial and temporal dependency of metro passenger flow data. The spatial dependency refers to the influence between metro stations, while the temporal dependency refers to the influence of historical metro passenger flow to the current time point. Moreover, metro passenger flow is affected by some external factors, such as weather conditions, temperature, holidays, air quality, etc. The external factors may affect human mobility, which increases the uncertainty of the metro passenger flow prediction. For example, people tend to stay at home rather than go out for dinner on a rainy day.
Thus, the problem of spatio-temporal metro passenger flow forecasting can be regard as an equation, Target = F prediction (I S , I T , I E , W), where Target is the target passenger flow of time t, F prediction is the model used to tackle the problem, I S presents the input related to the spatial dependency, I T presents the input of the temporal dependency, I E is the external factor influencing the metro passenger flow, and W denotes the parameters to be learned. The whole process of prediction is shown in Figure 1. Our goal is to use the historical passenger flow to predict the metro passenger flow at a certain moment. The input data come from three fragments of historical flow data, close, daily, and weekly patterns. The close pattern refers to the recent time. Daily and weekly patterns denote the historical passenger flow at the same time as the target time, but in daily or weekly periodicities [7]. If the target time interval is 7:00 a.m. to 7:10 a.m. on Saturday, the The input data come from three fragments of historical flow data, close, daily, and weekly patterns. The close pattern refers to the recent time. Daily and weekly patterns denote the historical passenger flow at the same time as the target time, but in daily or weekly periodicities [7]. If the target time interval is 7:00 a.m. to 7:10 a.m. on Saturday, the close pattern refers to the time close to 7:00 a.m., such as 6:50 a.m., 6:40 a.m., 6:30 a.m., etc. The daily pattern refers to 7:00-7:10 a.m. every day for the prior d days. The weekly pattern is the time from 7:00-7:10 a.m. every Saturday of the prior w weeks.
The input data of close, daily, and weekly patterns can be described as Equations (1)- (3).
where I t represents the passenger flow of the target time t. I C , I D and I W denote the historical passenger flow of the close, daily, and weekly patterns respectively. Assume that the number of time intervals in I C , I D , and I W are c, d, and w. Besides, the total number of a day's time interval is n.

Principle of GAT
Graph attention networks (GATs) were proposed by Petar Veličković [12], which can be operated on graph-structured data. GAT introduces an attention mechanism into the graph structure and applies masked self-attentional layers that can assign different importance to different nodes within a neighborhood without costly matrix operations [49]. Besides, a graph structure is injected into the model as a mask. In this way, neither the matrix operation nor the entire graph structure is needed. Therefore, we can apply GAT to the incomplete graph, directed graphs, asymmetric graphs, and dynamic graphs. The graph structure is one of the typical organizations of citywide metro data. Therefore, we can apply GAT to predict citywide metro passenger flow.
Using GAT, the first step is to construct a graph structure G(V, E), where V is the node, namely the metro station; and E is the connection line, namely the metro line between two neighbor stations. Then we need to build a block layer to construct the graph attention networks (by stacking this layer). The inputs are the features of the nodes, which refer to the passenger flow of each metro station. It can be described as where N is the number of nodes and F is the feature of each node. The layer produces a set as outputs, where F represents the output features. A shared linear transformation is applied to each node, which is parametrized by a weight matrix, W ∈ R F ×F . Then we perform a shared self-attention mechanism on the nodes, which is denoted as a : R F × R F → R . Its objective is to compute attention , where e ij shows the importance of j to i. Later, we should inject a mask based on the graph structure, which means we only do calculations on j ∈ N i , where N i is the neighborhood of node i (as shown in Figure 2, we only calculate the influence of Node 1, 7, 5, 6 on Node 3). Lastly, we can apply LeakyRelu for calculation. The attention coefficient can be shown as follows: where T represents transportation and || is the concatenation operation.
where T represents transportation and || is the concatenation ope To stabilize the learning process of self-attention, the author nism to employ multi-head attention. The features are connected, ture character is as follows, where || represents concatenation, k ij α are the normalized atten puted by the k-th attention mechanism, and k W is the correspond formation's weight matrix [12]. In particular, if we perform multi final prediction layer of the network, we will employ averaging on concatenation. The process is shown in Figure 3, which contains t calculate the graph attention coefficient of Node 2,3, and 4 to Node To stabilize the learning process of self-attention, the author extended their mechanism to employ multi-head attention. The features are connected, and the output of feature character is as follows, where || represents concatenation, α k ij are the normalized attention coefficients computed by the k-th attention mechanism, and W k is the corresponding input linear transformation's weight matrix [12]. In particular, if we perform multi-head attention on the final prediction layer of the network, we will employ averaging on the layers rather than concatenation. The process is shown in Figure 3, which contains three heads. We aim to calculate the graph attention coefficient of Node 2,3, and 4 to Node 1 in the picture. We calculate the attention coefficients of all neighbor nodes and ad the final attention coefficient of the node. In this way, the influence betw captured and the spatial dependency of passenger flow can be accurate We calculate the attention coefficients of all neighbor nodes and add them to obtain the final attention coefficient of the node. In this way, the influence between nodes can be captured and the spatial dependency of passenger flow can be accurately obtained.

Principle of LSTM
Long short-term memory (LSTM) networks were proposed by Hochreiter and Schmidhuber [50]. It is a kind of recurrent neural network (RNN). The objective of LSTM is to model the longer-distance features of the time series. LSTM can tackle the problems of gradient exploding and vanishing in the traditional RNNs. The LSTM consists of three parts, an input layer, a recurrent hidden layer, and an output layer. Different from the traditional RNNs, the recurrent hidden layer of the LSTM contains a special memory block, whose core structure is shown in Figure 4. The memory block contains memory cells with self-connection that store the temporal state of the network at each time step [51]. The temporal state is controlled by three gates: the forget gate, the input gate, and the output gate. The input gate is to protect the memory contents from irrelevant inputs. The forget gate is to forget some useless message. The output gate is to export the outputs.
the final attention coefficient of the node. In this way, the influence betwee captured and the spatial dependency of passenger flow can be accurately

Principle of LSTM
Long short-term memory (LSTM) networks were proposed by H Schmidhuber [50]. It is a kind of recurrent neural network (RNN). The obj is to model the longer-distance features of the time series. LSTM can tackl of gradient exploding and vanishing in the traditional RNNs. The LSTM c parts, an input layer, a recurrent hidden layer, and an output layer. Dif traditional RNNs, the recurrent hidden layer of the LSTM contains a s block, whose core structure is shown in Figure 4. The memory block co cells with self-connection that store the temporal state of the network at [51]. The temporal state is controlled by three gates: the forget gate, the the output gate. The input gate is to protect the memory contents from irr The forget gate is to forget some useless message. The output gate is to exp  In Figure 4, X t is the input of the current time point, h t is the output of the hidden layer, h t−1 is the output hidden layer of the previous time interval, C is the input state of the cell, C t is the output state of the cell, and C t−1 is the output state of the previous time interval. The coefficients of input gate, the forget gate, and the output gate in LSTM can be calculated in Equations (6)-(8) below.
input gate: forget gate: output gate: where W xi , W x f , and W xo are learnable weight parameters connecting X t with the input gate, forget gate, and output gate. W hi , W h f , and W ho are weight parameters connecting h t−1 with three gates. b i , b f , and b o are learnable offset parameters. σ is the sigmoid function: The input state of the cell is as follows: where W xC is a weight parameter connecting X t with the inputs, W hC is the parameter matrix connecting h t−1 with the cell inputs, b C is the learning offset parameters, and tanh is the tangent function. The output state of the cell is as follows: where i t , f t , C t , C t−1 , and C share the same dimension. The output of the hidden layer is as follows: In short, the LSTM can "remember" the needed information and "forget" the useless information. Thus, the LSTM owns the strong ability to process a time series with a longer temporal dependency. Applying the LSTM to the metro passenger flow prediction can capture the temporal dependency of the data, which contributes to the accuracy of the prediction model.

Model Development
Citywide metro passenger flow prediction is a typical spatio-temporal modeling problem. Therefore, we propose a model combining GAT and LSTM, which is called the hybrid GLM. Besides, the proposed model consists of a multi-time pattern. With GAT, the hybrid model can deal with topological problems better than the other models. The model consists of five parts, Branches 1-5. Branches 1-3 use the GAT structure to capture the spatial dependency in the close, daily, and weekly patterns. Branch 4 uses the LSTM to capture the temporal dependency through the fused close, daily, and weekly patterns. Branch 5 shows the impact of external factors. Moreover, an LSTM layer is used to obtain the output data. The detailed model architecture is presented in Figure 5.

Close
Daily Weekly Close+Daily+Weekly External Graph Attention1 Graph Attention2 Graph Attention1 Graph Attention2 Graph Attention1 Graph Attention2 Weather and AQI

Branches 1-3: Spatial Dependency
The influence of historical passenger flow can be divided into three patterns: close pattern, daily pattern, and weekly pattern. We take the three patterns as three parts, which are sent into the GAT for training. We take the three patterns apart for two reasons. On one hand, if we regard three parts as one input, the data in the GAT may be in a great amount. Thus, the training process may be very slow. On the other hand, the spatial correlation among the three patterns is not strong, so there is no need to train them together. The GAT structure in our model can capture the topological characters of the passenger flow, and it can be used in asymmetric networks. Every GAT structure in Branches 1 to 3 contains two graph attention layers. The topological relationship between the nodes are used to construct the adjacency matrix which is regarded as a layer mask. We utilize the mask to capture the topological relations between metro stations. To observe the correlations of further metro stations, we improved the traditional adjacent matrix. In the traditional adjacent matrix, we put 1 in the matrix if the two nodes can be connected by lines. However, we want to capture the spatial correlation of some nodes located a little further for better predicting. To that end, we set 4,3,2, and 1 as the weight of the closest nodes, less close, much less close, and further nodes, respectively. Besides, if the nodes are connected by several edges, we add 0.5 to the weights per edge. Doing so can also facilitate large-scale metro network prediction.

Branches 1-3: Spatial Dependency
The influence of historical passenger flow can be divided into three patterns: close pattern, daily pattern, and weekly pattern. We take the three patterns as three parts, which are sent into the GAT for training. We take the three patterns apart for two reasons. On one hand, if we regard three parts as one input, the data in the GAT may be in a great amount. Thus, the training process may be very slow. On the other hand, the spatial correlation among the three patterns is not strong, so there is no need to train them together. The GAT structure in our model can capture the topological characters of the passenger flow, and it can be used in asymmetric networks. Every GAT structure in Branches 1 to 3 contains two graph attention layers. The topological relationship between the nodes are used to construct the adjacency matrix which is regarded as a layer mask. We utilize the mask to capture the topological relations between metro stations. To observe the correlations of further metro stations, we improved the traditional adjacent matrix. In the traditional adjacent matrix, we put 1 in the matrix if the two nodes can be connected by lines. However, we want to capture the spatial correlation of some nodes located a little further for better predicting. To that end, we set 4,3,2, and 1 as the weight of the closest nodes, less close, much less close, and further nodes, respectively. Besides, if the nodes are connected by several edges, we add 0.5 to the weights per edge. Doing so can also facilitate large-scale metro network prediction.

Branch 4: Temporal Dependency
Another obvious character of metro passenger flow is its temporal dependency, which refers to the impact of the historical passenger flow on the current time point. There are three obvious aspects of temporal correlation: proximity, trend, and periodicity. Proximity means the influence of the closest time intervals. Trend means the overall trend over a period of time. Periodicity is the influence of a longer time. In our model, the time intervals of the close, daily, and weekly patterns are merged and sent into the LSTM. LSTM is a special RNN that uses gate structures to determine the necessity of the information. LSTM solves the gradient explosion and the gradient disappearance problems of traditional RNNs, making it available to capture the characters of much longer temporal distance. In the hybrid GLM, a two-layer LSTM is used for the metro passenger flow sequence. Then, the data are flattened and fully connected with 578 neurons. Through Branch 4, the temporal dependency can be obtained and the overall periodicity of the metro passenger flow can be studied.

Branch 5: External Influence
Apart from the spatial and temporal factors, some external factors may affect the prediction of metro passenger flow. For example, people tend to stay at home on sandstorms or heavily polluted days. Major events, such as the National Day, may make the metro passenger flow reach a new peak. External factors are essential references for people to schedule their travel plans. At present, only a few models introduce external factors into the prediction model and they pay little attention to air quality. Our model selects 11 external factors, which can be divided into three categories: weather conditions (maximum temperature, minimum temperature, and rainy day or not), air quality (AQI, PM2.5, PM10, NO 2 , CO, O 3 , and SO 2 ), and events (whether the day is a holiday or not). We use the time series of these factors to analyze the external influence. The external data are recorded every hour, and some examples are shown in Table 1. Note that the external data are recorded every hour. However, the time interval is 10 min in our experiment, which means a day contains 144 time intervals and an hour contains 6 time intervals. Therefore, the 6 intervals share the same recorded data in an hour. For example, the weather condition data from 6:00 to 6:10 will share the recorded data from 6:00 to 7:00, as shown in the first row of Table 1.
We normalized the external data so that all the quantities are in the same range. We performed one-hot encoding on the Boolean values like Holiday and RainyDay. Then the processed data were sent into the stacked LSTM layers. We built a three-layer LSTM and each layer has 256 neurons. At last, the output of Branch 5 and the outputs of the prior 4 Branches were put into the feature fusion part for training.

Feature Fusion
Because the output data from the five branches are in identical shape, we can fuse the five parts. However, the influence of the different parts varies. We adopt the parametricmatrix-based method, whose function is shown below.
where Fusion is the prediction target after fusion, • presents the Hadamard product, O 1 , O 2 , O 3 , O 4 , and O 5 are the outputs from the five branches, and W 1 , W 2 , W 3 , W 4 , and W 5 are the corresponding learnable weights. Then the results after fusion are activated by the activation function σ (i.e., ReLU). An LSTM layer with 64 neurons was applied after feature fusion. The LSTM output was subsequently flattened and fully connected with 578 neurons to generate the final outputs.

Model Training
To train the hybrid GLM, the mean-square error (MSE) is used as the loss function. As shown in Equation (13), y i is the available ground value,ŷ i is the predicted value and n is the number of samples. The original data were divided into three parts, a training dataset: a validating dataset, and a testing dataset. We use the training dataset for training in batches and the loss will be calculated per batch. Besides, we apply Adam as an optimizer when back propagation training. Adam is generally regarded as being fairly robust to the choice of hyperparameters, though the learning rate sometimes needs to be changed from the suggested default [52] (pp. 309). After minimizing the loss, all trainable parameters are trained.

Experiment Data
The metro passenger flow data used in this study were collected from Smart Card Data (SCD) of the metro system of Shanghai, China. The study area and the corresponding metro lines are shown in Figure 6. The time span of the data is between April 1st and April 30th in 2015. During this period, there were about nine million card records per day, covering 289 metro stations. Parts of the original data are shown in Table 2. The corresponding field descriptions are shown in Table 3.
where Fusion is the prediction target after fusion,  presents the

Model Training
To train the hybrid GLM, the mean-square error (MSE) is used as the loss function. As shown in Equation (13), i y is the available ground value, ˆi y is the predicted value and nis the number of samples. The original data were divided into three parts, a training dataset: a validating dataset, and a testing dataset. We use the training dataset for training in batches and the loss will be calculated per batch. Besides, we apply Adam as an optimizer when back propagation training. Adam is generally regarded as being fairly robust to the choice of hyperparameters, though the learning rate sometimes needs to be changed from the suggested default [52] (pp. 309). After minimizing the loss, all trainable parameters are trained.

Experiment Data
The metro passenger flow data used in this study were collected from Smart Card Data (SCD) of the metro system of Shanghai, China. The study area and the corresponding metro lines are shown in Figure 6. The time span of the data is between April 1st and April 30th in 2015. During this period, there were about nine million card records per day, covering 289 metro stations. Parts of the original data are shown in Table 2. The corresponding field descriptions are shown in Table 3.   According to the life experience, we can obtain the metro passenger inflow and outflow from the original data. The field Figure is the key to tell whether the flow is inflow or outflow. Take Row 1 in Table 2 as an example, which is a record of passenger outflow. Apparently, a Figure of zero represents inflow, while a Figure that is non-zero is outflow.
Then we put forward a definition of time interval (TI) for counting passenger flow. We choose 10 min, 15 min, 20 min and 30 min as the TIs, respectively. We should count the passenger flow every 10 min if the TI is 10 min. Then a day has 144 time slices. However, we chose 6:40 a.m. to 11:00 p.m. as the studying period according to human activities. Therefore, there were 98 time slices in a day.
The training set includes observations from 1-23 April 2015; the validation set is from 24-25 April 2015 (including one working day and one non-working day). We selected the last five days, 26-30 April 2015, as the testing period, which contains four working days and one non-working day.

Evaluation Metrics
To measure the performance of the different flow prediction models, we chose rootmean-square error (RMSE) and mean absolute percentage error (MAPE) as the evaluation metrics. They are calculated through prediction values and available ground values. Definitions of the two metrics are shown in Equations (14) and (15). From the definition, we can know that the smaller the value is, the better the model performs.
where y i is the available ground value,ŷ i is the predicted value and n is the number of samples.

Environment and Training Settings
Experiments were mainly run on a GPU platform with an NVIDIA GeForce GTX1050 Ti graphics card, whose detailed information is shown in Table 4. Python libraries, including scikit-learn, Keras, and TensorFlow were used to build our model. The procedure of tuning the parameters is important for DL prediction. Here, only the final settings are listed, which were proven to be the optimal parameters. The detailed tuning procedure will be presented in Section 5.6. In our experiments, the number of time intervals for close, daily, and weekly patterns were set as c = 7, d = 1, and w = 1. For Branches 1-3, we stacked two graph attention layers for every branch. The first layer had 6 output neurons, while the second layer had 2. We set the attention head as 12 for better training. To avoid overfitting, the dropout layers were added between the two graph attention layers. The dropout rate was set as 0.6. For Branch 4, we stacked two LSTM layers with 600 neurons each. A fully connected layer consisting of 578 neurons was applied in the end. To capture the influence of the external factors, we utilized a 3-layer LSTM with 256 neurons. For the feature fusion part, an LSTM layer and a fully connected layer consisting of 64 and 578 neurons were applied, respectively.
We trained the hybrid GLM model by minimizing the MSE for 200 epochs with a batch size of seven. We also used the Early Stopping techniques to avoid overfitting. The initial learning rate was set at 5 × 10 −4 , with a decay rate of 0.95 after every 20 epochs. The training loss and validation loss become stable after 200 epochs, which shows the robustness of the proposed model.

Baseline Models
We compare the hybrid GLM with the following 10 baseline models (including one statistical method, two ML methods, and seven DL methods) to evaluate the performance. To make a fair comparison, all these models take the close, daily, and weekly data as inputs. The Adam optimizer is used for all the models. The descriptions of these baseline models are as follows. Related abbreviations of the baseline models are shown in Table 5. • KNN: K-nearest neighbor (KNN) regression [55] is a commonly used method in nonparametric regression. We also employ PCA to select the principal components before inputting the data into KNN; • RSVR: A typical machine learning method [20]. The kernel of SVR in scikit-learn is set as a radial-basis function (RSVR); • LSTM: Long short-term-memory (LSTM) networks [50]. LSTM is a special kind of RNNs, which is capable of learning long-term temporal dependencies. The model consists of two stacked LSTM layers and one fully-connected layer; • CNN: A convolutional neural network (CNN) [56], which transforms the metronetwork-based passenger flow into a two-dimensional image. The vertical axis represents the metro stations, the horizontal axis represents time; • ResNet: A model combined with CNN and ResUnit (ResNet) [41]. It was used in the traffic field once. However, we do not embed the external factors in our study; • STGCN: A model that generalizes CNNs to non-Euclidean data, which is used in the spectral domain with graph Fourier transforms. In our study, we utilize the spatio-temporal graph conventional networks (STGCN) proposed by Han [57] as a baseline model; • GAT: Graph attention networks (GAT) [12]. GAT is a kind of graph neural networks, which can analyze the topological relations of nodes. Two graph attention layers are used in the model; • GLM_NoE: We delete Branch 5 in the hybrid GLM; • GLM_NoIA: We only use the traditional adjacent matrix as a mask layer compared with the hybrid GLM.

Different Networks Prediction Performance
Similar to the hybrid GLM, we tuned the hyperparameters for the other 10 baseline models and recorded the optimal hyperparameters. The final results are shown in Table 6. The MAPEs are calculated from the data whose actual value is not zero. In Table 6, FN CNN presents the number of hidden neurons of one CNN layer. D CNN refers to the number of hidden layers in CNN. F and D have similar meanings in other baseline models. L C , L D, and L W present the length of the close, daily, and weekly patterns, respectively. K means the number of attention heads. Kernel SVR refers to the type of kernel in the SVM algorithm. Neighbor means the number of nodes that one class contains. K s is the kernel size of the graph convolution. To further observe the prediction performance in a more intuitive way, we draw the bar pictures of the RMSEs and the MAPEs for all models in Figure 7.  As shown in Table 6 and Figure 7, the hybrid GLM outperforms most mainstream methods on Shanghai metro data with the smallest RMSR and MAPE. Compared with CNN, ResNet, and GAT, the hybrid GLM exhibits an obvious reduction in RMSE and MAPE. Specifically, compared with CNN, the hybrid GLM has a 34.11% relative reduction in RMSE and a 25.57% relative reduction in MAPE. Compared with ResNet, the hybrid GLM has a 31.58% relative reduction in RMSE and a 24.07% relative reduction in MAPE.   As shown in Table 6 and Figure 7, the hybrid GLM outperforms most mainstream methods on Shanghai metro data with the smallest RMSR and MAPE. Compared with CNN, ResNet, and GAT, the hybrid GLM exhibits an obvious reduction in RMSE and MAPE. Specifically, compared with CNN, the hybrid GLM has a 34.11% relative reduction in RMSE and a 25.57% relative reduction in MAPE. Compared with ResNet, the hybrid GLM has a 31.58% relative reduction in RMSE and a 24.07% relative reduction in MAPE. Compared with GAT, the hybrid GLM has a 29.04% relative reduction in RMSE and a 19.54% relative reduction in MAPE. In addition, the hybrid GLM also outperforms LSTM. Compared with the LSTM, the hybrid GLM exhibits an RMSE reduction of 13.37% and a MAPE reduction of 9.15%. We regard CNN, ResNet, and GAT as Group1, and LSTM as Group2. The reason why the hybrid GLM is superior to Group1 and Group2 is that either Group1 or Group2 only captures spatial or temporal dependency. However, the hybrid GML combines the advantage of the GAT and LSTM for capturing spatio-temporal dependency.
Next, we compared the performance of the different models, the statistical models, the ML models, and the DL models. The statistical model, LR, performs worse than the ML models and the DL models. Among the DL models, we find that most models concerning spatial dependency get worse results than the models based on recurrent neural networks. Take CNN and LSTM as an example: LSTM exhibits an RMSE reduction of 23.87% and a MAPE reduction of 18.07% compared with CNN. The reason may be that it is more difficult to capture the spatial dependency than temporal dependency for citywide metro network prediction. As for the ML methods, KNN and RSVR, they perform worse than the DL model concerning temporal dependency, such as LSTM. However, they get better results than the DL models concerning spatial dependency, such as CNN, ResNet, and GAT. Compared with KNN, the hybrid GLM has a 23.94% relative reduction in RMSE and an 11.29% relative reduction in MAPE. Moreover, the hybrid GLM exhibits a reduction of 16.59% in RMSE and a reduction of 1.3% in MAPE compared with RSVR. Generally speaking, the hybrid GLM outperforms ML models in a sense.
Compared with the raster-based models, like CNN and ResNet, the graph-structurebased models have a smaller RMSE and MAPE. Compared with ResNet, the STGCN has a 26.83% relative reduction in RMSE and a 23.11% relative reduction in MAPE. Compared with CNN, GAT has a 7.05% relative reduction in RMSE and a 7.50% relative reduction in MAPE. Generally speaking, the RMSEs of the raster-based model are larger than 40, while the hybrid GLM only gets 31.42. The MAPEs of the raster-based model are larger than 12, while the hybrid GLM model only gets 9.43. From the results, we conclude that the graph-structure based model can capture the irregular spatial dependency better than the raster-based models for the citywide metro passenger flow prediction.
Then, we compared the hybrid GLM with STGCN and found that the hybrid GLM performs a little better. The hybrid GLM has a 6.49% relative reduction in RMSE and a 10.87% relative reduction in MAPE mainly because the GAT component in the hybrid GLM improves the original adjacent matrix to compensate for the asymmetric matrix problem of STGCN.
Lastly, we discuss the contribution of improved adjacent matrix and external factors. As shown in Table 6, the hybrid GLM has an 8.74% relative reduction in RMSE and a 3.06% relative reduction in MAPE compared with GLM_NoIA, which indicates the benefits of the improved adjacent matrix. Moreover, the hybrid GLM embeds external factors, which improves the model a little. We can learn from the results that the hybrid GLM exhibits an RMSE reduction of 3.74% and an MAPE reduction of 3.94% compared with GLM_NoE.
In summary, by integrating GAT and LSTM, the hybrid GLM can capture the spatiotemporal dependency better than several existing models.

Prediction Results of a Specific Metro Station
According to the results of the RMSE and MAPE of the different models, we selected four models that performed relatively well, namely, STGCN, RSVR, LSTM, and GLM. We are going to discuss these models in detail. We chose the People's Square metro station as an example, for it is one of the most crowded stations in Shanghai and it is located in the center of the city. Figures 8 and 9 are the prediction results of passenger inflow and outflow per 10 minutes in People's Square metro station during the 4 working days and 1 non-working day. We set 6:40 a.m. to 11:00 p.m. as the time span. We can easily find the periodicity of metro passenger flow on the working day in Figures 8 and 9. The People's Square station is a typical working area. Therefore, the rush hours come at 5:00 p.m. every working day (the time interval is around 161, 271, 361, and 455 in Figure 8), which is the time to get off work. As for outflow volume, the rush hours come at 9:00 a.m. every working day (the time interval is around 113, 209, 308, and 410 in Figure 9), which is an office hour. As for the performance, the predicted values of the hybrid GLM, RSVR, LSTM, and STGCN models correspond well with the actual values. More specifically, in the red boxes of Figure 8, we can find the hybrid GLM catches the characters of the inflow volume in rush hours more accurately than the other three models. However, the hybrid GLM performs slightly worse in predicting the outflow volume in non-rush hours, which can be seen in Figure 9.  In Figure 10, we draw the prediction results of non-workdays and working days specifically. There are 98 time slices in a day. As shown in Figure 10a,c, the passenger flow volume of the non-working days is more variable and the prediction results are therefore  1  17  33  49  65  81  97  113  129  145  161  177  193  209  225  241  257  273  289  305  321  337  353  369  385  401  417  433  449  465  481   Inflow volume   Time interval (10 minutes)   GLM  RSVR  LSTM  STGCN  y_true   0   1000   2000   3000   4000   5000   6000   7000   1  17  33  49  65  81  97  113  129  145  161  177  193  209  225  241  257  273  289  305  321  337  353  369  385  401  417  433  449  465  481   Outflow volume   Time interval (10 minutes) GLM RSVR LSTM STGCN y_true  In Figure 10, we draw the prediction results of non-workdays and working days specifically. There are 98 time slices in a day. As shown in Figure 10a,c, the passenger flow volume of the non-working days is more variable and the prediction results are therefore slightly worse. Comparing the overall predictive abilities of non-working days, the hybrid  1  17  33  49  65  81  97  113  129  145  161  177  193  209  225  241  257  273  289  305  321  337  353  369  385  401  417  433  449  465  481   Inflow volume   Time interval (10 minutes)   GLM  RSVR  LSTM  STGCN  y_true   0   1000   2000   3000   4000   5000   6000   7000   1  17  33  49  65  81  97  113  129  145  161  177  193  209  225  241  257  273  289  305  321  337  353  369  385  401  417  433  449  465  481 Outflow volume Time interval (10 minutes) GLM RSVR LSTM STGCN y_true In Figure 10, we draw the prediction results of non-workdays and working days specifically. There are 98 time slices in a day. As shown in Figure 10a,c, the passenger flow volume of the non-working days is more variable and the prediction results are therefore slightly worse. Comparing the overall predictive abilities of non-working days, the hybrid GLM performs a little worse than the other three models. However, the hybrid GLM is superior to the STGCN, LSTM, and RSVR models when predicting in working days, which can be seen in Figure 10b,d. The GLM model can fit the ground value much better than the other three models during the rush hours of working days (time interval 50 to 70 in Figure 10b, time interval 10 to 20, and 60 to 80 in Figure 10d, see red circle in Figure 10b,d). However, the hybrid GLM's ability to capture characters in non-rush hours of the working day is sometimes worse than that of the STGCN model (see red box in Figure 10b). We assume the reason to be the low passenger flow in the non-rush hours, which leads to less flow in the graph structures. working day is sometimes worse than that of the STGCN model (see red box in Figure  10b). We assume the reason to be the low passenger flow in the non-rush hours, which leads to less flow in the graph structures.  To further explore the hybrid GLM's ability to predict during the rush hours of a working day, we qualified the RMSEs between the rush hours and non-rush hours in all stations. The results compared with the best baseline STGCN are shown in Table 7. Table 7. RMSEs of the rush hours and non-rush hours for a working day in the TI 10 min.

Model
Rush Hours (7:00-9:00; 11:00-13:00; 17:00-19:00) As shown in Table 7, the RMSEs of the hybrid GLM are significantly smaller than STGCN in rush hours in a working day. For passenger inflow, the hybrid GLM has an To further explore the hybrid GLM's ability to predict during the rush hours of a working day, we qualified the RMSEs between the rush hours and non-rush hours in all stations. The results compared with the best baseline STGCN are shown in Table 7. Table 7. RMSEs of the rush hours and non-rush hours for a working day in the TI 10 min.

Model
Rush Hours (7:00-9:00; 11:00-13:00; 17:00-19:00) As shown in Table 7, the RMSEs of the hybrid GLM are significantly smaller than STGCN in rush hours in a working day. For passenger inflow, the hybrid GLM has an 8.62% relative reduction in RMSE. For passenger outflow, the hybrid GLM brings 10.06% extra improvements compared with STGCN. While in non-rush hours, the RMSEs of the two model are relatively close. For passenger outflow, the hybrid GLM even performs worse. The results show the hybrid's ability to predict during the rush-hours of a working day is better than the best baseline STGCN. Note that the RMSEs in the rush hours are relatively higher than those in non-rush hours due to the higher passenger flow.
In summary, the hybrid GLM's prediction results during rush hours are more accurate than those of the STGCN, LSTM, and RSVR models. However, the hybrid GLM's predictive ability of non-rush hours is a little worse than the STGCN. Generally speaking, the overall predictive ability of the hybrid GLM is superior to those of STGCN, LSTM, and RSVR.

Prediction Performance in Different TIs
To verify the robustness of the hybrid GLM, we compared the prediction results from different time intervals (TIs). We chose 10 min, 15 min, 20 min, and 30 min as the TIs, respectively. From Figure 11 and Table 8, we can observe that the prediction precision decreases with the increasing TIs, which results from the lower number of samples in the training data. When the TI is fixed, the hybrid GLM gets the smallest RMSE compared with the other 9 baseline models. However, the MAPE of the hybrid GLM is not always the smallest. In the TI of 15 min and 20 min, KNN and RSVR get smaller MAPEs than the hybrid GLM. In the TI of 30 min, the MAPEs of RSVR and LSTM are smaller than that of the hybrid GLM. Though the MAPEs of some models are relatively small, the hybrid GLM still gets a much smaller RMSE. The reason may be that the hybrid GLM's predictive ability of rush hours is better compared with some baseline models while its predictive ability of non-rush hours is sometimes bad. As for a single TI, the conclusion made in Section 5.5.1 can be proved, too. Take TI 30 min as an example, the hybrid GLM exhibits an RMSE reduction of 43.27% and an MAPE reduction of 14.11% compared with CNN. Compared with LSTM, the hybrid GLM has a 15.87% relative reduction in RMSE. The results verify the superiority of the spatio-temporal model. The graph-structure-based model, GAT, exhibits an RMSE reduction of 2.50% and a MAPE reduction of 3.82% compared with the raster-based model, ResNet. The hybrid GLM has a 10.97% relative reduction in RMSE and a 2.99% relative reduction in MAPE compared with STGCN. The reason may be the improvement of GAT when using an asymmetric matrix. As shown in Table 8, GLM_NoIA and GLM_NoE perform a little worse than the hybrid GLM, which exhibits the effectiveness of the improved adjacent and the external factors. As shown in Figure 12, the prediction results of the hybrid GLM fit the true values well in different TIs, especially in the peak period. Note that the results of TI 10 min are shown in Figures 8 and 9. Therefore, we can conclude that the hybrid GLM outperforms the other baseline models in different TIs in total, which exhibits the robustness and high accuracy of the hybrid GLM. improvement of GAT when using an asymmetric matrix. As shown in Table 8, GLM_NoIA and GLM_NoE perform a little worse than the hybrid GLM, which exhibits the effectiveness of the improved adjacent and the external factors. As shown in Figure  12, the prediction results of the hybrid GLM fit the true values well in different TIs, especially in the peak period. Note that the results of TI 10 min are shown in Figures 8 and 9. Therefore, we can conclude that the hybrid GLM outperforms the other baseline models in different TIs in total, which exhibits the robustness and high accuracy of the hybrid GLM.

Parameters Tuning
The procedure for tuning the parameters is an indispensable process for the training of DL models. This section focuses on the adjustment process of some typical factors.

Lengths of the Different Input Patterns
We here verify the impact of the different input lengths of the three patterns, namely, the close, daily, and weekly patterns. The results are shown in Figure 13. We respectively define c, d, and w as the input length of the close, daily, and weekly patterns.

Parameters Tuning
The procedure for tuning the parameters is an indispensable process for the training of DL models. This section focuses on the adjustment process of some typical factors.

Lengths of the Different Input Patterns
We here verify the impact of the different input lengths of the three patterns, namely, the close, daily, and weekly patterns. The results are shown in Figure 13. We respectively define c, d, and w as the input length of the close, daily, and weekly patterns. Figure 13a shows the results of the effect of temporal closeness when d and w are fixed as 1 but c is changed. We can learn that the RMSE and the MAPE are very large when c is 0, which means the close pattern is very important. The best performance appears when c is 7. Figure 13b shows the results of the effect of the daily period when c is set as 7 and w is set as 1 but d varies from 0 to 6. We can observe that the RMSE and the MAPE first decrease and then increase as d increases. The optimal d is 1. Figure 13c shows the results of the effect of the weekly period when c is set as 7 and d is set as 1 but w varies from 0 to 2. From Figure 13c, we find that the RMSE and the MAPE increase when w is larger than 1, which means the situation at 7:00 a.m. the last two weeks is not closely related to that at 7:00 a.m. this week. After tuning, we can conclude that it is better to employ some temporal patterns, but the long-term trend may not be effective or even useless. shows the results of the effect of temporal closeness when d and w are fixed as 1 but c is changed. We can learn that the RMSE and the MAPE are very large when c is 0, which means the close pattern is very important. The best performance appears when c is 7. Figure 13b shows the results of the effect of the daily period when c is set as 7 and w is set as 1 but d varies from 0 to 6. We can observe that the RMSE and the MAPE first decrease and then increase as d increases. The optimal d is 1. Figure 13c shows the results of the effect of the weekly period when c is set as 7 and d is set as 1 but w varies from 0 to 2. From Figure 13c, we find that the RMSE and the MAPE increase when w is larger than 1, which means the situation at 7:00 a.m. the last two weeks is not closely related to that at 7:00 a.m. this week. After tuning, we can conclude that it is better to employ some temporal patterns, but the long-term trend may not be effective or even useless.
(a) The input length of the close pattern. (b) The input length of the daily pattern.
(c) The input length of the weekly pattern.

Number of Hidden Layers
We performed nine experiments to explore the optimal depth for the hybrid GLM. We investigate the depths of the graph attention layers and the LSTM layers separately, which are denoted as DGAT and DLSTM. First, with DLSTM = 1, DGAT varies from one to three. Then the DLSTM is set as 2 and 3 in the same way. The results are shown in Figure 14. Generally speaking, the RMSE and the MAPE of the model first decrease and then increase, which shows that the deeper network often has better results for capturing more 48

Number of Hidden Layers
We performed nine experiments to explore the optimal depth for the hybrid GLM. We investigate the depths of the graph attention layers and the LSTM layers separately, which are denoted as D GAT and D LSTM . First, with D LSTM = 1, D GAT varies from one to three. Then the D LSTM is set as 2 and 3 in the same way. The results are shown in Figure 14. Generally speaking, the RMSE and the MAPE of the model first decrease and then increase, which shows that the deeper network often has better results for capturing more characters. However, if the networks are too deep, the training may become hard, which leads to an increase in the RMSE and MAPE. More specifically, the RMSE is relatively low if the D LSTM is between 1 and 2. The RMSE will increase once the D LSTM becomes 3, which indicates the deeper LSTM performs worse in the model. As for D GAT , the RMSE and the MAPE are relatively large when D GAT is set as 1 and 3. Lastly, we find the optimal D LSTM and D GAT are 2 and 2.

Number of Hidden Neuron Units
FNLSTM and FNGAT represent the number of hidden units in LSTM a To test LSTM, the FNGAT of the 2-layer GAT were fixed as 32 and 2, between 32, 64, 128, 256, 512,576, 600, and 640. Correspondingly, to tes fixed as 600, and the FNGAT in the first GAT layer varies between 6, 12, 256. The results of the influence of different hidden neuron units are s Figure 15a is the result when changing FNLSTM. The MAPE is relatively changed. However, more FNLSTM results in a lower RMSE up to a certain point of FNLSTM is around 600. Therefore, the optimal FNLSTM is 600.   Figure 15. Figure 15a is the result when changing FN LSTM . The MAPE is relatively stable as FN LSTM is changed. However, more FN LSTM results in a lower RMSE up to a certain level. The turning point of FN LSTM is around 600. Therefore, the optimal FN LSTM is 600. Figure 15b is the result for changing FN GAT . The RMSEs and the MAPEs are very stable when FN GAT is changing. However, FN GAT = 32 has a superior result. In summary, the model exhibits the best prediction accuracy when FN LSTM is 600 and FN GAT is 32.
256. The results of the influence of different hidden neuron units are shown in Figure 15. Figure 15a is the result when changing FNLSTM. The MAPE is relatively stable as FNLSTM is changed. However, more FNLSTM results in a lower RMSE up to a certain level. The turning point of FNLSTM is around 600. Therefore, the optimal FNLSTM is 600. Figure 15b is the result for changing FNGAT. The RMSEs and the MAPEs are very stable when FNGAT is changing. However, FNGAT = 32 has a superior result. In summary, the model exhibits the best prediction accuracy when FNLSTM is 600 and FNGAT is 32.

Conclusions and Future Work
In this paper, we focus on a valuable and widely studied problem, metro passenger flow prediction, whose goal is to effectively and accurately predict the passenger flow in future time intervals for a specific region. We argue that the existing works ignore the 37 Figure 15. Experimental results for the different hidden layers.

Conclusions and Future Work
In this paper, we focus on a valuable and widely studied problem, metro passenger flow prediction, whose goal is to effectively and accurately predict the passenger flow in future time intervals for a specific region. We argue that the existing works ignore the application of asymmetric networks in a city and lose sight of the fact that further neighbors may also have some impact. In addition, some of them do not analyze the effects of external factors, such as weather, air quality, etc.
To address these issues, we propose a new method, the hybrid GLM, to predict the citywide metro passenger flow by integrating two DL methods, LSTM and GAT. By utilizing GAT, the proposed model can be used in asymmetric networks. In order to explore the influence of some nodes located a little further, we improve the adjacent matrix by applying different weights to some further neighbors. Further, LSTM structures are adopted to capture the temporal dependency and external influence, which can improve the entire model. We tested the proposed model via a case study involving the prediction of the citywide metro passenger flow in Shanghai, China for five days. The experimental results indicate that the hybrid GLM significantly outperforms several baseline models, namely LR, KNN, RSVR, LSTM, CNN, ResNet, GAT, and STGCN. A detailed comparison between the hybrid GLM and STGCN reveals that the hybrid GLM provides a higher performance of 6% to 10% for different TIs. For rush hours in a working day, the hybrid GLM fits the ground truths better, which may be more helpful for urban manager to make effective plans. The accurate prediction results can also provide references for people's traveling schedule.
However, some limitations still exist. Firstly, the prediction errors of the hybrid GLM are relatively larger than those of STGCN for the non-rush hours in a working day. Secondly, the time span of the validation datasets only covers a month, which may ignore some temporal external factors, such as seasons. Thirdly, we apply a single-step ahead prediction [58] in our study, which may cause error accumulation. In the future, we intend to address these limitations to better discover the correlations for a higher quality prediction. We will further explore external features and multi-step ahead prediction for model improvement. We also intend to investigate the application of the hybrid GLM to much longer datasets or other types of flows, such bike flow, crowd flow, and traffic flow in different TIs. Lastly, in terms of DL models, some advantages of GAT, such as capturing the bidirectional characters of the traffic lines, modeling dynamic graphs, and modeling multi-graphs, should be further studied.