Large-Scale Station-Level Crowd Flow Forecast with ST-Unet

High crowd mobility is a characteristic of transportation hubs such as metro/bus/bike stations in cities worldwide. Forecasting the crowd flow for such places, known as station-level crowd flow forecast (SLCFF) in this paper, would have many benefits, for example traffic management and public safety. Concretely, SLCFF predicts the number of people that will arrive at or depart from stations in a given period. However, one challenge is that the crowd flows across hundreds of stations irregularly scattered throughout a city are affected by complicated spatio-temporal events. Additionally, some external factors such as weather conditions or holidays may change the crowd flow tremendously. In this paper, a spatio-temporal U-shape network model (ST-Unet) for SLCFF is proposed. It is a neural network-based multi-output regression model, handling hundreds of target variables, i.e., all stations’ in and out flows. ST-Unet emphasizes stations’ spatial dependence by integrating the crowd flow information from neighboring stations and the cluster it belongs to after hierarchical clustering. It learns the temporal dependence by modeling the temporal closeness, period, and trend of crowd flows. With proper modifications on the network structure, ST-Unet is easily trained and has reliable convergency. Experiments on four real-world datasets were carried out to verify the proposed method’s performance and the results show that ST-Unet outperforms seven baselines in terms of SLCFF.


Introduction
To be able to forecast crowd flow is of great importance for risk assessment and public safety [1,2]; there has been increased emphasis on this since accidents such as the 2014 Shanghai Stampede occurred.Compared with doing citywide or regional forecasts, a station-level crowd flow forecast (SLCFF) benefits public safety protection at the station-level when predicting the flow at those places with high crowd mobility, such as metro/bus/bike stations.Stations are scattered throughout a city and the variation of crowd flow reflects people's daily life: work, activities, home, etc.However, SLCFF can benefit many other applications too, such as traffic management, taxi dispatching, bike-sharing pre-reallocation, etc. Concretely, SLCFF predicts the number of people that will arrive at or depart from stations in a given period.
There are many stations in a city.The crowd flow in one individual station exhibits greater fluctuation than that observed on the cluster level.The crowd flow variation in a station generally complies to the trend at the cluster it belongs to when hierarchical clustering of geo-neighboring stations is applied.The spatial dependence of crowd flow lies in the hierarchical structure of stations.Moreover, the peak arrival crowd at a certain station may have come from several other stations a while before; and the peak departure would cause fluctuation at nearby or far away stations a while later.Viewing crowd flow at each time slice individually would not reflect the inherent temporal dependence.Furthermore, some external factors, such as weather conditions and events, may change the crowd flow tremendously.All these issues integrally make it challenging to do a station-level crowd flow forecast with high precision.The forecast performance can only be improved when the spatio-temporal dependence and the external factors are well modeled.
Crowd flow forecast is intrinsically a regression problem.From the view of models and the forecasting techniques adopted, the works could be categorized into two groups: one uses empirical statistical methods [3,4] or pattern mining to identify crowd flow hot-spots or activity patterns [5,6]; the other implements machine learning techniques to forecast crowd flow.The former is used to answer when/where/how the future hot-spots might be from a macro perspective.The latter is used to make predictions numerically by modeling the impact factors as much as possible.This paper focuses on the latter.
For a regression problem, from the view of how the many target variables are modeled, the machine learning techniques can be summarized as being single-output models and multi-output models [7].The former trains the models for each target station individually or just one single-output model, in which the loss function has only one target variable.The latter builds one multi-output model to forecast many real-value target variables, which are optimized jointly in the loss function.
Here are some examples of the former type: support vector regression (SVR) methods for traffic flow predictions [8][9][10], gradient boosting regression tree (GBRT) and multi-similarity-based inference models for bike-sharing demand forecasting [11,12], ensemble framework with time-varying Poisson models and the auto-regressive integrated moving average (ARIMA) model for taxi-passenger demand forecasting [13].For multi-output models, some examples include: the probabilistic graphical models (PGM)-based hybrid framework for citywide traffic volume estimation [14], intrinsic Gaussian Markov random field (IGMRF) model, one of the PGM models with cluster-based adjustment for cluster-level crowd flow forecast [1], vector auto-regressive moving average (VARMA) with a spatio-temporal correlations matrix for real-time traffic predictions [15], ν-SVR (the modified multi-output SVR (M-SVR) method) for traffic speed predictions in large road networks [16], deep spatio-temporal residual networks (with convolutional neural network (CNNs) as kernels) for region-level crowd flow predictions [2], and multi-graph convolutional networks for station-level bike flow predictions [17].
Theoretically, by modeling the relationships between the target variables and optimizing accordingly, multi-output models can guarantee a better representation and interpretability of the real-world problems than single-output models [7], as is shown in the works enumerated.However, many multi-output models (PGMs, M-SVR, VARMA, as mentioned above) exhibit high computational complexity and can not handle large-scale problems (hundreds of target variables) well [7].Because they model the spatio-temporal dependence of targets carefully, the number of training parameters is often k times the product of the amount of features and the amount of target variables.To reduce complexity, target variables are grouped by cluster algorithms [16] or part of training parameters are set according to rules (like in Reference [15]), which sacrifices some forecast performance.With abundant designs of structures and mature training techniques to handle large-scale problems well, deep neural networks (DNNs) are currently subject to much research (References [2,17,18], as mentioned above).However, the geo-factors and information about the city are either lost or the forecast is only applicable regionally, because regular grids are leveraged and spatio-temporal dependences are simplified to enable the application of widely used neural networks (CNN, LSTM, etc.).
Inspired by the trend of leveraging DNNs on such large-scale regression problems, we forecast station-level crowd flow with a spatio-temporal U-shape network (ST-Unet) in this paper.It is a neural network-based multi-output regression model, handling hundreds of target variables.Its structure is carefully designed to emphasize stations' local-global dependence of crowd flow to improve the forecast performance.Concretely, the contributions we make in this paper are: (1) gConv-layer (convolutional layer) is designed to handle stations' irregular distribution and learn the influence of crowd flow from neighbor stations, which is based on the idea of receptive fields and sharing weights from CNNs; (2) the hierarchical information of the stations is integrated into the networks by gUpsampling/gDownSampling-layers, which enhances the model's ability to understand the local-global information of crowd flow; (3) several modifications on the widely used Unet are made, which improve the model's convergency and can handle hundreds of target variables well.Experiments on four real-world datasets were carried out to verify the proposed method's performance.Results show that ST-Unet outperforms seven baselines on station-level crowd forecasting.

Overview
As shown in Figure 1a,b, stations are irregularly scattered throughout a city.The in-out flow at each station reflects people's mobility level of the located region at the time.Visualized on maps and sequenced through time, we can get a series of double-channel (in-out channels) heat maps (as shown in Figure 1c).Thus, we model the station-level forecast problem by generating a subsequent heat map based on the ordered series of heat maps.Ideas from CNNs and multi source data are utilized to improve the forecast performance.In this section, the formal definition of the SLCFF problem and some preliminaries used are first introduced, and the framework of our method is illustrated as follows.

Framework
Figure 2 shows the framework of our method.To guarantee the performance, multi-source data are adopted, including the in-out records data at each station, locations of stations, road network, meteorology, etc.The ST-Unet model is the forecast model.It merges three Unet branches to capture the temporal dependence of crowd flow and one branch to integrate external factors (see Section 3.1).Each Unet deals with the spatial dependence of crowd flow among the stations (see Section 3.2).Owing to the stations' irregular distribution, we redesign the receptive field of each station and bring in the hierarchical information of the stations in the Unet.

ST-Unet Model
Date

•
k-NN (nearest neighbor) Receptive Field of Each Station.The receptive field of CNNs of each entry in regular grid data is its 8 or 24 neighbor grids (when using 3 × 3 or 5 × 5 feature maps, respectively).However, because the stations are scattered irregularly, the k-NN receptive field of each station should be redefined.Inspired by graph-CNNs utilizing graph labelings to impose an order on nodes [19], we define and figure out each station's receptive field with its k ordered nearest neighbor stations reachable in the road network (See Section 3.3).

•
Hierarchical Structure of Stations.From the view of one individual station, the changing regularity of in-out flow is difficult to determine because of its fluctuation, as shown in Figure 3a,b.However, it is much more robust and regular from the view of a region with several stations, as shown in Figure 3. Thus, we employ an agglomerative clustering algorithm to construct the hierarchical structure of the stations, which is based on the stations' geo-locations and historical in-out flow data.This is used as auxiliary information to determine the 'pools' of downsampling/upsampling layers in ST-Unet, which enhance the forecasting stability (See Section 3.4).

•
Hierarchical/Time-period In-Out proportion.According to the hierarchical structure of stations and different time periods, the maximum likelihood estimation method is used to estimate each station's in-out flow proportion within its cluster.Such information is used to correct the up-sampling operation, replacing the usual adopted method-padding (See Section 3.4).
Except for the intermediate data above, external features should also be prepared, including date/time properties and weather.Date/time properties include weekday/weekend, holiday, time slot, and so on.Weather conditions affect the crowd flow in some degree, as shown in Figure 3d.Weather features include precipitation, visibility, and so on.With these intermediate data and external data, ST-Unet can be trained to predict each station's crowd flow in a given period.Its architecture is elaborated in Section 3.1.

Overview
Figure 4 presents the architecture of ST-Unet.It is composed of four branches: (a) Three Unet branches respectively capture the temporal influence, crowd flow of closeness (related to the recent time period), period (yesterday for the same time period), trend (last week during the same time period); each branch is a modified Unet capturing the spatial dependence of crowd flow, illustrated in Section 3.2.(b) One branch introduces external influence, which contains weather and date/time property features in this paper.
As illustrated in Definition 4, we use only 1-d vector to present the double-channel heat map of crowd flow in Figure 1c, i.e., the in-flow and out-flow of all stations during one time slot, which simplifies the subsequent operations.Then, we stack specified in-out flow records of different time slots to capture the variation along the time axis in the three colored branches: where l is the length of time slots chosen to stack, l ≥ 1, τ p and τ t are the length of one day's and one week's time slots, respectively.The blue branch stacks the records of recent time slots; the green branch stacks the records of the same time slots as yesterday; the red branch stacks the records of the same time slots as last week.They separately model three temporal properties: closeness, period, and trend.
The bottom branch uses one layer of the fully connected neural network to introduce the external feature vector, including weather and date/time properties.The weather features include precipitation, wind speed, temperature, visibility, etc.The date/time property features contain workday/weekend, holiday or not, kind of time period, etc.

Unet Branch of ST-Unet
The Unet branch of ST-Unet is thus named because of its U-like network shape, as shown in Figure 5.It is inspired by the widely used network Unet in the domain of medical image processing [20].It is usually applied in pixel-level image segmentation, i.e., to classify each pixel in an image.The horizontal architecture of Unet performs CNNs on different hierarchical resolutions of the image.The design emphasizes the local-global dependence of the entire image on the output.The architecture is well-suited to SLCFF, especially when the multi-output model is adopted and there are hundreds of stations.Local and global features both exist in crowd flow in urban areas.The local features are a result of the distribution of points of interest and different regions' functionality.The global features are a consequence of different time periods of the day, weather, or events.Combining local and global information related to crowd flows at stations enhances the station-level forecast.
In Figure 5, gConv, gDownSampling, and gUpSampling are designed to deal with the irregular distribution of the stations, respectively, similar to convolutional layer, downsampling layer, upsampling layer in CNNs.They are elaborated in Sections 3.3 and 3.4.The horizontal architecture uses convolutional operations, with two 'bridges' in the two shallower layers.r 0 , r 1 , r 2 are the feature channels' count of gConv.There are two downsampling layers and two upsampling layers determined by the hierarchical structure of the stations (see Section 3.4).m C1 and m C2 are the clusters count of the corresponding layer in the hierarchical structure, respectively.
Different from the usage of Unet in pixel-level image segmentation, the outputs in our model are real values, which require higher precision.For that reason, several modifications were tested before being finally adopted.First, the depth of the hierarchical structure is shallow as the 'pixels' are not as numerous as those in images (hundreds of stations compared to 572 × 572 image, as in Reference [20]).Second, the bridges use an 'add' operation as a highway instead of 'concat'.Third, the resid blocks are immediately after the bridges.These modifications make the network more trainable and achieve reliable convergency.

gConv
Generally, the site-selection of metro/bus/bike-sharing stations is well-designed and stations are scattered throughout a city.Using New York City Citi Bike as an example, most stations' 8 nearest neighbors (8-NN) are reachable within 1.0 kilometer in the road network (shown in Figure 6a).According to Tobler's first law of geography [21], everything is related to everything else and near things are more related than distant things.That means the crowd flow at a station is probably related to its neighbor stations.This is the insight of the convolutional layers in the shallower layers of the Unet branch.Furthermore, as can be seen in Figure 6b, the distance of most trips does not exceed 5 km and is mainly between 1.5 km and 4 km.That means when people leave an area, they do not generally go too far.This is the insight from the convolutional layers in the deeper layers of the Unet branch.The k-NN receptive field of each station should be redefined as they are not in regular grids.Inspired by the works of graph-CNNs [19,22], we formalize the convolutional layer gConv in this paper as follows.
As shown in Figure 7, a rectangular buffer is first used to roughly determine the neighbors of a station S i .Then, the k nearest neighbors are determined and ordered, according to the road network and using the shortest path between stations to measure the distance.These stations are the receptive field of station S i .The cases in the deeper layers of Unet branch are different, treating each cluster as one station located on its centroid and using Euclidean distance between centroids as measurement.
To simplify the operations, we assume only the prediction of out-flow for a moment.Let w = [w 0 , w 1 , ..., w k ] ∈ R 1×(k+1) is a feature map.As shown in Figure 7, the convolutional operation of each station is x • w T .For all stations, gConv can be depicted in matrix form: where f (•) is the activation function, x out τ i records each station's out-flow during time period τ i , WK is the result of filling k-NN matrix K' with feature map w.The k-NN matrix S j p th −→ S i means station S j is the p th nearest station to station S i .Thus, filling K' with w to get WK, denoted as WK=w K', embedding operations intrinsically: In this paper, the crowd flow forecast includes in-flow and out-flow.The operations above are easily extended with slight modifications.First, the feature map w should be extended as w = w 00 , w 01 w 10 , to learn both the in-out crowd flow patterns.The k-NN matrix is extended as WK is extended as and where x τ i records each station's in-out flow during the time period τ i .The form of gConv for multi-channels and in the deeper layers of Unet branch is similar.

gDownSampling and gUpSampling
As depicted in Section 3.3, neighbor stations or stations in neighbor areas have highly-related crowd flows.Furthermore, as shown in Figure 3, the periodicity and regularity of a single station's crowd flow seems chaotic; but it becomes distinguishable from the view of a region with several stations.These insights inspire us to group stations by a hierarchical structure.Such hierarchical structures can be leveraged to determine the 'pools' of downsampling/upsampling layers in the Unet for enhancing the forecasting stability.As shown in Figure 8a, the bottom layer is each station itself.The middle layer 2 and the top layer 1 are extracted using an agglomerative clustering algorithm, which is based on the stations' locations and historical data of stations' in-out flow data.The historical flow data are used to estimate the transition probability of each station's in-out flow from/to other stations.Then, the similarity of crowd flow patterns between stations is measured and used as the weighted coefficient.Definition 5.Each station's feature vector of in-out flow transition probability.Let l S i ,hd =[l in S 0 , ..., l in S m−1 , l out S 0 , ..., l out S m−1 ] hd denotes the transition probability of station S i 's in-out flow from/to other stations.∑ l in S i = 1 and ∑ l out S i = 1.
l S i ,hd is estimated using the maximum likelihood estimation method according to each station's historical in-out flow records.Subscripts hd are used to distinguish different kind of time periods (see Section 2.2).Definition 6. Similarity of crowd flow between two stations.The cosine of two vectors cos(l S i ,hd , l S j ,hd ) is used to measure the similarity between station S i and S j .θ(S i , S j ) = cos(l S i ,hd , l S j ,hd ) = Figure 8b shows the core idea of agglomerative clustering to determine the hierarchical structure of stations.Agglomerative clustering is a 'bottom-up' type hierarchical clustering: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy [23].Two goals are achieved by using the stations' proximity matrix Z. First, stations in one cluster should be close to each other in the road network.Second, stations in the same group have similar in-out crowd flow patterns.
As mentioned in Section 3.2, the depth of the hierarchical structure in each Unet branch is shallow.In this paper, the output layers cut from the clustering tree are set to three as shown in Figure 8.Thus, two constraints are used to restrict each cluster's size on a respective layer: the distance between any two stations in each cluster does not exceed d C1 (for the top layer, constraint C 1 ) and d C2 (for the middle layer, constraint C 2 ), respectively (d C1 > d C2 ).
With the hierarchical structure of all stations, the 'pool' of the gDownSampling/gUpSampling layer in the Unet branch can now be determined.Similarly, to simplify the operations, we assume only the out-flow prediction first.
As shown in Figure 9a,b, where D C2 and U C2 are determined from the hierarchical structure of all stations, let h τ i be the result of gConv in the middle layer 2 .Each row in D C2 and each column in U C2 correspond to a certain station.Each column in D C2 and each row in U C2 correspond to a certain cluster in the middle layer 2 .Concretely, where S i C 2,j means station S i belongs to cluster C 2,j in the layer 2 , ∑ m−1 i=0 u i j = 1, u i j is the station S i 's out-flow proportion in cluster C 2,j , estimated by the maximum likelihood estimation method according to different time periods.The non-zero entries in each column in D C2 and in each row of U C2 denote the 'pools' of gDownSampling and gUpSampling, respectively.(a) In this paper, station-level crowd flow forecasts includes in and out flow.The operations above are easily extended with slight modifications.First, the 'pools' D C2 and U C2 should be extended as

( , )
and, gDownSampling( where h τ i is the result of gConv in the middle layer at τ i .The form of gDownSampling and gUpSampling for the top layer 1 are similar.

Experiments
Experiments to verify ST-Unet's effectiveness are presented in this section.Three bike-sharing trip datasets and one taxi-trip record dataset were used.All experiments were conducted on a virtual machine with 32 GB RAM and Python 2.7 with tensorflow-1.7.

Datasets
The three sharing-bike trip datasets are from New York Citi Bike in New York City (http://www.citibikenyc.com/system-data),Capital-Bikeshare in Washington DC (http://www.capitalbikeshare.com/system-data), and DivvyBikes in Chicago (http://www.divvybikes.com/data).The taxi-trip record dataset is the yellow taxi-trip records from NYC Taxi and Limousine Commission (TLC) (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).They are named CITI, DC, DIVVY, TAXI in the following.The details are presented in Table 1.The meteorology dataset used is from site (https://mesowest.utah.edu/)and the selected stations are New York City Central Park (ID: KNYC), Washington (ID: WASD2), Chicago Midway Airport (ID: KMDW).The missing records in the meteorology dataset were filled according to the records from the previous hours.The weather features include: relative-humidity, wind-speed, visibility, sea-level-pressure, precipitation-accumulation (1 h), precipitation-accumulation (3 h), temperature.
Data from 1 April to 19 September were used as training data; data from 19 September to 9 October were used as validating data; and data from 10 October to 30 October were used as testing data.In the testing data, 10 October 2016 was Columbus Day, which was a public holiday.The rainy dates and foggy dates in New York City (NYC), Chicago, and Washington DC are shown in Table 2.

Hyperparameters Selection of ST-Unet
The selection of the hyperparameters significantly affects the performance of most deep learning-based models.However, since the training of ST-Unet requires a great deal of time, we present here the best hyperparameters for ST-Unet as well as other settings we tested.

Baselines & Metric
In order to confirm the effectiveness of ST-Unet, we conducted experiments to compare ST-Unet with seven baselines: • XGB: XGB, short for eXtreme Gradient Boosting, is an implementation of GBRT (gradient boosted decision trees) [25].All input features are the same as ST-Unet.XGB and Ensemble are single-output models, i.e., all stations' forecast models of in or out crowd flow were trained, respectively.VARIMA, FC, MG-CNN, and ST-Unet are multi-output models.Owing to the heavy computational costs of VARIMA, the second layer 2 of stations was used and VARIMA were trained for each cluster, respectively.Unet and ST-net are the simplified versions used to verify the design of ST-Unet.
that the influence that this made on crowd flow was not great, for this reason ST-Unet only performed slightly better than the other methods.
As shown in Table 3, ST-Unet with other deep learning-based methods (FC and MG-CNN) shows better performance than the other baselines on the whole.However, FC shows an unstable forecast performance among different datasets, owing to its fully-connected layers containing too many parameters for training.ST-Unet performs slightly better than MG-CNN.With long short term memory (LSTM) cells, MG-CNN uses only the history data from the past six time slots to forecast the flow in the subsequent time slot [17].Details can be found in Figure 10, where we present some forecast examples of different time periods (holiday: 10/10/2016; weekends: 15-16/10/2016 and 14-15/10/2017; rainy day: 29/10/2017; the others are workdays).It shows that ST-Unet performs better than FC and MG-CNN at the peaks of crowd flow.To verify the design of ST-Unet, we present the forecast results of Unet and ST-net (see Table 3).The former has only a closeness Unet branch and the latter has no gDownSampling and gUpSampling.The forecast performance was improved by 13.1% on average by ST-Unet, showing the necessity of introducing the period/trend branch and the hierarchical structure of stations.It is worth noting that ST-net does not perform better than Unet.It is probable that the highways in ST-Unet are quite important for the networks' training and 'add' is better than 'concat' (see Section 3.2), as determined after attempts on multiple network structures.

Conclusions and Discussion
In this paper, we propose a deep learning-based model, named ST-Unet, to make station-level crowd flow forecasts.We present our methods as regards combining the geographic information with the design of the neural network.Three Unet branches to capture the temporal influence and one branch to introduce external influence were integrated into the forecast model.To deal with the irregular grid format of the data in the Unet branch, we propose gConv and gDownSampling/gUpSampling to replace the corresponding widely used convolutional layer and downsampling/upsampling layers in CNNs.Specifically, to make gConv effective, the receptive field is determined by each station's k-NN as regards local stations reachable on the road network.To make gDownSampling/gUpSampling effective, we regularize the 'pools' according to the hierarchical structure of stations, which is extracted using an agglomerative clustering algorithm based on the stations' locations and the historical flow data.Compared with several baselines, ST-Unet generally performed well in the experiments.ST-Unet accurately predicted each station's in-out flow in a future period.
It is notable that the proposed method does not show much superiority with regards to predictions involving rainy/foggy days and holidays.The reason may be the insufficiency of training instances on special days; however, this requires further study.In addition, the multi-time steps forecast performance of ST-Unet was not explored; how ST-Unet can be modified to do this is a focus for future research.

Figure 1 .
Figure 1.(a,b) Metro stations and bike-sharing stations in New York City; each station serves a region of the city, which could be roughly calculated by Voronoi-based segmentation; (c) the in-out crowd flow at each station aggregated along time axis can be viewed as a series of double-channel heat maps.

2. 1 .
Preliminaries & Problem Definition Definition 1. Stations.There are m stations in the station-level crowd flow forecast problem, S i (i∈[0, m-1]) is the ith station.

Figure 2 .
Figure 2. The framework of the forecast model.

Figure 3 .
Figure 3. (a,b) The variation in out-flow count of two stations in one day.(c) The variation in out-flow count of a region with seven stations in one day.(d) The effect of weather on the fluctuation of crowd flow (27 October, 2016 in New York City was a rainy day).

Figure 5 .
Figure 5. On the left is the architecture of a single Unet branch; on the right is the unfolded format of the resid block with gConv as kernels.

Figure 6 .
Figure 6.(a) The distance distribution of 8 nearest neighbors (8-NN) stations reachable on the road network of each station.(b) Trip distance distribution in New York Citi Bike-sharing system.
flow records of m stations out-flow count of station i entries forms the 'pool'

Definition 2 .
Trip.A trip Tr=(S o , S d , t o , t d ) is a record, where S o , S d denote the origin and destination station, respectively, t o and t d are the timestamps when people depart from S o and arrive at S d , respectively.Observing time unit.τ is the observing time unit for aggregating the in-out flow count, e.g., 30 min or 1 h.Let T=[τ 0 , τ 1 , ..., τ i , ..., τ n−1 ] is the whole observing time period.The station-level crowd flow forecast problem.Given the historical observations {x τ i |i ∈ [0, 1, ..., n − 1]}, forecast xτ n , aiming to minimize |x τ n − x τ n |, where x τ n is the ground truth at τ n . Problem:

Table 1 .
Details of the datasets.July 2016, only the origin/destination of each taxi trip was recorded; with the taxi zone ID according to the taxi zone map (https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip).In the experiments, the centroid of each taxi zone was treated as a location where a station was located.

Table 2 .
The rainy dates and foggy dates in New York City, Chicago, and Washington DC.
• k nearest neighbor stations: 4.•The length l of time slots chosen to stack in the input of each Unet branch: 4.•Constraints d C1 and d C2 to limit the size of each cluster of layer 1 , 2 : 2.5 km and 1.5 km (10.0 km and 5.0 km for Taxi dataset).• r 0 , r 1 , r 2 in each Unet: 16, 16, 16.
• Activation function of gConv, gDownSampling, gUpSampling: relu.• Loss function: as the metric MAE depicted in Section 4.3.• Optimizer: Adam-optimizer [24].• Terminated condition: The training reaches 400 iterations, or when the model does not achieve further improvement for 25 consecutive iterations on the validating data.
Vector-ARIMA extends ARIMA to the multivariate case, which can capture the pairwise relations among the multi-time series.• FC: A three-layers of Full-Connected neural networks is built.Its output is the forecast of all stations' in-out crowd flow.All input features are the same as ST-Unet.• MG-CNN: Multi-graph convolutional networks, a deep neural network model with multiple graphs fusing CNNs for station-level future bike flow forecast [17].The past six time slots history data are used to forecast the flow in the next time slot.Neither gDownSampling or gUpSampling are in the Unet branches, being replaced by gConv.