# Listening to the City, Attentively: A Spatio-Temporal Attention-Boosted Autoencoder for the Short-Term Flow Prediction Problem

^{*}

## Abstract

**:**

## 1. Introduction

- 1.
- We propose a novel autoencoder-based architecture that outperforms the state-of-the-art for the flow prediction problem. To the best of our knowledge, STREED-Net is the first autoencoder architecture that combines the use of time-distributed convolutional blocks with residual connections, a CMUs and two different attention mechanisms. Moreover, unlike other state-of-the-art models [4,6,7,8], focuses only on recent time dependencies (closeness), that is it only relies on a small number of time periods preceding the one to be predicted. This results in fewer hyperparameters to tuned.
- 2.
- The impact of the most important components of the architecture on prediction is assessed through an ablation study and discussed. In particular, the study demonstrates the ability of the two different attention blocks to capture and harness important underlying temporal and spatial information.
- 3.
- Finally, this work presents a methodologically sound comparative assessment against the best models from the literature on real-life case studies. The analyses consider different error measures, the number of trainable parameters, and a complexity indicator (number of FLOPs). Results indicate that outperforms the considered state-of-the-art approaches using a relatively reduced number of parameters and FLOPs.

## 2. Related Work

## 3. Problem Statement

## 4. STREED-Net

**Autoencoder architecture.**Given a set of unlabeled training examples $\{{x}^{1},{x}^{2},{x}^{3},...\}$, where ${x}^{i}\in {\mathbb{R}}^{n}$, an autoencoder neural network is an unsupervised learning algorithm that applies backpropagation setting the target values to be equal to the inputs $y\left(i\right)=x\left(i\right)$. It is a neural network that is trained to learn a function ${h}_{W,b}\left(x\right)=\widehat{x}\approx x$, where W and b are weights and biases of the ANN, respectively. In other words, an autoencoder is a learned approximation of the identity function, so as to output $\widehat{x}$ that is as much as possible similar to x. The overall network can be decomposed into two parts: an encoder function $h=f\left(x\right)$, which maps the input vector space onto an internal representation, and a decoder that transforms it back, that is $\widehat{x}=g\left(h\right)$. This type of architecture has been applied successfully to different difficult tasks, including traffic prediction [11].

**Attention mechanism.**In DNN Attention Mechanism helps focus on important features of the input, shadowing the others. This paradigm is inspired by the human neurovisual system, which quickly scans images and identifies sub-areas of interest, optimizing the usage of the limited attention resources [36]. Similarly, the attention mechanism in DNN determines and stresses on the most informative features in the input data that are likely to be most valuable to the current activity.

#### 4.1. Encoder

#### 4.2. Cascading Hierarchical Block

#### 4.3. External Factors

#### 4.4. Decoder

#### Channel Attention

#### Spatial Attention

## 5. Experimental Analysis

#### 5.1. Reference Methods

**ST-ResNet [4]**: it is one of the first deep learning approaches to traffic prediction. It predicts the flow of crowds in and out each individual region of activity. ST-ResNet uses three residual networks that model the temporal aspects of proximity, period, and trend separately.

**MST3D [7]**: this model is architecturally similar to ST-ResNet. The three time dependencies and the external factors are independently modeled and dynamically merged by assigning different weights to different branches to obtain the new forecast. Differently from ST-ResNet, MST3D learns to identify space-time correlations using 3D convolutions.

**ST-3DNet [24]**: the network uses two distinct branches to model the temporal components of closeness and trend, while the daily period is left out. Both branches start with a series of 3D convolutional layers used to capture the spatio-temporal dependencies among the input frames. In the closeness branch, the output of the last convolutional layer is linked to a sequence of residual units to further investigate the spatial dependencies between the frames of the closeness period. The most innovative architectural element is the Recalibration Block. It is a block inserted at the end of each of the two main branches to explicitly model the contribution that each region makes to the prediction.

**3D-CLoST [8]**: the model uses sequential 3D convolutions to capture spatio-temporal dependencies. Afterwards, a fully-connected layer encloses the information learned in a one-dimensional vector that is finally passed to an LSTM block. LSTM layers in sequence allow the model to dwell on the temporal dependencies of the input. The output of the LSTM section is added to the output produced by the network for external features. The output is multiplied by a mask, which allows the user to introduce domain knowledge: the mask is a matrix with null values in correspondence with the regions of the city that never have Inflow or Outflow values greater than zero (such areas can exist or not depending on the conformation of the city) while it contains 1 in all other locations.

**STAR [6]**: this approach aims to model temporal dependencies by extracting representative frames of proximity, period and trend. However, unlike other solutions, the structure of the model consists of a single branch: the frames selected for the prediction are concatenated along the axis of the channels to form the main input to the network. In STAR as well, there is a sub-network dedicated to external factors and the output it generates is immediately added to the main network input. Residual learning is used to train the deep network to derive the detailed outcome for the expected scenarios throughout the city.

**PredCNN [11]**: this network builds on the core idea of recurring models, where previous states in the network have more transition operations than future states. PredCNN employs an autoencoder with CMU, which proved to be a valid alternative to RNN. Unlike the models discussed above, this approach considers only the temporal component of closeness but has a relatively complex architecture. The key idea of PredCNN is to sequentially capture spatial and temporal dependencies using CMU blocks.

**ACFM [25]**: this module is composed of two progressive Convolutional Long Short-Term Memory (ConvLSTM [52]) units connected via a convolutional layer. Specifically, the first ConvLSTM unit takes the sequential flow features as input and generates a hidden state at each time-step, which is further fed into the connected convolutional layer for spatial attention map inference. The second ConvLSTM unit aims at learning the dynamic spatial-temporal representations from the attentionally weighted traffic flow features.

**HA**: the algorithm generates Inflow and Outflow forecasts by performing the arithmetic average of the corresponding values of the same day of the week at the same time as the instant in time to be predicted. This classical method represents a baseline in our comparative analysis, as it has not been developed specifically for the flow prediction problem.

#### 5.2. Case Studies

**BikeNYC.**In this first case study the behavior of bicycles in New York city is analyzed. The data has been collected by the NYC Bike system in 2014, from 1 April to 30 September. Records from the last 10 days form the testing data set, while the rest is used for training. The length of each time period is of 1 h.

**TaxiBJ.**In the second case study, a fleet of cabs and the city of Beijing are considered. Data have been collected in 4 different time periods: 1 July 2013–30 October 2013, 1 March 2014–30 June 2014, 1 March 2015–30 June 2015, 1 November 2015–15 April 2016. The last four weeks are test data and the others are used for training purposes. The length of each time period is set to 30 min.

**TaxiNYC.**Finally, a data set containing data from a fleet of taxicabs in New York is considered. Data have been collected from 1 January 2009 to 31 December 2014. The last four weeks are test data and the others are used for training purposes. The length of each time period is set to one hour. This case study has been specifically created to perform a more thorough and sound experimental assessments than those presented in the literature.

#### 5.3. Analysis of Results

#### 5.3.1. BikeNYC

#### 5.3.2. TaxiBJ

#### 5.3.3. TaxiNYC

**ST-ResNet*.**Optimized parameters: number of residual units, batch size and learning rate. Optimal values found: 2, 16 and 0.0001.**MST3D.**Optimized parameters: batch size and learning rate. Optimal values found: 16 and 0.00034.**PredCNN.**Optimized parameters: encoder length, decoder length, number of hidden units, batch size and learning rate. Optimal values found: 2, 3, 64, 16 and 0.0001.**ST-3DNet.**Optimized parameters: number of residual units, batch size and learning rate. Best values found: 5, 16 and 0.00095.**STAR*.**Optimized parameters: number of remaining units, batch size and learning rate. Optimal values found: 2, 16 and 0.0001.**3D-CLoST.**Optimized parameters: number of LSTM layers, number of hidden units in each LSTM layer, batch size, and learning rate. Optimal values found: 2, 500, 16, and 0.00076.**ACFM.**Optimized parameter: learning rate. Optimal value found: 0.0003.**STREED-Net.**Optimized parameters: kernel size, batch size, and learning rate. Optimal values found: 3, 64 and 0.00086.

#### 5.4. Ablation Study

**STREED-Net_N3**. Same architecture as STREED-Net, but input volumes with 3 frames ($[{X}_{t-3},{X}_{t-2},{X}_{t-1}]$).**STREED-Net_N5**. Same architecture as STREED-Net, but input volumes with 5 frames ($[{X}_{t-5},{X}_{t-4},{X}_{t-3},{X}_{t-2},{X}_{t-1}]$).**STREED-Net_NoLSC**. STREED-Net by removing the long skip connection between encoder and decoder.**STREED-Net_NoAtt**. STREED-Net without the attention blocks.**STREED-Net_NoExt**. STREED-Net without the external factors.

#### 5.5. Number of Trainable Parameters and FLOPs

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

ANN | Artificial Neural Network |

CMU | Cascade Multiplicative Unit |

CNN | Convolutional Neural Network |

GRU | Gated Recurrent Unit |

HA | Historical Average |

LSTM | Long Short-Term Memory |

MU | Multiplicative Unit |

ReLU | Rectified Linear Unit |

RNN | Recurrent Neural Network |

SVR | Support Vector Regression |

BN | Batch Normalization |

DNN | Deep Neural Network |

## References

- Zheng, Y.; Capra, L.; Wolfson, O.; Yang, H. Urban computing: Concepts, methodologies, and applications. ACM Trans. Intell. Syst. Technol. (TIST)
**2014**, 5, 1–55. [Google Scholar] [CrossRef] - Tolomei, L.; Fiorini, S.; Ciociola, A.; Vassio, L.; Giordano, D.; Mellia, M. Benefits of Relocation on E-scooter Sharing—A Data-Informed Approach. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 3170–3175. [Google Scholar] [CrossRef]
- Yuan, C.; Li, Y.; Huang, H.; Wang, S.; Sun, Z.; Li, Y. Using traffic flow characteristics to predict real-time conflict risk: A novel method for trajectory data analysis. Anal. Methods Accid. Res.
**2022**, 35, 100217. [Google Scholar] [CrossRef] - Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; Yi, X.; Li, T. Predicting citywide crowd flows using deep spatio-temporal residual networks. Artif. Intell.
**2018**, 259, 147–166. [Google Scholar] [CrossRef] [Green Version] - Liu, Y.; Lyu, C.; Khadka, A.; Zhang, W.; Liu, Z. Spatio-Temporal Ensemble Method for Car-Hailing Demand Prediction. IEEE Trans. Intell. Transp. Syst.
**2019**, 21, 1–6. [Google Scholar] [CrossRef] - Wang, H.; Su, H. STAR: A Concise Deep Learning Framework for Citywide Human Mobility Prediction. In Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, China, 10–13 June 2019; pp. 304–309. [Google Scholar]
- Chen, C.; Li, K.; Teo, S.G.; Chen, G.; Zou, X.; Yang, X.; Vijay, R.C.; Feng, J.; Zeng, Z. Exploiting spatio-temporal correlations with multiple 3d convolutional neural networks for citywide vehicle flow prediction. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 893–898. [Google Scholar]
- Fiorini, S.; Pilotti, G.; Ciavotta, M.; Maurino, A. 3D-CLoST: A CNN-LSTM Approach for Mobility Dynamics Prediction in Smart Cities. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 3180–3189. [Google Scholar]
- Yao, H.; Tang, X.; Wei, H.; Zheng, G.; Li, Z. Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5668–5675. [Google Scholar]
- Chowanda, A. Spatiotemporal Features Learning from Song for Emotions Recognition with Time Distributed CNN. In Proceedings of the 2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI), Jakarta, Indonesia, 28 October 2021; Volume 1, pp. 407–412. [Google Scholar]
- Xu, Z.; Wang, Y.; Long, M.; Wang, J.; Kliss, M. PredCNN: Predictive Learning with Cascade Convolutions. In Proceedings of the IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 2940–2947. [Google Scholar]
- Yu, R.; Li, Y.; Shahabi, C.; Demiryurek, U.; Liu, Y. Deep learning: A generic approach for extreme condition traffic forecasting. In Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, TX, USA, 27–29 April 2017; pp. 777–785. [Google Scholar]
- Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1724–1734. [Google Scholar]
- Moayedi, H.Z.; Masnadi-Shirazi, M. Arima model for network traffic prediction and anomaly detection. In Proceedings of the 2008 International Symposium on Information Technology, Kuala Lumpur, Malaysia, 26–28 August 2008; Volume 4, pp. 1–6. [Google Scholar]
- Guo, J.; Huang, W.; Williams, B.M. Adaptive Kalman filter approach for stochastic short-term traffic flow rate prediction and uncertainty quantification. Transp. Res. Part C Emerg. Technol.
**2014**, 43, 50–64. [Google Scholar] [CrossRef] - Sun, S.; Zhang, C.; Yu, G. A Bayesian network approach to traffic flow forecasting. IEEE Trans. Intell. Transp. Syst.
**2006**, 7, 124–132. [Google Scholar] [CrossRef] - Qi, Y.; Ishak, S. A Hidden Markov Model for short term prediction of traffic conditions on freeways. Transp. Res. Part C Emerg. Technol.
**2014**, 43, 95–111. [Google Scholar] [CrossRef] - Wu, C.H.; Ho, J.M.; Lee, D.T. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst.
**2004**, 5, 276–281. [Google Scholar] [CrossRef] [Green Version] - Asif, M.T.; Dauwels, J.; Goh, C.Y.; Oran, A.; Fathi, E.; Xu, M.; Dhanya, M.M.; Mitrovic, N.; Jaillet, P. Spatiotemporal patterns in large-scale traffic speed prediction. IEEE Trans. Intell. Transp. Syst.
**2013**, 15, 794–804. [Google Scholar] [CrossRef] - Tong, Y.; Chen, Y.; Zhou, Z.; Chen, L.; Wang, J.; Yang, Q.; Ye, J.; Lv, W. The Simpler the Better: A Unified Approach to Predicting Original Taxi Demands Based on Large-Scale Online Platforms. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17), Halifax, NS, Canada, 13–17 August 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1653–1662. [Google Scholar]
- Qian, X.; Ukkusuri, S.V. Spatial variation of the urban taxi ridership using GPS data. Appl. Geogr.
**2015**, 59, 31–42. [Google Scholar] [CrossRef] - Azzouni, A.; Pujolle, G. A long short-term memory recurrent neural network framework for network traffic matrix prediction. arXiv
**2017**, arXiv:1705.05690. [Google Scholar] - Ma, X.; Dai, Z.; He, Z.; Ma, J.; Wang, Y.; Wang, Y. Learning traffic as images: A deep convolutional neural network for large-scale transportation network speed prediction. Sensors
**2017**, 17, 818. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Guo, S.; Lin, Y.; Li, S.; Chen, Z.; Wan, H. Deep spatial–temporal 3D convolutional neural networks for traffic data forecasting. IEEE Trans. Intell. Transp. Syst.
**2019**, 20, 3913–3926. [Google Scholar] [CrossRef] - Liu, L.; Zhang, R.; Peng, J.; Li, G.; Du, B.; Lin, L. Attentive crowd flow machines. In Proceedings of the 26th ACM International Conference on Multimedia, Lisbon, Portugal, 22–26 October 2018; pp. 1553–1561. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv
**2016**, arXiv:1609.02907. [Google Scholar] - Jiang, W.; Luo, J. Graph Neural Network for Traffic Forecasting: A Survey. arXiv
**2021**, arXiv:2101.11174. [Google Scholar] - Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv
**2017**, arXiv:1707.01926. [Google Scholar] - Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst.
**2019**, 21, 3848–3858. [Google Scholar] [CrossRef] [Green Version] - Peng, H.; Du, B.; Liu, M.; Liu, M.; Ji, S.; Wang, S.; Zhang, X.; He, L. Dynamic graph convolutional network for long-term traffic flow prediction with reinforcement learning. Inf. Sci.
**2021**, 578, 401–416. [Google Scholar] [CrossRef] - Li, Y.; Zhao, W.; Fan, H. A Spatio-Temporal Graph Neural Network Approach for Traffic Flow Prediction. Mathematics
**2022**, 10, 1754. [Google Scholar] [CrossRef] - Lee, K.; Eo, M.; Jung, E.; Yoon, Y.; Rhee, W. Short-Term Traffic Prediction With Deep Neural Networks: A Survey. IEEE Access
**2021**, 9, 54739–54756. [Google Scholar] [CrossRef] - Yin, X.; Wu, G.; Wei, J.; Shen, Y.; Qi, H.; Yin, B. Deep Learning on Traffic Prediction: Methods, Analysis, and Future Directions. IEEE Trans. Intell. Transp. Syst.
**2022**, 23, 4927–4943. [Google Scholar] [CrossRef] - Wang, J.; Jiang, J.; Jiang, W.; Li, C.; Zhao, W.X. LibCity: An Open Library for Traffic Prediction. In Proceedings of the 29th International Conference on Advances in Geographic Information Systems, Beijing, China, 2–5 November 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 145–148. [Google Scholar] [CrossRef]
- Kemp, K.; Sean, C.A.; Ola, A.; Jochen, A.; Carl, A.; Brandon, B.; David, A.B.; Barry, B.; Scott, B.; Daniel, G.B.; et al. Encyclopedia of Geographic Information Science; Sage: Thousand Oaks, CA, USA, 2008. [Google Scholar]
- Ungerleider, S.K.; Ungerleider, L.G. Mechanisms of visual attention in the human cortex. Annu. Rev. Neurosci.
**2000**, 23, 315–341. [Google Scholar] [CrossRef] [PubMed] - Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv
**2014**, arXiv:1409.0473. [Google Scholar] - Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Waikoloa, HI, USA, 9–13 December 2017; pp. 5209–5217. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; Wierstra, D. Draw: A recurrent neural network for image generation. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1462–1471. [Google Scholar]
- Liu, Y.; Liu, Z.; Lyu, C.; Ye, J. Attention-Based Deep Ensemble Net for Large-Scale Online Taxi-Hailing Demand Prediction. IEEE Trans. Intell. Transp. Syst.
**2020**, 21, 4798–4807. [Google Scholar] [CrossRef] - Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Santurkar, S.; Tsipras, D.; Ilyas, A.; Madry, A. How Does Batch Normalization Help Optimization? In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2018; Volume 31, pp. 2483–2493. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press Cambridge: Cambrisge, MA, USA, 2016; Volume 1. [Google Scholar]
- Ranjan, N.; Bhandari, S.; Zhao, H.; Kim, H.; Khan, P. City-Wide Traffic Congestion Prediction based on CNN, LSTM and Transpose CNN. IEEE Access
**2020**, 8, 81606–81620. [Google Scholar] [CrossRef] - Kalchbrenner, N.; Oord, A.; Simonyan, K.; Danihelka, I.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Video pixel networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1771–1779. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17), San Francisco, CA, USA, 4–9 February 2017; AAAI Press: Palo Alto, CA, USA, 2017; pp. 4278–4284. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; So Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Komodakis, N.; Zagoruyko, S. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Bianco, S.; Cadene, R.; Celona, L.; Napoletano, P. Benchmark Analysis of Representative Deep Neural Network Architectures. IEEE Access
**2018**, 6, 64270–64277. [Google Scholar] [CrossRef]

Model | RMSE | MAPE | APE |
---|---|---|---|

HA | 6.56 | 26.46 | 4.09 $\xb7{10}^{5}$ |

ST-ResNet | 5.01 ± 0.07 | 21.97 ± 0.26 | 3.40 $\xb7{10}^{5}$ ± 4.06 $\xb7{10}^{5}$ |

MST3D | 4.98 ± 0.05 | 22.03 ± 0.47 | 3.41 $\xb7{10}^{5}$ ± 7.26 $\xb7{10}^{5}$ |

3D-CLoST | 4.90 ± 0.04 | 21.38 ± 0.20 | 3.31 $\xb7{10}^{5}$ ± 3.12 $\xb7{10}^{5}$ |

PredCNN | 4.81 ± 0.04 | 21.38 ± 0.24 | 3.31 $\xb7{10}^{5}$ ± 3.76 $\xb7{10}^{5}$ |

ST-3DNet | 4.75 ± 0.06 | 21.42 ± 0.28 | 3.31 $\xb7{10}^{5}$ ± 4.36 $\xb7{10}^{5}$ |

STAR | 4.73 ± 0.05 | 20.97 ± 0.13 | 3.24 $\xb7{10}^{5}$ ± 2.02 $\xb7{10}^{5}$ |

ACFM | 4.68 ± 0.13 | 20.98 ± 0.68 | 3.25 $\xb7{10}^{5}$ ± 1.05 $\xb7{10}^{5}$ |

STREED-Net | 4.67 ± 0.03 | 20.85 ± 0.15 | 3.23 $\xb7{\mathbf{10}}^{\mathbf{5}}$ ± 2.31 $\xb7{\mathbf{10}}^{\mathbf{5}}$ |

Model | RMSE | MAPE | APE |
---|---|---|---|

HA | 40.93 | 30.96 | 6.77 $\xb7{10}^{7}$ |

ST-ResNet | 17.56 ± 0.91 | 15.74 ± 0.94 | 3.45 $\xb7{10}^{7}$ ± 2.05 $\xb7{10}^{6}$ |

MST3D | 21.34 ± 0.55 | 22.02 ± 1.40 | 4.81 $\xb7{10}^{7}$ ± 3.03 $\xb7{10}^{5}$ |

3D-CLoST | 17.10 ± 0.23 | 16.22 ± 0.20 | 3.55 $\xb7{10}^{7}$ ± 4.39 $\xb7{10}^{5}$ |

PredCNN | 17.42 ± 0.12 | 15.69 ± 0.17 | 3.43 $\xb7{10}^{7}$ ± 3.76 $\xb7{10}^{5}$ |

ST-3DNet | 17.29 ± 0.42 | 15.64 ± 0.52 | 3.43 $\xb7{10}^{7}$ ± 1.13 $\xb7{10}^{6}$ |

STAR | 16.25 ± 0.40 | 15.40 ± 0.62 | 3.38 $\xb7{10}^{7}$ ± 1.36 $\xb7{10}^{6}$ |

ACFM | 15.67 ± 0.23 | 15.16 ± 0.33 | 3.32 $\xb7{10}^{7}$ ± 7.25 $\xb7{10}^{5}$ |

STREED-Net | 15.61 ± 0.11 | 14.73 ± 0.21 | 3.22 $\xb7{\mathbf{10}}^{\mathbf{7}}$ ± 4.51 $\xb7{\mathbf{10}}^{\mathbf{5}}$ |

Model | RMSE | MAPE | APE |
---|---|---|---|

HA | 164.31 | 27.19 | 7.94 $\xb7{10}^{5}$ |

ST-ResNet* | 35.87 ± 0.60 | 22.52 ± 3.43 | 6.57 $\xb7{10}^{5}$ ± 1.00 $\xb7{10}^{5}$ |

MST3D | 48.91 ± 1.98 | 23.98 ± 1.30 | 6,98 $\xb7{10}^{5}$ ± 1.34 $\xb7{10}^{4}$ |

3D-CLoST | 48.17 ± 3.16 | 22.18 ± 1.05 | 6.48 $\xb7{10}^{5}$ ± 3.08 $\xb7{10}^{4}$ |

PredCNN | 40.91 ± 0.51 | 25.65 ± 2.16 | 7.49 $\xb7{10}^{5}$ ± 6.32 $\xb7{10}^{4}$ |

ST-3DNet | 41.62 ± 3.44 | 25.75 ± 6.11 | 7.52 $\xb7{10}^{5}$ ± 1.78 $\xb7{10}^{5}$ |

STAR* | 36.44 ± 0.88 | 25.36 ± 5.24 | 7.41 $\xb7{10}^{5}$ ± 1.53 $\xb7{10}^{5}$ |

ACFM | 36.75 ± 0.94 | 19.10 ± 1.08 | 5.58 $\xb7{\mathbf{10}}^{\mathbf{5}}$ ± 2.21 $\xb7{\mathbf{10}}^{\mathbf{4}}$ |

STREED-Net | 36.22 ± 0.72 | 20.29 ± 1.48 | 5.93 $\xb7{10}^{5}$ ± 4.31 $\xb7{10}^{4}$ |

Model | RMSE | MAPE | APE |
---|---|---|---|

STREED-Net_N3 | 4.75 ± 0.04 | 21.18 ± 0.18 | 3.28 $\xb7{10}^{5}$ ± 2.73 $\xb7{10}^{3}$ |

STREED-Net_N5 | 4.74 ± 0.03 | 21.03 ± 0.22 | 3.26 $\xb7{10}^{5}$ ± 3.43 $\xb7{10}^{3}$ |

STREED-Net_NoLSC | 4.84 ± 0.04 | 21.53 ± 0.24 | 3.33 $\xb7{10}^{5}$ ± 3.71 $\xb7{10}^{3}$ |

STREED-Net_NoAtt | 4.78 ± 0.04 | 20.95 ± 0.27 | 3.25 $\xb7{10}^{5}$ ± 4.20 $\xb7{10}^{3}$ |

STREED-Net_NoExt | 4.76 ± 0.04 | 20.99 ± 0.29 | 3.26 $\xb7{10}^{5}$ ± 4.55 $\xb7{10}^{3}$ |

STREED-Net | 4.67 ± 0.03 | 20.85 ± 0.15 | 3.23 $\xb7{\mathbf{10}}^{\mathbf{5}}$ ± 2.31 $\xb7{\mathbf{10}}^{\mathbf{3}}$ |

Model | BikeNYC | TaxiNYC | TaxiBJ |
---|---|---|---|

ST-ResNet | 906,272 | 458,304 | 2,696,992 |

MST3D | 668,218 | 668,378 | 8,674,370 |

3D-CLoST | 13,099,090 | 19,477,648 | 72,046,714 |

PredCNN | 3,967,906 | 3,967,906 | 4,827,842 |

ST-3DNET | 540,696 | 617,586 | 903,242 |

STAR | 161,052 | 310,076 | 476,388 |

ACFM | 182,065 | 270,581 | 969,893 |

STREED-Net | 582,673 | 582,673 | 765,497 |

Model | BikeNYC | TaxiNYC | TaxiBJ |
---|---|---|---|

ST-ResNet | 230,849,450 | 115,735,786 | 5,459,663,018 |

MST3D | 33,042,250 | 33,042,570 | 272,483,226 |

3D-CLoST | 29,613,094 | 9,601,920 | 338,148,804 |

PredCNN | 1,015,468,288 | 1,015,468,288 | 9,883,813,888 |

ST-3DNET | 171,242,496 | 190,130,922 | 1,823,295,898 |

STAR | 40,449,706 | 78,231,530 | 928,100,922 |

ACFM | 41,687,924 | 93,643,568 | 621,498,864 |

STREED-Net | 130,047,738 | 130,047,738 | 1,067,063,882 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Fiorini, S.; Ciavotta, M.; Maurino, A.
Listening to the City, Attentively: A Spatio-Temporal Attention-Boosted Autoencoder for the Short-Term Flow Prediction Problem. *Algorithms* **2022**, *15*, 376.
https://doi.org/10.3390/a15100376

**AMA Style**

Fiorini S, Ciavotta M, Maurino A.
Listening to the City, Attentively: A Spatio-Temporal Attention-Boosted Autoencoder for the Short-Term Flow Prediction Problem. *Algorithms*. 2022; 15(10):376.
https://doi.org/10.3390/a15100376

**Chicago/Turabian Style**

Fiorini, Stefano, Michele Ciavotta, and Andrea Maurino.
2022. "Listening to the City, Attentively: A Spatio-Temporal Attention-Boosted Autoencoder for the Short-Term Flow Prediction Problem" *Algorithms* 15, no. 10: 376.
https://doi.org/10.3390/a15100376