# Benchmarking Attention-Based Interpretability of Deep Learning in Multivariate Time Series Predictions

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- We design a novel benchmark of synthetically designed datasets with the transparent underlying generating process of multiple time series interactions with increasing complexity for understanding the inner workings of attention mechanism based deep learning models for multivariate forecasting tasks.
- Using the designed benchmark, we conduct a comprehensive analysis of the performance of existing attention based deep neural networks in three different aspects: prediction performance score, interpretability correctness, sensitivity analysis.
- We demonstrate that although most models have satisfying and stable prediction performance results, they often fail to give correct interpretability and that intrinsic interpretability increases with the complexity of interactions between multiple time series.

## 2. Deep Learning Models for Multivariate Time Series Data

#### 2.1. seq2graph

#### 2.2. Interpretable Multi-Variable Long Short-Term Memory

**$\tilde{h}$**is used as input to calculate attention coefficients $\alpha $. Product of $\alpha $ and hidden state matrix

**$\tilde{h}$**is concatenated with

**$\tilde{h}$**, i.e., $[\alpha *\tilde{\mathbf{h}},\tilde{\mathbf{h}}]$, and used as input to attention mechanism to calculate $\beta $ coefficients.

#### 2.3. Temporal Causal Discovery Framework

#### 2.4. Dual Stage Attention

#### 2.5. Overview of Analyzed Deep Learning Models

## 3. Experimental and Evaluation Framework/Setup

#### 3.1. Synthetic Time Series Datasets

- 1.5—time-series quickly converge to (1.5–1)/1.5
- 2.5—time-series converge to (2.5–1)/2.5 but will fluctuate a bit before
- 3.2—time-series oscillate between two values
- 3.55—time-series oscillate between more than 4 values
- 3.56996—time-series enter chaotic domain

#### 3.2. Quantitative Evaluation—Prediction Performance

- ${\mu}_{noise}=0$
- ${\sigma}_{noise}^{2}=0.1$
- $f=0.3$
- N = 5 (except for dataset 4 N = 2, and dataset 7, 8 N = 4)

- IMV-LSTM—3 experiments
- seq2graph—5 experiments
- TCDF—10 experiments
- DA-RNN—3 experiments

#### 3.3. Qualitative Evaluation—Interpretability

#### 3.4. Sensitivity Analysis—Dependence on Hyperparameters

- Noise frequency: f—from 0 to 1 with step 0.05
- Noise amount: ${\sigma}_{noise}^{2}$—values are: 0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 3, and 5
- Number of time series: N—from 3 to 20 with step 2

## 4. Experimental Results

#### 4.1. Quantitative Analysis

#### 4.2. Qualitative Analysis

#### 4.2.1. seq2graph

#### 4.2.2. TCDF

#### 4.2.3. IMV-LSTM

#### 4.2.4. DA-RNN

#### 4.3. Sensitivity Analysis

#### 4.3.1. Dependency on Noise Frequency

#### 4.3.2. Dependency on Noise Amplitude

#### 4.3.3. Dependency on Number of Time Series

#### 4.4. Simulated Data from Statistical and Mechanistic Models

#### 4.4.1. Ising Model

#### 4.4.2. Logistic Map Inspred Model

## 5. Conclusions and Outlook

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Lim, B.; Zohren, S. Time Series Forecasting With Deep Learning: A Survey. arXiv
**2020**, arXiv:2004.13408. [Google Scholar] - Ramchandani, A.; Fan, C.; Mostafavi, A. Deepcovidnet: An interpretable deep learning model for predictive surveillance of covid-19 using heterogeneous features and their interactions. IEEE Access
**2020**, 8, 159915–159930. [Google Scholar] [CrossRef] - Shi, Z.R.; Wang, C.; Fang, F. Artificial intelligence for social good: A survey. arXiv
**2020**, arXiv:2001.01818. [Google Scholar] - Song, W.; Chandramitasari, W.; Weng, W.; Fujimura, S. Short-Term Electricity Consumption Forecasting Based on the Attentive Encoder-Decoder Model. IEEJ Trans. Electron. Inf. Syst.
**2020**, 140, 846–855. [Google Scholar] [CrossRef] - Arya, V.; Bellamy, R.K.; Chen, P.Y.; Dhurandhar, A.; Hind, M.; Hoffman, S.C.; Houde, S.; Liao, Q.V.; Luss, R.; Mojsilović, A.; et al. One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv
**2019**, arXiv:1909.03012. [Google Scholar] - Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.
**2019**, 1, 206–215. [Google Scholar] [CrossRef] [Green Version] - Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
- Cetinic, E.; Lipic, T.; Grgic, S. A deep learning perspective on beauty, sentiment, and remembrance of art. IEEE Access
**2019**, 7, 73694–73710. [Google Scholar] [CrossRef] - Lake, B.M.; Salakhutdinov, R.; Tenenbaum, J.B. The Omniglot challenge: A 3-year progress report. Curr. Opin. Behav. Sci.
**2019**, 29, 97–104. [Google Scholar] [CrossRef] - Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Lawrence Zitnick, C.; Girshick, R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1988–1997. [Google Scholar]
- Santoro, A.; Hill, F.; Barrett, D.; Morcos, A.; Lillicrap, T. Measuring abstract reasoning in neural networks. In Proceedings of the International Conference on Machine Learning, Alvsjo, Sweden, 10–15 July 2018; pp. 4477–4486. [Google Scholar]
- Springer, J.M.; Kenyon, G.T. It is Hard for Neural Networks To Learn the Game of Life. arXiv
**2020**, arXiv:2009.01398. [Google Scholar] - Chollet, F. On the measure of intelligence. arXiv
**2019**, arXiv:1911.01547. [Google Scholar] - Assaf, R.; Schumann, A. Explainable Deep Neural Networks for Multivariate Time Series Predictions. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 6488–6490. [Google Scholar]
- Arnout, H.; El-Assady, M.; Oelke, D.; Keim, D.A. Towards A Rigorous Evaluation Of XAI Methods On Time Series. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 4197–4201. [Google Scholar]
- Ismail, A.A.; Gunady, M.; Corrada Bravo, H.; Feizi, S. Benchmarking Deep Learning Interpretability in Time Series Predictions. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
- Pantiskas, L.; Verstoep, C.; Bal, H. Interpretable Multivariate Time Series Forecasting with Temporal Attention Convolutional Neural Networks. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence, Canberra, Australia, 1–4 December 2020. [Google Scholar]
- Fauvel, K.; Masson, V.; Fromont, É. A Performance-Explainability Framework to Benchmark Machine Learning Methods: Application to Multivariate Time Series Classifiers. arXiv
**2020**, arXiv:2005.14501. [Google Scholar] - Mohankumar, A.K.; Nema, P.; Narasimhan, S.; Khapra, M.M.; Srinivasan, B.V.; Ravindran, B. Towards Transparent and Explainable Attention Models. arXiv
**2020**, arXiv:2004.14243. [Google Scholar] - Runge, J. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos Interdiscip. J. Nonlinear Sci.
**2018**, 28, 075310. [Google Scholar] [CrossRef] [PubMed] - Runge, J.; Nowack, P.; Kretschmer, M.; Flaxman, S.; Sejdinovic, D. Detecting and quantifying causal associations in large nonlinear time series datasets. Sci. Adv.
**2019**, 5, eaau4996. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Runge, J.; Bathiany, S.; Bollt, E.; Camps-Valls, G.; Coumou, D.; Deyle, E.; Glymour, C.; Kretschmer, M.; Mahecha, M.; Munoz-Mari, J.; et al. Inferring causation from time series with perspectives in Earth system sciences. Nat. Commun.
**2019**, 10. [Google Scholar] [CrossRef] [PubMed] - Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: 100,000 time series and 61 forecasting methods. Int. J. Forecast.
**2020**, 36, 54–74. [Google Scholar] [CrossRef] - Dang, X.H.; Shah, S.Y.; Zerfos, P. seq2graph: Discovering Dynamic Dependencies from Multivariate Time Series with Multi-level Attention. arXiv
**2018**, arXiv:1812.04448. [Google Scholar] - Guo, T.; Lin, T.; Antulov-Fantulin, N. Exploring Interpretable LSTM Neural Networks over Multi-Variable Data. arXiv
**2019**, arXiv:1905.12034. [Google Scholar] - Nauta, M.; Bucur, D.; Seifert, C. Causal Discovery with Attention-Based Convolutional Neural Networks. Mach. Learn. Knowl. Extr.
**2019**, 1, 312–340. [Google Scholar] [CrossRef] [Green Version] - Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; Cottrell, G. A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction. arXiv
**2017**, arXiv:1704.02971. [Google Scholar] - Onsager, L. Crystal statistics. I. A two-dimensional model with an order-disorder transition. Phys. Rev.
**1944**, 65, 117. [Google Scholar] [CrossRef] - Landau, D.P.; Binder, K. A Guide to Monte Carlo Simulations in Statistical Physics; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
- Smyl, S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int. J. Forecast.
**2020**, 36, 75–85. [Google Scholar] [CrossRef] - Iso, S.; Shiba, S.; Yokoo, S. Scale-invariant feature extraction of neural network and renormalization group flow. Phys. Rev. E
**2018**, 97, 053304. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Parallel coordinates plot of evaluated models within Performance-Explainability Framework.

**Figure 2.**Parallel coordinates plot of multivariate time-series synthetic datasets with embedded interactions, linearity, and complexity. Datesets are presented in detail in Table 1.

**Figure 3.**Stability of prediction performance by dataset. On y-axis we plot standard deviation of MSE divided by mean value of MSE. The x-axis corresponds to the index of dataset. As we can see TCDF model is the model with most unstable performance across most datasets, with seq2graph having worse performance on datasets 2 and 10. IMV-LSTM is the model with most stable performance.

**Figure 4.**Ground truth (left panel) and mean $\beta $ values (right panel) from seq2graphs model for dataset 4. The y-axis corresponds to the target series index, and the x-axis corresponds to the index of series whose impact on selected target series we plot. This dataset consists of two time-series that are generated from each other without autoregression included. This is the reason we expect high values on antidiagonal. $\beta $ coefficients of seq2graph differ from what we expect, and they are biased towards time-series with index 0.

**Figure 5.**The mean values of $\alpha $ coefficients of seq2graph model for dataset 2. The y-axis corresponds to the index of target series and x-axis corresponds to the time lag whose impact on selected target series we plot. We can see seq2graph bias towards beginning of the window.

**Figure 6.**Mean (left) and standard deviation (right) of $\beta $ coefficients from seq2graphs model for dataset 2. The y-axis corresponds to the target series index, and the x-axis corresponds to the index of series whose impact on selected target series we plot. In the left panel, we plot the mean values of $\beta $ coefficients of seq2graph model for dataset 2. Similar to the analysis on Figure 4, we can see that seq2graph is biased to one time series. In this case, it is time series with index 2 and a lesser extent time series with index 0 (i.e., beginning of the window). We plot the standard deviation of $\beta $ coefficients in the right panel, averaged across all experiments. High standard deviation with almost uniform mean values suggests low confidence in given interpretability.

**Figure 7.**Percentage of TCDF retrieved causality association within series in dataset 2 and 4. On the y-axis, we plot the target time series. The x-axis corresponds to the index of the series whose impact on the selected target series we count. Values in a single row do not need to sum to 1 for TCDF because there can be a case where TCDF did not find any causality for the targeted time series. These are not attention coefficients. In the left panel, we plot the percentage of experiments that TCDF did find causality for dataset 2. We can see that interpretability is correct for this dataset. However, for the last time series, it found causality in a small percentage of experiments. It is crucial to notice that TCDF never found incorrect causality in this dataset. In the right panel, we plot the percentage of experiments that TCDF did find causality for dataset 4. For this dataset, TCDF always says that a single time series is generating both of them. In the majority of experiments, it was time series with index 0.

**Figure 8.**Percentage of TCDF retrieved causality association within series in dataset 8. On y-axis we plot target time series. The x-axis corresponds to the index of series whose impact on selected target series we count. Values in single row do not need to sum to 1 for TCDF, because there can be an experiment where TCDF did not find any causality for targeted time series. In this figure we plot interpretability given by TCDF on dataset 8. We can see that TCDF gives correct interpretability for time series with index 0 and 3, whose behaviour is autoregressive. For time series with index 1 and 2, this model gives wrong interpretability. Furthermore, even though model gives correct interpretability for time series with index 3, it is only found in 40% of experiments.

**Figure 9.**The mean values of $\beta $ coefficients from IMV-LSTM model for dataset 4 and dataset 2. The y-axis corresponds to the target series index, and the x-axis corresponds to the index of series whose impact on selected target series we plot. left) In the left panel, we plot mean $\beta $ coefficients of IMV-LSTM model for dataset 4. We can see that IMV-LSTM model gives correct interpretability for this dataset, but with low confidence. In the right panel, we plot mean $\beta $ coefficients of IMV-LSTM for dataset 2. Again, IMV-LSTM model gives correct interpretability, but with high confidence on this dataset.

**Figure 10.**The mean values of $\alpha $ coefficients of IMV-LSTM model for dataset 2. The y-axis corresponds to the target series’s index, and the x-axis corresponds to the time lag whose impact on the selected target series we plot. We can see that IMV-LSTM is also biased towards the end of the window. However, it is crucial to notice that for time series which do not impact target series, $\alpha $ coefficients are uniform.

**Figure 11.**The mean values of $\beta $ coefficients from IMV-LSTM model for dataset 8. The y axis corresponds to the index of target series and x axis corresponds to the index of series whose impact on selected target series we plot. We can see that IMV-LSTM gives correct interpretability for time series with index 0 and 3, whose behaviour is autoregressive. For time series with index 1 and 2, this model gives wrong interpretability.

**Figure 12.**Ground truth (left panel) and aggregated $\beta $ values (right panel) from the DA-RNN model for dataset 5. The y-axis corresponds to the target series index, and the x-axis corresponds to the index of series whose impact on selected target series we plot. This dataset consists of N = 5 time series that are generated from time series with index 0. This is the reason we expect high values in the first column. $\beta $ coefficients of DA-RNN differ from what we expect. Furthermore, the distribution of these coefficients is highly uniform.

**Figure 13.**The mean values of $\alpha $ coefficients of the DA-RNN model for dataset 7 for time series with index 0. The y-axis corresponds to the index of feature in Encoder output, and the x-axis corresponds to the time lag whose impact on selected target series we plot. We can see that DA-RNN is biased towards the end of the window. The same behavior is seen with seq2graph and IMV-LSTM.

**Figure 14.**In the left panel, we can see how the maximum error percentage for specific model changes with f for dataset 2. We observe a change in behavior at around $f=0.4$. In the right panel, we can see results for dataset 7. Here we see a change in behavior around $f=0.6$. All models show similar behavior with a slight deviation from the TCDF model on dataset 2.

**Figure 15.**Maximum error percentage divided by ${\sigma}_{noise}^{2}$ vs. ${\sigma}_{noise}^{2}$ for dataset 2. We see that models have an almost linear dependency on ${\sigma}_{noise}^{2}$, with TCDF achieving almost perfect linearity.

**Figure 16.**Percentage of maximum model error vs. number of time series N for dataset 2. We can see a significant increase in error percentage at $N=10$. Notice that seq2graph only goes to $N=13$. For higher N, seq2graph model had memory problems.

**Figure 17.**Percentage of maximum model error vs. number of time series N for dataset 5. For this dataset percentage of the maximal model error is almost independent of N, and this behavior is almost identical for all models.

**Figure 18.**The y axis corresponds to the index of target series and x axis corresponds to the index of series whose impact on selected target series we plot. On left image we plot mean $\beta $ values for dataset 9 for $T=2.75$. Model shows that first neighbours have highest impact on targeted spin with diminishing impact for higher order neighbours. Spin correlation given by RMB approach. We can see high similarity between these values and values given by IMV-LSTM. Same graph as left, but with T = 2. At temperatures lower than critical, spins become frozen and we have long-range correlations. This long-range correlation is what makes this graph more blurry.

**Figure 19.**IMV-LSTM mean $\alpha $ coefficients for spin #40. The x-axis corresponds to the time lag, and the y-axis corresponds to the index of spin. We only show interaction with selected spins, as there are 100 of them. As we can see, spins that interact with our selected spin have the most diverse values. Spins that do not interact with spin #40, for instance, spin #20 or spin #37, have almost uniform values across all time stamps.

**Figure 20.**The mean values of $\beta $ coefficients from IMV-LSTM model for dataset 10 for 3 different values of r: 1.5, 3.55 and 3.56996. The y-axis corresponds to the index of target series and x-axis corresponds to the index of series whose impact on selected target series we plot. left) In the left panel we plot mean values of $\beta $ coefficients for $r=1.5$. As we can see, the interpretability is completely wrong, but in this regime, logistic map converges to single value. In the middle panel we plot mean values of $\beta $ coefficients for $r=3.55$. As we can see, the interpretability is much closer to expected behaviour. In this regime, logistic map oscillates between several different values, so it is beneficial to model to learn correct interpretability. In the right panel we plot mean values of $\beta $ coefficients for $r=3.56996$. As we can see, the interpretability is almost correct with high confidence.

**Table 1.**Models used for benchmarking, with each successive dataset model’s complexity is increased: constant time series (dataset 1) to autoregressive (dataset 2) and nonlinear autoregressive (dataset 3) with no interaction between time series, two interdependent time series without autoregression (dataset 4), first series is autoregressive (dataset 5) and nonlinear autoregressive (dataset 6) time series with all other time series calculated from first, custom vector autoregression model (dataset 7), switching time series (dataset 8). Additionally, we created two datasets from statistical and mechanistic models: The logistic map inspired model (dataset 9) and the Ising model on the first-order 2D square lattice (dataset 10).

Name | Formula | Parameters |
---|---|---|

Dataset 1 | ${X}_{n,t}={C}_{n}+{\u03f5}_{t}$ | ${C}_{n}=N(0,1)$ |

Dataset 2 | ${X}_{n,t}={c}_{{t}_{lag}}{X}_{n,t-{t}_{lag}}+{\u03f5}_{t}$ | ${c}_{3}=1/2,{c}_{7}=1/2$ |

Dataset 3 | ${X}_{n,t}=tanh({c}_{{t}_{lag}}{X}_{n,t-{t}_{lag}}+{\u03f5}_{t})$ | ${c}_{3}=5/7,{c}_{7}=1/7,{c}_{9}=1/7$ |

Dataset 4 | ${X}_{1-n,t}={c}_{{t}_{lag}}{X}_{n,t-{t}_{lag}}+{\u03f5}_{t}$ | ${c}_{2}=2/5,{c}_{5}=1/5,{c}_{9}=2/5$ |

Dataset 5 | ${X}_{n,t}={c}_{n,{t}_{lag}}{X}_{0,t-{t}_{lag}}+{\u03f5}_{t}$ | ${c}_{0,3}=1/2,{c}_{0,4}=1/2$ ${c}_{1,9}=1$ ${c}_{2,2}=1/2,{c}_{2,7}=1/2$ ${c}_{3,3}=1/10,{c}_{3,4}=1/10,{c}_{3,8}=4/5$ ${c}_{4,2}=1/3,{c}_{4,5}=2/9,{c}_{3,8}=4/9$ |

Dataset 6 | ${X}_{n,t}=tanh({C}_{n,{t}_{lag}}{X}_{0,t-{t}_{lag}}+{\u03f5}_{t})$ | ${c}_{0,3}=1/2,{c}_{0,4}=1/2$ ${c}_{1,9}=1$ ${c}_{2,2}=1/2,{c}_{2,7}=1/2$ ${c}_{3,3}=1/10,{c}_{3,4}=1/10,{c}_{3,8}=4/5$ ${c}_{4,2}=1/3,{c}_{4,5}=2/9,{c}_{3,8}=4/9$ |

Dataset 7 | ${X}_{0,t}={c}_{0,1}{X}_{0,t-1}+{c}_{0,5}{X}_{0,t-5}+{\u03f5}_{t}$ ${X}_{1,t}=1+{c}_{1,2}{X}_{0,t-2}+{\u03f5}_{t}$ ${X}_{2,t}={c}_{2,1}{X}_{1,t-1}+{c}_{2,4}{X}_{3,t-4}+{\u03f5}_{t}$ ${X}_{3,t}=1+{c}_{3,4}{X}_{2,t-4}+{c}_{3,1}{X}_{0,t-1}+{\u03f5}_{t}$ ${X}_{4,t}={c}_{4,4}{X}_{4,t-4}+{c}_{4,1}{X}_{1,t-1}+{\u03f5}_{t}$ | ${c}_{0,1}=1/4,{c}_{0,5}=3/4$ ${c}_{1,2}=-1$ ${c}_{2,1}=1,{c}_{2,4}=1$ ${c}_{3,4}=-2/7,{c}_{3,1}=5/7$ ${c}_{4,4}=12/22,{c}_{4,1}=10/22$ |

Dataset 8 | if ${X}_{0,t-5}>1/2:$ ${X}_{0,t}={c}_{0,1}{X}_{0,t-1}+{c}_{0,3}{X}_{0,t-3}+{\u03f5}_{t}$ ${X}_{1,t}={X}_{0,t-5}+{\u03f5}_{t}$ ${X}_{2,t}={X}_{0,t-4}+{\u03f5}_{t}$ ${X}_{3,t}={c}_{3,1}{X}_{3,t-1}+{c}_{3,4}{X}_{3,t-4}+{\u03f5}_{t}$ else: ${X}_{0,t}={c}_{0,1}{X}_{0,t-1}+{c}_{0,3}{X}_{0,t-3}+{\u03f5}_{t}$ ${X}_{1,t}={X}_{3,t-2}+{\u03f5}_{t}$ ${X}_{2,t}={X}_{3,t-4}+{\u03f5}_{t}$ ${X}_{3,t}={c}_{3,1}{X}_{3,t-1}+{c}_{3,4}{X}_{3,t-4}+{\u03f5}_{t}$ | ${c}_{0,1}=1/2,{c}_{0,3}=1/2$ ${c}_{3,1}=1/2,{c}_{3,4}=1/2$ |

Dataset 9 | $H\left(\sigma \right)=-{\sum}_{\langle i,j\rangle}{J}_{i,j}{\sigma}_{i}{\sigma}_{j}-\mu {\sum}_{j}{h}_{j}{\sigma}_{j}$ | T = 2, T_{c}, 2.75 |

Dataset 10 | ${X}_{0,t}=r{X}_{0,t-3}(1-{X}_{0,t-3})$ ${X}_{1,t}=r{X}_{1,t-5}(1-{X}_{1,t-5})$ ${X}_{2,t}=1/2{X}_{0,t-3}+1/2{X}_{1,t-5}$ | r = 1.5, 2.5, 3.2, 3.55, 3.56996 |

**Table 2.**Model prediction performance on all datasets. The average experiment MSE for each model is reported as a score. We do not report seq2graph results on dataset 9 because the model had memory problems. Dataset 9 consists of 100 series in our experiment, and seq2graph cannot model that many time series. ES-RNN model (the winner of the M4 competition) is added for comparison and evaluated only on datasets 1–8. ES-RNN is only used in quantitative analysis since it does not provide interpretability.

Dataset | DA-RNN | IMV-LSTM | seq2graph | TCDF | ES-RNN |
---|---|---|---|---|---|

1 | 0.00293 ± 1 × 10^{−5} | 0.02905 ± 9 × 10^{−7} | 0.00303 ± 9 × 10^{−5} | 0.0033 ± 0.0002 | 0.003731 ± 1 × 10^{−6} |

2 | 0.00150 ± 6 × 10^{−5} | 0.0018 ± 0.0002 | 0.011 ± 0.007 | 0.02 ± 0.01 | 0.001496 ± 9 × 10^{−6} |

3 | 0.00013 ± 1 × 10^{−5} | 0.000125 ± 6 × 10^{−6} | 0.00001 ± 2 × 10^{−5} | 0.0006 ± 0.0004 | 0.000137 ± 1 × 10^{−6} |

4 | 0.000244 ± 2 × 10^{−6} | 0.000238 ± 5 × 10^{−6} | 0.00032 ± 6 × 10^{−6} | 0.002 ± 0.002 | 0.00045 ± 1 × 10^{−4} |

5 | 0.00229 ± 7 × 10^{−5} | 0.00138 ± 3 × 10^{−5} | 0.0020 ± 0.0001 | 0.005 ± 0.003 | 0.017833 ± 1 × 10^{−6} |

6 | 0.00210 ± 1 × 10^{−5} | 0.00143 ± 6 × 10^{−5} | 0.00213 ± 0.0001 | 0.005 ± 0.002 | 0.018541 ± 1 × 10^{−6} |

7 | 0.009 ± 0.001 | 0.0051 ± 0.0006 | 0.008 ± 0.001 | 0.021 ± 0.007 | 0.0120 ± 0.0002 |

8 | 0.286 ± 0.006 | 0.258 ± 0.002 | 0.18 ± 0.05 | 0.3 ± 0.1 | 0.250 ± 0.001 |

9 | 0.3353 ± 0.0005 | 0.2688 ± 0.0001 | - | 0.348 ± 0.007 | - |

10 | 0.002 ± 0.001 | (9 ± 1) × 10^{−5} | 0.006 ± 0.008 | 0.03 ± 0.01 | - |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Barić, D.; Fumić, P.; Horvatić, D.; Lipic, T.
Benchmarking Attention-Based Interpretability of Deep Learning in Multivariate Time Series Predictions. *Entropy* **2021**, *23*, 143.
https://doi.org/10.3390/e23020143

**AMA Style**

Barić D, Fumić P, Horvatić D, Lipic T.
Benchmarking Attention-Based Interpretability of Deep Learning in Multivariate Time Series Predictions. *Entropy*. 2021; 23(2):143.
https://doi.org/10.3390/e23020143

**Chicago/Turabian Style**

Barić, Domjan, Petar Fumić, Davor Horvatić, and Tomislav Lipic.
2021. "Benchmarking Attention-Based Interpretability of Deep Learning in Multivariate Time Series Predictions" *Entropy* 23, no. 2: 143.
https://doi.org/10.3390/e23020143