TimeREISE: Time Series Randomized Evolving Input Sample Explanation

Deep neural networks are one of the most successful classifiers across different domains. However, their use is limited in safety-critical areas due to their limitations concerning interpretability. The research field of explainable artificial intelligence addresses this problem. However, most interpretability methods align to the imaging modality by design. The paper introduces TimeREISE, a model agnostic attribution method that shows success in the context of time series classification. The method applies perturbations to the input and considers different attribution map characteristics such as the granularity and density of an attribution map. The approach demonstrates superior performance compared to existing methods concerning different well-established measurements. TimeREISE shows impressive results in the deletion and insertion test, Infidelity, and Sensitivity. Concerning the continuity of an explanation, it showed superior performance while preserving the correctness of the attribution map. Additional sanity checks prove the correctness of the approach and its dependency on the model parameters. TimeREISE scales well with an increasing number of channels and timesteps. TimeREISE applies to any time series classification network and does not rely on prior data knowledge. TimeREISE is suited for any usecase independent of dataset characteristics such as sequence length, channel number, and number of classes.


Introduction
The success of deep neural networks comes from the superior performance and scaling deep neural networks offer compared to traditional machine learning methods [1]. However, during the last few decades, the need for explainable decisions has become more significant. In critical infrastructures, it is inconceivable to use approaches without any justification for the results [2]. In the medical sector, financial domain, and other safetycritical areas, explainable computations are necessary by law [3]. Furthermore, there are ethical constraints that limit the use of artificial intelligence even more [4,5]. Accordingly, a large research domain evolved. This domain covers explainable artificial intelligence (XAI). One main goal is to propose techniques that provide interpretable results to enable the broader use of deep neural networks.
For several years researchers developed modifications of the networks and model agnostic methods to provide these results [6]. The majority of these methods originate from the imaging modality as its concepts are easier to interpret for humans [7]. Model agnostic methods have shown especially great success. One famous category of model agnostic approaches is attribution methods [8]. The number of available methods in this category increases every year. One advantage of them is their loose coupling with the network. In addition, they do not limit the processing capabilities of the network, although some attribution methods include minor limitations concerning the network architecture.
The downside of these methods is that the provided results require additional human inspection and interpretation. Furthermore, they do not make any statement related to the concepts covered by the network. Revealing the concepts learned by the network is not the goal of these approaches. Considering the time series modality, this is not a huge drawback, as concepts are not well defined in this domain, and an explanation based on pre-defined concepts would not be suitable.
Despite their great success, not all of these methods can be applied to time series. Besides the above-mentioned limitations, additional properties arise in the time series context. These properties are less important for the imaging modality, but they are pivotal for the success of an attribution method in the time series context. Noisy explanations are acceptable in the image domain but can result in low information gain in time series interpretability. Another aspect is the Continuity of the attribution [9]. It is pivotal for time series attributions that a certain degree of Continuity is preserved. In the paper written by Crabbe and Van der Schaar [10], the aspect of Continuity is addressed as relevant for the interpretability of an attribution map as the Continuity can greatly lower the cognitive effort required to understand an attribution map. In addition, an explanation that includes large spikes within important data points within small windows introduces ambiguity and increases the cognitive load. Due to the possible infinite length and number of channels, it is unavoidable to focus on every data point. The explanation needs to highlight the significant time frames and channels. This is not the case in the image domain, as the number of channels and their roles are pre-defined. The channels in the image domain are used together, which is not possible in the time series domain.
Taking into account the above-mentioned limitations and time series specific properties there is no perfect attribution method available for time series. This paper proposes TimeREISE, an instance-based attribution method applicable to every classifier. It addresses common bottlenecks such as runtime, smoothness, and robustness against input perturbations as mentioned in [11]. Many methods suffer from large computation times making them unfeasible for real-time applications. Another aspect mentioned by Mercier et al. covers noise included in the explanation, making it difficult to interpret the maps. The rest of the paper shows that the explanations provided by TimeREISE are continuous, precise, and robust. Without prior knowledge about the dataset, it is possible to produce attribution methods with different granularity and smoothness. The idea origins from RISE [12] and various perturbation-based attribution methods. Two main advantages are the following: TimeREISE is applicable to backbox classifiers, and its runtime does not scale directly with the input shape of the data.

Related Work
Interpretability methods are widespread across the different modalities such as image, natural language, and time series. A good overview of the diversity of these methods is given by Das and Rad [8]. Independent of the modality, the goal is to identify a significant subset of features to overcome ethical and industrial restrictions as mentioned by Peres et al. [2] and Karliuk [4]. Furthermore, the identification of these features must be precise and easy to understand. Therefore, noise free explanations that pass a sanity check are required. One prominent class of interpretability methods are attribution techniques. The following paragraphs describe the different attribution methods, their categories and characteristics. These methods are used to compare them to TimeREISE and they cover a broad set of post-hoc attribution map approaches applicable to time series classification.
The first sub-category of attribution methods covers the gradient-based approaches. A good survey of these was provided by Anacona et al. [13]. These methods use backpropagation to compute the importance of the features. Speaking of the advantages and disadvantages of these methods, they are known for their superior runtime but suffer from noisy gradients and access to the model internals. Guided-Backpropagation and Inte-gratedGradients are two well-known gradient-based methods. Guided-Backpropagation computes the gradient concerning the target prediction based on the non-negative gra-dients. More information about this approach was provided by Sundararajan et al. [14]. IntegratedGradients uses so-called baselines and approximates the integral of the gradients compared to the baseline. Further information is given by Springerberg et al. [15].
In contrast to these methods, the perturbation-based techniques do not require full access to the model as they perturb the input. A disadvantage of these methods is the increase in time as they utilize multiple forward passes. One famous example is the FeatureAblation presented in Fisher et al. [16]. Therefore, the features get replaced with a baseline value such as the mean. Next, the prediction is used to evaluate the impact. Very similar to this approach is the Occlusion presented in Zeiler et al. [17]. The features are removed completely.
The last category covers methods that do not fit directly to the previously mentioned. One method that falls into this category is LIME, introduced by Ribeiro et al. [18]. Although LIME performs perturbations to the input it is different in a way that a local model is trained to estimate the importance.
To evaluate the effectiveness of attribution maps a set of well-known metrics evolved. An important fact is that the ground truth of the feature importance is not given in most cases, and the measurements have to deal with that. One approach is to perform a deletion and insertion test as both are two well-known techniques to evaluate the efficiency of attribution methods. Petsiuk et al. [12] used them to provide evidence for their attribution method. Another well-known approach is to use the Infidelity and Sensitivity proposed by Yeh et al. [19]. To compute the Infidelity, the attribution gets perturbed by a significant amount, and the prediction change is evaluated. The Sensitivity perturbs the input by an insignificant amount, and the attribution is compared to the original one. A third metric related to robustness is Continuity. A continuous attribution map may suffer in the insertion evaluation, however, smooth attribution maps are more robust against attacks. Detailed information about the adversarial robustness was given by Alvarez et al. [20]. In addition, smooth attribution maps require less cognitive effort for interpretation as stated by Abdul et al. [9]; however, the correctness of the method needs to be preserved [21]. It has been shown that existing methods such as Guided-Backpropagation and IntegradedGradients can act as edge detectors when the network is randomized, resulting in misleading explanations. Finally, one of the most important aspects is the scaling concerning the runtime, as this defines the usability.

TimeREISE
This paper presents the novel approach TimeREISE, a post-hoc interpretability method applicable to any classification network. The work was inspired by Petsiuk et al. [12]. They presented a random perturbation-based approach for the image domain used as a baseline to build TimeREISE. Similar to RISE [12] masks are generated, applied to the input, and the output confidence is measured using the classification scores. However, there are several adaptations in the native RISE [12] to enhance the approach and successfully apply it to time series data. Besides the simple normalization based on the occurrences of each data point, TimeREISE extends this to create masks that evaluate the different channels. Therefore, the masks cover the time and channel direction. This makes it possible to evaluate different combinations, e.g., the first timesteps are perturbed for the first channel and the second half of the timesteps are perturbed for the other channel. The second main addition applied is the summation over different probabilities. RISE [12] uses only a fixed probability of occluded points to create the masks resulting in a fixed density. In contrast to that, TimeREISE uses masks of different densities and combines them in an additive manner which removes the assumption of the number of relevant data points. Finally, the granularity is introduced as a parameter to define granularity of the explanation. This directly affects the size of the pattern analyzed. Figure 1 shows the overall workflow of TimeREISE. M denotes a set of masks, e.g., M1.x can have a smaller window size related to a finer granularity. M3.x can have a larger windows size for coarse patterns. This makes it possible to cover smaller and larger patterns within the data. The same holds for the density, e.g., M2.x could have the same granularity as M1.x but with a more dense or sparse feature perturbation. In Figure 1a the mask generation is shown. The process to create the masks can be done once for the dataset properties as it only depends on the number of timesteps, channels and the provided set of granularites and densities. The process shown in Figure 1b needs to be executed for each sample.  x) applied to the input using an exchangeable perturbation function. The default perturbation is an element.wise multiplication. The masked input is passed to a classifier and the classification score is retrieved. The classification score is multiplied (*) by the masks and normalized by the number of feature occurrences. Finally, the attribution is normalized based on the number of occurrences of each point.

Mathematical Formulation
TimeREISE extends the native mathematical formulation presented by Petsiuk et al. [12] utilizing the different channels. TimeREISE generates masks with the shape s = (c, t ) instead of s = (1, t ) where t refers to the downsampled time axis and c to the channels. This enhances TimeREISE to apply masks that occlude different timesteps t across all channels c within a mask s instead of using the same timesteps t across all channels c as it is the case for s . Furthermore, the Monte Carlo sampling is performed across a set of densities P and granularities G. This enhances the masks to consider several density values p to regularize the density of the attribution. Similarly, the use of several granularity values g regularizes the size of the occluded patches. This changes the set of masks as shown in Equation (1).
Finally, denote S as the weighted sum of the scores produced by the network and the random masks M similar to Petsiuk et al. but normalize each feature as shown in Equation (2).

Theoretical Correctness
Concerning sanity checks mentioned by Adebayo et al. [21], the correctness of the approach is crucial, and it should mainly depend on the learned behavior rather than highlighting dataset-specific features. Adebayo et al. showed that in the image domain the methods may produce edge detectors or show similar attribution maps when some weights of the networks are randomized. However, this mainly holds for gradient-based methods. TimeREISE does not suffer from this as it only depends on the logits produced by the network prediction. The complete process only depends on the forward pass of the network and has no access to any internal parameters. Therefore, the attribution only relies on the prediction of the network. This prediction depends on the correct weights of the neurons.

Theoretical Runtime
For the runtime evaluation, initialization and attribution two separate processes are considered. Equation (3) shows the runtime to create the set of masks for a given set of density probabilities P, granularities G and the number of masks N defined for each combination of p i and g i . β is defined as the constant time to create the given map. In addition, P and G are independent of the data shape and therefore do not increase and can be considered as constant factors leading to a runtime of Θ(N).
Equation (4) shows the linear runtime of the attribution step. γ is defined as the constant time to apply the perturbation and δ as the constant time the classifier requires to forward pass the sample. Similar to the initialization step, P and G are assumed as constants resulting in a runtime of Θ(N).
In the following section, the runtimes of existing methods are explained. As the runtime can heavily depend on the implementation and the used hardware, a theoretical analysis offers a much more accurate analysis. The gradient-based methods such as GuidedBackprop are superior concerning their runtime. These methods only depend on the backward pass, which is executed once for the GuidedBackProp [15] and n_step times for the IntegratedGradients [14]. This means that both do not depend on the number of channels or timestep if we assume that the time for a backward pass is constant. Perturbation-based methods usually require multiple forward passes. Their runtime depends on the parameters that affect the number of forward passes. For Occlusion [17] and FeatureAblation [17] this is the window size. Using the same window size for a longer sequence results in more forward passes. Depending on the window size, the number of passes can be adjusted. Finally, the runtime of LIME [18] depends on the number of samples used to train the surrogate model to estimate the importance of the different features for the given samples. This means that similar to the perturbation-based approaches it depends on the channels and timesteps. To get a precise model a larger feature space requires more samples to be processed by LIME.
As described above, the runtime of TIMEREISE mainly depends on the number of masks used to compute the attribution map. This number can vary based on the time series, granularity and density. Using hyperparameter tuning it is possible to find the minimal number of required masks to produce a map that shows only insignificant changes.

Theoretical Implementation
The implementation of TimeREISE can be divided into two parts similar to the RISE [12] implementation by Petsiuk et al. [12]. In the first stage shown in Algorithm 1, a set of masks suited for the input shape gets generated. This has to be executed only once per dataset. Therefore, consider every combination of probabilities P and granularities G provided. The probability refers to a threshold used to determine the density of the mask. Granularity refers to the amount of data considered in a single slice. The downsampling and upsampling are performed along the time axis. Uniform refers to a uniform distribution with the given shape s . An additional cropping step is performed to preserve the original shape s.
Algorithm 2 performs the actual attribution. A predefined perturbation method σ is applied to the input x using every mask m i and is passed to the classifier θ. As default perturbation, the method uses the simple element-wise multiplication of the input x and the mask m i as proposed by Petsiuk et al. This results in a list of scores stored in S. Next, the matrix product of S T and the masks M is computed and each point is normalized by the number of occurrences N in the set M. Finally, the map is normalized to values between zero and one.
Append y to S 6: end for 7: S = S T ×M

Datasets
The work uses multiple datasets from the well-known UEA & UCR repository [22] to perform the experiments. The selection of datasets is based on a sufficient number of samples and the dataset modalities such as the number of timesteps, channels, and classes. Very small datasets with less then 300 samples were excluded as this would result in difficulties when training the network and lead to more variance concerning the metrics. Furthermore, the list of datasets is extended using the Anomaly dataset proposed by Siddiqui et al. [23]. This synthetic dataset serves as an interpretable baseline as the point anomalies in this dataset are mathematically defined, and therefore the ground truth attribution is available. Conversely, this is not the case for the other datasets, and only limited interpretability is given. Table 1 lists the datasets and their characteristics. These are assigned to the critical infrastructure domains they belong.

Experiments
The following paragraph describes the general setup to reproduce the results and covers decisions that affect the experiments. Following the generic experiment setup, the paper provides experiments on the insertion and deletion of data based on the importance scores of the attribution methods, an Infidelity and sensitivity analysis, and visual examples of the method and other state-of-the-art attribution methods. In addition, the experiments cover a sanity check to validate the correctness of the approach and runtime analysis to evaluate the dependency on the dataset properties such as channels and timesteps.

Baseline Accuracy
As model InceptionTime the current state-of-the-art proposed by Fawaz et al. [24] was used. The network architecture consists of multiple inception blocks followed by a global average pooling and a fully connected layer. Each inception block consists of multiple convolutional layers and a max-pooling. Furthermore, the network uses residual connections between the input and the different inception blocks. It is based on the Inception-v4 [25] architecture but specifically aligned to the time series domain and has shown to achieve good performances across time series classification datasets while being very robust. Figure 2 shows the architecture of IncpetionTime. For detailed information about the InceptionTime, the reader is referred to the paper of Fawaz et al. [24].
The network was trained using a learning rate scheduler to half the learning rate on plateaus and early stopping to prevent overfitting. As an optimizer, SGD was applied with an initial learning rate of 0.01 and a maximum of 100 epochs. None of the datasets required more than the 100 epochs, as the experiments have shown that every model converged in less than 100 epochs. As some datasets are very large, and the computation of measures such as the Sensitivity is computationally expensive, this work randomly sampled a set of 100 test samples to perform the attribution on a representative subset. In addition, the base accuracy scores for the whole datasets and the subset are provided in Table 2 which highlight that the findings, based on the subset, can be transferred to the complete datasets. Concerning the attribution methods GuidedBackprop [15], IntegratedGradients [14], FeatureAblation [17], Occlusion [17] and LIME [18] were used as state-of-the-art methods. This set of methods covers all categories of attribution methods mentioned in Section 2.  [24]. The architecture shows that there are several inceptionblocks consisting of convoluition layers. Furthermore, the network has residual connections to skip some inceptionblocks. After the last inceptionblock there is a global average pooling followed by a fully connected layer to produce the output classification. Figure taken from [24].

Sanity Check
In addition to the theoretical explanation of the correctness, a sanity check was conducted. Therefore, a sample of the CharacterTrajectories dataset was used, and the attribution map for different states of the model was computed. In Figure 3 the different attribution maps are shown. The first column always shows the original attribution map. Going from left to right increases the number of randomized layers for the top-down and bottom-up approaches. The first row refers to the bottom-up approach in which the layers are sequentially randomized starting from the first convolutional block in the first inception block up to the last dense layer. Respectively, the second row shows the top-down approach randomizing the dense layer first and the first convolutional block of the first inception block last. The third row covers the independent randomization of a single layer in the case of 'Random-9' as it refers to the last dens layer and single block for the other cases. Across all setups, it is visible that randomizing the layer weights results in significant changes in the attribution map. In Figure 4 the spearman and Pearson correlation between the original and randomized attribution map is given. The Spearman correlation was used as it is a rank-based approach and provides information about the preservation of the ranks of the points within the sample. The color of the individual points shows the correctness of the prediction using the manipulated network. It is visible that the correlation of the attribution maps with the correct prediction is higher than for others. This shows that TimeREISE successfully depends on the prediction of the network. Furthermore, the figure shows that for the top-down randomization the correlation drops by a large value. Similarly, the bottom-up randomization shows an increasing drop in correlation. However, randomizing a single layer or block resulted in higher correlation values except for the randomization of the last dense layer. This can be explained by the structure of InceptionTime, as it is very robust concerning the randomization of a single block. This is further validated by the correct predictions within this setup.   Figure 5 shows the runtimes for the experiments executed on 100 samples. The results show that methods that directly depend on a window size, such as FeatureAblation [17] and Occlusion [17], require much longer processing for the datasets that have a large number of features. Especially, the runtimes for the four datasets with the highest number of channels and timesteps highlight that the increasing feature number makes them unsuitable for some cases. To reduce the processing time it is possible to select a larger window, however this requires knowledge about the dataset and the size of the pattern within the dataset. In contrast to that, GuidedBackprop [15] and IntegratedGradients [14] are excluded as they do not depend on such properties of the datasets. LIME [18] and TimeREISE mainly depend on the number of samples and masks defined for each of the approaches. The experiments show that in both cases, the runtime is constant for the parameters that were selected. However, as mentioned it would be possible to finetune the parameters. In addition to the real datasets, Figure 6 shows the increase based on the number of timesteps and channels. These results refine the previous analysis. The FeratureAblation and Occlusion increase based on the timesteps and the channels, whereas the processing time of LIME and TimeREISE only slightly increase as the forward passes require more time. This holds for the gradient-based methods and the backpropagation too. The difference is that the number of backward passes required in those methods is limited, whereas the number of forward passes for LIME and TimeREISE is higher. It has to be mentioned that the number might vary depending on the implementation and hardware, however they provide insights into the expected behavior when changing the dataset.

Insertion & Deletion
The causal metric was used by Fong and Vedaldi [26] to explain the significant values of an attribution method. The intuition behind the deletion is that the prediction of a classifier changes if the cause gets removed. This applies to the insertion as well. In the case of the deletion, the points starting with the most important one are removed from the input, and the prediction gets computed. Large drops suggest that the feature was significant for the prediction. Further, the AUC based on the sequential deletion of features to rank the methods across every dataset was computed. In the case of the deletion, lower AUCs suggest that the method is superior in spotting important parts of the input. Similarly, the same was done for the insertion, starting with a sample that has only mean values. For the insertion, higher AUCs are superior. Large increases in this setup correspond to adding important data points relevant to the prediction. Figure 7 shows the critical difference diagrams of every attribution method. These were calculated using the AUC based on the achieved accuracy. In Figure 7a TimeREISE shows an outstanding performance compared to the other state-of-the-art methods with respect to the deletion of significant data that affects the classifier performance. Another important finding is that the methods that utilize a window, such as FeatureAblation and Occlusion show better performances concerning the deletion compared to methods that directly depend on the gradients such as GuidedBackprop and IntegratedGradients. However, Figure 7b highlights that the results are the same for the insertion task. One reason for its outcome is the smoothing applied to approaches that use a defined window. Gradient-based methods provide noisy and spiking attribution maps. Table 3 shows the different results of the deletion and insertion for every individual dataset. Furthermore, the table provides the average scores achieved by the methods. TimeREISE shows a superior behavior in both the average deletion and insertion score. The method achieves the best (lowest) score for 13 datasets and an average of 0.2516. The second-best approach concerning the average AUC score is GuidedBackprop with a score of 0.3220 and two times the best performance. While TimeREISE has the best average score for the insertion, it scores only two times the performance. GuidedBackprop achieves five times, IntegratedGradients four times and Lime three times the best score in the insertion task. However, the average score of TimeREISE is 0.7510 compared to the second-best of 0.6653 for the Occlusion.

Infidelity & Sensitivity
The Infidelity and Sensitivity proposed by Yeh et al. [19] cover significant and insignificant changes applied to the attribution and the input. The intuition behind Infidelity is that a significant perturbation of the attribution map leads to a change in the prediction. Similarly, the Sensitivity is calculated using an insignificant change in the input sample. The Sensitivity requires to recompute the attribution maps. For both Infidelity and Sensitivity, lower values are better. For Infidelity, 1000 perturbations were computed for each of the 100 samples and computed the averaged Infidelity value. In addition, 10 perturbations for each of the samples were computed, and their Sensitivity was calculated.
Starting with Infidelity, the results shown in Table 4 emphasize that there is no significant difference between the different methods. Overall the average scores differ only by 0.011, which is an insignificant difference. Across all datasets, the methods perform similarly, and it is impossible to create a critical difference diagram as the null hypothesis does hold. Interestingly, the Infidelity scores for the ElectricDevices and PhalangesOutli-nesCorrect dataset are much larger compared to those of any other dataset.
The Sensitivity experiments are shown in Table 5. The results of these experiments show a significant difference between the methods. The best result was achieved by TimeREISE, with a score of 0.0533. The worst result was achieved by LIME, with a score of 0.2182, which is about four times larger than the score of TimeREISE. The overall finding was that the perturbation-based approaches are superior in the case of Sensitivity compared to the gradient-based or others. This is the case as the gradient-based methods result in noisy attribution maps, whereas the perturbation-based come up with smoothed maps based on a window of multiple features. This smoothing increases the robustness against minor changes in the input. In Figure 8 the critical difference diagram across all datasets is provided. It shows the superior performance of the perturbation-based approaches compared to the other approaches. In addition, it highlights that TimeREISE is only slightly above the Occlusion method.

Attribution Continuity
Furthermore, this work calculated the Continuity proposed by Abdul et al. [9]. Continuity is a measurement that bridges correctness and visual interpretability. The Continuity for each feature was calculated as presented in Equation (5) and took the mean for the overall evaluation between the methods. Lower values are better with respect to the cognitive load but might conflict with the exact correctness of the feature importance.
In Table 6 we show the average Continuity of the attribution methods. Similar to the Sensitivity, smaller values are better. Interestingly, the performance of the attribution methods is very similar to the Sensitivity. Again TimeREISE shows superior performance with a score of 0.0267 compared to Occlusion as the second-best approach with a score of 0.0565. The reason for the superior performance is the smooth mask design. The masks of TimeREISE are created on a downscaled sample, and then they are upscaled using interpolation to the original input size. This results in smoother masks compared to Occlusion and FeatureAblation, which utilize fixed windows.  Figure 9 shows the corresponding critical difference diagram. It is intuitive that the Sensitivity is defined as the change in prediction when the attribution method is applied, to a slightly perturbed input, and the Continuity and the smoothness of the attribution method are connected to each other. However, it is interesting to observe the strong correlation between those two aspects.

Visualization
This section presents some interpretable attribution maps. The results highlight that TimeREISE produces smoother attribution maps while preserving the similar shape compared to the other attribution methods. TimeREISE builds a good compromise between the visual appearance strongly affected by the Continuity and the noise and the correctness of the feature importance values.
In Figure 10 an attribution map of every evaluated attribution map is shown. The first Figure 10a shows an anomalous sample of the Anomaly dataset. The anomaly is represented by the peak in the green signal. All methods successfully identify the peak as the most important part. However, the Occlusion and TimeREISE highlight that the neighborhood points of the peak are important. Whereas the intuition first suggests that only the peak should be highlighted, this is not correct, as changing the neighborhood points will influence the peak. Furthermore, it is visible that the attribution map provided by TimeREISE is very smooth compared to the other attributions, while preserving the relevant information.
In Figure 10b an attribution map for the ECG5000 dataset is shown. The results of all methods look similar to a certain degree. However, except for TimeREISE, the last part of the sequence is identified as features with some importance. In addition, the attribution maps include some noise. Specifically, the first negative peak in the signal is captured by the IntegratedGradients and LIME to be an important part. This is not the case for the remaining methods, and changing this part or the last part has only a minor effect on the prediction. Figure 11 shows the results of the attribution applied to an interpretable character trajectory sample. The Figure presents the time series sample and its back transformation to 2d space. Furthermore, the attribution maps given in the second row show the smoothness of TimeREISE. One finding is that the horizontal and vertical movement are rated as more important by most methods and that the majority of important points occur within the first 100 timesteps. Interestingly, GuidedBackprop results in a surprisingly high relevance for the force. FeatureAblation and Occlusion show low importance for both the vertical movement and the pressure.

Discussion
This section discusses the results to gain a better impression of the relevance and possible applications of TimeREISE. Furthermore, the advantages and drawbacks of the existing methods are mentioned. To summarize the experiments, the datasets are grouped below based on their properties described in Table 1. Datasets with less than 1000 training samples are referenced as small datasets. In addition, a distinction is made between univariate and multivariate datasets and binary as well as multi-class datasets. A final characteristic is the sequence length, which is considered a long sequence if it exceeds 500.
During the sanity checks, TimeREISE has shown a strong dependency on the model parameters. This dependency is an indication that the method does not represent dataset characteristics, but visualizes the points considered relevant by the model. Already in the work of Mercier et al. [11], and Adebayo et al. [21] the importance of this sanity check for time series analysis and image analysis was shown. The sanity check showed strong robustness to single layer manipulation and a large drop-off in attribution map correlation when layers were manipulated sequentially. Adebayo et al. have shown that some methods provide similar attributions independent of the model parameters.
The runtime analysis also shows that TimeREISE scales are better compared to other perturbation-based approaches like Occlusion. Specifically, the long-sequence and multivariate datasets result in a dramatic increase in the runtime for perturbation-based methods and LIME. It has to be mentioned that concerning the runtime the gradient-based methods are always superior to others, as they only require a backward pass. However, TimeREISE shows superior behavior compared to the remaining methods, specifically when long sequences or high channel numbers occur. For short sequences, the runtime of TimeREISE was nearly constant. In addition, it is possible to precisely adapt the runtime of TimeREISE to the application using the number of masks to be calculated and to configure it more precisely by incorporating knowledge about the data set. One example is the detection of point anomalies, where a low density is sufficient for detection and can thus improve the runtime.
The results of the deletion and insertion test provide evidence that the attribution maps are relevant for the prediction. Especially in the deletion test, TimeREISE shows excellent results, which directly lead to performance drops when important features were removed. Besides two long sequence datasets, TimeREISE has shown superior performance across all dataset categories. The good performance in the insertion test can be explained by the fact that TimeREISE has a high Continuity, which gives contiguous time segments more uniform importance. In general, the performance of window-based approaches is good in the insertion test, since the gradients are not used to create a synthetic minimalist sample. The detection of a peak is an example of an application where the insertion test gives a false picture. Methods that use the gradients show lower importance at the timesteps adjacent to the peak, while this is not the case with TimeREISE. Intuitively, the immediately adjacent points are relevant to the existence of a peak, so they should be given some importance. Summarizing the deletion and insertion test, the results show a superior performance independent of the dataset category.
In addition, TimeREISE shows excellent results for Infidelity and Sensitivity. TimeREISE has shown the best Sensitivity values across all datasets. Concerning Infidelity, TimeREISE achieved the second-best overall performance. The best scores were achieved in long sequences. There was no specific dataset category for which the method performed worse compared to the other methods. It has to be mentioned that LIME was slightly better in the overall Infidelity, however, TimeREISE achieved more individual best scores for the datasets. Due to the mask-based design, TimeREISE has high robustness in case of insignificant changes in the input. This is also because the attribution masks created by TimeREISE have a high Continuity, which contributes to the interpretability. Despite this continuity, the attribution masks were shown to withstand the sanity check and thus offer a good compromise between explanatory power and correctness. Finally, the visualizations of the different attribution methods show that TimeREISE provides less noisy explanations. Across all datasets, TimeREISE was able to show the best Continuity values.
In summary, TimeREISE has proven to be an outstanding method for time series analysis in terms of correctness, runtime and interpretability. In addition, the customization of the hyperparameters allows easy adaptation to different aspects of use. TimeREISE can be used to determine precise attribution masks for real-time analysis with less accuracy. In addition, it is possible to customize the Continuity, and the perturbation method is interchangeable. In contrast to existing perturbation-based methods, TimeREISE can be applied without knowing the size of the relevant pattern within the data, making it more effective in cases where the solution to the question of which parts are relevant is not known.

Conclusions
This work shows that the novel attribution method TimeREISE can achieve excellent performance concerning most of the evaluated metrics across all selected datasets. The method outperforms other state-of-the-art attribution methods concerning the Continuity, Sensitivity, and causal metrics. Specifically, the deletion scores when important data are occluded show that the approach provides superb performance. Furthermore, the paper has demonstrated that the method produces smooth attribution maps that require less effort to interpret. Concerning Infidelity, our approach is on par with the state-of-the-art methods. Further, the theoretical runtime evaluation shows that the method has a better scaling compared to methods that directly depend on the number of features and is applicable to any classifier. This is further validated based on the additional runtime experiments that highlight that the increase in runtime is related to the time spent by the network to execute a forward pass. Another positive aspect is that the method does not depend on noisy gradients or internal classifier variables. Ultimately, the sanity checks highlight the dependency on the model parameters and the robustness. TimeREISE is successfully shown to strongly depend on the correctness of the network and shows significant changes when the model weights are randomized. Summarizing the metrics, TimeREISE shows superior results for Sensitivity, Continuity, deletion and insertion tests. Particularly compared to gradient-based methods, TimeREISE shows a large improvement concerning the mentioned metrics. Only insignificant changes were observed for Infidelity. Concerning the runtime the method shows superior results compared to the perturbation-based methods when the dataset covers long sequences or multiple channels. For short univariate sequences, the runtime is nearly constant and only slightly above the runtime of the other perturbationbased methods. Compared to the gradient-based methods, the runtime of TimeREISE and any other perturbation-based method is inferior.