XCM: An Explainable Convolutional Neural Network for Multivariate Time Series Classification

Multivariate Time Series (MTS) classification has gained importance over the past decade with the increase in the number of temporal datasets in multiple domains. The current state-of-the-art MTS classifier is a heavyweight deep learning approach, which outperforms the second-best MTS classifier only on large datasets. Moreover, this deep learning approach cannot provide faithful explanations as it relies on post hoc model-agnostic explainability methods, which could prevent its use in numerous applications. In this paper, we present XCM, an eXplainable Convolutional neural network for MTS classification. XCM is a new compact convolutional neural network which extracts information relative to the observed variables and time directly from the input data. Thus, XCM architecture enables a good generalization ability on both large and small datasets, while allowing the full exploitation of a faithful post hoc model-specific explainability method (Gradient-weighted Class Activation Mapping) by precisely identifying the observed variables and timestamps of the input data that are important for predictions. We first show that XCM outperforms the state-of-the-art MTS classifiers on both the large and small public UEA datasets. Then, we illustrate how XCM reconciles performance and explainability on a synthetic dataset and show that XCM enables a more precise identification of the regions of the input data that are important for predictions compared to the current deep learning MTS classifier also providing faithful explainability. Finally, we present how XCM can outperform the current most accurate state-of-the-art algorithm on a real-world application while enhancing explainability by providing faithful and more informative explanations.


Introduction
Following the remarkable availability of multivariate temporal data, Multivariate Time Series (MTS) analysis is becoming a necessary procedure in a wide range of application domains (e.g., finance [1], healthcare [2], mobility [3], and natural disasters [4]).A time series is a sequence of real values ordered according to time; and when a set of coevolving time series are recorded simultaneously by a set of sensors, it is called an MTS.In this paper, we address the issue of MTS classification, which consists of learning the relationship between an MTS and its label.
According to the results published, the most accurate state-of-the-art MTS classifier on average is a deep learning approach (MLSTM-FCN [5]).MLSTM-FCN consists of the concatenation of a Long Short-Term Memory (LSTM) block with a Convolutional Neural Network (CNN) block composed of three convolutional sub-blocks.However, MLSTM-FCN outperforms the second-best MTS classifier (Bag-of-Words method WEASEL+MUSE [6]) only on the large datasets (relatively to the public UEA archive [7]-training set size ≥ 500).This deep learning approach contains a significant number of trainable parameters, which could be an important reason for its poor performance on small datasets.Moreover, for many applications, the adoption of machine learning methods cannot rely solely on their prediction performance.For example, the European Union's General Data Protection Regulation (GDPR), which became enforceable on 25 May 2018, introduces a right to explanation for all individuals so that they can obtain "meaningful explanations of the logic involved" when automated decision making has "legal effects" on individuals or similarly "significantly affecting" them (https://ec.europa.eu/info/law/law-topic/data-protection_en(accessed on 1 November 2021)).As far as we have seen, an architecture concatenating a LSTM network with a CNN such as MLSTM-FCN, or a classifier based on unigrams/bigrams extraction following a Symbolic Fourier Approximation [8] such as WEASEL+MUSE, cannot provide perfectly faithful explanations as they rely solely on post hoc model-agnostic explainability methods [9], which could prevent their use in numerous applications.Faithfulness is critical as it corresponds to the level of trust an end-user can have in the explanations of model predictions, i.e., the level of relatedness of the explanations to what the model actually computes.Hence, we propose a new compact (in terms of the number of parameters) and explainable deep learning approach for MTS classification that performs well on both large and small datasets while providing faithful explanations.
CNNs along with post hoc model-specific saliency methods such as Gradient-weighted Class Activation Mapping-Grad-CAM [10]-have the potential to have a compact architecture while enabling faithful explanations [11].A recent CNN, MTEX-CNN [12] proposes using 2D and 1D convolution filters in sequence to extract key MTS information, i.e., information relative to the observed variables and time, respectively.However, as confirmed by our experiments, the features related to time which are extracted from the output features of the first stage (relative to each observed variable) cannot fully incorporate the timing information from the input data and subsequently yield poor performance compared to the state-of-the-art MTS classifiers.In addition, the significant number of trainable parameters of MTEX-CNN affects its generalization ability on small datasets.Finally, MTEX-CNN requires upsampling processes on feature maps when applying Grad-CAM, which can lead to an imprecise identification of the regions of the input data that are important for predictions.
Therefore, we propose a new faithfully eXplainable CNN method for MTS classification (XCM) which improves MTEX-CNN in three substantial ways: (i) it generates features by extracting information relative to the observed variables and timestamps in parallel and directly from the input data; (ii) it enhances the generalization ability by adopting a compact architecture (in terms of the number of parameters); and (iii) it allows precise identification of the observed variables and timestamps of the input data that are important for predictions by avoiding upsampling processes.Summarizing our main contributions:

•
We present XCM, an end-to-end new compact and explainable convolutional neural network for MTS classification which supports its predictions with faithful explanations;

•
We show that XCM outperforms the state-of-the-art MTS classifiers on both the large and small UEA datasets [7]; • We illustrate on a synthetic dataset that XCM enables a more precise identification of the regions of the input data that are important for predictions compared to the current faithfully explainable deep learning MTS classifier MTEX-CNN; • We show that XCM outperforms the current most accurate state-of-the-art algorithm on a real-world application while enhancing explainability by providing faithful and more informative explanations.
The rest of this paper is organized as follows: Section 2 presents the related work concerning MTS classification and explainability; Section 3 details XCM architecture; Section 4 presents our evaluation method; and finally, Section 5 discusses our results.

Related Work
In this section, we first introduce the background of our study.Then, we present the state-of-the-art MTS classifiers, and we end with existing explainability methods supporting CNNs models' predictions.

Background
We address the issue of supervised learning for classification.Classification consists of learning a function that maps an input data to its label: given an input space X, an output space Y, an unknown distribution P over X × Y, a training set sampled from P, and a 0-1 loss function 0−1 compute function h * as follows: In this study, classification is performed on multivariate time series datasets.A Multivariate Time Series (MTS) M = {x 1 , ..., x d } ∈ R d * l is an ordered sequence of d ∈ N streams with x i = (x i,1 , ..., x i,l ), where l is the length of the time series and d is the number of multivariate dimensions.We address MTS generated from automatic sensors with a fixed and synchronized sampling along all dimensions.An example of an MTS with two dimensions and a length of 100 is given at the top of Figure 5.
Before presenting the state-of-the-art MTS classifiers, we introduce some notions about neural networks and the subfamily of our approach, Convolutional Neural Networks (CNNs).A neural network is a composition of L parametric functions referred to as layers, where each layer is considered a representation of the input domain [13].One layer l i , such as i ∈ {1, ..., L}, contains neurons, which are small units that compute one element of the layer's output.The layer l i takes as input the output of its previous layer l i−1 and applies a transformation to compute its own output.The behavior of these transformations is controlled by a set of parameters θ i for each layer and an activation sublayer to shape the nonlinearity of the network.These parameters are called weights and link the input of the previous layer to the output of the current layer based on matrix multiplication.This process is also referred to as feedforward propagation in the deep learning literature and is the constituent of multilayer perceptrons (MLPs).A neural network is usually called "deep" when it contains more than one layer between its input and output layer.Following the good performance of CNN architectures in image recognition [14] and natural language processing [15,16], CNNs have started to be adopted for time series analysis [17].CNNs are neural networks that use convolution in place of general matrix multiplication in at least one of their layers [13].A convolution can be seen as applying and sliding a filter over the time series.The use of different types, numbers and sequences of filters allow the learning of multiple discriminative features (feature maps) useful for the classification task.

MTS Classifiers
The state-of-the-art MTS classifiers are usually grouped into three categories: similarity-based, feature-based and deep learning methods.
Similarity-based methods make use of similarity measures to compare two MTS (e.g., Euclidean distance).Dynamic Time Warping (DTW) has been shown to be the best similarity measure to use along with the k-Nearest Neighbors (k-NN) [18].DTW is not a distance metric as it does not fully satisfy the required properties (the triangle inequality in particular), but its use as similarity measure along with the NN-rule is valid [19].There are two versions of kNN-DTW for MTS, dependent (DTW D ) and independent (DTW I ), and neither dominates the other [20].DTW I measures the cumulative distances of all dimensions independently measured under DTW.DTW D uses a similar calculation with single-dimensional time series; it considers the squared Euclidean accumulated distance over the multiple dimensions.
Next, feature-based methods can be categorized into two families: shapelet-based and Bag-of-Words (BoW) classifiers.Shapelets models (gRSF [21] and UFS [22]) use subsequences (shapelets) to transform the original time series into a lower-dimensional space that is easier to classify.On the other hand, BoW models (LPS [23], mv-ARF [24], SMTS [25] and WEASEL+MUSE [6]) convert time series into a bag of discrete words and use a histogram of words representation to perform the classification.WEASEL+ MUSE shows better results compared to gRSF, LPS, mv-ARF, SMTS and UFS on average (20 MTS datasets).WEASEL+MUSE generates a BoW representation by applying various sliding windows with different sizes on each discretized dimension (Symbolic Fourier Approximation) to capture features (unigrams, bigrams and dimension identification).Following a feature selection with chi-square test, it classifies the MTS based on a logistic regression.
Finally, deep learning methods (FCN [26], MLSTM-FCN [5], MTEX-CNN [12], ResNet [27], TapNet [28] and TST [29]) use Long-Short Term Memory (LSTM), Convolutional Neural Networks (CNN) or Transformers.According to the results published and our experiments, the current state-of-the-art model (MLSTM-FCN) is proposed in [5] and consists of a LSTM layer and a stacked CNN layer along with squeeze-and-excitation blocks to generate latent features.A recent network, TapNet [28], also consists of a LSTM layer and a stacked CNN layer, followed by an attentional prototype network.However, TapNet shows lower accuracy results (https://github.com/xuczhang/xuczhang.github.io/blob/master/papers/aaai20_tapnet_full.pdf(accessed on 1 November 2021)) on average on the 30 public UEA MTS datasets compared to MLSTM-FCN (MLSTM-FCN results presented in Table 3).There is no basis of comparison for MLSTM-FCN with MTEX-CNN [12] as MTEX-CNN has not been evaluated on public datasets.As illustrated in Figure 1, MTEX-CNN is a two-stage CNN network which first extracts information relative to each feature with 2D convolution filters and then extracts information relative to time with 1D convolution filters.The output feature map is fed into fully connected layers for classification.
Therefore, in this work, we choose to benchmark XCM to the best-in-class for each similarity-based, feature-based and deep learning category (DTW D /DTW I , WEASEL+ MUSE and MLSTM-FCN classifiers).We also include MTEX-CNN in the benchmark to demonstrate the superiority of our approach as MTEX-CNN has not been evaluated on the public UEA datasets.

Explainability
In addition to their prediction performance, machine learning methods have to be assessed on how they can support their decisions with explanations.Two levels of explanations are generally distinguished: global and local [30].Global explainability means that explanations concern the overall behavior of the model across the full dataset, while local explainability informs the user about a particular prediction.As previously introduced with the example of the GDPR, our new CNN approach needs to be able to support each individual prediction.Thus, we present in this section the local explainability methods for CNNs.
CNNs classifiers do not provide explainability-by-design at the local level.Thus, some post hoc model-agnostic explainability methods could be used.These methods provide explanations for any machine learning model.They treat the model as a black-box and do not inspect internal model parameters.The main line of work consists of approximating the decision surface of a model using an explainable one (e.g., LIME [31], SHAP [32], Anchors [33] and LORE [34]).However, the explanations from the surrogate models cannot be perfectly faithful with respect to the original model [9], which is a prerequisite for numerous applications.
Then, some post hoc model-specific explainability methods exist.These methods are specifically designed to extract explanations for a particular model.They usually derive explanations by examining internal model structures and parameters.The approaches based on back-propagation are seen as the state-of-the-art explainability methods for deep learning models [35].Methods based on back-propagation (e.g., Gradient Explanation [36], Guided Backpropagation [37], ε-Layer-wise Relevance Propagation [38], Gradient Input [39], Integrated Gradients [40], DeepLift [41] and Grad-CAM [10]) calculate the gradient, or its variants, of a particular output with respect to the input using back-propagation to derive the contribution of features.In particular, Gradient-weighted Class Activation Mapping (Grad-CAM) [10] has proven to be an adequate method for supporting CNNs predictions.Grad-CAM identifies the regions of the input data that are important for predictions in CNNs using the class-specific gradient information.The method has been shown to provide faithful explanations with regard to the model [11].The faithfulness of the explanations provided by Grad-CAM is shown following a methodology based on model parameter and data randomization tests.However, the precision of the explanations provided by Grad-CAM, i.e., the fraction of explanations that are relevant to a prediction, can vary across CNN architectures as Grad-CAM is sensitive to the upsampling processes on feature maps to match the input data dimensions.
Therefore, we support the predictions of our new CNN model XCM with Grad-CAM, a post hoc model-specific explainability method which provides faithful explanations at local level.The design of our network architecture avoids upsampling processes and enables Grad-CAM to identify the observed variables and timestamps of the input data that are important for predictions more precisely as compared to what the current explainable deep learning MTS classifier MTEX-CNN give.
Table 1 presents an overview of the challenges addressed by the state-of-the-art MTS classifiers and how we position our new method XCM.We evaluate the classification performance of XCM and its explainability in Section 5.The next section presents XCM in details.

XCM
In this section, we present our new eXplainable Convolutional neural network for Multivariate time series classification (XCM).The first part details the architecture of the network, and the second part explains how XCM can provide explanations by identifying the observed variables and timestamps of the input data that are important for predictions.

Architecture
Our approach aims to design a new compact and explainable CNN architecture that performs well on both the large and small UEA datasets.As illustrated in Figure 1, a recent explainable CNN, MTEX-CNN [12], proposes to use 2D and 1D convolution filters in sequence to extract key MTS information, i.e., information relative to the observed variables and time, respectively.However, CNN architectures such as MTEX-CNN have significant limitations.The use of 2D and 1D convolution filters in sequence means that the features related to time (features maps from 1D convolution filters) are extracted from the processed features related to observed variables (features maps from 2D convolution filters).Therefore, features related to time cannot fully incorporate the timing information from the input data and can only partially reflect the necessary information to discriminate between the different classes.Thus, (i) our approach XCM extracts both features related to observed variables (2D convolution filters) and time (1D convolution filters) directly from the input data, which leads to more discriminative features by incorporating all the relevant information and ultimately to a better classification performance on average than the 2D/1D sequential approach (see results in Section 5.1).Then, a CNN architecture using fully connected layers to perform classification, especially with the size of the first layer depending on the time series length as in MTEX-CNN, is prone to overfitting and can lead to the explosion of the number of trainable parameters.Thus, (ii) the output feature maps of XCM are processed with a 1D global average pooling before being input to a softmax layer for classification.The use of 1D global average pooling followed by a softmax layer for classification reduces the number of parameters and improves the generalization ability of the network compared to fully connected layers.Global average pooling consists of summarizing each feature map by its average.This operation improves the generalization ability of the network, as it does not have parameters to train, and it provides robustness to spatial translations of the input [42].In the possible cases when the sequences of events in an MTS change, the robustness to spatial translation ensures that the classification result is not modified.Finally, the use of non fully padded convolution filters as in MTEX-CNN can lead to an imprecise identification of the regions of the input data that are important for predictions as Grad-CAM is sensitive to upsampling processes.Therefore, (iii) the 2D and 1D convolution filters of XCM are fully padded.As detailed in the next section, the output feature maps can then be analyzed with the Grad-CAM explainability method without altering the precision of the explanations through upsampling processes.Figure 2 illustrates XCM, and the following paragraphs detail the architecture.Firstly, XCM extracts information relative to the observed variables with 2D convolution filters (upper green part in Figure 2).This upper part is composed of one 2D convolutional block which is then converted to one feature map to reduce the number of parameters with a 1 × 1 convolution filter.The convolutional block contains a 2D convolution layer followed by a batch normalization layer [43] and a ReLU activation layer [44].We set the kernel size of the 2D convolution filters to Window Size × 1, where Window Size is a hyperparameter which specifies the time window size, i.e., the size of the subsequence of the MTS expected to be interesting to extract discriminative features, and ×1 means for each observed variable.Thus, these 2D convolution filters (number: F in Figure 2) allow the extraction of features per observed variable.The features are extracted using a sliding window (strides equal to 1), and we use padding instead of half padding to keep the dimension of the feature maps the same as the input data.The padding allows us to avoid using upsampling and interpolation methods on the features maps when building the attribution maps, i.e., the heatmaps of dimensions T × D that identify the regions of the input data that are important for predictions (detailed in the next section).Then, batch normalization brings normalization at the layer level, and it enables faster convergence and better generalization of the network [45].In addition, the ReLU activation layer induces nonlinearity in the network.Next, the output feature maps are fed into a module (1 × 1 convolution filter) [46] which reduces the number of parameters.It projects the feature maps into one following a channel-wise pooling.
In parallel, XCM extracts information relative to time with 1D convolution filters (lower red part in Figure 2).This lower part is the same as the upper part, except that the 2D convolution filters are replaced by 1D.We set the kernel size of the 1D convolution filters to Window Size × D, where Window Size is the same hyperparameter as 2D convolution filters and D is the number of observed variables of the input data.The 1D convolution filters slide over the time axis only (stride equals to 1) and capture the interaction between the different time series.Following the use of padding, the output feature map of this lower part has a dimension of T × 1, with T the time series length of the input data.The use of padding, similar to 2D convolution filters, allows us to avoid using upsampling of the features maps on the dimension related to the information extracted (time-T) when building the attribution maps (detailed in the next section).
In the following step, the output feature maps from these two parts are concatenated and form a feature map of dimensions T × (D + 1).We apply the same 1D convolution block (1D convolution layer-F filters, kernel size Window Size × (D + 1), stride 1 and padding + batch normalization + ReLU activation layer) as presented in the previous paragraph to slide over the time axis and capture the interaction between the features extracted.Finally, we add a 1D global average pooling on the output feature maps and perform classification with a softmax layer.As previously introduced, the use of global average pooling instead of fully connected layers improves the generalization ability of the network.
In order to assess the potential advantage of concatenating the 2D and 1D convolution blocks instead of having them in sequence, independently from the choice of the classification layers (fully connected layers as in MTEX-CNN versus 1D global average pooling with a softmax layer in XCM), we include in our experiments in Section 5.1 a variant of XCM (XCM-Seq).XCM-seq is the same as XCM except that the 2D and 1D convolution blocks are in sequence.The next section presents how the architecture of XCM allows the communication of explanations supporting the model predictions with Grad-CAM.

Explainability
The new CNN architecture of XCM has been designed to enable the precise identification of the observed variables and timestamps that are important for predictions based on Gradient-weighted Class Activation Mapping (Grad-CAM) [10].As presented in Section 2.3, Grad-CAM identifies the regions of the input data that are important for predictions in CNNs using the class-specific gradient information.More specifically, Grad-CAM can output two types of attribution maps from XCM architecture: one related to observed variables and another one related to time.Attribution maps are heatmaps of the same size as the input data where some colors indicate features that contribute positively to the activation of the target output [35].These attribution maps constitute the explanations provided to support XCM model predictions and are available at the sample level.The following paragraphs explain how we adapt Grad-CAM for XCM.
In order to build the first attribution map related to observed variables, Grad-CAM is applied to the output feature maps of the 2D convolution layer which uses convolution filters per observed variable (first block in the upper green part in Figure 2).To obtain the class-discriminative attribution map, L c 2D ∈ R T×D with T the time series length and D the number of observed variables, we first compute the gradient of the score for class c (y c ) with respect to feature map activations A k of the convolutional layer, i.e., ∂y c ∂A k with k ∈ [1, . . ., F] the identifier of the feature map.These gradients flowing back are global-average-pooled over the time series length (T) and observed variables (D) dimensions (indexed by i and j, respectively) to obtain the weight of each feature map.Thus, as regards the feature map k, we calculate the weight as: We then use the weights to compute a weighted combination between all the feature maps for that particular class and use a ReLU to keep only the positive attributions to the predictions (Equation (3)).
The second attribution map, L c 1D , relates to time and is built on the same principle.Grad-CAM is applied to the output feature maps of the 1D convolution layer which uses convolution filters sliding over the time axis (first block in the lower red part in Figure 2).With respect to the feature maps activations M and the class c, we calculate L c 1D as: Thus, L c 1D has T × 1 as dimensions.We then upsample it to match the input data dimensions T × D with a bilinear interpolation in order to obtain the attribution map.This operation does not alter the time attribution results as the padding on the 1D convolution filters ensured that the feature extraction over the time dimension has kept the time series length.Therefore, the upsampling only replicates the results over the observed variables.An example of observed variables and time attribution maps on a synthetic dataset is presented in Section 5.2.
Before discussing the performance and explainability results of XCM, we present in the next section the evaluation setting.

Evaluation
In this section, we present the methodology employed (datasets, algorithms, hyperparameters and metrics) to evaluate our approach.

Datasets
We benchmarked XCM on the 30 currently available public UEA MTS datasets [7].We kept the train/test splits provided in the archive.The characteristics of each dataset are presented in Table 2.

Algorithms
We compare our algorithm XCM implemented in Python 3.6 (code available on GitHub https://github.com/XAIseries/XCM)to the state-of-the-art MTS classifiers, as detailed in Section 2.2, and to the variant XCM-Seq: ).We used the same setting as XCM.

Hyperparameters
For each dataset, hyperparameters were set by grid search based on the best average accuracy following a stratified 5-fold cross-validation on the training set.

Metrics
For each dataset, we computed the classification accuracy-the metric used to benchmark the MTS classifiers on the public UEA datasets [7].Then, we presented the average rank and the number of wins/ties to compare the different classifiers on the same datasets.Finally, we presented the critical difference diagram [47], the statistical comparison of multiple classifiers on multiple datasets based on the nonparametric Friedman test, to show the overall performance of XCM.We used the implementation available in R package scmamp (https://www.rdocumentation.org/packages/scmamp/versions/0.2.55/topics/ plotCD (accessed on 1 November 2021)).

Results
In this section, we first present the performance results of XCM on the public UEA datasets.Then, we illustrate how XCM can reconcile performance and explainability on a synthetic dataset.Finally, we end this section by showing that XCM outperforms the current most accurate state-of-the-art algorithm in a real-world application while providing faithful and more informative explanations.

Performance
The accuracy results on the public UEA test sets of XCM and the other MTS classifiers are presented in Table 3.A blank in the table indicates that the approach ran out of memory.The best accuracy for each dataset is denoted in boldface.
Firstly, we observe that XCM obtains the best average rank and the lowest rank variability across the datasets (rank: 2.3, standard error: 0.4), followed by MLSTM-FCN in second position (rank: 3.5, standard error: 0.5) and WEASEL+MUSE in third position (rank: 4.0, standard error: 0.5).Using the categorization of the datasets published in the archive website (http://www.timeseriesclassification.com/dataset.php(accessed on 1 November 2021)), we do not see any influence from the different train set sizes, MTS lengths, number of dimensions, number of classes and dataset types on XCM performance relative to the other classifiers on the UEA datasets.
More specifically, XCM exhibits better performance than MLSTM-FCN and WEASEL +MUSE on both the large (rank: 1.9, MLSTM-FCN rank: 2.1, WEASEL+MUSE rank: 4.6train size ≥ 500, 23% of the datasets) and small datasets (rank: 2.4, MLSTM-FCN rank: 4.0, WEASEL+MUSE rank: 3.9-train size < 500, 77% of the datasets).We can assume that the more compact architecture of XCM compared to the other deep learning classifiers provides a better generalization ability on the UEA datasets (average rank on the number of trainable parameters: XCM 1.7, MLSTM-FCN: 1.9, MTEX-CNN: 2.0).Furthermore, the results confirm the superiority of the XCM approach based on the extraction in parallel and directly from the input data of features relative to the observed variables and time compared to the sequential approaches.XCM outperforms both XCM-Seq and MTEX-CNN on average on the UEA datasets (rank: 2.3, XCM-Seq: 5.0, MTEX-CNN: 7.2).
With regard to the hyperparameter Window Size of XCM, Figure 3 shows the average relative drop in performance across the datasets when using the other time window sizes than the one used in the best configuration given in Table 3.In order to evaluate the relative impact with respect to the range of performance, we defined four categories of datasets: datasets with XCM original accuracy< 50%, datasets with 50% ≤ accuracy < 75%, datasets with 75% ≤ accuracy < 90% and datasets with accuracy ≥ 90%.First, as expected, we observe that the average relative impact of using suboptimal time window sizes is higher when XCM level of performance is low (average relative drop in accuracy: 13.1% when XCM accuracy < 50% versus 3.0% when XCM accuracy ≥ 90%).Then, the average relative drop in accuracy when using suboptimal time window sizes is not negligible but remains limited in all the cases.This drop is below 15% on average on the category where XCM has the lowest level of accuracy (13.1% ± 3.2%) and below 10% on average across all the datasets (7.0% ± 1.3%).3. The performance drop is presented across four categories of datasets, defined according to XCM levels of accuracy shown in Table 3. Abbreviation: Acc-Accuracy.
Finally, we performed a statistical test to evaluate the performance of XCM compared to the other MTS classifiers.We present in Figure 4 the critical difference plot with alpha equals to 0.05 from results shown in Table 3.The values correspond to the average rank, and the classifiers linked by a bar do not have a statistically significant difference.The plot confirms the top three ranking as presented before (XCM: 1, MLSTM-FCN: 2, and WEASEL+MUSE: 3), without showing a statistically significant difference between each other.We notice that XCM is the only classifier with a significant performance difference compared to DTW D .

Explainability
In this section, we illustrate how our approach XCM reconciles performance and explainability and show that XCM enables a more precise identification of the regions of the input data that are important for predictions compared to the current deep learning MTS classifier also providing faithful explainability-MTEX-CNN.We perform the comparison on a synthetic dataset as the construction of such a dataset allows us to know the expected explanations, with such information not being available in the public UEA datasets.Concerning the evaluation of the results, we adopt the intersection-over-union as a metric, i.e., the extent of overlap between the predicted and expected explanations.
The synthetic dataset is composed of 20 MTS (50%/50% train/test split) with a length of 100, two dimensions, and two balanced classes.The difference between the 10 MTS belonging to the negative class and the one belonging to the positive class stems from a 20% time window of the MTS.Negative class MTS are sine waves, and as illustrated in the plot on the top part of Figure 5, positive class MTS are sine waves with a square signal on 20% of the dimension 1 (see timestamps between 60 and 80).
First, MTEX-CNN and XCM (Batch Size: 1, Window Size: 20%) correctly predict the 10 MTS of the test set (accuracy 100%).We observe that XCM and MTEX-CNN obtain the same performance whereas XCM has around 10 times fewer parameters than MTEX-CNN (trainable parameters: XCM 17k, MTEX-CNN 232k).Moreover, MTEX-CNN and XCM with Grad-CAM all correctly identify the discriminative time window.However, as shown in Figure 5, the attribution maps of MTEX-CNN and XCM with the same explainability method (Grad-CAM) are different.Figure 5 shows one MTS sample belonging to the positive class, and the time and observed variables attribution maps supporting MTEX-CNN and XCM predictions.Attribution maps are heatmaps of the same size as the input data.The more intense the red, the stronger the features (observed variables, time) positively contribute to the prediction.We observe that the attribution maps drawn from XCM are more precise than the ones from MTEX-CNN, i.e., the intersection-over-union of the explanations is higher for XCM than for MTEX-CNN (intersection-over-union: XCM 0.65 versus MTEX-CNN 0.4).On the time attribution map, high attribution values (above 0.6) for XCM begin on timestamp 63 and end on timestamp 76 (expected: [60, 80], intersection-over-union: 0.65), whereas for MTEX-CNN they begin later (timestamp 68, intersection-over-union: 0.4).Concerning the attribution map of the observed variables, as expected, we see that high attributions values on the discriminative dimension (dimension 1) appear at the same timestamps as high attribution values on the time attribution map for XCM (timestamps 63 and 76, intersection-over-union: 0.65).Nonetheless, the observed variables attribution map of MTEX-CNN shows high attribution values on a window larger than the discriminative one (timestamps range [34,83], intersection-over-union: 0.41).As MTEX-CNN attribution maps exhibit a red color gradient, the precision of identification of the regions of the input data on MTEX-CNN attribution maps could be enhanced by setting a higher threshold than 0.6 for the attribution values.However, the red color gradient is due to the upsampling processes needed to match the 2D/1D output features maps of MTEX-CNN to the size of the input data when applying Grad-CAM.Grad-CAM is applied at a local level, which means that we would need to potentially set a different threshold for each instance and that would render MTEX-CNN explainability method impractical.Thus, based on the same attribution threshold (0.6), XCM allows a more precise identification of the regions of the input data that are important for predictions than MTEX-CNN.Both MTEX-CNN and XCM have periodically high attribution values on dimension 2 of the observed variables attribution maps.It could be surprising as the sinusoidal signal on this dimension is the same across all MTS; however, the fact that this information is uniformly high or low renders it irrelevant for explanations.Therefore, considering that XCM-Seq attributions maps are the same as XCM ones, we can assume that the use of half padding on the different convolution layers to reduce the number of parameters in MTEX-CNN, i.e., the use of upsampling to retrieve the input data dimensions on the attribution maps, can lead to a less precise identification of the regions of the input data that are important for predictions.

Real-World Application
Machine learning methods have great potential to improve the detection of determining events for milk production in dairy farms, which is one of the most important steps toward meeting both food production and environmental goals [48].A key factor for dairy farms performance is reproduction.Reproduction directly impacts milk production as cows start to produce milk after giving birth to a calf; and milk productivity declines after the first 3 months.Furthermore, the most prevalent reason for cow culling, the act of slaughtering a cow, is a reproduction issue (e.g., long interval between two calves) [49].Thus, it is crucial to detect estrus, the only period when the cow is susceptible to pregnancy, to timely inseminate cows and therefore optimize resource use in dairy farms.
The ground truth is estrus estimation using automated progesterone analysis in milk [50,51].However, the cost of this solution prohibits its extensive implementation.Thus, the machine learning challenge lies in developing a binary MTS classifier to detect estrus (class estrus/non-estrus) based on affordable sensor data (activity and body temperature).Commercial solutions based on these affordable sensor data have been developed.Nonetheless, their adoption rate remains moderate [52].These commercial detection solutions suffer from insufficient performance (false alerts and incomplete estrus coverage) and from a lack of justifications supporting alerts.Therefore, aside from an enhanced performance, decision support solutions need to provide to the farmers some explanations supporting the alerts.
The offline dataset consists of 15.5k MTS samples of length 4 with seven variables: the body temperature variable and six activity variables (rumination, ingestion, rest, standing up, over activity and other activity).A time series corresponds to a 4 day period (MTS length 4): the day of estrus (Day 0) and the previous 3 days.The labels are set with the ground truth in estrus detection-progesterone dosage in whole milk.We compare XCM with Grad-CAM to a reference commercial solution (HeatPhone [53]) and the most accurate state-of-the-art MTS classifier of our benchmark (see Section 4.2) on this real-world application-MLSTM-FCN-with SHAP [32].As far as we have seen, an architecture concatenating a LSTM network with a CNN such as MLSTM-FCN can only rely on post hoc model-agnostic explainability methods to support its predictions.We chose the state-of-the-art explainability method SHAP as its granularity of explanation is comparable to Grad-CAM (both global and local).Indeed, Grad-CAM can also offer global explainability by averaging the attribution maps values per class.SHAP provides the relative importance of the observed variables and timestamps on predictions.Performance is calculated following a five-fold cross-validation and an arithmetic mean of the F1-scores on test sets.The choice of this metric is driven by two reasons.First, no assumption is made about the dairy management style; farmers can favor a higher estrus detection rate (higher recall) or fewer false alerts (higher precision) according to their needs.Second, there is a class imbalance (33% of estrus days) which renders irrelevant the accuracy metric.
As presented in Table 4, we observe that XCM outperforms the current state-of-the-art deep learning approach (MLSTM-FCN) and the reference commercial solution by increasing the average F1-score (69.7% versus 63.1 % and 55.3%) and obtaining the lowest variability across folds (1.5% versus 1.5% and 5.1%).In addition, concerning XCM explainability, Figure 6 shows an example of the observed variables and time attribution maps supporting the correct prediction of an MTS sample belonging to the class Estrus.We plot the MTS sample with a heatmap to ease the readability.The intersection of attribution maps and sample values inform us that the prediction was made mainly based on the presence of a high overactivity (or low rest) of the animal on the day of estrus (attribution values above 0.6 on Day 0 and on the variable over activity, which has a high value).This behavior is aligned with the literature on estrus detection [54], as it is the behavior associated with most of the estrus.The current state-of-the-art MTS classifiers MLSTM-FCN and XCM have different explainability methods (SHAP-post hoc model-agnostic, Grad-CAM-post hoc modelspecific) which come with their own form of explanations.In order to assess and benchmark these two MTS classifiers also with respect to their explainability, we use a framework that we have proposed in [55].The framework details a set of characteristics (performance, model comprehensibility, granularity of the explanations, information type, faithfulness and user category) that systematize the performance-explainability assessment of machine learning methods.The results of the framework are represented in a parallel coordinates plot in Figure 7.Both deep learning approaches are hard-to-understand models (Comprehensibility: Black-Box) which provide explanations at both global and local levels (Granularity: Global and Local) that can be analyzed by a domain expert (User: Domain Expert).However, in addition to giving the relative importance of observed variables and time as MLSTM-FCN with SHAP, XCM with Grad-CAM provides more informative explanations by supplying the corresponding sample values (Information: MLSTM-FCN with SHAP-Features+Time and XCM with Grad-CAM-Features+Time+Values).Furthermore, unlike MLSTM-FCN with SHAP and as discussed in Section 2.3, XCM with Grad-CAM approach provides faithful explanations, which is a prerequisite to reduce solution mistrust from the farmers (Faithfulness: MLSTM-FCN with SHAP-Imperfect and XCM with Grad-CAM-Perfect).Therefore, XCM outperforms the current state-of-the-art algorithm on the real-world application (Performance: Best), while enhancing explainability by providing faithful and more informative explanations.Finally, the performance-explainability framework introduced in the previous paragraph can also be used to identify the limitations of XCM, which point to the directions to improve our approach.We see in Figure 7 that the level of information of the explanation provided by XCM with Grad-CAM (Features+Time+Values) could be enhanced.Therefore, aside from automating the hyperparameter setting of XCM (Window Size), it would be interesting to work on synthesizing the attribution maps to improve the level of information.
Institutional Review Board Statement: This work was carried out in accordance with the guidelines for animal research of the French Ministry of Agriculture (decret NOR AGRG 1231951D) and approved by the "Comite National de Réflexion Ethique sur l'Experimentation Animale" (Authorization of the French Ministry of Higher Education, Research and Innovation reference APAFIS 3122-2015112718172611).

Informed Consent Statement: Not applicable
Data Availability Statement: The UEA multivariate time series classification archive is available online: https://www.timeseriesclassification.com/index.php

Figure 1 .
Figure 1.MTEX-CNN architecture.Abbreviations: D-number of observed variables, de-dense layer size, F-number of filters, k-kernel size and T-time series length.

Figure 2 .
Figure 2. XCM architecture.Abbreviations: BN-Batch Normalization, D-number of observed variables, F-number of filters, T-time series length and Window Size-kernel size, which corresponds to the time window size.

Figure 3 .
Figure 3. XCM average relative accuracy drop across the UEA datasets when using other time window sizes than the one used in the best configuration given in Table3.The performance drop is presented across four categories of datasets, defined according to XCM levels of accuracy shown in Table3.Abbreviation: Acc-Accuracy.

Figure 4 .
Figure 4. Critical difference plot of the MTS classifiers on the UEA datasets with alpha equal to 0.05.

Figure 5 .
Figure 5. Observed variables and time attribution maps supporting the correct MTEX-CNN and XCM predictions of an MTS from the synthetic dataset belonging to the class Positive.Abbreviation: Dim-Dimension.

Figure 6 .
Figure 6.Observed variables and time attribution maps supporting the correct XCM prediction of an MTS from the real-world test set, which belongs to the class Estrus.The MTS sample is represented under the form of a heatmap with the regions important for the prediction highlighted with a red square.

Table 1 .
Overview of the state-of-the-art MTS classifiers.