Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series

New remote sensing sensors acquire now high spatial and spectral Satellite Image Time Series (SITS) of the world. These series of images are a key component of any classification framework to obtain up-to-date and accurate land cover maps of the Earth's soils. More specifically, the combination of the temporal, spectral and spatial resolutions of new SITS enables the monitoring of vegetation dynamics. Although some traditional classification algorithms, such as Random Forest (RF), have been successfully applied for SITS classification, these algorithms do not fully take advantage of the temporal domain. Conversely, deep-learning based methods have been successfully used to make the most of sequential data such as text and audio data. For the first time, this paper explores the use of Convolutional Neural Networks (CNNs) with convolutions applied in the temporal dimension for SITS classification. The goal is to quantitatively and qualitatively evaluate the contribution of temporal CNNs for SITS classification. More precisely, this paper proposes a set of experiments performed on a million Formosat-2 time series. The experimental results show that temporal CNNs are 2 to 3 % more accurate than RF. The experiments also highlight some counter-intuitive results on pooling layers: contrary to image classification, their use decreases accuracy. Moreover, we provide some general guidelines on the network architecture, common regularization mechanisms, and hyper-parameter values such as the batch size. Finally, the visual quality of the land cover maps produced by the temporal CNN is assessed.


I. INTRODUCTION
S INCE 2004, the biophysical cover on Earth's surfaceland cover -has been declared one of the fifty Essential Climate Variables [1]. Accurate knowledge of land cover is a key information for current environmental research. More generally, land cover maps are essential to monitor the effects of climate change, to manage resources, and to assist in disaster prevention. Accurate and up-to-date land cover maps are critical as both inputs to modeling systems -e.g. flood and fire spread models -and decision tools for informing public policy [2]. Charlotte   State-of-the-art approaches to producing accurate land cover maps use supervised classification of satellite images [3]. This makes it possible for maps to be reproducible and to be automatically produced at a global scale [4]. Recent satellite constellations are now acquiring satellite image time series (SITS) with high spectral, spatial and temporal resolutions. For instance, the two Sentinel-2 satellites provide worldwide images every five days, freely distributed, composed of thirteen spectral bands at spatial resolutions varying from 10 to 60 meters since March 2017 [5].
These new high-resolution SITS constitute an incredible source of data for land cover mapping, especially for vegetation and crop mapping [6], [7], at regional and continental scales [8], [4]. Figure 1 displays an example of such a map. The state-of-the-art classification algorithms used to produce maps are currently Support Vector Machines (SVMs) and Random Forests (RFs) [9]. These algorithms are generally applied in both spectral and temporal domains at pixel-level by stacking images in the temporal domain, and then by extracting the instances for each pixel. However, these algorithms are oblivious to the temporal dimension that structures SITS: the temporal or the spectral order in which the images are presented has no influence on the results, inducing a loss of the temporal behavior for classes with evolution over time such as the numerous forms of vegetation that are subject to seasonal change.
One solution to mitigate this problem has been to extract temporal features that are then fed to the classification algorithm [10], [11], [12]. Temporal features are generally computed from vegetation profiles such as the Normalized Difference Vegetation Index (NDVI). They correspond to either some statistical values, such as the maximum of NDVI, or the approximation of key dates in the phenological stages of the targeted vegetation classes. However, the use of such temporal features in addition to spectral bands has shown little effect on classification performance [13].
To make the most of the temporal domain, other works have applied the Nearest Neighbor (NN) algorithm combined with temporal measures [14]. Such measures aim at capturing the temporal trends present in the series by measuring a similarity independent of some time variations between two time series [15]. Although promising, their computational complexity is prohibitive for applications with more than a few thousand profiles [16].
Meanwhile, deep neural networks, especially Convolutional Neural Networks (CNNs), have been shown to be extremely good at taking advantage of unstructured data such as images, audio, or text. Their most successful applications have been to benefit from the spatial dimension: CNNs are considered as the state-of-the-art for image classification [17], face recognition [18], and semantic segmentation [19]. Recently, they have also shown interest for handling the temporal dimension with time series classification [20], and both the spatial and temporal dimension in video classification [21].
In remote sensing, CNNs have been successfully used to exploit the spatial dimension of very high spatial resolution satellite images for object detection [22], land cover classification [23], [24] and land cover change detection [25]. They have also been successfully used for classification of multi-source and muti-temporal data [26], [27], but without taking advantage of the temporal dimension: convolutions were applied in the spatial domain, excluding the temporal one. In other words, the order of the images had no influence on the results.
In addition, the potential of Recurrent Neural Networks (RNNs), developed for sequence data, has also been demonstrated in remote sensing, especially for the classification of multi-temporal Synthetic Aperture Radar (SAR) data with both Long-Short Term Memory (LSTM) units [28] and Gated Recurrent Units (GRUs) [29]. Such RNN models are able to explicitly consider the temporal correlation of the data [30] making them particularly well adapted when the task requires a prediction at each time point, such as producing a translation of each word in a sentence [31]. Contrary to CNNs, they are able to share the learned features across the sequence, e.g. across positions of text. As land cover mapping aims at producing one label for all the time points, RNNs might be less suited to this specific classification task. In particular, the number of training steps or the number of gradient calculations is a function of the length of the series [32, Section 10.2.2], while it is only a function of the depth of the network for CNNs. The result is a network that is: 1) harder to train because patterns at the start of the series are many layers away from the classification output, and 2) longer to train because the error has to be back-propagated through each layer in turn. For these reasons, we excluded RNNs from our experiments.
As CNNs are revolutionizing the field of machine learning, we propose for the first time to fully explore them with convolutions applied in the temporal domain for SITS classification. The aim of this work is to provide an extensive experimental study of CNNs in order to give general guidelines about how they might be used and parameterized. This paper is not about proposing one architecture that should be adopted by practitioners. Rather, this paper is about giving a theoretical understanding of temporal CNN models, and studying the influence of their parameterizations. CNNs are complex systems which usually require deep machine learning understanding to be successfully used. This paper looks for a methodological understanding of the models and their theoretical underpinnings, as well as an experimental study to learn how to use them for SITS classification. This paper should help to understand: 1) how and why CNN models work, 2) how to prepare the data to feed it to CNNs, 3) what are good model engineering practices, 4) what parameters are likely to have a large influence. The covered topics include the width and depth of the network, the size of the convolutions, the pooling layers, the optimization algorithms, the batch size, the input normalization, the batch normalization, the usefulness of spectral features, and the composition of a validation set. These topics are addressed both theoretically and experimentally using highresolution Formosat-2 SITS, composed of one million labeled time series. This paper presents the results obtained for 1775 deep learning models, corresponding roughly to 2000 hours of training time performed mainly on NVIDIA Tesla V100 Graphical Processing Units (GPUs). This paper is organized as follows: Section II describes CNN models, their theoretical foundation, as well as the baseline architecture that will be used for comparison in subsequent section. Then, Section III is devoted to the description of the data and the experimental settings. Section IV is the core section of the paper that presents the experimental results to answer the questions raised when designing a CNN model. Finally, Section V draws the main conclusions of this work.
Note that this paper focuses only on temporal CNN models and does not cover the use of the spatial structure of the data. As the reader will observe, CNNs are complex and the use of the temporal dimension alone raises a number of questions, such that we were not able to present a spatio-temporal study in a single paper. We leave such a study for future work.

II. CONVOLUTIONAL NEURAL NETWORKS
Deep Convolutional Neural Networks (CNNs) have been successfully used for many machine learning tasks including face detection [18], object recognition [33], and machine translation [31]. Benefiting from both theoretical and technical advances [17], [34], deep CNNs have also been applied to remote sensing data for classification of hyperspectral images [35], reconstruction of missing data [36] and pansharpening [37]. In this paper, we explore and assess the use of temporal CNNs for the classification of SITS.
This section is organised as follows:  A reviews the theory of neural networks and CNNs; B details the different layers and building blocks that are assembled to form a CNN; C explains how these models are learned, which details the most common optimization techniques; D reviews the challenges inherent to learning deep neural networks; E details the regularization methods that are used to tackle overfitting; F draws on the previous sections to introduce the general form of CNN architectures that will be studied in Section IV.

A. General Principles
Deep learning networks are based on the concatenation of different layers where each layer takes the outputs of the previous layer as inputs. Figure 2 shows an example of a fullyconnected network where the neurons in green represent the input, the neurons in blue belong to the hidden layers and the neurons in red are the outputs (the Softmax layer is presented in Section II-B). As depicted, each layer is composed of a certain number of units, namely the neurons. The input layer size depends on the dimension of the instances, whereas the output layer is composed of C units for a classification task of C classes. The number of hidden layers and their number of units need to be selected by the practitioner.
Formally, the outputs of a layer l, the activation map denoted by A [l] , are obtained through a two-step calculation: it first takes a linear combination of the inputs -which are the output of layer l − 1, i.e. A [l−1] , and then it applies a non-linear activation function g [l] to this linear combination. It can be written as follow: where W [l] and b [l] are the weights and the biases of the layer l, respectively, that need to be learned. The choice of the activation function is discussed in Section II-B. The idea behind the stacking of several layers is to increase the capacity of the network to represent complex functions, while keeping the layers simple, i.e. composed of a small number of units. Section IV-D will provide some experimental results for different network depths.
Let (X, Y ) be a set of n training instances such as (X, Y ) = {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x n , y n )} ∈ R T ×D × Y. The couple (x i , y i ) represents training instance i where x i is a D-variate time series of length T associated at the label y i ∈ Y = {1, · · · , C}. Formally, x i can be expressed by Training a neural network corresponds to finding the values of W = {W [l] } ∀l and b = {b [l] } ∀l that will minimize a given cost function which assesses the fit of the model to the data. This process is known as empirical risk minimization, and the cost function J is usually defined as the average of the errors committed on each training instance: The loss function L(ŷ i , y i ) is usually expressed for a multiclass problem as the cross-entropy loss: whereŷ i correspond to the network predictions, p(y i |x i ) represents the probability of predicting the true class y i of instance i computed by the last layer of the network, and denoted by A [L] (see Section II-B on the Softmax layer).

B. Layers
In the previous section, we have described a typical layer which is composed of a linear combination followed by an activation function, namely the dense layer. We describe here this layer and others which vary the way the weights are applied to the outputs of the previous layer. A last section is also dedicated to the choice of the activation function.
1) Dense layer: The dense layer, also known as a fullyconnected layer, is the main component of traditional neural network architectures, such in the Multilayer Perceptron illustrated in Figure 2. As describe above, it connects all the inputs of a layer to each of its neurons by applying the linear combination followed by the activation function presented in Equation 1. The number of trainable parameters of this layer depends on the number of units and the size of the input to this layer.
2) Convolutional layer: Convolutional layers were proposed to limit the number of weights that a network needs to learn while trying to make the most of structuring dimensions in the data -e.g. spatial, temporal or spectral - [38]. They apply a convolution filter to the output of the previous layer. Conversely to the dense layer where the output of a neuron is a single number reflecting the activation, the output of a convolutional layer is therefore a set of activations. For example, if the input is a uni-variate time series, then the output will be a time series where each point in the series is the result of a convolution filter. Figure 3 shows the application of a gradient filter [−1 − 1 0 1 1] onto the time series depicted in blue. The output is depicted in red. It takes high positive values where an increase in the signal is detected, and low negative values where a decrease in the signal occurs. Note that the so-called convolution is technically a cross-correlation. Compared to dense layers that apply different weights to the different inputs, convolutional layers differ in that they share their parameters: the same linear combination is applied by sliding it over the entire input. This drastically reduces the number of weights in the layer, by assuming that the same convolution might be useful in different parts of the time series. This is why the number of trainable parameters only depends on the filter size of the convolution f and on the number of units n, but not on the size of the input. Conversely, the size of the output will depend on the size of the input, and also on two other hyper-parameters -the stride and the padding. The stride represents the interval between two convolution centers. The padding controls for the size after the application of the convolution by adding values (usually zeros) at the borders of the input.
3) Pooling layer: The aim of pooling layers is to reduce the size of the representation to both speed-up the computation and make more robust to noise some of the learned features [39]. Pooling layers can be seen as a de-zooming operation. Two fixed transformations are usually used: taking either the average or the maximum over a window of size k, with generally a stride also equal to k. They naturally induce a multi-scale analysis when interleaved between successive convolutional layers. For a time series, these pooling layers simply reduce the length, and thus the resolution, of the time series that are output by the neurons -and this by a factor k. As convolutional layers output a series of values and not a single one, pooling layers provide a complement to them by progressively reducing the length of their inputs. Global pooling corresponds to a particular setting where k is equal to the size of the input. The k values are then simply either averaged or maxed for global average and global max pooling layers, respectively. Pooling layers do not have any trainable parameter as k is a user-set hyper-parameter.
Two of these four types of pooling have received most of the attention in the literature about image analysis: 1) the local max-pooling [40], and 2) the global average pooling [41]. For time series, global average pooling seems to have been more successful [20], [42]. In Section IV-C, experiments will show that these results do not generalize well to Earth Observation data.
4) Softmax layer: The Softmax layer is a special case of a dense layer used at the end of the network to predict the output label. It maps the output of the previous layer to a vector of class probabilities, and it has as many neurons as there are classes. The activation for neuron i, i.e. class i, is an extension of the sigmoid function for multi-class that can be written as: where Z [L] i is the result of the linear combinations for neuron i of the Softmax layer, i.e. Z i . For a given training instance, the C activations sum to one and can be interpreted as a probability distribution over the classes.
5) Activation function: The activation function, denoted by g [l] in Equation 1, is crucial as it allows to introduce non-linear combinations of the features. If only linear functions are used, the depth of the network will have little effect since the final output will simply be a linear combination of the input, which could be achieved with only one layer.
Sigmoid and tangent functions were historically used to provide this non-linearity, but they suffer from the problem of 'vanishing gradient' [17]: when the value fed to the sigmoid is large (either positively or negatively), the neuron saturates and the gradient of the loss becomes very close to zero. The result is a model that is extremely difficult and slow to train, because at each step of the learning algorithm (see Section II-C), the model is barely modified and thus learns extremely slowly.
Rectified Linear Units (ReLU), calculated as ReLU(z) = max(0, z), have been introduced to solve the above problem [17]. It keeps the 'non-vanishing' gradient of linear activations, and also keeps the non-linearity of sigmoid. Other variants exist (Leaky ReLU, parametric ReLU), but standard ReLU is currently the most used activation function as it does not require any parameters to learn or hyper-parameters to tune.

C. How to train a deep learning model?
Neural networks are trained with a two-step process: 1) the forward propagation step passes down the training data through the network to calculate the different activation values; 2) the backpropagation step reverses the process and updates the trainable parameters. The forward pass has been described in previous sections. We focus here on how backward propagation works, and on the used optimization method for this work. We also include a section about batch-normalization.
1) Backpropagation and optimization method: Training of neural networks is traditionally done by using a gradient descent technique. This calculates the gradient of the cost function, defined in Equation 2, with regard to the current parameter values, and updates the parameters by following the opposite direction to the gradient, so as to minimize the cost function. Parameters are updated in turn, starting from the last layer back to the first layer using the chain rule: The learning rate α is the hyper-parameter that controls how much the parameter is modified in the opposite direction to the gradient. If α value is too small, many iterations will be required for the network to converge. Conversely, if α is too large, learning may overshoot and even diverge.
One assumption behind deep learning methods is the availability of a huge training dataset, e.g. GoogleNet network requires 1.2M training images to get human level performance in image recognition task [43]. Therefore, the application of a gradient descent algorithm where all the training data are first processed before any update of the parameters might be not efficient. This is because one only needs a good estimate of the gradient, which might not require the whole dataset. Minibatch gradient descent was introduced to accelerate learning: it applies the forward and backward propagation steps on successive small batches of the training data, with the parameters updated once per batch. In addition, the activations and gradients are independent for each instance in the batch and can thus be computed in parallel.
The size of the mini-batch is also a hyper-parameter that sets the trade-off between the speed of the training and the progress made by the network at each iteration. If the batch size is equal to one -also known as Stochastic Gradient Descent (SGD)then the parameters are updated after seeing each instance but 1) it cannot fully benefit from parallelization, and 2) the gradients might be poorly estimated. Conversely, if a batch contains all the training instances, then a lot of calculations have to be done to update the parameters only once. We will study the influence of batch-size in our experiments.
The state-of-the-art to train deep neural nets is currently Adam (Adaptive moment optimization) [44]. It adapts the learning rate α at each iteration for each parameter. It has for specificity to storing an exponentially decaying average of past squared gradients, and past gradients as in the momentum method [45].
2) Batch normalization: Batch normalization performs a normalization of the networks' activations to accelerate learning [34]. It makes them follow N (0, 1) within each batch by subtracting the mean and dividing by the variance. More specifically, it can be applied on either the activation map A [l] (Equation 1) or the intermediate values before the application of the activation function g [l] . The second option is the most used in the literature, and also the one adopted in this manuscript.
The main intuition behind batch normalization is that it counteracts what the authors of [34] call 'covariate shift': each layer tries to minimize its contribution to the cost with regard to its input, but as learning progresses, those inputs' distribution actually changes. This can slow training down because each layer is trying to learn a mapping from input to output while its inputs are constantly changing. The normalization helps stabilizing this so-called covariate shift from one batch to the next so as to make it easier for the function to learn the correct mapping.

D. Challenges in training deep neural networks
Training deep neural networks presents two main challenges which are offset by a substantial benefit. First, it requires significant expertise to engineer the architecture of the network, choose its hyper-parameters, and decide how to optimize it. In return, such models require less feature engineering than more traditional classification algorithms and have shown to provide superior accuracy across a wide range of tasks. It is in some sense shifting the difficulty of engineering the features to the one of engineering the architecture. Second, deep neural networks are usually prone to overfitting because of their very low bias: they have so many parameters that they can fit a very large family of distributions, which in turn creates an overfitting issue [46].
The aim of supervised learning is to learn a model from the training instances that then generalizes to accurately classify new previously unseen instances. One of the main difficulties is to find the right trade-off between a model that is too simple but generalizes well (high-bias and low-variance), and one that fits the training data perfectly but generalizes poorly (low-bias and high-variance). The latter is called overfitting. For deep networks, a good proxy to understanding this bias-variance trade-off is to understand what elements of the architecture influence the number of trainable parameters most. We detail this here, and the next section will describe some mechanisms to control overfitting.
Let us compute the number of trainable parameters for dense and convolutional neural layers. For convolutional layer i, let f i and n [i] denote the filter size (i.e. the length of the convolution) and the number of units, respectively. n [0] denotes the number of features in the dataset, e.g. n [0] is equal to 3 for RGB input images. For dense layer j, let m [j] denote the number of units. Considering a model composed of K convolutional layers followed by K dense layers and the Softmax layer, the number of trainable parameters P for such a model on a dataset composed of time series of length T belonging to C classes is expressed by: where m [0] denotes the size of the last convolutional layer, which directly precedes dense layer j: m [0] = T n [K] . Note that the use of batch normalization, other stride and padding strategies or temporal pooling layers will change the total number of trainable parameters. For instance, P will slightly increase with batch normalization, and it will potentially significantly reduce with pooling layers. Table I gives the values of P for a few architectures with different number of dense and convolutional layers. We can see that the number is of course small if no hidden layers are used (K = K = 0), which corresponds to a logistic regression model. The addition of either a convolutional or a dense layer raises the number of parameters to about 100,000 parameters, but the greatest increase in the number of parameters occurs when at least one of each is used (K, K 1). This is because the number of parameters of the first dense layers is function of the number of outputs of the last convolutional layer, which itself outputs about T · n [K] values.

E. Mechanisms to control over-fitting
The number of parameters of deep networks is often very large, such that it might be 10 or 100 times larger than the number of training instances. With that number of parameters, a deep network can theoretically perfectly fit any training data, which creates an overfitting problem [46]. We present here some mechanisms to mitigate that problem. Section II-F will specify the one used in this works.
1) Regularizing the weights: The first technique corresponds to forcing the range of parameter values to be close to zero by adding a Gaussian prior N (0, λ −1 ). Optimizing for the posterior then translates to adding the L 2 -norm of all of the weight vectors, λ 2m L l=1 (||W [l] || 2 F ), to the cost in Equation 2. The term || · || 2 F is the Frobenius norm, which corresponds to taking the square of all weight values in the network. This technique is thus usually called L 2 -regularization or weight decay. The regularization parameter λ is the precision of the normal distribution and controls the trade-off between the fit to the data and the model complexity. If λ value is very large, then most of the probability mass of the prior is near zero and weights are strongly pulled toward zero. Similarly, a Laplace prior can be put on the parameters, which results in using the L 1 -norm. That second version tends to completely disable parts of the network and is rarely used for deep networks.
2) Dropout: Dropout randomly turns off some units for each step of the training process [47]. Each time one instance is fed through the network, a proportion of the neurons is disabled. As a neuron might be shut down at anytime, the network becomes less sensitive to the activation of specific input neurons. Note that at predicting time all the neurons contribute to the decision, none of them are turned off. Dropout has a parameter corresponding to the probability for a neuron to be turned off at training; regularization being maximized when the rate is equal to 0.5 [48].
3) Data augmentation: Data augmentation corresponds to generating new examples based on the ones in the training dataset. The idea here is that it might be difficult to integrate background knowledge about the data into the network itself. However, that knowledge can be used to generate variations of the training data that should not change the class of any particular instance. Techniques developed for images include translations, rotations, scaling, changing the contrast or adding some random additive noise [49]. Data augmentation techniques have also been developed for time series including window slicing, window warping and weighted averaged time series [50], [51], [52]. 4) Transfer learning: Transfer learning aims at applying the knowledge of some networks (e.g. the learned parameters) to a related task. For instance, image classification tasks where few training instances are available can benefit from the features learned by networks on huge training datasets, such as AlexNet network [17]. The underlying assumption is that the learned features in the earliest layers of a network -shapes, edges, break points -will be similar for different image classification tasks. Transfer learning has been successfully used in remote sensing for the classification of very high spatial resolution images [53], [54], where pre-trained models on general image classification task were fine-tuned with few labeled remote sensing data. 5) Validation set and early stopping: The use of a validation set to evaluate the loss and the accuracy of a model is useful to measure a rough approximation of the variance of the learned model. The higher the difference between the accuracy obtained on the training and the validation set, the higher the variance of the model. In addition, the validation set might be used to mitigate overfitting. This technique, named earlystopping, stops the learning when the validation loss increases or the validation accuracy decreases over a number of epochs, namely the patience. 1 .

F. Proposed network architecture
In this section, we present the baseline architecture that will be discussed in this manuscript. The goal will be not to propose the best architecture through exhaustive experiments, but rather to explore the behavior of temporal CNNs for SITS classification. Figure 4 displays an example of such an architecture, composed of three convolutional layers, one dense layer and one Softmax layer.
In Figure 4, the convolutional layers (or dense layer) are building blocks composed of one convolutional layer (or one dense layer), a batch normalization layer, and a dropout layer with a dropout rate of 0.5. In addition, a L 2 -regularization on the weight is applied for all the layers with a rate of 10 −6 . The choice of this architecture will be justified in Section IV-A.
The experimental section will also study the use of pooling layers.
In the experiments, if not otherwise specified, the parameters of the studied networks are trained using Adam optimization (standard parameter values: β 1 = 0.9, β 2 = 0.999, and  Fig. 4: Proposed temporal Convolutional Neural Network. The network input is a multi-variate time series. Three convolutional filters are consecutively applied, then one dense layer, and finally the Softmax layer, that provides the predicting class distribution. = 10 −8 ) for a batch size equal to 32, and a number of epochs equal to 20. An early stopping mechanism with a patience of zero is also applied. All the network parameter values are initialized with a Glorot's uniform initialization [55].
In addition, a validation set, corresponding to 5 % of the train set, is used. Considering the specificity of satellite data, we propose to build our validation set at the polygon-level. Similarly to the split between the train and test set (see the details in Section III-B), the validation set is composed of instances that cannot come from the same polygons of the train set.
All the studied CNN models have been implemented with Keras library [56], with Tensorflow as the backend [57]. To facilitate others to build on this work, we have made our code available at https://github.com/charlotte-pel/temporalCNN.

III. DATA AND MATERIAL
This section presents the dataset used for the experiments. First, optical satellite data are presented. Next, the used reference data are briefly described. Then, the data preparation steps are detailed. Finally, the Random Forest algorithm used to compare the results, as well as the used evaluation measures are presented.

A. Optical satellite data
The study area is located at the South West of France, near Toulouse city (110E, 4327N). It is 24 km × 24 km area where about 60 % of the soil correspond to arable surfaces. The area has a temperate continental climate with hot and dry summer -average temperature about 22.4 C and rainfall about 38 mm per month. Figure 5 displays a satellite image of the area in false color for July 14 2006.
The satellite dataset is composed of 46 Formosat-2 images acquired at 8 meter spatial resolution during the year 2006. Figure 6 shows the distribution of 46 acquisitions, that are mainly concentrated during the summer time. Note that Formosat-2's characteristics are similar to the new Sentinel-2 satellites that provide 10 meter spatial resolution images every five days.
For each Formosat-2 image, only the three bands Near-Infrared (760-900 nm) (NIR), Red (630-690 nm) (R) and  Green (520-600 nm) (G) are used. The blue channel has been discarded since it is very sensitive to atmospheric artifacts.
Each image has been ortho-rectified to ensure the same pixel location throughout the whole time series. In addition, the digital numbers from the row images have been converted to top-of-canopy reflectance by the French Space Agency. This last step corrects images from atmospheric effects, and also outputs cloud, shadow and saturation masks. The remaining steps of the data preparation -temporal sampling, feature extraction and feature normalization -are presented in Section III-C.

B. Reference data
The reference data come from three sources: 1) farmer's declaration from 2006 (Registre Parcellaire Graphique in French), 2) ground field campaigns performed during the 2006, and 3) a reference map obtained with a semi-automatic procedure [58]. From these three reference sources, a total of 13 classes is extracted representing three winter crops (wheat, barley and rapeseed), five summer crops (e.g. corn, soy and sunflower), four natural classes (grassland, forests and water) and the urban surfaces. Note that the reference map is used only to extract the urban surfaces, and each extracted urban polygon is visually controlled. Table II displays the total number of instances per class at pixel-and polygon-level. It shows great variations in the number of available instances for each class where grassland, urban surfaces, wheat and sunflower predominate.
The reference data are randomly split into two independent datasets at the polygon level where 60 % of the data is used for training the classification algorithms and 40 % is used for testing. To statistically evaluate the performance of the different algorithms, this split operation is repeated five times. Hence, each algorithm is evaluated five times on different test sets. The presented results are averaged over these five folds.
C. Data preparation 1) Temporal sampling: The optical SITS includes invalid pixels due to the presence of clouds and saturated pixels. Nowadays, the high temporal resolution of SITS is used to efficiently detect clouds and their shadows. The produced masks are then used to gap-filled the cloudy and saturated pixels before applying supervised classification algorithms without a loss of accuracy [59]. We use here a temporal linear interpolation for imputing invalid pixel values.
As most of the classification algorithms explored in the manuscript require a regular temporal sampling, we apply interpolation on a regular temporal grid defined with a time gap of two days. The starting and ending dates correspond to the first and last acquisition dates of the Formosat-2 series, respectively. This operation artificially increases the length of the Formosat-2 time series from 46 to 149. As some studied algorithms, such as RF, may be sensitive to this increase of the length, the temporal interpolation is also applied for the original sampling.
2) Feature extraction: Taking benefit from Formosat-2 spectral resolution, spectral indexes are computed after the gapfilling, for each image of Formosat-2 time series. Spectral indexes are commonly used in addition of spectral bands as the input of the supervised classification system in remote sensing  [3]. They can help the classifier to handle some nonlinear relationships between the spectral bands [4] More specifically, we compute three commonly used indexes: the Normalized Difference Vegetation Index (NDVI) [60], the Normalized Difference Water Index (NDWI) [61] and a brilliance index (IB) defined as the norm of all the available bands [14], [59].
In the experiments, we want to quantify the contribution of the spectral features for the proposed CNN models. To this end, a total of three different feature vectors are defined: 1) NDVI only, 2) spectral bands (SB), and 3) SB + NDVI + NDWI + IB. The simplest strategy corresponds to use all the available spectral bands. The contribution of the spectral features is analyzed by adding the three computed spectral indexes (NDVI, NDWI and IB) to the spectral bands. We also decide to analyze separately the NDVI index alone, as it is the most common index for vegetation mapping. Table III summarizes the total number of variables for the studied datasets as a function of the temporal strategy and the used spectral features.
3) Feature normalization: In remote sensing, the input time series are generally standardized by subtracting the mean and divided by the standard deviation for each feature where each time stamp is considered as a separate feature. This standardization, also called feature scaling, assures that the measure distance, often an Euclidean distance computed through all features, is not dominated by a single feature that has a high dynamic rank. However, it transforms the general temporal trend of the instances.
In machine learning, the input data are generally znormalized by subtracting the mean and divided by the standard deviation for each time series [62]. This z-normalization has been introduced to be able to compare time series that have similar trends, but different scaling and shifting [63]. However, it leads to a loss of the significance of the magnitude that it is recognized as crucial for vegetation mapping, e.g. the corn will have higher NDVI values than other summer crops.
To overcome both limitations of the common normalization methods, we decide to use a min-max normalization per type of feature. The traditional min-max normalization performs a subtraction of the minimum, then a division by the range, i.e. the maximum minus the minimum [64]. As this normalization is highly sensitive to extreme values, we propose to use 2 % (or 98 %) percentile rather than the minimum (or the maximum) value. For each feature, both percentile values are extracted from all the time-stamp values.

D. Random Forest algorithm
The remote sensing community has assessed the performance of different algorithms for SITS classification showing that Random Forests (RF) and the Support Vector Machines (SVM) algorithms dominate the other algorithms [65], [3]. In particular, the RF algorithm manages the high dimension of the SITS data [13], is robust to the presence of mislabeled data [66], has high accuracy performance on large scale study [4], and has parameters easy to tune [13].
The RF algorithm builds an ensemble of binary decision trees [67]. Its first specificity is to use bootstrap instances at each tree -i.e. training instances randomly selected with replacement [68] -to increase the diversity among the trees. The second specificity is the use of random subspace technique for choosing the splitting criterion at each node: a subset of the features is first randomly selected, then all the possible splits on this subset are tested based on a feature value test, e.g. maximization of the Gini index. It will result in a split of the data into two subsets, for which previous operations are recursively repeated. The construction stops when all the nodes are pure (i.e. in each node, all the data belong to the same class), or when a user-defined criterion is met, such as a maximum depth or a number of instances at the node below a threshold.
To complete the experiments of Section IV, the RF implementation from Scikit-Learn has been used with standard parameter settings [13]: 500 trees at the maximum depth, and a number of randomly selected variables per node equals to the square root of the total number of features.

E. Performance evaluation
The performance of the different classification algorithms are quantitatively and qualitatively evaluated. Following traditional quantitative evaluations, confusion matrices are obtained by comparing the referenced labels with the predicted ones. Then, the standard Overall Accuracy (OA) measure is computed. In addition, the results will be also qualitatively evaluated through a visual inspection.

IV. EXPERIMENTAL RESULTS
This Section aims at evaluating the temporal CNN architecture presented in Section II-F. First, a study on the CNN model complexity for our data is provided. A set of six experiments are run to study: A the CNN model complexity for our dataset, B how the proposed CNN models benefit from both spectral and temporal dimensions, C how pooling layers influence the performance, D how deep the model should be, E how the regularization mechanisms help the learning, F what values used for the batch size. A last section is dedicated to the visual analysis of the produced land cover maps.
As explained in Section III-B, all the presented Overall Accuracy (OA) values correspond to average values over five folds. When displayed, the interval always correspond to one standard deviation. Moreover, one can see all the details of the trained networks at https://github.com/charlotte-pel/ temporalCNN.
A. Bias-variance trade-off -how big a model for our data?
This first section has two goals: 1) approximately decide how big or complex should the models be for the remainder of this study, and 2) be able to give an idea about how big should the model be if the reader wants to start using temporal CNN for his or her application. Both questions relate to the bias-variance trade-off of the model for our quantity of data. The more complex the model (i.e. more parameters), the lower its bias, i.e. the fewer incorrect assumptions the model makes about the distribution from which the data is sampled. Conversely, given a fixed quantity of training data, the more complex the model, the higher the variance, which corresponds to the accuracy with which the parameters are learned. The model that will perform best on a particular dataset will be the one that has the best trade-off between bias and variance: making as few incorrect hypotheses as possible while having enough data to learn its parameters accurately. It follows that this trade-off depends on both the complexity of the problem to model and the quantity of data available.
Many classifiers vary their bias-variance trade-off automatically, such as when decision trees grow deeper as the quantity of data increases. For neural networks however, the bias is fixed by the architecture and so hence is the variance as well for a fixed quantity of data. In this paper, our problem and quantity of data, presented in Section III, are indeed fixed. We thus study here the influence of the number of parameters to learn onto the results. That number of parameters will then be the one used in the remaining sections of the paper: we can thus isolate the influence of other studied parameters of the network, independently of the bias-variance trade-off. We will see at the end of this section that the results are quite stable with the size of the network and give some conservative ideas about how to choose it for another problem or data size.
Note that the number of trainable parameters is here used as a proxy for model complexity, which provides a reasonable measure when dealing with a specific classification problem where the quantity of data and the number of classes are fixed [32,Chapter 7,Introduction].
We studied seven CNN architectures with increasing number of parameters. Each architecture is composed of three convolutional layers, one dense layer, and the Softmax layer as depicted in Figure 4. We then vary the number of neurons -or width. The depth of the model will be specifically studied in Section IV-D, where we will show that it is reasonable to have three convolutional layers.
The number of trainable parameters is mainly given by the size of the output after the third convolutional layer (see Equation 8). We thus fixed the number of units in the dense layer to 256 and varied the number of units in the convolutional layers from 16 to 1024. The total number of trainable parameters then ranges from about 320,000 to 50 million. The length of the convolutions is set to 5; this will be further studied in Section IV-B. All models are trained following Section II-F, with the exception of not using a validation set in order to observe more accuracy variations by letting the models being more prone to overfitting. The used dataset is composed of three spectral bands with a regular  Fig. 7: Overall Accuracy (± one standard deviation) as a function of the number of parameters for seven Convolutional Neural Network models.
2-day temporal sampling. Figure 7 shows the OA values as a function of number of parameters in logarithmic-scale. It first shows that the architecture is very robust to a drastic change in number of parameters, as exhibited by OA varying only between 93.28 % for 50M parameters and 93.69 % for 2.5M parameters. The standard deviation increases with the number of parameters, but even at 50M parameters 2 , most results are between 91 % and 95 % accuracy.
From this result, we can take the two following decisions: 1) models having about 2.5M of parameters with three convolutional layers and one dense layers will be used in the remainder of this study, and 2) one can use this result to see that if more training data is available, then using a similar architecture is likely to conservatively work well, because more data is likely to drive the variance down while bias is fixed. If less data is available, one could decide to use a smaller architecture but again, overall, the results are very stable.

B. Benefiting from both spectral and temporal dimensions.
In this section, we first explore whether the learned CNN models takes benefit from either temporal or spectral dimension. Then, we study the influence of the convolution filter size on the accuracy performance.
1) Spectro-temporal guidance for temporal CNNs: In this experiment, we compare four configurations: 1) no guidance, 2) only temporal guidance, 3) only spectral guidance, and 4) both temporal and spectral guidance. Before presenting the obtained results, we first describe the trained models for these fourth types of guidance.
No guidance: Similar to a traditional classifier, such as the RF algorithm, the first considered type of model ignores the spectral and temporal structures of the data, i.e. a shuffle of the data across both spectral and temporal dimensions will provide the same results. For this configuration, we decided to train two types of algorithms: 1) the RF classifier selected as the competitor (see the Appendix A for a comparison of RF with time series classification algorithms), and 2) a deep learning model composed of only dense layers of 1024 units -this specific architecture is named FC in the following. As both RF and FC models do not require regular temporal sampling, the use of 2-day sampling is not necessary and can even lead to under-performance. Indeed, the use of high dimensional space composed of redundant and sometimes noisy features may hurt the accuracy performance. Hence, the results of both models are displayed for the original temporal sampling.
Temporal guidance: The second type of model provides with guidance only on the temporal dimension. Among all the possible architectures, we decided to train an architecture with convolution filter of size (f, 1), instead of (f, D) with D the number of spectral features. In other words, same convolution filters are applied across the temporal dimension identically for all the spectral dimensions.
Spectral guidance: The third type of model includes guidance only on the spectral dimension. For this purpose, a convolution of size (1, D) is first applied without padding, reducing the spectral dimension to one for the next convolution layers.
Spectro-temporal guidance: The last type of model corresponds to the one learned in previous Section IV-A, where the first convolution has a size (f, D).
Table IV displays the OA values and one standard deviation for the four levels of guidance. As the use of engineering features may help the different models, we train all the models for the three feature vectors presented in Section III-C: NDVI alone, spectral bands (SB), and spectral bands with three spectral indexes (SB-SF). For both models using temporal guidance, the filter size f is set to five. The influence of the filter size f on the accuracy results is studied right after. All the models are learned as specified in Section II-F, including dropout and batch normalization layers, weight decay and the use of a validation set.  Table IV shows that the overall accuracy increases when adding more types of guidance, and that regardless of the type of used features. Note that the case of using only spectral guidance with NDVI feature is a particular "degenerate" case. For this case, the spectral dimension is compose of only one feature, the NDVI index. The proposed model will thus apply convolution of size (1,1), leading to a model that does not provide with guidance.
When using at least the spectral bands in the feature vector (SB and SB-SF columns), both RF and FC models obtain the lowest accuracy with a difference of 2 to 3 % compare to the case where both temporal and spectral guidance is used. Interestingly, models based on only spectral convolutions with spectral features (third row, second column) slightly outperforms models that used only temporal guidance (fourth row, second column). This result confirms thus the importance of the spectral domain for land cover mapping application. In addition, the use of convolutions in both temporal and spectral domains leads to slightly better OA values compared to the other three levels of guidance. Finally, Table IV shows that the use of spectral indexes in addition of the available spectral bands does not help to improve the accuracy when using spectro-temporal guidance.
2) Influence of the filter size: For both models using a temporal guidance, it is also interesting to study the filter size. Considering the 2-day regular temporal sampling, a filter size of f (with f an odd number) will abstract the temporal information over ±(f − 1) days, before and after each point of the series. Given this natural expression in number of days, we name (f − 1) the reach of the convolution: it corresponds to half of the width of the temporal neighborhood used for the temporal convolutions. For example, a convolution with a size of 5 will apply the filter over a neighborhood consisting of four days before and after the central point. When using this small reach of four days, the network identifies in priority weekly temporal patterns. Conversely, the use of reach higher than a month could lead to a system similar to one that does not use temporal guidance for crop classes with quick temporal dynamics. Figure 8 displays the OA values as a function of reach for the model learned using both temporal and spectral guidance. We study five size of filters f = {3, 5, 9, 17, 33} corresponding to a reach of 2, 4, 8, 16, and 32 days, respectively. Each curve corresponds to a different feature vector: NDVI in blue, SB in orange, and SB-SF in yellow. Although not displayed here, we obtained similar results for models learned using only temporal guidance. Figure 8 shows the maximum OA is reached for a reach of 4 to 8 days. This result show therefore the importance of high temporal resolution SITS, such as the one provided at five days by both Sentinel-2 satellites. Indeed, the acquisition frequency will allow CNNs to abstract enough temporal information from the temporal convolutions. In general, the reach of the convolutions will mainly depend on the patterns that we try to abstract at a given temporal resolution.
Moreover, Figure 8 shows that the OA variations are more important when using only NDVI than SB or SB-SF. This means that the use of several spectral features reduce the sensitivity of the learned model to the filter size f by abstracting also useful information from the spectral dimension.

C. Are local and global temporal pooling layers important?
In this Section, we explore the use of pooling layers for different reach values. As presented in Section II-B, pooling layers are generally used in image classification task to reduce For this purpose, we train models with a global average pooling layer added after the third convolution layers for the following reach: 2, 4, 8, 16, and 32 days. We also train models with local pooling layers interleaved between each convolution layer with a window size k of 2. For this experiment, the reach is kept constant -2, 4, 8, 16, and 32 days -by reducing the convolution filter size f after each convolution. For example, a constant reach of 8 is obtained by applying successively three convolutions with filter sizes of 9, 5, and 3. Figure 9 displays the OA values as a function of reach. Each curve represents a different configuration: local max-pooling (MP) in blue, local max-pooling and global average pooling (MP+GAP) in orange, local average pooling (AP) in yellow, local and global average pooling (AP+GAP) in purple, and global average pooling (GAP) in green. The horizontal red dashed line corresponds to the OA values obtained without pooling layers in the previous experiment. Figure 9 shows that the use of pooling layers performs poorly: the OA results are almost always below the one obtained without pooling layers (red dashed line). Let us describe in more details the different findings for both global and local pooling layers.
The use of a global average pooling layer leads to the highest decrease in accuracy. This layer is generally used to drastically reduce the number of trainable parameters by reducing the size of the last convolution layer to its depth. It thus performs an extreme dimensionality reduction that decreases here the accuracy performance.
Concerning the use of only a local pooling layer, Figure 9 shows similar results for both max and average pooling layers. The OA values tend to decrease when the reach increases. The results are similar to those obtained by the model without pooling layers (horizontal red dashed line) for reach values The used dataset is composed of the three spectral bands with a regular temporal sampling at two days.
lower than nine days, with even a slight improvement when using a local average pooling layer with a constant reach of four days. This result is in disagreement with results obtained for image classification tasks for which: 1) maxpooling tends to have better results than average pooling, and 2) the use of local pooling layers help to improve the classification performance. The main reason for this difference is probably task-related. In image classification, local pooling layers are known to extract features that are invariant to the scale and small transformations leading to models that can detect objects in an image no matter their locations or their sizes. However, the location of the temporal features, and their amplitude, are crucial for SITS classification. For example, winter and summer crops, that we want to distinguish, may have similar profiles with only a shift in time. Removing the temporal location of the peak of greenness might prevent their discrimination.
As the displayed results here show that pooling layers will not help the proposed temporal CNNs, we have decided not to use them in the following experiments. We also recommend to use carefully pooling layers when dealing with temporal CNNs.

D. How deep should the model be?
Contrary to Section IV-A where we varied the number of trainable parameters, we propose here to vary the number of layers, i.e. the depth of the network, for the same model complexities. For this purpose, we decrease the number of units to deeper networks. More specifically, we consider six architectures composed of one to six convolutional layers with a number of units ranging from 256 to 16, and one dense layer with a number of units ranging from 64 to 2048. Figure 10 shows OA values as a function of depth for six architectures. The orange bar represents one standard deviation. The used dataset is composed of the three spectral bands with a 2-day regular temporal sampling. The used dataset is composed of three spectral bands with a regular temporal sampling at two days. Figure 10 shows that the highest accuracy scores are obtained with the lowest standard deviation for an optimal number of convolutional layers of two or three for the studied data. The use of an inappropriate number of convolutional layers and number of units may lead to an under-estimation of the CNN model. The selection of a reasonable architecture is therefore crucial, and may be optimized through computationally expensive cross-validation procedure or meta-learning approaches [69], [70], [71].
E. How to control overfitting?
As described in Section II-E, several techniques have been proposed in the literature to deal with the overfitting issue. Considering the optimal architecture of Section IV-A, the model needs to learn a number of parameters higher than more than three times the given number of training instances -2 million of parameters versus 620,000 training data instances. This Section aims at determining which of the used regularization mechanisms are the most crucial to train the temporal CNN network. For this purpose, we study the architecture composed of three convolutional layers, one dense layer, and one Softmax layer, first with only one regularization technique, than with all regularization techniques except one. More precisely, we focus on dropout, batch normalization, the use of validation set, and weight decay regularization mechanisms. We include here the batch normalization layer even if its primary goal is to help the learning, not to regularize the model (see Section II-C). Table V displays OA values with or without the use of different regularization mechanisms. The first row displays the results when no regularization mechanism is applied (lower bound), whereas the last row displays the results when all the regularization mechanisms are used (higher bound). Table V shows that the use of dropout is the most important regularization mechanism for our temporal CNNs as its only use leads to an OA value close from the one obtained when using the four regularization mechanisms. Conversely, the use of a validation set and the weight decay seems less useful for regularizing the network.

F. What values are used for the batch size?
This section aims at studying the influence of the batch size on the classification performance and on the runtime complexity. For this purpose, the model of Section IV-E is trained for the following batch sizes: {8, 16, 32, 64, 128}. The given accuracy is computed on test instances for models that have obtained the minimum validation loss across the twenty epochs.

G. Visual analysis
This experimental section ends with a visual analysis of the results for both blue and green areas of size 3.7 km × 3.6 km (465 pixels × 445 pixels) displayed in Figure 5. The analysis is performed for the RF and the temporal CNN used in Section IV-E with all the regularization techniques. The original temporal sampling is used for RF, whereas the regular temporal sampling at two days is used to train the CNN model. Both models are learned on datasets composed of the three spectral bands. Figure 11 displays the produced land cover maps. The first row displays the results for the blue areas, whereas the second row displays the one for the green area. The first column displays the Formosat-2 image in false color for July 14 2006 (zoom of Figure 6). The second and third columns give the results for the RF and the temporal CNN classifiers, respectively. The images in the last column displays in red the disagreements between both classifiers. Legend of land cover maps is provided in Table II.
Although the results look visually similar, the disagreement images between both classifiers highlight some strong differences on the delineations between several land cover, but also at the object-level (e.g. crop, urban areas, forest). Concerning the delineation disagreements, we found that RF spreads out the majority class, i.e. urban areas, leading to an over-detection of this class, especially for mixing pixels. Concerning the object disagreement, one can observe for example that the RF confounds an urban area (in light pink) that should be, according to the reference (test polygon in that case), a sunflower crop (in purple). Finally, this visual analysis shows that both classification algorithms are sensitive to salt and pepper noise, that could be potentially removed by a postprocessing procedure or by incorporating into the classification framework some spatial information.

V. CONCLUSION
For the first time, this work explores the use of temporal CNNs for SITS classification. Through an extensive set of experiments carried out on a series of 46 Formosat-2 images, we show that most of the tested temporal CNN architectures outperform the RF algorithm by 2 to 3 %. A visual analysis also shows the good quality of temporal CNNs to accurately map land cover without over representation of majority classes.
To provide intuitions beyond this good performance, we studied the impact of the network architecture by varying the depth and the width of the models, by testing different batch size values, and by looking at the influence of common regularization mechanisms. We also demonstrate the importance of using both temporal and spectral dimensions when computing the convolutions. The remaining experimental results support two main recommendations on the use of pooling layers and the engineering of spectral features. First, we show that the use of global pooling layers, which drastically reduces the number of trainable parameters, is armful for SITS classification. Overall, we recommend careful study of the influence of pooling layers, and to favor local average global pooling, before any integration into a temporal CNN network. Second, we show that the addition of manually-calculated spectral features, such as the NDVI, does not seem to improve CNN models. We thus recommend not to compute them.
All these results show that temporal CNNs are a strong learner for Sentinel-2 SITS, which presents a high spectral resolution with 10 bands at a spatial resolution of 10 and 20 meters. While we have argued that RNN models might be less suited to SITS classification than the temporal CNN models we study herein, this analysis warrants experimental verification. Finally, the presence of salt and pepper noise also indicates a need to take into account the spatial dimension of SITS in addition to the spectral and temporal dimensions. In Section IV-B, we compare temporal CNN results to RF ones. Although RF algorithm obtains promising results for SITS classification [13], it is oblivious to the temporal dimension of the SITS. Thus, this comparison between temporal CNN and RF algorithms may seem unfair. For this reason, we explore in this appendix the use of time series classification algorithms that takes into account the temporal domain. They have been mainly developed by the machine learning community for various time series classification tasks present in the UCR archive, including image classification, speech recognition or electro-cardiogram (ECG) analysis [72]. This appendix aims at justifying the choice of comparing the temporal CNNs to RF, rather state-of-the-art time series classification algorithms. First, time series classification approaches are briefly presented. Second, a comparison between RF and five of these classification algorithms is performed.
Dynamic Time Warping: Dynamic Time Warping (DTW) quantifies the similarity between two time series by allowing some time distortions. The authorized amount of time distortions is generally defined by a parameter named the window warping w. The full-DTW is computed when w equals to the length of the time series, whereas the Euclidean distance can be seen as a particular case where w is equal to zero. The use of a 1-nearest neighbor (NN) classifier combined with DTW, whose warping window size is cross-validated, has been the gold-standard algorithm on the UCR datasets for several years [73].The state-of-the-art approaches are now lead by more complex algorithms [62], that we describe hereafter.
Elastic Ensemble: The Elastic Ensemble (EE) algorithm is an ensemble approach composed of eleven 1-NN algorithms associated with eleven distance metrics, presenting different results on UCR datasets [74]. Following the approach for the DTW, the parameters of these metrics (if there are some) are set using a cross-validation procedure.
BOSS: BOSS is a dictionary-based method that discretized into words the time series signal using the Symbolic Fourier Approximation (SFA) algorithm [75]. This method is particularly robust to noise and delivers a high classification accuracy.
Shapelet Transform: A shapelet is defined as a sub-series that allows to discriminate among the different classes. The intuition behind is that only some sub-sequences of time series are helpful for the classification task. The initial shapelet algorithm build a binary decision tree with a splitting criterion based on the distance between the training instances and the best shapelet [76]. To cope with th huge runtime complexity required to find the best shapelets, several algorithms have been developed to reduce the searching time for the best shapelets. One of them, Shapelet Transform (ST) algorithm, first searches the k best shapelets in one pass, and then uses the distances between the training time series and these k best shapelets to transform the data into a new space, namely the shapelet space. This new representation of the training instances is then used to learn eight traditional supervised classification algorithms including RF and SVM.
Collective Of Transformation-based Ensembles: Both Collective Of Transformation-based Ensembles (COTE) algorithms -Flat-COTE [77] and the more recent Hierarchical Vote of COTE (HIVE-COTE) [78] -are meta-ensemble algorithms that include a set of classifiers working with different representations of the data.They include EE and ST algorithms, but also classifiers learned on the frequency domain, e.g. after applying a Discrete Fourier Transform (DFT) to the data. The strength of these algorithms is the use of different feature spaces that allow different representations of the algorithm to learn models for different tasks. However, the use of high computational algorithms such as EE and ST makes them almost non-runnable for most of real-datasets.
Although these algorithms have the best accuracy results on the UCR datasets, they have a huge runtime complexity, which prevents them to scale on large datasets or on long time series datasets. Note that the biggest dataset of the UCR archive is composed of less than 10,000 training time series. In addition, they have been mainly developed for uni-variate time series, even if the community is actively now proposing adaptations of these algorithms for multi-variate datasets [79].
In the following experiments, Java implementations from timeseriesclassification.com website are used. The default parameter has been used except for the DTW algorithm where the window warping size is fixed at 25 % of the total length of the time series. The comparison is here performed for NDVI feature with a 2-day regular temporal sampling. As the algorithms are known to have huge runtime complexity, we decided to limit at 24 hours the runtime on one thread of all the algorithms. Each algorithm is trained on an increasing number of training instances, randomly selected, ranging from about 300 to 600,000. If the total computational time took more than 24 hours for a given training set size, the algorithm is not trained for bigger training sets. For computational reason again, the performance evaluations are also performed only on a subset of 1,000 test instances, randomly selected among the whole test instances extracted at polygon-level as described in Section III-B. Figure 12 shows OA values as a function of the number of training instances for six algorithms. Each curve corresponds to one algorithm: 1-NN combined with DTW in blue, EE in yellow, BOSS in red, ST in purple, COTE in green, and RF in cyan. An incomplete curve means that the algorithm requires more than 24 hours to run.

Number of training instances
Algorithms do not scale beyond these points COTE Fig. 12: Overall Accuracy as a function of the number of training instances for six classification algorithms. The used dataset is composed of NDVI feature with a 2-day regular temporal sampling. Figure 12 shows that most of the time series classification algorithms become infeasible for a large number of training instances. Both ST and COTE algorithms do not scale beyond 300 training instances. EE and BOSS algorithms stop at about 700 and 18,000 training instances, respectively. Only DTW and RF algorithms scale up to 620,000 training instances. However, RF clearly outperforms DTW. RF is the most accurate classifier that will scale up to thousand of training instances: RF has therefore been used as sate-of-theart approach in Section IV. Note that a scalable version of COTE and EE algorithms may be promising algorithms for the classification of large training sets of SITS.