Satellite Image Time Series Clustering via Time Adaptive Optimal Transport

: Satellite Image Time Series (SITS) have become more accessible in recent years and SITS analysis has attracted increasing research interest. Given that labeled SITS training samples are time and effort consuming to acquire, clustering or unsupervised analysis methods need to be developed. Similarity measure is critical for clustering, however, currently established methods represented by Dynamic Time Warping (DTW) still exhibit several issues when coping with SITS, such as pathological alignment, sensitivity to spike noise, and limitation on capacity. In this paper, we introduce a new time series similarity measure method named time adaptive optimal transport (TAOT) to the application of SITS clustering. TAOT inherits several promising properties of optimal transport for the comparing of time series. Statistical and visual results on two real SITS datasets with two different settings demonstrate that TAOT can effectively alleviate the issues of DTW and further improve the clustering accuracy. Thus, TAOT can serve as a usable tool to explore the potential of precious SITS data.


Introduction
Recent years have witnessed a rapid growth of satellite imagery data sources [1,2], thanks to the launch of various new satellite sensors such as the GaoFen series [3], ZiYuan series [4], Sentinel-2 [5], etc. Historical data, for instance, MODIS imagery [6] and Landsat imagery [7], have also been accumulated over decades, making satellite image time series (SITS) data more accessible nowadays. Compared with a single-scene image, SITS records the evolution of land cover types over time and this kind of temporal information is sometimes critical to make land cover types more distinguishable [8][9][10]. In addition, image preprocessing methods required by SITS analytics, such as geometric correction [11,12] and cloud removal [13,14], become more mature than before. Due to the above reasons, SITS analytics has attracted much attention in recent years and many applications have been developed to explore the rich information contained in SITS, for example, classification [15,16], clustering [1,17], class noise reduction [18], trend detection [19], disturbance detection [20], etc.
Among different data mining tasks, clustering [21], or unsupervised classification, seems to gain more importance because the acquisition or update of labeled SITS training samples is difficult [1]. The difficulty comes from multiple aspects: 1. All images contained in a SITS have to be considered simultaneously and a comprehensive judgement depending heavily on an expert's knowledge has to be made. 2. The land cover type of a SITS may change, especially when the time series is long and, thus, the class label itself is hard to decide in some cases.
1. Pathological alignment: A rational alignment between time series should be feature-tofeature and uniformly balanced, but DTW sometimes can lead to pathological alignment as shown by Figure 1a, where one point in a time series is mapped to nearly all points in the other time series, and this type of extreme alignment always ends with undesirable results. 2. Spike noise: DTW is sensitive to spike noise as shown by Figure 1b,c, where a spike noise point easily disarranges the original alignment. Unluckily for SITS, spike noises such as cloud or cloud shadow pixels are ubiquitous and we cannot assume cloud-contaminated pixels will always be detected and removed completely. 3. Limited capacity: The search space of optimal alignment is limited by DTW due to its rules of continuity, monotonicity, and boundary conditions. In theory, a fully-connected alignment can have a larger capacity and a higher flexibility for a more precise similarity. These issues of DTW have long been noticed [39][40][41] but have not been effectively solved because they are directly related to the core definition of DTW [42]. Looking outside the context of DTW based methods, we find Optimal Transport (OT) [43][44][45] has theoretically promising properties to overcome the aforementioned issues of DTW. OT is a classic method to compare probability distributions or histograms [46][47][48] in that OT measures similarities by finding the minimum cost to transform one probability distribution to another. From the perspective of finding optimal cost OT is similar to DTW, but OT has a fuller search space than DTW because it is not constrained by temporal orders of data points or other rules of DTW. For a long time, the utility of OT had been limited by its high computational overhead [49] but recently a series of modifications have been proposed to significantly accelerate its computation. For instance, Sinkhorn distance [50] makes OT dozens of times faster while keeping approximately the same result.
Time series is different from probability distributions so that adaptation is required for OT to cope with time series data. In this paper, we introduce Time Adaptive Optimal Transport (TAOT) [42] as the similarity measure for the clustering of SITS. On the basis of Sinkhorn distance [50], TAOT enables OT to consider both observed values and their corresponding time coordinates simultaneously. In addition, TAOT assumes each data point in a time series to have the same probability and thus further simplifies the calculation. To demonstrate the performance of TAOT, we conduct SITS clustering experiments on two real SITS datasets in two different settings. The results are visually and statistically compared with multiple well-established similarity measures including Euclidean distance and DTW-based methods. To have an intuitive understanding of the mechanism of TAOT, we illustrate the alignments generated by TAOT and analyze their difference from DTW in detail. Other closely related topics such as parameter extracting and limitations of TAOT will also be discussed.
The rest of this article is structured as follows. Section 2 systematically describes time series similarity measures mainly from the perspective of alignment, and introduces OT and TAOT in detail. Section 3 presents the datasets, settings, and statistical and visual results of the SITS clustering experiments. Section 4 compares the alignments generated by TAOT and DTW, and discusses the limitations of TAOT. Finally, Section 5 concludes this paper.

Alignment-Based Similarity Measures
A time series consists of a sequence of chronologically ordered data points. When two time series are compared, the real question can be described as how to align data points from different time series and how to measure the cost of each pair of points. From this perspective, many time series similarity measures can be classified as alignment-based methods, including the widely used Euclidean distance and Dynamic Time Warping (DTW) [26].
Euclidean distance is the most straightforward but still effective method. It enforces a strict one-to-one alignment in terms of time. Euclidean distance is usually viewed as the baseline method in the context of time series similarity measures.
Let A = {a 1 , a 2 , . . . , a i , . . . , a I } and B = {b 1 , b 2 , . . . , b j , . . . , b J } be two time series of length I and J, respectively. The lowercase a i and b j with subscript i and j denotes the i-th and j-th data points of A and B. Let d(i, j) denote the cost between a pair of data points a i and b j . Then the Euclidean distance (ED) between A and B can be defined as: Note that the difference between a i and b j is squared because, in practice, squared Euclidean distance is more frequently used to save the square-root operation.
In contrast with Euclidean distance, DTW employs a more flexible one-to-many alignment that enables DTW to cope with time distortions effectively. There is only one case of one-to-one alignment, but many cases of one-to-many alignments. DTW attempts to find the optimal alignment with minimum accumulation cost. We use a warping path W = w 1 , w 2 , . . . , w k , . . . , w K to represent the alignment between two time series A and B, where a time warp w k = (i, j) denotes a link between point a i and point b j , and the total number of links or point pairs that compose the entire alignment is K. Pairwise cost between each pair of linked data points is added up to the final distance score. In this setting, DTW can be defined as: The DTW problem defined by Equation (2) is essentially a dynamic programming problem and it can be solved by the following recursive formula: where DTW(A i , B j ) is the DTW distance between sub-sequences made up of the first i data points of A and the first j data points of B. DTW(A I , B J ) = DTW(A, B) is the final DTW distance between two entire time series. Figure 2 illustrates the alignments generated by different methods. For each alignment, the x-axis and y-axis indicate the time coordinates of the two time series. An intersection point colored in green, for instance, (i, j), indicates the i-th point in the first time series (colored in red) is aligned to the j-th point in the second time series (colored in blue), and all these green points together forms the warping path W. Figure 2a shows the alignment generated by Euclidean distance and we can observe that links happen only between points with the same time coordinate. Figure 2b shows the alignment generated by DTW where one point can link to multiple temporally adjacent points as long as the rules of DTW defined by Equation (2) are kept. DTW is a well-established time series similarity measure framework under which many variants have been proposed. Variants of DTW can be classified into two major categories. The first category imposes constraints on DTW, for example, window constraint [51,52], weight constraint [34,53], slope constraint [54], warping path length constraint [41], etc. The second category replaces the feature DTW considers, for example, derivative DTW [39], piecewise DTW [55,56], shape context-based DTW [40], etc. In this paper, we employ the Sakoe-Chiba band [57,58] constrained DTW, piecewise DTW, and time-weighted DTW [34] to represent variants of DTW, and their performance will be compared with our proposed method. The Sakoe-Chiba band narrows the warping window of DTW using a constant radius r, which means for any w k = (i, j) in the warping path, |i − j| ≤ r. Piecewise DTW employs piecewise averages to replace the raw time series when calculating DTW. Time-weighted DTW adds a temporal cost to the cost between data points based on the date of each point when calculating DTW. Equation (4) formulates the cost between data points in time-weighted DTW.
where d(i, j) denotes the cost between data points a i and b j . θ is a weight coefficient for the temporal cost. φ(i, j) is a logistic model with steepness α and midpoint β. g(i, j) is the absolute difference between the day of year or Julian day of a i and b j .

Time Adaptive Optimal Transport
Optimal Transport (OT) [43,59], also known as Earth Mover's Distance [46,47] or Wasserstein Distance [60,61], has long been a theoretically ideal tool to compare probability distributions or histograms [48,62]. OT is modeled as the optimal solution to the supplydemand balanced transportation problem. Informally, suppose a series of mines providing iron ores and a series of factories consuming iron ores. The supply of each single mine or the demand of each single factory can be different but the total supply and the total demand are the same. Given the transport cost between each mine to each factory, OT can find the optimal transport allocation plan that leads to the minimum total cost while satisfying global supply-demand balance.
Given two probability distributions denoted as: where a i or b j is the i-th or j-th observed value in its respective distribution, and p a i or p b j is the corresponding probability of a i or b j . Let M = {m ij } be the cost matrix between observed values of A and B, and typically m ij = (a i − b j ) 2 , then the OT distance can be defined as: where 1 d is a d-dimensional vector whose elements are all equal to 1, and U(A, B) is the set containing all possible joint probabilities between A and B, whose row and column sums to P A and P B , respectively. The corresponding optimal transport plan P , is thus defined as: From the definition of OT we can observe that OT has appealing mathematical properties to find the global optimal solution. However, the power comes with a large computational burden. OT has a worst case time complexity of O(d 3 logd) [49] that scales too fast even for a moderate-sized problem. After many attempts, a variant of OT named Sinkhorn distance [50] successfully makes OT dozens of times faster and reactivates the utility of OT. Sinkhorn distance adds an entropic regularization term to the classic OT equation to enforce a simple structure that has a fast solution. Sinkhorn distance is defined by the following equation: where λ is the regularization coefficient. A larger λ leads to a weaker regularization and as λ increases Sinkhorn distance converges to the raw OT distance. As the name implies, the regularized Equation (9) can be solved efficiently by Sinkhorn's fixed point iteration.
From the perspective of finding minimum cost, OT is similar to DTW but OT considers a more general problem and, thus, OT has a larger capacity to find a more correct result. However, the downside is that OT is not originally designed for time series data and to some extent it neglects the nature of time series, namely the temporal order or the time coordinates of data points in a time series. To make OT capable of handling time series data, Time Adaptive Optimal Transport (TAOT) [42] has been proposed. TAOT simultaneously considers the observed values and their corresponding time coordinates when calculating the cost between each pair of data points. Concretely, the cost matrix in Equation (9) is where t i and t j are z-scores of the time coordinates of a i and b j , respectively, and w is a weight parameter for the tradeoff between the two parts. In addition, given the observation that data points in a time series are acquired at different times, TAOT assumes each data point is independent and thus has the same probability of 1/d, where d is the total number of points in a time series as defined in Equation (5). In this setting, the Sinkhorn iteration can be further simplified and the detailed TAOT algorithm can be referred to in [42].
To have an intuitive impression of how OT and TAOT measure the similarity between time series, Figure 2c,d illustrates the alignments generated by OT and TAOT, respectively, where the radius of intersection points indicate the weights of connections. We can observe that both OT and TAOT lead to full connections among all points, although only some connections are strong while others are too weak to notice. Strong connections happen between corresponding peaks in the two time series and this characterizes an accurate alignment. Specifically, OT imposes no penalty on temporal distances and thus in Figure 2c connections (green circles) scatter around the whole area. In contrast, in Figure 2d most connections happen near the diagonal, the place where the temporal distance (difference between x and y coordinates) is small. This observation demonstrates that TAOT considers the temporal gap between points and penalizes connections with long temporal distances. In this setting, we can still find several strong connections (bigger green circles) a bit far from the diagonal, and this is because their numerical similarity dominates the temporal dissimilarity. In Figure 2c we can observe several strong connections (bigger green circles) in the upper left corner, however, due to their long temporal distances they are dismissed in Figure 2d.
A good alignment is expected to be feature-to-feature, where a local peak should be aligned to another temporally adjacent local peak in the other time series, and vice versa for local valleys. If a point is allowed to link with multiple points with different weights, then the weights of connections should consider the trade-off between numerical distance and temporal distance. In this standard, TAOT is better than DTW because TAOT generates more feature-to-feature connections and achieves the balance between numerical distance and temporal distance.

SITS Clustering with TAOT
A SITS clustering method usually requires two components: a similarity measure and a clustering algorithm. We introduce TAOT in this paper to serve as the similarity measure for SITS. As for the clustering algorithm, we choose mini-batch K-Means [63][64][65] for two main reasons. Firstly, mini-batch K-Means is a memory-saving algorithm. Compared with other categories of clustering algorithms such as density-based clustering, spectral clustering, or affinity propagation, mini-batch K-Means does not have to maintain a large distance matrix that scales quadratically with the number of samples. In contrast, mini-batch K-Means only has to maintain the distances to a limited number of cluster centers and the amount of memory scales linearly. Since satellite images always have millions of pixels, the memory-saving property becomes critical for the utility of a clustering algorithm. Secondly, given the same initial condition, the performance of mini-batch K-Means depends solely on the similarity measure and thus it ensures a fair comparison among different similarity measures.
The mini-batch K-Means [65] is a variant of K-Means that employs randomly sampled subsets instead of the entire sample set to update cluster centers during iterations. It drastically reduces the memory cost and the computation required to converge to the final solution, while still attempting to optimize the same global objective function as the raw K-Means. Generally, mini-batch K-Means produces results that are only slightly different than the raw K-Means. In this paper, we use the scikit-learn Python library to implement the mini-batch K-Means algorithm.

Results
To demonstrate the utility of TAOT for the clustering of SITS, in this section we evaluate the clustering accuracy of TAOT on two different datasets: the Reunion Island dataset and the Poyang Lake dataset. TAOT is compared with five well-established methods: Euclidean distance, DTW, Sakoe-Chiba band constrained DTW (SC-DTW), piecewise DTW (PDTW), and time-weighted DTW (TWDTW). To ensure a fair competition, we use the same initial cluster centers and the same random number generator such that the results depend only on the similarity measure.
Since the K-Means clustering algorithm is sensitive to initial cluster centers, and in order to demonstrate the utility of TAOT with different initial conditions, we conduct two groups of experiments for each dataset. The first experiment aims to estimate the best case capacity of these methods by using good initial cluster centers. Concretely, the first experiment employs the average time series of each class to be the initial cluster centers. The first experiment is technically semi-supervised because labeled samples are used for initialization. This scenario happens in real clustering tasks when we still have a few labeled samples. The second experiment uses 100 sets of random initial cluster centers and repeats the clustering procedure 100 times. The second experiment aims to evaluate the robustness under different conditions and is a traditional setting for clustering studies. Detailed results are shown, respectively, by the following Sections 3.2 and 3.3.

Performance Metrics
The performance of each method is evaluated with four most widely used criteria: Adjusted Rand score, Cohen Kappa score, Overall Accuracy, and Weighted F1 score.
The Rand score [66,67] measures the similarity between two data clusterings from the perspective of sample pairs. For a total of n samples, the number of sample pairs is n * (n − 1)/2. For two clusterings C 1 and C 2 , let a be the number of pairs that are in the same clustering in C1 and in the same clustering in C2, and let b be the number of pairs that are in different clusterings in C1 and in different clusterings in C2. In this setting, a + b expresses the number of agreements between C1 and C2. Then the Rand score is defined as the following Equation (10): The adjusted Rand score [68] is an adjusted-for-chance version of the Rand score that ensures a value close to 0 for two random clusterings and exactly 1.0 for two identical clusterings. It is defined as the following Equation (11): The Cohen Kappa score [69][70][71] is a statistic that measures the agreement between two classification results. It is defined as the following Equation (12): where p o is the observed agreement ratio and p e is the expected agreement ratio.
The overall accuracy is a straightforward performance metric, which is defined as the fraction of correct predictions. For a total of n samples, if y i is the real class label of the i-th sample andŷ i is the predicted class label, then the overall accuracy can be formulated as the following Equation (13): The F1 score [72,73] is the harmonic mean of precision and recall of a classifier. It is defined as the following Equation (14): where the precision is the number of true positive predictions divided by the number of all positive predictions, including those false positive predictions. In the context of classification, a true positive prediction means we predict a sample should belong to a certain class and it is true. A false positive prediction means we predict a sample should belong to a certain class but it is false. Precision reflects the ability of a classifier not to label as positive a sample that is negative. The recall is the number of true positive predictions divided by the number of all samples that should have been identified as positive. Recall reflects the ability of a classifier to find all the positive samples. In multi-class cases, the F1 score of each class is averaged and the weighted F1 score finds their average weighted by the number of samples of each class. All of the four criteria described above range in [0, 1] and a higher value indicates a better result.

Reunion Island Dataset
The Reunion Island dataset was released by the Time Series Land Cover Classification Challenge (TiSeLaC) in the 2017 European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2017). The study area covers the entire Reunion Island, a France overseas territory in the southwest Indian Ocean. Figure 3 illustrates the location and overview of the study area. The dataset is generated from an annual time series of 23 Landsat 8 images (30 m spatial resolution and 16 day temporal resolution) acquired in 2014. Figure 4 shows the temporal coverage of these images. Cloudy observations have been filled via pixel-wise multi-temporal linear interpolation on each multi-spectral band (OLI) independently. Each data point in a time series contains a total of 10 features, seven surface reflectance bands (Ultra Blue, Blue, Green, Red, NIR, SWIR1, and SWIR2) plus three complementary radiometric indices (NDVI, NDWI, and BI).  In the experiment we operate in an unsupervised manner and we conduct time series clustering on the testing set, which consists of 17,973 samples with ground truth class labels. To avoid the impact of imbalanced class distribution on the K-Means clustering algorithm and to focus on the comparison of similarity measures, two tiny classes whose sizes are not proportional to the other classes are respectively merged into their most similar classes. Concretely, the built-up class (647 samples) is merged into the urban class (4000 samples), and the other crops class (154 samples) is merged into the grassland class (1136 samples). The dataset is thus distributed over seven classes and Table 1 reports the detailed class distribution. Recall that two groups of experiments with different initial cluster centers are conducted. Table 2 shows the clustering performance where the average time series of each class in the training set is used for initialization. The training set was released along with the testing set by TiSeLaC and it is used only in the first experiment for setting initial cluster centers. The performance is evaluated with four criteria described in the above Section 3.1. We can observe from Table 2 that TAOT consistently achieves the best results in terms of the four criteria, and TAOT wins by a relatively large margin of 3.7%, 9.6%, 8.7%, and 9.0% compared to the second best result for each respective criterion. Visually, Figure 5 shows the clustering maps generated by different similarity measures and the ground truth reference map. We can observe that the forest class (purple) is well recognized by all the five methods. However, Euclidean distance leads to significant confusion between sparse vegetation (orange) and water (gray). DTW, SC-DTW, and TWDTW lead to significant confusion between rocks and bare soil (blue) and water (gray). PDTW causes an obvious decrease of urban and built-up (green). Some of these issues also exist in the clustering map generated by TAOT, but they are alleviated in varying degrees. Table 3 shows the clustering performance with 100 sets of random initial cluster centers on the Reunion Island dataset. The clustering is repeated 100 times and the average performance with standard deviation is reported. We can observe that in terms of all the four criteria, TAOT once again achieves the best results by margins of 1.3%, 1.4%, 1.3%, and 1.8% compared to the second best result, respectively.    Figure 6 shows the distribution of each performance metric over the 100 repetitions performed on the Reunion Island dataset. We can observe that TAOT achieves the highest median, mean, maximum, and minimum in all the four box-plots.

Poyang Lake Dataset
The Poyang Lake dataset consists of 23 cloud-free Landsat 8 land surface reflectance images acquired between 2014 and 2016. The study area is located in Poyang Lake, in the Jiangxi Province of China. Figure 7 illustrates the location and overview of the study area. We first mended cloud and cloud shadow pixels in each raw Landsat 8 image by the method proposed in [14] and then selected 23 clean images to construct the dataset. Figure 8 shows the temporal coverage of the selected images. We use the FROM-GLC 2015 (Finer Resolution Observation and Monitoring of Global Land Cover) classification product [74] (http://data.ess.tsinghua.edu.cn/ (accessed on 20 January 2021)) as ground truth reference. To further ensure the reliability of the reference set, we morphologically eroded the class labels with two iterations to keep only the central pixels of each land patch, because the central pixels are more likely to be correctly classified when generating the classification product.  The dataset contains six land cover classes: forest, impervious surface, cropland, grassland, bareland, and water. Table 4 reports the detailed class distribution. Table 5 shows the clustering performance where average time series of each class in the reference set is used for initialization. The performance is also evaluated with the four criteria. We can observe that TAOT outperforms the other methods by margins of 1.8%, 1.8%, 1.5%, and 1.3% compared to the second best result on each respective criterion. Visually, Figure 9 shows the clustering maps generated by different similarity measures and the morphologically eroded reference map. TAOT generates more precise contours for the majority of land cover patches.    Table 6 shows the clustering performance with 100 sets of random initial cluster centers on the Poyang Lake dataset. The clustering is also repeated 100 times with the average performance and the standard deviation reported. We can observe that TAOT outperforms the other methods by margins of 2.0%, 0.9%, 0.7%, and 0.6% compared to the second best result on each criterion. Figure 10 shows the distribution of each performance metric over the 100 repetitions performed on the Poyang Lake dataset. For the adjusted Rand score, TAOT achieves the highest median, mean, maximum, and minimum by a relatively large margin. For the Cohen Kappa score, TAOT has the highest mean, maximum, and minimum, while TWDTW has a higher median (TAOT: 0.668, TWDTW: 0.679). For overall accuracy, TAOT has the highest mean and maximum. SC-DTW has a slightly larger minimum (TAOT: 0.695, SC-DTW: 0.696) and TWDTW has a larger median (TAOT: 0.750, TWDTW: 0.760). For the weighted F1 score, TAOT has the highest mean and maximum, while SC-DTW has the highest minimum (TAOT: 0.731, SC-DTW: 0.734) and median (TAOT: 0.795, SC-DTW: 0.800). The outliers illustrated by circles below the boxes are mainly caused by poor random initializations of cluster centers.

Extraction of Parameters
When parameters are involved in any similarity measure in the experiment, we search for the optimal values. Table 7 lists the optimal parameter values used in this paper. Euclidean distance and DTW are both parameter-free. SC-DTW has one parameter r, which is the radius of the Sakoe-Chiba band. PDTW also has one parameter n, which is the number of pieces. TWDTW has three parameters: the temporal weight coefficient θ, the steepness α, and the midpoint β of the logistic weight model. TAOT involves two parameters: the regularization coefficient λ and the temporal weight w. For SC-DTW and PDTW, as they are reference methods, we use the global optimal parameters found by linear search on the testing sets to show their best possible performance. For TWDTW, a linear search of θ with different combinations of α (α = 0.1 or 0.2) and β (β = 50, 100, 150, or 200) is conducted on the testing sets to find the optimal parameters. For TAOT, we find the optimal λ and w by grid search on the training sets. If a full grid search is too time-consuming, we can search for w first with a coarse interval of λ, and then search for an optimal λ with a dense interval given fixed w. Figure 11 illustrates the extraction process of λ when the optimal w is decided.

Alignments Generated by TAOT
A major motivation of using TAOT instead of DTW-based methods is to avoid the issues of pathological alignment and spike noise when coping with SITS data. Throughout this paper, similarity measure methods of time series are introduced from the perspective of alignments, and thus an intuitive way to see whether TAOT can solve these issues is to compare the alignments generated by different methods on real SITS datasets. Figure 12 illustrates three sets of alignments generated by DTW and TAOT on the TiSeLaC Reunion Island dataset. Recall that TAOT produces a fully-connected alignment represented by a transport matrix, but a large portion of the matrix cells have small values that are close to zero. To extract the essential part of the alignment and make it comparable with DTW, we filter the alignment of TAOT by weights of connections and only significant connections are kept. Two filter thresholds are adopted, where the smaller one (0.005) gives an overall impression of the alignment generated by TAOT, and the larger one (0.01) enables a comparison with DTW. In addition, we use the widths of lines to reflect the weights of connections. In pathological alignments a point can be mapped onto an excessively large subsection of other points and distort the results. From all three sets of alignments shown in Figure 12 we find that DTW leads to pathological alignments locally in varying degrees, while TAOT leads to more balanced alignments. In Figure 12a,c we observe the appearance of spike noise and how it causes distortions to DTW. In contrast, TAOT is not obviously affected by the spike noise. These observations demonstrate our hypothesis that TAOT can alleviate the pathological alignment issue of DTW and it is not sensitive to spike noise. In addition, we observe that TAOT connections of points only happen in moderate temporal neighborhoods, which proves that TAOT achieves a trade-off between numerical and temporal similarities.

Capacity of TAOT
Another motivation of using TAOT is that TAOT derives a fully-connected alignment and thus theoretically has a larger capacity for a more precise result. The second row of Figure 12 shows that although weak connections are not drawn, each point still involves multiple connections with different weights. Rather than the many-to-many alignment generated by TAOT, DTW generates either one-to-many or many-to-one alignment. As a consequence, TAOT has a larger search space and more flexibility. Statistically, we have tested the capacity of TAOT in the previous experiments by giving each method an ideal initial condition and observing how well they could perform. On the Reunion Island dataset, TAOT outperforms the second best method by a large margin of 9.6% in terms of the Cohen Kappa score, and on the Poyang Lake dataset the margin is 2.1%. This observation proves that TAOT can reach an obviously higher limit than the other well-established methods.

Limitations of TAOT
While TAOT has advantages on accuracy, its computational efficiency still has significant room for improvement. In theory, given two time series of length N, the time complexity of TAOT is approximately O(N 2 log N) [75], which is slower than DTW-based methods (O(N 2 )) and Euclidean distance (O(N)). Note that the time complexity of naive optimal transport is O(N 3 log N) [49] and TAOT is already a faster variant of optimal transport.
In practice, Figure 13a,b shows the distribution of computational times for each method over the 100 repetitions performed on the two datasets, respectively. The times are measured on a configuration with 8 CPU cores of 2.5 GHz and 16-GB memory. We observe that Euclidean distance is the fastest due to its simplicity. Among DTW-based methods, PDTW is relatively rapid because it reduces the length of time series by piecewise averaging. TWDTW is the slowest on both the two datasets. TAOT runs moderately slower than DTW and SC-DTW on the Reunion Island dataset, which coincides with the theoretical analysis above. However, TAOT runs faster than DTW and SC-DTW on the Poyang Lake dataset. This might be because TAOT involves many matrix operations whose efficiency is better optimized when the size of matrices scale. Further acceleration of TAOT might be achieved through decomposing a multi-dimensional OT problem into one-dimensional ones and using one-dimensional results to compose the high-dimensional result [76]. Another issue of TAOT is that sometimes it will encounter the machine precision limit when λ increases beyond a problem-dependent value λ max , beyond which some elements of e −λM are represented as zeroes. In this case we have to use a smaller λ that may lead to a stricter regularization than desired.

Conclusions
In this paper, we have introduced time adaptive optimal transport (TAOT) as a similarity measure tool for satellite image time series (SITS), with the aim of avoiding the issues of DTW-based methods, namely the pathological alignment, sensitivity to spike noise, and limitation on capacity. TAOT is derived from the classic optimal transport framework which has long been a powerful tool to compare probability distributions or histograms. In addition, TAOT further considers temporal similarities to make it suitable for SITS data. In order to demonstrate the properties of TAOT, we have presented SITS clustering experiments on two real SITS datasets in two different settings. TAOT consistently outperformed the other methods in terms of four well-established accuracy criteria. To gain a deeper understanding of TAOT, we have illustrated the alignments generated by TAOT and compared them with DTW. TAOT is able to generate a more balanced fully-connected alignment to precisely capture the similarity between time series, and thus TAOT can serve as a usable tool for the analysis of complex SITS data.