Cluster Analysis on Locally Asymptotically Self-similar Processes with Known Number of Clusters

We conduct cluster analysis on a class of locally asymptotically self-similar stochastic processes, which includes multifractional Brownian motion as a representative. When the true number of clusters is supposed to be known, a new covariance-based dissimilarity measure is introduced, from which we obtain the approximately asymptotically consistent clustering algorithms. In simulation studies, clustering data sampled from multifractional Brownian motions with distinct functional Hurst parameters illustrates the approximated asymptotic consistency of the proposed algorithms. Clustering global financial markets' equity indexes returns and sovereign CDS spreads provides a successful real world application.


Introduction
Learning stochastic processes is an important area of machine learning, as there is a considerable amount of machine learning problems that involve time as a component (Cotofrei, 2002;Harms et al., 2002a;Jin et al., 2002a,b;Keogh and Kasetty, 2003). Due to the nature of the time component, stochastic processes often possess path features (Lin et al., 2002). This additional information brought by the time component makes machine learning on stochastic processes more difficult to handle than machine learning on the other type objects. A number of new techniques are developed along with studying such machine learning problems (Li et al., 1998;Bradley and Fayyad, 1998;Keogh et al., 2001;Java and Perlman, 2002). In this paper, we study a particular type of learning problems on stochastic processes: clustering. Among all machine learning tools, the cluster analysis is a common technique for unsupervised learning. It aims to detect the hidden patterns of a set of objects, through grouping the objects in the same cluster, if they are more similar to each other than to those in other clusters. Compared to other machine learning problems, clustering stochastic processes is more sensitive to dissimilarity measurement, which heavily depends on the objects' data type (Gan et al., 2007;Aghabozorgi et al., 2015;Sardá-Espinosa, 2017). Clustering techniques may vary drastically depending on whether the objects' data types are vector, matrix or sequence, etc. Being a subset of clustering problems, clustering stochastic processes has received growing attention in diverse areas to discover patterns of data indexed by "time". The stochastic process is a common type of dynamic data that naturally arises in many different scenarios. They have been broadly explored in information technology (Slonim et al., 2005;Jain et al., 1999), signal and image processing (Yairi et al., 2001;Honda et al., 2002;Guha et al., 2003;Truppel et al., 2003;Rubinstein et al., 2013), geology (Juozapavičius and Rapsevicius, 2001;Harms et al., 2002b), biology and medical research (Bar-Joseph et al., 2002;Damian et al., 2007;Zhao et al., 2014;Jääskinen et al., 2014), robotics (Oates, 1999), and finance (Mantegna, 1999;Gavrilov et al., 2000;Tino et al., 2000;Fu et al., 2001;Pavlidis et al., 2006;Bastos and Caiado, 2014;Ieva et al., 2016), etc. Unlike random vector type data, stochastic processes type data are sampled from processes' distributions, which possess not only finite dimensional distribution features but also infinite-length paths features, such as stationarity, ergodicity, seasonality and Markov property.
In the problem of clustering stochastic processes, dimensionality of a process related to time is extremely high, compared to dimensionality of commonly observed data. Therefore new challenges arise if one applies the conventional approaches for cluster static data on stochastic processes. usually become computationally forbidding (Sardá-Espinosa, 2017). We list at least three issues that may occur when performing these clustering algorithms.
To overcome the above issues, we assume some key path features of the observed stochastic processes are known in this framework (see Assumption (A ) in the next section). This is not a strong constrain, for the reason that in many fields it is proved that the observed paths are sampled from well-known process distributions. For example, dynamics in financial markets (equity returns and interest rates) can be described based on Geometric Brownian motions (GBm); long-term dependent or self-similar phenomena are often modeled by fractional Brownian motions (fBm). Contrary to the conventional clustering approaches, clustering based on the paths features of the processes largely removes the noise by capturing the observations' paths features. Therefore, a nice dissimilarity measure should be the one that well characterizes the paths features. In this context, "nice" refers to the property that the computational complexity and the prediction errors caused by the over-fitting issues are expected to be largely reduced. Moreover, subject to known paths features, consistency of the clustering algorithm (Khaleghi et al., 2012(Khaleghi et al., , 2016) may be obtained. Among all the stochastic process features, we focus on characterizing the property of ergodicity in this paper. However similar analysis can be made for other patterns of process features such as seasonality, Markov property and martingale property.
Ergodicity (Krengel, 1985) is a very typical feature possessed by a number of wellknown processes, which are applied to financial time series analysis. It is tightly related to other process features such as stationarity, long-term memory and self-similarity (Grazzini, 2012;Samorodnitsky, 2004). Khaleghi et al. (2016) and Peng et al. (2019) showed that both distribution ergodicity and covariance ergodicity lead to obtaining an asymptotically consistent clustering algorithms for clustering processes. In this paper, we step further to relax the condition of ergodicity to the "local asymptotic ergodicity" (Boufoussi et al., 2008) and obtain the so-called "approximately asymptotically consistent algorithms" for clustering processes having such path property. This setting presents such a large class of processes that includes the well-known Lévy processes, some self-similar processes and some multifractional processes (Boufoussi et al., 2008).
Each clustering stochastic processes problem involves handling data, defining cluster, measuring dissimilarities and finding groups efficiently, therefore we organize the paper as follows. Section 2 is devoted to introducing a class of locally asymptotically selfsimilar processes to which our clustering approaches apply. In Section 3, a covariancebased dissimilarity measure is suggested and in Section 4 the approximately asymptotically consistent algorithms for clustering both offline and online datasets are designed. A simulation study is performed in Section 5, where the algorithms are applied to cluster multifractional Brownian motions (mBm), an excellent representative of the class of locally asymptotically self-similar processes. In Section 6, we perform cluster analysis over the real world global financial market data through applying our clustering algorithms and provide economic implications. 7 concludes our findings.

A Class of Locally Asymptotically Self-similar Processes
Self-similarity is a process (path) feature. Self-similar processes are a class of processes that are invariant in distribution under suitable scaling of time (Samorodnitsky and Taqqu, 1994;Maejima, 2000, 2002). These processes have been used to successfully model various time-scaling random phenomena observed in high frequency data, especially in the geological data and financial data.
Definition 1 (Self-similar process) A stochastic process {Y (H ) t } t ≥0 (here the time indexes set is not necessarily continuous) is self-similar with self-similarity index H ∈ (0, 1) if, for all n ∈ N := {1, 2, . . .}, all t 1 , . . . , t n ≥ 0 and all c > 0, where law = denotes the equality in joint probability distribution of two random vectors.
therefore taking t = 0 at both hand sides of (2) yields Self-similar processes are generally not distribution stationary but their increment processes can be distribution stationary (any finite subset's joint distribution is invariant subject to time shift) or covariance stationary (its mean and covariance structure exist and are invariant subject to time shift). From now on we restrict our setting to stochastically continuous-time self-similar processes only (Embrechts and Maejima, 2002). i.e., a This assumption is weaker than the almost sure continuity. The process {X t } t ≥0 is called (stochastically) continuous-time over [0, +∞) if it is continuous at each t ≥ 0. For u > 0, we call {Y (t )} t = {X (t +u)−X (t )} t the increment process (or simply the increments) of {X (t )} t . If a continuous-time self-similar process' all increment processes are covariance stationary, its covariance structure can be explicitly given as below: be a self-similar process with index H ∈ (0, 1) and with covariance stationary increments. Then and Cov Theorem 1 can be obtained by replacing the distribution stationary increments in Theorem 1.2 in Embrechts and Maejima (2000) with covariance stationary increments. We briefly provide the proof below.
Proof 1 We first prove (4). On one hand, by using the fact that the increments of {X (H ) t } t are covariance stationary and (3), we have On the other hand, since Putting together (6), (7) and the fact that H < 1, we necessarily have E(X (H ) t ) = 0 for all t ≥ 0. (4) is proved.
For proving (5) we first observe that, for s, t ≥ 0, (8) Next we can see from the facts that {X (H ) t } t is self-similar with index H , that its increments are covariance stationary and (4), that The covariance stationarity yields V ar (X (H ) 1 ) < +∞. (5) then follows from (8) and (9). Theorem 1 is thus proved.
We highlight that, contrary to Theorem 1.2 in Embrechts and Maejima (2000), the covariance stationary increment process of X (H ) t t in Theorem 1 is not necessarily distribution stationary. This fact inspires us to relax the distribution stationarity of the processes to the covariance stationarity in the following Assumption (A ). Below we introduce a natural extension of self-similar processes, the so-called locally asymptotically self-similar processes (Boufoussi et al., 2008;Falconer, 2002Falconer, , 2003. Definition 2 (Locally asymptotically self-similar process) A continuous-time stochastic process Z (H (t )) t t ≥0 with its index H (•) being a continuous function valued in (0, 1), is called locally asymptotically self-similar, if for each t ≥ 0, there exists a non-degenerate self-similar process where the convergence f.d.d. − −−− → is in the sense of all the finite dimensional distributions. (Falconer, 2002(Falconer, , 2003. Moreover, it is shown (see Theorem 3.8 in Falconer (2003)) that, if {Y (H (t )) u } u is unique in law, it is then self-similar with index H (t ) and it has distribution stationary increments. Then the local asymptotic self-similarity generalizes the conventional self-similarity, in the sense that, any non-degenerate self-similar process with distribution stationary increments is locally asymptotically self-similar and its tangent process is itself. Further, in a weaker sense, it is not difficult to show the following: t } t ≥0 be a continuous-time self-similar process with self-similarity index H ∈ (0, 1) and with covariance stationary increments. Then its tangent processes share equal mean and covariance functions.

Proof 2 Since {Z (H )
t } t ≥0 is locally asymptotically self-similar, by definition at each t ≥ 0 there exists a tangent process {Y (H ) u } u≥0 such that Next we show {Y (H ) u } u≥0 's mean and covariance structure are uniquely determined. Since {Z (H ) t } t ≥0 has covariance stationary increments, for any u ≥ 0, τ > 0, define the scaled increments Again by the fact that {Z (H ) t } t ≥0 has covariance stationary increments, then using (4) in Theorem 1 we obtain and by (5) in Theorem 1, we have for u 1 , u 2 ≥ 0 and τ > 0, which is independent of τ. It follows from (11), (12) and (13) that This implies that all tangent processes of {Z (H ) t } t ≥0 possess zero-mean and equal covariance functions. By the way it is easy to derive from (14) that these tangent processes have covariance stationary increments. Proposition 1 is proved.
We remark from Proposition 1 that the tangent processes of {Z (H ) t } t ≥0 may not be unique in law, but their finite-dimensional subsets have unique first and second order moments.
Based on the above discussion, throughout this paper we assume that the observed dataset are sampled from a known number (κ) of continuous-time processes satisfying the following condition: Assumption (A ): The processes are locally asymptotically self-similar; their tangent processes' increment processes are autocovariance ergodic.
Here the autocovariance-ergodicity means that the sample autocovariance functions of the covariance stationary process converges in squared mean to the autocovaraince functions of the process, i.e., a zero-mean (that is the case for the tangent processes' increments) continuous-time process {X (t )} t ≥0 is autocovariance ergodic if it is covariance stationary and satisfies where X n Note that the above convergence (15) yields Thus Assumption (A ) says that the observed processes' tangent processes have covariance stationary increments. The well-known examples of locally asymptotically selfsimilar processes satisfying Assumption (A ) are fractional Brownian motions and multifractional Brownian motions (Mandelbrot and Van Ness, 1968;Peltier and Lévy-Véhel, 1995;Benassi et al., 1997).
The assumption of covariance stationarity inspires us to introduce a covariance-based dissimilarity measure between the sample paths, in order to capture the level of differences between the two corresponding covariance stationary processes. Later we show that the assumption of autocovariance-ergodicity is sufficient for the clustering algorithms to be approximately asymptotically consistent.

Covariance-based Dissimilarity Measure between Autocovariance Ergodic Processes
Let Z be a process satisfying Assumption (A ). Denote by Y its tangent process (see (10)) and denote by X an increment process of Y , i.e., there is some , X is autocovariance ergodic. Since we will show that clustering distinct Z 's is approximately asymptotically equivalent to clustering the corresponding increment processes X 's, then the dissimilarity measures of Z 's can be constructed based on those of the autocovariance ergodic processes X 's. From (4) we know that the autocovariance process X is zero-mean. Our first main result is then introduction to the following covariance-based dissimilarity measure between autocovariance ergodic processes (Peng et al., 2019).

Definition 3
The covariance-based dissimilarity measure d between the discrete-time stochastic processes X (1) , X (2) (in fact X (1) , X (2) denote two covariance structures, each class may contain different process distributions) is defined by where: • For any integers l ≥ 1, m ≥ 0, X (1) l ...l +m−1 is the shortcut notation of the row vector X (1) l , . . . , X (1) l +m−1 . • The distance ρ between 2 equal-sized covariance matrices M 1 , M 2 is defined to be the Frobenius norm of M 1 − M 2 . Recall that for a matrix A M ×N , its Frobenius norm is defined by • The sequence of positive weights {w j } j ≥1 should be chosen such that d X (1) , X (2) < +∞, i.e., the series on the right-hand side of Eq. (17) is convergent. The choice of {w j } j will be discussed in the forthcoming simulation study in Section 5.
Remark 1 It is important to note that Definition 3 only defines the dissimilarity measures between discrete-time stochastic processes (time series). This is the most common object studied in the literature on clustering stochastic processes. Considering time series is sufficient for practical applications for at least 2 reasons. First, continuous-time path is not observable in practice. Secondly, any (stochastically) continuous-time process can be approximated by its discrete-time paths.
In what follows we will mean sample path by a finite-length subsequence of discretized stochastic process.
Thanks to the autocovariance-ergodicity of the sample processes, the dissimilarity measure d can be estimated by the empirical dissimilarity measure d below: Definition 4 Given two processes' discrete-time sample paths n j ) for j = 1, 2, let n = min{n 1 , n 2 }, then the empirical covariance-based dissimilarity measure between x 1 and x 2 is given by where: • m n (≤ n) is the largest dimension of the covariance matrix considered by d ; in this framework we take m n = log n , i.e. the floor number of log n (Khaleghi et al., 2016;Peng et al., 2019).
where (•) T denotes the transpose of a matrix.
Remark 2 Since X is autocovariance ergodic, every empirical covariance matrix ν(X l ...l +m−1 ) is a consistent estimator of the covariance matrix Cov(X l ...l +m−1 ) under Frobenius norm and in probability, i.e., Further, the fact that both d and d satisfy the triangle inequalities implies that d is a consistent estimator of d . The proof is quite similar to that of Lemma 1 in Peng et al. (2019), except that in the former statement the convergence holds in probability. These ergodicity and triangle inequalities are the keys to demonstrate that our algorithms in the next section are approximately asymptotically consistent. We list them in the following remarks.
Remark 3 For every pair of paths n 1 , and x 2 = X (2) 1 , . . . , X (2) n 2 , sampled from two autocovariance ergodic processes X (1) and X (2) respectively, we have and where the dissimilarity measure d x i , X ( j ) between the sample path x i and the stochastic process Remark 4 Thanks to their definitions, the triangle inequalities hold for the covariance-based dissimilarity measure d in (17), as well as for its empirical estimates d in (18). Therefore for arbitrary processes X (i ) , i = 1, 2, 3 and arbitrary random vectors In the next section we define a proper covariance-based dissimilarity measure between locally asymptotically self-similar processes satisfying Assumption (A ), based on the dissimilarity measure d .

Covariance-based Dissimilarity Measure between Locally Asymptotically Self-similar Processes
Now under Assumption (A ), we study the asymptotic relationship between the locally asymptotically self-similar process {Z (H (t )) t } t in (10) and its tangent process' increment process. The following result reveals the relationship between local asymptotic self-similarity and covariance stationarity.
t t ≥0 be a locally asymptotically self-similar process satisfying Assumption (A ). For each h > 0, (10)) is an autocovariance ergodic process.
Proof 3 Let's fix h > 0 and pick any finite discrete time indexes set T ⊂ [0, +∞). Under Assumption (A ), the f.d.d. convergence (10) holds. It then implies where we adopt the notation (a u , b u ) u∈{u 1 ,...,u N } to denote the vector It follows from (24) and the continuous mapping theorem that (23) then results from (25) and the fact that the choice of T is arbitrary. Under Assumption (A ), u u is autocovariance ergodic, hence Proposition 2 is proved.
From a statistical point of view, the sequence at the left-hand side of (23) can not straightforwardly serve to estimate the distribution of the right-hand side {X H (t ) u,h } u , because the functional index H (•) at the left-hand side is not observable in practice. To overcome this inconvenience we note that (23) can be interpreted as: when τ is sufficiently small, where K is an arbitrary positive integer. Statistically, (26) says that: given a discrete- is approximately distributed as an autocovariance ergodic increment process of the selfsimilar process ∆t . This fact drives us to define the empirical covariancebased dissimilarity measure between two paths of locally asymptotically self-similar processes z 1 and z 2 as follows: where: are the localized increment paths defined as in (27). Heuristically speaking, The following observation is straightforward.
Remark 5 Based on the definition (28) and Remark 2, d * is also a (weakly) consistent estimator of d (see Remark 3) and it also satisfies the triangle inequalities as in Remark 4.
Through employing the dissimilarity measure d * on locally asymptotically self-similar processes, we obtain the so-called "approximately asymptotically consistent algorithms", which are introduced in the following section.

Offline and Online Algorithms
Note that the covariance-based dissimilarity measure d * defined in (28) will aim to cluster covariance structures, not process distributions, therefore the ground truth of clustering should be based on covariance structures. We thus define the ground truth as follows (Peng et al., 2019).

Definition 5 (Ground truth of covariance structures) Let
. . , κ, such that the means and covariance structures of x i , i ∈ N are identical, if and only if i ∈ G k for some k = 1, . . . , κ. Such G is called ground truth of covariance structures. For N ≥ 1, we denote by G| N the restriction of G to the first N sequences: The processes Z satisfying Assumption (A ) are generally not covariance stationary, however their tangent processes' increments X are covariance stationary. In view of (23) and (26), clustering these processes Z is equivalent to clustering X , based on the covariance structure ground truth of the latter increments. Below we will introduce algorithms aiming to approximate the covariance structure ground truth of X .
Depending on how the information is collected, the processes clustering problems consist of dealing with two separate model settings: offline setting and online setting. In the offline setting, the sample size and each path length are time-independent. However, in the online setting, they may both grow with time. As stated in Khaleghi et al. (2016), using the offline algorithm in the online setting by simply applying it to the entire data observed at every time step, does not result in an asymptotically consistent algorithm. As a result, we consider clustering offline and online datasets as 2 approaches and study them separately. Hence the approximated asymptotic consistency will be described in Theorem 2 and Theorem 3 below, respectively for offline and online clustering algorithms. Our offline and online clustering algorithms below are obtained by replacing the dissimilarity measures in Algorithms 1 and 2 in Peng et al. (2019) For the offline setting, we cluster observed data using Algorithm 1 below. It is a 2point initialization centroid-based clustering approach. Under the distance d * defined in (28), the algorithm picks the farthest two points to be the first two cluster centers (Lines 1-2). Then iteratively, each next cluster center is chosen to be the point farthest to all the previously assigned cluster centers (Lines 3-5). Finally the algorithm assigns each remaining point to the nearest cluster (Lines 7-10).
Assign the remaining points to the nearest centers: The strategy for clustering online data is presented in Algorithm 2 as follows. At each time t , first update the collection of sample paths S(t ) (Lines 1-2). Then for each j = κ, . . . , N (t ), use Algorithm 1 to group the first j paths in S(t ) into κ clusters (Lines 6-7); for each cluster the center is selected as the point having the smallest index among that cluster, and order these indexes increasingly (Line 8); calculate the minimum inter-cluster distance γ j and update the normalization factor η (Line 9-11). Finally, every sample path in S(t ) is assigned to the "nearest" cluster, based on the weighted combination of the distances (given in Line 15) between this observation and the candidate cluster centers obtained at each iteration on j (Lines 14-17).

Computational Complexity and Consistency of the Algorithms
We describe the computational complexity based on the number of computations of the distance ρ. For Algorithm 1, the 2-point initialization requires N (N − 1)/2 times calculations of d * . From (28) we see that each calculation of d * consists of n min − K − 1 times computations ofd. By (18),d can be obtained through computing K − log K + 1 times distances ρ. Therefore total number of computations of ρ is not greater than N (N − 1)(n min − K − 1)(K − log K + 1)/2. For Algorithm 2, since at each step j ∈ {κ, . . . , N − κ + 1}, Algorithm 1 is run on j observations, the total number of ρ's computations is then less than Assign each point to one of the κ clusters: in practice, and it is quite competitive to the existing algorithms for clustering stochastic processes. Now we introduce the notion of approximately asymptotic consistency. Fix a positive integer K . Let Z (1) , Z (2) be 2 locally asymptotically self-similar processes with respect functional indexes H 1 (•), H 2 (•). Also let be respectively their sample paths z 1 , z 2 ' increments, defined as in (27). For j = 1, 2, we define the normalized increments by taking the following linear transformation: Then using (23) we obtain where X denotes a discrete-time path of the increment of a selfsimilar process with self-similarity index H j (t i ). Fix L ≥ 1. For each empirical dissimilarity measure d * (z 1 , z 2 ), we correspondingly define d * (z 1 , z 2 ) is another dissimilarity measure between z 1 , z 2 , which has a tight relationship to the distance d * between their tangent processes' increments. Indeed by using (30) and the continuous mapping theorem, it is easy to derive the following result.

Proposition 3
For any N independent sample paths z 1 , z 2 , . . . , z N , where x 1 , x 2 are the increments of the tangent processes corresponding to z 1 , z 2 respectively.
In particular when ∆t = 1, d * (z 1 , z 2 ) = d * (z 1 , z 2 ). In this sense d * (z 1 , z 2 ) "approximates" d * . d * can not be observed in practice since the functional indexes of the locally asymptotically self-similar processes are supposed to be unknown. In what follows d * only serves to define the approximate asymptotic consistency. In this notion "approximate" means the clustering locally asymptotically self-similar processes problem is "approximately" equivalent to the clustering their tangent processes problem. Now we state the consistency theorems. Through Theorems 2 and 3 below we show that Algorithms 1 and 2 are both approximately asymptotically consistent. Their proofs are inspired by the ones in Khaleghi et al. (2016) and Peng et al. (2019). However different from the consistent theorems in Khaleghi et al. (2016) and Peng et al. (2019), the convergences in Theorems 2 and 3 have a weaker sense, which are in probability, not almost sure.
Theorem 2 Under Assumption (A ), Algorithm 1 is approximately asymptotically consistent for clustering the offline sample paths S = {z 1 , . . . , z N }. This means: if d * is replaced with d * in Algorithm 1, then the output clusters converge to the covariance structure ground truths of the increments of the corresponding tangent processes S = {x 1 , . . . , x N } in probability, as ∆t → 0 and n min := min{n 1 , . . . , n N } → +∞. More formally, where f is given in Algorithm 1 and G S denotes the ground truths of the covariance structures that generate the set of paths S .

Proof 4 First, we show
where G = {C 1 , . . . ,C κ } denotes any κ-partition of {1, . . . , N }. Since if the estimated clusters f (S, κ, d * ) = G, the samples in S that are generated by the same cluster in G are closer under d * to each other than to the rest of the samples. Hence we can write It follows from (35) and (32) that which proves (34). Next we show that Algorithm 1 is asymptotically consistent on clustering S under d * :  Denote by δ min the minimum non-zero dissimilarity measure between the processes with different covariance structures: Fix ε ∈ (0, δ min /4) and let δ > 0 be arbitrarily small. Since there are a finite number N of samples, by Remark 3 there is n 0 such that for n mi n > n 0 we have On one hand, by applying the triangle inequalities (see Remark 5), we obtain Then by (39) and the fact that 2ε < δ min /2, the following inclusion holds: On the other hand, by applying the triangle inequalities (see Remark 5) and the fact that 2ε < δ min /2, we also obtain: if Equivalently, It follows from (40), (41) and (38) that for n min > n 0 , Note that (42) is equivalent to This tells that the sample paths in S that are generated by the same covariance structures are closer to each other than to the rest of sample paths. Then by (42), for n min > n 0 , each sample path should be "close" enough to its cluster center, i.e., where the κ cluster centers' indexes c 1 , . . . , c κ are determined by Algorithm 1 in the following way: These c 1 , . . . , c κ are chosen to index sample paths generated by different process covariance structures. Then by (42), each remaining sample path will be assigned to the cluster center corresponding to the sample path generated by the same process covariance structure. Finally (36) results from (42) and (43); and (33) is proved by combining (34) and (36).
Below we state the consistency theorem concerning the online clustering algorithm. converge to the covariance structure ground truths of the increments of the corresponding tangent processes S (t )| N := {x t 1 , . . . , x t N } in probability, as ∆t → 0 and t → +∞. In other words, where f (S(t ), κ, d * )| N denotes the clustering f (S(t ), κ, d * ) restricted to the first N sample paths in S(t ). We also recall that G S (t ) | N is the restriction of G S (t ) to the first N sample paths {x t 1 , . . . , x t N } in S (t ) (see Definition 5).
Proof 5 Let's fix N ≥ 1. First, similar to the derivation of (34) in the proof of Theorem 2, we can obtain Then it remains to prove In what follows we prove (46). Let δ > 0 be arbitrarily small. Fix ε ∈ (0, δ min /4), where δ min is defined as in (37). Denote by For k ∈ {1, . . . , κ}, denote by s k the index of the first path in S (t ) sampled from X (k) , i.e., For j ≥ 1 denote by S (t )| j the first j sample paths contained in S(t ). Then from (49) we see that, m ≤ N (t ) and S (t )| m contains paths sampled from all κ distinct processes (covariance structures). By using the fact that +∞ j =1 w j < +∞, we can find a fixed value J ≥ m such that Recall that in the online setting, the i th sample path's length n i (t ) grows with time t . Therefore, by Remark 5, for every j ∈ {1, . . . , J } there exists some T 1 ( j ) > 0 such that Since J ≥ m, by Theorem 2 for every j ∈ {m, . . . , J } there exists T 2 ( j ) > 0 such that Alg1( From Algorithm 2 (Lines 9, 11) we see Below we provide upper bounds in probability of η t and γ t j .
Upper bound of γ t j : Similar to how (41) is derived, we use the triangle inequalities (Remark 5) and (37) to obtain: Since the clusters are ordered in the order of appearance of the distinct process covariance structures, we have x t c j k = x t s k for all j ≥ m and k ∈ {1, . . . , κ}, where we recall that the index s k is defined in (48). It follows from (53), the fact that ε < δ min /4 and (51) that For j ∈ {1, . . . , N (t )}, by (52), the triangle inequality, (47) and (54), we have Upper bound of η t : By (54) and the fact that Recall that N (t ) > J for t ≥ T . Therefore for every k ∈ {1, . . . , κ} we can write

Efficiency Improvement: log * -transformation
In this section, we show performance of the proposed clustering approaches (Algorithm 1) and (Algorithm 2) on clustering simulated multifractional Brownian motions (mBm). MBm is a paradigmatic example of locally asymptotically self-similar processes. Its tangent process is fractional Brownian motion (fBm), which is self-similar. Since the covariance structure of the fBm is nonlinearly dependent on its self-similarity index, we can then apply the so-called log * -transformation to the covariance matrices of its increments, in order to improve the efficiency of the clustering algorithms. More precisely, in our clustering algorithms, we replace in d * the coefficients of all the covariance matrices and their estimators with their log * -transformation, i.e., for x ∈ R, By applying such transformation, the observations assigned to any two clusters by the covariance structure ground truths become well separated thus the clustering algorithms become more efficient. For more detail on this efficiency improvement approach we refer the readers to Section 3 in Peng et al. (2019).
It can be seen from Boufoussi et al. (2008) that the mBm is locally asymptotically selfsimilar satisfying Assumption (A ). Its tangent process at t is an fBm {B (H (t )) (u)} u with index H (t ): where C H (t ) is a deterministic function only depending on H (t ). We select Wood-Chan's simulation method (Wood and Chan, 1994;Chan and Wood, 1998) to simulate the mBm paths, and use the implementation (MATLAB) of Wood-Chan's method in FracLab (version 2.2) by INRIA in our simulation study 1 . This method outputs independent sample paths of the following form: where n ≥ 1 is an input parameter. Now we would select w j = 1/( j 2 ( j + 1) 2 ) so that d (see (17)) is a convergent series (well-defined). To show this choice is reasonable we consider the stochastic process where ∆ > 0 is some given mesh. For each t 0 = 0, ∆, 2∆, . . ., the increments of the tangent process (see the right-hand side of (26)) is given by: for i = 0, 1, . . ., As increments of fBm, X (H (t 0 )) (•) is autocovariance ergodic. Moreover for i , j = 0, 1, . . ., we have, by using the definition of log * (see (67)), the covariance function of fBm (see (5)) and the fact that sup s≥0 H (s) ≤ 1, From the definition of d in (17) and (72) we can see that, by taking w j = 1/( j 2 ( j + 1) 2 ) and using (72), for any t 0 , t 0 = 0, ∆, 2∆, . . ., log m l 2 (l + 1) 2 (m + 1) 2 < +∞.
Therefore we have shown that w j = 1/( j 2 ( j + 1) 2 ) leads to that d is well-defined.

Synthetic Datasets
To construct a collection of the mBm paths with distinct functional indexes H (•), we set the function form of H (•) in each of the predetermined clusters. Two functional forms of H (•) are selected for synthetic data study: • Case 1 (Monotonic function): The general form is taken to be where Q > 0 is a fixed integer and different values of h correspond to difference clusters. We then predetermine 5 clusters with various h's to separate different clusters. In this study we set Q = 100, h 1 = −0.4, h 2 = −0.
where different values of h lead to different clusters. Specifically, we take Q = 100, h 1 = 0.4, h 2 = 0.2, h 3 = 0, h 4 = −0.2 and h 5 = −0.4. The trajectories of the corresponding 5 functional forms of H (•) are illustrated in the top graph of Figure 2.
We demonstrate the approximated asymptotic consistency of the proposed algorithms by conducting both offline and online clustering analysis. Denote the number of observed data points in each time series by n(t ), and denote the number of time series paths by N (t ).
Under offline setting, the number of observed paths does not depend on time t , owever the lengths do. In order to construct offline datasets, we perform the follows: 1. For i = 1, . . . , 5, simulate 20 mBm paths in group i (corresponding to h i ), each path is with length of 305. Then the total number of paths N = 100. To be more explicit we denote by x 2,2 · · · x 2,305 . . . . . . . . . . . .
x 100,1 x 100,2 · · · x 100,305 where each row is an mBm discrete-time path. For i = 1, . . . , 5, the data from the i th group are given as: x 20(i −1)+1,1 x 20(i −1)+1,2 · · · x 20(i −1)+1,305 2. At each t = 1, . . . , 100, we suppose to observe the first n(t ) = 3t +5 values of each path, i.e., The online dataset does not require observed paths to be with equal length, and can be regarded as some extension of the offline case. Introducing the online dataset aims at mimicking the situation where new time series are observed as time goes. In our simulation study, we use the following way to construct online datasets: 1. For i = 1, . . . , 5, simulate 20 mBm paths in group i (corresponding to h i ), each path is with length of 305 (see (75) and (76)).
where •x k,l 's are the (k, l )-coefficients in S (i ) given in (76).
• N i (t ) := 6 + (t − 1)/10 denotes the number of paths in the i th group. Here • denotes the floor number. That is, starting from 6 paths in each group, 1 new path will be added into each group as t increases by 10.
Since at each time t , the covariance structure ground truth being known, we can then evaluate the clustering performance in terms of the so-called misclassification rates (Peng et al., 2019). Heuristically speaking, the misclassification rate is then calculated by averaging the proportion of mis-clustered paths in each scenario.

Experimental Results
We demonstrate the asymptotic consistency of our clustering algorithms by computing the misclassification rates using simulated offline and online datasets. More details about such misclassification rate are provided in Section 4 of Peng et al. (2019).
Below we summarize the simulation study results.

Case 1 (Monotonic function):
When H (•)'s are chosen to be 5 monotonic functionals of the form (73) (see the top graph in Figure 1), the bottom graph in Figure 1 illustrates the behavior of the misclassification rates corresponding to Algorithm 1 applied to offline data setting (solid line), and Algorithm 2 applied to online data setting (dashed line). From this result we observe the following: (1) Both algorithms attempt to be consistent in their circumstances, as the time t increases, in the sense that the corresponding misclassification rates are decreasing to 0.
(2) Clustering mBms are asymptotically equivalent to clustering their tangent processes' increments.
(3) The online algorithm seems to have an overall better performance: its misclassification rates are 5%-10% lower than that of offline algorithm. The reason may be that at early time steps the differences among the H (•)'s are not significantly. Unlike the offline clustering algorithm, the online one is flexible enough to catch these small differences.

Case 2 (Periodic function):
The same converging behaviors are found in case of periodic functional form of H (•) as specified in (74). Their trajectories are illustrated in the top graph of Figure 2. The clustering performance shown in the bottom graph of Figure 2 indicate the following: (1) Both misclassification rates of the clustering algorithms have generally a declining trend as time increases.
(2) As the differences among the periodic function H (•)'s values go up and down, the misclassification rates go down and up accordingly.
(3) The online clustering algorithm has an overall worse performance than the offline one. This may be because starting from t = 20 the differences among H (•)'s become significantly large. In this situation offline clustering algorithm can better catch these differences, since it has larger sample size (20 paths in each group) than the online one.
Finally note that in the simulation study, for each pair of paths with length n(t ), we have taken K = n(t ) − 2 and L = 1 in d * , however any other value of K could be taken. We have provided easily readable and editable MATLAB codes of the proposed algorithms and simulation study replications. All the codes used in this section can be found publicly online 2 .
6 Real World Application: Clustering Global Financial Markets

Motivation
In this section, we motivate the application of the proposed clustering algorithms to real world datasets through performing cluster analysis on global equity markets. In the past, stock returns of countries in the same region are commonly believed to have more similar patterns and higher correlation. The reason is obvious: economic entities within closer geographical distance have potentially more trades and thus are influenced by similar economic factors. However, more recent empirical evidences of financial markets reveal that globalization is breaking the geographical barrier and is creating common economic factors. As a result, global economic clusters switch from "geographical centriods" to "emerging/developed economics centriods". That is, emerging markets demonstrate more and more similar financial market patterns and correlations (Demirer et al., 2018), whereas developed economic entities share increasingly financial market similarity (Ang and Longstaff, 2013). The idea of modeling stock returns by locally asymptotically self-similar processes (mBm) is pioneered by Bianchi and Pianese (2008). This time-varying self-similar feature of the financial markets stochastic processes is further convinced by Bianchi et al. (2013) and Peng and Zhao (2018). We consider data from global financial markets to be perfect underlying stochastic processes of our proposed clustering algorithms. We further examine the connection of global financial markets by answering whether economic entities are better co-behaved and clustered by geographical distribution or by development level.

Data and Methodology
Two asset classes from global financial markets are used in our empirical cluster analysis. The equity index return captures the upside (growth) characteristics and the sovereign CDS spread captures the downside (credit risk) characteristics of underlying economic entities: • Equity indexes returns: We cluster the global equity indexes according to the empirical time-varying covariance structure of their performance, using Algorithms 1 and 2 as purposed in this paper. The index constituents of MSCI ACWI (All Country World Index), selected as underlying stochastic processes. Each of the indexes is a realized path representing the historical monthly total returns (with dividends) of underlying economic entities. MSCI ACWI is the leading global equity market index and covers about 85% of the market capitalization in each market 3 .
• Sovereign CDS spreads: We cluster the sovereign credit default swap (CDS) spreads of the same economic entities. The sovereign CDS is an insurance-like product that provides default protection on bonds and other debts issued by government, nation or large economic entity. Its spread reflects the cost to insurer the exposure to the possibility of a sovereign defaulting or restructuring. We select 5-year sovereign CDS spread as the indicator of sovereign credit risk, and 5-year products usually have the best liquidity on the market. The same economic entities as in equity index analysis are selected. Our data source is Bloomberg and Markit.
Through empirical study it is proved that these indexes returns exhibit the "long memory" path feature hence they can be modeled by self-similar processes or more generally by locally asymptotically self-similar processes such as fBms and mBms (see e.g. Bianchi and Pianese (2008) Bianchi et al. (2013) and Peng and Zhao (2018)). Therefore similar to Section 5 we may cluster the increments of the indexes returns with the log * -transformed dissimilarity measure. We select equity indexes and sovereign CDS spreads covering 23 developed economic entities and 24 emerging markets. The detailed constituents can found at Table 1 or https://www.msci.com/acwi: Americas, EMEA (Europe, Middle East and Africa), Pacific and Asia.
We construct both offline and online datasets for monthly returns. For monthly return of global equity indexes, offline dataset starts from June 2005 and includes the financial crisis period in 2007 and 2008 and ends on November 2019. We include the global stock market crisis period to incorporate potential credit risk contagion effect on our clustering analysis. The online dataset starts on January 1989, which covers 1997 Asian financial crisis, 2003 dot-com bubble and 2007 subprime mortgage crisis, and ends on November 2019. In offline setting, each stochastic path in the dataset has 174 time series observations, and there are 47 paths to be clustered. In online setting, the longest time series have 371 observations and shortest time series have 174 observations. The online dataset begins with 33 economic entities and ends with 47 paths.
For monthly average on sovereign CDS spreads, we remove Netherlands, Qatar, Singapore, Greece and United Arabic from the economic entity samples due to insufficient observation. Therefore, we end up with 42 economic entities with CDS spread clustering analysis. The offline dateset starts from June 2005 and ends on November 2019, and each economic entity has 174 observations.

Clustering Results
We compare the clustering outcomes of both offline and online datasets with separations suggested by region (4 groups: Americas, Europe & Middle East, Pacific and Asia) and development level (2 groups: emerging markets and developed markets). The clustering factor (region or development level) that has lower misclassification rate contributes the partition of the economics entities the most. That is, we examine whether region or development level differentiates the financial markets behaviors. Table 2 shows that the misclassification rates for development levels are significantly and consistently lower than that of geographical region, for offline and online settings and for both equity and credit markets. The clustering comparison seems to convince that development level dominates the financial market characteristics over geographical distance for those underlying economic entities. The best results are clustering offline dataset using offline algorithm and clustering online dataset using online algorithm on dividing emerging markets and developed markets. There are 30% to 50% decrease on misclassification rates when clustering via development level than that via region.  Table 3 presents misclassified economic entities from the cluster analysis on grouping emerging markets and developed markets. For equity indexes dataset, more misclassification concentrates on clustering developed economic entities into emerging market group. Stock index returns from Austria, Finland and Portugal markets are clustered as from emerging markets by both offline and online algorithms. The misclassification within developed group is rather random. For sovereign CDS spreads, more misclassification is on clustering emerging economic entities into developed market group. Sovereign CDS spreads of Chile, China (Mainland), Czech, Korea, Malaysia, Mexico, Poland and Thailand are consistently misclassified into developed markets, probably due to their low sovereign credit risk.
From both equity and credit markets, we show that clustering financial time series using development level outperforms the clustering outcome using region. This empirical result further supports that economic globalization is breaking the geographic barrier and enhances the comovement of financial market behaviors by economics strengthens and status.

Conclusions and Future Prospects
We introduce the problem of clustering locally asymptotically self-similar processes. A new covariance-based dissimilarity measure is proposed to obtain approximately asymptotically consistent clustering algorithms for both offline and online settings. We have shown that the recommended algorithms are competitive for at least three reasons: (1) Given their flexibility, our algorithms are applicable to clustering any distribution stationary ergodic processes with finite variances; any autocovariance ergodic processes; locally asymptotically self-similar processes whose tangent processes have autocovariance ergodic increments. The multifractional Brownian motion (mBm) is an excellent example of the latter process.
(2) Our algorithms are efficient enough in terms of their computational complexity. Simulation study is performed on clustering mBm. The results show that both offline and online algorithms are approximately asymptotically consistent.
(3) Our algorithms are successfully applied to cluster the real world financial time series (equity returns and sovereign CDS spreads) via development level and via regions. The outcomes are self-consistent with the financial markets behavior.
Finally we list the following open problems which could be left for future research.
(1) The clustering framework proposed in our paper only focuses on the cases where the true number of clusters κ is known. The problem for which κ is supposed to be unknown remains open.
(2) If we drop the Gaussianity assumption the class of stationary increments self-similar processes becomes much larger. This will yield introduction to a more general class of locally asymptotically self-similar processes, whose autocovariances do not exist. This class includes linear multifractional stable motion Taqqu, 2004, 2005) as a paradigmatic example. Cluster analysis on such stable processes will nodoubt lead to a wide range of applications, especially when the process distributions exhibit heavy-tailed phenomena. Neither the distribution dissimilarity measure introduced in Khaleghi et al. (2016) nor the covariance-based dissimilarity measures used in this paper would work in this case, hence new techniques are required to cluster such processes. The misclassification outcome using offline algorithm on offline dataset and online algorithm on online dataset. Panel A reports the incorrectly categorized economics entities from equity index return clustering, and Panel B reports the incorrectly categorized economics entities from sovereign CDS spreads clustering. The algorithm divides the whole dataset into two groups: emerging market and developed markets, respectively. The incorrect outcome, where (i) entities from developed markets incorrectly clusters in emerging market, or (ii) vice versa, are reported in the table.