You are currently viewing a new version of our website. To view the old version click .
Engineering Proceedings
  • Proceeding Paper
  • Open Access

12 November 2021

Robust Methods for Soft Clustering of Multidimensional Time Series †

,
,
and
1
Research Group MODES, Research Center for Information and Communication Technologies (CITIC), University of A Coruña, 15071 A Coruña, Spain
2
Department of Economics, Sapienza University of Rome, Piazzale Aldo Moro 5, 00185 Rome, Italy
3
Technological Institute for Industrial Mathematics (ITMATI), 15782 Santiago de Compostela, Spain
*
Author to whom correspondence should be addressed.
This article belongs to the Proceedings The 4th XoveTIC Conference

Abstract

Three robust algorithms for clustering multidimensional time series from the perspective of underlying processes are proposed. The methods are robust extensions of a fuzzy C-means model based on estimates of the quantile cross-spectral density. Robustness to the presence of anomalous elements is achieved by using the so-called metric, noise and trimmed approaches. Analyses from a wide simulation study indicate that the algorithms are substantially effective in coping with the presence of outlying series, clearly outperforming alternative procedures. The usefulness of the suggested methods is also highlighted by means of a specific application.

1. Introduction

Clustering of time series is a pivotal problem in statistics with several applications [1,2]. Generally, the goal is to divide collection of unlabelled time series into uniform groups so that intra-cluster similarity is maximized wheres the inter-cluster similarity is minimized. Most of the current techniques deal with univariate time series (UTS), while clustering of multidimensional time series (MTS) has received limited attention. This paper proposes three robust clustering methods for MTS. All of them are aimed at neutralizing the effect of outlying series while detecting the underlying grouping structure.

2. Robust Clustering Methods for Multivariate Time Series

Let { X t , t Z } = { ( X t , 1 , , X t , d ) , t Z } be a d-variate real-valued strictly stationary stochastic process. Let F j the marginal distribution function of X t , j , j = 1 , , d , and let q j ( τ ) = F j 1 ( τ ) , τ [ 0 , 1 ] , the corresponding quantile function. Fixed l Z and an arbitrary couple of quantile levels ( τ , τ ) [ 0 , 1 ] 2 , consider the cross-covariance of the indicator functions I X t , j 1 q j 1 ( τ ) and I X t + l , j 2 q j 2 ( τ )
γ j 1 , j 2 ( l , τ , τ ) = Cov I X t , j 1 q j 1 ( τ ) , I X t + l , j 2 q j 2 ( τ ) ,
for 1 j 1 , j 2 d . Taking j 1 = j 2 = j , the function γ j , j ( l , τ , τ ) , with ( τ , τ ) [ 0 , 1 ] 2 , so-called quantile autocovariance function (QAF) of lag l, generalizes the traditional autocovariance function.
For the multivariate process { X t , t Z } , we can consider the d × d matrix Γ ( l , τ , τ ) = γ j 1 , j 2 ( l , τ , τ ) 1 j 1 , j 2 d , which simultaneously gives information about both the cross-dependence (when j 1 j 2 ) and the serial dependence (since there is a lag l).
Under appropriate summability conditions (mixing conditions), we can define the the Fourier transform of the cross-covariances. In this regards, the quantile cross-spectral density is given by
f j 1 , j 2 ( ω , τ , τ ) = ( 1 / 2 π ) l = γ j 1 , j 2 ( l , τ , τ ) e i l ω ,
for 1 j 1 , j 2 d , ω R and τ , τ [ 0 , 1 ] . Note that f j 1 , j 2 ( ω , τ , τ ) is complex-valued.
The quantile cross-spectral density contains information about the general dependence patterns of a given stochastic process. For a specific realization of the process, this quantity can be consistently estimated by means of the so-called smoothed CCR-periodogram, G ^ T , R j 1 , j 2 ( ω , τ , τ ) , proposed by [3].
Based on previous remarks, a simple dissimilarity measure between two realizations of the d-variate process (MTS) can be defined as follows. Given the i-th MTS, X t ( i ) , consider the set G ( i ) = { G ^ T , R j 1 , j 2 ( ω , τ , τ ) , j 1 , j 2 = 1 , , d , ω Ω , τ , τ T } , where Ω is the set of Fourier frequencies and T = { 0.1 , 0.5 , 0.9 } . Let Ψ ( i ) be the vector formed by concatenating the elements of the set G ( i ) . The dissimilarity measure between the series X t ( 1 ) and X t ( 2 ) is defined as the Euclidean distance between the complex vectors Ψ ( 1 ) and Ψ ( 2 ) . We call this dissimilarity d Q C D .
The dissimilarity d Q C D is used to develop three robust fuzzy clustering methods. All of them assume that we want to group n MTS into C clusters, and are based on the traditional fuzzy C-means clustering algorithm. They look for the set of centroids Ψ ¯ = { Ψ ¯ ( 1 ) , , Ψ ¯ ( C ) } , and the n × C matrix of fuzzy coefficients, U = ( u i c ) , i = 1 , , n , c = 1 , , C , which define the solution of a given minimization problem. The quantity u i c represents the membership degree of the i-th MTS in the c-th cluster. The minimization problem for the first method is the following:
min Ψ ¯ , U i = 1 n c = 1 C u i c m 1 exp β Ψ ( i ) Ψ ¯ ( c ) 2 2 w.r.t c = 1 C u i c = 1   and   u i c 0 ,
where β is an hyperparameter that needs to be set in advance and m is a parameter which determines the fuzziness of the partition, frequently called the fuziness parameter.
The exponential distance is used in the previous model because it is capable of neutralizing the effect of outlying series by spreading out their membership degrees between the different clusters [4].
The second robust procedure follows the noise cluster approach, and takes into account the following minimization problem:
min Ψ ¯ , U i = 1 n c = 1 C 1 u i c m Ψ ( i ) Ψ ¯ ( c ) 2 2 + i = 1 n δ 2 1 c = 1 C 1 u i c m w.r.t. c = 1 C u i c = 1   and   u i c 0 ,
where δ > 0 is the a parameter known as the noise distance, which has to be specified in advance.
The previous model includes C groups, but only ( C 1 ) are “real” clusters. The noise cluster is artificially created for outlier identification purposes. The aim is to locate the outliers and place them in the noise cluster, which is represented by a fictitious prototype that has a constant distance from every MTS (the noise distance, δ ).
The third technique can be expressed by means of the minimization problem:
min Y , U i = 1 H ( α ) c = 1 C u i c m Ψ ( i ) Ψ ¯ ( c ) 2 w.r.t. c = 1 C u i c = 1   and   u i c 0 .
where Y ranges on all the subsets of Ψ = { Ψ ( 1 ) , , Ψ ( n ) } of size H ( α ) = n ( 1 α ) . The model attains its robustness by removing a certain proportion of the series and requires the specification of the fraction α of the data to be trimmed.
The three previously presented robust models have been analysed by means of a broad simulation study containing a wide variety of generating processes. Two alternative dissimilarities were taken into account for comparison purposes [5,6]. In all cases, the three proposed algorithms outperformed the competitors.

3. Application to real data

The three techniques proposed in Section 2 were applied to perform clustering in a real MTS database. Specifically, we considered daily stock returns and trading volume of the top 20 companies of the S&P 500 index, thus obtaining 20 bivariate MTS. Table 1 shows the membership degrees of the series concerning the trimmed approach.
Table 1. Membership degrees for the top 20 companies in the S&P 500 index by considering the trimmed approach and a 6-cluster partition.
The symbols in bold correspond to the companies which were trimmed away, Berkshire Hathaway (BRK.B), Walmart (WMT) and Home Depot (HD). Similar clustering solutions were obtained with the remaining two methods.

4. Conclusions

This work proposes three robust methods to perform fuzzy clustering of MTS. They are based on the so-called exponential, noise and trimmed ideas. Each approach attains robustness to outlying series in a different way. The three procedures have been presented and assessed through a wide simulation study, substantially outperforming alternative approaches. A real data application has been also carried out in order to show the usefulness of the presented techniques.

Acknowledgments

This research has been supported by MINECO (MTM2017-82724-R and PID2020-113578RB-100), the Xunta de Galicia (ED431C-2020-14), and “CITIC” (ED431G 2019/01).

References

  1. Liao, T.W. Clustering of time series data—A survey. Pattern Recognit. 2005, 38, 1857–1874. [Google Scholar] [CrossRef]
  2. Aghabozorgi, S.; Shirkhorshidi, A.S.; Wah, T.Y. Time-series clustering—A decade review. Inf. Syst. 2015, 53, 16–38. [Google Scholar] [CrossRef]
  3. Baruník, J.; Kley, T. Quantile coherency: A general measure for dependence between cyclical economic variables. Econom. J. 2019, 22, 131–152. [Google Scholar] [CrossRef] [Green Version]
  4. Wu, K.L.; Yang, M.S. Alternative c-means clustering algorithms. Pattern Recognit. 2002, 35, 2267–2278. [Google Scholar] [CrossRef]
  5. D’Urso, P.; Maharaj, E.A. Autocorrelation-based fuzzy clustering of time series. Fuzzy Sets Syst. 2009, 160, 3565–3589. [Google Scholar] [CrossRef]
  6. D’Urso, P.; Maharaj, E.A. Wavelets-based clustering of multivariate time series. Fuzzy Sets Syst. 2012, 193, 33–61. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.