Distribution-based entropy weighting clustering of skewed time series

: The goal of clustering is to identify common structures in a data set by forming groups of 1 homogeneous objects. The observed characteristics of many economic time series have motivated 2 the development of classes of distributions that can accommodate properties such as heavy tails 3 and skewness. Thanks to its ﬂexibility, the Skew Exponential Power Distribution (also called Skew 4 Generalized Error Distribution) ensures a uniﬁed and general framework for clustering possibly 5 skewed time series. This paper develop a clustering procedure of model-based type, assuming that 6 the time series are generated by the same underlying probability distribution but with different 7 parameters. Moreover, we propose to optimally combine all the parameter estimates to form the 8 clusters with an entropy weighing k -means approach. The usefulness of the proposal is showed 9 by means of an application to ﬁnancial time series, showing also how the obtained clusters can be 10 used to form portfolio of stocks. 11

(1) 79 where z ∈ R, µ ∈ (−∞, +∞) is called location parameter, σ > 0 is called scale parameter, 80 p > 0 is a measure of fatness of tails and is called shape parameter (see [40]) and: (2) 82 is the Gamma function. Since the distribution is symmetric and unimodal, the location 83 parameter is also the mode, median and mean of the distribution (Fig. 1). It is possible to write the EPD probability density (1) in more compact form by means of 85 [40]: where C EPD (p) is a normalizing constant, C EPD (p) = 1/[2p 1/p Γ(1 + 1/p)].

88
The shape parameter p controls the tails and the peak of the distribution; a small value 89 of p means that the tails of the distribution become flat, with the center becoming largely 90 peaked.

91
A very important feature of this family of distributions, that has been proved to be 92 useful in modeling stock market volatility (e.g. [37, 38,44]), is that they include also other 93 common distributions, for different values of shape parameter p. 94 In particular, the Gaussian distribution is a special case of the GED when p = 2, and 95 when p < 2 the distribution has fatter tails than a Gaussian distribution [37]. Moreover, 96 when p = 1 we have a Laplace distribution, and for p = +∞ we have the Uniform 97 distribution [42]. 98 So far, there are two different methods to extend the EPD for skewness (see Fig. 2). A first approach is represented by the one of [27], that defined the first family of SEPD. 100 Later, [28,29] extended the EPD class to another family of SEPD by using a two-piece 101 method, in which an additional skew parameter γ (that henceforth we define as λ) is 102 introduced. By a method similar to that of [28,29] [30] [31], respectively, constructed 103 seemingly different classes of SEPD, which are actually reparametrizations of the one 104 developed by [28,29]. In what follows we consider the SPED family of [28,29,35]. 105 exist parameters p > 0, µ ∈ R, σ > 0, and λ > 0 such that the density function has the 107 form: where: The parameters µ and σ correspond to location and scale, respectively, while λ controls 112 skewness, and p is the shape parameter. For λ = 1, the distribution is symmetric about µ 113 so we obtain the symmetric exponential power distribution. In case λ = 1, letting p = 1 114 leads to the skew Laplace distribution with density: For p = 2, we obtain the skew normal distribution. The moments of the Skew Exponen-117 tial Power Distribution are the following [35].

118
The mean is equal to: 120 while the variance: The skewness and the excess of kurtosis, instead, can be retrieved by means of: Hence, the skewness is equal to: while the (excess) kurtosis is: Note that in the special case p = 1 (Laplace distribution) we have: where µ ∈ R and σ > 0 still represent location and scale, respectively, λ ∈ (0, 1) is the 137 skewness parameterization, p 1 > 0 and p 2 > 0 are the left and right tail parameters, 138 respectively, and C EPD (p) is the constant defined before. This representation suppose 139 that the two tails have different shapes p 1 and p 2 . If p 1 = p 2 = p, implying λ * = λ, the 140 AEPD (SPED) reduces to: 142 which is equivalent, but with a different parametrization, to those developed by [28-   Therefore, assuming to have N(n = 1, . . . , N) time series all generated by the Skew Ex-151 ponential Power Distribution of parameters µ n , σ n , p n and λ n , we can store the moments' 152 estimates in the following matrix: to the membership of the observations to that cluster.

167
Formally, the Weighted k-Means algorithm (WKM) can be formalized as follows: 168 min : series. Moreover, another appealing feature is that each c-th group has its own optimal 180 weighting.

181
In the end, the exponent β has to be analyzed. With β = 0 we obtain the usual k-means 182 clustering algorithm, while with a value of β = 1, we have that the estimated distribution 183 parameter with the smallest value of the weighted dissimilarity is equal to 1 and all the 184 others w m,c are equal to zero.

185
When β > 1, the larger the D m , the smaller the weight w m . Therefore, the effect of a 186 moment with a large D m is reduced. When β < 0, the larger D m , the larger the weight 187 w m . However, w m becomes smaller and has less weighting to the moment in the distance 188 calculation because of negative β. 189 In the end, if 0 < β < 1 the larger the parameters' dissimilarity, the larger is the weight Therefore we cannot choose 0 < β < 1 but we can choose β < 0 or β > 1 in the weighted 192 k-means algorithm.

193
However the exponent β is an artificial device, lacking a strong theoretical justification

208
The new objective function can be written as follows: 209 min : subject to the constraints: The first term is the sum of the within cluster dispersion, and the second term the Proof of 17. By using the Lagrangian multiplier technique we ontain the following 227 unconstrained minimization problem: 228 min : for c = 1, . . . , C. By setting the gradient with respect to w m,c and λ c to zero, we obtain: and: From the last equality we get: where D 2 m ,c can be interpreted as a measure of the data dispersion of the m dimension 240 for the objects placed within the c-th cluster. By substitution of the above equations we 241 get: Substituting back we obtain the (17): Similarly to the standard k-means algorithm u n,c is updated: Instead, if γ < 0, the weights w m,c is proportional to the distance D 2 m,c . Therefore, the 259 larger is the distance the larger is the associated weight. This is a contradictory result 260 and, hence, γ cannot be smaller than zero.

261
In the end, γ can be set equal to zero. In this case, the dimension m with the smallest 262 distance has a weight equal to 1, w m ,c = 1, while all the others are zero w m,c = 0.   Therefore, from these simple considerations appear clearly the need for the specification 309 of a very flexible distribution able to accurately capture these diversities.  As usual, the second step of the clustering procedure involves the decision of the number 318 of clusters C. As specified in the section 3 of the paper, we take advantage of the The higher value of S are obtained with C = 3 clusters and, then, its value dramatically 321 decreases with an increasing number of clusters. Therefore we choose C = 3.  The third cluster c = 3 is the less numerous with only the 16% of the assets (AUTO, 325 FERG, III and SVT) placed within. By looking at the parameter estimates in Tab. 1, it 326 appears clear that the stock within the group c = 3 are those showing an heavy tailed 327 distribution with a low degree of skewness.

328
The second clusters c = 2, on the other side, is the most numerous since a proportion 329 of 52% of stock is included within. In this case, looking at Tab. 1, we can conclude the 330 in the second group are placed the stock with the highest degree of skeweness. Indeed, 331 for example, the stock DGE, GSK and HL are those with the three highest value of 332 skeweness (close to 2 for both DGE and GSK and 1.31 for HL).

333
In the end, the residual cluster c = 1 contains all the other 32% of stocks.

334
An important feature of the proposed clustering approach is that, in any of the C clusters,  At the end, once can be interested in the possible usage of these groups in the real world.

338
An immediate example for any clustering approach is, once it is applied to financial data, The clusters obtained in the previous Section by the proposed approach can be seen as 343 possible portfolios from an asset allocation perspective.

344
Financial literature provided various approaches to portfolio selection. Nevertheless,

345
[50] showed that empirically the naive or Talmudic

365
In what follows we consider each cluster as a possible set of stock and we use the Global

366
Minimum Variance approach to build C different portfolios.

367
In order to evaluate the out-of-sample performances we follow the empirical procedure     Clustering, by grouping objects that have maximum similarity with other objects within 411 the group, and minimum similarity with objects in other groups, is a useful approach 412 for exploratory data analysis as it identifies structure(s) in an unlabeled dataset by objec-413 tively organizing data into similar groups. Three important particular cases of SEPD are analyzed in the paper and exactly they are 424 the Gaussian, the Laplace and the Asymmetric Laplace.

425
The clustering algorithm, which represents the innovative aspect of this paper, uses the 426 moments estimated by the introduced exponential power distribution Skew to form the 427 clusters.

428
The criterion is that time series with similar moment estimates are placed in the same 429 group. Therefore, with a k-means clustering algorithm, the measure of dissimilarity 430 is determined on the basis of these estimates. In this paper we therefore propose to 431 combine all the information in an optimal way to form clusters.

432
The approach we devised to optimally weight the different data characteristics is rep-