1. Introduction
Time-series clustering [
1,
2], just like time-series classification [
3] and time-series prediction [
4,
5], is one of the data mining methods of time-series. Time-series clustering is used to extract useful information from the data curve and divide the unmarked data into different clusters for maximizing the similarity of objects in the same cluster and the divergence of objects between different clusters [
6]. It has been widely applied in many fields, for example, finance, biomedicine, environment, and so forth. Many kinds of studies have been conducted on time-series clustering. For example, Huang [
7] proposed a new K-Means-type smooth subspace clustering algorithm for clustering time-series data. Guijo-Rubio [
8] demonstrated that a least-squares polynomial segmentation procedure could be applied to each time-series to return different-length segments. Then, all the segments were projected into the same dimensional space, according to the coefficients of the model. Zhang [
9] proposed a fuzzy time-series forecasting model based on multiple linear regression and time-series clustering for forecasting market prices.
Besides the clustering algorithm itself, distance measurement is another key factor affecting the clustering performance of time series. Many different distance measurement methods are adopted and compared for time-series data, including Hausdorff distance [
10], Minkowski [
11], hidden Markov model-based distance [
12], Euclidean distance [
13], and dynamic time warping (DTW) [
14]; the last two are the most widely used [
15].
Although many clustering algorithms and distance measurement methods have been applied together, these traditional clustering methods often have poor classification performance and usually do not perform well on high-dimensional data. However, image clustering, according to the value of image pixels, is a key technique for better accomplishing image annotation and searching in large image repositories. The aforementioned clustering can quickly extract obvious features from images and achieve clustering. Due to the rapid development of deep learning and its inherent characteristics, the deep neural network is used to transform data into good clustering representation.
The existing deep clustering models for images can be divided into three categories according to the network structure: based on automatic encoder (AE) methods, based on CDNN methods, and based on generation model methods. The former can be regarded as encoders and decoders, which are used for feature mapping and reconstruction, respectively, including deep clustering networks [
16], deep embedding networks [
17], deep subspace clustering networks [
18], and deep manifold clustering [
19]. The methods based on CDNN have three types of network architectures, including deep belief networks [
20], fully convolutional networks, and convolutional neural networks (CNN). Deep nonparametric clustering [
21], deep embedded clustering [
22], and discriminatively boosted image clustering [
23] are the classical unsupervised preprocessing network models among these kinds of methods. Besides the latter, based on variational autoencoder and generative adversarial networks [
24], such as variational deep embedding [
25] and deep adaptive image clustering [
26], are proposed for clustering and sample generation. Therein, the clustering method based on CDNN can be used to extract more distinctive features and can cluster large-scale image datasets. The depth clustering algorithms mentioned above are often based on two-dimensional images. At present, most time series clustering algorithms are based on traditional methods and show low accuracy. Therefore, a model based on CDNN is proposed to convert time series into images and improve the clustering performance by the depth learning methods.
For improving the performance of time-series clustering, four image feature representation methods were demonstrated to convert one-dimensional time-series into two-dimensional images, and the deep clustering network (DC), leveraging autoencoder, and K-Means were designed for feature learning and clustering. Therefore, the deep clustering model, named the GW_DC, was proposed in this study. First, time-series were converted into four kinds of two-dimensional images: grayscale images, recurrence plot (RP) images, Markov transition field (MTF) images, and Gramian Angular Difference Field (GADF) images. Then, for better cluster performance, wavelet transform was applied to extract and process the different signal components of the two-dimensional image for removing noise and enhancing the texture features of the images. Finally, the features of the enhanced images were represented by the autoencoder of the GW_DC, and the clustering process was completed by the clustering layer in the network. To verify the performance of the proposed GW_DC model, six UCR datasets were applied, and the clustering results were employed to reverse verify the characterization effects of different two-dimensional images. The comparative analysis revealed that the clustering results of GADF images were the best, and the proposed GW_DC model showed a better clustering effect than other deep clustering models.
2. Methodology
The proposed GW_DC model could convert the original time-series into two-dimensional images and then classify them into some groups. For comparison, time-series were converted into grayscale images, RP images, MTF images, and GADF images. Meanwhile, the deep clustering network was employed for classifying these enhanced two-dimensional images, and the convolution kernel factorization method was introduced into the DC network for reducing the number of parameters and improving the clustering performance. The frame of the proposed GW-DC model, depicted in
Figure 1, consisted of three stages: the image conversion stage, image enhancement stage, and image clustering stage.
In the image conversion stage, the time-series was transformed into four kinds of two-dimensional images for expanding data volume and enhancing generalization ability, including grayscale images, RP images, MTF images, and GADF images. The low-dimensional features were mapped to a high-dimensional space for amplifying the feature attributes and improving the clustering effect. Through comparison, GADF with the best characterization effect was selected in this study.
In the image enhancement stage, the wavelet transform algorithm was used to enhance the texture features of the transformed two-dimensional images. The resolution of the images was regarded as the measurement standard of image decomposition, and the image signals were decomposed into high-frequency subband and low-frequency subband. The high-frequency subband was expanded, and the low-frequency subband was scaled for strengthening the change details of the time-series and enhancing the contrast ratio. Then, the processed signal components were reconstructed to obtain the enhanced images.
In the image clustering stage, a new deep clustering network DC was proposed to improve the clustering effect. In this stage, the DC network based on CNN was designed to learn and represent the features of two-dimensional images. Then, the obtained features were clustered by the K-Means algorithm, and the cross-entropy (CE) [
27] was introduced as the loss function to optimize the DC network.
2.1. Image Conversion Stage
For well retaining the time correlation and frequency structure of the time-series, one-dimensional data were transformed into two-dimensional images. Time-series data were converted into four two-dimensional images, including grayscale images, RP images, MTF images, and GADF images. The clustering results according to different kinds of two-dimensional images were compared and analyzed to determine the best characterization effects of the aforementioned images. Four two-dimensional image representation methods were described in this section.
2.1.1. Conversion from Time-Series into Grayscale Image
The time-series of length
could be expressed as
. Then, for reducing the dimension of time-series, piecewise aggregation approximation (PAA) was adopted to compress time-series
and a new smooth time-series curve was generated. The generated time-series could be expressed as
, where
is the length of time-series
. The dimension of time-series
was compressed into
, with
. The reduction factor
is
where each row in the matrix contained every timestamp of the time-series, and each column was the transpose representation of the corresponding row for expanding the matrix data with redundancy features.
Then the data matrix was transformed into the gray value matrix (GVM), the corresponding gray value was obtained by using the maximum and minimum normalization method [
28]. Six time-series datasets from the UCR website were applied to verify the performance of the model and visual display. The samples of four time-series datasets were converted into gray images. The visualization effect is shown in
Figure 2.
The grayscale images intuitively present the general changes of different datasets. For example, the grayscale variation of the SonyAIBORobotSurface2 dataset is relatively flat, while the BME dataset shows a tendency to mutate.
2.1.2. Conversion from Time-Series into RP Images
RP was a time-time signal processing method that could be used to show the periodicity of trajectory in the phase space and reveal the internal structure of the time-series. It consisted of two-time axes, a black dot, and a white dot. The black dot indicated that recursion occurred in the state corresponding to the horizontal axis and vertical axis, and the white dot indicated that recursion did not occur. The key to constructing the RP image was to reconstruct the phase space, which needed to select the appropriate delay coefficient, embedding dimension, and threshold for reconstructing the time-domain information in the original phase space and promoting the signals to a higher dimension.
The transformation of the RP was divided into three steps, which were described as follows.
Step 1. For time-series
, the sampling interval was determined to be
. The appropriate embedding dimension
was determined through relevant theoretical calculation, and the time-series
was reconstructed. The reconstructed power system is
The length
of time-series
is
where
is the length of time-series
.
Step 2. The calculation of the distance between point
and point
in the reconstructed phase space is
where
represents the L2 norm.
Step 3. The calculation of the recursive value is
where,
is a square matrix with the size of
,
is equal to the vector numbers of
,
is the threshold, and
represents the Heaviside function. The calculation of
is
The effect of converting samples from different datasets into RP images is shown in
Figure 3.
The RP images clearly show the changes of different datasets through the arrangement of black and white blocks. For example, the SyntheticControl dataset generally presents a drift trend. The RP images of the UMD appear in large black areas and show a mutation mode, which is caused by the rapid change of data.
2.1.3. Conversion from Time-Series into MTF Images
The MTF method transformed one-dimensional time-series into two-dimensional images by constructing the discrete quantile of the Markov matrix and encoding the transition probability field. The MTF was obtained by adding the time position related to the first-order Markov chain. It provided an inverse operation to map the images back to the original signals, making the images easy to realize the visual representation.
The calculation process of the MTF could be divided into the following three steps.
Step 1. The data signals were discretized. First, the original time-series with length were divided into bins, and each data point belonged to a unique .
Step 2. The Markov transition matrix
with the size of
was constructed.
was determined by the adjacent frequency of a point between two quantiles
and
, and its calculation formula is
Step 3. The time dependence was added to the transition probability matrix
, and the Markov transition field
with the size of
was constructed.
represents the transition probability from
to
, and the calculation of
is
To better manage the graph and improve the operation efficiency, the principle of PAA was applied to this stage to reduce the size of the MTF matrix, and was gridded and averaged.
The effect of converting the time-series into MTF images is described in
Figure 4.
The MTF images intuitively show the change law through the variation of colors. And the trends in MTF images are similar to grayscale images and RP images.
2.1.4. Conversion from Time-Series to Gramian Angular Difference Field
The Gramian Angular Field (GAF) method was used to transform the scaled one-dimensional time-series into a polar coordinate system and construct the objective mapping between one-dimensional time-series and two-dimensional space. Then, the GAF method could be divided into two implementation methods according to the calculation angle between different time points, including Gramian Angular Summation Field (GASF) and GADF. A GAF image was a graphical representation of a Gramian matrix in which each element was the superposition of directions between different time intervals, and the polar coordinate system was used to retain the time correlation.
The conversion processes of GASF and GADF were similar, and the conversion process of GADF was as follows.
Step 1. One-dimensional time-series were scaled numerically. The time-series in the Cartesian coordinate system was scaled to
interval and the calculation of time series scaling is
Step 2. The scaled sequence data
were transformed from a Cartesian coordinate system to a polar coordinate system in which the value was regarded as the cosine of the included angle, the timestamp was treated as the radius, and the
function was used for mapping. This method retained the time dependence through the
coordinate, and the coordinate transformation equations is
where,
is the time stamp, and
is a constant factor to regularize the span of the polar coordinate system.
Step 3. The GADF matrix was obtained by the trigonometric function transformation of two angular differences. The calculation of the GADF matrix is
Similar to other image characterization methods mentioned earlier, the PAA method was used to retain the sequence trend and reduce the sequence size at this stage.
The effect of converting the time-series into GADF images is described in
Figure 5.
The GADF images reveal the temporal correlation between data pairs and preserve the spatial variation law intuitively. And the variation in GADF images is roughly the same as that of the abovementioned three image representation methods.
2.2. Image Enhancement Stage
The multi-resolution decomposition of wavelet transform was used to perform multistage two-dimensional discrete wavelet transform on the images by low-pass and high-pass filters. Then, the image signals were decomposed into low-frequency and high-frequency components. In the images, most of the noise and some edge details belonged to the high-frequency subband, while the low-frequency subband was mainly characterized as the approximate signals of the images. The high-frequency and low-frequency subband were processed by different methods to enhance the images, including reducing noise, improving contrast, and strengthening details. Then, the reconstructed image was obtained by inverse discrete wavelet transform on the processed components.
The two-dimensional image signals were filtered in horizontal and vertical directions for realizing the two-dimensional wavelet multi-resolution decomposition. First, the signals of the images were decomposed according to the line for obtaining the low-frequency component
and high-frequency component
in the horizontal direction. Then, the columns of the transformed data were decomposed to obtain the low-frequency components
) and high-frequency components
in four directions. The reconstructed images could be obtained by inverse discrete wavelet transform in the opposite direction. The aforementioned process of image decomposition and reconstruction is described in
Figure 6.
The
subband was an approximate representation obtained using a low-pass wavelet filter. The
subband was obtained using a low-pass wavelet filter and a high-pass wavelet filter, which showed the singularity of the image in the horizontal direction. The
subband was obtained using the aforementioned two filters and represented the singular characteristics of the image in the vertical direction. The
subband, obtained using a high-pass wavelet filter, indicated the diagonal edge characteristics of the images. Different measures were taken for the low-frequency and high-frequency components to improve the contrast of the images and strengthen the texture details in the images. If the low-frequency coefficient was greater than 250, it was multiplied by 0.75. If the high-frequency coefficient was less than 150, it was multiplied by 1.25. The aforementioned process is depicted in
Figure 7.
2.3. Image Clustering Stage
In this stage, an unsupervised deep clustering network DC was designed and applied. The target of the DC network was to define a parametric nonlinear mapping from the data space to the low-dimensional feature space, and complete clustering in the low-dimensional space. Following the idea of the DEC model and the inception network, a deep clustering network DC was designed. The DC included an improved autoencoder for feature learning and a clustering layer for clustering.
In the feature learning stage, the autoencoder-incorporated convolution kernel factorization method was trained to learn the mapping parameters from the data space to the feature space. The asymmetric convolution kernel factorization was first proposed by Szegedy [
29] and its application effect in the CNN network was demonstrated. According to the convolution kernel factorization, the convolution kernel of 3 ∗ 3 was replaced by 1 ∗ 3 and 3 ∗ 1 for reducing the number of parameters, and the obtained receptive field was not reduced. Meanwhile, the symmetric structure of samples was obtained after enhancement, and the model could learn more valuable representations. The structural parameters of the improved autoencoder are shown in
Figure 8.
The improved autoencoder consisted of an encoder and a decoder. The encoder stored the parameters of the learned feature representation, and the decoder was employed to reconstruct the data. The encoder was composed of 10 layers, and the number of its channels was 3, 32, 32, 32, 16, 16, 16, 8, 8, and 8, respectively. The small 3 ∗ 1 receptive field was set in the first convolution layer, followed by the second convolution layer with a 1 ∗ 3 effective receptive field. Then, the convolution kernels of 3 ∗ 3 and 5 ∗ 5 were subsequently added to obtain comprehensive distinctive features. The first three convolution layers were followed by batch normalization (BN) and max pooling. The flattening operation for learned feature representation was performed at the end of the encoder. The decoder consisted of 10 layers, and the number of its channels was 3, 8, 8, 8, 16, 16, 16, 32, 32, and 32, respectively. The decoder network contained four convolution layers with weight, as shown in
Figure 8. The first three convolution layers of the decoder were followed by BN and upsampling operation. Relu non-linearity was applied to every convolution layer. Then, the numerical probability of the predicted output was mapped to [0, 1]. The process of the DC is depicted in
Figure 9.
After pre-training, the decoder layers were discarded and the encoder was employed as the initial mapping between the data space and the feature space. The output of the encoder was used as the input of the clustering layer, and the feature data were clustered using the K-Means algorithm. Meanwhile, the CE was cited as the loss function
to optimize the objective function. The loss function
is
where,
represents the real label,
denotes the predicted probability that the current sample label is 1, and
is the predicted probability that the current sample label is 0.
is the total loss function of
n samples and represents the difference between the ground truth and predicted values.