1. Introduction
The advent of the Big Data era brings both opportunities and challenges to humanity. Wedel and Kannan [
1] said that “data is the oil of the digital economy”. Indeed, this data-rich environment is enhancing the way people live and work. To harness the potential of Big Data, identification, management, and analysis are essential tools for making intelligent and data-driven decisions. Nowadays, data are not only vast in capacity but also intricate in structure, encompassing elements like (ultra) high-dimensional data, (ultra) high-frequency data, and unequal observation time point data. Traditional data analysis techniques mainly focus on exploring univariate data, multivariate data, time-series data, cross-sectional data, and panel data. As a result, they encounter numerous limitations when attempting to extract information from Big Data. However, functional data analysis (FDA) has gained prominence as a research hotspot in statistics and various fields of science by capitalising on the advantages of utilising the functional features of observed data that vary across a continuum [
2].
Generally speaking, functional data is a type of data with functional properties that describe information about curves, surfaces, or anything that varies over a continuum. One of the most significant topics in the FDA is functional clustering. Methodologies for functional data clustering have developed rapidly and have aroused intense interest among researchers over the past few decades. The existing functional clustering methods can roughly be categorised into four groups: raw data clustering, filtering methods, adaptive methods, and distance-based methods [
3]. For raw data methods, functional data are clustered directly based on their discrete evaluation points [
4]. This method is often the simplest, as the functions are typically already recorded at specific discrete observation points. Therefore, there is no requirement to reconstruct the functional form of the data, we can directly cluster the observation points. However, the drawback of this method is its disregard for the functional characteristics of the data, potentially leading to a loss of valuable information in this regard [
3]. Filtering methods involve representing the curves using specific basis functions and then clustering them based on the coefficients of the basis expansion. Hence, the curves are expressed through a limited set of parameters (or coefficients), effectively reducing the dimensionality of the data. For the filtering phase, opting for a spline basis is a prevalent choice due to its optimal properties [
5]. For instance, Abraham et al. [
6] used the B-spline to fit the functional data and then cluster them by the K-means algorithm. Peng and Müller [
7] used the functional principal component scores and the K-means algorithm to realise the clustering. Wu et al. [
8] used functional principal component analysis and then clustered the principal scores based on the principal curves clustering method. Unlike the filtering methods, adaptive methods consider the basis expansion coefficients as random variables that adhere to a probability distribution, rather than treating them as parameters. For example, James and Sugar [
9] used the basis expansion coefficients to cluster sparsely sampled functional data based on a mixture of Gaussian distributions. Bouveyron and Jacques [
10] proposed a clustering algorithm for high-dimensional functional data within a specific functional subspace. The core idea involves assuming a Gaussian distribution for the functional principal components and defining a probabilistic model based on the mixture model. Distance-based methods modify traditional multivariate clustering algorithms to suit functional data, achieved by establishing specific distances or dissimilarities among curves. For instance, in the work of Ferraty and Vieu [
11], they suggested employing a hierarchical clustering algorithm in conjunction with either the
-metric or the semi-metric. Ieva et al. combined the K-means algorithm with
distance [
12]. Also, Tarpey and Kinateder explored the utilisation of
distance with K-means for Gaussian processes [
13].
Furthermore, Jacques and Preda [
14] considered the dependency between the curves and proposed a model-based clustering method for multivariate functional data; Secchi et al. [
15] presented the Bagging Voronoi classifiers algorithm, designed to handle the functional data of large geo-referenced datasets. Additionally, Boullé [
16] introduced a novel nonparametric method based on piecewise constant density estimation, which offers the advantage of not requiring parametric settings for functional data clustering. Meanwhile, Verma et al. [
17] provided an overview of the performance of various clustering algorithms in data mining, including K-means clustering, hierarchical clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), density-based clustering, Ordering Points to Identify the Clustering Structure (Optics), and the EM Algorithm. They point out that in all the discussed clustering techniques, the central approach involves finding cluster centres that represent each cluster. These centres act as reference points, aiding in identifying the core location of each cluster. When presented with an input, the system utilises a similarity metric between the input and cluster centres to assign the input to the nearest or most similar cluster.
In this paper, we present a novel functional clustering algorithm that hinges on a proximity measure designed for functional data trajectories, referred to as the proximity measure. This innovative approach takes inspiration from the aforementioned distance-based methods, leading us to introduce the fresh concept of proximity measurement. Diverging from the distance-based clustering methods commonly found in the literature, our proposed method takes into account both the shape and trend of the curves, operating independently of their specific positions. For instance, even if two functional curves share an identical trend but are widely separated, our proximity measure will still classify them within the same cluster. To put it succinctly, our focus is on their degree of similarity rather than their spatial proximity. In contrast, our proposed method stands apart from filtering clustering techniques as it does not require the estimation of the coefficients of the base expansion. We employ the B-spline smoothing technique to separate the real signal from the noisy data. Once the smoothed functional curves are derived from the discrete observation data, we have the flexibility to redefine the measuring points, either equidistant or nonuniform, as required to apply the proximity measure. To clarify, we extract the essence of the functional curve from the raw data and subsequently partition the smoothed functional curves into discrete measurement points once again. This strategy holds a significant advantage in handling original functional entities featuring unevenly spaced points or instances of missing data. It is worth noting that, unlike raw data clustering methods, we engage in reconstructing the functional structure of the data. This ensures the preservation of the data’s functional characteristics and the avoidance of any loss of valuable information. It is essential to highlight that K-means clustering necessitates prior knowledge of the number of clusters. This method proves highly susceptible to noise within datasets, which poses challenges in achieving optimal clustering. Another limitation of K-means lies in its inherent uncertainty, resulting in potentially varied clustering results across different runs, even when utilising the same dataset [
18]. Likewise, our proposed proximity measure algorithm incorporates a crucial parameter known as the “proximity threshold”. This threshold possesses a clear range of values and does not need to be preconfigured before using the algorithm. Moreover, our algorithm ensures consistent clustering outcomes even across multiple runs.
Another noteworthy aspect of this paper is the rolling window clustering technique, derived from the introduced proximity measure algorithm, denoted as “time-shift clustering”. This method is designed to identify segments within historical data that share a similar trend with the observed data points of interest. In this context, the “rolling window” concept discussed in this paper has similarities to the “sliding window” concept introduced in the study of Keogh and Lin [
19]. In their work, they asserted that “sliding window time-series clustering is never meaningful” due to instances of “trivial match”. However, our approach, unlike the “trivial match”, identifies subsequences that exhibit similar shapes to designated subsequences by searching across the entire timeframe, rather than focusing on locating cluster centres. Furthermore, it is important to note that the “trivial match” is typically employed in unsupervised clustering, whereas time-shift clustering falls under the category of supervised clustering. This distinction arises from our practice of presetting a target window as the designated subsequences. Another point of departure is that “trivial match” clusters the closest subsequences, which are the subsequences immediately to the left and right. In contrast, time-shift clustering can identify similar subsequences that are distant from the designated subsequences.
Finally, we applied the proposed algorithms to both simulated and real datasets. The outcomes demonstrate that the proposed methods exhibit excellent clustering accuracy and are adept at uncovering valuable insights within the data.
The rest of this paper is organised as follows.
Section 2 elaborates on the principle of the proximity measure algorithm and time-shift clustering for functional data. The proposed method is demonstrated using some simulated examples in
Section 3.
Section 4 showcases the application of the proposed clustering method to COVID-19 and NASDAQ datasets. Finally, in
Section 5, the advantages and disadvantages of the proposed algorithm are discussed, along with potential future directions.
2. Methodology
2.1. Proximity Measure and Functional Clustering Algorithm
In this section, we initially define a proximity measure for continuous functions and subsequently introduce our novel clustering algorithm for functional data, grounded in their curvilinear structure.
Let
be the collection of
m distinct continuous functions of
, i.e.,
We define the proximity
between
and
as
where
and
.
Obviously, the proximity
between
and
satisfies the following properties:
where
is a constant. Furthermore, we have the following.
Theorem 1. The proximity between and defined in Equation (2) satisfies the following reverse triangle inequality: Proof. Let
and
, we have
□
In practice, we only possess discrete observations of each function at a finite set of points. Consequently, continuous functions need to be estimated from these discrete observations using a basis of functions, such as the B-spline or Fourier basis. Subsequently, proximity can be calculated through numerical integration.
Suppose now we have m curves (), denoted collectively by . Our proposed clustering algorithm for functional data is outlined in the following steps.
- Step 1
Set up the proximity matrix. Based on Equation (
2), we calculate the proximity
, and obtain the following proximity matrix
:
It is evident that within this proximity matrix, all diagonal entries are zero.
- Step 2
Set a proximity threshold . From the proximity matrix , we find the largest value and denote it by . We define a threshold as . The threshold plays a vital role in clustering and determines the number of clusters (hence it is also referred to as the threshold of cluster separation):
- (a)
When , we will have m clusters, meaning that each object is a cluster;
- (b)
When , we will have only one cluster, meaning that all objects belong to one cluster;
- (c)
If , go to Step 3.
- Step 3
Find the two initial centroids. We refer to the different groups of functions as clusters . Assuming that and are such that their proximity is and , then we can determine the two initial cluster centroids: (belonging to ) and (belonging to ). It means that there exists a maximum difference between the trajectory of object and the trajectory of object . Thus, we set the objects and as two initial centroids, whilst and are two initial clusters.
- Step 4
Find the third centroid (if possible). Search all objects in and find an object such that and . Let . This means that the object cannot be grouped into or . In other words, there is a maximum difference between the trajectory of object , the trajectory of object , and the trajectory of object :
- (a)
If exists, update the cluster centroids: , , and , and belongs to the third cluster . Go to Step 5.
- (b)
If does not exist, go to Step 6.
- Step 5
Try to find the fourth centroid. Following the same logic as in Step 4, for any object , check the following:
- (a)
If , and , move object to with the cluster centroid ;
- (b)
If , and , move object to with the cluster centroid ;
- (c)
If , and , move object to with the cluster centroid ;
- (d)
If , and , let object be in a separate cluster .
Here, we call the to provisional clusters. Next, go to Step 7.
- Step 6
If the object does not exist. For any object :
- (a)
If , move object to with the cluster centroid ;
- (b)
If , move object to with the cluster centroid .
Similarly, we call the and provisional clusters. Next, go to Step 7.
- Step 7
Check and iteration. We check all provisional clusters. For example, for the provisional cluster mentioned in Step 5 and Step 6, check the following:
- (a)
If , , is a completed cluster. Next, go to Step 8;
- (b)
If there exists at least one pair of objects , such that , then we regard the cluster as a new collection of functions , where is the size of , and repeat Step 3 to Step 7 for cluster only.
For the remaining provisional clusters, we execute the same process as described for cluster above.
- Step 8
Update and complete. Update all clusters and gather all completed clusters from Step 7. The algorithm terminates when the number of objects in all the completed clusters equals the total number of objects in .
It is worth noting that within the above algorithm, a crucial inquiry arises regarding the determination of the proximity threshold
. This threshold
can be selected according to the specific scenario and research objectives in each case. Further elaboration on
selection will be provided in
Section 4.
2.2. Time-Shift Clustering for Time-Series Data
In this section, we present a fresh concept called time-shift clustering, building upon the clustering method outlined earlier. It is important to clarify that the term “time-shift” in this context allows for position adjustments, enabling movement backwards or to different time points, rather than strictly adhering to a linear chronological order. This concept merges the proposed clustering algorithm with the rolling time window technique, with the aim of identifying historical data segments that are similar to current observations.
The rolling window analysis of time-series models is commonly employed to assess the stability of models and parameters over time or to forecast future situations [
20,
21,
22]. As discussed in
Section 1, the utilisation of the “rolling window” in this context differs from the “trivial match” mentioned in the work of Keogh and Lin [
19]. Our emphasis lies solely on searching for and identifying necessary subsequences, without necessitating the estimation of a cluster centre among all the subsequences. Therefore, in our case, we utilise a rolling window to search backwards for historical periods that are similar to the current one. Regarding the time-shift clustering procedure, the primary clustering algorithm remains consistent with the one outlined in
Section 2.1. Consequently, in the following, we will solely elucidate the aspect related to the rolling window.
Given a time series with observations , for . Suppose that we are interested in the pattern of the most recent w observations, i.e., at the time points , which could be recent w weeks, w months, or w years. We denote these time points as the target window. Our objective is to identify segments of the same length within the historical data () that exhibit patterns similar to the target window, employing a proximity measure. To attain this objective, we employ the rolling window method, which comprises the following three steps:
- Step I
Choose a rolling step size and partition the complete time series into multiple windows. In this scheme, the first rolling window encompasses observations spanning the period from to , the second rolling window spans observations from to , and so forth. When the rolling step size equals one, the time series is segmented into windows. A larger value of indicates a swifter rolling of the windows.
- Step II
Regard both the target window and each of the rolling windows as continuous functions, and then execute Step 1 to Step 8 of the clustering algorithm outlined in
Section 2.1.
- Step III
Examine all of the rolling windows (see
Figure 1) to identify those that exhibit patterns similar to the target window.
The following schematic diagram illustrates the concept of the time-shift clustering procedure.
3. Numerical Experiments
To assess the efficacy of the proposed clustering algorithm, this section presents two simulated examples. We compare the performance of our method with that of traditional K-means clustering. In the case of K-means clustering, the observations are treated as objects () evaluated at a significantly high number of equidistant points within the interval.
3.1. Simulation Study
We generated two functional datasets as follows.
The first functional curves dataset (Case 1) was generated using the following functions:
where
, and
. We generated 30 sample curves for each of the aforementioned three functions, each containing 26 data points. These data points were evaluated at equidistant intervals within the range of
. An illustration of the simulated curves is presented in
Figure 2.
The second functional curves dataset (Case 2) is defined by the following functions [
23]:
where
. Likewise, we generated 30 sample curves for each of the aforementioned three functions, each comprising 26 data points. These data points were evaluated at equidistant intervals within the range of
. An illustration of the simulated curves is presented in
Figure 3.
3.2. Simulation Results
A simulation study comprising 10,000 replications was carried out, considering the Case 1 and Case 2 functional curves datasets. We present the outcomes of our proposed clustering method and contrast them with the conventional K-means clustering approach. We choose to use the average error rate as our performance evaluation metric instead of the mode or median. This decision is based on the fact that the mode is not always unique, and the median closely approximates the mean value in our case. Opting for the mean value provides a more accurate representation of the error level since all error rates exhibit a relatively stable pattern devoid of outliers.
Therefore, the average error rates for these two clustering methods are presented in
Table 1.
In the proximity measure method, the selection of the proximity threshold depends on the desired number of clusters. Given that these 90 curves are generated by three distinct functions, the ultimate number of clusters should be three. Consequently, we set the proximity threshold to yield a total of three final clusters. Notably, there exists a range of threshold values that lead to the formation of three clusters. Nonetheless, the clustering outcomes for varying threshold values exhibit remarkable similarity, with closely aligned error rates.
As an illustrative instance,
Table 1 showcases five thresholds for both Case 1 and Case 2. The error rates for the K-means clustering approach are computed by specifying the correct number of clusters. The data in
Table 1 reveal that the error rates produced by the proposed method for different threshold values are very similar and notably lower than those resulting from K-means clustering.
Let us now take a closer look at some of the typical clustering results.
Figure 4a,b exemplify instances of poor clustering using the proposed proximity measure clustering and the K-means method for Case 1, respectively. A distinct contrast can be observed between these two clustering outcomes.
Similar patterns can be discerned in
Figure 5a,b, which provide an example of poor clustering results using the proximity measure and the K-means method for Case 2, respectively.
To provide a clearer demonstration, we use Silhouette analysis to illustrate the poor clustering results for the simulation study.
Figure 6 and
Figure 7 show the Silhouette plots for Case 1 and Case 2 using the proximity measure and K-means methods, which correspond to the clustering results of
Figure 4 and
Figure 5.
Overall, based on the Silhouette plots, the worst-case scenario for K-means clustering is even worse than that of the proximity measure.
Furthermore, we used the Rand Index (RI) and Adjusted Rand Index (ARI) to assess the performance of these two methods. In the proximity measure, both the RI and ARI values are very close to one, indicating significantly better performance compared to the K-means method, as shown in
Table 2.
For illustrative purposes, examples of accurate clustering results are presented in
Figure 8 and
Figure 9 for both the proposed method and the K-means algorithm, in the context of Case 1 and Case 2, respectively. Hence, these simulation outcomes demonstrate the superiority of our proposed method in effectively identifying curvilinear characteristics within curve clustering.
5. Discussion and Conclusions
In this paper, we introduced a novel curve clustering algorithm based on the proximity measure for continuous functions. To validate the utility and effectiveness of these proposed algorithms, we conducted several numerical experiments and compared the outcomes against the K-means clustering algorithm. The simulation results underscore the capability of our algorithm to identify curvilinear features within functional data, yielding clustering results with enhanced accuracy compared to K-means. Moreover, as observed in
Figure 4a,b and
Figure 5a,b, the proximity measure algorithm can effectively focus on a curve’s trend while remaining independent of its positional information. In contrast, the K-means clustering approach tends to cause confusion since it cannot disregard positional information, which consequently results in dissimilar curves being grouped together due to close proximity. Subsequently, we extended our analysis by applying the proximity measure algorithm to the COVID-19 dataset. The obtained clustering results are satisfactory, demonstrating that confirmed new cases and confirmed new deaths in 49 countries can be grouped into six clusters based on the designated threshold. Each group can be explained by certain common features.
Moreover, we have extended our efforts by developing a time-shift clustering algorithm that merges proximity measure clustering with a rolling window technique. This innovative approach was then applied to the NASDAQ Composite data, resulting in the successful identification of similar subsequences that correspond to three target windows. This has been proven we can extend this method to explore and mine any functional dataset with potential periodic regularity.
The findings of this paper can be understood as the proposed proximity measure enriches the field of functional data clustering. It is especially suitable for identifying and clustering curve features of function curves, without interference from spatial position information. Searching for similar subsequences across the entire timeframe may be considered a further validation of the flexible applicability of the proximity measures.
Although the proposed algorithms have demonstrated effectiveness, it is important to note that the choice of the proximity threshold remains a significant consideration. This choice is closely tied to the specific context and goals of clustering, making it potentially challenging to determine an appropriate value in certain scenarios. Determining an optimal proximity threshold is an aspect we intend to address in future research. Moreover, future investigations should focus on clustering continuous functional curves directly, without the need to convert them into discrete data.
Another avenue of future exploration involves extending the algorithms to cluster surfaces and multidimensional functional data. Additionally, the time-shift clustering approach holds promise for applications in assessing economic cycles within finance, showcasing its potential usefulness beyond the contexts explored in this paper.