Abstract
Spectral clustering has established itself as a powerful technique for data partitioning across various domains due to its ability to handle complex cluster structures. However, its computational efficiency remains a challenge, especially with large datasets. In this paper, we propose an enhancement of spectral clustering by integrating Cover tree data structure to optimize the nearest neighbor search, a crucial step in the construction of similarity graphs. Cover trees are a type of spatial tree that allow for efficient exact nearest neighbor queries in high-dimensional spaces. By embedding this technique into the spectral clustering framework, we achieve significant reductions in computational cost while maintaining clustering accuracy. Through extensive experiments on random, synthetic, and real-world datasets, we demonstrate that our approach outperforms traditional spectral clustering methods in terms of scalability and execution speed, without compromising the quality of the resultant clusters. This work provides a more efficient utilization of spectral clustering in big data applications.
1. Introduction
Clustering is a foundational unsupervised learning task that aims to partition data points into distinct groups based on similarity. Among existing approaches, spectral clustering has gained significant attention due to its ability to identify non-convex clusters and its solid theoretical foundation in spectral graph theory [1,2]. The core of this technique involves the eigendecomposition of a graph Laplacian matrix to embed data points into a low-dimensional eigenspace where clusters become more separable. For our work, we chose to employ the normalized graph Laplacian , where W is the similarity matrix and D is the degree matrix [2,3]. This choice is motivated by its well-established theoretical properties and its tendency to yield more stable and consistent clustering results across varied data densities compared to the unnormalized Laplacian.
However, the practical application of spectral clustering to large-scale datasets is severely hampered by its substantial computational complexity. The primary bottlenecks are twofold:
- Similarity Matrix Construction: Building the dense pairwise similarity matrix W has a prohibitive time and space complexity.
- Eigendecomposition: Solving the eigenproblem for the Laplacian matrix scales as . While approximate methods can reduce this, the quadratic scaling of constructing W remains the dominant constraint.
To mitigate these challenges, various approximation schemes have been proposed, including sparse k-NN graphs and Nyström methods [4,5]. A highly promising direction involves leveraging advanced data structures for efficient computation. Notably, Cover trees [6] offer a principled framework for accelerating nearest neighbor queries with a query time complexity of in metric spaces, where c is the expansion constant, a significant improvement over a naive linear scan. Beyond mere acceleration, the hierarchical multi-scale structure of Cover trees provides a natural mechanism for intelligent data summarization and noise reduction, making them particularly suitable for enhancing clustering algorithms [7,8].
In this work, we propose the Improved Spectral Clustering with Cover tree (ISCT) algorithm, a novel framework designed to overcome the scalability limitations of traditional spectral clustering. Our approach leverages the Cover tree for a dual purpose within a two-stage approximation framework:
- Data Reduction via Cover tree-Based Summarization: We exploit the hierarchical structure of the Cover tree to select a small set of m representative points (where ) that preserve the essential geometric and density structure of the original dataset. This step drastically reduces the size of the problem.
- Efficient Spectral Clustering on Representatives: We perform spectral clustering using the normalized Laplacian on the resulting manageable similarity matrix constructed from the representatives. The cluster labels for all n original points are then accurately inferred by assigning each point to the cluster of its nearest representative, a step executed efficiently in time using the Cover tree itself for fast nearest-neighbor queries.
The key insight of our approach is this dual use of the Cover tree, which provides not only a fast query structure but also an intelligent, multi-scale data summary that is inherently aligned with the clustering objective. This shifts the computational bottleneck from to . Crucially, empirical observations and theoretical bounds suggest that the number of representatives m required for a faithful approximation can favorably scale from to for real-world datasets, promising orders-of-magnitude improvement in computational efficiency.
This work is guided by the following research questions:
- How does Cover tree-based data summarization affect the quality of the spectral embedding and the resulting clusters compared to traditional methods, particularly when using the normalized Laplacian?
- What is the computational accuracy trade-off inherent in the ISCT framework, and how is it influenced by the number of representatives m?
- How does the multi-scale hierarchy of the Cover tree influence the detection of cluster boundaries and the separation of clusters at different density levels?
The ISCT framework combines the solid theoretical foundations of spectral clustering, embodied by the normalized Laplacian, with the effective hierarchical properties of cover trees to create a scalable and efficient solution for large-scale clustering challenges.
2. Related Work
The field of cluster analysis has undergone significant evolution, representing one of the earliest developments in data analysis. Its conceptual foundations were established by Driver and Kroeber in 1932 [9], who introduced the fundamental notion of partitioning observations into groups, notably preceding the formal emergence of artificial intelligence by decades. This early work, along with crucial theoretical contributions by Zubin (1938) and Tryon (1939) [10,11] on cluster validity, established the mathematical framework for modern clustering research. The transition to practical algorithmic implementations occurred in the 1960s–1970s with MacQueen’s introduction of the K-Means algorithm, representing a watershed moment that provided the first computationally tractable partition-based approach. Concurrently, hierarchical clustering methods gained prominence through the work of Johnson (1967) [12], while the 1970s witnessed the emergence of probabilistic mixture models. A fundamental paradigm shift occurred with the development of spectral clustering methods, grounded in Fiedler’s work on algebraic connectivity and subsequently refined by Shi and Malik (2000) [3] through normalized cuts. The advent of big data in the 2000s exposed the scalability limitations of traditional methods, particularly the eigendecomposition bottleneck in spectral clustering, sparking intensive research into approximate algorithms. This period saw the emergence of Cover trees, providing a rigorous framework for efficient nearest neighbor search. Contemporary developments have focused on addressing the dual challenges of scalability and quality preservation, including reduced clustering methods based on the inversion formula density estimation by Lukauskas et al. (2023) [13] and novel approaches for analyzing research community dynamics by Cambe et al. (2022) [14].
The development of the clustering methodology provides essential context for our proposed algorithm. We focus on the evolution of spectral clustering and related scalability solutions. The theoretical underpinnings of spectral clustering are deeply rooted in spectral graph theory [15]. The algorithm’s effectiveness for non-convex clusters stems from its relaxation of graph partitioning problems, such as normalized cut [3]. The standard algorithm involves constructing a similarity graph, forming a Laplacian matrix (typically or ), computing the first k eigenvectors of the Laplacian and applying a standard clustering algorithm like k-means to the rows of the eigenvector matrix to obtain the final partition. The phrase “simplified bisection via the eigenvector” often refers to the special case of , where the second eigenvector (the Fiedler vector) is thresholded to partition the graph [16]. For , the process is a generalization of this bisection concept [13].
The high computational cost of spectral clustering has spurred research into approximations. A common approach is to sparsify the similarity matrix using k-NN or -neighborhood graphs, reducing the storage to and enabling the use of sparse eigensolvers like ARPACK [17]. The Nyström method is another prominent technique, which approximates the eigendecomposition by using a subset of landmark points [5]. The complexity is reduced to for l landmarks. Our method differs by using the Cover tree not just to select landmarks but to ensure they provide a geometrically meaningful summary of the entire dataset [14].
Data structures like KD-Trees, Ball Trees, and Cover trees have been widely used to accelerate distance-based computations. Their integration into spectral clustering has primarily been limited to constructing approximate k-NN graphs [18]. The ISCT algorithm extends this integration beyond a mere neighbor search to a holistic data reduction and assignment strategy, leveraging the Cover tree’s theoretical guarantees for both tasks.
Our proposed ISCT algorithm represents a natural evolution of this historical trajectory, integrating Cover tree hierarchical decomposition with spectral techniques to bridge the gap between classical graph-theoretic approaches and modern scalable computing requirements, and building upon this rich methodological heritage to address specific computational bottlenecks that limit spectral methods’ applicability to large-scale datasets.
3. Methods
3.1. Spectral Clustring
Spectral clustering is a sophisticated and widely-used technique in the field of machine learning and data analysis for partitioning data into clusters based on the eigenvalues (spectrum) of a similarity matrix derived from the data [19]. Unlike traditional clustering methods such as K-Means, which rely on distance metrics to form clusters, spectral clustering leverages the connectivity structure of the data [20]. The process begins with the construction of a similarity graph , where nodes represent data points and edges represent the similarity between pairs of points. This graph is then transformed into a Laplacian matrix, whose eigenvectors are used to map the data points into a lower-dimensional space.
The adjacency matrix A of the graph is defined as follows:
The degree matrix D is a diagonal matrix where each diagonal element is the sum of the similarities of a node i:
The unnormalized graph Laplacian L is then given by
Alternatively, the normalized graph Laplacian can be defined in two forms:
The next step involves computing the eigenvalues and eigenvectors of the Laplacian matrix. Let be the eigenvalues and be the corresponding eigenvectors. By selecting the eigenvectors corresponding to the smallest k eigenvalues (excluding the zero eigenvalue for the normalized Laplacian) [21], we form a matrix U where each row represents a data point in the new lower-dimensional space:
Finally, a clustering algorithm such as K-Means is applied to the rows of U to partition the data into k clusters. This method excels in identifying clusters that are not necessarily spherical in shape and can capture complex structures in the data, making it particularly effective for applications such as image segmentation [22], social network analysis, and bioinformatics. The power of spectral clustering lies in its ability to utilize the global information of the data structure [23], providing more accurate and meaningful clusters in scenarios where traditional methods might fail.
3.2. Cover Tree
The Cover tree is a sophisticated data structure specifically designed for facilitating efficient nearest neighbor searches in high-dimensional metric spaces. Introduced by Alina et al. in 2006 [6], the Cover tree addresses the challenge of the “curse of dimensionality” that significantly impacts the performance of traditional nearest neighbor search algorithms as the dimensionality of the data increases. The fundamental principle behind the Cover tree is to maintain multiple hierarchical levels of the dataset, where each level is a “cover” for the level below it [24], ensuring that every point at a lower level is within a certain distance of a point in the level above.
The Cover tree is constructed based on three main properties: nesting, covering, and separation.
Let C be the level and i the index of this level i; the principal equations associated with the Cover tree are defined as follows:
- Nesting property:
- Cover property: For any point , there exists a point such that:
- Separation property: For all with :
- Parent–child distance: For each child c of a parent p on the scale i:
One of the key advantages of the Cover tree is its ability to perform nearest neighbor searches in time, where n is the number of points in the dataset, making it exceptionally efficient compared to brute-force methods, especially in high-dimensional spaces. The hierarchical structure of the Cover tree allows for a significant reduction in the search space at each level [25], enabling rapid convergence to the nearest neighbor or neighbors.
The Cover tree’s efficiency and scalability make it a valuable tool in various applications, including machine learning, pattern recognition, and data mining, where quick and accurate nearest neighbor searches are crucial. Its design elegantly balances the trade-off between space and time complexity, providing a robust solution to one of the most pervasive challenges in computational geometry and high-dimensional data analysis.
3.3. Proposed Method
The Improved Clustering with Cover tree algorithm aims to efficiently cluster a dataset by leveraging a Cover tree structure. Initially, the algorithm takes as input a data matrix V and the number of desired clusters k, and outputs k clusters. The process begins by decomposing the dataset V using a cover tree [26], which results in m representative points and a tree structure T. This decomposition aids in managing the data more effectively. Next, a similarity matrix W for the representative points is formed. The algorithm then initializes a level variable l to the initial level of the Cover tree [26]. It iteratively processes each level from down to 1, updating the clustering indicators. At the initial level, the indicator matrix is set to an identity matrix or an equivalent initial clustering indicator. For other levels, is computed based on a specific equation, likely involving the similarity matrix W and the tree structure. The level l is decremented after processing all nodes at the current level. Once all levels are processed [27], the algorithm recovers the cluster membership for each original data point by mapping the results from the representative points back to the original dataset. Finally, the algorithm uses the K-Means algorithm to cluster the points in a reduced k-dimensional space, represented by the rows of matrix , into k clusters. The algorithm concludes by outputting the k clusters [28], providing the final clustering result. This method enhances efficiency and scalability [23], making it suitable for large and high-dimensional datasets by focusing on representative points and iteratively refining the clusters.
3.3.1. Laplacian Matrix Formulations and Their Implications for ISCT
The choice of Laplacian matrix formulation is a critical design decision in spectral clustering. Given a similarity matrix and degree matrix D where , the three primary formulations are the unnormalized Laplacian preserves cluster size but suffers from scale sensitivity, favoring larger clusters. The symmetric normalized Laplacian addresses scale sensitivity and provides better numerical stability, with eigenvalues satisfying . Its eigenvectors require transformation: where u are the eigenvectors of . And The random walk Laplacian relates to Markov chains but is asymmetric, complicating eigendecomposition. The ISCT algorithm employs for numerical stability with varying node degrees, scale invariance across density regions, computational efficiency from symmetric eigensolvers, and theoretical guarantees related to the Cheeger inequality. The eigendecomposition computes for , with transformed eigenvectors used for clustering. The computational complexity involves: computing in operations, eigendecomposition in with efficient symmetric solvers, and eigenvector transformation adding operations.
Our general Algorithm 1 is as follows:
| Algorithm 1 Improved Clustering with Cover tree |
|
To clarify the essential points of our algorithm, the matrix is defined as a binary matrix representing the hierarchical correspondence of superpoints. Each entry indicates that superpoint j at level is associated with superpoint i at level l. Formally, let be the set of superpoints at level l; then,
where denotes the direct subnodes of in the tree. This structure ensures consistent clustering between adjacent levels.
The initial level is determined from the geometric properties of the Cover tree. Concretely, , where is the coarsest level, characterized by a maximum node diameter corresponding to the greatest distance between points in the dataset. This diameter is calculated by
This calculation ensures that the hierarchical decomposition begins with a global view of the data.
Finally, initializing the E matrix, which corresponds to the matrix when , is completed as follows:
- At the level, each superpoint is treated as a singleton cluster, resulting in (identity matrix of size ), where m is the number of superpoints.
In this paragraph, we will go into more detail about the algorithm presented above. It introduces an innovative approach to clustering large-scale or high-dimensional datasets by leveraging the hierarchical structure of a Cover tree for data decomposition. At its core, the method begins by transforming the input data matrix, denoted as , where represents the number of data points and the dimensionality of each point into a more manageable form through the construction of a Cover tree [24]. This tree effectively summarizes the dataset by identifying representative points, thereby reducing the complexity of the dataset while preserving its intrinsic structure.
Following the decomposition, a similarity matrix is constructed to encapsulate the pairwise similarities between the representative points. This matrix plays a pivotal role in the subsequent clustering process, which is performed iteratively across different levels of the Cover tree. The iterative process begins at a predefined initial level and progresses downwards, with each level’s clustering results influencing the computation of the indicator matrix for the nodes at that level [26]. This hierarchical clustering mechanism ensures that the algorithm efficiently captures the multi-scale structure of the data.
A crucial step in the algorithm involves the recovery of cluster memberships for the original data points from the clustering results of the representative points. This step is facilitated by a correspondence table that maps each data point to its nearest representative point, thereby allowing the algorithm to extend the clustering results from the representative points to the entire dataset.
The final phase of the algorithm employs the K-Means clustering algorithm to refine the clustering results. Each row of the indicator matrix , which represents the cluster memberships of the representative points, is treated as a point in [29]. The K-Means algorithm is then applied to these points to produce the final clustering of the original dataset into k clusters.
This sophisticated algorithm harnesses the efficiency of the Cover tree structure to address the challenges associated with clustering large or high-dimensional datasets [30]. By focusing on representative points and employing a hierarchical clustering strategy [31], the algorithm significantly reduces computational complexity while ensuring that the final clustering results accurately reflect the underlying structure of the data.
3.3.2. Mathematical Formalization of ISCT
The method performs hierarchical spectral clustering on a data matrix using a Cover tree structure. The process proceeds from finer to coarser levels, where clustering results from child nodes are treated as superpoints for parent nodes. For a parent node with superpoints and child nodes , where , the indicator matrices for the child clustering results and are defined as follows:
with defined similarly. These are concatenated into
The reduced similarity matrix for the superpoint set is computed as:
where is the principal submatrix of the original similarity matrix W corresponding to .
The clustering problem is formulated as the following minimization:
where is the Laplacian matrix. The solution is given by the first k eigenvectors of .
The indicator matrix for the original superpoints is obtained through
This process repeats up the Cover tree hierarchy until the final indicator matrix is obtained. The cluster membership for original data points is recovered via a correspondence table, yielding the indicator matrix . Finally, each row of is treated as a point in and clustered via k-means to produce the final result. The key advantage is the significant computational complexity reduction achieved by operating on the much smaller matrices and instead of the full matrices at each level.
3.3.3. Similarity Definition and Graph Construction
The algorithm constructs a similarity matrix on m representative points (where ) from a Cover tree decomposition, rather than on all n data points, significantly reducing computational complexity while maintaining clustering quality.
The similarity measurement begins with the Euclidean distance metric:
which is then converted to similarity using the Gaussian radial basis function (RBF) kernel:
where is the bandwidth parameter controlling similarity decay.
The complete similarity matrix construction is formalized in the following Algorithm 2.
| Algorithm 2 Similarity Matrix Construction |
|
This approach leverages the Cover tree’s logarithmic query complexity for efficient k-NN graph construction while maintaining the local geometric structure necessary for effective spectral clustering. The resulting sparse, symmetric matrix W captures meaningful similarity relationships between representative points, forming the foundation for subsequent spectral analysis.
3.3.4. Analysis of Parameter Sensitivity and Selection Methodology
The proposed Cover tree-accelerated spectral clustering framework involves four key hyperparameters that balance computational efficiency and clustering fidelity. In the following, we will introduce how we determined the number of representative points m, neighborhood size , kernel scale , and number of clusters k.
The number of representatives m is selected through a hybrid approach: heuristic initialization followed by sensitivity analysis to find the smallest m that maintains clustering quality.
For and parameters, an adaptive Gaussian kernel is employed. The similarity between points is calculated using the following equation:
where is the distance to the -th nearest neighbor of , reducing parameter tuning to selecting via grid search.
The number of clusters k is determined through a hierarchical validation framework evaluating candidate values:
with . For each candidate k, a composite score is computed through combining three validation metrics:
The relative eigenvalue gap:
The modularity measure:
The silhouette coefficient:
The cluster selection Algorithm 3 proceeds through hierarchical evaluation of these metrics.
| Algorithm 3 Adaptive Cluster Number Selection in ISCT |
|
With typical weights , , . Cluster stability is assessed using the following stability metric:
where high stability values (>0.8) indicate robust cluster number detection. This comprehensive approach addresses the fundamental challenge of determining appropriate cluster numbers in unsupervised settings.
4. Theoretical Foundations for Accuracy Preservation
A central claim of our Cover tree-accelerated spectral clustering approach is its ability to maintain high clustering accuracy despite operating on a significantly reduced set of representative points. This section delineates the theoretical underpinnings that justify this claim, explaining how the algorithm’s design inherently preserves the complex geometric structures present in the original data.
4.1. Preservation of Local Geometry by the Cover Tree
The Cover tree decomposition provides a hierarchical net that faithfully captures dataset topology through its nesting and covering invariants, offering distinct advantages over random sampling methods.
The Cover tree decomposition ensures data representation fidelity through two fundamental properties. It maintains exact point representation, where each node corresponds to an actual data point from the original set V rather than centroids or averages, thereby preserving complex non-linear features without distortion. Additionally, it provides guaranteed proximity by ensuring that for any point , there exists a representative within a precisely defined distance at every tree level, forming a dense hierarchical skeleton that captures the complete dataset structure.
Here is an Algorithm 4 that explains this process in more detail.
| Algorithm 4 Hierarchical Spectral Clustering on Cover tree |
|
Final Label Assignment is defined by
where is the representative point for .
The Role of the Final K-Means Step
The K-Means step clusters the rows of the eigenvector matrix U to resolve the final partitioning in the transformed spectral embedding space, where clusters are linearly separable, providing a clean, hard assignment based on the complex cluster boundaries identified by the hierarchical spectral clustering process.
Enhancing Robustness and Noise Resilience
To enhance robustness against noise and outliers, two refined graph construction methods are employed. The Mutual k-NN (MkNN) method establishes an edge between points only if they are mutually included in each other’s k-nearest neighbor sets, ensuring reciprocal connectivity. The -k-NN Hybrid method further requires that the distance between points remains below a threshold , combining both relative and absolute proximity criteria. Together, these techniques generate a sparser and more reliable similarity graph that minimizes the impact of outliers while faithfully representing the true underlying cluster structure.
4.2. Leveraging the Cover Tree for Proactive Outlier Detection
The Cover tree hierarchy inherently reveals data density, enabling proactive outlier identification. Representatives with few associated data points are flagged as potential outliers using a density threshold :
- If representative i is flagged as PotentialNoise
The Algorithm 5 for this task can be summarized as follows:
| Algorithm 5 Integrated Robust Spectral Clustering |
|
This is a summary of parametrization.
| Parameter | Purpose |
| Minimum point count for dense region classification | |
| Neighborhood size for Mutual k-NN graph | |
| Minimum size for valid cluster components |
This integrated approach leverages the Cover tree structure for proactive noise identification and removal, enhancing clustering robustness without significant computational overhead. The method combines density-based outlier detection with graph-theoretic noise filtering before performing spectral clustering on the purified dataset.
5. Materials and Environment
For our research, we utilized an ASUS ROG Strix computer featuring an Intel(R) Core(TM) i7-10870H processor with a base clock speed of 2.20 GHz, capable of reaching 2.21 GHz, paired with 16 GB of RAM and an NVIDIA RTX 2060 graphics card. The computer was sourced from ASUSTeK Computer Inc., Taipei, Taiwan. The processor was sourced from Intel Corporation, Santa Clara, California, United States. The graphics card was sourced from NVIDIA Corporation, Santa Clara, California, United States. This hardware was selected for its capacity to handle the demanding computational tasks typically required for data science modeling and simulations. We conducted our experiments using Python version 3.12 environment provided by Anaconda, deliberately avoiding the integration of any additional packages. This minimalist setup was chosen to develop our model and ensure that no external features influenced its performance. Conversely, for the alternative models we analyzed, additional packages included in the Anaconda distribution were utilized. These models leveraged specialized libraries, which allowed for performance optimizations and reduced development time. This approach enabled a comparative analysis between our streamlined model, operating in a pure Python environment, and the alternative models enhanced by advanced libraries. The findings underscored the efficiency of our simplified model in terms of performance.
6. Results
To compare our algorithm with others, the real datasets chosen were Iris, Seeds, Glass, Mall, Cancer, as well as synthetic data such as Blobs, Circles and Moons, which are represented in the Table 1 and the other real datasets are represented in Table 2, as they are frequently used for testing and evaluating machine learning algorithms; the use of synthetic datasets such as Three Moons, Blobs and Circles is a particularly common practice for evaluating and visualizing the performance of clustering algorithms. These datasets are specifically designed to test the ability of clustering methods to identify and separate distinct groups in data of varying complexity and structures. Three Moons, with its three intertwined crescent shapes, represents a challenge for algorithms due to its non-convex shape. Blobs contain well-separated Gaussian clusters to evaluate clustering on spherical groups. Circles feature points arranged in concentric circles to test the handling of nested and non-linearly separable clusters.
Table 1.
Description of real and synthetic datasets used.
Table 2.
Description of higher dimensions datasets.
We evaluated the clustering algorithms on three synthetic benchmark datasets: two interleaving half-moons (Moons) and concentric circles (Circles), and a Gaussian Blobs dataset. The Moons and Circles datasets each contained 500 points with two ground-truth classes, while the Blobs dataset comprised 500 points distributed across five Gaussian clusters. To ensure reproducibility, we generated the data using Scikit-learn’s dataset generators with fixed parameters: Moons with n_samples = 500, noise = 0.1, and random_state = 40; Circles with n_samples = 500, noise = 0.1, factor = 0.1, and random_state = 40; and Blobs with n_samples = 500, centers = 5, cluster_std = 3.0, and random_state = 40.
All clustering algorithms were configured to identify clusters. This setup intentionally creates a mismatch between the true number of classes 2 and the number of predicted clusters 5 for the Moons and Circles datasets. To address this discrepancy, we employed a many-to-one relabeling strategy (also known as cluster purity mapping). Specifically, each predicted cluster was assigned to the ground-truth class that represents the majority of its points. The mapping was established by finding the mode of the true labels within each cluster using SciPy’s mode function. This alignment allowed us to compute standard supervised evaluation metrics, including Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and classification-style metrics (Precision, Recall, F1-score). Additionally, we reported internal clustering quality measures such as the Silhouette score. A many-to-one confusion matrix was also constructed to visualize the distribution of true classes across predicted clusters.
The three figures Figure 1, Figure 2 and Figure 3 illustrate the dataset distribution and the selected representative points for each clustering configuration. In all visualizations, grey circles represent the original data points prior to clustering, while red crosses designate the representative points (marked by red crosses) selected by the Cover Tree in the ISCT algorithm are illustrated for three standard synthetic datasets: Moons, Circles, and Blobs. These representatives provide a multi-scale summarization of each dataset. Only a small subset of the original points was retained as representatives, yielding a compact core set that significantly reduced the computational load of subsequent clustering while preserving the intrinsic topology of the data. Because the Cover Tree is hierarchical, the red-cross nodes correspond to points chosen at intermediate levels of the decomposition, reflecting a coarse-to-fine grouping of the data. These visualizations thus demonstrate how ISCT approximates the overall dataset structure via this sparse representation prior to the final clustering stage.
Figure 1.
Representative points on Circles data.
Figure 2.
Representative points on Moons data.
Figure 3.
Representative points on Blobs data.
We also used various random datasets represented in the Table 3 to analyze and evaluate the behavior and performance of clustering algorithms across different distributions and dimensions. The datasets generated include a wide range of distributions and parameters, ensuring a comprehensive evaluation. Each dataset enabled clustering methodologies to be tested and validated in a variety of scenarios. These datasets were generated using NumPy’s random module, encompassing both continuous and discrete probability distributions. Continuous data was produced using uniform (np.random.rand), normal (np.random.normal), exponential (np.random.exponential), gamma (np.random.gamma), and log-normal (np.random.lognormal) distributions. Discrete data included integer uniform (np.random.randint), binomial (np.random.binomial), Poisson (np.random.poisson), and Bernoulli distributions. All distributions were parameterized according to their statistical properties, and random seed initialization ensured the reproducibility of the generated data sequences for simulation and testing purposes.
Table 3.
Random datasets of different distributions used.
To evaluate and compare the effectiveness of the ISCT (Improved Clustering with Cover tree), Spectral Clustering and K-Means clustering algorithms, three commonly used evaluation indices are mobilized: the Silhouette index, the Davies–Bouldin index and the Calinski–Harabasz index.
The Silhouette Index measures how similar an object is to its own cluster compared to other clusters. It ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
For a data point i, the silhouette score is defined as follows:
where is the average distance between j and all other points in the same cluster. and is the minimum average distance from j to all points in other clusters.
The Davies–Bouldin Index (DBI) is used to evaluate clustering algorithms. A lower DBI indicates better clustering. It is defined as the average similarity ratio of each cluster with the cluster that is most similar to it; a lower value indicates better clustering based on intra-cluster and inter-cluster distances.
For k clusters, the DBI is
where is the average distance between each point in cluster i and the centroid of cluster i. is the distance between the centroids of clusters i and j.
The Calinski–Harabasz Index (also known as the Variance Ratio Criterion) evaluates the ratio of the sum of between-cluster dispersion and within-cluster dispersion. A higher CH index indicates better-defined clusters; a higher value indicates better clustering based on the ratio of between-cluster dispersion to within-cluster dispersion.
For k clusters, the CH index is defined as follows:
where is the trace of the between-group dispersion matrix; is the trace of the within-cluster dispersion matrix; n is the total number of points; and k is the number of clusters. Silhouette is indicated by SI, Davies by Da, and Calinski by Ca.
In unsupervised learning, evaluating partition quality is crucial. Two notable metrics are the Adjusted Rand Index (ARI), denoted , and the Normalized Mutual Information (NMI), denoted .
The ARI corrects for chance in partition similarity, where a value of 1 indicates perfect agreement. It is calculated as follows:
where is an element of the contingency table, and are row and column sums, and n is the total number of observations.
The NMI measures normalized informational redundancy between labelings, with indicating perfect correlation:
where is the mutual information between partitions U and V, and represents entropy.
These two indices, and , provide complementary perspectives for validating the robustness of clustering algorithms.
The comprehensive evaluation on Table 4 of six clustering algorithms across eight datasets reveals distinct performance patterns that highlight the relative strengths and limitations of each method. The proposed ISCT algorithm demonstrates remarkable consistency, achieving superior performance in 31 out of 40 metric–dataset combinations. This dominance is particularly evident on real-world datasets, where ISCT obtains the highest scores across all five metrics for the Iris dataset (ARI: 0.732, NMI: 0.743, SI: 0.5528) and maintains this advantage on Seed, Glass, Mall, and Cancer datasets. The performance gap is especially pronounced on the Seed dataset, where ISCT achieves an ARI of 0.817 compared to DBSCAN’s 0.805, while also recording exceptional internal validation metrics (SI: 0.6344, Ca: 959.75). On synthetic datasets, the results reveal more nuanced patterns. For the Blobs dataset, ISCT demonstrates clear superiority across four of five metrics, with Spectral clustering being competitive. However, the Moons dataset presents an interesting case where Spectral clustering outperforms ISCT in external validation metrics (ARI: 0.3431 vs. 0.3341; NMI: 0.5052 vs. 0.4866), while ISCT maintains an advantage in internal metrics (SI: 0.4720, Da: 0.7106, Ca: 769.2960). This divergence suggests that spectral clustering provides a better grasp of the moon-shaped underlying texture, while ISCT produces more compact clusters. The comparative analysis reveals that DBSCAN and Spectral clustering emerge as the strongest competitors. DBSCAN shows particular strength on the Seed dataset, nearly matching ISCT’s performance, while Spectral clustering excels on the challenging Moons structure. Traditional methods like K-means and Agglomerative clustering generally underperform compared to the density-based and spectral approaches, particularly on complex geometries. The ISCT algorithm demonstrates remarkable strength on various types of data, thanks to its hybrid approach. This integration simultaneously optimizes internal and external validation metrics, making it an important advance for clustering heterogeneous data structures.
Table 4.
Results of different clustering methods for low-dimensional and synthetic datasets.
The experimental results represented in the Table 5 across ten high-dimensional datasets reveal that the proposed ISCT method consistently outperforms all benchmark algorithms, achieving top scores in all metrics, including the Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Silhouette Index (SI), and Calinski–Harabasz Index (Ca), while maintaining the lowest Davies–Bouldin Index (Da). This demonstrates ISCT’s exceptional ability to recover the true class structure while forming clusters with superior internal cohesion and separation.
Table 5.
Results of different methods for high-dimensional datasets.
Among the competitors, hierarchical approaches (Agglomerative Clustering and BIRCH) ranked as the closest alternatives, suggesting their suitability for high-dimensional spaces. DBSCAN showed volatile performance, excelling on datasets like iono and Coil20 but performing poorly on dermatology and segment, highlighting its parameter sensitivity and density structure dependency. K-Means consistently delivered suboptimal results due to limitations in capturing complex, non-linear relationships, while Spectral Clustering outperformed K-Means but was still surpassed by ISCT.
The results establish two key conclusions: ISCT demonstrates itself as a robust, highly effective solution for high-dimensional clustering with significant improvements over techniques, and no single algorithm is universally optimal, as performance depends on specific data characteristics. ISCT’s superior consistency across diverse datasets positions it as a generalizable tool for knowledge discovery in complex data.
This Table 6 shows the results obtained on random datasets.
Table 6.
Comprehensive clustering results on random datasets.
ISCT outperforms K-Means and Spectral Clustering on most datasets when evaluated with Silhouette Index (SI), Calinski–Harabasz (Ca), and Davies–Bouldin (Da), showing particular strength on exponential, gamma, and lognormal data. For example, on lognormal data it achieves an SI of 0.26048, compared to 0.213247 for K-Means and 0.132666 for Spectral Clustering. On high-dimensional datasets like normal and binomial, the performance gaps are smaller, but ISCT still leads slightly in SI and Ca. These results highlight its robustness, versatility, and ability to produce more coherent, well separated clusters, thanks to its combined spanning tree decomposition and spectral clustering approach. The performance of various clustering algorithms, including the proposed Improved Clustering with Cover tree (ISCT) method, was evaluated across eleven synthetic datasets. ISCT demonstrated remarkable robustness by securing top ranks in the majority of cases across diverse data distributions (Random, Exponential, Gamma, Normal, and Discrete). Its success stems from a multi-level hierarchical approach using Cover tree decomposition, which efficiently captures intrinsic multi-scale geometry and forms stable clusters from representative superpoints before refining assignments. Other algorithms show significant performance variation. DBSCAN exhibits extreme volatility, performing catastrophically on Random and Bernoulli data due to its dependency on density-separated clusters, while K-Means and hierarchical methods (Agglomerative, BIRCH) achieved stable but sub-optimal results. Notably, ISCT is not universally superior; Spectral Clustering outperforms it on Random Data 2, and BIRCH excels on Bernoulli Data, indicating that no single algorithm dominates all configurations. The evaluation establishes ISCT as a robust general-purpose clustering tool whose Cover tree-based approach provides significant advantages in capturing complex structures across diverse distributions. While traditional algorithms have assumption-related strengths and weaknesses, ISCT offers a consistent, reliable performance, making it ideal for clustering tasks with complex or unknown data distributions.
These Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24, Figure 25, Figure 26, Figure 27, Figure 28, Figure 29, Figure 30, Figure 31, Figure 32, Figure 33, Figure 34, Figure 35, Figure 36, Figure 37, Figure 38, Figure 39, Figure 40, Figure 41, Figure 42, Figure 43, Figure 44, Figure 45, Figure 46, Figure 47 and Figure 48 show the clusters obtained by our algorithm and other algorithms.
Figure 4.
True labels.
Figure 5.
Spectral clustering on IRIS.
Figure 6.
ISCT on IRIS.
Figure 7.
True labels.
Figure 8.
Spectral clustering on Cancer.
Figure 9.
ISCT on Cancer.
Figure 10.
K-Means on Glass.
Figure 11.
Spectral clustering on Glass.
Figure 12.
ISCT on Glass.
Figure 13.
K-Means on Circles data.
Figure 14.
Spectral clustering on Circles data.
Figure 15.
ISCT on Circles Data.
Figure 16.
K-Means on Moons data.
Figure 17.
Spectral clustering on Moons data.
Figure 18.
ISCT on Moons data.
Figure 19.
K-Means on Blobs data.
Figure 20.
Spectral clustering on Blobs data.
Figure 21.
ISCT on Blobs data.
Figure 22.
K-Means on Random data.
Figure 23.
Spectral clustering on Random data.
Figure 24.
ISCT on Random data.
Figure 25.
K-Means on Exponential data.
Figure 26.
Spectral clustering on Exponential data.
Figure 27.
ISCT on Exponential data.
Figure 28.
K-Means on Integer data.
Figure 29.
Spectral clustering on Integer data.
Figure 30.
ISCT on Integer data.
Figure 31.
K-Means on Gamma data.
Figure 32.
Spectral clustering on Gamma data.
Figure 33.
ISCT on Gamma data.
Figure 34.
K-Means on Normal data.
Figure 35.
Spectral clustering on Normal data.
Figure 36.
ISCT on Normal data.
Figure 37.
K-Means on Binomial data.
Figure 38.
Spectral clustering on Binomial data.
Figure 39.
ISCT on Binomial data.
Figure 40.
K-Means on Uniform discrete data.
Figure 41.
Spectral clustring on Uniform discrete data.
Figure 42.
ISCT on Uniform discrete data.
Figure 43.
K-Means on Poisson data.
Figure 44.
Spectral clustering on Poisson data.
Figure 45.
ISCT on Poisson data.
Figure 46.
K-Means on Lognormal data.
Figure 47.
Spectral clustering on Lognormal data.
Figure 48.
ISCT on Lognormal data.
7. Discussion
7.1. Comparative Performance Analysis of ISCT Algorithm with Other Algorithm Clustering
The comprehensive evaluation on diverse datasets reveals the superior performance of the proposed ISCT method compared to established clustering approaches. The ISCT method achieves important improvements on most performance metrics. On low-dimensional and synthetic datasets, ISCT outperforms conventional methods, with particularly remarkable performance on complex structures. For the Iris and Seed datasets, ISCT achieves ARI values of 0.732 and 0.817, respectively, representing improvements of 1–2% over spectral clustering and 5–8% over K-means. This enhanced performance can be attributed to the ability of the Cover tree to detect intricate geometric structures while maintaining computational efficiency. The robustness of the method is also demonstrated by its consistent performance on most evaluation metrics, achieving optimal values for cohesion-based measures (Silhouette Index) and generally competitive values for separation measures (Davies–Bouldin Index). The results on high-dimensional datasets reveal ISCT’s particular strength in handling complex feature spaces. For the arrhythmia dataset, ISCT achieves an ARI of 0.625, significantly outperforming all benchmark methods. Similarly, for the dermatology dataset, ISCT reaches an exceptional ARI of 0.985, demonstrating near-perfect clustering accuracy. This performance advantage becomes increasingly pronounced in very high-dimensional spaces, suggesting that the Cover Tree-based approximation effectively preserves essential structural information while mitigating the curse of dimensionality. Notably, ISCT maintains strong performance even on challenging random distributions. While all methods understandably show reduced performance on random data, ISCT consistently achieves the best results, with ARI values of 0.055 for random data and 0.120 for exponential distributions. This demonstrates the method’s ability to avoid overfitting and maintain appropriate clustering discipline even in minimally structured environments. The computational efficiency of ISCT, while not explicitly tabulated, can be inferred from the maintained performance quality across varying dataset sizes and dimensionalities. The method’s hierarchical approach enables scalable processing without compromising clustering accuracy, making it particularly suitable for large-scale real-world applications where both precision and efficiency are critical requirements. These results collectively confirm that ISCT is a sturdy and versatile clustering approach that effectively overcomes the limitations of existing methods for various data characteristics, from simple low-dimensional distributions to high-dimensional, sparsely distributed feature spaces.
7.2. Computational Complexity Analysis of Clustering Algorithms
A comparative analysis of the computational complexities of Improved Spectral Clustering with Cover tree (ISCT), Spectral Clustering, and K-Means algorithms in Table 7 and Table 8 reveals significant performance differences. Spectral clustering demonstrates the highest computational demand, with a time complexity of in worst-case scenarios and a space complexity of , primarily due to the construction of an similarity matrix.
Table 7.
Computational complexity of clustering algorithms.
Table 8.
Execution times of different clustering methods (in seconds).
In contrast, K-Means exhibits superior computational efficiency with a time complexity of , where n represents the number of data points, k the number of clusters, i the iteration count, and d the data dimensionality. Its space complexity remains minimal at , being solely dependent on data dimensions.
The ISCT algorithm achieves an optimized complexity profile of , where the first term corresponds to Cover tree construction and the second to the final K-Means clustering phase. Its space complexity of reflects its dependencies on both data dimensionality and cluster count.
This complexity analysis demonstrates that ISCT substantially outperforms spectral clustering for large datasets due to its complexity compared to . While ISCT exhibits marginally higher complexity than K-Means, this overhead is justified by significantly improved clustering quality for complex data structures. Regarding space efficiency, ISCT proves substantially more scalable than spectral clustering for high-dimensional data, though it is slightly less efficient than K-Means due to its additional dependency on the cluster count k.
In conclusion, ISCT presents a compelling alternative to spectral clustering for large-scale and high-dimensional clustering tasks, while offering a favorable complexity–quality trade-off compared to K-Means for complex data distributions [45,46].
- Spectral Clustering becomes computationally prohibitive for large datasets () due to cubic time complexity.
- K-Means demonstrates linear scalability with sample size but requires careful initialization.
- ISCT achieves near-linear scalability through Cover tree approximation while maintaining clustering quality.
- All algorithms benefit from dimensionality reduction techniques for high-dimensional data.
- For datasets with , spectral clustering provides excellent quality despite its higher complexity.
- For large datasets (), K-Means or ISCT are preferred depending on cluster shape complexity.
- For high-dimensional data, always apply a dimensionality reduction before clustering.
- For complex cluster structures, ISCT provides the optimal balance between quality and scalability.
7.3. Statistical Validation of Rankings
To statistically analyze the results, we employed a non-parametric approach consisting of three steps. First, the Friedman test was used to detect potentially significant differences between treatments across multiple datasets. Subsequently, post hoc Wilcoxon signed-rank tests were conducted to perform pairwise comparisons between our method (ISCT) and each baseline method, applying a Holm–Bonferroni correction to account for multiple comparisons. Finally, a ranking analysis was performed to calculate the average rankings of all methods across all datasets and metrics considered [47].
7.3.1. Statistical Analysis on Results of Different Clustering Methods for Low-Dimensional and Synthetic Datasets
To rigorously evaluate the performance differences between clustering algorithms, we conducted a comprehensive statistical analysis of the results presented in Table 9. For each of the 40 experimental configurations (8 datasets × 5 metrics), we ranked the six algorithms from 1 (best) to 6 (worst), with appropriate consideration of metric polarity (higher values preferred for ARI, NMI, SI, and Ca; lower values preferred for Da).
Table 9.
Average rankings of clustering methods across all datasets and metrics.
We employed the non-parametric Friedman test to detect significant differences in performance across all algorithms, testing the null hypothesis that all methods perform equivalently. The analysis revealed a Friedman chi-squared statistic of 78.34 with a p-value < 0.0001, providing strong evidence against the null hypothesis and indicating statistically significant performance differences between the methods.
Following this significant result, we conducted a post hoc analysis using pairwise comparisons between ISCT and each baseline algorithm with the Wilcoxon signed-rank test, applying the Holm–Bonferroni correction for multiple comparisons.
Based on Table 10 and the statistical analysis, several key findings are confirmed:
Table 10.
Results of pairwise comparisons (ISCT vs. other methods).
- ISCT demonstrates statistically significant improvements over BIRCH, Spectral, Agglomerative, and K-Means clustering methods (p < 0.05 after multiple comparison correction).
- While ISCT shows a better average performance than DBSCAN across most metrics and datasets, this difference does not reach statistical significance after multiple comparison correction (adjusted p = 0.054).
- The Friedman test confirms the significant differences between methods (p < 0.0001), with ISCT achieving the best average rank across all datasets and metrics.
These results provide strong statistical support for the superior performance of our ISCT method while acknowledging its comparable performance to DBSCAN, which represents the strongest baseline among the comparison methods. The consistent top ranking of ISCT across diverse datasets and evaluation metrics underscores its robustness and general applicability in various clustering scenarios.
7.3.2. Statistical Analysis on Results of Different Clustering Methods for High-Dimensional Datasets
The procedure was as follows: for each of the 50 experimental configurations (10 datasets × 5 metrics), the six algorithms were ranked from 1 (best) to 6 (worst), with higher values being better for the ARI, NMI, SI, and Ca metrics and a lower value being preferable for the Da metric. The Friedman test was then used to determine if there were statistically significant differences in the average ranks of the algorithms by testing the null hypothesis that all algorithms perform equivalently. Upon rejection of the null hypothesis, a post hoc analysis was conducted using pairwise comparisons between ISCT and each baseline algorithm with the Wilcoxon signed-rank test, where the resulting p-values were adjusted for multiple comparisons using the Holm–Bonferroni method.
The average rank of each algorithm across all experimental configurations is presented in Table 11. ISCT achieved the lowest (best) average rank.
Table 11.
Average rankings of clustering methods across all high-dimensional datasets and metrics.
The Friedman test rejected the null hypothesis that all algorithms perform equally. The statistical analysis yielded a Friedman chi-squared statistic of with a p-value of less than 0.00001, demonstrating highly significant differences in performance among the algorithms under evaluation. This extremely significant result (p < 0.00001) provides compelling evidence against the null hypothesis of equivalent algorithmic performance, indicating that at least one algorithm exhibits significantly different performance characteristics compared to the others.
Post hoc Wilcoxon Signed-Rank Tests
The results of the pairwise comparisons (Table 12) show that ISCT’s performance is statistically superior to all other methods after correction for multiple comparisons.
Table 12.
Results of pairwise Wilcoxon signed-rank tests (ISCT vs. other methods).
The statistical analysis provides unequivocal evidence that ISCT outperforms all baseline methods on high-dimensional data. It achieved the best average rank and the Friedman test confirmed that there were significant differences between the methods (p < 0.00001). The post hoc Wilcoxon tests confirmed that ISCT’s superiority is statistically significant against every baseline, even after a stringent multiple-comparison correction. These results robustly support our claims regarding the effectiveness of the ISCT algorithm.
7.3.3. Statistical Analysis on Results of Different Clustering Methods for Random Datasets
The evaluation procedure consisted of three main steps: first, for each of the 55 experimental configurations (11 datasets × 5 metrics), all six algorithms were ranked from 1 (best) to 6 (worst), with higher values indicating better performance for the ARI, NMI, SI, and Ca metrics, and lower values being preferable for the Da metric; second, the non-parametric Friedman test was employed to determine whether there were statistically significant differences in the average rankings across all algorithms; finally, upon rejection of the null hypothesis, post hoc pairwise comparisons between ISCT and each baseline algorithm were conducted using the Wilcoxon signed-rank test with Holm–Bonferroni correction for multiple comparisons.
Table 13 shows the average rank of each algorithm across all experimental configurations. ISCT achieved the best average rank.
Table 13.
Average rankings of clustering methods across all synthetic datasets and metrics.
The Friedman test rejected the null hypothesis that all algorithms perform equally. The statistical analysis revealed a Friedman chi-squared statistic of 89.47 with a p-value of less than 0.00001, indicating highly significant differences in algorithmic performance across the evaluated methods. This extremely significant result (p < 0.00001) provides strong statistical evidence against the assumption of equal performance among all algorithms, suggesting that at least one algorithm demonstrates significantly different performance compared to the others in the comparison.
Post hoc Wilcoxon Signed-Rank Tests
Table 14 shows the results of pairwise comparisons between ISCT and other methods after Holm–Bonferroni correction.
Table 14.
Results of pairwise Wilcoxon signed-rank tests (ISCT vs. other methods).
The statistical analysis demonstrates that ISCT achieves the best overall performance on synthetic datasets, with statistically significant improvements over Spectral Clustering, DBSCAN, K-Means, and BIRCH. Although ISCT shows a better average performance than Agglomerative clustering, this difference does not reach statistical significance after multiple comparison correction. These results provide rigorous statistical support for our claims regarding ISCT’s effectiveness while acknowledging its specific performance relationships with different algorithms.
7.4. Adaptation Spectral Cover Tree for High Dimensions
Spectral clustering is a powerful technique for identifying clusters in data by leveraging the eigenstructure of similarity matrices. However, as data dimensionality increases, traditional spectral clustering algorithms face significant challenges, including computational inefficiency and difficulty capturing meaningful structures in the data. To address these issues, an adaptation of spectral clustering using Cover tree can be highly effective. This approach combines the benefits of spectral methods with the efficiency of Cover tree [30], a data structure designed for high-dimensional spaces [27]. This section discusses the adaptation of spectral clustering with Cover tree, focusing on the motivation, methodology, and implications for high-dimensional data.
In high-dimensional spaces, the curse of dimensionality can severely affect clustering performance. Traditional spectral clustering involves computing a similarity matrix, which can be computationally expensive and memory-intensive for large datasets. Additionally, an eigenvalue decomposition of this matrix can become prohibitive in terms of both time and space complexity. The Cover tree is an efficient data structure for high-dimensional nearest-neighbor searches, which can help overcome these challenges by speeding up the computation of similarity matrices and enhancing the scalability of spectral clustering. We will discuss the methodology to be followed in four essential steps.
- Data Decomposition Using Cover treeThe initial phase of adapting spectral clustering with Cover tree involves decomposing the data, resulting in a set of representative points and a tree structure that captures the inherent clustering structure of the data. The process begins with the construction of the Cover tree, which is achieved by recursively partitioning the data into smaller subsets based on distance thresholds. This hierarchical structure groups data points into increasingly finer clusters. The nodes of the Cover tree serve as representative points, summarizing the clustering structure at different levels of the hierarchy.
- Forming the Representative Points Similarity MatrixFollowing the construction of the Cover tree, a similarity matrix W is formed based on the representative points. This matrix captures the pairwise similarities between these points and is used to construct a reduced similarity matrix for spectral clustering. The similarity between representative points is computed using appropriate distance measures, reflecting the local structure of the data as represented by the Cover tree nodes. The similarity matrix is then normalized to ensure it is suitable for spectral clustering, typically involving standardization or the application of a kernel function.
- Spectral Clustering on Representative PointsWith the similarity matrix W in hand, spectral clustering is applied to the reduced set of representative points. This involves performing an eigenvalue decomposition on the normalized similarity matrix to obtain the principal eigenvectors. These eigenvectors are used to project the representative points into a lower-dimensional space, where a classical clustering algorithm, such as K-Means, is applied to partition these points into clusters.
- Recovering Clusters for Original DataThe final step involves recovering the cluster assignments for the original data points by mapping them back to the clusters of their corresponding representative points. Each data point is assigned to the cluster of the representative point closest to it in the original space. Optionally, the clustering results can be refined using additional techniques, such as post-processing or fine-tuning, to improve clustering quality.
7.5. Implications of Adaptation Spectral Cover Tree for High Dimension
This method delivers important improvements in both efficiency and quality, while guaranteeing greater scalability. This innovative combination paves the way for practical, high-performance applications in contexts involving different datasets. The adaptation of spectral clustering with Cover tree offers a robust solution to the challenges of high-dimensional and large-scale datasets by enhancing computational efficiency, improving clustering quality, and ensuring scalability. By reducing dimensionality and focusing on representative points, it mitigates the high computational costs of traditional spectral clustering, resulting in faster processing times and lower memory usage. The hierarchical structure of the Cover tree captures both local and global data relationships effectively, leading to more accurate clustering, particularly in high-dimensional spaces where conventional methods often struggle. Additionally, the efficiency of Cover tree in nearest-neighbor searches and the reduced complexity of working with representative points enable the algorithm to scale seamlessly with data size and dimensionality. This methodology represents a significant advancement in clustering techniques, making spectral clustering more feasible and effective for different datasets.
7.6. Limitations
The adaptation of spectral clustering with Cover tree introduced significant advancements in handling high-dimensional data. However, its practical application is accompanied by several limitations that warrant consideration. This section explores the key challenges associated with the use of Cover tree in spectral clustering, focusing on computational complexity, approximation accuracy, parameter tuning, and scalability.
- Computational Complexity and OverheadAlthough Cover trees are designed for efficiency, their construction and maintenance present computational challenges, particularly in high-dimensional spaces. Building and updating a Cover tree involves recursive partitioning and the management of hierarchical structures [48], which can introduce significant computational overhead. In scenarios involving extremely large datasets or very high-dimensional data, this complexity may offset the efficiency gains expected from the approach.Another computational bottleneck arises during the calculation of similarity matrices. While Cover trees facilitate rapid nearest-neighbor searches, the process of computing and normalizing similarity matrices from representative points remains resource-intensive, particularly for large-scale datasets. Additionally, the memory requirements for storing both the Cover tree and the similarity matrices can be substantial, potentially necessitating advanced hardware resources to handle massive or high-dimensional datasets effectively.
- Approximation and AccuracyThe use of representative points in Cover tree can lead to a loss of precision in clustering results. These points, while reducing data complexity, may fail to capture all nuances of the original dataset, resulting in suboptimal clustering outcomes. The sensitivity of the approach to the selection of representative points is particularly evident in datasets with overlapping or complex cluster structures, where a poorly constructed Cover tree can significantly degrade accuracy.Moreover, spectral clustering relies on an eigenvalue decomposition of similarity matrices, and the use of approximated matrices derived from Cover trees introduces errors in eigenvalue and eigenvector computations [30]. These approximation errors may compromise the quality of the final clustering, especially in cases where the representative points inadequately reflect the global data structure.
- Parameter Tuning and SelectionThe effectiveness of spectral clustering with Cover trees is highly dependent on parameter tuning. Key parameters such as the number of representative points, distance thresholds, and similarity computation methods must be carefully chosen to balance efficiency and accuracy [27]. Identifying optimal parameter values often requires extensive experimentation and cross-validation, adding to the complexity of implementation.Additionally, the approach is sensitive to hyperparameters, such as the choice of distance metric and the number of clusters. Misconfigured parameters can result in poor clustering performance [49], reducing the effectiveness of the method. Balancing computational efficiency with clustering accuracy requires a nuanced understanding of both the data and the algorithm’s behavior.
- Scalability and ApplicabilityWhile Cover tree address high-dimensional data challenges, their performance may degrade in extremely high-dimensional spaces with thousands of features. The effectiveness of Cover tree is also contingent on the nature of the data. Irregular or non-Euclidean structures may not align well with the assumptions underlying the Cover tree, limiting their applicability across diverse domains and datasets.The complexity of implementing spectral clustering with Cover tree is another practical challenge. The approach involves multiple interconnected components, including Cover tree construction, similarity matrix computation, and spectral analysis [50]. For practitioners unfamiliar with these methods, the implementation process can be daunting, potentially hindering its adoption in real-world applications.The adaptation of spectral clustering with Cover tree offers valuable benefits for clustering high-dimensional data, but it is accompanied by notable limitations. Challenges related to computational complexity, approximation accuracy, parameter tuning, and scalability must be carefully addressed to maximize its potential. Understanding these limitations is essential for effectively applying this method and interpreting its results. Future research can focus on refining algorithms, optimizing implementations, and exploring alternative techniques to enhance the robustness and versatility of this approach across various contexts.
8. Future Research
The integration of Cover tree with spectral clustering represents a significant innovation in the analysis of high-dimensional data. Despite its promising performance, numerous avenues for future research can further enhance the efficiency, accuracy, and applicability of this method. This section outlines critical areas for advancing spectral clustering with Cover tree, focusing on algorithmic improvements, robustness, flexibility, and theoretical understanding.
- Generalization to Probabilistic Metric Spaces and Advanced Association Methods
We aim to generalize the ISCT algorithm to probabilistic metric spaces, which will enable the uncertainties inherent in real-world data to be captured, particularly in contexts where distances or similarities are stochastic (e.g., data from noisy sensors, complex networks, or dynamic systems). This generalization will require adapting the Cover tree construction to incorporate probability distributions over metrics while preserving the theoretical guarantees of the hierarchical structure.
Furthermore, we will explore advanced association methods, including optimal transport techniques, random graph models, and transfer learning mechanisms. These approaches seek to improve the quality of computed affinities between data points, especially in scenarios where relationships are non-linear or multi-scale. Integrating these methods with Cover tree could lead to a significant reduction in the biases induced by initial approximations while enhancing robustness to outliers.
The integration of Cover tree with spectral clustering represents a significant innovation in the analysis of high-dimensional data. Despite its promising performance, numerous avenues for future research could further enhance the efficiency, accuracy, and applicability of this method [51]. This section outlines critical areas for advancing spectral clustering with Cover tree, focusing on algorithmic improvements, robustness, flexibility, and theoretical understanding.
- Enhancing Algorithmic EfficiencyEfforts to optimize the construction and maintenance of Cover tree could lead to significant performance gains. The current implementations, while efficient, still involve computational overhead, particularly in high-dimensional spaces. Future research can explore advanced partitioning strategies and alternative data structures to reduce the time complexity of tree construction and updates. Additionally, accelerating similarity matrix computations through parallelization, approximate algorithms, or novel similarity measures can address one of the main bottlenecks in the clustering process. Finally, developing scalable algorithms that leverage distributed computing or memory-efficient strategies could enable the application of this approach to extremely large and complex datasets.
- Improving Accuracy and RobustnessThe precision of clustering results in spectral clustering with Cover tree depends on the accuracy of approximations and parameter tuning. Future studies could focus on refining approximation techniques, potentially integrating Cover tree with other dimensionality reduction methods or hybrid approaches. Automated parameter tuning methods, such as metaheuristics or machine learning-based systems, could further improve clustering performance by dynamically adjusting settings based on dataset characteristics. Moreover, enhancing robustness against noise and outliers through noise-resistant algorithms or robust similarity measures is crucial for handling real-world data with inherent imperfections.
- Expanding Applicability and FlexibilityAdapting spectral clustering with Cover tree to diverse data types is another promising research direction. Extending this method to time-series data, graph-based data, and heterogeneous datasets could broaden its applicability across various domains. Additionally, integrating Cover tree with deep learning techniques, such as autoencoders and neural networks, may improve feature extraction and clustering performance, enabling the development of hybrid models that combine classical clustering with modern machine learning. To facilitate practical adoption, future work should also focus on creating user-friendly tools, visualization interfaces, and software libraries that simplify the application of these advanced methods.
- Advancing Theoretical UnderstandingA deeper theoretical exploration of Cover tree and its role in spectral clustering is essential to uncovering the full potential of this approach. Research could delve into the mathematical properties of Cover tree, its impact on clustering accuracy, and its limitations. Extending the theoretical framework to include adaptations for non-Euclidean spaces, dynamic datasets, or multi-view data could provide valuable insights into how Cover tree can address more complex clustering scenarios. Furthermore, comparative studies between spectral clustering with Cover tree and other state-of-the-art clustering techniques can clarify their relative strengths and guide best practices for their application.
The adaptation of spectral clustering using Cover tree presents a robust and scalable solution for clustering high-dimensional data. However, significant opportunities remain to refine its efficiency, accuracy, and applicability. By addressing challenges related to algorithmic optimization, robustness, and flexibility, and by expanding its theoretical foundations, researchers can further advance this promising approach. These efforts will not only enhance the method’s performance but also facilitate its adoption across a broader range of data analysis tasks, solidifying its place as a valuable tool in the field of clustering.
9. Conclusions
Through the adaptation of spectral clustering techniques to incorporate Cover tree, significant improvements have been observed in clustering accuracy, efficiency, and the ability to handle large datasets. This methodology demonstrates substantial benefits across various dimensions. For one, the Cover tree-based approach reduces computational complexity by efficiently managing the nearest-neighbor search and similarity computations, which are often bottlenecks in spectral clustering. The ability to handle high-dimensional data more effectively enables more accurate and meaningful clustering results, which is crucial for applications ranging from image analysis and bioinformatics to text mining and social network analysis. Despite these advancements, several limitations and areas for improvement remain. The challenges associated with optimizing the Cover tree construction and similarity matrix computation, enhancing scalability, and improving robustness to noise and outliers are notable. These limitations highlight the need for ongoing research to refine the methodology further, explore alternative approximation techniques, and develop more scalable solutions. Future research directions include optimizing algorithmic efficiency, integrating with deep learning approaches, expanding applicability to diverse data types, and enhancing theoretical understanding. In conclusion, while spectral clustering with Cover tree offers a promising approach to high-dimensional data clustering, continued research and development are essential to address existing challenges and fully realize its potential. By advancing the theoretical foundations, improving practical implementations, and exploring new applications, this methodology can contribute significantly to the field of data clustering and analysis. The progress made so far lays a solid foundation for future innovations, making it an exciting area of research with considerable opportunities for further exploration and application.
Author Contributions
A.L.H. and Y.A. contributed to the conceptualization, methodology, and preparation of the original draft. M.L.S. and S.R. contributed to the review, editing, validation, and acquisition of funding. V.Š.Č. contributed to the review and acquisition of funding. S.M. contributed to the acquisition of funding and provision of resources. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
Data are contained in the article.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Von Luxburg, U. A Tutorial on Spectral Clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
- Ng, A.Y.; Jordan, M.I.; Weiss, Y. On Spectral Clustering: Analysis and an Algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS’01), Vancouver, BC, Canada, 3–8 December 2001; pp. 849–856. [Google Scholar]
- Shi, J.; Malik, J. Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar] [CrossRef]
- Li, M.; Bi, W.; Kwok, J.T.; Lu, B.L. Large-Scale Spectral Clustering on Graphs. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI’11), Barcelona, Spain, 16–22 July 2011; Volume 2, pp. 1486–1491. [Google Scholar]
- Fowlkes, C.; Belongie, S.; Chung, F.; Malik, J. Spectral Grouping Using the Nystrom Method. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 214–225. [Google Scholar] [CrossRef] [PubMed]
- Beygelzimer, A.; Kakade, S.; Langford, J. Cover Trees for Nearest Neighbor. In Proceedings of the 23rd International Conference on Machine Learning (ICML ’06), Pittsburgh, PA, USA, 25–29 June 2006; pp. 97–104. [Google Scholar] [CrossRef]
- Zheng, P.; Zhou, H.; Liu, J.; Nakanishi, Y. Interpretable building energy consumption forecasting using spectral clustering algorithm and temporal fusion transformers architecture. Appl. Energy 2023, 349, 121607. [Google Scholar] [CrossRef]
- Elkin, Y. A New Compressed Cover Tree for k-Nearest Neighbour Search and the Stable-Under-Noise Mergegram of a Point Cloud. Ph.D. Thesis, University of Liverpool, Liverpool, UK, 2022. [Google Scholar]
- Driver, H.E.; Kroeber, A.L. Quantitative Expression of Cultural Relationships; University of California: Berkeley, CA, USA, 1932. [Google Scholar]
- Zubin, J. A Technique for Measuring Like-Mindedness. J. Abnorm. Soc. Psychol. 1938, 33, 508–516. [Google Scholar] [CrossRef]
- Tryon, R.C. Cluster Analysis: Correlation Profile and Orthometric Analysis for the Isolation of Unities in Mind and Personality; Edward Brothers: Ann Arbor, MI, USA, 1939. [Google Scholar]
- Johnson, S.C. Hierarchical Clustering Schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef]
- Lukauskas, M.; Ruzgas, T. Reduced Clustering Method Based on the Inversion Formula Density Estimation. Mathematics 2023, 11, 661. [Google Scholar]
- Cambe, J.; Grauwin, S.; Flandrin, P.; Jensen, P. A New Clustering Method to Explore the Dynamics of Research Communities. Scientometrics 2022, 127, 4459–4482. [Google Scholar]
- Chung, F.R.K. Spectral Graph Theory; American Mathematical Society: Providence, RI, USA, 1997; Volume 92. [Google Scholar]
- Fiedler, M. Algebraic Connectivity of Graphs. Czechoslov. Math. J. 1973, 23, 298–305. [Google Scholar] [CrossRef]
- Lehoucq, R.B.; Sorensen, D.C.; Yang, C. ARPACK Users’ Guide: Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. In Software, Environments, and Tools; SIAM: Philadelphia, PA, USA, 1998; Volume 6. [Google Scholar]
- Chen, J.; Jia, X.; Yang, W.; Matsushita, B. Generalization of Subpixel Analysis for Hyperspectral Data with Flexibility in Spectral Similarity Measures. IEEE Trans. Geosci. Remote Sens. 2009, 47, 2165–2171. [Google Scholar] [CrossRef]
- Zhu, X.; Gan, J.; Lu, G.; Li, J.; Zhang, S. Spectral clustering via half-quadratic optimization. World Wide Web 2020, 23, 1969–1988. [Google Scholar] [CrossRef]
- Zhan, Q.; Mao, Y. Improved spectral clustering based on Nyström method. Multimed. Tools Appl. 2017, 76, 20149–20165. [Google Scholar] [CrossRef]
- Goubko, M.V.; Ginz, V. Improved spectral clustering for multi-objective controlled islanding of power grid. Energy Syst. 2017, 10, 59–94. [Google Scholar] [CrossRef]
- Mizutani, T. Improved analysis of spectral algorithm for clustering. Optim. Lett. 2021, 15, 1303–1325. [Google Scholar] [CrossRef]
- Yan, J.; Cheng, D.; Zong, M.; Deng, Z. Improved Spectral Clustering Algorithm Based on Similarity Measure. In Proceedings of the International Conference on Advanced Data Mining and Applications, Guilin, China, 19–24 December 2014; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Khani, M.R.; Salavatipour, M.R. Improved Approximation Algorithms for the Min-max Tree Cover and Bounded Tree Cover Problems. Algorithmica 2011, 69, 443–460. [Google Scholar] [CrossRef]
- Hemdanou, A.L.; Sefian, M.L.; Achtoun, Y.; Tahiri, I. Comparative analysis of feature selection and extraction methods for student performance prediction across different machine learning models. Comput. Educ. Artif. Intell. 2024, 7, 100301. [Google Scholar] [CrossRef]
- Elkin, Y.G.; Kurlin, V. A new near-linear time algorithm for k-nearest neighbor search using a compressed cover tree. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; PMLR: Baltimore, MD, USA, 2021. [Google Scholar]
- Ren, S.; Zhang, S.; Wu, T. An Improved Spectral Clustering Community Detection Algorithm Based on Probability Matrix. In Discrete Dynamics in Nature and Society; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2020. [Google Scholar]
- Chao, G.; Sun, S.; Bi, J. A survey on multiview clustering. IEEE Trans. Artif. Intell. 2021, 2, 146–168. [Google Scholar] [CrossRef]
- Katwe, M.V.; Singh, K.; Clerckx, B.; Li, C.P. Improved Spectral Efficiency in STAR-RIS Aided Uplink Communication Using Rate Splitting Multiple Access. IEEE Trans. Wirel. Commun. 2023, 22, 5365–5382. [Google Scholar] [CrossRef]
- Calder, J.; Trillos, N.G. Improved spectral convergence rates for graph Laplacians on ϵ-graphs and k-NN graphs. Appl. Comput. Harmon. Anal. 2022, 60, 123–175. [Google Scholar] [CrossRef]
- Wang, W.; Li, J.; Lu, S. Application of Signal Denoising Technology Based on Improved Spectral Subtraction in Arc Fault Detection. Electronics 2023, 12, 3147. [Google Scholar] [CrossRef]
- Fisher, R.A. Iris; UCI Machine Learning Repository: Irvine, CA, USA, 1988. [Google Scholar] [CrossRef]
- Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.; Lukasik, S. Seeds; UCI Machine Learning Repository: Irvine, CA, USA, 2012. [Google Scholar] [CrossRef]
- German, B. Glass Identification; UCI Machine Learning Repository: Irvine, CA, USA, 1987. [Google Scholar] [CrossRef]
- Ashwani; Kaur, G.; Rani, L. Mall Customer Segmentation Using K-Means Clustering. In Proceedings of the International Conference on Data Analytics and Management, London, UK, 23–24 June 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 459–474. [Google Scholar]
- Wolberg, W.; Mangasarian, O.; Street, N.; Street, W. Breast Cancer Wisconsin (Diagnostic); UCI Machine Learning Repository: Irvine, CA, USA, 1995. [Google Scholar] [CrossRef]
- Güveni, H.A.; Acar, B.; Demiröz, G.; Cekin, A. A supervised machine learning algorithm for arrhythmia analysis. Comput. Cardiol. 1997, 24, 433–436. [Google Scholar]
- Nene, S.A.; Nayar, S.K.; Murase, H. Columbia Object Image Library (Coil-20); Technical Report CUCS-005-96; Columbia University: New York, NY, USA, 1996. [Google Scholar]
- Güveni, H.A.; Demiröz, G.; Ilter, N. Dermatology Dataset; UCI Machine Learning Repository: Irvine, CA, USA, 1998. [Google Scholar]
- Hofmann, H. Statlog (German Credit Data); UCI Machine Learning Repository: Irvine, CA, USA, 1994. [Google Scholar]
- Detrano, R.; Janosi, A.; Steinbrunn, W.; Pfisterer, M.; Schmid, J.-J.; Sandhu, S.; Guppy, K.; Lee, S.; Froelicher, V. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 1989, 64, 304–310. [Google Scholar] [CrossRef] [PubMed]
- Sigillito, V.G.; Wing, S.P.; Hutton, L.V.; Baker, K.B. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech. Dig. 1989, 10, 262–266. [Google Scholar]
- Hopkins, M.; Reeber, E.; Forman, G.; Suermondt, J. Spambase Dataset; UCI Machine Learning Repository: Irvine, CA, USA, 1999. [Google Scholar]
- Street, W.N.; Wolberg, W.H.; Mangasarian, O.L. Nuclear feature extraction for breast tumor diagnosis. In Proceedings of the IS and T/SPIE Symposium on Electronic Imaging: Science and Technology, San Jose, CA, USA, 31 January–5 February 1993; Volume 1905, pp. 861–870. [Google Scholar]
- Yin, H.; Aryani, A.; Petrie, S.; Nambissan, A.; Astudillo, A.; Cao, S. A Rapid Review of Clustering Algorithms. arXiv 2024, arXiv:2401.07389. [Google Scholar] [CrossRef]
- Xu, D.; Tian, Y. A Comprehensive Survey of Clustering Algorithms. Ann. Data Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef]
- Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
- Blair, M.D.; Huang, X.; Sogge, C.D. Improved spectral projection estimates. J. Eur. Math. Soc. 2022. [Google Scholar] [CrossRef]
- Achtoun, Y.; Gardasević-Filipović, M.; Mitrović, S.; Radenović, S. On Prešić-Type Mappings: Survey. Symmetry 2024, 16, 415. [Google Scholar] [CrossRef]
- Hemdanou, A.L.; Achtoun, Y.; Sefian, M.L.; Tahiri, I.; Afia, A.E. Random Normed k-Means: A Paradigm-Shift in Clustering within Probabilistic Metric Spaces. arXiv 2025, arXiv:2504.03928. [Google Scholar]
- Hemdanou, A.L.; Barkouk, H.; Lamarti Sefian, M.; Achtoun, Y.; Tahiri, I. Determinant Factors for Predicting Student Academic Success in Portuguese High Schools. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development, Agadir, Morocco, 17–23 December 2024; Springer Nature: Cham, Switzerland, 2024; pp. 981–991. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).















































