Clustering of Temporal and Visual Data: Recent Advancements

Mudgal, Priyanka

doi:10.3390/data11010007

Open AccessReview

Clustering of Temporal and Visual Data: Recent Advancements

by

Priyanka Mudgal

Portland State University, Portland, OR 97229, USA

Data 2026, 11(1), 7; https://doi.org/10.3390/data11010007

Submission received: 24 August 2025 / Revised: 13 November 2025 / Accepted: 24 December 2025 / Published: 4 January 2026

(This article belongs to the Section Featured Reviews of Data Science Research)

Download

Browse Figures

Versions Notes

Abstract

Clustering plays a central role in uncovering latent structure within both temporal and visual data. It enables critical insights in various domains including healthcare, finance, surveillance, autonomous systems, and many more. With the growing volume and complexity of time-series and image-based datasets, there is an increasing demand for robust, flexible, and scalable clustering algorithms. Although these modalities differ—time-series being inherently sequential and vision data being spatial—they exhibit common challenges such as high dimensionality, noise, variability in alignment and scale, and the need for interpretable groupings. This survey presents a comprehensive review of recent advancements in clustering methods that are adaptable to both time-series and vision data. We explore a wide spectrum of approaches, including distance-based techniques (e.g., DTW, EMD), feature-based methods, model-based strategies (e.g., GMMs, HMMs), and deep learning frameworks such as autoencoders, self-supervised learning, and graph neural networks. We also survey hybrid and ensemble models, as well as semi-supervised and active clustering methods that leverage minimal supervision for improved performance. By highlighting both the shared principles and the modality-specific adaptations of clustering strategies, this work outlines current capabilities and open challenges, and suggests future directions toward unified, multimodal clustering systems.

Keywords:

time-series clustering; image clustering; deep learning; autoencoders; multimodal data; unsupervised learning

1. Introduction

Clustering is a fundamental unsupervised learning technique that aims to group a set of data points into subsets, or clusters, where points in the same group are more similar to each other than to those in other groups [1,2]. It is one of the most widely used methods in exploratory data analysis, pattern recognition, and machine learning due to its ability to discover hidden structures in data without requiring labeled examples [3]. The increasing availability of large datasets across various domains has made clustering even more essential, especially in the context of complex, high-dimensional data [4,5]. In AI-enabled sensor systems, clustering helps uncover hidden patterns, detect anomalies, and reduce dimensionality. These capabilities are particularly important when working with time-series and vision-based sensor data, where labeled data is often scarce. As such, clustering plays a crucial role in enabling intelligent data interpretation and decision-making in modern sensing applications.

Among the diverse types of data encountered in machine learning, time-series and vision data have gained prominence due to their prevalence in real-world applications [6,7]. Time-series data, by its nature, consists of sequences of data points indexed by time. Examples include medical data such as electrocardiograms (ECG) signals, financial data such as stock prices, environmental data from sensor logs, and internet traffic data [8].

On the other hand, vision data are spatially rich. It comes from a variety of sources, including digital cameras, satellites, medical imaging devices, and video surveillance systems. Unlike time-series data, vision data is typically represented as high-dimensional matrices, where each pixel or voxel corresponds to a feature in a multi-dimensional space. In video data, this spatial information is combined with temporal dynamics, making the problem of clustering even more complex [9].

Despite these differences in modality, both time-series and vision data pose several common challenges for clustering. High dimensionality is a major issue, as time-series can include thousands of time points and images can contain millions of pixels, making the data sparse and standard distance measures less effective [4,5]. These data types are also often noisy due to sensor errors, environmental interference, or sampling irregularities, which can obscure meaningful patterns [10]. Additionally, both exhibit significant variability, with shifts in trends or periodicity for time-series and spatial vision data due to changes in object position, orientation, or scale. Finally, alignment issues are common, where sequences or visual elements may not be properly synchronized across samples, complicating the clustering process [11].

Given these shared challenges, clustering methods for both time-series and vision data must be adaptive, robust, and scalable [12]. Furthermore, with the growing importance of multimodal systems, which combine both time-series and vision data (e.g., in autonomous vehicles or healthcare), the need for cross-modal clustering techniques is increasing [13]. Traditional clustering algorithms, such as k-means or hierarchical clustering, struggle to handle these complexities due to their reliance on specific assumptions or their inability to scale efficiently with large, high-dimensional datasets [14].

This paper aims to provide a comprehensive survey of recent clustering techniques developed for or applicable to both time-series and vision data. Figure 1 presents an overview of the structure of our paper. To structure our discussion meaningfully, we reorganize the clustering techniques based on their methodological approaches—ranging from traditional distance-based methods (e.g., Dynamic Time Warping for time-series or pixel-based distances for images) [15,16], to more advanced techniques utilizing deep learning, autoencoders, self-supervised learning, and hybrid models [12,17]. This taxonomy is motivated by the need to highlight how different algorithmic strategies tackle the challenges posed by temporal and visual data, rather than classifying methods solely by application domain. Such an organization enables a clearer comparison of the underlying principles, assumptions, and trade-offs across approaches, helping readers to identify techniques most suitable for specific data characteristics or use cases. We also explore the growing role of semi-supervised clustering, which incorporates limited human guidance, and discuss its potential for handling complex, multimodal data sources. By offering a unified methodological perspective, this survey reveals both shared challenges and domain-specific innovations in clustering time-series and vision data, and outlines key directions for future research.

2. Related Works

Clustering has been extensively studied in the machine learning community, with a wide range of algorithms developed for different data types and application domains [2,3]. Traditional clustering methods, such as k-means, hierarchical clustering, and DBSCAN [18], have been applied to both time-series and vision data. However, these methods often assume Euclidean distance metrics and struggle with high-dimensional or non-linearly separable data [4,5], making them less effective for complex temporal or visual inputs.

For time-series data, early approaches relied heavily on distance-based metrics, with Dynamic Time Warping (DTW) being one of the most widely used due to its ability to account for time shifts and alignments [16]. Variants such as the Barycenter Averaging (DBA) of DTW [19] and shape-based distance measures have been introduced to improve clustering accuracy. Model-based methods, such as Hidden Markov Models (HMMs) [20] and autoregressive models [21], have also been explored to capture temporal dependencies more explicitly. More recently, deep learning methods including recurrent neural networks (RNNs), temporal convolutional networks (TCNs) [22], and sequence autoencoders [23] have enabled feature learning from raw time-series data, allowing clustering to be performed in a learned representation space.

In vision data, traditional clustering approaches such as spectral clustering and Gaussian Mixture Models (GMMs) have seen success in applications involving low-level image features. However, these approaches are limited by their reliance on hand-crafted features. The rise of convolutional neural networks (CNNs) [24] has led to a paradigm shift, enabling end-to-end feature learning for clustering. Deep Embedded Clustering (DEC) [1], Joint Unsupervised Learning (JULE) [25], and self-supervised methods such as contrastive learning [17,26] have shown significant promise in clustering high-dimensional image and video data. Extensions of these approaches have incorporated attention mechanisms [27] and multi-scale architectures to better handle spatial hierarchies and variability in visual scenes.

Multimodal and semi-supervised clustering approaches are also gaining traction, especially in contexts that combine time-series and vision data, such as healthcare, video analytics, and autonomous systems [13]. Methods that align or fuse data across modalities—such as using shared latent representations or co-training frameworks [28]—are increasingly explored. Semi-supervised clustering techniques, which incorporate limited labeled data or constraints [29], further enhance clustering performance in scenarios with scarce supervision. A recent work proposed by Zhang et al. [30] is the Fast Projected Fuzzy Clustering with Anchor Guidance (FPFC) method, which addresses challenges in large-scale multimodal remote sensing imagery. FPFC constructs meaningful superpixels to denoise data and generate high-quality anchors, then projects these into a shared subspace to jointly learn unified anchor graphs and membership matrices across modalities. Through adaptive weighting, the approach achieves efficient and consistent clustering without supervision, marking one of the first soft clustering frameworks for multimodal data. This represents a shift from feature concatenation toward cross-modal structure alignment and efficient anchor-based representation, underscoring the growing importance of integrative multimodal frameworks that balance scalability, robustness, and interpretability.

Recent developments in nonlinear subspace clustering have sought to address the limitations of traditional kernel-based methods, which often rely on predefined kernels and struggle to preserve intrinsic data geometry. Xu, Chen, and Wang [31] proposed a Data-driven Kernel Learning Model (DKLM) that directly learns the kernel from the data’s self-representation, thereby adapting to complex nonlinear structures without manual kernel selection. DKLM enforces a multiplicative triangle inequality constraint to ensure robustness and simultaneously preserves manifold structure while promoting an optimal block-diagonal affinity matrix. This data-driven kernel learning paradigm marks a conceptual shift from fixed-kernel subspace clustering toward adaptive kernel learning, integrating representation learning with structural preservation. As a result, nonlinear clustering is evolving from parameter-dependent techniques to self-tuning frameworks that better capture manifold relationships inherent in high-dimensional data.

At the intersection of clustering and foundation models, Wang et al. [32] proposed FedCKMS, a cluster-aware framework for heterogeneous federated foundation model adaptation. FedCKMS employs multi-factor heterogeneity-aware clustering to group clients based on both data distributions and computational constraints, enabling efficient and personalized model deployment. By combining clustering with knowledge-aware model architecture search and cluster-level knowledge transfer, the framework facilitates adaptive fine-tuning of foundation models while mitigating communication and resource limitations. This approach reframes clustering as a coordination mechanism in large-scale distributed learning rather than a purely unsupervised grouping tool. It exemplifies how clustering is increasingly being integrated into federated and adaptive learning pipelines, driving the personalization and scalability of modern foundation models.

Numerous surveys have examined clustering techniques within the domains of time-series and vision data, each tackling domain-specific challenges such as temporal dependencies or high-dimensional spatial structures. For time-series clustering, foundational surveys like Liao et al. [10] and Aghabozorgi et al. [33] offer taxonomies of distance-based, feature-based, and model-based approaches, while recent reviews provide updates on deep learning advances and neural clustering frameworks [34,35]. Vision clustering has similarly evolved, with surveys such as Pavel et al. [9] and Yang et al. [12] discussing self-supervised, subspace, and deep clustering methods tailored to visual representations. More integrative perspectives are emerging, such as surveys on leveraging vision models for time series [36] and foundation models for time-series tasks [37], reflecting increasing cross-pollination between the two domains. However, only a few works, e.g., Xu et al. [38] and Min et al. [39] explicitly examine clustering from a unified viewpoint. This paper aims to bridge that gap by reviewing clustering methods applicable to each domain and highlighting approaches capable of addressing their shared challenges.

3. Shared Challenges

Clustering temporal and visual data shares several challenges that arise due to the inherent complexity and characteristics of both data types. These challenges impact the performance and applicability of clustering algorithms across different domains. Below are the main shared challenges:

High Dimensionality: Both time-series and vision data are often high-dimensional, meaning they consist of a large number of features or data points. For time-series data, this typically means having a large number of time steps or sensors per sample. For example, in applications like ECG signals or stock price history, a single time-series sample may consist of hundreds or thousands of time steps, each carrying rich temporal information [16]. In the case of vision data, particularly images or videos, each sample is represented by numerous pixels or voxels, often running into the thousands or even millions for high-resolution images or videos [24]. This high dimensionality presents challenges for clustering algorithms, as the data may become sparse, leading to difficulty in defining meaningful clusters. The curse of dimensionality makes it harder to determine the inherent similarity between data points, reducing the effectiveness of traditional distance-based clustering methods like k-means and increasing computational costs [3].

Temporal/Spatial Variability: Both time-series and vision data exhibit substantial variability. For time-series, patterns often shift over time due to seasonality, trends, or external factors, making it difficult to capture consistent patterns across different time periods [21]. For instance, in healthcare, ECG signals from different patients might have similar overall patterns, but each may be influenced by different factors, such as age, health conditions, or medications, causing variations in temporal behavior [20]. Similarly, in vision data, spatial variability refers to how objects in images or videos might change in position, orientation, or scale, which can cause inconsistencies in how these objects are represented across different samples [2]. For example, an object in an image may appear rotated, translated, or scaled differently depending on the camera angle or zoom level, making it harder to cluster similar objects. Thus, clustering algorithms must account for such variations while still recognizing common patterns across time-series or visual data.

Noise and Redundancy: Noise in data refers to irrelevant or erroneous information that can distort the underlying patterns. Both time-series and vision data are particularly prone to noise, which can arise due to sensor malfunctions, environmental factors, or data collection errors [18]. For instance, in time-series data from sensors, noise could be introduced due to signal interference, measurement inaccuracies, or missing data points [4,5]. Similarly, in vision data, noise can take the form of image artifacts, compression losses, or background clutter, which can negatively affect clustering results [17]. In addition to noise, redundancy is another problem, where different samples may carry repetitive or highly correlated information. For instance, a time-series dataset might have several sensors that measure similar quantities, causing redundant data. In images, redundancy could be in the form of similar features being present across multiple frames of a video. Handling both noise and redundancy is critical to improving the accuracy of clustering algorithms and ensuring meaningful clusters are formed.

Scalability: In real-world applications, both time-series and vision datasets are often large-scale, sometimes involving millions of data points. Time-series datasets, especially in fields like finance, healthcare, and IoT, can span long time periods and involve multiple sensors or variables, which can lead to massive datasets [22]. Vision datasets, particularly when dealing with videos or large-scale image databases, can contain millions of images or frames, each with high-dimensional feature vectors [27]. Clustering these large datasets can be computationally expensive, particularly when traditional methods such as k-means or hierarchical clustering are applied [2]. Thus, clustering algorithms must be scalable to handle the sheer volume of data. This often requires more efficient algorithms or techniques, such as approximate nearest neighbor search, or the use of distributed computing to speed up clustering procedures [13].

Multimodal Fusion: With the increasing availability of data from multiple sources, there is a growing trend to combine time-series and vision data in multimodal systems. This fusion of temporal and visual data offers richer information and has applications in areas such as autonomous driving, healthcare, and surveillance. For example, in autonomous vehicles, sensor data (e.g., LiDAR, cameras) and time-series data (e.g., speed, acceleration) must be integrated to provide a comprehensive understanding of the environment [28]. However, multimodal data presents its own challenges: aligning different data types in a common feature space, handling the varying temporal resolutions (e.g., videos captured at 30 fps versus time-series data recorded every second), and dealing with the heterogeneity of data [13]. Clustering in such multimodal settings requires algorithms that can handle these different types of data simultaneously, making it a complex but crucial task for applications that rely on multimodal data [1]. These challenges also overlap with those found in vision and spatial alignment, where integrating visual inputs from different viewpoints or spatial formats requires precise geometric reasoning. While both tasks involve aligning heterogeneous data, multimodal fusion focuses more on temporal and semantic integration across modalities, whereas vision and spatial alignments emphasize spatial coherence within the visual domain. Despite these differences, both domains face the shared difficulty of building models that can robustly align and cluster complex, multi-source data.

4. Distance-Based Clustering

Distance-based clustering is one of the most common approaches for clustering time-series and vision data. In these methods, the similarity (or dissimilarity) between data points is calculated using a distance metric, and the data points are then grouped based on their proximity in the feature space [40] as shown in Figure 2. However, time-series and vision data come with distinct challenges that require specialized distance metrics for accurate clustering [2].

4.1. Temporal Methods

Time-series data often exhibits non-linear variations, misalignments, and phase shifts, making it challenging to apply traditional distance metrics such as Euclidean distance. As a result, specialized distance measures have been developed to account for these issues. Below are some of the most commonly used distance-based techniques for clustering time-series data:

Dynamic Time Warping (DTW): DTW is a powerful distance metric that measures similarity between two time-series by aligning them non-linearly in the time dimension. Unlike the Euclidean distance, which requires the sequences to be of equal length and aligned in time, DTW allows for local shifts and stretching of the time axis. This is particularly useful for time-series data that may be misaligned, such as signals recorded at different time scales or events that occur at different points in time. DTW computes an optimal alignment by minimizing the accumulated cost over all possible alignments. It has been widely used in applications such as speech recognition, gesture recognition, and financial market prediction [16].

Let

X = (x_{1}, x_{2}, \dots, x_{n})

and

Y = (y_{1}, y_{2}, \dots, y_{m})

be two time-series. The DTW distance is computed via dynamic programming:

D (i, j) = dist (x_{i}, y_{j}) + min \{\begin{matrix} D (i - 1, j), \\ D (i, j - 1), \\ D (i - 1, j - 1) \end{matrix}

(1)

with the boundary condition

D (0, 0) = 0

, and

dist (x_{i}, y_{j})

typically being the squared Euclidean distance

{(x_{i} - y_{j})}^{2}

.

The final DTW distance is:

DTW (X, Y) = D (n, m)

(2)

Edit Distance with Real Penalty (ERP): ERP is a variant of the classic edit distance used in string matching but designed for time-series data. In ERP, the distance between two time-series is calculated by considering both insertion/deletion and substitution operations, with penalties based on the magnitude of differences between corresponding points. ERP is particularly effective for dealing with time-series that may contain gaps or noise, as it allows for flexible alignment of the series while penalizing large differences in values. It has been applied in fields such as anomaly detection, sensor networks, and biometric identification [42]. Given two time-series

X = (x_{1}, \dots, x_{n})

and

Y = (y_{1}, \dots, y_{m})

and a reference value g (often zero), the ERP distance is defined recursively as:

E R P (i, j) = min \{\begin{matrix} E R P (i - 1, j - 1) + dist (x_{i}, y_{j}), \\ E R P (i - 1, j) + dist (x_{i}, g), \\ E R P (i, j - 1) + dist (g, y_{j}) \end{matrix}

(3)

with

dist (a, b) = | a - b |

or

{(a - b)}^{2}

and

E R P (0, 0) = 0

.

Move-Split-Merge (MSM): MSM is another technique for measuring the dissimilarity between time-series data that accounts for more complex sequence transformations. The MSM distance metric is based on the concept of aligning time-series by applying move, split, and merge operations, which allow for more flexible handling of temporal misalignments and variations. This method is well-suited for applications where time-series data undergoes structural changes, such as in biological data (e.g., gene expression) and motion capture data. MSM also allows for a hierarchical approach to clustering, making it adaptable to more complex time-series patterns [33]. Let

X = (x_{1}, \dots, x_{n})

and

Y = (y_{1}, \dots, y_{m})

be time-series. MSM defines the distance recursively using three operations (Move, Split, Merge):

M S M (i, j) = min \{\begin{matrix} M S M (i - 1, j - 1) + C_{move} (x_{i}, y_{j}), \\ M S M (i - 1, j) + C_{split} (x_{i}), \\ M S M (i, j - 1) + C_{merge} (y_{j}) \end{matrix}

(4)

where

C_{move} (x_{i}, y_{j})

is the cost of aligning

x_{i}

and

y_{j}

, and the split/merge costs are defined based on application-specific parameters.

These distance-based methods have proven effective in handling misalignments and non-linearities in time-series data, enabling more accurate clustering and pattern discovery in domains such as healthcare, finance, and sensor networks.

4.2. Spatial Methods

In vision data, the traditional pixel-based distance metrics (e.g., Euclidean distance between raw pixel values) often fail to capture the complex relationships between images or video frames, especially when dealing with issues such as noise, translation, rotation, or scale differences. Therefore, clustering vision data typically requires the use of more advanced distance measures, particularly those based on high-level features learned from the data. Below are some of the most commonly used distance metrics for vision data clustering:

Cosine Distance (on deep features): Cosine distance is often used in image clustering when dealing with feature embeddings extracted by deep learning models, such as Convolutional Neural Networks (CNNs) [43]. Instead of using raw pixel values, images are represented by high-dimensional feature vectors learned by a pre-trained model, capturing the high-level semantic content of the images. Cosine distance computes the cosine of the angle between two vectors, providing a measure of their similarity regardless of their magnitude. This is particularly useful when images are transformed in ways that preserve their semantic content but not their pixel values, such as in the case of rotation or scaling. Cosine similarity has been widely used for image retrieval, face recognition, and object detection [44]. Let

f_{1}

and

f_{2}

be deep feature vectors extracted from two images using a CNN. The cosine similarity is defined as:

Cos Sim (f_{1}, f_{2}) = \frac{f_{1} \cdot f_{2}}{∥ f_{1} ∥ \cdot ∥ f_{2} ∥}

(5)

The cosine distance is then computed as:

Cos Dist (f_{1}, f_{2}) = 1 - Cos Sim (f_{1}, f_{2})

(6)

This distance ranges from 0 (identical direction) to 2 (opposite direction), though in practice, for non-negative feature embeddings, the range is usually [0, 1].

Earth Mover’s Distance (EMD): EMD, also known as Wasserstein distance, is a measure of the dissimilarity between two distributions by calculating the minimum amount of “work” required to transform one distribution into another. In the context of vision data, EMD is often applied to compare image histograms or distributions of features. EMD is particularly useful for handling cases where images have large differences in appearance but share underlying structural similarities. For example, EMD can be used to compare images of the same scene taken under different lighting conditions or with varying viewpoints. It has been applied in applications such as image retrieval, shape matching, and texture comparison [45]. Let

P = {(p_{i}, w_{i})}

and

Q = {(q_{j}, u_{j})}

be two discrete distributions (e.g., image histograms), where

p_{i}

and

q_{j}

are the features, and

w_{i}

,

u_{j}

are their weights. Let

f_{i j}

be the flow from

p_{i}

to

q_{j}

. The EMD is defined as:

EMD (P, Q) = \frac{\sum_{i = 1}^{m} \sum_{j = 1}^{n} f_{i j} d_{i j}}{\sum_{i = 1}^{m} \sum_{j = 1}^{n} f_{i j}}

(7)

subject to:

\begin{matrix} f_{i j} & \geq 0, \end{matrix}

(8)

\begin{matrix} \sum_{j = 1}^{n} f_{i j} & \leq w_{i}, \forall i, \end{matrix}

(9)

\begin{matrix} \sum_{i = 1}^{m} f_{i j} & \leq u_{j}, \forall j, \end{matrix}

(10)

\begin{matrix} \sum_{i = 1}^{m} \sum_{j = 1}^{n} f_{i j} & = min (\sum_{i = 1}^{m} w_{i}, \sum_{j = 1}^{n} u_{j}) \end{matrix}

(11)

where

d_{i j}

is the ground distance (e.g., Euclidean) between

p_{i}

and

q_{j}

.

Structural Similarity Index (SSIM): SSIM is a metric designed to measure the perceived quality of images by taking into account luminance, contrast, and structure. Unlike pixel-based distance measures, SSIM compares images in terms of their structural similarity, which is closer to how the human visual system perceives images. This makes SSIM more robust to noise and minor distortions, such as slight changes in lighting or small image transformations. SSIM is often used in image denoising, image compression, and video quality assessment, but it has also found application in clustering tasks, particularly when comparing images with different lighting or noise conditions. Clustering based on SSIM can be more robust to real-world variations, such as those encountered in surveillance, medical imaging, and satellite imaging [46]. Given two image patches x and y, SSIM is computed as:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(12)

where:

$μ_{x}$ , $μ_{y}$ are the mean intensities of x and y
$σ_{x}^{2}$ , $σ_{y}^{2}$ are the variances
$σ_{x y}$ is the covariance between x and y
$C_{1}$ , $C_{2}$ are small constants to stabilize the division

SSIM values range from

- 1

to 1, where 1 indicates perfect structural similarity.

These distance-based methods for vision data provide a more flexible and robust framework for clustering, as they focus on high-level feature representations instead of raw pixel values. They are particularly effective in dealing with real-world challenges in vision data, such as noise, occlusions, and geometric transformations [47].

K-Means clustering: It is one of the most widely used and straightforward clustering algorithms [48]. As shown in Figure 3, it partitions a dataset into k clusters by iteratively assigning each data point to the nearest cluster centroid and then updating the centroids as the mean of the points assigned to each cluster. The algorithm aims to minimize the within-cluster sum of squared distances, promoting compact and well-separated clusters. Due to its simplicity and efficiency, K-Means is popular in various applications; however, it assumes clusters are roughly spherical and of similar size, and it requires the number of clusters k to be specified beforehand. Additionally, K-Means is sensitive to initialization and outliers, which can affect the quality of the clustering results [49].

5. Density-Based Clustering

Density-based clustering algorithms define clusters as dense regions of data points separated by areas of lower density. As shown in Figure 4, unlike centroid-based approaches such as K-Means, these methods do not assume spherical cluster shapes or require the number of clusters to be specified in advance. They are particularly effective for datasets with irregular or complex cluster structures and are robust to noise and outliers. Figure 5 presents the results of applying various density-based clustering methods to the CIFAR-10 dataset.

One of the most widely used density-based algorithms is DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [18], which groups together points that lie within a specified radius

ε

of each other and have at least a minimum number of neighboring points (minPts). Points that do not meet these criteria are considered noise. DBSCAN can discover clusters of arbitrary shape and handles noise naturally but is sensitive to parameter selection and struggles with varying cluster densities.

OPTICS (Ordering Points To Identify the Clustering Structure) [51] extends DBSCAN by removing the need for a global density threshold. Instead, it builds a reachability plot that reflects the clustering structure across multiple density levels, making it more flexible for datasets with variable density.

HDBSCAN (Hierarchical DBSCAN) [52] further generalizes DBSCAN by constructing a hierarchical clustering tree based on density and selecting clusters using a stability criterion. It improves cluster detection for data with differing densities and often provides more reliable results without requiring a single global density threshold.

Another method, DENCLUE (DENsity-based CLUstEring) [53], uses kernel density estimation to model the data distribution and identifies clusters by locating density attractors. Although mathematically elegant, DENCLUE can be computationally intensive for large datasets.

6. Feature-Based Clustering

Feature-based clustering is a technique where raw data is transformed into a more manageable form by extracting relevant features, which reduces the data’s dimensionality and enhances the clustering process. By representing the data through a smaller set of features, it becomes easier to discover meaningful patterns and clusters. This section discusses various feature extraction methods commonly used for time-series and vision data, followed by a discussion on the application of dimensionality reduction techniques to further improve clustering quality.

6.1. Time-Series Features

As shown in Figure 6, time-series data is inherently high-dimensional, and direct clustering in its raw form can be computationally expensive and ineffective due to the noise and variability often present in the data. Extracting meaningful features from time-series data enables more efficient clustering. Below are some common feature extraction methods:

Fourier and Wavelet Coefficients: Fourier transform and wavelet transform are widely used to decompose time-series data into frequency components, enabling the extraction of features that represent the periodic or oscillatory behavior of a time-series. The Fourier transform breaks the signal into sinusoidal components, while wavelet transform provides a multi-resolution analysis, offering better time localization for non-stationary signals [54]. These transforms help capture patterns that are not immediately apparent in the raw time-domain data, especially when the time-series exhibits periodic behavior or varying frequency content. Fourier coefficients are especially useful when dealing with periodic signals (e.g., ECG, sensor data), whereas wavelet coefficients are used to handle non-stationary signals, such as irregular financial time series or anomalous sensor data. By representing time-series in the frequency domain, these methods reduce the dimensionality and allow for more efficient clustering. Given a time-series

x (t)

, the Discrete Fourier Transform (DFT) transforms it into the frequency domain:

X (f_{k}) = \sum_{n = 0}^{N - 1} x (n) \cdot e^{- 2 π i \cdot k n / N}, k = 0, 1, \dots, N - 1

(13)

Here,

X (f_{k})

are the Fourier coefficients that capture the contribution of each frequency component

f_{k}

in the signal.

The Discrete Wavelet Transform (DWT) decomposes the signal into approximation and detail coefficients using a chosen wavelet

ψ (t)

:

W_{j, k} = \sum_{n} x (n) \cdot ψ_{j, k} (n)

(14)

where

ψ_{j, k} (n) = 2^{- j / 2} ψ (2^{- j} n - k)

are the scaled and shifted versions of the mother wavelet

ψ (t)

, and

W_{j, k}

are the wavelet coefficients at scale j and location k.

Shapelets: Shapelets are small, discriminative subsequences within time-series data that capture the underlying patterns or motifs within the series [55]. A shapelet is a subsequence that maximizes the dissimilarity between different classes or clusters in the time-series dataset. Shapelet-based clustering methods identify these patterns, which often correspond to key events or transitions in the data. For example, in healthcare, certain segments of an ECG signal may correspond to abnormal heart rhythms. By identifying and clustering similar shapelets, one can group time-series with similar events or behaviors, even if the overall signal appears quite different. Shapelet discovery has been used effectively in various applications, such as anomaly detection, health monitoring, and gesture recognition, where the identification of key motifs is crucial for grouping similar behavior. Let

T = [t_{1}, t_{2}, \dots, t_{N}]

be a time-series and

S = [s_{1}, s_{2}, \dots, s_{L}]

a shapelet (subsequence). The distance between shapelet S and time-series T is defined as the minimum Euclidean distance between S and any subsequence of T of length L:

Dist (S, T) = {min}_{i = 1}^{N - L + 1} \sqrt{\sum_{j = 1}^{L} {(s_{j} - t_{i + j - 1})}^{2}}

(15)

This distance is used as a feature for clustering or classification based on how well the shapelet matches parts of a time-series.

Statistical Summaries: In many time-series applications, it is useful to summarize the overall statistical properties of the data. Common statistical features include mean, standard deviation, skewness, kurtosis, and autocorrelation [56]. These statistical summaries capture the general distribution, trend, and dynamics of the time-series. For instance, in financial time-series data, the mean and variance can capture trends, while autocorrelation might reveal periodicity or cyclic behavior. These summaries reduce the dimensionality of the raw time-series, providing a compact representation that is easier to cluster. While these features may lose some fine-grained information, they can still be very effective when the time-series follows predictable statistical patterns. Given a time-series

x = [x_{1}, x_{2}, \dots, x_{N}]

, statistical features can be computed as:

\begin{matrix} Mean : & μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i} \end{matrix}

(16)

\begin{matrix} Standard Deviation : & σ = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2}} \end{matrix}

(17)

\begin{matrix} Skewness : & γ = \frac{1}{N} \sum_{i = 1}^{N} {(\frac{x_{i} - μ}{σ})}^{3} \end{matrix}

(18)

\begin{matrix} Kurtosis : & κ = \frac{1}{N} \sum_{i = 1}^{N} {(\frac{x_{i} - μ}{σ})}^{4} - 3 \end{matrix}

(19)

\begin{matrix} Autocorrelation : & ρ_{k} = \frac{1}{(N - k) σ^{2}} \sum_{i = 1}^{N - k} (x_{i} - μ) (x_{i + k} - μ) \end{matrix}

(20)

These features form a fixed-length vector representing the time-series for clustering.

These feature extraction techniques enable clustering algorithms to focus on the most informative aspects of the time-series data, making clustering more efficient and interpretable.

6.2. Vision Features

In vision data, clustering is often performed in the high-dimensional feature space rather than on raw pixel data. Direct clustering of raw images typically leads to poor results due to their high dimensionality, noise, and variability in appearance. Feature extraction in the form of high-level embeddings or low-level descriptors enables clustering algorithms to operate on a more compact, meaningful representation of the images. Below are some common feature extraction methods used for image data:

CNN-based Embeddings (e.g., ResNet, VGG, Inception): Convolutional Neural Networks (CNNs) are deep learning models that automatically learn hierarchical representations of images [57]. CNN-based embeddings refer to the feature vectors generated by a CNN, typically from the final layers of the network, which capture high-level semantic information about the image. These embeddings are often used in clustering tasks, as they effectively summarize complex visual content into a fixed-length vector, which can then be clustered using techniques like k-means or DBSCAN. Pretrained CNNs, such as ResNet or VGG, are commonly used for this purpose. These embeddings are invariant to transformations such as rotation, scaling, and translation, making them robust to visual changes in the data. CNN-based embeddings are particularly powerful for clustering images in large-scale datasets, such as for object recognition, face clustering, and scene analysis.

Local Descriptors (e.g., SIFT, HOG): Local feature descriptors are designed to capture distinctive patterns or keypoints in an image, which are robust to various transformations (such as scale, rotation, and affine distortion). Some popular local descriptors include Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG) [58]. SIFT identifies keypoints in an image and extracts descriptors around these points, making it well-suited for image matching and object recognition. HOG focuses on capturing the distribution of gradients or edge directions within localized regions, making it highly effective for tasks such as pedestrian detection. These local features are often clustered to group similar objects or scenes together, especially in tasks where fine-grained image details (such as object shape) are important. Local descriptors can be especially useful when images vary in size or when certain local features, rather than global structures, define the similarity between images.

Recent advancements in multimodal learning and self-supervised training have led to powerful large-scale models that extract high-quality embeddings suitable for clustering tasks. Models such as CLIP (Contrastive Language-Image Pretraining) [59], BLIP (Bootstrapping Language-Image Pretraining) [60], and DINO (Self-Distilled Vision Transformers) [61] have significantly improved how visual and textual data are represented. CLIP maps images and text into a shared embedding space using large-scale image-text pairs, enabling semantic clustering based on high-level features rather than raw pixel data. BLIP extends this idea by focusing on vision-and-language pretraining with better grounding and captioning capabilities, making its embeddings well-suited for clustering tasks that involve multimodal data or require understanding textual context. DINO, on the other hand, leverages Vision Transformers and self-distillation to learn visual representations in a completely self-supervised manner. Its embeddings are effective for capturing both global image structures and fine-grained details, proving valuable in unsupervised clustering and anomaly detection tasks.

In addition to these, models like SimCLR [26] have been instrumental in advancing contrastive self-supervised learning, training image encoders to distinguish between different instances without requiring labels [26]. Another noteworthy model is MAE (Masked Autoencoders), which learns representations by reconstructing masked portions of input images, resulting in embeddings that capture contextual and structural information useful for clustering [62].

The use of such pretrained embedding models has shifted clustering from a purely distance-based task over raw features to one that operates in rich, semantically meaningful spaces, enabling more accurate, interpretable, and scalable clustering across vision and multimodal datasets. These large pre-trained models provide powerful, high-dimensional embeddings that can be directly applied to clustering tasks, enabling the grouping of images based on high-level semantic features rather than raw pixel data or simple handcrafted features.

6.3. Dimensionality Reduction for Clustering

Although feature extraction reduces the dimensionality of the data, it may still be high-dimensional, leading to computational challenges and difficulties in clustering. Dimensionality reduction methods such as Principal Component Analysis (PCA) [63] and Uniform Manifold Approximation and Projection (UMAP) [64] are commonly used to reduce the dimensionality of high-dimensional data. These methods focus on preserving the most informative details of the data; however, this preservation comes with challenges. Depending on the structure of the data and the reduction technique used, some relevant information may be lost. Therefore, while these methods are powerful, their results should be interpreted with care.

Principal Component Analysis (PCA): PCA is a linear dimensionality reduction technique that identifies the principal components (the directions of maximum variance) in the data and projects the data onto a lower-dimensional space formed by these components [63]. It is widely used in time-series and vision applications to reduce the number of features while retaining as much variance as possible. By eliminating less informative components, PCA makes clustering more efficient and can lead to more distinct and well-separated clusters. For instance, in financial time-series data, PCA can be used to reduce the number of variables by focusing on the principal components that explain the most variance in stock prices, improving the clustering of similar stock behaviors. However, PCA assumes linear relationships in the data, which may not hold in many real-world scenarios where complex, non-linear patterns are present. Additionally, because PCA focuses solely on variance, it may retain components that capture noise rather than meaningful structure, potentially degrading clustering performance in such cases.

Uniform Manifold Approximation and Projection (UMAP): UMAP is a non-linear dimensionality reduction technique that focuses on preserving both local and global structure in the data [64]. Unlike PCA, which is a linear method, UMAP can better capture complex relationships in the data by modeling the data as a manifold in a higher-dimensional space. UMAP has become increasingly popular for reducing the dimensionality of high-dimensional data, such as images and time-series, before clustering. It is particularly useful when the data lies on a non-linear manifold, which is common in high-dimensional image and time-series datasets. While UMAP is commonly described as aiming to preserve local neighborhood relationships through the construction of fuzzy simplicial sets, it does not truly preserve local structure. It sometimes approximates local relationships, though this can vary widely with the dataset and settings. However, multiple studies have reported cases where UMAP fails to maintain accurate neighborhood structure, highlighting its limitations in consistently capturing local geometry [65,66]. Additionally, UMAP can be sensitive to hyperparameters such as the number of neighbors and minimum distance, which can significantly influence the resulting embedding. Moreover, the stochastic nature of its optimization may lead to variability across different runs, making reproducibility a challenge.

t-Distributed Stochastic Neighbor Embedding (t-SNE):t-SNE is a non-linear dimensionality reduction technique widely used for visualizing high-dimensional data in a lower-dimensional space, typically two or three dimensions [67]. t-SNE aims to maintain local similarity. In other words, points that are close together in high-dimensional space tend to stay close in the low-dimensional embedding. It achieves this by modeling pairwise similarities using probabilities. However, in order to achieve this aim, t-SNE sacrifices the preservation of global structure. In practice, it sometimes preserves local neighborhoods only when clear cluster structures exist in the data; otherwise, it struggles to preserve local neighborhoods. Moreover, t-SNE is sensitive to hyperparameters such as perplexity, as well as to initialization and randomness. As a result, local neighborhoods can sometimes be distorted, merged, or split artificially depending on the dataset and chosen parameters. Additionally, since t-SNE embeddings are not deterministic, repeated runs with the same data and parameters may produce different results. While t-SNE is a powerful tool for visualization, its sensitivity and tendency to distort global structure make it challenging for general-purpose dimensionality reduction or clustering preprocessing.

Figure 7 illustrates the clustering results on the CIFAR-10 dataset using t-SNE for visualization and K-Means for clustering across different embedding types: CLIP, BLIP, DINOv2, and CNN. These embeddings reduce the high-dimensional image data into more tractable feature spaces, allowing t-SNE to capture the overall structure and K-Means to partition the data into clusters. The results demonstrate varying degrees of class separability, with CLIP and DINOv2 embeddings generally showing more distinct and well-formed clusters compared to traditional CNN features, suggesting stronger semantic encoding. However, challenges arise due to t-SNE’s sensitivity to parameter settings and its tendency to distort local distances, which can mislead interpretation of cluster boundaries. Additionally, K-Means assumes spherical cluster shapes and equal variance, which may not hold for real-world image embeddings, potentially leading to suboptimal clustering performance. These limitations highlight the importance of embedding choice and methodological awareness when interpreting clustering outcomes.

7. Probabilistic Model-Based Clustering

Model-based clustering approaches assume that the data is generated from a probabilistic model, which allows for the identification of latent structures in the data as shown in Figure 8. These methods are particularly useful in scenarios where the data can be well-approximated by a mixture of distributions. In this section, we cover some of the most commonly used models for clustering both time-series and vision data, including Gaussian Mixture Models (GMMs), temporal models, and visual generative models [68].

7.1. Mixture Models

7.1.1. Gaussian Mixture Models (GMMs) and Variants

Gaussian Mixture Models (GMMs) assume that the data points are drawn from a mixture of several Gaussian distributions. GMMs are widely used for clustering both time-series and visual data because they allow for the modeling of complex data distributions that may not be adequately represented by a single Gaussian [69]. Each component of the GMM represents a cluster, and the model assigns probabilities to each data point to determine which cluster it most likely belongs to [68]. The model is typically fit using the Expectation-Maximization (EM) algorithm [70], which iteratively optimizes the likelihood of the data given the mixture model.

7.1.2. Applications

Time-Series Clustering with GMMs: GMMs can be used to model time-series data when the underlying temporal patterns are Gaussian or can be approximated by Gaussian distributions. For example, in the financial domain, stock price movements or sensor readings may exhibit Gaussian-like behavior [71]. GMMs can be extended to include temporal dependencies, such as through the use of Hidden Markov Models (HMMs) [20] or autoregressive processes, which can help capture the sequential nature of time-series data.

Visual Clustering with GMMs: In image clustering, GMMs can be applied to cluster pixels, image patches, or deep embeddings extracted from CNNs. For example, GMMs have been used for image segmentation, where each component of the mixture model corresponds to a different region of the image [72]. The advantage of GMMs in visual clustering lies in their flexibility to model data with complex distributions, which is common in natural images due to their variability in color, texture, and shape. GMMs can also be used in combination with deep learning-based embeddings (e.g., CNNs or autoencoders) [73] to cluster feature representations of images instead of raw pixel data.

GMMs are a powerful tool for modeling the underlying distribution of the data and are often used as a starting point for clustering tasks. However, more complex models are often necessary when the data exhibits temporal dependencies or complex generative processes.

7.2. Temporal Probabilistic Models

Time-series data is characterized by temporal dependencies, which are often crucial for accurate clustering. Several model-based approaches have been developed specifically for handling time-series data. These models incorporate temporal dependencies, allowing them to model sequences that exhibit trends, seasonality, or other dynamic behaviors. Below are some of the key temporal models used for clustering time-series data:

7.2.1. Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) are a powerful probabilistic tool used for modeling sequential data where the system is assumed to be in one of several latent states at any given time [20]. The transitions between these states follow a Markov process, meaning the next state depends only on the current state and not on previous states. HMMs are widely used for time-series clustering because they allow for the modeling of both observed data and hidden states, making them effective for sequences with temporal dependencies [74]. For example, HMMs are frequently applied in speech recognition, where different states correspond to different phonemes, or in finance, where they can model different market regimes. GMMs and HMMs fall under the category of model-based clustering; they are suited to different application scenarios due to the nature of the data they model. GMMs are typically applied in scenarios where the data points (images or time-series data) are assumed to approximately follow a mixture of Gaussian distributions. This assumption often holds for clustering visual features extracted from natural images, such as color histograms or SIFT descriptors [75], where the feature distributions tend to be unimodal or multimodal Gaussian-like. However, other types of images including medical imaging (MRI, CT scans) or textured and highly structured images contain complex, non-Gaussian distributions that may violate GMM assumptions. In such cases, relying only on GMMs can lead to suboptimal clustering performance, and alternative models or preprocessing steps may be necessary to better capture the underlying data structure. HMMs are designed to model temporal sequences by incorporating hidden states and transition probabilities, making them particularly effective for time-series data, such as activity recognition for sensor readings over time. The performance of these models thus varies with data type: GMMs are advantageous for static, spatial clustering, while HMMs capture temporal dependencies and are better suited for dynamic, sequential inputs. In many real-world scenarios where the underlying system dynamics are not directly observable but influence the observed outputs. The Markov property simplifies modeling by relying only on the current state to predict the next, which aligns well with sequential data where immediate past behavior is most relevant. Furthermore, the assumption that observations are conditionally independent given the hidden state reflects situations where the observed data is primarily driven by the current latent process, such as in speech or activity recognition.

7.2.2. Autoregressive Models

Autoregressive (AR) models are used for time-series data where the current value of the series depends linearly on its previous values [76]. AR models are simple but powerful for modeling data that exhibit temporal dependencies, such as financial stock prices, temperature readings, or sensor data. Variants of AR models, such as ARMA (AutoRegressive Moving Average) [77] and ARIMA (AutoRegressive Integrated Moving Average) [78], are also commonly used to model stationary and non-stationary time-series data. However, AR models assume a linear relationship between past and present values, which may limit their effectiveness when applied to time-series with complex or non-linear dynamics. Additionally, these models typically require strong assumptions about stationarity and can struggle with capturing long-term dependencies or sudden structural changes in the data.

7.3. Bayesian State-Space Models

Bayesian state-space models combine the flexibility of Bayesian inference with the ability to model complex temporal dependencies [79]. These models assume that the observed data is generated by a hidden process, which can be described by a set of states evolving over time. State-space models are particularly effective for handling noisy or incomplete time-series data, as they explicitly account for uncertainty in both the observations and the model parameters. In clustering, Bayesian state-space models can be used to identify underlying temporal patterns in the data, such as trends, cycles, or regime changes, and group similar time-series accordingly [79]. However, these models can be computationally intensive, especially when applied to large datasets or when using complex priors and inference algorithms such as Markov Chain Monte Carlo (MCMC) or variational methods. Additionally, model specification can be challenging, as results are sensitive to prior choices and assumptions about the latent state dynamics and observation processes.

8. Deep Learning Approaches

Deep learning techniques have significantly advanced clustering tasks, particularly due to their ability to learn rich, high-dimensional representations directly from raw data. In both time-series and visual data clustering, deep learning models excel at extracting useful features and reducing the complexity of clustering tasks. This section discusses several deep learning approaches that have been successfully applied to clustering, including representation learning, joint clustering models, and graph-based methods.

8.1. Representation Learning

Deep networks can automatically learn embeddings (low-dimensional representations) that capture the underlying structure of data. These embeddings are often more effective for clustering than manually engineered features. The primary advantage of deep learning-based representations is that they allow for learning complex, non-linear patterns in the data that traditional methods may not capture. Various deep learning models have been used to extract representations for time-series and visual data, including:

8.1.1. Long Short-Term Memory (LSTM)/Gated Recurrent Units (GRU)

LSTM and GRU networks are widely used for modeling sequential data such as time-series, as they can capture long-term dependencies and temporal patterns that are essential for effective representation learning. These models have been extensively applied in time-series clustering tasks by first encoding input sequences into latent embeddings, which are then clustered using traditional algorithms such as k-means or DBSCAN [80,81]. For instance, LSTM-based autoencoders have been used for clustering ECG signals [82], human activity sequences [83], and financial time-series [84], demonstrating their capacity to learn temporally informed embeddings that improve clustering performance.

GRUs, which are a simplified variant of LSTMs, have also been shown to be effective for time-series representation with fewer parameters, making them suitable for scenarios with limited data. Some hybrid approaches integrate attention mechanisms or variational encoders with LSTMs to enhance sequence modeling and improve cluster separability [83]. However, training these models can be computationally demanding and may require careful tuning to avoid overfitting, especially in domains with noisy or sparse sequences. Moreover, despite their strength in modeling moderate-range dependencies, LSTM and GRU architectures can struggle with very long sequences or fine-grained temporal variations unless enhanced with architectural extensions [85].

8.1.2. Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNNs) are primarily used for visual data but have also been effectively adapted for time-series clustering, particularly when temporal data is transformed into 2D representations such as spectrograms, recurrence plots, or Gramian Angular Fields [86,87]. In such cases, CNNs can extract spatially organized temporal patterns that are useful for downstream clustering. For raw time-series, 1D CNNs have been applied directly to capture local structures like peaks, shifts, and repetitive motifs [88,89], which serve as discriminative features for clustering algorithms such as k-means or spectral clustering [90]. In visual domains, CNN-based embeddings are widely used in tasks like face clustering [91,92] or object discovery [38]. These models generate compact, high-level feature representations that can be clustered more effectively than raw image pixels. CNN-based autoencoders and Siamese architectures have also been used to learn pairwise similarity functions, aiding in more structure-aware clustering [25,93].

Traditional CNN can be extended to operate over graph-structured data [94]. They have been used for both time-series and vision data clustering, where the data points are connected by edges that represent relationships or similarities. For instance, in time-series clustering, Graph Convolutional Networks (GCN) can model relationships between different sensor readings over time, while in vision clustering, GCN can capture spatial relationships between different regions in an image or between frames in a video [95,96]. They perform layer-wise propagation by applying a localized spectral convolution on the graph. At each layer, the feature representation of a node is updated by aggregating features from its immediate neighbors using a normalized adjacency matrix [97]. A common formulation of a GCN layer is:

H^{(l + 1)} = σ ({\hat{D}}^{- 1 / 2} \hat{A} {\hat{D}}^{- 1 / 2} H^{(l)} W^{(l)}),

where

\hat{A} = A + I

is the adjacency matrix with added self-loops,

\hat{D}

is the corresponding degree matrix,

H^{(l)}

is the node feature matrix at layer l,

W^{(l)}

is the learnable weight matrix, and

σ

is a non-linear activation function such as ReLU. This operation enables each node to iteratively integrate information from its local neighborhood. GCNs may be useful for clustering tasks because they can learn compact node embeddings that encode both node features and graph structure, which can then be clustered using algorithms like k-means or spectral clustering [94]. However, applying GCN to clustering also presents several challenges. First, GCNs rely on a predefined graph structure, and constructing an informative adjacency matrix often requires domain knowledge or assumptions about data similarity. Second, they are typically shallow (1–3 layers) due to the over-smoothing problem, where increasing depth causes node embeddings to become indistinguishable. This limits their ability to capture long-range dependencies in the graph. Third, in unsupervised clustering settings, the lack of label supervision can make training unstable or less effective, requiring auxiliary losses (e.g., contrastive or reconstruction losses) or pretext tasks. Finally, scalability remains a concern for large graphs, as the matrix operations involved in GCN propagation can be memory-intensive and computationally expensive [98,99].

8.1.3. Graph-Based Methods

Graph-based clustering methods have gained significant attention in recent years, especially for their ability to model complex relationships between data points. In domains such as time-series and vision, the inherent structure of the data is often more effectively represented as a graph, where nodes correspond to individual instances (e.g., time series or image regions) and edges encode similarities or interactions between them.

Graph Neural Networks (GNN) have emerged as powerful tools for learning from such graph-structured data [97]. By propagating information across the graph, these networks can capture inter-instance dependencies that traditional models may overlook [100]. This message-passing mechanism enables each node to iteratively refine its representation by aggregating information from its neighbors using learnable functions, thereby facilitating more expressive and meaningful clustering [101] as shown in Figure 9.

In the context of time-series data, these models can capture temporal relationships across different sequences, making it possible to group time series exhibiting similar dynamic patterns [102]. Similarly, in visual tasks, they can model the interactions between different regions within an image or across multiple images, leveraging spatial or semantic relationships. The ability to learn such structured representations makes graph-based approaches particularly effective in scenarios where data points are interrelated and collective patterns are essential for downstream tasks.

Figure 9. Schematic diagram of the workflow for graph-based clustering. The figure outlines key steps including graph construction, similarity computation, and partitioning, highlighting how clustering is performed through analysis of graph structures derived from the data. Image taken from [103].

GNNs have been widely applied to model inter-instance similarity in both time-series and visual data. For instance, in the context of time-series analysis, these networks can learn dependencies across different sequences, enabling the clustering of time series that display similar temporal dynamics [102]. In computer vision tasks, they are employed to capture relationships either among distinct regions within a single image or across multiple images in a scene. These models operate on graph-structured data, where nodes typically represent entities such as time series or image regions, and edges encode relationships or similarities between them. During training, the node representations are iteratively refined through a message-passing mechanism, in which each node aggregates feature information from its neighbors using learnable aggregation and update functions. Formally, at each layer l, the representation of a node v is updated as:

h_{v}^{(l)} = σ (\sum_{u \in N (v)} f_{agg} (h_{u}^{(l - 1)}, h_{v}^{(l - 1)})),

where

N (v)

denotes the neighbors of node v,

f_{agg}

is an aggregation function (e.g., mean, sum, attention), and

σ

is a non-linear activation function. By stacking multiple layers, nodes can capture multi-hop dependencies in the graph. This property makes them particularly effective in handling irregular or sparse data, making them well-suited for real-world time-series applications, such as sensor data clustering or video-based clustering. However, several challenges arise when applying GNN to clustering. First, constructing an appropriate graph structure (i.e., defining nodes and edges) is often non-trivial and may require domain knowledge or heuristics. Poorly constructed graphs can lead to suboptimal representation learning. Second, they may suffer from over-smoothing, where node representations become indistinguishable as more layers are added, which can degrade clustering performance. Third, scalability is a concern, especially for large-scale graphs, as message passing across many nodes and edges can be computationally expensive. Lastly, clustering typically requires unsupervised or self-supervised learning, but most GNN architectures are designed with supervised learning in mind, necessitating adaptations for effective unsupervised representation learning.

8.2. Generative Models for Clustering

Generative models are a class of models that learn the underlying distribution of the data and generate new samples that resemble the observed data [104]. These models have shown great promise in both time-series and vision data clustering by generating feature representations that can be effectively clustered [73]. In addition to traditional models such as Variational Autoencoders (VAEs) [73] and Generative Adversarial Networks (GANs) [104], recent advances in Generative AI-based methods have opened new frontiers in clustering tasks.

8.2.1. VAE-Based Clustering Approaches

VAEs are a class of generative models that learn probabilistic mappings between data and latent spaces [73]. By learning compact and continuous latent representations, VAEs enable effective clustering based on the learned features [105]. The latent space can sometimes provide interpretable representations that facilitate clustering into distinct groups [73]; however, this interpretability is not guaranteed, as latent variables are primarily optimized for reconstruction accuracy rather than for encoding human-understandable features. For instance, in vision tasks, VAEs are used to learn lower-dimensional embeddings from images or image patches, which are then clustered using traditional clustering techniques like k-means [106] or hierarchical clustering [107]. However, training VAEs can be challenging due to issues such as posterior collapse, where the learned latent variables fail to capture meaningful information. Additionally, VAEs require careful tuning of model architecture and hyperparameters to balance reconstruction accuracy and latent space regularization, which can affect clustering quality.

8.2.2. GAN-Based Clustering Approaches

Generative Adversarial Networks (GANs) have been used to generate realistic images and learn high-quality feature representations [104]. These embeddings can be used in clustering tasks. A typical approach involves using the generator network of a GAN to extract features from the data, and the discriminator is used to differentiate between features belonging to the same class [104]. This allows for clustering in the latent space of the generator, where similar images or time-series sequences can be grouped. However, GANs are notoriously difficult to train due to issues such as mode collapse, instability in the adversarial training process, and sensitivity to hyperparameter choices. Additionally, the quality of the learned representations heavily depends on the balance between the generator and discriminator, which can affect the robustness and interpretability of the resulting clusters.

8.2.3. Vision-Language Model-Based Clustering

Recent advancements in Generative AI have introduced large pretrained models that bridge the gap between natural language processing (NLP) and computer vision (CV), enabling powerful feature extraction for both modalities. Some of the key models include:

CLIP (Contrastive Language-Image Pre-training): CLIP is a large vision-language model that learns representations by matching textual descriptions with images [108]. By embedding both text and images in a shared feature space, CLIP can be used for clustering images based on textual descriptions, a key advantage when the data is multimodal [108]. However, CLIP’s performance can be limited by the quality and diversity of its training data, which may introduce biases and affect generalization to domain-specific datasets. Additionally, the reliance on paired text-image data makes it less applicable in settings where such annotations are unavailable or noisy. The large model size also poses challenges for deployment in resource-constrained environments.

BLIP (Bootstrapping Language-Image Pre-training): BLIP, like CLIP, is a vision-language model, but it focuses on improving vision-language understanding by combining self-supervised learning with large-scale image-text pairs. In clustering, BLIP can be used to extract embeddings from images and then perform clustering based on these embeddings. However, despite its improvements, BLIP’s performance can still be affected by noisy or imbalanced training data, which may impact the quality of the learned representations. Additionally, the reliance on large-scale paired datasets presents challenges for applications in domains where such data is scarce or expensive to obtain. Computational demands for training and inference also limit its usability in resource-constrained environments.

DINO (Self-supervised Learning with Vision Transformers): DINO is a self-supervised learning framework that utilizes Vision Transformers (ViTs) to potentially learn meaningful representations of images without any labeled data [109]. DINO focuses on learning discriminative embeddings through self-supervised methods, where the model is trained to distinguish between augmented versions of the same image [109]. These embeddings are then clustered using traditional methods such as k-means or hierarchical clustering [107]. However, DINO’s effectiveness depends heavily on the quality of augmentations and the capacity of the ViT model, which can make training computationally intensive. Additionally, self-supervised representations may capture spurious correlations or be sensitive to dataset biases, potentially impacting the robustness of the clustering results.

The integration of these generative AI models into the clustering process has led to several innovations, particularly in multimodal data scenarios where the temporal and visual data can be processed jointly [108]. This synergy enables more robust and semantically meaningful cluster formations by leveraging complementary information from multiple data modalities, which traditional clustering methods often struggle to capture effectively.

Adaptation Trade-offs in Generative and Foundation Models. While generative and foundation models provide powerful pretrained embeddings, adapting these models for clustering tasks—especially in domain-specific settings—comes with trade-offs. Fine-tuning can improve performance by aligning representations with specific clustering objectives or data characteristics, but it also introduces risks of overfitting, particularly when training data is limited or lacks diversity [110,111]. In contrast, using frozen representations from these models ensures generalization but may yield suboptimal clustering in specialized tasks. To mitigate this, recent methods employ partial fine-tuning (e.g., tuning projection heads only), regularization strategies, or prompt-based adaptation to balance performance and generalization [112,113]. In generative models such as VAEs and GANs, over-adaptation can distort latent spaces critical for effective clustering, while under-adaptation may lead to generic, less discriminative embeddings. These trade-offs highlight the need for careful adaptation strategies, particularly in applications requiring domain transfer, interpretability, or robustness.

8.3. Joint Clustering Models

Joint clustering models aim to combine the feature learning process and clustering objective in a unified framework. These models not only learn feature representations but also directly optimize the clustering results, improving the overall clustering quality. Some of the key models include:

8.3.1. Deep Embedded Clustering (DEC) and Variants

DEC is a popular deep learning model for clustering that jointly learns feature representations and cluster assignments. It works by first learning a deep feature extractor, often an autoencoder, and then applying clustering loss to enforce cluster assignments in the latent space. The DEC model optimizes both reconstruction loss and clustering loss simultaneously, resulting in cluster assignments that are learned jointly with data representations [114]. DEC has been successfully applied to tasks such as image and text clustering [114], demonstrating its potential for learning compact and discriminative embeddings. However, it is important to note that such success does not imply that the model universally produces meaningful cluster assignments or preserves temporal structure across all data types. In practice, the quality of clustering in DEC depends heavily on the initial representation learned by the autoencoder and the balance between reconstruction and clustering objectives. Moreover, for complex temporal data, DEC may struggle to capture long-range dependencies or subtle temporal patterns, especially since it lacks explicit mechanisms for modeling sequential relationships.

Variants of DEC, such as Deep Clustering with Label Propagation (DCLP) and Deep Clustering with Stacked Autoencoders (DCSA), adapt DEC to improve its performance by incorporating label propagation techniques or stacking multiple autoencoders for deeper representation learning [115,116].

8.3.2. Time-Series DEC (T-DEC)

Time-series data often exhibit unique characteristics like temporal dependencies and varying patterns over time. T-DEC is an extension of DEC specifically designed for time-series clustering. It adapts the DEC framework to take into account the sequential nature of the data. T-DEC uses recurrent networks, such as LSTM or GRU, in the feature extraction phase to capture temporal patterns. The clustering step in T-DEC ensures that similar time-series sequences are grouped together, respecting the temporal structure of the data [117].

8.3.3. Vision Clustering with Self-Supervised Pretraining

Self-supervised pretraining has recently gained popularity in vision clustering. Models like SimCLR, MoCo, and BYOL are pre-trained on unlabeled data to learn meaningful image representations. These representations are then used for downstream tasks like clustering. In the context of clustering, self-supervised pretraining ensures that the learned features are invariant to transformations (e.g., rotation, scaling, translation), making them robust for clustering [26,118]. Once the model is pre-trained, traditional clustering algorithms such as k-means or hierarchical clustering are applied to the embeddings to group similar images.

For instance, SimCLR (Simple Contrastive Learning of Representations) uses contrastive learning to learn feature representations by contrasting positive pairs (similar data) and negative pairs (dissimilar data) [26].

9. Hybrid and Ensemble Approaches

Hybrid and ensemble approaches have gained popularity in clustering tasks, particularly when dealing with complex and high-dimensional data like time-series and visual data. These methods combine multiple strategies, often blending different clustering algorithms, feature extraction techniques, and distance metrics to leverage their complementary strengths. By combining different techniques, hybrid models can overcome the limitations of individual algorithms, such as sensitivity to noise, high dimensionality, and varying data distributions. The fusion of various models can also improve clustering performance and robustness [119,120,121]. Hybrid approaches have emerged as a powerful paradigm in clustering by combining the complementary strengths of multiple methods. Rather than relying on a single algorithm’s perspective, hybrid techniques enable exploration of various data facets—geometric structure, statistical properties, learned representations—leading to more robust and comprehensive clustering results. Ensemble methods specifically leverage the diversity of multiple algorithms to reduce sensitivity to noise and outliers while increasing stability across different data characteristics. Furthermore, integrating deep feature extraction with traditional clustering algorithms addresses a fundamental challenge: raw data often obscures meaningful patterns that become apparent only in carefully learned representation spaces. This synergy between representation learning and clustering has proven particularly effective for complex, high-dimensional datasets where neither approach alone suffices. Hybrid clustering approaches can be systematically categorized based on the components they combine and the specific challenges they aim to address in complex datasets like visual and time-series data. Broadly, hybridization in clustering can be classified into the following categories:

Feature–Distance Hybrids: Combine learned or engineered feature extraction (e.g., CNNs, autoencoders) with advanced or domain-specific distance metrics (e.g., DTW, EMD) to improve similarity measurement.
Representation–Clustering Hybrids: Apply dimensionality reduction or representation learning followed by clustering algorithms like k-means or DBSCAN to enhance scalability and interpretability.
Algorithm–Algorithm Ensembles: Fuse results from multiple clustering algorithms to achieve consensus and improve robustness to noise, outliers, or parameter sensitivity.
Optimization–Model Hybrids: Jointly train clustering objectives alongside deep learning models, integrating clustering loss into the training process.

The rationale behind these combinations is rooted in addressing limitations of individual methods—such as high dimensionality, sensitivity to noise, inability to model non-linear patterns, or poor initialization. Table 1 summarizes key hybrid categories, their goals, and representative works.

9.1. Feature + Distance Hybrid Approach

One of the primary motivations behind hybrid models is the combination of feature extraction techniques with advanced distance measures. These hybrid models aim to capture both the high-level features of the data and the underlying relationships between data points. A common strategy is to apply a feature extractor (e.g., CNNs) to learn meaningful representations of the data, followed by a distance-based clustering algorithm that leverages these features. Notable examples of feature + distance hybrids include:

CNN embedding + Dynamic Time Warping (DTW): In time-series clustering, a CNN-based feature extractor can be used to learn high-level representations from raw time-series data. Once the feature vectors are obtained from the CNN, DTW, which is particularly effective for non-linearly aligned time-series, can be applied as the distance metric to group similar sequences. This hybrid approach is particularly useful in situations where time-series data exhibits temporal shifts or varying lengths, as DTW handles misaligned time sequences, while CNNs learn high-level temporal features that are robust to noise [10,122].

CNN embedding + Euclidean Distance: In visual data clustering, CNNs are commonly used for feature extraction, as they excel in extracting spatial hierarchies from images. After applying CNNs for feature learning, the Euclidean distance can be used to measure the similarity between the embeddings. This combination is effective for clustering tasks where images need to be grouped based on visual content, such as object recognition or face clustering, where pixel-level similarity (e.g., Euclidean distance) might not capture all the necessary details, but CNN-generated embeddings provide a more meaningful representation [125,126].

Feature extraction + Earth Mover’s Distance (EMD): Earth Mover’s Distance is a measure of the difference between two probability distributions over a region [45]. When applied to visual or time-series data, it provides a robust metric for comparing distributions in a feature space. Combining feature extraction techniques such as autoencoders or CNNs with EMD allows for more nuanced clustering by considering the “movement” required to transform one distribution into another. This hybrid approach can be applied to both vision and time-series data where spatial or temporal distributions are important, like in document clustering (text as a time-series) or video summarization (grouping visually similar frames) [123].

9.2. Autoencoder + k-Means Joint Training

Another widely used hybrid model involves combining autoencoders with traditional clustering algorithms like k-means. The idea is to use the autoencoder to reduce the dimensionality of the data and learn a compact representation of the input, which is then passed to a k-means algorithm for clustering.

Autoencoder + k-means: In this hybrid approach, an autoencoder is first trained to learn a low-dimensional latent space for the input data. The autoencoder reduces the complexity of the data while preserving important features for clustering. Once the latent representation is learned, k-means is applied to the compressed feature space to perform clustering. This approach works well when dealing with high-dimensional data such as images, videos, or multi-dimensional time-series, as the autoencoder can reduce the data to a manageable size and uncover hidden patterns that are then clustered by k-means. This method has been shown to be effective in scenarios like image clustering (where CNN-based autoencoders are used) or sensor data clustering (with time-series data) [114,124].

Deep autoencoder + k-means: In some cases, deeper or stacked autoencoders (which consist of multiple encoder-decoder layers) are used to learn even more abstract feature representations of the data before clustering. These deep autoencoders can capture complex data structures that simpler models may miss, allowing the k-means clustering step to operate on more sophisticated feature representations. This approach is particularly effective for datasets with intricate patterns that require deeper models for accurate representation [105,130].

Variational Autoencoder + k-means: For probabilistic clustering, a variational autoencoder (VAE) can be combined with k-means. VAEs learn a latent representation that models the underlying distribution of the data in a probabilistic manner. When used in conjunction with k-means, the learned latent space from the VAE is often more robust to noise and missing data, which is especially important in real-world scenarios where the data is often imperfect. This hybrid method has been successfully applied in medical imaging and other domains where uncertainty and variability play significant roles [73,129].

9.3. Consensus Clustering from Multiple Algorithms

Consensus clustering refers to a technique that combines the results of multiple clustering algorithms to generate a final clustering solution. The intuition behind consensus clustering is that different algorithms may offer complementary insights into the data, and combining them can produce a more robust clustering outcome.

Combining multiple clustering algorithms: One approach to consensus clustering involves running several clustering algorithms independently (e.g., k-means, DBSCAN, spectral clustering) and then combining their results through a majority voting mechanism or consensus function. For instance, the most common cluster assignment for each data point across all algorithms is selected. This can help mitigate the risks of a single algorithm being biased or sensitive to noise or outliers. This method is especially beneficial when working with complex data that exhibits non-linearities or varying structures, as different algorithms may capture different facets of the data’s internal structure [127,131].

Ensemble clustering using graph-based methods: In some consensus clustering frameworks, each clustering result is represented as a graph, where nodes are data points, and edges represent the similarity between points. These graphs can be combined using methods like graph fusion, where the consensus graph is formed by merging the edges from each clustering result. The final clusters can then be derived from the consensus graph using graph partitioning techniques. This method works well for complex datasets where relationships between data points are intricate, and different clustering methods may emphasize different relationships [128,132].

Co-training and co-ensemble methods: Another ensemble-based approach involves co-training, where different models are trained on different subsets of features or data instances, and then the clustering results from these models are aggregated. This strategy reduces the risk of overfitting and allows for a more comprehensive understanding of the data. The co-ensemble method further extends this idea by combining several models with complementary assumptions (e.g., feature-based and distance-based models) to generate a more robust clustering outcome [133].

10. Semi-Supervised and Active Clustering

While unsupervised clustering has been highly successful, there are many scenarios in which some form of supervision can significantly improve clustering results. Semi-supervised and active clustering techniques aim to incorporate limited supervision, either in the form of labeled data points or pairwise constraints, to guide the clustering process [134,135]. These approaches allow models to perform clustering with a minimal amount of labeled data, reducing the reliance on vast amounts of fully labeled data, which is often expensive or difficult to obtain.

In the context of both time-series and visual data, semi-supervised and active clustering can help refine cluster assignments, improve the quality of results, and address challenges such as ambiguity and noise [136,137]. These approaches have gained attention in domains like medical imaging, autonomous vehicles, and social networks, where labeled data may be scarce but expert knowledge is available in the form of domain-specific annotations [138,139]. Semi-supervised and active clustering methods bridge the gap between fully unsupervised and supervised learning by strategically incorporating limited supervision to improve clustering quality. These approaches offer three key advantages. First, the incorporation of pairwise constraints, domain-specific knowledge, and human feedback significantly enhances clustering results [140], particularly for noisy, high-dimensional, or ambiguous data where purely unsupervised methods struggle [135,141]. Second, by leveraging small amounts of labeled data or constraints, these methods achieve strong performance while dramatically reducing the labeling cost compared to fully supervised approaches—crucial in domains where annotation is expensive or time-consuming [138,141]. Third, semi-supervised and active clustering techniques provide greater adaptability to complex datasets that defy traditional unsupervised assumptions, enabling algorithms to incorporate expert knowledge and domain-specific characteristics to produce more meaningful and interpretable clusters [135,136,139].

10.1. Must-Link/Cannot-Link Constraints

One of the fundamental approaches in semi-supervised clustering is to incorporate pairwise constraints that indicate whether two data points must belong to the same cluster (must-link) or must belong to different clusters (cannot-link) [135]. These constraints provide valuable supervision without requiring fully labeled data.

Must-link constraints: A must-link constraint specifies that two data points should be placed in the same cluster. This could arise from prior knowledge about the data, such as knowing that two ECG signals represent the same type of cardiac event, or that two images depict the same object [138]. For example, in medical applications, must-link constraints can be generated when a radiologist labels certain slices of MRI scans as depicting the same anatomical region. When incorporated into a clustering algorithm, these constraints can help guide the model towards more meaningful and coherent clusters [138].

Cannot-link constraints: In contrast, a cannot-link constraint specifies that two data points must belong to different clusters. These constraints are useful when there is prior knowledge indicating that certain data points should not be grouped together. For instance, in the context of facial recognition, a cannot-link constraint could be applied between images of two different individuals. In time-series clustering, a cannot-link constraint might arise when there are known differences between two sensor readings that belong to different devices [135]. These constraints serve to refine the clustering process by explicitly preventing the algorithm from grouping points that are inherently dissimilar [135].

Pairwise constraints can be incorporated into various clustering algorithms, such as k-means, hierarchical clustering, or spectral clustering [135,142]. A popular approach for incorporating pairwise constraints is to modify the objective function of the clustering algorithm, so that it minimizes the clustering cost while satisfying the must-link and cannot-link constraints [135].

10.2. Interactive Clustering with Human-in-the-Loop

Active clustering approaches go a step further by involving human feedback during the clustering process [141,143]. This feedback loop can significantly enhance the quality of clustering by incorporating domain expertise or resolving ambiguities in the data.

Human-in-the-loop (HITL) clustering: HITL clustering refers to systems where a human expert provides feedback to the model during the clustering process. The expert can label data points, adjust cluster boundaries, or even provide pairwise constraints, which the model uses to refine its clustering assignments [141,143]. In time-series data, this may involve annotating specific periods in sensor logs as corresponding to meaningful events. In vision tasks, experts might manually adjust the clustering of regions or objects within an image or video.

A key emerging trend in HITL systems is the integration of human-guided, lossless multidimensional data visualization with machine learning algorithms (both supervised and unsupervised) [144]. Recent empirical studies further demonstrate that the choice of dimensionality reduction (DR) technique significantly impacts human performance in visual cluster analysis tasks [140]. For instance, UMAP and t-SNE were shown to outperform other methods in tasks like cluster and membership identification, while linear techniques such as NMF and t-SNLE performed better in density and distance comparisons, respectively. Traditional methods for visualizing high-dimensional data—such as PCA or t-SNE, as shown in Figure 10—are inherently lossy, often distorting local or global structures and leading to misinterpretations. This is particularly problematic in HITL settings, where humans rely on these projections to assess and refine clustering decisions. Without accurate visual access to high-dimensional structure, expert feedback may be based on misleading cues. Recent approaches aim to address this challenge by developing visualization techniques that better preserve high-dimensional information, or by tightly coupling visualization with model learning so that human feedback remains meaningful and effective.

Active learning in clustering: Active learning techniques can be applied to clustering tasks to make the clustering process more efficient [141,145]. The idea is to select the most informative data points for human labeling, thereby reducing the number of labeled data points required to achieve high-quality clustering. For example, an active learning algorithm may choose data points that are located near the boundaries between clusters (uncertain points), where the model would benefit the most from additional supervision. Once labeled, the model refines its clusters. This is especially useful in applications like medical imaging, where expert labeling is costly and time-consuming [141].

Active clustering frameworks combine elements of machine learning, human expertise, and interactivity to improve clustering performance, allowing the model to evolve based on user input [141,143]. This approach is particularly beneficial in domains like medical diagnostics, autonomous vehicles, or any domain where a high level of domain-specific knowledge is needed to guide the model [143].

10.3. Domain-Specific Priors

In certain fields, domain-specific priors play a crucial role in improving clustering performance by embedding expert knowledge into the model. These priors are especially beneficial when data exhibits known patterns or structures that standard algorithms might miss [146,147].

Anatomical labels in medical imaging: In medical imaging, for instance, anatomical priors can guide the clustering process. In brain scans, known structures such as the hippocampus or amygdala may be used to constrain clustering so that it aligns with meaningful biological regions [138]. Similarly, in time-series data from medical devices (e.g., ECGs), prior knowledge of typical heart rhythms or known abnormalities can improve the interpretability and clinical relevance of clusters [135].

A recent research direction integrates domain-specific priors with human-in-the-loop frameworks and improved data visualizations to form a more holistic clustering pipeline. In this paradigm, experts not only inject prior knowledge into the model but also interact with the clustering process via more faithful representations of high-dimensional data. This co-learning approach can improve clustering quality, interpretability, and trust, particularly in complex domains such as healthcare or scientific research.

Semantic priors in visual data: In computer vision, domain-specific priors can be used to guide clustering in applications like object recognition or scene segmentation [146]. For instance, in satellite image clustering, prior knowledge of geographical regions and their typical features (forests, urban areas, water bodies) can be integrated to guide the clustering process [146]. Similarly, semantic priors can help in the clustering of images in social media applications, where the types of content (e.g., portraits, landscapes) are well understood [147].

Task-specific priors in time-series data: In time-series analysis, domain knowledge about the underlying processes generating the data can be used to inform clustering [135]. For instance, in stock market data, it is common to assume that certain patterns, like trends or seasonality, govern the data [135]. These assumptions can guide the clustering process, enabling the algorithm to group time-series based on these expected temporal behaviors rather than raw statistical properties alone [139].

By incorporating domain-specific priors, semi-supervised and active clustering models can leverage expert knowledge to achieve more accurate and relevant clusters [146]. This is particularly valuable in fields where the data is complex and the underlying patterns are not immediately apparent from the raw data alone [147].

11. Scalability and Real-Time Clustering

The increasing availability of large-scale datasets in both time-series and visual data domains has posed significant challenges for clustering algorithms, particularly in terms of computational efficiency and scalability. In many real-world applications, datasets can be extremely large, with millions of samples, and may need to be processed in real-time or near real-time. As such, there has been growing interest in developing scalable and real-time clustering methods that can handle the size and dynamic nature of modern data streams.

In this section, we explore recent advancements and techniques designed to enhance the scalability and real-time processing capabilities of clustering algorithms.

11.1. Approximate Nearest Neighbors (ANN)

One of the core challenges when clustering large datasets is efficiently measuring similarity between data points, which often requires computing distances between all pairs of points. For high-dimensional data (common in time-series and vision), this can quickly become prohibitively expensive in terms of both time and memory. Approximate Nearest Neighbors (ANN) techniques address this challenge by providing fast, approximate solutions for nearest neighbor search, significantly improving clustering performance in large datasets [148].

Annoy (Approximate Nearest Neighbors Oh Yeah): Annoy is a popular library for fast approximate nearest neighbor search, specifically optimized for large-scale datasets [149]. It is based on building a tree structure that partitions the data points in such a way that search queries (such as for nearest neighbors) are efficiently handled. Annoy is particularly useful when dealing with high-dimensional feature spaces, such as embeddings from deep neural networks, which are commonly used in vision data clustering.

FAISS (Facebook AI Similarity Search): FAISS is a highly optimized library developed by Facebook for similarity search and clustering of large-scale datasets [150]. FAISS supports both exact and approximate nearest neighbor search, and it leverages efficient algorithms for indexing and searching, particularly for high-dimensional data. FAISS supports GPU acceleration, which further improves the scalability of the algorithm. It is widely used in clustering tasks involving large image datasets, text embeddings, or time-series data where fast search operations are crucial.

Other ANN Methods: Other ANN methods such as HNSW (Hierarchical Navigable Small World) [151], LSH (Locality Sensitive Hashing) [152], and KD-Trees [153] are also commonly used for speeding up nearest neighbor search. These techniques balance between search accuracy and computation time, providing scalable solutions that allow clustering algorithms to process large datasets in a fraction of the time it would take using exact methods.

By leveraging ANN methods, clustering algorithms can handle high-dimensional, large-scale datasets efficiently without the need for exhaustive pairwise distance computations. These methods have seen widespread adoption in domains like image retrieval, time-series anomaly detection [154], and even social network analysis, where speed is a key requirement.

11.2. Stream Clustering Algorithms

For real-time or continuous data, traditional clustering methods may not be suitable, as they require processing the entire dataset at once. Stream clustering algorithms are specifically designed to handle data that arrives in a continuous stream, allowing clusters to be updated incrementally as new data points arrive. This capability is critical in applications such as real-time monitoring, sensor networks, and online recommender systems.

DenStream: DenStream is a density-based clustering algorithm designed for data streams [155]. It uses a concept called “micro-clusters” to represent data in a compact form, allowing the algorithm to adapt to changes in the data over time. DenStream can efficiently track evolving patterns in time-series data or dynamic visual data streams, making it useful for applications like monitoring financial data or real-time object detection in video streams.

CluStream: CluStream is another stream clustering algorithm that combines both online and offline clustering phases [156]. It first maintains an online summary of the incoming data and periodically performs offline clustering to refine the clusters. This approach is highly effective for scenarios where real-time processing is needed, but periodic updates to the clustering structure are still required. CluStream has been used in various applications, such as in sensor networks and for clustering web traffic.

11.3. GPU-Accelerated Frameworks

As the complexity of clustering algorithms increases, so does the demand for computational resources. Graphics Processing Units (GPUs) have become a popular solution for accelerating clustering algorithms due to their massive parallel processing capabilities. GPU-accelerated frameworks allow clustering algorithms to process large datasets much faster than traditional CPU-based methods, making them a natural fit for real-time clustering tasks [157,158].

CUDA-Accelerated Clustering: Many clustering algorithms, such as k-means and DBSCAN, have been parallelized for execution on GPUs using CUDA (Compute Unified Device Architecture). CUDA enables efficient parallelization of computationally expensive steps like distance calculations and centroid updates. This acceleration allows clustering tasks that take hours on CPUs to be completed in a fraction of the time [159,160].

Deep Learning with GPU Acceleration: In deep learning-based clustering, large models such as convolutional neural networks (CNNs), autoencoders, and recurrent networks (LSTMs, GRUs) are trained and applied using GPU acceleration. Frameworks like TensorFlow, PyTorch, and MXNet leverage GPUs to significantly reduce training and inference times for clustering high-dimensional visual and sequential data [161,162].

Graph-Based Clustering on GPUs: Graph-based clustering algorithms, including spectral clustering and community detection, benefit from GPU acceleration as graph operations (e.g., adjacency matrix computation, eigenvalue decomposition) are resource-intensive. Libraries such as cuGraph and PyTorch Geometric facilitate parallel graph processing to speed up clustering in graph-structured data [163,164].

GPU-accelerated frameworks are essential for scaling clustering algorithms to handle large datasets with high-dimensional features and for achieving real-time processing in both time-series and visual data applications.

11.4. Applications of Scalable and Real-Time Clustering

Real-time clustering has become increasingly important across a wide range of application domains. In autonomous vehicles, clustering sensor data such as LiDAR, radar, and camera feeds enables the segmentation of environments, object detection, and navigation in dynamic traffic conditions. In healthcare, wearable devices and medical monitoring systems rely on clustering to detect early signs of health anomalies, such as irregularities in ECG signals or abnormal patterns in patient vitals. The finance sector utilizes stream clustering algorithms to identify emerging trends and anomalies in time-series market data, aiding in timely investment decisions. In video surveillance, clustering of object and activity features supports tasks like anomaly detection, people tracking, and behavior recognition in real-time video streams. IoT networks also benefit from clustering by enabling efficient organization and monitoring of large-scale sensor deployments, useful in smart cities, environmental monitoring, and industrial applications. Lastly, in manufacturing, clustering plays a crucial role in monitoring equipment, predicting failures [165], optimizing maintenance, and improving product quality by identifying patterns in sensor data collected from machines and production processes.

Scalable and real-time clustering techniques are therefore vital in enabling the analysis and decision-making in various time-sensitive and large-scale applications across multiple domains.

12. Explainability in Clustering

Explainability has become an essential consideration in modern machine learning, including unsupervised tasks like clustering. Unlike supervised learning, clustering lacks ground truth labels, making it inherently difficult to understand or justify why certain instances are grouped together. This opacity is problematic in high-stakes domains such as healthcare, finance, and scientific research, where model decisions must be interpretable and trustworthy.

Visualization is increasingly being integrated into the core of AI/ML modeling to support model discovery, evaluation, and refinement—shifting from a post-hoc analysis tool to an active part of the modeling process. Advanced visual methods like BC-DT and SPC-DT enhance the interpretability of decision tree models by revealing attribute relationships, data flow, and threshold tightness, which is valuable for understanding and refining clustering structures derived from decision-tree-based methods [144].

The demand for explainable clustering is further heightened in human-in-the-loop (HITL) systems, where experts interact with clustering models by labeling points, adjusting boundaries, or providing constraints. Without interpretable feedback from the model, human guidance can be misaligned or ineffective. Similarly, the integration of domain-specific priors requires the clustering model’s reasoning to be accessible to domain experts to ensure that it respects relevant structures or semantics. To address these challenges, several approaches have been developed to improve the interpretability of clustering results:

Prototype-based Explanations: These methods identify representative samples (or prototypes) within each cluster to summarize the core characteristics of the group. Techniques like Deep k-Means and ProtoDash provide a set of prototypical examples that help humans understand what defines each cluster.
Feature Importance and Attribution: Inspired by explainable AI (XAI) techniques, these methods estimate which features contributed most to a data point’s assignment to a particular cluster. This helps users interpret clusters in terms of the underlying feature space.
Interactive and Visual Explanations: Visualization tools allow users to explore clusters via 2D/3D projections, metadata overlays, and interactive controls. While methods like t-SNE and PCA are commonly used, they are lossy and can distort high-dimensional relationships. Recent work focuses on preserving more structure to provide faithful visual explanations, especially in HITL settings.
Contrastive and Counterfactual Explanations: These approaches provide explanations based on comparisons—for example, why a point belongs to one cluster and not another, or how it would need to change to switch clusters. Such methods enhance transparency and are particularly useful in applications requiring actionable insights.
Model-integrated Interpretability: Some clustering models—especially in deep learning or graph-based settings—are now being designed with interpretability in mind. This includes attention mechanisms in Graph Neural Networks (GNNs), disentangled representations, or interpretability constraints embedded in the learning objective.

Together, these methods represent a growing movement toward making clustering models more transparent, trustworthy, and usable in practice. However, significant challenges remain in ensuring that explanations are faithful, scalable, and aligned with human understanding—particularly in high-dimensional, noisy, or domain-specific data.

13. Evaluation Metrics in Clustering

Evaluating clustering performance is a fundamental challenge due to the absence of ground truth labels in most unsupervised settings. As shown in Table 2, when ground truth is available, external metrics such as Adjusted Rand Index (ARI) [166], which measures the similarity between predicted and true cluster assignments while adjusting for chance, Normalized Mutual Information (NMI) [167], which quantifies the mutual dependence between clustering and true labels, and Fowlkes–Mallows Index (FMI) [168], which balances precision and recall in pairwise clustering decisions, are commonly used. Homogeneity score [169] ensures each cluster contains members of a single class, completeness score [169] checks that all samples of a class fall into the same cluster, and V-measure [170] combines both using the harmonic mean. In fully unsupervised scenarios, internal metrics like Silhouette Score [171], which compares intra- vs. inter-cluster distances, Davies–Bouldin Index [172], which evaluates average similarity between clusters, and Calinski–Harabasz Score [173], which assesses the ratio of between- and within-cluster dispersion, are used to assess compactness and separation of clusters based on intrinsic data properties. However, these metrics may not always align with human intuition or domain-specific notions of meaningful clustering. As a result, recent research has emphasized the need for task-aware, domain-specific, and human-in-the-loop evaluation strategies that incorporate expert feedback or interpretability into the assessment process. The choice of evaluation metric can significantly influence the design and optimization of clustering algorithms, especially in high-dimensional or noisy data contexts.

14. Comparative Analysis of Clustering Methods

To provide a clearer comparison of existing clustering techniques, we summarize key methods in Table 3 based on their applicable scenarios, computational complexity, and representative performance on widely used benchmark datasets such as CIFAR-10 and the UCI time-series library. This comparison highlights the diversity of clustering algorithms—from classical methods like K-Means and GMMs to more recent approaches such as deep embedded clustering and structured doubly stochastic graph-based clustering. Each method exhibits strengths suited to specific data modalities and structures; for instance, HMMs are more effective for time-series data due to their sequential modeling capability, while graph-based and multiview methods are better suited for multimodal or relational data. The table also illustrates trade-offs in terms of computational cost and accuracy, helping readers assess the practical considerations and performance profiles of each approach in real-world applications.

15. Future Directions

The field of clustering time-series and vision data is evolving rapidly, with several promising directions for future research and innovation. Key areas for further exploration include:

Multimodal Clustering: As datasets increasingly consist of multiple types of data (e.g., video + text, sensor data + images), developing clustering techniques that can effectively handle multimodal data is crucial. The challenge lies in aligning features from different domains (e.g., combining visual and temporal information) while preserving the inherent structure of each modality. This requires novel architectures that can leverage shared and modality-specific information to perform joint clustering tasks. Another key difficulty lies in the alignment of heterogeneous data types: particularly the mismatch of spatio-temporal scales between modalities. For instance, visual features extracted from high-frame-rate video streams may not align temporally with lower-frequency time-series data from sensors, complicating the construction of a unified feature space [175]. A potential direction to address this challenge is the development of cross-modal synchronization frameworks that incorporate temporal interpolation, attention mechanisms, or learnable alignment layers to dynamically match and weight features across modalities during training.

Explainability: One of the key challenges in clustering—particularly in deep learning-based approaches is the limited interpretability of the resulting models. Many of these methods, especially those involving complex neural architectures such as autoencoders or contrastive learning frameworks, are often perceived as “black boxes.” There is a growing need for clustering models that not only deliver high performance but also offer insights into how and why specific groupings are formed. Recent efforts in explainable AI (XAI) have introduced techniques such as attention mechanisms [176], saliency maps [177], and concept attribution methods such as Grad-CAM [178] and TCAV [179], which improve model transparency in supervised learning tasks; these tools are increasingly being adapted for unsupervised settings, including clustering. For example, in time-series data, attention-based clustering can highlight which temporal segments are most influential in cluster assignment, while in vision data, visualization techniques can reveal key spatial features that drive cluster formation [180]. Explainable clustering methods enable greater transparency, facilitate trust in model outputs, and support the identification of biases or errors in the clustering process—particularly critical in sensitive domains such as healthcare, finance, or autonomous systems.

Cross-Domain Transfer: Another exciting future direction is the transfer of knowledge between different domains. For example, models trained on vision tasks might be applied to time-series data, or vice versa. This cross-domain transfer can help reduce the need for extensive labeled data in new domains, accelerating the development of clustering systems across a variety of applications. Transfer learning techniques and domain adaptation methods will likely play a central role in this area.

Foundation Models: The rise of large pretrained models such as CLIP for vision and TimeGPT [181] for time-series data has opened up new avenues for clustering. These foundation models, trained on vast amounts of data, provide powerful feature embeddings that can be directly used for clustering tasks. Leveraging these models allows for clustering in high-dimensional, complex data spaces without the need for custom feature engineering. However, fine-tuning the foundation models for few-shot clustering presents unique obstacles. These include preventing overfitting on limited data, preserving the generalization capabilities of pretrained models, and developing task-specific adaptation strategies that can exploit the latent structure of the data with minimal supervision. Addressing these technical gaps is essential for advancing practical implementations of clustering in real-world multimodal systems. Future research will explore how to fine-tune these models or use them as fixed feature extractors to improve clustering outcomes, especially in real-time applications.

Scalability and Real-Time Clustering Considerations: While many of the clustering techniques surveyed offer high accuracy and flexibility, their scalability to large datasets and applicability in real-time scenarios remains a critical concern, especially for time-series and vision applications that generate data continuously. To address this, several approximate and streaming-based clustering methods have been proposed. For instance, techniques such as CluStream, DenStream, and StreamKM++ are designed for evolving data streams and offer fast, incremental updates. In the deep learning space, recent works have explored online variants of clustering with mini-batch training and memory-efficient architectures [155,182,183,184]. These approaches strike a balance between computational efficiency and clustering quality, making them well-suited for real-time monitoring, anomaly detection, or video stream analysis. Incorporating or adapting such methods into the broader taxonomy of clustering approaches remains a promising direction for future research.

Unifying Similarity Measures Across Domains: While standard definitions are essential for clarity, it is equally important to provide intuitive interpretations that contextualize these formulas within clustering tasks. For example, Dynamic Time Warping (DTW) aligns sequences by optimally “warping” time dimensions, capturing temporal variations beyond straightforward pointwise distance. In contrast, vision clustering methods often rely on learned embeddings where geometric distances reflect semantic similarity. Despite their differences, these approaches share the core aim of defining meaningful similarity metrics that facilitate clustering. However, a comprehensive mathematical framework that abstracts both temporal and spatial data clustering methods into a unified similarity learning paradigm remains elusive. Such a framework would enhance theoretical understanding and foster cross-domain methodological advances. We thus identify this as a key open challenge and a promising avenue for future research.

16. Conclusions

In this paper, we reviewed various clustering techniques for temporal and visual data, which are critical to advancing machine learning applications across various domains such as healthcare, finance, and autonomous systems. Despite the differences in modality—sequential for time-series and spatial for visual data—both data types share common challenges, including high dimensionality, noise, and the need for robust clustering algorithms. The integration of traditional clustering methods with modern deep learning techniques, such as representation learning and hybrid models, has proven effective in addressing these challenges. However, scalability and interpretability remain key concerns, particularly in real-time and large-scale settings. Future research will likely focus on enhancing these models through the use of foundation models, multimodal data fusion, and cross-domain transfer, enabling more efficient and adaptable clustering systems. Additionally, incorporating user feedback through semi-supervised and active learning methods will further improve clustering performance. As the field continues to evolve, these innovations will significantly enhance the ability of machine learning systems to derive meaningful insights from complex, dynamic datasets, paving the way for more intelligent and robust applications in diverse industries.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The author declares no conflicts of interest.

References

Xie, J.; Girshick, R.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 478–487. [Google Scholar]
Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. 1999, 31, 264–323. [Google Scholar] [CrossRef]
Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef]
Aggarwal, C.C. Data Classification: Algorithms and Applications, 1st ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2014. [Google Scholar]
Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Database Theory— ICDT 2001; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
Fu, T.-C. A review on time series data mining. Eng. Appl. Artif. Intell. 2011, 24, 164–181. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 87–110. [Google Scholar] [CrossRef]
Fakhrazari, A.; Vakilzadian, H. A survey on time series data mining. In Proceedings of the 2017 IEEE International Conference on Electro Information Technology (EIT), Lincoln, NE, USA, 14–17 May 2017; pp. 476–481. [Google Scholar]
Pavel, M.I.; Tan, S.Y.; Abdullah, A. Vision-based autonomous vehicle systems based on deep learning: A systematic literature review. Appl. Sci. 2022, 12, 6831. [Google Scholar] [CrossRef]
Liao, T.W. Clustering of time series data—A survey. Pattern Recognit. 2005, 38, 1857–1874. [Google Scholar] [CrossRef]
Ding, H.; Trajcevski, G.; Scheuermann, P.; Wang, X.; Keogh, E. Querying and mining of time series data: Experimental comparison of representations and distance measures. Proc. VLDB Endow. 2008, 1, 1542–1552. [Google Scholar] [CrossRef]
Ren, Y.; Pu, J.; Yang, Z.; Xu, J.; Li, G.; Pu, X.; Yu, P.S.; He, L. Deep Clustering: A Comprehensive Survey. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 5858–5878. [Google Scholar] [CrossRef]
Baltrusaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Steinbach, M.; Ertöz, L.; Kumar, V. The Challenges of Clustering High Dimensional Data. In New Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition; Wille, L.T., Ed.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 273–309. [Google Scholar]
Li, T.; Wang, Z.; Liu, S.; Lin, W.-Y. Deep unsupervised anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3636–3645. [Google Scholar]
Berndt, D.J.; Clifford, J. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 31 July–1 August 1994; pp. 359–370. [Google Scholar]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Cuturi, M. Fast Global Alignment Kernels. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11), Omnipress, Madison, WI, USA, 28 June–2 July 2011; pp. 929–936. [Google Scholar]
Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis: Forecasting and Control, 5th ed.; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 156–165. [Google Scholar]
Malhotra, P.; Tv, V.; Vig, L.; Agarwal, P.; Shroff, G. TimeNet: Pre-trained deep recurrent neural network for time series classification. arXiv 2017, arXiv:1706.08838. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. Available online: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (accessed on 23 December 2025). [CrossRef]
Yang, J.; Parikh, D.; Batra, D. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5147–5156. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Dosovitskiy, A.; Springenberg, J.T.; Riedmiller, M.; Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://papers.nips.cc/paper_files/paper/2014/file/fd1b9ae90284fca85ba3fd719f3ba756-Paper.pdf (accessed on 23 December 2025). [CrossRef] [PubMed]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
Bilenko, M.; Basu, S.; Mooney, R.J. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 11. [Google Scholar]
Zhang, Y.; Yan, S.; Zhang, L.; Du, B. Fast projected fuzzy clustering with anchor guidance for multimodal remote sensing imagery. IEEE Trans. Image Process. 2024, 33, 4640–4653. [Google Scholar] [CrossRef]
Xu, K.; Chen, L.; Wang, S. Towards robust nonlinear subspace clustering: A kernel learning approach. IEEE Trans. Artif. Intell. 2025, 1–13. [Google Scholar] [CrossRef]
Wang, X.; Qiao, Y.; Wu, D.; Wu, C.; Wang, F. Cluster based heterogeneous federated foundation model adaptation and fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39. [Google Scholar]
Aghabozorgi, S.; Shirkhorshidi, A.S.; Wah, T.Y. Time-series clustering—A decade review. Inf. Syst. 2015, 53, 16–38. [Google Scholar] [CrossRef]
Alqahtani, A.; Ali, M.; Xie, X.; Jones, M.W. Deep time-series clustering: A review. Electronics 2021, 10, 3001. [Google Scholar] [CrossRef]
Paparrizos, J.; Yang, F.; Li, H. Bridging the Gap: A Decade Review of Time-Series Clustering Methods. arXiv 2024, arXiv:2412.20582. [Google Scholar] [CrossRef]
Ni, J.; Zhao, Z.; Shen, C.; Tong, H.; Song, D.; Cheng, W.; Luo, D.; Chen, H. Harnessing Vision Models for Time Series Analysis: A Survey. arXiv 2025, arXiv:2502.08869. [Google Scholar] [CrossRef]
Liang, Y.; Wen, H.; Nie, Y.; Jiang, Y.; Jin, M.; Song, D.; Pan, S.; Wen, Q. Foundation Models for Time Series Analysis: A Tutorial and Survey. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6555–6565. [Google Scholar]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep Clustering for Unsupervised Learning of Visual Features. arXiv 2019, arXiv:1807.05520. [Google Scholar] [CrossRef]
Min, E.; Guo, X.; Liu, Q.; Zhang, G.; Cui, J.; Long, J. A Survey of Clustering with Deep Learning: From the Perspective of Network Architecture. IEEE Access 2018, 6, 39501–39514. [Google Scholar] [CrossRef]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef]
Gao, C.X.; Dwyer, D.; Zhu, Y.; Smith, C.L.; Du, L.; Filia, K.M.; Bayer, J.; Menssink, J.M.; Wang, T.; Bergmeir, C.; et al. An overview of clustering methods with guidelines for application in mental health research. Psychiatry Res. 2023, 327, 115265. [Google Scholar] [CrossRef]
Hsu, C.-J.; Huang, K.-S.; Yang, C.-B.; Guo, Y.-P. Flexible Dynamic Time Warping for Time Series Classification. Procedia Comput. Sci. 2015, 51, 2838–2842. [Google Scholar] [CrossRef]
Rui, X. A convolutional neural networks based approach for clustering of emotional elements in art design. PeerJ Comput. Sci. 2023, 9, e1548. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Chen, W.; Liu, Y.; Wang, W.; Bakker, E.M.; Georgiou, T.; Fieguth, P.W.; Liu, L.; Lew, M.S. Deep Learning for Instance Retrieval: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7270–7292. [Google Scholar] [CrossRef]
Rubner, Y.; Tomasi, C.; Guibas, L. The Earth Mover’s Distance as a metric for image retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Chai, J.; Zeng, H.; Li, A.; Ngai, E.W.T. Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Mach. Learn. Appl. 2021, 6, 100134. [Google Scholar] [CrossRef]
Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
James, M. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965 and 27 December 1965–7 January 1966; Statistical Laboratory of the University of California: Berkeley, CA, USA, 1967. Volume 1: Statistics, Volume 5. [Google Scholar]
Wang, N.; Cui, Z.; Li, A.; Lu, Y.; Wang, R.; Nie, F. Structured Doubly Stochastic Graph-Based Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 11064–11077. [Google Scholar] [CrossRef]
Ankerst, M.; Breunig, M.M.; Kriegel, H.; Sander, J. OPTICS: Ordering points to identify the clustering structure. SIGMOD Rec. 1999, 28, 49–60. [Google Scholar] [CrossRef]
Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining; Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G., Eds.; PAKDD 2013; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7819. [Google Scholar]
Hinneburg, A.; Keim, D.A. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD’98), New York, NY, USA, 27–31 August 1998; pp. 58–65. [Google Scholar]
Cooper, G.R.J.; Cowan, D.R. Comparing time series using wavelet-based semblance analysis. Comput. Geosci. 2008, 34, 95–102. [Google Scholar] [CrossRef]
Li, G.; Choi, B.; Xu, J.; Bhowmick, S.S.; Chun, K.-P.; Wong, G.L.-H. Shapenet: A shapelet-neural network approach for multivariate time series classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 11–15 October 2021; Volume 35, pp. 8375–8383. [Google Scholar]
Wang, W.; Yang, J.; Muntz, R.R. STING: A statistical information grid approach to spatial data mining. Vldb 1997, 97, 186–195. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Li, J.; Allinson, N.M. A comprehensive review of current local features for computer vision. Neurocomputing 2008, 71, 1771–1787. [Google Scholar] [CrossRef]
Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv 2021, arXiv:2110.05208. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Maćkiewicz, A.; Ratajczak, W. Principal components analysis (PCA). Comput. Geosci. 1993, 19, 303–342. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426. [Google Scholar] [CrossRef]
Xiang, R.; Wang, W.; Yang, L.; Wang, S.; Xu, C.; Chen, X. A Comparison for Dimensionality Reduction Methods of Single-Cell RNA-seq Data. Front. Genet. 2021, 12, 646936. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Huang, H.; Wang, Y.; Rudin, C.; Browne, E.P. Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization. Commun. Biol. 2022, 5, 719. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
Figueiredo, M.A.T.; Jain, A.K. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 381–396. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 2018, 39, 1–22. [Google Scholar] [CrossRef]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Denver, CO, USA, 2020. [Google Scholar]
Gupta, L.; Sortrakul, T. A gaussian-mixture-based image segmentation algorithm. Pattern Recognit. 1998, 31, 315–325. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Smyth, P. Clustering sequences with hidden Markov models. In Proceedings of the 10th International Conference on Neural Information Processing Systems, Denver, CO, USA, 3–5 December 1996; pp. 648–654. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Fuller, W.A.; Hasza, D.P. Properties of predictors for autoregressive time series. J. Am. Stat. Assoc. 1981, 76, 155–161. [Google Scholar] [CrossRef]
Benjamin, M.A.; Rigby, R.A.; Stasinopoulos, D.M. Generalized autoregressive moving average models. J. Am. Stat. Assoc. 2003, 98, 214–223. [Google Scholar] [CrossRef]
Alnaa, S.E.; Ahiakpor, F. ARIMA (autoregressive integrated moving average) approach to predicting inflation in Ghana. J. Econ. Int. Financ. 2011, 3, 328–336. [Google Scholar]
Rivot, E.; Prévost, E.; Parent, E.; Baglinière, J.-L. A Bayesian state-space modelling framework for fitting a salmon stage-structured population dynamic model to multiple time series of field data. Ecol. Model. 2004, 179, 463–485. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; Merrienboer, B.V.; Gülçehre, Ç; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
Jang, J.H.; Kim, T.Y.; Lim, H.S.; Yoon, D. Unsupervised feature learning for electrocardiogram data using the convolutional variational autoencoder. PLoS ONE 2021, 16, e0260612. [Google Scholar] [CrossRef]
Eldele, E.; Ragab, M.; Chen, Z.; Wu, M.; Kwoh, C.; Li, X.; Guan, C. Time-Series Representation Learning via Temporal and Contextual Contrasting. arXiv 2021, arXiv:2106.14112. [Google Scholar] [CrossRef]
Wang, Z.; Yan, W.; Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: New York, NY, USA, 2017; pp. 1578–1585. [Google Scholar]
Zhou, X.; Zhang, N.L. Deep Clustering with Features from Self-Supervised Pretraining. arXiv 2022, arXiv:2207.13364. [Google Scholar] [CrossRef]
Wang, Z.; Oates, T. Imaging time-series to improve classification and imputation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Hatami, N.; Gavet, Y.; Debayle, J. Classification of time-series images using deep convolutional neural networks. In Proceedings of the International Conference on Machine Vision (ICMV), Vienna, Austria, 13–15 November 2017. [Google Scholar]
Zheng, Y.; Liu, Q.; Chen, E.; Ge, Y.; Zhao, J.L. Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks. In Web-Age Information Management; Li, F., Li, G., Hwang, S., Yao, B., Zhang, Z., Eds.; WAIM 2014; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8485. [Google Scholar]
Meng, Q.; Qian, H.; Liu, Y.; Xu, Y.; Shen, Z.; Cui, L. Unsupervised Representation Learning for Time Series: A Review. arXiv 2023, arXiv:2308.01578. [Google Scholar] [CrossRef]
Hahsler, M.; Piekenbrock, M.; Doran, D. dbscan: Fast density-based clustering with R. J. Stat. Softw. 2019, 91, 1–30. [Google Scholar] [CrossRef]
Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5962–5979. [Google Scholar] [CrossRef]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar] [CrossRef]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying graph convolutional networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6861–6871. [Google Scholar]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. In Proceedings of the IEEE Transactions on Neural Networks, New York, NY, USA, 1 January2009; Volume 20, pp. 61–80. [Google Scholar] [CrossRef]
Li, R.; Wang, S.; Zhu, F.; Huang, J. Adaptive graph convolutional neural networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Washington, DC, USA, 2018. Article 434. pp. 3546–3553. [Google Scholar]
Bhatti, U.A.; Tang, H.; Wu, G.; Marjan, S.; Hussain, A. Deep learning with graph convolutional networks: An overview and latest applications in computational intelligence. Int. J. Intell. Syst. 2023, 2023, 8342104. [Google Scholar] [CrossRef]
Wu, L.; Cui, P.; Pei, J.; Zhao, L.; Guo, X. Graph Neural Networks: Foundation, Frontiers and Applications. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), Washington, DC, USA, 14–18 August 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 4840–4841. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Jin, M.; Koh, H.Y.; Wen, Q.; Zambon, D.; Alippi, C.; Webb, G.I.; King, I.; Pan, S. A survey on graph neural networks for time series: Forecasting, classification, imputation, and anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10466–10485. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Barahona, M. Graph-based data clustering via multiscale community detection. Appl. Netw. Sci. 2020, 5, 3. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef]
Nielsen, F. Hierarchical clustering. In Introduction to HPC with MPI for Data Science; Springer: Berlin/Heidelberg, Germany, 2016; pp. 195–211. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 9630–9640. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 2 (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; Volume 2, pp. 3320–3328. [Google Scholar]
Witten, D.M.; Tibshirani, R. A framework for feature selection in clustering. J. Am. Stat. Assoc. 2010, 105, 713–726. [Google Scholar] [CrossRef]
Li, X.; Xiong, H.; Wang, H.; Rao, Y.; Liu, L.; Huan, J. Delta: Deep learning transfer using feature map with attention for convolutional networks. arXiv 2019, arXiv:1901.09229. [Google Scholar]
Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. arXiv 2020, arXiv:2012.15723. [Google Scholar]
Wang, Y.; Yao, H.; Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing 2016, 184, 232–242. [Google Scholar] [CrossRef]
Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5070–5079. [Google Scholar]
Xiao, S.; Wang, S.; Guo, W. SGAE: Stacked graph autoencoder for deep clustering. IEEE Trans. Big Data 2022, 9, 254–266. [Google Scholar] [CrossRef]
Najafgholizadeh, A.; Nasirkhani, A.; Mazandarani, H.R.; Soltanalizadeh, H.R.; Sabokrou, M. Imaging Time Series for Deep Embedded Clustering: A Cryptocurrency Regime Detection Use Case. In Proceedings of the 2022 27th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 23–24 February 2022; pp. 1–6. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar]
Rathore, P.; Kumar, D.; Bezdek, J.C.; Rajasegarar, S.; Palaniswami, M. A rapid hybrid clustering algorithm for large volumes of high dimensional data. IEEE Trans. Knowl. Data Eng. 2018, 31, 641–654. [Google Scholar] [CrossRef]
Alqurashi, T.; Wang, W. Clustering ensemble method. Int. J. Mach. Learn. Cybern. 2019, 10, 1227–1246. [Google Scholar] [CrossRef]
Rawat, B.S.; Srivastava, A.; Singh, G.; Kumar, G.; Dhondiyal, S.A. Hybrid Clustering Techniques for Optimizing Online Datasets Using Data Mining Techniques. In Proceedings of the 2023 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS), New Raipur, India, 6–8 October 2023; pp. 1–5. [Google Scholar]
Senin, P. Dynamic Time Warping Algorithm Review; Information and Computer Science Department University of Hawaii at Manoa Honolulu: Honolulu, HI, USA, 2008; Volume 855, p. 40. [Google Scholar]
Zhao, Q.; Yang, Z.; Tao, H. Differential earth mover’s distance with its applications to visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 32, 274–287. [Google Scholar] [CrossRef]
Chen, P.-Y.; Huang, J.-J. A Hybrid Autoencoder Network for Unsupervised Image Clustering. Algorithms 2019, 12, 122. [Google Scholar] [CrossRef]
Liu, M.; Shi, J.; Li, Z.; Li, C.; Zhu, J.; Liu, S. Towards better analysis of deep convolutional neural networks. IEEE Trans. Vis. Comput. Graph. 2016, 23, 91–100. [Google Scholar] [CrossRef]
Du, K.-L. Clustering: A neural network approach. Neural Netw. 2010, 23, 89–107. [Google Scholar] [CrossRef] [PubMed]
Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2003, 3, 583–617. [Google Scholar]
Meng, J.; Hao, H.; Luan, Y. Classifier ensemble selection based on affinity propagation clustering. J. Biomed. Inform. 2016, 60, 234–242. [Google Scholar] [CrossRef] [PubMed]
Sohn, K.; Yan, X.; Lee, H. Learning structured output representation using deep conditional generative models. In Proceedings of the 29th International Conference on Neural Information Processing Systems—Volume 2, Montreal, QC, Canada, 7–12 December 2015; pp. 3483–3491. [Google Scholar]
Masci, J.; Meier, U.; Cireşan, D.; Schmidhuber, J. Stacked convolutional auto-encoders for hierarchical feature extraction. In Proceedings of the Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011; pp. 52–59. [Google Scholar]
Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 2003, 52, 91–118. [Google Scholar] [CrossRef]
Mahmud, M.S.; Huang, J.Z.; Ruby, R.; Ngueilbaye, A.; Wu, K. Approximate clustering ensemble method for big data. IEEE Trans. Big Data 2023, 9, 1142–1155. [Google Scholar] [CrossRef]
Oza, N.C. Ensemble data mining methods. In Encyclopedia of Data Warehousing and Mining, 2nd ed.; IGI Global Scientific Publishing: Hershey, PA, USA, 2009; pp. 770–776. [Google Scholar]
Zhu, X.; Ghahramani, Z. Learning from Labeled and Unlabeled Data with Label Propagation. 2002. Available online: https://www.semanticscholar.org/paper/Learning-from-labeled-and-unlabeled-data-with-label-Zhu-Ghahramani/2a4ca461fa847e8433bab67e7bfe4620371c1f77 (accessed on 23 December 2025).
Basu, S.; Bilenko, M.; Mooney, R.J. A probabilistic framework for semi-supervised clustering. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 59–68. [Google Scholar]
Xiong, S.; Azimi, J.; Fern, X.Z. Active learning of constraints for semi-supervised clustering. IEEE Trans. Knowl. Data Eng. 2013, 26, 43–54. [Google Scholar] [CrossRef]
Ali, H.; Salleh, M.; Saedudin, R.; Talpur, K.; Mushtaq, M. Imbalance class problems in data mining: A review. Indones. J. Electr. Eng. Comput. Sci. 2019, 14, 1560–1571. [Google Scholar] [CrossRef]
Law, M.H.C.; Topchy, A.; Jain, A.K. Model-based clustering with probabilistic constraints. In Proceedings of the 2005 SIAM International Conference on Data Mining, Newport Beach, CA, USA, 21–23 April 2005; pp. 641–645. [Google Scholar]
Sander, J.; Ester, M.; Kriegel, H.-P.; Xu, X. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Min. Knowl. Discov. 1998, 2, 169–194. [Google Scholar] [CrossRef]
Xia, J.; Zhang, Y.; Song, J.; Chen, Y.; Wang, Y.; Liu, S. Revisiting Dimensionality Reduction Techniques for Visual Cluster Analysis: An Empirical Study. IEEE Trans. Vis. Comput. Graph. 2022, 28, 529–539. [Google Scholar] [CrossRef] [PubMed]
Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Gupta, B.B.; Chen, X.; Wang, X. A survey of deep active learning. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar] [CrossRef]
Wagstaff, K.; Cardie, C.; Rogers, S.; Schrödl, S. Constrained k-means clustering with background knowledge. Icml 2001, 1, 577–584. [Google Scholar]
Mosqueira-Rey, E.; Hernández-Pereira, E.; Alonso-Ríos, D.; Bobes-Bascarán, J.; Fernández-Leal, Á. Human-in-the-loop machine learning: A state of the art. Artif. Intell. Rev. 2023, 56, 3005–3054. [Google Scholar] [CrossRef]
Kovalerchuk, B.; Nazemi, K.; Andonie, R.; Datia, N.; Banissi, E. (Eds.) Artificial Intelligence and Visualization: Advancing Visual Knowledge Discovery; Springer Nature: Berlin/Heidelberg, Germany, 2024; Volume 1126, pp. 420–434. [Google Scholar]
Cohn, D.A.; Ghahramani, Z.; Jordan, M.I. Active learning with statistical models. J. Artif. Intell. Res. 1996, 4, 129–145. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Cui, G.; Wang, R.; Wu, D.; Li, Y. Semi-supervised Multi-view Clustering based on NMF with Fusion Regularization. ACM Trans. Knowl. Discov. Data 2024, 18, 1–26. [Google Scholar] [CrossRef]
Liu, T.; Moore, A.; Yang, K.; Gray, A. An investigation of practical approximate nearest neighbor algorithms. Adv. Neural Inf. Process. Syst. 2004, 17. Available online: https://papers.nips.cc/paper_files/paper/2004/file/1102a326d5f7c9e04fc3c89d0ede88c9-Paper.pdf (accessed on 23 December 2025).
Rahman, M.D.; Rabbi, S.M.E.; Rashid, M.M. Optimizing Domain-Specific Image Retrieval: A Benchmark of FAISS and Annoy with Fine-Tuned Features. arXiv 2024, arXiv:2412.01555. [Google Scholar]
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazar’e, P.; Lomeli, M.; Hosseini, L.; J’egou, H. The Faiss library. The faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar]
Malkov, Y.A.; Yashunin, D.A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 824–836. [Google Scholar] [CrossRef]
Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, Brooklyn, NY, USA, 8–11 June 2004; pp. 253–262. [Google Scholar]
Silpa-Anan, C.; Hartley, R. Optimised KD-trees for fast image descriptor matching. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Gungor, O.; Rios, A.; Mudgal, P.; Ahuja, N.; Rosing, T. A Robust Framework for Evaluation of Unsupervised Time-Series Anomaly Detection. In Proceedings of the International Conference on Pattern Recognition, Guiyang, China, 15–17 August 2025; pp. 48–64. [Google Scholar]
Cao, F.; Estert, M.; Qian, W.; Zhou, A. Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM International Conference on Data Mining, Bethesda, MD, USA, 20–22 April 2006; pp. 328–339. [Google Scholar]
Friedman, R.; Goaz, O.; Rottenstreich, O. Clustreams: Data Plane Clustering. In Proceedings of the ACM SIGCOMM Symposium on SDN Research (SOSR), Virtual Event, USA, 11–12 October 2021; pp. 101–107. [Google Scholar]
Owens, J.D.; Houston, M.; Luebke, D.; Green, S.; Stone, J.E.; Phillips, J.C. GPU Computing. Proc. IEEE 2008, 96, 879–899. [Google Scholar] [CrossRef]
Harris, M.; Sengupta, S.; Owens, J.D. Parallel prefix sum (scan) with CUDA. GPU Gems 2007, 3, 851–876. [Google Scholar]
Andrade, G.; Ramos, G.; Madeira, D.; Sachetto, R.; Ferreira, R.; Rocha, L. G-dbscan: A gpu accelerated algorithm for density-based clustering. Procedia Comput. Sci. 2013, 18, 369–378. [Google Scholar] [CrossRef]
Cuomo, S.; De Angelis, V.; Farina, G.; Marcellino, L.; Toraldo, G. A GPU-accelerated parallel K-means algorithm. Comput. Electr. Eng. 2019, 75, 262–274. [Google Scholar] [CrossRef]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A system for large-scale machine learning. arXiv 2016, arXiv:1603.04467. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Fender, A.; Rees, B.; Eaton, J. Rapids cugraph. In Massive Graph Analytics; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022; pp. 483–493. [Google Scholar]
Fey, M.; Lenssen, J.E. Fast Graph Representation Learning with PyTorch Geometric. arXiv 2019, arXiv:1903.02428. [Google Scholar] [CrossRef]
Mudgal, P.; Wouhaybi, R.H. Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data. In Proceedings of the 2024 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Virtual, 4–6 July 2024; pp. 212–216. [Google Scholar]
Warrens, M.J.; van der Hoef, H. Understanding the Adjusted Rand Index and Other Partition Comparison Indices Based on Counting Object Pairs. J. Classif. 2022, 39, 487–509. [Google Scholar] [CrossRef]
McDaid, A.F.; Greene, D.; Hurley, N. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv 2011, arXiv:1110.2515. [Google Scholar]
Fowlkes, E.B.; Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 1983, 78, 553–569. [Google Scholar] [CrossRef]
Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
Rosenberg, A.; Hirschberg, J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech, 28–30 June 2007; pp. 410–420. [Google Scholar]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 1, 224–227. [Google Scholar] [CrossRef] [PubMed]
Caliński, T.; Jerzy, H. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Xu, D.; Tian, Y. A Comprehensive Survey of Clustering Algorithms. Ann. Data. Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef]
Chen, Y.; Shu, T.; Zhou, X.; Zheng, X.; Kawai, A.; Fueda, K.; Yan, Z.; Liang, W.; Wang, K.I.K. Graph attention network with spatial-temporal clustering for traffic flow forecasting in intelligent transportation system. IEEE Trans. Intell. Transp. Syst. 2022, 24, 8727–8737. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
McConville, R.; Santos-Rodríguez, R.; Piechocki, R.J.; Craddock, I. N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5145–5152. [Google Scholar] [CrossRef]
Garza, A.; Challu, C.; Mergenthaler-Canseco, M. TimeGPT-1. arXiv 2024, arXiv:2310.03589. [Google Scholar]
Marcel, R.A.; Märtens, M.; Raupach, C.; Swierkot, K.; Lammersen, C.; Sohler, C. StreamKM++: A clustering algorithm for data streams. ACM J. Exp. Algorithmics 2012, 17, 2.1–2.30. [Google Scholar] [CrossRef]
Charu, C.A.; Han, J.; Wang, J.; Yu, P.S. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases—Volume 29 (VLDB ’03), Berlin, Germany, 9–12 September 2003; Volume 29, pp. 81–92. [Google Scholar]
Khan, I.; Huang, J.Z.; Ivanov, K. Incremental density-based ensemble clustering over evolving data streams. Neurocomput 2016, 191, 34–43. [Google Scholar] [CrossRef]

Figure 1. Organization of clustering techniques in our paper. The figure illustrates various methods applied at different stages of the clustering workflow—before, during, or after clustering. It highlights their respective roles in preprocessing, clustering execution, and post-processing.

Figure 2. Illustration of distance-based partitioning clustering. This method establishes a predefined number of cluster centers and assigns data points to clusters based on their proximity to these centers. Illustration taken from [41].

Figure 3. Comparison of clustering results on the random set of images using various clustering algorithms: K-Means, Gaussian Mixture Model (GMM), Autoencoder + K-Means, and Spectral Clustering. Each method is applied to the feature space derived from the images, with K-Means and GMM focusing on centroid-based and probabilistic clustering, respectively. Autoencoder + K-Means combines deep feature extraction with traditional clustering, while Spectral Clustering utilizes graph-based methods to capture complex relationships in the data [50]. The figure demonstrates the varying cluster structures, separability, and performance of each algorithm across the dataset.

Figure 4. Illustration of density-based clustering. This approach groups data based on the density of point distributions, allowing it to identify clusters of arbitrary shapes and varying sizes. Illustration taken from [41].

Figure 5. Clustering results on the CIFAR-10 dataset using various algorithms, including K-Means, Gaussian Mixture Models (GMM), DBSCAN, and HDBSCAN. Each method groups unlabeled image features into clusters, which are compared against the ground-truth classes. The visualization illustrates differences in cluster compactness, separation, and alignment with semantic categories, highlighting the strengths and limitations of density-based and centroid-based clustering techniques on high-dimensional visual data.

Figure 6. Comparison of clustering results for time-series data using different feature extraction approaches: Wavelet Transform, Shapelet-based, and Coefficient-based methods. The figure highlights how each method captures distinctive patterns within the time-series data. The Wavelet Transform emphasizes localized frequency features, the Shapelet method identifies discriminative subsequences, and the Coefficient-based approach uses statistical features derived from the time-series coefficients to improve clustering performance.

Figure 7. Visualization of clustering results on the CIFAR-10 dataset using t-SNE and K-Means for different types of embeddings: CLIP, BLIP, DINOv2, and CNN. These embeddings map the high-dimensional image data into lower-dimensional spaces, with t-SNE providing a global view of the data’s structure, and K-Means highlighting the clustering performance and separability between classes across different feature representations.

Figure 8. Illustration of model-based clustering. This approach assumes that the data is generated from a mixture of underlying probability distributions, each corresponding to a latent subgroup within the population. Illustration reproduced from [41].

Figure 10. Visualizations of the CIFAR-10 dataset using dimensionality reduction techniques. The left image shows the clustering results after applying Principal Component Analysis (PCA), while the right image shows the results after applying t-SNE. Both methods project high-dimensional image data onto two dimensions to reveal patterns, cluster structures, and separations between different classes in the dataset.

Table 1. Overview of Hybrid Clustering Approaches and Their Motivations.

Hybrid Type	Component Combination	Main Motivation	Representative Works
Feature–Distance	CNN/Autoencoder + DTW/EMD	Capture structural or temporal similarity beyond Euclidean assumptions	[10,122,123]
Representation–Clustering	Autoencoder + k-means	Reduce dimensionality, denoise data, improve cluster separability	[114,124]
Deep Representation–Distance	CNN + Euclidean	Learn task-specific embeddings for improved similarity estimation	[125,126]
Consensus/Ensemble	Multiple algorithms (e.g., k-means, DBSCAN, spectral)	Improve robustness via voting or graph fusion	[127,128]
Probabilistic–Clustering	VAE + k-means	Model uncertainty, improve clustering in noisy or incomplete data	[73,129]

Table 2. Clustering Evaluation Metrics, Formulas, and Ground Truth Requirements. Details collected from [174].

Metric	Formula	Description	Requires Ground Truth?
Adjusted Rand Index (ARI) [166]	$ARI = \frac{RI - E [RI]}{max (RI) - E [RI]}$	Measures agreement between predicted and true clusters, adjusted for chance.	Yes
Normalized Mutual Information (NMI) [167]	$NMI (U, V) = \frac{2 \cdot I (U; V)}{H (U) + H (V)}$	Quantifies shared information between predicted and true clusters, normalized by entropy.	Yes
Fowlkes–Mallows Index (FMI)	$FMI = \sqrt{\frac{T P}{T P + F P} \cdot \frac{T P}{T P + F N}}$	Combines pairwise precision and recall to assess clustering accuracy.	Yes
Homogeneity Score [169]	$h = 1 - \frac{H (C \| K)}{H (C)}$	Clusters contain only members of a single class.	Yes
Completeness Score [169]	$c = 1 - \frac{H (K \| C)}{H (K)}$	All members of a given class are in the same cluster.	Yes
V-measure [170]	$v = \frac{2 \cdot h \cdot c}{h + c}$	Harmonic mean of homogeneity and completeness.	Yes
Silhouette Score [171]	$s (i) = \frac{b (i) - a (i)}{max {a (i), b (i)}}$	Assesses how well a point fits in its cluster vs. the nearest other cluster.	No
Davies–Bouldin Index (DBI) [172]	$DBI = \frac{1}{k} \sum_{i = 1}^{k} max_{j \neq i} (\frac{σ_{i} + σ_{j}}{d (c_{i}, c_{j})})$	Evaluates average cluster similarity; lower is better.	No
Calinski–Harabasz Index (CHI) [173]	$CHI = \frac{Tr (B_{k})}{Tr (W_{k})} \cdot \frac{n - k}{k - 1}$	Ratio of between- to within-cluster variance; higher is better.	No

Table 3. Computational Complexity of Common Clustering Algorithms. Details collected from [174].

Algorithm	Time Complexity	Notes
K-Means	$O (n k d i)$	n: data points, k: clusters, d: dimensions, i: iterations. Scalable and efficient, but sensitive to initialization.
Hierarchical (Agglomerative)	$O (n^{2} log n)$	Computes pairwise distances and merges clusters iteratively. Not suitable for large datasets.
DBSCAN	$O (n log n)$ to $O (n^{2})$	Index structure like KD-tree improves performance; good for low dimensions.
Spectral Clustering	$O (n^{3})$	Requires eigen-decomposition of similarity matrix. Poor scalability for large datasets.
Gaussian Mixture Models (EM)	$O (n k d i)$	Similar to K-Means but fits full distributions. More flexible but slower.
Mean Shift	$O (n^{2})$	Based on kernel density estimation. Not efficient for large-scale data.
Affinity Propagation	$O (n^{2} T)$	T: iterations. Message-passing based; memory-intensive for large n.
OPTICS	$O (n log n)$	Orders points to reflect clustering structure. More robust than DBSCAN.
HDBSCAN	$O (n log n)$	Hierarchical extension of DBSCAN. Good scalability and cluster quality.
BIRCH	$O (n)$	Builds a CF-tree to summarize data. Fast and memory-efficient for large datasets.
CNN + K-Means	$O (n k d i + C)$	Combines CNN feature extraction (C is cost of training CNN) with K-Means. High upfront cost, but effective for image clustering.
Joint Clustering Models	$O (n k d + NN / AE training)$	Learns clustering and representation simultaneously (e.g., DEC, IDEC). Cost depends on neural network architecture.
Graph-Based Clustering	$O (n^{2})$ to $O (n^{3})$	Includes methods like spectral clustering and GNN-based clustering. Graph construction and processing are computational bottlenecks.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mudgal, P. Clustering of Temporal and Visual Data: Recent Advancements. Data 2026, 11, 7. https://doi.org/10.3390/data11010007

AMA Style

Mudgal P. Clustering of Temporal and Visual Data: Recent Advancements. Data. 2026; 11(1):7. https://doi.org/10.3390/data11010007

Chicago/Turabian Style

Mudgal, Priyanka. 2026. "Clustering of Temporal and Visual Data: Recent Advancements" Data 11, no. 1: 7. https://doi.org/10.3390/data11010007

APA Style

Mudgal, P. (2026). Clustering of Temporal and Visual Data: Recent Advancements. Data, 11(1), 7. https://doi.org/10.3390/data11010007

Article Menu

Clustering of Temporal and Visual Data: Recent Advancements

Abstract

1. Introduction

2. Related Works

3. Shared Challenges

4. Distance-Based Clustering

4.1. Temporal Methods

4.2. Spatial Methods

5. Density-Based Clustering

6. Feature-Based Clustering

6.1. Time-Series Features

6.2. Vision Features

6.3. Dimensionality Reduction for Clustering

7. Probabilistic Model-Based Clustering

7.1. Mixture Models

7.1.1. Gaussian Mixture Models (GMMs) and Variants

7.1.2. Applications

7.2. Temporal Probabilistic Models

7.2.1. Hidden Markov Models (HMMs)

7.2.2. Autoregressive Models

7.3. Bayesian State-Space Models

8. Deep Learning Approaches

8.1. Representation Learning

8.1.1. Long Short-Term Memory (LSTM)/Gated Recurrent Units (GRU)

8.1.2. Convolutional Neural Networks (CNN)

8.1.3. Graph-Based Methods

8.2. Generative Models for Clustering

8.2.1. VAE-Based Clustering Approaches

8.2.2. GAN-Based Clustering Approaches

8.2.3. Vision-Language Model-Based Clustering

8.3. Joint Clustering Models

8.3.1. Deep Embedded Clustering (DEC) and Variants

8.3.2. Time-Series DEC (T-DEC)

8.3.3. Vision Clustering with Self-Supervised Pretraining

9. Hybrid and Ensemble Approaches

9.1. Feature + Distance Hybrid Approach

9.2. Autoencoder + k-Means Joint Training

9.3. Consensus Clustering from Multiple Algorithms

10. Semi-Supervised and Active Clustering

10.1. Must-Link/Cannot-Link Constraints

10.2. Interactive Clustering with Human-in-the-Loop

10.3. Domain-Specific Priors

11. Scalability and Real-Time Clustering

11.1. Approximate Nearest Neighbors (ANN)

11.2. Stream Clustering Algorithms

11.3. GPU-Accelerated Frameworks

11.4. Applications of Scalable and Real-Time Clustering

12. Explainability in Clustering

13. Evaluation Metrics in Clustering

14. Comparative Analysis of Clustering Methods

15. Future Directions

16. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI