Next Article in Journal
High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings
Previous Article in Journal
Marine Predators Algorithm-Based Robust Composite Controller for Enhanced Power Sharing and Real-Time Voltage Stability in DC–AC Microgrids
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving K-Means Clustering: A Comparative Study of Parallelized Version of Modified K-Means Algorithm for Clustering of Satellite Images

1
Department of Electrical Engineering and Computer Science, South Dakota State University (SDSU), Brooking, SD 57007, USA
2
Image Processing Lab, Engineering Office of Research, South Dakota State University (SDSU), Brooking, SD 57007, USA
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(8), 532; https://doi.org/10.3390/a18080532
Submission received: 5 June 2025 / Revised: 8 August 2025 / Accepted: 11 August 2025 / Published: 21 August 2025
(This article belongs to the Special Issue Algorithms in Multi-Sensor Imaging and Fusion)

Abstract

Efficient clustering of high-spatial-dimensional satellite image datasets remains a critical challenge, particularly due to the computational demands of spectral distance calculations, random centroid initialization, and sensitivity to outliers in conventional K-Means algorithms. This study presents a comprehensive comparative analysis of eight parallelized variants of the K-Means algorithm, designed to enhance clustering efficiency and reduce computational burden for large-scale satellite image analysis. The proposed parallelized implementations incorporate optimized centroid initialization for better starting point selection using a dynamic K-Means sharp method to detect the outlier to improve cluster robustness, and a Nearest-Neighbor Iteration Calculation Reduction method to minimize redundant computations. These enhancements were applied to a test set of 114 global land cover data cubes, each comprising high-dimensional satellite images of size 3712 × 3712 × 16 and executed on multi-core CPU architecture to leverage extensive parallel processing capabilities. Performance was evaluated across three criteria: convergence speed (iterations), computational efficiency (execution time), and clustering accuracy (RMSE). The Parallelized Enhanced K-Means (PEKM) method achieved the fastest convergence at 234 iterations and the lowest execution time of 4230 h, while maintaining consistent RMSE values (0.0136) across all algorithm variants. These results demonstrate that targeted algorithmic optimizations, combined with effective parallelization strategies, can improve the practicality of K-Means clustering for high-dimensional-satellites image analysis. This work underscores the potential of improving K-Means clustering frameworks beyond hardware acceleration alone, offering scalable solutions good for large-scale unsupervised image classification tasks.

Graphical Abstract

1. Introduction

1.1. Overview of Satellite Image

Satellite images are rich data sources crucial for providing geographical information [1]. For instance, the Landsat program has consistently provided high-quality multispectral images to help understand changes in land cover and the environment since 1972 [2]. Satellite remote sensing technologies collect data/images at variable revisit times, at diverse spatial, spectral, temporal and radiometric resolutions by sensors sensitive to specific spectral regions [3].
Satellite image clustering is a strong technique for deriving information and fulfills a crucial function across diverse applications, from land cover mapping and urban planning to environmental monitoring and disaster management [3]. The South Dakota State University (SDSU) Image Processing Lab (IPLAB) uses global image clustering as Extended Pseudo Invariant Calibration Site (EPICS) to calibrate satellites [4,5]. The purpose of image clustering is to group the pixels in the image to identify spectrally similar, homogenous sites with known spectral, spatial and temporal characteristics [6]. In contrast, information classes are categorical, such as crop type, forest type and residential type [7]. The advancement of improved clustering algorithms has gathered considerable attention. Previous implementations of the K-means clustering process within the IPLAB EPICS satellite calibration workflow consistently failed to reach the convergence threshold, preventing the algorithm from attaining a global minimum. Despite this, the resulting spectrally similar and temporally stable pixels were utilized in various satellite calibration effort, such as global EPICS classification for stability monitoring, validation of cross-calibration techniques, and inter-sensor comparison during on-orbit initialization [5,8,9]. However, the absence of convergence limited the robustness of site selection and contributed to increased uncertainty in radiometric calibration outcomes. The enhanced convergence properties of the proposed algorithms established a more stable foundation for future satellite calibration efforts, enabling reduced uncertainty and improved reproducibility in global-scale clustering outputs.

1.2. Related Work

1.2.1. Overview of Clustering Algorithm

Clustering algorithms are fundamental to unsupervised learning, enabling the partitioning of large datasets into meaningful groups without labeled training data [10,11]. In satellite image analysis, clustering enables the classification of pixels into land cover types on spectral characteristics. Common clustering algorithm includes hierarchical clustering [12], density-based method such as DBSCAN [13], spectral clustering [14,15], and partitioning algorithms like K-Means [16,17,18,19]. Each of these methods offers distinct advantages and trade-offs, particularly when applied to high-spatial-dimensional satellite imagery.
Hierarchical clustering provides a nested clustering structure or tree structure that allows flexible exploration of data groupings without predefining the number of cluster [20]. However, its computational complexity, typically O(n2logn) [21], renders it impractical for very large datasets like high-resolution satellite images exceeding millions of pixels per scene.
Density-based methods such as DBSCAN are effective in detecting clusters of arbitrary shapes and naturally handle noise and outliers [22]. Despite its advantages, DBSCAN faces significant challenges in high-dimensional data spaces, where the notion of density becomes less meaningful due to the curse of dimensionality. Moreover, its performance is highly sensitive to the choice of neighborhood radius (ε) and minimum points (MinPts), which complicates its application to large and diverse satellite datasets [23]. Recent advancements, such as the Fast and General Density Peaks algorithm [24], have addressed the scalability limitations of density-based clustering by approximating k-nearest neighbor graphs, enabling clustering of up to one million high-spatial-dimensional vectors. It is a general approach that can be applied to diverse data types. However, it has not yet been demonstrated on massive, spatially continuous remote sensing datacubes where over 15.7 billion pixels with 16 spectral features present unique spatial correlation and memory locality challenges.
Spectral clustering [25] transforms the data into a graph-based representation and solves for eigenvectors of the Laplacian matrix to identify cluster. While powerful for handling non-convex clusters, spectral clustering suffers from prohibitive computational costs (O(n3)) associated with eigen decomposition, making it unsuitable for large-scale, high-spatial-dimensional image analysis.
In contrast, K-Means clustering [18,19,26,27] remains one of the most computationally efficient methods for large datasets, with the time complexity of O(n × k × i × d), where n is the number of data points, k is the number of clusters, i is the number of iterations, and d is the data dimensionality. K-Means is particularly well-suited to hyperspectral and multispectral satellite images due to its:
  • Linear scalability with data volume;
  • Support asynchronous parallel execution across independent data slices, enabling non-blocking processing of extremely large datasets (e.g., multiple 1 TB segments) by distributing computational task concurrently across multi-core or distributed systems;
  • Flexibility to integrate algorithmic improvements such as advanced centroid initialization, iteration reduction, and outlier handling, as explored in this study.
Furthermore, modified K-Means facilitates high-performance parallel implementations using frameworks like MPI, OpenMP, and MapReduce, allowing effective distribution of computation across large CPU cluster [28,29,30]. This characteristic aligns closely with the architectural resources of our target HPC environment, which predominantly comprises CPU nodes, thus justifying the selection of K-Means as a fundamental algorithm for our optimization efforts.

1.2.2. Advancement in K-Means Clustering

Since its introduction by MacQueen in 1967 [31], the K-Means algorithm has undergone extensive refinement to address its inherent limitations, including sensitivity to initial centroid selection, vulnerability to outliers, and high computational demand on distance calculations [19].
One of the well-recognized limitations of Standard K-Means algorithm is its reliance on random selection for initial centroid positions, which can lead to poor convergence behavior and susceptibility to local minima [32]. To overcome this, several studies have proposed strategies to enhance centroid initialization. Khan et al. [33] proposed an improved initialization strategy that differentiates between instance attributes using a novel weighting formula, coupled with a specialized normalized process that avoids forcing values into restrictive range. However, the evaluation of this methodology on large-scale datasets and real-world scenarios remains limited. Chan et al. [34] introduced a hybrid method combining kernel K-Means ++ with random projection to reduce dimensionality before centroid selection. This approach successfully reduced computational complexity from O(n2D) to O(n2d), where n is the number of data point, D is the original feature dimensionality, and d « D is the reduced dimensionality after projection. This makes the method particularly suitable for high-spatial-dimensional clustering scenarios while preserving clustering quality. Nevertheless, while the method effectively reduces dimensionality, it does not address the issue of large sample size(n).
It is important to note that high dimensionality alone does not hinder cluster ability when data is well structured [35]; however, it significantly increases computational cost, particularly in distance-based clustering algorithm like K-Means. In this study, our focus is on reducing this computational burden in large-scale satellite image analysis, where each data point may correlate spectral bands and iterations are governed by centroid stability rather than a fixed count.
While these methods have demonstrated improvements over random selection aimed at improving clustering outcomes for complex and large-scale datasets. However, many of these techniques still encounter challenges related to computational load and scalability when applied to extremely high-spatial-dimensional data such as multispectral satellite imagery. To address this, our study proposes a uniform average pixel grouping strategy specifically designed for large-scale image clustering, which is detailed in Section 3.2.2.
The susceptibility of K-Means clustering to outliers is a challenge, often leading to skewed centroids and degraded clustering performance, especially in complex, high-spatial-dimensional datasets [36]. To overcome this issue, Olukanmi et al. [37] introduced K-Means sharp, a modified version of the classical K-Means algorithm that automatically detects outliers using global threshold derived from the distribution of point-to-centroid distances. However, as the authors acknowledged, the reliance on a single global threshold may not sufficiently capture local variations in outlier distribution across clusters. Building upon this foundation, CSK-Means and NSK-Means algorithms advanced the handling of outliers by incorporating density-based strategies and assigning separate clusters to outliers. These algorithms enhance robustness by considering the density within clusters and overlap spaces between them, making them capable of handling non-spherical cluster shapes and datasets with varying densities [38]. Despite these improvements, the authors noted that the algorithms require multiple iterative steps to determine the final cluster count in high-cluster scenarios, which can lead to elevate computational costs, especially in large-scale datasets. Recent advancements in anomaly detection have improved outlier detection in high-spatial-dimensional data, for instance, the ResAD framework proposed by Zhang et al. [39] focuses on class-generalizable anomaly detection by learning residual feature distributions, which can reduce feature variations and improve detection across diverse classes. However, ResAD’s effectiveness in clustering context remains to be thoroughly evaluated, and its integration into clustering frameworks like K-Means requires further investigation.
This study extends previous advancements by introducing a dynamic, cluster-specific thresholding approach within the K-Means Sharp framework. Thresholds are calculated individually for each cluster and dynamically update at every iteration to enhance clustering robustness. A detailed explanation of this refinement, including computational considerations and its application to large-scale satellite image analysis, is provided in Section 3.2.4.
Standard K-Means algorithm suffers from high computational cost due to the need to repeatedly calculate distances between every data point and all cluster centers at each iteration [40]. To address this, various methods have emerged focusing on reducing the number of distance calculations while maintaining clustering quality. Na et al. [41] proposed an improved K-Means algorithm that introduces a simple data structure to retain essential information across iterations, thereby eliminating redundant computations of distances to cluster centers. While this method effectively accelerates clustering, it primarily benefits datasets of moderate size and may struggle to scale efficiently for high-spatial-dimensional data or complex cluster boundaries. Wang et al. [42] further advanced this direction by identifying active points near cluster boundaries, enabling selective recalculation of cluster assignments. However, its effectiveness is heavily dependent on accurately identifying boundary points, making it less reliable in scenarios involving overlapping or irregular clusters. Moodi et al. [43] developed an improved K-Means variant for big data applications, leveraging the distances to two nearest centroids while excluding points beyond an equidistance threshold from further consideration. However, it faces challenges in handling large sample sizes, as its computational performance when the number of objects, dimensions and cluster increases.
To address these limitations in the context of satellite image clustering, this study proposes a Nearest-Neighbor Iteration Calculation Reduction Method that draws upon the strengths of prior work while adapting to the complexities of high-spatial-dimensional satellite datasets. Rather than relying solely on nearest-centroid tracking, the method incorporates a broader neighborhood context to better capture the nuances of spatial data distributions. While the technical specifics of this approach are detailed in section, it is well designed to balance computational efficiency with classification accuracy, making it well suited for clustering tasks involving diverse and overlapping land-cover types. The detail about this proposed method is explained in Section 3.2.3.

1.2.3. Parallelization Techniques in K-Means Cluster

To address the computational intensity of clustering high-spatial-dimensional data, parallelization strategies have become essential [44]. Early parallel K-Means implementation leveraged shared-memory architecture to distribute computations across multiple cores [45]. Subsequent development introduced block-processing techniques, effectively partitioning datasets for concurrent processing on multi-core CPUs [46]. Frameworks such as OpenMP and POSIX threads have demonstrated significant improvements in clustering performance by harnessing the full potential of modern processors [28].
GPU-accelerated implementation of K-Means have gained attention due to their massive parallel processing capabilities [47,48], particularly for real-time and high-throughput applications. However, in many high-performance computing (HPC) environments, including the one employed in this study, CPU nodes constitute the majority of available resources. The HPC system employed in this study comprises over 3000 CPU cores and only 14 GPU nodes. This resource distribution informed the strategic decision to optimize CPU-based parallelization, enabling full utilization of available computational capacity and achieving scalable performance for global land cover classification. This architecture strategically prioritizes CPU parallelization to efficiently manage I/O-bound operations by avoiding blocking calls. This asynchronous parallelization fully exploits available computational resources, enabling scalable performance for the classification of global land cover datasets. The implementation supports asynchronous parallel execution across independent data partitions, enhancing throughput and computational efficiency in large-scale image clustering tasks. Due to the high availability of CPU resources, the algorithm is optimized for CPU-based HPC environments; however, the design remains fully compatible with GPU processing, making it adaptable to diverse computing architectures.

1.2.4. Motivation and Contribution of This Study

While prior research has explored individual improvements in K-Means clustering algorithms and parallelization strategies, few studies have systematically evaluated multiple optimized variants under a unified framework, particularly at the scale required for global satellite image analysis. Existing methods have largely been tested on smaller or less complex datasets, leaving a gap in understanding their scalability and robustness when applied to extensive hyperspectral or multispectral image collections.
This study addresses this gap by conducting a comprehensive comparative analysis of eight parallelized K-Means variants, including enhancements in centroid initialization, outlier handling, and iteration reduction strategies. Performance evaluation was conducted using 114 high-spatial-dimensional data cubes representing diverse land cover types worldwide, with each method is assessed based on convergence speed, computational efficiency, and clustering quality. The insights gained from this analysis guide the selection of the most effective algorithms for future deployment across the intended global dataset of 9500 data cube in HPC environments. This work contributes to the advancement of scalable and efficient clustering techniques for large-scale remote sensing data classification purposes.

2. Materials

2.1. SDSU Dervied Data Product

Landsat 8 Operational Land Imager data were selected to generate the data cubes using Google Earth Engine (GEE), a cloud-based platform with high computational capabilities for processing large geospatial datasets [49]. The data filtering process addressed cloud cover and cloud shadows. For additional information, refer to Shrestha et al. [4] and Fajardo et al. [5].

2.2. Hardware and Software Specifications

This section describes the software and tools used to test all algorithms on identical datasets. The computational experiments were performed on a server equipped with an Intel Xeon Gold 6242R CPU @ 3.10GHz, featuring 2 sockets with 20 cores each, totaling 40 cores and 80 logical CPUs (2 threads per core). The software environment included the Linux operating system (5.4.0-204-generic x86_64) and MATLAB (R2023a) programming language.
Due to the large size of the test data, processing it in a single run was not feasible. Asynchronous parallel implementation of the algorithm utilized the “parfor” construct, analogous to job scheduling mechanisms, to process approximately 10 TB of data by calculating Euclidean distances and assigning pixels to clusters based on the shortest spectral distance, defined as the Euclidean distance between a pixel with 16-dimensional spectral features space and centroids. This distance metric was used throughout the clustering process to quantify spectral similarity between pixels. This approach was effective because each pixel value is independent of the others. Multicore, with 19 workers running concurrently, led to a significant reduction in computational time.

3. Methodology

Figure 1 presents the logical development of the parallelized K-Means algorithms evaluated in this study. The base method, Parallel Standard K-Means (PSKM), serve as the foundation. Three key improvements, Centroid Initialization (CI), Dynamic K-Means Sharp (KS), and Nearest-Neighbor Iteration Calculation Reduction (NN), are introduced independently to enhance convergence speed and cluster quality. These individual improvements are then combined in various ways to form hybrid methods. The final integrated method, Parallelized Enhanced K-Means (PEKM), incorporates all three enhancements (i.e., PEKM = PSKM + CI + KS + NN). This modular design allows for evaluating the contribution of each improvement both independently and collaboratively. The hierarchical flowchart in Figure 1 visually clarifies the relationship among these variants and highlights how PEKM is systematically constructed from its components.
Figure 2 shows the flowchart of the overall proposed framework. In the following section, the foundations of the approach are presented. For comparative analysis, the framework provides an overview of each algorithm. Landsat 8 images are input into eight different algorithms, each performing image segmentation through pixel classification. Clustering quality and computational efficiency are then evaluated using various performance metrics. Additional details about each method are provided in subsequent sections.

3.1. Satellite Imagery Dataset

To evaluate the performance of the parallelized versions of the modified K-Means algorithms, testing on all 9522 data cubes in our archive is impractical due to the massive data volume (>1 billion pixels in total). Consequently, a strategy was devised to select representative datasets based on different land cover types. Specifically, the Biosphere Atmosphere Transfer Scheme (BATS) Classification [50], was employed, categorizing land cover into 20 distinct classes. From this classification, 114 data cubes were selected, each representing a heterogeneous mixture of land cover types as seen in Figure 3. This approach ensures a comprehensive and efficient assessment of the algorithms’ performance across diverse land cover scenarios. One data cube consists of a 1–8 temporal Mean and a 9–16 temporal standard deviation. Please refer to Table 1 for more details of dataset and K-Means parameters.

3.2. Algorithm Development

The study involves a selection of parallelized modified versions of K-Means clustering algorithms using different variants of K-Means clustering processes. These modified versions mainly include three variants:
  • Centroid Initialization Method;
  • Nearest-Neighbor Iteration Calculation Reduction Method;
  • Dynamic K-Means Sharp Method.

3.2.1. Parallelize Standard K-Means Algorithms

The K-Means clustering algorithm is an iterative clustering technique that partitions a dataset into a predefined number of k distinct clusters, where each pixel is assigned to the cluster whose centroid has the closest mean value [19]. The process begins by randomly selecting k initial cluster centers, referred to as centroids, from each layer of the 114 data cubes, with each cube containing 3712 × 3712 × 16 pixels where each of 3712 × 3712 pixels are represented 16 input features, derived from 8 spectral bands of Landsat-8 OLI imagery. These centroids serve as the initial reference points for the clusters.
μ m n = μ 1,1 μ 1,16 μ 160,1 μ 160,16
where ‘m’ represents the number of clusters and ‘n’ represents the number of features.
The algorithm iteratively refines the clusters through the following steps. First, each pixel Pxyn is assigned to the nearest cluster based on the Euclidean distance. The Euclidean distance between a pixel Pxyn = (Pxy1, Pxy2,…, Pxy16) from the data cube of size 3712 × 3712 × 16 and a centroid μmn = ([μ(1,1)(1,2),…,μ(1,16)],…,[μ(160,1)(160,2),…,μ(160,16)]), as seen in Equation (2), where ‘m’ represents the number of centroids, ‘K’ represent number of desired clusters and ‘n’ represents the number of feature.
x = 1 , y = 1 R o w , C o l u m n m = 1 K D x y m = x = 1 , y = 1 R o w ,   C o l u m n m = 1 K n = 1 16 ( P x y n μ m n ) 2
After all pixels have been assigned to clusters, the centroid for each layer of a cluster is calculated by using the following equation:
μ m n , n e w = m = 1 K n = 1 16 s u m C l a s s i f i e d P i x e l s m n c o u n t C l a s s i f i e d P i x e l s m n
Equation (3) calculates the new mean, defined as the average of the pixel values assigned to the cluster. After calculating the new mean value, convergence is checked by comparing the old and new cluster mean. If the maximum difference between the old and new means exceeds a convergence threshold of 0.0005 then update the cluster mean with the new values and repeat the iterative process. This iterative approach gradually refines the clusters, making them more distinct and well-defined with each iteration. If the difference is below the threshold, the algorithm converges. Algorithm 1 is the pseudocode for K-Means algorithm,
Algorithm 1: Pseudocode for K-Means Algorithm
Input: I = {I1, I2, I3,….,Id} // set of d data cube (114 data cube with size 3712*3712*16)
K // Number of desired clusters
Output: A set of K distinct clusters

1. Randomly select K initial cluster centers (centroids) from the 114 data cubes, where each centroid is represented by the matrix μmn as shown in Equation (1).
2. While max(|μmn,old − μmn,new |) < 0.0005 do:
3. For (I = 1: number_of_images)//for all images
    For (x = 1: number_of_row)
     For (y = 1: number_of_column)//for all pixel
4.      Calculate the Euclidean distance between Pxyn and each centroid μmn using Equation (2)
5. Assign Pxyn to the cluster with the nearest centroid (min(D(x,y,m)))
6. Update each centroid μmn: For each cluster m:
7. Calculate the new centroid using Equation (3).
8. Check for convergence:
9. Calculate the maximum change in centroids:
10. DiffMean = max(|μmn,old−μmn_new|)
11. If DiffMean < 0.0005, then:
    Convergence achieved, exit loop
12. Else:
   Update centroids: μmn = μmn_new
13. Continue iteration
   End While
Step:
When applying the K-Means clustering algorithm to 114 data cubes, each with dimensions 3712 × 3712 × 16, significant challenges are encountered related to speed-up, throughput, and scalability, particularly during the computation of the Euclidean distance between each pixel in the data cubes and the k centroids. Since the distance calculation for each pixel in different data cubes is independent, the process can be effectively parallelized asynchronously utilizing all resources making it computationally efficient.
To achieve this parallelization in MATLAB, the ‘parfor’ loop is utilized which is analogous to job scheduling mechanism in slurm, allowing asynchronous parallelization of task to run concurrently on multiple worker sessions. The process begins by opening a MATLAB pool, which allocates a set of MATLAB worker sessions. These workers are then assigned to handle clustering tasks, where the worker processes a subset of the data cubes—specifically, N/n data cubes, where N is the total number of data cubes and n is the number of workers.
Parallel execution allows for efficient distance computation by distributing workloads among available compute node, as shown in Figure 4. Each node independently processes assigned data chunks, gathers results, and updates centroids upon completion. This process continues until the convergence criteria are met, ensuring iterative refinement of clusters. Figure 5 illustrates the operation flowchart of this parallelization, showing how data chunks are distributed to workers and how results are combined.
Algorithm 2 provides the pseudocode for the parallel implementation of Standard K-Means clustering algorithm, which is implemented using the ‘parfor’ function on the server. Details about the hardware and software used for this implementation are described in Section 2.2.
Algorithm 2: Pseudocode for Parallelization Standard K-Means algorithm
Input: I = {I1, I2, I3,….,In}//set of d data cube (114 data cube with size 3712 × 712 × 16)
K//Number of desired clusters
Output: A set of K distinct clusters
  • Randomly select K initial cluster centers (centroids) from the 114 data cubes, where each centroid is represented by the matrix μmn as shown in Equation (1).
  • While max(|μmn,old−μmn,new |) < 0.0005 do:
  • Parfor each Data chunk I = {I1, I2,I3,…,Id}
  •   For (x = 1: number_of_rows)
  •   For (y = 1: number_of_columns)
  • Calculate the Euclidean distance between Pxyn and each centroid μmn using Equation (2).
  •  Assign Pxyn to the cluster with the nearest centroid (min(D(x,y,m)))
  • Calculate the partial sums for each cluster for updating centroids
  •   End Parfor
  • Combine partial sums from all chunks to calculate the new mean (centroid) for each cluster.
  • Update each centroid μmn: For each cluster m:
  • Calculate the new centroid using Equation (3).
  •  Check for convergence:
  • Calculate the maximum change in centroids: DiffMean = max(|μmn,old−μmn_new|)
  • If DiffMean < 0.0005, then:
  •   Convergence achieved, exit while loop
  • Else:
  •   Update centroids: μmn = μmn_new
  •   Continue iteration
  • End While
Step:
In line 3 of Algorithm 2, the “parfor” function leverages the multicore capabilities of the CPU to compute distances and assign pixels to the nearest cluster. This parallelization is consistently applied across all modified versions of the K-Means algorithms.

3.2.2. Parallelize Centroid Initialization Method

This improved algorithm introduces a new method to calculate a uniform average across the dataset. Pixel values are extracted through a 100 × 100 grid to create a matrix. Afterward, the matrix was subsequently sorted based on the fifth column, corresponding to the Near-Infrared (NIR) band, due to its relative insensitivity to atmospheric effects and its capacity to distinguish surface variability. Rows containing NaN values were removed to ensure data integrity prior to further processing. Initial centroids are determined by dividing the matrix into 160 groups and calculating the average for each group. Refer to Algorithm 3 for the pseudocode related to class initialization.
Algorithm 3: Pseudocode for class initialization of parallel centroid initialization method
Input:I = {I1, I2, I3,….,In}//set of n data cube (114 data cube with size 3712 × 3712 × 16)
K//Number of desired clusters
Output:A set of K distinct clusters.

sampledpixelvalues ← 10 × *10 matrix of zeros (size: 100 × 16)
  • counter← 1
  • For (i = 1 to 114):
  • imagein ← read image from Imagefile
  • Select only the first 16 layers of the image: imagein ← imagein(:,:,1:16)
  • image ← imagein(:,:,5)//Select band 5 for operations
  • Set grid size: gridsize ← 371
  • Calculate number of grid points:
  • numpointX ← floor(size of image in dimension 1/gridsize)
  • numpointY ← floor(size of image in dimension 2/gridsize)
  • For each grid point, extract pixel values:
  • For j = 1 to numpointX:
  • For z = 1 to numpointY:
  • Calculate x and y based on grid index:
  • x ← (j−1)  × gridsize + 1
  • y ← (z−1) ×  gridsize + 1
  • Assign sampled pixel values to sampledpix-   elvalues [counter,:] ← imagein(x, y, :)
  • Counter Increment
  • Save the sampled pixel values:
  • Sort the sample pixel values by the 5Th Column:
  • columntosortby ← sampledpixelvalues(:, 5)
  • sortedvalues, sortedindices ← sort(columntosortby)
  • sortedMatrix ← sampledpixelvalues(sorted by sortedindices)
  • Remove rows containing NaN values:
  • rowswithnans ← any(isnan(sortedMatrix))
  • sortedMatrix ← sortedMatrix without rowswithnans
  • Calculate the Initial centroid:
  • row, column ← size(sortedMatrix)
  • rows_per_group ← floor (row/160)
  • remainder ← mod (row, 160)
  • InitialCluster ← 160  ×  column matrix of zeros
  • Group rows and calculate mean for each group:
  • start_index ← 1
  • For (i = 1 to 160):
  • If i ≤ remainder
  • set end_index ← start_index + rows_per_group
  • Else
  • Set end_index ← start_index + rows_per_group−1
  • Currentgroup ← sortedMatrix[startindex:endindex, :]
  • Calculate mean for the current group and store in InitialCluster[i, :] ← mean(currentgroup)
  • Update startindex ← endindex + 1
  • Save the calculated initial centroids.
Step:
Algorithm 3 introduces a more effective method for determining the initial centroids of clusters, unlike the Standard K-Means Algorithm, which chooses centroids randomly. This improvement accelerates the convergence of the K-Means clustering algorithm.
Moreover, the revised K-Means algorithm incorporates data vectorization into a 2D matrix, making Euclidean distance calculations as seen in Equation (4) more efficient. While Algorithm 2 relies on two ‘for’ loops to extract pixels from the image, the optimized Algorithm 4 streamlines this task using a single loop.
x = 1 R o w m = 1 16 D x y m =           x = 1 R o w m = 1 K n = 1 16 ( P x n μ m n ) 2
Algorithm 4: Pseudocode for main classification
  • Parfor each Data chunk I = {I1, I2,I3,…,Id}
  • Reshape the image data into 2D array
  • For (x = 1:number_of_row)
  • Calculate the Euclidean distance between Px,n and each centroid μmn using Equation (4)
  • Assign Px,n to the cluster with the nearest centroid  (min(D(x,m)))
  • Calculate the partial sums for each cluster for updating centroids
  • End Parfor
As shown in Algorithm 4, the reshaping of a 16-dimensional tensor is vectorized into a 2-dimensional tensor. This modification eliminates the need for a single ‘for’ loop in Euclidean distance calculation as seen in Equation (4), while other steps remain similar to those in Algorithm 2.

3.2.3. Parallelized Nearest-Neighbor Iteration Calculation Reduction Method

To address the computational inefficiency in the Standard K-Means algorithm, this study introduces a Nearest-Neighbor Iteration Reduction Method, enhancing the accuracy of classification while minimizing redundant distance calculations. Unlike the Iteration Reduction Method proposed by Na et al. [41], which stores only the closest cluster for each pixel and recalculates distances during every iteration, the proposed method maintains the distances to the fifteen nearest cluster for each pixel. Assigning clusters based only on the nearest centroid can lead to errors, particularly in a dataset where the large spatial volume and heterogenous land cover boundaries cause centroids to shift gradually and stabilize slowly over iterations. To address this, the fifteen nearest centroids for each pixel are stored, which helps to improve accuracy by better reflecting the data’s distribution.
As shown in Algorithm 2, the distance calculation from each pixel to all k cluster centers during each iteration consumes a significant amount of execution time, particularly for large datasets like multispectral images. The core concept of this approach involves utilizing two fundamental data structures to store the closest cluster label and the corresponding distance of each pixel to its nearest cluster at every iteration, allowing this information to be used in the subsequent iteration. For each pixel, its distance to the nearest cluster is stored. In the subsequent iteration, the algorithm first computes distances to the previous 15 nearest clusters. A condition states that if pixels are assigned to a cluster beyond the 12 nearest or if the previous shortest distance exceeds the current shortest distance among 15 neighbors, distance to all clusters are recalculated, and nearest clusters are updated accordingly. The reason for selecting the 15–12 combination is based on computational speed since there was no difference in the accuracy. An experiment with different combinations (10,5), (15,10), and (15,12)—was conducted using a small dataset of 20 data cubes to determine which combination converges faster. Results show that saving 15 neighbors and assigning pixels with 12 nearest neighbors helps achieve faster convergence. Although accuracy remains the primary objective, mean squared error (MSE) comparisons across the tested configurations showed negligible differences, supporting the use of the 15–12 setting as a more efficient choice without compromising clustering quality. Algorithm 5 presents the pseudocode of the proposed parallelized Nearest-Neighbor Iteration Calculation Reduction method.
Algorithm 5: Pseudocode for Parallelize Nearest-Neighbor Iteration Calculation Reduction Method
Input:I = {I1, I2, I3,….,Id}//set of d data cube (114 data cube with size 3712 × 3712 × 16)
K//Number of desired clusters
Output:A set of k clusters.
  • Randomly select K initial cluster centers (centroids) from the 114 data cubes, where each centroid is represented by the matrix μmn as shown in Equation (1).
  • While max(|μmn,old − μmn,new |) < 0.0005 do:
  • Parfor each Data chunk I = {I1, I2,I3,…,Id}
  • Reshape the image data into 2D array
  • For (x = 1: number_of_row)
  • Calculate the Euclidean distance between Pxn and each centroid μmn using Equation (2).
  • Reshape back to original dimension
  • Assign Pxyn to the cluster with the nearest centroid (min(D(x,y,m))).
  • Store the labels of 15 nearest cluster centers for pixel Pxyn in the arrays Cluster_Label[]. Separately store the distance of Pxyn to the nearest cluster in Closest_Dist[].
  • Set Cluster_Label[] = k(the indices of the 15 Π  nearest clusters).
  • Set Closest_Dist[x,y] = D(Pxyn, μmn); (the minimum distance to the closest cluster).
  • Calculate the partial sums for each cluster for updating centroids
  • End Parfor
  • Update each centroid μmn: For each cluster m:
  • Calculate the new centroid using Equation (3) &
  • Update centroids: μmn = μmn_new
  • Repeat
  • Parfor each Data chunk I = {I1, I2,I3,…,Id}
  • Reshape the image data into 2D array
  • Calculate the distance between the pixel and the centroids of the 15 nearest clusters stored in Cluster_Label[].
  • Identify the rank of the currently assigned cluster within the 15 nearest clusters:
  • If the currently assigned cluster is within the top 12 nearest clusters, no recomputation is required.
  • If the currently assigned cluster is ranked between the 13th and 15th nearest clusters or the pixel has not been assigned, then go for the recomputation of pixel between all the cluster.
  • End parfor;
  • For each cluster m, (1 ≤ m ≤ K), recalculate the centroid.
  • Until the convergent criteria is met (previous centroid—new centroid) < 0.0005
Step:
The first step of the algorithm involves calculating the Euclidean distance from each pixel to initial centroids of all clusters using Equation (4). Each Pixel is allocated to the nearest centroid, forming an initial pixel’s group. For every pixel, the cluster assignment (Cluster_Label[]) and its proximity to the nearest cluster centroid (Closest_Dist[]) are recorded. Additionally, the labels of the 15 nearest cluster centers are stored in the arrays Cluster_Label[], while the distance to the nearest cluster is separately stored in Closest_Dist[]. As pixels are assigned to different clusters, the values of the cluster centroids may change. Centroids are iteratively recomputed for each cluster by averaging the pixel values within the respective cluster. Up to this point, the procedure is similar to Algorithm 2.
The next step is an iterative process that employs a heuristic method to improve efficiency. During the subsequent iteration, the algorithm focuses on optimizing cluster assignments by leveraging the distance between each pixel and the centroids of its fifteen nearest clusters. At the start of each iteration, the distances to these fifteen nearest centroids, stored from the previous iteration, are recalculated. If the pixel remains assigned to a cluster within its top twelve nearest clusters, no further computation is required for that pixel. This approach reduces unnecessary distance calculation to all the clusters, saving computation time.
However, if the pixel is assigned to a cluster ranked between the thirteenth and fifteenth nearest or the new distance to its nearest cluster exceeds the previously stored distance, the algorithm recalculates distances between the pixel and all cluster centroids. This ensures accurate reassignment to the closest cluster. Additionally, pixels marked for recomputation undergo a similar process to identify the new nearest cluster, and the Cluster_Label and Closest_Dist arrays are updated accordingly. This iterative refinement continues until the movement between centroids stabilizes, defined by a threshold of 0.0005, ensuring convergence.

3.2.4. Parallelize Dynamic K-Means Sharp Method

K-Means clustering is known to be sensitive to outliers, which can disproportionately influence centroid updates. Olukanmi et al. [37] proposed the K-Means Sharp Method, which addresses this by modifying the Standard K-Means centroid update step to exclude outliers during centroid calculation. The adaptation introduces global threshold values for all clusters. The original study noted that “the use of cluster-specific thresholds in contrast to a single global threshold may improve performance and is therefore worth exploring.”.
In this study, a Dynamic K-Means Sharp method is implemented by extending the K-Means Sharp framework to use cluster-specific thresholds derived from spatial standard deviation measurements within each cluster. This adjustment addresses the limitations observed with a fixed global threshold when applied to large-scale, heterogenous remote sensing image datacubes.
The approach is presented as a context-specific implementation decision intended to improve centroid stability during iterative updates under conditions of spatial variability and data complexity.
  • A. Outlier Detection
K-Means operates on the assumption that the data follows a Gaussian distribution, meaning that most inliers will fall within ±3 standard deviations of the centroid. This range corresponds to the area under the normal distribution curve within ±3σ of the mean, where most data points are expected to be located. Any point outside this range is considered outlier.
For each band of a cluster, the spatial standard deviation was calculated. Consequently, the pixel-to-centroid distance threshold, T, which differentiates inliers from outliers, is defined as:
T ( m , n )   =   3 σ
where σ is the spatial standard deviation of cluster per feature, 1 ≤ m ≤ K (centroid) and 1 ≤ n ≤ feature. This results in a cluster-and band-specific threshold, in contrast to the use of a single global threshold in previous implementations.
Note that T is a cluster- and band-specific threshold value in contrast to single global value that applies across all clusters which is proposed by Olukanmi et al. [37].
  • B. New Centroid update
With an appropriate value of T calculated as per Equation (5), this method identifies inlier and outlier pixels for each spectral band based on their distance from the cluster mean (centroid) and the cluster’s standard deviation. Inliers are pixels that lie within 3 standard deviations of the mean for each feature, while outliers are those outside this range. For each cluster feature, the new centroid is recalculated by averaging only the inlier pixels, effectively minimizing the impact of outlier pixels on the centroid’s position.
m = 1 K n = 1 16 μ m n = m = 1 K n = 1 16 1 s m n s m n o P m n : P m n = 0   i f     P m n μ m n > T                                                    
where S m n o ϵ S m n is the set of the pixels P m n in S m n that have pixel-to-centroid
Distance     P m n μ m n > T
The mean for each cluster is calculated using two approaches outlined in Equations (2) and (6). Both approaches define a new centroid as the mean of the pixels associated with each respective band and cluster. However, Equation (6) differs by excluding pixels that lie beyond a distance T from the cluster centroid. This exclusion refines the centroid calculation by reducing the impact of outlier pixels. This method was selected based on global threshold and failed to generalize effectively across clusters with distinct spatial characteristics.
Pixel classification follows the same method as in Algorithm 2. The primary change in this approach is in the centroid update step. Instead of recalculating centroids by averaging all pixels within each cluster, this modified algorithm identifies and excludes potential outliers from the computation. Algorithm 6 presents the pseudocode of the proposed parallelized Dynamic K-Means Sharp method.
Algorithm 6: Pseudocode for Parallelize Dynamic K-Means Sharp Method
Input:I = {I1, I2, I3,….,Id}//set of d data cube (114 data cube with size 3712 × 3712 × 16)
K//Number of desired clusters
Output:A set of k clusters.
  • Randomly select K initial cluster centers (centroids) from the 114 data cubes, where each centroid is represented by the matrix μmn as shown in Equation (1).
  • While max (|μmn,old−μmn,new |) < 0.0005 do:
  • Parfor each Data chunk I = {I1, I2,I3,…,Id}
  • Reshape the image data into 2D array
  • For (x = 1: number_of_row)
  • Calculate the Euclidean distance between Pxn and each centroid μmn using Equation (3).
  • Reshape back to original dimension
  • Assign pixel Pxyn to the cluster with the nearest centroid (min(D(x,y,m))).
  • Calculate the partial sums for each cluster for updating centroids
  • END Parfor
  • Update each centroid μmn: For each cluster m:
  • Calculate the new centroid using Equation (3).
  • Update centroids: μmn = μmn_new
  • Calculate Spatial Standard deviation
  • For each data chunk, calculate the sum of squared difference for each band of clusters
  • squared_diff = (Meantile-NewCluster)2
  • Accumulate the sum across all data cubes.
  • Compute the standard deviation for each cluster and band:
  • std_values = sqrt(sum_squared_diff/count)
  • Repeat
  • Parfor loop to update pixel assignments:
  • Compute inliers and outliers based on the spatial standard deviation:
  • For each cluster and band: get pixels in the current cluster and compute the absolute difference pixels and the cluster mean for each band.
  • inliers are those within 3 × std_value.
  • inliers = abs(cluster_pixel(:,feature)- mean_value) ≤ 3 × std_value
  • Update inlier count and calculate partial sums for the next centroid update:
  • Inlier_count(cluster,band) + =sum(inliers)
  • MeanCluster(cluster,band) + =sum(cluster_pixel(inliers,band))
  • Outlier count:
  • Outlier_count(cluster,band) = total_pixel_in_cluster-inlier_count(cluster,band)
  • End Parfor
  • Update the cluster centroids based on inliers using Equation (5).
  • Until the convergent criteria is met (previous centroid—new centroid) < 0.0005
Step:
In the dynamic K-Means sharp algorithm, the process starts by randomly initializing the cluster centroids from the data cubes. Each iteration assigns pixels from the data cubes to the nearest centroid based on Euclidean distance, computed in parallel using a “parfor” loop to handle the large dataset. Each pixel is assigned to the centroid with the minimum distance, and partial sums are calculated in parallel to gather the data needed for updating the centroid. After the parallel step, the partial sums from all data chunks are combined to compute new centroids for the clusters, and the algorithm checks for convergence. If the maximum change in the centroids is smaller than a threshold (0.0005), the algorithm converges. Otherwise, the centroids are updated, and the process repeats.
In later iterations, the algorithm incorporates spatial standard deviations for each cluster. After parallel processing of pixel assignments, the spatial standard deviation is calculated based on the squared difference between the current centroids and the pixel values for each band. Pixels within three standard deviations of the mean for each cluster and band are considered inliers, while those outsides are considered outliers. The inliers are used to update the centroids using Equation (6). The algorithm continues this iterative process, recalculating the centroids and spatial standard deviations in each iteration until convergence.

3.2.5. Parallelized Centroid Initialization and Dynamic K-Means Sharp Method

This combination integrates the Centroid Initialization Algorithm with the Dynamic K-Means Sharp algorithm to enhance accuracy. Detailed steps for this improved method are provided in Algorithm 7.
Algorithm 7: Centroid Initialization and Dynamic K-Means Sharp Method
Input: I = {I1, I2, I3,…,Id}//set of d data cube (114 data cube with size 3712 × 3712 × 16)
K//Number of desired clusters
Output: A set of k clusters
Phase 1: Identify the initial cluster centroids by using Algorithm 3
Phase 2: Update centroid by using Algorithm 6
Step:
In Algorithm 7, first phase is initial centroids are systematically set to form clusters with enhanced accuracy, as outlined in Section 3.2.2. The second phase applies to a variant of the clustering method from Section 3.2.4. This phase begins by creating initial clusters based on the uniform average of grouped pixels, as explained in Algorithm 3. These clusters are then refined using the K-Means Sharp method, which updates centroids by excluding outliers. Detailed steps for these two enhanced methods are provided in Algorithms 3 and 6.

3.2.6. Parallelized Dynamic K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction Method

This approach combines Iteration Calculation Reduction with the Dynamic K-Means Sharp algorithm to enhance computational efficiency and accelerate convergence while maintaining clustering accuracy. The detailed steps of this method are provided in Algorithm 8.
Algorithm 8: Dynamic K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction Method
Input: I = {I1, I2, I3,….,Id}//set of d data cube (114 data cube with size 3712 × 3712 × 16)
K//Number of desired clusters
Output: A set of k clusters
Phase 1: Assign each data point to the appropriate cluster by Algorithm 5.
Phase 2: Update centroid by using Algorithm 6
Step:
In Algorithm 8, first phase is a variant of the clustering method discussed in Section 3.2.3. is employed to improve efficiency. The second phase utilizes a variant of the clustering method outlined in Section 3.2.4. The two phases of this enhanced method are detailed in Algorithm 5 and Algorithm 6.

3.2.7. Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method

By merging the Centroid Initialization Algorithm and Iteration Calculation Reduction, this combination is designed to improve convergence rates while maintaining accurate cluster assignments
In Algorithm 9, first phase employs a variant of the clustering method discussed in Section 3.2.2, aimed at enhancing efficiency. In the second phase, another variant of the clustering method, outlined in Section 3.2.3, is used. Detailed descriptions of both phases in this enhanced method appear in Algorithms 3 and 5.
Algorithm 9: Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method
Input: I = {I1, I2, I3,….,Id}//set of d data cube (114 data cube with size 3712 × 3712 × 16)
K//Number of desired clusters
Output: A set of k clusters
Phase 1: Identify the initial cluster centroids by using Algorithm 3
Phase 2: Assign each data point to the appropriate cluster by Algorithm 5
Step:

3.2.8. Parallel Enhanced K-Means Method (PEKM)

This comprehensive combination integrates all three variants: the Centroid Initialization Method, Nearest-Neighbor Iteration Calculation Reduction Method, and Dynamic K-Means Sharp Method. By leveraging the strengths of each approach, the method ensures efficient and improved clustering quality, maximizing both performance and precision.
In Algorithm 10, the initial centroids are systematically determined, as described in Section 3.2.2. The second phase uses a variant of the clustering method discussed in Section 3.2.3, while the third phase applies another variant outlined in Section 3.2.4. The three phases of this enhanced method are detailed in Algorithms 3, 5, and 6.
Algorithm 10: Centroid Initialization Method, Dynamic K-Means Sharp, and Nearest-Neighbor Iteration Calculation Reduction Method
Input: I = {I1, I2, I3,….,Id}//set of d data cube (114 data cube with size 3712 × 3712 × 16)
K//Number of desired clusters
Output: A set of k clusters
Phase 1: Determine the initial centroids of the clusters by using Algorithm 3
Phase 2: Assign each data point to the appropriate cluster by Algorithm 5.
Phase 3: Update centroid by using Algorithm 6
Step:

3.3. Convergent Criteria

In K-Means clustering, pixels get assigned to clusters, and centroid updates continue until meeting specific stopping criteria. Possible criteria include the following:
  • Reaching a predetermined maximum number of iterations.
  • Having fewer pixel reassignments per iteration than a set threshold.
  • Centroid shifts fall below a specified distance threshold during an update cycle.
For this project, convergence was determined using centroid migration threshold of 0.0005 (0.05 reflectance units), a value adopted from the work of Shrestha et al. [4], where it was shown to effectively optimize cluster compactness.

3.4. Performance Metrics

As datasets in remote sensing continue to grow, reducing computational costs has become crucial due to the increasing demand for computational power. The methodology section outlines the various methods implemented to address this challenge. The model’s effectiveness was evaluated using the following performance metrics.

3.4.1. Convergence Speed

The iteration count required for the algorithm to meet the threshold condition represents a primary performance metric. Evaluating each parallel K-Means variant reveals which version converges with the fewest iterations.

3.4.2. Clustering Quality

Clustering quality in K-Means can be assessed by evaluating the compactness of cluster based on the algorithm’s objective function. In the context of high-volume, high-spatial-dimensional satellite imagery comprising over 15.7 billion 16-dimensional data points, external validation metrics such as clustering accuracy are not applicable due to the absence of ground truth, and internal validation indices based on pairwise distances (e.g., Silhouette Coefficient [51], Calinski–Harabasz Index [52]) are computationally infeasible. Root Mean Square Error (RMSE) [53] was selected as a computationally efficient and scalable metric that directly reflects clustering compactness and minimization objective of K-Means This approach is consistent with best practices outlined in Franti et al. [54], which recommend objective-function based evaluation when external references are unavailable and cluster quality must be assessed under large-scale unsupervised learning scenarios [19,53]. The RMSE is calculated through the following steps:
  • Squared   Difference   Calculation :
    squared _ diff = v i j c i k 2
  • Update   MSE   for   the   Cluster :
    MSE i k = MSE i k + j = 1 N i squared _ diff j
  • Update   Pixel   Count   for   the   Cluster :
    pixel _ count i = pixel _ count i + N i
  • Compute   the   Average   MSE   for   the   Cluster   and   Band :
    MSE i k = MSE i k pixel _ count i
  • Compute the Root Mean Square Error:
    RMSE ik   =   MSE ik 1 / 2
  • Compute the Average Root Mean Square Error:
Avg _ RMSE   =   M S E ¯ i k
where
  • Vij: The jth valid pixel in the ith cluster.
  • Cik: The centroid value for the kth band of the ith cluster.
  • Ni: The number of valid pixels in the ith cluster.
  • C: Total number of clusters.
  • B: Total number of bands.

3.4.3. Computational Efficiency

Evaluating computational efficiency involves measuring the execution time required for each parallelized K-Means algorithm variant to reach the specified convergence threshold. A comparison of execution times highlights the most efficient algorithms.

3.5. Comparative Analysis

This study provides a comparison of parallelized, modified versions of K-Means clustering algorithms based on selected performance metrics. Experiments were conducted on identical datasets to ensure consistency and enable a direct comparison of algorithm performance.

4. Evaluation and Experimental Results

This study primarily aims to enhance the standard K-Means clustering algorithm by accelerating its convergence, improving computational efficiency, and maintaining similar clustering compactness. This section discusses the findings derived from optimized variants of the K-Means clustering algorithm, as described in Section 3.2. The evaluation includes an analysis of convergence speed, clustering quality, and computational efficiency for each optimization technique described in Section 3.4, with a focus on their effectiveness in handling high-spatial-dimensional satellite imagery.

4.1. Convergent Speed Comparison

Convergence speed is a critical metric for assessing the efficiency of parallelized K-Means algorithm, measured by the number of iterations needed to achieve convergence. The criterion for cluster convergence is determined by the displacement of centroids between the previous and the most recently computed centroid positions.
For clarity in figures and tables, abbreviated method names are used. The corresponding full method descriptions are provided in Figure 1.
Figure 6 illustrates the convergent speed of parallelized modified K-Means algorithms over iterations. On the graph, the x-axis indicates the number of iterations, whereas the y-axis depicts the displacement of centroids between consecutive iterations. Each line corresponds to a specific parallelized modified K-Means algorithm. The results also demonstrate that centroids stabilize with respect to iteration. Please refer to Figure 1 for the description of Abbreviated method.
The Parallelized Standard K-Means algorithm (Blue color with diamond) failed to converge within the tested iterations. However, incorporating improvements such as selecting initial centroid deterministically, applying Dynamic K-Means Sharp for removing the outlier from getting into the calculation of centroid, and using the previous information of assigned cluster to pixel and assigned distance to reduce computational complexity arises because of classification of pixel, enabled convergence for the Parallelized Standard K-Means algorithm under specific conditions. The following observations were made regarding the convergence of the modified algorithms where the convergence threshold is set to 0.0005. Table 2 provides details about the number of iterations taken by each method.
Two things can be seen from these results, first the Parallelized Centroid Initialization method had great impact in allowing the problem to converge, and second the parallelized Dynamic K-Means Sharp had the largest impact in iteration reduction. These results highlight the efficacy of the modifications in improving the convergence behavior of the Parallelize Standard K-Means algorithm; however these methods also have various levels of improved efficiency as shown below.

4.2. Computation Efficiency Comparison

Efficiency is a key metric for assessing the practical applicability of the modified K-Means clustering algorithms. The bar graph shown in Figure 7 compares the computational efficiency of eight algorithms on the dataset described in Section 3. The y-axis indicates the total execution time in hours, while the x-axis lists the algorithms under evaluation. Please refer to Figure 1 for the description of Abbreviations method.
Table 3 refers to total execution time taken by each methods. The results reveal that the Parallelized Standard K-Means Method failed to converge, indicating no measurable computational efficiency for this baseline approach. Among the improved algorithms, the PEKM demonstrated superior computational efficiency, achieving the lowest execution time required for convergence. This method outperformed other algorithms by reducing the computational complexity, as reflected in its total execution time of approximately 4230 h.
Although the Parallelized Centroid Initialization and Dynamic K-Means Sharp Method and Parallelized Dynamic K-Means Sharp converged faster in terms of the number of iterations, their execution times of 5237 h and 6341 h, respectively, were higher than the Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation method execution time is 5130 h. These results highlight the critical role of the Nearest-Neighbor Iteration Calculation Reduction Method in minimizing computational overhead for pixel classification tasks.
Overall, the results underscore the practical advantages of incorporating Nearest-Neighbor Iteration Calculation Reduction, which not only accelerates convergence but also ensures the computational feasibility of the clustering process for large datasets.

4.3. Clustering Quality Evaluation

The calculation of RMSE of cluster compactness serves as a critical metric for assessing the quality clusters, as it reflects the effectiveness of cluster formation. Figure 8 illustrates the clustering quality of each parallelized modified K-Means algorithm as outlined in Section 3.2, by evaluating the compactness of the generated clusters. Compactness is quantified through the Average Root Mean Square Error, computed using the methodology detailed in Section 3.4.2. Please refer to Figure 1 for the description of Abbreviations method. Table 4 shows details of RMSE of each method.
The results indicate that the compactness of all the modified algorithms is similar, with minimal differences between them, as shown in the graph below:
The minimal variation in RMSE across the algorithms demonstrates comparable clustering quality. This suggests that while the methods vary in their computational efficiencies, their ability to form compact and well-defined clusters remains consistent.

4.4. Image Cluster Output

To further assess the consistency of clustering results across all methods, a specific chip classified as sand is displayed in Figure 9a. Differences between each method’s clustering and the reference classification shown in Figure 9b, the output of the Centroid Initialization Method, which improves centroid selection over Standard K-Means Clustering, are presented in subsequent subfigures. Each subfigure in Figure 9 illustrates the deviation of different classification approach relative to the reference. With the exception of minor random variations, the results demonstrate a high degree of similarity across methods. A noticeable exception is observed in Figure 9f, corresponding to the Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method, where clustering pixel to the reference classification was not fully achieved. Despite this visually detectable difference, all methods overall exhibit consistent classification outcomes. These results support evaluating the methods primarily on performance metrics, such as convergence efficiency and computational scalability, rather than difference in clustering compactness. In order to supplement results with other target, “water” category is shown in Appendix A. Please refer to Figure 1 for the description of Abbreviations method.
Finally, this study verified that clustering results generated by each parallelized modified K-Means algorithm were indistinguishable. As illustrated in Figure 9, the segmentation masks for the sand exhibit similar pixel distributions across all methods, indicating minimal variation in cluster assignment due to the different initial centroid value which was chosen randomly. This variation in cluster assignment was seen in methods such as Dynamic K-Means Sharp, Nearest Neighbor Iteration Calculation Reduction Method, and their combined variants, which begin with randomly selected centroids. Additionally, the RMSE values, as presented in Table 4, further support this observation. The Parallelized Centroid Initialization Method achieved an RMSE of 0.01417 and the Parallelized Nearest-Neighbor Iteration Calculation Reduction Method reported 0.01399, while the Parallelized Dynamic K-Means Sharp Method recorded 0.01332. The Parallelized Centroid Initialization and Dynamic K-Means Sharp Method yielded an RMSE of 0.01375, the Parallelized Centroid Initialization and Nearest Neighbor Iteration Calculation Reduction Method reported 0.01412, the Parallelized Dynamic K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction yielded as 0.01333, and the PEKM recorded 0.01364. The small difference in RMSE values across methods indicate that the final clustering results remain consistent. Building on these findings, a new optimized approach, termed PEKM, was developed. The PEKM method introduces three key improvements: (1) the use of deterministic techniques for centroid initialization, replacing reliance on random selection; (2) the exclusion of outliers during the computation of new centroids; and (3) enhanced computational efficiency through the implementation of a simple data structure to retain information from each iteration for subsequent processing. The proposed PEKM method is particularly suited for application in remote sensing, including the unsupervised clustering of global high-spatial-dimensional satellite imagery and the generation of training data for supervised or deep learning-based classification tasks.

5. Discussion

This study presented the Parallelized Enhanced K-Means (PEKM) algorithm, developed to address the scalability challenges of standard K-Means clustering when applied to high-spatial-dimensional satellite image datasets. By integrating three algorithms’ improvements, Centroid Initialization, Dynamic K-Means sharp, and Nearest-Neighbor Iteration Calculation Reduction, alongside parallel computing on CPU nodes, PEKM effectively reduced computational cost while maintaining clustering quality. The combined approach demonstrated stable performance across diverse land cover types and provided a scalable solution suitable for remote sensing applications where large unlabeled datasets are common.
One of the limitations of the Standard K-Means algorithm is its reliance on random centroid initialization, which can lead to poor convergence and converge to local minima. Prior studies have sought to address this by proposing more systematic initialization strategies. Khan et al. [33] introduced a weighting-based approach coupled with normalization to improve centroid selection, yet their evaluation on large-scale, high-spatial-dimensional datasets such as satellite imagery remain limited. Chan et al. [34] further refined initialization for high-spatial-dimensional scenarios by combining kernel K-Means + + with random projections, successfully reducing computational complexity from O(n2D) to O(n2d). However, while this reduced dimensionality, it did not address the scalability challenges related to large sample sizes (n). Building upon these efforts, our approach proposes a uniform average pixel grouping strategy, specifically designed to improve initialization robustness for large-scale satellite imagery. Centroid Initialization method is a one-time process performed prior to iterative clustering. It involves grid-based sampling and aggregation of pixel values, followed by sorting and mean calculation. Since the number of sampled points is substantially smaller than the total dataset size, this causes time complexity of O(gd + g log g), where g is the number of sampled grid points and d is the number of spectral bands, thereby keeping the initialization computational overhead minimal relative to the overall clustering process. Additionally, future work could explore incorporating Java-based parallel Random Swap [55,56], a wrapper around K-Means that periodically replaces centroids to escape local minima, particularly effective when initial centroid selection is suboptimal.
Outlier sensitivity, another critical challenge in K-Means clustering, has been the focus of several enhancements. Olukanmi et al. [37] developed K-Means sharp, employing a global threshold to detect outliers based on point to centroid distances. Although effective for small datasets, this global threshold approach may overlook cluster-specific variation, especially in heterogeneous land cover datasets. CSK-Means and NSK-Means algorithms further refined this by incorporating density-based outlier detection and assigning separate cluster to outliers, improving performance in non-spherical and overlapping cluster scenarios [38]. However, their reliance on iterative steps for final cluster determination results in increased computational costs for large-scale datasets. Recent anomaly detection framework like ResAD [39] offer potential for addressing high-spatial-dimensional outlier challenges through residual features learning, but their integration with clustering algorithms like K-Means remains to be fully explored. In this study, Dynamic K-Means sharp is introduced as a dynamic, cluster-specific thresholding mechanism as described in Section 3.2.4, which is an advancement of K-Means sharp. The Dynamic K-Means sharp method introduces additional computations of spatial standard deviations, maintaining an overall complexity of O(nd) per iteration but improving centroid stability, thus lowering the total number of iterations required for convergence as seen in Section 4.1. This advancement is particularly advantageous for complex satellite imagery, where outlier distributions vary across land-cover types.
Furthermore, the high computational cost of K-Means, due to exhaustive distance calculations at every iteration, has driven the development of methods aimed at reducing redundant computations. Shi et al. [41] introduced a data structure to retain essential cluster assignment information between iterations, while Wang et al. [42] identified active boundary points to optimize reassignment steps. Moodi et al. [43] extended this direction by using equidistance thresholds to eliminate certain distance checks. Nonetheless, these methods face challenges in accurately handling overlapping clusters or extremely large datasets. To overcome these limitations, our proposed Nearest-Neighbor Iteration Calculation Reduction method maintains a set of the 15 nearest clusters for each pixel, balancing computational efficiency with classification accuracy. This design mitigates the risks of misclassification near cluster boundaries and supports scalability for high-spatial-dimensional satellite data. This method offers the most significant improvement in reducing redundant distance calculations. By maintaining distances to the 15 nearest clusters for each point, the method reduces the need for exhaustive O(nkd) distance computation to O(nd) per iteration for the majority of points, where n denotes the number of data points (pixels), k is the number of clusters, and d represents the number of spectral dimensions. Only a small subset, denoted by a fraction ε, requiring full recomputation with complexity of O(εnkd). This selective updating mechanism substantially decreases overall execution time, as reflected in Section 4.2. Future work could explore integrating centroid activity detection strategies, which accelerate clustering by skipping inactive centroids during distance computations [57].
By aggregating these algorithmic enhancements, the overall time complexity of the PEKM algorithm is significantly improved compared to the Parallelized Standard K-Means algorithm. While standard K-Means requires O(inkd) operations, where i is the number of iteration to converge, n is the number of pixel, d is the number of spectral features, k is number of class, PEKM achieves a cumulative complexity of O(nKd + i(nd+ ε n K d)), where ε represents the proportion of pixel requiring full recomputation during iterations. The space complexity remains at O(nd), comparable to Standard K-Means, with the additional storage for maintaining nearest-neighbor distances, which scales linearly with the dataset size. These improvements are particularly valuable for large-scale satellite image clustering, where minimizing per-iteration computation and total iteration count directly translates to substantial reductions in overall processing time.
Results reinforce the theoretical complexity analysis. PEKM achieve the fastest convergence, completing in 234 iterations and approximately 4230 processing hours, outperforming other methods: Parallelized Centroid Initialization (348 iterations, 9965 h), Parallelized K-Means Sharp (282 iterations, 6341 h), Parallelized Nearest-Neighbor Iteration Calculation Reduction (326 iterations, 5860 h), Parallelized Centroid Initialization and K-Means Sharp (257 iterations and 5237 h), Parallelized K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction (288 iterations and 5243 h) and Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method (292 iterations and 5130 h). Besides the computational efficiency, cluster compactness was measured via RMSE. Cluster compactness serves as a measure of clustering consistency, indicating the extent to which different methods identify similar pixel grouping. Cluster compactness showed negligible differences across all algorithms: Parallelized Centroid Initialization (0.0141), K-Means Sharp (0.0133), Parallelized Centroid Initialization and K-Means Sharp (0.0137), Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction (0.0141), and PKEM (0.0136). Because cluster compactness was similar among different versions of K-Means algorithm, many practitioners may base their choice of algorithm on the iteration count and execution time. Users Prioritizing minimal iterations (234) and lower execution time (4230 h) could adopt the full combination PEKM, whereas those focusing strictly on achieving the smallest RMSE might select Parallelized K-Means sharp alone (0.0133) despite higher computational costs. These results highlight that computational efficiency and convergence speed—not cluster quality—differentiated the algorithms.
Further analysis of parameter influence highlights critical considerations for PEKM scalability. As expected, increasing the number of cluster K proportionally increases the initial computation load due to spectral distance between each pixel and centroid. However, the nearest-neighbor reduction strategy mitigates this impact during subsequent iterations by focusing computations on a limited candidate set. Additionally, maintaining the number of nearest neighbors at 15 proved to be an effective balance between computational cost and clustering compactness. Nearest Neighbors number need to be fine-tuned as explain in Section 3.2.3. The convergence threshold, fixed at 0.05 reflectance units in this study, provides a pragmatic balance between processing time and classification stability, though future studies could explore adaptive thresholding strategies for future optimization.
In terms of operational application, the PEKM algorithm addresses long-standing challenges in clustering reliability and scalability for satellite sensor calibration workflows. Previous studies have leveraged K-means clustering to identify globally distributed Extended Pseudo Invariant Calibration Sites (EPICS), which are essential for radiometric calibration across multiple satellites missions [4,5,9,58]. Fajardo et al. [5] applied K-Means clustering exclusively on temporally stable pixels to generate globally distributed EPICS clusters for radiometric calibration effort. Also, Fajardo et al. [59] applied K-Means clustering to temporally stable pixels to define clusters such as Cluster 13 Global Temporally Stable, which improve temporal consistency and expanded calibration opportunities. In a related study, Shah et al. [9] utilized K-Means-derived EPICS clusters over North Africa to support an expanded Trend-to-Trend cross-calibration technique. Despite the operational use of K-Means in these works, convergence was not consistently achieved, an issue acknowledged implicitly through reliance on prefiltered or heuristically selected cluster outputs. This lack of convergence introduces potential uncertainty in cluster reproducibility, which is critical for calibration accuracy and long-term sensor monitoring.
The PEKM algorithm introduced in this study offers a foundation improvement in this context. By integrating enhanced centroid initialization, dynamic outlier handling, and iteration calculations reduction, along with asynchronous parallel execution on CPU-based HPC systems, PEKM ensures convergence even when applied to high-spatial-dimensional, global-scale datasets. The algorithm was validated using 114 datacubes (each measuring 3712 × 3712 × 16), with successful convergence and consistent cluster compactness. In contrast to earlier K-Means implementations, PEKM provides a repeatable and scalable framework suitable for annual image clustering needs of calibration-focused laboratories. With deployments planned across over 9512 datacubes, the algorithm leverage more than 3000 CPU cores asynchronously within a high-performance computing cluster, enabling robust classification outputs for global radiometric calibration. This advancement not only reduces algorithmic uncertainty but also enhances the reliability of EPICS generation for current and future satellite missions.
The current configuration of 16-layer input data, encompassing temporal mean bands and temporal standard deviations, was maintained to ensure methodological continuity with prior datasets and to preserve model robustness for complex data distributions. These 16 layers are derived from the eight optical bands of the Landsat 8 OLI sensor, with each band processed to generate both a temporal mean and standard deviation, as outlined in the methodology by Shrestha et al. [4] and Fajardo et al. [5] This approach captures long-term reflectance behavior across diverse global land cover types, providing an input structure for unsupervised clustering. However, the inclusion of all bands may introduce inter-band redundancy, potentially impacting clustering stability and increasing computational load. Prior studies in land use and land cover classification have shown that not all spectral bands contribute equally to classification performance. For instance, Zhiqi Yu et al. [60] identified optimal three and four band combinations (Bands 4,5,6 and Bands 1,2,5,7, respectively) that achieved comparable accuracy to full-band inputs, suggesting the potential for dimensionality reduction without performance loss. Future research could incorporate correlation analysis and feature selection to reduce the number of input layers, thereby mitigating curse of dimensionality while preserving the spectral and spatial characteristics of global land cover for robust unsupervised clustering.
Looking forward, several enhancements could extend the scalability and applicability of the PEKM algorithm. Integration of GPU acceleration [40,47,48,61,62], approximate nearest neighbor (ANN) search method [63,64,65], and mini-batch K-Means algorithm [66,67,68] mechanism may offer further computational gain. In case of GPU availability, Han et al. [62] demonstrated that GPU-based parallelization of K-Means clustering significantly accelerated processing, achieving a 25.53 speedup over sequential processing and reducing runtime to 39 times faster with advanced GPU optimizations. Their findings underscore the potential of GPU-enabled clustering for handling extensive satellite imagery datasets. However, as outlined in Section 1.2.3, the current HPC environment comprises limited GPU resources, only 14 nodes, while offering over 3000 CPU cores. This constraint has guided the strategic focus on CPU-based parallel optimization, ensuring full utilization of available computational resources for large-scale clustering. Approximate nearest neighbor techniques, as surveyed by Andoni et al. [63], have been shown to significantly reduce computational cost in high-spatial-dimensional space by avoiding exhaustive distance calculations. Incorporating ANN within the PEKM framework could further improve clustering speed performing a trade-off with little accuracy, particularly during the distance assignment phase. Additionally, the mini-batch K-Means explored by peng et al. [66] offers an efficient alternative for large datasets; however, its speed advantages may come at the cost of reduced global clustering accuracy, a trade-off that must be evaluated in the context of PEKM. While this study achieved promising results, some limitations should be noted. Even though ANN and Mini-Batch K-Means improves computational efficiency, it comes at the cost of potentially reduced clustering quality and future research needs to be conducted the evaluation of accuracy in compared to PEKM. Furthermore, the choice of 160 clusters was based on experimental analysis and results of the initial work performed by Shrestha et al. [4] showing clusters had a compactness value of ~0.01 to 0.015 units of reflectance, which was the target for this experiment, but further analysis and updates of this may have influenced the outcomes. Future work should explore more systematic methods for determining the optimal number of clusters. Additionally, instead of using the Euclidean distance metric employed in this project, Aggarwal et al. [69] suggests that fractional distance metrics can significantly improve clustering effectiveness in high-spatial-dimensional spaces.
Beyond algorithmic developments, the output of PEKM offer potential utility beyond unsupervised classification. Clustered outputs representing global land cover types could serve as valuable inputs for fine-tuning pre-trained Vision Transformer (ViT) models, enriching feature representation for downstream tasks. While this remains a prospective research direction, the integration of clustering results with deep learning frameworks presents an opportunity to enhance model performance in remote sensing applications.
The PEKM algorithm presents a balanced solution for large-scale satellite image clustering, optimizing computational efficiency while maintaining clustering accuracy. Its design is particularly suited for resource-constrained environments emphasizing CPU parallelization and for application applications requiring high clustering precision, such as global land cover mapping and satellite sensor calibration. Through continued refinement and integration with emerging technologies, PEKM holds promise for advancing large-scale remote sensing data analysis.

6. Conclusions

This study aimed to address the scalability limitation and convergence challenges of traditional K-Means clustering when applied to high-spatial-dimensional, large-scale satellite imagery datasets. By implementing and evaluating eight parallelized and optimized K-Means algorithms in Section 3, the most efficient and accurate solution for unsupervised classification tasks was identified. Key findings from the evaluation of optimized K-Means algorithms indicate that the PEKM algorithm demonstrated the best overall performance in terms of computational efficiency and convergence speed. PEKM achieved the fastest convergence, completing in 234 iterations with a total execution time of 4230 h, while maintaining cluster quality comparable to other optimized variants. However, if cluster quality is the primary consideration, users may prefer the Dynamic K-Means Sharp algorithm, despite its higher computational cost.
The proposed enhancements, centroid initialization, dynamic outlier mitigation, and reduced iteration calculations, were integrated within a CPU-parallel framework, making PEKM well-suited for environments with limited GPU resources. This advancement enables more efficient classification of multispectral and hyperspectral satellites data, supporting applications such as global land cover mapping, sensor calibration site detection, and spatiotemporal pattern analysis.
Furthermore, the scalable design of PEKM facilitates its deployment in high-performance computing environments, enabling large-scale unsupervised classification of more than 9000 datacubes. These results provide a foundation for further integration of clustering techniques into operational remote sensing workflows and future adaptation with GPU acceleration, approximate nearest neighbor search, and hybrid clustering frameworks.
Future work exists to extend the scalability and effectiveness of PEKM. Exploring the integration of GPU acceleration, approximate nearest neighbor (ANN) search, and hybrid clustering frameworks such as mini-batch K-means to further reduce runtime. In particular, adapting centroid activity detection strategies, random swap and fractional distance metrics may yield additional performance improvements, especially in high-spatial-dimensional spaces. Moreover, dimensionality reduction through spectral band selection or feature correlation analysis may help mitigate inter-band redundancy and the curse of dimensionality without sacrificing clustering quality. Prior studies have shown that subsets of spectral bands can achieve classification accuracy comparable to full-band inputs, suggesting that more efficient feature selection methods can enhance both performance and interpretability.
The computational advancements presented here enable efficient analysis of multispectral and hyperspectral data, supporting applications such as global land cover classification. Beyond classification tasks, the optimized algorithm can facilitate calibration processes to identify stable environmental targets and differentiate land cover types on a global scale. By improving the computational efficiency of K-Means, this work empowers researchers to conduct large-scale exploratory analyses with preliminary datasets. These contributions establish a foundation for advancing clustering algorithm optimization in distributed computing environments and high-spatial-dimensional data research.

Author Contributions

Conceptualization, Y.R.P. and L.L.; Data curation, Y.R.P., L.L. and J.F.R.; Formal analysis, Y.R.P., L.L. and J.F.R.; Methodology, Y.R.P., L.L. and J.F.R.; Resources, L.L.; Software, Y.R.P. and L.L.; Visualization, Y.R.P.; Writing—original draft, Y.R.P.; Writing—review and editing, L.L. and J.F.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by USGS EROS (grant number SA2500150).

Data Availability Statement

Landsat-8 image courtesy of the U.S. Geological Survey, Google Earth Engine.

Acknowledgments

The authors would like to thank the reviewers for their valuable input, and the Image Processing Laboratory for their support.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Cluster-based Masked Image of “Water” Generated from Parallelized modified K-Means Cluster: (a) Input Image (b) from Parallelized Centroid Initialization method (Reference) (c) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp Method (d) Pixel difference between Reference and Parallelized Nearest-Neighbor Iteration Calculation Reduction Method (e) Pixel difference between Reference and Parallelized Centroid Initialization and Dynamic K-Means Sharp Method (f) Pixel difference between Reference and Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method (g) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction Method (h) Pixel difference between Reference and PEKM.
Figure A1. Cluster-based Masked Image of “Water” Generated from Parallelized modified K-Means Cluster: (a) Input Image (b) from Parallelized Centroid Initialization method (Reference) (c) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp Method (d) Pixel difference between Reference and Parallelized Nearest-Neighbor Iteration Calculation Reduction Method (e) Pixel difference between Reference and Parallelized Centroid Initialization and Dynamic K-Means Sharp Method (f) Pixel difference between Reference and Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method (g) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction Method (h) Pixel difference between Reference and PEKM.
Algorithms 18 00532 g0a1aAlgorithms 18 00532 g0a1b

References

  1. Ao, S.I. World Congress on Engineering: WCE 2012: 4–6 July, 2012, Imperial College London, London, U.K.; Newswood Ltd.; International Association of Engineers: Champaign, IL, USA, 2012; ISBN 9789881925138. [Google Scholar]
  2. Wulder, M.A.; Roy, D.P.; Radeloff, V.C.; Loveland, T.R.; Anderson, M.C.; Johnson, D.M.; Healey, S.; Zhu, Z.; Scambos, T.A.; Pahlevan, N.; et al. Fifty Years of Landsat Science and Impacts. Remote Sens. Environ. 2022, 280, 113195. [Google Scholar] [CrossRef]
  3. Dutta, S.; Das, M. Remote Sensing Scene Classification under Scarcity of Labelled Samples—A Survey of the State-of-the-Arts. Comput. Geosci. 2023, 171, 105295. [Google Scholar] [CrossRef]
  4. Shrestha, M.; Leigh, L.; Helder, D. Classification of North Africa for Use as an Extended Pseudo Invariant Calibration Sites (EPICS) for Radiometric Calibration and Stability Monitoring of Optical Satellite Sensors. Remote Sens. Environ. 2019, 11, 875. [Google Scholar] [CrossRef]
  5. Fajardo Rueda, J.; Leigh, L.; Teixeira Pinto, C.; Kaewmanee, M.; Helder, D. Classification and Evaluation of Extended Pics (Epics) on a Global Scale for Calibration and Stability Monitoring of Optical Satellite Sensors. Remote Sens. Environ. 2021, 13, 3350. [Google Scholar] [CrossRef]
  6. Usman, B. Satellite Imagery Land Cover Classification Using K-Means Clustering Algorithm: Computer Vision for Environmental Information Extraction. Elixir Int. J. Comput. Sci. Eng. 2013, 63, 18671–18675. [Google Scholar]
  7. Yasin, H.E.E.; Kornel, C. Evaluating Satellite Image Classification: Exploring Methods and Techniques. In Geographic Information Systems-Data Science Approach; IntechOpen: London, UK, 2024. [Google Scholar]
  8. Kaewmanee, M.; Leigh, L.; Shah, R.; Gross, G. Inter-Comparison of Landsat-8 and Landsat-9 during On-Orbit Initialization and Verification (OIV) Using Extended Pseudo Invariant Calibration Sites (EPICS): Advanced Methods. Remote Sens. 2023, 15, 2330. [Google Scholar] [CrossRef]
  9. Shah, R.; Leigh, L.; Kaewmanee, M.; Pinto, C.T. Validation of Expanded Trend-to-Trend Cross-Calibration Technique and Its Application to Global Scale. Remote Sens. 2022, 14, 6216. [Google Scholar] [CrossRef]
  10. Yin, L.; Lv, L.; Wang, D.; Qu, Y.; Chen, H.; Deng, W. Spectral Clustering Approach with K-Nearest Neighbor and Weighted Mahalanobis Distance for Data Mining. Electronics 2023, 12, 3284. [Google Scholar] [CrossRef]
  11. Ni, L.; Manman, P.; Qiang, W. A Spectral Clustering Algorithm for Non-Linear Graph Embedding in Information Networks. Appl. Sci. 2024, 14, 4946. [Google Scholar] [CrossRef]
  12. Ran, X.; Xi, Y.; Lu, Y.; Wang, X.; Lu, Z. Comprehensive Survey on Hierarchical Clustering Algorithms and the Recent Developments; Springer: Dordrecht, The Netherlands, 2023; Volume 56, ISBN 0123456789. [Google Scholar]
  13. Zhang, X.; Shen, X.; Ouyang, T. Extension of DBSCAN in Online Clustering: An Approach Based on Three-Layer Granular Models. Appl. Sci. 2022, 12, 9402. [Google Scholar] [CrossRef]
  14. Dinh, T.; Hauchi, W.; Lisik, D.; Koren, M.; Tran, D.; Yu, P.S.; Torres-Sospedra, J. Data Clustering: An Essential Technique in Data Science. arXiv 2024, arXiv:2412.18760. [Google Scholar] [CrossRef]
  15. Chaudhry, M.; Shafi, I.; Mahnoor, M.; Vargas, D.L.R.; Thompson, E.B.; Ashraf, I. A Systematic Literature Review on Identifying Patterns Using Unsupervised Clustering Algorithms: A Data Mining Perspective. Symmetry 2023, 15, 1679. [Google Scholar] [CrossRef]
  16. Miao, S.; Zheng, L.; Liu, J.; Jin, H. K-Means Clustering Based Feature Consistency Alignment for Label-Free Model Evaluation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 24 June 2023; pp. 3299–3307. [Google Scholar] [CrossRef]
  17. Al-Sabbagh, A.; Hamze, K.; Khan, S.; Elkhodr, M. An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections. Electronics 2024, 13, 3677. [Google Scholar] [CrossRef]
  18. Ahmed, M.; Seraj, R.; Islam, S.M.S. The K-Means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
  19. Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-Means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
  20. Rana, M.; Rahman, A.; Smith, D. Hierarchical Semi-Supervised Approach for Classifying Activities of Workers Utilising Indoor Trajectory Data. Internet Things 2024, 28, 101386. [Google Scholar] [CrossRef]
  21. Dash, M.; Liu, H.; Scheuermann, P.; Tan, K.L. Fast Hierarchical Clustering and Its Validation. Data Knowl. Eng. 2003, 44, 109–138. [Google Scholar] [CrossRef]
  22. Shi, K.; Yan, J.; Yang, J. A Semantic Partition Algorithm Based on Improved K-Means Clustering for Large-Scale Indoor Areas. ISPRS Int. J. Geoinf. 2024, 13, 41. [Google Scholar] [CrossRef]
  23. Degirmenci, A.; Karal, O. Efficient Density and Cluster Based Incremental Outlier Detection in Data Streams. Inf. Sci. 2022, 607, 901–920. [Google Scholar] [CrossRef]
  24. Sieranoja, S.; Fränti, P. Fast and General Density Peaks Clustering. Pattern Recognit. Lett. 2019, 128, 551–558. [Google Scholar] [CrossRef]
  25. Guava. Spectral Clustering for Large Scale Datasets (Part 1). Available online: https://medium.com/@guava1427/spectral-clustering-for-large-scale-datasets-part-1-874571887610 (accessed on 15 April 2025).
  26. Sreedhar, C.; Kasiviswanath, N.; Chenna Reddy, P. Clustering Large Datasets Using K-Means Modified Inter and Intra Clustering (KM-I2C) in Hadoop. J. Big Data 2017, 4, 27. [Google Scholar] [CrossRef]
  27. Capó, M.; Pérez, A.; Lozano, J.A. An Efficient K -Means Clustering Algorithm for Massive Data. arXiv 2018, arXiv:1801.02949. [Google Scholar] [CrossRef]
  28. Jin, S.; Cui, Y.; Yu, C. A New Parallelization Method for K-Means. arXiv 2016, arXiv:1608.06347. [Google Scholar] [CrossRef]
  29. Honggang, W.; Jide, Z.; Hongguang, L.; Jianguo, W. Parallel Clustering Algorithms for Image Processing on Multi-Core CPUs. In Proceedings of the International Conference on Computer Science and Software Engineering (CSSE), Wuhan, China, 12–14 December 2008; Volume 3, pp. 450–453. [Google Scholar]
  30. Zhang, Y.; Xiong, Z.; Mao, J.; Ou, L. The Study of Parallel K-Means Algorithm. In Proceedings of the 2006 6th World Congress on Intelligent Control and Automation, Dalian, China, 21–23 June 2006. [Google Scholar]
  31. Macqueen, J. SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF MULTIVARIATE OBSERVATIONS. In Proceedings of the Berkeley Symposium on Mathematical Statistics & Probability, Berkeley, CA, USA, 21 June–18 July 1965. [Google Scholar]
  32. Parveen, S.; Yang, M. Lasso-Based k-Means++ Clustering. Electronics 2025, 14, 1429. [Google Scholar] [CrossRef]
  33. Khan, A.A.; Bashir, M.S.; Batool, A.; Raza, M.S.; Bashir, M.A. K-Means Centroids Initialization Based on Differentiation Between Instances Attributes. Int. J. Intell. Syst. 2024, 2024, 7086878. [Google Scholar] [CrossRef]
  34. Chan, J.Y.K.; Leung, A.P.; Xie, Y. Efficient High-Dimensional Kernel k-Means++ with Random Projection. Appl. Sci. 2021, 11, 6963. [Google Scholar] [CrossRef]
  35. Fränti, P.; Sieranoja, S. K-Means Properties on Six Clustering Benchmark Datasets. Appl. Intell. 2018, 48, 4743–4759. [Google Scholar] [CrossRef]
  36. Li, H.; Sugasawa, S.; Katayama, S. Adaptively Robust and Sparse K-Means Clustering. Trans. Mach. Learn. Res. 2024, 2024, 1–29. [Google Scholar]
  37. Olukanmi, P.O.; Twala, B. K-Means-Sharp: Modified Centroid Update for Outlier-Robust k-Means Clustering. In Proceedings of the 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech); Bloemfontein, South Africa, 29 November–1 December 2017; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2017; Volume 2018-January, pp. 14–19. [Google Scholar]
  38. Zhao, J.; Bao, Y.; Li, D.; Guan, X. An Improved K-Means Algorithm Based on Contour Similarity. Mathematics 2024, 12, 2211. [Google Scholar] [CrossRef]
  39. Yao, X.; Chen, Z.; Gao, C.; Zhai, G.; Zhang, C. ResAD: A Simple Framework for Class Generalizable Anomaly Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 125287–125311. [Google Scholar]
  40. Wu, S.; Zhai, Y.; Liu, J.; Huang, J.; Jian, Z.; Dai, H.; Di, S.; Chen, Z.; Cappello, F. TurboFFT: A High-Performance Fast Fourier Transform with Fault Tolerance on GPU. arXiv 2024, arXiv:2405.02520. [Google Scholar] [CrossRef]
  41. Shi, N.; Liu, X.; Guan, Y. Research on K-Means Clustering Algorithm: An Improved k-Means Clustering Algorithm. In Proceedings of the 3rd International Symposium on Intelligent Information Technology and Security Informatics (IITSI), Jian, China, 2–4 April 2010; pp. 63–67. [Google Scholar]
  42. Wang, J.; Wang, J.; Ke, Q.; Zeng, G.; Li, S.; Valley, S. Fast Approximate K-Means via Cluster Closures. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3037–3044. [Google Scholar]
  43. Moodi, F.; Saadatfar, H. An Improved K-Means Algorithm for Big Data. IET Softw. 2022, 16, 48–59. [Google Scholar] [CrossRef]
  44. Mussabayev, R.; Mussabayev, R. Superior Parallel Big Data Clustering Through Competitive Stochastic Sample Size Optimization in Big-Means. In Proceedings of the Asian Conference on Intelligent Information and Database Systems, Ras Al Khaimah, United Arab Emirates, 15–18 April 2024; pp. 224–236. [Google Scholar] [CrossRef]
  45. Rashmi, C.; Chaluvaiah, S.; Kumar, G.H. An Efficient Parallel Block Processing Approach for K -Means Algorithm for High Resolution Orthoimagery Satellite Images. In Proceedings of the Procedia Computer Science, Bangalore, India, 19–21 August 2016; Elsevier B.V.: Amsterdam, The Netherlands, 2016; Volume 89, pp. 623–631. [Google Scholar]
  46. Jin, R.; Yang, G.; Agrawal, G. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. IEEE Trans. Knowl. Data Eng. 2005, 17, 71–89. [Google Scholar] [CrossRef]
  47. Cuomo, S.; De Angelis, V.; Farina, G.; Marcellino, L.; Toraldo, G. A GPU-Accelerated Parallel K-Means Algorithm. Comput. Electr. Eng. 2019, 75, 262–274. [Google Scholar] [CrossRef]
  48. Bellavita, J.; Pasquali, T.; Del Rio Martin, L.; Vella, F.; Guidi, G. Popcorn: Accelerating Kernel K-Means on GPUs Through Sparse Linear Algebra; Association for Computing Machinery: New York, NY, USA, 2025; Volume 1, ISBN 9798400714436. [Google Scholar]
  49. Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
  50. Dickinson, R.E.; Henderson-Sellers, A.; Kennedy, P.J.; Wilson, M.F. NCAR/TN-257+STR Biosphere-Atmosphere Transfer Scheme (BATS) for the NCAR Community Climate Model; National Center for Atmospheric Research: Boulder, CO, USA, 1986. [Google Scholar]
  51. Shahapure, K.R.; Nicholas, C. Cluster Quality Analysis Using Silhouette Score. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 747–748. [Google Scholar] [CrossRef]
  52. Syahputri, Z.; Sutarman, S.; Siregar, M.A.P. Determining The Optimal Number of K-Means Clusters Using The Calinski Harabasz Index and Krzanowski and Lai Index Methods for Groupsing Flood Prone Areas In North Sumatra. Sinkron 2024, 9, 571–580. [Google Scholar] [CrossRef]
  53. KC, M.; Leigh, L.; Pinto, C.T.; Kaewmanee, M. Method of Validating Satellite Surface Reflectance Product Using Empirical Line Method. Remote Sens. 2023, 15, 2240. [Google Scholar] [CrossRef]
  54. Fränti, P.; Sieranoja, S. Clustering Accuracy. Appl. Comput. Intell. 2024, 4, 24–44. [Google Scholar] [CrossRef]
  55. Nigro, L.; Cicirelli, F.; Fränti, P. Parallel Random Swap: An Efficient and Reliable Clustering Algorithm in Java. Simul. Model. Pract. Theory 2023, 124, 102712. [Google Scholar] [CrossRef]
  56. Fränti, P. Efficiency of Random Swap Clustering. J. Big Data 2018, 5, 1–29. [Google Scholar] [CrossRef]
  57. Kaukoranta, T.; Fränti, P.; Nevalainen, O. A Fast Exact GLA Based on Code Vector Activity Detection. IEEE Trans. Image Process. 2000, 9, 1337–1342. [Google Scholar] [CrossRef]
  58. Fajardo Rueda, J.; Leigh, L.; Kaewmanee, M.; Byregowda, H.; Teixeira Pinto, C. Derivation of Hyperspectral Profiles for Global Extended Pseudo Invariant Calibration Sites (EPICS) and Their Application in Satellite Sensor Cross-Calibration. Remote Sens. 2025, 17, 216. [Google Scholar] [CrossRef]
  59. Fajardo Rueda, J.; Leigh, L.; Teixeira Pinto, C. Identification of Global Extended Pseudo Invariant Calibration Sites (EPICS) and Their Validation Using Radiometric Calibration Network (RadCalNet). Remote Sens. 2024, 16, 4129. [Google Scholar] [CrossRef]
  60. Alshari, E.A.; Gawali, B.W. Development of Classification System for LULC Using Remote Sensing and GIS. Glob. Transit. Proc. 2021, 2, 8–17. [Google Scholar] [CrossRef]
  61. Yang, C.; Li, Y.; Cheng, F. Accelerating K-Means on GPU with CUDA Programming. IOP Conf. Ser. Mater. Sci. Eng. 2020, 790, 012036. [Google Scholar] [CrossRef]
  62. Han, S.; Lee, J. Parallelized Inter-Image k-Means Clustering Algorithm for Unsupervised Classification of Series of Satellite Images. Remote Sens. 2024, 16, 102. [Google Scholar] [CrossRef]
  63. Andoni, A.; Indyk, P.; Razenshteyn, I. Approximate Nearest Neighbor Search in High Dimensions. In Proceedings of the International Congress of Mathematicians (ICM), Rio de Janeiro, Brazil, 1–9 August 2018; Volume 4, pp. 3305–3336. [Google Scholar] [CrossRef]
  64. Shindler, M.; Wong, A.; Meyerson, A. Fast and Accurate κ-Means for Large Datasets. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS), Granada, Spain, 10–13 December 2011. [Google Scholar]
  65. Spalding-Jamieson, J.; Robson, E.W.; Zheng, D.W. Scalable K-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search. arXiv 2025, arXiv:2502.06163. [Google Scholar]
  66. Peng, K.; Leung, V.C.M.; Huang, Q. Clustering Approach Based on Mini Batch Kmeans for Intrusion Detection System over Big Data. IEEE Access 2018, 6, 11897–11906. [Google Scholar] [CrossRef]
  67. Jourdan, B.; Schwartzman, G. Mini-Batch Kernel k-Means. arXiv 2024, arXiv:2410.05902. [Google Scholar]
  68. Newling, J. Nested Mini-Batch K-Means. Adv. Neural Inf. Process. Syst. 2015, 29, 1–9. [Google Scholar]
  69. Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Database Theory ICDT 2001; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2001; Volume 1973, pp. 420–434. [Google Scholar] [CrossRef]
Figure 1. Logical Development of Parallelized K-Means Algorithm.
Figure 1. Logical Development of Parallelized K-Means Algorithm.
Algorithms 18 00532 g001
Figure 2. Workflow of the study.
Figure 2. Workflow of the study.
Algorithms 18 00532 g002
Figure 3. Selected Dataset Around the world.
Figure 3. Selected Dataset Around the world.
Algorithms 18 00532 g003
Figure 4. Job Submission and Execution Pipeline in an HPC Environment.
Figure 4. Job Submission and Execution Pipeline in an HPC Environment.
Algorithms 18 00532 g004
Figure 5. Operational Flowchart of Parallelization.
Figure 5. Operational Flowchart of Parallelization.
Algorithms 18 00532 g005
Figure 6. Cost Variation per Iteration for different Parallelized K-Means Variants.
Figure 6. Cost Variation per Iteration for different Parallelized K-Means Variants.
Algorithms 18 00532 g006
Figure 7. Bar Plot of Computational Efficiency Comparisons in terms of time.
Figure 7. Bar Plot of Computational Efficiency Comparisons in terms of time.
Algorithms 18 00532 g007
Figure 8. Bar Plot Cluster Quality Comparison in terms RMSE.
Figure 8. Bar Plot Cluster Quality Comparison in terms RMSE.
Algorithms 18 00532 g008
Figure 9. Cluster-based Masked Image of “Sand” Generated from Parallelized modified K-Means Cluster: (a) Input Image (b) from Parallelized Centroid Initialization method (Reference) (c) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp Method (d) Pixel difference between Reference and Parallelized Nearest-Neighbor Iteration Calculation Reduction Method (e) Pixel difference between Reference and Parallelized Centroid Initialization and Dynamic K-Means Sharp Method (f) Pixel difference between Reference and Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method (g) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction Method (h) Pixel difference between Reference and PEKM.
Figure 9. Cluster-based Masked Image of “Sand” Generated from Parallelized modified K-Means Cluster: (a) Input Image (b) from Parallelized Centroid Initialization method (Reference) (c) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp Method (d) Pixel difference between Reference and Parallelized Nearest-Neighbor Iteration Calculation Reduction Method (e) Pixel difference between Reference and Parallelized Centroid Initialization and Dynamic K-Means Sharp Method (f) Pixel difference between Reference and Parallelized Centroid Initialization and Nearest-Neighbor Iteration Calculation Reduction Method (g) Pixel difference between Reference and Parallelized Dynamic K-Means Sharp and Nearest-Neighbor Iteration Calculation Reduction Method (h) Pixel difference between Reference and PEKM.
Algorithms 18 00532 g009aAlgorithms 18 00532 g009b
Table 1. Parameters of Dataset and K-Means Clustering Algorithm.
Table 1. Parameters of Dataset and K-Means Clustering Algorithm.
ParameterValues
Data cube Number 114
Data cube Size 3712 × 3712 × 16 (~1 GB)
# Cluster (k)160
Convergent Criteria0.0005
Table 2. Number of iterations required for convergence across Parallelized K-Means methods.
Table 2. Number of iterations required for convergence across Parallelized K-Means methods.
MethodIterations Till Convergence
PSKM + CI348 iterations
PSKM + KS282 iterations
PSKM + NN326 iterations
PSKM + CI + KS257 iterations
PSKM + CI +NN292 iterations
PSKM + KS + NN288 iterations
PEKM234 iterations
Table 3. Computational time (in hours) required for convergence across various Parallelized K-Means methods.
Table 3. Computational time (in hours) required for convergence across various Parallelized K-Means methods.
MethodHours Till Convergence
PSKM + CI9965.47 h
PSKM + KS6341.58 h
PSKM + NN5860.90 h
PSKM + CI + KS5237.19 h
PSKM + CI +NN5130.24 h
PSKM + KS + NN5243.95 h
PEKM4230.43 h
Table 4. Comparison of RMSE values for various Parallelized K-Means methods.
Table 4. Comparison of RMSE values for various Parallelized K-Means methods.
MethodRMSE
PSKM+CI0.01417
PSKM+KS0.01332
PSKM + NN0.01399
PSKM + CI + KS0.01375
PSKM + CI +NN0.01412
PSKM + KS + NN0.01333
PEKM0.01364
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pant, Y.R.; Leigh, L.; Fajardo Rueda, J. Improving K-Means Clustering: A Comparative Study of Parallelized Version of Modified K-Means Algorithm for Clustering of Satellite Images. Algorithms 2025, 18, 532. https://doi.org/10.3390/a18080532

AMA Style

Pant YR, Leigh L, Fajardo Rueda J. Improving K-Means Clustering: A Comparative Study of Parallelized Version of Modified K-Means Algorithm for Clustering of Satellite Images. Algorithms. 2025; 18(8):532. https://doi.org/10.3390/a18080532

Chicago/Turabian Style

Pant, Yuv Raj, Larry Leigh, and Juliana Fajardo Rueda. 2025. "Improving K-Means Clustering: A Comparative Study of Parallelized Version of Modified K-Means Algorithm for Clustering of Satellite Images" Algorithms 18, no. 8: 532. https://doi.org/10.3390/a18080532

APA Style

Pant, Y. R., Leigh, L., & Fajardo Rueda, J. (2025). Improving K-Means Clustering: A Comparative Study of Parallelized Version of Modified K-Means Algorithm for Clustering of Satellite Images. Algorithms, 18(8), 532. https://doi.org/10.3390/a18080532

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop