A Heuristic Evaluation of Partitioning Techniques Considering Early-Type Galaxy Databases

: Galaxies are one of the most interesting and complex astronomical objects statistically due to their continuous diversiﬁcation caused mainly due to incidents such as accretion, action, or mergers. Multivariate studies are one of the most useful tools to analyze these type of data and to understand various components of them. We study a sample of the local universe of Orlando 509 galaxies, imputed with a Predictive Mean Matching (PMM) multiple imputation algorithm, with the aim of classifying the galaxies into distinct clusters through k-medoids and k-mean algorithms and, in turn, performing a heuristic evaluation of the two partitioning algorithms through the percentage of misclassiﬁcation observed. From the clustering algorithms, it was observed that there were four distinct clusters of the galaxies with misclassiﬁcation of about 1.96%. Also, comparing the percentage of misclassiﬁcation heuristically k-means is a superior algorithm to k-medoids under ﬁxed optimal sizes when the said category of galaxy datasets is concerned. By considering that galaxies are continuously evolving complex objects and using appropriate statistical tools, we are able to derive an explanatory classiﬁcation of galaxies, based on the physical diverse properties of galaxies, and also establish a better method of partitioning when working on the galaxies.


Introduction
A galaxy represents a vast and intricate system composed of stars and interstellar matter within the expanse of our universe.It requires great effort to effectively engage with these complex and dynamic databases.The repository of galaxy data includes an extensive array of information encompassing diverse aspects of galaxies, including their morphological characteristics, photometric properties, spectral attributes, and more.While substantial research has been conducted in these specific domains, the comprehensive exploration of their "physical properties" remains a relatively uncharted territory.
Esteemed statisticians and physicists concur that multivariate techniques represent the most suitable approach for deriving meaningful insights from these astronomical databases.Among the array of partitioning techniques widely embraced in multivariate statistics, the k-means and k-medoid methods emerge as notable contenders.As we navigate through our analysis, it becomes increasingly apparent that a heuristic comparison between these robust partitioning techniques can illuminate their relative strengths, particularly concerning the percentage of misclassification, all within the context of an assumed optimal number of clusters tailored to this specific category of astronomical data.This dataset was meticulously assembled by Ogando et al. in 2008 [1,2] and comprises a set of parameters that hold paramount significance for our study.Furthermore, we have enriched our dataset by incorporating supplementary parameters sourced from the Hyperleda database, enhancing the depth and breadth of our analytical endeavors.

Missing Value Imputations
To address the absence of data in the Galaxy dataset, we have employed the multiple imputation technique known as Predictive Mean Matching (PMM).In essence, PMM computes the anticipated value of the target variable Y based on the specified imputation model.Predictive mean matching is used in statistics and data analysis to impute missing values by matching them with the predicted means of similar observations, preserving the original data distribution and relationships.

Choice of Optimal Clusters 2.2.1. Elbow Plot
To ascertain the ideal number of partitions into which the data can be divided, the Distortion Plot Method stands as a widely embraced technique for determining this optimal value, often denoted as 'k'.This method computes the average sum of squared distances from the partition centers within the generated partitions.Essentially, the optimal number of clusters becomes evident when examining the graph for a distinct 'elbow-like' point [3].

Dunn Index
The Dunn Index is a metric used to evaluate the quality of clustering results in unsupervised machine learning [4].It helps assess the separation between clusters and the compactness of data points within each cluster.

Formula
The Dunn Index is calculated using the following formula: Dunn Index = min(Inter-cluster distances) max(Intra-cluster distances) where: • Inter-cluster distances refer to the distances between different clusters.• Intra-cluster distances refer to the distances within each cluster.
A higher Dunn Index indicates better clustering, as it signifies greater inter-cluster separation and smaller intra-cluster distances.

•
When the Dunn Index is high, it suggests that the clusters are well-separated and compact, indicating a good clustering solution.• Conversely, a low Dunn Index implies that clusters are either too close to each other (poor separation) or data points within clusters are too spread out (low compactness).

Clustering (Partitioning) Algorithms and Discriminant Analysis
Clustering is a method that involves categorizing individuals with diverse characteristics based on their similarities or dissimilarities.In this study, several renowned algorithms have been employed, including the following: k-means clustering is a versatile and straightforward technique for clustering data.It is easy to implement and can be applied to various domains like we used here in the classification and clustering of galaxy diversification, discovering hidden patterns and grouping similar data points together.

K-Medoids
We use this as a second algorithm to compare between them.The method is given below.

2.
Assign each data point to the nearest medoid.

3.
For each cluster, select the data point that minimizes the total distance to other points in the same cluster as the new medoid.4.
Repeat steps 2 and 3 until convergence.
k-medoid clustering is a valuable technique for partitioning data into meaningful clusters.It is particularly useful when dealing with noisy or non-linear data.

The Linear Discriminant Analysis (LDA)
The primary objective of LDA is to find a linear combination of features that best separates two or more classes in a dataset.It aims to maximize the between-class variance while minimizing the within-class variance [5].In LDA, key concepts include:

•
Scatter matrices: Within-class and between-class scatter matrices.

•
Eigenvectors and eigenvalues: Used to find the optimal linear transformation.• Decision boundaries: Separating classes based on discriminant functions.

Results
Astronomy generates complex datasets, especially for galaxies.k-means and kmedoids are vital for: These techniques help astronomers unveil patterns, understand celestial structures, and explore the universe's mysteries.
From the techniques used to find the optimal number of clusters, Elbow plot and Dunn Index are 4 and 3 for k-means and k-medoids, respectively.The Elbow plots and Value of the Dunn Index are given in Table 1, Figures 1 and 2.   The clusters thus formed by k-means and k-medoids considering the optimal number of clusters to be 3 and 4 are shown in the Figures 3-6.

Figures, Tables
To comprehensively compare both clustering algorithms, we formed clusters using both three and four optimal cluster numbers for each algorithm.We evaluated the comparison based on the percentage of misclassification within the procedure using the optimal cluster numbers.The results of the discriminant analysis are presented in a tabular format, Tables 2-5.K-means clustering exhibits superior performance compared to k-medoids when using three optimal clusters, with a misclassification rate of approximately 2.36% for k-means and 8.25% for k-medoids.The trend continues with four optimal clusters, where k-means maintains its advantage with a misclassification rate of around 1.96% versus 11.19% for k-medoids.In summary, k-means outperforms k-medoids overall, even in the presence of outliers, for the galaxy dataset.

Conclusions
From the results and findings of the work, we can observe there are four distinct clusters of galaxies in the local universe of Orlando (2008) based on their collective physical characteristics.The approximate mean values of the parameters in those robust clusters are also included in the study, which would give us a heuristic idea about the physical characteristics of a newly observed galaxy, provided it falls into one of the three robust clusters.Additionally, there is about 1.96% misclassification in the data which indicates the high accuracy of the clustering.The misclassification that occurred while clustering for a given optimal number of clusters (k = 3 and k = 4) can be unanimously inferred that k-means performs better than k-medoids under this category of galaxy database.Also, the misclassification with the optimal number of clusters for k-means (k = 4) and k-medoids (k = 3) also serves as a reasonable indication of the superiority of the k-means algorithm over k-medoids considering galaxy data.

3. 1
.1.K-Means K-means clustering is a popular unsupervised machine learning technique used for data clustering and segmentation.It is a simple yet effective algorithm for partitioning a dataset into K distinct, non-overlapping clusters.The goal is to group similar data points together based on their feature similarity.3.1.2.AlgorithmThe k-means algorithm works as follows (Algorithm 1)[3]: Algorithm 1 k-means Clustering 1. Initialize K cluster centroids randomly.2. Assign each data point to the nearest centroid.3. Recalculate the centroids as the mean of the data points in each cluster.4. Repeat steps 2 and 3 until convergence (centroids no longer change significantly).

Figure 1 .
Figure 1.Elbow Plot for K means.