A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering

Kuo, Chih-Ting; Xu, Duo; Friesen, Rachel

doi:10.3390/universe11120412

Open AccessReview

A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering

by

Chih-Ting Kuo

^1,*

,

Duo Xu

²

and

Rachel Friesen

³

¹

Faculty of Science, University of Toronto, 27 King’s College Circle, Toronto, ON M5S 1A1, Canada

²

Canadian Institute for Theoretical Astrophysics, University of Toronto, 60 St. George Street, Toronto, ON M5S 3H8, Canada

³

Department of Astronomy & Astrophysics, University of Toronto, 50 St. George Street, Toronto, ON M5S 3H4, Canada

^*

Author to whom correspondence should be addressed.

Universe 2025, 11(12), 412; https://doi.org/10.3390/universe11120412

Submission received: 2 October 2025 / Revised: 19 November 2025 / Accepted: 4 December 2025 / Published: 11 December 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence in Modern Astronomy)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This review investigates the application of unsupervised machine learning algorithms to astronomical data. Unsupervised machine learning enables researchers to analyze large, high-dimensional, and unlabeled datasets and is sometimes considered more helpful for exploratory analysis because it is not limited by present knowledge and can therefore be used to extract new knowledge. Unsupervised machine learning algorithms that have been repeatedly applied to analyze astronomical data are classified according to their usage, including dimension reduction and clustering. This review also discusses anomaly detection and symbolic regression. For each algorithm, this review discusses the algorithm’s functioning in mathematical and statistical terms, the algorithm’s characteristics (e.g., advantages and shortcomings and possible types of inputs), and the different types of astronomical data analyzed with the algorithm. Example figures are generated. The algorithms are tested on synthetic datasets. This review aims to provide an up-to-date overview of both the high-level concepts and detailed applications of various unsupervised learning methods in astronomy, highlighting their advantages and disadvantages to help researchers new to unsupervised learning.

Keywords:

unsupervised machine learning; data analysis; astronomy; artificial intelligence

1. Introduction

Machine learning (ML) has been applied to various analyses of astronomical data, such as analyzing spectral data (e.g., [1,2,3]), catalogs (e.g., [4,5,6]), light curves (e.g., [7,8,9]), and images (e.g., [10,11,12]). ML is a subfield of artificial intelligence, aiming to mimic human brain functions using computers. Rather than manually coding every step, as in traditional programming, researchers use ML algorithms to analyze data. ML adjusts models iteratively to minimize errors. Parameters are either automatically optimized by ML or manually fine-tuned by researchers. In simple terms, ML focuses on achieving a desired outcome rather than specifying how to produce it, performing tasks with general guidance rather than detailed instructions [13]. Unlike traditional programming, ML often involves iterative processes that cannot be easily described by equations.

The growing significance of ML arises from the rapidly increasing volume and complexity of astronomical data collected using progressively advanced instruments [14]. With the advanced instruments, high-resolution, high-dimensional datasets are collected. ML offers efficient and objective solutions for analyzing such large datasets. Therefore, ML has become increasingly popular among astronomers.

There are two types of ML techniques: unsupervised and supervised. Unsupervised ML conducts exploratory data analysis, discovering unknown data features, without any prior information on classification. In comparison, supervised ML requires a labeled dataset (i.e., a training set), where the features are known [14]. Supervised ML learns from this labeled dataset and makes predictions on new data with the same properties [14,15]. This review focuses on unsupervised ML algorithms, which are very crucial for scientific research because they are not constrained by existing knowledge and can uncover new insights [14].

In this review, we classify unsupervised ML algorithms into two categories based on their primary use: dimensionality reduction and clustering. Within the discussion of dimensionality reduction algorithms, we also include neural network-based methods. While clustering and dimensionality reduction are typically framed as objectives, neural networks are better viewed as model architectures that extract information from data; the resulting learned representations can then be used to perform tasks such as clustering or dimensionality reduction.

Clustering refers to finding the concentration of multivariable data points [16]. In simpler words, clustering groups objects so that those in the same group are more similar to each other than to those in different groups.

Dimensionality reduction selects or constructs a subset of features that best describe the data, reducing the number of features [14]. It retains essential information while discarding trivial information [16]. Manifold learning is an important branch of dimensionality reduction that performs non-linear reduction to unfold the surface of data and reveal its underlying structure [16].

A neural network model is designed to mimic the structure and function of the human brain [16]. A neural network consists of multiple interconnected neurons or layers of neurons, where each neuron receives inputs and transmits outputs. A shallow neural network has one or two hidden layers, while deep learning models have three or more. While deep learning can be used for supervised ML, this review focuses on unsupervised neural networks.

This review may serve as an up-to-date, comprehensive manual on unsupervised ML algorithms and their applications in astronomy, tailored for astronomy researchers new to ML. Emerging approaches—such as semi-supervised, self-supervised, and hybrid models—provide additional strategies for analyzing complex astronomical datasets and are discussed in detail in Section 5.

2. Dimensionality Reduction

This section introduces various dimensionality reduction algorithms, including principal component analysis, multi-dimensional scaling and isometric feature mapping, locally linear embedding, t-distributed stochastic neighbor embedding, self-organizing maps, and auto-encoders and variational auto-encoders. These algorithms project high-dimensional data into lower dimensions by identifying linear or non-linear structures that preserve essential information and, in some cases, by studying the underlying manifold1 of the data.

2.1. Linear Methods

We examine the commonly used linear dimensionality reduction algorithms PCA and its kernel-based extension, kernel PCA.

Principal Component Analysis (PCA) and Kernal PCA

Principal component analysis (PCA), as suggested by its name, is a dimensionality reduction method that focuses on the principal components (PCs) of the multivariate dataset. PCA was initially developed by Pearson [17] and further improved by Hotelling [18,19]. PCA performs singular value decomposition (SVD) and can be interpreted as a rotation and projection of the dataset to maximize variance, thereby highlighting the most important features [16]. PCA constructs a covariance matrix of the dataset, and the orthonormal eigenvectors are the PCs (i.e., the axes) [14]. The PCs are identified sequentially: the first maximizes the variance, the second is orthogonal to the first while maximizing the residual variance, and the subsequent components are orthogonal to all prior components [16]. The PCs are linearly uncorrelated, so applying PCA removes the correlation between multiple dimensions, simplifying the data. The first few components convey most of the information [15]. Therefore, when all PCs are retained, the transformation is simply a rotation of the original coordinate system and thus no information is lost; when k PCs are used, the data are reduced to k dimensions. Figure 1 is a demonstration of dimensionality reduction with PCA, showing how the first two PCs are selected given the dataset. Figure 2 illustrates an example of PCA applied to images, showing how the reconstructed images retain the main features of the original data. The number n on the left indicates the number of PCs used for the reconstruction. A smaller number of components corresponds to greater compression in feature space, which results in a blurrier reconstructed image.

Choosing an appropriate number of principal components k is a critical step in PCA. Retaining too few components risks discarding meaningful structure in the data, whereas retaining too many diminishes the benefits of dimensionality reduction. The optimal choice of k depends strongly on the dataset’s size and characteristics, as well as on the intended purpose of dimensionality reduction (e.g., visualization versus pre-processing). A common approach is to examine how the reconstruction error—quantifying the discrepancy between the reconstructed and original data—varies with the number of retained PCs. In many practical applications, the eigenvalue decays approximately exponentially or as a power law because additional PCs contribute only marginal improvements. For example, Çakir, U. and Buck, T. [21] retain 215 three-dimensional eigengalaxies (PCs) from an original sample of 11,960 galaxies. They report that 90% of their input images produce reconstruction errors below 0.027, which they use to validate that their chosen value of k is adequate.

There are some shortcomings in PCA. Firstly, pre-processing (i.e., treatment of the dataset before applying the algorithm) is necessary for PCA to yield informative results [16]. PCA is sensitive to outliers, so outliers need to be removed. Then, due to the property of PCA, the data needs to be normalized beforehand, similar to K-means (see Section 3.2) and some other algorithms. For instance, z-score normalization, sometimes referred to as feature scaling, can be applied. The equation is

z_{i} = (x_{i} - μ_{i}) / σ_{i}

, where z is the z-score of feature i for a given data point,

x_{i}

is the value of that feature for that data point, and

μ

and

σ

are the mean and standard deviation of that feature across the dataset. Secondly, PCA is a linear decomposition of data and thus not applicable in some cases (e.g., when effects are multiplicative) [14]. When applied to analyze a non-linear dataset, PCA may fail. Thirdly, standard PCA only supports batch processing, meaning that all data must be loaded into memory at once. To address this limitation, incremental PCA was developed to enable minibatch processing. In this approach, the algorithm processes small, randomly sampled subsets of the data (i.e., minibatches) in each iteration, which reduces memory requirements and computational cost. However, because the principal components are estimated from partial data rather than the full dataset, the resulting components may be slightly less accurate than those obtained from standard PCA.

As mentioned, PCA is a linear method. Therefore, to analyze datasets that are not linearly separable, kernel PCA [22] is introduced, where kernel means that the algorithm uses a function that measures the similarity between two data points, corresponding to an inner product in a transformed feature space. Compared to PCA, kernel PCA is able to make a non-linear projection of the data points, thus providing a clearer presentation of information by unfolding the dataset. When mapping the kernelized PCA back to the original feature space, there will be some small differences, even if the number of components defined is the same as the number of original features. The difference can be reduced by choosing a different kernel function or by adjusting the implementation, as suggested by Pedregosa et al. [23]. In particular, they recommend tuning the hyperparameter alpha, which is used when learning the inverse transformation. This step relies on ridge regression, where a smaller value of alpha typically improves the reconstruction but can also make the model more sensitive to noise and increase the risk of overfitting.

PCA has been applied to dimensionality reduction of spectral data (e.g., [24,25,26,27,28]), light curves (e.g., [29]), catalogs (e.g., [30]), and images (e.g., [11]).

2.2. Non-Linear Methods

There are also non-linear methods aiming to process non-linear datasets. The discussed algorithms are multi-dimensional scaling and isometric feature mapping, locally linear embedding, and t-distributed stochastic neighbor embedding.

2.2.1. Multi-Dimensional Scaling and Isometric Feature Mapping

Multi-dimensional scaling (MDS) [31] is a dimensionality reduction algorithm frequently compared to PCA. James O. Ramsay [32] gives the core theory behind MDS. MDS aims to preserve the disparities, which are the pairwise distances computed in the original high-dimensional space. Given a dataset, MDS first computes the pairwise distances between all pairs of points. It then reconstructs a low-dimensional representation of the data that minimizes a quantity known as stress (also known as the error). Stress is defined as the sum of the differences between the original pairwise distances and the corresponding distances in the low-dimensional embedding. There are various ways to define disparities. For example, metric MDS (also known as absolute MDS) defines disparity as a function of the absolute distance between two points. In contrast, non-metric MDS [33,34] preserves the rank order of the pairwise distances. That is, if one pair of points is farther apart than another pair of points, the same relative ordering is maintained in the low-dimensional embedding space2.

Isometric feature mapping (Isomap, [35]), an extension of MDS, is also used for non-linear dimensionality reduction. Unlike PCA and classical MDS, which rely on Euclidean distances, Isomap uses geodesic distance. Specifically, Isomap first connects each point to its nearest neighbors based on Euclidean distance, constructing a neighborhood graph. The geodesic distance between two points is then approximated as the shortest path along this graph. In other words, Isomap approximates the geodesic curves lying on the manifold and computes the distance along these curves [16]. While MDS considers the distances between all pairs of points to define the shape of the data, Isomap only uses distances between neighboring points and sets the distances between all other pairs to zero, thus unfolding the manifold [16]. After constructing this new distance matrix, Isomap applies MDS to the resulting structure. Figure 3a shows an example of how a three-dimensional ‘Swiss roll’ manifold is unfolded to two dimensions by Isomap.

Bu et al. [36] point out that Isomap is more efficient than PCA in feature extraction for spectral classification. However, Ivezić et al. [16] state that Isomap is more computationally expensive compared to locally linear embedding (see Section 2.2.2). The book also lists various algorithms that can reduce the amount of computation, such as the Floyd–Warshall algorithm [37] and the Dijkstra algorithm [38].

MDS has been used to reduce the dimension of catalogs (e.g., [12]), while Isomap has been used to reduce spectral data (e.g., [36,39,40]) and latent spaces generated by neural networks (see Section 2.3.2) (e.g., [41]).

2.2.2. Locally Linear Embedding

Locally linear embedding (LLE, [42]) is an algorithm for dimensionality reduction that preserves the data’s local geometry [16]. For each data point, LLE identifies its k nearest neighbors and produces a set of weights that can be applied to the neighbors to best reconstruct the data point. The weight indicates the geometry defined by the data point and its nearest neighbors. The weight matrix W is find by minimizing the error

E_{1} (W) = {| X - W X |}^{2}

, where X denotes the original dataset [16]. From the weight matrix, LLE finds the low-dimensional embedding Y by minimizing the error

E_{2} (Y) = {| Y - W Y |}^{2}

[16]. The solution to the two equations can be found by imposing efficient linear algebra techniques: some computations are performed to find another matrix

C_{W} \equiv {(I - W)}^{T} (I - W)

, and eigenvalue decomposition is performed on

C_{W}

, producing the low-dimensional embedding. Figure 3b shows an example of how a three-dimensional ‘Swiss roll’ manifold is flattened to two dimensions by LLE.

A shortcoming of LLE is that the direct eigenvalue decomposition becomes expensive for large datasets [16]. According to Ivezić et al. [16], this problem can be circumvented by using iterative methods, such as Arnoldi decomposition, which is available in the Fortran package ARPACK [43]. LLE is sometimes compared with PCA. On the one hand, LLE does not allow the projection of new data, which would impact the weight matrix and subsequent computations. This means that when LLE is applied to reduce new data of the same type, the training process needs to be repeated, unlike some other methods (e.g., PCA) that allow direct application of a trained template to new data. On the other hand, Vanderplas and Connolly [44] show that LLE leads to improved classification, although Bu et al. [45] point out that the usage of LLE is more specific than PCA and thus more limited.

LLE has been applied to dimensionality reduction of spectral data (e.g., [44,46,47,48]) and light curves (e.g., [7,49]).

2.2.3. t-Distributed Stochastic Neighbor Embedding

t-distributed stochastic neighbor embedding (t-SNE) is another widely used algorithm for dimensionality reduction. Stochastic neighbor embedding, developed by Hinton and Roweis [50], forms the basis of t-SNE, and the t-distribution variant was introduced by van der Maaten and Hinton [51]. t-SNE converts the affinities—similarities between two points, usually measured by pairwise distances in a high-dimensional space—into Gaussian joint probabilities [23]. Figure 4 illustrates how t-SNE calculates affinity using a three-dimensional dataset shown in (a). For example, the distance from any point to the darkest black dot is computed, and t-SNE calculates the corresponding conditional Gaussian probabilities, as shown in (b). Perplexity, a hyperparameter that reflects the effective number of nearest neighbors considered when computing conditional probabilities, determines the width of the Gaussian distribution around each point. Because of normalization, the probabilities between any two points, a and b, can differ depending on whether the perspective is from a to b or from b to a. The joint probability is typically defined as the average of these two probabilities. The probabilities are further represented by Student’s t-distributions [23]. When high-dimensional data is projected into a low-dimensional space, the probability distribution is maximally preserved [14]. That is, t-SNE uses a joint Gaussian distribution to model the likelihood of data in the high-dimensional space, which is then mapped to a Student’s t-distribution in the low-dimensional space [15]. Figure 5 illustrates an example of t-SNE applied to spectral data. Each spectrum can be represented as a vector whose dimensionality equals the total number of wavelength bins. The color of each dot indicates the cluster it belongs to, as determined by a given clustering algorithm, while the pink dots represent intermediate data points. The t-SNE projection reveals that the circled groups of pink data points lie near the boundary between the two main clusters, suggesting the presence of possible intermediate-type subgroups.

As discussed, Isomap and LLE can learn a single continuous manifold. However, in many cases, there is more than one manifold. Due to its design, t-SNE can learn different manifolds in the dataset. t-SNE also mitigates the issue of overcrowding at the center, a common problem in many dimensionality reduction algorithms where points cluster densely, by using a t-distribution with heavier tails than a Gaussian distribution [15]. This allows distant points to be spread out more effectively in the lower-dimensional space. Nonetheless, t-SNE has some disadvantages. Similarly to LLE, t-SNE does not allow the projection of new data points once the manifold has been learned. t-SNE is non-deterministic, which means it generates different results each time it runs. t-SNE is also very computationally expensive. To accelerate computation, the Barnes–Hut approximation [53] may be adopted, but the embedding manifolds applicable are limited to two or three dimensions. t-SNE may not preserve the global structure of the dataset. To solve this problem, one may select initial points by PCA.

t-SNE has been applied to reduce the dimensionality of catalogs (e.g., [54,55,56], which are, respectively, quasar parameters, line ratios, and chemical abundance), spectral data [57], photometry data [58], and light curves [8,59].

2.3. Neural Network-Based Dimensionality Reduction

This section introduces two neural network algorithms: the self-organizing map and the auto-encoder. Both techniques transmit outputs between interconnected neurons that mimic the human brain, performing tasks such as clustering, dimensionality reduction, and outlier detection or producing information that can be used as input for other algorithms.

2.3.1. Self-Organizing Map

The self-organizing map (SOM, [60]) is a neural network technique typically used for visualization by dimensionality reduction, but the SOM can also be applied in clustering and outlier detection. The SOM uses competitive learning, which is a form of unsupervised learning where nodes compete to represent input data. When used for dimensionality reduction, the output is almost always two-dimensional. The number of output nodes is first defined manually, usually

k \times k

. Having more data points indicates that more output nodes, and thus a higher k, are needed. Each node is assigned a weight vector, which can be viewed as a coordinate in the input space that the node is responsible for. Therefore, the weight has the same dimensionality as the input space. Then, for each data point, the SOM updates the weights of the closer nodes so the nodes become even closer to the data point. That is, the closest nodes are dragged the most, while the furthest nodes are not dragged much. The process of updating each node is iterated for the weight vectors to converge. Figure 6 shows a visualization of the final result. In the end, each data point can be assigned to a winning neuron, which is the node whose weight vector is closest to the data point in the input space. Therefore, each data point can be represented by the x and y coordinates of the corresponding winning neuron. When used in clustering, the number of nodes is set to be the known number of clusters, and the data points with the same winning node belong to the same cluster. The SOM performs clustering by partitioning the space, producing notably different results compared to the other clustering algorithms discussed previously, as discussed in Section 3.7.

Notably, the SOM performs a simplification of data, not only reducing the dimensionality but also changing continuous data into discrete data. In other words, similar data points, corresponding to the same node, are presented by the same box in the map. Consequently, the SOM presents the distribution of data in the 2D space clearly, but the resulting output may not be used for further data mining because most of the information is lost. Another disadvantage is that, like other neural networks, the SOM is also computationally expensive.

The SOM has been applied to photometric and spectroscopic data (e.g., for dimensionality reduction in [61,62,63,64]), catalogs (e.g., for dimensionality reduction and clustering in [65]), and light curves (e.g., for clustering in [66,67]).

2.3.2. Auto-Encoder and Variational Auto-Encoder

An auto-encoder (AE, [68,69]) is a neural network technique that aims to reduce a high-dimensional dataset to a low-dimensional representation (also known as latent space) so that the dataset reconstructed from the representation is highly similar to the original dataset. Therefore, an AE can be used in dimensionality reduction. As shown in Figure 7, an AE network can be divided into three components: encoder, bottleneck, and decoder. The encoder consists of multiple layers, with the same or decreasing number of neurons in each layer, until the information reaches the bottleneck—the most compressed representation of data, with the lowest dimensionality in the AE, only preserving the significant features. The bottleneck is thus the output of the encoder. If the number of neurons of the bottleneck is less than the dimensionality of the data, then the AE compresses the data [16]. The decoder works to reconstruct the data from the bottleneck. Once the encoder and decoder are trained, new data can be entered without retraining.

Similarly to the SOM, AEs can also be used for denoising, other than dimensionality reduction. However, the AE is extremely computationally expensive and may require a GPU to run. Another disadvantage of the AE is that the AE does not guarantee a continuous interpolation3 of new data because of the compactness of the latent space [16]. Therefore, the variational auto-encoder (VAE, [70]) was introduced to solve the interpolation issue. The VAE imposes a Gaussian prior, mapping each data point to a probability distribution in the latent space rather than a single point. Data for reconstruction is then sampled from this distribution, mitigating issues related to interpolation and overfitting.

The AE has been be applied to compress and decompress images (e.g., [71,72]) and catalogs (e.g., [73]) and to denoise time series (e.g., [74]). The VAE has been applied for anomaly detection of spectroscopic data (e.g., [75]) and time series data (e.g., [76]).

2.4. Examples of Applications of Different Dimensionality Reduction Algorithms

In this section, we present examples of applying different dimensionality reduction algorithms to real observational datasets in astronomy. The dataset consists of five-dimensional astrometry data for 1254 stars, including right ascension, declination, distance, proper motion in right ascension, and proper motion in declination. The data is provided by SIMBAD [77]. We used Strasbourg Astronomical Data Center (CDS, [78]) for the criteria query of data. The criteria we used were

55 ° < R A < 70 °, 20 ° < D E C < 35 °, 100 pc < Distance < 170 pc

, and spectral types being dimmer than or equal to ‘O’. The data is pre-processed by normalizing, removing outliers, and then re-normalizing to adjust the scale based solely on the inlier data, retaining 1113 data points.

Figure 8 shows the results of the dimensionality reduction algorithms, including PCA, LLE, non-metric MDS, and t-SNE. The five-dimensional dataset is clustered with DBSCAN (see Section 3.4) before being reduced to two dimensions. PCA is linear, while the other three algorithms are non-linear. PCA and t-SNE project the clusters relatively separately, suggesting they may be suitable for dimensionality reduction prior to clustering. In contrast, LLE and MDS—especially LLE—tend to project clusters with overlap, which may impair clustering performance. LLE’s behavior may be further amplified by its sensitivity to outliers and its reliance on local neighborhood structure.

We also applied PCA, Isomap, LLE, and t-SNE to reduce the dimensionality of 5-dimensional and 10-dimensional synthetic datasets described in Appendix A.1. The results and evaluations are presented in Appendix A.2. We find that t-SNE provides the greatest separation between clusters. When used for dimensionality reduction prior to clustering, t-SNE and Isomap achieve higher accuracy, while PCA and Isomap show greater robustness. LLE requires fine-tuning of the parameters and shows lower accuracy and robustness compared to the other algorithms.

2.5. Benchmarking of Dimensionality Reduction Algorithms

In this section, we present a benchmarking study of dimensionality reduction algorithms. Table 1 summarizes each algorithm’s run time, classic applications, novel applications, popular datasets, strengths, weaknesses and failure modes, and overall evaluation. We also provide a summary of literature-based comparisons of algorithm performance.

Table 2 presents the result of a focused search, showing the number of refereed astronomy papers in the SAO/NASA Astrophysics Data System (ADS) that applied each algorithm to different types of data from 2015 to 2025. These results offer insight into which algorithms are most widely used for specific types of analyses, reflecting current trends and preferences in the field.

Multiple studies have compared different linear and non-linear dimensionality reduction algorithms, most using spectral data. A majority of the studies [36,39,48,87] report that Isomap outperforms PCA for the dimensionality reduction of spectra from stars, massive protostars, and galaxies. In particular, Daniel et al. [48] show that, compared to PCA, LLE may better preserve information crucial for classification in a lower-dimensional projection. Similarly, Vanderplas and Connolly [44] show that LLE outperforms PCA for the classification of galactic spectra. They also point out that PCA performs better for continuum spectra than for line spectra. On the other hand, Kao et al. [46] suggest that PCA outperforms Isomap, LLE, and t-SNE when XGBoost or Random Forest is used for classification; in this example, quasar spectra are classified. When applied to catalogs, Zhang et al. [55] suggest that t-SNE outperforms PCA at visualizing galaxy clusters given high-dimensional datasets. When applied to images, Semenov et al. [118] suggest that LLE outperforms PCA, Isomap, and t-SNE; this example uses galaxy images.

In terms of the neural network-based algorithms AE and SOM, we find that the AE has wider applications in general. The AE can be used for dimensionality reduction and feature extraction, data compression, anomaly detection and denoising, and generative modeling; meanwhile the SOM is mostly used for dimensionality reduction and clustering only. This is due to the AE having a more complex architecture than the SOM, which allows the AE to learn complex, non-linear features. However, this complexity makes the AE more computationally demanding and typically slower to train, often requiring GPU for efficient training. In contrast, the SOM is CPU-friendly, though less flexible in application. The VAE is even more computationally expensive than the traditional AE, but it helps mitigate overfitting and improves generalization of the model, as stated in Section 2.3.2. In modern applications, most AEs and VAEs applied to images are implemented with convolutional layers, so the nomenclature ‘AE’ commonly includes what would traditionally be called the CAE. The CAE is a variant of the AE that combines a convolutional neural network (CNN) with an AE, using convolutional layers instead of connected layers. The CAE is best viewed as an architectural choice within the broader AE family rather than a distinct network structure. By far, there is a higher number of applications of VAEs in astronomy compared to CAEs. By conducting focused research in the ADS regarding the application of each algorithm in astronomy in the most recent 10 years (i.e., from 2015 to 2025), we found that SOM is more often applied to analyze photometric data, while the AE, VAE, and CAE are usually used on images.

In terms of the performance of the AE and SOM, they are not often directly compared in published studies; instead, they are frequently implemented together. Multiple studies apply the VAE for denoising and subsequently implement the SOM to study pulses of the Vela Pulsar [119,120,121]. Amaya et al. [122] use an AE for dimensionality reduction of solar wind data and then apply an SOM to cluster the latent vectors produced by the AE. Ralph et al. [71] apply the CAE and SOM to cluster radio-astronomical images: the CAE first converts the images into feature vectors, the SOM further reduces the CAE latent space, and K-means is used for clustering. Teimoorinia et al. [123] use a Deep Embedded SOM (DESOM, [124]) for clustering large astronomical images. The DESOM embeds the SOM within the bottleneck of the AE, resulting in improved classification performance and shorter training time [123].

There are various answers regarding which algorithm provides the most interpretable low-dimensional projection. As suggested by the Strengths column of the benchmarking Table 1, the choice of algorithm depends on which aspect of the data one aims to reveal or investigate. PCA is linear, so the results are more intuitive. Isomap, LLE, and t-SNE are non-linear, projecting similar data points closely, which makes them more suitable for displaying different clusters, where the clusters may overlap in a PCA projection. The SOM is often used for preliminary clustering and low-dimensional visualization, while AEs are commonly used for dimensionality reduction aimed at learning compressed latent representations that capture non-linear structure, and they are also used as a component in generative models.

Some hybrid models (Section 5) combine multiple dimensionality reduction algorithms to produce the most interpretable and useful projections.

3. Clustering

Given a dataset, we may be interested in classifying it into groups so that each group contains similar data. Unsupervised clustering algorithms are designed to accomplish this task without the need for human-provided labels. This section introduces various clustering algorithms, including the Gaussian mixture model, the K-means algorithm, hierarchical clustering, density-based spatial clustering of applications with noise and its hierarchical variant, and fuzzy C-means clustering. These algorithms group datasets into clusters by either iteratively estimating functions that represent the clusters, expanding clusters outward from core data points, or progressively merging or splitting clusters.

3.1. Gaussian Mixture Model

The Gaussian mixture model (GMM) is widely used to group objects or data into clusters. The GMM is a probabilistic model that assumes the data distribution can be represented as a weighted sum of multiple Gaussian functions, where each function has an associated weight that sums to one across all components. Each Gaussian function is treated as a component, allowing objects to be clustered based on which component assigns the highest probability to each data point. The expectation-maximization (EM) algorithm is the iterative process of fitting Gaussian functions to the data. For GMMs in particular, it is common to initialize the model parameters using K-means (Section 3.2), which provides a reasonable starting estimate for the component means. In the E-step, EM calculates the probability of generating each data point using Gaussian functions with the parameters. In the M-step, EM adjusts the parameters to maximize the probability. The E- and M-steps are repeated until convergence criteria are met, such as a sufficiently small change in log-likelihood. The GMM assumes the clusters to be convex-shaped; i.e., each cluster has a single center and follows a relatively ellipsoidal distribution around it. Figure 9 shows an example in which two components are considered for clustering. The lines can be seen as the contour line of the sum of the two Gaussian functions.

The GMM assumes the number of components is known, though this is often not the case in practice. As a result, the users may either apply the Bayesian information criterion (BIC, [125]) to determine the number of components or use the variational Bayesian GMM (VBGMM, based on the variational inference algorithm of Blei and Jordan [126]), which does not fix the number of components. The BIC is a score computed from a grid search over different numbers of components and the shapes of their distributions (i.e., the types of covariance). The combination with the lowest BIC score indicates the best fit to the data distribution and is used for the GMM. The VBGMM requires more hyperparameters than the standard EM-based GMM. In particular, the choice of prior on the mixture weights plays a central role. When using a Dirichlet distribution as the prior, the maximum number of components is fixed, and the concentration parameter controls how weight is distributed across components: the lower the concentration parameter, the more weight is placed on fewer components [23]. In contrast, when the VBGMM is formulated with a Dirichlet process prior (often referred to as the DPGMM), the model is theoretically capable of using an unbounded number of components. In this case, the concentration parameter similarly regulates how many components are effectively used, with smaller values favoring fewer active components [23].

The GMM has been applied to cluster photometric data (e.g., [127,128]), spectroscopic data (e.g., [129]), catalogs (e.g., [130,131]), and scattered data (e.g., [132]). An example of an application of the DPGMM is the detection of the shape of open clusters [133].

3.2. K-Means

The K-means algorithm was developed by MacQueen [134] to partition N-dimensional objects into k clusters. The procedure of K-means can be summarized as follows. First, centroid initialization is applied. From the data, k initial centroids are picked, and each element is assigned to be in the same cluster as the closest centroid. Next, the centroids are updated by computing the mean position of all points assigned to each cluster. Based on the updated centroids, points are reassigned to the nearest centroid, and the centroids are recomputed. This process of assignment and centroid update is repeated iteratively until convergence, at which point the centroids minimize the in-cluster sum of squares, i.e., the sum of squared distances between each point and its cluster centroid.

K-means has several limitations. Similarly to the GMM (Section 3.1), K-means requires prior knowledge of the number of clusters. This issue can be mitigated by testing different numbers of clusters and selecting the one at which the distortion curve4 begins to level off, indicating diminishing returns from adding more clusters [6], or by using a large number of clusters and discarding small clusters when the data allows [1]. Another limitation is K-means’ sensitivity to centroid initialization. One common approach is to repeat the computation for different initializations and select the result that produces the smallest in-cluster sum of squares, as demonstrated by Panos et al. [3]. Alternatively, the K-means++ algorithm [135] improves the initial centroid selection by ensuring that centroids are initially well separated, which often leads to better clustering outcomes. Additionally, K-means assumes that clusters are convex-shaped, which may not hold for more complex data distributions.

The K-means algorithm has been applied to various types of data for clustering, including scattered data from catalogs (e.g., [6]), spectral data (e.g., [1,2,3]), and polarimetric data (e.g., [136]).

3.3. Hierarchical Clustering and Friends-of-Friends Clustering

Hierarchical clustering (HC) is a clustering algorithm that identifies clusters across all scales without requiring a specified number of clusters [16]. HC can be performed in a top-down (also known as divisive) or bottom-up (also known as agglomerative; [137]) approach. A diagram demonstrating the bottom-up clustering dendrogram is Figure 10. In the bottom-up approach, the N elements start as N independent clusters, each containing a single element. The two nearest clusters are then merged, reducing the number of clusters from N to

N - 1

. The distance between clusters is calculated in various ways, generating significantly different results. One example is to sum the distances between all possible pairs of points, with one point from each cluster, and divide by the product of the two numbers of elements from the two clusters [16]. The merging procedure is repeated until there is only one cluster, containing all N elements. The top-down approach is the reverse of the bottom-up approach. Instead of merging clusters, this method divides a single cluster into two at each step. By imposing HC, all possible clustering with all possible numbers of clusters are generated, and the user can choose the level (i.e., the number of clusters) for investigation.

Friends-of-Friends clustering (FOF) can be viewed as single-linkage agglomerative hierarchical clustering with a ‘linking length’ l. In single-linkage clustering, the distance between two clusters is defined as the minimum distance between any two points in the clusters. The linking length serves as the threshold for merging: if the distance between two points is smaller than l, they are considered ’friends’ and belong to the same cluster [16].

An advantage of HC is that the computation does not need to be repeated if different numbers of clusters are considered. Another advantage is that HC does not assume the shape of clusters to be convex, unlike GMM and K-means algorithms.

HC has been applied to cluster bivariate data (e.g., chemical abundances and positions of stars in [138,139]), higher-dimensional scattered data (e.g., [140]), light curves (e.g., [9]), and spectral data (e.g., [141]). FOF is often used in cluster analysis for cosmological N-body simulations, mostly to cluster particles into dark matter halos (e.g., [142,143,144]).

3.4. Density-Based Spatial Clustering of Applications with Noise

Density-based spatial clustering of applications with noise (DBSCAN, [145]) is another clustering algorithm that discovers clusters of all shapes and does not require a specified number of clusters. DBSCAN divides the areas into high-density areas and low-density areas. High-density areas are where the cluster lies. Core samples are picked from high-density areas. We define the core sample as a sample that has a minimum of k samples at a maximum distance s from itself. Samples within a distance s from a core sample are considered neighbors of the sample. Then, the same requirement is applied to find the core samples among the neighbors. Samples are assigned to the same cluster as their nearest core sample. Then, for each additional core sample, we find its neighboring core samples. The steps are performed iteratively to cluster the samples.

DBSCAN has some advantages and disadvantages. As discussed above, an advantage of DBSCAN is that it does not assume the clusters to have convex shapes, thus discovering clusters with arbitrary shapes. Another advantage is that DBSCAN can filter out the outliers, which are non-core samples that do not neighbor any core sample. On the other hand, one crucial disadvantage of DBSCAN is that the results are greatly impacted by the parameters k and s, especially s. For high-density data, DBSCAN may require a higher k for better clustering. s is highly dependent on the data: setting s too small would lead to the fringes of clusters being recognized as outliers, while setting s too large could merge clusters. However, there are various ways to determine k and s (e.g., [146,147,148,149]). Another disadvantage of DBSCAN is the single density threshold conveyed by the fixed k and s, which means DBSCAN may not be useful when the clusters have different densities (i.e., inhomogeneous). This disadvantage could be avoided using hierarchical DBSCAN, which is introduced in Section 3.5.

DBSCAN has been applied to five-dimensional scattered data from catalogs (e.g., position and motion taken from GAIA catalogs by [146,150,151]), three-dimensional positional scattered data (e.g., positions of stars, taken from GAIA catalogs, by [147]), spectral data (e.g., [148,149]), and images (e.g., [10]).

3.5. Hierarchical Density-Based Spatial Clustering of Applications with Noise

Hierarchical density-based spatial clustering of applications with noise (HDBSCAN, [152]) is the hierarchical extension of DBSCAN, as indicated by the name. HDBSCAN is also used for clustering, especially for globally inhomogeneous data, whereas DBSCAN assumes the data to be homogeneous. As discussed above, DBSCAN imposes a single density threshold to define clusters. As a result, DBSCAN may not be applicable when the clusters have different densities. HDBSCAN solves this problem by fixing the minimum number of samples k and considering all possible distances s (for more information on k and s, refer to Section 3.4). First, HDBSCAN defines the core distance of a point as the distance to its nearest k-th point. For all pairs of points p and q, HDBSCAN defines the mutual reachability distance as the maximum of the core distances of the two points and the distance between them, and thus transforms the graph so that each pair of data points is separated by their mutual reachability distance. Then, HDBSCAN finds the minimum spanning tree in the new graph, which connects all points with the smallest total distance while avoiding cycles. From the minimum spanning tree, HDBSCAN uses HC to find all possible clusterings.

HDBSCAN has all the advantages of DBSCAN, but it also has two additional advantages: HDBSCAN does not apply the same density threshold to all clusters, and the computation does not need to be repeated to consider different numbers of clusters. HDBSCAN eliminates the use of s and instead has a new parameter—the minimum size of a cluster. In some cases, this parameter may be easier to set than s, since it is basically asking what size of a group of data you would consider a cluster [153]. Similarly, HDBSCAN shares some limitations with DBSCAN. A key disadvantage is its sensitivity to hyperparameters: selecting suboptimal values can lead to under- or over-clustering. In addition, HDBSCAN tends to identify clusters in the densest regions, potentially labeling points in more diffuse clusters as outliers [152].

HDBSCAN has been applied to spectroscopic and photometric data (e.g., [154,155,156]), light curves (e.g., [157]), astrometric data (e.g., [158]), and other catalogs (e.g., [159,160]).

3.6. Fuzzy C-Means Clustering

Fuzzy C-means clustering (FCM, also known as C-means clustering), being the most widely applied fuzzy clustering (also known as soft clustering) algorithm, was initially developed by Dunn [161] and further improved by Bezdek [162]. Soft clustering means that the algorithms may not assign a data point to a single cluster. Instead, the point may be assigned to multiple clusters with corresponding membership grades between 0 and 1. That is, a point at the edge of a cluster may have a smaller membership grade for that cluster (e.g., 0.1) compared to a point at the center of the cluster (e.g., 0.95). Turning to FCM, it is very similar to the K-means algorithm mentioned in Section 3.2. The only difference is that K-means clustering imposes hard clustering, while C-means clustering imposes soft clustering. In fact, K-means clustering is sometimes referred to as hard C-means clustering [163]. Figure 11 shows an application of FCM to two-dimensional scattered data from the Sloan Moving Object Catalog.

The advantages and disadvantages are very similar to those of K-means clustering. One shared disadvantage is the need for prior knowledge about the number of clusters. To avoid this problem, FCM can be imposed using different numbers of clusters, and then one can compute the fuzzy partition coefficient for each resulting clustering, which measures how well the clustering describes the data, and select the number of clusters that generates the smallest coefficient. Compared to K-means, an advantage of FCM is that it may be more flexible because it allows assigning multiple clusters to a point, thus generating more reliable results. However, FCM is more expensive in computation due to the same characteristic. There are some algorithms built on FCM, aiming at reducing the computational cost (e.g., [165,166]).

Fuzzy clustering has been applied to cluster catalogs (e.g., [167,168]), spectral data (e.g., [169]), and images (e.g., [170]). FCM has been applied to cluster time series data (e.g., [171]), catalogs (e.g., [172,173]), and images (e.g., [174,175]).

3.7. Examples of Applications of Different Clustering Algorithms

In this section, we present examples of applying different clustering algorithms to the same real observational datasets in astronomy as those introduced in Section 2.4. Figure 12 shows the results of the clustering algorithms, including the GMM, K-means, HC, DBSCAN, FCM, and the self-organizing map (SOM, see Section 2.3.1). PCA is applied for visualization after the dataset is clustered. The clustering algorithms are applied to generate four clusters, as suggested by the BIC calculation discussed in Section 3.1.

Among the six algorithms, the GMM and DBSCAN yield similar results by recognizing outliers, while K-means and HC produce similar clustering. The GMM, K-means, and HC identify three relatively dense, convex-shaped clusters and one scattered cluster that likely represents outliers. In contrast, DBSCAN excludes the outlier cluster from its core clusters, instead labeling the scattered points as noise, resulting in a total of five distinct clusters. Comparing the GMM and DBSCAN, the overlap between corresponding clusters shows high consistency, where 83.3% of the data points are similarly clustered: the upper clusters overlap by 19.9% of the GMM upper cluster and 100.0% of the DBSCAN upper cluster; the lower right clusters by 99.3% and 96.8%, respectively; the lower left clusters by 74.4% and 99.3%, respectively (where the green and orange clusters are combined for easier comparison); and the broader (i.e., outlier) clusters by 94.9% and 63.8%, respectively. Comparing K-means and HC, the overlap between corresponding clusters also shows high consistency, where 94.16% of the data points are similarly clustered: the upper clusters (pink) overlap by 91.7% of the K-means upper cluster and 85.8% of the HC upper cluster; the lower right clusters (blue) by 95.1% and 99.2%, respectively; the lower left clusters (green) by 94.6% and 93.9%, respectively; and the scattered clusters on the right by 93.1% and 87.1%, respectively. The statistics may change slightly if the program is re-run, as most algorithms are not deterministic.

We also applied the algorithms to cluster 5-dimensional and 10-dimensional synthetic datasets. The results and evaluations are presented in Appendix A.3. We find that the GMM outperforms the other algorithms, while K-means demonstrates improved accuracy and robustness in higher dimensions. The SOM yields poor clustering performance; however, it is typically not used for clustering directly. The SOM is typically used for data visualization or as a preliminary step before applying other clustering algorithms.

3.8. Benchmarking of Clustering Algorithms

In this section, we present a benchmarking study of clustering algorithms. Table 3 summarizes each algorithm’s run time, classic applications, novel applications, popular datasets, strengths, weaknesses and failure modes, and overall evaluation. We also provide a summary of literature-based comparisons of algorithm performance.

Table 4 presents the result of a focused search, showing the number of refereed astronomy papers in the ADS that applied each algorithm to different types of data from 2015 to 2025. These results offer insight into which algorithms are most widely used for specific types of analyses, reflecting current trends and preferences in the field.

Compared to the dimensionality reduction algorithms, the difference in application of the clustering algorithms may be more significant. Generally speaking, when clusters are spherical and have similar densities, K-means provides fast and inexpensive clustering and is applicable to large, high-dimensional datasets. When clusters have different sizes but are still convex-shaped (e.g., spherical or oval), the GMM can be applied and will be fast when there is not much noise. If clusters are not convex-shaped but have similar densities, DBSCAN can be applied. If clusters are not convex-shaped and have different densities, HDBSCAN can be applied. If we want to investigate the hierarchical structure of the dataset, HC can be applied. If the clusters overlap and we want to know the membership probability, soft clustering algorithms, such as FCM, can be applied.

Some studies compare the discussed algorithms. As Hunt and Reffert [151] point out, both the GMM and K-means do not deal with noise, while DBSCAN and HDBSCAN do. The same paper also suggests that DBSCAN produces less reliable OC membership lists compared to the GMM and HDBSCAN. Yan et al. [148] suggest that DBSCAN can identify consecutive structures in position–position–velocity space, while HDBSCAN cannot. Yan et al. [100] suggest that, when applied to classify accretion states, K-means is usually more accurate than HC.

4. Other Applications of Unsupervised Machine Learning

This section introduces two additional applications of unsupervised machine learning: anomaly detection and symbolic regression. Anomaly detection refers to identifying outliers in a dataset using various algorithmic approaches. Symbolic regression is a task where algorithms, typically genetic programming, search for one or more analytical expressions that model the data, with or without prior physical knowledge.

4.1. Anomaly Detection

Anomaly detection aims to find objects that either do not belong to pre-defined classes (in supervised learning) or are produced as noise or unusual events. Anomaly detection is particularly crucial as a pre-processing step before applying supervised algorithms: when a supervised classification algorithm is trained, we typically do not include anomalies in the training set. Therefore, if anomalies are presented as input to a supervised classifier, they will be forcibly assigned to one of the pre-defined classes [201].

Some dimensionality reduction algorithms, such as LLE and t-SNE, are good visualization tools to identify the outliers [15]. These projections often map outliers far from the main clusters, which can help guide the application of clustering algorithms to detect outliers more effectively (e.g., [48]). Furthermore, certain clustering algorithms can identify outliers directly, without relying on a dimensionality reduction step. For example, the GMM can group the outliers into one or multiple clusters but may not classify them as outliers, as shown in Figure 12. An extension of the GMM computes the log-likelihood of each sample, where a smaller score indicates a higher chance of being an outlier. The threshold for outlier detection may be manually selected to achieve better results, especially when the probabilistic distribution of the data points is not Gaussian. Other clustering algorithms, such as DBSCAN and its variant HDBSCAN, can also identify the outliers, though both require certain parameters to be selected manually. SOM denoising also performs outlier detection but requires a manually defined probabilistic threshold, similar to the GMM (e.g., [105]). VAEs can also be used to identify outliers by applying a threshold on the reconstruction error of each data point. For example, Villaescusa-Navarro et al. [202] use a fully unsupervised VAE on 2D gas map images and identify anomalies: images with significantly larger reconstruction errors compared to typical samples are classified as outliers. Their goal is to provide theoretical predictions for various observables as a function of cosmology and astrophysics, and the anomaly detection step helps to rule out theoretical models that deviate from the learned manifold.

In addition to the algorithms discussed above, Isolation Forest (iForest, [203]) is a dedicated anomaly detection algorithm. iForest uses random trees to randomly partition the dataset. It randomly selects a feature (i.e., a dimension), chooses a random splitting threshold on that feature, and recursively partitions the data until each point is isolated in its own partition. The outliers are partitioned first because they are more isolated (i.e., sparse and located at the edges of the data), resulting in a smaller isolation path length, which is the number of partitions required to isolate a point. This suggests that, given the path length, a shorter path length suggests a higher likelihood of the point being an outlier. The path length greatly relies on the initial random partitions, so multiple trees are considered. The anomaly score is computed from the lengths obtained from multiple trees, with outliers having higher anomaly scores. The number of trees and the threshold for determining outliers are user-defined, where the threshold is often indicated by the contamination rate hyperparameter, which can significantly impact the results. Due to its structure, iForest is fast and efficient. However, iForest may not perform well when interdependencies exist between features [204]. iForest has been applied to detect the outliers in time series (e.g., [204]), catalogs (e.g., [205]), light curves (e.g., [157]), and latent spaces of neural networks (e.g., [206,207,208]).

In addition to unsupervised methods, there are also semi-supervised approaches (Section 5) for identifying anomalies. Richards et al. [201] introduced a semi-supervised routine to classify All Sky Automated Survey objects while detecting anomalies. The paper applies Random Forest—typically used as a supervised algorithm—in a semi-supervised setting, where the feature vector produced by the Random Forest is used to compute an anomaly score for each object. A threshold on this score is then used to determine the anomalies. The study claims that this routine is able to find interesting objects that do not belong to any of the pre-defined classes. Furthermore, Lochner and Bassett [209] point out that although most anomaly detection methods are able to flag anomalies, they often fail to distinguish relevant anomalies from irrelevant ones. For example, one may want to remove anomalies caused by instrument errors but retain unusual astronomical events or features. The ‘relevance’ of anomalies ultimately depends on the users’ scientific goals. They introduce a semi-supervised anomaly detection framework, Astronomaly, that uses active learning to search for interesting anomalies. The authors demonstrate that the method is able to recover many interesting galaxies in the Galaxy Zoo dataset.

Figure 13 shows the exemplary results of outlier detection using the GMM, DBSCAN, the SOM, and iForest. The datasets are the same real observational datasets in astronomy as those introduced in Section 2.4. As shown, all four outlier detection algorithms identify the three major convex-shaped clusters, as discussed in Section 3.7, generating similar results with minor differences. While the true outliers are unknown, the algorithms show consistent patterns in separating points that lie farther from the cluster cores, particularly for the two larger clusters. The SOM exhibits a more scattered inlier pattern, whereas the GMM, DBSCAN, and iForest produce results that largely confirm each other.

4.2. Symbolic Regression

Symbolic regression (SR) is an algorithm that computes analytical equations solely from the data by finding the equations and their parameters simultaneously [210]. SR is based on the concept of genetic programming, which is a subfield of evolutionary computing. SR randomly initializes the first population, generating multiple configurations (i.e., equations). The symbolic expression of the equation can be visualized using tree structures with nodes and branches, with each node being a symbol (e.g., ÷, log, 4.2, X, and Y). SR examines the configurations and removes the less effective ones. Then, SR randomly selects and exchanges sub-configurations from two or more configurations. The evolutionary procedure iterates until SR produces a robust equation.

SR is frequently used as a supervised learning algorithm, where a label y is given for every data point

\vec{x}

, and the goal is to find a model

y = f (\vec{x})

. SR can also be applied in an unsupervised setting, where the objective is to discover an implicit relationship among variables by identifying an equation of the form

f (\vec{x}, y) = 0

that describes the dataset [211]. Equivalently, if the dataset consists solely of the variables, one may seek a relation

f (\vec{x}) = 0

. While there are numerous applications of supervised SR in astronomy (e.g., [212,213,214,215,216]), the use of unsupervised SR remains relatively unexplored. A key reason why unsupervised SR has seen limited application is the difficulty of identifying a meaningful model that satisfies an implicit equation. The discrete search space contains a dense distribution of degenerate solutions—in simpler terms, there are infinitely many mathematically valid implicit equations that fit the data [211]—which degrades SR performance [217]. For example, an implicit model of the form

f (\vec{x}) = g (\vec{x}) - \hat{g} (\vec{x})

is valid whenever

g (\vec{x})

and

\hat{g} (\vec{x})

are semantically equivalent (e.g.,

g (\vec{x}) = {cos}^{2} x_{i} + {sin}^{2} x_{i}

and

\hat{g} (\vec{x}) = 1

). Therefore, only a few SR methods are suitable for application to unlabeled data.

One early attempt to address this challenge was proposed by Schmidt and Lipson [211]. They show that directly applying basic SR methods does not guarantee recovering a meaningful implicit model for complex data. To address this, they introduce the Implicit Derivative Method, which uses the local derivatives of the data points when searching for an implicit equation. The idea is that a meaningful model should be able to predict the derivatives of each variable. For unordered data, one can fit a higher-order plane to the local neighborhood of each point and compute the derivatives from the fitted hyperplane. Schmidt and Lipson [211] demonstrate that this approach is accurate for simple 2D and 3D geometric shapes and for basic physical systems. However, it remains unclear whether the method can scale to the large, noisy, and high-dimensional datasets commonly encountered in astronomy.

Building on the need to better recover complex models, a more recent approach was introduced by Kuang et al. [217]. They propose a pre-training framework, PIE, designed to recover implicit equations from unlabeled data in an unsupervised setting. In particular, PIE generates a large synthetic pre-training dataset to refine the search space and reduce the impact of degenerate solutions. This method has not yet been applied in astronomy, and its applicability may be limited by the extremely large input data volumes typical of astronomical surveys, as well as the presence of noise and anomalies.

5. Future Directions

This section discusses the open challenges and future research directions of unsupervised ML in astronomy.

Scalability to large datasets has always been a critical problem for ML. Unsupervised ML, whose main aim is to reduce the manual cost of analyzing large datasets, requires the ability to handle large datasets efficiently. Many studies work with large databases such as Gaia and SDSS, but the actual size of the examined datasets selected from the databases is often moderate (e.g., 720k in Queiroz et al. [56] and 320k in Lyke et al. [81]). Therefore, the databases used in the studies may not fully reflect the scalability of the algorithms. Even algorithms successfully applied to full databases might struggle with future, larger datasets (e.g., the upcoming LSST survey5). Non-linear dimensionality reduction methods generally scale poorly due to high computational complexity, while PCA requires batch processing and large memory. Certain clustering algorithms (e.g., HC, DBSCAN, and HDBSCAN) also do not readily scale to extremely large datasets. The SOM is computationally complex and therefore scales poorly, whereas EM is expensive to train but can be applied to large datasets once trained. To mitigate these issues, practitioners have developed extensions of traditional algorithms, such as minibatch processing and randomized SVD, trading a small amount of accuracy for improved scalability.

Data acquisition also presents challenges. Although diverse types of data are available, a substantial gap exists between the volume of photometric and spectroscopic data. Compared to photometric data, spectroscopic data may give us more precise information about the objects, and most of the discussed ML algorithms are more often used on spectroscopic and spectral data, as shown in Table 2 and Table 4. However, as noted by van den Busch et al. [102], spectroscopic observations are considerably scarcer because they are more time-consuming and expensive. The scarcity of spectroscopic observations gives rise to a major challenge for redshift calibration, as photometric estimates for individual galaxies are less precise and subject to biases [218]. One strategy to address this limitation is direct calibration (DIR, [218]), which matches photometric sources to spectroscopic counterparts. However, because spectroscopic coverage is incomplete, photometric data without a corresponding spectroscopic match within the same SOM cell must be excluded. Building on DIR, Wright et al. [63] introduced additional quality control, an approach subsequently adopted by van den Busch et al. [102] and Hildebrandt et al. [62] to jointly exploit photometric and spectroscopic data for redshift calibration.

Building on the recent developments, future astronomical studies can leverage semi-supervised, self-supervised, and ensemble methods to improve data analysis. The following sections discuss these approaches in more detail, highlighting recent applications and potential benefits.

Semi-supervised and self-supervised learning are deep neural network methods. Semi-supervised learning utilizes both unlabeled and labeled data, a scenario very common in astronomy. In contrast, self-supervised learning generates its own labels from a given unlabeled dataset, which are then used for subsequent training. Together, semi-supervised and self-supervised approaches blur the traditional distinction between supervised and unsupervised learning. Focused research shows that the number of studies utilizing semi-supervised learning has increased exponentially since 2000, from just two papers in the ADS database between 2000 and 2005 to 124 papers between 2020 and 2025. Balakrishnan et al. [219] apply semi-supervised learning to classify pulsars, and Rahmani et al. [220] use it to classify galactic spectra. Both studies suggest that semi-supervised learning may have a higher classification accuracy compared to supervised learning. Similarly, Stein et al. [221] suggest that, compared to supervised and unsupervised learning, self-supervised learning may generate higher-quality image representatives, especially if there is insufficient labeled data.

Ensemble methods, which combine multiple algorithms to improve performance, include both hybrid models and model stacking. A hybrid model is a strategy that combines different algorithms with the same objective to generate an outcome, performing tasks such as classification. Hybrid models are expected to combine the advantages of individual algorithms while mitigating their disadvantages. For example, Kumar and Kumar [222] apply hybrid and traditional models to analyze sunspot time series data and find that hybrid models consistently outperform traditional models by capturing complex patterns. A pipeline can be considered a type of hybrid model if multiple algorithms are used. A pipeline is a workflow with multiple components, where each component produces its own output that is either used for subsequent analysis or serves as the final output. For instance, Angarita et al. [223] perform dimensionality reduction on a spectral catalog by applying LLE after PCA, as LLE is more computationally expensive, and then apply GMM for clustering. Asadi et al. [183] use K-means as a component of a semi-supervised learning framework, in combination with Random Forest, to classify spectroscopic data. Yan et al. [100] first cluster galactic photometric data into SOM neurons and then cluster the neurons using HC. Compared to using only HC, this model reduces computational cost. Compared to using only SOM, it is more flexible and allows faster inspection of results for different numbers of clusters. In addition, model stacking—a branch of ensemble learning6—is considered a branch of hybrid models when different algorithms are combined. Model stacking involves training multiple base models, followed by a meta-model trained on the predictions or outputs of the base models, which then produces the final result. Shojaeian et al. [224] demonstrate model stacking of six different algorithms using geometric data and further improve training using PCA. Bussov and Nättilä [225] provide an example in astrophysics, applying model stacking to multiple SOM models to study simulated plasma flows.

In recent years, there have been several developments in extending classical algorithms such as the SOM and HDBSCAN. A frequently used algorithm for star clustering is Stars’ Galactic Origin (StarGo, [226]), which is based on the SOM. Hierarchical Mode Association Clustering (HMAC, [227]) has been applied in astronomy to cluster astrometric data [139], while Agglomerative Clustering for ORganising Nested Structures (ACORNS, [189]) was developed to cluster spectroscopic data. Both HMAC and ACORNS are based on HC. HDBSCAN-MC [197] integrates Monte Carlo simulation with HDBSCAN to better handle astrometric data with significant uncertainties. The Extreme Deconvolution GMM (XDGMM) combines extreme deconvolution [228] with the GMM. Applications of the XDGMM include, but are not limited to, identifying membership lists of the Gaia-Sausage-Enceladus using stellar spectroscopic data [177] and clustering Gamma-ray bursts [229]. These examples are not exhaustive, as new developments continue to emerge frequently.

In addition, new algorithms have been developed. Some of the new algorithms, such as the dimensionality reduction algorithm Uniform Manifold Approximation and Projection (UMAP), have increasing popularity in the field, while more applications may be needed for a comprehensive evaluation. Examples of UMAP applications include reducing the dimensionality of galaxy images [118], stellar spectral data [107], and simulated time series magnetohydrodynamic data of star-forming clouds [172].

6. Conclusions

This review discusses unsupervised machine learning algorithms used in astronomy, classifying the algorithms into two categories by their aims: dimensionality reduction and clustering. Anomaly detection and symbolic regression are also briefly discussed. For each algorithm, the mechanism, characteristics (e.g., advantages and disadvantages), and past applications are reviewed. Benchmarking studies are presented. Most algorithms are frequently used in astronomy, such as DBSCAN and the VAE, while some others are underutilized, such as unsupervised symbolic regression. Table 2 and Table 4 present the results of a focused search, showing the number of refereed astronomy papers in the ADS that applied each algorithm to different types of data from 2015 to 2025. These results offer insight into which algorithms are most widely used for specific types of analyses, reflecting current trends and preferences in the field. The rows are ordered by the sum of all algorithm applications, with higher totals ranked first.

This review also provides examples demonstrating the application of these algorithms to a real five-dimensional astrometry dataset and two synthetic datasets of five and ten dimensions. Overall, unsupervised machine learning has wide application in astronomy, allowing us to analyze a large quantity of high-dimensional, unlabeled data, meeting the needs arising from technological development regarding observation.

Author Contributions

Conceptualization, C.-T.K., D.X., and R.F.; methodology, C.-T.K. and D.X.; software, C.-T.K. and D.X.; validation, C.-T.K. and D.X.; formal analysis, C.-T.K.; investigation, C.-T.K. and D.X.; resources, C.-T.K.; data curation, C.-T.K.; writing—original draft preparation, C.-T.K.; writing—review and editing, C.-T.K. and D.X.; visualization, C.-T.K. and D.X.; supervision, D.X. and R.F.; project administration, C.-T.K., D.X., and R.F.; funding acquisition, D.X. All authors have read and agreed to the published version of the manuscript.

Funding

D.X. acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number 568580]. D.X. also acknowledges support from the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship Program, a program of Schmidt Sciences.

Data Availability Statement

Data used in this study are publicly available from the SIMBAD astronomical database and were retrieved using the ‘Query by Criteria’ interface (https://simbad.cds.unistra.fr/simbad/sim-fsam (accessed on 8 September 2025)).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACORNS	Agglomerative clustering for organising nested structures
AE	Auto-encoder
BIC	Bayesian information criterion
CADET	Cavity Detection Tool
CAE	Convolutional auto-encoder
DBSCAN	Density-based spatial clustering of applications with noise
DESOM	Deep embedded self-organizing map
dim red	Dimensionality reduction
DIR	Direct calibration
DPGMM	Dirichlet process Gaussian mixture model
EM	Expectation maximization
FCM	Fuzzy C-means clustering
FOF	Friends-of-Friends
GMM	Gaussian mixture mode
GPU	Graphics processing unit
HC	Hierarchical clustering
HDBSCAN	Hierarchical density-based spatial clustering of applications with noise
HDBSCAN-MC	Hierarchical density-based spatial clustering of applications with
	noise–Monte Carlo
HMAC	Hierarchical mode association clustering
iForest	Isolation Forest
Isomap	Isometric feature mapping
LLE	Locally linear embedding
MDS	Multi-dimensional scaling
ML	Machine learning
OC	Open cluster
PC	Principal component
PCA	Principal component analysis
SNR	Signal-to-noise ratio
SOM	Self-organizing map
SR	Symbolic regression
StarGO	Stars’ Galactic Origin
SVD	Singular value decomposition
SVM	Support vector machine
t-SNE	t-distributed stochastic neighbor embedding
UMAP	Uniform manifold approximation and projection
VAE	Variational auto-encoder
VBGMM	Variational Bayesian Gaussian mixture model
XDGMM	Extreme deconvolution Gaussian mixture model
XGBoost	Extreme gradient boosting

Appendix A. Supplementary Discussion: Algorithm Performance on Synthetic Datasets

This appendix provides a supplementary discussion on the performance of the dimensionality reduction algorithms, following the evaluation of the real astronomy dataset in Section 2.4 and Section 3.7. To achieve this, we generate two synthetic datasets with different dimensions, each containing three clusters, with no noise. This approach provides known membership labels for each cluster, which is particularly useful for evaluating clustering algorithms, as it allows for a more straightforward discussion of accuracy. Then, we apply the dimensionality reduction algorithms and the clustering algorithms to the datasets and compare their outputs.

Appendix A.1. Synthetic Datasets

We generated two synthetic datasets of scattered data. One dataset is 5-dimensional, and the other one is 10-dimensional. Each dataset contains three clusters (A: 350 data points; B: 600 data points; and C: 720 data points), with a total of 1670 points. The data is randomly generated and stretched to mimic open clusters. We have the label for each data point. The coefficients for generating the datasets were chosen so that the clusters have different spatial extents and eccentricities.

The scatterplot matrices of the two datasets are shown in Figure A1 and Figure A2. For both figures, Cluster A is colored in pink, Cluster B is colored in green, and Cluster C is colored in blue.

Figure A1. A scatterplot matrix of the 5D synthetic dataset.

We highlight several limitations arising from the construction of our synthetic dataset. First, the clusters are generated from multivariate normal distributions and are therefore inherently convex-shaped. As a result, the evaluation does not capture how the algorithms perform on non-convex cluster structures. Algorithms such as the GMM and K-means, which assume convex or ellipsoidal clusters, may exhibit improved performance under these conditions. Second, the synthetic dataset does not include noise or outliers. This simplifies the clustering problem and makes the evaluation of the algorithms easier. However, it also prevents us from assessing how dimensionality reduction methods handle noisy or contaminated data. In particular, we expect linear methods such as PCA to project noise so that it overlaps with existing clusters, whereas non-linear methods may map noise points to isolated regions in the low-dimensional space. For clustering algorithms, noise points may be interpreted as one or more clusters, making the results highly sensitive to the specified number of clusters. Since our synthetic dataset contains no noise, we cannot evaluate the algorithms’ robustness to noisy data or how their performance depends on the choice of the total number of clusters, which is required by several methods.

Figure A2. A scatterplot matrix of the 10D synthetic dataset.

Appendix A.2. Performance of Dimensionality Reduction Algorithms on a Synthetic Dataset

This section presents a supplementary discussion of the performance of the dimensionality reduction algorithms (PCA, LLE, Isomap, and t-SNE), following Section 2.4. The 5D and 10D datasets are normalized prior to reduction to two dimensions. The resulting 2D projections obtained from each algorithm are shown in Figure A3 and Figure A4.

Among the four algorithms, t-SNE provides the greatest separation between the three clusters, although some members of Clusters B and C are projected onto the boundaries of other clusters. To evaluate the performance of the clustering algorithms, we apply K-means to the 2D projections and compare the results with the known labels. PCA achieves an accuracy of 94.1% on the 5D dataset and 98.0% on the 10D dataset; LLE achieves 95.5% and 96.8%; Isomap achieves 96.6% and 98.8%; and t-SNE achieves 97.7% and 98.4%, respectively. We then randomize the datasets by 10% of their standard deviation after normalization and repeat the procedure. The results are presented in Table A1 and Table A2, which report K-means clustering accuracy on the projections obtained using each dimensionality reduction algorithm. We find that t-SNE and Isomap yield accurate results, while PCA and Isomap demonstrate greater robustness. LLE shows lower accuracy, with increasing uncertainty at higher dimensionality.

Figure A3. Results of dimensionality reduction of the 5D synthetic dataset using PCA, LLE, Isomap, and t-SNE.

Figure A4. Results of dimensionality reduction of the 10D synthetic dataset using PCA, LLE, Isomap, and t-SNE.

Table A1. Accuracy of K-means clustering on the 2D projections of the 5D dataset, obtained using PCA, LLE, Isomap, and t-SNE.

	Accuracy (%)
	Rep1	Rep2	Rep3	Rep4	Rep5	Rep6	Rep7	Rep8	Avg	SD
PCA	93.9	94.3	94.3	94.1	94.2	93.9	94.1	94.6	94.2	0.2
LLE	93.9	95.5	94.7	95.4	94.8	94.7	94.8	95.5	94.9	0.6
Isomap	96.5	96.5	96.3	96.3	96.0	95.9	96.2	96.3	96.3	0.2
t-SNE	96.0	97.1	96.9	97.1	97.1	96.9	97.5	97.1	97.0	0.4

Table A2. Accuracy of K-means clustering on the 2D projections of the 10D dataset, obtained using PCA, LLE, Isomap, and t-SNE.

	Accuracy (%)
	Rep1	Rep2	Rep3	Rep4	Rep5	Rep6	Rep7	Rep8	Avg	SD
PCA	97.8	97.7	97.7	97.9	97.7	97.7	97.7	98.1	97.8	0.1
LLE	97.0	93.7	93.8	91.1	95.0	90.6	96.0	96.6	94.2	2.4
Isomap	98.7	99.0	98.7	98.7	98.7	99.0	98.7	98.6	98.8	0.1
t-SNE	98.2	98.3	98.5	98.3	98.2	98.0	98.7	97.8	98.3	0.3

Appendix A.3. Performance of Clustering Algorithms on a Synthetic Dataset

This section presents a supplementary discussion of the performance of the clustering algorithms (GMM, K-means, HC, FCM, and SOM), following Section 3.7. DBSCAN is excluded from this example, as it is particularly suited for noisy datasets and requires parameters beyond the number of clusters. The 5D and 10D datasets are normalized prior to clustering. The clustering results obtained from each algorithm are shown in Figure A5 and Figure A6, projected onto the PCA coordinates.

The GMM achieves an accuracy of 98.0% on the 5D dataset and 98.4% on the 10D dataset; K-means achieves 95.6% and 98.5%; HC achieves 96.8% and 95.7%; FCM achieves 94.0% and 93.3%; and SOM achieves 67.4% and 41.6%, respectively. We then randomize the datasets by 10% of their standard deviation after normalization and repeat the procedure. The results are presented in Table A3 and Table A4. We find that the GMM produces accurate results but is less robust. K-means shows relatively improved accuracy and robustness in higher dimensions. The SOM exhibits poor accuracy and robustness; however, it is typically not used for clustering directly, but rather to simplify the dataset before applying other clustering algorithms.

Figure A5. Results of clustering of the 5D synthetic dataset using the GMM, K-means, HC, FCM, and the SOM.

Figure A6. Results of clustering of the 10D synthetic dataset using the GMM, K-means, HC, FCM, and the SOM.

Table A3. Clustering accuracy on the 5D dataset using the GMM, K-means, HC, FCM, and the SOM.

	Accuracy (%)
	Rep1	Rep2	Rep3	Rep4	Rep5	Rep6	Rep7	Rep8	Avg	SD
GMM	96.8	97.5	97.8	97.2	97.9	98.0	98.1	97.3	97.6	0.5
K-means	95.0	95.6	95.5	95.3	95.6	95.3	94.9	95.9	95.4	0.4
HC	96.8	96.2	96.4	97.0	97.1	97.4	97.0	97.0	96.9	0.4
FCM	93.6	94.0	93.7	93.5	93.9	93.6	93.4	94.0	93.7	0.2
SOM	68.7	67.7	66.2	69.5	66.9	66.7	66.8	69.0	67.7	1.2

Table A4. Clustering accuracy on the 10D dataset using the GMM, K-means, HC, FCM, and the SOM.

	Accuracy (%)
	Rep1	Rep2	Rep3	Rep4	Rep5	Rep6	Rep7	Rep8	Avg	SD
GMM	98.4	98.3	99.5	98.3	98.1	99.2	99.3	99.3	98.8	0.6
K-means	98.3	98.1	98.0	98.3	98.0	98.4	98.3	98.4	98.2	0.1
HC	97.1	97.5	97.8	97.7	97.2	98.1	97.2	97.8	97.6	0.3
FCM	92.7	92.5	93.0	92.9	92.9	92.5	92.4	93.4	92.8	0.3
SOM	42.3	47.0	44.4	41.7	41.2	42.3	43.5	42.8	43.2	1.9

Notes

1	Manifold: A topological space that locally resembles Euclidean space (e.g., Cartesian space) around each point but may have a more complex global structure.
2	Embedding space: A lower-dimensional representation of high-dimensional data that typically captures the essential structure or features of the data.
3	Continuous interpolation means that similar latent vectors produce similar outputs when decoded.
4	A distortion curve plots $d_{K}^{'}$ versus K, where $d_{K}^{'}$ is the estimated distortion for the K-th clustering and $d_{K}$ is the distortion for K clusters, defined as the squared distance of points from their assigned cluster centers. Further details on the equation and its formulation are provided by Chattopadhyay et al. [6].
5	Legacy Survey of Space and Time: https://rubinobservatory.org/explore/how-rubin-works/lsst, accessed on 8 September 2025.
6	Ensemble learning is a technique that combines the predictions of multiple models with the same objective, using methods such as voting and averaging [16].

References

Almeida, J.S.; Aguerri, J.A.L.; Muñoz-Tuñón, C.; de Vicente, A. Automatic Unsupervised Classification of All Sloan Digital Sky Survey Data Release 7 Galaxy Spectra. Astrophys. J. 2010, 714, 487–504. [Google Scholar] [CrossRef]
Boersma, C.; Bregman, J.; Allamandola, L.J. Properties of Polycyclic Aromatic Hydrocarbons in the Northwest Photon Dominated Region of NGC 7023. II. Traditional PAH analysis using k-means as a Visualization tool. Astrophys. J. 2014, 795, 110. [Google Scholar] [CrossRef]
Panos, B.; Kleint, L.; Huwyler, C.; Krucker, S.; Melchior, M.; Ullmann, D.; Voloshynovskiy, S. Identifying Typical Mg ii Flare Spectra Using Machine Learning. Astrophys. J. 2018, 861, 62. [Google Scholar] [CrossRef]
Rodrigo, C.; Cruz, P.; Aguilar, J.F.; Aller, A.; Solano, E.; Gálvez-Ortiz, M.C.; Jiménez-Esteban, F.; Mas-Buitrago, P.; Bayo, A.; Cortés-Contreras, M.; et al. Photometric segregation of dwarf and giant FGK stars using the SVO Filter Profile Service and photometric tools. Astron. Astrophys. 2024, 689, A93. [Google Scholar] [CrossRef]
Zhang, H.; Ardern-Arentsen, A.; Belokurov, V. On the existence of a very metal-poor disc in the Milky Way. Mon. Not. R. Astron. Soc. 2024, 533, 889–907. [Google Scholar] [CrossRef]
Chattopadhyay, T.; Misra, R.; Chattopadhyay, A.K.; Naskar, M. Statistical Evidence for Three Classes of Gamma-Ray Bursts. Astrophys. J. 2007, 667, 1017–1023. [Google Scholar] [CrossRef]
Matijevič, G.; Prša, A.; Orosz, J.A.; Welsh, W.F.; Bloemen, S.; Barclay, T. Kepler Eclipsing Binary Stars. III. Classification of Kepler Eclipsing Binary Light Curves with Locally Linear Embedding. Astron. J. 2012, 143, 123. [Google Scholar] [CrossRef]
Steinhardt, C.L.; Mann, W.J.; Rusakov, V.; Jespersen, C.K. Classification of BATSE, Swift, and Fermi Gamma-Ray Bursts from Prompt Emission Alone. Astrophys. J. 2023, 945, 67. [Google Scholar] [CrossRef]
Froebrich, D.; Campbell-White, J.; Scholz, A.; Eislöffel, J.; Zegmott, T.; Billington, S.J.; Donohoe, J.; Makin, S.V.; Hibbert, R.; Newport, R.J.; et al. A survey for variable young stars with small telescopes: First results from HOYS-CAPS. Mon. Not. R. Astron. Soc. 2018, 478, 5091–5103. [Google Scholar] [CrossRef]
Paraficz, D.; Courbin, F.; Tramacere, A.; Joseph, R.; Metcalf, R.B.; Kneib, J.P.; Dubath, P.; Droz, D.; Filleul, F.; Ringeisen, D.; et al. The PCA Lens-Finder: Application to CFHTLS. Astron. Astrophys. 2016, 592, A75. [Google Scholar] [CrossRef]
Mesa, D.; Gratton, R.; Zurlo, A.; Vigan, A.; Claudi, R.U.; Alberi, M.; Antichi, J.; Baruffolo, A.; Beuzit, J.L.; Boccaletti, A.; et al. Performance of the VLT Planet Finder SPHERE. II. Data analysis and results for IFS in laboratory. Astron. Astrophys. 2015, 576, A121. [Google Scholar] [CrossRef]
Banda, J.M.; Angryk, R.A.; Martens, P.C.H. Steps Toward a Large-Scale Solar Image Data Analysis to Differentiate Solar Phenomena. Sol. Phys. 2013, 288, 435–462. [Google Scholar] [CrossRef]
Koza, J.R.; Bennett, F.H.; Andre, D.; Keane, M.A. Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming. In Artificial Intelligence in Design ’96; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1996; pp. 151–170. [Google Scholar] [CrossRef]
Baron, D. Machine Learning in Astronomy: A practical overview. arXiv 2019, arXiv:1904.07248. [Google Scholar] [CrossRef]
Fotopoulou, S. A review of unsupervised learning in astronomy. Astron. Comput. 2024, 48, 100851. [Google Scholar] [CrossRef]
Ivezić, Z.; Connolly, A.; Vanderplas, J.T.; Gray, A. Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data; Princeton University Press: Princeton, NJ, USA, 2020. [Google Scholar]
Pearson, K., LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441, 498–520. [Google Scholar] [CrossRef]
Hotelling, H. Relations Between Two Sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
Follette, K.B. An Introduction to High Contrast Differential Imaging of Exoplanets and Disks. Publ. Astron. Soc. Pac. 2023, 135, 093001. [Google Scholar] [CrossRef]
Çakir, U.; Buck, T. MEGS: Morphological Evaluation of Galactic Structure—Principal component analysis as a galaxy morphology model. Astron. Astrophys. 2024, 691, A320. [Google Scholar] [CrossRef]
Scholkopf, B.; Smola, A.; Müller, K.R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998, 10, 1299–1319. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Dale, D.A.; de Paz, A.G.; Gordon, K.D.; Hanson, H.M.; Armus, L.; Bendo, G.J.; Bianchi, L.; Block, M.; Boissier, S.; Boselli, A.; et al. An Ultraviolet-to-Radio Broadband Spectral Atlas of Nearby Galaxies. Astrophys. J. 2007, 655, 863–884. [Google Scholar] [CrossRef]
Ellingson, E.; Lin, H.; Yee, H.K.C.; Carlberg, R.G. The Evolution of Population Gradients in Galaxy Clusters: The Butcher-Oemler Effect and Cluster Infall. Astrophys. J. 2001, 547, 609–622. [Google Scholar] [CrossRef]
Francis, P.J.; Hewett, P.C.; Foltz, C.B.; Chaffee, F.H. An Objective Classification Scheme for QSO Spectra. Astrophys. J. 1992, 398, 476. [Google Scholar] [CrossRef]
Osmer, P.S.; Porter, A.C.; Green, R.F. Luminosity Effects and the Emission-Line Properties of Quasars with 0 < Z < 3.8. Astrophys. J. 1994, 436, 678. [Google Scholar] [CrossRef]
Brotherton, M.S.; Wills, B.J.; Francis, P.J.; Steidel, C.C. The Intermediate Line Region of QSOs. Astrophys. J. 1994, 430, 495. [Google Scholar] [CrossRef]
Cowan, N.B.; Agol, E.; Meadows, V.S.; Robinson, T.; Livengood, T.A.; Deming, D.; Lisse, C.M.; A’Hearn, M.F.; Wellnitz, D.D.; Seager, S.; et al. Alien Maps of an Ocean-bearing World. Astrophys. J. 2009, 700, 915–923. [Google Scholar] [CrossRef]
Whitmore, B.C. An objective classification system for spiral galaxies. I. The two dominant dimensions. Astrophys. J. 1984, 278, 61–80. [Google Scholar] [CrossRef]
Borg, I.; Groenen, P.J.F. Modern Multidimensional Scaling—Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar] [CrossRef]
Genest, C.; Nešlehová, J.G.; Ramsay, J.O. A Conversation with James O. Ramsay. Int. Stat. Rev. 2014, 82, 161–183. [Google Scholar] [CrossRef]
Kruskal, J.B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 1964, 29, 1–27. [Google Scholar] [CrossRef]
Kruskal, J.B. Nonmetric multidimensional scaling: A numerical method. Psychometrika 1964, 29, 115–129. [Google Scholar] [CrossRef]
Tenenbaum, J.B.; de Silva, V.; Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319–2323. [Google Scholar] [CrossRef]
Bu, Y.; Chen, F.; Pan, J. Stellar spectral subclasses classification based on Isomap and SVM. New Astron. 2014, 28, 35–43. [Google Scholar] [CrossRef]
Floyd, R.W. Algorithm 97: Shortest path. Commun. ACM 1962, 5, 345. [Google Scholar] [CrossRef]
Fredman, M.L.; Tarjan, R.E. Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM 1987, 34, 596–615. [Google Scholar] [CrossRef]
Ward, J.L.; Lumsden, S.L. Locally linear embedding: Dimension reduction of massive protostellar spectra. Mon. Not. R. Astron. Soc. 2016, 461, 2250–2256. [Google Scholar] [CrossRef]
Thorsen, T.; Zhou, J.; Wu, Y. Comparison of Stellar Classification Accuracies Using Automated Algorithms. In Proceedings of the American Astronomical Society Meeting Abstracts #227, Kissimmee, FL, USA, 4–8 January 2016; Volume 227, p. 348.18. [Google Scholar]
Pearson, W.J.; Rodriguez-Gomez, V.; Kruk, S.; Margalef-Bentabol, B. Determining the time before or after a galaxy merger event. Astron. Astrophys. 2024, 687, A45. [Google Scholar] [CrossRef]
Roweis, S.T.; Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef]
Lehoucq, R.B.; Sorensen, D.C.; Yang, C. ARPACK Users’ Guide; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1998. [Google Scholar] [CrossRef]
Vanderplas, J.; Connolly, A. Reducing the Dimensionality of Data: Locally Linear Embedding of Sloan Galaxy Spectra. Astron. J. 2009, 138, 1365–1379. [Google Scholar] [CrossRef]
Bu, Y.; Zhao, G.; Luo, A.l.; Pan, J.; Chen, Y. Restricted Boltzmann machine: A non-linear substitute for PCA in spectral processing. Astron. Astrophys. 2015, 576, A96. [Google Scholar] [CrossRef]
Kao, W.B.; Zhang, Y.; Wu, X.B. Efficient identification of broad absorption line quasars using dimensionality reduction and machine learning. Publ. Astron. Soc. Jpn. 2024, 76, 653–665. [Google Scholar] [CrossRef]
Matijevič, G.; Zwitter, T.; Bienaymé, O.; Bland-Hawthorn, J.; Boeche, C.; Freeman, K.C.; Gibson, B.K.; Gilmore, G.; Grebel, E.K.; Helmi, A.; et al. Exploring the Morphology of RAVE Stellar Spectra. Astrophys. J. Suppl. Ser. 2012, 200, 14. [Google Scholar] [CrossRef]
Daniel, S.F.; Connolly, A.; Schneider, J.; Vanderplas, J.; Xiong, L. Classification of Stellar Spectra with Local Linear Embedding. Astron. J. 2011, 142, 203. [Google Scholar] [CrossRef]
Yang, M.; Zhang, H.; Wang, S.; Zhou, J.L.; Zhou, X.; Wang, L.; Wang, L.; Wittenmyer, R.A.; Liu, H.G.; Meng, Z.; et al. Eclipsing Binaries From the CSTAR Project at Dome A, Antarctica. Astrophys. J. Suppl. Ser. 2015, 217, 28. [Google Scholar] [CrossRef]
Hinton, G.E.; Roweis, S. Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems; Becker, S., Thrun, S., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2002; Volume 15. [Google Scholar]
van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Peruzzi, T.; Pasquato, M.; Ciroi, S.; Berton, M.; Marziani, P.; Nardini, E. Interpreting automatic AGN classifiers with saliency maps. Astron. Astrophys. 2021, 652, A19. [Google Scholar] [CrossRef]
van der Maaten, L. Barnes-Hut-SNE. arXiv 2013, arXiv:1301.3342. [Google Scholar] [CrossRef]
Nakoneczny, S.; Bilicki, M.; Solarz, A.; Pollo, A.; Maddox, N.; Spiniello, C.; Brescia, M.; Napolitano, N.R. Catalog of quasars from the Kilo-Degree Survey Data Release 3. Astron. Astrophys. 2019, 624, A13. [Google Scholar] [CrossRef]
Zhang, X.; Feng, Y.; Chen, H.; Yuan, Q. Powerful t-SNE Technique Leading to Clear Separation of Type-2 AGN and H II Galaxies in BPT Diagrams. Astrophys. J. 2020, 905, 97. [Google Scholar] [CrossRef]
Queiroz, A.B.A.; Anders, F.; Chiappini, C.; Khalatyan, A.; Santiago, B.X.; Nepal, S.; Steinmetz, M.; Gallart, C.; Valentini, M.; Dal Ponte, M.; et al. StarHorse results for spectroscopic surveys and Gaia DR3: Chrono-chemical populations in the solar vicinity, the genuine thick disk, and young alpha-rich stars. Astron. Astrophys. 2023, 673, A155. [Google Scholar] [CrossRef]
Traven, G.; Feltzing, S.; Merle, T.; Van der Swaelmen, M.; Čotar, K.; Church, R.; Zwitter, T.; Ting, Y.S.; Sahlholdt, C.; Asplund, M.; et al. The GALAH survey: Multiple stars and our Galaxy. I. A comprehensive method for deriving properties of FGK binary stars. Astron. Astrophys. 2020, 638, A145. [Google Scholar] [CrossRef]
Steinhardt, C.L.; Weaver, J.R.; Maxfield, J.; Davidzon, I.; Faisst, A.L.; Masters, D.; Schemel, M.; Toft, S. A Method to Distinguish Quiescent and Dusty Star-forming Galaxies with Machine Learning. Astrophys. J. 2020, 891, 136. [Google Scholar] [CrossRef]
Garcia-Cifuentes, K.; Becerra, R.L.; De Colle, F.; Cabrera, J.I.; Del Burgo, C. Identification of Extended Emission Gamma-Ray Burst Candidates Using Machine Learning. Astrophys. J. 2023, 951, 4. [Google Scholar] [CrossRef]
Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 1982, 43, 59–69. [Google Scholar] [CrossRef]
Masters, D.; Capak, P.; Stern, D.; Ilbert, O.; Salvato, M.; Schmidt, S.; Longo, G.; Rhodes, J.; Paltani, S.; Mobasher, B.; et al. Mapping the Galaxy Color-Redshift Relation: Optimal Photometric Redshift Calibration Strategies for Cosmology Surveys. Astrophys. J. 2015, 813, 53. [Google Scholar] [CrossRef]
Hildebrandt, H.; van den Busch, J.L.; Wright, A.H.; Blake, C.; Joachimi, B.; Kuijken, K.; Tröster, T.; Asgari, M.; Bilicki, M.; de Jong, J.T.A.; et al. KiDS-1000 catalogue: Redshift distributions and their calibration. Astron. Astrophys. 2021, 647, A124. [Google Scholar] [CrossRef]
Wright, A.H.; Hildebrandt, H.; van den Busch, J.L.; Heymans, C. Photometric redshift calibration with self-organising maps. Astron. Astrophys. 2020, 637, A100. [Google Scholar] [CrossRef]
Carrasco Kind, M.; Brunner, R.J. SOMz: Photometric redshift PDFs with self-organizing maps and random atlas. Mon. Not. R. Astron. Soc. 2014, 438, 3409–3421. [Google Scholar] [CrossRef]
Yuan, Z.; Myeong, G.C.; Beers, T.C.; Evans, N.W.; Lee, Y.S.; Banerjee, P.; Gudin, D.; Hattori, K.; Li, H.; Matsuno, T.; et al. Dynamical Relics of the Ancient Galactic Halo. Astrophys. J. 2020, 891, 39. [Google Scholar] [CrossRef]
Armstrong, D.J.; Kirk, J.; Lam, K.W.F.; McCormac, J.; Osborn, H.P.; Spake, J.; Walker, S.; Brown, D.J.A.; Kristiansen, M.H.; Pollacco, D.; et al. K2 variable catalogue—II. Machine learning classification of variable stars and eclipsing binaries in K2 fields 0-4. Mon. Not. R. Astron. Soc. 2016, 456, 2260–2272. [Google Scholar] [CrossRef]
Brett, D.R.; West, R.G.; Wheatley, P.J. The automated classification of astronomical light curves using Kohonen self-organizing maps. Mon. Not. R. Astron. Soc. 2004, 353, 369–376. [Google Scholar] [CrossRef]
Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233–243. [Google Scholar] [CrossRef]
Kramer, M. Autoassociative neural networks. Comput. Chem. Eng. 1992, 16, 313–328. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Ralph, N.O.; Norris, R.P.; Fang, G.; Park, L.A.F.; Galvin, T.J.; Alger, M.J.; Andernach, H.; Lintott, C.; Rudnick, L.; Shabala, S.; et al. Radio Galaxy Zoo: Unsupervised Clustering of Convolutionally Auto-encoded Radio-astronomical Images. Publ. Astron. Soc. Pac. 2019, 131, 108011. [Google Scholar] [CrossRef]
Savary, E.; Rojas, K.; Maus, M.; Clément, B.; Courbin, F.; Gavazzi, R.; Chan, J.H.H.; Lemon, C.; Vernardos, G.; Cañameras, R.; et al. Strong lensing in UNIONS: Toward a pipeline from discovery to modeling. Astron. Astrophys. 2022, 666, A1. [Google Scholar] [CrossRef]
Ganeshaiah Veena, P.; Lilow, R.; Nusser, A. Large-scale density and velocity field reconstructions with neural networks. Mon. Not. R. Astron. Soc. 2023, 522, 5291–5307. [Google Scholar] [CrossRef]
Shen, H.; George, D.; Huerta, E.A.; Zhao, Z. Denoising Gravitational Waves with Enhanced Deep Recurrent Denoising Auto-Encoders. arXiv 2019, arXiv:1903.03105. [Google Scholar] [CrossRef]
Ichinohe, Y.; Yamada, S. Neural network-based anomaly detection for high-resolution X-ray spectroscopy. Mon. Not. R. Astron. Soc. 2019, 487, 2874–2880. [Google Scholar] [CrossRef]
Bayley, J.; Messenger, C.; Woan, G. Rapid parameter estimation for an all-sky continuous gravitational wave search using conditional varitational auto-encoders. Phys. Rev. D 2022, 106, 083022. [Google Scholar] [CrossRef]
Wenger, M.; Ochsenbein, F.; Egret, D.; Dubois, P.; Bonnarel, F.; Borde, S.; Genova, F.; Jasniewicz, G.; Laloë, S.; Lesteven, S.; et al. The SIMBAD astronomical database. The CDS reference database for astronomical objects. Astron. Astrophys. Suppl. Ser. 2000, 143, 9–22. [Google Scholar] [CrossRef]
Allen, M.G. CDS—Strasbourg Astronomical Data Centre. In Astronomical Data Analysis Software and Systems XXIX; Pizzo, R., Deul, E.R., Mol, J.D., de Plaa, J., Verkouter, H., Eds.; Astronomical Society of the Pacific: San Francisco, CA, USA, 2020; Volume 527, p. 751. [Google Scholar]
Tamuz, O.; Mazeh, T.; Zucker, S. Correcting systematic effects in a large set of photometric light curves. Mon. Not. R. Astron. Soc. 2005, 356, 1466–1470. [Google Scholar] [CrossRef]
Norberg, P.; Baugh, C.M.; Hawkins, E.; Maddox, S.; Madgwick, D.; Lahav, O.; Cole, S.; Frenk, C.S.; Baldry, I.; Bland-Hawthorn, J.; et al. The 2dF Galaxy Redshift Survey: The dependence of galaxy clustering on luminosity and spectral type. Mon. Not. R. Astron. Soc. 2002, 332, 827–838. [Google Scholar] [CrossRef]
Lyke, B.W.; Higley, A.N.; McLane, J.N.; Schurhammer, D.P.; Myers, A.D.; Ross, A.J.; Dawson, K.; Chabanier, S.; Martini, P.; Busca, N.G.; et al. The Sloan Digital Sky Survey Quasar Catalog: Sixteenth Data Release. Astrophys. J. Suppl. Ser. 2020, 250, 8. [Google Scholar] [CrossRef]
Newton-Bosch, J.; González, L.X.; Valdés-Galicia, J.F.; Monterde-Andrade, F.; Morales-Olivares, O.G.; Sergeeva, M.A.; Muraki, Y.; Shibata, S.; Matsubara, Y.; Sako, T.; et al. Atmospheric pressure and temperature effects on the Solar Neutron Telescope at Sierra Negra. Adv. Space Res. 2025, 75, 6543–6552. [Google Scholar] [CrossRef]
Boroson, T.A.; Green, R.F. The Emission-Line Properties of Low-Redshift Quasi-stellar Objects. Astrophys. J. Suppl. Ser. 1992, 80, 109. [Google Scholar] [CrossRef]
DeMeo, F.E.; Binzel, R.P.; Slivan, S.M.; Bus, S.J. An extension of the Bus asteroid taxonomy into the near-infrared. Icarus 2009, 202, 160–180. [Google Scholar] [CrossRef]
Tous, J.L.; Solanes, J.M.; Perea, J.D. Fully comprehensive diagnostic of galaxy activity using principal components of visible spectra: Implementation on nearby S0s. Mon. Not. R. Astron. Soc. 2025, 537, 1459–1469. [Google Scholar] [CrossRef]
Xu, M.; Fu, X.; Chen, Y.; Li, L.; Fang, M.; Zhao, H.; Liu, P.; Zuo, Y. Nearby open clusters with tidal features: Golden sample selection and 3D structure. Astron. Astrophys. 2025, 698, A156. [Google Scholar] [CrossRef]
Bu, Y.D.; Pan, J.C.; Chen, F.Q. Stellar spectral outliers detection based on Isomap. Guang Pu Xue Yu Guang Pu Fen Xi 2014, 34, 267–273. [Google Scholar]
Sasdelli, M.; Ishida, E.E.O.; Vilalta, R.; Aguena, M.; Busti, V.C.; Camacho, H.; Trindade, A.M.M.; Gieseke, F.; de Souza, R.S.; Fantaye, Y.T.; et al. Exploring the spectroscopic diversity of Type Ia supernovae with DRACULA: A machine learning approach. Mon. Not. R. Astron. Soc. 2016, 461, 2044–2059. [Google Scholar] [CrossRef]
Žerjal, M.; Zwitter, T.; Matijevič, G.; Strassmeier, K.G. Chromospherically Active Stars in the RAVE Survey. In Setting the scene for Gaia and LAMOST; Feltzing, S., Zhao, G., Walton, N.A., Whitelock, P., Eds.; IAU Symposium; Cambridge University Press: Cambridge, UK, 2014; Volume 298, pp. 298–303. [Google Scholar] [CrossRef]
Bódi, A.; Hajdu, T. Classification of OGLE Eclipsing Binary Stars Based on Their Morphology Type with Locally Linear Embedding. Astrophys. J. Suppl. Ser. 2021, 255, 1. [Google Scholar] [CrossRef]
Bu, Y.; Pan, J.; Jiang, B.; Wei, P. Stellar Spectral Subclass Classification Based on Locally Linear Embedding. Publ. Astron. Soc. Jpn. 2013, 65, 81. [Google Scholar] [CrossRef]
Chang, H.; Yeung, D.Y. Robust locally linear embedding. Pattern Recognit. 2006, 39, 1053–1065. [Google Scholar] [CrossRef]
Matijevič, G.; Chiappini, C.; Grebel, E.K.; Wyse, R.F.G.; Zwitter, T.; Bienaymé, O.; Bland-Hawthorn, J.; Freeman, K.C.; Gibson, B.K.; Gilmore, G.; et al. Very metal-poor stars observed by the RAVE survey. Astron. Astrophys. 2017, 603, A19. [Google Scholar] [CrossRef]
Hughes, A.C.N.; Spitler, L.R.; Zucker, D.B.; Nordlander, T.; Simpson, J.; da Costa, G.S.; Ting, Y.S.; Li, C.; Bland-Hawthorn, J.; Buder, S.; et al. The GALAH Survey: A New Sample of Extremely Metal-poor Stars Using a Machine-learning Classification Algorithm. Astrophys. J. 2022, 930, 47. [Google Scholar] [CrossRef]
Anders, F.; Chiappini, C.; Santiago, B.X.; Matijevič, G.; Queiroz, A.B.; Steinmetz, M.; Guiglion, G. Dissecting stellar chemical abundance space with t-SNE. Astron. Astrophys. 2018, 619, A125. [Google Scholar] [CrossRef]
Kos, J.; Bland-Hawthorn, J.; Freeman, K.; Buder, S.; Traven, G.; De Silva, G.M.; Sharma, S.; Asplund, M.; Duong, L.; Lin, J.; et al. The GALAH survey: Chemical tagging of star clusters and new members in the Pleiades. Mon. Not. R. Astron. Soc. 2018, 473, 4612–4633. [Google Scholar] [CrossRef]
Zhu, S.Y.; Sun, W.P.; Ma, D.L.; Zhang, F.W. Classification of Fermi gamma-ray bursts based on machine learning. Mon. Not. R. Astron. Soc. 2024, 532, 1434–1443. [Google Scholar] [CrossRef]
Chen, J.M.; Zhu, K.R.; Peng, Z.Y.; Zhang, L. Classification and Physical Characteristic Analysis of Fermi-GBM Gamma-Ray Bursts Based on Deep Learning. Astrophys. J. Suppl. Ser. 2025, 276, 62. [Google Scholar] [CrossRef]
Quispe-Huaynasi, F.; Roig, F.; Holanda, N.; Loaiza-Tacuri, V.; Eleutério, R.; Pereira, C.B.; Daflon, S.; Placco, V.M.; Lopes de Oliveira, R.; Sestito, F.; et al. An Unsupervised Machine Learning Approach to Identify Spectral Energy Distribution Outliers: Application to the S-PLUS DR4 Data. Astron. J. 2025, 169, 332. [Google Scholar] [CrossRef]
Yan, Z.; Wright, A.H.; Elisa Chisari, N.; Georgiou, C.; Joudaki, S.; Loureiro, A.; Reischke, R.; Asgari, M.; Bilicki, M.; Dvornik, A.; et al. KiDS-Legacy: Angular galaxy clustering from deep surveys with complex selection effects. Astron. Astrophys. 2025, 694, A259. [Google Scholar] [CrossRef]
Buchs, R.; Davis, C.; Gruen, D.; DeRose, J.; Alarcon, A.; Bernstein, G.M.; Sánchez, C.; Myles, J.; Roodman, A.; Allen, S.; et al. Phenotypic redshifts with self-organizing maps: A novel method to characterize redshift distributions of source galaxies for weak lensing. Mon. Not. R. Astron. Soc. 2019, 489, 820–841. [Google Scholar] [CrossRef]
van den Busch, J.L.; Wright, A.H.; Hildebrandt, H.; Bilicki, M.; Asgari, M.; Joudaki, S.; Blake, C.; Heymans, C.; Kannawadi, A.; Shan, H.Y.; et al. KiDS-1000: Cosmic shear with enhanced redshift calibration. Astron. Astrophys. 2022, 664, A170. [Google Scholar] [CrossRef]
Beck, R.; Szapudi, I.; Flewelling, H.; Holmberg, C.; Magnier, E.; Chambers, K.C. PS1-STRM: Neural network source classification and photometric redshift catalogue for PS1 3π DR1. Mon. Not. R. Astron. Soc. 2021, 500, 1633–1644. [Google Scholar] [CrossRef]
Galvin, T.J.; Huynh, M.T.; Norris, R.P.; Wang, X.R.; Hopkins, E.; Polsterer, K.; Ralph, N.O.; O’Brien, A.N.; Heald, G.H. Cataloguing the radio-sky with unsupervised machine learning: A new approach for the SKA era. Mon. Not. R. Astron. Soc. 2020, 497, 2730–2758. [Google Scholar] [CrossRef]
Polsterer, K.L.; Gieseke, F.; Igel, C. Automatic Galaxy Classification via Machine Learning Techniques: Parallelized Rotation/Flipping INvariant Kohonen Maps (PINK). In Astronomical Data Analysis Software an Systems XXIV (ADASS XXIV); Taylor, A.R., Rosolowsky, E., Eds.; Astronomical Society of the Pacific Conference Series; Astronomical Society of the Pacific: San Francisco, CA, USA, 2015; Volume 495, p. 81. [Google Scholar]
Kollasch, F.; Polsterer, K. UltraPINK—New possibilities to explore Self-Organizing Kohonen Maps. In Astromical Data Analysis Software and Systems XXXI; Hugo, B.V., Van Rooyen, R., Smirnov, O.M., Eds.; Astronomical Society of the Pacific Conference Series; Astronomical Society of the Pacific: San Francisco, CA, USA, 2024; Volume 535, p. 49. [Google Scholar] [CrossRef]
Viscasillas Vázquez, C.; Solano, E.; Ulla, A.; Ambrosch, M.; Álvarez, M.A.; Manteiga, M.; Magrini, L.; Santoveña-Gómez, R.; Dafonte, C.; Pérez-Fernández, E.; et al. Advanced classification of hot subdwarf binaries using artificial intelligence techniques and Gaia DR3 data. Astron. Astrophys. 2024, 691, A223. [Google Scholar] [CrossRef]
Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; LeCun, Y. What is the best multi-stage architecture for object recognition? In Proceedings of the IEEE International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 2146–2153. [Google Scholar] [CrossRef]
Han, Y.; Zou, Z.; Li, N.; Chen, Y. Identifying Outliers in Astronomical Images with Unsupervised Machine Learning. Res. Astron. Astrophys. 2022, 22, 085006. [Google Scholar] [CrossRef]
Sharma, K.; Kembhavi, A.; Kembhavi, A.; Sivarani, T.; Abraham, S. Detecting Outliers in SDSS using Convolutional Neural Network. Bull. Soc. R. Sci. Liege 2019, 88, 174–181. [Google Scholar] [CrossRef]
Tröster, T.; Ferguson, C.; Harnois-Déraps, J.; McCarthy, I.G. Painting with baryons: Augmenting N-body simulations with gas using deep generative models. Mon. Not. R. Astron. Soc. 2019, 487, L24–L29. [Google Scholar] [CrossRef]
Jadhav, S.; Shrivastava, M.; Mitra, S. Towards a robust and reliable deep learning approach for detection of compact binary mergers in gravitational wave data. Mach. Learn. Sci. Technol. 2023, 4, 045028. [Google Scholar] [CrossRef]
Laroche, A.; Speagle, J.S. Closing the Stellar Labels Gap: Stellar Label independent Evidence for [α/M] Information in Gaia BP/RP Spectra. Astrophys. J. 2025, 979, 5. [Google Scholar] [CrossRef]
Andrianomena, S.; Tang, H. Radio Galaxy Zoo: Leveraging latent space representations from variational autoencoder. J. Cosmol. Astropart. Phys. 2024, 2024, 034. [Google Scholar] [CrossRef]
Ferragamo, A.; de Andres, D.; Sbriglio, A.; Cui, W.; De Petris, M.; Yepes, G.; Dupuis, R.; Jarraya, M.; Lahouli, I.; De Luca, F.; et al. THE THREE HUNDRED project: A machine learning method to infer clusters of galaxy mass radial profiles from mock Sunyaev-Zel’dovich maps. Mon. Not. R. Astron. Soc. 2023, 520, 4000–4008. [Google Scholar] [CrossRef]
Zhou, C.; Gu, Y.; Fang, G.; Lin, Z. Automatic Morphological Classification of Galaxies: Convolutional Autoencoder and Bagging-based Multiclustering Model. Astron. J. 2022, 163, 86. [Google Scholar] [CrossRef]
Alemi, A.; Poole, B.; Fischer, I.; Dillon, J.; Saurous, R.A.; Murphy, K. Fixing a Broken ELBO. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; PMLR (Proceedings of Machine Learning Research): London, UK, 2018; Volume 80, pp. 159–168. [Google Scholar]
Semenov, V.; Tymchyshyn, V.; Bezguba, V.; Tsizh, M.; Khlevniuk, A. Galaxy morphological classification with manifold learning. Astron. Comput. 2025, 52, 100963. [Google Scholar] [CrossRef]
Zubieta, E.; Missel, R.; Sosa Fiscella, V.; Lousto, C.O.; del Palacio, S.; López Armengol, F.G.; García, F.; Combi, J.A.; Wang, L.; Combi, L.; et al. First results of the glitching pulsar monitoring programme at the Argentine Institute of Radioastronomy. Mon. Not. R. Astron. Soc. 2023, 521, 4504–4521. [Google Scholar] [CrossRef]
Zubieta, E.; Missel, R.; Araujo Furlan, S.B.; Lousto, C.O.; García, F.; del Palacio, S.; Gancio, G.; Combi, J.A.; Wang, L. Study of the 2024 major Vela glitch at the Argentine Institute of Radioastronomy. Astron. Astrophys. 2025, 698, A72. [Google Scholar] [CrossRef]
Lousto, C.O.; Missel, R.; Prajapati, H.; Sosa Fiscella, V.; Armengol, F.G.L.; Gyawali, P.K.; Wang, L.; Cahill, N.D.; Combi, L.; Palacio, S.d.; et al. Vela pulsar: Single pulses analysis with machine learning techniques. Mon. Not. R. Astron. Soc. 2022, 509, 5790–5808. [Google Scholar] [CrossRef]
Amaya, J.; Dupuis, R.; Innocenti, M.E.; Lapenta, G. Visualizing and Interpreting Unsupervised Solar Wind Classifications. Front. Astron. Space Sci. 2020, 7, 66. [Google Scholar] [CrossRef]
Teimoorinia, H.; Shishehchi, S.; Tazwar, A.; Lin, P.; Archinuk, F.; Gwyn, S.D.J.; Kavelaars, J.J. An Astronomical Image Content-based Recommendation System Using Combined Deep Learning Models in a Fully Unsupervised Mode. Astron. J. 2021, 161, 227. [Google Scholar] [CrossRef]
Forest, F.; Lebbah, M.; Azzag, H.; Lacaille, J. Deep Embedded SOM: Joint representation learning and self-organization. In Proceedings of the European Symposium on Artificial Neural Networks, Bruges, Belgium, 24–26 April 2019. [Google Scholar]
Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Blei, D.M.; Jordan, M.I. Variational inference for Dirichlet process mixtures. Bayesian Anal. 2006, 1, 121–143. [Google Scholar] [CrossRef]
Hao, J.; McKay, T.A.; Koester, B.P.; Rykoff, E.S.; Rozo, E.; Annis, J.; Wechsler, R.H.; Evrard, A.; Siegel, S.R.; Becker, M.; et al. A GMBCG Galaxy Cluster Catalog of 55,424 Rich Clusters from SDSS DR7. Astrophys. J. Suppl. Ser. 2010, 191, 254–274. [Google Scholar] [CrossRef]
Duncan, K.J. All-purpose, all-sky photometric redshifts for the Legacy Imaging Surveys Data Release 8. Mon. Not. R. Astron. Soc. 2022, 512, 3662–3683. [Google Scholar] [CrossRef]
Das, P.; Hawkins, K.; Jofré, P. Ages and kinematics of chemically selected, accreted Milky Way halo stars. Mon. Not. R. Astron. Soc. 2020, 493, 5195–5207. [Google Scholar] [CrossRef]
D’Isanto, A.; Polsterer, K.L. Photometric redshift estimation via deep learning. Generalized and pre-classification-less, image based, fully probabilistic redshifts. Astron. Astrophys. 2018, 609, A111. [Google Scholar] [CrossRef]
Lee, K.J.; Guillemot, L.; Yue, Y.L.; Kramer, M.; Champion, D.J. Application of the Gaussian mixture model in pulsar astronomy - pulsar classification and candidates ranking for the Fermi 2FGL catalogue. Mon. Not. R. Astron. Soc. 2012, 424, 2832–2840. [Google Scholar] [CrossRef][Green Version]
Cheng, T.Y.; Li, N.; Conselice, C.J.; Aragón-Salamanca, A.; Dye, S.; Metcalf, R.B. Identifying strong lenses with unsupervised machine learning using convolutional autoencoder. Mon. Not. R. Astron. Soc. 2020, 494, 3750–3765. [Google Scholar] [CrossRef]
Tarricq, Y.; Soubiran, C.; Casamiquela, L.; Castro-Ginard, A.; Olivares, J.; Miret-Roig, N.; Galli, P.A.B. Structural parameters of 389 local open clusters. Astron. Astrophys. 2022, 659, A59. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, California, CA, USA, 27 December 1965–7 January 1966; Volume 1, pp. 281–297. [Google Scholar]
Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Volume 8, pp. 1027–1035. [Google Scholar]
Viticchié, B.; Sánchez Almeida, J. Asymmetries of the StokesVprofiles observed by HINODE SOT/SP in the quiet Sun. Astron. Astrophys. 2011, 530, A14. [Google Scholar] [CrossRef]
Johnson, S.C. Hierarchical clustering schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef] [PubMed]
Dantas, M.L.L.; Smiljanic, R.; Boesso, R.; Rocha-Pinto, H.J.; Magrini, L.; Guiglion, G.; Tautvaišienė, G.; Gilmore, G.; Randich, S.; Bensby, T.; et al. The Gaia-ESO Survey: Old super-metal-rich visitors from the inner Galaxy. Astron. Astrophys. 2023, 669, A96. [Google Scholar] [CrossRef]
Galli, P.A.B.; Loinard, L.; Bouy, H.; Sarro, L.M.; Ortiz-León, G.N.; Dzib, S.A.; Olivares, J.; Heyer, M.; Hernandez, J.; Román-Zúñiga, C.; et al. Structure and kinematics of the Taurus star-forming region from Gaia-DR2 and VLBI astrometry. Astron. Astrophys. 2019, 630, A137. [Google Scholar] [CrossRef]
Kounkel, M.; Covey, K.; Suárez, G.; Román-Zúñiga, C.; Hernandez, J.; Stassun, K.; Jaehnig, K.O.; Feigelson, E.D.; Peña Ramírez, K.; Roman-Lopes, A.; et al. The APOGEE-2 Survey of the Orion Star-forming Complex. II. Six-dimensional Structure. Astron. J. 2018, 156, 84. [Google Scholar] [CrossRef]
Hojnacki, S.M.; Kastner, J.H.; Micela, G.; Feigelson, E.D.; LaLonde, S.M. An X-Ray Spectral Classification Algorithm with Application to Young Stellar Clusters. Astrophys. J. 2007, 659, 585–598. [Google Scholar] [CrossRef]
Tinker, J.; Kravtsov, A.V.; Klypin, A.; Abazajian, K.; Warren, M.; Yepes, G.; Gottlöber, S.; Holz, D.E. Toward a Halo Mass Function for Precision Cosmology: The Limits of Universality. Astrophys. J. 2008, 688, 709–728. [Google Scholar] [CrossRef]
Behroozi, P.S.; Wechsler, R.H.; Wu, H.Y. The ROCKSTAR Phase-space Temporal Halo Finder and the Velocity Offsets of Cluster Cores. Astrophys. J. 2013, 762, 109. [Google Scholar] [CrossRef]
Howlett, C.; Ross, A.J.; Samushia, L.; Percival, W.J.; Manera, M. The clustering of the SDSS main galaxy sample—II. Mock galaxy catalogues and a measurement of the growth of structure from redshift space distortions at z = 0.15. Mon. Not. R. Astron. Soc. 2015, 449, 848–866. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD; The Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 1996; Volume 96, pp. 226–231. [Google Scholar]
Castro-Ginard, A.; Jordi, C.; Luri, X.; Julbe, F.; Morvan, M.; Balaguer-Núñez, L.; Cantat-Gaudin, T. A new method for unveiling open clusters in Gaia. New nearby open clusters confirmed by DR2. Astron. Astrophys. 2018, 618, A59. [Google Scholar] [CrossRef]
Zari, E.; Brown, A.G.A.; de Zeeuw, P.T. Structure, kinematics, and ages of the young stellar populations in the Orion region. Astron. Astrophys. 2019, 628, A123. [Google Scholar] [CrossRef]
Yan, Q.Z.; Yang, J.; Su, Y.; Sun, Y.; Wang, C. Distances and Statistics of Local Molecular Clouds in the First Galactic Quadrant. Astrophys. J. 2020, 898, 80. [Google Scholar] [CrossRef]
Price-Jones, N.; Bovy, J. Blind chemical tagging with DBSCAN: Prospects for spectroscopic surveys. Mon. Not. R. Astron. Soc. 2019, 487, 871–886. [Google Scholar] [CrossRef]
Castro-Ginard, A.; Jordi, C.; Luri, X.; Cantat-Gaudin, T.; Carrasco, J.M.; Casamiquela, L.; Anders, F.; Balaguer-Núñez, L.; Badia, R.M. Hunting for open clusters in Gaia EDR3: 628 new open clusters found with OCfinder. Astron. Astrophys. 2022, 661, A118. [Google Scholar] [CrossRef]
Hunt, E.L.; Reffert, S. Improving the open cluster census. I. Comparison of clustering algorithms applied to Gaia DR2 data. Astron. Astrophys. 2021, 646, A104. [Google Scholar] [CrossRef]
Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining; Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 160–172. [Google Scholar]
McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
Koppelman, H.H.; Helmi, A.; Massari, D.; Price-Whelan, A.M.; Starkenburg, T.K. Multiple retrograde substructures in the Galactic halo: A shattered view of Galactic history. Astron. Astrophys. 2019, 631, L9. [Google Scholar] [CrossRef]
Hunt, E.L.; Reffert, S. Improving the open cluster census. II. An all-sky cluster catalogue with Gaia DR3. Astron. Astrophys. 2023, 673, A114. [Google Scholar] [CrossRef]
Kerr, R.M.P.; Rizzuto, A.C.; Kraus, A.L.; Offner, S.S.R. Stars with Photometrically Young Gaia Luminosities Around the Solar System (SPYGLASS). I. Mapping Young Stellar Structures and Their Star Formation Histories. Astrophys. J. 2021, 917, 23. [Google Scholar] [CrossRef]
Webb, S.; Lochner, M.; Muthukrishna, D.; Cooke, J.; Flynn, C.; Mahabal, A.; Goode, S.; Andreoni, I.; Pritchard, T.; Abbott, T.M.C. Unsupervised machine learning for transient discovery in deeper, wider, faster light curves. Mon. Not. R. Astron. Soc. 2020, 498, 3077–3094. [Google Scholar] [CrossRef]
Moranta, L.; Gagné, J.; Couture, D.; Faherty, J.K. New Coronae and Stellar Associations Revealed by a Clustering Analysis of the Solar Neighborhood. Astrophys. J. 2022, 939, 94. [Google Scholar] [CrossRef]
Shank, D.; Komater, D.; Beers, T.C.; Placco, V.M.; Huang, Y. Dynamically Tagged Groups of Metal-poor Stars. II. The Radial Velocity Experiment Data Release 6. Astrophys. J. Suppl. Ser. 2022, 261, 19. [Google Scholar] [CrossRef]
Cabrera Garcia, J.; Beers, T.C.; Huang, Y.; Li, X.Y.; Liu, G.; Zhang, H.; Hong, J.; Lee, Y.S.; Shank, D.; Gudin, D.; et al. Probing the Galactic halo with RR Lyrae stars—V. Chemistry, kinematics, and dynamically tagged groups. Mon. Not. R. Astron. Soc. 2024, 527, 8973–8990. [Google Scholar] [CrossRef]
Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern. 1973, 3, 32–57. [Google Scholar] [CrossRef]
Bezdek, J. Pattern Recognition With Fuzzy Objective Function Algorithms; Springer: New York, NY, USA, 1981. [Google Scholar] [CrossRef]
Kruse, R.; Döring, C.; Lesot, M.J. Fundamentals of Fuzzy Clustering. In Advances in Fuzzy Clustering and its Applications; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2007; Chapter 1; pp. 1–30. [Google Scholar] [CrossRef]
Colazo, M.; Alvarez-Candal, A.; Duffard, R. Zero-phase angle asteroid taxonomy classification using unsupervised machine learning algorithms. Astron. Astrophys. 2022, 666, A77. [Google Scholar] [CrossRef]
Shi, L.; He, P. A Fast Fuzzy Clustering Algorithm for Large-Scale Datasets. In Advanced Data Mining and Applications; Li, X., Wang, S., Dong, Z.Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 203–208. [Google Scholar]
Cheng, T.W.; Goldgof, D.B.; Hall, L.O. Fast fuzzy clustering. Fuzzy Sets Syst. 1998, 93, 49–56. [Google Scholar] [CrossRef]
Szabó, G.M.; Kálmán, S.; Borsato, L.; Hegedűs, V.; Mészáros, S.; Szabó, R. Sub-Jovian desert of exoplanets at its boundaries. Parameter dependence along the main sequence. Astron. Astrophys. 2023, 671, A132. [Google Scholar] [CrossRef]
Modak, S. Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering. Astron. Comput. 2021, 34, 100441. [Google Scholar] [CrossRef]
Li, H. Fuzzy Cluster Analysis: Application to Determining Metallicities for Very Metal-poor Stars. Astrophys. J. 2021, 923, 183. [Google Scholar] [CrossRef]
Barra, V.; Delouille, V.; Hochedez, J.F. Segmentation of extreme ultraviolet solar images via multichannel fuzzy clustering. Adv. Space Res. 2008, 42, 917–925. [Google Scholar] [CrossRef]
Anilkumar, B.T.; Sabarinath, A. Grouping and long term prediction of sunspot cycle characteristics-A fuzzy clustering approach. Astron. Comput. 2024, 48, 100836. [Google Scholar] [CrossRef]
Offner, S.S.R.; Taylor, J.; Markey, C.; Chen, H.H.H.; Pineda, J.E.; Goodman, A.A.; Burkert, A.; Ginsburg, A.; Choudhury, S. Turbulence, coherence, and collapse: Three phases for core evolution. Mon. Not. R. Astron. Soc. 2022, 517, 885–909. [Google Scholar] [CrossRef]
Shan, H.; Wang, X.; Chen, X.; Yuan, J.; Nie, J.; Zhang, H.; Liu, N.; Wang, N. Wavelet based recognition for pulsar signals. Astron. Comput. 2015, 11, 55–63. [Google Scholar] [CrossRef]
Bandyopadhyay, S.; Das, S.; Datta, A. Comparative Study and Development of Two Contour-Based Image Segmentation Techniques for Coronal Hole Detection in Solar Images. Sol. Phys. 2020, 295, 110. [Google Scholar] [CrossRef]
Revathy, K.; Lekshmi, S.; Nayar, S.R.P. Fractal-Based Fuzzy Technique For Detection Of Active Regions From Solar Images. Sol. Phys. 2005, 228, 43–53. [Google Scholar] [CrossRef]
Castro-Ginard, A.; McMillan, P.J.; Luri, X.; Jordi, C.; Romero-Gómez, M.; Cantat-Gaudin, T.; Casamiquela, L.; Tarricq, Y.; Soubiran, C.; Anders, F. Milky Way spiral arms from open clusters in Gaia EDR3. Astron. Astrophys. 2021, 652, A162. [Google Scholar] [CrossRef]
Buder, S.; Lind, K.; Ness, M.K.; Feuillet, D.K.; Horta, D.; Monty, S.; Buck, T.; Nordlander, T.; Bland-Hawthorn, J.; Casey, A.R.; et al. The GALAH Survey: Chemical tagging and chrono-chemodynamics of accreted halo stars with GALAH+ DR3 and Gaia eDR3. Mon. Not. R. Astron. Soc. 2022, 510, 2407–2436. [Google Scholar] [CrossRef]
Talbot, C.; Thrane, E. Flexible and Accurate Evaluation of Gravitational-wave Malmquist Bias with Machine Learning. Astrophys. J. 2022, 927, 76. [Google Scholar] [CrossRef]
Myeong, G.C.; Belokurov, V.; Aguado, D.S.; Evans, N.W.; Caldwell, N.; Bradley, J. Milky Way’s Eccentric Constituents with Gaia, APOGEE, and GALAH. Astrophys. J. 2022, 938, 21. [Google Scholar] [CrossRef]
Morales-Luis, A.B.; Sánchez Almeida, J.; Aguerri, J.A.L.; Muñoz-Tuñón, C. Systematic Search for Extremely Metal-poor Galaxies in the Sloan Digital Sky Survey. Astrophys. J. 2011, 743, 77. [Google Scholar] [CrossRef]
Hasselquist, S.; Carlin, J.L.; Holtzman, J.A.; Shetrone, M.; Hayes, C.R.; Cunha, K.; Smith, V.; Beaton, R.L.; Sobeck, J.; Allende Prieto, C.; et al. Identifying Sagittarius Stream Stars by Their APOGEE Chemical Abundance Signatures. Astrophys. J. 2019, 872, 58. [Google Scholar] [CrossRef]
Mackereth, J.T.; Schiavon, R.P.; Pfeffer, J.; Hayes, C.R.; Bovy, J.; Anguiano, B.; Allende Prieto, C.; Hasselquist, S.; Holtzman, J.; Johnson, J.A.; et al. The origin of accreted stellar halo populations in the Milky Way using APOGEE, Gaia, and the EAGLE simulations. Mon. Not. R. Astron. Soc. 2019, 482, 3426–3442. [Google Scholar] [CrossRef]
Asadi, V.; Haghi, H.; Zonoozi, A.H. Semi-supervised classification of stars, galaxies and quasars using K-means and random-forest approaches. arXiv 2025, arXiv:2507.14072. [Google Scholar] [CrossRef]
Hogg, D.W.; Casey, A.R.; Ness, M.; Rix, H.W.; Foreman-Mackey, D.; Hasselquist, S.; Ho, A.Y.Q.; Holtzman, J.A.; Majewski, S.R.; Martell, S.L.; et al. Chemical Tagging Can Work: Identification of Stellar Phase-space Structures Purely by Chemical-abundance Similarity. Astrophys. J. 2016, 833, 262. [Google Scholar] [CrossRef]
Bose, S.; Joshi, J.; Henriques, V.M.J.; Rouppe van der Voort, L. Spicules and downflows in the solar chromosphere. Astron. Astrophys. 2021, 647, A147. [Google Scholar] [CrossRef]
Hayes, J.J.C.; Kerins, E.; Awiphan, S.; McDonald, I.; Morgan, J.S.; Chuanraksasat, P.; Komonjinda, S.; Sanguansak, N.; Kittara, P.; SPEARNet Collaboration. Optimizing exoplanet atmosphere retrieval using unsupervised machine-learning classification. Mon. Not. R. Astron. Soc. 2020. [Google Scholar] [CrossRef]
Sreehari, H.; Nandi, A. A machine learning approach for classification of accretion states of black hole binaries. Mon. Not. R. Astron. Soc. 2021, 502, 1334–1343. [Google Scholar] [CrossRef]
Barchi, P.H.; da Costa, F.G.; Sautter, R.; Moura, T.C.; Stalder, D.H.; Rosa, R.R.; de Carvalho, R.R. Improving galaxy morphology with machine learning. arXiv 2017, arXiv:1705.06818. [Google Scholar] [CrossRef]
Henshaw, J.D.; Ginsburg, A.; Haworth, T.J.; Longmore, S.N.; Kruijssen, J.M.D.; Mills, E.A.C.; Sokolov, V.; Walker, D.L.; Barnes, A.T.; Contreras, Y.; et al. ‘The Brick’ is not a brick: A comprehensive study of the structure and dynamics of the central molecular zone cloud G0.253+0.016. Mon. Not. R. Astron. Soc. 2019, 485, 2457–2485. [Google Scholar] [CrossRef]
Carruba, V.; Aljbaae, S.; Lucchini, A. Machine-learning identification of asteroid groups. Mon. Not. R. Astron. Soc. 2019, 488, 1377–1386. [Google Scholar] [CrossRef]
Barnes, A.T.; Henshaw, J.D.; Caselli, P.; Jiménez-Serra, I.; Tan, J.C.; Fontani, F.; Pon, A.; Ragan, S. Similar complex kinematics within two massive, filamentary infrared dark clouds. Mon. Not. R. Astron. Soc. 2018, 475, 5268–5289. [Google Scholar] [CrossRef]
Castro-Ginard, A.; Jordi, C.; Luri, X.; Álvarez Cid-Fuentes, J.; Casamiquela, L.; Anders, F.; Cantat-Gaudin, T.; Monguió, M.; Balaguer-Núñez, L.; Solà, S.; et al. Hunting for open clusters in Gaia DR2: 582 new open clusters in the Galactic disc. Astron. Astrophys. 2020, 635, A45. [Google Scholar] [CrossRef]
Prisinzano, L.; Damiani, F.; Sciortino, S.; Flaccomio, E.; Guarcello, M.G.; Micela, G.; Tognelli, E.; Jeffries, R.D.; Alcalá, J.M. Low-mass young stars in the Milky Way unveiled by DBSCAN and Gaia EDR3: Mapping the star forming regions within 1.5 kpc. Astron. Astrophys. 2022, 664, A175. [Google Scholar] [CrossRef]
Plšek, T.; Werner, N.; Topinka, M.; Simionescu, A. CAvity DEtection Tool (CADET): Pipeline for detection of X-ray cavities in hot galactic and cluster atmospheres. Mon. Not. R. Astron. Soc. 2024, 527, 3315–3346. [Google Scholar] [CrossRef]
Nidever, D.L.; Dey, A.; Fasbender, K.; Juneau, S.; Meisner, A.M.; Wishart, J.; Scott, A.; Matt, K.; Nikutta, R.; Pucha, R. Second Data Release of the All-sky NOIRLab Source Catalog. Astron. J. 2021, 161, 192. [Google Scholar] [CrossRef]
Kounkel, M.; Covey, K. Untangling the Galaxy. I. Local Structure and Star Formation History of the Milky Way. Astron. J. 2019, 158, 122. [Google Scholar] [CrossRef]
Patel, V.; Hora, J.L.; Ashby, M.L.N.; Vig, S. Identification of Outer Galaxy Cluster Members Using Gaia DR3 and Multidimensional Simulation. arXiv 2025, arXiv:2507.09721. [Google Scholar] [CrossRef]
Hendy, Y.H.M.; Shokry, A.; Takey, A.; Aboueisha, M.S. The first CCD photometric study of the member eclipsing binary ZTF J060425.73+365000.1 in the newly discovered young open cluster UBC 68. New Astron. 2025, 119, 102392. [Google Scholar] [CrossRef]
Thakur, P.S.; Verma, R.K.; Tiwari, R. Analysis of Time Complexity of K-Means and Fuzzy C-Means Clustering Algorithm. Eng. Math. Lett. 2024, 2024, 4. [Google Scholar] [CrossRef]
Benvenuto, F.; Piana, M.; Campi, C.; Massone, A.M. A Hybrid Supervised/Unsupervised Machine Learning Approach to Solar Flare Prediction. Astrophys. J. 2018, 853, 90. [Google Scholar] [CrossRef]
Richards, J.W.; Starr, D.L.; Miller, A.A.; Bloom, J.S.; Butler, N.R.; Brink, H.; Crellin-Quick, A. Construction of a Calibrated Probabilistic Classification Catalog: Application to 50k Variable Sources in the All-Sky Automated Survey. Astrophys. J. Suppl. Ser. 2012, 203, 32. [Google Scholar] [CrossRef][Green Version]
Villaescusa-Navarro, F.; Anglés-Alcázar, D.; Genel, S.; Spergel, D.N.; Somerville, R.S.; Dave, R.; Pillepich, A.; Hernquist, L.; Nelson, D.; Torrey, P.; et al. The CAMELS Project: Cosmology and Astrophysics with Machine-learning Simulations. Astrophys. J. 2021, 915, 71. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Wen, J.; Ahmadzadeh, A.; Georgoulis, M.K.; Sadykov, V.M.; Angryk, R.A. Outlier Detection and Removal in Multivariate Time Series for a More Robust Machine Learning–based Solar Flare Prediction. Astrophys. J. Suppl. Ser. 2025, 277, 60. [Google Scholar] [CrossRef]
Pruzhinskaya, M.V.; Malanchev, K.L.; Kornilov, M.V.; Ishida, E.E.O.; Mondon, F.; Volnova, A.A.; Korolev, V.S. Anomaly detection in the Open Supernova Catalog. Mon. Not. R. Astron. Soc. 2019, 489, 3591–3608. [Google Scholar] [CrossRef]
Villar, V.A.; Cranmer, M.; Berger, E.; Contardo, G.; Ho, S.; Hosseinzadeh, G.; Lin, J.Y.Y. A Deep-learning Approach for Live Anomaly Detection of Extragalactic Transients. Astrophys. J. Suppl. Ser. 2021, 255, 24. [Google Scholar] [CrossRef]
Sánchez-Sáez, P.; Lira, H.; Martí, L.; Sánchez-Pi, N.; Arredondo, J.; Bauer, F.E.; Bayo, A.; Cabrera-Vives, G.; Donoso-Oliva, C.; Estévez, P.A.; et al. Searching for Changing-state AGNs in Massive Data Sets. I. Applying Deep Learning and Anomaly-detection Techniques to Find AGNs with Anomalous Variability Behaviors. Astron. J. 2021, 162, 206. [Google Scholar] [CrossRef]
Chan, H.S.; Villar, V.A.; Cheung, S.H.; Ho, S.; O’Grady, A.J.G.; Drout, M.R.; Renzo, M. Searching for Anomalies in the ZTF Catalog of Periodic Variable Stars. Astrophys. J. 2022, 932, 118. [Google Scholar] [CrossRef]
Lochner, M.; Bassett, B.A. ASTRONOMALY: Personalised active anomaly detection in astronomical data. Astron. Comput. 2021, 36, 100481. [Google Scholar] [CrossRef]
Angelis, D.; Sofos, F.; Karakasidis, T.E. Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives. Arch. Comput. Methods Eng. 2023, 30, 3845–3865. [Google Scholar] [CrossRef]
Schmidt, M.; Lipson, H. Symbolic Regression of Implicit Equations. In Genetic Programming Theory and Practice VII; Riolo, R., O’Reilly, U.M., McConaghy, T., Eds.; Springer: Boston, MA, USA, 2010; pp. 73–85. [Google Scholar] [CrossRef]
Llorella, F.R.; Cebrian, J.A. Exploring Symbolic Regression and Genetic Algorithms for Astronomical Object Classification. Open J. Astrophys. 2025, 8, 27. [Google Scholar] [CrossRef]
Tan, B. Neural infalling cloud equations (NICE): Increasing the efficacy of subgrid models and scientific equation discovery using neural ODEs and symbolic regression. Mon. Not. R. Astron. Soc. 2025, 537, 3383–3395. [Google Scholar] [CrossRef]
Lemos, P.; Jeffrey, N.; Cranmer, M.; Ho, S.; Battaglia, P. Rediscovering orbital mechanics with machine learning. Mach. Learn. Sci. Technol. 2023, 4, 045002. [Google Scholar] [CrossRef]
Delgado, A.M.; Wadekar, D.; Hadzhiyska, B.; Bose, S.; Hernquist, L.; Ho, S. Modelling the galaxy-halo connection with machine learning. Mon. Not. R. Astron. Soc. 2022, 515, 2733–2746. [Google Scholar] [CrossRef]
Gebhardt, M.; Anglés-Alcázar, D.; Borrow, J.; Genel, S.; Villaescusa-Navarro, F.; Ni, Y.; Lovell, C.C.; Nagai, D.; Davé, R.; Marinacci, F.; et al. Cosmological baryon spread and impact on matter clustering in CAMELS. Mon. Not. R. Astron. Soc. 2024, 529, 4896–4913. [Google Scholar] [CrossRef]
Kuang, Y.; Wang, J.; Huang, H.; Ye, M.; Zhu, F.; Li, X.; Hao, J.; Wu, F. Advancing Symbolic Discovery on Unsupervised Data: A Pre-training Framework for Non-degenerate Implicit Equation Discovery. arXiv 2025, arXiv:2505.03130. [Google Scholar]
Lima, M.; Cunha, C.E.; Oyaizu, H.; Frieman, J.; Lin, H.; Sheldon, E.S. Estimating the redshift distribution of photometric galaxy samples. Mon. Not. R. Astron. Soc. 2008, 390, 118–130. [Google Scholar] [CrossRef]
Balakrishnan, V.; Champion, D.; Barr, E.; Kramer, M.; Sengar, R.; Bailes, M. Pulsar candidate identification using semi-supervised generative adversarial networks. Mon. Not. R. Astron. Soc. 2021, 505, 1180–1194. [Google Scholar] [CrossRef]
Rahmani, S.; Teimoorinia, H.; Barmby, P. Classifying galaxy spectra at 0.5 < z < 1 with self-organizing maps. Mon. Not. R. Astron. Soc. 2018, 478, 4416–4432. [Google Scholar] [CrossRef]
Stein, G.; Blaum, J.; Harrington, P.; Medan, T.; Lukić, Z. Mining for Strong Gravitational Lenses with Self-supervised Learning. Astrophys. J. 2022, 932, 107. [Google Scholar] [CrossRef]
Kumar, A.; Kumar, V. Hybrid-Ensemble Deep-Learning Models to Enhance the Sunspot Prediction and Forecasting of Solar Cycle 26. Sol. Phys. 2025, 300, 100. [Google Scholar] [CrossRef]
Angarita, Y.; Chaparro, G.; Lumsden, S.L.; Walsh, C.; Avison, A.; Asabre Frimpong, N.; Fuller, G.A. Pattern finding in millimetre-wave spectra of massive young stellar objects. Astron. Astrophys. 2025, 694, A20. [Google Scholar] [CrossRef]
Shojaeian, A.; Shafizadeh-Moghadam, H.; Sharafati, A.; Shahabi, H. Extreme flash flood susceptibility mapping using a novel PCA-based model stacking approach. Adv. Space Res. 2024, 74, 5371–5382. [Google Scholar] [CrossRef]
Bussov, M.; Nättilä, J. Segmentation of turbulent computational fluid dynamics simulations with unsupervised ensemble learning. Signal Process. Image Commun. 2021, 99, 116450. [Google Scholar] [CrossRef]
Yuan, Z.; Chang, J.; Banerjee, P.; Han, J.; Kang, X.; Smith, M.C. StarGO: A New Method to Identify the Galactic Origins of Halo Stars. Astrophys. J. 2018, 863, 26. [Google Scholar] [CrossRef]
Li, J.; Ray, S.; Lindsay, B.G. A Nonparametric Statistical Approach to Clustering via Mode Identification. J. Mach. Learn. Res. 2007, 8, 1687–1723. [Google Scholar]
Bovy, J.; Hogg, D.W.; Roweis, S.T. Extreme deconvolution: Inferring complete distribution functions from noisy, heterogeneous and incomplete observations. Ann. Appl. Stat. 2011, 5, 1657–1677. [Google Scholar] [CrossRef]
Bhave, A.; Kulkarni, S.; Desai, S.; Srijith, P.K. Two dimensional clustering of Gamma-Ray Bursts using durations and hardness. Astrophys. Space Sci. 2022, 367, 39. [Google Scholar] [CrossRef]

Figure 1. Demonstration of dimensionality reduction with PCA, showing how the two PCs are selected given the dataset. The figure is taken from Follette [20], licensed under CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/).

Figure 2. Example of PCA applied to images. Figure is taken from Çakir, U. and Buck, T. [21], licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Figure 3. Example of dimensionality reduction with (a) Isomap and (b) LLE (Section 2.2.2), showing how the two algorithms unroll a ‘Swiss roll’ manifold. Each panel contains three mini-plots labeled A–C. In (a), A shows the geodesic distance between two points on the manifold (blue line), B shows the corresponding distance along the neighborhood graph (red line), and C presents the resulting low-dimensional embedding after unfolding. In (b), A–C correspond to the same stages, with color used to represent the local structure of the manifold. The change in the circled area between B and C illustrates how LLE preserves local structure during the unfolding. The figure is taken from Fotopoulou [15], licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Figure 4. Example of how t-SNE measures the affinity between two points, demonstrated using a three-dimensional dataset, shown in (a). For example, the distance from any dot to the darkest black dot is computed, and the corresponding probability is determined using a Gaussian kernel, as shown in (b).

Figure 5. Example of dimensionality reduction with t-SNE applied to spectral data. The red circle marks the region in which intermediate-type subgroups are likely to be found. Figure is taken from Peruzzi, T. et al. [52], licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Figure 6. Visualization of the SOM grid overlaid on the two-dimensional input data. The purple dots represent neuron weights, and the black lines connect neighboring neurons, illustrating how the SOM captures the underlying data topology.

Figure 7. Structure of the AE, showing the input and output layers, encoder and decoder, and bottleneck. A shallow neural network has one or two hidden layers, while deep learning models have three or more. The figure is adapted from Fotopoulou [15], which was adapted from Kramer [68], licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Figure 8. Results of the dimensionality reduction algorithms applied to the same five-dimensional dataset. The algorithms include PCA, LLE, non-metric MDS, and t-SNE. The five-dimensional dataset is clustered with DBSCAN before being reduced to two dimensions, where different colors represent different clusters, and the black dots represent the outliers. This figure illustrates how the clusters are distributed under different dimensionality reduction algorithms.

Figure 9. Example of clustering with the GMM, where two components are considered for clustering. The lines show the equi-probability contours of the model, making a contour plot of the sum of the two Gaussian functions. Figure generated using code adapted from the scikit-learn documentation: scikit-learn.org (https://scikit-learn.org/1.1/auto_examples/mixture/plot_gmm_pdf.html).

Figure 10. Hierarchical clustering dendrogram for the bottom-up approach. The dashed line shows the level selected for clustering when 4 clusters are required, and the dotted line shows the level selected when 6 clusters are required.

Figure 11. Example of clustering with FCM applied to scattered data. The intensity of each data point reflects the probability of its membership in the cluster. The figure is taken from Colazo et al. [164], licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Figure 12. The clustering algorithms are applied to the same five-dimensional dataset, with the number of clusters set to four based on the BIC calculation. For algorithms that do not take the number of clusters as a hyperparameter, their hyperparameters were selected to yield four clusters. The algorithms include the GMM, K-means, HC, DBSCAN, FCM, and the SOM. The dataset is dimensionally reduced with PCA to a two-dimensional projection after the dataset is clustered. The colors of the clusters are selected manually. When interpreting the results, note that PCA may not provide the best representation of the clusters; therefore, overlapping clusters or other irregularities do not necessarily indicate a failure of the algorithm. Among the six algorithms, the GMM and DBSCAN yield similar results by recognizing outliers, while K-means and HC produce similar clusterings.

Figure 13. Results of outlier detection using the GMM, DBSCAN, the SOM, and iForest on the five-dimensional dataset, presented using PCA. The labeled inliers are the blue dots, and the labeled outliers are the black dots. The four anomaly detection algorithms are able to identify the three major clusters to some extent. Under PCA visualization, both the GMM and SOM label a few points along the edge of the cluster as outliers, while DBSCAN and iForest consider half of the upper cluster as outliers. Again, PCA may not provide the best representation of the clusters, so irregularities do not necessarily indicate a failure of the algorithm.

Table 1. Benchmarking table of the dimensionality reduction algorithms.

Algorithm	Runtime Scaling ¹	Classic Applications	Novel Applications	Popular Datasets and Data Types	Strengths	Weaknesses and Failure Modes	Overall Evaluation
PCA	$O (n^{2})$ [23]	Denoise and PCA correction ² applied to remove unwanted systematic variations [79,80,81,82] Finding the major driver of variation, where subsequent classification is performed [83,84,85]	Given a stellar catalog, selecting high-quality samples used for further ML training or analysis [86].	SDSS [81,85] Spectra [80,83,85]	Can be automated and thus can be applied to large datasets [81]. Results very interpretable. Significantly fast.	Linear and thus may not demonstrate differences between clusters. Denoise: when the systematic uncertainty varies case by case [79].	Simple and widely applied. Fast ⇔ Linear and simple.
Isomap	Traditional: $O (n^{2} k^{n})$ ³. Floyd–Warshall: $O (n^{3})$ Dijkstra: $O (n^{2} log n)$ [16,23]	Reducing the dimensionality of data before applying supervised learning (e.g., XGBoost, Random Forest, SVM) [36,40,46]	Spectral outlier detection, which finds mostly the spectra of spectroscopic binaries, but the method becomes less effective at low signal-to-noise ratio (SNR), where outliers lie close to the main clusters [87].	SDSS [36,40,46,87] Stellar spectra [36,40,46,87] and latent space [41,88]	Isomap, being a non-linear method, projects clusters more distinctly and can thus be used for clearer visualization [88].	When the SNR is low, the clusters may be projected close to each other, reducing the accuracy of clustering [87]. The program needs to be re-run from the start if new data points are added.	Moderate cost; one can reduce cost by selecting a good training set ⁴. Separation of clusters ⇔ Sensitive to noise.
LLE	$O (n^{3})$ [16]	Being a component of automated classification	Vanderplas and Connolly [44] were the first to systematic apply LLE to SDSS spectral classification.	RAVE [47,89] and SDSS [44,48] Light curves [7,49,90], stellar and galactic spectra [44,47,48,91]	Better preserves relation between similar data points, which is valuable information for classification [44,47,48]. Once the lower-dimensional space is defined, classification based on LLE is fast and can be automated [7]. Low number of free parameters [7].	Computational cost limits its use on large datasets. Sensitive to outliers; one solution is to use robust LLE [44,92]. May not generate meaningful results for small datasets [48]. Requires trials and errors to find the best parameters [47]. The program needs to be re-run from the start if new data points are added [48]. Ward and Lumsden [39] note that LLE can fail to distinguish protostar spectra when continuum emission dominates. Removing the continuum improves the performance of LLE.	Moderate cost; one can reduce cost by selecting a good training set. Better to be used on large, low-dim datasets
t-SNE	Traditional: $O (n^{2})$ Barnes–Hut: $O (n log n)$ [23]	From a dataset, creating a subset of data points of research interest [57,93,94] Performing chemical tagging or very-similar classification [56,95,96] Investigating Gamma-ray bursts [58,97,98]	Given stellar photometric data, identifying stellar population and facilitating the classification of outlier spectral energy distributions [99].	Multi-source dataset [56,57] and GALAH [94,96] Catalog [55,56]	Performs well on high-dimensional datasets [54,55].	Relatively high computational cost and relatively slow [54]. Requires parameter fine-tuning (especially perplexity) by inspecting the results. Choosing suboptimal values can distort the low-dimensional embedding, potentially producing misleading clusters or relationships that do not reflect the true structure of the data [56,58]. The program needs to be re-run from the start if new data points are added [54].	Similar to LLE’s.
SOM	$O (n \times m)$ [100]	Redshift calibration [62,63,101,102] Dimensionality reduction (and clustering) by projecting the feature space onto a 2D grid of neurons [63,101,103,104] Galaxy visualization and classification [105,106]	Viscasillas Vázquez et al. [107] use SOM to cluster Gaia spectral data to identify hot subdwarf binaries. A small dataset is used.	KiDS [62,63,102] Photometric [62,63,103] data, typically galactic	SOM not only reduces the dimension, but also clusters the data points into cells, thus can be used to identify gold samples ⁵ [63,102] or generate prototypes that represent the cells [104]. Once trained, other catalogs can be mapped onto the same projection [65].	The small number of neurons tend to be dominated by the most common features of the dataset. Complex and rare morphologies may be lost. To better present the features, we need more neurons, which increases the computational cost. To reduce the computational cost, one may adopt hierarchical SOM layers [104].	SOM is usually used to reduce the dimension to 2D. Projecting the data in 3D has little impact on the results [101]. Large computational cost limits its application to large datasets [104].
AE	Depends on the number of epochs required, the data size, and hardware	Convolutional AE (CAE, [108]) [71,109,110] and VAE [75,76,111,112,113,114] Anomaly detection [75,99,109,110] Feature extraction and dim red [71,109,115,116] Generative model: Producing new data points that resemble the original dataset by decoding the latent space learned [111,114]	Scatter VAE, which can generate a Gaia BP/RP spectrum and simultaneously estimate intrinsic scatter for individual spectra [113].	Galaxy Zoo [71,109,114] Simulations [76,111,115] and images [71,109,112,114,115,116] The images are mostly galactic	VAE is interpretable and stable and thus ideal among generative models [111].	Training the VAE takes a long time [75]. The result of anomaly detection is very sensitive to the hyperparameters. E.g., if the AE is trained using brighter objects, and dimmer objects are introduced, the AE may view dimmer objects as anomalies [99]. For the VAE, the reconstructed feature space may not be very similar to the input feature space, which suggests that the decoder may lack sufficient power. However, if the decoder power increases, the VAE may experience posterior collapse [117] where the latent space is ignored [117].	The VAE is more computationally expensive than the AE. The AE and CAE are widely applied to reduce the size of data, decreasing the computational cost of further data mining. Low output quality ⇔ High stability [111]. Takes a long time to train ⇔ Becomes fast once trained.

¹ The runtime scaling is presented with respect to the number of data points n and number of neurons m. ² PCA correction: The application of PCA to remove unwanted systematic variations, which may include noise. ³ n is the number of data points. k is the number of neighbors considered. ⁴ One can reduce the computational cost of Isomap and LLE by selecting a representative training set that captures the overall data distribution using fewer, more ‘informative’ spectra. Vanderplas and Connolly [44] illustrate this by dividing the dataset into smaller groups and selecting points with the largest reconstruction errors—those poorly represented by their local neighborhoods and thus more unique—and iteratively merging them until a suitably small training set is obtained. ⁵ Gold sample: A subset of a large photometric dataset that is represented by the spectroscopic dataset.

Table 2. Result of a focused search, showing the number of refereed astronomy papers in the ADS that applied each dimensionality reduction algorithm to different types of data from 2015 to 2025. The rows are ordered by the sum of all algorithm applications, with higher totals ranked first. The box corresponding to the most popular application for each algorithm is highlighted in yellow. We do not report counts for cases in which latent spaces are used as inputs to the SOM or AE because any focused search for such usage would also return studies where the latent spaces produced by the SOM or AE were subsequently fed into other algorithms, making the numbers ambiguous.

	PCA	MDS	Isomap	LLE	t-SNE	SOM	AE	Total
								No.
Spectral data	251	1	2	5	18	41	22	340
Spectroscopic data	256	0	1	1	28	30	6	322
Image	106	0	1	0	7	26	16	156
Photometry data	46	0	0	2	7	51	1	107
Catalogs	35	0	1	5	14	38	3	96
Light curves	31	0	0	3	4	9	4	51
Astrometric data	9	3	0	0	2	4	0	18
Radio-astronomy	4	0	0	0	1	4	3	12
Polarimetric data	9	0	0	0	0	0	1	10
Latent space	3	0	1	0	0	/	/	4
Total No.	750	4	6	16	81	203	56	1116

Table 3. Benchmarking table of the clustering algorithms.

Algorithm	Runtime Scaling ¹	Classic Applications	Novel Applications	Popular Datasets and Data Types	Strengths	Weaknesses and Failure Modes	Overall Evaluation
GMM	$O (n)$ [151]	Studying open clusters (OCs) [133,151,176] Clustering the objects such that one or more clusters exhibit properties of scientific interest [131,176,177]	A novel pre-processing method [178] that enhances the performance of the GMM (and potentially other density estimation algorithms), with applications so far limited to gravitational wave and black hole astronomy.	Gaia [133,151,176] Stellar catalog [133,151,176] and photometry data [127,128]	Fast and simple. Effective at finding membership lists of OCs [151].	Does not deal with noise ⇒ Expensive and ineffective in the presence of noise [151]. Struggles to represent a non-Gaussian cluster using a single Gaussian distribution and may split the cluster into multiple separate but related clusters [179]. Hard-boundary cuts applied to the data may degrade the performance. A solution is to exclude the dimensions that were used for hard-boundary selection of the data. [179]. Sensitive to the number of clusters [151]. The GMM fails when there are severe outliers in a large dataset. The GMM clusters field stars as numerous clusters; when there are more field stars and a greater number of components, the size of the routine ( $O (n \cdot m)$ ) increases drastically with the size of the dataset ² [151]. This is improved by imposing the GMM on sub-partitions.	Inefficient for blind search of a large dataset with a lot of noise [151]. Fast and simple ⇔ More sensitive to convex-shaped clusters compared to other shapes.
K-means	$O (n)$ [151]	Clustering the objects such that one or more clusters exhibit properties of scientific interest [180,181,182]	K-means is used as a component of a semi-supervised learning framework (Section 5), used in combination with Random Forest to classify spectroscopic data of stars, galaxies, and quasars [183].	APOGEE [181,182,184] Spectroscopic data, where most examples use chemical abundance [180,181,182,183,184,185,186]	Fast and simple [180,186]. Applicable to large datasets [180]. Effective for near-normally distributed isotropic clusters in high-dim space [184].	Greatly depends on initialization and thus needs to be repeated with different initializations [180,183]. Requires a known number of clusters [184]. More sensitive to spherical clusters than other shapes [183,184]. Does not deal with noise [184].	Does not deal with noise ⇒ More effective if the clusters are dense [184]. Fast and simple ⇔ Assumes clusters are spherical and of similar sizes [183].
HC	Traditional: $O (n^{3})$ FOF: $O (n log n)$ [100]	Agglomerative HC [100,139,187,188,189] Clustering the objects so one can study the groups separately [100,140,187,190] Clustering to reveal the structure of the object (i.e., the dataset) [188,189,191]	Data points are first clustered into SOM cells (i.e., neurons), and the cells are further clustered using HC. Compared to solely HC, it reduces the computational cost. Compared to solely SOM, it is more flexible and faster to view the result of different numbers of clusters [100].	Different databases are used (e.g., KiDS [100], Gaia [139], ALMA [189], and multi-source dataset [140,187]) Photometric data [100,187], astrometric data [139,140], and catalog [188,190]	Robust, flexible, and interpretable [100]. Dendrograms can be very informative and interpretable, providing information on how close or similar two clusters are [139] or showing the hierarchical structure of the objects [189,191].	High computational cost [100] Does not like overlapping clusters [190].	High computational cost ⇔ Provides informative dendrogram and does not require re-running to explore different numbers of clusters. Best to be used to study hierarchically structured large clusters.
DBSCAN	$O (n log n)$ [151]	Searching for OC candidates [146,151,192] and molecular cloud candidates [148] In broader terms: providing membership lists for clusters, with one or multiple clusters subsequently analyzed [146,147,148,151,192,193]	DBSCAN is a component of Cavity Detection Tool (CADET)—an automated machine learning pipeline for detecting and measuring X-ray cavities. DBSCAN is used to cluster processed pixels [194].	Gaia [147] Astrometric data [146,147,151,192] and catalog [193,195]	Good scalability ³ and can be used on large datasets [146,151]. Can be used for blind searching [151]. Can discard small clusters [151]. Does not require an a priori number of clusters [192,193]. Can detect arbitrarily shaped clusters [146,192,193].	Sensitive to hyperparameters [151]. Applies the same requirements to all clusters (i.e., has global parameters) and thus may not detect clusters of different densities [146,147,151]. A solution is to divide the feature space into multiple smaller regions for clustering [146]. Does not handle uncertainty [195]. Low sensitivity: May fail to cluster diffuse clusters [147]. Given photometric and astrometric data, DBSCAN may fail to differentiate two groups with different kinematics but that are spatially close [147].	Can be applied to large datasets but may require dividing the feature spaces into regions for separate clustering [147]. More sensitive to detect distant OCs [151].
HDBSCAN	$O (n log n)$ [151]	Providing membership lists for clusters, with each cluster subsequently analyzed [56,133,151,196] Providing information about the substructures of a group [154,156]	HDBSCAN-MC: Integrating Monte Carlo simulation with HDBSCAN to better handle astrometric data with significant uncertainties [197].	Gaia [133,151,156,196,197] Astrometric data [133,151,156,196,197] and photometric data [156,198]	Able to reveal the substructures [151,156]. High sensitivity: Can find clusters of different densities [133,151] and small-scale structure [154]. Parameters are relatively straightforward to interpret when applied to detecting open clusters [151].	Low precision: Requires careful post-processing to select valid clusters and remove false positives [151]. Prone to finding clusters in the densest region [151]. Not effective for diffuse clusters with a low SNR [151]. Given astrometric data, HDBSCAN may fail to detect both nearby and distant clusters at the same time when the parallax range is large [196].	High sensitivity ⇔ Low precision [151]. Higher sensitivity to nearby OCs [151]. Acceptable runtime: Does not increase significantly compared to DBSCAN [151].
FCM	$O (n)$ [199]	Visualizing the structure of the dataset or the shape of the clusters [172,174,175] Studying the sun [171,174,175,200]	Application in astronomy is limited. A newer application is to cluster sunspot cycles [171]. There are many cases where fuzzy logic is applied to clustering sunspot cycles, and this paper is the first to use FCM.	Varios databases (e.g., simulation [172], SDSS [164], The ATNF Pulsar Catalogue [173], and APOGEE [167]) Image [174,175], catalog [167,173], and time series [171,172]	Shows the membership probability of each data point, which can be used to compile a list of ambiguous targets that require re-classification [164]. Useful if the cluster boundaries are not clear [174].	Not applicable to large datasets that have a lot of clusters. The runtime ( $O (n \cdot m^{2})$ ) increases drastically with the number of clusters m [199].	Should be applied if the clusters overlap. Soft clustering and thus informative ⇔ High computational cost.

¹ The runtime scaling is presented with respect to the number of data points n. ² n is the number of data points. m is the number of clusters. ³ Good scalability: The performance does not degrade much when applied to large datasets.

Table 4. Result of a focused search, showing the number of refereed astronomy papers in ADS that applied each clustering algorithm to different types of data from 2015 to 2025. The rows are ordered by the sum of all algorithm applications, with higher totals ranked first. The box corresponding to the most popular application for each algorithm is highlighted in yellow.

	GMM	K-Means	HC	DBSCAN	HDBSCAN	FCM	Total
							No.
Spectroscopic data	59	54	24	9	24	0	170
Spectral data	43	76	19	18	11	1	168
Catalogs	48	27	22	31	22	1	151
Image	36	70	13	13	2	3	137
Photometric data	51	14	8	14	26	1	114
Astrometric data	4	1	6	16	22	1	50
Bivariate data	11	11	12	1	1	0	36
Light curves	10	9	3	3	3	0	28
Polarimetric data	2	7	0	0	0	0	9
Latent space	3	0	0	0	0	0	3
Total No.	267	269	107	105	111	7	866

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kuo, C.-T.; Xu, D.; Friesen, R. A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering. Universe 2025, 11, 412. https://doi.org/10.3390/universe11120412

AMA Style

Kuo C-T, Xu D, Friesen R. A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering. Universe. 2025; 11(12):412. https://doi.org/10.3390/universe11120412

Chicago/Turabian Style

Kuo, Chih-Ting, Duo Xu, and Rachel Friesen. 2025. "A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering" Universe 11, no. 12: 412. https://doi.org/10.3390/universe11120412

APA Style

Kuo, C.-T., Xu, D., & Friesen, R. (2025). A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering. Universe, 11(12), 412. https://doi.org/10.3390/universe11120412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering

Abstract

1. Introduction

2. Dimensionality Reduction

2.1. Linear Methods

Principal Component Analysis (PCA) and Kernal PCA

2.2. Non-Linear Methods

2.2.1. Multi-Dimensional Scaling and Isometric Feature Mapping

2.2.2. Locally Linear Embedding

2.2.3. t-Distributed Stochastic Neighbor Embedding

2.3. Neural Network-Based Dimensionality Reduction

2.3.1. Self-Organizing Map

2.3.2. Auto-Encoder and Variational Auto-Encoder

2.4. Examples of Applications of Different Dimensionality Reduction Algorithms

2.5. Benchmarking of Dimensionality Reduction Algorithms

3. Clustering

3.1. Gaussian Mixture Model

3.2. K-Means

3.3. Hierarchical Clustering and Friends-of-Friends Clustering

3.4. Density-Based Spatial Clustering of Applications with Noise

3.5. Hierarchical Density-Based Spatial Clustering of Applications with Noise

3.6. Fuzzy C-Means Clustering

3.7. Examples of Applications of Different Clustering Algorithms

3.8. Benchmarking of Clustering Algorithms

4. Other Applications of Unsupervised Machine Learning

4.1. Anomaly Detection

4.2. Symbolic Regression

5. Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Supplementary Discussion: Algorithm Performance on Synthetic Datasets

Appendix A.1. Synthetic Datasets

Appendix A.2. Performance of Dimensionality Reduction Algorithms on a Synthetic Dataset

Appendix A.3. Performance of Clustering Algorithms on a Synthetic Dataset

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI