Review Reports - A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering

Round 1

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

Overall Assessment

This manuscript presents a timely and comprehensive review of unsupervised machine learning (ML) techniques and their applications in astronomy. The topic is of significant interest given the data-driven nature of modern astrophysics. The authors have undertaken a substantial effort to catalog and benchmark a wide array of algorithms, supported by practical examples and synthetic data tests. The scope is impressive, covering dimensionality reduction, clustering, neural networks, anomaly detection, and symbolic regression.

However, the manuscript in its current form requires minor revisions to meet the standards of a high-quality review article.

2. Major Strengths

Comprehensive Scope: The review covers an extensive range of algorithms, from classic techniques like PCA and K-Means to more modern methods like t-SNE, HDBSCAN, and Variational Autoencoders. This breadth is a major asset.

Practical Utility: The inclusion of benchmarking tables (e.g., Tables 1, 3, 5) and the application-focused search results (Tables 2, 4, 6) is highly valuable for practitioners seeking to choose an appropriate algorithm for their specific data type and task.

Hands-on Examples: The use of both real astronomical data (Gaia astrometry) and synthetic datasets to demonstrate and compare algorithms is a strong point. It successfully bridges the gap between theoretical description and practical application.

For example, Figure 6 provides an excellent at-a-glance comparison of how different dimensionality reduction techniques handle the same dataset, which is incredibly useful for a reader.

Structure: The high-level structure (Introduction, Dimensionality Reduction, Clustering, etc.) is logical and appropriate for a review of this nature.

3. Weaknesses

Technical Clarifications:

The description of the EM algorithm in Section 3.1 is slightly misleading. It does not start with "random parameters"; it starts with an initial guess (e.g., from K-Means) and iterates between the E-step and M-step.

The manuscript would benefit from a consistent notation (e.g., $x$for variables) and careful checking of mathematical expressions.

Appendix A: The use of synthetic data for validation is a major strength. However, the analysis should more explicitly state the limitations of these tests (e.g., idealized clusters without noise, known labels) and how the conclusions might differ for real, messier astronomical data.

“Table” should be “Tables” in lines 712, 843, 858.

“Fig.” Should be “Figures” in line 853.

“Fig.” Should be “Figure” in the text.

The captions of tables and figures should end with a period.

The captions of tables should be on the top of tables.

The references 137 and 141 are missing.

“Total” should be “Total No.” in Tables 2, 4 and 6.

“Accuracy” should be “Accuracy (%)” in Tables A1, A2,A3 A4.

The format of the first row in Tables A2 and A4 should be the same as Tables A1 and A3.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

The manuscript offers a valuable overview of methodological usage in the field, and the inclusion of benchmarking results represents a particularly welcome and practical contribution to the community. However, the discussion would benefit from a clearer articulation of the motivation behind each method and its specific applications within astronomy. It would also be helpful to more distinctly separate linear and non-linear approaches in the paper’s structure.

At present, many sections read as descriptive lists of methods and their limitations, without sufficient detail on implementation aspects or concrete astronomical case studies that would guide readers wishing to apply these techniques in practice. In addition, while the discussion overlaps with previous work (e.g., Fotopoulou et al.), it does so in a less comprehensive manner.

Overall, several areas require substantial expansion and clarification to meet the expectations for a thorough and authoritative review in this domain. Specific points for consideration are outlined below:

Line 82:

“ when all principal components are used, the full information of the dataset is preserved” - how is all defined?

Line: 84:
“the data are reduced to k-dimension” - > “the data are reduced to k-dimensions”

A critical discussion on how to choose the number of components k in methods like PCA is necessary.

Line 101:
“incremental PCA is developed to support minibatch processing but is relatively less accurate.” -> english needs improving. Also what is minibatch processing and why is it less accurate?

Line 110:
“The difference can be reduced by using a different kernel function or modifying the code” -> what specific code modifications are generally required to reduce the difference mentioned?

Line 231:

“LLE also shows a tendency to overcrowd the data points, possibly due to its sensitivity to outliers.” But PCA is also sensitive to outliers

Table 1: Isomap “ one can reduce cost by selecting a good training set” - what is considered a good training set? - similarly LLE

Table 1 implies LLE has no novel applications or failure modes but it does see e.g. Vanderplas 2009. Similarly for tSNE is sensitive to hyperparameters: Steinhart+2020

Line 296:

“ the variational Bayesian GMM” -reference

Line 323:

“distortion curve begins to level off” what is the distortion curve?

Line 408:

What are the disadvantages of HDBSCAN?

Line470:

Section for SOM should be referenced. “SOM yields poor clustering performance; however, it is typically not used for clustering directly” - what is it used for?

Table 3 again, ensure tables are complete, but also failures overlaps with weaknesses, so you could consider combining the two columns for conciseness/

Table 5:

SOM can also be used for galaxy visualisation and classification Polsterer+2015, Kollasch+2024

AE runtime scaling also depends on datasize and hardware

P30:

In many modern applications, AE is widely used as nomenclature for CAE, so the discussion here is obsolete. Similarly most VAE models use convolutional layers.

There is some section redundancy. The Neural Network section should be reconsidered. AE is better placed under Dimensionality Reduction, and SOM under Dimensionality Reduction or Clustering. I also think that Anomaly Detection and Symbolic Regression warrant a more in-depth discussion, and possibly inclusion in the benchmarking study. For Symbolic Regression (Line 650), discuss why unsupervised methods appear underexplored and provide examples of physics-based applications.

Line 685:701 feels out of place and might be better placed in the introduction.

When you talk about hybrid and stacking models - this is more commonly referred to as ensemble methods

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report (New Reviewer)

Comments and Suggestions for Authors

This manuscript provides a comprehensive and well-written review of unsupervised machine learning algorithms applied in astronomy. Its orgaized into three main classes—dimensionality reduction, clustering, and neural networks. For each method, the authors present suitable astronomical applications and clearly describe their strengths and limitations.

However, the current title “Unsupervised Machine Learning in Astronomy” is somewhat too broad. I suggest replacing a more specific and informative title such as “A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction, Clustering, and Neural Networks”, which would better reflect the paper’s systematic and data-driven scope. Aside from this minor suggestion, the reviewer finds no significant weaknesses or omissions. Any minor stylistic or formatting issues can be addressed during the editorial process. I therefore recommend publication.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

I'm happy with the corrections made, I believe it is publishable in it's current state

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Overall, I am unsure wether or not this review is going to be of real use for the community. While it presents an overview of several available methods, the manuscript in its current form does not provide a clear explanation of the purpose, strengths or weaknesses of individual methods in an astronomy context. More important, the scope of this review is so broad, astronomy as a whole, that it inevitably results in not being focused. It would be better to identify a given sub-area of astronomy and focus on that.

The review includes 136 references, which do not seem adequate for the scope of a comprehensive review.

The added value and novelty of the review is unclear, especially since it is promised to be addressed to the astronomy community, but it lacks an adequate astronomical perspective, while it is focused on technical generic details which could be obtained browsing the internet. Some applications to astronomical data and the original analysis and comparison performed using a common astronomical dataset lack a proper in depth discussion.

A paragraph describing the type of problems in which every algorithm is most suited for, as well as pros and cons, would be highly appreciated. In the current form, the manuscript mostly only includes a list of papers in a given topic is used, but that lacks a minimal discussion. A good example is given for example at lines 496-499 for anomaly detection, but it would be helpful to see this for every method.

in the following, I list some minor comments.

Check affiliation 3: it is not used and seems identical to affiliation 1

lines 21-23: unnecessary and a bit too simplistic.

27: the use of innumerable appears as redundant

34: “and predictions” ==> “and makes predictions”

35: this may largely depend on the problem that is tackled.

57-60: redundant statements. make more concise

71: “invented”==> “initially developed”, “developed by Hotelling”==> “further improved by Hotelling”. similar comment for lines 262 and 358.

110-111: the suggested literature is quite diverse and it is unclear what is the common aspect between them. As a reader I do not get a clear picture of the type of problems in which PCA can be helpful, because there is a description, but it is too technical.

213: use the tilda to have everything on the same line

217: the results of the comparison should be discussed. The results of the four dimensionality reduction algorithms are completely different. It is unclear how to interpret this result and no in depth discussion on the results of the comparison is given.

472-474: these sentences does not connect with the previous one and appears even in contradiction between each other.

500-501: it is unclear which data set is used.

502-503: conclusion is missing.

517: unclear statement. I guess we are talking of supervised SR, but it is worth clarifying.

Author Response

Attached please find our responses.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper may be considered for publication for its thorough overview of unsupervised machine learning techniques in astronomy, including clustering, dimensionality reduction, neural networks, anomaly detection, and symbolic regression. The inclusion of seminal works and up-to-date references (e.g., papers up to 2025) strengthens its relevance.

Advantages:

The paper is well-organized, with logical progression between sections and effective categorization of algorithms (e.g., clustering vs. dimensionality reduction). This enhances readability and accessibility.

The use of example figures (e.g., PCA, t-SNE, clustering results) and real-world applications (e.g., astrometry data) effectively illustrates the utility of these methods in astronomy.

The balanced discussion of advantages and limitations (e.g., PCA, DBSCAN’s parameter sensitivity) provides valuable insights for practitioners.

The review successfully bridges the gap for astronomers unfamiliar with ML, offering both high-level explanations and technical details (e.g., mathematical formulations of LLE ).

Disadvantages:

While the review is comprehensive, it primarily synthesizes existing knowledge without presenting new methodologies, case studies, or experimental results. A stronger original contribution (e.g., novel applications, benchmarking studies) would enhance its impact.

The paper could expand on open challenges (e.g., scalability for large surveys like LSST) and future research directions (e.g., hybrid models, interdisciplinary applications). A "Future Directions" section would be added and strengthen this.

Technical jargon (e.g., "manifold learning") may hinder accessibility for non-ML audiences. A glossary or expanded definitions could reduce this.

Recommendations:

This paper is worth publishing but requires minor revisions.

Key comments include:

Deepen the critique of challenges (e.g., noise/sparsity in astronomical data, comparison to supervised methods).

Add a "Future Directions" section to highlight gaps (e.g., hybrid models, scalability).

Although the authors provide application examples for each algorithm in astronomy, they only mention the data and references used. Astronomers are more interested in understanding the specific tasks these algorithms solve. The authors could expand on the application examples by specifying 'which astronomical problems' were addressed using 'which data.

Additional Notes for Authors:

Consider leveraging datasets from upcoming surveys (e.g., LSST, JWST) to demonstrate real-world scalability challenges.

A table summarizing algorithmic trade-offs (e.g., computational cost, accuracy) could further aid practitioners.

Some modifications are as follows:

Page 1, Abstract:

"Unsupervised machine learning enables the researchers to analyze large..."
→ "Unsupervised machine learning enables researchers to analyze large..." (Remove "the" before "researchers.")

Page 1, Introduction:

"ML, a subfield of artificial intelligence, aims to mimic the human brain using computers."
→ "ML, a subfield of artificial intelligence, aims to mimic human brain functions using computers."

Page 2, Introduction:

"The project focuses on unsupervised ML algorithms, which are sometimes considered more helpful for scientific research as they are not limited by present knowledge and can be used to extract new knowledge [14]."
→ "This review focuses on unsupervised ML algorithms, which are often considered more useful for scientific research because they are not constrained by existing knowledge and can uncover new insights [14]."

Page 3, Section 2.1:

"PCA performs singular value decomposition (SVD) and can be seen as a rotation and projection of the data set that maximizes the variance, making the data features more significant [15]."
→ "PCA performs singular value decomposition (SVD) and can be interpreted as a rotation and projection of the dataset to maximize variance, thereby highlighting the most important features [15]."

Page 3, Section 2.1:

"The difference can be reduced by imposing a different built-in function or revising the code, as suggested by Pedregosa et al. [23]."
→ "The difference can be reduced by using a different kernel function or modifying the code, as suggested by Pedregosa et al. [23]."

Page 5, Section 2.4:

"t-SNE also addresses the issue of overcrowding at the center - a common problem in many dimensionality reduction algorithms, where points become densely packed in the center - by using a t-distribution, which has heavier tails than a Gaussian distribution [16]."
→ "t-SNE also mitigates the issue of overcrowding at the center, a common problem in many dimensionality reduction algorithms where points cluster densely, by using a t-distribution with heavier tails than a Gaussian distribution [16]."

Page 7, Section 3.1:

"GMM assumes that the number of components is known. However, in most applications, the number of components is unknown."
→ "GMM assumes the number of components is known, though this is often not the case in practice."

Page 10, Section 3.4:

"The samples are considered to be in the same cluster as their neighboring core sample."
→ "Samples are assigned to the same cluster as their nearest core sample."

Page 15, Section 5.1:

"iForest may not perform well when there exists interdependence between features [125]."
→ "iForest may not perform well when interdependencies exist between features [125]."

Page 16, Section 6:

"This review also includes examples that demonstrate the results of applying these algorithms to a five-dimensional astrometry data set."
→ "This review also provides examples demonstrating the results of applying these algorithms to a five-dimensional astrometry dataset."

Author Response

Attached please find our responses.

Author Response File: Author Response.pdf