Author Contributions
Conceptualization, R.C. and O.A.; methodology, R.C. and O.A.; software, R.C. and T.H.; investigation, R.C., T.H. and O.A.; resources, O.A.; data curation, R.C.; writing, original draft preparation, R.C. and O.A.; writing, review and editing, R.C. and O.A.; visualization, R.C.; supervision, O.A.; project administration, O.A. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Example galaxy images from DESI-LS [
3].
Figure 2.
Example optical spectra from DESI [
2].
Figure 3.
Two-dimensional UMAP projections of aligned image embeddings. Galaxies are coloured by binned physical properties. Each panel represents a distinct property with values divided into low and high categories.
Figure 4.
Two-dimensional UMAP projections of unaligned image embeddings. Galaxies are coloured by binned physical properties. Each panel represents a distinct property with values divided into low and high categories.
Figure 5.
Two-dimensional UMAP projections of aligned spectrum embeddings. Galaxies are coloured by binned physical properties. Each panel represents a distinct property with values divided into low and high categories.
Figure 6.
Two-dimensional UMAP projections of unaligned spectrum embeddings. Galaxies are coloured by binned physical properties. Each panel represents a distinct property with values divided into low and high categories.
Figure 7.
Elbow curves for K-means clustering of AstroDino (left) and AstroCLIP (right) image embeddings. Inertia is plotted against the number of clusters k. The dotted red lines correspond to the selected value .
Figure 8.
Elbow curves for K-means clustering of SpecFormer (left) and AstroCLIP spectral embeddings (right). The dotted red lines correspond to the selected value .
Figure 9.
Scatter plot of sample coordinates (RA, DEC) colour-coded by cluster label as determined by K-means clustering.
Figure 10.
Box plot of nearest-neighbour angular distances, illustrating the spread of local separations across the galaxy sample.
Figure 11.
Distribution of galaxies across neighbour bins, defined by the number of companions within the angular distance threshold.
Figure 12.
Box plot of the absolute residuals for each model and property using KNN.
Figure 13.
Distribution of target variable values with quantile-based bin membership.
Figure 14.
Net instance count by error magnitude where AstroCLIP outperformed (negative values) or underperformed (positive values) AstroDino for , shown by target-value bins.
Figure 15.
Net instance count by error magnitude where AstroCLIP outperformed or underperformed AstroDino for , by target-value bins.
Table 1.
Model architecture and training infrastructure for each stage of the AstroCLIP pipeline.
| Property | Image (DINOv2) | Spectra Transformer | AstroCLIP (CLIP) |
|---|
| Architecture | ViT-L (24 layers, 16 heads) | 6-layer Transformer (6 heads) | Dual encoder + cross-attention |
| Parameters | ∼307 M | ∼43 M | — |
| Patch Size | | 20 (overlap 10) | — |
| Input Resolution | | 7514-d spectrum | Paired image–spectrum |
| Loss Function | DINO + iBOT + KoLeo | MSE (masked reconstruction) | InfoNCE |
| Hardware | 16 × H100 GPUs | 4 × H100 GPUs | 1 × H100 GPU |
| Training Time | ∼48 h | ∼24 h | ∼48 h |
Table 2.
Training hyperparameters for each stage of the AstroCLIP pipeline [
1].
| Hyperparameter | Image (DINOv2) | Spectra Transformer | AstroCLIP (CLIP) |
|---|
| Training Length | 200 epochs | 500k steps | 100 epochs |
| Batch Size | 1536 (72/GPU × 16) | 64 | 256 |
| Embedding Dim. | 1024 | 768 | 512 |
| Optimiser | AdamW | AdamW (: 0.9, 0.95) | AdamW |
| Learning Rate | 2 × 10−4
(warmup + cosine) | 1 × 10−5 (warmup + cosine) | 1 × 10−4 (cosine) |
| Weight Decay | 0.001 → 0.01 | 0.1 | 0.05 |
| Gradient Clipping | 3.0 | 1.0 | 1.0 |
| EMA Momentum | 0.992 → 1.0 | — | — |
| Masking Strategy | iBOT () | 6 segments of length 30 | None |
| Logit Scale | — | — | 15.5 (fixed) |
| Drop Path Rate | 0.3 | — | — |
| Dropout | — | 0.0 | — |
Table 3.
Galaxy property estimation performance.
| Source | Method | | | | |
|---|
| Images | AstroCLIP Zero-Shot | 0.74 | 0.44 | 0.27 | 0.44 |
| AstroCLIP Few-Shot | 0.73 | 0.43 | 0.26 | 0.42 |
| Unaligned Trans. Zero-Shot | 0.65 | 0.40 | 0.16 | 0.25 |
| Unaligned Trans. Few-Shot | 0.72 | 0.43 | 0.23 | 0.40 |
| Spectra | AstroCLIP Zero-Shot | 0.87 | 0.57 | 0.43 | 0.63 |
| AstroCLIP Few-Shot | 0.88 | 0.58 | 0.43 | 0.64 |
| Unaligned Trans. Zero-Shot | 0.84 | 0.57 | 0.38 | 0.62 |
| Unaligned Trans. Few-Shot | 0.88 | 0.64 | 0.47 | 0.69 |
Table 4.
Galaxy property estimation performance on the training set.
| Source | Method | | | | |
|---|
| Images | AstroCLIP Zero-Shot | 0.77 | 0.47 | 0.27 | 0.48 |
| AstroCLIP Few-Shot | 0.77 | 0.45 | 0.28 | 0.48 |
| Unaligned Trans. Zero-Shot | 0.64 | 0.39 | 0.17 | 0.24 |
| Unaligned Trans. Few-Shot | 0.73 | 0.42 | 0.21 | 0.39 |
| Spectra | AstroCLIP Zero-Shot | 0.88 | 0.57 | 0.41 | 0.63 |
| AstroCLIP Few-Shot | 0.88 | 0.61 | 0.43 | 0.68 |
| Unaligned Trans. Zero-Shot | 0.84 | 0.58 | 0.34 | 0.62 |
| Unaligned Trans. Few-Shot | 0.89 | 0.65 | 0.47 | 0.70 |
Table 5.
Bootstrap 95% confidence intervals for and (AstroCLIP minus unaligned baseline) using image embeddings.
|
Property
|
Model
| Zero-Shot | Few-Shot |
|---|
| / |
95% CI
| / |
95% CI
|
|---|
| AstroCLIP | 0.74 | [0.732, 0.748] | 0.73 | [0.722, 0.738] |
| Unaligned | 0.65 | [0.639, 0.661] | 0.72 | [0.710, 0.730] |
| 0.09 | [0.083, 0.097] | 0.01 | [0.004, 0.017] |
| AstroCLIP | 0.44 | [0.425, 0.455] | 0.43 | [0.414, 0.446] |
| Unaligned | 0.40 | [0.385, 0.415] | 0.43 | [0.414, 0.446] |
| 0.04 | [0.032, 0.049] | 0.00 | [−0.008, 0.008] |
| AstroCLIP | 0.27 | [0.256, 0.284] | 0.26 | [0.245, 0.275] |
| Unaligned | 0.16 | [0.147, 0.173] | 0.23 | [0.214, 0.246] |
| 0.11 | [0.100, 0.120] | 0.03 | [0.018, 0.042] |
| AstroCLIP | 0.44 | [0.429, 0.451] | 0.42 | [0.407, 0.432] |
| Unaligned | 0.25 | [0.239, 0.261] | 0.40 | [0.387, 0.413] |
| 0.19 | [0.180, 0.200] | 0.02 | [0.010, 0.030] |
Table 6.
Bootstrap 95% confidence intervals for and (AstroCLIP minus unaligned baseline) using spectral embeddings.
|
Property
|
Model
| Zero-Shot | Few-Shot |
|---|
| / |
95% CI
| / |
95% CI
|
|---|
| AstroCLIP | 0.87 | [0.865, 0.875] | 0.88 | [0.876, 0.884] |
| Unaligned | 0.84 | [0.835, 0.845] | 0.88 | [0.876, 0.884] |
| 0.03 | [0.027, 0.033] | 0.00 | [−0.003, 0.003] |
| AstroCLIP | 0.57 | [0.556, 0.583] | 0.58 | [0.567, 0.593] |
| Unaligned | 0.57 | [0.556, 0.583] | 0.64 | [0.628, 0.652] |
| 0.00 | [−0.005, 0.005] | −0.06 | [−0.066, −0.054] |
| AstroCLIP | 0.43 | [0.413, 0.447] | 0.43 | [0.412, 0.448] |
| Unaligned | 0.38 | [0.367, 0.393] | 0.47 | [0.452, 0.488] |
| 0.05 | [0.042, 0.058] | −0.04 | [−0.049, −0.032] |
| AstroCLIP | 0.63 | [0.620, 0.640] | 0.64 | [0.631, 0.649] |
| Unaligned | 0.62 | [0.610, 0.630] | 0.69 | [0.680, 0.700] |
| 0.01 | [0.005, 0.015] | −0.05 | [−0.057, −0.044] |
Table 7.
Mean physical properties and cluster frequencies for AstroDino and AstroCLIP image embedding clusters ().
| Cluster | | | | | Count |
|---|
| AstroDino 0 | 8.82 | −4.96 | 11.00 | 4.04 | 15,100 |
| AstroDino 1 | 8.63 | −5.15 | 10.82 | 4.47 | 19,279 |
| AstroDino 2 | 8.28 | −5.97 | 10.12 | 5.67 | 11,148 |
| AstroCLIP 0 | 8.47 | −5.77 | 10.31 | 4.97 | 9408 |
| AstroCLIP 1 | 8.77 | −5.20 | 10.81 | 4.31 | 21,307 |
| AstroCLIP 2 | 8.45 | −5.11 | 10.82 | 4.86 | 14,812 |
Table 8.
AMI scores between cluster labels and binned physical properties.
| Model | Property | AMI Score |
|---|
| AstroCLIP | | 0.012 |
| | | 0.014 |
| | | 0.038 |
| | | 0.003 |
| AstroDino | | 0.007 |
| | | 0.017 |
| | | 0.093 |
| | | 0.021 |
Table 9.
Mean physical properties and cluster frequencies for SpecFormer and AstroCLIP spectral embedding clusters ().
| Cluster | | | | | Frequency |
|---|
| SpecFormer 0 | 8.94 | −4.95 | 10.96 | 2.88 | 14,400 |
| SpecFormer 1 | 8.45 | −5.20 | 10.81 | 5.33 | 21,978 |
| SpecFormer 2 | 8.44 | −6.05 | 10.07 | 5.66 | 9149 |
| AstroCLIP 0 | 8.97 | −5.42 | 10.67 | 4.77 | 10,538 |
| AstroCLIP 1 | 8.45 | −5.16 | 10.76 | 4.74 | 19,015 |
| AstroCLIP 2 | 8.55 | −5.36 | 10.68 | 4.38 | 15,974 |
Table 10.
Adjusted mutual information scores between cluster labels and binned physical properties for spectral embeddings.
| Model | Property | AMI Score |
|---|
| AstroCLIP | | 0.013 |
| | | 0.009 |
| | | 0.006 |
| | | 0.002 |
| SpecFormer | | 0.024 |
| | | 0.079 |
| | | 0.093 |
| | | 0.062 |
Table 11.
Cluster frequencies for AstroCLIP image embeddings using HDBSCAN.
| Cluster | Count |
|---|
| −1 | 16,705 |
| 0 | 28,816 |
| 1 | 6 |
Table 12.
Cluster frequencies for AstroDino image embeddings using HDBSCAN.
| Cluster | Count |
|---|
| −1 | 8404 |
| 0 | 37,050 |
| 1 | 73 |
Table 13.
Distribution of samples by cluster label assigned by K-means.
| Cluster Label | Frequency | Proportion (%) |
|---|
| 5 | 16,776 | 11.8 |
| 1 | 16,696 | 11.8 |
| 3 | 15,144 | 10.7 |
| 2 | 14,472 | 10.2 |
| 7 | 14,080 | 9.9 |
| 8 | 13,568 | 9.6 |
| 9 | 13,344 | 9.4 |
| 6 | 13,272 | 9.3 |
| 0 | 12,896 | 9.1 |
| 4 | 11,872 | 8.4 |
Table 14.
scores for ZMW by cluster label using KNN and MLP models: (a) AstroCLIP image embeddings and (b) AstroCLIP spectra embeddings.
| (a) |
|---|
|
Cluster Label
| | |
|---|
| 0 | 0.438 | 0.445 |
| 1 | 0.377 | 0.377 |
| 2 | 0.405 | 0.407 |
| 3 | 0.471 | 0.491 |
| 4 | 0.415 | 0.419 |
| 5 | 0.427 | 0.431 |
| 6 | 0.401 | 0.387 |
| 7 | 0.417 | 0.415 |
| 8 | 0.405 | 0.416 |
| 9 | 0.434 | 0.453 |
| (b) |
| Cluster Label | | |
| 0 | 0.584 | 0.641 |
| 1 | 0.523 | 0.593 |
| 2 | 0.555 | 0.608 |
| 3 | 0.601 | 0.640 |
| 4 | 0.573 | 0.622 |
| 5 | 0.561 | 0.615 |
| 6 | 0.571 | 0.613 |
| 7 | 0.581 | 0.630 |
| 8 | 0.536 | 0.586 |
| 9 | 0.621 | 0.669 |
Table 15.
scores for galaxy properties across neighbour bins for KNN and MLP models using image and spectra embeddings.
| Embedding Type | Property | Neighbour Bin | KNN () | MLP () |
|---|
| Image | | Few | 0.687 | 0.711 |
| Moderate | 0.698 | 0.722 |
| Many | 0.663 | 0.704 |
| Few | 0.220 | 0.245 |
| Moderate | 0.217 | 0.240 |
| Many | 0.195 | 0.210 |
| Few | 0.414 | 0.431 |
| Moderate | 0.436 | 0.445 |
| Many | 0.391 | 0.406 |
| Few | 0.313 | 0.367 |
| Moderate | 0.346 | 0.403 |
| Many | 0.339 | 0.386 |
| Spectra | | Few | 0.845 | 0.873 |
| Moderate | 0.855 | 0.879 |
| Many | 0.841 | 0.857 |
| Few | 0.418 | 0.511 |
| Moderate | 0.393 | 0.464 |
| Many | 0.363 | 0.441 |
| Few | 0.576 | 0.630 |
| Moderate | 0.579 | 0.624 |
| Many | 0.540 | 0.605 |
| Few | 0.580 | 0.647 |
| Moderate | 0.618 | 0.674 |
| Many | 0.625 | 0.683 |
Table 16.
scores by brightness category and property for KNN and MLP models using image and spectral embeddings.
| Embedding | Property | Brightness Category | KNN () | MLP () |
|---|
| Image | | High-z (Faint) | 0.276 | 0.276 |
| High-z (Faint) | 0.606 | 0.630 |
| High-z (Faint) | 0.120 | 0.150 |
| High-z (Faint) | 0.283 | 0.350 |
| Low-z (Bright) | 0.510 | 0.527 |
| Low-z (Bright) | 0.733 | 0.775 |
| Low-z (Bright) | 0.187 | 0.219 |
| Low-z (Bright) | 0.349 | 0.405 |
| Spectra | | High-z (Faint) | 0.603 | 0.649 |
| High-z (Faint) | 0.867 | 0.888 |
| High-z (Faint) | 0.305 | 0.366 |
| High-z (Faint) | 0.591 | 0.664 |
| Low-z (Bright) | 0.515 | 0.578 |
| Low-z (Bright) | 0.817 | 0.863 |
| Low-z (Bright) | 0.415 | 0.531 |
| Low-z (Bright) | 0.635 | 0.683 |
Table 17.
score differences between the Bright and Faint categories for KNN and MLP models using image and spectral embeddings.
| Embedding | Property | | |
|---|
| Image | | 0.234 | 0.251 |
| 0.127 | 0.145 |
| 0.067 | 0.069 |
| 0.066 | 0.055 |
| Spectra | | 0.090 | 0.070 |
| 0.050 | 0.030 |
| −0.110 | −0.170 |
| −0.040 | −0.020 |
Table 18.
Distribution of outlier occurrence counts across properties for AstroDino and AstroCLIP.
| Model | Outlier Count | Total | Proportion (%) |
|---|
| AstroDino | 1 | 2308 | 82.55 |
| 2 | 415 | 14.84 |
| 3 | 65 | 2.32 |
| 4 | 8 | 0.29 |
| AstroCLIP | 1 | 2464 | 79.13 |
| 2 | 528 | 16.96 |
| 3 | 112 | 3.60 |
| 4 | 10 | 0.32 |