Trustworthy Celestial Eye: Calibrated and Robust Planetary Classification via Self-Supervised Vision Transformers
Abstract
1. Introduction
1.1. Background and Motivation
1.2. The Promise of Vision Foundation Models
1.3. Research Objectives and Contributions
1.4. Key Findings
2. Materials and Methods
2.1. Vision Transformers and Foundation Models
2.2. Deep Learning in Astronomical Image Classification
2.3. Transfer Learning and Domain Adaptation
2.4. Uncertainty Quantification and Model Calibration
2.5. Research Gap and Motivation
2.6. Method
2.6.1. Problem Formulation
2.6.2. Overall Framework
2.7. Data Preprocessing
2.7.1. Dataset Organization and Split Convention
2.7.2. Input Resolution and Automatic Size Resolution
2.7.3. Train/Val/Test Transforms
2.7.4. Class-Imbalance Handling via Weighted Sampling
- Let be the number of training samples in class .
- The code constructs inverse-frequency class weights:
- Each training sample with label receives weight .
- A WeightedRandomSampler draws samples with replacement, for num_samples = N_{\text{train}}. When the sampler is enabled, the dataloader uses shuffle = False and relies on sampling; otherwise, it uses standard shuffling.
2.8. Model Architecture
2.8.1. Backbone Wrapper
2.8.2. Linear Probing vs. Selective Fine-Tuning (Freeze/Unfreeze Policy)
- (a)
- Linear probing (linear_probe = True).All backbone parameters are frozen:
- (b)
- Fine-tuning (linear_probe = False).The wrapper first freezes all parameters, then:If unfreeze_last_n_blocks >= 999: unfreeze all parameters (full fine-tuning).
2.8.3. Backbone Candidates Used in Ablations
2.8.4. Legacy CNN Baselines
2.9. Loss Function and Optimization
2.9.1. Supervised Losses
- (1)
- Standard cross-entropy (loss_name = “ce”).Given one-hot label and predicted probabilities , the per-sample loss is:The empirical objective over a mini-batch of size is:Symbols: denotes logits from the model, is the softmax probability of class , and is the integer label.
- (2)
- Label-smoothing cross-entropy (loss_name = “label_smoothing”).The implementation fixes smoothing by default. The smoothed target distribution is:With , the loss is:Symbols: is the smoothing factor; is the number of classes.
- (3)
- Focal loss (loss_name = “focal”).The focal variant is implemented as follows:where by default, and the batch loss is the mean over samples. In the provided build_loss(), no class-weight vector is supplied (i.e., weight = None).
2.9.2. Two-Stage Optimization Strategy (AdamW)
2.9.3. Training Loop, AMP, and Early Stopping
- Compute logits = .
- Compute loss (one of the losses above).
- Backpropagate and apply the AdamW update.
2.10. Evaluation Metrics
2.10.1. Top-k Accuracy (k = 1 and 5)
2.10.2. Macro-F1 Score
2.10.3. Balanced Accuracy
2.10.4. Confidence Intervals (Wilson and Bootstrap)
2.10.5. Statistical Significance via McNemar’s Test (Paired)
2.10.6. Calibration: Temperature Scaling and ECE
2.10.7. Test-Time Augmentation (TTA)
2.10.8. Energy-Based OOD Scoring and OOD Metrics
2.10.9. Interpretability via Grad-CAM
2.11. Relation to Bayesian/Ensemble Uncertainty Methods
3. Experiments
3.1. Dataset Curation
3.1.1. The Celestial Body Dataset
3.1.2. OOD Dataset
3.1.3. Preprocessing Pipeline
- Stratified Splitting: The dataset was partitioned into Training, Validation, and Test sets using a ratio of 69:16:15. Training Set: 483 images (69 per class). Validation Set: 112 images (16 per class). Test Set: 105 images (15 per class).
- Repeated-Split Stability Protocol: Because the test split is small (15 images per class), single-split estimates can be sensitive to partitioning. We therefore perform R = 5 repeated stratified train/validation/test splits with different random seeds (random_state ∈ {42, 43, 44, 45, 46}) and report mean ± standard deviation for Top-1 accuracy, ECE, and OOD AUROC. To keep computation tractable while isolating split-induced variability, we adopt a frozen-backbone linear-probing setting: we extract fixed DINOv2 ViT features (vit_base_patch14_dinov2.lvd142m) and train a multinomial logistic-regression classifier on each training split; temperature scaling is fitted on the corresponding validation split.
- Leakage Prevention: Given the small dataset size, we executed Perceptual Hashing (pHash) verification across all splits. The verification script confirmed 0 exact or near-duplicates between the training and testing sets, ensuring that the reported performance reflects genuine generalization rather than memorization.
- Normalization: All images were resized (to 5182 for DINOv2, 2242 or 3002 for CNNs) and normalized using standard ImageNet mean and standard deviation statistics.
3.2. Implementation Details
3.3. Stress-Test Robustness Evaluation
3.4. Baselines and Zero-Shot Evaluation
4. Results
4.1. Main Results
4.1.1. Classification Performance
4.1.2. Reliability Analysis (Calibration)
4.1.3. Safety Analysis (OOD Detection)
4.1.4. Stability Across Repeated Stratified Splits
4.2. Stress-Test Robustness Beyond Saturated Accuracy
4.3. Ablation Study
4.3.1. Loss-Function Sensitivity (Calibration)
4.3.2. Learning-Rate Scheduling Discussion
4.3.3. Unfreezing-Depth Sensitivity
4.4. Qualitative Analysis (Visualization)
5. Conclusions and Discussion
5.1. Summary of Research Findings
- Performance Saturation: DINOv2 achieved a perfect 100% Top-1 accuracy on the test set, demonstrating faster convergence (reaching saturation within 2 epochs) compared to supervised baselines. While modern CNNs like EfficientNet-B3 also achieved 100% accuracy, legacy architectures like ResNet50 struggled (96.2%), highlighting the necessity of modern feature extractors for fine-grained celestial textures.
- Unmatched Reliability: The most significant contribution of this work is in model calibration. DINOv2 achieved a state-of-the-art Expected Calibration Error (ECE) of 0.08% after temperature scaling. This represents a 36-fold improvement over ResNet50 (2.90%) and a 4.5-fold improvement over EfficientNet-B3 (0.36%), proving that self-supervised ViT features naturally align closer to the true empirical probability distribution.
- Operational Safety: In OOD detection scenarios involving internet-sourced space artifacts, our energy-based detector achieved an AUROC of 93.7%. Qualitative analysis via Grad-CAM confirmed that DINOv2 focuses on intrinsic planetary features (e.g., surface bands, rings) rather than background noise, thereby mitigating the “shortcut learning” observed in ResNet50.
5.2. Theoretical and Practical Implications
5.3. Limitations
- Dataset scale and heterogeneity: Although the curated dataset is balanced and high-fidelity, it remains small (700 images, 7 classes). As a result, clean-test accuracy becomes saturated for several modern architectures, and robustness conclusions must be interpreted within this benchmark’s scope. We therefore complement clean accuracy with controlled stress tests, but we do not claim guaranteed robustness under mission-specific sensors or large-scale survey distributions. Future work will validate the proposed framework on larger and more heterogeneous datasets (e.g., survey data and instrument-specific imagery) and will consider grouped splits (by mission/instrument) to reduce the risk of source-specific visual signatures leaking across splits.
- The OOD set used in this study is sourced from web images and is designed to represent open-world, non-planetary visual confounders (e.g., nebulae, debris, abstract space imagery). While it is diverse, it does not fully cover sensor-specific artifacts encountered in real missions (e.g., cosmic-ray hits, saturation spikes, and instrument-dependent noise patterns). Therefore, the reported AUROC should be interpreted as performance on a web-image OOD benchmark rather than a definitive estimate under mission-specific sensor anomalies.
- Computational Cost (Latency/Memory Footprint): The proposed framework leverages DINOv2 (ViT-Base/14) as the strongest backbone, but this choice has clear deployment implications. In terms of model size, ViT-Base/14 contains 86M parameters, corresponding to approximately 344 MB of weights in FP32 (or ~172 MB in FP16), excluding a lightweight classification head. Importantly, inference-time latency and activation memory are strongly affected by input resolution for ViT backbones because self-attention scales quadratically with the token length. With a patch size of 14, the token length is . Under our pipeline, DINOv2 operates at (37 × 37 = 1369 tokens), while the CNN baselines use (16 × 16 = 256 tokens). This results in ~5.35× more tokens and ~28.6× larger quadratic attention cost (relative to 224), which helps explain the higher latency and peak memory footprint of DINOv2 compared to efficient CNN baselines.
5.4. Future Work
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ball, N.M.; Brunner, R.J. Data mining and machine learning in astronomy. Int. J. Mod. Phys. D 2010, 19, 1049–1106. [Google Scholar] [CrossRef]
- Fluke, C.J.; Jacobs, C. Surveying the reach and maturity of machine learning and artificial intelligence in astronomy. WIREs Data Min. Knowl. Discov. 2020, 10, e1349. [Google Scholar] [CrossRef]
- Domínguez Sánchez, H.; Huertas-Company, M.; Bernardi, M.; Tuccillo, D.; Fischer, J.L. Transfer learning for galaxy morphology from one survey to another. Mon. Not. R. Astron. Soc. 2019, 484, 93–100. [Google Scholar] [CrossRef]
- Dieleman, S.; Willett, K.W.; Dambre, J. Rotation-invariant convolutional neural networks for galaxy morphology prediction. Mon. Not. R. Astron. Soc. 2015, 450, 1441–1459. [Google Scholar] [CrossRef]
- Awais, M.; Dengel, A.; Ahmed, S.; Islam, J.; Schiele, B.; Bulling, A. Foundation Models Defining a New Era in Vision: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef] [PubMed]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; Volume 139, pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 17 January 2026).
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021; Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 20 January 2026).
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11-17 October 2021; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling up capacity and resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18-24 June 2022; IEEE: New York, NY, USA, 2022; pp. 12009–12019. [Google Scholar] [CrossRef]
- Cao, J.; Xu, T.-T.; Deng, Y.-H.; Li, G.-P.; Gao, X.-J.; Yang, M.-C.; Liu, Z.-J.; Zhou, W.-H. Classification of galaxy morphology based on FPN-ViT model. Chin. Astron. Astrophys. 2024, 48, 683–704. [Google Scholar] [CrossRef]
- Cao, J.; Xu, T.; Deng, Y.; Deng, L.; Yang, M.; Liu, Z.; Zhou, W. Galaxy morphology classification based on convolutional vision transformer. Astron. Astrophys. 2024, 683, A42. [Google Scholar] [CrossRef]
- Mohan, D.; Scaife, A.M.M.; Porter, F.; Walmsley, M.; Bowles, M. Quantifying uncertainty in deep learning approaches to radio galaxy classification. Mon. Not. R. Astron. Soc. 2022, 511, 3722–3740. [Google Scholar] [CrossRef]
- Rustige, L.; Kummer, J.; Griese, F.; Borras, K.; Brüggen, M.; Connor, P.L.S.; Gaede, F.; Kasieczka, G.; Knopp, T.; Schleper, P. Morphological classification of radio galaxies with Wasserstein generative adversarial network-supported augmentation. RAS Tech. Instrum. 2023, 2, 264–277. [Google Scholar] [CrossRef]
- Bowles, M.; Tang, H.; Vardoulaki, E.; Alexander, E.L.; Luo, Y.; Rudnick, L.; Walmsley, M.; Porter, F.; Scaife, A.M.M.; Slijepcevic, I.V.; et al. Radio galaxy zoo EMU: Towards a semantic radio galaxy morphology taxonomy. Mon. Not. R. Astron. Soc. 2023, 522, 2584–2600. [Google Scholar] [CrossRef]
- Vilalta, R.; Dhar Gupta, K.; Boumber, D.; Meskhi, M.M. A General Approach to Domain Adaptation with Applications in Astronomy. Publ. Astron. Soc. Pac. 2019, 131, 108008. [Google Scholar] [CrossRef]
- Yu, X.; Wang, J.; Hong, Q.-Q.; Teku, R.; Wang, S.-H.; Zhang, Y.-D. Transfer learning for medical images analyses: A survey. Neurocomputing 2022, 489, 230–254. [Google Scholar] [CrossRef]
- Ayana, G.; Dese, K.; Abagaro, A.M.; Jeong, K.C.; Yoon, S.-D. Multistage transfer learning for medical images. Artif. Intell. Rev. 2024, 57, 232. [Google Scholar] [CrossRef]
- Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
- Zhang, M.; Fernández-Torres, M.-Á.; Cohrs, K.-H.; Camps-Valls, G. Calibration and uncertainty quantification for deep learning-based drought detection. Int. J. Appl. Earth Obs. Geoinf. 2025, 140, 104563. [Google Scholar] [CrossRef]
- Yang, J.; Zhou, K.; Li, Y.; Liu, Z. Generalized Out-of-Distribution Detection: A Survey. Int. J. Comput. Vis. 2024, 132, 5635–5662. [Google Scholar] [CrossRef]







| Model Architecture | Top-1 Accuracy | ECE (Post-Cal) ⬇ | OOD AUROC ⬆ |
|---|---|---|---|
| ResNet 50 | 96.19% | 2.90% | 86.77% |
| Standard ViT | 97.14% | - | - |
| Swin Transformer | 100.0% | - | - |
| ConvNext V2 | 100.0% | - | - |
| EfficientNet-B3 | 100.0% | 0.36% | 93.96% |
| DINOv2 (Ours) | 100.0% | 0.08% | 93.68% |
| Metric | Mean ± Std |
|---|---|
| Acc@1 (%) | 99.43 ± 0.85 |
| ECE (%) | 0.74 ± 0.75 |
| NLL | 0.1596 ± 0.2339 |
| Brier | 0.0117 ± 0.0168 |
| OOD AUROC (%) | 61.50 ± 3.49 |
| TPR@FPR = 5% (%) | 33.40 ± 1.08 |
| TPR@FPR = 1% (%) | 22.80 ± 3.35 |
| Perturbation | Severity | Top-1 Acc (%) | Δ vs. Clean (pp) |
|---|---|---|---|
| Clean | - | 100 | 0.00 |
| Gaussian noise | σ = 5 | 87.62 | −12.38 |
| σ = 10 | 77.14 | −22.86 | |
| σ = 20 | 68.57 | −31.43 | |
| Gaussian blur | level = 1 | 99.05 | −0.95 |
| level = 2 | 97.14 | −2.86 | |
| level = 3 | 97.14 | −2.86 | |
| JPEG compression | Q = 90 | 100 | 0.00 |
| Q = 70 | 97.14 | −2.86 | |
| Q = 50 | 94.29 | −5.71 | |
| Downsample + resize | r = 0.75 | 100 | 0.00 |
| r = 0.50 | 100 | 0.00 | |
| r = 0.25 | 99.05 | −0.95 | |
| RandomCrop + resize | r = 0.70 | 99.05 | −0.95 |
| r = 0.50 | 98.1 | −1.90 |
| Augmentation Strategy | Test Accuracy Range | Mean Accuracy |
|---|---|---|
| Mean Accuracy | 95.2–99.0% | 97.4% |
| Rand (Rotation + Jitter) | 100.0–100.0% | 100.0% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xu, Z.; Choi, Y.; Yi, C.; Park, C.; Park, J.; Park, H.; Song, S. Trustworthy Celestial Eye: Calibrated and Robust Planetary Classification via Self-Supervised Vision Transformers. Aerospace 2026, 13, 222. https://doi.org/10.3390/aerospace13030222
Xu Z, Choi Y, Yi C, Park C, Park J, Park H, Song S. Trustworthy Celestial Eye: Calibrated and Robust Planetary Classification via Self-Supervised Vision Transformers. Aerospace. 2026; 13(3):222. https://doi.org/10.3390/aerospace13030222
Chicago/Turabian StyleXu, Ziqiang, Young Choi, Changyong Yi, Chanjeong Park, Jinyoung Park, Hyungkeun Park, and Sujeen Song. 2026. "Trustworthy Celestial Eye: Calibrated and Robust Planetary Classification via Self-Supervised Vision Transformers" Aerospace 13, no. 3: 222. https://doi.org/10.3390/aerospace13030222
APA StyleXu, Z., Choi, Y., Yi, C., Park, C., Park, J., Park, H., & Song, S. (2026). Trustworthy Celestial Eye: Calibrated and Robust Planetary Classification via Self-Supervised Vision Transformers. Aerospace, 13(3), 222. https://doi.org/10.3390/aerospace13030222

