Next Article in Journal
A Point Cloud-Based Algorithm for Mining Subsidence Extraction Considering Horizontal Displacement
Previous Article in Journal
A Hybrid Feature-Weighting and Resampling Model for Imbalanced Sentiment Analysis in User Game Reviews
Previous Article in Special Issue
Mathematics and Machine Learning for Visual Computing in Medicine: Acquisition, Processing, Analysis, Visualization, and Interpretation of Visual Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

by
Vasileios Arampatzakis
1,2,*,
George Pavlidis
2,
Nikolaos Mitianoudis
1 and
Nikos Papamarkos
1
1
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
2
Athena Research Center, 67100 Xanthi, Greece
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(8), 1272; https://doi.org/10.3390/math14081272 (registering DOI)
Submission received: 25 February 2026 / Revised: 6 April 2026 / Accepted: 9 April 2026 / Published: 11 April 2026

Abstract

Monocular depth estimation is a central problem in computer vision with applications in robotics, augmented reality, and autonomous driving, yet the self-attention mechanisms used by modern Transformer architectures remain opaque. In this work, we integrate SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), introducing a spectrally structured attention formulation for dense prediction that decouples directional alignment from spectral modulation through a learnable diagonal matrix embedded in normalized query–key interactions. Experiments on KITTI and NYU-v2 show that SVDA preserves competitive predictive performance while enabling intrinsic interpretability: on KITTI, AbsRel improves from 0.058 to 0.056 and δ1 from 0.976 to 0.979, while on NYU-v2, AbsRel improves from 0.133 to 0.124 and δ1 from 0.865 to 0.872. This is achieved with only 0.01% additional parameters, at the cost of a measurable runtime overhead associated with the added normalization and spectral modulation. More importantly, SVDA enables six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness, revealing consistent cross-dataset and depth-wise patterns in how attention organizes during training. These properties make the model easier to inspect and better suited to applications where transparency and reliability are important, such as robotics and autonomous navigation.
Keywords: monocular; depth; transformers; self-attention; SVD; interpretability monocular; depth; transformers; self-attention; SVD; interpretability

Share and Cite

MDPI and ACS Style

Arampatzakis, V.; Pavlidis, G.; Mitianoudis, N.; Papamarkos, N. Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics 2026, 14, 1272. https://doi.org/10.3390/math14081272

AMA Style

Arampatzakis V, Pavlidis G, Mitianoudis N, Papamarkos N. Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics. 2026; 14(8):1272. https://doi.org/10.3390/math14081272

Chicago/Turabian Style

Arampatzakis, Vasileios, George Pavlidis, Nikolaos Mitianoudis, and Nikos Papamarkos. 2026. "Interpretable Vision Transformers in Monocular Depth Estimation via SVDA" Mathematics 14, no. 8: 1272. https://doi.org/10.3390/math14081272

APA Style

Arampatzakis, V., Pavlidis, G., Mitianoudis, N., & Papamarkos, N. (2026). Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics, 14(8), 1272. https://doi.org/10.3390/math14081272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop