This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Interpretable Vision Transformers in Monocular Depth Estimation via SVDA
1
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
2
Athena Research Center, 67100 Xanthi, Greece
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(8), 1272; https://doi.org/10.3390/math14081272 (registering DOI)
Submission received: 25 February 2026
/
Revised: 6 April 2026
/
Accepted: 9 April 2026
/
Published: 11 April 2026
Abstract
Monocular depth estimation is a central problem in computer vision with applications in robotics, augmented reality, and autonomous driving, yet the self-attention mechanisms used by modern Transformer architectures remain opaque. In this work, we integrate SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), introducing a spectrally structured attention formulation for dense prediction that decouples directional alignment from spectral modulation through a learnable diagonal matrix embedded in normalized query–key interactions. Experiments on KITTI and NYU-v2 show that SVDA preserves competitive predictive performance while enabling intrinsic interpretability: on KITTI, AbsRel improves from 0.058 to 0.056 and from 0.976 to 0.979, while on NYU-v2, AbsRel improves from 0.133 to 0.124 and from 0.865 to 0.872. This is achieved with only 0.01% additional parameters, at the cost of a measurable runtime overhead associated with the added normalization and spectral modulation. More importantly, SVDA enables six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness, revealing consistent cross-dataset and depth-wise patterns in how attention organizes during training. These properties make the model easier to inspect and better suited to applications where transparency and reliability are important, such as robotics and autonomous navigation.
Share and Cite
MDPI and ACS Style
Arampatzakis, V.; Pavlidis, G.; Mitianoudis, N.; Papamarkos, N.
Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics 2026, 14, 1272.
https://doi.org/10.3390/math14081272
AMA Style
Arampatzakis V, Pavlidis G, Mitianoudis N, Papamarkos N.
Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics. 2026; 14(8):1272.
https://doi.org/10.3390/math14081272
Chicago/Turabian Style
Arampatzakis, Vasileios, George Pavlidis, Nikolaos Mitianoudis, and Nikos Papamarkos.
2026. "Interpretable Vision Transformers in Monocular Depth Estimation via SVDA" Mathematics 14, no. 8: 1272.
https://doi.org/10.3390/math14081272
APA Style
Arampatzakis, V., Pavlidis, G., Mitianoudis, N., & Papamarkos, N.
(2026). Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics, 14(8), 1272.
https://doi.org/10.3390/math14081272
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.