Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

Arampatzakis, Vasileios; Pavlidis, George; Mitianoudis, Nikolaos; Papamarkos, Nikos

doi:10.3390/math14081272

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

¹

Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece

²

Athena Research Center, 67100 Xanthi, Greece

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(8), 1272; https://doi.org/10.3390/math14081272 (registering DOI)

Submission received: 25 February 2026 / Revised: 6 April 2026 / Accepted: 9 April 2026 / Published: 11 April 2026

(This article belongs to the Special Issue Mathematics for Visual Computing: Acquisition, Processing, Analysis and Rendering of Visual Information)

Download Versions Notes

Abstract

Monocular depth estimation is a central problem in computer vision with applications in robotics, augmented reality, and autonomous driving, yet the self-attention mechanisms used by modern Transformer architectures remain opaque. In this work, we integrate SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), introducing a spectrally structured attention formulation for dense prediction that decouples directional alignment from spectral modulation through a learnable diagonal matrix embedded in normalized query–key interactions. Experiments on KITTI and NYU-v2 show that SVDA preserves competitive predictive performance while enabling intrinsic interpretability: on KITTI, AbsRel improves from 0.058 to 0.056 and

δ_{1}

from 0.976 to 0.979, while on NYU-v2, AbsRel improves from 0.133 to 0.124 and

δ_{1}

from 0.865 to 0.872. This is achieved with only 0.01% additional parameters, at the cost of a measurable runtime overhead associated with the added normalization and spectral modulation. More importantly, SVDA enables six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness, revealing consistent cross-dataset and depth-wise patterns in how attention organizes during training. These properties make the model easier to inspect and better suited to applications where transparency and reliability are important, such as robotics and autonomous navigation.

Keywords: monocular; depth; transformers; self-attention; SVD; interpretability

Share and Cite

MDPI and ACS Style

Arampatzakis, V.; Pavlidis, G.; Mitianoudis, N.; Papamarkos, N. Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics 2026, 14, 1272. https://doi.org/10.3390/math14081272

AMA Style

Arampatzakis V, Pavlidis G, Mitianoudis N, Papamarkos N. Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics. 2026; 14(8):1272. https://doi.org/10.3390/math14081272

Chicago/Turabian Style

Arampatzakis, Vasileios, George Pavlidis, Nikolaos Mitianoudis, and Nikos Papamarkos. 2026. "Interpretable Vision Transformers in Monocular Depth Estimation via SVDA" Mathematics 14, no. 8: 1272. https://doi.org/10.3390/math14081272

APA Style

Arampatzakis, V., Pavlidis, G., Mitianoudis, N., & Papamarkos, N. (2026). Interpretable Vision Transformers in Monocular Depth Estimation via SVDA. Mathematics, 14(8), 1272. https://doi.org/10.3390/math14081272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI