Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features
Abstract
1. Introduction
- We introduce a multimodal architecture, leveraging both tactile and PCA-reduced ResNet 50-based visual embeddings suitable for robotic deployment and extension to other sensor combinations.
- We employ a mid-level fusion architecture with multi-head cross-attention layers to learn complementary representations, joint embeddings and bidirectional interaction between tactile and visual inputs.
- We implement a lightweight, encoder only Transformer for tactile-only classification, achieving the highest accuracy and the lowest latency, making it suitable for real-time deployment on embedded robotic systems where computational resources are limited.
- We evaluate our models on the Touch and Go dataset [21] and demonstrate that multimodal fusion improves accuracy over tactile-only baselines.
2. Methods
2.1. Feature Processing
2.1.1. Feature Extraction
2.1.2. Feature Selection
2.2. Multi-Modal Classification
2.2.1. Cross-Modal Fusion
- Self Attention and Cross Attention:
- Feed-Forward Processing and Normalization:
- Fusion Process:
2.2.2. Classification Head
3. Experimental Results and Discussion
3.1. Data Preprocessing
3.2. Training Strategy
3.3. Results and Discussion
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Dave, V.; Lygerakis, F.; Rückert, E. Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8013–8020. [Google Scholar] [CrossRef]
- Li, Q.; Kroemer, O.; Su, Z.; Veiga, F.F.; Kaboli, M.; Ritter, H.J. A review of tactile information: Perception and action through touch. IEEE Trans. Robot. 2020, 36, 1619–1634. [Google Scholar] [CrossRef]
- Hu, Y.; Li, M.; Yang, S.; Li, X.; Liu, S.; Li, M. Learning robust grasping strategy through tactile sensing and adaption skill. arXiv 2024, arXiv:2411.08499. [Google Scholar] [CrossRef]
- Chen, L.; Zhu, Y.; Li, M. Tactile-GAT: Tactile graph attention networks for robot tactile perception classification. Sci. Rep. 2024, 14, 27543. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.; Ji, X.; Li, S.; Dong, H.; Liu, T.; Zhou, X.; Yu, S. Robot tactile data classification method using spiking neural network. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; IEEE: New York, NY, USA, 2021; pp. 5274–5279. [Google Scholar]
- Yuan, W.; Dong, S.; Adelson, E.H. GelSight: High-resolution robot tactile sensors for estimating geometry and force. Sensors 2017, 17, 2762. [Google Scholar] [CrossRef] [PubMed]
- Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
- Srivastava, N.; Salakhutdinov, R.R. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems 25 (NIPS 2012), 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; The Neural Information Processing Systems Foundation: San Diego, CA, USA, 2012; Volume 25, pp. 2949–2980. [Google Scholar]
- Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; Sun, C. Attention bottlenecks for multimodal fusion. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 6–14 December 2021; The Neural Information Processing Systems Foundation: San Diego, CA, USA, 2021; Volume 34, pp. 14200–14213. [Google Scholar]
- Lee, W.Y.; Jovanov, L.; Philips, W. Cross-modality attention and multimodal fusion transformer for pedestrian detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 608–623. [Google Scholar]
- Xu, D.; Ouyang, W.; Ricci, E.; Wang, X.; Sebe, N. Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5363–5371. [Google Scholar]
- Calandra, R.; Owens, A.; Jayaraman, D.; Lin, J.; Yuan, W.; Malik, J.; Adelson, E.H.; Levine, S. More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robot. Autom. Lett. 2018, 3, 3300–3307. [Google Scholar] [CrossRef]
- Miller, P.; Leibowitz, P. Integration of vision, force and tactile sensing for grasping. Int. J. Intell. Mach. 1999, 4, 129–149. [Google Scholar]
- Struckmeier, O.; Tiwari, K.; Salman, M.; Pearson, M.J.; Kyrki, V. Vita-slam: A bio-inspired visuo-tactile slam for navigation while interacting with aliased environments. In Proceedings of the 2019 IEEE International Conference on Cyborg and Bionic Systems (CBS), Munich, Germany, 18–20 September 2019; IEEE: New York, NY, USA, 2019; pp. 97–103. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; The Neural Information Processing Systems Foundation: San Diego, CA, USA, 2019; Volume 32, pp. 1–11. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Yang, F.; Ma, C.; Zhang, J.; Zhu, J.; Yuan, W.; Owens, A. Touch and Go: Learning from Human-Collected Vision and Touch. arXiv 2022, arXiv:2211.12498. [Google Scholar]
- Johnson, M.K.; Adelson, E.H. Retrographic sensing for the measurement of surface texture and shape. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 1070–1077. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; The Neural Information Processing Systems Foundation: San Diego, CA, USA, 2017; Volume 30, pp. 1–11. [Google Scholar]
Texture/Roughness Feature | Description |
---|---|
Gradient Magnitude | Sharpness or local variations in surface texture; helps distinguish coarse vs. smooth textures |
Contrast | Variation in intensity across the tactile surface; differentiates flat from textured materials |
Roughness | Surface unevenness; distinguishes materials like concrete (high) vs. synthetic fabric (moderate) |
Uniformity | How evenly the tactile pressure is distributed; identifies consistent contact textures |
Edge Density | The number of distinct edges or transitions per unit area |
Pressure Feature | Description |
Pressure Std | Variability in pressure over the contact area; surface irregularity or material compliance |
Center Deviation | How far the pressure centroid deviates from geometric center; reflects asymmetric or uneven contacts |
Max Pressure | Maximum localized pressure during contact; associated with material hardness or contact sharpness |
Avg Pressure | Overall pressure magnitude across contact area; related to surface hardness and grip force |
Contact Area | Total area activated during contact; depends on material softness and indentation profile |
Texture/Roughness Feature | Importance | Pressure/Contact Feature | Importance |
---|---|---|---|
Gradient Magnitude | 0.2493 | Pressure Std | 0.3390 |
Contrast | 0.2262 | Center Deviation | 0.2001 |
Roughness | 0.2244 | Max Pressure | 0.1942 |
Uniformity | 0.1733 | Avg Pressure | 0.1745 |
Edge Density | 0.1268 | Contact Area | 0.0922 |
Model | Precision | Recall | F1-Score | Accuracy | Parameters | Inference Time (ms) | |
---|---|---|---|---|---|---|---|
T- Only | Random Forest | 0.96 | 0.96 | 0.96 | 0.9560 | 108,788 | 0.2819 |
XGBoost | 0.97 | 0.97 | 0.97 | 0.9670 | 1500 | 0.0923 | |
SVM (RBF) | 0.87 | 0.87 | 0.86 | 0.8660 | N/A | 0.0483 | |
SVM (Linear) | 0.68 | 0.68 | 0.67 | 0.6810 | 80 | 0.0337 | |
Encoder-only Transformer | 0.97 | 0.97 | 0.97 | 0.9740 | 152,901 | 0.0085 | |
V-T | Multimodal-CNN | 1.00 | 1.00 | 1.00 | 1.0000 | 48,311,301 | 5.0737 |
Surformer v1 | 0.99 | 0.99 | 0.99 | 0.9940 | 673,321 | 0.7271 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kansana, M.; Hossain, E.; Rahimi, S.; Amiri Golilarz, N. Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features. Information 2025, 16, 839. https://doi.org/10.3390/info16100839
Kansana M, Hossain E, Rahimi S, Amiri Golilarz N. Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features. Information. 2025; 16(10):839. https://doi.org/10.3390/info16100839
Chicago/Turabian StyleKansana, Manish, Elias Hossain, Shahram Rahimi, and Noorbakhsh Amiri Golilarz. 2025. "Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features" Information 16, no. 10: 839. https://doi.org/10.3390/info16100839
APA StyleKansana, M., Hossain, E., Rahimi, S., & Amiri Golilarz, N. (2025). Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features. Information, 16(10), 839. https://doi.org/10.3390/info16100839