Semantically Supervised SeDINO Encoder for Visual–Language–Action Model
Abstract
1. Introduction
2. Materials and Methods
2.1. Design and Construction of the SeDINO Visual Perception Model
2.1.1. SeDINO Architecture Design
2.1.2. MLP Alignment Network Training
2.2. Integration of SeDINO into the VLA Model
3. Results
3.1. Visual Dataset Validation Results
3.2. VLA Simulated and Real-World Scenario Validation Results
4. Discussion
4.1. Performance Advantages and Mechanisms
4.2. Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P. Openvla: An open-source vision-language-action model. arXiv 2024, arXiv:2406.09246. [Google Scholar]
- Lu, G.; Guo, W.; Zhang, C.; Zhou, Y.; Jiang, H.; Gao, Z.; Tang, Y.; Wang, Z. VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning. arXiv 2025, arXiv:2505.18719. [Google Scholar] [CrossRef]
- Pan, C.; Junge, K.; Hughes, J. Vision-language-action model and diffusion policy switching enables dexterous control of an anthropomorphic hand. arXiv 2024, arXiv:2410.14022. [Google Scholar]
- Liang, Z.; Li, Y.; Yang, T.; Wu, C.; Mao, S.; Pei, L.; Yang, X.; Pang, J.; Mu, Y.; Luo, P. Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies. arXiv 2025, arXiv:2508.20072. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Dong, X.; Bao, J.; Zheng, Y.; Zhang, T.; Chen, D.; Yang, H.; Zeng, M.; Zhang, W.; Yuan, L.; Chen, D. Maskclip: Masked self-distillation advances contrastive language-image pretraining. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10995–11005. [Google Scholar]
- Wysoczańska, M.; Siméoni, O.; Ramamonjisoa, M.; Bursuc, A.; Trzciński, T.; Pérez, P. CLIP-DINOiser: Teaching CLIP a Few DINO Tricks for Open-Vocabulary Semantic Segmentation. Lect. Notes Comput. Sci. 2023, 13752, 320–337. [Google Scholar]
- Lan, M.; Chen, C.; Ke, Y.; Wang, X.; Feng, L.; Zhang, W. Proxyclip: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation. In Computer Vision—ECCV 2024; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; pp. 70–88. [Google Scholar]
- Wang, Y.; Shen, X.; Yuan, Y.; Du, Y.; Li, M.; Hu, S.X.; Crowley, J.L.; Vaufreydaz, D. Tokencut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15790–15801. [Google Scholar] [CrossRef] [PubMed]
- Rodriguez, D.S.; Gomez, A.E.R.; Serrezuela, R.R. Development of an embedded diagnostic tool for visual misalignment screening. HardwareX 2025, 2025, e00692. [Google Scholar] [CrossRef] [PubMed]
- Rodriguez Serrezuela, R.; Zamora, R.S.; Hermosilla, D.M.; Gomez, A.E.R.; Reyes, E.M. Hybrid Convolutional Vision Transformer for Robust Low-Channel sEMG Hand Gesture Recognition: A Comparative Study with CNNs. Biomimetics 2025, 10, 806. [Google Scholar] [CrossRef] [PubMed]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge 2012 (VOC2012) Results (2012). J. Vis. Comput. 2012, 10, 142–149. [Google Scholar]
- Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.-G.; Lee, S.-W.; Fidler, S.; Urtasun, R.; Yuille, A. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2014; Volume 2014, pp. 891–898. [Google Scholar]
- Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and Stuff Classes in Context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; Volume 10, pp. 1209–1218. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; Volume 2016, pp. 3213–3223. [Google Scholar]
- Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ADE20K dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; Volume 2017, pp. 633–641. [Google Scholar]
- Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. Groupvit: Semantic Segmentation Emerges from Text Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; Volume 1, pp. 18134–18144. [Google Scholar]
- Cha, J.; Mun, J.; Roh, B. Learning to Generate Text-Grounded Mask for Open-World Semantic Segmentation from Only Image-Text Pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 11165–11174. [Google Scholar]
- Zhou, C.; Loy, C.C.; Dai, B. Extract free dense labels from clip. In European Conference on Computer Vision; IEEE: New York, NY, USA, 2022; pp. 696–712. [Google Scholar]
- Wysoczańska, M.; Ramamonjisoa, M.; Trzciński, T.; Siméoni, O. Clip-diy: Clip Dense Inference Yields Open-Vocabulary Semantic Segmentation for-Free. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2024; Volume 1, pp. 1403–1413. [Google Scholar]
- Wang, F.; Mei, J.; Yuille, A. Sclip: Rethinking Self-Attention for Dense Vision-Language Inference. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; Volume 1, pp. 315–332. [Google Scholar]
- Lan, M.; Chen, C.; Ke, Y.; Wang, X.; Feng, L.; Zhang, W. Clearclip: Decomposing clip representations for dense vision-language inference. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; Volume 1, pp. 143–160. [Google Scholar]
- Hajimiri, S.; Ayed, I.B.; Dolz, J. Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2025; pp. 5061–5071. [Google Scholar]
- Barsellotti, L.; Amoroso, R.; Cornia, M.; Baraldi, L.; Cucchiara, R. Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; Volume 1, pp. 3689–3698. [Google Scholar]











| Model | mIoU | ||||||
|---|---|---|---|---|---|---|---|
| V20 | V21 | C59 | Stuff | City | ADE20K | Avg | |
| GroupViT [17] | 79.7 | 50.4 | 23.4 | 15.3 | 11.1 | 9.2 | 31.5 |
| TCL [18] | 77.5 | 51.2 | 30.3 | 19.6 | 23.1 | 14.9 | 36.1 |
| MaskCLIP [19] | 74.9 | 38.8 | 26.4 | 16.4 | 12.6 | 9.8 | 29.8 |
| CLIP-DIY [20] | 79.7 | 59.9 | 19.8 | 13.3 | 11.6 | 9.9 | 32.4 |
| SCLIP [21] | 80.4 | 59.1 | 34.2 | 22.4 | 32.2 | 16.1 | 40.7 |
| CLIP-DINOiser [7] | 80.9 | 62.1 | 35.9 | 24.6 | 31.1 | 20.0 | 42.4 |
| ClearCLIP [22] | 80.9 | 51.8 | 35.9 | 23.9 | 30.0 | 16.7 | 39.9 |
| NACLiP [23] | 79.7 | 58.9 | 35.2 | 23.3 | 35.5 | 17.4 | 41.7 |
| FreeDA [24] | 77.1 | 51.7 | 37.1 | 24.9 | 34.0 | 19.5 | 40.7 |
| ProxyCLIP [8] | 83.0 | 58.6 | 37.2 | 25.4 | 33.9 | 19.7 | 43.0 |
| SeDINO (ours) | 85.1 | 60.3 | 39.6 | 27.9 | 34.8 | 21.1 | 44.8 |
| Visual Encoder | Inference Frequency (Hz) |
|---|---|
| DINO | 4.16 |
| CLIP | 4.57 |
| SeDINO | 4.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Tian, S.; Yu, D.; Cui, L.; Liu, Z.; Wang, H.; Li, Z.; Liu, H. Semantically Supervised SeDINO Encoder for Visual–Language–Action Model. Appl. Sci. 2026, 16, 1464. https://doi.org/10.3390/app16031464
Tian S, Yu D, Cui L, Liu Z, Wang H, Li Z, Liu H. Semantically Supervised SeDINO Encoder for Visual–Language–Action Model. Applied Sciences. 2026; 16(3):1464. https://doi.org/10.3390/app16031464
Chicago/Turabian StyleTian, Shen, Dong Yu, Long Cui, Zhaoming Liu, Hongwei Wang, Zixuan Li, and Haotian Liu. 2026. "Semantically Supervised SeDINO Encoder for Visual–Language–Action Model" Applied Sciences 16, no. 3: 1464. https://doi.org/10.3390/app16031464
APA StyleTian, S., Yu, D., Cui, L., Liu, Z., Wang, H., Li, Z., & Liu, H. (2026). Semantically Supervised SeDINO Encoder for Visual–Language–Action Model. Applied Sciences, 16(3), 1464. https://doi.org/10.3390/app16031464

