Application of Vision Language Models in the Shoe Industry †
Abstract
1. Introduction
2. Literature Review
2.1. DL and LLMs
2.2. Multimodal LLMs (MLLMs)
3. Methodology
3.1. VideoLLaMA 2
3.2. Qwen2.5-VL
3.3. Zero-Shot Learning and In-Context Learning for VLMs
4. Results and Discussion
4.1. VLM-Based Surveillance Video Analysis with Zero-Shot Learning
4.2. VLM-Based Assembly Quality Monitoring
4.3. In-Context Learning for Assembly Quality Monitoring
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Krishnan, A.; Swarna, S.; Balasubramanya, H.S. Robotics, IoT, and AI in the Automation of Agricultural Industry: A Review. In Proceedings of the B-HTC 2020—1st IEEE Bangalore Humanitarian Technology Conference, Vijiyapur, India, 8–10 October 2020. [Google Scholar]
- Adugna, T.D.; Ramu, A.; Haldorai, A. A Review of Pattern Recognition and ML. J. Mach. Comput. 2024, 4, 210–220. [Google Scholar] [CrossRef]
- Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
- Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef]
- Ray, P.P. ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope. Internet Things Cyber-Phys. Syst. 2023, 3, 121–154. [Google Scholar] [CrossRef]
- Bello, A.; Ng, S.C.; Leung, M.F. A BERT Framework to Sentiment Analysis of Tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef] [PubMed]
- Bian, Y.; Küster, D.; Liu, H.; Krumhuber, E.G. Understanding Naturalistic Facial Expressions with DL and Multimodal LLMs. Sensors 2024, 24, 126. [Google Scholar] [CrossRef] [PubMed]
- Cheng, Z.; Leng, S.; Zhang, H.; Xin, Y.; Li, X.; Chen, G.; Zhu, Y.; Zhang, W.; Luo, Z.; Zhao, D. Videollama 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-Llms. arXiv 2024, arXiv:2406.07476. [Google Scholar]
- Ahmed, I.; Islam, S.; Datta, P.P.; Kabir, I.; Chowdhury, N.U.R.; Haque, A. Qwen 2.5: A Comprehensive Review of the Leading Resource-Efficient LLM with Potentioal to Surpass All Competitors. TechRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
- Romera-Paredes, B.; Torr, P. An Embarrassingly Simple Approach to Zero-Shot Learning. In Proceedings of the International Conference on ML, Lille, France, 6–11 July 2015; pp. 2152–2161. [Google Scholar]
- Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A Survey on In-Context Learning. arXiv 2022, arXiv:2301.00234. [Google Scholar] [CrossRef]
Questions | VideoLLaMA 2 | Qwen2.5-VL | ||
---|---|---|---|---|
Video (a) | Video (b) | Video (a) | Video (b) | |
How many workers are there? | X | X | V | V |
How many vehicles are there? | V | V | V | V |
Is the worker wearing a helmet? | V | V | V | V |
Is the worker smoking? | X | X | V | V |
Do workers wear masks? | V | V | X | X |
Did the worker fall? | V | V | V | V |
Did the workers take notes? | X | X | V | V |
Did the worker talk on the phone? | X | X | V | V |
Is the vehicle moving? | X | V | V | V |
Is this vehicle a bus? | X | V | V | V |
Questions | VideoLLaMA 2 | Qwen2.5-VL | ||
---|---|---|---|---|
Video (a) | Video (b) | Video (a) | Video (b) | |
Is the worker applying glue to the soles of the shoes? | V | X | X | V |
Is the worker’s action assembling the sole and upper? | X | V | X | X |
Is the worker wearing a glove on his right hand? | V | V | V | V |
Is the worker wearing a glove on his left hand? | V | V | V | V |
Do the workers wear shoes? | X | X | V | X |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tseng, H.-M.; Chu, H.-T. Application of Vision Language Models in the Shoe Industry. Eng. Proc. 2025, 108, 50. https://doi.org/10.3390/engproc2025108050
Tseng H-M, Chu H-T. Application of Vision Language Models in the Shoe Industry. Engineering Proceedings. 2025; 108(1):50. https://doi.org/10.3390/engproc2025108050
Chicago/Turabian StyleTseng, Hsin-Ming, and Hsueh-Ting Chu. 2025. "Application of Vision Language Models in the Shoe Industry" Engineering Proceedings 108, no. 1: 50. https://doi.org/10.3390/engproc2025108050
APA StyleTseng, H.-M., & Chu, H.-T. (2025). Application of Vision Language Models in the Shoe Industry. Engineering Proceedings, 108(1), 50. https://doi.org/10.3390/engproc2025108050