Next Article in Journal
IT Support Division Employee Selection Decision Support System Using Simple Additive Weighting Method
Previous Article in Journal
From Human-Computer Interaction to Human-Robot Manipulation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Application of Vision Language Models in the Shoe Industry †

Department of Artificial Intelligence, Asia University, Taichung 413305, Taiwan
*
Author to whom correspondence should be addressed.
Presented at the 2025 IEEE 5th International Conference on Electronic Communications, Internet of Things and Big Data, New Taipei, Taiwan, 25–27 April 2025.
Eng. Proc. 2025, 108(1), 50; https://doi.org/10.3390/engproc2025108050
Published: 24 September 2025

Abstract

The confluence of computer vision and natural language processing has yielded powerful vision language models (VLMs) capable of multimodal understanding. We applied state-of-the-art VLMs for quality monitoring of the shoe assembly industry. By leveraging the ability of VLMs to jointly process visual and textual data, we developed a system for automated defect detection and contextualized feedback generation to enhance the efficiency and consistency of quality assurance processes. We conducted an empirical evaluation by evaluating the effectiveness of the developed VLM system in identifying standard procedures for assembly, using the video data from a shoe assembly line. The experimental results validated the potential of the VLM system in detecting the quality of footwear assembly, highlighting the feasibility of future practical deployment in industrial quality control scenarios.

1. Introduction

The need for the footwear manufacturing industry to adopt Industry 4.0 stems from a confluence of pressures and opportunities requiring a paradigm shift from traditional practices. To improve efficiency and productivity amid rising labor costs and potential shortages, it has become critical to integrate automation, robotics, the Internet of Things (IoT), and artificial intelligence vision systems into labor-intensive tasks in order to optimize workflows [1]. IoT provides the connectivity and data foundation necessary for intelligent manufacturing. Robotics delivers automation, precision, and flexibility to manufacturing processes. AI acts as the “brain” of the intelligent factory.
The application of AI in this industry is categorized into four distinct stages (Figure 1). The initial phase, pattern recognition, focuses on identifying and classifying pre-defined patterns within industrial data [2]. The second stage with machine learning (ML) enables systems to learn from data and make predictions or classifications without explicit programming [3]. This is followed by the third stage, dominated by deep learning (DL), which leverages complex neural networks to automatically extract intricate features from large datasets, significantly enhancing performance in image recognition and anomaly detection. Currently, industrial AI is in its fourth stage, characterized by the integration of large language models (LLMs) into generative AI, with new capabilities that can be applied to the footwear manufacturing industry. This study aims to develop a large-scale language model to explore new developments in AI.

2. Literature Review

2.1. DL and LLMs

The past two decades have witnessed a monumental surge in the fields of DL and the nascent development of LLMs. DL emerged as a dominant force, fueled by advancements in computational power, especially graphics processing units. After that, the foundations for LLMs were laid. While early natural language processing relied on statistical methods and simpler ML models, the late 2010s marked a turning point with the introduction of the transformer in 2017. This architecture, with its self-attention mechanisms, proved highly effective in capturing long-range dependencies in text, paving the way for the first generation of LLMs. Models such as Word2Vec [4] for word embeddings and the initial versions of the generative pre-trained transformer (GPR) and bidirectional encoder representations from transformers [5,6] demonstrated a significant leap in language understanding and generation capabilities.

2.2. Multimodal LLMs (MLLMs)

MLLMs represent a significant evolution in artificial intelligence, extending the capabilities of traditional LLMs beyond text to process and generate information across multiple modalities such as images, audio, and video (Figure 2) [7]. A key feature in their development is the combination of powerful architectures including transformers, which excel at sequence modeling and have become the backbone of modern LLMs, such as convolutional neural networks, which are predominantly used in visual feature extraction. These visual encoders, often with transformers, are crucial for processing image and video data. Mechanisms to fuse visual and textual features enable MLLMs to perform image captioning, visual question answering, generating images from text, and even more complex reasoning involving multiple modalities. While still rapidly developing, MLLMs hold immense promise for creating versatile and human-like AI systems capable of understanding and interacting with the world in a richer, more comprehensive way.

3. Methodology

In this study, we utilized the available VLM hosted on Hugging Face spaces that accept video input, including VideoLLaMA 2 [8] and Qwen2.5-VL [9]. These platforms ensure the capabilities of state-of-the-art VLMs in understanding and processing video content. By leveraging these pre-existing spaces, we tested and evaluated the models’ performance on tasks involving video analysis and interpretation, taking advantage of their pre-trained abilities to process both visual and textual information extracted from the video inputs.

3.1. VideoLLaMA 2

VideoLLaMA 2 is developed to enhance spatial-temporal modeling and audio understanding in video and audio-related tasks [8]. Building upon its predecessor, VideoLLaMA, version 2 incorporates a tailored spatial-temporal convolution (STC) connector to effectively capture the complex spatial and temporal dynamics within video data. It integrates an Audio Branch through joint training, allowing the model to seamlessly incorporate audio cues and enrich its multimodal comprehension abilities.

3.2. Qwen2.5-VL

Qwen2.5-VL was developed by the Qwen team at Alibaba Cloud [9]. Building upon the foundations of previous Qwen models, Qwen2.5-VL demonstrates significant advancements in understanding and processing visual content alongside textual information. Its key improvements include enhanced general image recognition across a wider range of categories, more precise object grounding capabilities with diverse output formats like bounding boxes and points, and substantially improved text recognition and understanding within images, including multi-lingual and multi-oriented text.

3.3. Zero-Shot Learning and In-Context Learning for VLMs

Zero-shot learning [10] for VLMs refers to the capability of these models to perform tasks or recognize concepts for which they have not been trained. Instead of requiring task-specific labeled data, VLMs leverage their pre-existing knowledge acquired from training on massive datasets of paired images and text to generalize to new, unseen scenarios. This is achieved by learning a shared embedding space where visual and textual representations are aligned. When presented with a new task or concept described in texts, VLMs use this alignment to understand the relevant visual features, even if it has never seen an example of that specific task or concept before. In-context learning [11] for VLMs is useful when the VLM learns to perform new tasks or follow specific instructions simply provided with a few input-output examples within the prompt, without any explicit fine-tuning or gradient updates. VLMs enable task demonstrations, figure out the underlying pattern, and apply it to new, unseen inputs.

4. Results and Discussion

In this study, we collected two types of video clips: videos showing that a worker inspected two tanker trucks in a simulated workplace security scenario (Figure 3), and real shoe assembly workshops (Figure 4). We submitted questions as text prompts along with the videos to a VLM.

4.1. VLM-Based Surveillance Video Analysis with Zero-Shot Learning

Two VLMs were evaluated using workplace security videos for their ability to understand and interpret complex visual scenes. A set of ten targeted questions was designed to test the models’ reasoning from various aspects of human activities, including counting (“How many workers are there?”), safety compliance (“Is the worker wearing a helmet?”), behavior detection (“Is the worker smoking?” “Did the worker talk on the phone?”), and object recognition (“Is this vehicle a bus?”) (Table 1). The questions covered a range of tasks, including activity recognition, attribute detection, and object identification, which were relevant to workplace safety and surveillance. By analyzing the accuracy and relevance of the models’ responses, their capabilities in real-world video understanding tasks were evaluated.

4.2. VLM-Based Assembly Quality Monitoring

We evaluated the performance of VLMs on shoe assembly workshop videos, focusing on their ability to understand and interpret complex visual scenes relevant to footwear manufacturing. The questions in Table 2 were presented to VLMs to assess their comprehension of specific actions, object identification, and worker attributes: “Is the worker applying glue to the soles of the shoes? Is the worker’s action assembling the sole and upper? Is the worker wearing a glove on his right hand? Is the worker wearing a glove on his left hand? Do the workers wear shoes?” The models’ responses to these questions were analyzed to evaluate their capacity for detailed visual understanding within an industrial context.

4.3. In-Context Learning for Assembly Quality Monitoring

The Qwen2.5-VL model outperformed VideoLLaMA 2 in the tests. The final evaluation was conducted using responses to the text prompts for in-context learning. Figure 5 shows the questions and responses. This shows the potential of VLMs for industrial applications.

5. Conclusions

We evaluated the capabilities of VLMs for monitoring quality in the shoe assembly industry. The results indicated that the Qwen2.5-VL model demonstrated strong reasoning capabilities in various industrial scenarios. VLMs hold significant potential for optimizing efficiency, improving quality control, and enabling more intelligent automation throughout the footwear manufacturing process.

Author Contributions

Conceptualization, H.-M.T. and H.-T.C.; methodology, H.-M.T. and H.-T.C.; validation, H.-M.T. and H.-T.C.; investigation, H.-M.T. and H.-T.C.; resources, H.-M.T.; writing—original draft preparation, H.-M.T. and H.-T.C.; supervision, H.-T.C.; project administration, H.-T.C.; funding acquisition, H.-T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported in part by the National Science and Technology Council, Taiwan (Grant No. MOST 111-2410-H-324-006).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental results can be found at: https://github.com/htchu/vlm_QMSA (accessed on 20 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Krishnan, A.; Swarna, S.; Balasubramanya, H.S. Robotics, IoT, and AI in the Automation of Agricultural Industry: A Review. In Proceedings of the B-HTC 2020—1st IEEE Bangalore Humanitarian Technology Conference, Vijiyapur, India, 8–10 October 2020. [Google Scholar]
  2. Adugna, T.D.; Ramu, A.; Haldorai, A. A Review of Pattern Recognition and ML. J. Mach. Comput. 2024, 4, 210–220. [Google Scholar] [CrossRef]
  3. Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
  4. Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef]
  5. Ray, P.P. ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope. Internet Things Cyber-Phys. Syst. 2023, 3, 121–154. [Google Scholar] [CrossRef]
  6. Bello, A.; Ng, S.C.; Leung, M.F. A BERT Framework to Sentiment Analysis of Tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef] [PubMed]
  7. Bian, Y.; Küster, D.; Liu, H.; Krumhuber, E.G. Understanding Naturalistic Facial Expressions with DL and Multimodal LLMs. Sensors 2024, 24, 126. [Google Scholar] [CrossRef] [PubMed]
  8. Cheng, Z.; Leng, S.; Zhang, H.; Xin, Y.; Li, X.; Chen, G.; Zhu, Y.; Zhang, W.; Luo, Z.; Zhao, D. Videollama 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-Llms. arXiv 2024, arXiv:2406.07476. [Google Scholar]
  9. Ahmed, I.; Islam, S.; Datta, P.P.; Kabir, I.; Chowdhury, N.U.R.; Haque, A. Qwen 2.5: A Comprehensive Review of the Leading Resource-Efficient LLM with Potentioal to Surpass All Competitors. TechRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
  10. Romera-Paredes, B.; Torr, P. An Embarrassingly Simple Approach to Zero-Shot Learning. In Proceedings of the International Conference on ML, Lille, France, 6–11 July 2015; pp. 2152–2161. [Google Scholar]
  11. Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A Survey on In-Context Learning. arXiv 2022, arXiv:2301.00234. [Google Scholar] [CrossRef]
Figure 1. Industrial application stages of AI.
Figure 1. Industrial application stages of AI.
Engproc 108 00050 g001
Figure 2. Architecture of MLLMs.
Figure 2. Architecture of MLLMs.
Engproc 108 00050 g002
Figure 3. Videos for the simulation of a workplace security scenario: (a) working standing up and (b) lying down.
Figure 3. Videos for the simulation of a workplace security scenario: (a) working standing up and (b) lying down.
Engproc 108 00050 g003
Figure 4. Videos from real shoe assembly workshop: (a) applying glue to the soles of shoes and (b) assembling the sole and upper of a shoe.
Figure 4. Videos from real shoe assembly workshop: (a) applying glue to the soles of shoes and (b) assembling the sole and upper of a shoe.
Engproc 108 00050 g004
Figure 5. In-context learning of VLMs with Qwen2.5-VL.
Figure 5. In-context learning of VLMs with Qwen2.5-VL.
Engproc 108 00050 g005
Table 1. Questions on surveillance assessed by VLM.
Table 1. Questions on surveillance assessed by VLM.
QuestionsVideoLLaMA 2Qwen2.5-VL
Video (a)Video (b)Video (a)Video (b)
How many workers are there?XXVV
How many vehicles are there?VVVV
Is the worker wearing a helmet?VVVV
Is the worker smoking?XXVV
Do workers wear masks?VVXX
Did the worker fall?VVVV
Did the workers take notes?XXVV
Did the worker talk on the phone?XXVV
Is the vehicle moving?XVVV
Is this vehicle a bus?XVVV
X for right answers and V for wrong answers.
Table 2. Questions on shoe assembly assessed by VLM.
Table 2. Questions on shoe assembly assessed by VLM.
QuestionsVideoLLaMA 2Qwen2.5-VL
Video (a)Video (b)Video (a)Video (b)
Is the worker applying glue to the soles of the shoes?VXXV
Is the worker’s action assembling the sole and upper?XVXX
Is the worker wearing a glove on his right hand?VVVV
Is the worker wearing a glove on his left hand?VVVV
Do the workers wear shoes?XXVX
X for right answers and V for wrong answers.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tseng, H.-M.; Chu, H.-T. Application of Vision Language Models in the Shoe Industry. Eng. Proc. 2025, 108, 50. https://doi.org/10.3390/engproc2025108050

AMA Style

Tseng H-M, Chu H-T. Application of Vision Language Models in the Shoe Industry. Engineering Proceedings. 2025; 108(1):50. https://doi.org/10.3390/engproc2025108050

Chicago/Turabian Style

Tseng, Hsin-Ming, and Hsueh-Ting Chu. 2025. "Application of Vision Language Models in the Shoe Industry" Engineering Proceedings 108, no. 1: 50. https://doi.org/10.3390/engproc2025108050

APA Style

Tseng, H.-M., & Chu, H.-T. (2025). Application of Vision Language Models in the Shoe Industry. Engineering Proceedings, 108(1), 50. https://doi.org/10.3390/engproc2025108050

Article Metrics

Back to TopTop