A Novel Convolutional Vision Transformer Network for Effective Level-of-Detail Awareness in Digital Twins
Abstract
1. Introduction
1.1. Literature Review
1.2. Research Motivation
1.3. Our Contributions
- We propose a unified architecture that integrates a CvT backbone with multiple LoD-specific ViT branch networks. This design enables the model to detect key features tailored to each LoD using the corresponding ViT-based branch network, as well as to extract collective features across all LoDs using a CvT-based backbone.
- For each ViT-based branch network, we adopt the coarse-to-fine inference strategy for effectively detecting key features of the corresponding LoD. This refines informative image patches and processes them at a finer granularity. An early exit mechanism in the coarse-to-fine structure significantly reduces not only the computational burden but also the inference time at each LoD if the intermediate inference is accurate enough.
- From the perspective of LoD-aware DT synchronization, LCvT enables a single model to address end-to-end hierarchical classification across different LoDs, overcoming the need for independent models for each LoD.
- Through experiments with real datasets, we demonstrate that LCvT effectively learns key features across varying LoDs and consistently outperforms existing baseline models. Moreover, LCvT achieves a shorter inference time than using independent models for each LoD, especially in real-world scenarios where LoDs dynamically change.
1.4. Paper Structure
2. LoD-Aware Classification Problem
3. LoD-Aware Convolutional Vision Transformer Network
3.1. LCvT Architecture
3.1.1. Backbone Network
3.1.2. LoD-Specific Branch Network
3.2. Training of LCvT
4. LCvT in LoD-Aware DT Synchronization
4.1. LCvT Workflow for LoD-Aware DT Synchronization
4.2. Practical Implications of LCvT in DT Systems
5. Experimental Results
5.1. Experimental Setup
- LCvT is our proposed model.
- LCvT-OC is a model that shares identical architecture of LCvT, but only executes coarse-grained inference in each LoD.
- CvT is a model that only executes identical backbone model used in LCvT with the inference for each LoD.
- CNN is a model that only executes traditional convolution-based feature extraction. This model shares a design philosophy with [21].
5.2. Performance Evaluation
5.2.1. Comparison with Baseline Models
5.2.2. Time Efficiency in Dynamic LoD Scenarios
5.3. Interpretability and Qualitative Analysis
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Grieves, M.; Vickers, J. Digital twin: Mitigating unpredictable, undesirable emergent behavior in complex systems. In Transdisciplinary Perspectives on Complex Systems: New Findings and Approaches; Springer: Cham, Switzerland, 2017; pp. 85–113. [Google Scholar]
- Wang, H.; Yang, Z.; Zhang, Q.; Sun, Q.; Lim, E. A digital twin platform integrating process parameter simulation solution for intelligent manufacturing. Electronics 2024, 13, 802. [Google Scholar] [CrossRef]
- Wu, Y.; Zhang, K.; Zhang, Y. Digital twin networks: A survey. IEEE Internet Things J. 2021, 8, 13789–13804. [Google Scholar] [CrossRef]
- Xu, H.; Wu, J.; Pan, Q.; Guan, X.; Guizani, M. A Survey on digital twin for industrial internet of things: Applications, technologies and tools. IEEE Commun. Surv. Tutor. 2023, 25, 2569–2598. [Google Scholar] [CrossRef]
- Zhang, J.; Tao, D. Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet Things J. 2021, 8, 7789–7817. [Google Scholar] [CrossRef]
- Castaño, F.; Strzelczak, S.; Villalonga, A.; Haber, R.E.; Kossakowska, J. Sensor reliability in cyber-physical systems using internet-of-things data: A review and case study. Remote Sens. 2019, 11, 2252. [Google Scholar] [CrossRef]
- Al-Ali, A.R.; Gupta, R.; Zaman Batool, T.; Landolsi, T.; Aloul, F.; Al Nabulsi, A. Digital twin conceptual model within the context of internet of things. Future Internet 2020, 12, 163. [Google Scholar] [CrossRef]
- Fett, M.; Kraft, M.; Wilking, F.; Goetz, S.; Wartzack, S.; Kirchner, E. Medium-level architectures for digital twins: Bridging conceptual reference architectures to practical implementation in cloud, edge and cloud–edge deployments. Electronics 2024, 13, 1373. [Google Scholar] [CrossRef]
- Brooks, R.J.; Tobias, A.M. Choosing the best model: Level of detail, complexity, and model performance. Math. Comput. Model. 1996, 24, 1–14. [Google Scholar] [CrossRef]
- Abualdenien, J.; Borrmann, A. Levels of detail, development, definition, and information need: A critical literature review. J. Inf. Technol. Constr. 2022, 27, 363–392. [Google Scholar] [CrossRef]
- Alfaro-Viquez, D.; Zamora-Hernandez, M.; Fernandez-Vega, M.; Garcia-Rodriguez, J.; Azorin-Lopez, J. A comprehensive review of AI-based digital twin applications in manufacturing: Integration across operator, product, and process dimensions. Electronics 2025, 14, 646. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar]
- Fukui, H.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10705–10714. [Google Scholar]
- Teerapittayanon, S.; McDanel, B.; Kung, H. BranchyNet: Fast inference via early exiting from deep neural networks. In Proceedings of the International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2464–2469. [Google Scholar]
- Cambazoglu, B.B.; Zaragoza, H.; Chapelle, O.; Chen, J.; Liao, C.; Zheng, Z.; Degenhardt, J. Early exit optimizations for additive machine learned ranking systems. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), New York, NY, USA, 3–6 February 2010; pp. 411–420. [Google Scholar]
- Matsubara, Y.; Levorato, M.; Restuccia, F. Split computing and early exiting for deep learning applications: Survey and research Challenges. ACM Comput. Surv. 2022, 55, 90. [Google Scholar] [CrossRef]
- Chen, M.; Lin, M.; Li, K.; Shen, Y.; Wu, Y.; Chao, F.; Ji, R. CF-ViT: A general coarse-to-fine method for vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 7042–7052. [Google Scholar]
- Yan, Z.; Zhang, H.; Piramuthu, R.; Jagadeesh, V.; DeCoste, D.; Di, W.; Yu, Y. HD-CNN: Hierarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2740–2748. [Google Scholar]
- Zhu, X.; Bain, M. B-CNN: Branch convolutional neural network for hierarchical classification. arXiv 2017, arXiv:1709.09890. [Google Scholar] [CrossRef]
- Wang, Y.; Chung, S.H.; Khan, W.A.; Wang, T.; Xu, D.J. ALADA: A lite automatic data augmentation framework for industrial defect detection. Adv. Eng. Inform. 2023, 58, 102205. [Google Scholar] [CrossRef]
- Shi, Y.; Wei, P.; Feng, K.; Feng, D.C.; Beer, M. A survey on machine learning approaches for uncertainty quantification of engineering systems. Mach. Learn. Comput. Sci. Eng. 2025, 1, 11. [Google Scholar] [CrossRef]
- Khan, W.A.; Chung, S.H.; Wang, Y.; Eltoukhy, A.E.; Liu, S.Q. Intelligent early fault management using a continual deep learning information system for industrial operations. Ind. Manag. Data Syst. 2025; in press. [Google Scholar] [CrossRef]
- Kuts, V.; Modoni, G.; Terkaj, W.; Tähemaa, T.; Sacco, M.; Otto, T. Exploiting factory telemetry to support virtual reality simulation in robotics cell. In Proceedings of the International Conference on Augmented Reality, Virtual Reality, and Computer Graphics (AVR), Ugento, Italy, 12–15 June 2017; pp. 212–221. [Google Scholar]
- Zhang, J.; Cheng, J.; Chen, W.; Chen, K. Digital twins for construction sites: Concepts, LoD definition, and applications. J. Manag. Eng. 2022, 38, 04021094. [Google Scholar] [CrossRef]
- Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. DynamicViT: Efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 2021, 34, 13937–13949. [Google Scholar]
- Liang, Y.; Chongjian, G.; Tong, Z.; Song, Y.; Wang, J.; Xie, P. EViT: Expediting Vision Transformers via Token Reorganizations. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Yang, L.; Luo, P.; Loy, C.C.; Tang, X. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015; pp. 3973–3981. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
LCvT | LCvT-OC | CvT | CNN | |
---|---|---|---|---|
Stage 1 | Conv , 64, Conv , 128, Pool , | |||
LoD 1 Branch | Encoders(C) , Head Patch Identification Patch Splitting Encoders(F) , Head | Encoders(C) , Head | Head | Flatten |
Stage 2 | Conv , 256, Conv , 512, Pool , | |||
LoD 2 Branch | Encoders(C) , Head Patch Identification Patch Splitting Encoders(F) , Head | Encoders(C) , Head | Head | Flatten |
Datasets | LoD Level | LCvT | LCvT-OC | CvT | CNN |
---|---|---|---|---|---|
CompCars | LoD 1 | 62.48% (C) 68.68% (F) | 50.33% | 32.97% | 20.52% |
LoD 2 | 70.59% (C) 74.58% (F) | 57.34% | 55.86% | 58.23% | |
ImageNet | LoD 1 | 46.52% (C) 52.23% (F) | 41.79% | 26.84% | 19.16% |
LoD 2 | 54.21% (C) 55.62% (F) | 43.27% | 40.18% | 44.08% |
LCvT | CvT | |
---|---|---|
LoD 1 → LoD 2 | 66.977 (43.253 + 23.724) | 94.953 (38.572 + 56.381) |
LoD 2 → LoD 1 | 73.012 (66.977 + 6.035) | 94.953 (56.381 + 38.572) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, M.-S.; Kim, J.-W.; Lee, H.-S. A Novel Convolutional Vision Transformer Network for Effective Level-of-Detail Awareness in Digital Twins. Electronics 2025, 14, 3942. https://doi.org/10.3390/electronics14193942
Yang M-S, Kim J-W, Lee H-S. A Novel Convolutional Vision Transformer Network for Effective Level-of-Detail Awareness in Digital Twins. Electronics. 2025; 14(19):3942. https://doi.org/10.3390/electronics14193942
Chicago/Turabian StyleYang, Min-Seo, Ji-Wan Kim, and Hyun-Suk Lee. 2025. "A Novel Convolutional Vision Transformer Network for Effective Level-of-Detail Awareness in Digital Twins" Electronics 14, no. 19: 3942. https://doi.org/10.3390/electronics14193942
APA StyleYang, M.-S., Kim, J.-W., & Lee, H.-S. (2025). A Novel Convolutional Vision Transformer Network for Effective Level-of-Detail Awareness in Digital Twins. Electronics, 14(19), 3942. https://doi.org/10.3390/electronics14193942