Enhanced MobileViT with Dilated and Deformable Attention and Context Broadcasting Module for Intangible Cultural Heritage Embroidery Recognition
Abstract
1. Introduction
- Construction of the Guizhou intangible cultural heritage embroidery dataset.
- First application of the CNN-ViT hybrid model to embroidery recognition tasks.
- Replace the lower-layer ViT Block in MobileViT with Dilatefomer [22] to reduce computation while improving recognition accuracy.
- Introduction of a new attention mechanism, Multi-Scale Dilated Deformable Attention (MSDDA), combining Multi-Scale Dilated Attention (MSDA) and deformable convolutions [23].
- Replacement of the high-level ViT Block in MobileViT with DefDilatefomer to improve recognition accuracy with minimal increase in computational cost.
- Introduction of CBM [24] to alleviate the information propagation bottleneck in sparse attention during long-range dependency modeling.
2. Materials and Methods
2.1. MobileViT-DDC Network Architecture
2.2. MV2 Block
2.3. Dilatefomer Block
2.4. DefDilatefomer Block
2.5. CBM Block
3. Results
3.1. Dataset and Experimental Parameters Settings
3.1.1. Dataset
- Original training subset: 500 images (100 per category), used exclusively for data augmentation;
- Original validation subset: 500 images (100 per category), kept raw for final model evaluation.
- Image resizing for standardization:
- 2.
- Basic enhancement:
- 3.
- Texture perturbation:
- 4.
- Color space transformation:
- 5.
- Mixed enhancement:
3.1.2. Experimental Parameters Settings
3.1.3. Optimization Strategies
- 4 subsets (3600 images) served as training data to update parameters;
- 1 subset (900 images) acted as internal validation to monitor intermediate performance.
3.1.4. Experimental Evaluation Indicators
- Accuracy: Proportion of correctly predicted samples among all samples, shown in Equation (15):
- 2.
- Precision: Proportion of true positives among all predicted positives, shown in Equation (16):
- 3.
- Recall: Proportion of actual positives correctly identified as positive, shown in Equation (17):
- 4.
- F1-score: Harmonic mean of Precision and Recall (suitable for imbalanced data), shown Equation (18):
3.2. Dilatefomer Ablation Experiment
- Accuracy: Dilatefomer_x_small reached 97.45% (+1.01% vs. ViT_x_small 96.44%), Dilatefomer_xx_small 95.68% (+0.48% vs. ViT_xx_small 95.20%), Dilatefomer_small 97.43% (+0.13% vs. ViT_small 97.30%);
- Computational efficiency: Dilatefomer_x_small had 805.07M FLOPs (−16.2% vs. ViT_x_small 961.12M) and 1.76M parameters (−9.3% vs. ViT_x_small 1.93M); Dilatefomer_xx_small had 279.10M FLOPs (−20.6%) and 1.03M parameters (−18.9%);
- High-resolution (384 × 384) test: Dilatefomer_x_small↑384 accuracy slightly rose to 97.46% (FLOPs 1600.322M), while Dilatefomer_xx_small↑384 accuracy dropped by 0.14%, indicating the need to balance receptive field and feature detail in high-resolution scenarios.
- Optimal accuracy (97.45%) at [1,2,3]; accuracy declined with larger rates ([1,2,4] 97.34%, [1,3,5] 97.20%, [2,4,6] 96.45%), as excessive receptive fields caused feature over-smoothing;
- FLOPs (805.07 M) and parameters (1.76 M) remained stable across all rates, as dilation only modulates receptive field (no structural/parameter changes).
3.3. DefDilatefomer Ablation Experiment
- Inserted only in Layer 4: Top-1 accuracy increased from 97.45% to 97.70% (+0.25%), F1-score reached 97.73%, but FLOPs rose to 884.45M (+9.9%) and parameters to 2.04M (+15.9%);
- Inserted only in Layer 5: Accuracy achieved a breakthrough of 98.30% (+0.85%), F1-score 98.31%, with minimal overhead—FLOPs only increased by 2.37% to 824.09 M, parameters by 17.0% to 2.06 M (optimal cost-effectiveness);
- Inserted in both Layers 4–5: Accuracy reached 98.20% (+0.75%), but parameters surged by 32.9% to 2.34 M and FLOPs by 11.4% to 896.84 M, demanding more hardware resources.
3.4. CBM Ablation Experiment
- xx_small: MediCBM improved accuracy by 0.10% (96.70% vs. 96.60% for CBM), precision by 0.13% (96.74%), F1-score by 0.09% (96.69%);
- x_small: MediCBM boosted accuracy by 0.20% (98.60% vs. 98.40%), recall by 0.15% (98.60%), with gains in precision (+2.15%) and F1-score (+2.17%).;
- small: MediCBM reduced Std (0.04 vs. 0.05 for CBM) and improved F1-score by 0.45% (97.74%), despite similar accuracy gains (+0.40%).
3.5. Comparison Between MobileViT and MobileViT-DDC
3.5.1. Layer 3: Global Structural Perception
- MobileViT: Attention is scattered. For radially symmetric Dong embroidery, attention spreads over blank backgrounds and non-core contours; for Shui horsehair embroidery, it distributes evenly between stitch contours and blurred backgrounds, failing to aggregate to the linear skeleton
- MobileViT-DDC: Layer 3 embeds Dilatefomer with MSDA (Section 2.3), using dilation rates [1,2,3] for three-scale sampling: small () for local contours, medium () for structural connections, large () for background-target distinction. High-response regions align with embroidery skeletons: for Qiandongnan Miao embroidery, red regions overlap with totem frames/sub-pattern lines; background interference (blue) decreases by ~30% vs. MobileViT, and overlap with manually annotated contours reaches 82% ( vs. MobileViT).
3.5.2. Layer 4: Category-Specific Semantic Alignment
- MobileViT: Semantic misalignment occurs. For Bouyei embroidery (flower-vine patterns), attention distributes evenly between flower centers (most discriminative) and leaves; for Shui horsehair embroidery, it focuses on stitch transitions rather than cores; for Dong embroidery (phoenix motifs), it scatters over the phoenix’s body and background.
- MobileViT-DDC: Layer 4′s Dilatefomer uses MSDA to split channels into attention heads (dilation rates [1,2,3]): small heads () target fine features (flower centers, phoenix heads), medium () connect structures (vines, wings), large () filter irrelevant features (leaves, backgrounds). Post-linear aggregation highlights key features: Bouyei flower center overlap hits 78% (+45% vs. MobileViT), leaf/background response is 52% lower than flower centers; Shui horsehair stitch core response is transitions ( vs. MobileViT); Dong phoenix head/wing response is 3.1x body (vs. MobileViT).
3.5.3. Layer 5: Fine-Grained Detail Localization
- MobileViT: Two limitations. For Shui horsehair embroidery, it misses “segmented stitch nodes”, only responding to stitch directions; for Miao/Dong dragon embroidery, high-response regions are nearly identical, failing to distinguish “serrated vs. smooth scales”.
- MobileViT-DDC: Combines Layer 5’s DefDilatefomer (Section 2.1). DefDilatefomer adds MSDA-based deformable sampling offsets (Section 2.4): for Shui horsehair embroidery, sampling shifts to “segmented nodes” (dot-like high responses, localization accuracy 89% ( vs. MobileViT)); for Miao/Dong dragon embroidery, shifts to “serrated edges” (Miao) and “smooth curves” (Dong), with inter-category discrimination at 76% ( vs. MobileViT).
3.6. Comparison with Competitive Benchmark Models
4. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lu, T.L.D. The management of intangible cultural heritage in China. In The Routledge Companion to Intangible Cultural Heritage; Routledge: London, UK, 2016; pp. 121–134. [Google Scholar]
- Yan, W.J.; Chiou, S.C. The safeguarding of intangible cultural heritage from the perspective of civic participation: The informal education of Chinese embroidery handicrafts. Sustainability 2021, 13, 4958. [Google Scholar] [CrossRef]
- He, X.; Li, S. Predicaments and Solutions for Minority Handicrafts Industrialization in Southwest of China. In Proceedings of the 2022 International Conference on County Economic Development, Rural Revitalization and Social Sciences (ICCRS 2022), Nanjing, China, 24–26 January 2022; Atlantis Press: Dordrecht, The Netherlands, 2022. [Google Scholar]
- Torimaru, T. Similarities of Miao Embroidery and Ancient Chinese Embroidery and Their Cultural Implications. Res. J. Text. Appar. 2011, 15, 52–57. [Google Scholar] [CrossRef]
- Yu, X. An Overview of the Development of Chinese Embroidery Patterns in the Past Dynasties. In Proceedings of the 2018 International Conference on Sports, Arts, Education and Management Engineering (SAEME 2018), Taiyuan, China, 29–30 June 2018; Atlantis Press: Dordrecht, The Netherlands, 2018; pp. 389–394. [Google Scholar]
- Hu, X.; Yang, C.; Fang, F.; Huang, J.; Li, P.; Sheng, B. Msembgan: Multi-stitch embroidery synthesis via region-aware texture generation. IEEE Trans. Vis. Comput. Graph. 2024, 31, 5334–5347. [Google Scholar] [CrossRef] [PubMed]
- Quan, H.; Li, Y.; Liu, D.; Zhou, Y. Protection of Guizhou Miao Batik Culture Based on Knowledge Graph and Deep Learning. Herit. Sci. 2024, 12, 202. [Google Scholar] [CrossRef]
- Rousseau, D.; Tsaftaris, S. Data Augmentation Techniques for Deep Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), in Tutorial Session, Brighton, UK, 12–17 May 2019. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
- Zhang, C.; Wu, S.; Chen, J. Identification of Miao Embroidery in Southeast Guizhou Province of China Based on Convolution Neural Network. Autex Res. J. 2021, 21, 198–206. [Google Scholar] [CrossRef]
- Zhou, Z.; Wang, H.; Zhang, X.; Tao, F.; Ren, Q. Classification model for Chinese traditional embroidery based on Xception-TD. Data Anal. Knowl. Discov. 2022, 6, 338–347. [Google Scholar]
- Zhu, C.; Bai, X.; Zhu, J.; Huang, W. Research on Image Recognition of Nantong Shen Embroidery Intangible Cultural Heritage Based on Improved MobileNetV3; Preprint (Version 1) Available at Research Square; Preprint. 2024. Available online: https://www.researchsquare.com/article/rs-5300929/v1 (accessed on 22 July 2025).
- Zhu, J.; Zhu, C. Research on the innovative application of Shen Embroidery cultural heritage based on convolutional neural network. Sci. Rep. 2024, 14, 9574. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Fan, Z.; Yao, H.; Zhang, T.; Seng, B. Automatic Classification and Recognition of Qinghai Embroidery Images Based on the SE-ResNet152V2 Model. IET Image Process. 2025, 19, e70108. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training Data-Efficient Image Transformers & Distillation Through Attention. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021. [Google Scholar]
- Naseer, M.; Ranasinghe, K.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Intriguing Properties of Vision Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Red Hook, NY, USA, 6–14 December 2021; Volume 34, pp. 23296–23308. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging Mobilenet and Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
- Jiao, J.; Tang, Y.-M.; Lin, K.-Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.-S.; Ma, J. DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Hyeon-Woo, N.; Yu-Ji, K.; Heo, B.; Han, D.; Oh, S.J.; Oh, T.-H. Scratching Visual Transformer’s Back with Uniform Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
- Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
- He, W.; Han, K.; Nie, Y.; Wang, C.; Wang, Y. Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition. arXiv 2023, arXiv:2309.14183. [Google Scholar]
- Shah, R.; Bhatti, N.; Akhtar, N.; Khalil, S.; García-Magariño, I. Random Patterns Clothing Image Retrieval Using Convolutional Neural Network. In Proceedings of the 2020 International Conference on Emerging Trends in Smart Technologies (ICETST), Karachi, Pakistan, 26–27 March 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Zafar, A.; Aamir, M.; Nawi, N.M.; Arshad, A.; Riaz, S.; Alruban, A.; Dutta, A.K.; Almotairi, S. A Comparison of Pooling Methods for Convolutional Neural Networks. Appl. Sci. 2022, 12, 8643. [Google Scholar] [CrossRef]
Layer | Mode | Out Channels | ViT Channels | FFN Dim | ViT Blocks | Patch Size |
---|---|---|---|---|---|---|
Layer 3 | xx_small | 48 | 63 | 128 | 2 | 2 × 2 |
x_small | 64 | 96 | 192 | 2 | 2 × 2 | |
small | 96 | 144 | 288 | 2 | 2 × 2 | |
Layer 4 | xx_small | 63 | 81 | 160 | 4 | 2 × 2 |
x_small | 80 | 120 | 240 | 4 | 2 × 2 | |
small | 128 | 192 | 384 | 4 | 2 × 2 | |
Layer 5 | xx_small | 81 | 96 | 192 | 3 | 2 × 2 |
x_small | 96 | 144 | 288 | 3 | 2 × 2 | |
small | 160 | 240 | 480 | 3 | 2 × 2 |
Parameters | Values |
---|---|
Learning rate | 1 × 10−3 |
Weight decay | 5 × 10−5 |
Image size | |
Epoch | 200 |
Batch size | 64 |
Learning rate decay | 1 × 10−3 |
Architecture | Acc@1 | Std | F1-Score | Precision | Recall | FLOPs [M] | Params [M] |
---|---|---|---|---|---|---|---|
ViT_xx_small | 95.20 | 0.01 | 95.20 | 95.22 | 95.20 | 351.54 | 1.27 |
Dilatefomer_xx_small | 95.68 (+0.48) | 0.04 | 95.61 | 95.65 | 95.60 | 279.10 | 1.03 |
Dilatefomer_xx_small↑384 | 95.06 (−0.14) | 0.04 | 95.01 | 95.11 | 95.00 | 627.26 | 1.03 |
ViT_x_small | 96.44 | 0.03 | 96.42 | 96.45 | 96.40 | 961.12 | 1.93 |
Dilatefomer_x_small | 97.45 (+1.01) | 0.01 | 97.43 | 97.45 | 97.45 | 805.07 | 1.76 |
Dilatefomer_x_small↑384 | 97.46 (+1.02) | 0.05 | 97.41 | 97.47 | 97.40 | 1600.32 | 1.76 |
ViT_small | 97.30 | 0.06 | 97.29 | 97.33 | 97.30 | 1885.94 | 5.57 |
Dilatefomer_small | 97.43 (+0.13) | 0.02 | 97.40 | 97.40 | 97.40 | 1514.42 | 4.12 |
Dilatefomer_small↑384 | 97.51 (+0.21) | 0.03 | 97.50 | 97.50 | 97.50 | 3406.66 | 4.12 |
Dilation Rate | Acc@1 | FLOPs (M) | Parameters (M) |
---|---|---|---|
[1,2,3] | 97.45 | 805.07 | 1.76 |
[1,2,4] | 97.34 (−0.09) | 805.07 | 1.76 |
[1,3,5] | 97.20 (−0.25) | 805.07 | 1.76 |
[2,4,6] | 96.45 (−1.00) | 805.07 | 1.76 |
Module | Position | Acc@1 | Std | F1-Score | Precision | Recall | FLOPs [M] | Params [M] | ||
---|---|---|---|---|---|---|---|---|---|---|
Front | Mid | End | ||||||||
Dilatefomer | ✓ | ✓ | ✓ | 97.45 | 0.01 | 97.43 | 97.45 | 97.45 | 805.07 | 1.76 |
DefDilatefomer | ✕ | ✓ | ✕ | 97.70 (+0.25) | 0.03 | 97.73 | 97.71 | 97.70 | 884.45 | 2.04 |
✕ | ✕ | ✓ | 98.30 (+0.85) | 0.04 | 98.31 | 98.29 | 98.30 | 824.09 | 2.06 | |
✕ | ✓ | ✓ | 98.20 (+0.75) | 0.05 | 98.20 | 98.22 | 98.20 | 896.84 | 2.34 |
Module | Position | Acc@1 | Std | F1-Score | Precision | Recall | FLOPs [M] | Params [M] | ||
---|---|---|---|---|---|---|---|---|---|---|
Front | Mid | End | ||||||||
DefDilatefomer | 98.30 | 0.04 | 98.31 | 98.29 | 98.30 | 824.09 | 2.06 | |||
CBM | ✓ | ✕ | ✕ | 98.30 (+0) | 0.04 | 98.31 | 98.29 | 98.30 | 824.09 | 2.06 |
✕ | ✓ | ✕ | 98.30 (+0) | 0.06 | 98.31 | 98.29 | 98.30 | 824.09 | 2.06 | |
✕ | ✕ | ✓ | 98.40 (+0.10) | 0.01 | 98.39 | 98.41 | 98.40 | 824.09 | 2.06 |
Architecture | Acc@1 | Std | F1-score | Precision | Recall | FLOPs [M] | Params [M] |
---|---|---|---|---|---|---|---|
xx_small | |||||||
MobileViT | 95.20 | 0.01 | 95.20 | 95.22 | 95.20 | 351.54 | 1.27 |
MobileViT-DDC(CBM) | 96.60 (+1.40) | 0.05 | 96.60 (+1.40) | 96.61 (+1.39) | 96.60 (+1.40) | 288.24 | 1.17 |
MobileViT-DDC (MediCBM ours) | 96.70 (+1.50) | 0.04 | 96.69 (+1.49) | 96.74 (+1.52) | 96.70 (+1.50) | 288.24 | 1.17 |
x_small | |||||||
MobileViT | 96.44 | 0.03 | 96.42 | 96.45 | 96.40 | 961.12 | 1.93 |
MobileViT-DDC(CBM) | 98.40 (+1.96) | 0.01 | 98.39 (+1.97) | 98.41 (+1.96) | 98.40 (+2.00) | 824.09 | 2.06 |
MobileViT-DDC (MediCBM ours) | 98.60 (+2.16) | 0.08 | 98.59 (+2.17) | 98.60 (+2.15) | 98.60 (+2.20) | 824.09 | 2.06 |
small | |||||||
MobileViT | 97.30 | 0.06 | 97.29 | 97.33 | 97.30 | 1885.94 | 5.57 |
MobileViT-DDC(CBM) | 97.70 (+0.40) | 0.05 | 97.70 (+0.41) | 97.72 (+0.39) | 97.70 (+0.40) | 1563.82 | 4.90 |
MobileViT-DDC (MediCBM ours) | 97.70 (+0.40) | 0.04 | 97.74 (+0.45) | 97.72 (+0.39) | 97.70 (+0.40) | 1563.82 | 4.90 |
DataSet | Model | FLOPs [M] | Params [M] | Acc@1 | F1-Score | Precision | Recall |
---|---|---|---|---|---|---|---|
Guizhou Intangible Cultural Heritage Embroidery Dataset | Efficientnetv2_S | 3784.40 | 21.45 | 94.10 | 94.13 | 94.23 | 94.10 |
Efficientnetv2_L | 7102.51 | 54.13 | 96.80 | 96.79 | 96.81 | 96.80 | |
MobileViT_x_small | 961.12 | 1.93 | 96.44 | 96.42 | 96.45 | 96.40 | |
MobileNetV3_S | 80.59 | 2.54 | 90.40 | 90.51 | 90.83 | 90.40 | |
MobileNetV3_L | 304.71 | 5.48 | 95.10 | 95.10 | 95.22 | 95.10 | |
ShuffleNet_V2_X0_5 | 57.90 | 1.36 | 91.00 | 90.99 | 91.28 | 91.00 | |
ShuffleNet_V2_X1_0 | 199.14 | 2.27 | 96.80 | 96.78 | 96.82 | 96.80 | |
Xception | 72,810.48 | 22.85 | 98.30 | 98.39 | 98.30 | 98.30 | |
CoatNet | 6869.81 | 33.08 | 97.90 | 97.89 | 97.91 | 97.90 | |
ResNet50 | 5398.54 | 25.55 | 97.20 | 97.19 | 97.20 | 97.20 | |
EdgeViT_XS | 1327.19 | 6.77 | 84.70 | 84.52 | 85.20 | 84.70 | |
EdgeViT_S | 2477.89 | 13.10 | 90.40 | 90.39 | 90.49 | 90.40 | |
GhostNetV2 | 244.38 | 6.15 | 89.40 | 89.46 | 89.78 | 89.40 | |
MobileViT-DDC (ours) | 824.09 | 2.06 | 98.40 | 98.39 | 98.41 | 98.40 |
DataSet | Model | FLOPs [M] | Params [M] | Acc@1 | F1-Score | Precision | Recall |
---|---|---|---|---|---|---|---|
Pakistani National Dress Dataset | EfficientNetV2_S | 3784.40 | 21.45 | 70.78 | 70.98 | 72.02 | 70.78 |
EfficientNetV2_L | 7102.51 | 54.13 | 74.86 | 74.91 | 76.10 | 74.86 | |
MobileViT_x_small | 961.12 | 1.93 | 76.44 | 76.55 | 78.35 | 76.44 | |
MobileNetV3_S | 80.59 | 2.54 | 60.39 | 59.86 | 61.02 | 60.39 | |
MobileNetV3_L | 304.71 | 5.48 | 67.57 | 67.79 | 69.25 | 67.57 | |
ShuffleNet_V2_X0_5 | 57.90 | 1.36 | 63.28 | 64.69 | 63.28 | 63.34 | |
ShuffleNet_V2_X1_0 | 199.14 | 2.27 | 68.42 | 68.86 | 71.64 | 68.42 | |
Xception | 5398.54 | 25.55 | 70.00 | 69.81 | 70.65 | 70.00 | |
CoatNet | 72,810.48 | 22.85 | 78.57 | 78.63 | 79.99 | 78.57 | |
ResNet50 | 6869.81 | 33.08 | 78.13 | 78.27 | 79.11 | 78.13 | |
EdgeViT_XS | 1327.19 | 6.77 | 68.28 | 68.42 | 69.65 | 68.28 | |
EdgeViT_S | 2477.89 | 13.10 | 74.47 | 74.40 | 75.05 | 74.47 | |
GhostNetV2 | 244.38 | 6.15 | 69.86 | 69.98 | 71.32 | 69.86 | |
MobileViT-DDC (ours) | 824.09 | 2.06 | 79.07 | 79.25 | 80.22 | 79.07 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, H.; Zhang, Z.; Tong, R.; Song, T. Enhanced MobileViT with Dilated and Deformable Attention and Context Broadcasting Module for Intangible Cultural Heritage Embroidery Recognition. Symmetry 2025, 17, 1485. https://doi.org/10.3390/sym17091485
Jin H, Zhang Z, Tong R, Song T. Enhanced MobileViT with Dilated and Deformable Attention and Context Broadcasting Module for Intangible Cultural Heritage Embroidery Recognition. Symmetry. 2025; 17(9):1485. https://doi.org/10.3390/sym17091485
Chicago/Turabian StyleJin, Hui, Zhide Zhang, Ruchao Tong, and Tao Song. 2025. "Enhanced MobileViT with Dilated and Deformable Attention and Context Broadcasting Module for Intangible Cultural Heritage Embroidery Recognition" Symmetry 17, no. 9: 1485. https://doi.org/10.3390/sym17091485
APA StyleJin, H., Zhang, Z., Tong, R., & Song, T. (2025). Enhanced MobileViT with Dilated and Deformable Attention and Context Broadcasting Module for Intangible Cultural Heritage Embroidery Recognition. Symmetry, 17(9), 1485. https://doi.org/10.3390/sym17091485