YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition
Abstract
1. Introduction
- Construction of a bespoke Terracotta Army dataset. We curated a comprehensive dataset comprising 5796 images across nine categories (including General, Military Officer, and Cavalry figures) under varying angles and lighting conditions, providing a robust foundation for classification and recognition tasks.
- Development of the YOLOv10-TWD detection framework. Building upon the YOLOv10n backbone, we achieved a task-specific adaptation and synergistic integration of three established high-efficiency modules. Specifically, we strategically embedded CAFM in the deep detection path to sharpen the model’s perception of subtle morphological differences; deployed DualConv in the semantic fusion branch to enhance fine-grained detail modeling without over-fitting; and utilized GSConv in the Neck as a structural “adhesive” to mitigate information barriers and compress computational costs.
- Establishing an efficient and robust benchmark for Terracotta Warrior recognition. Through comprehensive ablation and comparative experiments, we validate the efficacy of the proposed architectural synergy. Compared to the baseline YOLOv10n, YOLOv10-TWD achieves a 7.63% increase in mAP@0.5 and a 6.66% boost in inference speed. This demonstrates a superior trade-off between recognition accuracy and deployment efficiency, providing a highly pragmatic solution for real-time museum guidance systems.
2. Materials and Methods
2.1. Dataset Construction and Preprocessing
2.1.1. Analysis of Visual Features for Different Types of Terracotta Warriors
2.1.2. Image Preprocessing and Data Augmentation Strategies
2.2. The Proposed YOLOv10-TWD Network Architecture
2.2.1. Overview of the YOLOv10n Baseline
2.2.2. Network Architecture of YOLOv10-TWD
- (1)
- CAFM Attention Module: Integrated into the detection head, the Convolution-Attention Fusion Module (CAFM) reinforces the model’s responsiveness to salient regions. This mechanism effectively suppresses background noise, thereby improving target discrimination capabilities in complex, cluttered museum environments.
- (2)
- DualConv Module: This module replaces standard convolutional structures within specific sections of the backbone network. By leveraging dual-kernel designs, it enhances the model’s capacity for fine-grained detail modeling and multi-scale feature representation, significantly improving adaptability to pose variations and occluded targets.
- (3)
- GSConv Module: Integrated into the feature fusion layer (Neck), the Ghost-Shuffle Convolution (GSConv) combines the advantages of standard convolution and depth-wise separable convolution. This integration effectively compresses parameter count and computational overhead while optimizing the efficiency of channel information representation, ensuring a lightweight yet robust feature fusion process.
2.2.3. CAFM: Convolution and Attention Fusion Module
2.2.4. DualConv: Dual Convolutional Module
2.2.5. GSConv: Ghost-Shuffle Convolution
3. Experiments and Analysis
3.1. Experimental Environment and Training Parameters
3.2. Performance Evaluation Metrics
- Accuracy MetricsmAP@0.5: The mean Average Precision (mAP) calculated at an Intersection over Union (IoU) threshold of 0.5. This metric is utilized to measure the model’s overall detection capability at a lower overlap threshold.mAP@0.5:0.95: The average mAP calculated over IoU thresholds ranging from 0.5 to 0.95 (in steps of 0.05). This metric provides a comprehensive assessment of the model’s detection performance across varying degrees of overlap strictness.
- Complexity MetricsParams (Parameters): This represents the total number of parameters within the network, reflecting the model’s scale and storage requirements. A lower parameter count typically indicates a more lightweight model with greater deployment flexibility.FLOPs (Floating Point Operations): This measures the computational cost required during the model inference process, expressed in GFLOPs (Giga Floating-point Operations). FLOPs serve as a critical indicator for evaluating the model’s computational efficiency.
- Speed MetricsPreprocess Time: This refers to the time consumed for data preprocessing of a single image prior to inference (measured in milliseconds, ms), reflecting the processing efficiency during the data preparation stage.Inference Time: This denotes the time required for a single image to complete the forward inference pass, directly reflecting the model’s response speed in practical applications.FPS (Frames Per Second): This indicates the number of image frames the model can process per second. It is used to comprehensively evaluate the model’s continuous inference capability; a higher FPS signifies superior real-time performance.
3.3. Overall Performance and Per-Class Analysis
3.4. Ablation Studies
3.5. Comparative Experiments
3.5.1. Performance Comparison of Different Attention Mechanisms
3.5.2. Comparative Experiments of Different YOLO Object Detection Algorithms
3.6. Visualization Analysis
4. Conclusions
- CAFMAttention Module: By aggregating local and global features, this module enhances the model’s spatial awareness of critical structural regions, effectively mitigating localization errors caused by partial damage or occlusion.
- DualConv Module: The heterogeneous dual-branch design optimizes the modeling of fine-grained details and global contours, significantly improving discriminative robustness against visually similar figure categories.
- GSConv Module: By streamlining feature pathways and enhancing cross-channel interaction, this module reduces computational redundancy, ensuring the model’s real-time performance on edge deployment devices.
- Extreme Parameter Control: Future work will explore advanced model compression techniques, such as knowledge distillation and structural pruning, to further reduce the parameter count for deployment on ultra-resource-constrained edge hardware without sacrificing detection precision.
- Small-Scale Target Robustness: We aim to incorporate more sophisticated multi-scale feature aggregation mechanisms (e.g., BiFPN or specialized small-object detection heads) to enhance the model’s discriminative stability for extremely small targets in wide-angle museum scenes.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| YOLO | You Only Look Once |
| TWD | Terracotta Warrior Detection |
| CAFM | Convolution-Attention Fusion Module |
| DualConv | Dual Convolution |
| GSConv | Ghost-Shuffle Convolution |
| mAP | Mean Average Precision |
| IoU | Intersection over Union |
| FLOPs | Floating Point Operations |
| FPS | Frames Per Second |
| LMMs | Large Multi-modal Models |
| CNN | Convolutional Neural Network |
| BN | Batch Normalization |
| DWConv | Depth-wise Separable Convolution |
| SE | Squeeze-and-Excitation |
| CBAM | Convolutional Block Attention Module |
| MHSA | Multi-Head Self-Attention |
| DCN | Deformable Convolutional Networks |
References
- Tuo, Y.; Wu, J.; Zhao, J.; Si, X. Artificial intelligence in tourism: Insights and future research agenda. Tour. Rev. 2025, 80, 793–812. [Google Scholar] [CrossRef]
- Zhao, F.Q.; Zhou, M.Q. Automatic matching method of cultural relic fragments based on multi-feature parameter fusion. Opt. Precis. Eng. 2023, 31, 1522–1531. [Google Scholar] [CrossRef]
- Liu, J.; Ge, Y.F.; Tian, M. Research on super-resolution reconstruction algorithm of cultural relic images. Acta Electron. Sin. 2023, 51, 139–145. [Google Scholar]
- Onkhar, V.; Kumaaravelu, L.T.; Dodou, D.; de Winter, J.C.F. Towards Context-Aware Safety Systems: Design Explorations Using Eye-Tracking, Object Detection, and GPT-4V; Technical Report; Delft University of Technology: Delft, The Netherlands, 2024. [Google Scholar]
- Limberg, C.; Gonçalves, A.; Rigault, B.; Prendinger, H. Leveraging YOLO-world and GPT-4V LMMs for zero-shot person detection and action recognition in drone imagery. arXiv 2024, arXiv:2404.01571. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
- Bogdoll, D.; Nitsche, M.; Zöllner, J.M. Anomaly detection in autonomous driving: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4488–4499. [Google Scholar]
- Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. GOLD-YOLO: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. (Neurips) 2024, 36, 30902–30915. [Google Scholar]
- Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-YOLO: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
- Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A full-scale reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
- Jocher, G. YOLOv5 Release v7.0. 2022. Available online: https://github.com/ultralytics/yolov5/tree/v7.0 (accessed on 10 January 2026).
- Jocher, G.; Chiguroy, A.; Romishin, B. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2026).
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
- Tan, H.; Liu, X.; Yin, B.; Li, X. MHSA-Net: Multihead self-attention network for occluded person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8210–8224. [Google Scholar] [CrossRef] [PubMed]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
- Hu, S.; Gao, F.; Zhou, X.; Dong, J.; Du, Q. Hybrid convolutional and attention network for hyperspectral image denoising. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5504005. [Google Scholar] [CrossRef]
- Zhong, J.C.; Chen, J.Y.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9528–9535. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
- Jin, Y. A study on the morphological characteristics of the figures of the Terracotta Warriors of Qin Shi Huang. J. Korea Soc. Ceram. Art 2021, 18, 5–31. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. (Neurips) 2024, 37, 107984–108011. [Google Scholar]
- Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef] [PubMed]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Varghese, R.; M, S. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Senzuru, India, 12–14 March 2024; pp. 1–6. [Google Scholar]
- Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]







| Warrior Type | Prototypical Visual Features | Example Image |
|---|---|---|
| Kneeling Archer | Hair bun on the left, wearing armor, in a half-kneeling posture. | ![]() |
| Standing Archer | Hair bun, light robe, arms in bow-pulling pose, with symmetrical movement. |
![]() |
| Chariot Soldier | Standing beside a chariot, wearing armor, and holding long-shafted weapons. | ![]() |
| General Figure | Pheasant tail crown, armor with tassels, complex details, majestic posture. |
![]() |
| Robe Warrior | Wearing robes, natural standing posture, and robust physique. |
![]() |
| Charioteer | Long crown, arms forward as if driving, hand guards, and compact movement. |
![]() |
| Military Officer | Single or double plate crown, clothing distinct from soldiers, serious expression. |
![]() |
| Armored Warrior | Wearing armor, natural standing posture, and robust physique. |
![]() |
| Cavalry | Leather cap, tight sleeves, short boots, and small armor; pose adapted for riding. | ![]() |
| Hyperparameter | Value |
|---|---|
| Initial Learning Rate | 0.01 |
| Cyclic Learning Rate | 0.01 |
| Momentum | 0.937 |
| Batch Size | 32 |
| Image Size | 640 |
| Training Epochs | 1000 (including early stopping mechanism) |
| Warmup Epochs | 3.0 |
| Warmup Momentum | 0.8 |
| Category | mAP@0.5 | mAP@0.5:0.95 | Precision | Recall |
|---|---|---|---|---|
| Kneeling Archer | 0.9958 | 0.8639 | 0.9895 | 1.0000 |
| Standing Archer | 0.9954 | 0.8475 | 0.9790 | 1.0000 |
| Chariot Soldier | 0.9713 | 0.8243 | 0.8954 | 0.9512 |
| General Figure | 0.9676 | 0.7908 | 0.9764 | 0.9074 |
| Robe Warrior | 0.9447 | 0.6866 | 0.9345 | 0.8146 |
| Charioteer | 0.9356 | 0.7461 | 0.9678 | 0.8025 |
| Military Officer | 0.9281 | 0.6858 | 0.9209 | 0.7359 |
| Armored Warrior | 0.9269 | 0.6732 | 0.8964 | 0.7803 |
| Cavalry | 0.9095 | 0.6698 | 0.9110 | 0.7324 |
| All (mAP) | 0.9528 | 0.7542 | 0.9523 | 0.8583 |
| No. | CAFM | DualConv | GSConv | mAP | mAP | Params | FLOPs | Pre. | Inf. | FPS ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| 50 (%)↑ | 50–95 (%)↑ | (M)↓ | (G)↓ | (ms)↓ | (ms)↓ | |||||
| 1 | × | × | × | 87.65 | 63.92 | 2.71 | 8.4 | 2.9 | 3.5 | 156.25 |
| 2 | ✓ | × | × | 88.91 | 64.52 | 3.06 | 8.7 | 3.4 | 3.4 | 147.06 |
| 3 | × | ✓ | × | 93.99 | 72.93 | 2.69 | 8.4 | 1.6 | 3.4 | 200.00 |
| 4 | × | × | ✓ | 92.43 | 72.37 | 2.63 | 8.3 | 1.3 | 3.4 | 212.76 |
| 5 | ✓ | ✓ | × | 88.72 | 64.26 | 3.04 | 8.6 | 2.3 | 3.5 | 172.41 |
| 6 | × | ✓ | ✓ | 92.54 | 72.14 | 2.72 | 8.4 | 2.3 | 3.6 | 169.49 |
| 7 | ✓ | ✓ | ✓ | 95.28 | 75.54 | 3.07 | 8.6 | 2.3 | 3.7 | 166.66 |
| Model | mAP | mAP | Params | FLOPs | Pre. | Inf. | FPS ↑ |
|---|---|---|---|---|---|---|---|
| 50 (%)↑ | 50–95 (%)↑ | (M)↓ | (G)↓ | (ms)↓ | (ms)↓ | ||
| YOLOv10n | 87.65 | 63.92 | 2.71 | 8.4 | 2.9 | 3.5 | 156.25 |
| YOLOv10n + SE | 88.31 | 64.64 | 2.71 | 8.4 | 2.5 | 3.4 | 169.49 |
| YOLOv10n + CBAM | 88.91 | 64.52 | 2.71 | 8.6 | 4.3 | 3.3 | 131.58 |
| YOLOv10n + CAFM | 88.94 | 64.86 | 3.07 | 8.7 | 3.4 | 3.4 | 147.06 |
| Model | mAP | mAP | Params | FLOPs | Pre. | Inf. | FPS ↑ |
|---|---|---|---|---|---|---|---|
| 50 (%)↑ | 50–95 (%)↑ | (M)↓ | (G)↓ | (ms)↓ | (ms)↓ | ||
| YOLOv5n | 87.65 | 63.92 | 1.77 | 4.3 | 0.4 | 8.6 | 111.11 |
| YOLOv6n | 79.39 | 55.23 | 4.16 | 11.5 | 1.5 | 4.3 | 172.41 |
| YOLOv7-Tiny | 89.10 | 61.90 | 6.03 | 13.2 | 8.3 | 1.1 | 106.38 |
| YOLOv8n | 87.41 | 62.86 | 3.01 | 8.2 | 1.1 | 3.9 | 200 |
| YOLOv9n | 88.39 | 66.19 | 1.76 | 6.4 | 1.3 | 4.5 | 153.85 |
| YOLOv10n | 87.65 | 63.93 | 2.71 | 8.4 | 2.9 | 3.5 | 156.25 |
| YOLOv11n | 90.26 | 68.01 | 2.59 | 6.3 | 1.5 | 3.6 | 172.41 |
| YOLOv12n | 85.71 | 61.60 | 2.80 | 7.3 | 2.1 | 3.5 | 178.27 |
| YOLOv10-TWD | 95.28 | 75.54 | 3.07 | 8.6 | 2.3 | 3.7 | 166.66 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, Y.; Wang, L.; Zhang, X.; Dong, S.; Zhu, X. YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition. Appl. Sci. 2026, 16, 2616. https://doi.org/10.3390/app16052616
Li Y, Wang L, Zhang X, Dong S, Zhu X. YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition. Applied Sciences. 2026; 16(5):2616. https://doi.org/10.3390/app16052616
Chicago/Turabian StyleLi, Yalin, Liang Wang, Xinyuan Zhang, Sijie Dong, and Xinjuan Zhu. 2026. "YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition" Applied Sciences 16, no. 5: 2616. https://doi.org/10.3390/app16052616
APA StyleLi, Y., Wang, L., Zhang, X., Dong, S., & Zhu, X. (2026). YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition. Applied Sciences, 16(5), 2616. https://doi.org/10.3390/app16052616










