YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition

Li, Yalin; Wang, Liang; Zhang, Xinyuan; Dong, Sijie; Zhu, Xinjuan

doi:10.3390/app16052616

Open AccessArticle

YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition

by

Yalin Li

¹

,

Liang Wang

²,

Xinyuan Zhang

¹,

Sijie Dong

² and

Xinjuan Zhu

^1,*

¹

School of Computer Science, Xi’an Polytechnic University, No. 58, Shangu Avenue, Lintong District, Xi’an 710048, China

²

Key Scientific Research Base of Ancient Polychrome Pottery Conservation, Emperor Qinshihuang’s Mausoleum Site Museum, Xi’an 710600, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2616; https://doi.org/10.3390/app16052616

Submission received: 10 February 2026 / Revised: 2 March 2026 / Accepted: 6 March 2026 / Published: 9 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

To address challenges such as complex backgrounds, partial occlusion, and high similarity of details in Terracotta Warrior image recognition, this paper proposes a lightweight detection method, YOLOv10-TWD, based on an improved YOLOv10n. Specifically, a lightweight Convolution-Attention Fusion Module (CAFMAttention) and a dual-branch feature extraction structure (DualConv) are integrated into the detection head to enhance the model’s focus on fine-grained features and its discriminative robustness under partial damage conditions. In the Neck network, Ghost-Shuffle Convolution (GSConv) is introduced to compress the computational cost of multi-scale feature fusion while strengthening context-aware capabilities. Experimental results on a self-built Terracotta Warrior dataset demonstrate that the proposed method achieves a 7.63% improvement in mAP@0.5 compared to the baseline YOLOv10n, while simultaneously achieving a 6.66% increase in inference speed. The model achieves high precision alongside significant optimization in inference efficiency, making it well-suited for rapid recognition tasks in cultural heritage and museum scenarios.

Keywords:

YOLOv10n; terracotta warrior detection; CAFMAttention; DualConv; GSConv; object detection

1. Introduction

In recent years, with the rapid development of intelligent tourism systems, the automatic recognition of cultural relic images has gradually become a crucial supporting technology for cultural dissemination and public education [1]. As a world-class cultural heritage, the automatic classification and recognition of Terracotta Warriors face numerous challenges in practical applications [2]. Due to the high density of targets, diverse poses, frequent occlusions, uneven illumination, and complex backgrounds within the burial pits, existing detection models are highly prone to false positives and missed detections, which imposes extremely high requirements on the precision and robustness of algorithms [3].

While large-scale multi-modal models (LMMs) exhibit exceptional prowess in general image-text understanding, they often lack standardized structures for object detection and entail prohibitive computational overhead. Consequently, they remain ill-suited for structured visual tasks like Terracotta Warrior detection, which require high sensitivity to multi-object positioning [4,5]. Therefore, adopting lightweight visual detection models with transparent architectures and controllable performance remains the most pragmatic and efficient path. In this context, the YOLO (You Only Look Once) series has emerged as a benchmark, successfully applied in real-time scenarios such as autonomous driving and intelligent surveillance due to its superior trade-off between detection accuracy, inference speed, and resource consumption [6,7].

Despite the significant performance leaps achieved through successive iterations of the YOLO framework [8,9,10], several bottlenecks persist in the specific task of Terracotta Warrior detection. The high inter-class similarity, dense clustering, and intricate backgrounds limit the model’s ability to capture fine-grained features while maintaining efficiency [11,12,13]. To mitigate these issues, researchers have attempted to integrate advanced modules such as multi-head self-attention (MHSA) for handling occlusion [14] and deformable convolution (DCN) for adapting to geometric variations [15]. However, effectively integrating these powerful—yet often computationally intensive—modules into a lightweight framework remains a critical challenge. This paper addresses this gap by proposing a synergistic fusion of established lightweight operators. Specifically, we adapt the Convolution and Attention Fusion Module (CAFM) [16], DualConv [17], and Ghost-Shuffle Convolution (GSConv) [18] to create a balanced and efficient detection head and neck network tailored for cultural heritage imagery.

In this paper, we propose YOLOv10-TWD, an improved object detection method based on the lightweight YOLOv10n model, specifically tailored for Terracotta Warrior recognition. Our approach aims to bolster detection accuracy and inference efficiency in complex cultural relic scenarios. The primary contributions are as follows:

Construction of a bespoke Terracotta Army dataset. We curated a comprehensive dataset comprising 5796 images across nine categories (including General, Military Officer, and Cavalry figures) under varying angles and lighting conditions, providing a robust foundation for classification and recognition tasks.
Development of the YOLOv10-TWD detection framework. Building upon the YOLOv10n backbone, we achieved a task-specific adaptation and synergistic integration of three established high-efficiency modules. Specifically, we strategically embedded CAFM in the deep detection path to sharpen the model’s perception of subtle morphological differences; deployed DualConv in the semantic fusion branch to enhance fine-grained detail modeling without over-fitting; and utilized GSConv in the Neck as a structural “adhesive” to mitigate information barriers and compress computational costs.
Establishing an efficient and robust benchmark for Terracotta Warrior recognition. Through comprehensive ablation and comparative experiments, we validate the efficacy of the proposed architectural synergy. Compared to the baseline YOLOv10n, YOLOv10-TWD achieves a 7.63% increase in mAP@0.5 and a 6.66% boost in inference speed. This demonstrates a superior trade-off between recognition accuracy and deployment efficiency, providing a highly pragmatic solution for real-time museum guidance systems.

2. Materials and Methods

2.1. Dataset Construction and Preprocessing

2.1.1. Analysis of Visual Features for Different Types of Terracotta Warriors

The Terracotta Warriors serve as high-fidelity physical embodiments of the military system and cultural iconography of the Qin Dynasty. They exhibit pronounced visual discriminative features across dimensions such as morphology, vestments, and armaments, which facilitate the categorization of diverse military branches and hierarchical statuses [19]. From the perspective of computer vision, these distinct morphological characteristics constitute the semantic basis for feature extraction and multi-class recognition.

The primary types of warriors and their corresponding prototypical visual features are summarized in Table 1.

2.1.2. Image Preprocessing and Data Augmentation Strategies

To meet the requirements for dense small object detection and multi-class classification in Terracotta Warrior recognition tasks, this study constructed a dedicated image dataset encompassing nine prototypical categories of targets, such as General Warriors and Junior Officers. Raw images were collected from Pits 1, 2, and 3, as well as the permanent exhibition halls of the Emperor Qinshihuang’s Mausoleum Site Museum. Given the status of the Terracotta Warriors as world cultural heritage, high-quality, multi-angle field imagery is inherently scarce due to strict cultural heritage protection regulations, low-light constraints in exhibition halls, and rigorous shooting distance requirements. A total of 1443 original images were acquired. Despite the relatively small image count compared to general-purpose datasets, the dataset exhibits extremely high annotation density because the warriors typically appear in clusters.

During the data preprocessing stage, poor-quality images with uneven lighting or motion blur were excluded. To enhance the model’s generalization capabilities in complex archaeological environments, targeted data augmentation was implemented to simulate real-world museum scenarios. Specifically, the illumination evolution simulation was implemented through data augmentation to mimic the non-uniform light distribution. Following the standard configuration of the training pipeline, the brightness perturbation factor was sampled from

[- 0.2, + 0.2]

, and the contrast scaling factor was constrained within

[0.8, 1.2]

. This controlled variance mitigates the model’s sensitivity to extreme lighting conditions.

Furthermore, archaeological occlusion simulation was introduced using Random Erasing and Grid Mask techniques. In accordance with the standard implementation of these algorithms, the Random Erasing area was limited to a proportion of

s_{e} \in [0.02, 0.15]

of the total image area, with an aspect ratio

r_{e} \in [0.3, 3.3]

. For the Grid Mask operation, the occlusion ratio was set to

0.3

, ensuring that the majority of the target’s semantic features remained intact for effective learning. Combined with geometric transformations (horizontal flipping, random rotation) and Gaussian noise, the dataset was expanded to 5796 images and uniformly scaled to

640 \times 640

pixels to fit the detection model input.

Data annotation was performed using the MakeSense.ai tool, strictly executing a one-image-multi-instance strategy to ensure all visible targets were accurately boxed and exported in YOLO format. To ensure ground-truth reliability, a dual-person cross-review mechanism was established to conduct multiple rounds of verification, specifically correcting classification ambiguities caused by target displacement, omissions, or subtle morphological features. Consistency checks on a random 5% sample yielded a Fleiss’ Kappa coefficient of 0.89, with a final verified annotation accuracy exceeding 99.5%. The dataset was divided into training (4057 images), validation (1160 images), and testing sets (579 images) following a 7:2:1 ratio. A stratified sampling strategy was adopted during the split to maintain a relative class balance across subsets, supporting training stability and robust generalization evaluation. This dataset presents significant challenges in terms of scale variation, pose diversity, and background complexity, providing realistic scenario support for subsequent structural optimization of the proposed model. Figure 1 illustrates representative sample images from the constructed dataset.

2.2. The Proposed YOLOv10-TWD Network Architecture

2.2.1. Overview of the YOLOv10n Baseline

YOLOv10, proposed by researchers from Tsinghua University in 2024 [20], represents the latest evolution in real-time end-to-end object detection, optimizing the trade-off between detection accuracy and computational cost. As the nano-scale variant of this family, YOLOv10n is engineered specifically for edge deployment, offering ultra-low latency and high efficiency in resource-constrained environments.

Architecturally, YOLOv10n introduces a Consistent Dual Assignments strategy that eliminates the need for Non-Maximum Suppression (NMS) during inference, fundamentally reducing latency overhead [20]. To further enhance efficiency, the model employs a holistic efficiency-accuracy driven design, incorporating a lightweight classification head and spatial-channel decoupled downsampling [21]. Additionally, the Rank-Guided Block design creates a compact network structure by reducing redundancy in gradient information flow, ensuring high precision with minimal parameter count.

For the specific task of Terracotta Warrior recognition—characterized by dense small targets, severe occlusion, and complex tourist backgrounds—YOLOv10n’s structured output and lightweight footprint offer distinct advantages. While Large Multimodal Models (LMMs) like CLIP and GPT-4V excel in semantic generalization, they lack the spatial granularity required for precise bounding box regression and incur prohibitive inference costs. Consequently, the deterministic and efficient nature of YOLOv10n makes it the optimal foundational framework for constructing the proposed YOLOv10-TWD model.

2.2.2. Network Architecture of YOLOv10-TWD

The overall architecture of the proposed YOLOv10-TWD is illustrated in Figure 2. Adhering to the classic object detection paradigm, it comprises three primary components: the Backbone for feature extraction, the Neck for multi-scale feature fusion, and the Head for predicting bounding boxes and class probabilities.

As depicted in Figure 2, this study integrates three distinct lightweight modules to synergistically enhance both model performance and deployment efficiency:

(1): CAFM Attention Module: Integrated into the detection head, the Convolution-Attention Fusion Module (CAFM) reinforces the model’s responsiveness to salient regions. This mechanism effectively suppresses background noise, thereby improving target discrimination capabilities in complex, cluttered museum environments.
(2): DualConv Module: This module replaces standard convolutional structures within specific sections of the backbone network. By leveraging dual-kernel designs, it enhances the model’s capacity for fine-grained detail modeling and multi-scale feature representation, significantly improving adaptability to pose variations and occluded targets.
(3): GSConv Module: Integrated into the feature fusion layer (Neck), the Ghost-Shuffle Convolution (GSConv) combines the advantages of standard convolution and depth-wise separable convolution. This integration effectively compresses parameter count and computational overhead while optimizing the efficiency of channel information representation, ensuring a lightweight yet robust feature fusion process.

2.2.3. CAFM: Convolution and Attention Fusion Module

To enhance the feature representation capability of the model in complex scenarios, this study incorporates the Convolution and Attention Fusion Module (CAFM) into the detection head architecture [16]. While the fundamental operator of CAFM is an established technique for feature fusion, the core design of this module lies in the fusion of convolutional operations with self-attention mechanisms, facilitating collaborative modeling through a parallel dual-branch structure. The specific structure is illustrated in Figure 3.

Mathematically, let the input feature map be X. The local branch extracts spatial features via depth-wise separable convolutions after channel shuffling, as formulated in Equation (1):

F_{l o c a l} = {Conv}_{1 \times 1} (Concat (DWConv (F_{s h u f f l e}^{(g)}))) .

(1)

Concurrently, the global branch models long-range semantic dependencies using an SE-like mechanism to generate Query (Q), Key (K), and Value (V) matrices. The attention-weighted features are computed as shown in Equation (2):

A = Softmax (Q \times K^{T}), F_{g l o b a l} = {DWConv}_{3 \times 3} (Reshape (A \cdot V)) .

(2)

The final output is a residual fusion of the input, local, and scaled global features, which is defined in Equation (3):

F_{o u t} = X + F_{l o c a l} + α \cdot F_{g l o b a l} .

(3)

Unique Design Rationale for Terracotta Warrior Recognition: While CAFM is a known technique, its standard application is insufficient for addressing the extreme inter-class similarity inherent in Terracotta Warrior imagery. Our unique contribution lies in its strategic placement. Instead of generic deployment, we embed CAFM specifically into the P5 detection path of YOLOv10n (positioned between the C2fCIB module and the v10Detect head). This position corresponds to the deepest feature map with the strongest semantic representation but the lowest spatial resolution. By deploying CAFM at this critical juncture, the module is forced to focus on subtle morphological differences—such as the distinct crown types of Officers and the intricate armor details of Generals (as defined in Table 1)—rather than generic background noise. The parallel dual-branch structure allows the network to simultaneously capture local textural details and global pose semantics, significantly improving fine-grained recognition for highly similar warrior categories.

2.2.4. DualConv: Dual Convolutional Module

Despite the lightweight characteristics of YOLOv10n, its standard convolutional operations possess limited capacity for extracting fine-grained features from detail-rich cultural heritage imagery. To address this limitation, this study introduces the DualConv module [17], an established lightweight operator designed to optimize the network’s feature representation capabilities without incurring a significant computational burden.

The core design philosophy of DualConv lies in integrating Group Convolution (GroupConv) with a Heterogeneous Kernel strategy. As illustrated in Figure 4, the channel dimension of the input feature map is partitioned into multiple sub-groups. Unlike traditional GroupConv, DualConv assigns distinct kernel types to each sub-group (e.g., lightweight

1 \times 1

kernels for information interaction, and standard

3 \times 3

kernels for spatial modeling). The outputs are then concatenated along the channel dimension.

Furthermore, the DualConv module incorporates standard Batch Normalization (BN) and non-linear activation functions (such as SiLU, Sigmoid Linear Unit) to stabilize the training process.

To quantify its efficiency, assuming the module contains G convolution groups where only

1 / G

of the channels employ

K \times K

convolution, the total computational cost of this structure, denoted as

F L_{D u a l C o n v}

, can be formally derived as shown in Equation (4):

F L_{D u a l C o n v} = \frac{(K^{2} + G - 1) \times D^{2} \times M \times N}{G} .

(4)

where D represents spatial dimensions, and

M, N

denote input and output channels. Compared with standard convolution (

F L_{C o n v}

), the theoretical computational reduction ratio (R) is expressed as formulated in Equation (5):

R = \frac{F L_{D u a l C o n v}}{F L_{C o n v}} = \frac{K^{2} + G - 1}{G \times K^{2}} \approx \frac{1}{G} + \frac{1}{K^{2}} .

(5)

This heterogeneous strategy significantly lowers overall computational complexity while retaining critical spatial perception capabilities.

Unique Design Rationale for Semantic Fusion: Applying DualConv blindly as a generic backbone replacement often leads to semantic loss in detail-rich cultural heritage imagery. To leverage its structural advantages while preventing the loss of fine-grained armor textures, our unique contribution lies in its strategic placement. This study embeds DualConv specifically within the detection head of YOLOv10n, replacing the downsampling convolution module in the original P3 → P4 branch. This path serves as a critical layer for semantic fusion, bridging shallow detail features with mid-level semantic features. By deploying DualConv at this specific juncture, we enhance multi-scale modeling and local perception capabilities within this fusion path. This strategic decision maximizes the extraction of highly similar textural features (e.g., the overlapping armor plates of Armored Warriors) while preserving the backbone’s generalizability and mitigating the over-fitting risks associated with small-scale datasets.

2.2.5. GSConv: Ghost-Shuffle Convolution

In the context of online cultural heritage image recognition, the task is characterized by dense targets and rich textural details. The Neck network of YOLOv10n primarily relies on Standard Convolution (SC) and Depth-wise Separable Convolution (DWConv) for multi-scale feature fusion. However, this creates an inherent design dichotomy: SC incurs high computational costs, while DWConv suffers from insufficient cross-channel information interaction due to its “channel-independent” computation [18].

To resolve these contradictions, this study introduces the GSConv module into the Neck layer of the YOLOv10-TWD. While GSConv is a recognized lightweight operator, this module integrates the advantages of SC and DWConv, facilitating information flow between the main and auxiliary paths via a Channel Shuffle operation. The structure is illustrated in Figure 5.

Mathematically, let the input feature map be X. The GSConv module extracts global features (

f_{1}

) via standard convolution and local features (

f_{2}

) via depth-wise separable convolution in parallel, as described in Equation (6):

f_{1} = {Conv}_{3 \times 3} (X), f_{2} = {DWConv}_{3 \times 3} (X) .

(6)

These features are concatenated and subsequently fused via a Channel Shuffle operation to enhance inter-channel interaction, as formulated in Equation (7):

f_{3} = Concat (f_{1}, f_{2}), F = Shuffle (f_{3}) .

(7)

Unique Design Rationale for Synergistic Integration: Our unique contribution lies in the synergistic integration of GSConv within the Neck network to complement the modifications in the detection head. While the previously introduced DualConv module significantly reduces computational redundancy, its group-based convolution strategy can inadvertently create inter-group information barriers. To resolve this, we strategically deploy GSConv in the Neck (replacing standard convolution units within the C2f layers). The Channel Shuffle operation inherent in GSConv acts as a structural “adhesive,” mitigating these information barriers and realigning feature flows through robust cross-channel interaction. This synergistic design ensures that the complex, high-similarity textures of cultural artifacts are seamlessly transmitted to the CAFM-enhanced detection head, achieving an optimal balance between lightweight deployment and fine-grained accuracy.

3. Experiments and Analysis

3.1. Experimental Environment and Training Parameters

The experiments were conducted within an Ubuntu 22.04 LTS operating system environment. The hardware platform includes an NVIDIA GeForce RTX 3090 (24 GB) GPU, an Intel(R) Xeon(R) Platinum 8358P @ 2.60 GHz (15 vCPUs) CPU, and 90 GB of RAM. Regarding the software configuration, the deep learning framework utilized is PyTorch 2.1.2, with Python version 3.10 and CUDA version 11.8. The development environment is PyCharm 2025.2.3 Professional.

Prior to model training, the configuration file of the YOLOv10n model was adjusted according to the specific task characteristics of the Terracotta Warrior dataset. A series of training hyperparameters were established to ensure favorable convergence and stability of the model on this dataset. The specific hyperparameter settings are presented in Table 2.

3.2. Performance Evaluation Metrics

To comprehensively evaluate the performance of the proposed improved model in the task of Terracotta Warrior object detection, this study establishes key evaluation metrics across three dimensions: accuracy, complexity, and speed. These metrics are selected to accurately assess the model’s comprehensive performance in practical application scenarios.

Accuracy Metrics
mAP@0.5: The mean Average Precision (mAP) calculated at an Intersection over Union (IoU) threshold of 0.5. This metric is utilized to measure the model’s overall detection capability at a lower overlap threshold.
mAP@0.5:0.95: The average mAP calculated over IoU thresholds ranging from 0.5 to 0.95 (in steps of 0.05). This metric provides a comprehensive assessment of the model’s detection performance across varying degrees of overlap strictness.
Complexity Metrics
Params (Parameters): This represents the total number of parameters within the network, reflecting the model’s scale and storage requirements. A lower parameter count typically indicates a more lightweight model with greater deployment flexibility.
FLOPs (Floating Point Operations): This measures the computational cost required during the model inference process, expressed in GFLOPs (Giga Floating-point Operations). FLOPs serve as a critical indicator for evaluating the model’s computational efficiency.
Speed Metrics
Preprocess Time: This refers to the time consumed for data preprocessing of a single image prior to inference (measured in milliseconds, ms), reflecting the processing efficiency during the data preparation stage.
Inference Time: This denotes the time required for a single image to complete the forward inference pass, directly reflecting the model’s response speed in practical applications.
FPS (Frames Per Second): This indicates the number of image frames the model can process per second. It is used to comprehensively evaluate the model’s continuous inference capability; a higher FPS signifies superior real-time performance.

These evaluation metrics collectively reflect the improved model’s performance across multiple dimensions, covering detection accuracy, model scale, computational efficiency, and real-time response capability. Through multi-dimensional comparative analysis, the optimization effects and practical application value of each module design within the Terracotta Warrior recognition scenario can be effectively validated.

3.3. Overall Performance and Per-Class Analysis

To address the requirement for verifying the classification accuracy of each specific warrior type, Table 3 provides a detailed breakdown of the performance metrics for all nine categories defined in this study, including mAP@0.5, mAP@0.5:0.95, Precision, and Recall.

The experimental results demonstrate that YOLOv10-TWD exhibits high discriminative stability across diverse categories. Notably, Kneeling and Standing Archers achieve near-perfect mAP@0.5 scores (0.9958 and 0.9954, respectively), as their unique half-kneeling and bow-pulling poses (as defined in Table 1) provide highly salient geometric features. For more challenging categories such as Military Officers, Armored Warriors, and Cavalry, which share similar standing postures, the model still maintains an mAP@0.5 above 0.90. This validates the efficacy of the CAFMAttention and DualConv modules in extracting fine-grained discriminative details. Specifically, for the Cavalry category, the relatively lower recall (0.7324) is primarily attributed to two factors: first, the smaller scale of their specialized equipment, where the leather caps (defined in Table 1) are significantly smaller and exhibit lower visual contrast compared to the prominent pheasant tail crowns of Generals; second, the complex occlusions inherent in equestrian-themed displays, where the proximity of terracotta horses and dense formations often obscures critical features. These factors collectively highlight the model’s operational limits in extremely cluttered museum environments and provide a clear direction for future research into multi-scale robustness.

3.4. Ablation Studies

To verify the efficacy and individual contributions of the structural improvement modules within the YOLOv10-TWD model, this study conducts ablation experiments based on the YOLOv10n baseline. The CAFMAttention, DualConv, and GSConv modules are introduced sequentially to systematically evaluate the impact of each modification on detection accuracy, model complexity, and inference efficiency.

To ensure a comprehensive assessment, the experiments utilize three categories of metrics: detection accuracy metrics (mAP@0.5 and mAP@0.5:0.95), complexity metrics (Parameters and FLOPs), and efficiency metrics (Preprocess Time, Inference Time, and Frames Per Second [FPS]). The experimental results are summarized in Table 4.

As indicated by the ablation results in Table 4, the improvement modules demonstrate distinct phased and synergistic effects on model performance.

Initially, the incorporation of the CAFMAttention module into the baseline YOLOv10n model increased the mAP@0.5 from 87.65% to 88.91%. This suggests that the module, through its attention mechanism integrating local and global features, significantly enhances the model’s capability to capture fine-grained details of the Terracotta Warriors.

Building upon this, the further integration of the DualConv module (Row 5) exhibited an excellent balance of efficiency. The data reveals that the model’s parameter count decreased slightly to 3.04 M, while the detection accuracy (mAP@0.5) was maintained at 88.72%, showing only a marginal fluctuation of 0.19% compared to the previous stage. Notably, inference speed achieved a significant boost—FPS surged from 147.06 to 172.41, representing an increase of approximately 17.2%. This demonstrates that the DualConv module, through its heterogeneous convolution kernel design, successfully eliminated computational redundancy during the feature extraction process.

Subsequently, the introduction of the GSConv module (as seen in the transition to Row 6) realized a comprehensive enhancement in model performance. The core advantage of the GSConv module lies in its Channel Shuffle operation; this mechanism mitigates potential inter-group information barriers introduced by DualConv and realigns feature flows through cross-channel interaction, playing a pivotal structural “adhesive” role.

Ultimately, with the deep fusion of these three modules to form the YOLOv10-TWD model (Row 7), the mAP@0.5 reached 95.28%, representing a substantial increase of 7.63% compared to the original YOLOv10n. It is crucial to address the seemingly counter-intuitive phenomenon where the final model achieves a higher overall FPS (166.66) than the baseline (156.25) despite an increase in theoretical computational complexity (8.6 GFLOPs vs. 8.4 GFLOPs). From a hardware-execution perspective, theoretical FLOPs do not strictly equate to actual latency. As shown in Table 4, the actual inference time (Inf.) slightly increased from 3.5 ms to 3.7 ms, which perfectly aligns with the added FLOPs overhead introduced by the CAFM module. However, the overall FPS boost is primarily driven by a significant reduction in preprocessing time (Pre., from 2.9 ms to 2.3 ms) and the optimization of Memory Access Cost (MAC). Specifically, the Channel Shuffle mechanism in GSConv and the heterogeneous design of DualConv substantially streamline memory read/write operations and alleviate memory bottlenecks. This allows the model to achieve higher actual parallel throughput on GPU hardware, proving that YOLOv10-TWD secures superior execution efficiency despite a marginal increase in theoretical parameters. These experimental results fully validate the synergistic efficacy of the proposed optimization schemes, demonstrating that YOLOv10-TWD achieves a superior balance between high-precision recognition and real-time deployment requirements in complex museum environments.

3.5. Comparative Experiments

3.5.1. Performance Comparison of Different Attention Mechanisms

Terracotta Warrior images are characterized by significant complex background interference, repetitive detailed textures, and pose occlusions. Relying solely on standard convolutional operations makes it difficult to effectively model contextual semantics and focus on key discriminative regions. To enhance the model’s recognition robustness regarding small targets and redundant textures, this study introduces attention mechanisms to strengthen the discriminative capability across feature channels and spatial perception.

Based on the YOLOv10n architecture, this experiment integrates the SE, CBAM, and CAFMAttention modules, respectively, to construct three comparative models. Their performance is evaluated in terms of detection accuracy and inference efficiency, with the results presented in Table 5 [22,23].

The results indicate that YOLOv10n+CAFM achieves an mAP@0.5 of 88.94% and an mAP@0.5:0.95 of 64.86%. Its overall accuracy performance surpasses that of SE and is comparable to or slightly higher than that of CBAM, validating its modeling capability in feature compression and channel focusing. Particularly within the high IoU threshold interval (>0.75), CAFM demonstrates superior effectiveness in capturing key regions of the Terracotta Warriors (such as heads, hands, and armor textures), exhibiting stronger contextual understanding capabilities.

Regarding inference efficiency, CAFMAttention significantly outperforms CBAM (131.58 FPS) with a latency of 3.4 ms and a performance of 147.06 FPS, approaching the speed of the SE module (169.49 FPS). This indicates that the structure maintains representational capacity while effectively controlling computational costs. Its advantage stems primarily from the parallel path design and efficient convolutional operations: CAFMAttention processes the local perception branch and the global context branch in parallel, thereby reducing serial information bottlenecks. Simultaneously, the introduction of Depth-wise Separable Convolution (DWConv) and Channel Shuffle operations enhances the efficiency of cross-channel information flow, effectively mitigating the inference path redundancy issues present in CBAM.

In summary, the CAFMAttention module not only delivers superior accuracy but also balances inference speed with structural controllability, making it the optimal attention mechanism choice for deployment in efficiency-sensitive scenarios within Terracotta Warrior recognition tasks.

3.5.2. Comparative Experiments of Different YOLO Object Detection Algorithms

To comprehensively evaluate the detection performance of the proposed YOLOv10-TWD model in cultural heritage scenarios, this study selects several mainstream lightweight object detection algorithms—including YOLOv7-Tiny, YOLOv8n, YOLOv11n, and YOLOv12n [24,25,26,27]—as comparative baselines. Systematic experiments were conducted under uniform dataset and training configurations, with the results summarized in Table 6.

According to the experimental results, YOLOv10-TWD surpasses other lightweight models in both mAP@0.5 and mAP@0.5:0.95. Specifically, the proposed model attained an mAP@0.5 of 95.28% and an mAP@0.5:0.95 of 75.54%. With a parameter count of 3.07 M and a computational complexity of 8.6 GFLOPs, the model still achieves a high inference speed of 166.66 FPS, outperforming the baseline YOLOv10n in both accuracy and efficiency. These results demonstrate a significant comprehensive advantage, successfully balancing high detection accuracy with operational efficiency for real-time cultural heritage digitalization.

3.6. Visualization Analysis

To further analyze the performance of the improved YOLOv10-TWD model in terms of semantic modeling and target focusing, this study conducts a visualization analysis using feature activation heatmaps of the high-level semantic output layer (the 22nd layer,

P 5 / 32

). As this layer serves as the direct input to the detection head, it is primarily responsible for category discrimination and bounding box regression, playing a critical role in global perception and the recognition of large-scale targets.

Initially, to evaluate the model’s feature response to targets in various poses, Figure 6 presents a comparison of activation patterns between YOLOv10n and YOLOv10-TWD for categories such as “Military Officer”. The results indicate that the heat zones of YOLOv10-TWD are more concentrated on key structural regions, characterized by clear boundaries and significantly suppressed background interference. This demonstrates enhanced discriminative power and semantic focusing capability.

Taking the ”Military Officer” figure as an example, Figure 6c shows that YOLOv10-TWD precisely focuses on critical structural components of the figure, including the head, the mid-section of the armor, and hand gestures. The heat response is highly concentrated, achieving a target confidence score of 0.79, which reflects excellent feature extraction and discrimination capabilities. In contrast, as shown in Figure 6b, the baseline YOLOv10n model misclassifies this instance as a “General” figure and exhibits uneven heat distribution, causing the confidence score to drop to 0.71.

Regarding background interference control, the disparity between the two models is even more pronounced. YOLOv10n (Figure 6b) exhibits a strong heat response at an exhibition plaque in the lower-right corner of the image, leading to the false activation of non-target regions. Conversely, the improved YOLOv10-TWD (Figure 6c) effectively suppresses activation in these irrelevant areas. The heat zones remain focused on the terracotta figure itself, with significantly reduced background response and a marked increase in boundary clarity.

To further analyze model performance in dense object detection scenarios, multi-instance images were selected for heatmap visualization comparison, as shown in Figure 7. YOLOv10-TWD demonstrates higher detection completeness and instance separation capability in complex, dense environments.

Specifically, YOLOv10n (Figure 7b) identifies a total of 8 instances. In comparison, YOLOv10-TWD (Figure 7c) identifies a significantly higher number of targets, detecting 17 instances. Consequently, YOLOv10-TWD exhibits superior performance under conditions of occlusion and low visibility, effectively mitigating the missed detection issues inherent in YOLOv10. For instance, in the densely arranged “Armored Warrior” region, the number of identified instances by YOLOv10-TWD increased from 1 to 8, maintaining clear and separated target boundaries while enhancing overall detection coverage. It is important to note that the overlapping bounding boxes observed in the figure are inherent characteristics of this dense scene. Rather than a visual defect, this overlap demonstrates the model’s robustness in resolving crowded targets and distinguishing adjacent instances.

Furthermore, an interesting phenomenon can be observed in the Grad-CAM visualization of the proposed model (Figure 7c): a wide horizontal heat zone appears on the museum rooftop, which is less pronounced in the baseline model. The primary cause of this activation is the enhanced texture-extraction capability of the integrated DualConv and CAFM modules. Designed to capture the dense, repetitive, and fine-grained linear textures of the warriors’ armor plates, these modules are highly sensitive to high-frequency spatial patterns. Consequently, the repetitive metallic grid structure of the rooftop triggers strong feature activations in the intermediate layers. However, this does not constitute a performance deterioration factor. As evidenced by the detection results, no false positive bounding boxes are generated in the rooftop area. The decoupled classification head of YOLOv10-TWD effectively evaluates the global semantic context, assigning near-zero objectness scores to these architectural structures and successfully suppressing task-irrelevant activations during the final inference phase.

Additionally, the orange ovals in Figure 7 also serve to highlight the detection boundary of the proposed model. Beyond this highlighted region, a few extremely small and distant warriors remain undetected, which defines the current "Detection Limit" (i.e., the maximum detectable distance and resolution limit) of YOLOv10-TWD. Specifically, when a target’s spatial resolution is lower than

10 \times 10

pixels or the occlusion ratio exceeds

80 %

, the semantic information becomes too fragmented for reliable recognition. These distant instances lack the key morphological features (e.g., armor tassels or specific headgear) required for fine-grained classification. Given that our model is optimized for real-time edge deployment, this detection boundary represents a strategic trade-off between computational efficiency and the suppression of false positives in high-density, low-visibility background regions.

The experimental results demonstrate that the YOLOv10-TWD method achieves a favorable balance between accuracy, complexity, and inference efficiency while remaining lightweight. Compared to multi-modal large models (LMMs) such as CLIP and GPT-4o, its structured detection path provides higher recognition accuracy in niche categories, structure-sensitive tasks, and multi-target scenarios, aligning more closely with the actual requirements of cultural heritage protection.

4. Conclusions

This study proposes YOLOv10-TWD, a high-performance and lightweight detection framework tailored for Terracotta Warrior digitalization. By integrating CAFMAttention, DualConv, and GSConv modules, the model effectively addresses the challenges of complex museum backgrounds and target occlusions. Experimental results demonstrate that YOLOv10-TWD achieves a peak mAP@0.5 of 95.28%, representing a 7.63% improvement over the baseline YOLOv10n. Despite a slight increase in complexity to 3.07 M parameters and 8.6 GFLOPs, the model maintains a high inference speed of 166.66 FPS (a 6.66% boost), achieving a superior balance between recognition precision and operational efficiency.

The core technical contributions are summarized as follows:

CAFMAttention Module: By aggregating local and global features, this module enhances the model’s spatial awareness of critical structural regions, effectively mitigating localization errors caused by partial damage or occlusion.
DualConv Module: The heterogeneous dual-branch design optimizes the modeling of fine-grained details and global contours, significantly improving discriminative robustness against visually similar figure categories.
GSConv Module: By streamlining feature pathways and enhancing cross-channel interaction, this module reduces computational redundancy, ensuring the model’s real-time performance on edge deployment devices.

While YOLOv10-TWD demonstrates significant overall advantages, we acknowledge certain limitations that define our future research directions:

Extreme Parameter Control: Future work will explore advanced model compression techniques, such as knowledge distillation and structural pruning, to further reduce the parameter count for deployment on ultra-resource-constrained edge hardware without sacrificing detection precision.
Small-Scale Target Robustness: We aim to incorporate more sophisticated multi-scale feature aggregation mechanisms (e.g., BiFPN or specialized small-object detection heads) to enhance the model’s discriminative stability for extremely small targets in wide-angle museum scenes.

Author Contributions

Conceptualization, Y.L. and X.Z. (Xinjuan Zhu); methodology, Y.L.; software, Y.L. and X.Z. (Xinyuan Zhang); validation, Y.L. and X.Z. (Xinyuan Zhang); formal analysis, Y.L. and L.W.; investigation, Y.L. and L.W.; resources, L.W. and X.Z. (Xinjuan Zhu); data curation, Y.L. and S.D.; writing—original draft preparation, Y.L.; writing—review and editing, X.Z. (Xinjuan Zhu) and L.W.; visualization, Y.L.; supervision, X.Z. (Xinjuan Zhu); project administration, X.Z. (Xinjuan Zhu); funding acquisition, X.Z. (Xinjuan Zhu). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Cultural Relics Science Research Project of the National Cultural Heritage Administration, grant number 2023ZCK026.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset related to this study is available at: https://aistudio.baidu.com/dataset/detail/368713/file (accessed on 7 February 2026).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

YOLO	You Only Look Once
TWD	Terracotta Warrior Detection
CAFM	Convolution-Attention Fusion Module
DualConv	Dual Convolution
GSConv	Ghost-Shuffle Convolution
mAP	Mean Average Precision
IoU	Intersection over Union
FLOPs	Floating Point Operations
FPS	Frames Per Second
LMMs	Large Multi-modal Models
CNN	Convolutional Neural Network
BN	Batch Normalization
DWConv	Depth-wise Separable Convolution
SE	Squeeze-and-Excitation
CBAM	Convolutional Block Attention Module
MHSA	Multi-Head Self-Attention
DCN	Deformable Convolutional Networks

References

Tuo, Y.; Wu, J.; Zhao, J.; Si, X. Artificial intelligence in tourism: Insights and future research agenda. Tour. Rev. 2025, 80, 793–812. [Google Scholar] [CrossRef]
Zhao, F.Q.; Zhou, M.Q. Automatic matching method of cultural relic fragments based on multi-feature parameter fusion. Opt. Precis. Eng. 2023, 31, 1522–1531. [Google Scholar] [CrossRef]
Liu, J.; Ge, Y.F.; Tian, M. Research on super-resolution reconstruction algorithm of cultural relic images. Acta Electron. Sin. 2023, 51, 139–145. [Google Scholar]
Onkhar, V.; Kumaaravelu, L.T.; Dodou, D.; de Winter, J.C.F. Towards Context-Aware Safety Systems: Design Explorations Using Eye-Tracking, Object Detection, and GPT-4V; Technical Report; Delft University of Technology: Delft, The Netherlands, 2024. [Google Scholar]
Limberg, C.; Gonçalves, A.; Rigault, B.; Prendinger, H. Leveraging YOLO-world and GPT-4V LMMs for zero-shot person detection and action recognition in drone imagery. arXiv 2024, arXiv:2404.01571. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
Bogdoll, D.; Nitsche, M.; Zöllner, J.M. Anomaly detection in autonomous driving: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4488–4499. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. GOLD-YOLO: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. (Neurips) 2024, 36, 30902–30915. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-YOLO: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A full-scale reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
Jocher, G. YOLOv5 Release v7.0. 2022. Available online: https://github.com/ultralytics/yolov5/tree/v7.0 (accessed on 10 January 2026).
Jocher, G.; Chiguroy, A.; Romishin, B. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2026).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Tan, H.; Liu, X.; Yin, B.; Li, X. MHSA-Net: Multihead self-attention network for occluded person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8210–8224. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Hu, S.; Gao, F.; Zhou, X.; Dong, J.; Du, Q. Hybrid convolutional and attention network for hyperspectral image denoising. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5504005. [Google Scholar] [CrossRef]
Zhong, J.C.; Chen, J.Y.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9528–9535. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Jin, Y. A study on the morphological characteristics of the figures of the Terracotta Warriors of Qin Shi Huang. J. Korea Soc. Ceram. Art 2021, 18, 5–31. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. (Neurips) 2024, 37, 107984–108011. [Google Scholar]
Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Varghese, R.; M, S. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Senzuru, India, 12–14 March 2024; pp. 1–6. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]

Figure 1. Terracotta Warriors Dataset Segment.

Figure 2. The architecture of the YOLOv10-TWD network.

Figure 3. Schematic diagram of the Convolution and Attention Fusion Module (CAFM).

Figure 4. Schematic comparison of DualConv principles: (a) Group Convolution, (b) Heterogeneous Convolution, and (c) Filter design of the proposed Dual Convolution.

Figure 5. Structure of the GSConv module.

Figure 6. Visualization analysis of different models: (a) The original input image; (b) Detection results of YOLOv10n; (c) Detection results of YOLOv10-TWD. Note: “junliyong” and “jiangjunyong” in the figures are the pinyin transliterations of the Chinese terms for “Military Officer” and “General Figure”, respectively. The green and red rectangular boxes indicate the detected bounding boxes. The yellow ellipses highlight the predicted class labels and confidence scores. The red circle in (b) highlights the area of false heat activation on the background exhibition plaque.

Figure 7. Comparison of detection performance in dense scenarios: (a) Original input image; (b) Detection results of YOLOv10n; (c) Detection results of YOLOv10-TWD. Note: The text labels associated with the bounding boxes (e.g., “junliyong”, “zhanpaowushi”, “qibingyong”) are the pinyin transliterations of the Chinese terms for specific terracotta warrior categories (e.g., Military Officer, Robe Warrior, and Cavalry, respectively). Different colored bounding boxes represent different warrior categories. The orange ovals highlight a challenging region where the baseline model exhibits missed detections (b), whereas the proposed model successfully identifies the targets (c). The overlapping bounding boxes are inherent characteristics of this dense scene.

Table 1. Summary of primary types of warriors and their corresponding prototypical visual features (Sorted by detection difficulty).

Warrior Type	Prototypical Visual Features	Example Image
Kneeling Archer	Hair bun on the left, wearing armor, in a half-kneeling posture.
Standing Archer	Hair bun, light robe, arms in bow-pulling pose, with symmetrical movement.
Chariot Soldier	Standing beside a chariot, wearing armor, and holding long-shafted weapons.
General Figure	Pheasant tail crown, armor with tassels, complex details, majestic posture.
Robe Warrior	Wearing robes, natural standing posture, and robust physique.
Charioteer	Long crown, arms forward as if driving, hand guards, and compact movement.
Military Officer	Single or double plate crown, clothing distinct from soldiers, serious expression.
Armored Warrior	Wearing armor, natural standing posture, and robust physique.
Cavalry	Leather cap, tight sleeves, short boots, and small armor; pose adapted for riding.

Table 2. Experimental Hyperparameters.

Hyperparameter	Value
Initial Learning Rate	0.01
Cyclic Learning Rate	0.01
Momentum	0.937
Batch Size	32
Image Size	640
Training Epochs	1000 (including early stopping mechanism)
Warmup Epochs	3.0
Warmup Momentum	0.8

Table 3. Detailed detection performance for each warrior type (Sorted by mAP@0.5).

Category	mAP@0.5	mAP@0.5:0.95	Precision	Recall
Kneeling Archer	0.9958	0.8639	0.9895	1.0000
Standing Archer	0.9954	0.8475	0.9790	1.0000
Chariot Soldier	0.9713	0.8243	0.8954	0.9512
General Figure	0.9676	0.7908	0.9764	0.9074
Robe Warrior	0.9447	0.6866	0.9345	0.8146
Charioteer	0.9356	0.7461	0.9678	0.8025
Military Officer	0.9281	0.6858	0.9209	0.7359
Armored Warrior	0.9269	0.6732	0.8964	0.7803
Cavalry	0.9095	0.6698	0.9110	0.7324
All (mAP)	0.9528	0.7542	0.9523	0.8583

Table 4. Results of Model Ablation Experiments. In this table, ‘✓’ indicates the inclusion of a module, while ‘×’ indicates its exclusion. For the performance metrics, the up-arrow (↑) denotes that higher values are better, and the down-arrow (↓) denotes that lower values are better.

No.	CAFM	DualConv	GSConv	mAP	mAP	Params	FLOPs	Pre.	Inf.	FPS ↑
No.	CAFM	DualConv	GSConv	50 (%)↑	50–95 (%)↑	(M)↓	(G)↓	(ms)↓	(ms)↓	FPS ↑
1	×	×	×	87.65	63.92	2.71	8.4	2.9	3.5	156.25
2	✓	×	×	88.91	64.52	3.06	8.7	3.4	3.4	147.06
3	×	✓	×	93.99	72.93	2.69	8.4	1.6	3.4	200.00
4	×	×	✓	92.43	72.37	2.63	8.3	1.3	3.4	212.76
5	✓	✓	×	88.72	64.26	3.04	8.6	2.3	3.5	172.41
6	×	✓	✓	92.54	72.14	2.72	8.4	2.3	3.6	169.49
7	✓	✓	✓	95.28	75.54	3.07	8.6	2.3	3.7	166.66

Table 5. Comparative experiments of different attention mechanisms. For the performance metrics, the up-arrow (↑) denotes that higher values are better, and the down-arrow (↓) denotes that lower values are better.

Model	mAP	mAP	Params	FLOPs	Pre.	Inf.	FPS ↑
Model	50 (%)↑	50–95 (%)↑	(M)↓	(G)↓	(ms)↓	(ms)↓	FPS ↑
YOLOv10n	87.65	63.92	2.71	8.4	2.9	3.5	156.25
YOLOv10n + SE	88.31	64.64	2.71	8.4	2.5	3.4	169.49
YOLOv10n + CBAM	88.91	64.52	2.71	8.6	4.3	3.3	131.58
YOLOv10n + CAFM	88.94	64.86	3.07	8.7	3.4	3.4	147.06

Table 6. Comparative experimental results of various detection models. For the performance metrics, the up-arrow (↑) denotes that higher values are better, and the down-arrow (↓) denotes that lower values are better.

Model	mAP	mAP	Params	FLOPs	Pre.	Inf.	FPS ↑
Model	50 (%)↑	50–95 (%)↑	(M)↓	(G)↓	(ms)↓	(ms)↓	FPS ↑
YOLOv5n	87.65	63.92	1.77	4.3	0.4	8.6	111.11
YOLOv6n	79.39	55.23	4.16	11.5	1.5	4.3	172.41
YOLOv7-Tiny	89.10	61.90	6.03	13.2	8.3	1.1	106.38
YOLOv8n	87.41	62.86	3.01	8.2	1.1	3.9	200
YOLOv9n	88.39	66.19	1.76	6.4	1.3	4.5	153.85
YOLOv10n	87.65	63.93	2.71	8.4	2.9	3.5	156.25
YOLOv11n	90.26	68.01	2.59	6.3	1.5	3.6	172.41
YOLOv12n	85.71	61.60	2.80	7.3	2.1	3.5	178.27
YOLOv10-TWD	95.28	75.54	3.07	8.6	2.3	3.7	166.66

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Wang, L.; Zhang, X.; Dong, S.; Zhu, X. YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition. Appl. Sci. 2026, 16, 2616. https://doi.org/10.3390/app16052616

AMA Style

Li Y, Wang L, Zhang X, Dong S, Zhu X. YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition. Applied Sciences. 2026; 16(5):2616. https://doi.org/10.3390/app16052616

Chicago/Turabian Style

Li, Yalin, Liang Wang, Xinyuan Zhang, Sijie Dong, and Xinjuan Zhu. 2026. "YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition" Applied Sciences 16, no. 5: 2616. https://doi.org/10.3390/app16052616

APA Style

Li, Y., Wang, L., Zhang, X., Dong, S., & Zhu, X. (2026). YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition. Applied Sciences, 16(5), 2616. https://doi.org/10.3390/app16052616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv10-TWD: An Improved YOLOv10n for Terracotta Warrior Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction and Preprocessing

2.1.1. Analysis of Visual Features for Different Types of Terracotta Warriors

2.1.2. Image Preprocessing and Data Augmentation Strategies

2.2. The Proposed YOLOv10-TWD Network Architecture

2.2.1. Overview of the YOLOv10n Baseline

2.2.2. Network Architecture of YOLOv10-TWD

2.2.3. CAFM: Convolution and Attention Fusion Module

2.2.4. DualConv: Dual Convolutional Module

2.2.5. GSConv: Ghost-Shuffle Convolution

3. Experiments and Analysis

3.1. Experimental Environment and Training Parameters

3.2. Performance Evaluation Metrics

3.3. Overall Performance and Per-Class Analysis

3.4. Ablation Studies

3.5. Comparative Experiments

3.5.1. Performance Comparison of Different Attention Mechanisms

3.5.2. Comparative Experiments of Different YOLO Object Detection Algorithms

3.6. Visualization Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI