1. Introduction
Heritage textiles are fragile fibrous surface systems composed of yarn networks, dyed layers, embroidered structures, and aged surface interfaces. Their long-term preservation depends not only on cultural value, but also on the stability of surface morphology, chromatic integrity, fiber cohesion, and resistance to environmental deterioration. From the perspective of applied surface science, textile deterioration can therefore be understood as a set of material changes occurring on complex fibrous substrates [
1]. Traditional textiles and clothing from the Kashgar region in northwestern China provide a representative case of such complex surface systems. Shaped by Silk Road exchange, regional craft practices, and historically significant textile and embroidery traditions, these objects include garments, embroidered fabrics, woven cottons, silk-based textiles, decorative cloths, and household textile artifacts [
2]. While they carry historical, artistic, and ethnographic significance, their preservation depends partly on the reliable detection and interpretation of surface degradation.
Kashgar heritage textiles commonly feature cotton, silk, blended yarns, embroidered structures, dyed surfaces, and lightweight woven substrates. Their preservation is constrained by the intrinsic instability of fibrous materials and by long-term environmental, mechanical, and handling-related stress [
3,
4]. These factors may lead to visible and latent deterioration, including yarn breakage, fiber shedding, holes, staining, abrasion, local deformation, chromatic fading, and loss of surface cohesion [
5,
6]. Such phenomena involve changes in morphology, texture continuity, chromatic response, contamination accumulation, fiber exposure, and local substrate integrity. Textile defects are therefore not merely visual irregularities, but measurable indicators of material instability [
7]. Reliable surface defect detection is essential for condition documentation, preliminary screening, and preventive conservation assessment.
Conventional examination of textile surface deterioration still relies heavily on manual visual inspection by conservators. Although expert observation remains indispensable, it is time-consuming, partly subjective, and difficult to standardize across large collections, subtle defects, or complex woven and embroidered textures [
8,
9]. Compared with industrial fabrics, heritage textiles pose greater analytical challenges due to heterogeneous materials, irregular manufacturing traces, aged surface layers, and non-uniform deterioration pathways [
10]. In the Kashgar context, this complexity is further amplified by the coexistence of woven, dyed, embroidered, and composite textile structures, each responding differently to environmental fluctuation, light exposure, contamination, and mechanical stress.
With advances in scientific instrumentation, specialized devices have increasingly been applied to assess textile damage. Digital microscopes [
11], infrared thermography cameras [
12,
13], and spectroscopic systems [
14] can provide valuable surface and material information. However, these methods often require costly equipment, limited inspection areas, specialized operation, and controlled experimental conditions. These constraints limit their suitability for routine, large-scale, or repeated monitoring of heritage textile surfaces. Computer vision has therefore become a promising non-contact and scalable approach for detecting visible surface deterioration in fragile heritage materials.
Earlier studies in textile analysis have examined fiber morphology measurement [
15,
16], yarn structure characterization [
17], fabric pilling evaluation [
18], and texture or defect recognition [
19,
20]. Yuen et al. proposed a hybrid model integrating a genetic algorithm with a three-layer back-propagation neural network, in which the genetic algorithm optimized fabric structural elements, and the neural network performed image classification [
21]. Hsu et al. [
22] used a genetic algorithm-based approach to optimize yarn-dyeing process scheduling, demonstrating improved performance over conventional methods. These studies established an important basis for intelligent textile analysis, particularly in textile structure, production quality, and surface pattern recognition. However, they were not specifically designed for conservation-oriented assessment of culturally significant heritage textile surfaces.
In recent years, deep learning–based object detection models such as YOLO, Faster R-CNN, and Mask R-CNN have shown strong performance in industrial fabric inspection and damage detection on other material surfaces, including building materials [
23], industrial fabrics [
24], ceramic cracks [
25,
26], and metal corrosion [
27]. These studies suggest that convolution-based detectors can learn to detect fine-grained visual anomalies in complex textured backgrounds and therefore support conservation-oriented textile inspection. In fabric defect detection, Wang et al. [
28] proposed an improved YOLOv5 method incorporating adaptive spatial feature fusion and attention mechanism to enhance multiscale feature processing, achieving an average detection accuracy of 71.70%. Zheng et al. [
29] developed the SE-YOLOv5 by embedding squeeze-and-excitation modules into the backbone and replacing ReLU with ACON, thereby improving detection accuracy, generalization, and robustness. Zhao et al. [
30] introduced a transfer-learning-based defect detection system using an improved Faster R-CNN with ResNet50 and feature pyramid enhancement, substantially improving mean detection accuracy.
Nevertheless, most existing studies focus on industrial defect detection or generic textile datasets, while intelligent detection tailored to fragile heritage textile surfaces remains underdeveloped [
23,
24,
25,
26]. This limitation is especially evident in regional textile heritage such as Kashgar textile, where material composition, weave structures, decorative layers, and degradation patterns differ from those of standard industrial fabrics. Three technical challenges are particularly relevant. First, material heterogeneity reduces the transferability of general-purpose models across cotton, silk, blended substrates, embroidery, and aged dyed surfaces. Second, many defects have weak boundaries and strong multiscale variation, ranging from micro-level fiber disruption to broad fading or local structural collapse. Third, heritage datasets are naturally limited and imbalanced because genuinely damaged samples are scarce and cannot be artificially generated without compromising conservation ethics. In addition, existing studies often stop at surface defect localization or category recognition, while less attention has been paid to translating detected surface anomalies into interpretable condition levels for documentation and expert review.
To address these issues, this study develops an improved YOLOv8-based framework for detecting fine-grained surface defects on Kashgar heritage textiles. The method is designed not as a generic engineering solution but as an applied surface-inspection framework that links image-based anomaly detection with preliminary surface-condition assessment. A high-resolution dataset was established from representative textile samples and annotated into six surface-degradation categories, including stains, broken yarn, yarn shedding, holes, abrasion, and color fading. The proposed YOLOv8-MABFT model contains two coordinated improvements: a C2f-Faster-EMA module for feature preservation, lightweight aggregation, and attention-based texture-interference suppression, and an RT-DETR-informed decoder head for contextual alignment and localization stability. CAM, Grad-CAM, Grad-CAM++, and SSCAM are used as complementary interpretability tools to visualize model attention and assess whether detected regions correspond to materially meaningful surface abnormalities. Building on the detection stage, this study introduces a Random Forest-based classification strategy to assign preliminary surface-condition levels based on multiple surface-condition variables. Comparative experiments against Faster R-CNN and representative YOLO baselines demonstrate the advantages of the proposed framework in precision, recall, detection robustness, deployment efficiency, and preliminary condition-level assessment.
This study makes four main contributions. First, it proposes a surface-degradation taxonomy focused on the material deterioration characteristics of Kashgar heritage textiles rather than on industrial fabric flaws alone. Second, it constructs a representative dataset of images for culturally significant fibrous textile substrates with complex surface morphologies. Third, it develops an optimized lightweight detection framework that balances detection accuracy and computational efficiency in non-contact surface inspection. Fourth, it demonstrates how AI-based visual detection can support applied surface-science analysis, surface-condition documentation, and preliminary condition-level assessment of fragile textile substrates in museum and field contexts.
2. Materials and Methods
2.1. Fibrous Surface Substrates and Surface Deterioration Characteristics
The Kashgar region in northwestern China, historically shaped by long-term exchange along the Silk Road, preserves a rich textile heritage characterized by diverse material systems, embroidery traditions, weaving practices, and decorative techniques. From the perspective of applied surface science, these heritage textiles can be regarded as complex fibrous surface substrates composed of yarn networks, dyed surface layers, embroidered relief structures, and aged material interfaces. Garments, embroidered cloths, household textiles, and other culturally significant fabric artifacts from this region commonly incorporate cotton, silk, blended textile systems, lightweight woven substrates, and patterned or dyed surfaces. To construct a representative and surface-oriented dataset for heritage textile defect detection, this study selected major textile-related cultural contexts associated with the Kashgar region as image collection sources. These sources were intended to cover a broad range of textile surface substrates, structural characteristics, and naturally occurring surface deterioration patterns associated with long-term use, storage, environmental exposure, and material aging.
Representative samples were collected from three major categories of heritage textile context. First, traditional garment and costume textiles included regionally characteristic garments and dress-related textiles, such as embroidered clothing, woven cotton fabrics, and silk-based decorative materials. These objects provided dense surface textures, fine yarn structures, and complex surface details suitable for detecting subtle anomalies such as fiber breakage, yarn exposure, yarn loosening, edge abrasion, micro-staining, and local loss of surface continuity. Cotton-based textiles often exhibited surface roughness, yarn exposure, and staining retention, while silk-based materials showed reflective surface responses, weak-contrast fading, and sensitivity to illumination variation. Embroidered garments further presented raised surface relief, shadow interference, and local stress concentration around decorative threads.
Second, decorative and household textile artifacts included embroidered fabrics, patterned cloths, and interior textile decorations. Such materials often exhibited repeated motifs, layered decorative treatments, chromatic surface layers, and localized stress accumulation. These characteristics made them appropriate for sampling surface defects related to wear, chromatic fading, seam distortion, partial structural damage, and contamination accumulation. In particular, dyed and patterned textiles were important for examining chromatic layer instability, fading boundaries, color inconsistency, and the difficulty of distinguishing true surface degradation from normal decorative variation.
Third, stored and aged textile substrates focused on folded or stored textiles, materials kept in conservation-related contexts, and other substrates showing authentic degradation. These samples were particularly valuable for documenting realistic surface deterioration phenomena, including discoloration, dust-associated staining, fiber embrittlement, local tears, yarn shedding, surface cohesion loss, and material fatigue caused by long-term aging. Aged, fragile fabrics were included because they frequently display weakened fiber cohesion, reduced substrate stability, and irregular surface morphology, which are central indicators of surface degradation in heritage textile conservation.
To ensure that the dataset reflects the material diversity and surface complexity of Kashgar heritage textiles, representative textile substrates were selected across six major types: cotton, silk, blended yarn systems, lightweight woven substrates, patterned or dyed textiles, and aged, fragile fabrics. These substrates collectively capture the main surface-material conditions encountered in the regional textile heritage context, ranging from relatively stable woven cottons to highly sensitive silk-based and degraded fabrics. Their inclusion was essential for representing both material heterogeneity and differential vulnerability to surface deterioration. More specifically, cotton substrates were associated with surface roughness, yarn exposure, and staining retention; silk substrates were associated with reflective surface behavior and weak-contrast fading; embroidered textiles were associated with raised relief, shadow interference, and local fiber disturbance; dyed textiles were associated with chromatic layer instability and fading; and aged fabrics were associated with fiber embrittlement, surface cohesion loss, and local substrate weakening.
Particular attention was also paid to naturally aged defects most relevant to surface-condition assessment. Across the sampled textile contexts, recurrent deterioration phenomena included staining, chromatic fading, fiber breakage, yarn loosening, abrasion, holes, and local deformation. These conditions reflect the broader material-aging environment of heritage textiles in the Kashgar region and provide the empirical basis for identifying conservation-relevant surface abnormalities. Compared with artificially generated flaws in industrial inspection datasets, these naturally occurring defects more accurately represent the true condition of heritage textile surfaces. They therefore provide a more reliable foundation for surface-oriented intelligent detection, surface-condition mapping, and deterioration-risk assessment.
Across these three source categories, field-oriented surface image acquisition was carried out to capture realistic surface deterioration conditions of heritage textile substrates. To ensure dataset diversity, samples were selected to include variation in fiber composition, weave density, decorative treatment, surface texture, chromatic response, and degradation intensity. This strategy enabled the dataset to represent both material heterogeneity and surface-defect heterogeneity, two key challenges in heritage textile surface analysis.
Figure 1 summarizes the geographic context, source categories, representative fibrous substrates, and typical naturally aged surface defects of Kashgar heritage textiles that informed the subsequent dataset construction and preprocessing procedures. As a result, the final image corpus provides a robust material foundation for subsequent surface defect detection, surface-condition classification, and deterioration-level assessment.
To further clarify the material basis of the detection task, the representative textile substrates were described not only by fiber composition but also by their geometric architectures, yarn constructions, surface relief, and optical responses, as shown in
Table 1. These structural properties are directly relevant to image-based defect capture because weave density, yarn thickness, thread orientation, embroidery relief, and surface reflectance influence how deterioration boundaries appear in RGB images. For example, loosely woven or aged substrates may expose yarn displacement and fiber loosening more clearly, whereas compact silk-based surfaces may produce reflective highlights that obscure weak abrasion or fading boundaries. Similarly, raised embroidery introduces local shadows and relief edges that may visually resemble yarn shedding or abrasion. Therefore, the dataset was organized to include different substrate architectures so that the model could learn defect features under heterogeneous material, optical, and textural conditions.
2.2. Surface Image Acquisition, Annotation, and Preprocessing
The dataset was constructed as a surface-image corpus, focusing on visible changes in textile surface morphology, chromatic continuity, and material discontinuity. To develop a deep learning model capable of detecting surface defects on heritage textiles from the Kashgar region of China, a dedicated dataset of 8247 high-resolution annotated images was constructed. The dataset includes six representative surface-degradation categories: stains, broken yarn, yarn shedding, holes, abrasion, and color fading—each category comprising approximately 1300 to 1400 annotated samples, as shown in
Figure 2. Although the sample size was constrained by the scarcity of authentic heritage textiles, limited access to collection materials, and conservation ethics, priority was placed on maximizing substrate diversity, texture variability, surface-defect heterogeneity, and scene representativeness.
To improve dataset transparency and reproducibility, the final image corpus was further organized according to image source. As summarized in
Table 1, the dataset consisted primarily of authentic Kashgar heritage textile images, supplemented by open-source textile images selected for material and visual relevance. No laboratory-generated or artificially damaged samples were used in the final dataset. Authentic heritage textile images were treated as the primary data source because they directly reflect the surface-degradation characteristics of regional textile materials, aging processes, handling traces, and preservation conditions. Supplementary open-source images were used only to enhance the coverage of underrepresented textile surface appearances and deterioration patterns, rather than to replace authentic heritage textile images.
The authentic heritage textile images covered traditional garments, embroidered fabrics, woven cotton textiles, silk-based decorative materials, household textile artifacts, patterned cloths, and aged stored textile substrates associated with the Kashgar region. These images were collected to capture realistic variations in fiber composition, weave density, decorative structure, chromatic response, surface reflectance, and naturally occurring deterioration. The supplementary open-source images were retained only when their textile structure, material appearance, surface texture, and visible deterioration morphology were consistent with the target heritage textile surfaces. Images showing industrial fabric regularity, unrealistic defect boundaries, non-comparable materials, or visual conditions unrelated to heritage textile surfaces were excluded.
Image acquisition conditions were documented to improve reproducibility. Authentic textile images were obtained under non-contact imaging conditions in museum, storage, documentation, or controlled indoor environments. The imaging distance was adjusted according to object size and defect scale, with close-range images used for local surface deterioration and wider images used for textile panels or larger surface regions. Illumination conditions included diffuse indoor lighting, museum or storage-room lighting, and controlled auxiliary lighting when available. Direct flash and strong specular illumination were avoided as far as possible to reduce glare on silk-based or reflective textile surfaces. Original image resolutions varied according to acquisition source and device, but only images with sufficient surface detail for defect annotation were retained. Most retained images had an original long-side resolution above 2000 pixels, and all images were subsequently resized to 640 × 640 pixels for model training and evaluation.
To describe source diversity across textile substrates, the dataset was further summarized by textile type, as shown in
Table 2. The distribution was not intended to represent the full statistical population of Kashgar heritage textiles, but to ensure that the model was exposed to a wide range of surface structures, fiber appearances, and deterioration contexts.
To ensure comprehensive coverage of the material conditions and deterioration patterns commonly found in Kashgar heritage textiles, the dataset was designed to reflect four major surface-condition dimensions and operationalized during preprocessing: substrate composition, surface texture complexity, environment- and storage-related degradation, and surface-degradation type. In terms of substrate composition, the dataset covers cotton, silk, blended yarn systems, embroidered textiles, lightweight woven cloth, and decorative composite fabrics. In terms of surface texture complexity, it includes dense weave structures, patterned woven textiles, raised embroidered surfaces, and reflective silk-based materials. In terms of degradation context, the dataset incorporates representative aging phenomena such as dust staining, fiber embrittlement, yarn loosening, color fading, fold-induced damage, surface abrasion, and local substrate weakening associated with long-term handling, storage fluctuation, light exposure, and inland environmental stress. These dimensions guided both sampling and preprocessing protocols, ensuring that the dataset captured the full range of visual variability caused by substrate composition, surface patterning, environmental aging, chromatic instability, and structural damage.
The inclusion of highly textured surfaces, such as embroidered textiles, patterned woven fabrics, and decorative composite cloth, was particularly important for testing the model’s robustness against surface-background interference, repetitive patterns, and low-contrast defect boundaries. Likewise, reflective or fine-fiber materials such as silk were included to evaluate model stability under weak-boundary conditions and complex light responses. In this sense, the dataset was not constructed merely as a collection of defect images, but as a material-sensitive and surface-oriented image corpus reflecting the visual, structural, and chromatic complexity of Kashgar heritage textiles.
A potential domain shift between authentic heritage textile images and supplementary open-source images was also considered during dataset construction. Supplementary images may differ from authentic heritage textile images in acquisition conditions, lighting consistency, background complexity, image quality, and defect-boundary clarity. Such differences may affect model generalization if the detector learns source-specific visual patterns rather than degradation-relevant surface features. To reduce this risk, supplementary images were screened according to material and visual relevance, and only images with textile structures, surface textures, and deterioration morphologies comparable to the target heritage textile surfaces were retained. In addition, source-domain labels were retained during dataset organization so that future source-stratified evaluation could distinguish performance on authentic heritage textile images and supplementary open-source images. All image sources followed the same annotation, preprocessing, augmentation, and data-splitting protocol to minimize source-related inconsistencies.
To improve model generalization across varying textures, materials, defect scales, surface reflectance, and illumination conditions, a multidimensional augmentation strategy was implemented. This included random cropping to simulate close-range surface inspection, ±90° rotation and affine transformation to simulate different object placement and image acquisition angles, Gaussian blur to mimic soft-focus conditions caused by fiber fuzziness or shallow imaging depth, and background-texture smoothing to reduce excessive interference from dense embroidered or patterned surface regions. For highly textured textiles, local contrast enhancement was additionally applied to improve the visibility of subtle surface defects such as fine broken yarn, abrasion traces, weakly bounded fading regions, and localized material discontinuities. All external images were anonymized by removing metadata such as timestamps, location, and device information. No identifiable individuals were included in the dataset; any incidental human figures were blurred using an OpenCV-based preprocessing pipeline followed by manual verification. These procedures were adopted to ensure data privacy and to standardize the visual input before model training.
Annotation was conducted by three researchers with backgrounds in textile conservation and material degradation analysis, each with more than two years of experience in heritage textile damage identification. Every image was independently labeled by two annotators and reviewed by a third following the YOLO bounding-box standard with an IoU ≥ 0.5 threshold. The overall Cohen’s Kappa coefficient reached 0.90 (SD = 0.05), indicating excellent inter-annotator agreement. Category-wise agreement was as follows: stains κ = 0.92, yarn shedding κ = 0.89, broken yarn κ = 0.89, holes κ = 0.91, abrasion κ = 0.88, and color fading κ = 0.90. Among these, abrasion and color fading were the most challenging categories because of irregular boundaries, gradual transitions, and their frequent overlap with patterned or aged textile backgrounds. Scene-dependent agreement was κ = 0.87 in low-light museum or storage environments and κ = 0.94 under controlled imaging conditions. Approximately 3.5% of samples with IoU < 0.5 or centroid deviation > 10 px were resolved through a third-round majority vote.
To prevent data leakage, images originating from the same textile object, the same cloth fragment, or the same decorative panel were assigned exclusively to either the training, validation, or test set based on unique sample identifiers. The final dataset was split into 70% for training, 20% for validation, and 10% for testing before any random data augmentation was performed. After this split was completed, augmentation operations were applied only to the training set. The validation and test sets were not subjected to random augmentation; they were only resized and normalized using the same preprocessing pipeline required for model input. This strategy prevented highly similar local textures or augmented variants of the same original image from appearing across multiple subsets and ensured that the reported validation and test results reflected model generalization rather than augmentation-induced similarity.
Model training was conducted using the improved YOLOv8-MABFT framework. Because the newly incorporated C2f-Faster-EMA module and RT-DETR-informed decoder head lacked pretrained weights specifically adapted to heritage textile surfaces, the model was trained from scratch. Training settings were as follows: optimizer = SGD, initial learning rate = 1 × 10−2, momentum = 0.9, weight decay = 1 × 10−4, batch size = 12, input resolution = 640 × 640, and cosine annealing learning-rate scheduling. Experiments were performed on an NVIDIA RTX 2080 (24 GB) with an Intel Core i9-13900K, running Ubuntu 22.04 and PyTorch 2.1.0 (CUDA 12.1, cuDNN 8.9). Training proceeded for 200 epochs, with early stopping triggered if the validation F1-score did not improve for 50 consecutive epochs.
2.3. Model Comparison and Improvement
Before constructing the improved YOLOv8-based detector, a systematic comparison of representative object detection models was conducted to evaluate their adaptability to surface defect detection on Kashgar heritage textiles. The model was designed to support applied surface inspection rather than generic object detection. Its architectural modifications aim to improve the recognition of surface discontinuities, weak-boundary abrasion, chromatic surface changes, and texture-induced false positives on heterogeneous fibrous substrates. Because heritage textile substrates from the Kashgar region exhibit dense embroidered textures, patterned woven surfaces, reflective silk interfaces, and weak-boundary aging-related deterioration, the detection task is more complex than conventional industrial fabric inspection. To establish reliable baselines, several widely used object detection frameworks were selected for comparison, including SSD, Faster R-CNN, YOLOv5, YOLOv7, and YOLOv8. These baselines represent both two-stage detectors, such as Faster R-CNN, and single-stage detectors, such as SSD and the YOLO series, thereby enabling a structured comparison of their adaptability to high-texture fibrous surface backgrounds, multiscale surface degradation, and heterogeneous heritage textile substrates.
The proposed YOLOv8-MABFT architecture was developed through a problem-driven and stepwise design strategy rather than by arbitrarily combining multiple improvement modules. The starting point of the design was the specific visual and material characteristics of Kashgar heritage textile surfaces. Unlike standard industrial fabric inspection, this task involves heterogeneous fibrous substrates, dense woven or embroidered backgrounds, reflective silk surfaces, weak-boundary deterioration, and visually gradual defects such as abrasion and color fading. These characteristics create four major technical requirements for model design: preserving fine surface details, enhancing lightweight multiscale feature aggregation, suppressing background texture interference, and modeling broader contextual relationships across complex textile surfaces. Therefore, the architecture was modified at two coordinated levels: first, a C2f-Faster-EMA module was introduced into the feature extraction and fusion stages; second, an RT-DETR-informed decoder head was used at the prediction stage, as illustrated in
Figure 3.
It should be emphasized that the proposed architecture does not treat C2f, lightweight feature aggregation, EMA-based attention, and RT-DETR-informed decoding as unrelated add-on components. Instead, they are organized according to their functional roles in the detection pipeline. C2f, lightweight feature aggregation, and EMA-based attention are integrated into a unified C2f-Faster-EMA module to strengthen local surface-feature preservation, improve efficient feature transmission, and suppress texture interference. The RT-DETR-informed decoder head is then used at the output stage to enhance contextual consistency and localization stability. In this sense, the architecture contains two principal improvements: the C2f-Faster-EMA module for feature enhancement and texture-interference suppression, and the RT-DETR-informed decoder head for context-aware prediction.
In the backbone of YOLOv8-MABFT, the multistage hierarchical feature-extraction structure of YOLOv8 is retained, but several standard bottleneck units are replaced with C2f-Faster-EMA modules. This module integrates three functions: C2f-based feature preservation, lightweight feature aggregation, and EMA-based attention refinement. The C2f structure improves cross-stage feature fusion and gradient flow, helping preserve shallow and intermediate surface details that are important for detecting broken yarn, early-stage yarn shedding, slight abrasion, and low-contrast deterioration boundaries. The lightweight aggregation component improves feature transmission across different depths without substantially increasing computational cost, thereby supporting the recognition of both small local defects and larger deteriorated regions. The EMA-based attention mechanism further suppresses interference from repetitive motifs, embroidery shadows, reflective fibers, and aged but non-defective surface textures, improving defect–background discrimination on complex heritage textile surfaces.
The SPPF module is retained at the deeper stage of the backbone to enlarge the receptive field and represent broader degradation patterns, such as large-area fading, dispersed staining, and structurally weakened regions. Building on this convolution-based feature-extraction structure, a lightweight RT-DETR-informed decoder head is integrated at the prediction stage. Compared with a purely convolutional prediction head, this transformer-informed decoder can model broader contextual relationships among spatially separated but visually related surface regions. This is useful for weak-boundary deterioration, gradual fading, and partially obscured broken yarn, where local convolutional responses alone may be insufficient for stable localization.
The detection head follows YOLOv8’s decoupled prediction strategy, allowing classification and localization to be optimized separately across multiple output resolutions. Training was performed using stochastic gradient descent (SGD), consistent with the optimization setting reported in
Section 2.2. A moving-average weight update strategy was also adopted to reduce optimization fluctuations on highly textured, heterogeneous fibrous surfaces.
Figure 4 further illustrates the internal attention refinement mechanism embedded in the C2f-Faster-EMA module. In this structure, the input feature map is divided into grouped feature representations, and both horizontal and vertical average-pooling operations are used to capture directional spatial dependencies. The attention responses generated from these pooled features are then combined with local convolutional responses to enhance informative regions and suppress irrelevant background textures. This design is suitable for heritage textile surfaces because subtle deterioration often appears within dense visual textures, and local material discontinuities can easily be confused with decorative patterns, woven motifs, or embroidery relief.
To further enhance robustness under complex patterned backgrounds and multi-scale fine defects, this study integrates an RT-DETR-informed head, inspired by Real-Time DETR (RT-DETR) [
31], into the improved YOLOv8-MABFT framework, as shown in
Figure 5. This component enables cross-scale contextual alignment and allows the model to preserve informative spatial tokens from visually ambiguous textile surface regions. By integrating local convolutional responses with broader contextual cues, the decoder head improves the recognition of subtle surface anomalies such as weak abrasion traces, gradual fading transitions, and partially obscured broken yarn regions. Compared with conventional purely convolutional prediction heads, this strategy provides better semantic consistency across dense textile surface patterns while maintaining practical computational efficiency. For heritage textile surface defect detection, such a design is especially valuable because the model must distinguish genuine deterioration from normal variability in weave density, embroidery texture, and decorative surface structure.
In summary, the proposed YOLOv8-MABFT framework contains two coordinated improvements. First, the C2f-Faster-EMA module enhances fine surface-feature preservation, lightweight feature aggregation, and attention-guided suppression of texture interference. Second, the RT-DETR-informed decoder head improves contextual alignment and localization stability in visually ambiguous regions of textiles. The final architecture was retained because it improved detection performance without introducing a substantial computational burden, as further evaluated in the ablation experiments and model comparison sections. In this sense, the development of YOLOv8-MABFT followed both theoretical task analysis and empirical validation: the component selection was guided by the material and visual properties of heritage textile degradation, while the final configuration was confirmed through ablation experiments and comparative evaluation. This improved architecture provides the technical foundation for the subsequent performance evaluation, attention visualization, and preliminary surface-condition assessment presented in the following sections.
2.4. Model Evaluation Metrics
To comprehensively evaluate the performance of the YOLOv8-MABFT for detecting surface defects on Kashgar heritage textiles, four widely used object detection metrics were employed: Precision, Recall, F1-score, and mAP@0.5. These indicators assess the model’s effectiveness from multiple perspectives, including small-defect recognition, resistance to texture interference, and the detection of weak-boundary deterioration patterns on complex heritage textile surfaces.
- (1)
Precision measures the proportion of predicted defect samples that are true defects. It reflects the model’s ability to suppress interference caused by dense woven motifs, embroidered textures, and repetitive decorative patterns. A higher precision indicates fewer false positives, which is particularly important for heritage textiles where background ornamentation may visually resemble actual defects.
- (2)
Recall evaluates the proportion of real defects that the model successfully detects. It reflects sensitivity to subtle anomalies such as fine broken yarn, early-stage yarn shedding, slight abrasion, small holes, and weakly bounded color fading. A high recall indicates a low missed-detection rate and more comprehensive coverage of multi-scale deterioration patterns in heritage textile substrates.
- (3)
F1 score, the harmonic mean of Precision and Recall, provides a balanced assessment of detection accuracy and sensitivity. For heritage textile surfaces with complex textures, diverse material conditions, and multiple defect categories, the
F1 score offers a more robust evaluation than any single metric alone.
- (4)
mAP@50 is used as the overall detection accuracy metric, representing the average detection performance across categories at an IoU threshold of 0.5. It effectively reflects the model’s ability to identify multiple defect types—including stains, broken yarn, yarn shedding, holes, abrasion, and color fading—and is a standard comprehensive indicator in object detection research.
To further assess whether the observed improvements of YOLOv8-MABFT over the baseline detection models were meaningful rather than isolated numerical differences from a single test evaluation, a paired bootstrap significance analysis was conducted on the fixed test set. Because all detection models were evaluated on the same test images, the comparisons between YOLOv8-MABFT and each baseline detector could be treated as paired at the test-sample level. For each comparison, the test images were resampled with replacement for 10,000 bootstrap iterations. In each iteration, the detection metrics were recalculated for both YOLOv8-MABFT and the corresponding baseline model, and the paired performance difference was recorded. The bootstrap distribution of the paired differences was then used to estimate the mean difference, 95% bootstrap confidence interval, and empirical p-value. The reported mean differences and confidence intervals are expressed in percentage points. A performance improvement was considered statistically supported when the 95% bootstrap confidence interval of the difference did not include zero. Because multiple baseline comparisons were conducted, Holm–Bonferroni correction was applied to the empirical p-values. This paired bootstrap procedure was used as a practical robustness check to assess whether the reported improvements were stable across test-sample variability. Since the analysis was conducted using fixed trained models on a fixed test set, it should be interpreted as evidence of fixed-test-set robustness rather than a complete replacement for repeated training across multiple random seeds.
2.5. Surface Degradation–Based Condition Level Prediction
The evaluation of surface-condition levels in Kashgar heritage textiles involves not only the visual identification of surface defects but also the integrated assessment of their material, structural, perceptual, and conservation-related implications. Although the condition of heritage textiles is often described qualitatively in conservation practice, expert assessment has long relied on a composite judgment of material degradation, structural disturbance, visual integrity, and preservation significance. In this study, the surface-condition level of heritage textiles is modeled as a predictable attribute driven by multidimensional textile-surface features. The goal is not to reduce conservation judgment to a purely mechanical score, but to operationalize the long-standing conservation principle that the condition of heritage textiles can be systematically observed, described, and interpreted through a theoretically grounded computational framework.
Accordingly, the scoring dimensions designed in this study meet both statistical modeling requirements and conservation-oriented material logic. This is validated using a Random Forest classifier integrated with SHAP (SHapley Additive exPlanations), forming a closed loop between theoretical formulation and technical implementation. SHAP, grounded in cooperative game theory, provides an interpretable framework for black-box models by quantifying the marginal contribution of each input feature to the model output. In the context of heritage textile surface assessment, this interpretability is particularly important because conservation decisions require not only accurate predictions but also transparent explanations of why a textile surface is classified as stable, mildly deteriorated, moderately deteriorated, or severely deteriorated. The methodological backbone of this section follows the same predictive and interpretive logic as the original framework. At the same time, the feature ontology and scoring criteria are fully redefined for heritage textile surfaces.
During model training, an 80/20 hold-out method was adopted to split the dataset into training and testing subsets. Eighty percent of the data was used for model training, while the remaining 20% was reserved for performance validation and generalizability assessment. The input feature matrix consisted of both categorical and continuous variables. Categorical features were converted into numeric form using One-Hot Encoding (OHE), and all quantitative features were normalized using Min-Max Scaling where required by the model. These preprocessing steps ensured that all candidate classifiers were evaluated under the same feature representation and data-processing conditions.
To ensure that the choice of classifier was empirically justified rather than based only on conceptual considerations, a comparative classifier evaluation was conducted before the final model was selected. The same seven surface-condition variables—Severity, Distribution, Structural, Chromaticity, Repairability, Focus, and Sensitivity—were used as input features for all candidate classifiers. The evaluated classifiers included Logistic Regression, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree, Gradient Boosting, LightGBM, XGBoost, and Random Forest. These models were selected because they represent linear, kernel-based, distance-based, single-tree, boosting-based, and bagging-based approaches commonly used for tabular classification tasks.
All classifiers were evaluated under the same data split, preprocessing strategy, and evaluation metrics. Categorical variables were encoded using one-hot encoding where necessary. Continuous variables were normalized using Min–Max scaling for scale-sensitive models, including Logistic Regression, SVM, and KNN. Tree-based models, including Decision Tree, Gradient Boosting, LightGBM, XGBoost, and Random Forest, were trained using the same feature representation but did not require feature scaling. Model performance was evaluated using overall accuracy, Macro-F1, Weighted-F1, and Cohen’s Kappa. Overall accuracy was used to assess general prediction correctness; Macro-F1 was used to evaluate balanced performance across the four condition levels; Weighted-F1 was used to account for possible class-frequency differences; and Cohen’s Kappa was used to measure agreement beyond chance.
Based on the comparative classifier evaluation, Random Forest was retained as the final surface-condition prediction model. The selection was supported by its best overall performance across the evaluation metrics and by its suitability for interpretable condition assessment. Although Gradient Boosting, LightGBM, and XGBoost showed competitive performance, Random Forest achieved the strongest overall balance between accuracy, class-wise performance, agreement beyond chance, and interpretability. The detailed comparison is reported in
Supplementary Table S1.
After the Random Forest classifier was selected, key hyperparameters were adjusted to optimize prediction performance, including the number of trees, maximum depth, maximum number of features per split, and minimum number of samples per leaf. These settings were used to balance classification accuracy, computational efficiency, model stability, and interpretability. Once the model was fully trained, SHAP was employed to conduct model explainability analysis. All SHAP interpretations were generated without retraining the model, ensuring that the explanation process did not introduce additional bias. The SHAP-based interpretability analysis was carried out across four dimensions:
- (1)
Class-wise comparison of feature contributions: SHAP summary plots were generated for each surface-condition level (0–3) to reveal how different surface-material variables contributed to predictions, including the direction and magnitude of their effects.
- (2)
Global feature importance ranking: bar plots based on mean absolute SHAP values were used to compare the overall influence of each variable across all classes, highlighting which material, structural, or visual factors dominate the model’s decisions.
- (3)
Precision–Recall curve evaluation using five-fold cross-validation: to address class imbalance, five-fold cross-validation was conducted, and PR curves were plotted for each level to visualize the model’s trade-offs in identifying low-frequency but high-risk surface deterioration categories.
- (4)
ROC–AUC comparison between training and test sets: multiclass ROC curves were plotted, and the AUC was calculated for each class to assess model stability and detect signs of overfitting.
All SHAP visualizations were implemented using the Python (3.10.12) SHAP library, together with Matplotlib version 3.7.2 and Scikit-learn version 1.3.0 for visualization and metric computation. This interpretability framework not only quantified the influence of each feature on model predictions but also improved the transparency, trustworthiness, and academic reproducibility of the proposed textile-surface condition assessment system.
To translate textile conservation theory into an operational and model-ready feature system, this study adopts a simplified but conservation-grounded ontology of Material–Damage–Perception, which serves as the conceptual foundation for all subsequent quantitative modeling. (i) Material refers to the physical characteristics of textile substrates, including fiber type, weave density, structural fineness, and sensitivity to degradation. (ii) Damage refers to the actual deterioration state of the textile surface, including the severity, spatial spread, and structural impact of visible defects. (iii) Perception refers to how deterioration is visually manifested and conservation-relevant, including chromatic inconsistency, visual saliency, and recoverability through intervention.
This tripartite framework mirrors how textile conservators and heritage specialists evaluate historic fabrics—not as isolated image patterns, but as culturally and materially embedded artifacts in which substrate composition, deterioration mechanism, and visible consequence interact. It also provides the theoretical rationale for the seven modeling variables used in this study: Severity, Distribution, Structural Damage, Color Difference, Repairability, Visual Focus, and Material Sensitivity. These variables were established through an expert-informed operationalization process rather than through post hoc model selection. Specifically, they were derived from three interrelated dimensions of textile surface assessment: material vulnerability, visible damage characteristics, and perceptual or conservation relevance. The seven variables were established through an expert-informed operationalization process rather than through post hoc model selection. Specifically, the variables were derived from three interrelated dimensions of textile surface assessment: material vulnerability, visible damage characteristics, and perceptual or conservation relevance.
Table 3 summarizes the operational meaning and theoretical foundation of each variable. These variables were defined before model training and were not adjusted based on Random Forest performance, SHAP values, or classification results, thereby reducing potential circularity between input construction and prediction outcomes.
The seven variables were constructed through a hybrid expert-informed and image-analysis-assisted workflow to balance conservation validity and computational reproducibility. Severity, Distribution, Structural Damage, and Repairability were primarily derived from expert annotation using a unified coding rubric grounded in textile conservation logic and material degradation assessment. Color Difference, Visual Focus, and Material Sensitivity were defined using fixed rule-based or image-analysis-assisted procedures: Color Difference was based on chromatic deviation between defect regions and adjacent relatively intact textile surfaces; Visual Focus reflected the positional prominence and display relevance of the detected defect; and Material Sensitivity was assigned according to the mapped relationship between textile substrate type and vulnerability to deterioration. The same scoring rules were applied consistently across all samples before model training, and none of these variables was adjusted according to Random Forest performance, SHAP values, or classification results.
To construct an evaluation model that is both computationally tractable and conservation meaningful, all seven variables were normalized to the [0, 1] range and assigned expert-informed weights according to their relative relevance in textile surface-condition assessment. The weighting scheme was established before model training and was not adjusted according to Random Forest performance, SHAP outputs, or classification results. This was done to reduce the risk of circularity between weight assignment and model evaluation.
The weights were assigned according to conservation-oriented material logic rather than statistical optimization. Variables most directly associated with irreversible material deterioration and substrate instability were given higher weights. Specifically, Severity (20%) reflects the basic intensity of surface deterioration, and Structural Damage (20%) captures direct weakening of textile integrity. Distribution (15%) reflects the spatial expansion of deterioration, while Material Sensitivity (15%) represents substrate-specific vulnerability. Color Difference (10%), Repairability (10%), and Visual Focus (10%) were assigned lower but still meaningful weights because they mainly refine the visual, treatment-related, and display-related interpretation of surface condition. To further minimize circularity, the Random Forest classifier was not used to determine the weights or redefine the condition labels. The four-level condition labels were generated from the predefined, expert-informed scoring framework, while the Random Forest model was used to evaluate whether the predefined variable system could support stable, interpretable condition-level prediction. SHAP analysis was subsequently used only for post hoc interpretation of feature contributions after model training, not for constructing the input variables, adjusting the weights, or redefining the outcome labels. This weighting scheme should therefore be understood as an expert-informed operational framework for preliminary surface-condition classification rather than as a statistically optimized or universally applicable grading model.
The composite surface-condition score for each heritage textile sample was calculated as a weighted linear combination of these normalized indicators:
where
denotes the normalized value of the
i-th variable for the
j-th textile sample, and
represents the corresponding weight assigned to that variable.
To facilitate supervised learning and classification evaluation, the continuous composite scores were discretized into four ordinal surface-condition levels (Levels 0 to 3), as shown in
Table 4. Unlike equal-width segmentation, the thresholds in this study were defined using expert-informed interval grading, which better reflects textile tolerance logic and conservation handling practice. This approach was chosen because the transitions between stable, mild, moderate, and severe surface deterioration in heritage textiles do not always follow equal statistical spacing; rather, they are shaped by preservation risk, material vulnerability, and intervention urgency.
The four predicted surface-condition levels were designed to provide a preliminary screening and documentation reference rather than a direct conservation prescription. In practical use, Level 0 indicates that the textile surface is relatively stable and may be recorded through routine digital documentation. Level 1 indicates mild surface deterioration and may require periodic monitoring, especially when defects occur in visually important or materially sensitive areas. Level 2 indicates moderate deterioration and may be used to prioritize objects for closer expert inspection, detailed condition reporting, or preventive conservation planning. Level 3 indicates severe surface deterioration and should trigger professional review by textile conservators, including detailed examination of structural stability, handling risk, storage conditions, and possible stabilization needs. Therefore, the predicted levels are intended to support triage, documentation, and expert prioritization, but they should not be interpreted as automatic treatment decisions.
The mapping between textile-surface features and surface-condition levels was clarified through three validation approaches. First, the seven variables were grounded in expert-informed reasoning in textile conservation. Severity, Distribution, Structural Damage, Color Difference, Repairability, Visual Focus, and Material Sensitivity were defined to reflect visible deterioration intensity, spatial spread, substrate weakening, chromatic inconsistency, treatment feasibility, visual saliency, and material vulnerability. Second, the prediction framework was designed to support consistency between the predefined expert-informed scoring logic and the Random Forest classification output. The Random Forest model was not used to define the variables, adjust their weights, or generate the scoring criteria; rather, it was used to evaluate whether the predefined variables could support interpretable condition-level prediction. Third, each variable was aligned with established conservation assessment principles. Severity and Structural Damage correspond to material stability evaluation; Distribution reflects deterioration spread; Color Difference and Visual Focus correspond to visible integrity and display impact; Repairability reflects treatment feasibility; and Material Sensitivity corresponds to substrate-specific vulnerability.
During preliminary framework development, alternative stratification strategies, including quantile-based binning and expert-only grading, were considered. However, these approaches produced less interpretable boundaries between adjacent levels and were less suitable for practical surface-condition documentation. The final four-level surface-condition scheme was therefore retained because it provided clearer semantic distinctions, better alignment with textile conservation practice, and stronger interpretive value for expert review.
To keep the workflow transparent and reproducible, the modeling and encoding process was organized into five steps. First, key surface and substrate characteristics were grouped into categorical and continuous variables. Second, each textile sample was assigned a four-level surface-condition label based on the weighted seven-variable evaluation system. Third, seven continuous variables representing material condition, visual manifestation, and conservation relevance were used as the core model inputs. Fourth, feature representation was controlled to avoid unnecessary redundancy and improve model stability. Fifth, categorical variables were encoded using One-Hot Encoding, while continuous variables were normalized using Min-Max Scaling where required, forming a unified input matrix for Random Forest classification and SHAP interpretability analysis.
The seven variables were scored using four normalized intervals: low concern (0.00–0.44), moderate concern (0.45–0.67), high concern (0.68–0.82), and very great concern (0.83–1.00). Severity reflects the intensity of the defect, as measured by size, contrast, visibility, and disruption to the textile surface. Distribution reflects the spatial extent, continuity, and spread of defects across the textile surface. Structural Damage evaluates the degree to which deterioration weakens the textile substrate, including yarn breakage, holes, tears, abrasion, deformation, and local loss of continuity. Color Difference measures chromatic deviation between deteriorated and relatively intact surface areas. Repairability reflects the difficulty of stabilizing, concealing, or mitigating the defect through conservation or technical intervention; a higher score indicates lower repair feasibility and greater conservation concern. Visual Focus reflects the saliency of the defect within the main visible or display-relevant area of the textile object. Material Sensitivity measures the vulnerability of the textile substrate to defects, with fragile silk, fine embroidery, aged, low-strength fibers, and layered decorative textiles receiving higher concern scores than more robust substrates.
To examine whether the expert-informed weighting scheme produced stable surface-condition assessment results, a robustness and sensitivity analysis was conducted and reported in the
Supplementary Materials. The purpose of this analysis was to evaluate whether reasonable changes in the relative weights of the seven surface-condition variables would substantially alter the composite condition scores or the final four-level condition assignments.
The original expert-informed weighting scheme was treated as the reference configuration. Four alternative weighting scenarios were examined. First, an equal-weight scenario assigned the same weight to all seven variables, providing a non-expert baseline for comparison. Second, a ±10% perturbation scenario was constructed by increasing and decreasing each variable weight individually by 10%, while proportionally normalizing the remaining weights so that the total weight remained 1. Third, the same procedure was repeated using a stronger ±20% perturbation to evaluate robustness under moderate weighting changes. Fourth, a conservation-risk-oriented weighting scenario was constructed by assigning relatively higher weights to variables more directly related to material deterioration risk, especially Severity, Structural, and Sensitivity, while proportionally reducing the weights of the remaining variables.
For each weighting scenario, the composite surface-condition score and the four-level condition assignment were recalculated using the same seven variables and the same condition-level thresholds. The alternative results were compared with the original expert-informed results using Spearman’s rank correlation coefficient for continuous condition scores, overall level agreement for four-level condition assignments, weighted Cohen’s Kappa for ordinal agreement, and the percentage of samples whose condition level changed. Because the condition levels are ordinal, weighted Cohen’s Kappa was used to account for the degree of disagreement between adjacent and non-adjacent levels. The full sensitivity analysis is reported in
Supplementary Table S3.
Overall, this framework transforms the assessment of Kashgar heritage textile surfaces from a purely descriptive inspection process into a structured and explainable preliminary surface-condition classification workflow. It can support surface-condition documentation, monitoring prioritization, and expert review, but it should not be interpreted as an automatic conservation grading system or a substitute for professional conservation judgment.
2.6. Attention Visualization (CAM-Based Analysis)
To qualitatively compare the attention behavior of the baseline YOLOv8 and the enhanced YOLOv8-based framework, we applied a set of commonly used Class Activation Mapping (CAM) techniques to visualize where the models focus during surface defect recognition in Kashgar heritage textiles. Instead of detailing the technical differences among individual CAM variants, we use them collectively as complementary visualization tools to assess whether the models attend to materially and conservationally meaningful regions—such as yarn discontinuities, fiber-loosening zones, stain boundaries, abrasion traces, faded decorative areas, and weakly bounded structural defects embedded in dense textile textures.
In this study, CAM, Grad-CAM, Grad-CAM++, and SSCAM are employed as a unified interpretability suite. CAM provides a coarse overview of the discriminative regions activated during textile defect detection. Grad-CAM and Grad-CAM++ improve semantic alignment and multi-region localization, enabling clearer observation of whether the model concentrates on fine-grained deterioration cues within complex woven, embroidered, or patterned surfaces. SSCAM provides more stable attention responses under visually degraded conditions, such as low contrast, uneven illumination, gradual fading, and texture-dense backgrounds. Together, these methods enable a robust comparison of model focus behavior without relying on low-level implementation details.
For heritage textile analysis, such visualization is particularly important because defects are often subtle, weakly bounded, and easily confused with decorative motifs, woven patterns, or material irregularities. Therefore, attention visualization serves not only as a model interpretation tool, but also as a means of examining whether the detection process is materially meaningful from a conservation perspective. A reliable model should focus on actual degradation-relevant surface regions rather than on unrelated ornamental structures, repeated background textures, or lighting artifacts. In this sense, the CAM-based analysis helps verify whether the enhanced model more consistently captures conservation-relevant surface abnormalities across diverse Kashgar textile substrates.
3. Results
3.1. Ablation Experiment
To assess the individual and combined contributions of the two proposed architectural improvements, a series of ablation experiments was conducted using the baseline YOLOv8n detector. As summarized in
Table 5, the ablation design evaluated two replaceable architectural units, the C2f-Faster-EMA module and the RT-DETR-informed decoder head.
Compared with the baseline YOLOv8n, introducing the C2f-Faster-EMA module alone improved the F1-score from 89.6% to 90.4%, Precision from 82.5% to 84.6%, and mAP@0.5 from 92.8% to 93.7%. These results indicate that the feature-enhancement module improved the model’s ability to distinguish true surface defects from visually similar textile background structures. The improvement in Precision suggests that the module helped reduce false positives caused by repetitive woven motifs, embroidery shadows, reflective fibers, and aged but non-defective surface textures. Although Recall decreased slightly from 98.0% to 97.0%, the overall improvement in F1-score and mAP@0.5 indicates that the module achieved a better balance between sensitivity and prediction reliability.
When the RT-DETR-informed decoder head was introduced alone, the model achieved a larger improvement in F1-score, increasing from 89.6% to 92.6%. Precision also increased from 82.5% to 87.7%, while Recall remained stable at 98.0%. This indicates that the decoder head improved prediction reliability without reducing the model’s ability to detect real defects. The improvement can be attributed to the decoder’s stronger contextual modeling capability, which is useful for weak-boundary deterioration patterns, gradual color fading, slight abrasion, and partially obscured broken yarn embedded in dense textile textures.
The best overall performance was achieved when the C2f-Faster-EMA module and the RT-DETR-informed decoder head were combined. The full YOLOv8-MABFT model achieved the highest F1-score of 94.6%, Precision of 91.4%, and mAP@0.5 of 94.0%, while maintaining Recall at 98.0%. Compared with the baseline YOLOv8n, the full model improved F1-score by 5.0 percentage points, Precision by 8.9 percentage points, and mAP@0.5 by 1.2 percentage points. These results demonstrate that the two improvements are complementary: the C2f-Faster-EMA module strengthens feature preservation and texture-interference suppression, while the RT-DETR-informed decoder head improves contextual alignment and localization stability.
The computational indicators remained comparable across the tested variants, with GFLOPs and parameter size reported at approximately 8.1 and 3.0 MB, respectively. This suggests that the performance gains were achieved without a substantial increase in computational burden. This property is important for practical heritage textile inspection, where models should remain lightweight enough for routine documentation and non-contact surface monitoring.
Overall, the ablation results confirm that the proposed improvements deliver measurable, complementary performance gains. The C2f-Faster-EMA module primarily improves feature discrimination under complex fibrous backgrounds, whereas the RT-DETR-informed decoder head contributes more strongly to contextual prediction and localization stability. Their combination produces the most balanced performance, supporting the final architecture design of YOLOv8-MABFT.
Figure 6 presents the normalized confusion matrix of the YOLOv8-MABFT framework across the six target defect categories in Kashgar heritage textiles—stains, broken yarn, yarn shedding, holes, abrasion, and color fading—together with the background class. The values on the main diagonal represent the proportion of samples that were correctly classified within each category. Overall, the strong diagonal concentration indicates that the model achieves high class-wise recognition accuracy and maintains stable discrimination across visually complex textile surfaces.
Among the six categories, structurally distinctive defects such as broken yarn and holes typically show the clearest recognition pattern. For non-specialist readers, this means that defects with obvious geometric interruption—such as a visible yarn break or an open hole in the fabric—are easier for the model to detect because they differ more clearly from the surrounding textile structure. By contrast, categories such as yarn shedding, abrasion, and color fading are generally more challenging because their boundaries are often weak, gradual, or partially blended into the original woven or embroidered background. Nevertheless, the model still maintains competitive performance on these subtle classes, demonstrating that it is able to capture fine surface deterioration even when the defect does not appear as a sharply separated object.
The confusion matrix also provides insight into where classification difficulty remains. Misclassification is more likely to occur between categories that share similar visual characteristics—for example, between abrasion and color fading, or between yarn shedding and other fine fiber-related disturbances. This is understandable from a conservation perspective: on aged heritage textiles, several deterioration phenomena may coexist in the same local region, and their visual manifestations can partially overlap. Even so, the relatively limited confusion with the background category indicates that the improved model is effective at suppressing false positives. In practical terms, this means the system is less likely to mistake decorative patterns, woven motifs, or surface texture irregularities for actual defects.
Overall, the normalized confusion matrix confirms that the proposed architectural improvements strengthen both feature discrimination and context modeling. The enhanced backbone and multiscale fusion improve the model’s ability to extract defect-relevant signals from dense textile textures. At the same time, the transformer-informed decoding strategy helps preserve broader contextual relationships on the textile surface. Together, these improvements lead to stronger recognition robustness and better generalization across the diverse and visually complex defect scenarios found in Kashgar heritage textiles.
To further evaluate the optimization effect of the proposed C2f-Faster-EMA + DETR decoder scheme,
Figure 7 illustrates the precision trajectories of the baseline YOLOv8 and the improved model (YOLOv8-MABFT) over 200 training epochs. As shown in the curve, both models exhibit an overall upward trend, but the improved method demonstrates significantly faster convergence and higher stability throughout the training process.
During the early training stage (0–50 epochs), the baseline YOLOv8 shows large oscillations and slower precision accumulation, stabilizing around 0.60–0.70. In contrast, the proposed model increases rapidly from near zero to above 0.75 within the first 50 epochs, indicating stronger feature discrimination capability at the initial optimization stage.
After 100 epochs, the baseline model gradually saturates with a precision plateau around 0.82–0.84, consistent with its final precision of 82.5%. Conversely, the improved model maintains a steady upward trend and stabilizes near 0.90–0.91, achieving a final precision of 91.4%. This corresponds to a +8.9% absolute improvement over the baseline, demonstrating that the integration of the Faster-EMA enhanced backbone and RT-DETR decoding effectively suppresses background noise, strengthens texture-aware representation, and improves classification robustness across diverse defect categories.
Overall, the precision curves verify that the proposed improvements not only accelerate convergence but also yield substantially higher steady-state performance, confirming the effectiveness of the redesigned feature extraction and decoding modules for textile defect detection. These results indicate that the proposed modules improve the recognition of surface discontinuities and chromatic degradation patterns under complex fibrous surface textures.
3.2. Model Comparison
To objectively verify the effectiveness of the proposed improved YOLOv8-based framework for fine-grained surface defect detection on Kashgar heritage textiles, we conducted a comprehensive comparison with five representative baseline detectors, including both two-stage and one-stage models. The purpose of this comparison was not only to evaluate detection accuracy, but also to examine whether the proposed framework is suitable for applied surface inspection of heterogeneous textile substrates. The quantitative results are summarized in
Table 6.
As shown in
Table 7, among the compared methods, the two-stage Faster R-CNN achieved relatively high Recall (94.6%) and mAP@0.5 (93.6%), yet its Precision remained markedly low (66.4%), resulting in a comparatively modest F1-score (78.0%). This performance gap suggests that Faster R-CNN tends to generate excessive false positives when dealing with visually dense heritage textile surfaces. For readers outside computer vision, this means the model was often able to locate many real surface defects. Still, it also tended to mistake repetitive woven motifs, embroidered textures, or aged surface irregularities for deterioration. Similarly, SSD, as an earlier one-stage detector, showed the weakest overall performance (F1-score = 65.5%, mAP@0.5 = 83.5%) and had difficulty capturing small and subtle surface defects embedded in complex fibrous backgrounds.
Among the lightweight YOLO-family baselines, YOLOv5n, YOLOv7n, and YOLOv8n demonstrated more balanced performance, with F1-scores of 91.8%, 92.1%, and 89.6%, respectively. YOLOv7n achieved the highest Recall (99.0%) among all baseline models, indicating very strong sensitivity to real surface defects. However, its lower Precision (86.1%) suggests that highly textured surfaces—such as embroidered regions, patterned woven textiles, and aged decorative fabrics—may still induce misdetections. YOLOv8n offered favorable computational efficiency (8.1 GFLOPs, 3.0 MB), but it struggled to maintain sufficiently high Precision (82.5%) for the recognition of weak-boundary or micro-scale surface deterioration.
The YOLOv8-MABFT framework achieved the best overall performance, with the highest F1-score (94.6%), Precision (91.4%), and mAP@0.5 (94.0%), while maintaining the same computational cost as YOLOv8n (8.1 GFLOPs, 3.0 MB). This result indicates that the architectural improvements enhanced the model’s ability to distinguish genuine surface deterioration from visually similar background structures without sacrificing efficiency. From an applied surface inspection perspective, this is particularly important because surface defects in Kashgar textiles are often embedded in complex material contexts, including dense embroidery, repeated woven motifs, faded decorative surfaces, reflective silk interfaces, and fragile, aged substrates. In such cases, an effective model must not only detect obvious defects, but also avoid falsely identifying normal surface pattern variation as damage.
Overall, the comparison demonstrates the suitability of the proposed framework for applied surface inspection of heterogeneous textile substrates. Its main advantage lies in its stronger ability to distinguish subtle and conservation-relevant surface deterioration—such as fine broken yarn, early-stage yarn shedding, weak abrasion traces, small holes, stains, and color fading—from high-texture fibrous surface backgrounds. At the same time, its lightweight design preserves real-time applicability, making it well-suited for routine surface inspection, digital documentation, and surface-condition monitoring of Kashgar heritage textiles.
To further examine whether the observed improvements were stable at the test-sample level, a paired bootstrap comparison was conducted between YOLOv8-MABFT and each baseline detection model on the fixed test set. The full results are reported in
Supplementary Table S2. The bootstrap analysis showed that YOLOv8-MABFT achieved positive mean differences in F1-score and mAP@0.5 compared with Faster R-CNN, SSD, YOLOv5n, YOLOv7n, and YOLOv8n. The 95% bootstrap confidence intervals did not include zero, and the adjusted
p-values remained below 0.001 after Holm–Bonferroni correction. Specifically, compared with YOLOv8n, the strongest direct YOLOv8 baseline, YOLOv8-MABFT improved the F1-score by 5.0 percentage points (95% bootstrap confidence interval: [4.3, 5.7]) and mAP@0.5 by 1.2 percentage points (95% bootstrap confidence interval: [0.8, 1.7]). Compared with Faster R-CNN, YOLOv8-MABFT showed a large improvement in F1-score (+16.6 percentage points), while the improvement in mAP@0.5 was small (+0.4 percentage points). This indicates that Faster R-CNN achieved competitive localization performance, but YOLOv8-MABFT provided a better balance between precision and recall, with fewer false detections on complex textile surfaces. Overall, these results indicate that the performance gains were stable under paired resampling of the fixed test set and reflected more reliable test-set-level improvements in detection performance.
Figure 8 illustrates the training loss curves of six representative detection models, including Faster R-CNN, SSD, YOLOv5n, YOLOv7n, YOLOv8n, and the proposed YOLOv8-MABFT. As shown in the figure, all models exhibit a decreasing trend in training loss over 200 epochs. However, the convergence speed and final stability differ significantly.
Faster R-CNN and SSD show slow convergence and maintain relatively high loss values throughout training, reflecting the limited capacity of traditional two-stage or early one-stage frameworks to handle complex fibrous surface textures. Among lightweight models, YOLOv5n and YOLOv7n converge moderately fast but still retain noticeable fluctuations after 100 epochs, indicating unstable optimization when dealing with dense woven backgrounds and repetitive surface patterns.
YOLOv8n demonstrates improved convergence stability compared with previous YOLO versions. Nevertheless, its loss decreases more slowly and remains higher in later epochs. In contrast, the YOLOv8-MABFT achieves the most rapid and smooth loss reduction. The curve drops sharply within the first 30 epochs and continues to decline steadily without oscillation, eventually reaching the lowest terminal loss value among all models.
This behavior indicates that the enhanced feature aggregation strategy significantly improves gradient flow and suppresses noise from repetitive textile surface patterns. At the same time, the DETR-based decoder effectively stabilizes object-query learning for small and irregular surface defects. Overall, the proposed model demonstrates superior optimization efficiency and generalization ability during training, supporting its advantages in downstream surface defect detection and surface-condition assessment.
Figure 9 presents qualitative detection results for representative surface defects observed in Kashgar heritage textiles, including stains, broken yarn, yarn shedding, holes, abrasion, and color fading. These examples were selected from authentic textile samples associated with the Kashgar region, so that the visual evaluation reflects realistic surface deterioration patterns commonly encountered in heritage textile conservation rather than simplified industrial inspection settings.
Overall, the proposed model exhibits robust and consistent detection capability across different surface defect types and material conditions. Stain defects, although often characterized by blurred boundaries and low contrast against woven backgrounds, are accurately identified, indicating that the model is sensitive to subtle chromatic and contamination-related surface anomalies. For non-specialist readers, this means the system can still recognize weak discoloration or soiling even when it does not stand out clearly from the surrounding fabric. Broken yarn and holes, which involve more evident structural interruption of the textile surface, are also reliably detected, showing that the model can capture both local damage boundaries and the surrounding surface-structure context. In practical terms, the detector is not only looking for isolated spots, but is also learning how deteriorated surface regions differ from normal woven or embroidered structures.
More challenging categories, such as yarn shedding, abrasion, and color fading, also show stable detection behavior. These defects are generally harder to recognize because they may appear as a gradual surface change, slight fiber disturbance, local material discontinuity, or irregular texture weakening rather than as sharply separated objects. The model’s successful identification of such defects suggests that the enhanced multiscale feature extraction and context-aware decoding strategy improve its ability to distinguish genuine deterioration from dense textile textures, repeated motifs, and age-related visual noise. This is particularly important for Kashgar heritage textiles, where embroidery, patterned weaving, dyed surface layers, and decorative treatments can easily interfere with automated surface recognition.
Overall, the qualitative results confirm that the YOLOv8-MABFT enhances robustness against complex woven and embroidered surface backgrounds, strengthens sensitivity to subtle surface deterioration, and maintains stable performance across diverse textile materials and defect forms. These properties make the proposed framework highly suitable for applied surface inspection, digital documentation, and routine surface-condition monitoring of fragile heritage textiles from the Kashgar region.
3.3. Material Interpretation of Detected Textile Surface Anomalies
Beyond numerical detection accuracy, the predicted categories also showed distinct image-based material characteristics that are relevant to surface-condition documentation. The six detected anomaly types—broken yarn, yarn shedding, abrasion, holes, color fading, and stains—corresponded to different forms of visible surface change on fibrous textile substrates. These categories were not only detection labels but also observable indicators of changes in yarn continuity, surface cohesion, material loss, chromatic stability, contamination, and local substrate integrity.
Broken yarn was typically detected as a linear discontinuity, interrupted thread path, or localized yarn fracture. This category was visually distinguishable when the yarn direction was clear and when the broken segment produced a strong contrast against the surrounding woven structure. Yarn shedding was detected as loose fibers, fuzzy protrusions, partial yarn detachment, or exposed surface fibers. Compared with broken yarn, yarn shedding generally showed weaker boundaries and more diffuse edges, which made it more sensitive to background texture. Abrasion appeared as roughened, flattened, thinned, or locally worn surface regions. Its visual boundary was often gradual rather than sharply defined, especially on aged or patterned textiles. Holes were detected as open discontinuities, missing substrate regions, or ruptured local structures, and they were generally easier to distinguish because they interrupted the continuity of the textile surface. Color fading appeared as low-saturation regions, uneven chromatic transitions, or weakened decorative color areas. Stains appeared as local darkening, irregular contamination boundaries, or discoloration patches.
Table 8 summarizes the relationship between the detected anomaly categories, their RGB image manifestations, and their conservation-oriented material implications. These results indicate that the proposed model does not merely identify generic visual defects but captures surface-level indicators associated with different forms of textile deterioration.
3.4. Interpretability Evaluation via CAM, Grad-CAM, Grad-CAM++, and SSCAM
To further compare the interpretability of the baseline YOLOv8n and the YOLOv8-MABFT,
Figure 10 visualizes their attention distributions on a representative damaged Kashgar heritage textile sample using four complementary methods: CAM, Grad-CAM, SSCAM, and Grad-CAM++. The input image contains a large structurally damaged region embedded within a dense decorative textile background, which provides a suitable case for examining whether the model can focus on surface-degradation regions rather than being distracted by surrounding woven or embroidered decorative patterns.
Among the four methods, Grad-CAM++ applied to the improved YOLOv8 (YOLOv8-MABFT) provides the clearest localization effect. Its high-response region is tightly aligned with the central damaged area. It more closely follows the actual contour of the surface defect, with relatively limited spillover into the surrounding patterned textile surface. For non-specialist readers, this means that the model is not simply responding to visually striking areas in general, but is more accurately identifying the part of the textile surface that is genuinely deteriorated. By contrast, CAM produces a comparatively coarser activation pattern, while SSCAM, although highly responsive, shows slightly broader diffusion into neighboring regions. Grad-CAM performs well, but its localization remains somewhat less constrained than that of Grad-CAM++.
The comparison also highlights the practical importance of the improved architecture for heritage textile surface inspection. In Kashgar heritage textiles, deterioration often occurs within visually complex surfaces that contain repeated motifs, dense embroidery, dyed layers, and aged pattern irregularities. Under such conditions, a reliable detector must distinguish actual surface degradation from normal decorative variation. The tighter and more stable attention responses produced by the improved YOLOv8 suggest that the proposed enhancements strengthen both surface-feature discrimination and contextual understanding, thereby improving defect-oriented interpretability.
Overall, CAM-based visualization confirms that the model attends to surface-degradation regions rather than decorative surface patterns. This result supports the suitability of the proposed framework for applied surface inspection, where the central task is not only to identify visually salient areas, but to localize materially meaningful surface deterioration on heterogeneous fibrous textile substrates.
3.5. Result of Prediction of the Comprehensive Value Level of Textile Heritage
After surface defect detection was completed by the YOLOv8-MABFT, the extracted seven-dimensional variables—Severity, Distribution, Structural, Chromaticity, Repairability, Focus, and Sensitivity—were input into the Random Forest classifier for four-level surface-condition prediction. To clarify how these variables contributed to the classification process, SHAP beeswarm plots were generated for each surface-condition level, as shown in
Figure 11.
Across all four levels, Distribution and Severity consistently showed the largest SHAP ranges, indicating that the model primarily relied on the spatial extent and intrinsic intensity of deterioration when assigning surface-condition levels. For Level 0, low values of Severity, Distribution, and Chromaticity were the main drivers, suggesting that this class corresponds to slight, localized, and visually weak surface deterioration. For Level 1, Distribution remained the strongest predictor. At the same time, Focus and Sensitivity became more influential, indicating that mild defects located in visually important areas or on more vulnerable substrates were more likely to be assigned to this level. For Level 2, positive contributions from Distribution, Severity, and Focus became more pronounced, showing that wider spread, stronger visual impact, and increasing substrate vulnerability jointly pushed predictions toward a more serious surface-condition state. For Level 3, the highest positive SHAP contributions were concentrated in Distribution, Focus, Severity, and Sensitivity, indicating that the most severe cases were characterized by extensive surface deterioration, strong visual prominence, high defect intensity, and fragile material conditions.
In contrast, Repairability showed the weakest overall contribution across all four panels, suggesting that it functioned mainly as a secondary adjustment variable rather than a primary determinant of surface-condition classification. Structural and Chromaticity played intermediate roles, refining the boundaries between adjacent levels but not dominating the decision on their own. Overall, the SHAP analysis indicates that the Random Forest classifier follows an interpretable and conservation-consistent decision pattern, prioritizing defect spread and severity while allowing visual prominence, substrate sensitivity, color change, and structural disturbance to further refine the final surface-condition level prediction.
To further clarify the overall influence of each input variable in the four-level surface-condition classification of Kashgar heritage textiles, global feature importance was computed using the mean absolute SHAP values of the seven continuous variables. A stacked bar chart showing class-stratified SHAP contributions across the four predicted levels, Levels 0–3, was generated, as shown in
Figure 12. This visualization summarizes both the total contribution of each variable and its relative influence within each surface-condition level.
The global SHAP ranking shows that Distribution is the most influential variable across the entire classification system, with the largest cumulative mean absolute SHAP value, approaching 0.40. This indicates that the spatial extent and spread of surface deterioration provide the strongest overall basis for level assignment. In conservation terms, the model places the greatest emphasis on whether a defect is localized or widely distributed across the textile surface. A defect that extends over a broader region is therefore much more likely to shift the prediction toward a higher surface-condition level. This finding is consistent with practical textile assessment, where widespread deterioration is usually treated as more serious than an isolated local flaw.
The second most influential feature is Severity, with a mean absolute SHAP value slightly above 0.30. This result confirms that the intrinsic intensity of surface damage—such as how obvious, disruptive, or materially serious a defect appears—is another major determinant of the final classification. Together, Distribution and Severity form the two primary drivers of the model’s decision-making, indicating that the classifier first evaluates how extensive the damage is and how serious the damage is before refining the prediction with additional variables.
A second tier of importance is formed by Focus, which ranks third, with a cumulative SHAP contribution around 0.26. This suggests that the visual prominence of a defect plays a substantial role in the classification process. In other words, defects located in visually dominant or compositionally important areas of the textile are more likely to raise the predicted surface-condition level than comparable defects in less noticeable regions. This is highly consistent with conservation practice, since deterioration in focal areas often has a greater impact on both visual integrity and display value.
The remaining variables—Sensitivity, Chromaticity, Repairability, and Structural—show lower but still meaningful contributions. Sensitivity and Chromaticity occupy intermediate positions, indicating that substrate vulnerability and color-related disruption are important supporting factors in the model. Their role is not as strong as that of Distribution or Severity, but they help distinguish between cases with similar physical defect intensity but different material fragility or visual disturbance. Repairability contributes less strongly overall, suggesting that treatment feasibility mainly functions as a boundary-adjusting factor rather than a primary driver of surface-condition classification. Structural shows the smallest total SHAP value among the seven variables, but it still provides relevant discriminative information, especially when structural weakening becomes more prominent in higher condition levels.
The stacked contributions across Levels 0–3 further indicate that the importance of the variables is not evenly distributed. Distribution and Severity contribute strongly across all levels, confirming their role as stable global determinants. By contrast, Focus, Sensitivity, and Chromaticity become more influential in the middle-to-higher levels, where broader visual and material consequences become increasingly relevant. This pattern suggests that the Random Forest classifier follows a hierarchical decision logic: first identifying the extent and intensity of surface deterioration, and then refining the final level assignment using variables related to visual prominence, substrate vulnerability, and recoverability.
Figure 13 presents the performance consistency of the Random Forest-based surface-condition classification model across the four predicted levels based on five-fold cross-validation. Each curve represents the Precision–Recall relationship obtained from one fold, with Recall on the
x-axis and Precision on the
y-axis. This visualization allows the stability and generalization ability of the classifier to be evaluated under different data partitions, while also revealing whether the four surface-condition levels differ in classification difficulty.
For Level 0, the five PR curves are highly overlapped and remain close to the upper-right region of the plot, indicating consistently high precision across most of the recall range. This suggests that the model can reliably identify textiles in the lowest surface-condition level, which are generally characterized by slight, localized, and visually weak deterioration. For non-specialist readers, this means the classifier is able to recognize stable or minor surface-condition cases with relatively few false alarms, and its performance remains stable across repeated sampling.
For Level 1, the PR curves also remain relatively concentrated, although they show a slightly earlier decline than those of Level 0. Precision is still high over a broad recall range, indicating that the classifier maintains good discrimination for mild surface deterioration cases. This level can be interpreted as a transitional condition class, in which defects are more noticeable than in Level 0 but are not yet strongly destructive. The limited spread among the five folds indicates that the model handles this intermediate class with reasonable consistency.
For Level 2, performance becomes more challenging. The PR curves exhibit a clearer downward trend as recall approaches the upper range, and the variation among folds becomes more visible than in Levels 0 and 1. This suggests that Level 2 contains more ambiguous cases, where surface defects are serious enough to be distinguished from lower levels but may still overlap with adjacent classes in terms of severity, distribution, or visual prominence. In conservation terms, this is reasonable, since moderate surface deterioration often shares features with both mild and severe cases, making class boundaries less sharply defined.
For Level 3, the PR curves begin with a long plateau near Precision = 1.0, indicating that the classifier can identify the most severe cases with very high confidence under a conservative prediction strategy. These samples are likely to exhibit strongly distinctive characteristics, such as extensive defect distribution, high severity, pronounced visual prominence, and high substrate sensitivity. Although some performance decline appears near the highest recall range, the curves remain consistently well above the random baseline, suggesting that the model preserves good recognition capability for the most severe surface-condition category.
Overall, the five-fold PR curves demonstrate that the YOLOv8-MABFT feature extraction and Random Forest classification framework has strong generalization ability for four-level surface-condition prediction in Kashgar heritage textiles. The classifier performs particularly well for Level 0 and Level 3, where the feature patterns are more distinctive, while Levels 1 and 2 show slightly greater overlap and therefore represent more challenging intermediate classes. Even so, the relatively stable curve distribution across folds confirms that the proposed feature framework supports robust and reproducible surface-condition classification.
To further illustrate how the proposed framework reasons about textile surface condition in a manner consistent with conservation-oriented assessment, a representative case-level SHAP analysis was conducted using a severely deteriorated heritage textile sample from the Kashgar region, as shown in
Figure 14. After surface defect detection and feature extraction, the seven-dimensional condition variables—Severity, Distribution, Structural, Chromaticity, Repairability, Focus, and Sensitivity—were input into the Random Forest classifier for four-level surface-condition prediction. The model assigned this sample to Level 3, corresponding to the most severe surface-condition category, and the class-specific SHAP force plot provides a transparent explanation of this decision pathway.
The SHAP decomposition shows that Distribution exerts the strongest contribution to the final prediction. In practical terms, this indicates that the extensive spatial spread of deterioration across the textile surface is the most important reason why the sample is classified as severe. This is consistent with the visual appearance of the textile, where damage is not confined to a single local point but affects a broad area of the object. Severity, Sensitivity, and Chromaticity also contribute to the model’s decision. These variables indicate that the defect is not only extensive but is also materially serious, occurs on a fragile substrate, and is accompanied by visible color disruption. For non-specialist readers, this means the classifier is responding not only to how much surface damage there is, but also to how serious it appears, how vulnerable the textile substrate is, and how strongly the damage disrupts the original visual surface.
By contrast, Structural provides a small positive contribution, suggesting that structural weakening remains relevant but is not the dominant driver in this particular case. This is reasonable because, although the textile clearly shows major loss and fragmentation, the model appears to assign greater importance to the combined effects of wide defect spread, material fragility, and visual deterioration. Other variables, including Focus and Repairability, contribute only modestly in comparison, indicating that they function as secondary adjustment factors rather than the primary basis for the Level 3 prediction in this sample.
Overall, this case study confirms that the model’s internal reasoning is interpretable and broadly consistent with heritage textile surface-condition assessment logic. The prediction of the highest surface-condition level is mainly driven by widespread deterioration, strong defect severity, substrate sensitivity, and visible chromatic disruption. At the same time, structural damage and other variables provide supporting refinements. This example demonstrates that SHAP analysis does not merely explain the classifier mathematically, but also reveals a decision pathway that is meaningful in practical textile surface-condition assessment.
4. Discussion
The findings of this study indicate that the proposed YOLOv8-MABFT framework provides an effective and interpretable approach for detecting surface defects in Kashgar heritage textiles. In this study, heritage textiles are treated as fragile fibrous surface systems composed of woven structures, embroidered reliefs, dyed layers, aged fibers, and deteriorated surface interfaces. Therefore, the detection task is not limited to identifying visual defects, but also involves recognizing surface degradation patterns associated with changes in surface morphology, chromatic continuity, fiber exposure, and local structural stability.
Rather than treating textile defects as generic industrial flaws, this study situates surface defect detection within the material characteristics of heterogeneous textile substrates. The results show that the two coordinated improvements of YOLOv8-MABFT—the C2f-Faster-EMA module and the RT-DETR-informed decoder head—enhance the model’s ability to recognize stains, broken yarn, yarn shedding, holes, abrasion, and color fading under complex woven, embroidered, dyed, and aged surface conditions. These defect types correspond to different forms of textile surface deterioration, ranging from chromatic change and contamination to fiber discontinuity and local substrate damage.
The image-based results can be further interpreted in terms of textile material deterioration. Although the proposed framework operates on non-contact RGB surface images, the detected visual anomalies are not merely image irregularities. In heritage textiles, visible surface changes often reflect underlying deterioration processes involving fiber weakening, yarn discontinuity, surface wear, chromatic instability, accumulation of contamination, and local substrate failure. Therefore, the detection outputs can serve as conservation-oriented visual proxies for material condition, while not replacing direct mechanical, chemical, or microscopic analysis.
Among the detected categories, broken yarn is most closely associated with mechanical deterioration. A visible interruption in the yarn path may result from tensile fatigue, bending fatigue, repeated handling, fiber embrittlement, or local stress concentration. Yarn shedding can be interpreted as a sign of reduced fiber cohesion or weakened inter-fiber bonding, especially when loose fibers or fuzzy protrusions appear on the textile surface. Abrasion reflects progressive surface wear caused by repeated friction, contact pressure, or handling-related fatigue, and is therefore associated with the gradual loss of the textile surface layer. Holes represent a more advanced stage of structural discontinuity, where yarn rupture and local substrate failure have progressed beyond surface-level weakening.
The remaining categories are more closely related to surface chemistry, optical response, and environmental exposure. Color fading may be associated with chromatic-layer instability, light-induced dye degradation, aging-related changes in surface reflectance, or long-term exposure to unfavorable storage conditions. Stains may indicate contaminant deposition, moisture migration, particulate accumulation, or localized environmental interaction. These interpretations explain why the detected categories are relevant to condition assessment: each visual anomaly corresponds to a different pathway through which textile surfaces lose continuity, cohesion, chromatic integrity, or structural stability.
However, these material-mechanical interpretations should be treated with caution. The present study does not directly measure tensile strength, fiber crystallinity, dye chemistry, or microscopic fiber morphology. Instead, it establishes an image-based framework that links visible degradation patterns with plausible material processes. This distinction is important for conservation practice: the model can support preliminary screening, mapping, and documentation, but final diagnosis should still be combined with expert examination and, when necessary, material testing.
The ablation experiments demonstrate that the proposed architectural improvements contribute directly to surface defect detection performance. Compared with the baseline YOLOv8n, the complete YOLOv8-MABFT model achieved higher Precision, F1-score, and mAP@0.5 while maintaining the same computational cost. This suggests that the improvement does not depend on simply increasing model size, but on better surface feature representation and defect–background discrimination. This is particularly relevant for heritage textile surfaces, where dense patterns, embroidery, reflective silk surfaces, chromatic variation, and aged material irregularities can easily interfere with automated recognition.
The paired bootstrap analysis provides further statistical support for this interpretation. Since all models were evaluated on the same test images, the paired bootstrap procedure examined whether the performance differences remained stable when the test samples were resampled. The results showed that YOLOv8-MABFT improvements in F1-score and mAP@0.5 were consistently positive across all baseline comparisons, and the 95% bootstrap confidence intervals did not include zero after adjustment for multiple comparisons. This suggests that the reported gains were not merely isolated single-evaluation differences, but were stable under fixed-test-set resampling. The magnitude of the improvement also follows a reasonable pattern. YOLOv8-MABFT showed the largest F1-score gains over SSD and Faster R-CNN. For SSD, this likely reflects weaker overall performance in detecting subtle or weak-boundary textile defects. For Faster R-CNN, the small mAP@0.5 difference but large F1-score difference suggests that Faster R-CNN achieved competitive localization performance but produced more false detections, resulting in lower precision and F1-score. Compared with YOLOv7n and YOLOv8n, which are stronger, lightweight detection baselines, the improvements were smaller but remained positive. This pattern supports the interpretation that the proposed architectural modifications improved surface-feature representation and defect–background discrimination, rather than simply producing an anomalous result in a single comparison. In particular, the improvement over YOLOv8n indicates that the added C2f-Faster-EMA module and RT-DETR-informed decoder head contributed to more reliable detection of complex textile degradation patterns.
The comparative experiments further confirm the suitability of the YOLOv8-MABFT for applied surface inspection of heterogeneous textile substrates. Faster R-CNN achieved relatively high recall but showed lower precision, indicating that it could identify many true defects but also produced more false detections on highly textured textile surfaces. SSD showed weaker overall performance, especially for subtle or small defects. In contrast, the proposed YOLOv8-MABFT model achieved a better balance between precision and recall, indicating that it can detect real surface defects while reducing the likelihood of mistaking decorative motifs, normal textile textures, or aged pattern variation for damage. This balance is important because both missed deterioration and excessive false alarms may affect condition documentation and subsequent assessment.
The confusion matrix and qualitative detection results also reveal where the model performs well and where challenges remain. Structurally distinct defects, such as holes and broken yarn, are generally easier to detect because they present clearer interruptions in the textile surface. By contrast, yarn shedding, abrasion, and color fading remain more difficult because their boundaries are often weak, gradual, or visually similar to the surrounding material. These differences are consistent with the material characteristics of textile deterioration: defects involving strong structural discontinuity are easier to localize, whereas defects involving gradual fiber loosening, surface wear, or chromatic transition are more susceptible to background interference. The qualitative false-positive analysis showed that errors were not randomly distributed, but were mainly associated with specific interactions between substrate geometry, surface relief, optical response, and deterioration boundaries. High-relief embroidery produced raised thread ridges, local shadows, stitch overlaps, and occluded edges. In two-dimensional RGB images, these relief structures sometimes resembled yarn shedding, abrasion, or broken yarn because both true damage and embroidery relief can appear as local surface discontinuities. Reflective silk-based surfaces created another type of ambiguity. Specular highlights and local brightness variation occasionally resembled weak color fading, abrasion, or stain boundaries, especially when illumination was uneven or when defect contrast was low. Repeated woven or dyed motifs also introduced strong periodic textures and chromatic transitions, which could be confused with stains or fading when decorative boundaries overlapped with aged surface irregularities.
False positives also occurred in aged but non-defective regions, such as folds, wrinkles, dust traces, and irregular discoloration. These features are visually similar to early-stage deterioration but do not always represent active material damage. Dense blended-yarn textures further reduced category separability because broken yarn and yarn shedding may both appear as local fiber discontinuities. False negatives were more likely to occur in early-stage abrasion, slight color fading, fine broken yarn, and weakly contrasted yarn shedding, particularly when these defects were embedded in dense woven patterns, reflective silk surfaces, or shadowed embroidery. Misclassification may also occur between visually related categories, such as yarn shedding and broken yarn, or abrasion and color fading. These cases show that the main limitation of the detector lies not in recognizing isolated defects, but in distinguishing true degradation from normal or historically produced textile surface complexity.
From a conservation perspective, this limitation is meaningful. It suggests that YOLOv8-MABFT should be used as a preliminary surface-screening and documentation tool rather than as an autonomous diagnostic system. The model can identify regions requiring attention, but expert interpretation remains necessary when material relief, reflectance, decorative patterning, and deterioration boundaries overlap.
The CAM-based interpretability analysis provides additional evidence that the improved model attends more closely to surface-degradation regions. Compared with the baseline YOLOv8n, the improved model produced more concentrated attention responses, especially in the Grad-CAM++ visualization. This suggests that the model is less distracted by surrounding decorative backgrounds and more capable of focusing on actual damaged areas. For textile surface analysis, this is meaningful because interpretability helps determine whether the model is responding to real deterioration rather than unrelated visual patterns.
Beyond defect localization, this study further extends the workflow toward surface-condition level assessment through Random Forest classification and SHAP interpretation. Random Forest was not selected solely because of its conceptual suitability or general interpretability. A comparative classifier evaluation was conducted using Logistic Regression, SVM, KNN, Decision Tree, Gradient Boosting, LightGBM, XGBoost, and Random Forest, all trained on the same feature set, data split, preprocessing strategy, and evaluation metrics. The results showed that Random Forest achieved the highest overall accuracy, Macro-F1, Weighted-F1, and Cohen’s Kappa among the compared classifiers. This indicates that Random Forest was empirically appropriate for the present four-level surface-condition prediction task. The comparison also shows that ensemble-based classifiers are generally suitable for this type of condition-assessment problem. Gradient Boosting, LightGBM, and XGBoost produced competitive results, but Random Forest provided a slightly stronger overall balance between predictive performance, class-wise stability, and interpretability. This balance is important for conservation-oriented condition assessment because the goal is not only to maximize overall accuracy but also to maintain reliable discrimination across stable, mild, moderate, and severe surface-condition levels. Random Forest was therefore retained as the final classifier for SHAP-based interpretation. The seven-dimensional feature system (Severity, Distribution, Structural, Chromaticity, Repairability, Focus, and Sensitivity) translates detected surface defects into condition-relevant variables. SHAP results show that Distribution and Severity are the most influential factors in surface-condition classification, indicating that the model primarily assesses how widely a defect spreads and how severe the deterioration is. Focus and Sensitivity also become more important in higher condition levels, suggesting that visual prominence and substrate vulnerability play meaningful supporting roles in identifying more serious cases.
This pattern is generally consistent with material-based surface assessment logic. A small and localized defect is usually less critical than one that is widespread, visually dominant, or located on a fragile substrate. The model’s reliance on Distribution and Severity, therefore, suggests that its classification behavior reflects a reasonable relationship between visible surface deterioration and condition concern. At the same time, the lower contribution of Repairability and Structural indicates that these variables function more as secondary refinement factors than as primary determinants in the current model. This does not mean they are unimportant; rather, it suggests that their influence may require more detailed feature extraction, segmentation-based measurement, or expert scoring in future studies.
The robustness and sensitivity analysis of the expert-informed weighting scheme further support this interpretation. Because the weighting scheme was partly based on conservation judgment, it was necessary to examine whether the four-level condition assessment would change substantially under alternative weighting assumptions. The results reported in
Supplementary Table S3 show that the condition scores and level assignments remained broadly consistent under equal-weight, ±10% perturbation, ±20% perturbation, and conservation-risk-oriented weighting scenarios. The strongest stability was observed under the ±10% perturbation scenario, while the ±20% perturbation scenario still maintained a high score correlation and good ordinal agreement.
The pattern of changes is also meaningful from a conservation perspective. The equal-weight scenario produced the largest number of changed cases because it removed the conservation-informed prioritization among variables. However, the high Spearman correlation indicates that the overall ranking of condition severity remained broadly similar. Under ±20% perturbation and conservation-risk-oriented scenarios, most changes occurred between adjacent condition levels, especially around the Level 1/Level 2 boundary. This is consistent with the practical ambiguity in assessing textile condition between mild and moderate deterioration. In contrast, clearly stable and clearly severe cases were less affected by weighting changes. These findings suggest that the proposed condition-assessment framework is not determined by a single arbitrary weight configuration, but intermediate cases should still be interpreted as decision-support outputs requiring expert review.
The Precision–Recall results for the four surface-condition levels further show that the model performs more consistently for Level 0 and Level 3, while Levels 1 and 2 are more challenging. This is understandable because stable and severe cases usually have clearer feature patterns, whereas intermediate levels often contain overlapping signs of deterioration. In surface-condition assessment, the boundary between mild and moderate deterioration is often less distinct than the boundary between clearly stable and clearly severe conditions. Therefore, the predicted condition levels should be understood as preliminary decision-support information for expert review rather than as direct treatment recommendations. From a practical conservation perspective, the predicted surface-condition levels may support textile heritage management in three main ways. First, they can assist routine documentation by translating detected defects into standardized and comparable condition records, which may help museums or conservation teams track visible surface changes over time. Second, they can support preliminary triage. Objects predicted as Level 0 or Level 1 may be suitable for routine documentation or periodic monitoring, whereas Level 2 objects may require closer expert inspection and more detailed condition reporting. Level 3 predictions may indicate cases that should be prioritized for professional review because extensive deterioration, strong defect intensity, or high material sensitivity may increase handling or storage risk. Third, SHAP-based explanations can help conservators understand whether a prediction was mainly driven by Distribution, Severity, Material Sensitivity, Visual Focus, or other variables, making the automated assessment more transparent and easier to verify against material evidence. However, the model cannot replace conservators’ examination of fiber strength, dye stability, historical value, previous restoration history, environmental exposure, or cultural significance. The proposed framework is therefore most suitable for preliminary screening, surface-condition documentation, monitoring prioritization, and expert review support. Final professional judgments should still be made by qualified specialists based on direct examination and broader contextual information beyond image-based surface features.
Limitations and Future Research
Several limitations should also be acknowledged. First, although the dataset was designed to reflect the material diversity of Kashgar heritage textiles, it remains regionally specific and relies on supplementary open-source textile images to improve coverage of underrepresented surface textures and deterioration. These supplementary images may introduce domain bias because their acquisition conditions, lighting consistency, image quality, and clarity of defect boundaries may differ from those of authentic Kashgar heritage textile images. Therefore, the reported results should be interpreted primarily as evidence of model performance within the constructed dataset rather than as proof of universal generalization to all heritage textile collections. Although the dataset includes multiple textile substrate types, such as woven cotton textiles, silk-based decorative textiles, embroidered textiles, blended yarn systems, lightweight woven cloth, and decorative composite fabrics, this study did not conduct a robustness analysis stratified by substrate type or deterioration intensity. The model’s transferability to other textile traditions, climatic contexts, imaging conditions, deterioration intensities, or museum collections remains to be further validated.
Second, the current framework focuses mainly on visible surface defects captured in RGB images. Latent deterioration, fiber-level chemical changes, dye instability, and subsurface weakening may not be fully captured through standard image-based detection. Future research should integrate multimodal data such as microscopic images, spectral information, environmental records, and conservation reports to improve the assessment of non-visible or early-stage surface deterioration.
Third, some defect categories, especially abrasion, yarn shedding, and color fading, remain difficult to separate when their visual boundaries overlap. The current evaluation still relies mainly on aggregate detection indicators. A more detailed class-wise quantitative evaluation, normalized confusion matrix analysis, false-positive/false-negative categorization, and stricter localization assessment, such as AP@0.75 or mAP@0.5:0.95, would further strengthen the interpretation of model performance. Future work should therefore report class-wise results and cases of systematic error to better explain category-level performance differences and localization limitations.
Fourth, although this study added robustness and sensitivity analyses of the expert-informed weighting scheme, the surface-condition-level prediction still depends on the current seven-variable feature system and the condition-level thresholds defined in this study. The sensitivity analysis shows that the results are broadly stable under reasonable alternative weighting scenarios, but it does not fully eliminate the need for broader expert validation. In particular, the equal-weight and conservation-risk-oriented scenarios indicate that some intermediate cases remain sensitive to the weighting of conservation priorities. Future research should involve larger expert panels, Delphi-based weighting procedures, cross-institutional conservation review, and external datasets to further refine the weighting scheme and condition-level thresholds. Data-driven or hybrid weighting strategies could also be explored to complement expert-informed conservation logic.
Although the paired bootstrap analysis provides statistical evidence that the reported improvements were stable under fixed-test-set resampling, it does not fully replace repeated training across multiple random seeds. The present statistical analysis evaluates test-sample-level robustness using fixed, trained models. In contrast, training variability arising from random initialization, stochastic optimization, and data loading would require multi-seed repeated training to be fully assessed. Therefore, the bootstrap results should be interpreted as additional evidence of fixed-test-set robustness rather than definitive proof of training-seed stability. They should not be interpreted as direct evidence that the training process itself is insensitive to random seeds. Future work should conduct multi-seed repeated training and external dataset validation to further evaluate the stability and generalizability of the proposed model.
Future research should proceed in several directions. First, the dataset should be expanded to include more textile types, environmental conditions, imaging devices, regional heritage contexts, and authentic heritage textile images collected independently. Future datasets should retain explicit source-domain labels, enabling source-stratified, substrate-stratified, and deterioration-intensity-stratified evaluation of model performance on authentic heritage textile images and supplementary image sources. Second, independent cross-institutional heritage textile datasets should be used to test whether the model generalizes beyond the current Kashgar-centered dataset. Third, segmentation-based methods could be explored to complement bounding-box detection, especially for irregular surface degradation such as fading and abrasion. Fourth, although this study added a comparative classifier evaluation and found that Random Forest achieved the best overall balance among the tested classifiers, the surface-condition level prediction still depends on the current seven-variable expert-informed feature system. The relative weights and definitions of these variables remain partly dependent on conservation judgment. Therefore, the condition-prediction results should be interpreted as a preliminary, explainable assessment framework rather than as a fully optimized conservation grading system. Future research should further validate the classifier selection and feature-weighting strategy using larger datasets, independent cross-institutional textile collections, external test sets, and condition labels assigned by multiple conservation experts.