From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification

Nerantzis, Evangelos; Malletzidou, Lamprini; Kyratzopoulou, Eleni; Tsirliganis, Nestor C.; Kazakis, Nikolaos A.

doi:10.3390/heritage8110447

Open AccessArticle

From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification

by

Evangelos Nerantzis

¹

,

Lamprini Malletzidou

^1,2

,

Eleni Kyratzopoulou

¹

,

Nestor C. Tsirliganis

¹ and

Nikolaos A. Kazakis

^1,2,*

¹

Laboratory of Archaeometry and Physicochemical Measurements, Athena—Research and Innovation Center in Information, Communication and Knowledge Technologies, Kimmeria University Campus, P.O. Box 159, GR-67100 Xanthi, Greece

²

Archimedes Research Unit, Athena Research Center, GR-15125 Marousi, Greece

^*

Author to whom correspondence should be addressed.

Heritage 2025, 8(11), 447; https://doi.org/10.3390/heritage8110447

Submission received: 10 September 2025 / Revised: 20 October 2025 / Accepted: 23 October 2025 / Published: 24 October 2025

(This article belongs to the Special Issue Futurescape of Heritage Preservation: Integrating AI, Digital Twins, and Multi-Scale Technologies for Cultural Sustainability)

Download

Browse Figures

Versions Notes

Abstract

The identification and classification of textiles play a crucial role in archaeometric studies, in the vicinity of their technological, economic, and cultural significance. Traditional textile analysis is closely related to optical microscopy and observation, while other microscopic, analytical, and spectroscopic techniques prevail over fiber identification for composition purposes. This protocol can be invasive and destructive for the artifacts under study, time-consuming, and it often relies on personal expertise. In this preliminary study, an alternative, macroscopic approach is proposed, based on texture and surface textile characteristics, using low-magnification images and deep learning models. Under this scope, a publicly available, imbalanced textile image dataset was used to pretrain and evaluate six computer vision architectures (ResNet50, EfficientNetV2, ViT, ConvNeXt, Swin Transformer, and MaxViT). In addition to accuracy, energy efficiency and ecological footprint of the process were assessed using the CodeCarbon tool. The results indicate that two of the convolutional neural network models, Swin and EfficientNetV2, both deliver competitive accuracies together with low carbon emissions, in comparison to the transformer and hybrid models. This alternative, promising, sustainable, and non-invasive approach for textile classification demonstrates the feasibility of developing a custom, heritage-based image dataset.

Keywords:

archaeometry; visual transformer; CNN; imbalanced datasets; classification; power consumption

1. Introduction

Textiles have played a fundamental role in human history, from clothing to decorative purposes, serving not only practical functions, such as protection from the elements, but also demonstrating economic and social status, together with religious affiliation and cultural identity [1,2,3,4,5,6,7,8]. Since ancient times, textiles have been produced by fibers derived from plant and animal sources, and inorganic materials, including metal threads and asbestos, while cellulosic man-made and synthetic fibers were introduced on an industrial scale in the 1950s.

Within the framework of archaeometric studies, textiles—and the fibers they are composed of—offer insights into cultural practices, trade routes, and technological development through time. Traditional macroscopic textile analysis focuses on yarn making, weaving, and knitting techniques. Diagnostic features, such as thread twisting direction, tightness and thickness of the thread, number and twisting direction of threads that compose the yarn, warp-weft count, and pore size, define the textile under study [2,4,5,8,9]. These features often survive even in degraded and mineralized textiles, making them valuable for historical studies. In some cases, such diagnostic techniques are invasive and destructive due to the need for sampling and extensive sample preparation [6,10,11].

The methodology regarding textile recognition depends on visual inspection of macroscopic textile features with the naked eye, enhanced with a magnifying glass, or with optical and electron microscopy [5,8]. Given the time-consuming nature of these methods, the possibility of human error cannot be ignored [5], especially when dealing with degraded and mineralized textile remains, where morphology interpretation requires highly specialized expertise [9]. In cases where fibers exhibit similar morphological features, misidentification can easily occur, and the overall picture of the textiles that also preserve archaeologically relevant information may be overlooked, such as their macroscopic features, including thread and weave type [1,3,6,8].

In alignment with the growing demand for non-destructive and non-invasive methods of analysis for textile classification and documentation, artificial intelligence (AI) has been increasingly employed not only in heritage science, but also in the textile industry, including tasks as weaving parameter estimation and recognition [5,12]. Textile recognition and classification have gained significant traction in applications related to cultural preservation, documentation, and analysis. Recent studies have employed pretrained deep learning models such as ResNet-50 and MobileNet to classify traditional woven textile patterns, achieving accuracy above 94% even under conditions like image rotation and varying lighting [13]. Moreover, hybrid models like YOLOv4-ViT and GANs have been used to detect and restore damaged ancient textiles, supporting the reconstruction of cultural artifacts [14]. On an ethnographic level, convolutional neural networks (CNNs) have been applied to the classification of handwoven designs from the Kalinga tribe in the Philippines, aiding digital preservation of regional textile identities [15]. Finally, large-scale datasets, featuring over 760,000 images, provide a unified benchmark for textile classification in both fashion and cultural heritage domains [16].

Pretrained computer vision models offer a powerful starting point for textile classification tasks, leveraging features learned from large datasets such as ImageNet [17]. However, unbalanced datasets, where some classes have significantly fewer samples than others, pose significant challenges to image classification [18], as models can end up biased toward the majority classes, while underperforming the minority classes [19]. This is a common problem in real-world applications, including textile datasets, and specifically textiles of archaeological interest, where some rare or specialized classes may have limited representation, affecting the ability of models trained on balanced benchmarks to generalize effectively [20]. Such imbalances bias model performance toward majority classes and distort evaluation metrics, such as accuracy, failing to capture minority class performance [21]. Different architectures, such as CNNs, transformers, and hybrid models, each respond differently to long-tailed data distributions and class imbalance, both in terms of learning dynamics and generalization capacity [18,22,23].

This preliminary study investigates the potential of using pretrained deep learning models for textile classification, using conventional magnification images, within the scope of non-invasive analysis in heritage science. Such images, capturing macroscopic textile features, like fiber bundle and weave appearance and structure, texture, and glow, can be acquired with standard cameras or smartphones. This offers a faster, non-invasive, and non-destructive alternative to traditional microscopic methods, which are time consuming, and rely on specialized expertise. For this reason, six pretrained computer vision models of different architectures are examined and compared, focusing on their performance when adapting to a highly imbalanced textile image dataset. Apart from classification accuracy, the study also evaluates energy consumption and carbon emissions associated with model training and deployment. The environmental impact of artificial intelligence models has become an increasingly important consideration in several fields, as sustainability is valued. With this preliminary evaluation of the balance between classification accuracy, energy cost, and carbon emissions, the practical and ecological implications of model selection are considered for future applications of AI in cultural heritage. This approach is moving textile identification into a scalable and in situ applied method, making it attractive for fields apart from heritage-related ones, such as forensics and the textile industry [5,8,24]. This preliminary study focuses on model performance under control conditions, using a publicly available dataset of contemporary textiles. Future work will involve constructing an image library dedicated to archaeological textiles, enabling model training and testing under authentic preservation states, imaging conditions, and sample heterogeneity.

2. Materials and Methods

2.1. Dataset Used in the Experiments

For the experimental evaluation of the pre-trained computer vision models examined in this study, the publicly available FABRICS dataset was chosen [25] (http://ibug.doc.ic.ac.uk/resources/fabrics, accessed on 3 March 2025) (The FABRICS dataset can alternatively be accessed at https://www.kaggle.com/datasets/orchit/the-fabrics-dataset-by-ibug, accessed on 20 October 2025), which is a set of images that includes textiles of different textures and compositions, such as cotton, wool, satin, polyester, terrycloth, and others. The dataset was created under controlled but realistic photography conditions, incorporating lighting, resolution, and shooting-angle variables. The images present an approximate field of view of 10 × 10 mm, sufficient to capture the textile macroscopic characteristics. This field of view is translated to an approximate magnification of 0.6×, which can be easily achieved through cameras with macro lenses, smartphone models with macro modes, or affordable and easy-to-find clip-on lenses.

Of particular importance is the fact that the FABRICS dataset exhibits a significant imbalance between classes, with some containing hundreds of samples and others only a few [25]. This structure makes it particularly suitable for studying the behavior of machine learning models in long-tailed distribution situations, a phenomenon that is common in application environments where classes of limited samples appear less frequently and their reliable recognition is critical [26].

Regarding the initial classification of the FABRICS imbalanced dataset, the following modifications were performed, in terms of efficiency and duplicate elimination:

The initial class of Artificial Fur was excluded from the analysis, as it contained only a single sample, which was insufficient for model training and evaluation.
The initial classes of Artificial Leather and Leather were merged, as the present study focuses on surface texture and morphology rather than textile composition. Although leather is not considered a textile, this class was retained because of its distinctive texture characteristics. Furthermore, it has been used for clothing, and it can coexist with textiles in the same assemblage.
Four samples were originally registered in both Polyester and Satin classes. They were assigned only to the Satin class, as the classification of the present study is based on surface texture and morphology. For the same reason, fourteen samples that were originally registered in both Silk and Satin classes, six samples that were registered in both Crepe and Satin classes, and six samples that were registered in both Crepe and Polyester classes, were incorporated only in Satin as well as Satin and Crepe classes, respectively.

Following these modifications, the revised dataset distribution in classes is presented in Table 1.

2.2. Architectural Models Used in Experiments

In this study, six pretrained computer vision models were evaluated for their ability to classify textile images based on macroscopic features: ConvNeXt [27], EfficientNetV2 [28], MaxViT [29], ResNet50 [17], Swin Transformer [30], and Vision Transformer (ViT) [31]. These models represent a range of architectures, from classical CNNs to recent Vision Transformer models (Table 2). These architectures were selected to benchmark different approaches to visual recognition, especially regarding how well they generalize to a low-magnification and imbalanced dataset of textile images. A detailed comparison of all six models is presented in the Supplementary File.

2.3. Methodology

2.3.1. Model Evaluation

The methodological approach follows a systematic process, which includes the selection and training of pre-trained models, the integration of energy consumption measurement tools, and the interpretation of the models through visualization techniques of their internal representations. To address the problem of imbalance between classes, a weighted sampling technique was applied based on the class distribution in the dataset. The cost function used for all experiments was either Cross Entropy or Focal Loss, depending on the configuration of each scenario. Fine tuning was performed on all models according to best practices by adjusting hyperparameters such as learning rate and batch size [39]. All models were trained for a maximum of 150 epochs, with early stopping applied when training loss did not improve but validation loss started to increase for 20 consecutive epochs [23,40]. For evaluation, accuracy, confusion matrix, and complete classification reports were employed.

In this study, the dataset was divided into training and validation subsets. This design choice was intentional and aimed to enable consistent and controlled comparison of various pretrained architectures under identical training conditions. For this preliminary approach, the validation set served both as a reference for monitoring model performance during training and as a basis for comparative analysis across models. Since the primary objective of this stage was to assess relative performance across architectures rather than generalization to unseen data, a distinct test set was not used at this point [41,42]. However, it is acknowledged that this approach may limit generalizability, and future work will incorporate an independent test set to rigorously evaluate the generalization capacity of the best-performing models.

To interpret the results of the models, emphasis was given to the analysis of the regions of interest that activate individual neural features. In the case of CNNs such as ResNet50, heatmaps were created using the Grad-CAM technique [43], highlighting the areas of each input image that contributed most to the model’s prediction. Correspondingly, for ViT models, attention map analysis was conducted [31]. Since attention mechanisms in transformer-based architectures incorporate spatial context, their visualization enabled insight into how information was distributed and prioritized across different attention heads. The attention maps were extracted from the final layers of the encoder and were visualized side by side with the original input, using overlay techniques for interpretability.

To evaluate each architecture comprehensively, a diverse set of performance metrics was employed. These included accuracy, precision, recall, and the f1-score, all calculated both on a per-class basis and as macro-averages. Accuracy provides a general overview of the model’s correctness by measuring the proportion of correctly classified instances over the total number of samples. However, in the context of imbalanced datasets, accuracy alone can be misleading, as models may achieve high scores by favoring majority classes. Therefore, more detailed metrics such as precision and recall were also considered. Precision evaluates how many of the samples predicted as a certain class were actually correct, highlighting the model’s ability to avoid false positives. Recall, on the other hand, measures how many of the actual instances of a given class were successfully identified, emphasizing the model’s sensitivity to false negatives. The f1-score, as the harmonic mean of precision and recall, captures the balance between the two and is particularly useful when both false positives and false negatives carry importance.

Macro-averaging ensures that each class contributes equally to the final score, regardless of its frequency, providing a more balanced perspective on model performance in datasets with class imbalance. Complementing these quantitative metrics, confusion matrices were also generated for each model on the validation set. These matrices offer a detailed view of the classification behavior by visualizing how predictions align with actual class labels, allowing for the identification of common misclassifications and inter-class confusion. Together, these evaluation tools form a robust framework for assessing not only overall performance but also fairness and reliability across all textile categories.

2.3.2. Energy Consumption Calculation

The collection and recording of energy consumption data was implemented using the CodeCarbon tool [44], which provides detailed measurements of CPU, GPU, and RAM power consumption, as well as the corresponding CO₂ emissions. All experiments were performed on a local system equipped with an Intel64 Family 6 Model 151 Stepping 2 GenuineIntel CPU (~2.1 GHz), NVIDIA RTX 3050 GPU (8 GB), and 32 GB RAM in a PyTorch (version 2.9.0) environment under Windows 11/Anaconda. The measurements were logged in CSV format, while the entire experimental pipeline was managed using the Data Version Control (DVC) tool [45], ensuring full reproducibility of the experiments.

3. Results

3.1. Per-Model Evaluation

The models were evaluated for textile classification using macroscopic images from the FABRICS dataset, which is characterized by significant class imbalance. The evaluation focused on the classification performance, the training process through the loss and accuracy curves, and the ability of the model to handle the imbalance problem. In general, all models achieved high overall accuracy, but with significant differences in performance for minority textile categories, as presented in the following sections.

3.1.1. ResNet50

From the classification report (Figure 1a) and confusion matrix (Figure 1b), RestNet50 shows strong performance in identifying dominant classes and avoids false positive predictions, for example, in the classes Denim (f1-score: 0.95) and Cotton (f1-score: 0.87), followed by Polyester (f1-score: 0.69). On the other hand, the model struggles significantly with Satin (f1-score: 0.27, precision: 0.20, recall: 0.39) and Felt (f1-score: 0.25), indicating substantial difficulty in learning features of these classes. Moderate performance is demonstrated for Crepe (f1-score: 0.67) and Viscose (f1-score: 0.53), the latter presenting well-balanced recall and precision (both of 0.78%). Minority classes such as Chenille, Suede, Velvet, and Lut present a perfect f1-score of 1.00, but with minimal validation support, due to the number of samples in the validation set (between 1 and 3). Perfect performance on such a small number of samples may not generalize to larger sets and is not a strong indication of generalized ability.

The training loss curve (Figure 1c) decreases steadily, reaching a value of 0.19 at epoch 100, indicating that the model is learning efficiently from the training data. The validation loss curve starts higher and fluctuates, and at 0.63, it is notably higher than the training loss of 0.19 at the end, suggesting a slight overfitting. Similarly, the training accuracy (Figure 1d) increases rapidly in the initial epochs and then stabilizes at high levels, reaching 86.76%. This suggests that the model learns effectively to recognize textiles in the training set. The validation accuracy reaches 81.00%, showing good generalization, but this can be improved given the gap with training accuracy.

3.1.2. EfficientNetV2

The performance of the EfficientNetV2 model is presented in Figure 2. According to the classification report (Figure 2a) and confusion matrix (Figure 2b), the model achieves significant success in recognizing several classes. Cotton is identified with high confidence (f1-score: 0.94, recall: 0.95, precision: 0.94), consistent with its dominance in the dataset. Denim and Polyester classes also exhibit excellent performance, with f1-scores of 0.97 and 0.83, respectively. Satin remains a challenge, as with ResNet50 results. Its f1-score of 0.50 and recall of 0.57 both suggest confusion with other classes. On the other hand, Viscose performs satisfactorily (f1-score: 0.66), indicating a balanced classification. Classes like Crepe, Acrylic, and Corduroy present a mid-performance ability (f1-scores: 0.67–0.80). While Acrylic shows a reasonable f1-score, Crepe remains a relatively difficult class, although its performance is improved in comparison to ResNet50. Classes like Leather, Chenille, Linen, Suede, Velvet, and Lut are perfectly classified (f10scores: 1.00), but given their small number of samples, these results should be interpreted with caution and are not strong indicators of generalization ability. In any case, the distinctive morphological characteristics of Leather in comparison to all the other classes should be taken into consideration.

Regarding the learning process, the training loss curve (Figure 2c) decreases consistently, reaching a final value of 0.12, confirming that the model learns effectively from the training data. The validation loss also drops sharply in the initial epochs and stabilizes around 0.46, remaining notably higher than the training loss. This gap was also observed in the validation loss of ResNet50, suggesting again a degree of overfitting between epochs 15 and 25, although the validation performance remains strong. The training accuracy curve (Figure 2d) rises rapidly and reaches a plateau near 95.27%, while the validation accuracy reaches 85.18%. Despite the gap, the model exhibits strong generalization ability, and the overall performance is considered robust across both frequent and less-represented textile classes. The observed fluctuations, which are more pronounced in validation curves rather than in training curves, reflect the model’s sensitivity to class imbalance in the validation due to challenges in generalization across underrepresented classes.

3.1.3. ConvNeXt

The ConvNeXt model also shows strong performance across the dominant textile categories. For example, Cotton is again recognized with great success, achieving an f1-score of 0.92 (Figure 3a). Denim is one of the best-classified textiles, with an excellent f1-score of 0.94. For the Polyester class, the model achieves a strong f1-score of 0.81, with a high success rate for recognizing its samples (recall 0.79) and good precision (0.83). The confusion matrix (Figure 3b) would further clarify any specific confusions. The difficulty in recognizing Satin persists (f1-score: 0.27), similar to the previous models. The confusion matrix shows that Satin continues to be mistaken for Silk and Polyester, which can be explained by similarity in morphological characteristics amongst the members of those classes. Nylon has an f1-score of 0.67, indicating that the model struggles to identify many of its samples, which is attributed to a relatively high precision of 0.77, followed by a lower recall of 0.59. Despite these weaknesses, significant improvements were seen in certain small classes. Crepe has an f1-score of 0.67. Silk is also classified more effectively, with an f1-score of 0.80. The Acrylic class performs reasonably well with an f1-score of 0.67, while Leather achieves a perfect f1-score of 1.00, as with the previous models, and Velvet also shows excellent performance with an f1-score of 0.80. For some textiles like Linen, Suede, and Lut, the f1-score is perfect (1.00), but this should be taken with caution, given the very few samples for these categories. Finally, Viscose and Wool present f1-scores of 0.90 and 0.90, respectively, showing a moderate to good level of recognition.

The training loss decreases smoothly and steadily (Figure 3c), and the training accuracy curve (Figure 3d) increases to a high level, as with the previous models, reaching a final training accuracy of 91.91% and a validation accuracy of 86.44%. The final training loss is 0.14, and the validation loss is 0.42. This gap between training and validation loss again suggests a degree of overfitting, but it is less wide in comparison to the previous models.

3.1.4. ViT

Based on the per-class performance analysis of ViT (Figure 4a), the model shows a moderate performance in the most frequent categories. For example, the Cotton class is recognized with an f1-score of 0.91, and the Denim class is also classified well, with an f1-score of 0.95, while for Polyester, the f1-score is 0.83. In this model, too, a bias towards Polyester seems to exist, as the model tends to incorrectly classify other textiles into this category, as suggested by its precision of 0.78 compared to a recall of 0.88. The recognition of Satin is extremely poor, as the f1-score reaches only 0.27, further demonstrating that the ViT model is a weak performer in this class. Nylon has an f1-score of 0.69, with perfect precision (1.00) but low recall (0.53), indicating that the model often fails to identify many of its samples. Crepe also performs very poorly, with an f1-score of 0.43. The confusion matrix (Figure 4b) reveals more about these misclassifications. Viscose has an f1-score of 0.43, with balanced precision and recall. In the remaining categories, the overall picture is mixed. Classes like Acrylic (f1-score: 0.33), Corduroy (f1-score: 0.73), and Silk (f1-score: 0.59) show low to moderate performance. Felt stands out with an f1-score of 0.00, while Wool has an f1-score of 0.86. Classes such as Suede, Linen, and Lut achieve perfect or very good classification, as they all reach an f1-score of 1.00. However, as with the previously presented models, these perfect scores are attributed again to the very small number of samples available for these categories, and these scores are not strong indicators of the model’s general ability.

The loss and accuracy curves (Figure 4c,d) indicate that the training process has its own unique characteristics. The training loss curve (Figure 4c) decreases smoothly to 0.12, but the validation loss curve shows fluctuations and remains at higher levels, reaching 0.28. This means that the model faces difficulties generalizing new data and exhibits overfitting. The training accuracy curve (Figure 4d) rises quickly to 98.51%, but the validation accuracy curve, while good, reaches 85.89%, further confirming the challenges in generalization.

3.1.5. Swin Transformer

The f1-scores and the confusion matrix of Swin Transformer are presented in Figure 5a,b. The model performs well in the most frequent categories. For Cotton, it achieves a good f1-score of 0.96 with very high precision (0.96) and recall (0.96), while for Denim, the performance is also excellent (f1-score: 0.96).

The model effectively identifies most of the Polyester samples (f1-score: 0.79), with a recall of 0.73 and a precision of 0.86, indicating only minor over-prediction. Regarding the more challenging categories, the performance on Satin remains low with an f1-score of 0.27, the same as ViT, indicating that this class remains a challenge. Nylon has an f1-score of 0.77, meaning that Nylon samples are always correctly predicted, but still many of the samples are missed. Notably, the model shows a significant improvement in recognizing Viscose compared to ViT, achieving an f1-score of 0.95. This indicates that the Swin Transformer is much more effective at identifying this category. In the remaining categories, the performance is generally acceptable, as Acrylic, Terrycloth, and Wool present very strong results. The learning curves indicate a smooth and stable training process (Figure 5c,d). The training loss steadily decreases to 0.09, and the validation loss stabilizes around 0.25 with low fluctuations. The relatively small gap between the two loss curves indicates that the model avoids significant overfitting and generalizes well. The training accuracy rises quickly to 97.09%, and the validation accuracy stabilizes at a high level of 91.80%, confirming the Swin Transformer’s strong ability to generalize while maintaining high performance across both major and minor textile classes.

3.1.6. MaxViT

The MaxVit architecture performance is presented in Figure 6. Apart from the f1-scores (Figure 6a), a major weakness of the model, as its confusion matrix indicates (Figure 6b), is its textile-by-textile performance. Lut is perfectly classified (f1-score: 1.00), although it is supported with only one sample. Satin continues to be a problematic class (f1-score: 0.30), and Crepe is also poorly classified (f1-score: 0.57). The confusion matrix further elaborates on the specific confusions for these classes. Cotton is once again well recognized (f1-score: 0.91), and Denim also performs very well (f1-score: 0.97), indicating that while the model correctly identifies almost all the class’s samples, its precision (0.95) and recall (0.94) are both very strong. For Polyester, the model presents a good f1-score of 0.81, with balanced precision (0.80) and recall (0.84). The results for the remaining classes are mixed, as classes like Corduroy, Silk, Chenille, and Velvet present satisfactory f1-scores of ~0.80, and Linen of 1.00, although Chenille and Linen classes contain very few samples. Fleece also shows strong performance, with an f1-score of 0.90.

The training loss and accuracy curves (Figure 6c,d) reveal a training process with some inconsistencies. While the training loss curve decreases smoothly to a very low level of 0.02, the validation loss curve shows significant fluctuations and stays at relatively high levels, reaching 0.51. This suggests that the model struggles to fully generalize to new data, exhibiting signs of overfitting. Similarly, the training accuracy curve rises quickly to 97.61%, while the validation accuracy curve, although showing good performance, reaches 82.89%, indicating a notable gap between training and generalization capabilities.

3.2. Comparative Analysis of Heatmaps and Attention Maps

The heatmaps for CNN models and the attention maps for transformers and hybrid models provide insights into the regions of the image that each model considers important for classification [46]. The heatmaps for the CNN models (ResNet50, EfficientNetV2, ConvNeXt) and the attention maps for transformers and hybrid models (Swin, ViT, and hybrid MaxViT) of a representative sample are presented in Figure 7 and Figure 8, respectively, for comparativeness.

ResNet50: The Grad-CAM heatmap for ResNet50 shows a strong, localized focus on details on the lower-left side of the textile textures. This is consistent with the nature of CNNs to extract hierarchical local features, particularly where weaving details are most pronounced.
EfficientNetV2: The attention map for EfficientNetV2 shows a broader focus across regions where weaving features are pronounced, covering a wider area than ResNet50. This suggests that EfficientNetV2 aims for multiple key local features across the image, rather than concentrating exclusively on areas where weaving details are pronounced. This distributed wide focus may contribute to its overall strong performance.
ConvNeXt: The attention map for ConvNeXt also shows a focus on areas with strong texture and pattern, but the distribution of attention appears quietly limited or spread across textured regions compared to the very broad focus of EfficientNetV2.

Swin Transformer: The attention map for the Swin Transformer reveals a focus on distinct, “patched” areas of the texture. This directly reflects the window-based self-attention mechanism of Swin Transformer, where attention is computed within local windows that are shifted in subsequent layers. The model captures local interactions within these windows and aggregates information hierarchically.
Vision Transformer (ViT): The attention map for the Vision Transformer shows a more focused and smaller patch-based result across various areas of the image. Unlike the localized or window-based focus of CNNs and Swin, the ViT processes the image as a sequence of patches and computes attention between them globally (at least in its base form). The attention map indicates that the model considers information from multiple, potentially non-contiguous small patches important for classification.
MaxViT: The attention map for MaxViT shows a focus on various texture regions, potentially combining local and more extensive features. This aligns with its multi-axis attention mechanism, which aims to capture both local and sparse global dependencies. The focus appears less severely localized than pure CNNs and more uniformly distributed than the base ViT, suggesting a hybrid approach to feature integration.

Figure 9 presents a schematic overview of the distinct focus mechanisms of the architectures, while the focus characteristics of all models are presented in Table 3, reflecting the fundamental architectural designs of the models. CNNs excel at capturing local features through convolutional filters, while transformers, with their attention mechanisms, are better at modeling long-range dependencies. Understanding these focus patterns can provide insights into why certain models perform better on specific types of visual tasks or datasets. For textile classification, the ability to capture both fine-grained local textures and potentially broader patterns across the textile could be important. The distributed or multi-region focus observed in some models (EfficientNetV2, Swin, and ViT) might be beneficial for this task compared to a very centralized focus. These differences present practical implications for heritage professionals. CNNs focus on small and highly textured areas, such as thread crossings and weave density, while transformers focus on multiple distant regions simultaneously, capturing wider weave structures and variations. On the other hand, hybrid models combine these characteristics. As a consequence, models of broader or even multi-region attention may perform better regarding complex weave patterns or degradation irregularities, while models focusing on local texture may stand out when identifying details such as threads or distinctive fiber characteristics.

3.3. Energy Efficiency and Ecological Footprint

In addition to accuracy, the energy consumption during training and testing of the models was also evaluated, using the CodeCarbon tool [44]. The aim was to measure the efficiency of each method, i.e., how much energy is required to achieve a level of accuracy, and therefore what its ecological footprint (CO₂ emissions) is.

As presented in Table 4, the results show, as expected, significant differences. Swin and ResNet50 are the most energy efficient. Specifically, Swin consumed approximately 0.064 kWh for 55 training epochs, which corresponds to ~22 g of CO₂ emissions in Greece, an impressively low value for such a powerful model. ResNet50, due to its smaller architecture and optimized convergence process, showed even lower consumption (~0.044 kWh), thus achieving the best accuracy-to-energy ratio. On the contrary, ConvNeXt and MaxViT had clearly increased consumption. MaxVit was estimated to consume about 0.239 kWh for the same training time, almost thrice the energy as that of ResNet50, despite delivering lower accuracy. This also implies a larger carbon footprint.

Overall, the energy efficiency is ranked as follows: ResNet50 → Swin → EfficientNetV2 → ViT → ConvNext → MaxViT. This is also visible in Figure 10, which plots the final accuracy against the energy spent per model. Swin and EfficientNetV2 are in the upper right corner (high accuracy and low energy), while MaxViT moves to the middle left (middle accuracy, higher energy consumption). Such a combination is undesirable when the environmental aspect is taken into consideration.

It is worth noting that all measurements were made on a local system with GPU (NVIDIA RTX 3050)-calculated and CodeCarbon-calculated emissions based on the energy mix of Greece. The absolute emission figures, a few tens of grams of CO₂, are small, but on a larger scale (e.g., training multiple models or many more seasons), they would increase. Therefore, the choice of the model also has ecological significance. A more efficient model can reduce computational costs and emissions during the development and implementation of the system.

In addition, the consumption during the inference stage was also examined. It was found that heavier models (ConvNext, MaxViT) require more memory and time per image for classification, while lightweight CNNs make predictions faster. This means that in a possible real-world application (e.g., a textile recognition system on the EU scale for researchers), the use of EfficientNetV2 would be preferable not only due to accuracy but also due to faster response and lower operating costs.

In summary, the energy evaluation highlights the outperformance of newer CNN architectures over transformers in resource-constrained scenarios. These results encourage the adoption of models like Swin and EfficientNetV2 in practical applications, where sustainability and efficiency are desired, as they achieve similar or better accuracy with a smaller ecological footprint.

4. Overall Discussion on Model Selection for Heritage-Oriented Textile Classification

The choice of the FABRICS dataset was dictated by the need to test candidate model architectures in a set that simulates the challenges of archaeological research, and, in particular, the documentation and analysis of ancient, historical, and contemporary textiles. It should be noted that the FABRICS dataset consists of contemporary textiles that were photographed under controlled conditions, and not authentic artifacts, which are culture related. The imbalance of classes in the dataset, combined with the stylistic and optical diversity, together with the existence of classes based on cloth type and textile composition at the same time, simulates conditions such as limited or fragmentary preservation of textile, unevenness in images due to wear, or the presence of textures combined with the remains of other materials, as often occurs in archaeological contexts [5,25]. In practice, computational analysis of archaeological textiles is often based on limited samples, without the possibility of homogeneous photography or extensive preprocessing. The FABRICS dataset offers a controlled experimental environment that allows for the prototyping of models before applying them to real, noisy field data. The use of FABRICS as an experimental platform allowed for the objective evaluation and comparison of different image classification models, as well as imbalance treatment techniques, such as Focal Loss [47] and stratified sampling [48]. This experimental phase is a prerequisite for application to archaeological data, where the aim is to identify those models that exhibit the best generalization ability from a few samples and the greatest discriminatory power in materials with similar texture or coloration. Therefore, the study on the FABRICS dataset served as a critical preparatory phase for the development of textile classification tools in archaeological collections, with the aim of automatically assisting the documentation, analysis, and conservation of textile findings.

Additionally, it should be noted that the FABRICS dataset includes textiles of various colors, as well as patterned textiles. For the latter, the motifs result from the use of differently colored yarns, and not from printed or embroidered designs. Despite this additional visual complexity, the models achieve high classification accuracy, demonstrating again their ability to generalize across both consistent and patterned textiles. Furthermore, the FABRICS dataset grouping contains several weave and pattern variants within the same class. The Satin class is such an example, where multiple satin-type weaves are included, without further sub-classification. This reflects the dataset’s prioritizing of surface texture rather than detailed weaving accuracy, which should be taken into consideration in future work, for a more weave-specific classification.

The combined results of applying six computer vision architectures towards textile classification show that the more traditional CNN models, such as Swin and EfficientNetV2, achieved the highest overall accuracies, of 0.89 (Table 5 and Figure 11). They also stand out with the best overall performance, as indicated by their Weighted-f1 score (0.89), while ConvNeXt closely follows (0.87). This suggests that they are the most reliable models for textile classification, even with unbalanced data [23]. MaxVit shows impressive potential, with an overall accuracy of 0.86 and a Weighted-f1-score of 0.86, quite close to the aforementioned architectures. In contrast, ViT and RestNet50 had the lowest performances, with accuracies of 0.84 and 0.81, respectively, and lower Weighted-f1-scores (0.83 and 0.82). This suggests that they struggle more with this specific type of data. The Macro-f1-score, which is more sensitive to small classes, confirms this trend. Swin and EfficientVetV2 (Macro-f1-score is 0.89) remain at the top, showing that they handle the challenge of small classes better.

By combining all the findings presented in Section 3, from performance metrics and energy efficiency to visual interpretation, the following guidelines are provided for selecting the most appropriate model for the unevenly distributed textile classification task. All models successfully recognize categories that have many samples, like Cotton and Denim. However, the real challenge lies in textile classes with limited samples, which can be considered as of most significance for heritage research as authentic historical findings are scarce [49]. In this case, the recognition of Satin (n = 24) was extremely difficult for all models, with almost all architectures presenting a low f1-score of 0.27 for the same class, apart from EfficientNetV2 (f1-score: 0.5). All models nearly failed to classify Felt, as this class has only a single sample. MaxViT faced additional difficulties, failing in the Suede class as well (n = 5). Despite the struggles, there are also encouraging signs. Some limited sample classes like Chenille (n = 13), Linen (n = 19), and Velvet (n = 11) are perfectly classified by most models. In conclusion, the choice of the right model is not based solely on overall accuracy but primarily on the f1-score of limited sample classes.

The Swin and EfficientNetV2 models proved to be the most reliable, with high overall performance combining stable and efficient operation. The choice between them might depend on computational cost, as Swin is likely more efficient. The ConvNeXt model is a very promising option, as it approaches the performance of Swin and EfficientNetV2 and offers improvements on specific, limited sample classes. It is a much better choice than ViT and MaxViT, which faced similar training sample number difficulties. Consequently, in archaeological research—where the correct classification of a limited number of textile classes is decisive—priority should be given to models with stable and reliable performance in these challenging cases with low sample number, while the potential of ConvNeXt for further improvement on the limited number of classes is also particularly interesting [14].

Apart from their present performance in this preliminary study, the integration of the models into museum and laboratory workflows is a key future step. Lightweight and efficient models, such as Swin Transformer and EfficientNetV2, could be used in standard workstations or even portable devices, allowing for in situ classification during the standard workflows (documentation, conservation, or cataloging procedures). In laboratory environments, this automated classification could be embedded in standard and established procedures to assist with preliminary identification and prioritization for further analysis.

The explainability of the models, as demonstrated through the outcome heatmaps and attention maps, remains the most significant output of this deep learning approach. These visualizations highlight key structural textile details and offer archaeological insights that can be interpretable by researchers. The successful application of deep learning architectures highlights the need to construct an extended image library of textiles of archaeological and cultural interest. Such a dataset would allow model training using a dataset that is composed of real cases. It will also be used to test the method’s limitations, for example, in distinguishing between textiles with similar characteristics, or degraded textiles, which remain challenging even under optical and electron microscopes [6].

5. Conclusions

The present preliminary study investigates the possibility of using pretrained deep learning models for textile classification, using low and conventional magnification textile images, under the scope of non-invasive analysis in heritage science. For this reason, three CNN models (ResNet50, EfficientNetV2, and ConvNeXt), and three transformer and hybrid models (Swin Transformer, Vision Transformer, and MaxViT) were trained and evaluated using a publicly available, imbalanced textile image dataset. This approach—bridging machine learning and heritage science—offers a non-invasive alternative for textile identification and documentation, aligned with sustainability purposes.

Overall, this benchmarking approach provides a comprehensive approach to the capabilities and limitations of each of the models under study. Based on the findings, the following recommendations are proposed for model selection, depending on accuracy, explainability, and energy efficiency requirements. EfficientNet has the best performance towards small datasets, ideal for early-stage archeometric applications, while Swin seems to be efficient when facing large datasets. On the other hand, although transformers typically require large training datasets, they should not be ruled out, as their performance could be improved with further training, pre-training, or hybrid architectural approaches. However, the present work, considering the imbalanced dataset scenario it uses, tends to conclude that classical deep learning (CNN) outperforms transformers for unbalanced texture classification, both in efficiency and effectiveness.

In conclusion, this comparative study demonstrates that although no single model outperforms others across all criteria, Swin and EfficientNetV2 emerge as the most balanced based on the proposed objectives. They both offer top accuracy, comparable to or superior to the others, keeping the minority classes at a high level, while being resource friendly. The results of the present preliminary study show that the deep learning approach can effectively be used to classify low-magnification textile images. This highlights the feasibility of developing a custom image dataset of textiles of archaeological and cultural interest, including samples of various preservation states. Moreover, the models demonstrate promising performance even with low-resolution images that can be collected using conventional microscopes or cameras, reinforcing the potential of building field-deployable textile recognition systems based on non-invasive imaging. In the long term, such approaches may enable the identification and classification of detailed textile morphological characteristics that define textile structure, such as thread-twisting direction, tightness and thickness of the thread, number of threads that compose the yarn, warp–weft count, and pore size, supporting deeper and non-invasive analysis in a heritage and forensic context.

Based on the present findings, the creation of a heritage-specific textile image dataset is recommended for future heritage-oriented applications. Such a dataset should include images of textiles of different preservation stages and typologies, enabling the training of models that generalize effectively towards archaeological contexts. This effort should be supported by the systematic recording of textile-related metadata for model explainability and reproducibility across various collections. Additionally, energy assessment protocols and in situ integration into documentation workflows would support rapid classification, prioritization, and evaluation of environmental impact. Together, these steps could support the development of an integrated, sustainable, rapid, and robust framework for large-scale textile classification.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/heritage8110447/s1, Section S1. Architectural models used in experiments, and Table S1. Main characteristics of the six pretrained computer vision models that are examined in the present study.

Author Contributions

Conceptualization, E.N., L.M., E.K., N.C.T., and N.A.K.; methodology, E.N.; software, E.N.; validation, E.N. and L.M.; formal analysis, E.N.; investigation, E.N., L.M., E.K., N.C.T., and N.A.K.; resources, N.C.T. and N.A.K.; data curation, E.N.; writing—original draft preparation, E.N. and L.M.; writing—review and editing, E.N., L.M., E.K., N.C.T., and N.A.K.; visualization, E.N. and L.M.; supervision, N.A.K.; project administration, N.A.K.; funding acquisition, N.A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.

Data Availability Statement

Data will be made available on request and Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
CNNs	Convolutional Neural Networks
DVC	Data Version Control
ViT	Vision Transformer

References

Coletti, F.; Romani, M.; Ceres, G.; Zammit, U.; Guidi, M.C. Evaluation of Microscopy Techniques and ATR-FTIR Spectroscopy on Textile Fibers from the Vesuvian Area: A Pilot Study on Degradation Processes That Prevent the Characterization of Bast Fibers. J. Archaeol. Sci. Rep. 2021, 36, 102794. [Google Scholar] [CrossRef]
Gleba, M.; Mannering, U. (Eds.) Textiles and Textile Production in Europe: From Prehistory to AD 400; Oxbow Books: Oxford, UK, 2012; Volume 11, ISBN 9781842179239. [Google Scholar]
Peets, P.; Kaupmees, K.; Vahur, S.; Leito, I. Reflectance FT-IR Spectroscopy as a Viable Option for Textile Fiber Identification. Herit. Sci. 2019, 7, 93. [Google Scholar] [CrossRef]
Xiao, W.; Xian, Y.H.; Yu, C.; Wang, Y.; Sun, L.J.; Li, Y.F. Microinvasive Analysis of Textile Relics from an Ancient Silk Road Turquoise Mining Site. Sci. China Technol. Sci. 2023, 66, 2286–2296. [Google Scholar] [CrossRef]
Seçkin, M.; Seçkin, A.Ç.; Demircioglu, P.; Bogrekci, I. FabricNET: A Microscopic Image Dataset of Woven Fabrics for Predicting Texture and Weaving Parameters through Machine Learning. Sustainability 2023, 15, 15197. [Google Scholar] [CrossRef]
Lukesova, H.; Holst, B. Identifying Plant Fibres in Cultural Heritage with Optical and Electron Microscopy: How to Present Results and Avoid Pitfalls. Herit. Sci. 2024, 12, 12. [Google Scholar] [CrossRef]
Veit, D. Fibers; Springer International Publishing: Cham, Switzerland, 2022; ISBN 978-3-031-15308-2. [Google Scholar]
Lukesova, H. Microscopy of Historical Textiles. In Handbook of Museum Textiles Volume 2: Scientific and Technological Research; Wiley-Scrivener: Salem, MA, USA, 2022; pp. 45–60. ISBN 978-1-119-98338-5. [Google Scholar]
Gillard, R.D.; Hardman, S.M.; Thomas, R.G.; Watkinson, D.E. The Mineralization of Fibres in Burial Environments. Stud. Conserv. 1994, 39, 132–140. [Google Scholar] [CrossRef]
Zhang, X.M.; Wyeth, P. Using FTIR Spectroscopy to Detect Sericin on Historic Silk. Sci. China Chem. 2010, 53, 626–631. [Google Scholar] [CrossRef]
Gao, S.; Yao, M.; Narenggaowa; Guo, D.; Li, Y.; Do, K.L.; Liu, J.; Zhao, F. Identification of Fibers and Dyes in Archaeological Textiles from Bazhou, Xinjiang (220–420 CE), and Their Silk Road Origins. J. Archaeol. Sci. 2024, 164, 105941. [Google Scholar] [CrossRef]
Hussain, M.A.I.; Khan, B.; Wang, Z.; Ding, S. Woven Fabric Pattern Recognition and Classification Based on Deep Convolutional Neural Networks. Electronics 2020, 9, 1048. [Google Scholar] [CrossRef]
Puarungroj, W.; Boonsirisumpun, N. Recognizing Hand-Woven Fabric Pattern Designs Based on Deep Learning. In Advances in Intelligent Systems and Computing; Springer: Singapore, 2019; Volume 924, pp. 325–336. ISBN 9789811368608. [Google Scholar]
Sha, S.; Li, Y.; Wei, W.; Liu, Y.; Chi, C.; Jiang, X.; Deng, Z.; Luo, L. Image Classification and Restoration of Ancient Textiles Based on Convolutional Neural Network. Int. J. Comput. Intell. Syst. 2024, 17, 11. [Google Scholar] [CrossRef]
Campos, H.D.; Caya, M.V. Utilizing Convolutional Neural Networks for the Classification and Preservation of Kalinga Textile Patterns. Int. J. Adv. Appl. Sci. 2024, 11, 229–236. [Google Scholar] [CrossRef]
Zhong, S.; Ribul, M.; Cho, Y.; Obrist, M. TextileNet: A Material Taxonomy-Based Fashion Textile Dataset. Available online: https://arxiv.org/abs/2301.06160v1 (accessed on 20 July 2025).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; Volume 2016, pp. 770–778. [Google Scholar]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on Deep Learning with Class Imbalance. J. Big Data 2019, 6, 1–54. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Gao, X.; Xie, D.; Zhang, Y.; Wang, Z.; He, C.; Yin, H.; Zhang, W. A Comprehensive Survey on Imbalanced Data Learning. Available online: http://arxiv.org/abs/2502.08960 (accessed on 20 July 2025).
Huang, Y.; Zou, J.; Meng, L.; Yue, X.; Zhao, Q.; Li, J.; Song, C.; Jimenez, G.; Li, S.; Fu, G. Comparative Analysis of ImageNet Pre-Trained Deep Learning Models and DINOv2 in Medical Imaging Classification. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; pp. 297–305. [Google Scholar]
Saini, M.; Susan, S. Tackling Class Imbalance in Computer Vision: A Contemporary Review. Artif. Intell. Rev. 2023, 56, 1279–1335. [Google Scholar] [CrossRef]
Japkowicz, N. The Class Imbalance Problem: Significance and Strategies. In Proceedings of the International Conference on Artificial Intelligence, Austin, TX, USA, 30 July–3 August 2000; pp. 111–117. [Google Scholar]
Tan, L.; Fu, Q.; Li, J. An Improved Neural Network Model Based on DenseNet for Fabric Texture Recognition. Sensors 2024, 24, 7758. [Google Scholar] [CrossRef]
Kampouris, C.; Zafeiriou, S.; Ghosh, A.; Malassiotis, S. Fine-Grained Material Classification Using Micro-Geometry and Reflectance. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes In Bioinformatics); Springer: New York, NY, USA, 2016; Volume 9909, pp. 778–792. [Google Scholar]
Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep Long-Tailed Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10795–10816. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; Volume 2022, pp. 11966–11976. [Google Scholar]
Tan, M.; Le, Q. V EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the Machine Learning Research, Virtual, 1 July 2021; Volume 139, pp. 10096–10106. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. MaxViT: Multi-Axis Vision Transformer. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2022; Volume 13684, pp. 459–479. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the ICLR 2021—9th International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). Available online: https://arxiv.org/abs/1606.08415v5 (accessed on 20 July 2025).
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the ICML 2010—27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 11 February 2015; Volume 1, pp. 448–456. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. Available online: https://arxiv.org/abs/1607.06450v1 (accessed on 20 July 2025).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ridnik, T.; Ben-Baruch, E.; Noy, A.; Zelnik-Manor, L. ImageNet-21K Pretraining for the Masses. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; Volume 2017, pp. 843–852. [Google Scholar]
Yin, X.; Chen, W.; Wu, X.; Yue, H. Fine-Tuning and Visualization of Convolutional Neural Networks. In Proceedings of the 4th International Conference on Industrial Engineering and Applications (ICIEA 2017), Nagoya, Japan, 21–23 April 2017; pp. 1310–1315. [Google Scholar] [CrossRef]
Sjoberg, J.; Ljung, L. Overtraining, Regularization and Searching for a Minimum, with Application to Neural Networks. Int. J. Control 1995, 62, 1391–1407. [Google Scholar] [CrossRef]
Hastie, T.; Friedman, J.; Tibshirani, R. The Elements of Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2001; ISBN 978-1-4899-0519-2. [Google Scholar]
Steyerberg, E.W. FRANK E. HARRELL, Regression Modeling Strategies: With Applications, to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd Ed. Heidelberg: Springer. Biometrics 2016, 72, 1006–1007. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Lacoste, A.; Luccioni, A.; Schmidt, V.; Dandres, T. Quantifying the Carbon Emissions of Machine Learning. Available online: https://arxiv.org/abs/1910.09700v2 (accessed on 20 July 2025).
Petrov, D. Data Version Control (DVC)—And Much More—For AI Projects. 2017. Available online: https://dvc.org/ (accessed on 20 July 2025).
Leem, S.; Seo, H. Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 2956–2964. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Volume 2, pp. 1137–1143. [Google Scholar]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]

Figure 1. F1-scores (a), confusion matrix (b), training validation loss (c), and training accuracy (d) of ResNet50.

Figure 2. F1-scores (a), confusion matrix (b), training validation loss (c), and training accuracy (d) of EfficientNetV2.

Figure 3. F1-scores (a), confusion matrix (b), training validation loss (c), and training accuracy (d) of ConvNeXt.

Figure 4. F1-scores (a), confusion matrix (b), training validation loss (c), and training accuracy (d) of ViT.

Figure 5. F1-scores (a), confusion matrix (b), training validation loss (c), and training accuracy (d) of Swin Transformer.

Figure 6. F1-scores (a), confusion matrix (b), training validation loss (c), and training accuracy (d) of MaxViT.

Figure 7. Final outputs of CNN models: Original image, heatmaps, and overlays.

Figure 8. Final outputs of transformer and hybrid models: Original image, attention maps, and overlays.

Figure 9. Schematic illustration of architectural focus differences between CNN, transformer, and hybrid models on textile images. The background microscopy image was collected by the authors and does not belong to the FABRICS dataset.

Figure 10. Accuracy vs. energy consumed per model.

Figure 11. Comparison of classification metrics across models.

Table 1. Revised class distribution and corresponding sample count (n) of the original FABRICS imbalanced dataset, based on morphological characteristics.

No.	Class	n	No.	Class	n
1	Acrylic	12	11	Lut	4
2	Chenille	13	12	Nylon	57
3	Corduroy	24	13	Polyester	216
4	Cotton	588	14	Satin	24
5	Crepe	20	15	Silk	36
6	Denim	162	16	Suede	5
7	Felt	4	17	Terrycloth	30
8	Fleece	33	18	Velvet	11
9	Leather	19	19	Viscose	37
10	Linen	19	20	Wool	90

Table 2. The six pretrained computer vision models under examination.

Model	Type	Typical Use Cases	References
ResNet50	CNN	General image classification and object detection	[17]
EfficientNetV2	CNN	General image classification and mobile device applications	[28]
ConvNeXt	CNN	General image classification	[17,32,33,34,35]
ViT	Transformer	General image classification and object detection	[31,36,37,38]
Swin	Transformer	General image classification, object detection, and segmentation	[30]
MaxViT	Hybrid	Linear	[29]

Table 3. Comparative summary of focus characteristics.

Model	Architecture Type	Focus Characteristics
ResNet50	CNN	Focus on multiple distinct local texture regions
EfficientNetV2	CNN	Strong, centralized local focus
ConvNeXt	CNN	Limited local focus on textured areas
Swin Transformer	Transformer	Focus on distinct, “patched” areas
ViT	Transformer	Distributed, smaller patch-based focus across the image
MaxViT	Hybrid (CNN + Transformer)	Combination of local and more extensive focus on texture

Table 4. Energy consumption and CO₂ emissions during model training.

Model	Training Duration (h)	Consumed Energy (kWh)	CO₂ Emissions (kg)
ResNet50	1.048	0.044	0.015
EfficientNetV2	1.003	0.079	0.027
ViT	0.918	0.090	0.030
ConvNeXt	1.379	0.111	0.037
Swin Transformer	0.846	0.064	0.022
MaxViT	4.034	0.239	0.080

Table 5. Model performance comparison. CNNs (ResNet50, EfficientNetV2, ConvNeXt) outperform in overall accuracy and balance, while Vision Transformers lag in macro-metrics.

Model	Overall Accuracy	Macro-Precision	Macro-Recall	Macro-f1	Weighted Precision	Weighted Recall	Weighted-f1
ResNet50	0.81	0.78	0.83	0.78	0.84	0.81	0.82
EfficientNetV2	0.89	0.82	0.82	0.82	0.89	0.89	0.89
ViT	0.84	0.78	0.69	0.71	0.85	0.84	0.83
ConvNeXt	0.87	0.81	0.84	0.81	0.87	0.87	0.87
Swin	0.89	0.82	0.82	0.81	0.89	0.89	0.89
MaxViT	0.86	0.75	0.69	0.71	0.86	0.86	0.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nerantzis, E.; Malletzidou, L.; Kyratzopoulou, E.; Tsirliganis, N.C.; Kazakis, N.A. From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification. Heritage 2025, 8, 447. https://doi.org/10.3390/heritage8110447

AMA Style

Nerantzis E, Malletzidou L, Kyratzopoulou E, Tsirliganis NC, Kazakis NA. From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification. Heritage. 2025; 8(11):447. https://doi.org/10.3390/heritage8110447

Chicago/Turabian Style

Nerantzis, Evangelos, Lamprini Malletzidou, Eleni Kyratzopoulou, Nestor C. Tsirliganis, and Nikolaos A. Kazakis. 2025. "From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification" Heritage 8, no. 11: 447. https://doi.org/10.3390/heritage8110447

APA Style

Nerantzis, E., Malletzidou, L., Kyratzopoulou, E., Tsirliganis, N. C., & Kazakis, N. A. (2025). From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification. Heritage, 8(11), 447. https://doi.org/10.3390/heritage8110447

Article Menu

From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Used in the Experiments

2.2. Architectural Models Used in Experiments

2.3. Methodology

2.3.1. Model Evaluation

2.3.2. Energy Consumption Calculation

3. Results

3.1. Per-Model Evaluation

3.1.1. ResNet50

3.1.2. EfficientNetV2

3.1.3. ConvNeXt

3.1.4. ViT

3.1.5. Swin Transformer

3.1.6. MaxViT

3.2. Comparative Analysis of Heatmaps and Attention Maps

3.3. Energy Efficiency and Ecological Footprint

4. Overall Discussion on Model Selection for Heritage-Oriented Textile Classification

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI