1. Introduction
Fine-grained visual classification (FGVC) aims to distinguish between visually similar subordinate categories within the same superordinate class, requiring models to capture subtle inter-class differences while accommodating significant intra-class variations. This task finds critical applications across diverse domains, including biological species identification [
1,
2], art heritage preservation [
3,
4], and medical diagnosis. The inherent difficulty of FGVC stems from the fact that discriminative regions are often localized and subtle.
Ancient mural classification represents a domain where fine-grained classification presents unique challenges. First, mural images often contain complex backgrounds with significant degradation due to aging and environmental factors, introducing considerable noise that interferes with feature extraction [
5]. Second, artistic styles vary substantially across different historical periods, requiring models to capture both low-level visual features and high-level semantic information [
3]. Third, ancient mural datasets are typically limited in scale due to the scarcity of these cultural artifacts.
Traditional approaches relied on hand-crafted features such as SIFT descriptors [
6], color histograms, and LBP textures. While achieving moderate success, they suffer from the inability to bridge the semantic gap between low-level visual features and high-level categorical concepts. The advent of deep learning has revolutionized image recognition through automatic hierarchical feature learning [
7]. These methods can be broadly categorized into supervised and semi-supervised/unsupervised paradigms. Supervised approaches, such as Bilinear CNN [
8], Recurrent Attention CNN [
9], NTS-Net [
10], and data-driven neural operators [
11], learn discriminative representations directly from labeled data, achieving strong performance when sufficient annotations are available. Semi-supervised and unsupervised methods, including domain adaptation techniques and generative adversarial networks [
12], address data scarcity by leveraging unlabeled samples or synthesizing augmented data. For mural classification, Cao et al. [
3] proposed a multichannel separable network fusing high-level features with low-level descriptors. While these paradigms have advanced solutions for degradation and limited annotations, they primarily operate in the spatial domain without explicitly modeling frequency characteristics or fine-grained token relationships. Our supervised framework complements existing approaches by introducing frequency-domain attention and cross-token relation modeling to capture texture periodicity and compositional structures inherent in cultural heritage imagery.
The emergence of Vision Transformer (ViT) [
13] has introduced a paradigm shift by adapting self-attention mechanisms to computer vision. Unlike CNNs, ViT treats an image as a sequence of patches and models global relationships through multi-head self-attention, offering advantages for fine-grained classification: capturing long-range dependencies, providing flexibility in aggregating information from different layers, and offering a unified class token representation. Recent ViT-based methods such as TransFG [
14], TPSKG [
15], and FFVT [
16] have demonstrated superior performance compared to their CNN counterparts.
However, directly applying standard ViT to fine-grained tasks reveals critical limitations. First, standard patch-based tokenization treats all image regions uniformly, which is suboptimal when only specific regions contain discriminative information. Second, while ViT employs multi-head self-attention, it lacks specialized mechanisms to model fine-grained cross-token relationships essential for understanding compositional structures. Third, standard ViT uses only the class token from the final layer for classification, potentially discarding valuable complementary information from intermediate layers. Fourth, existing methods operate in the spatial domain without explicitly considering frequency-domain characteristics, which are relevant for analyzing periodic patterns and textures in ancient murals.
To address these limitations, this paper proposes a novel fine-grained classification framework with three key innovations that directly target the aforementioned challenges. First, to overcome the fourth limitation regarding the lack of frequency-domain modeling in existing methods, we design a FreqCA module that operates in the frequency domain to capture periodic patterns and texture characteristics. By transforming image features into the frequency domain and applying adaptive channel-wise attention, FreqCA enables the model to focus on the frequency components that are most discriminative for distinguishing artistic styles, which is particularly relevant for analyzing brushwork patterns, texture granularity, and decorative motifs in ancient murals. Second, to address the second limitation concerning the absence of specialized mechanisms for modeling fine-grained cross-token relationships, we propose a CTRA mechanism that explicitly models pairwise relationships between tokens beyond standard self-attention. By constructing a relation-aware feature space where tokens corresponding to related visual elements establish stronger connections, CTRA facilitates the understanding of compositional structures and recurring motifs that are essential for fine-grained discrimination.
Furthermore, to mitigate the challenge that visually similar fine-grained categories are inherently difficult to separate, we incorporate AMCCL that enhances discriminative power by simultaneously encouraging intra-class compactness and inter-class separability with adaptive margins. Unlike fixed-margin loss functions that treat all class pairs equally, AMCCL adjusts separation boundaries dynamically based on the similarity between class centers, assigning larger margins to confusable categories while allowing smaller margins for clearly separable ones. This adaptive mechanism is particularly beneficial for fine-grained classification tasks where inter-class similarities vary substantially across different category pairs.
We evaluate our framework on CUB-200-2011 birds dataset [
1], the Stanford Dogs dataset [
2], and an ancient mural dataset spanning multiple dynasties. The experimental results demonstrate state-of-the-art performance across all datasets, with particularly notable improvements for ancient mural classification.
The main contributions of this work are summarized as follows:
We propose FRAM-ViT, a unified Vision Transformer framework for fine-grained visual classification in complex scenarios such as ancient murals. This framework systematically integrates frequency-domain feature enhancement, explicit cross-token relation modeling, and discriminative metric learning within a modular and extensible architecture. It effectively addresses challenges caused by image degradation, complex backgrounds, and limited annotations.
To achieve fine-grained discrimination, we design a triple core mechanism which includes FreqCA for enhancing frequency-domain texture information, CTRA for explicitly modeling pairwise relationships among tokens, and AMCCL to enhance feature separability and discriminative supervision.
We validate the proposed framework through extensive experiments on CUB-200-2011, Stanford Dogs, and a proprietary Dunhuang mural dataset. The proposed method achieves classification accuracies of 91.15%, 94.57%, and 94.27% on these three datasets, respectively, outperforming the ACC-ViT baseline by 1.35%, 1.63%, and 2.20%. These results, compared with several recent fine-grained classification methods, highlight the framework’s effectiveness, robustness, and generalization across datasets with varying levels of visual complexity.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 presents the methodology.
Section 4 describes the experimental setup.
Section 5 discusses implications and limitations.
Section 6 concludes the paper.
2. Related Work
In this section, we provide a comprehensive review of research areas relevant to our work, including fine-grained visual classification methods, vision transformer architectures, attention mechanisms for feature learning, cultural heritage image analysis, and metric learning approaches for discriminative feature representation.
2.1. Fine-Grained Visual Classification
Fine-grained visual classification has been a longstanding challenge in computer vision, requiring models to distinguish between subordinate categories with subtle visual differences. Early approaches to FGVC relied heavily on part-based representations with strong supervision. These methods typically required manual annotations of object bounding boxes and part locations to guide the localization of discriminative regions [
17,
18]. While effective, the heavy annotation burden limited their practical applicability and scalability to large-scale datasets.
The development of deep convolutional neural networks has enabled weakly supervised FGVC methods that learn from image-level labels alone. Bilinear CNN (B-CNN) [
8] pioneered the use of second-order feature interactions through outer product pooling, capturing rich correlations between feature channels without explicit part annotations. Building upon this foundation, numerous CNN-based methods have been proposed to automatically localize and aggregate discriminative features. Recurrent Attention CNN (RA-CNN) [
9] introduced a recurrent attention mechanism that iteratively zooms into discriminative regions at multiple scales. The Navigator–Teacher–Scrutinizer Network (NTS-Net) [
10] combined region proposal with classification in a unified framework, enabling joint learning of region localization and feature extraction. Part-Stacked CNN [
19] stacked multiple part detectors to capture complementary discriminative regions, while Weakly Supervised Complementary Parts Models [
20] learned to identify mutually exclusive parts through a bottom-up approach.
More recent CNN-based approaches have focused on sophisticated attention mechanisms and feature aggregation strategies. Cross-X Learning [
21] enhanced feature representations by learning cross-layer and cross-scale interactions. Channel Interaction Networks [
22] explicitly modeled relationships between feature channels to capture fine-grained patterns. Selective Sparse Sampling [
23] proposed that informative spatial locations be sampled adaptively based on learned attention weights. Despite these advances, CNN-based methods face fundamental limitations in modeling long-range dependencies due to their inductive bias toward local receptive fields, motivating the exploration of alternative architectures such as transformers.
2.2. Vision Transformers for Image Classification
The Vision Transformer (ViT) [
13] marked a paradigm shift in visual recognition by successfully adapting the transformer architecture from natural language processing to computer vision. Unlike CNNs, which build hierarchical representations through successive convolutions with local receptive fields, ViT treats an image as a sequence of flattened patches and models global relationships through multi-head self-attention. This design enables every patch to attend to all other patches, facilitating the capture of long-range dependencies without the constraints of fixed kernel sizes. The original ViT demonstrated that when pretrained on large-scale datasets such as ImageNet-21k or JFT-300M, pure transformer architectures can match or exceed the performance of state-of-the-art CNNs on various image classification benchmarks.
Following ViT, numerous variants and improvements have been proposed to enhance transformer-based visual recognition. Data-Efficient Image Transformers (DeiT) [
24] introduced knowledge distillation strategies to train vision transformers effectively on smaller datasets without requiring massive amounts of pretraining. TransUNet [
25] combined transformers with U-Net architecture for medical image segmentation, demonstrating the effectiveness of hybrid CNN-transformer designs. DETR [
26] applied transformers to object detection by formulating detection as a set prediction problems. TransTrack [
27] extended transformers to multiple object-tracking tasks. These works collectively demonstrate the versatility and effectiveness of transformer architectures across diverse computer vision tasks.
For fine-grained visual classification specifically, several ViT-based methods have recently emerged. TransFG [
14] represents one of the pioneering works in applying vision transformers to FGVC. It introduced a part selection module that identifies discriminative image regions by analyzing attention weights from the transformer layers. The selected parts are then used to augment the classification features. Transformer with Peak Suppression and Knowledge Guidance (TPSKG) [
15] addressed the attention concentration problem by introducing a peak suppression mechanism that penalizes the model from focusing excessively on the most discriminative part, thereby encouraging the discovery of complementary discriminative regions. Additionally, it incorporated knowledge guidance through a teacher–student framework to transfer learned discriminative patterns. Feature Fusion Vision Transformer (FFVT) [
16] aggregated important tokens from multiple transformer layers to preserve multi-scale information and compensate for potential information loss in deeper layers. However, these methods primarily focus on token selection and multi-layer fusion without explicitly addressing frequency-domain characteristics or fine-grained token relationships, leaving room for further improvements.
Recent advances have further explored the effective utilization of multi-scale feature hierarchies within transformer architectures. Jiang et al. [
28] revisited the multi-scale feature hierarchy design in Detection Transformer (DETR), demonstrating that appropriate integration of features from different network depths significantly enhances detection performance by capturing both fine-grained details and high-level semantics. This insight resonates with our Complementary Tokens Integration (CTI) strategy, which extracts and aggregates class tokens from multiple transformer layers to leverage hierarchical feature representations. While Jiang et al. focused on object detection with explicit multi-scale feature pyramid construction, our CTI approach adapts this philosophy to fine-grained classification by selectively combining intermediate and final-layer representations, thereby preserving both low-level textural details and high-level semantic structures essential for distinguishing visually similar categories.
2.3. Attention Mechanisms in Deep Learning
Attention mechanisms have become a fundamental component in modern deep learning architectures, enabling models to selectively focus on relevant information while suppressing irrelevant features. In computer vision, spatial attention mechanisms have been widely adopted to identify discriminative regions. Squeeze-and-Excitation Networks introduced channel attention by modeling inter-channel dependencies through global average pooling and fully connected layers. Context-Aware Attentional Pooling (CAP) [
29] proposed context-dependent attention weights for feature aggregation in fine-grained classification. Mask-Guided Contrastive Attention [
30] leveraged foreground–background segmentation masks to guide attention learning for person re-identification.
Beyond spatial and channel attention, recent works have explored attention mechanisms in the frequency domain. While spatial-domain attention focuses on identifying important spatial locations or feature channels, frequency-domain attention can capture periodic patterns and textures that are not readily accessible in the spatial domain. In medical image analysis, frequency-domain features have been shown to be effective for capturing subtle pathological patterns. In natural image processing, frequency analysis has been used for texture classification and image quality assessment. However, the explicit integration of frequency-domain attention into vision transformers for fine-grained classification remains relatively unexplored, particularly for applications involving textured and patterned images such as ancient murals.
Cross-attention mechanisms, which model relationships between different modalities or feature groups, have attracted increasing amounts of attention. In vision-language models such as BERT [
31], cross-attention enables the model to align visual features with textual descriptions. In visual recognition, cross-attention has been used to model relationships between global and local features, or between features from different network branches. However, most existing ViT-based fine-grained classification methods rely primarily on standard self-attention within each layer, without explicitly modeling cross-token relationships that could capture complementary discriminative information. Developing specialized cross-attention mechanisms for fine-grained feature learning represents a promising research direction.
Beyond spatial and channel attention paradigms, recent works have explored attention mechanisms that explicitly model correlations among feature representations. Li et al. [
32] proposed AMTrans, an Auto-Correlation Multi-Head Attention Transformer designed for infrared spectral deconvolution. The key innovation of AMTrans lies in its auto-correlation attention mechanism, which computes attention weights based on the correlation structure among input features rather than relying solely on query-key dot products. This design philosophy shares conceptual similarities with our proposed Cross-Token Relation Attention (CTRA) module. While AMTrans applies auto-correlation to capture inherent spectral feature dependencies in infrared signal processing, our CTRA explicitly models pairwise relationships between visual tokens through learnable relation transformations, enabling the network to establish stronger connections between semantically related image regions. Both approaches recognize the importance of going beyond standard self-attention by incorporating explicit relational priors, though they target different application domains and employ distinct computational mechanisms tailored to their respective tasks.
2.4. Cultural Heritage Image Analysis and Mural Classification
The application of computer vision techniques to cultural heritage preservation has attracted increasing amounts of attention in recent years, with ancient mural classification representing a particularly challenging problem. Traditional approaches to mural classification relied on manually designed features and conventional machine learning methods. Early works employed SIFT descriptors [
6] to capture local texture patterns, combined with support vector machines for classification. However, SIFT-based methods often produce false matches when applied to murals with complex artistic styles and severe degradation.
With the development of deep learning, CNN-based approaches have been adopted for mural analysis. Kumar et al. [
6] employed pretrained AlexNet and VGGNet models for mural classification through transfer learning, demonstrating that features learned from natural images can be adapted to the cultural heritage domain. Wang et al. [
5] modified the Inception-v3 architecture for dynasty-based classification of ancient Chinese murals, achieving improved performance through careful network adaptation and training strategies. Cao et al. [
3] proposed a multichannel separable network that explicitly fuses high-level deep features with low-level color histograms and LBP textures, recognizing that both levels of representation are important for capturing the artistic characteristics of murals.
Despite these advances, mural classification remains challenging due to several factors. First, mural images often exhibit significant background noise and degradation artifacts resulting from environmental exposure and aging, which can dominate the feature representations learned by neural networks. Second, the artistic styles, color palettes, and compositional structures vary substantially across different historical periods and geographical regions, requiring models to capture style-specific characteristics. Third, the limited availability of annotated mural datasets poses challenges for training deep neural networks that typically require large-scale data. Fourth, murals contain rich frequency-domain information in the form of brushwork patterns, texture details, and periodic decorative motifs, which are not adequately captured by standard spatial-domain convolutional or attention operations. These challenges motivate the development of specialized architectures that can effectively handle the unique characteristics of mural images while maintaining generalizability to other fine-grained classification tasks.
2.5. Metric Learning and Contrastive Loss Functions
Metric learning aims to learn feature representations where similar samples are close together while dissimilar samples are far apart in the embedding space. This objective aligns naturally with fine-grained classification, where the goal is to learn discriminative features that separate visually similar categories. Center Loss [
30] introduced the concept of learning class centers in the feature space and minimizing the distances between features and their corresponding class centers to enhance intra-class compactness. Contrastive Loss encourages features of the same class to be similar while pushing apart features from different classes, typically using sample pairs or triplets.
In fine-grained classification, the challenge lies in the fact that inter-class distances are naturally small due to visual similarity, while intra-class variations can be large due to factors such as pose, illumination, and viewpoint changes. Fixed-margin loss functions may not be optimal for this scenario, as they treat all class pairs equally without considering the varying difficulty of distinguishing between different category pairs. Adaptive margin methods have been proposed to address this issue. For instance, in face recognition, adaptive margin losses adjust the decision boundaries based on sample difficulty or class distributions. However, the application of adaptive margin contrastive learning to fine-grained visual classification, particularly in combination with transformer architectures, remains an area that requires further exploration.
Furthermore, most existing metric learning approaches focus solely on the feature extraction backbone without considering the integration of metric learning objectives with multi-head classification strategies or hierarchical feature fusion. In the context of ancient mural classification, where categories may exhibit hierarchical relationships (e.g., murals from adjacent dynasties may share certain stylistic elements), adaptive margin contrastive learning could provide additional benefits by learning fine-grained distinctions between closely related categories while maintaining clear separations between distinctly different ones.
2.6. Research Gaps and Motivations
Despite the significant progress in fine-grained classification, vision transformers, and cultural heritage image analysis, several research gaps remain that motivate our work. First, existing ViT-based fine-grained classification methods primarily operate in the spatial domain and do not explicitly capture frequency-domain characteristics, which are particularly important for analyzing textured and patterned images such as ancient murals. Second, while self-attention mechanisms in standard transformers model relationships between all tokens, they do not explicitly emphasize fine-grained pairwise relationships that could be beneficial for capturing compositional structures and recurring motifs. Third, most methods use only the final layer’s class token for classification, potentially discarding complementary information from intermediate layers that focus on different levels of visual abstraction. Fourth, existing token selection strategies either discard less important tokens entirely or aggregate them uniformly, missing the opportunity to maintain multiple parallel focuses on different discriminative aspects. Finally, the application of adaptive margin contrastive learning to transformer-based fine-grained classification, particularly for cultural heritage images, remains underexplored.
To provide a clearer overview of the research landscape and facilitate comparison with our proposed approach,
Table 1 summarizes the key characteristics, strengths, and limitations of representative methods.
As illustrated in
Table 1, existing methods typically address only a subset of the identified challenges. CNN-based methods such as B-CNN and RA-CNN excel at local feature extraction but face inherent limitations in modeling long-range dependencies. ViT-based methods, including TransFG and FFVT, leverage global self-attention but lack specialized mechanisms for frequency-domain analysis or explicit cross-token relation modeling. Notably, recent works such as the multi-scale feature hierarchy rethinking in DETR [
28] and the auto-correlation attention mechanism in AMTrans [
32] have demonstrated the value of hierarchical feature integration and explicit relational modeling in their respective domains, providing important conceptual foundations that inform our design choices.
Our proposed FRAM-ViT framework addresses these gaps comprehensively by integrating frequency-domain attention (FreqCA), cross-token relation modeling (CTRA), adaptive multi-focus classification (AOF), and adaptive margin contrastive learning (AMCCL). The FreqCA module draws inspiration from frequency-domain processing approaches to capture periodic patterns and textures. The CTRA mechanism shares philosophical similarities with AMTrans in explicitly modeling pairwise feature relationships, though adapted for visual token interactions in fine-grained classification. The CTI strategy aligns with the multi-scale hierarchy insights from recent DETR variants by leveraging complementary information across transformer layers. By combining these components, we aim to advance the state of the art in fine-grained classification while demonstrating particular effectiveness for the challenging task of ancient mural classification. The proposed approach is evaluated on both natural fine-grained datasets (CUB-200-2011 birds and Stanford Dogs) and a proprietary ancient mural dataset, ensuring comprehensive validation across different domains and visual characteristics.
3. Methodology
In this section, we present a comprehensive description of the proposed framework for fine-grained visual classification with emphasis on ancient mural recognition. We first provide an overview of the overall architecture, followed by detailed explanations of each proposed component: Frequency Channel Attention (FreqCA), Cross-Token Relation Attention (CTRA), Adaptive Omni-Focus (AOF) block, and Adaptive Margin Contrastive Center Loss (AMCCL). Finally, we describe the training strategy and integration of complementary information from multiple layers.
3.1. Overall Architecture
The proposed framework builds upon the Vision Transformer (ViT) architecture as the backbone, specifically utilizing ViT-B/16 configuration with patch size . Given an input image where H and W denote the height and width, respectively, the image is first resized to pixels and then partitioned into non-overlapping patches, where is the patch size. Each patch is linearly projected into a D-dimensional embedding space, where for ViT-B/16.
The initial sequence of tokens
is constructed by concatenating a learnable class token
with the patch embeddings and adding positional encodings:
where
denotes the patch embedding projection matrix and
represents the positional embeddings. The token sequence is then processed through
transformer encoder layers, each incorporating multi-head self-attention (MSA) and feed-forward networks (FFNs).
Our architecture extends the standard ViT by integrating four key innovations: (1) FreqCA modules inserted at early layers to capture frequency-domain texture characteristics; (2) CTRA mechanisms integrated into selected transformer layers to model fine-grained token relationships; (3) an AOF block after the backbone to perform adaptive token selection and multi-head classification; and (4) AMCCL loss function to enhance discriminative feature learning. The overall information flow is illustrated in
Figure 1. In the following subsections, we elaborate on each component in detail.
To improve readability and provide a consistent reference for the mathematical notation used throughout this section,
Table 2 summarizes all key symbols and their definitions.
3.2. Frequency Channel Attention (FreqCA)
Ancient murals exhibit distinctive characteristics in the frequency domain, including periodic brushwork patterns, repetitive decorative motifs, and texture granularity reflective of artistic techniques. Standard spatial-domain operations may not adequately capture these frequency-specific cues. To address this limitation, we propose the Frequency Channel Attention (FreqCA) module, which learns channel-wise importance in the frequency domain and applies the learned weights to both the real and imaginary components of the spectrum. A schematic illustration is provided in
Figure 2.
Given feature maps
(or an input image treated as a 3-channel feature map), where
,
, and
C denote the spatial dimensions and channel number respectively, FreqCA first transforms
into the frequency domain using the two-dimensional Discrete Fourier Transform (DFT), implemented by FFT in practice:
where
and
are frequency-domain coordinates,
c indexes the channel dimension, and
is the imaginary unit. We denote the complex spectrum as
To construct a compact channel descriptor, we compute pooling statistics on the magnitude spectrum
and define channel-wise average and max pooled descriptors as
We concatenate them to obtain
, which is fed into a channel attention function implemented with two fully connected layers (reduction ratio
r):
where
,
are learnable parameters,
is the sigmoid function, and
in our implementation.
The resulting weights
are shared and applied channel-wise to modulate both the real and imaginary components:
where ⊙ denotes channel-wise multiplication and
broadcasts
to match the spatial dimensions. The modulated spectrum is then recombined and transformed back to the spatial domain via inverse FFT:
Here, denotes taking the real part to obtain a real-valued output in practice.
By operating in the frequency domain, FreqCA enables the model to selectively emphasize channels that carry discriminative periodic patterns and textures. For computational efficiency, we implement the DFT using FFT algorithms with complexity . FreqCA modules are inserted at layers 3, 6, and 9 of the transformer backbone, chosen empirically to capture frequency characteristics at multiple levels of abstraction while maintaining reasonable computational overhead.
3.3. Cross-Token Relation Attention (CTRA)
To facilitate a clearer understanding of the proposed CTRA design, we provide a schematic illustration in
Figure 3. The module adopts a two-branch structure: a global branch that compresses token information into channel context, and a local branch that preserves token-wise details. The two branches jointly generate multiplicative gates to refine the input token representations.
Let denote the input token sequence of a transformer layer, where is the number of tokens (including the class token) and D is the embedding dimension. CTRA refines by combining a global (channel-context) gate and a local (token-wise) gate.
Global branch (channel-context gating). We first aggregate token information using global average pooling (GAP) over the token dimension:
The aggregated descriptor
is then transformed by a lightweight mapping (implemented as an MLP or, equivalently, a 1D convolution, as indicated in
Figure 3) followed by a sigmoid gate to obtain channel-context weights:
where
denotes the learnable transformation and
is the sigmoid function.
Local branch (token-wise gating). In parallel, the local branch processes the token sequence directly to preserve token-wise details and generate a token-dependent gate:
where
is implemented as an MLP/Conv1d module as shown in
Figure 3.
Joint modulation. Finally, the two gates are applied multiplicatively to refine the input tokens:
where ⊙ denotes element-wise multiplication and
replicates
along the token dimension to match the shape of
. By jointly considering channel-context information from the global branch and token-wise details from the local branch, CTRA produces refined token representations that are subsequently used by the remaining components of the transformer block.
3.4. Adaptive Omni-Focus (AOF) Block
Existing token selection strategies in vision transformers typically either discard less informative tokens entirely or aggregate all tokens uniformly. However, fine-grained classification benefits from maintaining multiple parallel focuses on different discriminative aspects simultaneously. The Adaptive Omni-Focus (AOF) block addresses this by dynamically partitioning tokens into multiple subsets and maintaining separate classification heads for each subset, enabling the model to attend to diverse discriminative regions concurrently.
Given the final token representations
from the
L-th transformer layer, AOF first computes token importance scores based on the class token’s attention to patch tokens. Specifically, we extract the attention weights
from the class token to all patch tokens in the final layer:
where
denotes the attention weight from the class token (index 0) to the
i-th patch token in head
h, averaged across all
H attention heads. Based on these importance scores, we partition tokens into
groups using adaptive thresholding:
where thresholds
are determined by equally dividing the sorted attention weights into
K quantiles, with
and
. Each group
corresponds to tokens with similar importance levels, capturing different aspects of the image—from background context in
to highly discriminative regions in
.
For each group
, we aggregate the corresponding tokens and the class token into a group-specific representation:
where
is the class token from layer
L,
denotes the
i-th patch token, and
is the cardinality of group
k. Each group representation
is then fed into a separate classification head:
where
and
are learnable parameters for the
k-th classification head, and
denotes the number of classes. The final prediction is obtained by weighted aggregation:
where
are aggregation weights corresponding to groups
through
ordered by ascending token importance. The rationale for weight assignment follows the principle that groups containing tokens of greater importance should contribute more to the final prediction. Through systematic grid search experiments on validation sets, testing weight configurations ranging from uniform
to various asymmetric distributions, we empirically determined that
yields optimal performance for CUB-200-2011 and the mural dataset. This configuration assigns higher weights (0.8) to groups
and
containing the most discriminative tokens—typically corresponding to distinctive object parts such as bird beaks or mural motifs—while lower weights (0.2) are assigned to groups
and
that predominantly contain background or less informative regions. For Stanford Dogs, uniform weights
performed best, likely because discriminative features in dog images are more uniformly distributed across different body regions rather than concentrated in specific parts.
By maintaining multiple classification heads focusing on different token subsets, AOF enables the model to capture diverse discriminative aspects simultaneously rather than committing to a single focus. This multi-focus strategy improves robustness, particularly when discriminative features are distributed across multiple image regions or when background variations are significant.
3.5. Complementary Tokens Integration (CTI)
Standard ViT uses only the class token from the final layer for classification, potentially discarding complementary information encoded in intermediate layers. Different transformer layers capture features at varying levels of abstraction: shallow layers encode low-level textures and colors, middle layers represent mid-level patterns, and deep layers capture high-level semantic structures. To leverage this multi-level complementary information, we propose Complementary Tokens Integration (CTI) that extracts class tokens from multiple layers for classification.
Specifically, we select class tokens from layers
for ViT-B/16, corresponding to the 10th, 11th, and 12th (final) transformer layers. For each selected layer
, we extract the class token
and pass it through a dedicated classification head:
where
and
are layer-specific classification parameters. During training, each classification head is supervised independently with cross-entropy loss:
where
denotes the cross-entropy loss,
is the ground-truth label, and
are layer-specific weights. We set
,
, and
to gradually increase the supervision strength for deeper layers, reflecting their greater semantic capacity.
During inference, predictions from multiple layers are aggregated through weighted averaging:
This multi-layer integration enables the model to simultaneously leverage low-level textural details (from layer 10), mid-level pattern information (from layer 11), and high-level semantic features (from layer 12), providing a more comprehensive representation for fine-grained classification. The complementary nature of these multi-level features is particularly beneficial for ancient mural classification, where both low-level artistic techniques (brushwork textures) and high-level compositional structures contribute to dynasty-specific characteristics.
3.6. Adaptive Margin Contrastive Center Loss (AMCCL)
To enhance the discriminative power of learned feature representations, we propose Adaptive Margin Contrastive Center Loss (AMCCL), which simultaneously encourages intra-class compactness and inter-class separability with adaptive margins. AMCCL combines the concepts of center loss and contrastive learning while introducing adaptive margins that adjust based on inter-class similarity.
Let
denote the feature representation (class token from layer 12) for the
i-th sample with label
, and let
denote the center (mean feature) for class
j. The center loss component encourages features to be close to their respective class centers:
where
B denotes the batch size. Class centers are updated using an exponential moving average during training:
where
is the set of samples belonging to class
j in the current batch and
is the momentum coefficient.
The contrastive component encourages inter-class separation by pushing apart features from different classes. We define an adaptive margin
between classes
i and
j based on their center similarity:
where
is the base margin,
is a scaling factor, and
measures cosine similarity between class centers. This adaptive mechanism assigns larger margins to similar classes that are harder to distinguish, while allowing smaller margins for clearly separable classes.
The contrastive loss with adaptive margins is formulated as follows:
which encourages the distance from a sample to centers of other classes to exceed the adaptive margin plus the distance to its own class center. The complete AMCCL is defined as follows:
where
balances the two components. For the mural classification task, we found that AMCCL with adaptive margins provides significant improvements over fixed-margin alternatives, as ancient mural categories often exhibit hierarchical similarities (e.g., murals from adjacent dynasties share certain stylistic elements) that benefit from flexible margin adjustments.
3.7. Overall Training Objective
The complete training objective combines multiple loss functions to supervise different aspects of the model:
where the first term supervises multi-layer classification through CTI, the second term supervises the
K classification heads in the AOF block with weights
, and the third term applies AMCCL for discriminative feature learning with weight
for the mural dataset (determined empirically through validation). The relatively small weight for AMCCL ensures that it provides a regularization effect without dominating the optimization process.
4. Experiments
In this section, we present comprehensive experiments to evaluate the effectiveness of the proposed framework. We first describe the datasets and implementation details, then we compare our method with state-of-the-art approaches on three fine-grained classification benchmarks. Finally, we conduct ablation studies to analyze the contribution of each proposed component, namely Frequency Channel Attention (FreqCA), Cross-Token Relation Attention (CTRA), and Adaptive Margin Contrastive Center Loss (AMCCL).
4.1. Datasets
We evaluate our method on three fine-grained visual classification datasets covering both natural object recognition and cultural heritage image analysis.
CUB-200-2011 [
1] is a widely used benchmark for fine-grained bird classification, containing 11,788 images from 200 bird species. The dataset is split into 5994 images for training and 5794 images for testing. This dataset presents significant challenges due to large intra-class variations caused by different poses, viewpoints, and backgrounds, while inter-class differences are often subtle and localized to specific body parts such as beaks, wings, and tail feathers.
Stanford Dogs [
2] consists of 20,580 images covering 120 dog breeds, with 12,000 images for training and 8580 images for testing. The dataset is challenging because different dog breeds often share similar body structures, and discriminative features are primarily found in facial characteristics, fur patterns, and body proportions.
Ancient Mural Dataset is a proprietary dataset collected from Dunhuang Grottoes, containing 1812 mural images spanning multiple historical dynasties. The images were acquired using standardized photographic protocols with controlled shooting angles and distances to mitigate severe geometric distortions caused by uneven wall surfaces and cavities. Nevertheless, residual perspective variations and minor surface irregularities inherent to in situ mural photography remain present in the data. Ancient mural classification presents unique challenges including complex background noise, severe degradation artifacts due to environmental exposure and aging, geometric variations from curved or irregular wall surfaces, and subtle stylistic variations across different historical periods [
3]. The artistic techniques, color palettes, and compositional structures vary substantially across dynasties, requiring models to capture both low-level visual features and high-level semantic patterns.
4.2. Implementation Details
Our framework is implemented using PyTorch (version 1.8.0) and trained on NVIDIA RTX GPUs. We adopt ACC-ViT [
33] with ViT-B/16 [
13] pretrained on ImageNet-21k as the backbone. During training, input images are resized to
pixels and then randomly cropped to
to obtain the final resolution fed into the transformer; standard data-augmentation techniques including random horizontal flipping are applied.
We use the SGD optimizer with momentum 0.9 and weight decay . The initial learning rate is set to with a cosine annealing schedule. The batch size is set to 10 and the model is trained for 60 epochs. For the FreqCA module, the reduction ratio r is set to 16, and the module is inserted at layers 3, 6, and 9 of the transformer backbone. For CTRA, the scaling parameter is initialized to 0.1, and the mechanism is integrated into layers 6, 9, and 12. For AMCCL, the loss weight is set to 0.05, the base margin is set to 0.5, and the momentum coefficient for center updates is set to 0.9. The aggregation weights for the multi-head classification are set to for CUB-200-2011 and the mural dataset, while uniform weights are used for Stanford Dogs.
Hyperparameter Selection Process. All hyperparameters were determined through systematic empirical experiments on validation sets. For each hyperparameter, we conducted grid search over a predefined range of candidate values and selected the configuration that yielded the highest validation accuracy. Specifically, for the AMCCL loss weight
, we tested values in
and found
to provide the best balance between discriminative supervision and classification loss. For the base margin
, candidates in
were evaluated, with
achieving optimal inter-class separation. The FreqCA reduction ratio
r was selected from
, where
offered an effective trade-off between model capacity and computational efficiency. For the AOF aggregation weights
, we experimented with uniform distributions, linearly increasing weights, and various asymmetric configurations, ultimately determining that
maximizes performance on datasets where discriminative information is concentrated in specific image regions.
Table 3 summarizes the sensitivity analysis for key hyperparameters on the mural dataset.
Computational Cost Analysis. To evaluate the practical efficiency of the proposed modules, we measure the computational overhead on the ancient mural dataset using an NVIDIA RTX 3090 GPU.
Table 4 presents the training time, inference time, GPU memory consumption, parameter count, and FLOPs when progressively adding each module to the baseline. The results show that FreqCA introduces the largest overhead (+8.2%) due to FFT operations, while CTRA and AOF add minimal costs. AMCCL only affects training and introduces no inference overhead. The total additional overhead is approximately 15% in training time, with only a 2.5% increase in parameters (from 86.4 M to 88.6 M) and a 3.4% increase in FLOPs (from 17.6 G to 18.2 G), which we consider acceptable given the accuracy improvement of 2.20% on the mural dataset.
4.3. Comparison with State-of-the-Art Methods
We compare our method with both CNN-based and ViT-based fine-grained classification approaches.
Table 5 presents the classification accuracy on the CUB-200-2011 and Stanford Dogs datasets.
As shown in
Table 5, our method achieves the best performance on both datasets. On CUB-200-2011, our approach attains 91.15% accuracy, outperforming the baseline ACC-ViT [
33] by 1.35%. On Stanford Dogs, our method achieves 94.57% accuracy, surpassing ACC-ViT by 1.63%. Compared with CNN-based methods, our approach demonstrates substantial improvements, exceeding the best CNN-based method API-Net [
35] by 5.15% on CUB-200-2011.
The consistent improvements over the ACC-ViT baseline can be attributed to the three proposed components. The FreqCA module captures frequency-domain texture patterns that complement the spatial-domain features learned by the baseline, which is particularly beneficial for distinguishing fine-grained features such as feather patterns in birds and fur textures in dogs. The CTRA mechanism explicitly models cross-token relationships beyond standard self-attention, enabling the network to establish stronger connections between semantically related image regions. The AMCCL loss function enhances discriminative feature learning by adaptively adjusting inter-class margins based on category similarity, which is crucial for separating visually similar fine-grained categories.
4.4. Results on Ancient Mural Dataset
Table 6 presents the classification results on the ancient mural dataset, comparing our method with the ACC-ViT baseline.
Our method achieves 94.27% accuracy on the ancient mural dataset, outperforming the ACC-ViT baseline by 2.20%. The improvement on the mural dataset is more pronounced compared to the natural image datasets, demonstrating that the proposed components are particularly effective for cultural heritage image analysis.
The superior performance on the mural dataset can be attributed to the FreqCA module’s ability to capture periodic brushwork patterns and texture granularity that reflect dynasty-specific artistic techniques. Ancient murals from different dynasties exhibit distinctive frequency-domain characteristics in terms of line density, color gradients, and decorative motifs. Standard spatial-domain attention mechanisms in the baseline may not adequately capture these frequency-specific patterns, whereas our FreqCA module explicitly models channel-wise importance in the frequency domain, enabling selective emphasis on discriminative frequency components.
Furthermore, the CTRA mechanism facilitates the learning of compositional structures that are characteristic of different dynasty styles. Ancient murals often contain recurring motifs and compositional arrangements that span different image regions. By explicitly modeling pairwise relationships between tokens, CTRA enables the network to establish connections between these related visual elements regardless of their spatial proximity, providing a more holistic understanding of artistic styles.
4.5. Ablation Studies
To analyze the contribution of each proposed component, we conduct comprehensive ablation studies on all three datasets.
Table 7 presents the results when progressively adding FreqCA, CTRA, and AMCCL to the ACC-ViT baseline.
Effect of FreqCA. Adding the FreqCA module to the baseline yields improvements of 0.38% on CUB-200-2011, 0.41% on Stanford Dogs, and 0.61% on the mural dataset. The largest improvement is observed on the mural dataset, validating our hypothesis that frequency-domain attention is particularly effective for capturing periodic patterns and textures in artistic images. The FreqCA module transforms spatial features into the frequency domain using Discrete Fourier Transform and applies adaptive channel-wise attention to selectively emphasize channels carrying discriminative frequency components. This mechanism is especially beneficial for ancient murals, as brushwork patterns and texture granularity contain important stylistic information that varies across dynasties.
Effect of CTRA. The CTRA mechanism contributes 0.44% improvement on CUB-200-2011, 0.48% on Stanford Dogs, and 0.71% on the mural dataset when added to the baseline. By constructing a relation-aware feature space where tokens corresponding to related visual elements can establish stronger connections, CTRA facilitates the learning of compositional structures and global patterns beyond standard self-attention. The enhanced relation matrix, computed through learnable transformations of pairwise cosine similarities, provides a relation prior that augments the standard query-key attention computation. This is particularly beneficial for mural classification, where compositional arrangements and recurring motifs are important discriminative cues for distinguishing different dynasty styles.
Effect of AMCCL. The AMCCL loss function provides improvements of 0.32% on CUB-200-2011, 0.34% on Stanford Dogs, and 0.48% on the mural dataset. By combining center loss with adaptive margin contrastive learning, AMCCL simultaneously encourages intra-class compactness and inter-class separability. The adaptive margin mechanism assigns larger margins to similar classes that are harder to distinguish, while allowing smaller margins for clearly separable classes. This flexibility is particularly beneficial for the mural dataset, where categories from adjacent dynasties often share certain stylistic elements due to cultural continuity, requiring the model to learn fine-grained distinctions with appropriate inter-class boundaries.
Synergistic Effects. When combining all three components, the total improvement reaches 1.35% on CUB-200-2011, 1.63% on Stanford Dogs, and 2.20% on the mural dataset. Notably, the combined improvement exceeds the sum of individual improvements, indicating synergistic effects among the proposed components. FreqCA and CTRA capture complementary aspects of discriminative information—frequency-domain textures and cross-token relationships, respectively—while AMCCL provides enhanced supervision that encourages the learning of more discriminative features in the representation space defined by FreqCA and CTRA. The most significant synergy is observed on the mural dataset, where the combination of frequency-domain analysis, cross-token relation modeling, and adaptive margin learning addresses the unique challenges of cultural heritage image classification.
4.6. Visualization Analysis
To qualitatively inspect the model’s behavior, we provide both attention heatmap overlays and a feature-space visualization based on t-SNE.
Attention heatmaps. We report representative attention heatmaps on three datasets, i.e., CUB-200-2011, Stanford Dogs, and the ancient mural dataset. As shown in
Figure 4, each example contains the original image and the corresponding attention overlay.
For the ancient mural dataset, the attention heatmaps (
Figure 4e,f) reveal that the model effectively identifies discriminative regions at multiple semantic levels. At the global scale, the attention concentrates on semantically significant areas such as central figure compositions, decorative borders, and background patterns that are characteristic of different dynasties. At a finer granularity, the heatmaps highlight regions containing brushwork textures, color transitions, and periodic decorative motifs—precisely the frequency-domain features that our FreqCA module is designed to capture. Furthermore, the attention distribution demonstrates that the CTRA mechanism successfully establishes connections between spatially distributed but semantically related elements, such as recurring artistic motifs across different image regions. These visualization results confirm that the proposed framework captures both holistic compositional structures and subtle stylistic details essential for accurate dynasty-based mural classification.
t-SNE feature-space visualization. To further examine the structure of the learned representations, we visualize high-dimensional image embeddings using t-SNE by projecting them into a 2D space. It should be noted that this visualization depicts the distribution of learned feature vectors in an embedding space; the original mural images remain unaltered throughout this process. The purpose of t-SNE visualization is to reveal how the model internally organizes samples based on discriminative features learned during training.
In
Figure 5, samples are colored by their class indices (as indicated by the color bar), and samples from the same class tend to form localized groups. On CUB-200-2011 and Stanford Dogs, many class groups appear relatively compact and are separated by visible gaps, while a subset of groups remain close or partially interleaved, suggesting that these categories share highly similar visual patterns. For the ancient mural dataset, several groups show broader spread and partial overlap, which may reflect stronger intra-class variability and stylistic proximity among certain categories.
To demonstrate the effectiveness of our proposed components, we compare the feature-space organization between the ACC-ViT baseline and our full model. Given identical input images, our method produces notably tighter intra-class clusters and more distinct inter-class boundaries, confirming that the proposed FreqCA, CTRA, and AMCCL modules enhance the discriminative quality of learned representations. We emphasize that t-SNE primarily preserves local neighborhood relations; therefore, the visualization serves as a qualitative diagnostic of feature-space quality rather than a metric-faithful measure of global inter-class distances.
5. Discussion
The experimental results presented in the previous section demonstrate the effectiveness of our proposed framework for fine-grained visual classification, particularly for the challenging task of ancient mural classification. In this section, we discuss the implications of our findings, analyze the behavior of the proposed components, and acknowledge the limitations of our approach.
5.1. Analysis of Frequency-Domain Attention
The FreqCA module consistently improves classification performance across all three datasets, with the most significant gains observed on the ancient mural dataset (0.61% improvement). This observation aligns with our hypothesis that frequency-domain features are particularly informative for images containing rich textural patterns and periodic structures. Ancient murals exhibit distinctive frequency characteristics that reflect dynasty-specific artistic techniques, including brushwork density, decorative motif periodicity, and texture granularity. Unlike natural images in which spatial features often dominate, mural images require explicit modeling of frequency-domain information to capture these subtle stylistic variations.
The relatively smaller improvements on CUB-200-2011 (0.38%) and Stanford Dogs (0.41%) suggest that frequency-domain attention provides complementary rather than dominant benefits for natural object classification. Birds and dogs possess discriminative features that are more readily captured by spatial-domain attention mechanisms, such as distinctive body part shapes and color patterns. Nevertheless, the consistent positive contribution of FreqCA across diverse datasets validates its general applicability beyond cultural heritage images.
5.2. Effectiveness of Cross-Token Relation Modeling
The CTRA mechanism demonstrates strong performance improvements, particularly on the mural dataset (0.71%). This finding supports our design motivation that fine-grained classification benefits from explicit modeling of pairwise token relationships beyond standard self-attention. While self-attention in vision transformers captures global dependencies, it treats all token relationships uniformly without emphasizing semantically meaningful connections.
For ancient murals, compositional structures and recurring motifs often span multiple image regions, requiring the model to establish connections between spatially distant but semantically related elements. The relation-aware attention in CTRA facilitates this by computing pairwise similarity scores and incorporating them as learnable priors in the attention computation. The enhanced performance on mural classification validates that cross-token relation modeling effectively captures the holistic structural patterns characteristic of different dynasty styles.
On natural image datasets, CTRA provides moderate improvements by helping the model establish connections between complementary discriminative parts, such as bird head and tail patterns or dog facial features and body proportions. The consistent benefits across datasets demonstrate that explicit relation modeling is a generally useful inductive bias for fine-grained recognition.
5.3. Role of Adaptive Margin Learning
The AMCCL loss function contributes meaningful improvements across all datasets, with the largest gain on the mural dataset (0.48%). The adaptive margin mechanism is particularly beneficial when dealing with categories that exhibit hierarchical similarity structures. Ancient murals from adjacent dynasties often share certain stylistic elements due to cultural continuity, while maintaining distinct characteristics that enable classification. Fixed-margin losses treat all class pairs equally and may not optimally handle such varying inter-class similarities.
By dynamically adjusting margins based on class center similarities, AMCCL assigns larger separation boundaries to confusable category pairs while allowing smaller margins for clearly distinguishable ones. This flexibility enables more efficient use of the feature space and encourages the model to focus its learning capacity on difficult distinctions. The feature distribution visualizations confirm that AMCCL produces tighter intra-class clustering and clearer inter-class separation compared to standard cross-entropy training.
5.4. Synergistic Effects and Component Interactions
An important finding from our ablation studies is that the combined improvement from all three components exceeds the sum of individual contributions, particularly on the mural dataset. This synergistic effect suggests that FreqCA, CTRA, and AMCCL capture complementary aspects of discriminative information and interact beneficially during training.
FreqCA enriches the feature representation with frequency-domain information, providing additional texture and pattern cues that complement spatial features. CTRA leverages these enriched features to establish meaningful cross-token relationships, enabling holistic structural understanding. AMCCL then provides enhanced supervision that encourages discriminative learning in this augmented feature space. The combination addresses multiple challenges in fine-grained classification simultaneously: texture analysis, structural understanding, and discriminative representation learning.
5.5. Limitations and Future Directions
Despite the promising results, our approach has several limitations that warrant future investigation. First, the FreqCA module introduces additional computational overhead due to Fourier transform operations, which may limit applicability in resource-constrained scenarios. Developing more efficient frequency-domain attention mechanisms could address this limitation.
Second, our evaluation on ancient murals is limited to a single dataset from Dunhuang Grottoes. Validating the approach on mural collections from different geographical regions and artistic traditions would strengthen the generalizability claims for cultural heritage applications.
Third, the current framework focuses on image-level classification without explicit localization of discriminative regions. Extending the approach to provide interpretable attention maps that highlight stylistically significant elements could enhance its utility for art historical analysis and cultural heritage preservation.
Fourth, while our comparative experiments include representative CNN-based and ViT-based methods, the rapid advancement of vision transformers necessitates continuous benchmarking against the latest state-of-the-art approaches. We plan to extend our comparisons to include emerging methods from 2025 and beyond as they become available, ensuring comprehensive and up-to-date experimental validation.
Fifth, ancient murals are often painted on uneven surfaces such as curved walls and cavities, introducing geometric distortions that pose additional challenges for automated analysis. While our current framework does not include explicit geometric rectification, the proposed components provide implicit robustness to moderate geometric variations: the FreqCA module operates in the frequency domain where global texture periodicity is partially preserved despite local spatial deformations; the CTRA mechanism models semantic relationships between tokens that are inherently less sensitive to geometric distortions than pixel-level correspondences; and standard data augmentation techniques (random cropping and resizing) during training enhance tolerance to perspective variations. Nevertheless, developing explicit geometric normalization or distortion-aware attention mechanisms remains an important direction for improving robustness in challenging real-world scenarios where murals exhibit significant surface curvature or structural damage.
Future work could explore the integration of domain-specific knowledge, such as iconographic databases or art historical annotations, to further improve classification accuracy and provide more interpretable results. Additionally, investigating the transferability of learned representations across different fine-grained domains represents a promising research direction.
6. Conclusions
This paper presents a novel vision transformer-based framework for fine-grained visual classification with particular emphasis on ancient mural recognition. We introduce three key innovations that address the fundamental limitations of existing approaches: Frequency Channel Attention (FreqCA) for capturing frequency-domain texture characteristics, Cross-Token Relation Attention (CTRA) for modeling fine-grained pairwise relationships between image regions, and Adaptive Margin Contrastive Center Loss (AMCCL) for enhancing discriminative feature learning with flexible inter-class boundaries.
Comprehensive experiments on CUB-200-2011, Stanford Dogs, and a proprietary ancient mural dataset validate the effectiveness of our approach. Our method achieves 91.15% accuracy on CUB-200-2011, 94.57% on Stanford Dogs, and 94.27% on the mural dataset, consistently outperforming the ACC-ViT baseline and other state-of-the-art methods. Ablation studies demonstrate that each proposed component contributes positively to the overall performance, with synergistic effects observed when combining all components. Specifically, the synergy arises because FreqCA enriches feature representations with frequency-domain texture information (e.g., brushwork periodicity and decorative patterns), CTRA then leverages these enriched features to establish meaningful cross-token semantic relationships that capture compositional structures, and AMCCL provides discriminative supervision that encourages learning in this augmented feature space. As evidence, the combined improvement on the mural dataset (2.20%) exceeds the sum of individual component contributions (0.61% + 0.71% + 0.48% = 1.80%), confirming that these modules complement each other rather than operating independently.
The proposed framework is particularly effective for ancient mural classification, where frequency-domain patterns, compositional structures, and hierarchical category similarities present unique challenges. The improvements in this cultural heritage task demonstrate the potential of advanced computer vision techniques for art historical research and cultural preservation applications. From a practical perspective, our framework can aid conservation decision-making in several ways: (1) automatic dynasty classification enables professionals to prioritize restoration efforts based on historical significance and allocate resources accordingly; (2) accurate style identification assists in selecting historically appropriate materials and techniques for intervention, ensuring that restoration work respects the original artistic tradition; (3) the attention visualization highlights dynasty-specific artistic features, providing interpretable evidence that supports expert judgment in authentication and provenance research; and (4) systematic classification of large mural collections facilitates comprehensive documentation of stylistic evolution across historical periods, contributing to broader cultural heritage databases.
Future work will focus on improving computational efficiency, extending evaluation to diverse mural collections, and incorporating domain-specific knowledge to enhance both accuracy and interpretability. We believe that the proposed techniques provide a solid foundation for advancing fine-grained visual classification in both natural image and cultural heritage domains.