FRAM-ViT: Frequency-Aware and Relation-Enhanced Vision Transformer with Adaptive Margin Contrastive Center Loss for Fine-Grained Classification of Ancient Murals

Wei, Lu; Chang, Zhengchao; Li, Jianing; Cai, Jiehao; Peng, Xianlin

doi:10.3390/electronics15020488

Open AccessArticle

FRAM-ViT: Frequency-Aware and Relation-Enhanced Vision Transformer with Adaptive Margin Contrastive Center Loss for Fine-Grained Classification of Ancient Murals

by

Lu Wei

¹,

Zhengchao Chang

^2,3,*,

Jianing Li

¹,

Jiehao Cai

¹ and

Xianlin Peng

^4,*

¹

School of Computer Science, Northwest University, Xi’an 710127, China

²

School of Electronic Information, Northwest University, Xi’an 710127, China

³

School of Computer Science and Technology, Henan Institute of Technology, Xinxiang 453000, China

⁴

School of Art, Northwest University, Xi’an 710127, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(2), 488; https://doi.org/10.3390/electronics15020488

Submission received: 17 December 2025 / Revised: 13 January 2026 / Accepted: 20 January 2026 / Published: 22 January 2026

Download

Browse Figures

Versions Notes

Abstract

Fine-grained visual classification requires recognizing subtle inter-class differences under substantial intra-class variation. Ancient mural recognition poses additional challenges: severe degradation and complex backgrounds introduce noise that obscures discriminative features, limited annotated data restricts model training, and dynasty-specific artistic styles manifest as periodic brushwork patterns and compositional structures that are difficult to capture. Existing spatial-domain methods fail to model the frequency characteristics of textures and the cross-region semantic relationships inherent in mural imagery. To address these limitations, we propose a Vision Transformer (ViT) framework which integrates frequency-domain enhancement, explicit token-relation modeling, adaptive multi-focus inference, and discriminative metric supervision. A Frequency Channel Attention (FreqCA) module applies 2D FFT-based channel gating to emphasize discriminative periodic patterns and textures. A Cross-Token Relation Attention (CTRA) module employs joint global and local gates to strengthen semantically related token interactions across distant regions. An Adaptive Omni-Focus (AOF) block partitions tokens into importance groups for multi-head classification, while Complementary Tokens Integration (CTI) fuses class tokens from multiple transformer layers. Finally, Adaptive Margin Contrastive Center Loss (AMCCL) improves intra-class compactness and inter-class separability with margins adapted to class-center similarities. Experiments on CUB-200-2011, Stanford Dogs, and a Dunhuang mural dataset show accuracies of 91.15%, 94.57%, and 94.27%, outperforming the ACC-ViT baseline by 1.35%, 1.63%, and 2.20%, respectively.

Keywords:

fine-grained visual classification; vision transformer; frequency-domain attention; cross-token relation attention; adaptive token selection; multi-head classification; metric learning; ancient mural classification

1. Introduction

Fine-grained visual classification (FGVC) aims to distinguish between visually similar subordinate categories within the same superordinate class, requiring models to capture subtle inter-class differences while accommodating significant intra-class variations. This task finds critical applications across diverse domains, including biological species identification [1,2], art heritage preservation [3,4], and medical diagnosis. The inherent difficulty of FGVC stems from the fact that discriminative regions are often localized and subtle.

Ancient mural classification represents a domain where fine-grained classification presents unique challenges. First, mural images often contain complex backgrounds with significant degradation due to aging and environmental factors, introducing considerable noise that interferes with feature extraction [5]. Second, artistic styles vary substantially across different historical periods, requiring models to capture both low-level visual features and high-level semantic information [3]. Third, ancient mural datasets are typically limited in scale due to the scarcity of these cultural artifacts.

Traditional approaches relied on hand-crafted features such as SIFT descriptors [6], color histograms, and LBP textures. While achieving moderate success, they suffer from the inability to bridge the semantic gap between low-level visual features and high-level categorical concepts. The advent of deep learning has revolutionized image recognition through automatic hierarchical feature learning [7]. These methods can be broadly categorized into supervised and semi-supervised/unsupervised paradigms. Supervised approaches, such as Bilinear CNN [8], Recurrent Attention CNN [9], NTS-Net [10], and data-driven neural operators [11], learn discriminative representations directly from labeled data, achieving strong performance when sufficient annotations are available. Semi-supervised and unsupervised methods, including domain adaptation techniques and generative adversarial networks [12], address data scarcity by leveraging unlabeled samples or synthesizing augmented data. For mural classification, Cao et al. [3] proposed a multichannel separable network fusing high-level features with low-level descriptors. While these paradigms have advanced solutions for degradation and limited annotations, they primarily operate in the spatial domain without explicitly modeling frequency characteristics or fine-grained token relationships. Our supervised framework complements existing approaches by introducing frequency-domain attention and cross-token relation modeling to capture texture periodicity and compositional structures inherent in cultural heritage imagery.

The emergence of Vision Transformer (ViT) [13] has introduced a paradigm shift by adapting self-attention mechanisms to computer vision. Unlike CNNs, ViT treats an image as a sequence of patches and models global relationships through multi-head self-attention, offering advantages for fine-grained classification: capturing long-range dependencies, providing flexibility in aggregating information from different layers, and offering a unified class token representation. Recent ViT-based methods such as TransFG [14], TPSKG [15], and FFVT [16] have demonstrated superior performance compared to their CNN counterparts.

However, directly applying standard ViT to fine-grained tasks reveals critical limitations. First, standard patch-based tokenization treats all image regions uniformly, which is suboptimal when only specific regions contain discriminative information. Second, while ViT employs multi-head self-attention, it lacks specialized mechanisms to model fine-grained cross-token relationships essential for understanding compositional structures. Third, standard ViT uses only the class token from the final layer for classification, potentially discarding valuable complementary information from intermediate layers. Fourth, existing methods operate in the spatial domain without explicitly considering frequency-domain characteristics, which are relevant for analyzing periodic patterns and textures in ancient murals.

To address these limitations, this paper proposes a novel fine-grained classification framework with three key innovations that directly target the aforementioned challenges. First, to overcome the fourth limitation regarding the lack of frequency-domain modeling in existing methods, we design a FreqCA module that operates in the frequency domain to capture periodic patterns and texture characteristics. By transforming image features into the frequency domain and applying adaptive channel-wise attention, FreqCA enables the model to focus on the frequency components that are most discriminative for distinguishing artistic styles, which is particularly relevant for analyzing brushwork patterns, texture granularity, and decorative motifs in ancient murals. Second, to address the second limitation concerning the absence of specialized mechanisms for modeling fine-grained cross-token relationships, we propose a CTRA mechanism that explicitly models pairwise relationships between tokens beyond standard self-attention. By constructing a relation-aware feature space where tokens corresponding to related visual elements establish stronger connections, CTRA facilitates the understanding of compositional structures and recurring motifs that are essential for fine-grained discrimination.

Furthermore, to mitigate the challenge that visually similar fine-grained categories are inherently difficult to separate, we incorporate AMCCL that enhances discriminative power by simultaneously encouraging intra-class compactness and inter-class separability with adaptive margins. Unlike fixed-margin loss functions that treat all class pairs equally, AMCCL adjusts separation boundaries dynamically based on the similarity between class centers, assigning larger margins to confusable categories while allowing smaller margins for clearly separable ones. This adaptive mechanism is particularly beneficial for fine-grained classification tasks where inter-class similarities vary substantially across different category pairs.

We evaluate our framework on CUB-200-2011 birds dataset [1], the Stanford Dogs dataset [2], and an ancient mural dataset spanning multiple dynasties. The experimental results demonstrate state-of-the-art performance across all datasets, with particularly notable improvements for ancient mural classification.

The main contributions of this work are summarized as follows:

We propose FRAM-ViT, a unified Vision Transformer framework for fine-grained visual classification in complex scenarios such as ancient murals. This framework systematically integrates frequency-domain feature enhancement, explicit cross-token relation modeling, and discriminative metric learning within a modular and extensible architecture. It effectively addresses challenges caused by image degradation, complex backgrounds, and limited annotations.
To achieve fine-grained discrimination, we design a triple core mechanism which includes FreqCA for enhancing frequency-domain texture information, CTRA for explicitly modeling pairwise relationships among tokens, and AMCCL to enhance feature separability and discriminative supervision.
We validate the proposed framework through extensive experiments on CUB-200-2011, Stanford Dogs, and a proprietary Dunhuang mural dataset. The proposed method achieves classification accuracies of 91.15%, 94.57%, and 94.27% on these three datasets, respectively, outperforming the ACC-ViT baseline by 1.35%, 1.63%, and 2.20%. These results, compared with several recent fine-grained classification methods, highlight the framework’s effectiveness, robustness, and generalization across datasets with varying levels of visual complexity.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the methodology. Section 4 describes the experimental setup. Section 5 discusses implications and limitations. Section 6 concludes the paper.

2. Related Work

In this section, we provide a comprehensive review of research areas relevant to our work, including fine-grained visual classification methods, vision transformer architectures, attention mechanisms for feature learning, cultural heritage image analysis, and metric learning approaches for discriminative feature representation.

2.1. Fine-Grained Visual Classification

Fine-grained visual classification has been a longstanding challenge in computer vision, requiring models to distinguish between subordinate categories with subtle visual differences. Early approaches to FGVC relied heavily on part-based representations with strong supervision. These methods typically required manual annotations of object bounding boxes and part locations to guide the localization of discriminative regions [17,18]. While effective, the heavy annotation burden limited their practical applicability and scalability to large-scale datasets.

The development of deep convolutional neural networks has enabled weakly supervised FGVC methods that learn from image-level labels alone. Bilinear CNN (B-CNN) [8] pioneered the use of second-order feature interactions through outer product pooling, capturing rich correlations between feature channels without explicit part annotations. Building upon this foundation, numerous CNN-based methods have been proposed to automatically localize and aggregate discriminative features. Recurrent Attention CNN (RA-CNN) [9] introduced a recurrent attention mechanism that iteratively zooms into discriminative regions at multiple scales. The Navigator–Teacher–Scrutinizer Network (NTS-Net) [10] combined region proposal with classification in a unified framework, enabling joint learning of region localization and feature extraction. Part-Stacked CNN [19] stacked multiple part detectors to capture complementary discriminative regions, while Weakly Supervised Complementary Parts Models [20] learned to identify mutually exclusive parts through a bottom-up approach.

More recent CNN-based approaches have focused on sophisticated attention mechanisms and feature aggregation strategies. Cross-X Learning [21] enhanced feature representations by learning cross-layer and cross-scale interactions. Channel Interaction Networks [22] explicitly modeled relationships between feature channels to capture fine-grained patterns. Selective Sparse Sampling [23] proposed that informative spatial locations be sampled adaptively based on learned attention weights. Despite these advances, CNN-based methods face fundamental limitations in modeling long-range dependencies due to their inductive bias toward local receptive fields, motivating the exploration of alternative architectures such as transformers.

2.2. Vision Transformers for Image Classification

The Vision Transformer (ViT) [13] marked a paradigm shift in visual recognition by successfully adapting the transformer architecture from natural language processing to computer vision. Unlike CNNs, which build hierarchical representations through successive convolutions with local receptive fields, ViT treats an image as a sequence of flattened patches and models global relationships through multi-head self-attention. This design enables every patch to attend to all other patches, facilitating the capture of long-range dependencies without the constraints of fixed kernel sizes. The original ViT demonstrated that when pretrained on large-scale datasets such as ImageNet-21k or JFT-300M, pure transformer architectures can match or exceed the performance of state-of-the-art CNNs on various image classification benchmarks.

Following ViT, numerous variants and improvements have been proposed to enhance transformer-based visual recognition. Data-Efficient Image Transformers (DeiT) [24] introduced knowledge distillation strategies to train vision transformers effectively on smaller datasets without requiring massive amounts of pretraining. TransUNet [25] combined transformers with U-Net architecture for medical image segmentation, demonstrating the effectiveness of hybrid CNN-transformer designs. DETR [26] applied transformers to object detection by formulating detection as a set prediction problems. TransTrack [27] extended transformers to multiple object-tracking tasks. These works collectively demonstrate the versatility and effectiveness of transformer architectures across diverse computer vision tasks.

For fine-grained visual classification specifically, several ViT-based methods have recently emerged. TransFG [14] represents one of the pioneering works in applying vision transformers to FGVC. It introduced a part selection module that identifies discriminative image regions by analyzing attention weights from the transformer layers. The selected parts are then used to augment the classification features. Transformer with Peak Suppression and Knowledge Guidance (TPSKG) [15] addressed the attention concentration problem by introducing a peak suppression mechanism that penalizes the model from focusing excessively on the most discriminative part, thereby encouraging the discovery of complementary discriminative regions. Additionally, it incorporated knowledge guidance through a teacher–student framework to transfer learned discriminative patterns. Feature Fusion Vision Transformer (FFVT) [16] aggregated important tokens from multiple transformer layers to preserve multi-scale information and compensate for potential information loss in deeper layers. However, these methods primarily focus on token selection and multi-layer fusion without explicitly addressing frequency-domain characteristics or fine-grained token relationships, leaving room for further improvements.

Recent advances have further explored the effective utilization of multi-scale feature hierarchies within transformer architectures. Jiang et al. [28] revisited the multi-scale feature hierarchy design in Detection Transformer (DETR), demonstrating that appropriate integration of features from different network depths significantly enhances detection performance by capturing both fine-grained details and high-level semantics. This insight resonates with our Complementary Tokens Integration (CTI) strategy, which extracts and aggregates class tokens from multiple transformer layers to leverage hierarchical feature representations. While Jiang et al. focused on object detection with explicit multi-scale feature pyramid construction, our CTI approach adapts this philosophy to fine-grained classification by selectively combining intermediate and final-layer representations, thereby preserving both low-level textural details and high-level semantic structures essential for distinguishing visually similar categories.

2.3. Attention Mechanisms in Deep Learning

Attention mechanisms have become a fundamental component in modern deep learning architectures, enabling models to selectively focus on relevant information while suppressing irrelevant features. In computer vision, spatial attention mechanisms have been widely adopted to identify discriminative regions. Squeeze-and-Excitation Networks introduced channel attention by modeling inter-channel dependencies through global average pooling and fully connected layers. Context-Aware Attentional Pooling (CAP) [29] proposed context-dependent attention weights for feature aggregation in fine-grained classification. Mask-Guided Contrastive Attention [30] leveraged foreground–background segmentation masks to guide attention learning for person re-identification.

Beyond spatial and channel attention, recent works have explored attention mechanisms in the frequency domain. While spatial-domain attention focuses on identifying important spatial locations or feature channels, frequency-domain attention can capture periodic patterns and textures that are not readily accessible in the spatial domain. In medical image analysis, frequency-domain features have been shown to be effective for capturing subtle pathological patterns. In natural image processing, frequency analysis has been used for texture classification and image quality assessment. However, the explicit integration of frequency-domain attention into vision transformers for fine-grained classification remains relatively unexplored, particularly for applications involving textured and patterned images such as ancient murals.

Cross-attention mechanisms, which model relationships between different modalities or feature groups, have attracted increasing amounts of attention. In vision-language models such as BERT [31], cross-attention enables the model to align visual features with textual descriptions. In visual recognition, cross-attention has been used to model relationships between global and local features, or between features from different network branches. However, most existing ViT-based fine-grained classification methods rely primarily on standard self-attention within each layer, without explicitly modeling cross-token relationships that could capture complementary discriminative information. Developing specialized cross-attention mechanisms for fine-grained feature learning represents a promising research direction.

Beyond spatial and channel attention paradigms, recent works have explored attention mechanisms that explicitly model correlations among feature representations. Li et al. [32] proposed AMTrans, an Auto-Correlation Multi-Head Attention Transformer designed for infrared spectral deconvolution. The key innovation of AMTrans lies in its auto-correlation attention mechanism, which computes attention weights based on the correlation structure among input features rather than relying solely on query-key dot products. This design philosophy shares conceptual similarities with our proposed Cross-Token Relation Attention (CTRA) module. While AMTrans applies auto-correlation to capture inherent spectral feature dependencies in infrared signal processing, our CTRA explicitly models pairwise relationships between visual tokens through learnable relation transformations, enabling the network to establish stronger connections between semantically related image regions. Both approaches recognize the importance of going beyond standard self-attention by incorporating explicit relational priors, though they target different application domains and employ distinct computational mechanisms tailored to their respective tasks.

2.4. Cultural Heritage Image Analysis and Mural Classification

The application of computer vision techniques to cultural heritage preservation has attracted increasing amounts of attention in recent years, with ancient mural classification representing a particularly challenging problem. Traditional approaches to mural classification relied on manually designed features and conventional machine learning methods. Early works employed SIFT descriptors [6] to capture local texture patterns, combined with support vector machines for classification. However, SIFT-based methods often produce false matches when applied to murals with complex artistic styles and severe degradation.

With the development of deep learning, CNN-based approaches have been adopted for mural analysis. Kumar et al. [6] employed pretrained AlexNet and VGGNet models for mural classification through transfer learning, demonstrating that features learned from natural images can be adapted to the cultural heritage domain. Wang et al. [5] modified the Inception-v3 architecture for dynasty-based classification of ancient Chinese murals, achieving improved performance through careful network adaptation and training strategies. Cao et al. [3] proposed a multichannel separable network that explicitly fuses high-level deep features with low-level color histograms and LBP textures, recognizing that both levels of representation are important for capturing the artistic characteristics of murals.

Despite these advances, mural classification remains challenging due to several factors. First, mural images often exhibit significant background noise and degradation artifacts resulting from environmental exposure and aging, which can dominate the feature representations learned by neural networks. Second, the artistic styles, color palettes, and compositional structures vary substantially across different historical periods and geographical regions, requiring models to capture style-specific characteristics. Third, the limited availability of annotated mural datasets poses challenges for training deep neural networks that typically require large-scale data. Fourth, murals contain rich frequency-domain information in the form of brushwork patterns, texture details, and periodic decorative motifs, which are not adequately captured by standard spatial-domain convolutional or attention operations. These challenges motivate the development of specialized architectures that can effectively handle the unique characteristics of mural images while maintaining generalizability to other fine-grained classification tasks.

2.5. Metric Learning and Contrastive Loss Functions

Metric learning aims to learn feature representations where similar samples are close together while dissimilar samples are far apart in the embedding space. This objective aligns naturally with fine-grained classification, where the goal is to learn discriminative features that separate visually similar categories. Center Loss [30] introduced the concept of learning class centers in the feature space and minimizing the distances between features and their corresponding class centers to enhance intra-class compactness. Contrastive Loss encourages features of the same class to be similar while pushing apart features from different classes, typically using sample pairs or triplets.

In fine-grained classification, the challenge lies in the fact that inter-class distances are naturally small due to visual similarity, while intra-class variations can be large due to factors such as pose, illumination, and viewpoint changes. Fixed-margin loss functions may not be optimal for this scenario, as they treat all class pairs equally without considering the varying difficulty of distinguishing between different category pairs. Adaptive margin methods have been proposed to address this issue. For instance, in face recognition, adaptive margin losses adjust the decision boundaries based on sample difficulty or class distributions. However, the application of adaptive margin contrastive learning to fine-grained visual classification, particularly in combination with transformer architectures, remains an area that requires further exploration.

Furthermore, most existing metric learning approaches focus solely on the feature extraction backbone without considering the integration of metric learning objectives with multi-head classification strategies or hierarchical feature fusion. In the context of ancient mural classification, where categories may exhibit hierarchical relationships (e.g., murals from adjacent dynasties may share certain stylistic elements), adaptive margin contrastive learning could provide additional benefits by learning fine-grained distinctions between closely related categories while maintaining clear separations between distinctly different ones.

2.6. Research Gaps and Motivations

Despite the significant progress in fine-grained classification, vision transformers, and cultural heritage image analysis, several research gaps remain that motivate our work. First, existing ViT-based fine-grained classification methods primarily operate in the spatial domain and do not explicitly capture frequency-domain characteristics, which are particularly important for analyzing textured and patterned images such as ancient murals. Second, while self-attention mechanisms in standard transformers model relationships between all tokens, they do not explicitly emphasize fine-grained pairwise relationships that could be beneficial for capturing compositional structures and recurring motifs. Third, most methods use only the final layer’s class token for classification, potentially discarding complementary information from intermediate layers that focus on different levels of visual abstraction. Fourth, existing token selection strategies either discard less important tokens entirely or aggregate them uniformly, missing the opportunity to maintain multiple parallel focuses on different discriminative aspects. Finally, the application of adaptive margin contrastive learning to transformer-based fine-grained classification, particularly for cultural heritage images, remains underexplored.

To provide a clearer overview of the research landscape and facilitate comparison with our proposed approach, Table 1 summarizes the key characteristics, strengths, and limitations of representative methods.

As illustrated in Table 1, existing methods typically address only a subset of the identified challenges. CNN-based methods such as B-CNN and RA-CNN excel at local feature extraction but face inherent limitations in modeling long-range dependencies. ViT-based methods, including TransFG and FFVT, leverage global self-attention but lack specialized mechanisms for frequency-domain analysis or explicit cross-token relation modeling. Notably, recent works such as the multi-scale feature hierarchy rethinking in DETR [28] and the auto-correlation attention mechanism in AMTrans [32] have demonstrated the value of hierarchical feature integration and explicit relational modeling in their respective domains, providing important conceptual foundations that inform our design choices.

Our proposed FRAM-ViT framework addresses these gaps comprehensively by integrating frequency-domain attention (FreqCA), cross-token relation modeling (CTRA), adaptive multi-focus classification (AOF), and adaptive margin contrastive learning (AMCCL). The FreqCA module draws inspiration from frequency-domain processing approaches to capture periodic patterns and textures. The CTRA mechanism shares philosophical similarities with AMTrans in explicitly modeling pairwise feature relationships, though adapted for visual token interactions in fine-grained classification. The CTI strategy aligns with the multi-scale hierarchy insights from recent DETR variants by leveraging complementary information across transformer layers. By combining these components, we aim to advance the state of the art in fine-grained classification while demonstrating particular effectiveness for the challenging task of ancient mural classification. The proposed approach is evaluated on both natural fine-grained datasets (CUB-200-2011 birds and Stanford Dogs) and a proprietary ancient mural dataset, ensuring comprehensive validation across different domains and visual characteristics.

3. Methodology

In this section, we present a comprehensive description of the proposed framework for fine-grained visual classification with emphasis on ancient mural recognition. We first provide an overview of the overall architecture, followed by detailed explanations of each proposed component: Frequency Channel Attention (FreqCA), Cross-Token Relation Attention (CTRA), Adaptive Omni-Focus (AOF) block, and Adaptive Margin Contrastive Center Loss (AMCCL). Finally, we describe the training strategy and integration of complementary information from multiple layers.

3.1. Overall Architecture

The proposed framework builds upon the Vision Transformer (ViT) architecture as the backbone, specifically utilizing ViT-B/16 configuration with patch size

16 \times 16

. Given an input image

I \in R^{H \times W \times 3}

where H and W denote the height and width, respectively, the image is first resized to

448 \times 448

pixels and then partitioned into

N = \frac{H}{P} \times \frac{W}{P}

non-overlapping patches, where

P = 16

is the patch size. Each patch

x_{i} \in R^{P^{2} \times 3}

is linearly projected into a D-dimensional embedding space, where

D = 768

for ViT-B/16.

The initial sequence of tokens

Z_{0} \in R^{(N + 1) \times D}

is constructed by concatenating a learnable class token

z_{c l s}

with the patch embeddings and adding positional encodings:

Z_{0} = [z_{c l s}; x_{1} E; x_{2} E; \dots; x_{N} E] + E_{p o s},

(1)

where

E \in R^{(P^{2} \cdot 3) \times D}

denotes the patch embedding projection matrix and

E_{p o s} \in R^{(N + 1) \times D}

represents the positional embeddings. The token sequence is then processed through

L = 12

transformer encoder layers, each incorporating multi-head self-attention (MSA) and feed-forward networks (FFNs).

Our architecture extends the standard ViT by integrating four key innovations: (1) FreqCA modules inserted at early layers to capture frequency-domain texture characteristics; (2) CTRA mechanisms integrated into selected transformer layers to model fine-grained token relationships; (3) an AOF block after the backbone to perform adaptive token selection and multi-head classification; and (4) AMCCL loss function to enhance discriminative feature learning. The overall information flow is illustrated in Figure 1. In the following subsections, we elaborate on each component in detail.

To improve readability and provide a consistent reference for the mathematical notation used throughout this section, Table 2 summarizes all key symbols and their definitions.

3.2. Frequency Channel Attention (FreqCA)

Ancient murals exhibit distinctive characteristics in the frequency domain, including periodic brushwork patterns, repetitive decorative motifs, and texture granularity reflective of artistic techniques. Standard spatial-domain operations may not adequately capture these frequency-specific cues. To address this limitation, we propose the Frequency Channel Attention (FreqCA) module, which learns channel-wise importance in the frequency domain and applies the learned weights to both the real and imaginary components of the spectrum. A schematic illustration is provided in Figure 2.

Given feature maps

F \in R^{H^{'} \times W^{'} \times C}

(or an input image treated as a 3-channel feature map), where

H^{'}

,

W^{'}

, and C denote the spatial dimensions and channel number respectively, FreqCA first transforms

F

into the frequency domain using the two-dimensional Discrete Fourier Transform (DFT), implemented by FFT in practice:

F (F) (u, v, c) = \sum_{h = 0}^{H^{'} - 1} \sum_{w = 0}^{W^{'} - 1} F (h, w, c) e^{- j 2 π (\frac{u h}{H^{'}} + \frac{v w}{W^{'}})},

(2)

where

u \in [0, H^{'} - 1]

and

v \in [0, W^{'} - 1]

are frequency-domain coordinates, c indexes the channel dimension, and

j = \sqrt{- 1}

is the imaginary unit. We denote the complex spectrum as

F (F) = F_{r e} + j F_{i m} .

(3)

To construct a compact channel descriptor, we compute pooling statistics on the magnitude spectrum

F_{m a g} = |F (F)| = \sqrt{F_{r e}^{2} + F_{i m}^{2}},

(4)

and define channel-wise average and max pooled descriptors as

f_{a v g}^{c} = \frac{1}{H^{'} \cdot W^{'}} \sum_{u = 0}^{H^{'} - 1} \sum_{v = 0}^{W^{'} - 1} F_{m a g} (u, v, c), f_{m a x}^{c} = max_{u, v} F_{m a g} (u, v, c) .

(5)

We concatenate them to obtain

f_{f r e q} = [f_{a v g}; f_{m a x}] \in R^{2 C}

, which is fed into a channel attention function implemented with two fully connected layers (reduction ratio r):

a_{f r e q} = σ (W_{2} \cdot ReLU (W_{1} \cdot f_{f r e q})),

(6)

where

W_{1} \in R^{\frac{C}{r} \times 2 C}

,

W_{2} \in R^{C \times \frac{C}{r}}

are learnable parameters,

σ (\cdot)

is the sigmoid function, and

r = 16

in our implementation.

The resulting weights

a_{f r e q} \in R^{C}

are shared and applied channel-wise to modulate both the real and imaginary components:

{\hat{F}}_{r e} = F_{r e} ⊙ Reshape (a_{f r e q}), {\hat{F}}_{i m} = F_{i m} ⊙ Reshape (a_{f r e q}),

(7)

where ⊙ denotes channel-wise multiplication and

Reshape (\cdot)

broadcasts

a_{f r e q}

to match the spatial dimensions. The modulated spectrum is then recombined and transformed back to the spatial domain via inverse FFT:

\hat{F} (F) = {\hat{F}}_{r e} + j {\hat{F}}_{i m}, \tilde{F} = F^{- 1} (\hat{F} (F)), F_{o u t} = Re (\tilde{F}) .

(8)

Here,

Re (\cdot)

denotes taking the real part to obtain a real-valued output in practice.

By operating in the frequency domain, FreqCA enables the model to selectively emphasize channels that carry discriminative periodic patterns and textures. For computational efficiency, we implement the DFT using FFT algorithms with complexity

O (H^{'} W^{'} C log (H^{'} W^{'}))

. FreqCA modules are inserted at layers 3, 6, and 9 of the transformer backbone, chosen empirically to capture frequency characteristics at multiple levels of abstraction while maintaining reasonable computational overhead.

3.3. Cross-Token Relation Attention (CTRA)

To facilitate a clearer understanding of the proposed CTRA design, we provide a schematic illustration in Figure 3. The module adopts a two-branch structure: a global branch that compresses token information into channel context, and a local branch that preserves token-wise details. The two branches jointly generate multiplicative gates to refine the input token representations.

Let

Z \in R^{T \times D}

denote the input token sequence of a transformer layer, where

T = N + 1

is the number of tokens (including the class token) and D is the embedding dimension. CTRA refines

Z

by combining a global (channel-context) gate and a local (token-wise) gate.

Global branch (channel-context gating). We first aggregate token information using global average pooling (GAP) over the token dimension:

g = GAP (Z) = \frac{1}{T} \sum_{t = 1}^{T} Z_{t} \in R^{D} .

(9)

The aggregated descriptor

g

is then transformed by a lightweight mapping (implemented as an MLP or, equivalently, a 1D convolution, as indicated in Figure 3) followed by a sigmoid gate to obtain channel-context weights:

a_{g} = σ (ϕ_{g} (g)) \in R^{D},

(10)

where

ϕ_{g} (\cdot)

denotes the learnable transformation and

σ (\cdot)

is the sigmoid function.

Local branch (token-wise gating). In parallel, the local branch processes the token sequence directly to preserve token-wise details and generate a token-dependent gate:

A_{l} = σ (ϕ_{l} (Z)) \in R^{T \times D},

(11)

where

ϕ_{l} (\cdot)

is implemented as an MLP/Conv1d module as shown in Figure 3.

Joint modulation. Finally, the two gates are applied multiplicatively to refine the input tokens:

Z_{o u t} = Z ⊙ A_{l} ⊙ Broadcast (a_{g}),

(12)

where ⊙ denotes element-wise multiplication and

Broadcast (a_{g}) \in R^{T \times D}

replicates

a_{g}

along the token dimension to match the shape of

Z

. By jointly considering channel-context information from the global branch and token-wise details from the local branch, CTRA produces refined token representations that are subsequently used by the remaining components of the transformer block.

3.4. Adaptive Omni-Focus (AOF) Block

Existing token selection strategies in vision transformers typically either discard less informative tokens entirely or aggregate all tokens uniformly. However, fine-grained classification benefits from maintaining multiple parallel focuses on different discriminative aspects simultaneously. The Adaptive Omni-Focus (AOF) block addresses this by dynamically partitioning tokens into multiple subsets and maintaining separate classification heads for each subset, enabling the model to attend to diverse discriminative regions concurrently.

Given the final token representations

Z_{L} \in R^{(N + 1) \times D}

from the L-th transformer layer, AOF first computes token importance scores based on the class token’s attention to patch tokens. Specifically, we extract the attention weights

A_{c l s} \in R^{N}

from the class token to all patch tokens in the final layer:

A_{c l s}^{i} = \frac{1}{H} \sum_{h = 1}^{H} {Attention}_{h} (0, i),

(13)

where

{Attention}_{h} (0, i)

denotes the attention weight from the class token (index 0) to the i-th patch token in head h, averaged across all H attention heads. Based on these importance scores, we partition tokens into

K = 4

groups using adaptive thresholding:

G_{k} = {i ∣ τ_{k - 1} < A_{c l s}^{i} \leq τ_{k}}, k = 1, \dots, K,

(14)

where thresholds

{τ_{0}, τ_{1}, \dots, τ_{K}}

are determined by equally dividing the sorted attention weights into K quantiles, with

τ_{0} = 0

and

τ_{K} = max (A_{c l s})

. Each group

G_{k}

corresponds to tokens with similar importance levels, capturing different aspects of the image—from background context in

G_{1}

to highly discriminative regions in

G_{K}

.

For each group

G_{k}

, we aggregate the corresponding tokens and the class token into a group-specific representation:

z_{k} = z_{c l s}^{L} + \frac{1}{| G_{k} |} \sum_{i \in G_{k}} Z_{L}^{i},

(15)

where

z_{c l s}^{L}

is the class token from layer L,

Z_{L}^{i}

denotes the i-th patch token, and

| G_{k} |

is the cardinality of group k. Each group representation

z_{k} \in R^{D}

is then fed into a separate classification head:

p_{k} = softmax (W_{k} z_{k} + b_{k}),

(16)

where

W_{k} \in R^{C_{c l s} \times D}

and

b_{k} \in R^{C_{c l s}}

are learnable parameters for the k-th classification head, and

C_{c l s}

denotes the number of classes. The final prediction is obtained by weighted aggregation:

p_{f i n a l} = \sum_{k = 1}^{K} β_{k} p_{k},

(17)

where

{β_{1}, β_{2}, β_{3}, β_{4}}

are aggregation weights corresponding to groups

G_{1}

through

G_{4}

ordered by ascending token importance. The rationale for weight assignment follows the principle that groups containing tokens of greater importance should contribute more to the final prediction. Through systematic grid search experiments on validation sets, testing weight configurations ranging from uniform

[1.0, 1.0, 1.0, 1.0]

to various asymmetric distributions, we empirically determined that

β = [0.2, 0.2, 0.8, 0.8]

yields optimal performance for CUB-200-2011 and the mural dataset. This configuration assigns higher weights (0.8) to groups

G_{3}

and

G_{4}

containing the most discriminative tokens—typically corresponding to distinctive object parts such as bird beaks or mural motifs—while lower weights (0.2) are assigned to groups

G_{1}

and

G_{2}

that predominantly contain background or less informative regions. For Stanford Dogs, uniform weights

β = [1.0, 1.0, 1.0, 1.0]

performed best, likely because discriminative features in dog images are more uniformly distributed across different body regions rather than concentrated in specific parts.

By maintaining multiple classification heads focusing on different token subsets, AOF enables the model to capture diverse discriminative aspects simultaneously rather than committing to a single focus. This multi-focus strategy improves robustness, particularly when discriminative features are distributed across multiple image regions or when background variations are significant.

3.5. Complementary Tokens Integration (CTI)

Standard ViT uses only the class token from the final layer for classification, potentially discarding complementary information encoded in intermediate layers. Different transformer layers capture features at varying levels of abstraction: shallow layers encode low-level textures and colors, middle layers represent mid-level patterns, and deep layers capture high-level semantic structures. To leverage this multi-level complementary information, we propose Complementary Tokens Integration (CTI) that extracts class tokens from multiple layers for classification.

Specifically, we select class tokens from layers

L = {l_{10}, l_{11}, l_{12}}

for ViT-B/16, corresponding to the 10th, 11th, and 12th (final) transformer layers. For each selected layer

l_{i} \in L

, we extract the class token

z_{c l s}^{l_{i}}

and pass it through a dedicated classification head:

p_{l_{i}} = softmax (W_{l_{i}} z_{c l s}^{l_{i}} + b_{l_{i}}),

(18)

where

W_{l_{i}} \in R^{C_{c l s} \times D}

and

b_{l_{i}}

are layer-specific classification parameters. During training, each classification head is supervised independently with cross-entropy loss:

L_{C T I} = \sum_{l_{i} \in L} α_{l_{i}} L_{C E} (p_{l_{i}}, y),

(19)

where

L_{C E}

denotes the cross-entropy loss,

y

is the ground-truth label, and

α_{l_{i}}

are layer-specific weights. We set

α_{10} = 0.3

,

α_{11} = 0.3

, and

α_{12} = 0.4

to gradually increase the supervision strength for deeper layers, reflecting their greater semantic capacity.

During inference, predictions from multiple layers are aggregated through weighted averaging:

p_{C T I} = \sum_{l_{i} \in L} α_{l_{i}} p_{l_{i}} .

(20)

This multi-layer integration enables the model to simultaneously leverage low-level textural details (from layer 10), mid-level pattern information (from layer 11), and high-level semantic features (from layer 12), providing a more comprehensive representation for fine-grained classification. The complementary nature of these multi-level features is particularly beneficial for ancient mural classification, where both low-level artistic techniques (brushwork textures) and high-level compositional structures contribute to dynasty-specific characteristics.

3.6. Adaptive Margin Contrastive Center Loss (AMCCL)

To enhance the discriminative power of learned feature representations, we propose Adaptive Margin Contrastive Center Loss (AMCCL), which simultaneously encourages intra-class compactness and inter-class separability with adaptive margins. AMCCL combines the concepts of center loss and contrastive learning while introducing adaptive margins that adjust based on inter-class similarity.

Let

f_{i} \in R^{D}

denote the feature representation (class token from layer 12) for the i-th sample with label

y_{i} \in {1, \dots, C_{c l s}}

, and let

c_{j} \in R^{D}

denote the center (mean feature) for class j. The center loss component encourages features to be close to their respective class centers:

L_{c e n t e r} = \frac{1}{2 B} \sum_{i = 1}^{B} {∥ f_{i} - c_{y_{i}} ∥}_{2}^{2},

(21)

where B denotes the batch size. Class centers are updated using an exponential moving average during training:

c_{j}^{(t + 1)} = μ c_{j}^{(t)} + (1 - μ) \frac{1}{| B_{j} |} \sum_{i \in B_{j}} f_{i},

(22)

where

B_{j}

is the set of samples belonging to class j in the current batch and

μ = 0.9

is the momentum coefficient.

The contrastive component encourages inter-class separation by pushing apart features from different classes. We define an adaptive margin

m_{i j}

between classes i and j based on their center similarity:

m_{i j} = m_{0} \cdot (1 + exp (- γ \cdot sim (c_{i}, c_{j}))),

(23)

where

m_{0} = 0.5

is the base margin,

γ = 2.0

is a scaling factor, and

sim (c_{i}, c_{j}) = \frac{c_{i} \cdot c_{j}}{∥ c_{i} ∥ ∥ c_{j} ∥}

measures cosine similarity between class centers. This adaptive mechanism assigns larger margins to similar classes that are harder to distinguish, while allowing smaller margins for clearly separable classes.

The contrastive loss with adaptive margins is formulated as follows:

L_{c o n t r a s t} = \frac{1}{B} \sum_{i = 1}^{B} \sum_{j \neq y_{i}} max (0, m_{y_{i}, j} - ∥ f_{i} - c_{j} ∥_{2} + ∥ f_{i} - c_{y_{i}} ∥_{2}),

(24)

which encourages the distance from a sample to centers of other classes to exceed the adaptive margin plus the distance to its own class center. The complete AMCCL is defined as follows:

L_{A M C C L} = L_{c e n t e r} + ξ L_{c o n t r a s t},

(25)

where

ξ = 1.0

balances the two components. For the mural classification task, we found that AMCCL with adaptive margins provides significant improvements over fixed-margin alternatives, as ancient mural categories often exhibit hierarchical similarities (e.g., murals from adjacent dynasties share certain stylistic elements) that benefit from flexible margin adjustments.

3.7. Overall Training Objective

The complete training objective combines multiple loss functions to supervise different aspects of the model:

L_{t o t a l} = L_{C T I} + \sum_{k = 1}^{K} α_{k} L_{C E} (p_{k}, y) + ω L_{A M C C L},

(26)

where the first term supervises multi-layer classification through CTI, the second term supervises the K classification heads in the AOF block with weights

α_{k}

, and the third term applies AMCCL for discriminative feature learning with weight

ω = 0.05

for the mural dataset (determined empirically through validation). The relatively small weight for AMCCL ensures that it provides a regularization effect without dominating the optimization process.

4. Experiments

In this section, we present comprehensive experiments to evaluate the effectiveness of the proposed framework. We first describe the datasets and implementation details, then we compare our method with state-of-the-art approaches on three fine-grained classification benchmarks. Finally, we conduct ablation studies to analyze the contribution of each proposed component, namely Frequency Channel Attention (FreqCA), Cross-Token Relation Attention (CTRA), and Adaptive Margin Contrastive Center Loss (AMCCL).

4.1. Datasets

We evaluate our method on three fine-grained visual classification datasets covering both natural object recognition and cultural heritage image analysis.

CUB-200-2011 [1] is a widely used benchmark for fine-grained bird classification, containing 11,788 images from 200 bird species. The dataset is split into 5994 images for training and 5794 images for testing. This dataset presents significant challenges due to large intra-class variations caused by different poses, viewpoints, and backgrounds, while inter-class differences are often subtle and localized to specific body parts such as beaks, wings, and tail feathers.

Stanford Dogs [2] consists of 20,580 images covering 120 dog breeds, with 12,000 images for training and 8580 images for testing. The dataset is challenging because different dog breeds often share similar body structures, and discriminative features are primarily found in facial characteristics, fur patterns, and body proportions.

Ancient Mural Dataset is a proprietary dataset collected from Dunhuang Grottoes, containing 1812 mural images spanning multiple historical dynasties. The images were acquired using standardized photographic protocols with controlled shooting angles and distances to mitigate severe geometric distortions caused by uneven wall surfaces and cavities. Nevertheless, residual perspective variations and minor surface irregularities inherent to in situ mural photography remain present in the data. Ancient mural classification presents unique challenges including complex background noise, severe degradation artifacts due to environmental exposure and aging, geometric variations from curved or irregular wall surfaces, and subtle stylistic variations across different historical periods [3]. The artistic techniques, color palettes, and compositional structures vary substantially across dynasties, requiring models to capture both low-level visual features and high-level semantic patterns.

4.2. Implementation Details

Our framework is implemented using PyTorch (version 1.8.0) and trained on NVIDIA RTX GPUs. We adopt ACC-ViT [33] with ViT-B/16 [13] pretrained on ImageNet-21k as the backbone. During training, input images are resized to

512 \times 512

pixels and then randomly cropped to

448 \times 448

to obtain the final resolution fed into the transformer; standard data-augmentation techniques including random horizontal flipping are applied.

We use the SGD optimizer with momentum 0.9 and weight decay

5 \times 10^{- 4}

. The initial learning rate is set to

3 \times 10^{- 4}

with a cosine annealing schedule. The batch size is set to 10 and the model is trained for 60 epochs. For the FreqCA module, the reduction ratio r is set to 16, and the module is inserted at layers 3, 6, and 9 of the transformer backbone. For CTRA, the scaling parameter

λ

is initialized to 0.1, and the mechanism is integrated into layers 6, 9, and 12. For AMCCL, the loss weight

ω

is set to 0.05, the base margin

m_{0}

is set to 0.5, and the momentum coefficient for center updates is set to 0.9. The aggregation weights

β

for the multi-head classification are set to

[0.2, 0.2, 0.8, 0.8]

for CUB-200-2011 and the mural dataset, while uniform weights

[1.0, 1.0, 1.0, 1.0]

are used for Stanford Dogs.

Hyperparameter Selection Process. All hyperparameters were determined through systematic empirical experiments on validation sets. For each hyperparameter, we conducted grid search over a predefined range of candidate values and selected the configuration that yielded the highest validation accuracy. Specifically, for the AMCCL loss weight

ω

, we tested values in

{0.01, 0.02, 0.05, 0.1, 0.2}

and found

ω = 0.05

to provide the best balance between discriminative supervision and classification loss. For the base margin

m_{0}

, candidates in

{0.3, 0.4, 0.5, 0.6, 0.7}

were evaluated, with

m_{0} = 0.5

achieving optimal inter-class separation. The FreqCA reduction ratio r was selected from

{4, 8, 16, 32}

, where

r = 16

offered an effective trade-off between model capacity and computational efficiency. For the AOF aggregation weights

β

, we experimented with uniform distributions, linearly increasing weights, and various asymmetric configurations, ultimately determining that

β = [0.2, 0.2, 0.8, 0.8]

maximizes performance on datasets where discriminative information is concentrated in specific image regions. Table 3 summarizes the sensitivity analysis for key hyperparameters on the mural dataset.

Computational Cost Analysis. To evaluate the practical efficiency of the proposed modules, we measure the computational overhead on the ancient mural dataset using an NVIDIA RTX 3090 GPU. Table 4 presents the training time, inference time, GPU memory consumption, parameter count, and FLOPs when progressively adding each module to the baseline. The results show that FreqCA introduces the largest overhead (+8.2%) due to FFT operations, while CTRA and AOF add minimal costs. AMCCL only affects training and introduces no inference overhead. The total additional overhead is approximately 15% in training time, with only a 2.5% increase in parameters (from 86.4 M to 88.6 M) and a 3.4% increase in FLOPs (from 17.6 G to 18.2 G), which we consider acceptable given the accuracy improvement of 2.20% on the mural dataset.

4.3. Comparison with State-of-the-Art Methods

We compare our method with both CNN-based and ViT-based fine-grained classification approaches. Table 5 presents the classification accuracy on the CUB-200-2011 and Stanford Dogs datasets.

As shown in Table 5, our method achieves the best performance on both datasets. On CUB-200-2011, our approach attains 91.15% accuracy, outperforming the baseline ACC-ViT [33] by 1.35%. On Stanford Dogs, our method achieves 94.57% accuracy, surpassing ACC-ViT by 1.63%. Compared with CNN-based methods, our approach demonstrates substantial improvements, exceeding the best CNN-based method API-Net [35] by 5.15% on CUB-200-2011.

The consistent improvements over the ACC-ViT baseline can be attributed to the three proposed components. The FreqCA module captures frequency-domain texture patterns that complement the spatial-domain features learned by the baseline, which is particularly beneficial for distinguishing fine-grained features such as feather patterns in birds and fur textures in dogs. The CTRA mechanism explicitly models cross-token relationships beyond standard self-attention, enabling the network to establish stronger connections between semantically related image regions. The AMCCL loss function enhances discriminative feature learning by adaptively adjusting inter-class margins based on category similarity, which is crucial for separating visually similar fine-grained categories.

4.4. Results on Ancient Mural Dataset

Table 6 presents the classification results on the ancient mural dataset, comparing our method with the ACC-ViT baseline.

Our method achieves 94.27% accuracy on the ancient mural dataset, outperforming the ACC-ViT baseline by 2.20%. The improvement on the mural dataset is more pronounced compared to the natural image datasets, demonstrating that the proposed components are particularly effective for cultural heritage image analysis.

The superior performance on the mural dataset can be attributed to the FreqCA module’s ability to capture periodic brushwork patterns and texture granularity that reflect dynasty-specific artistic techniques. Ancient murals from different dynasties exhibit distinctive frequency-domain characteristics in terms of line density, color gradients, and decorative motifs. Standard spatial-domain attention mechanisms in the baseline may not adequately capture these frequency-specific patterns, whereas our FreqCA module explicitly models channel-wise importance in the frequency domain, enabling selective emphasis on discriminative frequency components.

Furthermore, the CTRA mechanism facilitates the learning of compositional structures that are characteristic of different dynasty styles. Ancient murals often contain recurring motifs and compositional arrangements that span different image regions. By explicitly modeling pairwise relationships between tokens, CTRA enables the network to establish connections between these related visual elements regardless of their spatial proximity, providing a more holistic understanding of artistic styles.

4.5. Ablation Studies

To analyze the contribution of each proposed component, we conduct comprehensive ablation studies on all three datasets. Table 7 presents the results when progressively adding FreqCA, CTRA, and AMCCL to the ACC-ViT baseline.

Effect of FreqCA. Adding the FreqCA module to the baseline yields improvements of 0.38% on CUB-200-2011, 0.41% on Stanford Dogs, and 0.61% on the mural dataset. The largest improvement is observed on the mural dataset, validating our hypothesis that frequency-domain attention is particularly effective for capturing periodic patterns and textures in artistic images. The FreqCA module transforms spatial features into the frequency domain using Discrete Fourier Transform and applies adaptive channel-wise attention to selectively emphasize channels carrying discriminative frequency components. This mechanism is especially beneficial for ancient murals, as brushwork patterns and texture granularity contain important stylistic information that varies across dynasties.

Effect of CTRA. The CTRA mechanism contributes 0.44% improvement on CUB-200-2011, 0.48% on Stanford Dogs, and 0.71% on the mural dataset when added to the baseline. By constructing a relation-aware feature space where tokens corresponding to related visual elements can establish stronger connections, CTRA facilitates the learning of compositional structures and global patterns beyond standard self-attention. The enhanced relation matrix, computed through learnable transformations of pairwise cosine similarities, provides a relation prior that augments the standard query-key attention computation. This is particularly beneficial for mural classification, where compositional arrangements and recurring motifs are important discriminative cues for distinguishing different dynasty styles.

Effect of AMCCL. The AMCCL loss function provides improvements of 0.32% on CUB-200-2011, 0.34% on Stanford Dogs, and 0.48% on the mural dataset. By combining center loss with adaptive margin contrastive learning, AMCCL simultaneously encourages intra-class compactness and inter-class separability. The adaptive margin mechanism assigns larger margins to similar classes that are harder to distinguish, while allowing smaller margins for clearly separable classes. This flexibility is particularly beneficial for the mural dataset, where categories from adjacent dynasties often share certain stylistic elements due to cultural continuity, requiring the model to learn fine-grained distinctions with appropriate inter-class boundaries.

Synergistic Effects. When combining all three components, the total improvement reaches 1.35% on CUB-200-2011, 1.63% on Stanford Dogs, and 2.20% on the mural dataset. Notably, the combined improvement exceeds the sum of individual improvements, indicating synergistic effects among the proposed components. FreqCA and CTRA capture complementary aspects of discriminative information—frequency-domain textures and cross-token relationships, respectively—while AMCCL provides enhanced supervision that encourages the learning of more discriminative features in the representation space defined by FreqCA and CTRA. The most significant synergy is observed on the mural dataset, where the combination of frequency-domain analysis, cross-token relation modeling, and adaptive margin learning addresses the unique challenges of cultural heritage image classification.

4.6. Visualization Analysis

To qualitatively inspect the model’s behavior, we provide both attention heatmap overlays and a feature-space visualization based on t-SNE.

Attention heatmaps. We report representative attention heatmaps on three datasets, i.e., CUB-200-2011, Stanford Dogs, and the ancient mural dataset. As shown in Figure 4, each example contains the original image and the corresponding attention overlay.

For the ancient mural dataset, the attention heatmaps (Figure 4e,f) reveal that the model effectively identifies discriminative regions at multiple semantic levels. At the global scale, the attention concentrates on semantically significant areas such as central figure compositions, decorative borders, and background patterns that are characteristic of different dynasties. At a finer granularity, the heatmaps highlight regions containing brushwork textures, color transitions, and periodic decorative motifs—precisely the frequency-domain features that our FreqCA module is designed to capture. Furthermore, the attention distribution demonstrates that the CTRA mechanism successfully establishes connections between spatially distributed but semantically related elements, such as recurring artistic motifs across different image regions. These visualization results confirm that the proposed framework captures both holistic compositional structures and subtle stylistic details essential for accurate dynasty-based mural classification.

t-SNE feature-space visualization. To further examine the structure of the learned representations, we visualize high-dimensional image embeddings using t-SNE by projecting them into a 2D space. It should be noted that this visualization depicts the distribution of learned feature vectors in an embedding space; the original mural images remain unaltered throughout this process. The purpose of t-SNE visualization is to reveal how the model internally organizes samples based on discriminative features learned during training.

In Figure 5, samples are colored by their class indices (as indicated by the color bar), and samples from the same class tend to form localized groups. On CUB-200-2011 and Stanford Dogs, many class groups appear relatively compact and are separated by visible gaps, while a subset of groups remain close or partially interleaved, suggesting that these categories share highly similar visual patterns. For the ancient mural dataset, several groups show broader spread and partial overlap, which may reflect stronger intra-class variability and stylistic proximity among certain categories.

To demonstrate the effectiveness of our proposed components, we compare the feature-space organization between the ACC-ViT baseline and our full model. Given identical input images, our method produces notably tighter intra-class clusters and more distinct inter-class boundaries, confirming that the proposed FreqCA, CTRA, and AMCCL modules enhance the discriminative quality of learned representations. We emphasize that t-SNE primarily preserves local neighborhood relations; therefore, the visualization serves as a qualitative diagnostic of feature-space quality rather than a metric-faithful measure of global inter-class distances.

5. Discussion

The experimental results presented in the previous section demonstrate the effectiveness of our proposed framework for fine-grained visual classification, particularly for the challenging task of ancient mural classification. In this section, we discuss the implications of our findings, analyze the behavior of the proposed components, and acknowledge the limitations of our approach.

5.1. Analysis of Frequency-Domain Attention

The FreqCA module consistently improves classification performance across all three datasets, with the most significant gains observed on the ancient mural dataset (0.61% improvement). This observation aligns with our hypothesis that frequency-domain features are particularly informative for images containing rich textural patterns and periodic structures. Ancient murals exhibit distinctive frequency characteristics that reflect dynasty-specific artistic techniques, including brushwork density, decorative motif periodicity, and texture granularity. Unlike natural images in which spatial features often dominate, mural images require explicit modeling of frequency-domain information to capture these subtle stylistic variations.

The relatively smaller improvements on CUB-200-2011 (0.38%) and Stanford Dogs (0.41%) suggest that frequency-domain attention provides complementary rather than dominant benefits for natural object classification. Birds and dogs possess discriminative features that are more readily captured by spatial-domain attention mechanisms, such as distinctive body part shapes and color patterns. Nevertheless, the consistent positive contribution of FreqCA across diverse datasets validates its general applicability beyond cultural heritage images.

5.2. Effectiveness of Cross-Token Relation Modeling

The CTRA mechanism demonstrates strong performance improvements, particularly on the mural dataset (0.71%). This finding supports our design motivation that fine-grained classification benefits from explicit modeling of pairwise token relationships beyond standard self-attention. While self-attention in vision transformers captures global dependencies, it treats all token relationships uniformly without emphasizing semantically meaningful connections.

For ancient murals, compositional structures and recurring motifs often span multiple image regions, requiring the model to establish connections between spatially distant but semantically related elements. The relation-aware attention in CTRA facilitates this by computing pairwise similarity scores and incorporating them as learnable priors in the attention computation. The enhanced performance on mural classification validates that cross-token relation modeling effectively captures the holistic structural patterns characteristic of different dynasty styles.

On natural image datasets, CTRA provides moderate improvements by helping the model establish connections between complementary discriminative parts, such as bird head and tail patterns or dog facial features and body proportions. The consistent benefits across datasets demonstrate that explicit relation modeling is a generally useful inductive bias for fine-grained recognition.

5.3. Role of Adaptive Margin Learning

The AMCCL loss function contributes meaningful improvements across all datasets, with the largest gain on the mural dataset (0.48%). The adaptive margin mechanism is particularly beneficial when dealing with categories that exhibit hierarchical similarity structures. Ancient murals from adjacent dynasties often share certain stylistic elements due to cultural continuity, while maintaining distinct characteristics that enable classification. Fixed-margin losses treat all class pairs equally and may not optimally handle such varying inter-class similarities.

By dynamically adjusting margins based on class center similarities, AMCCL assigns larger separation boundaries to confusable category pairs while allowing smaller margins for clearly distinguishable ones. This flexibility enables more efficient use of the feature space and encourages the model to focus its learning capacity on difficult distinctions. The feature distribution visualizations confirm that AMCCL produces tighter intra-class clustering and clearer inter-class separation compared to standard cross-entropy training.

5.4. Synergistic Effects and Component Interactions

An important finding from our ablation studies is that the combined improvement from all three components exceeds the sum of individual contributions, particularly on the mural dataset. This synergistic effect suggests that FreqCA, CTRA, and AMCCL capture complementary aspects of discriminative information and interact beneficially during training.

FreqCA enriches the feature representation with frequency-domain information, providing additional texture and pattern cues that complement spatial features. CTRA leverages these enriched features to establish meaningful cross-token relationships, enabling holistic structural understanding. AMCCL then provides enhanced supervision that encourages discriminative learning in this augmented feature space. The combination addresses multiple challenges in fine-grained classification simultaneously: texture analysis, structural understanding, and discriminative representation learning.

5.5. Limitations and Future Directions

Despite the promising results, our approach has several limitations that warrant future investigation. First, the FreqCA module introduces additional computational overhead due to Fourier transform operations, which may limit applicability in resource-constrained scenarios. Developing more efficient frequency-domain attention mechanisms could address this limitation.

Second, our evaluation on ancient murals is limited to a single dataset from Dunhuang Grottoes. Validating the approach on mural collections from different geographical regions and artistic traditions would strengthen the generalizability claims for cultural heritage applications.

Third, the current framework focuses on image-level classification without explicit localization of discriminative regions. Extending the approach to provide interpretable attention maps that highlight stylistically significant elements could enhance its utility for art historical analysis and cultural heritage preservation.

Fourth, while our comparative experiments include representative CNN-based and ViT-based methods, the rapid advancement of vision transformers necessitates continuous benchmarking against the latest state-of-the-art approaches. We plan to extend our comparisons to include emerging methods from 2025 and beyond as they become available, ensuring comprehensive and up-to-date experimental validation.

Fifth, ancient murals are often painted on uneven surfaces such as curved walls and cavities, introducing geometric distortions that pose additional challenges for automated analysis. While our current framework does not include explicit geometric rectification, the proposed components provide implicit robustness to moderate geometric variations: the FreqCA module operates in the frequency domain where global texture periodicity is partially preserved despite local spatial deformations; the CTRA mechanism models semantic relationships between tokens that are inherently less sensitive to geometric distortions than pixel-level correspondences; and standard data augmentation techniques (random cropping and resizing) during training enhance tolerance to perspective variations. Nevertheless, developing explicit geometric normalization or distortion-aware attention mechanisms remains an important direction for improving robustness in challenging real-world scenarios where murals exhibit significant surface curvature or structural damage.

Future work could explore the integration of domain-specific knowledge, such as iconographic databases or art historical annotations, to further improve classification accuracy and provide more interpretable results. Additionally, investigating the transferability of learned representations across different fine-grained domains represents a promising research direction.

6. Conclusions

This paper presents a novel vision transformer-based framework for fine-grained visual classification with particular emphasis on ancient mural recognition. We introduce three key innovations that address the fundamental limitations of existing approaches: Frequency Channel Attention (FreqCA) for capturing frequency-domain texture characteristics, Cross-Token Relation Attention (CTRA) for modeling fine-grained pairwise relationships between image regions, and Adaptive Margin Contrastive Center Loss (AMCCL) for enhancing discriminative feature learning with flexible inter-class boundaries.

Comprehensive experiments on CUB-200-2011, Stanford Dogs, and a proprietary ancient mural dataset validate the effectiveness of our approach. Our method achieves 91.15% accuracy on CUB-200-2011, 94.57% on Stanford Dogs, and 94.27% on the mural dataset, consistently outperforming the ACC-ViT baseline and other state-of-the-art methods. Ablation studies demonstrate that each proposed component contributes positively to the overall performance, with synergistic effects observed when combining all components. Specifically, the synergy arises because FreqCA enriches feature representations with frequency-domain texture information (e.g., brushwork periodicity and decorative patterns), CTRA then leverages these enriched features to establish meaningful cross-token semantic relationships that capture compositional structures, and AMCCL provides discriminative supervision that encourages learning in this augmented feature space. As evidence, the combined improvement on the mural dataset (2.20%) exceeds the sum of individual component contributions (0.61% + 0.71% + 0.48% = 1.80%), confirming that these modules complement each other rather than operating independently.

The proposed framework is particularly effective for ancient mural classification, where frequency-domain patterns, compositional structures, and hierarchical category similarities present unique challenges. The improvements in this cultural heritage task demonstrate the potential of advanced computer vision techniques for art historical research and cultural preservation applications. From a practical perspective, our framework can aid conservation decision-making in several ways: (1) automatic dynasty classification enables professionals to prioritize restoration efforts based on historical significance and allocate resources accordingly; (2) accurate style identification assists in selecting historically appropriate materials and techniques for intervention, ensuring that restoration work respects the original artistic tradition; (3) the attention visualization highlights dynasty-specific artistic features, providing interpretable evidence that supports expert judgment in authentication and provenance research; and (4) systematic classification of large mural collections facilitates comprehensive documentation of stylistic evolution across historical periods, contributing to broader cultural heritage databases.

Future work will focus on improving computational efficiency, extending evaluation to diverse mural collections, and incorporating domain-specific knowledge to enhance both accuracy and interpretability. We believe that the proposed techniques provide a solid foundation for advancing fine-grained visual classification in both natural image and cultural heritage domains.

Author Contributions

Conceptualization, L.W. and Z.C.; Data curation, J.L. and J.C.; Formal analysis, Z.C. and X.P.; Funding acquisition, X.P.; Investigation, L.W. and Z.C.; Methodology, X.P.; Project administration, X.P.; Resources, X.P.; Software, L.W., Z.C., J.L., X.P. and J.C.; Supervision, Z.C.; Validation, L.W., J.L. and J.C.; Visualization, L.W.; Writing—original draft, L.W.; Writing—review and editing, Z.C. and X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62471390).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

It cannot be made public due to privacy, copyright, or cooperation agreement restrictions, but can be provided to the corresponding author upon reasonable request. The source code is publicly available at https://github.com/wl-stu/FRAM_ViT (accessed on 19 January 2026).

Acknowledgments

The authors are grateful for the support of the National Natural Science Foundation of China (No. 62471390).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACC-ViT	Attention-based Complementary-part Contrastive Vision Transformer.
AMCCL	Adaptive Margin Contrastive Center Loss.
AOF	Adaptive Omni-Focus.
B-CNN	Bilinear Convolutional Neural Network.
CAP	Context-Aware Attentional Pooling.
CNN	Convolutional Neural Network.
CTRA	Cross-Token Relation Attention.
CTI	Complementary Tokens Integration.
DeiT	Data-Efficient Image Transformer.
DETR	Detection Transformer.
DFT	Discrete Fourier Transform.
FFN	Feed-Forward Network.
FFT	Fast Fourier Transform.
FFVT	Feature Fusion Vision Transformer.
FGVC	Fine-Grained Visual Classification.
FreqCA	Frequency Channel Attention.
GPU	Graphics Processing Unit.
LBP	Local Binary Pattern.
MSA	Multi-head Self-Attention.
NTS-Net	Navigator–Teacher–Scrutinizer Network.
PMG	Progressive Multi-Granularity.
RA-CNN	Recurrent Attention Convolutional Neural Network.
RAMS-Trans	Region Attention Multi-Scale Transformer.
ReLU	Rectified Linear Unit.
SGD	Stochastic Gradient Descent.
SIFT	Scale-Invariant Feature Transform.
TransFG	Transformer for Fine-Grained Recognition.
TPSKG	Transformer with Peak Suppression and Knowledge Guidance.
t-SNE	t-distributed Stochastic Neighbor Embedding.
ViT	Vision Transformer.

References

Khosla, A.; Jayadevaprakash, N.; Yao, B.; Fei-Fei, L. Novel Dataset for Fine-Grained Image Categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Sydney, Australia, 1–8 December 2013; pp. 554–561. [Google Scholar]
Cao, J.; Jia, Y.; Chen, H.; Yan, M.; Chen, Z. Ancient Mural Classification Methods Based on a Multichannel Separable Network. Herit. Sci. 2021, 9, 88. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Wang, Y.; Zhang, X.; Li, Z. Application of a Modified Inception-v3 Model in the Dynasty-Based Classification of Ancient Murals. EURASIP J. Image Video Process. 2021, 2021, 49. [Google Scholar]
Chaudhuri, B.; Nakagawa, M.; Khanna, P.; Kumar, S. (Eds.) Proceedings of 3rd International Conference on Computer Vision and Image Processing: CVIP 2018, Volume 2: 1024 (Advances in Intelligent Systems and Computing); Springer: Singapore, 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.-Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN Models for Fine-Grained Visual Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
Fu, Y.; Hospedales, T.M.; Xiang, T.; Gong, S. Region-Aware Network for Fine-Grained Visual Categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4103–4112. [Google Scholar]
Yang, Z.; Luo, T.; Wang, H.; Cui, Z.; Huang, W.; Li, Y.; Jiang, L. Learning to Navigate for Fine-Grained Classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 560–576. [Google Scholar]
Wang, H.; Song, Y.; Yang, H.; Liu, Z. Generalized Koopman Neural Operator for Data-Driven Modelling of Electric Railway Pantograph-Catenary Systems. IEEE Trans. Transp. Electrif. 2025, 10, 14100–14112. [Google Scholar] [CrossRef]
Wang, X.; Jiang, H.; Zeng, T.; Dong, Y. An Adaptive Fused Domain-Cycling Variational Generative Adversarial Network for Machine Fault Diagnosis under Data Scarcity. Inf. Fusion 2025, 126, 103616. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
He, J.; Chen, J.-N.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. TransFG: A Transformer Architecture for Fine-Grained Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 852–860. [Google Scholar]
Liu, X.; Wang, L.; Han, X. Transformer with Peak Suppression and Knowledge Guidance for Fine-Grained Image Recognition. Neurocomputing 2022, 492, 213–225. [Google Scholar] [CrossRef]
Wang, J.; Yu, X.; Gao, Y. Feature Fusion Vision Transformer for Fine-Grained Visual Categorization. In Proceedings of the British Machine Vision Conference (BMVC), Online, 22–25 November 2021. [Google Scholar]
Berg, T.; Belhumeur, P.N. POOF: Part-Based One-vs-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 955–962. [Google Scholar]
Khan, F.; Weijer, J.; Bagdanov, A.; Vanrell, M. Portmanteau Vocabularies for Multi-Cue Image Representation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Granada, Spain, 12–14 December 2011; Volume 24. [Google Scholar]
Huang, S.; Xu, Z.; Tao, D.; Zhang, Y. Part-Stacked CNN for Fine-Grained Visual Categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1173–1182. [Google Scholar]
Ge, W.; Lin, X.; Yu, Y. Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification from the Bottom Up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3029–3038. [Google Scholar]
Luo, W.; Yang, X.; Mo, X.; Lu, Y.; Davis, L.S.; Li, J.; Yang, J.; Lim, S.-N. Cross-X Learning for Fine-Grained Visual Categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8241–8250. [Google Scholar]
Gao, Y.; Han, X.; Wang, X.; Huang, W.; Scott, M. Channel Interaction Networks for Fine-Grained Image Categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10818–10825. [Google Scholar]
Ding, Y.; Zhou, Y.; Zhu, Y.; Ye, Q.; Jiao, J. Selective Sparse Sampling for Fine-Grained Image Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6598–6607. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; Volume 139, pp. 10347–10357. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. TransTrack: Multiple Object Tracking with Transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
Liu, F.; Zheng, Q.; Tian, X.; Shu, F.; Jiang, W.; Wang, M.; Elhanashi, A.; Saponara, S. Rethinking the Multi-Scale Feature Hierarchy in Object Detection Transformer (DETR). Appl. Soft Comput. 2025, 175, 113081. [Google Scholar] [CrossRef]
Behera, A.; Wharton, Z.; Hewage, P.; Bera, A. Context-Aware Attentional Pooling (CAP) for Fine-Grained Visual Classification. arXiv 2021, arXiv:2101.06635. [Google Scholar] [CrossRef]
Song, C.; Huang, Y.; Ouyang, W.; Wang, L. Mask-Guided Contrastive Attention Model for Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1179–1188. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Gao, L.; Cui, L.; Chen, S.; Deng, L.; Wang, X.; Yan, X.; Zhu, H. AMTrans: Auto-Correlation Multi-Head Attention Transformer for Infrared Spectral Deconvolution. Tsinghua Sci. Technol. 2024, 30, 1329–1341. [Google Scholar] [CrossRef]
Zhang, Z.-C.; Chen, Z.-D.; Wang, Y.; Luo, X.; Xu, X.-S. A Vision Transformer for Fine-Grained Classification by Reducing Noise and Enhancing Discriminative Information. Pattern Recognit. 2024, 145, 109979. [Google Scholar] [CrossRef]
Du, Y.; Chang, D.; Bhunia, A.K.; Xie, J.; Song, Y.-Z.; Ma, Z.; Guo, J. Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 153–168. [Google Scholar]
Zhuang, P.; Wang, Y.; Qiao, Y. Learning Attentive Pairwise Interaction for Fine-Grained Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Voulme 34, pp. 13130–13137. [Google Scholar]
Zhang, N.; Zhang, Y.; Yu, W.; Zhu, J. RAMS-Trans: Multi-Scale Region Attention Transformer for Fine-Grained Image Classification. IEEE Access 2022, 10, 12345–12356. [Google Scholar]

Figure 1. Overall architecture and information flow of the proposed FRAM-ViT framework. The input image

I \in R^{H \times W \times 3}

is first processed by the FreqCA module for frequency-domain enhancement, then partitioned into patch embeddings

Z_{0} \in R^{(N + 1) \times D}

. The token sequence passes through 12 transformer encoder layers; class tokens from layers 10–12 are extracted for CTI-based multi-layer classification (Heads 10–12). The final layer outputs undergo Token Selection to partition patches into importance groups, which are processed by the AOF Block with CTRA to produce group-specific representations. These are aggregated to yield the final classification probabilities

p_{f i n a l} \in R^{C_{c l s}}

.

Figure 1. Overall architecture and information flow of the proposed FRAM-ViT framework. The input image

I \in R^{H \times W \times 3}

is first processed by the FreqCA module for frequency-domain enhancement, then partitioned into patch embeddings

Z_{0} \in R^{(N + 1) \times D}

. The token sequence passes through 12 transformer encoder layers; class tokens from layers 10–12 are extracted for CTI-based multi-layer classification (Heads 10–12). The final layer outputs undergo Token Selection to partition patches into importance groups, which are processed by the AOF Block with CTRA to produce group-specific representations. These are aggregated to yield the final classification probabilities

p_{f i n a l} \in R^{C_{c l s}}

.

Figure 2. Schematic illustration of the Frequency Channel Attention (FreqCA) module.

Figure 3. Schematic illustration of the Cross-Token Relation Attention (CTRA) module.

Figure 4. Attention heatmaps generated by the proposed model on three datasets. Each row shows the original image (left) and its attention overlay (right).

Figure 5. t-SNE projections of learned embeddings on three datasets (colored by class index).

Table 1. Comparative summary of representative methods. Columns indicate the presence (✓) or absence (×) of Freq. = frequency-domain modeling; C-R. = cross-token relation; M-L = multi-layer integration; A-M = adaptive margin.

Method	Strengths	Limitations	Freq.	C-R.	M-L	A-M
B-CNN [8]	Bilinear pooling for second-order interactions	High cost; limited global context	×	×	×	×
RA-CNN [9]	Iterative multi-scale attention	Sequential; local receptive field	×	×	×	×
TransFG [14]	Part selection via attention	Single-layer; no frequency	×	×	×	×
FFVT [16]	Multi-layer token fusion	Uniform aggregation; no relation	×	×	✓	×
DETR-MS [28]	Multi-scale feature pyramid	Detection-oriented; not for FGVC	×	×	✓	×
AMTrans [32]	Auto-correlation attention	Domain-specific (spectral)	✓	✓	×	×
ACC-ViT [33]	Complementary part learning	Limited cross-token; spatial only	×	×	✓	×
Ours	Integration of all components	Increased overhead	✓	✓	✓	✓

Table 2. Summary of notation used in the methodology.

Symbol	Description	Dimension/Value
Input and Patch Embedding
$I$	Input image	$R^{H \times W \times 3}$
$H, W$	Image height and width	448 pixels
P	Patch size	16
N	Number of patches	$(H / P) \times (W / P) = 784$
$x_{i}$	i-th flattened patch	$R^{P^{2} \times 3}$
$E$	Patch embedding projection matrix	$R^{(P^{2} \cdot 3) \times D}$
$E_{p o s}$	Positional embeddings	$R^{(N + 1) \times D}$
$z_{c l s}$	Learnable class token	$R^{D}$
$Z_{0}$	Initial token sequence	$R^{(N + 1) \times D}$
D	Embedding dimension	768
L	Number of transformer layers	12
FreqCA Module
$F$	Input feature maps	$R^{H^{'} \times W^{'} \times C}$
$F (\cdot)$	2D Discrete Fourier Transform	–
$F_{r e}, F_{i m}$	Real and imaginary components	$R^{H^{'} \times W^{'} \times C}$
$F_{m a g}$	Magnitude spectrum	$R^{H^{'} \times W^{'} \times C}$
$f_{a v g}, f_{m a x}$	Channel-wise pooled descriptors	$R^{C}$
$a_{f r e q}$	Frequency channel attention weights	$R^{C}$
$W_{1}, W_{2}$	FC layer weights in FreqCA	$R^{C / r \times 2 C}, R^{C \times C / r}$
r	Reduction ratio	16
CTRA Module
$Z$	Input token sequence to CTRA	$R^{T \times D}$
T	Total number of tokens	$N + 1$
$g$	Global average pooled descriptor	$R^{D}$
$a_{g}$	Global (channel-context) gate	$R^{D}$
$A_{l}$	Local (token-wise) gate	$R^{T \times D}$
$ϕ_{g}, ϕ_{l}$	Learnable transformations (MLP)	–
AOF Block and CTI
$Z_{L}$	Token representations from layer L	$R^{(N + 1) \times D}$
$A_{c l s}$	Class token attention weights	$R^{N}$
$G_{k}$	k-th token group	Index set
K	Number of token groups	4
$τ_{k}$	Threshold for group partitioning	Adaptive quantiles
$z_{k}$	Group-specific representation	$R^{D}$
$p_{k}$	Prediction from k-th head	$R^{C_{c l s}}$
$β_{k}$	Aggregation weight for group k	Scalar
$p_{f i n a l}$	Final aggregated prediction	$R^{C_{c l s}}$
$L$	Selected layers for CTI	${10, 11, 12}$
$α_{l_{i}}$	Layer-specific loss weight	Scalar
AMCCL Loss
$f_{i}$	Feature representation for sample i	$R^{D}$
$y_{i}$	Ground-truth label for sample i	${1, \dots, C_{c l s}}$
$c_{j}$	Class center for class j	$R^{D}$
B	Batch size	10
$μ$	Momentum for center update	0.9
$m_{i j}$	Adaptive margin between classes	Scalar
$m_{0}$	Base margin	0.5
$γ$	Scaling factor for margin	2.0
$ξ$	Balance weight for contrastive loss	1.0
$ω$	Weight for AMCCL in total loss	0.05
$C_{c l s}$	Number of classes	Dataset-dependent

Note: Italic text/symbols indicate variables used in equations.

Table 3. Hyperparameter sensitivity analysis on the ancient mural dataset. Bold values indicate the selected configurations.

Hyperparameter	Candidate Values and Accuracy (%)
$ω$ (AMCCL weight)	0.01: 93.52	0.02: 93.78	0.05: 94.27	0.1: 93.95	0.2: 93.41
$m_{0}$ (base margin)	0.3: 93.65	0.4: 93.89	0.5: 94.27	0.6: 94.01	0.7: 93.72
r (FreqCA reduction)	4: 93.85	8: 94.05	16: 94.27	32: 93.92	–
$β$ configuration	[1, 1, 1, 1]: 93.68		[0.2, 0.2, 0.8, 0.8]: 94.27		[0.1, 0.2, 0.3, 0.4]: 93.95

Table 4. Computational cost analysis on the ancient mural dataset.

Configuration	Training (ms/ep)	Inference (ms/img)	Memory (GB)	Params (M)	FLOPs (G)
ACC-ViT (baseline)	5.2	16.8	7.4	86.4	17.6
+FreqCA	7.1	18.2	7.9	86.9	17.9
+CTRA	7.5	18.7	8.1	87.4	18.0
+AOF	9.2	19.1	8.3	88.6	18.2
+AMCCL	6.0	19.1	8.5	88.6	18.2
Ours (Full)	9.7	19.1	8.5	88.6	18.2

Table 5. Comparison of classification accuracy (%) with state-of-the-art methods on the CUB-200-2011 and Stanford Dogs datasets. The best results are shown in bold.

Method	Backbone	CUB-200-2011	Stanford Dogs
CNN-based Methods
B-CNN [8]	VGG-16	84.1	–
RA-CNN [9]	VGG-19	85.3	87.3
NTS-Net [10]	ResNet-50	86.5	–
Cross-X [21]	ResNet-50	86.7	88.9
PMG [34]	ResNet-50	85.6	89.9
API-Net [35]	DenseNet-161	86.0	89.4
ViT-based Methods
ViT [13]	ViT-B/16	87.82	92.40
TransFG [14]	ViT-B/16	89.7	92.3
TPSKG [15]	ViT-B/16	89.3	92.5
FFVT [16]	ViT-B/16	89.6	91.5
RAMS-Trans [36]	ViT-B/16	89.3	92.4
ACC-ViT [33]	ViT-B/16	89.80	92.94
Ours	ViT-B/16	91.15	94.57

Note: Italic text indicates the category of methods.

Table 6. Comparison of classification accuracy (%) on the ancient mural dataset.

Method	Backbone	Accuracy
ViT [13]	ViT-B/16	89.45
TransFG [14]	ViT-B/16	90.72
FFVT [16]	ViT-B/16	90.33
ACC-ViT [33]	ViT-B/16	92.07
Ours	ViT-B/16	94.27

Table 7. Ablation study on the contribution of each proposed component in terms of accuracy (%). FreqCA: Frequency Channel Attention; CTRA: Cross-Token Relation Attention; AMCCL: Adaptive Margin Contrastive Center Loss.

FreqCA	CTRA	AMCCL	CUB-200-2011	Stanford Dogs	Mural
			89.80	92.94	92.07
✓			90.18	93.35	92.68
	✓		90.24	93.42	92.78
		✓	90.12	93.28	92.55
✓	✓		90.72	94.05	93.58
✓		✓	90.58	93.86	93.32
	✓	✓	90.65	93.95	93.45
✓	✓	✓	91.15	94.57	94.27

Note 1: The checkmark denotes the inclusion of the corresponding module. Note 2: Bold values represent the optimal results obtained in this study.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, L.; Chang, Z.; Li, J.; Cai, J.; Peng, X. FRAM-ViT: Frequency-Aware and Relation-Enhanced Vision Transformer with Adaptive Margin Contrastive Center Loss for Fine-Grained Classification of Ancient Murals. Electronics 2026, 15, 488. https://doi.org/10.3390/electronics15020488

AMA Style

Wei L, Chang Z, Li J, Cai J, Peng X. FRAM-ViT: Frequency-Aware and Relation-Enhanced Vision Transformer with Adaptive Margin Contrastive Center Loss for Fine-Grained Classification of Ancient Murals. Electronics. 2026; 15(2):488. https://doi.org/10.3390/electronics15020488

Chicago/Turabian Style

Wei, Lu, Zhengchao Chang, Jianing Li, Jiehao Cai, and Xianlin Peng. 2026. "FRAM-ViT: Frequency-Aware and Relation-Enhanced Vision Transformer with Adaptive Margin Contrastive Center Loss for Fine-Grained Classification of Ancient Murals" Electronics 15, no. 2: 488. https://doi.org/10.3390/electronics15020488

APA Style

Wei, L., Chang, Z., Li, J., Cai, J., & Peng, X. (2026). FRAM-ViT: Frequency-Aware and Relation-Enhanced Vision Transformer with Adaptive Margin Contrastive Center Loss for Fine-Grained Classification of Ancient Murals. Electronics, 15(2), 488. https://doi.org/10.3390/electronics15020488

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FRAM-ViT: Frequency-Aware and Relation-Enhanced Vision Transformer with Adaptive Margin Contrastive Center Loss for Fine-Grained Classification of Ancient Murals

Abstract

1. Introduction

2. Related Work

2.1. Fine-Grained Visual Classification

2.2. Vision Transformers for Image Classification

2.3. Attention Mechanisms in Deep Learning

2.4. Cultural Heritage Image Analysis and Mural Classification

2.5. Metric Learning and Contrastive Loss Functions

2.6. Research Gaps and Motivations

3. Methodology

3.1. Overall Architecture

3.2. Frequency Channel Attention (FreqCA)

3.3. Cross-Token Relation Attention (CTRA)

3.4. Adaptive Omni-Focus (AOF) Block

3.5. Complementary Tokens Integration (CTI)

3.6. Adaptive Margin Contrastive Center Loss (AMCCL)

3.7. Overall Training Objective

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Results on Ancient Mural Dataset

4.5. Ablation Studies

4.6. Visualization Analysis

5. Discussion

5.1. Analysis of Frequency-Domain Attention

5.2. Effectiveness of Cross-Token Relation Modeling

5.3. Role of Adaptive Margin Learning

5.4. Synergistic Effects and Component Interactions

5.5. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI