Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization

Hu, Zhihuai; Kojima, Rihito; Han, Xian-Hua

doi:10.3390/electronics14142869

Open AccessArticle

Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization

by

Zhihuai Hu

,

Rihito Kojima

and

Xian-Hua Han

^*

Graduate School of Artificial Intelligence and Science, Rikkyo University, Tokyo 171-8501, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2869; https://doi.org/10.3390/electronics14142869

Submission received: 9 June 2025 / Revised: 2 July 2025 / Accepted: 15 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Advances in Machine Learning for Image Classification)

Download

Browse Figures

Versions Notes

Abstract

Fine-grained visual categorization (FGVC) presents significant challenges due to subtle inter-class variation and significant intra-class diversity, often leading to limited discriminative capacity in global representations. Existing methods inadequately capture localized, class-relevant features across multiple semantic levels, especially under complex spatial configurations. To address these challenges, we introduce a Multi-scale Attention-driven Hierarchical Learning (MAHL) framework that iteratively refines feature representations via scale-adaptive attention mechanisms. Specifically, fully connected (FC) classifiers are applied to spatially pooled feature maps at multiple network stages to capture global semantic context. The learned FC weights are then projected onto the original high-resolution feature maps to compute spatial contribution scores for the predicted class, serving as attention cues. These multi-scale attention maps guide the selection of discriminative regions, which are hierarchically integrated into successive training iterations to reinforce both global and local contextual dependencies. Moreover, we explore a generalized pooling operation that parametrically fuses average and max pooling, enabling richer contextual retention in the encoded features. Comprehensive evaluations on benchmark FGVC datasets demonstrate that MAHL consistently outperforms state-of-the-art methods, validating its efficacy in learning robust, class-discriminative, high-resolution representations through attention-guided hierarchical refinement.

Keywords:

fine-grained visial categorization; hierarchical learning; attention map; generalized pooling; class-relevant regions

1. Introduction

Fine-grained visual classification (FGVC) is a specialized subfield of computer vision that focuses on distinguishing among visually similar subordinate categories [1], such as bird or cat species, car models, or fashion item variants. Unlike generic image classification for differentiating between distinct object categories, FGVC requires recognizing subtle inter-class differences while accounting for significant intra-class variation. This makes FGVC inherently more difficult and computationally demanding. Due to its domain-specific importance, FGVC has broad applications in various specific fields such as biodiversity and ecological monitoring [2], intelligent transportation and automotive manufacturing [3], fashion recommendation [4], and retail product recognition [5,6].

Over the past decade, deep learning methodologies have been widely employed in FGVC and demonstrated significant progress in term of the recognition performance [5,7,8,9,10,11,12,13,14,15], benefiting from the design of sophisticated architectures such as convolutional neural networks (CNNs) [16,17,18] and vision transformers (ViTs) [19,20,21,22]. The existing architectures [17,20,21,23] generally learn hierarchical feature representations to capture increasingly abstract visual patterns from shallow to deep layers. Despite the capability of capturing diverse features, many FGVC methods rely exclusively on the final layer’s globally aggregated feature vector for classification. This design choice assumes that the deepest layers encapsulate all the necessary information for category discrimination. However, this assumption often fails in FGVC tasks, where class-specific differences are localized and subtle, for example, fine texture variations in bird feathers or small structural details in similar vehicle models. Ignoring mid- and low-level features may, therefore, result in the loss of informative patterns learned in earlier layers, potentially degrading classification performance. In addition, the globally aggregation operation would further suppress the fine-grained and spatially localized information that is often crucial for distinguishing between visually similar categories, thereby diminishing its ability to differentiate between closely related subcategories, especially for the FGVC task. Moreover, the existing global aggregation models usually employ either average pooling (AP) [24] or max pooling (MP) [23] to condense high-dimensional spatial feature maps into fixed-size representations. Despite computationally efficient and facilitate end-to-end training, they introduce inherent trade-offs that can hinder the model’s ability to capture discriminative visual cues, particularly in the context of FGVC.

To address the aforementioned limitations in existing FGVC methods, we propose a novel framework called Multi-scale Attention-driven Hierarchical Learning (MAHL). The consensus mechanism of the MAHL is realized through training refinement based on dynamically learned discriminative regions guided by attention. At multiple semantic stages within the network, class-specific attention maps are generated to localize discriminative regions, which, in turn, guide the feature learning process at each stage. This design allows the model to progressively refine its focus on task-relevant visual cues during training. Specifically, MAHL integrates multi-scale features extracted from multiple convolutional blocks of a CNN backbone (e.g., ResNet-50 [23]), enabling the model to combine both low-level textures and high-level semantic context for classification. To enhance discriminative capability, we incorporate attention mechanisms that dynamically identify and emphasize informative spatial regions at each scale. These attention-guided regions highlight salient local patterns, such as the part-specific structures or fine texture differences that are crucial for distinguishing visually similar subcategories. The resulting class-relevant attention maps are then used to guide progressive feature refinement across training stages. By recursively reintroducing these attended regions during hierarchical learning, MAHL supports iterative improvement in both feature focus and contextual understanding. In addition, we exploit a simple yet generalized adaptive pooling strategy to selectively aggregate the top-ranked activations based on contribution scores, offering a middle ground between max and average pooling. The proposed pooling mechanism potentially retains salient visual patterns while suppressing irrelevant background information, leading to a more robust and expressive representation. We validate the proposed MAHL framework on multiple FGVC benchmark datasets. The experimental results show that our method consistently outperforms existing state-of-the-art (SOTA) approaches, confirming the effectiveness of multi-scale attention and adaptive pooling in fine-grained recognition.

In summary, our proposed MAHL framework offers several key advantages to elaborately address critical limitations in existing FGVC approaches:

(1): Multi-scale hierarchical feature integration: We leverage hierarchical representations by integrating features from multiple intermediate and deep layers of the backbone network. This strategy enables the model to capture both low-level visual details and high-level semantic context for distinguishing between fine-grained categories that exhibit subtle inter-class variations.
(2): Attention-driven region localization and progressive training: We incorporate a dynamic attention mechanism that identifies class-relevant spatial regions across different scales, and then reintroduce the attention-driven localized discriminating areas for progressive training refinement. With this strategy, MAHL potentially enhances the precision of feature learning and improves its ability to localize subtle but informative visual cues, ultimately leading to higher classification accuracy.
(3): Generalized adaptive pooling: We introduce a pooling method that adaptively combines the top-ranked activations, achieving better spatial information retention and stronger feature discriminability for fine-grained classification.

2. Related Work

This section offers a comprehensive review of prior work relevant to the proposed Multi-level Attention with Hierarchical Learning (MAHL) framework for fine-grained visual classification (FGVC) tasks, and elucidates how these studies have informed and motivated its development.

2.1. Attention Mechanism

Attention mechanism has widely incorporated in modern convolutional architectures for various visual tasks. For example, the squeeze-and-excitation (SE) block [25] explicitly models channel-wise dependencies to adaptively recalibrate feature maps, and the convolutional block attention module (CBAM) [26] sequentially introduces channel and spatial attention to adaptively reweight features. These foundational attention modules enhance representational power by highlighting informative feature channels and locations while suppressing irrelevant responses [25,26,27]. In practice, such global attention mechanisms serve as building blocks for many FGVC models by guiding the network to focus on discriminative image regions. In fine-grained categorization, attention is especially valuable for coping with subtle inter-class differences and large intra-class variations [28]. Numerous FGVC methods integrate attention to localize class-specific parts [13,29]. For instance, Fu et al. [29] exploited a recurrent attention CNN (RA-CNN) for recursively learning to crop and refining discriminative regions in a coarse-to-fine manner via an attention proposal network (APN). Similarly, Yang et al. [13] proposed a navigator teacher–scrutinizer pipeline with a navigator agent to obtain most informative regions and a scrutinizer to classify these attended patches. Zhao et al. [28] proposed a diversified visual attention network (DVAN) that explicitly pursues multiple distinct attentions to capture a wide range of discriminative cues. All the above works illustrate how recurrent or multi-agent attention schemes can sequentially focus on different object parts, thereby amplifying class-specific cues and mitigating confusion among visually similar classes.

Beyond explicit cropping, many FGVC models use internal attention modules to automatically learning the important feature of channel or spatial dimensions. For example, Lu et al. [30] investigated a hybrid attention block that applies both channel and spatial attention to highlight key feature maps and image regions simultaneously. Specifically, this model employs an attention erasing strategy to remove the most salient region while focus on discovering secondary discriminative parts, encouraging the network to cover multiple class-discriminative regions. Multi-scale attention [31,32] has also been explored to generate attention maps at different image scales via pyramid or multi-branch architectures. By leveraging the multi-scale attentions for part localization, these methods potentially reduce intra-class variance and ensure that subtle distinguishing features are consistently captured across viewpoints.

Recently, transformer-based attention [19,20,21,33] has been adapted to FGVC tasks, and has shown strong performance on FGVC benchmarks [34]. For example, Conde et al. [34] proposed a multi-stage ViT to localize informative regions with the inherent multi-head attention of each stage without architectural changes. Si et al. [34] introduced specialized token-level attention for FGVC, and proposed a token-selective transformer (TSVT) via pruning irrelevant tokens at each layer, thereby restricting attention to the most discriminative patches. Compared to these prior methods, the proposed MAHL framework adopts a unified multi-scale, hierarchical attention strategy, and achieves the attention via class-activated map to crop explainable discriminative cues at each scale, enabling the network to focus on the most relevant object parts.

2.2. Multi-Scale Feature Extraction

Multi-scale feature extraction plays a vital role in FGVC with its powerful modeling capability of capturing both coarse object structure and subtle local details to distinguish between highly similar classes. Conventional CNN-based models naturally yield hierarchical features at successive layers despite underutilization due to the final-scale aggregation for classification [16,17,23]. Early multi-scale architectures, such as inception networks [24], integrate features using parallel branches with varying receptive fields, while feature pyramid networks (FPN) [35] construct explicit top-down pathways that combine low- and high-level features across spatial resolutions. In FGVC, leveraging multi-scale features for coarse-to-fine learning like in RA-CNN [29] and TASN [36] have been widely exploited by recursively focusing on image sub-regions or generating multi-granularity part-based representations. Similarly, Du et al. [15] investigated a progressive training strategy over increasingly fine-grained image crops to refine classification in stages.

Recent research has shifted towards adaptive or learnable multi-scale feature extraction, especially with the emergence of attention-based mechanisms and transformer architectures. For instance, Ding et al. [37] generated region proposals at multiple levels using soft masks and integrated them via semantic alignment. Zhuang et al. [38] introduced pairwise relational attention to guide feature selection across scales, emphasizing discriminative cues through instance-level alignment. Luo et al. [39] proposed semantically enhanced feature learning by fusing object-level and category-level semantics across layers to guide representation learning, while Wang et al. [40] utilized a graph-propagation mechanism to enhance inter-region correlations across weakly supervised regions, offering an implicit multi-scale relational structure. In addition, DFGMM [41] incorporated Gaussian mixture models to perform discriminative part-level learning with spatial regularity, implicitly encoding multi-resolution patterns. Many other methods with strong performance across standard benchmarks have also been investigated [39]. For instance, Lin et al. explored bilinear CNNs (BCNN) [42], while Zhou et al. introduced self-supervised structure modeling [43] but overlooked localized, scale-adaptive details. Du et al. [15] employed fixed patch granularity, limiting adaptability. Yao et al. [44] leveraged dilated convolution to adjust receptive fields yet lacked guided spatial refinement, while Niu et al. [45] proposed attention shift-based DNN but did not provide iterative, hierarchical processing. More recently, some methods have focused on architectural optimization [46], data augmentation [47], or concept-guided learning [48]. In contrast, MAHL unifies classifier-driven attention with dynamic, multi-scale feature refinement and hierarchical integration, offering more effective extraction of discriminative visual patterns.

With the rise of vision transformer [19], hierarchical token-based architectures have become dominant in multi-scale representation learning. The Swin transformer [21,22] and pyramid vision transformer [33] introduce window-based and pyramid-level self-attention, respectively, enabling dense representations at multiple spatial resolutions. These have inspired recent FGVC-specific transformers such as TransFG [49] and TransFGVC [50]. While the existing methods succeed in capturing features at multiple levels, they often lack a class-specific guidance mechanism for region selection. In contrast, the proposed MAHL framework adopts a multi-scale attention-driven approach in which spatial regions are adaptively selected based on class-discriminative contribution scores derived from fully connected classifiers at different stages. Moreover, MAHL leverages these contribution cues to explicitly guide the hierarchical refinement of feature representations at each scale. This enables a tight coupling between multi-scale features and semantic relevance, yielding more robust and interpretable models for FGVC.

2.3. Pooling Methods

Pooling operations, such as max pooling and average pooling, have traditionally been employed in CNNs models to reduce the spatial dimensionality of feature maps [16,17,23]. This reduction not only alleviates computational burdens but also imparts a degree of translation invariance by summarizing activations across local or global regions, serving as integral components for compact feature representation. However, in FGVC, where intra-class variation is high and inter-class differences are subtle, pooling strategies have to be adapted to preserve discriminative local features while maintaining computational efficiency. To this end, several works have proposed advanced pooling mechanisms tailored for FGVC [42,51,52,53]. For example, Zheng et al. [42] introduced bilinear pooling to model pairwise feature interactions across spatial locations, which significantly enriches the representational capacity by capturing localized feature co-occurrences. He et al. [53] proposed spatial pyramid pooling to partition the feature map at multiple scales for pooling independently across each region, which can preserve spatial hierarchies and enable the handling of arbitrary input sizes. Additionally, second-order pooling techniques [54], and attention-guided pooling strategies [55] have been explored to capture higher-order dependencies or emphasize class-relevant regions. While these methods have demonstrated strong performance, they often incur substantial computational overhead or require complex architectural modifications. In contrast, we propose a simple yet effective pooling strategy that integrates the advantages of both average pooling and max pooling.

In summary, we provide a comparative table in Table 1 that summarizes existing methods with strong performance across standard benchmarks, while clearly highlighting how the proposed MAHL differs through its unified, attention-driven, multi-scale, and hierarchical refinement framework.

3. Proposed Multi-Scale Attention-Driven Hierarchical Learning (MAHL)

This section firstly presents an overview of the proposed MAHL framework for the FGVC task, and then introduces the contributed components in the proposed pipeline.

3.1. Overview

This study aims to exploit a novel framework, termed Multi-scale Attention-driven Hierarchical Learning (MAHL), designed to effectively leverage multi-scale features for FGVC task while dynamically identifying and localizing discriminative local regions through automatically learned attention maps. The key objective of MAHL is to enhance feature representation by integrating hierarchical cues and progressively refining the model’s focus on informative regions during training. Specifically, MAHL exploits the inherent hierarchical architecture of deep convolutional networks such as ResNet [23] to extract feature representations from multiple intermediate and deep layers. These layers capture complementary visual information, ranging from fine-grained textures and edge patterns in shallow layers to high-level semantic concepts in deeper layers. To fully exploit these multi-scale representations, multiple classification heads, each composed of fully connected (FC) layers, are attached to the spatially aggregated feature vectors from each individual block, as well as from the fused multi-scale feature map. This structure enables the model to generate a set of classification predictions at varying semantic levels. Following this, the learned classifier parameters (from the FC layers) associated with the predicted class, i.e., the class with the highest softmax probability, are employed to compute contribution scores across the unaggregated spatial feature maps. These contribution scores serve as attention maps, highlighting the spatial regions that contribute most significantly to the classification decision. By generating attention maps at multiple scales, the model can localize discriminative regions at varying resolutions and contextual depths. Finally, during the training phase, the identified multi-scale attention maps are used to crop discriminative regions, which are then reintroduced into the backbone network for further processing. This hierarchical refinement enables the model to iteratively improve its feature learning and focus more precisely on the salient visual patterns relevant to the classification task. Through this end-to-end progressive learning strategy, MAHL enhances the model’s ability to capture subtle inter-class differences, leading to more robust and accurate fine-grained recognition. The overall structure of our proposed MAHL is demonstrated in Figure 1, which outlines the key components of our approach, including multi-scale feature extraction, attention-guided region localization, and hierarchical supervision through intermediate classification heads.

Concretely, we employ the ResNet50 [23] as the backbone architecture, which is composed of an initial stem layer followed by four hierarchical convolutional blocks. Given an input image

I \in ℜ^{3 \times H \times W}

, where H and W denote the image height and width, respectively, the backbone yields multi-scale feature representations through its four blocks:

X_{1} \in ℜ^{C \times \frac{H}{4} \times \frac{W}{4}}

,

X_{2} \in ℜ^{2 C \times \frac{H}{8} \times \frac{W}{8}}

,

X_{3} \in ℜ^{4 C \times \frac{H}{16} \times \frac{W}{16}}

, and

X_{4} \in ℜ^{8 C \times \frac{H}{32} \times \frac{W}{32}}

. In conventional classification pipelines, a fully connected (FC) layer is typically applied to a spatially aggregated feature vector,

{\hat{X}}_{4} = f_{SA} (X_{4})

, derived from the final block, where

{\hat{X}}_{4} \in ℜ^{8 C}

, to produce the final classification output. In contrast, our proposed MAHL framework exploits multi-scale features extracted from all convolutional blocks. It constructs independent FC-based classifiers for different scales as well as their fused representations, thereby enabling the generation of multiple classification predictions, which is expressed as follows:

\begin{matrix} P^{s} = f_{F C}^{s} ({\hat{X}}_{s}) = f_{F C}^{s} (f_{S A} (X_{s})) \\ P^{a} = f_{F C}^{a} ({\hat{X}}_{a}) = f_{F C}^{a} (f_{c o n c a t} ([{\hat{X}}_{s}])), \end{matrix}

(1)

where

f_{F C}^{*}

refer to the FC-based classifier applied either to the feature representation at scale s or to the concatenated feature vector across multiple scales, while

P^{\in} ℜ^{K}

represents the resulting probability prediction for K class task.

Although multi-scale features are incorporated into classification prediction, the use of global spatial pooling inevitably weakens the discriminative power of features that are localized to specific regions. To mitigate this limitation, we introduce two complementary strategies. The first is a dynamic localization mechanism that operates via progressively discovering and utilizing discriminative local regions with the automatically obtained attention maps that strongly influence the prediction score during training, dubbed attention-driven region localization (ADRL). Then, the localized regions are further forwarded to the backbone to refine the training process in a hierarchical manner. The second strategy is a generalized spatial pooling method that selectively aggregates highly activated features, thereby retaining the critical spatial information crucial for fine-grained visual classification. Next, we will introduce the proposed ADRL, the generalized spatial pooling method, and the hierarchical learning with the localized regions.

3.2. Attention-Driven Region Localization (ADRL)

As previously mentioned, the probability vector

P^{s} \in ℜ^{K}

over K classes is obtained by applying a classification (CLS) head on the spatially aggregated feature

{\hat{X}}_{s}

, which is the result of global average pooling over the output feature map from the s-th convolutional block. The CLS head comprises two fully connected (FC) layers: The first maps the feature to an intermediate representation, and the second projects it to the final class scores. For notational simplicity, we omit bias terms, batch normalization (BN), and activation functions in the following formulation. Let

W_{s}^{1} \in ℜ^{C_{s}^{1} \times C_{s}}

be the weight matrix of the first FC layer, where

C_{s}

is the number of channels in

{\hat{X}}_{s}

, and

C_{s}^{1}

is the dimension of the intermediate representation. Let

W_{s}^{2} \in ℜ^{K \times C_{s}^{1}}

denote the weight matrix of the second FC layer. The classification output at scale s without considering the bias term is thus computed as

P^{s} = W_{s}^{2} W_{s}^{1} {\hat{X}}_{s}

(2)

where the output

P^{s}

is passed through a SoftMax function to obtain the final class probabilities.

Next, we identify the currently recognized class by selecting the one with the highest predicted probability, denoted as

P_{l}^{s} = max (P^{s})

, indicating that the input image is most likely classified into class l at scale s. Formally, the predicted class label l at scale ss is determined by identifying the index corresponding to the highest probability score in the prediction vector

P^{s}

. This can be expressed as

l = arg max_{k \in 1, \dots, K} P_{k}^{s},

(3)

where

P_{k}^{s}

denotes the predicted probability for class k. Then, we extract the weight vector

W_{s}^{2, l} \in ℜ^{1 \times C_{s}^{1}}

corresponding to the predicted class l from the second FC layer, along with the full weight matrix

W_{s}^{1}

of the first FC layer. Rather than operating on the spatially aggregated feature

{\hat{X}}_{s}

, we apply a point-wise convolution directly on the original feature map

X_{s} \in ℜ^{C_{s} \times H_{s} \times W_{s}}

using these weights, thereby retaining spatial resolution and enabling specific attention of the l-th class over the feature map. Initially, we compute a temporary contribution score map

A_{s}

to estimate the spatial importance of each location in the feature map

X_{s}

, which is calculated as

A_{s} = f_{P W} (f_{P W} (X_{s}, W_{s}^{1}), W_{s}^{2, l})

(4)

where

W_{s}^{1}

and

W_{s}^{2, l}

are extended into 4-D tensors for point-wise convolution operations, with dimensions

ℜ^{C_{s}^{1} \times C_{s} \times 1 \times 1}

and

ℜ^{1 \times C_{s}^{1} \times 1 \times 1}

, respectively. Specifically, the first convolution, parameterized by

W_{s}^{1} \in ℜ^{C_{s}^{1} \times C_{s} \times 1 \times 1}

, projects the input feature map

X_{s}

from

C_{s}

input channels to

C_{s}^{1}

intermediate channels. The second convolution, using

W_{s}^{2, l} \in ℜ^{1 \times C_{s}^{1} \times 1 \times 1}

, further compresses the intermediate feature map to a single-channel attention map

A_{s}

, representing the contribution score of each spatial location to the predicted class. This formulation enables back-projection of class-specific information onto the original spatial domain of the feature map. Then, a min-max normalization is employed to obtain the attention map at s-th scale, expressed as

{\hat{A}}_{s} (w, h) = \frac{A_{s} (w, h) - m i n (A_{s})}{m a x (A_{s}) - m i n (A_{s})}

(5)

The detailed flow of the attention map generation process is illustrated in Figure 2, which highlights how attention maps are dynamically derived using the weights of the classification heads. Since our objective is to localize and crop the most discriminative regions directly from the input image at its original resolution, we upsample the attention maps from all scales to match the spatial dimensions of the input image. This is achieved by applying an upsampling function

f_{U p} (\cdot)

, yielding the resized attention map

{\bar{A}}_{s} = f_{U p} ({\hat{A}}_{s})

. To extract the cropped discriminative region, we first generate a binary mask

M_{s}

by thresholding

{\bar{A}}_{s}

. Specifically, spatial locations in the attention map with values greater than a threshold

θ \in [0, 1]

are set to 1, and all others are set to 0. Formally, each element of the mask

M_{s} (w, h)

is defined as

M_{s} (w, h) = \{\begin{matrix} 1, & if {\bar{A}}_{s} (w, h) > θ \\ 0, & otherwise \end{matrix}

(6)

Given the binary mask

M_{s}

, we locate a bounding box that tightly encloses all positive regions. The bounding box is then used to crop the corresponding region

I_{s}

from the input image, which is considered the most discriminative area for the FGVC task. Having obtained the discriminative region for each individual scale

s = 1, 2, 3, 4

, we proceed to construct an integrated attention map

{\bar{A}}_{a}

that aggregates attention information across all scales. The goal of this integration is to capture a more holistic view of the discriminative cues distributed at different semantic levels, thereby facilitating the localization of an overall discriminative region in the input image. Specifically,

{\bar{A}}_{a}

is generated by averaging the upsampled attention maps from different scales. Each spatial element of the aggregated attention map

{\bar{A}}_{a} (w, h)

is computed as

{\bar{A}}_{a} (h, w) = \frac{1}{S} \sum_{s = 1}^{S} {\bar{A}}_{s} (h, w),

(7)

where S denotes the total number of scales (e.g., 4). Once

{\bar{A}}_{a}

is obtained, we utilize it to guide a refined localization of the most informative region in the input image. Similar to the procedure applied at individual scales, we generate a binary mask

M_{a}

by thresholding

{\bar{A}}_{a}

with a predefined threshold

θ

, and extract the bounding box

I_{a}

enclosing all positive pixels. Compared to the individual-scale crops, this fused region benefits from the complementary information across multiple feature levels, leading to a more robust and semantically complete representation of the discriminative content. The final set of discriminative regions thus includes both scale-specific local crops and the integrated region.

3.3. Hierarchical Learning with the Localized Regions

We propose a hierarchical learning framework that progressively optimizes baseline networks using both raw images and discriminative localized regions. The training process follows a structured multi-step paradigm aimed at enhancing representation learning across different semantic levels. In the initial stages, the model is trained with either the original image or individual localized regions at various scales, allowing each training step to specialize in capturing scale-specific discriminative cues without mutual interference. In the intermediate stages, we introduce a cross-scale mutual data augmentation (MDA) strategy, wherein a localized region at one scale is replaced with a region identified by the attention map from a different scale. This mechanism promotes cross-scale knowledge transfer: Deeper layers contribute high-level semantic abstractions to guide shallower layers, while shallower layers offer fine-grained visual patterns that enhance the descriptive power of deeper layers. Training is performed in a top-down fashion, progressing from deep to shallow scales to maintain semantic consistency and hierarchical alignment. In the final stages, the network is jointly optimized using both the globally aggregated attention region and the original input, facilitating the integration of multi-scale information. This collaborative phase enhances the model’s ability to adapt to varying object resolutions. The stage-wise optimization effectively manages the partially conflicting learning objectives across different depths, leading to a coherent convergence of complementary representations within the overall architecture.

3.4. Generalized Spatial Pooling

As described, the feature maps extracted from different blocks of the backbone ResNet50 are required to be spatially pooled into a feature vector for classification, and are usually implemented with the global average or max pooling (GAP or GMP) method. In the context of FGVC, where the distinction between categories often relies on subtle, localized differences, conventional pooling strategies exhibit inherent limitations. GAP uniformly aggregates all spatial activations, which can lead to over-smoothing and the dilution of discriminative information by incorporating a large number of low-activation or background pixels. Contrarily, GMP retains only the single most responsive activation, thereby discarding valuable contextual and complementary cues present in other highly informative regions. This extreme sparsity may result in feature representations that are overly sensitive to noise or biased towards a single part of the object. To address these issues, we propose a generalized top-

z %

pooling strategy, which combines the advantages of both GAP and GMP while mitigating their respective drawbacks. Specifically, given a feature

X

composed of activation values

{x_{1}, x_{2}, \dots, x_{d}}

, we first sort the activations in ascending order and select the top

z %

of of the highest values, denoted as

{x_{t}, x_{t + 1}, \dots, x_{d}}

, with

t = ⌊ (1 - \frac{z}{100}) \cdot d ⌊

. Then, we compute the mean of these top-ranked activations to produce a pooled feature vector:

\hat{X} = \frac{1}{d - t + 1} \sum_{m = t}^{d} x_{m}

(8)

The proposed pooling mechanism preserves multiple highly responsive regions simultaneously, capturing the fine-grained visual cues essential for category discrimination. Moreover, by focusing on the most salient activations while suppressing background noise, the proposed method yields a more informative, stable, and discriminative representation for FGVC tasks.

4. Experiments

This section firstly introduces the benchmark datasets and the detailed experimental setup, and then presents a comparative evaluation of our method against state-of-the-art (SotA) approaches to establish its effectiveness. Finally, we perform a series of ablation studies to investigate the individual contributions and interactions of key components and hyperparameters.

4.1. Experimental Settings

Datasets: We conduct experiments with three widely used and highly competitive benchmark datasets: Food-11 [58], FGVC-Aircraft [59], and Stanford Cars [60]. As summarized in Table 2, the Food-11 dataset consists of food images categorized into 11 meal types, with 9866 training images and 3347 testing images. The FGVC-Aircraft dataset comprises images of 100 distinct aircraft variants, with a total of 6667 training images and 3333 testing images. The Stanford Cars dataset has images of 196 car models, split into 8144 images for training and 8041 for testing.

Implementation Details: Our framework is implemented using PyTorch 2.5.1, with ResNet50 adopted as the backbone architecture for the proposed MAHL model. Model training is conducted using Stochastic Gradient Descent (SGD) for 200 epochs, with a momentum coefficient of 0.9 and a weight decay of

5 \times 10^{- 4}

. The initial learning rate is set to 0.002 and is gradually decayed according to a cosine annealing schedule [22]. To generate the binary attention mask

M_{s}

, a threshold

θ = 0.5

is applied to the normalized attention map. During training, standard data augmentation techniques such as random rotation, scaling, and cropping are utilized to enhance the diversity and robustness of the training data. The model is trained using a mini-batch size of 8 on a NVIDIA GeForce RTX 3070 GPU (10GB VRAM).

4.2. Comparison with State-of-the-Art Methods

To evaluate the effectiveness of our proposed MAHL method, we conducted comprehensive comparisons against a range of state-of-the-art approaches across three datasets: FGVC-Aircraft [59], Stanford Cars [60], and Food-11 [58]. The compared results are summarized in Table 3. Following the experimental setting [56,61], we resized the input images with a fixed spatial resolution of

448 \times 448

. On the FGVC-Aircraft dataset as shown in Table 3a, our approach achieves a classification accuracy of 93.03%, comparable with several recent competitive methods such as DF-GMM [41] (93.8%), GCL [40] (93.2%), and PMG [15] (92.8%). Notably, our model surpasses recent SOTA approaches such as I2-HOFI [57], and SaSPA [47], further demonstrating the effectiveness of our proposed framework. On the Stanford Cars dataset, our method attains a significant improvement with 97.11% accuracy, exceeding all existing methods listed in Table 3a. Compared to strong baselines like PMG [15] (95.1%), DF-GMM [41] (94.8%), SDNs [56] (94.6%), and SaSPA [47] (95.34%), our approach demonstrates a notable gain of over or about 2% in accuracy, illustrating its superior discriminative power for fine-grained vehicle classification. Moreover, following the recent trend in FGVC research toward foundation models and few-shot or data-free learning, we additionally include a comparison with the data-free knowledge distillation (DFKD) pipeline [62] in Table 3a. Notably, DFKD operates under a fundamentally different setting from our work, typically assuming access to either a limited amount of labeled data or no labeled data at all, with the objective of generalizing under low-resource conditions. Consequently, such approaches tend to yield lower performance on standard FGVC benchmarks. For instance, DFKD reports 65.76% accuracy on FGVC Aircraft and 71.89% on Stanford Cars, illustrating the trade-offs inherent in low-data regimes. In contrast, our method is developed and evaluated in the fully supervised setting, where complete training annotations are available. Our focus is on enhancing discriminative feature representations under this setting, and the results clearly demonstrate a superior performance compared to both conventional and recent fully supervised methods.

Regarding the Food-11 dataset, our method also exhibits a state-of-the-art performance. As shown in Table 3b, our model achieves 97.01% accuracy, surpassing the recent CMAL [61] method, which reported 96.5% using a more complex Res2Next50 backbone. Overall, these results confirm that our proposed MAHL method consistently outperforms previous SoTA methods across diverse fine-grained visual recognition tasks, while maintaining a comparable or even simpler backbone configuration.

In the above comparisons, we primarily report top-1 classification accuracy in line with standard practice in FGVC tasks, which remains the most widely used and directly comparable metric across SoTA methods. Other evaluation metrics such as F1-score and mean average precision (mAP) are more common in detection or multi-label classification settings, while their application is less standardized in FGVC benchmarks. However, to offer deeper insight into the model’s behavior, we further include a confusion matrix and the precition/recall for each class on the Food-11 dataset, as shown in Figure 3. This visual analysis reveals several class-level misclassification patterns. For instance, as shown in Figure 4, images of dairy products are frequently misclassified as desserts, while desserts themselves are sometimes confused with seafood or egg classes. Similarly, the meat category shows confusion with egg, and bread items are often misclassified as desserts. These misclassification patterns suggest overlapping visual cues among certain food types and point to specific areas where further improvement in class discrimination is needed.

4.3. Ablation Study

To comprehensively analyze the effectiveness of the proposed MAHL method, we conduct ablation studies to evaluate the individual contributions of key design components. Specifically, we investigate the impact of integrating the MAHL training and the generalized pooling (GP) mechanism into the baseline ResNet-50 architecture for all three datasets with the input image size of

448 \times 448

. The corresponding recognition accuracies are reported in Table 4, demonstrating how each module contributes to performance gain.

Subsequently, we further investigate the effects of varying the internal configurations of the two key components: the MAHL module and the generalized pooling (GP) mechanism. By modifying their respective settings, we aim to assess the sensitivity and robustness of the proposed method to different architectural choices. To expedite training across multiple experimental variants, input images are resized to

224 \times 224

pixels, and all evaluations are conducted on the Food-11 dataset. Specifically, as illustrated in Figure 1, we utilize three-scale feature maps, namely,

X_{2}

,

X_{3}

, and

X_{4}

(referred to as Scale3), as the default setting for both classification and attention map generation. To analyze the effect of scale selection, we test three additional combinations: (1) Scale4: all four feature maps (

X_{1}

–

X_{4}

), (2) Shallow2: two shallower layers (

X_{1}

and

X_{2}

), and (3) Deep2: two deeper layers (

X_{3}

and

X_{4}

). In addition to feature scale variations, we also compare our simple CAM-based attention mechanism with the conventional Grad-CAM-based attention strategy [68], by applying both to the same three-scale setting (Scale3). The results of these ablation experiments are summarized in Table 5a, offering insights into how different design choices affect recognition performance. Our default configuration, Scale3 (Ours) with three intermediate-to-deep feature maps (

X_{2}

,

X_{3}

, and

X_{4}

), achieves the highest accuracy of 94.95% on the Food-11 dataset, highlighting the advantage of incorporating both mid- and high-level semantic features for attention-guided learning. In contrast, when we switch to Shallow2 (using

X_{1}

and

X_{2}

) or Deep2 (using

X_{3}

and

X_{4}

) configurations, the accuracy drops to 94.59% and 94.14%, respectively. This suggests that relying solely on shallow or deep features limits the model’s capacity to capture comprehensive discriminative cues, validating the benefit of multi-scale representation. In addition, the use of Scale4 (all four scales) results in 94.26% accuracy, which is also lower than the three-scale setup. This may be attributed to the inclusion of the very shallow features (

X_{1}

), which could introduce noise or redundancy, thereby affecting the attention quality. Importantly, replacing our attention generation method with Grad-CAM (i.e., Scale3 with Grad-CAM) significantly reduces the accuracy to 93.30%. This 1.65% drop in performance indicates that the proposed method yields more effective and better-aligned attention maps for hierarchical learning, likely due to its simplicity, efficiency, and stronger compatibility with the network’s spatial activation behavior. These results support the design of our attention module and its contribution to improved recognition performance.

To evaluate the effectiveness of the proposed generalized pooling (GP) mechanism, we conduct an ablation study by varying the top-

z %

elements selected within the pooling region, and the compared results are manifested in Table 5b. As shown in Table 5b, conventional max pooling (MP) and average pooling (AP) are represented as special cases of the GP framework: MP corresponds to selecting only the single most activated element (

z = \frac{1}{d} \times 100 %

), where d denotes the total number of elements in the entire pooling region, while AP corresponds to averaging over all elements (

z = 100 %

). The compared results demonstrate that our GP strategy with

z = 25

achieves the highest accuracy of 94.95%, outperforming both MP (94.47%) and AP (93.87%). The comparison highlights the limitations of fixed pooling strategies. MP, while effective in preserving strong activations, may be overly sensitive to noise and ignore contextual cues. AP, on the other hand, tends to dilute discriminative signals by treating all spatial features equally. Our GP mechanism strikes a balance by adaptively aggregating a moderately selective subset of high-activation features, enabling it to retain salient information while suppressing irrelevant or noisy responses. Interestingly, as z increases to 50% or 75%, performance degrades, suggesting that including too many lower-activation elements weakens the representation’s discriminative strength. This supports the notion that emphasizing a focused subset of informative features is more effective for fine-grained recognition tasks.

In our method, the attention threshold

θ

is used to filter and retain highly attentive spatial regions in the feature maps. To investigate the effect of this hyperparameter, we conducted an ablation study by varying

θ

from 0.1 to 0.9 with an interval of 0.2. The corresponding results, summarized in Table 5c, demonstrate that performance remains relatively stable across a broad range of threshold values, with the best accuracy achieved at

θ = 0.5

.

θ = 0.5

provides a good trade-off between focusing on discriminative regions and preserving spatial diversity, and is, therefore, adopted as the default setting in our framework.

4.4. Visualization of Multi-Scale Attention Maps

To qualitatively assess the behavior of our proposed MAHL framework, we visualize the attention maps generated at different scales. These maps provide insights into how the model progressively refines its focus across hierarchical levels by capturing both coarse and fine-grained contextual information. As shown in Figure 5, the attention map at a lower scale (

X_{2}

) tends to highlight broader regions of interest, such as the general object outline or background-context boundaries. This early-stage attention enables the model to gather global semantics and context. In contrast, the attention maps at higher scales (

X_{3}

or

X_{4}

) exhibit concentrated activation on more localized and discriminative regions, such as object textures, edges, or critical sub-parts. This scale-wise transition of attention demonstrates the model’s capacity to gradually shift from global understanding to fine detail refinement. Notably, the consistency of focus across scales reinforces the hierarchical learning effect, where earlier attentions guide subsequent stages towards more semantically meaningful regions. Moreover, the multi-scale attention structure helps mitigate noisy or irrelevant activations that may arise in single-scale models, potentially leading to more robust and interpretable predictions.

4.5. Discussion

This subsection discuss several aspects of the proposed MAHL framework, including efficency, comparison with transfromer-based backbones, the effect of advanced augmentation technique, limitations, potential real-world application scenarios, and related societal/ethical issues.

Efficiency of the proposed MAHL framework: As introduced earlier, our framework builds upon the baseline ResNet-50 by attaching prediction heads to multi-scale feature maps from the second to fourth stages of the backbone. Each prediction head consists of two fully connected layers, which naturally increases the number of model parameters and computational cost during both training and inference. However, it is important to emphasize that the attention-guided hierarchical learning (MAHL) process is primarily conducted during the training phase. Specifically, each training iteration of MAHL involves multiple steps: optimizing the model using the full input image, followed by refinement based on attention-guided cropped regions. This multi-step training strategy indeed leads to increased training time. However, in most real-world applications, models are trained offline, and only the inference time is relevant during deployment. In this context, the inference time of our model remains comparable to that of a standard multi-scale prediction model without hierarchical learning, as only the full image is used for prediction at test time. From a practical standpoint, training cost is a one-time offline expense, whereas inference time and efficiency—directly related to the model’s multiply–accumulate operations (MACs)—are more critical for deployment. For reference, the baseline ResNet-50 requires 25.56 M parameters and 49.07 GMACs when processing an image of spatial size

448 \times 448

. In contrast, our MAHL framework involves 36.65 M parameters and 64.77 GMACs. Despite the increase, our model still supports real-time inference, achieving an average prediction time of approximately 8 ms per image on a NVIDIA GeForce RTX 3080 Ti.

Comparison with Transfromer-based backbones: As introduced, the proposed MAHL framework is implemented using a CNN backbone, specifically, ResNet50. To evaluate the impact of transformer-based architectures, we further implemented MAHL with Swin transformer (SwinT) [21], pyramid vision transformer (PVT) [33] and MetaFormer [69] as backbones, using the Food-11 dataset under the same experimental settings (input size:

448 \times 448

, max pooling). The ResNet50-based MAHL achieved an accuracy of 96.3%, while the Swin-, PVT-, and MetaFormer-based variants significantly underperformed, with accuracies of 83.2%, 86.85%, and 88.95%, respectively. Although transformer models offer strong global modeling capabilities, they are less effective at capturing fine-grained, localized visual patterns critical to FGVC tasks. This performance gap is primarily due to architectural differences: (1) CNNs like ResNet50 incorporate strong inductive biases (e.g., local connectivity and translation equivariance), which are well-suited for recognizing subtle, localized features; (2) vision Ttansformers (ViTs), in contrast, require large-scale data to learn such priors from scratch and tend to apply attention globally, potentially overlooking important local cues. Additionally, their high computational complexity poses challenges when processing high-resolution inputs, as typically required in FGVC. In conclusion, while MAHL is compatible with both CNN and transformer backbones, our experiments demonstrate that the CNN-based MAHL (ResNet50) is more effective for FGVC due to its superior ability to extract and refine discriminative local features under limited data conditions.

Effect of advanced augmentation technique: To assess the impact of more sophisticated data augmentation, we incorporated CutMix into our standard augmentation pipeline on the Food-11 dataset. This resulted in an improved classification accuracy, increasing from 94.95% to 95.60%. While this demonstrates the compatibility of our MAHL framework with advanced augmentation strategies, such techniques are beyond the primary focus of this study. For fair comparison with existing SoTA methods, we basically reported results using standard augmentations.

Limitations and potential real-world application: Our proposed method demonstrated strong performance in fully supervised FGVC, it also has certain limitations. First, the approach relies on the availability of high-quality annotated datasets, particularly for generating robust and discriminative attention regions in the hierarchical learning process. Such detailed annotations may not always be accessible in real-world scenarios. Furthermore, although the attention-based mechanism effectively highlights discriminative regions, its performance may degrade under significant domain shifts or in the presence of noisy or cluttered backgrounds—conditions frequently encountered in practical applications. Despite these limitations, the proposed method exhibits a promising recognition performance across several benchmark FGVC datasets and shows potential for real-world deployment in domains requiring fine-grained visual understanding. For example, in biodiversity monitoring, accurate identification of visually similar species (e.g., birds or plants) is essential for ecological analysis. In intelligent transportation systems, distinguishing between similar car models or aircraft types based on visual cues supports inventory tracking and surveillance. These scenarios could benefit from our model’s ability to attend to subtle yet informative regions, thereby improving classification accuracy in visually complex environments.

Related societal/ethical issues: While FGVC offers significant advancements in domains such as biodiversity monitoring and agriculture, it also raises important societal and ethical considerations. One major concern is the presence of biases in benchmark datasets, which may reflect imbalanced or non-representative sampling across object categories or environments. Such biases can result in uneven model performance, potentially disadvantaging underrepresented groups or scenarios. In addition, FGVC technologies, if misused, could be applied in sensitive contexts such as surveillance or automated decision-making without ethical safeguards. To address these concerns, future work should prioritize fairness-aware evaluation, open-access dataset auditing, and the development of explainable models that offer interpretable outputs. By considering these aspects, the FGVC community can help ensure that technological advancements contribute positively and equitably to society.

5. Conclusions

In this study, we proposed a novel Multi-scale Attention-driven Hierarchical Learning (MAHL) framework to tackle the central challenges in fine-grained visual categorization (FGVC), including subtle inter-class differences and high intra-class variation. Our key innovation lies in a multi-scale attention mechanism that dynamically identifies class-discriminative regions across different feature layers. This enables a hierarchical learning process that selectively enhances informative spatial features during training. Another notable contribution is the attention generation strategy, which leverages class-specific contribution scores derived from intermediate fully connected (FC) classifiers. This design fosters effective synergy between global semantic context and localized discriminative cues. Moreover, we introduce a generalized pooling strategy that combines average and max pooling to better preserve fine-grained features, enhancing the robustness and expressiveness of the learned representations. Extensive experiments on three FGVC benchmarks demonstrate that MAHL consistently achieves superior classification accuracy compared to state-of-the-art methods, validating its effectiveness and generality.

Nonetheless, our method has limitations. Its performance may be impacted when applied to domains with limited annotations or rare categories with small sample sizes, as the attention mechanism depends on sufficient data to generate reliable spatial focus. Future work will explore ways to improve robustness in such low-resource settings and extend the approach to unsupervised or semi-supervised FGVC tasks.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, writing—original draft preparation, and visualization, Z.H. and X.-H.H.; writing—review and editing, R.K. and X.-H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available in a publicly accessible repositor. The used datasets include Food-11, FGVC-Aircraft and Stanford Cars datasets. The Food-11 dataset is available at https://www.kaggle.com/datasets/vermaavi/food11 (accessed on 7 June 2025), and the FGVC-Aircraft dataset is available at https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/ (accessed on 7 June 2025), and the Stanford Cars dataset is available at https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset (accessed on 7 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, M.; Zhang, C.; Bai, H.; Zhang, R.; Zhao, Y. Cross-part learning for fine-grained image classification. IEEE Trans. Image Process. 2021, 37, 748–758. [Google Scholar] [CrossRef]
Liu, H.; Zhang, C.; Deng, Y.; Xie, B.; Liu, T.; Zhang, Z.; Li, Y.F. Transifc: Invariant cues-aware feature concentration learning for efficient finegrained bird image classification. IEEE Trans. Multimed. 2023, 27, 1677–1690. [Google Scholar] [CrossRef]
Du, R.; Yu, W.; Wang, H.; Lin, T.E.; Chang, D.; Ma, Z. Multi-view active finegrained visual recognition. In Proceedings of the 2023 International Conference on Computer Vision (ICCV2023), Paris, France, 4–6 October 2023. [Google Scholar]
Zhu, S.; Zou, X.; Qian, J.; Wong, W.K. Learning structured relation embeddings for finegrained fashion attribute recognition. IEEE Trans. Multimed. 2023, 26, 1652–1664. [Google Scholar] [CrossRef]
Min, W.; Wang, Z.; Liu, Y.; Luo, M.; Kang, L.; Wei, X.; Wei, X.; Jiang, S. Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9932–9949. [Google Scholar] [CrossRef]
Sakai, R.; Kaneko, T.; Shiraishi, S. Framework for fine-grained recognition of retail products from a single exemplar. In Proceedings of the 2023 15th International Conference on Knowledge and Smart Technology (KST2023), Phuket, Thailand, 21–24 February 2023. [Google Scholar]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the 2015 International Conference on Computer Vision (ICCV2015), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
Cai, S.; Zuo, W.; Zhang, L. Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In Proceedings of the 2017 International Conference on Computer Vision (ICCV2017), Venice, Italy, 22–29 October 2017; pp. 511–520. [Google Scholar]
Engin, M.; Wang, L.; Zhou, L.; Liu, X. DeepKSPD: Learning kernel-matrix-based SPD representation for fine-grained image recognition. In Proceedings of the 2018 15th European Conference (ECCV2018), Munich, Germany, 8–14 September 2018; pp. 629–645. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Learning deep bilinear transformation for fine-grained image representation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 4277–4289. [Google Scholar]
Gao, Y.; Han, X.; Wang, X.; Huang, W.; Scott, M.R. Channel interaction networks for fine-grained image categorization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI2020), New York, NY, USA, 7–12 February 2020; pp. 10818–10825. [Google Scholar]
Sun, G.; Cholakkal, H.; Khan, S.; Khan, F.S.; Shao, L. Fine-grained recognition: Accounting for subtle differences between similar classe. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI2020), New York, NY, USA, 7–12 February 2020; pp. 12047–12054. [Google Scholar]
Yang, Z.; Luo, T.; Wang, D.; Hu, Z.; Gao, J.; Wang, L. Learning to Navigate for Fine-grained Classification. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 438–454. [Google Scholar]
Chang, D.; Ding, Y.; Xie, J.; Bhunia, A.; Li, X.; Ma, Z.; Wu, M.; Guo, J.; Song, Y.Z. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Trans. Image Process. 2020, 29, 4683–4695. [Google Scholar] [CrossRef]
Du, R.; Chang, D.; Bhunia, A.K.; Xie, J.; Song, Y.Z.; Ma, Z.; Guo, J. Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Simonyan, K.; Zisserman, A. Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 6th International Conference on Learning Representations (ICLR2015), Vancouver, BC, Canada, 7–9 May 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Wein-berge, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR2017), Hawaii, HI, USA, 22–25 July 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR2021), Virtual, 3–7 May 2021. [Google Scholar]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. DaViT: Dual Attention Vision Transformer. arXiv 2022, arXiv:2204.03645. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, H.Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV2021), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2022), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2018), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV2018), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Jetley, S.; Lord, N.A.; Lee, N.; Torr, P.H.S. Learn to pay attention. arXiv 2018, arXiv:1804.02391. [Google Scholar]
Zhao, B.; Wu, X.; Feng, J.; Peng, Q.; Yan, S. Diversified Visual Attention Networks for Fine-Grained Object Classification. IEEE Trans. Multimed. 2017, 19, 1245–1256. [Google Scholar] [CrossRef]
Fu, J.; Zheng, H.; Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR2017), Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446. [Google Scholar]
Lu, W.; Yang, Y.; Yang, L. Fine-grained image classification method based on hybrid attention module. Front. Neurorobot. 2024, 18, 1391791. [Google Scholar] [CrossRef]
Zhang, F.; Li, M.; Zhai, G.; Liu, Y. Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization. In International Conference on Multimedia Modeling; Springer International Publishing: Cham, Switzerland, 2021; pp. 136–147. [Google Scholar]
Hou, Y.; Zhang, W.; Zhou, D.; Ge, H.; Zhang, Q.; Wei, X. Multi-Scale Attention Constraint Network for Fine-Grained Visual Classification. In IEEE International Conference on Multimedia and Expo (ICME); IEEE Computer Society: Washington, DC, USA, 2021; pp. 1–6. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV2021), Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Conde, M.V.; Turgutlu, K. Exploring Vision Transformers for Fine-grained Classification. arXiv 2021, arXiv:2106.10587. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR2017), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2019), Long Beach, CA, USA, 15–20 June 2019; pp. 5012–5021. [Google Scholar]
Ding, Y.; Zhou, Y.; Zhu, Y.; Ye, Q.; Jiao, J. Selective Sparse Sampling for Fine-Grained Image Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV2019), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhuang, P.; Wang, Y.; Qiao, Y. Learning Attentive Pairwise Interaction for Fine-Grained Classification. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI2020), New York, NY, USA, 7–12 February 2020; pp. 13130–13137. [Google Scholar]
Luo, W.; Zhang, H.; Li, J.; Wei, X.S. Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process. Lett. 2020, 27, 1545–1549. [Google Scholar] [CrossRef]
Wang, Z.; Wang, S.; Li, H.; Dou, Z.; Li, J. Graph-propagation based correlation learning for weakly supervised fine-grained image classification. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI2020), New York, NY, USA, 7–12 February 2020; pp. 12289–12296. [Google Scholar]
Wang, Z.; Wang, S.; Yang, S.; Li, H.; Li, J.; Li, Z. Weakly supervised fine-grained image classification via gaussian mixture model oriented discriminative learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2020), Seattle, WA, USA, 13–19 June 2020; pp. 9749–9758. [Google Scholar]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1309–1322. [Google Scholar] [CrossRef] [PubMed]
Zhou, M.; Bai, Y.; Zhang, W.; Zhao, T.; Mei, T. Look-into-object: Self-supervised structure modeling for object recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2020), Seattle, WA, USA, 13–19 June 2020; pp. 11774–11783. [Google Scholar]
Yao, J.; Wang, D.; Xing, H.H.W.; Wang, L. Adcnn: Towards learning adaptive dilation for convolutional neural networks. Pattern Recognit. 2022, 123, 108369. [Google Scholar] [CrossRef]
Niu, Y.; Jiao, Y.; Shi, G. Attention-shift based deep neural network for fine–grained visual categorization. Pattern Recognit. 2021, 116, 107947. [Google Scholar] [CrossRef]
Lu, Z.; Sreekumar, G.; Goodman, E.; Banzhaf, W.; Deb, K.; Boddeti, V. Neural architecture transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2971–2989. [Google Scholar] [CrossRef] [PubMed]
Michaeli, E.; Fried, O. Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeuIPS2024), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Bi, Q.; Zhou, B.; Ji, W.; Xia, G.S. Universal Fine-grained Visual Categorization by Concept Guided Learning. IEEE Trans. Image Process. 2025, 34, 394–409. [Google Scholar] [CrossRef]
He, J.; Chen, J.N.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C.; Yuille, A. TransFG: A Transformer Architecture for Fine-grained Recognition. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI2022), Virtual, 22 February– 1 March 2022; pp. 852–860. [Google Scholar]
Shen, L.; Hou, B.; Jian, Y.; Tu, X. TransFGVC: Transformer-based fine-grained visual classification. Vis. Comput. 2024, 41, 2439–2459. [Google Scholar] [CrossRef]
Kong, S.; Fowlkes, C. Low-rank bilinear pooling for fine-grained classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR2017), Honolulu, HI, USA, 21–26 July 2017; pp. 365–374. [Google Scholar]
Yu, C.; Zhao, X.; Zheng, Q.; Zhang, P.; You, X. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the 2018 European Conference on Computer Vision (ECCV2018), Munich, Germany, 8–14 September 2018; pp. 595–610. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Wang, Q.; Xie, J.; Zuo, W.; Zhang, L.; Li, P. Deep cnns meet global covariance pooling: Better representation and generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2582–2597. [Google Scholar] [CrossRef]
Behera, A.; Wharton, Z.; Hewage, P.R.P.G.; Bera, A. Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI2021), Virtual, 2–9 February 2021; pp. 929–937. [Google Scholar]
Zhang, L.; Huang, S.; Liu, W. Learning sequentially diversified representations for fine-grained categorization. Pattern Recognit. 2022, 121, 108219. [Google Scholar] [CrossRef]
Sikdar, A.; Liu, Y.; Kedarisetty, S.; Zhao, Y.; Ahmed, A.; Behera, A. Interweaving Insights: High-Order Feature Interaction for Fine-Grained Visual Recognition. Int. J. Comput. Vis. 2024, 133, 1755–1779. [Google Scholar] [CrossRef] [PubMed]
Singla, A.; Yuan, L.; Ebrahimi, T. Food/non-food image classification and food categorization using pre-trained googlenet model. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, Amsterdam, The Netherlands, 16 October 2016; pp. 3–11. [Google Scholar]
Maji, S.; Kannala, J.; Rahtu, E.; Blaschko, M.; Vedaldi, A. Fine-grained visual classification of aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D object representations for fine-grained categorization. In Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia, 2–8 December 2013. [Google Scholar]
Liu, D.; Zhao, L.; Wang, Y.; Kato, J. Learn from each other to Classify better: Cross-layer mutual attention learning for fine-grained visual classification. Pattern Recognit. 2023, 140, 109550. [Google Scholar] [CrossRef]
Shao, R.; Zhang, W.; Yin, J.; Wang, J. Data-free Knowledge Distillation for Fine-grained Visual Categorization. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV2023), Paris, France, 2–6 October 2023; pp. 1515–1525. [Google Scholar]
Islam, M.; Siddique, B.; Rahman, S.; Jabid, T. Food image classification with convolutional neural network. In Proceedings of the International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS2018), Bangkok, Thailand, 21–24 October 2018; Volume 3, pp. 257–262. [Google Scholar]
McAllister, P.; Zheng, H.; Bond, R.; Moorhead, A. Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets. Comput. Biol. Med. 2018, 95, 217–233. [Google Scholar] [CrossRef]
Yigit, G.O.; Ozyildirim, B. Comparison of convolutional neural network models for food image classification. J. Inf. Telecommun. 2018, 2, 347–357. [Google Scholar] [CrossRef]
Islam, K.; Wijewickrema, S.; Pervez, S.O.M. An exploration of deep transfer learning for food image classification. In Proceedings of the Digital Image Computing: Techniques and Applications (DICTA2018), Canberra, Australia, 10–13 December 2018; pp. 1–5. [Google Scholar]
Tan, R.; Chew, X.; Khaw, K. Neural architecture search for lightweight neural network in food recognition. Mathematics 2021, 9, 1245. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV2017), Venice, Italy, 22–29 October 2017. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2022), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]

Figure 1. Overall architecture of the proposed Multi-scale Attention-driven Hierarchical Learning (MAHL) framework, including the end-to-end structure of the proposed system, multi-scale feature extraction, attention-guided refinement, and hierarchical supervision.

Figure 2. Detailed illustration of the attention map generation process. This flow diagram highlights how class-specific contribution scores and attention maps are computed using the learned weights of the intermediate classification heads.

Figure 3. The confusion matrix and the precition/recall for each class on the Food-11 dataset using our proposed MAHL framework.

Figure 4. The image examples of misclassification on the Food-11 dataset using our proposed MAHL framework.

Figure 5. Visualization of attention maps on mullti-scale features:

X_{2}

,

X_{3}

, and

X_{4}

.

Figure 5. Visualization of attention maps on mullti-scale features:

X_{2}

,

X_{3}

, and

X_{4}

.

Table 1. Comparative summary of existing works with strong performance across standard benchmarks and key differences from the proposed MAHL.

Method	Core Idea	Limitation vs. MAHL
MC-Loss [14]	Enhances channel diversity via mutual-channel loss.	Lacks spatial attention and hierarchical refinement.
SEF [39]	Semantic enhancement of features.	No iterative or scale-adaptive attention.
GCL [40]	Graph-based correlation modeling.	Relies on graph structures instead of direct spatial attention.
DFGMM [41]	GMM-based part localization.	Uses statistical parts, not adaptive attention maps.
B-CNN [42]	Bilinear pooling for high-order interactions.	No dynamic attention or hierarchical processing.
GCP [54]	Global covariance pooling.	Misses fine-grained local attention mechanisms.
LIO [43]	Self-supervised structure learning.	Not tailored for discriminative refinement in FGVC.
PGT [15]	Multi-granularity jigsaw learning.	Fixed patches, lacks adaptive attention integration.
NAT [46]	Neural architecture search.	Focuses on architecture, not attention/pooling mechanisms.
AS-DNN [45]	Attention shift mechanisms.	No classifier-guided spatial attention or hierarchy.
ADCNN [44]	Adaptive dilation in convolutions.	Emphasizes receptive fields over attention structure.
SDR [56]	Sequentially diversified representations.	Lacks focus on spatial attention and region hierarchy.
I2-HOFI [57]	High-order feature interactions.	No spatial refinement or multiscale integration.
SaSPA [47]	Augmentation preserving structure.	Data-focused, not a learning framework innovation.
CGL [48]	Concept-guided semantic learning.	Focuses on semantics over spatial attention integration.
MAHL (Ours)	Hierarchical learning with multi-scale attention and generalized pooling.	Increased training process

Table 2. Statistics of the datasets used in our experiments.

Dataset	Dataset Content	Categories	Training Images	Testing Images
FGVC-Aircraft [59]	Aircraft Models	100	6667	3333
Stanford Cars [60]	Car Models	196	8144	8041
Food-11 [58]	Dishes	11	9866	3347

Table 3. Comparison with state-of-the-art methods on FGVC-Aircraft and Stanford Cars datasets.

(a) on FGVC-Aircraft and Stanford Cars datasets
Methods	Publication venue	Backbone	FGVC-aircraft	Cars
SEF [39]	Signal Processing Letters, 2020	ResNet50	92.1	94.0
MC Loss [14]	TIP, 2020	B-CNN	92.9	94.4
GCL [40]	AAAI, 2020	ResNet50	93.20	94.0
DF-GMM [41]	CVPR, 2020	ResNet50	93.8	94.8
LIO [43]	CVPR, 2020	ResNet50	92.7	94.5
PMG [15]	ECCV, 2020	ResNet50	92.8	95.1
GCP [54]	TPAMI, 2021	ResNet101	91.4	93.3
B-CNN [42]	TPAMI, 2021	VGG-M + VGG-D	84.1	90.6
NAT [46]	TPAMI, 2021	NAT-M4	90.8	92.9
AS-DNN [45]	Pattern Recognition, 2021	AS-DNN	92.3	94.1
ADCNN [44]	Pattern Recognition, 2022	W-ResNet101	92.5	91.3
SDNs [56]	Pattern Recognition, 2022	ResNet101	92.7	94.6
I2-HOFI [57]	IJCV, 2024	ResNet50	92.26	94.33
SaSPA [47]	NeurIPS, 2024	ResNet50	90.79	95.34
CGL [48]	TIP, 2025	ResNet50	94.2	-
Ours	-	ResNet-50	93.03	97.11
DFKD [62]	ICCV, 2023	ResNet-34	65.76	71.89
(b) on Food-11 dataset
Methods	Publication venue		Backbone	Food-11
Inception-TL [63]	ICIIBMS, 2018		Inception V3	92.9
ANN [64]	Computers in Biology and Medicine, 2018		ResNet152	91.3
Food-DCNN [65]	JIT, 2018		Alexnet	86.9
ResNet50-TL [66]	DICTA, 2018		ResNet50	88.1
LNAS [67]	Mathematics, 2021		LNAS-net	89.1
CMAL [61]	Pattern Recognition, 2023		ResNet50	96.3
CMAL [61]	Pattern Recognition, 2023		Res2Next50	96.5
Ours	-		ResNet-50	97.01

Table 4. The compared classification accuracies w/o the integration of the MAHL and GP strategies.

Methods	Food-11	FGVC-Aircraft	Stanford Cars
Baseline	88.1%	88.5%	91.7%
+MAHL	96.3%	92.8%	97.27%
+MAHL& GP	97.01%	93.03%	97.11%

Table 5. Ablation study.

(a) Different settings in MAHL
	Shallow2	Deep2	Scale4	Scale3 with Grad-CAM	Scale3 (Our)
Acc.	94.59%	94.14%	94.26%	93.30%	94.95%
(b) Different settings in GP strategy (d refers to the element number in the pooling region)
Top- $z %$	$z = \frac{1}{d} * 100$ (MP)	$z = 25$	$z = 50$	$z = 75$	$z = 100$ (AP)
Acc.	94.47%	94.95%	93.92%	93.52%	93.87%
(c) Different attention threshod to produce discriminating regions
Threshod $θ$	$θ = 0.1$	$θ = 0.3$	$θ = 0.5$	$θ = 0.7$	$θ = 0.9$
Acc.	94.20%	94.68%	94.95%	94.68%	93.92%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, Z.; Kojima, R.; Han, X.-H. Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization. Electronics 2025, 14, 2869. https://doi.org/10.3390/electronics14142869

AMA Style

Hu Z, Kojima R, Han X-H. Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization. Electronics. 2025; 14(14):2869. https://doi.org/10.3390/electronics14142869

Chicago/Turabian Style

Hu, Zhihuai, Rihito Kojima, and Xian-Hua Han. 2025. "Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization" Electronics 14, no. 14: 2869. https://doi.org/10.3390/electronics14142869

APA Style

Hu, Z., Kojima, R., & Han, X.-H. (2025). Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization. Electronics, 14(14), 2869. https://doi.org/10.3390/electronics14142869

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization

Abstract

1. Introduction

2. Related Work

2.1. Attention Mechanism

2.2. Multi-Scale Feature Extraction

2.3. Pooling Methods

3. Proposed Multi-Scale Attention-Driven Hierarchical Learning (MAHL)

3.1. Overview

3.2. Attention-Driven Region Localization (ADRL)

3.3. Hierarchical Learning with the Localized Regions

3.4. Generalized Spatial Pooling

4. Experiments

4.1. Experimental Settings

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Study

4.4. Visualization of Multi-Scale Attention Maps

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI