M2Former: Multiscale Patch Selection for Fine-Grained Visual Recognition

Moon, Jiyong; Park, Seongsik

doi:10.3390/app14198710

Open AccessArticle

M2Former: Multiscale Patch Selection for Fine-Grained Visual Recognition

by

Jiyong Moon

¹ and

Seongsik Park

^2,*

¹

Department of Artificial Intelligence, Dongguk University, Seoul 04620, Republic of Korea

²

Division of Advanced Engineering, Korea National Open University, Seoul 03087, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8710; https://doi.org/10.3390/app14198710

Submission received: 23 July 2024 / Revised: 15 September 2024 / Accepted: 23 September 2024 / Published: 26 September 2024

Download

Browse Figures

Versions Notes

Abstract

Recently, Vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose MultiScale Patch Selection (MSPS) to improve the multiscale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a MultiScale Vision Transformer (MS-ViT). In addition, we introduce Class Token Transfer (CTT) and MultiScale Cross-Attention (MSCA) to model cross-scale interactions between selected multiscale patches and fully reflect them in model decisions. Compared with previous Single-Scale Patch Selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks.

Keywords:

fine-grained visual recognition; patch-selection; multiscale vision transformer

1. Introduction

Despite recent rapid advances, fine-grained visual recognition (FGVR) is still one of the nontrivial tasks in the computer vision community. Unlike conventional recognition tasks, FGVR aims to predict subordinate categories of a given object, e.g., subcategories of birds [1], flowers [2,3], and cars [4,5]. It is a highly challenging task due to inherently subtle inter-class differences caused by similar subordinate categories and large intraclass variations caused by object pose, scale, or deformation.

The most common solution for FGVR is to decompose the target object into multiple local parts [6,7,8,9,10]. Due to subtle differences between fine-grained categories mostly residing in the unique properties of object parts [11], decomposed local parts provide more discriminative clues of the target object. For example, a given bird object can be decomposed into its beak, wing, and head parts. At this time, ‘Glaucous Winged Gull’ and ‘California Gull’ can be distinguished by comparing their corresponding object parts. Early approaches of these part-based methods find discriminative local parts using manual part annotations [6,7,11]. However, curating manual annotations for all possible object parts is labor-intensive and carries the risk of human error [12]. Therefore, the research focus has consequently shifted to a weakly supervised manner [8,9,10,13,14,15]. Researchers use additional tricks such as attention mechanisms [9,14,16] or region proposal networks (RPNs) [13,15,17] to estimate local parts with only category-level labels. However, the part proposal process greatly increases the overall computational cost. Additionally, they tend not to deeply consider the interactions between estimated local parts that are essential for accurate recognition [18].

Recently, Vision Transformers (ViTs) [19] are being actively applied to FGVR [18,20,21,22,23,24,25,26,27]. Relying exclusively on the Transformer [28] architecture, ViTs have shown competitive image classification performance on a large scale. Similar to token sequences in NLP, ViTs embed the input images into fixed-size image patches, and the patches pass through multiple Transformer encoder blocks. Patch-by-patch processing is highly suitable for FGVR because each image patch can be considered as a local part. This means that the cumbersome part proposal is no longer necessary. Additionally, the self-attention mechanism [28] inherent in each encoder block facilitates the modeling of global interactions between patch-divided local parts. ViT-based FGVR methods use patch selection to further boost performance [18,20,21,22]. Because ViTs deal with all patch-divided image regions equally, many irrelevant patches may lead to inaccurate recognition. Similar to part proposals, patch selection selects the most salient patches from a set of generated image patches based on the computed importance ranking, i.e., accumulated attention weights [21,29]. As a result, redundant patch information is filtered out, and only selected salient patches are considered for the final decision.

However, the existing ViT-based FGVR methods suffer from their single-scale limitations. ViTs use fixed-size image patches throughout the entire network, ensuring that the receptive fields remain the same across all layers and preventing ViTs from obtaining multiscale feature representations [30,31,32]. On the other hand, Convolutional Neural Networks (CNNs) are suitable for multiscale feature representations thanks to their staged architecture, where feature resolution decreases as layer depth increases [33,34,35,36,37]. In the early stages, spatial details of an object are encoded on high-resolution feature maps, and as the stages deepen, the receptive field expands with decreasing feature resolution, and higher-order semantic patterns are encoded into low-resolution feature maps. Multiscale features are important for most vision tasks, especially pixel-level dense prediction tasks, e.g., object detection [38,39,40], and segmentation [41,42,43]. In the same context, single-scale processing can cause two failure cases in FGVR, which leads to suboptimal recognition performance. (i) First, it is vulnerable to scale changes in fine-grained objects [32,35,38]. Fixed patch size may be insufficient to capture very subtle features of small-scale objects due to too coarse patches, and conversely, discriminative features may be over-decomposed for large-scale objects due to too finely split patches. (ii) Second, single-scale processing limits representational richness for objects [30,39]. Compared with a CNN that explores rich feature hierarchies from multiscale features, ViT considers only monotonic single-scale features due to its fixed receptive field.

In this paper, we improve existing ViT-based FGVR methods by enhancing multiscale capabilities. One simple solution is to use the recent MultiScale Vision Transformers (MS-ViTs) [30,31,32,44,45,46,47,48,49]. In fact, we can achieve satisfactory results simply by using MS-ViTs. However, we further boost the performance by adapting patch selection to MS-ViTs. Specifically, we propose a MultiScale Patch Selection (MSPS) that extends the previous Single-Scale Patch Selection (SSPS) [18,20,21,22] to multiscale. MSPS selects salient patches of different scales from different stages of the MS-ViT backbone. As shown in Figure 1, multiscale salient patches selected through MSPS include both large-scale patches that capture object semantics and small-scale patches that capture fine-grained details. Compared with single-scale patches in SSPS, feature hierarchies in multiscale patches provide richer representations of objects, which leads to better recognition performance. In addition, the flexibility of multiscale patches is useful for handling extremely large/small objects through multiple receptive fields.

However, we argue that patch selection alone cannot fully explain the object, and consideration is required for how to model interactions between selected patches and effectively reflect them in the final decision. It is more complicated than considering only single-scale patches. Therefore, we introduce Class Token Transfer (CTT) and MultiScale Cross-Attention (MSCA) to effectively deal with selected multiscale patches. First, CTT aggregates the multiscale patch information by transferring the global CLS token to each stage. Each stage-specific patch information is shared through transferred global CLS tokens, which generate richer network-level representations. In addition, we propose MSCA to model direct interactions between selected multiscale patches. In the MSCA block, cross-scale interactions in both spatial and channel dimensions are computed for selected patches of all stages. Finally, our MultiScale Vision Transformer with MultiScale Patch Selection (M2Former) obtains improved FGVR performance over other ViT-based SSPS models, as well as CNN-based models.

Our main contributions can be summarized as follows:

We propose MultiScale Patch Selection (MSPS) that further boosts the multiscale capabilities of MS-ViTs. Compared with Single-Scale Patch Selection (SSPS), MSPS generates richer representations of fine-grained objects with feature hierarchies, and obtains flexibility for scale changes with multiple receptive fields.
We propose Class Token Transfer (CTT) that effectively shares the selected multiscale patch information. Stage-specific patch information is shared through transferred global CLS tokens to generate enhanced network-level representations.
We design a MultiScale Cross-Attention (MSCA) block to capture the direct interactions of selected multiscale patches. In the MSCA block, the spatial-/channel-wise cross-scale interdependencies can be captured.
Extensive experimental results on widely used FGVR benchmarks show the superiority of our M2Former over conventional methods. In short, our M2Former achieves an accuracy of 92.4% on Caltech-UCSD Birds (CUB) [1] and 91.1% on NABirds [50].

2. Related Work

2.1. Part-Based FGVR

A number of methods have been proposed to classify subordinate object categories [6,7,11,12,13,14,15,51,52,53,54,55,56,57]. Among them, part-based methods decompose target objects into multiple local parts to capture more discriminative features [6,7,11,13,14,51]. Encoded local representations can be used either in conjunction with image-level representations or by themselves for recognition. This entails the use of a detection branch to generate discriminative part proposals from the input image either before or in parallel with the classification layer. Early works leverage manual part annotations for a fully supervised train of detection branches [6,7,11]. However, curating large-scale part annotations is labor-intensive and highly susceptible to human error. Thus, recent part-based methods localize informative regions in a weakly supervised way using only category-level labels [9,13,14,15,16,17,51,58,59,60,61]. TASN [10] introduces trilinear attention sampling to generate detail-preserved views based on feature channels. P2P-Net [8] and DF-GMM [62] use a Feature Pyramid Network (FPN) [35] to generate local part proposals from convolutional feature maps. RA-CNN [51] iteratively zooms in on local discriminative regions and reinforces multiscale feature learning with interscale ranking loss. InSP [63] extracts potential part regions from the attention maps of two different images and applies content swapping to learn fine-grained local structures. These part-based methods are good at locating discriminative parts, but they are ambiguous at capturing interactions between estimated local parts. Additionally, explicitly generating region proposals incurs a large computational burden, and performance can degrade significantly if the generated proposals are inaccurate. Thus, we focus on ViT-based FGVR methods that can avoid cumbersome region proposals.

2.2. ViT-Based FGVR

Transformer [28] was originally introduced in the natural language processing (NLP) community and has recently shown great potential in the field of computer vision. Specifically, Vision Transformers (ViTs) [19] have recorded remarkable performance in image classification on large-scale datasets relying entirely on a pure Transformer encoder architecture. ViTs split images into nonoverlapping patches and then apply multiple Transformer encoder blocks, consisting of multihead self-attention modules and feed-forward networks. Recently, ViTs have been actively applied to FGVR [18,20,21,22,23,24,25,26,27]. ViTs are highly suitable for FGVR in that (i) patch-by-patch processing can effectively replace part proposal generation, and (ii) it is easy to model global interactions between patch-divided local parts through the inherent self-attention mechanism. In most cases, patch selection, which selects discriminative patches from an initial patch sequence, is used with ViT-based models. TransFG [18] selects the most discriminative patches based on the attention score at the last encoder block to consider only the important image regions. Similarly, DCAL [20] conducts patch selection based on attention scores to highlight the interaction of high-response image regions. FFVT [22] extends this selection process to all layers of ViT to utilize layer-wise patch information. RAMS-Trans [21] enhances local representations by recurrently zooming in on salient object regions obtained by patch selection. For the same purpose, we use patch selection to strengthen the decisiveness of the network. However, we extend the existing patch selection from single-scale to multiscale to improve representational richness and obtain flexibility for scale changes.

2.3. Multiscale Processing

Multiscale features are important for most vision tasks, e.g., object detection [34,35,38,39,40], semantic segmentation [41,42,43], edge detection [64,65], and image classification [33,66,67], because visual patterns occur at multiscales in natural scenes. CNNs naturally learn coarse-to-fine multiscale features through a stack of convolutional operators [68,69], so most research has been proposed to enhance the multiscale capabilities of CNNs [33,34,35,36,37]. FPN [35] introduces a feature pyramid to extract features with different scales from a single image and fuses them in a top-down way to generate a semantically strong multiscale feature representation. SPPNet [33] proposes spatial pyramid pooling to improve backbone networks to model multiscale features of arbitrary size. Faster R-CNN [38] proposes region proposal networks that generate object bounding boxes of different scales. FCN [36] proposes a fully convolutional architecture that generates dense prediction maps for semantic segmentation from hierarchical representations of CNNs.

On the other hand, ViTs suffer from being unsuitable to handle multiscale features due to fixed-scale image patches. After the initial patch embedding layer, ViTs maintain image patches of the same size, and these single-scale feature maps are not suitable for many vision tasks, especially those requiring pixel-level dense prediction. To alleviate this issue, model architectures that adapt multiscale feature hierarchies to ViTs have recently been proposed. CvT [44] adapts convolutional operations to patch embedding layers and attention projection layers to enable feature hierarchies with improved efficiency. MViT [30] introduces attention pooling that controls feature resolution by adjusting the pooling stride of queries to implement multiscale feature hierarchies. PVT [31] proposes a progressive shrinking strategy that conducts patch embedding with different patch sizes at the beginning of each stage. SwinT [32] generates multiresolution feature maps by merging adjacent local patches using a patch merging layer. We also focus on feature hierarchies of fine-grained objects using MS-ViTs. However, we propose additional methods to further boost the multiscale capability, i.e., MSPS, CTT, and MSCA.

3. Our Method

The overall framework of our method is presented in Figure 2. First, we use MultiScale Vision Transformers (MS-ViTs) as our backbone network (Section 3.1). After that, MultiScale Patch Selection (MSPS) is equipped on different stages of MS-ViT to extract multiscale salient patches (Section 3.2). Class Token Transfer (CTT) aggregates multiscale patch information by transferring the global CLS token to each stage. MultiScale Cross-Attention (MSCA) blocks are used to model spatial-/channel-wise interactions of selected multiscale patches (Section 3.4). Finally, we use additional training strategies for better optimization (Section 3.5). More details are described as follows.

3.1. Multiscale Vision Transformer

To enhance the multiscale capability, we use MS-ViT as our backbone network, specifically the recent Multiscale Vision Transformer (MViT) [30,45]. MViT constructs a four-stage pyramid structure for low-level to high-level visual modeling instead of single-scale processing. To produce a hierarchical representation, MViT introduces Pooling Attention (PA), which pools query tensors to control the downsampling factor. We refer the interested reader to the original work [30,45] for details.

Let

X_{0} \in R^{h_{0} \times w_{0} \times c_{0}}

denote the input image, where

h_{0}

,

w_{0}

, and

c_{0}

refer to the height, width, and the number of channels, respectively.

X_{0}

first goes through a patch embedding layer to produce initial feature maps with a patch size of

4 \times 4

. As the stage deepens, the resolution of the feature maps decreases and the channel dimension increases proportionally. As a result, at each stage

i \in {1, 2, 3, 4}

, we can extract the feature maps

X_{i} \in {X_{1}, X_{2}, X_{3}, X_{4}}

with resolutions

h_{i} \times w_{i} \in {\frac{h_{0}}{4} \times \frac{w_{0}}{4}, \frac{h_{0}}{8} \times \frac{w_{0}}{8}, \frac{h_{0}}{16} \times \frac{w_{0}}{16}, \frac{h_{0}}{32} \times \frac{w_{0}}{32}}

and channel dimensions

c_{i} \in {96, 192, 384, 768}

. We can also flatten

X_{i}

into 1D patch sequence as

X_{i} \in R^{l_{i} \times c_{i}}

, where

l_{i} = h_{i} \times w_{i}

. In fact, after patch embedding, we attach a trainable class token (CLS token) to the patch sequence, and all patches

{\tilde{X}}_{i} \in R^{{\tilde{l}}_{i} \times c_{i}}

are fed into consecutive encoder blocks, where

{\tilde{l}}_{i} = l_{i} + 1

. After the last block, the CLS token is detached from the patch sequence and used for class prediction through a linear classifier.

3.2. Multiscale Patch Selection

Single-Scale Patch Selection (SSPS) has limited representations due to its fixed receptive field. Therefore, we propose MultiScale Patch Selection (MSPS) that extends SSPS to multiscale. With multiple receptive fields, the proposed MSPS encourages rich representations of objects from deep semantic information to fine-grained details. We design MSPS based on the MViT backbone. Specifically, we select salient patches from the intermediate feature maps produced at each stage of MViT.

Given the patch sequence

{\tilde{X}}_{i} \in R^{{\tilde{l}}_{i} \times c_{i}}

, we start by detaching the CLS token and reshaping it into 2D feature maps to

X_{i} \in R^{h_{i} \times w_{i} \times c_{i}}

. And then we group

r \times r

neighboring patches, reshaping

X_{i}

into

{\hat{X}}_{i} \in R^{r^{2} \times \frac{h_{i} w_{i}}{r^{2}} \times c_{i}}

. This means

\frac{h_{i} w_{i}}{r^{2}}

neighboring patch groups are generated. Afterwards, we apply a per-group average to merge patch groups, producing

{\hat{X}}_{i} \in R^{{\hat{l}}_{i} \times c_{i}}

, where

{\hat{l}}_{i} = \frac{h_{i} w_{i}}{r^{2}}

. We set

r = 2

to merge patches within a

2 \times 2

local region. This merging process removes the redundancies of neighboring patches, which forces MSPS to search for salient patches in wider areas of the image.

Now, we produce a score map

S_{i} \in R^{{\hat{l}}_{i}}

using a predefined scoring function

S (\cdot)

. Then, patches with top-k scores are selected from

{\hat{X}}_{i}

,

P_{i} = MSPS ({\hat{X}}_{i}; S_{i}, k_{i}),

(1)

where

P_{i} \in R^{k_{i} \times c_{i}}

. We set k differently for each stage to consider hierarchical representations. Since the high-resolution feature maps of the lower stage capture the detailed shape of the object with a small receptive field, we set k to be large so that enough patches are selected to sufficiently represent the details of the object. On the other hand, low-resolution feature maps of the higher stage capture the semantic information of objects with a large receptive field, so small k is sufficient to represent the overall semantics.

For patch selection, we decided how to define the scoring function

S

. Attention roll-out [29] has been mainly used as a scoring function for SSPS [18,20]. Attention roll-out aggregates the attention weights of the Transformer blocks through successive matrix multiplications, and the patch selection module selects the most salient patches based on the aggregated attention weights. However, since we use MS-ViT as the backbone, we cannot use attention roll-out because the size of attention weights is different for each stage, even each block. Instead, we propose a simple scoring function based on mean activation, where the score for the j-th patch of

{\hat{X}}_{i}

is calculated by

S_{i}^{j} = S ({\hat{X}}_{i}) = \frac{1}{c_{i}} \sum_{c = 1}^{c_{i}} {\hat{X}}_{i}^{j} (c),

(2)

where c is the channel index

c \in {1, 2, \dots, c_{i}}

. Mean activation measures how strongly the channels in each patch are activated on average. After computing the score map, our MSPS conducts patch selection based on it. This is implemented through top-k and gather operations. We extract

k_{i}

patch indices with the highest scores from the

S_{i}

through the top-k operation, and patches corresponding to the patch indices

I_{i}

are selected from

{\hat{X}}_{i}

,

\begin{matrix} I_{i} = topkIndex (S_{i}; k_{i}), \\ P_{i} = gather ({\hat{X}}_{i}; I_{i}), \end{matrix}

(3)

where

I_{i} \in N^{k_{i}}

, and

P_{i} \in R^{k_{i} \times c_{i}}

.

3.3. Class Token Transfer

Through MSPS, we can extract salient patches from each stage,

P

= {P_{1}, P_{2}, P_{3}, P_{4}}

. In Section 3.2, the CLS token is detached from the patch sequence before MSPS at each stage. The simplest way to reflect the selected multiscale patches in the model decisions is to concatenate the detached CLS token

{CLS}_{i}

with the

P_{i}

again and feed it into a few additional ViT blocks, consisting of multihead self-attention (MSA) and feed-forward networks (FFN):

\begin{matrix} {\tilde{P}}_{i} = concat (P_{i}, {CLS}_{i}), \\ {\tilde{O}}_{i} = FFN (MSA ({\tilde{P}}_{i})), \end{matrix}

(4)

where

{\tilde{P}}_{i}

,

{\tilde{O}}_{i} \in R^{{\tilde{k}}_{i} \times c_{i}}

, and

{\tilde{k}}_{i} = k_{i} + 1

. Finally, predictions for each stage are computed by extracting the

{CLS}_{i}

from

{\tilde{O}}_{i}

and connecting the linear classifier. It should be noted that the

{CLS}_{i}

is shared by all stages: the set of

{CLS}_{i}

is derived from the global CLS token and it is detached with different dimensions at each stage. This means that the stage-specific multiscale information is shared through

{CLS}_{i}

. However, the current sharing method may cause inconsistency between stage features because the detached

{CLS}_{i}

does not equally utilize the representational power of the network. For example,

{CLS}_{1}

is detached right after stage-1 and it will always lag behind

{CLS}_{4}

, which utilizes representations of all stages.

To this end, we introduce a Class Token Transfer (CTT) strategy that aggregates multiscale information more effectively. The core idea is to use the CLS token transferred from the global CLS token

{CLS}_{g}

rather than using the detached

{CLS}_{i}

at each stage. It should be noted that

{CLS}_{g}

is equal to

{CLS}_{4}

, so

{CLS}_{g} \in R^{c_{4}}

. We transfer the

{CLS}_{g}

according to the dimension of each stage through a projection layer consisting of two linear layers along with Batch Normalization (BN) and ReLU activation:

{\bar{CLS}}_{i} = W_{i}^{1} (ReLU (BN (W_{i}^{0} {CLS}_{g}))),

(5)

where

W_{i}^{0} \in R^{2 c_{i} \times c_{4}}

,

W_{i}^{1} \in R^{c_{i} \times 2 c_{i}}

are the weight matrices, and

{\bar{CLS}}_{i}

is the transferred

{CLS}_{g}

in stage

i \in {1, 2, 3}

. Now, (4) is reformulated as

\begin{matrix} {\tilde{P}}_{i} = concat (P_{i}, {\bar{CLS}}_{i}), \\ {\tilde{O}}_{i} = FFN (MSA ({\tilde{P}}_{i})) . \end{matrix}

(6)

Compared with conventional approaches, CTT guarantees consistency between stage features as it uses CLS tokens with the same representational power. Each stage encodes stage-specific patch information into a globally updated CLS token. CTT is similar to the top-down pathway [35,39]: it combines high-level representations of objects with multiscale representations of lower layers to generate richer network-level representations.

3.4. Multiscale Cross-Attention

Although CTT can aggregate multiscale patch information from all stages, it cannot model direct interactions between multiscale patches, which indicates how interrelated they are. Therefore, we propose MultiScale Cross-Attention (MSCA) to model the interactions between multiscale patches.

MSCA takes

\tilde{P} = {{\tilde{P}}_{1}, {\tilde{P}}_{2}, {\tilde{P}}_{3}, {\tilde{P}}_{4}}

as input and models the interactions between selected multiscale salient patches. Specifically, MSCA consists of Channel Cross-Attention (CCA) and Spatial Cross-Attention (SCA), so (6) is reformulated as

\tilde{O} = MSCA (\tilde{P}) = SCA (CCA (\tilde{P})),

(7)

where

\tilde{O} = {{\tilde{O}}_{1}, {\tilde{O}}_{2}, {\tilde{O}}_{3}, {\tilde{O}}_{4}}

.

3.4.1. Channel Cross-Attention

Exploring feature channels has been very important in many vision tasks because feature channels encode visual patterns that are strongly related to foreground objects [10,53,59,70,71]. Many studies have been proposed to enhance the representational power of a network by explicitly modeling the interdependencies between the feature channels [72,73,74,75,76]. In the same vein, we propose CCA to further enhance the representational richness of multiscale patches by explicitly modeling their cross-scale channel interactions.

We illustrate CCA in Figure 3a. First, we apply global average pooling (GAP) to

{\tilde{P}}_{i}

to obtain a global channel descriptor

D_{i} \in R^{c_{i}}

for each stage. The c-th element of

D_{i}

is calculated by

D_{i}^{c} = \frac{1}{{\tilde{k}}_{i}} \sum_{j = 1}^{{\tilde{k}}_{i}} {\tilde{P}}_{i}^{c} (j),

(8)

where j is the patch index

j \in {1, 2, \dots, {\tilde{k}}_{i}}

. From the stage-specific channel descriptors, we compute the channel attention score as follows:

\begin{matrix} D = concat (D_{1}, D_{2}, D_{3}, D_{4}) \in R^{c}, \\ C = sigmoid (W^{c, 1} ReLU (BN (W^{c, 0} D))) \in R^{c}, \end{matrix}

(9)

where

c = \sum c_{i}

,

W^{c, 0} \in R^{\frac{c}{2} \times c}

, and

W^{c, 1} \in R^{c \times \frac{c}{2}}

. We then split

C

back into

C_{i} \in R^{c_{i}}

and recalibrate the channels of

{\tilde{P}}_{i}

as follows:

{\tilde{Y}}_{i} = {\tilde{P}}_{i} \otimes C_{i} + {\tilde{P}}_{i},

(10)

where ⊗ indicates element-wise multiplication. In (9), we compute the channel attention score by aggregating the channel descriptors of all multiscale patches. It captures channel dependencies in a cross-scale way and reflects them back to each stage-specific piece of channel information.

3.4.2. Spatial Cross-Attention

In addition to channel-wise interactions, we can compute the spatial-wise interdependencies of selected multiscale patches. To this end, we propose SCA, which is a multiscale extension of MSA [19,28].

We illustrate SCA in Figure 3b. First, we compute query, key, and value tensors

Q_{i}

,

K_{i}

,

V_{i}

for every

{\tilde{Y}}_{i}

,

Q_{i} = {\tilde{Y}}_{i} W_{i}^{Q}, K_{i} = {\tilde{Y}}_{i} W_{i}^{K}, V_{i} = {\tilde{Y}}_{i} W_{i}^{V},

(11)

where

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V} \in R^{c_{i} \times d}

, and

Q_{i}

,

K_{i}

,

V_{i} \in R^{{\tilde{k}}_{i} \times d}

. After that, we concatenate the

K_{i}

and

V_{i}

of all stages to generate global key and value tensors

K

,

V

,

\begin{matrix} K = concat (K_{1}, K_{2}, K_{3}, K_{4}) \in R^{\tilde{k} \times d}, \\ V = concat (V_{1}, V_{2}, V_{3}, V_{4}) \in R^{\tilde{k} \times d}, \end{matrix}

(12)

where

\tilde{k} = \sum {\tilde{k}}_{i}

. Now, we can compute self-attention for

Q_{i}

,

K

,

V

, and a single linear layer is used to restore the dimension,

\begin{matrix} {\tilde{A}}_{i} = Softmax (Q_{i} K^{T} / \sqrt{d}) V, \\ {\tilde{O}}_{i} = {\tilde{A}}_{i} W_{i}^{s}, \end{matrix}

(13)

where

{\tilde{A}}_{i} \in R^{{\tilde{k}}_{i} \times d}

,

W_{i}^{s} \in R^{d \times c_{i}}

, and

{\tilde{O}}_{i} \in R^{{\tilde{k}}_{i} \times c_{i}}

. SCA is also implemented in a multihead manner [28]. For global key and value, SCA captures how strongly multiscale patches interact spatially with each other. Specifically, SCA models how large-scale semantic patches decompose into more fine-grained views, and conversely, how small-scale fine-grained patches can be identified in more global views.

3.5. Training

After the MSCA block, we can extract

{\bar{CLS}}_{i}

from

{\tilde{O}}_{i}

and compute the class prediction

y_{i}

using a linear classifier. In addition, we can compute

y_{c o n}

by concatenating all

{\bar{CLS}}_{i}

tokens. For model training, we compare every

y = {y_{1}, y_{2}, y_{3}, y_{4}, y_{c o n}}

for the ground-truth label

\hat{y}

,

L = \sum_{y \in y} \sum_{t = 1}^{n} - {\hat{y}}^{t} log y^{t},

(14)

where n is the total number of classes, and t denotes the element index of the label. To improve model generalization and encourage diversity of representations from specific stages, we employ soft supervision using label smoothing [8,77]. We modify the one-hot vector

\hat{y}

as follows:

{\hat{y}}_{α} [t] = \{\begin{matrix} α & t = \hat{t} \\ \frac{1 - α}{n} & t \neq \hat{t} \end{matrix}

(15)

where

\hat{t}

denotes index of the ground truth class, and

α

denotes a smoothing factor

α \in [0, 1]

.

α

controls the magnitude of the ground truth class. As a result, the different predictions are supervised with different labels during training. We set

α

to increase in equal intervals by

0.1

from

0.6

to 1, so

y_{1}

has the smallest

α = 0.6

.

For inference, we conduct a final prediction considering all of

y

,

y_{a l l} = \sum_{y \in y} y,

(16)

where the maximum entry in

y_{a l l}

corresponds to the class prediction.

4. Experiments

In this section, we evaluate our proposed M2Former on widely used fine-grained benchmarks. In addition, we conduct ablation studies and further analyses to validate the effectiveness of our method. More details are described in the following section.

4.1. Datasets

We use two FGVR datasets to evaluate the proposed method: Caltech-UCSD Birds (CUB) [1] and NABirds [50]. The CUB dataset consists of 11,788 images and 200 bird species. All images are split into 5994 for training and 5794 for testing. The NABirds is a larger dataset, consisting of 48,562 images and 555 classes. All images are split into 23,929 for training and 24,633 for testing. Most of our experiments are conducted on CUB.

4.2. Implementation Details

We use MViTv2-B [45] pretrained on ImageNet21K [78] as our backbone network. We add MSPS to every stage of MViTv2-B. After patch selection, selected patches pass through one MSCA block. We empirically set

k = {k_{1}, k_{2}, k_{3}, k_{4}}

, the number of patches selected for each stage, to

{6, 18, 54, 162}

. Our training recipe follows the same as recent work [18]. The model is trained for a total of 10,000 iterations, and an SGD optimizer with momentum of 0.9 and weight decay of 0 is used. The batch size is set to 16, and the initial learning rate is set to 0.03. The learning rate has a cosine decay schedule [79]. For augmentations, raw images are first resized into

600 \times 600

, followed by cropping into

448 \times 448

. We use random cropping for training, and center cropping for testing. Random horizontal flipping is adapted to training images. We implement the whole model with the PyTorch framework on three NVIDIA A5000 GPUs (NVIDIA, Santa Clara, CA, USA).

4.3. Main Results

We compare our M2Former with state-of-the-art FGVR methods, including ViT-based and CNN-based models on each dataset.

4.3.1. Results on CUB

First, the evaluation results on CUB are presented in Table 1. As shown in Table 1, our M2Former obtained a top-1 accuracy of 92.4%, which significantly outperforms CNN-based methods. In particular, M2Former improves recent P2P-Net [8] by 2.2%. In addition, M2Former achieved higher accuracy compared with ViT-based models using SSPS. Specifically, M2Former outperforms TransFG [18], RAMS-Trans [21], FFVT [22], and DCAL [20] by 0.7%, 1.1%, 0.8%, and 0.4%, respectively. These results indicate that our proposed MSPS encourages enhanced representations compared with SSPS.

4.3.2. Results on NABirds

The evaluation results on NABirds are presented in Table 2. Our proposed M2Former obtains the top-1 accuracy of 91.1% on NABirds. Compared with the CNN-based models, our M2Former shows significantly improved performance. For example, M2Former improves PMGv2 [86] by 2.7%. Compared with ViT-based models, our M2Former lags slightly behind Dual-TR [85] and ACC-ViT [84] but outperforms TransIFC+ [82], IELT [83], and TransFG [18]. In particular, M2Former improves TransFG [18] by 0.3%, which shows the effectiveness of MSPS compared with SSPS.

4.4. Ablation Study

We analyzed each component of the proposed M2Former through the ablation study. All experiments are conducted on the CUB dataset.

4.4.1. Ablation on Proposed Modules

We investigate the influence of the proposed modules (i.e., MSPS, CTT, CCA, SCA) included in the M2Former architecture. The results are presented in Table 3. The pure MViTv2-B baseline obtained the top-1 accuracy of 91.6% on CUB (Table 3 (a)). When MSPS was added, we did not find any noticeable improvement (Table 3 (b)). This indicates that effectively incorporating selected patches into model decisions is more important than patch selection itself. For this purpose, CTT allows multiscale patch information to be shared across the entire network through transferred global CLS tokens. Indeed, with CTT, we improve the baseline by 0.5% (Table 3 (c)). In addition, CCA and SCA capture more direct interactions between multiscale patches, but using CCA or SCA alone was not effective for improvement (Table 3 (d) and (e)). On the other hand, when using both CCA and SCA (full MSCA block), we obtain the highest top-1 accuracy of 92.4%, which improves the baseline by 0.8% (Table 3 (f)). This means that the spatial-/channel-level interactions of multiscale patches are strongly correlated and need to be considered simultaneously.

4.4.2. Number of MSCA Blocks

Table 4 shows the results of the ablation experiment on the number of our MSCA blocks. The results show that using a single MSCA block performs best and increasing the number of MSCA blocks no longer yields a meaningful improvement. This means that a single MSCA block is sufficient to model the interactions of multiscale patches. Moreover, using one MSCA block is efficient as it only introduces small extra parameters.

4.4.3. Number of Selected Patches

Table 5 shows the influences of the number of selected patches. We define it as

k = {k_{1}, k_{2}, k_{3}, k_{4}}

, where

k_{i}

is the number of patches selected through patch selection at stage i.

k_{i}

is set differently for each stage. As shown in Table 5, the larger the overall number of selected patches, the better the performance in general. However, after it grows to some extent, increasing selected patches is insignificant and causes unnecessary computations. We empirically found the optimal

k = {6, 8, 54, 162}

.

4.4.4. Different Backbone Networks

In Table 6, we analyze whether our method provides consistent effects on different backbone networks. We use CvT-21 [44] and SwinT-B [32] as backbone architectures and compare them to their baseline. Both are initialized with pretrained weights from ImageNet21K [78]. The proposed modules are added to all three stages for CvT-21 and all four stages for SwinT-B. We follow the training recipe from recent work [18]. It should be noted that CTT could not be applied to the SwinT-B backbone, as it does not use a CLS token. In addition, SwinT-B uses an input resolution of

384 \times 384

: raw images are first resized into

510 \times 510

followed by cropping into

384 \times 384

. As shown in Table 6, Cvt-21 and SwinT-B baselines obtained top-1 accuracies of 89.3% and 90.6%, respectively. When we added the proposed modules, we could obtain top-1 accuracies of 90.0% and 91.3%, which are both 0.7% higher than their pure counterparts. This suggests that selecting multiscale salient patches and modeling their interactions is important for fine-grained recognition, and our method is beneficial for that purpose.

4.5. Further Analysis

In this section, we conduct additional experiments to further validate the effectiveness of our proposed methods. All experiments are conducted on the CUB dataset.

4.5.1. Contributions from CTT

In Table 7, we examine the contributions from CTT. First, we start by not using the CLS token. In this set-up, the patch sequence selected at each stage does not contain any CLS tokens. Instead, selected patches that pass through the MSCA block are projected as feature vectors with global average pooling (GAP), and the features are used for final prediction through a linear layer. This is noted as the ‘global pool’ in Table 7. As a result, the global pool obtains a top-1 accuracy of 91.5%, which lags behind the MViTv2-B baseline (91.6% in Table 3). We conjecture that this low accuracy is due to the lack of a shared intermediary (i.e., CLS token) to aggregate the information of the selected multiscale patches, even when using the MSCA block.

Then, we initialized the CLS token of the backbone, and the patch sequence selected in each stage was concatenated with the CLS token of the same dimension that was detached before patch selection (as in Section 3.3 (4)). This is simply reattaching the CLS token that was detached before MSPS, so it is noted as ‘simple attach’ in Table 7. As a result, the simple attachment obtains an accuracy of 91.9%, which improves the global pool by 0.4%. We argue that this improvement comes from that simple attach can share information between multiscale patches through the CLS tokens. Since the global CLS token is detached with different channel dimensions at each stage, stage-specific patch information can be shared through the global CLS token to generate richer representations.

As discussed in Section 3.3, we propose CTT to enhance multiscale feature sharing. In this setup, each stage uses a ’transferred’ global CLS token rather than a simply ’detached’ one. This can be implemented using a single projection layer to match its dimensions (noted ‘CTT w/1-MLP’ in Table 7). As a result, it obtains an accuracy of 92.1%, which improves the simple attach by 0.2%. Compared with using a detached CLS token at an intermediate stage, transferring the globally updated CLS token is more effective in that it can inject local information of each stage into the object’s deep semantic information, utilizing the same representational power of the network. Finally, CTT performs best when using 2-layer MLP with nonlinearity (noted ‘CTT w/2-MLP’ in Table 7).

4.5.2. Contributions from MSPS

In the earlier section, we pointed out that SSPS is suboptimal because it is difficult to deal with scale changes. We now show that our proposed MSPS can consistently improve performance on different object scales.

First, we need to classify a given set of objects according to their scale. Following COCO [92], we categorize all objects into large, medium, and small-sized objects according to their bounding box size. We compute the size of the bounding box using the given box coordinates and sort them in ascending order. Then, we compute the quartiles for sorted bounding box sizes. Finally, we classify an object as a small-sized (ob_s) object if its bounding box size is less than the first quartile, as a large-sized object (ob_l) if its bounding box size is greater than the third quartile, and as a medium-sized (ob_m) object otherwise. Specifically, 25% of the objects are small, 50% are medium, and 25% are large. Example images for each category are shown in Figure 4.

We train five model variants with different MSPS stages: M2Former_none, M2Former₄, M2Former_3,4, M2Former_2,3,4, M2Former_full. It should be noted that M2Former_none is exactly the same as the MViTv2-B baseline. For comparison, we trained TransFG [18], which conducts SSPS at the last encoder block.

The results are presented in Table 8. First, TransFG obtained a total top-1 accuracy of 91.3%, which obtained accuracies of 91.7% for ob_l, 91.7% for ob_m, and 90.1% for ob_s. The performance of ob_s seems to lag behind ob_l and ob_m. M2Former_none obtains a total top-1 accuracy of 91.6%, which outperforms TransFG by 0.3% higher. This means that we can achieve satisfactory results simply by using MS-ViT. Additionally, the improvement was prominent in ob_l and ob_s (0.6% and 0.4% higher, respectively), indicating that multiscale features are important in mitigating scale variability.

When we add MSPS to stage-4 (M2Former₄), we obtain a top-1 accuracy of 91.5%, which is 0.2% higher than the baseline. M2Former₄ is almost identical to SSPS, as it selects only single-scale patches in a single stage. However, M2Former₄ obtained a lower total top-1 accuracy than M2Former_none, and it generally results in lower accuracy at all scales. This is also in contrast to previous findings [18], where performance was improved when patch selection was conducted at the last block.

However, when adding the MSPS in stage-3 (M2Former_3,4), we obtained a top-1 accuracy of 91.9%, which outperforms the SSPS baseline by 0.6%. The improvements were found at all object scales but were noticeable at ob_l (1.2% higher compared with the SSPS baseline). This means that salient patches from stage-3 (along with selected patches from stage-4) provide the definitive cue for large objects.

We can see this improvement even when adding MSPS in stage-2 (M2Former_2,3,4). In particular, adding MSPS in stage-2 leads to improvements for ob_m and ob_s. Compared with the SSPS baseline, it is 0.7% higher for ob_m, 1.3% higher for ob_s, and 1.0% higher for total accuracy. This indicates that the finer-grained object features extracted at stage-2 enhance representations for small/medium-sized objects.

Finally, when adding MSPS to stage-1 (M2Former_full), we obtained the highest total top-1 accuracy of 92.4%, with a slight further improvement in ob_m. In summary, MSPS from high to low stages models a feature hierarchy from deep semantic features to subtle fine-grained features, which consistently improves recognition accuracy for large to small-sized objects. As a result, MSPS encourages networks to generate richer representations of fine-grained objects and to be more flexible to scale changes.

4.6. Qualitative Analysis

To further investigate the proposed method, we present the visualization results in Figure 5 and Figure 6.

Figure 5 shows the selected patches at each stage by conducting MSPS for several images sampled from CUB. In each subfigure, the first column is the original image, and selected patches from stage-4 to stage-1 are marked with red rectangles. At higher stages, large-sized patches that capture several parts of the object are selected. Sufficiently large patches are appropriate for modeling the intraimage structure and overall semantics of a given object. On the other hand, at the lower stage, smaller patches are chosen to model subtle details. In particular, the smallest size patches selected in stage-1 capture the coarsest features, such as object edges. As a result, MSPS enhances object representations at different levels for each stage.

Figure 6 shows the cross-attention maps of MSCA for selected patches. For visualization, we sample some patches selected from stage-4. And then, we extract cross-attention maps between sampled patches and patches selected at different stages from SCA. In Figure 6, the first column shows the sampled stage-4 patch as a red rectangle. The second to fourth columns show the attention maps between sampled patches and selected patches from stage-3 to stage-1 as orange rectangles. The brightness of a color indicates the strength of attention: brighter means stronger interactions. In addition to modeling channel interactions in CCA, SCA models spatial interactions between selected multiscale patches. As a result, each selected patch can be correlated with other patches that exist in different locations at different scales.

5. Conclusions

In this paper, we propose a MultiScale Patch Selection (MSPS) for fine-grained visual recognition (FGVR). MSPS selects salient patches in multiscale at each stage of a MultiScale Vision Transformer (MS-ViT) based on mean activation. In addition, we introduce Class Token Transfer (CTT) and MultiScale Cross-Attention (MSCA) to effectively deal with selected multiscale patch information. CTT transfers the globally updated CLS token to each stage so that stage-specific patch information can be shared throughout the entire network. MSCA directly models the spatial-/channel-wise correlation between selected multiscale patches. Compared with Single-Scale Patch Selection (SSPS), MSPS provides richer representations for fine-grained objects and flexibility for scale changes. As a result, our proposed M2Former obtains accuracies of 92.4% and 91.1% on CUB and NABirds, respectively, which outperform CNN-based models and ViT-based SSPS models. Our ablation experiments and further analyses validate the effectiveness of our proposed methods.

Author Contributions

Formal analysis, J.M.; writing—review and editing, J.M.; supervision, S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used in this article are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. Caltech-UCSD birds 200. In Technical Report; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Angelova, A.; Zhu, S.; Lin, Y. Image segmentation for large-scale subcategory flower recognition. In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV 2013), Clearwater Beach, FL, USA, 15–17 January 2013; pp. 39–45. [Google Scholar]
Nilsback, M.E.; Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing (ICVGIP 2008), Bhubaneswar, India, 16–19 December 2008; pp. 722–729. [Google Scholar]
Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; Vedaldi, A. Fine-grained visual classification of aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW 2013), Sydney, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar]
Berg, T.; Belhumeur, P.N. POOF: Part-Based One-vs.-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 23–28 June 2013; pp. 955–962. [Google Scholar]
Xie, L.; Tian, Q.; Hong, R.; Yan, S.; Zhang, B. Hierarchical Part Matching for Fine-Grained Visual Categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), Sydney, Australia, 1–8 December 2013; pp. 1641–1648. [Google Scholar]
Yang, X.; Wang, Y.; Chen, K.; Xu, Y.; Tian, Y. Fine-Grained Object Classification via Self-Supervised Pose Alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 7399–7408. [Google Scholar]
Ji, R.; Wen, L.; Zhang, L.; Du, D.; Wu, Y.; Zhao, C.; Liu, X.; Huang, F. Attention convolutional binary neural tree for fine-grained visual categorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 14–19 June 2020; pp. 10468–10477. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 5012–5021. [Google Scholar]
Huang, S.; Xu, Z.; Tao, D.; Zhang, Y. Part-Stacked CNN for Fine-Grained Visual Categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 1173–1182. [Google Scholar]
Du, R.; Chang, D.; Bhunia, A.K.; Xie, J.; Ma, Z.; Song, Y.Z.; Guo, J. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In Proceedings of the 16th European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; pp. 153–168. [Google Scholar]
Liu, C.; Xie, H.; Zha, Z.J.; Ma, L.; Yu, L.; Zhang, Y. Filtration and distillation: Enhancing region attention for fine-grained visual categorization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; pp. 11555–11562. [Google Scholar]
Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 5209–5217. [Google Scholar]
Ge, W.; Lin, X.; Yu, Y. Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 3034–3043. [Google Scholar]
Tang, H.; Yuan, C.; Li, Z.; Tang, J. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 2022, 130, 108792. [Google Scholar] [CrossRef]
Ke, X.; Cai, Y.; Chen, B.; Liu, H.; Guo, W. Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification. Pattern Recognit. 2023, 137, 109305. [Google Scholar] [CrossRef]
He, J.; Chen, J.N.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. Transfg: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2022), Vancouver, BC, Canada, 22–30 February 2022; pp. 852–860. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhu, H.; Ke, W.; Li, D.; Liu, J.; Tian, L.; Shan, Y. Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 4692–4702. [Google Scholar]
Hu, Y.; Jin, X.; Zhang, Y.; Hong, H.; Zhang, J.; He, Y.; Xue, H. Rams-trans: Recurrent attention multi-scale transformer for fine-grained image recognition. In Proceedings of the 29th ACM International Conference on Multimedia (MM 2021), Chengdu, China, 20–24 October 2021; pp. 4239–4248. [Google Scholar]
Wang, J.; Yu, X.; Gao, Y. Feature fusion vision transformer for fine-grained visual categorization. arXiv 2021, arXiv:2107.02341. [Google Scholar]
Xue, F.; Wang, Q.; Tan, Z.; Ma, Z.; Guo, G. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. 2022, 14, 3244–3256. [Google Scholar] [CrossRef]
Li, H.; Sui, M.; Zhao, F.; Zha, Z.; Wu, F. MVT: Mask vision transformer for facial expression recognition in the wild. arXiv 2021, arXiv:2106.04520. [Google Scholar]
Xue, F.; Wang, Q.; Guo, G. TransFER: Learning Relation-Aware Facial Expression Representations with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 11–17 October 2021; pp. 3601–3610. [Google Scholar]
Ma, F.; Sun, B.; Li, S. Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput. 2021, 14, 1236–1248. [Google Scholar] [CrossRef]
Yu, X.; Wang, J.; Zhao, Y.; Gao, Y. Mix-ViT: Mixing attentive vision transformer for ultra-fine-grained visual categorization. Pattern Recognit. 2023, 135, 109131. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Serrano, S.; Smith, N.A. Is attention interpretable? arXiv 2019, arXiv:1906.03731. [Google Scholar]
Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 11–17 October 2021; pp. 6824–6835. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27–31 January 2019; pp. 9259–9266. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context Encoding for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 4804–4814. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 10323–10333. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European conference on computer vision (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; pp. 459–479. [Google Scholar]
Hassani, A.; Walton, S.; Li, J.; Li, S.; Shi, H. Neighborhood Attention Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 6185–6194. [Google Scholar]
Van Horn, G.; Branson, S.; Farrell, R.; Haber, S.; Barry, J.; Ipeirotis, P.; Perona, P.; Belongie, S. Building a Bird Recognition App and Large Scale Dataset with Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 595–604. [Google Scholar]
Fu, J.; Zheng, H.; Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446. [Google Scholar]
Chang, D.; Pang, K.; Zheng, Y.; Ma, Z.; Song, Y.Z.; Guo, J. Your "Flamingo" is My "Bird": Fine-Grained, or Not. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 20–25 June 2021; pp. 11476–11485. [Google Scholar]
Chang, D.; Ding, Y.; Xie, J.; Bhunia, A.K.; Li, X.; Ma, Z.; Wu, M.; Guo, J.; Song, Y.Z. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Trans. Image Process. 2020, 29, 4683–4695. [Google Scholar] [CrossRef]
Zhuang, P.; Wang, Y.; Qiao, Y. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; pp. 13130–13137. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Learning deep bilinear transformation for fine-grained image representation. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; pp. 499–515. [Google Scholar]
Mao, S.; Wang, Y.; Wang, X.; Zhang, S. Multi-Proxy Feature Learning for Robust Fine-Grained Visual Recognition. Pattern Recognit. 2023, 109779. [Google Scholar] [CrossRef]
Sun, G.; Cholakkal, H.; Khan, S.; Khan, F.; Shao, L. Fine-grained recognition: Accounting for subtle differences between similar classes. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; pp. 12047–12054. [Google Scholar]
Chen, Y.; Bai, Y.; Zhang, W.; Mei, T. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 5157–5166. [Google Scholar]
Behera, A.; Wharton, Z.; Hewage, P.R.; Bera, A. Context-aware attentional pooling (cap) for fine-grained visual classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2021), Virtual Conference, 2–9 February 2021; pp. 929–937. [Google Scholar]
Huang, Z.; Li, Y. Interpretable and Accurate Fine-grained Recognition via Region Grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 14–19 June 2020; pp. 8662–8672. [Google Scholar]
Wang, Z.; Wang, S.; Yang, S.; Li, H.; Li, J.; Li, Z. Weakly Supervised Fine-Grained Image Classification via Guassian Mixture Model Oriented Discriminative Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 14–19 June 2020; pp. 9749–9758. [Google Scholar]
Zhang, L.; Huang, S.; Liu, W. Intra-Class Part Swapping for Fine-Grained Image Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2021), Waikoloa, HI, USA, 3–8 January 2021; pp. 3209–3218. [Google Scholar]
Liu, Y.; Cheng, M.M.; Hu, X.; Wang, K.; Bai, X. Richer Convolutional Features for Edge Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 3000–3009. [Google Scholar]
Xie, S.; Tu, Z. Holistically-Nested Edge Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V.; Funkhouser, T. Dilated Residual Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 472–480. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Peng, Y.; He, X.; Zhao, J. Object-part attention model for fine-grained image classification. IEEE Trans. Image Process. 2017, 27, 1487–1500. [Google Scholar] [CrossRef]
Xiao, T.; Xu, Y.; Yang, K.; Zhang, J.; Peng, Y.; Zhang, Z. The Application of Two-Level Attention Models in Deep Convolutional Neural Network for Fine-Grained Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 842–850. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2022), Vancouver, BC, Canada, 22–30 February 2022; pp. 2441–2449. [Google Scholar]
Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Yang, Z.; Luo, T.; Wang, D.; Hu, Z.; Gao, J.; Wang, L. Learning to navigate for fine-grained classification. In Proceedings of the European conference on computer vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 420–435. [Google Scholar]
Luo, W.; Yang, X.; Mo, X.; Lu, Y.; Davis, L.S.; Li, J.; Yang, J.; Lim, S.N. Cross-X Learning for Fine-Grained Visual Categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 27–30 October 2019; pp. 8242–8251. [Google Scholar]
Liu, H.; Zhang, C.; Deng, Y.; Xie, B.; Liu, T.; Li, Y.F. TransIFC: Invariant Cues-aware Feature Concentration Learning for Efficient Fine-grained Bird Image Classification. IEEE Trans. Multimed. 2023, 1–14. [Google Scholar] [CrossRef]
Xu, Q.; Wang, J.; Jiang, B.; Luo, B. Fine-Grained Visual Classification via Internal Ensemble Learning Transformer. IEEE Trans. Multimed. 2023, 25, 9015–9028. [Google Scholar] [CrossRef]
Zhang, Z.C.; Chen, Z.D.; Wang, Y.; Luo, X.; Xu, X.S. A vision transformer for fine-grained classification by reducing noise and enhancing discriminative information. Pattern Recognit. 2024, 145, 109979. [Google Scholar] [CrossRef]
Ji, R.; Li, J.; Zhang, L.; Liu, J.; Wu, Y. Dual Transformer With Multi-Grained Assembly for Fine-Grained Visual Classification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5009–5021. [Google Scholar] [CrossRef]
Du, R.; Xie, J.; Ma, Z.; Chang, D.; Song, Y.Z.; Guo, J. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9521–9535. [Google Scholar] [CrossRef]
Dubey, A.; Gupta, O.; Raskar, R.; Naik, N. Maximum-Entropy Fine Grained Classification. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Guo, P.; Farrell, R. Aligned to the Object, Not to the Image: A Unified Pose-Aligned Representation for Fine-Grained Recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV 2019), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1876–1885. [Google Scholar]
Dubey, A.; Gupta, O.; Guo, P.; Raskar, R.; Farrell, R.; Naik, N. Pairwise confusion for fine-grained visual classification. In Proceedings of the European conference on computer vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 70–86. [Google Scholar]
Zhang, L.; Huang, S.; Liu, W.; Tao, D. Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27–30 October 2019; pp. 8331–8340. [Google Scholar]
Touvron, H.; Vedaldi, A.; Douze, M.; Jégou, H. Fixing the train-test resolution discrepancy. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (ECCV 2014), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]

Figure 1. Comparison between previous Single-Scale Patch Selection (SSPS) and our MultiScale Patch Selection (MSPS): (a) SSPS extracts salient image patches of the same size, and the limited receptive field causes suboptimal object representation and vulnerability to scale variance. (b) On the other hand, our MSPS extracts salient patches in multiscale. This encourages rich representations of objects, from deep semantic information in large-sized patches to fine-grained details in small-sized patches. In addition, the flexibility of multiscale patches is useful for handling extremely large or small objects through multiple receptive fields.

Figure 2. The framework of our M2Former. MSPS conducts patch selection at each stage of the MViT backbone. For each intermediate feature map, several salient patches are selected based on score maps computed using mean activation. At the same time, the global CLS token is transferred to each stage, and the transferred CLS tokens are concatenated with the selected patch sequence. And then, patch sequences are passed through MSCA blocks consisting of CCA and SCA. Finally, the CLS tokens are detached from the patch sequence of each stage, and the final prediction is conducted by aggregating them.

Figure 3. CCA and SCA constituting the MSCA block: (a) CCA recalibrates the channels of each stage-specific patch based on their cross-scale channel interdependencies. (b) For the same purpose, SCA captures spatial-wise interdependencies of selected multiscale patches.

Figure 4. Example images for different scale objects. Following COCO [92], we classify objects into three scales, large, medium, and small, according to their bounding box size. The first row shows example images belonging to the large category (ob_l). The second row shows example images belonging to the medium category (ob_m). The third row shows example images belonging to the small category (ob_s).

Figure 5. Visualization results of the selected patches when MSPS was conducted at each stage. In each subfigure, the first column shows the original image, and the second to fifth columns show the patches selected from stage-4 to stage-1. The selected patches are marked with red rectangles.

Figure 6. Visualization results of cross-attention maps from SCA. The first column shows the sampled stage-4 patch as a red rectangle. The second to fourth columns show the attention maps between sampled patches and selected patches from stage-3 to stage-1 as orange rectangles. Brighter rectangles mean higher attention scores.

Table 1. Comparison with the state-of-the-art methods on CUB.

Method	Backbone	Accuracy (%)
RA-CNN [51]	VGG19	85.3
NTS-Net [80]	ResNet50	87.5
Cross-X [81]	ResNet50	87.7
DBTNet [55]	ResNet101	88.1
DF-GMM [62]	ResNet50	88.8
PMG [12]	ResNet50	89.6
GDSMP-Net [17]	ResNet50	89.6
API-Net [54]	DenseNet161	90.0
P2P-Net [8]	ResNet50	90.2
TransIFC+ [82]	SwinT-B	91.0
TransFG [18]	ViT-B	91.7
RAMS-Trans [21]	ViT-B	91.3
FFVT [22]	ViT-B	91.6
IELT [83]	ViT-B	91.8
ACC-ViT [84]	ViT-B	91.8
Dual-TR [85]	ViT-B	92.0
DCAL [20]	R50-ViT-B	92.0
M2Former (ours)	MViTv2-B	92.4

Table 2. Comparison with the state-of-the-art methods on NABirds.

Method	Backbone	Accuracy (%)
MaxEnt [87]	DenseNet161	83.0
Cross-X [81]	ResNet50	86.4
PAIRS [88]	ResNet50	87.9
API-Net [54]	DenseNet161	88.1
PMGv2 [86]	ResNet101	88.4
CS-Parts [89]	ResNet50	88.5
MGE-CNN [90]	ResNet101	88.6
GDSMP-Net [17]	ResNet50	89.0
FixSENet-154 [91]	SENet154	89.2
ViT [19]	ViT-B	89.9
IELT [83]	ViT-B	90.8
TransFG [18]	ViT-B	90.8
TransIFC+ [82]	SwinT-B	90.9
Dual-TR [85]	ViT-B	91.3
ACC-ViT [84]	ViT-B	91.4
M2Former (ours)	MViTv2-B	91.1

Table 3. Effect of the proposed modules on CUB.

Index	MSPS	CTT	CCA	SCA	# Params.	Accuracy (%)
(a)					52M	91.6
(b)	✓				60M	91.6
(c)	✓	✓			62M	92.1
(d)	✓	✓	✓		64M	92.2
(e)	✓	✓		✓	64M	92.1
(f)	✓	✓	✓	✓	65M	92.4

Table 4. Effect of the number of MSCA blocks on CUB.

Num. Blocks	Accuracy (%)
1	92.4
2	92.1
3	92.2
4	92.2

Table 5. Effect of the number of selected patches on CUB.

k	Accuracy (%)
{6, 8, 10, 12}	92.2
{6, 12, 24, 48}	92.2
{6, 18, 54, 162}	92.4
{7, 28, 112, 448}	92.1

Table 6. Comparison of different backbone networks on CUB.

Method	Accuracy (%)
CvT-21 [44]	89.3
CvT-21_MSPS	90.0
SwinT-B [32]	90.6
SwinT-B_MSPS	91.3

Table 7. Comparison of different CTT methods on CUB.

Method	Accuracy (%)
global pool	91.5
simple attach	91.9
CTT w/1-MLP	92.1
CTT w/2-MLP	92.4

Table 8. Comparison of different MSPS stages on CUB.

Method	Stages for Patch Selection				Accuracy (%)
Method	Stage1	Stage2	Stage3	Stage4	ob_l	ob_m	ob_s	Total
TransFG [18]	SSPS at Last Block				91.7	91.7	90.1	91.3
M2Former_none					92.3_(+0.6)	91.8_(+0.1)	90.5_(+0.4)	91.6_(+0.3)
M2Former₄				✓	91.9_(+0.2)	91.7_(+0.0)	90.6_(+0.5)	91.5_(+0.2)
M2Former_3,4			✓	✓	92.9_(+1.2)	91.9_(+0.2)	90.8_(+0.7)	91.9_(+0.6)
M2Former_2,3,4		✓	✓	✓	92.9_(+1.2)	92.4_(+0.7)	91.4_(+1.3)	92.3_(+1.0)
M2Former_full	✓	✓	✓	✓	92.9_(+1.2)	92.6_(+0.9)	91.4_(+1.3)	92.4_(+1.1)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moon, J.; Park, S. M2Former: Multiscale Patch Selection for Fine-Grained Visual Recognition. Appl. Sci. 2024, 14, 8710. https://doi.org/10.3390/app14198710

AMA Style

Moon J, Park S. M2Former: Multiscale Patch Selection for Fine-Grained Visual Recognition. Applied Sciences. 2024; 14(19):8710. https://doi.org/10.3390/app14198710

Chicago/Turabian Style

Moon, Jiyong, and Seongsik Park. 2024. "M2Former: Multiscale Patch Selection for Fine-Grained Visual Recognition" Applied Sciences 14, no. 19: 8710. https://doi.org/10.3390/app14198710

APA Style

Moon, J., & Park, S. (2024). M2Former: Multiscale Patch Selection for Fine-Grained Visual Recognition. Applied Sciences, 14(19), 8710. https://doi.org/10.3390/app14198710

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

M2Former: Multiscale Patch Selection for Fine-Grained Visual Recognition

Abstract

1. Introduction

2. Related Work

2.1. Part-Based FGVR

2.2. ViT-Based FGVR

2.3. Multiscale Processing

3. Our Method

3.1. Multiscale Vision Transformer

3.2. Multiscale Patch Selection

3.3. Class Token Transfer

3.4. Multiscale Cross-Attention

3.4.1. Channel Cross-Attention

3.4.2. Spatial Cross-Attention

3.5. Training

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Main Results

4.3.1. Results on CUB

4.3.2. Results on NABirds

4.4. Ablation Study

4.4.1. Ablation on Proposed Modules

4.4.2. Number of MSCA Blocks

4.4.3. Number of Selected Patches

4.4.4. Different Backbone Networks

4.5. Further Analysis

4.5.1. Contributions from CTT

4.5.2. Contributions from MSPS

4.6. Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI