MFVT: Multilevel Feature Fusion Vision Transformer and RAMix Data Augmentation for Fine-grained Visual Categorization

The introduction and application of the Vision Transformer (ViT) has promoted the development of ne-grained visual categorization (FGVC). However, there are some problems with directly applying ViT to FGVC tasks. ViT only classi�es using the class token in the last layer, ignoring the local and low-level features necessary for FGVC. We propose a ViT-based multilevel feature fusion transformer (MFVT) for FGVC tasks. In this framework, with reference to ViT, the backbone network adopts 12 layers of Transformer blocks, divides it into four stages, and adds multilevel feature fusion (MFF) between Transformer layers. We also design RAMix, a CutMix-based data augmentation strategy that uses the resize strategy for crop-paste images and label assignment based on attention. Experiments on the CUB-200-2011, Stanford Dogs, and iNaturalist 2017 datasets gave competitive results, especially on the challenging iNaturalist 2017, with an accuracy rate of 72.6%.


Introduction
Image classi cation is a classic problem in computer vision, whose goal is to categorize images [1,18].Fine-grained visual categorization (FGVC) refers to more re ned sub-category division on the basis of basic categories, such as distinguishing types of birds and dogs, which is essentially intra-class classi cation.The problem is to make a more detailed division of the categories obtained in the traditional classi cation problem [2,3,7,8].
FGVC has common research needs and application scenarios in elds such as smart agriculture and unmanned retail.Therefore, how to design an accurate and e cient FGVC algorithm is of great signi cance [3].Compared with ordinary image classi cation tasks, the image data seen by FGVC have similar appearance and features, and there is interference such as posture, illumination, perspective, occlusion, and background, resulting in small inter-class variations and large intra-class variations.These di culties make FGVC a challenging research task [1,4,6].
Feature extraction is a key factor in determining the accuracy of image classi cation.Traditional FGVC methods are based on manually extracting image features [5].However, the arti cial feature description ability is limited, and the classi cation effect is not good.With the rapid development of deep learning, features extracted by neural network training have stronger predictive ability than those extracted manually, which promotes the rapid development of FGVC [6].
The self-attention neural network Transformer has become the object of much research in the CV eld.The Vision Transformer (ViT) makes full use of the modeling ability of Transformer's attention mechanism by dividing an image into multiple patch tokens, and it promotes the development of vision tasks such as image classi cation.A number of FGVC algorithms based on ViT, such as TransFG, AFTrans, and FFVT, have also been produced [9,20,23].These methods improve the model structure to different degrees, to effectively improve the performance of FGVC algorithms.Nonetheless, there are issues to consider when applying ViT to FGVC tasks.ViT only uses the class token of the last layer for classi cation, and the deep class token is obtained based on all image patches in the self-attention mode, so it pays more attention to global information.In the FGVC task, landmark details and parts are the key information for classi cation [11,17,23].According to our experiments, class tokens can extract different levels of information features, and these are complementary.Therefore, better utilization of these levels of information in the FGVC task will allow the model to obtain more comprehensive information from the image for nal prediction.
For the model to grasp more detailed information that is helpful for classi cation, instead of only the most discriminative information, CutMix data augmentation covers with part of another image, so that the model can discover more useful information.However, this brings new problems.Random cropping and pasting of background image patches that do not contribute to classi cation will lead to loss of object information and incorrect label assignment [29,30,31].
We therefore propose a ViT-based multilevel feature fusion vision transformer (MFVT) for the FGVC task.
In addition to a ViT backbone, multilevel feature fusion (MFF) is included.To solve the problem of possible object information loss and label errors due to CutMix, referring to ResizeMix and TransMix, we design RAMix, a CutMix data enhancement strategy based on resize for image cropping and pasting, and Transformer-based attention mechanism for label assignment.This paper makes the following contributions: (1) The MFVT algorithm has a backbone network that is consistent with ViT.The 12-layer Transformer block divides it into four stages, which require no additional labeling information such as bounding boxes, so as to achieve ne-grained visual categorization.
(2) The MFF module extracts the features of the last block output of different stages in the backbone network, and uses a lightweight method for fusion, introducing visual information at different levels and effectively improving feature expression.
(3) RAMix data augmentation reasonably mixes images, and the attention mechanism in ViT effectively guides image label assignment without introducing additional parameters.

Related Work
Fine-grained visual categorization has been extensively explored in convolutional neural network (CNN)based methods.Early work such as Part-based R-CNN and Mask-CNN relied on bounding boxes and part annotation to locate and distinguish regions, but this requires extensive manual annotation, which limits practical application [18].Subsequent work used only image labels, with an attention mechanism to localize key regions in a weakly supervised manner.Typical methods include RA-CNN, MA-CNN, and AP-CNN [3,5,24].There is also a focus on enriching feature representation to achieve better classi cation results.For example, in Bilinear CNN, two networks coordinate with each other for overall and local detection and feature extraction with high accuracy [25].The emergence of ViT has brought further development of FGVC.

ViT-based image classi cation
The Vision Transformer (ViT) rst realized the application of the Transformer to image classi cation.
Experimental results have shown that the visual eld does not necessarily rely on the CNN, and inputting the image patch sequence to the Transformer can also achieve a good image classi cation effect.ViT introduced the Transformer to the visual eld.
Due to the excellent performance of ViT, many Transformer-based image classi cation models have been proposed, improving ViT from perspectives in ve categories [15]:

Fine-grained visual categorization based on ViT
TransFG is said to be the rst method to validate the effectiveness of visual Transformers on FGVC tasks.
In TransFG, a region selection module (PSM) integrates all the original attention weights of the Transformer in an attention map to guide the e cient and accurate selection of discriminative image patches and compute the relationship between them.Repetition loss encourages multiple attention heads to focus on different regions, and contrastive loss is applied to further increase the distance between feature representations of similar subclasses [9].FFVT extends ViT to large-scale FGVC and small-scale ultra-ne-grained visual categorization, with a feature fusion visual Transformer that aggregates local information from low, mid, and high-level tokens for classi cation.Mutual attention weight selection (MAWS) selects a representative token on each layer and adds it as the input of the last Transformer layer [20].
AFTrans and RAMS-Trans use the same strategy.The most discriminative part of the image is extracted, and it is enlarged and re-input to the network for further learning.The selective attention collection module (SACM) in AFTrans leverages the attention weights in ViT and adaptively lters input patches based on their relative importance.Global and local multiscale pipelines are supervised by weight-sharing encoders, enabling end-to-end training [23].
To learn local discriminative regional attention, the strength of attention weights is used to measure a patch's importance to the original image, and a multiscale recurrent attention Transformer, RAMS-Trans, utilizes the Transformer's self-attention force mechanism to cyclically learn discriminative regional attention in a multiscale manner.The core of the algorithm, the dynamic patch proposal module (DPPM), guides region enlargement to integrate multiscale image patch blocks [11].R 2 -Trans adaptively adjusts the masking threshold by calculating the proportion of high-weight regions in the segmentation, and moderately extracts background information in the input space.An information bottleneck approach guides the network to learn a minimum su cient representation in the feature space.MetaFormer is a simple, effective method for joint learning of vision and various meta-information, using meta-information to improve the performance of ne-grained recognition, providing a strong baseline for FGVC [26].

Method
For the FGVC task, starting from ViT, the model is improved to increase the feature fusion mechanism, and the training strategy, mainly the RAMix data enhancement strategy, is improved.

Backbone and MFF
The key challenge of FGVC is to detect discriminative regions that signi cantly contribute to nding subtle differences between subordinate classes, which can be well met for the multi-head self-attention MSA mechanism in ViT [20].Deep MSA pays more attention to global information, while FGVC tasks require more attention to detail.Therefore, we propose multilevel feature fusion to ensure the use of highlevel global information while obtaining middle-and low-level information.The backbone network and multilevel feature fusion module are the core parts of the algorithm, and are respectively shown at the left and right of Fig. 2.
The backbone network is consistent with the original ViT.We use a division method similar to Swin Transformer to divide the 12-layer Transformer block into stages 1-4 [27].Stages 1, 2, and 4 have two blocks, and stage 3 has six blocks.The output of each stage contains certain feature information, which is effective and complementary, and the fused features have stronger expressiveness.
The Transformer block is the basic unit of the backbone network.Figure 3 shows the structure of two Transformer blocks connected in series.
A Transformer block includes multi-head self-attention ( ), multilayer perceptron ( ), and layer normalization ( ).The forward propagation of layer l is calculated as multi-head self-attention is calculated as where Q, K, and V refer to query, key, and value, respectively; refers to the key vector dimension; refers to parameters when performing linear transformations on Q, K, and V; and h is the number of heads.The Concat operation concatenates the outputs of multiple heads [4,9].
Feature fusion combines features from different layers or branches, and often fuses features of different scales to improve deep learning performance.In the network architecture, low-level features have higher resolution and more detailed information, and high-level features have stronger semantic information but poorer perception of details.Their e cient integration is the key to improving the classi cation model.
After the basic features are extracted from the backbone network, a lightweight feature fusion method is adopted.A consult feature pyramid (FPN) in CNN, top-down pathways, and horizontal connections are added to the network structure.As shown on the right side of Fig. 2, the last features of stages 4 and 3 are fused.We use three features for classi cation from different layers, P4 is the output of stage4, P3 is obtained by fusing the output of stage3 and P4, and P2 is obtained by fusing the output of stage2 and P3.

Classi cation head and loss function
Detection performance can be improved by combining the detection results of different layers.Before the nal fusion is completed, detection starts on the partially fused layer, there will be multiple layers of detection, and the multiple detection results are nally fused.
Instead of using the output features of the last layer for classi cation prediction, we use multilevel features, and fuse the classi cation prediction results to obtain the nal prediction.As shown in Fig. 4, we use three classi cation heads for classi cation prediction, and each classi cation head consists of a fully connected layer.
We obtained three different levels of features from MFF, we use three classi cation heads to classify the three levels of features.Classi cation heads 1, 2 and 3 were classi ed using features P4, P3 and P2, respectively, and the nal classi cation result was obtained by averaging the three classi cation results.
We use soft target cross-entropy as the loss function in each classi cation head, where is the number of samples, is the number of categories, represents the input label, represents the prediction label, is the loss function of , and h takes values from 1 to 3.
To adjust the in uence of the classi cation results of different levels on the nal classi cation, the overall loss function is the weighted average of three loss functions, where , , and are weight parameters.

RAMix data augmentation
The Transformer has powerful expressive power, but according to the experiments of ViT and DeiT, the network needs a large quantity of data for model training [32].Hence data augmentation is an important part of model training, which can avoid over tting and improve model performance.We use CutMix data augmentation in the FGVC task, but the random crop and paste operation adds no usefulness to the target image when the cropped area is in the background, without object information.However, when labels are calculated, they will be allocated according to the size ratio of the mixed images, resulting in the loss of object information and label allocation errors.This has a greater impact on the use of small datasets for training in the FGVC task.We design RAMix to address this problem, as follows.
In Image A is randomly cropped to through the crop value , and is reduced to a small image block by the scale ratio and the resize operation, i.e., , where represents the reduction operation, , and are the lower and upper bounds.
Image block is pasted onto random area of image B to generate a new image.
where is a binary mask representing location .Since scale ratios and paste regions are obtained randomly, this mixing operation adds little cost.
The last step is to assign labels.Due to the different sizes of pasting areas, the occlusion of the target in the original image will be different, as should the assignment of labels, so we have improved the label assignment.We utilize the attention map A instead of the size of the paste region to compute the mixing weight λ.Labels and of images A and B, respectively, are mixed according to λ to obtain the label of the mixed image, The calculation of λ is guided by the attention map A, and calculates the weights that mix the labels of the two sample images.
where is the attention map from the class token to the image patch tokens, summarizing which patches are most useful to the nal classi er.
denotes nearest-neighbor interpolation downsampling to transform the original M from HW top p pixels.In this way, the network can learn to dynamically reassign the weights of labels for each data point based on their responses in the attention map.An input that is better focused by the attention map will be assigned a higher value in the mixed label [30].

Algorithm Veri cation
We performed multiple experiments on three datasets in a Linux environment using the PyTorch deep learning framework on an Nvidia 3090 GPU.

Dataset and experimental details
We evaluated MFVT on the widely-used ne-grained datasets iNaturalist 2017, CUB-200-2011, and Stanford Dogs.Details are shown in Table 1.In the preprocessing stage, we used the same training strategy for the relatively small CUB-200-2011 and Stanford Dogs.The input image was resized to 600 x 600, and randomly cropped to 448 x 448.For iNaturalist 2017, to reduce training time, we resized images to 400 x 400 and randomly cropped them to 304 x 304.Finally, random horizontal ipping and RAMix data augmentation were employed for three datasets.
We chose the SGD optimizer for training, with momentum = 0.9, weight decay = 5e − 4 , and cosine annealing to adjust the learning rate.The initial learning rate was 0.03 for CUB-200-2011 and Stanford Dogs, and 0.02 for iNaturalist 2017.The batch size for all three datasets was 16.The parameters of weighted summation in the loss function were set to α = β = 1, γ = 0.5.

Ablation experiment
To evaluate the effectiveness and impact of MFF and two-level data augmentation, we conducted ablation studies on the CUB-200-2011 dataset, and the same performance can be observed on the other datasets.
To con rm the validity and complementarity of information at each level, we used different layer classi cation heads to make nal predictions, and tried to use the features of multiple stages for fusion, with results as shown in Tables 2 and 3. Experiments were performed without RAMix data enhancement.
From Table 2, we can see that even if only the class token of the fourth layer is used for the nal classi cation, an accuracy rate of more than 68% can be achieved, which means that the class token of this layer has features that are effective for classi cation.It can be seen from Table 3 that feature fusion signi cantly improves the performance of the model.With the increase of the number of fusion layers, the performance is stronger, which shows the effectiveness and complementarity of the features at each level.However, after merging the features of the rst stage, the accuracy of the model decreases.
Combined with the results in Table 2, we believe that the underlying features have more noise, hence the model cannot well extract useful information.
We used stage 4 + stage 3 + stage 2 for feature fusion, which resulted in a 0.5% accuracy improvement, and a 0.4% increase in computation due to the increased number of classi cation heads, which we consider affordable given the performance gain.

Visualization
In order to analyze the effectiveness of algorithm design and data enhancement, we show the visualization results of RAMix data augmentation in training phase and attention map in testing phase.
As shown in Fig. 5, we show the mixed images, we cover the image patches whose impact on classi cation is lower than the threshold.It can be seen from the gure that most of the background is covered, while the object part is preserved.Same as the object in the original image, and the objects on the mixed-in patches, get the attention, which means that the strategy of label assignment using the attention map is reasonable and e cient.
As shown in Fig. 6, we show the attention heatmap during the inference phase.We can see that the attention is distributed over the entire object, not just the most discriminative parts, which validates the effectiveness of our design.

Conclusion
We proposed an improved ne-grained visual categorization method, MFVT, based on ViT.To improve the performance of the visual Transformer in FGVC, the backbone network adopted 12 layers of Transformer blocks, divided into four stages, and feature fusion was added between Transformer layers.The feature fusion mechanism integrated high-level information and low-level features.For more accurate and reliable data enhancement, the RAMix data enhancement method was designed.
Experiments on the CUB-200-2011, Stanford Dogs, and iNaturalist 2017 datasets showed that MFVT signi cantly improved the classi cation accuracy of standard ViT in ne-grained environments, especially on the more challenging iNaturalist 2017 dataset, where the accuracy rate of MFVT reached 72.6%.Based on the experimental results, we believe that the ViT model still has research potential in the eld of FGVC.

Declarations Figures
Page   Algorithm framework

( 1 )
Transformer-enhanced CNN image classi cation inserts the Transformer in the CNN backbone or replace convolutional blocks with Transformer layers, e.g., VTs, BoTNet; (2) CNN-enhanced Transformer for image classi cation enhances Transformer and accelerates its convergence with convolutional biases, e.g., DeiT, ConViT; (3) Transformer image classi cation with local attention enhancement adapts the image patch structure through a local self-attention mechanism, e.g., TNT, Swin Transformer; (4) Hierarchical Transformer image classi cation applies similar structures to Transformers, following hierarchical CNN, e.g., CvT, PVT; (5) Deep Transformer image classi cation enables the network to learn more complex representations by increasing the depth of the model, e.g., CaiT, DeepViT.

Figure 1
Figure1shows the algorithm framework, which follows ViT and includes patch embedding, a Transformer encoder, a multilevel feature fusion module, classi cation head components, and data augmentation.
the training set, images A and B are randomly selected, and and y are used to represent the training image and its label, respectively.The goal is to generate a new training sample ( , ) by combining training samples ( ) and ( ) [28].