Token Injection Transformer for Enhanced Fine-Grained Recognition
Abstract
1. Introduction
- 1.
- Patch Tokenization Step: Image feature gradients (capturing fine-grained spatial transitions, e.g., the edge of a car grille) are inherently defined at the pixel level. When converting pixels into fixed-size patch tokens, the local spatial gradient information is averaged within each patch, leading to an initial loss of high-frequency gradient details.
- 2.
- Self-Attention Aggregation Step: The self-attention mechanism fuses token features based on similarity scores, prioritizing global semantic consistency over local spatial gradient preservation. For stacked attention layers, this global fusion further smooths the spatial gradient information embedded in tokens: as tokens are repeatedly aggregated across layers, the distinct gradient cues that distinguish adjacent regions (e.g., between a headlight and the car body) are progressively diluted, resulting in attenuation of image content feature gradients.
- 1.
- We provide an in-depth analysis of how multi-layer token aggregation in Vision Transformers progressively weakens gradient-based edge and texture information, highlighting an underexplored limitation of existing attention-based FGVC models.
- 2.
- We propose a token injection mechanism that encodes gradient magnitude and orientation into learnable tokens and integrates them with visual tokens via Transformer attention, enabling effective visual–gradient feature interaction without additional annotations.
- 3.
- Extensive evaluation on four standard FGVC benchmarks shows that the proposed Gradient-Aware Token Injection Transformer establishes new state-of-the-art accuracies of , , , and on CUB-200-2011, iNaturalist 2018, NABirds, and Stanford Cars, respectively, including a record-setting result on the NABirds benchmark.
2. Related Work
2.1. Fine-Grained Visual Classification
2.2. Transformer Architectures and Optimizations in Vision
3. Methodology
3.1. Image Tokenization
3.2. Gradient Tokenization
3.3. Token Fusion Unit
4. Results
4.1. Datasets and Implementation Details
4.1.1. Datasets
4.1.2. Implementation Details
4.2. Comparison with State-of-the-Art Methods
4.2.1. Results on the iNaturalist 2018 Dataset
4.2.2. Results on the CUB-200-2011 Dataset
4.2.3. Results on the NABirds Dataset
4.2.4. Results on the Stanford Cars Dataset
- 1.
- Dataset-specific feature dominance: The Stanford Cars dataset contains a large number of images with highly similar geometric structures (e.g., same car models with different colors or backgrounds). In such cases, color and texture features may play a more dominant role than fine-grained geometric details in classification tasks. Gradient tokens, which are designed to highlight geometric features, thus do not provide a significant edge over existing methods on this dataset.
- 2.
- Limited feature discrimination: The current design of gradient tokens focuses on local geometric feature extraction but lacks a global feature aggregation mechanism. For car images with subtle differences in grilles/headlights, the local geometric cues captured by gradient tokens may not be sufficient to distinguish between similar categories, leading to saturated performance.
4.3. Visual Analysis
4.3.1. Qualitative Attention Analysis Across Datasets
4.3.2. Gradient Token Activation Maps
4.4. Ablation Study
4.4.1. Effectiveness of the Proposed Modules
4.4.2. Effect of Gradient Token Count on Model Performance
4.4.3. Evaluation Beyond Accuracy
4.4.4. Confusion Matrix Analysis
4.5. Model Complexity Evaluation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Yang, Z.; Luo, T.; Wang, D.; Hu, Z.; Gao, J.; Wang, L. Learning to Navigate for Fine-Grained Classification. In Proceedings of the Computer Vision–ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XIV. Springer: Berlin/Heidelberg, Germany, 2018; pp. 438–454. [Google Scholar] [CrossRef]
- Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Learning deep bilinear transformation for fine-grained image representation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 4227–4236. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zhang, Z.C.; Chen, Z.D.; Wang, Y.; Luo, X.; Xu, X.S. Vit-fod: A vision transformer based fine-grained object discriminator. arXiv 2022, arXiv:2203.12816. [Google Scholar]
- Zhu, H.; Ke, W.; Li, D.; Liu, J.; Tian, L.; Shan, Y. Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4682–4692. [Google Scholar] [CrossRef]
- Tang, Y.; Han, K.; Guo, J.; Xu, C.; Li, Y.; Xu, C.; Wang, Y. An Image Patch Is a Wave: Phase-Aware Vision MLP. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10935–10944. [Google Scholar] [CrossRef]
- He, J.; Chen, J.N.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. TransFG: A transformer architecture for fine-grained recognition. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 852–860. [Google Scholar]
- Chen, T.; Lin, L.; Chen, R.; Wu, Y.; Luo, X. Knowledge-embedded representation learning for fine-grained image recognition. arXiv 2018, arXiv:1807.00505. [Google Scholar]
- Mac Aodha, O.; Cole, E.; Perona, P. Presence-Only Geographical Priors for Fine-Grained Image Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–2 November 2019; pp. 9595–9605. [Google Scholar] [CrossRef]
- Diao, Q.; Jiang, Y.; Wen, B.; Sun, J.; Yuan, Z. Metaformer: A unified meta framework for fine-grained recognition. arXiv 2022, arXiv:2203.02751. [Google Scholar]
- Wang, J.; Yu, X.; Gao, Y. Feature fusion vision transformer for fine-grained visual categorization. arXiv 2021, arXiv:2107.02341. [Google Scholar]
- Zhu, L.; Chen, T.; Yin, J.; See, S.; Liu, J. Learning Gabor Texture Features for Fine-Grained Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 1621–1631. [Google Scholar] [CrossRef]
- Wang, J.; Xu, Q.; Jiang, B.; Luo, B.; Tang, J. Multi-Granularity Part Sampling Attention for Fine-Grained Visual Classification. IEEE Trans. Image Process. 2024, 33, 4529–4542. [Google Scholar] [CrossRef]
- Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNNs for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar] [CrossRef]
- Yu, C.; Zhao, X.; Zheng, Q.; Zhang, P.; You, X. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 574–589. [Google Scholar] [CrossRef]
- Behera, A.; Wharton, Z.; Hewage, P.R.P.G.; Bera, A. Context-Aware Attentional Pooling (CAP): For fine-grained visual classification. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), Virtual, 2–9 February 2021; Volume 35, pp. 929–937. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5314–5321. [Google Scholar] [CrossRef]
- Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. MetaFormer Is Actually What You Need for Vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar] [CrossRef]
- Lu, Z.; Lu, B.; Wang, F. CausalSR: Structural causal model-driven super-resolution with counterfactual inference. Neurocomputing 2025, 646, 130375. [Google Scholar] [CrossRef]
- Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; Belongie, S. The iNaturalist Species Classification and Detection Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8769–8778. [Google Scholar] [CrossRef]
- Van Horn, G.; Branson, S.; Farrell, R.; Haber, S.; Barry, J.; Ipeirotis, P.; Perona, P.; Belongie, S. Building a Bird Recognition App and Large Scale Dataset with Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 595–604. [Google Scholar] [CrossRef]
- Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
- Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international Conference on Computer Vision Workshops, Sydney, Australia, 1–8 December 2013; pp. 554–561. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
- Diederik, K. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Chu, G.; Potetz, B.; Wang, W.; Howard, A.; Song, Y.; Brucher, F.; Leung, T.; Adam, H. Geo-Aware Networks for Fine-Grained Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCV Workshops), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- Sun, S.; Lu, H.; Li, J.; Xie, Y.; Li, T.; Yang, X.; Zhang, L.; Yan, J. Rethinking Classifier Re-Training in Long-Tailed Recognition: Label Over-Smooth Can Balance. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Virtual, 28 April–2 May 2025. [Google Scholar]
- Han, P.; Ye, C.; Tong, J.; Jiang, C.; Hong, J.; Fang, L.; Li, X. Enhancing Features in Long-tailed Data Using Large Vision Model. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; pp. 1–9. [Google Scholar]
- Park, S.; Hong, Y.; Heo, B.; Yun, S.; Choi, J.Y. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6887–6896. [Google Scholar]
- Cui, J.; Zhong, Z.; Liu, S.; Yu, B.; Jia, J. Parametric contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 715–724. [Google Scholar]
- Touvron, H.; Cord, M.; El-Nouby, A.; Verbeek, J.; Jégou, H. Three things everyone should know about vision transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 497–515. [Google Scholar] [CrossRef]
- Cui, J.; Zhong, Z.; Tian, Z.; Liu, S.; Yu, B.; Jia, J. Generalized parametric contrastive learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 7463–7474. [Google Scholar] [CrossRef] [PubMed]
- Liu, J.; Huang, X.; Zheng, J.; Liu, Y.; Li, H. MixMAE: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6252–6261. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 32–42. [Google Scholar] [CrossRef]
- Tian, C.; Wang, W.; Zhu, X.; Dai, J.; Qiao, Y. VL-LTR: Learning class-wise visual-linguistic representation for long-tailed visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 73–91. [Google Scholar] [CrossRef]
- Kim, D.; Heo, B.; Han, D. DenseNets reloaded: Paradigm shift beyond ResNets and ViTs. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 395–415. [Google Scholar] [CrossRef]
- Girdhar, R.; Singh, M.; Ravi, N.; Van Der Maaten, L.; Joulin, A.; Misra, I. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16102–16112. [Google Scholar] [CrossRef]
- Singh, M.; Misra, I.; Joulin, A.; Van Der Maaten, L. Revisiting weakly supervised pre-training of visual perception models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 804–814. [Google Scholar] [CrossRef]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar] [CrossRef]
- Ryali, C.; Hu, Y.T.; Bolya, D.; Wei, C.; Fan, H.; Huang, P.Y.; Aggarwal, V.; Chowdhury, A.; Poursaeed, O.; Hoffman, J.; et al. Hiera: A hierarchical vision transformer without the bells-and-whistles. In Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 29441–29454. [Google Scholar]
- Xu, Q.; Wang, J.; Jiang, B.; Luo, B. Fine-grained visual classification via internal ensemble learning transformer. IEEE Trans. Multimed. 2023, 25, 9015–9028. [Google Scholar] [CrossRef]
- Zhao, Y.; Li, J.; Chen, X.; Tian, Y. Part-guided relational transformers for fine-grained visual recognition. IEEE Trans. Image Process. 2021, 30, 9470–9481. [Google Scholar] [CrossRef] [PubMed]
- Yang, X.; Wang, Y.; Chen, K.; Xu, Y.; Tian, Y. Fine-grained object classification via self-supervised pose alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7399–7408. [Google Scholar] [CrossRef]
- Rao, Y.; Chen, G.; Lu, J.; Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 1025–1034. [Google Scholar] [CrossRef]
- Imran, A.; Athitsos, V. Domain adaptive transfer learning on visual attention aware data augmentation for fine-grained visual categorization. In Proceedings of the 15th International Symposium on Visual Computing (ISVC), San Diego, CA, USA, 5–7 October 2020; pp. 53–65. [Google Scholar] [CrossRef]
- Liu, M.; Zhang, C.; Bai, H.; Zhang, R.; Zhao, Y. Cross-part learning for fine-grained image classification. IEEE Trans. Image Process. 2021, 31, 748–758. [Google Scholar] [CrossRef]
- Liu, H.; Zhang, C.; Deng, Y.; Xie, B.; Liu, T.; Li, Y.F. TransIFC: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification. IEEE Trans. Multimed. 2023, 25, 11894–11907. [Google Scholar] [CrossRef]
- Zhao, P.; Yang, S.; Ding, W.; Liu, R.; Xin, W.; Liu, X.; Miao, Q. Learning multi-scale attention network for fine-grained visual classification. J. Inf. Intell. 2025, 3, 492–503. [Google Scholar] [CrossRef]
- Kim, S.; Nam, J.; Ko, B.C. ViT-Net: Interpretable vision transformers with neural tree decoder. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 11162–11172. [Google Scholar]
- Sun, H.; He, X.; Peng, Y. Sim-Trans: Structure information modeling transformer for fine-grained visual categorization. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM), Lisbon, Portugal, 10–14 October 2022; pp. 5853–5861. [Google Scholar] [CrossRef]
- Jiang, X.; Tang, H.; Gao, J.; Du, X.; He, S.; Li, Z. Delving into multimodal prompting for fine-grained visual classification. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 23–27 February 2024; Volume 38, pp. 2570–2578. [Google Scholar]
- Zhang, Z.C.; Chen, Z.D.; Wang, Y.; Luo, X.; Xu, X.S. A vision transformer for fine-grained classification by reducing noise and enhancing discriminative information. Pattern Recognit. 2024, 145, 109979. [Google Scholar] [CrossRef]
- Qiao, S.; Li, S.; Zheng, H. Fine-Grained Visual Classification via Adaptive Attention Quantization Transformer. IEEE Trans. Neural Netw. Learn. Syst. 2025, 1–15. [Google Scholar] [CrossRef]
- Bi, Q.; Zhou, B.; Ji, W.; Xia, G.S. Universal Fine-Grained Visual Categorization by Concept Guided Learning. IEEE Trans. Image Process. 2025, 34, 394–409. [Google Scholar] [CrossRef]
- Zhao, Y.; Yan, K.; Huang, F.; Li, J. Graph-based high-order relation discovery for fine-grained recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 15079–15088. [Google Scholar] [CrossRef]
- Zhuang, P.; Wang, Y.; Qiao, Y. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13130–13137. [Google Scholar]
- Du, R.; Xie, J.; Ma, Z.; Chang, D.; Song, Y.Z.; Guo, J. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9521–9535. [Google Scholar] [CrossRef]
- Ke, X.; Cai, Y.; Chen, B.; Liu, H.; Guo, W. Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification. Pattern Recognit. 2023, 137, 109305. [Google Scholar] [CrossRef]
- Shi, Y.; Hong, Q.; Yan, Y.; Li, J. LDH-ViT: Fine-grained visual classification through local concealment and feature selection. Pattern Recognit. 2025, 161, 111224. [Google Scholar] [CrossRef]
- Yao, H.; Miao, Q.; Zhao, P.; Li, C.; Li, X.; Feng, G.; Liu, R. Exploration of class center for fine-grained visual classification. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9954–9966. [Google Scholar] [CrossRef]











| Dataset | Classes | Train | Test | LR |
|---|---|---|---|---|
| iNaturalist 2018 [21] | 8142 | 437,513 | 24,426 | 5 |
| CUB-200-2011 [23] | 200 | 5994 | 5794 | 5 |
| NABirds [22] | 555 | 23,929 | 24,633 | 5 |
| Stanford Cars [24] | 196 | 8114 | 8041 | 5 |
| Method | Publication | Accuracy (%) |
|---|---|---|
| LOS [28] | ICLR 2025 | 70.8 |
| LVM-LT [29] | arXiv 2025 | 72.7 |
| BS-CMO [30] | CVPR 2022 | 74.0 |
| PaCo [31] | ICCV 2021 | 75.2 |
| ViT-L [32] | ECCV 2022 | 75.3 |
| GPaCo [33] | IEEE 2023 | 75.4 |
| MixMIM-B [34] | CVPR 2023 | 77.5 |
| CaiT-M-36U224 [35] | ICCV 2021 | 78.0 |
| MixMIM-L [34] | CVPR 2023 | 80.3 |
| VL-LTR [36] | ECCV 2022 | 81.0 |
| RDNet-L [37] | ECCV 2024 | 81.8 |
| OMNIVORE [38] | CVPR 2022 | 84.1 |
| SWAG [39] | CVPR 2022 | 86.0 |
| MAE [40] | CVPR 2022 | 86.8 |
| Hiera-H [41] | PMLR 2023 | 87.3 |
| MetaFormer [10] | arXiv 2022 | 88.7 |
| Ours | – | 90.5 |
| Method | Publication | Accuracy (%) |
|---|---|---|
| PART [43] | TIP 2021 | 90.1 |
| P2P-Net [44] | CVPR 2022 | 90.2 |
| CAL [45] | ICCV 2021 | 90.6 |
| DATL [46] | ISVC 2020 | 91.2 |
| CP-CNN [47] | TIP 2022 | 91.4 |
| LGTF [12] | ICCV 2023 | 91.5 |
| TransIFC [48] | TMM 2023 | 91.0 |
| CAMF [49] | SPL 2021 | 91.2 |
| RAMS-Trans [45] | AAAI 2021 | 91.3 |
| ViT-Net [50] | ICML 2022 | 91.6 |
| TransFG [7] | AAAI 2022 | 91.7 |
| SIM-Trans [51] | ACM 2022 | 91.8 |
| IELT [42] | TMM 2023 | 91.8 |
| MP-FGVR [52] | AAAI 2024 | 91.8 |
| ACC-VIT [53] | AAAI 2024 | 91.8 |
| DACL [5] | CVPR 2022 | 92.0 |
| QTrans [54] | TNNLS 2025 | 92.2 |
| MetaFormer [10] | arXiv 2022 | 92.5 |
| CGL [55] | TMM 2025 | 92.6 |
| MPSA [13] | TIP 2024 | 92.8 |
| MetaFormer (ImageNet-21K) [10] | arXiv 2022 | 91.0 |
| MetaFormer (iNaturalist 2021) [10] | arXiv 2022 | 92.5 |
| Ours (ImageNet-21K) | – | 91.4 |
| Ours (iNaturalist 2021) | – | 92.9 |
| Method | Publication | Accuracy (%) |
|---|---|---|
| GaRD [56] | CVPR 2021 | 88.0 |
| API-Net [57] | AAAI 2020 | 88.1 |
| PMG-V2 [58] | TPAMI 2022 | 88.4 |
| GDSPM-Net [59] | PR 2023 | 89.0 |
| LGTF [12] | ICCV 2023 | 90.4 |
| TransFG [7] | AAAI 2022 | 90.8 |
| IELT [42] | TMM 2023 | 90.8 |
| TransIFC [48] | TMM 2023 | 90.9 |
| MP-FGVR [52] | AAAI 2024 | 91.0 |
| ACC-ViT [53] | PR 2024 | 91.4 |
| CGL [55] | TMM 2025 | 91.7 |
| MPSA [13] | TIP 2024 | 92.5 |
| QTrans [54] | TNNLS 2025 | 92.5 |
| MetaFormer (ImageNet-21K) [10] | arXiv 2022 | 91.9 |
| MetaFormer (iNaturalist 2021) [10] | arXiv 2022 | 92.8 |
| Ours (ImageNet-21K) | – | 92.3 |
| Ours (iNaturalist 2021) | – | 93.2 |
| Method | Publication | Accuracy (%) |
|---|---|---|
| LDH-ViT [60] | PR 2025 | 93.1 |
| QTrans [54] | TNNLS 2025 | 93.6 |
| DATL [46] | ISVC 2020 | 94.5 |
| RsNet50-ECC [61] | TNNLS 2024 | 94.7 |
| TransFG [7] | AAAI 2022 | 94.8 |
| ViT-Net [50] | ICML 2022 | 95.0 |
| GaRD [56] | CVPR 2021 | 95.1 |
| CAMF [49] | SPL 2021 | 95.3 |
| DACL [5] | AAAI 2022 | 95.3 |
| API-Net [57] | AAAI 2020 | 95.3 |
| GDSPM-Net [59] | PR 2023 | 95.3 |
| PART [43] | TIP 2021 | 95.3 |
| MetaFormer (ImageNet-21K) [10] | arXiv 2022 | 94.9 |
| MetaFormer (iNaturalist 2021) [10] | arXiv 2022 | 95.0 |
| Ours (ImageNet-21K) | – | 95.2 |
| Ours (iNaturalist 2021) | – | 95.3 |
| GTE | iNaturalist 2018 | CUB-200-2011 | NABirds | Stanford Cars |
|---|---|---|---|---|
| - | 88.7 | 92.5 | 92.8 | 95.0 |
| ✓ | 90.5 | 92.9 | 93.2 | 95.3 |
| Number of Gradient Tokens | 0 | 4 | 8 | 16 | 20 |
|---|---|---|---|---|---|
| Accuracy (%) | 92.5 | 92.7 | 92.7 | 92.9 | 92.7 |
| Dataset | Acc@1 (%) | Acc@5 (%) | Macro F1 | Weighted F1 | Macro AUC |
|---|---|---|---|---|---|
| NABirds | 93.066 | 99.610 | 0.9157 | 0.9290 | 0.9996 |
| CUB-200-2011 | 92.906 | 98.792 | 0.9285 | 0.9281 | 0.9977 |
| Stanford Cars | 95.275 | 99.678 | 0.9491 | 0.9501 | 0.9987 |
| iNaturalist2018 | 90.464 | 98.886 | 0.9021 | 0.9051 | 0.9993 |
| Method | Input Size | Params (M) | GFLOPs | Accuracy (%) |
|---|---|---|---|---|
| ViT [3] | 448 | 86.4 | 78.5 | 91.0 |
| TransFG [7] | 448 | 86.4 | 130.2 | 91.7 |
| IELT [42] | 448 | 93.5 | 73.2 | 91.8 |
| ACC-ViT [53] | 448 | 87.0 | 162.9 | 91.8 |
| ViT-Net [50] | 448 | 92.2 | 65.6 | 91.8 |
| MPSA [13] | 448 | 98.5 | 67.9 | 92.5 |
| MetaFormer [10] | 384 | 81.0 | 50.01 | 92.8 |
| Ours | 384 | 86.9 | 51.35 | 93.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ma, B.; Jin, Z.; Li, J.; Li, J.; Zhang, P.; Song, X.; Jin, B. Token Injection Transformer for Enhanced Fine-Grained Recognition. Processes 2026, 14, 492. https://doi.org/10.3390/pr14030492
Ma B, Jin Z, Li J, Li J, Zhang P, Song X, Jin B. Token Injection Transformer for Enhanced Fine-Grained Recognition. Processes. 2026; 14(3):492. https://doi.org/10.3390/pr14030492
Chicago/Turabian StyleMa, Bing, Zhengbei Jin, Junyi Li, Jindong Li, Pengfei Zhang, Xiaohui Song, and Beibei Jin. 2026. "Token Injection Transformer for Enhanced Fine-Grained Recognition" Processes 14, no. 3: 492. https://doi.org/10.3390/pr14030492
APA StyleMa, B., Jin, Z., Li, J., Li, J., Zhang, P., Song, X., & Jin, B. (2026). Token Injection Transformer for Enhanced Fine-Grained Recognition. Processes, 14(3), 492. https://doi.org/10.3390/pr14030492

