Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (10)

Search Parameters:
Keywords = fine-grained visual categorization

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 1342 KB  
Article
Multi-Scale Attention-Driven Hierarchical Learning for Fine-Grained Visual Categorization
by Zhihuai Hu, Rihito Kojima and Xian-Hua Han
Electronics 2025, 14(14), 2869; https://doi.org/10.3390/electronics14142869 - 18 Jul 2025
Cited by 1 | Viewed by 2213
Abstract
Fine-grained visual categorization (FGVC) presents significant challenges due to subtle inter-class variation and significant intra-class diversity, often leading to limited discriminative capacity in global representations. Existing methods inadequately capture localized, class-relevant features across multiple semantic levels, especially under complex spatial configurations. To address [...] Read more.
Fine-grained visual categorization (FGVC) presents significant challenges due to subtle inter-class variation and significant intra-class diversity, often leading to limited discriminative capacity in global representations. Existing methods inadequately capture localized, class-relevant features across multiple semantic levels, especially under complex spatial configurations. To address these challenges, we introduce a Multi-scale Attention-driven Hierarchical Learning (MAHL) framework that iteratively refines feature representations via scale-adaptive attention mechanisms. Specifically, fully connected (FC) classifiers are applied to spatially pooled feature maps at multiple network stages to capture global semantic context. The learned FC weights are then projected onto the original high-resolution feature maps to compute spatial contribution scores for the predicted class, serving as attention cues. These multi-scale attention maps guide the selection of discriminative regions, which are hierarchically integrated into successive training iterations to reinforce both global and local contextual dependencies. Moreover, we explore a generalized pooling operation that parametrically fuses average and max pooling, enabling richer contextual retention in the encoded features. Comprehensive evaluations on benchmark FGVC datasets demonstrate that MAHL consistently outperforms state-of-the-art methods, validating its efficacy in learning robust, class-discriminative, high-resolution representations through attention-guided hierarchical refinement. Full article
(This article belongs to the Special Issue Advances in Machine Learning for Image Classification)
Show Figures

Figure 1

23 pages, 2486 KB  
Article
Learning High-Order Features for Fine-Grained Visual Categorization with Causal Inference
by Yuhang Zhang, Yuan Wan, Jiahui Hao, Zaili Yang and Huanhuan Li
Mathematics 2025, 13(8), 1340; https://doi.org/10.3390/math13081340 - 19 Apr 2025
Viewed by 1477
Abstract
Recently, causal models have gained significant attention in natural language processing (NLP) and computer vision (CV) due to their capability of capturing features with causal relationships. This study addresses Fine-Grained Visual Categorization (FGVC) by incorporating high-order feature fusions to improve the representation of [...] Read more.
Recently, causal models have gained significant attention in natural language processing (NLP) and computer vision (CV) due to their capability of capturing features with causal relationships. This study addresses Fine-Grained Visual Categorization (FGVC) by incorporating high-order feature fusions to improve the representation of feature interactions while mitigating the influence of confounding factors through causal inference. A novel high-order feature learning framework with causal inference is developed to enhance FGVC. A causal graph tailored to FGVC is constructed, and the causal assumptions of baseline models are analyzed to identify confounding factors. A reconstructed causal structure establishes meaningful interactions between individual images and image pairs. Causal interventions are applied by severing specific causal links, effectively reducing confounding effects and enhancing model robustness. The framework combines high-order feature fusion with interventional fine-grained learning by performing causal interventions on both classifiers and categories. The experimental results demonstrate that the proposed method achieves accuracies of 90.7% on CUB-200, 92.0% on FGVC-Aircraft, and 94.8% on Stanford Cars, highlighting its effectiveness and robustness across these widely used fine-grained recognition datasets. Comprehensive evaluations of these three widely used fine-grained recognition datasets demonstrate the proposed framework’s effectiveness and robustness. Full article
Show Figures

Figure 1

14 pages, 3540 KB  
Article
INTS-Net: Improved Navigator-Teacher-Scrutinizer Network for Fine-Grained Visual Categorization
by Huilong Jin, Jiangfan Xie, Jia Zhao, Shuang Zhang, Tian Wen, Song Liu and Ziteng Li
Electronics 2023, 12(7), 1709; https://doi.org/10.3390/electronics12071709 - 4 Apr 2023
Cited by 2 | Viewed by 2403
Abstract
Fine-grained image recognition, as a significant branch of computer vision, has become prevalent in various applications in the real world. However, this image recognition is more challenging than general image recognition due to the highly localized and subtle differences in special parts. Up [...] Read more.
Fine-grained image recognition, as a significant branch of computer vision, has become prevalent in various applications in the real world. However, this image recognition is more challenging than general image recognition due to the highly localized and subtle differences in special parts. Up to now, many classic models, including Bilinear Convolutional Neural Networks (Bilinear CNNs), Destruction and Construction Learning (DCL), etc., have emerged to make corresponding improvements. This paper focuses on optimizing the Navigator-Teacher-Scrutinizer Network (NTS-Net). The structure of NTS-Net determines its strong ability to capture subtle information areas. However, research finds that this advantage will lead to a bottleneck of the model’s learning ability. During the training process, the loss value of the training set approaches zero prematurely, which is not conducive to later model learning. Therefore, this paper proposes the INTS-Net model, in which the Stochastic Partial Swap (SPS) method is flexibly added to the feature extractor of NTS-Net. By injecting noise into the model during training, neurons are activated in a more balanced and efficient manner. In addition, we obtain a speedup of about 4.5% in test time by fusing batch normalization and convolution. Experiments conducted on CUB-200-2011 and Stanford cars demonstrate the superiority of INTS-Net. Full article
Show Figures

Figure 1

14 pages, 3596 KB  
Article
MFVT: Multilevel Feature Fusion Vision Transformer and RAMix Data Augmentation for Fine-Grained Visual Categorization
by Xinyao Lv, Hao Xia, Na Li, Xudong Li and Ruoming Lan
Electronics 2022, 11(21), 3552; https://doi.org/10.3390/electronics11213552 - 31 Oct 2022
Cited by 7 | Viewed by 5478
Abstract
The introduction and application of the Vision Transformer (ViT) has promoted the development of fine-grained visual categorization (FGVC). However, there are some problems when directly applying ViT to FGVC tasks. ViT only classifies using the class token in the last layer, ignoring the [...] Read more.
The introduction and application of the Vision Transformer (ViT) has promoted the development of fine-grained visual categorization (FGVC). However, there are some problems when directly applying ViT to FGVC tasks. ViT only classifies using the class token in the last layer, ignoring the local and low-level features necessary for FGVC. We propose a ViT-based multilevel feature fusion transformer (MFVT) for FGVC tasks. In this framework, with reference to ViT, the backbone network adopts 12 layers of Transformer blocks, divides it into four stages, and adds multilevel feature fusion (MFF) between Transformer layers. We also design RAMix, a CutMix-based data augmentation strategy that uses the resize strategy for crop-paste images and label assignment based on attention. Experiments on the CUB-200-2011, Stanford Dogs, and iNaturalist 2017 datasets gave competitive results, especially on the challenging iNaturalist 2017, with an accuracy rate of 72.6%. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision)
Show Figures

Figure 1

17 pages, 3046 KB  
Article
Progressive Training Technique with Weak-Label Boosting for Fine-Grained Classification on Unbalanced Training Data
by Yuhui Jin, Zuyun Wang, Huimin Liao, Sainan Zhu, Bin Tong, Yu Yin and Jian Huang
Electronics 2022, 11(11), 1684; https://doi.org/10.3390/electronics11111684 - 25 May 2022
Viewed by 1965
Abstract
In practical classification tasks, the sample distribution of the dataset is often unbalanced; for example, this is the case in a dataset that contains a massive quantity of samples with weak labels and for which concrete identification is unavailable. Even in samples with [...] Read more.
In practical classification tasks, the sample distribution of the dataset is often unbalanced; for example, this is the case in a dataset that contains a massive quantity of samples with weak labels and for which concrete identification is unavailable. Even in samples with exact labels, the number of samples corresponding to many labels is small, resulting in difficulties in learning the concepts through a small number of labeled samples. In addition, there is always a small interclass variance and a large intraclass variance among categories. Weak labels, few-shot problems, and fine-grained analysis are the key challenges affecting the performance of the classification model. In this paper, we develop a progressive training technique to address the few-shot challenge, along with a weak-label boosting method, by considering all of the weak IDs as negative samples of every predefined ID in order to take full advantage of the more numerous weak-label data. We introduce an instance-aware hard ID mining strategy in the classification loss and then further develop the global and local feature-mapping loss to expand the decision margin. We entered the proposed method into the Kaggle competition, which aims to build an algorithm to identify individual humpback whales in images. With a few other common training tricks, the proposed approach won first place in the competition. All three problems (weak labels, few-shot problems, and fine-grained analysis) exist in the dataset used in the competition. Additionally, we applied our method to CUB-2011 and Cars-196, which are the most widely-used datasets for fine-grained visual categorization tasks, and achieved respective accuracies of 90.1% and 94.9%. This experiment shows that the proposed method achieves the optimal effect compared with other common baselines, and verifies the effectiveness of our method. Our solution has been made available as an open source project. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

14 pages, 2553 KB  
Article
Two-Branch Attention Learning for Fine-Grained Class Incremental Learning
by Jiaqi Guo, Guanqiu Qi, Shuiqing Xie and Xiangyuan Li
Electronics 2021, 10(23), 2987; https://doi.org/10.3390/electronics10232987 - 1 Dec 2021
Cited by 6 | Viewed by 2765
Abstract
As a long-standing research area, class incremental learning (CIL) aims to effectively learn a unified classifier along with the growth of the number of classes. Due to the small inter-class variances and large intra-class variances, fine-grained visual categorization (FGVC) as a challenging visual [...] Read more.
As a long-standing research area, class incremental learning (CIL) aims to effectively learn a unified classifier along with the growth of the number of classes. Due to the small inter-class variances and large intra-class variances, fine-grained visual categorization (FGVC) as a challenging visual task has not attracted enough attention in CIL. Therefore, the localization of critical regions specialized for fine-grained object recognition plays a crucial role in FGVC. Additionally, it is important to learn fine-grained features from critical regions in fine-grained CIL for the recognition of new object classes. This paper designs a network architecture named two-branch attention learning network (TBAL-Net) for fine-grained CIL. TBAL-Net can localize critical regions and learn fine-grained feature representation by a lightweight attention module. An effective training framework is proposed for fine-grained CIL by integrating TBAL-Net into an effective CIL process. This framework is tested on three popular fine-grained object datasets, including CUB-200-2011, FGVC-Aircraft, and Stanford-Car. The comparative experimental results demonstrate that the proposed framework can achieve the state-of-the-art performance on the three fine-grained object datasets. Full article
(This article belongs to the Special Issue Advancements in Cross-Disciplinary AI: Theory and Application)
Show Figures

Figure 1

17 pages, 1861 KB  
Article
The Kinematics of Social Action: Visual Signals Provide Cues for What Interlocutors Do in Conversation
by James P. Trujillo and Judith Holler
Brain Sci. 2021, 11(8), 996; https://doi.org/10.3390/brainsci11080996 - 28 Jul 2021
Cited by 24 | Viewed by 4091
Abstract
During natural conversation, people must quickly understand the meaning of what the other speaker is saying. This concerns not just the semantic content of an utterance, but also the social action (i.e., what the utterance is doing—requesting information, offering, evaluating, checking mutual understanding, [...] Read more.
During natural conversation, people must quickly understand the meaning of what the other speaker is saying. This concerns not just the semantic content of an utterance, but also the social action (i.e., what the utterance is doing—requesting information, offering, evaluating, checking mutual understanding, etc.) that the utterance is performing. The multimodal nature of human language raises the question of whether visual signals may contribute to the rapid processing of such social actions. However, while previous research has shown that how we move reveals the intentions underlying instrumental actions, we do not know whether the intentions underlying fine-grained social actions in conversation are also revealed in our bodily movements. Using a corpus of dyadic conversations combined with manual annotation and motion tracking, we analyzed the kinematics of the torso, head, and hands during the asking of questions. Manual annotation categorized these questions into six more fine-grained social action types (i.e., request for information, other-initiated repair, understanding check, stance or sentiment, self-directed, active participation). We demonstrate, for the first time, that the kinematics of the torso, head and hands differ between some of these different social action categories based on a 900 ms time window that captures movements starting slightly prior to or within 600 ms after utterance onset. These results provide novel insights into the extent to which our intentions shape the way that we move, and provide new avenues for understanding how this phenomenon may facilitate the fast communication of meaning in conversational interaction, social action, and conversation. Full article
(This article belongs to the Special Issue Human Intention in Motor Cognition)
Show Figures

Figure 1

12 pages, 779 KB  
Article
Learning Attention-Aware Interactive Features for Fine-Grained Vegetable and Fruit Classification
by Yimin Wang, Zhifeng Xiao and Lingguo Meng
Appl. Sci. 2021, 11(14), 6533; https://doi.org/10.3390/app11146533 - 16 Jul 2021
Cited by 2 | Viewed by 3021
Abstract
Vegetable and fruit recognition can be considered as a fine-grained visual categorization (FGVC) task, which is challenging due to the large intraclass variances and small interclass variances. A mainstream direction to address the challenge is to exploit fine-grained local/global features to enhance the [...] Read more.
Vegetable and fruit recognition can be considered as a fine-grained visual categorization (FGVC) task, which is challenging due to the large intraclass variances and small interclass variances. A mainstream direction to address the challenge is to exploit fine-grained local/global features to enhance the feature extraction and representation in the learning pipeline. However, unlike the human visual system, most of the existing FGVC methods only extract features from individual images during training. In contrast, human beings can learn discriminative features by comparing two different images. Inspired by this intuition, a recent FGVC method, named Attentive Pairwise Interaction Network (API-Net), takes as input an image pair for pairwise feature interaction and demonstrates superior performance in several open FGVC data sets. However, the accuracy of API-Net on VegFru, a domain-specific FGVC data set, is lower than expected, potentially due to the lack of spatialwise attention. Following this direction, we propose an FGVC framework named Attention-aware Interactive Features Network (AIF-Net) that refines the API-Net by integrating an attentive feature extractor into the backbone network. Specifically, we employ a region proposal network (RPN) to generate a collection of informative regions and apply a biattention module to learn global and local attentive feature maps, which are fused and fed into an interactive feature learning subnetwork. The novel neural structure is verified through extensive experiments and shows consistent performance improvement in comparison with the SOTA on the VegFru data set, demonstrating its superiority in fine-grained vegetable and fruit recognition. We also discover that a concatenation fusion operation applied in the feature extractor, along with three top-scoring regions suggested by an RPN, can effectively boost the performance. Full article
Show Figures

Figure 1

12 pages, 19293 KB  
Communication
A Public Dataset for Fine-Grained Ship Classification in Optical Remote Sensing Images
by Yanghua Di, Zhiguo Jiang and Haopeng Zhang
Remote Sens. 2021, 13(4), 747; https://doi.org/10.3390/rs13040747 - 18 Feb 2021
Cited by 97 | Viewed by 10977
Abstract
Fine-grained visual categorization (FGVC) is an important and challenging problem due to large intra-class differences and small inter-class differences caused by deformation, illumination, angles, etc. Although major advances have been achieved in natural images in the past few years due to the release [...] Read more.
Fine-grained visual categorization (FGVC) is an important and challenging problem due to large intra-class differences and small inter-class differences caused by deformation, illumination, angles, etc. Although major advances have been achieved in natural images in the past few years due to the release of popular datasets such as the CUB-200-2011, Stanford Cars and Aircraft datasets, fine-grained ship classification in remote sensing images has been rarely studied because of relative scarcity of publicly available datasets. In this paper, we investigate a large amount of remote sensing image data of sea ships and determine most common 42 categories for fine-grained visual categorization. Based our previous DSCR dataset, a dataset for ship classification in remote sensing images, we collect more remote sensing images containing warships and civilian ships of various scales from Google Earth and other popular remote sensing image datasets including DOTA, HRSC2016, NWPU VHR-10, We call our dataset FGSCR-42, meaning a dataset for Fine-Grained Ship Classification in Remote sensing images with 42 categories. The whole dataset of FGSCR-42 contains 9320 images of most common types of ships. We evaluate popular object classification algorithms and fine-grained visual categorization algorithms to build a benchmark. Our FGSCR-42 dataset is publicly available at our webpages. Full article
Show Figures

Graphical abstract

27 pages, 5036 KB  
Article
Image-Based Delineation and Classification of Built Heritage Masonry
by Noelia Oses, Fadi Dornaika and Abdelmalik Moujahid
Remote Sens. 2014, 6(3), 1863-1889; https://doi.org/10.3390/rs6031863 - 28 Feb 2014
Cited by 52 | Viewed by 9737
Abstract
Fundación Zain is developing new built heritage assessment protocols. The goal is to objectivize and standardize the analysis and decision process that leads to determining the degree of protection of built heritage in the Basque Country. The ultimate step in this objectivization and [...] Read more.
Fundación Zain is developing new built heritage assessment protocols. The goal is to objectivize and standardize the analysis and decision process that leads to determining the degree of protection of built heritage in the Basque Country. The ultimate step in this objectivization and standardization effort will be the development of an information and communication technology (ICT) tool for the assessment of built heritage. This paper presents the ground work carried out to make this tool possible: the automatic, image-based delineation of stone masonry. This is a necessary first step in the development of the tool, as the built heritage that will be assessed consists of stone masonry construction, and many of the features analyzed can be characterized according to the geometry and arrangement of the stones. Much of the assessment is carried out through visual inspection. Thus, this process will be automated by applying image processing on digital images of the elements under inspection. The principal contribution of this paper is the automatic delineation the framework proposed. The other contribution is the performance evaluation of this delineation as the input to a classifier for a geometrically characterized feature of a built heritage object. The element chosen to perform this evaluation is the stone arrangement of masonry walls. The validity of the proposed framework is assessed on real images of masonry walls. Full article
Show Figures

Graphical abstract

Back to TopTop