Few-Shot Fine-Grained Image Classification: A Comprehensive Review

: Few-shot fine-grained image classification (FSFGIC) methods refer to the classification of images (e


Introduction
Few-shot fine-grained image classification (FSFGIC) methods [1] refer to the classification of images (e.g., birds [2], flowers [3], and airplanes [4]) belonging to different subclasses of the same species by a small number of labeled samples.As illustrated in Figure 1, image classification tasks can be divided into coarse-grained image classification (CGIC) and fine-grained image classification (FGIC) according to different classification granularity.CGIC is a task of cross-species classification, and these classes usually have obvious differences in appearance characteristics, with the characteristics of large inter-class differences and small intra-class differences.FGIC is a classification task of different subclasses of the same species, and the differences between these classes may be very small, with the characteristics of small inter-class differences and large intra-class differences.
The researchers found that two-year-old children can classify objects into different categories after viewing just a few images, but the child may be confused about fine-grained image classification with a limited number of samples [5,6], due to the following reasons: (1) Objects for FSFGIC are obtained from sub-categories of one category, making them visually very similar.Some images may differ only in subtle visual features, requiring experts in the field to distinguish between specific categories; (2) Samples are affected by factors such as background, pose, occlusion, light intensity, and shooting angle, the differences between different subclasses may be small, and the differences within the same subclass may be greater, resulting in a classification problem of small inter-class differences and large intra-class differences.
Fine-grained image datasets usually have a small number of samples and need domain experts to label the datasets.However, the traditional image classification algorithm requires a large amount of labeled data for model training, which is obviously not suitable for FGIC tasks.Therefore, how to use few-shot learning to complete FGIC tasks is a research hotspot in this field.Since the objects in different sub-categories of the same entry-level category are very similar to each other, a key consideration in FSFGIC is how to effectively learn discriminative features from extremely limited training samples, which makes FSFGIC a very challenging research problem.Recently, with the growing attention on FSFGIC, various FSFGIC methods have been proposed.Many few-shot learning methods have also been applied to handle FSFGIC tasks with impressive results.Currently, there is no survey about FSFGIC.This paper aims to fill this gap.It is worth noting that the quality of feature representation learning directly affects the classification performance on FSFGIC.The reason is that the quality of feature representation learning determines whether the FSFGIC methods can make better use of the limited sample information and learn more discriminant feature representations, thus greatly improving the classification accuracy and generalization ability of the FSFGIC methods.In this way, a taxonomy of feature representation learning for FSFGIC is proposed.According to this classification, we discuss different types of FSFGIC methods in depth.It is worth to note that those few-shot image classification algorithms (e.g., [7,8]) that have achieved good classification performance in some FSFGIC datasets are also introduced in this survey.
The contributions of this survey comprise the following aspects.This is the first work to review FSFGIC under a taxonomy of feature representation learning.Subsequently, different types of feature representation learning techniques for FSFGIC are reviewed.Additionally, the relationships among different FSFGIC methods are presented.Furthermore, combining with representative existing FSFGIC techniques, the main unresolved issues on FSFGIC are discussed.

Problem, Datasets, and Categorization of FSFGIC Methods
In this section, the problem formulation of FSFGIC, categorization of FSFGIC methods, and representative benchmark datasets for FSFGIC are presented.

Problem Formulation
For an FSFGIC task, the dataset D is typically divided into a training set D train , a validation set D val , and a test set D test .The D train is used to train the parameters of the model, the D val is used to verify and tune the model, and the D test is used to finally evaluate the accuracy of the FSFGIC method.That is, the three stages of training, validation, and testing of the model.Each stage consists of many epochs, each containing thousands of episodes.
The FSFGIC task is denoted as a C-way K-shot task, which means that C categories are selected in each episode, K samples in each category are selected as support samples, and part of the remaining samples in the C categories are selected as query samples.Each episode's dataset D episode consists of a support set S consisting of C × K labeled support samples and a query set Q consisting of C × J unlabeled query samples.
where x i ∩ x j = ∅, x i and x j denote fine-grained samples and (x i , x j ) ∈ C, and y i ∈ C represents the ground truth label of x i .The purpose of the FSFGIC method is to successfully predict the category of x j using x i and y i .The evaluation criterion of FSFGIC method is classification accuracy, which is calculated by dividing the number of successfully predicted query samples by the total number of query samples.

A Taxonomy of the Existing Feature Representation Learning for FSFGIC
According to the difference of contents and representations of learned features, the existing feature representation learning techniques for FSFGIC can be divided into three categories: local and/or global deep feature representation learning based FSFGIC methods [9,10], class representation learning based FSFGIC methods [11,12], and task-specific feature representation learning based FSFGIC methods [13,14].According to different types of feature representation learning paradigms, a taxonomy of feature representation learning for FSFGIC methods is illustrated in Figure 2.
Local and/or global deep feature representation learning based FSFGIC methods utilize the degree of difference of the local and/or global deep feature representations between query and support samples for performing FSFGIC tasks.Class representation learning based FSFGIC methods utilize deep feature representations from all training samples in a class to construct a class feature representation (e.g., class-level graph [15] or class-level local deep feature representation [7]) for this class.And then class feature representation is used to perform FSFGIC tasks.Task-specific feature representation learning based FSFGIC methods utilize deep feature representations from all training images in a task (i.e., one training episode) to construct a task-specific feature representation (e.g., task-level graph relationship representation [16] or task-level local deep feature representation [8]) for this task and to perform FSFGIC tasks.
It is worth noting that after feature representation is learned, most meta-learning based techniques, which can be divided into two branches (i.e., optimization-based techniques and metric-based techniques), are utilized for performing FSFGIC tasks.Optimizationbased techniques aim to converge the model to novel tasks, which learns how to update the parameters of a given initial model with only a few training samples for each category.Metric-based techniques aim to learn a transferable feature knowledge and obtain a distribution based on similarity metrics between different samples.In this way, for each type of feature representation learning, both optimization-based and metric-based techniques used for FSFGIC will be reviewed in detail.

Benchmark Datasets
Datasets have become one of the most critical roles in the development of FSFGIC, not only as a means for evaluating the classification accuracy of different FSFGIC methods, but also for greatly promoting the development of the field of FSFGIC (e.g., solving more complex, practical, and challenging problems).Currently, the representative datasets for training and evaluation on FSFGIC are CUB-200-2010 [63], CUB-200-2011 [2], Stanford Dogs [64], Stanford Cars [65], FGVC-Aircraft [4], NABirds [66], SUN397 [67], and Oxford 102 Flowers [3].The number of images and the number of categories corresponding to these datasets are shown in Table 1.
A detailed description of the datasets available on FSFGIC can be accessed at https://paperswithcode.com/task/fine-grained-image-classification (accessed on 1 March 2024).In addition, several ultra-fine-grained image datasets (such as Cottons and Soybeams [68]) exist in this field.Compared with the current widely used FSFGIC datasets (e.g., CUB-200-2011 [2]), the inter-class differences among ultra-fine-grained images are much smaller, which put forward greater requirements on the design of FSFGIC algorithms.

Methods on FSFGIC
In this section, we first review the data augmentation techniques for FSFGIC.Then, local and/or global deep feature representation learning based FSFGIC methods, class representation learning based FSFGIC methods, and task-specific feature representation learning based FSFGIC methods are introduced in detail.Furthermore, the relationships among the classification methods for different types of FSFGIC techniques are also introduced.

Data Augmentation Techniques for FSFGIC
Data augmentation techniques aim to enhance both the quantity and diversity of training data, thus alleviating overfitting and improving generalization ability.Currently, two types of data augmentation techniques are widely used on FSFGIC.The first type of data augmentation techniques (e.g., random horizontal flipping [50,69], jittering [39,44], scaling [1], random cropping [6], translation [70], zooming [70], and random rotation [56,71]) are used as a basic image manipulation in FSFGIC methods.
The second type of data augmentation techniques [33,[72][73][74] are based on deep learning mechanisms which aim to mimic the characteristics of real data.For example, in [72], generative adversarial networks (GAN) were utilized to generate realistic samples from a given dataset.In [75], a feature encoder-decoder was used to augment the dataset by generating feature representations.In [76], a pre-trained GAN without discriminator was applied to generate subtle features of fine-grained images.And in [77], GAN was used to generate hallucination images.In [45], a self-training strategy was developed with unlabeled data for augmenting data, and in [78], they applied a self-taught learning strategy to measure the credibility of each pseudo-labeled instance.In [27], a fully annotated auxiliary dataset which has similar distribution with the target dataset was used to train a meta learner, which can transfer knowledge from an auxiliary dataset to a target dataset.In [79] a diversity transfer network (DTN) was proposed to learn to transfer latent diversities from training data to testing data.Xu et al. [74] first proposed a variational autoencoder (VAE)-based feature disentanglement method on FSL problems to generate images.∆-encoder [80] utilized an autoencoder to find deformations between different samples of the same category, then generated new samples for the other categories.Ref. [81] proposed a method of foreground extraction and posture transformation, which can extract the foreground from base classes and generate additional samples for novel sub-classes to realize data expansion.Inspired by the hypothesis that language can help learn new visual objects [82], auxiliary semantic modalities (e.g., attribute annotations [50,83]) were applied for the support set while ignoring the query set.In addition, other data augmentation techniques will be described in detail in the following review of FSFGIC methods.

Local and/or Global Deep Feature Representation Learning Based FSFGIC Methods
In the field of FSFGIC, some scholars consider that local deep feature representations have the ability to recognize the discriminative regions for distinguishing subtle differences of fine-grained features.Some scholars argue that combining global and local deep feature representation learning can effectively improve the capability of deep feature representation.Currently, there are two main research directions (i.e., optimization-based techniques and metric-based techniques) which utilize local and/or global deep feature representations for performing FSFGIC tasks as illustrated in the following.

Optimization-Based Local and/or Global Deep Feature Representation Learning
The existing optimization-based methods for local and/or global deep feature representation learning mainly focus on learning fine-tuning techniques.These methods aim to improve the model's performance with limited training samples by integrating the fine-tuning process during the meta-training stage.
Learning to fine-tune.By using multiple attention mechanisms, a multi-attention meta-learning (MattML) method [6] applied attention mechanisms to both the basic learner and the task learner to capture the feature information of subtle and local parts of an image.It was indicated in [17] that some knowledge in the base data may be biased against the new class, so transferring the entire knowledge in the base data to the new class may not obtain a good meta learner or classifier.An evolutionary search strategy was proposed for transferring partial knowledge by fine-tuning particular layers in the base model after obtaining deep feature representations through feature extractor.First, several fine-tuning strategies were randomly generated and their corresponding classification accuracies on the validation set are obtained.K strategies with the highest accuracy were selected as parents.Second, with the help of gene mutation and gene crossover as in an evolutionary algorithm, offspring vectors were obtained and their corresponding classification accuracies were calculated.By repeating this process in iterations, the best fine-tuning strategy can be obtained.This proposed evolutionary search strategy can be embedded into a metric-based method [84] and an optimization-based method [85] for performing FSFGIC tasks.By introducing enhancement methods that combine global and local perception features into the feature space and adding semantic orthogonality constraints, ref. [18] achieved a more comprehensive and accurate representation of image feature information.

Metric-Based Local and/or Global Deep Feature Representation Learning
Metric-based local and/or global deep feature representation learning methods can be classified into six categories: attention mechanism, metric strategy, multimodel learning, feature distribution, semantic alignment, and multi-scale representation.
Attention mechanism.Following the idea that a self-attention mechanism has the ability to indicate the discriminative regions in an image [46], a novel network architecture [86] that incorporated saliency information as input was designed.Local deep feature representations from training samples and their corresponding saliency maps obtained from [87] were combined for improving the classification performance on FSFGIC.Following the idea of object localization strategy [88], a meta-reweighting strategy [19] was designed to extract and exploit local deep feature representations of support samples.Furthermore, an adaptive attention mechanism based on the meta-reweighting model was designed to localize the region of interest in query samples.The aim of the designed adaptive attention mechanism was to match query images and support images to highlight relevant regions of interest for obtaining more discriminative local deep feature representations.A trilinear spatial-awareness network (S3Net) [23] was proposed to strengthen the spatial representation of each local descriptor by adding a global relationship feature with self-attention.They construct the multi-scale features to enhance rich representation in global features.Finally, a local loss and a global loss were combined to learn the discriminative features.In [29], they proposed an attention-based pyramid structure to weight the different areas of the feature maps and produce multi-scaled features.Ref. [20] proposed a fusion spatial attention method that performs spatial attention simultaneously in both the image and the embedded space.Ref. [21] proposed a self-attention based prototype enhancement network (SAPENet) to obtain a more representative prototype for each class.In [89], they proposed an automatic salient region selection network without the use of a bounding box or part annotation mechanism for locating salient regions from images.
Metric strategy.The DeepEMD method [22] formalized the problem of image classification as an optimal image matching problem.And then earth mover's distance (EMD) was applied to select local discriminative feature representations for finding optimal matching between query samples and support samples.In [90], a two-stage comparison strategy was proposed to mine hard examples which correspond to the top two relation scores outputted by the first relation network and then were inputted into a second relation network to distinguish similar classes.A subtle difference module [23] was proposed to classify confused or near-duplicated samples based on the cooperation of local and global similarities between query image and the prototype of each class.Ref. [24] used the Sinkhorn distance to find an optimal matching between images, mitigating the object mismatch caused by misaligned position.Meanwhile, they proposed the intraimage and interimage attentions as the bilateral normalization on the Sinkhorn distance to suppress the object mismatch caused by background clutter.
Multimodel learning.In [25], Zhao et al. argued that cross-modal external knowledge will help improve the classification performance on FSFGIC.In this way, a mirror mapping network (MMN) was designed to map multimodal features (i.e., external knowledge and global and local feature representations) into the same semantic space.The external knowledge which was extracted from textual descriptions and knowledge graph was utilized to generate global and local features for training samples.Finally, global and local feature representations from samples and external knowledge were combined for performing FSFGIC tasks.
Feature distribution.Sun et al. [26] proposed a domain-specific FSFGIC task of marine organisms.They designed a feature fusion model to focus on the key regions.Specifically, the framework consisted of a ConvNet-based feature extractor, a feature fusion model, and a classifier.As the key component, the feature fusion model utilized the focus-area location and high-order integration to generate feature representations which contained more identifiable information.
Semantic alignment.Huang et al. [27] proposed a novel pairwise bilinear pooling to recognize the subtle difference of fine-grained images.Specifically, they designed a finegrained features extractor which contained an alignment loss regularization and a pair-wise bilinear pooling layer.The alignment loss aimed to match the features of the same position and the pair-wise bilinear pooling layer was able to capture comparative features from pairs of images.The bi-directional local alignment strategy [28] was proposed to encode image features using shared embedding networks, construct bi-directional distances to align similar semantic information, and optimize the network for FSFGIC tasks.Traditional feature generation networks failed to capture the subtle difference between fine-grained categories; to address this problem, a feature composition framework was proposed in [91] to generate fine-grained features for novel classes.In the training stage, they proposed a dense attribute-based attention to compute attention features for all attributes and then aligned them with attribute semantic vectors to obtain a similarity score.After that, they applied these attribute features to construct features of novel classes.
Multi-scale representation.Different from the single-scale representation, multi-scale enhances the representation of global features because the large-scale with larger receptive fields contains richer information [92][93][94][95][96][97][98][99].In [23], a structural-pyramid descriptor was constructed by exploiting the pyramid pooling of the global feature with different scales.Then, multi-scale features were magnified to the same size and fused together by bilinear interpolations.Ruan et al. [29] proposed a spatial attentive comparison network (SACN) for the FSFGIC task.They constructed a selective-comparison similarity module (SCSM) based on pyramid structure and attention mechanism to assign different weights to the background and target, aiming to produce multi-scaled feature maps for classification.In [30], they were the first to attempt integrating the idea of multi-scale representation into the cross-domain few-shot classification problem by proposing a new hierarchical residuallike block applicable to lightweight ResNet structures such as ResNet-10.In [31], Zhang et al. proposed a multi-scale second-order relation network (MsSoSN), which equipped second-order pooling and a scale selector to create multi-scale second-order representations.They proposed a scale and discrepancy discriminator to reweight multi-scale features, which were trained using a self-supervision method.

Class Representation Learning Based FSFGIC Methods
The authors of class representation learning based methods argue that local and/or global deep feature representations learned from extremely limited training samples cannot effectively represent a novel class, and class representations (e.g., class-level graph [15] or class-level local deep feature representation [7]) can be used to alleviate the phenomenon of overfitting and effectively represent a novel class.

Optimization-Based Class Representation Learning
The existing optimization-based class representation learning can be divided into two categories: (1) learning a model-based method, which aims to design network architectures to efficiently adapt to target tasks through only several gradient descent steps; (2) learning fine-tune-based methods.
Learning a model.In [32], an optimization-based FSFGIC method was proposed, which included a bilinear feature learning module and a classifier mapping module that encoded discriminative information and mapped features to decision boundaries using a "piecewise mappings" function.The meta variance transfer method [33] was proposed to transfer factors of variations between classes to improve classification performance on unseen examples, allowing deep learning models to generalize better with scarce data instances and enhance robustness against various factors of variations.In order to combine distribution-level and instance-level relation, Yang et al. [34] proposed a distribution propagation graph network (DPGN).The features of support images and query images were fed into a dual complete graph network, where a point-to-distribution aggregation strategy was applied to aggregate instance similarities to construct distribution representations.Additionally, a distribution-to-point aggregation strategy was applied to calculate similarity with both distribution-level and instance-level relations.Few-shot image classification methods faced challenges in capturing diverse context and intraclass variations with limited labeled images, leading to object and scale mismatch issues, which were addressed by the bilaterally normalized scale-consistent Sinkhorn distance (BSSD) method proposed by He et al. [100] for improved performance on few-shot benchmarks.
Learning to fine-tune.A weight imprinting strategy was proposed in [35], which aimed to set weights directly of a ConvNet classifier for new categories.They applied a normalization layer with a scaling factor in the classifier which aimed to transform the features of new category samples into activation vectors as the weights of the normalization layer.In [36], a transfer-based method was proposed to generate class representations.They applied a power transform mechanism to preprocess support features to make them closer to the Gaussian distribution.According to the Gaussian-like distribution, they applied maximum a posteriori probability to find the estimates of each class center, which is similar to the minimization of Wasserstein distance.Then an iterative algorithm based on a Wasserstein distance was used to estimate the optimal transport from the initial distribution of the features to the Gaussian distribution in order to update the center.In [37], they proposed an adaptive distribution calibration (ADC) method, which addressed distribution bias in few-shot learning by adaptively transferring and calibrating information from base classes to improve classification performance on novel classes.

Metric-Based Class Representation Learning
Many techniques have been put forward for effective metric-based class representations, which can be broadly divided into five categories: feature distribution, attention mechanism, metric strategy, semantic alignment, and multimodel learning.
Feature distribution.In [101], it was demonstrated that the GANs-based feature generator [102] suffered from the issue of mode collapse.To address this problem, varia-tional autoencoder (VAE) [103] and GANs were combined together to form a conditional feature generation model [73], which aimed to learn the conditional distribution of image features on the labeled class data and the marginal distribution of image features on the unlabeled class data.Alternatively, a multi-mixed feature distribution could be learned to represent each category in RepMet [38] and perform FSFGIC tasks.Davis et al. [39] extended the DeepEMD method [22] by reconstructing each query sample as a weighted sum of components from the same class for obtaining class-level feature distribution.In [40], a re-abstraction and perturbing support pair network (RaPSPNet) was proposed to improve the performance of FSFGIC by enhancing feature discrimination through a feature re-abstraction embedding (FRaE) module and a novel perturbing support pair (PSP)-based similarity measure module.
Afrasiyabi [69] proposed two distribution alignment strategies to align the novel categories to the related base categories, aiming to obtain better class representations.A centroid alignment strategy and an adversarial alignment strategy based on Wasserstein distance were designed to enforce intra-class compactness.Das et al. proposed a nonparametric approach [104] to address the problem that only base-class prototypes were available.They considered that all class prototype distributions were arranged on a manifold.They first estimated the novel-class prototypes by calculating the mean of the prototypes, which were near the novel samples.A graph was structured with all the class prototypes, and an induced absorbing Markov chain was applied to complete the classification task.Ref. [105] proposed compositional prototypical networks (CPN) to learn transferable component prototypes for improved feature reusability, which could be adaptively fused with visual prototypes using a learnable weight generator for recognizing novel classes based on human-annotated attributes.
In order to learn fine-grained structure in the feature space, Luo et al. [106] proposed a two-path network to adaptively learn the views.One path was label-guided classification, where the support features belonging to the same class were aggregated into a prototype and the similarities were calculated between the prototypes and query images.Another path was instance-level classification, which aimed to produce different views for an image, then map them into feature space to construct a better fine-grained semantic structure.Ref. [107] proposed to combine the frequency features with routine features.In addition to a regular CNN module, a discrete cosine transformation was applied to generate frequency feature representations.Then, the two kinds of features were concatenated as the final features.Current approaches overlooked intra-class distribution details while focusing on learning a generalized class-level metric.Ref. [108] proposed improved prototypical networks (IPN) to address the issue by using an attention-analogous strategy with varied sample weights based on representativeness and a distance-scaling strategy to enhance class-distribution exploration and discriminative information across classes.To gain Gaussian-like distributions, ref. [109] proposed a transfer-based method to process features belonging to the same class.They introduced transforms to adjust the distribution of features, and a Wasserstein distance-based iterative algorithm to calculate the prototype for each class.Similarly, ref. [110] proposed an optimal-transport algorithm to transform features into Gaussian-like distributions and estimate the best class centers.
Attention mechanism.The attention strategy aims to select discriminative feature or region from the extracted feature space for effective class-level feature representation.In [46], an attention mechanism [111] was applied to locate and reweight semantically relevant local region pairs between query and support samples, which aimed to strengthen discriminative objects and suppress the background.He et al. [41] indicated that object localization (using local discriminative regions) could provide great help for FSFGIC.Then a self-attention-based complementary module, which utilized channel attention and spatial attention was designed for performing weakly supervised object localization and finding their corresponding discriminative regions.Ref. [48] utilized channel attention and spatial attention to find discriminative regions from query and support samples for improving the classification performance of FSFGIC.A novel transformer-based neural network architecture called CrossTransformers [42] was designed which applied a cross-attention mechanism to find coarse spatial correspondence between the query and support labeled samples in a class.In [50], an attention mechanism was proposed to mix two modalities (i.e., semantic and visual modalities) and ensure that the representations of attributes were in the same space with visual representation.Single prototype-based methods might fail to capture the subtle information of a class.To address this problem, Huang et al. [43] proposed a descriptor-based multi-prototype network (LMPNet) to learn multi-prototype.They designed an attention mechanism to weight all channels in each spatial position of all samples adaptively to obtain local descriptors, and constructed multiple prototypes based on these descriptors which contained more complete information of a class.
Metric strategy.To obtain discriminative class representations for FSFGIC, image-toclass metric strategies were proposed.Deep nearest neighbor neural network (DN4) [7] aimed to learn optimal class-level local deep feature representation of a class space based on the designed image-to-class similarity measure strategy in the case of extremely limited training samples.A discriminative deep nearest neighbor neural network (D2N4) [112] extended the DN4 method [7] by adding a center loss function [113].And then class-level local and global feature representations were learned for improving the quality discriminability features in the framework of the DN4 method [7].The Bi-Similarity Network (BSNet) [44] was proposed to use two different similarity measures to create more discriminative feature maps from a small number of images, resulting in a significant boost in generalization performance.In [45], Zhu et al. argued that a large amount of unlabeled data had the high potential to improve the classification performance in FSFGIC tasks.A progressive point to set metric learning (PPSML) [45] was presented to improve few-shot classification accuracy by defining a distance metric and using a self-training strategy.To avoid overfitting and calculate a robust class representation under the condition of extremely limited training samples, a deep subspace network (DSN) [114] was introduced to transform class representation into an adaptive subspace and generate a corresponding classifier.
Triantafillou et al. proposed a mean average precision (mAP) [115], which aimed to learn a similarity metric based on information retrieval.They extended the work that optimized for AP in order to account for all possible choices of query among the batch points.They then used the frameworks of SSVM (Structural Support Vector Machine) and DLM (Direct Loss Minimization) for optimization of mAP.Liu et al. [116] introduced a negative margin loss to reduce inter-class variance and generate more efficient decision boundaries.Hilliard et al. [70] proposed a metric-agnostic conditional embeddings (MACO) network.MACO contained four stages: the feature stage was used to obtain features, the relational stage produced a single vector as the class representation of each class.The conditioning stage connected the class representations to query image features which aimed to learn the class representation that was more relevant to the query image and the classifier made the final prediction.
Semantic alignment.It was indicated in [47] that people tended to compare similar objects thoroughly in a pairwise manner, e.g., comparing the heads of two birds first, then their wings and feet.In this manner, it was natural to enhance feature information during the comparison process.A low-rank pairwise bilinear pooling operation network [47] was designed for obtaining class-level deep feature representation between query and support samples in terms of the way that people compared similar objects.According to [46], the main object could be situated anywhere in the image, leading to potential ambiguity when directly computing the distance between query and support samples.To address this problem, semantic alignment metric learning (SAML) [46] was proposed to align the semantically related local regions on samples by a "collect and select" strategy.On the one hand, the similarities of all local region pairs from query samples and support class in a relation matrix were calculated and obtained.On the other hand, an attention mechanism [111] was applied to "select" the semantically relevant pairs.Li et al. [48] extended the method in [46], and a convolutional block attention module [117] was applied to capture discriminative regions.To eliminate the influence of noise and improve the efficiency of a similarity measure, query-relevant regions from support samples were selected for semantic alignment.Then, multi-scale class-level feature representations were utilized to represent discriminative regions of the query, support samples in a class, and perform FSFGIC tasks.In [69], a centroid associative alignment strategy was proposed to enforce intra-class compactness and obtain better class representations.
Alternatively, an end-to-end graph-based approach called explicit class knowledge propagation network (ECKPN) [15] was proposed, which aimed to learn and propagate the class representations explicitly.First, a comparison module was used to explore the relationship between paired samples for learning sample representations in instance-level graphs.Secondly, a squeeze strategy was proposed to make the instance-level graph generate the class-level graph, which helped obtain class-level visual representation.Third, the classlevel visual representations were combined with the instance-level sample representations for performing FSFGIC tasks.
Multimodel learning.Inspired by the prototypical network [85], a multimodal prototypical network [49] was designed for mapping text data into the visual feature space by using GANs.In [50], Huang et al. indicated that some methods, which applied auxiliary semantic modalities into a metric learning framework, only augmented the feature representations of samples with available semantics and ignored the query samples, which might lose the potential for the improvement of classification performance and could lead to a shift between the modalities combination and the pure-visual representation.To address this issue, an attributes-guided attention module (AGAM) was proposed, which aimed to make more effective use of human-annotated attributes and learn more discriminative class-level feature representations.An attention alignment mechanism was designed to distill knowledge from attribute guidance to the pure visual feature selection process, so that it could learn to pay attention to more semantic features without using the restriction of attribute annotation.To better align the visual and language feature distributions that described the same object class, a cross-modal distribution alignment module [51] was proposed, in which a vision-language prototype was introduced for each class to align the distributions, and the earth mover's distance (EMD) was adopted to optimize the prototypes.
Gu et al. [118] proposed a two-stream neural network (TSNN), which not only learned features from RGB images, but also focused on steganalysis features via a steganalysis rich model filter layer.The RGB stream aimed to distinguish the difference between support images and query images based on the global-level features and calculated the representations of each support class; the steganalysis stream extracted steganalysis features to locate critical regions.An extractor and fusion module was used to fuse the two-stream features by a general convolutional block.An image-to-class deep metric was applied to produce the similarity scores.Zhang et al. [119] introduced fine-grained attributes into the prototype network and proposed a prototype completion network (ProtoComNet).In the meta-training stage, ProtoComNet extracted representative attribute features as priors.They applied an attention-based aggregator to aggregate the attribute features and prototype to obtain the completed prototype.In addition, a Gaussian-based prototype fusion strategy was designed to learn mean-based prototypes from unlabeled samples, and applied Bayesian estimation to fuse the two kinds of prototypes, aiming to produce more representative prototypes.

Task-Specific Feature Representation Learning Based FSFGIC Methods
Task-specific feature representation learning based FSFGIC methods aim to overcome the problem of overfitting and poor generalization and utilize deep feature representation from all training samples in a task (i.e., one training episode) to construct a task-specific feature representation (e.g., task-level graph relationship representation [16] or task-level local deep feature representation [8]) for this task.

Optimization-Based Task-Specific Feature Representation Learning
The existing optimization-based task-specific feature representation learning methods can be divided into two categories: learning a model and learning to fine-tune.
Learning a model.In [52], a task embedding network was presented to learn taskspecific feature representations via a Fisher information matrix [120] for exploring the nature of the target task and its relationship to other tasks.Meanwhile, the learned taskspecific feature representations could also show the similarity between two different tasks.It was indicated in [53] that the existing optimization-based methods learned to equally utilize meta-knowledge in each task without considering the diversity of each task.To address this problem, they extended the model-agnostic meta-learning method [121] to deal with the imbalance of the number of samples in each task instance and out-of-distribution tasks, but the encoding of complex datasets and calculation of balance variables for each task increased the computational complexity of the algorithm.
A meta neural architecture search method called M-NAS [122] was proposed to effectively obtain a task-specific architecture for each new task.Specifically, an autoencoder was designed to generate a task-aware model architecture which had the ability to tailor the globally shared meta-parameters.It was indicated in [123] that meta-learning models were prone to overfitting in a new task with limited samples.In this way, a gradient dropout regularization was proposed to efficiently adapt to a new task.The key idea was to impose uncertainty on the meta-training stage by adding a noise gradient to parameters to improve the generalization of the model.In [54], new transformers called HCTransformers were introduced, which enhanced data efficiency for visual recognition by leveraging spectral token pooling and attribute surrogate learning.They addressed the limitations of vision transformers with limited data, providing better performance through improved parameter optimization and image structure utilization.
In order to improve the representation ability of meta-learning methods, a deep metalearning (DEML) method [124] was proposed to generate high-level concepts for each image in a task.These concepts could guide the meta-learner to adapt quickly to new tasks.Moreover, a concept discriminator was designed to recognize different images.Tian et al. [125] proposed a new consistent meta-regularization (Con-MetaReg) to enhance the learning ability of meta-learning models.Specifically, a base learner trained on the support set, then another learner trained on a novel query set.Con-MetaReg was proposed to align the two learners by the Frobenius norm of the difference between parameters to eliminate the data discrepancy for better meta-knowledge.In [126], a label-free loss function called Self-Critique and Adapt (SCA) was proposed.SCA could be added to a base model to learn knowledge with an unsupervised loss from a critic loss network.The features learned from the base model were sent to the critic network to create a loss for the target task.
Learning to fine-tune.In order to overcome overfitting and the poor generalization ability caused by limited training samples, an effective scheme [1] for selecting samples from the auxiliary data was proposed.According to a given classifier with shared parameters, some samples with similar feature distributions to some given target samples were selected from an auxiliary dataset with rich samples.The selected samples from an auxiliary dataset and the given target samples were sent into the classifiers to pre-train a weight initialization.Finally, the remaining target samples were used to fine-tune the parameters corresponding to the classifiers for quickly adapting to target tasks.
In order to improve the generalization on the novel domain, ref. [55] proposed a combining domain-specific meta-learners (CosML) method.CosML pre-trained a set of meta-learners on different domains to learn domain-specific parameters.CosML generated task and domain prototypes to represent each task and domain in the feature space.For the novel domain, they initialized a subnetwork with the domain-specific meta-parameters, which were weighted by the similarity of these domains and the novel domain.In the optimizing phase, properties in an image that were not related to the target task interfered with the optimization results.A context-agnostic (CA) [56] method was proposed to abandon the additional properties in training data.In the training task, they applied a context-adversarial network to generate another object without extra information to the base network to initialize context-agnostic weights.

Metric-Based Task-Specific Feature Representation Learning
The existing metric-based task-specific feature representation learning methods can be classified into three categories: feature distribution, attention mechanism, and metric strategy.
Feature distribution.In [57], a covariance metric network (CovaMNet) was proposed, which aimed to obtain task-level covariance representations and a covariance metric between query and support samples.Furthermore, a novel deep covariance metric was designed to measure the consistency of distributions between query and support samples for performing FSFGIC tasks.The metric function might have failed to generalize due to the discrepancy between the feature distributions of the base and novel domains in a task.To address this problem, Tseng et al. [58] proposed a cross-domain approach which applied a featurewise transformation layer to simulate the feature distributions of different domains.In the training stage, the feature-wise transformation layer was inserted into the feature encoder and optimized by two hyper-parameters via a learning-to-learn strategy.Ref. [59] proposed an unsupervised embedding adaptation mechanism called early-stage feature reconstruction (ESFR).ESFR contained a feature-level reconstruction training stage and a dimensionalitydriven early stopping stage, which aimed to find out more generalizable features.
Attention mechanism.In [8], an adaptive episodic attention module was designed to select and weight key regions among the entire task.Alternatively, attention strategy was also used in graph neural networks (GNNs) for effectively obtaining task-level relation representations.Guo et al. indicated in [16] that existing GNN-based FSFGIC methods focused on the sample-to-sample relations while neglecting task-level relationships.Then, a GNN based sample-to-task FSFGIC method named attention-based task-level relation module (ATRM) was proposed to consider the specificity of different tasks.In ATRM, task-relation representations between the embedding features of a target sample and the embedding features of all samples in the task were obtained by calculating the absolute difference between the target sample and all samples in the task.Then, an attention mechanism was used to learn task-specific relation representations for each task.
Metric strategy.It was indicated in [8] that the existing image-to-image similarity measure [19] or image-to-class similarity measure [7] could not make full use of local deep feature representations.To address this problem, an adaptive task-aware local representations network (ATL-Net) was designed to select local descriptors with learned thresholds and assign selected local representations different weights based on episodic attention for improving the local deep feature representations.In [60], a region comparison network was proposed which aimed to reveal how FSFGIC worked in neural networks.In order to explore more fine-grained information and find the critical regions, each support sample was divided into several parts, and task-level local deep feature representations between each region in a support sample and each query sample were used to calculate their feature similarities and their corresponding region weights.Then, an explainable network was designed to find the critical regions related to the final classification results.A discriminative mutual nearest neighbor neural network (DMN4) [61] extended the DN4 method [7] and a mutual nearest neighbor mechanism [127] was applied to obtain task-level local-feature representations between query and support samples for performing FSFGIC tasks.Li et al. extended a triplet network [128] into a deep K-tuplet network [62] for learning a task-level local deep feature representation by utilizing the relationship among the input samples in a training episode.

Comparison of Experimental Results
In Table 2, we select experimental data of some research results for the above three feature representation learning methods to be shown.It is worth noting that the data in Table 2 are derived from the corresponding original papers.Different backbone networks and different feature representation learning methods make the final model performance different.At present, some researchers [7,46] combine two or more feature representation learning methods to make the model obtain better classification results.In this paper, we classify the above models according to the feature representation learning method, which occupies the largest proportion in the original method.

Summary and Discussions
Our investigation indicates that the existing FSFGIC methods have made great process in FSFGIC tasks, but there are still some important challenges to FSFGIC that need to be dealt with in the future.
Trade-off between the problem of overfitting and the ability of image feature representation.Our investigation indicates that the existing FSFGIC algorithms are still at the stage of theoretical exploration and cannot be used in practical applications.Currently, data augmentation, regularization, and modeling of the feature extraction process can effectively alleviate the overfitting problem caused by extremely limited training samples and can also enhance the ability of feature representation, but there is still a trade-off between overcoming the overfitting problem and enhancing the ability of image feature representation.On the one hand, image feature representation is used not only to represent train samples, but also to construct classifiers for performing FSFGIC tasks.In this manner, the quality of feature representation directly affects the classification performance on FSFGIC.On the other hand, due to the extremely limited number of training samples on FSFGIC, the existing FSFGIC methods utilize a relatively simple network as a backbone (e.g., Conv-64F [132]) for alleviating the overfitting problem.Our investigation indicates that the existing simple networks cannot effectively learn discriminative features from training samples compared with the existing large networks (e.g., ResNet50 [133]).Therefore, how to balance the problem of overfitting and the ability of image feature representation is one of the most important challenges on FSFGIC.
Generalization in FSFGIC.There exist two main challenges on generalization in FSFGIC methods.On the one hand, an ideal FSFGIC algorithm should have the ability to handle various learning tasks with different complexity and diversity of data.Our investigation indicates that, currently, the number of tasks and datasets available for FSFGIC training is very limited (much less than the number of instances available in few-shot learning).Most of the existing FSFGIC methods are over-designed for specific benchmark tasks and data sets which may weaken the applicability of the existing FSFGIC methods for dealing with more general FSFGIC tasks.On the other hand, our investigation indicates that most of the existing FSFGIC studies focus on common application scenarios with small-scale tasks and large-scale labeled auxiliary data.However, the actual FSFGIC tasks that need to be solved may be dynamic and the labeled auxiliary data is not available.Therefore, it is necessary to generalize the technique of feature representation learning to effectively perform cross-domain or multi-domain FSFGIC tasks.
Theoretical research.In essence, all FSFGIC solutions are designed by specific techniques to obtain feature representations that can be used to accurately represent samples and to perform FSFGIC tasks.Although the quality of feature representation directly affects the classification performance of FSFGIC, our investigation indicates that no one has considered how to establish a theoretical approach to measure whether the feature representation learned from training samples can correctly reflect the inherent characteristics of the training samples.Therefore, constructing a systematic theory for FSFGIC from the perspective of improving the accuracy of feature representations obtained from training samples can bring new inspiration to FSFGIC researchers.
Performance and efficiency.As shown in Figure 3, the FSFGIC methods still have some challenges in terms of performance and efficiency.Researchers still need to make breakthroughs in the following aspects: (1) how to obtain more discriminating local significant features; (2) how to achieve better classifier performance; (3) how to reduce the model complexity and time complexity, so as to avoid overfitting and strengthen the robustness of the model.

(a) Model complexity
A simple but efficient algorithm can take less memory space while achieve good performance, such as the simple construction of network (Li et al. 2020), or reduce data redundancy (Ruan et al. 2021).

Conclusions
It is obvious that the fine-grained datasets are small in scale, and the samples between different subclasses often exist only in local subtle regions.Therefore, only the method that can extract the feature information of the local salient regions of the image without a large number of labeled samples for model training can achieve better classification performance of fine-grained datasets.The general few-shot algorithm is not designed for the fine-grained features of the image, so it cannot effectively extract the subtle differences in the image, resulting in poor performance [134].Based on this, many FSFGIC methods have been proposed by researchers, and satisfactory results have been achieved.The excellent classification performance of these methods is mainly due to the following two reasons: (1) focusing on the feature information of the significant region of the image, it can obtain more distinctive and effective feature representation; (2) the inter-class distance between different subclasses is increased, and the intra-class distance within the same subclass is reduced.
In this paper, we presented a comprehensive review on feature representation learning for FSFGIC.A taxonomy for FSFGIC is proposed.In terms of this taxonomy, different issues of FSFGIC methods are discussed.The main unresolved problems related to feature representation learning for FSFGIC are summarized and discussed.We hope that this survey can help newcomers and practitioners position themselves in this growing field and work together to keep pushing the field forward.

( a )
High ability to obtain local feature representation Efficiency Recognizing discriminative local feature is the critical part in FSFGIC task .Existing methods learn local features mainly by local an d/or global deep feature representation (Wertheimer et al. 2021; Huang et al. 2020), class representation (Karlinsky et al. 2019; Xian et al. 2019), and task-specific feature representation (Lee & Chung 2021; Guo et al. 2021).(b) Classifier performance Classifier performance is very important for prediction results.Existing methods such as learning a metric measure (Li et al. 2019; Yang et al. 2020), or proposing a deep neural network (Wei et al. 2019).