Few-Shot Fine-Grained Image Classification with Residual Reconstruction Network Based on Feature Enhancement

Liu, Ying; Zhang, Haibin; Zhang, Weidong

doi:10.3390/app15189953

Open AccessArticle

Few-Shot Fine-Grained Image Classification with Residual Reconstruction Network Based on Feature Enhancement

by

Ying Liu

^*

,

Haibin Zhang

and

Weidong Zhang

Center for Image and Information Processing, School of Communications and Information Engineering and School of Artificial Intelligence, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 9953; https://doi.org/10.3390/app15189953

Submission received: 7 August 2025 / Revised: 5 September 2025 / Accepted: 10 September 2025 / Published: 11 September 2025

(This article belongs to the Special Issue Advances in Computer Vision and Digital Image Processing)

Download

Browse Figures

Versions Notes

Abstract

In recent years, few-shot fine-grained image classification has shown great potential in addressing data scarcity and distinguishing highly similar categories. However, existing unidirectional reconstruction methods, while enhancing inter-class differences, fail to effectively suppress intra-class variations; bidirectional reconstruction methods, although alleviating intra-class variations, inevitably introduce background noise. To overcome these limitations, this paper proposes a Bidirectional Feature Reconstruction Network that incorporates a Feature Enhancement Attention Module (FEAM) to highlight discriminative regions and suppress background interference, while integrating a Channel-Aware Spatial Attention (CASA) module to strengthen local feature modeling and compensate for the Transformer’s tendency to overemphasize global information. This joint design not only enhances inter-class separability but also effectively reduces intra-class variation. Extensive experiments on the CUB-200-2011, Stanford Cars, and Stanford Dogs datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches, validating its effectiveness and robustness in few-shot fine-grained image classification.

Keywords:

few-shot learning; fine-grained image classification; feature reconstruction; attention mechanisms

1. Introduction

In recent years, few-shot fine-grained image classification (FS-FGIC [1], as shown in Figure 1) has gradually become a prominent research focus in the field of computer vision. FS-FGIC demonstrate significant practical value across multiple application scenarios. For instance, in the ship classification, it can distinguish between different types or models of ships [2]; in synthetic aperture radar (SAR) image analysis, it enables fine-grained recognition of terrain features or targets [3]; and in marine ecological monitoring, it can identify different marine organisms or ecological states [4].

These application scenarios highlight the potential of FS-FGIC while also revealing the fundamental challenges the task faces in capturing subtle intra-class variations. FS-FGIC requires not only distinguishing inter-class differences but also capturing fine-grained intra-class variations, such as shape, texture, and local structural features, which are crucial for accurate instance discrimination. Effectively extracting such nuanced features under conditions of limited samples remains a core challenge for FS-FGIC.

Metric learning [5], due to its efficiency and simplicity, has been widely applied in few-shot image classification. Representative approaches include ProtoNet [6], which classifies query samples by computing their Euclidean distances to class prototypes, and RelationNet [7], which models relationships between samples by learning a nonlinear similarity function. While these methods perform well in general few-shot classification tasks, they face limitations in FS-FGIC. In particular, traditional global average pooling often loses critical spatial information, whereas directly flattening feature maps preserves features but makes the model overly sensitive to pose variations, thus hindering the capture of fine-grained local details. Therefore, designing methods that can both retain essential spatial details and mitigate overfitting and positional sensitivity is imperative for enhancing the extraction of fine-grained features.

Building upon conventional metric learning approaches, recent studies have explored feature reconstruction-based strategies to better capture fine-grained details and more precisely model the relationships between support and query features. For example, DeepEMD [8] formulates feature matching as an optimal transport problem, achieving high accuracy but with considerable computational cost. To enhance efficiency, Wertheimer et al. [9] proposed the Feature Map Reconstruction Network (FRN), which formulates feature reconstruction as a ridge regression problem and derives optimal reconstruction weights via a closed-form solution, enabling efficient reconstruction of query features from the support set. Following this, Li et al. [10] introduced the Locally Content-enriched Cross Reconstruction Network (LCCRN), which improves semantic representation through local feature extraction and cross-view reconstruction. Furthermore, Li et al. [11] incorporated task-difference maximization and center loss into the FRN framework to alleviate overfitting on fine-grained subclasses and enhance inter-subclass discriminability. More recently, BiEN [12] combined self-reconstruction and bidirectional reconstruction modules to simultaneously improve inter-class separability and reduce intra-class variance, while employing a snapshot ensemble strategy to strengthen model robustness and generalization.

Meanwhile, attention mechanisms and Transformer architectures have also been increasingly applied to FS-FGIC due to their ability to dynamically model relationships between features and emphasize task-relevant information. Ruan et al. [13] proposed the Spatial Attentive Comparison Network (SACN), which uses a Selective Comparison Similarity Module for pixel-level fusion of support and query features. Huang et al. [14] introduced SAPENet, incorporating self-attention and intra-class attention blocks to enhance local features within each support sample while leveraging channel attention to preserve critical channel-level information. Ma et al. [15] proposed C2-Net, which integrates Cross-Layer Feature Refinement and Cross-Sample Feature Adjustment modules to address feature misalignment. Li et al. [16] presented CDN4, generating multi-view representations of support and query samples via attention and performing cross-view measurements to enhance focus on discriminative features. Sun et al. [17] introduced TST_MFL, incorporating self-attention modules into a local classification subnetwork to capture complex dependencies among local patches, producing more discriminative feature representations. These studies collectively demonstrate that attention mechanisms can significantly improve feature discriminability in FS-FGIC.

Although attention mechanisms and Transformer architectures demonstrate significant advantages in capturing global context and enhancing feature discriminability, they still face certain challenges in FS-FGIC. For instance, local features may become misaligned, attention modules are prone to overfitting under limited sample conditions, and background noise can be amplified, which interferes with the extraction of key discriminative features. Feature reconstruction-based models, while capable of capturing class-specific features, also encounter difficulties in FS-FGIC due to large intra-class variations and background noise. Specifically, reconstruction weights can be influenced by non-critical regions, causing the reconstructed features to contain redundant information and thereby weakening class discriminability. Our experiments further reveal that existing bidirectional reconstruction models, such as BiFRN, often introduce irrelevant background noise when focusing on key regions (see Figure 2), a phenomenon that is particularly detrimental for fine-grained classification tasks.

To address the aforementioned challenges, this paper proposes a FS-FGIC framework that first applies a Feature Enhancement Attention Module (FEAM) to enhance input features. The enhanced features are then processed by a Feature Refinement Module (FRM), which incorporates a Transformer encoder for global context modeling and a Channel-Aware Spatial Attention (CASA) module for discriminative local feature extraction. In addition, a residual bidirectional reconstruction mechanism is employed to align support and query features effectively. This framework enables simultaneous modeling of global contextual dependencies while emphasizing fine-grained local features. The main contributions of this work are summarized as follows:

(1): In this work, the Feature Enhancement Attention Module (FEAM) is proposed, which integrates multi-scale convolutions with attention mechanisms. This design effectively combines features from the main branch and the side branch, enabling the model to comprehensively perceive the data distribution and accurately respond to feature variations in different local regions.
(2): The Feature Refinement Module and the Residual Reconstruction Module are introduced. This Module integrates self-attention mechanisms with feature fusion techniques to effectively aggregate information from different feature channels and capture subtle and discriminative fine-grained details. By enhancing the representation of these refined features, the Residual Reconstruction Module leverages the concept of residual learning to fully exploit the enhanced features, thereby improving the model’s ability to distinguish between minor inter-class differences and significant intra-class variations.
(3): Experiments on three widely used fine-grained classification datasets demonstrate the effectiveness and robustness of the proposed method. Detailed ablation studies and performance comparisons show that the proposed components contribute significantly to addressing the challenges of FS-FGIC.

The structure of this manuscript is organized as follows: Section 1 first introduces the research background and motivation for FS-FGIC, while providing a systematic review of existing related methods. This includes metric learning, feature reconstruction approaches, and the application of attention mechanisms in FS-FGIC. Section 2 offers a detailed introduction to the proposed algorithm, with an in-depth explanation of each module. Section 3 describes the parameter settings used for model training and the experimental datasets, followed by a comprehensive analysis conducted on three standard datasets. Section 4 provides a conclusion to the overall work.

2. Proposed Method

2.1. The Proposed Algorithm

Existing feature reconstruction-based FS-FGIC methods primarily focus on reconstructing query set features from support set features, as well as mutual reconstruction between the support set and query set features. These models face challenges in effectively reducing intra-class variations and are influenced by background noise, which negatively affects their performance.

In this paper, a Residual Reconstruction Nqtwork based on Feature Enhancement Attention is introduced to address the limitations of FS-FGIC. The model consists of four key modules (as shown in Figure 3): (1) Feature Embedding Module (merged from the Feature Enhancement Attention Module and the Embedding Module): combines multi-scale feature extraction and attention mechanisms to extract and enhance key image features through networks such as Conv4 or ResNet12, thereby improving classification performance; (2) The Feature Refinement Module, which leverages self-attention mechanisms and feature fusion to enhance fine-grained feature representation; (3) The Residual Feature Similarity and Reconstruction (RFSR) Module, which combines bidirectional reconstruction with residual connections to improve inter-class variation and reduce intra-class variation; (4) The Euclidean Distance Module, which calculates reconstruction distances for classification.

2.2. Feature Enhancement Attention Module

In FS-FGIC tasks, traditional convolutional neural networks struggle to effectively extract discriminative features due to the limited number of samples and the high similarity between categories. To address this issue, this paper proposes a Feature Enhancement Attention Module (FEAM, as shown in Figure 4), which extracts multi-scale semantic information through a multi-branch convolutional structure and expands the receptive field by incorporating multi-scale convolutional kernels and depthwise separable convolution techniques.

The FEAM module consists of the three main branches, two of which use convolutional kernels with expanded receptive fields (such as 1 × 3, 3 × 1, and 3 × 3) to enhance the multi-scale representation of local features, while the other branch employs standard convolutions. To further optimize the feature map, the FEAM module introduces a channel attention mechanism by replacing the fully connected layers in the SE module [18] with convolution operations, reducing the number of parameters and effectively learning the dependencies between channels. Finally, the FEAM module fuses the extracted features and adjusts the output through residual connections. The mathematical formulation of the module is as follows:

f_{F E A M} (x) = \{\begin{matrix} x_{1} & = f_{3 \times 3}^{c o n v} (f_{3 \times 3}^{c o n v} (x)) \\ x_{2} & = f_{5}^{d i c o n v} (f_{3 \times 1}^{c o n v} (f_{1 \times 3}^{c o n v} (f_{1 \times 1}^{c o n v} (x)))) \\ x_{3} & = f_{5}^{d i c o n v} (f_{1 \times 3}^{c o n v} (f_{3 \times 1}^{c o n v} (f_{1 \times 1}^{c o n v} (x)))) \\ x^{'} & = σ [f_{1 \times 1}^{c o n v} (Cat (x_{1}, x_{2}, x_{3}))] \cdot Cat (x_{1}, x_{2}, x_{3}) \\ x^{''} & = f_{1 \times 1}^{c o n v} (x) \oplus f^{d p c o n v} (x^{'}) \\ X & = f_{a d a p t i v e}^{c o n v} (x^{''}) \end{matrix}

(1)

where

f_{n \times m}^{c o n v}

represents the convolution operation with a kernel size of

n \times m

,

f_{k}^{d i c o n v}

represents dilated convolution with a dilation rate of k,

f^{d p c o n v}

represents depthwise separable convolution,

σ

is the sigmoid function, and

f_{a d a p t i v e}^{c o n v}

represents adaptive convolution that adjusts the size of the feature map required for the subsequent network step. The symbol ⊕ indicates element-wise addition, x is the input feature map, X is the output feature map after processing by the module.

Compared to FEM [19], FEAM introduces depthwise separable convolutions and channel attention mechanisms. Additionally, through the multi-scale feature enhancement module, FEAM learns fine-grained local features, thereby enhancing its ability to represent complex targets.

2.3. Feature Refinement Module

The FRM module consists of the CASA module, Transformer encoder, and iAFF [20] feature fusion module (Figure 5 illustrates the details of this module). As shown in Figure 3a, the CASA module first processes the input feature map through the CBAM module [21], which computes channel and spatial attention maps to weight the input feature map, thereby emphasizing important features. Next, the feature map undergoes further processing through a pointwise convolution layer to adjust the number of channels. Then, the SimAM attention mechanism [22] is applied to compute the mean squared difference for each channel of the feature map, and an adaptive weight y is generated based on the mean and variance of the feature map, as shown in the following formula:

y = \frac{{(x - μ)}^{2}}{4 (\frac{\sum {(x - μ)}^{2}}{n} + λ)} + 0.5

(2)

where x is the input feature map,

μ

is the mean of the channels, and

λ

is a variable parameter. n is the spatial size of the feature map. Finally, the module normalizes the weight y using the Sigmoid activation function and multiplies it with the original feature map to reinforce the important features, yielding the final output. This module enhancing the model’s sensitivity to key regions and improving the effectiveness of feature representation.

The composition of the Transformer encoder is shown in Figure 3b. We use the sum of the spatial sequence

[{\tilde{x}}_{i}^{1}, {\tilde{x}}_{i}^{2}, \dots, {\tilde{x}}_{i}^{r}]

and the positional encoding

P_{s i n}

as the input to the Transformer,

P_{s i n}

representing the sinusoidal positional encoding. This paper follows the standard self-attention computation in the Transformer encoder, as shown in Equation (3):

A t t e n t i o n (Q, K, V) = σ (\frac{Q K^{T}}{\sqrt{d_{K}}}) V

(3)

where

σ

is the sigmoid activation function, and the output

{\tilde{z}}_{i}

can then be obtained through the Equation (4):

{\tilde{z}}_{i} = M L P (L N (z_{i} + A t t e n t i o n (z_{i} W_{ϕ}^{Q}, z_{i} W_{ϕ}^{K}, z_{i} W_{ϕ}^{V})))

(4)

where

z_{i}

is the input to the Transformer encoder module,

W_{ϕ}^{Q}, W_{ϕ}^{K}, W_{ϕ}^{V} \in R^{d \times d}

represents a series of learnable weight matrices, and

L N

denotes the layer normalization operation. Next, the refined features from the Transformer encoder and CASA module are fed into the iAFF modul for fusion.

2.4. Residual Feature Similarity and Reconstruction Module

The Feature Mutual Reconstruction Module (FMRM) [23] is an advanced technique aimed at enhancing feature matching and similarity computation across various tasks, including FSL. Inspired by methods based on FMRM, the Residual Feature Symmetric Reconstruction (RFSR) module is introduced, which incorporates current state-of-the-art bidirectional feature reconstruction strategies. This module consists of two main tasks: first, reconstructing the query set features based on the support set, and second, reconstructing the support set features based on the query set. Figure 6 illustrates the details of this module, which is designed to enhance mutual feature alignment and improve discriminative capability in FSL.

In the common N-way K-shot few-shot image classification, we have the support set features of the nth class, denoted as

S (n, k)

(where

k \in {1, 2, \dots, K}

), and the query set features of the nth class, denoted as

Q_{i}

(where

i \in {1, 2, \dots, C \times M}

), For each class n, through the RFSR module, we can achieve mutual reconstruction between the support set and the query set. Specifically, given the query features

Q_{i}

, it can be mapped to the

Q, K, V

space by a linear transformation (multiplied by the corresponding weights) to obtain

Q_{i}^{Q}

,

Q_{i}^{K}

,

Q_{i}^{V}

. Similarly, given the support features

S_{n}

, these features can be mapped to the same space to get

S_{(n \times k)}^{Q}

,

S_{(n \times k)}^{K}

,

S_{(n \times k)}^{V}

. Then, according to Equations (5) and (6), we can obtain the mutually reconstructed features

{\hat{S}}_{(i \times n)}

,

{\hat{Q}}_{(n \times i)}

.

{\hat{S}}_{(i \times n)} = A t t e n t i o n (S_{i}^{Q}, Q_{i}^{K}, Q_{i}^{V})

(5)

{\hat{S}}_{(i \times n)} = A t t e n t i o n (Q_{i}^{Q}, S_{i}^{K}, S_{i}^{V})

(6)

To further improve the expressiveness of the features, we employ a feature enhancement strategy (shown in Equations (7) and (8)), which performs element-wise multiplication between the original and reconstructed features. This enables the model to capture meaningful interactions between the support set and the query set.

{\hat{S}}_{(i \times n)}^{e} = {\hat{S}}_{(i \times n)} \times S_{(i \times n)}

(7)

{\hat{S}}_{n}^{e V} = {\hat{S}}_{n}^{V} \times S_{n}

(8)

The enhanced features incorporate the original input features and the reconstructed contextual information to be more discriminative.

2.5. Metric Module

Through the RFSR module, we can get the reconstructed and enhanced features, and then we need to calculate the similarity of these enhanced features to measure the similarity between the support set features and the query set features. In this paper, we use Euclidean distance.

d_{Q \to S}

is the distance from the query sample

Q_{i}

to the support sample

S_{n}

in class c, as shown in Equation (9):

d_{Q_{i} \to S_{n}} = {∥{\hat{Q}}_{n}^{e V} - {\hat{Q}}_{(n, i)}∥}^{2}

(9)

Similarly, we can obtain

d_{S_{n} \to Q_{i}}

. Then, the total distance can calculated by Equation (10):

d_{i}^{n} = τ (ω_{1} d_{Q_{i} \to S_{n}} + ω_{2} d_{S_{n} \to Q_{i}})

(10)

where

ω_{1}

and

ω_{2}

are the learnable weight parameters and

τ

is the temperature factor. After that, normalization is performed to obtain

{\hat{d}}_{i}^{n}

. We can find the total loss

L

for the N-way K-shot tasks:

L = - \frac{1}{M \times N} \sum_{i = 1}^{M \times N} \sum_{n = 1}^{N} I [y_{i} = n] log ({\hat{d}}_{i}^{n})

(11)

where

I [y_{i} = n]

denotes 1 when

y_{i}

and n are equal, and 0 otherwise.

During the training process, we minimize

L

to update the proposed network.

3. Experiments

3.1. Experiments Setup

To evaluate the performance of the proposed method, we selected three fine-grained benchmark datasets: CUB-200-2011 [24], Stanford-Dogs [25], and Stanford-Cars [26]. Each dataset is split into training, validation, and test sets according to the proportions described in reference [27]. All images are resized to

84 \times 84

.

Below, we provide a brief overview of each dataset:

The CUB-200-2011 dataset (CUB) is a well-established benchmark for fine-grained image classification. It comprises 11,788 images spanning 200 bird species. Following the protocols in [28], each image is cropped according to human-annotated bounding boxes.

Stanford Dogs (Dogs) presents a challenging fine-grained classification task, consisting of 20,580 labeled images covering 120 dog breeds worldwide.

Stanford Cars (Cars) is also a widely used fine-grained classification benchmark containing 16,185 images representing 196 car types, categorized mainly by brand, model, and manufacturing year.

We conducted experiments on two widely adopted backbone architectures: Conv-4 and ResNet-12, following the design principles outlined in FRN. All experiments were implemented in a PyTorch-1.7.0-based environment.

For the Conv-4 backbone, models were trained for 800 epochs using stochastic gradient descent (SGD) with Nesterov momentum (momentum coefficient of 0.9). The initial learning rate was set to 0.1 and decayed to 0.01 after 400 epochs. Training was performed under a 30-way 5-shot setting, and evaluation was carried out under both 1-shot and 5-shot scenarios, with 15 query images per class used in each case.

For the ResNet-12 backbone, models were trained for 1200 epochs with the learning rate initialized at 0.1 and reduced by a factor of 10 every 400 epochs. We used the same optimizer (SGD with Nesterov momentum of 0.9). To mitigate memory consumption, training was conducted under a 15-way 5-shot setting, while evaluation followed standard 1-shot and 5-shot configurations. Unless otherwise specified, the weight decay was set to

5 \times 10^{- 4}

.

We employed standard data augmentation techniques during training, including center cropping, random horizontal flipping, and color jittering. Model selection was based on validation performance, with evaluation conducted every 20 epochs.

All final results are reported for the CUB-200-2011, Stanford Dogs, and Stanford Cars datasets by averaging the accuracy over 10,000 randomly sampled episodes under the standard 5-way 1-shot and 5-shot settings. The reported results include the 95% confidence intervals for all accuracy metrics.

3.2. Comparative Experiments

To validate the effectiveness of our proposed method in FS-FGIC tasks, we conducted experiments using two backbone networks, Conv-4 and ResNet-12, on three fine-grained classification datasets: CUB, Dogs, and Cars. The experimental results are presented in Table 1 and Table 2, respectively.

As shown in Table 1, based on the Conv-4 backbone network, our method significantly outperforms existing mainstream approaches on the 5-way 1-shot and 5-shot tasks across all three datasets. Compared to strong-performing models such as BiFRN and TDM+CSCAM, our method improves accuracy by approximately 3% to 6%, demonstrating stronger feature representation and classification capabilities. Notably, the improvement is especially pronounced on the more challenging Dogs dataset, indicating that our method has advantages in capturing subtle differences between fine-grained categories.

When conducting experiments with the more powerful ResNet-12 backbone network (shown in Table 2), our method also achieves the best performance, further validating its generalization ability and robustness. Compared to other state-of-the-art methods using the same backbone, we attain leading accuracy on the 1-shot and 5-shot tasks across the CUB, Dogs, and Cars datasets. Notably, the improvement in accuracy during the 5-shot tasks is both stable and significant, indicating that the model can effectively leverage additional samples to achieve more efficient learning.

Figure 7 and Figure 8 illustrate the validation accuracies of our proposed method compared with SRM and BiFRN on three fine-grained datasets (Dogs, CUB, Cars), using Conv-4 and ResNet-12 as backbone networks.

Under the Conv-4 shallow backbone network (Figure 7), our method consistently achieves higher validation accuracy than SRM and BiFRN across all datasets, particularly excelling on the Dogs and Cars datasets. This demonstrates strong feature modeling capabilities and good generalization even with a lightweight backbone.

With the deeper ResNet-12 backbone (Figure 8), our method sustains stable validation performance, significantly outperforming SRM on the Dogs and CUB datasets, and achieving performance comparable to or slightly better than BiFRN.

Overall, our method demonstrates strong generalization ability and adaptability across both shallow and deep backbone networks, highlighting its superior performance and practical potential in few-shot learning tasks.

This success is mainly attributed to the design of key modules. The FEAM module effectively enhances feature representation by enabling the network to focus on critical regions within the image; the feature refinement module integrates both local and global information to further optimize feature representations and highlight detailed features; meanwhile, the RFSR module improves the model’s sensitivity to subtle differences in fine-grained images through residual learning and feature similarity reconstruction. The collaborative effect of these three modules substantially enhances the model’s discriminative power, enabling more stable and efficient training and inference in few-shot learning.

In summary, the comparative experiments verify the effectiveness of our method across different backbone networks and demonstrate its broad applicability and competitiveness in various FS-FGIC tasks. The results fully indicate that the proposed module design and learning strategies effectively improve the model’s discriminative capability and detail capture, thereby achieving more stable and superior performance.

3.3. Visualization

To demonstrate that our method effectively focuses on the most discriminative regions of the image for classification, we employed the Eigen-CAM [33] technique to visualize and compare the performance of BiFRN and our method on the same images (as shown in Figure 2). By visualizing the feature activation maps, we can clearly highlight the key regions that each method attends to during the classification process. Specifically, we fed the same image to both methods and generated the feature activation maps for each using the Eigen-CAM technique. The results show that, compared to BiFRN, our method more accurately focuses on the most discriminative regions of the image while mitigating the impact of noisy background areas on the foreground subject, thereby enhancing both classification accuracy and robustness.

3.4. Ablation Experiments

To rigorously evaluate the contribution of each component in the proposed method, ablation experiments were conducted on the CUB and Stanford Cars datasets using Conv-4 as the backbone. Hyperparameters were kept consistent with the main experiments. The results are summarized in Table 3.

As shown in Table 3, each module—FEAM, CASA, and RFSR—provides measurable improvements over the baseline. The FEAM module delivers the largest individual gain, particularly on the Cars dataset, reflecting its strong capability in enhancing local feature representations. The CASA module contributes noticeably on the CUB dataset but shows limited effect on Cars, likely due to the regular spatial structure of vehicle images reducing the impact of spatial attention. The RFSR module consistently improves performance across datasets by facilitating semantic-level feature reconstruction.

Combinations of modules yield further gains, highlighting their complementary roles. Notably, the combination of RFSR and FEAM surpasses other dual-module configurations, suggesting a synergistic effect between semantic selection and feature enhancement. Incorporating all three modules achieves the highest performance across all experimental settings, demonstrating their structural complementarity and the substantial benefit of the proposed design for FS-FGIC tasks.

4. Conclusions

This paper proposes an improved method for FS-FGIC that effectively addresses the limitations of existing unidirectional and bidirectional reconstruction techniques. Traditional reconstruction methods often struggle to accurately capture subtle intra-class variations and inter-class discriminative features due to insufficient feature interaction mechanisms. To overcome these challenges, we introduce two novel modules within a bidirectional feature reconstruction framework: the Feature Enhancement Attention Module (FEAM) and the Channel-Aware Spatial Attention (CASA) module. Additionally, a residual learning-based module is incorporated to facilitate effective propagation and refinement of deep features, thereby further enhancing feature representation quality.

The FEAM module dynamically emphasizes critical semantic features while suppressing irrelevant noise, effectively improving the discriminative capability of feature representations. The CASA module adaptively models spatial dependencies within images, enabling the network to focus on subtle regions crucial for distinguishing fine-grained categories. The integration of these two modules enables enhanced interaction of both inter-class and intra-class features, resulting in more robust and precise classification boundaries.

Extensive experiments were conducted on multiple widely used public benchmark datasets, including CUB-200-2011, Stanford Cars and Stanford Dogs. The results demonstrate that the proposed method achieves state-of-the-art performance in FS-FGIC tasks, with significant improvements observed under both 1-shot and 5-shot settings. These findings indicate the model’s superior capability in capturing fine-grained discriminative details.

Although our method is evaluated on generic datasets, its feature reconstruction mechanism is particularly well-suited to domains with limited annotations, such as tumor segmentation in medical imaging and ship classification in remote sensing imagery. We plan to further explore this direction in future work.

Author Contributions

Conceptualization, Y.L.; Funding acquisition, Y.L. and W.Z. Methodology, Y.L.; Resources, Y.L.; Software, H.Z.; Supervision, Y.L.; Validation, H.Z.; Visualization, H.Z.; Writing—original draft, Y.L. and H.Z.; Writing—review & editing, Y.L. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wei, X.S.; Wang, P.; Liu, L.; Shen, C.; Wu, J. Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. IEEE Trans. Image Process. 2019, 28, 6116–6125. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Chen, L.; Li, W.; Wang, N. Few-shot fine-grained classification with rotation-invariant feature map complementary reconstruction network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608312. [Google Scholar] [CrossRef]
Yang, M.; Bai, X.; Wang, L.; Zhou, F. HENC: Hierarchical embedding network with center calibration for few-shot fine-grained SAR target classification. IEEE Trans. Image Process. 2023, 32, 3324–3337. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Xv, H.; Dong, J.; Zhou, H.; Chen, C.; Li, Q. Few-shot learning for domain-specific fine-grained image classification. IEEE Trans. Ind. Electron. 2020, 68, 3588–3598. [Google Scholar] [CrossRef]
Li, X.; Yang, X.; Ma, Z.; Xue, J.H. Deep metric learning for few-shot image classification: A review of recent developments. Pattern Recognit. 2023, 138, 109381. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-Shot Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4080–4090. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. DeepEMD: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12203–12213. [Google Scholar]
Wertheimer, D.; Tang, L.; Hariharan, B. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8012–8021. [Google Scholar]
Li, X.; Song, Q.; Wu, J.; Zhu, R.; Ma, Z.; Xue, J.H. Locally-enriched cross-reconstruction for few-shot fine-grained image classification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7530–7540. [Google Scholar] [CrossRef]
Li, X.; Guo, Z.; Zhu, R.; Ma, Z.; Guo, J.; Xue, J.H. A simple scheme to amplify inter-class discrepancy for improving few-shot fine-grained image classification. Pattern Recognit. 2024, 156, 110736. [Google Scholar] [CrossRef]
Wu, J.; Chang, D.; Sain, A.; Li, X.; Ma, Z.; Cao, J.; Guo, J.; Song, Y.Z. Bi-directional ensemble feature reconstruction network for few-shot fine-grained classification. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6082–6096. [Google Scholar] [CrossRef] [PubMed]
Ruan, X.; Lin, G.; Long, C.; Lu, S. Few-shot fine-grained classification with spatial attentive comparison. Knowl.-Based Syst. 2021, 218, 106840. [Google Scholar] [CrossRef]
Huang, X.; Choi, S.H. Sapenet: Self-attention based prototype enhancement network for few-shot learning. Pattern Recognit. 2023, 135, 109170. [Google Scholar] [CrossRef]
Ma, Z.X.; Chen, Z.D.; Zhao, L.J.; Zhang, Z.C.; Luo, X.; Xu, X.S. Cross-layer and cross-sample feature optimization network for few-shot fine-grained image classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4136–4144. [Google Scholar]
Li, X.; Ding, S.; Xie, J.; Yang, X.; Ma, Z.; Xue, J.H. CDN4: A cross-view Deep Nearest Neighbor Neural Network for fine-grained few-shot classification. Pattern Recognit. 2025, 163, 111466. [Google Scholar] [CrossRef]
Sun, Z.; Zheng, W.; Guo, P.; Wang, M. TST_MFL: Two-stage training based metric fusion learning for few-shot image classification. Inf. Fusion 2025, 113, 102611. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Wu, J.; Chang, D.; Sain, A.; Li, X.; Ma, Z.; Cao, J.; Guo, J.; Song, Y.Z. Bi-directional feature reconstruction network for fine-grained few-shot image classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2821–2829. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S.J. The Caltech-UCSD Birds-200-2011 Dataset. 2011. Available online: https://authors.library.caltech.edu/records/cvm3y-5hh21 (accessed on 9 September 2025).
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar]
Dataset, E. Novel datasets for fine-grained image categorization. In Proceedings of the First Workshop on Fine Grained Visual Categorization, Columbus, OH, USA, 23–28 June 2014; Volume 5, p. 2. [Google Scholar]
Zhu, Y.; Liu, C.; Jiang, S. Multi-attention meta learning for few-shot fine-grained image recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; pp. 1090–1096. [Google Scholar]
Ye, H.J.; Hu, H.; Zhan, D.C.; Sha, F. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8808–8817. [Google Scholar]
Doersch, C.; Gupta, A.; Zisserman, A. Crosstransformers: Spatially-aware few-shot transfer. Adv. Neural Inf. Process. Syst. 2020, 33, 21981–21993. [Google Scholar]
Lee, S.; Moon, W.; Heo, J.P. Task discrepancy maximization for fine-grained few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5331–5340. [Google Scholar]
Li, X.; Li, Z.; Xie, J.; Yang, X.; Xue, J.H.; Ma, Z. Self-reconstruction network for fine-grained few-shot classification. Pattern Recognit. 2024, 153, 110485. [Google Scholar] [CrossRef]
Yang, S.; Li, X.; Chang, D.; Ma, Z.; Xue, J.H. Channel-Spatial Support-Query Cross-Attention for Fine-Grained Few-Shot Image Classification. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 9175–9183. [Google Scholar]
Muhammad, M.B.; Yeasin, M. Eigen-cam: Class activation map using principal components. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–7. [Google Scholar]

Figure 1. Major challenges of FS-FGIC task. The figure shows an example of 5-way 5-shot FS-FGIC task. Horizontally, there is less variation between subcategories in different contexts. Vertically, greater variation is observed within each subcategory.

Figure 2. Perform Eigen-CAM visualization of feature activations using the trained BiFRN and our method with the ResNet-12 backbone.

Figure 3. The structure of our module with an example of 5-way 1-shot classification. (a) is the CASA module, and (b) is the Transformer encoder module.

Figure 4. The structure of FEAM.

Figure 5. (a) is the iAFF module framework, and (b) is the MS-CAM module.

Figure 6. The structure of the RFSR module.

Figure 7. Validation accuracy comparison of our method against baselines on multiple datasets (5-way 5-shot) with Conv-4 backbone.

Figure 8. Validation accuracy comparison of our method against baselines on multiple datasets (5-way 5-shot) with ResNet-12 backbone.

Table 1. 5-way few-shot classification accuracies on the CUB, Dogs and Cars datasets with the Conv-4 backbone. Mean accuracy and 95% confidence interval are reported.

Method	CUB		Dogs		Cars
Method	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
ProtoNet (NeurIPS’17) * [6]	64.82 ± 0.23	64.82 ± 0.23	46.66 ± 0.21	46.66 ± 0.21	50.88 ± 0.23	66.07 ± 0.21
CTX (NeurIPS’20) * [29]	72.61 ± 0.21	86.23 ± 0.14	57.86 ± 0.21	73.59 ± 0.16	66.35 ± 0.21	82.25 ± 0.14
FRN (CVPR’21) ^† [9]	75.64 ± 0.21	89.39 ± 0.12	60.72 ± 0.22	79.07 ± 0.15	68.37 ± 0.21	87.51 ± 0.11
FRN+TDM(-noise) (CVPR’22) * [30]	76.55 ± 0.21	90.33 ± 0.11	62.68 ± 0.22	79.59 ± 0.15	71.16 ± 0.21	89.55 ± 0.10
BiFRN (AAAI’23) ^† [23]	78.56 ± 0.20	91.81 ± 0.11	64.89 ± 0.22	81.51 ± 0.14	76.30 ± 0.20	91.63 ± 0.09
SRM (PR’24) ^† [31]	73.53 ± 0.21	87.81 ± 0.13	58.09 ± 0.21	77.47 ± 0.15	66.07 ± 0.21	84.90 ± 0.13
AFRN (PR’24) ^† [11]	75.99 ± 0.23	89.25 ± 0.12	60.16 ± 0.22	78.73 ± 0.15	68.08 ± 0.22	87.38 ± 0.12
TDM+CSCAM (MM’24) ^† [32]	79.89 ± 0.20	92.49 ± 0.11	65.22 ± 0.22	82.45 ± 0.14	78.34 ± 0.20	92.08 ± 0.09
Ours	82.00 ± 0.19	93.06 ± 0.10	70.16 ± 0.21	85.21 ± 0.13	80.48 ± 0.19	93.46 ± 0.08

* Results from BiFRN. ^† Reproduced results from using the officially published code in our experimental setup.

Table 2. 5-way few-shot classification accuracies on the CUB, Dogs and Cars datasets with the ResNet-12 backbone. Mean accuracy and 95% confidence interval are reported.

Method	CUB		Dogs		Cars
Method	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
ProtoNet (NeurIPS’17) * [6]	81.59 ± 0.19	91.99 ± 0.10	73.81 ± 0.21	87.39 ± 0.12	85.46 ± 0.19	95.08 ± 0.08
CTX (NeurIPS’20) * [29]	80.39 ± 0.20	91.01 ± 0.11	73.22 ± 0.22	85.90 ± 0.13	85.03 ± 0.19	92.63 ± 0.11
FRN (CVPR’21) ^† [9]	84.30 ± 0.18	93.34 ± 0.10	76.76 ± 0.21	88.74 ± 0.12	88.01 ± 0.17	95.75 ± 0.07
FRN+TDM(-noise) (CVPR’22) * [30]	84.97 ± 0.18	93.83 ± 0.09	77.94 ± 0.21	89.54 ± 0.12	88.80 ± 0.16	97.02 ± 0.06
BiFRN (AAAI’23) ^† [23]	85.50 ± 0.18	94.73 ± 0.09	76.55 ± 0.21	88.22 ± 0.12	90.28 ± 0.14	97.45 ± 0.06
SRM (PR’24) ^† [31]	82.46 ± 0.19	93.71 ± 0.09	75.23 ± 0.20	88.84 ± 0.11	84.06 ± 0.19	96.07 ± 0.07
AFRN (PR’24) ^† [11]	84.78 ± 0.18	93.49 ± 0.09	78.23 ± 0.21	89.18 ± 0.16	89.18 ± 0.16	96.25 ± 0.07
TDM+CSCAM (MM’24) ^† [32]	85.78 ± 0.18	94.42 ± 0.09	76.35 ± 0.21	88.94 ± 0.11	90.34 ± 0.15	96.53 ± 0.08
Ours	86.42 ± 0.17	94.87 ± 0.08	78.46 ± 0.20	90.06 ± 0.11	90.92 ± 0.14	97.62 ± 0.05

* Results from BiFRN. ^† Reproduced results from using the officially published code in our experimental setup.

Table 3. Ablation experiments comparison on the CUB-200-2011 and Cars datasets with the Conv-4 backbone.

Baseline	RFSR	CASA	FEAM	CUB		Cars
Baseline	RFSR	CASA	FEAM	1-Shot	5-Shot	1-Shot	5-Shot
✓				78.56 ± 0.20	91.81 ± 0.11	76.30 ± 0.20	91.63 ± 0.09
✓	✓			79.10 ± 0.20	92.39 ± 0.10	75.82 ± 0.20	91.55 ± 0.10
✓		✓		80.21 ± 0.20	92.44 ± 0.10	75.92 ± 0.21	90.64 ± 0.10
✓			✓	81.48 ± 0.19	92.76 ± 0.10	80.12 ± 0.19	92.65 ± 0.08
✓	✓	✓		80.35 ± 0.19	92.44 ± 0.10	76.69 ± 0.20	91.74 ± 0.09
✓	✓		✓	80.82 ± 0.19	92.89 ± 0.10	79.39 ± 0.19	93.25 ± 0.08
✓	✓	✓	✓	82.00 ± 0.19	93.06 ± 0.10	80.48 ± 0.19	93.46 ± 0.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Zhang, H.; Zhang, W. Few-Shot Fine-Grained Image Classification with Residual Reconstruction Network Based on Feature Enhancement. Appl. Sci. 2025, 15, 9953. https://doi.org/10.3390/app15189953

AMA Style

Liu Y, Zhang H, Zhang W. Few-Shot Fine-Grained Image Classification with Residual Reconstruction Network Based on Feature Enhancement. Applied Sciences. 2025; 15(18):9953. https://doi.org/10.3390/app15189953

Chicago/Turabian Style

Liu, Ying, Haibin Zhang, and Weidong Zhang. 2025. "Few-Shot Fine-Grained Image Classification with Residual Reconstruction Network Based on Feature Enhancement" Applied Sciences 15, no. 18: 9953. https://doi.org/10.3390/app15189953

APA Style

Liu, Y., Zhang, H., & Zhang, W. (2025). Few-Shot Fine-Grained Image Classification with Residual Reconstruction Network Based on Feature Enhancement. Applied Sciences, 15(18), 9953. https://doi.org/10.3390/app15189953

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Fine-Grained Image Classification with Residual Reconstruction Network Based on Feature Enhancement

Abstract

1. Introduction

2. Proposed Method

2.1. The Proposed Algorithm

2.2. Feature Enhancement Attention Module

2.3. Feature Refinement Module

2.4. Residual Feature Similarity and Reconstruction Module

2.5. Metric Module

3. Experiments

3.1. Experiments Setup

3.2. Comparative Experiments

3.3. Visualization

3.4. Ablation Experiments

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI