TA-MSA: A Fine-Tuning Framework for Few-Shot Remote Sensing Scene Classification

Li, Xiang; Sun, Yumei; Peng, Xiaoming; Zhang, Jianlin; Qi, Guanglin; Liu, Dongxu

doi:10.3390/rs17081395

Open AccessArticle

TA-MSA: A Fine-Tuning Framework for Few-Shot Remote Sensing Scene Classification

by

Xiang Li

^1,2,3,4,

Yumei Sun

^1,2,3,4,

Xiaoming Peng

⁵,

Jianlin Zhang

^1,2,3,*

,

Guanglin Qi

^1,2,3

and

Dongxu Liu

^1,2,3

¹

National Key Laboratory of Optical Field Manipulation Science and Technology, Chinese Academy of Sciences, Chengdu 610209, China

²

The Key Laboratory of Optical Engineering, Chinese Academy of Sciences, Chengdu 610209, China

³

The Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

⁴

University of Chinese Academy of Sciences, Beijing 100049, China

⁵

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(8), 1395; https://doi.org/10.3390/rs17081395

Submission received: 14 March 2025 / Revised: 2 April 2025 / Accepted: 13 April 2025 / Published: 14 April 2025

Download

Browse Figures

Versions Notes

Abstract

Existing few-shot remote sensing scene classification (FS-RSSC) works primarily follow the meta-learning paradigm, which meta-trains a model on an auxiliary dataset before adapting it to target FS-RSSC tasks. To ensure good performance, the auxiliary dataset should share similar distributions with the target tasks. However, acquiring such an auxiliary dataset is difficult and economically costly in real-world FS-RSSC applications. To address this issue, we aim to handle FS-RSSC tasks by directly fine-tuning a general pre-trained model, eliminating the need for an auxiliary dataset related to the target tasks. In this paper, we propose a novel fine-tuning framework, named TA-MSA, which consists of a Task-Adaptive (TA) fine-tuning strategy and a Multi-level Spatial feature Aggregation (MSA) module. The TA fine-tuning strategy is composed of two components: (1) a layer-specific optimizer that alleviates distribution shifts between the pre-trained and target remote sensing datasets, and (2) a task-specific training scheme designed to accommodate variations in discriminative features across different FS-RSSC tasks. Additionally, to suppress the negative effect of the cluttered backgrounds and enhance the spatial features of true discriminative regions, the MSA module extracts multi-level spatially important features using trainable spatial templates for classification. Experimental analysis demonstrates the superiority of the proposed TA-MSA framework. On three FS-RSSC benchmarks (NWPU-RESISC45, UC Merced LandUse, and WHU-RS19), our TA-MSA framework outperforms many state-of-the-art methods, achieving an average classification accuracy of 76.78% in the 5-way 1-shot setting and 91.89% in the 5-way 5-shot setting.

Keywords:

remote sensing scene classification; transfer learning; cross-domain generalization

1. Introduction

The research of remote sensing scene classification (RSSC) is essential to resource investigation [1], geographical image retrieval [2], and environment monitoring [3]. Numerous deep learning-based models have been introduced to address RSSC tasks effectively [4,5]. Most of these models are implemented with large-scale annotated datasets of remote sensing images [6]. However, a large number of labeled samples for the categories to be recognized are difficult to obtain in some real-world applications, due to obstacles originating from the data sensitivity and annotation cost [7]. As a result, few-shot remote sensing scene classification (FS-RSSC) research, which aims to recognize novel remote sensing scenes with few labeled samples, has attracted great attention in these years [8,9].

Existing FS-RSSC research is mainly focused on meta-learning-based methods [10,11]. In the meta-learning paradigm, sufficient simulated few-shot tasks are drawn from an auxiliary dataset for the purpose of meta-training [12]. These simulated few-shot tasks are split into meta-training and meta-validation tasks with disjoint categories. During the meta-training phase, the model is learned to minimize the classification loss on the meta-training tasks, and its best configuration is selected with the highest classification performance on the meta-validation tasks [13,14]. During the meta-testing phase, the meta-trained model is directly evaluated on target tasks to be solved [15]. Notably, the auxiliary datasets are always assumed to have similar distributions with the target tasks in meta-learning methods. In most existing FS-RSSC studies [16,17,18,19], the auxiliary dataset used to generate simulated meta-training and meta-validation tasks comes from the same RSSC dataset as the target FS-RSSC tasks to be solved. In other words, the meta-training, meta-validation and target tasks always share similar data distributions in this meta-learning paradigm, which plays a significant role in ensuring good performance.

However, for practical FS-RSSC tasks to be solved, only a few labeled samples are available, and the meta-training and meta-validation tasks that share similar distributions with the target FS-RSSC tasks are not always accessible. As claimed in [20], the remote-sensing images are captured in different conditions with variance distributions. For target FS-RSSC tasks of unknown distributions, sampling proper images to build meta-training and meta-validation tasks is unrealistic. Besides, even when the sampling conditions of target FS-RSSC tasks are pre-known, constructing a well-annotated auxiliary dataset is still challenging and incurs extensive economic and labor costs [21]. As a result, to enable the practical implementation of FS-RSSC systems in real-world scenarios, a method that relies on only a few labeled samples and eliminates the need for a well-annotated auxiliary dataset related to target tasks is highly desirable.

Recently, the fine-tuning paradigm has shown great potential in few-shot learning [22,23,24]. For example, it improves the average classification accuracy of the TPN + AFA method on four popular CDFSL benchmarks (EuroSAT, ISIC, CropDiseases, and ChestX datasets) by 3.81% in the 5-way 1-shot setting and 4.94% in the 5-way 5-shot setting. Inspired by these studies, and to handle FS-RSSC tasks without relying on any pre-constructed auxiliary datasets related to target tasks, we mainly focus on the fine-tuning-based framework. Specifically, to ensure the generality and reproducibility of our approach, we directly fine-tune a publicly available pre-trained model for FS-RSSC tasks. It is important to emphasize that we only use the pre-trained model, and the pre-training dataset is not required. During the fine-tuning process, the model is updated solely based on a few labeled samples.

To solve FS-RSSC tasks in such a fine-tuning-based paradigm, there are three challenges. First, domain shifts between the pre-trained public dataset and the target FS-RSSC tasks should be bridged with few-labeled samples. Many publicly available pre-trained models, including ResNet-18 and ResNet-34 in the Pytorch framework [25,26], learn from the large-scale dataset, ImageNet [27]. However, the ImageNet dataset includes lots of images of natural categories that have totally different data distributions with remote sensing images, as depicted in Figure 1a. As demonstrated in many cross-domain studies [13,23,28], these distribution shifts will cause great performance degradation when directly adapting the pre-trained model to remote sensing tasks, and this degradation could be greater when the available supervised samples are very limited.

The second challenge lies in the variation in discriminative feature patterns across different FS-RSSC tasks. Luo et al. claimed that different classification tasks may emphasize varying discriminative information [29]. Inspired by them, we exploit such a phenomenon in FS-RSSC tasks. As depicted in Figure 1b, the exemplar 3-way 2-shot tasks are, respectively, distinguished by different feature patterns. For example, the categories in Group I are each characterized by their discriminative objects, such as the baseball field, stadium, and intersection. Categories in Group II can be easily distinguished by their color and textural features. In contrast, the discriminative features for Group III are more complex, requiring the identification of main linear structures in the image, such as railway tracks, along with the detailed information surrounding these structures. In summary, the effective feature patterns vary across different FS-RSSC tasks, posing a challenge to existing fine-tuning algorithms that learn all features in a unified manner [30].

The third challenge involves locating the true discriminative regions in remote sensing images. The classification results can vary significantly depending on which regions are emphasized. For example, in the first group of Figure 1c, focusing on the background information of these two images may lead to misclassifications into the “dense residential” category, which features many buildings surrounded by plants. Similarly, in the second group, class confusion arises due to the background information shared by images from two different categories. Clearly, the class-specific discriminative objects in remote sensing images vary in shape, size, and position, hindering recognition performance.

To handle the aforementioned challenges, we creatively introduce an effective fine-tuning framework, TA-MSA, which is composed of a Task-Adaptive (TA) fine-tuning strategy and a Multi-level Spatial features Aggregation (MSA) module. First, to efficiently address the discrepancies between the pre-training and remote sensing datasets, our TA fine-tuning strategy applies a layer-specific optimizer to fine-tune the pre-trained model. Specifically, instead of applying the same learning rate to all layers, our strategy assesses the degree of distribution bias in each layer and generates layer-specific learning rates for fine-tuning. Secondly, recognizing the variability of discriminative features across FS-RSSC tasks, our TA fine-tuning strategy incorporates a task-specific training scheme designed to explore the most discriminative feature patterns and enhance the learning of them for accurate classification. The discriminative criterion focuses on the similarity and the variance of samples from different categories.

Additionally, our proposed TA-MSA framework incorporates a Multi-level Spatial feature Aggregation (MSA) module to highlight the true discriminative regions in remote sensing images. This module features learnable spatial templates following every block in the feature extractor. These spatial templates first extract saliency maps of the discriminative regions. Next, multi-level spatially important features are generated by combining these saliency maps with the feature maps output by each block. These spatially important features are then concatenated and passed into the final classifier. In this way, the class-specific discriminative regions are highlighted to improve the classification accuracy.

To summarize, our key contributions can be outlined in four points. First, we propose an innovative fine-tuning framework, TA-MSA, which directly fine-tunes a publicly available pre-trained model for FS-RSSC tasks. This framework offers a more practical solution compared to the existing meta-learning-based methods by eliminating the need for a pre-constructed auxiliary dataset related to the target FS-RSSC tasks. Second, we introduce a task-adaptive (TA) fine-tuning strategy composed of a layer-specific optimizer and a task-specific discriminative-features-emphasizing training scheme. Third, we design a multi-level spatial feature aggregation (MSA) module that combines multi-level spatially important features to enhance the classification accuracy. Finally, we perform a series of comprehensive experiments to validate the superiority of the proposed TA-MSA framework in FS-RSSC.

2. Related Work

We review the research that is closely relevant to our study in this section, focusing on few-shot remote sensing scene classification in Section 2.1, cross-domain generalization in Section 2.2 and task adaptation with few labeled samples in Section 2.3.

2.1. Few-Shot Remote Sensing Scene Classification

FS-RSSC studies are mainly focused on two problems: (1) the over-fitting problem arising from limited labeled samples and (2) the classification difficulty due to the large intra-class variance and inter-class similarity in remote sensing images [31].

To handle the over-fitting risk, Zhang et al. transform few-shot classification tasks into classical classification tasks with an online sample generation method [31]. Tang et al. designed a class-level training strategy based on the meta-learning [10], which employs an episodic training paradigm to simulate few-shot scenes during the training. Besides, the methods proposed in [8,18,32] are all based on the meta-learning. Recently, transfer-learning-based methods [22,33], especially parameter-efficient tuning approaches related to the research of large-scale deep learning models [34,35], have attracted great attention. Inspired by these works, Zhu et al. proposed the meta visual prompt (MVP) tuning approach, which fine-tunes the pre-trained network for downstream tasks using few labeled samples [21]. Similar to the MVP methods, our TA-MSA framework aims to directly learn the task-specific models for downstream tasks with few labeled samples. Differently, our proposed framework employs a lightweight backbone network, which imposes lower requirements on deployment conditions.

Furthermore, the intra-class variance and inter-class similarity issues are highly focused in the research of FS-RSSC. To address this, Wang et al. proposed a spatial affinity attention mechanism and a class surrogate-based learning strategy to enhance focus on important regions and promote inter-class dispersion [9]. Tian et al. introduced a hierarchical relation network to increase the model’s discriminative power by learning hierarchical feature relations between the support and query sets [19]. Additionally, Li et al. designed a global-local contrastive learning auxiliary task for the recognition of images from different categories with low distinguishability [36].

In this paper, to solve the low-data dilemma, our framework is based on the fine-tuning paradigm. Besides, we address the class confusion problem in FS-RSSC tasks by introducing the multi-level spatial feature aggregation module, which captures multi-level spatially important features to enhance the final prediction.

2.2. Cross-Domain Generalization

The problem of domain shifts between the training and test datasets has been widely explored in classical machine learning and few-shot learning, as seen in the research on domain generalization [37] and cross-domain few-shot learning [38]. To achieve cross-domain generalization, existing approaches primarily focus on enhancing the source domain training or adaptation to target domains. Methods such as FWT [39], AFA [24], and ATA [23] conduct data augmentation during training to increase the diversity and complexity of training data, to improve the model’s generalization ability. Hu et al. proposed the DSL method to improve the robustness to domain shifts by continuously switching tasks from different domains during training [40]. Li et al. learned domain-invariant features by minimizing distances between different source domains [41]. These methods all enhance the model’s generalization ability by improving the training on the source domain [13]. Additionally, adaptation-based methods are proposed to accurately and efficiently adjust the trained model to target tasks. Fine-tuning a pre-trained model’s last few layers or classifier head has shown great potential in overcoming the domain shifts [22]. Based on the fine-tuning paradigm, Ji et al. devised a dense-sparse-dense training framework to achieve domain generalization under the low-data regime [42]. Lee et al. claimed that fine-tuning specific layers selectively may perform better than only updating the last few layers [43]. They also claimed that the optimal fine-tuning choice is related to the type of distribution shifts [43].

Our proposed framework, based on fine-tuning, follows the target domain adaptation paradigm. To effectively bridge the distribution shifts between the pre-trained model and the target remote-sensing datasets, the task-adaptive fine-tuning strategy employs a layer-specific optimizer inspired by [43].

2.3. Task Adaptation with Few Labeled Samples

In few-shot classification, category shifts widely exist between the training and test tasks and limit the classification performance [29]. Various approaches have been introduced to handle this issue, focusing on task adaptation with few labeled samples [44,45,46]. Li et al. designed task-specific adapters, which predict task-specific weights of the model’s classifier conditioned on few labeled samples [44]. Zhao et al. aligned the base knowledge to the target task via generating adaptive prototypes according to the characteristics of the target tasks [46]. Based on fine-tuning, Ji et al. localized those parameters in the pre-trained model that are harmful to downstream tasks and introduced a dense-sparse-dense (DSD) fine-tuning flow to achieve efficient adaptation on target tasks [42].

In our study, both category and domain shifts exist between pre-training and target tasks, presenting greater challenges for task adaptation with few labeled samples. To handle diverse FS-RSSC tasks with the pre-trained model, our task-adaptive fine-tuning strategy includes a task-specific training scheme to enhance the learning of discriminative features for better task adaptation.

3. Methods

3.1. Preliminary

Problem definition. FS-RSSC tasks aim to recognize query images of novel remote sensing scenes with few labeled samples. A C-way K-shot FS-RSSC task

T

includes a support set

S = {X_{s}, Y_{s}}

and a query set

Q = {X_{q}, Y_{q}}

. The support set

S

includes K labeled samples for each of the C novel categories. The query set

Q

is composed of several query images to be recognized. Aiming at directly learning task-specific models without relying on meta-training on the auxiliary datasets, our TA-MSA framework builds upon the fine-tuning paradigm. During the fine-tuning, we utilize the classification loss of samples in

S

to learn the task-specific model for task

T

.

The structure of the feature extractor. As depicted in [25], the structure of most residual networks can be divided into four blocks. Following many existing FS-RSSC works [47,48], we employ the ResNet-18 backbone network as the feature extractor. The architecture of the ResNet-18 is detailed in Table 1 [25]. The four blocks in Table 1 correspond to those shown in Figure 2. Building upon a pre-trained ResNet-18 feature extractor, our proposed fine-tuning framework aims to efficiently learn task-specific models for FS-RSSC tasks with few labeled samples.

The classification loss for the FS-RSSC task. Different from many metric-based methods, we employ a simple trainable fully connected layer as the linear classifier, which is denoted as

F (.)

. Besides, let us denote the original feature extractor and the multi-level spatial features aggregation module as

E (.)

and

E_{θ} (.)

. For a remote sensing image

(x, y)

sampled from a C-way K-shot FS-RSSC task, the prediction of a standard model is

\hat{y} = F (E (x)),

(1)

whereas

y = {y_{1}, y_{2}, \dots, y_{C}}

;

y_{i}

is the prediction score of the query image x belonging to the

i^{t h}

category. In our proposed TA-MSA fine-tuning framework, for the task-adaptive fine-tuning strategy and the multi-level spatial features aggregation module, the extracted features and the final prediction can be denoted as

z = c o n c a t (E (x), E_{θ} (x)),

(2)

\hat{y} = F (M * z),

(3)

whereas the matrix M represents a channel-wise mask detailed in Equation (13).

During the fine-tuning process, the classification loss for all images in

S

is utilized to update the trainable parameters in the model. The overall loss function is

L o s s = \sum_{x \in X_{s}} C E (y, \hat{y}) .

(4)

After fine-tuning, the classification accuracy is calculated based on the correctly predicted labels for all samples in the query set. The prediction label of the sample

(x, y)

is

c = a r g max_{c} {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{C}} .

(5)

3.2. Overview of the Proposed TA-MSA Fine-Tuning Framework

An overview of our TA-MSA framework is presented in Figure 2. This framework aims to address FS-RSSC tasks with a general pre-trained feature extractor. There are two key innovations in our proposed TA-MSA fine-tuning framework: (1) the task-adaptive fine-tuning strategy and (2) the multi-level spatial feature aggregation module. To effectively bridge the distribution shifts between pre-training and FS-RSSC tasks, the task-adaptive fine-tuning strategy optimizes the model’s different layers using layer-specific learning rates. Besides, to better adapt to various FS-RSSC tasks, the task-adaptive fine-tuning strategy also enhances the learning of task-specific discriminative features. Specifically, the strategy first evaluates the discriminative degree of feature maps produced by different channels and then emphasizes the training of the most discriminative channels. Additionally, to accumulate spatial information related to true discriminative regions, the proposed Multi-level Spatial feature Aggregation (MSA) module extracts multi-level spatially important features using learnable spatial templates. These features are subsequently integrated with the original feature outputs to improve the classification accuracy.

In the following part of this section, we begin by introducing the task-adaptive fine-tuning strategy, which comprises a layer-specific optimizer (Section 3.3) and a task-specific training scheme (Section 3.4). Next, we present the multi-level spatial features aggregation module (Section 3.5). Finally, the fine-tuning process using the proposed TA-MSA framework is concisely depicted in Algorithm 1.

Algorithm 1: The TA-MSA fine-tuning framework.

3.3. Layer-Specific Optimizer in the Task-Adaptive Fine-Tuning Strategy

As demonstrated by Lee et al., for different distribution shifts, the optimal subsets of model layers to be fine-tuned are different [43]. Inspired by this, to effectively bridge the distribution shifts between the pre-training and the target FS-RSSC datasets, different model layers should be optimized to different extents during the fine-tuning.

As a result, we propose a layer-specific optimizer. This optimizer is designed to update the pre-trained model’s different layers based on their relevance to the target FS-RSSC tasks. To determine this relevance, we utilize the relative gradient norm (RGN) score. For the model’s l-

t h

layer, let us denote its weights set in this layer as

W_{l}

. Then, for each weight w

(w \in W_{l})

, we calculate its relative gradient norm (RGN) score to evaluate its relevance to the target task as

R G N (w) = \frac{{∥ g ∥}_{2}}{{∥ w ∥}_{2}},

(6)

whereas g is the gradient of the training loss about the parameter

θ

,

{∥ . ∥}_{2}

represents the l-2 norm of the matrix. The gradient g describes the model’s performance change with respect to the parameter w for the current task, while

{∥ θ ∥}_{2}

serves as a normalization here to eliminate the influence of parameter magnitude variance. The

R G N (w)

reflects the importance of w for the current task. For the l-th layer, its RGN score is calculated by summarizing the RGN scores of all parameters at this layer, as

R G N_{l} = \sum_{w \in W_{l}} R G N (w) .

(7)

Intuitively, the larger the

R G N_{l}

, the more effort should be devoted to adapting the l-th layer to the target task. As a result, we generate the layer-specific learning rates as

l r_{l} = f (R G N_{l}) = l r_{0} * \frac{R G N_{l}}{max_{l \in L} R G N_{l}},

(8)

whereas

l r_{0}

is a hyper-parameter representing the base learning rate,

max_{l \in L} R G N_{l}

serves as a normalization term that controls

l r_{l} \in (0, 1]

, and

L

denotes the set of all trainable layers.

Furthermore, the RGN scores, as well as the learning rates for parameters to be learned, are dynamically updated during the fine-tuning process. After each optimization step, we re-evaluate the important score of each parameter for the current task to rapidly adapt the pre-trained model to specific target tasks with few labeled samples.

3.4. Task-Specific Training Scheme in the Task-Adaptive Fine-Tuning Strategy

As demonstrated in [29], discriminative features acquired for recognition vary across different tasks, even when these tasks are sampled from the same dataset. Besides, feature patterns are widely recognized to be correlated with the output channels of the feature extractor. For example, Luo et al. utilized a simple transformation that balances the output of different channels to enhance the model’s generalization ability on novel tasks [29]. Li et al. improved the model’s capacity of learning general and diverse features by dynamically discarding the outputs of partial channels [13]. As a result, to adapt the discriminative features shifts across various FS-RSSC tasks, we integrate a novel training scheme into the task-adaptive fine-tuning strategy. This scheme emphasizes the learning of discriminative channels during the fine-tuning.

Specifically, this training scheme first evaluates the discriminative degree of different output features and then emphasizes the learning of the most discriminative ones with a channel-wise mask. As for the criterion of discriminative degree, we assume that features exhibiting lower inter-class similarity and higher inter-class variance are more discriminative. All samples in the support set are employed to construct these criteria. We denote the final feature representation fed into the linear classifier as

f \in R^{D}

. Inspired by [49], the inter-class similarity can be computed by

S_{d} = \frac{1}{C^{2} K^{2}} (\sum_{i = 1}^{C} \sum_{\begin{matrix} j = 1, \\ j \neq i \end{matrix}}^{C} \sum_{k_{1} = 1}^{K} \sum_{k_{2} = 1}^{K} x_{d}^{i, k_{1}} \cdot x_{d}^{j, k_{2}}),

(9)

whereas d denotes the index of the feature channels (

d \in {1, \dots, D}

).

x_{d}^{i, k_{1}}

and

x_{d}^{j, k_{2}}

, respectively, represent the value of the

d^{t h}

channel in the

k_{1}^{t h}

feature and

k_{2}^{t h}

belonging to the class i and class j (

i, j \in {1, \dots, C}

), respectively. As shown in Equation (9),

S_{d}

reflects the inter-class similarity by computing the average similarity between all samples belonging to different classes. Besides, the inter-class variance can be calculated using the distances between the prototype features of different categories. For the category c, its prototype feature is

x^{c} = \frac{1}{K} \sum_{k = 1}^{K} x^{k} .

(10)

The inter-class variance is calculated as

V_{d} = \frac{1}{C} \sum_{c = 1}^{C} {(x_{d}^{c} - {\bar{x}}_{d})}^{2},

(11)

where

{\bar{x}}_{d} = \frac{1}{C} \sum_{c = 1}^{C} x_{d}^{c}

. The final discriminative criterion combines the inter-class similarity and variance as

J_{d} = λ S_{d} - (1 - λ) V_{d},

(12)

whereas

λ

is a trade-off hyper-parameter. As we know, a smaller inter-class similarity

S_{d}

, and a larger inter-class variance

V_{d}

, indicate a higher discriminative power of the channel, which corresponds to a lower

J_{d}

score. As a result, we select a subset of feature channels with the lowest

J_{d}

scores as

D_{Q}

, which contains Q elements. Accordingly, we construct a corresponding channel-wise mask

M \in R^{D}

. The values of M are given by

M_{d} = \{\begin{matrix} 1, & if d \in D_{Q} \\ 0, & else d \notin D_{Q} \end{matrix} .

(13)

During the early stage of the fine-tuning process (typically the first 10 epochs), we multiply the final output features by the channel-wise mask M to emphasize the learning of discriminative features. Subsequently, to learn additional features useful for classification, we remove the mask and train the model using all features.

In Figure 3, we vividly depict our proposed task-adaptive fine-tuning strategy in comparison to the standard fine-tuning approach.

3.5. Multi-Level Spatial Features Aggregation Module

To extract spatial features correlated with the true discriminative regions, the multi-level spatial features aggregation module is proposed. This module first extracts saliency maps highlighting important spatial regions with spatial templates at different levels. Then, the spatially important features are generated by multiplying these saliency maps with the corresponding feature maps.

The intermediate feature output by the

i^{t h}

block can be represented as

z_{i} \in R^{C_{i} \times H_{i} \times W_{i}}

,

i = (1, 2, 3)

, where

C_{i}

is the number of output channels, and

W_{i}

and

H_{i}

are the width and height of the feature maps. The spatial templates are expected to represent features of true discriminative regions. We denote each spatial template with convolutional kernels, and the

i^{t h}

spatial template can be represented as

t_{i} \in R^{C_{i} \times h_{i} \times w_{i}}

. First, saliency maps are extracted through a convolutional operation on the original feature maps using the spatial templates as

h_{i} = C o n v (z_{i}, t_{i}), (i = 1, 2, 3)

(14)

where

h_{i}

denotes the saliency maps and it has the consistent shape with the feature maps

z_{i}

. Then, spatially important features are generated with these saliency maps as

{\hat{z}}_{i} = A v g P o o l (z_{i} \cdot h_{i}),

(15)

where

{\hat{z}}_{i} \in R^{C_{i}}

.

A v g P o o l (.)

represents the average pooling operation. Subsequently, multi-level spatially important features are concatenated as

E_{θ} (x) = c o n c a t ({\hat{z}}_{1}, {\hat{z}}_{2}, {\hat{z}}_{3}) .

(16)

E_{θ} (x)

is employed for final prediction in Equation (1).

4. Experiments

In this section, we begin by presenting an overview of the experimental setup, covering the datasets and implementation details (Section 4.1). Then, we evaluate the superiority of our proposed framework by comparing it with various FS-RSSC methods on the three popular benchmarks (Section 4.2). Subsequently, we conduct ablation studies (Section 4.3) and visualization analysis (Section 4.4) to have a deep understanding of our proposed TA-MSA fine-tuning framework.

4.1. Experimental Settings

Datasets. We employ three commonly used FS-RSSC datasets to evaluate our proposed framework, i.e., NWPU-RESISC45 [50], UC Merced-LandUse [51], and WHU-RS19 [52]. The NWPU-RESISC45 dataset is developed by Northwestern Polytechnical University. It consists of 31500 images covering 45 scene categories. Each category consists of 700 images with a resolution of 256 × 256 pixels. Twelve classes from the NWPU-RESISC45 dataset, referred to as the meta-test data, are selected to test the model’s performance. The UC Merced-LandUse dataset, created for land-use classification, contains 2100 images across 21 land-use classes, with each class having 100 images with a resolution size of 256 × 256 pixels. Six classes of the UC Merced-LandUse dataset are chosen for evaluation. The WHU-RS19 dataset is sourced from Google Earth. It includes 1005 high-resolution satellite images covering 19 scene classes, with each class having approximately 50 images of size 600 × 600 pixels. We select five classes of the WHU-RS19 dataset for evaluation. We show some instances of the datasets mentioned above in Figure 4. Besides, for a fair comparison with other FS-RSSC methods, we set the categories selected for the test of each dataset following previous works [31,53], as outlined in Table 2.

Implementation details. The publicly available ResNet-18 backbone network is employed as our feature extractor [25] (the model weights of the employed pre-trained ResNet-18 network are available at https://pytorch.org/vision/stable/models/resnet.html on 12 April 2025). To mitigate the over-fitting risk, only the last block of the ResNet-18 backbone network is updated to adapt to the target tasks during the fine-tuning process. In the layer-specific optimizer, the basic learning rate

l r_{0}

is

0.0005

and the learning rates of other trainable parameters are all set as

0.001

. Besides, in the task-specific training scheme, the number of selected most discriminative feature channels, Q, is 896. To obtain the best performance, we set the hyperparameter

λ

as 0.7 and 0.6 for the UC Merced LandUse and WHU-RS19 datasets in the 5-way 1-shot setting,

λ = 0.7

for the WHU-RS19 dataset in the 5-way 5-shot setting, and

λ = 0.9

in the other settings. Additionally, the number of fine-tuning epochs with the task-specific training scheme,

N_{1}

, is set to 10. In the multi-level spatial feature aggregation module, the learnable spatial template is configured with a 7 × 7 kernel size and a stride of 1. For evaluation, we randomly construct 1000 few-shot learning tasks from the test categories in Table 2 and report the model’s average classification accuracy on these tasks along with a 95% confidence interval. Each of these target tasks contains 16 query samples per category for classification. All experiments were conducted under the Pytorch 2.0.1 framework.

4.2. Comparison Results on the FS-RSSC Tasks

To validate the effectiveness of our proposed TA-MSA framework, we perform a comparative analysis with several state-of-the-art FS-RSSC methods. These methods are all meta-learning-based [12,16,19,32,47,48,54,55,56,57,58,59,60,61]. Despite MVP is a state-of-the-art transfer-learning-based method, we do not compare its results here since its experimental settings are completely different from those in this paper [21]. The selected methods for comparison all require meta-training on base classes sampled from the same remote sensing dataset before tackling target tasks with few labeled samples. In contrast, our proposed TA-MSA framework directly fine-tunes a pre-trained model with few labeled samples, without the need for meta-training on base classes, demonstrating greater practical applicability.

The comparison results on the three FS-RSSC datasets are reported in Table 3, Table 4 and Table 5. From these tables, it is clear that our proposed TA-MSA framework has shown distinct superiority in overall performance compared with other methods, with achieving the best or second results in nearly all settings. For example, on the UC Merced-LandUse dataset in the 5-way 5-shot setting, the proposed TA-MSA has achieved 91.75% classification accuracy, greatly outperforming the best classification accuracy achieved by other methods 87.69%. Besides, on the WHU-RS19 dataset with the 5-way 1-shot setting, the proposed TA-MSA framework achieves a classification accuracy of 87.24%, surpassing the best result from other methods, which is 86.89%.

4.3. Ablation Study

We perform ablation studies on three FS-RSSC datasets to analyze the individual contributions of each component in our TA-MSA framework. The experimental results primarily reflect (1) the effectiveness of the task-adaptive fine-tuning strategy and the multi-level spatial features aggregation module and (2) the effect of various hyper-parameter values on the performance of the TA-MSA framework.

The effectiveness of “TA” and “MSA” modules in the TA-MSA framework. Our proposed TA-MSA fine-tuning framework is composed of the task-adaptive fine-tuning strategy and the multi-level spatial features aggregation module. The task-adaptive fine-tuning strategy, comprising the layer-specific optimizer and the task-adaptive training scheme, provides customized fine-tuning adjustments for improved adaptation to target tasks. Specifically, the layer-specific optimizer assigns different learning rates to different layers based on their relevance to the target tasks, ensuring more effective updates. The task-specific training scheme identifies the most discriminative channels for each target task and enhances their learning, ultimately improving performance on the current task. Additionally, the multi-level features aggregation module improves FS-RSSC accuracy for two reasons. First, the learnable spatial templates capture spatially important features related to true discriminative regions, reducing class confusion issues. Second, combining multi-level features enhances the robustness of feature representations, which in turn, improves the reliability of the similarity metric between the query and support samples.

To demonstrate the individual contributions of each proposed method in the TA-MSA framework, we evaluate the ablated models on the three datasets in the 5-way 1-shot and 5-way 5-shot settings. For simplicity, the combinations of the baseline method and the proposed methods are denoted as “BS + TA”, “BS + MSA”, “BS + TA1” and “BS + TA2”, whereas “TA1” and “TA2” represent the layer-specific optimizer and the task-specific training scheme, respectively.

As depicted in Table 6, the task-adaptive fine-tuning strategy and the multi-level spatial features aggregation module improve the average classification accuracy on the three datasets by

0.90 %

and

0.67 %

, respectively, in the 5-way 1-shot setting. From Table 7, it is observed that the task-adaptive fine-tuning strategy and the multi-level spatial features aggregation module boost the classification accuracy on average with

0.67 %

and

0.64 %

, respectively, in the 5-way 5-shot setting. Besides, the combination of these two modules always exhibits the best results, with

1.80 %

and

1.47 %

accuracy improvements on average in both 5-way 1-shot and 5-way 5-shot settings, which validates the great potential of our TA-MSA fine-tuning framework.

As for the layer-specific optimizer (denoted as “TA1”) and the task-specific training scheme (denoted as “TA2”), the results in Table 6 and Table 7 demonstrate their effectiveness across the three datasets in both 5-way 1-shot and 5-way 5-shot settings. Furthermore, the results of “BS + TA” consistently outperform those for “BS + TA1” and “BS + TA2,” confirming the effectiveness of the collaboration between these two methods.

Influence of different basic learning rates $l r_{0}$ . According to Equation (11), the basic learning rate

l r_{0}

is crucial for the layer-specific optimizer in the task-adaptive fine-tuning strategy. We assess the performance of our TA-MSA framework with varying basic learning rates ranging from 0.0001 to 0.002 across the three datasets in two few-shot learning settings.

From the results in Table 8 and Table 9, the basic learning rate

l r_{0}

significantly impacts the performance of our TA-MSA framework. A value of

l r_{0}

that is too small may lead to underfitting, while a value that is too large can result in overfitting. To achieve optimal performance, we select

l r_{0}

as 0.0005.

Influence of the hyper-parameter $λ$ . According to Equation (12), the hyper-parameter

λ

determines the balance of the inter-class similarity

S_{d}

and variance

V_{d}

in the discriminative scores. We assess the performance of the TA-MSA framework with different

λ

values on the three datasets in two few-shot learning settings.

We vividly depict the experimental results in Figure 5; on the NWPU-RESISC45 dataset in both two few-shot learning settings, the accuracy of the FS-RSSC tasks remains stable across different values of the hyper-parameter

λ

. On the UC-Merced LandUse dataset, our TA-MSA framework achieves the best results with

λ = 0.7

in the 5-way 1-shot setting and

λ = 0.9

in the 5-way 5-shot setting. On the WHU-RS19 dataset, our TA-MSA framework achieves the best results with

λ = 0.6

in the 5-way 1-shot setting and

λ = 0.9

in the 5-way 5-shot setting.

4.4. Visualization Analysis

To have a deep understanding of our proposed TA-MSA framework, we use gradient-weighted class activation mapping (Grad-CAM) [63] to visualize predictive regions in images for FS-RSSC tasks, highlighting key regions relevant to classification. Several visualization results are presented in Figure 6. It is evident that our TA-MSA framework concentrates more on the discriminative regions compared to the baseline method. For example, in the “ground track field” scene, the baseline method tends to focus on some cluttered background for prediction, while our proposed TA-MSA framework accurately focuses on the ground track field itself. In the “Intersection” scene, compared to the baseline method, our TA-MSA framework places more emphasis on the intersection crossroads, aligning more closely with the semantic meaning of this category.

5. Discussion

Our proposed TA-MSA fine-tuning framework achieves promising performance in FS-RSSC tasks and demonstrates greater practical applicability compared to many existing meta-learning-based methods. Here, we provide a deeper discussion of our proposed TA-MSA framework, focusing on its generalization to other backbone networks, its computational analysis, and an exploration of its potential limitations.

First, the generalization of our proposed TA-MSA framework to other ResNet-based backbones is straightforward, while its application to transformer-based frameworks is more limited. When applied to other ResNet-based backbone networks, the methods proposed in the TA-MSA framework remain effective. This is because ResNet architectures are typically divided into multiple blocks, as shown in [25]. Therefore, we can fine-tune the last block, select discriminative channels using the TA method, and aggregate multi-level spatial features output by different blocks with the proposed MSA module. However, the TA-MSA framework has limitations when applied to transformer-based backbone networks since transformer-based networks use entirely different learnable structures and forward propagation flows [64]. Adapting our TA-MSA framework to transformer-based networks is a direction for future investigation.

Second, the computational cost of our proposed TA-MSA framework is manageable for the FS-RSSC tasks with a limited number of categories to recognize. Compared to many existing methods that calculate the relationships between query and support samples using cosine similarity [19,31], similar to our approach, the additional computational cost of our TA-MSA framework stems from the selection of discriminative features. Although our TA-MSA fine-tuning framework requires computing the discriminative power of each channel, this computation is performed only once. Its complexity for a C-way K-shot task is

O (C^{2} K^{2} D)

, where D represents the number of feature channels, and K is typically small in few-shot learning tasks.

Additionally, from the results in Table 8 and Table 9, and Figure 5, the performance of our TA-MSA framework shows small sensitivity to changes in hyper-parameters. One possible reason is that the supervised information is too limited, which leads to (1) a significant impact from possible noisy samples in the support set, and (2) instability in the computation of discriminative scores in Equation (12). In our future work, a well-designed data augmentation-based method may prove to be an effective approach to addressing this issue.

6. Conclusions

In this paper, we introduce a novel fine-tuning framework, TA-MSA, which directly adapts a general pre-trained model to handle FS-RSSC tasks. The TA-MSA framework consists of a task-adaptive fine-tuning strategy and a multi-level spatial feature aggregation module. The task-adaptive fine-tuning strategy incorporates a layer-specific optimizer to effectively bridge the distribution shifts and a task-specific training scheme to rapidly adapt to discriminative feature shifts across FS-RSSC tasks. Meanwhile, the multi-level spatial feature aggregation module extracts multi-level features of true discriminative regions to improve the classification accuracy. Experimental results validate that our TA-MSA framework achieves competitive classification accuracy compared to several state-of-the-art methods, across three widely used FS-RSSC datasets in both 5-way 1-shot and 5-way 5-shot settings. Recently, the advancements in large pre-trained models have opened new possibilities in various research fields [65,66]. In future research, we plan to investigate innovative fine-tuning techniques using large-scale pre-trained models to further improve the performance of FS-RSSC tasks.

Author Contributions

Conceptualization, X.L.; methodology, X.L. and Y.S.; software, X.L. and G.Q.; validation, X.L.; formal analysis, X.L.; investigation, X.L. and G.Q.; resources, X.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, X.L., Y.S. and X.P.; visualization, X.L.; project administration, J.Z.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Postdoctoral Fellowship Program of CPSF GZC20232676.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

We would like to express our gratitude to the editor and reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Tripathi, D. Land Resource Investigation Using Remote Sensing and Geographic Information System: A Case Study. Int. J. Innov. Sci. Eng. Technol. 2017, 4, 125–132. [Google Scholar]
Li, Y.; Ma, J.; Zhang, Y. Image retrieval from remote sensing big data: A survey. Inf. Fusion 2021, 67, 94–115. [Google Scholar] [CrossRef]
Li, J.; Pei, Y.; Zhao, S.; Xiao, R.; Sang, X.; Zhang, C. A review of remote sensing for environmental monitoring in China. Remote Sens. 2020, 12, 1130. [Google Scholar] [CrossRef]
Nogueira, K.; Penatti, O.A.; Dos Santos, J.A. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognit. 2017, 61, 539–556. [Google Scholar] [CrossRef]
Tang, X.; Ma, Q.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. Attention consistent network for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2030–2045. [Google Scholar] [CrossRef]
Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Zhang, L.; Lan, M.; Zhang, J.; Tao, D. Stagewise unsupervised domain adaptation with adversarial self-training for road segmentation of remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5609413. [Google Scholar] [CrossRef]
Jia, Y.; Gao, J.; Huang, W.; Yuan, Y.; Wang, Q. Exploring Hard Samples in Multi-View for Few-Shot Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615714. [Google Scholar] [CrossRef]
Wang, L.; Zhuo, L.; Li, J. Few-shot Remote Sensing Scene Classification with Spatial Affinity Attention and Class Surrogate-based Supervised Contrastive Learning. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 4705714. [Google Scholar] [CrossRef]
Tang, X.; Lin, W.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Class-level prototype guided multiscale feature learning for remote sensing scene classification with limited labels. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622315. [Google Scholar] [CrossRef]
Li, H.; Cui, Z.; Zhu, Z.; Chen, L.; Zhu, J.; Huang, H.; Tao, C. RS-MetaNet: Deep Metametric Learning for Few-Shot Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6983–6994. [Google Scholar] [CrossRef]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 3630–3638. [Google Scholar]
Li, X.; Luo, H.; Zhou, G.; Peng, X.; Wang, Z.; Zhang, J.; Liu, D.; Li, M.; Liu, Y. Learning general features to bridge the cross-domain gaps in few-shot learning. Knowl.-Based Syst. 2024, 299, 112024. [Google Scholar] [CrossRef]
Ye, H.J.; Sheng, X.R.; Zhan, D.C. Few-shot learning with adaptively initialized task optimizer: A practical meta-learning approach. Mach. Learn. 2020, 109, 643–664. [Google Scholar] [CrossRef]
Sun, Z.; Zheng, W.; Guo, P.; Wang, M. TST_MFL: Two-stage training based metric fusion learning for few-shot image classification. Inf. Fusion 2025, 113, 102611. [Google Scholar] [CrossRef]
Ma, J.; Lin, W.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Multipretext-task prototypes guided dynamic contrastive learning network for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614216. [Google Scholar] [CrossRef]
Zhang, B.; Feng, S.; Li, X.; Ye, Y.; Ye, R.; Luo, C.; Jiang, H. SGMNet: Scene graph matching network for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5628915. [Google Scholar] [CrossRef]
Chen, X.; Zhu, G.; Wei, J. MMML: Multi-manifold Metric Learning for Few-Shot Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5618714. [Google Scholar] [CrossRef]
Tian, F.; Lei, S.; Zhou, Y.; Cheng, J.; Liang, G.; Zou, Z.; Li, H.C.; Shi, Z. HiReNet: Hierarchical-Relation Network for Few-Shot Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5603710. [Google Scholar] [CrossRef]
Lu, X.; Gong, T.; Zheng, X. Domain Mapping Network for Remote Sensing Cross-Domain Few-Shot Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606411. [Google Scholar] [CrossRef]
Zhu, J.; Li, Y.; Yang, K.; Guan, N.; Fan, Z.; Qiu, C.; Yi, X. MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610413. [Google Scholar] [CrossRef]
Luo, X.; Wu, H.; Zhang, J.; Gao, L.; Xu, J.; Song, J. A closer look at few-shot classification again. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 23103–23123. [Google Scholar]
Wang, H.; Deng, Z.H. Cross-domain few-shot classification via adversarial task augmentation. arXiv 2021, arXiv:2104.14385. [Google Scholar]
Hu, Y.; J, A.J.M. Adversarial Feature Augmentation for Cross-domain Few-Shot Classification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 20–37. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Subramanian, V. Deep Learning with PyTorch: A Practical Approach to Building Neural Network Models Using PyTorch; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Zou, Y.; Yi, S.; Li, Y.; Li, R. A Closer Look at the CLS Token for Cross-Domain Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2025, 37, 85523–85545. [Google Scholar]
Luo, X.; Xu, J.; Xu, Z. Channel importance matters in few-shot image classification. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2022; pp. 14542–14559. [Google Scholar]
Chen, W.Y.; Liu, Y.C.; Kira, Z.; Wang, Y.C.F.; Huang, J.B. A closer look at few-shot classification. arXiv 2019, arXiv:1904.04232. [Google Scholar]
Zhang, X.; Fan, X.; Wang, G.; Chen, P.; Tang, X.; Jiao, L. MFGNet: Multibranch Feature Generation Networks for Few-Shot Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Li, L.; Han, J.; Yao, X.; Cheng, G.; Guo, L. DLA-MatchNet for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7844–7853. [Google Scholar] [CrossRef]
Guo, Y.; Codella, N.C.; Karlinsky, L.; Codella, J.V.; Smith, J.R.; Saenko, K.; Rosing, T.; Feris, R. A broader study of cross-domain few-shot learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 124–141. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 2790–2799. [Google Scholar]
Wang, H.; Yang, X.; Chang, J.; Jin, D.; Sun, J.; Zhang, S.; Luo, X.; Tian, Q. Parameter-efficient tuning of large-scale multimodal foundation model. Adv. Neural Inf. Process. Syst. 2024, 36, 15752–15774. [Google Scholar]
Li, J.; Gong, M.; Liu, H.; Zhang, Y.; Zhang, M.; Wu, Y. Multiform ensemble self-supervised learning for few-shot remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; Loy, C.C. Domain generalization: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4396–4415. [Google Scholar] [CrossRef]
Oh, J.; Kim, S.; Ho, N.; Kim, J.H.; Song, H.; Yun, S.Y. Understanding cross-domain few-shot learning based on domain similarity and few-shot difficulty. Adv. Neural Inf. Process. Syst. 2022, 35, 2622–2636. [Google Scholar]
Tseng, H.Y.; Lee, H.Y.; Huang, J.B.; Yang, M.H. Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation. arXiv 2020, arXiv:2001.08735. [Google Scholar]
Hu, Z.; Sun, Y.; Yang, Y. Switch to generalize: Domain-switch learning for cross-domain few-shot classification. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Li, S.; Song, S.; Huang, G.; Ding, Z.; Wu, C. Domain invariant and class discriminative feature learning for visual domain adaptation. IEEE Trans. Image Process. 2018, 27, 4260–4273. [Google Scholar] [CrossRef] [PubMed]
Ji, F.; Chen, Y.; Liu, L.; Yuan, X.T. Cross-Domain Few-Shot Classification via Dense-Sparse-Dense Regularization. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1352–1363. [Google Scholar] [CrossRef]
Lee, Y.; Chen, A.S.; Tajwar, F.; Kumar, A.; Yao, H.; Liang, P.; Finn, C. Surgical Fine-Tuning Improves Adaptation to Distribution Shifts. In Proceedings of the The Eleventh International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Li, W.H.; Liu, X.; Bilen, H. Cross-domain few-shot learning with task-specific adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7161–7170. [Google Scholar]
Liu, X.; Ji, Z.; Pang, Y.; Han, Z. Self-taught cross-domain few-shot learning with weakly supervised object localization and task-decomposition. Knowl.-Based Syst. 2023, 265, 110358. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, T.; Li, J.; Tian, Y. Dual adaptive representation alignment for cross-domain few-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11720–11732. [Google Scholar] [CrossRef]
Chen, Y.; Li, Y.; Mao, H.; Chai, X.; Jiao, L. A novel deep nearest neighbor neural network for few-shot remote sensing image scene classification. Remote Sens. 2023, 15, 666. [Google Scholar] [CrossRef]
Cheng, G.; Cai, L.; Lang, C.; Yao, X.; Chen, J.; Guo, L.; Han, J. SPNet: Siamese-prototype network for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Zhu, X.; Zhang, R.; He, B.; Zhou, A.; Wang, D.; Zhao, B.; Gao, P. Not all features matter: Enhancing few-shot clip with adaptive prior refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2605–2615. [Google Scholar]
Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; pp. 270–279. [Google Scholar]
Sheng, G.; Yang, W.; Xu, T.; Sun, H. High-resolution satellite scene classification using a sparse coding based multiple feature combination. Int. J. Remote Sens. 2012, 33, 2395–2412. [Google Scholar] [CrossRef]
Qin, A.; Chen, F.; Li, Q.; Tang, L.; Yang, F.; Zhao, Y.; Gao, C. Deep Updated Subspace Networks for Few-Shot Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4077–4087. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2017; pp. 1126–1135. [Google Scholar]
Oreshkin, B.; Rodríguez López, P.; Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. Adv. Neural Inf. Process. Syst. 2018, 31, 721–731. [Google Scholar]
Zhang, P.; Fan, G.; Wu, C.; Wang, D.; Li, Y. Task-adaptive embedding learning with dynamic kernel fusion for few-shot remote sensing scene classification. Remote Sens. 2021, 13, 4200. [Google Scholar] [CrossRef]
Li, W.; Wang, L.; Xu, J.; Huo, J.; Gao, Y.; Luo, J. Revisiting local descriptor based image-to-class measure for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7260–7268. [Google Scholar]
Dong, Z.; Lin, B.; Xie, F. Optimizing few-shot remote sensing scene classification based on an improved data augmentation approach. Remote Sens. 2024, 16, 525. [Google Scholar] [CrossRef]
Chen, Y.; Li, Y.; Mao, H.; Liu, G.; Chai, X.; Jiao, L. A Novel Discriminative Enhancement Method for Few-Shot Remote Sensing Image Scene Classification. Remote Sens. 2023, 15, 4588. [Google Scholar] [CrossRef]
Ji, Z.; Hou, L.; Wang, X.; Wang, G.; Pang, Y. Dual contrastive network for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhang, R.; Hu, X.; Li, B.; Huang, S.; Deng, H.; Qiao, Y.; Gao, P.; Li, H. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15211–15222. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]

Figure 1. Challenges to overcome when fine-tuning a general pre-trained model for FS-RSSC tasks. The red rectangles feature the class-specific discriminative regions.

Figure 2. The proposed TA-MSA fine-tuning framework. We freeze the first three blocks of the backbone network and update the last block with the proposed task-adaptive fine-tuning strategy. The proposed multi-level spatial features aggregation module leverages multi-level spatially important information to enhance classification accuracy. Specifically, the module consists of three spatial templates that assist in extracting spatial features at different levels. The extracted multi-level features are then fed into a linear classifier for final prediction.

Figure 3. Comparison of our proposed task-adaptive fine-tuning strategy and the standard fine-tuning approach. The layer-specific optimizer trains different layers with distinct learning rates based on their relevance to the target task, effectively handling domain shifts. Additionally, by masking the outputs of less discriminative channels during fine-tuning, the task-specific training scheme enhances the learning of discriminative features, leading to better adaptation to the target tasks.

Figure 4. Example instances of datasets referred to in this paper: (a) ImageNet, (b) NWPU-RESISC45, (c) UCMerced LandUse, (d) WHU-RS19.

Figure 5. Performance of the TA-MSA framework for varying values of

λ

.

Figure 5. Performance of the TA-MSA framework for varying values of

λ

.

Figure 6. Grad-CAM visualization results for the baseline method and our TA-MSA framework across six remote sensing scenes.

Table 1. The architecture of the ResNet18 backbone and its output size with an input image of 3 × 224 × 224 pixels.

Block Index	Architecture	Output Size
block 1	7 × 7, 64, stride = 2 3 × 3, maxpool, stride = 2 [3 × 3, 64, stride = 1] × 4	64 × 56 × 56
block 2	[3 × 3, 128, stride = 1] × 4	128 × 28 × 28
block 3	[3 × 3, 256, stride = 1] × 4	256 × 14 × 14
block 4	[3 × 3, 512, stride = 1] × 4 7 × 7, avgpool, stride = 7	512 × 1 × 1

Table 2. Selected categories of the three FS-RSSC datasets for evaluation.

NWPU-RESISC45	WHU-RS19	UCMerced-LandUse
Airport; Circular farmland; Basketball court; Dense residential; Ground track field; Forest; Medium residential; Intersection; River; Parking lot;	Meadow; Commercial; Pond; Viaduct; River;	Golf course; River; Mobile home park; Tennis court; Sparse residential; Beach;

Table 3. Comparison of the few-shot remote sensing scene classification accuracy across various methods on the NWPU-RESISC45 dataset. We bold the best results and underline the second-best results. The results marked with * are sourced from [48].

Method	Backbone	5-Way 1-Shot	5-Way 5-Shot
MatchingNet * [12]	ResNet-18	64.41 ± 0.86	76.33 ± 0.65
ProtoNet * [54]	ResNet-18	65.20 ± 0.84	80.52 ± 0.55
RelationNet * [55]	ResNet-18	60.04 ± 0.85	80.39 ± 0.56
MAML [56]	ResNet-12	56.01 ± 0.87	72.94 ± 0.63
TADAM [57]	ResNet-12	62.25 ± 0.79	82.36 ± 0.54
DLA-MatchNet [32]	ConvNet	68.80 ± 0.70	81.63 ± 0.46
TAE-Net [58]	ResNet-12	69.13 ± 0.83	82.37 ± 0.52
DN4 [59]	ResNet-18	66.39 ± 0.86	83.24 ± 0.87
SPNet [48]	ResNet-18	67.84 ± 0.87	83.94 ± 0.50
HiReNet [19]	Conv + ViT	70.43 ± 0.90	81.24 ± 0.58
MPCL [16]	ConvNet	55.94 ± 0.04	76.24 ± 0.12
ODS [60]	ResNet-12	67.47 ± 1.17	80.59 ± 0.86
DN4AM [47]	ResNet-18	70.75 ± 0.81	86.79 ± 0.51
DEADN4 [61]	ResNet-18	73.56 ± 0.83	87.28 ± 0.50
TA-MSA (Ours)	ResNet-18	68.88 ± 0.63	86.95 ± 0.36

Table 4. Comparison of the few-shot remote sensing scene classification accuracy across various methods on the UC Merced-LandUse dataset. We bold the best results and underline the second-best results. The results marked with * are sourced from [48].

Method	Backbone	5-Way 1-Shot	5-Way 5-Shot
MatchingNet * [12]	ResNet-18	48.18 ± 0.75	67.39 ± 0.50
ProtoNet * [54]	ResNet-18	53.85 ± 0.78	71.23 ± 0.48
RelationNet * [55]	ResNet-18	50.07 ± 0.72	65.22 ± 0.52
MAML [56]	ResNet-12	43.65 ± 0.68	58.43 ± 0.64
DLA-MatchNet [32]	ConvNet	53.76 ± 0.62	63.01 ± 0.51
TAE-Net [58]	ResNet-12	60.21 ± 0.72	77.44 ± 0.51
DN4 [59]	ResNet-18	57.25 ± 1.01	79.74 ± 0.78
SPNet [48]	ResNet-18	57.64 ± 0.73	73.52 ± 0.51
HiReNet [19]	Conv + ViT	58.60 ± 0.80	76.84 ± 0.56
DUSN [53]	Conv5	62.20 ± 0.84	79.44 ± 0.47
MFGNet [31]	ResNet-12	61.76 ± 0.59	76.55 ± 0.40
DCN [62]	ResNet-12	58.64 ± 0.71	76.61 ± 0.49
MPCL [16]	ConvNet	56.46 ± 0.21	76.57 ± 0.07
ODS [60]	ResNet-12	60.35 ± 1.02	72.67 ± 0.73
DN4AM [61]	ResNet-18	65.49 ± 0.72	85.73 ± 0.47
DEADN4 [61]	ResNet-18	67.27 ± 0.74	87.69 ± 0.44
TA-MSA (Ours)	ResNet-18	74.20 ± 0.49	91.75 ± 0.25

Table 5. Comparison of the few-shot remote sensing scene classification accuracy across various methods on the WHU-RS19 dataset. We bold the best results and underline the second-best results. The results marked with * are sourced from [48].

Method	Backbone	5-Way 1-Shot	5-Way 5-Shot
MatchingNet * [12]	ResNet-18	67.78 ± 0.67	85.01 ± 0.38
ProtoNet * [54]	ResNet-18	76.36 ± 0.67	85.00 ± 0.36
RelationNet * [55]	ResNet-18	65.01 ± 0.72	79.75 ± 0.32
MAML [56]	ResNet-12	59.19 ± 0.92	72.34 ± 0.75
DLA-MatchNet [32]	ConvNet	68.27 ± 1.83	79.89 ± 0.33
TAE-Net [58]	ResNet-12	73.67 ± 0.74	88.95 ± 0.53
DN4 [59]	ResNet-18	82.14 ± 0.80	96.02 ± 0.33
SPNet [48]	ResNet-18	81.07 ± 0.60	88.04 ± 0.28
DCN [62]	ResNet-12	81.74 ± 0.55	91.67 ± 0.25
DN4AM [61]	ResNet-18	85.05 ± 0.52	96.94 ± 0.21
DEADN4 [61]	ResNet-18	86.89 ± 0.57	97.63 ± 0.19
TA-MSA (Ours)	ResNet-18	87.24 ± 0.32	96.97 ± 0.14

Table 6. Comparison of the ablated models’ few-shot remote sensing scene classification accuracy in the 5-way 1-shot setting on the three FS-RSSC datasets. “BS + x” represents the combination of the baseline and our proposed method. We bold the average accuracy gains of each ablated model.

Model	NWPU-RESISC45	UC Merced-LandUse	WHU-RS19	Ave
baseline	67.33 ± 0.64	71.80 ± 0.47	85.20 ± 0.34	–
BS + TA1	67.42 ± 0.64	72.37 ± 0.50	84.95 ± 0.33	+ 0.14
BS + TA2	67.53 ± 0.66	72.96 ± 0.51	85.83 ± 0.34	+ 0.66
BS + TA	68.25 ± 0.65	73.25 ± 0.49	85.53 ± 0.32	+ 0.90
BS + MSA	68.58 ± 0.63	72.58 ± 0.51	86.41 ± 0.32	+ 1.08
TA-MSA	68.88 ± 0.63	73.62 ± 0.51	87.24 ± 0.32	+ 1.80

Table 7. Comparison of the ablated models’ few-shot remote sensing scene classification accuracy in the 5-way 5-shot setting on the three FS-RSSC datasets. “BS + x” represents the combination of the baseline and our proposed method. We bold the average accuracy gains of each ablated model.

Model	NWPU-RESISC45	UC Merced-LandUse	WHU-RS19	Ave
baseline	85.20 ± 0.38	90.16 ± 0.27	95.71 ± 0.15	–
BS + TA1	85.11 ± 0.39	90.73 ± 0.26	95.54 ± 0.17	+ 0.10
BS + TA2	86.11 ± 0.37	90.73 ± 0.27	96.16 ± 0.15	+ 0.64
BS + TA	86.13 ± 0.38	91.03 ± 0.26	95.93 ± 0.16	+ 0.67
BS + MSA	85.97 ± 0.39	90.57 ± 0.27	96.46 ± 0.14	+ 0.64
TA-MSA	86.95 ± 0.36	91.75 ± 0.25	96.97 ± 0.13	+ 1.53

Table 8. The classification accuracy of the TA-MSA framework with different basic learning rates

l r_{0}

in the 5-way 1-shot setting.

Table 8. The classification accuracy of the TA-MSA framework with different basic learning rates

l r_{0}

in the 5-way 1-shot setting.

${lr}_{0}$	NWPU-RESISC45	UC Merced-LandUse	WHU-RS19
0.0001	67.47 ± 0.66	91.45 ± 0.25	83.95 ± 0.18
0.0005	68.88 ± 0.66	91.75 ± 0.25	87.24 ± 0.14
0.001	68.72 ± 0.66	91.63 ± 0.25	86.84 ± 0.14
0.0015	68.58 ± 0.67	91.17 ± 0.27	87.02 ± 0.14
0.002	67.46 ± 0.64	90.61 ± 0.27	86.79 ± 0.14

Table 9. The classification accuracy of the TA-MSA framework with different basic learning rates

l r_{0}

in the 5-way 5-shot setting.

Table 9. The classification accuracy of the TA-MSA framework with different basic learning rates

l r_{0}

in the 5-way 5-shot setting.

${lr}_{0}$	NWPU-RESISC45	UC Merced-LandUse	WHU-RS19
0.0001	85.75 ± 0.39	73.86 ± 0.48	94.81 ± 0.31
0.0005	86.95 ± 0.37	74.20 ± 0.50	96.97 ± 0.32
0.001	87.03 ± 0.35	74.10 ± 0.49	96.57 ± 0.34
0.0015	86.33 ± 0.37	73.16 ± 0.50	96.61 ± 0.32
0.002	86.15 ± 0.37	72.15 ± 0.53	96.76 ± 0.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Sun, Y.; Peng, X.; Zhang, J.; Qi, G.; Liu, D. TA-MSA: A Fine-Tuning Framework for Few-Shot Remote Sensing Scene Classification. Remote Sens. 2025, 17, 1395. https://doi.org/10.3390/rs17081395

AMA Style

Li X, Sun Y, Peng X, Zhang J, Qi G, Liu D. TA-MSA: A Fine-Tuning Framework for Few-Shot Remote Sensing Scene Classification. Remote Sensing. 2025; 17(8):1395. https://doi.org/10.3390/rs17081395

Chicago/Turabian Style

Li, Xiang, Yumei Sun, Xiaoming Peng, Jianlin Zhang, Guanglin Qi, and Dongxu Liu. 2025. "TA-MSA: A Fine-Tuning Framework for Few-Shot Remote Sensing Scene Classification" Remote Sensing 17, no. 8: 1395. https://doi.org/10.3390/rs17081395

APA Style

Li, X., Sun, Y., Peng, X., Zhang, J., Qi, G., & Liu, D. (2025). TA-MSA: A Fine-Tuning Framework for Few-Shot Remote Sensing Scene Classification. Remote Sensing, 17(8), 1395. https://doi.org/10.3390/rs17081395

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TA-MSA: A Fine-Tuning Framework for Few-Shot Remote Sensing Scene Classification

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Remote Sensing Scene Classification

2.2. Cross-Domain Generalization

2.3. Task Adaptation with Few Labeled Samples

3. Methods

3.1. Preliminary

3.2. Overview of the Proposed TA-MSA Fine-Tuning Framework

3.3. Layer-Specific Optimizer in the Task-Adaptive Fine-Tuning Strategy

3.4. Task-Specific Training Scheme in the Task-Adaptive Fine-Tuning Strategy

3.5. Multi-Level Spatial Features Aggregation Module

4. Experiments

4.1. Experimental Settings

4.2. Comparison Results on the FS-RSSC Tasks

4.3. Ablation Study

4.4. Visualization Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI