1. Introduction
Unlike pixel-level and object-level interpretation techniques, remote sensing scene classification (RSSC) focuses on scene-level understanding by capturing the spatial distribution patterns of objects and their associated semantic attributes within remote sensing images (RSIs) [
1]. Over the past decade, deep learning-based RSSC methods have achieved remarkable success in static, closed-set environments, extensively supporting applications such as land-use and land-cover mapping [
2], environmental monitoring [
3], urban planning [
4,
5], and related geospatial applications [
6,
7,
8,
9,
10]. In real-world RSSC systems, new scene classes may emerge due to environmental changes, urban expansion, sensor upgrades, or the incorporation of new geographic regions into monitoring programs. In addition, the rapid advances in satellite sensors and earth observation technologies have led to a substantial increase in both the quantity and diversity of RSIs [
11], thereby requiring RSSC systems to continuously adapt to newly emerging scene classes.
Most existing deep learning approaches for RSSC follow the offline learning paradigm, in which models are trained using all available data with fixed class sets. However, this paradigm becomes inadequate in dynamic open-world environments where training data are received as streams and new classes continually emerge. Retraining models from scratch using both historical and incoming data largely wastes the substantial computational resources previously invested in the old model. More critically, in many operational settings, historical data cannot be permanently stored or repeatedly accessed due to strict memory budgets or data privacy regulations.
These intrinsic limitations necessitate the adoption of the class-incremental learning (CIL) paradigm, which enables models to acquire new knowledge sequentially without requiring full retraining. Inspired by human cognition, CIL methods aim to retain prior knowledge while simultaneously integrating new information [
12]. However, CIL methods are susceptible to catastrophic forgetting [
13], where the incorporation of new classes leads to severe degradation in performance on previously learned classes. Therefore, effective CIL methods must balance model stability and plasticity [
14,
15], achieving a trade-off between preserving old knowledge and adapting to new classes.
Recently, incremental learning has attracted substantial interest in remote sensing applications. Notably, several studies have extended CIL to the task of RSSC [
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26]. To mitigate catastrophic forgetting, most existing methods achieve promising performance by employing replay-based strategy, where a subset of representative exemplars from past classes is stored in a limited memory buffer and jointly used with new data for mixed training. While effective in high-resource server environments, this strategy encounters significant bottlenecks in real-world deployment. Specifically, the requirement to store historical data frequently conflicts with strict memory constraints and data privacy protocols.
To address the inherent limitations of image-based replay in practical CIL scenarios, we propose a memory-efficient feature-replay framework that preserves compact feature embeddings instead of raw historical images. The overall workflow of our proposed framework is schematically illustrated in
Figure 1. This feature-space rehearsal strategy provides a dual advantage. By retaining semantically rich feature embeddings, it effectively preserves the decision boundaries of previously learned classes while alleviating prediction bias toward newly introduced classes. Meanwhile, storing compact embeddings instead of raw data substantially enhances memory efficiency without sacrificing classification performance, thereby enabling practical continual adaptation in real-world remote sensing deployments.
Although retaining old feature embeddings is technically feasible, effectively incorporating them into mixed training with new data in the CIL setting remains challenging when applied to complex remote sensing scenarios. Specifically, the feature-replay strategy continues to face three key challenges: feature space drift, representation ambiguity and classifier bias. Firstly, despite its memory efficiency, our strategy of preserving feature descriptors confronts a critical challenge inherent to incremental learning: feature space drift. As the model is sequentially trained on new classes, the feature extractor is continually updated. This process inevitably alters the embedding space, rendering the stored descriptors from previous tasks obsolete. Consequently, a significant distributional gap emerges between the legacy feature space and the current one, making the two sets of representations incompatible. Secondly, the complex backgrounds, intra-class variability, and inter-class similarities inherent in RSIs necessitate the learning of well-structured and coherent feature representations. In addition, our feature replay strategy relies on replaying robust and discriminative features, which are essential for maintaining decision boundaries. However, it becomes particularly challenging in CIL. As new classes emerge over time and the feature extractor is continuously updated across incremental stage, the feature representations of new and old classes become overlapped in the feature space, leading to ambiguity and making it difficult to distinguish between them [
27]. Thirdly, new classes are typically trained with abundant data in CIL, whereas previous classes are represented by limited exemplars, leading to an imbalanced training distribution that skews predictions toward new classes. In addition, the commonly adopted knowledge distillation in replay-based methods relies on soft labels generated by earlier models, which become increasingly noisy due to error accumulation and feature distribution shifts. As a result, CIL methods are inherently prone to class imbalance and noisy distillation labels, both of which induce significant classifier bias.
To tackle the challenges inherent to feature-replay paradigm, we propose several specific technical components within our framework. First, we introduce a specialized feature calibration network (FCN) to compensate for feature space drift. The FCN efficiently adapt stored feature descriptors to the updated feature space, thereby bridging the distributional gap and facilitating a balanced training of the unified classifier. A key challenge in feature calibration is error accumulation: naive feature mapping strategies will accumulate errors over successive incremental steps, leading to performance degradation as the number of classes grows. To enhance the robustness of this mapping, we adopt a transductive learning strategy to train FCN. This strategy is notable as it exclusively leverages paired feature vectors from the current task, thereby obviating the need for original past-task images. Specifically, we model the feature space alignment as an orthogonal transformation, implemented via the Cayley transform, to enforce a manifold-preserving regularization constraint. This principled transformation preserves the intrinsic geometric structure and relative relationships among the feature vectors. To counteract representation ambiguity, we propose a progressive multi-scale feature enhancement (PMFE) module, which employs a progressive construction scheme to enable fine-grained and interactive feature enhancement, thereby yielding richer feature representations. To mitigate classifier bias, we employ a bias rectification (BR) strategy. Once the feature calibration process is complete, we fix the feature extractor and continue to optimize the classifier further only using the calibrated old class features and the new class features, effectively mitigating classification bias in the CIL task.
The primary contributions of our work are summarized as follows.
We propose a novel CIL framework for RSSC that retains compact feature embeddings, rather than raw images, as exemplars for previously learned classes. This memory-efficient feature replay method not only addresses data privacy concerns but also mitigates representation drift and classifier bias induced by data imbalance. Consequently, the framework maintains robust decision boundaries and significantly alleviates catastrophic forgetting.
A specialized FCN is trained in a transductive learning paradigm with manifold consistency regularization to adapt previously outdated feature descriptors to the current feature space. The FCN effectively compensates for feature space drift, thereby facilitating balanced and compatible unified classifier training. Following feature calibration process, we implement a BR strategy that mitigates the final prediction bias by exclusively optimizes the classifier on a balanced exemplar set.
To mitigate representation ambiguity, we propose a PMFE module. By adopting a progressive construction scheme, the PMFE module achieves fine-grained and interactive feature enhancement, yielding richer feature representations and a more comprehensive understanding of remote sensing scenes.
3. Methodology
3.1. Problem Setting
Unlike the traditional learning paradigm, which utilizes all available data simultaneously, CIL involves training on a sequence of tasks with disjoint class sets.The ultimate goal is to learn a unified classifier capable of recognizing all classes encountered up to the current stage. Specifically, in task
t, the training set comprises data
for new classes and an exemplar memory
for old classes. We define
, where
represents the number of training samples for new classes in task
t, while
and
denote the input data and corresponding target labels, respectively. Notably, the class sets across different tasks are disjoint, i.e.,
for
. For task
t, our classification model
comprises a feature extractor
and a unified classifier
. Here,
d denotes the feature dimension, and
indicates the cumulative number of categories learned up to stage
t. The model is formulated as the composition
. Consequently, the final prediction for a test sample
is obtained by:
where
represents the logit corresponding to class
y.
3.2. Method Overview
To address the challenge of catastrophic forgetting and effectively balance the stability-plasticity trade-off, we propose a novel CIL framework for RSSC, termed FR-CIL. It mainly consists of four key components: (1) the PMFE module, designed to construct fine-grained, interactive feature representations through parallel depthwise dilated convolutions with progressive connections; (2) the DSKR mechanism, which incorporates two synergistic distillation losses to preserve stability across both decision boundaries and the feature representation space; (3) the FCN, which trained in a transductive learning paradigm under manifold consistency regularization, efficiently adapting the previously stored feature descriptors to the updated feature space to reconcile the distributional gap for a unified classifier; (4) the BR strategy, which exclusively optimizes the classifier on a balanced exemplar set to mitigate prediction bias.
An overview of the proposed method (FRCIL) is illustrated in
Figure 2, which is organized as a unified pipeline and proceeds through three sequential stages. (1) Backbone training stage. The backbone network is trained using a combination of new class data and retained feature descriptors. The optimization is governed by a modified cross-entropy loss augmented with cosine normalization and the DSKR mechanism. (2) Feature calibration stage. The FCN is employed to adapt historical feature descriptors from the previous latent space to the updated feature space, thereby ensuring distributional alignment. (3) Bias rectification stage. The feature extractor is fixed, and the classifier is further optimized exclusively using a balanced set of calibrated old features and new class features. This step effectively mitigates the classifier bias inherent in CIL tasks.
In summary, our proposed framework is designed to effectively mitigate both representation ambiguity and classifier bias in CIL for RSSC. In the subsequent sections, we provide a detailed elaboration of the key components and procedural steps.
3.3. Feature Learning
Given the complex semantic content inherent in remote sensing scenes, capturing features across a diverse range of spatial scales is essential for robust classification. However, conventional multi-scale feature extraction paradigms typically rely on parallel convolutional pathways with fixed receptive fields, often resulting in insufficient contextual correlation and limited feature interaction.
To address these limitations, we propose a progressive multi-scale feature enhancement (PMFE) module. This module adopts a progressive construction scheme to facilitate feature enhancement in a fine-grained and interactive manner, effectively remedying the deficiencies of static multi-branch architectures.
The structural design of the PMFE module is illustrated in
Figure 3. PMFE employs a multi-branch architecture designed to systematically expand the receptive field. This is realized through a set of parallel
depthwise dilated convolutions configured with incrementally increasing dilation rates (e.g.,
). By strategically introducing dilation within the kernel structure, depthwise dilated convolution effectively expands the receptive field without increasing the number of parameters. Furthermore, when synergized with pointwise convolution, this approach significantly curbs computational complexity. This structure enables the module to effectively capture a comprehensive range of multi-scale features, spanning from fine-grained local textures to broader global structural contexts. Formally, this operation is defined by the following equation:
where
denotes the input feature map, and
represents channel-wise concatenation. The operator
signifies a
depthwise dilated convolution with a dilation rate of
, while
denotes a pointwise convolution employed to adjust the channel dimensionality to
C.
denotes ReLU activation function. Finally,
represents the enhanced feature output of the
i-th branch.
The choice of exponentially increasing dilation rates is designed to rapidly expand the receptive field to capture the broad global context required for complex remote sensing scenes, without increasing computational overhead [
52,
53]. Furthermore, this progressive cascading structure naturally mitigates the gridding effect commonly associated with large dilation rates [
53]. As each branch fuses the original input
F with the densely extracted features
from the preceding smaller dilation rate, the spatial holes introduced by sparse sampling are continuously filled. This fine-grained and interactive enhancement preserves local continuity while capturing multi-scale structural dependencies.
Subsequently, the multi-scale feature maps
and
are concatenated along the channel dimension. The resulting tensor is then passed through a pointwise convolution
to fuse information across scales and adjust the channel dimensionality. This process yields the aggregated feature map
, formulated as follows:
Finally, a residual skip connection is introduced to fuse the original input with the enhanced features. This design facilitates identity mapping, ensuring the preservation of original information while effectively mitigating the vanishing gradient problem to stabilize model training. Consequently, the final output
can be calculated as follows:
Conventional feature pyramid architectures, such as FPN [
54] and PANet [
55], are often characterized by intricate designs involving elaborate pathways. In contrast, the proposed PMFE module offers a streamlined architecture, ensuring high modularity and ease of integration. Notably, the construction process follows a coarse-to-fine cognitive paradigm, where global structural understanding is progressively enriched with local details. Consequently, the PMFE module is particularly adept at addressing the inherent challenges in RSIs, such as complex background interference, extreme scale variations, and high inter-class similarity.
3.4. Incremental Learning
CIL methods inherently suffer from catastrophic forgetting, characterized by a sharp degradation in performance on previously learned tasks when sequentially adapted to new data. To alleviate this, replay-based strategies mitigate forgetting by preserving a representative subset of old class samples within an exemplar memory. By interleaving these retained exemplars with incoming data during subsequent training phases, the model enables joint optimization over both past and current information, thereby maintaining old knowledge while while learning new ones.
Despite their effectiveness, replay-based strategies are hindered by inherent limitations. Firstly, storing raw samples imposes a substantial storage overhead that escalates linearly with the number of tasks. Secondly, the retention of historical RSIs may involve sensitive or proprietary information, raising critical privacy and security concerns. Finally, the data imbalance between the limited exemplar set and abundant new data leads to a severe prediction bias toward new classes.
To address these issues, we propose a novel replay-based framework for CIL that preserves low-dimensional feature descriptors, rather than raw images, as exemplars for previously learned classes. We employ ResNet-18 as the feature extractor, followed by a cosine classifier for final classification. The model is optimized using a composite cross-entropy loss, which is applied jointly to both the new class training data
and the retained feature exemplars
. The loss function is formulated as:
where
denotes the probability distribution computed over all classes encountered up to stage
t. The first term represents the standard classification loss for the new task data
, and the second term imposes a supervision constraint on the retained feature exemplars
, which bypass the feature extractor
and are evaluated by the unified classifier
.
To further mitigate catastrophic forgetting and enhance feature discriminability, we augment this loss function with a cosine normalization strategy alongside a dual-space knowledge retention (DSKR) mechanism. The details of these components are elaborated below.
3.4.1. Cosine Normalization
During incremental learning, replay-based methods exhibit a pronounced prediction bias toward new classes. This phenomenon stems from the severe data disparity between the abundant new training data and the limited exemplar set, manifesting empirically as larger logit values for new classes compared to old ones.
This bias is structurally exacerbated by the standard fully connected layer. Specifically, the prediction logit for class i, denoted as , is computed via the dot product between the feature vector and the classifier weight . As the dot product is sensitive to vector magnitudes, the larger weight norms of new classes directly inflate their output scores, regardless of the actual semantic similarity.
To mitigate this magnitude-driven bias, we reformulate the logit computation via cosine normalization. Specifically, the logit
is computed via a scaled cosine similarity, effectively decoupling the prediction score from the vector magnitude:
where
denotes the
-normalized version of a vector
, and the inner product
quantifies the cosine similarity between a pair of normalized vectors. The final probabilities are then obtained by applying the softmax function to these rectified logits.
3.4.2. Dual-Space Knowledge Retention Mechanism
To effectively mitigate catastrophic forgetting, we employ a dual-space knowledge retention (DSKR) mechanism that integrates two complementary distillation strategies.
First, we introduce a prediction-space knowledge distillation loss, denoted as
. This objective serves as a regularization constraint, forcing the output probability distribution of the current model to align with that of the frozen old model. This process ensures the preservation of previously established decision boundaries. Formally, during the training of task
t,
is computed as:
where
denotes the Kullback–Leibler divergence. The term
represents the temperature-scaled probability distribution computed over previously learned classes.
While prediction-space distillation preserves the final decision boundaries, feature representations may gradually drift. To address this, we introduce a feature-space distillation loss
to impose a direct constraint on the feature representations, complementing the output-level supervision. This loss encourages the current model to maintain discriminative feature representations consistent with the old model. Specifically, it minimizes the cosine distance between the normalized feature embeddings extracted by the model at the previous and current steps. The loss is formulated as:
As a result, the DSKR mechanism guarantees the stability of the decision boundaries and feature representation, providing a holistic defense against catastrophic forgetting.
3.4.3. The Total Loss
Combining these components, the total objective function
for optimizing the current model
is formulated as:
The hyperparameters and serve as weighting coefficients for the respective distillation losses.
3.5. Feature Calibration
A central challenge in feature replay is that feature descriptors stored from previous tasks become progressively misaligned with the evolving feature space as the model incrementally learns new classes. This distributional shift creates a discrepancy between legacy and current feature representations, rendering them incompatible. Consequently, the core objective is to adapt these historical feature descriptors into the current latent space, effectively bridging the distributional gap to facilitate the construction of a unified and robust classifier.
To resolve the feature space incompatibility, we propose a transductive feature calibration network that obviates the need for storing past raw images. This mechanism leverages the current task data
to generate a set of paired feature correspondences, which subsequently serve as proxy supervision for the feature calibration process. Specifically, for each sample
in the current dataset
, we generate a pair of feature vectors: one obtained by passing the sample through the frozen prior extractor
, and the other through the current extractor
. As a result, two corresponding feature sets are generated as follows:
The discrepancy between and encapsulates the evolution of the feature space. The proposed feature calibration network (FCN) learns to bridge this gap by optimizing a transformation function that adapts the representation from the prior space to the current space .
The entire calibration process adheres to the manifold consistency principle, which enforces that the intrinsic topology of the underlying data manifold remains invariant. This regularization ensures that the relative distances and structural relationships among the stored old descriptors are preserved, which guarantees a robust and structurally coherent calibration process.
To enforce this geometric constraint, we explicitly model the feature calibration process as an orthogonal operator, which is achieved by parameterizing the transformation weights via the Cayley transform. Specifically, the orthogonal matrix
is generated from a learnable skew-symmetric matrix
using the formulation:
where
denotes the identity matrix, and
satisfies the skew-symmetric property
.
During the calibration phase, with both the feature extractor and classifier frozen, we exclusively optimize the parameters of
. The objective is to learn the optimal orthogonal transformation that aligns the calibrated old features with the current feature representations. To achieve this, a dual-component loss function is employed to enforce both feature alignment and semantic consistency. The formulation is given by:
where the hyperparameter
modulates the weight of the feature alignment loss. The first component is a feature alignment loss, which minimizes the spatial distance between the original and calibrated feature descriptors by leveraging cosine similarity. The second component is a semantic consistency loss that ensures the calibrated feature descriptors are still correctly classified by the current model. Consequently, this process ensures that the retained feature descriptors are adapted to the new latent space, without distorting its intrinsic topological structure or compromising its semantic consistency.
3.6. Bias Rectification
The commonly adopted knowledge distillation in replay-based methods relies on soft labels generated by earlier models, which become increasingly noisy due to error accumulation and feature distribution shifts. However, CIL methods are inherently susceptible to class imbalance and noisy distillation labels, both of which induce severe classifier bias.
To address these issues, we propose a bias rectification (BR) strategy. Upon completion of the feature calibration process, the memory buffer
is updated to adhere to specific memory constraints. We employ herding strategy [
28] to select the most representative samples. This selection algorithm iteratively chooses a subset of feature descriptors such that their cumulative average best approximates the true class mean in the feature space.
In this stage, we freeze the feature extractor and exclusively optimize the classifier on the balanced exemplar set
.
This optimization uses only true class labels, thereby avoiding the label noise inherent in distillation-based soft labels. As a result, the proposed BR strategy effectively mitigates prediction bias and yields consistent performance improvements across all previously learned classes.
The pseudo-code for the proposed method is presented in Algorithm 1. This procedure details the optimization steps required to address catastrophic forgetting.
| Algorithm 1 Pseudocode for Our Method Training at Incremental Stage t |
Input: New data ; exemplar memory ; old model ; hyperparameters , and .
1: Initialize new module with and extend the classifier head for new classes.
2: for each incremental stage do
3: if then
4: Train the first task:
5: Train model by minimizing , according to Equation (9).
6: Construct exemplar memory by herding strategy [28].
7: Retain the model .
8: else
9: Train incremental tasks:
10: Initialize new module with .
11: Train model by minimizing , according to Equation (9). 12: Train FCN by minimizing , according to Equation (13).
13: Update exemplar memory .
14: Construct exemplar memory by herding strategy [28]. 15: Jointly train classifier by optimizing , according to Equation (14). 16: Discard the old model . 17: Retain the new model for next stage. 18: end if 19: end for Output: New model ; exemplar memory .
|
4. Experiments and Results
In this section, we present a comprehensive evaluation of the proposed FR-CIL across five datasets using diverse data split protocols. First, we outline the experimental setting, involving the datasets, selected baselines for comparison, evaluation metrics, and implementation details. Next, we compare FR-CIL against several state-of-the-art methods to demonstrate its superior performance. Following this, sequential learning experiments are conducted to analyze its stability and performance trends as tasks accumulate. Subsequently, we analyze resource efficiency by comparing memory footprint and evaluating sensitivity to the number of preserved data points. Finally, we perform ablation studies to examine the contribution of each proposed component to the overall performance and investigate model’s robustness to changes in hyperparameters. Visual analysis further complements our findings, providing additional insights into the model’s behavior.
4.1. Experimental Setting
4.1.1. Datasets
To comprehensively evaluate the effectiveness of the proposed method, experiments were conducted on five datasets: AID [
56], RSICD-256 [
57], NWPU-45 [
11], PatternNet [
58], and UC-Merced [
59]. These datasets encompass a diverse range of spatial resolutions and scene complexities. The detailed characteristics of each dataset are summarized as follows.
The AID dataset contains 30 scene classes with spatial resolutions ranging from 0.5 to 8 m, characterized by high intra-class variability and high inter-class similarity. The RSI-CB256 dataset includes 35 classes featuring diverse angles, scales, and colors, providing rich sample diversity. The NWPU-45 dataset consists of 45 classes with spatial resolutions spanning 0.2–30 m, exhibiting significant variations in viewpoint, lighting, and occlusion. The PatternNet dataset comprises 38 classes with resolutions of 0.06–4.69 m, ensuring high object occupancy to minimize background noise while maintaining visual diversity. Specifically, the ‘airplane’ and ‘baseball field’ classes were excluded to ensure class balance across the incremental learning stages. Finally, the UC-Merced dataset is composed of 21 classes with a fixed spatial resolution of 0.3 m, covering a wide range of agricultural, urban, and natural textures under varying lighting and background contexts.
To provide a comprehensive and structured evaluation of the proposed method, five datasets are strategically distributed across different experimental analyses according to their characteristics and evaluation objectives. Specifically, AID, RSI-CB256, and NWPU-45 are used as the primary benchmarks for state-of-the-art comparisons and core ablation studies, as they represent large-scale and widely recognized RSSC datasets with varying incremental task lengths. PatternNet is utilized specifically for sequential learning analysis to validate the multiscale feature enhancement capabilities of the PMFE module given its broad spectrum of spatial resolutions and rich visual diversity. UC-Merced is utilized for qualitative visualization and hyperparameter sensitivity analysis, offering clear interpretability due to its moderate scale and well-structured class distribution.
4.1.2. Baselines
In the experiments, the proposed method was compared with eleven classical incremental learning methods. (1)
Joint (Retraining): This setting represents the ideal offline baseline, where all available data are utilized for simultaneous joint training. Consequently, it serves as the performance upper bound for evaluating CIL methods. (2)
Finetuning: A naive incremental baseline that sequentially trains solely on new tasks. However, it typically suffers from severe catastrophic forgetting. (3)
GEM [
40]: A projection-based approach that computes gradients for both the memory buffer and the current task, projecting the current gradient to minimize its angle relative to the memory gradients. (4)
EWC [
37]: A pioneering regularization-based approach that utilizes the diagonal of the Fisher information matrix to approximate the posterior distribution, thereby penalizing changes to parameters deemed critical for previous tasks. (5)
LwF.MC [
60]: A representative distillation-based method that addresses catastrophic forgetting by incorporating a knowledge distillation term into the global loss function. (6)
iCaRL [
28]: A hybrid approach that combines a distillation loss with a reserved memory of exemplars to preserve old knowledge. (7)
WA [
31]: A post-processing method designed to correct prediction bias toward new classes. It introduces a weight aligning strategy to adjust the biased weights in the final layer, ensuring fairness between old and new classes. (8)
CwD [
61]: A regularization strategy applied at the initial stage to mitigate representation collapse. It compels the model to learn uniformly scattered features by decorrelating class-wise representations, thereby boosting generalization for subsequent incremental stages. In this work, we utilize the AANet-based version of CwD. (9)
DER [
62]: A structure-based approach that dynamically expands the network architecture. It freezes previous feature extractors to ensure stability while adding prunable new branches for plasticity, employing an auxiliary loss to enhance feature discrimination. (10)
MEMO [
63]: A parameter-efficient expansion framework that decomposes the backbone into shared shallow layers and expandable deep layers. It optimizes memory allocation between model parameters and exemplars, ensuring robust performance across varying memory budgets. (11)
EASE [
64]: A framework based on pre-trained models that expands the architecture using lightweight adapters. It constructs task-specific subspaces and utilizes semantic similarities to synthesize old class prototypes, enabling ensemble prediction without forgetting.
4.1.3. Evaluation Metrics
Following standard evaluation protocols in CIL [
27], we assess performance based on two primary aspects: the overall discriminative capability across learned tasks and the stability of memory regarding previously acquired knowledge. To quantify these, we utilize accuracy, mean accuracy (mACC), and backward transfer (BWT) metrics.
(1)
Accuracy: The classification accuracy for a specific task
k serves as the fundamental performance metric.
where
represents the total number of test samples for task
k,
denotes the predicted label for the
i-th sample,
is the corresponding ground truth label, and
is an indicator function that outputs 1 if the condition is satisfied and 0 otherwise.
(2) mACC: Mean accuracy serves as a comprehensive metric for evaluating the overall performance of incremental learning models. Notably, it assesses the stability-plasticity trade-off by averaging the model’s accuracy on all encountered tasks after the entire learning sequence is complete.
Let
denote the accuracy on the test set of task
j after the model has been trained on task
k (where
). A higher mACC indicates superior prediction capability across all observed classes. mACC is calculated as:
(3)
BWT: The backward transfer measures the model’s resistance to forgetting and quantifies the impact of new tasks on previously learned tasks. BWT is calculated as:
where a positive BWT value indicates forgetting previous tasks, and a negative BWT value suggests that learning new tasks has improved performance on old tasks.
4.1.4. Implementation Details
To ensure a fair comparison, ResNet-18 is employed as the backbone for all compared methods, with the exception of EASE. For EASE, the ViT-B/16 architecture is utilized. Regarding data partitioning, all datasets are randomly divided into training and test sets with a ratio of 4:1. Unless otherwise specified, all methods adopted the default hyperparameter configurations reported in their respective original papers. Specifically, for EWC, the regularization weight was set to 1000, with
. For LwF, the temperature scaling factor
T was set to 2. In WA,
normalization was applied to the fully connected layers. For DER, 10 warm-up epochs were employed with
. CwD was implemented using the AANet-based version. Finally, for EASE, the adapter projection dimension
r was set to 16, with a trade-off parameter
of 0.1. All methods were optimized using stochastic gradient descent for 200 epochs, with a momentum of 0.9 and a weight decay of
. A batch size of 128 was utilized. Regarding the exemplar memory set, the herding strategy [
28] was adopted to select representative samples after the completion of each training phase. The transductive learning strategy strictly and exclusively utilizes the training data from the current incremental stage. There is strictly no data leakage, no access to future tasks, and no peeking at the global test set at any point in the training process.
To ensure a fair and reproducible comparison, we adopt a fixed total memory budget for all methods. Specifically, image-based replay is allocated 300 MB, while feature-based replay (FR-CIL) is allocated 16 MB. For image rehearsal, RGB images are stored in their raw format. For feature-based replay, we store feature embeddings using 32-bit floating-point precision. When new classes are introduced, the memory is evenly reallocated across all observed classes to maintain a fixed total memory constraint. The class order was randomized using a fixed numPy random seed, thereby enforcing an identical task sequence across all methods. Each experiment was repeated three times, and the mean results are reported. No data augmentation strategies were utilized. All experiments were implemented in Python 3.8 using the PyTorch 1.9 framework and executed on a single NVIDIA RTX 3090 GPU.
4.2. Results and Analysis
To provide a comprehensive assessment of FR-CIL, we compare it with several state-of-the-art baselines. Experiments are conducted across three datasets to ensure a fair and robust evaluation, thereby reducing potential bias from any single dataset. The mACC and BWT results are summarized in
Table 1.
As expected, Joint (Retraining) achieves the highest mACC scores. Since the model is trained using all accumulated data simultaneously, it does not suffer from catastrophic forgetting and therefore has no corresponding BWT value. Consequently, it serves as an upper bound for CIL performance. In contrast, finetuning focuses solely on the current task without any mechanism to mitigate catastrophic forgetting, resulting in the lowest mACC values and the most severe forgetting among all compared methods. Prior-focused methods, such as GEM and EWC, demonstrate limited effectiveness and offer negligible gains over finetuning. Notably, GEM exhibits even poorer BWT performance on NWPU-45, as the feasible gradient space for subspace projection becomes increasingly constrained with increasing tasks. EWC, which estimates parameter importance via the Fisher information matrix to restrict updates on critical parameters, similarly yields only a marginal improvement in mACC relative to finetuning. In contrast, LwF employs knowledge distillation during incremental updates, yielding a substantial improvement in mACC over finetuning and demonstrating the effectiveness of this strategy. iCaRL further enhances performance by maintaining an exemplar memory, resulting in higher mACC scores and highlighting the efficacy of exemplar replay. WA builds upon iCaRL by addressing class imbalance through bias correction, thereby achieving additional performance gains. CWD constrains the model to generate uniformly distributed features by explicitly decorrelating class-wise representations. This mechanism enhances generalization across incremental stages, leading to improved overall performance. Architecture-based methods (i.e., DER, MEMO, and EASE) employ expandable architectures and consequently deliver competitive results. These findings highlight that greater representational capacity plays a crucial role in alleviating forgetting and improving continual learning robustness. Among them, EASE distinguishes itself by integrating pre-trained models, resulting in superior performance across all datasets. Our method surpasses existing state-of-the-art methods, achieving mACC values closer to those of the Joint upper bound and attaining the lowest BWT scores, thereby validating the effectiveness of the proposed framework.
4.3. Sequential Learning Analysis
CIL methods proceed in a sequential manner, where each incremental stage introduces new classes. To evaluate the dynamic learning behavior, we employ sequential learning analysis, which serves as a dynamic visualization tool for characterizing how well the model retains knowledge of previously learned classes while adapting to new ones. After completing the t-th incremental stage, we evaluate the model using mACC over all classes learned up to that point. While the mACC metric provides a global summary, it fails to elucidate the specific timing and magnitude of catastrophic forgetting. Consequently, we analyze the sequential learning trajectory on the ten-task AID dataset to provide a more detailed view of performance evolution as the number of learned classes increases.
As illustrated in
Figure 4, the mACC curves generally exhibit a downward trend as the number of learned classes increases. Notably, our method maintains consistently high and stable mACC values throughout the incremental tasks, indicating its robust capability to adapt to new tasks without compromising old knowledge. Conversely, naive finetuning lacks any strategies to prevent catastrophic forgetting and therefore exhibits the poorest performance among all evaluated methods. Under long-sequence CIL settings, GEM behaves similarly to finetuning, failing to preserve knowledge of earlier tasks. EWC, which estimates parameter importance, achieves only a marginal improvement over finetuning. The above two methods suggest that such simple regularization is insufficient for the challenging CIL setting. The incremental improvements across LwF, iCaRL, and WA highlight the effectiveness of their strategies. These methods exhibit a progressive relationship: LwF employs knowledge distillation strategy; iCaRL improves upon this by incorporating exemplar replay; and WA further enhances performance via postprocessing for bias correction. Additionally, CwD mitigates representation collapse during the initial phase, outperforming methods that rely exclusively on replay or distillation. Architecture-based methods, including DER, MEMO, and EASE, rank among the top methods and achieve competitive results. By maintaining additional network branches or adapters, they sustain high mACC scores even as the total number of learned classes increases.
4.4. Impact of Memory Footprint
We propose a novel replay-based framework for CIL that preserves low-dimensional feature embeddings, rather than raw images, as exemplars for previously learned classes. The primary goal is to substantially reduce the memory consumption while maintaining performance. Given that different replay-based methods vary in their storage mechanisms, we standardize the memory budget to ensure a fair comparison. Specifically, considering that the NWPU-45 dataset imposes the most stringent storage demands due to its extensive class number and extended incremental sequence, we compare our approach with three representative methods employing distinct storage strategies on the NWPU-45 dataset in terms of the exemplar memory footprint.
As illustrated in
Figure 5, our method demonstrates competitive mACC scores while maintaining a significantly lower memory footprint compared to conventional image-based replay methods. It is important to note that the reported memory footprint encompasses all preserved data (i.e., raw images or feature embeddings) across all classes. Specifically, storing a single
image in uint8 format requires 192 KB. However, for GPU-based training, these images are typically normalized and converted to float32 tensors, escalating the storage requirement to 768 KB per sample. In contrast, a floating-point feature embedding with dimension
occupies merely 2 KB, which represents less than 1% of the storage needed for the original image. Consequently, as demonstrated in
Figure 5, our method reduces memory requirements by at least an order of magnitude while improving the mACC in most cases.
The experimental results presented in
Figure 5 illustrate the trade-off between model performance and storage overhead. As the quantity of preserved exemplars increases, both mACC and memory consumption rise. Notably, the proposed method achieves performance saturation at a memory usage of approximately 16 MB. In contrast, conventional image-based replay methods necessitate nearly 300 MB to achieve comparable stability. Beyond these thresholds, the marginal gains in mACC diminish significantly while memory cost continues to grow unnecessarily.
Given the substantial discrepancy in storage density between raw images and compact feature embeddings, strictly enforcing an identical memory budget for all comparisons proves impractical. Specifically, a budget sufficient for images implies an excessive number of features, whereas a budget optimized for features is insufficient for images. Consequently, throughout all experiments on the NWPU-45 dataset, we allocate a memory budget of 16 MB for feature-based replay and 300 MB for image-based replay, respectively. To maintain experimental consistency, the memory constraints for all other datasets are calibrated based on these benchmarks, with the number of preserved exemplars adjusted accordingly.
4.5. Ablation Experiments
To validate the effectiveness of the proposed FR-CIL, we conduct a comprehensive ablation study to systematically evaluate the specific contribution of each key component: the PMFE module, the FCN, the DSKR mechanism and the BR strategy.
4.5.1. Effect of PMFE
We conducted ablation experiments on the PMFE module across three datasets, with the results reported in
Table 2. Compared to the variant without PMFE (w/o PMFE), our method achieves a consistent increase in mACC alongside a reduction in BWT values. These findings indicate that PMFE effectively improves incremental learning performance by promoting fine-grained and interactive feature enhancement.
To further validate the effectiveness of PMFE, we perform a sequential learning analysis on the six-task PatternNet dataset. PatternNet encompasses a broad spectrum of spatial resolutions (0.06–4.69 m) and is characterized by high object occupancy and rich visual diversity. These attributes make it particularly suitable for evaluating multiscale feature learning in remote sensing scenarios.
As illustrated in
Figure 6a, the model incorporating PMFE maintains consistently higher mACC values throughout the incremental learning process. Furthermore,
Figure 6b details the final accuracy achieved for each task after completing all incremental learning phases. The proposed method demonstrates significant performance gains across all six tasks, yielding improvements of
,
,
,
,
, and
, respectively. These performance enhancements are attributed to the multiscale and fine-grained feature construction process, which enhances the model’s capacity to learn discriminative representations for RSIs. The performance gains on all tasks provide strong empirical evidence for the effectiveness of PMFE.
4.5.2. Effect of FCN
To evaluate the effect of FCN, we assess the quality of feature calibration by computing the average similarity between adapted feature vectors and their corresponding ground-truth representations. Specifically, for a given image x, the ground-truth vector is the feature representation extracted by the current model using the original image. This vector is compared against the adapted feature vector of x, which evolves across incremental stages. The quality of feature calibration is quantified using cosine similarity, serving as a metric to determine how accurately the model approximates the true feature distribution.
Figure 7 illustrates the cosine similarity of the five classes from the initial task, evaluated after completing all learning phases on the AID, RSI-CB256, and NWPU-45 datasets. The results demonstrate that the proposed method consistently maintains higher average cosine similarity for old classes relative to the variant without FCN. This improvement is particularly pronounced on the NWPU-45 dataset, where our method attains average cosine similarities ranging from 80.39% to 85.04%. In contrast, the variant without FCN yields significantly lower values, falling within the range of 51.29% to 60.35%.
These results indicate that the proposed FCN efficiently adapts the previously stored feature descriptors to the updated feature space to reconcile the distributional gap. Consequently, the adapted features retained in the memory bank maintain high semantic fidelity to the original class distributions, thereby facilitating robust incremental learning.
To further validate if its contribution is justified, we employ t-distributed stochastic neighbor embedding (t-SNE) for feature visualization. This qualitative analysis complements our quantitative findings by providing intuitive insights into the model’s internal representation learning. Specifically, we utilize t-SNE to project the high-dimensional feature embeddings from the final layer of FR-CIL into a two-dimensional latent space.
Figure 8 presents the t-SNE visualization of feature embeddings on the UC-Merced dataset. The feature distribution in
Figure 8a appears less organized and diffuse, characterized by ambiguous class boundaries. In contrast,
Figure 8b demonstrates that the proposed FR-CIL generates a highly structured and regularized embedding space. The visualization reveals significantly enhanced inter-class separability alongside a high degree of intra-class compactness. These geometric properties ensure that the learned features will not lose discrimination as feature space evolves. Consequently, the proposed method effectively enhances feature stability, allowing the model to maintain robust classification boundaries even as the data distribution evolves.
4.5.3. Effect of DSKR
During the incremental learning stage, the proposed method incorporates two complementary knowledge distillation losses within the DSKR mechanism to ensure robust knowledge transfer. Specifically,
facilitates distillation within the output probability space, whereas
operates directly in the feature representation space. To verify the effectiveness of this dual-constraint mechanism, we performed ablation experiments across three datasets. The detailed results are presented in
Table 3.
Compared to models optimized solely with classification loss, the incorporation of either distillation loss yields notable improvements in mACC while effectively reducing BWT. These findings underscore the effectiveness of distillation mechanisms in preserving knowledge via both output semantic probabilities and latent feature representations. In particular, consistently achieves slightly superior performance compared to , suggesting that constraints imposed at the semantic output space play a more critical role in retaining class discrimination. Notably, when both distillation losses are jointly employed, mACC improves by , , and , while BWT decreases by , , and across the three datasets, respectively. These gains surpass those obtained using either distillation strategy alone, demonstrating the complementary and synergistic benefits of the proposed dual-space distillation mechanism.
4.5.4. Effect of BR
We assess the effectiveness of the proposed BR strategy on the performance of classifier across three datasets. As detailed in
Table 4, the incorporation of the BR strategy yields additional improvement in ACC and BWT. Notably, the impact of the BR strategy is particularly pronounced on the AID dataset compared to the other datasets. This distinction can be attributed to the class imbalance inherent in AID, characterized by a highly uneven distribution of samples across different scene classes. Specifically, such imbalance typically causes model parameters to bias towards classes with a higher quantity of samples during incremental learning. The BR strategy effectively mitigates this bias by optimizing the classifier on the balanced exemplar set, thereby enhancing overall accuracy and alleviating catastrophic forgetting for old classes.
The effect of the BR strategy is further illustrated through confusion matrix analysis on the UC-Merced dataset. Compared to
Figure 9a,b exhibits significantly fewer off-diagonal elements and a more uniform diagonal distribution. Notably, the diagonal elements corresponding to old classes exhibit higher values in
Figure 9b. This observation confirms that incorporating the BR strategy effectively mitigates classifier bias, ensuring that the model retains high accuracy for previously learned classes while simultaneously learning new ones.
4.5.5. Synergistic Effect Among PMFE, FCN, and BR
To further investigate the interaction among the three proposed components, we conduct combination ablation experiments by incrementally integrating PMFE module, FCN, and BR strategy.
As detailed in
Table 5, the results reveal strong synergistic interactions among the three proposed components beyond their individual contributions. Notably, there is a pronounced synergy between PMFE and FCN. The PMFE module functions as a stabilizing foundation by enhancing discriminative multi-scale features. When the FCN subsequently performs transductive mapping to calibrate historical embeddings into the current feature space, it directly benefits from this enhanced representation. As PMFE inherently increases intra-class compactness and inter-class separability, the calibration process within FCN becomes more stable and significantly less susceptible to distortion. Furthermore, the BR strategy specifically targets the classifier to rectify biased decision boundaries. Its effectiveness becomes more pronounced when the feature space is both discriminative and stabilized. When FCN and PMFE are jointly applied, the classifier processes embeddings with reduced drift and improved separability, thereby enabling the BR strategy to more effectively mitigate classification bias.
4.5.6. Effect of and
As formulated in Equation (
9), the overall loss function incorporates two hyperparameters,
and
. These terms serve as weighting coefficients for the respective distillation losses. Specifically,
governs the contribution of knowledge distillation within the output probability space, while
modulates the impact of distillation in the feature representation space.
These hyperparameters are critical for navigating the stability–plasticity dilemma. When and are set to small values, insufficient regularization leads to pronounced catastrophic forgetting. Conversely, excessively large values enforce overly strong constraints on knowledge retention, leading to stricter loss function, more rigid constraints, and lower level of plasticity.
To evaluate the sensitivity of the proposed framework to hyperparameters and , we conducted ablation experiments on the three-task UC-Merced dataset. Preliminary empirical observations indicated that the model achieves a favorable stability-plasticity trade-off within the ranges of and . Consequently, we performed a fine-grained grid search within these ranges, incrementing the values by 0.1 at each step to precisely identify the optimal configuration.
As illustrated in
Figure 10, our method achieves optimal performance on the UC-Merced dataset when
is set to 1.8 and
to 0.8. Therefore, these hyperparameter settings are employed as the default hyperparameters throughout the experiments.
4.5.7. Effect of
As formulated in Equation (
13), the hyperparameter
modulates the weight of the feature alignment loss, thereby regulating the degree of emphasis placed on feature similarity during the training of the feature calibration network. To evaluate the model’s sensitivity to
, we conducted ablation studies on the UC-Merced dataset.
As presented in
Table 6, the mACC demonstrates robustness across a relatively broad range of
. Notably, the optimal performance is achieved when
. Consequently, we adopt this value for all experiments reported in our work.
5. Discussion
5.1. Dataset-Specific Sensitivity Analysis
Through comprehensive comparisons with several state-of-the-art methods, extensive experiments conducted on multiple public datasets demonstrate both the effectiveness and robustness of the proposed method, as well as the contribution of each key component, indicating its strong potential for real-world RSSC applications. To supplement the general performance overview across these datasets, we will provide an additional discussion on the dataset-specific behavior of different methods. In particular, we identify which methods maintain stable performance across diverse datasets and which exhibit notable performance fluctuations under certain data distributions. Such performance variations are worthy of careful investigation, as they often reflect the intrinsic compatibility between a given incremental learning strategy and the underlying characteristics of the data.
As illustrated in
Table 1, the performance variations observed across the AID, RSI-CB256, and NWPU-45 datasets highlight the impact of inherent dataset characteristics, including intra-class variability, inter-class similarity, class distribution balance, and the granularity of task partitioning.
For the AID dataset, the primary challenges arise from significant fluctuations in spatial resolution (ranging from 0.5 m to 8 m) and class imbalance (220–420 samples per class). These factors adversely affect prior-focused regularization methods, such as GEM and EWC. Under conditions of high intra-class variability, the gradient-based constraints employed by these methods become unreliable, leading to substantial performance degradation. In contrast, other methods demonstrate robust resistance to catastrophic forgetting under these conditions. The RSI-CB256 dataset features a balanced class distribution that mitigates the adverse effects of class imbalance, thereby enhancing the efficacy of data-replay strategies. Consequently, iCaRL achieves a slightly higher mACC on RSI-CB256 than on the AID dataset (64.02% vs. 61.17%), and the performance gap between iCaRL and WA narrows significantly. However, as the sequence length and number of classes increase, the limitations of pure regularization methods (GEM, EWC, and LwF) become pronounced, leading to sustained performance degradation. The NWPU-45 dataset poses the most rigorous challenge due to its high scene complexity and extended incremental sequence. The larger number of incremental stages leads to cumulative task interference, thereby exacerbating the stability–plasticity dilemma. Notably, methods such as iCaRL and WA, which perform well on the AID and RSI-CB256 datasets, exhibit a significant performance drop on NWPU-45 dataset, suggesting that strategies heavily reliant on exemplar replay lose effectiveness as task interference intensifies. Architecture-based methods, including DER, MEMO, and EASE, consistently rank among the top-performing methods and achieve competitive results, indicating that maintaining additional network branches or adapters can be highly effective, despite additional storage overhead. Our method maintains consistently high and stable mACC values throughout incremental tasks without relying on additional storage, indicating that the proposed feature replay strategy effectively preserves discriminative representations while mitigating catastrophic forgetting.
This discussion facilitates a deeper understanding of the underlying causes of the observed performance fluctuations and offers guidance for the development of more robust and generalizable RSSC-oriented CIL methods.
5.2. Real-World Application Scenarios Analysis
To further clarify the practical value of the proposed method, we highlight its applicability in real-world remote sensing deployment scenarios, particularly in resource-constrained platforms, such as satellites and unmanned aerial vehicles (UAVs). In such platforms, storage capacity and computational resources are strictly limited, and continuous data transmission to ground stations for full retraining is impractical due to bandwidth constraints and latency considerations. As a result, models deployed in such environments must support incremental updates while operating under strict memory constraints.
Long-term earth observation missions necessitate adaptive models update to accommodate newly emerging scene classes, such as evolving urban structures, disaster-affected regions, and newly monitored geographic zones. However, retaining large volumes of historical raw imagery for rehearsal is frequently infeasible due to limited on-board storage and strict regulatory restrictions on data retention. Under these constraints, memory efficiency becomes a decisive factor for operational deployment rather than a purely theoretical advantage.
The proposed FR-CIL framework is explicitly designed to accommodate such deployment constraints. By storing compact feature embeddings instead of raw images, the framework achieves substantial memory reduction while preserving knowledge of previously learned classes. This enables incremental adaptation without periodic retraining or permanently storing historical imagery. Consequently, FR-CIL offers a robust and practical solution for resource-constrained remote sensing platforms, empowering them with long-term, adaptive scene understanding capabilities.
5.3. Computational Efficiency Analysis
The proposed memory-efficient design is not achieved at the expense of substantial computational overhead. Both the FCN and the PMFE module incur only a little computational cost during training and negligible impact during inference.
The FCN operates as an orthogonal transformation applied to 512-dimensional feature embeddings rather than high-resolution images. As a result, its computational cost is minimal compared with the backbone network, which dominates overall training complexity. The calibration process therefore introduces only marginal overhead in training, and at inference time the FCN performs a simple forward transformation, leading to negligible additional latency. The PMFE module enhances multiscale representation within the feature extraction stage using depthwise and dilated convolutions. These operations are computationally efficient and widely adopted in lightweight CNN architectures. In particular, depthwise convolutions significantly reduce parameter count and FLOPs compared with standard convolutions. Consequently, the overhead introduced by PMFE remains small relative to the total computational cost of the backbone. Importantly, conventional image-replay methods require repeated forward and backward propagation of stored raw images through the entire backbone during each incremental stage. In contrast, the proposed feature-replay strategy directly reuses precomputed feature embeddings, thereby eliminating redundant backbone processing of historical data. Consequently, this computational saving effectively offsets the additional training time introduced in each incremental task, leading to comparable overall training time in practice.
Overall, the proposed method maintains comparable training time and inference latency while significantly reducing memory consumption, ensuring that its storage efficiency does not come at the cost of substantial computational burden.
6. Conclusions
In this article, we introduce a memory-efficient CIL framework for RSSC that stores compact feature embeddings, rather than raw images, as exemplars for previously learned classes. This strategy mitigates privacy concerns, representation drift, and classifier bias, thereby preserving decision boundaries and alleviating catastrophic forgetting. The proposed framework comprises four key components: the PMFE module, the DSKR mechanism, the FCN, and the BR strategy. The PMFE module enables fine-grained and interactive feature enhancement through a progressive construction scheme, yielding richer representations and a more comprehensive understanding of remote sensing scenes. The DSKR mechanism integrates two complementary distillation losses to jointly preserve decision boundaries and feature representation stability. A specialized FCN is trained in a transductive learning paradigm with manifold consistency regularization to adapt stored feature descriptors to the updated feature space, thereby bridging the distributional gap for a unified classifier. Finally, the BR strategy mitigates prediction bias by exclusively optimizing the classifier on a balanced exemplar set. The framework proceeds in three sequential stages. It first learns discriminative representations via joint optimization with the DSKR mechanism, then calibrates old features into the current feature space via FCN, and finally mitigates classifier bias by optimizing the classifier on a balanced exemplar set.
Extensive experiments on five public datasets demonstrate the superior performance and robustness of the proposed framework. Comprehensive analyses further validate its effectiveness across diverse remote sensing scenarios. In future work, we plan to extend this framework to few-shot class-incremental learning, where only limited samples of new classes are available while preserving previously acquired knowledge. This setting more closely reflects real-world remote sensing applications.