1. Introduction
Recent advances in remote sensing technologies have enabled the acquisition of large volumes of imagery from diverse platforms, including satellites, drones, and airborne systems. This has led to the extensive application of remote sensing data in various fields such as urban planning [
1,
2,
3,
4], environmental monitoring [
5,
6,
7], and disaster management [
8,
9,
10]. Furthermore, the introduction of deep learning techniques has significantly enhanced the efficiency and accuracy of remote sensing image analysis, achieving state-of-the-art performance across a wide range of applications [
11,
12,
13,
14,
15]. For instance, tasks such as building segmentation and land use/land cover (LULC) classification, which previously relied on manual interpretation or basic analytical methods, have seen remarkable improvements in both accuracy and computational efficiency through the application of deep learning [
16,
17,
18]. Consequently, the research paradigm in remote sensing has shifted from traditional rule-based approaches to data-driven, deep learning-based methods.
Most deep learning studies in remote sensing have focused on improving task performance using fixed datasets under static experimental settings, with the development of state-of-the-art models becoming a dominant research trend. However, in real-world scenarios, imagery is continuously collected from various platforms, leading to significant variations in visual characteristics—such as object size, shape, and texture—as illustrated in
Figure 1. Due to these domain-specific discrepancies, models trained on a single domain (i.e., a set of images from a specific platform) often suffer from severe performance degradation when applied to new domains [
19,
20]. This necessitates either retraining the model for each new domain, allowing adaptation only to the new domain, or performing joint learning across all domains, which is computationally expensive.
To alleviate these challenges, domain adaptation has been explored, aiming to transfer knowledge from a source domain to a target domain and thereby improve performance on the target domain [
21,
22]. While this approach can indeed enhance target-domain performance, it primarily focuses on forward knowledge transfer and does not explicitly preserve knowledge from previous domains, leading to catastrophic forgetting [
23,
24], where previously learned knowledge is lost during adaptation. Therefore, in real-world scenarios where new domains appear sequentially, more flexible and adaptive learning strategies are required to facilitate the practical deployment of deep learning-based remote sensing methods. In this context, continual learning has emerged as a promising approach capable of mitigating both domain shift and catastrophic forgetting.
Continual learning is a learning paradigm in which a model incrementally learns from a sequence of tasks, aiming to integrate new knowledge without forgetting what has already been learned [
25,
26]. In this context, a “task” refers to a specific learning objective that arises sequentially over time. Based on learning scenarios, continual learning is typically categorized into Task-Incremental Learning (TIL), Class-Incremental Learning (CIL), and domain-incremental learning (DIL) [
27,
28]. In TIL, each task has a distinct label space, and a task identifier is available during both the training and inference to specify which task the model should address. In contrast, CIL does not provide a task identifier during inference, requiring the model to classify all previously encountered classes jointly. DIL differs from the previous two scenarios in that the learning objective (i.e., the set of classes or the application) remains the same across tasks, while the input data distribution or domain characteristics vary. Among the three, DIL is particularly relevant to real-world remote sensing applications, where domain shifts frequently occur. Nevertheless, research on DIL remains relatively underexplored compared to the TIL and CIL. Therefore, this study focuses on addressing challenges within the DIL scenario.
In addition to scenario-based categorization, continual learning can also be classified based on implementation strategies: regularization-based, architecture-based, and replay-based methods [
29,
30]. Regularization-based methods preserve prior knowledge by incorporating additional terms into the loss function that penalize changes to parameters important for previous tasks. Architecture-based methods maintain prior knowledge by isolating parameters for each task, either by fixing and masking them during the training of new tasks or by using dynamic architectures that freeze existing parameters while expanding the model for task-specific learning. Replay-based methods, on the other hand, preserve prior knowledge more intuitively by explicitly storing a subset of data from previous tasks in a memory buffer and reusing it when training on new tasks.
Most existing studies on DIL have primarily employed regularization-based or architecture-based approaches, often citing concerns regarding data storage costs and privacy issues associated with replay-based methods [
31,
32,
33]. However, in the field of remote sensing, long-term data collection is standard practice, making data storage costs a less critical concern. Furthermore, remote sensing imagery typically does not contain personally identifiable information, reducing privacy risks compared to other fields—except in sensitive areas such as national defense. Instead, the limitations of regularization-based and architecture-based methods pose significant challenges in practical remote sensing applications. Regularization-based methods are inherently vulnerable to domain shifts, which can lead to catastrophic forgetting. Meanwhile, architecture-based methods tend to increase model complexity as tasks accumulate, requiring additional memory resources. Given these factors, replay-based methods present a more practical solution for DIL within remote sensing.
Therefore, this study focuses on developing a replay-based learning algorithm for DIL in remote sensing, where a model learns the same application from sequentially emerging domains collected from different platforms, while retaining knowledge from previously learned domains. To this end, we propose an Experience Replay with Performance-Aware Submodular Sampling (ER-PASS), designed to improve adaptability across domains, mitigate catastrophic forgetting, and ensure computational efficiency. ER-PASS integrates the robustness of joint learning—by processing samples from multiple domains as a unified input—with the efficiency of replay-based learning in reducing training time. Furthermore, we introduce a performance-aware submodular sample selection strategy to enhance the stability of the learning process. To evaluate the effectiveness of our approach, we conduct experiments on two representative remote sensing applications: building segmentation and LULC classification. The main contributions of this study are as follows:
- (1)
We propose a replay-based learning algorithm that incorporates a performance-aware submodular sample selection strategy, namely ER-PASS, which is model-agnostic and can be applied across various deep learning models.
- (2)
We demonstrate that ER-PASS effectively mitigates catastrophic forgetting compared to existing methods, while requiring relatively low resource demands.
- (3)
Experimental results on building segmentation and LULC classification demonstrate that ER-PASS exhibits generalizability across diverse remote sensing applications.
3. Methodology
3.1. Overview
This study is conducted under the DIL scenario, where data from multiple platforms are incrementally collected, and the model is continuously updated to accommodate newly acquired data. In this setting, learning for each domain is treated as an individual task, and the number of tasks increases as new domain data becomes available. A strict DIL setup assumes that class labels remain consistent across tasks. To align with this assumption, we adopt building segmentation as the primary downstream application. However, real-world scenarios often involve the emergence of new classes over time. To reflect such practical considerations, we additionally evaluate our approach under a more relaxed setting using LULC classification, where new classes can emerge in later tasks.
The overall learning process of ER-PASS proposed in this study is illustrated in
Figure 2. In the training phase, model learning and sample selection are performed sequentially for each task. The segmentation model is trained on the current domain’s data in conjunction with the samples stored in the memory buffer from previous domains. After training, the sample selection algorithm identifies core samples based on scores derived from the trained model’s predictions. These samples are then stored in the memory buffer and carried over to the next task, enabling the model to retain knowledge from earlier domains while facilitating learning on subsequent ones. For evaluation, the model is assessed not only on the current domain but also on all previously encountered domains. We adopt UNet [
63] and DeepLabV3+ [
64] as baseline models for both the building segmentation and LULC classification. Notably, ER-PASS is model-agnostic and can be readily integrated into other segmentation networks without requiring any architectural modifications.
3.2. Proposed Algorithm
ER-PASS is a replay-based continual learning algorithm tailored for DIL in remote sensing applications. It is motivated by the observation that joint learning has been shown to improve generalization across heterogeneous domains and that dataset-level memory integration—rather than batch-level—offers greater computational efficiency.
Let the continual learning process consist of a sequence of tasks , each corresponding to a distinct remote sensing domain. For the k-th task , a labeled dataset is provided, where and denote the input image and its corresponding label, and is the number of training samples in task .
Unlike conventional experience replay methods [
45,
46,
47], which merge current task data and memory buffer samples within each mini-batch, ER-PASS performs dataset-level integration by combining the two into a unified training set. Specifically, for task
, the new training set
is defined as:
where
denotes the memory buffer containing representative samples from previous tasks. The model parameters
for task
are then optimized by minimizing the expected loss over the combined dataset
:
where
denotes the downstream segmentation loss (e.g., binary cross-entropy or cross-entropy), and
represents the segmentation model.
Through this simple yet effective approach, ER-PASS computes a single gradient per iteration without requiring gradient aggregation, while maintaining a consistent batch size. This design reduces memory overhead compared to conventional replay-based methods and shortens training time compared to joint learning by relying only on a compact memory buffer rather than the full dataset of previous tasks.
After training on task
, a sample selection algorithm is employed to extract a representative subset of samples from
for updating the memory buffer. The selection is based on feature similarity and task-specific performance, as described in
Section 3.3:
The updated memory
is then used for training on the next task
. The overall learning process of ER-PASS is summarized in Algorithm 1.
Algorithm 1 Learning process of ER-PASS |
Input: | ▹Dataset corresponding to each task |
Require: | ▹ Neural network |
Initialize: | ▹ Memory buffer |
- 1:
Define as the training set used for task k - 2:
for task do - 3:
if t = 1 then - 4:
|
5: | ▹ Initialize model parameters |
- 6:
else - 7:
- 8:
- 9:
end if - 10:
Define as the total number of mini-batches in - 11:
for do
|
12: | ▹ Update model |
- 13:
end for
|
14: | ▹ Update memory buffer |
- 15:
end for
|
3.3. Performance-Aware Submodular Sample Selection
In ER-PASS, the samples stored in the memory buffer from previous tasks are essential to preserving knowledge. To ensure effective knowledge retention, we propose a performance-aware submodular sample selection strategy that jointly considers both feature-level diversity and task-specific prediction performance. This strategy promotes the retention of diverse, representative, and informative samples, thereby improving stability and mitigating catastrophic forgetting during the learning process.
For each candidate sample
, we extract the feature representation
by applying global average pooling to the encoder output of the trained model
. Here,
d denotes the dimensionality of the feature space, which corresponds to 512 for UNet and 2048 for DeepLabV3+. Specifically, we use the output of the final downsampling block in the UNet encoder and the output of layer 4 in the ResNet50 backbone of DeepLabV3+ as the feature map. Each feature vector is then
-normalized to ensure consistent scaling, using the following equation:
We also compute a task-specific evaluation score , such as the Intersection-over-Union (IoU) or mean Intersection-over-Union (mIoU) between the model’s prediction and the ground truth. This score serves as a performance-based weight during sample selection.
Let
denote the set of normalized feature vectors extracted from all candidate samples. The intra-similarity of a candidate sample
is defined as the total cosine similarity between its feature vector and those of all other vectors in
:
This term reflects the alignment of
with the candidate sample distribution in the feature space, promoting the selection of representative and diverse samples.
Let
be the set of feature vectors corresponding to the samples already selected into the memory buffer. The inter-similarity of
with respect to
is then defined as:
This measures the redundancy of
relative to the selected set. A lower inter-similarity indicates that the sample introduces novel information, contributing to buffer diversity.
We define the selection scores for each candidate
as the submodular gain—the difference between intra- and inter-similarity—weighted by task-specific evaluation score:
At each iteration, the candidate
with the highest score is greedily selected and added to the memory buffer:
This process continues until a predefined budget—such as a fixed number of samples or percentage of the dataset—is reached. By jointly optimizing representativeness (via intra-similarity), redundancy reduction (via inter-similarity), and task relevance (via evaluation score), the proposed strategy constructs a memory buffer that is both submodular-optimal and performance-sensitive. Notably, since samples are dynamically selected based on their scores after the completion of each task, there is no need to pre-allocate memory per task. This design ensures that important samples are naturally retained in the buffer and are not mechanically displaced due to memory constraints, thereby effectively supporting continual learning. The detailed procedure is described in Algorithm 2.
Algorithm 2 Performance-aware submodular sample selection |
Input: , |
▹ Dataset and trained model corresponding to |
Require: N | ▹ Memory budget (number of samples to select) |
Output: | ▹ Updated memory buffer |
- 1:
Initialize , , - 2:
Extract features from for each sample - 3:
Compute normalized features: - 4:
Compute evaluation score between and ground truth for each sample - 5:
Let , where n is the total number of samples in - 6:
Compute intra-similarity: - 7:
for to N do - 8:
for to n do - 9:
if then
|
10: | ▹ Only intra-similarity |
- 11:
else - 12:
Compute inter-similarity: - 13:
- 14:
end if
|
15: if | ▹ Exclude already selected samples |
- 16:
end for - 17:
- 18:
, , - 19:
end for
|
5. Results
5.1. Building Segmentation (Strict DIL Setting)
As previously described, our incremental learning setup employs a step-wise training procedure across four domains in the following order: Potsdam, LoveDA, DeepGlobe, and GID. The model is initially trained on Potsdam in Step 1 with randomly initialized weights. In Steps 2 to 4, the model is incrementally updated by leveraging the weights learned from the previous step, thereby facilitating knowledge transfer across domains. Under this setup, we conducted a downstream evaluation on building segmentation. Experiments were primarily conducted using UNet, with additional comparisons performed using DeepLabV3+.
Table 2 presents the IoU performance at each incremental step using UNet for building segmentation, along with evaluation metrics at Step 4. At each step, the reported IoU values reflect the model’s performance on all previously encountered domains, evaluated using the model trained on the current task. Models trained via single-task learning lack adaptability to unseen domains and demonstrate satisfactory performance only within the domain on which they were trained. This limitation becomes particularly evident at Step 4, where performance on other domains significantly degrades—primarily caused by substantial domain shifts. While fine-tuning facilitates some knowledge transfer, its performance gains on previous tasks over single-task learning remain marginal, suggesting the occurrence of catastrophic forgetting. Notably, even continual learning methods such as EWC and LwF fail to provide meaningful improvements, yielding results comparable to naive fine-tuning and demonstrating limited ability to mitigate forgetting. In contrast, ER demonstrates better performance in terms of BWT, suggesting a stronger capability to retain prior knowledge. It also achieves higher AIA values compared to other baseline methods. However, comparison with single-task learning reveals that ER exhibits lower performance on the current task at each step, indicating that the high AIA is primarily due to the preservation of previous knowledge rather than effective learning of new tasks.
In theory, joint learning serves as the performance upper bound for incremental learning approaches. Our experimental results demonstrate that the proposed method achieves performance comparable to joint learning and, in some cases, even surpasses it. Furthermore, our method consistently outperforms baseline methods. In particular, compared to ER, which achieved the best performance among the benchmarks, our method achieves improvements of 0.1613 in AIA and 0.332 in BWT at Step 4. These results indicate that our method not only mitigates catastrophic forgetting effectively but also maintains reasonable adaptability to new tasks. A similar trend is observed in
Table 3, where DeepLabV3+ is used as the segmentation model instead of UNet. This consistent improvement across different architectures highlights that the effectiveness of the proposed algorithm is not limited to a specific model, demonstrating its model-agnostic nature.
To complement the quantitative results,
Figure 4 and
Figure 5 provide qualitative visualizations that further demonstrate the effectiveness of the proposed method.
Figure 4 shows segmentation results from each domain using the UNet model trained up to Step 4, illustrating representative samples across domains. As expected, the segmentation performance on GID is relatively accurate since the model was directly trained on that domain. However, all benchmark methods exhibit noticeable degradation on previously learned domains. In particular, EWC and LwF fail to produce meaningful predictions, indicating that severe catastrophic forgetting has occurred. In contrast, ER performs better than these methods and partially mitigates forgetting. The proposed method maintains segmentation performance on previous domains at a level comparable to joint learning, clearly demonstrating its effectiveness in mitigating catastrophic forgetting.
Beyond the cross-domain comparison at Step 4,
Figure 5 presents step-by-step segmentation results for the first domain, analyzing how each method retains prior knowledge throughout the incremental learning process. In the case of EWC and LwF, prior knowledge is relatively well preserved up to Step 2. However, noticeable forgetting begins at Step 3, and severe catastrophic forgetting becomes evident by Step 4. ER maintains acceptable performance up to Step 3, with slight degradation observed at Step 4. Consistent with earlier findings, the proposed method maintains stable performance across all steps, further demonstrating its robustness against catastrophic forgetting. For completeness, the qualitative visualization results for the DeepLabV3+ model are provided in
Appendix A.
5.2. LULC Classification (Relaxed DIL Setting)
To further evaluate the generalizability of the proposed method in practical applications, we conduct experiments on LULC classification. Unlike building segmentation, which follows a strict DIL setting with consistent class labels, LULC classification represents a more relaxed and realistic scenario where new classes may emerge over time.
Table 4 and
Table 5 present LULC classification results for UNet and DeepLabV3+, respectively. Similar to the building segmentation experiments, these tables provide step-wise mIoU scores and final-step metrics, with overall trends following a comparable pattern. Nonetheless, several noteworthy differences can be observed. First, the single-task performance on LoveDA is relatively low. As observed in
Figure 3, this can be attributed to the highly imbalanced class distribution in LoveDA, which causes certain classes, such as agriculture and barren, to be frequently misclassified as background. In addition, EWC and LwF, which failed to retain prior knowledge in earlier building segmentation experiments (yielding near-zero scores at Step 4), show relatively better performance in LULC classification. This may be attributed to greater semantic consistency or overlap among LULC classes across domains, which facilitates better classification performance.
Another interesting observation concerns the behavior of ER. While ER showed substantial forgetting at Step 4 in building segmentation, its performance in LULC begins to degrade earlier, at Step 2, but then remains relatively stable in the subsequent steps. Building segmentation involves binary and consistent labels, allowing previous knowledge to be effectively reinforced through the memory buffer. In contrast, LULC classification includes many classes with new classes introduced at each step. Due to the batch-level replay mechanism in ER, which integrates previous and current samples within every batch, learning new classes persistently interferes with previously learned knowledge, and severe forgetting is, therefore, considered to have occurred in Step 2. In comparison, the proposed method integrates previous and current samples at the dataset level before batching, thereby reducing such interference and enabling more stable optimization even when new classes are introduced. This is supported by the experimental results, which show that the proposed method consistently mitigates forgetting, as also observed in the building segmentation experiments.
Figure 6 illustrates the LULC classification results obtained with the UNet model trained up to Step 4. The results for EWC and LwF on previous domains show that predictions are primarily focused on classes such as forest, grassland, water, and agriculture, which correspond to the classes present in the GID. However, both methods largely fail to predict classes absent from GID, such as cars, roads, and barren areas, suggesting substantial forgetting of these classes. Notably, despite the building class being commonly present across all domains, both EWC and LwF still perform poorly on this class. This is likely due to substantial domain discrepancies, which hinder consistent recognition. In contrast, ER seems to retain partial retention of classes absent from GID, demonstrating better knowledge preservation across diverse classes and domains.
Figure 7 provides a detailed visualization of the forgetting dynamics across incremental steps. As confirmed by the quantitative results, substantial forgetting occurs at Step 4 in the building segmentation, whereas in the LULC classification, it emerges earlier, starting from Step 2. Specifically, in the case of ER, prediction performance on the grassland class noticeably degrades at Step 2, which corresponds to the second domain, LoveDA, where this class is absent. This implies that classes not present in the current task are more susceptible to forgetting. Similarly, at Step 3, performance on the road class deteriorates, consistent with its absence in the DeepGlobe, and this trend continues through Step 4. Notably, at Step 4, a performance degradation is also observed for the building class, which is present in the GID, aligning with earlier observations in the building segmentation experiments. In contrast, the proposed method effectively preserves a broader range of class representations across incremental steps, mitigating forgetting not only for classes currently being learned but also for those learned in previous steps. These findings suggest that the proposed method generalizes well beyond binary segmentation to more general semantic segmentation problems such as LULC classification, underscoring its robustness and scalability under both strict and relaxed DIL settings. The corresponding DeepLabV3+ results are provided in
Appendix A.
7. Conclusions
In this paper, we propose ER-PASS, an experience replay-based continual learning algorithm that leverages the strengths of joint learning and experience replay for DIL in remote sensing. ER-PASS integrates a performance-aware submodular sample selection strategy to enhance the stability of the learning process across evolving domains. We conducted experiments on two distinct remote sensing applications: building segmentation and LULC classification. In both applications, ER-PASS outperforms the benchmarks in terms of AIA and BWT, demonstrating its effectiveness in mitigating catastrophic forgetting and its generalizability across diverse applications. Furthermore, experiments with multiple model architectures, including UNet and DeepLabV3+, show that ER-PASS is model-agnostic and can be flexibly integrated into various network structures. Additional analyses confirm that the proposed sample selection strategy positively impacts adaptability and forgetting mitigation, with stable performance maintained when the sampling ratio is 0.5 or higher. Finally, ER-PASS exhibits reasonable efficiency in terms of training time and memory consumption. These results suggest that ER-PASS can be practically applied in real-world remote sensing scenarios.
Nevertheless, ER-PASS has several limitations. First, ER-PASS has only been evaluated on two representative applications—building segmentation and LULC classification—using solely optical imagery. Its applicability to a wider range of remote sensing tasks, such as change detection or super-resolution, as well as its generalizability to more heterogeneous remote sensing domains, such as hyperspectral or synthetic aperture radar imagery, remains to be validated. Second, the experiments were limited to domain sequences from high resolution to low resolution, and the impact of domain order was not fully analyzed. Therefore, performance across various domain orders should be evaluated in future work. Third, the submodular gain calculation involved in the sample selection process of ER-PASS includes matrix operations, which leads to increased computational complexity over successive incremental learning steps. Hence, efficient computation strategies for large-scale matrix operations need to be explored. Lastly, while ER-PASS was primarily compared with relatively classical methods, additional evaluations against recent state-of-the-art benchmarks are required for a more comprehensive assessment. Furthermore, as a model-agnostic learning algorithm, ER-PASS could be further explored for potential integration with architecture-based approaches proposed in recent studies.