DILRS: Domain-Incremental Learning for Semantic Segmentation in Multi-Source Remote Sensing Data

: With the exponential growth in the speed and volume of remote sensing data, deep learning models are expected to adapt and continually learn over time. Unfortunately, the domain shift between multi-source remote sensing data from various sensors and regions poses a signiﬁcant challenge. Segmentation models face difﬁculty in adapting to incremental domains due to catastrophic forgetting, which can be addressed via incremental learning methods. However, current incremental learning methods mainly focus on class-incremental learning, wherein classes belong to the same remote sensing domain, and neglect investigations into incremental domains in remote sensing. To solve this problem, we propose a domain-incremental learning method for semantic segmentation in multi-source remote sensing data. Speciﬁcally, our model aims to incrementally learn a new domain while preserving its performance on previous domains without accessing previous domain data. To achieve this, our model has a unique parameter learning structure that reparametrizes domain-agnostic and domain-speciﬁc parameters. We use different optimization strategies to adapt to domain shift in incremental domain learning. Additionally, we adopt multi-level knowledge distillation loss to mitigate the impact of label space shift among domains. The experiments demonstrate that our method achieves excellent performance in domain-incremental settings, outperforming existing methods with only a few parameters.


Introduction
The deployment of deep learning models on edge devices is emerging as a significant trend for future interpretation of Earth intelligence [1].By locally processing real-time data, this approach eliminates the need for data transfer to cloud devices, saving significant processing time, data transmission, and resource consumption.This technique has matured in the field of autonomous driving [2,3].Subsequently, the faster and more abundant acquisition of remote sensing data has created new standards and requirements for deep learning models, making it essential for these models to adapt and learn continuously over time.However, most existing deep learning models for semantic segmentation tasks [4,5] are trained offline and statically deployed [6,7].These models require large amounts of data for long-term training and can only be applied to a specific domain, with no provision for adapting or expanding over time.When new domain data becomes available, the models cannot maintain their performance requirements for the original domain, leading to catastrophic forgetting [8,9] problems.As illustrated in Figure 1, we show an example of catastrophic forgetting, when semantic segmentation models are applied in continual remote sensing domains.Therefore, it is critical to develop deep learning models [10,11] that can adapt to changing data and continuously learn to keep up with the evolving requirements of Earth intelligence interpretation.Figure 1.An example of catastrophic forgetting in a continual domain sequence.We deploy two segmentation models statically, where pre-trained models are sequentially fine-tuned (FT) [12] on five different domains (D1-D5) in the field of continual remote sensing.This domain sequence closely simulates the data collected by edge devices, such as satellite and aerial sensors capturing the images of urban and rural regions.The performance of the models is evaluated from two perspectives.Firstly, a bar chart demonstrates the performance of the DeepLabV3+ model when fine-tuned on D1-D5 (i.e., GID [13], BDCI2020 [14], deepglobe [15], LoveDA-urban [16], and LoveDA-rural [16]).The chart reveals a degradation in performance for the previous domains.Secondly, two line diagrams illustrate the performance of the GID domain on different tasks for the DeepLabV3+ [4] and Erfnet [5] models.The results show that models trained on new domains achieve good performance, while the performance of the models on previous domains gradually decreases.For a more detailed description, please see Section 4.4.
The existing training schemes to solve the challenge of deep learning models in continual domains can be summarized as follows [17]: (a) separate training in a single domain, storing each model, and then flexibly switching between different domains; (b) storing each domain data, then joint training with multiple domains when deploying on a specific domain; (c) sequential training with incremental domains, adopting the domain adaption method to improve the performance in the target domain.However, deploying deep learning models on edge devices requires the consideration of the operation rate, storage pressure, and data privacy problems.Obviously, the aforementioned methods do not meet these requirements.Instead, an extensible lightweight model with an incremental learning ability that can maintain good performance among all cumulative domains is more suitable for applications.Therefore, incremental learning (also known as continual learning) [8,9] is proposed.
The study of incremental learning in remote sensing is still in its early stages and is primarily focused on task-incremental learning or class-incremental learning, as evidenced by the works of [18][19][20][21][22].However, domain-incremental learning is more relevant in practical applications, since deep learning models are expected to learn continual remote sensing domains when deployed on the edge devices.Despite the importance of domainincremental learning, it has received relatively little attention.In this regard, our objective is to address this gap and explore methods for domain-incremental learning that can improve performance in continual domains.Additionally, we choose semantic segmentation as our downstream task.
In the domain-incremental setting, catastrophic forgetting can be attributed to two main factors: domain shift and label space shift.Figure 2 illustrates the properties of con-tinual remote sensing domains, including: (1) inconsistent class distributions of different regions; and (2) spatial resolutions and spectral divergence of the specific object category in different sensors.Due to the diversity in sensors and regions, domain shift can occur as the image capture conditions change, such as variations in object scales, complex background, spatial resolution, spectral divergence, and weather conditions.Furthermore, previously unseen classes in new geographical regions and inconsistent class distributions can both contribute to label space shift.As a result, addressing domain shift and label space shift is crucial to achieving optimal segmentation model performance in domain-incremental learning.A brief illustration of the multi-source remote sensing data, including samples from multisensor (satellite and aerial sensors) and multi-region (urban and rural regions) data.Taking the category of building and agriculture for example, the upper half of the graph shows the visual difference in the sensor of different spatial resolutions, reflected in object scales and styles.The second half shows the spectral divergence of the building and agriculture in a series of images from different domains, by the mean and standard deviation in red, green, and blue wavelengths.We simplify our research and regard the remote sensing domains as images from multiple sensors and regions in our research.

Building
Therefore, we propose the domain-incremental learning for the remote sensing framework (DILRS) as a solution in continual domains.Drawing inspiration from the effectiveness of the universal parametrization in multi-domain learning [23,24], we find that model parameters can be divided into those that learn domain-specific features and those that represent shared features.In domain-incremental learning settings, retaining domain-agnostic parameters and switching domain-specific parameters can be an effective approach to maintaining performance in current and previous domains with fewer parameters.To handle the label space shift problem during model updates in domains with non-overlapping new classes, we propose a multi-level knowledge distillation loss, which refers to knowledge distillation strategies [25][26][27].Our main contributions can be summarized as follows: (1) We define the problem of domain-incremental learning for remote sensing and propose a dynamic framework specific for this problem without using previous training data and labels.Experimental results demonstrate the excellent performance of our method with fewer parameters.(2) To alleviate the domain shift among incremental domains, we adapt domain residual adapter modules in the structure, using different optimization strategies towards domain-specific and domain-agnostic parameters.(3) Consider different label space shift, class-specific knowledge distillation loss is applied to distil the common class knowledge between domains, and we also use the distillation loss at intermediate feature space to avoid background class interference.

Related Work
While remote sensing visual perception models such as image scene classification and segmentation models have been thoroughly studied, incremental learning in remote sensing is still in its early stages.This section aims to provide an overview of incremental learning, followed by a focus on domain-incremental learning and incremental learning for semantic segmentation.In particular, we will briefly review the relevant research in remote sensing.

Incremental Learning
Incremental learning [8,9], aiming to tackle catastrophic forgetting during model learning and extending, can be divided into three scenarios: task-incremental learning, class-incremental learning, and domain-incremental learning.Three categories of incremental learning technologies have been proposed, including replay-based, regularization-based, and parameter isolation-based strategies.Replay-based methods [28] involve storing a portion of old data or training additional generators to produce pseudodata for replay, followed by joint training with new data.However, this approach may raise data privacy concerns and create storage pressures.To address the forgetting of previous knowledge, regularization-based strategies [12,29] typically employ knowledge distillation or regularization terms in loss functions.Compared with replay-based strategies, regularizationbased strategies do not require the storage of previous data.However, the model is optimized based on the previous task, which could lead to the final model not converging towards the globally optimal solution and unsatisfactory performance.Parameter isolation-based strategies [30,31], on the other hand, typically isolate or freeze important model parameters from previous tasks and allow models to introduce new parameters to prevent forgetting in the new task.Given that remote sensing data storage is impractical, and we aim to maximize effectiveness, parameter isolation-based methods are the most suitable approach for our task [19].Our proposed model is based on parameter isolation-based methods.

Domain-Incremental Learning
Domain-incremental learning refers to model learning from continual domains of changing distribution, where nonstationarity is reflected in background, blur, noise, and other factors [8,9].However, there is a gap between this definition and real-world applica-tions.A relevant example of domain-incremental learning in the real world is an agent that needs to learn to survive in different environments.Classic scenes include autonomous driving [23,32], person ReID [33], and crowd counting [34], among others [35].For instance, Garg et al. [23] proposed a dynamic semantic segmentation model, which is effective in three driving scenes from visually disparate geographical regions.Mirza et al. [32] presented a robust object detection method for autonomous driving that learns incrementally across varying weather conditions.In fact, the research setting of domain-incremental learning is also applicable to remote sensing.Multi-source remote sensing data collected by revisiting multiple satellites implies that remote sensing intelligent interpretation models will require higher domain-incremental learning capabilities.To the best of our knowledge, there is no related research on domain-incremental learning for remote sensing.
Additionally, domain adaptation [36] and multi-domain learning [24,37] are closely related to our research.Multi-domain learning, with access to all domains' data, aims to retain good performance in all domains.Domain adaptation utilizes labeled data in the source domain to maximize performance in the target domain.The difference with our work is reflected in the model's goal, the availability of different domains, and the diversity in the label space.

Incremental Learning for Semantic Segmentation
Recently, the limitation of an offline setting used in semantic segmentation models is cause for concern.Incremental learning strategies for semantic segmentation have been proposed [25][26][27].These studies are based on the assumption that models update for new, unseen categories in the class-incremental setting, where the background class or semantic shift is the primary challenge.Cermelli et al. [27] highlighted that the semantic distribution shift exists in the non-overlapping new classes of each learning step and proposed a distillation-based framework specific to solve this issue.Klingner et al. [26] introduced a knowledge distillation loss without relying on previous data in class-incremental learning for semantic segmentation.
However, compared with natural images, continual semantic segmentation in remote sensing is a relatively new field with few papers [20][21][22][38][39][40][41] studying the problem.Tasar et al. [22] were the first to study the incremental learning scenario of remote sensing segmentation, while Shan et al. [38] proposed two effective modules embedded in the proposed class-incremental segmentation framework without access to previous data.However, these experimental settings considered only a class-incremental semantic shift in the same domain.In reality, the semantic shift and domain shift may coexist in remote sensing applications, which motivates us to study semantic segmentation in the domainincremental learning setting.

Method
This section provides an overview of our work, DILRS, which focuses on domainincremental learning for remote sensing.Firstly, we present the problem formulation of domain-incremental learning that we use in our work.Next, we introduce the overall framework, DILRS, and provided a detailed description of its key component, the domain residual adapter module.Finally, we discuss the proposed loss function and the optimization strategy that we developed.

Problem Formulation
The domain incremental learning setting assumes the presentation of N tasks, each corresponding to n training domains D k = {(x k , y k )} n k=1 , where x k and y k represent a data sample and its corresponding label, respectively.In contrast to current incremental semantic segmentation research [40], where only a semantic shift exists at each step, our experiments consider the coexistence of domain shift and label shift between D k and D k−1 .Additionally, y k may contain overlapping classes and new classes compared to y k−1 , and vice versa.We used the original dataset labels as domain labels and a multi-head decoder structure to prevent semantic shifts in the background class.The classifier in each decoder predicts the category of pixels in different domains independently.
The overall training process of the domain-incremental semantic segmentation model is illustrated in Figure 3.The segmentation model is trained incrementally on the domain sequence.At step k of training, we train model Mk(xk, k) on domain Dk = (xk, yk).We assume that the old data samples ∑ i = 1 k−1 D i become unavailable while D k is provided.Our model aims to adapt to each new domain without degrading its performance on previous ones.During the inference phase, we have access to the ID of each domain, similarly to task-incremental learning.We evaluated the performance of the model at the end of the training sequence for the current and previous domains ∑ k i=1 D i .To simplify the introduction, we refer to the domain, dataset, and task at step k as D k .

Proposed Framework
The DILRS architecture consists of two components: a shared encoder E and K paratactic domain-specific decoders C k , as depicted in detail in Figure 4.The shared encoder is based on the lightweight efficient residual factorized network (Erfnet) [5], which incorporates domain residual adapter (DRA) modules.These DRA modules learn both domain-specific features and domain-agnostic features.For the domain D k = (x k , y k ), our framework learns a mapping, where y k represents the predictions of the model, W k = {α k , C k } and W s are domainspecific parameters and domain-agnostic parameters in the model, respectively; α k are domain-specific parameters in encoder E .

Domain Residual Adapter Module
The domain residual adapter (DRA) module is a critical component in the DILRS architecture, as it is responsible for reparametrizing the network into domain-specific and domain-agnostic parameters.As shown in Figure 5, the features in the jth module are denoted as u from the previous module: Specifically, ûj−1 k is formed by a concatenation of the domain-agnostic part g j (•) and domain-specific part h j k (•): where g j is a domain-agnostic structure across all domains, and f j k and h j k constitute a parallel domain-specific residual adapter structure for each domain.Among them, g j is composed of [3 × 1] and [1 × 3] convolutional layers, followed by a ReLU activation function.As for the domain-specific structure, h Step: Domain k-1 Step: Domain k

Loss
As mentioned above, the domain shift is the primary challenge when the model adapts to continual domains.Additionally, the label space shift between these domains is another factor that needs to be considered.Recent studies [26,27] of incremental learning for semantic segmentation have focused on the semantic shift of the background class, where the old classes of the previous step are divided into the background class at each step, and all classes belong to the same domain in their setting.Although there is no domain shift in this research, the methods that adapt knowledge distillation strategies to solve the semantic shift are worth referencing, as they are a common strategy to transfer knowledge from the old model into the new one.However, in the DILRS setting, a naive application of previous knowledge distillation loss functions would not suffice.
Considering the fact that domain shift coexists with label space shift in DILRS, we revisit the classical knowledge distillation and optimization strategies by introducing a multi-level class-specific knowledge distillation function and different optimization for domain-specific and domain-agnostic parameters.
(1) Class-specific knowledge distillation function.To optimize the domain-agnostic parameters' cross-domain sequence, we adopt the knowledge distillation loss.At step k, we initialize the domain-agnostic parameters from W s of M k−1 and then distill the predictions of the new and old models' output.Considering the fact that we only have data of domain D k = (x k , y k ), we input x k into both the current model M k and the previous model M k−1 for the previous task i, 0 < i < k: where q new i and q old i represent the output probabilities of the current model M k and the previous model M k−1 , respectively.We define the class-specific knowledge distillation loss as J, which is computed over all previous tasks.The detailed diagram of J can be seen in Figure 6.We design the class-specific knowledge distillation strategy on both spatial and channel dimensions to further mitigate the influence of the label space shift.Since we use the current domain Dk to replace all previous domains ∑ i = 1 k−1 D i , we empirically found that the data shift in the background class and non-overlapping classes leads to worse distillation results.Hence, we only distill knowledge in the overlapping classes between the current domain D k and the previous domain D i , 0 < i < k separately.Specifically, we define the class-specific knowledge distillation loss as follows: ).
( Here, µ k ∈ {0, 1} C×H×W denotes the pixel-wise elements of a binary mask, getting from one-hot encoded label y k .C µ represents the overlapping classes between the current domain D k and the previous domain D i , 0 < i < k, respectively.N k is the set domain D k of all pixels contributing to the loss. (2) Knowledge distillation function in feature space.Our model can be decomposed by one shared encoder E and multi-head decoder C k .As mentioned above, the parameters of the decoder C k are separate domain-specific parameters, while the parameters of the encoder mix domain-specific parameters α k and domain-agnostic parameters W s .Inspired by [25,38], we try to preserve knowledge by keeping the encoder E of model M k and M k−1 having a similar representation capability at feature space.We compute this by where p new (3) Overall loss.Additionally, during the incremental step k, we used cross-entropy loss to train model M k , defined by where y k denotes the prediction output of M k .ψ k is the SoftMax cross-entropy loss, and we set the class weights of each category to better solve the category imbalance problem.The total loss is defined as a weighted sum of these three kinds of loss where λ 1 , λ 2 and λ 3 are the weights of cross-entropy loss L CE , class-specific knowledge distillation function L D , and knowledge distillation function in feature space L F .The details of the parameter setting λ 1 , λ 2 and λ 3 can be seen in Section 4.5.
(4) Optimization strategy.At step k of our work, we take a different optimization strategy on domain-specific parameters W k = {α k , C k } and domain-agnostic parameters W s .We initialize the W k based on W k−1 , while the output classification layer is randomly initialized considering the label space shift between y k and y k−1 .Additionally, all previous domain-specific parameters ∑ k−1 i=1 W i are frozen at step k.Similarly, domain-agnostic parameters W s initialize the model M k−1 , which are shared with all domains.Knowledge distillation on feature space and class-specific output prediction makes the current model preserve previous domain knowledge by domain-agnostic weights W s as much as possible.In contrast, the cross-entropy loss trained on each domain improves the domain-specific performance.In addition, the domain-specific paths corresponding to the multi-domain intertwine with domain-agnostic parameters.For each evaluation of each domain, only the corresponding domain-specific path within domain-agnostic parameters is activated in the forward pass.Overall, the process of training can be described as fellows Algorithm 1.

Algorithm 1: Process of learning a new domain of kth step in DILRS
Require: Freeze: domain-specific weights of all previous domains: ∑ k−1 i=1 W i 3: for epochs do Compute knowledge distillation loss L D , L F

9:
Compute loss L total 10: Update: M k and M s at learning rate lr 11: end for

Experiments
In this section, we first introduce the overview of the datasets, and then we show the implementation details and evaluation metrics.Finally, we briefly introduce the compared methods and show the experimental results from different perspectives in detail, respectively.

Datasets
In the remote sensing field, there is currently no clear definition of what constitutes a 'domain', despite extensive research on domain adaptation [36].For the purposes of our research, we have chosen several representative datasets [13][14][15][16] to form an experimental domain sequence.Table 1 presents the statistics of our chosen domain sequence, with all datasets resized to 256 × 256.The domains are incrementally ordered based on an increasing sensor resolution (D1-D5), and include a range of satellite (GF-2, GF-1/6, WorldView-2) and airborne sensors, covering various complex scenarios in different regions and countries.Given the regional diversity, we further divided the rural and urban areas of LoveDA [16] into two independent domains.Additionally, our domain sequence has non-overlapping categories, which are detailed in Table 1.The class distribution is shown in Figure 7, which highlights the differences and challenges posed by the domain incremental learning setting.Our experimental setting involves simultaneous domain shift and label space shift.We will make our dataset available on our website at http://complex.ustc.edu.cn/,accessed on 11 May 2023.

Implementation Details
We utilize the Erfnet [5] as our model's backbone, with the integration of our domain's residual adapter module.Specifically, the encoder embeds with our domain residual adapter module, while each head of the multi-head decoder seamlessly follows the Erfnet structure.We use the Adam optimizer and set the batch size to 36.In light of the data imbalance in LoveDA compared to the other datasets, we adopted data augmentation strategies such as random horizontal flip and rotation for LoveDA-rural and LoveDA-urban.
Considering the stability-plasticity trade-off problem in incremental learning, we use a different learning rate for domain-specific W k and domain-agnostic parameters W s .In particular, W k is related to the plasticity of a new domain, while W s is close to the stability of previous domains.There is an imbalance between the representation learning on a new domain and representation maintenance on a previous domain when using the same learning rate for W k and W s .Experiments indicate that valued as 100 obtains a good stability-plasticity trade-off.The learning rate of the W k and W s are set to 5 ×10 −4 and 5 ×10 −6 , respectively.Our experiments are implemented using Pytorch, and we use an NVIDIA Tesla V100 GPU.
In accordance with the evaluation metrics used in studies such as [19,23,34,43], we utilize ∆ m and BWT to evaluate the performance of our incremental learning model, DILRS.Specifically, ∆ m measures the average performance degradation compared to the single-task baseline b: Here, the mean intersection over union mIoU T,t denotes the evaluation accuracy of the incremental-learning segmentation model on task t, which can also be considered a domain.On the other hand, mIoU b,t represents the evaluation accuracy of the single-task baseline for task t, and T denotes the total number of tasks.If ∆ m < 0, it indicates that the performance of the incremental-learning task is worse than the single-task baseline for each domain, while ∆ m > 0 indicates a better performance.
The backward transfer BWT metric measures the forgetting of the model on old tasks after training on a new task, which is particularly relevant for evaluating incremental learning methods.Theoretically, if BWT < 0, this means that the model learning a new task will improve the performance of the model on the previous tasks.In contrast, if BWT > 0, it denotes that the performance of previous tasks decreases when learning a new task, which is also known as catastrophic forgetting.
where mIoU T,t represents the accuracy on task t after learning task T.

Compared Methods
When training multiple domains, several choices of training paradigms are optional, including disjoint, joint-training, and fine-tuning.We compare our proposed methods with these baseline methods.Additionally, the three latest incremental learning methods are compared, including EwC [29], LwF [12], and ILTSS [25].Note that all the compared methods are based on the Erfnet model.
(1) Benchmark Methods for Comparison.Disjoint (single-task) trains a separate model independently for each domain, which aligns with the i.i.d.(independent identically distributed) assumption of model training.Joint-training (multi-task) trains a unified model by combining all domain data.Both disjoint and joint-training belong to the offline setting, and we record results when training until convergence on each new task.We use a multi-head decoder to train all accumulated datasets.Fine-tuning (FT) trains each domain incrementally by fine-tuning the pre-trained model on the new domain until convergence.Fine-tuning is a standard baseline in incremental learning, while nothing has been implemented to avoid forgetting.We used a single-head decoder in our experimental setting.Feature extraction (FE) freezes the weights of previous domains, which theoretically preserves the model's performance over the previous domains as much as possible.We fix all encoder weights and only train the new domain's decoder weights.
(2) Latest Incremental Learning Methods.There are only a few methods explicitly designed for domain-incremental learning.Given the similarity between our experimental setting and incremental learning for semantic segmentation, we compare our method with the three latest incremental learning methods, namely EwC [29], LwF [12], and ILTSS [25].Since the data of the previous domain are not available in our setting, we do not consider replay-based incremental learning as a compared method.EWC [29] ranks the weights of the old task and then optimizes them differently depending on their importance.It does this by introducing the Fisher information matrix based on the analysis of sequential Bayesian estimation.LwF [12] was the first method to use knowledge distillation to prevent catastrophic forgetting when learning new tasks or classes.On the other hand, ILTSS [25] extends the knowledge distillation strategy in segmentation by adopting a feature-space loss.

Experimental Results
(1) Multi-source domain-incremental learning scenario.As mentioned previously, our incremental learning approach follows the domain sequence (D1-D5): GID [13]; BDCI2020 [14]; deepglobe [15]; LoveDA-urban [16]; and LoveDA-rural [16].We begin by training a model on GID in step 1 and then incrementally learning the same model on BDCI2020 in step 2, followed by the remaining domains in steps 3-5.Table 2 presents the performance evolution of different methods in the domain-incremental learning setting, with all domain test results recorded when the model achieves its best results on the new domain.Our results show that our method outperforms multi-task learning.While multitask training can access all training data and is typically considered an upper bound, the differences in data distribution among the multiple domains in our experiment may affect each task's ability to achieve optimal results.As anticipated, fine-tuning (FT) performs poorly on the previous domains, indicating that the model entirely forgets the knowledge of old domains.Feature extraction (FE) cannot achieve satisfactory results in the new domain, demonstrating that freezing the encoder weights leads to the lowest plasticity in the new domain.Moreover, the performance of FE in previous domains was ideal, which we consider as reference results.Compared with FT, LwF [12] and ILTSS [25] only have a slight effect on relieving catastrophic forgetting, while LwF (single-head) yields even worse results than FT.Additionally, the performance of the multi-head decoder setting is slightly better than that of the single-head decoder.Similarly, inevitable catastrophic forgetting exists in the EwC [29] method.Compared with the above methods, our method achieves significantly better results in both old and new domains.As highlighted in bold in Table 2, our approach shows the minimum degradation of mIoU at each step, with good performance concurrently in the current new domain.Additionally, our method achieves ∆ m of −5.46% and BWT of 6.03%, demonstrating plasticity and stability.
(2) Single-source domain-incremental learning scenario.In addition to discussing domain-incremental learning in multi-source scenarios, we also considered the singlesource scenario where the rural and urban domains come from the same aerial dataset LoveDA [16].These domains share the same semantic categories and sensor resolution but exhibit a domain shift.Furthermore, the t-SNE visualization results of these five domains in the feature space, as shown in Figure 8, suggest that the domains are independently identically distributed, with rural and urban having a more similar distribution.Based on this observation, we conduct experiments in which the model incrementally learns from rural to urban and from urban to rural.The results, as shown in Table 3, are compared with the performance of rural and urban in the single-task setting.It is worth noting that the two parentheses in step 2 indicate different meanings.Specifically, the left one shows the drop/gain in performance concerning step 1, while the right one compares the performance with the single-task baseline for the corresponding dataset.Our method performs well in the single-source domain-incremental learning scenarios, with little catastrophic forgetting.Furthermore, the performance in step 2 surpasses that of the corresponding dataset in step 1, and we observe a gain of 14.78% concerning the single-task baseline for the rural and urban.Our experiments suggest that our model achieves forward transfer from the previous domain by capturing and adapting domain-agnostic and domain-specific features between the rural and urban domains.Table 3. Results obtained on a single-source domain-incremental learning scenario: rural → urban and urban → rural.The left parenthesis in step 2 indicates a drop/gain in performance concerning step 1, while the right one compares with a single-task baseline for the corresponding dataset.(3) Class-wise qualitative analysis.In this section, we delve into the class-wise accuracy of previous domains during domain-incremental learning.Figure 9 presents the comparative results of our method with a single-task and ILTSS (multi-head) [25], where we recorded the test results of domains 1-4 (GID, BDCI2020, deepglobe, LoveDA-rural) at step 5. Additionally, all test results are noted when the model is optimal in the current domain (LoveDA-rural).As shown in Figure 9, the ILTSS method suffers heavy catastrophic forgetting and fails to maintain performance in the last domain (LoveDA-urban).The model almost forgets all the knowledge of the previous domain (GID and BDCI2020), particularly in classes such as building, agriculture, and grassland in domain GID and water, road, and grassland in domain BDCI2020.On the other hand, our method successfully mitigates forgetting in all previous domains, and there is a considerable gap between our performance and ILTSS in the three domains (GID, BDCI2020, and deepglobe), of 41.80%, 31.11%, and 31.14%,respectively.It is worth noting that our method's performance differs slightly from the single-task method, performing even slightly better in some categories.Moreover, to evaluate the current domain LoveDA-rural, we observe that our method and ILTSS (multi-head) outperformed single-task performance by 7.55-13.32%.Considering the similarity between the rural and urban areas of LoveDA, we attribute the knowledge of the previous domain to getting forward transfer to the current domain.While ILTSS performs better than our approach in the current domain.LoveDA-rural highlights the plasticity-stability trade-off problem in incremental learning.This trade-off refers to the need to compromise between learning a new domain while also preserving the knowledge acquired from previously learned domains.
(4) Visualization analysis.Figure 10 presents the visualization results of representative samples obtained by our method and the comparative methods on the experimental domains.These datasets are uniformly cropped as 256 × 256, and some regions are cut into small pieces, making the segmentation difficult due to the lack of contextual semantic understanding.As shown in the first three domains (six rows) of Figure 10b,c, much noise is introduced, and significant misclassification is present.Moreover, the model loses the ability to classify some categories in the previous domain, such as the 'road' category in domain BDCI2020 of the ILTSS method, while misclassifying categories that did not exist before, such as the 'water' category in domain BDCI2020 of the LwF method, indicating catastrophic forgetting in both methods.In contrast, the results of the last domain (urban) and the current domain (rural) show better performance in Figure 10b,c.Furthermore, we discuss the performance of multi-task training in Figure 10d.Although multi-task training can access all domain data, the model optimizes the training data from all domain sequences simultaneously, leading to suboptimal performance on all domains.The performance of multi-task training in the rural and urban domains is worse than the other three domains, likely due to the smaller amount of data and the feature space gap, as shown in Figure 8.Although multi-task training achieves a more precise prediction in some categories, such as the 'agriculture' category embedded in the 'road' and 'building' of BDCI2020 in Figure 10d, it still confuses misclassification in the 'grassland' category of deepglobe, reflecting its performance instability in different samples.
In comparison, our method significantly improves the performances in these five domains, as shown in Figure 10e, with minor catastrophic forgetting even in the previous domain.Additionally, our method can correctly classify the 'grassland', 'forest', and 'agriculture' categories, as they are similar in appearance, such as the bounding box region in the sixth row, thanks to our understanding of contextual semantic knowledge.The model's classification ability is susceptible to multi-source data, especially when it already has knowledge of the previous domain, posing a challenge for utilizing different domain knowledge, while our result benefits from the domain-specific structure.

Ablation Study
(1) Loss analysis.The proposed method comprises two key components: the DILRS architecture and the multiple loss function.In this section, we present ablation experiments to evaluate the effectiveness of the proposed loss function and architecture, as depicted in Table 4.In all experiments, we used the proposed DILRS architecture, except for the 'ours' entry in Table 2, which is based on Erfnet [5].We report the optimal results for each experimental setting, with varying weights for different loss functions.Ablation experiment 1 is conducted to examine the performance of only using the cross-entropy loss L CE for training, which is similar to FT (multi-head, in Table 2) for incremental training.In contrast, our DILRS model yields better results, which we attribute to the utilization of domain-specific and domain-agnostic structures.By comparing ablation experiment 1 with experiments 2, 3, and 4, we evaluate the impact of different types of distillation loss.The results show that our proposed class-specific loss L D outperforms the others, as both ∆ m and BWT are improved.Additionally, we combine the class-specific loss L D with distillation loss at the feature space, as shown in ablation experiment 5.As mentioned above, using these two losses jointly maximizes the distillation of previous domain knowledge while minimizing the effects of the label space shift.The excellent performance demonstrated in ablation experiment 5 is also utilized in our method.Moreover, we investigate the influence of the weights of loss λ 1 , λ 2 , and λ 3 in (8), which is also a critical factor to balance plasticity and stability.In our experiments, we set  5. As introduced in Section 3.2, ∆ m indicates the performance of each domain in domain-incremental models, while BWT measures the ability to retain old knowledge.Ideally, as the λ 2 λ 1 increases, the model should focus more on retaining old knowledge.Thus, the value of ∆ m should gradually decrease, and the value of BWT should decrease accordingly.The results show that the performance conforms to this law only in a specific range, and λ 2 λ 1 = 1 achieves better results compared to the other four parameter settings.(2) Parameters and FLOP analysis.As discussed in [19] and the related work, parameter isolation-based methods are the most suitable option for incremental learning in the remote sensing field when compared with replay-based and regularization-based methods.Our method also belongs to the parameter isolation-based method.However, as the model expands to different domains, the increased number of parameters will inevitably burden the application.Therefore, it is necessary to analyze the evolution of parameters and FLOPs in the domain-incremental learning setting, as these reflect the model's space and time computational complexity.In Figure 11, we present the growth in the number of parameters and floating point operations (FLOPs) of the single-task baseline and our method with incremental domains.It can be observed that our method exhibits a 21.09% growth in parameters, while the FLOPs remain constant.In contrast, although the single-task model has fewer parameters and FLOPs at domain 1, the growth rate as incremental domains are added is tremendous.

Conclusions
In this paper, we investigate the domain-incremental learning challenge in the context of remote sensing, where the model needs to incrementally learn new out-of-domain distribution data.Catastrophic forgetting caused by the coexistence of a domain shift and label space shift has limited the performance of previous works in this area.To tackle this issue, we propose a model that utilizes domain adapter modules to reparametrize domain-agnostic and domain-specific parameters as well as introduce a novel multi-level knowledge distillation loss.Our experimental results demonstrate that our approach out-performs existing methods for both multi-source and single-source remote sensing domains.Additionally, class-wise qualitative analysis and visualization support the superiority of our method.
As the deployment of deep learning models on edge devices gains importance in earth intelligence interpretation, developing domain-incremental learning methods that are suitable for remote sensing multi-source data becomes essential.Currently, there are few relevant studies, and our dataset and experimental settings can serve as a benchmark in the future.However, our research still has some limitations, as data collected by the same satellite in different seasons and regions can be considered different domains, which better align with the actual deployment of remote sensing edge devices.Due to limited data, this setting was not followed in this study.We will continue to improve our method in future studies.

Figure 2 .
Figure 2.A brief illustration of the multi-source remote sensing data, including samples from multisensor (satellite and aerial sensors) and multi-region (urban and rural regions) data.Taking the category of building and agriculture for example, the upper half of the graph shows the visual difference in the sensor of different spatial resolutions, reflected in object scales and styles.The second half shows the spectral divergence of the building and agriculture in a series of images from different domains, by the mean and standard deviation in red, green, and blue wavelengths.We simplify our research and regard the remote sensing domains as images from multiple sensors and regions in our research.

Figure 3 .
Figure 3.The DILRS framework for domain-incremental learning in remote sensing involves a process where, at each step, only the current domain is available.Our proposed model incrementally trains on the current domain while simultaneously testing on all previous domains.
j k is a [1 × 1] convolutional layer denoted as a domain-specific layer of the domain k in parallel.Additionally, f j k represents the batch normalization layers, which are also domain-specific structures in parallel.The setting of the DRA fellows the residual adapter module in [23,24,42].Decoder Decoder Decoder DRA Domain-agnostic Path Domain-specific path of previous domains Domain-specific path of domain k Frozen

Figure 4 .
Figure 4. Our proposed approach is composed of a shared encoder and domain-specific decoders.The encoder is made up of several domain residual adapter (DRA) modules, as illustrated in detail in Figure 5.At each step, denoted by k, the domain-specific paths for the previous domains are frozen, as indicated by the yellow dotted line in the figure.The model then trains on the domain-agnostic path (in blue) and the current domain-specific path (in green).

Figure 5 .
Figure 5.The detailed structure of the domain residual adapter module (DRA).Parameters of the domain-specific and domain-agnostic part in DRA are shown in different colors.

Figure 6 .
Figure 6.The detailed diagram of class-specific knowledge distillation loss J which can be divided into the spatial and channel dimension parts.

i
and p old i represent the intermediate features of the current model M k and the previous model M k−1 before the decoding stage, and • denotes L2 − norm.

Figure 7 .
Figure 7.The class distribution of our datasets.

Figure 10 .
Figure 10.Visualization of semantic segmentation results in five domain as step 5: each domain displays in two rows according to the training order (GID-BDCI2020-deepglobe-urban-rural).Each line: (a) Input image; (b) LwF (multi-head); (c) ILTSS (multi-head); (d) multi-task; (e) Ours; (f) Ground truth.The black bounding boxes highlight the details in the images.The color corresponding to each category is shown at the bottom.

λ 2 =
λ 3 to simplify research, and λ 2 λ 1 represents the ratio of distillation to cross-entropy loss.The results of varying the ratio λ 2 λ 1 are shown in Table

Figure 11 .
Figure 11.Parameters and FLOP growth with the incremental domain.

Table 1 .
Comparison and statistics among datasets.

Table 2 .
Results of the 5-step domain-incremental learning.We record performance (IoU) on current and all previous domains at each step.Parentheses indicate the drop in performance compared with the domain's first trained step.∆ m and BWT are calculated from the results at step 5. ↑ indicates that larger is better, while ↓ is the opposite.Convention: best.

Table 4 .
[44]lts of the ablation study for different loss functions.∆mand BWT are calculated based on our proposed model at step 5. L CE , L D , and L F are the loss function in our proposed method, while L dist represents the classical distillation loss proposed by[44].

Table 5 .
Results of an ablation study for the weight ratio λ 2 λ 1 .