1. Introduction
Remote sensing scene classification tasks can be considered to assign semantic labels to high spatial resolution (HSR) image patches. According to the different spatial distributions and combinations of objects, remote sensing scenes can be divided into semantic categories which contain specific semantic information, such as airport, forest, resort and tennis court, etc. The fundamental scene classification task has led to a wide range of applications such as urban planning [
1], environmental monitoring [
2], disaster detection [
3], and object recognition [
4,
5].
Over the past few years, remote sensing scene classification has achieved significant improvements. Several researchers have developed various well-performed methods for remote sensing scene classification tasks. Especially, deep-learning-based methods take full advantage of deep feature extraction and classification to achieve state-of-the-art performance in scene classification. However, with the development of real-time earth observation technology in the remote sensing field, extensive new remote sensing imageries are continuously acquired from different satellites. It is quite difficult to directly delineate ever-increasing new images accurately with pre-trained models as the significant variation (such as spatial resolution, and imaging angles) comes from different sensors. Moreover, the existing models usually fail to continually update with non-independent identically distribution (non.i.i.d) streaming data and result in a catastrophic forgetting of the previous knowledge for remote sensing classification. Thus, it is critical to preserve the knowledge learned by the old model while extending the new task learning.
Currently, several studies adopt pre-trained models to continually update with the consecutive tasks for cross-dataset scene classification, for example, Lima et al. [
6] used a model pre-trained on ImageNet and then applied it to the new remote sensing scene classification task. However, pre-trained model features based on natural images cannot be directly transferred to remotely sensed images due to the significant differences. To address the cross-domain problem, several transfer-learning-based methods are employed to perform scene classification tasks. For example, Li et al. [
7] fine-tuned with a few shot samples and achieved favorable performance on scene classification tasks. Song et al. [
8] adopted domain adaptation to maintain the consistency of source and target domain features in the subspace and effectively improved the scene classification. Although transfer learning effectively transfers knowledge from the source domain to the target domain, forgetting the previous knowledge of earlier tasks is still inevitable for continuous learning tasks, especially when the models are constantly being updated.
In order to tackle the catastrophic forgetting problem in the streaming scene classification tasks, continual learning or life-long learning is introduced which enables the model to adaptively learn from tasks without a predefined number of samples and categories. The existing methods are mainly divided into three groups: replay-based methods, regularization-based methods, and parameter-isolation methods [
9]. To consolidate already learned knowledge, replay-based methods adopt a strategy that saves the original data or a model that can generate the data to mitigate forgetting. For example, Rebuff et al. [
10] developed a training strategy for continual learning via stored historical samples. Kamra et al. [
11] proposed a generative dual memory network that could be used to generate pseudo-data for previous information preservation. Similarly, to overcome the model’s forgetting of historical knowledge, Rostami et al. [
12] used a generative model to produce pseudo samples with a few samples from the previous tasks, so that the abstract concepts have effectively emigrated from the generative model to the current task. Shin et al. [
13] proposed the deep generative replay framework with a “generator” and “solver”, where the “generator” is applied to generate data from the previous task, and the “solver” is then used to handle the current task with generated samples. Verma et al. [
14] demonstrate the contribution of their proposed Efficient Feature Transforms in generative models to overcome catastrophic forgetting. Although the replay-based methods obtained favorable results, the additional requirement of storage space and the complexity of training the generative model mean such methods cannot be applied in resource-limited situations. The regularization-based method protects previously learned knowledge by constraint parameter updates which typically add a regularization term to penalize change in critical parameters. Kirkpatrick et al. [
15] firstly proposed Elastic Weight Consolidation (EWC), an approach that employed a quadratic penalty term to constrain the update of important weights calculated through the diagonal of the Fisher information matrix. Similarly, Aljundi et al. [
16] also proposed a method to preserve important parameters, which is called Memory Aware Synapses (MAS). MAS estimates the importance of each parameter based on the sensitivity of the predicted output function, which effectively prevents valuable parameters from being covered. However, the memorization mechanisms of regularization-based methods show poor performance on discriminate inter-tasks categories. However, the additional loss term that is used to protect consolidated knowledge may lead to a performance trade-off between old and new tasks [
17]. The third parameter isolation approaches allocate fixed parameters for each task to prevent model forgetting. This method is also subdivided into dynamic architecture and fixed architecture, depending on whether the model structure changes. For instance, Yoon et al. [
18] proposed the Dynamically Expandable Network (DEN), which dynamically expands the capacity of the old model when encountering new tasks. Rusu et al. [
19] avoid modifying corresponding sections of previous tasks while extending the model for new tasks. Meanwhile, in the fixed architecture solutions, PathNet [
20] and PackNet [
21] employ binary masks to restrain parameters of subsets of the network for the specific tasks. However, the parameter-isolation methods still suffer from the problem of parameter independence, which restricts the robustness of complex tasks.
The three schemes mentioned heavily focus on the collection of historical experience, but Lee et al. [
22] argued that it still leads to more forgetting due to the restriction of future events. Therefore, to alleviate catastrophic forgetting, they have learned more representative features in the first instance. Inspired by this, obtaining meaningful features within the continual learning process becomes critical to alleviating catastrophic forgetting. For remote sensing scene images, there is a large quantity of land cover types and ground objects covered in the same imagery, and the inter-class similarity and intra-class diversity cause scene classification tasks to be more challenging. In addition, the images acquired from different satellites have the problems of variation in illumination, backgrounds, scale and noise, which further increase the discrepancy of scene images across different datasets. Facing the dramatic variations in images, how to extract discriminative features with limited annotated samples becomes the primary goal.
In order to capture more representative features for continual scene classification, self-supervised, especially contrastive learning has demonstrated the strength of obtaining the intrinsic features. Unlike supervised methods that require numerous manually annotated labels, contrastive learning uses similarity metrics to measure the distance between positive and negative samples after transformation. It brings similar samples too close together and separates distinct samples, by learning invariant features. For instance, Zhao et al. [
23] combined scene classification task with contrastive learning, which further improved feature extraction and the generalization of the model. Tao et al. [
24] obtained high performance model for scene classification tasks under insufficient labeled samples via introduced contrastive learning and achieved favorable results. Stojnic et al. [
25] analyzed the effect of sample size and domain of scene images for training, and their work demonstrated that results of pretrained models by contrastive learning outperform others on scene classification. Therefore, building robust and discriminative feature representations for describing the scenes is the essential component in the cross-dataset scene classification. Although it is possible to strengthen the deep feature obtained through contrastive learning, there is still a restriction in preserving consolidated knowledge over a stream of tasks.
Moreover, due to the similarity of samples among different datasets, it is difficult for the model to reuse the valuable knowledge learned from previous data. The knowledge distillation strategy, especially the distillation of feature-level knowledge and semantic information enables the model to obtain more transferable features effectively. Indeed, the representative feature extraction for continual learning is intended to improve future tasks. However, it still lacks a knowledge retention mechanism to preserve the acquired representative features under streaming tasks. Specifically, the lack of distillation of historical model features, especially for complex remote sensing scene images, results in the learned spatial knowledge becoming unadaptable for future scene classification tasks. The deep and abstract spatial features in the previous model no longer facilitate the learned knowledge retention and eventually leads to forgetting.
Based on the issues mentioned, it is essential to employ contrastive learning to enhance the robustness of the extracted features with the distillation strategy introduced. On the one hand, due to the complex spatial configuration and significant distinctions between different datasets, contrastive learning representations will further enhance the features to boost future learning. On the other hand, the knowledge distillation can transfer valuable learned knowledge to new tasks effectively and optimize scene classification. Especially for the spatial features and class distillation, the catastrophic forgetting could be dramatically alleviated by mimicking the different level features and the final output of the historical model. Hence, we considered applying both contrastive learning and knowledge distillation to guarantee the model acquires robust features while preserving the historically learned knowledge.
In this case, we propose the continual contrastive learning network (CCLNet) for continual scene classification, which contains a deep feature extractor, knowledge distillation mechanism, and contrastive feature enhancement scheme. Firstly, we designed the contrastive loss module through comparing samples with different augmented views which are used to enhance the robustness of features for continual scene classification tasks. Then, we introduced deep spatial feature distillation and class distillation for knowledge preservation by imitating the different level features and outputs of historical models. The integration of the contrastive loss module and the knowledge distillation strategy for continual learning ensures the model captures comparison information under limited annotated samples across different datasets, while further assuring knowledge retention.
The main contributions of the proposed CCLNet are:
- (1)
contrastive learning for continual learning enables the model to learn invariant and robustness features of complex scene images under limited annotated samples.
- (2)
the designed spatial and class distillation to effectively distill the latent shape and other knowledge of previous model into the current model thus facilitating continual learning.
The remaining parts of the paper are organized as follows:
Section 2 introduces related works of this paper.
Section 3 describes in detail the proposed method in this paper.
Section 4 presents the experimental data and then details the setup of the experiments.
Section 5 analyzes and discusses the results of the experiments. Finally,
Section 6 provides the conclusion of the paper.