1. Introduction
With the rapid development of remote sensing technology, optical remote sensing images (RSIs) have seen significant improvements in spatial resolution and textural detail characteristics, providing a richer data foundation for object detection tasks. This technological breakthrough enables optical remote sensing image target detection technology to more accurately identify surface objects and their spatial distribution characteristics, thereby playing crucial roles in various fields such as environmental monitoring, geological hazard detection, Land Use and Land Cover (LULC) mapping, geographic information system (GIS) updating, precision agriculture, and urban planning [
1]. Deep learning-based object detection algorithms have achieved remarkable success in traditional natural image domains, with models like R-CNN [
2], Fast R-CNN [
3], Faster R-CNN [
4], YOLO [
5] series, RetinaNet [
6], GFL [
7], and CenterNet [
8]. However, object detection in remote sensing images faces unique challenges: targets to be detected are typically distributed in multi-scale forms, arbitrarily oriented, densely packed, and even mutually occluded. Recent studies further indicate that tiny objects, complex backgrounds, and label assignment uncertainty remain important obstacles for robust remote sensing object detection [
9]. The uniqueness of remote sensing object detection tasks can be demonstrated in recently remote sensing object detection datasets such as NWPU VHR-10 [
10], DOTA [
11], and DIOR [
12], and many specialized approaches, including Oriented R-CNN [
13], OAN [
14], and FFCA-YOLO [
15], have therefore been developed. Recent studies have further shown that large-scene remote sensing object detection benefits from context-aware and attention-guided detector design, which is especially important for handling small objects, cluttered backgrounds, and substantial scale variation in aerial imagery [
16]. Recent systematic reviews have also summarized the rapid development of optical remote sensing object detection methods, benchmarks, and operational applications, indicating that robust object detection remains a fundamental task in remote sensing image interpretation [
17].
In the construction of computer vision models, traditional paradigms typically assume that training data for all tasks can be permanently stored and accessed at any time. However, in practical applications, as task sequences continuously expand, systems inevitably face the dilemma of historical data becoming unavailable. This constraint arises from both the physical limitations of storage media and the exponential growth of computational costs associated with retraining. When models trained on old tasks are directly updated with new tasks, catastrophic parameter drift occurs, manifesting as significant performance degradation on previous tasks. This phenomenon is defined as catastrophic forgetting, which can hardly be effectively handled by existing classical object detection training paradigms. Incremental learning (IL) is specifically designed to address this pervasive issue of catastrophic forgetting. Based on whether the task identity is provided or must be inferred [
18], researchers categorize incremental learning into three types: task-incremental, domain-incremental, and class-incremental learning. This paper focuses on the challenge of class-incremental learning in remote sensing object detection, as it is particularly relevant to the dynamic nature of remote sensing data, where new object classes are frequently introduced while maintaining the accuracy of previously learned classes is crucial for practical applications.
Knowledge distillation [
19] is a technique that effectively mitigates the catastrophic forgetting of old-class knowledge in class-incremental learning by introducing a regularization constraint mechanism.
In knowledge distillation for object detection prior to the introduction of incremental learning, earlier works primarily focused on extracting knowledge through the combined distillation of detector components. For instance, studies such as [
20,
21] adopted a comprehensive approach by distilling all components of the detector. However, these methods fundamentally relied on feature-based knowledge distillation, leaving the potential of logit-based distillation underexplored in incremental object detection (IOD) scenarios. Unlike feature distillation, logit-based methods [
22] leverage the teacher model’s logit output (e.g., classification scores or bounding box predictions), which inherently encapsulate the teacher’s reasoning information. This characteristic enables the student model to better mimic the teacher’s decision-making process. Consequently, carefully designing distillation strategies for different types of logits becomes critical to fully exploit their advantages. Building on this, for incremental object detection, ref. [
23] suggested that not all responses are crucial for preventing catastrophic forgetting, thus proposing an Elastic Response Selection (ERS) method for logit distillation from old-class teachers. The advantage of ERS is that it can automatically select key information for incremental learning based on the teacher model’s logits, thereby effectively reducing the interference between old and new knowledge and lowering the risk of catastrophic forgetting. However, the selection mechanism of elastic response distillation overlooks the local information at the channel level, resulting in a reduced ability to extract key fine-grained information of old classes, and thus fails to handle the confusion between old and new classes well.
Figure 1 presents two representative failure cases of Elastic Response Distillation (ERD) on DIOR. In
Figure 1a, multiple old-class airplane instances in a dense airport scene are misclassified as the newly introduced windmill class. In
Figure 1b, old-class airplane targets are also incorrectly assigned to the new windmill category, further indicating that ERD may suffer from old/new class confusion when fine-grained response information is insufficiently preserved.
Furthermore, in the parameter-sharing mechanism inherent to current incremental learning models employing a single detector architecture, there exist inherent flaws. The design where old and new categories share critical parameters leads to simultaneous backward gradient updates for both categories during training. This not only introduces implicit constraints in the model training process, but also easily triggers coupling effects in the feature representation space, making it difficult to maintain independent and stable feature distributions. Consequently, this exacerbates the risk of class confusion caused by the blurring of classification boundaries.
To address the aforementioned issues, we propose improvements to the ERS distillation method and its model architecture. Specifically, we propose a dual-branch detector framework for independent learning and a multi-granularity dynamic selection strategy. Unlike previous works, our method explicitly considers both channel-wise and spatial information during the fine-grained selection process. We then integrate these two types of information to make a unified selection and purification for the distillation supervision of old classes. Architecturally, we employ dual-branch detectors that share common backbone and neck network parameters, with each branch specifically dedicated to training old classes and new classes, respectively. This design effectively mitigates the risk of class confusion caused by overlapping classification boundaries at identical image locations between old and new classes. Furthermore, we adapt DIST loss [
24] to the old-class classification logit distillation process through a sigmoid-based formulation, which preserves inter-class and intra-class relationships from the teacher responses. The proposed method is evaluated on the remote sensing datasets DIOR and DOTA, and the experimental results demonstrate that it effectively alleviates catastrophic forgetting and balances old-class retention with new-class adaptation under different remote-sensing scenarios.
The main contributions of this work are summarized as follows:
(1) We propose a dual-branch detector framework for remote sensing incremental object detection, which decouples the learning pathways of old and new classes to alleviate class confusion and catastrophic forgetting.
(2) We propose a multi-granularity dynamic selection (MDS) strategy that combines channel-wise and global response filtering, enabling more informative teacher responses to be selected for old-class knowledge distillation.
(3) We introduce a sigmoid-based DIST loss for classification logit distillation, which preserves both inter-class and intra-class relationships in the selected teacher responses.
(4) Extensive experiments on DIOR and DOTA demonstrate that the proposed method achieves a favorable balance between old-class retention and new-class adaptation under remote sensing incremental detection scenarios.
The remainder of this paper is organized as follows.
Section 2 reviews related works on dense detection frameworks, class-incremental object detection, knowledge distillation, and response selection.
Section 3 presents the proposed dual-branch incremental detector and the MDS strategy.
Section 4 describes the experimental setup, including datasets, implementation details, and evaluation metrics.
Section 5 reports the experimental results and discussion.
Section 6 concludes this paper.
5. Results and Discussion
5.1. Ablation Study on DIOR
All ablation studies in this subsection are conducted on DIOR under the old-10/new-10 class-incremental setting. DIOR is used as the main benchmark for component analysis because it provides a clear old/new category split and a relatively stable evaluation protocol for remote sensing incremental object detection. The joint-training result of GFL is used as the upper-bound reference, and “Diff” denotes the mAP gap between each incremental result and the corresponding joint-training result.
5.1.1. Baseline
We use GFL [
7] as the baseline detector. Under the joint-training setting, GFL is trained with annotations of both old and new classes and serves as the upper-bound reference for incremental learning. As shown in
Table 1, the joint-training GFL achieves an mAP of 70.40% on Old-10 and 65.91% on New-10. In contrast, the model denoted as “Baseline+w/o select” performs incremental training without response selection in a single detection branch. It obtains 62.56% mAP on Old-10, corresponding to a drop of 7.84 percentage points compared with joint training, and 61.54% mAP on New-10, corresponding to a drop of 4.37 percentage points. These results indicate that directly transferring all teacher responses without selective distillation is insufficient for preserving old-class knowledge and may also weaken new-class learning.
5.1.2. Dual-Branch Detector
In
Table 1, “Baseline + Dual-branch” represents the addition of the dual-branch detector to the baseline model, where separate branches handle old and new classes. The corresponding accuracy values for old and new classes are 68.05% and 64.38%, respectively. Compared to training and distillation using a single detector in “Baseline + W/o select”, this approach improves the accuracy of old classes by 5.49% and new classes by 2.84%. Furthermore, the comparison between “Baseline + MDS + Dual-branch” and “Baseline + MDS” demonstrates that incorporating a dual-branch structure on top of our selection method (MDS) leads to significant performance gains. Similarly, the comparison between “Baseline + MDS + DIST loss” and “Baseline + MDS + Dual-branch+DIST loss” further validates this finding.
5.1.3. MDS
Through comparative analysis of model accuracy on old classes between the MDS method and direct distillation without selection strategy, we demonstrate the substantial advantage of MDS’s response selection mechanism. This finding validates the conclusion in reference [
23] that “not all responses are critical for preventing catastrophic forgetting”. Notably, when comparing our proposed MDS framework with the ERS model, the recognition accuracy for old classes improves from 67.99% to 68.26%. This enhancement confirms the superiority of our channel-wise (i.e., category-level) response selection mechanism over the ERS approach. By implementing a fine-grained channel-level selection strategy, we achieve substantial improvement in the teacher model’s information extraction capability. Furthermore, under the dual-branch detector architecture, the distillation strategy employing MDS shows comprehensive performance gains on both old and new classes compared to selection-free distillation, providing additional evidence for the systematic effectiveness of this selection mechanism.
5.1.4. DIST Loss
Functioning as a regularization constraint, DIST loss [
24] is designed to optimize the intrinsic correlations in knowledge transfer from the teacher model, thereby enhancing the student model’s capacity to retain old-class knowledge. Experimental results demonstrate that incorporating DIST loss into the Elastic Response Distillation (ERD) [
23] framework improves the old-class recognition accuracy from 67.99% to 68.18%. Within the integrated architecture combining the dual-branch detector and the MDS selection mechanism, DIST loss further strengthens old-class response preservation while maintaining competitive new-class adaptation. This loss term preserves old-class knowledge by constraining the relational structure of selected classification responses between the teacher and student models, rather than directly aligning intermediate feature distributions. In this way, DIST loss complements MDS by maintaining both inter-class and intra-class response relationships, thereby improving the fidelity of old-class response distillation.
5.1.5. Additional Analysis of Feature-Based Distillation
To further examine the effect of feature-level knowledge transfer and analyze its compatibility with the proposed response-level framework, we introduce an additional feature-based distillation analysis under the same DIOR old-10/new-10 setting. Specifically, we evaluate a feature-KD-only variant and an extended variant that adds feature distillation to the proposed method. The results are shown in
Table 2.
As shown in
Table 2, the feature-KD-only variant obtains 54.70% mAP on Old-10 and 59.20% mAP on New-10, which is substantially lower than ERD and the proposed method. This indicates that feature-level imitation alone is insufficient for preserving old-class decision knowledge in the adopted GFL-based incremental detection framework. One possible reason is that direct feature alignment constrains intermediate representations without explicitly selecting discriminative old-class responses, which may introduce additional interference when old and new categories coexist in cluttered remote-sensing scenes.
When feature distillation is added to the proposed method, the old-class mAP decreases from 69.93% to 68.90%, and the new-class mAP decreases from 65.88% to 65.70%. This suggests that simply combining feature-level distillation with the proposed response-level framework does not necessarily bring further improvement. In particular, additional feature alignment may over-constrain the shared backbone and FPN features, thereby weakening the balance between old-class preservation and new-class adaptation. Therefore, we keep the response-level MDS-based dual-branch framework as the final model, since it achieves the best overall balance between old-class retention and new-class learning. These results also suggest that feature-level and response-level distillation may be complementary in principle, but their combination requires careful design rather than simple superposition.
5.2. Comparison with State-of-the-Art Methods on DIOR and DOTA
In this subsection, we compare the proposed method with representative class-incremental object detection methods on both DIOR and DOTA. DIOR is used as the main benchmark for systematic comparison and ablation analysis, while DOTA provides another representative remote-sensing benchmark for evaluating the effectiveness of the proposed framework under more challenging scene conditions. For fair comparison, all methods are evaluated under the same old/new class splits and AP50 metric.
To clarify the comparison protocol,
Table 3 summarizes the detector type, main distillation level, and result source of each compared method. The results of Fast-IL, Faster-IL, and FPN-IL are taken from the reported results in FPN-IL [
32], since they follow the same DIOR and DOTA class-incremental protocols adopted in this work. ERD is the most directly related response-level baseline and is re-implemented under the same codebase, data split, and evaluation protocol as the proposed method. The proposed method is also implemented and evaluated under the same setting. For each method, Diff is computed using its corresponding joint-training result as the reference.
As summarized in
Table 3, the compared methods differ in both detector architecture and knowledge transfer level. Fast-IL and Faster-IL are two-stage incremental detection methods, while ERD and the proposed method are based on the one-stage GFL detector. In terms of knowledge transfer, Faster-IL and FPN-IL mainly rely on feature-level or feature-pyramid-level distillation, which aims to preserve intermediate representations. ERD performs response-level logit distillation by selecting informative teacher responses. Different from these methods, the proposed framework combines response-level MDS with a dual-branch detector, thereby explicitly decoupling old-class knowledge preservation and new-class learning. This structured comparison clarifies the main similarities and differences between previous incremental detection methods and the proposed method.
5.2.1. Results on DIOR
On DIOR,
Table 4 and
Table 5 respectively report the performance of each method on old and new classes. On Old-10, the proposed method achieves the highest incremental mAP of 69.93 and the smallest performance gap of −0.47, demonstrating strong old-class retention. On New-10, FPN-IL obtains the highest absolute mAP, while our method achieves the second-best mAP and the smallest gap to its corresponding joint-training reference, with a Diff of only −0.03. These results indicate that the proposed method provides a favorable balance between old-class retention and new-class adaptation rather than improving only one side.
Compared with FPN-IL, the proposed method transfers knowledge at the prediction-response level rather than the feature-pyramid level. FPN-IL is effective for preserving multi-scale feature representations, but feature-level constraints may still introduce interference between old and new categories when their visual patterns are similar. In contrast, our method uses MDS to select informative teacher responses and uses a dual-branch detector to decouple old-class preservation from new-class learning. This explains why the proposed method achieves stronger old-class retention while maintaining competitive new-class adaptation.
5.2.2. Results on DOTA
On DOTA, we further compare the proposed method with the same state-of-the-art incremental object detection methods under the old-8/new-7 setting. Compared with DIOR, DOTA exhibits denser object layouts, larger scale variations, and more frequent co-occurrence of old and new categories within cropped patches, thus providing a more challenging benchmark for evaluating incremental object detection methods in remote-sensing scenes.
Table 6 reports the performance of each method on the old classes of DOTA. Compared with DIOR, DOTA contains denser object distributions and more complex background variations, which makes old-class knowledge preservation more challenging. The proposed method achieves the highest incremental mAP of 70.23 on Old-8, showing its advantage in preserving old-class detection capability under complex remote-sensing scenes. FPN-IL obtains the smallest performance gap to its joint-training reference, while the proposed method still provides the strongest absolute old-class performance among all incremental methods.
Table 7 reports the performance of each method on the new classes of DOTA. FPN-IL obtains the highest incremental mAP, while the proposed method achieves the second-best mAP. It is worth noting that the proposed method exceeds its corresponding joint-training reference by 2.00 percentage points on New-7. This positive gap should not be interpreted as incremental learning being generally superior to joint training. Instead, it may be related to the regularization effect of the frozen teacher model, the decoupled optimization of old and new branches, and category-specific data distribution differences under the adopted data split. Overall, the DOTA results show that the proposed framework can maintain strong old-class retention while providing competitive new-class adaptation under complex remote-sensing scenes.
5.3. Hyperparameter Analysis
All hyperparameter analyses in this subsection are conducted on DIOR, which serves as the main benchmark for parameter study in our method. We analyze the influence of the MDS thresholds and , as well as the weights and in the DIST loss.
The parameters
and
control the strictness of per-channel selection and global selection in MDS, respectively. As shown in
Figure 6, when
, increasing
improves the performance of both old and new classes, indicating the necessity of the second-stage global filtering. When
, setting
yields the best performance. However, further increasing
makes the selection too strict and may discard useful teacher responses, resulting in performance degradation. Therefore, a moderate threshold setting is more suitable for preserving informative old-class knowledge while avoiding noisy response transfer.
The parameters
and
are used to balance the inter-class and intra-class relational losses in DIST loss. As illustrated in
Figure 7, we evaluate three settings: intra-only (
), inter-only (
), and both combined (
). Since DIST loss is applied to old-class distillation, its influence on new-class performance is relatively limited. The results show that using both inter-class and intra-class relational constraints achieves better old-class performance than using either term alone. This indicates that the two relational constraints are complementary in preserving old-class knowledge. In addition, the intra-class term contributes more than the inter-class term in our framework, suggesting that maintaining response consistency across samples of the same category is particularly important for old-class retention.
5.4. Activation Function Analysis in DIST Loss
In the proposed framework, we adapt the original DIST loss by replacing the softmax activation with the sigmoid activation for classification logit distillation. This modification is motivated by the multi-class response characteristics of dense object detection, where multiple category responses may provide useful relational information for old-class knowledge preservation. The following is the detailed explanation for this modification.
The Softmax function is commonly used in multi-class classification tasks where each input is assumed to belong to one and only one class. It transforms a vector of logits
into a probability distribution over
K classes by the following formulation:
This normalization ensures that the output probabilities are positive and sum to 1, thereby emphasizing the mutual exclusivity between classes.
The Sigmoid function, on the other hand, is typically used in multi-label classification settings where each input may belong to multiple classes simultaneously. It independently maps each logit
to a value in the range
according to the following equation:
Unlike Softmax, the Sigmoid function does not enforce any inter-class competition or normalization across logits. This makes it well-suited for modeling non-exclusive, class-independent probabilities, such as those found in multi-label or soft-target scenarios.
In the logit distillation of this paper, the teacher model provides soft labels—probability distributions across multiple categories—that reflect inter-class confidence relationships. The student model is expected to learn these relationships rather than simply predicting a single correct class. Therefore, using Sigmoid enables a more appropriate modeling of these non-exclusive, correlated outputs.
From
Table 8, it is clear that using sigmoid in the DIST loss consistently outperforms softmax at every temperature setting for old classes: sigmoid reaches 69.93%, whereas softmax peaks at just 69.31% (T = 2) and is lower at other temperatures. Although for new classes sigmoid (65.88%) is marginally below softmax at T = 0.5 (65.90%) and T = 1 (65.93%), the difference is negligible—only 0.02–0.05%—and sigmoid still outperforms softmax at T = 2 (65.50%) and T = 3 (64.84%). In other words, sigmoid substantially boosts old-class accuracy while incurring only a minimal cost to new-class performance. Compared to softmax, which requires careful temperature tuning, sigmoid delivers a more stable and balanced incremental-learning result. Therefore, incorporating sigmoid into the DIST loss more effectively balances retention of old-class knowledge with learning of new classes.
5.5. Visualization Analysis
To further analyze the effectiveness of the proposed response selection strategy, we visualize old-class response maps on representative remote-sensing scenes. As shown in
Figure 8, two typical old-class categories are selected, including airplanes in a complex airport scene and ships in a dense harbor scene. Each row presents the input image, ground-truth boxes, the response map generated by ERD, and the response map generated by the proposed method.
As shown in
Figure 8, ERD/ERS can activate several discriminative regions of old-class objects, but its responses are relatively weak or fragmented for some densely distributed instances. In contrast, Ours/MDS generates more stable and complete responses over old-class targets. For the airplane scene, Ours/MDS highlights more aircraft instances under a complex airport background. For the ship scene, it preserves more continuous responses over densely moored ships. These visualization results further demonstrate that the proposed multi-granularity response selection strategy is beneficial for retaining informative old-class knowledge during incremental learning.
5.6. Discussion
The effectiveness of the proposed method can be mainly attributed to the combination of multi-granularity response selection and old/new class decoupling. In class-incremental object detection, directly distilling all teacher responses may introduce redundant or even misleading information, especially in remote-sensing scenes where objects are densely distributed and different categories often exhibit similar visual appearances. The proposed MDS strategy addresses this issue by selecting informative teacher responses from both channel-wise and global perspectives. The channel-wise selection preserves category-specific response information, while the global selection further filters out low-confidence or less informative spatial responses. Therefore, MDS provides a more refined knowledge transfer mechanism than holistic response selection and helps the student model retain old-class decision knowledge more effectively.
The dual-branch detector also plays an important role in balancing old-class retention and new-class adaptation. In a single-branch incremental detector, old-class distillation and new-class supervised learning are optimized within the same prediction head, which may lead to competition between preserving old decision boundaries and adapting to newly introduced categories. By assigning old-class preservation and new-class learning to two separate branches, the proposed framework reduces the direct optimization conflict between old and new categories. This design is particularly useful for remote-sensing object detection, where inter-class similarity and dense object distributions can easily amplify the interference between old and new classes.
The comparison with FPN-IL further reveals the difference between feature-level and response-level distillation. FPN-IL transfers knowledge at the feature-pyramid level and is effective in preserving multi-scale representations, which is important for remote-sensing objects with large scale variations. In contrast, the proposed method focuses on prediction-response-level knowledge transfer, where informative logits and localization responses are dynamically selected for distillation. The experimental results show that response-level selection is especially beneficial for old-class retention. Feature-level and response-level distillation may be complementary in principle, but their combination requires careful design. As shown in the additional feature-based distillation analysis, naively adding feature alignment to the proposed response-level framework does not necessarily improve performance and may over-constrain the shared representation.
For the DOTA new-class results, the proposed method obtains a positive Diff compared with its corresponding joint-training reference. This result should be interpreted carefully. It does not indicate that incremental learning is generally superior to joint training. Instead, the positive gap may be caused by several factors, including the regularization effect of the frozen teacher model, the decoupled optimization of old and new branches, and category-specific data distribution differences under the adopted split. Therefore, the main conclusion from the DOTA results is that the proposed framework can maintain competitive new-class adaptation while improving old-class retention, rather than universally outperforming joint training.
Remote-sensing datasets often exhibit natural class imbalance, where some categories contain dense and frequent instances while others appear sparsely. The proposed method does not explicitly introduce a class-balanced sampler or category re-weighting strategy. Instead, MDS partially alleviates this issue by selecting informative teacher responses from both channel-wise and global perspectives, which reduces the dominance of redundant high-frequency responses and encourages the student model to focus on discriminative old-class knowledge. Nevertheless, class imbalance is not fully solved in this work. Combining the proposed response-level distillation with class-balanced sampling or adaptive category re-weighting is a promising direction for future research.
Despite its effectiveness, this work still has several limitations. First, the current experiments mainly follow the large-step class-incremental protocol adopted in previous remote-sensing incremental detection studies, which ensures fair comparison with existing methods such as FPN-IL. This setting is also consistent with practical remote-sensing scenarios, where scene transitions or changes in monitoring tasks may introduce a group of new object categories simultaneously rather than only one category at a time. For example, when the observed area changes from relatively simple rural scenes to more complex urban or transportation-related scenes, multiple new categories may emerge together with different spatial layouts, scales, and background distributions. Therefore, the adopted old-10/new-10 and old-8/new-7 protocols provide a meaningful evaluation of substantial category expansion in remote-sensing incremental detection. Nevertheless, small-step sequential scenarios, where only a few categories are introduced at each stage, are also important for evaluating long-term continual learning ability and will be further investigated in future work. Second, this work mainly focuses on response-level distillation because ERD is the most directly related baseline. Although the additional feature-based distillation analysis shows that simple feature alignment does not further improve the proposed framework, more carefully designed hybrid feature-response distillation strategies remain worth investigating. In future work, we will further explore longer incremental sequences, small-step class-incremental protocols, and adaptive feature-response distillation strategies to improve the robustness and generality of remote-sensing incremental object detection.