A Dual-Branch Detector Based on the Multi-Granularity Dynamic Selection Mechanism for Remote Sensing Incremental Detection

Li, Shixi; Wang, Weiji; Xu, Yousheng; Yao, Wei; Xu, Shengzhou

doi:10.3390/rs18122032

Open AccessArticle

A Dual-Branch Detector Based on the Multi-Granularity Dynamic Selection Mechanism for Remote Sensing Incremental Detection

by

Shixi Li

,

Weiji Wang

,

Yousheng Xu

,

Wei Yao

^*

and

Shengzhou Xu

School of Computer Science, South-Central Minzu University, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 2032; https://doi.org/10.3390/rs18122032

Submission received: 14 May 2026 / Revised: 10 June 2026 / Accepted: 16 June 2026 / Published: 18 June 2026

(This article belongs to the Special Issue Object Detection in Remote Sensing Images Based on Artificial Intelligence)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A dual-branch detector framework is proposed for remote sensing incremental object detection, which decouples the learning pathways of old and new classes to alleviate inter-class confusion and catastrophic forgetting.
A multi-granularity dynamic selection (MDS) strategy and a sigmoid-based DIST loss are introduced to improve old-class knowledge transfer by selecting informative teacher responses and preserving inter-class and intra-class relationships.

What are the implications of the main findings?

The proposed framework provides a practical solution for updating remote sensing object detectors when new object categories emerge, without requiring full retraining with all historical data.
Experiments on DIOR and DOTA demonstrate that the method can achieve a favorable balance between old-class retention and new-class adaptation, indicating its potential for continual remote sensing image interpretation in evolving scenarios.

Abstract

In practical remote sensing object detection tasks, the application of deep learning approaches often takes the form of incremental learning: when the application includes new target types that were not encountered during training, a pre-trained model must acquire new knowledge without suffering catastrophic forgetting. Among the various techniques proposed, knowledge distillation (KD)-based regularization has proven to be one of the most effective methods. Current KD-based approaches primarily focus on addressing inter-task confusion and optimizing feature selection during distillation processes. In this paper, we propose a dual-branch detector-independent learning framework and a multi-granularity dynamic selection strategy. The former decouples detection tasks for old and new classes to mitigate inter-class confusion, while the latter is a novel, exquisitely designed distillation mechanism that ensures precise transfer of critical old-class information. Moreover, we apply a DIST loss that aligns both inter-class and intra-class relations, further enhancing the fidelity of old-class knowledge transfer. Experiments on the DIOR and DOTA datasets demonstrate that our method significantly outperforms state-of-the-art incremental-learning approaches for remote-sensing object detection and exhibits good robustness under different remote-sensing scenarios.

Keywords:

remote sensing; object detection; incremental learning; knowledge distillation; dual-branch detector; DIST loss; multi-granularity dynamic selection

1. Introduction

With the rapid development of remote sensing technology, optical remote sensing images (RSIs) have seen significant improvements in spatial resolution and textural detail characteristics, providing a richer data foundation for object detection tasks. This technological breakthrough enables optical remote sensing image target detection technology to more accurately identify surface objects and their spatial distribution characteristics, thereby playing crucial roles in various fields such as environmental monitoring, geological hazard detection, Land Use and Land Cover (LULC) mapping, geographic information system (GIS) updating, precision agriculture, and urban planning [1]. Deep learning-based object detection algorithms have achieved remarkable success in traditional natural image domains, with models like R-CNN [2], Fast R-CNN [3], Faster R-CNN [4], YOLO [5] series, RetinaNet [6], GFL [7], and CenterNet [8]. However, object detection in remote sensing images faces unique challenges: targets to be detected are typically distributed in multi-scale forms, arbitrarily oriented, densely packed, and even mutually occluded. Recent studies further indicate that tiny objects, complex backgrounds, and label assignment uncertainty remain important obstacles for robust remote sensing object detection [9]. The uniqueness of remote sensing object detection tasks can be demonstrated in recently remote sensing object detection datasets such as NWPU VHR-10 [10], DOTA [11], and DIOR [12], and many specialized approaches, including Oriented R-CNN [13], OAN [14], and FFCA-YOLO [15], have therefore been developed. Recent studies have further shown that large-scene remote sensing object detection benefits from context-aware and attention-guided detector design, which is especially important for handling small objects, cluttered backgrounds, and substantial scale variation in aerial imagery [16]. Recent systematic reviews have also summarized the rapid development of optical remote sensing object detection methods, benchmarks, and operational applications, indicating that robust object detection remains a fundamental task in remote sensing image interpretation [17].

In the construction of computer vision models, traditional paradigms typically assume that training data for all tasks can be permanently stored and accessed at any time. However, in practical applications, as task sequences continuously expand, systems inevitably face the dilemma of historical data becoming unavailable. This constraint arises from both the physical limitations of storage media and the exponential growth of computational costs associated with retraining. When models trained on old tasks are directly updated with new tasks, catastrophic parameter drift occurs, manifesting as significant performance degradation on previous tasks. This phenomenon is defined as catastrophic forgetting, which can hardly be effectively handled by existing classical object detection training paradigms. Incremental learning (IL) is specifically designed to address this pervasive issue of catastrophic forgetting. Based on whether the task identity is provided or must be inferred [18], researchers categorize incremental learning into three types: task-incremental, domain-incremental, and class-incremental learning. This paper focuses on the challenge of class-incremental learning in remote sensing object detection, as it is particularly relevant to the dynamic nature of remote sensing data, where new object classes are frequently introduced while maintaining the accuracy of previously learned classes is crucial for practical applications.

Knowledge distillation [19] is a technique that effectively mitigates the catastrophic forgetting of old-class knowledge in class-incremental learning by introducing a regularization constraint mechanism.

In knowledge distillation for object detection prior to the introduction of incremental learning, earlier works primarily focused on extracting knowledge through the combined distillation of detector components. For instance, studies such as [20,21] adopted a comprehensive approach by distilling all components of the detector. However, these methods fundamentally relied on feature-based knowledge distillation, leaving the potential of logit-based distillation underexplored in incremental object detection (IOD) scenarios. Unlike feature distillation, logit-based methods [22] leverage the teacher model’s logit output (e.g., classification scores or bounding box predictions), which inherently encapsulate the teacher’s reasoning information. This characteristic enables the student model to better mimic the teacher’s decision-making process. Consequently, carefully designing distillation strategies for different types of logits becomes critical to fully exploit their advantages. Building on this, for incremental object detection, ref. [23] suggested that not all responses are crucial for preventing catastrophic forgetting, thus proposing an Elastic Response Selection (ERS) method for logit distillation from old-class teachers. The advantage of ERS is that it can automatically select key information for incremental learning based on the teacher model’s logits, thereby effectively reducing the interference between old and new knowledge and lowering the risk of catastrophic forgetting. However, the selection mechanism of elastic response distillation overlooks the local information at the channel level, resulting in a reduced ability to extract key fine-grained information of old classes, and thus fails to handle the confusion between old and new classes well. Figure 1 presents two representative failure cases of Elastic Response Distillation (ERD) on DIOR. In Figure 1a, multiple old-class airplane instances in a dense airport scene are misclassified as the newly introduced windmill class. In Figure 1b, old-class airplane targets are also incorrectly assigned to the new windmill category, further indicating that ERD may suffer from old/new class confusion when fine-grained response information is insufficiently preserved.

Furthermore, in the parameter-sharing mechanism inherent to current incremental learning models employing a single detector architecture, there exist inherent flaws. The design where old and new categories share critical parameters leads to simultaneous backward gradient updates for both categories during training. This not only introduces implicit constraints in the model training process, but also easily triggers coupling effects in the feature representation space, making it difficult to maintain independent and stable feature distributions. Consequently, this exacerbates the risk of class confusion caused by the blurring of classification boundaries.

To address the aforementioned issues, we propose improvements to the ERS distillation method and its model architecture. Specifically, we propose a dual-branch detector framework for independent learning and a multi-granularity dynamic selection strategy. Unlike previous works, our method explicitly considers both channel-wise and spatial information during the fine-grained selection process. We then integrate these two types of information to make a unified selection and purification for the distillation supervision of old classes. Architecturally, we employ dual-branch detectors that share common backbone and neck network parameters, with each branch specifically dedicated to training old classes and new classes, respectively. This design effectively mitigates the risk of class confusion caused by overlapping classification boundaries at identical image locations between old and new classes. Furthermore, we adapt DIST loss [24] to the old-class classification logit distillation process through a sigmoid-based formulation, which preserves inter-class and intra-class relationships from the teacher responses. The proposed method is evaluated on the remote sensing datasets DIOR and DOTA, and the experimental results demonstrate that it effectively alleviates catastrophic forgetting and balances old-class retention with new-class adaptation under different remote-sensing scenarios.

The main contributions of this work are summarized as follows:

(1) We propose a dual-branch detector framework for remote sensing incremental object detection, which decouples the learning pathways of old and new classes to alleviate class confusion and catastrophic forgetting.

(2) We propose a multi-granularity dynamic selection (MDS) strategy that combines channel-wise and global response filtering, enabling more informative teacher responses to be selected for old-class knowledge distillation.

(3) We introduce a sigmoid-based DIST loss for classification logit distillation, which preserves both inter-class and intra-class relationships in the selected teacher responses.

(4) Extensive experiments on DIOR and DOTA demonstrate that the proposed method achieves a favorable balance between old-class retention and new-class adaptation under remote sensing incremental detection scenarios.

The remainder of this paper is organized as follows. Section 2 reviews related works on dense detection frameworks, class-incremental object detection, knowledge distillation, and response selection. Section 3 presents the proposed dual-branch incremental detector and the MDS strategy. Section 4 describes the experimental setup, including datasets, implementation details, and evaluation metrics. Section 5 reports the experimental results and discussion. Section 6 concludes this paper.

2. Related Works

2.1. Dense Detection Frameworks and Related Losses

The design of dense object detectors and their loss functions plays an important role in improving detection performance. Traditional classification losses, such as Cross-Entropy Loss [25], may suffer from severe foreground–background imbalance, especially in dense detection scenarios with extreme positive–negative sample ratios. To address this problem, Focal Loss [6] introduces a modulating factor to reduce the contribution of easy samples and focus the training process on hard examples. However, Focal Loss is mainly designed for discrete classification labels and cannot directly model the joint relationship between classification confidence and localization quality.

Generalized Focal Loss (GFL) [7] further extends this idea and provides a unified dense detection framework. It should be noted that GFL in this paper refers to the GFL detector framework rather than only a single loss function. Specifically, GFL integrates Quality Focal Loss (QFL) for joint classification-quality estimation and Distribution Focal Loss (DFL) for bounding box distribution regression. QFL uses continuous labels to jointly represent category confidence and localization quality, while DFL models bounding box locations as discrete probability distributions and embeds them into regression optimization through an integral form. Therefore, GFL provides a suitable one-stage detection baseline for our incremental detection framework.

In addition to supervised detection losses, distillation losses are also important for preserving knowledge in incremental learning. Existing logit distillation methods often rely on L2 loss or KL divergence [19]. L2 loss mainly performs magnitude matching and ignores the relational structure among categories, while KL divergence focuses on inter-class relationships but does not explicitly preserve intra-class relationships. DIST loss [24] addresses this issue by replacing strict distribution matching with the Pearson correlation coefficient [26], enabling both inter-class and intra-class relational knowledge to be transferred. In this work, we adapt DIST loss to the classification logit distillation process by using sigmoid activation instead of softmax, so that class-wise response relationships can be better preserved under the class-incremental object detection setting.

2.2. Class-Incremental Object Detection

Incremental learning aims to update a model with new knowledge while preserving previously learned knowledge. According to whether task identity is available, incremental learning can be generally divided into task-incremental, domain-incremental, and class-incremental learning [18]. This paper focuses on class-incremental object detection, where newly introduced object categories need to be learned without severely degrading the detection performance on old categories.

Existing class-incremental learning methods can be roughly divided into data replay-based methods and regularization-based methods. Data replay methods preserve a subset of old samples and jointly train the detector with both old and new data. Representative methods, such as iCaRL [27] and experience replay for continual learning [28], can alleviate catastrophic forgetting by maintaining old data distributions. However, they require additional storage and may introduce privacy or data access issues, which limits their applicability in realistic remote sensing scenarios.

Regularization-based methods avoid storing old samples and instead constrain the updated model to retain old knowledge. Learning without Forgetting (LwF) [29] uses the response outputs of the old model as soft labels to regularize the new model. In object detection, Fast-IL [30] and Faster-IL [31] introduce knowledge distillation into incremental object detection frameworks. In the remote sensing domain, FPN-IL [32] further explores feature pyramid-based knowledge distillation for incremental detection of remote sensing objects, showing the importance of multi-scale feature preservation. In addition, some emerging studies, such as incremental few-shot object detection [33] and graph-based few-shot incremental learning [34], indicate that practical recognition systems may need to address knowledge preservation, limited supervision, and uncertainty in class membership simultaneously. Recent open-vocabulary remote sensing object detection studies also attempt to detect categories unseen during training through teacher–student learning, which is conceptually related to adapting detectors to newly emerging object classes [35].

2.3. Knowledge Distillation for Object Detection

Knowledge distillation [19] transfers knowledge from a teacher model to a student model and has been widely used for model compression, object detection, and incremental learning. In object detection, existing distillation methods can be generally divided into feature imitation and logit mimicking [22], as illustrated in Figure 2.

Feature imitation methods transfer knowledge by aligning intermediate feature maps between teacher and student detectors. For example, FitNet [36] introduces intermediate feature supervision for thinner networks. FGD [37], namely Focal and Global Distillation, separates foreground and background regions and applies feature distillation with global relation modeling. MGD [38], namely Masked Generative Distillation, randomly masks student features and forces the student model to reconstruct teacher-like representations. These feature-based methods are effective for transferring spatial representation knowledge, but feature-level constraints may also introduce interference between old and new categories in incremental detection, especially when class distributions change.

Logit mimicking methods directly distill the prediction outputs of the detector, such as classification logits and bounding box regression distributions. LD [39], namely Localization Distillation, transfers localization knowledge by distilling bounding box distributions. LS-KD [40], namely Logit Standardization Knowledge Distillation, normalizes logits before distillation to improve the transfer of relational information. Compared with feature imitation, logit mimicking is more directly related to the final prediction behavior of detectors. Therefore, it is particularly suitable for class-incremental object detection, where preserving old-class decision responses is crucial.

2.4. Response Selection in Incremental Object Detection

Not all teacher responses are equally useful for preventing catastrophic forgetting. Directly distilling all responses may transfer redundant or noisy information, which can weaken the learning of new categories and reduce the effectiveness of old-class knowledge preservation. To address this issue, Elastic Response Distillation (ERD) [23] introduces Elastic Response Selection (ERS) for logit distillation in incremental object detection.

The core idea of ERS is to select informative teacher responses according to the teacher model’s output distribution. Specifically, ERS first compresses the teacher logits along the channel dimension by taking the maximum response at each spatial location, resulting in a single spatial response map. Then, a global threshold is generated based on the mean and variance of this response map, and only responses above the threshold are selected for distillation. This strategy allows the model to automatically filter teacher responses without manually selecting object regions.

However, ERS performs response filtering only after channel-wise compression. As a result, class-specific channel information may be discarded before the selection process. This limitation is particularly problematic in remote sensing incremental detection, where visually similar old and new categories may appear in dense or cluttered scenes and activate different response channels at the same spatial location. If fine-grained channel-level information is ignored, old-class responses may be insufficiently preserved, leading to old/new class confusion.

Different from ERS, the proposed multi-granularity dynamic selection (MDS) strategy first conducts per-channel response filtering and then applies global spatial selection. By combining channel-wise and global response selection, MDS preserves more fine-grained class-specific information from the teacher model and provides more informative distillation targets for old-class knowledge transfer. This design directly addresses the limitations of ERS and serves as the key motivation for the proposed method.

3. Proposed Method

Figure 3 provides an overview of the proposed dual-branch incremental detection framework, including both the incremental training stage and the inference stage.

In this section, we introduce the proposed dual-branch incremental detector. We first describe the overall training and inference procedures of the framework. Then, we present the multi-granularity dynamic selection (MDS) strategy, which is used to select informative teacher responses for old-class distillation. Finally, we introduce the classification and regression distillation losses used to preserve old-class knowledge.

3.1. Overall Structure

As illustrated in Figure 3, the proposed framework consists of three key components: a dual-branch detector, the proposed MDS module, and two complementary distillation objectives. MDS is applied to the classification and regression responses of the frozen teacher detector to select informative old-class knowledge. Then, DIST loss [24] and LD loss [39] are employed to transfer classification and localization knowledge from the teacher model to the old-class distillation branch (ODB) of the student detector. Meanwhile, the new-class learning branch (NLB) is trained with ground-truth labels to learn newly introduced categories. In this way, old-class knowledge preservation and new-class adaptation are decoupled into two branches, which helps alleviate class confusion during incremental training. Therefore, the overall learning loss of the student detector is defined as follows:

\begin{matrix} L_{total} & = L_{ODB} + L_{NLB} \\ = λ_{1} L_{MDS_cls} (C_{t}, C_{s}) + λ_{2} L_{MDS_bbox} (B_{t}, B_{s}) + L_{NLB} \end{matrix}

(1)

In Equation (1),

λ_{1}

and

λ_{2}

are balancing coefficients for different loss terms, and we set

λ_{1} = λ_{2} = 1

by default in all experiments. The subscripts t and s denote the teacher model and the student model, respectively. Specifically,

C_{t}

and

C_{s}

represent the classification logits or response maps generated by the teacher model and the student old-class distillation branch (ODB), respectively, while

B_{t}

and

B_{s}

denote their corresponding bounding box regression distributions or regression logits. The MDS operation is applied to both the classification responses and regression responses to obtain selected old-class distillation targets.

L_{NLB}

denotes the collection of supervised detection losses generated when the new-class learning branch (NLB) is trained with ground-truth labels, and it is used to enhance the student model’s ability to detect newly introduced classes.

L_{MDS_cls}

is the classification distillation loss after MDS-based response selection, where DIST loss is adopted to preserve relational knowledge among old-class logits. Meanwhile,

L_{MDS_bbox}

uses LD loss to distill bounding box regression knowledge from the teacher model. Both distillation losses are applied to preserve old-class knowledge, while

L_{NLB}

is used for new-class learning.

In class-incremental object detection, a single detection head often struggles to distinguish old and new categories because their feature distributions and decision boundaries may overlap. This can lead to severe old/new class confusion during incremental training. To address this problem, we adopt a dual-branch detector architecture, as shown in Figure 3. The two branches share the same backbone and FPN features but are responsible for different category groups. The ODB branch focuses on preserving old-class knowledge through teacher-guided distillation, while the NLB branch learns newly introduced categories through ground-truth supervision. This decoupled design reduces the interference between old-class preservation and new-class adaptation.

3.1.1. Old Classes Distillation Branch (ODB)

This branch implements knowledge transfer from the teacher model through a distillation framework. Guided by the soft labels generated by the teacher model, it optimizes via distillation losses for classification probability and bounding box regression. This design preserves discriminative features and geometric localization capabilities for old categories. Compared to the single-peak supervision of hard labels, the inter-class probabilistic correlations embedded in soft labels effectively mitigate catastrophic forgetting of old-class features while maintaining boundary consistency in object detection [29].

3.1.2. New Classes Learning Branch (NLB)

This branch employs end-to-end supervision using ground-truth annotations. Built on the generalized focal loss framework (GFL) [7], it enhances the performance of the model through an optimization mechanism. The Quality Focal Loss (QFL) strengthens the correlation between classification confidence and localization quality, while the Distribution Focal Loss (DFL) enables continuous spatial probability modeling for bounding box regression, thereby improving the model’s ability to capture features of new classes.

3.1.3. Inference Procedure

During inference, an input image is first processed by the shared backbone and FPN. The ODB branch predicts old-class categories and their bounding boxes, while the NLB branch predicts newly introduced categories. Since the old and new categories are disjoint in the class-incremental setting, the outputs of the two branches are remapped to the global category space and then concatenated. Score thresholding and class-wise non-maximum suppression (NMS) are applied to the merged predictions to obtain the final detection results. In this way, the ODB branch preserves the detection capability for old classes, whereas the NLB branch provides predictions for new classes. Since the ODB and NLB branches are responsible for non-overlapping old-class and new-class sets, their predictions are mapped to different category indices in the global label space. Therefore, no category-level conflict is introduced during prediction merging.

3.2. MDS: Multi-Granularity Dynamic Selection

As indicated by previous studies [23], directly distilling all teacher responses is not always beneficial for incremental object detection, because redundant or noisy responses may weaken the effectiveness of old-class knowledge transfer. Therefore, response selection plays an important role in mitigating catastrophic forgetting. Elastic Response Selection (ERS) [23] addresses this issue by automatically selecting informative teacher responses according to the output distribution of the teacher model. Specifically, ERS first compresses the teacher logits along the channel dimension by taking the maximum response at each spatial location, and then applies a global threshold based on the mean and standard deviation of the compressed response map.

Although ERS can reduce the influence of redundant responses, it still has an important limitation: the channel dimension is compressed before response selection. As a result, class-specific channel information may be discarded before the selection process, which is undesirable for remote sensing incremental detection. In remote sensing scenes, visually similar old and new categories may appear in dense or cluttered regions and activate different channels at the same spatial location. If such channel-wise information is ignored, the selected teacher responses may be insufficient for preserving old-class knowledge, leading to old/new class confusion.

To address this limitation, we propose a multi-granularity dynamic selection (MDS) strategy. Different from ERS, MDS performs response selection in two stages. First, it conducts per-channel response filtering to preserve class-specific informative responses. Then, it applies global spatial selection to further filter the selected responses. By combining channel-wise and global response selection, MDS provides more fine-grained and informative distillation targets for old-class knowledge transfer. This design preserves channel-specific response information before global filtering, which directly addresses the information loss issue caused by the channel-wise compression in ERS. The comparison between ERS and MDS is illustrated in Figure 4.

In the logit distillation process, given an input image, we feed it into the frozen teacher model T. From the classification head of T, we obtain the teacher logit matrix

{logits}_{t} \in R^{H \times W \times C}

, where H and W denote the spatial height and width of the logit map, and C denotes the number of class channels. Similarly, from the ODB classification head of the student model S, we obtain the student logit matrix

{logits}_{s} \in R^{H \times W \times C}

. A selective mask for distillation is generated from the teacher logits

{logits}_{t}

. As shown in Figure 4b, MDS generates the final selection mask through two successive stages: per-channel selection and global selection.

3.2.1. Step 1: Per-Channel Selection

In this step, for each channel

c \in {1, 2, \dots, C}

, calculate the mean and standard deviation (std) across the spatial dimension

H \times W

, and generate an

H \times W \times C

mask.

The mean and standard deviation are calculated for each channel c:

\begin{matrix} {mean}_{c} & = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {logits}_{t} (i, j, c) \\ {std}_{c} & = \sqrt{\frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {({logits}_{t} (i, j, c) - {mean}_{c})}^{2}} \end{matrix}

(2)

Then we use the

{mean}_{c}

and

{std}_{c}

to generate the

{mask}_{initial}

{mask}_{initial} (i, j, c) = \{\begin{matrix} 1 & if {logits}_{t} (i, j, c) \geq {mean}_{c} + k_{1} \cdot {std}_{c} \\ 0 & otherwise \end{matrix}

(3)

Here,

k_{1}

is a hyperparameter that controls the strictness of the selection (e.g.,

k_{1} = 1

or

k_{1} = 2

).

Compress the

H \times W \times C

{mask}_{initial}

into an

H \times W \times 1

mask. Specifically, if a spatial position

(i, j)

passes the selection in any channel (i.e.,

{mask}_{initial} (i, j, c) = 1

), then the position in the compressed

{mask}_{C} (i, j)

is 1; otherwise, it is 0.

{mask}_{C} (i, j) = \{\begin{matrix} 1 & if \sum_{c = 1}^{C} {mask}_{initial} (i, j, c) \geq 1 \\ 0 & otherwise \end{matrix}

(4)

3.2.2. Step 2: Global Selection

Use the

{mask}_{C} (i, j)

to select the

{logits}_{t} (i, j, c)

, resulting in the initial selected results

{logits}_{S} (i, j, c)

.

{logits}_{S} (i, j, c) = {logits}_{t} (i, j, c) \cdot {mask}_{C} (i, j)

(5)

For the initially screened results, take the pixel-wise maximum value, then generate a threshold using the mean plus

k_{2}

times the standard deviation, and finally generate the final mask based on this threshold.

{logits}_{\max} (i, j) = max_{c} {logits}_{S} (i, j, c)

(6)

{mask}_{final} (i, j) = \{\begin{matrix} 1 & if {logits}_{\max} (i, j) \geq {mean}_{\max} + k_{2} \cdot {std}_{\max} \\ 0 & otherwise \end{matrix}

(7)

Here,

{logits}_{\max} (i, j)

is a matrix of dimension

i \times j

after channel-wise maximum selection of

{logits}_{S} (i, j, c)

, and

{mean}_{\max}

and

{std}_{\max}

are its mean and standard deviation.

The final mask

{mask}_{final} (i, j)

is then used to select informative teacher and student logits for distillation:

\begin{matrix} C_{t}^{m d s} & = {mask}_{final} (i, j) \cdot {logits}_{t} (i, j, c) \\ C_{s}^{m d s} & = {mask}_{final} (i, j) \cdot {logits}_{s} (i, j, c) \end{matrix}

(8)

For the logits layer of the regression output, we also apply the aforementioned MDS operation to obtain the corresponding mask and generate the selected

B_{t}^{m d s}

and

B_{s}^{m d s}

, following a similar process as denoted by Equations (2)–(8).

As shown in Figure 4a, ERS first performs channel-wise maximum projection and then applies a global threshold on the compressed

H \times W \times 1

response map. In contrast, MDS preserves channel-specific information before global filtering, enabling more informative responses to be retained for distillation.

3.3. Classification Distillation Based on the DIST Loss

The DIST loss [24] is introduced to improve the logit distillation process for the classification head in our model. This method introduces the Pearson correlation coefficient [26] as a matching metric, relaxing the traditional KL divergence’s [19] strict requirement for exact alignment between the predictions of teacher and student models. Instead, it focuses on capturing the intrinsic relationships between prediction results, thereby better accommodating the significant discrepancies introduced by powerful teacher models (see Figure 5).

Furthermore, to adapt to incremental learning during logit distillation, we modify the original approach by replacing the softmax operation on the logit layer with a sigmoid operation. We use the sigmoid function to convert the selected logits output

C_{t}^{m d s}

into a value

Y_{t}

within the [0, 1] range, defined as:

Y_{t} = Sigmoid (C_{t}^{m d s})

(9)

Similarly, we define the student model’s

Y_{s}

as

Y_{s} = Sigmoid (C_{s}^{m d s})

.

Then, the Pearson Correlation Coefficient is employed to measure the correlation between the predicted probability distributions of the teacher model and the student model. Its value ranges between [−1, 1], where 1/−1 indicates a perfect positive/negative linear correlation, while 0 signifies no linear correlation. Specifically, for the predicted probability distributions

Y_{s}

and

Y_{t}

of each instance, the Pearson Correlation Coefficient can be expressed as:

ρ_{p} (Y_{s}, Y_{t}) = \frac{(Y_{s} - \bar{Y_{s}}) \cdot (Y_{t} - \bar{Y_{t}})}{∥ Y_{s} - \bar{Y_{s}} ∥ ∥ Y_{t} - \bar{Y_{t}} ∥}

(10)

where

Y_{s}

and

Y_{t}

are the sets of the selected probability distribution vectors produced by the student model and teacher model, respectively.

\bar{Y_{s}}

and

\bar{Y_{t}}

are the mean values of

Y_{s}

and

Y_{t}

.

The calculation of the DIST loss is based on the Pearson Correlation Coefficient between

Y_{s}

and

Y_{t}

. Specifically, the DIST loss consists of two parts: inter-class relation loss and intra-class relation loss.

Inter-class relation loss measures the correlation between the predictive probability distributions of the teacher model and student model for each instance. The loss value is the opposite of the Pearson correlation coefficient between the two probability distribution vectors, calculated as:

L_{i n t e r} = 1 - ρ_{p} (Y_{s}, Y_{t})

(11)

Intra-class relation loss focuses on the relationships between predictive probabilities of different instances within the same class. It is computed by transposing the prediction matrices (by columns, i.e., classes) and then calculating the inter-class relation loss:

L_{i n t r a} = 1 - ρ_{p} (Y_{s}^{T}, Y_{t}^{T})

(12)

where

Y_{s}^{T}

and

Y_{t}^{T}

are the transposes of the predictive probability matrices of the student and teacher models.

The total DIST loss function is obtained by weighted summation of the inter-class relation loss and intra-class relation loss:

L_{D I S T_l o s s} = α L_{i n t e r} + β L_{i n t r a}

(13)

α

and

β

are balancing parameters.

3.4. Regression Distillation Based on the LD Loss

The logit distillation process for the regression head in the detection model is also improved in our research. A critical observation lies in the teacher model’s regression capability: even when no old-class object is detected at a specific location (i.e., absence of classification confidence), the teacher model can still generate meaningful regression boxes containing old-class localization knowledge. This implicit spatial reasoning ability poses a distillation challenge: how to effectively transfer such geometric intuition from teacher to student. Benefiting from the GFL model [7], where the four edges of the bounding box (top, bottom, left, and right) are modeled as discrete probability distributions, each edge’s position is specifically transformed into a probability distribution using the Softmax function. This probabilistic representation allows us to reinterpret the teacher’s regression outputs as learnable spatial distributions, which becomes the foundation for our distillation strategy.

The probability matrix for each bounding box

B

can be formulated as

B = [δ_{t}, δ_{b}, δ_{l}, δ_{r}] \in R^{n \times 4}

(14)

Therefore, we can extract the incremental information of bounding boxes from the teacher model T and distill it to the student model S using KL divergence loss,

L_{L D} = \sum L_{K L} (B_{t}^{m d s}, B_{s}^{m d s})

(15)

where

B_{t}^{m d s}

represents the regression response obtained from the selected bounding box using the MDS method in the teacher detector, while

B_{s}^{m d s}

denotes the corresponding regression response in the student detector.

4. Experimental Setup

The experiments are conducted on the open-source remote sensing object detection datasets DIOR [12] and DOTA [11]. Specifically, DIOR is used as the main benchmark for ablation studies to analyze the effectiveness of each component in the proposed method. Comparative experiments with state-of-the-art incremental object detection methods [32] are conducted on both DIOR and DOTA to further evaluate the generalization ability of the proposed framework.

4.1. Datasets and Incremental Settings

DIOR [12] is a large-scale remote sensing object detection dataset containing 20 target categories. The complete dataset consists of 23,463 images with

800 \times 800

pixel resolution, divided into 5862 training images, 5863 validation images, and 11,738 test images. DIOR provides horizontal bounding box annotations for 192,472 object instances across 20 categories, including airplane, baseball field, bridge, ground track field (GTF), vehicle, ship, tennis court (TC), airport, chimney, dam, basketball court, storage tank (ST), harbor, expressway toll station (ETS), expressway service area (ESA), golf course, overpass, stadium, train station (TS), and wind mill. Following the previous remote sensing incremental detection protocol [32], the first ten categories are used as old classes, while the remaining ten categories are used as new classes. In our experiments, the training and validation sets are combined for model training, and the test set is used for evaluation.

DOTA [11] is a representative remote sensing object detection benchmark containing 2806 images and 188,282 object instances, officially divided into 1411 training images, 458 validation images, and 937 test images. The image size ranges from

800 \times 800

to

4000 \times 4000

, and the original images are cropped into patches of size

800 \times 800

for training and evaluation. DOTA contains 15 categories, namely plane, baseball-diamond (BD), bridge, ground-track-field (GTF), small-vehicle (SV), large-vehicle (LV), ship, tennis-court (TC), basketball-court (BC), storage-tank (ST), soccer-ball-field (SBF), roundabout, harbor, swimming-pool (SP), and helicopter. Although DOTA provides oriented object annotations, all experiments on DOTA are conducted with horizontal bounding boxes to ensure consistency with the compared class-incremental object detection methods, which are based on horizontal detectors. In the adopted class-incremental setting, the first eight categories are treated as old classes, while the remaining seven categories are regarded as new classes. Since the labels of the official test set are unavailable, the training set is used for training and the validation set is used for evaluation.

Following the experimental protocol described in [32], we adopt a relatively large-step class-incremental setting in our experiments. This setting ensures fair comparison with existing remote sensing incremental detection methods. It is also consistent with practical remote sensing scenarios, where newly emerging object categories often appear as groups due to scene transitions, land-use changes, or newly introduced monitoring targets. For example, when the observed area changes from rural regions to urban regions, multiple new object categories may appear simultaneously, accompanied by different textures, spatial layouts, and background distributions. Therefore, the old-10/new-10 split on DIOR and the old-8/new-7 split on DOTA are adopted as the main evaluation protocols.

4.2. Implementation Details

All experiments are implemented based on MMDetection 2.10.0 and MMCV 1.2.7. The experimental environment includes Python 3.8.20, PyTorch 1.6.0, TorchVision 0.7.0, CUDA 10.2, and cuDNN 7605. All experiments are conducted on a single NVIDIA GeForce GTX 1080 Ti GPU with 11 GB VRAM under a Linux platform. GFL with ResNet-50 [41] and FPN [42] is adopted as the basic detector, and the ResNet-50 backbone is initialized with ImageNet-pretrained weights. The batch size is set to 2. The optimization process utilizes stochastic gradient descent (SGD) with an initial learning rate of 0.0025, momentum of 0.9, and weight decay of 0.0001. All models are trained for 24 epochs. For reproducibility, a fixed random seed of 42 is used unless otherwise specified. The same training configuration and evaluation protocol are used for all ablation variants and re-implemented baselines to ensure fair comparison.

In the incremental training stage, the teacher model is initialized from the detector trained on old classes and kept frozen throughout training. Only the student detector is updated. The frozen teacher model provides old-class response supervision for the ODB branch through MDS-based distillation, while the NLB branch learns newly introduced categories from ground-truth annotations.

4.3. Evaluation Metrics

In our experiments,

A P_{50}

is adopted as the main evaluation metric to assess the performance of the proposed method.

A P_{50}

denotes Average Precision at 50% Intersection over Union (IoU), where a detection is considered correct if the IoU between the predicted bounding box and the ground-truth bounding box is greater than 0.5. Following previous incremental object detection studies, we report the mean average precision (mAP) of old classes and new classes to evaluate old-class retention and new-class adaptation, respectively. In addition, we report “Diff” to measure the performance gap between incremental learning and joint training. Specifically, Diff is defined as the mAP difference between the incremental result and the corresponding joint-training result, i.e., incremental mAP minus joint-training mAP. A Diff value closer to 0 indicates a smaller performance degradation compared with the joint-training reference, while a positive Diff indicates that the incremental result exceeds the corresponding joint-training result.

5. Results and Discussion

5.1. Ablation Study on DIOR

All ablation studies in this subsection are conducted on DIOR under the old-10/new-10 class-incremental setting. DIOR is used as the main benchmark for component analysis because it provides a clear old/new category split and a relatively stable evaluation protocol for remote sensing incremental object detection. The joint-training result of GFL is used as the upper-bound reference, and “Diff” denotes the mAP gap between each incremental result and the corresponding joint-training result.

5.1.1. Baseline

We use GFL [7] as the baseline detector. Under the joint-training setting, GFL is trained with annotations of both old and new classes and serves as the upper-bound reference for incremental learning. As shown in Table 1, the joint-training GFL achieves an mAP of 70.40% on Old-10 and 65.91% on New-10. In contrast, the model denoted as “Baseline+w/o select” performs incremental training without response selection in a single detection branch. It obtains 62.56% mAP on Old-10, corresponding to a drop of 7.84 percentage points compared with joint training, and 61.54% mAP on New-10, corresponding to a drop of 4.37 percentage points. These results indicate that directly transferring all teacher responses without selective distillation is insufficient for preserving old-class knowledge and may also weaken new-class learning.

5.1.2. Dual-Branch Detector

In Table 1, “Baseline + Dual-branch” represents the addition of the dual-branch detector to the baseline model, where separate branches handle old and new classes. The corresponding accuracy values for old and new classes are 68.05% and 64.38%, respectively. Compared to training and distillation using a single detector in “Baseline + W/o select”, this approach improves the accuracy of old classes by 5.49% and new classes by 2.84%. Furthermore, the comparison between “Baseline + MDS + Dual-branch” and “Baseline + MDS” demonstrates that incorporating a dual-branch structure on top of our selection method (MDS) leads to significant performance gains. Similarly, the comparison between “Baseline + MDS + DIST loss” and “Baseline + MDS + Dual-branch+DIST loss” further validates this finding.

5.1.3. MDS

Through comparative analysis of model accuracy on old classes between the MDS method and direct distillation without selection strategy, we demonstrate the substantial advantage of MDS’s response selection mechanism. This finding validates the conclusion in reference [23] that “not all responses are critical for preventing catastrophic forgetting”. Notably, when comparing our proposed MDS framework with the ERS model, the recognition accuracy for old classes improves from 67.99% to 68.26%. This enhancement confirms the superiority of our channel-wise (i.e., category-level) response selection mechanism over the ERS approach. By implementing a fine-grained channel-level selection strategy, we achieve substantial improvement in the teacher model’s information extraction capability. Furthermore, under the dual-branch detector architecture, the distillation strategy employing MDS shows comprehensive performance gains on both old and new classes compared to selection-free distillation, providing additional evidence for the systematic effectiveness of this selection mechanism.

5.1.4. DIST Loss

Functioning as a regularization constraint, DIST loss [24] is designed to optimize the intrinsic correlations in knowledge transfer from the teacher model, thereby enhancing the student model’s capacity to retain old-class knowledge. Experimental results demonstrate that incorporating DIST loss into the Elastic Response Distillation (ERD) [23] framework improves the old-class recognition accuracy from 67.99% to 68.18%. Within the integrated architecture combining the dual-branch detector and the MDS selection mechanism, DIST loss further strengthens old-class response preservation while maintaining competitive new-class adaptation. This loss term preserves old-class knowledge by constraining the relational structure of selected classification responses between the teacher and student models, rather than directly aligning intermediate feature distributions. In this way, DIST loss complements MDS by maintaining both inter-class and intra-class response relationships, thereby improving the fidelity of old-class response distillation.

5.1.5. Additional Analysis of Feature-Based Distillation

To further examine the effect of feature-level knowledge transfer and analyze its compatibility with the proposed response-level framework, we introduce an additional feature-based distillation analysis under the same DIOR old-10/new-10 setting. Specifically, we evaluate a feature-KD-only variant and an extended variant that adds feature distillation to the proposed method. The results are shown in Table 2.

As shown in Table 2, the feature-KD-only variant obtains 54.70% mAP on Old-10 and 59.20% mAP on New-10, which is substantially lower than ERD and the proposed method. This indicates that feature-level imitation alone is insufficient for preserving old-class decision knowledge in the adopted GFL-based incremental detection framework. One possible reason is that direct feature alignment constrains intermediate representations without explicitly selecting discriminative old-class responses, which may introduce additional interference when old and new categories coexist in cluttered remote-sensing scenes.

When feature distillation is added to the proposed method, the old-class mAP decreases from 69.93% to 68.90%, and the new-class mAP decreases from 65.88% to 65.70%. This suggests that simply combining feature-level distillation with the proposed response-level framework does not necessarily bring further improvement. In particular, additional feature alignment may over-constrain the shared backbone and FPN features, thereby weakening the balance between old-class preservation and new-class adaptation. Therefore, we keep the response-level MDS-based dual-branch framework as the final model, since it achieves the best overall balance between old-class retention and new-class learning. These results also suggest that feature-level and response-level distillation may be complementary in principle, but their combination requires careful design rather than simple superposition.

5.2. Comparison with State-of-the-Art Methods on DIOR and DOTA

In this subsection, we compare the proposed method with representative class-incremental object detection methods on both DIOR and DOTA. DIOR is used as the main benchmark for systematic comparison and ablation analysis, while DOTA provides another representative remote-sensing benchmark for evaluating the effectiveness of the proposed framework under more challenging scene conditions. For fair comparison, all methods are evaluated under the same old/new class splits and AP50 metric.

To clarify the comparison protocol, Table 3 summarizes the detector type, main distillation level, and result source of each compared method. The results of Fast-IL, Faster-IL, and FPN-IL are taken from the reported results in FPN-IL [32], since they follow the same DIOR and DOTA class-incremental protocols adopted in this work. ERD is the most directly related response-level baseline and is re-implemented under the same codebase, data split, and evaluation protocol as the proposed method. The proposed method is also implemented and evaluated under the same setting. For each method, Diff is computed using its corresponding joint-training result as the reference.

As summarized in Table 3, the compared methods differ in both detector architecture and knowledge transfer level. Fast-IL and Faster-IL are two-stage incremental detection methods, while ERD and the proposed method are based on the one-stage GFL detector. In terms of knowledge transfer, Faster-IL and FPN-IL mainly rely on feature-level or feature-pyramid-level distillation, which aims to preserve intermediate representations. ERD performs response-level logit distillation by selecting informative teacher responses. Different from these methods, the proposed framework combines response-level MDS with a dual-branch detector, thereby explicitly decoupling old-class knowledge preservation and new-class learning. This structured comparison clarifies the main similarities and differences between previous incremental detection methods and the proposed method.

5.2.1. Results on DIOR

On DIOR, Table 4 and Table 5 respectively report the performance of each method on old and new classes. On Old-10, the proposed method achieves the highest incremental mAP of 69.93 and the smallest performance gap of −0.47, demonstrating strong old-class retention. On New-10, FPN-IL obtains the highest absolute mAP, while our method achieves the second-best mAP and the smallest gap to its corresponding joint-training reference, with a Diff of only −0.03. These results indicate that the proposed method provides a favorable balance between old-class retention and new-class adaptation rather than improving only one side.

Compared with FPN-IL, the proposed method transfers knowledge at the prediction-response level rather than the feature-pyramid level. FPN-IL is effective for preserving multi-scale feature representations, but feature-level constraints may still introduce interference between old and new categories when their visual patterns are similar. In contrast, our method uses MDS to select informative teacher responses and uses a dual-branch detector to decouple old-class preservation from new-class learning. This explains why the proposed method achieves stronger old-class retention while maintaining competitive new-class adaptation.

5.2.2. Results on DOTA

On DOTA, we further compare the proposed method with the same state-of-the-art incremental object detection methods under the old-8/new-7 setting. Compared with DIOR, DOTA exhibits denser object layouts, larger scale variations, and more frequent co-occurrence of old and new categories within cropped patches, thus providing a more challenging benchmark for evaluating incremental object detection methods in remote-sensing scenes.

Table 6 reports the performance of each method on the old classes of DOTA. Compared with DIOR, DOTA contains denser object distributions and more complex background variations, which makes old-class knowledge preservation more challenging. The proposed method achieves the highest incremental mAP of 70.23 on Old-8, showing its advantage in preserving old-class detection capability under complex remote-sensing scenes. FPN-IL obtains the smallest performance gap to its joint-training reference, while the proposed method still provides the strongest absolute old-class performance among all incremental methods.

Table 7 reports the performance of each method on the new classes of DOTA. FPN-IL obtains the highest incremental mAP, while the proposed method achieves the second-best mAP. It is worth noting that the proposed method exceeds its corresponding joint-training reference by 2.00 percentage points on New-7. This positive gap should not be interpreted as incremental learning being generally superior to joint training. Instead, it may be related to the regularization effect of the frozen teacher model, the decoupled optimization of old and new branches, and category-specific data distribution differences under the adopted data split. Overall, the DOTA results show that the proposed framework can maintain strong old-class retention while providing competitive new-class adaptation under complex remote-sensing scenes.

5.3. Hyperparameter Analysis

All hyperparameter analyses in this subsection are conducted on DIOR, which serves as the main benchmark for parameter study in our method. We analyze the influence of the MDS thresholds

k_{1}

and

k_{2}

, as well as the weights

α

and

β

in the DIST loss.

The parameters

k_{1}

and

k_{2}

control the strictness of per-channel selection and global selection in MDS, respectively. As shown in Figure 6, when

k_{1} = 1

, increasing

k_{2}

improves the performance of both old and new classes, indicating the necessity of the second-stage global filtering. When

k_{1} = 2

, setting

k_{2} = 1

yields the best performance. However, further increasing

k_{2}

makes the selection too strict and may discard useful teacher responses, resulting in performance degradation. Therefore, a moderate threshold setting is more suitable for preserving informative old-class knowledge while avoiding noisy response transfer.

The parameters

α

and

β

are used to balance the inter-class and intra-class relational losses in DIST loss. As illustrated in Figure 7, we evaluate three settings: intra-only (

β = 1

), inter-only (

α = 1

), and both combined (

α = β = 1

). Since DIST loss is applied to old-class distillation, its influence on new-class performance is relatively limited. The results show that using both inter-class and intra-class relational constraints achieves better old-class performance than using either term alone. This indicates that the two relational constraints are complementary in preserving old-class knowledge. In addition, the intra-class term contributes more than the inter-class term in our framework, suggesting that maintaining response consistency across samples of the same category is particularly important for old-class retention.

5.4. Activation Function Analysis in DIST Loss

In the proposed framework, we adapt the original DIST loss by replacing the softmax activation with the sigmoid activation for classification logit distillation. This modification is motivated by the multi-class response characteristics of dense object detection, where multiple category responses may provide useful relational information for old-class knowledge preservation. The following is the detailed explanation for this modification.

The Softmax function is commonly used in multi-class classification tasks where each input is assumed to belong to one and only one class. It transforms a vector of logits

z = [z_{1}, z_{2}, \dots, z_{K}]

into a probability distribution over K classes by the following formulation:

Softmax (z_{i}) = \frac{exp (z_{i})}{\sum_{j = 1}^{K} exp (z_{j})}

(16)

This normalization ensures that the output probabilities are positive and sum to 1, thereby emphasizing the mutual exclusivity between classes.

The Sigmoid function, on the other hand, is typically used in multi-label classification settings where each input may belong to multiple classes simultaneously. It independently maps each logit

z_{i}

to a value in the range

(0, 1)

according to the following equation:

Sigmoid (z_{i}) = \frac{1}{1 + exp (- z_{i})}

(17)

Unlike Softmax, the Sigmoid function does not enforce any inter-class competition or normalization across logits. This makes it well-suited for modeling non-exclusive, class-independent probabilities, such as those found in multi-label or soft-target scenarios.

In the logit distillation of this paper, the teacher model provides soft labels—probability distributions across multiple categories—that reflect inter-class confidence relationships. The student model is expected to learn these relationships rather than simply predicting a single correct class. Therefore, using Sigmoid enables a more appropriate modeling of these non-exclusive, correlated outputs.

From Table 8, it is clear that using sigmoid in the DIST loss consistently outperforms softmax at every temperature setting for old classes: sigmoid reaches 69.93%, whereas softmax peaks at just 69.31% (T = 2) and is lower at other temperatures. Although for new classes sigmoid (65.88%) is marginally below softmax at T = 0.5 (65.90%) and T = 1 (65.93%), the difference is negligible—only 0.02–0.05%—and sigmoid still outperforms softmax at T = 2 (65.50%) and T = 3 (64.84%). In other words, sigmoid substantially boosts old-class accuracy while incurring only a minimal cost to new-class performance. Compared to softmax, which requires careful temperature tuning, sigmoid delivers a more stable and balanced incremental-learning result. Therefore, incorporating sigmoid into the DIST loss more effectively balances retention of old-class knowledge with learning of new classes.

5.5. Visualization Analysis

To further analyze the effectiveness of the proposed response selection strategy, we visualize old-class response maps on representative remote-sensing scenes. As shown in Figure 8, two typical old-class categories are selected, including airplanes in a complex airport scene and ships in a dense harbor scene. Each row presents the input image, ground-truth boxes, the response map generated by ERD, and the response map generated by the proposed method.

As shown in Figure 8, ERD/ERS can activate several discriminative regions of old-class objects, but its responses are relatively weak or fragmented for some densely distributed instances. In contrast, Ours/MDS generates more stable and complete responses over old-class targets. For the airplane scene, Ours/MDS highlights more aircraft instances under a complex airport background. For the ship scene, it preserves more continuous responses over densely moored ships. These visualization results further demonstrate that the proposed multi-granularity response selection strategy is beneficial for retaining informative old-class knowledge during incremental learning.

5.6. Discussion

The effectiveness of the proposed method can be mainly attributed to the combination of multi-granularity response selection and old/new class decoupling. In class-incremental object detection, directly distilling all teacher responses may introduce redundant or even misleading information, especially in remote-sensing scenes where objects are densely distributed and different categories often exhibit similar visual appearances. The proposed MDS strategy addresses this issue by selecting informative teacher responses from both channel-wise and global perspectives. The channel-wise selection preserves category-specific response information, while the global selection further filters out low-confidence or less informative spatial responses. Therefore, MDS provides a more refined knowledge transfer mechanism than holistic response selection and helps the student model retain old-class decision knowledge more effectively.

The dual-branch detector also plays an important role in balancing old-class retention and new-class adaptation. In a single-branch incremental detector, old-class distillation and new-class supervised learning are optimized within the same prediction head, which may lead to competition between preserving old decision boundaries and adapting to newly introduced categories. By assigning old-class preservation and new-class learning to two separate branches, the proposed framework reduces the direct optimization conflict between old and new categories. This design is particularly useful for remote-sensing object detection, where inter-class similarity and dense object distributions can easily amplify the interference between old and new classes.

The comparison with FPN-IL further reveals the difference between feature-level and response-level distillation. FPN-IL transfers knowledge at the feature-pyramid level and is effective in preserving multi-scale representations, which is important for remote-sensing objects with large scale variations. In contrast, the proposed method focuses on prediction-response-level knowledge transfer, where informative logits and localization responses are dynamically selected for distillation. The experimental results show that response-level selection is especially beneficial for old-class retention. Feature-level and response-level distillation may be complementary in principle, but their combination requires careful design. As shown in the additional feature-based distillation analysis, naively adding feature alignment to the proposed response-level framework does not necessarily improve performance and may over-constrain the shared representation.

For the DOTA new-class results, the proposed method obtains a positive Diff compared with its corresponding joint-training reference. This result should be interpreted carefully. It does not indicate that incremental learning is generally superior to joint training. Instead, the positive gap may be caused by several factors, including the regularization effect of the frozen teacher model, the decoupled optimization of old and new branches, and category-specific data distribution differences under the adopted split. Therefore, the main conclusion from the DOTA results is that the proposed framework can maintain competitive new-class adaptation while improving old-class retention, rather than universally outperforming joint training.

Remote-sensing datasets often exhibit natural class imbalance, where some categories contain dense and frequent instances while others appear sparsely. The proposed method does not explicitly introduce a class-balanced sampler or category re-weighting strategy. Instead, MDS partially alleviates this issue by selecting informative teacher responses from both channel-wise and global perspectives, which reduces the dominance of redundant high-frequency responses and encourages the student model to focus on discriminative old-class knowledge. Nevertheless, class imbalance is not fully solved in this work. Combining the proposed response-level distillation with class-balanced sampling or adaptive category re-weighting is a promising direction for future research.

Despite its effectiveness, this work still has several limitations. First, the current experiments mainly follow the large-step class-incremental protocol adopted in previous remote-sensing incremental detection studies, which ensures fair comparison with existing methods such as FPN-IL. This setting is also consistent with practical remote-sensing scenarios, where scene transitions or changes in monitoring tasks may introduce a group of new object categories simultaneously rather than only one category at a time. For example, when the observed area changes from relatively simple rural scenes to more complex urban or transportation-related scenes, multiple new categories may emerge together with different spatial layouts, scales, and background distributions. Therefore, the adopted old-10/new-10 and old-8/new-7 protocols provide a meaningful evaluation of substantial category expansion in remote-sensing incremental detection. Nevertheless, small-step sequential scenarios, where only a few categories are introduced at each stage, are also important for evaluating long-term continual learning ability and will be further investigated in future work. Second, this work mainly focuses on response-level distillation because ERD is the most directly related baseline. Although the additional feature-based distillation analysis shows that simple feature alignment does not further improve the proposed framework, more carefully designed hybrid feature-response distillation strategies remain worth investigating. In future work, we will further explore longer incremental sequences, small-step class-incremental protocols, and adaptive feature-response distillation strategies to improve the robustness and generality of remote-sensing incremental object detection.

6. Conclusions

In this paper, we propose an innovative class-incremental object detection approach. Specifically, we extend the traditional single-branch detector architecture into a dual-branch detector architecture, designed to separately train old and new categories through a “divide-and-conquer” strategy. Simultaneously, we introduce an innovative selection mechanism, the multi-granularity dynamic selection (MDS) strategy, for the logit matrix of the old-class teacher model, thereby optimizing the knowledge distillation process from teacher model to student model. Additionally, we apply the DIST loss function that further enhances model accuracy. In particular, our proposed algorithm demonstrates excellent generalizability and extensibility, enabling seamless integration with other powerful deep learning-based object detection frameworks. The comparative results on the DIOR dataset clearly demonstrate the superiority of our proposed method over existing techniques. The results on DIOR and DOTA further demonstrate that the proposed framework generalizes well across different remote-sensing scenarios and effectively balances old-class retention and new-class adaptation.

Author Contributions

Conceptualization, W.Y. and S.L.; methodology, S.L. and W.W.; software, S.L.; validation, S.L., W.W. and Y.X.; formal analysis, S.L.; investigation, S.L.; resources, W.Y.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, W.W., Y.X., W.Y. and S.X.; visualization, S.L.; supervision, W.Y.; project administration, W.Y.; funding acquisition, S.X. and W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Natural Science Foundation of Hubei Province, grant number 2025AFB688, and in part by the National Natural Science Foundation of China, grant number 61976226.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. DIOR and DOTA can be accessed from their official websites. Additional experimental materials are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the NIPS’20: 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Hu, X.; Ren, Z.; Bhatti, U.A.; Huang, M.; Wu, Y. DCEDet: Tiny Object Detection in Remote Sensing Images Based on Dual-Contrast Feature Enhancement and Dynamic Distance Measurement. Remote Sens. 2025, 17, 2876. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3500–3509. [Google Scholar]
Xie, X.; Cheng, G.; Li, Q.; Miao, S.; Li, K.; Han, J. Fewer is more: Efficient object detection in large aerial images. Sci. China Inf. Sci. 2023, 67, 112106. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Hua, X.; Wang, X.; Rui, T.; Zhang, H.; Wang, D. A fast self-attention cascaded network for object detection in large scene remote sensing images. Appl. Soft Comput. 2020, 94, 106495. [Google Scholar] [CrossRef]
Fontanet Garcia, N.; Boccardo, P. Object Detection in Optical Remote Sensing Images: A Systematic Review of Methods, Benchmarks, and Operational Applications. Remote Sens. 2026, 18, 1289. [Google Scholar] [CrossRef]
van de Ven, G.M.; Tolias, A.S. Three scenarios for continual learning. arXiv 2019, arXiv:1904.07734. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Chen, G.; Choi, W.; Yu, X.; Han, T.; Chandraker, M. Learning efficient object detection models with knowledge distillation. In Proceedings of the NIPS’17: 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 742–751. [Google Scholar]
Sun, R.; Tang, F.; Zhang, X.; Xiong, H.; Tian, Q. Distilling object detectors with task adaptive regularization. arXiv 2020, arXiv:2006.13108. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Feng, T.; Wang, M.; Yuan, H. Overcoming catastrophic forgetting in incremental object detection via elastic response distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9417–9426. [Google Scholar]
Huang, T.; You, S.; Wang, F.; Qian, C.; Xu, C. Knowledge distillation from a stronger teacher. In Proceedings of the NIPS’22: 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Cohen, I.; Huang, Y.; Chen, J.; Benesty, J. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5533–5542. [Google Scholar]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.P.; Wayne, G. Experience replay for continual learning. In Proceedings of the NIPS’19: 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Shmelkov, K.; Schmid, C.; Alahari, K. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3420–3429. [Google Scholar]
Hao, Y.; Fu, Y.; Jiang, Y.-G.; Tian, Q. An end-to-end architecture for class-incremental object detection with knowledge distillation. In Proceedings of the IEEE International Conference on Multimedia and Expo, Shanghai, China, 8–12 July 2019; pp. 1–6. [Google Scholar]
Chen, J.; Wang, S.; Chen, L.; Cai, H.; Qian, Y. Incremental detection of remote sensing objects with feature pyramid and knowledge distillation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5600413. [Google Scholar] [CrossRef]
Pérez-Rúa, J.-M.; Zhu, X.; Hospedales, T.M.; Xiang, T. Incremental few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13843–13852. [Google Scholar]
Liu, Z.; Wang, Y.; Luo, Y.; Luo, C. Graph-based few-shot incremental learning algorithm for unknown class detection. Appl. Soft Comput. 2024, 154, 111363. [Google Scholar] [CrossRef]
Wang, S.; Song, Y.; Xiang, J.; Chen, Y.; Zhong, P.; Fu, R. Mask-Guided Teacher–Student Learning for Open-Vocabulary Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 3385. [Google Scholar] [CrossRef]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for thin deep nets. arXiv 2015, arXiv:1412.6550. [Google Scholar]
Yang, Z.; Li, Z.; Jiang, X.; Gong, Y.; Yuan, Z.; Zhao, D.; Yuan, C. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4643–4652. [Google Scholar]
Yang, Z.; Li, Z.; Shao, M.; Shi, D.; Yuan, Z.; Yuan, C. Masked generative distillation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 53–69. [Google Scholar]
Zheng, Z.; Ye, R.; Hou, Q.; Ren, D.; Wang, P.; Zuo, W.; Cheng, M.-M. Localization distillation for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10070–10083. [Google Scholar] [CrossRef] [PubMed]
Sun, S.; Ren, W.; Li, J.; Wang, R.; Cao, X. Logit standardization in knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15731–15740. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]

Figure 1. Representative failure cases of Elastic Response Distillation (ERD) on DIOR. (a) A dense airport scene where multiple old-class airplane instances are partially or completely misclassified as the newly introduced windmill class. (b) A large-object airport scene where old-class airplane regions are also incorrectly assigned to the new windmill category. These two cases illustrate that ERD may suffer from old/new class confusion in both dense multi-instance scenes and large-object airport scenes when fine-grained response information is insufficiently preserved.

Figure 2. Comparison of different knowledge distillation methods in object detection. FGD denotes Focal and Global Distillation, MGD denotes Masked Generative Distillation, LD denotes Localization Distillation, and LS-KD denotes Logit Standardization Knowledge Distillation.

Figure 3. Overall architecture of the proposed dual-branch incremental detector. The framework consists of two stages: (a) incremental training and (b) inference. During incremental training, the frozen teacher detector provides selected old-class responses through MDS to supervise the old-class distillation branch (ODB), while the new-class learning branch (NLB) is optimized with new-class ground-truth annotations. During inference, only the student detector is used, and old-class and new-class predictions are merged by label remapping, concatenation, and NMS to obtain the final detection results.

Figure 4. Comparison between ERS and the proposed MDS. (a) ERS performs global-level response filtering after channel-wise maximum compression, which may discard class-specific channel information before selection. (b) MDS first conducts per-channel response filtering and then applies global spatial selection, preserving more fine-grained class-specific information for distillation. Arrows indicate the response-flow process, × denotes element-wise masking, and different colors are used to distinguish response maps, masks, and selection stages.

Figure 5. Illustration of inter-class and intra-class relational knowledge in DIST loss. Inter-class relations characterize the correlation among category predictions within a sample, while intra-class relations model the consistency of prediction responses across different samples of the same category. The orange dashed box illustrates an example inter-class relation within one instance, while the green dashed box illustrates an example intra-class relation across different instances.

Figure 6. Hyperparameter analysis of

k_{1}

and

k_{2}

in MDS on DIOR. Different settings indicate different response selection thresholds.The legend entries denote

(k_{1}, k_{2})

pairs. The dashed lines indicate the corresponding old-class and new-class mAP values of the ERS baseline.

Figure 6. Hyperparameter analysis of

k_{1}

and

k_{2}

in MDS on DIOR. Different settings indicate different response selection thresholds.The legend entries denote

(k_{1}, k_{2})

pairs. The dashed lines indicate the corresponding old-class and new-class mAP values of the ERS baseline.

Figure 7. Effect of different

α

and

β

settings in the DIST loss on old-class retention and new-class adaptation. The x-axis labels denote

(α, β)

pairs.

Figure 7. Effect of different

α

and

β

settings in the DIST loss on old-class retention and new-class adaptation. The x-axis labels denote

(α, β)

pairs.

Figure 8. Visualization comparison of old-class response maps on representative remote-sensing scenes. Each row shows the input image, ground-truth boxes, the response map generated by ERD with ERS, and the response map generated by the proposed method with MDS. Compared with ERD/ERS, Ours/MDS produces stronger and more complete responses around old-class targets, especially in complex airport scenes with multiple airplanes and dense harbor scenes with numerous ships. In the ground-truth column, green boxes indicate object annotations. In the response maps, warmer colors denote stronger response activations, while cooler colors denote weaker responses.

Table 1. Ablation results of different strategies on DIOR under the old-10/new-10 setting. The best incremental results are marked in bold.

Method	Strategy	Old-10		New-10
Method	Strategy	mAP	Diff	mAP	Diff
GFL	joint	70.40	0	65.91	0
ERD	incremental	67.99	−2.41	65.28	−0.63
ERD + DIST loss	incremental	68.18	−2.22	64.74	−1.17
Baseline + w/o select	incremental	62.56	−7.84	61.54	−4.37
Baseline + MDS	incremental	68.26	−2.14	64.10	−1.81
Baseline + Dual-branch	incremental	68.05	−2.35	64.38	−1.53
Baseline + MDS + Dual-branch	incremental	69.53	−0.87	65.84	−0.07
Baseline + MDS + DIST loss	incremental	68.25	−2.15	63.80	−2.11
Baseline + MDS + Dual-branch+DIST loss	incremental	69.93	−0.47	65.88	−0.03

Table 2. Additional analysis of feature-based distillation on DIOR under the old-10/new-10 setting. Diff denotes the performance gap relative to the corresponding joint-training reference.

Method	Strategy	Old-10		New-10
Method	Strategy	mAP	Diff	mAP	Diff
GFL	joint	70.40	0	65.91	0
ERD	response KD	67.99	−2.41	65.28	−0.63
Feature KD only	feature KD	54.70	−15.70	59.20	−6.71
Ours + Feature KD	response KD + feature KD	68.90	−1.50	65.70	−0.21
Ours	MDS + dual-branch + DIST	69.93	−0.47	65.88	−0.03

Table 3. Summary of compared incremental detection methods.

Method	Detector Type	Main Distillation Level	Result Source
Fast-IL [30]	Two-stage	Output-level regularization	Reported in FPN-IL [32]
Faster-IL [31]	Two-stage	Feature-level RPN distillation	Reported in FPN-IL [32]
FPN-IL [32]	Two-stage with FPN	Feature-pyramid-level distillation	Reported in FPN-IL [32]
ERD [23]	One-stage GFL	Response-level logit distillation	Re-implemented in this work
Ours	One-stage GFL	Response-level MDS + dual-branch decoupling	Implemented in this work

Table 4. Comparison of detection results on old classes of DIOR. The best and second-best incremental mAP values are marked in bold and underlined, respectively. For Diff, bold and underline indicate the best and second-best gaps relative to the corresponding joint-training result, where values closer to 0 indicate less degradation and positive values indicate improvement over joint training.

Methods	Strategy	Old 10 Categories										mAP
Methods	Strategy	Airplane	Baseball Field	Bridge	GTF	Vehicle	Ship	TC	Airport	Chimney	Dam	mAP
Fast-IL [30]	incremental	22.4	46.7	7.0	36.3	6.5	9.1	35.5	45.4	66.7	34.2	31.00
	joint	22.4	49.0	10.2	43.0	7.0	7.8	38.7	52.6	69.7	35.0	33.50
	Diff	0.0	−2.3	−3.2	−6.7	−0.5	+1.3	−3.2	−7.2	−3.0	−0.8	−2.50
Faster-IL [31]	incremental	23.0	61.9	9.5	50.1	15.8	29.2	73.3	36.5	64.0	3.3	36.70
	joint	37.0	60.9	18.2	56.4	28.1	41.8	75.7	57.1	68.6	30.0	47.40
	Diff	−14.0	+1.0	−8.7	−6.3	−12.3	−12.6	−2.4	−20.6	−4.6	−26.7	−10.70
ERD [23]	incremental	47.8	78.2	86.5	54.0	76.9	71.4	64.7	79.4	41.9	79.1	67.99
	joint	63.6	73.8	86.1	61.0	78.2	77.5	68.8	76.2	40.0	78.8	70.40
	Diff	−15.8	+4.4	+0.4	−7.0	−1.3	−6.1	−4.1	+3.2	+1.9	+0.3	−2.41
FPN-IL [32]	incremental	47.6	71.5	46.1	83.1	54.1	85.7	85.0	74.8	77.7	61.8	68.74
	joint	55.1	70.7	43.7	83.0	53.7	85.6	84.9	75.7	78.5	62.7	69.36
	Diff	−7.5	+0.8	+2.4	+0.1	+0.4	+0.1	+0.1	−0.9	−0.8	−0.9	−0.62
Ours	incremental	55.8	77.7	86.2	61.3	76.8	76.7	64.8	81.3	40.2	78.5	69.93
	joint	63.6	73.8	86.1	61.0	78.2	77.5	68.8	76.2	40.0	78.8	70.40
	Diff	−7.8	+3.9	+0.1	+0.3	−1.4	−0.8	−4.0	+5.1	+0.2	−0.3	−0.47

Table 5. Comparison of detection results on new classes of DIOR. The best and second-best incremental mAP values are marked in bold and underlined, respectively. For Diff, bold and underline indicate the best and second-best gaps relative to the corresponding joint-training result, where values closer to 0 indicate less degradation and positive values indicate improvement over joint training.

Methods	Strategy	New 10 Categories										mAP
Methods	Strategy	Basketball Court	ST	Harbor	ETS	ESA	Golf Course	Overpass	Stadium	Train Station	Wind Mill	mAP
Fast-IL [30]	incremental	33.7	10.3	12.5	23.8	23.5	39.7	12.8	27.5	1.4	9.2	19.40
	joint	58.2	13.0	27.9	33.2	43.7	66.9	20.7	51.1	5.4	21.4	34.10
	Diff	−24.5	−2.7	−15.4	−9.4	−20.2	−27.2	−7.9	−23.6	−4.0	−12.2	−14.70
Faster-IL [31]	incremental	70.0	29.3	17.9	33.4	54.0	62.8	43.2	48.1	38.0	72.9	47.00
	joint	70.4	29.8	16.4	37.9	53.8	64.3	42.8	47.2	39.5	74.9	47.70
	Diff	−0.4	−0.5	+1.5	−4.5	+0.2	−1.5	+0.4	+0.9	−1.5	−2.0	−0.70
ERD [23]	incremental	80.1	53.0	65.0	52.2	51.8	73.6	61.2	87.2	48.1	80.6	65.28
	joint	81.0	56.2	70.1	56.9	42.4	73.0	60.9	86.1	45.8	86.7	65.91
	Diff	−0.9	−3.2	−5.1	−4.7	+9.4	+0.6	+0.3	+1.1	+2.3	−6.1	-0.63
FPN-IL [32]	incremental	86.8	58.3	59.1	66.9	73.3	71.0	60.1	69.2	49.6	86.8	68.11
	joint	87.2	69.8	63.2	65.9	80.4	81.6	57.2	61.2	58.3	88.3	71.31
	Diff	−0.4	−11.5	−4.1	+1.0	−7.1	−10.6	+2.9	+8.0	−8.7	−1.5	−3.20
Ours	incremental	80.4	51.5	64.4	51.3	54.6	74.8	62.6	87.6	48.4	83.2	65.88
	joint	81.0	56.2	70.1	56.9	42.4	73.0	60.9	86.1	45.8	86.7	65.91
	Diff	−0.6	−4.7	−5.7	−5.6	+12.2	+1.8	+1.7	+1.5	+2.6	−3.5	−0.03

Table 6. Comparison of detection results on old classes of DOTA. The best and second-best incremental mAP values are marked in bold and underlined, respectively. For Diff, bold and underline indicate the best and second-best gaps relative to the corresponding joint-training result, where values closer to 0 indicate less degradation and positive values indicate improvement over joint training.

Methods	Strategy	Old 8 Categories								mAP
Methods	Strategy	Plane	Baseball-Diamond	Bridge	GTF	Small Vehicle	Large Vehicle	Ship	Tennis Court	mAP
Fast-IL [30]	incremental	48.5	26.7	2.9	31.4	3.1	20.7	16.8	57.5	25.95
	joint	50.7	28.3	4.4	32.4	3.1	20.7	15.7	58.9	26.78
	Diff	−2.2	−1.6	−1.5	−1.0	0.0	0.0	+1.1	−1.4	−0.83
Faster-IL [31]	incremental	57.9	36.9	18.1	25.4	14.0	29.0	28.5	79.2	36.13
	joint	66.0	47.5	31.0	21.0	15.1	39.6	31.2	80.9	41.54
	Diff	−8.1	−10.6	−12.9	+4.4	−1.1	−10.6	−2.7	−1.7	−5.41
ERD [23]	incremental	83.8	63.6	44.1	48.7	55.3	77.5	68.1	92.5	66.70
	joint	84.9	72.6	45.5	63.7	55.9	81.9	69.4	93.6	70.94
	Diff	−1.1	−9.0	−1.4	−15.0	−0.6	−4.4	−1.3	−1.1	−4.24
FPN-IL [32]	incremental	79.6	66.2	46.6	59.1	53.7	78.2	81.9	91.2	69.56
	joint	80.9	69.1	49.4	57.4	53.8	76.4	80.4	90.8	69.78
	Diff	−1.3	−2.9	−2.8	+1.7	−0.1	+1.8	+1.5	+0.4	−0.59
Ours	incremental	83.8	71.6	44.4	67.5	55.8	76.8	69.2	92.7	70.23
	joint	84.9	72.6	45.5	63.7	55.9	81.9	69.4	93.6	70.94
	Diff	−1.1	−1.0	−1.1	+3.8	−0.1	−5.1	−0.2	−0.9	−0.71

Table 7. Comparison of detection results on new classes of DOTA. The best and second-best incremental mAP values are marked in bold and underlined, respectively. For Diff, bold and underline indicate the best and second-best gaps relative to the corresponding joint-training result, where values closer to 0 indicate less degradation and positive values indicate improvement over joint training.

Methods	Strategy	New 7 Categories							mAP
Methods	Strategy	Basketball Court	Storage Tank	Soccer Ball Field	Roundabout	Harbor	Swimming Pool	Helicopter	mAP
Fast-IL [30]	incremental	12.2	14.6	22.5	6.2	17.4	13.2	4.7	12.97
	joint	23.5	18.1	28.2	13.1	36.7	20.3	18.2	22.59
	Diff	−11.3	−3.5	−5.7	−6.9	−19.3	−7.1	−13.5	−9.61
Faster-IL [31]	incremental	18.3	29.1	19.8	30.2	42.4	35.4	9.7	26.41
	joint	14.8	32.4	14.8	36.7	45.1	36.1	8.1	26.86
	Diff	+3.5	−3.3	+5.0	−6.5	−2.7	−0.7	+1.6	−0.44
ERD [23]	incremental	52.1	62.1	47.9	58.9	68.8	46.5	36.6	53.27
	joint	57.1	66.0	42.1	59.8	72.8	49.8	42.5	55.73
	Diff	−5.0	−3.9	+5.8	−0.9	−4.0	−3.3	−5.9	−2.46
FPN-IL [32]	incremental	52.1	68.5	56.0	68.7	70.6	62.7	45.9	60.64
	joint	56.0	71.4	58.1	69.0	74.2	61.4	35.7	60.83
	Diff	−3.9	−2.9	−2.1	−0.3	−3.6	+1.3	+10.2	−0.19
Ours	incremental	55.0	65.6	55.9	62.5	79.5	46.8	38.8	57.73
	joint	57.1	66.0	42.1	59.8	72.8	49.8	42.5	55.73
	Diff	−2.1	−0.4	+13.8	+2.7	+6.7	−3.0	−3.7	+2.00

Table 8. Performance comparison of different activation functions in the DIST loss on DIOR under the old-10/new-10 setting.

Method	mAP (Old-10)	mAP (New-10)
sigmoid	69.93	65.88
softmax ( $T = 0.5$ )	68.77	65.90
softmax ( $T = 1$ )	69.09	65.93
softmax ( $T = 2$ )	69.31	65.50
softmax ( $T = 3$ )	67.24	64.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, S.; Wang, W.; Xu, Y.; Yao, W.; Xu, S. A Dual-Branch Detector Based on the Multi-Granularity Dynamic Selection Mechanism for Remote Sensing Incremental Detection. Remote Sens. 2026, 18, 2032. https://doi.org/10.3390/rs18122032

AMA Style

Li S, Wang W, Xu Y, Yao W, Xu S. A Dual-Branch Detector Based on the Multi-Granularity Dynamic Selection Mechanism for Remote Sensing Incremental Detection. Remote Sensing. 2026; 18(12):2032. https://doi.org/10.3390/rs18122032

Chicago/Turabian Style

Li, Shixi, Weiji Wang, Yousheng Xu, Wei Yao, and Shengzhou Xu. 2026. "A Dual-Branch Detector Based on the Multi-Granularity Dynamic Selection Mechanism for Remote Sensing Incremental Detection" Remote Sensing 18, no. 12: 2032. https://doi.org/10.3390/rs18122032

APA Style

Li, S., Wang, W., Xu, Y., Yao, W., & Xu, S. (2026). A Dual-Branch Detector Based on the Multi-Granularity Dynamic Selection Mechanism for Remote Sensing Incremental Detection. Remote Sensing, 18(12), 2032. https://doi.org/10.3390/rs18122032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual-Branch Detector Based on the Multi-Granularity Dynamic Selection Mechanism for Remote Sensing Incremental Detection

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Dense Detection Frameworks and Related Losses

2.2. Class-Incremental Object Detection

2.3. Knowledge Distillation for Object Detection

2.4. Response Selection in Incremental Object Detection

3. Proposed Method

3.1. Overall Structure

3.1.1. Old Classes Distillation Branch (ODB)

3.1.2. New Classes Learning Branch (NLB)

3.1.3. Inference Procedure

3.2. MDS: Multi-Granularity Dynamic Selection

3.2.1. Step 1: Per-Channel Selection

3.2.2. Step 2: Global Selection

3.3. Classification Distillation Based on the DIST Loss

3.4. Regression Distillation Based on the LD Loss

4. Experimental Setup

4.1. Datasets and Incremental Settings

4.2. Implementation Details

4.3. Evaluation Metrics

5. Results and Discussion

5.1. Ablation Study on DIOR

5.1.1. Baseline

5.1.2. Dual-Branch Detector

5.1.3. MDS

5.1.4. DIST Loss

5.1.5. Additional Analysis of Feature-Based Distillation

5.2. Comparison with State-of-the-Art Methods on DIOR and DOTA

5.2.1. Results on DIOR

5.2.2. Results on DOTA

5.3. Hyperparameter Analysis

5.4. Activation Function Analysis in DIST Loss

5.5. Visualization Analysis

5.6. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI