Next Article in Journal
Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network
Previous Article in Journal
Preparation of Inclusion Complexes with Argan Oils and Their Application of Hair Treatment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection

1
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
2
Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an 710049, China
3
School of Continuing Education, Xi’an Jiaotong University, Xi’an 710049, China
4
MIGU Video Co., Ltd., Shanghai 201206, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12723; https://doi.org/10.3390/app152312723
Submission received: 31 October 2025 / Revised: 25 November 2025 / Accepted: 28 November 2025 / Published: 1 December 2025

Abstract

Continual learning for open-vocabulary object detection aims to enable pretrained vision–language detectors to adapt to diverse specialized domains while preserving their zero-shot generalization capabilities. However, existing methods primarily focus on mitigating catastrophic forgetting, often neglecting the substantial domain shifts commonly encountered in real-world applications. To address this critical oversight, we pioneer Open-Domain Continual Object Detection (OD-COD), a new paradigm that requires detectors to continually adapt across domains with significant stylistic gaps. We propose Disentangled Domain Knowledge-Aided Learning (D-Know) to tackle this challenge. This framework explicitly disentangles domain-general priors from category-specific adaptation, managing them dynamically in a scalable domain knowledge base. Specifically, D-Know first learns domain priors in a self-supervised manner and then leverages these priors to facilitate category-specific adaptation within each domain. To rigorously evaluate this task, we construct OD-CODB, the first dedicated benchmark spanning six domains with substantial visual variations. Extensive experiments demonstrate that D-Know achieves superior performance, surpassing current state-of-the-art methods by an average of 4.2% mAP under open-domain continual settings while maintaining strong zero-shot generalization. Furthermore, experiments under the few-shot setting confirm D-Know’s superior data efficiency.

1. Introduction

Object detection has recently evolved toward the open-vocabulary paradigm, leading to Open-Vocabulary Object Detection (OVD) [1,2,3,4,5]. In this setting, large-scale multimodal data enable detectors to recognize both seen and unseen categories in the wild [6,7]. These models accept arbitrary text queries and detect visual concepts beyond a fixed label set, exhibiting strong zero-shot generalization. Despite this progress, deploying OVD models in specialized downstream applications still requires careful adaptation [7,8]. In realistic scenarios, downstream tasks typically arrive sequentially rather than simultaneously. Adapting to new tasks often harms performance on previously learned ones and may degrade zero-shot capability [9]. Consequently, existing OVD models are ill-equipped for continual adaptation across diverse real-world scenarios.
To address this limitation, recent studies [9,10,11] have introduced Open-Vocabulary Continual Object Detection (OV-COD), which aims to learn new categories from a sequential data stream without accessing prior data, while retaining previously learned knowledge and maintaining open-vocabulary recognition. For example, ZiRa [9] employs a dual-branch architecture to mitigate forgetting, MR-GDINO [10] uses memory-retrieval mechanisms, and Dithub [11] adopts a modular warm-up and specialization strategy. These approaches support dynamic deployment in applications such as autonomous driving, service robotics, and security monitoring [12,13]. However, existing OV-COD methods mainly address the incremental addition of categories and overlook the domain shift problem across sequential tasks. In practice, real-world data streams exhibit substantial variations in visual appearance and object distribution—for example, adapting from artistic styles such as Comic and Watercolor to specialized sensing conditions like Thermal or Aquarium, as illustrated in Figure 1a. Such domain discrepancies significantly hinder the effectiveness of existing approaches.
To tackle this limitation, we introduce for the first time the Open-Domain Continual Object Detection (OD-COD) task, as illustrated in Figure 1b. This task presents several key challenges: the model must (1) handle large domain discrepancies across sequential data streams without access to previously seen data, (2) continuously learn new categories while retaining knowledge of previously learned ones, and (3) maintain open-vocabulary capability to recognize unseen concepts. Such a setting more faithfully reflects real-world scenarios, where models must adapt to dynamically changing environments with diverse domains. For instance, in security applications, the category “person” should be detected not only in standard RGB imagery but also in thermal or infrared domains, requiring rapid adaptation to varying domain characteristics and recurring conditions.
To address the OD-COD challenge, we propose the Disentangled Domain Knowledge-Aided Learning (D-Know) framework. Recognizing that tasks within the same domain share valuable underlying visual structures, D-Know explicitly models and leverages these domain-specific structures to facilitate continual adaptation. This is achieved through disentangling domain-general knowledge from category-specific knowledge and dynamically managing them within a scalable Domain Knowledge Base (DKB) through the Dynamic Domain Assignment mechanism. The DKB maintains a tri-component set for each encountered domain. To capture transferable domain priors, we design Disentangled Domain Prior Learning, which employs two complementary self-supervised tasks to model global structural layouts and local appearance statistics that characterize each domain. To facilitate continual category learning, we introduce Domain-Guided Class-Specific Adaptation, which leverages the learned domain priors to guide the optimization of image and text representations, enabling efficient adaptation to new classes. To preserve knowledge of previously learned categories within each domain, we adopt intra-domain anti-forgetting strategies based on momentum consolidation. During inference, the model autonomously identifies the input domain and invokes the corresponding component triplet to detect both previously learned or unseen categories in open-world scenarios.
To evaluate our framework, we construct OD-CODB, a challenging benchmark comprising six visually diverse domains based on public datasets [7,14,15]. This benchmark is designed to assess an open-domain detector’s capability to resist catastrophic forgetting across new, old, and unseen categories in a continual learning framework. Experimental results show that our framework surpasses current state-of-the-art methods by 4.2% mAP on average under open-domain continual detection settings.
In summary, our main contributions are as follows:
  • We pioneer the Open-Domain Continual Object Detection (OD-COD) task, a novel learning paradigm that aims to continuously adapt to diverse and stylistically distinct domains while maintaining open-vocabulary recognition of unseen categories.
  • We propose the D-Know framework to address the challenging OD-COD task, which leverages domain priors to facilitate intra-domain category learning by disentangling the learning process of domain-general and category-specific knowledge.
  • We establish the OD-CODB benchmark and conduct comprehensive experiments, achieving state-of-the-art performance by surpassing current methods by 4.2% mAP on average, validating the effectiveness of the proposed approach.

2. Related Work

In this section, we review the foundational literature relevant to our work. We begin by discussing Open-Vocabulary Object Detection (OVD) to establish the baseline context for our task (Section 2.1). Next, we examine general Continual Object Detection (COD) strategies (Section 2.2). Finally, we focus on Open-Vocabulary Continual Object Detection (OV-COD) and highlight the limitations of existing methods within this emerging field (Section 2.3).

2.1. Open-Vocabulary Object Detection

Open-Vocabulary object Detection (OVD) aims to develop detectors capable of recognizing both seen and unseen categories in real-world scenarios by leveraging large-scale, semantically rich multimodal data [6]. Existing studies can generally be divided according to their strategies for aligning visual regions with textual concepts [16]. Region-aware training methods integrate image–text pairs into the training process to establish soft alignments between image regions and vocabulary terms. For instance, OVR-CNN [6] first formalized the OVD task and optimized the vision–language projection layer through pretraining. Pseudo-labeling approaches, on the other hand, achieve hard alignments by generating pseudo-labels based on pretrained vision–language models (VLMs) or self-training strategies. Representative examples include GLIP [7], which reformulates detection as a phrase grounding problem and leverages linguistic context to improve localization of novel categories, and ProxyDet [17], which synthesizes proxy embeddings for unseen classes to overcome the generalization bottleneck of a known-class alignment. Knowledge distillation–based approaches adopt a teacher–student paradigm to transfer visual knowledge from large-scale VLMs such as CLIP [18] into task-specific detectors. ViLD [19], for example, demonstrates the localization capability of region proposal networks for unseen categories and mitigates overfitting through ensemble prediction between teacher and student models, while BARON [20] captures co-occurrence relationships among visual concepts via regional set distillation. Transfer learning methods directly adapt VLM encoders to detection tasks. OWL-ViT [21] removes CLIP’s token pooling layer and attaches a lightweight detection head for end-to-end fine-tuning, whereas F-VLM [22] freezes the CLIP encoder and trains only the detection head, refining predictions via geometric mean ensembling. Although these methods exhibit remarkable open-vocabulary capability, they are fundamentally designed for static datasets and do not intrinsically address the challenges posed by sequential data streams and catastrophic forgetting.

2.2. Continual Object Detection

Continual Object Detection (COD) focuses on enhancing a detector’s ability to adapt to new categories while maintaining recognition performance on previously learned ones. Existing approaches can be broadly categorized by their underlying strategies [23]. Knowledge distillation–based methods transfer knowledge from previous tasks through a teacher–student architecture. For example, LwF [24] preserves performance on old tasks during new task learning via distillation loss. Rehearsal-based strategies mitigate distribution shifts by reusing historical data. iCaRL [25] combines exemplar memory with a nearest-mean classifier, advancing incremental classification, while Rodeo [26] employs product quantization for memory-efficient feature replay. Parameter isolation methods allocate task-specific parameters to avoid interference between tasks. For instance, Li et al. [27] store and reset critical task parameters, and Zhang et al. [28] design a compact expert detector ensemble with pruning, achieving forgetting-free learning under limited memory budgets. Pseudo-labeling strategies generate weakly supervised signals to guide model adaptation. Guan et al. [29] demonstrate their effectiveness in scenarios where old and new categories co-occur, and RD-IDO [30] integrates pseudo-labeling with feature distillation and residual learning to further boost performance. However, these methods are designed primarily for conventional closed-set object detection and focus solely on retaining previously learned knowledge. They cannot be directly applied to open-vocabulary continual object detection, where models should not only preserve past knowledge but also generalize to unseen categories under dynamically evolving data streams.

2.3. Open-Vocabulary Continual Object Detection

In recent years, Open-Vocabulary Continual Object Detection (OV-COD) has emerged as a natural extension of the open-vocabulary detection paradigm [9]. The goal of OV-COD is to continuously learn novel object categories from a sequential data stream without access to previous data, while retaining knowledge of previously learned categories and maintaining open-vocabulary recognition of unseen ones. ZiRa [9] addresses this problem by introducing a re-parameterizable dual-branch architecture together with a zero-interference loss to mitigate forgetting. MR-GDINO [10] proposes a highly scalable detection framework based on a memory-and-retrieval mechanism, enabling efficient adaptation to newly encountered categories. Dithub [11] adopts a modular paradigm with separate warming-up and specialization phases. Although these methods achieve promising results, they primarily focus on the continual category increase while neglecting the domain and visual style variations that naturally occur across tasks. Consequently, their performance deteriorates when facing substantial domain discrepancies between sequential tasks.

3. Method

In this section, we detail the proposed Disentangled Domain Knowledge-Aided Learning (D-Know) framework. We first formally define the Open-Domain Continual Object Detection (OD-COD) task and outline our core objectives (Section 3.1). Following a high-level overview of the framework architecture (Section 3.2), we detail the Dynamic Domain Assignment mechanism (Section 3.3), the Disentangled Domain Prior Learning strategy (Section 3.4), and the Domain-Guided Class-Specific Adaptation process (Section 3.5). The section concludes with a description of the Inference Process (Section 3.6).

3.1. Problem Definition

In this paper, we define and address a challenging task setting, termed open-domain continual object detection. In this setting, a model is trained on a sequence of N incremental tasks, T = ( T 1 , T 2 , , T N ) . Each task T i is associated with a dataset D i = { ( I k , B k ) } k = 1 n i of n i samples, where I k is an image, and B k represents its bounding box annotations. The key challenge of this setting is that these tasks may belong to different visual domains d j , and the sequence of domains can contain substantial shifts in style and content. Our learning model, denoted as M , processes this task sequence incrementally. At any given step i, the model can only access the dataset of the current task, D i , without storing or accessing data from any previous tasks { D 1 , , D i 1 } . Our core objective is to ensure that, upon completing this sequence of tasks, the model achieves two primary goals: maximizing the average performance across all previous tasks and maintaining its open-world detection capability on novel tasks.

3.2. Framework Overview

To address the performance degradation caused by domain shift, we propose a continual detection framework that decouples domain-prior modeling to assist class adaption. To further mitigate catastrophic forgetting, we introduce a dynamic domain knowledge base, together with a momentum consolidation strategy, which continuously integrates newly learned knowledge while preserving previously acquired representations. As illustrated in Figure 2, the model builds upon a frozen Grounding DINO [31] backbone and maintains a domain knowledge base  K that stores domain-specific components Θ d = { DSM d , θ d img , V d text } . Each domain representation Θ d includes: (1) a domain-specific modulator ( DSM d ) encoding class-agnostic domain priors, (2) an Image PEFT module θ d img capturing class-specific visual patterns, and (3) a textual vocabulary table V d text defined relative to the Textual PEFT module, storing class-specific semantic embeddings.
When a new task T i arrives, the framework first performs the dynamic domain assignment, where the domain classifier f domain identifies its domain identity d i and retrieves the corresponding components Θ d i from the K . It then conducts disentangled domain prior learning, refining the DSM d i through self-supervised domain modeling to capture transferable priors. Finally, domain-guided class-specific adaptation applies PEFT-based supervised learning to adapt class-specific representations under domain guidance. After training, the updated components are consolidated into K through a momentum-based integration strategy, enabling continual evolution of domain knowledge while preventing forgetting.

3.3. Dynamic Domain Assignment

We design a Dynamic Domain Assignment mechanism to enable our framework to dynamically associate incoming tasks with the correct specialized knowledge. This process determines the domain identity of a new task T i before incremental training begins, guiding the subsequent module loading and update procedures.
Domain Prototypes. We maintain a set of domain prototypes P = { P 1 , , P N } within the knowledge base K , where N is the number of currently known domains. Each prototype P d R D represents the central tendency of domain d in the feature space defined by a frozen CLIP visual encoder Φ CLIP ( · ) .
Domain Classification. When a new task T i with dataset D i arrives, we first compute its task-level representation v i . A subset of K samples S i = { I 1 , , I K } D i is encoded using Φ CLIP , and their features are averaged:
v i = 1 K k = 1 K Φ CLIP ( I k ) .
This representation v i is then compared against all existing prototypes in P using the cosine similarity function sim ( · , · ) , which measures the angular closeness between two vectors, to find the most similar prototype index d * and its corresponding similarity score s max :
d * = arg max j { 1 , , N } sim ( v i , P j ) ,
s max = max j { 1 , , N } sim ( v i , P j ) ,
where sim ( A , B ) = A · B A B . The domain assignment d i for task T i is determined based on a threshold τ :
d i = d * , if s max τ , N + 1 , otherwise .
If s max < τ , a new domain index N + 1 is initialized, signifying the need to create a new set of domain-specialized components Θ N + 1 .
Repulsive Prototype Updating. Following the domain assignment d i , the corresponding prototype P d i (either existing or newly initialized) is updated to incorporate information from the current task T i . To enhance inter-domain separability and prevent prototype drift, we design a repulsive exponential moving average rule to update domain prototypes. Let P pos = norm ( v i ) be the normalized representation of the current task, and let P neg = P m (where m = arg max j d i sim ( v i , P j ) ) be the most similar prototype from a different domain. The update is performed as follows:
P d i ( 1 α ) P d i + α · norm P pos + β ( P pos P neg ) ,
where α is the EMA momentum, and β be a small repulsion factor. This ensures that prototypes remain representative and distinct over time.

3.4. Disentangled Domain Prior Learning

Domain priors play a crucial role in capturing domain-level statistics and style information that facilitate generalizations across heterogeneous visual distributions. We introduce a lightweight Domain-Specific Modulator (DSM) to encode such priors through self-supervised optimization.
Domain-Specific Modulator. The DSM is implemented as a bottleneck-structured MLP adapter inserted after the Feed-Forward Network (FFN) within each Swin Transformer block [32]. Given the input feature X R L × C , the DSM modulates the output as follows:
X = X + α · DSM ( FFN ( X ) ) ,
where α is a learnable scaling coefficient controlling the adaptation strength. This deep integration enables the network to align intermediate features with the intrinsic patterns of each domain.
Self-Supervised Optimization. To disentangle domain priors from category-specific semantics, we observe: domain characteristics lie in two complementary aspects. One is the global structural layout, shared by all images in the same domain; the other is local appearance statistics, capturing fine-grained texture or style variations. To comprehensively encode these domain cues, we optimize DSM parameters through a self-supervised objective that integrates both aspects. Specifically, the loss function L ssl is formulated as follows:
L ssl = L con + w · L rec ,
where L con is the contrastive term aimed at learning holistic, domain-level representations; L rec is the reconstructive term focused on preserving fine-grained domain patterns; and w balances these two losses. The contrastive loss L con encourages consistent representations between two augmented views of the same image:
L con = log exp ( q · k + / τ ) exp ( q · k + / τ ) + i exp ( q · k i / τ ) ,
where q and k + are features of positive pairs, k i are negatives, and τ is the temperature coefficient. In parallel, the reconstruction loss L rec promotes fine-grained style understanding through patch recovery. The loss is the Mean Squared Error (MSE) between the predicted and original pixels over the masked regions:
L rec = 1 | M | p M ( x pred ) p ( x orig ) p 2 ,
where M denotes the set of masked patches, | M | is the number of patches in M , p represents a single patch within M , ( x pred ) p are the pixels predicted by the model for patch p, and ( x orig ) p are the original pixels of patch p.
Continual Regularization. To achieve stable evolution of domain priors during continual updates, we define the overall optimization objective for the DSM as the combination of self-supervised domain modeling and continual regularization:
L domain = L ssl + L EWC ,
here, L ssl captures transferable domain priors through contrastive and reconstructive objectives, while L EWC serves to preserve previously learned domain knowledge. Specifically, we employ Elastic Weight Consolidation (EWC) [33], which penalizes deviations of important parameters from their previously optimal values. The importance of each parameter θ k is estimated by the diagonal of the Fisher Information Matrix (FIM) F k , yielding:
L EWC = λ 2 k F k ( θ k θ k * ) 2 ,
where λ controls the regularization strength. This continual regularization mechanism enables the DSM to retain critical domain priors while adapting to new environments, effectively mitigating catastrophic forgetting.

3.5. Domain-Guided Class-Specific Adaptation

Once the domain priors are established, the model leverages these priors to support the learning of class-level discriminative information. This process relies on lightweight PEFT modules, which in turn ensure efficient and stable task adaptation.
Image PEFT Module. The Image PEFT Module is designed to enhance class-specific visual representations. It incorporates a dual-branch convolutional adapter into the multi-scale feature maps F vis R C × H × W , which are generated by the Feature Pyramid Network (FPN) from the backbone’s hierarchical features. This adapter operates in parallel with the feature projection layers:
F vis = Proj ( F vis ) + ImageAdapter ( F vis ) .
A persistent branch θ p accumulates long-term knowledge, while a task-specific branch θ Δ captures current updates. After optimization, they are merged through the following:
θ p θ p + α · θ Δ ,
where α is a learnable fusion factor. This incremental merging preserves prior knowledge while integrating new visual cues.
Textual Vocabulary Table. To handle evolving textual semantics across domains, we apply a residual adapter to the BERT encoder outputs E R L × D :
E = FeatMap ( E ) + TextPEFT ( E ) ,
where FeatMap ( · ) is the main feature mapping layer. The evolution of textual knowledge is managed through the interaction between the adaptive Textual PEFT module and a domain vocabulary table serving as a long-term memory buffer. To mitigate semantic drift and forgetting caused by new task training, we construct a Domain Vocabulary with a momentum-based update mechanism. Each domain’s vocabulary stores semantic knowledge through key–value pairs, where keys represent task description texts (denoted as C t for the t task’s caption string) and values correspond to contextual embedding vectors (denoted as E C t for caption C t ). When a new task arrives, the system first trains the PEFT module, then performs momentum updates on all historical description embeddings in the vocabulary:
E C t ( 1 m ) · E old C t + m · E new C t ,
where E old C t and E new C t represent the previous and newly generated embeddings for caption C t , and m is the momentum coefficient. This mechanism stabilizes the semantic space and prevents drift across tasks within each domain.
Supervised Optimization. The PEFT modules are trained using the standard detection objective L det [2]. This loss is calculated over pairs of ground truth objects and model predictions that are matched using bipartite matching, typically via the Hungarian algorithm. The total loss is a sum of classification and bounding box regression terms:
L det = λ cls L cls + λ reg L reg ,
where λ cls and λ reg are weighting coefficients. The classification loss L cls is often implemented using Focal Loss applied to the matched pairs to address class imbalance. The bounding box regression loss L reg is computed only for matched pairs corresponding to non-background objects and typically combines an L1 loss and a Generalized IoU (GIoU) loss to penalize localization errors:
L reg = λ L 1 L L 1 + λ GIoU L GIoU .
Crucially, during this step, gradients update only the parameters of the Image and Textual PEFT modules, ensuring the domain priors learned by the DSM remain intact.

3.6. Inference

During inference, given a test image I and query T, the domain classifier f domain first identifies its corresponding domain d. If d exists in K , the corresponding specialized components Θ d = { θ d DSM , θ d img , V d text } are loaded to configure the domain-specialized detector M d . In this case, the textual branch bypasses the encoder and directly retrieves the semantic embedding of T from the stored vocabulary table V d text , ensuring consistent text–image alignment within that domain. Otherwise, the frozen base model M 0 is used for detection:
Detect ( I , T ) = M d ( I , T ) , if d = f domain ( I ) , M 0 ( I , T ) , otherwise .
This adaptive mechanism ensures both domain specialization and open-vocabulary generalization in unseen environments. By leveraging domain-specific vocabulary tables, the framework avoids redundant text encoding and maintains domain-consistent semantic grounding. Crucially, under strong domain shifts where no specific domain is identified, the automatic fallback to the frozen M 0 guarantees that the original zero-shot generalization capability is perfectly preserved.

4. Experiment

To validate D-Know, we conducted extensive experiments on our OD-CODB benchmark. This section first details the Experimental Setup (Section 4.1). We then present the Main Results in both full-sample and few-shot settings (Section 4.2), followed by Ablation Studies on key components (Section 4.3). Finally, we provide the Qualitative Analysis and Visualization (Section 4.4).

4.1. Experimental Setup

4.1.1. Datasets

To evaluate open-domain continual object detection performance, we construct a challenging benchmark from public datasets, encompassing six distinct domains: Clipart1k of clipart images, Comic2k of comic book illustrations, Watercolor2k of watercolor paintings [14], VisDrone of real-world aerial imagery [15], Thermal of thermal infrared imagery, and Aquarium of underwater scenes [7]. The distinct visual styles and content across these datasets introduce substantial domain gaps for the model to bridge. We divide the combined dataset into 36 incremental tasks based on object classes. Most tasks contain a single class, such as watercolor-car, while smaller classes within the same domain are merged into multi-class tasks to ensure sufficient training data. For instance, the clipart-merge task combines semantically related categories like dog and cat. All tasks are presented to the model in random order, further increasing the difficulty of maintaining performance across complex open-domain shifts.

4.1.2. Evaluation Metrics

We evaluate object detection performance across all learned tasks using the standard mean Average Precision (mAP) metric following the COCO evaluation protocol. Specifically, mAP is computed over multiple Intersection over Union (IoU) thresholds from 0.5 to 0.95 with a step size of 0.05. After completing the full sequence of 36 tasks, the model is evaluated on the test set of each task to measure both overall knowledge acquisition and retention [9].

4.1.3. Baselines

To comprehensively evaluate our approach, we compare it with representative state-of-the-art methods in continual open-vocabulary object detection. These baselines address the challenge of learning incrementally without forgetting, while our method further targets open-domain continual object detection with substantial domain shifts. Specifically, ZiRa tackles incremental vision–language object detection by introducing lightweight, reparameterizable adapters to both vision and language encoders, enabling task-specific adaptation without increasing inference cost. Notably, ZiRa has been demonstrated to outperform other recent continual detection methods, such as CL-DETR, particularly in open-vocabulary settings. MR-GDINO extends Grounding DINO with a memory-retrieval mechanism that stores and retrieves low-rank parameter representations for new concepts, facilitating continual adaptation to evolving domains. By evaluating these strong baselines on our open-domain benchmark, we provide a rigorous assessment of our model’s effectiveness in maintaining performance under large domain variations.

4.1.4. Implementation Details

All experiments were conducted using PyTorch 2.4.0 on a server equipped with two NVIDIA RTX 3090 GPUs. The Grounding DINO [31] was used with the Swin-Tiny [32] backbone as the default original vision–language object detection model in our experiments. This Grounding DINO was pre-trained on Objects365 [34], GoldG [7], and Cap4M [7]. For each incremental task, the model was trained for 10 epochs with a total batch size of 2. We adopted the AdamW optimizer with an initial learning rate of 1 × 10 3 , decayed by a cosine schedule, and a weight decay of 1 × 10 4 . For dynamic domain assignment, features were extracted using a pre-trained CLIP ViT-B/16 model [18], with a similarity threshold τ = 0.85 for domain matching. The EMA momentum for repulsive prototype updates was set to α = 0.5 , and the repulsion coefficient to β = 0.01 . The momentum coefficient for task vocabulary updates was m = 0.1 . For continual regularization, the EWC strength was set to λ = 1000 , and the L rec in the composite self-supervised objective was weighted by w = 0.5 .

4.2. Main Results

We evaluate the efficacy of our proposed method against prominent baselines of continual learning with all the approaches implemented in the Grounding DINO model. Our experiments encompass both full-sample and few-shot settings to comprehensively assess performance under varying data availabilities. To clearly illustrate performance based on domain characteristics, we aggregate and report the average results for all tasks belonging to the same domain.

4.2.1. Full-Sample Continual Learning Performance

Table 1 presents the results under the full-sample setting, where all available data are used for training. Our method achieves the highest overall performance with an average mAP of 46.97%, surpassing ZiRa by 4.2%. The performance gain is particularly evident in domains with substantial stylistic deviations from natural images, such as Clipart, Comic, and Watercolor. These improvements demonstrate the effectiveness of our domain-specialization mechanism, which captures domain-specific priors through self-supervised adaptation. To further support the quantitative results, Figure 3 provides qualitative comparisons among zero-shot Grounding DINO, ZiRa, and our method. Our approach produces more accurate bounding boxes across diverse domains, illustrating its superior open-domain localization capability. Crucially, our adaptive inference mechanism naturally preserves the original zero-shot generalization capability of the pre-trained Grounding DINO model on entirely unseen categories and domains. This is achieved by defaulting to the frozen base model for unrecognized domains, incurring no additional computational cost for maintaining this vital property. This ensures robust open-domain generalization alongside effective knowledge retention for seen tasks.

4.2.2. Few-Shot Continual Learning Performance

We further assess data efficiency under few-shot continual learning with 1, 3, 5, and 10 samples per class. As shown in Table 2, our method achieves the best performance from the 3-shot setting onward, demonstrating strong adaptability with limited data. In the 5-shot and 10-shot settings, it reaches average mAP scores of 39.65% and 40.90%, outperforming ZiRa by 1.13% and 1.09% points. The 1-shot performance is lower at 31.49% mAP due to insufficient samples for effective domain-prior estimation. Once a minimal support set is available, the model quickly forms stable domain representations that enhance learning efficiency. These results verify that our approach offers a data-efficient and robust solution for open-domain continual learning, maintaining strong generalization even with few training examples.

4.2.3. Systematic Analysis of Catastrophic Forgetting

Beyond the final average performance, we systematically analyze the model’s stability across the 36 sequential tasks using two standard continual learning metrics: Backward Transfer (BWT) and Forgetting Measure (F). BWT measures the average influence of learning new tasks on the performance of previous tasks, where a negative value indicates forgetting. The Forgetting Measure (F) quantifies the average decline in performance for each task relative to its peak proficiency achieved during training. As shown in Table 3, the strongest baseline, ZiRa, exhibits a BWT of −3.32% and a Forgetting Measure of 4.42%, indicating noticeable knowledge decay as the domain shifts. In stark contrast, our D-Know framework achieves a BWT of −0.10% and a Forgetting Measure of 1.14%. These near-zero degradation values demonstrate that our framework effectively isolates domain and class knowledge within the dynamic domain knowledge base. This isolation allows the model to incrementally accumulate new capabilities without compromising previously acquired skills, confirming its robustness against catastrophic forgetting even under significant domain shifts.

4.3. Ablation Study

To rigorously evaluate the independent contribution of each design choice in our D-Know framework, we perform a comprehensive ablation study isolating the DSM, class-specific PEFT modules, and our Decoupled Training strategy, as summarized in Table 4. The frozen baseline model yields a poor average mAP of 35.59%, highlighting the significant domain gap. Introducing class-specific PEFT modules significantly boosts the performance to 45.48%. Crucially, even without these detection-specific modules, incorporating the DSM alone improves the baseline to 40.39%, demonstrating that explicitly modeling domain-general priors provides a strong foundation independent of category learning. When all modules are present, the optimization strategy becomes critical. While jointly training all components provides a strong basis of 46.14%, it is limited by the interference between self-supervised domain learning and supervised detection tasks. When our Decoupled Training strategy is applied to separate these optimization processes, performance further improves to the best result of 47.10%. These results highlight two key observations: the combined domain-specialized modules effectively capture transferable priors and align cross-modal features, and the Decoupled Training strategy is essential for stabilizing training and fully realizing this benefit.
Beyond these performance gains, we also highlight the significant parameter efficiency of our design, as detailed in the “Params” column and Table 5. While the frozen Grounding DINO backbone comprises 174.84 M parameters, our full D-Know framework adds only 4.52 M trainable parameters in total. This represents a marginal 2.52% increase in model size compared to the baseline. Notably, the core DSM module, despite its vital role in capturing domain priors, requires merely 0.22 M parameters. This quantitative analysis confirms that our method achieves substantial performance improvements with minimal storage and computational overhead.

4.4. Analysis and Visualization

4.4.1. Structure of the Learned Domain Feature Space

To assess the effectiveness of our domain prototype learning, we visualize the CLIP feature space using t-SNE. As shown in Figure 4, samples from different domains form distinct and well-separated clusters, confirming that the learned representations are both structured and discriminative. Domains with distinct imaging characteristics, such as Drone and Thermal, occupy isolated regions, demonstrating that the model effectively distinguishes between fundamentally different visual modalities. In contrast, artistic domains, i.e., Clipart, Comic, and Watercolor, cluster together at a higher level due to their shared non-photorealistic nature but still exhibit clear internal separation. Clipart and Comic appear closer because of their line-art characteristics, while Watercolor remains relatively distant, reflecting its texture-based style. The Aquarium domain also forms a separate cluster, indicating sensitivity to unique real-world content beyond the real vs. synthetic distinction. Each learned domain prototype, marked by a star, aligns closely with the centroid of its corresponding cluster, showing that the prototypes are stable and representative rather than arbitrary. Overall, the visualization demonstrates that our framework constructs a well-organized and highly separable domain feature space, where the prototypes accurately capture domain-level semantics, enabling reliable domain discrimination and adaptation.

4.4.2. Domain-Aware Attention Visualization

We further analyze the influence of the domain-specific modulator on the model’s visual perception using Grad-CAM visualizations [35]. As shown in Figure 5, our method produces sharper and more semantically aligned attention maps compared to the baseline, demonstrating its enhanced domain awareness. The baseline model exhibits diffuse, domain-agnostic attention that often focuses on low-level details, revealing a lack of sensitivity to domain-specific cues. In contrast, our approach shows clearly localized attention guided by learned domain priors. For example, in the comic domain, attention aligns with contour lines characteristic of line-art illustrations, while in the drone domain, it emphasizes top-down spatial layouts relevant to aerial imagery. These results confirm that our model effectively leverages domain priors to adapt its perception across diverse visual styles. By dynamically adjusting attention based on domain characteristics, it achieves more focused and interpretable visual understanding, which is critical for robust open-domain continual object detection.

4.4.3. Continual Performance Analysis Across Tasks

To assess the effectiveness of our method in mitigating catastrophic forgetting under open-domain continual object detection, we employ Normalized Average mAP to measure knowledge retention over the learning sequence. This metric reflects how well the model preserves its best past performance as new tasks are introduced. As shown in Figure 6, the baseline method exhibits notable fluctuations and a gradual decline in performance, indicating interference between old and new knowledge. In contrast, our approach maintains consistently high stability, keeping the Normalized Average mAP above 0.95 throughout the sequence. This demonstrates strong resistance to forgetting and balanced adaptation to new tasks. Furthermore, our method occasionally achieves scores exceeding 1.0, revealing positive backward transfer. This indicates that learning new tasks within the same domain refines the domain prior, which in turn enhances earlier tasks. These findings confirm that our framework not only preserves but can also reinforce prior knowledge, achieving robust and adaptive open-domain continual learning.

4.4.4. Hyperparameter Sensitivity Analysis

To verify the robustness of our framework and ensure that our results are not dependent on precise hyperparameter tuning, we analyze the sensitivity of the key hyperparameter w. This weight balances the contrastive term L con and the reconstructive term L rec in our self-supervised domain-prior learning objective. We evaluate the model’s Average mAP by varying w across the set { 0.1 , 0.5 , 1.0 , 2.0 } , while keeping other settings fixed. As illustrated in Figure 7, the performance remains consistently high across this range. Specifically, the performance gap between the peak (at w = 0.5 ) and the lowest point is marginal (less than 0.59%). This stability confirms that the complementarity of global and local domain features is robust, and the proposed framework does not require exhaustive automated hyperparameter search to achieve state-of-the-art performance.

5. Discussion

5.1. Scalability and Complexity Analysis

To assess the real-world applicability of D-Know, we analyze both its theoretical complexity (Big O) and empirical scalability in terms of storage and runtime.
Storage Scalability. A critical requirement for continual learning systems is the ability to learn new tasks without excessive memory growth. As detailed in Table 5, let P base be the parameter count of the frozen backbone (174.84 M) and P Θ be the size of the specialized components for one domain. In our framework, P Θ amounts to approximately 4.52 M. In a standard fine-tuning setting, preventing forgetting often necessitates storing independent models for each domain, resulting in a space complexity of O ( N · P base ) . In contrast, D-Know creates a new set of lightweight parameters only when a new domain is encountered. The total storage complexity is O ( P base + N · P Θ ) . Since P Θ represents only 2.52% of the backbone size, the storage requirement grows at a much slower rate, demonstrating superior scalability for long-term adaptation.
Inference Efficiency. During inference, the computational cost is determined by the domain assignment module and the detector. We measured the inference speed on an NVIDIA RTX 3090 GPU. The original Grounding DINO achieves 9.06 FPS. Our D-Know framework achieves 8.74 FPS, representing a negligible speed drop of only 3.55%. Crucially, the computational cost is constant O ( 1 ) with respect to the number of learned domains N. Regardless of how large the domain knowledge base becomes, the system retrieves and loads only one specific component set Θ d for a given input. Thus, our framework maintains stable, real-time inference speeds regardless of the length of the task sequence.

5.2. Limitations and Future Work

Despite the promising results, our framework has certain limitations that open avenues for future research. First, the dynamic domain assignment mechanism relies on a frozen, pre-trained CLIP model. While CLIP offers robust general-purpose features, it may struggle to accurately distinguish domains that are extremely specialized or vastly different from its pre-training data. Future work could explore adaptive or fine-tunable domain routers to address this. Second, our domain-prior learning strategy necessitates a minimal support set. As observed in our few-shot experiments, the self-supervised objectives require more than a single sample to capture meaningful domain statistics. Consequently, in extreme 1-shot scenarios, the advantage of our method is less pronounced. Developing techniques to rapidly infer domain priors from single examples remains a challenging direction for future investigation. Finally, while our method effectively disentangles domain and class knowledge, the domain assignment is currently a hard decision process. A soft assignment or probabilistic routing mechanism could potentially allow for better handling of images that lie on the boundary of multiple domains, further enhancing the model’s flexibility.

6. Conclusions

We introduce Open-Domain Continual Object Detection (OD-COD), a novel learning paradigm designed to enable pretrained vision–language detectors to continuously adapt across multiple specialized domains with significant domain shifts, while preserving their zero-shot generalization ability in broad open-world scenarios. We construct the OD-CODB benchmark comprising datasets spanning six distinct domains with substantial visual style variations, which is both challenging and practical. To address this new challenge, we propose Disentangled Domain Knowledge-Aided Learning (D-Know) framework, which disentangles domain-general and category-specific knowledge. We acquire domain priors through the self-supervised tasks designed in Disentangled Domain Prior Learning, and leverage these priors to facilitate category learning via Domain-Guided Class-Specific Adaptation. These knowledge representations are dynamically managed within a scalable domain knowledge base. Comprehensive experiments on COCO and the OD-CODB benchmark demonstrate the effectiveness and superiority of the proposed framework. While our current approach relies on a frozen pre-trained domain encoder and requires a minimal support set, as discussed in Section 5, we believe it establishes a strong foundation for future research in robust, open-world visual perception.

Author Contributions

Conceptualization, C.Y. and B.H.; methodology, C.Y., B.H. and Y.X.; validation, B.H.; writing—original draft preparation, B.H.; writing—review and editing, C.Y., Y.K., Y.W. and Y.X.; visualization, B.H. and Y.W.; supervision, X.L.; project administration, C.Y. and H.D.; funding acquisition, C.Y. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Science and Technology Major Project No. 2022ZD0117103, NSFC under Grant 62302384, 62192781, 62137002, and 62172326, Research Project Funded by the State Key Laboratory of Communication Content Cognition under Grant No. A202403, and the Project of China Knowledge Centre for Engineering Science and Technology.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article, Further inquiries can be directed to the corresponding author.

Conflicts of Interest

One of the authors, Yugui Xie, is employed by MIGU Video Co., Ltd., China Mobile Communications Company Limited Research Institute, Shanghai, China. His employment is unrelated to the present research, and the employer had no role in the study design, data collection, data analysis, manuscript preparation, or the decision to publish this work. All other authors declare no financial or non-financial conflicts of interest.

References

  1. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
  2. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  3. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  4. Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
  5. Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes Are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
  6. Zareian, A.; Rosa, K.D.; Hu, D.H.; Chang, S.F. Open-Vocabulary Object Detection Using Captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14393–14402. [Google Scholar]
  7. Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
  8. Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. GLIPv2: Unifying Localization and Vision-Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 36067–36080. [Google Scholar]
  9. Deng, J.; Zhang, H.; Ding, K.; Hu, J.; Zhang, X.; Wang, Y. Zero-shot generalizable incremental learning for vision-language object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 136679–136700. [Google Scholar]
  10. Dong, B.; Huang, Z.; Yang, G.; Zhang, L.; Zuo, W. MR-GDINO: Efficient open-world continual object detection. arXiv 2024, arXiv:2412.15979. [Google Scholar]
  11. Cappellino, C.; Mancusi, G.; Mosconi, M.; Porrello, A.; Calderara, S.; Cucchiara, R. DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection. arXiv 2025, arXiv:2503.09271. [Google Scholar]
  12. Poss, C.; Ibragimov, O.; Indreswaran, A.; Gutsche, N.; Irrenhauser, T.; Prueglmeier, M.; Goehring, D. Application of open Source Deep Neural Networks for Object Detection in Industrial Environments. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 231–236. [Google Scholar] [CrossRef]
  13. Şafak Kılıç. HybridVisionNet: An advanced hybrid deep learning framework for automated multi-class ocular disease diagnosis using fundus imaging. Ain Shams Eng. J. 2025, 16, 103594. [Google Scholar] [CrossRef]
  14. Inoue, N.; Furuta, R.; Yamasaki, T.; Aizawa, K. Cross-Domain Weakly-Supervised Object Detection Through Progressive Domain Adaptation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5001–5009. [Google Scholar]
  15. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
  16. Zhu, C.; Chen, L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8954–8975. [Google Scholar] [CrossRef] [PubMed]
  17. Jeong, J.; Park, G.; Yoo, J.; Jung, H.; Kim, H. Proxydet: Synthesizing proxy novel classes via classwise mixup for open-vocabulary object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 2462–2470. [Google Scholar]
  18. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PmLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  19. Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
  20. Wu, S.; Zhang, W.; Jin, S.; Liu, W.; Loy, C.C. Aligning bag of regions for open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 15254–15264. [Google Scholar]
  21. Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 728–755. [Google Scholar]
  22. Kuo, W.; Cui, Y.; Gu, X.; Piergiovanni, A.J.; Angelova, A. Open-Vocabulary Object Detection upon Frozen Vision and Language Models. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  23. Menezes, A.G.; de Moura, G.; Alves, C.; de Carvalho, A.C. Continual Object Detection: A review of definitions, strategies, and challenges. Neural Netw. 2023, 61, 476–493. [Google Scholar] [CrossRef] [PubMed]
  24. Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
  25. Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
  26. Acharya, M.; Hayes, T.L.; Kanan, C. Rodeo: Replay for online object detection. arXiv 2020, arXiv:2008.06439. [Google Scholar] [CrossRef]
  27. Li, W.; Wu, Q.; Xu, L.; Shang, C. Incremental learning of single-stage detectors with mining memory neurons. In Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1981–1985. [Google Scholar]
  28. Zhang, N.; Sun, Z.; Zhang, K.; Xiao, L. Incremental learning of object detection with output merging of compact expert detectors. In Proceedings of the 2021 4th International Conference on Intelligent Autonomous Systems (ICoIAS), Wuhan, China, 14–16 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–7. [Google Scholar]
  29. Guan, L.; Wu, Y.; Zhao, J.; Ye, C. Learn to detect objects incrementally. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Suzhou, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 403–408. [Google Scholar]
  30. Yang, D.; Zhou, Y.; Shi, W.; Wu, D.; Wang, W. RD-IOD: Two-Level Residual-Distillation-Based Triple-Network for Incremental Object Detection. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–23. [Google Scholar] [CrossRef]
  31. Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 38–55. [Google Scholar]
  32. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  33. Aich, A. Elastic weight consolidation (EWC): Nuts and bolts. arXiv 2021, arXiv:2105.04093. [Google Scholar] [CrossRef]
  34. Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8429–8438. [Google Scholar] [CrossRef]
  35. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Figure 1. Illustration of the OD-COD task setting. (a) Data distribution spans diverse visual domains (e.g., Comic, Thermal) with distinct styles and categories. (b) Sequential task execution involves domain shifts and potential revisits (e.g., D o m a i n 1 ), aiming for continuous adaptation while preserving zero-shot capabilities.
Figure 1. Illustration of the OD-COD task setting. (a) Data distribution spans diverse visual domains (e.g., Comic, Thermal) with distinct styles and categories. (b) Sequential task execution involves domain shifts and potential revisits (e.g., D o m a i n 1 ), aiming for continuous adaptation while preserving zero-shot capabilities.
Applsci 15 12723 g001
Figure 2. Overview of the D-Know framework. The process unfolds in three stages: (a) Dynamic domain assignment identifies the task domain using a frozen CLIP encoder. (b) Disentangled domain prior learning captures domain signatures via self-supervised optimization of the DSM. (c) Domain-guided class-specific adaptation leverages domain priors to adapt image and text features for detection through lightweight PEFT modules.
Figure 2. Overview of the D-Know framework. The process unfolds in three stages: (a) Dynamic domain assignment identifies the task domain using a frozen CLIP encoder. (b) Disentangled domain prior learning captures domain signatures via self-supervised optimization of the DSM. (c) Domain-guided class-specific adaptation leverages domain priors to adapt image and text features for detection through lightweight PEFT modules.
Applsci 15 12723 g002
Figure 3. Qualitative comparison among zero-shot Grounding DINO, ZiRa, and our method. Our approach produces more accurate bounding boxes across diverse domains, demonstrating superior open-domain localization capability.
Figure 3. Qualitative comparison among zero-shot Grounding DINO, ZiRa, and our method. Our approach produces more accurate bounding boxes across diverse domains, demonstrating superior open-domain localization capability.
Applsci 15 12723 g003
Figure 4. T-SNE visualization of the learned domain feature space.
Figure 4. T-SNE visualization of the learned domain feature space.
Applsci 15 12723 g004
Figure 5. Grad-CAM attention visualization comparison between zero-shot Grounding DINO and our method across different domains. In the heatmaps, red denotes the strongest activation, while blue indicates the lowest response.
Figure 5. Grad-CAM attention visualization comparison between zero-shot Grounding DINO and our method across different domains. In the heatmaps, red denotes the strongest activation, while blue indicates the lowest response.
Applsci 15 12723 g005
Figure 6. Comparison of normalized average mAP between our method and ZiRa over a sequence of open-domain continual learning tasks.
Figure 6. Comparison of normalized average mAP between our method and ZiRa over a sequence of open-domain continual learning tasks.
Applsci 15 12723 g006
Figure 7. Sensitivity analysis of the hyperparameter w. The model maintains stable performance across a wide range of values, demonstrating robustness to hyperparameter variations.
Figure 7. Sensitivity analysis of the hyperparameter w. The model maintains stable performance across a wide range of values, demonstrating robustness to hyperparameter variations.
Applsci 15 12723 g007
Table 1. Open-domain continual object detection results evaluated by mAP (%) after the full training sequence. The best performance is shown in bold.
Table 1. Open-domain continual object detection results evaluated by mAP (%) after the full training sequence. The best performance is shown in bold.
ShotsMethodsZCOCOAvgClipartComicWatercolorDroneThermalAquarium
0Zero Original Model47.4035.5953.3342.9445.5311.9039.3131.59
FullMR-GDINO47.4040.60 49.3738.9544.3821.6066.6354.35
ZiRa46.0642.77 ± 0.06 *60.4848.5650.0014.4462.0549.83
Ours47.4046.97 ± 0.14 *64.4050.9553.1719.8766.3956.06
Note: * Results are reported as “Mean ± Standard Deviation” over three independent runs with different random seeds. Result is from a single run. Individual domain scores are from a single representative run.
Table 2. Few-shot continual learning results evaluated by mAP (%) under 1, 3, 5, and 10-shot settings. The best performance is in bold.
Table 2. Few-shot continual learning results evaluated by mAP (%) under 1, 3, 5, and 10-shot settings. The best performance is in bold.
ShotsMethodsAvgClipartComicWatercolorDroneThermalAquarium
0Zero Original Model35.5953.3342.9445.5311.9039.3131.59
1MR-GDINO33.1152.1739.3944.3310.0938.3223.71
ZiRa36.0754.4042.2345.899.4852.7333.57
Ours31.4951.2631.7640.077.4346.6231.72
3MR-GDINO31.8046.9434.2141.248.4752.1831.93
ZiRa37.0756.4742.4046.3210.2258.8832.65
Ours37.5358.3043.6148.037.4054.6637.90
5MR-GDINO32.6646.2533.0138.7310.2455.7940.31
ZiRa38.5256.8241.5846.2412.5560.1339.88
Ours39.6558.8143.9146.2712.9757.3142.92
10MR-GDINO33.9345.1733.9342.7212.2159.6739.73
ZiRa39.8157.5544.0748.5213.2359.5741.40
Ours40.9058.6445.2247.6414.7457.6945.87
Table 3. Analysis of forgetting and stability. “↑” denotes that higher values are better, while “↓” denotes that lower values are preferred. Best results are in bold.
Table 3. Analysis of forgetting and stability. “↑” denotes that higher values are better, while “↓” denotes that lower values are preferred. Best results are in bold.
MethodBackward Transfer (BWT) ↑Forgetting Measure (F) ↓
ZiRa−3.32%4.42%
D-Know (Ours)−0.10%1.14%
Table 4. Comprehensive ablation study on the core modules and training strategies of the D-Know framework on the OD-CODB benchmark. We report the Average mAP (%) and total parameter count (Params). The frozen backbone contributes 174.84 M parameters. “” indicates the component/strategy is used, “” indicates it is not.
Table 4. Comprehensive ablation study on the core modules and training strategies of the D-Know framework on the OD-CODB benchmark. We report the Average mAP (%) and total parameter count (Params). The frozen backbone contributes 174.84 M parameters. “” indicates the component/strategy is used, “” indicates it is not.
AblationDSMImage PEFTText PEFTDecoupled TrainingParams(M)AvgGain
Baseline174.8435.59-
Component179.1445.48+9.89
175.0640.39+4.80
179.2942.92+7.33
175.1345.97+10.38
Strategy179.3646.14+10.55
179.3647.10+11.51
Table 5. Parameter breakdown and trainable status of the D-Know framework.
Table 5. Parameter breakdown and trainable status of the D-Know framework.
ComponentParameters (M)StatusPercentage of Total
Backbone (Grounding DINO)174.84Frozen97.48%
Our Trainable Components
   Domain-Specific Modulator0.2211Trainable0.12%
   Image PEFT Module4.2291Trainable2.36%
   Text PEFT Module0.0659Trainable0.04%
Subtotal (Trainable)4.5161-2.52%
Total179.3561-100.00%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, B.; Yan, C.; Kou, Y.; Wang, Y.; Lv, X.; Du, H.; Xie, Y. D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection. Appl. Sci. 2025, 15, 12723. https://doi.org/10.3390/app152312723

AMA Style

He B, Yan C, Kou Y, Wang Y, Lv X, Du H, Xie Y. D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection. Applied Sciences. 2025; 15(23):12723. https://doi.org/10.3390/app152312723

Chicago/Turabian Style

He, Bintao, Caixia Yan, Yan Kou, Yinghao Wang, Xin Lv, Haipeng Du, and Yugui Xie. 2025. "D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection" Applied Sciences 15, no. 23: 12723. https://doi.org/10.3390/app152312723

APA Style

He, B., Yan, C., Kou, Y., Wang, Y., Lv, X., Du, H., & Xie, Y. (2025). D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection. Applied Sciences, 15(23), 12723. https://doi.org/10.3390/app152312723

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop