D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection
Abstract
1. Introduction
- We pioneer the Open-Domain Continual Object Detection (OD-COD) task, a novel learning paradigm that aims to continuously adapt to diverse and stylistically distinct domains while maintaining open-vocabulary recognition of unseen categories.
- We propose the D-Know framework to address the challenging OD-COD task, which leverages domain priors to facilitate intra-domain category learning by disentangling the learning process of domain-general and category-specific knowledge.
- We establish the OD-CODB benchmark and conduct comprehensive experiments, achieving state-of-the-art performance by surpassing current methods by 4.2% mAP on average, validating the effectiveness of the proposed approach.
2. Related Work
2.1. Open-Vocabulary Object Detection
2.2. Continual Object Detection
2.3. Open-Vocabulary Continual Object Detection
3. Method
3.1. Problem Definition
3.2. Framework Overview
3.3. Dynamic Domain Assignment
3.4. Disentangled Domain Prior Learning
3.5. Domain-Guided Class-Specific Adaptation
3.6. Inference
4. Experiment
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Evaluation Metrics
4.1.3. Baselines
4.1.4. Implementation Details
4.2. Main Results
4.2.1. Full-Sample Continual Learning Performance
4.2.2. Few-Shot Continual Learning Performance
4.2.3. Systematic Analysis of Catastrophic Forgetting
4.3. Ablation Study
4.4. Analysis and Visualization
4.4.1. Structure of the Learned Domain Feature Space
4.4.2. Domain-Aware Attention Visualization
4.4.3. Continual Performance Analysis Across Tasks
4.4.4. Hyperparameter Sensitivity Analysis
5. Discussion
5.1. Scalability and Complexity Analysis
5.2. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes Are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
- Zareian, A.; Rosa, K.D.; Hu, D.H.; Chang, S.F. Open-Vocabulary Object Detection Using Captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14393–14402. [Google Scholar]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
- Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. GLIPv2: Unifying Localization and Vision-Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 36067–36080. [Google Scholar]
- Deng, J.; Zhang, H.; Ding, K.; Hu, J.; Zhang, X.; Wang, Y. Zero-shot generalizable incremental learning for vision-language object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 136679–136700. [Google Scholar]
- Dong, B.; Huang, Z.; Yang, G.; Zhang, L.; Zuo, W. MR-GDINO: Efficient open-world continual object detection. arXiv 2024, arXiv:2412.15979. [Google Scholar]
- Cappellino, C.; Mancusi, G.; Mosconi, M.; Porrello, A.; Calderara, S.; Cucchiara, R. DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection. arXiv 2025, arXiv:2503.09271. [Google Scholar]
- Poss, C.; Ibragimov, O.; Indreswaran, A.; Gutsche, N.; Irrenhauser, T.; Prueglmeier, M.; Goehring, D. Application of open Source Deep Neural Networks for Object Detection in Industrial Environments. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 231–236. [Google Scholar] [CrossRef]
- Şafak Kılıç. HybridVisionNet: An advanced hybrid deep learning framework for automated multi-class ocular disease diagnosis using fundus imaging. Ain Shams Eng. J. 2025, 16, 103594. [Google Scholar] [CrossRef]
- Inoue, N.; Furuta, R.; Yamasaki, T.; Aizawa, K. Cross-Domain Weakly-Supervised Object Detection Through Progressive Domain Adaptation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5001–5009. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
- Zhu, C.; Chen, L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8954–8975. [Google Scholar] [CrossRef] [PubMed]
- Jeong, J.; Park, G.; Yoo, J.; Jung, H.; Kim, H. Proxydet: Synthesizing proxy novel classes via classwise mixup for open-vocabulary object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 2462–2470. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PmLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Wu, S.; Zhang, W.; Jin, S.; Liu, W.; Loy, C.C. Aligning bag of regions for open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 15254–15264. [Google Scholar]
- Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 728–755. [Google Scholar]
- Kuo, W.; Cui, Y.; Gu, X.; Piergiovanni, A.J.; Angelova, A. Open-Vocabulary Object Detection upon Frozen Vision and Language Models. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Menezes, A.G.; de Moura, G.; Alves, C.; de Carvalho, A.C. Continual Object Detection: A review of definitions, strategies, and challenges. Neural Netw. 2023, 61, 476–493. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
- Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
- Acharya, M.; Hayes, T.L.; Kanan, C. Rodeo: Replay for online object detection. arXiv 2020, arXiv:2008.06439. [Google Scholar] [CrossRef]
- Li, W.; Wu, Q.; Xu, L.; Shang, C. Incremental learning of single-stage detectors with mining memory neurons. In Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1981–1985. [Google Scholar]
- Zhang, N.; Sun, Z.; Zhang, K.; Xiao, L. Incremental learning of object detection with output merging of compact expert detectors. In Proceedings of the 2021 4th International Conference on Intelligent Autonomous Systems (ICoIAS), Wuhan, China, 14–16 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–7. [Google Scholar]
- Guan, L.; Wu, Y.; Zhao, J.; Ye, C. Learn to detect objects incrementally. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Suzhou, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 403–408. [Google Scholar]
- Yang, D.; Zhou, Y.; Shi, W.; Wu, D.; Wang, W. RD-IOD: Two-Level Residual-Distillation-Based Triple-Network for Incremental Object Detection. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–23. [Google Scholar] [CrossRef]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 38–55. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Aich, A. Elastic weight consolidation (EWC): Nuts and bolts. arXiv 2021, arXiv:2105.04093. [Google Scholar] [CrossRef]
- Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8429–8438. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]







| Shots | Methods | ZCOCO | Avg | Clipart | Comic | Watercolor | Drone | Thermal | Aquarium |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Zero Original Model | 47.40 | 35.59 | 53.33 | 42.94 | 45.53 | 11.90 | 39.31 | 31.59 |
| Full | MR-GDINO | 47.40 | 40.60 † | 49.37 | 38.95 | 44.38 | 21.60 | 66.63 | 54.35 |
| ZiRa | 46.06 | 42.77 ± 0.06 * | 60.48 | 48.56 | 50.00 | 14.44 | 62.05 | 49.83 | |
| Ours | 47.40 | 46.97 ± 0.14 * | 64.40 | 50.95 | 53.17 | 19.87 | 66.39 | 56.06 |
| Shots | Methods | Avg | Clipart | Comic | Watercolor | Drone | Thermal | Aquarium |
|---|---|---|---|---|---|---|---|---|
| 0 | Zero Original Model | 35.59 | 53.33 | 42.94 | 45.53 | 11.90 | 39.31 | 31.59 |
| 1 | MR-GDINO | 33.11 | 52.17 | 39.39 | 44.33 | 10.09 | 38.32 | 23.71 |
| ZiRa | 36.07 | 54.40 | 42.23 | 45.89 | 9.48 | 52.73 | 33.57 | |
| Ours | 31.49 | 51.26 | 31.76 | 40.07 | 7.43 | 46.62 | 31.72 | |
| 3 | MR-GDINO | 31.80 | 46.94 | 34.21 | 41.24 | 8.47 | 52.18 | 31.93 |
| ZiRa | 37.07 | 56.47 | 42.40 | 46.32 | 10.22 | 58.88 | 32.65 | |
| Ours | 37.53 | 58.30 | 43.61 | 48.03 | 7.40 | 54.66 | 37.90 | |
| 5 | MR-GDINO | 32.66 | 46.25 | 33.01 | 38.73 | 10.24 | 55.79 | 40.31 |
| ZiRa | 38.52 | 56.82 | 41.58 | 46.24 | 12.55 | 60.13 | 39.88 | |
| Ours | 39.65 | 58.81 | 43.91 | 46.27 | 12.97 | 57.31 | 42.92 | |
| 10 | MR-GDINO | 33.93 | 45.17 | 33.93 | 42.72 | 12.21 | 59.67 | 39.73 |
| ZiRa | 39.81 | 57.55 | 44.07 | 48.52 | 13.23 | 59.57 | 41.40 | |
| Ours | 40.90 | 58.64 | 45.22 | 47.64 | 14.74 | 57.69 | 45.87 |
| Method | Backward Transfer (BWT) ↑ | Forgetting Measure (F) ↓ |
|---|---|---|
| ZiRa | −3.32% | 4.42% |
| D-Know (Ours) | −0.10% | 1.14% |
| Ablation | DSM | Image PEFT | Text PEFT | Decoupled Training | Params(M) | Avg | Gain |
|---|---|---|---|---|---|---|---|
| Baseline | ✗ | ✗ | ✗ | ✗ | 174.84 | 35.59 | - |
| Component | ✗ | ✓ | ✓ | ✗ | 179.14 | 45.48 | +9.89 |
| ✓ | ✗ | ✗ | ✗ | 175.06 | 40.39 | +4.80 | |
| ✓ | ✓ | ✗ | ✗ | 179.29 | 42.92 | +7.33 | |
| ✓ | ✗ | ✓ | ✗ | 175.13 | 45.97 | +10.38 | |
| Strategy | ✓ | ✓ | ✓ | ✗ | 179.36 | 46.14 | +10.55 |
| ✓ | ✓ | ✓ | ✓ | 179.36 | 47.10 | +11.51 |
| Component | Parameters (M) | Status | Percentage of Total |
|---|---|---|---|
| Backbone (Grounding DINO) | 174.84 | Frozen | 97.48% |
| Our Trainable Components | |||
| Domain-Specific Modulator | 0.2211 | Trainable | 0.12% |
| Image PEFT Module | 4.2291 | Trainable | 2.36% |
| Text PEFT Module | 0.0659 | Trainable | 0.04% |
| Subtotal (Trainable) | 4.5161 | - | 2.52% |
| Total | 179.3561 | - | 100.00% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, B.; Yan, C.; Kou, Y.; Wang, Y.; Lv, X.; Du, H.; Xie, Y. D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection. Appl. Sci. 2025, 15, 12723. https://doi.org/10.3390/app152312723
He B, Yan C, Kou Y, Wang Y, Lv X, Du H, Xie Y. D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection. Applied Sciences. 2025; 15(23):12723. https://doi.org/10.3390/app152312723
Chicago/Turabian StyleHe, Bintao, Caixia Yan, Yan Kou, Yinghao Wang, Xin Lv, Haipeng Du, and Yugui Xie. 2025. "D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection" Applied Sciences 15, no. 23: 12723. https://doi.org/10.3390/app152312723
APA StyleHe, B., Yan, C., Kou, Y., Wang, Y., Lv, X., Du, H., & Xie, Y. (2025). D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection. Applied Sciences, 15(23), 12723. https://doi.org/10.3390/app152312723

