Next Article in Journal
Surface Resistivity Imaging for Drilling Columnar Cores
Previous Article in Journal
Two-Stage Distributionally Robust Optimization for an Asymmetric Loss-Aversion Portfolio via Deep Learning
Previous Article in Special Issue
A Fine-Tuning Method via Adaptive Symmetric Fusion and Multi-Graph Aggregation for Human Pose Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PSMP: Category Prototype-Guided Streaming Multi-Level Perturbation for Online Open-World Object Detection

1
Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
2
Shandong Province Higher Education Institutions Future Industry Engineering Research Center for Artificial Intelligence Safety, Qingdao 266580, China
3
Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao 266580, China
4
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
5
China Electronics Technology Group Corporation’s 22nd Research Institute (Qingdao), Qingdao 266107, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(8), 1237; https://doi.org/10.3390/sym17081237
Submission received: 11 June 2025 / Revised: 10 July 2025 / Accepted: 23 July 2025 / Published: 5 August 2025
(This article belongs to the Special Issue Symmetry and Asymmetry in Computer Vision and Graphics)

Abstract

Inspired by the human ability to learn continuously and adapt to changing environments, researchers have proposed Online Open-World Object Detection (OLOWOD). This emerging paradigm faces the challenges of detecting known categories, discovering unknown ones, continuously learning new categories, and mitigating catastrophic forgetting. To address these challenges, we propose Category Prototype-guided Streaming Multi-Level Perturbation, PSMP, a plug-and-play method for OLOWOD. PSMP, comprising semantic-level, enhanced data-level, and enhanced feature-level perturbations jointly guided by category prototypes, operates at different representational levels to collaboratively extract latent knowledge across tasks and improve adaptability. In addition, PSMP constructs the “contrastive tension” based on the relationships among category prototypes. This mechanism inherently leverages the symmetric structure formed by class prototypes in the latent space, where prototypes of semantically similar categories tend to align symmetrically or equidistantly. By guiding perturbations along these symmetric axes, the model can achieve more balanced generalization between known and unknown categories. PSMP requires no additional annotations, is lightweight in design, and can be seamlessly integrated into existing OWOD methods. Extensive experiments show that PSMP achieves an improvement of approximately 1.5% to 3% in mAP for known categories compared to conventional online training methods while significantly increasing the Unknown Recall (UR) by around 4.6%.

1. Introduction

Object detection (OD) is a fundamental task in computer vision that aims to identify all objects of interest within an image or video, accurately localizing and classifying them. The technology has been implemented in a variety of domains, including, but not limited to, autonomous driving, video surveillance, facial recognition, and medical image analysis. Notwithstanding the considerable advances achieved in the accuracy of object detection in recent years, the development of OD models remains fundamentally constrained by the closed-world assumption. This assumption presumes that all object categories are predefined and annotated during training. It is assumed that models are capable of recognizing only the categories that were present in the training data. Nevertheless, this limitation is in stark contrast to the open world, which is characterized by its inherent openness and dynamism, as well as the frequent emergence of both known and unknown categories. In such scenarios, models are expected not only to detect known and unknown categories, but also to incrementally learn the unknown ones once identified. However, Traditional OD methods [1,2,3,4] are inherently prone to catastrophic forgetting when adapting to novel categories, a phenomenon that severely impairs their generalization capabilities and practical applicability. This challenge has emerged as a critical bottleneck hindering the continued advancement of object detection technologies.
Traditional object detection methods are predominantly built upon Convolutional Neural Networks (CNNs) and, more recently, Transformer-based architectures. Classical two-stage detectors such as Faster R-CNN [5], as well as one-stage detectors like the YOLO series [6,7,8], have achieved significant success under the closed-world assumption with predefined categories. In addition, with the advent of Transformers [9], methods such as Deformable DETR [10] have integrated Transformer architectures into object detection frameworks, further enhancing the model’s adaptability to complex scenes and improving detection accuracy. However, these approaches still operate under the closed-world assumption and are unable to cope with the challenges posed by open-world scenarios.
To address this challenge, Joseph et al. [11] formally introduced Open-World Object Detection (OWOD) as a new paradigm in object detection. OWOD is designed to detect and localize unknown objects in the presence of known categories while enabling the continual learning of newly emerging classes. This paradigm fundamentally breaks away from the closed-set assumption of traditional object detectors and opens up new directions for learning in a dynamic, open world.
Concurrent with these developments, the ORE [11] proposed by Joseph et al. was the first to establish a standardized task formulation for OWOD, introducing an open-set recognition loss and energy-based learning to enable preliminary awareness of unknown categories. Building upon this foundation, OW-DETR [12] integrates the DETR architecture with incremental learning mechanisms, improving performance on large-scale datasets. Subsequently, RE-OWOD [13] revisits the OWOD task and designed new benchmarks, PROB [14] incorporates probabilistic modeling to quantify uncertainty, and CAT [15] leverages contextual information to enhance the recognition of unknown classes. Despite the notable progress under static training–testing paradigms, these methods lack the adaptability needed for deployment in dynamic, open-world environments.
In open-world scenarios, data acquisition typically follows a streaming paradigm rather than a static batch-based provision. This dynamic nature of streaming data presents considerable significant challenges for OWOD models. While current state-of-the-art methods (e.g., OCPL [16] and OW-RCNN [17]) have achieved considerable success in identifying unknown categories, their fundamental reliance on static datasets and requirement for extensive training iterations (often demanding dozens to hundreds of epochs) render them inadequate for streaming data scenarios. This inherent limitation becomes particularly evident when these methods encounter streaming data, manifesting as severe catastrophic forgetting of previously learned known categories. To address this limitation, Chen et al. [18] first introduced the formulation of Online Open World Object Detection (OLOWOD), which enables models to dynamically detect unknown objects and incrementally learn new categories during the continuous arrival of data, thereby narrowing the gap between algorithmic advances and open-world applicability.
Chen et al. [18] further proposed a plug-and-play approach, BSDP, to enhance the performance of existing OWOD methods under the OLOWOD formulation. Inspired by neuroscience [19], BSDP is the first to introduce a perturbation mechanism into OWOD. By incorporating a “dual-perturbation mechanism”, it simulates the brain’s ability to form new neural connections through moderate noise. On the one hand, BSDP perturbs new samples at the feature level using the prototype features of old class samples preserved during incremental learning, thereby enhancing the model’s ability to retain previously acquired knowledge. On the other hand, BSDP generates adversarial examples based on the feature distribution of old samples as a form of data-level perturbation, aiming to improve the model’s robustness to various categories.
Although BSDP is the first attempt to introduce and explore the challenging formulation of the OLOWOD problem and has achieved promising results, it still suffers from several limitations:
  • Weak handling of inter-class interference: Although BSDP’s feature-level and data-level perturbation strategies improve the model’s plasticity, they may also introduce ambiguous interference between categories, leading to fluctuations in the recognition accuracy of known classes in specific tasks.
  • Lack of diversity in perturbation generation: BSDP utilizes feature-level and data-level perturbations to enhance the performance of the model. However, both types of perturbations are constructed based solely on the statistical characteristics of known (old) categories, lacking mechanisms for modeling the diversity and generalization of unknown categories. Consequently, the perturbations exhibit constrained coverage within the semantic space, resulting in inadequate simulation of diverse and potentially unobserved categories. This limitation restricts the model’s adaptability in complex, open-world scenarios.
Existing works rarely explore the underlying symmetric structure among category prototypes in the feature space. In many real-world classification scenarios, category centers or prototypes tend to arrange themselves in a latent symmetric pattern, either geometrically (e.g., equidistant on a hypersphere) or semantically (e.g., hierarchical symmetry). In this paper, we argue that leveraging such symmetry can enhance representation learning and category discrimination, particularly under the open-world setting where unknown classes continuously emerge.
Inspired by the aforementioned observations, we propose a novel plug-and-play method tailored for online incremental learning, termed Prototype-guided Streaming Multi-Level Perturbation (PSMP). PSMP introduces three perturbation mechanisms—semantic-level, data-level, and feature-level perturbations—that are independent yet complementary. Uniquely, all perturbations are consistently guided by category prototypes, which act as semantic anchors to impose structured and directional changes within the feature space.
From a technical standpoint, this design draws on two fundamental insights: Prototype Theory, which models prototypes as compact representations of class-level semantics that capture low-rank, stable directions in the feature manifold; and stability theory, which emphasizes the generalization benefit of controlled, semantically aligned perturbations. Unlike isotropic or random noise injection, PSMP performs directional perturbations along the semantic axes formed by inter-class prototype discrepancies. These axes often reflect a latent symmetry among prototypes, where the class centers are positioned in balanced or equidistant patterns within the embedding space. By perturbing along these symmetric directions, the model is better able to preserve known category boundaries while expanding into the space of unknown categories. These structured perturbations serve dual purposes: they pull old-class features closer to their corresponding prototypes to mitigate forgetting, and they push new-class features away from old-class centers to enhance inter-class separability.
This multi-level, prototype-guided perturbation framework dynamically reinforces class-specific alignment, maintains discriminative boundaries under continual data shifts, and encourages robust feature disentanglement. As a result, PSMP not only preserves prior knowledge during online learning but also improves the model’s capacity to discover and isolate novel or unknown categories under open-world conditions.
PSMP requires no additional annotations, features a lightweight and flexible design, and can be easily integrated into existing detection frameworks. It significantly improves unknown-class recognition while alleviating catastrophic forgetting, demonstrating strong scalability and practical utility. Experimental results show that PSMP improves the detection accuracy of known categories by approximately 1.5% to 3% compared to the original online training manner while substantially enhancing the recall of unknown classes. The primary contributions of this paper are outlined below:
  • We propose a plug-and-play method called Category Prototype-guided Streaming Multi-Level Perturbation (PSMP). PSMP effectively enhances the model’s ability to distinguish both known and unknown categories by introducing data-level, feature-level, and semantic-level perturbations on the training dataset and image features while significantly mitigating catastrophic forgetting.
  • We optimize the prototype generation mechanism and design a semantic-level perturbation based on contrastive tension between inter-class prototypes, which guides feature perturbations along semantic boundaries in the latent space. This improves the model’s ability to delineate class boundaries, enhance generalization, and strengthen unknown category detection while alleviating forgetting. Furthermore, we refine the data-level and feature-level perturbation strategies from BSDP, effectively improving the representational stability and generalizability of class prototypes while enhancing the controllability of perturbations and the overall model performance.
  • We conduct a systematic evaluation of PSMP on standard OWOD benchmark tasks. The experimental results demonstrate that our method achieves consistently lower WI and A-SOE metrics across multiple task phases while significantly improving the detection performance of known categories. The results obtained demonstrate the robust compatibility, practicality, and scalability of PSMP.

2. Related Work

2.1. Traditional Object Detection (OD)

Early object detection methods can be broadly categorized into two-stage and one-stage detectors based on their architectural design. Two-stage approaches are exemplified by the R-CNN [20] series (e.g., Fast R-CNN [5], Mask R-CNN [1], Cascade R-CNN [21], Sparse R-CNN [22]), which typically involve first generating a set of region proposals, followed by classification and regression for each proposal. Although these methodologies have been demonstrated to achieve high levels of detection accuracy, they have been observed to exhibit a tendency towards high computational intensity and concomitant reduction in inference speeds. Conversely, one-stage detectors such as the YOLO series and SSD [23] approach object detection as a unified regression task, directly mapping input images to class labels and bounding box coordinates. This design has been shown to enhance inference efficiency, rendering it suitable for open-world scenarios.
In recent years, with the rise of the Transformer architecture in vision tasks, an increasing number of object detection methods have begun to integrate Transformers at various stages of their pipelines [24]. On the one hand, Transformers can replace convolutional neural networks (CNNs) as backbone networks for feature extraction. Dosovitskiy et al. [25] first introduced Vision Transformer (ViT) for image classification, which was later adapted for object detection tasks [26,27,28]. On the other hand, Transformer encoders and decoders have also been independently applied to object detection frameworks. Caion et al. [29] proposed DETR, which was the first to adopt a Transformer encoder–decoder architecture for object detection. DETR reinterprets object detection as a set prediction problem, thereby obviating the necessity for handcrafted components such as anchor generation and non-maximum suppression (NMS). Zhu et al. [10] further introduced Deformable DETR, which incorporates a deformable attention mechanism to improve training efficiency and enhance performance, particularly for small object detection.
Despite the significant success of the aforementioned methods, their training and learning processes are predominantly based on the closed-world assumption. As a result, they exhibit several limitations under open-world assumption. These methods are incapable of recognizing new categories not predefined during training, lack adaptability to new category spaces, and are prone to catastrophic forgetting during incremental learning. Consequently, traditional object detection methods are inadequate in meeting the demands for flexibility and scalability in open-world applications, thereby establishing both the theoretical foundation and practical motivation for the study of OWOD.

2.2. Open-World Object Detection

Open World Object Detection (OWOD) is a practical object detection task formulated under the open-world assumption. OWOD assumes that the world is open and full of unknowns. In comparison with conventional object detection models, OWOD has been engineered to ensure continuous adaptability. This entails the model’s capacity to detect objects from known categories, in addition to recognizing unknown categories and continuously acquiring knowledge of emerging ones. In 2021, Joseph et al. [11] first introduced and formalized the paradigm of OWOD, identifying three fundamental challenges: distinguishing unknown objects from the background, mitigating catastrophic forgetting, and enabling continual learning.
Following the introduction of this emerging paradigm, a variety of approaches have been proposed to tackle these challenges. Gupta et al. [12] proposed OW-DETR, the first method to incorporate a Transformer-based architecture with attention-guided pseudo-labeling and a novelty classification module, significantly enhancing the model’s ability to recognize unknown objects. Building on this work, Zohar et al. [14] introduced the PROB method, which employs probabilistic objectness modeling to improve the estimation accuracy of unknown object regions.
Wu et al. [30] proposed UC-OWOD, which further extended the formulation of OWOD by introducing a new problem setting called Unknown Classified Open-World Object Detection. This approach facilitates the identification and classification of multiple unknown categories through a two-stage detector. Yu et al. [16] proposed OCPL, which incorporates Class Prototype Learning (CPL) to compress intra-class distributions in the potential feature space, thereby effectively distinguishing between known and unknown categories while facilitating the incremental learning of new categories. Shaheen et al. [31] proposed the OWOD-NP framework, which adopts a non-parametric prototype learning mechanism to identify unknown objects during incremental learning and enhance the overall mean average precision (mAP).
Furthermore, Pershouse et al. [17] proposed OW-RCNN, a comprehensive framework that systematically addresses the three main challenges in OWOD, including recognizing unknown objects, reducing confusion between classes, and supporting incremental learning. OW-RCNN achieved significant improvements in U-Recall and A-OSE metrics on the MS-COCO [3] benchmark. Zhao et al. [13] revisited the OWOD task and designed new benchmarks and evaluation metrics to more comprehensively evaluate the performance of the model on unknown detection. Li et al. [32] systematically reviews the progress in Open-World Object Detection, covering problem formulation, key techniques, challenges, and mainstream methods. They analyze how to detect unknown categories and continuously learn in dynamic environments, highlighting future research directions and potential trends and provides comprehensive theoretical and practical guidance for researchers in this field. Mullappilly et al. [33] propose a new framework called SS-OWFormer, which enables object detectors to learn from both labeled and unlabeled data while discovering unknown object categories, significantly reducing annotation costs in real-world scenarios. Ma et al. [34] propose Instance-Dictionary Learning (IDL) to improve Open-World Object Detection in autonomous driving by learning robust representations that align visual features with semantic categories and reduce both category and domain gaps. He et al. [35] propose SGROD, a SAM-guided Open-World Object Detector that significantly improves the recall of unknown objects without sacrificing precision on known ones by introducing dynamic label assignment and cross-layer learning techniques. Xue et al. [36] enhance Open-World Object Detection by generating diverse synthetic datasets using AIGC (via Stable Diffusion + LoRA) and mitigating catastrophic forgetting through Elastic Weight Consolidation, achieving high precision and better retention of old categories. Fang et al. [37] introduce an unsupervised approach for recognizing unknown objects in Open-World Object Detection by learning from raw region proposals without relying on known-object supervision, significantly improving unknown detection performance. Zhao et al. [38] propose a dual-detector framework for Open-World Object detection that uses a class-agnostic detector to identify all foreground objects and a class-specific detector to classify known categories, improving unknown object discovery while preserving known detection quality. Jamonnak et al. [39] introduce OW-Adapter, a human-in-the-loop framework that enables pre-trained object detectors to perform Open-World Object Detection by identifying and learning unknown classes with only a few annotated examples, minimizing training cost and architectural changes.

2.3. Incremental Learning

Incremental learning (IL) is not only a popular research topic but also a fundamental challenge in OWOD, and numerous methods [40,41,42] have been proposed in this domain. The paradigm of incremental learning can be broadly categorized into Class Incremental Learning (CIL) [43,44] and Task Incremental Learning (TIL) [45,46]. TIL refers to dividing the data into multiple tasks in chronological order, where each task’s data is available in a single batch, and classification or detection is performed independently across tasks. During inference, the model requires the task ID to select the corresponding output head, thus adopting a multi-head output structure. In contrast, CIL treats incoming data at different time points as extensions of the same task with new categories. CIL requires the model to use a single output head and incrementally expand the number of recognized classes without relying on task IDs. This single-head structure makes CIL more challenging than TIL.
Based on whether samples from old categories are retained after training, IL can be further categorized into memory-based and memory-free approaches. Memory-based incremental learning methods retain a small subset of representative samples from previous tasks, leveraging replay mechanisms to facilitate the model’s rapid adaptation to new tasks. Representative methods include MER [47] and Multi-Objective Optimization [48]. In contrast, memory-free methods rely on imposing constraints on model parameters or applying knowledge distillation to facilitate adaptation to new tasks without storing any old samples. Notable examples include iCaRL [49] and HFC [50].
According to different learning paradigms, incremental learning can be categorized into online IL [51,52] and offline IL. Most existing OWOD models and traditional deep learning approaches adopt an offline training strategy, where models are trained on the same dataset for dozens or even hundreds of epochs to achieve satisfactory performance. However, this training paradigm deviates significantly from the way data is acquired in open-world scenarios. In the Online IL protocol, data arrives sequentially in the form of a stream, and the model is allowed to train on each instance only once (i.e., each sample is used for a single epoch) without multiple iterations. Online IL requires the model to continuously learn new samples and categories as data arrives, while also retaining previously acquired knowledge and mitigating forgetting.
Building on this foundation, Chen et al. [18] first introduced the concept of Online Open World Object Detection (OLOWOD). Unlike OWOD, OLOWOD assumes that training data arrives in a streaming manner, where each sample can be seen only once. This setting not only requires the model to detect unknown categories and incrementally learn new ones, but also demands continuous model updates under single-pass data constraints while avoiding the forgetting of previously learned knowledge. To address this, Chen et al. proposed a plug-and-play method for OLOWOD named Brain-inspired Streaming Dual-level Perturbations, BSDP. Inspired by neuroscience, BSDP introduces a dual-level perturbation mechanism, where dual-stream information from old samples is used as perturbations for new samples. This mimics the neuroscientific phenomenon that “specific noise can help the brain form new connections and neural pathways”, enabling the model to acquire new knowledge without forgetting the old. Although BSDP explores and improves upon the Online-OWOD problem to a certain extent, its relatively simplistic dual-stream perturbation mechanism still leaves considerable room for enhancement.

3. Method

3.1. Problem Formulation and Motivation

3.1.1. Problem Formulation

Based on the OWOD paradigm, OLOWOD introduces a more challenging incremental learning mechanism, which essentially falls under the category of Task Incremental Learning (TIL). In OLOWOD protocol, a sequence of tasks T = T 1 , T 2 , T N is presented in chronological order. For each task T t where t 2 , the model receives training data D t r a i n t associated with a set of categories C t = c 1 , c 2 , , c n in a streaming manner. The model is allowed to traverse the data in D t r a i n t only once, meaning that each sample is seen exactly once and will not be reused in future tasks. In other words, the training data D t r a i n t is visible only during the training of task T t , and each sample is accessible only during its first occurrence.
Notably, for the first task T 1 , it is assumed that the model can perform multiple passes over the training data D t r a i n 1 , resembling traditional object detection settings to establish a strong initialization on known categories. Consequently, in the OLOWOD protocol, task T 1 is treated as an offline training stage, while tasks T t where t 2 are trained in an online manner.

3.1.2. Motivation

Prototype-guided perturbation serves not merely as a regularized alternative to random noise but also as a synergistic learning paradigm that integrates structural awareness, inter-class discrimination, and intra-class consistency. This approach enables a dynamic equilibrium between enhancing generalization and mitigating catastrophic forgetting.
Prototypes function as cognitive anchors. According to Prototype Theory in cognitive psychology, humans do not memorize every instance individually when learning new concepts; instead, they extract a central representation—the prototype—to guide future categorization. Analogously, in machine learning, a class prototype can be viewed as a compact representation synthesized from multiple instances of that class. It embodies a low-rank subspace that captures the class’s core semantic direction, offering high stability and interpretability. Such prototypes provide a reliable reference for assessing the affiliation of novel samples, making them especially valuable in open-world and continual learning scenarios.
Prototype-guided perturbation vs. conventional perturbation. Unlike traditional perturbation strategies that introduce noise in a random or isotropic manner within the feature space, prototype-guided perturbation imposes a structured adjustment aligned with inter-class semantic differences. In PSMP, we construct an inter-class prototype discrepancy tensor to quantify the semantic tension between classes. Principal directions are then extracted from this tensor via Principal Component Analysis (PCA), defining a semantic axis along which perturbations are injected. This structured perturbation serves multiple purposes: it pulls representations of old-class samples back toward their respective prototype centers to alleviate forgetting, it simultaneously pushes new-class representations away from old prototypes to enhance discriminability, and it refines the semantic positioning of samples near ambiguous decision boundaries. Fundamentally, this is a structure-constrained perturbation mechanism that encourages sample migration along semantically meaningful trajectories on the feature manifold, rather than injecting arbitrary or disruptive noise. In doing so, it facilitates more faithful alignment with class-specific subspaces and promotes robust feature disentanglement.
Perturbation-induced gains in generalization. From the perspective of stability theory, a model is considered robust and generalizable if its predictions remain consistent under small perturbations, i.e., f x + σ f x . Crucially, when the direction of perturbation is semantically aligned with inter-class variations, the model learns to focus on features that are truly discriminative rather than those that are sensitive to irrelevant noise dimensions. Prototype-guided perturbation leverages this principle by injecting perturbations along directions that reflect semantic differences between classes. Rather than introducing arbitrary noise, this targeted semantic perturbation reinforces the model’s attention to class-defining features and their underlying structure. As a result, the model becomes more effective at capturing essential class semantics, thereby enhancing both generalization performance and representation stability—particularly in open-world and incremental learning scenarios where adaptability and discrimination must be simultaneously maintained.

3.2. Overall Architecture

The overall architecture of PSMP is illustrated in Figure 1. Our method is centered around category prototypes and incorporates multi-level perturbation strategies to enhance the model’s ability to recognize both known and unknown categories while avoiding catastrophic forgetting. Specifically, we first construct class prototypes to capture the core features of intra-class distributions, as shown in Figure 1c. Based on these prototypes, we propose a semantic-level perturbation mechanism that leverages the “contrastive tension” between inter-class prototypes to generate semantic-level perturbations, as shown in Figure 1b. To reinforce the model’s memory of known categories at the feature level, we design an enhanced feature-level perturbation module, as shown in Figure 1d. In addition, we propose a data-level enhanced perturbation method to improve the model’s robustness to input variations, as shown in Figure 1a. Since PSMP is a plug-and-play method, it can be integrated with mainstream OWOD methods such as ORE [11], OCPL [16], and OW-RCNN [17]. In this work, we choose OW-RCNN as our base framework due to its superior performance.
Training: For task T 1 , we adopt an offline training strategy. After training on D t r a i n 1 , the data D t r a i n 1 is fed again into the trained model to extract the bounding box features f b b o x R C for each sample, where C denotes the feature dimension. These extracted bounding box features f b b o x , together with the image features f R C obtained via ROI Pooling, are used to compute the known (old) categories prototypes P R C , as shown in Figure 1c. For incremental tasks T t t 2 , we adopt an online training strategy. Before feeding the training data D t r a i n t into the backbone, we first generate adversarial samples by applying data-level perturbations to D t r a i n t , based on the old category prototypes calculated after training task T t 1 , as shown in Figure 1a. The original training data and the adversarial samples are then fed together into the backbone. After ROI Align and ROI Pooling are applied to obtain fixed-size image features f R C , we apply semantic-level perturbation using the image features f and the prototypes P , resulting in perturbed semantic features f p r e t _ s e m , which are then passed to the Convolutional Attention-Based Module (CABM). Subsequently, feature-level perturbations are applied to the output features f C A B M based on similarity computation with the old category prototypes. The final perturbed features f p r e t _ f e a are then input to the OW-RCNN’s classification head, class boundary head, IoU head, and class-agnostic head. After completing training on task T t , we update the old category prototypes and continue the above process through all subsequent tasks until completing the final task T N . Notably, after each task is completed, we replay a small set of representative samples selected according to the sample selection strategy proposed in Section 3.3 in order to better mitigate catastrophic forgetting.
Testing: During the testing phase, we do not apply data-level, semantic-level, or feature-level perturbations, nor do we compute category prototypes for known categories. The test data D t e s t is fed into the model immediately, and feature vectors f t e s t are obtained via ROI Align and ROI Pooling. These features are then directly passed to the OW-RCNN classification head, class boundary head, IoU head, and class-agnostic head to produce the final prediction results.

3.3. Enhanced Category Prototype Computation

Just as humans rely on learning to acquire knowledge and on memory to retain it for continuous cognitive development, machines also benefit from analogous mechanisms. In the protocol of OLOWOD, prototypes derived from previously learned categories act as crucial memory components, guiding the model in avoiding catastrophic forgetting and facilitating the identification of unknown categories. In this section, we provide a detailed introduction to our enhanced category prototype computation method and a prototype-distance-based sample selection strategy.

3.3.1. Inter-Class Prototype Computation via Pooling and Mean

In memory-based incremental learning, to help the model better adapt to new tasks, researchers [53,54] typically retain a small subset of representative samples that contain rich features from previously learned (old) categories, based on specific sample selection strategies. By replaying these samples during training, the model can reduce the tendency to overfit to new categories, thereby alleviating catastrophic forgetting to a certain extent. However, in OWOD tasks, a single image may simultaneously contain old categories, current known categories, and unknown categories. As the model continues to iterate, under the influence of the loss function, it gradually misclassifies old categories as either current known or unknown categories, resulting in progressive forgetting of the old categories.
To address this issue, we propose an enhanced category prototype computation method, as illustrated in Figure 2. We adopt a weighted fusion of global average pooling (GAP) and global max pooling (GMP) to harness the complementary strengths of two distinct feature aggregation strategies. Specifically, GAP captures the overall semantic contour of the feature map by summarizing average activations across spatial dimensions, providing a stable and holistic representation of the class. In contrast, GMP emphasizes the most salient and discriminative regions by focusing on peak responses, thereby enhancing sensitivity to key local features.
The integration of GAP and GMP strikes a balance between global stability and local sensitivity, leading to more robust and expressive prototypes. Compared to using either GAP or GMP alone, the fusion strategy better preserves feature diversity and mitigates the risk of information loss inherent to single-mode pooling. Moreover, in contrast to more sophisticated attention-based pooling methods (e.g., self-attention pooling), the GAP+GMP fusion offers a lightweight and implementation-friendly alternative with significantly lower computational cost—making it particularly suitable for multi-task and continual learning settings. Ablation studies (Section 4.4.1) further validate that this fusion strategy consistently improves model performance in incremental learning tasks.
As illustrated in Algorithm 1, after the training of task T t is completed, the training data D t r a i n t is refed into the model’s backbone to extract the feature maps F for each sample. Subsequently, the ground-truth bounding boxes are used as input to the ROI Pooling module to obtain features f R B × C × H × W , where B denotes the total number of instances from all known categories up to task T t , C is the number of channels, and H , W are the height and width of the features.
Algorithm 1 Enhanced Category Prototype Computation
Input:
F c l s : Feature maps of all instances in current class c, shape [N, C, H, W]
T o p k : Number of most representative samples to retain (e.g., 50)
γ : Weight coefficient for GAP and GMP fusion (default 0.5)
Output:
P c : Final prototype vector for class c
1:  Initialize empty list candidate_features = []
2:  For i in range(N):
3:    feature map   F c l s [i]
4:    Compute GAP and GMP:
5:    gap GlobalAveragePooling(feature_map)
6:    gmp GlobalMaxPooling(feature_map)
7:    Fuse to form candidate feature:
8:    fused_feature   γ × gap + (1 − γ ) × gmp
9:    candidate_features.append(fused_feature)
10: end
11: center Mean(candidate_features)
12: distance_list []
13: For feat in candidate_features:
14:   dist = EuclideanDistance(feat, center)
15:   distance_list.append((feat, dist))
16: end
17: distance_list sort(key lambda x: x [1])
18: Retain T o p k closest features:
19: selected_feats [feat for feat, _ in distance_list[: T o p k ]]
20:  P c   Mean(selected_feats)
21:  P c   Normalize( P c )
22: Return  P c
Next, we split the features f into a collection of category-specific subsets f , where f R B k × C × H × W corresponds to the features of instances belonging to category k , and B k denotes the number of instances in that category, which may be different between categories. For each category, Global Average Pooling (GAP) is used to extract the average semantic information, and Global Max Pooling (GMP) is employed to capture the most representative features. These two are summed to form a robust feature representation f b b o x .
Subsequently, we compute the average of f b b o x within each category to obtain the category prototype f p r o t o R C . Finally, prototypes from all known categories are aggregated into a matrix M R C × n , where n is the number of known categories. The computation of f p r o t o is detailed in Equation (1).
f p r o t o f i = 1 B · j = 1 M G A P f + G M P f 2 .

3.3.2. Prototype Distance-Based Sample Selection Strategy

Unlike OW-RCNN [17], which constructs the memory replay subset using a class-balanced sampling strategy, we focus on leveraging prototypes to select a representative subset R t that best preserves knowledge of previously learned (old) categories in each training task. The subset R t is constructed using a prototype distance-based method, as described by Chen et al. [18]. The evaluation metric D i s t p is defined as follows:
D i s t p = f p r o t o f b b o x 2 = k = 1 C f p r o t o k f b b o x k 2 ,
where k denotes the index of elements in the corresponding feature vector. A smaller D i s t p value indicates that the feature contains richer knowledge of old categories and is therefore more representative. Following the approach of Chen et al., we select the top 50 most representative samples per category to fine-tune the model on new categories, thereby mitigating the effects of catastrophic forgetting. Specifically, the bounding box features f b b o x of the selected top 50 samples from each category are also stored for use in the enhanced data perturbation process described in Section 3.6.

3.4. Semantic-Level Perturbation via Prototype-Contrastive Tension

Analogous to the process of human cognitive development, infants initially perceive the visual world as completely unknown. However, through continuous exposure and learning, most objects in the world gradually are categorized into an existing class system. During this process, the number of unknown categories is extremely limited compared to the known categories, and it can often be explained by a finite set of cognitive prototypes. Inspired by this, we hypothesize that the overall category distribution in the feature space follows a multivariate Gaussian distribution, where high-density regions correspond to frequently occurring known categories, while the low-density tails may encompass potential unknown categories.
Previous work [55,56,57] has shown that appropriately introducing noise during model training can effectively improve the model’s classification and generalization capabilities. Kim et al. [58] enhanced representation ability by injecting semantic noise into the latent space, which significantly boosted model classification performance. Inspired by these studies, we propose a semantic-level perturbation based on prototype-driven contrastive tension. By explicitly constructing “contrastive tension” between categories and projecting the input data into a semantic space defined by these inter-class tensions, we add a structured prior perturbation of “inter-class semantics” to the features. This aims to alleviate catastrophic forgetting of old categories at the semantic level and improve the model’s ability to recognize unknown categories to a certain extent.
As illustrated in Figure 3, during the model’s transition from task T t 1 to T t , the absence of old class labels in the current training dataset D t r a i n t often leads to misclassification of old-class samples, which may be mistakenly identified as either current-task known classes or even as unknowns. To address this issue, we introduce a semantic-level perturbation f p e r t _ s e m constructed from the inter-class semantic difference space. This perturbation maps unbiased noise sampled from a standard Gaussian distribution onto the real semantic variation directions defined by the principal axes of inter-class differences, injecting directional and structured perturbations into the latent space.
Specifically, it guides the image features in the feature space to make subtle adjustments along the semantic boundary direction: (1) Features of old-class samples misclassified as unknown are pulled toward their corresponding old-class prototypes, as shown in Figure 3, label a. (2) Old-class features lying outside the decision boundary are drawn back into their correct discriminative region, as shown in Figure 3, label b. (3) Features already within the correct region are further refined toward their class centroids, as shown in Figure 3, label c. (4) Features of unknown classes are pushed away from known class semantic centers, as shown in Figure 3, label d.
This mechanism constructs a contrast-aware perturbation with semantic tension, enhancing the model’s ability to capture inter-class boundaries and generalize across tasks. As a result, it alleviates catastrophic forgetting and also moderately improves the model’s capability to detect unknown categories.
We illustrate the feature generation process of semantic-level perturbation in Figure 4 and Algorithm 2. Specifically, during the training phase of task T t , feature maps f R C × H × W are obtained from the training data D t r a i n t through ROI Pooling and then passed into the semantic perturbation module. First, the feature map f is flattened into a feature matrix F R C × H W . Meanwhile, we construct the inter-class prototype contrastive tension matrix C T by performing pairwise subtractions between prototypes stored in matrix M R C × n t 1 , which contains the prototypes of known categories from previous tasks T 1 , , T t 1 .
Algorithm 2 Semantic-Level Perturbation
Input:
f R C × H × W : Feature map from ROI Pooling
M R ^ ( C × N ) : Prototype matrix from previous tasks ( N = n × t 1 )
γ s e m : Blending hyperparameter (e.g., 0.4)
Output:
f p e r t _ s e m : Perturbed features after semantic-level perturbation
1:   F R C × H W Flatten(f)
2:   T      
3:  For i in range( N ):
4:  For j in range(N):
5:  if i j :
6:   T .append( M : , i M : , j )// f p r o t o i   f p r o t o j
7:   end
8:  end
9:   C T R C × K ←Matrixization ( T ) # K = N × N 1
10: pca_model ← PCA()
11: pca_model.fit( C T T )
12: cumulative_variance ← CumulativeSum(pca_model.explained_variance_ratio_)
13: D ← FirstIndexWhere(cumulative_variance ≥ 0.9) + 1
14:  v R D × H W ← TopDPrincipalComponents(pca_model)
15:  F R C × D F × v T
16:  M c o v R D × D ← Covariance( C T )
17:  L R D × D ← Cholesky( M c o v )
18:  δ R C × D ← SampleFromStandardGaussian()
19:  f s e m R C × D δ × L T
20:  f g e n _ s e m R C × H W F + f s e m · v T
21:  f g e n _ s e m R C × H × W ← Reshape(f_gen_sem)
22:  f p e r t _ s e m R C × H × W γ s e m · f + 1 γ s e m · f g e n _ s e m
23: Return  f p e r t _ s e m
T = f p r o t o i f p r o t o j i , j M ,
C T = T 1 , T 2 , , T n t 1 1 × n t 1 ,
where f p r o t o i denotes the prototype vector of the i -th category in matrix M . The contrastive tension matrix C T contains a total of n t 1 1 × n t 1 column vectors, each of dimensionality C . Geometrically, these difference vectors characterize the semantic distribution directions and decision boundaries between different categories, reflecting the “semantic tension relationships” among categories. This structured prior serves as the foundation for constructing subsequent perturbations.
Subsequently, we perform Principal Component Analysis (PCA) on these difference vectors and extract the top D principal components v R D × H W that together account for over 90% of the cumulative explained variance. We then compute the covariance matrix M c o v R D × D , which reflects the distributional trends and perturbation sensitivity of different categories in the semantic space. M c o v provides directional constraints for constructing discriminative semantic perturbations. Finally, the feature matrix F R C × H W is projected into the PCA-derived inter-class semantic difference space, resulting in a new feature representation F R C × D , as shown in Equation (5).
F = F · v T ,
We perform Cholesky decomposition on M c o v , yielding a lower triangular matrix L R D × D :
M c o v = L · L T ,
The matrix L encodes the covariance information along the principal directions of inter-class semantic difference space, serving as the foundation for generating structured semantic perturbations. Standard Gaussian sampling provides a neutral and unbiased source of perturbation, ensuring that all inter-class tension directions are equally likely to be sampled under the covariance structure. This approach maximally activates the discriminative dimensions within the semantic space without introducing additional priors. Based on this, we randomly sample δ R C × D from a standard Gaussian distribution, and compute the structured semantic perturbation f s e m R C × D containing inter-class differences using Equation (7).
f s e m = δ · L T ,
The semantic perturbation f s e m is added to the projected feature matrix F in the PCA space, and then reprojected back to the original feature space to obtain the perturbed feature f g e n _ s e m R C × H × W :
f g e n _ s e m = F + f s e m · v T ,
Finally, the generated semantic perturbation feature f g e n _ s e m is combined with the original feature f at a certain ratio to obtain the perturbed feature f p e r t _ s e m :
f p e r t _ s e m = γ s e m · f + 1 γ s e m · f g e n _ s e m ,
where γ s e m is a hyperparameter set to 0.4. To enhance the representational quality and class separability of the features after semantic perturbation, we introduce a Convolutional Block Attention Module (CABM) [59] to further strengthen the expression of semantic signals. CABM jointly models attention along both the channel and spatial dimensions, effectively highlighting the feature dimensions and spatial regions that are more critical for class discrimination within the semantic perturbation.
On the one hand, CABM effectively preserves the key directions within the perturbation that are relevant to distinguishing old categories; on the other hand, it suppresses the interference risk introduced by excessive irrelevant noise, thereby enhancing the model’s controllability over perturbation signals and the stability of decision boundaries.

3.5. Enhanced Feature-Level Perturbations via Similarity Calculation

To further enhance the model’s retention of known categories, we incorporate a feature-level perturbation mechanism based on feature similarity from BSDP following the semantic-level perturbation and CABM module. This mechanism introduces subtle perturbations around intra-class prototypes to simulate minor variations of the same class under different environmental conditions or viewpoints, thereby strengthening the model’s robustness to intra-class structural variations. Complementary to the inter-class semantic perturbation, this feature-level disturbance jointly facilitates the formation of a discriminative feature space that balances inter-class separability with intra-class compactness.
The generation of the feature-level perturbation is illustrated in Figure 5 and Algorithm 3. During the training phase of task T t , the feature map f R C × H × W output from the CABM module serves as the input to the feature-level perturbation module. Differing from the feature-level perturbation in BSDP, we employ both Global Average Pooling (GAP) and Global Max Pooling (GMP) to jointly extract the feature vector f i n from the input feature f , consistent with the feature extraction method used for prototype calculation in Section 3.3.
f i n = 1 2 G A P f + G M P f .
Algorithm 3 Feature-Level Perturbation
Input:
    f R C × H × W : Feature map from CABM
    M R C × n t 1 : Prototype memory matrix
    γ f e a : Blending coefficient (e.g., 0.6)
Output:
    f p e r t _ f e a R C : Perturbed feature map
1:  Compute GAP and GMP of feature f:
2:    G A P f ← GMP(f)
3:    G M P f ← GMP(f)
4:  Fuse into input feature vector:
5:    f i n ← 0.5 × ( G A P f + G M P f )
6:  Compute cosine similarity between f i n and each prototype in M:
7:   For each prototype f p r o t o i ∈ M:
8:     S i f i n · f p r o t o i f i n · f p r o t o i
9:  Compute attention weights via Softmax:
10:  W ← Softmax( S 1 , S 2 , , S n t 1 )
11: Generate perturbation feature from weighted sum:
12:   f g e n _ f e a i = 1 n × t 1 W i · f p r o t o i
13: Fuse with original feature to obtain final perturbation:
14:   f p e r t _ f e a γ f e a · f + 1 γ f e a · f g e n _ f e a
15: Return f p e r t _ f e a
During the training of previous tasks T 1 , , T t 1 , we stored n × t 1 prototypes of the old categories in the prototype memory matrix M . As discussed in previous sections, these prototypes encapsulate rich semantic information about the old categories. To incorporate this information into the current task, we compute the cosine similarity S i between the extracted feature vector f i n and each of the n × t 1 stored prototypes in matrix M :
S i = f i n · f p r o t o i f i n · f p r o t o i , i 1,2 , , n × t 1 ,
where f p r o t o i denotes the i -th prototype in M . Subsequently, the computed n × t 1 cosine similarity scores S i are passed through a Softmax function to obtain the weight vector W :
σ S i = e S i k = 1 n × t 1 e S k , i 1,2 , , n × t 1 ,
W = σ 1 , σ 2 , , σ n × t 1 .
The weight vector W represents the correlation between all old category samples and new samples in the matrix M . We multiply the weight W with all the corresponding old categories in the matrix M to obtain the feature perturbation f g e n _ f e a containing the correlation between the new and old categories, and add it to the feature f in a certain proportion to obtain the perturbed feature f p e r t _ f e a :
f g e n _ f e a = i = 1 n × t 1 σ i · f p r o t o i ,
f p e r t _ f e a = γ f e a · f + 1 γ f e a · f g e n _ f e a ,
where γ f e a is a hyperparameter and is set to 0.6.

3.6. Enhanced Data-Level Perturbations via Multivariate Gaussian Distribution

The eyes are the windows to the soul; likewise, datasets serve as the “windows” through which models perceive the world. In Section 3.4, we propose the hypothesis that the category distribution in the feature space generally follows a multivariate Gaussian distribution. Regions of high density correspond to frequently occurring known categories, whereas the low-density tail regions are likely to encompass potential unknown categories. These unknown categories can often be approximately represented by a limited number of cognitive prototypes. Based on this assumption, we leverage the statistical characteristics of known category prototypes, as defined in Section 3.3, to generate adversarial samples aimed at enhancing the model’s robustness in an online manner.
Our enhanced data-level perturbation generation module is shown in Figure 6 and Algorithm 4. After each training task T t t 2 to completion, a fixed number k of boundary features f b b o x are stored for each class based on the sample selection strategy proposed in Section 3.4. Consequently, for task T t , we retain a total of t 1 × n × k boundary features corresponding to t 1 × n previously encountered classes. For each class, we compute the mean vector μ c R C × 1 and covariance matrix σ c R C × C . In addition, we compute the overall mean μ a l l and covariance σ a l l across all t 1 × n classes:
Algorithm 4 Enhanced Data-Level Perturbations
Input:
    F = f k c k = 1 k ,   c = 1 t 1 × n : Boundary features F for previous T t 1
    D t r a i n t R B × C × H × W : Training data for current task T t
    γ d a t a : Perturbation ratio (e.g., 0.05)
Output:
    A d v d a t a t : Augmented training data
1:  For each class c in 1,2 , , t 1 × n :
2:   Compute class mean vector μ c 1 k k = 1 k f k c
3:   Compute class covariance matrix σ c C O V f 1 c , f 2 c , , f k c
4:  end
5:  Compute global mean vector μ a l l 1 t 1 × n i = 1 t 1 × n μ c i
6:  Compute global covariance matrix σ a l l 1 t 1 × n i = 1 t 1 × n σ c i
7:  Define multivariate Gaussian distribution N μ a l l , σ a l l  
8:  Sample N o i s e t from N μ a l l , σ a l l   with shape matching D t r a i n t
9:  Randomly select ⌊ γ d a t a × | D t r a i n t |⌋ samples from D t r a i n t to form subset D s e l e c t t
10: For each x ∈ D s e l e c t t :
11:  Generate adversarial sample: x a d v ← x + N o i s e i t
12:  Add x with x a d v in A d v d a t a t
13: end
14: Return  A d v d a t a t
μ c = 1 k k = 1 k f k c , σ c = C O V f 1 c , f 2 c , , f k c ,
μ a l l = 1 t 1 × n i = 1 t 1 × n μ c i , σ a l l = 1 t 1 × n i = 1 t 1 × n σ c i ,
Based on the overall mean μ a l l and covariance σ a l l , we construct a multivariate Gaussian distribution that approximates the global distribution of previously encountered classes. From this distribution, we generate noise data N o i s e t R C × H × W for training task T t . The generated noise N o i s e t is then combined with the new training data D t r a i n t to synthesize adversarial samples A d v d a t a t R C × H × W :
A d v d a t a t = D t r a i n t N o i s e t ,
where “ ” denotes element-wise addition. The adversarial samples A d v d a t a t are fed into the model jointly with the original training data D t r a i n t . It is important to note that, in order to preserve the inherent distribution of D t r a i n t , perturbations are not applied to all samples. Instead, a fixed proportion of training samples is randomly selected for perturbation. Our ablation study (as detailed in Section 4.4.2) demonstrates that perturbing 5% of the training samples yields a significant improvement in model performance.

4. Experience

4.1. Experiment Settings

This section provides a systematic description of the datasets, baseline models, evaluation metrics, and implementation details used in our experiments.

4.1.1. Datasets

As our proposed method is implemented on OW-RCNN [17], we adopt its dataset split strategy and evaluate our approach using two widely used object detection benchmarks: PASCAL VOC [60] and MS COCO [3]. Following the task split defined by the OWOD protocol, we divide the overall training process T into four sequential tasks: T 1 , T 2 , T 3 , and T 4 . Specifically, all the object categories and corresponding data from PASCAL VOC are assigned to T 1 . The remaining 60 categories in MS-COCO are evenly divided into three groups as tasks T 2 , T 3 , and T 4 . At the end of each training task, we evaluate model performance using the PASCAL VOC test set and the MS-COCO validation set. The training data, testing data, and retained exemplar data for each task are summarized in Table 1. As detailed in Section 3.3, at the conclusion of every task T t , we apply our sample selection strategy to retain 50 representative samples for each previously known category, in preparation for fine-tuning during the transition to the subsequent task T t + 1 .
It is worth noting that only the initial task T 1 is trained in an offline manner, where the model is allowed multiple accesses to the training data D t r a i n 1 . For all subsequent incremental tasks T t , we adopt an online learning protocol, in which the training data D t r a i n t is only accessible once during the current task and is not revisited thereafter. Importantly, for a given training task T t , all categories C i   :   i t × n are considered known classes, whereas categories C i   :   i > t × n are treated as unknown. Among the known classes, those belonging to C i   :   i t 1 × n are referred to as previously known classes, and those in the range C i   :   t 1 × n < i t × n are defined as currently known classes.

4.1.2. Baseline and Metrics

Baseline: Since our proposed PSMP is a plug-and-play modular approach, we evaluate its effectiveness and generalizability by integrating it into three representative Open-World Object Detection methods: ORE [11], OW-RCNN [17], and OCPL [16]. These models serve as baselines in our comparative experiments, enabling a comprehensive assessment of PSMP’s performance across different methods.
Metrics: In line with our baselines, we adopt mean Average Precision (mAP) at the IoU threshold of 0.5 as the primary evaluation metric for known classes. At each task stage, we compute the mAP over all known categories to assess the model’s ability to retain detection performance throughout the incremental learning process. For unknown object evaluation, we follow the protocols established by OW-RCNN [17] and ORE [11], using Unknown Recall (UR) [61] to measure the model’s ability to correctly identify unknown categories. To further investigate the confusion between unknown and known categories, we report the Wilderness Impact (WI) [62], which assesses the overall degradation in detection performance caused by the presence of unknown objects in open-world scenarios. In addition, we employ the Absolute Open-Set Error (A-OSE) [63] to quantify the number of unknown categories that are incorrectly classified as known categories.
W i l d e r n e s s   I m p a c t W I = P k P k U 1 ,
where P k denotes the precision evaluated over the known categories, while P k U represents the overall precision of the model evaluated on both known and unknown categories.

4.1.3. Training Process

We follow the OLOWOD protocol, in which the initial task T 1 is trained in an offline manner, while the subsequent tasks T 2 , T 3 , and T 4 are conducted in an online setting. The overall training pipeline is illustrated in Figure 7. We begin by training the model offline on task T 1 , from which we extract boundary features f b b o x R C to compute the corresponding category prototypes f p r o t o R C . Based on the distance-based sample selection strategy, we then select the most representative samples S T 1 for each category.
Subsequently, for tasks T 2 , T 3 , and T 4 , we proceed with online incremental training. First, we utilize the boundary features f b b o x extracted from task T 1 in conjunction with the current training data D t r a i n 2 to compute data-level perturbations and generate adversarial samples. These adversarial samples are then combined with the original training data D t r a i n 2 and fed into the model for online training. During training, we apply semantic-level and feature-level perturbations to introduce disturbances at both the semantic and feature representation levels. Finally, we use the trained model to store exemplar samples for the current task and perform online model fine-tuning. This incremental training process is repeated iteratively until all tasks are completed.

4.1.4. Implementation Details

Our code implementation is based on the Detectron2 and developed using PyTorch2.1.2. To ensure fair comparison with baseline methods, we maintain identical backbone configurations and training hyperparameters across all models, incorporating our proposed perturbation strategies only within the designated modules. For data-level perturbation, adversarial samples are generated using only 5% of the training data. The intensity of semantic-level and feature-level perturbations are set to 0.4 and 0.6, respectively. It is noteworthy that our perturbation strategies are applied exclusively during the training phase; no perturbations are included during the offline training of task T 1 or in the evaluation stage. All experiments are conducted on four NVIDIA RTX 3090 GPUs (24 GB each, manufactured by NVIDIA Corporation, San Jose, CA, USA), with a per-GPU batch size of 2.

4.2. Results and Analysis

In this section, we systematically evaluate the overall performance of the proposed PSMP method in OLOWOD. To demonstrate the generality and effectiveness of our approach, we integrate PSMP into five representative baseline methods: ORE [11], OW-RCNN [17], UC-OWOD [30], OCPL [16], and RandBox [64]. We conduct both offline and online training experiments across multiple incremental tasks (from T 1 to T 4 ), and the results are shown in Table 2 and Table 3. “ORE*”, “OCPL*”, “UC-OWOD*”, “RandBox”, and “OW-RCNN” denote the offline versions of the corresponding methods. Specifically, the energy-based unknown recognition module in ORE [11] is not effective for OLOWOD scenarios; therefore, we remove this module for fair comparison, denoting the modified version as ORE*. For OCPL [16], we observe that its cosine-similarity-based classifier provides limited benefit—or even hinders performance—in the online setting, and thus we exclude it, referring to the modified implementation as “OCPL*”. The same applies to UC-OWOD [30], where we remove the SUC and UCR modules (UC-OWOD*) to better accommodate online training. The suffix “-OL” is used to distinguish online training. Accordingly, “ORE*-OL”, “OCPL*-OL”, “UC-OWOD*-OL”, “RandBox-OL”, and “OW-RCNN-OL” represent the online training versions of each method under the OLOWOD protocol, which serve as our second group of baselines.
It is important to note that a performance drop is commonly observed when the model transitions from T 1 to T 2 . This phenomenon arises from the inherent challenges of open-world learning, where newly introduced categories often suffer from insufficient representativeness at early stages. Such transient degradation is particularly pronounced when there exists a substantial semantic gap between consecutive tasks. This short-term instability reflects the model’s adaptation process to novel concepts and does not necessarily indicate a failure in learning but rather a characteristic feature of continual learning in open and evolving environments.
The results in rows 1–5 of Table 2 are obtained through offline training and can be considered performance upper bounds. In contrast, the results in rows 6–10 are obtained under the online training manner. As illustrated in rows 11–14 and 16 of Table 2, our proposed PSMP method significantly outperforms the online baselines “ORE*-OL”, “OCPL*-OL”, “UC-OWOD*-OL”, “RandBox-OL”, and “OW-RCNN-OL” across various evaluation metrics. Since the model for task T 1 is trained offline, the results for all methods in this stage are identical to their offline versions and therefore lack comparative value; for instance, “ORE*-OL” and PSMP (ORE*) are equivalent to “ORE*” during task T 1 . Accordingly, our analysis focuses on the incremental tasks T 2 , T 3 , and T 4 .
It is evident that, across these incremental tasks, models trained with PSMP achieve consistent improvements in mAP over their online counterparts, with gains ranging from approximately 1.5% to as much as 3%. The offline models (Rows 1–5) serve as an upper bound for reference. By comparing rows 1, 6, and 11, we can clearly observe that the PSMP(ORE*) yields notable improvements in mAP for each incremental task. Specifically, for tasks T 2 , T 3 , and T 4 , PSMP (ORE*) achieves mAP scores that are 2.76%, 1.91%, and 1.44% higher than those of ORE*-OL for the current known categories, respectively. While the original ORE* model exhibits limited capability in recognizing unknown categories, the PSMP (ORE*) nonetheless outperforms “ORE*-OL” in unknown recall (3.51% vs. 3.42%, 3.21% vs. 2.45%).
As an improvement over ORE [11], OCPL [16] is better suited for online learning tasks. This is evidenced by the fact that “OCPL*-OL” consistently outperforms “ORE*-OL” across all metrics. A direct comparison of rows 2, 7, and 12 in Table 2 further illustrates the comprehensive advantage of “OCPL*-OL” over “ORE*-OL”. OW-RCNN [17], which was proposed in the same period as OCPL, demonstrates a more balanced performance in both unknown category recognition and known category retention. For instance, in task T 4 , OW-RCNN achieves a mAP of 36.94% compared to 26.24% for “OCPL*”. In task T 3 , it obtains a U-Recall of 42.18% versus 12.28%. A comparison of rows 3, 8, and 16 in Table 2 reveals that the PSMP-enhanced OW-RCNN consistently surpasses “OW-RCNN-OL” across all metrics. Specifically, for tasks T 2 , T 3 , and T 4 , PSMP (OW-RCNN) improves the mAP of known categories by 2. 41%, 5.79%, and 1.86%, respectively, and increases unknown recall by 4.62% and 11.23%. Moreover, when comparing rows 15 and 16 of Table 2, it can be observed that PSMP(OW-RCNN) outperforms the BSDP(OW-RCNN) across all evaluation metrics. Notably, PSMP (OW-RCNN) achieves superior gains in both known category retention and unknown object recall, with performance improvements ranging from 2% to 5% over BSDP (OW-RCNN).
It is worth noting that our proposed PSMP method not only outperforms the original offline-trained models in terms of performance in previous categories but also demonstrates significant improvements in recognizing unknown categories, whether it is “ORE *”, “OCPL*”, or OW-RCNN. For example, in the case of OW-RCNN, PSMP achieves 54.58% vs. 48.20%, 46.22% vs. 44.31%, and 41.27% vs. 39.82% across different incremental tasks, respectively. The mAP of both known categories shows consistent improvements compared to the original OW-RCNN. These results indicate that PSMP can effectively enhance the model’s capability in identifying unknown categories while alleviating catastrophic forgetting.
In addition, we integrate PSMP into two other mainstream OWOD methods, “UC-OWOD*” and “RandBox”, to further demonstrate the generality and effectiveness of our approach. Experimental results show that PSMP consistently improves performance across both frameworks. For instance, when applied to “UC-OWOD*”, “PSMP(UC-OWOD*)” yields mAP gains on current classes of 1.38%, 0.49%, and 0.78% in tasks T 2 , T 3 , and T 4 , respectively, along with modest improvements in unknown recall. These results indicate that PSMP not only preserves the original strengths of the base model but also enhances its capacity to recognize novel categories.
Similarly, when applied into “RandBox”, PSMP achieves consistent improvements, with mAP increases ranging from 1.29% to 2.17% across incremental tasks and unknown recall gains of approximately 0.5% to 1%. These findings further validate the robustness and general applicability of PSMP, confirming its ability to enhance both the retention of known categories and the recognition of unknown objects in continual open-world learning settings.
Table 3 presents the Wilderness Impact (WI) and A-OSE scores for both the baselines and our proposed method under the “ORE*”, “OCPL*”, and OW-RCNN methods. The WI score quantifies how the detection of unknown categories affects the performance on known categories, and a lower WI indicates less impact. A-OSE reflects the number of unknown class instances that are mistakenly classified as known, where lower values are also desirable. As shown in the results, PSMP significantly improves the model’s ability to recognize unknown categories. For example, the WI score of PSMP(ORE*) is notably lower than that of “ORE*-OL” and is even close to the original “ORE*”, indicating minimal disruption to known category detection. A similar trend is observed for PSMP(OCPL*). Moreover, both the online-trained models and those trained with PSMP achieve lower A-OSE scores than the offline-trained baselines, and PSMP-based models consistently achieve the lowest A-OSE scores among all. This indirectly demonstrates the rationality and practicality of the OLOWOD protocol and further highlights the effectiveness of PSMP in enabling better unknown category discrimination. A comparison between rows 9 and 10 in Table 3 clearly shows that PSMP outperforms BSDP in enhancing unknown category recognition. Notably, PSMP achieves even lower WI scores than OW-RCNN itself, as evidenced by the results in rows 3, 6, and 10. This further indicates the strong potential of PSMP for open-world scenarios where distinguishing unknown categories is critical.
From the perspective of tasks, whether through offline training or online incremental learning, the model’s ability to recognize current task-specific categories tends to degrade as the number of tasks increases. This trend essentially reflects the inherent conflict between learning and forgetting under a limited “memory capacity”, much like how human memory becomes diluted when attempting to absorb increasing amounts of new knowledge. From a temporal standpoint, regarding simultaneously recognizing unknown categories, maintaining accuracy on current tasks, and avoiding forgetting of previously learned content, this “have-it-all” capability is extremely difficult to achieve in practice. Therefore, our future research should no longer aim to store all knowledge indiscriminately but rather seek to compress the experience acquired during tasks into concise and stable representations. These representations can be gradually internalized as the model’s intuitive ability to perceive and understand the ever-changing world, much like how humans rely on long-term memory and intuition.

4.3. Incremental Object Detection

To more intuitively demonstrate the results of unknown object detection, we follow the evaluation protocol commonly adopted in the iOD [65] domain, as used in ORE [11] and OW-DETR [12], to evaluate our proposed PSMP method. Table 4 presents a comparison between PSMP and existing methods on PASCAL VOC. The incremental learning experiments are conducted in an offline manner, and we follow the experimental setup described in Ref. [11], where one group of classes (10, 5, and the final class) is incrementally introduced into a detector pre-trained on the remaining classes (10, 15, and 19, respectively).
It is worth noting that, apart from “OCPL*”, OW-RCNN, and PSMP, the results of the other methods shown in the table are directly copied from the original paper. As can be observed from the data in Table 4, compared with traditional incremental learning methods such as ILOD [66] and Faster ILOD [67], our proposed PSMP achieves superior performance under all three settings, which is consistent with our discussion in Section 4.2. Specifically, when comparing OW-RCNN and PSMP (OW-RCNN), the performance of the model improves by 4.0%, 4.3%, and 6.6% across the three experimental settings, respectively. Furthermore, in terms of unknown category detection, PSMP (OW-RCNN) significantly outperforms OW-RCNN.

4.4. Ablation Study

In this section, we conduct ablation studies on each component of the proposed PSMP module, the perturbation intensity, memory, training/inference time, and other factors. All experiments are based on the data split introduced in Section 4.1.1. Among “ORE*”, “OCPL*”, and OW-RCNN, we select OW-RCNN, which achieves the best performance, as the detector for our experiments. “OW-RCNN-OL” is used as the baseline for all ablation comparisons.

4.4.1. Ablation Study of the PSMP Module

In this section, we conduct ablation studies on each component of the proposed PSMP module to better demonstrate their effectiveness. The results are shown in Table 5. In this table, “OW-RCNN-OL” serves as the baseline model, which uses its original sample selection strategy for incremental online learning. S p r o t o refers to the model composed of the baseline and our prototype-distance-based sample selection strategy (as detailed in Section 3.3.2), which replaces the baseline’s original sample selection strategy. P d a t a denotes our enhanced data-level perturbation module, which generates adversarial samples based on the distribution of old category prototypes, aiming to improve the robustness of the model and guide the model to learn discriminative features of new categories without forgetting the old ones. P s e m represents our semantic perturbation module, which constructs contrastive tension among old category prototypes to generate structured inter-class semantic priors. This module attempts to mitigate catastrophic forgetting of old classes on the semantic level and improves the model’s ability to recognize unknown categories to a certain extent. P f e a indicates our enhanced feature-level perturbation module, which perturbs new category features using old category prototypes to preserve previously learned knowledge. “All” refers to the combination of our proposed P d a t a , P s e m , and P f e a modules.
From the results presented in rows 1 and 2 of Table 5, it can be observed that employing our prototype-distance-based sample selection strategy leads to a notable improvement in the model’s performance on “previous” categories (e.g., 47.35% vs. 46.98% for T 2 , 43.06% vs. 42.81% for T 3 , and 39.27% vs. 38.76% for T 4 ). Meanwhile, performance in the “both” categories remains largely consistent with that of the baseline. However, a slight decrease is noted in the “current” categories, which can be attributed to the absence of any perturbation mechanisms applied to the S p r o t o component. This observation, in turn, validates the representativeness of the samples selected by our sample selection strategy.
Comparing the results in rows 2 and 4 of the Table 5, we can find that the structural semantic perturbation added by the semantic perturbation module we proposed can significantly improve the mAP of all categories (for example, for task T 2 , “previous” categories: 52.77% vs. 47.35%, “current” categories: 41.07 vs. 39.74%, “both” categories: 45.11% vs. 44.47%). This effectively proves the effectiveness of the semantic perturbation generated by the contrastive tension constructed based on the old category prototype, which learns new features while retaining the old knowledge.
The results presented in rows 2 and 4 of the table demonstrate that our enhanced feature perturbation module can effectively leverage the knowledge embedded in the prototypes of old classes to improve the mean Average Precision (mAP) of current categories while avoiding catastrophic forgetting. However, it is evident that the feature perturbation P f e a tends to degrade the model’s ability to recognize known classes (“both” categories: task T 2 , −0.44%; task T 3 , −0.42%; task T 4 , −0.28%). This finding indirectly substantiates that an excessive focus on shared features of old categories impairs the model’s capacity to learn new categories, thereby underscoring the rationale behind our proposed semantic perturbation module’s emphasis on inter-class distinctions.
Comparing the results in rows 2 and 5 reveals that our enhanced data perturbation module, by exploiting adversarial samples, substantially improves the performance on previous classes but has a detrimental effect on both “current” categories and “both” categories. Furthermore, an examination of rows 3, 4, and 6 indicates that the combined use of semantic and feature perturbations consistently outperforms their individual application across all mAP metrics. This combined approach not only facilitates learning of new categories but also effectively discriminates unknown categories. Specifically, for tasks T 2 , T 3 , and T 4 , the mAP of “both” categories with S p r o t o + P s e m + P f e a surpasses those of S p r o t o + P s e m and S p r o t o + P d a t a .
Taken together with the results in rows 2 and 7, it is clear that compared to S p r o t o , our proposed PSMP framework yields a significant improvement in all mAP metrics, particularly for “previous” categories and “both” categories. These findings unequivocally demonstrate that our method markedly enhances mAP for both old and new categories, effectively mitigating catastrophic forgetting.
As illustrated in Figure 8, we visualize the semantic boundaries learned by the model before and after applying semantic-level perturbation in order to evaluate the effectiveness of our proposed contrastive tension mechanism. Specifically, we sample a subset of ten categories during training phase T 3 . The first six categories (“person”, “car”, “cat”, “dog”, “parking meter”, “train”) are selected from task T 1 and represent previous categories, the next three (“knife”, “carrot”, “cake”) are drawn from task T 3 as current categories, and the last one (“remote”), from task T 4 , represents the unknown category.
Figure 8a shows the category distribution of “OW-RCNN-OL” without semantic perturbation, whereas Figure 8b presents the results after applying our semantic perturbation module ( S p r o t o + P s e m ). A clear improvement can be observed: both the previous and current categories exhibit more compact intra-class clustering and better inter-class separation. Notably, the current categories benefit more than the previous ones, which aligns with the nature of incremental learning.
Prior to perturbation, semantically similar classes such as “cat”, “dog”, and “cake” are not well distinguished. However, the introduction of contrastive tension encourages the model to focus on inter-class discrepancies, driving the representation space toward enhanced intra-class compactness and inter-class sparsity—clearly reflected in Figure 8b. As for the unknown class, which the model has never encountered, it is inherently difficult to define sharp boundaries. Rather than directly modeling the unknown class, the model—guided by contrastive tension—focuses on refining the decision boundaries among known categories, which in turn leads to a clearer separation from the unknown. This effect is evident in Figure 8b, where the category “parking meter” is more distinctly separated from the unknown class “remote”.

4.4.2. Ablation Study of Perturbation Intensity

In this section, we conduct ablation studies on the perturbation intensity of the three proposed strategies. It is worth noting that both semantic and enhanced feature perturbations are applied within the internal mechanisms of the training process. Specifically, the semantic perturbation emphasizes inter-class discrepancies among prototypes of old categories, while the feature perturbation targets intra-class consistency by enhancing similar feature representations within categories. Given their complementary objectives and the coupled nature of their mechanisms, we design the experiments such that the sum of their perturbation intensities remains fixed at 1 (as defined in Equation (20)). This balanced regulation allows us to systematically investigate the impact of their interplay on overall model performance.
γ s e m + γ f e a = 1 ,
where γ s e m denotes the intensity of semantic-level perturbation, while γ f e a represents the intensity of enhanced feature-level perturbation.
As illustrated in Figure 9, we present the effect of varying γ s e m and γ f e a on the detection performance—measured in terms of mAP of “previous”, “current”, and “both” categories in the absence of the enhanced data-level perturbation module. Here, α denotes γ s e m and β denotes γ f e a . Notably, the perturbation setting “α-0.4 β-0.6” yields the most favorable overall performance. Although this setting results in a slightly lower mAP for “previous” categories compared to “α-0.2 β-0.8”, it significantly outperforms the latter in terms of mAP for both the “current” and “both” categories. Furthermore, the “α-0.4 β-0.6” configuration achieves the highest mAP in “current” and “both” categories in tasks T 2 and T 3 .
It can be observed that the “α-0.2 β-0.8” achieves the highest mAP for “previous” categories, primarily due to an excessively large feature-level perturbation intensity γ f e a , as shown in Figure 9a. This overemphasis causes the model to disproportionately focus on “previous” categories while neglecting both “current” and “both” categories. In particular, the model places undue importance on similarity to the old category prototypes, at the expense of learning the “contrastive tension” between different prototypes. This observation aligns with the findings from the ablation study presented in Section 4.4.1.
Moreover, in Figure 9b, the mAP for “current” classes under “α-0.2 β-0.8” is consistently 2–4% lower across all tasks compared to under “α-0.4 β-0.6”. As shown in Figure 9c, the performance of “α-0.2 β-0.8” for “both” categories is clearly the worst among all settings.
This phenomenon indirectly validates that, along the temporal dimension, if the model persistently concentrates on representations of old categories without actively learning the distinctions between old and new categories, it will gradually develop a bias as the number of tasks and categories increases. Such bias may lead to overfitting to extreme (i.e., heavily represent or heavily underrepresented) categories, ultimately resulting in a significant degradation of detection performance across all known categories.
Conversely, when the model places excessive emphasis on inter-class discrimination while neglecting the retention and transfer of prior knowledge, it may achieve improved recognition performance for current categories (as observed in Figure 9c for “α-0.6 β-0.4” and “α-0.8 β-0.2”). However, this comes at the cost of a substantial drop in overall detection performance across both old and new categories. Such outcomes reflect not true superiority but rather a fragile stability derived from overly aggressive differentiation. These findings underscore that strategies focusing solely on enhancing inter-class separability are insufficient to support robust incremental learning in open-world scenarios. Instead, a careful trade-off must be achieved between inter-class distinction and knowledge preservation.
Accordingly, the setting with semantic-level perturbation intensity γ s e m set to 0.4 and feature-level perturbation intensity γ f e a set to 0.6 yields the best overall performance. This setting not only establishes a well-calibrated balance between old and new categories—effectively mitigating catastrophic forgetting—but also demonstrates superior capability in unknown category discovery (as detailed in Section 4.2). These results validate the pivotal role of synergistic semantic and feature perturbations in enhancing OLOWOD.
Based on the preceding experiments, we fixed the γ s e m at 0.4 and the γ f e a at 0.6, and further conducted an ablation study to examine the impact of perturbation intensity in our proposed enhanced data-level perturbation module, where γ denotes enhanced data-level perturbation intensity γ d a t a . It is important to note that this module operates by introducing adversarial noise into the training data, guided by the semantic information encoded in the prototypes of previously learned categories. Furthermore, γ d a t a acts directly at the data level, potentially interacting or conflicting with the internal mechanisms influenced by semantic and feature-level perturbations. To more clearly isolate and evaluate the independent effect of data-level perturbation, we conducted controlled experiments by varying its intensity while holding γ s e m and γ f e a constant. This design allowed us to assess the contribution of enhanced data-level perturbations under different magnitudes and to determine their role in enhancing model performance.
As illustrated in Figure 10, we conducted an ablation study on the proposed enhanced data-level perturbation module by varying the perturbation intensity γ (set to 1%, 5%, 50%, and 100%) to assess its influence on model performance. In each incremental task T t t 2 , γ% of the training samples from D t r a i n t were randomly selected to be perturbed, thereby generating adversarial samples. The results show that when γ is set to 5%, the model exhibits the most pronounced improvement on the “previous” categories, with mAP gains of approximately 2-4%, while the performance impact on the “current” and “both” categories remains marginal (with mAP decreases controlled within ~1%). However, as γ increases to 50% and 100%, the model’s performance on “previous” categories shows only slight additional gains, as shown in Figure 10a, whereas the performance on “current” and “both” categories degrades significantly—falling below even that of the γ = 1% setting, as shown in Figure 10b,c.
These findings indicate that a carefully calibrated perturbation intensity (specifically γ = 5%) achieves a favorable trade-off between enhancing memory of previous categories and maintaining overall detection performance. In contrast, excessive perturbation severely distorts the original data distribution, impeding the model’s ability to learn new category features and ultimately leading to overall performance degradation. Thus, the intensity of enhanced data-level perturbation must strike a careful balance between enhancing memory of prior knowledge and preserving the integrity of the training data distribution.

4.4.3. Ablation Study of PSMP’s Performance

In this section, we conduct a quantitative analysis of the memory usage and training time of the proposed PSMP module to assess its lightweight design, practicality, and deployment feasibility. The experimental settings follow those described in Section 4.1.4, with “OW-RCNN-OL” adopted as the baseline.
As shown in Table 6, PSMP increases the total training time by only ~0.9 h compared to OW-RCNN-OL, which corresponds to approximately 20 additional minutes per task stage on average. During inference, the frames per second (FPS) drops moderately from 12.3 to 10.6, indicating only a 13% reduction. Detailed runtime statistics across task stages are presented in Figure 11.
Table 7 provides a breakdown of the resource consumption for each component within PSMP, including memory usage, training time, and inference speed. The observations are summarized as follows:
  • The prototype computation module introduces minimal overhead, with only ~1 GB increase in memory and 4 additional minutes per training stage.
  • Semantic-level and data-level perturbations consume similar amounts of memory, and each add approximately 7 min to the training time. However, semantic perturbation leads to the most noticeable drop in training FPS.
  • Feature-level perturbation incurs the least memory overhead and increases training time by only 2 min, with a slight decrease in FPS.
  • As no perturbations are applied during inference, PSMP and OW-RCNN-OL exhibit identical inference-time performance metrics, which are therefore omitted here.
Table 7. Comparison of training metrics across PSMP modules. GPU Memory denotes the memory consumption per GPU, while Per-Stage Time refers to the average training time per task stage.
Table 7. Comparison of training metrics across PSMP modules. GPU Memory denotes the memory consumption per GPU, while Per-Stage Time refers to the average training time per task stage.
PSMP(OW-RCNN)GPU MemoryPer-Stage TimeTraining FPS
S p r o t o +1.0 GB+4 min12.0
P f e a +0.7 GB+4 min11.2
P d a t a +1.4 GB+7 min11.6
P s e m +1.3 GB+7 min10.8
Total+4.2 GB+20 min10.6
In summary, PSMP significantly enhances the model’s performance in Open-World Object Detection while introducing only marginal computational and memory overhead. All additional training costs remain within acceptable limits, and inference-time efficiency is largely preserved, demonstrating the method’s practicality for real-world deployment. Moreover, the three perturbation modules strike a favorable balance between effectiveness and resource efficiency, further validating the lightweight, practical, and adaptable nature of PSMP. These results offer a concrete and scalable design paradigm for future work on open-world detection frameworks.

4.5. Visualization

To provide a more intuitive understanding of the detection performance of our proposed PSMP approach across different task stages, we visualized a subset of the test results. As shown in Figure 12, we present detection results of the PSMP (OW-RCNN) model on four representative image groups after training on tasks T 1 and T 4 . The first and third columns correspond to detection results after training on task T 1 , while the second and fourth columns show results after training on task T 4 .
It is clearly observed that the categories that are unknown in task T 1 can be correctly detected and recognized by the model in task T 4 —for instance, the sandwich, cup, and spoon in Figure 12a; the stop sign in Figure 12b; the surfboard in Figure 12c; and the bear in Figure 12d. In addition, PSMP effectively mitigates catastrophic forgetting. For instance, the detection accuracy of previously known categories from task T 1 improves after training on task T 4 : “dining table” in Figure 12a increases from 85.7% to 93.0%, and “car” and “person” in Figure 12b increase from 89.19% to 90.84% and from 94.83% to 97.38%, respectively.
These comparisons demonstrate that PSMP enables the model to retain strong detection performance for known categories across multiple tasks while simultaneously reducing false detections of unknown objects. This provides further evidence of the robustness and effectiveness of our method in OLOWOD.
In addition, we further visualized and compared the detection results for task T 4 between several baseline models and their corresponding PSMP-enhanced versions, as shown in Figure 13. The first and third columns present the outputs of the baseline methods “OCPL*” and OW-RCNN, respectively, while the second and fourth columns show the results of PSMP(OCPL*) and PSMP(OW-RCNN), respectively. By comparing columns 1 and 2, as well as columns 3 and 4, it is evident that our PSMP approach not only improves the confidence scores for known category detections but also significantly reduces the number of false positives caused by misclassifying unknown categories as known ones, thereby lowering the overall false detection rate.
This is particularly noticeable in the second row, where the comparison between the first and second columns illustrates that PSMP helps the model detect a greater number of known-category objects. For instance, PSMP(OCPL*) successfully detects the “stop sign” in the upper right corner, which was completely missed by the original “OCPL*” method.
Overall, PSMP(OW-RCNN) achieves superior detection performance across multiple examples, demonstrating higher accuracy and stronger generalization. It enables more complete and reliable identification of all known-category objects.

5. Discussion

The proposed PSMP is a lightweight and plug-and-play perturbation enhancement mechanism that demonstrates strong generalizability and transferability in Open-World Object Detection (OWOD), particularly in online incremental learning (OLOWOD) scenarios. PSMP can be seamlessly integrated into mainstream detection frameworks such as OW-RCNN, OCPL, and ORE, without requiring additional labels or architectural modifications, making it highly practical for real-world deployment across diverse tasks. Consequently, PSMP holds promise for a wide range of applications in domains such as edge computing, intelligent surveillance, and autonomous driving, where continuous learning and unknown object detection are critical. For instance, in intelligent video surveillance, the system must identify emerging threats from continuous video streams while maintaining robust recognition of known targets—an ideal setting for PSMP.
Despite its promising performance gains in OWOD, PSMP still faces several limitations. First, when categories have ambiguous semantic boundaries or overlapping feature distributions, the prototype-guided adversarial perturbations may lose directional guidance and even mislead the model to update along incorrect decision boundaries, resulting in misclassification of known categories. For example, in tasks T 3 and T 4 , we observed confusion between semantically similar classes such as “sofa” and “bed”, indicating that the current perturbation mechanism remains limited in handling fine-grained semantic ambiguities.
Second, although PSMP constructs discriminative prototypes and employs them to guide multi-level perturbations during training, these prototypes are not involved in the inference stage. This design choice limits the model’s ability to leverage prototype-based cues when encountering unknown categories—especially in scenarios with blurred feature distributions and unclear semantic boundaries, where the lack of prototype-driven support can hinder generalization.
Future work will proceed along two directions. First, we aim to strengthen the modeling of ambiguous boundary regions during training by incorporating uncertainty estimation or local clustering strategies to enhance the model’s sensitivity to semantic transitions. Second, we plan to explore combining the prototype mechanism with memory-augmented approaches—such as episodic memory or retrieval-based memory—to better preserve and utilize prior knowledge in incremental learning. Additionally, we will consider incorporating prototype-based post-processing during inference, such as semantic alignment or similarity-based scoring, to further improve the model’s generalization capacity in open-world scenarios.

6. Conclusions

In this paper, we propose a plug-and-play approach PSMP for OLOWOD. Centered around the concept of “perturbation”, PSMP introduces a set of independent yet collaboratively designed modules across three levels–semantic, feature, and data–guided uniformly by category prototypes. These modules collectively contribute to model performance throughout different task stages. Without requiring additional supervision, PSMP significantly enhances the model’s ability to distinguish between old and new categories, effectively alleviating catastrophic forgetting across multiple incremental tasks, while also improving the detection of unknown categories. Extensive ablation studies demonstrate that a well-balanced configuration of semantic and feature perturbations helps maintain performance trade-offs between old and new categories. Additionally, moderate data-level perturbation improves the model’s generalization without disrupting the underlying data distribution. Specifically, for better clarity, we provide definitions of the abbreviations used in the figure, as listed in Table 8.
Compared with existing methods, PSMP not only achieves better detection performance on known categories but also exhibits heightened sensitivity and accuracy in recognizing unknown ones. The method is lightweight, highly compatible, and easily integrable into existing detection frameworks, showing promising scalability and practicality. Overall, PSMP provides a more stable and robust solution for online incremental learning in open-world scenarios, offering both theoretical value and application potential.

Author Contributions

Conceptualization: S.G. and Z.C. Methodology: S.G. Formal analysis: S.G. Software: S.G. and M.S. Validation: S.G. and M.S. Writing—original draft preparation: S.G. Writing—review and editing: S.G. and Y.B. Data curation: Z.Z. Investigation: S.G. Visualization: Y.B. Supervision: Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data and materials utilized in this research can be obtained by requesting them from the corresponding author.

Acknowledgments

During the research and writing of this paper, I was fortunate to receive generous guidance and support from many teachers and fellow students, to whom I express my sincere gratitude. I would also like to extend heartfelt thanks to the editors and reviewers for their valuable efforts and contributions.

Conflicts of Interest

Author Yuhao Bai is employed by the company China Electronics Technology Group Corporation’s 22nd Research Institute (Qingdao). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  2. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  3. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. pp. 740–755. [Google Scholar]
  4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
  5. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  6. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  7. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  8. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 2. [Google Scholar]
  10. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  11. Joseph, K.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5830–5840. [Google Scholar]
  12. Gupta, A.; Narayan, S.; Joseph, K.; Khan, S.; Khan, F.S.; Shah, M. Ow-detr: Open-world detection transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 20 June 2019; pp. 9235–9244. [Google Scholar]
  13. Zhao, X.; Liu, X.; Shen, Y.; Qiao, Y.; Ma, Y.; Wang, D. Revisiting Open World Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 3496–3509. [Google Scholar] [CrossRef]
  14. Zohar, O.; Wang, K.-C.; Yeung, S. PROB: Probabilistic Objectness for Open World Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 24 June 2023; pp. 11444–11453. [Google Scholar]
  15. Ma, S.; Wang, Y.; Wei, Y.; Fan, J.; Li, T.H.; Liu, H.; Lv, F. CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 24 June 2023; pp. 19681–19690. [Google Scholar]
  16. Yu, J.; Ma, L.; Li, Z.; Peng, Y.; Xie, S. Open-world object detection via discriminative class prototype learning. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022. [Google Scholar]
  17. Pershouse, D.; Dayoub, F.; Miller, D.; Sünderhauf, N. Addressing the challenges of open-world object detection. arXiv 2023, arXiv:2303.14930. [Google Scholar] [CrossRef]
  18. Chen, Y.; Ma, L.; Jing, L.; Yu, J. BSDP: Brain-inspired Streaming Dual-level Perturbations for Online Open World Object Detection. Pattern Recognit. 2024, 152, 110472. [Google Scholar] [CrossRef]
  19. Van der Groen, O.; Potok, W.; Wenderoth, N.; Edwards, G.; Mattingley, J.B.; Edwards, D. Using noise for the better: The effects of transcranial random noise stimulation on the brain and behavior. Neurosci. Biobehav. Rev. 2022, 138, 104702. [Google Scholar] [CrossRef] [PubMed]
  20. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
  21. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  22. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14454–14463. [Google Scholar]
  23. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  24. Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
  25. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  26. Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12179–12188. [Google Scholar]
  27. Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 6824–6835. [Google Scholar]
  28. Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
  29. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
  30. Wu, Z.; Lu, Y.; Chen, X.; Wu, Z.; Kang, L.; Yu, J. UC-OWOD: Unknown-Classified Open World Object Detection; Springer Nature: Cham, Switzerland; pp. 193–210.
  31. Shaheen, K.; Hanif, M.A.; Hasan, O.; Shafique, M. A Framework for Open World Object Detection. Artif. Intell. Evol. 2023, 4, 154–164. [Google Scholar] [CrossRef]
  32. Li, Y.; Wang, Y.; Wang, W.; Lin, D.; Li, B.; Yap, K. Open World Object Detection: A Survey. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 988–1008. [Google Scholar] [CrossRef]
  33. Mullappilly, S.S.; Gehlot, A.S.; Anwer, R.M.; Khan, F.S.; Cholakkal, H. Semi-supervised Open-World Object Detection. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI)/36th Conference on Innovative Applications of Artificial Intelligence/14th Symposium on Educational Advances in Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 4305–4314. [Google Scholar]
  34. Ma, Z.; Zheng, Z.; Wei, J.; Yang, Y.; Shen, H.T. Instance-Dictionary Learning for Open-World Object Detection in Autonomous Driving Scenarios. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3395–3408. [Google Scholar] [CrossRef]
  35. He, Y.; Chen, W.; Wang, S.; Liu, T.; Wang, M. Recalling Unknowns Without Losing Precision: An Effective Solution to Large Model-Guided Open World Object Detection. IEEE Trans. Image Process. 2025, 34, 729–742. [Google Scholar] [CrossRef] [PubMed]
  36. Xue, W.; Xu, G.; Yang, N.; Liu, J. Enhancing open-world object detection with AIGC-generated datasets and elastic weight consolidation. J. Supercomput. 2025, 81, 417. [Google Scholar] [CrossRef]
  37. Fang, R.; Pang, G.; Miao, W.; Bai, X.; Zheng, J.; Ning, X. Unsupervised Recognition of Unknown Objects for Open-World Object Detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 11340–11354. [Google Scholar] [CrossRef] [PubMed]
  38. Zhao, R.; Wang, J.; Chen, Y.; Zheng, Z.; Cui, K.; Su, J. Class-Agnostic Detection of Unknown Objects from Foreground Improves Robust Open World Object Detection. In Proceedings of the 7th Chinese Conference on Pattern Recognition and Computer Vision, Urumqi, China, 18–20 October 2024; pp. 78–92. [Google Scholar]
  39. Jamonnak, S.; Guo, J.; He, W.; Gou, L.; Ren, L. OW-Adapter: Human-Assisted Open-World Object Detection with a Few Examples. IEEE Trans. Vis. Comput. Graph. 2024, 30, 694–704. [Google Scholar] [CrossRef] [PubMed]
  40. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Hassabis, D.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
  41. Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
  42. Castro, F.M.; Marín-Jiménez, M.J.; Guil, N.; Schmid, C.; Alahari, K. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 233–248. [Google Scholar]
  43. Dong, N.; Zhang, Y.; Ding, M.; Bai, Y. Class-incremental object detection. Pattern Recognit. 2023, 139, 109488. [Google Scholar] [CrossRef]
  44. Sun, W.; Li, Q.; Zhang, J.; Wang, D.; Wang, W.; Geng, Y. Exemplar-free class incremental learning via discriminative and comparable parallel one-class classifiers. Pattern Recognit. 2023, 140, 109561. [Google Scholar] [CrossRef]
  45. Sun, Q.; Lyu, F.; Shang, F.; Feng, W.; Wan, L. Exploring example influence in continual learning. Adv. Neural Inf. Process. Syst. 2022, 35, 27075–27086. [Google Scholar]
  46. Tiwari, R.; Killamsetty, K.; Iyer, R.; Shenoy, P. Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 99–108. [Google Scholar]
  47. Riemer, M.; Cases, I.; Ajemian, R.; Liu, M.; Rish, I.; Tu, Y.; Tesauro, G. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv 2018, arXiv:1810.11910. [Google Scholar]
  48. Zhuang, C.; Huang, S.; Cheng, G.; Ning, J. Multi-criteria selection of rehearsal samples for continual learning. Pattern Recognit. 2022, 132, 108907. [Google Scholar] [CrossRef]
  49. Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2001–2010. [Google Scholar]
  50. Dong, J.; Liang, W.; Cong, Y.; Sun, G. Heterogeneous forgetting compensation for class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seattle, WA, USA, 16–22 June 2024; pp. 11742–11751. [Google Scholar]
  51. Yang, D.; Zhou, Y.; Zhang, A.; Sun, X.; Wu, D.; Wang, W.; Ye, Q. Multi-View correlation distillation for incremental object detection. Pattern Recognit. 2022, 131, 108863. [Google Scholar] [CrossRef]
  52. Wang, J.; Wang, X.; Shang-Guan, Y.; Gupta, A. Wanderlust: Online Continual Object Detection in the Real World. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10809–10818. [Google Scholar]
  53. Li, M.; Yan, Z.; Li, C. Class Incremental Learning with Important and Diverse Memory; Springer Nature: Cham, Switzerland; pp. 164–175.
  54. Nokhwal, S.; Kumar, N. DSS: A Diverse Sample Selection Method to Preserve Knowledge in Class-Incremental Learning. In Proceedings of the 2023 10th International Conference on Soft Computing & Machine Intelligence (ISCMI), Mexico City, Mexico, 25–26 November 2023; pp. 178–182. [Google Scholar]
  55. Zeng, L.; Chen, X.; Shi, X.; Shen, H.T. Feature Noise Boosts DNN Generalization Under Label Noise. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 7711–7724. [Google Scholar] [CrossRef] [PubMed]
  56. Dhifallah, O.; Lu, Y. On the inherent regularization effects of noise injection during training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 2665–2675. [Google Scholar]
  57. Yuan, X.; Li, J.; Kuruoglu, E.E. Robustness enhancement in neural networks with alpha-stable training noise. Digit. Signal Process. 2025, 156, 104778. [Google Scholar] [CrossRef]
  58. Kim, H.-E.; Hwang, S.; Cho, K. Semantic Noise Modeling for Better Representation Learning. arXiv 2016, arXiv:1611.01268. [Google Scholar] [CrossRef]
  59. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  60. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  61. Bansal, A.; Sikka, K.; Sharma, G.; Chellappa, R.; Divakaran, A. Zero-Shot Object Detection. In Computer Vision—ECCV 2018; Springer Nature: Cham, Switzerland, 2018; pp. 397–414. [Google Scholar]
  62. Dhamija, A.R.; Günther, M.; Ventura, J.; Boult, T.E. The Overlooked Elephant of Object Detection: Open Set. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1010–1019. [Google Scholar]
  63. Miller, D.; Zhou, Z.; Bambos, N.; Ben-Gal, I. Optimal Sensing for Patient Health Monitoring. In Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018; pp. 1–7. [Google Scholar]
  64. Wang, Y.; Yue, Z.; Hua, X.-S.; Zhang, H. Random boxes are open-world object detectors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6233–6243. [Google Scholar]
  65. Joseph, K.J.; Rajasegaran, J.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Incremental Object Detection via Meta-Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9209–9216. [Google Scholar] [CrossRef] [PubMed]
  66. Shmelkov, K.; Schmid, C.; Alahari, K. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3400–3409. [Google Scholar]
  67. Peng, C.; Zhao, K.; Lovell, B.C. Faster ILOD: Incremental learning for object detectors based on faster RCNN. Pattern Recognit. Lett. 2020, 140, 109–115. [Google Scholar] [CrossRef]
Figure 1. Overview of the proposed PSMP architecture. The symbol “ ” denotes weighted addition, “ ” represents matrix multiplication, “ ” indicates similarity computation, “ P ” is the abbreviation for prototype, and “ C T ” stands for contrastive tension.
Figure 1. Overview of the proposed PSMP architecture. The symbol “ ” denotes weighted addition, “ ” represents matrix multiplication, “ ” indicates similarity computation, “ P ” is the abbreviation for prototype, and “ C T ” stands for contrastive tension.
Symmetry 17 01237 g001
Figure 2. Description of the enhanced calculation of prototypes for PSMP and multi-level perturbations. GAP and GMP denotes the global average pooling and global max pooling operations, respectively. The symbol ” denotes weighted addition, C is the number of channels, and B is the number of samples per category, which may vary between different categories. The “mean” indicates the averaging process, while P denotes the resulting prototype obtained through these computations.
Figure 2. Description of the enhanced calculation of prototypes for PSMP and multi-level perturbations. GAP and GMP denotes the global average pooling and global max pooling operations, respectively. The symbol ” denotes weighted addition, C is the number of channels, and B is the number of samples per category, which may vary between different categories. The “mean” indicates the averaging process, while P denotes the resulting prototype obtained through these computations.
Symmetry 17 01237 g002
Figure 3. An illustrative explanation of semantic-level perturbation. Each shape represents a known category, while the same shape in different colors denotes the same category across different task stages. The dashed lines indicate the decision boundaries for each category. The semantic-level perturbation can be intuitively understood as applying a directional perturbation along the intra-class centroid to the features of each category, guiding the samples that deviate beyond the decision boundary back toward the correct classification region.
Figure 3. An illustrative explanation of semantic-level perturbation. Each shape represents a known category, while the same shape in different colors denotes the same category across different task stages. The dashed lines indicate the decision boundaries for each category. The semantic-level perturbation can be intuitively understood as applying a directional perturbation along the intra-class centroid to the features of each category, guiding the samples that deviate beyond the decision boundary back toward the correct classification region.
Symmetry 17 01237 g003
Figure 4. Description of the semantic-level perturbation of PSMP. L denotes the number of prototypes of known categories stored from tasks T 1 to T t 1 , v represents the top D principal components extracted via PCA whose cumulative explained variance exceeds 90%, and F indicates the projection of the input data in the PCA space. The symbol “ ” denotes weighted addition, “ ” represents matrix multiplication, and the symbol ” denotes pairwise subtraction.
Figure 4. Description of the semantic-level perturbation of PSMP. L denotes the number of prototypes of known categories stored from tasks T 1 to T t 1 , v represents the top D principal components extracted via PCA whose cumulative explained variance exceeds 90%, and F indicates the projection of the input data in the PCA space. The symbol “ ” denotes weighted addition, “ ” represents matrix multiplication, and the symbol ” denotes pairwise subtraction.
Symmetry 17 01237 g004
Figure 5. Description of the semantic-level perturbation of our PSMP. The symbol “ ” is similarity calculation, the symbol “ ” denotes weighted addition, and W is the weight vector obtained after similarity calculation.
Figure 5. Description of the semantic-level perturbation of our PSMP. The symbol “ ” is similarity calculation, the symbol “ ” denotes weighted addition, and W is the weight vector obtained after similarity calculation.
Symmetry 17 01237 g005
Figure 6. Description of the enhanced data-level perturbation of PSMP.
Figure 6. Description of the enhanced data-level perturbation of PSMP.
Symmetry 17 01237 g006
Figure 7. Illustration of training process of PSMP. D represents the training data, while S refers to the selected samples by our “prototype distance-based sample selection strategy” outlined in Section 3.3.2 for each task.
Figure 7. Illustration of training process of PSMP. D represents the training data, while S refers to the selected samples by our “prototype distance-based sample selection strategy” outlined in Section 3.3.2 for each task.
Symmetry 17 01237 g007
Figure 8. Visualization of PSMP before and after semantic-level perturbations. (a) The semantic boundaries without perturbation, which can be used for comparison. (b) The semantic boundaries after applying semantic-level perturbation.
Figure 8. Visualization of PSMP before and after semantic-level perturbations. (a) The semantic boundaries without perturbation, which can be used for comparison. (b) The semantic boundaries after applying semantic-level perturbation.
Symmetry 17 01237 g008
Figure 9. The mAP for “previous”, “current”, and “both” categories is reported under different settings of semantic-level perturbation intensity (α = γ s e m ) and enhanced feature-level perturbation intensity (β = γ f e a ). The setting “α-0.4 β-0.6” achieves the highest overall performance across all known categories. Comparison of the model’s mean Average Precision (mAP) across different settings: (a) results for the ‘previous’ category, (b) results for the ‘current’ category, and (c) results for the combined ‘both’ category.
Figure 9. The mAP for “previous”, “current”, and “both” categories is reported under different settings of semantic-level perturbation intensity (α = γ s e m ) and enhanced feature-level perturbation intensity (β = γ f e a ). The setting “α-0.4 β-0.6” achieves the highest overall performance across all known categories. Comparison of the model’s mean Average Precision (mAP) across different settings: (a) results for the ‘previous’ category, (b) results for the ‘current’ category, and (c) results for the combined ‘both’ category.
Symmetry 17 01237 g009
Figure 10. The mAP for “previous”, “current”, and “both” categories is reported under different settings of enhanced data-level perturbation intensity (γ = γ d a t a ). The setting “γ-5%” achieves the highest overall performance across all known categories. Comparison of the model’s mean Average Precision (mAP) across different settings: (a) results for the ‘previous’ category, (b) results for the ‘current’ category, and (c) results for the combined ‘both’ category.
Figure 10. The mAP for “previous”, “current”, and “both” categories is reported under different settings of enhanced data-level perturbation intensity (γ = γ d a t a ). The setting “γ-5%” achieves the highest overall performance across all known categories. Comparison of the model’s mean Average Precision (mAP) across different settings: (a) results for the ‘previous’ category, (b) results for the ‘current’ category, and (c) results for the combined ‘both’ category.
Symmetry 17 01237 g010
Figure 11. Detailed comparison of training metrics between PSMP and the baseline across all task stages.
Figure 11. Detailed comparison of training metrics between PSMP and the baseline across all task stages.
Symmetry 17 01237 g011
Figure 12. Visualization of PSMP based on OW-RCNN. “Unknown” in the figure represents unknown categories. (a) food-related objects, (b) outdoor scenes, (c) small-scale objects, and (d) animals.
Figure 12. Visualization of PSMP based on OW-RCNN. “Unknown” in the figure represents unknown categories. (a) food-related objects, (b) outdoor scenes, (c) small-scale objects, and (d) animals.
Symmetry 17 01237 g012
Figure 13. Visualization results of task T 4 are presented for both baseline and the PSMP approach. Columns 1 and 3 show the results of “OCPL*” and OW-RCNN, and columns 2 and 4 show the results of PSMP(OCPL *) and PSMP (OW-RCNN), respectively.
Figure 13. Visualization results of task T 4 are presented for both baseline and the PSMP approach. Columns 1 and 3 show the results of “OCPL*” and OW-RCNN, and columns 2 and 4 show the results of PSMP(OCPL *) and PSMP (OW-RCNN), respectively.
Symmetry 17 01237 g013
Table 1. Number of training samples, test samples, and retained exemplars in each task phase.
Table 1. Number of training samples, test samples, and retained exemplars in each task phase.
Task 1
(Base Task)
Task 2Task 3Task 4
Semantic splitVOC classesOutdoor, accessories, appliance, truckSports, foodElectronic, indoor, kitchen, furniture
Training images16,55145,52039,40240,260
Exemplars1000100010001000
Test images10,24610,24610,24610,246
Table 2. The experimental results of PSMP and baselines in OLOWOD protocol, with comparison to BSDP [18]. “P” and “C” denote “previous” and “current”, respectively, referring to mAPs of old categories and the categories of the current task. “Both” represents the mAP of all known categories. The (↑) indicates that higher values correspond to better model performance.
Table 2. The experimental results of PSMP and baselines in OLOWOD protocol, with comparison to BSDP [18]. “P” and “C” denote “previous” and “current”, respectively, referring to mAPs of old categories and the categories of the current task. “Both” represents the mAP of all known categories. The (↑) indicates that higher values correspond to better model performance.
Task IDsTask 1Task 2Task 3Task 4
mAP (↑)URmAP (↑)URmAP (↑)URmAP (↑)
C(↑)PCBoth(↑)PCBoth(↑)PCBoth
ORE*56.215.2452.8428.7340.792.9538.2013.0530.073.9329.6414.2125.52
OCPL*56.328.2351.9328.8640.407.5139.2514.7931.4412.2831.4714.8026.24
OW-RCNN [17]62.4137.5248.2041.5845.1439.6744.3130.9440.0842.1839.8228.0736.94
UC-OWOD*56.3832.4448.8345.7546.0435.4130.1426.8227.4337.4928.2425.7926.11
RandBox60.410.547.2643.8845.076.2542.1738.3439.287.6736.0233.9434.81
ORE*-OL56.215.2451.4422.8437.223.4237.2812.5928.732.4528.4713.8823.12
OCPL*-OL56.328.2352.2025.3738.357.9139.1113.6630.119.3229.2514.0125.24
OW-RCNN-OL62.4137.5246.9840.3244.4340.2842.8129.1939.2640.7138.7627.4036.07
UC-OWOD*-OL56.3832.4449.4642.9044.2835.8829.9725.0626.3234.8526.1125.3725.39
RandBox-OL60.410.548.0440.3943.975.9841.3236.3338.506.9434.5331.7132.55
PSMP(ORE*)56.215.2452.9823.4738.493.5138.9212.3330.823.2130.2614.0825.20
PSMP(OCPL*)56.328.2353.2726.8339.557.6840.7014.2531.2911.8332.5214.6726.51
PSMP(UC-OWOD*)56.3832.4450.6644.3445.7735.6930.8426.5528.0135.0828.2926.1527.04
PSMP(RandBox)60.410.549.3342.3445.286.4442.8636.5439.668.0136.7033.1835.22
BSDP(OW-RCNN)62.4137.5252.4540.2744.5840.3144.5729.1041.2244.2940.0826.0136.44
PSMP(OW-RCNN)62.4137.5254.5841.2945.0242.1446.2230.8842.5045.2841.2727.9137.68
Table 3. WI and A-OSE results of the proposed PSMP are compared with those of the baseline method under the OLOWOD protocol. The (↓) indicates that lower values correspond to better model performance.
Table 3. WI and A-OSE results of the proposed PSMP are compared with those of the baseline method under the OLOWOD protocol. The (↓) indicates that lower values correspond to better model performance.
Task IDsTask 1Task 2Task 3
WIA-OSEWIA-OSEWIA-OSE
(↓)(↓)(↓)(↓)(↓)(↓)
ORE*0.0580114280.0295106310.02149471
OCPL*0.046256150.022859440.01654852
OW-RCNN [17]0.053768820.024847220.01723827
ORE*-OL0.0580114280.0384113770.01928925
OCPL*-OL0.046256150.024258650.01744221
OW-RCNN-OL0.053768820.026140060.01683428
PSMP(ORE*)0.0580114280.0297128750.02607931
PSMP(OCPL*)0.046256150.024954370.01794394
BSDP(OW-RCNN)0.053768820.022639220.01744155
PSMP(OW-RCNN)0.053768820.021337190.01603107
Table 4. Experiments are conducted under three different settings. Due to space limitations, only the mAP of the last ten categories of the incremental task and the overall mAP are shown here. Categories 10, 5, and 1 in the yellow background in the table are introduced into the detectors trained on the remaining categories 10, 15, and 19, respectively.
Table 4. Experiments are conducted under three different settings. Due to space limitations, only the mAP of the last ten categories of the incremental task and the overall mAP are shown here. Categories 10, 5, and 1 in the yellow background in the table are introduced into the detectors trained on the remaining categories 10, 15, and 19, respectively.
10 + 10 SettingsTableDogHorseBikePersonPlantSheepSofaTrainTVmAP
ILOD [66]59.772.773.573.266.329.563.461.669.362.263.1
Faster ILOD [67]36.770.966.867.666.124.763.148.157.143.662.2
iOD [65]60.166.476.072.674.639.764.060.268.560.566.3
ORE [11]56.170.480.272.381.842.771.668.177.067.764.6
OCPL*46.150.865.266.071.921.649.854.668.546.364.3
OW-RCNN [17]58.842.963.558.267.628.936.366.877.559.665.1
PSMP(ORE*)36.769.377.476.670.838.270.752.263.959.967.9
PSMP(OCPL*)52.261.372.470.069.730.764.336.560.156.668.5
PSMP(OW-RCNN)51.364.268.884.364.437.750.648.348.760.869.1
15 + 5 SettingsTableDogHorseBikePersonPlantSheepSofaTrainTVmAP
ILOD [66]59.075.871.878.669.633.761.563.171.762.265.9
Faster ILOD [67]63.178.680.578.480.436.761.759.367.959.167.9
iOD [65]61.874.781.677.580.237.858.054.673.056.167.8
ORE [11]55.476.786.278.582.132.863.654.777.764.668.5
OCPL*75.581.289.284.483.319.125.224.065.136.869.5
OW-RCNN [17]79.383.782.680.781.854.239.143.731.547.372.8
PSMP(ORE*)61.579.290.188.786.642.368.850.748.252.974.7
PSMP(OCPL*)82.487.085.284.584.158.247.651.342.950.176.0
PSMP(OW-RCNN)81.788.186.984.385.459.648.352.145.754.377.1
19 + 1 SettingsTableDogHorseBikePersonPlantSheepSofaTrainTVmAP
ILOD [66]64.877.280.877.570.142.367.564.476.762.768.3
Faster ILOD [67]58.778.881.875.377.443.173.861.769.861.168.6
iOD [65]63.278.582.779.179.944.173.266.376.457.670.2
ORE [11]54.672.885.981.782.444.875.868.275.760.168.9
OCPL*74.486.188.787.685.450.183.774.977.848.577.1
OW-RCNN [17]76.982.379.577.678.874.175.678.380.158.378.0
PSMP(ORE*)79.884.281.280.381.080.582.783.885.354.181.1
PSMP(OCPL*)82.686.082.881.482.982.184.084.785.352.682.6
PSMP(OW-RCNN)83.987.685.580.086.284.886.987.888.456.284.6
Table 5. Ablation experiments are conducted on each module of the proposed PSMP and compared with the baseline method. The baseline used in comparison is “OW-RCNN-OL”.
Table 5. Ablation experiments are conducted on each module of the proposed PSMP and compared with the baseline method. The baseline used in comparison is “OW-RCNN-OL”.
MethodTask 1
mAP (↑)
Task 2
mAP (↑)
Task 3
mAP (↑)
Task 4
mAP (↑)
CPCBothPCBothPCBoth
Baseline62.4146.9840.3244.4342.8129.1939.2638.7627.4036.07
S p r o t o 62.4147.3539.7444.4743.0628.7739.1939.2725.0836.14
S p r o t o   +   P s e m 62.4152.7741.0745.1145.3929.8140.2740.5526.3436.16
S p r o t o   +   P f e a 62.4149.6340.2844.0344.9128.9438.7739.7226.1135.86
S p r o t o   +   P d a t a 62.4148.3239.2744.0844.5728.6138.6439.8824.9735.82
S p r o t o   +   P s e m   +   P f e a 62.4153.6541.3744.8345.6930.7741.6641.0526.9336.74
S p r o t o +All62.4154.5841.2945.0246.2230.8842.5041.2727.9137.68
Table 6. Comparison of training metrics between PSMP and the baseline. GPU Memory denotes the memory consumption per GPU, while Per-Stage Time refers to the average training time per task stage. Note that since task T 1 is offline training, it trains for more than one epoch (about 22 min), so the training time is longer.
Table 6. Comparison of training metrics between PSMP and the baseline. GPU Memory denotes the memory consumption per GPU, while Per-Stage Time refers to the average training time per task stage. Note that since task T 1 is offline training, it trains for more than one epoch (about 22 min), so the training time is longer.
MethodGPU MemoryTotal TimePer-Stage TimeTraining FPS
OW-RCNN-OL13.2 GB18 (Task1) + 3.1 h46 min12.3 FPS
PSMP(OW-RCNN)17.4 GB18 (Task1) + 4.0 h66 min10.6 FPS
Table 8. The table for acronyms in this paper.
Table 8. The table for acronyms in this paper.
SymbolExplanation
It denotes weighted addition
It denotes matrix multiplication
It denotes similarity computation
P It denotes prototype
C T It denotes contrastive tension
f p e r t _ s e m It denotes semantic-level perturbation
f p e r t _ f e a It denotes feature-level perturbation
A d v d a t a t It denotes data-level perturbation
-OLIt denotes training by online settings
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gu, S.; Sun, M.; Zhang, Z.; Bai, Y.; Chen, Z. PSMP: Category Prototype-Guided Streaming Multi-Level Perturbation for Online Open-World Object Detection. Symmetry 2025, 17, 1237. https://doi.org/10.3390/sym17081237

AMA Style

Gu S, Sun M, Zhang Z, Bai Y, Chen Z. PSMP: Category Prototype-Guided Streaming Multi-Level Perturbation for Online Open-World Object Detection. Symmetry. 2025; 17(8):1237. https://doi.org/10.3390/sym17081237

Chicago/Turabian Style

Gu, Shibo, Meng Sun, Zhihao Zhang, Yuhao Bai, and Ziliang Chen. 2025. "PSMP: Category Prototype-Guided Streaming Multi-Level Perturbation for Online Open-World Object Detection" Symmetry 17, no. 8: 1237. https://doi.org/10.3390/sym17081237

APA Style

Gu, S., Sun, M., Zhang, Z., Bai, Y., & Chen, Z. (2025). PSMP: Category Prototype-Guided Streaming Multi-Level Perturbation for Online Open-World Object Detection. Symmetry, 17(8), 1237. https://doi.org/10.3390/sym17081237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop