Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts

Chen, Hao; Chen, Ying

doi:10.3390/app16073474

Open AccessArticle

Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts

by

Hao Chen

and

Ying Chen

^*

The Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Jiangnan University, Wuxi 214026, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3474; https://doi.org/10.3390/app16073474

Submission received: 13 February 2026 / Revised: 15 March 2026 / Accepted: 27 March 2026 / Published: 2 April 2026

(This article belongs to the Special Issue Advances in Computer Vision and Digital Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Few-shot open-set object detection (FS-OSOD) remains challenging in real-world scenarios, where detectors must accurately recognize known objects from few examples while reliably rejecting vast unknown categories. Under this setting, decision boundaries between known and unknown classes are easily distorted by data scarcity and background clutter, leading to severe overfitting on base classes and overconfident misclassification of unknowns. Recent research attempts to alleviate these issues by regularizing detection heads to suppress base-class bias, or by leveraging vision–language priors through open-vocabulary alignment and prompt tuning to enhance semantic transferability. However, these solutions often overlook explicit modeling of truly out-of-set unknowns and the instability of prompt adaptation in low-data regimes, which can cause boundary drifts and make unknown proposals be absorbed by similar seen classes or even suppressed as background. To alleviate these issues, a guided prompt–monument network (GPMN) that is proposed, which jointly enhances prompt learning and feature representation learning for FS-OSOD. First, the contrastive distilled prompts (CDP) module employs a teacher–student prompt framework to decouple optimization across base, novel, and unknown classes. This strategy preserves transferability between zero-shot and few-shot settings while enhancing discrimination on base categories. Second, a synthesized monument module (SMM) maintains class-centered memory with momentum-updated prototypes and a non-parametric classifier, which compresses the overlap between seen and unseen distributions and provides a stable rejection margin for unknowns with strong co-occurrence and background noise. Compared with existing head-regularization and open-vocabulary prompt-tuning pipelines, GPMN explicitly targets both base-class bias and seen–unseen overlap at the region level. Extensive experiments on VOC10-5-5 and VOC-COCO benchmarks demonstrate that GPMN consistently improves unknown recall and few-shot mAP over representative FS-OSOD baselines. These results suggest that prompt-level decoupling mitigates base-class bias, whereas memory-anchored regularization enlarges the seen–unseen margin, jointly supporting reliable unknown rejection in scarce-supervision regimes.

Keywords:

few shot; open set; object detection; prompt learning; representation learning

1. Introduction

Object detection has matured substantially over the recent decade. Modern detectors, represented by faster R-CNN and YOLO frameworks, deliver accurate localization and recognition when trained and tested within a fixed, fully annotated label space [1,2]. This closed-set protocol underpins most benchmark progress. However, it only partially reflects real-world deployment in autonomous driving, robotics, and industrial inspection, where target categories evolve and annotation budgets remain limited and uneven.

Moreover, an equally restrictive constraint is scarcity. When only a few labeled instances are provided for each known category, the learned boundaries tend to be shaped by incidental cues in the support set rather than stable class characteristics. This over-specialization is amplified in detection because classification and localization are learned jointly, and background statistics can dominate training gradients when positive examples are scarce. In an ideal case, known-class features would remain compact yet sufficiently diverse and well separated from both background and the unknown region, as illustrated in Figure 1a. The few-shot open-set object detection (FS-OSOD) setting was first introduced in [3]. In FS-OSOD settings, limited supervision coupled with missing labels for unknown objects makes the known–unknown boundary highly susceptible to distortion, which often results in low recall for unknown discovery. Base-dominated optimization can drive known-class feature distributions toward the unknown space, increasing false unknown alarms and cross-class confusion in Figure 1b. Meanwhile, insufficient intra-class diversity for novel/unknown objects leads to under-coverage and reduced margins, causing unknown instances to be absorbed by similar known classes or be suppressed as background.

Recent FS-OSOD work in Table 1 has sought to improve unknown rejection when training is limited to few-shot closed-set annotations. A common line regularizes the classification head to curb base-class overfitting, e.g., sparsifying normalized classifier weights to weaken co-adaptation [3] or stabilizing optimization via moving/averaged weights under scarce supervision [4]. While effective for training stability, these designs remain largely vision-centric and offer limited semantic guidance for “what lies outside” the seen taxonomy. In parallel, vision–language models such as CLIP and large-scale successors provide transferable, semantically structured embeddings [5,6], and open-vocabulary detectors align region features with text to broaden category coverage [7,8,9,10]. However, open-vocabulary pipelines typically assume a user-specified finite vocabulary and do not explicitly enforce rejection of truly vocabulary-external objects. Prompt-tuning methods further show that learning lightweight prompts is efficient, yet prompts can over-specialize in low-data regimes and degrade transfer to novel or unknown concepts [11,12,13]. These gaps indicate the need to jointly control semantic adaptation and region-level separation under clutter and co-occurrence.

Despite recent progress, few-shot open-set detectors still face a compound challenge in deployment driven by both scarce labels and unlabeled unknowns. As depicted in Figure 1, an ideal feature space forms compact yet diverse seen-class manifolds that are separated from unseen objects and background in Figure 1a. In practice, only a limited number of labeled instances are available for known classes, and unknown objects remain unlabeled. Consequently, base-class gradients dominate optimization, causing unknown regions to be absorbed into background noise. This bias causes a representation drift and boundary expansion, as shown in Figure 1b, increasing overlap between seen- and unseen-class manifolds while enabling high-confidence absorption of unknown proposals, especially for cluttered or partial RoIs. Meanwhile, limited intra-class coverage further shrinks inter-class margins in Figure 1c. These factors motivate controlling semantic specialization while enforcing clearer seen–unseen separation.

Despite advances made by FOOD [3] and CED-FOOD [20] in the field of general-purpose few-shot open-set object detection, significant gaps remain. Existing methods still face challenges in addressing fundamental category bias, limited semantic transferability under scarce data conditions, region-level seen–unseen ambiguity in cluttered scenes, and the application of prompt learning. Few-shot open-set detection requires two complementary forms of control in Table 2. One operates at the prompt interface to prevent collapse into base-class specialization, and the other operates at the representation level to reduce overlap between seen and unseen region distributions in cluttered scenes. Concretely, we propose a guided prompt–monument network (GPMN) that jointly enhances prompt learning and region feature learning for few-shot open-set object detection. First, the contrastive distilled prompts (CDP) module decouples optimization via a teacher–student prompt design. The teacher prompt favors zero or few-shot transfer to novel or unknown semantics, while the student prompt sharpens base-category discrimination and reduces base overfitting without loss of generalization. Second, the synthesized monument module (SMM) maintains a class-centered memory using momentum-updated prototypes and a lightweight non-parametric classifier. It regularizes region features to reduce overlap between seen and unseen distributions and yields a more stable rejection margin for unknown objects under strong co-occurrence and background noise. Importantly, combining CDP and SMM is particularly effective under few-shot imbalance because it addresses two coupled failure modes jointly. CDP counteracts base-dominated gradients by decoupling transferable semantics from base discriminability, producing reliable seen/unseen semantic conditions. SMM then uses these CDP-derived conditions to construct the unseen anchor space and synthesize unseen-like features, enforcing a monument-centered structure that reduces seen–unseen overlap and yields a larger, more stable rejection margin than either component alone.

Extensive experiments on VOC10-5-5 and VOC-COCO demonstrate that GPMN consistently improves unknown recall and few-shot mAP over representative FS-OSOD baselines [3]. Overall, the results suggest that prompt-level decoupling and memory-anchored representation regularization offer complementary benefits beyond additive effects. They mitigate two distinct failure mechanisms that jointly shape reliable rejection of unknown objects in scarce-supervision regimes.

The contribution of the work can be summarized as follows:

First, the contrastive distilled prompts (CDP) module addresses base–novel optimization conflict through a teacher–student prompt division with learnable fusion. It improves base-class detection while preserving novel/unknown generalization and reduces high-confidence unknown-to-known errors.
Second, SMM reduces seen–unseen overlap at the distribution level via class-centered memory and cross-domain comparison. Non-parametric fusion at inference improves robustness, yielding higher unknown recall and more stable rejection in cluttered, co-occurring, low-supervision settings.
The experimental results demonstrate significant improvements under the FS-OSOD setting, particularly on unknown detection and related aggregate metrics, validating the superiority of the proposed approach over competitive baselines.

2. Related Works

This Related Works Section follows a problem-driven narrative review to position FS-OSOD. The literature is organized into three complementary streams: (i) few-shot object detection (FSOD) methods that address scarce annotations under closed-set evaluation, (ii) open-set object detection (OSOD) methods that learn to reject unknown objects, and (iii) prompt learning for vision–language models (VLMs) that provides a parameter-efficient interface for semantic transfer. Representative papers are selected to cover dominant design choices and evaluation protocols within each stream, and their limitations are summarized to identify gaps most relevant to generalized FS-OSOD in Table 3.

Inclusion and exclusion criteria. Primary papers are included if they (a) address detection under few-shot, open-set/open-world; (b) propose a mechanism operating on region representations, detection heads, or inference-time calibration; and (c) report detection results on standard benchmarks or comparable detection protocols. Context papers are additionally included when they provide widely adopted components that directly impact FS-OSOD, such as VLM prompt tuning and generality-preserving objectives for stable transfer. Papers are excluded if they focus solely on image-level recognition without a clear implication for region-level detection or unknown rejection, or if they do not provide an explicit mechanism/evaluation related to few-shot adaptation or open-set behavior.

To further clarify the distinctions and correlations among these research streams, Figure 2 provides a visual comparison of their respective focuses, limitations, and shared relevance to FS-OSOD. As illustrated in Figure 2, FS-OSOD lies at the intersection of few-shot adaptation, region-level localization, and unknown rejection, which in turn motivates the dual design of GPMN from both prompt-level and representation-level perspectives.

2.1. Few-Shot Open-Set Recognition

Few-shot open-set recognition (FSOSR) seeks to rapidly learn from a handful of labeled examples while identifying known categories and rejecting unknowns in open-world settings. Early benchmarks augment meta-learning-based few-shot frameworks with an open-set loss and pseudo-unknown sampling from base data [14]. Later directions focus on rejection via prototype transformation distances [15], pseudo-unknown construction through background-region mining [16], and task-adaptive negative thresholding to refine the rejection boundary during training [17]. These strategies primarily improve image-level unknown rejection, but they do not close the gap to detection. FS-OSOD requires fine-grained region representations and robust localization in cluttered, co-occurring scenes, together with explicit treatment of the background class, which makes it substantially harder than recognition alone [3,4,20].

2.2. Few-Shot Object Detection

Few-shot object detection has progressed rapidly as a remedy for the annotation burden of supervised detectors, and recent methods can be grouped into three lines. Meta-learning methods learn task-level knowledge that adapts to novel categories from only a few support examples; representative systems include FSRW [21], Meta-RCNN [22], and Meta-DETR [23]. Transfer-learning approaches adopt a two-stage regime—base training followed by few-shot fine-tuning—to transfer generalizable features to novel classes; notable baselines include TFA [24], FSCE [25], and DeFRCN [26]. A third direction treats few-shot detection as a class-imbalance problem and augments rare categories by synthesizing pseudo-samples for end-to-end training [33]. Despite strong performance under closed-set evaluation, these methods do not guarantee unknown rejection. Consequently, later studies extend standard transfer baselines with unknown-aware mechanisms to identify previously unseen objects, reporting improved rejection of unknown classes in few-shot regimes relative to prior state-of-the-art [4].

2.3. Open-Set Object Detection

Open-set object detection simultaneously detects known categories and rejects unknowns, and existing approaches can be grouped by how unknown evidence is acquired. A common approach synthesizes surrogate “unknowns”, e.g., by tail sampling in the known feature space, cross-class embedding mixing, or out-of-set patch generation. The resulting outliers supervise an unknown-class branch and help the detector form a separable rejection region [18,19]. A second line mines background proposals with high uncertainty as unknowns, using them to supervise open-set behavior [27,28]. Another option is to select unknowns from known-class samples, treating high uncertainty instances as proxy unknowns during training [29,30]. In parallel, threshold-based schemes rely on energy or entropy as uncertainty surrogates and reject predictions against a learned or fixed threshold [34,35,36,37]. Despite notable progress, most OSOD methods presuppose abundant known-class data; under few-shot conditions, this assumption is violated, leading to overfitting on scarce known classes and degraded open-set performance. This gap motivates mechanisms that reduce dependence on dense class supervision, such as sparsifying class weights and decoupling known or unknown decision factors to improve robustness when data are limited.

2.4. Prompt Learning

Prompt learning is included here because it complements the above research streams at the semantic level and is highly relevant to FS-OSOD. It facilitates transfer to novel or unknown categories, enables parameter-efficient adaptation under limited annotations, and provides text-guided priors that help stabilize seen–unknown separation in open-set detection.

Vision–language models (VLMs) such as CLIP [5] and larger successors (ALIGN [38], ALBEF [39], BLIP [6], Flamingo [40]) provide aligned visual–text representations that transfer well. As model size increases, full fine-tuning becomes prohibitively expensive. Prompt tuning is a parameter-efficient alternative that freezes the backbone and optimizes a small set of continuous prompt vectors, replacing manually crafted templates [40,41]. In VLM adaptation, CoOp [11] learns static textual context tokens but can overfit base classes and harm base-to-novel generalization. CoCoOp [12] addresses class shift by generating instance-conditional prompts. MaPLe [13] further couples prompts across visual and textual encoders to strengthen cross-modal coordination, yet higher prompt capacity may amplify distribution bias. Recent methods therefore emphasize generality-preserving objectives, including self-regularization against the frozen model in PromptSRC [31] and prompt-driven distillation in PromptKD [32]. Following this line, we introduce CDP in our FS-OSOD method to decouple base discrimination from novel or unknown transfer.

3. Method

This section presents GPMN for few-shot open-set object detection, where models must recognize known/novel classes from scarce annotations while rejecting unseen objects. We propose two complementary components. Contrastive distilled prompts (CDP) decouple transferable semantic alignment and task-specific discriminability via a frozen teacher prompt and a learnable student prompt, optimized with a detection-aware contrastive objective. The synthesized monument module (SMM) synthesizes region features under prompt guidance and maintains a momentum monument memory to calibrate classification scores and support unknown decisions at inference. We then detail the overall pipeline and each module.

3.1. Preliminaries

Let

D = {(x, y)}

denote an object detection dataset, where

x \in X

is an image, and each image x is associated with a set of instance annotations

Y = {(c_{i}, b_{i}) ∣ i = 1, 2, \dots, N}

, where

c_{i}

and

b_{i}

denote the class label and bounding box of the i-th instance, respectively. The dataset is split into training and test sets,

D_{tr} = D_{B} \cup D_{N}

and

D_{te}

, respectively. The training label space comprises K known classes

C_{K} = C_{B} \cup C_{N} = {1, \dots, K}

, where

C_{B} = {1, \dots, B}

are data-abundant base classes and

C_{N} = {B + 1, \dots, K}

are N novel classes with M-shot supports per class. At test time, in addition to

C_{K}

, an open set of unknown categories appears and is aggregated into a placeholder class

C_{U} = {K + 1}

, since enumerating all unknowns is infeasible. We further include an explicit background class

C_{B G} = {K + 2}

for regions not aligned with any object of interest. The FS-OSOD task is therefore to learn, from the class-imbalanced training set

D_{tr}

, a detector that (i) accurately recognizes instances from

C_{K}

, (ii) rejects all unknown instances into

C_{U}

, and (iii) distinguishes foreground from background according to

C_{B G}

.

3.2. Overall

To enhance the readability of Section 3, Figure 3 above serves as a workflow diagram to guide readers’ understanding, outlining the overall logic of the proposed method before the detailed technical description. As illustrated in the figure, GPMN proceeds from the FS-OSOD setting and a shared detector/VLM embedding space to prompt-level semantic stabilization by CDP, then to the two stages of SMM for conditional feature synthesis and monument-memory regularization, and finally to inference by fusing detector logits with memory-based similarities. The figure also outlines the overall training schedule, helping readers understand the connections among the components before following the mathematical details.

Figure 4 presents the full pipeline. A two-stage detector extracts region proposals and RoI features, and a frozen vision–language encoder provides a common embedding space for computing region–text similarity. Under few-shot supervision, optimization is easily dominated by background and base-category gradients, which tends to bias the semantic interface toward base-specific cues and enlarges the effective decision region of seen classes. The first stage therefore applies the contrastive distilled prompts (CDP) module to constrain semantic adaptation. CDP learns region–text alignment with a teacher–student prompt formulation, in which the teacher retains transferable semantics and the student is trained with guided hard negatives to enhance discrimination while limiting base-driven drift. Given a more reliable similarity space, the second stage invokes the synthesized monument module (SMM) to regularize region representations. SMM maintains momentum-updated class monuments and enforces cross-domain comparisons, reducing overlap among seen, synthesized unseen and background features, and yielding a clearer rejection margin. During inference, SMM-derived memory similarities are fused with detector logits for calibration, which mitigates over-confident assignment of ambiguous RoIs and improves unknown rejection under clutter and partial observations.

3.3. Contrastive Distilled Prompts

Few-shot open-set detection is learned from a class-imbalanced training set, where abundant base classes and M-shot novel classes share the known label space, while unknown categories appear only at test time and must be rejected into an unknown placeholder. This imbalance makes a single tuned prompt prone to base-dominated shortcuts. As illustrated in Figure 5, CDP addresses this issue by decoupling prompt adaptation into a fixed teacher prompt that preserves transferable semantics and a trainable student prompt that improves discrimination, while leveraging teacher-predicted scores to construct region-level hard negatives for training. The student prompt is then optimized with these hard negatives under a symmetric contrastive objective.

Region text embeddings. Given an image x, a two-stage detector produces region proposals

{v_{i}}_{i = 1}^{N}

. As shown in Figure 4, the frozen CLIP image encoder

f (\cdot)

maps each proposal to a region embedding, and the frozen CLIP text encoder

g (\cdot)

maps each class text or prompt to a text embedding. In Figure 4, these components are explicitly labeled as “CLIP Image Encoder” and “CLIP Text Encoder”, respectively. Their outputs form a shared embedding space for region–text similarity computation.

R_{i} = f (v_{i}) \in R^{d} .

(1)

For a class name

T_{c}

, a prompt is parameterized by M learnable context vectors

P = [p_{1}; \dots; p_{M}] \in R^{M \times d},

(2)

and a frozen text encoder

g (\cdot)

outputs the text embedding

t_{c} (P) = g (T_{c}, P) \in R^{d} .

(3)

Let

s (a, b) = \cos (a, b) / τ

denote scaled cosine similarity.

P^{*}

is the prompt for all categories, including known and unknown categories. The CLIP-style region-to-class probability over a candidate set

C

is

p (c ∣ R_{i}, P^{*}) = \frac{\exp (s (R_{i}, t_{c} (P^{*})))}{\sum_{k \in C} \exp (s (R_{i}, t_{k} (P^{*})))} .

(4)

Dual-prompt decoupling and task-aware mixing. CDP maintains a frozen teacher prompt

P^{t}

and a trainable student prompt

P^{s}

. The student prompt is initialized from the teacher prompt, while the teacher prompt serves as a stable semantic anchor. To balance transferability and discrimination, we construct mixed prompts for different tasks. For known-class recognition, we use

P^{K} = ω_{K} P^{s} + (1 - ω_{K}) P^{t}, ω_{K} \in (0, 1),

(5)

where

P^{K}

denotes the mixed prompt for known-class classification. Here,

ω_{K}

is the weight coefficient for the known class. For unknown-related semantics, we use

P^{U} = ω_{U} P^{s} + (1 - ω_{U}) P^{t}, ω_{U} \approx 0,

(6)

where

P^{U}

is the mixed prompt for open-set rejection. Because

ω_{U}

is close to zero,

P^{U}

is dominated by the teacher prompt, which preserves more transferable semantics for rejecting unseen objects.

Unseen semantic matrix

W^{u}

. To connect CDP to the downstream synthesis branch in Figure 2, the mixed prompt for unknown-side semantics induces a text-embedding matrix

W^{u} = {[t_{c} (P^{U})]}_{c \in C_{u}} \in R^{| C_{u} | \times d},

(7)

where

C_{u}

denotes an auxiliary set of unseen semantic anchors used for conditional feature synthesis. This matrix acts as the semantic condition passed to the synthesis module.

Hard negative objects with mismatch control. As shown in the upper half of Figure 5, this process corresponds to the hard negative sample module within the CDP optimizer. For a foreground RoI

r_{i}

with base label

y_{i} \in C_{B}

, teacher-side scores over base classes are computed as

q_{i} (c) = \frac{\exp (s (R_{i}, t_{c} (P^{t})))}{\sum_{k \in C_{B}} \exp (s (R_{i}, t_{k} (P^{t})))}, c \in C_{B} .

(8)

The hard negative object set is selected by Top-K filtering

N_{i} = TopK (q_{i} (\cdot), K) ∖ {y_{i}},

(9)

and duplicate negatives are removed. A key issue in detection is mismatch noise: a negative class label does not guarantee the current RoI truly contains an instance of that class. To reduce this noise, each negative object

c^{-} \in N_{i}

is paired with a matched negative region

v^{-} (c^{-})

sampled from the training pool using ground-truth boxes (or high-IoU RoIs) of class

c^{-}

. This yields matched region–text pairs, rather than text-only negatives. The constructed contrastive set has size

L \leq b K

, where b is the number of positive anchors in the mini-batch.

Feature filtering for stable text distributions. Let

T_{B} (P^{s}) = {[t_{c} (P^{s})]}_{c \in C_{B}} \in R^{| C_{B} | \times d}

. Row-wise

ℓ_{2}

normalization is applied to keep the global text feature scale stable:

{\hat{T}}_{B} = norm (T_{B} (P^{s})) .

(10)

For each anchor i, only

{y_{i}} \cup N_{i}

are retained when forming contrastive logits.

Symmetric contrastive loss. Let

B = {(R_{ℓ}, t_{ℓ})}_{ℓ = 1}^{L}

be the matched region–text pairs built from positives and their mined hard negatives, where

t_{ℓ}

is computed with the student prompt. The symmetric InfoNCE loss is

L_{cl} = - \frac{1}{2 L} \sum_{ℓ = 1}^{L} [\log \frac{\exp (s (R_{ℓ}, t_{ℓ}))}{\sum_{j = 1}^{L} \exp (s (R_{ℓ}, t_{j}))} + \log \frac{\exp (s (t_{ℓ}, R_{ℓ}))}{\sum_{j = 1}^{L} \exp (s (t_{ℓ}, R_{j}))}] .

(11)

Only

P^{s}

is updated; both encoders remain frozen.

Total objective of CDP. CDP is trained as an auxiliary objective together with standard detector losses:

L_{CDP} = L_{\det} + λ_{cl} L_{cl},

(12)

where

L_{\det}

denotes the usual two-stage detection loss (classification over known classes and background, plus box regression), and

λ_{cl}

balances prompt contrastive learning.

3.4. Synthesized Monument Module

SMM is applied after CDP has stabilized the region–text interface. Its role is twofold: it synthesizes region embeddings conditioned on CDP-induced class semantics and regularizes the feature space with class-centered monuments. This allows known classes to remain compact, and ambiguous proposals are less likely to be absorbed into

C_{K}

at test time. As illustrated in Figure 6, SMM is trained in a two-stage manner: region embeddings are first pooled from detector proposals and then synthesized by a conditional WGAN using stochastic latents and semantic prototypes. During training, intra-class semantic diverging encourages diverse synthesis under a fixed condition, whereas inter-class structure preserving enlarges class margins by contrasting synthetic embeddings against real foreground and background proposals.

Semantic conditions from CDP (base, novel, and unseen anchors). The FS-OSOD label space contains known classes

C_{K} = C_{B} \cup C_{N} = {1, \dots, K}

in training, where

C_{B}

are data-abundant base classes and

C_{N}

are M-shot novel classes. Unknown categories appear only at test time and are mapped to

C_{U} = {K + 1}

, while background is

C_{B G} = {K + 2}

. CDP provides text prototypes used as synthesis conditions:

W^{s} = {[w_{c}^{s}]}_{c \in C_{K}}, w_{c}^{s} = \{\begin{matrix} t_{c} (P^{B}), & c \in C_{B}, \\ t_{c} (P^{N}), & c \in C_{N}, \end{matrix}

(13)

and an auxiliary unseen-anchor set

W^{u}

.

RoI embedding. For the i-th proposal, let the RoI feature map be

R_{i}

and obtain a vector embedding by average pooling:

r_{i} = AvgPool (R_{i}) \in R^{d} .

(14)

Stage 1: Conditional synthesis on known proposals (WGAN-GP). We use a Wasserstein generative adversarial network (WGAN) with a gradient penalty, which trains more stably and avoids collapsing modes often. WGAN produces a synthetic embedding

\tilde{r} = G (w, z)

given a semantic condition

w

and noise

z \sim N (0, I)

. A critic outputs

D (r, w)

. Stage 1 trains the synthesizer using real proposal embeddings from known classes

c \in C_{K}

(including M-shot novel instances) with conditions

w_{c}^{s} \in W^{s}

. The WGAN-GP losses are

\begin{matrix} L_{D} & = E_{\tilde{r}} [D (\tilde{r}, w)] - E_{r} [D (r, w)] + ξ E_{\hat{r}} {(∥ \nabla_{\hat{r}} D (\hat{r}, w) ∥_{2} - 1)}^{2}, \end{matrix}

(15)

\begin{matrix} L_{G} & = - E_{\tilde{r}} [D (\tilde{r}, w)] + η L_{cls}, \end{matrix}

(16)

where

\hat{r} = μ r + (1 - μ) \tilde{r}

and

μ \sim U (0, 1)

.

L_{cls}

enforces semantic faithfulness by requiring a fixed classifier to predict the conditioning label for

\tilde{r}

.

Intra-class semantic diverging

L_{Sd}

. To encourage diversity under a fixed condition

w

, sample

z^{+} = z + ρ, ρ \sim U [- r, r], z_{k}^{-} \sim {z^{'} ∣ ∥ z^{'} - z ∥_{2} > r}, k = 1, \dots, K .

(17)

Let

\tilde{r} = G (w, z)

,

{\tilde{r}}^{+} = G (w, z^{+})

, and

{\tilde{r}}_{k}^{-} = G (w, z_{k}^{-})

. Using cosine similarity

sim (\cdot, \cdot)

with

ℓ_{2}

-normalized features,

L_{Sd} = - \log \frac{\exp (sim (\tilde{r}, {\tilde{r}}^{+}) / τ_{1})}{\exp (sim (\tilde{r}, {\tilde{r}}^{+}) / τ_{1}) + \sum_{k = 1}^{K} \exp (sim (\tilde{r}, {\tilde{r}}_{k}^{-}) / τ_{1})} .

(18)

Inter-class structure preserving

L_{Sp}

. To keep synthetic embeddings aligned with detection-time proposal statistics, construct a hybrid pool

G = {\tilde{r}} \cup {r^{fg}} \cup {r^{bg}},

(19)

where

r^{fg}

are IoU-matched foreground proposals from

C_{K}

, and

r^{bg}

are low-IoU background proposals from

C_{B G}

. For the current

\tilde{r}

, choose a positive

g^{+} \in G

from the same known class, and negatives

Ω = {g \in G ∣ y (g) \neq y (\tilde{r})}

:

L_{Sp} = - \log \frac{\exp (sim (\tilde{r}, g^{+}) / τ_{2})}{\exp (sim (\tilde{r}, g^{+}) / τ_{2}) + \sum_{g \in Ω} \exp (sim (\tilde{r}, g) / τ_{2})} .

(20)

Stage 2: Monument memory with momentum updates. After Stage 1, unseen-domain embeddings are synthesized by conditioning G on

W^{u}

. SMM maintains two memory banks: a known bank

U^{s} = {u_{c}^{s}}_{c \in C_{K}}

and an unseen-anchor bank

U^{u} = {u_{a}^{u}}_{a \in C_{u}}

.

U^{s}

is initialized from real known proposals (class means) and kept fixed, while

U^{u}

is initialized from synthesized unseen-domain embeddings and updated with momentum.

Given an embedding

r

with a conditioning index

y \in C_{K} \cup C_{u}

, define the memory softmax

p_{mem} (y ∣ r) = \frac{\exp (sim (r, u_{y}) / τ_{3})}{\sum_{c \in C_{K}} \exp (sim (r, u_{c}^{s}) / τ_{3}) + \sum_{a \in C_{u}} \exp (sim (r, u_{a}^{u}) / τ_{3})} .

(21)

The memory loss is

L_{CE} = - E_{(r, y)} \log p_{mem} (y ∣ r),

(22)

which attracts embeddings to their corresponding monuments and repels them from others.

Only the unseen-anchor bank is updated. For

a \in C_{u}

, let

Q_{a}

be synthesized samples of anchor a in the current iteration and

{\bar{r}}_{a}

their mean:

u_{a}^{u} \leftarrow m u_{a}^{u} + (1 - m) {\bar{r}}_{a}, a \in C_{u} .

(23)

Stage-wise objectives. The SMM optimization follows a two-stage schedule:

\begin{matrix} Stage 1 : \min_{G} \max_{D} L_{D} + L_{G} + λ_{1} L_{Sd} + λ_{2} L_{Sp}, \end{matrix}

(24)

\begin{matrix} Stage 2 : \min_{G} \max_{D} L_{D} + L_{G} + λ_{1} L_{Sd} + λ_{2} L_{Sp} + λ_{3} L_{CE} . \end{matrix}

(25)

Inference and FS-OSOD decision. For a test proposal, the detector produces logits

ℓ^{\det} (c)

for

c \in C_{K} \cup C_{B G}

. SMM provides

p_{mem} (c ∣ r)

for

c \in C_{K}

and an unseen-anchor mass

p_{mem}^{u} (r) = \max_{a \in C_{u}} p_{mem} (a ∣ r)

. Calibration is performed on known-class logits:

ℓ (c) = ℓ^{\det} (c) + β \log p_{mem} (c ∣ r), c \in C_{K} .

(26)

A proposal is predicted as background if its objectness is below a threshold. Otherwise, it is assigned to

arg \max_{c \in C_{K}} ℓ (c)

when known confidence is high and

p_{mem}^{u} (r)

is small; it is rejected to

C_{U}

when known confidence is low and the unseen-anchor mass is high.

3.5. Overall Optimization

A two-stage fine-tuning protocol is used to train the FS-OSOD detector. The first stage learns a strong detector on the base split, while the second stage adapts the model to the M-shot novel split under the open-set setting.

Stage I: Base training. Let

θ_{\det}

denote detector parameters (backbone, RPN, RoI heads), and let

θ_{cdp}

and

θ_{smm}

denote the trainable parameters in CDP and SMM, respectively. During base training, the student prompt parameters are updated in CDP, while the teacher prompt and the vision–language encoders remain frozen. SMM is trained jointly according to its stage-wise schedule, while the detector is optimized with standard detection losses. The overall objective is

L_{base} = L_{\det} + λ_{cdp} L_{CDP} + λ_{smm} L_{SMM},

(27)

where

L_{\det}

is the two-stage detection loss (classification over

C_{B} \cup {C_{B G}}

and box regression), and

L_{CDP}

and

L_{SMM}

are the auxiliary objectives defined in Section 3.3 and Section 3.4, respectively.

Stage II: Few-shot adaptation. In the few-shot stage, the detector is fine-tuned using the M-shot novel data and a small replay subset from base classes to mitigate drift. The same auxiliary objectives are retained but re-weighted to avoid over-regularization under scarce supervision:

L_{fs} = L_{\det}^{fs} + α_{cdp} L_{CDP} + α_{smm} L_{SMM} .

(28)

Here,

L_{\det}^{fs}

is defined on

C_{K} \cup {C_{B G}}

with box regression, and

α_{cdp}, α_{smm}

are typically smaller than

λ_{cdp}, λ_{smm}

to reflect the limited amount of novel evidence.

Parameter update policy. Across both stages, CDP updates only the student prompt parameters, while the teacher prompt and the vision–language encoders are frozen. SMM updates its synthesizer parameters and its memory container following the stage-wise schedule described in Section 3.4. Unless stated otherwise, all detector parameters are fine-tuned in both stages.

Implementation details for optimization. We adopt a two-stage fine-tuning protocol. In each iteration, we first run the detector forward pass to obtain proposals/RoI features and optimize the standard two-stage detection loss (classification and box regression). We then compute the CDP loss by constructing mixed prompts from the frozen teacher prompt and the learnable student prompt, encoding RoIs and class texts with frozen vision–language encoders, and performing teacher-guided Top-K hard-negative mining with mismatch-aware negative RoI matching (we use

K = 8

). SMM is trained jointly in a stage-wise manner: we sample conditioning vectors from known classes and unseen anchors, synthesize region embeddings, update the conditional WGAN-GP by alternating critic and generator updates, and momentum-update the unseen monument memory. Stage II uses the M-shot novel set mixed with a replay subset from base classes to mitigate drift; across both stages, only the student prompt parameters are updated in CDP, while the teacher prompt and the vision–language encoders remain frozen. Following the sensitivity analysis, we set the SMM loss weights to

λ_{1} = 10^{- 3}

,

λ_{2} = 10^{- 3}

, and

λ_{3} = 10^{- 2}

, and we use

β = 1.0

for memory-score calibration; for the overall loss weights, we use

λ_{cdp} = 0.5

,

λ_{smm} = 0.1

,

α_{cdp} = 0.1

, and

α_{smm} = 0.05

. Unless otherwise specified, we use SGD with a learning rate of

2 \times 10^{- 4}

in Stage I and

1 \times 10^{- 4}

in Stage II.

Training details in our method: As summarized in Figure 3 and Algorithm 1, GPMN is trained under a two-stage schedule. Stage I performs base training to establish the shared detector/VLM embedding space and initialize the semantic and memory components, while Stage II performs few-shot and unknown-class adaptation using novel data with base replay. Within each iteration, the process follows a fixed order. First, the detector branch samples a mini-batch, extracts RoI proposals/features, and computes the standard detection losses. Second, CDP mixes the frozen teacher and trainable student prompts to encode region and text embeddings, mines teacher-guided Top-K hard negatives, matches them with true negative RoIs, and applies symmetric InfoNCE to update the student-side semantic interface. Third, SMM proceeds in two stages: conditional synthesis generates seen/unseen-like region embeddings from semantic conditions, and monument-memory regularization updates the unseen monuments and constrains feature geometry to improve seen–unseen separation. Finally, all trainable parameters are jointly optimized by the weighted loss at each iteration. After training, inference fuses detector logits with memory-based similarities to produce calibrated known/unknown/background predictions.

Limitation: This study focuses on the standard FS-OSOD setup, which employs a two-stage detector and region-of-interest (RoI)-level feature learning, aiming to simultaneously enhance robust recognition of known classes and reliable rejection of unknown objects in the face of scarce supervised data. The framework intentionally employs teacher–student prompt decoupling, staged optimization, and WGAN-guided feature synthesis to enhance semantic transfer and the ability to distinguish between seen and unseen objects. While these design choices contribute to the robustness of the FS-OSOD task, they also result in a training process that is slightly more complex than that of simpler baseline models.

Algorithm 1 Pseudo-code aligned with the workflow in Figure 3. Training and inference workflow of GPMN.

Input: A stage-specific mini-batch ${(x_{i}, y_{i})}$ sampled from $D_{s}$ , where $s \in {Stage I, Stage II}$ . Here, Stage I uses base data, and Stage II uses few-shot novel data with base replay. Each image $x_{i} \in X$ is associated with instance annotations $y_{i} = {(c, b)}$ (class label c and bounding box b).

(A) FS-OSOD setting and batch sampling

1: ${(x_{i}, y_{i})} \leftarrow SampleBatch (D_{s})$ // Stage I: Base training. Stage II: Few-shot + replay.
2: $G \leftarrow {known detection, unknown rejection, background separation}$ // FS-OSOD target.

(B) Shared detector and VLM space

3: $R \leftarrow RoIExtract ({x_{i}}; θ)$ // proposals, RoI features, detector logits.
4: $L_{\det} \leftarrow DetLoss (R, {y_{i}})$ // classification + box regression.
5: $V \leftarrow RegionEncode (R; f_{I})$ // frozen CLIP image encoder.

(C) CDP: Prompt-level semantic stabilization

6: $(P^{K}, P^{U}) \leftarrow MixPrompt (P^{s}, P^{t})$ // teacher–student prompt mixing.
7: $T \leftarrow TextEncode (P^{K}; f_{T})$ // frozen CLIP text encoder.
8: $N \leftarrow HardNegMining (V; P^{t})$ // teacher-guided Top-K hard negatives.
9: $V^{-} \leftarrow MatchNegRegions (N)$ // mismatch-aware true negative RoIs.
10: $L_{CL} \leftarrow SymContrast (V, T, N, V^{-})$ // contrastive distillation.
11: $W \leftarrow BuildConditions (T, P^{U})$ // stable similarity space + unseen semantic conditions.

(D) SMM Stage 1: Conditional feature synthesis

12: $\tilde{V} \leftarrow Synthesize (G, W)$ // generate seen/unseen-like region embeddings.
13: $(G, D) \leftarrow WGANUpdate (G, D, \tilde{V}, W)$ // conditional WGAN-GP update.

(E) SMM Stage 2: Monument memory regularization

14: $U^{u} \leftarrow MomentumUpdate (U^{u}, \tilde{V}, W)$ // update unseen monuments.
15: $L_{SMM} \leftarrow MemoryReg (\tilde{V}, U^{s}, U^{u})$ // monument-memory regularization.
16: $S_{mem} \leftarrow MemoryScore (V, U^{s}, U^{u})$ // memory-based similarity scores.

(F) Overall training schedule and joint optimization

17: Update trainable parameters by minimizing. $L_{\det} + λ_{CL} L_{CL} + λ_{SMM} L_{SMM}$
18: Repeat under Stage I (base training) and Stage II (few-shot + replay).

4. Experiments

4.1. Overview

This section is organized as follows. Section 4.1 introduces the experimental settings, including datasets, implementation details, and evaluation protocol. Section 4.2 reports quantitative comparisons with state-of-the-art methods. Section 4.3 presents ablation studies to analyze the contributions of different components. Section 4.4 provides qualitative results and further discussion.

4.2. Experimental Detail

4.2.1. Datasets

We follow the standard FS-OSOD benchmark splits, including VOC10-5-5, VOC-COCO, and COCO-RoadAnomaly [42], as introduced in [3] for a fair comparison with prior methods. For VOC10-5-5, the dataset contains 10 base classes, 5 novel classes, and 5 unknown classes split from the PASCAL VOC label space [43]. The base training data consist of VOC07trainval and VOC12trainval, with annotations retained only for the base classes. Each novel class includes 1, 3, 5, and 30 shot objects extracted from VOC07trainval and VOC12trainval, and VOC07test is used as the testing set. For VOC-COCO, 20 classes from PASCAL VOC are used as base classes, and 20 classes from the 60 MS-COCO categories that do not intersect with PASCAL VOC are chosen as novel classes, while the remaining 40 categories are treated as unknown. The base training data again consist of VOC07trainval and VOC12trainval. Each novel class includes 1, 5, 10 and 30 shot objects sampled from COCO2017train, with COCO2017val serving as the testing set. For COCO-RoadAnomaly, rare or safety-critical categories are regarded as unknown, and the remaining categories are split into base and novel sets; this benchmark is mainly employed to test the generalization ability of our framework in open-set road scenes.

4.2.2. Setup

We employ ResNet-50 [44] (pre-trained in RegionClip [7]) as the image encoder, and ResNet-50 (pre-trained on ImageNet) as the RPN image encoder. The detector is trained with a two-stage strategy (base training and few-shot fine-tuning) [24]. On top of this base framework, the contrastive distilled prompts (CDP) module is applied to the vision–language branch to optimize region–text alignment, while synthesized monument (MEM) is attached to the detection head as a class-centered memory module. Stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of

5 \times 10^{- 5}

is adopted for optimization, with a batch size of 1 image on a single RTX-class GPU. The learning rate is set to 0.0002 during the base training stage and 0.0001 for fine-tuning. Following RegionCLIP [7], the loss weight for the background class is fixed to 0.2, and a focal scaling strategy with parameter 0.5 is employed. For CDP, the prompt length and context initialization follow CoOp [11] with a context length of 16. For MEM, the monument dimension is set equal to the visual feature dimension, and the momentum coefficient is fixed to 0.9 unless otherwise specified.

4.2.3. Evaluation Metrics

For FS-OSOD evaluation, we report the mean average precision (

m AP

) of known classes (

m {AP}_{K}

) and novel classes (

m {AP}_{N}

) as closed-set metrics. For unknown-class metrics, we adopt the recall of unknown objects (

R_{U}

) and the average recall of unknown objects (

A R_{U}

) as in [31]. Furthermore, we report the wilderness impact (

W I

) under a recall level of 0.8 to measure the degree of unknown objects misclassified as known classes:

W I = \frac{P_{K}}{P_{K \cup U}} - 1,

(29)

where

P_{K}

and

P_{K \cup U}

denote the precision computed without and with unknown instances, respectively. In addition, absolute open-set error (

A - O S E

) is used to count the number of unknown objects misclassified as known [5], which directly reflects the tightness of the decision boundary between known and unknown classes.

4.2.4. Baselines

We compare the proposed GPMN framework with three groups of baselines. (i) Few-shot object detection methods, including TFA [24], FSCE, and DeFRCN, which are evaluated under the standard FSOD setting and serve as strong closed-set backbones. (ii) Open-set and open-world detectors, including DS [45], ORE [27], PROSER [46], and OPENDet [29], which explicitly model unknown classes but do not exploit vision–language prompts. (iii) FS-OSOD methods, including FOOD [3] and FOODv2 [4], which jointly consider few-shot learning and unknown rejection. For methods with officially reported FS-OSOD results, we directly adopt the numbers from the original papers. In addition, OPENDet [29], FOODv2 [4] and CED-FOOD [20] are implemented within our detection framework, and the variants enhanced by distilled prompts or MEM, respectively. All methods are trained and evaluated under the same FS-OSOD protocol and data splits for a fair comparison.

4.3. Experimental Analysis

4.3.1. Analysis of Experimental Results Under Different Shot Settings on VOC-COCO

Table 4 reports the few-shot open-set detection performance measured by the mean AP on known/novel classes

(m A P_{K} / m A P_{N})

, unknown recall metrics

(R_{U} / A R_{U})

, and open-set risk indicators

(W I / A O S E)

where lower is better. Across all shot regimes, our method consistently improves recognition of both known and novel categories. In particular, compared with the strongest baseline CED-FOOD, our approach increases

m A P_{K}

by +3.02 to +5.47 and

m A P_{N}

by +3.31 to +8.48 from 1-shot to 30-shot, demonstrating stronger few-shot transfer and reduced base-class bias. For unknown detection,

R_{U}

remains competitive and is often slightly improved, indicating that the gain on known/novel does not come at the expense of unknown rejection. Regarding open-set errors, our method achieves lower AOSE than CED-FOOD in moderate/high-shot settings, suggesting fewer unknown instances are misclassified as known categories. Although

W I

is not always the lowest, the overall results reveal a more favorable balance among known accuracy, novel generalization, and unknown robustness under the FOOD protocol.

4.3.2. Analysis of Experimental Results on VOC10-5-5 Under Different Shot Settings

Table 5 summarizes few-shot open-set detection performance on VOC10-5-5, reporting mean AP for known/novel classes

(m A P_{K} / m A P_{N})

, unknown recall metrics

(R_{U} / A R_{U})

, and open-set risk indicators. Overall, the proposed method achieves consistent gains in closed-set accuracy while maintaining competitive unknown rejection. Compared with the strongest baseline CED-FOOD, our approach yields higher

(m A P_{K} / m A P_{N})

across all shot regimes, indicating improved transfer to novel categories and reduced base-class overfitting. Notably, the improvements are evident in low-shot conditions, where scarce supervision typically causes unstable class boundaries, suggesting that our model better exploits limited annotations to form separable representations. For unknown detection,

R_{U}

and

A R_{U}

remain comparable to or slightly below the best-performing baseline in some settings; however, this does not translate into increased open-set errors. In contrast,

A O S E

remains well controlled across shots, implying fewer unknown instances are misclassified as known categories. Although

W I

is not always minimal, the overall trend demonstrates a favorable balance between known/novel recognition and open-set robustness under the challenging VOC10-5-5 protocol.

4.3.3. Analysis of Computational Resources on VOC10-5-5

Table 6 indicates that our method is computationally efficient. The training cost is on par with CED-FOOD. The inference time is comparable to most baselines and dramatically faster than DS. With a moderate parameter count, our method offers a strong efficiency–performance trade-off.

4.4. Ablation Studies

Table 7 presents an ablation study that quantifies the individual and joint contributions of CDP and SMM on the FS-OSOD benchmark. Starting from the CED-FOOD baseline, incorporating CDP consistently improves both

m A P_{K}

and

m A P_{N}

, indicating that teacher–student prompt decoupling and teacher-guided hard negatives enhance base discrimination while preserving transferability to novel classes. Adding SMM yields the most pronounced reduction in open-set errors, as reflected by the decreased

A O S E

, suggesting fewer unknown instances are misclassified as known categories due to the class-centered memory prior and momentum-updated monuments. When CDP and SMM are jointly enabled, the model achieves the best overall trade-off across known/novel accuracy and unknown robustness, validating their complementarity.

Figure 7 reports the hyper-parameters involved in our experiments, including

λ_{1}, λ_{2}, λ_{3}, β

and the loss weights

λ_{cdp}, λ_{smm}, α_{cdp}, α_{smm}

. Using ARu as the metric, the results show that the performance is stable across a wide range of values. The best ARu is obtained at

λ_{1} = 10^{- 3}

,

λ_{2} = 10^{- 3}

,

λ_{3} = 10^{- 2}

and

β \approx 1.0

(with comparable results at

β = 1.5

). For the overall objectives, moderate auxiliary weights work best, with

λ_{cdp} = 0.5

,

λ_{smm} = 0.1

,

α_{cdp} = 0.1

and

α_{smm} = 0.05

.

4.5. Visualized Results

4.5.1. Visualization of Test Results in Figure 8

Compares qualitative detection results under the FOOD setting. FR-CNN tends to over-detect background or confuse unknown instances with known categories. For example, in the first row, the unseen elephant is misclassified as a known class (e.g., “cow”) with high confidence by FR-CNN, indicating strong closed-set bias. OpenDet and CED-FOOD alleviate this issue but still produce false positives on ambiguous regions. Our method yields cleaner predictions with more accurate localization and more reliable unknown labeling, effectively preventing unseen objects from being absorbed into the known label space, indicating improved open-set calibration.

Figure 8. Visualization of different methods’ detection results.

4.5.2. CAM Maps for Different Detectors in Figure 9

Warmer colors indicate a higher response to foreground regions. FR-CNN exhibits diffuse and background-dominated activations, implying weak localization under open-set settings. Such background-biased responses are often correlated with over-confident assignment of unseen objects to known categories (e.g., the elephant case). OpenDet and CED-FOOD partially concentrate on salient objects but still show spurious responses on surrounding context, which may induce misclassification of unknown regions. In contrast, our method produces more compact and object-aligned heat patterns across diverse scenes, highlighting discriminative parts while suppressing background interference. This behavior is consistent with improved open-set calibration: known objects receive stronger, sharper activations, whereas ambiguous regions are less likely to trigger over-confident predictions.

Figure 9. Visualization of CAM results.

5. Conclusions

This work presented a guided prompt–monument network (GPMN) for few-shot open-set object detection, aiming at robust recognition and unknown rejection under data-scarce open-world conditions. A contrastive distilled prompts (CDP) module restructured the prompt space into teacher–student branches, alleviating optimization conflicts between base and novel/unknown categories while preserving few-shot transferability. A synthesized monument module (SMM) maintained class-centered prototypes with momentum updates and non-parametric scoring, tightening distributional boundaries between seen and unseen classes while stabilizing predictions under strong background clutter. Extensive experiments on VOC10-5-5 and VOC-COCO benchmarks showed consistent improvements in unknown recall and few-shot mAP over representative FS-OSOD baselines, and ablations confirmed the complementary benefits of CDP and SMM within GPMN.

Future: Future work will focus on developing lighter-weight and more optimization-efficient variants of GPMN to reduce the training overhead introduced by staged prompt adaptation and WGAN-guided feature synthesis. We will also investigate more robust calibration strategies to further improve the balance between closed-set recognition and unknown-object rejection under few-shot open-set conditions. In addition, extending the framework to broader deployment-oriented scenarios will be an important direction for future study.

Author Contributions

Conceptualization, H.C. and Y.C.; methodology, H.C. and Y.C.; software, H.C. and Y.C.; validation, H.C. and Y.C.; formal analysis, H.C. and Y.C.; investigation, H.C. and Y.C.; resources, H.C. and Y.C.; data curation, H.C. and Y.C.; writing—original draft preparation, H.C. and Y.C.; writing—review and editing, H.C. and Y.C.; visualization, H.C. and Y.C.; supervision, H.C. and Y.C.; project administration, H.C. and Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant 62173160.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We used public MS COCO (https://cocodataset.org/#download, accessed on 1 April 2026) and PASCAL VOC (https://www.robots.ox.ac.uk/~vgg/projects/pascal/VOC/, accessed on 1 April 2026). The VOC–COCO mixed split is derived from official downloads following the public protocol (https://github.com/binyisu/food, accessed on 1 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Su, B.; Zhang, H.; Li, J.; Zhou, Z. Toward generalized few-shot open-set object detection. IEEE Trans. Image Process. 2024, 33, 1389–1402. [Google Scholar] [CrossRef] [PubMed]
Su, B.; Zhang, H.; Zhou, Z. Hsic-based moving weight averaging for few-shot open-set object detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5358–5369. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 350–368. [Google Scholar]
Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 728–755. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar]
Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19113–19122. [Google Scholar]
Liu, B.; Kang, H.; Li, H.; Hua, G.; Vasconcelos, N. Few-shot open-set recognition using meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8798–8807. [Google Scholar]
Pal, D.; Bundele, V.; Sharma, R.; Banerjee, B.; Jeppu, Y. Few-shot open-set recognition of hyperspectral images with outlier calibration network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 3801–3810. [Google Scholar]
Song, N.; Zhang, C.; Lin, G. Few-shot open-set recognition using background as unknowns. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5970–5979. [Google Scholar]
Huang, S.; Ma, J.; Han, G.; Chang, S.F. Task-adaptive negative envision for few-shot open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7171–7180. [Google Scholar]
Wu, A.; Chen, D.; Deng, C. Deep feature deblurring diffusion for detecting out-of-distribution objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 13381–13391. [Google Scholar]
Isaac-Medina, B.K.; Breckon, T.P. Dream-Box: Object-wise Outlier Generation for Out-of-Distribution Detection. arXiv 2025, arXiv:2504.18746. [Google Scholar]
Wu, Z.; Su, B.; Geng, Q.; Zhang, H.; Zhou, Z. Boosting Few-Shot Open-Set Object Detection via Prompt Learning and Robust Decision Boundary. arXiv 2024, arXiv:2406.18443. [Google Scholar]
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta r-cnn: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9577–9586. [Google Scholar]
Zhang, G.; Luo, Z.; Cui, K.; Lu, S.; Xing, E.P. Meta-DETR: Image-level few-shot detection with inter-class correlation exploitation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12832–12843. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar] [CrossRef]
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7352–7362. [Google Scholar]
Qiao, L.; Zhao, Y.; Li, Z.; Qiu, X.; Wu, J.; Zhang, C. Defrcn: Decoupled faster r-cnn for few-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 8681–8690. [Google Scholar]
Joseph, K.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5830–5840. [Google Scholar]
Gupta, A.; Narayan, S.; Joseph, K.; Khan, S.; Khan, F.S.; Shah, M. Ow-detr: Open-world detection transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9235–9244. [Google Scholar]
Han, J.; Ren, Y.; Ding, J.; Pan, X.; Yan, K.; Xia, G.S. Expanding low-density latent regions for open-set object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9591–9600. [Google Scholar]
Zhang, S.; Ni, Y.; Du, J.; Xue, Y.; Torr, P.; Koniusz, P.; van den Hengel, A. Open-World Objectness Modeling Unifies Novel Object Detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Jammu, India, 16–18 July 2025; pp. 30332–30342. [Google Scholar]
Khattak, M.U.; Wasim, S.T.; Naseer, M.; Khan, S.; Yang, M.H.; Khan, F.S. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15190–15200. [Google Scholar]
Li, Z.; Li, X.; Fu, X.; Zhang, X.; Wang, W.; Chen, S.; Yang, J. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26617–26626. [Google Scholar]
Zhang, W.; Wang, Y.X. Hallucination improves few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13008–13017. [Google Scholar]
Liu, W.; Wang, X.; Owens, J.; Li, Y. Energy-based out-of-distribution detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21464–21475. [Google Scholar]
Hendrycks, D.; Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv 2016, arXiv:1610.02136. [Google Scholar]
Liang, S.; Li, Y.; Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv 2017, arXiv:1706.02690. [Google Scholar]
Dhamija, A.; Gunther, M.; Ventura, J.; Boult, T. The overlooked elephant of object detection: Open set. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1021–1030. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar] [CrossRef]
Lis, K.; Nakka, K.; Fua, P.; Salzmann, M. Detecting the unexpected via image resynthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2152–2161. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Miller, D.; Nicholson, L.; Dayoub, F.; Sünderhauf, N. Dropout sampling for robust object detection in open-set conditions. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 3243–3249. [Google Scholar]
Zhou, D.W.; Ye, H.J.; Zhan, D.C. Learning placeholders for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4401–4410. [Google Scholar]

Figure 1. Feature space challenges in few-shot open-set object detection. Black double-headed arrow indicates inter-class separability, while red single-headed arrows indicate directions of potential misclassification or inter-class inseparability.

Figure 2. Differences and correlations among FSOSR, FSOD and OSOD for FS-OSOD tasks.

Figure 3. Workflow of GPMN.

Figure 4. Dual-prompt contrastive distillation and WGAN-based feature synthesis jointly mitigate base-class bias and enhance unknown-class rejection. Green denotes random noise vectors; yellow indicates mixed prompt features; blue and beige triangles represent CLIP image/text encoders; orange marks hard negative selection.

Figure 5. Contrastive distilled prompts (CDP) module. Flame and snowflake icons denote trainable and frozen parameters. Pink, yellow, and cyan blocks represent S-prompt, T-prompt, and Mix-prompt. The top panel optimizes contrastive loss (

L_{C L}

) using mined hard negative objects. Bottom panels detail the Base and New class adaptation flows, dynamically governed by weight

w_{b}

.

Figure 5. Contrastive distilled prompts (CDP) module. Flame and snowflake icons denote trainable and frozen parameters. Pink, yellow, and cyan blocks represent S-prompt, T-prompt, and Mix-prompt. The top panel optimizes contrastive loss (

L_{C L}

) using mined hard negative objects. Bottom panels detail the Base and New class adaptation flows, dynamically governed by weight

w_{b}

.

Figure 6. Two stage training process of SMM.

Figure 7. Hyper-parameter sensitivity analysis of the proposed GPMN.

Table 1. A comparative summary of historical research papers in this field.

Framework	Core Idea	Key Limitation
PEELER [14]	Open-set meta-learning: Samples novel classes per episode, maximizes posterior entropy for unseen classes, and uses a Mahalanobis metric to improve open-set separability.	Designed for image-level FSOSR; llacks region-level localization/background handling and does not address detection-specific seen–unseen overlap under clutter.
OCN [15]	Threshold-free outlier rejection via an auxiliary calibration network that takes distances to class prototypes; uses generated samples to learn seen vs. outlier separation.	Built for hyperspectral recognition; prototype-distance calibration is sensitive under few-shot noise and is not tailored to region-level unknown rejection in detection.
ProCAM [16]	Progressive CAM separates foreground/background; background features are treated as pseudo-unseen classes to reserve open space and train additional background/unknown capacity.	Background-as-unknown is an imperfect proxy for truly out-of-set objects; the formulation is recognition-centric and does not directly stabilize detection boundaries.
TANE [17]	Learns task-adaptive negative prototypes to obtain a threshold-free rejection boundary; integrates rejection calibration into learning.	Recognition setting; performance depends on negative prototype quality and does not directly model region-level ambiguity/background clutter in detection.
DFDD [18]	Feature-space OOD synthesis: Forward blurring produces virtual OOD features and reverse deblurring recovers features for augmentation to improve OOD object detection.	Not few-shot specific; does not address base/novel optimization conflict or semantic drift, and may be brittle under severe class imbalance.
Dream-Box [19]	Pixel-space diffusion generates object-wise outliers to train detectors for in-distribution detection and OOD rejection with visualizable outliers.	Not tailored to scarce supervision; outlier coverage can vary and does not explicitly stabilize region-level seen–unseen margins in FS-OSOD.
FOOD [3]	Decouples unknown optimization from known classes via an unknown decoupling learner (UDL) and mitigates overfitting using a class weight sparsification classifier (CWSC) that sparsifies normalized classifier weights to reduce co-adaptation.	Does not leverage vision–language priors for semantic transfer; rejection relies on classifier heuristics and can remain fragile when unknown regions are visually close to seen classes under clutter and co-occurrence.
Ours	Two complementary controls: (i) Teacher–student prompt decoupling to stabilize semantic transfer under few-shot data. (ii) Monument memory with momentum-updated prototypes and non-parametric inference fusion to compress region-level seen–unseen overlap.	Jointly targets base-class bias and region-level overlap, yielding a more stable rejection margin for unknown objects under clutter/co-occurrence.

Table 2. Process-level overview of GPMN.

Level	Module	Core Operation	Effect
Prompt level	CDP	Uses a teacher–student prompt design with teacher-guided hard negatives for contrastive optimization	Reduces base-class bias while preserving transfer to novel/unknown semantics
Representation level	SMM	Builds unseen anchors, synthesizes unseen-like features, and regularizes them with monument memory	Compresses seen–unseen overlap and enlarges the rejection margin for unknowns
Inference level	Memory fusion	Fuses detector logits with monument-memory similarities	Reduces unknown-to-known confusion and improves rejection stability
Overall	GPMN	Sequentially connects CDP and SMM	Achieves a better balance between known/novel detection and unknown rejection

Table 3. Summary of related work streams and the identified gaps for FS-OSOD.

Stream (Related Works)	Scope, Gap and Relevance to FS-OSOD
Few-shot detection FSRW [21], Meta-RCNN [22], Meta-DETR [23]; TFA [24], FSCE [25], DeFRCN [26]	Scope: Train transferable detectors from scarce annotations via meta-learning, two-stage transfer, or pseudo-sample augmentation under closed-set evaluation. Gap: These formulations do not explicitly model out-of-set unknowns at the region level; under background clutter and multi-class co-occurrence, base-driven gradients can distort decision boundaries and induce overconfident unknown-to-known errors. Relevance: Motivates explicit unknown handling and stable semantic transfer for generalized FS-OSOD.
Open-set detection outlier synthesis [18,19]; background mining [27,28]; proxy unknown selection [29,30];	Scope: Acquires “unknown evidence” through synthetic outliers, mined background proposals, proxy unknowns, or uncertainty thresholds and shapes a rejection region during training or inference. Gap: Many OSOD methods implicitly assume abundant known-class supervision; in few-shot regimes, the overlap among seen, unknown and background proposal distributions increases, making rejection boundaries fragile and sensitive to proxy quality or threshold calibration. Relevance: Calls for representation-level regularization that enlarges the seen–unseen margin under clutter.
Prompt learning CoOp/CoCoOp/MaPLe [11,12,13]; PromptSRC/PromptKD [31,32]	Scope: Leverages aligned vision–text priors and parameter-efficient prompts to improve semantic transfer without full fine-tuning. Gap: Under scarce supervision, prompt parameters may drift toward base classes, degrading transfer to novel/unknown semantics and destabilizing region-level boundaries in detection. Relevance: Requires prompt-level decoupling that preserves transferability while maintaining base discrimination.
Our work	Contribution: CDP stabilizes semantic adaptation via teacher–student prompt decoupling, while SMM reduces region-level seen–unseen overlap using memory-anchored regularization and non-parametric fusion. Together, they address base-class bias and distributional overlap, supporting reliable unknown rejection under clutter/co-occurrence with scarce supervision.

Table 4. Few-shot open-set object detection results on VOC-COCO under different shot settings. Metrics include mean AP of known/novel classes (

m A P_{K} / m A P_{N}

), recall and average recall of unknowns (

R_{U} / A R_{U}

), and wilderness impact/absolute open-set error (

W I / A O S E

). CED-FOOD denotes the original method, CED-FOOD* indicates results reproduced by us using the publicly available code under the same experimental settings, and Ours denotes the proposed GPMN framework. The upward arrow (↑) indicates that higher values are preferable, while the downward arrow (↓) signifies that lower values are more desirable for optimal performance. Boldface denotes the best result for each metric among all compared methods under the corresponding shot setting.

Table 4. Few-shot open-set object detection results on VOC-COCO under different shot settings. Metrics include mean AP of known/novel classes (

m A P_{K} / m A P_{N}

), recall and average recall of unknowns (

R_{U} / A R_{U}

), and wilderness impact/absolute open-set error (

W I / A O S E

). CED-FOOD denotes the original method, CED-FOOD* indicates results reproduced by us using the publicly available code under the same experimental settings, and Ours denotes the proposed GPMN framework. The upward arrow (↑) indicates that higher values are preferable, while the downward arrow (↓) signifies that lower values are more desirable for optimal performance. Boldface denotes the best result for each metric among all compared methods under the corresponding shot setting.

	Metric	1-Shot			5-Shot
Method		$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$	$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$
TFA [24]		$15.77$ / $2.50$	$0.00$ / $0.00$	$10.73$ / $1441.80$	$17.13$ / $6.56$	$0.00$ / $0.00$	$11.36$ / $1673.30$
DS [45]		$15.47$ / $2.11$	$3.57$ / $1.69$	$9.15$ / $711.60$	$17.10$ / $6.30$	$3.86$ / $1.71$	$9.91$ / $1110.10$
ORE [27]		$14.14$ / $2.18$	$4.59$ /–	$12.08$ / $1087.00$	$16.21$ / $6.29$	$4.99$ /–	$12.30$ / $1344.00$
PROSER [46]		$13.58$ / $2.32$	$7.53$ / $3.07$	$11.68$ / $925.30$	$15.67$ / $6.40$	$9.59$ / $4.08$	$12.56$ / $1165.90$
OPENDET [29]		$16.01$ / $2.29$	$7.24$ / $3.14$	$9.82$ / $690.90$	$17.16$ / $6.56$	$11.49$ / $5.21$	$9.55$ / $1176.90$
FOOD [3]		$15.83$ / $2.26$	$15.76$ / $7.20$	$6.78$ / $485.00$	$18.08$ / $6.69$	$20.02$ / $9.45$	$7.37$ / $859.00$
FOODv2 [4]		$18.54$ / $4.33$	$30.87$ / $14.13$	–/–	$19.88$ / $11.95$	$32.53$ / $15.74$	–/–
CED-FOOD [20]		$19.49$ / $5.41$	$38.53$ / $16.68$	$4.51$ / $638.70$	$20.23$ / $13.24$	$40.52$ /17.91	2.99/ $808.90$
CED-FOOD*		$18.87$ / $4.93$	38.87/16.97	4.41/594.10	$21.46$ / $11.19$	$40.26$ / $17.85$	$3.33$ / $952.80$
Ours		24.30/13.89	36.58/15.19	4.75/677.90	26.93/19.86	40.59/17.48	$3.25$ /788.20
Method		10-Shot			30-Shot
Method		$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$	$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$
TFA [24]		$18.67$ / $9.02$	$0.00$ / $0.00$	$11.40$ / $1732.20$	$23.01$ / $15.16$	$0.00$ / $0.00$	$10.48$ / $2294.10$
DS [45]		$19.06$ / $9.46$	$3.75$ / $1.77$	$10.13$ / $1336.40$	$23.40$ / $15.27$	$3.95$ / $1.83$	$9.84$ / $1892.90$
ORE [27]		$17.98$ / $8.75$	$5.13$ /–	$11.65$ / $1463.20$	$23.07$ / $15.17$	$5.51$ /–	$11.22$ / $1867.00$
PROSER [46]		$17.00$ / $8.75$	$10.06$ / $4.89$	$12.47$ / $1160.00$	$21.44$ / $14.30$	$12.06$ / $5.98$	$12.00$ / $1561.60$
OPENDET [29]		$18.53$ / $8.70$	$13.89$ / $6.32$	$9.83$ / $1400.60$	$22.93$ / $14.02$	$18.07$ / $8.76$	$9.02$ / $1818.00$
FOOD [3]		$20.17$ / $9.48$	$21.48$ / $9.56$	$7.59$ / $1099.30$	$23.90$ / $14.17$	$23.17$ / $11.45$	$8.13$ / $1480.00$
FOODv2 [4]		$22.64$ / $13.82$	$32.78$ / $16.52$	–/–	$23.71$ / $17.67$	$35.74$ / $17.26$	–/–
CED-FOOD [20]		$23.75$ / $16.77$	$38.69$ /17.06	$2.58$ / $856.40$	$25.72$ / $21.16$	$39.43$ / $17.52$	$2.46$ / $1339.30$
CED-FOOD*		$22.90$ / $15.79$	$37.92$ / $16.84$	2.46/ $916.30$	$23.94$ / $19.78$	$40.01$ / $18.09$	2.43/ $1408.60$
Ours		28.60/22.67	38.71/ $16.83$	$3.40$ /843.50	28.74/24.47	40.80/18.11	3.09/1233.50

Table 5. Few-shot open-set object detection results on VOC10-5-5 under different shot settings. Metrics include mean AP of known/novel classes (

m A P_{K} / m A P_{N}

), recall and average recall of unknowns (

R_{U} / A R_{U}

), and wilderness impact/absolute open-set error (

W I / A O S E

). CED-FOOD denotes the original method, CED-FOOD* indicates results reproduced by us using the publicly available code under the same experimental settings, and Ours denotes the proposed GPMN framework. The upward arrow (↑) indicates that higher values are preferable, while the downward arrow (↓) signifies that lower values are more desirable for optimal performance. Boldface denotes the best result for each metric among all compared methods under the corresponding shot setting.

Table 5. Few-shot open-set object detection results on VOC10-5-5 under different shot settings. Metrics include mean AP of known/novel classes (

m A P_{K} / m A P_{N}

), recall and average recall of unknowns (

R_{U} / A R_{U}

), and wilderness impact/absolute open-set error (

W I / A O S E

). CED-FOOD denotes the original method, CED-FOOD* indicates results reproduced by us using the publicly available code under the same experimental settings, and Ours denotes the proposed GPMN framework. The upward arrow (↑) indicates that higher values are preferable, while the downward arrow (↓) signifies that lower values are more desirable for optimal performance. Boldface denotes the best result for each metric among all compared methods under the corresponding shot setting.

	Metric	1-Shot			3-Shot
Method		$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$	$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$
TFA [24]		$45.31$ / $8.50$	$0.00$ / $0.00$	$10.69$ / $1308.40$	$47.55$ / $15.23$	$0.00$ / $0.00$	$10.13$ / $1335.40$
DS [45]		$43.82$ / $7.22$	$23.99$ / $12.15$	$9.14$ / $772.60$	$46.89$ / $14.48$	$23.62$ / $11.98$	$9.08$ / $969.90$
ORE [27]		$43.25$ / $8.62$	$18.25$ /–	$9.54$ / $930.30$	$45.88$ / $14.52$	$22.23$ /–	$9.88$ / $1058.70$
PROSER [46]		$41.64$ / $8.49$	$30.95$ / $15.41$	$11.15$ / $994.60$	$43.30$ / $15.16$	$32.30$ / $16.17$	$10.45$ / $1021.70$
OPENDET [29]		$43.45$ / $8.27$	$33.64$ / $17.28$	$10.47$ / $867.30$	$46.47$ / $14.09$	$30.62$ / $15.89$	$9.27$ / $954.50$
FOOD [3]		$43.97$ / $8.95$	$43.72$ / $23.51$	$6.96$ / $598.60$	$48.48$ / $16.83$	$44.52$ / $23.58$	$7.83$ / $859.00$
FOODv2 [4]		$45.12$ / $11.56$	$60.03$ / $31.19$	–/–	$48.90$ / $18.96$	$61.21$ / $32.02$	–/–
CED-FOOD [20]		$51.94$ / $21.43$	$79.88$ /38.12	4.12/459.60	$53.09$ / $31.70$	80.55/39.53	3.72/451.20
CED-FOOD*		$51.36$ / $21.35$	$75.71$ / $35.91$	$5.73$ / $625.70$	$53.74$ / $32.25$	$77.75$ / $38.49$	$4.65$ / $555.70$
Ours		58.73/40.95	79.97/37.23	5.68/770.30	57.48/34.14	78.14/37.34	5.81/558.20
Method		5-Shot			10-Shot
Method		$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$	$m A P_{K} / m A P_{N} ↑$	$R_{U} / A R_{U} ↑$	$W I / A O S E ↓$
TFA [24]		$47.88$ / $19.74$	$0.00$ / $0.00$	$9.99$ / $1256.10$	$51.10$ / $26.19$	$0.00$ / $0.00$	$9.87$ / $1267.20$
DS [45]		$48.01$ / $19.27$	$19.99$ / $10.08$	$8.97$ / $990.60$	$48.01$ / $25.66$	$19.99$ / $10.83$	$8.81$ / $1025.70$
ORE [27]		$46.29$ / $18.49$	$23.01$ /–	$9.35$ / $1040.70$	$46.29$ / $24.40$	$23.48$ /–	$9.65$ / $1063.70$
PROSER [46]		$45.12$ / $20.08$	$32.68$ / $16.48$	$10.59$ / $1099.00$	$48.35$ / $25.13$	$32.61$ / $17.01$	$10.29$ / $956.70$
OPENDET [29]		$47.56$ / $19.07$	$32.13$ / $16.72$	$9.86$ / $1010.90$	$47.56$ / $27.19$	$30.62$ / $18.89$	$8.50$ / $1201.40$
FOOD [3]		$50.18$ / $23.10$	$45.65$ / $23.61$	$7.59$ / $908.00$	$53.23$ / $28.26$	$45.84$ / $23.86$	$6.99$ / $900.20$
FOODv2 [4]		$52.55$ / $27.31$	$60.21$ / $31.05$	–/–	$53.24$ / $32.73$	$62.14$ / $32.80$	–/–
CED-FOOD [20]		$54.35$ / $36.67$	$81.37$ /40.32	3.78/ $512.20$	$55.85$ / $43.52$	79.39/39.79	$3.43$ / $546.30$
CED-FOOD*		$54.82$ / $36.77$	$80.15$ / $40.17$	$4.28$ / $587.10$	$57.47$ / $44.16$	$78.02$ / $39.33$	$3.85$ / $604.00$
Ours		59.90/43.20	81.68/38.50	4.13/509.60	61.78/47.40	78.36/38.16	3.36/535.20

Table 6. Comparison of computational resources in VOC-10-5-5. CED-FOOD denotes the original method, OpenDet* indicates results reproduced by us using the publicly available code under the same experimental settings, and Ours denotes the proposed GPMN framework.

Computational Resource	DS	PROSER	ORE	OpenDet*	CED-FOOD	Ours
Training time (s/iter)	0.1963	0.1925	0.2071	0.1941	0.1963	0.1957
Inference time (s/iter)	0.3979	0.03981	0.03984	0.03985	0.03989	0.03995
Model parameters (KB)	224219	224125	231290	223645	225687	226945

Table 7. Ablation results of individual components under the 30-shot setting. Reported numbers are averaged over 10 runs with different random seeds; the best performance is highlighted in bold.

CDP	SMM	${mAP}_{K} ↑$	${mAP}_{N} ↑$	$R_{U} ↑$	${AR}_{U} ↑$	$WI ↓$	$AOSE ↓$
Baseline (CED-FOOD)		25.72	21.16	39.43	17.52	2.46	1339.30
✔		29.78	25.56	40.17	18	2.98	2083
	✔	24.28	19.96	40.88	18.08	3.16	1096
✔	✔	28.74	24.47	40.80	18.11	3.09	1233.50

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, H.; Chen, Y. Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts. Appl. Sci. 2026, 16, 3474. https://doi.org/10.3390/app16073474

AMA Style

Chen H, Chen Y. Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts. Applied Sciences. 2026; 16(7):3474. https://doi.org/10.3390/app16073474

Chicago/Turabian Style

Chen, Hao, and Ying Chen. 2026. "Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts" Applied Sciences 16, no. 7: 3474. https://doi.org/10.3390/app16073474

APA Style

Chen, H., & Chen, Y. (2026). Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts. Applied Sciences, 16(7), 3474. https://doi.org/10.3390/app16073474

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts

Abstract

1. Introduction

2. Related Works

2.1. Few-Shot Open-Set Recognition

2.2. Few-Shot Object Detection

2.3. Open-Set Object Detection

2.4. Prompt Learning

3. Method

3.1. Preliminaries

3.2. Overall

3.3. Contrastive Distilled Prompts

3.4. Synthesized Monument Module

3.5. Overall Optimization

4. Experiments

4.1. Overview

4.2. Experimental Detail

4.2.1. Datasets

4.2.2. Setup

4.2.3. Evaluation Metrics

4.2.4. Baselines

4.3. Experimental Analysis

4.3.1. Analysis of Experimental Results Under Different Shot Settings on VOC-COCO

4.3.2. Analysis of Experimental Results on VOC10-5-5 Under Different Shot Settings

4.3.3. Analysis of Computational Resources on VOC10-5-5

4.4. Ablation Studies

4.5. Visualized Results

4.5.1. Visualization of Test Results in Figure 8

4.5.2. CAM Maps for Different Detectors in Figure 9

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI