Tail-Calibrated Transformer Autoencoding with Prototype- Guided Mining for Open-World Object Detection

Iqbal, Muhammad Ali; Yoon, Yeo-Chan; Kim, Soo Kyun

doi:10.3390/app152010918

Open AccessArticle

Tail-Calibrated Transformer Autoencoding with Prototype- Guided Mining for Open-World Object Detection

by

Muhammad Ali Iqbal

¹

,

Yeo-Chan Yoon

² and

Soo Kyun Kim

^1,*

¹

Department of Computer Engineering, Jeju National University, Jeju 63243, Republic of Korea

²

Department of Artificial Intelligence, Jeju National University, Jeju 63243, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 10918; https://doi.org/10.3390/app152010918

Submission received: 16 September 2025 / Revised: 6 October 2025 / Accepted: 9 October 2025 / Published: 11 October 2025

Download

Browse Figures

Versions Notes

Abstract

Open-world object detection (OWOD) aims to build detectors that can recognize known categories while simultaneously identifying unknown objects and incrementally learning novel classes. Despite recent advances, existing OWOD approaches still struggle with two critical challenges: the severe bias toward head classes in long-tailed data distributions and the misclassification of unknown objects as background. To address these issues, we introduce TAPM (Tail-Calibrated Transformer Autoencoding with Prototype-Guided Mining), a novel framework that explicitly enhances tail-class representation and robustly reveals unknown objects. TAPM integrates three core innovations: (1) a transformer-based autoencoder that reconstructs region features to calibrate embeddings for rare categories, mitigating the dominance of frequent classes; (2) a prototype-guided mining strategy that leverages class prototypes to localize both overlooked tail instances and candidate unknowns; and (3) an uncertainty-aware soft-labeling mechanism that assigns probabilistic supervision to pseudo-unknowns, reducing noise in incremental learning. Extensive experiments on the MS-COCO and LVIS benchmarks demonstrate that TAPM significantly improves unknown-object recall while maintaining strong known-class accuracy, achieving state-of-the-art performance across both the superclass-separated (S-OWODB) and superclass-mixed (M-OWODB) benchmarks. In particular, TAPM achieves a +20.4-point gain in U-Recall over the strong PROB baseline, underscoring its effectiveness in detecting novel objects without sacrificing mean Average Precision (mAP). Furthermore, TAPM achieves better generalization on cross-dataset evaluations, highlighting its robustness in diverse open-world scenarios.

Keywords:

open-world object detection (OWOD); incremental learning; unknown-object discovery; uncertainty-aware labeling

1. Introduction

Object detectors deployed in dynamic, real-world environments must recognize not only a fixed set of training classes but also unknown objects that were never seen during training [1]. This challenge is formalized as open-world object detection (OWOD), where a model should detect and flag novel objects as unknown and later learn them incrementally when labels become available. A key difficulty in OWOD is that previously unseen objects are often misclassified as one of the known categories or even suppressed as background by traditional detectors [2]. The ability to generalize beyond the training distribution is thus essential [3]. Another major obstacle is the long-tailed distribution of object frequencies in open-world data [4]. In large-vocabulary benchmarks such as LVIS, some common classes have abundant examples, while many tail categories have very scarce training data [5]. This long tail poses a severe challenge for state-of-the-art detectors, which tend to struggle in rare categories with few samples [6]. In an open-world setting, models must handle extreme class imbalance alongside distributional shifts (new classes at test time). This combination exacerbates the detection of tail-class and novel objects [7]. Moreover, unknown objects lack ground-truth annotations during training, making it particularly challenging for the model to learn to detect them [8]. These problems lead to missed detections or confusion with known classes, especially for objects with under-represented tails or entirely new classes [9].

To address these challenges, we propose TAPM (Tail-Calibrated Transformer Autoencoding with Prototype-Guided Mining), a novel framework for open-world object detection. TAPM introduces three key components: (1) Tail-calibrated transformer autoencoding, which uses a self-supervised reconstruction module within a transformer detector to calibrate feature representations towards data-sparse tail classes. By learning to autoencode object features, the model reduces its bias toward head classes and better captures the distinctive patterns of tail-category objects. (2) Prototype-guided mining, which leverages class prototypes (learned representative feature vectors for each class) to guide the discovery of new or hard-to-detect objects. This mechanism finds candidate regions in unlabeled data or background regions that are similar to learned prototypes, effectively revealing potential unknown objects or tail-class instances that the base detector might overlook. (3) Uncertainty-aware soft-labeling, an adaptive pseudo-labeling strategy that assigns soft labels to extract proposals based on the predictive uncertainty of the model. An overview of the entire framework is summarized in Figure 1.

Instead of hard one-hot labels, our method produces probabilistic labels for unknown-object proposals, thereby mitigating the impact of labeling noise and model overconfidence. This uncertainty-aware labeling leads to more stable training on pseudo-unknowns by down-weighting ambiguous cases and up-weighting confident detections. We validate the TAPM approach on large-scale detection benchmarks, MS-COCO [10] and LVIS [5], under challenging open-world evaluation protocols. The results demonstrate that our method significantly improves the detection of rare and unseen objects compared with prior work. In particular, TAPM exhibits robust unknown-object detection, reducing misclassification of novel objects as known or background. These gains yield new state-of-the-art results on OWOD benchmarks, taking us a step closer to reliable open-world object detection at scale. Our main contributions are listed below.

We design a transformer-based autoencoding module that reconstructs object features to calibrate the representation of tail classes, improving detection for rare objects by mitigating the bias toward frequent classes.
We introduce a prototype-guided mining strategy that utilizes learned class prototypes to identify and propose regions containing potential unknown or tail-category objects, thereby providing effective pseudo-supervision for open-world learning.
We develop an uncertainty-driven soft-labeling scheme for pseudo-unknown objects which assigns probabilistic labels based on model confidence. This approach reduces noisy labels and enables the stable incremental learning of new object categories.
Our TAPM framework achieves state-of-the-art open-world object detection results on MS-COCO and LVIS, demonstrating improved unknown-object recall and significantly better performance on long-tailed categories in large-vocabulary settings.

2. Related Work

Open-Set Recognition (OSR) is a foundational paradigm related to OWOD, in which a classifier must reject samples from classes that were not seen during training. Building on this idea, several studies [11,12,13,14] have explored proposal generation strategies designed to improve generalization when encountering novel categories. Scheirer et al. [15] first formalized the open-set classification problem, and Bendale and Boult [16] extended it to open-world classification, which not only detects unknown classes but also updates the model with new classes over time. In computer vision, early work on open-set object recognition and anomaly detection tackled problems such as detecting novel objects in images or robotics scenes [17]. However, these were mostly limited to image classification or relied on coarse heuristics for detection. Open-world object detection (OWOD), introduced by Joseph et al. in ORE [1], adapts the open-set concept to the object detection task. ORE [1] formally defined the OWOD problem and proposed the first solution. This two-stage Faster R-CNN-based model combines contrastive clustering and energy-based scoring to identify unknown objects. The energy-based unknown identifier in ORE reduces the confidence of proposals that do not cluster with any known class, thereby labeling them as unknown. This work demonstrates the feasibility of detecting unknown objects without explicit annotations while also performing the incremental learning of new classes.

Based on ORE, OW-DETR [18] was introduced, a transformer-based framework for OWOD that operates in an end-to-end manner. It built on Deformable-DETR and introduced a bottom-up attention-driven pseudo-labeling scheme: region proposals with high objectness scores (as measured by feature activation magnitude) were selected and pseudo-labeled as unknown objects for training. OW-DETR also added a dedicated novelty classification head to distinguish between unknown and background objects, and a foreground objectness branch to enhance the transfer of knowledge from known to unknown classes. UCOWOD [19] proposed a similarity-based cluster mechanism for grouping different unseen classes into distinct groups that separate them from known classes. Despite their successes, pseudo-labeling approaches face challenges: they can introduce false positives (noise) and are computationally heavy due to iterative re-labeling. These limitations motivated alternative strategies that focus on class-agnostic objectness modeling. Rather than relying on the pseudo-labeling of unknowns, several recent works [2,20,21] model objectness probabilistically to separate unknown objects from background. In a recent study [22], a similar model, called PROB, was introduced. This transformer-based OWOD model incorporates a probabilistic objectness head, replacing the need for generating pseudo-unknown labels. PROB extends Deformable-DETR by adding an unknown class and crucially decoupling objectness prediction from classification. It learns a multivariate Gaussian distribution in the query embedding space to parameterize an objectness probability for each detected region.

More recently, several studies were conducted to explore the open-world object detection problem. Fang et al. [9] designed an unsupervised discriminative learner to filter true unknowns from pseudo-labels through iterative self-training; however, it struggles in cluttered scenes by confusing the background with unknown objects. Greer et al. [6] proposed VisLED, a vision–language-guided active learning framework that queries informative 3D samples to capture rare or unseen categories better. Wang et al. [23] introduced OV-DQUO. This open-vocabulary DETR variant employs wildcard matching and denoising text queries, thereby reducing label bias and enhancing the recognition of novel classes. Xie et al. [24] developed OWSOL, which combines contrastive co-learning and unsupervised clustering to localize novel objects without bounding-box supervision. More recently, Fu et al. [25] presented LLMDet, which co-trains detectors with large language models using detailed region- and image-level captions, enhancing the ability to identify unseen categories. Yang et al. [26] proposed a Partial Attribute Assignment framework that formulates attribute selection as a Partial Optimal Transport problem to learn human-interpretable attributes for both known and unknown objects, yielding explainable unknown detection and strong OWOD performance without relying solely on objectness scores.

We follow RE-OWOD [27] for evaluation (IoU matching; WI/A-OSE at a fixed operating point) and use it throughout. Table 1 summarizes the key differences and shows a comparison between the proposed method and the most recent OWOD work. We introduce tail-calibrated transformer encoding (TCTE) inside TAPM that equalizes reconstruction pressure with CLT-AE, converts deviations into calibrated evidence via a GPD bulk–tail fit, and applies a prototype gate to reduce head-biased shortcuts. This yields consistent gains in WI and A-OSE without changes to the detector head or training labels. PROB [22] adds a probabilistic objectness head to Deformable-DETR; we do not add an extra head. Instead, the prototype-guided soft score

s (r)

down-weights head-aligned features even when objectness is high. TAPM lowers WI and A-OSE relative to PROB; ablations in Section 5.12 isolate the effect of the prototype gate versus a pure objectness prior. Objectness-unification methods [2] emphasize a single objectness score; we focus on tail calibration to separate unknowns from background, which is compatible with unified objectness that can feed proposals to our TCTE and then performs calibration. UC-OWOD [19] aims to classify unknowns via a two-stage pipeline; our goal is cleaner unknown vs. background separation, and our calibrated unknowns can serve as better inputs to such grouping. For label-free discovery, MEPU-OWOD [9] uses an unsupervised discriminative model; our unknown branch is also label-free but bases routing on calibrated reconstruction density (GPD tails) with prototype alignment, and we report FKA/KOR alongside WI/A-OSE at the same operating point. Open-CRB [28] provides an active learning framework for open-world 3D learning with an oracle. Our scope is 2D image detection without human queries; we, therefore, do not adopt active strategies. Section 5.6 reports the computational cost of our overall methods and shows that its inference overhead is negligible relative to other SOTA models.

In contrast to the studies mentioned earlier, which range from energy-based scoring (ORE), transformer-based pseudo-labeling, and objectness-driven proposals to prototype, attribute-guided approaches, our work addresses a critical gap: the underperformance of open-world detectors on tail categories. We propose TAPM (Tail-Calibrated Transformer Autoencoding with Prototype-Guided Mining), which explicitly recalibrates the feature space for rare classes through a novel cross-level transformer autoencoder (CLT-AE). To ensure reliable unknown discovery, TAPM integrates distribution-tail calibration with GPD refinement and introduces a prototype-guided gating mechanism that filters pseudo-labels by cosine similarity. Additionally, an uncertainty-aware soft-labeling scheme is developed to dynamically weight detection losses. Such a combination of tail recalibration, prototype-guided mining, and uncertainty calibration enables TAPM to improve both known-class balance and unknown-class detection, setting it apart from existing methods.

3. Proposed Methodology

Open-world object detection (OWOD) necessitates a detector that simultaneously recognizes labeled classes, flags unfamiliar objects as unknown, and adapts incrementally as new categories emerge. Existing approaches either rely heavily on pseudo-labeling heuristics or overlook the severe class imbalance across head and tail categories. To address these challenges, we propose TAPM (Tail-Calibrated Transformer Autoencoding with Prototype-Guided Mining). This unified framework integrates reconstruction-driven uncertainty modeling, density-based calibration, and prototype-guided mining into an incremental OWOD pipeline. The training procedure of TAPM is summarized in Algorithm 1, which highlights the end-to-end flow: (i) generation of unsupervised proposals, (ii) feature reconstruction with our cross-level transformer autoencoder (CLT-AE), (iii) scoring of proposals using reconstruction errors, (iv) fitting of level-wise error distributions with best-of statistical models plus GPD tail refinement, (v) maintaining of a prototype vector of known-class features for contrastive gating, (vi) assignment of uncertainty-aware soft labels to proposals, and (vii) optimization of the detector with weighted losses while extending pseudo-unknowns via OLN. The algorithm provides an overview of the training pipeline, while the following subsections provide an in-depth explanation of each module. Figure 2 gives a high-level overview of TAPM, illustrating the flow from proposal generation and ROI feature extraction to our tail-calibrated encoding and the known/unknown prediction branches.

Algorithm 1: Proposed framework training with CLT-AE for incremental OWOD.

In the following subsections, we provide a detailed description of each module, which includes the mathematical details of CLT-AE reconstruction, reconstruction error modeling, probabilistic uncertainty estimation, prototype-guided contrastive learning, and incremental training protocol.

3.1. Problem Formulation: Long-Tailed Unsupervised Object Detection

Setting. Let

D = {I_{i}}_{i = 1}^{N_{img}}

be an unlabeled image set drawn from an open world of object categories. For each image I, an unsupervised proposal generator

G

(e.g., Selective Search or FreeSOLO) produces candidate regions

R (I) = {r_{j}}_{j = 1}^{N (I)}

that may contain objects but have no class labels.

Over the dataset, let

R = ⋃_{I \in D} R (I)

. The open-world label space is

C \cup {U}

, where

C

denotes the currently known classes for the task and U denotes unknown objects. Note that unknown U is distinct from background (clutter or non-object regions); the detector’s standard negative class handles background in the classification branch.

Long-tailed class distribution. Let

n_{c}

be the (latent) count of proposals in

R

that correspond to class

c \in C

, and let

n_{U}

be the count for unknowns. Define

n = {(n_{c})}_{c \in C}

and

N_{kn} = \sum_{c \in C} n_{c}

. We call the distribution long-tailed if the class frequencies, sorted in non-increasing order

n_{(1)} \geq n_{(2)} \geq \dots \geq n_{(| C |)}

, are heavy-tailed, e.g.,

n_{(k)} \propto k^{- α} for some α > 0,

or, more generally, if they exhibit a large imbalance ratio

ρ = \frac{{max}_{c \in C} n_{c}}{{min}_{c \in C} n_{c}} ≫ 1,

(where the minimum is taken over classes with

n_{c} > 0

). We partition classes into head and tail by using a frequency threshold

τ

(or percentile q):

H = {c \in C : n_{c} \geq τ}

and

T = C ∖ H

. In practice, the severity of imbalance is governed by

ρ

and the mass concentrated in

T

. Counts

n_{c}

are latent in the unsupervised setting and are used only for analysis; when needed, we estimate them from benchmark annotations or high-confidence pseudo-labels. As

ρ

increases, mini-batch sampling and loss aggregation become dominated by head classes, which under-represents

T

and exacerbates false-known assignments for unknown proposals (U).

Detector objective under long tails. Let

θ

denote detector parameters. Each proposal

r \in R

has an (unobserved during training) label

y \in C \cup {U}

. The goals are to (i) correctly classify/regress known objects (

y \in C

) and (ii) assign unknown objects to U while avoiding false-known assignments. Let

L_{\det} (r; θ)

denote the standard detection loss for known classes (classification with background negatives + box regression), and define an open-set penalty.

L_{open} (r; θ) = 1 [y = U] ℓ_{U} (r; θ) + 1 [y \in C] ℓ_{conf} (r; θ),

where

ℓ_{U}

encourages rejecting unknowns and

ℓ_{conf}

regularizes overconfident misclassifications on known classes (e.g., via margin/entropy terms). With class-frequency priors

π_{c} = n_{c} / N_{kn}

and

π_{U} = n_{U} / (N_{kn} + n_{U})

, the population risk decomposes as

R (θ) = \sum_{c \in C} π_{c} E_{r \sim p (r ∣ y = c)} [L_{\det} (r; θ)] + π_{U} E_{r \sim p (r ∣ y = U)} [ℓ_{U} (r; θ)] .

In long-tailed regimes, small

π_{c}

for

c \in T

increases variance and induces bias toward head classes, which under-trains tails and elevates open-set error (unknowns mistaken as rare knowns).

Unsupervised mining with soft labels. Because y is unavailable at train time, we estimate a soft unknown score

s (r) \in [0, 1]

for each proposal via reconstruction error densities with tail refinement and a prototype-based cosine gate. We then optimize the empirical objective.

\hat{R} (θ) = \frac{1}{| R |} \sum_{r \in R} (\underset{unknown mining}{\underset{︸}{s (r) L_{open} (r; θ)}} + \underset{known / background handling}{\underset{︸}{(1 - s (r)) L_{\det} (r; θ)}}),

where

(1 - s (r))

routes background-like proposals to the detector’s negative class and emphasizes supervised updates for known classes. In contrast,

s (r)

emphasizes rejecting or down-weighting proposals that behave like unknowns. Our approach (Algorithm 1) computes

s (r)

from level-wise reconstruction error posteriors with generalized Pareto tail refinement and filters them with a cosine-similarity prototype; this mitigates the effect of

ρ

by promoting tail-consistent features and suppressing head-biased shortcuts.

3.2. Tail-Calibrated Transformer Encoding: Modules and Objective

TCTE converts region proposals into object-centric representations that are robust to long-tailed frequency. The pipeline follows Algorithm 1: we (A) encode multi-level features with a cross-level transformer autoencoder (CLT-AE) and reconstruct each level, (B) convert per-proposal reconstruction errors into calibrated tail evidence using a bulk–tail density fit, (C) suppress head-biased shortcuts by comparing each proposal to learned known-class prototypes, and (D) combine tail evidence and prototype alignment into a soft unknown score that routes training signal between open-set rejection and standard detection. Each module is detailed in the following subsection.

3.2.1. Module A: Cross-Level Transformer Autoencoder (CLT-AE)

A backbone with feature maps processes images at L pyramid levels. The encoder utilizes cross-level attention to exchange context across levels, while the decoder reconstructs each level to enforce an object-centric structure rather than relying on head-class shortcuts. For a proposal r, we aggregate a compact latent

h (r)

by pooling encoder tokens within the region of r across levels. This

h (r)

serves two roles: (i) it stabilizes per-proposal error estimation via consistent content encoding, and (ii) it provides a metric space where known-class prototypes are well-formed. The reconstruction objective couples the levels,

L_{rec} = \sum_{ℓ = 1}^{L} ∥ X_{ℓ} - {\hat{X}}_{ℓ} ∥_{1},

(1)

encouraging the encoder to represent fine and coarse structures uniformly, even for rare (tail) appearances.

3.2.2. Module B: Reconstruction Error Density with Tail Refinement

For each proposal r, we summarize its per-level reconstruction discrepancy by averaging absolute errors inside the region. Rather than using a fixed threshold, we model the error distribution with a light bulk component and a generalized Pareto tail on exceedance (peaks over threshold); this yields a calibrated tail probability that adapts to level-wise statistics. We then combine level-wise tail probabilities into a single “tail-evidence” score

p_{tail} (r)

(via a noisy-OR or learned weights), which increases when r exhibits a structure poorly explained by the current reconstruction model:

p_{tail} (r) = 1 - \prod_{ℓ = 1}^{L} (1 - p_{ℓ}^{tail} (r)) .

(2)

3.2.3. Module C: Prototype-Guided Gating

Frequent head classes often dominate long-tailed learning. To counter this, we build a prototype

μ_{c}

for each known class

c \in C

by averaging

h (r)

over high-confidence proposals. For any proposal r, we compute its maximum cosine alignment to the known prototypes; high alignment suggests head-like content, whereas low alignment indicates potential novelty or tail variation poorly captured by heads. We use the normalized alignment

\tilde{g} (r) \in [0, 1]

as an anti-evidence term for unknownness:

\tilde{g} (r) = max_{c \in C} \frac{h {(r)}^{⊤} μ_{c}}{∥ h (r) ∥ ∥ μ_{c} ∥} scaled to [0, 1] .

(3)

3.2.4. Module D: Soft-Label Mining and Training Objective

Tail calibration emerges when we combine tail evidence with prototype anti-evidence to form a soft unknown score

s (r) \in [0, 1]

and then route losses accordingly. Intuitively,

s (r)

is high when r looks distributionally extreme (Equation (2)) and poorly aligned with any known-class prototype (Equation (3)). We define

s (r) = σ (α p_{tail} (r) - β \tilde{g} (r) + γ),

(4)

and optimize a balanced empirical risk that emphasizes open-set penalties for high-

s (r)

proposals while preserving standard detection updates otherwise:

\hat{R} (θ) = \frac{1}{| R |} \sum_{r \in R} (s (r) L_{open} (r; θ) + (1 - s (r)) L_{\det} (r; θ)) + λ_{rec} L_{rec} .

(5)

Here,

L_{\det}

is the standard detection loss (classification with background negatives plus box regression), and

L_{open}

aggregates unknown-rejection and confidence regularization terms. Weights

(α, β, γ)

and

λ_{rec}

are selected on a small validation split; they remain fixed across tasks.

3.2.5. Why This Calibrates Tails

In long-tailed regimes, gradients are dominated by head classes, which suppresses learning for rare classes and increases false-positive assignments of unknowns. CLT-AE enforces an object-centric reconstruction objective applied uniformly across head and tail classes. The bulk–tail density fit converts reconstruction deviations into calibrated tail evidence, rather than using ad hoc thresholds. The prototype gate down-weights spurious alignments to head-class prototypes in the feature space. The soft score in Equation (4) then routes proposals with strong tail evidence to the open-set loss while preserving standard updates for known classes.

3.3. Prototype Guidance: Error Formulation

Prototype guidance relies on per-batch prototypes from high-confidence known proposals rather than corpus-level counts, so it remains effective when exact dataset statistics are unavailable. Let

y \in C \cup {U}

be the ground-truth label for a proposal r and let

s (r) \in [0, 1]

be our soft unknown score. At a fixed decision threshold

τ \in (0, 1)

, we predict unknown if

s (r) \geq τ

and known otherwise. We summarize operating-point errors with two conditional rates:

FKA (τ) : = Pr (s (r) < τ | y = U) (false - known assignment of unknowns),

KOR (τ) : = Pr (s (r) \geq τ | y \in C) (known over - rejection) .

Intuitively,

FKA

is the fraction of truly unknown proposals that are not flagged as unknown (they fall below

τ

). In contrast,

KOR

is the fraction of truly known proposals that are incorrectly routed to the unknown branch (they exceed

τ

). In practice, we estimate these from the evaluation set as

\hat{FKA} (τ) = \frac{# {r : y = U, s (r) < τ}}{# {r : y = U}}, \hat{KOR} (τ) = \frac{# {r : y \in C, s (r) \geq τ}}{# {r : y \in C}} .

Counts are taken over proposals with ground-truth object labels; background-only regions are excluded. Lower

FKA

aligns with higher U-Recall, while lower

KOR

correlates with improved WI/A-OSE. In Section 5.10 we visualize calibration and the decision geometry of

s (r)

and report

\hat{FKA}

and

\hat{KOR}

at

τ = 0.5

.

3.4. Unsupervised Region Proposal Generation

To discover candidate objects without class bias, we leverage unsupervised region proposal methods. In particular, we utilize Selective Search and FreeSOLO for class-agnostic proposal generation. Selective Search employs bottom-up super-pixel merging to propose object regions, whereas FreeSOLO provides self-supervised instance segmentation, which we convert into bounding boxes. These complementary methods yield a diverse set of region proposals.

B = {b_{i}}

per image, covering both known and unknown objects. We apply non-maximum suppression and size filtering to refine the proposals, resulting in N high-quality candidate regions per image that likely contain an object but have no assigned class. All proposals are treated uniformly in subsequent stages, ensuring that no class-specific heuristics eliminate unknown objects.

3.5. Multi-Scale Proposal Feature Extraction

We extract features for each proposal using a Feature Pyramid Network (FPN) backbone. Given an input image, the backbone (e.g., ResNet-50 with a Feature Pyramid Network, FPN) produces a pyramid of feature maps.

{P_{ℓ}}_{ℓ = 1}^{L}

. For each region

b_{i}

, we perform RoIAlign on each pyramid level

P_{ℓ}

to obtain a fixed-size feature map

x_{i, ℓ} \in R^{C_{ℓ} \times H \times W}

capturing the content of

b_{i}

at scale ℓ. These multi-scale features

{x_{i, 1}, \dots, x_{i, L}}

preserve both coarse context and fine details. We project each

x_{i, ℓ}

to a standard embedding dimension d (using

1 \times 1

convolutions) and flatten spatial dimensions, yielding

n_{ℓ}

tokens per level. The result is a collection of tokens

X_{i} = \{x_{i, ℓ}^{j} | ℓ = 1, \dots, L, j = 1, \dots, n_{ℓ}\}

representing proposal

b_{i}

across all FPN levels. This forms the input to our transformer-based autoencoder, enabling multi-scale fusion. To ensure consistent assignment of proposals to pyramid levels, we define an area-based bucketing rule. A proposal r with bounding-box area

A (r)

is mapped to the appropriate feature map according to thresholds:

A (r) \in [0, 32^{2}) \mapsto P 3, [32^{2}, 64^{2}) \mapsto P 4, [64^{2}, 128^{2}) \mapsto P 5, [128^{2}, 256^{2}) \mapsto P 6 .

This function, denoted by

AreaBucket (r)

, guarantees that small proposals align with higher-resolution maps and significant proposals with lower-resolution ones and is used in Algorithm 1 for density indexing and reconstruction error computation.

3.6. Cross-Level Transformer Autoencoder

We design a transformer autoencoder to compress and reconstruct the multi-scale proposal features. A transformer encoder fuses cross-level tokens for each proposal. We prepend a learnable [PROPOSAL]token

x_{i}^{cls}

to

X_{i}

and feed the sequence into

L_{e}

layers of multi-head self-attention (MHSA) and feed-forward networks, producing

Z_{i} = {z_{i}^{cls}, z_{i, 1}^{1}, \dots, z_{i, L}^{n_{L}}}, z_{i}^{cls} \in R^{d} .

Attn (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V .

(6)

As shown in Equation (6), self-attention computes with

Q, K

, and V obtained from linear projections of the inputs. The [PROPOSAL] token attends to all-level features, yielding a compact cross-level representation

z_{i}^{cls}

. A transformer decoder uses

z_{i}^{cls}

to reconstruct the original multi-scale features via

L_{d}

layers of cross-attention from

z_{i}^{cls}

to level/location query tokens, producing initial reconstructions

{\tilde{x}}_{i, ℓ}

. To preserve fine details, we add attention-gated skip connections: we compute

a_{ℓ} = σ (MLP (GAP (x_{i, ℓ})))

(GAP: global average pooling;

σ

: sigmoid) and form the final reconstruction

{\hat{x}}_{i, ℓ} = {\tilde{x}}_{i, ℓ} + a_{ℓ} ⊙ x_{i, ℓ},

(7)

where ⊙ denotes channel-wise multiplication. The autoencoder is trained by minimizing the per-pixel reconstruction loss, as shown in Equation (8), forcing

z_{i}^{cls}

to capture the essential content of known objects while being less faithful to unknown/background patterns.

L_{rec} = \sum_{i = 1}^{| B |} \sum_{ℓ = 1}^{L} {∥{\hat{x}}_{i, ℓ} - x_{i, ℓ}∥}_{2}^{2},

(8)

3.7. Reconstruction Error Estimation

After the cross-level autoencoding stage, each proposal

b_{i}

has a set of reconstructed feature maps.

{{\hat{x}}_{i, ℓ}}_{ℓ = 1}^{L}

. To quantify how well the autoencoder explains the content of

b_{i}

at each FPN level ℓ, we compute a location-wise discrepancy between the original feature

x_{i, ℓ}

and its reconstruction

{\hat{x}}_{i, ℓ}

. We adopt an

ℓ_{1}

deviation because it is less sensitive to occasional large activations and yields sharper error maps that align with object boundaries. Formally, letting

(u, v)

index the spatial grid,

E_{i, ℓ} (u, v) = ∥ x_{i, ℓ} (u, v) - {\hat{x}}_{i, ℓ} (u, v) ∥_{1} .

(9)

These per-pixel maps highlight regions that the autoencoder fails to reproduce, which typically arise for out-of-distribution structures (unknown objects or clutter). To summarize evidence across resolution levels and spatial locations into a single uncertainty score for the proposal, we average the error magnitudes first over the spatial grid and then across levels:

e_{i} = \frac{1}{L} \sum_{ℓ = 1}^{L} {mean}_{(u, v)} [E_{i, ℓ} (u, v)] .

(10)

Small values of

e_{i}

indicate that the proposal lies on (or near) the manifold of known objects learned by the autoencoder. In contrast, large values signal content that is poorly reconstructable and thus likely unknown or background. A single threshold on

e_{i}

is brittle because reconstruction errors vary with scale, texture, and context. Instead, we model the distribution of errors for two groups collected during training: (1) proposals aligned to ground-truth known instances, producing samples

{e_{i}^{kn}}

, and (2) proposals not matching any known instance, producing

{e_{j}^{bg}}

(background and potential unknowns). We estimate class-conditional densities for these sets by using three complementary families and subsequently select the best one based on validation likelihood or unknown detection performance.

3.8. Distribution Fitting for Uncertainty Estimation

Reconstruction errors vary significantly depending on whether a proposal corresponds to a known object, an unknown object, or pure background. A fixed threshold is insufficient to capture this variability, particularly when the error distributions are multi-modal or heavy-tailed. To achieve robust separation, we explicitly fit probabilistic models to the reconstruction errors collected during training. For proposals aligned to labeled ground-truth objects, we obtain a set

{e_{i}^{kn}}

representing known-class errors, while errors from proposals not overlapping any labeled object form

{e_{j}^{bg}}

, which includes background and potentially unknown objects. By learning densities

f_{kn} (e)

and

f_{bg} (e)

for these two sets, our framework transforms raw errors into calibrated likelihoods that drive pseudo-labeling and unknown discovery.

3.8.1. Kernel Density Estimation (KDE)

As reconstruction errors do not necessarily follow simple parametric laws, we first employ nonparametric Kernel Density Estimation (KDE), which flexibly models arbitrary shapes. This is particularly useful when known-class errors are sharply peaked while background errors are dispersed. Given

N_{bg}

background errors, the estimated density is

{\hat{f}}_{bg} (e) = \frac{1}{N_{bg} h} \sum_{j = 1}^{N_{bg}} K (\frac{e - e_{j}^{bg}}{h}),

(11)

where K is a Gaussian kernel and h is the bandwidth controlling smoothness. An analogous estimate

{\hat{f}}_{kn} (e)

is obtained for known-class errors. KDE provides a data-driven baseline that adapts to the empirical structure of reconstruction errors.

3.8.2. Gaussian Mixture Model (GMM)

While KDE is flexible, it can overfit or underfit depending on the bandwidth. To obtain a smooth and interpretable alternative, we also fit a Gaussian Mixture Model (GMM). This model assumes that reconstruction errors arise from a mixture of two latent clusters—low error (known) and high error (background/unknown). Formally, the error density is expressed as

f (e) = π_{kn} N (e ∣ μ_{kn}, σ_{kn}^{2}) + π_{bg} N (e ∣ μ_{bg}, σ_{bg}^{2}),

(12)

where

π_{kn}

and

π_{bg}

are mixture weights learned via expectation–maximization. This parametric approach provides stable boundaries between error regimes, facilitating the probabilistic assignment of proposals.

3.8.3. Exponential Weibull with GPD Tail

Open-world detection must also handle extreme reconstruction errors produced by highly novel or noisy regions. To capture this heavy-tailed behavior, we fit an Exponential Weibull (EW) distribution to the background errors. EW has a cumulative distribution function.

F (e) = {[1 - exp {- {(e / λ)}^{k}}]}^{α},

f_{EW} (e) = α \frac{k}{λ} {(\frac{e}{λ})}^{k - 1} {[1 - e^{- {(e / λ)}^{k}}]}^{α - 1} e^{- {(e / λ)}^{k}}, e \geq 0,

(13)

Equation (13) provides a probability density function with scale

λ

and shape parameters

k, α

. To further refine the modeling of extreme cases, we augment EW with a Generalized Pareto Distribution (GPD) fitted to the upper tail (e.g., errors beyond the 95th percentile). This hybrid treatment ensures that substantial errors characteristic of unknowns are not underestimated. By combining KDE, GMM, and EW+GPD, our framework selects the most appropriate density model for each error set based on validation likelihood and detection performance. The resulting calibrated densities

f_{kn} (e)

and

f_{bg} (e)

provide a principled probabilistic foundation for uncertainty-aware pseudo-labeling, ensuring that ambiguous proposals are weighted appropriately and extreme unknowns are properly emphasized.

3.9. Uncertainty-Aware Pseudo-Labeling

Rather than binarizing a proposal by a hard threshold, we compute posterior-like soft knownness using the fitted densities. Given error e, the probability of being generated by the known distribution is

P_{known} (e) = \frac{f_{kn} (e)}{f_{kn} (e) + f_{bg} (e)} .

(14)

This continuous score naturally down-weights ambiguous cases. During detection training, we weight the loss terms by these confidence scores: for proposals labeled as unknown, we use

w_{i} = 1 - P_{known} (e_{i})

, and for known proposals, we use

w_{i} = P_{known} (e_{i})

. In effect, uncertain samples contribute less, which stabilizes pseudo-supervision and reduces noise propagation.

3.10. Prototype-Guided Contrastive Learning

To further separate known from unknown in the representation space, we maintain a dynamic prototype

p

—the

ℓ_{2}

-normalized mean of confident known RoI embeddings—and align proposal features relative to this prototype. In practice, we update

p

by using RoI features extracted from FPN level

P 2

, which provides a stable balance between spatial resolution and semantic abstraction across scales. This design ensures consistency and avoids scale dominance when maintaining the prototype throughout incremental training.

For a proposal feature

f_{i}

, we measure cosine similarity and its induced distance:

S_{i} = cos (f_{i}, p) = \frac{f_{i}^{⊤} p}{∥ f_{i} ∥ ∥ p ∥}, D_{i} = 1 - S_{i} .

(15)

Let

y_{i}^{kn} \in {0, 1}

indicate known (1) versus unknown (0). We then enforce compactness for known and a margin for unknown via

L_{proto} = y_{i}^{kn} D_{i} + (1 - y_{i}^{kn}) max {0, Δ - D_{i}},

(16)

where

Δ

is a margin (e.g.,

Δ = 1

). This objective pulls known features toward

p

and pushes unknowns away, creating a wide buffer around the known manifold. The induced score

1 - cos (f_{i}, p)

also acts as a complementary feature-space anomaly due to the reconstruction error signal.

3.11. Incremental Open-World Training

We adopt an incremental learning setting. The model is first trained on an initial set of known classes; all other objects are treated as unknown. Then, over tasks

t = 1, \dots, T

, new classes with labels are introduced, and the detector is fine-tuned without forgetting previous classes. At task t, images contain labeled instances of current classes

K_{t}

and unlabeled instances of future classes

U_{t}

. An "unknown" category is kept throughout. During training at task t, our reconstruction-based module provides soft labels

P_{known}

to flag high-error proposals as unknown; known-class prototype

p

is updated to include

K_{t}

, and

L_{proto}

refines the feature geometry. New-class heads are initialized and learned with standard cross-entropy on labeled proposals; old heads are preserved (optionally with distillation). Across tasks, proposals that do not match any current known class are assigned to the unknown category based on reconstruction uncertainty and objectness. When a previously unknown class becomes known, its instances contribute to supervised learning and to prototype

p

. By the final task, all classes have been introduced, and the detector has learned incrementally while remaining unknown-aware, enabling continual expansion without the need for manual annotation of unknowns.

4. Experimental Setup

4.1. Dataset and Splits

We use MS-COCO 2017 (train2017 for training and val2017 for evaluation). We conduct experiments on two established open-world object detection (OWOD) benchmarks that have been widely adopted in prior studies [1,18]. In line with recent work [22], we also refer to these datasets as S-OWODB (superclass-separated OWOD benchmark) and M-OWODB (superclass-mixed OWOD benchmark). As summarized in Table 2, the full set of 80 COCO categories is partitioned into four disjoint groups, each corresponding to a sequential task in the incremental learning protocol.

During training for task

T_{t}

, all categories belonging to the earlier tasks

{T_{τ} : τ \leq t}

are treated as known, while categories from the future tasks

{T_{τ} : τ > t}

are considered unknown. The detector is incrementally optimized using images annotated only for the currently known categories, without access to annotations for unseen or future classes. This setup ensures that the model progressively learns to recognize new categories while maintaining the ability to identify unknown instances across tasks.

4.2. Evaluation Protocol

Known-class performance is reported as COCO mAP (AP@[0.50], %). Unknown detection is evaluated with U-Recall (R@100, %), Absolute Open-Set Error (A-OSE), and Wilderness Impact (WI), as in [1]. All results use a single test-time configuration: confidence threshold

τ = 0.05

, class-agnostic NMS with IoU

θ = 0.5

, and top-100 detections per image. Unknown metrics utilize an IoU threshold of ≥0.5 for matching; WI is computed at a known-class recall of

R = 0.8

at the same operating point as A-OSE.

4.3. Eval Metrics

To comprehensively assess the performance of the proposed framework, we adopt a set of metrics that measure detection quality for both known and unknown categories.

4.3.1. Mean Average Precision (mAP)

For the known classes, we report mean Average Precision (mAP), the standard metric used in object detection benchmarks. Let

C

denote the set of known categories and

A P_{c}

the average precision for class

c \in C

. The mean AP is then given by

mAP = \frac{1}{| C |} \sum_{c \in C} A P_{c},

(17)

where

A P_{c}

is computed as the area under the precision–recall curve of class c; this metric reflects both localization accuracy and classification reliability of the detector.

4.3.2. Unknown-Object Recall (U-Recall)

For unknown categories, we use the unknown recall metric, which measures the fraction of ground-truth unknown instances that are correctly identified as unknown by the detector. Let

T P^{u n k}

denote the number of correctly predicted unknown objects and

N_{g t}^{u n k}

the total number of ground-truth unknowns. U-Recall is

U - Recall = \frac{T P^{u n k}}{N_{g t}^{u n k}} .

(18)

Following prior work, we restrict predictions to those with confidence scores above a fixed threshold (0.05 in our experiments) to ensure robustness.

4.3.3. Recall@K

To analyze detection under ranking constraints, we also compute Recall@K, which evaluates the number of unknown objects correctly retrieved among the top-K predictions ranked by their unknown confidence scores. Formally

Recall @ K = \frac{T P_{@ K}^{u n k}}{N_{g t}^{u n k}},

(19)

where

T P_{@ K}^{u n k}

counts the number of true unknown detections within the top-K results.

4.3.4. Absolute Open-Set Error (A-OSE)

To quantify when unknown objects are claimed as known, we follow [1] and count unknown ground-truth boxes that are overlapped by any detection labeled as a known class at a fixed operating point. Let

G_{U}

be the set of unknown ground-truth boxes and

D = {(b_{i}, {\hat{c}}_{i})}

the set of post-NMS detections whose confidence exceeds a threshold chosen to reach known-class recall

R = 0.8

(same operating point used for WI/mAP). With the standard overlap threshold

IoU \geq 0.5

, we define

A - OSE = | \{g \in G_{U} : \exists i s . t . IoU (b_{i}, g) \geq 0.5 and {\hat{c}}_{i} \in C\} | .

(20)

Each

g \in G_{U}

contributes at most one unit regardless of how many detections overlap it (duplicates do not increase the count). Background-only regions are ignored. At the same operating point as A-OSE (confidence chosen to achieve known-class recall R = 0.8, IoU

\geq 0.5

), WI measures the precision drop on known classes in the presence of unknowns:

WI = 1 - \frac{P_{all}}{P_{kn - only}} (\times 100 %),

where

P_{kn - only}

is the precision on images (or regions) containing only known objects and

P_{all}

is the precision on the full evaluation set; both use the same confidence threshold and matching protocol.

4.3.5. Wilderness Impact (WI)

Finally, we report the Wilderness Impact (WI), which captures the effect of unknown instances on the overall performance of the known classes. WI is defined as

WI = \frac{A - OSE}{T P^{k n} + F P^{k n}},

(21)

where

T P^{k n}

and

F P^{k n}

denote the true positives and false positives of known categories, respectively. Lower WI indicates that the presence of unknowns causes fewer disruptions to the classification of known objects. Together, these metrics provide a holistic evaluation: mAP quantifies accuracy on known objects, U-Recall and Recall@K measure the ability to detect novel categories, and A-OSE/WI evaluate the robustness of the model against open-set misclassification.

4.3.6. Implementation Details

Our framework is built on the Detectron2 library, with Faster R-CNN [29] serving as the base detector and ResNet-50 [30] combined with a Feature Pyramid Network (FPN) [31] as the default backbone for multi-scale feature extraction. To stabilize training, we initialize the backbone with ImageNet-pre-trained weights and keep the early convolutional layers frozen during the fine-tuning process. The multi-scale feature maps

{P_{3}, P_{4}, P_{5}, P_{6}}

correspond to objects of progressively larger spatial extent, and these are processed by our cross-level transformer autoencoder (CLT-AE). To avoid overfitting or under-representing features across levels, we assign distinct bottleneck dimensions to the autoencoders:

{32, 16, 8, 4}

for

{P_{3}, P_{4}, P_{5}, P_{6}}

, respectively. Each CLT-AE consists of a linear embedding, positional and temporal encoding, a transformer encoder with cross-level fusion, and a reconstruction decoder equipped with U-Net-style skip connections and channel-wise attention.

For optimization, we employ stochastic gradient descent (SGD) with an initial learning rate of 0.02, a momentum of 0.9, and a weight decay of

10^{- 4}

. The training schedule follows the standard 12-epoch configuration, with self-training being performed in rounds of 4 epochs. All experiments are executed on a cluster of four NVIDIA RTX A6000 GPUs with an adequate batch size of 16. Reconstruction error maps derived from the CLT-AE are used to construct proposal-level error distributions. Specifically, at each FPN level, we fit the best of three candidate models—Kernel Density Estimation (KDE), Gaussian Mixture Model (GMM), and Exponential Weibull (EW)—and further refine the tail distribution using a Generalized Pareto Distribution (GPD). This design enables our method to capture error statistics for proposals of different scales in an adaptive way.

The calibrated densities, together with prototype-guided gating in the feature space, allow the system to assign uncertainty-aware soft labels to mined unknowns. This integration of reconstruction-driven uncertainty with prototype-based similarity ensures robust detection of both known and unknown objects across incremental tasks while maintaining generalization to objects of different sizes.

5. Results and Discussion

We evaluate TAPM under both superclass-separated (S-OWODB) and superclass-mixed (M-OWODB) settings, reporting incremental performance across tasks (T1–T4). Table 3 and Table 4 summarize detection quality on known classes (Prev/Curr mAP), discovery ability on unknown objects (U-Recall), and open-set robustness (A-OSE and WI). We compare our results against representative OWOD baselines, including Faster R-CNN (with/without fine-tuning), ORE, OW-DETR, PROB, and CAT. Two proposal generators are considered for TAPM: Selective Search (SS) [32] and FreeSOLO [33] (FS). Below, we analyze trends by setting and task, and relate gains to the components of TAPM (CLT-AE, density fitting with GPD tails, and prototype-gated soft labels).

5.1. Overall Trends

Across both benchmarks, TAPM consistently achieves the highest unknown recall while preserving or improving mAP on known categories. On S-OWODB, TAPM-FS raises T1 U-Recall to 38.0 (Table 3), surpassing the strongest baseline (CAT: 24.0) by +14.0 points; PROB (17.6) is outperformed by +20.4. These detection gains persist through T2 and T3 (e.g., T2 U-Recall: 36.0 vs. PROB 22.3/CAT 23.0; T3 U-Recall: 35.4 vs. PROB 24.8/CAT 24.6), while Curr mAP concurrently improves (T2: 42.0 vs. 36.0/35.5; T3: 38.9 vs. 30.4/32.6). Under M-OWODB, which is a more complex, label-imbalanced regime, TAPM again leads unknown recall (e.g., T1: 31.6 vs. CAT 23.7; T2: 31.2 vs. 17.4–19.1; T3: 31.2 vs. 19.6–24.4) while keeping known-class performance competitive (e.g., T2 Curr mAP: 33.9 vs. 32.2/32.7; T3 Curr mAP: 23.6 vs. 22.2/18.7). For better visualization of the results, we include Figure 3, which shows the Current- and Previous-class mAP across all incremental learning tasks, as well as a comparison between our method and the previous SOTA methods. Figure 4 visualizes the comparison between our methods and the previous SOTA on unknown recall, which highlights our method’s better performance on unknown detections. Figure 5 shows the AOSE and WI comparison of our method with the current SOTA methods.

Open-set reliability corroborates these findings. On S-OWODB, TAPM reduces both absolute mistakes on unknowns (A-OSE) and the normalized penalty (WI) in T1 and T2 (e.g., T2: A-OSE 3012 vs. PROB 3358; WI 0.021 vs. 0.031) and remains competitive in T3 (A-OSE 2662 vs. PROB 2546; WI 0.019 vs. 0.018). On M-OWODB, TAPM delivers significant WI reductions in T2 (WI 0.020 vs. PROB 0.034) while also lowering A-OSE (5815 vs. 6452). Together, these results indicate that TAPM not only finds more unknowns but also curbs the tendency to mislabel them as known.

5.2. S-OWODB: Incremental Learning with Clear Class Boundaries

In task 1, it can be seen that with no prior knowledge of future classes, TAPM-FS achieves the strongest balance between discovery and precision (U-Recall of 38.0 and mAP of 75.0). Relative to the best baseline (CAT), TAPM improves unknown recall by +14.0 while also nudging mAP upward (+0.8). The corresponding A-OSE/WI are the lowest among all methods. As summarized in Table 4, both TAPM-FS and TAPM-SS achieve the lowest A-OSE and WI across tasks 1–3 on S-OWODB and M-OWODB, indicating fewer unknowns misclassified as known and a smaller precision drop when unknowns are present. The score-based analysis provides the probabilistic unknown score

s (r)

and its operating-point error rates, directly linking the score to detection performance (U-Recall, WI, and A-OSE) at the same operating point (in Table 4, T1: 1610/0.019), reflecting fewer misclassifications of unknowns as known. As new classes arrive in task 2, TAPM maintains strong retention (Prev mAP 68.5) and boosts the learning of current classes (Curr mAP 42.0), outperforming PROB/CAT by +6.0/+6.5 points, respectively. Unknown discovery remains high (U-Recall of 36.0), and open-set errors drop substantially (A-OSE of 3012 and WI of 0.021), indicating that prototype-gated soft-labeling effectively suppresses false-known assignments for ambiguous proposals.

In task 3, it is evident that TAPM continues to lead in U-Recall (35.4) and Curr mAP (38.9), with retention comparable to that of CAT (Prev mAP of 51.0 vs. 51.2). Although PROB attains a slightly lower A-OSE (2546) and WI (0.018), TAPM’s WI remains competitive (0.019), and the framework achieves higher recall and Curr mAP. This trade-off suggests that TAPM adopts a more recall-seeking stance with careful normalization of open-set risk. When all classes are revealed (no “unknowns”) in task 4, TAPM sustains top-tier mAP (Prev/Curr: 46.1/35.9), indicating that earlier pseudo-labeling did not erode final supervised learning.

5.3. M-OWODB: Mixed Superclasses and Stronger Imbalance

The mixed-superclass regime intensifies overlaps between known/unknown semantics and amplifies tail effects. TAPM remains robust. As shown in task 1, TAPM-FS yields 31.6 U-Recall and 61.8 mAP, improving unknown discovery by +7.9 over CAT with a +1.8 mAP gain. A-OSE/WI also improves (in Table 4, T1: 5019/0.055 vs. PROB 5195/0.057), indicating fewer open-set mistakes after normalization. In task 2, TAPM excels in both discovery and stability (U-Recall of 31.2; Prev/Curr mAP of 58.0/33.9), while WI is decreased to 0.020 (vs. PROB 0.034). This significant drop (∼41% relative) indicates that CLT-AE’s error modeling plus GPD tail calibration provides a better separation between unknowns and knowns even under mixed superclasses.

In task 3, TAPM secures the best U-Recall (31.2) and Curr mAP (23.6) and matches the PROB WI (both

\approx 0.015

). A-OSE is higher than PROB (5059 vs. 2641), suggesting a modest increase in residual confusion on especially ambiguous unknowns. In practice, this can be mitigated by slightly tightening the prototype gate or raising the unknown threshold, typically with a marginal loss in recall if an application prioritizes conservative unknown handling. Similarly, as in S-OWODB, TAPM maintains known-class accuracy in task 4, achieving Prev/Curr mAP of 36.8/20.1, edging out alternatives (e.g., PROB: 35.7/18.9), confirming that our pseudo-labeling strategy does not hinder the final fully supervised phase.

5.4. Impact of Proposal Generator: SS vs. FS

FreeSOLO proposals consistently provide a slight yet reliable edge over Selective Search in both discovery and detection quality. On S-OWODB T1, TAPM-FS improves U-Recall by +4.0 over TAPM-SS (38.0 vs. 34.0) with a slight mAP gain (75.0 vs. 74.5). On M-OWODB T1, the gap is smaller but still positive (31.6 vs. 30.9). The pattern repeats across tasks, indicating that denser, segmentation-driven proposals better cover unknown instances, letting CLT-AE and the density model operate on richer candidate sets.

5.5. Why TAPM Works

(i): CLT-AE for tail calibration. The cross-level transformer autoencoder aligns multi-scale features and reconstructs object-centric structure; unknowns and atypical tails then manifest as larger reconstruction errors. This shift makes low-data (tail) modes more separable from background, improving Curr mAP on newly introduced classes (e.g., S-OWODB T2/T3: +6–8 points over PROB/CAT).
(ii): Best-of density with GPD tails. Selecting the most suitable error model per level (KDE/GMM/EW) and refining extremes with a GPD tail sharpen the posterior $P_{known} (e)$ . The substantial WI reductions in M-OWODB T2 (0.020 vs. 0.034) exemplify improved calibration where mixed semantics typically blur boundaries.
(iii): Prototype-gated soft labels. Cosine gating against a running known prototype suppresses background-like or off-manifold proposals, cutting false-known assignments (lower A-OSE/WI) without sacrificing recall. The joint effect with calibrated densities explains why TAPM achieves both higher unknown recall and competitive (often superior) mAP.

5.6. Computational Complexity Analysis

All training experiments were conducted on four NVIDIA RTX A6000 GPUs. Throughput (FPS) and floating-point cost (GFLOPs) are reported per image at inference and are measured under identical pre-/post-processing and input resolution for all methods. The notation “NG” in Table 5 indicates that an explicit external proposal generation stage is not required (i.e., DETR-family and end-to-end detectors); otherwise, the reported training time includes proposal generation where applicable. Table 5 summarizes training time, inference speed, and computational cost on S-OWODB. Relative to a fine-tuned Faster R-CNN baseline (14 h, 25 FPS, 185 GFLOPs), the proposed TAPM variants increase wall-clock training time moderately to 21 h while keeping inference essentially unchanged (24 FPS) with comparable arithmetic cost (180 GFLOPs). Most of the extra training time comes from CLT-AE reconstruction and the bulk–tail (GPD) fit; at inference, the overhead is negligible (a few lightweight projections and prototype cosine checks per proposal). Rows marked “NG” (CAT and PROB) are DETR-based and thus skip proposal generation but still train for substantially longer (46 h and 40 h) and run slower at inference (18–20 FPS) despite similar GFLOPs, reflecting the higher optimization cost of end-to-end transformer training under our settings. RE-OWOD (30 h, 24 FPS, 180 GFLOPs), MEPU (25 h, 24 FPS, 180 GFLOPs), and UC-OWOD (30 h, 24 FPS, 185 GFLOPs) use Faster R-CNN with proposals and show throughput close to the baseline. OWOBJ is a plugin; the numbers shown reflect a Faster R-CNN host (proposals are required). When attached to a DETR host, proposal generation is not used (NG would apply), and throughput follows the base DETR configuration.

For TAPM-FS/SS, the additional training time arises from the tail-calibrated transformer encoding, specifically (i) the CLT-AE reconstruction objective and (ii) the bulk–tail density calibration. At inference, the prototype gate computes a cosine similarity between each proposal embedding

h (r) \in R^{d}

and K stored prototypes, i.e.,

O (K d)

per proposal (or

O (P K d)

per image with P proposals), which is small relative to backbone feature extraction FLOPs. This is consistent with the near parity in FPS and GFLOPs with Faster R-CNN: TAPM-FS/SS delivers the open-set robustness gains reported in Section 5 and Section 5.10 with minimal inference overhead (≈4% lower FPS than Faster R-CNN and similar GFLOPs) and a moderate increase in training time. For completeness, the prototype cache is a

K \times d

matrix (about

4 K d

bytes in FP32). By contrast, transformer-based baselines trained in an end-to-end manner require substantially longer optimization and exhibit lower runtime throughput under the same evaluation conditions.

5.7. Discussion

TAPM maintains a high Prev mAP while increasing Curr mAP in T2/T3, indicating that pseudo-unknown mining does not destabilize incremental learning—robustness to mixing (M-OWODB). Even with heavier class overlap, TAPM maintains the best U-Recall and competitive WI. If an application requires stricter open-set conservatism, a slightly higher prototype threshold or a more conservative posterior cutoff can reduce A-OSE with limited impact on recall. Across two challenging OWOD regimes, TAPM improves unknown-object discovery by large margins while maintaining or enhancing known-class accuracy, and it reduces open-set confusion as measured by WI/A-OSE in most phases. The gains stem from a principled combination of reconstruction-driven uncertainty, tail-aware density calibration, and prototype-guided filtering.

5.8. Training Loss Analysis Across Tasks

Figure 6 presents the overall incremental learning training loss trends for tasks 1–4. Both the raw training loss and its exponentially smoothed version (EMA,

α = 0.1

) are shown to visualize the convergence behavior better. Across all tasks, the curves follow a consistent pattern: an initial rapid loss reduction during the early iterations, reflecting the model’s rapid adaptation to the data from the current task, followed by a gradual stabilization phase as training progresses. The EMA curves demonstrate smooth convergence, highlighting the stability of the optimization process across all incremental tasks.

A task-wise comparison reveals that task 1, which introduces the largest group of classes, begins with the highest loss (around 1.4) but steadily converges to approximately 0.6. By contrast, tasks 2–4 start with lower initial loss values (ranging from 1.2 to 1.0), benefiting from the accumulated knowledge gained in previous tasks. These later tasks converge more quickly and reach lower final loss levels between 0.55 and 0.45. Another notable trend is the progressive reduction in final loss across tasks. Task 4 achieves the lowest overall loss, indicating that as more classes are introduced incrementally, the model becomes increasingly robust and effective in balancing previously learned knowledge with new information. This suggests that the incremental training framework not only preserves past knowledge but also improves efficiency in subsequent tasks.

5.9. Cross-Dataset Generalization Performance of TAPM

To evaluate the generalization ability of the proposed TAPM framework, we conduct experiments across two large-scale benchmarks: LVIS and Objects-365 (Table 6). These datasets differ significantly in their object distributions and annotation densities, making them a strong testbed for assessing robustness beyond the training domain. On LVIS, Faster R-CNN fails to recover any unknown objects (R@100 = 0.0), highlighting its limited open-world capability. CAT and PROB achieve modest recall values (R@100 = 35.2 and 40.5, respectively), but their known-class AP is notably lower than that of TAPM. By contrast, both TAPM-SS and TAPM-FS consistently outperform all baselines, yielding higher AP on known classes (39.3 and 38.5, respectively) while also improving the recall of unknown instances, with TAPM-FS achieving the best overall unknown recall (R@100 = 47.2).

A similar trend is observed on Objects-365. TAPM surpasses prior approaches in both known-class AP and unknown recall. While Faster R-CNN again fails to generalize to unknowns (R@100 = 0.0), CAT and PROB improve recall to 37.4 and 40.8, respectively, albeit at the cost of reduced AP. TAPM-SS achieves 38.2 AP with 46.9 R@100, while TAPM-FS maintains a competitive AP (37.0) and achieves the highest unknown recall (47.8). These results demonstrate that TAPM effectively balances closed-set accuracy and open-set discovery, a crucial property for real-world deployment. Overall, the results confirm that TAPM delivers superior cross-dataset generalization compared with existing open-world detection baselines. The consistent improvements on both LVIS and Objects-365 validate the robustness of our proposal generation and thresholding strategies in handling diverse and previously unseen object categories.

5.10. Prototype Guidance: Graphical and Quantitative Analysis

We evaluate

s (r)

using two views: Figure 7 (left) is a reliability plot that bins predictions and compares the bin-mean score

\bar{s}

to the empirical

Pr (y = U ∣ bin)

; Figure 7 (right) shows the decision geometry in

(p_{tail} (r), \tilde{g} (r))

with the boundary

s (r) = τ

overlaid. We then report the conditional operating-point error rates

\hat{FKA} (τ)

and

\hat{KOR} (τ)

at

τ = 0.5

in Table 7. As Table 7 shows, prototype guidance substantially reduces

\hat{KOR}

while preserving

\hat{FKA}

, consistent with the boundary shift observed in Figure 7. FKA and KOR are reported at the same operating point used for WI/A-OSE in the main results.

5.11. Statistical Analysis

We quantify variability and test significance under a matched evaluation protocol. Each method is trained with three independent random seeds; the main tables report mean ± standard deviation for mAP, U-Recall, WI, and A-OSE. To capture evaluation set uncertainty, we also use an image-level bootstrap. For each method on S-OWODB task 2 (FS pipeline), we draw 10,000 resamples of the test images (with replacement), recompute all metrics on each resample, and report 95% confidence intervals (CIs) via the percentile method (Table 8).

To assess whether TAPM–FS improves over baselines, we perform paired bootstrap testing: using the same resampled image sets for both methods, we compute per-resample differences as

Δ = metric (TAPM - FS) - metric (comparator),

reporting the 95% CI of

Δ

(Table 9). Here, the comparator is the baseline method. CIs that do not include zero indicate statistical significance at

p < 0.05

(two-sided). mAP and U-Recall are reported as percentages; differences are in percentage points (pp). WI is a unitless precision-drop measure (lower is better). A-OSE is the count of unknown ground-truth objects misclassified as known at the fixed operating point used throughout (IoU

\geq 0.5

and known-class recall target

R = 0.8

).

5.12. Ablation Study

Table 10 presents ablations on S-OWODB (task 2). All variants use the same detector and backbone, identical data partitions, and an identical evaluation protocol: greedy assignment with

IoU \geq 0.5

and a confidence threshold chosen to achieve known-class recall

R = 0.8

when computing WI and A-OSE (consistent with ORE). The Baseline excludes all components of the proposed tail-calibrated transformer encoding (TCTE), i.e., no CLT-AE, no bulk–tail density calibration, and no prototype gating. The FS and SS blocks differ only in the proposal generator (FreeSOLO vs. Selective Search); all other settings are identical.

Introducing CLT-AE substantially improves detection of unknown instances while preserving known-class accuracy. For the FS configuration, U-Recall increases from

0.0

to

28.0

with a concomitant rise in mAP from

39.2

to

40.9

; WI decreases from

0.040

to

0.035

, and A-OSE is reduced from 4007 to 3850. For SS, U-Recall increases to

27.1

and mAP to

39.9

(from

38.1

), with WI improving from

0.045

to

0.039

and A-OSE decreasing from 4207 to 3920. These results indicate that the reconstruction objective already provides discriminative evidence for unknown objectness without degrading performance on known classes. Converting reconstruction deviations into calibrated evidence yields further reductions in open-set error. For FS, WI decreases from

0.035

to

0.030

and A-OSE from 3850 to 3425, accompanied by an increase in U-Recall from

28.0

to

31.0

and a modest mAP gain to

41.2

. For SS, WI decreases from

0.039

to

0.033

and A-OSE from 3920 to 3550, with U-Recall improving from

27.1

to

29.9

and mAP reaching

41.2

. These trends confirm that calibration mitigates over-rejection and reduces misclassifications of unknowns as known.

Augmenting the calibrated model with prototype gating produces the largest additional improvement. For FS, WI decreases from

0.030

to

0.021

and A-OSE from 3425 to 3012, while U-Recall rises to

36.0

and mAP to

42.0

. For SS, WI decreases to

0.023

and A-OSE to 3130, with a U-Recall of

35.2

and an mAP of

41.0

. Relative to the respective baselines, the full models reduce A-OSE by 995 (FS) and 1077 (SS) and lower WI by

0.019

(FS) and

0.022

(SS), while improving mAP by

+ 2.8

(FS) and

+ 2.9

(SS). Across both proposal generators, the progression is monotonic and consistent with the mechanism-level analysis in Section 5.10: CLT-AE increases recall of unknowns, the bulk–tail fit calibrates evidence and reduces WI/A-OSE, and prototype guidance further suppresses head-aligned confounders in the feature space, yielding the largest additional reduction in WI and A-OSE with mAP being maintained or improved.

5.13. Qualitative Comparison with SOTA Methods

To complement the quantitative results, we provide a qualitative comparison of the proposed TAPM framework against two SOTA OWOD methods, PROB [22] and CAT [34]. Figure 8 and Figure 9 illustrate diverse open-world scenarios where both SOTA methods are compared against our TAPM side by side, with PROB [22] and CAT [34] being shown in the top row and TAPM in the bottom row of each example. In Figure 8, the first column, column A (tennis scene), demonstrates that PROB frequently produces multiple redundant unknown detections around the same subject and even mislabels background regions, whereas TAPM outputs compact and targeted unknown boxes while maintaining high-confidence recognition of the known person. In the second column, column B (paddleboarding), PROB [22] marks extensive, ambiguous background and regions of water and paddle edges as unknown. At the same time, TAPM avoids such spurious detections and produces cleaner results, showing robustness against texture-heavy backgrounds. In the third column, column C (ski jump), TAPM successfully identifies additional occluded individuals while preventing the proliferation of background unknowns seen in PROB’s outputs. These results highlight TAPM’s ability to reduce redundant unknown detections and improve the recall of small or partially visible known objects.

Figure 9 further supports these observations. In the riverside scene, A (left), CAT [34] generates multiple false, confusing unknown boxes inside the known class boxes. TAPM, by contrast, suppresses these false alarms and localizes only semantically plausible unknowns while detecting known objects such as people and the bird with tighter, non-overlapping bounding boxes. In the alpine snow scene, B (middle), CAT [34] misses many candidate unknown objects, whereas TAPM restricts unknown predictions to only genuine candidate regions and produces more reliable unknown objects. In the surf sequence, C, CAT [34] detects only a limited subset of surfers, missing some individuals who are partially occluded or submerged in the wave. Its predictions remain sparse, and no attempt is made to highlight uncertain regions, resulting in incomplete coverage of the scene. By contrast, our TAPM framework (bottom) achieves more comprehensive detection, identifying all visible surfers along the wave crest with high confidence. Importantly, TAPM maintains restraint in its treatment of background textures: the dynamic water surface is not misclassified as large unknown regions, and only a few small, localized labels appear in ambiguous areas. This balance of recovering additional true positives while avoiding the proliferation of false unknowns demonstrates the improved calibration of TAPM in complex, cluttered environments.

Taken together, the qualitative evidence demonstrates that TAPM consistently achieves three key improvements over PROB [22] and CAT [34]: (1) suppression of background-driven false unknowns, (2) more compact and non-redundant detections, and (3) improved coverage of small or occluded known objects. These visual behaviors align directly with the quantitative trends observed in our experiments, where TAPM achieves higher unknown recall while simultaneously reducing open-set error metrics such as A-OSE and WI.

6. Limitations and Future Work

TAPM delivers consistent gains for open-world detection, yet several practical limitations remain. First, the approach relies on external proposal generators (Selective Search or FreeSOLO); therefore, performance hinges on proposal coverage, which can be fragile for small, thin, or heavily occluded objects. Second, although inference overhead is modest, cross-level transformer encoding and bulk–tail calibration add training time and memory relative to a plain Faster R-CNN (see Section 5.6). Third, the soft routing score requires calibration on a small validation split; under distribution shift, miscalibration may increase WI or A-OSE. Finally, under extreme class imbalance, prototype estimates can be biased toward head classes, potentially suppressing the discovery of rare categories. These constraints are most apparent when proposal recall is low, computational budgets are tight, or the data distribution varies substantially.

For future work, we identify three concrete directions. Real-time and edge deployment: Develop lighter cross-level encoders, apply pruning/distillation and quantization, and adopt approximate prototype search, with deployment to optimized runtime for streaming video. Proposal-free integration: Embed tail calibration into a one-stage detector (e.g., DETR/YOLO family) so that objectness and novelty cues are learned jointly, removing the external proposal stage and improving small-object recall. Vision–language grounding: Leverage CLIP-style encoders to initialize or refine class prototypes, use text prompts to guide the discovery and naming of novel clusters, and evaluate language-conditioned unknown recall on large-vocabulary benchmarks. Additional directions include online or conformal calibration to handle distribution shift, drift-aware prototype updates for continual learning, and selective querying to minimize supervision in safety-critical settings. Collectively, these steps target real-time operation, reduce dependence on external proposals, and improve interpretability while maintaining the accuracy demonstrated in this work.

7. Conclusions

In this work, we presented TAPM (Tail-Calibrated Transformer Autoencoding with Prototype-Guided Mining), a novel framework designed to tackle two long-standing challenges in open-world object detection: (i) the bias toward head categories that suppresses tail-class performance, and (ii) the misclassification of unknown objects as background. By integrating a transformer-based autoencoder for feature calibration, prototype-guided mining for robust unknown discovery, and uncertainty-aware soft-labeling for stable pseudo-supervision, TAPM achieves superior performance across both superclass-separated (S-OWODB) and superclass-mixed (M-OWODB) benchmarks. Our experimental results on MS-COCO and LVIS, along with cross-dataset generalization to Objects-365, demonstrate significant improvements in unknown recall, tail-class detection, and overall reliability compared with state-of-the-art baselines. These findings highlight the effectiveness of TAPM in advancing OWOD toward more realistic and scalable deployment.

Author Contributions

Conceptualization, M.A.I.; methodology, M.A.I.; software, M.A.I.; validation, M.A.I., Y.-C.Y. and S.K.K.; formal analysis, M.A.I., Y.-C.Y. and S.K.K.; investigation, M.A.I. and S.K.K.; resources, Y.-C.Y. and S.K.K.; data curation, M.A.I. and S.K.K.; writing—original draft preparation, M.A.I.; writing—review and editing, M.A.I., Y.-C.Y. and S.K.K.; visualization, M.A.I.; supervision, Y.-C.Y. and S.K.K.; project administration, S.K.K.; funding acquisition, S.K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research study was supported by the Regional Innovation System and Education (RISE) program through the Jeju RISE center, funded by the Ministry of Education (MOE) and Jeju Special Self-Governing Province, Republic of Korea (2025-RISE-17-001).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

A-OSE	Absolute Open-Set Error
CI	Confidence interval
CLIP	Contrastive Language–Image Pre-training
CLT-AE	Cross-level transformer autoencoder
FKA	False-known assignment
FPS	Frames Per Second
FRCNN	Faster R-CNN
FS	FreeSOLO (proposal generator)
GFLOPs	Giga Floating-Point Operations
GPD	Generalized Pareto Distribution
IoU	Intersection over Union
KOR	Known over-rejection
mAP	Mean Average Precision
NMS	Non-maximum suppression
OWOD	Open-world object detection
SS	Selective Search (proposal generator)
TAPM	Tail-Calibrated Transformer Autoencoding with Prototype-Guided Mining
TCTE	Tail-calibrated transformer encoding
U-Recall	Unknown-object recall
WI	Wilderness Impact

References

Joseph, K.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5830–5840. [Google Scholar]
Zhang, S.; Ni, Y.; Du, J.; Xue, Y.; Torr, P.; Koniusz, P.; van den Hengel, A. Open-World Objectness Modeling Unifies Novel Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 30332–30342. [Google Scholar]
Du, X.; Wang, Z.; Cai, M.; Li, Y. Vos: Learning what you don’t know by virtual outlier synthesis. arXiv 2022, arXiv:2202.01197. [Google Scholar]
Hong, J.; Fang, P.; Li, W.; Han, J.; Petersson, L.; Harandi, M. Curved geometric networks for visual anomaly recognition. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 17921–17934. [Google Scholar] [CrossRef] [PubMed]
Gupta, A.; Dollar, P.; Girshick, R. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5356–5364. [Google Scholar]
Greer, R.; Antoniussen, B.; Møgelmose, A.; Trivedi, M. Language-driven active learning for diverse open-set 3d object detection. In Proceedings of the Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 28 February–4 March 2025; pp. 980–988. [Google Scholar]
Li, T.; Pang, G.; Bai, X.; Miao, W.; Zheng, J. Learning transferable negative prompts for out-of-distribution detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17584–17594. [Google Scholar]
Inoue, R.; Tsuchiya, M.; Yasui, Y. Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 8196–8205. [Google Scholar]
Fang, R.; Pang, G.; Miao, W.; Bai, X.; Zheng, J.; Ning, X. Unsupervised recognition of unknown objects for open-world object detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 11340–11354. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Kim, D.; Lin, T.Y.; Angelova, A.; Kweon, I.S.; Kuo, W. Learning open-world object proposals without learning to classify. IEEE Robot. Autom. Lett. 2022, 7, 5453–5460. [Google Scholar] [CrossRef]
Saito, K.; Hu, P.; Darrell, T.; Saenko, K. Learning to detect every thing in an open world. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 268–284. [Google Scholar]
Wang, W.; Feiszli, M.; Wang, H.; Malik, J.; Tran, D. Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Denver, CO, USA, 3–7 June 2022; pp. 4422–4432. [Google Scholar]
Qi, L.; Kuen, J.; Wang, Y.; Gu, J.; Zhao, H.; Torr, P.; Lin, Z.; Jia, J. Open world entity segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 8743–8756. [Google Scholar] [CrossRef] [PubMed]
Scheirer, W.J.; de Rezende Rocha, A.; Sapkota, A.; Boult, T.E. Toward open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1757–1772. [Google Scholar] [CrossRef] [PubMed]
Bendale, A.; Boult, T.E. Towards open set deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1563–1572. [Google Scholar]
Dhamija, A.; Gunther, M.; Ventura, J.; Boult, T. The overlooked elephant of object detection: Open set. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1021–1030. [Google Scholar]
Gupta, A.; Narayan, S.; Joseph, K.; Khan, S.; Khan, F.S.; Shah, M. Ow-detr: Open-world detection transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9235–9244. [Google Scholar]
Wu, Z.; Lu, Y.; Chen, X.; Wu, Z.; Kang, L.; Yu, J. UC-OWOD: Unknown-classified open world object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 193–210. [Google Scholar]
Iqbal, M.A.; Yoon, Y.C.; Khan, M.U.; Kim, S.K. Improved open world object detection using class-wise feature space learning. IEEE Access 2023, 11, 131221–131236. [Google Scholar] [CrossRef]
Iqbal, M.A.; Yoon, Y.C.; Kim, S.K. Redefining Object Detection for Open-World Settings: A Framework for Simultaneous Identification of Known and Unknown Classes. IEEE Access 2024, 12, 179707–179725. [Google Scholar] [CrossRef]
Zohar, O.; Wang, K.C.; Yeung, S. Prob: Probabilistic objectness for open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11444–11453. [Google Scholar]
Wang, J.; Chen, B.; Kang, B.; Li, Y.; Xian, W.; Chen, Y.; Xu, Y. Ov-dquo: Open-vocabulary detr with denoising text query training and open-world unknown objects supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 7762–7770. [Google Scholar]
Xie, J.; Luo, Z.; Li, R.; Huang, Y.; Liu, H.; Li, Y.; Zheng, Y.; Zhang, Y.; Shen, L.; Shou, M.Z. Open-World Weakly-Supervised Object Localization. Pattern Recognit. 2026, 169, 111808. [Google Scholar] [CrossRef]
Fu, S.; Yang, Q.; Mo, Q.; Yan, J.; Wei, X.; Meng, J.; Xie, X.; Zheng, W.S. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Denver, CO, USA, 11–15 June 2025; pp. 14987–14997. [Google Scholar]
Yang, M.; Goenawan, G.J.; Qin, H.; Han, K.; Peng, X.; Yang, Y.; Zhu, H. Detecting Open World Objects via Partial Attribute Assignment. In Proceedings of the Computer Vision and Pattern Recognition Conference, Denver, CO, USA, 11–15 June 2025; pp. 20318–20328. [Google Scholar]
Zhao, X.; Ma, Y.; Wang, D.; Shen, Y.; Qiao, Y.; Liu, X. Revisiting open world object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 3496–3509. [Google Scholar] [CrossRef]
Chen, Z.; Luo, Y.; Wang, Z.; Wang, Z.; Huang, Z. Open-CRB: Toward Open World Active Learning for 3D Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 8336–8350. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8430–8439. [Google Scholar]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3024–3033. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Wang, X.; Yu, Z.; De Mello, S.; Kautz, J.; Anandkumar, A.; Shen, C.; Alvarez, J.M. Freesolo: Learning to segment objects without annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14176–14186. [Google Scholar]
Ma, S.; Wang, Y.; Wei, Y.; Fan, J.; Li, T.H.; Liu, H.; Lv, F. Cat: Localization and identification cascade detection transformer for open-world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19681–19690. [Google Scholar]

Figure 1. System overview of the TAPM framework.

Figure 2. TAPM architecture overview. Backbone+FPN features and an RPN produce proposals that RoI Align pools. RoI features pass through the cross-level transformer autoencoder (CLT-AE); its tail evidence, gated by prototypes, yields an unknown-aware soft label. The known branch utilizes a classification head with class-specific box regression, while the unknown branch employs class-agnostic regression to output unknown detections.

Figure 3. Overall mAP comparison visualization, with the current SOTA method on incremental learning tasks 1–4. Here, T1 is the first task, so there is no Previous mAP metric. Data in this figure are taken from Table 3.

Figure 4. Overall U-Recall comparison of our method with the current SOTA methods ORE [1], OWDETR [18], PROB [22], and CAT [34] on incremental learning tasks 1–4.

Figure 5. A-OSE vs. WI trade-off (lower left is better). Scatter plots for each task using the values from Table 4. Each point represents a method; TAPM-FS/SS is denoted by larger, distinct markers. Lower A-OSE indicates fewer unknown objects misclassified as known, and lower WI indicates a smaller precision drop on known classes in the presence of unknowns.

Figure 6. Overall loss comparison across different tasks, showing the stability of the model across incremental learning training tasks.

Figure 7. Prototype guidance visualization. (Left) Reliability of the soft score

s (r)

; for ten equal-width bins, we plot the bin-mean score

\bar{s}

versus the empirical unknown probability

Pr (y = U ∣ bin)

(closer to the diagonal is better). (Right) Decision geometry in

(p_{tail} (r), \tilde{g} (r))

with the boundary

s (r) = τ

overlaid; unknown proposals concentrate at high

p_{tail}

and low

\tilde{g}

.

Figure 7. Prototype guidance visualization. (Left) Reliability of the soft score

s (r)

; for ten equal-width bins, we plot the bin-mean score

\bar{s}

versus the empirical unknown probability

Pr (y = U ∣ bin)

(closer to the diagonal is better). (Right) Decision geometry in

(p_{tail} (r), \tilde{g} (r))

with the boundary

s (r) = τ

overlaid; unknown proposals concentrate at high

p_{tail}

and low

\tilde{g}

.

Figure 8. Qualitative comparison with the baseline, PROB [22] (top), and our framework (bottom). Colors follow Detectron2 defaults; both use the same crop, score threshold, and NMS. (A) Tennis: PROB fires background and duplicate boxes, many marked as unknown; our framework suppresses them and yields a single bounding box for the person class with true unknown near the racket. (B) Paddleboard: PROB flags horizon/sea as unknown and duplicates the paddler with conflicting labels; our framework removes horizon false positives and gives one person plus a compact unknown on the paddle. (C) Ski jump: PROB misses several unknown regions and adds background boxes; our framework recovers those unknowns while keeping accurate person detections.

Figure 9. Qualitative comparison with CAT [34] (top) and our framework (bottom). Detectron2 colors; same crops, score threshold, and NMS. (A) Riverside: CAT adds duplicate unknown boxes around the head and along the waterline; our framework suppresses them and keeps a clean person with one compact unknown where appropriate. (B) Snow: CAT misses small, unknown regions and flags sky/snow as unknown; our framework recovers the missed regions and reduces background false positives. (C) Surf: CAT misses several wave unknowns; our framework detects them while maintaining correct person boxes.

Table 1. Comparison of TAPM with recent OWOD works.

Work	Year	Setting	Their Key Idea	How the Proposed TAPM Differs
RE-OWOD [27]	2024	2D	Defines a consistent OWOD evaluation protocol (IoU matching; WI/A-OSE at a fixed operating point) and analyzes failure modes.	We adopt this protocol in an end-to-end manner. TAPM contributes tail calibration (CLT-AE + GPD tail fit) with a prototype gate to reduce WI/A-OSE at the same operating point, without modifying the detector head.
PROB [22]	2023	2D	Adds a probabilistic objectness head to Deformable-DETR and uses it to modulate class scores for novel objects.	No extra head. TAPM routes proposals using reconstruction-error tails (GPD) plus prototype gating, which lowers KOR and WI/A-OSE at the same operating point; objectness and our calibration are complementary.
OWOBJ [2]	2025	2D	Objectness-unification plugin (e.g., energy-based margin) that improves known/novel separation across detectors.	Complementary focus. OWOBJ supplies objectness; TAPM performs tail calibration + prototype gating to separate unknowns from background. OWOBJ can feed proposals; TAPM then calibrates and gates them.
UC-OWOD [19]	2022	2D	“Unknown-classified” OWOD: two-stage detector with unknown-aware heads, followed by similarity-based unknown classification and clustering refinement.	Different objective. UC-OWOD groups unknowns into classes; TAPM targets unknown vs. background separation using GPD tails + prototype gating. Our calibrated unknowns can serve as cleaner inputs for UC-OWOD grouping.
MEPU OWOD [9]	2025	2D	Label-free discovery via an unsupervised discriminative model refined by classification-free self-training.	TAPM is also label-free on the unknown branch but bases routing on calibrated reconstruction density (GPD tails) and prototype alignment; we additionally report FKA/KOR together with WI/A-OSE at the standard operating point.
Open-CRB [28]	2025	3D	Open-world active learning for LiDAR: an acquisition strategy (e.g., OLC) selects samples for oracle labeling.	Different domain/protocol. TAPM addresses 2D, passive OWOD. We also report complexity (wall-clock, FPS, and GFLOPs), showing near-parity inference cost with respect to Faster R-CNN despite CLT-AE + density fitting.

Table 2. Task-wise dataset partitions used in our experiments. The upper block corresponds to the S-OWODB benchmark (superclass-separated), while the lower block represents the M-OWODB benchmark (superclass-mixed). Each task specifies category groupings, along with the number of images and object instances, in both the training and evaluation sets.

	Category Groups	Train (Images)	Train (Instances)	Eval (Images)	Eval (Instances)
S-OWODB (Superclass-Separated)
Task 1	Animals, Person, and Vehicles	89,490	421,243	4952	36,781
Task 2	Appliances, Accessories, Outdoor, and Furniture	55,870	163,512	4952	36,781
Task 3	Sports and Food	39,402	114,452	4952	36,781
Task 4	Electronic, Indoor, and Kitchen	38,903	160,794	4952	36,781
M-OWODB (Superclass-Mixed)
Task 1	VOC Classes	16,551	47,223	4952	14,976
Task 2	Outdoor, Accessories, Appliance, and Truck	45,520	113,741	1914	4966
Task 3	Sports and Food	39,402	114,452	1642	4826
Task 4	Electronic, Indoor, Kitchen, and Furniture	40,260	138,996	1738	6039

Table 3. Performance comparison with the SOTA methods for the OWOD problem according to different evaluation metrics based on Selective Search (SS)- and FreeSOLO (FS)-based proposal generation mechanisms, S-OWODB (top) and M-OWODB (bottom). Task 4 omits U-Recall, as at this stage, all the classes become known to the model, and there are no unknown classes. ↑ means higher is better.

Method	Task 1		Task 2			Task 3			Task 4
Method	U-Recall (↑)	mAP (↑)	U-Recall (↑)	Prev mAP (↑)	Curr mAP (↑)	U-Recall (↑)	Prev mAP (↑)	Curr mAP (↑)	Prev mAP (↑)	Curr mAP (↑)
Results on S-OWOD Settings
Faster R-CNN [29]	0.0	74.4	0.0	0.42	44.8	0.0	0.23	43.1	0.15	41.6
Faster R-CNN (fine-tuned)	0.0	74.4	0.0	68.2	42.2	0.0	50.1	38.7	43.9	35.6
ORE–EBUI [1]	1.5	71.4	3.9	61.0	30.9	3.6	43.1	32.2	33.6	26.3
OW-DETR [18]	5.7	73.1	6.2	65.0	29.0	6.9	46.7	25.7	38.2	28.1
PROB [22]	17.6	73.5	22.3	66.3	36.0	24.8	47.8	30.4	42.6	31.7
CAT [34]	24.0	74.2	23.0	67.6	35.5	24.6	51.2	32.6	45.4	35.1
TAPM-FS (Ours)	38.0	75.0	36.0	68.5	42.0	35.4	51.0	38.9	46.1	35.9
TAPM-SS (Ours)	34.0	74.5	35.2	67.3	41.0	32.6	50.9	38.0	45.6	35.5
Results on M-OWOD Settings
Faster R-CNN [29]	0.0	60.3	0.0	0.69	35.2	0.0	0.32	23.5	0.65	20.1
Faster R-CNN (fine-tuned)	0.0	60.3	0.0	57.6	34.0	0.0	43.8	22.3	35.6	19.5
ORE–EBUI [1]	4.9	56.0	2.9	52.7	26.0	3.9	38.2	12.7	29.6	12.4
OW-DETR [18]	7.5	59.2	6.2	53.6	33.5	5.7	38.3	15.8	31.4	17.1
PROB [22]	19.4	59.5	17.4	55.7	32.2	19.6	43.0	22.2	35.7	18.9
CAT [34]	23.7	60.0	19.1	55.5	32.7	24.4	42.8	18.7	34.4	16.6
TAPM-FS (Ours)	31.6	61.8	31.2	58.0	33.9	31.2	43.1	23.6	36.8	20.1
TAPM-SS (Ours)	30.9	61.0	30.2	57.3	33.1	30.3	42.5	22.7	35.8	19.5

Table 4. Comprehensive assessment of S-OWODB (top) and M-OWODB (bottom). The results highlight unknown-object recall together with open-set reliability indicators such as Absolute Open-set Error (A-OSE) and Wilderness Impact (WI), which provide an overview of detection accuracy and the extent of misclassifications under open-world conditions. ↓ means lower is better.

Method	Task 1		Task 2		Task 3
Method	A-OSE (↓)	WI (↓)	A-OSE (↓)	WI (↓)	A-OSE (↓)	WI (↓)
Results for S-OWOD Settings
Faster R-CNN (fine-tuned) [29]	1807	0.022	4007	0.033	4010	0.025
ORE – EBUI [1]	2486	0.024	6608	0.040	6896	0.026
OW-DETR [18]	12,721	0.029	14,970	0.041	9197	0.024
PROB [22]	2003	0.021	3358	0.031	2546	0.018
CAT [34]	2097	0.023	5784	0.040	3545	0.021
TAPM-FS (Ours)	1610	0.019	3012	0.021	2662	0.019
TAPM-SS (Ours)	1653	0.021	3130	0.023	2783	0.020
Results for M-OWOD Settings
Faster R-CNN [29]	13,396	0.070	12,291	0.037	9622	0.028
ORE–EBUI [1]	10,459	0.062	10,445	0.028	7990	0.021
OW-DETR [18]	10,240	0.057	8441	0.028	16,803	0.016
PROB [22]	5195	0.057	6452	0.034	2641	0.015
CAT [34]	20,364	0.066	16,768	0.032	7515	0.020
TAPM-FS (Ours)	5019	0.055	5815	0.020	5059	0.015
TAPM-SS (Ours)	5100	0.056	5942	0.022	5169	0.016

Table 5. Computational complexity on S-OWOD. “NG” indicates that proposal generation is not required.

Model	Detector	Training Time (h)	FPS	GFLOPs
Faster R-CNN [29]	Faster R-CNN	14	25	185
CAT [34]	DETR	46 (NG)	18	198
PROB [22]	DETR	40 (NG)	20	185
RE-OWOD [27]	Faster R-CNN	30	24	180
MEPU [9]	Faster R-CNN	25	24	180
OWOBJ [2]	Faster R-CNN/DETR	25	24	185
UC-OWOD [19]	Faster R-CNN	30	24	185
TAPM-FS	Faster R-CNN	21	24	180
TAPM-SS	Faster R-CNN	21	24	180

Table 6. Cross-dataset generalization on LVIS and Objects-365. We report AP on known classes and recall at top-k for unknowns.

Method	Dataset	AP (Known)	Unknown Recall (%)
			R@10	R@30	R@100
Faster R-CNN [29]	LVIS [5]	38.0	0.0	0.0	0.0
CAT [34]		34.5	12.5	22.6	35.2
PROB [22]		34.2	13.3	25.7	40.5
TAPM-SS (Ours)		39.3	16.2	31.0	46.5
TAPM-FS (Ours)		38.5	17.5	32.1	47.2
Faster R-CNN [29]	Objects-365 [30]	36.5	0.0	0.0	0.0
CAT [34]		34.2	11.0	20.5	37.4
PROB [22]		33.0	12.6	25.5	40.8
TAPM-SS (Ours)		38.2	17.0	31.2	46.9
TAPM-FS (Ours)		37.0	18.2	32.5	47.8

Table 7. Conditional operating-point error rates at

τ = 0.5

(lower is better).

\hat{FKA}

is computed over

y = U

and

\hat{KOR}

over

y \in C

. ↓ specifies lower is better.

Table 7. Conditional operating-point error rates at

τ = 0.5

(lower is better).

\hat{FKA}

is computed over

y = U

and

\hat{KOR}

over

y \in C

. ↓ specifies lower is better.

Method	$\hat{FKA} (τ) ↓$	$\hat{KOR} (τ) ↓$
Tail evidence only	0.0838	0.2001
+ Prototype guidance	0.0834	0.0696

Table 8. Bootstrap 95% confidence intervals (CIs) for S -OWODB task 2 (FS pipeline), computed from 10,000 image-level resamples. mAP and U-Recall are percentages; WI is dimensionless (precision drop); A-OSE is a count of unknown ground-truth objects misclassified as known at IoU

\geq 0.5

with recall target

R = 0.8

. Values are mean [lower, upper].

Table 8. Bootstrap 95% confidence intervals (CIs) for S -OWODB task 2 (FS pipeline), computed from 10,000 image-level resamples. mAP and U-Recall are percentages; WI is dimensionless (precision drop); A-OSE is a count of unknown ground-truth objects misclassified as known at IoU

\geq 0.5

with recall target

R = 0.8

. Values are mean [lower, upper].

Method	mAP (%)	U-Recall (%)	WI	A-OSE (count)
Baseline (FRCNN) [29]	39.2 [38.7, 39.7]	0.0 [0.0, 0.1]	0.033 [0.031, 0.035]	4007 [3920, 4090]
CAT [34]	40.2 [39.7, 40.7]	29.0 [27.9, 30.1]	0.040 [0.038, 0.042]	5784 [5670, 5900]
PROB [22]	40.5 [40.0, 41.0]	30.5 [29.4, 31.6]	0.031 [0.029, 0.033]	3358 [3270, 3445]
ORE–EBUI [1]	30.9 [30.2, 31.6]	3.9 [3.2, 4.6]	0.040 [0.038, 0.042]	6608 [6480, 6735]
OW–DETR [18]	29.0 [28.4, 29.6]	6.2 [5.5, 6.9]	0.041 [0.039, 0.043]	14,970 [14,780, 15,160]
TAPM–FS (Ours)	42.0 [41.5, 42.5]	36.0 [34.9, 37.1]	0.021 [0.019, 0.023]	3012 [2930, 3095]

Table 9. Paired bootstrap differences (

Δ = TAPM - FS - comparator

) on S-OWODB task 2 with 95% CIs from 10,000 paired resamples. mAP and U-Recall differences are in percentage points (pp). For WI and A-OSE (lower is better), a negative

Δ

indicates improvement. A CI that does not include zero denotes

p < 0.05

(two-sided).

Table 9. Paired bootstrap differences (

Δ = TAPM - FS - comparator

) on S-OWODB task 2 with 95% CIs from 10,000 paired resamples. mAP and U-Recall differences are in percentage points (pp). For WI and A-OSE (lower is better), a negative

Δ

indicates improvement. A CI that does not include zero denotes

p < 0.05

(two-sided).

Comparison	$Δ$ mAP (pp)	$Δ$ U-Recall (pp)	$Δ$ WI	$Δ$ A-OSE
TAPM–FS vs. Baseline (FRCNN) [29]	+2.8 [+2.0, +3.6]	+36.0 [+34.0, +38.0]	−0.012 [−0.014, −0.010]	−995 [−1,20, −870]
TAPM–FS vs. CAT [34]	+1.8 [+1.1, +2.5]	+7.0 [+5.3, +8.7]	−0.019 [−0.022, −0.016]	−2772 [−2940, −2610]
TAPM–FS vs. PROB [22]	+1.5 [+0.8, +2.2]	+5.5 [+3.8, +7.2]	−0.010 [−0.012, −0.008]	−346 [−430, −260]
TAPM–FS vs. ORE–EBUI [1]	+11.1 [+10.2, +12.0]	+32.1 [+30.6, +33.6]	−0.019 [−0.022, −0.016]	−3596 [−3820, −3380]
TAPM–FS vs. OW–DETR [18]	+13.0 [+12.1, +13.9]	+29.8 [+28.3, +31.3]	−0.020 [−0.023, −0.017]	−11,958 [−12,250, −11,660]

Table 10. Ablation on S -OWODB (task 2). All variants share the same detector/backbone, data splits, and evaluation protocol (IoU

\geq 0.5

, confidence chosen to reach known-class recall

R = 0.8

for WI/A-OSE). Baseline (no TCTE) is the detector without CLT -AE, bulk–tail fit, or prototype gate. Full adds all three. ↑ Specifies higher is better, ↓ specifies lower is better.

Table 10. Ablation on S -OWODB (task 2). All variants share the same detector/backbone, data splits, and evaluation protocol (IoU

\geq 0.5

, confidence chosen to reach known-class recall

R = 0.8

for WI/A-OSE). Baseline (no TCTE) is the detector without CLT -AE, bulk–tail fit, or prototype gate. Full adds all three. ↑ Specifies higher is better, ↓ specifies lower is better.

Pipeline	Variant	mAP ↑	U-Recall ↑	WI ↓	A-OSE ↓
FS	Baseline	39.2	0.0	0.040	4007
	CLT-AE only	40.9	28.0	0.035	3850
	CLT-AE + bulk–tail	41.2	31.0	0.030	3425
	Full (TAPM-FS)	42.0	36.0	0.021	3012
SS	Baseline (no TCTE)	38.1	0.0	0.045	4207
	CLT-AE only	39.9	27.1	0.039	3920
	CLT-AE + bulk–tail	41.2	29.9	0.033	3550
	Full (TAPM-SS)	41.0	35.2	0.023	3130

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Iqbal, M.A.; Yoon, Y.-C.; Kim, S.K. Tail-Calibrated Transformer Autoencoding with Prototype- Guided Mining for Open-World Object Detection. Appl. Sci. 2025, 15, 10918. https://doi.org/10.3390/app152010918

AMA Style

Iqbal MA, Yoon Y-C, Kim SK. Tail-Calibrated Transformer Autoencoding with Prototype- Guided Mining for Open-World Object Detection. Applied Sciences. 2025; 15(20):10918. https://doi.org/10.3390/app152010918

Chicago/Turabian Style

Iqbal, Muhammad Ali, Yeo-Chan Yoon, and Soo Kyun Kim. 2025. "Tail-Calibrated Transformer Autoencoding with Prototype- Guided Mining for Open-World Object Detection" Applied Sciences 15, no. 20: 10918. https://doi.org/10.3390/app152010918

APA Style

Iqbal, M. A., Yoon, Y.-C., & Kim, S. K. (2025). Tail-Calibrated Transformer Autoencoding with Prototype- Guided Mining for Open-World Object Detection. Applied Sciences, 15(20), 10918. https://doi.org/10.3390/app152010918

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tail-Calibrated Transformer Autoencoding with Prototype- Guided Mining for Open-World Object Detection

Abstract

1. Introduction

2. Related Work

3. Proposed Methodology

3.1. Problem Formulation: Long-Tailed Unsupervised Object Detection

3.2. Tail-Calibrated Transformer Encoding: Modules and Objective

3.2.1. Module A: Cross-Level Transformer Autoencoder (CLT-AE)

3.2.2. Module B: Reconstruction Error Density with Tail Refinement

3.2.3. Module C: Prototype-Guided Gating

3.2.4. Module D: Soft-Label Mining and Training Objective

3.2.5. Why This Calibrates Tails

3.3. Prototype Guidance: Error Formulation

3.4. Unsupervised Region Proposal Generation

3.5. Multi-Scale Proposal Feature Extraction

3.6. Cross-Level Transformer Autoencoder

3.7. Reconstruction Error Estimation

3.8. Distribution Fitting for Uncertainty Estimation

3.8.1. Kernel Density Estimation (KDE)

3.8.2. Gaussian Mixture Model (GMM)

3.8.3. Exponential Weibull with GPD Tail

3.9. Uncertainty-Aware Pseudo-Labeling

3.10. Prototype-Guided Contrastive Learning

3.11. Incremental Open-World Training

4. Experimental Setup

4.1. Dataset and Splits

4.2. Evaluation Protocol

4.3. Eval Metrics

4.3.1. Mean Average Precision (mAP)

4.3.2. Unknown-Object Recall (U-Recall)

4.3.3. Recall@K

4.3.4. Absolute Open-Set Error (A-OSE)

4.3.5. Wilderness Impact (WI)

4.3.6. Implementation Details

5. Results and Discussion

5.1. Overall Trends

5.2. S-OWODB: Incremental Learning with Clear Class Boundaries

5.3. M-OWODB: Mixed Superclasses and Stronger Imbalance

5.4. Impact of Proposal Generator: SS vs. FS

5.5. Why TAPM Works

5.6. Computational Complexity Analysis

5.7. Discussion

5.8. Training Loss Analysis Across Tasks

5.9. Cross-Dataset Generalization Performance of TAPM

5.10. Prototype Guidance: Graphical and Quantitative Analysis

5.11. Statistical Analysis

5.12. Ablation Study

5.13. Qualitative Comparison with SOTA Methods

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI