Hierarchical Open-Set Object Detection in Unseen Data

: In this paper, we propose an open-set object detection framework based on a dynamic hierarchical structure with incremental learning capabilities for unseen object classes. We were motivated by the observation that deep features extracted from visual objects show a strong hierarchical clustering property. The hierarchical feature model (HFM) was used to learn a new object class by using collaborative sampling (CS), and open-set-aware active semi-supervised learning (ASSL) algorithms. We divided object proposals into superclasses by using the agglomerative clustering algorithm. Data samples in each superclass node were classiﬁed into multiple augmented class nodes instead of directly associating with regular object classes. One or more augmented class nodes are related to a regular object class, and each augmented class has only one superclass. Object proposals from inexperienced data distribution are assigned to an augmented class node. Dynamic HFM nodes in the decision path are assembled to constitute an ensemble prediction, and the new augmented object is associated with a new regular object class. Our experimental results showed that the proposed method uses standard benchmark datasets such as PASCAL VOC, MS COCO, ILSVRC DET, and local datasets to perform better than state-of-the-art techniques.


Introduction
There have been many advances in object detection technology since the deep-learning breakthrough achieved by Krizhevsky et al. in 2012 [1]. However, collecting and labeling training samples are necessary and exhaustive tasks. Furthermore, all object classes to be detected should be decided in advance, but this is not achievable in many real-world applications. In such applications, object detection systems trained with limited object classes tend to be impractical because a new object class in an unseen data distribution cannot be properly handled and can result in false predictions [2][3][4]. Few researchers have focused on open-set object detection in unseen data distributions, even though there are strong requirements in many application fields. The unseen data distribution of a new object poses a difficult problem in real-world object detection systems. A larger dataset with extended classes and a more complex neural network can be adopted, but this requires a complicated process of network redesign or remodeling tasks. However, the need for unseen object class detection is very common in many real-world applications.
State-of-the-art detection methods such as OverFeat [5], Faster RCNN [6], Spatial Pyramid Pooling [7], the YOLO series [8][9][10], and RetinaNet [11] still cannot satisfy real-world requirements. Even though they employ high-dimensional deep-feature spaces, performance degradation is unavoidable in object detection, especially when it is due to the imperfect quality of training samples and the diversity of real-world image capturing qualities. In practice, completely reliable human-labeling requirements are not acceptable because the cost of a high-quality labeling method is very expensive. The tree structure approach is very effective for solving problems in a flat architecture and supporting open-set classification. The pros and cons of hierarchical architecture in classification have been extensively investigated in the past few decades [12]. In hierarchical architecture, the label information in higher level tree nodes generally captures more discriminable, relevant label concepts and can be inherited by lower level nodes that are more difficult to distinguish [13]. Most previous approaches have relied on handcrafted features and annotations for hierarchical classifier training, but they required much effort and time, and an error prone annotation process is used in current classification in diverse application environments.
Two related issues with regard to open-set object detection errors include misclassified data points and out-of-distribution data points. Most object detectors tend to fail due to noisy environments such as cluttered backgrounds, pose variances, and illumination changes. Furthermore, some input data samples do not belong to a closed set, (i.e., the in-distribution class of the training dataset), but in an open set (i.e., outlier data). Both misclassification and outlier data samples are not consistent with trained data samples in deep-feature space, and detection results based on maximum likelihood are unreliable and incorrect. Many open-set algorithms distinguish between in-distribution and outlier data classes relying on classification scheme thresholds [3,14,15]. They show that discrimination against unseen classes from regular representative classes effectively minimize open-space risk. However, the information about unseen classes, such as which feature space and object class the test samples would have, is not available in advance. Both training and testing in classifiers are processed in deep-feature space rather than in semantic object label space [12].
In this paper, we reduce ambiguities from the mixture within regular object classification by employing the concept of augmented object classes based on a physical feature in a tree structure, not relying on the semantic regular object class hierarchy. Augmented object class hierarchies in the deep-feature space are organized by considering the between-class deep feature correlations and the within class variation criteria. We propose a hierarchical open-set object detection framework with the learning capabilities of unseen data distribution for a new object class. The proposed method employs a dynamic hierarchical structure, each node of which keeps track of related data, features, and the deep model, and evolves itself in accordance with changing data distributions. It adopts outlier detection for open-set active semi-supervised learning. The dynamic hierarchical feature model (HFM) can increase its visual taxonomy in the flexible hierarchical structure based on the novel concept of the augmented object class. An augmented object class is a portion of the regular class, the mixture complexity of which is less complex than or equal to that of the regular object class. The proposed framework can adaptively learn both closed-set classes for performance improvement and open-set object classes using unlabeled data samples. Our method combines the incremental open-set aware active semi-supervised learning (ASSL) [16] and the dynamic hierarchical feature model (HFM) update algorithm for effectively grouping unseen objects together. In real-world scenarios, an object detection system should handle noisy and open-set data. Static world assumptions adopted by most detection methods [5][6][7][8][9][10] are no longer valid in practice. We tackled this problem by leveraging the discriminative capability of the dynamic HFM embedded by the outlier detection algorithm and the collaborative sample selection-based open-set ASSL. We propose efficient open-set object detection using a flexible hierarchical structure that provides informative and nonredundant sample selection and the open-set-aware ASSL algorithm.

Multi-Object Detection
Advanced object detection techniques depend primarily on the availability of large, properly labeled datasets for lengthy training [17][18][19]. Researchers have employed high-dimensional deep feature spaces to reduce performance degradation in detecting an object due to the imperfect quality of training samples and unseen data distributions, compared with varied and changing real-world environments. Most object detection schemes [5][6][7][8][9][10] assume that labeled training data samples that are independent and identically distributed (IID), are available in a static environment. Such a static IID assumption is not valid in many real-world object detection applications such as pedestrian detection [20], visual surveillance [21], activity recognition [15,22,23], and pose estimation [24]. Junwei et al. analyzed and compared the recent progress in a variety of state-of-the-art object detection studies using benchmark datasets [25]. They also proposed a two-stage co-segmentation algorithm with the assumption that dependable human labeling is not valid. This is because the cost of a high-quality labeling process becomes too expensive and it is unacceptable in real-time applications where an even background is necessary to reduce a disturbed background [26]. The quality of new data points in unseen distribution creates a challenging problem in object detection.

Open-Set Recognition
The open-set recognition problem is that the test sample is associated with an unseen class during training [4]. Most classification algorithms are based on the closed-world assumption, whereby a test sample will belong to one of K classes used during training. In many real-world applications, however, a testing sample may come from a class from the unseen distribution. Open-set recognition has incomplete knowledge of the world at training time, but unknown classes can be submitted to the algorithm during testing [4]. The open-set recognition issue has seldom been presented until now due to its strong generalization requirement. Most of the early works were based on hand engineered features that are not suitable for incremental learning. One of the works that is closely related to our approach is the Hasan et al. [27] method, where a deep neural network is applied in order to select the features incorporated with semi-supervised learning in a hybrid approach. Felzenszwalb et al. [22] proposed a deformable part model for employing latent information, which forms invariance in local transformations, leading to better localization. Tang et al. [28] presented a similarity-based knowledge transfer model and investigated how knowledge about object similarities can be transferred to adapt an object classifier to an object detector.
Common machine learning research work is based on closed sets. In [4], the nature of open-set recognition was discussed and formalized as a constrained minimization problem, which is similar to our work. The authors applied their method to face verification whereas we apply our proposed method to object detection. Neural networks show very strong generalization capability, as the data distribution of training and testing are the same.
In the real world, object detection performance usually decreases since new unseen object classes cannot be properly dealt with. In this context, we propose an open-set object framework based on a dynamic hierarchical structure with incremental learning capabilities for unseen object classes.
Neural networks tend to make high confidence predictions, even for completely unrecognizable [29] or irrelevant inputs [30][31][32], but we often have very little control over testing the data distribution in real-world applications. The correct detection of out-of-distribution samples is important for object detection/classification tasks [33]. It is important to be aware of the uncertainty of new types of input data in terms of in-and out-of-distribution samples, or those at their boundary [34]. One general weakness of open-set recognition is selecting the right candidates through collaborative sampling strategies, which is not easy where uncertainty, diversity, and confidence criteria play a major role. Often depending on the training data size, labeling time also varies, which is very tedious work. Moreover, after a number of iterative incremental learning the final model reaches saturation, and then the classification performance does not vary with more training. On the positive side, our method performs much better compared to state-of-the-art object detector approaches. The ASSL combined approach can effectively select the dataset that reduces the training time and labeling effort.

Active Learning and Semi-Supervised Learning Combination
Active learning (AL) and semi-supervised learning (SSL) try to solve the same problem based on different theorems, and they have the common goal of achieving high classification accuracy with minimum human labeling (Settles, 2010;Zhu, 2008). AL selects the most informative samples that are beneficial to the process of classification training by leveraging known information in the test data in accordance to the oracle's decisions [35,36]. The sampling approach for uncertainty is adopted to pick samples nearest to the decision boundary [37,38]. The famous AL sampling strategy, query by committee (QBC), is an ensemble learning method that relies on the different hypotheses of a committee, whereas the most informative samples are considered those of the maximal disagreement between classifiers [36]. While AL allows for human intervention to some extent, SSL directly uses unlabeled data in the training process without any human labeling [39]. The AL and SSL techniques can be mixed to handle labeled and tentatively labeled samples for classification practices, while mixed techniques investigate fresh samples manually labeled with minimal effort. The ensemble methods of AL and SSL are categorized into the sequential combination, with SSL embedding into AL, and collaborative samplings. The sequential combination emphasizes the fact that the initial training set is important for SSL convergence to objective performance. Muslea et al. adopted this strategy by employing multiple views for both AL and SSL [35]. The AL and SSL ensemble method is based on several different architectures [36,39]. Wan et al. showed AL-based verification for low-confidence pseudo-labeled samples labeled by SSL [40]. The collaborative combination of AL and SSL using the confidence score from a boosting algorithm was applied to a spoken-language classification problem [41]. Several methods of AL and SSL collaboration were studied in [36,39].

System Overview
We present an open-set object detector framework that learns a new object class and retrains a regular one in unseen data distribution. The framework begins with a very small number of labeled data samples and incrementally learns by using unlabeled data samples in an open-set setting. The main components of the proposed framework, shown in Figure 1 are the dynamic HFM, the outlier-detection algorithm, the collaborative sampling (CS) algorithm [16], and incremental ASSL [14]. We used the initial CNN model, which was pretrained by using labeled data samples, i.e., PASCAL VOC 2007 and the 2012 trainval dataset [17]. minimum human labeling (Settles, 2010;Zhu, 2008). AL selects the most informative samples that are beneficial to the process of classification training by leveraging known information in the test data in accordance to the oracle's decisions [35,36]. The sampling approach for uncertainty is adopted to pick samples nearest to the decision boundary [37,38]. The famous AL sampling strategy, query by committee (QBC), is an ensemble learning method that relies on the different hypotheses of a committee, whereas the most informative samples are considered those of the maximal disagreement between classifiers [36]. While AL allows for human intervention to some extent, SSL directly uses unlabeled data in the training process without any human labeling [39]. The AL and SSL techniques can be mixed to handle labeled and tentatively labeled samples for classification practices, while mixed techniques investigate fresh samples manually labeled with minimal effort. The ensemble methods of AL and SSL are categorized into the sequential combination, with SSL embedding into AL, and collaborative samplings. The sequential combination emphasizes the fact that the initial training set is important for SSL convergence to objective performance. Muslea et al. adopted this strategy by employing multiple views for both AL and SSL [35]. The AL and SSL ensemble method is based on several different architectures [36,39]. Wan et al. showed AL-based verification for lowconfidence pseudo-labeled samples labeled by SSL [40]. The collaborative combination of AL and SSL using the confidence score from a boosting algorithm was applied to a spoken-language classification problem [41]. Several methods of AL and SSL collaboration were studied in [36,39].

System Overview
We present an open-set object detector framework that learns a new object class and retrains a regular one in unseen data distribution. The framework begins with a very small number of labeled data samples and incrementally learns by using unlabeled data samples in an open-set setting. The main components of the proposed framework, shown in Figure 1 are the dynamic HFM, the outlierdetection algorithm, the collaborative sampling (CS) algorithm [16], and incremental ASSL [14]. We used the initial CNN model, which was pretrained by using labeled data samples, i.e., PASCAL VOC 2007 and the 2012 trainval dataset [17]. Since the red circle was included in the brown circle, the detector could not distinguish between two object classes, i.e., the brown circle represents 'bottle', the red circle represents 'fire extinguisher'. The data distribution change is shown in the lower right where the brown and red circles were separated. Input dynamic HFM was trained by incremental ASSL and then updated to a new augmented class node.
The dynamic HFM builds the initial CNN model and improves the detection accuracy by improving model performance and increasing confident pseudo-labeled samples step by step. The hierarchical structure of labeled samples was modeled in terms of the super-and augmented classes using the agglomerative clustering algorithm. Data samples in each superclass node were related to Since the red circle was included in the brown circle, the detector could not distinguish between two object classes, i.e., the brown circle represents 'bottle', the red circle represents 'fire extinguisher'. The data distribution change is shown in the lower right where the brown and red circles were separated. Input dynamic HFM was trained by incremental ASSL and then updated to a new augmented class node. The dynamic HFM builds the initial CNN model and improves the detection accuracy by improving model performance and increasing confident pseudo-labeled samples step by step. The hierarchical structure of labeled samples was modeled in terms of the super-and augmented classes using the agglomerative clustering algorithm. Data samples in each superclass node were related to multiple augmented class nodes. An augmented object class is a portion of a regular object, having distinctive data distribution from another portion/other portions of the regular object class. i.e., a regular object class consists of one or more augmented class nodes. Discrimination information is used in the dynamic HFM for open-set learning for an unseen object class when a new augmented object class is added. We considered object proposals generated from unseen data distribution. The object proposal sequence was partitioned into bins, which is dealt with at the same time, instead of one image sample at a time, as in many other approaches. This means that the learning/update for the dynamic HFM is not sensitive to noisy images, even in a new cluttered environment. The open-set ASSL performs collaborative sampling and analyzes object proposal distribution using the outlier-detection algorithm. Based on the results of the outlier detection, the ASSL retrains or creates an augmented class for an unseen regular class in the dynamic HFM. In the dynamic HFM, object proposals are filtered by the CS algorithm, combining criteria of uncertainty and diversity for AL and the criterion of the confidence for SSL. The collaboration between SSL and AL makes it possible to obtain pseudo-labeled training samples that are more confident and informative from unlabeled object proposals in the partition of the image. The outlier-detection algorithm is used to discriminate in-and out-of-distribution object proposals in the current deep-feature space.
In the dynamic HFM, learning in the ASSL is divided into the AL cycle using the confident dataset, and incremental SSL using unlabeled data divided into batches and bin sequences. In the AL cycle, a batch of object samples is split into several bins, and bin-based incremental SSL cycles are performed. If unseen object proposals are detected by the outlier-detection algorithm, the dynamic HFM is updated using the unseen data samples. The detected outliers are accumulated and clustered. If the volume of a cluster exceeds a threshold value, a corresponding augmented-object-class predictor model is built and added to the associated superclass node in the dynamic HFM. The open-set ASSL trains the CNN by using the confidently marked samples and continuously retrains the next profound model for the CNN by placing the chosen batch of samples using the present object detector until convergence. The suggested technique thus offers methods for both exploration and exploitation by combining informative and reliable (well-found) approaches to sampling in an open environment. The decision path for an object prediction ensemble is built by combining current object models and the object prediction model created for the new augmented class. Finally, the dynamic HFM is updated.

Dynamic Hierarchical Feature Model
We present a hierarchical deep-feature structure for open-set object detection that extends the HFM in [1] with the capability of incremental learning for regular objects and open-set learning for unseen objects. The dynamic HFM consists of two different levels: the superclass and augmented-class levels, as shown in Figure 2b. We employed the concept of an augmented class, which is defined as a distinctive portion of a regular class in data distribution. An augmented class shares common between-class characteristics with the superclass level, and closer within-class characteristics than the associated regular object class (see Figure 3).
profound model for the CNN by placing the chosen batch of samples using the present object detector until convergence. The suggested technique thus offers methods for both exploration and exploitation by combining informative and reliable (well-found) approaches to sampling in an open environment. The decision path for an object prediction ensemble is built by combining current object models and the object prediction model created for the new augmented class. Finally, the dynamic HFM is updated.

Dynamic Hierarchical Feature Model
We present a hierarchical deep-feature structure for open-set object detection that extends the HFM in [1] with the capability of incremental learning for regular objects and open-set learning for unseen objects. The dynamic HFM consists of two different levels: the superclass and augmentedclass levels, as shown in Figure 2b. We employed the concept of an augmented class, which is defined as a distinctive portion of a regular class in data distribution. An augmented class shares common

Outlier Detection
Unseen object classes cannot be correctly classified into a current augmented class and cannot produce regular object output. However, there exist superclass objects that share common between-class Symmetry 2019, 11, 1271 7 of 14 attributes with high probability, even though the new augmented-class prediction capability does not exist in the dynamic HFM. We adopted outlier detection methods [42] for our outlier detection. Object proposals were inputted to the dynamic HFM at time t. For object proposals {x i } extracted from the unlabeled dataset, the causes of object-detection errors are discriminated into a particular data point that is hard to classify and has not been defined yet in the dynamic HFM, i.e., an unseen data point (a particular data point belonging to a new class). The discrimination of the above types of errors is hard since it is deeply related to the semantic goal of a detector [28]. We investigated the distribution of object proposals and updated the dynamic HFM using the current prediction models. Following [42], we chose centroids µ j J j=1 and the cluster membership weights to optimize the within-cluster sum of squares (WCSS). Let w ij ∈ [0, 1] denote the cluster membership weight of deep-feature vector FV i in cluster j. Let us define outlier compensation vector CV i , which is an M-dimensional zero vector (i.e., 0 if CV i is an inlier), or nonzero vector (when CV i an outlier). Let us define CV = [CV 1 · · · CV N ] and Λ = µ 1 · · · µ J , and membership weight matrix W ∈ R N×J , where each element is represented by cluster membership weight w ij ∈ [0, 1]. The soft K-means algorithm leverages the sparsity of the CV i s and defines the outlier-compensated version (FV i − CV i ) that replaces FV i , and w ij . Outlier aware soft K-means clustering is defined by where τ is tuning parameter τ > 1. The soft k-means algorithm solves based on the block coordinate descent (BCD) algorithm [42] that iteratively optimizes cost function, focusing on one variable at a time, while the other variables remain fixed. If CV i was greater than threshold ρ, we defined CV i as an outlier. For a given bin of object proposals, the dynamic HFM algorithm updates the current node attributes. The dynamic HFM executes the tasks of open-set learning in each augmented class node. The dynamic HFM continuously updates its node attributes composed by the node dataset, node feature vectors, and node prediction model. Those are updated using the open-set ASSL algorithm discussed in the following section.

Open-Set-Aware Incremental ASSL
The proposed incremental open-set-aware ASSL combines the AL paradigm's uncertainty and diversity characteristics with the incremental SSL paradigm's confidence property. Taking into account AL's uncertainty criterion, most uncertain samples are considered as the most helpful training samples to be added, as they are anticipated to be wrongly classified with high probability by the present detection model. However, the uncertainty criterion may cause noisy or redundant samples to be selected. We adjusted a pool-based (batch or bin) AL structure coupled with incremental SSL philosophy based on AL and SSL's collaborative sampling method in terms of uncertainty, variety, and trust criteria that are expected to select more informative and low-redundancy training samples.
We used an AL batch cycle similar to [43], and added a bin-based cycle for incremental SSL. In the AL batch cycle, a training dataset was divided into well-defined labeled training samples D well , and weakly and unlabeled training samples D tentative . Open-set aware incremental ASSL dealt with them to increase the volume of D well above D tentative , and update the dynamic HFM. First, the original model learns from the prelabeled dataset used to build the CNN. Then, the batch of samples is chosen considering the distribution of the trained models and category balancing. The present detector assigns confidence scores to the pseudo-labeled samples. Depending on the confidence score, which are measured and ranked by the present detector, confident and well-defined samples are chosen from weak samples. A subset of pseudo-labeled samples is chosen using a collaborative sampling approach, whereby the present detector reassigns fresh labels or assigns elevated ratings to labels; some ambiguous samples that have previously been identified are removed or relabeled by the oracle after filtering by the criteria of uncertainty and diversity.
Open-set aware incremental ASSL reduces training time by making a pool of D ∆ based on the uncertainty, diversity, and confidence criteria.
Candidate sample set D diversity that covers more revealing samples within rank ϑ where Rank(x) denotes decreasing order of f (x). Next, we initialized D ∆ with sample X top = argmax x ∈ D diversity f (x), X top ∈ D diversity using confidence criterion parameter γ. A sample from D diversity adds to D ∆ . D diversity becomes the most similar sample in D ∆ in terms of confidence score, i.e., In this equation, we used Euclidian distance between two features to calculate d x i , x j . When the cardinality of D ∆ becomes γ, the sample selection process is stopped, and the final sample set is D ∆ . We retrained the CNN using the pool of samples, and the process was repeated until a convergence criterion was satisfied. The entire process and parameters are summarized in Algorithm 1 and [44]. Algorithm 1. Open-set aware incremental active semi-supervised learning using outlier detection Input: Confident labeled dataset D well , tentatively labeled dataset D tentative , and dynamic HFM. Output: Optimal dynamic HFM. 1: while D tentative ∅ do 2: Train initial CNN model f using D well .

3:
while f not convergence, do 4: Select batch pool of candidate samples from D tentative .

5:
Select D ∆ tentatively labeled samples filtered by η, ϑ, γ parameters using (2) and (3). and each selection criteria. 6: Assign pseudo-label and score to each unlabeled D ∆ . 7: Sort pseudo-labeled tentative samples D ∆ in decreasing order. 8: Divide j bins sorted tentative samples in decreasing order. i th bin has samples in range of (i − 1)/ |D ∆ | to i/ |D ∆ |.
Generate bin sequence BSeq = [bin i ] j i=0 by partitioning D ∆ . 9: while i < j do 10: train ∪ bin i and calculate Acc bin * . Else if Acc (i+1) < Acc (i) or outlier detected by (1), oracle labels incorrectly labeled data in bin * and return f (i+1) = f (i) . i++ 12: end 13: Retrain f using D well = D well ∪ D ∆ ; D tentative = D tentative − D ∆ . 14: end 15: Update dynamic HFM with f . 16: end The computational complexity of Algorithm 1 depends on the number of bins and the augmented classes used during our experiment, which we discuss in the next section. In some cases where many unseen object classes were present or the number of bins increased as a result, the training time also increased but the performance of the proposed method improved. In our experiment, we commonly considered a smaller number of bins to optimize performance, that is, around ten.

Experiment
The main goal of our experiment was to identify the efficiency of our dynamic HFM tree model framework using ASSL [44]. To achieve this goal, we conducted several experiments on benchmark datasets, such as PASCAL VOC, MS COCO, and ILSVRC. We then compared the results with advanced detectors such as Faster RCNN. We used the evaluation metric in the VOC development kit. All implementations were on a single server with a single NVIDIA TITAN X and Tensorflow [45].  [9], where the base detector is YOLOv2, which is trained by the PASCAL VOC 2007 and 2012 trainval datasets, and the resolution of the input images was 416 × 416. Training parameters were the same as [27,37].
Local dataset. In our experiment, we used the fire-extinguisher class and hog class dataset for input and to train new classes to our model. This dataset [5][6][7][8][9][10] was trained with the existing PASCAL VOC dataset. The fire-extinguisher class had 110 images that included 100 training-data images and 10 testing-data images. The hog class had a total of 110 images (100 training images and 10 testing images).
MS COCO dataset. The MS COCO dataset has 80 object-detection classes. We used MS COCO 2017 as a training and validation dataset [18], which has 118,287 training images and 5000 validation images. For ASSL training and evaluation, we used unseen training and validation dataset classes of PASCAL VOC in MS COCO animal classes (bear, elephant, giraffe, zebra).
ILSVRC DET dataset. The ILSVRC DET dataset has 200 classes for object detection training. We used the ILSVRC DET 2017 training and validation dataset [19], which contains 456,567 training images, 20,121 validation images, and 40,152 testing images. For ASSL training and evaluation, we used unseen training and validation dataset classes of PASCAL VOC in the ILSVRC vehicle classes (golf cart, snowmobile, snowplow, unicycle, watercraft).

Results
Local dataset. In Figure 4a1, it can be seen that the fire-extinguisher class was neighboring the augmented class 9, and the hog class was similar to augmented classes 8, 11, 14, 17, and 22. Therefore, it was possible to define the fire-extinguisher class and hog class, which had similar distributions to the existing augmented class, as a new augmented class. When we applied dynamic HFM, we could see that applying ASSL (Figure 4a2-a4 and VOC 2007 test + local dataset graph in Figure 5) could increase the area and average precision of the precision-recall curve over time. In Figure 4a5, the fire-extinguisher class was well separated from the existing augmented class 9, and the hog class was also different from the existing augmented class.
ILSVRC DET dataset. The ILSVRC DET dataset has 200 classes for object detection training. We used the ILSVRC DET 2017 training and validation dataset [19], which contains 456,567 training images, 20,121 validation images, and 40,152 testing images. For ASSL training and evaluation, we used unseen training and validation dataset classes of PASCAL VOC in the ILSVRC vehicle classes (golf cart, snowmobile, snowplow, unicycle, watercraft).

Results
Local dataset. In Figure 4a1, it can be seen that the fire-extinguisher class was neighboring the augmented class 9, and the hog class was similar to augmented classes 8, 11, 14, 17, and 22. Therefore, it was possible to define the fire-extinguisher class and hog class, which had similar distributions to the existing augmented class, as a new augmented class. When we applied dynamic HFM, we could see that applying ASSL (Figure 4a2-a4 and VOC 2007 test + local dataset graph in Figure 5) could increase the area and average precision of the precision-recall curve over time. In Figure 4a5, the fireextinguisher class was well separated from the existing augmented class 9, and the hog class was also different from the existing augmented class.  Figure 4b1 shows animal classes (bear, elephant, giraffe, zebra) neighboring augmented classes 2, 12, 14, 16, 17. It was possible to define the COCO animal classes that had similar distribution to existing augmented classes as a new augmented class. When we used our method, we could see that the time-dependent change with ASSL (Figure 4b2-b4 and VOC 2007 test + COCO animal dataset graph in Figure 5) could increase the area and average precision of the precision-recall curve over time. In Figure 4b5, it can be seen that the COCO animal classes were well independent from existing augmented classes such as augmented classes 12 and 16.

MS COCO dataset.
ILSVRC DET dataset. In Figure 4c1, we can see that the ILSVRC DET vehicle classes were very similar to augmented classes 8, 10, 12, 16, 21, and 23. It was possible to define vehicle classes that had similar distributions to existing augmented classes as a new augmented class. When we applied dynamic HFM, we could see that change in time sequence with ASSL (Figure 4c2-c4 and VOC 2007 test + ILSVRC vehicle dataset graph in Figure 5) could increase the area and average precision of the precision-recall curve over time. In Figure 4c5, we can see that the vehicle classes were well separated from existing augmented classes like augmented classes 12 and 16.
Comparison with other detection methods. We experimented with other detection methods. Table 1 shows that, in the case of other detection methods, we confirmed that performance was significantly reduced because other methods did not know the class for the environment in which the unseen object class appeared. However, dynamic HFM proved to be superior to other detection methods such as Faster RCNN, SSD300, and YOLOv2 for unseen object classes. The number of augmented classes in the dataset affects the performance of our method. As there are new augmented classes, the performance of our method improves compared to the state-of-the-art methods.

Conclusions
In this paper, we proposed an open-set object detection framework called dynamic HFM, which provides incremental learning capabilities for unseen object classes. Data samples were clustered into superclasses according to deep-feature hierarchy attributes using the agglomerative clustering algorithm, and each superclass node was built to have multiple augmented class nodes instead of directly associating with regular object classes as in many other hierarchical approaches. The dynamic HFM discovers more informative deep-feature information with low mixture complexity by learning an augmented class instead of learning a regular class with high mixture complexity. The dynamic HFM was used to learn new object classes by imbedding outlier detection and a collaborative sampling method based on incremental ASSL algorithms. Dynamic HFM nodes in the decision path were assembled to constitute a prediction ensemble for associating to a regular object class. Finally, it adds an unseen object class as a new regular class. Our suggested model delivers greater efficiency with fewer mistakes, greater precision, and requires less human effort compared to other methods of pure object detection. These achievements encourage further improvement to our proposed model. Future research directions include finding ways to deal with huge datasets.