Ensemble-Based Out-of-Distribution Detection

: To design an efﬁcient deep learning model that can be used in the real-world, it is important to detect out-of-distribution (OOD) data well. Various studies have been conducted to solve the OOD problem. The current state-of-the-art approach uses a conﬁdence score based on the Mahalanobis distance in a feature space. Although it outperformed the previous approaches, the results were sensitive to the quality of the trained model and the dataset complexity. Herein, we propose a novel OOD detection method that can train more efﬁcient feature space for OOD detection. The proposed method uses an ensemble of the features trained using the softmax-based classiﬁer and the network based on distance metric learning (DML). Through the complementary interaction of these two networks, the trained feature space has a more clumped distribution and can ﬁt well on the Gaussian distribution by class. Therefore, OOD data can be efﬁciently detected by setting a threshold in the trained feature space. To evaluate the proposed method, we applied our method to various combinations of image datasets. The results show that the overall performance of the proposed approach is superior to those of other methods, including the state-of-the-art approach, on any combination of datasets.


Introduction
Deep learning has achieved state-of-the-art performance in various tasks, such as speech recognition [1,2], image classification [3,4], video prediction [5,6] and medical diagnosis [7,8]. Nevertheless, several problems with deep learning remain. This study is focused on two of them. The first is the closed-world assumption. Contemporary deep learning models are designed under the static and closed-world assumption that training and testing datasets have the same distribution [9]. However, in the real-world, data distributions may undergo complex and dynamic shifts over time, and even a novel dataset with an unseen distribution might be presented to the model during the test. These shifted and unseen data distributions may cause critical failures because the model attempts to predict the results under the closed-world assumption [10]. The second is the high confidence problem. It is generally known that modern deep learning models may yield improper predictions with high confidence even for unseen data distributions [11]. These problems, which are called out-of-distribution (OOD) problems [12], cause overfitting and complicate the calibration of deep learning models [13,14]. Therefore, to design an efficient deep learning model that can be used in the real-world, it is important to detect the OOD data well.
Various studies have been conducted to solve the OOD problem. A baseline model has been proposed to detect OOD data using a neural network's softmax value as a confidence score [12]. As an extension of the baseline method, the out-of-distribution detector for neural networks (ODIN) has been proposed to improve performance using temperature scaling and input preprocessing [15]. ODIN outperformed the baseline method; however, this method required hyperparameters to be tuned appropriately for each dataset. Additionally, approaches based on generative models and auxiliary datasets have been proposed [16][17][18]. The current state-of-the-art method uses confidence scores based on the Mahalanobis distance in a feature space [19]; however, its results are sensitive to the dataset complexity and the quality of the trained model. In that respect, we proposed the OOD detection method based on distance metric learning (DML) in our previous research [20]. This method can train a clumped feature space (in which data with the same label are located closely) by class using a DML-based network (instead of the softmax-based classifier) and can detect OOD samples efficiently in that feature space. Our previous method outperformed the state-of-the-art approach in 1-channel image datasets having relatively simple structures. However, it could not detect OOD well in the 3-channel image datasets having complex structures.
Herein, we propose a novel OOD detection method that uses not only DML-based networks but also softmax-based classifiers, as an extended version of our previous work. The proposed method can obtain more efficient feature space for OOD detection by an ensemble of the features trained using the softmax-based classifier and the DML-based networks, including Siamese and triplet networks [21,22]. The trained feature space has a more clumped distribution and can fit better on the Gaussian distribution by class, compared with using the state-of-the-art approach and our previous method. An example of the trained feature spaces is shown in Figure 1. Moreover, in the testing phase, the OOD data can be detected as follows: (1) measure the distance between the features of the input data and each class distribution as a confidence score and (2) set a threshold in that distance. To evaluate the proposed OOD detection module, we applied our method to various combinations of 1-channel and 3-channel image datasets. Subsequently, we verified the performance of the OOD detection by comparing it with the previous approaches.
The remainder of this paper is organized as follows. Section 2 presents the related studies, Section 3 describes the proposed OOD detection method and Section 4 details our experiment. The paper is concluded in Section 5.

Related Work
In this section, we introduce several OOD detection methods and DML-based networks used in the proposed method and our previous approach.

OOD Detection Methods
This study mainly focuses on the OOD detection method based on the confidence score and threshold (among various OOD-related studies). Therefore, this section reviews some threshold-based OOD detection methods relevant for this study. The baseline method of OOD detection [12] was proposed based on the tendency of a well-trained neural network to assign a higher softmax score to in-distribution examples rather than OOD examples. In this approach, the OOD data can be detected using softmax as a confidence score and applying a threshold to it. The softmax function is shown in Equation (1); it comprises f i (x) for the logit of class i.
The ODIN method [15] was proposed to improve the performance of the baseline by using temperature scaling and adding small controlled perturbations to the input data. The temperature scaling T was applied to the baseline confidence scoring function, as shown in Equation (2). The ODIN method outperformed the baseline method; however, it required hyperparameters to be tuned appropriately for each dataset.
Mahalanobis-based Approach [19], which demonstrates the state-of-the-art performance, uses the confidence score based on the Mahalanobis distance in a feature space. This approach was designed under the assumption that the well-trained output features of the softmax-based neural classifier can be fitted well to the class-conditional Gaussian distribution. The confidence score can be defined by calculating the Mahalanobis distance using the class mean and covariance of the feature map, thereby enabling the effective detection of OOD samples. Although this method outperformed the previous approaches, the results were sensitive to the dataset complexity and the quality of the trained model.
The DML-based approach [20] was proposed in our previous research. To train a more efficient feature space than the state-of-the-art approach, this method uses DML-based networks (as described in the next section) instead of the softmax-based classifier. The trained feature space has a more clumped distribution by class, and OOD samples can be detected by applying a threshold to this feature space. This method performed well in 1-channel image datasets with relatively simple structures. However, it could not detect OOD well in 3-channel image datasets with complex structures.

Networks Based on Distance Metric Learning (DML)
DML is a branch of machine learning algorithms that aims to learn similarities between data samples using a distance-based loss function [23]. As this method embeds similar data samples closer, DML-based networks can train more clumped feature spaces by class. In this section, we introduce two DML-based networks used in the proposed method and our previous research.
The Siamese network [21] comprises one cost function and two sub-networks that share parameters and have the same structure. When training the Siamese network, two inputs are passed through the sub-networks. One is an anchor input x a , and the other can be a positive input x p with the same label as the anchor or a negative input x n with a different label. After the sub-networks, the distance between two output features is calculated using the cost function. The Siamese network uses contrastive loss as a cost function; hence, inputs with the same label are embedded closely in the feature space. In the opposite case, they can be distant from one another when training the network. The contrastive loss function is shown in Equation (3), where M is a constant value In the testing phase, the test dataset is entered into one of the trained sub-networks; subsequently, the clumped vectors can be obtained by class. The Siamese network structure is shown in Figure 2a.
The triplet Network [22] is a network based on the Siamese network. It comprises a triplet loss function and three sub-networks. The triplet loss function is shown in Equation (4).
When training the triplet network, three inputs are provided: the anchor input x a , positive input x p and negative input x n . Using these three inputs and the triplet loss function, the anchor input can be embedded farther from the negative input than the positive input. After training, the test can be performed as in the Siamese network. The structure of the triplet network is shown in Figure 2b.

Methodology
In this section, we present our proposed method for detecting OOD samples. Our method is proposed to improve the Mahalanobis-based approach [19], which is the current state-of-the-art method, and the DML-based approach [20], which is our previous work. The state-of-the-art approach uses only a softmax-based classifier, and our previous approach uses only a DML-based network to train the feature space. Figure 1 shows welltrained feature spaces with a softmax-based classifier and DML-based networks. In these feature spaces, we may detect OOD well using these approaches. However, there is no guarantee that the networks are always trained well. If the networks are not trained well, we cannot detect the OOD well in that feature space. Therefore, in this study, we used both networks together-not only softmax-based classifier but also DML-based network-to train more efficient feature space for OOD detection. In the proposed method, the feature space is trained by an ensemble of the features trained using the softmax-based classifier and the DML-based network. With complementary interaction between the two networks by the ensemble, the trained feature space has a more clumped distribution, and it can better fit on the Gaussian distribution by class. Thus, OOD samples can be efficiently detected in that feature space. Figure 3 shows the overall structure of our proposed method. Except for the networks that train feature spaces, the state-of-the-art protocols (such as input preprocessing, Mahalanobis-based confidence score and feature ensemble) were also used to efficiently detect OOD samples in this study.
Input Preprocessing [15]. In the testing phase, to increase the confidence score based on the Mahalanobis distance, the input preprocessing technique is applied, in which a small controlled noise is added to the test samples. This technique results in a more separable in-distribution and OOD samples. The preprocessed test samples are obtained by Equation (5), where x represents the test sample, is the magnitude of noise and M(x) is the confidence score based on the Mahalanobis distancê Confidence Score based on Mahalanobis Distance [19]. The Mahalanobis distance between the test sample and the closest class distribution is used as a confidence score. The Mahalanobis distance of the l-th layer, M l (x), is calculated using Equation (6), where c is the class index and f l (x) is the feature of the test sample at the l-th layer; µ and Σ are the class mean and covariance matrix, respectively Feature Ensemble [19]. In the state-of-the-art approach, the feature ensemble technique was used to calculate the weighted sum of confidence scores from the feature set in some layers. Using this technique, we can ensemble the features trained using the softmax-based classifier and the DML-based network. Moreover, we can also measure and combine the confidence scores of the final feature and the other low-level features in the two networks. This means that effective layers can be assigned a higher weight, and ineffective layers can be assigned a lower weight. This is expressed as Equation (7), where M l S and α l S are the confidence score and its weight obtained from the feature set of the l-th layer in the softmax-based classifier, and M l D and α l D are the confidence score and its weight obtained from the feature set of the l-th layer in the DML-based network. In the experiments of this study, both weights were trained by logistic regression using a small validation dataset that consisted of 1000 images from each in-and out-of-distribution pair, similar to [19]. Here, M(x) is the total confidence score based on the Mahalanobis distance. Figure 4 shows the overall process of the proposed method. In the training phase, features are extracted from the training samples using both the softmax-based classifier and DML-based network trained on in-distribution dataset. Subsequently, the mean and covariance are calculated for each class from the extracted features. In the testing phase, features are extracted from the test samples consisting of the same ratio of in-and out-of-distribution datasets, with a small amount of controlled noise added. Thereafter, the Mahalanobis distance between the test samples and the closest class distribution is calculated using the class mean and covariance. The calculated Mahalanobis distances from the output features of several layers in the two networks are ensembled. Finally, OOD samples can be detected by applying a threshold to the ensembled Mahalanobis distance.

Experiments
In this section, the performance of the proposed OOD detection method is evaluated, analyzed and compared with the previous approaches on various combinations of datasets.
All proposed methods were implemented using Python 3.7 and PyTorch 1.5 on NVIDIA TITAN RTX 24 GB × 2. In the case of experiments for 1-channel image datasets, we used ResNet34 [3] for the softmax-based classifier and the ResNet34-based Siamese or triplet network (described in Section 2) for the DML-based network. These networks were trained with a learning rate of 0.001, a batch size of 32 and an Adam optimizer. For the experiments on 3-channel image datasets, we used the trained ResNet34 on each dataset, provided in the state-of-the-art approach (https://github.com/pokaxpoka/ deep_Mahalanobis_detector (accessed on 27 February 2021)), as a softmax-based classifier. ResNet34-based Siamese or triplet network was also used for DML-based network. We trained these DML-based networks with a batch size of 256 and an Adam optimizer. The learning rate was initialized at 0.001 and decreased to 0 by a cosine scheduler [29]. In addition, we used an early stopping method to prevent overfitting [30]. Subsequently, we applied our method to various combinations of standard datasets mentioned above.
For performance comparison, we considered the baseline model, ODIN, the Mahalanobis-based approach, and our previous method (as described in Section 2). Our previous method was trained using DML-based networks, and the other models were trained using ResNet34 in the same way as the proposed method. In the case of experiments for 3-channel image datasets, the trained ResNet34 (provided in the stateof-the-art approach (https://github.com/pokaxpoka/deep_Mahalanobis_detector (accessed on 27 February 2021)) for each dataset was used for those models, excluding our previous approach.
To evaluate the performance, the following metrics were used: the true negative rate (TNR) at a 95% true positive rate (TPR), the area under the receiver operating characteristic curve (AUROC), the detection accuracy (DTACC) and the area under the precision-recall curve (AUPR). Using these metrics, the performance of OOD detection methods can be evaluated without selecting a specific threshold [15]. Our source code is available on GitHub (https://github.com/yangdonghun3/Ensemble_based_OOD_Detection (accessed on 27 February 2021)).
We detail our experimental results on 1-channel and 3-channel image datasets in the next subsections.  Figure 5a. Furthermore, the ensemble-triplet (navy line) version of the proposed method showed the best performance among all methods. In addition to their better performance, the triplet-based approaches (purple and navy lines) consistently demonstrated robust states during the entire training phase. Meanwhile, the other models demonstrated unstable performances depending on the epoch. Consequently, the proposed method and our previous method can be considered as less sensitive when selecting hyperparameters for OOD detection in this experiment. Comparing ours, triplet-based approaches (purple and navy lines) performed better than the Siamese-based approaches (yellow and green lines). Table 1 presents the average performance at the epoch point, showing the best TNR in several experiments on 1-channel image datasets. Table 1a also shows that the proposed method and our previous method outperformed the other methods, including the state-of-the-art approach. Turning to the experiment of (In) MNIST/(Out) Fashion-MNIST, except for the baseline method and ODIN, all other models showed nearly 100% OOD detection performance and stable states at all epoch points and all metrics, as shown in Figure 5b. This means that the three methods detected OOD samples perfectly during the entire training phase. Consequently, we considered that the proposed method, the previous method and the Mahalanobis-based approach could completely separate in-distribution samples and OOD samples because the structure of the MNIST dataset is simple. Table 1b also shows that the three methods detected the OOD samples perfectly.

Experimental Results on 3-Channel Image Datasets
To further verify the performance of the proposed method, additional experiments were performed on various combinations of 3-channel image datasets having more complex structures than 1-channel image datasets, and the results are reported in Tables 2-4. The tables show the average performance at the epoch point, showing the best TNR in 10 experiments on 3-channel image datasets. In the case of (In) SVHN, the proposed method, our previous approach and the Mahalanobis-based approach performed well on all combinations of datasets. In contrast, other models could not detect OOD well, as shown in Table 2. This table also shows that the overall performance of the proposed approaches (ensemble-Siamese, ensemble-triplet) was superior to the others, and the ensemble-triplet method achieved the best TNR, approximately 0.16-1.40% higher than the state-of-the-art approach, among all methods. Moreover, except for OnlySiamese on (Out) CIFAR-10, our previous methods (OnlySiamese, OnlyTriplet) detected OOD well in most cases, compared with other combinations of 3-channel image datasets that have more complex structures. In the case of (In) CIFAR-10, the proposed method and the Mahalanobis-based approach detected OOD samples well. Meanwhile, the other models did not perform well, as shown in Table 3. Furthermore, our proposed methods (ensemble-Siamese, ensembletriplet) showed the best performances for all combinations of datasets. Specifically, their TNR values were approximately 0.01-0.61% higher than the state-of-the-art approach's. In the case of experiments on (In) CIFAR-100, similarly, the proposed method and the Mahalanobis-based approach outperformed other methods, as shown in Table 4. This table also presents that the proposed methods (ensemble-Siamese, ensemble-triplet) were superior to the others, and showed the best TNR, approximately 1.80-7.17% higher than the state-of-the-art approach. However, our previous methods (OnlySiamese, OnlyTriplet) showed similar or worse performance to the ODIN method in both cases of (In) CIFAR-10 and (In) CIFAR-100. It was considered that these poor results came about because CIFAR-10 and CIFAR-100 have more complex structures than SVHN.  In summary, our previous method trained using only a DML-based network outperformed other models, including the state-of-the-art approaches trained using only a softmax-based classifier, on simple datasets (such as 1-channel image datasets). However, OOD could not be detected well in complex image datasets (such as 3-channel datasets). In contrast, the proposed method trained by the ensemble of the two networks outperformed all other methods for the combinations of simple image datasets and combinations of complex datasets, showing up to 7% higher TNR than the state-of-the-art approach.

Conclusions
This study proposed a novel OOD detection method that can train more efficient feature space. The proposed method uses an ensemble of the features trained using the softmax-based classifier and the DML-based network. With a complementary interaction between these two networks, the trained feature space has a more clumped distribution, and it can be better fitted to the Gaussian distribution by class. Thus, OOD samples can be efficiently detected by setting a threshold in this feature space. To verify the proposed method, we applied our OOD detection approach to various combinations of standard datasets which have been most actively used for evaluating and comparing OOD detection methods. After that, we compared its performance with previous approaches. The results showed that the overall performance of the proposed approach made it superior to other methods, including the state-of-the-art approach trained using only a softmax-based classifier and our previous method trained using only a DML-based network. We believe that the proposed approach has the potential to be applied to designing various machine learning models that can be efficiently used in the real-world, where data distributions undergo complex changes.

Conflicts of Interest:
The authors declare no conflict of interest.