Confidence measures for deep learning in domain adaptation

: In recent years, Deep Neural Networks (DNNs) have led to impressive results in a wide variety of machine learning tasks, typically relying on the existence of a huge amount of supervised data. However, in many applications (e.g., bio–medical image analysis), gathering large sets of labeled data can be very difﬁcult and costly. Unsupervised domain adaptation exploits data from a source domain, where annotations are available, to train a model able to generalize also to a target domain, where labels are unavailable. Recent research has shown that Generative Adversarial Networks (GANs) can be successfully employed for domain adaptation, although deciding when to stop learning is a major concern for GANs. In this work, we propose some conﬁdence measures that can be used to early stop the GAN training, also showing how such measures can be employed to predict the reliability of the network output. The effectiveness of the proposed approach has been tested in two domain adaptation tasks, with very promising results.


Introduction
Recently, deep learning has pushed the state of the art in several visual recognition tasks, e.g., in object detection [1], semantic segmentation [2,3], speech recognition [4], and medical image analysis [5,6]. All of these results rely on the availability of large datasets of annotated images (e.g., ImageNet [7]). However, due to domain shift [8], models trained on these datasets do not generalize well to different sets of data [9,10]. Usually, to be adapted to a new domain, the model should be fine-tuned; unfortunately, it is often difficult to obtain enough labeled data. Unsupervised domain adaptation [11] is used when the supervision is available for the source but not for the target domain. It aims at building a model on the source dataset that correctly generalizes also on target samples, despite the domain shift. In particular, domain adaptation can be seen as a particular case of transfer learning that leverages labeled data of a source domain, to learn a classifier that can be applied on a target domain, in which supervised data are not available. In general, it is assumed that the two domains share the same class and that the target domain is related, but not identical, to the source domain. In this framework, Generative Adversarial Networks (GANs) [12] have been successfully employed [13], due to their ability to reproduce data distributions. Indeed, GANs are based on two competing networks, called the generator and the discriminator, being the last engaged in determining whether a sample is produced by the generator or it belongs to the original data distribution. In unsupervised domain adaptation, the adversarial training has been employed to induce the model to learn the same distribution for the source and the target domain, this is done at image or feature level. Three main problems are often encountered with this approach. The first issue is directly related to the very nature of GANs, for which deciding when the training should be stopped is generally a difficult task. In fact, usually, when convergence is not achieved, a visual evaluation of the generated images is used to stop the learning, an almost heuristic approach, difficult to be standardized. Another limit is directly related to the absence of labeled samples in the target domain, which makes it impossible to evaluate the network performance with the usual method based on a validation set. Finally, the last problem is related to the reliability of the model output, particularly impactful when the input is significantly different from the training patterns. If, in general, this is a limit common to all neural networks, it has a particular relevance in domain adaptation due to the different data distribution of the source and target datasets.
In this research, we studied several confidence measures, not depending on data labels, that allow evaluating the reliability of the network outputs, mitigating all the above-mentioned problems. The proposed measures are related to the two major sources of uncertainty that a model may have [14]: • Epistemic uncertainty is related to the lack of knowledge and, in the case of a neural network model, means that the parameters are poorly determined due to the data scarcity, so that the posterior probability over them is broadly captured. • Aleatoric uncertainty is due to genuine stochasticity in the data; if the data are inherently noisy, then also the best prediction may have a high entropy.
The experiments were carried out using a recently proposed domain adaptation method [13] that employs GANs to learn feature distributions indistinguishable from the source and the target domain. The employed confidence measures were firstly applied in two unsupervised domain adaptation tasks: SVHN [15] → MNIST [16] and CIFAR [17] → STL [18]. The experimental results show that, based on the confidence measures, the domain adaptation process can be stopped close to the optimum, attainable only if a labeled validation set would be available. Such measures are also used to evaluate the reliability of the classifier after the adaptation to the target domain.
The paper is organized as follows. In Section 2, the state-of-the-art approaches to unsupervised domain adaptation, generative adversarial networks and uncertainty estimation are reviewed. Section 3 presents the details of the proposed confidence measures, whereas Sections 4 and 5 describe the experimental setup and the obtained results, respectively. Finally, some conclusions and future perspectives are drawn in Section 6.

Related Work
In this section, the state-of-the-art research in unsupervised domain adaptation, generative adversarial networks and uncertainty estimation methods is briefly reviewed.

Unsupervised Domain Adaptation
Domain adaptation deals with the situation in which a source dataset with labeled instances is used to train a classifier with the purpose to also generalize to a target domain, where labeled data are not available [11]. Earlier solutions to domain adaptation employed a projection to align the source and target data space [19][20][21]. Recently, deep learning has been applied to this task, typically using weight sharing [22], reconstruction [22], or adding Maximum Mean Discrepancy (MMD) and association-based losses between the source and the target layers [23][24][25]. Moreover, the inclusion of adversarial loss functions allow for a better transfer of representations across domains [26][27][28]. Finally, the adversarial logic has been extended with the use of GANs, which are employed basically in two ways:

•
To learn the feature distribution: The generator is trained to extract features that are indistinguishable for the target and the source domain [13,[29][30][31][32].

Generative Adversarial Networks
Generative Adversarial Networks (GANs), first proposed in [12], consist of a pair of neural networks, a generator and a discriminator, trained by a min-max game. The generator aims at learning to reproduce the training samples distribution, whereas the discriminator tries to distinguish the generated samples from the real ones. Some approaches to stabilize the GAN training are presented in [39][40][41][42], while in [43,44], methods to control what GANs generate are proposed. In particular, CGANs [45] allow generating samples conditioned on the desired classes.

Uncertainty Estimation
Uncertainty estimation aims at detecting when a neural network is likely to make an incorrect prediction. Earlier works on this topic were traditionally based on Bayesian statistics. For instance, Bayesian Neural Networks (BNNs) [46] can learn a distribution over each of the network parameters. The uncertainty estimation naturally arises because the network would be able to produce a distribution over the output for any given input. Unfortunately, BNNs are not easy to be trained for very complex problems. More recently, Monte-Carlo Dropout [47], Multiplicative Normalizing Flows [48] and Stochastic Batch Normalization [49] have been employed to produce uncertainty estimations with varying degrees of success. Deep Ensembles [50] is an alternative to BNNs, which estimate uncertainty by training more than one model and observing the variance of the prediction on all the models. A promising alternative to the very computational demanding approaches described above, consists in training a neural network to learn the uncertainty for any given input [51][52][53] or, as for the contributions in [54,55], in employing the network output to measure its confidence. To the best of our knowledge, our proposed approach is the first to investigate the use of confidence measures to stop the training of a domain adaptation GAN and to evaluate the reliability of the network in the target domain.

Confidence Measures
In this section, the proposed confidence measures are introduced. In particular, Sections 3.1-3.3 describe some basic measures, while Section 3.4 illustrates how combining such measures helps in capturing complementary aspects related to the network uncertainty.

Entropy and Max Scaled Softmax Output
The majority of current neural networks, differently from older ones, are poorly calibrated to be representative of the true output distribution. One of the most effective technique to calibrate the output of a model is temperature scaling [54]. Given the vector of the network logits z and the softmax output σ SM (z), to better represent the true posterior probability of the model, a scaled version of σ SM (z) can be defined as: with , being c the number of classes. T is a scalar parameter, called temperature, optimized on the validation set, which aims at "softening" the softmax (i.e., it raises the output entropy). The network output y = argmax c (z) does not change after temperature scaling. Two types of confidence measures have been considered in this case: • Entropy

Distance from the Classification Boundary
The fast gradient sign method [56], usually employed to generate adversarial examples, has been used to modify the input image until the classifier changes the original classification of the sample: where is a constant, Θ collects the model parameters, x is the input to the model, y is the one-hot original prediction of the network, and J(Θ, x, y) is the cost function used to train the neural network.
η is added to x until a new image x ADV , whose classification is different from y, is generated. If a sample is near the classification boundary, it has a high probability to be uncertain. Based on this assumption, two different ways to measure the confidence have been proposed: • Euclidean distance between x and x ADV • Magnitude of the gradient computed at the first step of the adversarial example generation procedure

Auto-Encoder Feature Distance
In the domain adaptation framework proposed in [13] and employed in this work, a feature extractor is trained on the source domain and a GAN is then used to learn the feature distribution, with the aim of extracting indistinguishable features from the source and the target data. If the extracted features significantly diverge from the distribution of the source domain features, the output on the target domain can be considered unreliable. Based on this assumption, c auto-encoders (being c the number of classes) are trained to reproduce the feature distribution of each class. We compute the Euclidean distance d between the features extracted from the target dataset f and those reproduced by the auto-encoders, as: wheref i are the features produced by the auto-encoder of the ith class. Then, the network reliability is evaluated as follows.
• Difference between the first and the second minimum distance in d where j = arg min c (d) and l = arg min c =j (d).

•
Concordance between the prediction (CBP) of the classifier and the class of the auto-encoder that better reconstruct the original features where y is the classifier output and k = arg min c (d).

Combined Confidence Measures
The proposed confidence measures can capture different and, often, complementary information related with the network uncertainty. For this reason, we define a way to combine their different contributions, first normalizing each measure between 0 and 1 and then averaging the obtained values with respect to all possible combinations. For the sake of simplicity, we report here only the two combinations that, in our experiments, provided the best performances.

Experimental Setup
The domain adaptation method, used in our experiments, is presented in Section 4.1, while Section 4.2 briefly introduces the employed benchmarks. Finally, in Section 4.3, the model training details are reported.

Domain Adaptation Network
The experiments were based on the domain adaptation network proposed in [13], whose architecture is depicted in Figure 1. Specifically, the network employs a classifier C, which is attached on the top of a feature extractor E S , and trained on the source dataset (Step 0). After that, S, the generator of a conditional GAN (CGAN [45]), is used to learn the distribution of the features extracted by E S from the source dataset (Step 1). Finally, a second GAN is used to train another feature encoder, E I , aimed at extracting the same feature distribution from both the source and the target domain (Step 2). In this step, E I acts as the GAN generator and learns to match the feature distribution produced by S (its weights are not updated in this phase). The procedure aims at extracting features from the target domain that are indistinguishable from those extracted from the source domain. Therefore, based on the features extracted by E I , the classifier C can be used to classify images in the target domain. The training procedure proposed by Volpi et al. [13] was modified as follows: • Step 0. The validation set of the source domain is used to early stop the training.

•
Step 1. The CGAN generator, S, produces features for a given class; then, every 1000 training steps, C is engaged in classifying the features generated by S, using the conditioning labels to evaluate the error and early stop the training.

•
Step 2. Every 1000 iteration, the classifier C is used to classify the features extracted from E I . The proposed confidence measures are used to evaluate the performance of the classifier to early stop the training.
In the task of SVHN → MNIST, the network hyper-parameters (number of layers, initial weights, learning rate, batch size, etc.) are the same as in [13], while for CIFAR → STL, which is not utilized in [13], the same network structure is maintained, changing only the weight initialization (truncated norm), the pooling kernel size (from 2 to 3), and using the padding "same" instead of "valid" in the feature extractor convolutions. (The type "same" means using zero-padding in order to have an output feature map with the same size of the input. Instead, "valid" means no padding is added and the convolution is applied only inside the feature map; in this case, the resulting feature map size is reduced depending on the size of the convolutional kernel.)

Datasets
To evaluate the proposed confidence measures, we used two public source/target benchmark datasets, typically adopted in domain adaptation.

SVHN → MNIST
Street View House Numbers (SVHN) [15] is a dataset containing real images of house numbers taken from Google Street View. Instead, MNIST [16] collects images of handwritten digits. MNIST images were resized to 32 × 32 pixels and SVHN images were converted to gray-scale. A subset of images taken from the extra set of SVHN was used as the validation set, to early stop the training of the Step 0 and to compute the thresholds for the confidence measures.

CIFAR → STL
Both CIFAR-10 [17] and STL-10 [18] are 10-class image datasets with nine overlapping classes. Following the procedure described in [57], we removed the non-overlapping classes "frog" and "monkey", and reduced the problem to a nine-class classification problem. STL images were resized to 32 × 32 and a subset of the training set of CIFAR-10 was used as the validation set, to early stop the training of Step 0 and to compute the thresholds for the confidence measures.

Network Training
Each experiment was carried out following the same setup, based on the network architecture reported in Figure 1. Firstly, E S and C were trained on the source dataset, then E S was used in Step 1 to train the feature generator S. The validation set of the source domain was used to compute the scale factor T needed for both the confidence measures e and s (see Equations (3) and (4)). Features extracted by E S from the validation set were then used to train an auto-encoder for each class, to compute d f eat and CBP (see Equations (8) and (9)). Each measure was used to evaluate the reliability of the classifier, predicting whether the network C would correctly classify the sample or not. For this purpose, a threshold for each measure needed to be set to discriminate certain and uncertain samples. Threshold values were selected to maximize the accuracy on the validation set of the source domain, based on the Receiver Operating Characteristic (ROC) (the ROC is a graphical plot that displays, at different thresholds, the true positive rate against the false positive rate achieved by a binary classifier) obtained using a balanced subset of 5000 classified/misclassified samples (see Figures 2 and 3).
The obtained thresholds were also used at the end of the adaptation procedure to evaluate the reliability of the classifier on the target domain. Finally, E I was trained on the target domain using the GAN of Step 2. In this phase, every 1000 iterations, features extracted from the target domain by E I were fed into the classifier C. The reliability estimator M, a regressors that computes one of the proposed confidence measures, was used to decide when stopping the training.  (d) (e) (f) (g) Figure 3. ROC curves obtained using each measures to predict the accuracy on the CIFAR validation set. Measures e, s, d img , g, d f eat , mix 1 and mix 2 are, respectively, plotted in (a-g).

Results
Section 5.1 reports the experimental results obtained using the proposed confidence measures to stop the training of the GAN in Step 2, whereas Section 5.2 illustrates the results achieved when the confidence measures were employed to evaluate the reliability of the classifier after the domain adaptation phase.

Early Stop of the GAN Training
In domain adaptation process, the labels of the target domain are not available. For this reason, a validation set cannot be used to evaluate the training progress; normally, a fixed number of training step is set to stop the domain adaptation phase. This criterion does not guarantee stopping the training properly, especially if the training is not stable. The proposed confidence measures proved to be related to the performance of a classifier and, in addition, they did not require the labels of the target dataset to be computed. For these reasons, our idea was to use them to monitor the domain adaptation process and to stop the training properly. In particular, the proposed confidence measures could be used to stop the training of the GAN in Step 2. In this GAN, the generator E I was initialized with the pre-trained weights of E S and was fed with a combination of images from both the target and the source domain. E I was trained to learn the feature distribution produced by S (the generator of the GAN in Step 1). This particular configuration seemed to guarantee much more stability than a classic GAN framework. To prove the effectiveness of the proposed measures in GAN the training, the following experimental setup was employed: • The training was stopped according to the confidence measures (e, s, d img , g, d f eat , mix 1 and mix 2 ) computed by the reliability estimator M.

•
The GAN was trained for a fixed number (200,000) of iterations (Fix Iter.).

•
The early stop was based on the accuracy calculated on the validation set (Max Acc.) The experiments aimed at comparing the performance of the classifier C, when the GAN was stopped according to these different strategies. It is worth noting that the stopping criterion based on the use of a validation set was not feasible in domain adaptation and, in this work, it was used only for comparison purposes, since it provided an ideal optimal performance. The experiments were carried out on two domain adaptation tasks: SVHN → MNIST and CIFAR10 → STL.

SVHN → MNIST
In this experimental setup, SVHN was used as the source domain and MNIST as the target domain. The experiments were repeated 20 times to obtain statistically reliable results. Table 1 reports the accuracy obtained using different measures as stopping criteria. Instead, Table 2 shows the mean and the standard deviation of the difference between the accuracy obtained with the use of a validation set (theoretical optimum) and the proposed measures. In this experimental setup, CIFAR was used as the source domain and STL as the target domain. As in SVHN → MNIST, the experiments were repeated 20 times to obtain more statistically reliable results. Table 3 shows the accuracy obtained using different measures as stopping criteria. Instead, Table 4 displays the mean and the standard deviation of the difference between the accuracy obtained with the use of a validation set (theoretical optimum) and the proposed measures. The experiments showed that the proposed confidence measures allow stopping the domain adaptation process when the accuracy was close to the performance that could be obtained only with a labeled validation set. Moreover, the obtained accuracy was better than the one achievable using a fixed number of iterations. It is also worth mentioning that, in our experiments, the training of the GAN tended to converge, while, in those cases when this favorable condition was not satisfied, the proposed measures might become even more effective. Furthermore, in both experimental setups, the best results were obtained using a combined measure. This suggests that different measures captured different information about the network output uncertainty. Here, measures were combined simply computing their mean (see Section 3.4), whereas it is a matter of future work to evaluate more complicated combination strategies.

Classifier Reliability
In a real domain adaptation setting, the performance of the classifier C cannot be evaluated, since a labeled test set is not available. Instead, the proposed confidence measures allow estimating the reliability of classifier C on target samples. Formally, a reliability classifier M, implementing one of the proposed measures, estimated the confidence of the output of C. For each confidence measure, a threshold th was employed, on the output of M, to classify the network output as certain or uncertain. If the confidence provided by M was larger than th, then the output of C was considered to be reliable; otherwise, it was considered unreliable. In the experiments, th was set based on the ROC, using each measure to predict the accuracy on the validation set of the source dataset (see Figures 2 and 3). Tables 5 and 6 show the accuracy and the Area Under the ROC (AUROC) achieved by M on the test set of the target domain. (The area under the ROC is a measure of the performance of the classifier. Differently from the accuracy, the AUROC does not depend on a single predefined threshold, but it provides a unique measure that summarizes the behavior of the classifier for each possible thresholds.) In Tables 7 and 8, the corresponding confusion matrices are shown. On SVHN → MNIST, the best reliability classifier M obtained an accuracy of about 89% and an AUROC of about 73%. Notice that, in this case, the accuracy was not a significant metric because the classifier C was very accurate and therefore correctly and incorrectly classified samples are unbalanced (i.e., we would obtain an accuracy of 92.34% only classifying each instance as "certain"). However, the value of the AUROC suggests that the proposed confidence values are correlated to the errors of C. The benchmark CIFAR → STL was more challenging and, in fact, the network proposed by Volpi et al. [13] reached an accuracy of just 56% on STL. In this case, our measures allowed predicting the correctness of the output with an accuracy of about 67% and an AUROC of about 73%. Thus, the experiments suggested that the proposed confidence measures are closely related with the actual reliability of the classifier, and could be very useful where an accurate uncertainty estimation is fundamental for decision making, such as in automatic analysis of medical images.

Conclusions
We investigated the use of confidence measures in domain adaptation to stop the training of a GAN and to evaluate the reliability of the classifier on the target domain. The results show that the proposed measures allowed early stopping the training, nonetheless approaching the optimum, which could only be reached when a labeled validation set is available. Moreover, confidence measures can also be used to accurately estimate the reliability of the image classifier. Finally, simply mixing (based on an average operation) different confidence measures allowed capturing different types of network uncertainty, refining the reliability estimation, whereas it is a matter of future work to consider more targeted combination strategies. It will be also interesting to assess the proposed approach within different domain adaptation frameworks and with different benchmarks, particularly in decision support systems for medical imaging, where the uncertainty assessment of a predictive model is mandatory.