In recent years, deep learning models have achieved great success in many fields, but their training relies on large-scale labeled data. However, the annotation of large datasets is difficult, time-consuming, and costly, which is a major challenge for deep learning. In order to reduce the cost of annotation, unsupervised learning [1
], semi-supervised learning [4
], and active learning [6
] have attracted attention.
Unsupervised learning, which learns only from unlabeled images, and semi-supervised learning, which learns from a small number of labeled images and a large number of unlabeled images, aim to train a model by fully utilizing unlabeled images. Rather than annotating all of the data when creating a labeled dataset, active learning aims to reduce the total amount of annotation required by prioritizing images that would be the most effective in training a model.
As shown in Figure 1
, active learning selects the most useful images from the unlabeled images for model learning, then labels the selected images using Oracle (annotater), and finally adds the labeled images to the labeled pool to update the task model for learning. This process is repeated until the performance of the task model meets the requirements or the budget is exhausted. Active learning is widely used in image classification [7
], which is the task of categorizing objects in an image, and segmentation [10
], which is the task of classifying each pixel in an image according to the object to which that pixel belongs.
In this study, we propose a method to select effective images from an unlabeled pool that are useful for training task models in an active learning framework. To determine the efficient images, our method uses an uncertainty sampling approach. The uncertainty sampling approach annotates the data with the highest level of uncertainty in the results when inferring the model. In recent years, there have been many conventional methods using the uncertainty sampling approach. Refs. [12
] use variational autoencoder (VAE) and Ref. [17
] uses deep learning models to infer image loss to select uncertain images. In these methods using a VAE, VAE reduces an image to a low-dimensional representation, and the discriminator determines whether the obtained representation was derived from a labeled or unlabeled image. In the learning-loss method, their model consists of a task module and a loss prediction module that predicts the loss of the task module. These two modules are learned simultaneously, and the loss prediction module is trained to estimate the target loss for unlabeled samples based on mid-layer feature information as a proxy for model uncertainty.
These prior studies rely on specialized learning methods such as adversarial learning. This tends to result in complex and inefficient pipelines that are difficult to use in practical applications. Specifically, they have a model for computing uncertainty in addition to the task model, and since these are independent of each other, it is difficult for the task model to select the images it really needs, and the number of computation nodes is large, which increases the computation time. The disadvantage of the time required is that it detracts from the objective of improving the efficiency of annotation.
To solve this problem, Refs. [18
] have been proposed to compute uncertainty algorithmically, using only task model learning. These methods do not require learning to predict uncertainty, making them easy to use in practice. We propose an uncertainty indicator generator (UIG), which is an algorithmic method similar to these methods. The UIG is a non-deep learning model based on a very simple and powerful idea: selecting samples near the decision boundary of the model. It is designed to express the uncertainty of each image in the unlabeled pool with a specific importance. It calculates the uncertainty indicators based on the prediction vectors of the task model, which has the advantage of allowing the task model to choose the images it really needs. For example, in the case of image classification, the prediction is a probability vector for each category. Images are measured in terms of uncertainty from the two algorithms similar to least confident [6
] and margin sampling [6
]. In Figure 2
, the samples selected by SRAAL [13
] and the proposed method are marked with black faction marks. It shows that the proposed method is actually able to select samples near the decision boundary (see Section 4
and Section 5
Since the proposed method does not require additional training to compute uncertainty, it can update the labeled set at shorter intervals than conventional methods when actually annotating. In addition, the simplicity of the method makes it easy to incorporate into the training pipeline of a task model, and the method does not require hyperparameters, making it easy to apply in real-world applications. UIG can be implemented in a few lines of code. Therefore, images to be annotated can be selected with a few lines of code added to the task model training pipeline. In addition, the UIG does not need to be trained at the dataset creation site, and the task model can be updated at a high frequency because image selection can be performed from scratch faster than with conventional methods.
The main contributions of this paper are the following:
The method can be applied to multiple tasks;
Experimental results show that it is able to select more uncertain (high loss) images than conventional methods;
It is 14 times faster in execution time compared to conventional methods.
2. Related Works
There are three main scenarios for active learning [6
]: The first is membership query synthesis, which generates valid data for model training. The second is stream-based selective sampling, which sifts the data to label or discard. The third is pool-based sampling, which selects the most effective images for model learning from an unlabeled pool of data. Of the three, pool-based sampling, shown in Figure 1
, is the most common [12
]. Specifically, a representative image is selected from a large amount of unlabeled data and annotated by an oracle (called the annotator), thus becoming the labeled data. There are multiple approaches to using pool-based sampling.
Some studies focus on how to map the distance of a distribution to the information content of data points [24
], while others estimate the diversity of a distribution by observing the gradient [27
], future errors [28
], or changes in the output of the trained model [29
There are proposed active learning methods [22
] with a pool-based sampling approach. However, these methods are computationally inefficient for current deep networks and large datasets.
To solve this problem, active learning methods [12
] effective for deep networks and large datasets have appeared. Ref. [17
] proposed the learning-loss method. Their model consists of a task module and a loss prediction module that predicts the loss of the task module. These two modules are learned simultaneously, and the loss prediction module is trained to estimate the target loss for unlabeled samples based on mid-layer feature information as a proxy for model uncertainty. Since the accuracy of loss prediction is affected by the performance of the task module, there is a disadvantage that if the task module is inaccurate, the predicted loss will not reflect how informative the sample is.
The methods that have become mainstream in recent years are those that use adversarial training [12
]. In these methods, a variational autoencoder (VAE) reduces an image to a low-dimensional representation, and the discriminator determines whether the obtained representation was derived from a labeled or unlabeled image. By training the VAE and discriminator adversarially, we can select images from the unlabeled data pool that have features that the model has not yet seen. Because these methods sample images based on image features, they not only have the advantage of performing task-independent active learning, such as classification and segmentation, but also the disadvantage of taking more time than task learning. SRAAL [13
] is an improved method of VAAL [12
]. By relabeling the output of the discriminator with an online uncertainty indicator (OUI), SRAAL achieves better selection of uncertain images than VAAL. We think that the contribution of SRAAL is more in OUI than in VAE. OUI is used to relabel the discriminator, which is the same as distilling the discriminator at the output of OUI. In other words, the contribution that represents uncertainty can be found in the OUI. The verification of this hypothesis is one of the significant aspectsof our research. TA-VAAL [15
] is a state-of-the-art method that combines VAAL [12
] and learning-loss [17
]. By changing the task-learning-based loss prediction to ranking loss prediction and embedding normalized ranking loss information into VAAL using a ranking-conditional generative adversarial network, the data distribution of both labeled and unlabeled pools is taken into account. There are also methods that use VAE [33
] or GAN [35
] to generate new images that are more informative for the current model. However, these synthesis methods have disadvantages such as high computational complexity and unstable performance [37
These studies rely on specialized learning methods such as adversarial learning. This tends to result in complex and inefficient pipelines that are difficult to use in practical applications. Specifically, they have a model for computing uncertainty in addition to the task model, and since these are independent of each other, it is difficult for the task model to select the images it really needs, and the number of computation nodes is large, which increases the computation time. The disadvantage of the time required is that it detracts from the objective of improving the efficiency of annotation.
To solve this problem, Refs. [18
] have been proposed to compute uncertainty algorithmically, using only task model learning. These methods do not require learning to predict uncertainty, making them easy to use in practice. We propose an uncertainty indicator generator (UIG), which is an algorithmic method similar to these methods.
The uncertainty sampling approach [38
] annotates the data with the highest level of uncertainty in the results when inferring the model. Essentially, uncertainty is the degree to which the model cannot recognize the image. In other words, a sample that is closer to the decision boundary of the model can be regarded as more uncertain [40
]. Therefore, our study focused on selecting samples around the decision boundary.
4.1. Experimental Overview
In this study, we evaluate the proposed method in three tasks: classification, multi-label classification, and semantic segmentation, and compare the results with those obtained by sampling with four different methods: VAAL [12
], SRAAL [13
], TA-VAAL [15
], entropy [6
], and random. The results are compared and evaluated. Furthermore, to confirm that the proposed method is able to select more uncertain images than the conventional method, we compare the uncertainty indicator with the real loss. We also measure the execution time for each dataset and compare it with VAAL and SRAAL.
We also confirm in our experiments if we are able to select samples near the boundaries and if we are able to select samples with higher actual losses.
In addition, since UIG is a combination of least confident and margin sampling, we performed ablation studies for each of them. Specifically, the experiment was conducted with for least confident and for margin sampling.
We started the experiment with 10% of the total dataset randomly sampled as labeled data and the remaining 90% as unlabeled data. We continued sampling in 5% increments until the labeled data reached 40% of the total dataset. Afterwards, we compared the performance of each dataset with that of previous studies and random sampling. Because active learning aims to achieve high accuracy with even a small number of annotations, in this experiment, as in [12
], we limited the labeled data to 40% of the total data. For multi-label classification, we start with 1% of the labeled data and sample in 0.5% increments until we reach 4%, and for semantic segmentation, we start with 10% of the labeled data and sample in 10% increments until we reach 70%.
The experiments were each conducted five times, and the results of the experiments were averaged to determine the final value. The evaluation metrics used were accuracy for the classification tasks and mean intersection over union (mIOU) for the semantic segmentation task. Since the purpose of active learning is to improve the efficiency of annotation in normal learning, commonly used metrics were selected for each evaluation metric.
In Table 1
, we show configuration of the datasets we used. For the image classification task in this study, we used the CIFAR-10 [42
] and CIFAR-100 [42
] datasets. In the semantic segmentation task, we used the Cityscapes [43
] dataset. Lastly, for the multi-label classification task, we used the CelebA [44
] dataset. CIFAR-10 and CIFAR-100 each contained 60,000 32 × 32 × 3 images. A total of 50,000 were training data, and 10,000 were test data. As indicated in their names, CIFAR-10 had 10 classes, and CIFAR-100 had 100 classes. When training the model, Cityscapes converted the image sizes to 688 × 688 × 3. A total of 2975 images were used as training data and 500 images were used as test data. Cityscapes was annotated with 34 classes, including ambiguous classes. For this study, the model was converted to 19 non-ambiguous classes. Upon training the model, Celeb A converted the image size to 64 × 64 × 3 with 40 classes and only few labels in these classes used as answers. There were 162,770 images in this dataset, making it much larger than the others. Therefore, for this dataset, the sampling rate was changed 0.5% from 1% to 4% in this study. To confirm if we were able to select samples near the boundaries, we used the Scikit-learn makemoons library [21
] to generate a two-dimensional binary dataset for binary classification as shown in Figure 2
. The noise option was set to 0.2 and the size of the labeled and unlabeled datasets was set to 500 samples each. From the 500 unlabeled samples, 30 samples were selected for each method.
4.3. Implementation Details
For each task, we trained about 100 epochs using the training data of each dataset. The task models used for training were VGG16 [45
] for the image classification task and the multi-label classification task. We also used a dilated residual network (DRN) [46
] for the semantic segmentation task.
The optimization algorithms were stochastic gradient descent (SGD) for the image classification task and the multi-label classification task, and Adam was used for the semantic segmentation task, with learning rates of 0.01 and 0.0003, respectively. We used cross-entropy error for the image classification task and the semantic segmentation task, and negative log likelihood (NLL) losses were used for the multi-label classification task as the loss functions. In addition, we use only RandomHorizontalFlip as data augmentation.
Since the experiments in Section 5.5
use a simplified dataset, a simplified model was also used. The task model used a three-layer multilayer perceptron (MLP) with a learning rate of 0.1 and the SGD optimizer. SRAAL’s VAE used a two-layer MLP with ReLU for the encoder and decoder, respectively, and a two-layer MLP for the discriminator. The VAE and discriminator used the Adam optimizer. All epochs were set to 100 for the experiments.
Note that UIG does not require hyperparameters.
5.1. Experimental Results for CIFAR-10
The experimental results with CIFAR-10 are shown in Figure 4
a. When we trained all the data in CIFAR-10 with the VGG16, the accuracy was 84.50%. The proposed method achieved 81.83% with 40% of the data using the same task model. When we randomly sampled 40% of the data, the accuracy was 76.69%. This was almost equal to the accuracy of the proposed method when 25% of the data was sampled. The accuracy of the conventional method, VAAL [12
], was 78.96% with 40% of the data, which was close to the proposed method’s accuracy of 30%.
The accuracy of SRAAL is higher than that of VAAL, but lower than that of the proposed method, which has a more simpler structure. This is because SRAAL structurally trains the model using the output of the UIG as the correct label, and as a result, the uncertainty can be expressed more accurately by using the UIG alone. The accuracy of TA-VAAL is higher than that of SRAAL, but lower than that of the proposed method. It should be noted that the results are very close to the proposed method at 15%.
a shows the improvement from the random sampling baseline for each method with standard deviation (shaded). The proposed method outperformed others on CIFAR-10 in all stages.
These results indicate that in 10-class classification, selecting samples near the decision boundary of the model is very effective in learning the model. Based on these results, we can say that our method is able to reduce the amount of samples to be annotated, which is the objective of active learning.
The accuracy of each method differs from that of the original paper due to the different models used in the task model (ResNet was used in SRAAL, while VGG was used in this experiment). In particular, SRAAL does not include a description of data augmentation, which suggests that the experimental environment other than the method is different. Because we conducted our experiments in the same environment for each method, the results may differ from those in the original paper.
5.2. Experimental Results for CIFAR-100
The experimental results with CIFAR-100 are shown in Figure 4
b and Figure 5
b. When the VGG16 was trained on all the data in CIFAR-100, the accuracy was 51.59%. The proposed method achieved 39.66% with 40% of the data using the same task model. When we randomly sampled 40% of the data, the accuracy was 36.52%. This was almost equal to the accuracy of the proposed method when sampling 35% of the data. In VAAL, the accuracy was 37.43% for 40% of the data, which was also close to the proposed method’s 35%.
These results show that the accuracy of the proposed method was better than the other methods except TA-VAAL, even with more complex datasets that had more classes and a similar amount of data to CIFAR-10. The results also indicate that, compared to CIFAR-10, the proposed method was effective even before the convergence of the accuracy values.
For sampling after 10%, the proposed method showed better accuracy than the previous studies. This is because the proposed method computes uncertainty using the output of the task model, and thus provides information lacking in the task model, while VAAL and SRAAL do not necessarily contribute to learning the task model, as they provide a wide variety of data by VAE, which is not the task model. The accuracy of TA-VAAL is higher than that of the proposed method at 10%, 15%, and 40%. The values of 10% and 40% are almost the same, but the value of 15% is remarkable. Accuracy was also higher in CIFAR-10 when sampling at 15% than at other sampling times. In combination with the accuracy of the original paper, TA-VAAL seems to be more effective when the number of labeled sets is small. In contrast, the proposed method is stable and highly accurate at all sampling points.
On CIFAR-100, TA-VAAL showed comparable performance to the proposed method. However, TA-VAAL takes about 10 times longer than the proposed method for training, making it computationally expensive.
5.3. Experimental Results for Cityscapes
The experimental results with Cityscapes are shown in Figure 4
c and Figure 5
c. When a DRN was trained on all the data in Cityscapes, the accuracy was 58.14%. The proposed method achieved 56.72% with 70% of the data using the same task model. When we randomly sampled 70% of the data, the accuracy was 54.98%. This was almost equal to the accuracy of the proposed method when sampling 30% of the data. The conventional method, VAAL, achieved 56.10% accuracy with 70% of the data, which was close to the proposed method’s accuracy of 50%. These results show that the proposed method performed better than the conventional method in semantic segmentation.
The Cityscapes results in Figure 5
c appear to decrease in accuracy as sampling progresses compared to the results for the other datasets. The Cityscapes dataset sampled up to 70% is more similar to random sampling than the other datasets.
shows a comparison of the average IOUs for each class obtained by random sampling and the proposed method; each sampling was performed five times with 50% sampling. Table 3
shows the average, minimum, and maximum values of IOUs for each class in random sampling.
From the results shown in Table 2
, the proposed method obtained higher IOUs than the random sampling in 15 of the 17 classes (the two exceptions were “Road” and “Fence”). In addition, the mean of the maximum value of each class in Table 3
was 53.07, while the mean of the proposed method was 52.90. This was close to the mean of the random sampling, indicating that the proposed method had ideal sampling.
compares the ground truth images and the training results using all training images in Cityscapes and the training results using 50% sampling by the proposed method. The proposed method can almost obtain the same results as those obtained by training with all the data, but only using half of the data.
These quantitative and qualitative comparisons show the superiority of the proposed method.
The average values in Table 2
differ from those in Figure 4
c at 50% sampling because the values in Figure 4
c are the mIOUs of each sample averaged over all the samples. The value for each class in Table 2
is the average of the IOUs of all the samples for each class, and the AVERAGE values are the average of each class (accounting for the different number of occurrences).
5.4. Experimental Results for CelebA
The results of the CelebA experiment are shown in Figure 4
d and Figure 5
d. When the VGG16 was trained on all the data of CelebA, the accuracy was 88.40%. The proposed method achieved 87.71% with 4% of the data using the same task model. When 4% of the data was sampled randomly, the accuracy was 87.36%. The conventional method, VAAL, achieved 87.42% accuracy with 4% of the data.
After 2.5%, there was little difference between VAAL and random sampling, with the proposed method achieving higher accuracy on a relatively stable basis. These results indicate that the proposed method outperforms the conventional methods in multi-label classification.
The accuracy of SRAAL is lower than that of other methods, which indicates that the uncertainty indicator inferred by SRAAL does not work well for multi-label classification.
In addition, there seemed to be variations in the results of the proposed method, VAAL, and random sampling with this dataset that was not seen in the other datasets. In the case of CelebA, the accuracy was close to convergence at 1% sampling. The difference in performance between the proposed method and the other methods was smaller with this dataset than with the other datasets. However, the advantage of the proposed method is the ability to obtain higher accuracy than the other methods on average, even if the range of increase was small.
5.5. Analysis of Selected Samples
shows which samples were selected in the two-dimensional dataset for SRAAL and the proposed method. The samples selected for SRAAL are spread over the whole area, whereas the proposed method selects samples close to the decision boundary line. Our method aims at efficient annotation by selecting samples near the decision boundary, and Figure 2
shows that our method actually selects samples closer to the decision boundary than conventional methods.
5.6. Analysis of Uncertainty Indicator
shows a graph of the actual losses of SRAAL and the proposed method on the x
-axis and the uncertainty index of each method on the y
-axis in CIFAR-10. The point chosen for the 5% selection after training with the randomly chosen 10% is shown in red. Note that both SRAAL and the proposed method select data points with the highest uncertainty index for unlabeled data. In this figure, it is ideal to be able to sample the points on the right (where the real losses are higher). SRAAL selects images with a wide range of actual losses, as shown in Figure 7
a, whereas our proposed method can select images with relatively high real loss values, as shown in Figure 7
b. Furthermore, the correlation coefficient of the proposed method is 0.43, compared to 0.01 in SRAAL. From the above, it appears that the proposed method performs better than SRAAL on a variety of tasks.
5.7. Comparison of Model Computational Complexity
We compared the execution time of the proposed method with that of VAAL and SRAAL, and the results are shown in Table 4
. The experimental results are measured on an NVIDIA GeForce RTX 2080Ti. We succeeded in reducing the execution time for each dataset approximately by the following factors: 10 for both CIFAR-10 and CIFAR-100, two for Cityscapes, and five for CelebA. The memory requirements for sampling were 4114 MiB for the proposed method, and 5586 MiB for VAAL and SRAAL when mini-batch size was 128.
5.8. Ablation Study
In order to evaluate the contribution of least confident and margin sampling in the proposed method (UIG), we compared least confident, margin sampling, and full UIG using CIFAR-10. Specifically, the experiment was conducted with
for least confident and
for margin sampling. The results are shown in Table 5
. The full UIG was the highest for sampling after 10%, indicating the effectiveness of both least confident and margin sampling.
In order to select the most informative sample from an unlabeled pool, we proposed a model to derive the most informative unlabeled sample from the output of the task model, a UIG. Experimental results show that the proposed method is able to select images with higher loss than the conventional method, and the annotation cost can be reduced by minimizing the amount of labeled data that is used. The proposed method outperforms the current SoTA method by 1% accuracy in CIFAR-10. We also succeeded in reducing the execution time of the models by about 90% for CIFAR-10 and CIFAR-100, about 50% for Cityscapes, and about 80% for CelebA, as compared to the conventional methods.
In this study, we have shown the superiority of our method for images, but we believe that it can also be applied to more advanced tasks, such as video. We will also consider useful methods for combining active learning with semi-supervised and unsupervised learning for real-world applications.