Recent Advancements in Active Learning

Kwon, Bokyung Amy; Kang, Kyungtae

doi:10.3390/math14081358

Open AccessReview

Recent Advancements in Active Learning

by

Bokyung Amy Kwon

and

Kyungtae Kang

^*

Department of Artificial Intelligence, College of Computing, Hanyang University, Ansan 15588, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(8), 1358; https://doi.org/10.3390/math14081358

Submission received: 13 March 2026 / Revised: 9 April 2026 / Accepted: 16 April 2026 / Published: 18 April 2026

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

Active learning (AL) aims to maximize model performance while minimizing annotation costs. With the rapid adoption of deep learning, AL approaches have evolved to meet contemporary demands. We systematically examine the literature published from 2018 to May 2025, focusing on four key trends: batch-mode selection, transfer learning integration, multi-strategy querying, and extension to diverse application domains. In addition, we summarize classical AL approaches. While observations show that combining AL with deep learning significantly enhances data efficiency, a critical limitation remains: the lack of standardized evaluation protocols across studies hinders precise comparisons. Nevertheless, we find that AL is well-aligned with modern trends, and we offer insights into underexplored opportunities to guide future research within the machine learning community.

Keywords:

active learning; query strategy; machine learning; deep learning; complexity; supervised learning

MSC:

68-11; 68T01; 68U01

1. Introduction

Over the past few decades, machine learning and deep learning have rapidly evolved and been applied across various domains. Within this field, active learning (AL) has consistently gained attention as an effective strategy. Traditional supervised learning typically requires large-scale labeled datasets, which often leads to high learning costs due to expensive annotation. AL addresses this issue when labeled data are limited but unlabeled data are abundant. Unlike traditional methods, AL strategically selects specific instances from the input space to optimize the learning process. While standard supervised learning begins with a model trained on a pre-annotated dataset, the AL framework starts with a small amount of annotated data. The active learner then selects the most informative unlabeled instances for labeling based on predefined criteria. These selected instances are submitted to an oracle for labeling. The current hypothesis is then updated with this new information, and the process repeats until a stopping criterion or budget constraint is met. Through this iterative approach, AL reduces error and converges toward the target function more efficiently. This framework is particularly advantageous for deep learning models, which generally depend on massive datasets to achieve high accuracy. AL has been examined both analytically [1], and empirically, proving its effectiveness in diverse fields such as natural language processing, remote sensing, pattern recognition, and protein classification. Most of the literature demonstrates that AL can achieve significant performance gains while minimizing labeling effort. However, several key components are essential for the successful implementation of AL, both analytically and empirically. This study aims to provide a comprehensive reference on these critical components, reflecting current trends in the literature. The primary contributions of this paper are as follows:

Outline the problem settings of AL based on an extensive literature review.
Examine recent advancements and strategies within the AL framework.
Identify underexplored research opportunities to enhance AL capabilities.

The remainder of this paper is organized as follows: We first provide an overview of notations and the fundamental AL framework with formal definitions. Subsequently, we summarize recent trends, including classical results and baseline approaches that reflect modern developments. Finally, we discuss potential research directions for expanding AL capabilities and provide concluding remarks.

2. Brief Introduction of Active Learning

2.1. Problem Setting

The problem setting for AL can be primarily categorized into three frameworks. One is based on query synthesis framework. The learner synthesizes the query instances near the decision boundary which are generated by learned hypothesis [2]. This setting is often regarded as an efficient approach, but the synthesized ones are often unrecognized in prediction [3]. Accordingly, the synthesis framework is outside the scope of our study. In addition, the others are stream-based framework and pool-based framework. The difference between these two problem setting lies in how the learner makes a decision for query. In a stream-based framework, the learner determines whether to query the instance or not for a given instance while the learner selects an instance from the pool according to a predefined criterion where the pool consists of unlabeled instances in a pool-based framework. The stream-based framework requires fast operation in a short time interval, so it is typically proceeded in an online setting. Thus, a single-pass approach is ideal [4]. Meanwhile, the pool-based framework iterates learning over the instance in the pool, which may be a relatively time-consuming approach. In particular, unlabeled instances are easily acquired at low cost or freely, so this framework is commonly applied. In this context, we mainly concerned with stream-based framework and pool-based framework (see Figure 1).

2.2. Notations and Definition

Let random variables

(X, Y) \in X \times Y

follow a probability distribution

D

, where

X

is an input space and

Y

is a response space, respectively. We observe a sequence of i.i.d samples

\{z_{i} \equiv (x_{i}, y_{i}) : i = 1, \dots, n\}

, where

X

is a standard Borel space of

R^{m}

and

Y = {- 1, 1}

. Denote

H

as a hypothesis class of measurable functions where a measurable function is defined as

h : X \to Y

. When we denote

L

and

U

as the labeled set and the unlabeled set, respectively, the performance of the learner can be measured by the classification error in Equation (1).

e r r (h) = P (h (x) \neq y)

(1)

AL aims to maximize the performance by establishing correct hypothesis while minimizing label queries within a budget constraint.

2.3. Noise Assumptions

AL often favors certain noise assumption to implement AL especially in comparison to other models. The noise assumption is mainly categorized into realizable noise condition and non-realizable noise condition consisting of six primary categories. These conditions are described as follows:

Realizable noise condition:
The realizable assumption posits an ‘error-free’ environment, where function $h \in H$ perfectly approximates the target function, and the instances are perfectly separable. This condition is assumed in many studies, but this condition is relatively stringent in practice since the target function is typically unknown.
Agnostic noise condition:
The agnostic noise assumption is the weakest non-realizable assumption. This condition makes no assumption on the target function, as shown in Equation (2), and it is often referred to as adversarial noise condition. This assumption is described as follows for $ϵ \in [0, 1]$ .

$inf_{h \in H} e r r (h) \leq ϵ$

(2)
Benign noise condition:
The benign noise assumption is defined by constraining the range of $ϵ$ to $[0, \frac{1}{2}]$ in Equation (2).
Tsybakov’s noise condition:
Tsybakov’s noise assumption is a widely applied non-realizable condition, as shown in Equation (3), and it is also referred to as the low noise condition for some $C \in [1, \infty)$ and $α \in [0, 1]$ .

$P [h (x) \neq h^{*} (x)] \leq C \cdot {(e r r (h) - e r r (h^{*}))}^{α}$

(3)

This condition reflects the pragmatic hypothesis that the amount of label noise is inversely related to the distance from the decision boundary [5]. Since this condition is always true when $α = 0$ , it is appealing when $α \in (0, 1)$ , where $h^{*}$ is the Bayes classifier. Here, $α$ indicates how quickly the target function $η$ changes as instance x approaches the decision boundary [6].
Massart’s noise condition:
Massart’s noise assumption is interpreted as a margin condition on the probability distribution $D$ , as shown in Equation (4), which is also referred to as bounded noise assumption [7].

$P (x : |η (x) - \frac{1}{2}| \leq \frac{1}{2 \cdot C}) = 0$

(4)

where $C \in [1, \infty)$ , and $η$ is a target function. Massart’s noise condition is derived from Tsybakov’s noise assumption by setting $α = 1$ with $C \in [1, \infty)$ .
Bernstein’s noise condition:
Bernstein’s noise condition [8] is a weaker form of Tsybakov’s noise condition [9]. For $0 \leq β \leq 1$ and C is a non-zero constant, it is defined as in Equation (5).

$E_{z} [{(l (h (x), y) - l (h^{*} (x), y))}^{2}] \leq C \cdot {[e r r (h) - e r r (h^{*})]}^{β}$

(5)

2.4. Standard Query Strategies

The query strategy represents a pivotal element of AL, as it facilitates the identification of optimal instances through a systematic analysis of the input domain. By selecting highly informative instances, the learner can converge toward the target function more efficiently, thereby minimizing the total number of required queries. The following sections provide a concise overview of classical query strategies, categorized by their fundamental characteristics.

The version space strategy:
The version space strategy repeatedly narrows down the version space by eliminating inconsistent concepts based on instances at each iteration, where the version space refers to all plausible versions of concepts [2]. Hence, the version space $V_{k}$ is defined as ${h \in H : h (x_{k}) = y_{k}}$ at the k-th iteration. Since the version space is refined by examining concepts at each iteration, effective implementation significantly reduces label complexity; however, inconsistent concepts hinder the identification of the optimal hypothesis [10] and lead to computational intractability.
The query by committee (QBC) strategy:
The QBC strategy selects the instance exhibiting maximal disagreement in voting among the committee consisting of multiple learners, according to a given criterion. This strategy is formulated as the summation of disagreement measure such as divergence, as shown in Equation (6), where $n_{c}$ and $D_{0}$ indicate the number of committee members and the given criterion, respectively [11].

$arg \max_{x \in X} \sum_{c = 1}^{n_{c}} ω_{c} \cdot D_{0} (P (Y | ξ_{c, t - 1} (x)), P (Y | \bar{ξ} (x)))$

(6)

Here, $ω_{c}$ represents a mixing weight, and $ξ_{c, t - 1} (x) = 〈θ_{c, t - 1}, x〉$ denotes the inner product between committee member parameters and input, while $\bar{ξ} (x)$ represents the consensus model parameters, which is also referred to as the e-mixture of models [12].
This strategy exponentially decreases the generalization error: since each query bisects the version space, the bound on information gain asymptotically approaches 1 as the number of queries approaches ∞ [13]. Furthermore, the accuracy of the estimate generally increases if the number of committees exceeds two.
The expected error reduction (EER) strategy:
The expected error is defined as the predicted error of a specific estimator when a new observation is incorporated into the training process. The expected error reduction (EER) strategy estimates the potential generalization error associated with an observation and selects the query instance that minimizes this expected error. According to Settles [3], this objective is typically formulated as:

$arg \min_{x} \sum_{i} P_{θ} (y_{i} | x) (\sum_{j \in U} 1 - P_{θ}^{+} (\hat{y} | x^{j}))$

(7)

In (7), $θ^{+}$ denotes the model retrained after augmenting the training set with the pair $(x, y_{i})$ . Given that the error term can be decomposed into bias and variance components, the variance term—being more analytically tractable than the total error—is frequently utilized in practice, focusing on expected variance reduction [14].
The uncertainty sampling strategy:
Uncertainty sampling strategies prioritize instances for selection based on predefined uncertainty criteria, under the assumption that high uncertainty corresponds to high informativeness. The concept is intuitive and can be implemented using various measures, such as entropy or margin-based metrics. For binary classification using posterior probabilities, uncertainty can be formulated as in (8):

$arg \min_{x} |P_{θ} (\hat{y} | x) - 0.5|$

(8)

As shown in (8), the most informative sample is the one characterized by maximum uncertainty. Consequently, the strategy selects the instance whose posterior probability is closest to 0.5. In the case of more than two class labels

$arg \max_{x} 1 - P_{θ} (\hat{y} | x)$

(9)

the least confident instance is chosen as shown in (9), where $\hat{y} = arg \max_{y} P_{θ} (y | x)$ .
The representative sampling strategy:
The representative sampling strategy selects the most informative instances by analyzing the structural characteristics of the unlabeled data based on predefined criteria, such as density, diversity, or similarity. In density-based selection [15], the density score, Q, is defined as the average feature similarity among the nearest neighbors within a batch, as formulated in Equation (10):

$Q = \frac{1}{K} \sum_{i = 1}^{K} sim (z_{i}, z_{k})$

(10)

This approach identifies query instances that exhibit the highest density scores within the input space. Furthermore, diversity-based strategies are widely adopted across various information criteria [16,17,18], particularly when integrated with uncertainty sampling in batch settings. However, due to its inherent nature, this strategy often requires a larger number of samples, which may lead to slower convergence rates.

3. Literature Collection

Articles were retrieved from the Web of Science (WoS) and Scopus databases for the publication period from 2018 to 2025 using the keywords ‘active learning’, ‘machine learning’, and ‘deep learning’. From the initial search results, only regular articles and conference proceedings were selected, including early access publications. Abstract-only publications and poster presentations were excluded. Subsequently, duplicate articles and datasets were removed, followed by manual inspection of abstracts to ensure topical relevance. This process yielded 223 articles for final selection. This selection constitutes the primary dataset for analysis, with selected highly cited classical publication prior to 2018 included as needed (see Figure 2).

4. Modern Development of Active Learning Strategies

In this section, we analyze the AL strategies. First, we examine classical theoretical works that demonstrate improvements in label complexity under specific assumptions. Subsequently, we present a comprehensive taxonomy of AL strategies regarding domains, focusing on the selected literature. Before proceeding with the analysis, we illustrate publication trends in AL research based on our selected corpus in Figure 3.

4.1. Classical Results: Theoretical Guarantees in Active Learning

AL aims to minimize label complexity while maximizing performance, especially when labeling costs are significant, offering both theoretical and practical benefits. When noise assumptions are not relaxed, theoretical results may become inapplicable, but learning becomes impossible if noise restrictions are not imposed [19]. Some studies focus on theoretical improvements guaranteed by AL under specific noise assumptions compared to passive learning. These studies tolerate relaxation of the realizable noise assumption, which can never be realized in practice. There have been few analytical articles focusing on theoretical guarantees such as [20], which appeared earlier in the literature. In this section, we briefly summarize primary findings from the literature, especially those not covered in previous articles regarding non-realizable assumptions.

As discussed in Section 2.3, non-realizable noise assumptions are divided into five categories. Building on the realizable noise assumption, some articles propose AL algorithms that exhibit improved label complexity using excess error with respect to Bayes risk under specified noise conditions, where Bayes risk is defined as

e r r^{*} = \min_{h \in H} e r r (h)

, for simplicity. Agnostic noise assumption and Tsybakov’s noise assumption have been widely adopted. In [21], Tsybakov’s low noise assumption is applied, demonstrating that the convergence rate is asymptotically faster than passive learning for the coefficient

α > \approx 0.73

. In [18], a margin-based algorithm with membership query synthesis is adapted to Tsybakov’s low noise condition and achieves the optimal rate up to the polylogarithmic factor. Under Tsybakov’s low noise assumption, articles typically set the coefficient

α

to around 2, indicating more noise near the decision boundary. Massart’s noise assumption shows less noise than Taybakov’s noise condition, setting

α

to 1. In [19], restricting noise to bounded noise conditions achieves exponential savings in label complexity compared to the realizable noise condition, revealing that label complexity depends on the coefficient

β

. The literature sometimes extends results from the realizable noise assumption to specified non-realizable noise assumptions. In [22], when the learner can abstain from prediction based on the negativity of excess risk, exponential savings in label complexity are achieved under agnostic noise assumptions equivalent to realizable noise assumptions. Additionally, this study shows that the results extend to Massart’s bounded noise assumption with rare abstention, and proves that Bernstein’s noise assumption is vacuous if

h^{*} \notin H

although exponential savings in label complexity remain possible. Similarly, in [23], an importance-weighted AL algorithm achieves exponential savings under Bernstein’s noise assumption after relaxing the definition of the disagreement coefficient. Since Bernstein’s noise assumption is often interpreted as a weaker form of Tsybakov’s noise assumption [9], this improvement naturally aligns with previous studies. We briefly summarize the primary findings for each noise assumption in Table A1.

From a practical perspective, it is important for AL algorithms to demonstrate their performance in terms of label complexity under non-realizable noise assumptions. While some cases do not improve label complexity bounds as shown in [18], most of the literature demonstrates the superiority of AL in data efficiency.

4.2. Emerging Trends in Active Learning Research

AL provides a framework for data-efficient training. While traditional approaches have focused on theoretical label complexity based on uncertainty bounds, deep learning introduces architectural complexities, such as varying numbers of neurons and filters. Beyond merely reducing model uncertainty, deep learning models have fundamentally shifted learning environments and expanded application domains. Consequently, AL must be redesigned to align with these evolving trends. To this end, we categorize these distinctive patterns into the following four groups:

Multiple or Batch-mode Selection: Traditionally, AL involves selecting a single query instance at a time from a given dataset. However, because deep learning models typically operate on mini-batches, AL has evolved to select multiple instances simultaneously in a batch-mode fashion. This natural transition toward batch-mode AL has been extensively adopted in recent studies [24,25,26,27,28] and is discussed in detail in Section 4.3.
Transfer Learning Setting: AL is conventionally integrated into target environments as a generic framework. Deep learning models are typically pre-trained in a source domain and subsequently fine-tuned in the target domain. In this transfer learning (TL) setting, domain adaptation challenges inherently arise during the learning phase. Consequently, AL strategies must account for distributional shifts between domains. Several methodologies [16,29,30,31] developed under TL paradigms address these issues, as detailed in Section 4.4.
Multiple Query Strategy: Selecting informative query instances is pivotal in active learning. In conventional settings, instances are typically chosen from the entire dataset based on a single criterion. However, in batch-mode implementations, a single criterion may fail to identify diverse and informative samples, as examples within a batch often exhibit high homogeneity. To address this limitation, the recent literature has deployed query strategies based on multiple criteria [15,32,33]. These standard query strategies are further detailed in Section 2.4.
Extension to Diverse Applications: While AL algorithms are typically task-agnostic, their application domains have expanded significantly with the widespread adoption of deep learning models. Although many general challenges have been addressed, certain problems remain confined to domain-specific environments, necessitating tailored AL approaches. Among the various methodologies developed for specific application domains [31,34,35,36], several representative studies are discussed in Section 4.5.

Despite these methodological divergences, AL maintains its fundamental objective of minimizing annotation costs through strategic selection of informative instances, regardless of the underlying modeling paradigm.

4.3. Multiple or Batch-Mode Selection with Multi-Label Problems

Most of the directions described above stem from developments in deep learning models, particularly in solving multi-label image classification problems. The core-set approach [37] combines with convolutional neural networks (CNNs) under the assumption that a model trained on a selected subset remains competitive for the remaining data points. The core-set is selected by computing L² distance between penultimate layer activations, formulating this as a k-center problem as shown in Equation (11). This involves selecting k instances as the subset

S

from the set

L

by minimizing the maximal distance between instances in

L ∖ S

and their closest instances in

S

.

\min_{S \subset L} \max_{x_{i} \in L ∖ S} \min_{x_{j} \in S} d (x_{i}, x_{j})

(11)

The core-set approach solves this problem using an approximate greedy method by iteratively updating upper bounds on the optimal value. Sener et al. demonstrated that the upper bound of the core-set is independent of the number of labeled instances, which constitutes a desirable property for AL, particularly for addressing correlation issues that arise from batch sampling in deep AL (DAL). For batch-mode selection, BatchBALD [27] approximates mutual information between model parameters of deep Bayesian networks and sampling points within each batch using a greedy approach, selecting multiple instances with maximum mutual information. This mutual information criterion prevents the acquisition of redundant instances by considering dependencies within a batch, resulting in significant label efficiency. BALQUE [25] alleviates the problem of bias between performance in terms of accuracy and average confidence by addressing the forgetting phenomenon of deep neural networks (DNNs) during training where the forgetting event [28] is defined as an incorrect transition of a correctly classified individual instance. BALQUE utilizes calibrated confidence with DNNs to select query instances near the decision boundary that have a high probability of experiencing forgetting events. While traditional confidence is retrieved from the softmax function of the top layer in DNN architecture, it defines calibrated confidence to consider the confidence transition influence during training by integrating confidences from each intermediate layer as shown in Equation (12), where

s_{c}^{x}

is the c-th value of the accumulated one-hot prediction over intermediate layers throughout training for a given unlabeled instance, x.

p^{*} (y = c | x) = \frac{s_{c}^{x}}{\sum_{c = 1}^{C} s_{c}^{x}}

(12)

ADS [38] proposes the Data Shapley value (DSv) of a point to measure its expected contribution to a classifier’s performance in AL. ADS calculates DSvs of labeled data and trains a regressor with them to estimate DSvs of unlabeled data points. Positive DSvs are interpreted as significant contributions to the classifier’s performance, while negative ones deteriorate its performance. Since it is infeasible to compute exact DSvs with a large number of data points, this approach needs to be approximated in practice. The literature has mostly focused on pool-based framework, but VeSSAL [17] employs a stream-based AL framework. VeSSAL conducts approximate volume sampling of unlabeled instances in the gradient space computed from the top layer of deep neural networks under streaming settings. VeSSAL selects query instances according to the expected contribution of their gradients across all instances. This approach can be biased under non-i.i.d. streaming conditions, but since volume is typically computed based on covariance, it is advantageous for considering the total covariance of all samples for convergence. Approximate class-balanced typical sampling (ACTS) [32] utilizes hierarchical classification by combining AL with a hierarchical classification (HC) framework through three techniques: hierarchical dependency representation entropy (HDRE), ACTS, and local probability suppression loss (LPSL). Using the class hierarchy, HDRE estimates global uncertainty by considering mutual influence between neighboring layers. Based on criteria using sorted global uncertainty, the ACTS component selects query instances according to the optimal query size per class. Since this approach loses label hierarchy information during training, LPSL proposes a learning objective that adds constraints to preserve the hierarchical structure of labels. This approach interprets the multi-label problem as multiple positive branches from the HC perspective; however, the overall strategy appears overly complex. MIRAL [24] integrates AL into a reinforcement learning (RL) paradigm. MIRAL implements AL based on a Markov decision process by leveraging an actor-critic strategy for multi-label image classification. At each state, the actor network selects the top k query instances based on an action vector computed by the learning policy that takes predictions of unlabeled instances in the state matrix. Once the selection is completed by the action, the selected samples are added to the labeled set. During this process, the reward function provides incentives to the actor network based on misclassification error. Since the policy gradient estimator can deviate due to both sampling data and advantage function approximation, this approach mitigates this issue through a clipped surrogate objective that constraints the policy update magnitude in the PPO algorithm, achieving fast convergence (these baseline approaches are detailed in Table A2).

4.4. Active Learning Under Transfer Setting

Based on the advantages of pre-trained models, some recent studies consider the transfer learning (TL) paradigm. TL extracts knowledge from a source domain and enhances the learning performance of a model in a target domain [29]. However, the TL setting often causes a disparity problem between source and target domains, especially when the tasks are unrelated, and some AL methods directly address this problem.

TAL [30] integrates AL into a transfer learning framework for time series classification. During the fine-tuning phase with pre-trained model parameters, the learner selects query instances with the maximum product of entropy value and weighted average similarity to their neighbors from the unlabeled instance pool as a batch-mode selection criterion. While this combination of query selection criteria builds upon existing approaches, the informative measure based on average similarity considering neighboring feature values appears relatively robust when observations are noisy and temporally correlated.

DTSE [31] actively queries salient hyperspectral image (HSI) instances in both source and target domains to solve semantic disparity of TL framework. This approach selects salient labeled instances regarding global and local densities according to uncertainty and representativeness principles. In particular, DTSE groups every two instances to compute co-distance

s_{a}

while comparing their global densities. Subsequently, it chooses instances by defining local density peaks based on

s_{a}

and the highest global density

d_{a}

as shown in Equation (13).

a^{*} = arg \max_{n_{l} < a < n} (s_{a} + λ \cdot d_{a})

(13)

DTSE heuristically chooses instances to avoid optimizing the

λ

value in Equation (13), while also selecting current pixels of unlabeled instances within an AL framework through a sequential augmenting process. The approach then builds auto-encoders for both source and target domains and trains the entire network by maximizing canonical correlation coefficients between corresponding layers during backpropagation.

From a similar perspective, ALFREDO [15] uses feature disentanglement techniques based on an encoder–decoder architecture with multiple criteria at the batch level. ALFREDO obtains domain-specific features containing source-specific information and task-specific features containing discriminative features from each domain by training on labeled data in the source domain and unlabeled data in the target domain. Subsequently, it selects query instances based on informative scores using a weighted average of these disentangled features according to four different criteria: uncertainty, dominance, density, and novelty. This approach utilizes the informative characteristics of observations by combining multiple criteria from a domain adaptation perspective.

Videos comprise sequential frames, which renders manual end-to-end annotation of image sequences inherently labor-intensive. To address this, the frame-level AL algorithm proposed by Goswami et al. [16] aims to classify videos while maximizing generalization capability within a transfer learning (TL) framework involving batches of both labeled and unlabeled data. This algorithm operates in two sequential phases: batch selection and frame selection. Initially, a video batch is selected by maximizing utility scores derived from both unlabeled and labeled videos. This utility is quantified using a diversity matrix

R (x_{i}, x_{j})

and an entropy value

e (x_{i})

for each video

x_{i}

. The selection process is formulated as the following optimization problem:

\begin{matrix} \max_{z} \{e^{T} z + μ (z^{T} R z)\} \approx \max_{z} z^{T} Q z \\ s . t . z_{i} \in {0, 1}, \forall i and \sum_{i = 1}^{| U |} z_{i} = b \end{matrix}

(14)

where

z_{i}

denotes the selection indicator and

μ

is a weight parameter balancing the two objectives. In the reformulated objective function (the right side) in Equation (14), the matrix Q integrates e and R as follows.

Q (i, j) = \{\begin{matrix} μ \cdot R (i, j), & if i \neq j \\ e (i), & if i = j \end{matrix}

(15)

Since the binary constraint on z renders this optimization NP-hard, an iterative truncated power algorithm is employed as an approximate solution. This algorithm updates z by multiplying the previous iteration’s value with the Q matrix and retaining only the b largest entries. It is proven to converge monotonically for a given Q [39]. Following batch selection, a core-set approach is utilized to select k representative frames from the batch, where a well-known k-center greedy algorithm provides an approximation for this second NP-hard problem. While frame-level approaches effectively mitigate the computational burden of end-to-end video annotation, the presence of ‘doubly NP-hard’ optimization remains a significant challenge. Although robust approximation methods offer clear advantages, performance may still be compromised by imbalanced batch compositions (these baseline approaches are detailed in Table A3).

4.5. Extension to Diverse Applications

Domain-specific research across diverse fields has increasingly incorporated AL methodologies. In this section, we review AL approaches based on the primary categories of recent applications, and also briefly introduce newly emerging application domains (see Figure 4).

4.5.1. Object Detection

Compared to classification, object detection (OD) requires substantially higher annotation costs. Furthermore, imperfect samples such as those with missing annotations, incorrectly labeled bounding boxes, or noisy data are prevalent in real-world datasets. ASSL [40] reduces the miss-rate by mitigating sampling bias regarding imperfect samples in the training set based on AL principles. ASSL selects uncertain samples using confidence bounds after training the detector with a well-defined confident set. Subsequently, it chooses a set of diverse samples based on clustering from the uncertain samples and selects query instances with the maximum Euclidean distance to the instances in the confidence set. ODBS [26] focuses on localization, also referred to as distance estimation, by integrating AL into 3D object detection problems. ODBS assigns a single score to each image by integrating two uncertainty criteria that consider localization and diversity aspects after obtaining a center location probability map based on CenterNet. The first uncertainty is defined as the variance of an infinite sum of Gaussian distributions, which is optimized by least squares errors after obtaining multiple picks from the heatmap in converted polar coordinates. Similarly, the second uncertainty is obtained by Gaussian variance from a single combined heatmap after obtaining the pixel-wise maximum of eight altered images through augmentation and one original image sample. This uncertainty score is multiplied by the diversity score based on Euclidean distance, resulting in a single image score, and the sample with the highest image score is selected as a query instance. ODBS maintains a fixed training set size by discarding an equivalent number of samples that have lower image scores from the training set, unlike traditional pool-based AL. In the same task, Liang et al. [41] propose spatial and temporal diversity objectives for selecting query images. Following [37], the spatial diversity objective is formulated as a k-center optimization problem. Based on samples at specific locations obtained through synchronized multi-modal sensor data, the spatial distance between two points is defined as the shortest path computed using Dijkstra’s algorithm. Subsequently, temporal diversity, defined as the absolute difference between timestamps of two given points along with their embedded features, complements the spatial diversity. These three diversity values are combined through a weighted sum after rescaling normalization for querying. Arthur et al. [34] propose iterative sampling strategies that consider class imbalance in vehicle detection problems using a YOLO detector within an AL framework. After obtaining embedding features through Vision Transformer (ViT), this approach reduces the output vector to two-dimensional vectors using t-SNE, then passes informative images to YOLOv8 by computing image scores based on a class imbalance score reflecting the variety of object classes in an image and a model uncertainty score based on the confidence score assigned to each bounding box in an image. Unlike the typical approach, this approach computes the model uncertainty score by iteratively repeating the computation of cosine similarity between the embedding of the training set and that of the test set until convergence is achieved. GMU-CS [14] utilizes collaborative sampling (CS) based on uncertainty measured by the GMU score to select high-value aerial images, given that aerial images typically exhibit dense visual interference. The candidate samples are first chosen as multiple images with higher GMU scores, which represent the product of the top three entropy values of predicted objects in an image and their marginal values. After training a main model and an auxiliary model separately to detect the chosen candidate samples, the method selects the top k candidate images that have the highest uncertainty scores based on the difference between the two predicted results for bounding boxes in each image. Instance-aware uncertainty (IAU) [42] also proposes uncertainty integration for the OD problem. Based on localized objects obtained from the detector, IAU defines the uncertainty score of an image as the highest uncertainty score among localized objects, where the uncertainty score is quantified as a weighted average of positional, dimensional, and categorical uncertainties. The positional and dimensional uncertainties are quantified as standard deviations of the center coordinates of bounding boxes and height–width dimensions, respectively. The categorical uncertainty is measured as entropy values. IAU demonstrates relatively straightforward implementation.

Because OD problems are accompanied by noisy and complex environments, often involving dense data from multi-modal sensors, achieving satisfactory performance remains challenging, and these studies empirically demonstrate performance improvements through the integration of AL strategies.

4.5.2. Biomedical Data Classification

Biomedical data consistently suffers from a limited number of annotations due to high annotation costs. The recent literature demonstrates the effectiveness of AL across various tasks in biomedical domains.

BioRL [36] compared six AL techniques for extracting relations among different entities based on PubMedBERT networks [43], and recommends margin-based and uncertainty-based strategies. MedAL [44] proposes AL for medical image classification. MedAL selects samples with maximum predictive entropy based on feature descriptors from CNNs, utilizing the optimal distance function in terms of information gain to derive the feature descriptors. AD [45] proposes multi-factor calculations to retrieve high-quality data for medical image classification. AD divides an image into multiple subregions and predicts the image class as the subregion class with maximum average confidence. For each image, a multi-factor score is computed based on rescaled entropy, Jensen–Shannon (JS) divergence, and Tamura factors from all subregions within the image. The multi-factor scores are sorted in descending order, and images with higher scores are selected for querying according to a predefined threshold. MDAL [33] integrates AL to reduce annotation costs for multi-modal image classification. MDAL quantifies multi-modality differences as sample-wise point mutual information (PMI) through a contrastive learning framework, based on the hypothesis that larger multi-modality differences indicate more informative samples. After image augmentation, features are extracted by minimizing the InfoNCE loss function, which is equivalent to maximizing the lower bound of mutual information (MI) [46]. This results in straightforward computation of MI based on exponential cosine similarity, from which sample-wise point mutual information values (PMIs) are derived. MDAL selects multiple query instances with the largest average PMI for each sample after calculating PMIs across all modality pairs. Additionally, it employs k-Center Greedy [37] to select multiple samples using a vector representation of PMIs created by concatenating PMIs from different modality pairs based on the diversity principle.

M-VAAL [47] presents a task-agnostic approach using a variational auto-encoder (VAE) with adversarial learning to exploit multi-modal image data. Initially, task networks such as ResNet or UNet are trained on the labeled dataset. Independently, M-VAAL is trained in an adversarial manner on both labeled and unlabeled data from the first modality to generate latent representations using an encoder, which then reconstructs both the first modality image and the second modality image from the learned latent representation. Simultaneously, a discriminator is trained to distinguish whether first modality images belong to the labeled set or not. While the VAE and discriminator are trained by minimizing the Wasserstein GAN loss function with gradient penalty, samples that the discriminator predicts as belonging to the unlabeled set are selected for querying. These selected samples are then added to retrain the task networks. M-VAAL utilizes auxiliary information from the second modality in a manner opposite to [33]. Given that multi-modal data are essential across various domains, including biomedical applications, quantifying information from different modalities to identify informative samples remains challenging, although these studies demonstrate the effectiveness of AL approaches.

4.5.3. Entity Recognition for Network Security

Named entity recognition (NER) aims to identify and locate named entities within a text corpus, a process that generally requires large-scale annotated data. Recently, few studies have applied this task to the cybersecurity domain. To address data scarcity, Dual Dimension Diversity Sampling (DDDS) [48] integrates an AL framework with bidirectional Long Short-Term Memory (Bi-LSTM) networks. DDDS combines uncertainty and diversity sampling by calculating a dual diversity score—the sum of internal and external similarity scores—to select instances based on both standardized cosine similarity and posterior probability distributions. While alternating between these two sampling strategies at each iteration effectively leverages both labeled and unlabeled data, the method may produce inconsistent performance based on the sampling distribution. Separately, the Dynamic Attention-based BiLSTM-LSTM (DA-BiLSTM-LSTM) [49] model employs AL to identify named entities within a Generative Adversarial Network (GAN) framework. It converts cybersecurity text into embedding feature maps for a Bi-LSTM encoder. After capturing contextual information, a self-attention mechanism learns dependencies between tokens, and an LSTM decoder increases tagging speed. The encoder predicts features of labeled sequences while deceiving a discriminator trained on both labeled and unlabeled data. The discriminator’s output provides similarity scores, enabling AL by identifying unlabeled samples that closely align with labeled data.

4.5.4. Transformation

Unlike classification tasks, ALLG [35] serves as a core module for learning non-linear transformations in unsupervised settings. After an auto-encoder maps instances into latent space, ALLG learns optimal graph structures using adjacency matrices for both representation learning and instance selection. During this process, k-nearest neighbors provide an a priori graph structure. ALLG captures representational changes across network layers through adjacency matrix propagation, where different learning layers are connected via shortcuts to prevent over-smoothing. Representative instances are selected through a dedicated selection mechanism. This approach effectively captures evolving feature patterns within network structures, addressing a key challenge in DAL.

4.5.5. Exploration and Mapping

DA-SLAM [50] employs an uncertainty-based agent for autonomous exploration and mapping within a deep RL framework. After constructing mapping information online using SLAM algorithms, the uncertainty-based agent identifies optimal paths to maximize rewards using observation space constrained to five LiDAR measurements. The agent’s episodes continue iteratively using the PPO algorithm and Markov decision processes until collision occurs. The agent operates through episodes that continue via the PPO algorithm within a Markov decision process framework until collision occurs. During exploration, the agent selects paths that maintain low pose uncertainty as its primary action strategy. Since this uncertainty-based approach prioritizes decisions that reduce uncertainty, it proves particularly useful for larger and more complex environments, though the generated paths may be longer than those from completeness-based agents.

4.5.6. Facial Age Estimation

While DAL is more commonly used for classification tasks, DALRel [51] applies it to facial age estimation from images using CNNs within a regression framework. Unlike traditional DAL approaches, the learner receives only relative labels for query instances. During the training phase, DALRel minimizes the expected model output change (EMOC) loss, denoted as

Δ f (X^{'})

and formulated in Equation (16).

Δ f (X^{'}) = \sum_{x \in X^{'}} E_{x} {|\nabla_{θ} f {(x; θ)}^{T} \nabla_{θ} L (θ; (x, {\bar{y}}^{'}))|}_{1}

(16)

where

L

is the loss function, x is an unlabeled instance, and

{\bar{y}}^{'}

is the most likely label inferred by the model f. Once the CNN-based model is trained, EMOC scores are computed for all unlabeled instances based on the first-order approximation of Taylor expansion for the gradient of the objective function with respect to candidate instances, and the learner selects the k instances with the highest EMOC scores. This approach reduces the computational burden for the selection criterion since the model is trained by minimizing EMOC loss. However, AL within the regression paradigm remains relatively scarce in the literature. Nevertheless, gradient computation requires marginalizing over all possible values of unknown labels, which is computationally impractical. Therefore, we assume that all instances in a set share the same label as assigned by the maximum likelihood estimator.

4.5.7. Demonstration

ARLD [52] is a streamlined RL framework based on active deep Q-network (ADQN), in which the agent can query for demonstrations. During training, ADQN iteratively computes uncertainty based on observed states using two types of DQN: bootstrapped DQN and noisy DQN. The bootstrapped DQN consists of multiple function heads that approximate the distribution over Q-values, with uncertainty measured as the entropy divergence between heads after obtaining the policy distribution. The noisy DQN estimates predictive variance as an exploration policy. The largest Q-value variance across actions serves as the uncertainty measure since action variance reflects the model’s confidence level. One challenge in streamlined RL is determining when to query using a given criterion, and ARLD addresses this by applying an adaptive threshold based on uncertainty measures derived from the policy distribution.

5. Underexplored Opportunity: Imbalanced Class Distribution

Machine learning algorithms expect balanced class distributions, which are not common in practice. Imbalanced class distributions hamper learning performance and also affect AL performance. AL focuses on selecting more informative samples and is capable of leveraging this characteristic to address imbalance in the acquisition process. Despite this capability, there is not much literature that addresses the class imbalance problem in AL.

Regarding the class imbalance problem, traditional approaches have adopted over- or under-sampling techniques, and AL-SVMSMOTE-DBN [53] integrates an over-sampling technique into AL. AL-SVMSMOTE-DBN is an AL framework based on deep belief networks (DBNs) designed to address imbalanced class distributions in multi-label classification. AL-SVMSMOTE-DBN applies support vector machine (SVM)-based synthesized minority over-sampling techniques (SMOTE) to the training set prior to training, and generates an optimizing set using this resampling approach. Once DBN is trained, the learner selects the top k query instances from the optimizing set by solving quadratic programming (QP) problem based on two criteria as shown in Equation (5): (a) diversity

g (r_{i}) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} r_{i} \cdot r_{j} \cdot K_{i j}

, where

r_{i}

and

r_{j}

are rankings of instances

x_{i}

and

x_{j}

respectively, and

K_{i j}

is RBF kernel based similarity; and (b) uncertainty

I_{i} = (P_{θ} (y_{m} | x_{i}) - P_{θ} (y_{n} | x_{i}))

, where

y_{m}

and

y_{n}

are the top two most likely labels.

\begin{matrix} \min_{r_{i}} \sum_{i = 1}^{n} r_{i} \cdot I_{i} + g (r_{i}) \\ s . t . \sum_{i = 1}^{n} r_{i} = 1, r_{i} \geq 0 \end{matrix}

AL-SVMSMOTE-DBN employs a batch-based query strategy with a training set enriched by SMOTE to address the class imbalance problem. However, SMOTE may not always accurately represent minority class distributions [54], potentially degrading classification performance.

InvProp [55], Aggarwal et al. (2022) [56], and CBAL [57] provide a generalized framework that can be integrated into other algorithms. InvProp [55] is a parameter-free weighting scheme designed to address class imbalance that can be integrated into other algorithms. InvProp generates a probability distribution such that the random variable is uniformly distributed between 0 and 1 across unlabeled instances. The weights are determined as the inverse of preference scores by evenly partitioning the confidence score distribution into a fixed number of bins. InvProp is simple to implement through integration into other algorithms and demonstrates performance improvement for mild class imbalance, but shows limited effectiveness for severely imbalanced cases. Aggarwal et al. [56] propose minority-oriented sampling acquisition functions (AFs) based on certainty, uncertainty, and diversity for AL. The initial labeled set is created by annotating a randomly selected subset of unlabeled instances. Subsequently, a shallow classifier is iteratively fine-tuned using a pre-trained model or trained with feature embeddings after obtaining predictions for unlabeled data within a transfer learning paradigm. The AF then selects query instances based on certainty, uncertainty, or diversity principles. Aggarwal et al.’s approach integrates transfer learning with AL to address class imbalance problems when transferable features are available for improved performance. CBAL [57] is a general optimization framework based on entropy designed to solve the class imbalanced problem for classification. CBAL focuses on the KCenterGreedy problem [58] to find b samples with a maximum distance from their nearest labeled samples while maintaining class balance, and minimizes the informative and balancing objectives as shown in Equation (17), where P is the distance matrix between the outputs of labeled and unlabeled instances from the top layers of CNNs,

Ω (c)

denotes the number of required instances in each class, and z represents a binary indicator vector for the i^th instance.

\begin{matrix} \min_{z} z^{T} (P ⊙ log (P)) 1_{C \times 1} + λ {∥Ω (c) - P^{T} z∥}_{1} \\ s . t . z^{T} 1_{N \times 1} = b, z_{i} = {0, 1}, \forall i = 1, 2, . . ., N \end{matrix}

(17)

By solving Equation (16), the learner obtains query instances with a maximum distance from the existing labeled instances. CBAL demonstrates consistent performance across a relatively wide range of imbalanced ratios, from 1 to 0.1, and is advantageous for integrating greedy balancing algorithms into the selection process within AL framework.

Based on the DAL framework, a diversity-based or a combined criterion is employed since uncertainty-based approaches are limited in considering the data distribution. BAL [59] employs a diversity criterion based on stratified unequal sampling, noting that uncertainty-based criteria fail to fully utilize the data distribution. BAL generates a probability vector based on feature mapping using a variational auto-encoder (VAE), and estimates tail probabilities with copulas, which are joint probability distribution functions C defined as

F (x) = C (F (x_{1}), \dots, F (x_{d}))

for a d-dimensional random variable

x = (x_{1}, \dots, x_{d})

with marginal distributions

F (x_{1}), \dots, F (x_{d})

[60].

The probability of the minority class is determined as the maximum tail probability among left tail, right tail, and skew-corrected tail probabilities after fitting the copulas to empirical cumulative distribution functions (ECDFs). After conducting k-means algorithm, the learner selects

⌊\frac{k}{m}⌋

instances at a time from each cluster based on normalized instance probabilities belonging to the minority class, where m denotes the number of labeled instances. BAL is advantageous in fully utilizing the data distribution, but copulas are computationally intensive.

6. Discussion

With the rapid adoption of deep learning, AL has naturally evolved to align with deep learning environments. This study provides a comprehensive examination of AL approaches based on the literature published from 2018 to May 2025. The emerging patterns in AL can be summarized as follows:

Batch-mode Selection: Modern AL approaches frequently employ multiple or batch-mode selection. While this accelerates data acquisition, samples within a single batch often exhibit similar characteristics. This necessitates more sophisticated query strategies to maintain sample efficiency.
Transfer Learning Integration: AL is increasingly applied within transfer learning frameworks. However, performance may degrade due to the domain gap between source and target data, requiring strategies that offer better generalization for unseen tasks.
Multiple or Hybrid Query Strategies: There is a growing trend toward using multiple query strategies simultaneously. Since uncertainty-based methods alone may select redundant samples in a batch, they are often paired with representativeness-based strategies (e.g., diversity) to prioritize heterogeneous samples. While effective, this approach increases computational overhead and risks, introducing redundant information or sensitivity issues if the criteria overlap excessively.
Domain Expansion: AL has expanded into diverse application domains, proving particularly beneficial where annotation costs are prohibitive. Nevertheless, highly specialized fields continue to demand more tailored AL architectures.

Although AL has demonstrated its superiority in numerous studies, its evaluation remains somewhat limited as research often lacks consistent protocols for comprehensive comparison. Furthermore, we aim to share insights on a relatively underexplored area: the class imbalance problem. As widely recognized, class imbalance not only hinders overall model performance but also significantly deteriorates the decision-making process within AL. While some studies suggest AL’s potential to address this issue, to the best of our knowledge, this is the first study to specifically examine the capability of AL in mitigating class imbalance.

Author Contributions

Conceptualization, B.A.K. and K.K.; methodology, B.A.K.; writing—original draft preparation, B.A.K. and K.K.; writing—review and editing, B.A.K. and K.K.; visualization, B.A.K.; funding acquisition, B.A.K. and K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the research fund of Hanyang University ERICA (HY-2021000000001819). Also, it is partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2024-00431388, the Global Research Support Program in the Digital Field program).

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AL	Active Learning
DAL	Deep Active Learning
DNNs	Deep Neural Networks
CNNs	Convolutional Neural Networks

Appendix A. Appendix Tables

Table A1. Theoretical guarantees under different noise assumptions.

Noise	Setting	Query Strategy	Primary Findings	Reference
Realizable	pool	Greedy	Label complexity of greedy is $O (ln \| H \|)$ times that of any other strategy	[61]
Benign	pool	Uncertainty	Both upper and lower bound of expected excess error of active learning is proportional to $O (1 / n)$ as the passive learning	[62]
Agnostic	stream	Disagreement	A fast rate in convergence based on disagreement coefficient	[63]
	pool	Disagreement	Exponential savings using the negativity of the excess risk also adapted to Massart’s condition	[22]
Tsybakov	stream	Margin	The convergence rate of active learning is faster than passive learning for $α > \sqrt{3} - 1 \approx 0.73$	[21]
	stream	Plug-in	Proposed the minimax lower bound for the excess risk. The optimal rate attained by shrinking confidence bands	[64]
	stream	Synthesis	Achieve the optimal rate up to polylogarithemic factor	[18]
Massart’s	pool	Adaptive	Exponential savings in label complexity of $Ω (\frac{β^{2}}{ϵ^{2}} \cdot ln \frac{1}{δ})$ comparing to realizable noise condition	[19]
Bernstein’s	stream	Disagreement	Exponential savings in label complexity compared to passive learning	[23]

Table A2. Under batch-mode setting.

Methods	Query-Strategy	Main Contribution	Datasets	Ref.
MIRAL	EER	Selects top k query instances	VOC2007, MS-COCO	[24]
ACTS	Uncertainty	Global uncertainty criteria in HDRE	CIFAR10/100	[32]
BALQUE	Entropy	Calibrated confidence for k instances	CIFAR10/100, SVHN	[25]
DDDS	Uncert. + Div.	Dual diversity or least confidence	CoNLL-2003	[48]
ALLG	Representative	Optimal graph structures	GSAD, Waveform	[35]
VeSSAL	Uncert. + Div.	Expected detrimental contribution	MNIST, CIFAR10	[17]
ADS	Data Shapley	Points with high DSvs	CIFAR10/100, SVHN	[38]
DALRel	Exp. Model Change	Highest EMOC scores for k instances	IMDB, Wiki	[51]
BatchBALD	Mutual Info.	Mutual info between model and data	CINIC-10, MNIST	[27]
Core-set	Representative	$L^{2}$ distance to solve k-center	CIFAR10/100, SVHN	[37]

Table A3. Under transfer learning setting.

Methods	Query-Strategy	Main Contribution	Datasets	Ref.
TAL	Uncert. + Repre.	Uses entropy and weighted average similarity for time series classification	RAUS, MeteoNet, KenCentralMet	[30]
Goswami’s	Uncert. + Div.	Solves iterative quadratic programming to select videos and k frames	UCF-101, Kinetics, ImageNet	[16]
DTSE	Uncert. + Repre.	Selects salient instances in source and target domains for HSI fine-tuning	Pavia, Urban, etc.	[31]
ALFREDO	Uncert. + Repre.	Weighted sum of four criteria after obtaining disentangled features	CAMELYON17, NIH Chest Xray14	[15]

Table A4. Applications for diverse domains.

Methods	Query-Strategy	Main Contribution	Datasets	Ref.
ASSL	Uncert. + Div.	Euclidean distance between sets for Image OD	MS COCO, ILSVRC	[40]
ODBS	Uncert. + Div.	Gaussian variances from heatmaps for Autonomous Driving	KITTI (vehicle)	[26]
Lin’s	Diversity	Spatial/temporal diversity for Autonomous Driving	VoxelNet, BEVFusion	[41]
Arthur’s	Uncert. + Imb.	Cosine similarity of embeddings for Vehicle Detection	DOT CCTV, MIO-TCD	[34]
GMU-CS	Uncertainty	Collaborative sampling for Aerial Image OD	VisDrone2019, DOTA-1.5	[14]
IAU	Uncertainty	Positional and categorical components for Image OD	MS COCO, Pascal VOC	[42]
bioRE	Uncert., Margin	Six query strategies for Bio-relation Extraction	AIMED, BioRED, CDR	[36]
MedAL	Uncert. + Repre.	Maximizes average distance for Medical Image Class	Messidor, Breast Cancer	[44]
AD	Uncert. + Repre.	Subregion probability prediction for Medical Image Class	HQDS	[45]
MDAL	Uncert. + Repre.	Mutual info and diversity for Multi-modal Image	Brain glioma, Ovarian cancer	[33]
M-VAAL	Uncert. + Repre.	Predicted unlabeled set via discriminator for Multi-modal	BraTS2018, COVID-QU-Ex	[47]

References

Baum, E.; Haussler, D. What Size Net Gives Valid Generalization? In Advances in Neural Information Processing Systems; Morgan Kaufmann: San Francisco, CA, USA, 1989. [Google Scholar]
Angluin, D. Queries and Concept Learning. Mach. Learn. 1988, 2, 319–342. [Google Scholar] [CrossRef]
Settles, B. Active Learning Literature Survey; Computer Science Technical Report 1648; University of Wisconsin–Madison: Madison, WI, USA, 2009; pp. 1–47. [Google Scholar]
Lughofer, E. On-line active learning: A new paradigm to improve practical usability of data stream modeling methods. Inf. Sci. 2017, 415, 356–376. [Google Scholar] [CrossRef]
Tsybakov, A. Optimal aggregation of classifiers in statistical learning. Ann. Stat. 2004, 32, 135–166. [Google Scholar] [CrossRef]
Mammen, E.; Tsybakov, A.B. Smooth discrimination analysis. Ann. Stat. 1999, 27, 1808–1829. [Google Scholar] [CrossRef]
Massart, P.; Nédélec, E. Risk bounds for statistical learning. Ann. Stat. 2006, 34, 2326–2366. [Google Scholar] [CrossRef]
Bernstein, S. The Theory of Probabilities; Gastehizdat Publishing House: Moscow, Russia, 1946. [Google Scholar]
Bartlett, P.; Mendelson, S. Empirical Minimization. Probab. Theory Relat. Fields 2006, 135, 311–334. [Google Scholar] [CrossRef]
Beygelzimer, A.; Hsu, D.; Langford, J.; Zhang, T. Agnostic Active Learning without Constraints. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010. [Google Scholar]
McCallum, A.; Nigam, K. Employing EM and Pool-Based Active Learning for Text Classification. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; pp. 350–358. [Google Scholar]
Takano, K.; Hino, H.; Akaho, S.; Murata, N. Nonparametric e-mixture estimation. Neural Comput. 2016, 28, 2687–2725. [Google Scholar] [CrossRef] [PubMed]
Seung, H.S.; Opper, M.; Sompolinsky, H. Query by Committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 287–294. [Google Scholar]
Zhang, T.; Oles, F. A probability analysis on the value of unlabeled data for classification problems. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 1191–1198. [Google Scholar]
Mahapatra, D.; Tennakoon, R.; George, Y.; Roy, S.; Bozorgtabar, B.; Ge, Z.; Reyes, M. ALFREDO: Active learning with feature disentanglement and domain adaptation for medical image classification. Med Image Anal. 2024, 97, 103261. [Google Scholar] [CrossRef] [PubMed]
Goswami, D.; Chakraborty, S. Active Learning for Video Classification with Frame-Level Queries. arXiv 2023, arXiv:2307.05587. [Google Scholar]
Saran, A.; Yousefi, S.; Krishnamurthy, A.; Langford, J.; Ash, J. Streaming Active Learning with Deep Neural Networks. arXiv 2023, arXiv:2303.02535. [Google Scholar] [CrossRef]
Wang, Y.; Singh, A. Noise-Adaptive Margin-Based Active Learning and Lower Bounds under Tsybakov Noise Condition. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Kaariainen, M. Active Learning in the Non-Realizable Case. In Proceedings of the 17th International Conference on Algorithmic Learning Theory, Barcelona, Spain, 7–10 October 2006; pp. 63–77. [Google Scholar]
Hanneke, S. Rates of Convergence in Active Learning. Ann. Stat. 2011, 39, 333–361. [Google Scholar] [CrossRef]
Cavallanti, G.; Cesa-Bianchi, N.; Gentile, C. Linear Classification and Selective Sampling under Low Noise Conditions. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; pp. 249–256. [Google Scholar]
Puchkin, N.; Zhivotovskiy, N. Exponential Savings in Agnostic Active Learning Through Abstention. IEEE Trans. Inf. Theory 2021, 68, 4651–4665. [Google Scholar] [CrossRef]
Shayestehmanesh, H. Active Learning Under the Bernstein Condition for General Losses. Master’s Thesis, University of Victoria, Victoria, BC, Canada, 2020. [Google Scholar]
Cai, Q.; Tao, R.; Fang, X.; Xie, X.; Liu, G. A deep reinforcement active learning method for multi-label image classification. Comput. Vis. Image Underst. 2025, 257, 104351. [Google Scholar] [CrossRef]
Han, Y.; Liu, D.; Shang, J.; Zheng, L.; Zhong, J.; Cao, W.; Sun, H.; Xie, W. BALQUE: Batch Active Learning by Querying Unstable Examples with Calibrated Confidence. Pattern Recognit. 2024, 151, 110385. [Google Scholar] [CrossRef]
Hekimoglu, A.; Schmidt, M.; Marcos-Ramiro, M.; Rigoll, G. Efficient Active Learning Strategies for Monocular 3D Object Detection. In Proceedings of the IEEE Intelligent Vehicles Symposium, Aachen, Germany, 4–9 June 2022; pp. 295–302. [Google Scholar]
Kirsch, A.; Amersfoort, J.; Gal, Y. BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning. In Proceedings of the Thirty-Third Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1–12. [Google Scholar]
Toneva, M.; Sordoni, A.; Tachet des Combes, R.; Trischler, A.; Bengio, Y.; Gordon, G.J. An empirical study of example forgetting during deep neural network learning. arXiv 2019, arXiv:1812.05159. [Google Scholar] [CrossRef]
Cook, D.; Feuz, K.; Krishnan, N. Transfer Learning for Activity Recognition: A Survey. Knowl. Inf. Syst. 2013, 36, 537–556. [Google Scholar] [CrossRef]
Gikunda, P.; Jouandeau, N. Homogeneous Transfer Active Learning for Time Series Classification. In Proceedings of the IEEE International Conference on Machine Learning and Applications, Pasadena, CA, USA, 13–16 December 2021; pp. 1–7. [Google Scholar]
Lin, J.; Zhao, L.; Li, S.; Ward, R.; Wang, Z. Active-Learning-Incorporated Deep Transfer Learning for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4048–4062. [Google Scholar] [CrossRef]
Wu, Y.; Wang, M.; Fan, M.; Wang, Q.; Zhang, Z.; Zhang, H.; Zhou, X. Deep Active Learning for Image Hierarchical Classification by Introducing Dependencies and Constraints Between Classes. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 4396–4409. [Google Scholar] [CrossRef]
Wang, H.; Jin, Q.; Du, X.; Wang, L.; Guo, Q.; Li, H.; Wang, M.; Song, Z. MDAL: Modality-difference-based active learning for multimodal medical image analysis via contrastive learning and pointwise mutual information. Comput. Med. Imaging Graph. 2025, 123, 102544. [Google Scholar] [CrossRef] [PubMed]
Arthur, E.; Muturi, T.; Adu-Gyamfi, Y. Training Vehicle Detection and Classification Models with Less Data: An Active Learning Approach. Transp. Res. Rec. 2024, 2678, 2146–2164. [Google Scholar] [CrossRef]
Ma, H.; Li, C.; Shi, X.; Yuan, Y.; Wang, G. Deep Unsupervised Active Learning on Learnable Graphs. IEEE Trans. Neural Netw. Learn. Syst. 2021, 35, 2894–2900. [Google Scholar] [CrossRef] [PubMed]
Nachtegael, C.; De Stefani, J.; Lenaerts, T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS ONE 2023, 18, e0292356. [Google Scholar] [CrossRef]
Sener, O. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–13. [Google Scholar]
Ghorbani, A.; Zou, J.; Esteban, A. Data Shapley Valuation for Efficient Batch Active Learning. In Proceedings of the 56th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 30 October–2 November 2022; pp. 1456–1462. [Google Scholar]
Yuan, X.; Zhang, T. Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res. 2013, 14, 899–925. [Google Scholar]
Rhee, P.; Erdenee, E.; Shin, D.; Ahmed, M.U.; Jin, S. Active and semi-supervised learning for object detection with imperfect data. Cogn. Syst. Res. 2017, 45, 10–123. [Google Scholar] [CrossRef]
Liang, Z.; Xu, X.; Deng, S.; Cai, L.; Jiang, T.; Jia, K. Exploring Diversity-Based Active Learning for 3D Object Detection in Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2022, 25, 15454–15466. [Google Scholar]
Zhang, Z.; Ma, W.; Yuan, X.; Hao, Y.; Guo, M.; Tang, H.; Zhou, Z.; Yao, Z. Instance-Aware Uncertainty for Active Learning in Object Detection. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 298–304. [Google Scholar]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthc. 2021, 3, 1–23. [Google Scholar] [CrossRef]
Smail, A.; Noh, H.; Campilho, A.; Costa, P.; Walawalkar, D.; Khandelwal, K.; Mirshekari, M.; Fagert, J.; Galdran, A.; Xu, S. MedAL: Accurate and Robust Deep Active Learning for Medical Image Analysis. In Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1–8. [Google Scholar]
Zhou, J.; Cao, R.; Kang, J.; Guo, K.; Xu, Y. An Efficient High-Quality Medical Lesion Image Data Labeling Method Based on Active Learning. IEEE Access 2020, 8, 144331–144342. [Google Scholar] [CrossRef]
Van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Khanal, B.; Bhattarai, B.; Khanal, B.; Stoyanov, D.; Linte, C.A. M-VAAL: Multimodal Variational Adversarial Active Learning for Downstream Medical Image Analysis Tasks. arXiv 2023, arXiv:2306.12376. [Google Scholar] [CrossRef]
Wang, L.; Ma, Y.; Li, M.; Li, H.; Zhang, P. A Method of Network Attack Named Entity Recognition Based on Deep Active Learning. In Proceedings of the IEEE 24th International Conference on Software Quality, Reliability and Security (QRS), Cambridge, UK, 1–5 July 2024. [Google Scholar]
Li, T.; Hu, Y.; Ju, A.; Hu, Z. Adversarial Active Learning for Named Entity Recognition in Cybersecurity. Comput. Mater. Contin. 2020, 66, 407–414. [Google Scholar] [CrossRef]
Alcalde, M.; Ferreira, M.; González, P.; Andrade, F.; Tejera, G. DA-SLAM: Deep Active SLAM Based on Deep Reinforcement Learning. In Proceedings of the 2022 Latin American Robotics Symposium (LARS)/Brazilian Symposium on Robotics (SBR)/Workshop on Robotics in Education (WRE), São Bernardo do Campo, Brazil, 18–21 October 2022; pp. 282–287. [Google Scholar]
Singh, A.; Chakraborty, S. Deep Active Learning with Relative Label Feedback: An Application to Facial Age Estimation. In Proceedings of the International Joint Conference on Neural Networks, Virtual, 18–22 July 2021; pp. 1–9. [Google Scholar]
Chen, S.; Tangkaratt, V.; Lin, H.; Sugiyama, M. Active Deep Q-Learning with Demonstration. arXiv 2020, arXiv:1812.02632. [Google Scholar] [CrossRef]
Deng, J.; Sun, J.; Peng, W.; Zhang, D.; Vyatkin, V. Imbalanced Multiclass Classification with Active Learning in Strip Rolling Process. Knowl.-Based Syst. 2022, 255, 109754. [Google Scholar] [CrossRef]
Elreedy, D.; Atiya, D.; Kamalov, F. A Theoretical Distribution Analysis of Synthetic Minority Oversampling Technique (SMOTE) for Imbalanced Learning. Mach. Learn. 2023, 113, 4903–4923. [Google Scholar] [CrossRef]
Fairstein, Y.; Kalinsky, O.; Karnin, Z.; Kushilevitz, G.; Libov, A.; Tolmach, S. Class Balancing for Efficient Active Learning in Imbalanced Datasets. In Proceedings of the 18th Linguistic Annotation Workshop, St. Julians, Malta, 17–22 March 2024; pp. 77–86. [Google Scholar]
Aggarwal, U.; Popescu, A.; Hudelot, C. Minority Class-Oriented Active Learning for Imbalanced Datasets. arXiv 2022, arXiv:2202.00390. [Google Scholar]
Bengar, J.; van de Weijer, J.; Fuentes, L.; Raducanu, B. Class-Balanced Active Learning for Image Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 1–10. [Google Scholar]
Wolf, G. Facility location: Concepts, models, algorithms and case studies. Int. J. Geogr. Inf. Sci. 2011, 25, 331–333. [Google Scholar] [CrossRef]
Jin, Q.; Yuan, M.; Wang, H.; Wang, M.; Song, Z. Deep Active Learning Models for Imbalanced Image Classification. Knowl.-Based Syst. 2022, 257, 109817. [Google Scholar] [CrossRef]
Sklar, M. Fonctions de répartition á n dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris 1959, 8, 229–231. [Google Scholar]
Dasgupta, S. Analysis of a Greedy Active Learning Strategy. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 337–344. [Google Scholar]
Mussmann, S.; Dasgupta, S. Constants matter: The performance gains of active learning. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 16123–16173. [Google Scholar]
Gelbhart, R.; El-Yaniv, R. The Relationship Between Agnostic Selective Classification, Active Learning, and the Disagreement Coefficient. J. Mach. Learn. Res. 2019, 20, 33–41. [Google Scholar]
Minsker, S. Plug-in approach to active learning. arXiv 2011, arXiv:1104.1450. [Google Scholar] [CrossRef]

Figure 1. The general pipeline of AL implementation.

Figure 2. PRISMA flow diagram of the study selectionprocess.

Figure 3. Publication trends in AL research (2018–2025). (Left) Publication distribution by article type in WoS and Scopus databases, including early access publications. (Right) Word cloud depicting frequently occurring terms in article titles from the selected corpus.

Figure 4. Overview of the general AL pipeline for object detection. Object detection (OD) networks utilize standard models to localize objects by predicting bounding boxes (BBoxes). The AL procedure is subsequently implemented following the initial localization phase.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kwon, B.A.; Kang, K. Recent Advancements in Active Learning. Mathematics 2026, 14, 1358. https://doi.org/10.3390/math14081358

AMA Style

Kwon BA, Kang K. Recent Advancements in Active Learning. Mathematics. 2026; 14(8):1358. https://doi.org/10.3390/math14081358

Chicago/Turabian Style

Kwon, Bokyung Amy, and Kyungtae Kang. 2026. "Recent Advancements in Active Learning" Mathematics 14, no. 8: 1358. https://doi.org/10.3390/math14081358

APA Style

Kwon, B. A., & Kang, K. (2026). Recent Advancements in Active Learning. Mathematics, 14(8), 1358. https://doi.org/10.3390/math14081358

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recent Advancements in Active Learning

Abstract

1. Introduction

2. Brief Introduction of Active Learning

2.1. Problem Setting

2.2. Notations and Definition

2.3. Noise Assumptions

2.4. Standard Query Strategies

3. Literature Collection

4. Modern Development of Active Learning Strategies

4.1. Classical Results: Theoretical Guarantees in Active Learning

4.2. Emerging Trends in Active Learning Research

4.3. Multiple or Batch-Mode Selection with Multi-Label Problems

4.4. Active Learning Under Transfer Setting

4.5. Extension to Diverse Applications

4.5.1. Object Detection

4.5.2. Biomedical Data Classification

4.5.3. Entity Recognition for Network Security

4.5.4. Transformation

4.5.5. Exploration and Mapping

4.5.6. Facial Age Estimation

4.5.7. Demonstration

5. Underexplored Opportunity: Imbalanced Class Distribution

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Appendix Tables

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI