Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings

Pytlak, Radosław; Cichosz, Paweł; Fajdek, Bartłomiej; Jastrzębski, Bogdan

doi:10.3390/app15147955

Open AccessArticle

Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings

¹

Faculty of Mathematics and Information Sciences, Warsaw University of Technology, 00-665 Warsaw, Poland

²

Faculty of Electronics and Information Technology, Warsaw University of Technology, 00-665 Warsaw, Poland

³

Faculty of Mechatronics, Warsaw University of Technology, 00-665 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7955; https://doi.org/10.3390/app15147955

Submission received: 17 June 2025 / Revised: 9 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

(This article belongs to the Special Issue Natural Language Processing and Semantic Technologies: From Theories to Applications)

Download

Browse Figures

Versions Notes

Abstract

Systems supporting systematic literature reviews often use machine learning algorithms to create classification models to assess the relevance of articles to study topics. The proper choice of text representation for such algorithms may have a significant impact on their predictive performance. This article presents an in-depth investigation of the utility of the bag of concepts representation for this purpose, which can be considered an enhanced form of the ubiquitous bag of words representation, with features corresponding to ontology concepts rather than words. Its utility is evaluated in the active learning setting, in which a sequence of classification models is created, with training data iteratively expanded by adding articles selected for human screening. Different versions of the bag of concepts are compared with bag of words, as well as with combined representations, including both word-based and concept-based features. The evaluation uses the support vector machine, naive Bayes, and random forest algorithms and is performed on datasets from 15 systematic medical literature review studies. The results show that concept-based features may have additional predictive value in comparison to standard word-based features and that the combined bag of concepts and bag of words representation is the most useful overall.

Keywords:

systematic literature reviews; text classification; machine learning; semantic text representation; bag of concepts; bag of words; active learning

1. Introduction

With the rapid growth in the amount of published research results, it has become a necessity to perform literature reviews in a semi-automated way using systems that save some significant part of the human work necessary to make inclusion or exclusion decisions. This is particularly true for domains where the results of such reviews are used to make decisions or take actions of possibly severe consequences, as is definitely the case for medical publications, where both the completeness and timeliness of reviews are of extreme importance. Recent analyses show that active learning can reach a recall of about 98% with only about 60% of articles manually reviewed [1]. Additionally, large-scale simulations confirm that active learning consistently outperforms random sampling across various screening tasks [2]. Therefore, any enhancement to the existing methodology that reduces the workload without deteriorating the quality is valuable [3].

1.1. Systematic Literature Reviews

Due to the importance of literature reviews performed in a systematic way, several organizations, such as the Cochrane Collaboration [4] and the EFSA [5], provide guidelines for the process, calling it a systematic literature review (SLR). An SLR is an essential part of evidence-based medicine (preface in [6]), and according to the recommendations of the a roundtable convened in 2006 by the Institute of Medicine, 90 percent of clinical decisions should be evidence-based (preface in [6]).

SLR is a time-consuming process because, firstly, the set of articles to be reviewed is often in the thousands; secondly, each article must be assessed by at least two experts, and according to the estimate given in [7], the review of 5000 abstracts can take at least 40 h of expert work. In recent years, several semi-automated tools supporting SLRs have been proposed: Abstrackr [8], Rayyan [9], RobotAnalyst [10], Fastread [11], Fast [12], Colandr [13], BioReader [14], ASReview [15], and Symbals [16] (see [17] for more details). These systems are based on an active learning process and, in most cases, use bag of words (BoW) for document representations. A recent scoping review of over 60 studies confirmed the effectiveness of active learning for SLRs [18].

1.2. Motivation and Contributions

As summarized above, most published research on the use of text classification for systematic literature reviews has used the bag of words representation, which, despite its simplicity, handles the task of mapping texts to a vector space useful for classification quite well. Although alternative embedding-based representations provide some clear advantages, such as usually reduced dimensionality and the potential to capture contextual information, this does not necessarily make them better for the text classification task. The utility of several such representations, including word2vec [19], doc2vec [20], Glove [21], fastText [22], and BioBERT [23], was evaluated in previous work [17,24], and they did not outperform the bag of words representation. While some of them were found to reach a similar level of predictive power, with the advantage of lower dimensionality, the former can be still considered the method of choice for text classification with conventional (non-neural) machine learning algorithms. While more complex neural network-based embeddings are promising and deserve further interest [25], their widespread adoption in SLR systems encounters limitations related to computational cost, challenges in interpretability, and complexities in integration with active learning pipelines [26,27]. Moreover, recent benchmark analyses continue to show that the bag of words representation remains competitive in comparison with modern Transformer architectures in various text classification tasks [27].

Another approach to improve text representation based on the bag of concepts (BoC) approach has not yet received as much attention as it deserves, and its utility for active learning-based SLR systems has not been investigated with sufficient depth. We find this direction promising because it potentially augments the simplicity and utility of bag of words, with domain knowledge and semantic information represented by ontologies. Bag of concepts can be actually considered a family of representations sharing the same basic principle of using features corresponding to concepts but differing in several detailed design decisions, such as the choice of ontologies and annotation tools or the method of concept vocabulary formation. This research partially fills the gap left by prior work by experimentally verifying the utility of several instantiations of the bag of concepts representation for semi-automated SLRs, in combinations with different classification algorithms and active learning setups. Unlike the form of bag of concepts previously described in the literature [28], the applied feature extraction method does not rely on the UMLS storage repository and management tools, which makes it more general and potentially applicable with different ontologies for different domains.

More specifically, the contributions of this work can be summarized as follows.

•: A general bag of concepts text representation procedure with several variants corresponding to different vocabulary formation methods;
•: Systematic experiments comparing the utility of the bag of words, bag of concepts, and combined bag of concepts and bag of words representations, using different classification algorithms and active learning setups, with evaluation methods based on workload savings and performance profiles;
•: Practical recommendations for semi-automated SLR system development.

A preliminary and limited investigation of the utility of the bag of concepts representation for active learning applied to systematic literature reviews was reported before [17], but the study presented in this article is considerably more comprehensive with respect to the scope of bag of concepts variants, classification algorithms, and active learning configurations.

2. Methods

This research applies selected conventional machine learning algorithms to create text classification models in active learning mode, using the bag of words and bag of concepts representations. The corresponding subsections below briefly describe the methods used for key components of our machine learning system, providing necessary implementation details.

2.1. Bag of Concepts

The bag of concepts representation enhances the traditional bag of words model by substituting individual words with pre-established concepts drawn from ontologies. This approach begins with a document annotation process to identify where these concepts occur within texts. Once identified, these concept occurrences are quantified, similar to how word frequencies are tabulated in BoW. Our annotator’s dictionaries are designed to recognize synonyms, ensuring they are mapped to their corresponding core concepts.

Our BoC implementation offers a notable advancement over methodologies like the one described in [28]. That work is tightly integrated with the UMLS (Unified Medical Language System) repository and its specific tools, such as the MetaMap annotator. In contrast, our framework is developed for broader applicability: it is not confined to biomedical contexts and operates independently of UMLS-specific ontologies. This design allows for compatibility with diverse dictionary-based annotators and a wide spectrum of ontological structures. Such flexibility, providing a generalized BoC framework unconstrained by domain-specific infrastructure, aligns directly with the growing emphasis on portability and the wider reuse of ontologies in systematic literature review research.

The full process for generating bag of concepts representations from articles involves several stages. Initially, relevant ontologies are chosen to define the annotation vocabulary. Next, a dictionary-based annotator is employed to identify concept instances within the articles. These raw annotations are then processed to compute the fundamental numerical text representations. Finally, an augmented version of the BoC representation is developed by leveraging the hierarchical and relational properties inherent in ontologies.

It is worth noting that the term “bag of concepts”, itself, refers to a family of text representation techniques. Specific instantiations within this family can vary not only in the particular ontologies employed but also in their chosen methods for concept vocabulary formation and the detailed processes of document annotation.

2.1.1. Ontology Management Tools and Text Annotation

An expanded bag of concepts representation incorporates not only concepts directly identified by the annotator but also additional concepts linked through the ontology’s relational structure. For management and analysis of ontologies, we rely on the ROBOT environment Software version 1.9.1. [29]. This tool is particularly useful for mapping hierarchical paths, such as tracing from specific concept nodes to the ontology tree’s root. To ensure these paths are correctly identified, a full understanding of all relationships between concepts is essential. This comprehensive relationship discovery is achieved through the use of a semantic reasoner, a software tool that performs logical inference based on the formal definitions within the ontology.

In order to choose a suitable annotator, we analyzed properties of several available annotators. They were compared not only with respect to their accuracy but also by taking into account their computational time efficiency. After reviewing suitable dictionary annotators (m-grep, as used in the NCBO portal [30]; NOBLE [31]; cTAKEs [32]’ and Neji [33]), we decided to use the NOBLE annotator, following the recommendation of [34] and taking into account some other requirements stated in [17].

2.1.2. Feature Space

Concepts in an ontology form a directed graph, which is considered a taxonomy tree. Concepts that lie close to the tree root have a broader meaning, and those that are farther away from the root have a more specific meaning. On the basis of concept positions in the ontology tree, we can categorize them. As in [28], we introduce a concept depth threshold. As in [28], we take into account the shortest path to the root as the basis for evaluating concept depth.

The set of ontology concepts identified by the annotator among all articles in the set subject to transformation (A) can by directly used as the concept vocabulary (C) for the bag of concepts representation. Besides this direct approach, there are other possible vocabulary formation methods, such as those based on the filtering of the set of identified concepts by their occurrence frequency or based on the extension of the set of identified concepts by including additional concepts lying on the paths between their corresponding nodes and the root node of the ontology tree.

For each concept (

c \in C

), the number of articles (

N (c)

) in which this concept occurs is determined. For a given article (

a \in A

), the number of occurrences (

n (a, c)

) of concept c is determined. Then, the corresponding feature value is obtained using one of the following approaches:

TF:: $w (a, c) = n (a, c) \cdot d (c),$

(1)
TF-IDF:: $w (a, c) = (1 + log n (a, c)) \cdot log \frac{N}{N (c)} \cdot d (c),$

(2)

where

d (c)

is the depth of concept c. The first factor in the applied TF-IDF formula is a sublinearly scaled form of term frequency supposed to avoid assigning too much weight to multiple concept occurrences in the same article. For both TF and TF-IDF, feature values are scaled depending on concept depth. Notice that this reduces to the standard TF-IDF formula used in the BoW representation if

d (c) = 1

, c is a term in bag of words vocabulary, and

1 + log n (a, c)

is replaced by

n (a, c)

.

To extend the concept vocabulary, for each concept (

c \in C

; we call it a seed concept), all concepts that are on the shortest path from c to the root of the ontology tree are identified. Those whose depth is greater than the fixed depth threshold level (

T L

) are used to build an extended set of concepts associated with c—

E C (c)

by adding them to concept c. As a result, the set of concepts (

E C

) that is the basis for defining the feature space is of the following form:

E C = ⋃_{c \in C} E C (c) .

(3)

The value of the

\tilde{c} \in E C (c)

feature for article a is then evaluated as

\begin{matrix} w (a, \tilde{c}) = & c_{r} \cdot w (a, c), \end{matrix}

(4)

\begin{matrix} c_{r} = & \frac{1}{d (c) - d (\tilde{c}) + 1} . \end{matrix}

(5)

This method of evaluating bag of concepts feature values is similar to the approach described in [28], with the caveat that in [28], the term-frequency variant is used.

Figure 1 illustrates the extended bag of concepts representation. In two articles, two seed concepts are identified—C7 in Article 1 and C9 in Article 2. If vector representations are created on the basis of concepts C7 and C9, they do not capture the similarity between the articles. However, if concept C1 is additionally used, then the similarity of the articles is reflected by their representations.

The scheme of the process of evaluating text representations according to the bag of concepts method is presented in Figure 2.

2.2. Active Learning

Active learning is a supervised learning scenario in which the learning system issues labeling queries to an external oracle and receives training information via answers to these queries [35]. For the systematic literature review application considered in this work, class labels indicate the relevance or irrelevance of articles under study. At the beginning, a small initial training set is selected for labeling from a large pool of unlabeled articles. Then, in each iteration, in the active learning process, a classification model is trained, and its predictions are used to select additional query articles from the remaining unlabeled pool. These articles with their class labels are then added to the training set for the next iteration. Methods for initial training-set selection and query selection are two major configurable components of this process.

2.2.1. Initial Training-Set Selection

The initial training set is used to create the classification model in the first active learning iteration. Two active learning initialization strategies are used in this work.

Random initialization. This strategy selects initial training instances at random.
Cluster initialization. This strategy uses clustering to select the most representative articles [36]. It applies the fuzzy c-means clustering algorithm [37] to cluster pool instances, then selects a specified number of instances with the highest cluster membership values (cluster-center selection), as well as a specified number of instances for which membership values for two different clusters differ the least (cluster-border selection).

Since clustering in multidimensional spaces tends to degenerate due to the curse of dimensionality, LSA (i.e., truncated SVD) [38] dimensionality reduction is applied to bag of words or bag of concepts feature vectors when using the cluster initialization method.

2.2.2. Query Selection

This work adopts several algorithm-independent strategies for selecting query articles in each active learning iteration [39].

Uncertainty sampling. This strategy selects articles for which the class probability predictions of the current model are the most uncertain (the least confident) [40]. In the case of binary classification, as addressed by this work, the uncertainty can be measured as 1’s complement of the probability of the more probable class:

${unc}_{i} = 1 - max {p_{i, 1}, p_{i, 0}}$

(6)

where $p_{1, i}$ and $p_{0, i}$ are the probabilities of the relevant and irrelevant classes for the i-th article, respectively. The uncertainty is maximized if both the probabilities are equal $0.5$ .
Diversity sampling. This strategy selects articles that are maximally different from those already in the training set using using the cosine dissimilarity measure.
Random sampling. This strategy is used as a comparison baseline and selects query articles at random.

2.3. Classification Algorithms

The experimental study reported in this article used three machine learning algorithms: random forest [41], linear support vector machine [42], and multinomial naive Bayes [43] algorithms. Their utility for text classification was confirmed by prior work [24,44,45,46]. They are relatively resistant to overfitting and not overly sensitive to hyperparameter settings, which is an important advantage in the active learning setting, with very limited availability of labeled training data. While we also experimented with the radial kernel for SVM, it handled the high dimensionality of the bag of concepts representation poorly, whereas for bag of words, it did not significantly outperform the linear version.

2.4. Performance Evaluation

Since the purpose of the experimental study is to assess the utility of different learning process configurations for systematic literature reviews, the popular SLR-specific metric of classification performance is adopted, estimating screening workload savings that can be obtained. To objectively compare the observed performance on several SLR studies, an SLR performance profile analysis is performed, inspired by the performance profile analysis used for optimization solvers [47].

2.5. Workload Savings

Following [7], let the Yield and Burden measures related to using a classification model for an SLR study be defined as follows:

\begin{matrix} Yield & = \frac{{TP}^{L} + {TP}^{U}}{{TP}^{L} + {TP}^{U} + {FN}^{U}} \end{matrix}

(7)

\begin{matrix} Burden & = \frac{{TP}^{L} + {TN}^{L} + {TP}^{U} + {FP}^{U}}{N} \end{matrix}

(8)

where

•: TP is the number of true-positive predictions, i.e., articles predicted to be relevant that are truly relevant;
•: TN is the number of true-negative predictions, i.e., articles predicted to be irrelevant that are truly irrelevant;
•: FP is the number of false-positive predictions, i.e., articles predicted to be relevant that are truly irrelevant;
•: N is the number of all articles.

These numbers are identified for the entire set of articles being screened. The L superscript designates labeled articles (screened manually). The U superscript designates remaining unlabeled articles. Since class labels obtained for training articles are taken to be ground truth,

{FP}^{L} = 0

by definition; therefore, it is not used for Burden and Yield calculation.

Yield is actually the same as recall for machine learning and information retrieval performance: it represents the share of truly relevant articles identified in the screening process (either due to manual screening or by model predictions). Burden represents the share of articles that needed to be screened manually, either to provide class labels for training or to review articles predicted to be relevant by the model.

The Yield and Burden metrics can be used to estimate the workload savings as the amount of manual work saved due to model predictions while achieving a satisfactory Yield, e.g.,

0.95

[48]. The corresponding WSS@95% metric (work saved for random sampling at 95%) is defined as follows:

WSS @ 95 % = 1 - Burden - (1 - Yield) | Yield \geq 0.95 .

(9)

When evaluating the performance of the classification model on any active learning iteration, the predicted probability of the relevant class is compared against all possible probability thresholds between 0 and 1. Lower thresholds favor the positive class and produce higher Yield values but also higher Burden values due to an increased number of false positives. Higher threshold values favor the negative class, leading to lower Yield and Burden values. The maximum threshold value that provides a Yield level of at least

0.95

is selected. The value of the WSS@95% metric is then calculated for predicted classes obtained using this threshold value. Then, the active learning iteration with the maximum WSS@95% is determined, as well as the corresponding percentage of labeled data (manually screened articles).

2.6. Performance Profiles for Active Learning Strategies

The experimental study reported in this article includes different text representations, different algorithms, active learning initialization methods, and active learning query selection methods, yielding several dozens of learning process configurations to be evaluated for multiple SLR case studies.

Presenting detailed raw results in such a multidimensional space of possibilities may lead to confusion and provide little value, even when using a single performance metric, such as WSS@95%. While this could be partially alleviated by using aggregation to reduce the dimensionality (e.g., averaging over multiple datasets), some possibly important information about result distribution would be lost. Adding more aggregation methods (e.g., standard deviation, median, or quartiles) would, again, obscure any patterns, bringing back the complexity of the original multidimensional comparison. Another popular approach based on performance ranking (e.g., counting the number times particular configurations took the first, second, or third place, etc.) is robust with respect to outliers but loses information about the magnitude of performance differences.

While there appears not to be an adequate multidimensional comparison procedure in the SLR literature, similar challenges also occur in other domains, one of which is optimization solvers, for which optimization performance profiles provide a useful comparison method [47]. This is the inspiration for the learning performance profile analysis proposed below.

Consider comparing a set of learning configurations (

L

) on a set of test cases (

C

) using the WSS@95% performance metric. For each learning configuration (

l \in L

) and each case study (

c \in C

), let

w_{l, c}

denote the value of WSS@95% obtained by learning configuration l in case study c. The performance ratio (

r_{l, c}

) assessing the relative performance of learning configuration l for case study (c) in comparison to the best performance of any configuration for that case study is then defined as follows:

r_{l, c} = \frac{w_{l, c}}{{max}_{l^{'} \in L} w_{l^{'}, c}} .

(10)

Then,

p_{l} (τ) = \frac{1}{| C |} | {l \in L : r_{l, c} \geq τ} |

(11)

can be considered the estimated probability that the performance of learning configuration l is within a factor of

τ

of the best performance obtained over all case studies. This is a non-increasing piecewise constant function such that at

p_{l} (0) = 1

, and

p_{l} (1)

is the probability that learning configuration l yields the best performance. Therefore, if only the number of wins were of interest, it would be sufficient to evaluate

p_{l} (1)

for all

l \in L

. The complete profile, presented visually as a plot of

p_{l} (τ)

for

τ \in [0, 1]

, is much more useful, as it reveals all major performance characteristics.

3. Experimental Study

The experimental evaluation of the utility of active learning for systematic literature reviews is based on a comparison of the performance obtained using different classification algorithms, text representations, and active learning setups on real medical SLR case study data.

3.1. Data

For our comparative analysis, we employed 15 systematic literature review case studies, which originated from the Drug Effectiveness Review Project (DERP) team [49]. These datasets, publicly available from [48], have previously served as a resource in various investigations into semi-automated SLR systems, notably those documented in [28,50,51]. Each of these 15 SLRs was rigorously compiled by experienced researchers, with all inclusion and exclusion decisions consistently confirmed by at least two independent experts. In our experiments, classification relied exclusively on article abstracts; consequently, the assignment of ‘relevant’ or ‘irrelevant’ labels directly corresponded to the ground-truth decisions made during the original abstract-screening phase. The datasets for these case studies were specifically assembled by downloading article titles and abstracts using their respective PubMed Identifiers (PMIDs). Table 1 provides a detailed overview for each case study, listing the total article count and the percentage of relevant articles, which inherently illustrates the degree of class imbalance present in each set.

3.2. Experimental Procedure

In the experimental study, we present the performance of active learning using two types of text representations—bag of words and bag of concepts—as well as the combined bag of concepts and -words (BoC+W) representation with the random forest (RF), linear support vector machine (SVL ), and multinomial naive Bayes (NB) algorithms.

Text cleaning and normalization were performed using the Python spaCy library version 3.0.5. [52] with its en_core_web_sm English language model. The process includes removing punctuation and other special characters, conversion to lower case, lemmatization, and stopword removal. This is only applied when using the bag of words representation, since for the bag of concepts representation, text preprocessing is handled by the annotator.

The bag of words representation is implemented using the TfidfVectorizer class in the Python scikit-learn library [53]. To determine the vocabulary, frequency-based filtering is applied, skipping words with document frequency below 0.01 or above 0.95, which corresponds to the setting of the min_df = 0.01 and max_df = 0.95 parameters.

For the bag of concepts representation, a custom implementation of the procedure described in Section 2.1 was developed, using the NOBLE annotator with the SNOMED CT and MeSH ontologies and the depth threshold (

T L

) for extended representation set to 3. These choices follow [28]. All datasets are from the medical domain, so using the most common ontologies in this domain—SNOMED CT and MeSH—in our paper, as in [28], is well justified.

The following representation variants corresponding to different vocabulary formation methods are used:

Normal (n): All concepts identified by the annotator are included in the representation;
Short (s): As above, but concepts with document frequency less than 1% or greater than 95% of the maximum document frequency are omitted;
Extended (x): The concept vocabulary is extended to include concepts lying on the path from the concept identified by the annotator to the ontology root;
Extended short (xs): As above, but concepts with document frequency less than 1% or greater than 95% of the maximum document frequency are omitted.

It should be noted that the short variants are characterized by significantly shorter representation vectors; for example, in the case of the ACE Inhibitors study, the representation vector for the normal variant contains 10,462 elements, whereas for the short variant, it contains 1380 elements. On the other hand, the size of the vector representations in the extended variant does not differ that much from that of the normal one: for the same case study, these representations have sizes of 14,624 and 10,462, respectively.

Our experiments showed that the most common medical ontologies—SNOMED CT and MeSH—provide sufficiently rich vocabularies for creating numerical representations of medical texts. In our tests, we used SNOMED CT with 354,319 concepts and MeSH with 305,350 concepts. Table 2 presents the dimensions of features vectors for BoW, the short version of BoC, and the extended short version of BoC representations for each considered case study.

The same four variants corresponding to different concept vocabulary formation methods are also considered for the combined bag of concepts and -words representation, in which representation vectors are obtained by concatenating the corresponding bag of concepts and bag of words vectors.

Classification algorithm implementations provided by the scikit-learn Python library [53] are applied (RandomForestClassifier, LinearSVC, and MultinomialNB classes) with mostly default settings, since hyperparameter tuning would be problematic, given the extremely limited availability of labeled data in SLR application conditions. In particular, this means the following:

Random forest algorithm: A total of 100 trees in random forest models, grown using the Gini index impurity measure without a depth limit and with the square-root strategy for determining the random feature subset size considered for split selection;
Linear SVM algorithm: The squared-hinge loss function, the L2 penalty, and the regularization (constraint violation cost) parameter set to 1;
Multinomial naive Bayes algorithm: The Laplace probability smoothing parameter set to 1.

The only manual adjustment made to the algorithm configuration involves enabling class-balancing weights for RF and SVL to compensate for severe class imbalance.

An implementation of the fuzzy c-means clustering algorithm used for cluster initialization provided by the scikit-fuzzy library [54] was used with cosine dissimilarity, the number of clusters set to 5, and the dimensionality reduced to 5 by truncated SVD.

A custom active learning research implementation was developed, following the description in Section 2.2 with the initialization methods and query selection strategies presented there. The active learning process starts with an initial training set of 20 articles. For cluster initialization, 13 of them are from cluster borders, and 7 are from cluster centers. To make it possible to create the first classification model, the initial training set must contain articles from both the relevant and irrelevant classes. If, after receiving the class labels of initial training articles, it turns out not to be the case, additional initial training articles are selected in batches of five until the requirement is satisfied. Then, five query articles are selected in each iteration using the selection strategies described above. Human screening is simulated by revealing the true relevant/irrelevant classes available for the used datasets. There is no limit on the number of iterations, and the issue of early stopping criteria is postponed for future work so that this study focuses entirely on the utility of different text representations. This is why the active learning process continues until all class labels are revealed. As described above in Section 2.5, the active learning iteration with the maximum WSS@95% is determined, as well as the corresponding percentage of labeled data (articles with class labels revealed).

In each active learning iteration, the currently available true class labels for training articles and the current model’s predicted class probabilities for the unlabeled pool articles are combined and evaluated. To reduce the variance resulting from the random initial training-set selection, each active learning process is repeated 10 times, and for each iteration, true class labels and model predictions from these 10 runs are combined for the purpose of performance evaluation.

3.3. Preliminary Experiments

In preliminary experiments, the results of which are not presented to save space and the reader’s time, the scope of learning process configurations for further consideration was reduced by

•: Leaving only the TF-IDF variant of bag of concepts with a sublinear TF factor, which performed considerably better than TF and other variants of TF-IDF;
•: Leaving only the standard TF-IDF variant of bag of words, which performed similarly to the TF variant, because it is widely used in semi-automated tools for SLRs;
•: Skipping the diversity sampling query selection strategy, which performed significantly below the uncertainty sampling level;
•: Skipping the entirely random active learning setup (random initialization and random query selection), which was clearly inferior to the other setups.

Therefore, results presented below are limited to the standard TF-IDF variant of bag of words, the sublinear TF-IDF variant of bag of concepts, and the following active learning setups:

random–uncertainty (ru): Random initialization and uncertainty sampling query selection;
cluster–random (cr): Cluster initialization and random sampling query selection;
cluster–uncertainty (cu): Cluster initialization and uncertainty sampling query selection.

3.4. Evaluation Results

Before examining the utility of different text representations, we identify the most useful active learning setup based on the performance profiles presented in Figure 3. The presented curves were obtained using the bag of words representation, all three classification algorithms (NB, SVL, and RF), and three different combinations of active learning initialization and query selection (random–uncertainty (ru), cluster–random (cu), and cluster–uncertainty (cu)).

The following observations can be made:

Active learning setups: The cluster–uncertainty and random–uncertainty active learning setups outperform the cluster–random setup for each algorithm. There is not much of a difference between these two, but the cluster–uncertainty setup has a slight advantage; therefore, this is the setup selected for subsequent experiments.
Classification algorithms: The RF algorithm clearly outperforms the other two, and NB is better than SVL.

The four different variants of the bag of concepts representation with the random forest algorithm and the cluster–uncertainty active learning setup are compared in the performance profiles in Figure 4. Supplementary Material Figures S1 and S2 present the corresponding profiles for the linear SVM and naive Bayes algorithms, respectively. Each plot includes the performance profiles for the bag of words representation and the four variants of the combined bag of concepts and -words representation.

One can observe the following:

RF: The best results were obtained for the BoC+W variants, with the extended variant BoC+W having a slight advantage over the others; all BoC variants performed worse than the BoW variant, which achieved significantly worse results than BoC+W variants.
SVL: BoC+W representations showed behavior quite similar to that of BoW; performance profiles of BoC representations definitely stand out from the profiles of the other variants.
NB: Adequate results were obtained only for the BoW variant.

Supplementary Material Figure S3 presents performance profiles for BoW and the normal and extended BoC representations with the cluster–uncertainty active learning setup and the random forest classification algorithm. Supplementary Material Figure S4 similarly compares BoW with extended and extended short BoC.

The following can be observed:

Normal versus extended normal: In the case of BoC representations, the extended variant is better than the normal variant; the use of concatenation of BoC with BoW mitigates the impact of the extended set of concepts on the classification result.
Extended normal versus extended short: Short variants yield slightly worse classification results.

Considering concepts that are located on the ontology tree paths from the concept found by the annotator to the tree root, in the creation of the numerical representation of articles, enables the identification of similarity between articles, even though these articles do not contain the same ontology concepts. This mechanism is presented in Figure 1.

With the cluster–uncertainty active learning setup and the extended bag of concepts representation, which can be generally considered the most useful based on the results presented above, Supplementary Material Figure S5 compares the performance profiles of different classification algorithms.

It can be seen that the random forest algorithm with the extended bag of concepts and -words representation performs the best, and the advantage of RF with BoC+W over RF with BoW is substantial. The corresponding detailed results for the same configurations are presented in Table 3 and Supplementary Material Tables S1 and S2. They contain the obtained WSS@95% values and the percentage of articles manually screened, calculated as described in Section 2.5. The maximum WSS@95% values in each row are marked with bold font.

The results presented in the tables are consistent with the previous findings based on performance plots. The most successful active learner is the one using the random forest algorithm with the bag of concepts and -words representation, and even if the advantage over bag of words is not huge, it is more than sufficient to recommend this setup for SLR applications. The workload savings across test cases varies between 15% and 65%, while the percentage of articles screened varies between 30% and 75%.

4. Conclusions

This article examined the choice of text representation for semi-automated SLR systems based on active learning more thoroughly than before.

Motivated by the insufficient attention that the bag of concepts representation has received so far, we implemented and experimentally evaluated several variations thereof, varying in the vocabulary formation method, in comparison with the ubiquitous bag of words representation, as well as the combined bag of concepts and -words representation.

The experimental study included classification algorithms found to be the most useful for text classification by prior work, as well as several active learning setups, and was performed on a fairly large number of medical SLR case studies.

Besides calculating the WSS@95% metric of SLR workload savings, the performance profile methodology, which is known in the field of optimization but novel in the context of text classification and systematic literature reviews, was adopted to visualize the relative performance of different configurations of the learning process, permitting an easy and objective comparison.

The obtained results confirm that uncertainty sampling combined with cluster initialization provides the most useful active learning strategy. The concept vocabulary formation method including not only concepts identified by the annotator but also concepts on the paths leading from those concepts to the ontology tree root turned out to provide the most useful variant of the bag of concepts representation. While it still did not outperform bag of words, the combined bag of concepts and -words representation turned out to provide the highest predictive value for the random forest and SVM algorithms, with the former being the most successful. While this superiority of BoC+BoW is an empirical observation based on comparative experiments rather than a formally justified property, it can be reasonably explained as resulting from the fact that BoC emphasizes the occurrence of terms related to the domain of the texts being classified, while BoW emphasizes their context provided by the occurrence of other words.

The findings not only extend the state of knowledge but also make it possible to provide a direct practical recommendation for the development of semi-automated SLR systems: a system combining the random forest classifier, the combined bag of concepts and -words representation with path concepts included in the vocabulary, and active learning with cluster initialization and uncertainty sampling query selection is the learning process configuration leading to the highest workload savings. This is not supported by most of currently available SLR tools, which suggests a promising direction for their future development.

While the reported research fills some gaps left by prior work, it still leaves several issues to be addressed in the future.

First of all, new variants of the vocabulary formation method can be considered, e.g, an extended dictionary of concepts can be built based not on the shortest path to the root but on all paths leading from a given concept.

Secondly, it can be expected that the use of an ontology closely related to the domain of the SLR case under consideration will allow for the improvement of the classification results using the bag of concepts representation. This work presents an approach that allows for the use of any ontology.

Thirdly, the use of an NER annotator trained on data related to the SLR case-study domain will allow for higher annotation accuracy, which should also translate into better results in the classification of articles represented by the bag of concepts.

It can be also expected that better results than those reported in this article will be obtained if the MedCAT annotator trained on a corpus of medical texts is used [55]. In addition, MedCAT uses a models that eliminate ambiguities in the interpretation of terms found in the texts.

Last but not least, recent advances in large language models (LLMs) and demonstrations of their successful applications to text classification [56,57,58,59,60,61] provide very strong encouragement for the investigation of their utility for systematic literature reviews. While challenging due to computational or budget requirements, this is, with no doubt, a promising and exciting direction, with several research issues to investigate, including prompt designs, few-shot instance selection, and the use of model output to obtain not only predicted class labels but also class probabilities or an alternative confidence measures. The most natural continuation of this work, however, would be to systematically evaluate different possible approaches to combining powerful but resource-hungry LLMs with fast and efficient conventional classifiers created via active learning, which can be still hard to beat with respect to classification quality if provided with sufficient training data. Some noteworthy possibilities include but are not limited to using LLMs fed specialized domain knowledge to extract more useful features, generate synthetic abstracts for training data augmentation, or refine borderline (low-confidence) predictions of conventional models.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15147955/s1, Figure S1: Performance profiles for BoW and different BoC/BoC+W variants (SVL); Figure S2: Performance profiles for BoW and different BoC/BoC+W variants (NB); Figure S3: Performance profiles for BoW and the normal and extended BoC/BoC+W variants (RF); Figure S4: Performance profiles for BoW and the extended and extended short BoC/BoC+W variants (RF); Figure S5: Performance profiles for BoW and extended BoC/BoC+W; Table S1: WSS@95% metric values for BoW and extended BoC/BoC+W (SVL); Table S2: WSS@95% metric values for BoW and extended BoC/BoC+W (NB).

Author Contributions

Conceptualization, R.P., P.C. and B.F.; methodology, R.P., P.C. and B.F.; software, R.P., P.C., B.F. and B.J.; validation, R.P., P.C. and B.F.; formal analysis, R.P. and P.C.; investigation, RP., PC. and B.J.; resources, R.P.; data curation, B.F.; writing—original draft preparation, R.P., P.C. and B.F.; writing—review and editing, R.P., P.C., B.F. and B.J.; visualization, R.P., P.C. and B.F.; supervision, R.P.; project administration, R.P.; funding acquisition, R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by funds from Norway grants under the NOR/POLNOR/REFSA/0059/2019-00 contract. This research was carried out with the support of the Laboratory of Bioinformatics and Computational Genomics and the High Performance Computing Center of the Faculty of Mathematics and Information Science, Warsaw University of Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used for the experimental study described in this article are available at https://dx.doi.org/10.5281/zenodo.14735961, accessed on 13 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, T.; Ruser, S.; Kalunga, L.; Ivanek, R. Active learning models to screen articles as part of a systematic review of literature on digital tools in food safety. J. Food Prot. 2025, 88, 100488. [Google Scholar] [CrossRef]
Teijema, J.J.; de Bruin, J.; Bagheri, A.; van de Schoot, R. Large-scale simulation study of active learning models for systematic reviews. Int. J. Data Sci. Anal. 2025, 1–22. [Google Scholar] [CrossRef]
Masoumi, S.; Amirkhani, H.; Sadeghian, N.; Shahraz, S. Natural language processing (NLP) to facilitate abstract review in medical research: The application of BioBERT to exploring the 20-year use of NLP in medical research. Syst. Rev. 2024, 13, 107. [Google Scholar] [CrossRef] [PubMed]
Clarke, M. The Cochrane collaboration and the Cochrane library. Otolaryngol.-Head Neck Surg. 2007, 137, S52–S54. [Google Scholar] [CrossRef]
EFSA. Application of systematic review methodology to food and feed safety assessments to support decision making. EFSA J. 2010, 8, 1637–1726. [Google Scholar] [CrossRef]
McClellan, M.B. Evidence-Based Medicine and the Changing Nature of Health Care: 2007 IOM Annual Meeting Summary; National Academy Press: Washington, DC, USA, 2008. [Google Scholar]
Wallace, B.C.; Trikalinos, T.A.; Lau, J.; Brodley, C.; Schmid, C.H. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform. 2010, 11, 55. [Google Scholar] [CrossRef]
Wallace, B.C.; Small, K.; Brodley, C.E.; Lau, J.; Trikalinos, T.A. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, Miami, FL, USA, 28–30 January 2012; pp. 819–824. [Google Scholar]
Ouzzani, M.; Hammady, H.; Fedorowicz, Z.; Elmagarmid, A. Rayyan—A web and mobile app for systematic reviews. Syst. Rev. 2016, 5, 210. [Google Scholar] [CrossRef]
Przybyła, P.; Brockmeier, A.J.; Kontonatsios, G.; Le Pogam, M.A.; McNaught, J.; von Elm, E.; Nolan, K.; Ananiadou, S. Prioritising references for systematic reviews with RobotAnalyst: A user study. Res. Synth. Methods 2018, 9, 470–488. [Google Scholar] [CrossRef]
Yu, Z.; Kraft, N.A.; Menzies, T. Finding better active learners for faster literature reviews. Empir. Softw. Eng. 2018, 23, 3161–3186. [Google Scholar] [CrossRef]
Yu, Z.; Menzies, T. FAST2: An intelligent assistant for finding relevant papers. Expert Syst. Appl. 2019, 120, 57–71. [Google Scholar] [CrossRef]
Cheng, S.H.; Augustin, C.; Bethel, A.; Gill, D.; Anzaroot, S.; Brun, J.; DeWilde, B.; Minnich, R.; Garside, R.; Masuda, Y.; et al. Using Machine Learning to Advance Synthesis and Use of Conservation and Environmental Evidence. Conserv. Biol. 2018, 32, 762–764. [Google Scholar] [CrossRef]
Simon, C.; Davidsen, K.; Hansen, C.; Seymour, E.; Barnkob, M.B.; Olsen, L.R. BioReader: A Text Mining Tool for Performing Classification of Biomedical Literature. BMC Bioinform. 2019, 19, 57. [Google Scholar] [CrossRef]
Van De Schoot, R.; De Bruin, J.; Schram, R.; Zahedi, P.; De Boer, J.; Weijdema, F.; Kramer, B.; Huijts, M.; Hoogerwerf, M.; Ferdinands, G.; et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat. Mach. Intell. 2021, 3, 125–133. [Google Scholar] [CrossRef]
van Haastrecht, M.; Sarhan, I.; Yigit Ozkan, B.; Brinkhuis, M.; Spruit, M. SYMBALS: A systematic review methodology blending active learning and snowballing. Front. Res. Metrics Anal. 2021, 6, 685591. [Google Scholar] [CrossRef]
Pytlak, R.; Bukhvalova, B.; Cichosz, P.; Fajdek, B.; Grahek-Ogden, D.; Jastrzebski, B. Machine learning based system for the automation of systematic literature reviews. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, Istanbul, Turkey, 5–8 December 2024; BIBM-2023. pp. 4389–4397. [Google Scholar]
Teijema, J.J.; Seuren, S.; Anadria, D.; Bagheri, A.; van de Schoot, R. Simulation-based Active Learning for Systematic Reviews: A Systematic Review of the Literature. PsyArXiv 2025. preprint. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Le, Q.V.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the Thirty-First International Conference on Machine Learning, New York, NY, USA, 21–26 June 2014; ICML-2014. pp. 1188–1196. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 25–29 October 2014; EMNLP-2014. pp. 1532–1543. [Google Scholar]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. arXiv 2016, arXiv:1607.04606. [Google Scholar] [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Cichosz, P. Bag of Words and Embedding Text Representation Methods for Medical Article Classification. Int. J. Appl. Math. Comput. Sci. 2023, 33, 603–621. [Google Scholar] [CrossRef]
Hasny, M.; Vasile, A.P.; Gianni, M.; Bannach-Brown, A.; Nasser, M.; Mackay, M.; Donovan, D.; Šorli, J.; Domocos, I.; Dulloo, M.; et al. BERT for complex systematic review screening to support the future of medical research. In Proceedings of the International Conference on Artificial Intelligence in Medicine, Portoroz, Slovenia, 12–15 June 2023; Springer: Cham, Switzerland, 2023; pp. 173–182. [Google Scholar]
Costa, W.M.; Pedrosa, G.V. A textual representation based on bag-of-concepts and thesaurus for legal information retrieval. In Proceedings of the Symposium on Knowledge Discovery, Mining and Learning (KDMiLe), Campinas, Brazil, 28 November–1 December 2022; pp. 114–121. [Google Scholar]
Graff, M.; Moctezuma, D.; Téllez, E.S. Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges. Nat. Lang. Process. J. 2025, 11, 100154. [Google Scholar] [CrossRef]
Ji, X.; Ritter, A.; Yen, P.Y. Using ontology-based semantic similarity to facilitate the article screening process for systematic reviews. J. Biomed. Inform. 2017, 69, 33–42. [Google Scholar] [CrossRef]
Jackson, R.C.; Balhoff, J.P.; Douglass, E.; Harris, N.L.; Mungall, C.J.; Overton, J.A. ROBOT: A tool for automating ontology workflows. BMC Bioinform. 2019, 20, 407. [Google Scholar] [CrossRef] [PubMed]
Jonquet, C.; Shah, N.H.; Musen, M.A. The open biomedical annotator. Summit Transl. Bioinform. 2009, 2009, 56. [Google Scholar]
Tseytlin, E.; Mitchell, K.; Legowski, E.; Corrigan, J.; Chavan, G.; Jacobson, R.S. NOBLE–Flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinform. 2016, 17, 32. [Google Scholar] [CrossRef] [PubMed]
Savova, G.K.; Masanz, J.J.; Ogren, P.V.; Zheng, J.; Sohn, S.; Kipper-Schuler, K.C.; Chute, C.G. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 2010, 17, 507–513. [Google Scholar] [CrossRef]
Campos, D.; Matos, S.; Oliveira, J.L. A modular framework for biomedical concept recognition. BMC Bioinform. 2013, 14, 281. [Google Scholar] [CrossRef]
Jovanović, J.; Bagheri, E. Semantic annotation in biomedicine: The current landscape. J. Biomed. Semant. 2017, 8, 44. [Google Scholar] [CrossRef]
Cohn, D.; Atlas, L.; Ladner, R. Improving Generalization with Active Learning. Mach. Learn. 1994, 15, 201–221. [Google Scholar] [CrossRef]
Yuan, W.; Han, Y.; Hee, K.; Guan, D.; Lee, S. Initial Training Data Selection for Active Learning. In Proceedings of the Fifth International Conference on Ubiquitous Information Management and Communication, Seoul, Korea, 21–23 February 2011. [Google Scholar]
Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Springer: Berlin/Heidelberg, Germany, 1981. [Google Scholar]
Dumais, S.T. Latent Semantic Analysis. Annu. Rev. Inf. Sci. Technol. 2005, 38, 188–229. [Google Scholar] [CrossRef]
Fu, Y.; Zhu, X.; Li, B. A Survey on Instance Selection for Active Learning. Knowl. Inf. Syst. 2013, 35, 249–283. [Google Scholar] [CrossRef]
Nguyen, V.; Shaker, M.H.; Hüllermeier, E. How to Measure Uncertainty in Uncertainty Sampling for Active Learning. Mach. Learn. 2022, 111, 89–122. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V.N. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
McCallum, A.; Nigam, K. A Comparison of Event Models for Naive Bayes Text Classification. In Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, Menlo Park, CA, USA, 26–27 July 1998; pp. 41–48. [Google Scholar]
Joachims, T. Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Rios, G.; Zha, H. Exploring Support Vector Machines and Random Forests for Spam Detection. In Proceedings of the First International Conference on Email and Anti Spam, Mountain View, CA, USA, 30–31 July 2004; CEAS-2004. pp. 284–292. [Google Scholar]
Xue, D.; Li, F. Research of Text Categorization Model Based on Random Forests. In Proceedings of the 2015 IEEE International Conference on Computational Intelligence and Communication Technology, Los Alamitos, CA, USA, 13–14 February 2015; CICT-2015. pp. 173–176. [Google Scholar]
Dolan, E.; More, J. Benchmarking optimization software with performance profiles. Math. Program. Ser. A 2002, 91, 201–213. [Google Scholar] [CrossRef]
Cohen, A.M.; Hersh, W.R.; Peterson, K.; Yen, P.Y. Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc. 2006, 13, 206–219. [Google Scholar] [CrossRef] [PubMed]
Cohen, A.M. Drug Effectiveness Review Project (DERP) Systematic Drug Class Review GoldStandard Data. 2006. Available online: https://dmice.ohsu.edu/cohenaa/systematic-drug-class-review-data.html (accessed on 1 February 2023).
Cohen, A.M. Optimizing feature representation for automated systematic review work prioritization. In Proceedings of the AMIA Annual Symposium Proceedings, Washington, DC, USA, 8–12 November 2008; Volume 2008, pp. 121–125. [Google Scholar]
Matwin, S.; Kouznetsov, A.; Inkpen, D.; Frunza, O.; O’Blenis, P. A new algorithm for reducing the workload of experts in performing systematic reviews. J. Am. Med. Inform. Assoc. 2010, 17, 446–453. [Google Scholar] [CrossRef] [PubMed]
Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python, Library version 3.0.5.; Zenodo: Geneva, Switzerland, 2021. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Warner, J.; Sexauer, J.; Van den Broeck, W.; Kinoshita, B.P.; Balinski, J.; Clauss, C.; Clauss, C.; twmeggs; alexsavio; Unnikrishnan, A.; et al. JDWarner/scikit-fuzzy: Scikit-Fuzzy 0.5.0, Library version 0.5.0; Zenodo: Geneva, Switzerland, 2024. [Google Scholar] [CrossRef]
Kraljevic, Z.; Searle, T.; Shek, A.; Roguski, L.; Noor, K.; Bean, D.; Mascio, A.; Zhu, L.; Folarin, A.A.; Roberts, A.; et al. Multi-domain clinical natural language processing with MedCAT: The medical concept annotation toolkit. Artif. Intell. Med. 2021, 117, 102083. [Google Scholar] [CrossRef]
Chae, Y.; Davidson, T. Large Language Models for Text Classification: From Zero-Shot Learning to Instruction-Tuning. Sociol. Methods Res. 2025. [Google Scholar] [CrossRef]
Sun, X.; Li, X.; Li, J.; Wu, F.; Guo, S.; Zhang, T.; Wang, G. Text Classification via Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Vienna, Austria, 27 July–1 August 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Vienna, Austria, 2023; pp. 8990–9005. [Google Scholar]
Chen, S.; Li, Y.; Lu, S.; Van, H.; Aerts, H.J.W.L.; Savova, G.K.; Bitterman, D.S. Evaluating the ChatGPT Family of Models for Biomedical Reasoning and Classification. J. Am. Med. Inform. Assoc. 2024, 31, 940–948. [Google Scholar] [CrossRef]
Guo, Y.; Ovadje, A.; Al-Garadi, M.A.; Sarker, A. Evaluating Large Language Models for Health-Related Text Classification Task with Public Social Media Data. J. Am. Med. Inform. Assoc. 2024, 31, 2181–2189. [Google Scholar] [CrossRef]
Labrak, Y.; Rouvier, M.; Dufour, R. A Zero-Shot and Few-Shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks. arXiv 2024, arXiv:2307.12114. [Google Scholar]
Wang, Z.; Pang, Y.; Lin, Y.; Zhu, X. Adaptable and Reliable Text Classification using Large Language Models. In Proceedings of the 2024 IEEE International Conference on Data Mining Workshops, Abu Dhabi, United Arab Emirates, 9 December 2024; ICDMW-2024. pp. 67–74. [Google Scholar]

Figure 1. Concept extension illustration.

Figure 2. Process of evaluating bag of concepts text representation.

Figure 3. Performance profiles for the BoW text representation, different active learning initialization methods (c—cluster; r—random), different active learning query selection strategies (u—uncertainty; r—random), and different classification algorithms (RF—random forest; NB—naive Bayes; SVL—SVM with linear kernel). For example, RF-cu designates the random forest algorithm applied with cluster initialization and uncertainty sampling.

Figure 4. Performance profiles for BoW and different BoC/BoC+W variants: normal (n), short (s), extended (x), extended short (xs) for active learning with cluster initialization and uncertainty sampling query selection, and the RF algorithm.

Table 1. Case study datasets.

	Total	Relevant (%)
ACE Inhibitors	2544	183 (7.19%)
ADHD	851	84 (9.87%)
Antihistamines	310	92 (29.68%)
Atypical Antipsychotics	1120	363 (32.41%)
Beta Blockers	2072	302 (14.58%)
Calcium Channel Blockers	1218	279 (22.91%)
Estrogens	368	80 (21.74%)
NSAIDS	393	88 (22.39%)
Opioids	1915	48 (2.51%)
Oral Hypoglycemics	503	139 (27.63%)
Proton Pump Inhibitors	1333	238 (17.85%)
Skeletal Muscle Relaxants	1643	34 (2.07%)
Statins	3465	173 (4.99%)
Triptans	671	218 (32.49%)
Urinary Incontinence	327	78 (23.85%)

Table 2. The number of features in the BoW, BoC (s), and BoC (xs) representations.

	BoW	BoC (s)	BoC (sx)
ACE Inhibitors	1121	1380	2285
ADHD	1132	1319	2133
Antihistamines	1231	1331	2207
Atypical Antipsychotics	1088	1124	1818
Beta Blockers	1187	1565	2555
Calcium Channel Blockers	1080	1327	2170
Estrogens	1204	1110	1916
NSAIDS	1156	1353	2271
Opioids	1181	1230	1999
Oral Hypoglycemics	1189	1145	1851
Proton Pump Inhibitors	1021	1032	1805
Skeletal Muscle Relaxants	1236	1794	2879
Statins	1201	1559	2491
Triptans	1156	1131	1891
Urinary Incontinence	1397	1423	2312

Table 3. WSS@95% metric values for the BoW and extended BoC/BoC+W text representations, active learning with cluster initialization and uncertainty sampling, and the RF algorithm.

Case Name	BoW	BoW	BoC	BoC	BoC+W	BoC+W
	(%)	(WSS95)	(%)	(WSS95)	(%)	(WSS95)
ACE Inhibitors	69.16	0.2550	55.03	0.1935	69.94	0.2293
ADHD	29.26	0.6579	27.77	0.6672	27.15	0.6779
Antihistamines	81.88	0.1006	78.39	0.1399	80.13	0.1472
Atypical Antipsychotics	73.22	0.1157	79.53	0.1242	69.83	0.1522
Beta Blockers	71.96	0.2046	70.44	0.1639	75.15	0.1934
Calcium Channel Blockers	59.51	0.2247	67.62	0.1977	68.53	0.2634
Estrogens	64.46	0.3059	60.17	0.3211	50.14	0.3209
NSAIDS	54.46	0.3677	51.67	0.2857	55.86	0.3235
Opiods	58.18	0.3289	46.55	0.1712	53.61	0.3625
Oral Hypoglycemics	84.21	0.1082	81.05	0.1164	76.84	0.1280
Proton Pump Inhibitors	75.97	0.1810	71.84	0.2201	70.60	0.2079
Skeletal Muscle Relaxants	61.41	0.2899	68.95	0.1736	62.04	0.3144
Statins	70.48	0.2448	65.99	0.2906	66.67	0.2632
Triptans	48.82	0.3393	42.92	0.2766	49.66	0.3286
Urinary Incontinence	68.66	0.2584	72.18	0.2281	66.90	0.2809

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pytlak, R.; Cichosz, P.; Fajdek, B.; Jastrzębski, B. Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings. Appl. Sci. 2025, 15, 7955. https://doi.org/10.3390/app15147955

AMA Style

Pytlak R, Cichosz P, Fajdek B, Jastrzębski B. Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings. Applied Sciences. 2025; 15(14):7955. https://doi.org/10.3390/app15147955

Chicago/Turabian Style

Pytlak, Radosław, Paweł Cichosz, Bartłomiej Fajdek, and Bogdan Jastrzębski. 2025. "Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings" Applied Sciences 15, no. 14: 7955. https://doi.org/10.3390/app15147955

APA Style

Pytlak, R., Cichosz, P., Fajdek, B., & Jastrzębski, B. (2025). Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings. Applied Sciences, 15(14), 7955. https://doi.org/10.3390/app15147955

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings

Abstract

1. Introduction

1.1. Systematic Literature Reviews

1.2. Motivation and Contributions

2. Methods

2.1. Bag of Concepts

2.1.1. Ontology Management Tools and Text Annotation

2.1.2. Feature Space

2.2. Active Learning

2.2.1. Initial Training-Set Selection

2.2.2. Query Selection

2.3. Classification Algorithms

2.4. Performance Evaluation

2.5. Workload Savings

2.6. Performance Profiles for Active Learning Strategies

3. Experimental Study

3.1. Data

3.2. Experimental Procedure

3.3. Preliminary Experiments

3.4. Evaluation Results

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI