Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification

Park, Youngki; Shin, Youhyun

doi:10.3390/math12193090

Open AccessArticle

Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification

by

Youngki Park

¹

and

Youhyun Shin

^2,*

¹

Department of Computer Education, Chuncheon National University of Education, Chuncheon 24328, Republic of Korea

²

Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(19), 3090; https://doi.org/10.3390/math12193090

Submission received: 16 September 2024 / Revised: 30 September 2024 / Accepted: 1 October 2024 / Published: 2 October 2024

(This article belongs to the Special Issue New Machine Learning and Deep Learning Techniques in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Can bi-encoders, without additional fine-tuning, achieve a performance comparable to fine-tuned BERT models in classification tasks? To answer this question, we present a simple yet effective approach to text classification using bi-encoders without the need for fine-tuning. Our main observation is that state-of-the-art bi-encoders exhibit varying performance across different datasets. Therefore, our proposed approaches involve preparing multiple bi-encoders and, when a new dataset is provided, selecting and ensembling the most appropriate ones based on the dataset. Experimental results show that, for text classification tasks on subsets of the AG News, SMS Spam Collection, Stanford Sentiment Treebank v2, and TREC Question Classification datasets, the proposed approaches achieve performance comparable to fine-tuned BERT-Base, DistilBERT-Base, ALBERT-Base, and RoBERTa-Base. For instance, using the well-known bi-encoder model all-MiniLM-L12-v2 without additional optimization resulted in an average accuracy of 77.84%. This improved to 89.49% through the application of the proposed adaptive selection and ensemble techniques, and further increased to 91.96% when combined with the RoBERTa-Base model. We believe that this approach will be particularly useful in fields such as K-12 AI programming education, where pre-trained models are applied to small datasets without fine-tuning.

Keywords:

adaptive selection; ensemble; bi-encoder model

MSC:

68T01

1. Introduction

One effective approach to text classification is fine-tuning state-of-the-art transformer models such as BERT [1], DistilBERT [2], ALBERT [3], and RoBERTa [4]. Since the original fine-tuned BERT model outperformed OpenAI’s GPT (GPT-1) [5] in text classification tasks, numerous BERT variants and techniques have been proposed, leading to continuous improvements in text classification accuracy. More recently, approaches utilizing large language models like GPT-4 [6] and LLaMA [7] have also emerged for text classification [8]. However, fine-tuned BERT-based models continue to demonstrate high performance in text classification tasks and remain widely used.

However, these BERT-based fine-tuning approaches are not effective in every situation. While fine-tuning can be very effective in many cases, it can also be a very “expensive” task in some contexts. For example, in environments where it is difficult to use a GPU, applying these approaches is not easy. Additionally, in environments where small amounts of training data are frequently updated, incremental learning and decremental learning are challenging. This is especially true in the field of AI education for primary and secondary school students, where there is a desire to see the results of training immediately [9], making it difficult to apply these approaches. Although lightweight models like TinyBERT [10] are continuously being developed, fine-tuning even these models can still be considered an expensive task in certain situations.

An alternative approach leverages Sentence BERT (SBERT) bi-encoders [11,12], which are trained on sentence pairs using a Siamese network structure composed of identical BERT-based models. These bi-encoders embed sentences into vectors, enabling the assessment of semantic similarity through cosine similarity between the vectors. Originally, the SBERT paper evaluated these bi-encoders on semantic textual similarity datasets, demonstrating their primary design for tasks like sentence retrieval rather than classification. However, once pre-trained, SBERT bi-encoders are highly adaptable and easy to use, making them increasingly popular for text classification tasks. For example, [13] employs the well-known SBERT models, all-MiniLM-L6-v2 and all-mpnet-base-v2, for unsupervised text classification. Similarly, [9] integrates a SBERT bi-encoder in an educational programming environment for children [14], allowing young students to gain hands-on experience with training and inferencing machine learning models.

Our research question is as follows: Can SBERT bi-encoders, without additional fine-tuning, achieve performance comparable to fine-tuned BERT models in classification tasks? To the best of our knowledge, there has been no direct and systematic comparison between SBERT bi-encoder models and fine-tuned BERT models in text classification tasks. In particular, there has been limited exploration of how performance changes with varying dataset sizes. This paper aims to address this research question with the goal of narrowing this research gap.

In this paper, we propose straightforward yet effective approaches to address this question. The proposed approaches are based on the observation that different SBERT bi-encoders often perform particularly well on different datasets. First, we prepare multiple high-performing SBERT bi-encoders. When given a training dataset, we dynamically select the bi-encoders that demonstrate the best performance through validation. These selected bi-encoders are then ensembled, and we use various nearest neighbor techniques for inference. Experimental results on datasets of different sizes show that while the proposed approaches do not outperform BERT-based fine-tuning on large datasets, it generally outperforms them as the dataset size decreases. We believe that the results of this research will be valuable for those considering SBERT bi-encoders for classification tasks, especially in scenarios where BERT-based fine-tuning is challenging.

The main contributions of this paper are as follows:

We introduce a foundational version of the proposed approaches, utilizing k-NN text classification with a bi-encoder model (Section 2).
We present novel adaptive selection and ensemble techniques, which significantly improve the performance of the basic approaches introduced in the previous section (Section 3).
We present experimental results from datasets of various sizes (Section 4.1 and Section 4.2), followed by a comprehensive analysis of the proposed approaches (Section 4.3).

2. k-NN Text Classification Using a Bi-Encoder Model

2.1. Text Classification with a Bi-Encoder Model

We define the text classification task as follows: Given two datasets, a training set and a test set, the objective is to predict the labels for the sentences in the test set based on the sentences and labels in the training set. Formally, the training set is denoted as

D_{train} = {(s_{i}, l (s_{i}))}_{i = 1}^{N_{train}}

, where

N_{train}

is the number of training examples,

s_{i}

represents the i-th sentence in the training set, and

l (s_{i})

is its corresponding label from a predefined set of possible labels

L

. Similarly, the test set is denoted as

D_{test} = {(t_{j}, l (t_{j}))}_{j = 1}^{N_{test}}

, where

N_{test}

is the number of test examples,

t_{j}

represents the j-th sentence in the test set, and

l (t_{j})

is its true label. The goal is to train a model on

D_{train}

that can accurately predict the label

l (t_{j})

for each sentence

t_{j}

in

D_{test}

. The model’s accuracy is measured by the percentage of correct predictions on the test set.

Our proposed approaches to text classification proceed as follows:

Training Phase: We employ a pre-trained Sentence BERT (SBERT) bi-encoder to convert each sentence $s_{i}$ in $D_{train}$ into an embedding vector $v (s_{i})$ . The training set is then updated to $D_{train} = {(s_{i}, l (s_{i}), v (s_{i}))}_{i = 1}^{N_{train}}$ .
Inference Phase: For each sentence $t_{j}$ in $D_{test}$ , we generate its corresponding embedding vector $v (t_{j})$ using the same SBERT bi-encoder. The test set is then updated to $D_{test} = {(t_{j}, l (t_{j}), v (t_{j}))}_{j = 1}^{N_{test}}$ . To predict the label for each sentence $t_{j}$ , we compute the cosine similarity between $v (t_{j})$ and the embedding vectors $v (s_{i})$ of all sentences $s_{i}$ in the training set $D_{train}$ . The predicted label $l (t_{j})$ is assigned as follows:

$l (t_{j}) = l (s_{i^{*}}) where i^{*} = arg max_{i} sim (v (t_{j}), v (s_{i})) .$

(1)

Here, $i^{*}$ is the index of the sentence in $D_{train}$ whose embedding $v (s_{i^{*}})$ has the highest cosine similarity with the embedding $v (t_{j})$ .

2.2. Enhanced k-NN Classification Techniques

To enhance the basic approach described above, we extend traditional k-nearest neighbor (k-NN) classification techniques in the following ways. We exploit k-NN classification because it is well-suited for handling embedding vectors generated by SBERT bi-encoders.

Majority Voting: For each sentence embedding $v (t_{j})$ in $D_{test}$ , we identify the k most similar embeddings from the set of sentence embeddings in $D_{train}$ based on cosine similarity. The predicted label $l (t_{j})$ for the sentence $t_{j}$ in $D_{test}$ is then determined by majority voting among the labels corresponding to these k most similar sentences:

$l (t_{j}) = \underset{l^{'} \in L}{arg max} \sum_{i \in N_{k} (v (t_{j}))} C (l (s_{i}) = l^{'})$

(2)

Here, $N_{k} (v (t_{j}))$ represents the indices corresponding to the k most similar embeddings to $v (t_{j})$ in $D_{train}$ , and $C (l (s_{i}) = l^{'})$ is an indicator function that equals 1 if the label $l (s_{i})$ is equal to $l^{'}$ , and 0 otherwise.
Weighted Voting: For each sentence embedding $v (t_{j})$ in $D_{test}$ , we identify the k-most similar embeddings $v (s_{i})$ in $D_{train}$ by comparing their cosine similarity scores. Instead of simply counting label occurrences, we sum the similarity scores for each label. The predicted label $l (t_{j})$ for $t_{j}$ is the label with the highest sum of cosine similarity scores:

$l (t_{j}) = \underset{l^{'} \in L}{arg max} \sum_{i \in N_{k} (v (t_{j}))} sim (v (t_{j}), v (s_{i})) \cdot C (l (s_{i}) = l^{'})$

(3)

where $sim (v (t_{j}), v (s_{i}))$ refers to the cosine similarity between $v (t_{j})$ and $v (s_{i})$ , and $N_{k} (v (t_{j}))$ denotes the set of indices for the k-most similar sentence embeddings to $v (t_{j})$ in $D_{train}$ . The indicator function $C (l (s_{i}) = l^{'})$ returns 1 if $l (s_{i}) = l^{'}$ , and 0 otherwise.
Fair Weighted Voting: To prevent bias towards labels that are more frequent in the training set, we introduce a “fair weighted voting” strategy. For each sentence embedding $v (t_{j})$ in $D_{test}$ , we perform the following steps: For each possible label $l^{'} \in L$ , we select the top m nearest neighbors from the sentence embeddings $v (s_{i})$ in $D_{train}$ that have the label $l^{'}$ , based on their cosine similarity to $v (t_{j})$ . The value of m is defined as follows:

$m = min (k, min_{l^{'} \in L} N_{train} (l^{'}))$

(4)

where k is a predefined constant, and $N_{train} (l^{'})$ represents the number of training examples with label $l^{'}$ . Once the top m nearest neighbors for each label $l^{'}$ are identified, we proceed similarly to the weighted voting method. We calculate the total similarity score for these neighbors and predict the label $l (t_{j})$ for the sentence $t_{j}$ in $D_{test}$ as the label with the highest cumulative score:

$l (t_{j}) = \underset{l^{'} \in L}{arg max} \sum_{i \in M (v (t_{j}), l^{'})} sim (v (t_{j}), v (s_{i}))$

(5)

Here, $M (v (t_{j}), l^{'})$ represents the indices of the top m nearest embeddings with label $l^{'}$ in $D_{train}$ that are most similar to the sentence embedding $v (t_{j})$ .

3. Adaptive Selection and Ensemble Techniques

In this section, we introduce additional techniques that expand upon the text classification approaches described in Section 2. Section 3.1 presents a method for dynamically selecting a subset of Sentence BERT (SBERT) bi-encoder models from the available SBERT bi-encoders based on the given dataset. Section 3.2 describes the process of ensembling these selected bi-encoders. In Section 3.3, we present a simple adaptive selection approach that incorporates both existing methods and our proposed approach.

3.1. Adaptive Selection of Bi-Encoders

Given the training set

D_{train} = {(s_{i}, l (s_{i}))}_{i = 1}^{N_{train}}

and the set of available bi-encoders

B = {B_{1}, B_{2}, \dots, B_{b}}

, the process for adaptively selecting the best bi-encoder

B^{*}

is as follows:

Divide the training set $D_{train}$ into F equal-sized folds ${D_{train}^{(f)}}_{f = 1}^{F}$ , where each fold serves as a validation set once, and the remaining $F - 1$ folds form the training subset.
For each bi-encoder $B_{i} \in B$ and for each fold f, train $B_{i}$ on the corresponding training subset and evaluate it on the corresponding validation fold $D_{train}^{(f)}$ .
Define the cross-validated accuracy $A (B_{i})$ for bi-encoder $B_{i}$ as follows:

$A (B_{i}) = \frac{1}{F} \sum_{f = 1}^{F} A (B_{i}, D_{train}^{(f)})$

(6)

where $A (B_{i}, D_{train}^{(f)})$ represents the accuracy of bi-encoder $B_{i}$ on the f-th validation fold.
Select the bi-encoder $B^{*}$ that maximizes the cross-validated accuracy:

$B^{*} = \underset{B_{i} \in B}{arg max} A (B_{i})$

(7)

The selected bi-encoder

B^{*}

is then used for further processing, applying the strategies described in Section 2 (e.g., majority voting, weighted voting, or fair weighted voting) on the training and test sets.

3.2. Ensemble of Bi-Encoders

The ensemble process is as follows:

Perform the F-fold cross-validation process as described in Section 3.1 for each bi-encoder $B_{i} \in B$ , and compute their respective cross-validated accuracies $A (B_{i})$ .
Select the top H bi-encoders with the highest cross-validated accuracies:

$B^{*} = {B_{(1)}, B_{(2)}, \dots, B_{(H)}} where A (B_{(1)}) \geq A (B_{(2)}) \geq \dots \geq A (B_{(H)})$

(8)

Here, $B_{(1)}$ is the bi-encoder with the highest cross-validated accuracy, $B_{(2)}$ is the bi-encoder with the second-highest accuracy, and so on until the H-th best model, $B_{(H)}$ .
For each sentence $s_{i}$ in $D_{train}$ , generate H embedding vectors using the selected H models. This results in H vectors $v_{(1)} (s_{i}), v_{(2)} (s_{i}), \dots, v_{(H)} (s_{i})$ for each sentence $s_{i}$ .
Similarly, transform each sentence $t_{j}$ in $D_{test}$ into H embedding vectors $v_{(1)} (t_{j}),$ $v_{(2)} (t_{j}), \dots, v_{(H)} (t_{j})$ using the same H models.
Compute the “ensemble similarity” between a sentence $s_{i}$ in $D_{train}$ and a sentence $t_{j}$ in $D_{test}$ as the average of the cosine similarities between their corresponding vectors from all H models. This can be expressed as follows:

${sim}_{ensemble} (s_{i}, t_{j}) = \frac{1}{H} \sum_{h = 1}^{H} sim (v_{(h)} (s_{i}), v_{(h)} (t_{j}))$

(9)
Use the computed ${sim}_{ensemble} (s_{i}, t_{j})$ during the inference phase for the process of finding the nearest neighbors as described in Section 2.

This approach leverages the strengths of multiple top-performing bi-encoders, enhancing the accuracy of the text classification model.

3.3. Adaptive Selection of Existing and Proposed Approaches

Recall that our proposed approaches do not require a fine-tuning process, making it particularly effective when fine-tuning is not feasible or applicable. However, in scenarios where fine-tuning is possible and both BERT models and bi-encoder models are available, adaptive selection between the BERT models and the proposed approaches becomes an option. Assume we have a BERT model

M_{BERT}

and our model

M_{OURS}

. The process of adaptive selection between these approaches is as follows:

Fine-tune the BERT model $M_{BERT}$ on 90% of the training set $D_{train}$ , while reserving 10% as a validation set $D_{val}$ .
Calculate the validation accuracy $A (M_{BERT}, D_{val})$ .
If $A (M_{BERT}, D_{val})$ exceeds a threshold $τ$ , use $M_{BERT}$ for inference on the test set $D_{test}$ . Otherwise, use our model $M_{OURS}$ for inference on $D_{test}$ .

Here, our model

M_{OURS}

may already incorporate the techniques discussed in Section 2, Section 3.1, and Section 3.2. Therefore, there could be cases where adaptive selection is applied at two stages—first during the process described in Section 3.1, and again during the selection between

M_{BERT}

and

M_{OURS}

. If the existing approaches perform better on specific data, and the proposed approaches are more effective on other data, this method can be expected to result in better performance.

4. Experiments

In this section, we present the experimental results from various combinations of the proposed approaches across different datasets. Section 4.1 describes the experimental setup. The results are then presented in Section 4.2, followed by a brief analysis in Section 4.3.

4.1. Experimental Setup

4.1.1. Datasets

For our experiments, we utilize four types of datasets. The first is the Stanford Sentiment Treebank (SST) dataset [15], a widely used sentiment analysis dataset from the General Language Understanding Evaluation (GLUE) benchmark [16]. The second dataset is the AG News dataset [17], which is highly popular and featured in the official PyTorch tutorial. The third dataset is the SMS Spam Collection dataset [18], another widely used dataset containing samples related to spam and non-spam messages. The fourth dataset is the Text REtrieval Conference (TREC) Question Classification dataset [19], which includes labeled questions in both the training and test sets.

Although the SST dataset provides a validation set, the other datasets do not. To maintain consistency across all experiments, we decided not to use the validation set from the SST dataset either. Therefore, all datasets were configured to include only a training set and a test set. If validation is required, a portion of the training set should be used.

In this paper, we assume that Sentence BERT (SBERT) bi-encoders are primarily applied to tasks involving small datasets. To reflect this assumption, we used the original datasets and also created smaller subsets by reducing the size of each dataset to 1/10, 1/100, and 1/1000 of its original size. However, if any of these reduced datasets resulted in fewer than five samples per class, we excluded them from the experiments. Based on this approach, we prepared a total of 13 datasets for our experiments.

Table 1 shows the statistics for the training datasets, and Table 2 provides the statistics for the test datasets. These tables list the dataset names, the number of classes, the total number of samples, and the sample size for the class with the fewest samples (minimum class sample size), the average class sample size, and the sample size for the class with the most samples (maximum class sample size). Each training set is paired with a test set that shares the same prefix.

4.1.2. Existing Models

In our experiments, we selected the base models of BERT, DistilBERT, ALBERT, and RoBERTa for comparison with our proposed approaches. The authors of the original BERT paper found that fine-tuning the BERT model for 2, 3, or 4 epochs yielded good performance across various tasks. Consequently, we initially fine-tuned all models for 4 epochs on our datasets. However, this number of epochs was insufficient for our datasets, so we also fine-tuned the models for 10 epochs. In the remainder of this paper, models fine-tuned for 4 epochs are referred to as BERT, DistilBERT, ALBERT, and RoBERTa, while those fine-tuned for 10 epochs are denoted as BERT-10, DistilBERT-10, ALBERT-10, and RoBERTa-10.

We also conducted experiments using 5-fold cross-validation and early stopping. However, early stopping did not produce satisfactory results for our datasets, and thus, we do not discuss these results in subsequent sections. For example, when training BERT on the entire TREC dataset with early stopping (using a patience of 1 epoch), the model achieved an accuracy of 95.20%, which was lower than the accuracy obtained after 4 epochs (96.60%) and 10 epochs (97.20%) without early stopping.

We tokenized the dataset using the AutoTokenizer class from the Hugging Face Transformers library, applying truncation to ensure input sequences fit within the model’s maximum length. Additionally, dynamic padding was implemented via the DataCollatorWithPadding class, ensuring that sequences in each batch were padded to the same length during training.

4.1.3. Proposed Models

For the proposed approaches, we selected 13 Sentence-BERT (SBERT) bi-encoder models available on sbert.net as of September 2024. These models were chosen based on their “average performance on encoding sentences across 14 diverse tasks from different domains”, as reported on the website. Models exceeding 1GB in size were excluded, as our goal was to demonstrate that even relatively smaller bi-encoder models can achieve a high level of accuracy. Among these are widely used models such as all-mpnet-base-v2, all-distilroberta-v1, all-MiniLM-L12-v2, and all-MiniLM-L6-v2. Table 3 lists each model’s name, its number of embedding dimensions, and its number of parameters, with models abbreviated as Bi-encoder 1, 2, ⋯, 13 for simplicity.

We conducted experiments on four types of approaches described below using techniques from Section 2 and Section 3. We set F to 5, k to 10, and

τ

to 0.8 with the all-MiniLM-L12-v2 model as the default bi-encoder.

Approaches Without Pre-trained Transformers: These models perform text classification based on the training and inference phases described in Section 2, without using transformer models. The “GloVe.6B.300d” model generates embedding vectors with the GloVe model [20] and classifies text by finding the most similar instances. The “Jaccard+Word1Gram” and “Jaccard+Char3Gram” models skip embedding generation, directly finding the most similar instances using Jaccard similarity based on word-level 1-grams and character-level 3-grams, respectively. Similarly, the “Cosine+TFIDF” model finds the most similar instances using cosine similarity applied to tf-idf values. These tf-idf values are calculated from the training data using the TfidfVectorizer class from the scikit-learn library [21].
Bi-Encoder-Based Approaches: These approaches use Sentence BERT (SBERT) bi-encoder models. “Model NN” employs the default bi-encoder, following the training and inference phases described in Section 2, which finds the most similar embeddings. “Model MV” extends this by incorporating the majority voting technique. “Model WV” is similar to “Model MV” but uses weighted voting. Finally, “Model FW” builds on “Model WV” by employing our fair weighted voting technique.
Bi-Encoder + Adaptive Selection: This approach “Model FW-AS” incorporates the proposed adaptive selection technique described in Section 3.1, along with our fair weighted voting method described in Section 2.
Bi-Encoder + Adaptive Selection + Ensemble: These approaches use the techniques proposed in Section 2, Section 3.1, and Section 3.2. “Model FW-AS-2BI” extends “Model FW-AS” by applying the ensemble technique from Section 3.2, with H set to 2. “Model FW-AS-3BI” is similar to “Model FW-AS-2BI,” but with H set to 3.
Bi-Encoder + Adaptive Selection + Ensemble + Existing Approach: This approach “Model FW-AS-2BI-BT” extends “Model FW-AS-2BI” by incorporating the RoBERTa-10 model, following the technique described in Section 3.3.

4.2. Experimental Results

Table 4 presents the experimental results comparing the existing approaches and the proposed approaches across the 13 datasets. Because the 13 datasets can be grouped into four categories (“SST”, “AGNEWS”, “SMSSpam”, “TREC”), we also report the average accuracy for each category. The overall average is calculated as the micro-average of all accuracies across the 13 datasets.

The results show that models trained for 10 epochs (BERT-10, DistilBERT-10, ALBERT-10, RoBERTa-10) outperformed those trained for 4 epochs (BERT, DistilBERT, ALBERT, RoBERTa). This contrasts with the original BERT paper, which recommended training for two to four epochs and may be due to the characteristics of our datasets. As mentioned in Section 4.1.2, early stopping did not yield positive results and was therefore not included. RoBERTa consistently performed better than BERT, DistilBERT, and ALBERT, with RoBERTa-10 achieving the highest average accuracy at 89.78%.

In the “Approaches Without Pre-trained Transformers” category, the overall average accuracy was relatively low. The GloVe model demonstrated performance similar to the Jaccard and cosine similarity-based approaches, with an average accuracy of 73.26%, suggesting that GloVe may not be well suited for text classification tasks. Conversely, it is notable that the Jaccard and cosine similarity models achieved comparable accuracy despite their simplicity.

For the “Bi-Encoder-Based Approaches,” Model NN, which adheres to the basic training and inference phases described in Section 2, showed limited performance, with an accuracy of 77.84%. However, applying additional techniques in Model MV, Model WV, and Model FW led to improved results, with all models achieving average accuracies above 82%. These results emphasize the importance of using such techniques when working with bi-encoder models without fine-tuning. Among these, our proposed Model FW performed the best, especially in scenarios with smaller datasets.

When adaptive selection was applied, as detailed in Section 3.1, the average accuracy increased to 88.06%. Without adaptive selection, high performance was observed in the AGNEWS and SMSSpam datasets, while the SST and TREC datasets showed lower results. Adaptive selection improved accuracy in these cases as well.

Additionally, the application of the ensemble technique, as discussed in Section 3.2, following adaptive selection, resulted in consistently high accuracy across most datasets. For instance, in the SST 0.1% dataset, accuracy improved from below 70% with only adaptive selection to nearly 80% after introducing the ensemble technique. This suggests that the ensemble compensated for weaknesses in individual bi-encoders. There was only a slight difference in performance between the two-model ensemble (Model FW-AS-2BI) and the three-model ensemble (Model FW-AS-3BI). Model FW-AS-2BI achieved an average accuracy of 89.49%, which closely matched RoBERTa-10’s accuracy of 89.78%. Our proposed approaches demonstrate the advantage of achieving comparable classification accuracy without the need for fine-tuning, making it particularly useful in cases where fine-tuning is impractical.

Although the main advantage of the proposed approaches is that it does not require fine-tuning, it can be integrated with existing BERT models when fine-tuning is possible. As expected, combining our best-performing model (Model FW-AS-2BI) with the highest-performing existing model (RoBERTa-10) resulted in the highest average accuracy of 91.96%. This exceeds the 89.78% accuracy achieved by RoBERTa-10, which demonstrated the best performance among the existing approaches. Repeated trials yielded consistent results, and an unpaired two-tailed t-test conducted across five iterations for both models produced a p-value of less than 0.00001 (p-value < 0.05), indicating a statistically significant difference.

4.3. Analysis of Results

According to the experimental results in Section 4.2, the performance of existing approaches declines significantly as the training dataset size decreases. Figure 1 focuses on the two smallest training datasets from each of the four categories (SST, AGNEWS, SMSSpam, TREC). Four models represent existing approaches: BERT-10, DistilBERT-10, ALBERT-10, and RoBERTa-10. The remaining four models represent the proposed approaches: Model NN, Model FW-AS, Model FW-AS-2BI, and Model FW-AS-2BI-BT.

As shown in Figure 1, all models experienced a drop in accuracy as the training dataset size decreased. This result was expected because the test dataset remained fixed while the training set was reduced. However, the decline in accuracy for existing approaches was particularly pronounced. For example, all existing approaches achieved over 80% accuracy on the SST 1% dataset but dropped below 60% on the SST 0.1% dataset. In contrast, Model FW-A-2BI showed only a slight decline, maintaining relatively high accuracy despite the reduced training dataset size.

Our final two models, Model FW-AS-2BI and Model FW-AS-2BI-BT, consistently performed well across different datasets. By contrast, Model NN showed relatively low performance across all datasets except SMSSpam. Although Model NN performed well on SMSSpam, even the Jaccard+Char3Gram model, based on Jaccard similarity, achieved over 90% average accuracy on these datasets. This suggests that relying solely on a basic bi-encoder is not effective for text classification tasks. It also indicates that applying at least the proposed adaptive selection technique, even without the ensemble technique, is necessary when using bi-encoders for classification tasks.

Recall that we selected bi-encoder 4 (all-MiniLM-L12-v2), one of the most widely used models, as the default for Model NN. Table 5 presents the accuracy of Model NN when different bi-encoders were used in place of bi-encoder 4. According to sbert.net, bi-encoders with smaller indices showed better performance on average across 14 diverse tasks. Specifically, bi-encoder 1 consistently performed better than bi-encoder 2, bi-encoder 2 outperformed bi-encoder 3, and so on, with bi-encoder 13 ranking the lowest among the 13 selected models. We aimed to confirm whether this ranking would hold in our experiments.

The results were surprising. In our experiments, bi-encoder 8, ranked eighth, achieved the highest average accuracy, followed by bi-encoders 1, 2, 3, 12, 10, 13, 4, 5, 9, 6, 11, and 7. A key observation is the significant variation in bi-encoder performance depending on the dataset. For the SST datasets, bi-encoder 8 achieved the highest average accuracy, while bi-encoder 1 demonstrated superior performance on the AGNEWS datasets. Bi-encoder 13 showed the strongest performance on the SMSSpam datasets, and bi-encoder 12 produced the best results on the TREC datasets. Additionally, some bi-encoders performed best for specific dataset sizes; for example, bi-encoder 5 performed best on the AGNEWS 100% dataset, while bi-encoder 2 achieved the highest accuracy on the AGNEWS 10% dataset. This variability in top-performing bi-encoders across different dataset types and sizes underscores the effectiveness of Model FW-AS, which employs adaptive selection, and Model FW-AS-2BI, which also incorporates ensemble techniques.

Figure 2 compares the performance of the 13 bi-encoder models and three of the proposed models (Model FW-AS, Model FW-AS-2BI, and Model FW-AS-2BI-BT). Every bi-encoder model showed notably low performance in at least one dataset category. For instance, although Bi-encoder 1 is regarded as one of the best-performing bi-encoder models, its average accuracy remained in the 60% range on the TREC datasets. In contrast, Model FW-AS maintained an average accuracy of approximately 80%, Model FW-AS-2BI around 85%, and Model FW-AS-2BI-BT around 89% on the lowest-performing datasets. These results emphasize the importance of adaptive selection and ensemble techniques in improving bi-encoder performance for classification tasks.

To further investigate the performance improvements from adaptive selection and ensemble techniques, we applied LIME [22], an explainable AI method. LIME was used to identify the three words that most influenced the creation of the embedding vectors for each of the first 100 sentences in the TREC test dataset across various bi-encoder models. This resulted in 300 words per model, which were visualized as word clouds based on their frequency, as shown in Figure 3. The word clouds revealed notable differences between models, highlighting substantial variation in their embedding approaches. For instance, Bi-encoder 11 (distiluse-base-multilingual-cased-v1), which achieved the highest performance on the TREC dataset, had some overlap in its top five most frequent words with other bi-encoders; however, none of them had an identical top five word set. These distinctions in embedding approaches likely contributed to the performance improvements observed with adaptive selection and ensemble techniques.

5. Related Work

The Sentence BERT (SBERT) bi-encoder is a Siamese network architecture based on the BERT model that generates effective sentence embeddings, primarily used for semantic similarity search. While cross-encoder models like BERT and RoBERTa can compute sentence similarities by jointly encoding pairs of sentences, they become significantly slower when handling large numbers of sentences due to increased computational complexity.

SBERT bi-encoders are commonly applied in search tasks but are also valuable for classification due to their ease of use. For instance, PatentSBERTa [23] enhances SBERT with Augmented SBERT [24] to create patent embeddings, which are then classified using the k-nearest neighbors algorithm. Another approach [25] employs SBERT bi-encoder embeddings to train feedforward neural networks for classifying academic paper abstracts into one of seven categories. Similarly, DocSCAN [26] uses SBERT bi-encoders to generate text embeddings and constructs a weakly supervised training set based on the intuition that adjacent vectors in the embedding space should share the same label, enabling unsupervised text classification. The method presented in [13] involves creating initial label vectors by vectorizing label keywords; these vectors are then used to find similar candidate documents. Afterward, these documents are vectorized to form refined label vectors. Classification is performed by identifying nearest neighbors among document vectors using these refined label vectors.

Despite these diverse applications, there is still a lack of studies utilizing SBERT bi-encoders for traditional supervised classification tasks without training additional neural networks or performing fine-tuning. Therefore, we propose various techniques for leveraging SBERT bi-encoders directly for classification tasks. These methods are highly effective, especially in scenarios where fine-tuning may not be feasible, such as in K-12 machine learning education. For instance, in block-based programming environments commonly used in K-12 settings, students typically interact with AI by dragging and dropping blocks, leading to frequent incremental and decremental learning. In such cases, fine-tuning models is often not feasible. However, our proposed methods do not require fine-tuning, making them easily adaptable to these environments. These techniques can also be explored for applications in other fields, such as software defect detection [27], classification of medical subjects from text-based health counseling data [28], and classification tasks involving low-resource languages [29].

6. Conclusions

In this paper, we presented an efficient Sentence BERT (SBERT) bi-encoder-based text classification approach without fine-tuning neural networks. Our main strategy was to apply adaptive selection and ensemble techniques using various types of bi-encoder models. This approach is based on the intuition that no single bi-encoder outperforms all others; instead, each bi-encoder has its own strengths and performs exceptionally well on specific types of datasets. Our experimental results are noteworthy: in contrast to the well-known belief that cross-encoder models outperform bi-encoder models, our bi-encoder-based approach without fine-tuning shows comparable results to the fine-tuned base models of BERT, DistilBERT, ALBERT, and RoBERTa. We expect that the proposed approaches will be highly useful in scenarios where fine-tuning is not possible or feasible, as using bi-encoder models allows us to achieve a good level of performance in text classification tasks.

The limitations of this paper can be summarized in three main points. First, storing many bi-encoder models requires substantial storage space. While this may not be an issue for most users, it could pose a challenge for those with low-performance computers, such as very young students. Second, the time required for validation during adaptive selection can be relatively long. Currently, cross-validation is used to identify the best models; however, as the number of models increases, validation time grows significantly. In cases where many models are used and fast execution time is required, additional techniques may need to be considered to improve validation speed. Third, because the performance of the selected bi-encoders significantly affects the final results, it is important for users to prepare a diverse set of bi-encoders when applying adaptive selection and ensemble techniques to enhance performance.

Our future research direction is threefold: First, we plan to apply our techniques to real-world scenarios where fine-tuning is not easily employed, such as in K-12 machine learning education. Second, we intend to refine our current approaches to improve execution time by developing more lightweight bi-encoder models, enabling more effective use in practical applications. Third, we aim to devise novel approaches for text classification based on text-matching algorithms such as Jaccard and cosine similarity, even without using deep learning. In our experiments, we found that these similarity-based approaches can achieve high performance depending on the dataset. This suggests that text-matching methods can be useful for classification tasks in various scenarios, especially when extremely lightweight models are required.

Author Contributions

Conceptualization, Y.P. and Y.S.; methodology, Y.P. and Y.S.; investigation, Y.P. and Y.S.; data curation, Y.P. and Y.S.; writing—original draft preparation, Y.P. and Y.S.; writing—review and editing, Y.P. and Y.S.; funding acquisition, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Chuncheon National University of Education Grant in 2023.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite BERT for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding with Unsupervised Learning. Technical Report. OpenAI. 2018. Available online: https://openai.com/index/language-unsupervised/ (accessed on 30 September 2024).
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Li, Z.; Li, X.; Liu, Y.; Xie, H.; Li, J.; Wang, F.-L.; Li, Q.; Zhong, X. Label supervised llama finetuning. arXiv 2023, arXiv:2310.01208. [Google Scholar]
Park, Y.; Shin, Y. A block-based interactive programming environment for large-scale machine learning education. Appl. Sci. 2022, 12, 13008. [Google Scholar] [CrossRef]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Reimers, N.; Gurevych, I. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020. [Google Scholar]
Schopf, T.; Braun, D.; Matthes, F. Evaluating unsupervised text classification: Zero-shot and similarity-based approaches. In Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, Sanya, China, 16–18 December 2022; pp. 6–15. [Google Scholar]
Park, Y.; Shin, Y. Tooee: A novel scratch extension for K-12 big data and artificial intelligence education using text-based visual blocks. IEEE Access 2021, 9, 149630–149646. [Google Scholar] [CrossRef]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
Del Corso, G.M.; Gulli, A.; Romani, F. Ranking a stream of news. In Proceedings of the 14th International Conference on World Wide Web, Chiba, Japan, 10–14 May 2005; pp. 97–106. [Google Scholar]
Almeida, T.A.; Hidalgo, J.M.G.; Yamakami, A. Contributions to the study of SMS spam filtering: New collection and results. In Proceedings of the 11th ACM Symposium on Document Engineering, Mountain View, CA, USA, 19–22 September 2011; pp. 259–262. [Google Scholar]
Li, X.; Roth, D. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, 24 August–1 September 2002. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Bekamiri, H.; Hain, D.S.; Jurowetzki, R. PatentsBERTA: A deep NLP-based hybrid model for patent distance and classification using augmented SBERT. Technol. Forecast. Soc. Chang. 2024, 206, 123536. [Google Scholar] [CrossRef]
Thakur, N.; Reimers, N.; Daxenberger, J.; Gurevych, I. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 296–310. [Google Scholar]
Piao, G. Scholarly text classification with sentence BERT and entity embeddings. In Proceedings of the Trends and Applications in Knowledge Discovery and Data Mining: PAKDD 2021 Workshops, Delhi, India, 11 May 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 79–87. [Google Scholar]
Stammbach, D.; Ash, E. Docscan: Unsupervised text classification via learning from neighbors. arXiv 2021, arXiv:2105.04024. [Google Scholar]
Petrovic, A.; Jovanovic, L.; Bacanin, N.; Antonijevic, M.; Savanovic, N.; Zivkovic, M.; Milovanovic, M.; Gajic, V. Exploring Metaheuristic Optimized Machine Learning for Software Defect Detection on Natural Language and Classical Datasets. Mathematics 2024, 12, 2918. [Google Scholar] [CrossRef]
Sung, Y.W.; Park, D.S.; Kim, C.G. A Study of BERT-Based Classification Performance of Text-Based Health Counseling Data. CMES-Comput. Model. Eng. Sci. 2023, 135, 1–20. [Google Scholar]
Veisi, H.; Awlla, K.M.; Abdullah, A.A. KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis. Res. Sq. 2024. [Google Scholar] [CrossRef]

Figure 1. Comparison of accuracy across 8 selected models on small-scale datasets.

Figure 2. Comparison of average accuracies between bi-encoder models and the proposed models across the four categories.

Figure 3. Word clouds illustrating the most influential words for generating embedding vectors from the first 100 sentences of the TREC test dataset, for each Sentence BERT bi-encoder.

Table 1. Summary of the 13 training datasets used in the experiments.

Training Dataset	# of Classes	# of Samples	Average Class Sample Size	Minimum Class Sample Size	Maximum Class Sample Size
SST 100%	2	67,349	33,675	29,780	37,569
SST 10%	2	6734	3367	2932	3802
SST 1%	2	673	337	291	382
SST 0.1%	2	67	34	31	36
AGNEWS 100%	4	120,000	30,000	30,000	30,000
AGNEWS 10%	4	12,000	3000	2964	3019
AGNEWS 1%	4	1200	300	269	320
AGNEWS 0.1%	4	120	30	22	42
SMSSpam 100%	2	5074	2537	686	4388
SMSSpam 10%	2	507	254	55	452
SMSSpam 1%	2	50	25	8	42
TREC 100%	6	5452	909	86	1250
TREC 10%	6	545	91	6	125

Table 2. Summary of the 4 test datasets used in the experiments.

Test Dataset	# of Classes	# of Samples	Average Class Sample Size	Minimum Class Sample Size	Maximum Class Sample Size
SST	2	872	436	428	444
AGNEWS	4	7600	1900	1900	1900
SMSSpam	2	500	250	61	439
TREC	6	500	83	9	138

Table 3. Summary of the 13 bi-encoders used in the experiments.

Bi-Encoder ID	Model Name	# of Embedding Dimensions	# of Parameters
Bi-encoder 1	all-mpnet-base-v2	768	109,486,464
Bi-encoder 2	multi-qa-mpnet-base-dot-v1	768	109,486,464
Bi-encoder 3	all-distilroberta-v1	768	82,118,400
Bi-encoder 4	all-MiniLM-L12-v2	384	33,360,000
Bi-encoder 5	multi-qa-distilbert-cos-v1	768	66,362,880
Bi-encoder 6	all-MiniLM-L6-v2	384	22,713,216
Bi-encoder 7	multi-qa-MiniLM-L6-cos-v1	384	22,713,216
Bi-encoder 8	paraphrase-multilingual-mpnet-base-v2	768	278,043,648
Bi-encoder 9	paraphrase-albert-small-v2	768	11,683,584
Bi-encoder 10	paraphrase-multilingual-MiniLM-L12-v2	384	117,653,760
Bi-encoder 11	paraphrase-MiniLM-L3-v2	384	17,389,824
Bi-encoder 12	distiluse-base-multilingual-cased-v1	512	135,127,808
Bi-encoder 13	distiluse-base-multilingual-cased-v2	512	135,127,808

Table 4. Experimental results for existing and our proposed approaches across different datasets.

Model Name	SST					AGNEWS					SMSSpam				TREC			Total
Model Name	100%	10%	1%	0.1%	Avg.	100%	10%	1%	0.1%	Avg.	100%	10%	1%	Avg.	100%	10%	Avg.	Avg.
Existing Approaches
BERT	90.83	89.45	87.96	50.46	79.68	94.50	92.32	89.57	74.67	87.77	98.80	98.60	96.20	97.87	96.60	90.60	93.60	88.50
BERT-10	91.74	90.60	87.61	55.96	81.48	94.22	92.17	88.86	78.62	88.47	99.40	98.60	94.60	97.53	97.20	90.40	93.80	89.23
DistilBERT	91.17	85.78	84.29	49.08	77.58	94.21	91.71	89.29	83.01	89.56	99.40	98.00	87.80	95.07	97.00	88.80	92.90	87.66
DistilBERT-10	89.79	89.45	83.03	51.72	78.50	94.37	91.25	88.67	83.70	89.50	99.20	98.80	87.80	95.27	97.00	90.60	93.80	88.11
ALBERT	89.11	84.06	80.73	50.00	75.98	93.67	91.17	88.53	73.83	86.80	98.60	96.60	96.40	97.20	94.20	88.80	91.50	86.59
ALBERT-10	91.28	88.88	87.04	51.49	79.67	94.04	91.53	88.07	77.22	87.72	99.20	98.60	94.40	97.40	96.00	90.20	93.10	88.30
RoBERTa	93.81	92.20	88.42	49.08	80.88	95.13	92.71	90.33	25.00	75.80	99.60	99.40	87.80	95.60	97.40	85.40	91.40	84.33
RoBERTa-10	94.04	92.89	88.88	49.08	81.22	94.45	92.38	89.99	87.59	91.10	99.60	99.00	87.80	95.47	97.60	93.80	95.70	89.78
Proposed Approaches
Approaches Without Pretrained Transformers
GloVe.6B.300d	64.68	63.19	59.52	55.16	60.64	88.95	85.50	83.57	78.03	84.01	96.40	94.40	89.60	93.47	51.80	41.60	46.70	73.26
Jaccard+Word1Gram	60.67	61.24	53.44	52.41	56.94	84.13	74.04	60.96	42.41	65.39	98.00	94.60	89.00	93.87	81.20	54.60	67.90	69.75
Jaccard+Char3Gram	66.17	62.16	58.14	51.26	59.43	88.33	82.03	70.43	54.09	73.72	98.80	96.00	93.20	96.00	79.60	61.00	70.30	73.94
Cosine+TFIDF	70.07	66.74	59.98	53.90	62.67	88.26	83.70	74.79	57.03	75.95	98.40	96.00	93.00	95.80	64.00	65.60	64.80	74.73
Bi-Encoder-Based Approaches
Model NN	74.43	66.06	65.14	64.11	67.44	90.00	87.49	83.24	77.01	84.44	98.60	95.00	93.60	95.73	64.80	52.40	58.60	77.84
Model MV	73.74	72.48	71.79	66.06	71.02	92.11	90.26	87.05	82.75	88.04	98.60	95.40	95.60	96.53	76.60	64.80	70.70	82.10
Model WV	74.66	72.71	71.90	65.94	71.30	92.13	90.21	87.18	83.08	88.15	98.60	95.40	95.40	96.47	76.40	64.80	70.60	82.19
Model FW	74.89	72.59	70.87	67.09	71.36	92.49	90.45	87.62	84.14	88.68	99.00	95.60	96.00	96.87	78.60	63.40	71.00	82.52
Bi-Encoder + Adaptive Selection
Model FW-AS	83.49	83.94	84.40	69.84	80.42	92.49	91.04	88.46	84.74	89.18	99.40	99.20	97.60	98.73	89.60	80.60	85.10	88.06
Bi-Encoder + Adaptive Selection + Ensemble
Model FW-AS-2BI	86.01	86.47	86.24	79.93	84.66	92.88	91.04	88.61	85.58	89.53	99.60	99.40	97.80	98.93	89.00	80.80	84.90	89.49
Model FW-AS-3BI	85.09	85.78	85.67	79.70	84.06	92.88	91.18	88.66	85.96	89.67	99.60	99.20	97.80	98.87	89.40	80.40	84.90	89.33
Bi-Encoder + Adaptive Selection + Ensemble + Existing Approach
Model FW-AS-2BI-BT	94.04	93.35	88.76	79.93	89.02	94.75	92.05	89.54	87.00	90.84	99.40	99.20	87.80	95.47	97.20	92.40	94.80	91.96

Table 5. Accuracy comparison across different datasets for various types of bi-encoders.

Bi-Encoder ID	SST					AGNEWS					SMSSpam				TREC			Total
Bi-Encoder ID	100%	10%	1%	0.1%	Avg.	100%	10%	1%	0.1%	Avg.	100%	10%	1%	Avg.	100%	10%	Avg.	Avg.
Bi-encoder 1	81.31	76.03	75.46	76.95	77.44	90.32	87.57	84.37	79.04	85.32	98.20	96.00	94.60	96.27	70.00	56.60	63.30	82.03
Bi-encoder 2	76.15	73.28	63.53	67.89	70.21	90.20	87.75	83.99	77.45	84.85	98.80	98.40	95.80	97.67	65.80	58.00	61.90	79.77
Bi-encoder 3	72.94	69.04	64.11	60.09	66.54	90.05	87.00	83.92	78.11	84.77	99.20	96.40	93.40	96.33	73.80	68.40	71.10	79.73
Bi-encoder 4	74.43	66.06	65.14	64.11	67.43	90.00	87.49	83.24	77.01	84.43	98.60	95.00	93.60	95.73	64.80	52.40	58.60	77.84
Bi-encoder 5	68.23	64.33	57.45	58.03	62.01	90.92	87.61	83.55	78.01	85.02	98.40	95.80	94.60	96.27	67.00	57.20	62.10	77.01
Bi-encoder 6	68.23	63.76	61.58	62.84	64.11	90.32	87.42	83.08	76.25	84.27	97.60	96.20	93.40	95.73	62.60	53.20	57.90	76.65
Bi-encoder 7	65.14	62.96	57.91	56.42	60.61	90.25	87.34	82.42	77.83	84.46	98.80	95.00	95.00	96.27	62.60	55.40	59.00	75.93
Bi-encoder 8	79.47	78.90	78.21	75.34	77.98	89.99	86.37	82.41	75.82	83.64	99.20	97.40	95.00	97.20	70.00	60.80	65.40	82.22
Bi-encoder 9	72.25	69.84	66.74	65.37	68.55	88.54	84.67	79.88	68.41	80.38	98.80	95.60	95.00	96.47	63.20	54.00	58.60	77.10
Bi-encoder 10	75.46	73.97	71.79	70.18	72.85	89.70	85.42	80.34	72.58	82.01	97.40	97.00	96.00	96.80	66.60	57.20	61.90	79.51
Bi-encoder 11	71.33	68.58	65.48	65.02	67.60	89.28	85.29	80.00	69.57	81.03	99.20	96.60	93.60	96.47	57.60	47.00	52.30	76.04
Bi-encoder 12	66.86	64.91	60.67	59.75	63.04	90.14	86.53	83.01	75.01	83.67	99.40	98.80	97.40	98.53	79.40	72.60	76.00	79.58
Bi-encoder 13	67.89	64.56	60.21	58.14	62.70	90.16	86.80	82.74	75.09	83.70	99.60	98.80	97.60	98.67	76.80	71.60	74.20	79.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, Y.; Shin, Y. Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification. Mathematics 2024, 12, 3090. https://doi.org/10.3390/math12193090

AMA Style

Park Y, Shin Y. Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification. Mathematics. 2024; 12(19):3090. https://doi.org/10.3390/math12193090

Chicago/Turabian Style

Park, Youngki, and Youhyun Shin. 2024. "Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification" Mathematics 12, no. 19: 3090. https://doi.org/10.3390/math12193090

APA Style

Park, Y., & Shin, Y. (2024). Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification. Mathematics, 12(19), 3090. https://doi.org/10.3390/math12193090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification

Abstract

1. Introduction

2. k-NN Text Classification Using a Bi-Encoder Model

2.1. Text Classification with a Bi-Encoder Model

2.2. Enhanced k-NN Classification Techniques

3. Adaptive Selection and Ensemble Techniques

3.1. Adaptive Selection of Bi-Encoders

3.2. Ensemble of Bi-Encoders

3.3. Adaptive Selection of Existing and Proposed Approaches

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Existing Models

4.1.3. Proposed Models

4.2. Experimental Results

4.3. Analysis of Results

5. Related Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI