Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversiﬁcation Sampling

: This study introduces a training pipeline comprising two components: the Encoder-Decoder-Outlayer framework and the Vector Space Diversiﬁcation Sampling method. This framework efﬁciently separates the pre-training and ﬁne-tuning stages, while the sampling method employs pivot nodes to divide the subvector space and selectively choose unlabeled data, thereby reducing the reliance on human labeling. The pipeline offers numerous advantages, including rapid training, parallelization, buffer capability, ﬂexibility, low GPU memory usage, and a sample method with nearly linear time complexity. Experimental results demonstrate that models trained with the proposed sampling algorithm generally outperform those trained with random sampling on small datasets. These characteristics make it a highly efﬁcient and effective training approach for machine learning models. Further details can be found in the project repository on GitHub.


Introduction
The outcome of modeling is significantly influenced by labeled datasets, which are usually costly in terms of human effort. For many years, researchers relied on intuition or randomly sampled data points and split train-test data randomly. This paper was inspired by an active learning methodology that leverages neural networks to guide humans in preparing and labeling data, minimizing human effort and improving the overall performance of models.
Researchers in the field of natural language processing (NLP) commonly employ zero-shot, one-shot, and few-shot methods [1] to address issues of limited labeled data. However, these methods have limitations, such as the absence of modified parameters and limited customization, which make it difficult to achieve industry-level high scores for most large language models (LLMs). An alternative technique is the pre-training and fine-tuning method [2], which also requires substantial labeled data and is costly to train. Therefore, this paper proposes an Encoder-Decoder-Outlayer framework that addresses the aforementioned shortcomings and provides additional benefits.
To address the challenge of adapting pre-trained language models to specific downstream tasks without requiring extensive fine-tuning or re-training of the entire model, there are some approaches similar to the adapter method [3]. However, these approaches, like the adapter, are nested in the large language model (LLM), necessitating the entire LLM to be accommodated in GPU memory for training and prediction. In this context, the utilization of sampling methods will be examined to categorize unlabeled datasets, thereby choosing data that can enhance modeling accuracy. During training and prediction, only of these techniques in resolving intricate classification challenges. The following are the contributions of this work: -Proposal for an Encoder-Decoder-Outlayer (EDO) active learning method for text classification; -Exploration of the applicability of EDO, demonstrating its effectiveness in addressing issues of limited labeled data; -Exploration of the utilization of different models and techniques, such as BERTbase, S-BERT, Universal Sentence Encoder, Word2Vec, and Document2Vec, to optimize datasets for deep learning; -Proposal for the use of T-SNE for dimension reduction and comparison of sentence vectors. https://github.com/MarcoMozilla/cis_research_project

Literature Review
The optimization of datasets is a crucial part of deep learning, and it has been a critical research field for many researchers. This section reviews and compares some related studies on classification (clustering) datasets and how the data are selected. There has been a limited amount of research conducted on active learning (AL) in the context of text classification, especially with regard to the latest, cutting-edge natural language processing (NLP) models. The work in [10] involved an empirical analysis that evaluated various uncertainty-based algorithms utilizing BERTbase as the classifier. To compare different types of strategies for getting sentence vectors, ref. [11] set three aiming functions for training and optimizing different tasks: Classification Objective Function, Regression Objective Function, and Triplet Objective Function. All-mpnet-base-v is based on S-BERT. This framework is utilized to generate sentence or text embeddings, which can be used to compare for finding sentences with similar meanings. In addition to S-BERT, the utilization of Universal Sentence Encoder [12], Word2Vec [13], and Document2Vec [14] are all viable options.
For dimension reduction, ref. [15] presents T-SNE, a technique to visualize highdimensional data by giving each data point a location on a two or three-dimensional map. The information contained in high-dimensional vectors is preserved after it is transformed into low-dimension vectors. Its basic idea is that two vectors reduced to a low dimension are supposed to be close if they are similar in high dimension.
The authors of [1] propose that fine-tuning pre-trained models on small datasets with adapters that store in-domain knowledge and that are pre-trained in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries, improves the summary quality over standard fine-tuning and allows for summary personalization through aspect keyword queries. The authors of [2] examined the brittleness of fine-tuning pre-trained contextual word embedding models for natural language processing tasks by experimenting with four datasets from the GLUE benchmark and varying random seeds, finding substantial performance increases and quantifying how the performance of the best-found model varies with the number of fine-tuning trials, while also exploring factors influenced by the choice of random seed such as weight initialization and training data order.
The Encoder-Decoder architecture is commonly used in NLP tasks, such as machine translation and text summarization. The Encoder takes an input sequence, such as a sentence in one language, and transforms it into a fixed-dimensional vector representation. The Decoder then takes this representation as input and generates an output sequence, such as a translated sentence in another language. The work in [9] brings up a Transformer based on this Encoder-Decoder architecture.
For deeper neural network training, ref. [16] presents Resnet to ease the training of networks that are substantially deeper than those used previously. The representations also have excellent generalization performance on other recognition tasks. However, overfitting may cause a worse result. Combining with stronger regularization may improve results.
The authors of [9] propose that the Transformer, a new network architecture based solely on attention mechanisms, outperforms complex recurrent or convolutional neural networks with Encoder-Decoder attention mechanisms in machine translation tasks, achieving stateof-the-art BLEU scores with significantly less training time and cost, and shows good generalization to other tasks. The residual learning framework, which was raised by [16], won first place in the ILSVRC 2015 classification task and improved the performance on the COCO object detection dataset. It can ease the training of substantially deeper neural networks and achieve higher accuracy.
The authors of [17] developed a procedure for Int8 matrix multiplication in Transformer that reduces the memory needed for inference by half while retaining full precision performance by using vector-wise quantization and a mixed-precision decomposition scheme to cope with highly systematic emergent features in language models, enabling up to 175B-parameter LLMs to be used without any performance degradation.
For image classification, the authors of [16] present the use of parametric rectified linear units (PReLU) and a robust initialization method in training extremely deep rectified neural networks for image classification, achieving a 4.94% top-5 test error on the ImageNet 2012 classification dataset, surpassing human-level performance for the first time. In 2018, the work in [18] involved comparing the performance of seven commonly used stochasticgradient-based optimization techniques in a convolutional neural network (ConvNet), and Nadam achieved the best performance.

Data Description
The IMDB dataset contains highly polar movie reviews. The Amazon_polarity dataset contains product reviews from Amazon. Each sample from these two datasets was annotated by labels: 0 (negative) or 1 (positive). The Ag_news dataset is a collection of news articles gathered by ComeToMyHead. Each sample was labeled according to its category: World (0), Sports (1), Business (2), and Sci/Tech (3). The Emotion dataset contains Twitter messages classified by emotions, including sadness (0), joy (1), love (2), anger (3), fear (4), and surprise (5). The DBpedia dataset is constructed from 14 different classes in DBpedia. Each sample is annotated by its class. The YelpReviewFull dataset contains reviews from Yelp, each annotated by labels from 0 to 5 corresponding to the score associated with the review.

Methodology
The approach described draws inspiration from the concept of diversified investments in the realm of financial investment. In traditional financial investment strategies, the fundamental idea is to construct a portfolio by combining a set of unrelated assets. By diversifying the portfolio, investors aim to reduce the overall risk while potentially increasing the return on investment [19,20]. Similarly, the approach being discussed adopts a similar principle of diversification to tackle a different kind of risk in the context of machine learning models.
To reduce variance and enhance the performance of the model, the approach utilizes a collection of smaller models instead of relying on a single large model. This ensemble of models is designed to work in tandem within an Encoder-Decoder framework. The Encoder part of the framework maps real data to encoded vectors, while the Decoder part allows for sampling on these encoded vectors. By performing sampling, the approach introduces an ordered structure to the dataset, where the first few data points provide the greatest amount of diversity.
The encoded vector spaces generated by the model maintain a unique property: similar original data points will have a smaller distance between each other in the vector space. This property facilitates the effective organization and representation of the data, enabling the model to capture important patterns and relationships more efficiently.
In the process of training the model, a subset of the dataset is selected for manual labeling, as opposed to labeling the entire dataset. This strategic approach minimizes the resources required for manual labeling while still obtaining valuable labeled data. The labeled subset is then used to train a simple Outlayer model. This model takes the encoded vectors as input and produces human labels as output. By training on this subset, the model can learn to generalize and predict labels for the remaining unlabeled data.
For building the training model, the approach adopts the Nadam optimizer. Nadam, a combination of Nesterov accelerated gradient descent [21] and Adam optimization algorithm [22], offers distinct advantages. It provides greater control over the learning rate and directly influences the gradient update, resulting in improved convergence and potentially faster training times.
By incorporating these strategies and techniques inspired by diversified financial investments, the approach aims to mitigate risk and enhance performance in the realm of machine learning. The use of smaller models, the organization of data through encoded vectors, and the selective labeling process all contribute to a more robust and efficient learning framework. Additionally, the choice of the Nadam optimizer further optimizes the training process, ultimately leading to better outcomes in terms of accuracy and generalization.
One advantage of this approach is the separation of feature extraction and output, which helps reduce the GPU RAM usage. During the pre-training stage, only raw data are required, and the Encoder-Decoder model is stored in the GPU RAM. During the encoded vector buffering step, only the Encoder is kept in the GPU RAM. Similarly, during the sampling stage, only the sample algorithm is running. When training the Outlayer, only the simple Outlayer and batch data are stored in the RAM.
By breaking down the large prediction model into smaller parts and executing the process step-by-step, the Outlayer model can accommodate more encoded vectors and process a larger batch of items within a fixed GPU RAM capacity. This partitioning of tasks and resource allocation allows for more efficient memory management during the different stages of the approach. It ensures that only the necessary components are stored in the GPU RAM at any given time, freeing up space for other operations. The advantage of this approach becomes particularly evident when dealing with large datasets or when working with limited GPU resources. By carefully managing the GPU RAM usage, the approach enables the model to handle a greater number of encoded vectors and process larger batches of items, without exceeding the memory constraints. This scalability and flexibility contribute to the overall effectiveness and practicality of the approach, making it suitable for a wide range of applications.

Basic Framework
The model utilized in this study consists of three primary components: Encoder, Decoder, and Outlayer. The Encoder component is responsible for transforming the data into feature vectors, which are then subjected to Vector Space Diversification sampling. This sampling process reorganizes the dataset, and the first N samples are selected for training the model. Figure 1 depicts the Encoder-Decoder-Outlayer framework, which comprises an Encoder, a Decoder, and an Outlayer.
During training, the chosen loss function is cross-entropy loss with weight. This loss function helps measure the discrepancy between the predicted outputs and the actual labels, taking into account the importance assigned to each class. Additionally, F1-score guidance is employed as a trigger mechanism. If the F1-score decreases below the previous score, specific actions are initiated to address and rectify the issue. Overall, the model's architecture and training process aim to effectively encode the data, generate diverse samples through vector space diversification, and train the model using the selected samples. The use of cross-entropy loss with weight assists in optimizing the model's performance, while the F1-score guidance helps monitor and manage the training progress, ensuring that the model maintains or improves its performance throughout the training process.
model's architecture and training process aim to effectively encode the data, generate diverse samples through vector space diversification, and train the model using the selected samples. The use of cross-entropy loss with weight assists in optimizing the model's performance, while the F1-score guidance helps monitor and manage the training progress, ensuring that the model maintains or improves its performance throughout the training process.

Settings
The Encoder used in this study was the "all-mpnet-base-v2" Sentence-BERT model with 768 features. It was employed to transform both the train and test datasets into vectors. Figure 2 illustrates the architecture of the Outlayer, which consists of a three-layer-ResNet framework with Prelu activation, batch normalization, linear layer, and hidden layer size set to twice the cluster number.

Settings
The Encoder used in this study was the "all-mpnet-base-v2" Sentence-BERT model with 768 features. It was employed to transform both the train and test datasets into vectors. Figure 2 illustrates the architecture of the Outlayer, which consists of a three-layer-ResNet framework with Prelu activation, batch normalization, linear layer, and hidden layer size set to twice the cluster number. The study employed the Nadam optimizer with an initial learning rate of 0.1 to train a three-layer-ResNet framework for data analysis. The activation function was Prelu, and a batch normalization [23] layer was applied before the linear layer. The hidden layer size  The study employed the Nadam optimizer with an initial learning rate of 0.1 to train a three-layer-ResNet framework for data analysis. The activation function was Prelu, and a batch normalization [23] layer was applied before the linear layer. The hidden layer size was set to twice the cluster number, and cross entropy loss with weight was used. The weight was determined using a specific formula.
F1-score-guidance was used, which triggered when the F1-score decreased below the previous score. When this occurred, the learning rate was reduced by half and the forgiveness count was decreased by one. The forgiveness count was initialized at 12, and when it reached zero, the training stopped. The F1-score threshold for early stop was set at 0.995.
The use of F1-score-guidance and early stop eliminated the need for a validation set, as no other models were compared. During training, the model was only saved if the loss was not NaN and if the F1score had improved. The study was conducted on a NVIDIA GeForce RTX 3080 Ti GPU.
GPU memory optimization and data buffering: This optimized GPU memory and data buffering pipeline allows for efficient training of an auto Encoder-Decoder with separate training steps for each part of the network. The encoded vectors are smaller in size than the raw data, allowing for larger batches in Outlayer training. This approach can reduce the overall training cost of the neural network.

Vector Space Diversification Sampling
The basic idea is to find the 'center' point among the training set encoding vectors for each dataset. Then, we select the center point as the root and perform a binary split in each feature dimension. We record the comparison status as 0 or 1, which allows us to obtain a binary representation of an integer. We can then utilize these integers as keys to represent the vector subspace and create branches based on these keys. This process is repeated recursively for each branch.
In this study, a method is proposed for sampling points in a vector space to explore the variety of the feature space. A center pivots picking algorithm is used to select the representative point of the space and divide the space into smaller subspaces. Distance methods, such as Euclidean distance and cosine similarity, are used to measure the distance between points. To introduce randomness, the rank is merged with each algorithm, and the indices are resorted. The output of different methods is blended to create a series of sample methods. The behavior of the algorithms is visualized using 2D points sampled in a circle. Results show that exploring the first few indices after reranking provides the greatest diversity of the feature space. Although cosine similarity may be a reasonable choice as a distance measure, as Sentence-BERT is designed to work with unit vectors and perform cosine similarity on text pairs, our experiments did not find it to make a significant difference [24]. Nonetheless, our approach still provides a useful method for sampling points in a vector space to explore its variety. Figure 3 shows the sampling executed on a 2D unit circle employing the Gaussian distribution of theta values. It consists of subfigures,

Data and Vector Space: Understanding and Unfamiliarity
Understanding refers to measuring the extent to which a given set of vectors comprehends the properties of the data points or subspaces contained within it. It is measured

Data and Vector Space: Understanding and Unfamiliarity
Understanding refers to measuring the extent to which a given set of vectors comprehends the properties of the data points or subspaces contained within it. It is measured by assessing how well the system identifies and represents all possible points in the vector space.
Unfamiliarity, on the other hand, refers to evaluating the degree to which a given data point is unfamiliar to a specific subspace within the vector space. This measure can be used to inform AI systems of the level of confusion or disinterest they should feel towards certain data points, based on their level of familiarity with the subspace to which they belong. These new metrics can be described mathematically using the following formulas: Let V be a vector space, and B be a set of real or virtual points within that vector space. Let x be a vector that belongs to V, and D be the distance function (e.g., Euclidean distance, arccos of cosine similarity, etc.). Finally, let g be an adjustment function that is positive and monotonically increasing, and have a derivative that is monotonically decreasing.
To ensure stable results and secure float representation, we can use the inverse of percentiles to obtain the rate, although this may introduce additional complexity.

Results
This study investigated the effectiveness of different data sampling methods on the performance of trained models. The VSD sampling algorithm, which selects items that maximize understanding during the sampling process, was compared to a random sampling method. The experimental results indicate that the model trained with VSD sampling algorithms typically outperforms the random sampling method on small datasets. However, for a large portion of the dataset, there is not much difference since the sample will eventually be the same as the whole training data as the size increases to the maximum size.
The improvement in model performance is more significant in metrics, such as recall, F1, and accuracy, but not in precision score. The experiment demonstrated that VSD sampling leads to a substantial improvement in F1-score for several datasets, including amazonpolar, dbpedia, agnews, and emotion, on small datasets. A 50-item trained model was evaluated using F1-score on a test set. The results of the study are presented in Table 1. F1 (trivial) represents the F1-score of randomly selecting items from each class. F1 (rand) represents the F1-score of the random sampling method. F1 (VSD min), F1 (VSD ave), and F1 (VSD max) are the F1-scores of the VSD sampling algorithm when selecting items with the minimum, average, and maximum understanding, respectively. F1 (VSD min-rand), F1 (VSD ave-rand), and F1 (VSD max-rand) represent the differences between the F1-scores of the VSD sampling algorithm and the random sampling method when selecting items with the minimum, average, and maximum understanding, respectively. The dataset's F1-score, accuracy, precision, and recall for each sampling approach are illustrated in Figures 4-6.          However, the nature of the Sentence-BERT encoding used in the experiment may have limited the performance in some datasets. The black line in the figure represents the trivial F1-score baseline achieved by randomly selecting items from each class. As the size of the dataset increases, the difference between the VSD sampling and random sampling methods becomes less significant. Table 1 shows the enhancements observed across various dataset sample sizes.
Overall, these findings suggest that VSD sampling can improve model performance on small datasets, but its effectiveness may vary depending on the nature of the data and the encoding method used. Therefore, researchers should consider using VSD sampling in conjunction with appropriate encoding techniques to improve model performance.

Conclusions
Data diversity is crucial for enhancing the performance of neural network models, and simply increasing the amount of data without considering diversity can be misleading. Traditionally, training datasets have contained redundant data, and researchers have resorted to brute force or AI-generated data to enhance diversity, which can be resource-intensive.
To address this issue, we propose an Encoder-Decoder-Outlayer (EDO) pipeline and a VSD sampling algorithm that leverages a pre-trained Encoder-Decoder framework for feature extraction. Our approach involves using a compact output layer and efficiently exploring the diversity of the encoded feature or hidden layer vector space to prevent overfitting and improve performance, even with limited data.
Experimental results demonstrate that our approach can yield satisfactory results in tasks that previously demanded substantial amounts of data. By employing a pretrained Encoder model for feature extraction and incorporating a small output layer, we can conserve computational resources and reduce human labor. Furthermore, storing the encoding process in a buffer allows for data to be encoded only once, further diminishing computational costs. Future work may involve extending the application of the EDO pipeline and VSD sampling to other tasks and developing a more generalized Encoder-Decoder approach.