REKP: Refined External Knowledge into Prompt-Tuning for Few-Shot Text Classification

Dang, Yuzhuo; Chen, Weijie; Zhang, Xin; Chen, Honghui

doi:10.3390/math11234780

Open AccessArticle

REKP: Refined External Knowledge into Prompt-Tuning for Few-Shot Text Classification

Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(23), 4780; https://doi.org/10.3390/math11234780

Submission received: 4 October 2023 / Revised: 1 November 2023 / Accepted: 13 November 2023 / Published: 27 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

Text classification is a machine learning technique employed to assign a given text to predefined categories, facilitating the automatic analysis and processing of textual data. However, an important problem is that the number of new text categories is growing faster than that of human annotation data, which makes many new categories of text data lack a lot of annotation data. As a result, the conventional deep neural network is forced to over-fit, which damages the application in the real world. As a solution to this problem, academics recommend addressing data scarcity through few-shot learning. One of the efficient methods is prompt-tuning, which transforms the input text into a mask prediction problem featuring [MASK]. By utilizing descriptors, the model maps output words to labels, enabling accurate prediction. Nevertheless, the previous prompt-based adaption approaches often relied on manually produced verbalizers or a single label to represent the entire label vocabulary, which makes the mapping granularity low, resulting in words not being accurately mapped to their label. To address these issues, we propose to enhance the verbalizer and construct the refined external knowledge into a prompt-tuning (REKP) model. We employ the external knowledge bases to increase the mapping space of tagged terms and design three refinement methods to remove noise data. We conduct comprehensive experiments on four benchmark datasets, namely AG’s News, Yahoo, IMDB, and Amazon. The results demonstrate that REKP can outperform the state-of-the-art baselines in terms of Micro-F1 on knowledge-enhanced text classification. In addition, we conduct an ablation study to ascertain the functionality of each module in our model, revealing that the refinement module significantly contributes to enhancing classification accuracy.

Keywords:

few-shot learning; text classification; prompt learning; pre-trained language model

MSC:

68T50; 91G10; 91G70

1. Introduction

Text classification is a classic natural language processing task, and its purpose is to divide the text into preset categories, so as to better serve downstream tasks, such as information retrieval, sentinel analysis, and spam detection. Most conventional supervised learning methods heavily rely on a substantial amount of labeled data.

With the development of large language models (LLMs), pre-training language models (PLMs) have become widely adopted for text classification tasks due to their exceptional performance [1,2,3]. The PLMs can transfer the knowledge learned from other data to the target task. At present, the mainstream way is based on fine-tuning [4,5,6]. This method entails adding a classifier to the pre-trained model and adjusting the model’s parameters using a well-labeled training set. However, obtaining a large amount of accurately labeled training data can be costly in terms of both time and resources, especially when facing newly emerging categories. Consequently, a common challenge arises in scenarios where only a limited number of samples are available for certain categories, known as the few-shot scenario. The vanilla fine-tuning approach is not well-suited for tackling tasks in the few-shot scenario. This is primarily due to the fact that a limited number of samples are insufficient for optimizing the numerous parameters of PLMs, resulting in overfitting. To address this issue, recent models such as GPT-3 [7], PET [8], and other models [9,10] have put forward the method of prompt-tuning, which has narrowed the gap between the PLMs and the downstream target tasks and has provided a novel idea to imitate the effect of the few-shot scene. In essence, the goal of prompt-tuning is to reduce semantic differences and mitigate overfitting, thereby improving model performance on both the training and test sets.

The primary concept underlying prompt-tuning is the insertion of a concise textual template containing a [MASK] token into the original input sentence. This transformation converts the label prediction task into a [MASK] filling problem. A crucial aspect of prompt-tuning involves mapping the prediction probabilities of words to their respective labels. It is essential to establish a verbalizer capable of generating a thesaurus and corresponding labels to facilitate this process. Previous works mainly use manual verbalizers [7,8]. For instance, the word “mathematics” corresponds with the category “MATHEMATICS” and typically, a category is denoted by a tag word. However, we contend that such an approach is flawed. Specifically, words such as “formula” and “numeracy” should also be classified under the category “MATHEMATICS”, as they possess similar semantic relationships. Moreover, the prediction accuracy is significantly influenced by human subjectivity. Recently, some researchers have attempted to learn tag words through the discrete clustering method [9]. However, this approach lacks human prior knowledge and often falls short of producing accurate word categories. It typically only generates words with similar semantics or vectors in the embedding space. To address these issues, we constructed the refined external knowledge into a prompt-tuning (REKP) model. To be more precise, the proposed model introduces external knowledge to assist in prompt learning, builds a verbalizer, enhances the tag thesaurus, and refines them based on specific tasks, thereby enabling improved adaptation to target tasks.

To evaluate the efficacy of our proposition, we conducted comprehensive experiments in two application scenarios: the Ag’news and Yahoo topic categorization datasets [11], as well as the IMDB and Amazon sentiment categorization datasets. The experimental results clearly illustrate that our scheme outperforms the competitive baseline, showing a significant improvement. Furthermore, the ablation study confirms the effectiveness of each module in our proposal, with the results highlighting the crucial role of the refinement module in enhancing classification accuracy.

We summarize the main contributions in this paper as follows:

We introduce external knowledge to enhance the verbalizer by expanding related words, which improves the granularity of the tag thesaurus and enriches the mapping range of tags.
We propose three methods to refine the enhanced verbalizer and remove the noise words to optimize the model.
Extensive experiments conducted on four benchmark datasets, namely AG’s News, Yahoo, IMDB, and Amazon, demonstrate the superiority of REKP over the competitive baselines in terms of Micro-F1 on knowledge-enhanced text classification tasks.

The rest of the paper is organized as follows: Section 2 presents the related work. Section 3 presents the overall framework of our method and the implementation details of the REKP model. Section 4 describes the research questions, evaluation metrics, and baselines. Section 5 presents and analyzes the experimental results of our system on the AG’s News, Yahoo, IMDB, and Amazon benchmark datasets. Section 6 gives the summary and outlook for this work.

2. Related Work

In this section, we first review previous work on few-shot text classification in Section 2.1, and subsequently provide a summary of the related work on prompt tuning for few-shot learning in Section 2.2.

2.1. Few-Shot Text Classification

Processing text classification problems with a small number of labeled samples is the topic of few-shot text classification. The primary objective is to enhance the generalization capability of pre-trained models for rapid adaptation to new tasks [12,13,14] while acquiring sufficient prior knowledge. The three prevailing techniques employed for few-shot text categorization include data augmentation, transfer learning, and meta-learning.

Data augmentation is a straightforward concept that directly addresses the issue of insufficient training samples in tasks with small sample sizes, which can lead to overfitting. The diversity of data structures is increased by the use of data augmentation. In our research, we classify data enhancement technologies into three categories according to the direction of enhancing features: those utilizing unlabeled data [15,16,17], those employing data synthesis [18,19,20], and those focusing on feature enhancement [21]. However, increasing the generalization capabilities of few-shot learning models through data augmentation still presents challenges as the semantics of the synthesized training samples closely resemble those of the original samples. Typically, transfer learning pre-trains encoders in large-scale corpora and then immediately apply the taught encoders to a few-shot tasks [1]. Nakamura et al. [3] use the Universal Language Model Fine-tuning [22] where parameters are gradually adjusted by modifying the learning rate, and set different learning rates for each layer of the network, continuously learning and adjusting the learning strategy to obtain the best results. However, due to the small sample data of each category in the training set, the global fine-tuning method may destroy the information learned by the feature extractor, resulting in overfitting. To solve this problem, Chen [23] proposed the ContrastNet for few-shot text classification, which learned differentiated text representations belonging to different categories through supervised contrast learning, and introduced unsupervised contrast regularization at task and instance levels to prevent overfitting. Meta-learning is a cutting-edge field in machine learning, which aims to endow machines with the ability to learn in a manner analogous to human learning processes [24]. Meta-learning can be divided into optimization-based methods and metric-based methods. A notable example of an optimization-based model is MAML [25]. This approach involves learning an appropriate initialization of model parameters from a set of base classes and transferring these parameters to new classes using a few gradient steps. Hong [26] recently created a dictionary of meta-level attention characteristics and identified the top k most pertinent attention elements for few-shot learning using pre-trained models. On the other hand, metric-based meta-learning approaches determine the category of a query by assessing the dissimilarity between the query vector and the prototype vector of each class. One example is the prototypical network [27], which computes the prototypes by averaging the support sets and determines the category by calculating the Euclidean distance between the query and each prototype. Gao et al. [28] introduced a prototype network framework with a hybrid attention mechanism, which effectively accomplishes denoising and mitigates the issue of feature sparsity.

Despite the significant progress made in the field of few-shot text classification, there still exists a notable gap compared to human classification accuracy. Data augmentation techniques, while commonly used, are susceptible to introducing noisy data. Transfer learning methods rely on strong correlations between the old and new tasks. Meta-learning approaches, although promising, often suffer from higher complexity and limited knowledge acquisition when only a few samples are available.

2.2. Prompt Tuning for Few-Shot Learning

Although PLMs [29,30] have shown remarkable success in various natural language processing tasks, closing the gap between pre-training and target tasks using few-shot learning remains challenging. However, the concept of “in-context learning” has emerged since the introduction of GPT-3 [7]. By incorporating timely adjustments and text-free learning, large-scale language models have demonstrated excellent performance in few-shot scenarios. Furthermore, the utilization of prompt templates to activate model knowledge has gained significant attention. Subsequently, researchers have developed a series of prompt-tuning approaches, such as text classification [28], natural language understanding [31], relation extraction [32], and entity typing [33]. These works have demonstrated that even small-scale language models [34] can achieve impressive performance through prompt adjustments.

As mentioned in Section 1, the performance of prompt-tuning is strongly influenced by the verbalizer, an essential component in the prompt-tuning process. For specific tasks, artificially constructed verbalizer have been widely used and have proven to be effective [8], which makes use of prompt learning to annotate a sizable dataset that is unlabeled for upcoming training. However, this approach requires significant manual labor and relies heavily on expert knowledge, limiting its ability to fully exploit the deep semantics of the language model. To address this limitation, a search-based verbalizer device [35] has been proposed, which is a model that does not use the manual definition of class names. By choosing the words with the most information in the PLMs vocabulary as tag words. However, due to the limited amount of data, it is difficult to find suitable words. The other direction is to build a soft verbalizer [10], using continuous vectors to represent classes, and multiplying the output of Masked Language Model (MLM) with class vectors to obtain the prediction probability of each class. However, the effectiveness of the soft verbalizer heavily relies on abundant data for optimal performance, which may not be feasible in few-shot learning scenarios.

Unlike previous approaches, we propose the incorporation of external knowledge to enhance the verbalizer through associative extensions. This refinement allows for an improved granularity in the tag lexicon and expands the mapping range of tags, thereby enhancing its adaptability in few-shot learning scenarios.

3. Proposed Approach

In this section, we describe the overall REKP framework in Section 3.1. In particular, we introduce the template conversion module and detail the verbalizer construction module in Section 3.2 and Section 3.3, respectively. Finally, we discuss the process of verbalizer utilization in Section 3.4.

3.1. Overall Framework

The proposed framework in Figure 1 consists of three essential components: template conversion, verbalizer construction, and verbalizer utilization.

Firstly, the input sequence

x = (x_{0}, x_{1}, \dots, x_{n})

is transformed into a template

T

, which is a natural language text. This conversion results in the formation of a sentence with a [MASK] token. Subsequently, the entire sentence is passed as input to a masked language model (RoBERTa [30]) in order to predict the tag words

v \in V

.

During the process of converting tag words into tags, we employ open-source external knowledge to construct a tag thesaurus for each tag. This thesaurus consists of synonyms with varying levels of granularity. However, directly incorporating the external knowledge base can introduce noise into the target task. To address this, we propose three refining approaches in this paper: Label Word Refinement, Correlation Refinement, and Importance Refinement. These approaches optimize the tag thesaurus specifically for the target task. Finally, to obtain the prediction score of the tag, the weighted average of the tag words is used to get the prediction score of the tag.

3.2. Template Conversion

The template, denoted as

T

, is a concise sentence with the [MASK] token. Currently, there exist five established methods for constructing templates, each with distinct characteristics. Further details regarding these methods are presented in Table 1.

Among them, the first three methods are referred to as Hard Templates. In these methods, the text is directly combined with a short sentence containing [MASK]. Furthermore, the Word Embedding of discrete characters remains unaltered throughout the training process. The latter two methods are called Soft Template, which is able to dynamically adjust its parameters based on the contextual semantics of the text and the objectives of the task throughout the training process.

Typically, a sentence is equipped with a single Template. However, constructing multiple templates can yield diverse samples, which effectively serves as a form of disguised data augmentation. The Manual Template, owing to its inclusion of extensive expert knowledge, has shown better performance compared to the generative template. For our experiments, we meticulously designed four templates for each dataset using manually constructed templates. Additional information regarding the prompt templates can be found in Table 2.

3.3. Verbalizer Construction

The task of predicting masked words based on contextual semantics does not resemble a multiple-choice question with a single correct answer. Instead, it involves determining the probability of predicting each word from a given set of alternatives. We should consider the following two issues when constructing the tag thesaurus of the verbalizer: the coverage should be wide and the subjective influence should be small. These requirements can be simultaneously fulfilled by leveraging the structural knowledge of external molding, as discussed in KPT [36]. To address these requirements effectively, we select relevant external knowledge sources for topic classification and emotion classification tasks, respectively.

Regarding topic classification, the primary concern lies in ensuring the comprehensiveness of the tag thesaurus and avoiding excessive granularity. To address this concern, we adopt Related Words, which is a knowledge map that aggregates information from multiple resources. This choice allows us to mitigate the aforementioned problem effectively. The algorithm employs word embedding to convert words into multi-dimensional vectors, representing their meanings. It proceeds by searching for words similar to the provided word vectors in extensive word vector libraries. Additionally, it traverses the concept network to identify words associated with the query, based on certain semantic criteria. In the experimental evaluation, we used a threshold of 0 and selected all words corresponding to the labels. For emotion classification, our goal is to expand binary emotions (positive and negative) to emotions with wider coverage and larger granularity. To accomplish this, we use the emotional dictionaries POSITIVEwords and NEGATIVEwords summarized by previous researchers to form our emotional classification tag thesaurus.

Although we have introduced the external knowledge base into our prompt-learning verbalizer, enriching it with a wide range of tag words, we acknowledge that these words might not be specifically tailored for the pre-training model. As a result, we put forward three refinement methods to save high-quality words and solve the noise problem of words in the external knowledge base.

3.3.1. Label Word Refinement (WR)

In the context of Pre-trained Language Models (PLMs), there exist unfamiliar words within the external knowledge base that can lead to significant errors in prediction. To address this challenge, we propose a method called WR to handle such unfamiliar words. Specifically, we utilize the task prior probability of tag words to filter out and remove these unfamiliar words from consideration. The visual representation of the image is illustrated in Figure 2, where the shaded area denotes the removed words.

In our approach, we consider a text classification task where we represent the sentence distribution

D

of a given sample

x

. Initially, we transform each sentence into a predefined template and compute the probability

P_{M} ([MASK] = v | x_{t})

for each tag word in the mask position. By estimating the overall distribution probability, we can derive the prior probability

P_{D} (v)

of the tag word in the mask position:

P_{D} (v) = E_{x \sim D} P_{M} ([MASK] = v |x_{t}),

(1)

In the experiment, we extract a small part from the training set to form the support set S, and remove the sentence tags of the support set as the above expectation estimation. We assume that

x \in S

has a uniform prior probability, then the task prior probability of tag words can be approximately calculated as follows:

P_{D} (v) \approx \frac{1}{|S|} \sum_{x \in S} P_{M} ([MASK] = v |x_{t}) .

(2)

According to the calculated prior probability, we remove the tag words less than the threshold. Because the distribution of task prior probability is different for each task, we do not have a predefined numerical value for the threshold in our experiments. Instead, we employ a ranking-based removal method that excludes the bottom half of the label terms based on their prior probabilities.

3.3.2. Correlation Refinement (CR)

The mapping intensity between different tag words and tags varies, with certain tag words exhibiting a stronger sense of belonging to their respective tags. Taking this into consideration, we propose CR as a solution. The architecture is illustrated in Figure 3.

Firstly, we compute the prediction probability of tag words on the support set S as the vector representation

q_{i}^{v}

of tag words, where

x_{i t}

denotes the sentence composed of

x_{i}

and template t.

q_{i}^{v} = P_{M} ([MASK] = v |x_{i t}), x_{i} \in S

(3)

To denote the vector representation of class labels, we assume that the name of each class label is

v_{0}

; for instance, “MATHEMATICS” is “mathematics”. Despite the limited coverage of a single word, it exhibits the highest similarity to its corresponding label. Consequently, we employ the vector representation

q^{v_{0}}

of these words as the vector representation

q^{y}

for the class label y. In this way, we can quantify the correlation between the tag word v and the tag y, utilizing cosine similarity as the metric.

r (v, y) = cos (q^{v}, q^{y}) = cos (q^{v}, q^{v_{0}}) .

(4)

Furthermore, we take into account the possibility of a word appearing in the tag library of multiple class tags, resulting in mapping ambiguity. For instance, the tag word “calculator” under the “MATHEMATICS” category may also be present under the “TECHNOLOGY” category. To address this issue, we propose a calculation method that assigns a high degree of relevance to a specific category of tags while maintaining a low degree of relevance to other categories of tags, effectively filtering out these multi-category words.

R (v) = r (v, f (v)) \frac{|Y| - 1}{\sum_{y \in Y, y \neq f (v)} r (v, y)},

(5)

where

f (v)

denotes the class associated with v. Essentially, the concept of Correlation Refinement in CR bears similarity to the TF-IDF algorithm, where class tags can be viewed as documents and tag words as words.

In an ideal scenario, a reliable keyword exhibits a stronger correlation with a specific type of tag compared to other types of tags. Therefore, the tag words we have omitted have correlations with their respective categories that fall below the average correlation with other categories.

3.3.3. Importance Refinement (IR)

The tag thesaurus underwent denoising for specific tasks following WR and CR. Taking into account the varying impacts of tag words on tags, we propose an importance attention module that discerns the significance of tag words. This module aims to diminish the influence of insignificant tag word data and extract tag words that have a greater contribution to tags. The architecture is illustrated in Figure 4.

We posit that the extended tag words vary in their significance to the tags, thus necessitating the assignment of a weight

W_{v}

to each tag word v in consideration.

w_{v} = s u m \{σ (g (u \in V_{y}) \otimes g (y))\},

(6)

where,

u

represents the feature vector of the tag words,

g (\cdot)

denotes the linear layer responsible for the linear transformation between

u

and

y

, ⊗ represents the dot product operation, and

σ (\cdot)

refers to the activation function. In this study, we employ the tanh activation function to map the dot product result to the interval

[- 1, 1]

. Additionally, the

s u m (\cdot)

operation signifies the computation of the summation of vector elements

α_{j}

using the Softmax functions, which serve as the weights for the tag words.

Finally, we normalize the tag words within each tag:

α_{v} = \frac{exp (w_{v})}{\sum_{u \in V_{y}} exp (w_{u})} .

(7)

By adding IR, the weight of tag words that are more similar to tags is improved. In other words, by introducing certain incorrect words into the tag words, the weight assigned to them is adjusted using this module in order to mitigate their influence on the tag.

3.4. Verbalizer Utilization

The three refinements designed in this study eliminate less probable tag words, address the issue of multiple mappings in tag words, and enhance the attention to highly important tag words. In the final estimation of tag prediction probability, this study employs the prediction score computed through a weighted average of the tag words.

\hat{y} = arg max_{y \in Y} \frac{exp (s (y |x_{p}))}{\sum_{y^{'}} exp (s (y^{'} |x_{p}))},

(8)

s (y |x_{p}) = \sum_{v \in V_{y}} α_{v} log P_{M} ([MASK] = v |x_{p}) .

(9)

The objective function in this study employs the CrossEntropyLoss (CEL) function as the loss function, enabling the model to obtain optimal parameters during the training process and consequently improving the accuracy of tag prediction. The calculation method is as follows:

l o s s (Y, y) = - log (\frac{e^{Y_{y}}}{\sum_{i} e^{Y_{i}}}) = - Y_{y} + log (\sum_{i} e^{Y_{i}}),

(10)

where

Y

is the prediction result and is the score vector

Y = [y_{1}, y_{2}, \dots y_{n}]

of each category, and y represents the real category label of this sample.

4. Experiments

4.1. Research Questions

We prove the effectiveness of REKP by addressing the following four research questions:

(RQ1): Can our proposed REKP outperform the baselines in terms of performance?
(RQ2): How does each refinement in verbalizer contribute to REKP?
(RQ3): How our PTEK model and baselines perform in terms of classification accuracy depends on the number of training samples?
(RQ4): What is the particular impact of refinement on the number of label thesaurus?

4.2. Datasets and Evaluation Metrics

The effectiveness of REKP and the baseline models is evaluated on publicly available topic categorization datasets, namely, Ag’news and Yahoo [11], as well as sentiment categorization datasets, namely, IMDB and Amazon. They are the public classification datasets. AG’s News is a dataset for news categorization tasks, containing text samples and their corresponding labels. The dataset comprises over 2000 distinct news items. Yahoo is a topic-categorized dataset derived from user-generated questions on the Yahoo website. IMDB is a binary sentiment categorization dataset of movie reviews, encompassing both positive and negative sentiments. It is widely used for sentiment analysis tasks in the domain of movies. Similarly, the Amazon dataset is sourced from the Amazon website and comprises reviews and ratings aimed at anticipating consumer feedback for various products.

Accuracy, Precision, or Recall are commonly employed as evaluation metrics for binary classification tasks, but they cannot be directly applied to multiclassification problems. To address this, we adopt the Micro-F1 metric as the comprehensive evaluation criterion for all our experiments.

First, calculate the total precision

P r e c i s i o n_{m i c r o}

and recall

R e c a l l_{m i c r o}

for all categories.

P r e c i s i o n_{m i c r o} = \frac{\sum_{i = 1}^{n} T P_{i}}{\sum_{i = 1}^{n} T P_{i} + \sum_{i = 1}^{n} F P_{i}},

(11)

R e c a l l_{m i c r o} = \frac{\sum_{i = 1}^{n} T P_{i}}{\sum_{i = 1}^{n} T P_{i} + \sum_{i = 1}^{n} F N_{i}} .

(12)

The final Micro-F1 metric value is calculated using the F1 metric formula as follows:

F 1_{m i c r o} = 2 \cdot \frac{P r e c i s i o n_{m i c r o} \cdot R e c a l l_{m i c r o}}{P r e c i s i o n_{m i c r o} + R e c a l l_{m i c r o}} .

(13)

4.3. Baselines

We assess the effectiveness of our proposed model by comparing it with the following state-of-the-art baselines:

FT inputs the [CLS] hidden embedding of the pre-trained model into the classification layer for label prediction.
PT [8] is a classical model for text classification with few-shot using prompt learning. It inserts a short text template with the [MASK] token, utilizing the name of the category as an unusual label, and predicting the results of the labeled words in direct response to the category results.
AUTO [9] is a model that automatically chooses the most informative terms from the PLM vocabulary as tag words, eliminating the need for manually established class names.
SOFT [10] uses consecutive vectors to represent the classes, and the output of the MLM is dot-produced with the class vectors to obtain the predicted probabilities for each class. The method is carried on by REKP, which uses the class vectors that are the word embeddings of the class names.

4.4. Model Configuration

We consider four types of meta tasks, following the common practice in few-shot learning experiments [27]: “1-shot”, “5-shot”, “10-shot”, and “15-shot”, where the numbers indicate the available number of training samples. To construct the task training set and the validation set, we randomly sampled k instances from each category in the original training dataset. To mitigate experiment errors and result in randomness, we created four templates for each dataset (in Table 2) and conducted three independent runs using different random seeds. Consequently, the final score for each dataset is the average of 12 experiments.

We implement our model in PyTorch with Transformer Library. For the PLMs, we choose RoBERTa [30], which is a modified version of BERT [29]. For model training, we use the AdamW optimizer with a batch size of 16. The learning rate is 0.00003 for parameters of RoBERTa and the margin in contrastive loss is set to 1. The entire experiment is run on a Linux server with two RTX 3090 GPUs.

5. Results and Discussion

5.1. Overall Performance

To answer RQ1, we put PTEK with four other baselines on four datasets based on the Micro-F1 metrics. The overall text classification performances of all discussed models are shown in Table 3.

We discover that the baselines’ non-prompt tuning model, FT, clearly lags behind other prompt tuning models. This further demonstrates the clear benefits of using prompt-tuning in the task of model fine-tuning text classification. SOFT consistently outperforms PT in models based on prompt-tuning, suggesting that continuous vector representation may be preferable to discrete representation. The effectiveness of AUTO, however, falls considerably short of these two models. Due to the small number of tag samples, the vocabulary’s choice of tag terms becomes erroneous. The results are disappointing, especially in experiments with small sample sizes, because AUTO can only choose tag words based on sample data from coefficients. The performance of this model, however, improves the quickest as the number of samples increases.

Then, we compared the performance of different baselines with REKP. We capture that the Micro-F1 performance of REKP on each dataset is significantly superior to that of the baselines, which proves that our idea of expanding the tag thesaurus of external knowledge is effective. On average, in the 1-shot, 5-shot, and 10-shot experiments, compared with the best baseline, our Micro-F1 index increased by 16.8%, 8.7%, and 5.6% respectively. For the 15-shot experiment, we can see that with the increase of training samples, the gap between different methods gradually decreases, but REKP still maintains its advantages and outperforms other benchmark models. It is worth noting that in the IMDB dataset, SOFT outperformed PTEK in the 15-shot scenario. We hypothesize that the continuous vector representation may demonstrate stronger advantages as the number of samples increases, and future research can explore optimizing the model using vector representation of tag words.

5.2. Ablation Study

For RQ2, to explore the effectiveness of the three refinement methods, we further analyze the influence of the three refinement methods on the classification performance, and carry out the ablation experiment of the model by deleting specific modules.

w/o WR removes Label Word Refinement from REKP;
w/o CR removes Correlation Refinement from REKP;
w/o IR removes Importance Refinement from REKP;
w/o RRR removes three refinement modules of REKP.

Due to the long running time of the model, we chose the AG’s News dataset with topic classification and the IMDB dataset with emotion classification. The specific experimental results are shown in Table 4.

For Variant 1 (w/o WR): For topic classification, WR proves to be more effective when working with a smaller number of samples, aligning with our initial intent of leveraging prior knowledge. As for emotion classification, since there is less noisy data associated with emotion tag words, the number of tag words removed in this module is relatively lower.

For Variant 2 (w/o CR): It is evident that both topic classification and emotion classification CR have yielded satisfactory results. However, when dealing with a large volume of samples, the omission of certain tag words may lead to a decrease in the granularity of the tag lexicon, consequently resulting in a decline in model performance.

For variant 3 (w/o IR): We observe that this module is not conducive to topic classification, possibly due to the presence of numerous tag categories in this task. In this scenario, a tag word may belong to a distinct tag lexicon, and the lack of consideration for cross-information during importance calculation may be a contributing factor.

When comparing with the optimal baseline, we observe that in the majority of cases, the model’s performance remains superior even when excluding the implementation of WR, CR, and IR.

5.3. Impact of Sample Number

To address RQ3 and examine the impact of different sample quantities on classification performance, we conducted 1-shot, 5-shot, 10-shot, and 15-shot experiments. The baseline models employed were FT, PT, AUTO, and SOFT. The results are presented in Figure 5. It can be observed that our proposed model, PTEK, outperforms the four baseline models in most cases, validating the effectiveness and robustness of our model. Furthermore, it is evident that as the sample quantity increases, the classification performance of all models improves, albeit at a slower rate.

In comparison to topic classification, the model demonstrates superior performance in the task of sentiment classification, achieving a Micro-F1 score of 0.925 even in the 1-shot scenario. This can be attributed to the fewer number of label categories and inherently lower difficulty level associated with sentiment classification when contrasted with topic classification. Importantly, our model maintains high classification performance even with a limited number of samples, underscoring its compatibility with few-shot learning conditions.

5.4. Refinement Visualization

To answer RQ4, we count and visualize the number of tag words remaining after Label Word Refinement (WR) and Correlation Refinement (CR) processes, and we can intuitively observe the denoising process.

As we can see in Figure 6, these refinement processes delete a large part of the tag words while retaining the tags with the largest amount of information and the most important to the tag in the target task. Moreover, the remaining number of tag words at least exceeds 100, which far exceeds the number of tag words in the previous methods. The universality of tag words is helpful to improve the performance of the prompt tuning model. In addition, it is not difficult to see that the number of tag words deleted in emotion classification is minimal, which also verifies that the refinement processes analyzed in Section 5.2 have little influence on emotion classification. The emotion dictionary has almost no noise.

In summary, the process of refinements plays a pivotal role in effectively reducing noise within the label word repository.

6. Conclusions and Future Work

In this work, we proposed a refined external knowledge into a prompt-tuning (REKP) model for few-shot text classification. We enhanced the verbalizer by adding an external knowledge-expanded label vocabulary, building on prompt learning. In addition, to remove noise from the label vocabulary for a more positive impact on the results, we design three refinement methods: Label Word Refinement, Correlation Refinement, and Importance Refinement. These approaches take into account the noise impact of the vocabulary in various tasks. The efficacy and reliability of our technique are demonstrated by experimental findings on two different dataset types, namely topic classification and sentiment classification compared to all baselines. Furthermore, thorough ablation research revealed that refinements are a crucial element for our model. Our approach enhances the accuracy of text classification in scenarios with limited new sample quantities, such as healthcare and finance, thereby enabling more precise downstream tasks such as decision analysis, sentiment analysis, and intention recognition.

In future work, we will like to incorporate more social network information to further enhance knowledge expansion. Moreover, while this paper utilizes an enhanced verbalizer, it still relies on manually constructed “hard templates”. Although these “hard templates” conform to human language conventions, they may not be easily interpretable for machine learning algorithms. Looking ahead, we envision initializing the templates as vectorized “soft templates” that can adapt to the task during the training process, thus improving the adaptability of hint learning.

Author Contributions

Conceptualization, Y.D.; methodology, Y.D.; validation, Y.D.; writing—original draft, Y.D.; investigation, W.C.; writing—reviewing and editing, X.Z.; supervision, H.C.; resources, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under No. 61702526.

Data Availability Statement

The datasets can be found at https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews (accessed on 27 September 2023) and https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products (accessed on 27 September 2023).

Acknowledgments

The authors thank the editor and the anonymous reviewers for their valuable suggestions that have significantly improved this study.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

References

Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
Nakamura, A.; Harada, T. Revisiting fine-tuning for few-shot learning. arXiv 2019, arXiv:1910.00216. [Google Scholar]
Shen, Z.; Liu, Z.; Qin, J.; Savvides, M.; Cheng, K.-T. Partial is Better Than All: Revisiting Fine-Tuning Strategy for Few-Shot Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 9594–9602. [Google Scholar]
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef]
Kong, J.; Wang, J.; Zhang, X. Hierarchical BERT with an adaptive fine-tuning strategy for document classification. Knowl. Based Syst. 2022, 238, 107872. [Google Scholar] [CrossRef]
Jacobs, G.; Van Hee, C.; Hoste, V. Automatic classification of participant roles in cyberbullying: Can we detect victims, bullies, and bystanders in social media text? Nat. Lang. Eng. 2022, 28, 141–166. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Schick, T.; Schütze, H. Exploiting cloze questions for few shot text classification and natural language inference. arXiv 2020, arXiv:2001.07676. [Google Scholar]
Schick, T.; Schmid, H.; Schütze, H. Automatically identifying words that can serve as labels for few-shot text classification. arXiv 2020, arXiv:2010.13641. [Google Scholar]
Hambardzumyan, K.; Khachatrian, H.; May, J. Warp: Word-level adversarial reprogramming. arXiv 2021, arXiv:2101.00121. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. arXiv 2015, arXiv:1509.01626. [Google Scholar]
Han, C.; Fan, Z.; Zhang, D.; Qiu, M.; Gao, M.; Zhou, A. Meta-learning adversarial domain adaptation network for few-shot text classification. arXiv 2021, arXiv:2107.12262. [Google Scholar]
Sui, D.; Chen, Y.; Mao, B.; Qiu, D.; Liu, K.; Zhao, J. Knowledge guided metric learning for few-shot text classification. arXiv 2020, arXiv:2004.01907. [Google Scholar]
Sun, P.; Ouyang, Y.; Zhang, W.; Dai, X. MEDA: Meta-Learning with Data Augmentation for Few-Shot Text Classification. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 3929–3935. [Google Scholar]
Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, Y. Transductive propagation network for few-shot learning. arXiv 2018, arXiv:1805.10002. [Google Scholar]
Boney, R.; Ilin, A. Semi-Supervised Few-Shot Learning with MAML. 2018. Available online: https://openreview.net/forum?id=r1n5Osurf (accessed on 3 October 2023).
Bayer, M.; Kaufhold, M.-A.; Reuter, C. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv. 2022, 55, 1–39. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Kim, H.H.; Woo, D.; Oh, S.J.; Cha, J.-W.; Han, Y.-S. ALP: Data Augmentation Using Lexicalized PCFGs for Few-Shot Text Classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’2022), Vancouver, BC, Canada, 22 February–1 March 2022; pp. 10894–10902. [Google Scholar]
Li, K.; Zhang, Y.; Li, K.; Fu, Y. Adversarial Feature Hallucination Networks for Few-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’2020), Virtual, 13–19 June 2020; pp. 13470–13479. [Google Scholar]
Dixit, M.; Kwitt, R.; Niethammer, M.; Vasconcelos, N. Aga: Attribute-guided augmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7455–7463. [Google Scholar]
Lyu, H.; Sha, N.; Qin, S.; Yan, M.; Xie, Y.; Wang, R. Advances in neural information processing systems. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://par.nsf.gov/biblio/10195511 (accessed on 3 October 2023).
Chen, J.; Zhang, R.; Mao, Y.; Xu, J. ContrastNet: A Contrastive Learning Framework for Few-Shot Text Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; pp. 10492–10500. [Google Scholar]
Munkhdalai, T.; Yu, H. Meta Networks. In Proceedings of the International Conference on Machine Learning, 6–11 August 2017; pp. 2554–2563. [Google Scholar]
Guo, Y.; Cheung, N.-M. Attentive Weights Generation for Few Shot Learning via Information Maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020; pp. 13499–13508. [Google Scholar]
Hong, S.K.; Jang, T.Y. LEA: Meta Knowledge-Driven Self-Attentive Document Embedding for Few-Shot Text Classification. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 99–106. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-shot Learning. Adv. Neural Inf. Process. Syst. 2017. [Google Scholar]
Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6407–6414. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT Understands, Too. arXiv 2021, arXiv:2103.10385. [Google Scholar] [CrossRef]
Han, X.; Zhao, W.; Ding, N.; Liu, Z.; Sun, M. Ptr: Prompt tuning with rules for text classification. AI Open 2022, 3, 182–192. [Google Scholar] [CrossRef]
Ding, N.; Chen, Y.; Han, X.; Xu, G.; Xie, P.; Zheng, H.-T.; Liu, Z.; Li, J.; Kim, H.-G. Prompt-learning for fine-grained entity typing. arXiv 2021, arXiv:2108.10604. [Google Scholar]
Ezepue, E.I.; Nwankwor, P.P.; Chukwuemeka-Nworu, I.J.; Ozioko, A.N.; Egbe, C.O.; Ujah, J.; Nduka, C.; Edikpa, E.C. Evaluating the Local Language Dimensions for Effective Teaching and Learning Sustainability in the Secondary Education System in Southeast Nigeria: Results from a Small-Scale Study. Sustainability 2023, 15, 7510. [Google Scholar] [CrossRef]
Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. arXiv 2020, arXiv:2012.15723. [Google Scholar]
Hu, S.; Ding, N.; Wang, H.; Liu, Z.; Wang, J.; Li, J.; Wu, W.; Sun, M. Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv 2021, arXiv:2108.02035. [Google Scholar]

Figure 1. The framework of REKP. First, the whole label word goes through Label Word Refinement and Correlation Refinement to make its granularity more in line with the target task, and then, through Importance Refinement, the weight value of each label word is calculated. Finally, the verbalizer maps the predictions over label words into Labels.

Figure 2. WR process diagram.

Figure 3. CR process diagram.

Figure 4. IR process diagram. (Instance Encoder. A tag word is vectorized through PLMs. Instance-level Attention. Pay more attention to tag words related to Label and reduce the influence of noise. The whole process can be described as: tag words and Label are transformed into feature vectors through Instance Encoder, and then the results are input into Instance-level Attention, and the weight of the label words is obtained after the weight sum.)

Figure 5. The impact of sample quantity on model performance.

Figure 6. The remaining number of label words after WR and CR. (In our code, the number of tag thesaurus of each Label will be output after WR and CR, and we add all the tag words to get the number of tag words in WR and CR of this dataset.)

Table 1. Existing template construction methods.

Category	Method	Introduction
Discrepancy	Manual Template	Manual template design based on the nature of specific tasks and prior knowledge.
	Heuristic-based Template	Build templates by heuristic search, etc.
	Generation	According to the given training data with few samples, an appropriate template is generated.
Continuity	Word Embedding	Initialize discrete templates and train them by gradient descent method.
Continuity	Pseudo Token	Use the template as a trainable parameter.

Table 2. Prompt templates for four datasets.

Ag’news	$T_{1} (x) =$ A [MASK] news : x
	$T_{2} (x) =$ $x$ This topic is about [MASK] .
	$T_{3} (x) =$ [Category : [MASK]] $x$
	$T_{4} (x) =$ [Topic : [MASK]] $x$
IMDB	$T_{1} (x) =$ A [MASK] question : $x$
	$T_{2} (x) =$ x This topic is about [MASK] .
	$T_{3} (x) =$ [Category : [MASK]] $x$
	$T_{4} (x) =$ [Topic : [MASK]] $x$
Yahoo, Amazon	$T_{1} (x) =$ It was [MASK] . $x$
	$T_{2} (x) =$ Just [mask]! $x$
	$T_{3} (x) =$ $x$ All in all, it was [MASK] .
	$T_{4} (x) =$ $x$ In summary, it was [MASK] .

Table 3. Overall performance in terms of Micro-F1 (%) and 95% confidence interval on the test set for four types of shot tasks. The outcomes achieved by each column’s top performance are boldfaced. The sections marked with horizontal lines represent the results obtained from the optimal baseline.

Shot	Method	AG’s News	Yahoo	IMDB	Amazon
	FT	68.6 ± 8.9	35.1 ± 6.5	75.3 ± 5.3	71.2 ± 4.7
	PT	79.7 ± 6.3	54.3 ± 5.8	90.5 ± 4.3	92.3 ± 3.4
1	AUTO	70.3 ± 7.2	45.2 ± 6.2	79.7 ± 4.9	85.6 ± 4.0
	SOFT	79.8 ± 6.1	54.6 ± 5.6	89.6 ± 4.3	91.2 ± 3.5
	REKP	83.9 ± 4.1	64.4 ± 3.5	91.1 ± 3.3	93.5 ± 3.1
	FT	70.1 ± 8.2	39.3 ± 6.3	79.6 ± 4.2	77.5 ± 3.1
	PT	82.3 ± 6.2	62.6 ± 4.2	91.1 ± 1.6	93.2 ± 1.9
5	AUTO	76.5 ± 7.2	50.2 ± 5.9	87.0 ± 3.4	88.9 ± 2.9
	SOFT	82.3 ± 5.9	62.1 ± 4.8	90.8 ± 1.8	92.5 ± 1.5
	REKP	85.8 ± 3.8	66.3 ± 2.8	91.7 ± 2.7	94.1 ± 2.5
	FT	79.2 ± 5.2	50.1 ± 5.1	87.1 ± 3.8	87.8 ± 2.5
	PT	84.7 ± 3.2	64.5 ± 4.5	92.0 ± 2.1	93.6 ± 2.1
10	AUTO	81.2 ± 4.1	59.9 ± 5.1	91.5 ± 2.8	93 ± 2.5
	SOFT	84.8 ± 2.9	64.7 ± 3.8	91.8 ± 2.7	93.2 ± 2.6
	REKP	86.1 ± 2.5	66.8 ± 1.9	91.3 ± 1.5	94.5 ± 1.7
	FT	85.1 ± 2.5	59.8 ± 3.1	89.3 ± 2.4	90.2 ± 1.8
	PT	86.2 ± 1.5	66.9 ± 2.4	92.1 ± 1.9	93.9 ± 1.4
15	AUTO	85.3 ± 2.1	64.7 ± 2.6	92.1 ± 2.1	93.8 ± 1.2
	SOFT	86.1 ± 1.4	67.1 ± 0.9	92.5 ± 1.0	94.3 ± 1.1
	REKP	86.9 ± 0.8	67.8 ± 0.3	92.2 ± 0.5	94.7 ± 0.6

Table 4. Results of ablation experiment (↑: elevation, ↓: descent, =: basically equal), Better than the optimal baseline model in W/O is appended *.

Dataset	Method	1	5	10	15
AG’s News	REKP	83.9	85.8	86.1	86.9
	w/o WR	83.7 ↓	85.2 ↓	86.1 =	86.8 ↓
	w/o CR	83.0 ↓	85.6 ↓	86.5 ↑	86.9 =
	w/o IR	83.9 =	85.9 =	86.7 ↑	87.4 ↑
	w/o RRR	82.5 * ↓	83.6 * ↓	85.6 * ↓	86.6 * ↓
IMDB	REKP	91.1	91.7	91.3	92.2
	w/o WR	91.0 ↓	91.7 =	91.3 =	92.2 =
	w/o CR	90.8 ↓	91.6 ↓	90.5 ↓	92.4 =
	w/o IR	90.8 ↓	91.8 =	90.5 ↓	92.1 ↓
	w/o RRR	90.7 $^{*}$ ↓	91.5 $^{*}$ ↓	90.2 ↓	91.9 ↓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dang, Y.; Chen, W.; Zhang, X.; Chen, H. REKP: Refined External Knowledge into Prompt-Tuning for Few-Shot Text Classification. Mathematics 2023, 11, 4780. https://doi.org/10.3390/math11234780

AMA Style

Dang Y, Chen W, Zhang X, Chen H. REKP: Refined External Knowledge into Prompt-Tuning for Few-Shot Text Classification. Mathematics. 2023; 11(23):4780. https://doi.org/10.3390/math11234780

Chicago/Turabian Style

Dang, Yuzhuo, Weijie Chen, Xin Zhang, and Honghui Chen. 2023. "REKP: Refined External Knowledge into Prompt-Tuning for Few-Shot Text Classification" Mathematics 11, no. 23: 4780. https://doi.org/10.3390/math11234780

APA Style

Dang, Y., Chen, W., Zhang, X., & Chen, H. (2023). REKP: Refined External Knowledge into Prompt-Tuning for Few-Shot Text Classification. Mathematics, 11(23), 4780. https://doi.org/10.3390/math11234780

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

REKP: Refined External Knowledge into Prompt-Tuning for Few-Shot Text Classification

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Text Classification

2.2. Prompt Tuning for Few-Shot Learning

3. Proposed Approach

3.1. Overall Framework

3.2. Template Conversion

3.3. Verbalizer Construction

3.3.1. Label Word Refinement (WR)

3.3.2. Correlation Refinement (CR)

3.3.3. Importance Refinement (IR)

3.4. Verbalizer Utilization

4. Experiments

4.1. Research Questions

4.2. Datasets and Evaluation Metrics

4.3. Baselines

4.4. Model Configuration

5. Results and Discussion

5.1. Overall Performance

5.2. Ablation Study

5.3. Impact of Sample Number

5.4. Refinement Visualization

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI