Prompt Tuning for Multi-Label Text Classiﬁcation: How to Link Exercises to Knowledge Concepts?

: Exercises refer to the evaluation metric of whether students have mastered speciﬁc knowledge concepts. Linking exercises to knowledge concepts is an important foundation in multiple disciplines such as intelligent education, which represents the multi-label text classiﬁcation problem in essence. However, most existing methods do not take the automatic linking of exercises to knowledge concepts into consideration. In addition, most of the widely used approaches in multi-label text classiﬁcation require large amounts of training data for model optimization, which is usually time-consuming and labour-intensive in real-world scenarios. To address these problems, we propose a prompt tuning method for multi-label text classiﬁcation, which can address the problem of the number of labelled exercises being small due to the lack of specialized expertise. Speciﬁcally, the relevance scores of exercise content and knowledge concepts are learned by a prompt tuning model with a uniﬁed template, and then the multiple associated knowledge concepts are selected with a threshold. An Exercises–Concepts dataset of the Data Structure course is constructed to verify the effectiveness of our proposed method. Extensive experimental results conﬁrm our proposed method outperforms other state-of-the-art baselines by up to 35.53% and 41.78% in Micro and Macro F1, respectively.


Introduction
In recent decades, personalized learning has become a mainstream solution to enhance students' learning interest, and experience in intelligent education systems [1][2][3].One of the fundamental and key tasks in personalized learning is knowledge tracing [4,5], which aims to evaluate the students' learning state of knowledge concepts.
Exercises have played an important role in the knowledge tracing tasks, which is one of the evaluation metrics of whether students have mastered specific knowledge concepts [6,7].Students in intelligent education systems choose the right exercises according to their own needs and acquire specific knowledge concepts during exercise.In turn, we can track changes in students' acquisition of knowledge concepts during their exercising process.From this perspective, knowledge tracing should consist of a students-exercisesknowledge concepts hierarchy [8].However, most existing methods of knowledge tracing approaches [9][10][11] are partially modeled among the hierarchy (i.e., students-exercises or students-concepts).This is because, in some intelligent systems, there is a lack of connection between exercises and knowledge concepts.To this end, we take the automatic linking of exercises to knowledge concepts into consideration for knowledge tracing tasks.
In essence, linking exercises to knowledge concepts is a multi-label text classification (MLTC) problem.As shown in Figure 1, the relationship between exercises and knowledge concepts is one-to-one or one-to-many, which aims to assign one or more concepts to each input exercise in the dataset.Moreover, Figure 1 shows that the semantics between exercises and knowledge concepts are highly correlated.Recently, deep-learning-based methods have achieved fairly good performance in MLTC for the superiority of feature representation learning.For example, Liu et al. [12] utilized the strengths of the existing convolutional neural network and took multi-label co-occurrence patterns into account in the optimization objective to produce good results in MLTC.Pal et al. [13] proposed a graph attention network-based model to capture the attentive dependency structure among the labels.Chang et al. [14] fine-tuned the BERT language model [15] to capture the contextual relations between input text and the induced label clusters.However, these deep-learning-based methods in MLTC tasks require large amounts of training data for model optimization, which is usually time-consuming and labour-intensive in real-world scenarios.Unfortunately, linking exercises to knowledge concepts usually lacks training data because some knowledge concepts corresponding to a few exercises or new courses may contain a paucity of labelled data.
To address these problems, we propose a Prompt Tuning method for Multi-Label Text classification (PTMLTC for short).First, the prompt tuning model with a unified template predicts the relevance scores of exercises and knowledge concepts.Then, the multiple associated knowledge concepts are picked with a threshold.In order to verify the effectiveness of our proposed method, an Exercises-Concepts dataset of the Data Structure course is constructed.Extensive experimental results confirm our method outperforms other state-of-the-art methods by up to 32.53% and 41.78% in Micro and Macro F1, respectively.
The contribution of our paper can be summarized as follows: (1) To the best of our knowledge, this is the first attempt to automatically link exercises to knowledge concepts.We built an Exercises-Concepts dataset of the Data Structure course and reconstructed the few-shot dataset.
(2) We propose a prompt tuning method for multi-label text classification to link exercises to knowledge concepts.Large amounts of labelled or unlabeled training data are not required.
(3) Extensive experimental results confirm that our proposed method outperforms other state-of-the-art deep-learning-based methods.

Related Work
In this section, firstly, we introduce the deep-learning-based multi-label text classification methods.Then, the prompt tuning learning methods used in our models will be presented.

Multi-Label Text Classification
The goal of MLTC is to associate one or more relevant labels for each input text instance.The traditional MLTC methods include one-vs-all methods [16,17], tree-based methods [18,19] and embedding-based methods [20,21].For example, Babbar et al. [16] proposed a distributed learning mechanism for MLTC, which can use doubly parallel training to reduce the expensive computational cost of one-vs-all methods.Prabhu et al. [22] presented a method called FastXML by optimizing an nDCG-based ranking loss function to further reduce expensive computational costs.Tagami [21] proposed a graph embedding method, which learns partition data points by the k-nearest neighbour graph (KNNG) and uses an approximate k-nearest neighbour to predict results by exploring KNNG in the embedding space.
In recent years, due to the powerful ability of feature representations learning [23,24], deep models have gained much attention and achieved superior performances over traditional methods.The focus of existing deep-learning-based methods on MLTC is learningenhanced text representation for improving performance.For example, Liu et al. [12] utilized the strengths of the existing convolutional neural network (CNN) and dynamic pooling to model the text representation for MLTC.Xiao et al. [25] employed an attention mechanism to explore highlight important context representation in MLTC tasks.Ma et al. [26] utilized the bidirectional Gated Recurrent Unit network and hybrid embedding for learning the representation of the text level-by-level.Chang et al. [14] proposed to fine-tune the BERT language model [15] in order to capture the contextual relations between input text for MLTC.
In addition, recently, the dependencies or correlations among labels have demonstrated the ability to improve performance in most MLTC tasks.Along this line, many deeplearning-based methods have been proposed to model label dependencies.For example, Chen et al. [27] explored labels' correlations through Recurrent Neural Networks, which were used to predict labels one-by-one sequentially.Pal et al. [13] proposed a graphattention network-based model to capture the attentive dependency structure among the labels.Yang et al. [28] treated MLTC tasks as a sequence generation problem and proposed a decoder structure to capture the dependencies between labels that selected the most informative words automatically while predicting different labels.Xun et al. [29] learned label correlation by introducing an extra CorNet module that is applied to a deep model at the prediction layer to enhance raw label predictions with correlation knowledge.
However, most existing deep-based MLTC methods require a large amount of labelled or unlabeled training data for model optimization, which is often time-consuming and labour-intensive.Therefore, designing methods that can achieve promising results in the few-shot scenario remain a huge challenge in real-world MLTC tasks.

Prompt Tuning
Prompt-based learning [30][31][32] is regarded as a new paradigm in natural language processing and has drawn great attention from multiple disciplines, which promotes the downstream tasks by using the pre-training knowledge as much as possible.Starting from the GPT-3 [33], Prompt tuning has demonstrated unique strengths in a variety of tasks, which contain text classification [32,34], relation extraction [35], event extraction [36] and so on.Prompt-based learning directly models the probability of text on top of language models.It is different from traditional supervised learning, which trains a model to predict the output y as P(x | y) with the input x.Specifically, in the prediction task, firstly, a template is added to the original input x to form a new textual string prompt x with [MASK].Then, the reconstructed x is learned with the language model to probabilistically fill the unfilled information.For example, Cui et al. [37] employed closed prompts filled by a candidate named entity span as the target sequence in named entity recognition tasks.Li et al. [38] proposed Prefix-tuning that uses continuous templates to improve performance than discrete prompts.There has already been some recent effort in devoting external knowledge to prompt design.For example, Hu et al. [34] proposed a knowledgeable prompt-tuning by expanding the label word space of the verbalizer with external knowledge bases.Chen et al. [35] proposed a knowledge-aware prompt-tuning approach, which introduced relation labels knowledge into prompt construction.In addition, many works [34,39] have demonstrated that prompt-based learning greatly improves model performance in few-shot scenarios.Hambardzumyan et al. [40] proposed an automatic prompt generation method to transfer knowledge from large Pre-trained Language Models, which achieved excellent performance in a few-shot setting.Gu et al. [41] proposed to add soft prompts into the pre-training stage and pre-train soft prompts in the form of unified classification tasks, which can reach or even outperform in few-shot settings.However, in the knowledge tracing tasks, we are not aware of existing prompt-learning-based approaches that automatically link exercises to knowledge concepts.To this end, we propose a prompt tuning method for multi-label text classification to link exercises to knowledge concepts.

Prompt Tuning Method for Multi-Label Text Classification
In this section, the details of our proposed PTMLTC are given, and the general framework is shown in Figure 2. The general framework of our PTMLTC.Exercise Text sequence is connected with united template as the input of prefix Language Model.It will then predict the probability of filling the token [MASK] with each word of knowledge concepts.Sigmoid() function is used to obtain the probability of exercise texts linking to knowledge concept labels.Finally, a threshold mechanism is adopted to predict all the possible knowledge concept labels.

Problem Formalization
In this paper, we aim to use few exercises with labeled concepts to predict one or more related concepts for each input exercise text.
, where E denotes the exercise-instance space, S usually contains K exercise-instances (K-shot) of N concept-labels (N-way), N S is the size of the support set.
For each learning instance (E i , C i ), E i ⊆ E is l-dimensional input and C i ⊆ C is the related concepts set.For an unseen instance e in the query set, the classifier predicts a set of concepts P = h(e) ⊆ C.

Prompt Tuning Method for Multi-Label Text Classification
As is shown in Figure 2, our methods adopt a threshold-based strategy [42,43] to achieve multi-label text classification.Firstly, the relevance scores of exercise content and knowledge concepts are transformed into a masked language model by prompt tuning methods.Specifically, a prompt template is defined as V prompt = "It belongs to [MASK]".and combine the exercise text x = {x 0 , x 1 , x 2 , • • • , x n } to form the final input for prompt tuning input e prompt , which can be shown as Equation ( 1): Suppose that M is a large corpus of Pre-trained Language Models (PLMs in short), the probability of filling the token [MASK] for each word of concept c in the knowledge concepts set C can be denoted as P M ([MASK] = c | e prompt ).Here, we need a map function Sigmoid() to predict the probability of each concept independently.The relevance scores can be represented as (2): Finally, we add an additional threshold mechanism to determine knowledge concepts corresponding to exercises, which can be formulated as (3): where t is the threshold.
To better introduce our method, we take an example shown in Figure 3.The exercise text "The stack is characterized by first in, last out, and the queue is characterized by first in, first out.(right)" is wrapped with template as the input.PLM is adopted to predict the predict the probability of filling the token [MASK] with knowledge concepts word set array, stack, queue, linked list.Then, Sigmoid() function is used to obtain the probability of exercise text linking to labels {array, stack, queue, linkedlist}.Due to the probability of exercise text linking to stack, queue greater than threshold, exercise text is regarded as linking to {stack, queue}.It has been proven that binary cross-entropy loss (BCE) over sigmoid activation is more suited for multi-label problems and outperforms cross-entropy loss [12].Therefore, in our paper, the BCE loss function is chosen to learn parameters in the tasks, which can be formulated as (4): where pij represents the predicted value of exercise i belongs to concept j. y ij represents the value of exercise i belongs to concept j, and σ is the sigmoid function σ(x) = 1 1+e −x .

Experiment
In this section, we conduct extensive experiments on the constructed Exercises-Concepts dataset of the Data Structure course to verify the effectiveness of our proposed method for linking exercises to knowledge concepts.In the following, firstly, the Exercises-Concepts dataset of the Data Structure course and the few-shot dataset construction are introduced in detail.Then the compared methods and evaluation metrics of our experiments are shown.Finally, we analyze the experimental results and the influence of the main parameters.

Datasets
Exercises-Concepts dataset of Data Structure course: To study the problem of linking exercises to knowledge concepts, we construct the Exercises-Concepts dataset of the Data Structure course.Refer to MOOCCube_DS [44] data repository and Several national planning textbooks, we extract 65 classic knowledge concepts.Subsequently, 2027 exercises used in these textbooks are marked with the corresponding knowledge concepts.Details are shown in Table 1.
Few-shot dataset construction: To simulate the few-shot situation, we reconstruct the dataset in to the form of few-shot learning, where each example is the combination of a query instance (e q , c q ) and the corresponding K − shot support set S. Unlike the singlelabel classification problem, instances of multi-label classification may be associated with multiple labels.Therefore, there is no guarantee that each label appears exactly K times during sampling.To address the problem, we approximately construct K − shot support set S with the Minimum-including Algorithm [43].It constructs a support set generally complying with the following two conditions: (1) All labels in the original dataset appear at least K times in support set S. (2) At least one label will appear less than K times in S if any (e q , c q )pair is removed from it.For the original dataset, we sampled N S different support sets.For each support set, we take the remaining data as the query set.Each support-query-set pair constitutes one few-shot episode.
On the test stage, we constructed 10 different few-shot episodes for each selected K-shot.Among them, support set is used to fine tuning model, and query set is used to test the effectiveness of methods.[12], MAGNET [13], require massive amounts of training data for model optimization, which inevitably leads to performance degradation in the few-shot scenario.However, the PLMs tuning multi-label text classification methods can provide a certain advantage in the few-shot problem.Therefore, three PLMs tuning methods are conducted as compared methods, the details are described as follows: TextCNN [45]: The method uses a simple CNN with one layer of convolution on top of word vectors for Sentence Classification.In our experiments, PLMs are used to learn the representation of words, in addition, a multi-label classification layer is added to predict labels.Notably, the method is fine-tuned on the support set to select the optimal model and validated on the query set.
TagBert [46]: This is a model based on a large pre-trained model and a multi-label classification layer.Following the parameter setting of a threshold-based multi-label method, a fixed threshold tuned on the support set is used in the experiments.
The experimental setup of all the above methods is the same as that in TextCNN.

Evaluation
In our paper, the MacroF1 and MicroF1 are introduced to evaluate the effectiveness of our proposed method.MacroF1 calculates the average of the F1 scores obtained for each category, which can be formulated as (5): where P t represents the precision of each category, R t represents the recall of each category.TP t , FP t and FN t are the true-positive, false-positive and false-negative example of the t-th label in the label set C, respectively.MicroF1 calculates the overall of the F1 scores, which can be formulated as (6): where P represents the overall precision, R represents the overall recall.

Experiment Settings
We evaluate the performance of our proposed method on the few-shot Exercises-Concepts dataset.Because some concepts in the dataset have only 5 exercises, we select the value K in K-shot as 1 and 5, respectively.There are some hyper-parameters that need to be initialized in the above methods.Firstly, we introduce uniform settings in all methods.The maximum length sequence is set as 512.These models are optimized by Adam with batch size 4 and learning rate 1 × 10 −5 .Then, the size of thresholds has an impact on final performance.The thresholds are set as 0.10, 0.65, 0.82, 0.24 on 1-shot setting in TextCNN, TagBert, BertFGM and PTMLTC, respectively.On 5-shot setting, the thresholds are 0.08, 0.70, 0.85 and 0.20.The reported results are the mean and variance of the experimental results on 10 randomly generated few-shot datasets.

Performance Comparison
Results of 1-shot setting:The results of the 1-shot exercise linking to knowledge concepts are shown in Table 2. From the experimental results, we can have the following observations.Firstly, we can observe that the results of MicroF1 and MacroF1 in the PTMLTC method are 54.74% and 46.11%, respectively, which are far better than the other three baselines.In the case of much training data, the performance of BertFGM is better than the TagBert.However, added adversarial training in the few-shot problem obtains interference information, which makes the classifier more indistinguishable.BertFGM achieves worse results than the TagBert.Results of 5-shot setting: The results of the 5-shot exercise linking to knowledge concepts are shown in Table 3.The results are basically consistent with the trend of the 1-shot setting.Compared with the 1-shot setting, the results of all methods have been improved in the 5-shot setting.These results demonstrated that the increasing of training data improves classification performance.In addition, PTMLTC has a smaller margin of advantage in 5-shot setting compared with 1-shot setting.It is proved that the fewer the data, the more obvious the advantages of PTMLTC.
The success of prompt tuning mainly owes to the template design and label words.Different templates are designed in our method to discuss their effect.The details are shown Table 5.The template was selected as "It belongs to [MASK]", which obtains the better result.Regarding our proposed method, in this section we have studied the influence of the parameter, which is the threshold t in Equation ( 3).The experimental mode of control variables is adopted, when one variable is changed, the other variables remain unchanged.We randomly selected a dataset from the 1-shot and 5-shot few-shot datasets for verification.After some preliminary tests, we found that the value of t will have a relatively large impact on the effect, it can be ensured that the effect will not excessively fluctuate within a certain range.The value set of t is [0.14, 0.16, 0.18, 0.22, 0.24, 0.26].It can be observed from Figure 4 that t = 0.24 on the 1-shot setting and t = 0.2 on the 5-shot setting lead to the best results.

Conclusions and Future Work
In this paper, a prompt tuning multi-label text classification method is proposed to realize the link between exercises and knowledge concepts.The main idea is that the relevance scores of exercise content and knowledge concepts are learned by a prompt tuning model with a unified template, and then the multiple associated knowledge concepts are selected with a threshold.On the constructed dataset, we compare the proposed method with other baseline methods.The results show that PTMLTC achieves better performance than other state-of-the-art methods in the evaluation metrics, and with fewer training data, the advantage is more conspicuous.The knowledge concepts in the course bear a natural graph relationship, and our work ignores the relationship between them.Future work will try to introduce the structural relationship between knowledge concepts into the model for achieving better results.

Figure 1 .
Figure 1.Examples of exercises linking to knowledge concepts from dataset.

Figure 2 .
Figure2.The general framework of our PTMLTC.Exercise Text sequence is connected with united template as the input of prefix Language Model.It will then predict the probability of filling the token [MASK] with each word of knowledge concepts.Sigmoid() function is used to obtain the probability of exercise texts linking to knowledge concept labels.Finally, a threshold mechanism is adopted to predict all the possible knowledge concept labels.

Figure 3 .
Figure 3.An example of our proposed method.

Figure 4 .
Figure 4. Effects of threshold t on two datasets.

Table 1 .
The number of each exercise linking to knowledge concepts.C represents the label of the knowledge concept, N denotes the number of exercise.

Table 2 .
Results of 1-shot on our dataset.Metrics marked in bold contain the highest metrics for the dataset.

Table 3 .
Results of 5-shot on our dataset.Metrics marked in bold contain the highest metrics for the dataset.

Table 4 .
Results of different PLMs on our dataset.Metrics marked in bold contain the highest metrics for the dataset.

Table 5 .
Results of the different design of the templates.Metrics marked in bold contain the highest metrics for the dataset.