P-Distill: Efficient and Effective Prompt Tuning Using Knowledge Distillation

Won, Hyun-Sik; Choi, Joon-Young; Zaman, Namrah; Aliyeva, Dinara; Kim, Kang-Min

doi:10.3390/app15052420

Open AccessArticle

P-Distill: Efficient and Effective Prompt Tuning Using Knowledge Distillation

by

Hyun-Sik Won

¹

,

Joon-Young Choi

²

,

Namrah Zaman

¹

,

Dinara Aliyeva

³

and

Kang-Min Kim

^1,4,*

¹

Department of Artificial Intelligence, The Catholic University of Korea, Bucheon-si 14662, Republic of Korea

²

Danggeun Market Inc., Seoul 06611, Republic of Korea

³

Department of Computer Science, College of Arts & Sciences, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

⁴

Department of Data Science, The Catholic University of Korea, Bucheon-si 14662, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2420; https://doi.org/10.3390/app15052420

Submission received: 20 January 2025 / Revised: 14 February 2025 / Accepted: 20 February 2025 / Published: 24 February 2025

(This article belongs to the Special Issue Machine Learning Approaches in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

In the field of natural language processing (NLP), prompt-based learning is widely used for efficient parameter learning. However, this method has the drawback of shortening the input length by the extent of the attached prompt, leading to the inefficient utilization of the input space. In this study, we propose P-Distill, a novel prompt compression method that mitigates the aforementioned limitation of prompt-based learning while maintaining performance via knowledge distillation. The knowledge distillation process of P-Distill consists of two methods, namely prompt initialization and prompt distillation. Experiments on various NLP tasks demonstrated that P-Distill exhibited comparable or superior performance compared to other state-of-the-art prompt-based learning methods, even with significantly shorter prompts. Specifically, we achieved a peak improvement of 1.90% even with the prompt lengths compressed to one-eighth. An additional study further provides insights into the distinct impact of each method on the overall performance of P-Distill. Our code will be released upon acceptance.

Keywords:

knowledge distillation; natural language processing; natural language understanding; pre-trained language models; prompt compression; prompt engineering; prompt tuning; P-tuning v2

1. Introduction

Pre-trained language models (PLMs) have been effective in improving the performances of various natural language processing (NLP) tasks [1,2,3]. These models are fine-tuned by optimizing all parameters to enhance the performances of specific downstream tasks; however, fine-tuning requires significant computational resources while training. The need for significant computational resources for storage and training becomes a challenge, especially when fine-tuning large language models such as Llama2 [3], which may not be readily available to most users.

To reduce computational costs, researchers have explored various methods for efficiently fine-tuning parameters [4,5,6]. In contrast to the traditional model of fine-tuning that updates all parameters for a downstream task, P-tuning v2 [6] fixes the pre-trained parameters and only trains the continuous prompts, which are trainable embeddings attached at the beginning or throughout each layer of the model. While P-tuning v2 is computationally efficient, especially for PLMs with a large number of parameters, it does not effectively address the inefficiency of utilizing the input space due to the use of continuous prompts [5]. This attachment increases the attention computation, thereby requiring the truncation of the positional encodings for the attached prompts, which reduces the available input token sequence length. Extending the input token sequence by forcibly modifying the code can lead to issues with attention calculation beyond the model’s training [7]. Such modifications often result in performance degradation, as the attention mechanism might struggle with sequences extending beyond its initially intended scope. Similar to the findings in the work [6], more challenging tasks require longer prompt lengths to achieve better performance.

In this paper, we propose P-Distill, which is a novel prompt compression method to mitigate the limitations of long prompts. Our method involves a two-step process where we first train a teacher model using P-tuning v2 to achieve superior performance with long prompts. We then transfer this knowledge to a student model with significantly shorter prompts through a distillation process. To ensure stability in training, we first perform prompt initialization based on the teacher model prompts. Then, we focus on distilling knowledge between the teacher and student models, specifically targeting the outputs of their intermediate and prediction layers. This is due to the impact of continuous prompts on the hidden states within these layers, which subsequently influences the model’s predictions. This method enables the compression of prompts to shorter lengths without a significant degradation in performance, thereby addressing the inefficiencies inherent in longer prompts.

To validate its effectiveness and efficiency, we evaluate P-Distill using various NLP benchmarks. Our results demonstrate that P-Distill exhibits comparable or superior performance to those of the existing state-of-the-art prompt-based models. To the best of our knowledge, this study is the first to train teacher prompts and transfer their knowledge to student prompts for compressing prompts. The main contributions of this study are summarized as follows:

We propose a method called P-Distill to compress the continuous prompts, effectively mitigating the limitation of reducing the model’s usable sequence length in prompt-based learning.
We introduce a prompt distillation method utilizing the teacher model’s hidden-state and prediction outputs, influenced by continuous prompts, and propose a prompt initialization for stable prompt distillation.
We validate P-Distill across multiple NLP benchmarks, demonstrating its ability to maintain or enhance accuracy while reducing prompt lengths by up to eight times.

This paper is structured as follows: Section 2 outlines the preliminaries; Section 3 delves into the proposed methodology in detail; Section 4 discusses the experimental results and provides an in-depth analysis; and Section 5 concludes the study with key findings and insights.

2. Preliminaries

2.1. Pre-Trained Language Models Based on the Transformer

The transformer model [8], comprising an encoder and decoder, is the fundamental architecture of the majority of recent PLMs, including BERT [1], RoBERTa [9], and GPT-3 [2]. Each encoder and decoder consists of multiple transformer layers and incorporates key components, such as multi-head attention modules (MHA), feed-forward networks, layer normalization, and residual connections. A key component of this architecture is the multi-head attention mechanism, which computes attention weights using query (Q), key (K), and value (V) matrices. Mathematically, the attention function in multi-head attention can be represented as follows:

A t t (x) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(1)

where

\sqrt{d_{k}}

is the scaling factor for gradient stabilization during training. This attention mechanism is crucial in understanding language and generating tasks by modulating the focus of the model on different parts of the input data.

2.2. Prompt-Based Learning Methods

Prompt-based learning methods have emerged as an efficient alternative to full-model fine-tuning, especially for PLMs [6,10]. These methods use prompts to guide the model predictions for specific tasks. Several approaches [11,12] employ discrete prompts, which are fixed templates added to the input. For example, in sentiment analysis, a template might be “This text [Input Text] expresses a [MASK] sentiment”. However, discrete prompts are limited in that their performances significantly depend on template selection. Advanced approaches, such as prefix-tuning [13] and P-tuning [10], use continuous prompts that are trainable embeddings independent of the model vocabulary. Particularly, P-tuning v2 [6] attaches continuous prompts to each layer of the model, thereby influencing its behavior and enhancing its performance in downstream tasks. These continuous prompts are integrated into the attention mechanism of the transformer model as follows:

A t t (x) = s o f t m a x (\frac{Q {(P_{k} : K)}^{T}}{\sqrt{d_{k}}}) (P_{v} : V),

(2)

where

P_{k} \in R^{n_{p} \times d}

and

P_{v} \in R^{n_{p} \times d}

are the continuous prompts added to the key and value vectors, respectively, the colon denotes the concatenation of these prompts with the key and value matrices. The dimension

n_{p}

indicates the lengths of the prompts, and d represents the dimensions of the key and value vectors. This integration enables the model to influence layers closer to the output, significantly affecting the final predictions.

A significant disadvantage of P-tuning v2 is the necessity for long continuous prompts to achieve maximum performance, resulting in a reduced input token space. Figure 1 demonstrates the influence of continuous prompt length on model performance across diverse NLP tasks, indicating that more complicated tasks often need longer prompts.

This discovery inspired our investigation into prompt compression, in which we wanted to maintain model efficacy while minimizing prompt length. In Section 3, we present P-Distill, an innovative method that use knowledge distillation to address the inefficiencies linked to long prompts.

2.3. Knowledge Distillation

In artificial intelligence, knowledge distillation is a technique for reducing the size of large models while preserving their performances [14,15,16,17]. Recent improvements in knowledge distillation have presented parameter-efficient online distillation [18], prompt transfer-based knowledge distillation for efficient model adaptation [19], and retrieval-augmented knowledge distillation for enhanced generalization [20]. Furthermore, pseudo-target training has been investigated for enhancing knowledge distillation in natural language generation tasks [21]. Beyond NLP, knowledge distillation has been effectively applied in other domains such as visible–infrared transmission line detection using contrastive learning [22] and object detection through cross-head distillation [23], gradient-guided knowledge distillation for object detectors [24], and instance, scale, and teacher adaptive knowledge distillation for visual detection in autonomous driving [25]. These advancements demonstrate the versatility of knowledge distillation across diverse domains. During knowledge distillation, a smaller student model is trained to internalize and emulate the complex decision-making patterns and behaviors of a larger teacher model. This process involves the behavior functions of the models,

f^{T}

and

f^{S}

, transforming inputs into informative representations, typically defined as the output of any layer within the model. These representations contain abundant information for model predictions. Knowledge distillation is quantified using loss functions, such as the Kullback–Leibler divergence [26] or Mean Squared Error (MSE) [17], as follows:

L_{K D} = \sum_{x \in X} L (f^{S} (x), f^{T} (x)),

(3)

where x is the input and X and L denote the dataset and the loss function, respectively. This approach enables the student model to gain a comprehensive understanding of various classes, enhancing its application in fields such as NLP.

Table 1 summarizes the main notations applied throughout this paper to help clarity and the simplicity of referencing.

3. Methodology

Many existing prompt tuning methods, including P-tuning v2, have the drawback of occupying an unnecessarily large portion of the input token space, owing to their long prompts. Inspired by knowledge distillation methods, we propose a novel prompt compression methodology called P-Distill. This approach aims to compress the prompts while maintaining the performance, thereby increasing the available space for input tokens and enhancing the overall model efficiency. To this end, the proposed P-Distill comprises the following two methods: prompt initialization and prompt distillation. Figure 2 shows the learning and compression processes of P-Distill. Our approach involved two main steps where the first step was training a teacher model using P-tuning v2 and the second step focused on distilling knowledge to a student model with shorter prompts, effectively reducing the length of the prompts.

3.1. Prompt-Based Teacher Learning

When solving downstream tasks using P-tuning v2, we froze the pre-trained weights of the language model and only trained the continuous prompts. Prompt lengths that yield good performance vary according to task complexity. In general, simple classification tasks tend to use shorter prompts, with lengths of around 20, while more complex sequence labeling tasks often require longer prompts, with lengths of around 100 [6]. Understanding the variation in prompt length is crucial, particularly during the inference stage, as longer prompts inherently limit the maximum sequence length that the model can handle.

We trained a teacher model on various tasks based on the P-tuning v2 methodology. This model tokenized input data, x, and embedded them into text embeddings,

\bar{x}

. Subsequently, the continuous prompts

P_{k}^{T}

,

P_{v}^{T} \in R^{n_{t} \times d}

of the teacher model were randomly initialized and concatenated with the key vectors

K \in R^{n_{x} \times d}

and value vectors

V \in R^{n_{x} \times d}

of each layer. Here, d is the dimensionality of the hidden representations,

n_{t}

is the prompt length of the teacher model, and

n_{x}

is the length of token embeddings. The teacher model, which utilizes attention heads incorporating continuous prompts, was trained to take the text embedding

\bar{x}

as an input and generate the final logits,

y^{T}

. The parameter optimization of the teacher model was guided by the cross-entropy loss, which is formalized as follows:

L_{C E}^{T} = - \frac{1}{| B |} \sum_{i = 1}^{| B |} log (s o f t m a x (y_{i}^{T}) [c_{i}]),

(4)

where

| B |

is the number of data points in the current batch,

y_{i}^{T}

is the logits output by the teacher model for the i-th data point in the batch,

s o f t m a x (y_{i}^{T})

is the softmax-transformed probability distribution over the classes, and

c_{i}

is the true class index for the i-th data point.

3.2. Prompt-Enhanced Distillation (P-Distill)

We initiated the training of a student model which employs shorter continuous prompts, rather than the teacher model, using the same prompt attachment methodology. During the initial training phase, we initialized the continuous prompts of the student model,

P_{k}^{S}

and

P_{v}^{S} \in R^{n_{s} \times d}

, based on the teacher model’s prompts,

P_{k}^{T}

and

P_{v}^{T}

. Subsequently, student prompts were also attached to the key and value vectors across all layers to compute the attention heads. The length of the student model prompts, represented by

n_{s}

, is shorter than that of the teacher model prompts

n_{t}

. The student model, denoted by

f^{S}

, takes the text embedding

\bar{x}

as input and generates the output logits

y^{S}

. The teacher and student models share the same underlying language model architecture, differing only in the length and content of their respective prompts. In this context, we focused on distilling the knowledge from the more extensive teacher model prompts into the shorter student model prompts. To enhance the effectiveness of knowledge transfer, we propose two novel methods for knowledge distillation.

3.2.1. Prompt Initialization

For solving downstream tasks, the model utilizes the attached prompts to generate answers. Starting with the randomly initialized prompts for the model can result in an unstable training process [27]. To mitigate this challenge, the study [28] employed a method for transferring the prompts learned in one task to another task. We aimed to stabilize the training by initializing the student model prompts

P_{k}^{S}

and

P_{v}^{S}

based on the teacher model prompts

P_{k}^{T}

,

P_{v}^{T}

. We experimented with various prompt initialization methods, including reparameterization, average pooling, and max. pooling, as illustrated in Figure 3. In reparameterization, we employed a reparameterization encoder to adjust the length of the teacher model prompts to that of the student model prompts. For average pooling, we divided the teacher model’s prompts into smaller segments and computed their averages to initialize the student prompts. In max. pooling, we focused on the most prominent features by obtaining the maximum value from each segment of the teacher model’s prompts. Based on the experimental results, we applied the reparameterization encoder to the teacher model’s prompts to construct the student model’s prompts as follows:

P_{k}^{S} = (P_{k}^{T} \cdot W_{k}^{T}) + b_{k}^{T},

(5)

P_{v}^{S} = (P_{v}^{T} \cdot W_{v}^{T}) + b_{v}^{T},

(6)

where

W_{k}^{T}

and

W_{v}^{T}

are the learnable weight matrices used to construct the student’s prompts, and

b_{k}^{T}

and

b_{v}^{T}

are the corresponding bias. The results of various prompt initialization experiments are shown in Section 4.5.

3.2.2. Prompt Distillation

In this section, we focus on prompt distillation, a key aspect of the proposed approach. Recognizing the influence of continuous prompts on both the hidden-state and the prediction-layer outputs within the model, we employed the following two distillation techniques: prediction-layer and hidden-state distillations. These techniques focused on different aspects of the teacher model’s output to ensure comprehensive knowledge transfer.

Prediction-layer distillation In this method, a student model learns to emulate the predictions of a teacher model. This process involves the student model utilizing soft labels from the teacher model’s output, which encapsulate the teacher model’s understanding of the data. Particularly, a loss function was used to minimize the difference between the logits

y^{S}

and

y^{T}

produced by the student and teacher models, respectively. The distillation loss

L_{p r e d}

is formulated as follows:

L_{p r e d} = K L (s o f t m a x (y_{i}^{S} / θ), s o f t m a x (y_{i}^{T} / θ)),

(7)

where

y_{i}^{S}

and

y_{i}^{T}

are the logits vectors predicted by the student and teacher, respectively, and

K L

denotes the Kullback–Leibler divergence, which measures the difference between the probability distributions of the two models.

θ

is a temperature hyperparameter that adjusts the smoothness of these distributions, enabling a more nuanced transfer of knowledge from the teacher to student model. The distillation loss

L_{p r e d}

was then used in the optimization process to update the parameters of the student model, thereby aligning its predictive behavior more closely with that of the teacher model.

Hidden-state distillation Additionally, we also distilled knowledge from the intermediate representations of the teacher model. The concept of distilling knowledge through intermediate representations was initially introduced by Fitnets [29], with the aim of enhancing the training process of the student model. Based on the provided prompts and inputs, we extracted knowledge from the transformer layers of the teacher model and distilled it into the student model. This process was formalized using the loss function

L_{h i d d e n}

, which is calculated as the MSE between the hidden states

H_{S}

and

H_{T}

of the student and teacher models, respectively, as follows:

L_{h i d d e n} = M S E (H_{S}, H_{T}),

(8)

where the matrices

H_{S}, H_{T} \in R^{n \times d}

represent the hidden states, n is the input sequence length, and d is the hidden-state dimensionality of the two models.

3.3. Distillation-Based Student Learning

While training the student model, the cross-entropy loss was computed similarly to that of the teacher model. This loss serves as a measure of the student model’s accuracy in predicting the true class labels as follows:

L_{C E}^{S} = - \frac{1}{| B |} \sum_{i = 1}^{| B |} log (s o f t m a x (y_{i}^{S}) [c_{i}]) .

(9)

Subsequently, the overall loss function

L_{t o t a l}

for the student model is then a weighted combination of the cross-entropy loss and the distillation losses as follows:

L_{t o t a l} = λ_{1} \cdot L_{C E}^{S} + λ_{2} \cdot L_{p r e d} + λ_{3} \cdot L_{h i d d e n},

(10)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are the learnable weighted coefficients with the constraint that their combined sum equals 1. During the training, the teacher model parameters were fixed to serve as the sources of prior knowledge.

4. Experiments

This section presents the datasets employed in our experiments, baseline models for comparison, results of these datasets, and analyses from our additional studies.

4.1. Datasets

Our evaluation of the proposed P-Distill method included a comprehensive range of natural language understanding tasks, utilizing datasets that are well-established benchmarks in the field.

We included various tasks from the SuperGLUE benchmark [30], which assesses a model’s understanding and reasoning abilities across different contexts, including BoolQ [31], CB [32], COPA [33], MultiRC [34], ReCoRD [35], RTE [36,37], WiC [38] and WSC [39]. We also utilized the CoNLL-2003 [40], CoNLL-2004 [41], CoNLL-2005 [42], CoNLL-2012 [43], and OntoNotes 5.0 datasets [44], each providing richly annotated text for entity classification. The SQuAD dataset, in its versions 1.1 [45] and 2.0 [46], facilitated testing reading comprehension, requiring the model to parse passages and answer questions with a high degree of understanding. In addition, we extended our experiments to the Allsides dataset [47], which consists of news articles with an average token length of over 1000. This dataset allowed us to examine the effects of P-Distill in scenarios where input tokens are frequently truncated. All these datasets are English, open-source, and used for academic research purposes only. For accurate comparisons, we followed the train, validation, and test set splits as specified in the referenced work [6].

4.2. Baselines

We compared P-Distill against the following methods to validate its competitive performance, with all methods utilizing BERT_large with 335 M parameters as the backbone architecture.

Fine-tuning: all parameters of a PLM are updated and adapted to the given downstream task in a task-specific manner.

P-tuning v2 [6]: this appends trainable, continuous prompts to the key and value matrices of a model, enabling task-specific learning while keeping the model’s pre-trained weights fixed.

4.3. Experimental Details

In our training process, we exclusively focused on continuous prompts while keeping the backbone parameters of the model fixed. For full fine-tuning, we used a batch size of 8 with a learning rate of

2 \times 10^{- 5}

. For P-tuning V2, we also used a batch size of 8 but used a higher learning rate of

5 \times 10^{- 3}

and a dropout rate of 0.2. For P-Distill, the model was trained with a batch size of 16, and the learning rate was individually optimized for each task. Furthermore, we employed the AdamW optimizer for training. For the temperature hyperparameter

θ

used in the distillation process, we experimentally determined the optimal setting by sweeping across {1, 5, 10}. For the learnable parameter

λ_{2}

, we explored the initial values of {0.1, 0.5, 0.9}. Considering the significant impact of the hidden-state loss, we experimented with the initial values of {

1 \times 10^{- 3}

,

1 \times 10^{- 4}

,

1 \times 10^{- 5}

} for

λ_{3}

. All experiments were performed using PyTorch 2.0 (https://pytorch.org/ (21 February 2025)) and HuggingFace Transformers [48] on three NVIDIA A100 GPUs, and to ensure consistency in our results, each task was conducted using a fixed random seed.

4.4. Results

Table 2 and Table 3 present the experimental results of fine-tuning, P-tuning v2, and P-Distill. In P-Distill, the prompt length was compressed to one-eighth of that of the teacher model prompts. For fewer than eight teacher model prompts, the length was compressed to 1. For a detailed analysis of prompt compression ratios, refer to Section 4.4.3. In general, the proposed P-Distill method exhibited a comparable or superior performance to those of the other methods while using shorter prompts.

4.4.1. Results on SuperGLUE

Table 2 shows the performance of each approach on the SuperGLUE benchmark. The experimental results show that despite using shorter prompts, P-Distill matched or exceeded the performance of P-tuning v2, which utilized long prompts. Specifically, P-Distill exhibited a 0.51% performance improvement on average on SuperGLUE tasks. The improvement increased to 2.75% when compared to P-tuning v2 with the same prompt length. These results reveal that P-Distill effectively compressed the prompt length while maintaining or even improving performance.

Additionally, we compared P-Distill with P-tuning v2 across varying prompt sizes, ensuring a fair assessment of initiating with a short prompt in the first place against distillation-based compression. The results confirm that simply using a shorter prompt did not yield the same benefits as P-Distill, which applied structured knowledge transfer from a teacher model to a student model. P-Distill not only achieved competitive or superior performance but performed so with significantly compressed prompt lengths, highlighting the effectiveness of the proposed distillation strategy in preserving and even enhancing task-specific knowledge while optimizing efficiency.

4.4.2. Results on Other Tasks

Table 3 presents the experimental results for diverse tasks, including named entity recognition, question answering, semantic role labeling, and long-sequence classification. We first observe that achieving superior performance via P-tuning v2 on these tasks required training with longer prompts (up to 128 tokens), which aligns with the phenomenon reported in previous work [6] where complex tasks tend to require long prompts.

In comparison to the baseline methods, P-Distill achieved comparable performance using significantly shorter prompts even for tasks where P-tuning v2 employed long prompts. In particular, P-Distill outperformed P-tuning v2 with 128 prompt tokens on CoNLL04 while utilizing a prompt of length 16. Furthermore, P-Distill achieved a performance improvement of 2.54% over the prompt of the same length trained using P-tuning v2 and a 0.90% improvement over the teacher model.

Moreover, P-Distill outperformed the baseline methods on the Allsides dataset, which consisted of long input instances where the average token length exceeded 1000. Specifically, with prompts significantly compressed to four, P-Distill not only mitigated token truncation issues but also showed a 1.03% performance improvement over the best P-tuning v2 configuration. This highlights the advantage of P-Distill in handling long-sequence tasks by preserving input space with shorter prompts. Additionally, P-Distill exhibited 1.42% performance improvement when compared to P-tuning v2 with a prompt length of four, showing that P-Distill is an effective method for compressing prompt length. A qualitative evaluation of the Allsides dataset is provided in Section 4.9, offering further insights on long-sequence tasks.

4.4.3. Impact of Compression Ratio

To investigate the impact of different compression ratios on the performance of P-Distill, we conducted additional experiments with 2× and 4× compression ratios and compared the results. We used the CoNLL04 dataset in these experiments, as it is known to require longer prompts for sufficient training.

Table 4 presents the performance results for these different compression ratios within CoNLL04. We observed a trade-off between the compression ratio and model performance, as a higher compression ratio led to relatively lower performance. Yet, it is worth noting that a 2× compression with P-Distill did preserve the performance of the original long prompt and that P-Distill still matched the performance of fine-tuning at the compression ratio of 8×.

4.5. Ablation Study

To further verify the effectiveness of the proposed method, we conducted ablation studies using the following variants of P-Distill.

P-Distill_−init: instead of training with prompt initialization, it focuses exclusively on leveraging the two types of distillation losses designed to transfer the knowledge from the teacher to student model in different ways.

P-Distill_−pred: This approach does not implement prediction-layer distillation loss. Following the application of the prompt initialization method, it trains the student model based on hidden-state distillation loss. This method aligns the internal representations of the student model with those of the teacher model without focusing on the final output predictions.

P-Distill_−hidden: This variant does not consider the differences between the hidden-state outputs of the teacher and student models. Instead, it focuses on training based on the differences in the prediction-layer outputs. This approach aligns the final predictions of the student model closely with those of the teacher model without directly focusing on their internal representations.

Results Figure 4 shows that all three variants of P-Distill underperformed compared to the original P-Distill among various tasks. This shows that each component of P-Distill contributed to the performance achieved on downstream tasks.

Given these results, we further observe that the extent of degradation varied among different variants. First, P-Distill_−init exhibited the most significant performance degradation across various tasks. While not utilizing prompt initialization and only conducting prediction-layer distillation and hidden-state distillation still led to performance improvements over P-tuning v2, the degradation in performance was clear compared to other variants and the original P-Distill. This indicates that prompt initialization, based on the teacher model’s prompt, is crucial in prompt-based knowledge distillation. Second, P-Distill_−hidden and P-Distill_−pred exhibited decreased prediction performance. This demonstrates that integrating prompt initialization with the hidden-state or prediction-layer distillation techniques enhances the stability and effectiveness of knowledge distillation.

4.6. Impact of Prompt Initialization

To inspect the impact of different prompt initialization methods within P-Distill, we conducted experiments to compare the performance of P-Distill with two variants: P-Distill_mean, which initializes the student model prompts using an average-pooling layer over the teacher model prompts, and P-Distill_max, which uses a max.-pooling layer for the same purpose.

The results, as detailed in Table 5, demonstrate that both P-Distill_mean and P-Distill_max underperformed compared to P-Distill, which utilizes a reparameterization encoder for prompt initialization. We conjecture that the use of average pooling and max. pooling led to an excessive simplification of the teacher model’s prompts, resulting in the loss of crucial nuances and complexities. Conversely, the reparameterization encoder for prompt initialization effectively captured and transferred the complex knowledge of the teacher model prompts without the loss of crucial task information. This suggests that the reparameterization encoder is a more suitable method for prompt initialization in P-Distill, contributing significantly to the overall effectiveness of the knowledge distillation process.

4.7. Experimental Results of Applying P-Distill to P-Tuning Methodology

To evaluate the effectiveness of P-Distill when the prompts directly occupied the input sequence space, we applied P-Distill to the P-tuning methodology that attached prompts to the input embeddings. The experimental results are shown in Table 6.

These results demonstrate the effectiveness of applying P-Distill to the P-tuning methodology. When P-Distill was used, it showed higher performance across all datasets compared to using prompts of the same length without P-Distill. Particularly, on the CB and COPA datasets, P-Distill achieved the same performance as the teacher prompts despite compressing the prompts to one-eighth. These results indicate that P-Distill effectively compressed the prompt length while maintaining performance, even when the prompts were attached to the input embeddings.

4.8. Inference Costs

To examine the benefits of P-Distill in the inference stage, we compared the inference costs across P-Distill, fine-tuning, and P-tuning v2. This reduction in computational requirements is quantified in Table 7, which presents the GFLOPs required during the inference stage with average length samples from each task in the SuperGLUE benchmark.

While both P-tuning v2 and P-Distill demonstrated increased inference GFLOPs compared to fine-tuning due to the inclusion of prompts, we observed that P-Distill added fewer computations compared to P-tuning v2. The difference was more clear in cases where long prompts were utilized in P-tuning v2. This demonstrates the advantage of compressing prompt length in terms of lowering computational costs.

4.9. Qualitative Analysis in Long-Sequence Classification

To further examine the effectiveness of expanding the input space by using shorter prompts via P-Distill, we conducted a qualitative analysis on the Allsides dataset. The task was to predict the political perspectives inherent in a news article. Therefore, the dataset consisted of news articles that generally exceeded the model’s input capacity of 512 tokens.

We compared the input text and prediction results of P-tuning v2, trained with 32 prompt tokens, to those of P-Distill, which compressed and appended only four prompt tokens, which are shown in Table 8. Note that while P-tuning v2 could utilize up to 480 input tokens by attaching 32 prompt tokens, P-Distill extended the input capacity by using up to 508 input tokens with only four prompt tokens appended.

We observed that the P-tuning v2 model could not access the detailed explanation of the Fairness Doctrine due to its limited input space. However, P-Distill could access the remainder of the sentence, namely ‘broadcasters provide “equal time” to divergent political views’, which provided the key information for the model in making an accurate prediction: ‘Center’. This example verifies that the input space preserved by compressing the prompt length with P-Distill could contribute to accurate prediction with the model, leading to better performance overall.

5. Conclusions

In this paper, we introduce P-Distill, a novel approach in NLP that utilizes two knowledge distillation techniques to enhance performance by compressing unnecessary prompt length. This approach combines prompt initialization and two types of prompt distillation to effectively transfer knowledge from a teacher model with longer prompts to a student model with prompts that are eight times shorter. To evaluate the efficacy of our proposed method, we conducted experiments across various NLP tasks. Our results demonstrate that, using prompts of the same length, the proposed method achieved an average improvement of 2.75% over the existing prompt-tuning methods across the SuperGLUE benchmark. Furthermore, P-Distill exhibited competitive performance even against models trained with prompts that were eight times longer.

6. Limitations

One limitation of this study is that we evaluated our method only on the BERT architecture. Conducting additional experiments on other architectures could be beneficial to determine the generalizability of our findings. Additionally, while our model improves performance through the process of training a teacher model and transferring its knowledge, it incurs more time and cost compared to previous methods. Furthermore, since P-Distill compresses task-specific prompts, direct cross-task transferability remains limited without re-optimization, similar to other prompt-tuning approaches. In future work, we plan to develop an approach that integrates the training of the teacher model and the knowledge distillation process in an end-to-end manner.

Author Contributions

Conceptualization, K.-M.K. and H.-S.W.; methodology, H.-S.W.; software, H.-S.W.; validation, H.-S.W. and J.-Y.C.; formal analysis, H.-S.W. and J.-Y.C.; investigation, H.-S.W. and J.-Y.C.; resources, H.-S.W. and J.-Y.C.; data curation, H.-S.W. and J.-Y.C.; writing—original draft preparation, H.-S.W., J.-Y.C. and N.Z.; writing—review and editing, H.-S.W., J.-Y.C., N.Z. and D.A.; visualization, H.-S.W.; supervision, K.-M.K. and D.A.; project administration, K.-M.K.; funding acquisition, K.-M.K. and H.-S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Basic Research Program through the National Research Foundation of Korea (NRF) grant funded by the Korea Government [Ministry of Science and ICT (MSIT)] (No. 2022R1C1C1010317), and Seoul R&BD Program (CC240003) through the Seoul Business Agency(SBA) funded by Seoul Metropolitan Government.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code is available from the corresponding author upon request.

Conflicts of Interest

Author Joon-Young Choi was employed by the company Danggeun Market Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. arXiv 2019, arXiv:1902.00751. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Liu, X.; Ji, K.; Fu, Y.; Tam, W.; Du, Z.; Yang, Z.; Tang, J. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 61–68. [Google Scholar] [CrossRef]
Chen, S.; Wong, S.; Chen, L.; Tian, Y. Extending Context Window of Large Language Models via Positional Interpolation. arXiv 2023, arXiv:2306.15595. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.116922. [Google Scholar] [CrossRef]
Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT Understands, Too. arXiv 2023, arXiv:2103.10385. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, F.F.; Araki, J.; Neubig, G. How Can We Know What Language Models Know? Trans. Assoc. Comput. Linguist. 2020, 8, 423–438. [Google Scholar] [CrossRef]
Shin, T.; Razeghi, Y.; IV, R.L.L.; Wallace, E.; Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. arXiv 2020, arXiv:2010.15980. [Google Scholar] [CrossRef]
Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4582–4597. [Google Scholar] [CrossRef]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4163–4174. [Google Scholar] [CrossRef]
Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2158–2170. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Wang, Y.; Wang, J.; Zhang, X. Parameter-efficient online knowledge distillation for pretrained language models. Expert Syst. Appl. 2025, 265, 126040. [Google Scholar] [CrossRef]
Zhong, Q.; Ding, L.; Liu, J.; Du, B.; Tao, D. PanDa: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation. IEEE Trans. Knowl. Data Eng. 2024, 36, 4835–4848. [Google Scholar] [CrossRef]
Zhang, J.; Muhamed, A.; Anantharaman, A.; Wang, G.; Chen, C.; Zhong, K.; Cui, Q.; Xu, Y.; Zeng, B.; Chilimbi, T.; et al. ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1128–1136. [Google Scholar] [CrossRef]
Calderon, N.; Mukherjee, S.; Reichart, R.; Kantor, A. A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 14632–14659. [Google Scholar] [CrossRef]
Zhou, W.; Wang, Y.; Qian, X. Knowledge Distillation and Contrastive Learning for Detecting Visible-Infrared Transmission Lines Using Separated Stagger Registration Network. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 1–13. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Zheng, Z.; Li, X.; Cheng, M.M.; Hou, Q. CrossKD: Cross-Head Knowledge Distillation for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16520–16530. [Google Scholar]
Lan, Q.; Tian, Q. Gradient-Guided Knowledge Distillation for Object Detectors. arXiv 2023, arXiv:2303.04240. [Google Scholar] [CrossRef]
Lan, Q.; Tian, Q. Instance, Scale, and Teacher Adaptive Knowledge Distillation for Visual Detection in Autonomous Driving. IEEE Trans. Intell. Veh. 2023, 8, 2358–2370. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3045–3059. [Google Scholar] [CrossRef]
Vu, T.; Lester, B.; Constant, N.; Al-Rfou’, R.; Cer, D. SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5039–5059. [Google Scholar] [CrossRef]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for Thin Deep Nets. arXiv 2015, arXiv:1412.6550. [Google Scholar] [CrossRef]
Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv 2019, arXiv:1905.00537. [Google Scholar] [CrossRef]
Clark, C.; Lee, K.; Chang, M.W.; Kwiatkowski, T.; Collins, M.; Toutanova, K. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2924–2936. [Google Scholar] [CrossRef]
De Marneffe, M.C.; Simons, M.; Tonhauser, J. The CommitmentBank: Investigating Projection in Naturally Occurring Discourse. 2019. To Appear in Proceedings of Sinn und Bedeutung 23. Available online: https://github.com/mcdm/CommitmentBank/ (accessed on 20 January 2025).
Roemmele, M.; Bejan, C.A.; Gordon, A.S. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the 2011 AAAI Spring Symposium Series, Stanford, CA, USA, 21–23 March 2011. [Google Scholar]
Khashabi, D.; Chaturvedi, S.; Roth, M.; Upadhyay, S.; Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 252–262. [Google Scholar]
Zhang, S.; Liu, X.; Liu, J.; Gao, J.; Duh, K.; Durme, B.V. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension. arXiv 2018, arXiv:1810.12885. [Google Scholar]
Dagan, I.; Glickman, O.; Magnini, B. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment; Springer: Berlin/Heidelberg, Germany, 2006; pp. 177–190. [Google Scholar]
Bar Haim, R.; Dagan, I.; Dolan, B.; Ferro, L.; Giampiccolo, D.; Magnini, B.; Szpektor, I. The Second PASCAL Recognising Textual Entailment Challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Venice, Italy, 10–12 April 2006; Volume 7, pp. 785–794. [Google Scholar]
Pilehvar, M.T.; Camacho-Collados, J. WiC: The Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Levesque, H.J.; Davis, E.; Morgenstern, L. The Winograd schema challenge. In Proceedings of the AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, Stanford, CA, USA, 21–23 March 2011; Volume 46, p. 47. [Google Scholar]
Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar]
Carreras, X.; Màrquez, L. Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004, Boston, MA, USA, 6–7 May 2004; pp. 89–97. [Google Scholar]
Carreras, X.; Màrquez, L. Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), Ann Arbor, MI, USA, 29–30 June 2005; Dagan, I., Gildea, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 152–164. [Google Scholar]
Pradhan, S.; Moschitti, A.; Xue, N.; Uryupina, O.; Zhang, Y. CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. In Proceedings of the Joint Conference on EMNLP and CoNLL - Shared Task, Jeju Island, Republic of Korea, 12–14 July 2012; Pradhan, S., Moschitti, A., Xue, N., Eds.; Ninth Conference on Computational Natural Language Learning. pp. 1–40. [Google Scholar]
Weischedel, R.; Palmer, M.; Marcus, M.; Hovy, E.; Pradhan, S.; Ramshaw, L.; Xue, N.; Taylor, A.; Kaufman, J.; Franchini, M.; et al. Ontonotes release 5.0 ldc2013t19. Linguist. Data Consort. 2013, 23, 170. [Google Scholar]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Su, J., Duh, K., Carreras, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
Rajpurkar, P.; Jia, R.; Liang, P. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, 15–20 July 2018; Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 784–789. [Google Scholar] [CrossRef]
Li, C.; Goldwasser, D. Encoding Social Information with Graph Convolutional Networks forPolitical Perspective Detection in News Media. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2594–2604. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Liu, Q., Schlangen, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 38–45. [Google Scholar] [CrossRef]

Figure 1. Performance variation in P-tuning v2 across tasks based on length of continuous prompts.

Figure 2. (a) An illustration of P-tuning v2 [6]. (b) An illustration of the proposed method, denoted as P-Distill. This method trains a teacher model to generate concise and effective prompts, followed by distilling the knowledge into a student model.

Figure 3. Illustration of various prompt initialization methods.

Figure 4. Comparison of ablation study results across various tasks, with different colors and bar styles representing distinct variants of P-Distill.

Table 1. Notation Table.

Symbol	Description
x	Input sequence or data point
$Q, K, V$	Query, key, and value matrices in attention mechanism
$d_{k}$	Dimensionality of key vectors; scaling factor for gradient stabilization
softmax	Softmax function used for normalization in attention
$P_{k}, P_{v}$	Continuous prompts added to key and value vectors
$n_{p}$	Length of continuous prompts
d	Dimensionality of hidden states and key/value vectors
$f_{T}, f_{S}$	Behavior functions of teacher and student models in knowledge distillation
$L_{K D}$	Knowledge distillation loss function
X	Dataset used for training or evaluation
L	Loss function (e.g., KL divergence, MSE)
$n_{t}$	Prompt length of teacher model
$n_{s}$	Prompt length of student model
$W_{k}, W_{v}$	Learnable weight matrices for prompt initialization
$b_{k}, b_{v}$	Bias terms for prompt initialization
$L_{C E}$	Cross-entropy loss function
$L_{p r e d}$	Prediction-layer distillation loss
$L_{h i d d e n}$	Hidden-state distillation loss
$λ_{1}, λ_{2}, λ_{3}$	Weighted coefficients for loss combination

Table 2. Experimental results on the SuperGLUE validation dataset. For P-Distill, training was performed using a teacher model with the prompt length exhibiting the best performance for P-tuning v2. The numbers in parentheses indicate the lengths of the prompt attached to the model. (Acc.: accuracy; bold: the best; underline: the second best).

	BoolQ Acc.	CB Acc.	COPA Acc.	MultiRC F1a	ReCoRD F1	RTE Acc.	WiC Acc.	WSC Acc.	Average
Fine-tuning	0.777	0.946	0.710	0.705	0.706	0.762	0.746	0.683	0.754
P-tuning v2	0.764₍₈₎	0.946₍₃₂₎	0.810₍₄₎	0.711₍₁₆₎	0.728₍₁₆₎	0.794₍₄₎	0.756₍₄₎	0.731₍₁₆₎	0.780
P-tuning v2	0.738₍₁₎	0.929₍₄₎	0.790₍₁₎	0.707₍₂₎	0.721₍₂₎	0.783₍₁₎	0.745₍₁₎	0.692₍₂₎	0.763
P-Distill	0.776₍₁₎	0.964₍₄₎	0.810₍₁₎	0.718₍₂₎	0.726₍₂₎	0.798₍₁₎	0.759₍₁₎	0.721₍₂₎	0.784

Table 3. Experimental results for each method on named entity recognition (NER), question answering (QA), semantic role labeling (SRL), and long-sequence classification (SC). For P-Distill, training was performed using a teacher model with the prompt length exhibiting the best performance for P-tuning v2. The numbers in parentheses indicate the lengths of the prompts attached to the model. All metrics are reported as F1-scores. (bold: the best; underline: the second best).

	NER			SRL		QA		SC
	CoNLL03	CoNLL04	OntoNotes 5.0	CoNLL05 Brown	CoNLL05 WSJ	SQuAD 1.1 Dev	SQuAD 2.0 Dev	Allsides
Fine-tuning	0.928	0.882	0.890	0.827	0.885	0.911	0.819	0.780
P-tuning v2	0.919₍₆₄₎	0.880₍₁₂₈₎	0.885₍₁₂₈₎	0.837₍₃₂₎	0.890₍₁₂₈₎	0.902₍₆₄₎	0.782₍₁₂₈₎	0.775₍₃₂₎
P-tuning v2	0.914₍₈₎	0.866₍₁₆₎	0.881₍₁₆₎	0.807₍₄₎	0.877₍₁₆₎	0.891₍₈₎	0.771₍₁₆₎	0.772₍₄₎
P-Distill	0.919₍₈₎	0.888₍₁₆₎	0.886₍₁₆₎	0.817₍₄₎	0.885₍₁₆₎	0.896₍₈₎	0.775₍₁₆₎	0.783₍₄₎

Table 4. Comparison of P-Distill performance across varying prompt compression ratios. Bold values indicate the best performance.

CoNLL05 WSJ	Fine-Tuning	P-Tuning v2	P-Distill
	0.885	0.890₍₁₂₈₎	0.890₍₆₄₎
			0.888₍₃₂₎
			0.885₍₁₆₎

Table 5. Comparison of additional experiment results across various tasks based on prompt initialization methods. All metrics are reported as micro-F1-scores. (bold: the best).

	CoNLL03	CoNLL04	CoNLL05 WSJ	CoNLL05 Brown
P-Distill	0.919	0.888	0.885	0.817
P-Distill_mean	0.915	0.875	0.878	0.809
P-Distill_max	0.912	0.872	0.872	0.803

Table 6. Experimental results on the SuperGLUE validation dataset for small datasets (CB, COPA, RTE). For P-Distill, training was performed using a teacher model with the prompt length exhibiting the best performance for P-tuning. The numbers in parentheses indicate the lengths of the prompt attached to the model. All metrics are accuracy.

	CB	COPA	RTE
Fine-tuning	0.946	0.71	0.762
P-tuning	0.821₍₁₆₎	0.76₍₁₆₎	0.657₍₁₆₎
P-tuning	0.786₍₂₎	0.70₍₂₎	0.621₍₄₎
P-Distill	0.821₍₂₎	0.76₍₂₎	0.646₍₄₎

Table 7. Comparing GFLOPs of baseline methods and P-Distill on SuperGLUE using BERT_large.

	Fine-Tuning	P-Tuning v2	P-Distill
BoolQ	89.06	89.17₍₈₎	89.07₍₁₎
CB	53.94	54.22₍₃₂₎	53.97₍₄₎
COPA	21.82	21.83₍₄₎	21.82₍₁₎
MultiRC	226.94	227.50₍₁₆₎	227.01₍₂₎
ReCoRD	163.12	163.53₍₁₆₎	163.17₍₂₎
RTE	42.78	42.81₍₄₎	42.79₍₁₎
WiC	18.83	18.84₍₄₎	18.83₍₁₎
WSC	23.11	23.17₍₄₎	23.11₍₁₎

Table 8. Example of P-tuning v2 and P-Distill predictions on Allsides dataset.

P-tuning v2 Input Text (480 tokens): President Trump on Wednesday lashed out over a critical news report and escalated his previous attacks on the media by suggesting that news organizations he disagrees with be shut down, alarming free-speech advocates who compared the tactics to intimidation efforts by the Nixon administration.
[…]
Last week, angered by the ongoing investigations into his campaign’s ties to Russia, Trump suggested that the Senate Intelligence Committee investigate news outlets over “fake news”. Over the weekend, he expressed disdain at late-night television hosts over their “anti-Trump” material and proposed bringing back the Fairness Doctrine, a rule phased out in 1987 that had required

Prediction: Left

P-Distill Input Text (508 tokens): President Trump on Wednesday lashed out over a critical news report and escalated his previous attacks on the media by suggesting that news organizations he disagrees with be shut down, alarming free-speech advocates who compared the tactics to intimidation efforts by the Nixon administration.
[…]
Last week, angered by the ongoing investigations into his campaign’s ties to Russia, Trump suggested that the Senate Intelligence Committee investigate news outlets over “fake news”. Over the weekend, he expressed disdain at late-night television hosts over their “anti-Trump” material and proposed bringing back the Fairness Doctrine, a rule phased out in 1987 that had required broadcasters to provide “equal time” for divergent political views on certain issues. First Amendment advocates roundly condemned the president over his remarks, calling them an assault

Prediction: Center

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Won, H.-S.; Choi, J.-Y.; Zaman, N.; Aliyeva, D.; Kim, K.-M. P-Distill: Efficient and Effective Prompt Tuning Using Knowledge Distillation. Appl. Sci. 2025, 15, 2420. https://doi.org/10.3390/app15052420

AMA Style

Won H-S, Choi J-Y, Zaman N, Aliyeva D, Kim K-M. P-Distill: Efficient and Effective Prompt Tuning Using Knowledge Distillation. Applied Sciences. 2025; 15(5):2420. https://doi.org/10.3390/app15052420

Chicago/Turabian Style

Won, Hyun-Sik, Joon-Young Choi, Namrah Zaman, Dinara Aliyeva, and Kang-Min Kim. 2025. "P-Distill: Efficient and Effective Prompt Tuning Using Knowledge Distillation" Applied Sciences 15, no. 5: 2420. https://doi.org/10.3390/app15052420

APA Style

Won, H.-S., Choi, J.-Y., Zaman, N., Aliyeva, D., & Kim, K.-M. (2025). P-Distill: Efficient and Effective Prompt Tuning Using Knowledge Distillation. Applied Sciences, 15(5), 2420. https://doi.org/10.3390/app15052420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

P-Distill: Efficient and Effective Prompt Tuning Using Knowledge Distillation

Abstract

1. Introduction

2. Preliminaries

2.1. Pre-Trained Language Models Based on the Transformer

2.2. Prompt-Based Learning Methods

2.3. Knowledge Distillation

3. Methodology

3.1. Prompt-Based Teacher Learning

3.2. Prompt-Enhanced Distillation (P-Distill)

3.2.1. Prompt Initialization

3.2.2. Prompt Distillation

3.3. Distillation-Based Student Learning

4. Experiments

4.1. Datasets

4.2. Baselines

4.3. Experimental Details

4.4. Results

4.4.1. Results on SuperGLUE

4.4.2. Results on Other Tasks

4.4.3. Impact of Compression Ratio

4.5. Ablation Study

4.6. Impact of Prompt Initialization

4.7. Experimental Results of Applying P-Distill to P-Tuning Methodology

4.8. Inference Costs

4.9. Qualitative Analysis in Long-Sequence Classification

5. Conclusions

6. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI