Discriminator-Enhanced Knowledge-Distillation Networks
Round 1
Reviewer 1 Report (Previous Reviewer 4)
Dear Editor,
The manuscript can be accepted in its present form.
Author Response
Dear Reviewer,
I wish to express my profound gratitude for your insightful suggestions and constructive feedback on our manuscript. Your expert critique has proven invaluable and has served to substantially improve the quality of our work.
I have undertaken a meticulous proofreading of the manuscript and have rectified several syntactic errors that were identified. These revisions have certainly enhanced the clarity and coherence of the paper.
Once again, I wish to extend my sincerest thanks for your valuable comments. Your time and effort in reviewing our manuscript are deeply appreciated, and your expertise has greatly contributed to the refinement of our work.
Best Regards
Author Response File: Author Response.pdf
Reviewer 2 Report (Previous Reviewer 3)
Congratulation to authors, they have responsed well with my previous commetns and suggests.
Authors should proof manuscript by professional editors. I can find some monor errors.
Author Response
Dear Reviewer,
I wish to express my profound gratitude for your insightful suggestions and constructive feedback on our manuscript. Your expert critique has proven invaluable and has served to substantially improve the quality of our work.
Upon thorough review of your remarks, I have undertaken a meticulous proofreading of the manuscript and have rectified several syntactic errors that were identified. These revisions have certainly enhanced the clarity and coherence of the paper.
Once again, I wish to extend my sincerest thanks for your valuable comments. Your time and effort in reviewing our manuscript are deeply appreciated, and your expertise has greatly contributed to the refinement of our work.
Best Regards,
Reviewer 3 Report (Previous Reviewer 1)
This paper proposed the Discriminator enhanced Knowledge Distillation (DisKD) framework for Query Auto-Completion tasks. A discriminator LSTM was used to distinguish between the outputs generated by the teacher and the student networks, which provided an additional loss term to the overall training objective. The paper provided some evidence demonstrating the effectiveness of this change. I have reviewed the previous paper and suggested to accept. My decision remains the same for this resubmission.
Meets the standard.
Author Response
Dear Reviewer,
I wish to express my profound gratitude for your insightful suggestions and constructive feedback on our manuscript. Your expert critique has proven invaluable and has served to substantially improve the quality of our work.
Upon thorough review of your remarks, I have undertaken a meticulous proofreading of the manuscript and have rectified several syntactic errors that were identified. These revisions have certainly enhanced the clarity and coherence of the paper.
Once again, I wish to extend my sincerest thanks for your valuable comments. Your time and effort in reviewing our manuscript are deeply appreciated, and your expertise has greatly contributed to the refinement of our work.
Best Regards,
Reviewer 4 Report (New Reviewer)
The manuscript needs the following improvements:
1) The introduction section must end with the research objectives while the core contributions must be moved to the conclusion section
2) There must be a separate sub section on limitations
3) Minor corrections such as Relate Work to Related Work for section 2 title is needed.
The manuscript needs some improvements as stated.
Author Response
Dear Reviewer,
I wish to express my profound gratitude for your insightful suggestions and constructive feedback on our manuscript. Your expert critique has proven invaluable and has served to substantially improve the quality of our work.
Upon thorough review of your remarks, I have undertaken a meticulous proofreading of the manuscript and have rectified several syntactic errors that were identified. These revisions have certainly enhanced the clarity and coherence of the paper.
(1) The introduction section must end with the research objectives while the core contributions must be moved to the conclusion section.
I rewrote two parts.
- research objectives:
This study endeavors to conceptualize a novel methodology aimed at augmenting the efficiency and speed of language models by fusing a comprehensive sentence evaluation module, predicated on the discriminator concept inherent in adversarial neural networks. The core objectives of the present research are encapsulated below:
(a) The study seeks to pioneer a shift in the analysis of sentence quality by focusing on a holistic sentence-level evaluation, diverging from conventional pairwise methodologies. This is intended to rectify prevalent over-correction issues observed within traditional models.
(b) A primary objective lies in the formulation of a novel Knowledge Distillation framework, aiming to engineer a more compact, efficient, and enhanced student network. This is envisaged to be achieved via a unique distillation scaffold that incorporates an additional discriminator.
(c) Capitalizing on the discriminator signal is another key target. To accomplish this, we intend to incorporate policy gradient from reinforcement learning, thereby overcoming constraints linked with the utilization of discrete signals during the back-propagation process within Natural Language Processing tasks.
- contribution:
Our experimental results on the QAC task demonstrate that DisKD significantly outperforms baseline methods, with the distilled two-layer student model even surpassing the six-layer teacher model, and our method is easy to optimize and can be combined with other methods to consistently improve performance. And our work has several main contributions.
(a) We propose DisKD, a novel Discriminator enhanced Knowledge Distillation framework, which can both enhance model accuracy and reduce parameter size.
(b) To make the discriminator's signal suitable for generation tasks, we introduce an easy-to-implement discriminator loss, as direct training on the signal from the discriminator is not differentiable.
(c) Our approach involves intelligently leveraging the loss of the discriminator as an evaluation signal for the entire sentence. By adopting this method, we can effectively overcome the over-correction problem, thereby significantly reducing the model size nearly threefold and improving inference by the same factor. Furthermore, it exceeds the performance of the original GPT-2 model in terms of the Mean Reciprocal Rank (MRR).
(2) There must be a separate sub section on limitations.
I adjusted the structure of the article and added this paragraph.
Limitation and Future work
The primary limitation of Dis-KD, as observed in this study, is the relatively slow reasoning speed during the execution of query auto-completion. This could impact its ability to provide real-time suggestions to users, especially in environments characterized by high volume or time sensitivity. Therefore, efforts focused on enhancing the computational efficiency of Dis-KD's reasoning mechanisms would likely improve its practical utility and broaden its appeal across various domains. In specific large-scale Internet applications, such as Alibaba and LinkedIn, deep learning methodologies have not yet been extensively implemented in actual production. It is noteworthy that the proposed method demonstrates a slower inference speed than the Most Commonly Generated (MCG) approach, with an approximate speed difference of 145 times.
Our investigation paves the way for further research in several domains. Firstly, while our results are encouraging, we have not delved into the possibility of extending our methodology to include information beyond the immediate query. Incorporating elements such as users' behavioral history could yield additional insights into user preferences and interests, extending beyond the context available just prior to a search.
(3) Minor corrections such as Relate Work to Related Work for section 2 title is needed.
I changed the title by mistake.
Once again, I wish to extend my sincerest thanks for your valuable comments. Your time and effort in reviewing our manuscript are deeply appreciated, and your expertise has greatly contributed to the refinement of our work.
Best Regards,
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
This paper proposed the Discriminator enhanced Knowledge Distillation (DisKD) framework for Query Auto-Completion tasks. A discriminator LSTM was used to distinguish between the outputs generated by the teacher and the student networks, which provided an additional loss term to the overall training objective. The paper provided some evidence demonstrating the effectiveness of this change. I have the following comments and questions for the authors regarding the technical details of this paper:
1. Equation (4) and (5) seem unclear. What is the dimension of “s”? My understanding is that a discriminator usually outputs a scalar. But in (5), “s” goes through a softmax layer.
2. Please avoid the overuse of notation “t” in (2), (3), and (5).
3. Although references are provided for equations (5), please provide explanation and description for equation (5) to complete the proposed algorithm.
4. The architecture details of the discriminator LSTM is missing. Also insufficient details about how this LSTM and the student network were trained alternatively.
Author Response
As for your questions, I marked the revisions in blue in the PDF and annotated them as Reviewer 1.
Dear Reviewer:
Thank you for taking the time to review my work and pointing out the mistake. I apologize for any confusion it may have caused. After carefully reviewing your feedback, I found I had made a lot of errors in my section 3.1 details of the model, which I almost completely rewrote this chapter. I hope this modification will make it easier for you to understand the problems I have described.
1.Equation (4) and (5) seem unclear. What is the dimension of “s”? My understanding is that a discriminator usually outputs a scalar. But in (5), “s” goes through a softmax layer.
After carefully reviewing your feedback, I agree that the value in question should be a scalar instead of what I had previously used. I have since rewritten the entire chapter, and the corrected value should now be the result after applying the sigmoid function as equation (5) and (6), as you suggested.
2.Please avoid the overuse of notation “t” in (2), (3), and (5).
I have revised the example of sequence representation and reduced the overuse of the notation "t".
Previously, "t" denoted the parameters used from the teacher, but now the notation is used only where necessary to avoid confusion.
Furthermore, I have revised all formulas once again to improve their clarity.
By improving the clarity of formulas (1) to (8) and their descriptions, and reducing the overuse of the notation "t," I believe the current interpretation is now better understood.
3.Although references are provided for equations (5), please provide explanation and description for equation (5) to complete the proposed algorithm.
I would like to clarify that the equation previously labeled as (5) is now labeled as (6) in the new version. I have also added detailed notes after the equation to provide more clarity. Additionally, I have included equation (7) to explain how rewards help to update the student model.
In details, the training of the proposed framework involves two steps of backpropagation. In the first step, the discriminator is optimized as a binary classifier. Once the discriminator is trained, equation (6) is used to calculate the score of the student's output and the teacher's output. Next, we go back to training the discriminator, and this cycle iterates several times.
4.The architecture details of the discriminator LSTM is missing. Also, insufficient details about how this LSTM and the student network were trained alternatively.
The novel distillation framework that incorporates an GAN's discriminator (a LSTM network) to enhance the training of the student network. The LSTM only need to output 0 or 1, which is the label of output embedding from student’s model or teacher’s model.
The learning procedure of the model comprises three stages, which are repeated multiple times: learning, distinguishing, and cheating. The first stage is divided into two parts: initially learning in a traditional knowledge distillation manner, and subsequently learning with the additional discriminator loss. In the second stage, the discriminator distinguishes the generated sequence from the teacher or student and then evaluates it and output 0 or 1. In the third stage, the updated student model attempts to deceive the discriminator and improve the generator's performance through equation (7).
This approach can be considered a hybrid of learning and cheating, as the generator learns from the discriminator and then seeks to outwit it.
Thank you again for bringing these to my attention, and please let me know if there is anything else that requires further attention.
Author Response File: Author Response.pdf
Reviewer 2 Report
Following are my comments on the paper:
> Paper needs copy editing.
> No domain background, no introduction of the problem, no problem statement, and no reason for this research in the abstract.
> The problem is not well explained in the introduction section. The sequence is also not adequate. Authors should explain the problem build their problem statement on the basis of the limitations of the existing work and then explain their model to overcome such limitations.
> Authors should be consistent while writing abbreviations, e.g. "large language models (LLMs)" in lowercase while "Natural Language Processing (NLP)" is capitalized. Should be the same pattern throughout the document. Also, abbreviations should be used in brackets and full terms outside the brackets, e.g. "QPS (queries per second)"
> Once abbreviation is used, authors should not use the full term as in line 22 for natural language processing, line 29 for LLM, and other places.
> Contributions should be written explicitly in bullets.
> Proper use of punctuation is required, e.g. in section 1, paragraph 1, "However the deployability" should be "However, the deployability", similar in other places.
> References are missing, "BERT and ERNIE models"
> Related work section is not up to date as I have observed only a few papers from 2019-2023 while most of the references should be from this timeline.
> Figure 1 is an abstract view of the model and doesn't explain the process in detail.
> Section 3 seems to be a combination of equations rather than the process to solve the problem. Authors should explain their process which solves the problem and emphasize the steps with the help of examples.
> In section 4.1, authors should use a table to summarize the datasets so that readers can easily grasp the details.
> What are the evaluation parameters? None explained.
> Compared approaches are too old, MCG is from 2020 and the rest are older than that.
Author Response
Dear Reviewer,
I am writing to provide you with updates on the revisions made to the DisKD paper, based on your valuable feedback.
- Paper needs copy editing.
I have thoroughly edited the paper for clarity, and marked the areas where significant modifications have been made. I hope that the current version of the paper is much clearer than the previous one.
- No domain background, no introduction of the problem, no problem statement, and no reason for this research in the abstract.I have rewritten the abstract to include the domain background, problem statement, and reason for this research. The rewritten portion is color-coded for easy identification.
- The problem is not well explained in the introduction section. The sequence is also not adequate. Authors should explain the problem build their problem statement on the basis of the limitations of the existing work and then explain their model to overcome such limitations.
I have rearranged the introduction section to better explain the problem and the limitations of existing work.
- Authors should be consistent while writing abbreviations, e.g. "large language models (LLMs)" in lowercase while "Natural Language Processing (NLP)" is capitalized. Should be the same pattern throughout the document. Also, abbreviations should be used in brackets and full terms outside the brackets, e.g. "QPS (queries per second)".
I have checked the consistency of abbreviations used throughout the paper and have made corrections where necessary. I have also followed the correct format.
- Contributions should be written explicitly in bullets.I have rearranged the contributions section and presented it in bullet points as per your request.
- Proper use of punctuation is required, e.g. in section 1, paragraph 1, "However the deployability" should be "However, the deployability", similar in other places.
I have checked and corrected the punctuation errors in the paper, such as the missing comma in "However, the deployability".
- References are missing, "BERT and ERNIE models"
I added the quote, and checked for references elsewhere in the article.
- Related work section is not up to date as I have observed only a few papers from 2019-2023 while most of the references should be from this timeline.
I have added a new reference from 2022, called Topic-LSTM QAC, to the related work section. However, as there are not many recent studies in this area, my research still holds significance for anonymous user recommendations.
- Figure 1 is an abstract view of the model and doesn't explain the process in detail.
I have added a detailed process description in Figure 1 to explain the model in more detail.
- Section 3 seems to be a combination of equations rather than the process to solve the problem. Authors should explain their process which solves the problem and emphasize the steps with the help of examples.
As you noted, there were several errors in the formulation of the formulas, which I deehave taken steps to rectify the issue. Specifically, I have rearranged all the formulas ply regret. I take full responsibility for the mistake and want to assure you that I from 1 through 8, and I am confident that the new collated formula will be much clearer and more accurate. Again, I apologize for any confusion or inconvenience this may have caused, and I appreciate your feedback on this matter.
- In section 4.1, authors should use a table to summarize the datasets so that readers can easily grasp the details.
I have consolidated the dataset details in Table 1 for better readability.
- What are the evaluation parameters? None explained.
I am not sure if you are referring to the model hyperparameter, but I have included it in Table 2.
- Compared approaches are too old, MCG is from 2020 and the rest are older than that.
I introduced a new model in 2022 called the LSTM-topic Model, which is similar to my previous research. I understand that recent studies in the industry are more inclined towards combining the user's existing characteristics to make recommendations, which is faster and more accurate. However, I would like to emphasize that my research still has implications for anonymous user recommendations. My research idea is specifically aimed at anonymous users, who still exist in the real-world use process and have certain application significance.
Moreover, I would like to add that there haven't been many studies similar to mine in the past two years. Hence, my research is still relevant and valuable in the field.
Thank you for your valuable feedback and for giving us the opportunity to improve the paper. We hope that the revised version of the paper meets your expectations.
Author Response File: Author Response.pdf
Reviewer 3 Report
Thank you for opportunity reviewing the manuscript titled "Discriminator enhanced Knowledge Distillation Networks". The topic is very interesting and new to the domain. However, authors might need to address the following concerns.
1- The research justification why DisKD is needed was missing in the introduction part.
2. What is the significant and originality of DisKD compared to other technique was not clearly identified.
3. In the experiment section, author compare with different models. That is very good. However, can DisKD work better than other natural language processing model (e.g., LDA, or topic modelling)?
4. Conclusion section is too short. Author should give reader the take home message.
5. I was wondering what is the limitation of DisKD model compared to others?
Author Response
As for your questions, I marked the revisions in red in the PDF and annotated them as Reviewer 2
Dear Reviewer,
I am writing to share with you some updates on DisKD.
- The research justification why DisKD is needed was missing in the introduction part.
The primary motivation behind DisKD is the high cost of reasoning with large scale users, while small model reasoning accuracy is insufficient. In order to address both of these challenges and improve the inference's precision while minimizing costs, I proposed DisKD.
- What is the significant and originality of DisKD compared to other technique was not clearly identified.
We are the first to introduce the concept of discriminator in the distillation model. This approach offers two significant advantages:
- Firstly, it allows the student network to be optimized by leveraging the identification information of both the student and teacher networks through the use of a discriminator. Adversarial learning principles dictate that a well-trained student model should generate outputs that are highly similar to those of the teacher network, to the point that the discriminator cannot distinguish between the two.
- Secondly, our approach analyzes sentence quality at the sentence level, rather than the pairwise approach taken by traditional cross-entropy loss functions. This enables our approach to avoid overcorrection issues to some extent.
- In the experiment section, author compare with different models. That is very good. However, can DisKD work better than other natural language processing model (e.g., LDA, or topic modelling)?
I add a base line called PCTM based on LDA. It has demonstrated exceptional performance on seen data and achieved better results. However, one of the limitations of PCTM, and topic models in general, is their inability to be effectively transferred to scenarios involving unknown query completion. As a result, when encountering out-of-vocabulary words, such as "Dis-KD" in a dataset not seen during training, PCTM is unable to generate accurate completions.
- Conclusion section is too short. Author should give reader the take home message.
I have improved the conclusion from three aspects in conclusion: contribution points, shortcomings of the method, and future work.
- I was wondering what is the limitation of DisKD model compared to others?
In the QAC field, while we have made significant strides by achieving three times faster results through distillation with the same performance, this falls short of the standard for large-scale online applications. Specifically, in specific large-scale Internet applications, such as typing prompts Alibaba and LinkedIn, deep learning methods have yet to be widely deployed in actual production.
Notably, the proposed method exhibits a slower inference speed than MCG, with a speed difference of approximately 145 times. However, I have explained this in the limitation of the article.
I hope this update provides you with valuable insights into my ongoing research. If you have any questions or comments, please feel free to reach out to me.
Sincerely
Author Response File: Author Response.pdf
Reviewer 4 Report
Dear Authors,
In this manuscript, the authors presented a novel framework for the Query Auto-Completion (QAC) task, called Discriminator enhanced Knowledge Distillation (DisKD). The authors proposed the experimental results demonstrate that DisKD outperforms baseline methods and the student’s model also perform better for QAC tasks in sub-word languages than teacher model. However, the manuscript requires significant revisions before it can be considered for publication in a high-quality journal.
English language needs to be refined.
· All the references featured in the related work section are conferences. Rewrite the section on related work using standard references.
· The output results must be plotted with any data visualization tools, to make readers better understanding.
· The simulation parameters must be listed in table format.
· The experimental results did not prove the validity.
· The proposed method lacks clarity in dataset selection and proposed algorithm.
The results lack insights and are not presented properly.
Author Response
As for your questions, I marked the revisions in green in the PDF and annotated them as Reviewer 3.
Dear Reviewer:
I am writing to share with you some updates on DisKD.
A number of reviewers, particularly in regards to question 5, have identified significant errors in my formula descriptions. I deeply regret this oversight on my part, as it has led to confusion and difficulty in understanding the algorithm section of the paper.
In order to address this issue, I have completely reworked the 3.1 details of the model and carefully rewritten the descriptions of the formulas involved.
These changes have been made with the aim of enhancing the clarity and accuracy of the algorithm section, ensuring that readers can fully comprehend and utilize the concepts presented.
- English language needs to be refined.
I revised the language in various sections of the paper, including the method, introduction, and other areas, all of which were color-coded for ease of reference. These revisions aimed to enhance the clarity and coherence of the text, ensuring that the ideas were expressed in a concise and effective manner.
- All the references featured in the related work section are conferences. Rewrite the section on related work using standard references.
In addition to revising the introduction by reorganizing its structure and rewriting its content, I also incorporated several relevant articles on journals, namely 1, 2, 6, and 11. Furthermore, the language has been carefully rewrited to convey the ideas in a clear.
- The output results must be plotted with any data visualization tools, to make readers better understanding.
Figure 2 now contains a graphic presentation of the results that were obtained. This visual representation offers readers a clear and concise way to interpret the data, making it easier to identify trends and patterns that may have otherwise been difficult to discern. By incorporating this graphic, the paper's presentation of the results is enhanced, offering readers a more complete and compelling analysis of the findings.
- The simulation parameters must be listed in table format.
I have included the relevant parameters in Table 1, which provide important details about the variables used in the study. By presenting this information in a clear and concise manner, readers can quickly and easily understand the parameters that were involved in the research. This addition enhances the paper's transparency and supports the accuracy and reproducibility of the findings.
- The proposed method lacks clarity in dataset selection and proposed algorithm.
The algorithm has been redescribed in more detail to improve its clarity and usability. Additionally, a new case study on data input has been added to the paper. This case study provides a practical illustration of the importance of accurate data entry and its impact on the reliability of the algorithm. By examining the challenges and best practices involved in data input, readers will gain a better understanding of how to effectively utilize the algorithm in practice.
- The experimental results did not prove the validity.
- The results lack insights and are not presented properly.
For question 6 and 7:
It appears that the lack of specific experimental cases may have made the paper less intuitive to readers. To address this issue, I have included several new case studies that provide practical examples of the concepts discussed in the paper. By offering these detailed examples, readers can better understand how the theories and ideas presented in the paper can be applied in real-world situations.
Sincerely
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Manuscript has been largely improved, thank the authors for the response.
One remaining question I had is about equation (6) and (7). Please write both Score_t and Score_s in (6) for clarification as both appear in (7).
Author Response
Dear Reviewer
- One remaining question I had is about equation (6) and (7). Please write both Score_t and Score_s in (6) for clarification as both appear in (7).
The quasi-reinforcement learning approach employed in the training process is not a straightforward one, as there are several factors at play. Formula 6 pertains to the score of the discriminator, which is composed of LSTM and requires prior training. When the output is say hello word, he scores say, say hello, say hello word.
Formula 7 is the loss formula for training the discriminator, which is used to optimize the discriminator.
Formula 7, on the other hand, refers to the loss formula that is utilized to optimize the discriminator. This formula is designed for use in training the discriminator, which is a binary classification network. When the output is deemed to be from the teacher network, the score is biased towards 1, whereas when it is from the student network, the score is biased towards 0.
Formulas 5 and 6 essentially mean that deep learning minimizes expected loss, while reinforcement learning maximizes expected reward. The 1-s shown in Formula 6 is a result of this. Formula 7, on the other hand, is a simple binary loss that is used to optimize the discriminator to provide a more accurate score for Formula 5. To achieve this, Formula 6 is turned from maximization to minimization by incorporating 1-s as part of the total loss of the model, as shown in Formula 9.
I understand that the numerous formulas and technical jargon can be overwhelming and confusing. I have published my code for reference, but I am unsure if providing a very detailed explanation of each case in the paper would be beneficial or if it would make the paper appear too verbose. I appreciate any feedback or suggestions you may have.
Thank you for your attention to this matter.
Best regards,
Author Response File: Author Response.pdf
Reviewer 2 Report
No more comments
Author Response
Dear Reviewer
I've changed the details and color-coded them in the hope that this will make the article more readable.
Thank you for your attention to this matter.
Best regards,
Author Response File: Author Response.pdf
Reviewer 3 Report
I found the author address well on their revision. However, why the manuscript you upload in the system and the one you upload in authors’ response are not the same?
I was confused which one is your updated one.
Author Response
Dear Reviewer
The confusion may have arisen from the order in which I presented the information.
I've changed the details and color-coded them in the hope that this will make the article more readable and resubmit the manuscript.
Thank you for your attention to this matter.
Best regards,
Author Response File: Author Response.pdf
Reviewer 4 Report
1. The dataset selection process is not clearly defined. This comment is not explained properly in the revised manuscript.
2. The features used in the dataset are not explained properly. Also the importance of input features selected must be explained.
3. It is proposed that the accuracy and parameter size of the model will be improved. But in result section, these improvements were not discussed. 4. The sufficient information is not provided in the simulation parameters listed in the revised manuscript.
Overall, the proposed methodology is not satisfactory.
Author Response
Dear Reviewer
I wanted to take a moment to thank you for your suggestions and feedback on our paper. After careful consideration, we have reevaluated the paper and taken your comments into account.
For those who are not familiar with the QAC task, the experimental section of our paper may be difficult to read.
In response to your suggestions, we will be elaborating on the experimental section in greater detail. I hope these modifications will meet your expectations
- The dataset selection process is not clearly defined. This comment is not explained properly in the revised manuscript.
As you mentioned, our paper includes an evaluation of several terms such as "seen data," "unseen data," "MMR," and the preprocessing of input string encoding, among others.
I apologize if my explanations were unclear or difficult to follow. In response to your feedback, I have re-edited the content you mentioned and highlighted it with color fonts to make it more easily identifiable and accessible.
- The features used in the dataset are not explained properly. Also the importance of input features selected must be explained.
To solve this problem, I described AOL's data format in more detail, the data characteristics I adopted, and marked the paragraphs with colors.
In our study, the BPE (Byte Pair Encoding) tokenizer was utilized to preprocess the language model input, as proposed by Sennrich . Specifically, the input text "ask.com" was tokenized into three subwords, namely "ask", ".", and "com". Our model requires the input text to satisfy a minimum subword criterion. For example, in the case of "ask.com", the used input text for test our model is "ask", rather than "as" or "a". This criterion was consistently applied to the input data in the test set. We used the model to generate 10 candidate words to calculate the mean reciprocal rank (MRR). We show the output of the model in table 5 and compare the output of other models.
- It is proposed that the accuracy and parameter size of the model will be improved. But in result section, these improvements were not discussed.
I've added to the discussion, but I'm not sure if it's in the direction you'd expect.
Our experiments demonstrate the efficacy of distillation as a means of transferring knowledge from a highly regularized, ensemble model or a large-scale model to a smaller, distilled model.
Compared to DisGPT and Patient networks, our proposed approach presents a highly innovative discriminator framework that effectively mitigates the issue of over-correction. Notably, by eliminating the discriminator component, our network architecture remains identical to that of DisGPT, thus underscoring the substantial gains attributable to our novel discriminator concept in enhancing model performance.
- The sufficient information is not provided in the simulation parameters listed in the revised manuscript.
I rearranged the parameters, which included the parameters of teacher network, student network and discriminator.
Our model architecture comprises 3 components, student network consisting of two layers of GPT-2 with an embedding of 768, alongside a linear neural network with an embedding of 768, the different in teacher network is that, teacher network consisting of 6 layers GPT-2. In addition, we employ a bidirectional LSTM network, a linear layer, a dropout layer, another linear layer, and an activation function as the discriminator component, with an embedding of 256.
Author Response File: Author Response.pdf