EvoText: Enhancing Natural Language Generation Models via Self-Escalation Learning for Up-to-Date Knowledge and Improved Performance

In recent years, pretrained models have been widely used in various fields, including natural language understanding, computer vision, and natural language generation. However, the performance of these language generation models is highly dependent on the model size and the dataset size. While larger models excel in some aspects, they cannot learn up-to-date knowledge and are relatively difficult to relearn. In this paper, we introduce EvoText, a novel training method that enhances the performance of any natural language generation model without requiring additional datasets during the entire training process (although a prior dataset is necessary for pretraining). EvoText employs two models: $G$, a text generation model, and $D$, a model that can determine whether the data generated by $G$ is legitimate. Initially, the fine-tuned $D$ model serves as the knowledge base. The text generated by $G$ is then input to $D$ to determine whether it is legitimate. Finally, $G$ is fine-tuned based on $D$'s output. EvoText enables the model to learn up-to-date knowledge through a self-escalation process that builds on a priori knowledge. When EvoText needs to learn something new, it simply fine-tunes the $D$ model. Our approach applies to autoregressive language modeling for all Transformer classes. With EvoText, eight models achieved stable improvements in seven natural language processing tasks without any changes to the model structure.


Introduction
Pre-training models have shown great promise in natural language processing, with the Transformer model [1] proposing an encoder-decoder architecture based solely on the selfattention mechanism, enabling the construction of large-scale models that can be pretrained on vast amounts of data. Language models [2][3][4] can be broadly categorized into two types: autoregressive language modeling and autoencoder language modeling. autoregressive language models, such as ELMO [5], GPT [6], and T5 [7], predict the next possible word based on the preceding context, making them well-suited for generative tasks. On the other hand, autoencoder language models, such as BERT [8] and RoBERTa [9], predict intermediate words based on context and are better suited for natural language understanding tasks.
In recent years, generative models, including VAE [10], GAN [11], and DDPM [12], have made significant progress in computer vision. However, natural language generation presents unique challenges due to its discrete and temporal nature. To address these challenges, a common approach is to use unsupervised training of large language models, such as GPT-3 [13], which has 175 billion parameters. Despite their potential, training these models can be challenging due to their size, and further training once deployed can be difficult. As a result, a zero-shot approach is often adopted, which does not require fine-tuning the model for specific downstream tasks. However, this approach has limitations; large language models may not perform as well as smaller models with fine-tuning for certain tasks that rely heavily on supervised learning. It is also akin to using a computer that has not been upgraded for an extended period, and its obsolescence is only a matter of time. Therefore, there is an urgent need for novel approaches that can balance the benefits and limitations of large language models to improve their performance and longevity. While reinforcement learning from human feedback (RLHF) [14] is one such approach that holds promise, it is still limited by the availability and quality of human feedback data, and it may not always generalize well to other tasks. This paper introduces a novel training process for pretrained models, which we call EvoText. The proposed EvoText method can continuously learn from new data without suffering from the limitations of unsupervised learning or requiring additional datasets for fine-tuning. Specifically, we merge the input text and the generated text and then use a natural language understanding model to label the data for supervised fine-tuning of the generative model. Simply retraining the smaller discriminator model is required to enable the generative model to acquire up-to-date knowledge. To address the issues of natural language understanding errors and overfitting, we adopt a small learning rate and epoch size during fine-tuning. Unlike GAN models, we do not modify the parameters of the natural language understanding model during training, and only fine-tune it when necessary to incorporate new knowledge. The contributions of this work are as follows: • EvoText partially mitigates the problem of low-quality samples generated by the generative model. • EvoText improves the model's performance without additional data during the training process. (Note that while additional datasets are used in the system for warm-up and learning up-to-date data, as illustrated in Figure 1, only the data generated by the generator are used in the crucial training process.) • EvoText enables continuous and sustainable improvement in seven natural language processing tasks, including natural language understanding and natural language generation, without altering the model's structure. • The proposed method achieves results comparable to those of large generative networks, even with relatively limited computational resources. • We will make the source code for EvoText publicly available on GitHub (https://github. com/DLYuanGod/Auto-learning (accessed on 9 April 2023). The novelty of this paper lies in the proposed method for enhancing the performance of natural language generation models without the need for additional datasets during training. With this approach, EvoText allows the generative model to acquire up-to-date knowledge with minimal additional cost. This is achieved by updating the knowledge base through retraining the smaller discriminator model, as opposed to retraining the entire model with additional data.

Background
In this section, we briefly overview the language model, Transformer, the GPT series of models, the BERT series of models, and the GAN model.

Language Modeling
In this subsection, we briefly describe the implementation of the two language modelings.

Autoregressive Language Modeling
Autoregressive language modeling [15,16] is a type of language modeling used in natural language processing. It involves predicting the next token in a sequence based on the previous tokens. Given a sequence of N tokens, (t 1 , t 2 , . . . , t N ), the probability of a token t k is modeled by calculating the conditional probability of t k given all preceding tokens, (t 1 , t 2 , . . . , t k−1 ), using the following formula: This Formula (1) calculates the joint probability of all the tokens in the sequence. The product of all the conditional probabilities is taken from the first token to the last token. This means that to predict the probability of a token, we need to know the probability of all preceding tokens. In the case of the backward model, the token after t k needs to be calculated.

Autoencoder Language Modeling
Autoencoder language modeling (ALM) [17,18] is a type of language modeling that involves predicting a target token in a sequence of tokens, while considering the probabilities of all preceding and following tokens. This can be represented mathematically as predicting t k in the sequence (t 1 , t 2 , ..., t k , ..., t N ), where the probabilities of t 1:k−1 and t k+1:N are also calculated.
ALM is a powerful modeling method that can be applied to natural language understanding tasks. By considering the entire sentence when predicting a target token, ALM can capture the semantic and syntactic relationships between words in the sentence. This makes it particularly useful for tasks such as language generation, summarization, and machine translation.
The concept of ALM is closely related to the masked language model (MLM), which is used as a pretraining strategy in the popular BERT model. In MLM, a percentage of the tokens in a sequence are randomly masked out, and the model is trained to predict the missing tokens while considering the context of the surrounding tokens. This approach is similar to ALM, as it also involves predicting a target token while considering the context of the surrounding tokens.

Transformer Encoder
Transformer Encoder takes a sequence of tokens as input, which are first processed through a word embedding and positional embedding layer. The resulting vector dimension is called d model .
Next, the Transformer Encoder uses a self-attentive mechanism to compute the output tokens. This mechanism involves creating three copies of the input token, which are referred to as Q, K, and V. Each of these copies is used in the attention calculation, which computes the weights between all pairs of tokens in the input sequence. The attention calculation formula is shown below: Here, d k represents the vector dimension processed by each attention head. The value of d k is equal to d model divided by the number of attention heads used in the multi-head attention mechanism.
Finally, the values calculated for each attention head are concatenated and passed through a multi-layer perceptron (MLP) layer to produce the final output tokens. Overall, the selfattentive mechanism used in the Transformer Encoder allows the model to capture complex relationships between tokens in the input sequence and produce high-quality representations for natural language processing tasks.

Transformer Decoder
The decoder module is calculated in the same way as the encoder module, except that masking is added to the multi-headed attention mechanism to mask tokens that have not yet been generated.

GPT
The Generative Pre-training Transformer (GPT) [6] was introduced by Radford et al. in 2018 as an improvement on the Transformer model, which had been mainly used for natural language understanding tasks. GPT was the first model to apply a pretrained Transformer model to natural language processing.
GPT uses a multi-layer Transformer Decoder for the language model, which consists of 12 blocks of Transformer Decoders [24] with up to 117 million parameters. This approach allows GPT to generate high-quality natural language text by predicting the next word in a sequence of words.
Overall, GPT has been shown to achieve impressive results on a range of natural language processing tasks, such as text classification, language translation, and text generation. Its success has led to the development of larger and more powerful Transformer-based models, such as GPT-2 and GPT-3, which continue to push the boundaries of natural language processing.
In addition to their larger model sizes, both GPT-2 and GPT-3 are trained on larger and more diverse datasets, enabling them to capture more complex and nuanced patterns in natural language. As a result, these models have achieved state-of-the-art performance on a range of natural language processing tasks, including language translation, question answering, and text generation.
The success of GPT-2 and GPT-3 highlights the tremendous potential of pretrained Transformer models in natural language processing research. With continued advances in this field, we can expect even more powerful language models in the future.

BERT Series of Models
BERT [8,26], RoBERTa [9], ALBERT [27], XLNET [28], TinyBERT [29], and ELECTRA [30] are all state-of-the-art natural language understanding models that employ the Transformer Encoder layer. While there are slight differences in their training approaches, they all share the same underlying architecture.
The architecture of the Transformer Encoder layer is particularly well-suited for natural language understanding tasks, as it allows the model to capture long-range dependencies between words and phrases in the text. By leveraging self-attention mechanisms, these models can dynamically weigh the importance of different words in the input sequence, enabling them to extract more nuanced and complex representations of language.

GAN
The main idea of generative adversarial networks (GAN) [11,31,32] is to build two models, a generator (G) model and a discriminator (D) model. During training, the G model tries to improve its manufacturing process to create realistic outputs that can fool the D model, while the D model tries to accurately distinguish between real and generated data like a police officer inspecting forgeries. However, achieving a balance between the two models is essential, since an imbalance can lead to one model not converging. Specifically, the G model needs to create realistic outputs that can fool the D model, while the D model needs to accurately distinguish between real and generated data. Despite the challenges, GANs have shown great potential in generating high-quality and diverse samples in a range of applications, such as image synthesis, text generation, and music composition.

RLHF
Reinforcement-learning-based training methods have shown state-of-the-art success in recent years in natural language processing tasks. One of the most advanced training methods is the reinforcement learning with hybrid feedback (RLHF) approach. This method combines the advantages of both human feedback and self-supervised learning, enabling the model to learn from its own mistakes while also benefiting from human expertise. RLHF has been successfully applied in various tasks, including machine translation, text generation, and summarization. This is due to the absence of human feedback in the approach proposed in this article. So we used the feedback provided by ChatGPT [33].

EvoText
In this section, the training process of EvoText is introduced, and the theoretical representation and algorithmic implementation are given.

Priori Learning of Discriminator
Given an a priori dataset Data = {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x N , y N )}, which could be related to tasks, such as grammar judgment or semantic rationalization, we aim to fine-tune the discriminator model using the following objective function: Here, D θ represents the pretrained natural language understanding model, and L represents the loss function.

Pre-Warm-Up Training for Generator
We need to freeze all Transformer blocks in the pretrained natural language generative model G and add new linear and softmax layers on top. We then train the newly added layers using the prior dataset through the following equation: where G blocks Θ refers to all the frozen Transformer blocks of the pretrained model G with parameters Θ, and F ϑ represents the newly added linear layer and its trainable parameters. The objective is to minimize the expected value of the loss function L on the prior dataset, where y n is the ground truth label for input example x n .

Training Dataset
Give a new token Z in K = {z 1 , z 2 , · · · , z k } representing the input generative model G, such as "Yesterday", the token is fed into the G model to obtain Z out K = {z 1 , z 2 , · · · , z k+n }, e.g., "Yesterday, a man named Jack said he saw an alien, the skies near New Orleans". It can express as The generated token Z out K is fed into the discriminator model D to obtain the label Y K = D θ (Z out K ) for each token. Suppose we need to construct M samples, and finally, we obtain the posterior data as

Supervised Fine-Tuning for Generators
Supervised training has a greater impact on the model than unsupervised training, even when using the same dataset size. To fine-tune the parameters of all Transformer blocks in the generator G blocks Θ , we enable the gradient of all generator parameters. The optimization objective is defined as follows: where Φ denotes the parameters of the model G blocks Θ , and ϑ represents the parameters of the linear layer.

Semi-Supervised Fine-Tuning for Generators
Assuming that grammatical sentences are labeled as Y K = 1 and ungrammatical sentences are labeled as Y K = 0, we extract all tokens with Y K = 1 in the dataset Z as Z f it = {z 1 , z 2 , · · · , z m }. The generator model G is again unsupervised pretrained using corpus Z f it . The generator model G is then unsupervised pretrained using the corpus Z f it , where zk = (u 1 , u 2 , · · · , u n ) represents an unsupervised token. To accomplish this, we use autoregressive language modeling to maximize the following likelihood function: where k is the size of the context window.

Self-Escalation
You can adjust the discriminator model D to make it perform better, and EvoText is even superior. You can also continue to pretrain the discriminator model D with new knowledge to bring it up-to-date. Certain words of text in the Z dataset are randomly masked, and the masked words are predicted using the D model, which is then input to G for supervised fine-tuning. As illustrated in Figure 2, the proposed method consists of several steps. First, the up-to-date data is retrained to update the discriminator. Second, the generator is given a command containing a specific year (e.g., "In 2022") to generate the data. Third, 15% of the words in the generated data is masked. Fourth, the discriminator performs word completion on the masked words. Finally, the completed data are subjected to a supervised fine-tuning of the generator, which is labeled one (by default, all statements are grammatically correct after the discriminator's completion).

Algorithm Implementation
As shown in Algorithms 1 and 2, this is the full process of EvoText.

Experimental Setup
In this section, we describe our experimental setup to demonstrate the performance of our algorithm.

Experimental Environment
We utilize a server configuration consisting of a 120-core Xeon(R) Platinum 8358P CPU @ 2.60 GHz and 8 NVIDIA A100 (80 GB) GPUs. In order to ensure optimal efficiency, we release the GPU when the model is not being trained after deploying it in a real-world scenario.

Experimental Model
To demonstrate the performance of EvoText, we adopted the GAN approach and selected 4 natural language understanding models and 8 natural language generation models. (These models are part of the PyTorch-Transformers library: https://github.com/huggingface/ pytorch-transformers (accessed on 17 February 2023)).

BERT
The BERT model is widely recognized as one of the most outstanding models in recent years, having topped the GLUE tasks list when it was first released [34]. For this experiment, we exclusively utilize the large-cased version of BERT as the discriminator model. We apply the EvoText method to the fine-tuning of this model. Notably, the BERT large−cased model boasts 16 layers of Transformer encoders, 24 self-attentive heads, and 330 million parameters, while the BERT base−cased model has 12 layers of Transformer encoders, 12 self-attentive heads, and 104 million parameters.

RoBERTa
The RoBERTa model is an improved version of the BERT model that requires longer training time, a larger batch size, and more training data. Unlike BERT, RoBERTa uses dynamic masking and text encoding, moving away from BERT's NSP task. It modifies the key hyperparameters in BERT based on BERT's language masking strategy, resulting in better generalization to downstream tasks. Despite these modifications, the overall number of parameters in RoBERTa is consistent with BERT.

GPT-2
The primary contribution of GPT-2 is its exploration of the performance of larger-scale models in ZERO-SHOT scenarios, where no fine-tuning is used. With only pretraining, hints, and predictions, GPT-2 achieved state-of-the-art results in 8 out of 9 tasks. Additionally, it is an exceptional model for natural language generation. The GPT-2 small model and BERT medium have 24 and 12 layers of Transformer decoders, 24 and 12 self-attentive heads, and 335M and 124M parameters, respectively. Moreover, GPT-2 also offers larger models, such as GPT-2 large with 774M parameters and GPT-2 xl with 1.5B parameters.

GPT-Neo
The GPT-Neo [35,36] 1.3B is a Transformer model trained on the Pile using cross-entropy loss. As an autoregressive language model, it learns to predict the next token in a given sequence of English text, thereby capturing internal representations of English. These representations can then be used to extract features that are useful for downstream tasks. Although language models have many applications beyond this, there are still many unknowns in this area of research.

OPT
The OPT model [37] is primarily pretrained using a causal language model (CLM) target, which belongs to the family of GPT-3 models. The pretrained model can be used to evaluate prompts and generate text for downstream tasks. Additionally, the model can be fine-tuned on downstream tasks using CLM instances. The experiments in this paper use models with 125M and 350M parameters.

Transformer-XL
The Transformer-XL model [38] introduces two innovations to the Vanilla Transformer, a recurrence mechanism and relative positional coding. An additional advantage of Transformer-XL over Vanilla Transformer is that it can be used for word-level and character-level language modeling. It achieves state-of-the-art language modeling results on several different datasets and combines a circularity mechanism with an attention mechanism that allows the model to learn long-term dependencies.

Language Models with Pre-trained Word Embeddings and without Pre-Trained Word Embeddings
In addition to attention-based models, pretrained word embedding models such as Word2Vec [39] or Glove [40] can also yield good results when incorporated into the word embedding layer. Similarly, scratch-trained word embedding layers can be effective for specific tasks, such as hate detection or text toxicity detection [41][42][43][44][45][46]. In this paper, we evaluate and compare various pretrained and scratch-trained word embedding models as discriminators to assess their impact on the overall training system.

Dataset
In the subsequent experiments, we chose an a priori dataset for discriminators.

CoLA
The CoLA (corpus of linguistic acceptability) [47] consists of 10,657 sentences from 23 linguistic publications, professionally annotated for acceptability (grammaticality) by their original authors. The public version presented here contains 9594 sentences from the training and development sets, excluding 1063 sentences from the retention test set. The goal is to classify the sentences into either acceptable or unacceptable categories based on their grammaticality. Due to its carefully curated and annotated nature, CoLA is a valuable resource for evaluating the performance of various NLP models and techniques in the domain of language understanding.

LAMBADA
The source of the corpus constructed by LAMBADA [48] is unpublished anthologies. The rationale is to minimize the influence of generic knowledge on the answers, i.e., it is difficult for the model to derive answers from generic knowledge. It consists of 5325 novels and 465 million words. LAMBADA has been widely used for language generation tasks and language understanding tasks, such as language modeling and text comprehension, where the goal is to predict the next word in a given sentence based on the preceding context.

CBT
The children's book test (CBT) aims to directly measure the extent to which language models exploit the wider language environment. The CBT is built from books that are freely accessible. The CBT has been widely used for evaluating the performance of various NLP models and techniques in the domain of language understanding and generation.

WikiText
The WikiText [49] dataset is a large-scale language modeling dataset that is widely used in natural language processing research. It is created by extracting articles from the English Wikipedia and is available in three versions: WikiText-2, WikiText-103, and WikiText-500k. The dataset includes articles covering a wide range of topics, providing a diverse range of text for training and evaluation. The WikiText dataset has been used in various language modeling tasks, including next word prediction, text generation, and text classification. It is a valuable resource for training and evaluating natural language processing models, and its use has contributed significantly to the development of language modeling research.

PTB
The PTB(penn treebank dataset) [50] contains 42,000, 3000, and 3000 English sentences for the training set, validation set, and test set. "<sos>" is the start signal of each sentence, and "<eos>" is the end signal of each sentence. The dataset is annotated with part-of-speech tags, constituency parse trees, and semantic roles, providing rich linguistic annotations for various natural language processing tasks. The PTB has been used in a wide range of natural language processing tasks, including language modeling, part-of-speech tagging, named entity recognition, parsing, and machine translation.

enwiki8 and text8
The text8 (the dataset is available for download at https://huggingface.co/datasets/ enwik8 (accessed on 17 February 2023)) comes from enwiki8 (the dataset is available for download at http://mattmahoney.NET/dc/text8.zip (accessed on 17 February 2023)), which was first used to conduct text compression. Simply put, enwiki8 is the first 100,000,000 characters picked up from Wikipedia; and text8 is the result of removing all kinds of strange symbols and non-English characters from these characters, then converting uppercase characters into lowercase characters and transforming numbers into the corresponding English words. This dataset aims to learn distributed representations of words that capture their semantic and syntactic relationships, and it has been used in various natural language processing tasks, including language modeling, text generation, and word embeddings.

1BM
The 1BW (one billion word) [51] dataset is a large English language corpus used for pretraining language models. It contains one billion words and is freely available for research purposes. This benchmark dataset is widely used for evaluating the performance of statistical language models and is composed of various genres and topics, including news, technology, and novels. It was proposed by the Google Brain team and is considered a standard for measuring progress in the field of natural language processing. The 1BW dataset has been used for pretraining language models to improve their performance on downstream NLP tasks, such as text classification, sentiment analysis, and language generation.

Model Evaluation Indicators
We use PPL (perplexity), ACC (accuracy), and BPC (bits-per-character) as performance metrics for our experiments. PPL measures the average number of choices available to the model when predicting the next word in a sentence and is calculated using the following formula: where S is the sentence being evaluated, m is the length of the sentence, and p(w i | w 1 , . . . , w i−1 ) is the probability of the i-th word given the preceding words in the sentence. A lower PPL value indicates better model performance. ACC measures the percentage of correct judgments out of all judgment cases and is calculated using the following formula: where TP (true positive) is the number of cases correctly judged as positive, TN (true negative) is the number of cases correctly judged as negative, FP (false positive) is the number of cases incorrectly judged as positive, and FN (false negative) is the number of cases incorrectly judged as negative.
In the this work, accuracy refers to the percentage of correctly predicted tokens in the test dataset. In other words, it measures how often the model predicted the correct next word given the previous words in the sentence. This metric is commonly used to evaluate the performance of language models.
BPC measures the number of bits required on average to encode each character in the text and is calculated using the following formula: where m is the length of the text, and log 2 p(w i | w 1 , . . . , w i−1 ) is the number of bits required to encode the i-th character given the preceding characters in the text. A lower BPC value indicates better model performance.
Specifically, BPC measures the number of bits needed to encode each character in the text. Lower BPC values indicate better compression, which in turn indicates that the model has learned to better capture the patterns and structure of the text.

Experimental Procedure
This section shows the parameters that need to be tuned in the actual training and the comparison with other models.

Data Preprocessing
In this paper, we preprocessed the data using common techniques such as regular expression substitution and expanding English abbreviations. Table 1 shows the details of the preprocessing steps.

Fine-Tuning of Discriminators For Priori Datasets
In this paper, we employed BERTlarge, BERTbase, RoBERTalarge, or RoBERTabase as the discriminator model. Since these models are pretrained, they need to be fine-tuned to achieve optimal results in downstream tasks. During the fine-tuning process, it is recommended to use a lower learning rate and a smaller number of epochs to update the model. This is because using large learning rates and epochs may cause the model to fail to converge or overfit, which can negatively impact the model's performance in this task.
Results As illustrated in Figure 3, we observed that the large model performed better on the CoLA task, with RoBERTa exhibiting the lowest loss rate. During the pretraining process, we set the maximum length of the tokenizer to 45, enabled padding, and used a minibatch size of 512. We fine-tuned the model using several of the most commonly used parameter settings. As summarized in Table 2, we achieved the best results with a learning rate of 3 × 10 −5 and 10 epochs. Following the fine-tuning process, the RoBERTa large model demonstrated the ability to make judgments about grammatical plausibility.

Prewarm-Up Training of Generator
During pretraining of GPT-2 medium , we followed the same data preprocessing steps as before, with one exception: GPT's tokenizer does not auto-pad sentences to the maximum length. Therefore, we used the special token "<|endoftext|>" to pad sentences that were not long enough. Unlike the input to the BERT large model, the generator model used in GPT-2 medium is an autoregressive language model that requires mask attention to model the data. As such, we needed to provide masks in the input to the GPT-2 medium model to ensure optimal performance.
As shown in Table 3, the best performance was achieved with a learning rate of 1 × 10 −2 and 10 epochs. During training, we only updated the parameters of the last linear layer, which allowed the model to be easily fine-tuned for supervised tasks and prevented extensive updates to the linear layer parameters during fine-tuning. Table 3. With all Transformer block parameters of GPT-2 frozen, only the linear layer is trained to validate the results of the set on the CoLA task.

Training Process
As illustrated in Figure 4, our framework involves training the discriminator model BERT large and the generator model GPT-2 medium in a training loop. Firstly, we identify common text word-initial words and use GPT-2 medium to complete the sentences. Subsequently, the completed sentences are fed into the fine-tuned BERT large model to evaluate their grammatical plausibility. This evaluation result is then utilized to conduct supervised fine-tuning of the GPT-2 medium model. To avoid the discriminator model's errors significantly affecting the generator model, we adopt a minimal learning rate and train the generator model only one round.  Table 4 presents some examples of text generated by GPT-2 medium . Subsequently, we fed this data to a discriminator model to assess their syntactic plausibility. The discriminator model's output value of 1 or 0 indicates whether the modified sentence is grammatically valid or invalid, respectively. Next, we used these labeled data to perform supervised fine-tuning of the GPT-2 medium model. Figure 5 shows that we used a learning rate of 1 × 10 −4 and a minibatch size of 64 for fine-tuning. Table 4. We did not fine-tune the text generated by the previous GPT-2 medium model. Then, it is input to the discriminator for judgment. The red marker denotes the data generated by the generator, while a D Output of 0 indicates that the statement is not grammatically correct, and 1 indicates that it is grammatically correct.
That That doesn'ot have any significance, right? 0 It It was a beautiful night of sunshine with some gorgeous light falling. 1 He He said: ""[W]ith this being an attack of our religion I do feel the time will not have arrived."" 0 We We have already started the implementation phase and will keep the project in mind throughout. 1 I I 've done all those jobs. 1 Figure 5. Supervised fine-tuning of the training loss of the GPT-2 medium model.

Semi-Supervised Fine-Tuning Generator Model
Based on the discriminator output presented in Table 4, the sentences labeled with a value of 1 are fed back into the generator model for further pretraining. This process aims to enhance the generator's ability to produce grammatically correct text. Table 5 illustrates the evaluation of 10 tasks using 4 different natural language understanding models. The results indicate that RoBERTa large outperforms the other models. Thus, we selected RoBERTa large as the discriminator for subsequent experiments. As demonstrated in Table 6, utilizing pretrained word embeddings as a replacement for the model's word embedding layer or reinitializing the parameters to train from scratch resulted in inferior performance compared to the original RoBERTa large model. Table 5. After 156 minibatch-sized EvoText sessions, we evaluated the performance of various natural language understanding models (D) and the same language generation model (GPT medium ) on 7 natural language processing tasks ZERO-SHOT. Each model was tested five times on each task, and the results were averaged.  Table 6. After 156 minibatch-sized EvoText sessions, we evaluated the performance of various natural language understanding models (D) with or without pretrained word embeddings and the same language generation model (GPT medium ) on 7 natural language processing tasks using ZERO-SHOT. Each model was tested five times on each task, and the results were averaged.

Experimental Results
After a training process consisting of 156 minibatch iterations, which represents the average of the LAMBADA, CBT, WikiText, PTB, enwiki8, text8, and 1BW dataset sizes, we evaluated the performance of six natural language generation models on various datasets, including LAMBADA, CBT, WikiText, PTB, enwiki8, text8, and 1BW. Table 7 shows that EvoText can steadily improve the performance of eight natural language generation models. Notably, the training process with just 156 steps can produce better results without altering the model architecture or the initial pretraining method. It is surprising to see that EvoText improves the performance of the GPT small model to surpass that of the OPT 125M model. These results indicate that EvoText can significantly enhance the performance of the model without requiring extensive modifications. Our approach demonstrates favorable performance compared to the current state-of-the-art RLHF in terms of Chatgpt feedback on nearly every task. Based on the results presented in Table 8, it can be observed that the EvoText training approach is highly effective in rectifying a significant portion of the grammatical errors produced by the model, which is an impressive outcome.  Table 8. We entered "Once upon a time" into Baseline's GPT-2 xl and EvoText GPT-2 xl , respectively, for comparison.

Generation by Baseline GPT-2 xl
Once upon a time, girl name is Lisa. Lisa is like to go on walk in park, but yesterday, she goes on walk and she lost. She asks help from man which he see, but man doesn't speak English. She feels very scared and doesn't know how to come back to her house. Suddenly, she saw a police car and she run to them. Police helped her and she come back to her house safely.

Generation by EvoText GPT-2 xl
Once upon a time, there was a girl named Lisa. Lisa enjoyed going for walks in the park, but yesterday, she got lost while on a walk. She asked for help from a man she saw, but he didn't speak English. She felt very scared and didn't know how to get back home. Suddenly, she saw a police car and ran towards them. The police officers helped her and she was able to return home safely.

Up-to-Data Knowledge Update
We collected abstracts of preprints published on arXiv from June to September 2022 as the most up-to-date knowledge dataset. We partitioned the dataset into a training dataset, validation dataset, and testing dataset with a split ratio of 7:1:2. In the conventional methodology, the generator model is directly subjected to fine-tuning. In contrast, our methodology as shown in Section 3.6 entails solely fine-tuning the discriminator model, followed by EvoText training. To ensure equitable experimental outcomes, we employ the identical epoch across all trials. Table 9 demonstrates that the up-to-date knowledge update generator model, implemented with EvoText's approach, outperforms retraining the generator model while maintaining its performance on the ZERO-SHOT task. However, the scalability of EvoText on larger datasets and more complex natural language processing tasks is not discussed in this paper. Nevertheless, based on the results presented in Table 9, it can be observed that the model not only acquires new knowledge but also avoids catastrophic forgetting of the original knowledge, which is promising for future research on scalability. Table 9. We evaluated the performance of both models on the arXiv dataset by retraining the discriminator at a learning rate of 1 × 10 −4 and the generator at a learning rate of 5 × 10 −5 .

Ablation Experiments
To investigate the effect of each module in EvoText, we performed ablation experiments by removing them one by one. These modules include the fine-tuning discriminator model, the prewarm-up generator, and the supervised and semi-supervised fine-tuning generator models.
Based on Table 10, it can be observed that each module of EvoText has an impact on the final results. Removing any of these modules may cause some negative impact on the overall performance. It is worth noting that the supervised learning module is deemed necessary in our approach.

Conclusions and Future Work
In this article, we introduced EvoText, a training process for two pretrained models aimed at addressing the challenges of insufficient sample data and computational resources, allowing models to continue learning after deployment. Through fine-tuning discriminators and prewarm-up training generators, we achieved better model performance with just 156 training steps, significantly improving performance without requiring additional training data. This approach steadily improves the performance of natural language understanding and generation tasks without changing the model structure, with the potential for even greater performance improvements over time. EvoText is an effective and scalable training model that holds great promise for low-resource NLP tasks. Our extensive experiments demonstrate the potential for improving pretrained model performance and highlight the importance of supervised learning.
Future research directions may include exploring the potential for EvoText in other NLP tasks and applications, investigating the impact of different discriminator and generator architectures on model performance, and further exploring the potential for continued learning after deployment in other settings. Additionally, our study highlights the importance of supervised learning in NLP and suggests that future research should continue to focus on developing effective training processes for pretrained models in low-resource settings.

Funding:
The authors gratefully acknowledge the support of the AIMTEEL 202201 Open Fund for Intelligent Mining Technology and Equipment Engineering Laboratory in Anhui Province and the Anhui Provincial Department of Education Scientific Research Key Project (Grant No. 2022AH050995). The financial assistance provided by these projects was instrumental in carrying out the research presented in this paper. We would like to thank all the members of the laboratory for their valuable support and assistance. Without their help, this research would not have been possible. Finally, we would like to express our gratitude to the Anhui Polytechnic University for providing the necessary facilities and resources for this study.