Next Article in Journal
Design of Ultrasonic Guided Wave Pipeline Non-Destructive Testing System Based on Adaptive Wavelet Threshold Denoising
Previous Article in Journal
Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs
Previous Article in Special Issue
Robustness Assessment of AI-Based 2D Object Detection Systems: A Method and Lessons Learned from Two Industrial Cases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving Text Classification with Large Language Model-Based Data Augmentation

1
Data Science and Engineering, The University of Tennessee, Knoxville, TN 37996, USA
2
Department of Information Science, The University of North Texas, Denton, TX 76203, USA
3
Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
4
Computational Sciences and Engineering, The University of North Texas, Denton, TX 76203, USA
5
Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
*
Authors to whom correspondence should be addressed.
Electronics 2024, 13(13), 2535; https://doi.org/10.3390/electronics13132535
Submission received: 30 April 2024 / Revised: 17 June 2024 / Accepted: 25 June 2024 / Published: 28 June 2024
(This article belongs to the Special Issue AI Test)

Abstract

:
Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model’s classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model’s performance.

1. Introduction

The classification of natural language texts is a highly researched topic in the fields of artificial intelligence (AI) and machine learning (ML). Since the emergence of deep learning, various applications such as automatic data collection, filtering, and curation have been significantly improved. Recent developments in self-attention mechanism-based language models [1,2,3] have made significant strides and have had a profound impact on our daily lives. In essence, ML models rely heavily on the corpus of training data. Thus, their ability to produce accurate inferences is limited to the knowledge included in the training data. In scenarios where the dataset is imbalanced, with hundreds or thousands of training samples for certain labels and few or zero training samples for others, machine learning models typically generate acceptable predictions for the majority classes but encounter challenges in making accurate predictions for the minority classes. Data Augmentation (DA), which aims at increasing the volume, quality, and diversity of training data, has emerged as an effective technique to address this issue, and numerous studies and efforts have been made thus far [4,5,6,7,8]. Earlier DA methods usually focus on obtaining augmentation data by manipulating the original training data through techniques like random deletion, insertion, swapping, synonym replacement [9,10,11,12], and back translation [13]. However, with the advent of recently developed large language models (LLMs) that exhibit advanced language understanding and text generation capabilities, researchers can leverage them for rewriting, rephrasing, or summarizing the training data or generating entirely new samples as augmentation data [14]. Sarker et al. [15] instructed ChatGPT to rewrite the clinical notes to help improve both medication identification and medication event classification. Yuan et al. [16] asked ChatGPT to rewrite the clinical notes to help improve compatibility between electronic health records (EHRs) and clinical trial descriptions. Cohen et al. [17] combined back translation with GPT-3 rewritten samples to enhance social network hate detection. Dai et al. [18] tasked ChatGPT to rewrite each sentence in the training samples into multiple conceptually similar but semantically different samples. The rewritten samples were used as augmentation data to aid the classification of the target dataset. Yoo et al. [19] randomly selected samples (sentences with the corresponding labels) from the original training dataset and embeded the samples into the prompt. Following these prompts, GPT-3 first generated sentences influenced by the samples then assigned soft labels to the generated sentences. The generated samples were then used as augmentation data for classification tasks.
Although there are various ways to instruct the LLM to generate desired data, these methods fall into two bigger categories: rewrite the original training data and generate entirely new data from scratch. Rewritten samples are more similar to the original dataset and new generated samples infuse new information (features) to the training dataset. It remains unclear which method benefits the model more. In previous studies, authors usually utilize one method without comparing with the other method. To maximum the effectiveness of the LLM-based DA, this study conduct experiments with both DA methods. There’s an intuition that, for some domain specific topic, it is hard for the LLM to synthesis samples from scratch. Therefore, we chose one general topic dataset—Reuters news data and one domain specific data—Mitigation dataset to perform the experiments.
The primary contribution of this study is an analysis of two main LLM-based DA methods for enhancing the classification of imbalanced dataset. To be more specific:
  • We conduct experiments with two main LLM-based DA methods: rewrite samples and generate entirely new samples using ChatGPT, with both general and domain specific datasets.
  • We further investigate optimum new generated samples’ size for DA.
  • We proposed combining new samples with rewritten samples to further improve the classification result for minority classes.
Section 2 provides a detailed overview of the approach, including data, classification model, and performance measurement. Comparative experimental results are presented in Section 3, and Section 4 discusses the findings and potential for further improvement. The paper is a substantially extended version of the IEEE AITest 2023 conference paper “Enhancing Text Classification Models with Generative AI-aided Data Augmentation” [20].

2. Materials and Methods

We conducted an experimental study to evaluate the effectiveness of two main LLM-based DA methods for enhancing the text classification model’ performance on two datasets. This section provides details about the ML model we tested and the experimental design employed in the study.

2.1. Dataset

For the study, we utilized the Reuters news data and the Mitigation dataset.

2.1.1. Reuters News Dataset

The Reuters corpus provided by the Natural Language Tool Kit (NLTK) [21] Python library. The corpus consisted of 10,788 news articles, with a total of 1.3 million words. The corpus contains pre-defined “training” and “test” sets with 7769 and 3019 cases, respectively. Note that we randomly held out 10% of the training samples for validation of the model training. Each news article belongs to one or more of the 90 pre-defined categories, forming multi-class labels. The corpus provides a multi-labeled dataset for text classification tasks. Each article is labeled with a number from one to fifteen. However, this is a long-tail imbalanced dataset, as the number of samples (articles) for each label (topic) varies greatly, ranging from 1 to 2877. Figure 1 displays the count of samples for each label through a bar plot. The number of words in each article ranges from 2 to 1316, with an average of 130 words per article. The distribution of the articles’ lengths is shown in Figure 2.

2.1.2. Mitigation Dataset

Environmental mitigation strategies, including water quality standards, fish passage infrastructure, and species conservation, are critical for the ecologically sustainable advancement of hydropower resources. The Federal Energy Regulatory Commission (FERC) necessitates these mitigations during the licensing procedure for non-federal hydropower projects [22], highlighting the need for unified, countrywide data regarding these mandates. Hydropower licensing documents serve as comprehensive reservoirs of scientific data, encapsulating critical information on environmental conditions key to the sustainable progression of hydropower resources. Each license document, extending beyond 15,000 words, comprises 135 class labels demanding identification.
Identification and collation of the environmental mitigation data have historically been conducted by human experts possessing profound scientific knowledge in the respective field. Nevertheless, the manual curation of this information poses a significant challenge, given the extensive nature of each licensing document and the large quantity of mitigation labels requiring identification. The implementation of Natural Language Processing (NLP) models can potentially ease the burden of manual labor and also decrease the variability of annotation due to differences between individual observers.
In this study, a trained analyst annotated 1869 segments. Each segment corresponds to at least one of the 93 out of 135 mitigation categories. The analyst identified sentences and paragraphs related to mitigation terms and mapped them to their respective mitigation IDs. These segments were extracted from mitigation license documents from the period 2014 to 2017. Note that each segment may include specific terminology and phrases that detail the requirements of environmental mitigation plans. This often leads to the allocation of multiple mitigation IDs to a single segment. Consequently, the annotation process is inherently imbalanced, posing additional complexities for the implementation of machine learning models.

2.2. Machine Learning Model for Text Comprehension and Classification

Given that the datasets are labeled as multi-class, it is imperative for the model to be designed to enable multiple choices of labels. The final decision layer’s output nodes should be equipped with a sigmoid activation function that applies binary cross-entropy for optimization in the back-propagation. To implement the proposed approach of augmenting augmentation data for natural language text classification, we utilized the BERT models. The BERT model has been implemented using the PyTorch [23] platform on Python 3.10. Bidirectional encoder representations from transformers (BERT) [24] is currently the most successful ML model for NLP. It has achieved superb classification accuracy scores across many applications. BERT applies multiple layers of self-attention mechanism to identify keywords that characterize documents at scale. We apply a fully-connected layer at the top to make final inferences. For our study, we utilized the pre-trained bert_base_uncased model from the HuggingFace [25] library, which is widely recognized for its high performance in NLP tasks.

2.3. Augmenting Generated Data to the Text Classification

2.3.1. Obtain Augmentation Data from ChatGPT

We utilized two approaches to obtain augmentation datasets from ChatGPT (GPT3.5). 1. Asking ChatGPT to generate new samples from scratch according to the given labels. 2. Asking ChatGPT to rewrite the samples from the training data. For the Reuters dataset, we directly instructed ChatGPT to write a new article according to the given topic (Appendix A). However, for the Mitigation dataset, we did prompts engineering to obtain desired augmentation data. The mitigation classification system consists of six Tier 1 (T1) categories, twenty Tier 2 (T2) categories, and a total of 135 sub-categories (Tier 3) [26]. Each mitigation category is assigned a unique six-digit ID. Our task is to predict the Tier 3 mitigation IDs for given text data. When we instruct ChatGPT to write a sample according to the given Tier 3 mitigation ID, the generated samples usually contain other mitigation IDs that are under the same Tier 2 ID which introduced too much noise to the augmentation data. For example, there are four Tier 3 IDs under Tier 2 ID “Riparian”: “Riparian habitat monitoring or planning, Establish riparian buffers, Riparian habitat enhancement, Dust control and abatement”. Since these four mitigation requirements are about the same topic “Riparian”, when giving “Establish riparian buffers” to ChatGPT, it generates text mixing these four mitigation requirements together. However, we found that when listing all the mitigation requirements under one Tier 2 ID for ChatGPT, it is able to generate the samples for each label separately. Here is the prompt we used: “{list of mitigation requirements} is a list of mitigation requirements at hydropower project, write a paragraph for each of them as if they are extracted from the requirement license. The format of the answer should be like A: B where A represents the mitigation requirement, B represents the corresponding generated text. Do not change the mitigation requirement name in the list”. The prompts are presented in the Appendix A.
To ensure consistency, we applied the same text data pre-processing pipeline, which included tokenization and vectorization, as used in the training corpus from the Reuters dataset.

2.3.2. Integrate the Generated Data to the Model

In order to incorporate augmentation data into our model, we added an augmentation data training loop after each batch of the original data. During a given training update within each epoch, the following steps are taken:
  • The binary cross-entropy loss is calculated from the given minibatch of training samples and backpropagation is performed.
  • A minibatch of augmentation samples is randomly selected, the binary cross-entropy loss is calculated, and backpropagation is performed.

2.4. Experimental Design

To do a comprehensive study of these two LLM-based DA methods, we experimented with both general topic and domain specific datasets. We further investigated how the new generated sample’s size affect the effectiveness of the DA method. The following are experiments related to this procedure.

2.4.1. Evaluate the DA Effectiveness of Rewritten Samples and New Generated Samples

To evaluate the DA effectiveness of the rewritten samples and new generated samples, we generated 20 samples for each label in the Reuters news data and the Mitigation data using the prompts described in Section 2.3.1. For the number of the rewritten samples we referenced to the Easy Data Augmentation(EDA) paper [4], as rewritten samples shares similarities with EDA, both involve modifying the original data to generate augmented data. According to the recommendation in [4], the recommended number of augmented samples depends on the size of the original sample. We instructed ChatGPT to generate four rewritten samples for each training sample in the Reuters dataset and Mitigation dataset. Then we integrated the augmentation data into the training procedure as described in Section 2.3.2.

2.4.2. Investigate the Optimum New Generated Samples’ Size for DA

Adequate training samples for a label can significantly impact the classification performance of a model. Generally, more new generated samples brings more new information and allow the model to learn more features and produce better results. However, there may be a point at which adding more augmentation data will not improve the results any further, as all useful features have been covered. Furthermore, for labels that already have sufficient training samples and high accuracy, adding augmentation data may introduce noise and decrease performance. To learn how the augmentation sample size affect the DA effectiveness, we experimented with different augmentation samples size for both datasets.

2.4.3. Combining Rewritten Data with New Generated Data

According to the similarity analysis in Figure 3, ChatGPT-generated data exhibits strong intrinsic similarity but is less similar to the training data. This suggests that ChatGPT-generated data introduces novel information, contributing to improved classification results, but may induce topic drift for minority classes. Including rewritten sample can help maintain feature consistency for minority classes. We hypothesize that combining rewritten data with newly generated data would further improve the DA effectiveness.

2.5. Performance Measure

Due to the severe class imbalance and multi-label annotations present in our data corpus, it is necessary to calculate both macro- and micro-averaged F1 metrics using a class-wise multi-label confusion matrix. In this context, macro-averaged F1 scores are equally weighted among the class labels, while micro-averaged F1 scores are equally weighted among individual decisions. To calculate these scores, we use the Scikit-Learn [27] Python library.
For each class label i, we obtain a i , b i , c i , and d i , where a stands for true positives, b represents true negatives, c represents false negatives, and d represents false positives. To calculate the macro-averaged precision, recall, and F1 scores, we computed those scores for each label separately, and then took their average over all the labels. In contrast, to compute the micro-averaged precision, recall, and F1 scores, we aggregated the a i , b i , c i , and d i across all labels and computed the corresponding overall scores. Equations (1) and (2) illustrate the method of calculating macro- and micro-averaged scores.
p m a c r o = i = 1 N p i N p i = a i a i + d i ( i = 1 , , N )
p m i c r o = i = 1 N a i i = 1 N a i + i = 1 N d i ( i = 1 , , N )
where N is the total number of the labels. In our case, N = 90 .

3. Results

3.1. Evaluate the DA Effectiveness of Rewritten Samples and New Generated Samples

Table 1 shows the BERT model’s performance without augmentation data, with ChatGPT rewritten samples and with new generated samples for the Reuters news data. From the table we can see that both rewritten data and new generated data lead to improved accuracy across both macro- and micro-averaged metrics. With new generated data the macro-F1 increased from 49.87 to 65.73 and with rewritten data the macro-F1 increased from 49.87 to 61.7. New generated data leads to better DA effectiveness. Table 2 shows the BERT model’s performance without augmentation data, with ChatGPT rewritten samples and with new generated samples for the Mitigation data. With new generated data the macro-F1 increased from 13.32 to 15.42 and with rewritten data the macro-F1 decreased. We can also observe that the enhancement was more pronounced in macro-averaged scores than in micro-averaged ones, suggesting that the DA methods significantly improves the accuracy of minor class labels.

3.2. Investigate the Optimum Samples Size of the LLM-Based DA Method

In Table 3, we present classification accuracy scores for the Reuters dataset with 5, 10, 15, and 20 newly generated samples. A notable increase in Macro F1 score is observed from 5 to 10 samples (64.72, falling outside the confidence interval [60.03, 62.29]); however, we noticed a plateau from 10 samples to 20 samples. Table 4 illustrates similar improvements in the Mitigation dataset, where 10 samples yield comparable results to 20 samples (15.13 vs. 15.42). These findings suggest that when integrating ChatGPT-generated samples as augmentation data, generating 10 new samples for each label suffices, while additional samples may provide marginal enhancements.

3.3. Combining Rewritten Data with New Generated Data

Table 5 presents the BERT model’s performance for Reuters dataset with three different augmentation datasets: ChatGPT rewritten samples, ChatGPT newly generated samples, and generated samples plus rewritten samples. As depicted in Table 5, this combination resulted in a noteworthy enhancement of the macro F1 score from 65.73% to 67.14%, a substantial improvement compared to solely utilizing newly generated samples for augmentation. However, such an increase was not evident for the Mitigation dataset, as demonstrated in Table 6.

3.4. Difference Analysis of the Newly Generated Data and the Rewritten Data

To quantitatively evaluate the differences between the newly generated data and the rewritten data, we performed a vocabulary analysis on the training dataset, the newly generated data, and the rewritten data. For the Reuters dataset, the original training data contains 21,764 unique words, the newly generated data contains 8597 unique words, and the rewritten data contains 18,376 unique words. We then calculated the words that appeared in the augmentation data but did not exist in the training dataset and plotted them in a heatmap. From Figure 4a, we can see that 1619 words are present in the rewritten data but not in the training data, and 2768 words are present in the newly generated data but not in the training data. This indicates that the newly generated data introduced more information into the training. A similar trend is observed in the mitigation dataset (Figure 4b), with 536 words in the rewritten data but not in the training data, and 964 words in the newly generated data but not in the training data. We also noticed that in the mitigation dataset, the rewritten vocabulary has a much smaller size compared to the training vocabulary (1892 vs. 4577), this is consistent with our previous observation that when rewriting domain-specific data, ChatGPT tends to replace sophisticated terminology with general words, which may harm the model’s performance (as shown in Table 6).

3.5. Categorical Analysis

To analyze how different DA methods impacts prediction results for each class, a categorical analysis was conducted using the Reuters data results, as presented in Table 7. The first column denotes the label, the second column shows the number of samples in the original training data for each label, and the subsequent columns display F1 scores corresponding to different DA methods. Each F1 score was computed by averaging results from ten runs. ChatGPT-generated samples, created from scratch, prove highly informative and beneficial. As revealed in Table 7, introducing 20 newly generated samples not only enhanced the F1 score for most minority classes but also improved scores for majority classes, such as ‘money-fx’ (from 78.56% to 83.91%) and ’grain’ (from 90.31% to 94.20%), even with a much smaller augmentation sample size compared to the original training data.
Additionally, the average influences of DA on majority and minority classes were assessed by calculating the average F1 scores for labels with samples exceeding and falling below a threshold (set at 40), as depicted in Table 8. Without DA, the average F1 score for minority classes is significantly lower than that for majority classes. Introducing DA enhances the average F1 score for minority classes from 0.3026 to 0.6064. Notably, the improvement achieved with rephrased samples plus new samples (0.6064) surpasses that with only new samples (0.5701) and rephrased samples alone (0.5209). For majority classes, all three DA methods contribute to performance improvement (from 0.7627 to 0.8217). However, these three methods exhibit similar levels of improvement.

4. Discussion and Conclusions

LLMs have gained popularity since the debut of ChatGPT. With their large number of trainable parameters, pre-training with a substantial amount of articles and documents, they achieved noteworthy performance in chatting, question answering, and information retrieval. Early adoption studies have already shown remarkable results, making it clear that these models have great potential for various NLP tasks. However, it is still too early to expect that GPT models can solve complex real-world problems independently. Nonetheless, with proper guidance, we can leverage the vast amounts of information they provide to enhance various NLP tasks, such as data augmentation for text data classification. This paper evaluated the effectiveness of two main LLM-based DA methods for natural language text classification—rewriting samples and generating new samples using ChatGPT. Furthermore, we found the optimum samples size for DA when using ChatGPT generated samples. Finally, we investigated a hybrid data augmentation approach that may further improve the model’s classification results.
The results from Section 3.1 indicate that newly generated data produced better performance compared to rewritten data, with much fewer augmentation samples. The rewritten data for the Mitigation dataset diminished the model’s performance, which goes against intuition. This reveals a drawback of rewriting samples for data augmentation. In text classification tasks, there are chances that the model classifies the text content according to several critical words in the sentence, especially for domain-specific data. Rewritten samples may replace these critical words with other synonyms, thus losing important information. However, this issue may not arise for general topic dataset like Reuter news data. For the Reuters news data, rewritten minority classes and combining them with new generated data further boost the model’s performance. This verified our hypothesis that rewritten data helps maintain feature consistency for minority classes, while newly generated data introduces new information to the entire dataset. The outcomes from Section 3.2 indicate that the sample size does affect the model’s performance. However, the margin of the improvement decreases when increasing the sample size, and an optimal augmentation size is attained with 10–20 newly generated samples for each label. From Section 3.4 categorical analysis, it is evident that these two DA methods has substantially enhanced the prediction scores of the Reuters minority classes, aligning with the primary objective of data augmentation.
This study not only underscores the strengths and limitations of two main LLM-based DA methods but also guides optimal strategies for employing LLMs in enhancing text classification models. This study could be particularly useful in text classification tasks that suffer from severe class imbalance issues. The rise of other LLMs trained with domain knowledge provides good resources for DA. For instance, Med-PaLM2 [28] demonstrates impressive capabilities in answering medical questions, suggesting its potential use for generating medical data to enhance the classification and information extraction of clinical and health-related documents.

Author Contributions

Conceptualization, H.-J.Y.; methodology, H.Z., H.C. and H.-J.Y.; software, H.-J.Y.; validation, H.Z.; formal analysis, H.Z. and H.-J.Y.; investigation, H.Z., T.A.R. and H.C.; resources, Y.F., T.A.R. and D.S.; data curation, H.-J.Y., T.A.R. and D.S.; writing—original draft preparation, H.Z. and H.-J.Y.; writing—review and editing, H.-J.Y., H.C. and Y.F.; visualization, H.Z.; supervision, H.-J.Y.; project administration, H.-J.Y. and D.S.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by US Department of Energy’s Water Power Technologies Office.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data sets generated and/or analysed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

The paper is a substantially extended version of the IEEE AITest 2023 conference paper “Enhancing Text Classification Models with Generative AI-aided Data Augmentation” [20]. This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan (accessed on 30 April 2024)).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model
DAData Augmentation
NLPNatural Language Processing

Appendix A

Prompt 1: “write an article with N words about LABEL in Reuters news format”. Here, LABEL represents the topic for which we aimed to create data, and N represents a designated word count. We applied three specific word counts (50, 150, and 250) in our experiment.
Prompt 2: “You are a technique writer, {} is a list of mitigation requirements at hydropower project, write a paragraph for each of them as if they are extracted from the requirement license. The format of the answer should be like A: B where A represents the mitigation requirement, B represents the corresponding generated text. Do not change the mitigation requirement name in the list”.
Prompt 3: “rewrite the following content: samples from the original training dataset”.

References

  1. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2023, arXiv:1910.10683. [Google Scholar]
  2. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  3. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. Available online: https://api.semanticscholar.org/CorpusID:160025533 (accessed on 30 April 2024).
  4. Wei, J.; Zou, K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar]
  5. Akkaradamrongrat, S.; Kachamas, P.; Sinthupinyo, S. Text generation for imbalanced text classification. In Proceedings of the 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), Chonburi, Thailand, 10–12 July 2019; pp. 181–186. [Google Scholar]
  6. Hu, Z.; Tan, B.; Salakhutdinov, R.R.; Mitchell, T.M.; Xing, E.P. Learning data manipulation for augmentation and weighting. Adv. Neural Inf. Process. Syst. 2019, 32, 15764–15775. [Google Scholar]
  7. Xu, B.; Qiu, S.; Zhang, J.; Wang, Y.; Shen, X.; de Melo, G. Data augmentation for multiclass utterance classification—A systematic study. In Proceedings of the 28th International Conference on Computational Linguistics, Online, 8–13 December 2020; pp. 5494–5506. [Google Scholar]
  8. Chen, H.; Pieptea, L.F.; Ding, J. Construction and Evaluation of a High-Quality Corpus for Legal Intelligence Using Semiautomated Approaches. IEEE Trans. Reliab. 2022, 71, 657–673. [Google Scholar] [CrossRef]
  9. Karimi, A.; Rossi, L.; Prati, A. AEDA: An Easier Data Augmentation Technique for Text Classification. arXiv 2021, arXiv:2108.13230. [Google Scholar]
  10. Kolomiyets, O.; Bethard, S.; Moens, M.F. Model-Portability Experiments for Textual Temporal Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011. [Google Scholar]
  11. Xie, Z.; Wang, S.I.; Li, J.; Levy, D.; Nie, A.; Jurafsky, D.; Ng, A.Y. Data Noising as Smoothing in Neural Network Language Models. arXiv 2017, arXiv:1703.02573. [Google Scholar]
  12. Li, Y.; Cohn, T.; Baldwin, T. Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017. [Google Scholar]
  13. Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. arXiv 2016, arXiv:1511.06709. [Google Scholar]
  14. Ye, J.; Gao, J.; Li, Q.; Xu, H.; Feng, J.; Wu, Z.; Yu, T.; Kong, L. ZEROGEN: Efficient Zero-shot Learning via Dataset Generation. arXiv 2022, arXiv:2202.07922. [Google Scholar]
  15. Sarker, S.; Qian, L.; Dong, X. Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification. arXiv 2023, arXiv:2306.07297. [Google Scholar]
  16. Yuan, J.; Tang, R.; Jiang, X.; Hu, X. Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching. arXiv 2023, arXiv:2303.16756. [Google Scholar]
  17. Cohen, S.; Presil, D.; Katz, O.; Arbili, O.; Messica, S.; Rokach, L. Enhancing social network hate detection using back translation and GPT-3 augmentations during training and test-time. Inf. Fusion 2023, 99, 101887. [Google Scholar] [CrossRef]
  18. Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Liu, W.; Liu, N.; et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv 2023, arXiv:2302.13007. [Google Scholar]
  19. Yoo, K.M.; Park, D.; Kang, J.; Lee, S.W.; Park, W. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual, 16–20 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2225–2239. [Google Scholar] [CrossRef]
  20. Zhao, H.; Chen, H.; Yoon, H.J. Enhancing Text Classification Models with Generative AI-aided Data Augmentation. In Proceedings of the 2023 IEEE International Conference on Artificial Intelligence Testing (AITest), Athens, Greece, 17–20 July 2023; pp. 138–145. [Google Scholar] [CrossRef]
  21. Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
  22. Pracheil, B.M.; Levine, A.L.; Curtis, T.L.; Aldrovandi, M.S.; Uría-Martínez, R.; Johnson, M.M.; Welch, T. Influence of project characteristics, regulatory pathways, and environmental complexity on hydropower licensing timelines in the US. Energy Policy 2022, 162, 112801. [Google Scholar] [CrossRef]
  23. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
  24. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  25. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
  26. Schramm, M.P.; Bevelhimer, M.S.; DeRolph, C.R. A synthesis of environmental and recreational mitigation requirements at hydropower projects in the United States. Environ. Sci. Policy 2016, 61, 87–96. [Google Scholar] [CrossRef]
  27. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  28. Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023, arXiv:2305.09617. [Google Scholar]
Figure 1. The number of articles for each topic in the Reuters corpus, illustrates that the dataset is severely imbalanced.
Figure 1. The number of articles for each topic in the Reuters corpus, illustrates that the dataset is severely imbalanced.
Electronics 13 02535 g001
Figure 2. The article length in the Reuters corpus.
Figure 2. The article length in the Reuters corpus.
Electronics 13 02535 g002
Figure 3. Selected heatmap examples for cosine similarities between training data and ChatGPT generated data. Row 1–5 are training data and row 6–10 are ChatGPT generated data. The analysis was performed with Reuters news data.
Figure 3. Selected heatmap examples for cosine similarities between training data and ChatGPT generated data. Row 1–5 are training data and row 6–10 are ChatGPT generated data. The analysis was performed with Reuters news data.
Electronics 13 02535 g003
Figure 4. Heatmap showing the non-overlap words among training, new generated and rewritten data for the Reuters and Mitigation dataset.
Figure 4. Heatmap showing the non-overlap words among training, new generated and rewritten data for the Reuters and Mitigation dataset.
Electronics 13 02535 g004
Table 1. BERT Model’s performance with rewritten and new generated data—Reuters data.
Table 1. BERT Model’s performance with rewritten and new generated data—Reuters data.
Macro (Unit: %) Micro (Unit: %)
PrecisionRecallF1 ScorePrecisionRecallF1 Score
without DA57.1747.2549.8791.3087.6989.45
(53.82, 60.53)(44.02, 50.48)(46.70, 53.03)(90.84, 91.75)(86.80, 88.58)(89.15, 89.75)
with rewritten data68.2559.3261.790.8389.1689.98
(65.91, 70.58)(56.46, 62.19)(59.27, 64.13)(90.48, 91.17)(88.68, 89.64)(89.67, 90.30)
with new data75.2361.4465.7392.5087.9090.13
(74.00, 76.46)(59.65, 63.23)(64.46, 66.99)(91.80, 92.62)(87.74, 89.06)(90.07, 90.45)
The table above shows the mean scores and 95% confidence intervals of macro precision, recall, f1, micro precision, recall and f1 for both datasets, without augmentation data, with rewritten data and with new generated dataset. The new generated data contains 20 samples for each label and the rewritten data contain 4 samples for each training sample. The abbreviations DA represents data augmentation.
Table 2. BERT Model’s performance with rewritten and new generated data—Mitigation data.
Table 2. BERT Model’s performance with rewritten and new generated data—Mitigation data.
Macro (Unit: %) Micro (Unit: %)
PrecisionRecallF1 ScorePrecisionRecallF1 Score
without DA15.1212.7713.3275.1464.0169.13
(14.69, 16.04)(12.34, 13.20)(12.82, 13.82)(73.75, 76.52)(62.11, 65.92)(67.46, 70.80)
with rewritten data10.958.949.4076.0565.1270.16
(9.54, 12.36)(8.21, 9.67)(8.55, 10.25)(74.22, 77.88)(63.84, 66.40)(68.91, 71.41)
with new data17.8614.2315.4277.6964.7170.60
(16.34, 19.39)(13.19, 15.28)(13.91, 16.38)(75.46, 79.92)(64.11, 65.31)(69.46, 71.75)
The table above shows the mean scores and 95% confidence intervals of macro precision, recall, f1, micro precision, recall and f1 for both datasets, without augmentation data, with rewritten data and with new generated dataset. The new generated data contains 20 samples for each label and the rewritten data contain 4 samples for each training sample. The abbreviations DA represents data augmentation.
Table 3. Reuters dataset with different sample size.
Table 3. Reuters dataset with different sample size.
Macro (Unit: %) Micro (Unit: %)
PrecisionRecallF1 ScorePrecisionRecallF1 Score
5 samples70.1057.2261.1691.4388.5089.94
(68.76, 71.43)(55.92, 58.53)(60.03, 62.29)(91.00, 91.85)(88.01, 89.00)(89.72, 90.16)
10 samples73.9359.4163.7292.2788.1190.13
(72.63, 75.22)(57.51, 61.30)(62.22, 65.23)(91.74, 92.80)(87.35, 88.86)(89.1, 90.3)
15 samples74.4360.5364.8491.7388.5290.19
(73.01, 75.85)(59.04, 62.02)(63.52, 66.16)(91.61, 92.24)(88.09, 88.96)(89.99, 90.39)
20 samples75.2361.4465.7392.5087.9090.13
(74.00, 76.46)(59.65, 63.23)(64.46, 66.99)(91.80, 92.62)(87.74, 89.06)(90.07, 90.45)
The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model. The first row shows the result of adding 5 distinct samples for each label. The following row shows the result of 10, 15, 20 distinct samples.
Table 4. Mitigation dataset with different sample size.
Table 4. Mitigation dataset with different sample size.
Macro (Unit: %) Micro (Unit:%)
PrecisionRecallF1 ScorePrecisionRecallF1 Score
10 samples18.1414.1715.1377.1564.5870.29
(16.51, 19.76)(13.44, 14.90)(14.63, 15.64)(73.46, 80.83)(61.13, 68.04)(67.07, 73.52)
20 samples17.8614.2315.4277.6964.7170.60
(16.34, 19.39)(13.19, 15.28)(13.91, 16.38)(75.46, 79.92)(64.11, 65.31)(69.46, 71.75)
The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model. The first row shows the result of adding 10 distinct samples for each label. The second row shows the result of 20 distinct samples.
Table 5. Reuters dataset with ChatGPT rewritten data, ChatGPT generated new data, combination of rewritten data and new data.
Table 5. Reuters dataset with ChatGPT rewritten data, ChatGPT generated new data, combination of rewritten data and new data.
Macro (Unit: %) Micro (Unit: %)
PrecisionRecallF1 ScorePrecisionRecallF1 Score
without DA57.1747.2549.8791.3087.6989.45
(53.82, 60.53)(44.02, 50.48)(46.70, 53.03)(90.84, 91.75)(86.80, 88.58)(89.15, 89.75)
rewritten samples68.2559.3261.790.8389.1689.98
(65.91, 70.58)(56.46, 62.19)(59.27, 64.13)(90.48, 91.17)(88.68, 89.64)(89.67, 90.30)
new samples75.2361.4465.7392.5087.9090.13
(74.00, 76.46)(59.65, 63.23)(64.46, 66.99)(91.80, 92.62)(87.74, 89.06)(90.07, 90.45)
rewritten + new76.0563.0267.1492.3188.4090.31
(73.86, 78.25)(61.04, 65.01)(65.62, 68.66)(91.01, 93.62)(87.44, 89.36)(89.99, 90.62)
The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model without augmentation data, with ChatGPT rewritten samples, with 90 × 20 ChatGPT generated new samples, with 90 × 20 ChatGPT generated new samples plus selected rewritten samples.
Table 6. Mitigation dataset with ChatGPT rewritten data, ChatGPT generated new data, combination of rewritten data and new data.
Table 6. Mitigation dataset with ChatGPT rewritten data, ChatGPT generated new data, combination of rewritten data and new data.
Macro (Unit: %) Micro (Unit: %)
PrecisionRecallF1 ScorePrecisionRecallF1 Score
without DA15.1212.7713.3275.1464.0169.13
(14.69, 16.04)(12.34, 13.20)(12.82, 13.82)(73.75, 76.52)(62.11, 65.92)(67.46, 70.80)
rewritten samples10.958.949.4076.0565.1290.16
(9.54, 12.36)(8.21, 9.67)(8.55, 10.25)(74.22, 77.88)(63.84, 66.40)(68.91, 71.41)
new samples17.8614.2315.4277.6964.7170.60
(16.34, 19.39)(13.19, 15.28)(13.91, 16.38)(75.46, 79.92)(64.11, 65.31)(69.46, 71.75)
rewritten + new11.99.92 10.2379.1865.8371.89
(11.45, 12.89)(8.85, 10.69)(9.45, 11.02)(73.57, 79.98)(61.90, 66.76)(67.73, 72.24)
The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model without augmentation data, with ChatGPT rewritten samples, with 90 × 20 ChatGPT generated new samples, with 90 × 20 ChatGPT generated new samples plus selected rewritten samples.
Table 7. Categorical F1 scores of BERT model with no augmentation, ChatGPT rephrased data, ChatGPT generated new data, combination of rephrased data and new data.
Table 7. Categorical F1 scores of BERT model with no augmentation, ChatGPT rephrased data, ChatGPT generated new data, combination of rephrased data and new data.
CategorySamplesAvg_NoaugAvg_RephraseAvg_NewAvg_Rephrase_New
earn28770.98100.98540.98630.9825
acq16500.95200.97320.97600.9767
money-fx5380.78560.85850.83910.8427
grain4330.90310.94200.92640.9451
crude3890.87230.90580.91430.9068
trade3680.75670.79950.81570.8037
interest3470.75350.82800.85940.8527
wheat2120.85370.87240.85840.8591
ship1970.80000.88540.89020.8932
corn1810.87610.87850.86920.8777
money-supply1400.78220.79660.83390.7831
dlr1310.68490.77330.78460.8182
sugar1260.89340.91000.87680.9011
oilseed1240.62730.72560.72310.7269
coffee1110.95240.94790.96410.9487
gnp1010.81860.85880.81690.8186
gold940.85770.90750.90870.9325
veg-oil870.62870.70740.67990.6616
soybean780.61220.72030.71450.6813
nat-gas750.64650.76600.69250.7284
livestock750.53500.69980.71250.7188
bop750.67720.73910.68310.6609
cpi690.61210.71450.67570.6607
cocoa550.99161.00001.00001.0000
reserves550.69750.79450.81820.8311
carcass500.59260.65470.61690.6114
copper470.86280.91490.92610.9304
jobs460.67900.67010.72240.7275
yen450.35560.64600.61950.6448
ipi410.83820.92120.90420.9246
iron-steel400.70320.79260.86880.8572
cotton390.71100.73860.74890.7310
gas370.68080.86460.86250.8278
barley370.66520.74630.78730.8225
rubber370.83120.87750.88860.9579
alum350.71760.90450.88710.9006
rice350.71180.82040.72590.7939
meal-feed300.20920.64690.49900.6037
palm-oil300.70220.85710.83880.8487
sorghum240.35670.58140.53870.6040
retail230.13330.60000.64290.6334
silver210.65900.76630.77760.7786
zinc210.89080.91730.88540.9364
pet-chem200.19430.74010.45780.6516
wpi190.70670.91460.91400.9123
tin180.83510.95650.94970.9565
rapeseed180.71570.76950.65490.7629
strategic-metal160.03330.37700.56450.5450
housing160.71310.85710.77550.8190
hog160.54380.62730.75670.7503
orange160.74690.91240.90000.9210
lead150.33360.87640.81430.9513
soy-oil140.05070.35050.24680.3417
heat140.63720.66160.74430.7073
fuel130.34280.69630.66130.6833
soy-meal130.08420.52650.61560.6171
lei120.94571.00001.00000.9714
sunseed110.30760.54280.46090.6367
dmk100.03330.08000.00000.0800
lumber100.21430.84360.84680.8436
tea90.30670.88570.95920.9428
income90.55780.74850.72730.7126
nickel80.15000.70001.00001.0000
oat80.23640.27940.41410.5200
l-cattle60.09000.46670.53810.6600
rape-oil50.00000.00000.07140.1000
sun-oil50.00000.00000.19050.0000
groundnut50.00000.08000.40000.4000
instal-debt50.00001.00000.95241.0000
platinum50.11000.62550.66970.7230
coconut40.10000.66670.35720.7000
coconut-oil40.13000.30000.12860.4000
jet40.00000.23330.40480.7333
propane30.00000.10000.61430.8400
potato30.36000.72001.00001.0000
cpu30.40001.00000.85711.0000
dfl20.00000.00000.00000.0000
nzdlr20.00000.00000.09520.5334
palmkernel20.00000.00000.00000.0000
copra-cake20.00000.00000.57140.0000
palladium20.00000.40000.71430.4000
naphtha20.00000.08000.51430.6667
rand20.00000.60001.00001.0000
castor-oil10.00000.00000.00000.0000
nkr10.00000.00000.00000.0000
sun-meal10.00000.00000.00000.0000
groundnut-oil10.00000.00000.14290.0000
lin-oil10.00000.00000.00000.0000
cotton-oil10.00000.00000.00000.0000
rye10.00000.00000.00000.0000
Table 8. Average scores of the majority and minority classes. In this context, majority classes pertain to those with more training samples than the specified threshold, while minority classes refer to those with fewer training samples than the threshold.
Table 8. Average scores of the majority and minority classes. In this context, majority classes pertain to those with more training samples than the specified threshold, while minority classes refer to those with fewer training samples than the threshold.
Avg_NoaugAvg_RephraseAvg_NewAvg_Rephrase_New
majority (threshold = 40)0.76270.82660.82030.8217
minority (threshold = 40)0.30260.52090.57010.6064
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, H.; Chen, H.; Ruggles, T.A.; Feng, Y.; Singh, D.; Yoon, H.-J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics 2024, 13, 2535. https://doi.org/10.3390/electronics13132535

AMA Style

Zhao H, Chen H, Ruggles TA, Feng Y, Singh D, Yoon H-J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics. 2024; 13(13):2535. https://doi.org/10.3390/electronics13132535

Chicago/Turabian Style

Zhao, Huanhuan, Haihua Chen, Thomas A. Ruggles, Yunhe Feng, Debjani Singh, and Hong-Jun Yoon. 2024. "Improving Text Classification with Large Language Model-Based Data Augmentation" Electronics 13, no. 13: 2535. https://doi.org/10.3390/electronics13132535

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop