Next Article in Journal
Classification of Multiple Partial Discharge Sources Using Time-Frequency Analysis and Deep Learning
Previous Article in Journal
Influence of Chinstrap Stiffness on Cerebrospinal Fluid Dynamics and Brain Stress in Helmet Impacts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Synthetic Text as Data: On Usefulness and Limitations

1
Research and Development Center, Nanum Space Co., Ltd., Jeonju 54907, Republic of Korea
2
Department of Statistics, Institute of Applied Statistics, Jeonbuk National University, Jeonju 54896, Republic of Korea
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(10), 5460; https://doi.org/10.3390/app15105460
Submission received: 15 April 2025 / Revised: 4 May 2025 / Accepted: 12 May 2025 / Published: 13 May 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

This study investigates the utility of GPT-generated text as a training resource in supervised learning, focusing on two perspectives: its effectiveness as an augmentation tool in data-scarce or class-imbalanced settings and its potential as a substitute for human-written data. Using MBTI personality classification as a benchmark task, we conducted controlled experiments under both class imbalance and few-shot learning conditions. Results showed that GPT-generated text could improve classification performance when used to supplement underrepresented classes. However, when synthetic data fully replace real data, performance declines significantly—particularly in tasks requiring fine-grained semantic distinctions. Further analysis reveals that GPT outputs often capture only partial personality traits, enabling coarse-level classification but falling short in nuanced cases. These findings suggest that GPT-generated text can function as a conditional training resource, with its effectiveness closely tied to the granularity of the classification task.
Keywords: GPT-generated text; synthetic data; data augmentation; MBTI classification; class imbalance; few-shot learning; fine-grained classification; data granularity; large language models GPT-generated text; synthetic data; data augmentation; MBTI classification; class imbalance; few-shot learning; fine-grained classification; data granularity; large language models

Share and Cite

MDPI and ACS Style

Choi, S.; Sim, J.; Choi, G. Synthetic Text as Data: On Usefulness and Limitations. Appl. Sci. 2025, 15, 5460. https://doi.org/10.3390/app15105460

AMA Style

Choi S, Sim J, Choi G. Synthetic Text as Data: On Usefulness and Limitations. Applied Sciences. 2025; 15(10):5460. https://doi.org/10.3390/app15105460

Chicago/Turabian Style

Choi, Seoyeon, Jaein Sim, and Guebin Choi. 2025. "Synthetic Text as Data: On Usefulness and Limitations" Applied Sciences 15, no. 10: 5460. https://doi.org/10.3390/app15105460

APA Style

Choi, S., Sim, J., & Choi, G. (2025). Synthetic Text as Data: On Usefulness and Limitations. Applied Sciences, 15(10), 5460. https://doi.org/10.3390/app15105460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop