Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains
Abstract
1. Introduction
1.1. Toward Empowered Educational Research: Generative AI in Service of the Academic Community
1.2. Researcher Autonomy and Competencies in AI and NLP: Overcoming Small Data with Critical Thinking
1.3. Methodological Integrity and Pedagogical Responsibility in Data Generation
- -
- RQ1: What are the latent thematic structures in corpora of sensitive human narratives, and how are they semantically characterized?
- -
- RQ2: What seed-document selection strategy optimizes thematic purity before semantic saturation occurs in synthetic text generation?
- -
- RQ3: How does the semantic fidelity of synthetic texts vary according to the thematic complexity of the original cluster?
2. Materials and Methods
2.1. Origin of the Corpus
2.2. Pipeline Architecture
2.3. Hyperparameter Configuration
2.4. System Prompt and LLM Configuration
- -
- Platform Environment: Reddit (Subreddit: r/breakups)
- -
- Social Context: You are simulating a digital safe space where individuals express raw, unfiltered sentiments regarding heartbreak, relationship ruptures, and grief stages.
- -
- Topic Focus: {topic_description}
- -
- Follow structural and emotional patterns from “Few-Shot Seeds”
- -
- Ensure generated data is indistinguishable from real human venting
- -
- Create entirely new personas and stories
- -
- You are a deterministic generator (do NOT act as an AI assistant)
- -
- Target Length:
- -
- Reference Material:
- -
- Persona Rotation:Each output must use a distinct linguistic fingerprint.Vary:
- -
- Age
- -
- Gender
- -
- Stage of grief (denial, anger, bargaining, depression, acceptance)
- -
- Mimetic Accuracy:Replicate Reddit structural entropy:
- -
- Non-standard syntax
- -
- Irregular punctuation
- -
- Platform vernacular (M28, F21, TL;DR, ex-partner, NC)
- -
- No AI-style behavior:
- -
- No advice
- -
- No balanced framing
- -
- No hopeful conclusions
- -
- Emotional instability must remain intact
- -
- Narrative Depth:Include:
- -
- Sensory details
- -
- Internal monologues
- -
- Physical/emotional sensations
- -
- COLD START:
- -
- NO REASONING:
- -
- DELIMITER:
- -
- CLEAN OUTPUT:
2.5. Proposed Human Validation Protocol
3. Results
3.1. Characterization of the Original Corpus and Thematic Modeling
3.2. Answering the Research Questions
3.3. Optimal Extraction of Seed Documents via the Elbow Method
3.4. Computational Efficiency and Generative Orchestration
3.5. Semantic Fidelity Validation via Cosine Similarity
3.6. Human Validation Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Human Evaluation Coding Guide
- -
- Topic 0—Post-Breakup Emotional Turmoil/Social Distance: intermittent contact, involvement of friends/family, weeks since separation
- -
- Topic 1—Self-Sabotage and Avoidant Pattern: how they met the ex, reproaches, “he/she told me that…”, early-relationship memories
- -
- Topic 2—Unreciprocated Investment and Abandonment: cycle of hope and disillusionment, damaged self-esteem, emotional pain, missing the partner
- -
- Topic 3—Self-Loss and Unreadiness: books, therapy, advice that “helped”, search for realistic hope
- -
- Topic 4—Self-Created Healing Resources: “breakup kit”, apps, healing journals, concrete healing tools
- -
- Topic 5—Abrupt Departure and Communication Deficit: narratives of escape, feeling of “prison”, emotional exhaustion, abuse
References
- Adadi, A. (2021). A survey on data-efficient algorithms in big data era. Journal of Big Data, 8(1), 1. [Google Scholar] [CrossRef]
- Braunack-Mayer, A., Carolan, L., Street, J., Ha, T., Fabrianesi, B., & Carter, S. (2023). Ethical issues in big data: A qualitative study comparing responses in the health and higher education sectors. PLoS ONE, 18(4), e0282285. [Google Scholar] [CrossRef] [PubMed]
- Chim, J., Ive, J., & Liakata, M. (2025). Evaluating synthetic data generation from user generated text. Computational Linguistics, 51(1), 191–233. [Google Scholar] [CrossRef]
- Ding, B., Qin, C., Zhao, R., Luo, T., Li, X., Chen, G., Xia, W., Hu, J., Luu, A. T., & Joty, S. (2024). Data augmentation using LLMs: Data perspectives, learning paradigms and challenges. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the association for computational linguistics: ACL 2024 (pp. 1679–1705). Association for Computational Linguistics. [Google Scholar] [CrossRef]
- Feng, Y., Li, L., Qin, X., & Zhang, B. (2025). Improving event representation learning via generating and utilizing synthetic data. Information Processing & Management, 62(4), 104083. [Google Scholar] [CrossRef]
- Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv, arXiv:2203.05794. [Google Scholar] [CrossRef]
- Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. [Google Scholar] [CrossRef]
- Khodeir, N., & Elghannam, F. (2024). Efficient topic identification for urgent MOOC forum posts using BERTopic and traditional topic modeling techniques. Education and Information Technologies, 30(5), 5501–5527. [Google Scholar] [CrossRef]
- Leinonen, J., Hellas, A., & Taubert, N. (2024). LLM-itation is the sincerest form of data. arXiv, arXiv:2411.10455. [Google Scholar] [CrossRef]
- Lenatti, M., Paglialonga, A., Orani, V., Ferretti, M., & Mongelli, M. (2023). Characterization of synthetic health data using rule-based artificial intelligence models. IEEE Journal of Biomedical and Health Informatics, 27(8), 3760–3769. [Google Scholar] [CrossRef] [PubMed]
- Levin, D., & Singer, G. (2024). GB-AFS: Graph-based automatic feature selection for multi-class classification via mean simplified silhouette. Journal of Big Data, 11, 79. [Google Scholar] [CrossRef]
- Liu, Q., Khalil, M., Jovanovic, J., & Shakya, R. (2024). Scaling while privacy preserving: A comprehensive synthetic tabular data generation and evaluation in learning analytics. In Proceedings of the 14th learning analytics and knowledge conference (LAK ‘24) (pp. 620–631). ACM. [Google Scholar] [CrossRef]
- López-Pernas, S., Misiejuk, K., Kaliisa, R., & Saqr, M. (2025). Capturing the process of students’ AI interactions when creating and learning complex network structures. IEEE Transactions on Learning Technologies, 18, 556–568. [Google Scholar] [CrossRef]
- McDaniel, E. L., Scheele, S., & Liu, J. (2024). Zero-shot classification of crisis tweets using instruction-finetuned large language models. In 2024 IEEE international humanitarian technologies conference (IHTC) (pp. 1–7). IEEE. [Google Scholar] [CrossRef]
- McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv, arXiv:1802.03426. [Google Scholar] [CrossRef]
- Miletić, M., & Sariyar, M. (2025). Utility-based analysis of statistical approaches and deep learning models for synthetic data generation. JMIR AI, 4, e65729. [Google Scholar] [CrossRef] [PubMed]
- Nadǎş, M., Dioşan, L., & Tomescu, A. (2025). Synthetic data generation using large language models: Advances in text and code. IEEE Access, 13, 134615–134633. [Google Scholar] [CrossRef]
- Pattnayak, P., Chowdhuri, S., Agarwal, A., & Patel, H. L. (2025). LLM-guided lifecycle-aware clustering of multi-turn customer support conversations. In K. Inui, S. Sakti, H. Wang, D. F. Wong, P. Bhattacharyya, B. Banerjee, A. Ekbal, T. Chakraborty, & D. P. Singh (Eds.), Proceedings of the 14th international joint conference on natural language processing and the 4th conference of the Asia-Pacific chapter of the association for computational linguistics (pp. 3180–3206). AFNLP & ACL. [Google Scholar] [CrossRef]
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3982–3992). Association for Computational Linguistics. [Google Scholar] [CrossRef]
- Satopää, V., Albrecht, J., Irwin, D., & Raghavan, B. (2011). Finding a “Kneedle” in a haystack: Detecting knee points in system behavior. In Proceedings of the 2011 31st international conference on distributed computing systems workshops (pp. 166–171). IEEE. [Google Scholar] [CrossRef]
- Shujon, S. (2025). Reddit break-up stories dataset (2023–2025) [Data set]. Kaggle. Available online: https://www.kaggle.com/datasets/shakhoyatshujon/reddit-break-up-stories-dataset-20232025 (accessed on 28 February 2026).



| Component | Parameter | Value |
|---|---|---|
| Embeddings | embedding_model | all-mpnet-base-v2 |
| UMAP | n_neighbors | 15 |
| UMAP | n_components | 5 |
| UMAP | min_dist | 0.0 |
| UMAP | metric | cosine |
| UMAP | random_state | 42 |
| HDBSCAN | min_cluster_size | 15 |
| HDBSCAN | min_samples | 10 |
| HDBSCAN | metric | Euclidean |
| HDBSCAN | cluster_selection_method | eom |
| HDBSCAN | prediction_data | True |
| VECTORIZER | min_df | 2 |
| VECTORIZER | max_df | 0.95 |
| VECTORIZER | ngram_range | (1, 2) |
| LLM API | model | DeepSeek V3.2 |
| LLM API | temperature | 0.85 |
| LLM API | top_p | 0.9 |
| LLM API | max_tokens | 3000 |
| LLM API | include_reasoning | False |
| Topic | Count | Name | Percentage | Cohesion | Diversity |
|---|---|---|---|---|---|
| 0 | 311 | contact_family_weeks_week | 32.88% | 0.672 | 0.551 |
| 1 | 288 | saying_asked_boyfriend_met | 30.44% | 0.670 | 0.552 |
| 2 | 79 | hope_heart_bad_let | 8.35% | 0.623 | 0.620 |
| 3 | 36 | helped_come_real_hope | 3.81% | 0.710 | 0.510 |
| 4 | 20 | kit_page_tracker_helped | 2.11% | 0.690 | 0.552 |
| 5 | 18 | friend_finally_leave_tired | 1.90% | 0.670 | 0.583 |
| Topic | Labels |
|---|---|
| 0 | Post-Breakup Emotional Turmoil |
| 1 | Self-Sabotage and Avoidant Pattern |
| 2 | Unreciprocated Investment and Abandonment |
| 3 | Self-Loss and Unreadiness |
| 4 | Self-Created Healing Resources |
| 5 | Abrupt Departure and Communication Deficit |
| Topic | Cut-Off Point (n) |
|---|---|
| T0 | 11 |
| T1 | 11 |
| T2 | 27 |
| T3 | 10 |
| T4 | 10 |
| T5 | 17 |
| Topic | Global Mean (Words) | Seed Mean (Words) |
|---|---|---|
| T0 | 335 | 157 |
| T1 | 368 | 343 |
| T2 | 245 | 352 |
| T3 | 159 | 165 |
| T4 | 88 | 74 |
| T5 | 349 | 366 |
| Topic | Posts Generated | Total Time | Rate (sec/Post) | Batch Size |
|---|---|---|---|---|
| T0 | 1200 | 53:47 | 2.69 | 5 |
| T1 | 1200 | 1:04:01 | 3.20 | 5 |
| T2 | 1200 | 1:07:29 | 3.36 | 5 |
| T3 | 1200 | 1:01:31 | 3.08 | 5 |
| T4 | 1200 | 56:07 | 2.80 | 5 |
| T5 | 1200 | 56:15 | 2.81 | 5 |
| TOTAL | 7200 | ~6.5 h | 2.99 (mean) | 5 |
| Topic | Posts Evaluated | P10 Threshold | Post Approved | Mean AI Similarity | Fidelity Rate (%) |
|---|---|---|---|---|---|
| T0 | 1200 | 0.5337 | 1194 | 0.6987 | 99.50% |
| T1 | 1200 | 0.5376 | 1189 | 0.7506 | 99.08% |
| T2 | 1200 | 0.4905 | 1193 | 0.7244 | 99.42% |
| T3 | 1200 | 0.6149 | 1161 | 0.7220 | 96.75% |
| T4 | 1200 | 0.5067 | 1200 | 0.7855 | 100.00% |
| T5 | 1200 | 0.5774 | 997 | 0.6413 | 83.08% |
| TOTAL | 7200 | — | 7134 | — | 99.08% |
| Topic | Posts Evaluated | AC2 Adherence | AC2 Authenticity | Status |
|---|---|---|---|---|
| T0 | 291 | 0.759 | 0.724 | Validated |
| T1 | 291 | 0.704 | 0.892 | Validated |
| T2 | 291 | 0.994 | 0.900 | Validated |
| T3 | 291 | 0.660 | 0.817 | Marginal |
| T4 | 291 | 0.947 | 0.838 | Validated |
| T5 | 291 | 0.447 | 0.216 | Not validated |
| Weighted mean | 1732 | 0.754 | 7134 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Sosa-Ramírez, R.; López-Meneses, E.; González-Zamar, M.-D.; Cevallos, M.B.M. Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains. Educ. Sci. 2026, 16, 885. https://doi.org/10.3390/educsci16060885
Sosa-Ramírez R, López-Meneses E, González-Zamar M-D, Cevallos MBM. Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains. Education Sciences. 2026; 16(6):885. https://doi.org/10.3390/educsci16060885
Chicago/Turabian StyleSosa-Ramírez, Rafael, Eloy López-Meneses, Mariana-Daniela González-Zamar, and María Belén Morales Cevallos. 2026. "Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains" Education Sciences 16, no. 6: 885. https://doi.org/10.3390/educsci16060885
APA StyleSosa-Ramírez, R., López-Meneses, E., González-Zamar, M.-D., & Cevallos, M. B. M. (2026). Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains. Education Sciences, 16(6), 885. https://doi.org/10.3390/educsci16060885

