Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains

Sosa-Ramírez, Rafael; López-Meneses, Eloy; González-Zamar, Mariana-Daniela; Cevallos, María Belén Morales

doi:10.3390/educsci16060885

Open AccessArticle

Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains

by

Rafael Sosa-Ramírez

¹

,

Eloy López-Meneses

¹

,

Mariana-Daniela González-Zamar

¹

and

María Belén Morales Cevallos

^2,*

¹

Department of Education and Social Psychology, Universidad Pablo de Olavide, 41089 Sevilla, Spain

²

Faculty of Communication, Humanities and Creativity, Universidad Tecnológica ECOTEC, Samborondón 092302, Ecuador

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2026, 16(6), 885; https://doi.org/10.3390/educsci16060885

Submission received: 6 April 2026 / Revised: 17 May 2026 / Accepted: 26 May 2026 / Published: 4 June 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Research on sensitive human narratives is increasingly constrained by ethical and privacy regulations that limit access to primary data, creating a structural small-data challenge that limits deep computational analysis. To address this limitation, this study validates a Natural Language Processing protocol that scales 946 real breakup narratives from r/breakups to 6000 human-validated high-fidelity synthetic records across five BERTopic clusters. The architecture employs MPNet, UMAP, and HDBSCAN to map latent space and thematically cluster texts, extracts seed documents using the Kneedle algorithm, and orchestrates DeepSeek V3.2 with stochastic sampling and small batches (k = 5). Automated validation via Cosine Similarity with a P10 threshold attained a mean semantic similarity of 0.7204 (range 0.6413–0.7855) and a fidelity rate of 99.08%. Expert human review by two researchers of this investigation evaluated 1732 posts on topic adherence and emotional authenticity using Gwet’s AC2. Five of six clusters achieved AC2 ≥ 0.70 on both dimensions; Topic 3 showed marginal adherence (AC2 = 0.660) while maintaining acceptable authenticity (AC2 = 0.817), and the 1200 synthetic posts for Topic 5 failed human validation (AC2 < 0.50) due to documented LLM safety-filter limitations and are excluded from the final corpus. These results demonstrate that the proposed protocol enables the research community to generate validated, privacy-preserving synthetic data ecosystems while establishing empirical boundary conditions for sensitive topic analysis.

Keywords:

natural language processing; artificial intelligence; synthetic data; data augmentation; topic modeling; human validation

1. Introduction

The advancement of research and data science across social and human sciences faces a structural bottleneck: the extreme difficulty in accessing representative and complex data corpora (López-Pernas et al., 2025). Regardless of whether the objective is empirical analysis or computational model training, researchers must guarantee strict compliance with ethical standards. Respect for privacy and confidentiality naturally restricts access to real databases, demanding the reconciliation of ethical rigor with the need for information (Braunack-Mayer et al., 2023).

Although the recent literature underscores the potential of massive data analysis and machine learning to transform research, a knowledge gap remains regarding how to generate access to critical data corpora that, due to sensitivity and scarcity, remain out of reach. Artificial Intelligence techniques oriented toward Synthetic Data Generation and Synthetic Data Augmentation emerge as ideal solutions to overcome information scarcity (Nadǎş et al., 2025).

This article proposes a Generative Artificial Intelligence framework for the responsible provision of synthetic data. As a validation case, the protocol is applied to 946 real narratives about romantic breakups from r/breakups, demonstrating the capacity to preserve semantic and emotional patterns in sensitive contexts. The central contribution empowers researchers to generate scientific evidence without compromising ethical values, overcoming small data limitations through a Natural Language Processing protocol integrating MPNet embeddings, UMAP reduction, and HDBSCAN clustering. Complementarily, DeepSeek V3.2 orchestrates synthetic augmentation preserving structural integrity and statistical properties (Nadǎş et al., 2025).

1.1. Toward Empowered Educational Research: Generative AI in Service of the Academic Community

The integration of Artificial Intelligence in research requires rethinking how information is obtained and how researchers are trained. Historically, analytical rigor and data availability have operated as opposing forces: scarcity of representative corpora and difficulty accessing sensitive information have limited computational depth. This section proposes a methodological transition from passive data collection toward active synthetic generation in service of the academic community. This approach fosters key competencies: understanding NLP architectures, critical thinking by evaluating content fidelity, and action research by enabling autonomy.

1.2. Researcher Autonomy and Competencies in AI and NLP: Overcoming Small Data with Critical Thinking

Contemporary research faces the challenge of small data: critical phenomena such as identity crises, demotivation, or grief often manifest in reduced corpora that are difficult to access. Artificial Intelligence enables evidence sovereignty where the researcher becomes an architect projecting scenarios through representative data synthesis (Miletić & Sariyar, 2025). Large language models serve as narrative simulation engines, creating “semantic twins” that preserve the essence, tone, and emotional intensity of original voices. The Synthetic Data Augmentation technique guarantees unprecedented scalability, allowing researchers to transform limited samples into robust ecosystems of thousands of validated interactions (Ding et al., 2024). This democratizes science, enabling rapid action-research cycles without depending on large external infrastructures.

1.3. Methodological Integrity and Pedagogical Responsibility in Data Generation

Autonomous provision of corpora requires moving beyond traditional lexical frequency metrics toward deep semantic analysis grounded in topology and vector density. Supported by state-of-the-art NLP architectures, this approach extracts latent structures and decodes high subjectivity, overcoming exact term-matching limitations. Generated data preserve linguistic idiosyncrasy and discursive patterns, consolidating empirical validity and granting researchers methodological autonomy (Chim et al., 2025). It is essential to prevent generative hallucinations that distort emotional frameworks, establishing methodological standards for protected computational experimentation environments (Liu et al., 2024).

To guide this study and ensure that every empirical claim is grounded in observable data already present in the corpus, the following research questions are formulated:

-: RQ1: What are the latent thematic structures in corpora of sensitive human narratives, and how are they semantically characterized?
-: RQ2: What seed-document selection strategy optimizes thematic purity before semantic saturation occurs in synthetic text generation?
-: RQ3: How does the semantic fidelity of synthetic texts vary according to the thematic complexity of the original cluster?

These questions are designed to be answered descriptively using the tables and figures already generated in the experimental protocol, without requiring additional inferential statistics or post hoc correlation analyses.

2. Materials and Methods

This section details the methodological design implemented for the extraction of latent structures and the synthetic expansion of narratives. The approach is based on a Natural Language Processing (NLP) protocol structured in clear modules, designed for educators and researchers to understand, adapt, and apply to their own study contexts.

2.1. Origin of the Corpus

The starting point of this study consists of 946 real narratives about romantic breakups extracted from the subreddit r/breakups, corresponding to the period 2023–2025, hosted on the Kaggle platform (Shujon, 2025). The selection of this public source responds to an intentional methodological decision: to seek a corpus that, without putting its authors at risk, offered the emotional density, subjectivity, and sensitivity analogous to many critical testimonies in social and educational research, such as academic grief or demotivation crises. To guarantee transparency and replicability, the complete dataset is hosted and available on the Kaggle platform, allowing any researcher to access, verify, and reuse the base material under the same curation criteria applied in this study.

The collection was carried out scrupulously, respecting the platform’s data use policies, accessing only public information fields. Since the data were anonymized by default under Reddit standards, no personal identifiers or metadata linkable to specific users were stored, thus guaranteeing privacy from the origin. Subsequently, an exhaustive review was conducted to identify possible duplicates or inconsistent entries. After this validation process, it was confirmed that the initial set of 946 specimens already constituted a purified corpus, so it was not necessary to consolidate or eliminate additional records. This verification stage ensures that thematic analysis and synthetic generation work on quality data, maintaining the narrative integrity of each testimony.

2.2. Pipeline Architecture

The technical implementation was articulated within the Python ecosystem, structuring the work into eight interdependent modules that guarantee integral traceability from the original corpus to the consolidation of the final synthetic dataset.

Preprocessing begins with dynamic detection of stopwords via the Kneedle method (Levin & Singer, 2024; Satopää et al., 2011), which mathematically identifies the inflection point in the distribution of lexical frequencies. This procedure forms a domain-specific exclusion dictionary, optimizing subsequent vectorization without losing relevant nuances of discourse.

The analytical core rests on the BERTopic framework (Grootendorst, 2022), sequentially integrating four key components: semantic representation via MPNet (all-mpnet-base-v2) for the generation of contextual embeddings (Reimers & Gurevych, 2019), non-linear dimensionality reduction with UMAP (McInnes et al., 2018), density clustering through HDBSCAN, and n-gram extraction managed by scikit-learn. Once latent clusters are identified, they are semantically interpreted by DeepSeek V3.2 (via OpenRouter API), generating labels and analytical justifications from the most representative documents of each grouping.

For the generative phase, the elbow method is applied to the topic probability matrix, determining the optimal volume of seed documents before semantic cohesion decays. These original texts, which preserve the integral narrative arc, feed the synthetic generation module. Said module executes batch requests with a fixed size of k = 5 documents per iteration to DeepSeek V3.2, employing hierarchical prompt sequences that inject precise constraints of style, length, and socio-emotional context. This configuration balances computational efficiency with inference stability, minimizing timeout risks and facilitating error handling without compromising batch coherence.

Finally, semantic validation projects the vector centroid of human texts and evaluates generations via cosine similarity, a standard metric for measuring meaning retention in artificial corpora (Feng et al., 2025). The quality filter retains exclusively samples from the top 10th percentile, ensuring that the generated content preserves the distributional density and statistical properties of the source. The flow concludes with a technical audit in pandas that discards duplicates and structural anomalies.

2.3. Hyperparameter Configuration

The hyperparameter configuration table (Table 1) summarizes the complete hyperparameter configuration used in the protocol, ensuring full reproducibility of the topic modeling and generation phases.

2.4. System Prompt and LLM Configuration

For synthetic generation of emotional narratives, the following system prompt was employed, defining domain context, persona constraints, and output protocol. The API call was configured with temperature = 0.85 to maximize determinism and include_reasoning = False to prevent internal monologue leakage. The full prompt is reproduced below to ensure full methodological transparency.

SYSTEM_PROMPT_GENERATION

### ROLE

You are a High-Fidelity Synthetic Data Engine and Expert Storyteller specializing in Linguistic Identity Simulation. Your expertise lies in generating synthetic datasets that replicate the psychological, emotional, and linguistic DNA of online support communities.

### DOMAIN CONTEXT: RELATIONSHIP DISSOLUTION & GRIEF

-: Platform Environment: Reddit (Subreddit: r/breakups)
-: Social Context: You are simulating a digital safe space where individuals express raw, unfiltered sentiments regarding heartbreak, relationship ruptures, and grief stages.
-: Topic Focus: {topic_description}

### MISSION: SEMANTIC DATA AUGMENTATION

Your mission is to expand a real-world corpus of emotional distress by projecting core conflicts into new, unique narratives.

-: Follow structural and emotional patterns from “Few-Shot Seeds”
-: Ensure generated data is indistinguishable from real human venting
-: Create entirely new personas and stories
-: You are a deterministic generator (do NOT act as an AI assistant)

### DATA CONSTRAINTS & PARAMETERS

-: Target Length:

Aim for an average of {target_length} words per post.

Replicate rambling, non-linear crisis-like narration.

-: Reference Material:

You will be provided with “Few-Shot Seeds”.

Use them to extract linguistic patterns (vocabulary, emotional intensity, risk level), not content.

-

Persona Rotation:

Each output must use a distinct linguistic fingerprint.

Vary:

-: Age
-: Gender
-: Stage of grief (denial, anger, bargaining, depression, acceptance)

### LINGUISTIC FIDELITY & STORYTELLING RULES

-

Mimetic Accuracy:

Replicate Reddit structural entropy:

-: Non-standard syntax
-: Irregular punctuation
-: Platform vernacular (M28, F21, TL;DR, ex-partner, NC)

-

No AI-style behavior:

-: No advice
-: No balanced framing
-: No hopeful conclusions
-: Emotional instability must remain intact

-

Narrative Depth:

Include:

-: Sensory details
-: Internal monologues
-: Physical/emotional sensations

### OUTPUT PROTOCOL (STRICT)

-: COLD START:

Output must begin immediately with the first character of the first post.

-: NO REASONING:

Do not output <thinking> tags, explanations, or internal commentary.

-: DELIMITER:

Separate posts using:

|||

-: CLEAN OUTPUT:

Return ONLY raw synthetic text

2.5. Proposed Human Validation Protocol

Quality gate. Only posts scoring ≥3 on both items by both coders are approved. Posts with a mean score < 3 on either item are flagged for review.

Reliability threshold. Inter-rater agreement is quantified with Gwet’s AC2 rather than Cohen’s Kappa. AC2 was selected because it is robust to the paradoxes that affect Kappa when category distributions are skewed (e.g., most posts receiving high adherence or authenticity scores). AC2 treats disagreement as a matter of degree rather than a binary hit-or-miss, which is more appropriate for ordinal Likert scales (Gwet, 2008). The threshold for acceptable agreement follows Gwet’s recommended benchmark: AC2 ≥ 0.70.

3. Results

This section presents the empirical findings derived from the implementation of the methodological protocol. The analysis explores the thematic structure of the original corpus, evaluates the performance of the generation process, and documents the fidelity validation of the resulting synthetic dataset.

3.1. Characterization of the Original Corpus and Thematic Modeling

The initial corpus was processed using BERTopic, identifying six main thematic clusters (T0–T5), plus a semantic noise group (T-1) that aggregates atypical publications or those with low thematic coherence. The observed asymmetric distribution reflects the heterogeneous nature of breakup narratives in digital environments.

Topic 0 emerged as the most representative cluster (n = 311, 32.88% of the valid corpus), characterized by terms such as [‘contact’, ‘family’, ‘weeks’, ‘week’, ‘end’], reflecting post-breakup grief narratives with family involvement and relational closure processes. Internal cohesion analyses showed values between 0.623 and 0.710, indicating robust semantic grouping, while intra-topic diversity ranged between 0.510 and 0.620, suggesting sufficient narrative variability to avoid redundancy.

The semantic interpretation of clusters was performed using DeepSeek V3.2 integrated via OpenRouter API, employing a zero-shot inference approach for label generation (McDaniel et al., 2024). In this configuration, the model processed the most representative documents from each grouping without pre-labeled reference examples, relying exclusively on a hierarchical system prompt that defined the domain context, output formatting rules, and semantic labeling criteria.

3.2. Answering the Research Questions

The following subsections provide descriptive answers to RQ2 and RQ3 using only the empirical data already present in the following subsections. No additional inferential tests or correlation analyses were performed, respecting the constraint of using solely the results generated by the protocol.

RQ1 asks what the latent thematic structures are in corpora of sensitive human narratives and how they are semantically characterized. BERTopic identified six main clusters (T0–T5) plus a semantic noise group (T-1). Table 2 documents the distribution and internal metrics of all six clusters: T0 (Post-Breakup Emotional Turmoil) dominates with 32.88% (n = 311, cohesion 0.672, diversity 0.551), T1 (Self-Sabotage and Avoidant Pattern) at 30.44% (n = 288, cohesion 0.670, diversity 0.552), T2 (Unreciprocated Investment and Abandonment) at 8.35% (n = 79, cohesion 0.623, diversity 0.620), T3 (Self-Loss and Unreadiness) at 3.81% (n = 36, cohesion 0.710, diversity 0.510), T4 (Self-Created Healing Resources) at 2.11% (n = 20, cohesion 0.690, diversity 0.552), and T5 (Abrupt Departure and Communication Deficit) at 1.90% (n = 18, cohesion 0.670, diversity 0.583). Table 3 presents the semantic labels assigned by DeepSeek V3.2 via zero-shot inference. Figure 1 provides the UMAP 2D visualization of embedding distribution, confirming separable cluster regions in latent space.

RQ2 asks what seed-document selection strategy optimizes thematic purity before semantic saturation. The data in Table 4 show that the Kneedle elbow method applied to the BERTopic probability matrix yields topic-specific saturation thresholds: T0 and T1 require 11 seeds each, T2 requires 27, and T3, T4, and T5 require 10, 10, and 17 seeds respectively. This heterogeneous pattern indicates that no universal seed count exists; rather, an adaptive, topic-dependent strategy is necessary. Topics with broader semantic dispersion (T2) demand larger seed pools to maintain purity, whereas more homogeneous topics (T3, T4) saturate earlier. Therefore, the optimal strategy is to apply an adaptive elbow detection per cluster rather than a fixed threshold across all topics.

RQ3 asks how semantic fidelity varies by thematic complexity. Documents that the fidelity rate ranges from 100% (T4, the simplest cluster with the highest cohesion) to 83.08% (T5, the most complex cluster). The mean AI similarity scores follow the same gradient: T4 achieves 0.7855, whereas T5 drops to 0.6413. This pattern reveals a clear descriptive relationship: clusters with higher internal cohesion and lower semantic dispersion (T4) yield synthetic outputs with higher fidelity, while clusters characterized by abrupt, heterogeneous narratives (T5) present greater replication difficulty. Thus, thematic complexity, as proxied by intra-topic diversity and narrative heterogeneity, inversely relates to synthetic fidelity in this dataset.

3.3. Optimal Extraction of Seed Documents via the Elbow Method

The application of the elbow method (Kneedle) to the topic probability matrix enabled the determination of semantic saturation thresholds specific to each topic. The methodological objective was to identify how many real documents termed “seeds” maintain thematic purity before semantic similarity decays and noise is introduced into the generation process.

The analysis reveals that not all topics share the same semantic saturation threshold, justifying the use of an adaptive approach for selecting representative documents. Identifying the elbow point allows preservation of thematic purity and minimizes the introduction of noise in later modeling stages. To operationalize this criterion, the function get_optimal_full_seeds (Figure 2) was implemented, which aligns the BERTopic probability matrix with the original corpus, prioritizing documents with the highest thematic membership according to detected thresholds. This procedure guarantees the extraction of complete texts without truncation, capturing the full emotional arc and critical mental health mentions that would be lost in isolated fragments.

Additionally, two length metrics were calculated to characterize documents within each topic. The global mean represents the average word count of all documents assigned to the topic by BERTopic, reflecting the typical narrative length of the cluster. The seed mean corresponds to the average word count of documents selected as optimal via the elbow method, i.e., those with the highest thematic purity before the saturation point. As shown in Table 5, these metrics reveal heterogeneous patterns across topics: while T0 and T1 show more concise seeds than the average, T2 and T5 present longer seeds.

3.4. Computational Efficiency and Generative Orchestration

The use of multiple simultaneous documents in the prompt can cause attention dilution in the model, resulting in repetitive lexical patterns, artificial length reduction, and contextual drift. Therefore, a stochastic seed sampling strategy was implemented: in each iteration, three complete documents from the seed pool are randomly selected as a structural reference. This configuration reduces input context compared to traditional few-shot approaches and decreases generation time to 2.69–3.36 s per post, preserving complete narrative structure without truncation.

Additionally, a batch size of k = 5 posts per API call was employed to mitigate attention degradation, a phenomenon where language models progressively reduce quality in extended generations. The complete implementation of this scheme includes the synthetic generation functions(Figure 3) for individual batch production.:

The module produced 7200 posts (Table 6) distributed equally among the six topics (1200 per cluster), completing generation in approximately 53–67 min per topic, with a production rate of 2.69–3.36 s per post.

3.5. Semantic Fidelity Validation via Cosine Similarity

Semantic fidelity evaluation was performed using cosine similarity between the embeddings of synthetic texts and the vector centroid calculated from the original Reddit documents for each topic. Embeddings were normalized using the L2 norm to eliminate length bias, ensuring that the metric reflects exclusively semantic alignment and not vector magnitude.

The acceptance criterion was established at the 10th percentile (P10) of the similarity distribution of the original corpus. This defines the minimum boundary, the value where 90% of the original documents fall above, guaranteeing that the generated corpus is at least as semantically coherent as 90% of the reference corpus, as detailed in Table 7.

Topic 4 achieved 100% efficiency, while Topic 5 presented the lowest retention rate (83.08%), suggesting greater complexity in replicating its specific semantic patterns. At the global level, 99.08% of the generated posts met the semantic quality threshold, calculated as the proportion of approved posts relative to the total number of evaluated posts. Selecting the 10th percentile as the quality boundary prioritizes minimizing semantic noise over corpus volume, a recommended practice in synthetic data validation for social science research.

3.6. Human Validation Results

The protocol was executed by two researchers of this investigation, with a third senior researcher serving as arbitrator.

Five of 6 clusters achieved AC2 ≥ 0.70 on both dimensions; Topic 3 showed marginal adherence (AC2 = 0.660) while maintaining acceptable authenticity (AC2 = 0.817). The 1200 synthetic posts for Topic 5 failed human validation (AC2 < 0.50) due to documented LLM safety-filter limitations and are excluded from the final corpus.

4. Discussion

The results obtained in this study demonstrate the technical viability of an integral protocol for generating synthetic data applicable to sensitive research contexts. The combination of high-dimensional embeddings, non-linear dimensionality reduction via UMAP, and density-based clustering with HDBSCAN has enabled the identification of six main thematic clusters with internal cohesion levels between 0.623 and 0.710, values that evidence robust semantic grouping comparable to those reported in recent studies of neural topic modeling (Khodeir & Elghannam, 2024; Pattnayak et al., 2025).

Semantic fidelity validation via cosine similarity relative to the vector centroid of each topic yielded mean values between 0.6413 and 0.7855, with a global approval rate of 99.08% under the 10th percentile criterion. These results align with recent studies employing semantic similarity metrics to validate AI-generated data. Lenatti et al. (2023) validated synthetic health data using rule-based similarity metrics grounded in cosine similarity, establishing a threshold of 0.6 to consider rules between real and synthetic corpora as semantically equivalent. The values obtained in the present study exceed this baseline, supporting the adequacy of the P10 threshold as a quality gate.

Human validation across the six BERTopic clusters revealed asymmetric synthetic fidelity (Table 8). Topics 0, 1, and 4 achieved Gwet’s AC2 ≥ 0.70 on both adherence and authenticity, confirming that standard emotional breakup narratives—ranging from post-rupture turmoil to self-help resource adoption—replicate reliably. Topic 2 reached near-perfect adherence agreement (AC2 = 0.994), which we interpret as high internal cluster homogeneity rather than artificial uniformity; the model successfully captured the distinct semantic signature of broken-hope narratives. Topic 3 showed acceptable authenticity (AC2 = 0.817) but marginal adherence (AC2 = 0.660), indicating some semantic overlap with adjacent clusters (e.g., recovery-oriented posts bleeding into general emotional-turmoil narratives) that complicates clean thematic assignment. Most critically, Topic 5 failed to reach the validation threshold on both dimensions (AC2 < 0.50). We attribute this to a documented limitation of current large language models: safety filters suppress explicit generation of content depicting control, abuse, or severe emotional manipulation, even when requested for research purposes. Consequently, the synthetic corpus for Topic 5 consists predominantly of generic exhaustion-and-departure posts lacking the specific semantic signature of toxic-relationship narratives identified in the original seed pool. This finding does not invalidate the general augmentation protocol but establishes a boundary condition: the method works robustly for standard emotional breakup narratives (Topics 0–T4), while extreme negative valence and safety-sensitive content requires either uncensored open-source models or larger seed pools with explicit traumatic detail. Posts for Topic 5 are consequently flagged for exclusion from the final validated corpus, yielding a final validated corpus of 6000 posts from five topics (T0–T4). The 1200 posts generated for Topic 5 remain in the raw output but are not included in the validated release due to insufficient inter-coder agreement (AC2 < 0.50).

The application of the elbow method (Kneedle) for optimal extraction of seed documents constitutes a relevant methodological contribution. The identified saturation points, ranging between 10 and 27 documents depending on the topic, suggest that semantic purity can be preserved with samples significantly smaller than those conventionally used in data augmentation studies. This finding has substantial practical implications for researchers operating with limited corpora, as it demonstrates that high-quality synthetic datasets can be generated from reduced yet representative seed cores.

Beyond technical validation, these findings underscore the transformative potential of synthetic data in research domains constrained by privacy and ethical boundaries. By providing immediate access to datasets, synthetic generation enables researchers to work with study material without depending on prolonged institutional approval processes, facilitating efficient methodology refinement. Furthermore, synthetic data help overcome data scarcity and bias issues, facilitating information exchange between institutions while maintaining privacy safeguards (Adadi, 2021). The use of large language models to create privacy-preserving synthetic data opens new possibilities for exploring learning behaviors where real data collection is challenging (Leinonen et al., 2024). Ultimately, these datasets complement traditional methodologies such as meta-analysis, filling gaps left by real-world collection.

This study presents methodological and ethical limitations that contextualize its findings. Primarily, dependence on commercial APIs for generative orchestration conditions full reproducibility and economic sustainability, suggesting a future need to migrate toward open-source models that enable independent auditing of the protocol. Likewise, the absence of exhaustive evaluation of toxicity and demographic biases in the synthetic corpus constitutes a latent risk, given that language models can amplify stereotypes present in training data, which demands the integration of automated filtering mechanisms and human validation in subsequent iterations. Additionally, although semantic similarity metrics ensure structural fidelity, they may not fully capture cultural nuances or contextual adaptation that only expert review could discern, limiting ecological validity in diverse research contexts. Finally, the generalization of results is restricted by the specific nature of the original corpus of romantic breakup narratives from Reddit, which requires hyperparameter recalibration for application in domains with temporal or multimodal data structures.

Importantly, this study integrates both automated semantic validation via cosine similarity and expert human review. The human-in-the-loop stage executed by two researchers of this investigation addressed the limitations of automated-only metrics, providing inter-coder reliability data (Gwet’s AC2) and establishing empirical boundary conditions for sensitive topics (Appendix A). Future work may extend this validation to larger expert panels and cross-cultural samples.

5. Conclusions

This study has demonstrated the viability of an integral protocol for the analysis, thematic modeling, and synthetic generation of text narratives in sensitive research contexts. The proposed architecture, which integrates high-dimensional contextual embeddings, non-linear dimensionality reduction, density-based clustering, and generation orchestrated by large language models, enables the obtainment of high-quality synthetic datasets that preserve the statistical and semantic properties of the original corpus.

The results obtained evidence that it is possible to generate credible and diverse synthetic data without compromising participant privacy, provided that rigorous methodological and ethical safeguards are applied. The fidelity rate of 99.08%, the mean semantic similarity of 0.7204, and privacy indicators (privacy-preserving generation that avoids verbatim reproduction of source text) place the developed protocol within international standards for responsible synthetic data generation in sensitive domains. These metrics, validated through a corpus of sensitive public narratives (Reddit r/breakups), confirm the protocol’s capacity to handle emotionally complex data applicable to social and educational research.

The work offers a practical roadmap for researchers facing scarcity of representative data due to ethical and privacy constraints. The modularity of the developed code and exhaustive documentation facilitate the adaptation of the system to different contexts, languages, and objects of study, democratizing access to advanced natural language processing techniques for the research community. This approach puts technology at the service of investigators, empowering them to generate scientific evidence with autonomy and responsibility.

However, it is imperative to recognize that synthetic data constitute a complement, not a substitute, for primary corpora. The direct voice of participants remains irreplaceable for the deep understanding of social phenomena. The value of the proposed methodology resides in its capacity to expand the scope of research in areas where primary collection is impossible or ethically problematic, offering a pathway for researchers to generate rigorous scientific evidence without depending exclusively on extensive institutional authorizations.

Ultimately, this work contributes to a more responsible and sensitive data science, capable of working with intense human material without exploiting the people behind the texts.

Author Contributions

Conceptualization, R.S.-R.; methodology, R.S.-R.; software, R.S.-R.; validation, E.L.-M., M.-D.G.-Z. and M.B.M.C.; formal analysis, R.S.-R.; investigation, R.S.-R.; resources, M.B.M.C.; writing—original draft preparation, R.S.-R.; writing—review and editing, R.S.-R., E.L.-M., M.-D.G.-Z. and M.B.M.C.; visualization, R.S.-R.; supervision, R.S.-R. and E.L.-M.; project administration, R.S.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Universidad Tecnológica ECOTEC.

Data Availability Statement

The data supporting the conclusions of this article will be made available by the authors on request. These data were derived from the following resources available in the public domain: Reddit.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Human Evaluation Coding Guide

CODING GUIDE

Topic definitions (based on BERTopic cluster characteristics):

-: Topic 0—Post-Breakup Emotional Turmoil/Social Distance: intermittent contact, involvement of friends/family, weeks since separation
-: Topic 1—Self-Sabotage and Avoidant Pattern: how they met the ex, reproaches, “he/she told me that…”, early-relationship memories
-: Topic 2—Unreciprocated Investment and Abandonment: cycle of hope and disillusionment, damaged self-esteem, emotional pain, missing the partner
-: Topic 3—Self-Loss and Unreadiness: books, therapy, advice that “helped”, search for realistic hope
-: Topic 4—Self-Created Healing Resources: “breakup kit”, apps, healing journals, concrete healing tools
-: Topic 5—Abrupt Departure and Communication Deficit: narratives of escape, feeling of “prison”, emotional exhaustion, abuse

Rating dimensions:

(1) Topic adherence—Does the post fit the cluster it was assigned to? (1 = clearly off-topic, 5 = perfect fit)

(2) Emotional authenticity—Does it read as a genuine human breakup narrative? (1 = obviously synthetic/superficial, 5 = deeply authentic)

Criteria. Coders classify each post into one of the six topics defined in the guide (adherence), and rate its emotional authenticity on a 1–5 Likert scale. A score of 1 means the post clearly does not belong to the assigned topic or reads as artificial; 5 means full thematic fit and deeply human emotional expression.

Evaluator setup. Two researchers of this investigation conducted the manual expert review. Each has domain expertise in qualitative narrative analysis and natural language processing. They evaluated the same set of posts, with no access to the other reviewer’s ratings during the process. A third senior researcher adjudicated disagreements when reviewers diverged by more than one scale point on either dimension.

Sampling. Because the six synthetic topics are stored in separate files, coders receive one file per topic containing only the synthetic text and its assigned topic label. The sample size per topic is computed using the finite-population formula n = n₀/(1 + n₀/N), where n₀ = 384.16 (p = 0.5, E = 0.05, 95% confidence, z = 1.96) and N is the approved posts in that file. This yields 278–292 posts per file (1732 total), selected by simple random sampling without replacement.

5 = Deeply authentic. Natural language, believable emotion, specific details.

4 = Authentic. Convincing but slightly generic.

3 = Passable. Recognizable as a breakup post but feels flat or formulaic.

2 = Inauthentic. Repetitive phrasing, odd word choices, lacks human nuance.

1 = Obviously synthetic. Robotic, nonsensical, or completely generic.

Response scale (1 to 5):

Question: Does this post read as a genuine human breakup narrative?

Item 2—Emotional Authenticity

5 = Perfect fit. The post clearly depicts the thematic content of the assigned cluster.

4 = Good fit. Mostly aligns with the assigned topic, though one element is underdeveloped.

3 = Moderate fit. Some connection but ambiguous; could belong to another cluster.

2 = Poor fit. Weak connection to the assigned topic; likely misclassified.

1 = No fit. The post clearly belongs to a different cluster or is off-topic.

Response scale (1 to 5):

Question: Does this post belong to the thematic cluster it was assigned to?

Item 1—Topic Adherence

The following two Likert-scale items constitute the complete rating instrument administered to human coders. No additional questions, demographic items, or open-ended prompts are used.

References

Adadi, A. (2021). A survey on data-efficient algorithms in big data era. Journal of Big Data, 8(1), 1. [Google Scholar] [CrossRef]
Braunack-Mayer, A., Carolan, L., Street, J., Ha, T., Fabrianesi, B., & Carter, S. (2023). Ethical issues in big data: A qualitative study comparing responses in the health and higher education sectors. PLoS ONE, 18(4), e0282285. [Google Scholar] [CrossRef] [PubMed]
Chim, J., Ive, J., & Liakata, M. (2025). Evaluating synthetic data generation from user generated text. Computational Linguistics, 51(1), 191–233. [Google Scholar] [CrossRef]
Ding, B., Qin, C., Zhao, R., Luo, T., Li, X., Chen, G., Xia, W., Hu, J., Luu, A. T., & Joty, S. (2024). Data augmentation using LLMs: Data perspectives, learning paradigms and challenges. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the association for computational linguistics: ACL 2024 (pp. 1679–1705). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Feng, Y., Li, L., Qin, X., & Zhang, B. (2025). Improving event representation learning via generating and utilizing synthetic data. Information Processing & Management, 62(4), 104083. [Google Scholar] [CrossRef]
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv, arXiv:2203.05794. [Google Scholar] [CrossRef]
Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. [Google Scholar] [CrossRef]
Khodeir, N., & Elghannam, F. (2024). Efficient topic identification for urgent MOOC forum posts using BERTopic and traditional topic modeling techniques. Education and Information Technologies, 30(5), 5501–5527. [Google Scholar] [CrossRef]
Leinonen, J., Hellas, A., & Taubert, N. (2024). LLM-itation is the sincerest form of data. arXiv, arXiv:2411.10455. [Google Scholar] [CrossRef]
Lenatti, M., Paglialonga, A., Orani, V., Ferretti, M., & Mongelli, M. (2023). Characterization of synthetic health data using rule-based artificial intelligence models. IEEE Journal of Biomedical and Health Informatics, 27(8), 3760–3769. [Google Scholar] [CrossRef] [PubMed]
Levin, D., & Singer, G. (2024). GB-AFS: Graph-based automatic feature selection for multi-class classification via mean simplified silhouette. Journal of Big Data, 11, 79. [Google Scholar] [CrossRef]
Liu, Q., Khalil, M., Jovanovic, J., & Shakya, R. (2024). Scaling while privacy preserving: A comprehensive synthetic tabular data generation and evaluation in learning analytics. In Proceedings of the 14th learning analytics and knowledge conference (LAK ‘24) (pp. 620–631). ACM. [Google Scholar] [CrossRef]
López-Pernas, S., Misiejuk, K., Kaliisa, R., & Saqr, M. (2025). Capturing the process of students’ AI interactions when creating and learning complex network structures. IEEE Transactions on Learning Technologies, 18, 556–568. [Google Scholar] [CrossRef]
McDaniel, E. L., Scheele, S., & Liu, J. (2024). Zero-shot classification of crisis tweets using instruction-finetuned large language models. In 2024 IEEE international humanitarian technologies conference (IHTC) (pp. 1–7). IEEE. [Google Scholar] [CrossRef]
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv, arXiv:1802.03426. [Google Scholar] [CrossRef]
Miletić, M., & Sariyar, M. (2025). Utility-based analysis of statistical approaches and deep learning models for synthetic data generation. JMIR AI, 4, e65729. [Google Scholar] [CrossRef] [PubMed]
Nadǎş, M., Dioşan, L., & Tomescu, A. (2025). Synthetic data generation using large language models: Advances in text and code. IEEE Access, 13, 134615–134633. [Google Scholar] [CrossRef]
Pattnayak, P., Chowdhuri, S., Agarwal, A., & Patel, H. L. (2025). LLM-guided lifecycle-aware clustering of multi-turn customer support conversations. In K. Inui, S. Sakti, H. Wang, D. F. Wong, P. Bhattacharyya, B. Banerjee, A. Ekbal, T. Chakraborty, & D. P. Singh (Eds.), Proceedings of the 14th international joint conference on natural language processing and the 4th conference of the Asia-Pacific chapter of the association for computational linguistics (pp. 3180–3206). AFNLP & ACL. [Google Scholar] [CrossRef]
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3982–3992). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Satopää, V., Albrecht, J., Irwin, D., & Raghavan, B. (2011). Finding a “Kneedle” in a haystack: Detecting knee points in system behavior. In Proceedings of the 2011 31st international conference on distributed computing systems workshops (pp. 166–171). IEEE. [Google Scholar] [CrossRef]
Shujon, S. (2025). Reddit break-up stories dataset (2023–2025) [Data set]. Kaggle. Available online: https://www.kaggle.com/datasets/shakhoyatshujon/reddit-break-up-stories-dataset-20232025 (accessed on 28 February 2026).

Figure 1. UMAP 2D visualization of embedding distribution by topic.

Figure 2. Function for extracting optimal seed documents based on semantic saturation.

Figure 3. Functions for large-scale synthetic data augmentation.

Table 1. Hyperparameter configuration for BERTopic, UMAP, HDBSCAN, vectorizer, and LLM API.

Component	Parameter	Value
Embeddings	embedding_model	all-mpnet-base-v2
UMAP	n_neighbors	15
UMAP	n_components	5
UMAP	min_dist	0.0
UMAP	metric	cosine
UMAP	random_state	42
HDBSCAN	min_cluster_size	15
HDBSCAN	min_samples	10
HDBSCAN	metric	Euclidean
HDBSCAN	cluster_selection_method	eom
HDBSCAN	prediction_data	True
VECTORIZER	min_df	2
VECTORIZER	max_df	0.95
VECTORIZER	ngram_range	(1, 2)
LLM API	model	DeepSeek V3.2
LLM API	temperature	0.85
LLM API	top_p	0.9
LLM API	max_tokens	3000
LLM API	include_reasoning	False

Table 2. Distribution of thematic clusters and evaluation metrics of the original corpus.

Topic	Count	Name	Percentage	Cohesion	Diversity
0	311	contact_family_weeks_week	32.88%	0.672	0.551
1	288	saying_asked_boyfriend_met	30.44%	0.670	0.552
2	79	hope_heart_bad_let	8.35%	0.623	0.620
3	36	helped_come_real_hope	3.81%	0.710	0.510
4	20	kit_page_tracker_helped	2.11%	0.690	0.552
5	18	friend_finally_leave_tired	1.90%	0.670	0.583

Table 3. Thematic Labels of Clusters.

Topic	Labels
0	Post-Breakup Emotional Turmoil
1	Self-Sabotage and Avoidant Pattern
2	Unreciprocated Investment and Abandonment
3	Self-Loss and Unreadiness
4	Self-Created Healing Resources
5	Abrupt Departure and Communication Deficit

Table 4. Semantic saturation points identified by the elbow method.

Topic	Cut-Off Point (n)
T0	11
T1	11
T2	27
T3	10
T4	10
T5	17

Table 5. Comparison of mean length: Seeds vs. Global Corpus.

Topic	Global Mean (Words)	Seed Mean (Words)
T0	335	157
T1	368	343
T2	245	352
T3	159	165
T4	88	74
T5	349	366

Table 6. Performance Metrics by Topic.

Topic	Posts Generated	Total Time	Rate (sec/Post)	Batch Size
T0	1200	53:47	2.69	5
T1	1200	1:04:01	3.20	5
T2	1200	1:07:29	3.36	5
T3	1200	1:01:31	3.08	5
T4	1200	56:07	2.80	5
T5	1200	56:15	2.81	5
TOTAL	7200	~6.5 h	2.99 (mean)	5

Table 7. Semantic Fidelity Validation Results by Topic.

Topic	Posts Evaluated	P10 Threshold	Post Approved	Mean AI Similarity	Fidelity Rate (%)
T0	1200	0.5337	1194	0.6987	99.50%
T1	1200	0.5376	1189	0.7506	99.08%
T2	1200	0.4905	1193	0.7244	99.42%
T3	1200	0.6149	1161	0.7220	96.75%
T4	1200	0.5067	1200	0.7855	100.00%
T5	1200	0.5774	997	0.6413	83.08%
TOTAL	7200	—	7134	—	99.08%

Table 8. Gwet’s AC2 adherence and authenticity metrics by topic and validation status.

Topic	Posts Evaluated	AC2 Adherence	AC2 Authenticity	Status
T0	291	0.759	0.724	Validated
T1	291	0.704	0.892	Validated
T2	291	0.994	0.900	Validated
T3	291	0.660	0.817	Marginal
T4	291	0.947	0.838	Validated
T5	291	0.447	0.216	Not validated
Weighted mean	1732	0.754	7134

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sosa-Ramírez, R.; López-Meneses, E.; González-Zamar, M.-D.; Cevallos, M.B.M. Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains. Educ. Sci. 2026, 16, 885. https://doi.org/10.3390/educsci16060885

AMA Style

Sosa-Ramírez R, López-Meneses E, González-Zamar M-D, Cevallos MBM. Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains. Education Sciences. 2026; 16(6):885. https://doi.org/10.3390/educsci16060885

Chicago/Turabian Style

Sosa-Ramírez, Rafael, Eloy López-Meneses, Mariana-Daniela González-Zamar, and María Belén Morales Cevallos. 2026. "Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains" Education Sciences 16, no. 6: 885. https://doi.org/10.3390/educsci16060885

APA Style

Sosa-Ramírez, R., López-Meneses, E., González-Zamar, M.-D., & Cevallos, M. B. M. (2026). Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains. Education Sciences, 16(6), 885. https://doi.org/10.3390/educsci16060885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains

Abstract

1. Introduction

1.1. Toward Empowered Educational Research: Generative AI in Service of the Academic Community

1.2. Researcher Autonomy and Competencies in AI and NLP: Overcoming Small Data with Critical Thinking

1.3. Methodological Integrity and Pedagogical Responsibility in Data Generation

2. Materials and Methods

2.1. Origin of the Corpus

2.2. Pipeline Architecture

2.3. Hyperparameter Configuration

2.4. System Prompt and LLM Configuration

2.5. Proposed Human Validation Protocol

3. Results

3.1. Characterization of the Original Corpus and Thematic Modeling

3.2. Answering the Research Questions

3.3. Optimal Extraction of Seed Documents via the Elbow Method

3.4. Computational Efficiency and Generative Orchestration

3.5. Semantic Fidelity Validation via Cosine Similarity

3.6. Human Validation Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Human Evaluation Coding Guide

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI