Evolutionary Multi-Objective Prompt Learning for Synthetic Text Data Generation with Black-Box Large Language Models

Pastrián, Diego; Hidalgo, Nicolás; Reyes, Víctor; Rosas, Erika

doi:10.3390/app16083623

Open AccessArticle

Evolutionary Multi-Objective Prompt Learning for Synthetic Text Data Generation with Black-Box Large Language Models

¹

Escuela de Informática y Telecomunicaciones, Facultad de Ingeniería y Ciencias, Universidad Diego Portales, Santiago 8370191, Chile

²

Department of Computer Engineering (DISCA), Universidad Polytecnica de Valencia, 46022 Valencia, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3623; https://doi.org/10.3390/app16083623

Submission received: 11 March 2026 / Revised: 31 March 2026 / Accepted: 2 April 2026 / Published: 8 April 2026

(This article belongs to the Special Issue Resource Management for AI-Centric Computing Systems)

Download

Browse Figures

Versions Notes

Abstract

High-quality training data are essential for the performance and generalization of artificial intelligence systems, particularly in dynamic environments such as adaptive stream processing for disaster response. However, constructing large and representative datasets remains costly and time-consuming, especially in domains where real data are scarce or difficult to obtain. Large Language Models (LLMs) provide powerful capabilities for synthetic text generation, yet the quality of generated data strongly depends on the design of input prompts. Prompt engineering is therefore critical, but it remains largely manual and difficult to scale, particularly in black-box settings where model internals are inaccessible. This work introduces EVOLMD-MO, a multi-objective evolutionary framework for automated prompt learning aimed at generating high-quality synthetic text datasets using black-box LLMs. The proposed approach formulates prompt optimization as a multi-objective search problem in which candidate prompts evolve through genetic operators guided by two complementary objectives: semantic fidelity to reference data and generative diversity of the produced samples. To support scalable optimization, the framework integrates a modular multi-agent architecture that decouples prompt evolution, LLM interaction, and evaluation mechanisms. The evolutionary process is implemented using the NSGA-II algorithm, enabling the discovery of diverse Pareto-optimal prompts that balance semantic preservation and diversity. Experimental evaluation using large-scale disaster-related social media data demonstrates that the proposed approach consistently improves prompt quality across generations while maintaining a stable trade-off between fidelity and diversity. Compared with a single-objective baseline, EVOLMD-MO explores a significantly broader semantic search space and produces more diverse yet semantically coherent synthetic datasets. These results indicate that multi-objective evolutionary prompt learning constitutes a promising strategy for black-box LLM-driven data generation, with potential applicability to adaptive data analytics and real-time decision-support systems in highly dynamic environments, pending broader validation across domains and models.

Keywords:

data generation; LLMs; data diversification; genetic algorithm

1. Introduction

The continuous growth of data-intensive applications has intensified the demand for large, diverse, and high-quality datasets to support the training of machine learning models. This challenge is particularly evident in real-time and dynamic environments, such as stream processing systems (SPS), where models must operate under strict quality-of-service (QoS) constraints while adapting to rapidly changing data distributions [1,2]. In such contexts, the availability of representative training data directly impacts the robustness, generalization capacity, and operational reliability of AI-driven adaptive systems [3,4]. Traditionally, the construction of training datasets relies on manual data collection, annotation, and curation processes, which are often expensive, time-consuming, and prone to bias. These limitations are especially problematic in domains characterized by rare or extreme events, such as natural disasters, where historical data are scarce, heterogeneous, and difficult to obtain at scale. As a result, synthetic data generation has emerged as a promising alternative to augment real datasets and mitigate data scarcity issues. Recent advances in Large Language Models (LLMs) have introduced a new paradigm for synthetic text data generation. LLMs are capable of producing coherent and contextually rich textual content by leveraging knowledge encoded in large-scale pretraining corpora. However, despite their generative power, the quality of LLM-generated data is highly sensitive to the design of input prompts. Crafting effective prompts remains a non-trivial task that typically requires domain expertise and extensive trial-and-error experimentation, making the process difficult to scale and reproduce. Prompt engineering, defined as the process of designing input instructions to steer LLM behavior, has therefore become a central research topic. While several automated prompt optimization techniques have been proposed, many rely on gradient-based methods or assume white-box access to model parameters. These assumptions are often unrealistic in practice, particularly when dealing with proprietary or API-based LLM services, where only input–output interactions are available. Moreover, existing black-box prompt optimization frameworks focus primarily on single-objective fidelity metrics, neglecting the fundamental trade-off between semantic preservation and generative diversity required for effective synthetic data augmentation. Consequently, there is a strong need for black-box prompt optimization strategies that are model-agnostic, scalable, and capable of discovering high-quality prompts without relying on internal gradients or fine-tuning.

In this work, we address the problem of automatic synthetic text data generation using LLMs by formulating prompt learning as a multi-objective evolutionary optimization problem. We propose EVOLMD-MO, an evolutionary framework that leverages a Genetic Algorithm (GA) to evolve populations of prompts under a black-box setting. Each prompt is evaluated based on multiple objectives, including semantic fidelity with respect to a reference dataset and lexical diversity of the generated outputs. To support modularity and scalability, the framework adopts a multi-agent architecture that decouples prompt representation, LLM interaction, and evolutionary operators. Our approach is motivated by two key observations. First, evolutionary algorithms are naturally suited for black-box optimization in large, discrete, and non-differentiable search spaces, such as the space of natural language prompts. Second, maintaining a population of candidate prompts enables exploration of diverse generation strategies, reducing the risk of premature convergence and mode collapse in synthetic data generation. The main contributions of this work are threefold:

We introduce a multi-objective evolutionary prompt learning framework for black-box LLM-driven data generation.
We propose a structured prompt representation and semantic mutation mechanism that preserve linguistic coherence during evolution.
We empirically demonstrate the effectiveness of the approach on large-scale disaster-related social media data, highlighting its ability to balance semantic preservation and generative diversity.

Overall, this work positions evolutionary prompt learning as a principled and practical solution for automated synthetic data generation, with direct relevance to adaptive stream processing systems and data-driven decision support in dynamic environments.

The remainder of this paper is organized as follows: Section 2 reviews the related work on prompt optimization, black-box evolutionary approaches, multi-objective optimization in NLP, and evaluation metrics for synthetic data quality. Section 3 describes the preliminary EVOLMD framework upon which the proposed system is built. Section 4 presents the full architecture of EVOLMD-MO, including the LLM Interaction Layer, the NSGA-II Engine, the post-processing funnel, and the implementation details. Section 5 reports the experimental evaluation, covering fidelity metric validation, diversity metric assessment, search space visualization, Pareto front analysis, metric evolution across generations, and runtime analysis. Section 6 discusses the key findings, including computational efficiency, evolutionary dynamics, decision-making strategies, qualitative validation, and ethical considerations. Finally, Section 7 concludes the paper and outlines directions for future work.

2. Related Work

The transition from manual prompt engineering to automated optimization has shifted the focus of Generative AI research. While early approaches relied on gradient access [5,6], the proprietary nature of modern Large Language Models (LLMs) has necessitated the development of black-box strategies. This section reviews the evolution of these methods, the emerging crisis of “Mode Collapse”, and the necessary shift towards Multi-Objective Optimization and advanced evaluation metrics.

2.1. Single-Objective Evolutionary Baselines

To address the black-box constraint, the research community adopted metaheuristics. Pioneer frameworks like PromptBreeder [7], EvoPrompt [8], and our own previous baseline [9] established the feasibility of evolutionary prompting. These systems utilize Evolutionary Algorithms (EAs) to evolve discrete prompts by treating the LLM as a mutation operator.

While effective for maximizing a specific metric (e.g., accuracy or fidelity), these methods function fundamentally as single-objective optimizers. They direct the entire evolutionary pressure towards a scalar fitness value. This limitation makes them the direct precursors to our work, highlighting the need for a multi-objective approach to overcome the diversity bottlenecks described below.

2.2. Black-Box Prompt Optimization and Mode Collapse

With gradients unavailable, recent research has turned to Derivative-Free Optimization (DFO). The state-of-the-art ZO-PoG (Zeroth-Order Policy Gradient) [10] demonstrates that hybrid strategies combining discrete token search with continuous zeroth-order estimation can achieve convergence without backpropagation.

However, a critical limitation of these single-objective optimizers is their susceptibility to Typicality Bias. As identified in Verbalized Sampling [11], models optimized for a scalar reward (e.g., average likelihood) tend to converge towards the mode of the distribution, suppressing rare or creative outputs. In the context of synthetic data generation, this leads to Mode Collapse, where the generated dataset lacks the variance required for robust downstream training.

2.3. Multi-Objective Optimization in NLP

To mitigate collapse, optimization must balance competing goals (e.g., Fidelity vs. Diversity). Traditional methods attempt to solve this via scalarization (weighted sums:

w_{1} \cdot F_{1} + w_{2} \cdot F_{2}

). However, recent theoretical findings from MetaAligner [12] and Pareto Prompt Optimization (PPO) [13] argue that scalarization is mathematically flawed for LLMs. Because the semantic manifold of language is non-convex, linear weighting schemes fail to discover concave regions of the Pareto front, resulting in an “Alignment Tax” where improving one objective disproportionately degrades the other.

Consequently, the field is moving towards explicit Multi-Objective Evolutionary Algorithms (MOEAs). Recent studies have validated this efficacy: MOPO [14] demonstrated that NSGA-II could optimize prompts to balance coherence versus toxicity, while EMO-Prompts [15] applied evolutionary algorithms to balance conflicting sentiment requirements. Similarly, in sentiment analysis feature selection, NSGA-II has been used to effectively balance accuracy against feature reduction [16].

From a broader optimization perspective, Liu et al. [17] proposed a collaborative neurodynamic approach to multi-objective optimization that decomposes problems into scalar subproblems solved cooperatively by a population of recurrent neural networks, offering guaranteed convergence to Pareto-optimal solutions in continuous spaces. While this neurodynamic paradigm provides strong theoretical convergence guarantees, evolutionary algorithms such as NSGA-II remain the preferred approach for discrete, non-differentiable search spaces such as the space of natural language prompts where gradient information is unavailable and the fitness landscape is highly irregular.

Despite these advancements, a gap remains: existing frameworks focus largely on style transfer or safety (Red Teaming), lacking a dedicated focus on the specific trade-off between Semantic Fidelity and Generative Diversity for synthetic data augmentation.

2.4. Metrics for Fidelity and Diversity

Evaluating the quality of synthetic data requires metrics beyond simple lexical overlap. Traditional n-gram-based metrics often fail to capture the semantic nuances and distributional properties of the generated text. Therefore, we focus on two critical dimensions: Fidelity, which measures how accurately the synthetic data preserves the semantics of the original distribution, and Diversity, which assesses the variety and coverage of the generated samples to ensure the model is not merely memorizing or repeating patterns.

2.4.1. Semantic Fidelity

Traditional n-gram metrics (e.g., BLEU, ROUGE) correlate poorly with semantic consistency in open-ended generation. While BERTScore [18] utilizes contextual embeddings for high granularity, it incurs significant computational costs (

O (N^{2})

). For iterative evolutionary loops where speed is critical, Sentence-BERT (SBERT) [19] has become the preferred standard due to its efficiency in computing cosine similarity over dense vector representations.

2.4.2. Quantifying Diversity

Evaluating diversity is complex and has evolved from lexical to geometric approaches:

Lexical Diversity: Metrics like Distinct-n [20] and Self-BLEU [21] measure uniqueness at the token level. While useful for detecting exact repetition, they fail to capture semantic redundancy (paraphrases).
Content Diversity: To ensure variation in the generated subjects, metrics like Entity Entropy apply information theory principles [22] to calculate the distribution of Named Entities (persons, locations, organizations). High entropy indicates that the model is generating diverse scenarios rather than recycling generic placeholders or memorized entities.
Spectral and Geometric Diversity: The current theoretical gold standard is the Vendi Score [23], which calculates the spectral entropy of the similarity matrix to measure effective sample size. Other approaches, like Metric Space Magnitude [24], use algebraic topology to detect “holes” in data coverage.

While spectral metrics like Vendi offer theoretical robustness, their cubic computational cost (

O (N^{3})

) renders them impractical for the rapid iterative cycles of evolutionary algorithms. This necessitates efficient geometric proxies capable of estimating latent manifold coverage with significantly lower complexity.

Two such proxies have emerged as effective alternatives. Global Pairwise Dissimilarity measures the average distance between all generated samples, providing a direct estimate of the population’s dispersion in the embedding space. Although technically quadratic (

O (N^{2})

), it operates on simple vector embeddings rather than complex matrix decompositions, making it orders of magnitude faster than spectral methods. Complementarily, as used in recent benchmarks [25], K-Means Inertia quantifies the tightness of data clusters; high inertia indicates that the generated samples are well-distributed across the semantic space rather than collapsed into a single high-density region. Together, these metrics offer a computationally viable approach to monitor and penalize Mode Collapse during real-time training.

2.5. Decision Making in Pareto Fronts

MOEAs produce a set of non-dominated solutions (Pareto Front) rather than a single result. Selecting the optimal subset requires Multi-Criteria Decision Making (MCDM) techniques.

Standard approaches include TOPSIS [26], which ranks solutions based on their geometric distance to an ideal solution. To determine objective weights without human bias, data-driven methods like Entropy weighting [27] are preferred. Furthermore, to select a portfolio of K solutions that are not redundant among themselves, Maximal Marginal Relevance (MMR) [28] is widely used to re-rank candidates by balancing relevance (score) with novelty (dissimilarity to selected items). Our work integrates these MCDM strategies into a unified post-processing funnel.

3. Preliminaries: The EVOLMD Framework

Our work builds directly upon EVOLMD [9], an evolutionary framework designed for automated prompt optimization in black-box LLM settings. Since our proposed multi-objective approach (EVOLMD-MO) extends this architecture, we briefly describe its core mechanisms to establish the necessary context.

3.1. Prompt Representation and Agents

EVOLMD treats prompt engineering as a search problem guided by a Genetic Algorithm (GA). Unlike continuous prompt tuning methods that require gradient access, EVOLMD maintains the LLM as a frozen black box. The system employs a multi-agent architecture in which distinct agents handle initialization, data generation, and mutation.

It is important to note that prompts are not treated as unstructured strings but are defined as structured individuals

P_{i} = (R_{k}, T_{m}, A_{j})

, consisting of three semantic components:

Role ( $R_{k}$ ): Defines the persona or perspective (e.g., “As a medical professional”).
Topic ( $T_{m}$ ): Specifies the central subject matter derived from reference data.
Action ( $A_{j}$ ): Dictates the communicative intent.

To illustrate this encoding concretely, consider a reference text describing a hospital overwhelmed during a pandemic. The Init Agent would decompose this input into the following structured components: Role

R_{k}

= ‘As a frontline healthcare worker’; Topic

T_{m}

= ‘hospital overcrowding during a public health emergency’; Action

A_{j}

= ‘describe the current situation and its impact on patients’. The assembled prompt passed to the Data Agent would then read: ‘As a frontline healthcare worker, describe the current situation and its impact on patients in the context of hospital overcrowding during a public health emergency.’ The Mutation Agent operates on individual components: for example, it may replace the Role component with a semantically related alternative such as ‘As a hospital administrator’, preserving the Topic and Action components unchanged. Following mutation, the Regeneration Step uses the LLM to refine the syntax of the modified prompt, ensuring grammatical validity before the individual is re-evaluated.

3.2. Single-Objective Optimization

The original framework employs a single-objective optimization strategy focused primarily on semantic fidelity. The fitness function measures the similarity between the synthetic text generated by the LLM,

G (P_{i})

, and a randomly selected snippet from the real reference dataset,

X_{s}

. This similarity is quantified using BERTScore [18] to capture semantic alignment:

F i t n e s s (P_{i}) = BERTScore (G (P_{i}), X_{s})

(1)

To evolve the population, EVOLMD utilizes tournament selection and standard crossover. A distinguishing feature is its Mutation Agent, which uses the LLM itself to generate semantically similar token replacements. Following the mutation, a regeneration step is applied, where the LLM refines the syntax of the modified prompts to ensure that they remain linguistically valid before evaluation.

Despite these advances, existing approaches remain limited in two key aspects: (i) they optimize scalarized objectives that fail to capture the non-convex structure of semantic trade-offs, and (ii) they lack explicit mechanisms to prevent diversity collapse in large-scale synthetic data generation scenarios.

4. Proposed Architecture

This section introduces EVOLMD-MO, a multi-objective evolutionary framework re-engineered to overcome the limitations of single-objective text generation. While building upon the agent-based architecture introduced in [9], the baseline system employed a Genetic Algorithm (GA) aimed exclusively at maximizing semantic similarity via BERTScore. Although effective for generating coherent text, this single-objective focus created a critical bottleneck regarding diversity: the system tended to converge towards a “mode collapse,” producing repetitive variations or near-exact copies of the reference text. This lack of variability severely restricts the utility of synthetic data for downstream tasks such as Data Augmentation, where diversity is as critical as accuracy.

To address this, we propose EVOLMD-MO, a new model that expands the architectural scope beyond simple fidelity optimization. The proposed system shifts to a multi-objective paradigm that integrates a new diversity-oriented objective, a mechanism to actively broaden the semantic search space, and a post-processing module aimed at refining the final selection of candidate solutions. Crucially, to support this increased algorithmic complexity without sacrificing scalability, we implemented a strategic transition to SBERT for fidelity assessment. This optimization significantly reduces inference overhead, enabling a computationally efficient balance between semantic accuracy and lexical diversity within a feasible execution time.

The EVOLMD-MO architecture is organized into three distinct layers: the LLM Interaction Layer, the EVOLMD-MO Engine, and the Post-Processing Funnel. Our proposal is presented in Figure 1.

4.1. LLM Interaction Layer

Following a black-box approach, this layer abstracts the interaction with the Large Language Model. It employs a multi-agent scheme inherited from the baseline, consisting of specialized agents:

Init Agent: Responsible for creating the initial seed prompts based on the reference dataset.
Data Agent: Handles the prompt construction and API communication to generate synthetic text.
Mutation Agent: Supports the evolutionary operators by suggesting semantically similar tokens to introduce variability.

The framework uses specialized system and user prompts to guide the LLM through these stages. The templates for the primary agents are detailed in Table 1.

This component decoupling ensures that the optimization logic remains agnostic to the specific LLM backend used.

This architectural decoupling has an important practical implication: the framework is theoretically compatible with any LLM accessible via an input–output API, regardless of its internal architecture or provider. Different backends (e.g., GPT-4, Claude, Mistral) would affect the quality and style of generated mutations, but the evolutionary selection pressure applied by NSGA-II would remain functionally equivalent, as it operates exclusively on the semantic embeddings of the generated outputs. Stronger generative models would be expected to produce higher-quality mutations, potentially accelerating convergence, while weaker models may require more generations to achieve comparable Pareto fronts. Empirical validation across multiple LLM providers is identified as an important direction for future work to formally establish the model-agnosticism of the framework.

4.2. EVOLMD-MO Engine (NSGA-II)

EVOLMD-MO proposes a multi-objective text-generation approach based on the Non-dominated Sorting Genetic Algorithm II (NSGA-II) [29]. In this new model, each individual

P_{i}

is evaluated against two conflicting objectives:

Objective 1: Maximizing Fidelity ( $F_{1}$ ). We employ SBERT (specifically the all-MiniLM-L6-v2 model) to measure the cosine similarity between the embeddings of the generated text $G (P_{i})$ and the reference data $D_{r e f}$ . This ensures semantic coherence.
Defined as the cosine similarity between the embedding of the generated text $E (G)$ and the embedding of the reference text $E (R)$ :

$F_{1} (i) = \frac{E (G_{i}) \cdot E (R)}{∥ E (G_{i}) ∥ ∥ E (R) ∥}$

(2)

This transition from the baseline’s standard BERT to SBERT is driven by computational efficiency. Since the multi-objective framework introduces additional algorithmic complexity (specifically for diversity assessment), it is critical to minimize the cost of fidelity evaluation. By reducing the inference overhead for semantic similarity, we offset the computational load of the new components, preventing prohibitive execution times in a system already constrained by the latency bottleneck of LLM queries.
Objective 2: Maximizing Diversity ( $F_{2}$ ). To force the exploration of the search space, we introduce a Global Dissimilarity objective. Unlike local distance metrics, $F_{2}$ calculates the average cosine distance between an individual and the entire remaining population. This is calculated as the average semantic distance (1 − cosine similarity) between an individual i and all other j individuals in the current population P:

$F_{2} (i) = 1 - (\frac{1}{| P | - 1} \sum_{j \in P, j \neq i} \frac{E (G_{i}) \cdot E (G_{j})}{∥ E (G_{i}) ∥ ∥ E (G_{j}) ∥})$

(3)

This creates global selective pressure against redundancy, penalizing any prompt that closely resembles the population mean.

The engine employs Pareto-based ranking (Non-dominated Sorting) to manage the intrinsic trade-off between semantic fidelity and diversity. Unlike single-objective approaches that collapse metrics into a weighted sum, this method preserves the trade-off surface, allowing the survival of solutions that excel in one dimension even if they are average in the other. The choice to restrict the objective space to two dimensions—semantic fidelity (

F_{1}

) and generative diversity (

F_{2}

)—is grounded in the fundamental tension identified in the synthetic data generation literature. As demonstrated by Zhang et al. [11], single-objective optimizers targeting fidelity invariably collapse toward the mode of the distribution, suppressing rare outputs. Conversely, optimizing diversity without fidelity constraints produces semantically incoherent noise. These two objectives thus constitute a minimal yet sufficient formulation for the core synthetic data augmentation task:

F_{1}

ensures that generated samples remain within the target semantic distribution, while

F_{2}

guarantees sufficient variance for downstream generalization, consistent with the evaluation axes proposed in recent synthetic data benchmarks [25]. We acknowledge that additional objectives such as fluency, factual consistency, or task-specific utility could further enrich the optimization framework, and their integration is considered a priority direction for future work.

Complementarily, Crowding Distance sorting is utilized to estimate the solution density in the objective space. By prioritizing candidates located in less populated regions, the algorithm actively prevents genetic drift towards a single cluster, ensuring a uniform spread of solutions along the estimated Pareto front.

Genetic Operators and Selection

To isolate the impact of the multi-objective architecture, we retain the Structured Prompt Representation (

P_{i} = {R_{k}, T_{m}, A_{j}}

) and the variation operators defined in the baseline. Specifically, we employ Component-wise Crossover (swapping roles or actions between parents) and LLM-Guided Mutation (replacing tokens with semantically similar alternatives).

We also maintain the Regeneration Step employed in the baseline to ensure linguistic coherence after stochastic mutation.

Finally, the selection mechanism implies a significant departure from the baseline: we implement the standard NSGA-II Binary Tournament, prioritizing individuals with a lower non-domination rank and a larger crowding distance to preserve population diversity.

4.3. Post-Processing Funnel

Unlike the baseline, which converged towards maximizing a single fitness metric, EVOLMD-MO produces a Pareto Front containing a set of optimal trade-off solutions. To extract a usable dataset from this front, we implement a three-stage filtering pipeline:

1.: Safety Filter: Implements a dual-threshold mechanism that defines a valid acceptance window $[τ_{m i n}, τ_{m a x}]$ based on the fidelity score. This component automatically rejects candidates classified as semantic hallucinations ( $F_{1} < τ_{m i n}$ ) or lexical duplicates ( $F_{1} > τ_{m a x}$ ). For the experimental validation presented in this study, the lower threshold was set to $τ_{m i n} = 0.20$ and the upper threshold to $τ_{m a x} = 0.95$ .
2.: TOPSIS Ranking: To automate the final selection and eliminate manual bias—which introduces subjectivity and operational bottlenecks—we implement TOPSIS. This method is selected for its ability to identify the optimal geometric compromise, identifying the individual $i^{*}$ that best balances both objectives:

$i^{*} = arg max_{i \in Front} \frac{d_{i}^{-}}{d_{i}^{+} + d_{i}^{-}}$

(4)

where $d_{i}^{+}$ and $d_{i}^{-}$ are the Euclidean distances to the ideal positive and negative solutions, respectively. Weights are dynamically calculated using the Entropy Method ( $w_{j} = \frac{k_{j}}{\sum k_{j}}$ , where $k_{j} = 1 - Entropy (f_{j})$ ). This approach avoids arbitrary human prioritization, allowing the data’s intrinsic variability to determine the relative importance of each objective. Consequently, the selection process becomes fully data-driven, adapting to the specific distribution of the Pareto front in every generation.
3.: MMR Re-ranking: Finally, to construct the final Top-K dataset, we apply Maximal Marginal Relevance (MMR):

$MMR = arg max_{D_{i} \in R ∖ S} [λ \cdot {Sim}_{1} (D_{i}, Q) - (1 - λ) max_{D_{j} \in S} {Sim}_{2} (D_{i}, D_{j})]$

(5)

This iterative method selects candidates that maximize relevance (TOPSIS score) while minimizing similarity to the already selected set. The balance between these two factors is controlled by a tunable diversity parameter ( $λ$ ), which governs the penalty for semantic redundancy. For our experiments, we set $λ = 0.35$ to prioritize semantic diversity, ensuring the final output covers the broadest possible semantic space and avoids repetition within the post-processed Pareto front.

5. Experiments

In this section, we evaluate the correctness of the EVOLMD-MO architecture and compare its performance against a baseline approach [9]. The evaluation is structured into four phases, each targeting a specific aspect of the algorithm: (i) configuration validation, (ii) fidelity assessment, (iii) diversity assessment, and (iv) overall performance analysis with respect to the baseline.

5.1. Fidelity Metric Validation

To justify the replacement of BERT with SBERT, we conduct an evaluation that considers both execution time and measurement quality. In this initial experiment, a Python 3-based implementation is used to compare similarity scores produced by BERT and SBERT against the LLM-as-a-Judge model. The evaluation relies on a dataset composed of 20 thematic groups (e.g., COVID-19, economy, technology, climate). Each group contains the following:

Reference Text: A complex phrase associated with the group’s theme.
4 Candidates: Variations in the reference phrase with different similarity levels (High, Medium-High, Medium-Low, Low/Null).
Oracle Score: A reference value (0.0 to 1.0) assigned to each candidate, representing the ideal semantic correlation. To ensure an objective and scalable evaluation, we adopt the well-documented LLM-as-a-Judge paradigm [30], employing the Gemini 3 Pro model [31] as an impartial automated oracle to evaluate how related the candidate is to the reference text.

Thus, a total of 80 paired evaluations were obtained for the BERT and SBERT metrics. While the state-of-the-art literature already validates the general robustness and speed of SBERT [19], the purpose of this targeted evaluation set is to serve as a preliminary pilot study. This pilot confirms that the established metric performance holds true specifically within our disaster-related dataset, enabling a direct comparison with the LLM-as-a-Judge oracle scores to verify which automatic metric provides the most reliable and efficient fitness signal for the evolutionary engine.

Table 2 presents the detailed structure of a representative thematic group (Topic: Economy). This example shows how candidates vary from a high-fidelity paraphrase of the reference text to completely out-of-context sentences, along with the ideal score (Oracle) assigned by the LLM-as-a-Judge oracle. It also visualizes the score value given by the BERT and SBERT metrics for each candidate. Similarly, the remaining groups in the validation set cover a diverse range of domains, addressing topics such as Biology, Marketing, Travel, and Literature, among others.

In this particular example, it is observed that SBERT evaluates correlations with greater precision in 3 out of the 4 candidates. Globally, after systematizing the tests over the 20 thematic groups, the consolidated results presented in Table 3 were obtained.

Table 3 reports the Mean Absolute Error (MAE) obtained by the SBERT and BERT models across the evaluated scenarios. The best results in each row are highlighted in bold, corresponding to the lowest MAE values and thus the highest estimation accuracy. Data analysis indicates that SBERT, presenting a lower MAE, achieves a robust correlation with the LLM-as-a-Judge oracle, performing comparably to—and across several tiers, slightly better than—BERTScore. The most critical local finding is found for Noise detection (Low Similarity). The data shows that BERTScore presents a high error (0.3659) in this category, implying it tends to overestimate the quality of texts that bear no real relation to the reference. Conversely, SBERT maintains a minimal error (0.0497). This behavior is vital for system integrity: it validates that SBERT acts as a robust mechanism to discard hallucinations.

To empirically define the operational boundaries of the fidelity score for the feasible candidate solutions in the Safety Filter of the post-processing funnel Section 4.3, we examined the specific score distributions across the four quality tiers (as shown in Table 4).

For the upper bound (

τ_{m a x}

), we observe that the High Quality tier, composed of optimal paraphrases, reaches a maximum SBERT score of 0.94. This suggests that candidates exceeding this value are likely lexical duplicates rather than creative variations. Consequently, we set

τ_{m a x} = 0.95

to strictly penalize overfitting.

Regarding the lower bound (

τ_{m i n}

), the data reveal a critical intersection zone. While the Med-Low (Tangential) tier centers around a mean of

μ = 0.22

, the Low Quality (Noise) tier exhibits a maximum observed score of 0.18. To prioritize the semantic integrity of the dataset, we adopt a conservative thresholding strategy.

Rather than relying on the standard deviation of the tangential tier (which would extend the acceptance range into the noise zone), we establish the limit based on the empirical upper bound of the noise distribution. Consequently, we set

τ_{m i n} = 0.20

. This value acts as a safe margin slightly above the maximum noise outlier (0.18), ensuring the systematic rejection of hallucinations while retaining the upper quantile of semantically diverse candidates.

Overall, within this empirical pilot, SBERT reached a 96% correlation with the Oracle evaluation versus 90% for BERTScore. While universal metric superiority cannot be definitively claimed from an 80-pair subset, the results provide sufficient assurance that SBERT is a reliable substitute for this framework. Crucially, the transition to SBERT was primarily driven by the necessity for computational efficiency rather than absolute metric superiority. Table 5 details the comparative execution times for the same evaluation batch.

The results confirm a speedup factor of 34.22× in favor of SBERT. While the exact magnitude of this gain may vary with hardware and text length, the results confirm that our approach removes the primary bottleneck for scalability, effectively transforming a process that would require hours into minutes. This optimization is decisive for the scalability of the EVOLMD-MO architecture.

5.2. Semantic Diversity Metrics Evaluation

As outlined in the architecture, the Monitor component acts as an observational module designed to track the population’s evolutionary dynamics. Unlike the active selection mechanisms, the Monitor’s role is to provide real-time analytics using K-Means Inertia (geometric diversity) and Entity Entropy (content diversity). This enables us to diagnose critical states, such as convergence stability or potential mode collapse, during the experimental phase.

Prior to deploying these metrics within the evolutionary loop, it is crucial to verify their diagnostic capability. Therefore, this section performs two validation tests to demonstrate that the Monitor can effectively distinguish between redundancy, chaotic noise, and useful diversity:

1.: A Macro-level Analysis to validate if K-Means Inertia correctly measures if the population diversity increases or decreases.
2.: A Micro-level Analysis to verify if the selective pressure favors innovative individuals over redundant ones.

5.2.1. Macro-Level Analysis: K-Means Inertia and Entity Entropy

To validate the robustness of the K-Means Inertia metric employed by the diversity monitor, and specifically its capacity to quantify the semantic content covered by the population, we designed a controlled experiment using three synthetic populations.

Each population consists of five individuals (

N = 5

) representing distinct states of semantic convergence. The primary objective is to determine if the metric correctly aligns with the conceptual levels of diversity in each scenario. This validation step is essential to subsequently assess quantitatively whether the population evolves during the genetic algorithm by incorporating new content, increasing diversity, and expanding the semantic search space.

The experimental scenarios and their expected outcomes are defined as follows:

Mode Collapse (High Redundancy)

This population (Table 6) represents a failure state where the model produces semantically identical phrases or slight paraphrases of the same sentence. We hypothesize this scenario will result in low global inertia, as the semantic distance between embeddings should be minimal.

Chaotic Diversity (Irrelevance)

This population (Table 7) consists of phrases that are semantically disjoint and lack a cohesive central topic. While diverse, this represents “noise” rather than useful data generation. We anticipate a high global inertia value, significantly exceeding the useful range, due to the lack of semantic correlation between the sentences.

Ideal Diversity (Faithful and Diverse)

This population (Table 8) maintains a central topic (e.g., the Pandemic) but explores it from different semantic angles (economy, mental health, logistics, etc.). We expect an intermediate global inertia value, located between the values of the previous groups.

Following the qualitative definition of the scenarios, vector metrics were calculated to verify these hypotheses. The quantitative results are presented in Table 9.

The results confirm the metric’s utility as a driver for evolutionary exploration:

1.: Differentiation: There is a clear numerical gap between the stagnation state (0.15) and diverse states (>0.22), allowing the algorithm to effectively penalize mode collapse.
2.: Interpretation of High Values: The high inertia observed in the “Chaotic” scenario (0.37) confirms that the metric correctly measures semantic breadth. In the context of the Multi-Objective Evolutionary Algorithm (MOEA), high inertia values are desirable as they indicate extensive exploration of the search space.
3.: Role in Optimization: While “Chaotic Diversity” achieves high inertia through irrelevance, the MOEA filters such solutions via the Fidelity objective. Therefore, the inertia metric successfully fulfills its role as a monitoring signal: it accurately quantifies the expansion of the semantic space (assigning higher values to wider spreads) and provides the necessary data for the optimization engine to reward diversification while the fidelity metric simultaneously constrains the search to relevant regions.

These findings confirm that the inertia metric effectively captures the content diversity within a group of phrases, providing the system with a clear operational range towards which to direct the optimization process.

Finally, it is important to note that these quantitative results should not be interpreted as objective constants, as they depend on population size and complexity. While normalization could offer a scale-independent alternative, for the purposes of this paper, the primary objective is to monitor the evolution of diversification. The metric is intended to track the relative expansion of the semantic space throughout the generations, rather than using these specific numerical values to force the solutions toward a fixed target state.

5.2.2. Microscopic Analysis: Sensitivity and Selective Pressure

This experiment evaluates the discriminatory sensitivity of the proposed metric at the individual level. Specifically, it seeks to demonstrate the system’s capacity to assign differential fitness scores: rewarding semantically novel candidates while effectively penalizing redundancy within a heterogeneous population. This differential is critical to ensure that the genetic algorithm exerts the necessary selective pressure to escape local optima (mode collapse).

To verify this mechanism, we constructed a controlled mixed population designed to simulate a transition state in the evolutionary process. The population consists of two distinct subsets:

A Redundant Cluster ( $n = 3$ ): Three variations of the same sentence with high semantic overlap, representing a stagnant lineage. We anticipate these individuals will receive low novelty scores due to the high semantic overlap among them, and their marginal contribution to the group’s diversity is limited, as they add redundancy rather than expanding the semantic search space.
An Innovative Group ( $n = 2$ ): Two sentences introducing distinct concepts or perspectives, representing desired genetic mutations. These candidates are expected to achieve significantly higher scores compared to the redundant cluster, correlating with their ability to expand the semantic boundaries of the population and mitigate mode collapse.

We calculated the individual novelty score for each candidate to measure the gap in valuation between these two groups. Table 10 details the assigned scores.

The results demonstrate a clear discriminatory gap of +0.24 (a ≈60% increase relative to the redundant group). This differential validates the metric’s ability to impose significant selective pressure against redundancy.

By assigning higher fitness values to semantically distinct individuals, the system inherently reduces the survival probability of repetitive clusters. This mechanism naturally prioritizes the exploration of the semantic search space, preventing the algorithm from converging into local optima (mode collapse) without requiring external constraints.

5.2.3. Macroscopic Scaling Validation on Realistic Populations ( $N = 100$ )

While the preceding synthetic experiments conceptually demonstrate the mathematical mechanics of the diversity monitor, it is crucial to validate the scaling reliability of these metrics when operating over full-size populations (

N = 100

) derived from actual execution logs. To definitively address this, we conducted a robust scaling validation by evaluating the Semantic Dispersion (K-Means Inertia) and Conceptual Coverage (Entity Entropy) across four distinct, realistic states of generative diversity:

1.: Absolute Collapse (Identical Clones): A total of 100 mathematically identical copies of a single sentence. This serves as the theoretical ground-truth minimum boundary.
2.: Algorithmic Collapse (Baseline Failure State): A total of 100 texts extracted from the final generation (Gen 100) of the single-objective baseline algorithm. This represents realistic repetitive stagnation and mode collapse.
3.: Structured Diversity (EVOLMD-MO Success State): A total of 100 texts extracted from the final generation (Gen 100) of the proposed multi-objective framework, representing the desired optimization goal.
4.: Organic Diversity (Social Media Corpus): A total of 100 completely random sentences sampled from the original Lamsal dataset. This serves as the theoretical organic maximum benchmark.

As illustrated in Figure 2, both proposed metrics exhibit strict monotonic scaling perfectly aligned with the latent distribution of the selected populations. The metrics register near-zero valuation for Absolute Collapse, correctly penalizing the marginal dispersion of the Baseline, and sharply increase their scores for the structured dataset generated by EVOLMD-MO, demonstrating its proximity to the natural variance of the human-generated corpus. This empirical evidence categorically confirms that the diversity monitor remains robust and consistently discriminates structural and semantic variance at full operational scales. The central objective of this phase is to contrast the performance of the proposed algorithm against the baseline, analyzing the quality of the explored search space and the pertinence of the selected individuals.

5.3. Comparative Performance

This section evaluates and contrasts the performance of the proposed EVOLMD-MO framework against the single-objective baseline (original algorithm). To ensure a robust assessment, we conducted three distinct experiments corresponding to the reference scenarios defined in Section 5.3.1. The analysis focuses on three key dimensions: the diversity of the search space exploration, the semantic quality of the generated individuals, and the computational efficiency (runtime).

5.3.1. Experimental Setup and Hyperparameters

To guarantee a fair comparison regarding computational efficiency and execution times, all experiments were conducted on the same hardware infrastructure used in the baseline: a dedicated Linux workstation equipped with 32 GB of RAM and accelerated by dual NVIDIA GeForce RTX 3060 GPUs. The system utilizes a local, containerized LLM environment powered by Ollama to execute the Llama 3.1 7B model.

To ensure the robustness of the analysis, four reference texts presenting heterogeneous linguistic characteristics and semantic contexts were selected. We utilized a Reference Phrase Dataset [32] consisting of 2,536,837 distinct text segments.

While these seeds were chosen randomly, their representativeness was empirically verified by projecting them against the dataset’s global semantic distribution. As illustrated in Figure 3, a random uniform sample of the corpus (

N = 8000

) and the four seeds were embedded using SBERT and projected into a 2D space via PCA. The choice of a subset size is a standard practice in dense embedding visualization to preserve the underlying structure of the latent semantic manifold while preventing visual overplotting and excessive computational overhead caused by processing the 2.3 million records. The broad spatial dispersion of the seeds confirms that they cover orthogonal semantic regions of the underlying dataset, proving they capture the diverse conceptual dimensions required for a comprehensive generation benchmark without introducing manual selection bias.

From this pool, the specific reference texts selected as seed inputs for the experimental scenarios are defined below:

Reference 1 (Social/Informal Context): “brother just look italy case first few case came 31st jan now italy top the death list corona viruse cases boom very fast medical facility is also not good in south east asia and you have neighbor worst conditions your government be very careful”.
Reference 2 (Mental Health/Personal): “this quarantine has kicked my depression up a couple notches thanks to my work and routine being void now and im effectively avoiding my phone now bc everyone is nuts sending corona stuff dont blame them but christ the anxiety is driving me up the Wall”.
Reference 3 (Corporate/Informative): “update our action center has been updated with more information about restaurant shutdowns and disaster financing options for smbs”.
Reference 4 (Creative/Open-Ended): “why dont youwake me up when corona ends”.

Given the stochastic nature of both Genetic Algorithms and LLM generation, relying on a single execution per scenario is insufficient to draw robust conclusions. To ensure statistical significance and reproducibility, we adopted the following protocol:

Independent Runs: Each scenario was evaluated across 3 independent runs for both the Baseline and EVOLMD-MO, using different random seeds for initialization.
Total Executions: The complete experimental campaign comprises 24 distinct executions (2 Systems × 4 Scenarios × 3 Runs).

The configuration used for all experimental runs is detailed in Table 11. To ensure direct comparability with the results of the baseline [9], the structural hyperparameters of the genetic algorithm used in the previous work were maintained, adapting them to the multi-objective nature of the proposal.

Explicit elitism parameters are not defined, as NSGA-II inherently preserves the fittest individuals through its non-dominated sorting and crowding distance mechanisms, ensuring that high-quality solutions from combined parent and offspring populations are retained. The choice of a high crossover probability (

0.8

) seeks to encourage the combination of semantic features among the best individuals, while a controlled mutation (

0.05

) introduces the necessary variability to avoid local minima without degrading text coherence.

For the final selection via MMR, we set the diversity parameter to

λ = 0.35

. This choice relies on TOPSIS acting as a quality filter, discarding candidates with poor fidelity. With quality assured, the MMR step can focus on aggressively penalizing redundancy. By setting

λ < 0.5

, we prioritize semantic diversity, ensuring that the selected candidates represent distinct areas of the latent space rather than being repetitive variations.

5.3.2. Search Space Visualization

To facilitate visual interpretation, semantic embeddings were projected onto a two-dimensional plane using PCA. The following figures compare the original algorithm (red dots) against the multi-objective approach (blue dots). To avoid visual occlusion caused by overlapping stochastic trials, we visualize the execution corresponding to the median performance (in terms of Hypervolume) for each scenario.

The PCA projections illustrate a fundamental difference in the exploration strategies of both algorithms. In Scenarios 1 through 3 (Figure 4, Figure 5 and Figure 6), the baseline results (sub-figure a) remain heavily condensed around the reference vector. This dense clustering suggests that the single-objective optimization prioritizes fidelity at the expense of exploration, effectively trapping the population in a local neighborhood of minimal semantic variation.

A distinct behavior emerges in Scenario 4 (Figure 7), where the brevity of the reference prompt forces the baseline to generate new content. Instead of remaining at the origin, the baseline drifts significantly to a specific region (visible as a dense red cloud to the right) and collapses there. This confirms that even when the single-objective approach is forced to diverge, it inevitably becomes trapped in a specific local optimum, failing to explore alternative semantic directions or maintain solutions near the original context.

The EVOLMD-MO populations (sub-figures b) demonstrate a consistent ability to maintain a broad and structured semantic dispersal. This is particularly evident in Scenario 4, where, unlike the baseline, which abandoned the reference zone entirely, EVOLMD-MO maintains a set of high-fidelity solutions near the reference while simultaneously extending the Pareto front into diverse territories. This structure proves that the multi-objective mechanism effectively navigates the trade-off, avoiding the “all-or-nothing” jumps characteristic of the baseline. We can also observe that the spatial arrangement of the final selected candidates (highlighted in the figures) validates the role of the post-processing stage. Rather than clustering around a single optimum, the selected points exhibit a deliberate separation. This spacing is enforced by the MMR mechanism which, regulated by the diversity parameter

λ

, actively penalizes redundancy among high-scoring solutions. Consequently, the algorithm ensures that the top-K outputs are not merely repetitive variations of the best individual but distinct representatives of the explored semantic space.

Finally, Figure 4c, Figure 5c and Figure 6c, present a direct superposition of both search spaces. This overlay reveals that the baseline’s exploration (red) is consistently confined to narrow sub-regions—either pinned to the reference (S1–S3) or isolated in a distant cluster (S4). In contrast, the Multi-Objective framework (blue) encompasses these regions while also covering the intermediate space and alternative distinct gradients. The contrast effectively demonstrates how EVOLMD-MO breaks the boundaries of local optima, accessing semantic territories that remained unreachable for the single-objective approach.

5.3.3. Pareto Front Analysis

Next, graphs illustrating the trade-off between diversity and fidelity metrics are presented. Each point corresponds to a non-dominated solution of the Pareto Front. Consistent with the previous analysis, each plot depicts the Pareto Front of the median execution per scenario.

As expected, a clear trade-off relationship between diversity and fidelity was evidenced, confirming that they are conflicting metrics and validating the need for a multi-objective approach. As visible in Figure 8, Figure 9, Figure 10 and Figure 11, the Pareto front in each scenario comprises a proper subset of the final population, consistent with the expected behavior of NSGA-II’s non-dominated sorting procedure [29]. The remaining individuals occupy dominated ranks (front 2, front 3, etc.) and are excluded from the final selection stage. The non-dominated front exhibits a smooth, continuous descending trade-off curve between fidelity and diversity, with no significant discontinuities, confirming that the crowding distance mechanism successfully maintained a well-distributed spread of non-dominated solutions along the front. Because improving diversity almost invariably requires sacrificing some degree of fidelity, solutions rarely dominate each other entirely, leading the algorithm to successfully distribute the Pareto-front solutions along the optimal trade-off surface, while the remaining individuals occupy dominated ranks and are excluded from the final selection stage.

To provide a stronger quantitative analysis of the Pareto front quality, we evaluated the Hypervolume (HV) and Spacing (SP) indicators across all test configurations (Table 12). The single-objective baseline, which optimizes solely for fidelity, naturally fails to construct a trade-off surface, suffering from mode collapse. When its geometric diversity (

F_{2}

) is evaluated post hoc to plot its solutions in the bi-objective space, its Hypervolume remains extremely small. In stark contrast, EVOLMD-MO achieves an average Hypervolume increase of over 300% across all scenarios, mathematically proving that the Multi-Objective architecture successfully discovers a vastly broader array of valid textual representations. Furthermore, EVOLMD-MO maintains a consistent Spacing value near 0.010, confirming that the crowding-distance mechanism distributes the candidates uniformly across the discovered spectrum, offering the decision-maker smooth and varied alternatives rather than clustered duplicates.

Beyond the structural quality of the Pareto front, we examined the absolute performance levels for the final populations (Table 13). The results highlight the fundamental trade-off managed by the framework: the Baseline algorithm achieves a higher average fidelity in Scenario 3 (

μ = 0.7166

) by converging to a concentrated cluster of high-similarity texts, but this comes at the cost of a significant diversity collapse (

F_{2} = 0.1108

). Conversely, EVOLMD-MO sustains a slightly lower yet stable average fidelity while boosting the average population diversity by a factor of 3× to 6× across all scenarios. This ensures that the generated dataset covers a significantly larger portion of the semantic manifold, fulfilling the primary requirement for robust data augmentation.

5.3.4. Metric Evolution

In this section, we analyze the dynamic evolution of population diversity throughout the 100 generations across all scenarios. This analysis aims to evaluate the distinct trajectories of the semantic space coverage and the vocabulary richness maintained by both algorithms. To quantify these aspects, we monitor K-Means Inertia (structural diversity) and Entity Entropy (lexical variety). It is important to note that while the Baseline algorithm does not optimize for these objectives, these metrics were logged during its execution to enable a direct quantitative comparison with EVOLMD-MO. Regarding the visualization, to ensure the reliability of the trends, each line in the following figures represents the average value calculated across the 3 independent executions. By plotting the mean rather than a single run, we smooth out the stochastic fluctuations inherent to the generation process, revealing the consistent convergence behavior of each algorithm.

As illustrated in the Figure 12 and Figure 13, the evolutionary trajectories reveal a fundamental divergence between the two approaches. The Baseline (dashed lines) exhibits a rapid and consistent degradation in both Inertia and Entropy across all scenarios. This downward trend confirms that when diversity is not explicitly rewarded, the population suffers from severe mode collapse, converging within the first 20 generations to a state of minimal semantic variation (Inertia < 0.2) and repetitive vocabulary, with the exception of Scenario 4. In this specific case, the lack of content in the reference prompt forces the Baseline into an unstructured exploration phase; however, it still yields significantly worse diversity metrics compared to the EVOLMD-MO executions.

In contrast, EVOLMD-MO (solid lines) successfully counteracts this tendency. The multi-objective framework maintains significantly higher levels of global diversity throughout the entire process. These sustained trajectories provide empirical confirmation that the algorithm achieves a broader semantic coverage (K-means Inertia) and superior content richness (Entity Entropy) in the population compared to the baseline. Notably, the stability of the solid lines demonstrates that the algorithm does not merely generate random noise to satisfy the metrics; rather, it sustains a structural balance, preventing the population from imploding into a single local optimum. This gap between the solid and dashed lines quantitatively validates the hypothesis that an explicit diversity mechanism is essential to preserve the creative potential of the system over long evolutionary runs.

5.3.5. Runtime Analysis

Finally, the computational efficiency of the proposed framework was evaluated. Table 14 compares the total wall-clock time for 100 generations for all executions between the baseline (using BERTScore) and EVOLMD-MO (using SBERT).

Given the high stochastic variability observed across the executions, we cannot categorically conclude that EVOLMD-MO will be strictly faster in every single instance. As anticipated, the mathematical operations inherent to the multi-objective architecture—specifically the calculation of global inertia, entity entropy, and non-dominated sorting—introduce an unavoidable computational overhead compared to the baseline’s simple fitness comparison. However, the crucial finding is that this theoretical computational burden is successfully neutralized by the strategic architectural transition from heavy token-level computations (BERTScore) to lightweight sentence-level embeddings (SBERT). While the multi-objective method itself does not generate the speedup, the integration of SBERT acts as a necessary compensating mechanism. Consequently, the combined system achieves a global average execution time improvement of 8.26%, proving that it is structurally feasible to deploy advanced, computationally demanding multi-objective text evolution without suffering restrictive temporal penalties due to the extra diversity metrics measurement.

6. Discussion

This section interprets the experimental findings, synthesizing the results from computational efficiency, evolutionary dynamics, and decision-making strategies to validate the EVOLMD-MO framework.

6.1. Computational Efficiency as an Enabler for Quality

One of the most critical barriers in evolutionary prompt optimization is the computational cost of fitness evaluation. Our experiments demonstrated that the strategic replacement of BERTScore with SBERT successfully counterbalanced the inherent overhead of the multi-objective architecture (NSGA-II sorting and diversity calculations).

Although the stochastic nature of the process resulted in some variance across individual runs, the system achieved a global average execution time reduction of 8.26% compared to the baseline. This finding is significant because it proves that sophisticated optimization strategies can be implemented without incurring prohibitive runtime penalties.

Crucially, this efficiency acts as an enabler for superior search capabilities. Unlike the baseline, which optimized for speed and fidelity but collapsed into repetitive local optima, EVOLMD-MO utilized these computational resources to navigate a complex trade-off surface. The framework demonstrated the ability to generate solutions that are not only faithful to the reference but also semantically diverse, effectively breaking the “mode collapse” barrier while maintaining—and often surpassing—the temporal efficiency of the original single-objective approach.

We acknowledge that the experimental comparison presented in this work is limited to our own prior single-objective baseline [9]. While this comparison isolates the specific contribution of the multi-objective extension, it does not position EVOLMD-MO against the broader landscape of black-box prompt optimization methods. Relevant competing approaches include zeroth-order optimization methods such as ZO-PoG [10], reinforcement learning-based prompt search strategies, and other evolutionary frameworks such as PromptBreeder [7] and EvoPrompt [8]. A direct empirical comparison against these methods—using standardized evaluation benchmarks and datasets—is necessary to rigorously assess the competitiveness and novelty of the proposed framework. We identify this as a high-priority direction for future work, noting that such a comparison requires careful experimental design to ensure that differences in LLM backends, evaluation metrics, and task formulations do not confound the results. The current results should therefore be interpreted as a proof-of-concept demonstration of the multi-objective evolutionary paradigm for synthetic data generation, rather than a claim of absolute superiority over all existing black-box prompt optimization methods.

6.2. Evolutionary Dynamics: From Geometry to Content

A central hypothesis of this work was that maximizing geometric diversity (distance in embedding space) would translate into richer informational content. The evolution of metrics presented in Figure 12 strongly validates this link.

We observed a tight positive correlation between Global Inertia (geometric spread) and Entity Entropy (informational variety). As the generations progressed, the population did not merely spread out randomly; it incorporated a growing number of unique named entities and concepts. This distinction is crucial: high inertia could theoretically be achieved by generating incoherent noise (random words), but the simultaneous rise in entropy confirms that the diversity is semantic and conceptual.

Furthermore, this “enrichment” process occurred while maintaining high fidelity scores. This behavior suggests that the multi-objective pressure successfully navigated the search space to find “islands” of valid solutions that are semantically distinct from each other, effectively solving the Mode Collapse problem detected in single-objective approaches.

6.3. Decision Making and the Fidelity-Diversity Trade-Off

The analysis of the Pareto Fronts reveals that the optimal strategy for synthetic data generation is not to maximize fidelity blindly but to exploit the trade-off region. As evidenced in the results, the selection algorithm (MMR) consistently favored solutions with acceptable fidelity over those with perfect fidelity.

This preference is a feature, not a limitation. Extreme fidelity solutions (>0.95) tended to form clusters of high redundancy, effectively mimicking the reference text with minimal variation. By introducing the diversity objective and the MMR re-ranking, the system penalized this internal similarity. The algorithm enforces a sacrifice of marginal fidelity (e.g., dropping from 0.95 to 0.85) to gain significant semantic coverage. This aligns with the ultimate goal of data augmentation: to produce a dataset that covers the latent manifold of the problem space, rather than concentrating density around the known reference points.

6.4. Robustness to LLM Architecture

A critical consideration regarding the multi-agent design is the system’s reliance on the LLM as the primary evolutionary mutation operator. The LLM Interaction Layer abstracts the model provider, allowing for the seamless substitution of the backend (e.g., swapping Llama 3 for GPT-4 or Mistral). However, the overall evolutionary success remains inherently bounded by the reasoning capabilities of the chosen model.

If an inferior architecture lacking sufficient instruction-following capabilities is utilized, the mutation agent will likely fail to propose coherent semantic variations, instead generating hallucinations or severe deviations. While the framework is structurally protected against incorporating these failures—since the strict SBERT evaluation and Safety Filter (

τ_{m i n}

) will reject them—this protection comes at the cost of evolutionary stagnation. The algorithm would waste generations evaluating and discarding invalid mutations, unable to effectively expand the Pareto Front. Therefore, while the architecture is provider-agnostic, its operational viability strictly requires an LLM that meets a minimum threshold of linguistic reasoning and instruction adherence to function as a competent genetic operator.

6.5. Qualitative Validation: Linguistic Analysis

To evaluate the real quality of the synthetic data beyond numerical metrics, the textual results corresponding to Scenario 2 are analyzed. This case is of particular interest due to its semantic and emotional complexity, as the reference text addresses subjective feelings about depression and anxiety during quarantine:

Reference Text: “this quarantine has kicked my depression up a couple notches thanks to my work and routine being void now and im effectively avoiding my phone now bc everyone is nuts sending corona stuff dont blame them but christ the anxiety is driving me up the Wall”.

Table 15 presents a selection of candidates generated by the system.

Upon reviewing the candidates, the results of the metrics are confirmed. Unlike the baseline, which generated similar texts, EVOLMD-MO achieves clear variations:

1.: Creativity Gradient: Candidate 1 is a direct high-fidelity paraphrase, while Candidate 2 uses complex metaphors (“swirling vortex of chaos”, “snakes”) to maintain the original emotion by changing the vocabulary.
2.: Mode Collapse Mitigation: The system forces different perspectives, such as that of a first responder (Candidate 3) or an empathetic tone of advice (Candidate 5), avoiding repetition.
3.: Coherence: Despite the high diversity, the majority of sentences maintain grammatical coherence and semantic relevance to the reference topic, validating SBERT as an effective filter. The exception is Candidate 4, which illustrates the semantic drift phenomenon inherent to high-diversity generation—a known edge case that the Pareto framework addresses by empowering the user to apply a stricter fidelity threshold to exclude such outliers from the final dataset.

A noteworthy observation derived from the cross-comparison of the experimental scenarios is the apparent robustness of the proposed framework across heterogeneous linguistic registers. Despite the distinct structures of the reference texts, ranging from informal slang to formal business terminology, the evolutionary engine successfully expanded the Pareto front and mitigated mode collapse in all tested cases (as seen in Figure 4, Figure 5 and Figure 6). This suggests that the diversity mechanism, driven by geometric dissonance in the embedding space (

F_{2}

), may operate with some degree of independence from the specific semantic topic. However, we caution that these results are based on four scenarios within a single domain (disaster-related social media), and broader claims of domain agnosticism must await validation across multiple independent domains, datasets, and LLM backends. This multi-domain evaluation is identified as the primary direction for future work.

However, we acknowledge that extreme diversity can occasionally lead to semantic drift, as seen for Candidate 4, where the topic shifts from anxiety to pet hygiene. This phenomenon validates the necessity of the Pareto approach: rather than accepting a single output, the user is empowered to filter out these ’hallucinations’ by selecting a higher fidelity threshold from the available front. In conclusion, this qualitative analysis validates that the proposed architecture translates into a rich, varied, and linguistically coherent text generation.

6.6. Computational Cost and Real-World Feasibility

Despite the runtime efficiency gains achieved through the SBERT substitution (Section 6.1), the overall computational cost of EVOLMD-MO remains dominated by the LLM query bottleneck inherent to the evolutionary loop. With a population of

N = 100

individuals evolving over 100 generations, each experimental run requires on the order of 10,000 LLM inference calls for data generation, plus additional calls for mutation and regeneration steps. In our controlled experimental environment using a locally hosted Llama 3.1 7B model, this resulted in wall-clock times in the range of 22,000–33,000 s per run (Table 14), which, while feasible for offline dataset construction, may be prohibitive in time-sensitive or resource-constrained settings. In cloud-based deployment scenarios where LLM access is mediated through commercial APIs, this query volume translates into non-trivial financial costs that scale linearly with population size and generation count. For example, at representative API pricing levels, a full experimental run of 100 generations with N = 100 individuals could incur costs that limit applicability for large-scale or iterative use cases. Practitioners should therefore carefully tune the population size, number of generations, and mutation rate to balance exploration quality against computational budget. To improve scalability, future work will investigate three complementary strategies: (i) surrogate fitness models that approximate LLM-generated outputs without requiring actual inference calls, reducing the number of true LLM queries per generation; (ii) adaptive population control mechanisms that dynamically adjust population size based on convergence indicators, avoiding unnecessary evaluations in stable generations; and (iii) asynchronous parallel evaluation schemes that exploit multi-GPU or distributed computing infrastructure to reduce wall-clock time. These directions are essential to establish EVOLMD-MO as a practically deployable framework beyond controlled experimental environments.

6.7. Ethical Considerations and Risk Mitigation

The use of LLMs for synthetic data generation raises important ethical concerns that must be addressed to ensure responsible deployment of the proposed framework. Regarding hallucination mitigation, the Safety Filter in the post-processing funnel (Section 4.3) applies a lower fidelity threshold

τ_{m i n} = 0.20

, automatically rejecting candidates whose SBERT score falls below the empirical upper bound of the noise distribution. This provides a systematic, metric-driven mechanism to discard semantically incoherent or hallucinated outputs before they enter the final dataset. Regarding bias propagation, because EVOLMD-MO evolves prompts conditioned on real reference texts, the generated outputs inherit the distributional biases present in the source corpus. In the context of disaster-related social media data, this may include linguistic biases, demographic underrepresentation, or the amplification of emotionally charged misinformation. We recommend that practitioners apply standard dataset auditing procedures including fairness metrics and demographic parity checks to synthetic outputs prior to downstream use. Regarding misuse prevention, synthetic text generation tools carry inherent dual-use risks, including the potential creation of misleading content at scale. We recommend that any deployment of EVOLMD-MO be governed by responsible AI usage policies, and that all synthetic datasets generated by the framework be clearly labeled as machine-generated in any downstream application or publication.

6.8. Limitations and Downstream Validation

A key limitation of the current evaluation is its exclusive reliance on intrinsic metrics, SBERT-based semantic fidelity and embedding-based diversity, to assess the quality of the generated synthetic data. While these metrics are well-established proxies in the synthetic data generation literature [25], they do not directly demonstrate that improvements in fidelity and diversity translate into measurable performance gains in downstream machine learning tasks. It is theoretically possible for a synthetic dataset to score highly on both intrinsic metrics while providing limited benefit, or even introducing noise, when used for model training or data augmentation. To address this limitation, future work will include extrinsic evaluation experiments in which synthetic datasets generated by EVOLMD-MO are used to augment training data for downstream classifiers, such as disaster event detection, sentiment analysis, or intent classification models, and performance is assessed using standard metrics including accuracy, F1-score, and AUC-ROC. This downstream validation is essential to establish the practical utility of the framework and to confirm that the Pareto-optimal prompts discovered by the evolutionary process generate data that is not only intrinsically diverse but also genuinely useful for real-world learning systems.

7. Conclusions

This paper introduces EVOLMD-MO, a multi-objective evolutionary framework for automated prompt learning aimed at synthetic text data generation using Large Language Models under a black-box setting. By formulating prompt optimization as a population-based search process and explicitly modeling semantic fidelity and generative diversity as competing objectives, the proposed approach enables the discovery of high-quality prompts without requiring model fine-tuning, gradient access, or manual prompt engineering. The integration of a Genetic Algorithm with a multi-agent architecture allows the framework to efficiently explore the discrete and non-differentiable prompt space while preserving semantic coherence through LLM-driven mutation operators. Experimental results on large-scale, real-world social media data demonstrate that the evolutionary process consistently improves prompt quality over successive generations and converges to stable solutions that achieve a meaningful trade-off between similarity to the reference distribution and output diversity. These results validate the effectiveness of multi-objective evolutionary prompt learning as a practical and general strategy for black-box, LLM-driven synthetic data generation.

Overall, this work contributes a modular and reproducible framework for synthetic data generation that supports downstream learning systems, particularly in adaptive stream processing scenarios where robustness to data variability is critical. By formalizing prompt evolution as an optimization mechanism, the approach highlights a new paradigm in which LLMs act not only as generators but also as controllable components within data-centric learning pipelines.

Future work will focus on extending the framework in four main directions. First, and most critically, we plan to conduct downstream task evaluations in which the quality of the synthetic datasets generated by EVOLMD-MO is measured by its direct impact on the performance of machine learning models trained on the augmented data. Specifically, we will assess whether synthetic data produced by Pareto-optimal prompts improves classification accuracy, robustness to distribution shift, and generalization in low-resource settings, providing empirical evidence of utility beyond proxy metrics. Second, we will evaluate EVOLMD-MO across a broader set of textual domains and datasets, including medical records, legal documents, and financial reports, to rigorously assess its robustness and formally establish its domain independence beyond the disaster-related social media context explored in this work. Third, we will investigate the integration of task-aware fitness functions, where the quality of synthetic data is measured based on its impact on downstream learning tasks. Fourth, we aim to explore more efficient evolutionary strategies, including surrogate fitness models and adaptive population control, to reduce the number of required LLM evaluations and improve scalability. Collectively, these directions position EVOLMD-MO as a foundational step toward fully autonomous, scalable, and controllable synthetic data generation systems driven by large language models.

Author Contributions

Conceptualization, D.P., N.H., V.R. and E.R.; Methodology, N.H., V.R. and E.R.; Software, D.P.; Validation, D.P.; Formal analysis, N.H. and V.R.; Investigation, D.P., N.H., V.R. and E.R.; Resources, D.P.; Data curation, D.P.; Writing—original draft, D.P., N.H., V.R. and E.R.; Writing—review & editing, N.H. and V.R.; Visualization, D.P. and V.R.; Supervision, N.H., V.R. and E.R.; Project administration, N.H. and E.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fondecyt Iniciación 11230225 and Proyecto Enlace UDP 2025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Victor Reyes thanks Fondecyt Iniciación 11230225. The authors also thank STIC-AmSud ITERATION-D, project number 24-STIC-13, and Proyecto Enlace UDP 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hidalgo, N.; Wladdimiro, D.; Rosas, E. Self-adaptive processing graph with operator fission for elastic stream processing. J. Syst. Softw. 2017, 127, 205–216. [Google Scholar] [CrossRef]
Russo Russo, G.; Cardellini, V.; Lo Presti, F. Hierarchical auto-scaling policies for data stream processing on heterogeneous resources. ACM Trans. Auton. Adapt. Syst. 2023, 18, 1–44. [Google Scholar] [CrossRef]
Russo, G.R.; D’Alessandro, E.; Cardellini, V.; Presti, F.L. Towards a Multi-Armed Bandit Approach for Adaptive Load Balancing in Function-as-a-Service Systems. In Proceedings of the 2024 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), Aarhus, Denmark, 16–20 September 2024; pp. 103–108. [Google Scholar]
Wladdimiro, D.; Arantes, L.; Sens, P.; Hidalgo, N. PA-SPS: A predictive adaptive approach for an elastic stream processing system. J. Parallel Distrib. Comput. 2024, 192, 104940. [Google Scholar] [CrossRef]
Shin, T.; Razeghi, Y.; Logan, R.L., IV; Wallace, E.; Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Fernando, C.; Banarse, D.; Michalewski, H.; Osindero, S.; Rocktäschel, T. PromptBreeder: Self-Referential Self-Improvement Via Prompt Evolution. arXiv 2023, arXiv:2309.16797. [Google Scholar]
Guo, Q.; Wang, R.; Wang, J.; Li, B.; He, K.; Tan, X.; Bian, J.; Zheng, Y. EvoPrompt: Connecting Large Language Models with Evolutionary Algorithms for Prompt Engineering. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Hidalgo, N.; Saez, P.; Meneses, N.; Reyes, V.; Rosas, E. Prompt’s Evolution for Language Model-Driven Data Generation. Appl. Sci. 2025, 15, 12911. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, H.; Liu, Z.; Gu, B.; Chang, Y. ZO-PoG: Collaborative Discrete-Continuous Black-Box Prompt Learning for Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Zhang, J.; Yu, S.; Chong, D.; Sicilia, A.; Tomz, M.; Manning, C.; Shi, W. Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity. arXiv 2025, arXiv:2510.01171. [Google Scholar] [CrossRef]
Yang, K.; Liu, Z.; Xie, Q.; Huang, J.; Zhang, T.; Ananiadou, S. MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models. Adv. Neural Inf. Process. Syst. 2024, 37, 34453–34486. [Google Scholar]
Zhao, G.; Yoon, B.J.; Park, G.; Jha, S.; Yoo, S.; Qian, X. Pareto Prompt Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Menchaca Resendiz, Y.; Klinger, R. MOPO: Multi-objective prompt optimization for affective text generation. In Proceedings of the 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 5588–5606. [Google Scholar]
Baumann, J.; Kramer, O. Evolutionary Multi-Objective Optimization of Large Language Model Prompts for Balancing Sentiments. In Applications of Evolutionary Computation. EvoApplications 2024; Springer: Cham, Switzerland, 2024. [Google Scholar]
Deniz, A.; Angin, M.; Angin, P. Evolutionary Multiobjective Feature Selection for Sentiment Analysis. IEEE Access 2021, 9, 142982–142996. [Google Scholar] [CrossRef]
Leung, M.F.; Wang, J. A collaborative neurodynamic approach to multiobjective optimization. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5738–5748. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3982–3992. [Google Scholar]
Li, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, B. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 110–119. [Google Scholar]
Tevet, G.; Berant, J. Evaluating the Evaluation of Diversity in Natural Language Generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 326–346. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Friedman, D.; Dieng, A.B. The Vendi Score: A Diversity Evaluation Metric for Machine Learning. Trans. Mach. Learn. Res. 2023. Available online: https://www.scopus.com/inward/record.uri?eid=2-s2.0-86000643949&partnerID=40&md5=b49386c3214dc96c83717ba286a41b0a (accessed on 1 April 2026).
Limbeck, K.; Andreeva, R.; Sarkar, R.; Rieck, B. Metric Space Magnitude for Evaluating the Diversity of Latent Representations. Adv. Neural Inf. Process. Syst. 2024, 37, 123911–123953. [Google Scholar]
Zhu, Y.; Zhang, H.; Wu, B.; Li, J.; Zheng, Z.; Zhao, P.; Chen, L.; Bian, Y. Measuring Diversity in Synthetic Datasets. arXiv 2025, arXiv:2502.08512. [Google Scholar] [CrossRef]
Taherdoost, H.; Madanchian, M. Multi-Criteria Decision Making (MCDM) Methods and Concepts. Encyclopedia 2023, 3, 77–87. [Google Scholar] [CrossRef]
Merkepçi, H. Impact of the Objective Attribute Weighting on Five Popular Multicriteria Decision-Making Methods: An Empirical Study. Eskiseh. Tech. Univ. J. Sci. Technol. A—Appl. Sci. Eng. 2024, 25, 456–470. [Google Scholar] [CrossRef]
Carbonell, J.; Goldstein, J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; pp. 335–336. [Google Scholar]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
Google. Gemini 3 Pro: A Multimodal Language Model. Version Used as the Automated Evaluator (LLM-as-a-Judge). 2025. Available online: https://deepmind.google/technologies/gemini/ (accessed on 1 January 2026).
Lamsal, R. Coronavirus (COVID-19) Tweets Dataset. 2020. Available online: https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset (accessed on 1 December 2025).

Figure 1. The architecture of the proposed EVOLMD-MO framework. The system is organized into three distinct layers: (1) the LLM Interaction Layer, which abstracts black-box communications via specialized agents; (2) the EVOLMD-MO Engine, implementing the NSGA-II algorithm to simultaneously optimize for semantic fidelity and diversity; and (3) the Post-Processing Funnel, a multi-stage selection pipeline (Safety Filter, TOPSIS, MMR) used to refine the raw Pareto front into the final synthetic dataset.

Figure 2. Scaling validation of the diversity monitor metrics across distinct real-world population states of size

N = 100

. Both K-Means Inertia and Entity Entropy perfectly track the gradient from mode collapse to organic diversity, confirming their reliability and discrimination power beyond toy examples.

Figure 2. Scaling validation of the diversity monitor metrics across distinct real-world population states of size

N = 100

. Both K-Means Inertia and Entity Entropy perfectly track the gradient from mode collapse to organic diversity, confirming their reliability and discrimination power beyond toy examples.

Figure 3. The semantic dispersion of the four random seeds within the 2D projected space of the COVID-19 Reference Phrase Dataset. The distinct locations of the seeds validate their structural representativeness across different semantic zones.

Figure 4. Analysis of Scenario 1 representative run (Social).

Figure 5. Analysis of Scenario 2 representative run (Mental Health).

Figure 6. Analysis of Scenario 3 representative run (Corporate).

Figure 7. Analysis of Scenario 4 representative run (Creative).

Figure 8. Fidelity vs. Diversity: Pareto frontier of semantic fidelity versus generative diversity for Scenario 1.

Figure 9. Fidelity vs. Diversity: Pareto frontier of semantic fidelity versus generative diversity for Scenario 2.

Figure 10. Fidelity vs. Diversity: Pareto frontier of semantic fidelity versus generative diversity for Scenario 3.

Figure 11. Fidelity vs. Diversity: Pareto frontier of semantic fidelity versus generative diversity for Scenario 4.

Figure 12. Entity Entropy through generations on MO and Baseline.

Figure 13. Global Diversity through generations on MO and Baseline.

Table 1. System and User Prompt Templates for EVOLMD-MO Agents.

Agent	System Prompt	User Prompt
Init Agent	You are an AI prompt engineer. Create one high-quality prompt aligned with the reference text, role, and topic.	Reference text: {ref} Role: {role} Topic: {topic}
Data Agent	You are an AI text generator. Produce ONE short output text that resembles the reference in style and tone, but adapted to the given role, topic, and keywords.	Ref: {ref} Role: {role} Topic: {topic} Keywords: {kws} Prompt: {prompt}
Mutation *	You are a creative AI assistant. Perform a MUTATION by selecting a creative and semantically valid synonym for the given word.	Context: {role, topic, kws} Parameter: {param} Word: {original} Options: {synonyms}

* The Mutation Agent also includes a Prompt Regeneration step to ensure syntax coherence after modifications.

Table 2. Comparison of fidelity metrics for the topic “Economy”.

Reference (Ref):
Inflation has reached a 40-year high, causing the central bank to raise interest rates aggressively.
Candidates	Oracle	SBERT	BERT
The central bank is raising interest rates aggressively because inflation hit a 40-year peak.	0.97	0.94	0.85
Prices are rising fast, so the bank is making it more expensive to borrow money.	0.80	0.52	0.62
The stock market is volatile due to uncertainty in global markets.	0.30	0.20	0.54
Basketball is a popular sport in the United States.	0.00	0.17	0.42

Table 3. Mean Absolute Error (MAE) broken down by similarity level (Lower is better).

Similarity Level	SBERT MAE	BERTScore MAE	Best Performance
High (Reference)	0.0806	0.1818	SBERT
Medium-High (Paraphrase)	0.1517	0.1424	Tie (Slight BERT)
Medium-Low (Contextual)	0.1013	0.1730	SBERT
Low (Noise)	0.0497	0.3659	SBERT

Table 4. Positional Distribution Analysis of SBERT Sensitivity. The table summarizes the SBERT score statistics for each quality tier defined by the Oracle, confirming a clear semantic degradation across levels.

Quality Tier (Oracle)	Mean ( $μ$ )	Std Dev ( $σ$ )	Min	Max
1. High (Ref Paraphrase)	0.8564	0.0712	0.68	0.94
2. Med-High (Coherent)	0.5298	0.1049	0.38	0.76
3. Med-Low (Tangential)	0.2203	0.0911	0.06	0.44
4. Low (Noise/Hallucination)	0.0331	0.0653	−0.08	0.18

Table 5. Computational efficiency comparison (N = 80 evaluations).

Metric	Total Time (s)	Speedup Factor
BERTScore (Baseline)	32.2673	1.0×
SBERT (Proposed)	0.9430	34.22×

Table 6. Test Population: Mode Collapse.

Generated Text
The hospital is completely overwhelmed with patients. There are no more beds available at the hospital. The hospital has reached its maximum capacity. The emergency room is totally full and cannot accept new people. It’s a crisis; the hospital is overwhelmed.

Table 7. Test Population: Chaotic Diversity (Irrelevant).

Generated Text
The hospital is completely overwhelmed with patients. I think I will adopt a dog tomorrow. The weather in Chile is very sunny this time of year. Software engineering is a complex but rewarding field. The new city budget includes funding for parks.

Table 8. Test Population: Ideal Diversity (Faithful and Diverse).

Generated Text

The economic fallout from the global lockdowns was severe, with small businesses closing.
Many people reported feelings of isolation and anxiety due to the prolonged lockdowns.
Governments struggled to balance public health guidelines with the need to keep the economy moving.
The switch to remote work was a major consequence of the pandemic’s stay-at-home orders.
Supply chains were disrupted worldwide because of the global halt in transportation.

Table 9. Metric sensitivity across different dispersion levels.

Scenario	Global Inertia	Detected State
Mode Collapse	0.1539	High Redundancy
Coherent Diversity	0.2225	Moderate Dispersion
Chaotic Diversity	0.3732	High Dispersion

Table 10. Redundancy discrimination in mixed population.

Type	Candidate Text (Truncated)	Novelty
Redundant	Artificial Intelligence is transforming industries…	0.4033
Redundant	Industries are being transformed by AI automation…	0.3802
Redundant	The transformation of industries is driven by AI…	0.4133
Innovative	Machine learning models require vast amounts of data…	0.6702
Innovative	Ethical concerns about AI bias are growing among…	0.6120

Table 11. Comprehensive Configuration and Hyperparameters (EVOLMD-MO).

Parameter	Value	Description
Population Size (N)	100	Number of individuals per generation.
Number of Generations	100	Total iterations of the evolutionary process.
Crossover Probability ( $P_{c}$ )	0.8	Genetic recombination rate (80%).
Mutation Probability ( $P_{m}$ )	0.05	Random variation rate (5%).
Lower Threshold ( $τ_{m i n}$ )	0.20	Filtering boundary for hallucinations.
Upper Threshold ( $τ_{m a x}$ )	0.95	Filtering boundary for lexical duplicates.
Diversity Weight ( $λ$ )	0.35	MMR parameter for semantic redundancy.
Final Selection (K)	5	Number of solutions returned to the user.

Table 12. Pareto Front Quality Indicators: Baseline vs. EVOLMD-MO (Gen 100). The single-objective Baseline metrics were calculated post hoc for comparison.

Scenario	Single-Objective Baseline		EVOLMD-MO
Scenario	Hypervolume (HV)	Spacing (SP)	Hypervolume (HV)	Spacing (SP)
1	0.1557	0.0126	0.6906	0.0124
2	0.1920	0.0160	0.6055	0.0099
3	0.1325	0.0079	0.7274	0.0095
4	0.3165	0.0214	0.5897	0.0105

Table 13. Average Fitness and Diversity Metrics of the Final Populations (Gen 100). Baseline metrics were computed post hoc for direct comparison.

Scenario	Single-Objective Baseline		EVOLMD-MO
Scenario	Avg Fidelity ( $F_{1}$ )	Avg Diversity ( $F_{2}$ )	Avg Fidelity ( $F_{1}$ )	Avg Diversity ( $F_{2}$ )
1	0.5040	0.1721	0.4632	0.6524
2	0.5328	0.1776	0.5321	0.6188
3	0.7166	0.1108	0.4469	0.7048
4	0.4610	0.3388	0.3894	0.7414

Table 14. Detailed Execution Time Comparison (Baseline vs. EVOLMD-MO). The proposed method achieves a reduction in computational cost on average, despite optimizing multiple objectives.

Scenario	Run	Baseline (s)	EVOLMD-MO (s)	Improvement (%)
Scenario 1	1	30,766	29,020	5.68%
	2	29,310	24,605	16.05%
	3	30,708	25,244	17.79%
Scenario 2	1	32,683	24,590	24.76%
	2	26,568	27,416	−3.19%
	3	28,935	25,457	12.02%
Scenario 3	1	26,264	22,880	12.88%
	2	22,910	25,542	−11.49%
	3	23,169	24,313	−4.94%
Scenario 4	1	23,667	21,987	7.10%
	2	27,287	23,935	12.28%
	3	22,923	23,325	−1.75%
Global Average		27,099	24,859	8.26%

Table 15. Selection of candidates generated for the mental health scenario (Scenario 2).

#	Text Generated by EVOLMD-MO	Fidelity	Diversity
1	Quarantining has amplified my depression’s symptoms, as my usual routine and work have been disrupted. The constant stream of corona-related updates on social media is overwhelming, causing me to avoid my phone to mitigate the anxiety.	0.8429	0.5196
2	In this swirling vortex of chaos, a city wall once stood tall, now breached by waves of misinformation. False reports writhe like snakes, their hissing whispers fueling fear and anxiety. As an artist, I’m trapped in this digital purgatory, watching the world descend into madness. Frustration and panic seep through my pixels, as vivid colors bleed together, a cityscape mural that mirrors the turmoil within.	0.2521	0.7723
3	As a first responder, seeing misinformation spreading like wildfire is concerning me more than the actual virus. Travel restrictions are necessary for our safety, but with false reports of infected areas, it’s hard to know what’s real. Insurance companies need accurate data to provide fair coverage—I worry about the long-term impact on our economy if we can’t trust the information.	0.3365	0.7195
4	Quarantine life has me scrambling for reliable info on paw hygiene amidst the chaos. As an online schooler, I’m trying not to lose my mind while keeping Fido’s paws clean. Here are some sanity-saving tips: Wash those paws daily with mild soap and warm water. Dry thoroughly to prevent bacterial growth. Trim nails regularly to avoid dirt accumulation. Sanitize high-touch areas like door handles and toys.	0.2720	0.7578
5	I completely understand how you’re feeling right now—it’s like a never-ending nightmare! As someone who’s struggled with depression myself, I know how easy it is to get sucked into the vortex of false reports and misinformation about COVID-19. But here’s the thing: there is hope! By seeking out reliable information from trusted sources, you can start to regain control over your life. Don’t be afraid to reach out to your insurance provider or a mental health professional for support—they’re there to help you through this tough time.	0.4751	0.6808

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pastrián, D.; Hidalgo, N.; Reyes, V.; Rosas, E. Evolutionary Multi-Objective Prompt Learning for Synthetic Text Data Generation with Black-Box Large Language Models. Appl. Sci. 2026, 16, 3623. https://doi.org/10.3390/app16083623

AMA Style

Pastrián D, Hidalgo N, Reyes V, Rosas E. Evolutionary Multi-Objective Prompt Learning for Synthetic Text Data Generation with Black-Box Large Language Models. Applied Sciences. 2026; 16(8):3623. https://doi.org/10.3390/app16083623

Chicago/Turabian Style

Pastrián, Diego, Nicolás Hidalgo, Víctor Reyes, and Erika Rosas. 2026. "Evolutionary Multi-Objective Prompt Learning for Synthetic Text Data Generation with Black-Box Large Language Models" Applied Sciences 16, no. 8: 3623. https://doi.org/10.3390/app16083623

APA Style

Pastrián, D., Hidalgo, N., Reyes, V., & Rosas, E. (2026). Evolutionary Multi-Objective Prompt Learning for Synthetic Text Data Generation with Black-Box Large Language Models. Applied Sciences, 16(8), 3623. https://doi.org/10.3390/app16083623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evolutionary Multi-Objective Prompt Learning for Synthetic Text Data Generation with Black-Box Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. Single-Objective Evolutionary Baselines

2.2. Black-Box Prompt Optimization and Mode Collapse

2.3. Multi-Objective Optimization in NLP

2.4. Metrics for Fidelity and Diversity

2.4.1. Semantic Fidelity

2.4.2. Quantifying Diversity

2.5. Decision Making in Pareto Fronts

3. Preliminaries: The EVOLMD Framework

3.1. Prompt Representation and Agents

3.2. Single-Objective Optimization

4. Proposed Architecture

4.1. LLM Interaction Layer

4.2. EVOLMD-MO Engine (NSGA-II)

Genetic Operators and Selection

4.3. Post-Processing Funnel

5. Experiments

5.1. Fidelity Metric Validation

5.2. Semantic Diversity Metrics Evaluation

5.2.1. Macro-Level Analysis: K-Means Inertia and Entity Entropy

5.2.2. Microscopic Analysis: Sensitivity and Selective Pressure

5.2.3. Macroscopic Scaling Validation on Realistic Populations ( N = 100 )

5.3. Comparative Performance

5.3.1. Experimental Setup and Hyperparameters

5.3.2. Search Space Visualization

5.3.3. Pareto Front Analysis

5.3.4. Metric Evolution

5.3.5. Runtime Analysis

6. Discussion

6.1. Computational Efficiency as an Enabler for Quality

6.2. Evolutionary Dynamics: From Geometry to Content

6.3. Decision Making and the Fidelity-Diversity Trade-Off

6.4. Robustness to LLM Architecture

6.5. Qualitative Validation: Linguistic Analysis

6.6. Computational Cost and Real-World Feasibility

6.7. Ethical Considerations and Risk Mitigation

6.8. Limitations and Downstream Validation

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2.3. Macroscopic Scaling Validation on Realistic Populations ( $N = 100$ )