A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling

Alsakran, Wedyan Salem; Alabduljabbar, Reham

doi:10.3390/electronics14142800

Open AccessArticle

A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling^†

by

Wedyan Salem Alsakran

and

Reham Alabduljabbar

^*

Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia

^*

Author to whom correspondence should be addressed.

^†

An earlier version of this work, focusing on the conceptual exploration of the platform design, was published in the proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM): Alsakran, W.; Alabduljabbar, R. Exploring the Potential of LLMs and Attributed Prompt Engineering for Efficient Text Generation and Labeling. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM), Dubai, United Arab Emirates, 26–29 November 2024; pp. 244-252, doi: 10.1109/FLLM63129.2024.10852475. This article significantly extends that work by presenting the full system development, implementation, and detailed evaluation.

Electronics 2025, 14(14), 2800; https://doi.org/10.3390/electronics14142800

Submission received: 11 June 2025 / Revised: 4 July 2025 / Accepted: 8 July 2025 / Published: 11 July 2025

(This article belongs to the Special Issue Advanced Natural Language Processing Technology and Applications)

Download

Browse Figures

Versions Notes

Abstract

With the growing demand for labeled textual data in Natural Language Processing (NLP), traditional data collection and annotation methods face significant challenges, such as high cost, limited scalability, and privacy constraints. This study presents a novel web-based platform that automates text data generation and labeling by integrating Llama 3.3, an open-source large language model (LLM), with advanced prompt engineering techniques. A core contribution of this work is the Attributed Prompt Engineering Framework, which enables modular and configurable prompt templates for both data generation and labeling tasks. This framework combines zero-shot, few-shot, role-based, and chain-of-thought prompting strategies within a unified architecture to optimize output quality and control. Users can interactively configure prompt parameters and generate synthetic datasets or annotate raw data with minimal human intervention. We evaluated the platform using both benchmark datasets (AG News, Yelp, Amazon Reviews) and two fully synthetic datasets we generated (restaurant reviews and news articles). The system achieved 99% accuracy and F1-score on generated news article data, 98% accuracy and F1-score on generated restaurant review data, and 92%, 90%, and 89% accuracy and F1-scores on the benchmark labeling tasks for AG News, Yelp Reviews, and Amazon Reviews, respectively, demonstrating high effectiveness and generalizability. A usability study also confirmed the platform’s practicality for non-expert users. This work advances scalable NLP data pipeline design and provides a cost-effective alternative to manual annotation for supervised learning applications.

Keywords:

LLMs; prompt engineering; attributed prompts; data labeling; data generation; sentiment analysis; Llama 3; NLP

1. Introduction

In the field of Artificial Intelligence (AI), obtaining high-quality labeled data remains a major challenge [1]. The increasing reliance on Natural Language Processing (NLP) applications—such as machine translation, sentiment analysis, and content generation—has intensified the demand for large-scale, accurately labeled textual datasets. Supervised Machine Learning (ML) models, in particular, require vast volumes of such data for training, testing, and validation. These datasets span various domains and labeling schemes, including categories such as education, politics, and entertainment; sentiment polarity (positive, negative, neutral); and binary labels for tasks such as spam detection or phishing classification.

Current manual data collection methods—such as scraping, surveys, user feedback forms, and interviews—are time-consuming, resource-intensive, and often limited in scalability [2]. Similarly, data labeling typically involves expert annotators or crowdsourced workers. While trained annotators produce high-quality labels, their involvement is costly and time-consuming [3]. Crowd workers offer a more affordable alternative but often sacrifice label consistency and accuracy [4].

Traditional approaches face additional limitations, such as inefficiency, bias, and labor intensity [1,2]. A significant obstacle is data scarcity [5], especially in privacy-sensitive domains such as healthcare and finance, where access to data is restricted by legal and ethical considerations.

To overcome these challenges, there is an increasing shift toward automation using Generative AI—particularly large language models (LLMs). Models such as OpenAI’s ChatGPT-3.5 and ChatGPT-4, and Google’s Gemini have demonstrated human-like capabilities in Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks [1]. These include applications such as machine translation, creative writing, question answering, and code generation [6]. Their ability to synthesize labeled data significantly reduces the time, cost, and manual effort traditionally required for dataset construction.

In this study, we introduce a web-based platform designed to automate English text data generation and labeling. The platform leverages the capabilities of Llama 3.3, an open-source LLM developed by Meta AI [7], and integrates multiple prompt engineering techniques—including zero-shot, few-shot, role-based, and chain-of-thought prompting. A central innovation of our approach is the use of attributed prompts, which allow users to specify structured preferences such as style, tone, and classification labels to control the model’s output [8].

The system is evaluated on both synthetic and benchmark datasets, including AG News, Yelp Reviews, and Amazon Reviews. We assess its performance using common evaluation metrics—accuracy, precision, recall, and F1-score—to measure its effectiveness in generating and labeling high-quality text data.

To the best of our knowledge, there is currently no comprehensive platform that integrates both data generation and data labeling processes using AI tools. For example, LLMs such as ChatGPT [9] and Llama [6] are primarily designed for text generation and conversational tasks, but they are not explicitly designed for creating labeled datasets. On the other hand, platforms such as Refuel AI [10] focus on automated data labeling by leveraging fine-tuned LLMs but do not support synthetic data generation as part of their functionality. Our proposed platform is uniquely developed to automate both processes, data generation and data labeling, within a single unified system and allows users to generate and label customized datasets which streamline dataset generation for training machine learning models.

Our contributions are threefold:

Automated platform for data generation and labeling: a user-friendly web platform that enables users to generate labeled datasets or annotate their own data by specifying task parameters through an interactive interface.
Integrated prompt engineering module: a novel framework that combines attributed prompts with multiple prompting techniques (zero-shot, few-shot, CoT, role-based) to enhance control and task alignment.
Synthetic dataset release: two fully synthetic, high-quality benchmark datasets were created using Llama 3.3:
- LLM-generated restaurant reviews: 6028 sentiment-labeled reviews.
- LLM-generated news: 6141 news articles labeled by topic (World, Sports, Sci_Tech, and Business).

These contributions offer scalable solutions for NLP model development and provide tools that benefit the broader ML and data science communities.

The remainder of this paper is structured as follows: Section 2 provides a review of related literature and theoretical foundations. Section 3 outlines the system methodology and design. Section 4 presents the implementation details. Section 5 reports on evaluation procedures, experimental results, and subsequent discussion. Finally, Section 6 addresses this study’s limitations, outlines directions for future research, and concludes the paper.

2. Background and Related Work

2.1. LLMs for Data Labeling

Labeled data is critical for training supervised models in NLP, but traditional annotation methods are costly, inconsistent, and difficult to scale. To address these limitations, researchers have increasingly explored the use of large language models (LLMs) for automating data labeling and generation.

Wang et al. [1] demonstrated that GPT-3 can reduce labeling costs by 50–96% while maintaining human-level accuracy across multiple tasks. Kojima et al. [11] introduced the “Zero-Shot Chain-of-Thought” technique to improve LLM reasoning without training examples, showing dramatic gains on reasoning benchmarks.

He et al. [12] proposed AnnoLLM, an explain-then-annotate framework using GPT-3.5, which outperformed crowd workers in semantic labeling tasks. Similarly, Alizadeh et al. [13] benchmarked ChatGPT and open-source LLMs against human annotators, finding that open models outperformed MTurk in several tasks and approached proprietary model performance.

Ding et al. [14] compared prompt-guided annotation with generation-based labeling, concluding that direct labeling is effective for tasks with small label sets, while generation is better suited for complex or large-label tasks. Gilardi et al. [4] confirmed that ChatGPT outperforms human annotators in relevance, stance, and topic detection—all while reducing annotation costs to USD 0.003 per label.

Shushkevich et al. [15] demonstrated the utility of LLMs for augmenting fake news classification datasets, though performance varied by class type. Trad and Chehab [16] compared prompt engineering and fine-tuning techniques, concluding that while prompt engineering is flexible, fine-tuning offers superior performance in phishing URL classification tasks. Table 1 provides a summary of LLMs used for text data labeling.

2.2. LLMs for Data Generation

Several studies have applied LLMs for synthetic dataset creation to alleviate data scarcity. Sahu et al. [17] addressed the issue of data scarcity by proposing the use of existing pretrained large language models (LLMs), such as GPT-3 and GPT-J, for data augmentation to generate labeled data for intent classification (IC) tasks. The experiments were conducted using four GPT-3 models, Ada, Babbage, Curie, and Davinci, developed by OpenAI [18], as well as GPT-J by EleutherAI [19]. The GPT-3 models are ordered from the smallest to the largest in terms of model size. The proposed approach was evaluated using a few-shot prompting technique with ten training examples across four intent classification datasets. The results demonstrated that the synthetic text data generated by GPT-3 improved classification accuracy, particularly when the intent categories were clearly well-defined.

Dai et al. [5] proposed AugGPT, a data augmentation framework using ChatGPT with few-shot examples, significantly enhancing BERT-based classification accuracy on low-resource datasets. Ubani et al. [20] applied task-specific zero-shot prompts to generate diverse samples in sentiment and question classification tasks, outperforming standard augmentation techniques.

Yang et al. [21] introduced “Tailor,” a multi-attribute prompting method using GPT-2, which required minimal additional parameters while achieving notable improvements in multi-attribute text generation. Yu et al. [22] advanced this idea by applying attributed prompts (length, style, etc.) with GPT-3.5 Turbo, reducing bias and generation cost.

Table 2 presents a summary of the studies that investigated LLMs as data generators, including the models used, prompt engineering techniques utilized in generating data, and key findings.

2.3. Prompt Engineering and Attributed Prompts

Prompt engineering is a technique to guide LLMs without modifying their parameters [23]. It includes the following:

Zero-shot prompting: describes the task only [8,20].
Few-shot prompting: provides input–output examples [24,25,26].
Chain-of-thought prompting (CoT): encourages step-by-step reasoning [27,28,29].
Role-based prompting: assigns persona or behavior to the model [16].

Attributed prompts go a step further by embedding user-defined constraints—e.g., tone, length, sentiment, or category—into the instruction format. This technique enhances control, diversity, and task specificity [21,22].

Example:

Simple prompt: “Write a restaurant review.”
Attributed prompt: “Write a 40–60-word, informal, positive review about a Mexican restaurant.”

2.4. Ethical Considerations

Although LLMs bring automation benefits, they pose risks related to misinformation, bias, toxicity, and misuse [27,30]. Zhang et al. [30] highlight prompt abuse, hallucinations, and multilingual bias. Hua et al. [27] raise concerns over originality, data privacy, and user safety. Törnberg et al. [31] propose best practices including data anonymization, structured prompt use, and ethical licensing. Table 3 presents the ethical considerations of utilizing LLMs to generate and label data.

3. Methodology and System Design

This section outlines the proposed platform’s methodology and technical architecture, including user interaction workflows, prompt engineering techniques, interface design, and system components. The platform is a web-based tool that supports two primary NLP tasks: textual data generation and textual data labeling, utilizing Llama 3.3 with attributed prompt engineering.

3.1. Data Generation and Labeling Technique

Objective

The objective of the data technique is to generate and label high-quality textual data using Llama 3 language model, attributed prompts, and structured prompt engineering techniques.

2.: Attributed Prompts

Attributed prompts are used to guide data generation and labeling. Key parameters include word length, domain specification, classification type, label categories, and the number of samples to generate or annotate.

3.: Prompt Engineering

Four prompting techniques are incorporated into the system:

Zero-shot.
Few-shot.
Chain-of-thought
Role-playing.

Prompts are constructed dynamically using the LangChain framework (v0.3.25), incorporating user-defined preferences and attributes.

4.: Data Generation

Constructed prompts are sent to the Llama model, which processes them and generates textual data, along with their corresponding labels. The generated data is displayed in real time via the user interface, allowing users to preview the output and download it in in CSV or JSON format.

5.: Data Labeling

For data labeling, user-provided data and prompt configurations are sent to Llama. The model returns the labeled data, which is displayed to the user through the same interface.

6.: Validation and Storage

All generated or labeled data is presented to the user in real time for validation. Users can review the content before downloading it as CSV or Json files.

3.2. Functional Overview

The platform enables users to configure settings through an interactive interface. These settings are transformed into attributed prompts using predefined templates and then processed by the LLM. The two supported workflows are:

A. Data generation workflow

Users choose the domain (e.g., restaurant reviews), output volume, word length, and optional class labels.
Optionally, users may include few-shot examples (up to 10) for guidance.
The system constructs an attributed prompt using the LangChain framework (v0.3.25).
Llama 3.3 70B generates the textual outputs, which are displayed for review and available for download.

B. Data labeling workflow

Users input raw data and choose the task (e.g., sentiment analysis, binary/multi-class classification, or NER).
Labels can be predefined or custom-defined (up to 10).
Few-shot examples can be added to improve LLM labeling accuracy.
The system classifies the data and returns structured results.
Feedback can be submitted for human-in-the-loop refinement.

Figure 1 and Figure 2 illustrate the architecture of data labeling and generation workflows implemented in the system.

3.3. Prompting Techniques and Attributed Prompts

The system applies advanced prompt engineering techniques to guide LLM behavior effectively [11,22,32]:

Zero-shot prompting: task-only prompts for general cases.
Few-shot Prompting: up to 10 labeled examples provided inline for context.
Chain-of-thought (CoT): Encourages step-by-step reasoning using phrases such as “Let’s think step by step” [8]. CoT mimics the way humans break down complex tasks into smaller steps during problem-solving. CoT has demonstrated notable improvements in tasks such as solving mathematical problems [25].
Role-playing prompting: assigns roles such as “food critic” or “expert annotator” to match task context.

These are enhanced through attributed prompts, which embed user-selected fields (e.g., domain, style, tone, text length) directly into the instructions. This approach improves specificity, control, and reproducibility [19]. Prompt templates are generated dynamically and formatted using LangChain logic. Figure 3 shows prompting template examples for data generation and labeling.

3.4. Evaluation Strategy

To validate the system’s effectiveness, a five-phase evaluation plan is implemented (see Section 5). It includes the following:

Automated benchmarking using labeled datasets.
Human-in-the-loop review of sampled outputs for quality.
Ethical bias and hallucination analysis of generated text.
Manual validation of output label accuracy and relevance.
Usability testing and user feedback collection.

3.5. Benchmark Datasets

The evaluation will utilize a combination of benchmark datasets for labeling and classification tasks. These datasets were selected due to their diversity in domain, structure, and complexity, as well as their frequent use in prior research, making them suitable for evaluating generalization and robustness.

The datasets are as follows:

AG News (AG’s News Corpus) [33] contains 30,000 training samples and 1900 test samples for each class. It includes the title and description fields of news articles from AG’s news corpus, categorized into four main classes: Sport, World, Business, Sci/Tech. This dataset was selected to evaluate the system performance in multi-class classifying with a small number of classes (labels).
Amazon Reviews dataset [34] consists of 13,800 training data samples, 230 validation samples, and 1130 testing samples of customer reviews for Amazon products. It comprises 23 categories. This dataset was chosen to assess system performance in multi-class classifying with a large number of classes (labels).
Yelp dataset [34] consists of 67,000 training samples and 38,000 restaurant reviews, categorized into two classes: positive and negative. This dataset was selected to represent and evaluate binary sentiment classification within the domain of business and social reviews.
These datasets are used to compare the system-generated outputs against ground truth labels and were selected to ensure diversity in classification tasks (including binary and multi-class classification) and in domains (such as News, E-Commerce, and Customer Reviews). Furthermore, the complexity of the labeling facilitates a comprehensive evaluation of model performance under both few-shot and zero-shot configurations.

3.6. System Architecture

The system comprises five core modules:

LLM engine: uses Llama 3.3 (70B, open source) for generation and labeling.
User interface (UI): provides fields for task configuration, preview, feedback, and export.
Prompt engineering module: builds attributed prompts with user-defined settings.
Machine learning backend: handles communication between UI, prompt module, and LLM.
Evaluation module: validates the quality of generated and labeled outputs.

Figure 4 shows the system’s overall architecture.

3.7. User Interface Design

The interface is divided into two functional screens: Data Generation and Data Labeling. It supports the following:

Domain and classification type selection.
Custom label input.
Additional attributes (e.g., tone, volume, length).
Few-shot example entries (up to 10).
User/system role fields.
Dynamic prompt preview and system prompt formatting.
Output preview, download, and feedback submission.

The platform’s main user interface is shown in Figure 5. The data generation interface is illustrated in Figure 6, and the data labeling interface is illustrated in Figure 7. Figure 8 presents the data generation system prompt.

3.8. Backend Components

Prompt formatter: uses LangChain to merge user inputs into structured templates.
Request handler: sends prompts to the LLM and returns formatted responses.
Data validator: evaluates label consistency and output structure.
Feedback engine: captures human feedback to refine prompt strategies.

The system intentionally preserves Llama’s system messages to maintain ethical compliance and avoid misuse or manipulation.

4. Implementation

This section details the system development process, from environment setup and model evaluation to frontend/backend implementation, prompt logic, and synthetic dataset creation. Our goal was to deploy a flexible, scalable solution for prompt-engineered data generation and labeling using Llama 3.3.

4.1. Development Environment and Model Evaluation

The platform was initially developed in Google Colab (https://colab.research.google.com/ accessed on 9 July 2025) [35] and later deployed on Hugging Face Spaces (https://huggingface.co accessed on 9 July 2025) [36] for web-based access.

Model Variants Tested

During development, we evaluated three Llama variants for accuracy and efficiency:

Llama 3.2 3B Instruct.
Llama 3 8B Instruct.
Llama 3.3 70B Instruct (the final model used in deployment).

While there are no substantial differences between the smaller or earlier versions of Llama 3, we selected Llama 3.3 70B due to its extended context window, which enables the generation and labeling of larger data batches. More importantly, in the classification task, when a large number of reviews (e.g., 200) were provided as plain text, each on a new line but without numerical identifiers (e.g., “1.”, “2.”), Llama 3.3 70B consistently distinguished between individual reviews. In contrast, smaller versions of Llama sometimes failed to recognize line breaks as boundaries between reviews, treating multiple reviews as a single input. Additionally, we opted for the most advanced, state-of-the-art version to ensure optimal performance. Table 4 presents the processing time and labeling accuracy of four different variants of Llama 3 during the data generation process within the same task.

4.2. Frontend Implementation

The frontend was built using Streamlit (v1.44.1) [37], offering a minimal, responsive UI. Users can select between labeling and generation services, configure tasks, and preview/download outputs. Figure 5, Figure 6 and Figure 7 in Section 3.7. illustrated the user interface components.

4.3. Backend and Prompt Logic

The backend logic manages the integration with Llama 3 through API calls, builds prompts dynamically using LangChain, and handles data persistence.

4.3.1. Prompt Engineering Techniques

We implemented the four most common prompting methods:

Few-shot prompting: Standard prompting technique where users can input up to 10 examples, mapped to their corresponding labels. Each label is ideally represented at least once, improving classification precision, helping the model understand the desired output format and task requirements, and enhancing the model’s ability to control hallucinations [26]. This is aligned with prior studies recommending 1–3 examples per class [17].
Zero-shot prompting: Standard prompting technique used when no examples are provided. The model relies entirely on instructions, system roles, and system prior knowledge. This technique is used when the task is straightforward and does not require explicit guidance.
Role-based prompting: The system auto-generates role messages using domain and classification context (e.g., “You are a restaurant critic”). Users can modify this as needed. By employing role-play prompting, the system can simulate realistic interactions and generate persona-driven data.
Chain-of-thought prompting: CoT logic is embedded in our system prompts to encourage stepwise reasoning:
E.g., “Think step by step. Check for duplicates. Modify if necessary.”
In binary sentiment: “Restrict to Positive/Negative only. Lean toward dominant tone.”

LangChain Templates

We utilized LangChain templates to programmatically compose full system prompts, integrating:

Domain.
Classification type.
Few-shot examples.
Output count.
System/user roles.
Custom user instructions.

The final prompt is sent to Llama 3.3 via API. The response is parsed, displayed in real-time, and saved as CSV or JSON using pandas (v 2.2.3) and Python’s json module.

4.3.2. Synthetic Data Generation Pipeline

To validate system performance, we generated two synthetic benchmark datasets using Llama 3.3 with prompt tuning. The system-generated datasets are presented in Table 5.

The synthetic datasets generated using:

Temperature: 0.7.
Min/max word limits.
Optional few-shot examples.
Role assignment in prompts (e.g., expert reviewer, journalist).

The output was validated, formatted as CSV, and visualized using word clouds. Both datasets were generated using prompt-tuned configurations.

A. Restaurant Reviews Workflow

To generate our dataset, we used different attributes and prompt-engineering techniques to generate more diverse reviews and avoid duplicating and repeating data. First, we used our prompt-tuned Llama 3.3 70B model in the data generation process. Set the system temperature to 0.7, which allows the model to be creative, but still mostly sticks to reasonable, expected outputs. Then select the data generation process. Set the domain to restaurant reviews and select binary classification with positive and negative labels. Set the ‘Min words’ and ‘Max words’ attributes to different values every time. Set ‘Min words’ to 20 but allow the user to choose a value between 1 and 50, since there are reviews with one word in real-world datasets. While ‘Max words’ are set to 50, allow users to select up to 100 words. Then, in some generation processes, we chose to add few-shot examples. Then, we select the number of examples to generate, which range from one to 100, based on the number of tokens in the model’s context window. Then, we can modify the system role if we want the model to adopt a persona, such as an angry customer, a happy customer, an expert restaurant review writer, or any other role. Then, in the User Prompt’ field, we sometimes included additional instructions or additional attributes, such as the type of cuisine, the menu, special features about the restaurant, or added chain-of-thought instructions to guide the system to generate the preferred review. Then, we reviewed the system prompt, which combines all the user entries before generating the data. Then, we generated the data and saved it as CSVs files with their corresponding labels and combined them in one CSV file, removing duplicates. The final dataset was saved as CSV and visualized with a word cloud as in Figure 9.

B. News Dataset Workflow

In order to generate the news, we, similar to the approach used in the LLM-generated restaurant review dataset, used different attributes and prompt engineering techniques to generate more diverse new articles and avoid duplicating and repeating data. First, we used our prompt-tuned Llama 3.3 70B model in the data generation process. Set the system temperature to 0.7. Then, select the data generation process. Set domain to ‘Custom’, then enter news in ‘Specify custom domain’. In ‘Number of classes’, we select 4 and define our news classes as ‘World’, Sports’, Business’, and ‘Sci_Tech’. Then, continue the process to generate the restaurant reviews dataset. The difference here is that we add additional instructions about the dataset and its classes to ensure that the model understands the dataset and generates the outputs correctly with their corresponding labels. In some generations, we added the word news to the user prompt and asked the model to generate news accordingly. Then, we reviewed the system prompt and made modifications to our entries if we wanted to change the system prompt. Then, generated the data, saved it as CSV files, and finally combined them in one CSV, removing duplicates. The final dataset was saved as CSV and visualized with word cloud, as shown in Figure 10.

4.3.3. Data-Labeling Pipeline

Labeling follows a similar prompt construction process but targets existing user-provided data. The pipeline supports the following:

Task types: sentiment analysis, binary/multi-class classification, NER.
Domains: Restaurants, News, E-Commerce, Tourism.
Labels: fixed or user-defined (up to 10).
Prompt configuration:
Zero/few-shot support.
Custom role and user prompt.
Label format output: rendered in UI and exportable.

Backend flow:

Merge user input and task type into a system prompt.
Send prompt to Llama 3.3
Parse labeled output.
Display in the UI.
Export as CSV.

Role-based instructions guide labeling behavior (e.g., “You are a sentiment analysis expert”), and iterative improvements are based on user feedback.

5. Evaluation, Results, and Discussion

This chapter presents a comprehensive evaluation of the proposed system through five structured phases. Each evaluation phase focuses on a specific aspect of performance or usability to ensure the robustness, accuracy, and practicality of the solution. Through these phases, we aim to validate not only the technical correctness of the outputs but also their usability and competitiveness in real-world applications.

5.1. Evaluation Framework Overview

To validate the effectiveness and robustness of our proposed platform, we designed a structured evaluation framework comprising five distinct evaluation phases. Each phase focuses on a different aspect of system performance, ranging from technical accuracy to usability and comparative benchmarking. Figure 11 presents the Evaluation Methodology Framework, which outlines the five evaluation phases, along with their corresponding evaluation methods and performance metrics. This structure ensures comprehensive coverage of both automated assessment techniques and human-centered evaluations.

The five evaluation phases are as follows:

Evaluation Phase A: Benchmark Dataset Labeling
Assesses the system’s labeling performance on benchmark datasets such as AG News, Amazon Reviews, and Yelp Reviews using zero-shot or few-shot prompting techniques.
Evaluation Phase B: Generated Data Labeling (Labeling of the system-generated data)
Evaluates the internal consistency of the system by measuring how accurately it can label the text it has generated, using structured prompts and self-labeling pipelines.
Evaluation Phase C: Data Generation Quality
Assesses the effectiveness of the system in generating textual data based on user-defined prompts, incorporating few-shot examples and attributes such as domain and specified labels.
Evaluation Phase D: Comparative Analysis with GPT-4
Benchmarks the system’s output quality and efficiency against GPT-4 by comparing labeling accuracy, time efficiency, and cost-effectiveness. The same dataset and prompts are used across both systems to ensure fairness (the labeling process).
Evaluation Phase E: Usability Testing
Evaluates the user experience through structured testing using the System Usability Scale (SUS). Qualitative feedback and user suggestions are also gathered to identify usability issues and opportunities for improvement.

This framework provides a balanced blend of objective performance metrics and subjective usability insights, ensuring the system’s practicality in real-world applications.

5.2. Evaluation Metrics

To assess annotation performance, we use precision, recall, and F1-score. These standard metrics evaluate how accurately the generated labels match ground truth annotations. For synthetic data quality, we assess fluency, relevance, and diversity based on expert judgment. Usability is evaluated using the System Usability Scale (SUS) across 10 Likert-scale items.

For classification and labeling tasks, the system computes the following metrics [38]:

Accuracy

The percentage of correctly predicted labels out of all predictions.

Formula:

Accuracy = Number of Correct Predictions\ Total Number of Predictions

Precision

It is a metric measuring the ratio of the number of correct positive results and the number of all positive results.

Formula:

Precision = True Positive (TP)\ True Positive (TP) + False Positive (FP)

Recall

The percentage of correctly predicted positive labels out of all actual positive labels.

Formula:

Recall = True Positive (TP)\ True Positive (TP) + False Negative (FN)

F1 Score

Measures the test accuracy via measuring the harmonic mean of precision and recall.

Formula:

F1 = 2 × (precision × recall)/(precision + recall)

Fleiss’ Kappa

A statistical measure used to assess the agreement among multiple annotators. It is applied in Evaluation Phase C to measure how well the data is generated and structured.

System Usability Scale (SUS)

A ten-item questionnaire was used to evaluate user satisfaction and interface usability. Scores range from 0 to 100, with scores above 68 considered above average.

Together, these metrics offer a multi-dimensional view of the platform’s performance, covering technical accuracy, fluency, consistency, and usability.

5.3. Evaluation Phase A: Benchmark Dataset Labeling

5.3.1. Methodology

The model was tested on three benchmark datasets, including the AG News, Amazon Reviews, and Yelp Reviews datasets. Ground truth labels were compared with system-generated labels under zero-shot and few-shot prompt configurations. We computed accuracy, precision, recall, and F1 for each class.

As part of our evaluation, we compared the performance of multiple prompting strategies across benchmark datasets. The strategies included zero-shot, few-shot, role-playing, and chain-of-thought. Each was tested in the context of different labeling tasks to determine which configuration achieved the best results for specific dataset types. This comparative approach helped us assess the impact of prompt selection on labeling accuracy.

This phase consists of three subtasks according to the nature of each provided benchmark dataset:

Subtask one

Task: sentiment analysis.
Dataset: the Yelp Reviews dataset containing 67.000 reviews with ground truth labels.
Model: prompt-tuned Llama 3.3 70B for labeling data.
The dataset consists of binary sentiments: positive and negative.
Predictions: the model predicts labels for 33,800 reviews.
Evaluation: compare predictions with ground truth labels and compute accuracy, precision, recall, and F1-score using Scikit-learn (v1.7.0rc1).

Subtask two

Task: topic classification.
Dataset: the AG News dataset containing 67.000 news articles with ground truth labels.
Model: prompt-tuned Llama 3.3 70B for labeling data.
The dataset consists of four news article categories which are World, Sports, Business, and Sci_Tech.
Predictions: the model predicts labels for the 500 news articles.
Evaluation: compare predictions with ground truth labels and compute accuracy, precision, recall, and F1-score using Scikit-learn (v1.7.0rc1).

Subtask three

Task: topic classification.
Dataset: Amazon Reviews dataset containing 13.291 reviews with ground truth labels.
Model: prompt-tuned Llama 3.3 70B for labeling data.
The dataset consists of 23 Amazon product reviews which are magazines, camera and photos, office products, kitchen, cell phones service, computer video games, grocery and gourmet food, tools hardware, automotive, music album, health and personal care, electronics, outdoor living, video, apparel, toys games, sports outdoors., books, software, baby, musical and instruments, beauty, jewelry, and watches.
Predictions: the model predicts labels for the 13.291 reviews.
Evaluation: compare predictions with ground truth labels and compute accuracy, precision, recall, and F1-score using Scikit-learn (v1.7.0rc1).

5.3.2. Results

Tables and confusion matrices for each dataset were generated to visualize labeling performance.

A. AG News dataset

The results of the AG News dataset labeling show an accuracy of 92%, with a weighted average of 92% for precision, recall, and F1-score. The evaluation results of the AG News dataset labeling are presented in Table 6. Figure 12 illustrates the confusion matrices for the Business, Sci_Tech, Sports, and World classes.

B. The Yelp Reviews Dataset

The results of the Yelp Reviews dataset labeling show an accuracy of 90%, with a weighted average of 90% for precision, recall, and F1-score. The evaluation results of the Yelp Reviews dataset labeling are presented in Table 6. Figure 13 illustrates the confusion matrices for the positive and negative classes of the Yelp reviews.

C. Amazon Reviews Dataset

The results of the Amazon Reviews dataset labeling show an accuracy of 89%, with a weighted average of 92% for precision and a weighted average of 89% for recall and F1-score. The evaluation results of the Amazon Reviews dataset labeling are presented in Table 6. Figure 14 illustrates the confusion matrices for the following classes: ‘magazines’, ‘camera photo’, ‘office_products’, ‘kitchen’, ‘cell_phones_service’, ‘computer_video_games’, ‘grocery_and_gourmet_food’, and ‘tools_hardware’. Figure 15 illustrates the confusion matrices for the following classes: ‘automotive’,’music_album’, ‘health_and_personal_care’, ‘electronics’, ‘outdoor_living’, ‘video’, ‘apparel’, and ‘toys_games’. Figure 16 illustrates the confusion matrix for the following classes: ‘sports_outdoors’, ‘books’, ‘software’, ‘baby’, ‘musical_and_instruments’, ‘beauty’, ‘jewelry_and_watches’.

5.3.3. Analysis and Discussion

In the labeling benchmark data phase, we labeled three benchmarked datasets. The best results were achieved on the AG News dataset, where labeling demonstrated an accuracy of 92%, with a weighted average of 92% for precision, recall, and F1-score.

The evaluation results for the Yelp Reviews and Amazon Reviews datasets also showed strong performance. For the Yelp dataset, labeling achieved an accuracy of 90%, with a weighted average of 90% for precision, recall, and F1-score. For the Amazon Reviews dataset, labeling achieved an accuracy of 89%, with a weighted average of 92% for precision, and a weighted average of 89% for both recall and F1-score.

Labeling AG News achieved the best results because of the nature of the dataset where we extract a balanced subset of the dataset with 500 news articles, 125 for each class.

Then, we used our system to classify the date, processing 100 items at a time. While working with the Yelp Reviews dataset, we classified 33.800 reviews, which is more than half of the full dataset. We classified 250 reviews at a time, which reduced the model’s performance when handling large amounts of data. We believe that if we had labeled a balanced subset of the Yelp Reviews, we would have achieved better results than we achieved by labeling a dataset of 33.800 reviews, especially since we noticed that our system works very well with sentiment analysis data.

The confusion matrix of the true predicted labels of the four classes of AG News demonstrated that the system performs well in correctly classifying true positives (TPs), with high numbers for Sports (124), World (117), Sci_Tech (111), and Business (110), all out of 125 true instances per class.

These results indicate that the Sports category has the highest classification accuracy and least confusion, with only one article misclassified as World.

The most significant confusion occurs between Business and Sci_Tech categories: 12 Sci_Tech articles were incorrectly predicted as Business (false positive), and eight Business articles were misclassified as Sci_Tech.

The confusion matrix for the 23-class Amazon Reviews classification task reveals both strong performance in certain categories and notable overlaps in others.

The music_album category was perfectly classified, achieving 827 true positives (TPs) out of 827 instances. Similarly, the books category showed high accuracy, with 851 TPs out of 852. The Electronics category had 756 TPs out of 776, and sport_outdoors achieved 588 TPs out of 782. However, substantial confusion was observed in categories with overlapping characteristics. For instance, the apparel category had 387 TPs but was frequently classified as toys_games (119 false positive), baby (38 FP), and beauty (17 FP). The health_and_personal_care category was often confused with beauty (36 FP), baby (21 FP), and grocery (8 FP). Whereas the kitchen category had 383 TPs, it was significantly misclassified as grocery_and_gourmet_food, with 276 FPs.

These patterns highlight how semantically similar categories can lead to misclassifications, underscoring the challenge in multi-class classification tasks where product categories often overlap, such as ‘health and personal care’ overlapping with the ‘beauty’ category and the ‘kitchen’ category overlapping with ‘grocery and gourmet’ food.

The confusion matrix of the Yelp Reviews classification task reveals strong performance. For the negative class, there are 13.183 true positives (TPs) out of 15.097 instances, with 1.554 false positives (FPs). For the positive class, the system achieved 17.149 TPs out of 18.703 instances, with 1.914 FPs. The overall accuracy is 90%, the precision for the positive class is 90%, and the recall is 92%, indicating that the system effectively predicts the positive instances. The F1-score, which balances precision and recall, stands at 91%, suggesting that the system is both reliable and effective. Notably, the recall is slightly higher than the precision, which means that the system prioritizes reducing false negatives. This is useful in applications where missing a positive review is more costly than missing a false one.

In general, the system shows very good performance in classifying real-world data. This performance slightly varies according to the nature of the dataset, the number of classes, and whether there are overlaps between these classes, such as in the Amazon Reviews dataset. In sentiment analysis, the system performs slightly better in classifying positive classes than negative classes. The weaknesses of the system appear in the classification of overlapping categories within the Amazon Reviews dataset due to the semantic similarity between those categories.

To address the semantic overlap between similar categories (e.g., Amazon’s “health and personal care” vs. “beauty”), we recommend providing clear descriptions for each category or label to help the model distinguish between them. Additionally, including a few representative examples for each category, especially those prone to overlap, can guide the model’s classification behavior and reduce misclassification.

5.4. Evaluation Phase B: Synthetic Data Labeling

5.4.1. Methodology

The synthetic datasets introduced in Section 4 namely the LLM-Generated Restaurant Reviews dataset for binary classification and the LLM-Generated News dataset for multi-class classification, were labeled concurrently during the data generation process using our system’s pipeline.

In this phase, the evaluation was conducted to measure how accurately our Llama 3.3 model assigned labels to the generated text, for both sentiment analysis on our system-generated dataset (LLM-Generated Restaurant Reviews) and for category labeling on our LLM-Generated News datasets.

Define Ground Truth Labels

Number of evaluators: We used the Upwork (https://www.upwork.com/ accessed on 9 July 2025) [39] platform to hire three expert annotators who were fluent in English and experienced in data annotation for each dataset. Each annotator was asked to manually label every item in the dataset without the use of AI tools. One round of revisions was requested for refinement. (Note: we conducted two rounds of revisions to ensure the accuracy and consistency of the expert-assigned labels).
Inter-rater reliability: For each text item in the system-generated datasets (LLM-Generated Restaurant Reviews or LLM-Generated News), we applied majority voting to determine inter-rater agreement among annotators. The label agreed upon by at least two out of three annotators was selected as the ‘gold standard’ for the dataset.

Table 7 and Table 8 present randomly selected examples of manual labeling applied to our system-generated data for restaurant reviews and news, respectively.

After creating the gold standard dataset, we compared our system labels with those of the expert annotators and computed the following metrics: accuracy, precision, recall, and F1-Score.

5.4.2. Results

Labeling precision on the synthetic data showed high consistency, with accuracy scores of 0.99 across all classes, demonstrating exceptional performance by the system in this phase.

A. LLM-Generated News dataset

The performance evaluation of our system’s predicted labels for the LLM-Generated News dataset demonstrates an accuracy of 98.5%, with a weighted average of 99% for precision, recall, and F1-score. The evaluation results of the system-generated News dataset are presented in Table 9. Figure 17 illustrates the confusion matrices for the Business, Sci_Tech, Sports, and World classes.

B. LLM-Generated Restaurant Reviews dataset

The performance evaluation of our system’s predicted labels for the LLM-Generated Restaurant Reviews dataset also demonstrates an accuracy of 97.9%, with a weighted average of 98% for precision, recall, and F1-score. The evaluation results for the system-generated Restaurant Reviews dataset are presented in Table 9. Figure 18 illustrates the confusion matrices for the positive and negative classes for LLM-Generated Restaurant Reviews dataset.

5.4.3. Analysis and Discussion

The results confirm that synthetic data generated using Llama 3.3 and our attributed prompt engineering module is internally coherent and can be effectively self-labeled. The structured nature of these prompts enables the model to assign labels with high precision, enhancing the reliability of the labeling process.

A. LLM-Generated News dataset

In the multi-class classification task for the LLM-Generated News dataset, the system exhibited strong performance across all categories.

Business: 1512 true positives out of 1530 actual Business samples. There were 13 misclassified as Sci_Teh and 5 as World.
Sci_Tech: 1602 true positives out of 1641 actual Sci_Tech samples. Misclassifications included 1 as Sports and 38 as World.
Sports: 1533 true positives out of 1556 actual Sports samples. Misclassifications included 9 to Business, 11 to Sci_Tech, and 3 to World.
World: 1403 true positives out of 1414 actual World samples. Misclassifications include 3 to Business and 8 to Sci_Tech.

These results indicate accurate predictions across all categories. Each class (Business, Sci_Tech, Sports, and World) achieved high classification accuracy, with only minor misclassifications. The few errors observed, such as Sci_Tech misclassification as Sports or World, are likely due to the semantic overlap between categories. Overall, the system demonstrates high precision and recall across all classes, making it highly suitable for real-world applications in news topic classification.

B. LLM-Generated Restaurant Reviews dataset

The confusion matrix of the LLM-Generated Restaurant Reviews dataset demonstrates excellent performance in labeling synthetic data.

In the binary sentiment classification task, the system achieved 2857 true negatives (TNs) and 3045 true positives (TPs), with 20 false positives (FPs) and 106 false negatives (FNs), across a total of 6028 samples.

This results in an overall accuracy of 97.9%. The precision and recall for the positive class are 99% and 97%, respectively, yielding an F1-score of approximately 98%, while the precision and recall for the negative class are 96% and 99%, respectively, yielding an F1-score of approximately 98%.

These metrics indicate that the model is highly reliable, with strong performance in detecting both positive and negative sentiments, though it performs slightly better at identifying negative sentiments. The low false positive and false negative rates suggest that the system excels at sentiment classification, with minimal errors.

Comparison with Real-World Data Performance

Our system performs better on synthetic (generated) data, achieving between 96% and 99% across all key metrics: accuracy, precision, recall, and F1-score. In contrast, when labeling real-world data, the performance ranges from 89% to 92% across the same metrics. This suggests that the system is more effective when working with its own generated (synthetic) data.

One possible explanation is that misclassification occurs more frequently when the system processes large volumes of real-world data with ambiguity and variability. Additionally, during synthetic data generation, labels are assigned concurrently with the text, leading to stronger alignment between the data-generated content and its corresponding label, contributing to higher accuracy observed on synthetic data.

Context window limits, such as restricting the window to accept batches of 200 reviews, can impact the accuracy of real-world tasks in situations such as asking the model to return only the labels of the data instead of returning the whole review and its corresponding label. This method decreases labeling accuracy, and we explored it by experimenting on ChatGPT and Llama 3.3, in Section 5.6.3. Most real-world data are in large volumes, and a limited widow context can lead to loss of data by truncating excessive data, which in turn decreases accuracy.

Human Labeling vs. Llama 3.3-70B-Instruct Labeling

For human labeling, our system generated two datasets with 6000 items each. The cost was USD 0.02 per data item, requiring three weeks of work from six annotators (three for each dataset) and one team member to review the results and ensure inter-rater agreement. In contrast, Llama 3.3-70B-Instruct, which is integrated into our system, labels the data automatically during generation, at a cost of USD 0.003 per item. These costs reflect the pro-subscription pricing on the Hugging Face platform. Table 10 presents the costs of our system and those of human annotators.

Potential Reuse of LLM-Generated News and Restaurant Reviews

Our generated datasets for news and restaurant reviews represent valuable community resources that can be reused across a range of research and development domains. Importantly, theses datasets are synthetic and thus adhering to privacy, ethical, and legal constraints.

Potential Reuse of the Datasets:

NLP training and evaluation: the datasets can be employed in NLP tasks such as sentiment analysis, text classification, and information retrieval to train and evaluate machine learning models.
Dialogue system development: the Restaurant Reviews dataset is well suited for building and training conversational agents (e.g., customer support chatbots).
Data augmentation: Both datasets can be used to augment real-world datasets, helping to address issues such as class imbalance, data scarcity, and data privacy.

5.5. Evaluation Phase C: Data Generation (Generated Data Quality)

5.5.1. Methodology

We conducted an ablation study because our system employs three prompting techniques in each prompt.

In this phase, we evaluated how our system responds to different prompt configurations. To do this, we crafted twenty scenarios (prompts) to generate textual data across various domains. These include six e-commerce reviews, three children’s stories, four restaurant reviews, two yes/no questions, and five cake recipes.

For the e-commerce reviews, we designed six prompts using diverse prompt engineering techniques: pure zero-shot, pure few-shot, pure chain-of-thought (CoT), pure role-play, zero-shot with CoT and role-play, and few-shot with CoT and role-play. In the children’s stories domain, we used pure zero-shot, pure CoT, and pure role-play prompts. For the restaurant reviews, the techniques included pure zero-shot, pure few-shot, pure role-play, and role-play with CoT.

In the yes/no question category, we used two pure zero-shot prompts to generate one question with a “yes” answer and one with a “no” answer.

For the cake recipes, we applied a variety of prompt engineering techniques, including zero-shot, few-shot, role-play, zero-shot with CoT and role-play, and few-shot with CoT and role-play.

Each prompt was submitted to the model in a new session to prevent any influence from previous interactions. The quality of the responses was assessed using a qualitative, human-based evaluation approach.

Three fluent English-speaking evaluators were provided with an evaluation form containing 20 tables of prompts and their corresponding system-generated responses. The evaluators assessed the quality of each response using four criteria on a 5-point Likert scale:

Relevance: Does the output appropriately address the input prompt?
Coherence: Does the text demonstrate logical flow and readability?
Fluency: Is the text grammatically correct and natural sounding?
Correctness: Is the content factually and contextually accurate (where applicable)?

To measure inter-rater reliability, we calculated Fleiss’ Kappa to determine the level of agreement among the evaluators.

An example of the evaluation form is presented in Table 11.

In this phase, we calculated the evaluators’ inter-rater agreement using Fleiss’ Kappa, and the results for each domain, as follows:

In the e-commerce domain, the evaluation results are as follows: the zero-shot approach received an overall score of 5. The few-shot and CoT both approached 4.9, while the role-play prompt technique scored 4.6. A combination of CoT, role-play, and zero-shot resulted in a score of 4.7, whereas combining CoT, role-play, and few-shot achieved the highest score of 5 out of 5.

In the domain children’s stories, CoT received a score of 5, outperforming the role-play and zero-shot prompting techniques, each of which received a score of 4.9.

In the domain of restaurant reviews, zero-shot prompting outperformed other techniques, receiving a score of 5. This was followed by role-play, with a score of 4.7; role-play with CoT, with a score of 4.5; and few-shot prompting, with a score of 4.3.

In the domain of yes/no questions, zero-shot prompting received a perfect score of 5.

In the cake recipes domain, both pure zero-shot and pure role-play techniques outperformed other techniques, each achieving a score of 5. Meanwhile, pure few-shot, the combination of role-play with CoT, and zero-shot each received a score of 4.9, as did the combination of role-play with CoT and the few-shot prompting techniques.

5.5.2. Results

Fleiss’ Kappa values ranged from 5 to 4.3, indicating substantial agreement between LLM outputs and human judgment.

Table 12 shows the inter-rater agreement of the three evaluators using Fleiss’ Kappa.

5.5.3. Analysis and Discussion

The results demonstrate that the LLM with attributed prompting achieves near-expert level in text generation, particularly in sentiment tasks. Most disagreements among the evaluators occurred in borderline cases involving sarcasm or ambiguous tones.

The quality of the data generated is consistently high. The zero-shot technique performed surprisingly well, producing responses that were both relevant and coherent with the given question. Role-play also proved effective, generating responses that aligned with the assigned persona while remaining relevant to the prompt.

Chain-of-thought (CoT) stood out for its ability to produce structured and logically coherent responses. The few-shot technique also yielded strong results, with responses closely following the tone and length of the provided examples.

The prompt engineering techniques that yielded the best results in our evaluation:

Zero-shot, chain-of-thought (CoT), and role-play: the techniques demonstrated high performance in terms of both relevance and coherence.
Zero-shot prompting performs particularly well when the model has sufficient prior knowledge of the topic.
Role-play: effectively aligns the output tone and content with a specific persona or scenario, significantly enhancing relevance.
Chain-of-thought (CoT): delivered especially strong results in generating children’s stories, achieving a perfect score (5/5), and also performed well in e-commerce review tasks, with a score of 4.9/5.
Combinations: techniques combining CoT, role-play, and either zero-shot or few-shot learning produced consistently high scores, often ranging from 4.7 to 5.0 overall.

Table 13 presents the best prompting techniques by domain.

5.6. Evaluation Phase D: Comparative Analysis with ChatGPT-4

5.6.1. Methodology

We conducted a performance comparison between our system and ChatGPT-4-turob (ChatGPT web interface) by classifying a balanced subset of 200 Yelp Reviews test samples labeled with either ‘Positive’ or ‘Negative’ sentiments.

To ensure fair comparison, both systems were prompted to classify 20 reviews at a time using the same prompt. Our system processed the reviews directly, while for ChatGPT, the same prompt was manually submitted each time. The responses were then parsed to extract the sentiment labels.

5.6.2. Results

ChatGPT-4-turbo vs. our system (prompt-tuned Llama 3.3-70B)

ChatGPT-4-turbo achieved solid performance, with an accuracy of 90%, with precision, recall, and F1-score all at 90%.

Our system demonstrated stronger performance, achieving 98% accuracy, precision, recall, and F1-score. Table 14 presents the classification results from ChatGPT and our system. Confusion matrices for both models are illustrated in Figure 19.

5.6.3. Analysis and Discussion

Out of 200 reviews, ChatGPT correctly identified 90 true negatives and 90 true positives but also produced 10 false positives and 10 false negatives. This resulted in a 10% misclassification rate for both the positive and negative classes.

In contrast, our system identified 98 true negatives and 98 true positives, with only two false negatives and two false positives. This represents a 2% misclassification rate, indicating significantly higher precision and recall.

These results indicate that while ChatGPT maintains a balanced performance, our system outperforms it in binary sentiment classification tasks, offering superior accuracy, precision, and recall.

In terms of time efficiency, our system completed the classification task in less than half the time required by ChatGP. This improvement is primarily due to our system being purpose-built for this task, enabling direct label extraction via CSV file operations. In contrast, ChatGPT required repeated manual prompting and response parsing, which significantly increased the overall processing time.

In terms of cost, ChatGPT outperformed Llama due to the small dataset size and the use of a free version of ChatGPT. However, our system significantly outperformed ChatGPT in terms of time efficiency and classification accuracy. Table 15 presents a comparison between ChatGPT and Llama in terms of processing time, cost, and accuracy.

A performance comparison between our system and ChatGPT is illustrated in Figure 20.

We observed that when LLMs, such as our system, and ChatGPT were tasked with classifying data such as restaurant reviews and instructed to return only the labels (e.g., positive or negative), they occasionally misclassified the data, resulting in reduced classification accuracy. In our experiment, our system classified 200 reviews at a time, achieving an accuracy of 81%. In comparison, ChatGPT-4 achieved an accuracy of 72% under the same conditions. Due to context token limitations, ChatGPT-4 was only able to classify 20 reviews at a time. While this approach reduced classification time, it also led to lowered classification accuracy. Our system processed all 200 reviews in one minute, whereas ChatGPT-4 took approximately 20 min. Notably, our system was originally designed to return both the classified text and the corresponding labels.

5.7. Evaluation Phase E: Usability Testing via SUS

5.7.1. Methodology

Usability was tested using the System Usability Scale (SUS) with seven participants, including five males and two females, from computer science and linguistics backgrounds. The participants used the platform to label and generate data, then rated their experience. The participants’ demographics are detailed in Table 16.

We conducted unmoderated usability testing for our system. The participants were given a scenario and asked to complete usability testing. Afterward, they were required to fill out a questionnaire, which began with a consent form, followed by questions about their demographic information, and then the main usability questions.

User feedback was collected using a Google forms questionnaire, incorporating the System Usability Scale (SUS), a standardized tool for assessing usability. The SUS consists of ten questions related to the system being tested, each rated on a 5-Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). The questions alternate between positive and negative statements to measure overall usability. Additionally, we added two open-ended questions to collect users’ feedback and suggestions, along with a question to rate the app from 1 to 5 stars:

Do you have any suggestions for improving the system?
What do you think of the system? Please share your feedback here.

5.7.2. Results

The average SUS score of our system was 86.67, indicating above-average usability. Qualitative feedback highlighted prompt customization as strengths.

5.7.3. Analysis

Strengths identified:

Overall, the system is well-built, functional, and responsive.
Data generation and labeling quality are very good, with great potential for business applications.
The system is easy to use once users understand how to work through it.

Areas for improvement:

User interface: Many users pointed out that the UI needs to be improved to be more modern, polished, and visually engaging. A smoother, more attractive design would take the experience to the next level.
Submit button: The submit button functionality could be improved for better user feedback after submission.
Minor bugs:

The minimum word generation parameter does not always work as expected.

System prompts could be better placed and should avoid unnecessary repetition.

Data labeling flow: auto-adding custom attributes (such as NER) after filling them in would make the flow smoother, instead of requiring manual addition afterward.

This comprehensive evaluation validates the system’s technical robustness and usability. The results demonstrate the effectiveness of the attributed prompt engineering pipeline in generating and labeling high-quality data. The system performs on par with, or better than ChatGPT-4, with improved efficiency, scalability, and user experience. These findings support the platform’s applicability in real-world NLP tasks and data pipeline automation.

5.8. Evaluation of Results Against Previous Studies

We evaluated our system’s results in both data generation and data labeling by comparing them with previous work that utilized LLMs in data generation or labeling using prompt engineering techniques or fine-tuning and presented the results in Table 17.

Previous work [16] demonstrated significant potential in AI data generation and labeling. In the labeling process, their system showed exceptional performance and extraordinary results in detecting and classifying phishing URLs, achieving a 92.74% F1-score using GPT-3.5 and Claude 2. This was accomplished through three prompt engineering techniques: zero-shot, role-playing, and chain-of-thought, a methodology similar to the one adopted in our research. However, unlike the previous study, we enhance the approach by using attributed prompts and, in some cases, substituting zero-shot prompting with few-shot prompting. Our results in classifying the AG News dataset achieved an F1-score of 92%, which is comparable to the performance in phishing URL classification.

In contrast, the same study [16] fine-tuned the LLM instead of relying on prompt engineering techniques, achieving a remarkable F1-score of 99.27%.

In terms of data generation, our system also excels—achieving 99% accuracy on the news generation dataset and 98% accuracy on the restaurant reviews dataset. By comparison, previous work [22] achieved 83.95% accuracy using ChatGPT to augment the Amazon dataset, and another study [5] reported 88.9% accuracy using the Symptoms dataset with ChatGPT and BERT.

These results suggest that Llama 3.3 70B is a strong competitor to language models such as ChatGPT-4 and Claude 3. Moreover, combining attributed prompts with advanced prompt engineering techniques has proven to be highly effective for both data labeling and data generation tasks.

6. Conclusions

This study introduced a web-based platform for synthetic text data generation and automated data labeling using prompt-engineered interactions with Llama 3.3. By combining zero-shot, few-shot, and chain-of-thought prompting with attributed prompt templates and role-based customization, the system empowers users to generate high-quality textual data with minimal effort.

The platform was validated through extensive evaluation on benchmark datasets (AG News, Amazon Reviews, and Yelp) and the generation of two large-scale synthetic datasets. Experimental results demonstrated strong performance in classification accuracy, generation quality, and speed, compared to traditional approaches. The system also showcased adaptability, allowing users to guide model behavior through intuitive UI inputs and real-time feedback.

In both labeling and generation tasks, the use of attributed prompts and role-based instructions significantly improved output quality. Few-shot prompting consistently outperformed zero-shot prompting in complex or multi-class scenarios. The platform was also found to be cost-effective, reducing the need for expert annotators, and delivering outputs in a fraction of the time required by manual processes. Usability testing and expert feedback confirmed the system’s reliability and practical value for NLP workflows.

Limitations and Future Directions

Despite its success, the platform faces several limitations. Generated text length tends to shorten when large batches are requested, and exact adherence to user-specified word limits is not guaranteed. To avoid duplication, data generation is currently capped at 100 entries per request. Additionally, labeling accuracy can decline when the system is instructed to label high volumes of text without sufficient prompt context. Finally, the system presently accepts only plain text input, lacking support for document uploads or other data formats.

To address these challenges, in future work we plan to evaluate our system’s performance on more diverse datasets. Our goal is to allow users to upload datasets of any size and enable the labeling process by splitting the dataset into multiple subsets, performing classification, and then returning the fully labeled dataset to the user. Additionally, we aim to support document uploads for both data labeling and data generation, allowing the system to extract valuable information or even learn new insights from these documents. To overcome batch-size constraints, we suggested implementing a LangChain-based orchestration framework that automatically partitions data into appropriately sized chunks (e.g., 100 entries per request) and manages interactions with the Llama model. This scalable architecture allows users to submit or request data of any size while maintaining efficient and reliable processing.

For data variability and data duplication issues, our system can be optimally utilized by using attributed prompts such as changing word length in each prompt, adding new attributes with suitable instructions, applying prompt engineering techniques, and adding few-shot examples. These strategies can solve this problem and produce diverse, customized datasets.

We intend to assess the system’s performance on datasets that are scarce in specific domains by generating and labeling such data. Furthermore, we plan to fine-tune the model using a wide range of datasets to improve its ability to classify text data effectively. We also aim to integrate support for the Arabic language to broaden the system’s applicability. In the future, our model could be extended to generate images by leveraging the same techniques used in this project, in combination with our custom-attributed prompt engineering module. Through these enhancements, the platform aims to offer a more robust, inclusive, and intelligent tool for synthetic data generation and automated labeling in real-world NLP applications.

Author Contributions

Conceptualization, R.A.; methodology, R.A. and W.S.A.; software, W.S.A.; validation, R.A. and W.S.A.; formal analysis, W.S.A.; investigation, W.S.A.; resources, R.A. and W.S.A.; data curation, W.S.A.; writing—original draft preparation, W.S.A. and R.A.; writing—review and editing, W.S.A. and R.A.; visualization, W.S.A.; supervision, R.A.; project administration, R.A.; funding acquisition, R.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ongoing Research Funding Program (ORF-2025-905), King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

The datasets generated and analyzed during the current study are available on GitHub: 1. LLM-Generated Restaurant Reviews Dataset: https://github.com/Wedyan2023/LLM-Generated-Resaturant-Reviews./blob/main/LLM-Generated%20Restaurant%20Reviews%20Dataset-1.csv (accessed on 9 July 2025). 2. LLM-Generated News Dataset: https://github.com/Wedyan2023/LLM-Generated-News-Dataset/blob/main/LLM-Generated%20News%20Dataset%202.csv (accessed on 9 July 2025).

Conflicts of Interest

The authors declare that they have no competing interests.

References

Wang, S.; Liu, Y.; Xu, Y.; Zhu, C.; Zeng, M. Want To Reduce Labeling Cost? GPT-3 Can Help. arXiv 2021. [Google Scholar] [CrossRef]
Reyes, O.; Morell, C.; Ventura, S. Effective active learning strategy for multi-label learning. Neurocomputing 2018, 273, 494–508. [Google Scholar] [CrossRef]
Whang, S.E.; Roh, Y.; Song, H.; Lee, J.-G. Data collection and quality challenges in deep learning: A data-centric AI perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
Gilardi, F.; Alizadeh, M.; Kubli, M. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. Proc. Natl. Acad. Sci. USA 2023, 120, e2305016120. [Google Scholar] [CrossRef]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Liu, W.; Liu, N.; et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv 2023. [Google Scholar] [CrossRef]
Llama. Available online: https://llama.meta.com/ (accessed on 5 February 2024).
Meta Llama 3. Available online: https://llama.meta.com/llama3/ (accessed on 7 May 2024).
Alsakran, W.; Alabduljabbar, R. Exploring the Potential of LLMs and Attributed Prompt Engineering for Efficient Text Generation and Labeling. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM), Dubai, United Arab Emirates, 26–29 November 2024; pp. 244–252. [Google Scholar]
ChatGPT. Available online: https://chatgpt.com (accessed on 3 July 2025).
Refuel.ai: High-Quality Data at the Speed of Thought. Available online: https://refuel.ai (accessed on 3 July 2025).
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2023. [Google Scholar] [CrossRef]
He, X.; Lin, Z.; Gong, Y.; Jin, A.-L.; Zhang, H.; Lin, C.; Jiao, J.; Yiu, S.M.; Duan, N.; Chen, W. AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators. arXiv 2023. [Google Scholar] [CrossRef]
Alizadeh, M.; Kubli, M.; Samei, Z.; Dehghani, S.; Bermeo, J.D.; Korobeynikova, M.; Gilardi, F. Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks. arXiv 2023. [Google Scholar] [CrossRef]
Ding, B.; Qin, C.; Liu, L.; Chia, Y.K.; Joty, S.; Li, B.; Bing, L. Is GPT-3 a Good Data Annotator? arXiv 2023. [Google Scholar] [CrossRef]
Shushkevich, E.; Alexandrov, M.; Cardiff, J. Improving Multiclass Classification of Fake News Using BERT-Based Models and ChatGPT-Augmented Data. Inventions 2023, 8, 112. [Google Scholar] [CrossRef]
Trad, F.; Chehab, A. Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models. Mach. Learn. Knowl. Extr. 2024, 6, 367–384. [Google Scholar] [CrossRef]
Sahu, G.; Rodriguez, P.; Laradji, I.H.; Atighehchian, P.; Vazquez, D.; Bahdanau, D. Data Augmentation for Intent Classification with Off-the-shelf Large Language Models. arXiv 2022. [Google Scholar] [CrossRef]
OpenAI Platform. Available online: https://platform.openai.com (accessed on 22 March 2024).
GPT-J. Available online: https://www.eleuther.ai/artifacts/gpt-j (accessed on 22 March 2024).
Ubani, S.; Polat, S.O.; Nielsen, R. ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT. arXiv 2023. [Google Scholar] [CrossRef]
Yang, K.; Liu, D.; Lei, W.; Yang, B.; Xue, M.; Chen, B.; Xie, J. Tailor: A Soft-Prompt-Based Approach to Attribute-Based Controlled Text Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, QC, Canada, 2023; pp. 410–427. [Google Scholar]
Yu, Y.; Zhuang, Y.; Zhang, J.; Meng, Y.; Ratner, A.; Krishna, R.; Shen, J.; Zhang, C. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. arXiv 2023. [Google Scholar] [CrossRef]
White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2023. [Google Scholar] [CrossRef]
Fei-Fei, L.; Fergus, R.; Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 594–611. [Google Scholar] [CrossRef]
Yu, P.; Xu, H.; Hu, X.; Deng, C. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare 2023, 11, 2776. [Google Scholar] [CrossRef]
Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Med-HALT: Medical Domain Hallucination Test for Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
Hua, S.; Jin, S.; Jiang, S. The Limitations and Ethical Considerations of ChatGPT. Data Intell. 2023, 6, 201–239. [Google Scholar] [CrossRef]
Tan, Z.; Beigi, A.; Wang, S.; Guo, R.; Bhattacharjee, A.; Jiang, B.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large Language Models for Data Annotation: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Ji, X.; Zhao, Z.; Hei, X.; Choo, K.-K.R. Ethical Considerations and Policy Implications for Large Language Models: Guiding Responsible Development and Deployment. arXiv 2023. [Google Scholar] [CrossRef]
Törnberg, P. Best Practices for Text Annotation with Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2024. [Google Scholar] [CrossRef]
Papers with Code—AG News Dataset. Available online: https://paperswithcode.com/dataset/ag-news (accessed on 20 May 2024).
AttrPrompt/Datasets at Main yueyu1030/AttrPrompt. Available online: https://github.com/yueyu1030/AttrPrompt/tree/main/datasets (accessed on 4 May 2024).
Google Colaboratory. Available online: https://colab.research.google.com/ (accessed on 23 January 2024).
Spaces—Hugging Face. Available online: https://huggingface.co/spaces (accessed on 21 April 2025).
Streamlit A Faster Way to Build and Share Data Apps. Available online: https://streamlit.io/ (accessed on 21 April 2025).
Bandi, A.; Adapa, P.V.S.R.; Kuchi, Y.E.V.P.K. The Power of Generative AI: A Review of Requirements, Models, Input–Output Formats, Evaluation Metrics, and Challenges. Future Internet 2023, 15, 260. [Google Scholar] [CrossRef]
Upwork|Hire Top Freelance Talent with Confidence. Available online: https://www.upwork.com/ (accessed on 3 July 2025).

Figure 1. Framework for the data labeling process.

Figure 2. Framework for the data generation process.

Figure 3. Prompting template examples for data generation and labeling.

Figure 4. The overall architecture of the system.

Figure 5. The platform’s main user interface.

Figure 6. The data generation user interface.

Figure 7. The data labeling user interface.

Figure 8. The data generation system prompt.

Figure 9. Word cloud of LLM-generated restaurant review dataset.

Figure 10. Word cloud of our generated dataset, LLM-generated news.

Figure 11. Detailed Evaluation Methodology Framework.

Figure 12. AG News confusion matrices for the Business, Sci_Tech, Sports, and World classes.

Figure 13. The confusion matrices for the positive and negative classes of the Yelp dataset.

Figure 14. Confusion matrices of the first 8 classes of the Amazon product reviews: ‘magazines’, ‘camera_photo’, ‘office_products’, ‘kitchen’, ‘cell_phones_service’, ‘computer_video_games’, ‘gro-cery_and_gourmet_food’, and ‘tools_hardware’.

Figure 15. Confusion matrices of the second 8 classes of Amazon product reviews: ’automotive’, ‘mu-sic_album’, ‘health_and_personal_care’, ‘electronics’, ‘outdoor_living’, ‘video’, ‘apparel’, and ‘toys_games’.

Figure 16. Confusion matrices of the third 7 classes of Amazon product reviews: ’sports_outdoors’, ‘books’, ‘software’, ‘baby’, ‘musical_and_instruments’, ‘beauty’, and ‘jewelry_and_watches’.

Figure 17. Confusion matrices for all classes of the LLM-Generated News dataset.

Figure 18. Confusion matrices for LLM-Generated Restaurant Reviews classes.

Figure 19. Overall confusion matrices of (A) ChatGPT-4-turbo and (B) Llama 3.3-70B.

Figure 20. Model performance comparison between ChatGPT and our system.

Table 1. Summary of LLMs used for text data labeling.

Ref.	Year	Task	Model(s)	Prompt Technique	Key Findings
[1]	2021	Data labeling.	GPT-3 (Davinci)	Zero-shot.	The proposed labeling method can reduce labeling costs by 50% to 96% while achieving the same performance with human-labeled data.
[11]	2023	Reasoning: Arithmetic Reasoning (AR), Commonsense Reasoning (CR). Symbolic and Logical reasoning.	PaLM, (540 B) model, InstructGPT model (text-davinci 002), GPT-3, GPT-Neo, GPT-J, OPT (13 B)	Zero-shot + CoT.	The ‘Zero-Shot CoT’ outperforms zero-shot LLMs’ performance when using the same prompting template on diverse reasoning tasks such as arithmetic and commonsense reasoning. Improved the accuracy of (text-davinci 002) on MultiArith and GSM8K reasoning benchmarks from 17.7% to 78.7% and from 10.4 to 40.7%, respectively, and demonstrated similar improvements on P450B parameter PaLM.
[12]	2023	Data labeling.	GPT-3.5	Few-shot + CoT prompts.	GPT-3.5 outperforms crowdsourced annotators in user input and keyword relevance assessment and achieves comparable results in BoolQ and WiC tasks with crowdsourced annotators.
[13]	2023	Text annotation tasks.	ChatGPT,	zero-shot and few-shot, different temperature parameters.	ChatGPT outperforms MTurk in ten out of eleven annotation tasks, while open-source models outperform MTurk in six out of eleven tasks. Results demonstrate that open-source LLMs outperform human annotators and approach ChatGPT in text data annotation tasks.
[14]	2023	Data annotation.	GPT-3	Zero-shot, one-shot, and few-shot prompts.	Direct annotation of unlabeled data is effective for tasks with a small label space, whereas generation-based methods are more suitable for tasks with a large label space. Additionally, generation-based methods prove to be more cost-effective when compared with the direct annotation of unlabeled data.
[4]	2023	Data annotation.	ChatGPT	Zero-shot learning.	ChatGPT outperforms both trained annotators and crowd workers in all annotation tasks. ChatGPT’s zero-shot learning outperforms crowd workers by an average of 25%.
[15]	2023	Data labeling ‘Fake News multiclass classification’.	BERT-Based Models (mBERT, SBERT, XLM-RoBERTa) and ChatGPT	Fine-tuning.	ChatGPT-generated messages showed performance enhancement in quality for True, Partially True, and other news classifications.
[16]	2024	URL classification.	GPT-3.5 and Claude 2 GPT-2, Bloom-560m, Baby Llama-58m DistilGPT-2	Zero-shot, role-playing, chain-of-thought and fine-tuning.	An F1-score of 92.74% for prompt engineering and An F1-score of 99.29% and an AUC of 99.56% for fine-tuning.

Table 2. Summary of LLMs used for text data generation.

Ref.	Year	Task	Model(s)	Prompt Technique	Key Findings
[17]	2022	Data augmentation.	GPT-3 models (Ada, Babbage, Curie and Davinci) & GPT-J	Few-Shot (10 training examples)	The generated GPT-3 samples enhanced classification accuracy when the intended categories are clearly differentiated from one another.
[5]	2023	Text data augmentation.	ChatGPT	Few-shot prompting	Results show the superior performance of the AugGPT approach over state-of-the-art text data augmentation methods, achieving 88.9% and 89.9 accuracy for BERT and BERT with contrastive respectively for Symptoms dataset
[20]	2023	Augmenting data in low resource scenarios.	ChatGPT	Task specific ChatGPT prompts, zero-shot prompting	Using ChatGPT zero-shot prompts outperform most popular data augmentation methodologies.
[21]	2023	Text generation.	GPT-2	Single and multi-attribute prompts	Tailor requires only 0.08% extra training parameters and can achieve significant improvements on eleven attribute-specific tasks.
[22]	2023	Generating training data.	gpt-3.5-turbo	Attributed prompts	Attributed prompts outperform simple prompts, utilizing only 5% of the querying costs of ChatGPT with simple prompts, and reduce bias in the newly generated data.

Table 3. Ethical challenges and proposed mitigations.

Ref.	Year	Task	Model	Risk	Recommendation
[27]	2023	Content generation.	ChatGPT.	Hallucination Originality Toxicity Privacy and security issues	Reviewing generated content of AI models. Regulations and policies related to generative AI must be up to date to keep up with the rapid evolution of these models.
[30]	2023	Content generation.	ChatGPT and LLMs.	System role-playing prompts. Perturbation. Image-related issues. Hallucinations. Generation-related problems. Bias and discrimination.	Developing specialized detectors to manage generation-related issues. Developing LLMs with priority to address ethical, legal, and moral issues. Users must use LLMs responsibly. Legislation should be implemented for AI models.
[31]	2024	Data annotation.	LLMs.	Quality of labeled data. Data annotation integrity. Overall validity of annotation.	Selecting suitable LLMs. Considering ethical and legal implications. Conduct strict model evaluation. Transparency. Using DPA. Employing prompt engineering, structuring, and analyzing prompts. Using private or copyright data with license, permission, or consent.

Table 4. Processing time of Llama 3 variants during data generation for the same task.

Model	#Data	Task	Accuracy	Processing Time
Llama 3 8B	100	Positive/Negative Restaurant Review Generation (20–70 Words)	99%	1:04:10 min
Llama 4 Scout 17B	100	Positive/Negative Restaurant Review Generation (20–70 Words)	99%	1:07:74 min
Llama 3.2 3B	100	Positive/Negative Restaurant Review Generation (20–70 Words)	99%	1:32:77 min
Llama 3.3 70B	100	Positive/Negative Restaurant Review Generation (20–70 Words)	99%	1:38:91 min

Table 5. System-generated datasets.

Dataset	#Data	Classes	Task	Domain
Restaurant Reviews	6028	Positive, Negative	Sentiment Analysis	Restaurants
News Dataset	6141	World, Sports, Business, Sci_Tech	Multi-Class	News

Table 6. Evaluation results of the AG News, Yelp Reviews, and Amazon Reviews dataset labeling.

Dataset	Support	Accuracy	Precision	Recall	F1-Score
AG News	500	0.92	0.92	0.92	0.92
Yelp Reviews	33,800	0.90	0.90	0.90	0.90
Amazon Reviews	13,291	0.89	0.92	0.89	0.89

Table 7. Randomly selected examples of LLM-Generated Restaurant Reviews dataset.

Generated Text	Predicted Label (Llama 3.3)	Human Gold Standard Label
I’ve tried fried chicken at many restaurants in Riyadh, but Canes stands out from the rest. The chicken is crispy on the outside and juicy on the inside. A great option for those looking for a delicious meal.	Positive	Positive ✔️
Ta’ameya was the highlight of my meal, but the spicy sauce was a joke. The restaurant’s decor was straight out of an ancient Egyptian movie. I would not recommend it.	Negative	Negative ✔️
I had an amazing experience at this Greek restaurant! The moussaka was tender and flavorful, and the service was top-notch. The staff was friendly and attentive. 10/10 would recommend!	Positive	Positive ✔️
We celebrated our graduation in a private room, but the noise level was deafening. The music was too loud, and the staff seemed more interested in the party next door than in attending us. Not exactly the quiet night we were looking for.	Negative	Negative ✔️

Table 8. Randomly selected examples of LLM-Generated News dataset.

Generated Text	Predicted Label (Llama 3)	Human Gold Standard Label
The United Nations has launched a new initiative aimed at promoting education and awareness about the rights of people with disabilities. The program will provide funding and support to organizations and individuals working to promote disability rights.	World	World ✔️
FIFA Announces Saudi Arabia as 2034 World Cup Host: Saudi Arabia has been officially announced as the host of the 2034 FIFA World Cup, marking a historic moment for the kingdom’s sports industry.	Sports	Sports ✔️
The World Trade Organization (WTO) has launched a new initiative aimed at promoting trade facilitation and reducing customs barriers, supporting the growth of international trade.	Business	Business ✔️
Researchers at a leading university have made a groundbreaking discovery in the field of renewable energy, developing a new solar panel that can harness energy from the sun more efficiently than ever before.	Sci_Tech	Sci_Tech ✔️

Table 9. Evaluation results for LLM-Generated News and Restaurant Reviews datasets.

Dataset	Support	Accuracy	Precision	Recall	F1-Score
LLM-Generated News	6141	0.99	0.99	0.99	0.99
LLM-Generated Restaurant Reviews	6028	0.98	0.98	0.98	0.98

Table 10. Comparison between the costs of our system and those of human annotators.

Labeling Method	Support	Cost Total Cost	Estimated Cost Per Label	Notes
Our System Labeling	2 * (6000)	USD 36	USD 0.003	Hugging Face Pro subscription.
Human Annotators	2 * (6000)	USD 240	USD 0.02	Six human annotators (hired via Upwork).

We use * to indicate that we used two datasets, each consisting of 6000 entries. ‘The costs for 12,000 entries’.

Table 11. Example of generated data evaluation form.

Text ID	Prompt Technique	Relevance (1–5)	Coherence (1–5)	Fluency (1–5)	Correctness (1–5)	Overall Score	Notes/Comments
01	Zero_shot	5	5	5	-----	5	Accurate and understandable detailed review.
02	Few_shot	5	5	5	-----	5	Followed the example in its own unique way.
03	Role-play	5	5	4	-----	4.67	Great detailed role-play review. Grammar error No “a” before “refund”.

Table 12. The inter-rater agreement of the three evaluators, measured using Fleiss’ Kappa.

Text ID	Prompt Technique	Relevance (1–5)	Coherence (1–5)	Fluency (1–5)	Correctness (1–5)	Overall Score	Domain
01	Zero_shot	5	5	5	-----	5	E-commerce reviews
02	Few_shot	5	4.6	5	-----	4.9	E-commerce reviews
03	Role-play	5	4.6	4.2	-----	4.6	E-commerce reviews
04	CoT	5	5	4.6	-----	4.9	E-commerce reviews
05	CoT, role-play, and zoro-shot	5	4.6	4.6	-----	4.7	E-commerce reviews
06	Cot, role-play, and few-shot	5	5	5	-----	5	E-commerce reviews
07	Zero-shot	4.8	4.8	5	-----	4.9	Children’s stories
08	CoT	5	5	5	5	5	Children’s stories
09	Role-play	5	4.6	5	5	4.9	Children’s stories
10	Zero_shot	5	5	5	-----	5	Restaurant reviews
11	Few_shot	3.4	4.8	4.6	-----	4.3	Restaurant reviews
12	Role-play	4.6	5	4.6	-----	4.7	Restaurant reviews
13	Role-play and CoT	4.2	4.6	4.8	-----	4.5	Restaurant reviews
14	Zero-shot	5	5	5	5	5	Yes/no questions
15	Zero-shot	5	5	5	5	5	Yes/no questions
16	Zero-shot	5	5	5	5	5	Cake recipes
17	Few-shot	5	4.6	5	5	4.9	Cake recipes
18	Role-play	5	5	5	5	5	Cake recipes
19	Role-play, CoT, and zero-shot	4.6	5	5	5	4.9	Cake recipes
20	Few-shot, role-play, and CoT	5	4.6	5	5	4.9	Cake recipes

Table 13. The best prompting techniques by domain.

Domain	Best Techniques	Text ID	Reason
E-commerce	Few-shot + CoT + role-play	06	Balanced contextual understanding and reasoning
Children’s stories	CoT	08	Clear logic and structured storytelling
Restaurant reviews	Zero-shot Role-play	10 12	Natural expression or factual summary
Yes/no questions	Zero-shot	14 and 15	Simple format
Cake recipes	Zero-shot Role-play Role-play + CoT + zero-shot Role-play, CoT combinations	16 17 19 20	Step-by-step clarity

Table 14. ChatGPT and Llama classification results.

Model	Support	Accuracy	Precision	Recall	F1-Score
ChatGPT-4-turbo	200	0.90	0.90	0.90	0.90
Llama 3.3	200	0.98	0.98	0.98	0.98

Table 15. ChatGPT vs. Llama 3.3 performance.

Model	Support	Accuracy	Processing Time	Cost
ChatGPT-4-turbo (Web interface)	200	90%	55 min	Free
Llama 3.3 (our system)	200	98%	20 min	USD 9 per month (USD 0.3 per day)

Table 16. Demographics of usability testing participants.

Participant	Gender	Age	Degree	Work
P1	Female	25–34	Bachelor	Freelancer
P2	Male	35–44	Diploma	IT
P3	Male	25–34	Master	Astronomer and data scientist
P4	Male	25–34	Bachelor	Cybersecurity and technical writing
P5	Female	25–34	Bachelor	E-commerce\Interpreting
P6	Male	25–34	Bachelor	Data scientist
P7	Male	18–24	Bachelor	QA

Table 17. System evaluation results compared with previous work.

Ref.	Model	Methodology	Task	Dataset	Accuracy	F1-Score
[22]	ChatGPT	Attributed prompts.	Data Augmentation	Amazon Reviews	83.95	83.93
[5]	ChatGPT % BERT	Few-shot prompting.	Augmentation	Symptoms Dataset	88.9%
Our System	Llama 3.3 70B	Attributed prompts and prompt engineering techniques.	Data Generation	News Dataset	99%	99%
Our System	Llama 3.3 70B	Attributed prompts and prompt engineering techniques.	Data Generation	Restaurant Reviews Dataset	98%	98%
[16]	GPT-3.5 and Claude 2	Zero-shot, role-playing, chain-of-thought.	URL Classification.	The Phishing Dataset	---	92.74%
[16]	GPT-3.5 and Claude 2	Fine-tuning.	URL Classification.	The Phishing Dataset	---	99.29%
Our System	Llama 3.3 70B	Attributed prompts and prompt engineering techniques.	Data Labeling	AG News Dataset	92%	92%
Our System	Llama 3.3 70B	Attributed prompts and prompt engineering techniques.	Data Labeling	Yelp Reviews	90%	90%
Our System	Llama 3.3 70B	Attributed prompts and prompt engineering techniques.	Data Labeling	Amazon Reviews	89%	90%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alsakran, W.S.; Alabduljabbar, R. A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling. Electronics 2025, 14, 2800. https://doi.org/10.3390/electronics14142800

AMA Style

Alsakran WS, Alabduljabbar R. A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling. Electronics. 2025; 14(14):2800. https://doi.org/10.3390/electronics14142800

Chicago/Turabian Style

Alsakran, Wedyan Salem, and Reham Alabduljabbar. 2025. "A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling" Electronics 14, no. 14: 2800. https://doi.org/10.3390/electronics14142800

APA Style

Alsakran, W. S., & Alabduljabbar, R. (2025). A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling. Electronics, 14(14), 2800. https://doi.org/10.3390/electronics14142800

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling †

Abstract

1. Introduction

2. Background and Related Work

2.1. LLMs for Data Labeling

2.2. LLMs for Data Generation

2.3. Prompt Engineering and Attributed Prompts

2.4. Ethical Considerations

3. Methodology and System Design

3.1. Data Generation and Labeling Technique

3.2. Functional Overview

3.3. Prompting Techniques and Attributed Prompts

3.4. Evaluation Strategy

3.5. Benchmark Datasets

3.6. System Architecture

3.7. User Interface Design

3.8. Backend Components

4. Implementation

4.1. Development Environment and Model Evaluation

Model Variants Tested

4.2. Frontend Implementation

4.3. Backend and Prompt Logic

4.3.1. Prompt Engineering Techniques

4.3.2. Synthetic Data Generation Pipeline

4.3.3. Data-Labeling Pipeline

5. Evaluation, Results, and Discussion

5.1. Evaluation Framework Overview

5.2. Evaluation Metrics

5.3. Evaluation Phase A: Benchmark Dataset Labeling

5.3.1. Methodology

5.3.2. Results

5.3.3. Analysis and Discussion

5.4. Evaluation Phase B: Synthetic Data Labeling

5.4.1. Methodology

5.4.2. Results

5.4.3. Analysis and Discussion

5.5. Evaluation Phase C: Data Generation (Generated Data Quality)

5.5.1. Methodology

5.5.2. Results

5.5.3. Analysis and Discussion

5.6. Evaluation Phase D: Comparative Analysis with ChatGPT-4

5.6.1. Methodology

5.6.2. Results

5.6.3. Analysis and Discussion

5.7. Evaluation Phase E: Usability Testing via SUS

5.7.1. Methodology

5.7.2. Results

5.7.3. Analysis

5.8. Evaluation of Results Against Previous Studies

6. Conclusions

Limitations and Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling^†