4.3. Experimental Setup
The experiment adopts a combination of quantitative and qualitative evaluation methods, and the experimental environment is as follows: Intel Core i5 2.8 GHz processor, DDR3 16 GB RAM, and JDK11 environment (JDK11 has been integrated into RM2PT). RM2PT [
4,
5,
7] is the advanced prototyping method, which can quickly generate a prototype system through the requirement model and improve the efficiency of requirement confirmation.
RQ1: In order to evaluate the effectiveness of InitialGPT, this study invited 10 software engineers to participate in the experiment: 6 of them are master’s degree students majoring in software engineering, and the other 4 are from Internet companies with 3 years of software development experience.
Before the start of the experiment, the subjects received a training session, which included the basics of requirements identification using CASE tools. We familiarized the participants with the construction of requirements models and the process of requirements identification using RM2PT, as well as the basic information of each case study. After the training, the participants were randomly divided into two groups for controlled experiments: manual preparation of prototype data for requirements validation, and automatic generation of prototype data for requirements validation using InitialGPT. In addition, this study also explored the effect of the amount of initial data on the efficiency of requirements validation, and the experiments were set up to compare and analyze 25, 100, and 300 pieces of initial data, respectively.
RQ2: Since the prototype data generated by the model is mainly used for requirements validation, its data generation method is different from that of software testing data. Therefore, a user study with 15 subjects was conducted to evaluate the quality of initial data generated by InitialGPT.
Ten of the subjects were master’s degree holders in software engineering, and the other five were software engineers. First, this paper conducts experimental design based on the prototype data quality assessment metrics shown in
Table 2. All the subjects experience the practical application of manually written prototype data and InitialGPT generated data in the requirement identification process. Subjects need to independently assess the quality of data, including the manually written prototype data and the InitialGPT-generated data.
In this paper, we first conducted systematic research on the current mainstream large language models and screened out GPT-4.5, GPT-4o, O1-preview and O3-mini, GPT-4, GPT-3.5-turbo, and Gemini-2.0-flash after comprehensively considering the performance of the models, the wide range of applications, and the degree of recognition. Claude-3-5-sonnet and Deepseek-r1 models were used as experimental subjects, and manually written data were used as benchmarks for comparative analysis.
The evaluation was conducted using the 5-Likert [
48] evaluation for user research, in which the scoring criteria include: strongly agree (5 points), agree (4 points), neutral (3 points), disagree (2 points), and strongly disagree (1 point). In addition, to ensure the credibility of the experimental results, subjects were required to submit the evaluation results within 2 h. To ensure the accuracy of the experiment, the assessment results of three independent experiments were counted in this study [
49]. Each experiment was evaluated separately for the data generated by each model of manual writing and InitialGPT, and the average score of the three experiments was finally calculated as the final analysis result.
RQ3: The experiment aims to evaluate the difference in data generation quality between InitialGPT prompt (specially designed cue generation model) and ordinary prompt (directly written basic prompt) to measure the advantages of InitialGPT in prompt optimization.
The experimental object is the CoCoME supermarket system software case dataset, the requirements are confirmed by using ordinary prompt words and InitialGPT prompt words, and GPT-4 is adopted as the data generation model. The experiment is divided into two groups: the first group uses ordinary prompt words and directly inputs GPT-4 for data generation; the second group uses the prompt word template generated by InitialGPT and prompts GPT-4 to generate data. During the experiment, all generated data were based on the same demand model to ensure comparability. Ten subjects (six master’s degree students and four software engineers) were invited to the study and independently scored using a five-point Likert scale, and the average quality scores of the data in the two groups were calculated to verify whether the prompt word templates could significantly improve the quality of the data generated through statistical analysis.
RQ4: This experiment compares the performance of the multi-agent workflow with the single-agent model in terms of data generation quality in order to assess the advantages of multi-agent workflow in data generation. The experiment, also based on the above case study, used GPT-4 as the data generation model and used the single agent and the multi-agent collaborative model for data generation, respectively. The single-agent mode is used by GPT-4 to complete all the data generation tasks independently, while the multi-agent mode is used by multiple agent to collaborate, including the data generation agent, data validation and cleaning agents, and data format output agent. The results of the experiments were independently evaluated several times and averaged to verify whether the multi-agent workflow outperforms the single-agent body in terms of data generation quality.
The total number of participants in the experiment was 35. The student group (15) consisted of 10 master’s degree students in computer science (all with 2–3 years of experience in software development) and 5 undergraduate students in computer science (all having completed at least two software engineering courses and 1 year of development experience). The average age was 24.1 years (Standard Deviation = 1.8).
The professional group (20) were all working software engineers, recruited through an industry collaboration platform. The inclusion criteria were: at least 3 years of software development experience, more than 1 year of which had to involve a field directly related to the evaluation task (e.g., machine learning system development, API design, etc.). The average length of experience was 5.2 years, and the age range was 26–45 years (mean = 32.4, standard deviation = 4.3). All participants volunteered to participate in the experiment and signed an informed consent form. Familiarity with the assessment task was confirmed by a questionnaire prior to the experiment (all participants indicated “familiar” or “very familiar”).
RQ5: In this experiment, various types of representative data generation methods were selected, including data generation based on random selection methods (random selection generation, RSG), data generation based on statistical distributions (statistical distribution generation, SDG), and data generation methods based on deep learning (conditional tabular generative adversarial network, CTGAN), to be compared with the InitialGPT method for comparison.
The dataset is adopted from publicly available datasets covering various fields such as business, finance, and healthcare, including a supermarket sales dataset, loan approval dataset, adult census dataset, and medical cost personal dataset, which are widely used for scientific research and data analysis.
For generic datasets that lack formal requirement models (e.g., UCI adult income dataset), in order to evaluate the performance of InitialGPT on real datasets, this paper first reverse parses its metadata structure, extracts entity-attribute mapping relationships, and constructs a domain information format similar to the one extracted from the formal requirement model. Subsequently, it is injected into InitialGPT prompt templates to drive the large language model to generate data streams conforming to the domain characteristics. The generated results are evaluated against ISO quality standards together with the baseline method to ensure that the comparison experiments are conducted under a unified quality framework. The method solves the method adaptation problem under the scenario of missing formal requirements and provides theoretical feasibility proof for cross-domain data generation.
For data evaluation metrics, this paper uses four evaluation metrics, Wasserstein distance, Jensen–Shannon divergence (JSD), Pearson correlation, and Cramér’s V, to evaluate the quality of data generation. Wasserstein distance and Jensen–Shannon divergence are mainly used to assess the distribution similarity, which is categorized into continuous and discrete distributions. Pearson correlation and Cramér’s V are used to assess the data feature correlation, which is mainly categorized into numerical and categorical data evaluation.
Wasserstein distance is a metric that measures the difference between two probability distributions. It is based on the theory of optimal transmission and calculates the minimum amount of “work” required to “move” one distribution to another, and is used to assess the similarity of continuously distributed numerical data between the generated data and the real data. The equation is defined as follows:
where
p and
q are the two probability distributions,
is the set of all joint distributions whose marginals are
p and
q, and
is a joint distribution that describes how the mass in
p is transported to
q. The distance
represents the cost of moving a unit mass from
x to
y. The Wasserstein distance is the minimum total cost of transporting the mass from
p to
q over all possible joint distributions
.
Jensen–Shannon divergence (JSD) scatter is a symmetric, non-negative measure of the similarity of categories of data in discrete distributions in the generated data. The JSD is defined as follows:
where
is the average distribution of
p and
q, and
is the Kullback–Leibler divergence, given by
The JSD has several desirable properties for our evaluation purposes. It is symmetric, meaning that , and it is bounded between 0 and . The value of JSD is zero if and only if the two distributions are identical. This makes JSD a suitable metric for assessing the quality of generated data in comparison to real data.
Pearson’s correlation coefficient (PCC) is used to measure the linear correlation between two variables and is used to evaluate the generation of continuous numerical patterns. For two random variables X and Y, the Pearson correlation coefficient is defined as follows:
To quantify the linear relationship between two continuous variables, we employ Pearson’s correlation coefficient (PCC), denoted as
P, which is a widely used statistical measure of the linear dependence between two variables. PCC provides a value between −1 and 1, where 1 indicates a perfect positive linear relationship, −1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The PCC is defined as follows:
where
and
are the individual sample points indexed with
i, and
and
are the sample means of
X and
Y, respectively. The numerator represents the covariance between
X and
Y, while the denominator is the product of the standard deviations of
X and
Y.
Cramér’s V is a statistic based on the chi-square test used to measure the strength of association between two categorical variables for category type variables. Cramér’s V is particularly useful for analyzing contingency tables and provides a value between 0 and 1, where higher values indicate stronger associations. Cramér’s V is defined as follows:
where
is the chi-square statistic,
n is the total sample size,
k is the number of columns, and
r is the number of rows in the contingency table. The term
is used to normalize the statistic, ensuring that the value of
V lies within the interval
.
In this study, we applied four different data generation methods to implement data generation operations on four specific datasets. Subsequently, the quality of the generated data is systematically compared and analyzed with that of the real data based on the four preset assessment indicators, aiming to comprehensively assess the effectiveness of the data generation methods and the reliability of the generated data.
4.4. Experimental Results
Experiment 1: It mainly consists of statistics on the time consumed by the two methods of requirements validation. According to the data in
Table 2, when developers manually write data for 100 initial entities, the average time consumed is 25.03 min; while using InitialGPT to automatically generate data, it only takes 2.27 min, and with the increase in the number of initial data entities, the time gap between the two methods will be more obvious.
The “Manual” column shows the time taken to manually write the corresponding amount of initial data, while the “InitialGPT” column records the time taken to generate the data using InitialGPT (both in minutes).
By calculating the comparisons,
Table 3 shows the time efficiency improvement brought about by utilizing InitialGPT for requirements validation. The results show that the InitialGPT approach can improve the efficiency of requirements validation by an average of 7.02 times compared to the traditional manual initial prototyping approach.
Experiment 2: As shown in
Table 4, this paper counts the results of three manual assessments of the quality of prototype data generated by each model. From the experimental data, the average data quality score of manually written data is 4.42, which is the best performance among all test subjects. The average data quality score of GPT-4-generated data is 4.33, which is the closest performance to the manual level, followed by GPT-4.5 with an average score of 4.30, which is also more outstanding. The average data quality score of Gemini-2.0-flash and o1- preview are at a medium level. Gpt-3.5-turbo has an average score of 4.02, which is an average performance. In summary, the comparative analysis reveals that GPT-4 and GPT-4.5 are more outstanding in terms of the quality of automatically generated prototype data, which are closest to the quality level of manually written data and have better practical value. In this experiment, despite the technical upgrades and optimizations of high-end models such as GPT-4.5, Deepseek-r1, O1-preview, and O3-mini, GPT-4 still outperformed these models in terms of data generation quality. This result may be attributed to a number of factors. First, it may be related to the complexity of the inference models.The O1-preview and O3-mini models may have over-complicated the inference process, leading to over-inference on specific tasks, which affected the data quality. In addition, the new versions of the models may not be as well adapted as expected on specific tasks, despite their optimization in other areas. This lack of fitness may lead to a decrease in the quality of the generated data, reflecting the delicate balance between model complexity and task matching.
Experiment 3: In Experiment 3, we compared the effect of an unadjusted ordinary prompt with the InitialGPT prompt. The results, as shown in
Table 5, show that the prompt of the InitialGPT method significantly improve the data quality by 12.18%, which indicates that the method has high adaptability and reliability in prompt generation.
Experiment 4:
Table 6 shows the results of Experiment 4, where “Single” denotes single-agent and “Multi” denotes multi-agent. The results show that the multi-intelligence workflow significantly outperforms the single-intelligence mode in terms of data generation quality, with an average improvement of 12.38%.
In several evaluations, the multi-agent performs well, especially in complex tasks, and its advantage is more obvious, indicating that multi-agent collaboration can effectively improve the quality of data generation. However, in the experimental process, the generation efficiency of multi-agent is not as good as that of a single agent, and more time and resources are needed to coordinate the collaboration of different agents, which suggests that multi-agent collaboration has high application value when pursuing high-quality data generation but needs to be weighed against the efficiency.
Experiment 5:
Table 7 shows the results of Experiment 5. In this comparison experiment, the InitialGPT method demonstrated its significant advantages in the similarity assessment between generated and real data, especially in the distributional similarity and variable correlation between generated and real data. The experimental results show that the Jensen–Shannon divergence (JSD) value of InitialGPT is only 0.001, which is much lower than that of other methods, indicating that its generated data are highly consistent with the real data in terms of distribution. In addition, the Cramér’s V value of InitialGPT is 0.277, which is higher than that of other methods, indicating that the data it generates are not only similar in distribution, but also better able to reflect the characteristics of the real data in terms of the relationships between variables. This ability is particularly important in tasks that require preserving the internal structure and relationships of the data, and thus InitialGPT has a significant advantage in generating high-quality synthetic data that can provide more representative and usable data for a variety of application scenarios.
InitialGPT’s performance in terms of Pearson correlation coefficients is relatively weak, with a value of 0.174, which is lower than that of the other methods. PCCs are used to measure the linear correlation between two variables, and low PCCs values indicate that InitialGPT-generated data may not fully reproduce the real data with respect to some of the linear relationships. The low PC value indicates that the data generated by InitialGPT may not fully reproduce the characteristics of the real data in some linear relationships. This shortcoming may stem from the fact that InitialGPT focuses more on the overall distribution and complex relationships between variables at the expense of the accuracy of some linear correlations. However, this shortcoming may not be significant in practical applications, especially in scenarios where more attention is paid to data distribution and non-linear relationships. Therefore, this limitation of InitialGPT does not detract from its overall advantage in generating high-quality synthetic data.
Statistical Analysis: This study validated the comparability of the LLM-based data generation method with manually prepared data in terms of quality assessment through systematic analysis. As shown in
Table 8, intra-group correlation coefficient analysis revealed some subjective differences among raters (ICC(1) = 0.340), but the key model comparison revealed that the quality ratings of the proposed LLM-generated method (GPT-4.5) were highly similar to those of manually-written data (Manual) (4.30 vs. 4.42, with a difference of <3%). The standard deviation analysis of each model further revealed that the LLM generation method (GPT-4.5: SD = 0.132) had a similar pattern of rater variability to the manually written data (Manual: SD = 0.329), suggesting that the two were comparable in terms of perceived quality.
The analysis of variance (ANOVA) results (F(9,20) = 0.014, p = 0.986) confirmed that there were no statistically significant differences between all models, which supported the hypothesis of equivalence between LLM-generated and manually written data in terms of quality assessment. Notably, the proposed LLM method (GPT-4.5) outperformed the manual data (SD = 0.329) in terms of rater agreement (SD = 0.132), suggesting more stability in quality evaluation. Although the highest mean score (4.42) was obtained for the manually prepared data, its small difference ( = 0.12) with the LLM method (GPT-4.5:4.30) did not reach the level of statistical significance (p > 0.05), which verified the validity of the LLM-generated method for approaching the manual level of quality dimensions.