Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsOverall, this manuscript presents a timely and interesting exploration into whether persona-driven prompts can lead Large Language Models (LLMs) to produce authentically diverse linguistic outputs, akin to those of human authors with varying backgrounds. However, to further strengthen the manuscript, certain areas would benefit from additional clarification and elaboration. Below are some specific comments and suggestions:
1、Please consider employing multiple similarity or distance measures, rather than relying solely on cosine similarity, to provide a more comprehensive evaluation of the differences among personas.
2、Please offer more detail regarding the representativeness of the four personas, considering attributes such as cultural or professional backgrounds, and please explore approaches to enhance both the number and diversity of personas to strengthen the analysis.
Author Response
We would like to thank the esteemed reviewers for the time and effort spent on considering our manuscript and for the provided recommendations that help improving our paper.
We have revised the manuscript, adding and modifying it according to the reviewers’ comments. The list of references has been modified accordingly.
Our responses to the reviewers’ comments follow.
R1
1.> Please consider employing multiple similarity or distance measures, rather than relying solely on cosine similarity, to provide a more comprehensive evaluation of the differences among personas.
We thank the esteemed reviewer for the suggestion and agree that the analysis with more than a single measure got to have an increased robustness. So, we supplemented the cosine similarity (CS) with another relevant measure, Mahalanobis distance (MD), and used it as the second dependent variable in the extended analysis. The combination of these two measures appears to be exhaustive, whereas certain other existing measures won’t be appropriate in our experiment design (see the new sub-section 2.4. The Vector Similarity Measures). The additional analysis suggests that the outcome of the hypotheses testing for the new metric (incorporated into the manuscript) is largely the same: the associations of different humans are different, whereas the associations of different artificial persons are not.
2.> Please offer more detail regarding the representativeness of the four personas, considering attributes such as cultural or professional backgrounds, and please explore approaches to enhance both the number and diversity of personas to strengthen the analysis.
The four personas devised in our study correspond to the classical groups of external target users (i.e. not considering the intranet users) for a university website: undergraduate and graduate applicants, their parents, and young researchers (job-seekers in academia). Although the number and diversity of personas could have been increased with relative technical ease (e.g., by adding a journalist who seeks tech-related news on the website), we decided it would be artificial and foreign towards the university website content and academia language domain. Instead, we decided to stick to the maximum representativeness in terms of HCI, since it’s the context prompt-specification paradigm that we explore in our current work.
We’d like to note that while the human participants were also relatively homogenous, their associations did differ significantly, unlike the personas’ ones. In our opinion, it reinforces the validity of our experiment design. To support the similarity between the groups, we did not employ the factor of cultural background difference in the current study: all the human participants are Russian by origin, whereas the personas’ descriptions mention the names of Russian cities. The professional background factor, kindly suggested by the esteemed reviewer, appears to be a relevant one, and we shall consider it in our future study.
We appreciate the esteemed reviewer for the comment and have incorporated the respective justification into the manuscript’s text.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper investigates the potential of using persona-driven approaches to generate diverse synthetic training data for large language models (LLMs), particularly focusing on imitating human-like differences in authorship. The authors conducted experiments with human participants and artificial personas created via GPT-4 and YandexGPT, measuring the cosine similarity of responses to stimulus words. They found significant differences in responses among human participants but not between the artificial personas, suggesting that current LLMs may not effectively mimic human associative diversity.
Technical Comments:
1. The authors should clarify the selection criteria for the 50 human participants and how they ensure that this sample is representative of the broader population, as the current participant selection process lacks detailed justification and could bias the results.
2. The discussion section lacks a critical evaluation of the limitations of the synthetic data generated by the LLMs. The authors should explore the potential impacts of these limitations on the practical application of the data in real-world scenarios.
3. The paper does not compare its findings with existing research on the effectiveness of synthetic data in improving model performance. A comparison with other studies would strengthen the paper's contributions to the field.
4. The use of cosine similarity as the sole metric for comparing associations cannot capture the complexity of human language understanding. Is it possible to consider incorporating additional linguistic metrics to provide a more comprehensive analysis?
5. The authors should include the following references to strengthen the literature on applications of LLMs:
https://www.mdpi.com/2227-9032/11/20/2776
https://www.mdpi.com/1999-5903/16/10/365
Author Response
We would like to thank the esteemed reviewers for the time and effort spent on considering our manuscript and for the provided recommendations that help improving our paper.
We have revised the manuscript, adding and modifying it according to the reviewers’ comments. The list of references has been modified accordingly.
Our responses to the reviewers’ comments follow.
> 1. The authors should clarify the selection criteria for the 50 human participants and how they ensure that this sample is representative of the broader population, as the current participant selection process lacks detailed justification and could bias the results.
We did not actually aim to make the human participants representative of the broader population. Since the human participants act as a sort of the control group in the study, we sought to make its characteristics as close to the four personas’ group as possible (quota sampling). For instance, the mean “age” of the personas is 29.3, whereas it’s 30.8 for the human participants. Similarly, all the selected human participants were Russian by origin, whereas the personas’ descriptions mention the names of Russian cities. Correspondingly, we note these limitations in the Discussion section and never claim that our findings automatically generalize to all the LLMs, languages and domains, or context-specification prompt techniques.
We thank the esteemed reviewer for the suggestion to add more detail about the selection of the human participants in our study, and have added the related information to the manuscript.
> 2. The discussion section lacks a critical evaluation of the limitations of the synthetic data generated by the LLMs. The authors should explore the potential impacts of these limitations on the practical application of the data in real-world scenarios.
> 3. The paper does not compare its findings with existing research on the effectiveness of synthetic data in improving model performance. A comparison with other studies would strengthen the paper's contributions to the field.
Per the suggestion of the esteemed reviewer, we have extended our manuscript with a critical evaluation of the limitations of the synthetic data. In particular, we mention real-world examples from such domains as networking technologies [48. Large Language Models Meet Next-Generation Networking Technologies: A Review. Future Internet 2024], facial recognition systems [49. Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention, ACM FAccT 2024], etc. We also highlight the importance of overcoming such limitation of the synthetic data is lower diversity, and provide some additional references to existing research.
As for the second comment of the esteemed reviewer, we have to say that we cannot exactly compare our findings with existing research on the effectiveness of synthetic data, since we did not use the obtained synthetic data in improving any ML model’s performance. In our study, the synthetic data itself was the object of the study. However, we added the reference to a recent publication [47. Scaling Laws of Synthetic Images for Model Training ... for Now, IEEE/CVF 2024] that formally describes the scaling effect of synthetic data quality (though images, not text) on the improvement in the models performance.
> 4. The use of cosine similarity as the sole metric for comparing associations cannot capture the complexity of human language understanding. Is it possible to consider incorporating additional linguistic metrics to provide a more comprehensive analysis?
We thank the esteemed reviewer for the suggestion and agree that using more than one linguistic metric can make our analysis more comprehensive. So, we added Mahalanobis distance (MD) as an alternative metric (see the new sub-section 2.4. The Vector Similarity Measures) and used it to calculate the distances between the associative vectors, in the similar manner as we did with cosine similarity (CS). The additional analysis suggests that the outcome of the hypotheses testing for the new metric is largely the same: the associations of different humans are different, whereas the associations of different artificial persons are not. The related results and discussion are added to the manuscript.
> 5. The authors should include the following references to strengthen the literature on applications of LLMs:
We thank the esteemed reviewer for supplying us with additional references. We have incorporated them into our manuscript’s literature review and discussion, as references [4] and [48] respectively.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper describes an experiment comparing associations produced by synthetic LLM persons and real people. The authors explore the ability of the LLMs to imitate human-like actions. The authors explore the ability of the models to imitate human-like actions. They examine ChatGPT and YandexGPT creating 4 virtual person profiles on each of two LLMs. Each profile is described by a short background story, which in my subjective opinion is too superficial. The LLM was switched on each profile and presented its 10 one-word associations to each of 5 stimuli words: center, university, scince, education, program. Similar tests were conducted for real persons of different ages and occupations. Association lists were compared using cosine similarity measure for each possible pairs of respondents (synthetic and real). Based on results authors accept or reject hypotheses about differency or similarity of human and synthetic associations and their properties.
The problem that the authors study is important and relevant in connection with the introduction of neural networks into everyday life. At the same time, the stated motivation associated with the depletion of texts produced by real people is somewhat strange. At a minimum, scientific articles in various fields of science are written constantly, not to mention blogs, social networks, etc. The question in the title of the article also looks superfluous. The article lacks a clear and strict description of the methods used with a declaration of input and output data and calculation formulas. No examples of compared data vectors are given. The methodology of the experiment raises questions. The experiment was conducted on data from a very narrowly limited area. All stimulus words are already associatively related to each other. The same list of associations, even from a real person, can be suitable for any of the stimulus words. Therefore, it is unclear whether the studied data set is representative enough to draw conclusions about the similarity or difference in the reactions of a person or a synthetic personality.
In the last section, the authors provide conclusions on accepting or rejecting the hypotheses about the ability of LLM to distinguish language tokens, about the difference or similarity of the results for people and LLM. It is clear that these conclusions are valid only for the narrow sample of data that was studied.
Author Response
We would like to thank the esteemed reviewers for the time and effort spent on considering our manuscript and for the provided recommendations that help improving our paper.
We have revised the manuscript, adding and modifying it according to the reviewers’ comments. The list of references has been modified accordingly.
Our responses to the reviewers’ comments follow.
R3
> Each profile is described by a short background story, which in my subjective opinion is too superficial.
We thank the esteemed reviewer for the interesting comment. We do not have a rigorous basis for a discussion here either, since we did not utilize different personas descriptions in the current study. However, we would like to note that the major source of individual differences in the human participants’ associations is clearly the emotional load, which in turn stems from the previous life experience (see the new Table 3 in the manuscript, especially the unconventional “Queue” and “Circus” associations for the education stimulus word). Having this in mind, one should expect that even longer specifications for the personas, with detailed bios and values, would make the contextual prompts more effective in creating artificial authors with diverse thesauri and writing styles.
> the stated motivation associated with the depletion of texts produced by real people is somewhat strange. At a minimum, scientific articles in various fields of science are written constantly, not to mention blogs, social networks, etc.
We thank the esteemed reviewer for this observation and agree that new texts and data are produced constantly. We have added this to the manuscript.
However, we’d like to note that since the advent of deep learning models about a decade ago, their sizes (and thus the required amount of training data), have been increasing at an unprecedented rate, far exceeding the new information production. For instance, during the recent 5 years, frontier LLMs have grown from about 340 million parameters (BERT) to 175 billion parameters (GPT-3) to over trillion parameters (GPT-4) [2. Data Quality May Be All You Need: Model size is not everything, 2024; 4. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration, Healthcare 2023]. This corresponds to an annual growth of about 400%. There are different estimations for the growth rate of online information and data, but most commonly they state that the volume doubles every 1 or 2 years, corresponding to the growth of just 41-100%. So, we believe that the statement from [5. Will we run out of data? Limits of LLM scaling based on human-generated data, ICML 2024] that “models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032” is plausible.
> The question in the title of the article also looks superfluous.
We thank the esteemed reviewer for the observation. To make the question in title of our article look less superfluous, we have changed it to “Who Will Author the Synthetic Texts?”. We believe this accurately reflects the problem addressed in our work and at the same time popularly hints that the synthetic data generation is not straightforward.
> The article lacks a clear and strict description of the methods used with a declaration of input and output data and calculation formulas.
To make the description of the methods in our manuscript more clear and strict, we have provided more information about the two employed similarity / distance measures, as well as the calculation formulas (in the new sub-section 2.4). We have also provided more detail on the calculation of the measures in our study (see 3.4.1).
> No examples of compared data vectors are given.
To better illustrate our data, as the esteemed reviewer suggests, we have added several examples. In the new Table 2, we provide the most distant associations by ChatGPT. In the new Table 3, we provide the most distant associations of the human participants, and also illustrate for the reader how the associations vary for the same participants and different stimuli and for different participants and the same stimuli. Also, appendixes with more raw data will be made available upon the acceptance of the manuscript.
> The experiment was conducted on data from a very narrowly limited area. All stimulus words are already associatively related to each other. The same list of associations, even from a real person, can be suitable for any of the stimulus words. Therefore, it is unclear whether the studied data set is representative enough to draw conclusions about the similarity or difference in the reactions of a person or a synthetic personality. … It is clear that these conclusions are valid only for the narrow sample of data that was studied.
The esteemed reviewer is correct in noting that the stimulus words used in our study belong to a relatively narrow domain. However, we believe that this is justified, as this is also the case for most topical textual content (cf. dictionaries, where terms from different domains have equal chances of being represented). Besides, our selection of the stimuli was performed via corpus-driven approach applied to two real university websites featuring real textual content. Finally, we believe that the validity of our study design is supported by the outcome of testing the hypotheses H1-1 and H1-2: the effect of Is_different_stimulus was significant for both the models and the human participants (p < 0.001). This suggests that they could distinguish the different stimulus words well enough.
That said, we added a specific notice in the Discussion that our study has been performed for the stimuli and associations belonging to a specific domain.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have made considerable efforts to address the comments, and I am satisfied with the revised manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors took into account the reviewer's comments, improving the presentation of the article material. The description of the mathematical model and the experimental part has been expanded. The question of the representativeness of the studied data remains open, since the stimuli words are associatively related to each other. It is recommended to take this into account in further research. I recommend this version of the article for publication as one of the pilot studies in this direction.