Next Article in Journal
Reusing an Expired Drug as a Sustainable Corrosion Inhibitor for Bronze in 3.5% NaCl and Simulated Acid Rain Solutions
Previous Article in Journal
Integral Security Pillars for Medical Devices: A Comprehensive Analysis
Previous Article in Special Issue
Between Truth and Hallucinations: Evaluation of the Performance of Large Language Model-Based AI Plugins in Website Quality Analysis
 
 
Due to scheduled maintenance work on our database systems, there may be short service disruptions on this website between 10:00 and 11:00 CEST on June 14th.
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Personality Emulation Utilizing Large Language Models

by
Jack Kolenbrander
1,* and
Alan J. Michaels
1,2
1
Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA 24061, USA
2
National Security Institute, Virginia Tech, Blacksburg, VA 24061, USA
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(12), 6636; https://doi.org/10.3390/app15126636
Submission received: 22 May 2025 / Revised: 10 June 2025 / Accepted: 11 June 2025 / Published: 12 June 2025

Abstract

:
Fake identities have proven to be an effective methodology for conducting privacy and cybersecurity research; however, existing models are limited in their ability to interact with and respond to received communications. To perform privacy research in more complex Internet domains, withstand enhanced scrutiny, and persist long-term, fake identities must be capable of automatically generating responses while maintaining consistent behavior and personality. This work proposes a method for assigning personality to fake identities using the widely accepted psychometric Big Five model. Leveraging this model, the potential application of large language models (LLMs) to generate email responses that emulate human personality traits is investigated to enhance fake identity capabilities for privacy research at scale.

1. Introduction

The use of fake identities has proven to be an effective method for performing active open-source intelligence (OSINT) and privacy research [1,2]. The Use & Abuse project at the Virginia Tech National Security Institute has created and developed the architecture necessary to generate and deploy fake identities at scale [2,3]. Previous research has identified that account activity and communications are key indicators used to detect fake accounts on the Internet [4,5,6]. Current fake identity models, however, are largely stagnant after deployment to target websites. Therefore, to conduct privacy research with fake identities in more complex Internet domains and establish persistent and human-passable digital identities, it is necessary to develop a method to provide fake identities with the capability of actively responding and interacting with received content. Additionally, to enhance credibility, these responses must be cohesive, realistic, and demonstrate consistent personality traits. The work performed in this paper investigates the potential application of LLMs to generate personality-reflective email responses for fake identity applications.
Previous U&A research has identified that activity and responses are key components of establishing an effective fake digital identity for privacy research [2]. The majority of communications received by fake identities are via email; however, identities currently do not generate or send any responses. This lack of capability for identities greatly limits their long-term application as well as their ability to successfully avoid detection in more scrutinized Internet domains. Large language models (LLMs) are one tool that has demonstrated the ability to efficiently generate written content [7,8,9] and email responses [10,11]. Leveraging LLMs to efficiently generate responses at scale for emails received by fake identities represents a potential solution to allow identities to automatically interact. To convincingly pass as human, the emails generated for each identity must maintain a cohesive personality and display consistent behavior across all chains and interactions.
Representing personality and generating personality-consistent responses for fake identities is an abstract challenge, particularly since personality is a complex and subjective concept [12]. Therefore, the application of a scientific or psychological model commonly used to represent real human personality can offer a structured and reputable framework. There are numerous psychometric models; however, the most popular and widely accepted are the Five Factor Model (Big Five model) [13] the Myers–Briggs Type Indicator model [14], and the VIA Character Strength model [15]. Researchers have investigated the application and tailoring of personality for various automated communication platforms, including chatbots and large language models [12,16,17]. The exploration of using AI and LLMs to mimic personality traits in written content, however, is limited.
The work in this paper explores the application of the Big Five model to construct a model for fake identities’ personality, as it is the most widely researched model [18]. These profiles are then used as input into an LLM to investigate the application of LLMs to generate emails emulating these personality traits. This research contributes to the long-term goal of the Use & Abuse project to provide fake identities with the ability to respond automatically to received content, enhancing their realism and allowing them to withstand increased scrutiny.

Ethical Implications of Fake Identities and LLMs

Using LLMs to mimic human beings and, more broadly, for active OSINT applications raises ethical concerns about potential misuse for harmful purposes such as deception and manipulation. Open-source intelligence techniques represent a unique field for ethical concerns, as the information is publicly accessible and available; however, negligent handling or malicious use can lead to significant consequences for individuals, organizations, and businesses. Researchers have identified gaps in the implementation of OSINT processes and have proposed frameworks and models to help address these shortcomings [19,20,21]. LLMs can also be misused to trick or mislead individuals on scales much larger than would be possible manually. Researchers have investigated ethical concerns surrounding LLMs for human–computer-based interactions and the responsible design and governance of their applications [22,23,24,25]. The work performed in this paper aims to provide fake identities with a prechosen personality and leverage LLMs to generate communications emulating that personality. Although there is the potential for LLMs to be abused when interacting with individuals and businesses, the main goal of this research is to explore their application for the creation of consistent, convincing identities for privacy research.

2. Literature Review

2.1. Psychometric Personality Models

Human personality has been heavily studied, and a large number of models have been developed to represent it. Of these, the Big Five model (Five Factor Model) is the most widely accepted and researched [18], which is why it was chosen as the foundation for this work. The Big Five model breaks personality into five main traits: conscientiousness, agreeableness, neuroticism, extraversion, and openness to experience [13]. Popular alternative models include the Myers–Briggs Type Indicator (MBTI) model, which represents personality using four different preference pairs: extraversion or intraversion, sensing or intuition, thinking or feeling, and judging or perceiving [14]. The VIA Classification model analyzes 24 different character strengths to model individual personality [15]. The Ennegram model categorizes individuals into nine different personality types based on their dominant tendencies and traits [26]. Each of these models attempt to characterize human personality differently, representing the nuanced and complex nature of “scoring” human personality.

2.2. Determining Big Five Characteristics from Online Data

Existing research has demonstrated that Big Five personality traits can be predicted from individuals’ online data. Understanding how researchers can extract personality traits from Internet history can help identify the necessary traits and components for creating effective fake digital identities. One study of online social networks (OSNs) demonstrates that users’ personalities, especially extraversion and neuroticism traits, can be predicted based on online behavior data [27]. A secondary study of OSN data identified that individuals high in openness are more likely to express emotions in their online posts, while neurotic individuals are more likely to be reserved [28]. Another study demonstrated the ability to analyze posts and online activity to compare personality trait levels of celebrities [29]. Researchers have also applied machine learning techniques such as random forest regressors to analyze OSN data and predict personality traits [30]. Utilizing this approach, researchers were able to clearly identify and distinguish between personality traits of different user groups [30]. Existing research has shown that personality traits can be extracted from OSN behavior and language, highlighting that online personality is a key component of digital identity.

2.3. LLM Model Comparison

LLM models differ based on training methodology, the number of training parameters, use case, and performance. As models are constantly evolving and developing, up-to-date research and comparisons are often unavailable. Furthermore, some companies, such as OpenAI and Anthropic, do not disclose all information about their models, which further limits comparison ability. Researchers have developed benchmarks and tests to score and rate LLM capabilities. Popular benchmarks include the AI2 Reasoning Challenge (ARC), HellaSwag, Massive Multitask Language Understanding (MMLU), and TruthfulQA; however, an extensive number of benchmarks exist and are tailored towards different purposes [31,32]. A benchmark specific to personality-based text generation does not currently exist; however, MMLU and HELM Lite are two benchmarks that evaluate the overall knowledge and performance capabilities of LLMs [33,34]. The MMLU benchmark rates LLM performance on 57 unique tasks, spanning across a wide variety of subject areas [33]. Similarly, HELM Lite performs a holistic analysis ranging from books and movie knowledge to medical exam questions [34]. WritingBench is a benchmark purposely designed to analyze an LLM’s writing abilities across six domains: Academic and Engineering, Finance and Business, Politics and Law, Literature and Art, Education, and Advertising and Marketing [35].
Multiple different LLMs were employed to validate the generated response emails, as detailed in the Methodology Section 3.3. The models utilized were chosen based on novelty, popularity, and overall performance. In total, seven models were utilized: GPT-4.5, GPT-4o, DeepSeek R1, Gemma 3, Hermes 3, Llama 3, and Claude 3 Opus. These models represent a combination of open-source and proprietary LLMs, which can limit the ability to compare them head-to-head. Most of the models have been scored by the MMLU benchmark, while fewer have been scored by the HELM Lite and WritingBench benchmarks. Table 1 provides a comparison of LLM performance.

2.4. Personality Emulation and LLMs

Although limited, existing research has investigated the capability of LLMs to emulate personality and explored methods to improve their ability. One study of the GPT-4, Llama 4, and Mixtral LLM models found that each model had a unique Big Five personality [38]. The GPT-4 and Mixtral models were found to rate higher in areas of agreeableness, conscientiousness, extraversion, and openness when compared to the Llama model [38]. Understanding the underlying personality of LLM models allows for a determination of what areas those models may excel in, as well as reveal the likeliness of bias occurring when attempting to mimic alternative personalities. An alternative study of the GPT-3.5 Turbo model explored it ability to shape to specific personality profiles and maintain consistent personality throughout interactions [39]. The study found that the model more consistently maintained a personality high in all Big Five traits [39]. However, when prompted to mimic a personality low in all Big Five traits, over time the model trended up across all traits [39]. Similar research has focused on tailoring LLM responses and personas to individual users [40,41,42]. A third study found that different LLM models have different Myers–Briggs Type Indicator (MBTI) scores [43]. Notable models studied included ChatGPT-3.5 (ENTJ—extraverted, intuitive, thinking, judging), GPT-4 (INTJ—introverted, intuitive, thinking, judging), and OpenLlama_v2 (INFJ—introverted, intuitive, feeling, judging) [43]. These studies demonstrate that different models display different personality traits, and that they are capable of emulating personality traits, particularly traits typically considered more social or extroverted.
The work performed in this paper investigates the ability of LLMs to generate responses to email while emulating certain personality models. There is some similar existing research, but none specifically investigating the email use case. One similar study investigated the ability of LLMs to mimic personality while writing short stories [44]. This study found that there was overlap in linguistic behavior when comparing the GPT-3.5 and GPT-4 models investigated and human writing with the same Big Five personality [44]. The study also found that the accuracy of humans predicting personality trait levels varied across each trait and that knowledge of AI authorship decreased prediction accuracy [44]. In an investigation of the ability of LLMs to emulate personality for video game NPCs, researchers found that some LLM models were capable of aligning behaviors with psychometric values at up to 100% accuracy based on the International Personality Item Pool (IPIP) questionnaire [45]. In a study of open-source LLM models, researchers investigated the effect of different prompts on the LLMs’ ability to mimic personality [46]. This study found that LLMs more consistently demonstrated the personality assigned to them when provided with a role and an associated personality when compared to only being assigned personality traits or types [46]. Existing research has shown that LLMs display some ability to emulate personality traits; however, the accuracy varies based on trait, model, prompt, and application.

2.5. Fake Identities and the Use & Abuse Project

The work performed in this paper builds upon ongoing research of the Use & Abuse (U&A) of Personal Information Project at the Virginia Tech National Security Institute. The Use & Abuse project has developed the capabilities to generate and deploy fake identities at scale [2], as well as the underlying architecture to collect and track all related communications [3], in order to support privacy research and active OSINT applications. Currently, the fake identities deployed by the project remain largely static; however, to persist long-term and increase credibility, these identities require the capability to interact and respond automatically with received content. This paper investigates the potential use case of the Big Five model in combination with LLMs to enable the automatic generation of responses that display consistent personality traits for fake identity applications.

3. Methodology

The experimental process for this work can be broken down into four main stages: creation of personality profiles, email creation and response generation, LLM personality ratings, and the human perception personality survey. Figure 1 provides a high-level overview of the experimental process. Personality profile creation refers to the process used to break down the Big Five personality model into different personality profiles that are used to guide the LLM on the traits to display when responding. Email creation and response generation refers to the process of designing a series of initial emails and gathering the responses generated by the LLM with various personality profiles. The LLM personality rating was the first step of validating the responses generated and focused on collecting information on how other LLMs perceive the personality traits of each individual response. Finally, the human perception survey sought to collect information on the perceived personalities from the human perspective. The personality profile creation process is outlined in Section 3.1, the email creation and response generation process is described in Section 3.2, the LLM personality validation process is described in Section 3.3, and the human perception survey overview is described in Section 3.4.

3.1. Personality Profile Creation

As mentioned previously, the Big Five Personality model was selected as the basis for this experiment due to its simplistic approach and widespread acceptance in human psychology. To develop personality vectors, it was necessary to designate potential levels for each individual trait, which could then be varied to generate the set of personality vectors. For this experiment, each trait was designated high or low, giving 32 possible combinations of personality traits. A case of all neutral traits was also created, resulting in 33 unique personality vectors. Table 2 provides an overview of three potential personality vectors. These personality trait vectors are utilized as an input to ChatGPT’s 4o model when generating personality-specific responses, which is further described in Section 3.2.

3.2. Email Creation and Response Generation

3.2.1. Email Base Set Creation

The second step in the experimental process was the creation of a base set of emails and the generation of LLM responses to those emails. For this experiment, three initial emails were written to represent three common email use cases: a professional project email, a phishing email, and a love letter email. These three use cases were chosen to investigate the LLM’s ability to respond across three unique email domains. Additionally, they were selected to evaluate the LLM’s ability to generate responses requiring differing levels of emotion. These 3 use cases were chosen to evaluate the LLM’s ability to emulate personality across three unique email domains. The professional email, outlined in Figure 2, represents an email from a project manager to an employee asking for a project status update. The phishing email, provided in Figure 3, attempts to trick an individual into clicking a malicious link by informing them that suspicious activity has been detected on their account. The love letter email, included in Figure 4, consists of an individual expressing their personal feelings towards another individual. Through the utilization of a diverse base set of emails, an investigation into an LLM’s ability to mimic personality in various use cases can be completed.

3.2.2. Response Generation and Collection

The next step performed was leveraging an LLM to generate responses, utilizing the base set emails and a personality vector as an input. For this experiment, Open AI’s GPT-4o model [47] was selected due to it being the most popular LLM model [32], as well as its ability to perform in a wide variety of use cases. To efficiently query the model and ensure that the LLM was provided with the same input and phrasing for each response generation, a Python script was created. The script first formulates a query consisting of a description of the Big Five traits, the desired personality vector, and the initial email the LLM is responding to. The Open AI API [48] is then passed this query and the script automatically stores the response in a text file. Each email is assigned a unique ID, which is stored alongside the corresponding personality vector in a csv file for future mapping and analysis. An overview of this process is provided in Figure 5. In total, 99 different responses were generated, consisting of the 32 possible combinations of the high and low Big Five traits, along with an all-neutral vector, for each of the three base emails. An example response to the project email of a personality high in all is provided in Figure 6.

3.3. LLM Personality Ratings

The first method utilized for the validation of the responses was the querying of other LLMs to rate each email’s perceived traits. By querying other LLMs, it can be determined whether other AI models are capable of identifying personality characteristics from written emails or from broader text. In total, seven different LLM models were queried: OpenAI’s GPT-4.5 and GPT-4o [47] models, DeepSeek’s R1 [49] model, Nous Research’s Hermes 3 [50] model, Google’s Gemma 3 [51] model, Ollama’s Llama 3 [52] model, and Anthropic’s Claude Opus [53] model. By utilizing models from different organizations and companies, it can be identified if one model outperforms the others. The OpenAI models were queried using the OpenAI API [48], the DeepSeek R1, Llama 4, and Hermes 3 models used Lambda Lab’s [54] API, the Gemma 3 model was queried locally through Ollama’s python library [55], and the Claude 3 Opus model through Anthropic’s API [56]. An overview of each model as well as how it was queried is provided in Table 3.
A script was developed to query each LLM model to ensure that each was provided with the same input information to classify the responses. Each query provides the LLM model with context, the original email, and the response email. The model is then prompted to rate its perception of each Big Five trait on a scale from 0 to 100. Similar to the human experiment, each LLM was not provided with any information about the distribution of responses, or that responses were generated with levels of high, neutral, or low for each. Rather than ask the LLM to classify each response as high/neutral/low, the scale from 0 to 100 was used to allow for more granular and nuanced data to be collected. Additionally, by collecting numeric data, it can be determined if certain LLM models are predisposed to rating traits higher or lower when compared to other. The results of the LLM validation process are presented in Section 4.1.

3.4. Human Perception Survey

The second method employed for response validation was through a human perception survey, where individuals were asked to rate individual email responses based on their perception of the personality showcased. Prior to completing the survey, individuals were provided with an overview presentation of the Big Five model and common characteristics associated with each high and low level of each personality trait. Due to time constraints, a random sampling survey consisting of 37 randomly selected emails out of the 99 was performed. In this survey, individuals were asked to rate the personality traits of each email as either low, neutral, or high. An example question from the survey is included in Figure 7. In total, responses were gathered from 11 individuals.

4. Results

Validation data of the GPT-4o model’s ability to emulate personality was collected from both LLM model queries and the human perception survey, described in Section 3.3 and Section 3.4, respectively. Although the generated responses were created using the polar cases of high and low, the LLMs were prompted to provide a rating from 0 to 100 of each trait. Humans were asked to rate each email as low, neutral, or high. By mapping 0 to 33 as low, 33 to 66 as neutral, and 66–100 as high, LLM models achieved an average accuracy rate of 42.94%. This suggests that the model does not effectively embed personality traits into the email messages in a way that is readily detectable by other LLMs. While this initial mapping is somewhat arbitrary, it serves as a baseline method for categorizing the LLMs’ continuous scores into three discrete classes. More refined and systematic approaches are explored in the normalization techniques sections. However, after normalization of LLM data, overall average accuracy increased to 66.72%. The human perception survey resulted in an overall average accuracy of 46.48%, which is higher than the non-normalized accuracy of the LLM models. After normalization, however, the LLMs performed 43.5% better than human data and 55.4% better than non-normalized LLM data. The detailed LLM results are discussed in Section 4.1 and the human perception survey results are described in Section 4.2.

4.1. LLM Validation Results

Three methods for analysis of the LLM scoring data are presented: no normalization, normalization with standard deviation, and normalization with median. Analysis of the non-normalized LLM data, described in Section 4.1.1, revealed that LLM models tended not to score traits at either extreme, therefore resulting in low accuracy overall. To address this bias, two normalization approaches were attempted and are described in Section 4.1.2 and Section 4.1.3. As mentioned, normalization of LLM data resulted in an overall increase from 42.94% to 66.72% average accuracy.

4.1.1. Non-Normalized LLM Data

Utilizing no normalization, the average overall accuracy of all LLM models was 42.94%, with a 95% confidence interval range of 39.82% to 46.07%. For this analysis, an LLM score of 0 to 33 was designated low, a score greater than 33 to less than 66 was designated neutral, and a score 66 or greater was designated high. These scores were compared to the true low, neutral, and high trait levels of the emails analyzed to calculate prediction accuracy. The Claude 3 Opus model achieved the highest accuracy of 46.67%, while the Gemma 3 model had the lowest overall accuracy of 39.39%. When comparing individual traits, the LLM models as a whole performed substantially worse on extraversion, scoring only 29.72%, with the next lowest being openness at 43.14%. The accuracy of each model by trait and overall is presented in Table 4.
As shown previously in Table 3, each LLM model is trained on a specific number of parameters. For open-source models, this count is typically publicly available. For proprietary models such as GPT-4.5 and GPT-4o, however, parameter count is not disclosed. Table 5 provides a comparison of parameter count and LLM model accuracy. As demonstrated, for the models with known parameter counts, there is no clear correlation between parameters and accuracy for detecting personality. The highest-performing models, which are both proprietary models, were GPT-4.5 and Claude 3 Opus.
Although the generated emails were meant to represent polar sides of each trait, no LLM scored any trait at a level of zero. Outside of neuroticism, where low meant a more stable individual, only the GPT-4o model provided a score less than 20. The violin plot shown in Figure 8 provides an overview of the distribution of scoring for each trait and model. LLM models have been shown to reflect the cultural values of biases of the language they were trained on, likely leading the LLM models to respond more centrically in their personality scoring [57]. As shown in the plot, models were less likely to provide lower ratings, especially for conscientiousness and agreeableness. For extraversion, the distribution of scoring is more uniform; however, the majority of scores land between 20 and 80. For neuroticism, a lower score meant the response portrayed a more emotionally stable individual. Overall, models rarely provided scores at either extreme, especially for lower scores. This suggests that either the responses did not emulate extremes of any trait, or that the models exhibited a bias to score traits more favorably. The high and low scores by trait for each model are provided in Table 6. Additionally, confusion matrices created from the combined scoring of all LLMs are provided in Figure 9.

4.1.2. Normalization with Standard Deviation

To address the models’ tendency to not provide ratings at either extreme, a few normalization methods were tested. The first attempted normalization method was to adjust the score classification range using the standard deviation by individual trait for each model. The formula used to create the classification range is provided in Equation (1). For this formula μ represents the mean score for each trait, N represents an arbitrary value, and σ represents the standard deviation. Therefore, the range can be shifted utilizing some value of N. Two different approaches were used: a static value of N applied uniformly across all traits and models, and another with a dynamic value of N calculated for each trait so that one-third of scores are equally classified as low, neutral, and high.
Classification ( x ) = Low if x < μ N σ High if x > μ + N σ Neutral otherwise
For static values of N, values of 0.25, 0.5, 0.75, and 1 were used. A lower N value correlates to a smaller neutral range, leading to more scores being classified as either high or low. As a result, since the majority of responses consisted of traits that were either high or low (apart from the three emails that were all neutral traits), a smaller value of N results in a better accuracy for each trait. Utilizing a value of N = 1 results in the lowest accuracy values. Using a dynamic value of N calculated for each model and trait results in an accuracy similar to the value of N = 0.25. With a dynamic value for N, the average accuracy across all models was 60.84% with a 95% confidence interval of 57.9% to 60.84%. The value of N was calculated by for each trait, utilizing the mean and standard deviation to determine what value would result in one-third of values being classified as low, neutral, and high. A comparison of the average accuracy across all models for all five traits is provided in Figure 10.
More granular analysis of the dynamic N scenario allows for the determination of how the LLM models mispredicted various traits. Figure 11 provides an overview of the proportion of predictions by the LLM model that were completely correct, near misses, or complete errors. The green highlighted column represents the proportion of cases that were predicted correctly, the yellow columns represent predictions one label off, and the red cases represent complete misclassifications. Under dynamic standard deviation normalization, roughly all models represented around one-third of correct high and low responses. This corresponds to the normalization tactic of placing one-third of the scores in each classification category. Compared to other traits, models were less likely to make near-miss errors when predicting neuroticism, reflected by the low proportion of predictions in the yellow “Near Miss” column. The proportion of complete error predictions, represented in the red column, varied largely across models and traits.

4.1.3. Normalization Utilizing Median

A secondary normalization tactic utilized was ignoring the neutral cases and classifying anything above the median value by trait as high, and anything below as low. This normalization approach recognizes that 96 out of the 99 emails generated were representative of polar cases, while only 3 consisted of neutral traits. For this approach, the neutral cases were removed from consideration. With this normalization tactic, an overall average accuracy of 68.81% was achieved, with a 95% confidence interval of 67.77% to 69.85%. The highest-performing model was Gemma 3 with an accuracy of 70.21%, closely followed by DeepSeek R1 with an accuracy of 70%. Notably, the Gemma 3 model was the model with the lowest overall accuracy without normalization. The lowest performing mode, Hermes 3, still achieved an overall accuracy of 67.29%. The accuracy by trait and model is provided in Figure 12. Conscientiousness stood out as the trait predicted inaccurately by models, while neuroticism was predicted most accurately.
After normalization using the median, the only wrong predictions are true misclassifications. Figure 13 provides a breakdown of the proportions of correct and incorrect predictions by the LLM model. Across all models, conscientiousness and extraversion were predicted incorrectly at the highest rates. This is likely due to these traits being the most difficult to perceive easily through written text. The LLM likely based agreeableness scores off of the responses’ agreement or disagreement with the original email, leading to a higher accuracy for this trait. Additionally, for all traits apart from neuroticism (as this trait is reversed, where low means an emotionally stable individual), models tended to be more likely to predict a low trait as high.

4.2. Human Perception Validation Results

The results of the human perception survey allow for an analysis of how well humans were able identify personality traits in the email responses generated by the GPT-4o model. As mentioned, 11 individuals rated each Big Five trait for a random sampling of 37 different emails. Individuals were asked to rate their perception for each trait as high, neutral, or low, without being provided with any information about the distribution of personality traits within the email responses. Overall, humans were 46.48% accurate in trait-level identification across all traits, with a 95% confidence interval of 44.32% to 48.64%. For the 11 individuals, accuracy ranged from 37.1% to 61.82%. Accuracy by individual Big Five characteristic is provided in Table 7.
The confusion matrices in Figure 14 compare how individuals rated the levels of personality traits compared to the true level of the trait. The total number of entries for each matrix is 407, representative of the 11 individuals who rated the trait across 37 emails. For agreeableness, shown in Figure 14a, individuals rated responses high 61.9% of the time. However, of the responses they rated high, 39.7% were designated low in agreeableness. Similarly, participants rated 73.5% of responses high in conscientiousness; however, 42.8% of those were designed to be low. Individuals tended to rate these traits higher, suggesting that the responses did not clearly portray low agreeableness and conscientiousness effectively.
The confusion matrix for the extraversion trait, which was the lowest-accuracy trait for participants, is shown in Figure 14c. Individuals rated 81.2% of the responses that were truly high as either high or neutral and 76.3% of the true low responses as either low or neutral. This suggests that individuals were generally able to determine that the true high and low response were not the opposite; however, a subset of responses were perceived as ambiguous. For neuroticism, shown in Figure 14b, individuals rated 50.6% of responses as low, 19.7% as neutral, and 29.5% as high. This distribution suggests that participants were generally able to identify the true low cases; however, the cases of high neuroticism were not clearly perceived. Finally, for openness to experience, participants were approximately equally likely to correctly identify responses high in openness (46.7% correct) as those that were low (45.6% correct). They were also equally likely to misclassify the true trait as the opposite, occurring 28.6% of the time for true low responses and 28.1% of the time for true high responses. This suggests that a subset of the responses did not clearly portray the characteristics of a highly open personality.
Overall, individuals achieved a higher accuracy rate than what would be expected from random guessing; however, the results suggest that the level of some traits, like agreeableness and conscientiousness, were more difficult for individuals to accurately perceive in the responses. For extraversion, neuroticism, and openness to experience, individuals tended to be able correctly identify a subset of cases as true high or true low. However, the results also suggest that a subset of emails were more ambiguous, making correct classification by participants more difficult.

4.3. Overall Validation Results

Overall, after normalization, the LLM models performed substantially better than the results from the human perception survey. The non-normalized LLM data, however, performed similarly to humans for all traits, except for extraversion, where LLMs performed substantially worse. The LLM normalization tactic that utilized a dynamic N value and the standard deviation for each trait achieved the highest overall accuracy. The median-based approach, however, performed only marginally worse. The need for normalization suggests that the LLM models tended to avoid providing scores at either extreme. The comparison of the LLM and human survey results are provided in Figure 15.

5. Conclusions

The work performed in this paper and the data collected suggest that LLMs are capable of emulating personality in a single email-based text to an extent; however, improvement before long-term fake identity application is required. Although the emails generated were designed to represent polar personality trait cases, the LLM validation results revealed that the LLM models tended to avoid scoring traits as either extreme. This suggests that either the responses generated did not effectively portray extreme personality traits or that LLM models avoided providing polarizing scores. Analysis of human results, however, revealed that humans frequently misclassified high or low trait levels as the opposite. This further emphasizes that the generated responses did not clearly emulate the designated personality traits to the extent humans could perceive.
There are a few opportunities for future work, further described in Section 6, that could increase LLMs’ ability to emulate personality. These include the usage of an alternative psychometric model and the usage or creation of alternative LLM models. Overall, the results of this paper highlight that LLMs possess a limited capability of generating written content reflective of human personality traits. To effectively emulate personality for long-term privacy research using fake identities, LLMs must improve their ability to emulate human personality traits, as well as demonstrate the ability to maintain a consistent personality for extended conversations and cross-platform interactions.

6. Future Work

  • Investigation and formulation of the ethical model for fake identity OSINT and LLM applications
As mentioned, the work performed in this paper, along with the broader application of fake identities for privacy research, are in support of the Use & Abuse research project, which aims to ethically perform privacy research. The development of a more robust and practical ethical framework for applying fake identities would help mitigate ethical-related concerns. The U&A project is currently conducting an in-depth analysis of the ethics of active OSINT to develop a quantitative and qualitative model for the broader applications of OSINT techniques.
  • Analysis of alternative personality models
The work performed in this paper attempted to emulate personality based on the Big Five model. One opportunity for future work is investigation into alternative personality models to determine if another model offers better results. The Big Five model is fairly limited in that it breaks personality into five broad categories, whereas a more granular model may allow for an LLM to better comprehend the personality traits it is attempting to emulate. Alternative personality models for investigation include the Myers–Briggs Type Indicator (MBTI), VIA Classification, and the Enneagram model. Additionally, there is some existing literature that models an individual’s writing personality [58], which could help LLMs more aptly mimic an individual’s writing style.
  • Analysis of alternative LLM Models
The responses analyzed in this experiment were generated by a singular LLM model: OpenAI’s GPT-4o. As mentioned previously, this model was chosen due to its popularity and wide-scale applications. There are, however, numerous alternative LLM models, each developed and trained by different companies and organizations. Evaluating responses from other LLM models would allow for comparison and help determine whether other models are more capable of generating responses that clearly emulate personality traits. A more complex avenue for future expansion, which would likely yield the best results, is the design and training of an LLM model specifically tailored for the application purpose of generating personality-mimicking text. Overall, investigation into alternative LLMs could help to identify an LLM model more suited to the intended use case of this work.
  • Investigation of LLM capability in extended conversations and alternative communication contexts
In this work, evaluation was limited to one-time email-based responses across three different contexts: professional, phishing, and romantic. These scenarios, however, only represent a limited subset of the areas where fake identities may be applied. Alternative areas include social media websites, customer support interactions, employment applications, and casual texting or email communications. In addition, typical email interactions involve multiple back-and-forth exchanges, rather than just a singular response. To be effective, LLMs must be capable of demonstrating consistent personality traits for extended conversations, as well as for alternative applications and platforms. The average individual interacts in many different areas on the Internet, and an effective fake identity would be required to do the same. For fake identity applications, future investigation into extended email chains and generation of responses for other communication formats is required.

Author Contributions

Conceptualization, J.K. and A.J.M.; methodology, J.K.; software, J.K.; investigation, J.K.; writing—original draft preparation, J.K.; writing—review and editing, A.J.M.; supervision, A.J.M.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Commonwealth Cyber Initiative, an investment in the advancement of cyber R&D, innovation, and workforce development. For more information about CCI, visit https://cyberinitiative.org/. Additional support was also received from the VT National Security Institute’s Spectrum Dominance Division. Additionally, this material is based upon work supported by the National Science Foundation under Grants Number 1946493. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Elovici, Y.; Fire, M.; Herzberg, A.; Shulman, H. Ethical Considerations when Employing Fake Identities in Online Social Networks for Research. Sci. Eng. Ethics 2014, 20, 1027–1043. [Google Scholar] [CrossRef]
  2. Kolenbrander, J.; Husmann, E.; Henshaw, C.; Rheault, E.; Boswell, M.; Michaels, A.J. Use & Abuse of Personal Information, Part II: Robust Generation of Fake IDs for Privacy Experimentation. J. Cybersecur. Priv. 2024, 4, 546–571. [Google Scholar] [CrossRef]
  3. Rheault, E.; Nerayo, M.; Leonard, J.; Kolenbrander, J.; Henshaw, C.; Boswell, M.; Michaels, A.J. Use and Abuse of Personal Information, Part I: Design of a Scalable OSINT Collection Engine. J. Cybersecur. Priv. 2024, 4, 572–593. [Google Scholar] [CrossRef]
  4. Gurajala, S.; White, J.S.; Hudson, B.; Matthews, J.N. Fake Twitter Accounts: Profile Characteristics Obtained Using an Activity-Based Pattern Detection Approach. In Proceedings of the 2015 International Conference on Social Media & Society, Toronto, ON, Canada, 27–29 July 2015. [Google Scholar] [CrossRef]
  5. Khaled, S.; El-Tazi, N.; Mokhtar, H.M.O. Detecting Fake Accounts on Social Media. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 3672–3681. [Google Scholar] [CrossRef]
  6. Elyusufi, Y.; Elyusufi, Z.; Kbir, M.A. Social Networks Fake Profiles Detection Based on Account Setting and Activity. In Proceedings of the 4th International Conference on Smart City Applications, Casablanca, Morocco, 2–4 October 2019. [Google Scholar] [CrossRef]
  7. Wu, Y. Large Language Model and Text Generation. In Natural Language Processing in Biomedicine: A Practical Guide; Xu, H., Demner Fushman, D., Eds.; Springer International Publishing: Cham, Switzerland, 2024; pp. 265–297. [Google Scholar] [CrossRef]
  8. Mo, Y.; Qin, H.; Dong, Y.; Zhu, Z.; Li, Z. Large Language Model (LLM) AI text generation detection based on transformer deep learning algorithm. arxiv 2024, arXiv:2405.06652. [Google Scholar]
  9. Yuan, A.; Coenen, A.; Reif, E.; Ippolito, D. Wordcraft: Story Writing With Large Language Models. In Proceedings of the 27th International Conference on Intelligent User Interfaces, Helsinki, Finland, 22–25 March 2022; pp. 841–852. [Google Scholar] [CrossRef]
  10. Wallwork, A. Using Large Language Models to Improve, Correct and Generate Your Emails. In English for Academic Research: Grammar Exercises; Springer: Cham, Switzerland, 2024; pp. 191–210. [Google Scholar] [CrossRef]
  11. Thiergart, J.; Huber, S.; Übellacker, T. Understanding Emails and Drafting Responses—An Approach Using GPT-3. arXiv 2021, arXiv:2102.03062. [Google Scholar]
  12. Ait Baha, T.; El Hajji, M.; Es-Saady, Y.; Fadili, H. The Power of Personalization: A Systematic Review of Personality-Adaptive Chatbots. SN Comput. Sci. 2023, 4, 661. [Google Scholar] [CrossRef]
  13. McCrae, R.R.; John, O.P. An Introduction to the Five-Factor Model and Its Applications. J. Pers. 1992, 60, 175–215. [Google Scholar] [CrossRef]
  14. Foundation, M.B. Myers-Briggs Overview. 2023. Available online: https://www.myersbriggs.org/my-mbti-personality-type/myers-briggs-overview/ (accessed on 15 May 2025).
  15. The VIA Institute. VIA Character Strengths Survey & Character Reports: Via Institute. Available online: https://www.viacharacter.org/ (accessed on 15 May 2025).
  16. Sutcliffe, R. A Survey of Personality, Persona, and Profile in Conversational Agents and Chatbots. arxiv 2023, arXiv:2401.00609. [Google Scholar]
  17. Yu, B.; Kim, J. Personality of AI. In Artificial Intelligence and Soft Computing; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M., Eds.; Springer: Cham, Switzerland, 2025; pp. 244–252. [Google Scholar]
  18. Li, T.; Yan, X.; Li, Y.; Wang, J.; Li, Q.; Li, H.; Li, J. Neuronal Correlates of Individual Differences in the Big Five Personality Traits: Evidences from Cortical Morphology and Functional Homogeneity. Front. Neurosci. 2017, 11, 414. [Google Scholar] [CrossRef]
  19. van der Woude, M.; Dodds, T.; Torres, G. The ethics of open source investigations: Navigating privacy challenges in a gray zone information landscape. Journalism 2024, 1–19. [Google Scholar] [CrossRef]
  20. Hribar, G.; Podbregar, I.; Ivanuša, T. OSINT: A “Grey Zone”? Int. J. Intell. Counterintell. 2014, 27, 529–549. [Google Scholar] [CrossRef]
  21. Rahman, M.S. The Art of Open Source Intelligence (OSINT): Addressing Cybercrime, Opportunities, and Challenges. 2025. Available online: https://ssrn.com/abstract=5281845 (accessed on 15 May 2025).
  22. Desai, S.; Dubiel, M.; Zargham, N.; Mildner, T.; Spillner, L. Personas Evolved: Designing Ethical LLM-Based Conversational Agent Personalities. arXiv 2025, arXiv:2502.20513. [Google Scholar]
  23. Sun, G.; Zhan, X.; Such, J. Building Better AI Agents: A Provocation on the Utilisation of Persona in LLM-based Conversational Agents. In Proceedings of the 6th ACM Conference on Conversational User Interfaces, Luxembourg, 8–10 July 2024. [Google Scholar] [CrossRef]
  24. Kapania, S.; Wang, R.; Li, T.J.J.; Li, T.; Shen, H. ’I’m Categorizing LLM as a Productivity Tool’: Examining Ethics of LLM Use in HCI Research Practices. Proc. ACM Hum.-Comput. Interact. 2025, 9, 1–26. [Google Scholar] [CrossRef]
  25. Wang, T.; Tao, M.; Fang, R.; Wang, H.; Wang, S.; Jiang, Y.E.; Zhou, W. AI PERSONA: Towards Life-long Personalization of LLMs. arXiv 2024, arXiv:2412.13103. [Google Scholar]
  26. The Enneagram Institute. 2024. Available online: https://www.enneagraminstitute.com/ (accessed on 4 June 2025).
  27. Bai, S.; Zhu, T.; Cheng, L. Big-Five Personality Prediction Based on User Behaviors at Social Network Sites. arXiv 2012, arXiv:1204.4809. [Google Scholar]
  28. Sitaraman, G.; Cock, M.d. Inferring Big 5 Personality from Online Social Networks. Master’s Thesis, University of Washington Libraries, Washington, DC, USA, 2014. [Google Scholar]
  29. Dutta, K.; Singh, V.K.; Chakraborty, P.; Sidhardhan, S.K.; Krishna, B.S.; Dash, C. Analyzing Big-Five Personality Traits of Indian Celebrities Using Online Social Media. Psychol. Stud. 2017, 62, 113–124. [Google Scholar] [CrossRef]
  30. Karanatsiou, D.; Sermpezis, P.; Gruda, D.; Kafetsios, K.; Dimitriadis, I.; Vakali, A. My Tweets Bring All the Traits to the Yard: Predicting Personality and Relational Traits in Online Social Networks. ACM Trans. Web 2022, 16. [Google Scholar] [CrossRef]
  31. Ivanov, T.; Penchev, V. AI Benchmarks and Datasets for LLM Evaluation. arXiv 2024, arXiv:2412.01020. [Google Scholar]
  32. Vardhan, H. Top 10 LLM Models. 2024. Available online: https://medium.com/@harsh.vardhan7695/top-10-llms-model-e4d8c2c440bd (accessed on 15 May 2025).
  33. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. arXiv 2021, arXiv:2009.03300. [Google Scholar]
  34. Liang, P.; Mai, Y.; Somerville, J.; Kaiyom, F.; Lee, T.; Bommasani, R. Helm Lite V1.0.0: Lightweight and Broad Capabilities Evaluation. 2023. Available online: https://crfm.stanford.edu/2023/12/19/helm-lite.html (accessed on 15 May 2025).
  35. Wu, Y.; Mei, J.; Yan, M.; Li, C.; Lai, S.; Ren, Y.; Wang, Z.; Zhang, J.; Wu, M.; Jin, Q.; et al. WritingBench: A Comprehensive Benchmark for Generative Writing. arXiv 2025, arXiv:2503.05244. [Google Scholar]
  36. Wang, Y.; Ma, X.; Zhang, G.; Ni, Y.; Chandra, A.; Guo, S.; Ren, W.; Arulraj, A.; He, X.; Jiang, Z.; et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv 2024, arXiv:2406.01574. [Google Scholar]
  37. Stanford. A Holistic Framework for Evaluating Foundation Models. Holistic Evaluation of Language Models (HELM), Center for Research on Foundation Models. 2024. Available online: https://crfm.stanford.edu/helm/lite/latest/ (accessed on 15 May 2025).
  38. Sorokovikova, A.; Fedorova, N.; Rezagholi, S.; Yamshchikov, I.P. LLMs Simulate Big Five Personality Traits: Further Evidence. arXiv 2024, arXiv:2402.01765. [Google Scholar]
  39. Frisch, I.; Giulianelli, M. LLM Agents in Interaction: Measuring Personality Consistency and Linguistic Alignment in Interacting Populations of Large Language Models. arXiv 2024, arXiv:2402.02896. [Google Scholar]
  40. Bo, J.Y.; Xu, T.; Chatterjee, I.; Passarella-Ward, K.; Kulshrestha, A.; Shin, D. Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering. arXiv 2025, arXiv:2505.04260. [Google Scholar]
  41. Sun, R.; Li, X.; Akella, A.; Konstan, J.A. Multi-Prompting Scenario-based Movie Recommendation with Large Language Models: Real User Case Study. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025. [Google Scholar] [CrossRef]
  42. He, J.Z.Y.; Pandey, S.; Schrum, M.L.; Dragan, A. Context Steering: Controllable Personalization at Inference Time. arXiv 2025, arXiv:2405.01768. [Google Scholar]
  43. Pan, K.; Zeng, Y. Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models. arXiv 2023, arXiv:2307.16180. [Google Scholar]
  44. Jiang, H.; Zhang, X.; Cao, X.; Breazeal, C.; Roy, D.; Kabbara, J. PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 3605–3627. [Google Scholar] [CrossRef]
  45. Klinkert, L.J.; Buongiorno, S.; Clark, C. Evaluating the Efficacy of LLMs to Emulate Realistic Human Personalities. In Proceedings of the Twentieth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Lexington, KY, USA, 18–22 November 2024; Volume 20, pp. 65–75. [Google Scholar] [CrossRef]
  46. La Cava, L.; Tagarelli, A. Open Models, Closed Minds? On Agents Capabilities in Mimicking Human Personalities through Open Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, Pennsylvania, 25 February–4 March 2025; Volume 39, pp. 1355–1363. [Google Scholar] [CrossRef]
  47. OpenAI. Hello gpt-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o (accessed on 15 May 2025).
  48. OpenAI. API Platform. 2025. Available online: https://openai.com/api (accessed on 15 May 2025).
  49. DeepSeek. DeepSeek. 2025. Available online: https://www.deepseek.com/ (accessed on 15 May 2025).
  50. NOUS RESEARCH. Hermes 3. 2024. Available online: https://nousresearch.com/hermes3/ (accessed on 15 May 2025).
  51. Farabet, C. Introducing Gemma 3: The Most Capable Model You Can Run on a Single GPU or TPU. 2025. Available online: https://blog.google/technology/developers/gemma-3/ (accessed on 15 May 2025).
  52. Ollama. Llama 3.3. 2025. Available online: https://ollama.com/library/llama3.3 (accessed on 15 May 2025).
  53. Anthropic. Meet Claude: Anthropic. 2025. Available online: https://www.anthropic.com/claude (accessed on 15 May 2025).
  54. Lambda. Lambda Cloud API Documentation. 2025. Available online: https://cloud.lambda.ai/api/v1/docs#overview–response-types-and-formats (accessed on 15 May 2025).
  55. Ollama. Python & JavaScript Libraries · Ollama Blog. 2024. Available online: https://ollama.com/blog/python-javascript-libraries (accessed on 15 May 2025).
  56. Anthropic. Getting Started. 2025. Available online: https://docs.anthropic.com/en/api/getting-started (accessed on 15 May 2025).
  57. Tao, Y.; Viberg, O.; Baker, R.S.; Kizilcec, R.F. Cultural bias and cultural alignment of large language models. PNAS Nexus 2024, 3, pgae346. [Google Scholar] [CrossRef]
  58. Helt, A. What’s Your Writing Personality Type? 2018. Available online: https://rootedinwriting.com/writing-personality/ (accessed on 15 May 2025).
Figure 1. Experimental process overview.
Figure 1. Experimental process overview.
Applsci 15 06636 g001
Figure 2. Example of a professional project email used in the experiment.
Figure 2. Example of a professional project email used in the experiment.
Applsci 15 06636 g002
Figure 3. Example of a phishing email used in the experiment.
Figure 3. Example of a phishing email used in the experiment.
Applsci 15 06636 g003
Figure 4. Example of a romantic love letter email used in the experiment.
Figure 4. Example of a romantic love letter email used in the experiment.
Applsci 15 06636 g004
Figure 5. Overview of response generation script.
Figure 5. Overview of response generation script.
Applsci 15 06636 g005
Figure 6. Example of a professional email response from a personality high in all Big Five traits.
Figure 6. Example of a professional email response from a personality high in all Big Five traits.
Applsci 15 06636 g006
Figure 7. Example question from human perception survey.
Figure 7. Example question from human perception survey.
Applsci 15 06636 g007
Figure 8. Violin plot of distribution of LLM scoring by trait and model.
Figure 8. Violin plot of distribution of LLM scoring by trait and model.
Applsci 15 06636 g008
Figure 9. Combined confusion matrix for all non-normalized LLM model data. The shade of blue represents the proportion of ratings, with darker blue indicating a higher number of items classified in that category.
Figure 9. Combined confusion matrix for all non-normalized LLM model data. The shade of blue represents the proportion of ratings, with darker blue indicating a higher number of items classified in that category.
Applsci 15 06636 g009
Figure 10. Accuracy for different values of N.
Figure 10. Accuracy for different values of N.
Applsci 15 06636 g010
Figure 11. Proportion of correct, near-miss, and complete error predictions by model and trait. The green columns match predictions that were completely correct by LLM models (high predicted high, low predicted low, etc.). The yellow highlighted columns represent the near-miss cases, and the red highlighted columns represent the complete error predictions.
Figure 11. Proportion of correct, near-miss, and complete error predictions by model and trait. The green columns match predictions that were completely correct by LLM models (high predicted high, low predicted low, etc.). The yellow highlighted columns represent the near-miss cases, and the red highlighted columns represent the complete error predictions.
Applsci 15 06636 g011
Figure 12. Accuracy by trait and model after normalization using median.
Figure 12. Accuracy by trait and model after normalization using median.
Applsci 15 06636 g012
Figure 13. LLM prediction proportions after normalization utilizing median.
Figure 13. LLM prediction proportions after normalization utilizing median.
Applsci 15 06636 g013
Figure 14. Confusion matrices for the Big Five personality traits: (a) Agreeableness, (b) Conscientiousness, (c) Extraversion, (d) Neuroticism, and (e) Openness.
Figure 14. Confusion matrices for the Big Five personality traits: (a) Agreeableness, (b) Conscientiousness, (c) Extraversion, (d) Neuroticism, and (e) Openness.
Applsci 15 06636 g014
Figure 15. Comparison of human and LLM model accuracy.
Figure 15. Comparison of human and LLM model accuracy.
Applsci 15 06636 g015
Table 1. Overview of LLM models.
Table 1. Overview of LLM models.
ModelParametersMMLU Pro [36]HELM [37]Writing Bench [35]
GPT-4.5N/A0.861UnscoredUnscored
GPT-4oN/A87.20.7798.16
DeepSeek R1671B0.84Unscored8.55
Gemma 327B0.675UnscoredUnscored
Hermes 3405BUnscoredUnscoredUnscored
Llama 4400B0.65920.8127.01
Claude 3 OpusN/A0.68450.683Unscored
Table 2. Example Big Five personality vectors.
Table 2. Example Big Five personality vectors.
Big Five TraitVector 1Vector 2Vector 3
ConscientiousnessHighHighNeutral
AgreeablenessHighLowNeutral
NeuroticismLowHighNeutral
ExtraversionLowLowNeutral
Openness to ExperienceLowHighNeutral
Table 3. Overview of LLMs queried for personality validation.
Table 3. Overview of LLMs queried for personality validation.
ModelQuery Method
GPT-4.5OpenAI API
GPT-4oOpenAI API
DeepSeek R1Lambda API
Gemma 3Ollama Python Library
Hermes 3Lambda API
Llama 3Ollama Python Library
Claude 3 OpusAnthropic API
Table 4. Accuracy (%) of each model for the Big Five personality traits (C: Conscientiousness, A: Agreeableness, N: Neuroticism, O: Openness, E: Extraversion).
Table 4. Accuracy (%) of each model for the Big Five personality traits (C: Conscientiousness, A: Agreeableness, N: Neuroticism, O: Openness, E: Extraversion).
ModelCANOEOverall
DeepSeek R146.4648.4850.5137.3731.3142.83
Gemma 3-27b37.3751.5254.5535.3518.1839.39
GPT-4.549.4949.4949.4947.4733.3345.86
GPT-4o48.4849.4944.4443.4324.2442.02
Claude 3 Opus48.4850.5150.5150.5133.3346.67
Hermes Hermes 3-405b49.4950.5135.3545.4523.2340.81
Llama 4-Maverick48.4848.4839.3942.4236.3643.03
Average by Trait47.1849.7846.6143.1429.7242.94
Table 5. Accuracy comparison by parameter count for LLM models.
Table 5. Accuracy comparison by parameter count for LLM models.
ModelParameter CountOverall Accuracy
DeepSeek R1671B42.83
Gemma 327B39.39
GPT-4.5Proprietary45.86
GPT-4oProprietary42.02
Claude 3 OpusProprietary46.67
Hermes 3405B40.81
Llama 4-Maverick400B43.03
Table 6. Minimum and maximum trait scores for each model across the Big Five traits (non-normalized).
Table 6. Minimum and maximum trait scores for each model across the Big Five traits (non-normalized).
ModelConscient.Agreeab.Neurotic.OpennessExtrav.
MinMaxMinMaxMinMaxMinMaxMinMax
DeepSeek R140954595107520902085
Gemma 3-27b25923095158530852075
GPT-4.545954095128525952085
GPT-4o45904095107520951080
Claude Opus60903095207520902080
Hermes 3-405b50903095208025952080
Llama 4-Maverick35854090257030903590
Table 7. Human accuracy by Big Five trait.
Table 7. Human accuracy by Big Five trait.
TraitAccuracy
Conscientiousness0.479115
Agreeableness0.484029
Neuroticism0.488943
Openness to Experience0.452088
Extraversion0.432432
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kolenbrander, J.; Michaels, A.J. Personality Emulation Utilizing Large Language Models. Appl. Sci. 2025, 15, 6636. https://doi.org/10.3390/app15126636

AMA Style

Kolenbrander J, Michaels AJ. Personality Emulation Utilizing Large Language Models. Applied Sciences. 2025; 15(12):6636. https://doi.org/10.3390/app15126636

Chicago/Turabian Style

Kolenbrander, Jack, and Alan J. Michaels. 2025. "Personality Emulation Utilizing Large Language Models" Applied Sciences 15, no. 12: 6636. https://doi.org/10.3390/app15126636

APA Style

Kolenbrander, J., & Michaels, A. J. (2025). Personality Emulation Utilizing Large Language Models. Applied Sciences, 15(12), 6636. https://doi.org/10.3390/app15126636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop