1. Introduction
Despite the emergence of various digital communication platforms, email remains a cornerstone of modern communication. In 2024, the number of email users worldwide had reached 4.48 billion, building on an existing user base that already encompassed over half the world’s population in 2023 [
1]. Furthermore, the volume of email traffic is enormous. The total number of business and consumer emails sent and received per day reached 361.6 billion in 2024, and this figure is expected to surpass 408 billion by the end of 2027 [
1]. This persistent reliance on email significantly contributes to the cognitive load of knowledge workers, as 60% of people prefer email for work communication, with 35% of surveyed workers spending between two and five hours of their day in their inbox [
2].
Despite its ubiquity, email communication poses persistent challenges that motivate research into automation. Knowledge workers experience cognitive overload when managing high volumes of messages, while organizations face inefficiencies arising from delayed or inconsistent responses. Traditional rule-based or template-driven automation lacks personalization and contextual understanding, resulting in impersonal and often irrelevant outputs. The emergence of Large Language Models (LLMs) offers new possibilities for generating context-aware, stylistically adaptive, and user-aligned messages. However, these advances also raise critical issues regarding model bias, privacy protection, and maintaining a consistent tone across multi-turn conversations. Closing these gaps is essential for achieving effective and trustworthy email automation.
This survey examines the transformative role of personalized LLMs in automating and improving email response generation. Unlike early systems that relied on static templates or rule-based approaches, contemporary methods increasingly use LLMs to generate context-aware, stylistically consistent, and personalized responses. However, this shift introduces new challenges, including user privacy concerns, the risk of bias in automated responses, and difficulty maintaining contextual coherence across email threads.
In light of these challenges, the aim of this review is to summarize the existing approaches and pinpoint open research questions at the intersection of LLMs and personalized email automation. To address these issues and improve our understanding of the current situation, we conducted a thorough review of LLM-driven email personalization. The study focuses on identifying state-of-the-art personalization strategies, technical frameworks, user perceptions, evaluation methods, and the security implications of deploying such systems.
Given the increasing prevalence and impact of personalized LLMs in digital communication, our research aims to address the following research questions:
RQ1: What are the core strategies and frameworks that enable the effective generation of personalized email responses using LLMs?
RQ2: What are the primary technical methods, security vulnerabilities, user perceptions, and evaluation benchmarks for personalized LLM-based email assistants?
This survey is structured as a systematic literature review, following the PRISMA methodology, to ensure transparency, replicability, and comprehensive coverage of the research domain. Specifically, we focus on text-based systems powered by LLMs designed for email response generation. Voice-based assistants, multimodal communication tools, and non-email-centric dialogue systems are beyond the scope of this survey. Similarly, we exclude works that focus exclusively on general-purpose personalization techniques that are not applied to email composition or response tasks.
Existing studies on email automation tend to be fragmented, addressing isolated aspects such as response generation or tone adjustment, rather than integrating personalization, privacy and adaptability. There has been no comprehensive synthesis connecting the technical methods, ethical implications and user-centric evaluation of LLM-based email systems. This review aims to address this gap.
This work provides the first PRISMA-based survey that focuses exclusively on personalized LLMs for email automation. It critically compares 32 studies across personalization, adaptation, and evaluation dimensions, and outlines open challenges—including multimodal, privacy-preserving, and cross-domain extensions—to guide future research. Therefore, the main contributions of this work are the following:
A critical analysis of frameworks and tools designed to improve LLM-based writing assistants in email contexts;
A structured thematic synthesis of the identified research areas and an evaluation of personalization techniques tailored for email response generation;
An examination of security vulnerabilities associated with personalized email generation systems;
A synthesis of user perception studies to understand trust, usability, and satisfaction with AI-mediated communication;
An assessment of benchmarking methodologies used to evaluate personalized email systems.
The rest of this paper is organized as follows.
Section 2 introduces foundational concepts in LLMs and email generation.
Section 3 describes our survey methodology.
Section 4 presents an in-depth review of current techniques and findings across the identified research dimensions.
Section 5 highlights the key benefits of personalized email assistants.
Section 6 outlines their limitations and risks, and
Section 7 proposes directions for future research. Finally,
Section 8 concludes the paper.
2. Background
In this context, email automation primarily refers to the automatic generation, adaptation and management of email content through LLMs. While traditional automation focused on scheduling and routing, LLMs take automation to a semantic level, allowing contextual and personalized messages to be created.
Earlier forms of email assistance on commercial platforms offered features such as autocomplete and short reply suggestions, demonstrating the potential of automated support in everyday communication. However, as these systems relied on shallow contextual cues and template-based patterns, they could not adapt to different tones, message objectives or extended conversational contexts. LLM-based systems represent a shift towards deeper semantic processing, enabling more contextually responsive and stylistically flexible email generation. Nevertheless, real-world email communication often involves lengthy threads, nested quotations, and forwarded content, which can strain the model’s ability to maintain coherence and interpret evolving discourse.
While the development of AI assistants that can generate email responses is not widely covered in the academic literature, the impressive capabilities of LLMs such as GPT are becoming increasingly apparent. These models understand context, generate human-like text, and adapt tone and structure based on input prompts. These strengths suggest a natural alignment with email-related tasks, for which personalization, clarity, and efficiency are key.
2.1. Foundational Concepts in LLMs and Email Generation
The sophisticated functionality of LLMs relies on several interconnected core components. First, tokenization methods (such as Byte-Pair Encoding, or BPE) break raw input text into smaller, manageable numerical units (tokens) that the model can process [
3]. Because the transformer architecture processes these tokens in parallel, Positional Encoding techniques (such as absolute, relative, or Rotary Positional Encoding—RoPE) are crucial for preserving the original sequence order of the tokens. The fundamental Attention Mechanisms, particularly self-attention, then enable the model to dynamically calculate the relevance of each token to all other tokens in the sequence, allowing it to focus on the most pertinent information. Within the network’s layers, non-linear Activation Functions (e.g., GeLU, SwiGLU) are essential for enabling the model to learn complex patterns and relationships in the data [
3]. Finally, Layer Normalization techniques (e.g., RMSNorm) are applied at various layers to stabilize the learning process, ensuring that activations within the network remain within a suitable range during training. The specific arrangement and configuration of these components define the model’s overall architecture. Typical structures include Decoder-Only models (like the GPT series), which are autoregressive and excel at text generation; Encoder-Only models (like BERT), which process the entire input bidirectionally for deep understanding; and Encoder–Decoder models (such as T5) suited for tasks transforming input sequences to output sequences, such as translation and email answering [
4,
5].
The scale of LLMs, often involving hundreds of billions of parameters, enables emergent capabilities such as in-context learning, instruction following, and multi-step reasoning. Tokenization methods and positional encoding strategies (e.g., rotary embeddings) are crucial components of their architecture, while post-training techniques such as Instruction Tuning and Reinforcement Learning from Human Feedback (RLHF) enhance alignment with human preferences [
6]. In the context of email generation, these advancements have evolved into Transformer-based systems. Later innovations have integrated personalization strategies such as fine-tuning, prompt engineering, and Retrieval Augmented Generation (RAG). These strategies enhance the relevance and coherence of responses while addressing challenges like multi-turn thread management and stylistic alignment. Together, these foundational developments form the basis of the growing potential of LLMs in personalized, efficient, and context-aware email response generation systems, a topic that will be explored further [
4,
5].
2.2. The Need for Personalization in Email Communication
From a technical standpoint, personalization is not merely a stylistic preference but a functional requirement for effective automation. Without modeling user intent and tone, LLM-based systems risk producing generic outputs that reduce the quality of automated systems. Therefore, it is essential to understand the need for personalization in order to guide technical design choices in email automation.
Email serves as a distinct and essential communication channel, characterized by specific norms of formality, task-orientation, and interpersonal nuance that set it apart from other digital platforms. While LLMs exhibit strong capabilities for general-purpose text generation, their inefficient forms often struggle to meet the subtle, context-sensitive demands of email communication [
6]. Early AI tools, such as Smart Reply [
7] and Smart Compose [
8], demonstrated the feasibility of lightweight automated assistance but were constrained by their reliance on generic patterns, which limited their ability to adapt to individual tone and communicative intent.
The core challenge is not only generating grammatically correct text but also accurately capturing an individual’s unique communication style and intent within the immediate context of an email thread. Research in conversational AI indicates that conditioning models on user personas or profiles significantly improves consistency and relevance [
9]. In the email domain, this requires personalization along several dimensions: aligning with the sender’s characteristic tone and vocabulary, modulating formality levels, maintaining coherence across asynchronous threads, and incorporating relevant contextual cues [
10]. Although achieving this alignment is essential for crafting authentic and effective emails, it remains a significant hurdle.
Implementing robust personalization also presents notable technical and ethical challenges. One major obstacle is scalability: training and maintaining a distinct large model for each user is typically infeasible [
10]. Parameter-Efficient Fine-Tuning (PEFT) techniques, such as adapter modules [
11] and Low-Rank Adaptation (LoRA) [
12], offer promising solutions by enabling efficient adaptation with minimal parameter overhead. Collaborative strategies, such as PER-PCS, further explore gains by sharing and composing PEFT components across users [
13].
User privacy is another critical concern, as effective personalization often depends on access to sensitive historical email data [
10]. RAG [
14] offers a more privacy-preserving alternative, allowing LLMs to retrieve relevant content from external, potentially local, knowledge sources (e.g., a personal email archive) at inference time. This enables contextually appropriate generation without requiring the model to internalize private data. Optimizing retrieval pipelines [
15] and integrating RAG with local PEFT are key strategies for balancing personalization with privacy.
Moreover, personalization must account for the evolving nature of user preferences and communication styles [
16]. This requires models that can learn continually and in real time based on context, rather than relying on static user profiles. Interaction design also plays a crucial role, and moving beyond plain text prompts towards structured question-and-answer or guided interfaces may enable users to steer the personalization process more effectively.
Taken together, these factors highlight the urgent need for more effective personalization in email generation. Techniques such as PEFT and RAG offer strong capabilities in terms of style adaptation, privacy preservation, and computational efficiency [
16]. Nevertheless, key challenges remain. Fine-tuning using limited user data often fails to capture the full range of real-world email contexts, particularly when dealing with multiple personas or varied communicative goals. Similarly, applying RAG to extensive, unstructured personal archives requires highly accurate and context-sensitive retrieval—something current systems can only partially deliver. The ultimate objective is to reflect not only a user’s stylistic voice, but also their current knowledge state and communicative intent in each message. Bridging the gap between surface-level personalization and deep contextual understanding while maintaining scalability and reliability remains a central challenge in developing truly effective and trustworthy AI email assistants.
4. Literature Review
This review examined the scientific literature on the impact of large language models on the generation of email responses using generative AI. As mentioned above, 32 papers were selected for this comprehensive review. To present these key research directions systematically, we analyzed these documents using the thematic organization detailed in
Section 3.5 of the survey—Tools and Frameworks, Personalization Techniques, Security Vulnerabilities, User Perceptions, and Benchmarking and Evaluation. The analysis concludes with a summary of the comparative analysis.
4.1. Tools and Frameworks to Enhance AI-Powered Writing Assistants
The development of AI-based writing assistants, particularly those designed for tasks such as composing personalized emails, is primarily driven by the innovative tools and architectures that facilitate their creation. The following is an extensive overview of the key systems and architectural frameworks created to advance the different aspects of AI writing. These include promoting creative questioning, providing the highest-quality tutorial data and achieving improved human–AI interaction. Understanding these underlying developments is crucial to appreciating how LLMs are being modified and utilized to fulfil the requirements of personalized email communication.
One of the key innovations in enhancing AI-powered writing assistants is the Luminate system [
18], which introduces a unique approach to creative exploration, overcoming the limitations of traditional prompt-based methods. Conventional systems often suffer from premature convergence and limited exploration of creative possibilities, leading to restricted output diversity. In response, the authors introduce a novel Prompting for Design Space framework. This system enhances idea generation by employing a two-step process: first, generating key dimensions relevant to the user’s prompt and then using these dimensions to guide the creation of a diverse array of responses. This framework facilitates a structured exploration of the creative design space, enabling users to engage with and actively organize the generated content. Its key innovations include the automated generation of creative dimensions, dimension-guided response generation to promote output diversity, interactive selection of dimensions to help structure responses, and semantic zooming to explore content at varying levels of detail. Unlike traditional prompt-refinement methods used in systems like ChatGPT, these features offer a systematic, expansive approach to creative exploration.
While Luminate [
18] focuses on enhancing creative exploration, other research has addressed improving instructional data quality to enable more accurate AI writing assistance. REALIGN [
19] presents a novel solution to improve the quality and alignment of instruction data for LLMs, addressing a key challenge in developing generative AI applications, such as email-writing assistants. Traditional fine-tuning methods often rely on extensive human annotation or suffer from inaccuracies, such as hallucinations, which limit their effectiveness. REALIGN addresses these issues through a three-step process—Criteria Definition, Retrieval Augmentation, and Reformatting—that improves the structure and factual accuracy of LLM outputs. The approach begins by incorporating human-defined response formats, thereby better aligning with user expectations, particularly for tasks such as email generation. By leveraging the Google Search API to gather relevant, up-to-date information, it enhances the factual accuracy of generated content, reducing the risk of errors. This retrieval-based step also improves scalability, enabling LLMs to access external knowledge without relying solely on pre-existing training data. The last step, reformatting, ensures that the outputs are structured, readable, and meet user-defined criteria.
Empirical results demonstrate that REALIGN [
19] significantly improves overall alignment, mathematical reasoning, factuality, and readability in LLMs. For example, LLaMA-2-13 B’s mathematical reasoning ability on the GSM8K dataset improved from 46.77% to 56.63% through reformatting alone, without requiring additional data or advanced training. Moreover, just 5% of REALIGN data led to a 67% improvement in overall alignment, as measured by the Alpaca dataset. This highlights the importance of organized formats, particularly for tasks that require complex reasoning. This framework also uses a task classifier to apply the correct format based on the query type, while post-processing steps, such as length and task-based filtering, ensure the generation of high-quality data.
Taking a different approach to user engagement, Miura et al. developed ResQ [
20], an artificial intelligence-based email response system that replaced the traditional prompt-based drafting with a question-answer (QA) method. The new technology solved one of the prevalent problems of email support systems: helping users to articulate their responses understandably. The system automatically retrieved relevant data from incoming emails and created structured questions for users to optimize responses. This approach reduced the cognitive burden of manual prompt engineering, improving response accuracy and user satisfaction. A comparative experiment with 20 users revealed that ResQ improved response efficiency without compromising email quality, outperforming conventional LLM-based email composition tools.
Beyond the quality of instruction data, effective interfaces for managing multiple AI-generated variations represent another crucial advancement in this field. The work presented in [
21], significantly contributes to generative AI for writing assistance by addressing the critical need for interfaces that effectively support the exploration and organization of multiple writing variations generated by the models. Recognizing the limitations of existing chat-based and in-place editing interfaces in handling the growing volume of AI-generated text, the authors propose ABScribe [
21], a novel interface designed to support and enhance human–AI co-writing. This system aims to enhance collaboration by providing more effective methods for managing and interacting with generated content. The technical approach involves the implementation of five integrated interface elements: Variation Fields for non-linear, in-place storage of multiple variations; a Popup Toolbar for swift comparison; a Variation Sidebar for structured navigation; AI Modifiers that transform LLM instructions into reusable buttons; and an AI Drafter for direct text insertion. To evaluate the practical effectiveness of ABScribe, the study employed a within-subjects design with 12 writers who completed guided writing tasks (LinkedIn post and email) using both ABScribe and a baseline interface featuring a chat-based AI assistant. Unique data collection methods included the NASA-TLX questionnaire to assess subjective task workload and Likert-scale measures to display user perceptions of the revision process.
Building on the foundational work summarized in the literature review by Rasheed et al. [
21], recent research has introduced novel methods for incrementally improving AI-generated content through self-feedback mechanisms. The paper presented in this context introduces SELF-REFINE [
22]. This innovative iterative refinement framework enhances the output quality of LLMs by enabling them to generate self-feedback and subsequently refine their output. This survey holds significant importance for generative AI, particularly in applications such as email and writing assistants, as it demonstrates a supervision-free method for improving the sophistication and appropriateness of LLM-generated text. Other approaches, such as REALIGN [
19], also recognize the specific demands of such applications, explicitly including email generation among the 46 tasks with tailored criteria and formatting. The technical approach follows a three-step iterative process: initial generation, self-generated feedback, and refinement based on that feedback, with each stage performed by the same underlying large language model (LLM). The approach utilizes a few-shot prompting to guide the LLM in generating initial drafts, offering constructive feedback, and creating improved revisions. Notably, the process eliminates the need for additional training data, model fine-tuning, or reinforcement learning, thereby alleviating one key drawback of current refinement methods, which tend to be dependent on domain-specific data, external supervision, or reward models.
The results enhance prior knowledge by demonstrating that this straightforward, standalone method can improve even cutting-edge LLMs like GPT-4 at test time. The study evaluates the effectiveness of the method proposed by Madaan et al. [
22] across seven diverse tasks involving natural language and code generation. Using automatic metrics and human evaluations, it demonstrates consistent and substantial improvements in task performance, with an average absolute improvement of approximately 20%, and a clear preference over baseline LLM outputs.
Further advancing the concept of self-feedback and refinement, more specialized models have been developed to enhance output quality through iterative critique. As a follow-up to this critical evaluation of training processes, Wang T et al. [
23] introduced Shepherd, a task-conditioned 7B-parameter language model fine-tuned to critique and refine the output of other language models. This innovation addressed one of the key challenges of personalized email systems: how to produce higher-quality output through iterative feedback. Building upon LLaMA-7B, Shepherd improved significant processes in LLM self-improvement with a focus on identifying diverse errors and providing helpful feedback. The model was trained on a high-quality community feedback dataset (Stack Exchange and Reddit) and human annotations. Shepherd’s performance was rigorously tested against rival baselines, including Alpaca-7B, SelFee-7B, and ChatGPT, using both automated (GPT-4) and human testing. The results indicated that Shepherd performed better, with win rates of 53–87% against alternatives in the GPT-4 evaluation, and human evaluators also found it to be comparable to or better than ChatGPT. Unlike untuned models that generate output passively, Shepherd’s [
23] model actively identifies mistakes, suggests corrections, and enhances multiple key determinants of quality, such as coherence, factuality, and fluency. This was especially relevant in highlighting the effectiveness of iteratively feedback-driven fine-tuning to significantly enhance AI-response quality and usability, with clear ramifications for improving personal email generation systems through continuous refinement with user feedback.
To address the issue of manually creating improvement goals and detailed rules (known as “rubrics”) in approaches such as Self-Refine [
22], the authors propose a new framework called ImPlicit Self-Improvement (PIT) [
24]. PIT’s main innovation is that it learns how to improve responses independently without needing explicit instructions. Instead of providing detailed guidelines, it utilizes existing human preference data (used to train reward models) to understand what makes a response more effective.
This is achieved by reformulating the RLHF training objective to maximize the quality gap between a model’s response and a reference response rather than simply optimizing for response quality given an input. The PIT framework uses a three-stage training pipeline. First, supervised fine-tuning (SFT) is applied to both satisfactory and unsatisfactory responses. Then, a reward model is trained based on the relative quality gap between reactions. Finally, a curriculum-based reinforcement learning strategy progressively refines the model’s outputs.
Notably, PIT [
24] can be integrated into the inference process to refine LLM outputs iteratively. Its effectiveness is demonstrated through extensive evaluations of three diverse datasets, including Anthropic/HH-RLHF and OpenAI/Summary. Regarding response quality, it outperforms prompting-based self-improvement methods, such as Self-Refine, as validated by automatic metrics (GPT-4 and DeBERTa) and human evaluations.
Building on the development of advanced interfaces and large language model (LLM) capabilities, Script&Shift [
25] offers a layered interface paradigm designed to better align with natural writing processes, particularly for complex tasks. Although not explicitly designed for email, its architecture provides valuable insights into generative AI writing tools in diverse communication settings. This approach uniquely integrates content development (“scripting”) with rhetorical strategy (“shifting”) in a zoomable, non-linear workspace, facilitating fluid movement between drafting and organization.
The system’s technical foundation rests on “layer primitives”: distinct modules, including the Writing Layer for content creation, the Meta Layer for global context (audience, tone, and goals), and the Document Layer for compilation. These layers enable dynamic interaction with content generated by an LLM (Claude 3.5 Sonnet, in this case) through features such as embedded placeholders, context-sensitive prompts, and output rendering that respects document structure. Coordination is managed by a Prompt Composer, which formulates system instructions based on task knowledge, and a Workspace Manager, which orchestrates component interactions, ensures structural consistency and manages content distribution across layers.
Complementing these interface-focused and refinement-driven systems, recent work has also explored architectural strategies to support scalable and low-latency personalization pipelines. The authors of Serverless RAG-Stream [
26] present a cloud-native retrieval-augmented generation pipeline designed for real-time operation. The system uses AWS Kinesis, a managed streaming service for handling continuous data flows in real time, to capture and route contextual email or communication events as they occur. Semantic retrieval is conducted using Dense Passage Retrieval (DPR), which embeds documents and queries into the same vector space, enabling meaning-based rather than keyword-based searches. The pipeline is orchestrated through serverless functions, allowing computation to scale automatically with workload while avoiding the cost of maintaining dedicated servers. Empirical results demonstrate that the architecture sustains an average end-to-end latency of approximately 150 ms and a throughput of around 500 requests per second, while maintaining generation quality across standard text evaluation metrics [
26]. The Serverless RAG-Stream design, which combines event-driven scaling with pay-per-use execution, offers a cost-efficient and operationally lightweight foundation for email writing assistants that must integrate dynamic contextual data without compromising responsiveness.
In summary, the reviewed tools and frameworks demonstrate a clear evolution from prompt-based generation to more interactive and structured co-writing systems. Innovations such as ABScribe and Script&Shift illustrate a growing emphasis on user control and interface design. At the same time, approaches like SELF-REFINE and Shepherd suggest promising avenues for improving generation quality through feedback and self-assessment. However, most tools still lack deep integration with user-specific context or real-time personalization, indicating a gap between interface design and underlying model adaptation.
4.2. Personalization Techniques Found on LLM Writing Assistants
While generic LLMs demonstrate remarkable performance in text generation, the true promise of artificial intelligence in email communication lies in its ability to deliver highly personalized responses. In this subsection, we discuss advanced methodologies specifically designed to personalize LLMs for the unique styles, preferences, and subtle contextual requirements of email communication. We consider a variety of methods aimed at achieving significant and effective personalization, analyzing their strategies critically in terms of data efficiency, privacy, and adaptive responsiveness to users in email contexts.
Recent advances in personalized generative AI, such as PersonaAI [
27], have overcome the limitations of traditional general-purpose LLMs, including ChatGPT, in delivering profound personalized experiences, particularly in user-facing applications and sophisticated use cases, like email writing assistants. By combining RAG with the LLAMA model, this layered interface significantly enhances personalization, allowing it to respond dynamically to each user’s unique preferences and nuances. PersonaAI’s solution uses a mobile app to capture real-time user data via voice-to-text transcription. This data is saved to a cloud database for processing, where it is formatted and converted into 384-dimensional vectors. To perform this conversion, the system utilizes an embedding model, specifically the BAAI/bge-small-en model [
28] from Hugging Face, which transforms the text into numerical representations that capture its semantic meaning. This process is crucial because it enables the fast semantic retrieval of data. The system then dynamically retrieves the top-k most contextually similar contexts using a cosine similarity function, returning responses tailored to the user’s specific needs.
Additionally, advanced prompt engineering fine-tunes the model to produce contextually fit content, and built-in error handling enhances the user’s confidence. In addition, lightweight and scalable architecture makes it a viable option for mass-scale personalized AI deployment, and its ethical design focus guarantees user trust and privacy. High contextual retrieval precision (91%) and low query response time (<1 s) were achieved through performance testing using datasets such as simulated university journals, demonstrating the system’s feasibility in real-world applications. Such technological advances underscore the potential of PersonaAI to drive personalized AI, particularly in tasks that require fine-grained communication, such as advanced email writing assistants.
While the work presented in [
27] focuses on retrieval techniques for personalization, another significant advancement in this domain is the preference agent approach [
29]. This method introduces a way to personalize content generated by LLMs for tasks like email writing, explicitly addressing the challenge of adapting their broad capabilities to meet individual user preferences. Traditional methods, such as in-context learning and parameter-efficient fine-tuning, often struggle to capture the nuanced complexity of human preferences, particularly with smaller, personalized datasets. To overcome this limitation, the authors of [
28] introduced these agents as small, locally trainable models that encode user preferences into concise natural language rules. These agents act as a “steering wheel,” guiding the output of a larger, more generic large language model (LLM) to align with a personalized style and content, all without requiring fine-tuning of the larger model. This modular method separates the preference learning process from the generic LLM, offering a more scalable and flexible solution for personalization compared to the RAG approach used in PersonaAI [
27].
So, the authors Shashidhar S., Chinta A., et al. [
29] built upon a systematic process for capturing user preferences. The proposed method uses a large LLM to generate zero-shot responses, which are then compared with the ground truth outputs to identify differences. These differences derive preference rules, which a smaller model then learns. This model becomes the personalized preference agent, capable of generating rules that guide the large LLM at inference time. During inference, the trained preference agent provides context in the form of natural language rules to the large model, allowing it to generate outputs aligned with the user’s preferences. Evaluations across three diverse datasets—Enron emails, New Yorker articles, and Amazon product reviews—demonstrate that preference-guided LLMs significantly outperform both fine-tuning baselines and standard prompting methods in terms of automatic metrics (such as GPT-4o evaluation) and human judgments.
Following the research on personalization techniques, Panza [
30] emerges as a novel solution focused on personalized text generation, particularly for emails, while prioritizing user privacy through local execution. It tackles the efficient challenges of fine-tuning and RAG faced by predecessors. Panza uniquely combines a variant of Reverse Instructions with RAG and PEFT methods, such as ROSA, enabling personalization using tiny datasets (under 100 emails) on commodity hardware. This makes it highly viable for users with limited personal data.
A key methodological contribution is its evaluation approach, demonstrating that a combination of BLEU and MAUVE scores strongly correlate with human preferences for personalized text quality. This combined metric helps validate Panza’s ability to replicate writing styles effectively. Unlike purely RAG-based or preference-agent approaches, Panza provides a scalable and flexible solution that enables local execution, low-cost fine-tuning, and inference on standard hardware. Its practical utility is showcased through a Google Chrome plugin designed for Gmail integration.
The authors Shaikh O., Lam M., et al. [
31] introduced a distinct method for aligning LLMs with individual user preferences, complementing other personalization frameworks. Their “Demonstration Iterated Task Optimization” (DITTO) approach takes a more direct approach by leveraging user-provided demonstrations. This iterative process only requires a small number of prototypes (fewer than 10) to guide the model’s output, making it a data-efficient solution that contrasts with the more resource-intensive methods of fine-tuning and reinforcement learning from human feedback (RLHF) that is often used in personalization, as already stated.
The online imitation learning method, introduced in [
31], stands out by treating user demonstrations as preferred outputs and utilizing this feedback to update the model via algorithms such as Direct Preference Optimization (DPO). Compared to Panza, which focuses on fine-tuning foundation models to mimic writing styles with minimal data while prioritizing privacy, this approach offers a simpler alternative by directly optimizing model outputs based on few-shot user demonstrations. This method allows it to bypass the need to fine-tune the entire model or implement complex RAG-based retrieval systems.
Furthermore, this technique addresses some challenges inherent in the preference agents’ approach, where user preferences are encoded into specific rules. While preference agents can effectively guide output, the online imitation learning framework offers a more scalable and flexible approach to leveraging direct user feedback for preference alignment, thereby reducing the need for complex rule creation and extensive preference elicitation. Its methodology also contrasts with Panza’s emphasis on local execution and privacy as it focuses on the iterative refinement of user preferences, making the system adaptable to diverse user needs without compromising performance. Evaluations using static benchmarks (such as CMCC and CCAT) and real-world user studies for email writing consistently show its effectiveness in personalizing LLMs with minimal data. By outperforming traditional methods, such as supervised fine-tuning and few-shot prompting (even with GPT-4), this system demonstrates its ability to capture fine-grained stylistic preferences with only a few demonstrations, positioning it as a powerful and efficient solution alongside the personalization techniques discussed previously.
The evolution of personalization techniques reflects a broader shift in how LLMs are transforming individualized AI systems. Zhang et al. [
9] explored the disruptive impact of Large Language Models such as GPT-3.5, GPT-4, and LLaMA-7B, highlighting how LLMs transform personalization from filtering to dynamic, real-time user engagement. Unlike static embedding-based models, LLMs use few-shot prompting and in-context learning to adapt recommendations dynamically. Reinforcement Learning from Human Feedback (RLHF) further refines personalization by aligning responses with user intent. The study highlighted LLMs’ ability to integrate external tools, including retrieval-based recommendation engines (e.g., the Generalized Dual Encoder Model), search APIs, and vector databases, thereby enhancing context-aware personalization.
These observations provide valuable context for understanding the technological foundations that underpin the personalization approaches of PersonaAI, preference agents, Panza, and DITTO, all of which leverage these capabilities in different ways.
The reviewed personalization techniques highlight a transition from model-centric fine-tuning to modular, data-efficient personalization strategies. Approaches such as PersonaAI, Panza, and DITTO show that personalized outputs can be achieved even with limited user data, particularly through RAG and preference agents. Yet, a trade-off remains between scalability, privacy, and stylistic fidelity. While some models excel in fast deployment or local inference, others prioritize accuracy in capturing writing nuances, suggesting that no single method currently balances all personalization goals effectively.
Synthesis and Discussion of Personalization Techniques
Personalization emerges as the most critical and least mature component of email automation across studies. While most studies rely on PEFT or RAG variants to emulate user tone and contextual relevance, these approaches are still limited by small datasets and static user models. There is an increasing convergence in the literature on hybrid retrieval–fine-tuning pipelines as a promising direction, yet long-term adaptation and cross-user generalization remain open problems. Furthermore, the ethical and privacy implications are rarely quantified, indicating the necessity for standardized evaluation protocols that incorporate both performance and responsible AI criteria.
4.3. Security Vulnerabilities in Generative AI Email Systems
This section highlights key threats in AI-driven email environments, ranging from ecosystem-level attacks, such as the Morris II worm, to advanced exploits, including LLM-enabled spear-phishing and Trojan attacks on fine-tuning processes. Understanding these risks is crucial for building secure and reliable AI email assistants that protect user data and maintain communication integrity.
A recent study has discovered a significant security weakness in Generative AI (GenAI) environments, particularly those powered by RAG, such as email assistants. The authors Cohen S., Bitton R., et al. [
32] present Morris-II, a new computer worm that propagates across GenAI platforms via embedded adversarial prompts in emails. The result is indirect prompt injections, which can activate malicious behaviors, such as data exfiltration. The authors performed empirical experiments with mock GenAI email assistants developed using the LangChain library and RAG to examine this threat. The assistants were exposed to tailored databases extracted from real-world email datasets, such as the Enron and Hillary Clinton email datasets, to replicate the behavior of real-world GenAI systems. For instance, the Enron dataset simulated 20 workers, each with a personal database of one hundred emails, whereas the Hillary Clinton dataset consisted of 1500 emails. These datasets served as the “external knowledge sources” for the RAG components of the assistants.
The simulations were utilized to demonstrate the worm’s potential to infect the systems under various scenarios. The paper proposes Virtual Donkey, an effective countermeasure that detects worm propagation based on input-output similarity in the GenAI model. The solution has high accuracy with minimal false positives. While the research acknowledges the potential for adaptive attacks, it makes a valuable contribution by revealing this ecosystem-level threat and proposing an effective solution to secure GenAI-based email assistants.
Building on these ecosystem-level concerns, the author Hazell [
33] examined the capacity of LLMs, such as OpenAI’s GPT-3.5 and GPT-4, to facilitate spear-phishing attacks, a sophisticated form of social engineering that leverages personalized information to manipulate targets.
The research highlights LLMs’ ability to assist with the reconnaissance phase by processing unstructured data to gather target information and the message generation phase by producing realistic and contextually relevant spear phishing emails at a fraction of a cent per message. Notably, the study demonstrates how basic prompt engineering can bypass safety measures in these models, enabling the generation of malicious content and advice on conducting attacks, including crafting persuasive emails and basic malware. By generating unique spear phishing messages for a large group of UK Members of Parliament, the study provides evidence that LLMs can produce realistic and cost-effective phishing attempts, potentially scaling such campaigns significantly and lowering the barrier for less skilled cyber criminals. The findings also compare the sophistication of different LLMs, including open-source models, in this context. While the study focuses on the malicious application of LLMs for email-based attacks, it underscores the dual-use nature of this technology, and the governance challenges associated with preventing its misuse. The paper further discusses potential solutions, including structured access schemes and the development of LLM-based defensive systems for email security.
Researchers subsequently began identifying additional security vulnerabilities that could arise with the deployment of LLMs. Building on the phishing concerns identified by Hazell [
33], a critical security evaluation was proposed that demonstrates how instruction-following large language models can be exploited for malicious purposes through methods borrowed from classical computer security. The authors Kang D., Li X., et al. [
34] developed three primary attack vectors—obfuscation (inserting typos or replacing synonyms to escape detection), code injection/payload splitting (indirect programming of instructions), and virtualization (programming attacks in virtual scenarios)—that could evade OpenAI’s content filtering mechanisms 100% of the time in situations like hate speech, conspiracy theories, and phishing schemes.
The economic implications of these vulnerabilities are significant. Using human evaluators and GPT-4 to assess generation quality, Kang et al. [
34] found that instruction-tuned models of larger capacities produced much more realistic malicious content than earlier models. Their economic research revealed that tailored malicious content could be generated at
$0.0064–
$0.016 per instance, significantly lower than human-generated content, which is estimated to cost
$0.10 per instance. This cost-effectiveness created strong economic incentives for adversaries to deploy these systems. This identified the dual-use possibility of AI-powered email composition, where features such as personalization and productivity could be leveraged for malicious purposes, including sophisticated phishing attacks.
Beyond prompt-based vulnerabilities and economic incentives, adaptation-based security concerns have also emerged. Because of these broader security concerns, the authors Dong T., Xue M., et al. [
35] analyzed another potential vulnerability in their paper, “The Philosopher’s Stone: Trojaning Plugins of LLMs.” Their research on low-rank adaptations for LLMs identified critical security concerns for AI-mediated communication systems. The researchers have identified weaknesses in LoRA adapters, demonstrating how attackers could design “Trojan plugins” that cause LLMs to generate toxic text when triggered by specific phrases.
These adaptation-based attacks represent a sophisticated evolution of security threats in generative AI systems. The authors Dong T., Xue M., et al. [
35] also introduced two new attack methods: POLISHED, which uses LLM-based paraphrasing to produce naturalistic poisoned datasets, and FUSION, which transforms benign adapters through an over-poisoning process. Testing on real-world LLMs, such as Llama (7B, 13B, 33B) and ChatGPT-2 (6B), confirmed the approaches, with FUSION achieving a nearly 100% success rate in producing target keywords with just 5% poisoned data. These findings highlighted key security issues for AI-powered communication systems that utilize adapter-based fine-tuning techniques to adjust models to specific domains or users. Since adapter-based techniques are commonly used in email personalization systems, these vulnerabilities raised questions about the secure deployment of generative AI in email ecosystems, connecting back to the ecosystem-level threats identified in [
32].
Large-scale organizational data further demonstrates the severity of these risks. In a comparative study involving approximately 9000 employees, GPT-4 was used to generate lateral phishing emails that were then evaluated against phishing messages written by trained human communications professionals [
36]. The LLM-generated emails proved equally persuasive and were even more effective in time-sensitive scenarios, achieving higher rates of email opens, link clicks, and credential submissions. This success was primarily due to exploitation of existing trust relationships, particularly when emails appeared to originate from supervisors or internal colleagues, with average internal trust ratings measured at 4.77/5. The study also identified role-specific vulnerability patterns, with student workers showing the highest susceptibility (14.06% click rate and 6.76% credential submission rate), suggesting that the risk of exposure to phishing is not uniform across organizational layers. The authors propose mitigation strategies such as tagging potential LLM-generated emails but also emphasize that detection becomes increasingly difficult as model quality improves [
36].
In contrast to the vulnerable personalization pipelines discussed in this section, systems that intentionally avoid external model calls and online inference offer a different approach. The CerebralChat assistant, for example, is implemented entirely offline, with all language processing performed locally rather than through remote model hosting or API-based services [
37]. This architectural choice removes the attack surface associated with prompt injection through retrieved content and eliminates the risk of data exposure to third-party model providers. The system also integrates a biometric face-recognition login mechanism with reported accuracy of around 98% under controlled conditions, providing an additional physical-layer authentication barrier. Although not an LLM-based email assistant, CerebralChat shows that security can be achieved at the architectural level by limiting the model’s execution boundary rather than just improving defensive techniques within generative pipelines [
37].
The security analysis reveals that personalized LLMs for email systems are vulnerable to a wide range of attacks, including prompt injections, Trojan plugins, and socially engineered phishing vectors facilitated by model-generated content. While solutions such as Virtual Donkey and structured access models offer partial mitigation, the literature shows that many personalization methods, particularly those relying on adapters or external retrieval, introduce new vectors for exploitation. Recent large-scale phishing studies further demonstrate that these vulnerabilities translate into operational risks in real organizational settings, rather than remaining purely theoretical. This underscores the urgent need for security-aware personalization frameworks and defensive evaluation protocols to accompany the design and deployment of LLM-based email assistants.
4.4. User Perceptions of AI Communication and Writing Assistants
Beyond AI’s technical capabilities, the success of email writing assistants depends heavily on user perception and interaction. This section reviews key studies highlighting the balance between AI-driven productivity and concerns about authenticity, emotional tone, and trust. Examining user experiences across contexts—from accessibility to professional use—emphasizes the need for AI that is both efficient and aligned with human communication nuances.
In 2022, Goodman et al. introduced LaMPost [
38], a language model-powered email composition tool designed for dyslexic adults. The research highlighted the potential for AI systems to be retuned to address specific user needs rather than to enhance overall productivity. Unlike traditional spell-check and grammar-check software, LaMPost provided high-level composition support, including content structuring, subject line generation, and stylistic rewriting. The system was built with LaMDA, a conversation-specific LLM, in a browser-based email editor. The test with 19 dyslexic adults showed that the “rewrite” and “subject line” tools significantly enhanced writing productivity, but participants occasionally experienced issues with coherence and deviations in tone. Perceived self-efficacy was not influenced by awareness of AI assistance, indicating that the system was employed as an empowering tool, not an intrusive helper. The study emphasized the need for adaptive AI-provided feedback to better manage cognitive diversity within writing support tools, demonstrating how personalization can be augmented to incorporate mental and accessibility dimensions beyond style considerations.
Following these findings on personalized writing support, later research investigated user experience and usage of AI-facilitated communication tools in daily life. The authors Fu Y., Foell S., et al. [
39] presented an in-depth diary and interview study of user experience with tools that mediate communication through artificial intelligence in daily interpersonal interactions. They conducted the study with 15 users who used tools of their choice for one week, resulting in 227 diary entries that captured their experiences and attitudes. The study found general positive acceptability with an average satisfaction score of 7.1 on a scale of 1–10, with satisfaction rising following participants’ initial learning curve. One of the key contributions was the definition of four communication spaces, distinguished by stakes (high/low) and relationship dynamics (formal/informal); these AI tools were perceived as considerably more appropriate in formal relationships than in informal ones. The participants noted that these systems were beneficial by increasing communication confidence, helping to find precise words to express ideas, overcoming cross-cultural communication challenges, and expanding vocabulary. However, the study also discovered ongoing flaws in current AI communication systems, including excessive verbosity, unnatural phrasing, exaggerated emotional intensity, and the difficulty of iterative revision required to achieve satisfactory outputs. These findings suggest that the tools must be tuned differently for specific communication contexts, with features matched to their functions, and conclusions directly applicable to crafting customized email systems that are sensitive to differing communication contexts.
Researchers shifted from educational applications to professional settings and began examining AI writing tools in business environments. In a corporate environment, the authors Jovic M. and Mnasri S. [
40] compared four well-known LLMs—ChatGPT 3.5, Llama 2, Bing Chat, and Bard—in terms of their ability to generate business emails, examining the implications for AI implementation in business communication. Using a detailed framework, the study assessed performance across three types of emails: routine, negative, and persuasive. Each LLM was subjected to identical email scenarios, with outputs scored on content, format, and tone. Despite the formulaic nature of business emails, the researchers found significant variations in quality between models. Llama 2.0 achieved the highest overall score (48.9/60), followed closely by Bing (47.8), ChatGPT 3.5 (46.7), and Bard (45.2). Common weaknesses across all models included difficulties in following the requested structure, maintaining tone consistency, and responding to emotional cues, which highlighted the need for further development of LLMs in business email generation, particularly in terms of emotional aspects. However, the study employed only these tests and did not examine or analyze prior personalization methods, suggesting that further development was needed to address these issues in tailored business communication.
Beyond performance comparisons, understanding user trust in AI-generated content emerged as a critical area of research. In a study, user perceptions of AI-generated content were explored, revealing that trust in AI-generated emails decreased as the perceived AI involvement increased [
41]. Using a “Wizard-of-Oz” approach, where participants believed they were interacting with an AI system but were interacting with a human, the study found that users were more willing to accept AI-generated content for factual emails but preferred human authorship for emotionally charged content, such as condolences.
To enhance understanding of stylistic variation in artificial intelligence versus human writing, the authors Li W., Saha K, et al. [
42] compared AI- and human-generated emails and investigated the primary distinction that may influence user acceptance and perception. Based on the W3C email corpus, the research compared the syntactic, semantic, and psycholinguistic attributes of emails created by GPT-3.5, GPT-4, Llama-2, and Mistral-7B. Although the findings replicated anxieties regarding AI-generated emails being formal and emotionally monotonous, Li et al. also highlighted the difference in style: AI-generated emails were verbose and lexically redundant, whereas human-generated emails were succinct, personalized, and linguistically varied. Although polite and positive, LLM-generated emails often lack contextual specificity, underscoring the need for personalization, as demonstrated by several studies in this literature review. These findings are corroborated by a small-scale user study involving 41 participants in which, although they praised the efficiency of AI-generated writing, they criticized its limited diversity and personalization. This highlights the reality that although LLM-based email automation is efficient, it can be complemented by hybrid approaches that balance AI-driven efficiency with user customization, thereby adding authenticity and adaptability. The study highlighted a significant trade-off of email generation systems: reconciling the grammatical accuracy and speed of AI-produced messages with the conversational, personalized tone of human-composed emails.
Recent studies have examined how generative AI can support users in emotionally sensitive communication scenarios, such as instructors responding to challenging or confrontational emails from students. Reisman [
43] evaluated several LLMs, including Claude, Gemini, Perplexity and CoPilot, and found that these tools consistently produced replies that maintained a professional and non-hostile tone, even in situations where a human’s initial reaction may have been defensive or impulsive. Notably, the study highlights the role of AI as a “calming time-out” mechanism: the act of prompting the system requires the instructor to slow down and reflect, reducing the likelihood of sending regrettable or emotionally charged responses. The generated replies also clearly articulated boundaries and expectations while remaining respectful, demonstrating AI’s potential to mediate tone in stressful communication. The article also predicts a shift towards locally trained Small Language Models (SLMs) that can learn a user’s communication style over time. This raises questions about the degree of autonomy such systems should have when responding to emails on behalf of the user [
43].
Studies on user perception consistently point to a tension between productivity and authenticity. While users appreciate speed and ease of use, concerns persist regarding emotional tone, verbosity, and lack of contextual precision in AI-generated emails. Importantly, acceptance varies across domains—business, accessibility, and personal contexts—highlighting that personalization must extend beyond style to include purpose, audience, and emotional resonance. Future systems will need to adapt not just to how users write, but why they write.
4.5. Benchmarking and Evaluation
Robust evaluation methods are crucial for accurately measuring the performance, personalization, fidelity, and impact of LLM-based email systems. This section reviews emerging benchmarks focused on user feedback responsiveness, long-form coherence, and on-device efficiency, highlighting a shift toward hybrid evaluation frameworks that combine automated metrics with human judgment.
To more effectively quantify LLMs in interactive settings, the authors Yan J., Luo Y., et al. [
44] introduced RefuteBench, a novel benchmark designed to assess an LLM’s ability to incorporate refuting user feedback systematically. This approach addresses a vital challenge in generative AI: LLMs often struggle to incorporate user corrections, limiting their effectiveness in applications like email writing, where user-specific adjustments are essential. RefuteBench takes a dynamic approach by generating counter-instructions to test models across various tasks, including question answering, machine translation, and email writing. This contribution addresses a key gap in current instruction-following evaluations by focusing on scenarios in which users actively refine or correct model-generated responses. The benchmark rigorously evaluates LLMs’ compliance with updated instructions across single- and multi-feedback scenarios. It introduces two novel evaluation metrics: Feedback Acceptance (FA), which assesses the positive acceptance of feedback, and Response Rate (RR), which evaluates the correct incorporation of feedback into future interactions. To overcome these recognized drawbacks, the authors present a novel and pragmatic “recall-and-repeat” prompting strategy that leverages past feedback to increase model responsiveness. The experimental results showed significant gains in Response Rate across a range of LLMs, testifying to the effectiveness of this strategy. Furthermore, the study’s examination of the relationship between FA and RR provides valuable insight into the importance of accepting initial feedback for ongoing compliance. Data gathering involved existing datasets, such as RIPPLEEDITS and WMT2023, along with custom-created email writing instructions to provide a comprehensive evaluation. This study contributes to a deeper understanding of LLMs’ interactive capabilities by identifying their limitations in responding to refuted instructions and proposing a viable solution to enhance their responsiveness, with direct implications for the field of generative AI assistants.
While the work by Yan et al. [
44] prioritizes the use of feedback and responsiveness, a different study by the authors Xu et al. [
45] tackles another significant challenge for generative AI email assistants: high inference latency in on-device LLMs, which is a critical issue for systems processing long contextual prompts. Their paper introduces llm.npu, the first LLM inference system to successfully utilize on-device Neural Processing Unit (NPU) offloading to solve this problem, particularly in the dominant prefill stage. The system integrates three new techniques: chunk-sharing graphs, which improve efficiency and lower memory overhead by allowing for variable-length prompt handling through division into fixed-size chunks; shadow outlier execution, which preserves accuracy by offloading the processing of significant activation outliers to the CPU/GPU in parallel; and out-of-order subgraph execution, a scheduling framework that intelligently improves the utilization of heterogeneous mobile processors (CPU/GPU and NPU) by allocating and executing Transformer blocks based on their hardware affinity and precision sensitivity.
Rigorous testing performed by the authors of [
45] on standard mobile hardware showed that the performance improvements of their system are significant, with prefill speedup rates of up to 43.6 times over GPU benchmarks and substantial power savings of up to 59.5 times, all with preserved inference accuracy. In extensive real-world usage, particularly in scenarios involving intelligent email assistants with lengthy prompts, this approach achieves latency improvements of 1.4 and 32.8 times compared to competing benchmarks. Notably, the system achieves prefill speeds of over 1000 tokens per second for billion-parameter models on mobile devices, marking a significant advancement in enabling rapid and efficient on-device generative AI capabilities for apps that require accelerated processing of large contextual information.
Shifting from system-level advances to personalization techniques, a key contribution to the literature is provided by work on retrieval optimization in LLM personalization. The LaMP benchmark [
46], which evaluates performance on seven personalized NLP tasks, stated that retrieval optimization led to a 5.5% average improvement in LLM personalization, with a 33.8% improvement in cold-start settings. Pre- and post-generation retrieval selection mechanisms were one of the novel features, which picked alternate retrieval methods per query to trade off recency, keyword relevance, and user writing style. These findings underscore the importance of adaptive retrieval selection in producing highly personalized LLM responses, with direct applications to email response generation systems.
To further the standards of personalization, the authors Kumar I., Viswanathan S., et al. [
47] responded to an essential requirement within evaluation tools by introducing LongLaMP, a specialized benchmark employed to personalize the generation of long textual content across various tasks, including email writing. This innovation remains highly relevant in email because previous personalization datasets concentrated primarily on short-term textual outputs without regard for coherence and consistency issues within lengthy communications. Contrary to the state of the art at the time, LongLaMP focused more on long-term coherence, stylistic coherence, and topic consistency within lengthy passages. The authors used a RAG model where documents and features specific to users were leveraged to constrain the generation step to avoid common problems like topic drift. The LongLaMP [
47] dataset was large, consisting of multi-paragraph custom text samples from email conversations, scientific abstracts, and web reviews. The quantitative metrics presented in this article indicate that RAG-improved models outperform conventional fine-tuning approaches by 5.7% to 128%, as observed across various evaluation metrics, including BLEU, ROUGE-L, and METEOR. The suggested evaluation framework categorizes models along two critical dimensions: user-based (cold-start) personalization, which assesses a model’s capacity to generalize to new users with minimal or no prior history, and temporal personalization, which examines how models evolve as they adjust to changing user tastes over time. Among the most noteworthy findings was that dense retrieval-based Contriever methods, which use neural networks to understand the semantic meaning of text, vastly outperformed traditional keyword-based methods, such as BM25, which rely on word frequency and occurrence, in determining beneficial personalization signals by an enormous margin, with significant implications for retrieval-based personalization methods in email systems.
Evaluation practices for personalized LLMs are evolving, with new benchmarks that address context incorporation, responsiveness to feedback, and stylistic coherence. However, inconsistencies in datasets, lack of long-form personalization metrics, and overreliance on automatic scoring still hinder comprehensive assessment. The emergence of benchmarks like RefuteBench and LongLaMP shows progress but also signals the need for hybrid evaluations that combine automated measures with human judgment—particularly for nuanced tasks like email-response generation.
Building on these benchmark-level insights, recent research has also examined how evaluation can be structured at the pipeline component level, rather than focusing solely on final output quality. A hybrid chatbot implementation demonstrates this approach by separating intent classification, entity extraction, retrieval, and response generation into distinct evaluable stages. The system integrates BERT-based intent recognition with named-entity extraction and retrieval over a ChromaDB vector store, with generated replies produced through a lightweight LLM orchestration layer [
48]. The reported metrics show high intent classification accuracy (Bi-LSTM at 98.12%, BERT at 99.70%) across an 11-intent corpus. Qualitative relevance assessments also suggest that specificity and correctness of the retrieved context strongly influence final message coherence. Although this evaluation structure has been applied to an e-commerce domain, it can be mapped directly onto email assistance workflows, where determining communicative intent (e.g., follow-up, scheduling or clarification) is a prerequisite for producing contextually appropriate, high-quality email text [
48].
This pipeline-oriented can also be observed in real-world deployment. For example, a large-scale cold outreach system for staffing applications was implemented using open-source LLMs and retrieval-based personalization. This system reported an email response rate of 9% and a meeting conversion rate of 4%, outperforming conventional outreach baselines [
49]. A comparative evaluation of multiple open-source models revealed clear trade-offs between response effectiveness and computational cost. LLaMA-3-70B produced the highest engagement (9.8% response rate and 4.7% conversion rate), but required approximately 140 GB of memory and 7.3 s of latency per message. In contrast, the Mistral-8×7B (MoE) model delivered nearly equivalent engagement (9.3% response rate and 4.2% conversion rate) with substantially reduced latency of approximately 3.8 s and lower resource demand. These findings demonstrate that effective personalization strategies must balance model expressiveness, retrieval grounding precision and the computational feasibility of deployment [
49].
Beyond automated metrics such as BLEU, ROUGE and METEOR, the evaluation of email-writing assistants is increasingly dependent on qualitative indicators, including user satisfaction, the appropriateness of the tone, contextual alignment and the perceived authenticity of the text. The studies reviewed in this section consistently demonstrate that models that perform well on traditional text similarity metrics can still produce outputs that feel impersonal, emotionally flat or misaligned with communicative intent. Similarly, real-world deployments suggest that efficiency and personalization should be evaluated in terms of practical usability rather than model accuracy alone. These findings highlight the need for hybrid evaluation frameworks that integrate quantitative scoring with human-centered assessments and structured user feedback. As research progresses, developing standardized, domain-specific benchmarks for personalized email writing remains a critical step in ensuring that evaluations capture not only textual correctness, but also the social, stylistic and relational dimensions of communication.
4.6. Summary of Comparative Analysis
Table 2 summarizes the technical dimensions of the reviewed studies, mapping each contribution across personalization methods, retrieval mechanisms, user interaction design, and evaluation practices. This comparative overview reveals a wide range of approaches, from modular architectures such as Luminate and REALIGN, to retrieval-centric systems like LongLaMP and RefuteBench.
The comparison reveals several notable trends. Firstly, retrieval-augmented pipelines are increasingly replacing traditional fine-tuning for context adaptation, striking a balance between personalization and privacy preservation. Secondly, evaluation practices remain heterogeneous, with only a minority of studies incorporating human-centred or longitudinal assessment. Thirdly, despite methodological advances, there is limited convergence on standard benchmarks and interoperability between tools is rarely discussed.
These observations provide a structured overview of the current landscape. However, a broader synthesis is required to understand how these technical dimensions interconnect, and to identify any cross-cutting challenges that persist. The following subsection therefore provides an integrated discussion of emerging trends and unresolved issues in LLM-based email automation.
To facilitate a better understanding of the data presented in
Table 2, the following points detail the definitions for each column:
Personalization: RAG Focus: These highlights systems leveraging Retrieval-Augmented Generation, a critical technique discussed throughout this survey for enhancing responses with contextual, external knowledge.
Personalization: PEFT/Fine-tuning Focus: The use of Parameter-Efficient Fine-Tuning or similar deep adaptation methods is noted as these are central to tailoring general models to specific domains.
Personalization: Iterative Refinement/Feedback-Driven: This column identifies approaches that emphasize continuous improvement through self-critique, learning from preferences, or direct user feedback, which are key for evolving model performance.
Addresses User Experience/Interaction Design: Given that these systems are user-facing, a focus on intuitive interfaces, usability, and overall user interaction design is a significant factor for practical adoption and effectiveness.
Addresses Security/Privacy Aspects: Given the inherent sensitivity of email communication, this section highlights systems or studies that explicitly consider and address security vulnerabilities, defenses, or privacy-preserving mechanisms. These threats include ecosystem-level attacks, such as the Morris II worm, which propagates via adversarial prompts embedded in emails. The literature also highlights vulnerabilities in prompt-based attacks, with LLMs being exploited to generate convincing spear-phishing emails at low cost. Furthermore, a more advanced security concern involves adaptation-based attacks, like the Trojan plugins that compromise fine-tuning components. In response, studies propose mitigation strategies, such as Virtual Donkey, a defense solution that detects worm propagation based on input-output similarity and emphasize the need for security-aware personalization frameworks. On the privacy front, systems like Panza prioritize local execution and low-cost fine-tuning with tiny datasets to ensure sensitive data remains on the user’s device.
Includes Evaluation/User Study: Indicates whether the system or approach has undergone formal evaluation, been tested against benchmarks, or included studies of user perceptions, which are vital for validating efficacy and user acceptance.
The comparative overview in
Table 2, along with the literature review, provides clear answers to the research questions guiding this survey.
In response to RQ1 (What are the core strategies and frameworks that enable the effective generation of personalized email responses using LLMs?), this review concludes that effective personalization in email generation cannot be achieved through a single technology, but rather through a multifaceted approach that combines three core technical strategies: RAG, PEFT and Iterative Refinement. RAG is a key strategy for providing LLMs with up-to-date external context, such as a user’s past correspondence, without costly retraining. This approach is central to frameworks like PersonaAI, which uses semantic retrieval for dynamic personalization, and Panza, which leverages RAG within a privacy-preserving, locally run architecture. PEFT, in turn, addresses the challenge of adapting massive models to individual users in a scalable manner. Techniques like Low-Rank Adaptation (LoRA) enable the fine-tuning of an LLM to a user’s specific writing style by training only a small fraction of the model’s parameters, making personalization feasible on commodity hardware with limited data. Panza successfully employs this method and is foundational to the alignment process in DITTO, which updates a model based on user demonstrations. The field is also advancing toward more dynamic systems through iterative, feedback-driven learning. Frameworks such as SELF-REFINE enable an LLM to provide self-feedback, thereby improving its output. Meanwhile, dedicated critic models like Shepherd and implicit learning systems like PIT further enhance this capability. Alongside these core techniques, the most effective systems recognize the importance of user-centric design, innovating on the user-AI interaction with structured interfaces for creative exploration (Luminate), cognitive load reduction through guided Q&A (ResQ), and novel interfaces for managing multiple writing variations (ABScribe).
Regarding RQ2 (What are the primary technical methods, security vulnerabilities, user perceptions, and evaluation benchmarks for personalized LLM-based email assistants?), our analysis highlights several key dimensions beyond these core strategies. The foundational technology for these systems is the Transformer architecture, with RAG and PEFT being the key methods for personalization. A critical emerging technical consideration is on-device performance to ensure privacy and low latency, with systems like llm.npu demonstrating successful offloading to mobile NPUs for significant speedups. However, the deployment of these systems introduces considerable security vulnerabilities, including ecosystem-level threats such as the Morris II worm, the misuse of LLMs for generating convincing spear phishing emails, and the discovery of “Trojan” attacks that can compromise PEFT components. User perception studies reveal a clear trade-off: while users value the productivity gains, with tools like LaMPost showing clear benefits for users with dyslexia, they remain concerned about the authenticity of AI-generated emails, often finding them verbose and emotionally flat. This can lead to a drop in trust, especially in sensitive contexts. Finally, evaluation benchmarks are evolving beyond traditional metrics, with the emergence of more robust assessments, such as RefuteBench (for feedback incorporation), LaMP (for personalized NLP tasks), and LongLaMP (for long-form coherence), underscoring a consensus that a hybrid approach combining automated metrics with human evaluation is essential.
The primary contribution of this survey, therefore, is to synthesize and structure this diverse body of work into a single, coherent overview explicitly focused on the domain of email response generation. The classification used serves as a practical, quick-reference guide for researchers and practitioners, highlighting the current state of the art and identifying key architectural patterns in the design of personalized email assistants.
4.7. Cross-Thematic Synthesis and Open Challenges
The thematic analysis presented in this review reveals clear interdependencies among the five research dimensions: Tools and Frameworks, Personalization Techniques, Security Vulnerabilities, User Perceptions and Benchmarking and Evaluation. While each dimension advances the field independently, it is their convergence that defines the maturity and scalability of LLM-based email automation.
The following points outline the key open challenges observed in the reviewed literature across these interconnected dimensions, highlighting the persistent technical and methodological gaps:
Methods of personalization such as PEFT and RAG have a direct impact on data privacy and model security. Systems that rely on fine-tuning private email corpora risk leaking sensitive information. In contrast, retrieval-augmented models reduce this exposure but introduce new vulnerabilities through external data pipelines. A recurring limitation across studies is the lack of a unified framework for privacy-preserving personalization that balances contextual adaptation with data minimization.
Most of the systems reviewed demonstrate effective personalization only on a small scale. Training or adapting individual models for each user remains computationally prohibitive. Lightweight strategies (e.g., LoRA and adapters) mitigate this issue but they often degrade performance in long-context or multi-thread scenarios, which are typical of email workflows. Future work must optimize inference latency and parameter efficiency simultaneously to enable enterprise-level deployment.
Although benchmarking studies (e.g., LaMP, LongLaMP and RefuteBench) have advanced personalized evaluation, they remain siloed and assess narrow capabilities rather than end-to-end communication quality. There is an urgent requirement for hybrid evaluation frameworks that incorporate human input and combine automated metrics (e.g., BLEU, ROUGE-L and METEOR) with subjective measures of trust, satisfaction and tone alignment.
Although user-perception research acknowledges concerns regarding trust and authenticity, few technical frameworks operationalize these insights. Integrating affective computing and user feedback loops into model adaptation pipelines could enhance the perception of empathy and reduce the “synthetic” tone that often undermines the acceptance of AI-generated emails.
The current literature treats email generation as an isolated task, but true automation requires the integration of scheduling and context retrieval. Collaboration between natural-language processing, knowledge graph construction, and HCI research is required to bridge these layers.
Summary of Emerging Research Directions
The synthesis indicates that progress towards trustworthy and scalable email automation depends on three priorities: (i) developing privacy-preserving personalization architectures; (ii) establishing unified evaluation benchmarks that blend human and automatic metrics; and (iii) designing adaptive learning systems that can continually align with user preferences. These cross-disciplinary challenges define the next frontier of research for personalized, LLM-driven communication assistants.
In addition to traditional encryption and access-control mechanisms, emerging techniques such as federated fine-tuning, differential privacy, and secure aggregation provide practical solutions for privacy-preserving email automation. These approaches allow models to be adapted without centralized access to sensitive email data, providing a technical basis for scalable, regulation-compliant personalization.
5. Potential Benefits of Using Personalized LLM Email Assistants
Despite limited literature on personalized email assistants powered by LLMs, substantial research indicates significant benefits. These include enhanced productivity, improved personalization, reduced cognitive load, and more. These potential benefits, identified through our literature review, are summarized in detail in
Table 3.
Projects like Panza [
30] demonstrate that LLMs can be effectively adjusted to reflect a user’s unique writing style with limited data, using techniques such Reverse Instructions and PEFT, as already stated. This stylistic personalization, considered fundamental for email assistants, is complemented by RAG, which enhances contextual awareness by leveraging previous correspondence. Recent evaluations of email composition capabilities across various LLMs indicate that models like ChatGPT demonstrate superior performance based on clarity, tone, and relevance metrics. Empirical studies using tools like M365 Copilot have revealed tangible productivity improvements, with 64% of participants reporting a positive impact on email writing quality and efficiency during six-month trials [
50].
Beyond basic content generation, these systems offer substantial reductions in cognitive load. Tools incorporating LLMs facilitate the exploration of multiple text variations in a non-linear fashion, allowing users to consider different expression approaches—particularly valuable for nuanced communications. The ability to convert brief prompts into complete, personalized emails minimizes repetitive effort while maintaining communication authenticity. Integration approaches that minimize context switching further enhance workflow efficiency, with research suggesting that initially creating rudimentary requirement summaries and subsequently requesting expansion produces optimal results.
Personalized email assistants also demonstrate valuable adaptive learning capabilities through mechanisms like the “Self-Refine” concept [
22] and the PIT framework [
24], which enable iterative improvement based on implicit user feedback. From a credibility perspective, studies on “impersonation” capabilities indicate that properly personalized outputs maintain authenticity. Additionally, the research emphasizes the importance of user control, with participants valuing the ability to customize, edit, and remove suggestions to ensure alignment with individual communication preferences and contexts.
The technological foundation for these benefits leverages advanced Transformer architectures, practical, prompt engineering, and efficient fine-tuning methods like LoRa, making personalization increasingly possible. While complete end-to-end personalized email assistant systems remain limited in academic literature, the convergence of evidence from related fields suggests these tools can significantly transform digital correspondence by making it more efficient, contextually relevant, and aligned with users’ authentic communication styles. This transformation appears particularly pronounced in structured communication tasks, where the cognitive overhead of composition can be substantially reduced without sacrificing personal voice or communication efficacy.
However, the integration of personalized LLM assistants into real-world email systems, particularly on edge devices such as smartphones or embedded enterprise platforms, introduces necessary trade-offs between model complexity, latency, and energy efficiency. While recent work like llm.npu [
45] demonstrates that significant inference acceleration can be achieved via Neural Processing Units (NPUs), these optimizations often require architectural compromises, such as precision reduction or chunked execution strategies.
Thus, the deployment of high-quality, personalized assistants must strike a balance between the depth of personalization (e.g., fine-grained stylistic adaptation) and the responsiveness and energy constraints of target devices. In edge scenarios, lightweight methods such as RAG and PEFT become critical not only for privacy and personalization but also for enabling feasible inference under resource limitations.
6. Limitations and Risks of Using Personalized LLMs
Customizing LLMs for email writing poses significant challenges, particularly in evaluation and practical implementation. The current assessment measures do not always accurately reflect the degree to which an LLM captures the individual’s writing style, tone, or communicative intent. Most metrics rely on task-specific, pre-established quality criteria that fall short of accounting for the complexities of personalized email generation, especially in professional and multicultural settings [
50]. An underlying issue further compounds this limitation: What is the most “personal” email tone for a user to say, “I could have written that”? It remains unlikely that most users possess a consistently unique and recognizable writing fingerprint in business emails, aside from conforming to overall politeness, structural, or professional norms. Uncertainty itself is a significant challenge for developing meaningful assessment approaches.
Thus, there is an evident need for more advanced benchmarks that can evaluate personalized email outputs in real-life environments, encompassing diverse professional settings, varying formality levels, and nuanced interactions. Modern standards often overlook such nuances, and the evidence suggests that open-source models consistently fall behind closed-source models in successful email personalization. While frameworks aiming to understand email conventions and maintain stylistic consistency are crucial for effective personalization, established email categorizations often oversimplify the multi-faceted nature of real-world interactions, where single communications can blend business and personal purposes. Furthermore, an overemphasis on mimicking stylistic traits may overlook the potentially greater impact of providing rich, accurate context to already skillful LLMs. Delivering the correct contextual information—about the relationship, history, and specific goals of the communication—may be more critical than achieving perfect stylistic replication in generating an appropriate and authentic-feeling response, presenting a limitation or perhaps a necessary shift in focus for current personalization approaches.
Beyond these functional limitations, significant ethical concerns surround the use of personalized email assistants. The ability to mimic style, imperfect though it may be, poses risks of impersonation, including phishing or social engineering attacks. Privacy remains a top priority, as emails often contain personal or confidential business information. Adversaries may further utilize personalization systems to craft unwanted content or prompt the LLM to emit confidential information, as observed via RAG or user history.
Achieving a balance between personalization and authenticity remains challenging. LLMs can be incapable of completely incorporating personalized personal phrases or expert technical jargon without direct, ongoing input. Practical considerations also limit personalization, including the requirement for sufficient user email history, which poses challenges in cold-start scenarios and raises privacy concerns, as well as the need for hardware efficiency to support large-scale deployment of these systems. Overall, although RAG improves personalization by presenting contextually relevant suggestions, it also raises security and privacy concerns regarding access to and processing of potentially sensitive external user data. This increases the system’s vulnerability to data leakage and manipulation through injection attacks.
The threat model for LLM-based email clients assumes that adversaries are capable of injecting malicious content via incoming messages, compromised retrieval sources, or unvalidated plugin calls. Mitigation controls include the use of least privileged API scopes, schema-validated tool invocation, whitelisted retrieval, and content-provenance checks.
Privacy considerations are divided between training-time and runtime data flows. Training-time risks relate to dataset collection and consent, while runtime risks involve user inputs, retrieval context, and storage policies. Approaches such as local inference, retrieval filtering, and differential privacy help to ensure compliance with data protection frameworks such as the GDPR and the EU AI Act, and with respect to the use of potentially sensitive external or historical data sources. Addressing these complex limitations and associated risks is crucial to ensuring the ethical development and implementation of personalized LLM email assistants.
Ethical and Governance Frameworks in Personalized Communication Systems
While ethical concerns around impersonation and privacy are pressing, a comprehensive analysis also requires situating these issues within structured AI ethics frameworks. From a fairness perspective, personalized assistants risk reproducing and amplifying linguistic or cultural biases present in their training data. For instance, formality norms and politeness strategies can vary across languages and regions, and prejudice in these patterns may lead to unequal representation or misinterpretation in automated email responses.
Bias detection and mitigation techniques, such as counterfactual data augmentation, adversarial debiasing, and fairness-aware evaluation metrics, are essential to ensure equitable communication outcomes. These tools should be integrated into future personalization pipelines, not only to protect underrepresented user groups but also to prevent systemic distortions in tone and intent.
Accountability and transparency mechanisms are equally important. In professional email contexts, the line between human and AI authorship must remain clear to preserve trust and legal responsibility. Techniques such as model card documentation, audit trails, and usage disclaimers can provide traceability, enabling users and organizations to understand how model outputs were generated and adapted.
Furthermore, frameworks like the EU AI Act and the OECD AI Principles emphasize human oversight, explainability, and risk classification, all of which apply directly to AI-mediated communication. Embedding these governance principles into LLM-based email assistants will be critical to ensuring that personalization systems remain not only intelligent and efficient but also fair, accountable, and aligned with human values.
Risks can be grouped into input-level (prompt injection, data leakage), model-level (fine-tuning misuse, data poisoning), and output-level (misinformation, impersonation, bias). In line with AI Ethics Guidelines and Privacy by Design, future systems should ensure compliance with regulations such as the GDPR and the EU AI Act.
In summary, current research gaps remain in context retention across extended email threads, bias mitigation in tone and style adaptation, dynamic personalization that evolves with user preferences, real-time and user-centric evaluation methodologies, and the lack of standardized benchmarks. Addressing these challenges is essential for advancing toward secure, dependable, and trustworthy LLM-based communication assistants.
8. Conclusions
This survey consolidates the fragmented yet rapidly growing body of research on personalized LLMs for generating email responses. Using a systematic methodology and thematic synthesis, we analyzed 32 recent studies to map the evolving technical landscape and identify key architectural patterns, including RAG, PEFT, and iterative refinement mechanisms.
Rather than reiterating individual findings, we emphasize a broader insight: effective email personalization demands not only technical sophistication but also an alignment between system capabilities and human communication needs, including tone, purpose, and trust. The convergence of lightweight fine-tuning, secure on-device deployment, and responsive interfaces represents a significant shift towards scalable, privacy-conscious assistants that can support real-world writing workflows.
In the future, interdisciplinary collaboration will be essential. Research in NLP, cybersecurity, HCI, and applied ethics must converge to ensure that personalized email assistants evolve responsibly, enhancing productivity without compromising authenticity or privacy. As future work, we intend to recommend a minimal evaluation bundle integrating thread-aware test cases, human tone/faithfulness ratings, privacy sensitivity checks and runtime cost reporting to support a more standardized evaluation. As generative AI systems continue to influence digital communication, the development of trustworthy, adaptive and human-centered assistants remains a frontier worthy of sustained exploration.