1. Introduction
Artificial intelligence has demonstrated transformative potential across diverse domains, from real-time monitoring systems in agriculture [
1] to comprehensive language understanding [
2]. Among these advances, large language models (LLMs) have brought remarkable progress in natural language processing, enabling applications ranging from conversational agents to decision-support systems in sensitive domains [
3,
4]. However, as LLMs are increasingly integrated into real-world contexts, concerns about their adherence to ethical principles, privacy norms, and contextually appropriate behavior have gained prominence. Privacy preservation is a particularly pressing concern in applications involving personal interactions or sensitive domains such as healthcare, finance, and eldercare [
5]. These concerns highlight the need for LLMs to not only generate accurate and coherent outputs but also align their actions with societal and contextual expectations [
6,
7].
Ensuring that LLMs adhere to privacy norms requires them to recognize and mitigate risks of information leakage while still delivering utility [
8,
9,
10,
11]. This balance is critical for fostering trust in LLM-based systems, especially in environments where user data and personal attributes are central to the interactions. As LLMs become more embedded in such domains, their ability to respect contextual integrity [
12] becomes a key determinant of their ethical acceptability and effectiveness.
Contextual integrity (CI) [
12] provides a valuable framework for addressing these challenges, emphasizing that the flow of information should adhere to norms specific to its social context. This principle is particularly relevant in applications where sensitive information, such as personal or healthcare data, is involved. Ensuring that LLMs respect these norms requires models to understand and evaluate the roles, information attributes, and transmission principles that govern appropriate information exchanges. The integration of CI into LLM evaluation and design is therefore critical for ensuring their responsible and trustworthy deployment [
13,
14,
15].
One significant challenge in this domain is the ability of LLMs to understand implicit norms and apply them consistently across diverse scenarios. While these models have demonstrated proficiency in linguistic tasks, they often struggle with recognizing and adapting to the nuanced expectations of different contexts [
16,
17,
18]. This issue is compounded by their probabilistic nature and the biases inherent in their training data, which can lead to inappropriate or harmful outputs. Addressing these shortcomings requires rigorous evaluation and fine-tuning strategies that embed contextual and societal norms into the models’ functionality.
Moreover, fairness and ethical alignment are also essential dimensions of CI. LLMs must demonstrate equitable treatment across demographic groups, avoid perpetuating harmful stereotypes, and align with established societal values. These goals are challenging, as they require LLMs to navigate complex social contexts while minimizing unintended biases. Developing robust methodologies to evaluate and enhance these capabilities is an ongoing area of research, driven by the imperative to create AI systems that reflect ethical principles and promote inclusivity.
This survey explores the intersection of CI and LLM development, providing an overview of recent advancements, methodologies, and challenges. To guide the analysis, we formulate the following research questions:
RQ1: How have existing works operationalized the parameters of contextual integrity in the context of LLMs?
RQ2: What benchmarks, datasets, and evaluation metrics exist for assessing CI compliance in LLMs, and what are their strengths and limitations?
RQ3: Which components of the formal CI model remain unaddressed in the literature, and what concrete research directions can bridge these gaps?
Unlike prior privacy and ethics surveys on LLMs [
6,
7,
8] that treat privacy as a broad, undifferentiated concern, this work makes three distinct contributions. First, we provide the first systematic mapping of existing CI-in-LLM research against the formal CI model of Barth et al. [
19], revealing that critical parameters—including traces, policies, policy combination, and compliance—remain largely unaddressed in the literature. Second, we offer a cross-cutting gap analysis across benchmarks, evaluation metrics, and system architectures, identifying where current datasets and methods fall short of full CI parameter coverage. Third, we synthesize these findings into a set of concrete, actionable research directions for CI-native model training and evaluation. Together, these contributions position the survey not merely as an aggregation of recent work, but as a diagnostic and forward-looking guide for the CI-aligned LLM research community.
Paper organization. The remainder of this paper is structured as follows.
Section 2 provides the background on CI theory and the formal model of Barth et al. [
19], as well as the role of theory of mind (RQ1).
Section 3 reviews current methods and system architectures, categorized by the number of agents involved, and provides a comparative analysis against CI parameters (RQ1).
Section 4 surveys available benchmarks, datasets, and evaluation metrics, discussing their strengths and limitations (RQ2).
Section 5 identifies open challenges, unaddressed CI parameters, and concrete future research directions (RQ3).
Section 6 concludes the paper.
1.1. Review Methodology
To ensure a systematic and reproducible survey, we followed a structured literature search and selection process. We searched five electronic databases: Google Scholar, IEEE Xplore, ACM Digital Library, Semantic Scholar, and arXiv. The primary search query combined terms related to contextual integrity and language models: “
contextual integrity” AND (“
large language model” OR “
LLM” OR “
language model” OR “
privacy norms”). We also performed backward and forward citation tracking on key papers to identify additional relevant works. The temporal span of the review covers publications from 2018 to 2025, reflecting the earliest application of CI to smart-device privacy norms [
20] through the most recent CI-aligned LLM benchmarks [
21,
22]. Inclusion criteria required that a paper (i) explicitly addresses or operationalizes contextual integrity theory in the context of LLMs or AI-based systems, and (ii) proposes a benchmark, dataset, evaluation metric, or system architecture for CI-aligned behavior. We excluded papers that (i) address general LLM privacy without reference to CI theory, (ii) are not available in English, or (iii) are duplicates across databases. This process yielded a final corpus of
12 primary studies that are systematically analyzed in
Section 3,
Section 4 and
Section 5 and summarized in
Table 1. An additional set of supporting references (privacy techniques, formal CI models, ToM benchmarks, and legal frameworks) is cited for context but falls outside the core comparative analysis.
Analytical Procedure
We independently mapped each primary study to the nine CI parameters in
Table 1, marking a parameter as addressed only when the study explicitly operationalized or evaluated it. Discrepancies were resolved through discussion; the same process was applied to the dataset and metric analyses in
Section 4.
2. Background
Privacy in LLMs goes beyond simply protecting Personally Identifiable Information (PII); it encompasses safeguarding user interactions, mitigating risks of data leakage, and ensuring compliance with legal and ethical standards. As LLMs process vast amounts of textual data, they may inadvertently memorize and reproduce sensitive information, raising concerns about data security, consent, and fairness. Addressing these concerns requires a multi-faceted approach that balances utility with privacy preservation. Several techniques like differential privacy [
30,
31,
32], data sanitization [
33,
34,
35,
36], machine unlearning [
37,
38], contextual privacy protection [
25,
39], and privacy agents [
40,
41,
42,
43,
44,
45] are being explored to address the privacy challenge. In this survey, we examine the privacy issue through the lens of contextual integrity, as outlined in the next subsection, and define the theory of mind as a means to help to measure CI.
2.1. Privacy Through the Lens of CI
CI [
12] is a theory that defines privacy as the appropriate flow of information within specific social contexts, rather than simply the protection of personal information. It emphasizes that privacy norms and rules differ across various social domains such as health, work, family, and civil and political contexts. A privacy violation occurs when information flows deviate from established norms and principles within a particular context. CI theory considers not just the nature of the information being shared, but also the context surrounding it. This approach moves beyond the idea of privacy as simply hiding personal information and instead emphasizes maintaining appropriate information flows. Key parameters of CI include the following (see
Figure 1):
Sender: The entity initiating the information flow.
Receiver: The entity receiving the information.
Subject: The individual or entity the information is about.
Information Type: The category of information being shared (e.g., medical, financial, personal).
Transmission Principle: The rules or norms governing how the information should flow between sender and receiver (e.g., purpose, consent, belief).
Nissenbaum’s framework of CI emphasizes that maintaining privacy requires upholding two types of norms: norms of appropriateness and norms of dissemination. Norms of appropriateness dictate what type of information about an individual is suitable to be revealed in a specific context, while norms of dissemination govern the flow of personal information from one user to another. CI further suggests that information flow is only appropriate if it adheres to the established norms of the particular context, recognizing that norms vary across different situations. For instance, sharing medical information with a doctor is typically considered acceptable, whereas sharing the same information with an insurance company for marketing purposes would constitute a violation of CI.
In the context of LLMs, CI is used to ensure that only necessary and relevant information is shared for the intended goal. LLMs are seen as “untrusted receivers” that can store, reuse, or leak information in ways that users cannot control. This means that, when users interact with LLMs, they must be careful about the information they disclose. CI is used to evaluate whether the information flow between a user and an LLM adheres to the appropriate standards for that specific interaction. When using LLMs, a contextually private user query does not contain nonessential sensitive attributes. For instance, while seeking financial advice, sharing the names of family members is not essential, but a general overview of a financial situation is. Similarly, when asking about seasonal allergy symptoms, full name or date of birth are unnecessary, but details about symptoms or lifestyle might be relevant. The application of CI to LLMs means ensuring the LLM only uses the essential information it needs to perform its task, without sharing extraneous private data, which is similar to the data minimization principle that is applied to organizations through regulations such as the GDPR [
46].
Scope of this Review. We restrict the scope of this survey to works that explicitly invoke Nissenbaum’s CI theory or one of its formal extensions [
19,
47] when analyzing LLM behaviour. Concretely,
in scope are (i) papers that operationalize CI parameters (sender, receiver, subject, attributes, transmission principles) in LLM evaluation or design; (ii) CI-inspired benchmarks and datasets whose annotation schemes encode contextual norms; and (iii) system architectures (agents, filters, prompting strategies) designed to enforce CI-aligned information flow.
Out of scope are general LLM privacy techniques (e.g., differential privacy, data sanitization, machine unlearning) that do not reference CI theory; broad AI-alignment or safety research not grounded in information-flow norms; and general fairness or ethics work, unless the authors explicitly map their contributions to CI concepts such as roles, contexts, or appropriateness norms.
2.1.1. Formal Models of Contextual Integrity
Several works have attempted to develop formal models for CI [
19,
47,
48,
49]. Among these, the model proposed by Barth et al. [
19] provides the most comprehensive formalization and serves as the analytical backbone of this survey. We summarize its key components below and then define the evaluation parameters derived from it.
Core Primitives
The model defines three categories of primitives that participate in every information flow:
Agents: The entities involved in information exchange, assuming the roles of sender (who initiates the flow), receiver (who obtains the information), and subject (whom the information is about). In the LLM context, agents may be human users, AI assistants, or third-party services.
Attributes: The types of personal information exchanged (e.g., postal address, medical diagnosis, financial record). Attributes define what is being shared.
Messages: Composite communication units that contain one or more attributes pertaining to agents. A message captures the actual content transmitted in an interaction.
Contextual Elements
The model further specifies the social structure within which flows occur:
Roles: The capacities or functions that agents assume within a specific context (e.g., doctor, patient, employer). Roles determine the normative expectations for information handling.
Contexts: The structured social settings in which agents interact, each governed by its own informational norms. For example, a healthcare context involves roles like doctor and patient, with norms protecting medical confidentiality.
Traces: The history of communication actions among agents. Traces capture the temporal dimension of information flow, enabling analysis of how prior disclosures affect the appropriateness of subsequent ones.
Normative Governance
Finally, the model addresses how norms are expressed and enforced:
Policies: Formal rules specifying which information flows are permitted or prohibited under given conditions (e.g., “a doctor may share a patient’s diagnosis with a specialist only with the patient’s consent”).
Policy combination: Mechanisms for resolving overlapping or conflicting policies when multiple norms apply to the same flow.
Compliance: The property that an information flow—or a sequence of flows (trace)—satisfies all applicable policies, both in the immediate context and across future interactions.
2.1.2. Operationalization in This Survey
We use the nine parameters above—agents, attributes, messages, roles, contexts, traces, policies, policy combination, and compliance—as the columns of our comparative analysis (
Table 1). For each reviewed work, we record which of these parameters are explicitly addressed. This operationalization provides a structured, reproducible basis for identifying coverage gaps: as the analysis shows, the first five parameters (agents, attributes, messages, roles, contexts) receive substantial attention, while traces, policies, policy combination, and compliance remain largely unaddressed in the current literature.
Mapping CI Parameters to LLM Artifacts
To bridge the gap between CI theory and LLM engineering, we map each CI parameter to the concrete artifacts encountered in modern LLM deployments:
Agents → the user, the system prompt (which defines the assistant’s identity), and any third-party tools or APIs invoked during execution.
Attributes → personal data fields present in the user prompt, retrieved context (e.g., RAG passages), or tool outputs (e.g., database query results).
Messages → the full conversation turns, including both user inputs and model responses, as well as memory logs persisted across sessions.
Roles → the persona or function specified in the system prompt (e.g., “You are a medical assistant”) and implicit roles inferred from context.
Contexts → the deployment setting (healthcare portal, customer-service chat, internal enterprise tool) that determines which norms apply.
Traces → the conversation history and session memory, capturing the temporal sequence of disclosures.
Policies/Compliance → system-level guardrails, content-filtering rules, and organizational data-governance policies that constrain model output.
CI Violations Across LLM Deployment Patterns
Different deployment architectures expose distinct CI risks:
Chat assistants: CI violations occur when the model discloses attributes from earlier turns to a new participant, or when the system prompt overrides the user’s contextual expectations (e.g., logging sensitive queries without consent).
RAG pipelines: The retrieval step may surface documents containing attributes about subjects other than the current user; a CI violation arises if these are passed to the generation model without filtering for role- and context-appropriateness.
Tool-use agents: When an LLM invokes external APIs (e.g., calendar, medical records), it acts as both receiver and re-sender. CI requires that only the attributes necessary for the tool call are transmitted, and that tool outputs are not disclosed beyond the intended receiver.
Multi-agent workflows: Each agent may operate under different contextual norms. A CI violation occurs when Agent A forwards information to Agent B without verifying that B’s context permits the receipt of those attributes—a scenario that directly engages the traces and policy combination parameters.
2.1.3. Worked Example
To illustrate how CI parameters apply in practice, consider a patient interacting with an LLM-based medical assistant:
User prompt: “I was diagnosed with Type 2 diabetes last month. My employer asked me to fill out a wellness survey—should I disclose my condition?”
CI parameter mapping: Sender = patient; Receiver = LLM assistant (and, indirectly, the employer); Subject = patient; Attribute = diabetes diagnosis; Role = employee (in the employer context); Context = workplace wellness program; Transmission principle = voluntary disclosure, not required by employment law.
CI-compliant response: The model should advise that medical diagnoses are protected health information and that disclosure in a workplace wellness survey is voluntary, without repeating the specific diagnosis in the response text sent to any downstream system.
CI-violating response: The model generates a draft survey response that includes “User has Type 2 diabetes” and forwards it to the employer’s form—leaking the attribute to an unauthorized receiver under a transmission principle (mandatory employment reporting) that does not apply.
2.2. Theory of Mind
Theory of mind (ToM) [
50] is the ability to understand the mental states of others, including their thoughts, beliefs, and intentions. It is a crucial aspect of social cognition and communication, and plays a significant role in understanding privacy, especially in conversational contexts. ToM is essential for CI because understanding privacy involves not only knowing what information is sensitive, but also understanding who knows what, and what the implications of sharing or not sharing that information might be [
13,
51]. For example, if person
X tells person
Y a secret, and person
Y then interacts with person
Z,
Y needs to be aware of
Z’s lack of knowledge about the secret to maintain
X’s privacy. This requires
Y to track the mental states of both
X and
Z.
Traditional evaluations of ToM have often used passive narratives, which may contain reporting biases and surface cues that large language models (LLMs) can exploit to give the appearance of ToM, without true understanding [
52,
53,
54]. In contrast, conversations present interactions in their raw form and are less susceptible to reporting bias, making them a more realistic way to test ToM. Benchmarks like FANTOM [
55] are designed to stress-test ToM in conversational contexts by creating scenarios where characters enter and leave discussions, leading to different mental states between participants due to information asymmetry. These benchmarks test whether a model can track what information is accessible to different characters in a conversation, and distinguish between a character’s own knowledge and that of others.
3. Current Methods and Evaluation Frameworks
Research in this field focuses on evaluating how well LLMs adhere to contextual privacy norms. The objective is to create intermediary systems that detect and manage sensitive information, guiding agents toward more privacy-preserving interactions.
Various approaches have been explored, including two- or multi-agent frameworks. LLMs play a key role in identifying context and sensitive attributes within agent queries. Once detected, the system can suggest reformulations to ensure that only relevant information is shared. Another strategy involves a dedicated privacy agent that monitors information flow in real time, compares it against contextual norms, and filters sensitive data to minimize privacy violations. Additionally, fine-tuning LLMs with contrastive examples—explicitly labeled by users as appropriate or inappropriate—helps models to better understand and manage contextual privacy [
27].
Classification principle. We organize the reviewed systems along a primary axis: the
number of agents involved in the information flow (two-agent vs. multi-agent). This criterion is motivated by the formal CI model of Barth et al. [
19], in which the number of agents directly determines the complexity of tracking roles, traces, and policy compliance. In a two-agent setting (e.g., a user and an AI assistant), privacy reasoning reduces to a single sender–receiver pair with a fixed context. In multi-agent settings, the system must additionally manage information asymmetries across participants, resolve conflicting norms, and track which agents possess which knowledge—capabilities closely linked to theory of mind. Within each category, we further classify works by their
intervention type:
- (a)
Prompt-time controls: Approaches that rewrite or filter user prompts before they reach the model (e.g., AirGapAgent’s data minimizer [
24], ShareGPT reformulation [
27]).
- (b)
Training-time alignment: Methods that embed CI awareness via supervised fine-tuning, contrastive learning, or reinforcement learning with CI-aligned rewards (e.g., CPPLM [
25], GoldCoin [
26], CI-RL [
22]).
- (c)
Decoding-time filters: Post-generation checks that scan model outputs for norm violations before delivery (e.g., privacy agents [
5]).
- (d)
Auditing and benchmarks: Evaluation-only frameworks that assess CI compliance without modifying the model (e.g.,
[
13], CI-Bench [
56], Privacy Checklist [
29]).
This two-level classification—agent cardinality and intervention type—provides a principled structure for the comparative analysis in
Table 1. We note that memory/RAG governance constitutes an emerging fifth category not yet represented in the CI-specific literature, which we flag as a gap in
Section 5.
3.1. Two-Agent Works
In [
23], the authors focus on ensuring that AI assistants respect CI principles when handling user data. To achieve this, they introduce Information Flow Cards (IFCs), a structured method of categorizing user data according to CI norms. The study employs a form-filling benchmark to test compliance across multiple real-world scenarios, such as medical appointments and job applications. By prompting LLMs to engage in CI-based reasoning, the researchers demonstrate that structured guidance improves privacy adherence, ensuring that assistants only disclose necessary information while filtering out inappropriate details.
Shvartzshnaider et al. [
28] provide a framework for evaluating how well LLMs align with societal privacy norms across different models and datasets. The paper highlights a major challenge: prompt sensitivity, where slight variations in input phrasing lead to drastically different outputs. To address this, the authors introduce a multi-prompt assessment methodology, ensuring that privacy evaluations are based only on consistent LLM responses across multiple prompt variants. Using this approach, they assess LLMs in contexts like Internet of Things (IoT) privacy and child protection under the Children’s Online Privacy Protection Act (COPPA), demonstrating that different training techniques, hyperparameters, and optimization strategies significantly impact privacy norms encoded in LLMs.
Bagdasarian et al. [
24] propose a novel architecture to combat context-hijacking attacks, where adversaries manipulate LLMs into revealing private data by modifying the conversational context. The authors introduce AirGapAgent, a two-component system comprising a data minimizer, which filters unnecessary sensitive data before it reaches the LLM, and a conversational agent, which operates with a restricted data scope. Adversarial experiments show that standard LLMs fail to protect user data against carefully crafted prompts, but AirGapAgent successfully blocks 97% of privacy attacks, demonstrating the efficacy of context isolation in preventing unauthorized data leaks.
Xiao et al. [
25] propose Contextual Privacy Protection Language Models (CPPLMs) as a method to fine-tune LLMs with a built-in awareness of when and how to protect privacy. They employ instruction-based tuning, training LLMs on both positive and negative examples to help them to distinguish between appropriate and inappropriate disclosures. Additionally, they introduce penalty-based loss functions, discouraging models from generating responses containing unnecessary private details. Evaluating across diverse datasets, including biomedical and legal contexts, the researchers find that CPPLMs significantly outperform baseline models, successfully balancing privacy preservation and knowledge retention.
3.2. Multi-Agent Works
Mireshghallah et al. [
13] introduce
, a benchmark designed to assess how well LLMs respect contextual privacy norms in real-world scenarios. The benchmark is structured into four tiers, each increasing in complexity, evaluating whether LLMs can recognize sensitive information, determine the appropriateness of disclosure, and adapt responses accordingly. The study finds that even commercial models like GPT-4 and ChatGPT fail to adhere to contextual privacy expectations, revealing sensitive information 39–57% of the time.
The GoldCoin framework, presented in [
26], takes a different approach by aligning LLMs with existing privacy laws like HIPAA [
57] and GDPR [
46]. The framework generates synthetic legal cases based on real-world privacy regulations and trains LLMs to assess compliance. GoldCoin introduces an automated case-filtering system to ensure consistency with legal standards, making it an effective tool for training LLMs in legal privacy reasoning. The study demonstrates that GoldCoin-trained models outperform standard LLMs by 8–23% in recognizing privacy risks in real legal cases, providing a scalable method for legal privacy education in AI systems.
A more user-centered approach is explored in [
27], where the authors focus on how individuals accidentally disclose sensitive information when interacting with LLMs. Through a formative user study, they find that many users believe they are protecting their privacy by avoiding explicit personal identifiers, but they still reveal private information indirectly through context. The study introduces a real-time prompt modification system, which detects potentially sensitive queries and suggests privacy-preserving reformulations before the user submits them. Analyzing ShareGPT logs, the authors confirm that even privacy-conscious users frequently overshare, and their intervention system significantly reduces unintentional disclosures without disrupting the conversational flow.
Li et al. [
29] propose a scalable privacy evaluation system that leverages LLMs to analyze privacy violations within existing regulatory frameworks. Unlike previous work that relies on expert-annotated privacy norms, this study constructs a comprehensive privacy checklist that maps privacy concerns to legal standards such as HIPAA. By integrating retrieval-augmented generation (RAG) techniques, the authors enable LLMs to contextually evaluate privacy-sensitive scenarios in a legally compliant manner. Their experiments show that LLMs guided by the privacy checklist improve privacy judgment accuracy by 6–18%, making it a promising tool for regulatory compliance in AI-driven applications.
4. Datasets and Evaluation Metrics
4.1. Datasets
Papers in this category emphasize the creation of structured benchmarks and evaluation pipelines to assess LLMs’ adherence to CI principles. Some papers propose benchmarks that evaluate models on their ability to understand contextual parameters such as roles, information types, and transmission principles. These evaluations provide insight into the extent to which LLMs align with norms governing appropriate information flows in varied scenarios like healthcare, finance, and government interactions.
Evaluation frameworks in this category often highlight gaps in existing models’ understanding of context. For example, LLMs may perform well in syntactic understanding but struggle to apply nuanced privacy norms across different user roles. By systematically comparing LLMs across tasks, these benchmarks identify performance discrepancies and areas requiring targeted improvements. This standardization accelerates the development of more contextually aware and trustworthy AI systems.
Several papers focus on creating benchmarks and systematic evaluation methods to assess LLMs’ adherence to CI principles (Can be see in
Table 2).
GKC-CI Dataset [
14]: The GKC-CI (Governing Knowledge Commons and Contextual Integrity) dataset is a comprehensive, real-world dataset containing 21,588 manually annotated segments derived from 16 privacy policies of online services. It focuses on identifying parameters related to contextual information flows and institutional grammar, supporting normative privacy analysis. This dataset is pivotal for training machine learning models to automate privacy policy annotations, replacing the need for time-consuming manual efforts. It serves as a benchmark for exploring longitudinal and cross-industry variations in privacy practices.
Benchmark [
13]:
is a synthetic benchmark designed to evaluate LLMs’ ability to reason about contextual privacy norms. It is structured into four progressive tiers, ranging from simple sensitivity classification tasks to complex real-world scenarios like meeting summaries and action-item generation. The benchmark uses synthetic examples grounded in CI theory, emphasizing the appropriateness of information flow in various social contexts. This dataset not only challenges LLMs’ basic comprehension but also their advanced capabilities in theory of mind and contextual privacy preservation.
AirGapAgent Dataset [
24]: The AirGapAgent dataset is a synthetic, task-oriented dataset used to evaluate privacy-conscious conversational agents in adversarial settings. It comprises simulated user profiles and tasks that involve interacting with third-party entities under various privacy constraints. The dataset explores scenarios where agents must balance utility and privacy, preventing inappropriate information disclosures during adversarial attempts like “context hijacking.” This dataset is critical for advancing privacy-preserving technologies in real-world, high-stakes interactions.
ShareGPT Dataset [
27]: Derived from real-world user interactions in the ShareGPT dataset, this collection analyzes how users inadvertently disclose sensitive information during conversations with LLMs. Examples include instances where users overshare details that are unnecessary for achieving their intended goals, such as personal names or medical conditions. By highlighting contextual privacy violations, this dataset helps to design interventions that guide users toward safer, privacy-preserving interactions with conversational agents.
Privacy Checklist Dataset [
29]: This dataset is a structured knowledge base derived from HIPAA regulations, annotated using large language models to cover all relevant privacy norms. It includes CI characteristics such as sender and receiver roles, information types, and transmission principles. The dataset transforms privacy evaluation into an in-context reasoning task, allowing models to assess compliance with existing norms and laws. It serves as a foundational tool for extending privacy research into broader regulatory frameworks beyond HIPAA.
Privacy-Aware Robotics Dataset [
58]: The Privacy-Aware Robotics dataset focuses on safeguarding CI in eldercare settings, combining real-world transcripts from workshops and synthetic scenarios. It is designed to evaluate LLM-supported privacy agents, particularly in identifying and masking sensitive information during human-robot interactions. The dataset captures nuanced privacy challenges, including speaker diarization and personal data masking, emphasizing adherence to contextual privacy norms in care environments. It plays a critical role in developing privacy-conscious robotic systems.
CI-Bench [
56]: CI-Bench is a synthetic benchmark containing 44,100 test cases designed to evaluate AI assistants’ compliance with CI principles. The dataset spans eight domains, including healthcare, finance, and government, featuring multi-turn dialogues and email threads. It tests AI assistants on tasks such as identifying appropriate data sharing norms and understanding sensitive information flows. By simulating diverse real-world interactions, CI-Bench provides a comprehensive framework for assessing privacy and norm alignment capabilities in LLMs.
Pii-Masking-200k: The Pii-Masking-200k dataset is a large-scale annotated collection of 200,000 examples aimed at fine-tuning LLMs to identify and mask personally identifiable information (PII). It includes 54 PII categories, such as phone numbers, addresses, and social security numbers, enabling robust privacy protection. This dataset is essential for building systems capable of safeguarding user data and adhering to privacy regulations in various domains.
4.1.1. Discussion of Existing Datasets: Advantages and Limitations
Existing benchmarks—including policy-based corpora such as GKC-CI, synthetic multi-turn dialogues like and CI-Bench, real-world conversational logs (ShareGPT), and regulation-aligned collections (Privacy Checklist)—offer a broad spectrum of contexts and annotation schemes. These resources provide good coverage of key CI parameters (including sender, receiver, attribute, role, context, but not traces, policies, combined, or compliance), combining controlled synthetic examples that isolate particular norms with real user data that captures natural oversharing patterns. By embedding legal standards (HIPAA, GDPR) directly into their annotation schemes, they also enable a rigorous evaluation of compliance in regulated domains.
Nonetheless, several important limitations remain. Many datasets rely heavily on synthetic dialogues, which may undercut the ecological validity of conversational dynamics. Very few benchmarks model multi-party or evolving interactions, limiting our ability to study norm conflicts and theory-of-mind reasoning in naturalistic settings. The domain and language scope is likewise narrow—most resources focus on English and a small set of environments (healthcare, finance, IoT). Finally, fairness, bias mitigation, and broader ethical dimensions are often treated orthogonally to privacy annotations, leaving gaps in holistic evaluations of contextual integrity.
Annotation Quality and Cultural Variation
CI judgments are inherently normative and context-dependent, yet few of the reviewed benchmarks report detailed annotation protocols. Key methodological gaps include (i)
inter-annotator agreement—only GKC-CI [
14] and PrivaCI-Bench [
21] report agreement statistics; most synthetic benchmarks bypass annotation entirely by generating gold labels programmatically; (ii)
annotator demographics—cultural background shapes privacy norms (e.g., acceptable health-data sharing differs between collectivist and individualist societies), yet no benchmark systematically varies annotator or scenario culture; and (iii)
norm ambiguity—many real-world CI judgments are legitimately contested, but current benchmarks encode norms as binary (appropriate/inappropriate), collapsing edge cases. These issues directly affect benchmark validity: a CI benchmark whose norms reflect only Western, English-speaking annotators may systematically mislabel responses that conform to norms in other cultural contexts. Future benchmarks should report inter-annotator kappa, include annotators from diverse cultural backgrounds, and introduce graded appropriateness labels to capture norm ambiguity.
4.2. Evaluation Metrics
Evaluating whether large language models (LLMs) uphold contextual integrity (CI) requires a diverse array of metrics, each designed to capture a particular dimension of privacy, norm alignment, theory of mind (ToM), fairness, or legal compliance. These metrics, used across recent research, reflect the growing complexity and interdisciplinary nature of the CI evaluation task.
4.2.1. Theory-of-Mind (ToM) Metrics
Theory-of-mind (ToM) metrics are increasingly important for evaluating whether LLMs can reason about the beliefs, knowledge, and intentions of multiple agents. Since contextual integrity often depends on who has access to what information in a social context, the ability to track mental states and infer informational asymmetries is critical. Benchmarks like FANToM [
55] and
[
13] simulate multi-agent scenarios with layered knowledge states, testing whether a model can correctly attribute beliefs to individual agents. Evaluation in these contexts often involves mental state tracking accuracy, where each prediction is scored based on whether the model correctly infers what another party does or does not know. In addition to binary tracking scores, researchers also assess reasoning path consistency, comparing the model’s explanation chain to human annotators’ expectations. A common failure mode—merging mental states—occurs when the model assumes universal knowledge among agents, revealing a lack of ToM fidelity.
Recent work, such as the Reasoning and Reinforcement Learning approach to CI [
22], incorporates dynamic prompting and inference-time reasoning to improve ToM capabilities. These approaches are evaluated not only for raw accuracy on belief-tracking tasks, but also for robustness under adversarial perturbations or shifts in conversation context. Thus, ToM metrics are expanding beyond single-shot correctness to include measures of reasoning generalization, cognitive separation between agents, and inference-chain validity.
4.2.2. Privacy Behavior Metrics
Privacy behavior metrics quantify how well an LLM detects, withholds, or reformulates sensitive information in accordance with contextual norms. A fundamental metric in this category is the refuse-to-answer rate, which captures the proportion of sensitive prompts that a model correctly declines to answer [
59]. More refined metrics include privacy leakage scores [
25,
60], which measure the proportion of generated content containing personally identifiable information (PII). These metrics are used both in general evaluations and in adversarial settings, where models are red-teamed with probing prompts.
Some systems, such as AirGapAgent [
24] and CPPLM [
25], define contextual privacy scores that take the task and scenario into account, measuring not just whether sensitive data was revealed, but whether it was necessary to the interaction. Other works evaluate privacy–utility tradeoffs [
23], testing whether improved privacy leads to degraded task performance. In such evaluations, privacy-preserving LLMs are assessed for their ability to maintain fluency, informativeness, and goal satisfaction, even while avoiding disclosure. More advanced metrics now incorporate multi-turn dialogue history and role-specific norms, capturing whether a model leaks information indirectly in later responses or to inappropriate recipients. In Reasoning and RL-based frameworks [
22], utility scores are calculated in parallel with privacy metrics to quantify how much relevant, contextually appropriate information is retained.
4.2.3. Norm Alignment and Acceptability Metrics
Norm alignment metrics assess how closely LLM outputs match expected behaviors based on human-defined norms, usually encoded as contextual integrity parameters. These metrics go beyond binary privacy judgments to incorporate qualitative evaluations of appropriateness. In structured datasets such as GKC-CI [
14] and
[
13], Pearson correlation coefficients are computed between model predictions and gold-labeled norm annotations, capturing the overall consistency of model behavior with human expectations.
Another common method is Likert-scale acceptability, in which human raters evaluate model responses along a five-point scale from “strongly unacceptable” to “strongly acceptable” [
28]. This allows for a more graded view of CI compliance, especially in edge cases where disclosure is conditionally permitted or prohibited. Acceptability ratings are also cross-referenced with demographic group membership in fairness audits, measuring whether models apply privacy norms consistently across identities.
Work on integrity, utility, and completeness [
22] introduces a broader view of norm alignment. Integrity refers to the internal coherence of an agent’s behavior across turns—whether its responses reflect consistent adherence to previously stated norms. Completeness captures whether the model discloses all information necessary to fulfill a task, without omitting relevant details under the false pretense of privacy. These metrics, although still emerging, are especially important in systems trained with reinforcement learning from CI-aligned rewards, where the tradeoff space between withholding and disclosing must be navigated dynamically.
4.2.4. Legal Compliance Metrics
Legal compliance metrics evaluate whether LLMs understand and adhere to codified legal norms, such as those found in HIPAA or GDPR. These metrics translate legal provisions into classification tasks, where models must determine whether a given information exchange is legally allowed or prohibited. In frameworks like GoldCoin [
26] and the Privacy Checklist [
29], models are assessed for their accuracy in classifying actions as compliant or noncompliant, based on regulatory criteria. These evaluations often include both applicability checks—determining whether a regulation is relevant in the first place—and compliance judgments, measuring whether the action adheres once applicability is established. Macro F1-scores are commonly used to evaluate classification performance across multiple normative classes, particularly when the data is imbalanced.
Some works now integrate retrieval-augmented generation (RAG) techniques to help models ground their predictions in real legal texts. Evaluation here includes not only classification metrics, but also citation accuracy—assessing whether the model correctly identifies the relevant regulation—and justification quality, measuring whether the rationale given for a decision aligns with the legal logic. Compliance metrics are especially important for models deployed in health, finance, and public services, where privacy violations carry legal consequences.
Across all four categories, the field is trending toward richer, multi-dimensional evaluation frameworks that reflect the complexity of real-world CI scenarios. Metrics are no longer confined to counting leaks or scoring correctness, but are evolving to assess models as agents that must reason about context, obey legal constraints, respect social norms, and act with cognitive empathy. To truly evaluate contextual integrity in LLMs, future metrics must blend behavioral, normative, and legal perspectives into unified scoring systems that support transparent, accountable deployment.
4.2.5. Discussion of Existing Metrics: Advantages and Limitations
The current portfolio of evaluation metrics offers a useful—but still incomplete—toolkit for analyzing how large language models perform under the lens of contextual integrity. On the positive side, several metrics provide crisp quantitative signals. Refuse-to-answer rates and privacy-leakage ratios, as adopted in AirGapAgent and CPPLM [
24,
25], make it easy to benchmark successive model and prompt-engineering iterations. Continuous measures such as Pearson correlation on
[
13] or the perfect-match scores introduced with GKC-CI [
14] give researchers fine-grained insight into incremental alignment improvements. Human-centred instruments—including Likert acceptability judgements in LLM-CI [
28]—capture subtle normative nuances that purely statistical metrics overlook, while legal-compliance frameworks like GoldCoin and the Privacy Checklist [
26,
29] translate statutory language into measurable tasks, making results directly actionable for regulated domains.
However, these advantages also reveal persistent gaps. Many metrics collapse complex privacy phenomena into binary or aggregate outcomes, obscuring partial disclosures or role-specific violations. A single privacy-leakage percentage, for example, treats a trivial mention of a first name and an unsolicited release of a full medical record as equivalent infractions. Domain specificity poses another challenge: metrics calibrated on biomedical narratives frequently fail when applied to IoT logs or financial chats, an issue documented in cross-domain studies of contextual privacy protection [
25]. Synthetic benchmarks such as
or CI-Bench accelerate controlled experimentation, but they risk overstating real-world robustness because conversation dynamics, speaker overlap, and cultural cues are simplified or absent. Fairness and demographic parity are likewise under-represented: most privacy metrics assume homogenous user populations, leaving questions about disparate leakage rates across protected classes unanswered.
Theory-of-mind evaluations, while a step toward richer cognitive measurement, remain brittle. Recent investigations show that chain-of-thought prompting can paradoxically degrade ToM accuracy [
55]; this suggests that current ToM metrics may reflect prompt idiosyncrasies rather than genuine reasoning skill. Finally, few evaluation suites measure integrity, completeness, and utility simultaneously. Reinforcement-learning approaches that reward context-aware withholding and penalize unnecessary omissions [
21] expose the tradeoff between privacy and task success, but comparable multi-objective metrics are not yet standardized, making cross-paper comparisons difficult.
Recommended Minimum Metric Suite
Based on our analysis, we recommend that future CI evaluations report at least the following metrics to enable meaningful cross-study comparison:
- 1.
Privacy leakage rate—proportion of responses containing unnecessary PII or sensitive attributes [
25];
- 2.
Norm-alignment score—Pearson correlation or accuracy against gold-standard CI-norm labels [
13,
14];
- 3.
Refuse-to-answer rate—proportion of sensitive prompts correctly declined [
59];
- 4.
Utility preservation—task-completion or fluency score to quantify the privacy–utility tradeoff [
22,
23];
- 5.
ToM tracking accuracy—belief-attribution correctness in multi-agent scenarios [
55];
- 6.
Legal compliance F1—classification performance on regulation-grounded tasks [
26,
29];
- 7.
Demographic parity gap—difference in leakage or refusal rates across protected demographic groups.
Reporting this suite—or justifying omissions—would provide a common baseline for benchmarking progress toward CI-aligned LLMs.
5. Current Challenges and Future Work
5.1. Privacy Norms and Conflicts
We categorize the norms that govern information flow in a social context into three types [
61]:
access norms (A),
inference norms (I), and
privacy norms (P). These norms include “allow” and “disallow” predicates, which explicitly state the outcome of an access request. Specifically, an access norm (A) permits a third party to access an actor’s private information only if certain conditions are met; otherwise, access is denied. Inference norms, on the other hand, enable a software agent—whether representing a actor or a third party—to derive new information from its existing knowledge base within the ontology. Lastly, privacy norms allow actors to define their preferences by specifying which types of access requests should be granted or rejected. We define privacy conflicts as follows: access rules represent CI-based norms that actors may not fully recognize or be aware of. In certain cases, access norms and privacy norms may align, resulting in the same privacy decision.
Figure 2 illustrates the flow of information across distinct privacy norms (shown by clouds). Each cloud signifies a specific set of privacy rules governing information exchange. Within a cloud, actors (AI or humans) share information following these norms, as shown by the arrows.
Figure 2a shows actors are situated within distinct privacy norms (clouds), sharing information according to the agreed rules within the same boundary.
Figure 2b demonstrates collaboration occurs between two or more actors operating under the same privacy norms. The red arrow represents the transmission of sensitive information, adhering to shared norms within the same context.
Figure 2b demonstrates a depiction of cross-boundary collaboration, where actors from different privacy norms exchange information. The red arrows emphasize the challenges of sharing sensitive information across contexts, where differing norms require careful consideration to maintain privacy integrity and solve privacy conflicts.
Privacy conflicts occur when privacy norms or rules regarding the flow of information differ among involved parties, creating disagreements about what information should be shared, with whom, and under what circumstances. These conflicts are common in multi-party (multi-actor) scenarios where multiple individuals or systems have stakes in the same piece of information. For instance, in online social networks (OSNs), conflicts may arise when a content owner wishes to share information that also involves others (e.g., a photo with friends) without consulting all parties. Such situations lead to multi-party privacy conflicts (MPPCs). Similarly, in context-dependent scenarios, conflicts occur when privacy preferences (e.g., restricting access to sensitive information) are at odds with transmission norms (e.g., sharing for emergency purposes).
Handling Multi-Party Conflicts
Solutions like negotiation- or argumentation-based privacy reasoning [
61,
62] can assist to solve privacy coflicts. Argumentation-based privacy reasoning uses structured dialogues between agents (representing users or systems) to resolve privacy conflicts. It involves generating, exchanging, and evaluating arguments about privacy norms, user preferences, and access rules.
5.2. Multidimensional Integration of CI in LLMs
A subset of works focuses on integrating CI principles across multiple dimensions, treating CI as a foundational aspect of LLM development. These works address the interplay between privacy, fairness, robustness, and ethical alignment, offering a unified approach to embedding CI into LLMs.
For instance, researchers propose training pipelines that simultaneously fine-tune models for privacy adherence, ethical alignment, and fairness. This multidimensional approach ensures that LLMs perform consistently across diverse tasks, from protecting sensitive information to avoiding biases in predictions. Such integration highlights the interdependencies between different CI aspects, underscoring the need for comprehensive strategies rather than isolated fixes.
These papers often emphasize the importance of transparency and accountability, advocating for open-source benchmarks and datasets to facilitate broader research efforts. By fostering collaboration and standardization, this category aims to create LLMs that are not only functionally effective but also contextually and ethically aware across all dimensions of CI.
5.3. Ethical and Fairness Alignment
Ethical alignment and fairness are key dimensions of CI, requiring LLMs to avoid discriminatory outputs and treat diverse demographic groups equitably. Importantly, fairness harms can be mapped directly to CI concepts: a model that applies stricter privacy norms to queries from certain demographic groups violates norms of appropriateness by treating equivalent roles unequally; a model that selectively discloses sensitive attributes (e.g., health conditions) about minority subjects while protecting majority subjects violates the transmission principle of equitable treatment; and a model whose behavior shifts based on inferred user context (e.g., socioeconomic status) introduces role-based discrimination. By grounding fairness in CI parameters, researchers can leverage the same formal framework used for privacy analysis to detect and measure bias—an integration that remains largely unexplored in the current literature.
For instance, researchers evaluate LLMs on their ability to detect and avoid stereotypes across domains such as gender, race, and religion. Tasks include responding to stereotype-laden prompts and assessing how models attribute traits or roles to individuals based on demographic characteristics. Papers highlight how existing LLMs often perpetuate biases from their training data and propose interventions like fine-tuning with fairness-centric datasets or incorporating fairness constraints during training.
Another approach involves moral reasoning and ethical decision-making. Researchers examine how LLMs handle morally ambiguous scenarios, requiring them to weigh competing ethical principles. By aligning model behavior with human ethical standards, these methods ensure that LLM outputs are contextually appropriate and aligned with societal values. Papers in this area underscore the importance of evaluating models not just for fairness but for their broader alignment with ethical norms.
5.4. The Impact of Reasoning Techqniues
Reasoning techniques in large language models (LLMs) play a critical role in ensuring adherence to CI. CI emphasizes the appropriate flow of information based on the norms of specific contexts, requiring LLMs to evaluate the roles, transmission principles, and attributes involved in an interaction. Advanced reasoning techniques can significantly enhance an LLM’s ability to discern and apply these norms, ensuring more responsible and contextually aware outputs [
63].
5.5. Challenges in Embedding CI in LLMs
Despite the advancements in integrating CI into LLMs, several challenges persist, spanning technical, ethical, and practical domains. These challenges highlight the limitations of current methodologies and underscore the complexity of achieving robust CI alignment across diverse applications. Addressing these obstacles is essential to ensure LLMs operate responsibly in real-world scenarios.
5.5.1. Inconsistent Understanding of Contextual Norms
One of the primary challenges is the inability of LLMs to consistently understand and apply contextual norms across diverse domains. While LLMs are trained on vast datasets, their probabilistic nature often leads to a misinterpretations of subtle contextual nuances. For instance, a model may appropriately apply privacy norms in one domain, such as healthcare, but fail to generalize these principles to another domain, like IoT interactions. This inconsistency arises from a lack of domain-specific fine-tuning and comprehensive training datasets that explicitly embed diverse contextual norms.
5.5.2. Managing Privacy vs. Utility Tradeoffs
Balancing privacy preservation with model utility poses another significant challenge. Models designed to prioritize privacy often adopt conservative approaches, such as refusing to generate potentially sensitive outputs. While this reduces the risk of information leakage, it can hinder the utility of the model in contexts where sensitive information is integral to the task. For example, in eldercare scenarios, withholding critical health information due to over-cautious privacy mechanisms could adversely impact care quality. Developing models that can dynamically navigate this tradeoff without compromising either aspect remains an open research problem.
5.5.3. Complexity of Evaluating Contextual Integrity
Evaluating CI in LLMs is inherently complex, as it requires assessing multiple interconnected dimensions, such as privacy, fairness, robustness, and ethical alignment. Existing benchmarks provide valuable insights but are often limited in scope, failing to capture the full spectrum of real-world scenarios. Furthermore, many evaluation methods rely on synthetic datasets, which may not fully represent the intricacies of live interactions. Developing comprehensive evaluation frameworks that reflect diverse and dynamic contexts remains a significant challenge.
5.6. Concrete Research Directions
Building on the gaps identified in our analysis, we outline several actionable research agendas:
- 1.
CI-native training objectives. Current approaches treat privacy, theory of mind, and legal compliance as separate optimization targets. A promising direction is to design joint training objectives that simultaneously reward CI-appropriate information flow, penalize norm violations across all five CI parameters (sender, receiver, subject, information type, transmission principle), and enforce compliance with applicable regulations such as HIPAA and GDPR. Reinforcement learning from CI-aligned feedback [
22] provides an initial step, but multi-objective formulations that balance integrity, utility, and completeness within a single loss function remain unexplored.
- 2.
Dynamic, multi-party CI benchmarks. As our gap analysis reveals, no existing benchmark models traces, policy combination, or compliance. Future benchmarks should feature multi-party dialogues with explicit norm conflicts (e.g., a patient who consents to share health data with a doctor but not with an insurer present in the same conversation), longitudinal memory across sessions, and evolving roles. Such benchmarks would stress-test whether models can track who knows what over time, and detect when context shifts invalidate earlier disclosure decisions.
- 3.
Cross-domain and multilingual CI evaluation. Most current datasets are English-only and concentrate on healthcare or finance. Extending CI benchmarks to additional domains (e.g., education, social media, and government services) and languages would reveal whether CI reasoning generalizes or remains domain- and culture-specific.
- 4.
Unified CI evaluation metrics. Current metrics assess privacy leakage, norm alignment, and legal compliance in isolation. A standardized, multi-dimensional scoring framework—analogous to holistic evaluation harnesses in other NLP tasks—would enable meaningful cross-paper comparisons and track community-wide progress toward CI-aligned LLMs.
6. Conclusions
This survey examined contextual integrity (CI) as a unifying framework for understanding and improving privacy behavior in large language models (LLMs). Rather than treating privacy as a static property of data (e.g., the presence of PII), CI emphasizes that privacy hinges on whether an information flow is appropriate for a given context—who is communicating (sender/receiver/subject), what is being shared (information attributes), and under which conditions (transmission principles). Viewed through this lens, many real-world failures of LLM-based systems can be interpreted as violations of contextual norms: models may disclose unnecessary details, generalize norms incorrectly across domains, or mis-handle multi-party settings where different agents hold different knowledge and expectations.
Across the literature, we identified three complementary lines of progress. First, benchmarks and evaluation frameworks operationalize CI by encoding roles, attributes, and context-dependent acceptability, enabling the systematic measurement of privacy leakage, norm alignment, and robustness. Second, system-level interventions—including prompt rewriting, data minimization and filtering pipelines, and dedicated privacy agents—seek to enforce context-aware disclosure at inference time. Third, training-based approaches (e.g., instruction tuning with positive/negative examples and penalty-based objectives) aim to internalize CI-relevant distinctions, improving the privacy–utility tradeoff beyond what prompting alone can achieve. At the same time, recurring limitations remain: sensitivity to prompt phrasing, limited generalization across domains, and weak theory-of-mind (ToM) capabilities that lead to incorrect assumptions about what different participants know, especially in multi-turn and multi-party interactions.
Looking forward, advancing CI in LLMs requires more holistic evaluation and design. Promising directions include richer benchmarks with realistic interaction dynamics and the explicit modeling of norm conflicts; metrics that jointly capture integrity, utility, and completeness alongside leakage; and stronger support for multi-party reasoning through ToM-aware evaluations and agent architectures. Finally, CI should be treated as inherently multidisciplinary: privacy behavior interacts with fairness, safety, and legal compliance, and future systems will benefit from evaluation suites and training objectives that integrate these dimensions rather than optimizing them in isolation. Progress on these fronts will be essential for deploying LLMs that are not only capable, but also contextually appropriate and trustworthy in sensitive real-world settings.