Neighborhood Deviation Attack Against In-Context Learning

Hou, Dai; Yang, Zhenkai; Zheng, Lei; Jin, Bo; Xu, Huan; Li, Ying; Xu, Bo; Peng, Kai

doi:10.3390/app15084177

Open AccessArticle

Neighborhood Deviation Attack Against In-Context Learning

by

Dai Hou

¹,

Zhenkai Yang

²,

Lei Zheng

¹,

Bo Jin

¹,

Huan Xu

¹,

Ying Li

¹,

Bo Xu

² and

Kai Peng

^2,*

¹

State Grid Hubei Information & Telecommunication Company, Wuhan 430048, China

²

Hubei Key Laboratory of Smart Internet Technology, School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4177; https://doi.org/10.3390/app15084177

Submission received: 13 March 2025 / Revised: 7 April 2025 / Accepted: 9 April 2025 / Published: 10 April 2025

(This article belongs to the Special Issue Security and Privacy in Machine Learning and Artificial Intelligence (AI))

Download

Browse Figures

Versions Notes

Abstract

In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks using only a few examples, without requiring fine-tuning. However, the new privacy and security risks brought about by this increasing capability have not received enough attention, and there is a lack of research on this issue. In this work, we propose a novel membership inference attack (MIA) method, termed Neighborhood Deviation Attack, specifically designed to evaluate the privacy risks of LLMs in ICL. Unlike traditional MIA methods, our approach does not require access to model parameters and instead relies solely on analyzing the model’s output behavior. We first generate neighborhood prefixes for target samples and use the LLM, conditioned on ICL examples, to complete the text. We then compute the deviation between the original and completed texts and infer membership based on these deviations. We conduct experiments on three datasets and three LLMs and further explore the influence of key hyperparameters on the method’s performance and their underlying reasons. Experimental results show that our method is significantly better than the comparative methods in terms of stability and achieves better accuracy in most cases. Furthermore, we discuss four potential defense strategies, including increasing the diversity of ICL examples and introducing controlled randomness in the inference process to reduce the risk of privacy leakage.

Keywords:

membership inference attack; in-context learning; large language models; privacy risk

1. Introduction

In recent years, the rapid development of large language models (LLMs) has had a transformative impact on the field of artificial intelligence (AI) [1], and these models demonstrate diverse capabilities covering a wide range of areas [2]. As LLMs are widely used in fields such as medicine [3,4], education [5], finance [6,7], and engineering [8,9], there is a growing demand for LLMs to quickly adapt to new tasks and efficiently utilize existing knowledge. Due to the ability to adapt to specific tasks using only a few examples without requiring extensive parameter updates, ICL has emerged as a powerful paradigm [10]. Although ICL improves the flexibility and efficiency of LLMs, it also brings significant privacy and security challenges, especially in the face of membership inference attacks (MIA) [11,12]. Therefore, we urgently need effective tools to assess the privacy security of ICL.

MIA was originally used to determine whether a specific data sample exists in a model training dataset, thus assessing the model’s privacy risk [13]. For example, in the financial sector, attackers can use MIA to attack models trained on transaction data to infer whether the training used the financial records of a specific individual [14]. Existing studies on privacy risks for LLMs mainly focus on gradient leakage, membership inference attacks, and the leakage of personally identifiable information (PII) due to unintended data memorization [15]. Among them, MIA methods against the training data of LLM are emerging one after another [16,17,18], however, MIA methods targeting ICL in LLMs remain underexplored and imprecise, highlighting the need for more effective approaches.

In our work, we bridge the gap between existing MIA methods and in-context learning and present a text-only MIA approach specifically designed for the ICL of LLMs. By examining how LLMs process and respond to in-context learning examples, our objective is to uncover the privacy risks associated with their ICL capabilities. We extract prefixes from the target sample and its neighboring samples and input them into the LLM loaded with the ICL sample for generation. The output results are then analyzed to determine whether the target sample is part of the ICL sample. Our approach relies only on the model’s output, enhancing its practicality and portability. Extensive experiments verify the effectiveness of our approach in judging whether a specific sample belongs to the ICL example used in LLM reasoning, thus highlighting the privacy risks of ICL in practical applications.

We summarize our contribution as follows:

We propose a text-only MIA method against ICL, revealing its inherent privacy risks. Our method combines neighbor sample generation and MIA techniques to achieve higher stability and accuracy.
We conduct experiments on multiple datasets and LLMs, demonstrating the effectiveness of our approach in judging whether a specific sample belongs to the ICL example used by LLM.
We conduct an extensive study on the factors that affect attack performance and analyze the possible causes of these effects. These influencing factors include the hyperparameters of the attack method and the structure of the attack model.
We explore potential defense strategies to mitigate the privacy risks of LLM in ICL. These strategies include reducing the memory of LLM for ICL examples and limiting the deviation caused by the output.

The remainder of this paper is organized as follows: Section 2 reviews related work on ICL and MIAs; Section 3 introduces the background knowledge of LLMs, ICL, and MIAs; Section 4 details the proposed method, including neighbor sample generation and the inference process; Section 5 introduces the experimental setup and the results including the main and ablation experiment; Section 6 discusses potential defense strategies; and Section 7 summarizes our research findings and outlines future work.

2. Related Work

2.1. Membership Inference Attack Against Machine Learning

Shokri et al. [19] were the first to apply MIA on machine learning models by training a shadow model to mimic the behavior of the target model. Their framework uses these shadow models to generate training data for the attack model, making it possible to infer if a specific sample belongs to the dataset used to train the target model. Building on this, Rahimian et al. [20] developed the Sampling Attack, which introduced perturbations to target samples and leveraged the target model’s prediction consistency to approximate membership probabilities.

Hayes et al. [21] extend MIA to generative models by using a generative adversarial network (GAN) to detect distribution differences. It works effectively in both white-box and black-box scenarios and requires very little prior data. Subsequently, Nasr et al. [22] proposed a gradient-based MIA method that exploits privacy vulnerabilities in the stochastic gradient descent algorithm.

Further studies used model parameters and data distribution as prior knowledge. Wu et al. [23] designed parameter-based attacks that computed membership probabilities using the target model’s internal parameters and auxiliary datasets. The method employed threshold-based classification for inference. At the same time, Truex et al. proposed a new MIA method, which is implemented by the joint training of the shadow model and attack model. The method only requires the distribution information of the target model training data.

The latest research mainly improves attack accuracy through statistical and integrated methods. Ye et al. [24] quantified the privacy risk through hypothesis testing and used the loss threshold of the target model to determine whether the data are a member. This not only reduces the computational complexity but also improves the effectiveness of MIA. Most recently, Ullah et al. [25] proposed the Multilevel Ensemble Membership Inference Attack (MEMIA), which integrates hybrid neural architectures combining standard NNs and LSTMs to capture both distributional and temporal disparities between member/non-member samples.

2.2. Membership Inference Attack Against Large Language Model

With the widespread adoption of LLMs in open-domain generation tasks, concerns about privacy leakage risks in their training data continue to escalate. As a mainstream approach, MIAs analyze model outputs to determine if specific data were used in training, serving as a critical technique for evaluating language models’ privacy boundaries.

Carlini et al. [26] were the first to introduce MIA into the field of LLM, using language features such as lexical frequency and syntactic consistency to identify training data through model output. The subsequent work of Mireshghallah et al. [27] introduced a likelihood ratio-based MIA method for masked language models, quantifying the risk of privacy leakage by estimating the likelihood of training data. Mattern et al. [28] proposed a neighborhood comparison mechanism, where semantic perturbations of target samples reveal membership signatures through differential model response patterns. This approach maintains a precision of above 85% even under low-confidence predictions. Feng et al. [29] proposed the PREMIA framework, which pioneers dynamic likelihood ratio analysis and joint preference-perplexity inference to quantify privacy vulnerabilities in aligned LLMs, overcoming static threshold limitations in prior MIAs. They reveal that alignment methods like DPO exhibit significantly higher privacy risks compared to PPO due to direct preference optimization. Xie et al. [30] proposed RECALL (Relative Conditional Log-Likelihood), which aims to detect pre-training data in large language models (LLMs) by leveraging their conditional language modeling capabilities. RECALL examines the relative change in conditional log-likelihoods when target data points are prefixed with non-member contexts. Wang et al. [31] conducted the first systematic investigation of privacy risks in long-context scenarios, proposing six novel attack strategies. Their empirical analysis revealed that models processing texts exceeding 4k tokens exhibit 23–41% higher privacy leakage probabilities due to the memory characteristics inherent in attention mechanisms.

In particular, Wen et al. [32] addressed an underexplored area of ICL and designed four MIA methods for ICL to perform membership reasoning on ICL examples through the direct output of LLM. Their framework achieved high attack accuracy in the black-box case. In addition, this work also highlighted the urgent need for hybrid defenses that combine data cleaning, instruction hardening, and output filtering to mitigate the leakage risk of ICL examples.

The comparison between the above-mentioned membership inference attack methods and the method proposed in this paper is shown in Table 1.

2.3. In-Context Learning

In recent years, with the continuous increase in model size and data volume, LLMs have demonstrated powerful capabilities in ICL. Brown et al. [34] first showcased the ICL ability of GPT-3. GPT-3 is capable of learning from a small number of given examples without the need for additional fine-tuning, and it can complete new reasoning tasks by analyzing contextual information. Subsequent LLMs, such as LLaMA2 [35] and GPT-4 [36], have also exhibited extremely excellent ICL abilities. After incorporating ICL examples, these models can achieve better results in specific downstream tasks compared to larger pre-trained models with more parameters.

Sun et al. [37] proposed an application framework for LLMs based on ICL, making Language Model as a Service (LMaaS) possible. This framework enables LLMs to be easily applied to a wider range of real-world tasks. In recent years, various application frameworks about ICL have emerged rapidly, such as C-ICL [38], Batch-ICL [39], and P-ICL [40]. C-ICL introduces negative samples during the ICL process to help the model identify possible reasoning errors, thereby achieving performance improvements in a wide range of related tasks. Batch-ICL is an improved ICL method that treats ICL examples as a meta-optimization process by processing examples in batches and aggregating meta-gradients. P-ICL is an ICL framework specially designed for named entity recognition (NER). It introduces “point entities” to help LLMs better understand entity types and classifications. These frameworks provide more methods for applying ICL to practical tasks and broaden the application scenarios of ICL.

3. Preliminaries

3.1. Large Language Model

A Language Model (LM) is a type of machine learning model designed to model the probability distribution of natural language and estimate the probability of any linguistic sequence. In recent years, with the rapid development of deep learning algorithms and hardware computational power, the number of parameters in LMs has continuously increased. Researchers have found that the performance of LMs can consistently improve with the growth of model parameters and training data, significantly enhancing various tasks in the field of natural language processing (NLP), such as text classification, named entity recognition, and part-of-speech tagging. The ability that rapidly emerges with the increase in model parameter scale is referred to as the emergent abilities [41] of LLMs, which is the primary distinction between LLMs and ordinary LMs [1].

Currently, the mainstream LLM architecture is usually “Transformer”. The key feature of this architecture is that it allows the model to consider the relationship between each element in a sequence and all other elements, thereby capturing global dependencies. This is known as the attention mechanism.

When the LLM performs inference, it receives a given prompt

x

and a sequence of previous tokens

t_{< i}

to predict the probability distribution of the next token

t_{i}

. This probability distribution is referred to as logits, denoted as

l_{i}

. Logits represent the confidence level of LLM for each possible output. This process can be expressed by the following formula:

l_{i} = f (x, t_{< i})

(1)

where f represents the LLM used.

After generating the logits, the next step is to select the next token

t_{i}

. LLM will select the next token from the softmax-processed logits according to a specific sampling strategy. The process can be expressed as follows:

t_{i} = T (softmax (l_{i})),

(2)

here, the function T represents the sampling method used.

By looping the above process, the LLM can generate completions for multiple tokens based on the input.

3.2. In-Context Learning

ICL is a learning method that does not require additional training. Introducing a small number of examples into the model’s input enables the model to adapt quickly and complete specific tasks. This ability is primarily attributed to the powerful memory and reasoning capabilities of LLMs, which allow them to capture the patterns and regularities of tasks within a limited context [10].

During the ICL process, the model first analyzes the input context examples and extracts features and rules relevant to the task [42]. These examples often include a detailed description of the task, the structure of inputs and outputs, and several specific instances. With this information, the model can flexibly adjust its reasoning mechanism without modifying its internal parameters, thereby making accurate predictions or creative generations for newly received inputs.

For example, when an LLM faces the task of determining whether the sentiment of a new text query is positive or negative, the ICL process is shown in Figure 1 and can be described as follows.

First, a template is defined to format the input data for the model. For sentiment analysis, the template could be “Review: [text] Sentiment: [label]”, where [text] represents the review text, and [label] is the sentiment label (either “Positive” or “Negative”). In this example, label 1 means “Positive”, and label 0 means “Negative”.

Next, demonstration examples are provided. These are labeled examples of text and sentiment, filled into the defined template. The following are examples:

“Review: Delicious food! Sentiment: Positive”.
“Review: The food is awful. Sentiment: Negative”.
“Review: Terrible dishes! Sentiment: Negative”.

Then, a new query is presented, which is the target text the model needs to evaluate. In this case, the new query might be “Review: Good meal!” with the sentiment label still undetermined.

Finally, the large language model receives both the new query and the demonstration examples as input. The model’s parameters are frozen, meaning no updates are made to the model’s internal weights for the current task. The model processes the input, leveraging its pre-trained knowledge and the context provided by the examples to predict the sentiment label for the new query. In this case, the model predicts that the sentiment of “Good meal!” is “Positive”.

By following these steps, the model can accurately determine the sentiment of new text inputs without needing to retrain or adjust its internal parameters, relying instead on the context and prior knowledge acquired during pre-training.

3.3. Membership Inference Attack

MIA aims to infer whether a specific sample was used in the training process of a machine learning model by analyzing the output behavior of the model. The core threat model can be characterized along three dimensions: the attacker’s knowledge level may range from white-box (access to model parameters or gradients) to black-box (interaction via API only); the attack target may focus on individual samples or group characteristics; and the attack phase may occur during model training or deployment inference. In terms of classical methods, early confidence thresholding techniques inferred membership by capturing systematic biases in the model’s prediction confidence for training samples, while shadow model approaches simulated the target model’s behavior using substitute models to train attack classifiers. It is important to note that the loss-based discrimination mechanism relied upon by traditional methods faces fundamental challenges in the context of LLMs. This is because ICL in LLMs dynamically adjusts behavior through forward propagation, rather than explicit backward gradient updates, making it difficult to directly transfer statistical features based on training traces.

When the attack target shifts to LLMs, their unique learning paradigms introduce new technical challenges and attack surfaces. First, under the constraint of parameter unknowability, attackers can only make indirect inferences based on the probability distribution of generated text or decoding strategies. This necessitates the development of new semantic-level discriminative features. Second, the ICL mechanism allows models to temporarily acquire task-specific capabilities through prompt engineering, blurring the traditional boundary between training and inference data. More critically, the probabilistic output uncertainty exhibited by LLMs during text generation may either mask the statistical signals of training data memorization or inadvertently reveal membership information through overconfident generation. Although the existing research indicates that LLMs still exhibit data memorization despite their strong generalization capabilities, extracting effective membership-discriminative features from free-form generated content requires breaking through the traditional supervised learning framework. These particularities demand that the design of MIAs be closely integrated with the generative characteristics and ICL mechanisms of language models, rather than simply transplanting attack strategies from traditional classification models.

4. Proposed Algorithm

4.1. Method Overview

We propose a novel text-only MIA method targeting ICL examples of LLMs. The core objective is to determine whether a specific sample is included in the context examples used for the LLM ICL, thereby revealing privacy vulnerabilities in contextual information processing. Compared with traditional approaches, our method ingeniously integrates membership inference with neighbor sample generation techniques. It relies solely on model outputs without requiring access to internal parameters, significantly improving attack stability and accuracy. This provides a powerful tool for assessing the privacy risks of LLMs in practical applications.

The method consists of four main steps: generation of the neighbor prefix, predictive completion, calculation of the metric deviation, and sample membership determination. These steps are detailed in Section 4.2, Section 4.3, Section 4.4 and Section 4.5. The detailed algorithm is provided in Algorithm 1. To make our method easier to understand, a flowchart capturing our algorithm process is shown in Figure 2.

Algorithm 1 Neighbor Deviation Attack.

Input: target sample

S_{target}

, LLM

f_{target}

, attack model

f_{attack}

, number of neighbors N, metric calculation function

f_{m}, m \in {Similarity, BLEU, PPL, ROUGE - L}

Output: Membership status (Member or Non-Member)

1:: Split the target sample into prefix $p$ and suffix
2:: ${\hat{p^{1}}, \hat{p^{2}}, \dots, \hat{p^{N}}} \leftarrow n e i g h b o r_g e n e r a t e (p)$ ▹Generate neighbor prefixes
3:: for each prefix $p and \hat{p^{i}}, i = 1, \dots, N$ do
4:: for each metric $m \in {Similarity, BLEU, PPL, ROUGE - L}$ do
5:: $S \leftarrow f_{target} (p)$
6:: $\hat{S^{i}} \leftarrow f_{target} (\hat{p^{i}})$ ▹Generate completion
7:: $c_{m} \leftarrow f_{m} (S_{target}, S)$
8:: $\hat{c_{m}^{i}} \leftarrow f_{m} (S_{target}, \hat{S^{i}})$ ▹Calculate metric score
9:: end for
10:: end for
11:: for each metric $m \in {Similarity, BLEU, PPL, ROUGE - L}$ do
12:: $δ_{m} \leftarrow c_{m} - \frac{1}{N} \sum_{i = 1}^{N} \hat{c_{m}^{i}}$ ▹Calculate deviation for each metric
13:: end for
14:: Membership $\leftarrow f_{attack} (δ_{Similarity}, δ_{BLEU}, δ_{PPL}, δ_{ROUGE - L})$
15:: Return Membership status

4.2. Neighbor Prefix Generation

To realize the subsequent metric deviation calculation in the neighborhood, we generate neighbor samples based on the target sample in this part. At the beginning, we preprocess the target sample

S_{target}

by splitting it into a prefix p (occupying a small proportion) and a suffix. We then generate neighbor samples through synonym replacement using BERT [43], ensuring semantic consistency by processing the entire sample. It should be noted that we only perform synonym substitution on the prefix part of the target sample, but still need to input the entire sample into the model. The purpose of this is to consider the entire sentence to ensure that the sentence can retain the original meaning after word replacement. This process can be briefly represented by the following formula:

{\hat{p^{1}}, \hat{p^{2}}, \dots, \hat{p^{N}}} = n e i g h b o r_g e n e r a t e (p)

(3)

To further clearly illustrate the process of this step, we draw a diagram as shown in Figure 3.

4.3. Predictive Completion

All prefixes (p and neighbors

\hat{p^{1}}, \hat{p^{2}}, \dots, \hat{p^{N}}

) are loaded into the target LLM as ICL examples for completion:

S = f_{target} (p)

(4)

\hat{S^{i}} = f_{target} (\hat{p^{i}}), i = 1, 2, \dots, N

(5)

where

f_{target}

denotes the LLM with loaded in-context demonstrates. The model exhibits prediction bias: due to the learning of patterns, the prefixes in the in-context demonstrations often generate completions that are similar to the demonstrations, while non-member prefixes produce comparatively divergent outputs. Therefore, we can compare the completed samples with the target samples in the neighborhood and then realize the membership inference.

4.4. Metric Deviation Calculation

To quantify the difference between the completed and target samples, we use several evaluation metrics and calculate their deviations. Let m represent the metric type; it can be similarity, BLEU, PPL, or ROUGE-L.

c_{m}

represents the metric value between original sample

S_{target}

and completion S, and

\hat{c_{m}^{i}}

represents the metric value between original sample

S_{target}

and completion

\hat{S^{i}}

c_{m} = f_{m} (S_{target}, S)

(6)

\hat{c_{m}^{i}} = f_{m} (S_{target}, \hat{S^{i}}), i = 1, 2, \dots, N

(7)

where

f_{m}

denotes the calculation function corresponding to the metric m.

The metric deviation

δ

is calculated as

δ_{m} = c_{m} - \frac{1}{N} \sum_{i = 1}^{N} \hat{c_{m}^{i}}

(8)

This deviation quantifies the relative quality difference between target and neighbor completions.

4.5. Membership Determination

To comprehensively perform the membership inference through multiple metric deviations, we develop a three-layer neural network as the attack model. This network is trained on available data for effective membership determination.

This neural network serves as an attack model. It consists of three fully connected layers and takes four inputs: similarity deviation, perplexity deviation, BLEU deviation, and ROUGE-L deviation. The output indicates the membership status of this target sample.

More importantly, our experiments indicate that training the attack model from publicly available datasets exhibits remarkable transferability. This implies that a single trained model can be effectively deployed in most practical situations without requiring case-specific retraining. Therefore, our approach applies to black-box models. This condition is of great significance for our actual deployment scenarios in the experiments.

Once the attack model is trained, our framework operates as an end-to-end pipeline: given any target sample, the system automatically processes the input through the trained architecture and directly outputs the membership determination result without requiring additional computational steps.

The proposed framework enables an efficient privacy risk assessment for LLM-based ICL systems while ensuring practical applicability in black-box scenarios.

4.6. Workflow Example

To make our method easier to understand, we provide a detailed example to illustrate the workflow as shown in Figure 4.

Our target sample is “They’re in the wrong ATHENS—Matt Emmons was focusing on staying calm, he should have been focusing on the right target. Answer: Sports”, and we extract the first 33% of the sample as its prefix, randomly replace two words in the prefix with their synonyms to generate a neighbor prefix. In the example, the original prefix is “They’re in the wrong ATHENS—Matt Emmons”, and the neighbor prefixes are “They do in the wrong Greece—Matt Emmons”, “They’re at this wrong ATHENS—Matt Emmons”, and “They’re in the mistaken ATHENS—Mattern Emmons”. Then, we input the original prefix and the neighbor prefix into the LLM loaded with the ICL example, where the loaded context example contains the target sample. The corresponding completions are “…was focusing on staying calm, he should have been focusing on the right target. Answer: Sports”, “…was focused on a stay of calmness, it is important to be concentrated on a suitable goal. Answer: Sports”, “…was focusing on staying calm, he should be focusing on the right target. Answer: Sports”, and “…had been specializing on remaining calm, he ought to have been concentrating on the proper goal. Answer: Sports”. We compute the metric deviation between the original sample and its completion using the defined formula, and feed the deviations into the attack model to determine whether the sample is a “Member”.

Overall, this example demonstrates how our approach can generate different similar variants, generate potential completions, evaluate them using various metrics, and finally determine the membership information of the target sample.

5. Experiments

5.1. Experimental Basis

Datasets: The experiment utilizes three datasets: DBPedia [44], AGNews [44], and TREC [45]. The DBPedia dataset, extracted from Wikipedia, contains 14 main categories covering domains such as persons, locations, and organizations, providing rich samples for text classification tasks. The AGNews dataset, composed of global news articles, includes four major categories (World, Sports, Business, and Sci/Tech) and is suitable for news classification and related NLP tasks. The TREC dataset features a large volume of text data with high-quality annotations, covering tasks such as information retrieval, question answering, and text classification. It is used to evaluate text retrieval system performance and supports academic research. These datasets collectively offer diverse and high-quality samples for training and validation in the experiment.

Language Models: The experiment employed three LLMs: LLaMA3-8B [46], LLaMA2-7B [35], and GPT2-XL [47]. LLaMA3-8B stands out with its 8B parameter scale and superior multilingual processing capabilities, making it ideal for complex language generation tasks. LLaMA2-7B, with its 7B parameter scale and optimized training strategies, demonstrates efficient language understanding and dialogue generation. GPT2-XL, with its 1.5B trainable parameters, is noted for its strong text generation coherence and broad applicability. These models, varying in size and performance, are suitable for a range of NLP tasks and provide robust capabilities for language generation and understanding in the experiment.

Neighbor Prefix Generation Model: When generating the neighbor prefixes of the samples to be judged, we use the BERT model [48]. With its powerful context encoding ability, the BERT model can generate similar prefixes for each sample, ensuring both semantic similarity and contextual consistency. Compared to other solutions, such as Con2vec [49], T5 [50], and GPT, BERT maintains a balance between efficiency, contextual synonym generation, and computational cost, making it a suitable choice for our approach. The key hyperparameters are set as follows: the dropout rate, set at 30%, which controls the probability of zeroing dimensions during embedding perturbation; the prefix proportion, set at 20%, which determines the size of the prefix relative to the entire target sample; the number of neighbors generated, which is typically five per target sample; and the number of replacements, which refers to the number of masked words substituted per target sample and is set to one.

Experimental Setup: All experiments were conducted on a workstation equipped with two NVIDIA GeForce RTX 4080 GPUs and an AMD Ryzen 9 7950X processor with 128 GB RAM. The implementation was based on Python 3.10. Each experiment ran for 500 iterations, requiring approximately 20 GB of GPU memory.

In each iteration, one member sample and one non-member sample were drawn for the attack, and the average performance across iterations was taken as the final result. This experimental design has been used in previous studies [11]. At the beginning of the experiment, 1000 samples were drawn from the dataset and divided into an example set and a test set. In each iteration, one sample was drawn from the example set and added to the context examples, while also being copied as the member sample. Another sample was drawn from the test set to serve as the non-member sample. To ensure consistency across multiple runs, a fixed random seed was used.

5.2. Evaluation Metrics

The evaluation metrics for generation quality in the experiment are described as follows.

Similarity: This metric can be expressed as

Similarity = Φ (E (x), E (x^{'}))

(9)

where

E (\cdot)

denotes a text encoder (for example, Sentence Transformer [51]) that converts samples into embedding vectors, and

Φ (a, b)

represents the similarity computation function (cosine similarity in experiments).

BLEU Score [52]: The BLEU score measures the n-gram overlap between generated text and reference text. The formula is

BLEU = BP \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n})

(10)

where

p_{n}

is the n-gram precision (the ratio of matched n-grams to the total number of n-grams in the generated text). BP is the brevity penalty, defined as

BP = \{\begin{matrix} e^{1 - \frac{l_{ref}}{l_{gen}}} & if l_{gen} \leq l_{ref} \\ 1 & otherwise \end{matrix}

(11)

where

l_{gen}

is the length of the generated text, and

l_{ref}

is the length of the reference text.

w_{n}

is the weight for different n-grams (typically uniform weighting,

w_{n} = 1 / N

).

Perplexity (PPL) [53]: Perplexity measures the quality of a language model’s predictions. The formula is

PPL = exp (- \frac{1}{N} \sum_{i = 1}^{N} log p (w_{i} | w_{< i}))

(12)

where

p (w_{i} | w_{< i})

is the probability of predicting the i-th word given the previous words. N is the length of the text. Lower values indicate better prediction accuracy.

ROUGE-L [54]: ROUGE-L evaluates the similarity between a generated text and a reference text by computing the F1 score of their longest common subsequence (LCS). The formula is

ROUGE - L = \frac{(1 + β^{2}) R_{LCS} P_{LCS}}{R_{LCS} + β^{2} P_{LCS}}

(13)

where

R_{LCS} = \frac{LCS (X, Y)}{l_{ref}}

(recall).

P_{LCS} = \frac{LCS (X, Y)}{l_{gen}}

(precision).

l_{gen}

is the length of the generated text, and

l_{ref}

is the length of the reference text.

β

is a parameter to balance recall and precision (typically set to

β \to \infty

to emphasize recall).

The evaluation metrics used in the experiment to evaluate the final performance are described below.

Advantage [55]: This metric evaluates the effectiveness of attack methods and is defined as

A d v = 2 \times (A c c - 0.5)

(14)

where

A d v

represents Advantage, and

A c c

represents the accuracy rate of the attack method.

It quantifies the performance advantage over random guessing, providing a more accurate assessment of the effectiveness of the attack.

5.3. Comparative Methods

GAP Attack [32]: GAP Attack is a baseline membership inference attack method designed for black-box text generation scenarios. This method relies on a critical assumption: overfitting on training data manifests itself as accuracy superiority on member samples, while non-member samples receive near-random predictions. GAP Attack categorizes the samples for which the LLM gives correct classification results as members and the rest as non-members.

Repeat Attack [32]: Repeat Attack exploits the text completion characteristics of language models to achieve covert membership inference. The core mechanism lies in the language model’s memorization bias: when provided with prefixes from member samples, the model exhibits a higher tendency to generate continuations closely aligned with the original text. In contrast, non-member samples yield significantly divergent outputs. Repeat Attack compares the text output generated by LLM based on the first three words of the sample with the original sample, and determines it as a member if the similarity exceeds the set threshold.

5.4. Results

The main objective of this experiment is to evaluate the vulnerability of the ICL of different language models across various datasets, with a specific focus on examining and comparing the effectiveness of distinct attack methods (Neighbor Deviation Attack, Repeat Attack, and GAP Attack). Our attack method is implemented on three datasets (AGNews, DBPedia, and TREC) and three models (LLaMA3, LLaMA2, and GPT2-XL). Through a comparative analysis of attack method performance across different models and datasets, this study aims to provide deeper insights into the applicability of MIA in ICL scenarios. The experimental results reveal the performance changes of our attack method in different scenarios, which will provide a basis for thinking about how to reduce the privacy risks of ICL. The experimental results are illustrated in Table 2.

In Table 2, we not only show the performance of the proposed method but also compare it with the comparative methods to highlight the advancement of the proposed method over existing methods. According to the results, Neighbor Deviation Attack demonstrates superior attack performance, maintaining advantages above 0.8 across all datasets while reaching peak values of up to 0.91 in specific cases. Specifically, on the TREC dataset, the Neighbor Deviation Attack has the highest advantages for the LLaMA3 and LLaMA2 models. In contrast, our method performs better than the other two attack approaches in most scenarios, and it maintains comparable attack performance across any dataset–model combination, exhibiting exceptional stability and stability. Repeat Attack ranks second, occasionally slightly surpassing the Neighbor Deviation Attack in isolated cases, but its overall stability remains inferior. In the worst case, the performance of the Repeat Attack is only two-thirds of that of Neighbor Deviation Attack. GAP Attack exhibits the lowest advantage, showing limited performance across all datasets, which indicates its ineffectiveness in executing attack tasks under the current experimental settings.

The experimental results show that Neighbor Deviation Attack has the best attack performance, which can be attributed to the fact that it makes more effective use of the feature information generated by the model through multiple similar neighbor prefixes to attack the ICL examples more accurately. Furthermore, Neighbor Deviation Attack has good stability, which can also be attributed to the fact that multiple neighbor prefixes and the original prefixes provide richer feature information for the final judgment as a way to obtain a more stable final result. Repeat Attack has slightly worse performance. This method achieves good results by leveraging the prediction and completion capabilities of LLMs, but it is not as stable as Neighbor Deviation Attack. In contrast, GAP Attack has the lowest performance, probably because it directly requires the model to classify samples. When the model is powerful, it may not rely too much on context examples to complete the task directly.

In addition, from a model perspective, Neighbor Deviation Attack appears to perform best on LLaMA3 and LLaMA2. This may be because the LLaMA series of models are trained using a larger scale of data compared to GPT2, which makes the models’ generative behavior more likely to expose the characteristics of their training data. Therefore, we can hypothesize that as the capability of LLMs increases, the privacy risk in in-context learning increases accordingly.

The results also reveal some challenges. For example, the performance of the attack method varies a little on different datasets, which indicates that the distribution characteristics of the dataset have a greater impact on the attack. However, due to the black-box condition, we cannot obtain prior knowledge such as the distribution of the target data. In addition, although Neighbor Deviation Attack performs well in most cases, its performance may still be affected by factors such as the specific implementation of the language model.

In the future, we believe that more research is needed to further explore the vulnerabilities of LLMs in ICL scenarios. One possible direction is to study the impact of different training strategies and model architectures on ICL privacy risks. As we can see in the results, our attack method has certain performance differences on different models. Additionally, we can investigate the influence of different datasets on the attack performance and explore how to improve the stability of the attack method. Based on these further studies, we can better explore more powerful defense strategies to mitigate the potential threats posed by these attacks.

5.5. Ablation Experiments

In this subsection, we conduct a series of ablation experiments to analyze the influence of key hyperparameters on the performance of the attack method. By systematically adjusting these parameters, we aim to gain a deep understanding of their contributions to the overall effectiveness and stability of the attack. Specifically, we investigate the effects of factors such as the number of neighbors, the proportion of prefixes, the number of replacements, and the attack model. These experiments help optimize the method configuration and enhance the attack’s accuracy while ensuring the computational cost’s controllability.

5.5.1. Number of Neighbors

In this experiment, the number of neighbors plays a crucial role in the neighbor bias attack. Increasing the number of neighbors can introduce more diverse target sample variants, allowing us to analyze the metric deviation of generated completions more accurately, thereby improving the accuracy and stability of membership reasoning. However, too many neighbors may increase the computational cost and introduce noise due to the randomness of generation, thus affecting the attack performance. Therefore, this experiment studies the effect of changing the number of neighbors from 1 to 15 on the attack performance of different datasets and language models to determine the most appropriate number of neighbors.

The experimental results are shown in Figure 5. We tested the LLaMA2 and LLaMA3 models on the AGNews and DBPedia datasets. From the results, it can be seen that increasing the number of neighbors does indeed improve the attack effect in the initial stage. When the number of neighbors is set between 5 and 10, the attack performance reaches its optimum. Beyond this range, the advantage value tends to stabilize and even slightly decrease.

For example, in the AGNews dataset, when using the LLaMA3 model, the attack advantage value increases from 0.76 at N = 1 to 0.90 at N = 7 and then stabilizes at around 0.88. Similarly, the attack advantage value of the LLaMA2 model increases from 0.83 at N = 1 to 0.88 at N = 7 and then decreases slightly as N increases. In the DBPedia dataset, a similar trend is observed. The optimal number of neighbors is between N = 5 and N = 10. Beyond this range, the additional increase in neighbors has a limited effect on performance improvement.

When the number of neighbors is relatively small, we can call it the rising stage. In this stage, as the number of neighbor samples increases, the advantage of the attack also increases. At this point, each additional neighbor brings additional information, which improves the advantage. At this stage, more neighbors can better quantify the metric deviations between member samples and non-member samples, thereby improving the performance of the attack method. When the number of neighbors falls within 5 to 10, the performance of the attack tends to be stable, indicating that the newly added neighbors no longer provide additional information but only improve the stability of the method. This stage can be called the stable stage. When the number of neighbors exceeds 10, we call it the saturation stage. In this stage, the added neighbors no longer bring any additional information but introduce noise due to the randomness of the LLM generation, misleading the final judgment, resulting in a decrease in the attack effect and a waste of some computing resources.

In conclusion, the experimental results show that the optimal range of neighbor samples should be between 5 and 10. This can ensure attack performance while controlling the computational cost.

5.5.2. Prefix Proportion

The prefix proportion is also a key hyperparameter in our proposed Neighbor Deviation Attack, as it determines the size of the prefix relative to the entire target sample. As the prefix serves as the input for the prediction completion stage, its length directly affects the model’s ability to generate meaningful completions. By altering the prefix proportion, we aim to investigate how the prefix length influences the performance of the attack. Specifically, we seek to find the optimal prefix proportion that can maximize the attack advantage while maintaining the semantic consistency of the generated completions.

According to the experimental results shown in Figure 6, on the AGNews dataset, both LLaMA2 and LLaMA3 witnessed a continuous decline in attack performance as the proportion of prefixes increased. At a prefix length of 10%, the advantage value reached approximately 0.90, and when the prefix length increased to 20%, the advantage value decreased slightly. Subsequently, as the proportion of prefixes continued to increase, the advantage value showed a significant decline. At a prefix length of 50%, the advantage value was only around 0.4. On the DBPedia dataset, the advantage value did not show any significant change when the prefix length was 10% or 20%. However, as the prefix length continued to increase, the advantage value still showed a significant decline, but the trend of decline in LLaMA3 was significantly less severe than that in the AGNews dataset.

The attack effect decreases with the increase in the prefix ratio, and the specific reasons can be attributed to several aspects. Shorter prefixes cause the model to rely more on contextual learning samples when generating, making the attack more effective. On the contrary, longer prefixes introduce more context, weaken the effect of diversifying contextual learning samples, and reduce the difference in metric deviation between member samples and non-member samples. In addition, semantic changes may happen in completed samples and introduce noise that affects final membership inference because longer prefixes increase the complexity of text generation. The performance decline trends of LLaMA3 on the AGNews and DBPedia datasets are different, indicating that dataset characteristics (such as text complexity and diversity) are crucial to determining the optimal prefix length. The stronger the performance of LLM, the more significant the impact of the dataset characteristics on the prefix length. Due to the different sample lengths of the different datasets, the impact of different prefix ratios on attack performance is also different. Taking into account factors such as attack performance, dataset characteristics, and computing resources, we determine that the optimal prefix ratio is 20%.

5.5.3. Number of Replacement

The number of replacements is another crucial hyperparameter in Neighbor Deviation Attack, as it determines how many words in the prefix will be replaced by synonyms when generating the neighborhood prefix. This parameter directly affects the diversity and semantic consistency of the generated neighborhood prefixes. By changing the number of replacements, we aim to study how the degree of prefix perturbation affects the effectiveness of the attack. Specifically, we hope to find the optimal number of replacements that can maximize the attack advantage while generating valid neighbor prefixes.

The experimental results are shown in Table 3, which demonstrates the influence of the number of replacements on the attack performance when using the LLaMA2 and LLaMA3 models on the AGNews and DBPedia datasets. When the number of replacements is 1, in most cases, the attack achieves the highest advantage value. At this time, the attack advantage values for LLaMA2 and LLaMA3 on the AGNews dataset are 0.87 and 0.82, respectively, and on the DBPedia dataset are 0.85 and 0.88, respectively. When the number of replacements increases to 2, most of the advantage values slightly decrease, and only the attack advantage value for LLaMA3 on the AGNews dataset slightly increases. Further increasing the number of replacements to 3 leads to a more significant performance decline. The advantage values for LLaMA2 and LLaMA3 on the AGNews dataset are 0.71 and 0.86, respectively, and on the DBPedia dataset are 0.72 and 0.84, respectively, which are the lowest values in all cases. These results indicate that the more replacements there are, the more likely the effectiveness of the attack will decrease, especially on the AGNews dataset.

The decline in attack performance as the number of replacements increases may be due to the following reasons. Only a single replacement in the sample introduces controllable variation, allowing the model to generate completions that are semantically consistent with the original prefix while still providing a certain degree of difference to assist judgment. Such a neighbor sample is crucial to the accuracy of the attack because it ensures that the generated completions can fully function during the final membership inference. However, as the number of replacements increases, the semantic consistency of the prefix may be compromised, resulting in a decrease in the consistency between the generated completions and the original context. Arguably, this introduces more noise. This noise makes it more difficult for the model to generate accurate completions, thereby weakening the effectiveness of the attack. In addition, the AGNews dataset consists of news articles and may be more sensitive to semantic perturbations than the DBPedia dataset, which contains more structured factual information. This sensitivity may explain why the performance of the AGNews dataset degrades more significantly as the number of replacements increases. In summary, we determine that the optimal number of replacements for the neighborhood bias attack is 1, as it strikes the best balance between introducing enough variation and maintaining semantic consistency, while also ensuring that no computational resources are wasted, thereby maximizing the benefits of the attack.

5.5.4. Attack Model

The objective of this experiment is to evaluate the effectiveness of the Neighbor Deviation Attack using different attack models when targeting different LLMs, and to conduct tests on three datasets. The goal of the experiment is to determine which attack model (neural network, support vector machine, or logistic regression) performs better in Neighbor Deviation Attack.

All three models take multiple metric deviations calculated from the deviation calculation as the input, and output the membership/non-membership (1/0) as the result. They are trained on 1000 randomly selected data points.

The attack model plays a decisive role in the Neighbor Deviation Attack. Therefore, this experiment mainly focuses on how the architectures of different attack models affect the success rate of the attack. By comparing the experimental results, we can determine which attack model is suitable for the Neighbor Deviation Attack.

Figure 7 illustrates the comparative analysis of the Neighbor Deviation Attack using neural network (NN), support vector machine (SVM), and logistic regression (Logistic) on LLaMA3, LLaMA2, and GPT2-XL, respectively. As shown in Figure 7a, the performance of the Neighbor Deviation Attack using different attack models varies on different datasets. In AGNews and TREC datasets, the Neighbor Deviation Attack achieves a similar and much higher advantage using neural network and logistic regression than SVM, and in the DBPedia dataset, the Neighbor Deviation Attack achieves about 0.2 Advantage using neural network over SVM and logistic regression. As shown in Figure 7b, the Neighbor Deviation Attack shows a large advantage using neural networks on the DBPedia dataset, but on the AGNews and TREC datasets, the performance of the three models is similar. In addition, the overall attack performance on LLaMA2 is slightly higher than that on LLaMA3, suggesting that LLaMA2 may be more vulnerable to attacks on these datasets. As shown in Figure 7c, compared to the LLaMA model, the performance differences between the different attack structures are smaller on GPT2-XL. The Neighbor Deviation Attack performs almost the same on all the datasets using the different attack models, and all exhibit excellent attack performance, with dominance values greater than 0.8. This suggests that GPT2-XL has a balanced vulnerability in the face of the Neighbor Deviation Attack using different attack model architectures.

According to the experimental results, compared with the other two models, the Neighbor Deviation Attack using neural networks as the attack model has superior performance. In most cases, the attack model using neural networks can achieve better results than the other two models. In a few cases, it is only slightly inferior to the best results, and the attack model using neural networks makes the method more portable, functioning normally in the presence of different LLMs and datasets. This might be because neural networks have better nonlinear expression capabilities.

The confusion matrix of different attack models on DBPedia on LLaMA3 is shown in Figure 8. The results show that the Neighbor Deviation Attack using neural networks as the attack model has the highest true positive rate (TPR) and true negative rate (TNR), indicating that it can effectively distinguish between member and non-member samples. On the contrary, the SVM model has the lowest TPR and TNR, and the logistic regression model has a TPR and TNR that are slightly lower than those of the neural network model but higher than those of the SVM model. Therefore, this further supports our choice of neural networks as the attack model.

Moreover, we should also note that no matter which attack model is used, the false positive rate is significantly higher than the false negative rate. This shows that our method is more likely to misjudge negative examples, which also provides a direction for our further improvement.

To evaluate the generalization ability of the attack model between different LLMs and datasets, we trained the attack model on a specific LLM and dataset combination and tested it on different LLM and dataset combinations. The results are shown in Table 4.

The experimental results in the table show advantage values in different test scenarios when the attack model is trained on a specific combination. The data in the table show that the generalization across LLMs is strong. Training and testing the attack model on the same dataset across different LLMs can achieve relatively good results. For example, the attack model is trained on GPT2-XL&TREC and tested on LLaMA2&TREC to achieve the highest advantage value of 0.92. However, when the trained attack model is tested on different datasets but the same LLM, the advantage value sometimes drops below 0.80. For example, when the attack model trained on LLaMA2&TREC is tested using LLaMA2&AGNews, the advantage value is only 0.76.

The above experimental results show that the generalization of the attack model is less related to the LLM used in training but more related to the dataset used. The main reason for this may be that the sample distribution and characteristics of the dataset are more likely to affect the metrics we use and thus affect the performance of the attack model. This also guides our approach to training the attack model. We finally chose to use samples extracted from the three datasets to train the attack model to improve its generalization as can be seen in Table 2.

5.5.5. Distribution of Metric Deviation

This experiment aims to analyze the distribution of the metric deviations used in the Neighbor Deviation Attack. By examining the deviations of different evaluation metrics (including similarity, PPL, BLEU, and ROUGE-L), we aim to understand how the member states affect these metrics’ deviations. This is of great significance for analyzing the feasibility and effectiveness of the Neighbor Deviation Attack. By analyzing the differences in the distribution of the metric deviations, we can further design and optimize our attack methods.

From Figure 9, we can observe the following. Overall, the metric deviations of member samples are more concentrated within a specific range compared to non-member samples, allowing us to distinguish between them. Specifically, the BLEU deviation and ROUGE-L deviation of member samples exhibit significantly higher concentrations than those of non-member samples, making these two metrics particularly effective for differentiation. While the similarity deviation and PPL deviation of member samples are less concentrated than the former two metrics, they still demonstrate substantial distinctions from non-member samples, thereby also serving as viable criteria for classification. This provides a reliable basis for our proposed method to distinguish between member and non-member samples.

Although the experimental results demonstrate that individual metric biases can be utilized independently for membership inference, relying solely on a single metric deviation proves suboptimal for achieving robust attack performance. This limitation arises because different models and datasets may exhibit varying degrees of sensitivity to each metric, potentially leading to inconsistencies across different attack scenarios. In pursuit of enhanced attack efficacy, we propose integrating metric deviations rather than relying on a single specific deviation for judgment. To this end, we employ the machine learning model mentioned above to comprehensively synthesize these deviations, thereby achieving superior accuracy and stability in attack performance.

As shown in Figure 10, the performance achieved by combining multiple deviations clearly outperforms that obtained using any single metric. This validates the effectiveness of our integration strategy in enhancing the overall attack performance.

6. Potential Defenses

The Neighbor Deviation Attack proposed in this paper can be used to conduct MIA on LLM in the ICL scenarios. This attack can determine whether specific data are used as an ICL example. In response to such privacy risks, we can adopt targeted defense strategies that focus on reducing LLM memory for ICL examples and limiting output deviation.

To reduce LLM memory and preference for ICL examples, the most direct approach is to increase the number of ICL examples used in a single inference [32]. When an LLM loads multiple ICL examples in a single inference, the model’s attention will be dispersed, thereby reducing the dependence on specific examples. However, the adaptability of ICL to downstream tasks still exists for LLMs. In other words, after increasing the number of ICL examples, the LLM is more inclined to learn the overall patterns of multiple examples rather than memorizing certain specific examples. Thus, this approach ensures the quality of the LLM inference while controlling the corresponding privacy risks. Furthermore, the study conducted by Liu et al. [56] indicates that when dealing with a long context, the LLM pays more attention to the beginning and ending parts of the long context rather than the middle part of the long context. Therefore, when using ICL examples, we can place more valuable or more protected ICL examples in the middle part of the loaded multiple examples, which will help mitigate the risk of privacy leakage.

In order to impose limits on the metric deviation of the LLM output, we can introduce controlled randomness in the reasoning phase of the LLM. When generating the output, we can use generation strategies such as beam search [57], Top-k [58], and Top-p [59]. These measures will introduce a certain amount of noise, causing the metric bias to fluctuate easily, affecting the final judgment result but will also affect the quality of the output result to a certain extent. In addition, a direct filtering approach can be employed, utilizing the existing LLM as a filtering layer to modify the output while preserving its original meaning. It can be foreseen that such a defense measure will be very effective against neighbor bias attacks. However, using additional LLMs to filter the output will consume additional computing resources and may hurt the quality of the output.

While protecting privacy and security, the above defense measures may have a certain impact on LLM and output quality, or cause additional computing resource consumption. Consequently, in practical applications, we will weigh the relationship between privacy protection and model performance and resource consumption, and choose the most suitable defense solution.

7. Conclusions and Future Work

In this paper, we proposed the Neighborhood Deviation Attack, a novel MIA targeting ICL in LLMs. This method can infer whether a specific data sample is included in the ICL examples of LLM, without requiring access to model parameters. We conducted extensive experiments on multiple datasets and language models to validate the effectiveness of the proposed attack. By integrating the neighborhood sample generation strategy with multi-dimensional deviation quantification analysis, our attack method outperforms existing MIA techniques in terms of accuracy and stability, offering a new perspective for evaluating privacy risks in ICL scenarios. Experimental results show that LLMs loaded with ICL examples exhibit specific preferences during generation. We characterize these preferences using multi-dimensional deviation quantification metrics through the neighborhood approach, and thereby can accurately infer the membership of samples regarding ICL examples. Ablation studies further validate the stability and effectiveness of our method by examining the contributions of key components in our attack. Our study successfully demonstrates that LLMs may inadvertently leak information about their ICL examples, reveals the privacy risks brought by ICL in LLMs, and provides better guidance for related defense work.

While the proposed Neighborhood Deviation Attack effectively assesses privacy risks in ICL, several limitations remain. Our attack currently mainly targets static ICL scenarios, where the ICL examples set remains unchanged during inference. Whether the related methods can be migrated to dynamic or multi-instance ICL scenarios requires more research. Additionally, when the deviation between member and non-member samples is small, the attack performance noticeably declines. For the problem of a too-small deviation difference, we could add the preprocessing steps to amplify the metric deviation differences between member and non-member samples. Furthermore, it is necessary to study potential defense mechanisms systematically. Although this paper preliminarily explores some defense strategies, further analysis of possible defense methods is needed, such as differential privacy (DP) and output perturbation (OP), to mitigate privacy risks in ICL scenarios.

Author Contributions

Conceptualization, D.H. and K.P.; methodology, Z.Y.; software, Z.Y. and B.X.; validation, Z.Y., L.Z. and B.J.; investigation, H.X.; resources, Z.Y.; data curation, Y.L.; writing—original draft preparation, Z.Y.; writing—review and editing, B.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Key Research and Development Program of Hubei Province, China under Grant 2024BAB016, in part by the Key Research and Development Program of Hubei Province, China under Grant 2024BAB031, and in part by the Key Research and Development Program of Hubei Province, China under Grant 2023BAB074.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

You can find the dataset we use in this experiment at https://huggingface.co/datasets/fancyzhx/ag_news (AGNews), https://huggingface.co/datasets/fancyzhx/dbpedia_14 (DBPedia), https://huggingface.co/datasets/CogComp/trec (TREC), accessed on 10 March 2025. The large language models we use can be found at https://huggingface.co/meta-llama/Llama-2-7b-hf (LLaMA2-7B), https://huggingface.co/meta-llama/Meta-Llama-3-8B (LLaMA3-8B), https://huggingface.co/openai-community/gpt2-xl (GPT2-XL), accessed on 10 March 2025. The Neighborhood Generation Model Bert we use can be find at https://huggingface.co/FacebookAI/roberta-large, accessed on 10 March 2025. The codes supporting the result of this study will be made available by the author on request.

Conflicts of Interest

Authors Dai Hou, Lei Zheng, Bo Jin, Huan Xu and Ying Li were employed by State Grid Hubei Information & Telecommunication Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Hadi, M.U.; Al-Tashi, Q.; Qureshi, R.; Shah, A.; Muneer, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J.; et al. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Prepr. 2023, 3. [Google Scholar] [CrossRef]
Rao, A.; Kim, J.; Kamineni, M.; Pang, M.; Lie, W.; Succi, M.D. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv 2023. [Google Scholar] [CrossRef]
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef]
Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Liu, X.Y.; Wang, G.; Yang, H.; Zha, D. FinGPT: Democratizing Internet-scale Data for Financial Large Language Models. In Proceedings of the NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. Bloomberggpt: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar]
Wang, X.; Anwer, N.; Dai, Y.; Liu, A. ChatGPT for design, manufacturing, and education. Procedia CIRP 2023, 119, 7–14. [Google Scholar] [CrossRef]
Fraiwan, M.; Khasawneh, N. A review of chatgpt applications in education, marketing, software engineering, and healthcare: Benefits, drawbacks, and research directions. arXiv 2023, arXiv:2305.00237. [Google Scholar]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A survey on in-context learning. arXiv 2022, arXiv:2301.00234. [Google Scholar]
Duan, H.; Dziedzic, A.; Yaghini, M.; Papernot, N.; Boenisch, F. On the privacy risk of in-context learning. arXiv 2024, arXiv:2411.10512. [Google Scholar]
Ren, K.; Meng, Q.R.; Yan, S.K.; Qin, Z. Survey of artificial intelligence data security and privacy protection. Chin. J. Netw. Inf. Secur. 2021, 7, 1–10. [Google Scholar]
Hu, H.; Salcic, Z.; Sun, L.; Dobbie, G.; Yu, P.S.; Zhang, X. Membership inference attacks on machine learning: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 235. [Google Scholar] [CrossRef]
Huang, W.; Wang, Y.; Chen, C. Privacy Evaluation Benchmarks for NLP Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 2615–2636. [Google Scholar]
Das, B.C.; Amini, M.H.; Wu, Y. Security and privacy challenges of large language models: A survey. ACM Comput. Surv. 2025, 57, 152. [Google Scholar] [CrossRef]
Ko, M.; Jin, M.; Wang, C.; Jia, R. Practical membership inference attacks against large-scale multi-modal models: A pilot study. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 4871–4881. [Google Scholar]
Fu, W.; Wang, H.; Gao, C.; Liu, G.; Li, Y.; Jiang, T. Practical membership inference attacks against fine-tuned large language models via self-prompt calibration. arXiv 2023, arXiv:2311.06062. [Google Scholar]
Shi, W.; Ajith, A.; Xia, M.; Huang, Y.; Liu, D.; Blevins, T.; Chen, D.; Zettlemoyer, L. Detecting Pretraining Data from Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3–18. [Google Scholar]
Rahimian, S.; Orekondy, T.; Fritz, M. Sampling Attacks: Amplification of Membership Inference Attacks by Repeated Queries. arXiv 2020, arXiv:2009.00395. [Google Scholar]
Hayes, J.; Melis, L.; Danezis, G.; De Cristofaro, E. LOGAN: Membership Inference Attacks Against Generative Models. Proc. Priv. Enhancing Technol. 2019, 1, 133–152. [Google Scholar] [CrossRef]
Nasr, M.; Shokri, R.; Houmansadr, A. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 739–753. [Google Scholar]
Wu, B.; Chen, C.; Zhao, S.; Chen, C.; Yao, Y.; Sun, G.; Wang, L.; Zhang, X.; Zhou, J. Characterizing membership privacy in stochastic gradient langevin dynamics. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6372–6379. [Google Scholar]
Ye, J.; Maddi, A.; Murakonda, S.K.; Bindschaedler, V.; Shokri, R. Enhanced membership inference attacks against machine learning models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles, CA, USA, 7–11 November 2022; pp. 3093–3106. [Google Scholar]
Ullah, N.; Aman, M.N.; Sikdar, B. meMIA: Multi-level Ensemble Membership Inference Attack. IEEE Trans. Artif. Intell. 2025, 6, 93–106. [Google Scholar] [CrossRef]
Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, U.; et al. Extracting training data from large language models. In Proceedings of the 30th USENIX security symposium (USENIX Security 21), Vancouver, BC, Canada, 11–13 August 2021; pp. 2633–2650. [Google Scholar]
Mireshghallah, F.; Goyal, K.; Uniyal, A.; Berg-Kirkpatrick, T.; Shokri, R. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 8332–8347. [Google Scholar]
Mattern, J.; Mireshghallah, F.; Jin, Z.; Schoelkopf, B.; Sachan, M.; Berg-Kirkpatrick, T. Membership Inference Attacks against Language Models via Neighbourhood Comparison. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Feng, Q.; Kasa, S.R.; Yun, H.; Teo, C.H.; Bodapati, S.B. Exposing privacy gaps: Membership inference attack on preference data for LLM alignment. arXiv 2024, arXiv:2407.06443. [Google Scholar]
Xie, R.; Wang, J.; Huang, R.; Zhang, M.; Ge, R.; Pei, J.; Gong, N.; Dhingra, B. ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 8671–8689. [Google Scholar]
Wang, Z.; Liu, G.; Yang, Y.; Wang, C. Membership Inference Attack against Long-Context Large Language Models. arXiv 2024, arXiv:2411.11424. [Google Scholar]
Wen, R.; Li, Z.; Backes, M.; Zhang, Y. Membership inference attacks against in-context learning. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 3481–3495. [Google Scholar]
Truex, S.; Liu, L.; Gursoy, M.E.; Yu, L.; Wei, W. Demystifying membership inference attacks in machine learning as a service. IEEE Trans. Serv. Comput. 2019, 14, 2073–2089. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Sun, T.; Shao, Y.; Qian, H.; Huang, X.; Qiu, X. Black-box tuning for language-model-as-a-service. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 20841–20855. [Google Scholar]
Mo, Y.; Liu, J.; Yang, J.; Wang, Q.; Zhang, S.; Wang, J.; Li, Z. C-ICL: Contrastive in-context learning for information extraction. arXiv 2024, arXiv:2402.11254. [Google Scholar]
Zhang, K.; Lv, A.; Chen, Y.; Ha, H.; Xu, T.; Yan, R. Batch-ICL: Effective, Efficient, and Order-Agnostic In-Context Learning. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10728–10739. [Google Scholar]
Jiang, G.; Ding, Z.; Shi, Y.; Yang, D. P-ICL: Point in-context learning for named entity recognition with large language models. arXiv 2024, arXiv:2405.04960. [Google Scholar]
Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 11048–11064. [Google Scholar]
Zhou, W.; Ge, T.; Xu, K.; Wei, F.; Zhou, M. BERT-based lexical substitution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3368–3373. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Li, X.; Roth, D. Learning question classifiers. In Proceedings of the COLING 2002: The 19th International Conference on Computational Linguistics, Taipei, Taiwan, 26 August–30 August 2002. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Nguyen, D.; Luo, W.; Vo, B.; Nguyen, L.T.; Pedrycz, W. Con2Vec: Learning embedding representations for contrast sets. Knowl.-Based Syst. 2021, 229, 107382. [Google Scholar] [CrossRef]
Ni, J.; Abrego, G.H.; Constant, N.; Ma, J.; Hall, K.; Cer, D.; Yang, Y. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Martin, J.H.; Jurafsky, D. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Pearson/Prentice Hall: Upper Saddle River, NJ, USA, 2009; Volume 23. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Yeom, S.; Giacomelli, I.; Fredrikson, M.; Jha, S. Privacy risk in machine learning: Analyzing the connection to overfitting. In Proceedings of the 2018 IEEE 31st Computer Security Foundations Symposium (CSF), Oxford, UK, 9–12 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 268–282. [Google Scholar]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
Freitag, M.; Al-Onaizan, Y. Beam Search Strategies for Neural Machine Translation. arXiv 2017, arXiv:1702.01806. [Google Scholar] [CrossRef]
Holtzman, A.; Buys, J.; Forbes, M.; Bosselut, A.; Golub, D.; Choi, Y. Learning to Write with Cooperative Discriminators. arXiv 2018, arXiv:1805.06087. [Google Scholar] [CrossRef]
Holtzman, A.; Buys, J.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. arXiv 2019, arXiv:1904.09751. [Google Scholar] [CrossRef]

Figure 1. The ICL process in LLM with parameter freeze.

Figure 2. Flowchart of Neighbor Deviation Attack.

Figure 3. Diagram of Neighbor Prefix Generation.

Figure 4. Example workflow for our method.

Figure 5. Influence of the number of neighbors across two datasets and two language models. The number of neighbors varies from 1 to 15. (a) The impact of the number of neighbors on the method advantages for the LLaMA2 and LLaMA3 models on the AGNews dataset. (b) The impact of the number of neighbors on the method advantages for the LLaMA2 and LLaMA3 models on the DBPedia dataset.

Figure 6. Influence of prefix proportion across two datasets and two language models. The prefix proportion varies from 10% to 50%. (a) The impact of prefix proportion on the method Advantage for the LLaMA2 and LLaMA3 models on the AGNews dataset. (b) The impact of prefix proportion on the method Advantage for the LLaMA2 and LLaMA3 models on the DBPedia dataset.

Figure 7. Comparison of attack model across three datasets and three language models. The performance is measured in terms of Advantage. (a) The attack performance of three structures (neural network, SVM, and logistic) across three datasets on LLaMA3. (b) The attack performance of three structures across the same datasets on LLaMA2. (c) The attack performance of three structures across the same datasets on GPT2-XL.

Figure 8. Confusion matrix of different attack models on DBPedia on LLaMA3. The confusion matrix is obtained by testing 200 data points, half of which are positive, while half are negative.

Figure 9. The membership distribution of four metric deviations. The result was obtained by conducting tests on two hundred data points. (a) Similarity deviation distribution between members and non-members. (b) PPL deviation distribution between members and non-members. (c) BLEU deviation distribution between members and non-members. (d) ROUGE-L deviation distribution between members and non-members.

Figure 10. Comparison between single deviation threshold judgment and multi-dimensional metric deviation threshold judgment. The performance is measured in terms of Advantage. (a) The attack performance of single metric deviation judgment and Neighbor Deviation Attack across three datasets on LLaMA3. (b) The attack performance across the same datasets on LLaMA2. (c) The attack performance across the same datasets on GPT2-XL.

Table 1. Comparison of membership inference attack methods. ✓ indicates whether corresponding prior knowledge is needed, while × indicates no.

Attack Method	Target	Attack Content	Training Data		Target Model
Attack Method	Target	Attack Content	Sample	Distribution	Structure	Parameter
Shadow Attack [19]	ML Model	Training Data	✓	✓	✓	×
Sampling Attacks [20]	ML Model	Training Data	×	✓	✓	×
LOGAN [21]	Generative Model	Training Data	×	×	✓	✓
Nasr et al. [22]	ML Model	Training Data	×	×	✓	×
Wu et al. [23]	ML Model	Training Data	✓	×	×	✓
Truex et al. [33]	ML Model	Training Data	✓	×	×	×
Ye et al. [24]	ML Model	Training Data	✓	✓	×	×
MEMIA [25]	ML Model	Training Data	×	✓	×	×
Carlini et al. [26]	LLM	Training Data	×	×	×	×
Mireshghallah et al. [27]	Masked LM	Training Data	×	✓	×	×
Mattern et al. [28]	LM	Training Data	✓	×	×	×
PREMIA [29]	LLM	Training Data	×	✓	×	×
RECALL [30]	LLM	Training Data	✓	×	×	×
Wang et al. [31]	LLM	Context Content	×	×	×	×
Wen et al. [32]	LLM	ICL Samples	×	×	×	×
Our Work *	LLM	ICL Samples	×	×	×	×

* The bolded rows are the methods proposed in this paper.

Table 2. Comparison of attack methods across three datasets and three language models, measured by Advantage. “GAP” denotes GAP Attack, “RA” represents Repeat Attack, and “NDA” refers to Neighbor Deviation Attack.

Model	Attack Method	Dataset
Model	Attack Method	AGNews	DBPedia	TREC
LLaMA3	GAP	0.49	0.40	0.35
	RA	0.89	0.72	0.83
	NDA	0.82	0.83	0.91
LLaMA2	GAP	0.36	0.42	0.38
	RA	0.82	0.62	0.82
	NDA	0.87	0.85	0.91
GPT2-XL	GAP	0.47	0.25	0.15
	RA	0.89	0.84	0.86
	NDA	0.89	0.86	0.85

Bolded values in the table indicate optimal performance results, and underlined values indicate suboptimal performance results.

Table 3. Influence of the number of replacements across two datasets and two language models. The table presents the impression of the method Advantage regarding the number of replacements on the LLaMA2 and LLaMA3 models and across the AGNews and DBPedia datasets. The number of replacements varies from 1 to 3.

Model	Number of Replacement	Dataset
Model	Number of Replacement	AGNews	DBPedia
LLaMA2	1	0.87	0.85
	2	0.8	0.72
	3	0.71	0.72
LLaMA3	1	0.82	0.88
	2	0.9	0.85
	3	0.86	0.84

Table 4. Attack model generalization across datasets and LLMs.The performance is measured in terms of Advantage.

Train	Test
Train	LLaMA2&TREC	LLaMA2&AGNews	GPT2-XL&TREC	GPT2-XL&AGNews
LLaMA2&TREC	/	0.76	0.87	0.83
LLaMA2&AGNews	0.88	/	0.82	0.91
GPT2-XL&TREC	0.92	0.80	/	0.77
GPT2-XL&AGNews	0.88	0.90	0.83	/

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, D.; Yang, Z.; Zheng, L.; Jin, B.; Xu, H.; Li, Y.; Xu, B.; Peng, K. Neighborhood Deviation Attack Against In-Context Learning. Appl. Sci. 2025, 15, 4177. https://doi.org/10.3390/app15084177

AMA Style

Hou D, Yang Z, Zheng L, Jin B, Xu H, Li Y, Xu B, Peng K. Neighborhood Deviation Attack Against In-Context Learning. Applied Sciences. 2025; 15(8):4177. https://doi.org/10.3390/app15084177

Chicago/Turabian Style

Hou, Dai, Zhenkai Yang, Lei Zheng, Bo Jin, Huan Xu, Ying Li, Bo Xu, and Kai Peng. 2025. "Neighborhood Deviation Attack Against In-Context Learning" Applied Sciences 15, no. 8: 4177. https://doi.org/10.3390/app15084177

APA Style

Hou, D., Yang, Z., Zheng, L., Jin, B., Xu, H., Li, Y., Xu, B., & Peng, K. (2025). Neighborhood Deviation Attack Against In-Context Learning. Applied Sciences, 15(8), 4177. https://doi.org/10.3390/app15084177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Neighborhood Deviation Attack Against In-Context Learning

Abstract

1. Introduction

2. Related Work

2.1. Membership Inference Attack Against Machine Learning

2.2. Membership Inference Attack Against Large Language Model

2.3. In-Context Learning

3. Preliminaries

3.1. Large Language Model

3.2. In-Context Learning

3.3. Membership Inference Attack

4. Proposed Algorithm

4.1. Method Overview

4.2. Neighbor Prefix Generation

4.3. Predictive Completion

4.4. Metric Deviation Calculation

4.5. Membership Determination

4.6. Workflow Example

5. Experiments

5.1. Experimental Basis

5.2. Evaluation Metrics

5.3. Comparative Methods

5.4. Results

5.5. Ablation Experiments

5.5.1. Number of Neighbors

5.5.2. Prefix Proportion

5.5.3. Number of Replacement

5.5.4. Attack Model

5.5.5. Distribution of Metric Deviation

6. Potential Defenses

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI