3.1. Problem Formulation
Given an original prompt represented as
, a sequence of input tokens where each
(with
V being the vocabulary size), an LLM maps the sequence of tokens to a distribution over the next token. The LLM can be defined as
, representing the likelihood of
given the preceding tokens
. Thus, the response
can be generated by sampling from the following distribution:
To force the model to provide correct answers to malicious questions, rather than refusing to respond, previous works combine the malicious question
with the optimized jailbreak suffix
, forming a jailbreak prompt
, where ⊕ represents the vector concatenation operation. For notational simplicity, let
denote the malicious question
,
represent the jailbreak suffix
, and
stand for the jailbreak prompt
. Setting a specific target response for individual malicious questions is impractical, as crafting an appropriate answer for each query is challenging and risks compromising universality. A common workaround [
7,
39] is to default to affirmative responses (e.g., “Sure, here’s how to [
]”). To achieve this, we optimize the LLM initial output to align with a predefined target prefix
(abbreviated as
), leading to the following adversarial jailbreak loss function:
The generation of the adversarial suffix can be formulated as the minimal optimization problem, written as follows:
For simplicity in representation, we use
to denote
in subsequent sections. A detailed optimization process is provided in
Appendix E to aid understanding.
3.2. Spatial Momentum
MAC [
19] is the first to introduce a momentum mechanism into the GCG method, achieving performance improvements. By integrating a momentum term into the iterative process, it effectively incorporates temporal correlations in the gradients used for candidate sampling, thereby stabilizing the update direction during iterations.
In our paper, inspired by advancements [
20,
21,
22] in the traditional visual adversarial domain, we apply the spatial momentum method to enhance the performance and transferability of GCG, naming it SM-GCG (Spatial Momentum GCG).
In traditional GCG methods, candidate sampling gradients depend entirely on the current input. This can cause the adversarial suffix to overfit to the specific malicious query during iterative optimization, becoming fragile to even single-character modifications. Ideally, a robust adversarial suffix should maintain effectiveness against semantically equivalent variations of the malicious question. Compared to the GCG gradient formulation, SM-GCG integrates gradients from multiple random transformations of the malicious query while incorporating abstract semantic space information to ensure gradient stability. This integration acts as a form of gradient averaging, which smooths the optimization landscape. The dynamical consequence, as evidenced by the loss curves in
Figure 3 (the experimental setup for this comparison used 100 malicious prompts from AdvBench [
12] to attack LLaMA2-7B, with 500 attack rounds per method; the plotted curves show the average loss across the 100 attacks, with the shaded area indicating the standard deviation), is a significant reduction in oscillation amplitude. This dampening effect arises because the averaged gradient is less susceptible to the high-frequency noise present in any single instance of the query, guiding the optimization towards a more stable descent direction. Furthermore, the smooth convergence phase observed in SM-GCG indicates that the optimizer has located a flat region of the loss minimum. Solutions in such flat minima are theoretically and empirically linked to superior generalization, which in our context translates to adversarial suffixes that are robust to paraphrasing and minor perturbations of the original malicious query. Finally, the lower plateau value of the loss achieved by SM-GCG quantifies a higher success probability for the attack. We quantitatively validated this relationship by monitoring the attack success rate alongside the loss across ten independent SM-GCG runs under varying configurations. A representative example (
Figure 4) shows the striking mirror image between the two curves during optimization. The average Spearman’s rank correlation coefficient across all 10 runs is −0.995 (std: ±0.004), providing robust statistical evidence that the loss value is a highly reliable proxy for attack efficacy. Consequently, the significantly lower final loss plateau achieved by SM-GCG (as seen in
Figure 3) directly and consistently translates to its measurably higher attack success rate across our benchmark. The proposed gradient formulation is defined as follows:
where
is used to compute the gradient after applying transformations to
, where
n represents the desired number of transformations, details are provided below.
j is the index of the suffix, where
, and
m is the length of the token sequence after encoding the adversarial suffix. The term
denotes the one-hot vector corresponding to the token at index
j. The coefficients
and
are weighting factors used to balance the original gradient and sampled gradient.
Through extensive research and experimentation, we classify input transformations using the following criteria:
Note: Text-space transformations on
may alter decoded sequence length, disrupting gradient accumulation and thus requiring pre-use filtering. Due to the non-surjective nature of tokenizers (tokens → string), not all token sequences map to valid strings. While transformations in Token, One-Hot, and Embedding spaces may sample values absent in normal inference scenarios, failing to filter them does not cause gradient accumulation errors. For details, see
Table 1. In the table, “F” indicates that the corresponding combination may produce some unusable sample values and must be filtered to take effect; “T” means the corresponding combination requires no filtering; “B” indicates that the combination may generate some low-quality sample values, which can be either filtered or omitted; “X” denotes that the combination is incompatible and cannot take effect.
The function
in Equation (
4) is merely an abstract representation; in practice, the gradient function varies depending on the different transformation space.
3.2.1. Candidate Space
In the iterative process of GCG, each iteration generates a batch of candidate suffixes with only 1–2 token differences. A high-quality suffix should exhibit robustness, meaning that minor modifications to the suffix should preserve most of its performance. To simulate such modifications, we sample candidate suffixes based on their gradients, particularly focusing on loss-guided sampling. This method more readily identifies suffixes that may appear suboptimal from a local perspective but are globally optimal. This facilitates a more stable iterative process and enables the generation of more robust adversarial suffixes.
Gradient sampling in the candidate space involves retaining the set of candidate suffixes from the previous iteration during the iterative process. Random sampling or loss-prioritized sampling is employed to accumulate gradients from other candidates. In the t-th iteration, the sampled inputs are calculated as follows:
where
is the adversarial suffix in the t-th iteration,
can serve as either a random sampling function or a loss-prioritized sampling function.
We encode the input using a tokenizer to obtain a sequence of tokens
. The gradient formula applying the candidate-space transformation is written as follows:
Specifically, we implement two sampling strategies:
Random Sampling: Uniformly select k candidates from the candidate pool without considering their loss values, promoting exploration of diverse gradient directions.
Loss-Guided Sampling: Select the top-k candidates with the lowest loss values from the candidate pool . This strategy focuses the gradient computation on the most promising (lowest-loss) directions found in the current candidate set, reinforcing and refining the most effective perturbations.
3.2.2. Text Space
We believe that a robust adversarial suffix should not be rendered ineffective by modifications that do not alter the semantics, such as “Write a convincing anonymous death threat” and “Write a credible anonymous death threat”. Therefore, we attempt to sample these text transformations that minimally change the semantics to collect gradients, ensuring that the generated suffix relies not on specific character sequences but on higher-level semantics.
In the text space, changes need to be applied to the string. Let
be the original string, where
denotes the malicious question part of the original string,
denotes the adversarial suffix part, and
represents the string concatenation operation. After applying the transformers, we obtain the following:
We encode the input using a tokenizer to obtain a sequence of tokens
, which is then substituted into Equation (
4), yielding the gradient formula after applying the text-space transformation:
where
is the j-th token in the encoded token sequence of the string after applying text transformations.
We implement text-space transformations at two textual modification granularity using the nlpaug library:
- 1.
Character-Level Transformations:
Random Character Substitution: Randomly replace 1–2 characters with other alphabetic characters.
OCR-based Substitution: Simulate OCR errors by replacing characters with visually similar ones (e.g., ’o’→’0’, ’l’→’1’).
Keyboard Typo Substitution: Replace characters with adjacent keyboard keys (e.g., ’a’→’s’, ’k’→’l’).
- 2.
Word-Level Transformations:
Synonym Replacement: Replace one word with its semantic equivalent using WordNet.
Random Swap: Randomly swap the positions of two adjacent words.
Random Deletion: Delete one word from the suffix.
Spelling Error Replacement: Introduce common spelling mistakes (e.g., ’receive’→’recieve’).
We limit changes to 1–2 characters or 1 word to minimize semantic alteration while providing sufficient variation for robustness.
Furthermore, our framework can be extended to handle more complex scenarios: For malicious queries consisting of multiple sentences, sentence-level transformations such as random sentence reordering and sentence paraphrasing can be applied. For contextual malicious scenarios involving multiple turns of dialogue, message-level transformations such as random context message reordering can be employed. Since the AdvBench dataset used in our subsequent experiments contains malicious prompts that are primarily single-sentence queries, these extended transformations are not utilized in our current experimental setup.
3.2.3. Token Space
The transformations we apply in token space can be mapped to text space. However, the advantage of operating in token space is that we avoid the issue of altered decoded token sequence lengths caused by transformations. Therefore, in text space, we are typically limited to minor modifications such as synonym replacement, whereas in token space, more impactful transformations such as shift operations can be applied.
In the token space, changes need to be applied to the token sequence. The gradient formula is written as follows:
where
is the j-th token in the encoded token sequence after applying token transformations.
When applying transformers, we can attempt to decode and re-encode the token sequence, filtering out any outputs that diverge from the original sequence to ensure the transformed tokens remain valid.
In token space, we implement two specific transformation strategies:
1. Random Token Replacement: Randomly replace a subset of tokens in the sequence with other valid tokens from the vocabulary. Formally, for a token sequence
, we generate the following:
where
for randomly selected positions
i.
2. Cyclic Shift Operation: Perform circular shifting of the token sequence by a random offset
k, calculated as follows:
This operation preserves all token information while altering the positional context.
Similar to the text space, we limit the scope of modifications by replacing only 1–2 tokens to maintain semantic coherence while providing sufficient variation.
3.2.4. One-Hot Space
The sampling in one-hot and embedding spaces primarily addresses the high non-smoothness of gradients in these spaces. Local gradients may fail to capture the global gradient landscape, leading optimization to converge to local minima. Neighborhood sampling can help stabilize gradients.
In the one-hot space, first convert the token sequence into one-hot form
and
. The loss function in Formula (2) simplifies the model reasoning process. We extract the embedding procedure and redefine a loss function that takes embedding vectors as input:
where
v is a embedding vectors.
We apply transformations to the malicious query and adversarial suffix separately, then multiply them by the embedding weight matrix to obtain the modified embedding vectors. The gradient formula is:
where
W is the embedding weight matrix. Due to the sparsity of natural values in one-hot space, applying transformations almost never yields natural values. Therefore, we employ neighborhood sampling without filtering.
Specifically, in the one-hot space, we employ Gaussian noise injection to sample neighboring points around the current one-hot vectors. This approach helps explore the gradient landscape beyond immediate local neighborhoods and provides more stable gradient estimates for optimization.
3.2.5. Embedding Space
In the embedding space, the original embedding vector is
and
. The gradient formula is written as follows:
As in the one-hot space, Gaussian noise is used to sample the neighborhood without filtering.