In this section, we first present a formal formulation of the adversarial attack problem, along with the corresponding notation. We then elaborate on the motivation driving this study. Subsequently, we detail the design of the loss functions and the semantic-consistency transformations incorporated into CBLA. Finally, we describe the overall training framework and algorithmic procedure of CBLA.
3.1. Notions and Definitions
Given a trained classification model
the model predicts a label
y for an input
x, where the ground-truth label of
x is denoted by
. An adversarial attack aims to construct a small perturbation
that is added to a clean sample such that the classifier produces an incorrect prediction, i.e.,
The perturbation
is typically constrained by an
-norm bound, where
. The perturbed input is defined as
which is referred to as an adversarial example. The adversarial attack can therefore be formulated as:
where
typically denotes the cross-entropy loss, and
is a small constant controlling the perturbation magnitude.
In this work, we focus on targeted attacks, in which the adversarial example is required not only to induce misclassification but also to drive the prediction toward a specified target label , i.e., Compared with untargeted attacks, targeted attacks constitute a more challenging optimization scenario.
In this study, adversarial examples are generated using a generative model
, and the targeted attack objective is formulated as:
where
denotes the generator network. In this work, the perturbation is constrained by the
-norm.
3.2. A Geometric Perspective on Transferability
Adversarial examples generated on a surrogate model often transfer to other models trained for the same task. Nevertheless, a significant performance gap remains between transfer-based attacks and white-box attacks, especially in the targeted setting. This gap suggests that the geometric relationships among the decision regions of different models intrinsically constrain transferability.
A key observation is that adversarial examples optimized on a single surrogate model frequently lie near the boundary of its attackable region, rather than in regions that are jointly vulnerable across multiple models. Consequently, such perturbations may fail to generalize when evaluated on unseen architectures. To formalize this intuition and motivate our method, we introduce a geometric description of structures shared across classifiers.
Shared Target Region Across Models. Let
denote a collection of classifiers trained for the same task, with label set
and input space
. For each class
, define the decision region of model
as:
The shared target region across models is given by
Under standard supervised training on a common dataset, different models tend to agree on a nontrivial subset of samples per class. Therefore, is typically non-empty in practice. This is consistent with the empirical observation that clean samples are often consistently classified across heterogeneous architectures.
Shared Targeted Region Across Models. For a fixed input
x and target label
t, define the targeted adversarial region of model
under perturbation budget
as:
The shared targeted region across models is
In general, the non-emptiness of cannot be guaranteed without specific assumptions on model geometry. However, the well-established phenomenon of adversarial transfer indicates that perturbations effective on one surrogate model often remain effective across multiple models. This empirical evidence suggests that transferable targeted adversarial examples tend to lie in the vicinity of shared target-consistent regions.
These constructions lead to the following observations.
(1) Relative size of shared regions. The shared targeted adversarial region is generally much smaller than the shared decision region . While consists of inputs naturally classified as class t by all models, additionally requires reachability from a given input x under a constrained perturbation budget. Therefore, transferable targeted attacks can be interpreted as searching for feasible perturbations that move x into a relatively small intersection region embedded within a larger shared class-consistent domain.
(2) Prototype-based interpretation. The existence of implies that there exist samples that are consistently recognized as class t across models. Such samples can be regarded as prototype points located in the interior of the shared decision region. From this perspective, effective targeted adversarial examples should not only cross the decision boundary of a surrogate model, but also move toward directions that approximate the geometric structure of these prototype regions. In particular, cosine similarity encourages alignment between the logit vector z and the target basis vector , which provides a normalized directional objective independent of logit magnitude. This can be interpreted as approximating movement toward a target-consistent prototype direction in logit space, thereby improving cross-model transferability.
Transfer-based targeted attacks seek perturbations that move an input sample into the shared adversarial region using only a surrogate model. The effectiveness of this process depends critically on the attack objective and the gradient feedback it provides. In most existing methods, the cross-entropy loss is adopted to guide perturbation updates.
However, once the perturbed sample crosses the surrogate model’s decision boundary, the cross-entropy gradient rapidly diminishes. As a consequence, the optimization trajectory often stagnates near the boundary of the surrogate model’s attack region. We refer to regions in the optimization space where gradient magnitudes become extremely small and provide little informative direction as attack dead zones. In these regions, optimization updates become unstable or ineffective, and perturbations are dominated by local surrogate-specific decision boundaries rather than target-consistent directions, resulting in poor transferability. This issue is particularly evident in iterative attacks such as FGSM [
14] (
Figure 1a) and MIM [
7] (
Figure 1b). Although generative approaches alleviate sample-wise overfitting [
18] (
Figure 1c), they still inherit the gradient-vanishing behavior induced by the cross-entropy objective.
Motivated by this limitation, we replace the cross-entropy objective with a loss that mitigates gradient decay and enlarges the region in which meaningful update signals can be obtained. By shrinking the attack dead zone, the optimization is less confined to surrogate-specific boundary artifacts and more likely to move toward the shared adversarial region .
Nevertheless, reducing gradient vanishing alone does not guarantee that adversarial examples will lie inside
, whose structure is unknown. From the geometric perspective introduced earlier, highly transferable perturbations should approximate the shared target-consistent region
. Clean target samples naturally reside in this region and exhibit strong cross-model agreement. Existing methods, such as TTP [
18] and M3D [
19], attempt to exploit this observation by encouraging similarity between adversarial and target samples, either through logit alignment or feature-level discrimination (
Figure 1c). While effective to some extent, these strategies remain dependent on surrogate-model representations and require access to real target samples.
To further reduce surrogate dependence, we instead impose constraints directly in the input space. Clean target samples preserve their class identity under semantic-preserving augmentations. Therefore, if an adversarial example maintains its attack objective under similar transformations, it is likely to capture more stable and model-agnostic target features. This transformation-based consistency serves as implicit regularization toward a shared target structure, without requiring target-domain data.
In summary, improving targeted transferability requires addressing two coupled aspects: (i) reducing model-dependent perturbations caused by gradient saturation, and (ii) enhancing target-oriented, model-agnostic features. Our framework jointly mitigates gradient vanishing and enforces semantic-invariance constraints, thereby promoting adversarial examples that are both structurally stable and strongly transferable.
3.3. Cosine Similarity as an Alternative to Cross-Entropy
The choice of loss function plays a central role in transfer-based targeted attacks, as it determines the optimization trajectory in the surrogate model’s attack space. In standard classification, cross-entropy (CE) is widely used for its stable gradients and strong class-separability properties [
36]. However, in the context of targeted adversarial optimization, its behavior becomes suboptimal [
32].
For a logit vector
and target class
t, the gradient of cross-entropy (CE) with respect to
is
where
denotes the one-hot target label.
In targeted attacks, the objective encourages
to dominate other logits. Let
denote the logit margin. Without loss of generality, assume
is the largest logit. Then,
where
.
As
, all terms
vanish exponentially, yielding
Thus, the gradient on the target logit satisfies
and similarly for non-target classes. Therefore,
This shows that CE gradients decay exponentially as the logit margin increases, leading to rapid gradient saturation once the decision boundary is crossed.
To address this limitation, we instead optimize directly in logit space using cosine similarity. Let
z denote the logit vector and
the one-hot target vector. The cosine loss is defined as:
Since
, we have
, and thus
Taking the gradient with respect to
z, we obtain
Bounding the norm of this gradient yields
for some constant
C. Therefore,
Unlike CE, whose gradients decay exponentially with respect to the logit margin, the cosine loss exhibits only polynomial decay. As a result, it maintains non-negligible gradient signals even when the target logit is dominant.
Geometrically, cross-entropy emphasizes probability saturation, whereas cosine similarity enforces directional alignment in logit space. This allows CBLA to continue optimizing after successful misclassification, pushing adversarial examples deeper into target-consistent regions and improving transferability. Therefore, compared with CE or other magnitude-based objectives, cosine similarity yields substantially less rapidly vanishing gradients and provides a more suitable optimization signal for transferable targeted attacks.
3.4. Semantic-Invariant Constraint
The previous section addresses gradient degeneration by modifying the attack objective in logit space. However, even with sustained gradients, the optimization trajectory may remain surrogate-dependent, since the update direction is entirely determined by surrogate-specific feedback.
To further reduce this dependency, we introduce a structural constraint directly in the input domain, termed the Semantic-Invariant Constraint (SIC). The key idea is to enforce target consistency under semantic-preserving transformations.
Let
denote the generated adversarial example targeting class
t. We consider two semantic-invariant transformations (SIT): an appearance-domain transformation
and a spatial-domain transformation
. SIC requires that the adversarial objective remain valid after applying either transformation:
This design is motivated by the observation that clean target samples located in the shared decision region remain correctly classified under Semantic-Invariant transformations. Therefore, enforcing such invariance encourages adversarial examples to approximate the structural properties of genuine target samples, rather than merely crossing surrogate-specific decision boundaries.
In this work, the following transformations are used:
Appearance-domain transformation . This includes global, small-amplitude perturbations such as horizontal flipping, color jittering (brightness, contrast, saturation, hue), and grayscale conversion.
Spatial-domain transformation . This consists of four discrete rotation operations: , , , and .
These transformations are selected according to four principles:
Semantic preservation. The chosen operations do not alter class identity and act globally on the image, avoiding local occlusion or cropping that may distort semantics.
High-frequency robustness. Transfer failures are often induced by high-frequency perturbation artifacts; these transformations reduce sensitivity to such components.
Gradient stability. Color jittering is implemented via small parametric adjustments, and rotations are realized through tensor permutations without interpolation, ensuring stable gradient propagation.
Controlled comparison. By adopting the same base transformations as in TTP [
18], performance differences can be attributed to the application target (adversarial examples versus training samples) rather than to the transformation design.
Unlike previous approaches that apply transformations to training data to regularize surrogate learning, SIC directly constrains the adversarial example itself. Rather than modifying the surrogate model’s representation space, we impose structural consistency on the perturbation outcome. Consequently, the generated adversarial examples are encouraged to move toward target-consistent regions that are stable under semantic variation, thereby improving cross-model transferability.
Formally, the SIC loss is defined as:
The overall generator objective combines the logit-based loss and the semantic-invariance constraint:
This joint formulation simultaneously maintains gradient signals in logit space and enforces semantic consistency in input space, thereby providing complementary mechanisms to enhance transferable targeted attacks.
3.5. Overall Framework and Implementation
The overall training procedure of the proposed method is illustrated in
Figure 2. Given a mini-batch of clean images
, the generator
directly outputs unconstrained adversarial candidates within the valid pixel range
. These preliminary outputs may contain high-frequency artifacts induced by optimization dynamics. To suppress such undesirable noise and promote smoother perturbation structures, a Gaussian smoothing operator is applied to the generated images. After smoothing, the samples are projected onto the feasible perturbation set to ensure compliance with the
constraint and valid image bounds. The resulting adversarial batch is then subjected to semantic-preserving transformations; both original and transformed samples are forwarded through the surrogate model. Cosine-based logit alignment losses are computed and combined with semantic-invariance regularization to update the generator parameters.
Generator. The generator
adopts a U-Net [
37] architecture and learns a parametric mapping from clean images to unconstrained perturbed examples.
Gaussian Smoothing. To suppress high-frequency noise and stabilize gradient propagation, a Gaussian smoothing operator
is applied to the generated examples. Specifically,
is implemented as a Gaussian filter with kernel size
and standard deviation
. This operation reduces localized artifacts and encourages smoother perturbation patterns, which are empirically more transferable across architectures [
18].
Projection and Clipping. The adversarial example must satisfy two constraints: (i) the
perturbation budget relative to the clean image, and (ii) the valid pixel range
. To enforce these requirements, we first project the smoothed perturbation onto the admissible perturbation set:
which guarantees that
Subsequently, we apply element-wise clipping to obtain a valid image:
where
denotes pixel-wise truncation into the interval
.
Semantic-Invariant Transformation. To enforce structural consistency, two semantic-preserving transformations are applied to
: an appearance-domain transformation
and a spatial-domain transformation
. This produces two transformed adversarial samples:
Attack Loss. All adversarial samples are forwarded through the pretrained surrogate model
to obtain logits:
The total generator loss combines the primary cosine objective and the semantic-invariance regularization:
The generator parameters
are optimized via gradient descent to minimize
. The complete training procedure is summarized in Algorithm 1.
| Algorithm 1 Cosine-Based Logit Alignment Algorithm |
| Require: Training data , pretrained substitute model , perturbation budget , target class t, loss criteria , the number of iteration T.
|
| Ensure: Generator |
- 1:
Initialize the generator . - 2:
for
do - 3:
Randomly sample a mini-batch . - 4:
Generate unbounded perturbed examples - 5:
Get adversarial examples : - 6:
Obtain transformed adversarial samples using Equation ( 28). - 7:
Forward pass through and get logits: - 8:
Calculate attack loss: - 9:
Calculate loss of the generator: - 10:
Update - 11:
end for - 12:
return Generator
|