Next Article in Journal
Design Methodology for Interleaved Converters Based on Coupled Inductors with ZVS and Closed-Loop Controllability Constraints
Previous Article in Journal
Coordinated Inertia Support Strategy for Offshore Wind Power-Integrated MMC-HVDC System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition

1
School of Computer and Cyber Sciences, Communication University of China, Beijing 100024, China
2
School of Information and Intelligent Science, Donghua University, Shanghai 201620, China
3
School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen 518107, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2026, 15(5), 1064; https://doi.org/10.3390/electronics15051064
Submission received: 30 December 2025 / Revised: 12 February 2026 / Accepted: 12 February 2026 / Published: 4 March 2026

Abstract

Scene text recognition (STR) and automatic speech recognition (ASR) translate visual or acoustic signals into linguistic sequences and underpin many modern perception systems. Although their front-ends and decoders differ (e.g., CTC-based, attention-based, or variants), both tasks ultimately rely on aligning input frames to output tokens by deep learning techniques, which exposes a shared vulnerability to adversarial perturbations. Existing attacks commonly optimize global sequence-level objectives. As a result, decisive frames are treated implicitly, and optimization can become unnecessarily diffuse over long input sequences, hindering convergence and perceptual quality. To address the above issues, we propose FLAMA, a unified Frame-Level Alignment Margin Attack, which could be used for both STR and ASR models. FLAMA explicitly targets alignment by maximizing per frame (or per step) recognition margins. The design is decoder-agnostic and applies to both CTC-based and attention-based pipelines. It employs a recognition-score-aware Step/Halt gate that concentrates updates on the most critical frames, and a stabilization stage that suppresses late-iteration oscillations to improve optimization stability and perceptual control. Ablation analyses show that stabilization consistently enhances attack success and reduces distortion. We evaluate FLAMA on STR benchmarks (SVT, CUTE80, and IC13) with CRNN, STAR, and TRBA, and on the ASR benchmark (LibriSpeech) with a Wav2Vec 2.0 model. Across modalities and architectures, FLAMA achieves near-100% attack success while substantially reducing l 2 distortion and improving perceptual metrics compared with FGSM/PGD baselines. These results highlight frame-level alignment as a shared weak point across visual and audio sequence recognizers and suggest localized margin objectives as a principled route to effective sequence attacks.

1. Introduction

Scene text recognition (STR) [1] and automatic speech recognition (ASR) [2] are two core technologies that translate visual or acoustic signals into structured linguistic sequences, forming the backbone of numerous modern perception systems across various applications such as automatic drive [3], intelligent transportation [4], and Google Assistant [5]. Thanks to advancements in deep learning [6], specifically through architectures like convolutional neural networks (CNNs) [7], recurrent neural networks (RNNs) [8], and more recent models, both STR and ASR have achieved conspicuous recognition accuracy and robustness that even surpass human performance in many real-world scenarios.
Although there are many different structures (e.g., CTC-based, attention-based, or variants) for STR and ASR models, both tasks ultimately rely on aligning input frames to output tokens by deep learning models, which have been proven to be vulnerable to adversarial perturbations. Similar to that in classification, despite impressive accuracy, both STR and ASR models are demonstrated to be vulnerable to adversarial examples [9,10,11].
On the visual side, STR models can be fooled by small perturbations that induce drastically different text predictions [10,12,13,14]. On the audio side, an attacker can add carefully crafted but human-imperceptible perturbations to a speech waveform and force an incorrect transcription [11,15,16]. Recently, studies further demonstrate that such attacks can remain effective under realistic constraints, including over-the-air playback [17,18,19], and even novel modalities such as lasers [20]. These findings motivate principled attack mechanisms for evaluating the robustness of sequence recognition models. In this paper, we focus on untargeted attacks in the white-box setting in particular, which provide a direct and conservative stress test for overall model robustness.
A key limitation of many existing attacks for STR [10,13,14,21] and ASR [11,15,16] is that they optimize a global sequence-level adversarial objective, such as the CTC negative log-likelihood or an end-to-end cross-entropy loss. While effective, global objectives treat the sequence as a monolithic target and make it difficult to identify which input frames are most directly responsible for flipping the final transcription. In alignment-based recognizers, however, decisions are rarely uniform over time. CTC decoding often relies on sparse, high-confidence spikes that survive the collapse rule, and attention-based decoding relies on peaked attention distributions at specific steps. In addition, existing works only study the algorithm to attack STR [10,13,14,21] or ASR [11,15,16] models; few works study whether their methods could attack both STR and ASR models.
To address these gaps, we try to study a novel attack instead of the traditional methods for attacking both STR and ASR models. Previous experimental results suggest that perturbing a small subset of decisive frames/steps may be sufficient to change the output, while perturbing the rest can be wasteful and may unnecessarily degrade perceptual quality. Additionally, prior work in STR has shown the benefit of explicitly modeling alignment at a finer granularity. For example, Yang et al. [12] proposed a Step/Halt (S/H) mechanism that targets step-level margins along the alignment path, focusing optimization on the weakest positions and halting once a flip is achieved. It also motivates attacks that explicitly reason about where the recognizer commits to a token, rather than optimizing a single undifferentiated loss.
Our work is inspired by this alignment–margin view, but differs in both scope and technical design. First, we pursue a unified cross-modal attack principle spanning STR and ASR, which requires the attack objective to be formulated directly on frame-wise logits and to remain compatible with different decoders (CTC and attention). Second, to accommodate the collapse-based nature of CTC decoding, we refine the margin tracking and gating criteria so that “critical frames” correspond to those supporting non-blank evidence and symbol transitions, rather than treating all frames equally. Third, we introduce an H-warm-up mechanism that gradually activates the halting gate, improving stability on long sequences and preventing premature stopping before the attack effect reliably propagates through decoding. These refinements are essential for achieving consistently strong untargeted performance and for enabling a subsequent stabilization stage to operate on a meaningful and already-aligned perturbation pattern.
In this paper, we propose FLAMA (Frame-Level Alignment Margin Attack), a unified adversarial framework for attacking sequence recognizers through frame-level alignment margins. FLAMA operates on frame-wise logits and alignment information, making it applicable to both visual and audio modalities. Intuitively, a recognizer first maps the input into frame-level logits and then collapses them into a final transcription through alignment and decoding. In this process, only a subset of frames is decisive, in the sense that changing their local evidence can alter the decoded sequence. FLAMA explicitly tracks margins along the current alignment, identifies the weak frames, and concentrates perturbation updates on these positions. Once the weak margins become negative and remain stable, a recognition-score-aware Step/Halt gate (with H-warm-up) automatically halts further optimization to avoid unnecessary changes to already-flipped regions.
In addition, we incorporate perceptual constraints in a way that is compatible with localized alignment attacks. In the visual domain, imperceptibility is typically reflected by high SSIM and small l 2 distortion. In the audio domain, psychoacoustic constraints and smoothness regularization have been introduced to improve perceptual metrics such as PESQ and STOI [16]. In FLAMA, we introduce a stabilization stage that suppresses late-iteration oscillations and encourages smoother perturbations, improving perceptual quality without sacrificing attack success. Together, Step/Halt localization and stabilization allow FLAMA to achieve effective untargeted attacks while maintaining strong perceptual control and practical runtime.
We explicitly highlight that proactively studying vulnerabilities serves as an indispensable and complementary pathway to achieving genuine robustness. To build truly reliable and secure STR/ASR systems, one must first understand their failure modes under adversarial conditions. By characterizing these threats, we aim to empower the community to address them and, ultimately, to improve the robustness and security baseline of practical sequence recognition systems.
Our main contributions are summarized as follows:
  • We propose FLAMA, a unified Frame-Level Alignment Margin Attack for untargeted adversarial generation on sequence recognizers. FLAMA localizes optimization by tracking alignment margins and updating only the most critical frames via a recognition-score-aware Step/Halt gate.
  • We extend and refine Step/Halt for cross-modal sequence recognition to improve stability on long sequences and avoid premature halting. In addition, we also introduce a stabilization stage that combines smoothness-oriented regularization with perturbation scaling, which suppresses late-iteration oscillations and improves perceptual metrics (PESQ, STOI, SSIM) while preserving attack success.
  • Extensive experiments on STR and ASR benchmarks show near-100% attack success, with substantially reduced perturbation and improved perceptual metrics.
Overall, FLAMA provides a unified and alignment-centric method for probing the robustness of modern sequence recognizers across both visual and acoustic domains.

2. Related Work

This section reviews adversarial attacks on non-sequence recognition (classification), summarizes representative sequence recognition architectures, reviews adversarial attacks on sequence recognition (including STR and ASR models), and discusses perceptual fidelity constraints that aim to preserve human-perceived quality.

2.1. Adversarial Attacks on Non-Sequence Recognition (Classification)

The vulnerability of deep neural networks to adversarial perturbations was first established in image classification [22,23] and has since been extended to other tasks, including sequence recognition [24]. Existing attacks on non-sequence recognition (classification) could be roughly discussed under white-box and black-box threat models. White-box attacks assume access to model parameters and gradients, enabling direct optimization of perturbations, whereas black-box attacks rely on query-based gradient estimation or transfer from surrogate models. Classic white-box algorithms for classification include L-BFGS [22], FGSM [23], BIM [25], PGD [26], APGD [27], etc. Common black-box adversarial attacks for classification include Square attack [28], Surfree [29], CGBA [30], etc. We suggest you refer to the review article [31] for more details.

2.2. Sequence Recognition Architectures

Sequence recognition models are typically built upon two paradigms: Connectionist Temporal Classification (CTC) [32] and attention-based encoder–decoder architectures [33,34]. Understanding where and how these models form alignments is crucial for designing adversarial attacks.
CTC-based models. CTC [32] is widely adopted in STR (e.g., CRNN [35] and STAR [36]) and ASR (e.g., Wav2Vec 2.0 [37]). Formally, a CTC model outputs a distribution over the vocabulary augmented with a special blank symbol for each frame (STR width position or ASR time step). Training computes the probability of a target label sequence by marginalizing over all valid alignment paths that collapse to the same output after removing blanks and repeated symbols. This dynamic-programming formulation enables learning without frame-level annotations and enforces a monotonic alignment structure. At inference time, decoding is typically performed using greedy decoding or beam search, followed by the collapse operation to obtain the final transcription. Importantly, the collapse rule implies that only a subset of frames/columns that provide non-blank evidence and resolve symbol transitions are decisive for the final output, whereas many other positions are dominated by blanks or redundant repeats. This characteristic explains why CTC systems can be highly sensitive to perturbations that selectively affect a small set of decisive frames, and it supports attacks that focus on the weakest frames instead of spreading perturbations across all positions.
Attention-based models. Attention-based recognition [33,34] is used in STR, such as TRBA (TPS-ResNet-BiLSTM-Attn) [38], and also used in sequence-to-sequence ASR [34]. In an encoder–decoder framework, the encoder maps the input into a sequence (or grid) of feature vectors, and the decoder generates tokens autoregressively. At each decoding step t, the decoder computes attention weights over encoder features to form a context vector, which is combined with the decoder state and the previously generated token to predict the next token. Compared with CTC, attention-based decoding relaxes the conditional independence assumption and can model richer dependencies between output tokens. It also provides a flexible mechanism for localization, which is beneficial for irregular text in STR and for long-range contextual modeling in ASR. However, the alignment in attention models is still concentrated: a small number of highly attended regions or steps often dominate the next-token decision. Because decoding is autoregressive, an early alignment shift may propagate to later steps and amplify transcription errors. These properties imply that attacks that perturb a few salient regions or decisive decoding steps can steer the attention distribution and change the transcription without requiring large global distortion.
Implications for alignment-aware attacks. CTC and attention differ in how alignments are represented (explicit path marginalization versus implicit attention weights), but both concentrate decision-making on limited critical positions. CTC concentrates probability mass into sparse non-blank evidence over frames, while attention concentrates mass into peaked attention distributions over features and steps. This shared concentration phenomenon provides a unifying rationale for frame-/step-level margin objectives across modalities and architectures.

2.3. Adversarial Attacks on Sequence Recognition

Attacks on scene text recognition (STR). When considering attacking STR models, a straightforward baseline is to transfer image attacks such as FGSM and PGD to text images, by optimizing a global objective under an l p constraint. However, STR introduces additional complexity because the output is a variable-length sequence, and the prediction depends on an input–output alignment process. Beyond direct transfer, STR-specific iterative attacks have explored how to balance multiple objectives (e.g., misrecognition versus distortion) and how to exploit the alignment structure of sequence decoders. Yuan et al. [21] introduced Adaptive Attack, which formulates adversarial generation as a multi-task optimization problem and learns the relative weights via task uncertainty, reducing the need for manual hyper-parameter tuning. Yang et al. [12] further highlighted that explicitly alignment-aware objectives can significantly reduce the perturbation needed to alter sequence decoding, and proposed cost-sensitive formulations tailored to untargeted STR attacks.
Attacks on automatic speech recognition (ASR). Early work demonstrated that adversarial examples can be constructed for CTC-based ASR systems by directly optimizing the waveform [15,39]. Later studies extended the threat models to other settings, such as universal perturbations [40,41,42]. Despite their effectiveness, many ASR attacks are still formulated with coarse sequence-level surrogates (e.g., CTC loss or cross-entropy), which can obscure which frames dominate the decoding outcome. Notably, Carlini and Wagner observed that different parts of a phrase can have very different attack difficulty, and introduced per position weighting to avoid over-optimizing easy segments at the expense of imperceptibility [15]. This line of evidence suggests that local (frame/step) contributions are highly non-uniform in sequence recognition and motivates attacks that explicitly localize optimization to decisive moments.
A shared alignment-driven attack surface. Although STR and ASR differ in modality and preprocessing, both must ultimately align input evidence to output tokens. In practice, the final transcription is often dominated by a small subset of decisive frames (CTC spikes) or decoding steps (attention peaks), while many other positions contribute weakly. This decision concentration creates a common attack surface: manipulating a few critical alignments can disproportionately change the output sequence. These observations motivate our frame-level alignment margin view as a unified principle for attacking both speech and scene text recognition.
In this work, we focus on the white-box setting to characterize worst-case robustness and to better expose architectural weaknesses shared by sequence recognizers.

2.4. Perceptual Quality and Fidelity Constraints

For adversarial examples to be practical, attack success alone is insufficient. Perturbations should remain imperceptible or unobtrusive to humans. In the visual domain (STR), fidelity is typically encouraged by constraining the l p -norm of perturbations and preserving structural similarity, for example, by maximizing SSIM between adversarial and original images. In the audio domain, fidelity is more challenging due to the sensitivity of human hearing [43]. Prior work has integrated psychoacoustic masking thresholds and related perceptual constraints to hide perturbations under human perception [16,44], and has also employed time-domain smoothness regularization to suppress high-frequency artifacts and improve perceptual metrics such as PESQ and STOI.
Beyond the definition of fidelity metrics, the optimization trajectory also affects perceptual quality. Uniformly updating all frames can introduce unnecessary artifacts, especially for long sequences in which many frames contribute weakly to the decoded output. In contrast, step-/frame-selective updates can reduce wasted perturbations by concentrating changes on decisive alignments, and stability-aware procedures can avoid late-iteration oscillations that degrade quality. This perspective complements perceptual constraints: high-fidelity adversarial generation benefits from both appropriate fidelity objectives and localized, stable optimization procedures. In this work, we incorporate modality-appropriate fidelity terms into a frame-level optimization framework, so that perturbations are both localized to decisive alignments and perceptually controlled.

3. Methodology

3.1. Problem Definition

Given an input sample x (image or audio) and its ground-truth transcription y , we study an untargeted white-box adversarial setting. Suppose we have all the information about the structural parameters and gradients of the model. Our goal is to construct an adversarial example x = x + δ , such that the recognizer produces an incorrect output, i.e., M ( x ) = y ^ y , while keeping the perturbation δ small so that x remains perceptually close to x . We list all the notation and parameters used in the rest of this paper in Table A1.
Following the spirit of C&W-style optimization-based attacks [45], we formulate the core objective for sequence models as finding the optimal perturbation that induces misrecognition:
δ = arg min δ δ 2 2 + c · L a d v ( x + δ , y ) , s . t . x + δ D ,
where the primary objective is to find the optimal perturbation δ . δ 2 2 minimizes the squared L 2 norm of the adversarial noise, ensuring the perturbation is small in energy and thus less perceptible. Adversarial loss L a d v is the key adaptation for sequence models. In practice, we implement it using a Connectionist Temporal Classification (CTC) loss or a cross-entropy loss over the sequence decoder’s output distributions. c is a trade-off coefficient. We determine its optimal value via an automated binary search procedure for each input. D denotes the valid input domain (e.g., [ 1 , 1 ] after normalization).

3.2. Framework

The overall framework of the proposed FLAMA is illustrated in Figure 1. FLAMA iteratively updates the perturbation based on frame-level margins, utilizing a Step/Halt gate to freeze optimized frames and a stabilization stage to remove high-frequency artifacts, until it meets the maximum steps or successfully attacks.
Specifically, as shown in Figure 1, it consists of three stages with five main components. Firstly, we get the original input (image or audio) to input to the STR/ASR model (CTC- or attention-based). Then we can obtain the frame alignment and find the Top-k frames. Thus in Stage 1, we can use a recognition confidence score S, which modulates optimization pressure across samples, to measure the confidence of the prediction sequence label; we can also get a frame-/step-level margin objective L margin that explicitly targets the alignment process; in addition, a differentiable Step/Halt gate H that concentrates updates on critical positions and suppresses redundant optimization once the attack becomes stable in Stage 1. If the attack does not succeed, the above operation will continue. If the attack succeeds, we use Stage 2 as the stabilization stage to improve perturbation smoothness and perceptual quality while further shrinking the perturbation magnitude without breaking attack success. Then, in Stage 3, a binary search scaling procedure is used to rescale the perturbation after attack success to obtain a minimal adversarial solution while preserving attack effectiveness. Finally, we get the adversarial image/audio.

3.3. Stage 1: Initial Generation

3.3.1. Recognition Confidence Score

At Stage 1, we first try to measure the confidence S of the ground-truth sequence label to obtain the strength to attack. As STR and ASR are both sequence prediction tasks in which decoding relies on aligning input evidence to output tokens, we unify CTC- or attention-based recognizers by viewing them as producing a sequence of log-probability vectors { z t } t = 1 T , where z t R | V | denotes the frame- and step-wise log-probabilities over the vocabulary V (i.e., z t = log softmax ( · ) ) at time index t.
For CTC-based models (e.g., CRNN/STAR for STR and Wav2Vec 2.0 for ASR), t corresponds to an input frame (time frame in audio or width-wise step in text images), and V includes a special blank symbol. For attention-based models (e.g., TRBA for STR), t corresponds to an autoregressive decoding step and z t is the decoder output distribution at step t.
We use a recognition confidence score to modulate the attack strength through the recognition confidence score S for each sample. Instead of multiplying frame-wise probabilities (which can underflow on long sequences), we compute the attack strength S from the mean log-probability along the reference path:
S = exp 1 T t = 1 T z t [ y t ] ,
where z t [ y t ] is the log-probability of the reference token y t at step t. y t means the right character at step t. T is the total time step for the image/audio x .
We can conclude from the above function that a larger S indicates higher model confidence on the clean decision trajectory, and we thus apply stronger adversarial pressure to high-confidence samples.

3.3.2. Frame-Level Margin Attack Loss

We also calculate the frame-level margin attack loss to prepare for subsequent gradient backpropagation.
Reference token sequence. We do not require forced alignment to the ground-truth transcript. Instead, we define a reference token at each index t using the model’s original output on the clean input:
y t = arg max y V z t ( y ; x ) ,
where z t ( y ; x ) denotes the logit score of token y at index t on input x (equivalently, the log-probability after log-softmax for arg max); V denotes the vocabulary set comprising all possible output tokens predictable by the recognition model.
This operation naturally matches the untargeted goal by explicitly pushing the model away from its current frame-/step-wise decision boundary afterwards.
Frame- and step-wise margin. Given the reference right token y t , we define the margin at time step t as:
d t = z t [ y t ] max y V c o m p ( t ) z t [ y ] ,
where V c o m p ( t ) is the competitor set.
Note that we set V c o m p ( t ) = V { y t } , and for CTC models we further exclude the blank token:
V c o m p ( t ) = V { y t , blank } for CTC-based models , V { y t } for Attention-based models ,
where the operator ∖ represents the set difference, which excludes the specific reference token y t from the vocabulary set V .
This design avoids over-exploiting blank-dominant behaviors. In CTC, pushing many positions toward blank may not reliably change the collapsed transcript due to the collapse rule (removing blanks and merging repeats). Moreover, such “blank shortcuts” can interact unfavorably with gating terms that down-weight the adversarial pressure once local margins flip, causing the optimization to be dominated by the l 2 term. Excluding blank from competitors encourages more informative confusions among non-blank tokens, which is more consistent with our margin-driven gating.
Margin loss. Then we calculate the positive margin loss along the reference path for subsequent penalizing margins as follows:
L margin = t = 1 T max ( d t , 0 ) ,
where T is the total number of time steps divided by the model.
It is easy to know that minimizing L margin reduces the separation between the ground-truth token and its strongest competitor at critical indices, weakening the original alignment prediction scores and inducing decoding changes under small perturbations to its strongest competitor. Minimizing Equation (6) involves solving a constrained optimization problem. We employ the classical gradient descent approach, adapted for our objective.
Although Equation (6) is defined over all indices, FLAMA does not update all positions uniformly in practice. The subsequent Step/Halt mechanism selects and emphasizes the Top-k most vulnerable margins, preventing the optimization from becoming overly diffuse on long sequences.
On long sequences or difficult samples, the most vulnerable positions may drift as perturbations evolve. When attack success is low or the optimization stalls, we optionally refresh the reference sequence y t every R iterations using the current perturbed logits, so that the margin objective continues to track the active decoding trajectory rather than a stale alignment. This lightweight heuristic can improve convergence stability and attack success in challenging cases.

3.3.3. Step/Halt Dynamic Gating Mechanism

To improve attack efficiency and reduce redundant distortion after crossing the decision boundary, FLAMA dynamically modulates adversarial pressure using a differentiable Step/Halt (S/H) gate [12]. The key idea is to monitor the most vulnerable margins and down-weight the adversarial term once these critical margins become non-positive, so that the optimizer does not keep injecting unnecessary perturbations.
Let { d t } t = 1 T be the per step margins defined in Equation (4). We first select the indices of the Top-k smallest margins:
K = arg min Ω { 1 , , T } , | Ω | = k t Ω d t ,
where Ω denotes a subset of time indices of size k. The resulting set K thus contains the indices of the k smallest elements in the margin sequence { d t } t = 1 T .
We then compute a smooth surrogate of the minimum margin within the selected set  K using a LogSumExp-based soft-min:
d ˜ min = softmin β { d i } i K = 1 β log i K exp ( β d i ) ,
where β > 0 controls the sharpness (larger β is closer to min i K d i ). The gate is defined as
H raw = σ ( κ d ˜ min ) ,
where σ ( · ) is the sigmoid function and κ controls the transition steepness.
It is easy to know that when the critical positions are not yet flipped, d ˜ min > 0 and thus H raw 1 , maintaining strong adversarial pressure.
As optimization progresses and the critical margins become non-positive ( d ˜ min 0 ), H raw decreases toward 0, suppressing further updates that would mainly increase distortion.
Warm-up and floor. In practice, to avoid shrinking gradients too early, we introduce a warm-up schedule that keeps H close to 1 at the beginning iterations and gradually enables the halting behavior:
H warm = 1 + ( H raw 1 ) · min 1 , r R w ,
where r is the iteration index; R w is the warm-up iteration length.
Then,
H = max ( H warm , H min ) ,
where H min is a small floor to prevent the adversarial term from vanishing completely.
Together with the Top-k selection, this gate encourages the optimization to focus on a small number of decisive positions and to reduce redundant updates after the attack becomes stable.

3.3.4. Comprehensive Objective Function

Combining the margin objective, the confidence score, and the Step/Halt gate, the adversarial loss L 1 in the main attack stage is:
L 1 = S · H · L margin ,
where L margin is defined in Equation (6). Here, S provides sample-level modulation in Equation (2), while H in Equation (11) suppresses redundant optimization once the most vulnerable margins are flipped.
The overall loss in the main attack stage is:
L ( δ ) = δ 2 2 + c · L 1 = δ 2 2 + c · S · H · L margin ,
where L 1 is defined in Equation (12). The coefficient c is a hyper-parameter, which balances attack effectiveness and distortion.
In practice, we use a simple binary search to reduce sensitivity to c: c lo and c hi represent the low and high bounds for the value of c, and we maintain bounds ( c lo , c hi ) and run the attack with the current c; if the attack succeeds, we tighten the upper bound and decrease c (e.g., c ( c lo + c hi ) / 2 ); otherwise, we raise the lower bound and increase c. This procedure typically yields a more reasonable c while maintaining attack success.
After we find the adversarial noise δ 0 , which can successfully attack the target models in Stage 1, we perform the stabilization stage (Stage 2) to optimize the quality.

3.4. Stage 2: Stabilization Stage (Total Variation Smoothing with Success-Keeping)

While the main Stage 1 achieves misrecognition with low l 2 energy, the resulting perturbation may still contain undesirable high-frequency components (e.g., rapid local variations in audio waveforms or spatially jagged patterns in images). We therefore apply a stabilization stage (Stage 2) on a successful perturbation δ 0 to improve perceptual quality while maintaining attack success.
Starting from δ 0 , we refine the perturbation by optimizing:
δ = arg min δ L stab ( δ ) = arg min δ ( λ tv L tv ( δ ) + J ( δ ; τ ) ) ,
where L stab is a stabilization objective function that refines the perturbation by combining a Total Variation regularizer L tv [46] to suppress high-frequency artifacts (smoothing) with a margin-keeping term J ( δ ; τ ) to ensure the attack remains successful. L tv and J ( δ ; τ ) are defined in detail afterwards.
L tv is the Total Variation (TV) regularizer. For images, we use anisotropic 2D TV on the perturbation. Here, δ i , j represents the perturbation value at spatial coordinates ( i , j ) (i.e., the ( i , j ) th element of the matrix δ ), and the loss is defined as:
L tv img ( δ ) = i , j | δ i + 1 , j δ i , j | + | δ i , j + 1 δ i , j | .
And for audio, TV reduces to a 1D difference penalty across adjacent time samples. Similarly, δ i denotes the ith time sample of the perturbation vector δ :
L tv aud ( δ ) = i = 1 N 1 | δ i + 1 δ i | .
We can infer from Equations (15) and (16) that, minimizing L tv suppresses the sharp local changes and encourages a smoother perturbation that is more consistent with natural signals.
For J ( δ ; τ ) , it enforces a τ -margin ( τ > 0 ) to maintain attack success:
J ( δ ; τ ) = t = 1 T max d t ( x + δ ) + τ , 0 ,
where τ is a hyper-parameter.

3.5. Stage 3: Minimal Feasible Scaling

After smoothing, we further reduce the perturbation magnitude by searching for the smallest scaling factor γ ( 0 , 1 ] that keeps the attack successful:
γ = arg min γ M ( x + γ δ ) y | γ ( 0 , 1 ] ,
where δ is the output of Stage 2.
We solve Equation (18) via binary search over γ , obtaining the minimal successful scaling. Specifically, we solve Equation (18) using a binary search procedure, since the state of attack success is monotonic with respect to the scaling factor γ . We initialize the search range as ( 0 , 1 ] . In each iteration, we evaluate the perturbed input scaled by the current γ . If the attack remains successful, we reduce the upper bound to the current γ ; otherwise, we raise the lower bound to the current γ . This process repeats for a fixed number of iterations to locate the minimal feasible γ , obtaining the minimal successful scaling.

4. Experiments

4.1. Models and Datasets

We evaluate FLAMA on two representative sequence recognition tasks: scene text recognition (STR) and automatic speech recognition (ASR).
  • STR models. We consider three widely used architectures that instantiate the four-stage STR framework [38]:
    • CRNN [35]: A VGG-like convolutional encoder, BiLSTM sequence modeling, and a CTC decoder.
    • STAR [36]: A spatial-transformer-based rectification front-end with a ResNet-like backbone and CTC-style decoding, related to STAR.
    • TRBA [38]: A TPS rectification module, a ResNet encoder, BiLSTM sequence modeling, and an attention-based decoder.
These models cover both CTC-based and attention-based STR design choices.
STR datasets. We evaluate them on three popular benchmarks: SVT [1], CUTE80 [47] (80 images/288 cropped word images) [35], and IC13 [48]. Following [12], input images are resized to a fixed height and width and normalized before being fed into the model.
ASR models. We adopt the self-supervised Wav2Vec 2.0 Base model fine-tuned with a CTC head on LibriSpeech [37,49].
ASR datasets. We conduct experiments on a subset of LibriSpeech test-clean [49]. Specifically, we select utterances with durations between 3 and 8 s, resulting in 989 evaluation samples, and resample all waveforms to 16 kHz if necessary.

4.2. Implementation

Our experiments for STR were conducted on a workstation equipped with an Intel Core i9-13900K CPU, 64 GB of RAM, and a single NVIDIA GeForce RTX 3060 GPU (12 GB RAM), running Win11. The software environment was managed with Python 3.13 and PyTorch 2.9.0, built with CUDA 13.0 and cuDNN 11.3 for GPU acceleration. All critical random seeds were fixed to one to ensure full reproducibility. Unless otherwise stated, we use the Adam optimizer [50]. The experiments for ASR were conducted on a Laptop with NVIDIA GeForce RTX 3050 GPU (Ampere architecture, 4 GB VRAM), and the Driver supports CUDA 12.7.
For adversarial attacks on STR, we set the batch size to one, use a learning rate of 0.1, and run 150 steps for iterative methods. For adversarial attacks on ASR, we use a learning rate of 0.01 and run 200 iterations; decoding is performed with beam search (beam width = 10) without an external language model. For each method, we report both attack success and distortion/perceptual metrics (e.g., l 2 , SNR, SSIM/PESQ/STOI), rather than enforcing an identical distortion bound across all methods.
Before reporting adversarial results, we summarize the clean recognition accuracy of the three STR models on SVT, CUTE80, and IC13 in Table 1.
The clean accuracies indicate that TRBA and STAR achieve comparable performance on both datasets, while CRNN is relatively weaker, especially on CUTE80. This gap should be kept in mind when interpreting adversarial robustness.

4.3. Threat Model and Baselines

We adopt a white-box threat model for both STR and ASR: the attacker has full access to model parameters, logits, and gradients. Unless otherwise stated, we focus on untargeted attacks. We first evaluate the model on clean inputs and only attack clean-correct samples to avoid ambiguity caused by already-misrecognized inputs. For untargeted attacks, a trial is counted as successful if the adversarial example changes the transcription w.r.t. the ground-truth label, i.e., M ( x ) y , and we report the attack success rate over the clean-correct subset.
For STR, perturbations are applied to normalized input images, and the adversarial image is clipped to the valid range at each step during optimization. For ASR, perturbations are added directly to the time-domain waveform, and the perturbed waveform is clipped to the valid amplitude range after each update.
To validate the effectiveness of FLAMA, we compare it with representative baselines. Specifically, we include two generic gradient-based baselines:
  • FGSM (Fast Gradient Sign Method) [23]: A single-step method that perturbs the input along the gradient sign of a sequence-level loss (with an l budget for STR in our implementation).
  • PGD (Projected Gradient Descent) [26]: A multi-step iterative baseline that updates the perturbation and projects it back to the feasible set each step (also using an l budget for STR in our implementation).
In addition, for CTC-based STR models (CRNN/STAR), we also compare against CE-ASTR [12], which are STR-specific cost-effective optimization baselines.
FLAMA is evaluated under the same model and data settings for fairness in all experiments, and we compare methods by jointly considering success rate, distortion magnitude, perceptual quality, and runtime.

4.4. Evaluation Metrics

4.4.1. STR Metrics

For STR experiments on TRBA, CRNN, and STAR, we report:
  • SR (success rate): The percentage of clean-correct images that become misrecognized under attack, i.e., M ( x ) y (higher is better for the attacker).
  • L 2 : The average l 2 norm of the image-domain perturbation (computed on normalized inputs), where lower values indicate smaller distortion.
  • SSIM: The Structural Similarity Index between original and adversarial images, capturing perceptual similarity in luminance, contrast, and structure (higher is better).
  • ED (edit distance): The average Levenshtein edit distance between the adversarial prediction and the ground-truth transcript, computed on the same clean-correct subset (lower indicates milder transcription corruption). Since the clean prediction matches y on this subset, ED also reflects the deviation from the original correct prediction.
Overall, for STR, we seek adversarial examples that achieve high SR with small l 2 distortion and high SSIM.

4.4.2. ASR Metrics

For audio-based experiments on Wav2Vec 2.0, we report:
  • SR (success rate): The percentage of utterances for which the final transcription differs from the ground-truth transcript, i.e., M ( x ) y .
  • L 2 : The average l 2 norm of the perturbation.
  • SNR (signal-to-noise ratio): The ratio of signal power to perturbation power in decibels (higher is better).
  • PESQ (perceptual evaluation of speech quality) [51]: An objective speech-quality metric that correlates with human perception (higher is better).
  • STOI (short-time objective intelligibility) [52]: An objective measure of speech intelligibility (higher is better).
  • Time: The average end-to-end wall-clock time per utterance in our implementation, including waveform I/O, decoding, attack optimization, and metric computation.

4.5. Attack Performance

We now present the main attack results. Consistent with our setup, we first discuss STR results, followed by ASR findings.

4.5.1. STR Attack Results

We evaluate FLAMA on models of CRNN, STAR, and TRBA. Table 2 presents the combined results on SVT and CUTE80.
Quantitative Analysis. As shown in Table 2, FLAMA consistently achieves high attack success rates comparable to or better than baselines across all architectures on SVT, CUTE80, and IC13, but with substantially lower perturbation energy. For instance, on the attention-based TRBA model (CUTE80), FLAMA reduces L 2 distortion from 4.04 (PGD) to 0.56, while maintaining a 100% success rate.
Perceptual Study. We have designed a perception experiment involving 20 participants (recruited via a university participant pool). Each participant was shown 50 image/speech pairs (clean vs. perturbed) in a randomized order and asked two questions on a 5-point Likert scale: “Can you perceive any difference between these two images?” (1 = no difference, …, 5 = very obvious difference). The results are shown in Figure 2; we can find that for perception scores, our method is lower than FGSM, PGD, and CE-ASTR, indicating the low perception for our method.
Qualitative Analysis. To visualize this, Figure 3 displays eight pairs of original and adversarial images, highlighting the high level of imperceptibility of our attack. We can find that despite sharing the same visual appearance, the adversarial examples in (b) successfully mislead the recognition models. To further explain where the perturbation concentrates, Figure 4 provides a three-row qualitative comparison with stroke-aware overlays and width-wise perturbation profiles.
Failure Analysis. Despite achieving a near-perfect attack success rate (>99%), FLAMA fails on a very small subset of samples (e.g., ∼0.6% on IC13). Our analysis reveals that these failures primarily occur on short text instances (3–4 characters) comprising high-frequency words (e.g., “the”, “this”), where recognition models appear to have learned highly robust features. Representative failure cases are visualized in Figure 5.

4.5.2. ASR Attack Results

Quantitative Analysis. Table 3 summarizes the attack performance of FGSM, PGD, and FLAMA on the Wav2Vec 2.0 CTC model, evaluated on a 989-utterance subset of LibriSpeech test-clean (see Section 4.3 for the untargeted setting). The success rate (SR) is defined w.r.t. the ground-truth transcript, i.e., M ( x ) y .
As shown in Table 3, FLAMA achieves a 100% SR, slightly surpassing PGD and far outperforming FGSM. More importantly, FLAMA dramatically reduces perturbation: the average L 2 norm drops to 0.60, over 3 × smaller than that of PGD (2.12). This improvement is also reflected in SNR, where FLAMA reaches 40.66 dB compared to 18.96 dB for PGD. In terms of perceptual quality, PESQ increases from 1.59 (PGD) to 3.28 and STOI from 0.94 to 0.98, indicating that FLAMA yields adversarial examples that remain much closer to the original audio for human listeners. While FLAMA is slower than FGSM due to iterative optimization, its time cost remains moderate and is substantially lower than that of PGD.
Qualitative Analysis. Figure 6 provides a qualitative spectrogram comparison of both the perturbations and the resulting adversarial audio across different methods. We can conclude that FLAMA produces substantially weaker and less structured perturbation patterns, leading to a cleaner adversarial spectrogram.
This close perceptual similarity is also evident in the time domain: the adversarial waveform closely overlaps the original waveform in Figure 7, which also shows that the perturbation is hard to perceive visually as the adversarial curve closely follows the original.
Quality–efficiency Trade-off Analysis. Figure 8 further summarizes the quality–efficiency trade-off, where each method is visualized as a bubble with size proportional to SR. FLAMA consistently occupies a favorable region in this trade-off space, achieving near-perfect success and high quality with a moderate time cost in practice.
Brief Summary. Taken together, the STR and ASR results demonstrate that FLAMA is highly effective (near-100% success) while substantially reducing perturbation energy and preserving perceptual quality across both visual and acoustic sequence recognition tasks.

4.6. Ablation Study

We conduct ablation studies to quantify the contribution of key FLAMA components to both STR and ASR.

4.6.1. STR Ablation

We examine the effect of the stabilization stage (Stage 2) and scaling stage (Stage 3) on STR by comparing FLAMA-A (w/o stabilization) with the full FLAMA.
Following the main STR evaluation, we report L 2 and SSIM, and additionally include the edit distance (ED) between the adversarial prediction and the ground-truth transcript (a smaller ED indicates a more subtle character-level change while still achieving misrecognition). Note that all variants reach 100% SR on the clean-correct subset in this study; thus, we focus on distortion and output-change severity.
From Table 4, the stabilization stage consistently improves visual fidelity (lower L 2 and higher SSIM). It also tends to reduce ED, suggesting that FLAMA often achieves misrecognition with fewer character edits (i.e., a less drastic change in the decoded sequence), while keeping the perturbation visually negligible.

4.6.2. ASR Ablation

For ASR, we evaluate the following variants on Wav2Vec 2.0: (1) w/o stabilize, removing the stabilization stage after a successful adversarial example is found; (2) w/o refresh, disabling alignment refresh during optimization; and (3) w/o warm-up, enabling the S/H gate from the first iteration. Table 5 summarizes the results.
Figure 9 provides a visual comparison of key metrics. We observe that: (1) removing stabilization or S/H warm-up noticeably degrades perceptual quality or increases perturbation energy, and therefore, stabilization is crucial for high-fidelity perturbations—removing it leaves SR almost unchanged but greatly increases L 2 and degrades PESQ/STOI; (2) alignment refresh mainly helps the hardest samples reach full success (SR = 100%) in our setting; and (3) the warm-up strategy prevents premature halting of the S/H gate, reducing the perturbation energy needed to obtain a stable attack. Overall, the full FLAMA configuration offers the most reliable balance between success, fidelity, and computational cost.

4.6.3. Computational Cost and Convergence

We further compare the optimization dynamics of FLAMA with a standard sequence-level CTC-loss-based attack. On test samples with durations of 3–5 s, FLAMA typically converges within 50–100 iterations in practice, while the CTC baseline often requires more than 200 iterations to reach a comparable success level. This gap is mainly attributable to the Step/Halt gating mechanism: once the weakest frames are flipped and their margins stabilize, FLAMA suppresses redundant updates and reduces late-iteration oscillations. In practice, this yields lower wall-clock time on a single GPU and also results in smaller perturbation energy.

5. Discussion

FLAMA is designed to balance attack success, perceptual quality, and computational cost for sequence recognition. Its core idea is to optimize frame-/step-wise margins along the decoding alignment, while using the recognition score S and the Step/Halt gate H to focus updates on critical positions and suppress redundant perturbations after the attack stabilizes. Empirically, this strategy yields consistent gains on both STR and ASR, achieving high success with substantially reduced distortion and improved perceptual quality compared with global sequence-level baselines.

5.1. Analysis of Method Efficacy

A key feature of FLAMA is that its objective is defined at the frame/step level and relies only on logits and alignment-related margins, without modality-specific heuristics. As a result, the same formulation can be instantiated on audio waveforms (e.g., Wav2Vec 2.0) and on word images processed by STR models (e.g., CRNN/STAR/TRBA). The consistent improvements on SVT/CUTE80/IC13 and the LibriSpeech test-clean subset indicate that manipulating a small number of decisive alignment positions is a reliable way to induce recognition errors with limited perceptual change. In this sense, FLAMA not only serves as a strong attack baseline but also provides a diagnostic lens for studying how alignment dynamics shape robustness in modern sequence recognizers.

5.2. Limitations and Future Work

Our study has several limitations. First, all experiments are conducted in a digital white-box setting with direct access to inputs and gradients. Extending FLAMA to real-world scenarios in the wild (e.g., over-the-air ASR attacks) would require modeling acoustic channels, device responses, and environmental noise, which may substantially change the threat surface. Second, we primarily focus on English ASR and Latin-script STR benchmarks. Evaluating FLAMA on multilingual corpora with different writing systems, tokenization schemes, and decoding behaviors is an important direction for future work. Third, while the stabilization stage improves objective quality metrics, a more thorough evaluation with stronger perceptual models and user studies would further strengthen claims about perceptual fidelity.

5.3. Implications for Defense

FLAMA suggests that defenses for sequence recognition should go beyond global sequence-level objectives and explicitly consider alignment dynamics. The cross-modal effectiveness of alignment-driven attacks also indicates that robustness improvements may transfer across modalities when they directly address alignment sensitivity (e.g., regularizing frame/step margins, stabilizing alignments, or incorporating alignment-aware detection). We hope FLAMA can serve as a foundation for future work on alignment-aware defenses and cross-modal robustness analysis. Representative defense directions include adversarial training [26], certified robustness via randomized smoothing [53], and detection/mitigation for audio adversarial examples [54,55,56].

6. Conclusions

In this paper, we propose FLAMA, a unified adversarial attack framework for sequence recognition under a white-box, untargeted setting. FLAMA explicitly optimizes frame-/step-level alignment margins and uses a recognition-score-aware Step/Halt gate to concentrate updates on critical positions, while a stabilization stage (TV-based smoothing and minimal feasible scaling) improves perceptual quality and suppresses late-iteration oscillation. Experiments with three STR models on SVT/CUTE80/IC13, and with a Wav2Vec 2.0 ASR model on a subset of LibriSpeech test-clean, show that FLAMA achieves near-100% success with substantially lower distortion and better perceptual metrics (PESQ, STOI, SSIM) than global sequence-level baselines. These results highlight alignment-dependent decoding as a shared weak point across visual and audio sequence recognizers, and motivate future, more systematic alignment-aware robustness analysis and defenses for safety- and security-sensitive applications in both modalities.

Author Contributions

Conceptualization, Y.X. and Z.X.; methodology, Z.X. and Y.X.; formal analysis, Z.X.; investigation, Z.X.; data curation, Y.X.; writing—original draft preparation, Z.X.; writing—review and editing, Y.X.; supervision, Y.X. and P.D.; funding acquisition, Z.X. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Youth Fund of the National Natural Science Foundation of China (Grant No. 62402459), the Fundamental Research Funds for the Central Universities (Grant No. CUC25QT05), and the Foundation of Yunnan Key Laboratory of Smart Education (Grant No. YNSE2024C001).

Data Availability Statement

Data sets, Materials and Code are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Summary of notations and parameters used in FLAMA.
Table A1. Summary of notations and parameters used in FLAMA.
NotationMeaning
x , y Input sample (image or audio) and its ground-truth transcription sequence.
δ Additive adversarial perturbation; Stage 2 output is denoted as δ .
x Adversarial example, defined as x + δ .
y ^ Incorrect output transcription produced by the model on x .
M ( · ) Sequence recognition model (e.g., STR or ASR) in inference mode.
L a d v Adversarial loss function (e.g., CTC loss or cross-entropy loss).
L ( δ ) Overall attack objective function for adversarial generation.
D Valid input domain (e.g., [ 1 , 1 ] after normalization).
TTotal number of input frames or decoding steps.
V Vocabulary set comprising all possible output tokens (includes blank for CTC).
z t Logit/log-probability vector z t R | V | at time index t.
z t [ y ] Scalar score for token y in vector z t at step t.
y t Reference token: arg max y V z t ( y ; x ) , derived from clean prediction.
V c o m p ( t ) Competitor token set at step t (excludes y t and blank).
d t Frame-level alignment margin at step t.
d ˜ min Soft-min aggregated margin across Top-k positions.
SRecognition confidence score of the clean sample.
HStep/Halt gate value used to modulate adversarial pressure.
cTrade-off coefficient between adversarial loss and distortion.
kNumber of selected weakest positions in Top-k operation.
β Sharpness control parameter for the soft-min operator.
κ Steering parameter for the transition steepness of the gate H.
RUpdate frequency (in iterations) for the reference token sequence.
rCurrent iteration index during adversarial optimization.
R w Total iterations for the gate warm-up stage.
τ Margin threshold used in the stabilization objective L stab .
λ tv Weight coefficient for the Total Variation (TV) regularizer.
L 1 Gated adversarial objective used in Stage 1 optimization.
L margin Sum of positive alignment margins: t = 1 T max ( d t , 0 ) .
L tv Total Variation (TV) regularizer.
J ( δ ; τ ) Margin-keeping term in Stage 2 to ensure attack success.
L stab Stabilization objective: success-keeping + TV regularization.

References

  1. Wang, K.; Babenko, B.; Belongie, S. End-to-End Scene Text Recognition. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar] [CrossRef]
  2. Yu, D.; Deng, L. Automatic Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1. [Google Scholar]
  3. Maracani, A.; Ozkan, S.; Cho, S.; Kim, H.; Noh, E.; Min, J.; Min, C.J.; Park, D.; Ozay, M. Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 14516–14526. [Google Scholar] [CrossRef]
  4. Zhang, C.; Ding, W.; Peng, G.; Fu, F.; Wang, W. Street view text recognition with deep learning for urban scene understanding in intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4727–4743. [Google Scholar] [CrossRef]
  5. Velikovich, L.; Williams, I.; Scheiner, J.; Aleksic, P.S.; Moreno, P.J.; Riley, M. Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 2222–2226. [Google Scholar]
  6. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  7. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
  8. Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
  9. Zhou, S.; Liu, C.; Ye, D.; Zhu, T.; Zhou, W.; Yu, P.S. Adversarial Attacks and Defenses in Deep Learning: From a Perspective of Cybersecurity. In ACM Computing Surveys; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
  10. Xu, Y.; Dai, P.; Li, Z.; Wang, H.; Cao, X. The Best Protection is Attack: Fooling Scene Text Recognition with Minimal Pixels. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1580–1595. [Google Scholar] [CrossRef]
  11. Zhang, X.; Tan, H.; Huang, X.; Zhang, D.; Tang, K.; Gu, Z. Adversarial attacks on ASR systems: An overview. arXiv 2022, arXiv:2208.02250. [Google Scholar] [CrossRef]
  12. Yang, M.; Zheng, H.; Bai, X.; Luo, J. Cost-Effective Adversarial Attacks against Scene Text Recognition. In IEEE International Conference on Document Analysis and Recognition (ICDAR); IEEE: New York, NY, USA, 2021. [Google Scholar]
  13. Xu, X.; Chen, J.; Xiao, J.; Gao, L.; Shen, F.; Shen, H.T. What machines see is not what they get: Fooling scene text recognition models with adversarial text images. In CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 12304–12314. [Google Scholar]
  14. Xu, Y.; Dai, P.; Cao, X. Less is better: Fooling scene text recognition with minimal perturbations. In International Conference on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2021; pp. 537–544. [Google Scholar]
  15. Carlini, N.; Wagner, D. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. In IEEE Symposium on Security and Privacy (SP) Workshops; IEEE: New York, NY, USA, 2018. [Google Scholar]
  16. Qin, Y.; Carlini, N.; Goodfellow, I.; Cottrell, G.; Mishkin, C. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In International Conference on Machine Learning (ICML); PMLR: Waurn Ponds, VIC, Australia, 2019. [Google Scholar]
  17. Yuan, X.; Chen, Y.; Zhao, Y.; Long, Y.; Liu, X.; Zhang, K.; Wang, S.; Gunter, C. Commandersong: A Systematic Approach for Practical Adversarial Voice Recognition. In Proceedings of the USENIX Security Symposium, Baltimore, MD, USA, 15–17 August 2018. [Google Scholar]
  18. Chen, Y.; Yuan, X.; Zhang, S.; Gunter, C. Devil’s Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-Box Speech Recognition Devices. In Proceedings of the USENIX Security Symposium, Boston, MA, USA, 12–14 August 2020. [Google Scholar]
  19. Schönherr, L.; Kohls, K.; Zeiler, S.; Holz, T.; Kolossa, D. Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems. In Proceedings of the Annual Computer Security Applications Conference (ACSAC), Austin, TX, USA, 7–11 December 2020; pp. 284–296. [Google Scholar]
  20. Zhang, G.; Yan, C.; Xu, X.; Zhang, T.; Li, T.; Xu, W. LaserAdv: Laser Adversarial Attacks on Speech Recognition Systems. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024. [Google Scholar]
  21. Yuan, X.; He, P.; Li, X.; Wu, D. Adaptive Adversarial Attack on Scene Text Recognition. In Proceedings of the IEEE INFOCOM 2020 Workshops (INFOCOM WKSHPS), Virtually, 6–9 July 2020; pp. 358–363. [Google Scholar] [CrossRef]
  22. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the ICLR, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  23. Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  24. Cheng, M.; Yi, J.; Chen, P.Y.; Zhang, H.; Hsieh, C.J. Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 3601–3608. [Google Scholar]
  25. Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Proceedings of the ICLR Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
  26. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  27. Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML; PMLR: Waurn Ponds, VIC, Australia, 2020; pp. 2206–2216. [Google Scholar]
  28. Andriushchenko, M.; Croce, F.; Flammarion, N.; Hein, M. Square attack: A query-efficient black-box adversarial attack via random search. In ECCV; Springer: Berlin/Heidelberg, Germany, 2020; pp. 484–501. [Google Scholar]
  29. Maho, T.; Furon, T.; Le Merrer, E. Surfree: A fast surrogate-free black-box attack. In Proceedings of the CVPR, Nashville, TN, USA, 19–25 June 2021; pp. 10430–10439. [Google Scholar]
  30. Reza, M.F.; Rahmati, A.; Wu, T.; Dai, H. CGBA: Curvature-aware geometric black-box attack. In Proceedings of the ICCV, Paris, France, 1–6 October 2023; pp. 124–133. [Google Scholar]
  31. Li, Y.; Cheng, M.; Hsieh, C.J.; Lee, T.C. A review of adversarial attack and defense for classification methods. Am. Stat. 2022, 76, 329–345. [Google Scholar] [CrossRef]
  32. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar] [CrossRef]
  33. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  34. Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based Models for Speech Recognition. In 29th International Conference on Neural Information Processing Systems; NIPS’15; MIT Press: Cambridge, MA, USA, 2015; Volume 1, pp. 577–585. [Google Scholar]
  35. Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
  36. Liu, W.; Chen, C.; Wong, K.Y.K. STAR-Net: A Spatial Attention Residue Network for Scene Text Recognition. In Proceedings of the British Machine Vision Conference (BMVC), York, UK, 19–22 September 2016; pp. 43.1–43.13. [Google Scholar]
  37. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020. [Google Scholar]
  38. Baek, J.; Kim, G.; Lee, J.; Park, S.; Han, D.; Yun, S.; Oh, S.J.; Lee, H. What Is Wrong with Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 2 November 2019; pp. 4715–4723. [Google Scholar]
  39. Alzantot, M.; Sharma, Y.; Song, J.; Shrivastava, G.; Chang, H.; Wang, M.; Vorobeychik, Y. Did You Hear That? Adversarial Examples against Automatic Speech Recognition. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
  40. Neekhara, P.; Hussain, S.; Pandey, P.; Dubnov, S.; McAuley, J.; Koushanfar, F. Universal Adversarial Perturbations for Speech Recognition Systems. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019. [Google Scholar]
  41. Li, Z.; Wu, Y.; Liu, J.; Chen, Y.; Yuan, B. AdvPulse: Universal, Synchronization-Free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), Virtual Event, 9–13 November 2020. [Google Scholar]
  42. Sun, Z.; Zhao, J.; Guo, F.; Chen, Y.; Ju, L. CommanderUAP: A Practical and Transferable Universal Adversarial Attacks on Speech Recognition Models. Cybersecurity 2024, 7, 38. [Google Scholar] [CrossRef]
  43. Li, J.; Deng, L.; Gong, Y.; Haeb-Umbach, R. An Overview of Noise-Robust Automatic Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 745–777. [Google Scholar] [CrossRef]
  44. Schönherr, L.; Aichroth, P.; Backes, M.; Lander, C. Adversarial Attacks against Automatic Speech Recognition Systems via Psychoacoustic Hiding. In Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
  45. Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017. [Google Scholar]
  46. Rudin, L.; Osher, S.; Fatemi, E. Nonlinear Total Variation-Based Noise Removal Algorithms. Phys. D 1992, 60, 259–268. [Google Scholar] [CrossRef]
  47. Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A Robust Arbitrary Text Detection System for Natural Scene Images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar] [CrossRef]
  48. Lucas, S.M.; Panaretos, A.; Sosa, L.; Tang, A.; Wong, S.; Young, R. ICDAR 2003 Robust Reading Competitions. In Proceedings of the ICDAR, Edinburgh, UK, 3–6 August 2003; pp. 682–687. [Google Scholar]
  49. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 April 2015. [Google Scholar]
  50. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  51. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual Evaluation of Speech Quality (PESQ): A New Method for Speech Coding Standards. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Salt Lake City, UT, USA, 7–11 May 2001; pp. 749–752. [Google Scholar]
  52. Taal, C.; Hendriks, R.; Heusdens, R.; Jensen, J. A Short-Time Objective Intelligibility Measure for Time–Frequency Weighted Noisy Speech. In IEEE Transactions on Audio, Speech, and Language Processing; IEEE: New York, NY, USA, 2011. [Google Scholar]
  53. Cohen, J.; Rosenfeld, E.; Kolter, J.Z. Certified Adversarial Robustness via Randomized Smoothing. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 1310–1320. [Google Scholar]
  54. Hussain, S.; Neekhara, P.; Dubnov, S.; McAuley, J.; Koushanfar, F. WaveGuard: Understanding and Mitigating Audio Adversarial Examples. In Proceedings of the USENIX Security Symposium, Vancouver, BC, Canada, 11–13 August 2021. [Google Scholar]
  55. Kwon, H.; Nam, S.H. Audio Adversarial Detection through Classification Score on Speech Recognition Systems. Comput. Secur. 2023, 126, 103061. [Google Scholar] [CrossRef]
  56. Chen, G.; Zhao, Z.; Song, F.; Chen, S.; Fan, L.; Wang, F.; Wang, J. Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition. arXiv 2022, arXiv:2206.03393. [Google Scholar] [CrossRef]
Figure 1. Overall pipeline of the proposed cost-effective adversarial attack.
Figure 1. Overall pipeline of the proposed cost-effective adversarial attack.
Electronics 15 01064 g001
Figure 2. Human perceptual evaluation of adversarial imperceptibility for (a) STR (clean vs. adversarial image) and (b) ASR (clean vs. adversarial audio).
Figure 2. Human perceptual evaluation of adversarial imperceptibility for (a) STR (clean vs. adversarial image) and (b) ASR (clean vs. adversarial audio).
Electronics 15 01064 g002
Figure 3. Qualitative comparison o n STR. We visualize the original samples as group (a) and the adversarial samples as group (b).
Figure 3. Qualitative comparison o n STR. We visualize the original samples as group (a) and the adversarial samples as group (b).
Electronics 15 01064 g003
Figure 4. Three-row qualitative comparison on STR: (Row 1) adversarial images with stroke-aware glow overlay of perturbation magnitude | δ | ; (Row 2) adversarial images with heatmap overlay of | δ | ; and (Row 3) width-wise perturbation profiles (column-wise RMS of | δ | ) aligned with stroke energy, illustrating how perturbations distribute across STR time steps.
Figure 4. Three-row qualitative comparison on STR: (Row 1) adversarial images with stroke-aware glow overlay of perturbation magnitude | δ | ; (Row 2) adversarial images with heatmap overlay of | δ | ; and (Row 3) width-wise perturbation profiles (column-wise RMS of | δ | ) aligned with stroke energy, illustrating how perturbations distribute across STR time steps.
Electronics 15 01064 g004
Figure 5. Visualization of representative failure cases on IC13.
Figure 5. Visualization of representative failure cases on IC13.
Electronics 15 01064 g005
Figure 6. Spectrogram comparison. (a) Original audio; (bd) spectrograms of the perturbations ( δ ) generated by FGSM, PGD, and FLAMA; (eg) spectrograms of the corresponding adversarial audio. In (a,eg), warmer colors indicate higher spectral magnitude (dB) and cooler colors indicate lower magnitude. In (bd), red/blue indicate positive/negative perturbation components.
Figure 6. Spectrogram comparison. (a) Original audio; (bd) spectrograms of the perturbations ( δ ) generated by FGSM, PGD, and FLAMA; (eg) spectrograms of the corresponding adversarial audio. In (a,eg), warmer colors indicate higher spectral magnitude (dB) and cooler colors indicate lower magnitude. In (bd), red/blue indicate positive/negative perturbation components.
Electronics 15 01064 g006
Figure 7. Waveform visualization for a 0.25 s segment. (a) Original waveform; (b) Adversarial waveform; (c) Overlay of the original and adversarial waveforms.
Figure 7. Waveform visualization for a 0.25 s segment. (a) Original waveform; (b) Adversarial waveform; (c) Overlay of the original and adversarial waveforms.
Electronics 15 01064 g007
Figure 8. Quality–efficiency trade-off on a 989-utterance subset of LibriSpeech test-clean. Bubble size denotes SR. FLAMA achieves a favorable balance between perceptual quality (PESQ) and time cost, while maintaining a 100% SR.
Figure 8. Quality–efficiency trade-off on a 989-utterance subset of LibriSpeech test-clean. Bubble size denotes SR. FLAMA achieves a favorable balance between perceptual quality (PESQ) and time cost, while maintaining a 100% SR.
Electronics 15 01064 g008
Figure 9. Ablation study visualization on ASR. (a) Normalized radar chart of main metrics. (b) Comparison of scalar metrics.
Figure 9. Ablation study visualization on ASR. (a) Normalized radar chart of main metrics. (b) Comparison of scalar metrics.
Electronics 15 01064 g009
Table 1. Clean recognition accuracy of three STR models on SVT, CUTE80, and IC13.
Table 1. Clean recognition accuracy of three STR models on SVT, CUTE80, and IC13.
ModelSVTCUTE80IC13
CRNN80.5364.9389.73
STAR86.0970.4992.30
TRBA86.8674.3192.88
Table 2. Attack performance on STR benchmarks SVT, CUTE80, and IC13 (evaluated on the clean-correct subset). Units: SR (%), L 2 is computed on normalized images (unitless), SSIM is unitless, and ED is measured in characters, ↑: Higher is better; ↓: Lower is better; Bold: Best results.
Table 2. Attack performance on STR benchmarks SVT, CUTE80, and IC13 (evaluated on the clean-correct subset). Units: SR (%), L 2 is computed on normalized images (unitless), SSIM is unitless, and ED is measured in characters, ↑: Higher is better; ↓: Lower is better; Bold: Best results.
ModelMethodSVTCUTE80IC13
SR ↑ L 2  ↓SSIM ↑ED ↓SR ↑ L 2  ↓SSIM ↑ED ↓SR ↑ L 2  ↓SSIM ↑ED ↓
CRNNFGSM67.375.630.722.6465.245.560.742.5244.865.610.721.98
PGD100.004.230.824.64100.004.330.824.47100.004.320.803.98
CE-ASTR95.784.290.851.1596.794.310.851.1385.576.870.751.10
FLAMA100.000.690.991.01100.000.610.991.0198.311.150.971.01
STARFGSM58.485.640.752.5456.165.560.752.3230.425.600.742.14
PGD100.004.050.845.15100.004.050.844.80100.004.110.825.17
CE-ASTR96.415.040.831.1894.584.630.851.1782.438.410.721.15
FLAMA99.820.600.991.00100.000.680.981.0098.481.070.971.01
TRBAFGSM54.205.640.742.6046.265.570.752.5126.835.590.742.23
PGD100.004.040.843.46100.004.040.843.27100.004.080.822.95
CE-ASTR99.113.850.881.4399.533.740.881.6993.225.650.801.59
FLAMA100.000.510.991.27100.000.560.991.3299.370.700.981.42
Table 3. Attack performance on a 989-utterance subset of LibriSpeech test-clean with the Wav2Vec 2.0 CTC model. Units: SR (%), SNR (dB), L 2 (unitless), PESQ and STOI (unitless), and time (s).
Table 3. Attack performance on a 989-utterance subset of LibriSpeech test-clean with the Wav2Vec 2.0 CTC model. Units: SR (%), SNR (dB), L 2 (unitless), PESQ and STOI (unitless), and time (s).
MethodSR ↑SNR ↑ L 2  ↓ PESQ ↑STOI ↑Time ↓
FGSM68.8615.243.261.300.920.47
PGD99.7018.962.121.590.9414.12
FLAMA (ours)100.040.660.603.280.983.30
Table 4. Ablation of the stabilization stage on STR models. All variants achieve 100% SR on the clean-correct subset. Units: L 2 is computed on normalized images (unitless), SSIM is unitless, and ED is measured in characters.
Table 4. Ablation of the stabilization stage on STR models. All variants achieve 100% SR on the clean-correct subset. Units: L 2 is computed on normalized images (unitless), SSIM is unitless, and ED is measured in characters.
ModelMethodSVTCUTE80
L 2  ↓ SSIM ↑ED ↓ L 2  ↓ SSIM ↑ED ↓
CRNNFLAMA-A0.740.98531.730.800.97361.57
FLAMA0.690.991.010.610.991.01
STARFLAMA-A0.910.97991.840.900.96891.95
FLAMA0.600.991.000.680.981.00
TRBAFLAMA-A0.870.98041.560.900.97031.65
FLAMA0.510.991.270.560.991.32
Table 5. Ablation study of FLAMA components on a 989-utterance subset of LibriSpeech test-clean. Units: SR in %, SNR in dB, and time in seconds.
Table 5. Ablation study of FLAMA components on a 989-utterance subset of LibriSpeech test-clean. Units: SR in %, SNR in dB, and time in seconds.
MethodSR ↑SNR ↑ L 2  ↓ PESQ ↑STOI ↑Time ↓
w/o Stabilize99.8010.335.711.130.860.69
w/o Refresh99.7041.060.583.310.992.76
w/o Warm-up99.8030.312.052.490.952.58
FLAMA (full)100.040.660.603.280.983.30
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Y.; Xu, Z.; Dai, P. FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition. Electronics 2026, 15, 1064. https://doi.org/10.3390/electronics15051064

AMA Style

Xu Y, Xu Z, Dai P. FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition. Electronics. 2026; 15(5):1064. https://doi.org/10.3390/electronics15051064

Chicago/Turabian Style

Xu, Yikun, Zhiheng Xu, and Pengwen Dai. 2026. "FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition" Electronics 15, no. 5: 1064. https://doi.org/10.3390/electronics15051064

APA Style

Xu, Y., Xu, Z., & Dai, P. (2026). FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition. Electronics, 15(5), 1064. https://doi.org/10.3390/electronics15051064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop