FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition
Abstract
1. Introduction
- We propose FLAMA, a unified Frame-Level Alignment Margin Attack for untargeted adversarial generation on sequence recognizers. FLAMA localizes optimization by tracking alignment margins and updating only the most critical frames via a recognition-score-aware Step/Halt gate.
- We extend and refine Step/Halt for cross-modal sequence recognition to improve stability on long sequences and avoid premature halting. In addition, we also introduce a stabilization stage that combines smoothness-oriented regularization with perturbation scaling, which suppresses late-iteration oscillations and improves perceptual metrics (PESQ, STOI, SSIM) while preserving attack success.
- Extensive experiments on STR and ASR benchmarks show near-100% attack success, with substantially reduced perturbation and improved perceptual metrics.
2. Related Work
2.1. Adversarial Attacks on Non-Sequence Recognition (Classification)
2.2. Sequence Recognition Architectures
2.3. Adversarial Attacks on Sequence Recognition
2.4. Perceptual Quality and Fidelity Constraints
3. Methodology
3.1. Problem Definition
3.2. Framework
3.3. Stage 1: Initial Generation
3.3.1. Recognition Confidence Score
3.3.2. Frame-Level Margin Attack Loss
3.3.3. Step/Halt Dynamic Gating Mechanism
3.3.4. Comprehensive Objective Function
3.4. Stage 2: Stabilization Stage (Total Variation Smoothing with Success-Keeping)
3.5. Stage 3: Minimal Feasible Scaling
4. Experiments
4.1. Models and Datasets
- STR models. We consider three widely used architectures that instantiate the four-stage STR framework [38]:
- CRNN [35]: A VGG-like convolutional encoder, BiLSTM sequence modeling, and a CTC decoder.
- STAR [36]: A spatial-transformer-based rectification front-end with a ResNet-like backbone and CTC-style decoding, related to STAR.
- TRBA [38]: A TPS rectification module, a ResNet encoder, BiLSTM sequence modeling, and an attention-based decoder.
4.2. Implementation
4.3. Threat Model and Baselines
- FGSM (Fast Gradient Sign Method) [23]: A single-step method that perturbs the input along the gradient sign of a sequence-level loss (with an budget for STR in our implementation).
- PGD (Projected Gradient Descent) [26]: A multi-step iterative baseline that updates the perturbation and projects it back to the feasible set each step (also using an budget for STR in our implementation).
4.4. Evaluation Metrics
4.4.1. STR Metrics
- SR (success rate): The percentage of clean-correct images that become misrecognized under attack, i.e., (higher is better for the attacker).
- : The average norm of the image-domain perturbation (computed on normalized inputs), where lower values indicate smaller distortion.
- SSIM: The Structural Similarity Index between original and adversarial images, capturing perceptual similarity in luminance, contrast, and structure (higher is better).
- ED (edit distance): The average Levenshtein edit distance between the adversarial prediction and the ground-truth transcript, computed on the same clean-correct subset (lower indicates milder transcription corruption). Since the clean prediction matches on this subset, ED also reflects the deviation from the original correct prediction.
4.4.2. ASR Metrics
- SR (success rate): The percentage of utterances for which the final transcription differs from the ground-truth transcript, i.e., .
- : The average norm of the perturbation.
- SNR (signal-to-noise ratio): The ratio of signal power to perturbation power in decibels (higher is better).
- PESQ (perceptual evaluation of speech quality) [51]: An objective speech-quality metric that correlates with human perception (higher is better).
- STOI (short-time objective intelligibility) [52]: An objective measure of speech intelligibility (higher is better).
- Time: The average end-to-end wall-clock time per utterance in our implementation, including waveform I/O, decoding, attack optimization, and metric computation.
4.5. Attack Performance
4.5.1. STR Attack Results
4.5.2. ASR Attack Results
4.6. Ablation Study
4.6.1. STR Ablation
4.6.2. ASR Ablation
4.6.3. Computational Cost and Convergence
5. Discussion
5.1. Analysis of Method Efficacy
5.2. Limitations and Future Work
5.3. Implications for Defense
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
| Notation | Meaning |
|---|---|
| Input sample (image or audio) and its ground-truth transcription sequence. | |
| Additive adversarial perturbation; Stage 2 output is denoted as . | |
| Adversarial example, defined as . | |
| Incorrect output transcription produced by the model on . | |
| Sequence recognition model (e.g., STR or ASR) in inference mode. | |
| Adversarial loss function (e.g., CTC loss or cross-entropy loss). | |
| Overall attack objective function for adversarial generation. | |
| Valid input domain (e.g., after normalization). | |
| T | Total number of input frames or decoding steps. |
| Vocabulary set comprising all possible output tokens (includes blank for CTC). | |
| Logit/log-probability vector at time index t. | |
| Scalar score for token y in vector at step t. | |
| Reference token: , derived from clean prediction. | |
| Competitor token set at step t (excludes and blank). | |
| Frame-level alignment margin at step t. | |
| Soft-min aggregated margin across Top-k positions. | |
| S | Recognition confidence score of the clean sample. |
| H | Step/Halt gate value used to modulate adversarial pressure. |
| c | Trade-off coefficient between adversarial loss and distortion. |
| k | Number of selected weakest positions in Top-k operation. |
| Sharpness control parameter for the soft-min operator. | |
| Steering parameter for the transition steepness of the gate H. | |
| R | Update frequency (in iterations) for the reference token sequence. |
| r | Current iteration index during adversarial optimization. |
| Total iterations for the gate warm-up stage. | |
| Margin threshold used in the stabilization objective . | |
| Weight coefficient for the Total Variation (TV) regularizer. | |
| Gated adversarial objective used in Stage 1 optimization. | |
| Sum of positive alignment margins: . | |
| Total Variation (TV) regularizer. | |
| Margin-keeping term in Stage 2 to ensure attack success. | |
| Stabilization objective: success-keeping + TV regularization. |
References
- Wang, K.; Babenko, B.; Belongie, S. End-to-End Scene Text Recognition. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar] [CrossRef]
- Yu, D.; Deng, L. Automatic Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1. [Google Scholar]
- Maracani, A.; Ozkan, S.; Cho, S.; Kim, H.; Noh, E.; Min, J.; Min, C.J.; Park, D.; Ozay, M. Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 14516–14526. [Google Scholar] [CrossRef]
- Zhang, C.; Ding, W.; Peng, G.; Fu, F.; Wang, W. Street view text recognition with deep learning for urban scene understanding in intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4727–4743. [Google Scholar] [CrossRef]
- Velikovich, L.; Williams, I.; Scheiner, J.; Aleksic, P.S.; Moreno, P.J.; Riley, M. Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 2222–2226. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
- Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
- Zhou, S.; Liu, C.; Ye, D.; Zhu, T.; Zhou, W.; Yu, P.S. Adversarial Attacks and Defenses in Deep Learning: From a Perspective of Cybersecurity. In ACM Computing Surveys; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
- Xu, Y.; Dai, P.; Li, Z.; Wang, H.; Cao, X. The Best Protection is Attack: Fooling Scene Text Recognition with Minimal Pixels. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1580–1595. [Google Scholar] [CrossRef]
- Zhang, X.; Tan, H.; Huang, X.; Zhang, D.; Tang, K.; Gu, Z. Adversarial attacks on ASR systems: An overview. arXiv 2022, arXiv:2208.02250. [Google Scholar] [CrossRef]
- Yang, M.; Zheng, H.; Bai, X.; Luo, J. Cost-Effective Adversarial Attacks against Scene Text Recognition. In IEEE International Conference on Document Analysis and Recognition (ICDAR); IEEE: New York, NY, USA, 2021. [Google Scholar]
- Xu, X.; Chen, J.; Xiao, J.; Gao, L.; Shen, F.; Shen, H.T. What machines see is not what they get: Fooling scene text recognition models with adversarial text images. In CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 12304–12314. [Google Scholar]
- Xu, Y.; Dai, P.; Cao, X. Less is better: Fooling scene text recognition with minimal perturbations. In International Conference on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2021; pp. 537–544. [Google Scholar]
- Carlini, N.; Wagner, D. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. In IEEE Symposium on Security and Privacy (SP) Workshops; IEEE: New York, NY, USA, 2018. [Google Scholar]
- Qin, Y.; Carlini, N.; Goodfellow, I.; Cottrell, G.; Mishkin, C. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In International Conference on Machine Learning (ICML); PMLR: Waurn Ponds, VIC, Australia, 2019. [Google Scholar]
- Yuan, X.; Chen, Y.; Zhao, Y.; Long, Y.; Liu, X.; Zhang, K.; Wang, S.; Gunter, C. Commandersong: A Systematic Approach for Practical Adversarial Voice Recognition. In Proceedings of the USENIX Security Symposium, Baltimore, MD, USA, 15–17 August 2018. [Google Scholar]
- Chen, Y.; Yuan, X.; Zhang, S.; Gunter, C. Devil’s Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-Box Speech Recognition Devices. In Proceedings of the USENIX Security Symposium, Boston, MA, USA, 12–14 August 2020. [Google Scholar]
- Schönherr, L.; Kohls, K.; Zeiler, S.; Holz, T.; Kolossa, D. Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems. In Proceedings of the Annual Computer Security Applications Conference (ACSAC), Austin, TX, USA, 7–11 December 2020; pp. 284–296. [Google Scholar]
- Zhang, G.; Yan, C.; Xu, X.; Zhang, T.; Li, T.; Xu, W. LaserAdv: Laser Adversarial Attacks on Speech Recognition Systems. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024. [Google Scholar]
- Yuan, X.; He, P.; Li, X.; Wu, D. Adaptive Adversarial Attack on Scene Text Recognition. In Proceedings of the IEEE INFOCOM 2020 Workshops (INFOCOM WKSHPS), Virtually, 6–9 July 2020; pp. 358–363. [Google Scholar] [CrossRef]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the ICLR, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Cheng, M.; Yi, J.; Chen, P.Y.; Zhang, H.; Hsieh, C.J. Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 3601–3608. [Google Scholar]
- Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Proceedings of the ICLR Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML; PMLR: Waurn Ponds, VIC, Australia, 2020; pp. 2206–2216. [Google Scholar]
- Andriushchenko, M.; Croce, F.; Flammarion, N.; Hein, M. Square attack: A query-efficient black-box adversarial attack via random search. In ECCV; Springer: Berlin/Heidelberg, Germany, 2020; pp. 484–501. [Google Scholar]
- Maho, T.; Furon, T.; Le Merrer, E. Surfree: A fast surrogate-free black-box attack. In Proceedings of the CVPR, Nashville, TN, USA, 19–25 June 2021; pp. 10430–10439. [Google Scholar]
- Reza, M.F.; Rahmati, A.; Wu, T.; Dai, H. CGBA: Curvature-aware geometric black-box attack. In Proceedings of the ICCV, Paris, France, 1–6 October 2023; pp. 124–133. [Google Scholar]
- Li, Y.; Cheng, M.; Hsieh, C.J.; Lee, T.C. A review of adversarial attack and defense for classification methods. Am. Stat. 2022, 76, 329–345. [Google Scholar] [CrossRef]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based Models for Speech Recognition. In 29th International Conference on Neural Information Processing Systems; NIPS’15; MIT Press: Cambridge, MA, USA, 2015; Volume 1, pp. 577–585. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Chen, C.; Wong, K.Y.K. STAR-Net: A Spatial Attention Residue Network for Scene Text Recognition. In Proceedings of the British Machine Vision Conference (BMVC), York, UK, 19–22 September 2016; pp. 43.1–43.13. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020. [Google Scholar]
- Baek, J.; Kim, G.; Lee, J.; Park, S.; Han, D.; Yun, S.; Oh, S.J.; Lee, H. What Is Wrong with Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 2 November 2019; pp. 4715–4723. [Google Scholar]
- Alzantot, M.; Sharma, Y.; Song, J.; Shrivastava, G.; Chang, H.; Wang, M.; Vorobeychik, Y. Did You Hear That? Adversarial Examples against Automatic Speech Recognition. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Neekhara, P.; Hussain, S.; Pandey, P.; Dubnov, S.; McAuley, J.; Koushanfar, F. Universal Adversarial Perturbations for Speech Recognition Systems. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Li, Z.; Wu, Y.; Liu, J.; Chen, Y.; Yuan, B. AdvPulse: Universal, Synchronization-Free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), Virtual Event, 9–13 November 2020. [Google Scholar]
- Sun, Z.; Zhao, J.; Guo, F.; Chen, Y.; Ju, L. CommanderUAP: A Practical and Transferable Universal Adversarial Attacks on Speech Recognition Models. Cybersecurity 2024, 7, 38. [Google Scholar] [CrossRef]
- Li, J.; Deng, L.; Gong, Y.; Haeb-Umbach, R. An Overview of Noise-Robust Automatic Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 745–777. [Google Scholar] [CrossRef]
- Schönherr, L.; Aichroth, P.; Backes, M.; Lander, C. Adversarial Attacks against Automatic Speech Recognition Systems via Psychoacoustic Hiding. In Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
- Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017. [Google Scholar]
- Rudin, L.; Osher, S.; Fatemi, E. Nonlinear Total Variation-Based Noise Removal Algorithms. Phys. D 1992, 60, 259–268. [Google Scholar] [CrossRef]
- Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A Robust Arbitrary Text Detection System for Natural Scene Images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar] [CrossRef]
- Lucas, S.M.; Panaretos, A.; Sosa, L.; Tang, A.; Wong, S.; Young, R. ICDAR 2003 Robust Reading Competitions. In Proceedings of the ICDAR, Edinburgh, UK, 3–6 August 2003; pp. 682–687. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 April 2015. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual Evaluation of Speech Quality (PESQ): A New Method for Speech Coding Standards. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Salt Lake City, UT, USA, 7–11 May 2001; pp. 749–752. [Google Scholar]
- Taal, C.; Hendriks, R.; Heusdens, R.; Jensen, J. A Short-Time Objective Intelligibility Measure for Time–Frequency Weighted Noisy Speech. In IEEE Transactions on Audio, Speech, and Language Processing; IEEE: New York, NY, USA, 2011. [Google Scholar]
- Cohen, J.; Rosenfeld, E.; Kolter, J.Z. Certified Adversarial Robustness via Randomized Smoothing. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 1310–1320. [Google Scholar]
- Hussain, S.; Neekhara, P.; Dubnov, S.; McAuley, J.; Koushanfar, F. WaveGuard: Understanding and Mitigating Audio Adversarial Examples. In Proceedings of the USENIX Security Symposium, Vancouver, BC, Canada, 11–13 August 2021. [Google Scholar]
- Kwon, H.; Nam, S.H. Audio Adversarial Detection through Classification Score on Speech Recognition Systems. Comput. Secur. 2023, 126, 103061. [Google Scholar] [CrossRef]
- Chen, G.; Zhao, Z.; Song, F.; Chen, S.; Fan, L.; Wang, F.; Wang, J. Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition. arXiv 2022, arXiv:2206.03393. [Google Scholar] [CrossRef]









| Model | SVT | CUTE80 | IC13 |
|---|---|---|---|
| CRNN | 80.53 | 64.93 | 89.73 |
| STAR | 86.09 | 70.49 | 92.30 |
| TRBA | 86.86 | 74.31 | 92.88 |
| Model | Method | SVT | CUTE80 | IC13 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SR ↑ | ↓ | SSIM ↑ | ED ↓ | SR ↑ | ↓ | SSIM ↑ | ED ↓ | SR ↑ | ↓ | SSIM ↑ | ED ↓ | ||
| CRNN | FGSM | 67.37 | 5.63 | 0.72 | 2.64 | 65.24 | 5.56 | 0.74 | 2.52 | 44.86 | 5.61 | 0.72 | 1.98 |
| PGD | 100.00 | 4.23 | 0.82 | 4.64 | 100.00 | 4.33 | 0.82 | 4.47 | 100.00 | 4.32 | 0.80 | 3.98 | |
| CE-ASTR | 95.78 | 4.29 | 0.85 | 1.15 | 96.79 | 4.31 | 0.85 | 1.13 | 85.57 | 6.87 | 0.75 | 1.10 | |
| FLAMA | 100.00 | 0.69 | 0.99 | 1.01 | 100.00 | 0.61 | 0.99 | 1.01 | 98.31 | 1.15 | 0.97 | 1.01 | |
| STAR | FGSM | 58.48 | 5.64 | 0.75 | 2.54 | 56.16 | 5.56 | 0.75 | 2.32 | 30.42 | 5.60 | 0.74 | 2.14 |
| PGD | 100.00 | 4.05 | 0.84 | 5.15 | 100.00 | 4.05 | 0.84 | 4.80 | 100.00 | 4.11 | 0.82 | 5.17 | |
| CE-ASTR | 96.41 | 5.04 | 0.83 | 1.18 | 94.58 | 4.63 | 0.85 | 1.17 | 82.43 | 8.41 | 0.72 | 1.15 | |
| FLAMA | 99.82 | 0.60 | 0.99 | 1.00 | 100.00 | 0.68 | 0.98 | 1.00 | 98.48 | 1.07 | 0.97 | 1.01 | |
| TRBA | FGSM | 54.20 | 5.64 | 0.74 | 2.60 | 46.26 | 5.57 | 0.75 | 2.51 | 26.83 | 5.59 | 0.74 | 2.23 |
| PGD | 100.00 | 4.04 | 0.84 | 3.46 | 100.00 | 4.04 | 0.84 | 3.27 | 100.00 | 4.08 | 0.82 | 2.95 | |
| CE-ASTR | 99.11 | 3.85 | 0.88 | 1.43 | 99.53 | 3.74 | 0.88 | 1.69 | 93.22 | 5.65 | 0.80 | 1.59 | |
| FLAMA | 100.00 | 0.51 | 0.99 | 1.27 | 100.00 | 0.56 | 0.99 | 1.32 | 99.37 | 0.70 | 0.98 | 1.42 | |
| Method | SR ↑ | SNR ↑ | ↓ | PESQ ↑ | STOI ↑ | Time ↓ |
|---|---|---|---|---|---|---|
| FGSM | 68.86 | 15.24 | 3.26 | 1.30 | 0.92 | 0.47 |
| PGD | 99.70 | 18.96 | 2.12 | 1.59 | 0.94 | 14.12 |
| FLAMA (ours) | 100.0 | 40.66 | 0.60 | 3.28 | 0.98 | 3.30 |
| Model | Method | SVT | CUTE80 | ||||
|---|---|---|---|---|---|---|---|
| ↓ | SSIM ↑ | ED ↓ | ↓ | SSIM ↑ | ED ↓ | ||
| CRNN | FLAMA-A | 0.74 | 0.9853 | 1.73 | 0.80 | 0.9736 | 1.57 |
| FLAMA | 0.69 | 0.99 | 1.01 | 0.61 | 0.99 | 1.01 | |
| STAR | FLAMA-A | 0.91 | 0.9799 | 1.84 | 0.90 | 0.9689 | 1.95 |
| FLAMA | 0.60 | 0.99 | 1.00 | 0.68 | 0.98 | 1.00 | |
| TRBA | FLAMA-A | 0.87 | 0.9804 | 1.56 | 0.90 | 0.9703 | 1.65 |
| FLAMA | 0.51 | 0.99 | 1.27 | 0.56 | 0.99 | 1.32 | |
| Method | SR ↑ | SNR ↑ | ↓ | PESQ ↑ | STOI ↑ | Time ↓ |
|---|---|---|---|---|---|---|
| w/o Stabilize | 99.80 | 10.33 | 5.71 | 1.13 | 0.86 | 0.69 |
| w/o Refresh | 99.70 | 41.06 | 0.58 | 3.31 | 0.99 | 2.76 |
| w/o Warm-up | 99.80 | 30.31 | 2.05 | 2.49 | 0.95 | 2.58 |
| FLAMA (full) | 100.0 | 40.66 | 0.60 | 3.28 | 0.98 | 3.30 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xu, Y.; Xu, Z.; Dai, P. FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition. Electronics 2026, 15, 1064. https://doi.org/10.3390/electronics15051064
Xu Y, Xu Z, Dai P. FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition. Electronics. 2026; 15(5):1064. https://doi.org/10.3390/electronics15051064
Chicago/Turabian StyleXu, Yikun, Zhiheng Xu, and Pengwen Dai. 2026. "FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition" Electronics 15, no. 5: 1064. https://doi.org/10.3390/electronics15051064
APA StyleXu, Y., Xu, Z., & Dai, P. (2026). FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition. Electronics, 15(5), 1064. https://doi.org/10.3390/electronics15051064

