Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW
Abstract
1. Introduction
1.1. Research Gap
1.2. Novelty of the Research Work
2. Literature Review
3. Materials and Methods
3.1. Data Workflow and Processing Stages
3.1.1. Dataset Collection
3.1.2. Data Pre-Processing
3.1.3. Dataset Splitting
3.1.4. Model Selection: Whisper Architecture
3.1.5. Proposed Model and Fine-Tuning
- W is the frozen pre-trained weight matrix;
- A ∈ Rd×r, B ∈ Rr×k are trainable matrices;
- ensuring a significant reduction in trainable parameters.
3.1.6. Model Evaluation
- S = number of substitutions (wrong word instead of the correct one);
- D = number of deletions (a word that was missed);
- I = number of insertions (an extra word that was not in the reference);
- N = total number of words in the reference (ground truth).
3.1.7. Pseudo-Code
4. Experimental Results
4.1. Hyper-Parameter Tuning and Optimization
4.1.1. Optimizer Analysis
4.1.2. Batch-Wise vs. Sample-Wise Evaluation
4.2. Visual Analysis of Training Behavior
4.2.1. Training Loss Curve
AdamW Optimizer Loss
Adam Optimizer Loss
4.3. WER Progression Curve
4.3.1. AdamW Optimizer WER
4.3.2. Adam Optimizer WER
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ASR | Automatic Speech Recognition |
MFCC | Mel-Frequency Cepstral Coefficients (MFCC) |
PLP | Perceptual Linear Predictive (PLP) features |
LoRA | Low-Rank Adaptation |
References
- Kheddar, H.; Hemis, M.; Himeur, Y. Automatic speech recognition using advanced deep learning approaches: A survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
- Goldstein, A.; Wang, H.; Niekerken, L.; Schain, M.; Zada, Z.; Aubrey, B.; Sheffer, T.; Nastase, S.A.; Gazula, H.; Singh, A.; et al. A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations. Nat. Hum. Behav. 2025, 9, 1041–1055. [Google Scholar] [CrossRef] [PubMed]
- Singh, A.; Kaur, N.; Kukreja, V.; Kadyan, V.; Kumar, M. Computational intelligence in processing of speech acoustics: A survey. Complex Intell. Syst. 2022, 8, 2623–2661. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Google, K.T.; Language, A.I. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. Available online: https://aclanthology.org/N19-1423.pdf (accessed on 10 October 2025).
- Baevski, A.; Schneider, S.; Auli, M. VQ-WAV2VEC: Self-supervised learning of discrete speech representations. arXiv 2020, arXiv:2406.05745. [Google Scholar]
- Miao, H.; Cheng, G.; Gao, C.; Zhang, P.; Yan, Y. Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6084–6088. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 28492–28518. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
- Wang, S.; Yu, L.; Li, J. LoRA-GA: Low-Rank Adaptation with Gradient Approximation. arXiv 2024, arXiv:2407.05000. [Google Scholar]
- Guo, P.; Chang, X.; Lv, H.; Watanabe, S.; Xie, L. SQ-Whisper: Speaker-Querying Based Whisper Model for Target-Speaker ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 33, 175–185. [Google Scholar] [CrossRef]
- McGuire, M.; Larson-Hall, J. Assessing Whisper automatic speech recognition and WER scoring for elicited imitation: Steps toward automation. Res. Methods Appl. Linguist. 2025, 4, 100197. [Google Scholar] [CrossRef]
- Pour, M.H.R.; Rastin, N.; Kermani, M.M. Persian Automatic Speech Recognition by the Use of Whisper Model. In Proceedings of the 2024 20th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP), Babol, Iran, 21–22 February 2024; pp. 1–7. [Google Scholar]
- Wang, S.; Yang, C.H.; Wu, J.; Zhang, C. Can Whisper Perform Speech-Based in-Context Learning? In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 13421–13425. [Google Scholar]
- Yeo, J.H.; Kim, M.; Watanabe, S.; Ro, Y.M. Visual Speech Recognition for Languages with Limited Labeled Data Using Automatic Labels from Whisper. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10471–10475. [Google Scholar]
- Liu, W.; Qin, Y.; Peng, Z.; Lee, T. Sparsely Shared Lora on Whisper for Child Speech Recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11751–11755. [Google Scholar]
- Xu, T.; Huang, K.; Guo, P.; Zhou, Y.; Huang, L.; Xue, H.; Xie, L. Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper. arXiv 2024, arXiv:2408.10680. [Google Scholar]
- Liu, Y.; Yang, X.; Qu, D. Exploration of Whisper fine-tuning strategies for low-resource ASR. EURASIP J. Audio Speech Music Process. 2024, 2024, 29. [Google Scholar] [CrossRef]
- Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
- Ou, L.; Feng, G. Parameter-Efficient Fine-Tuning Large Speech Model Based on LoRA. In Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; pp. 36–41. [Google Scholar]
- Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-Rank Adaptation of Large Language Models. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–26. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing—ICASSP, Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Jannet, M.A.B.; Galibert, O.; Adda-Decker, M.; Rosset, S. How to evaluate ASR output for named entity recognition? In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 6–10 September 2015; pp. 1289–1293. [Google Scholar]
- Arif, S.; Khan, A.J.; Abbas, M.; Raza, A.A.; Athar, A. WER We Stand: Benchmarking Urdu ASR Models. In Proceedings of the International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 5952–5961. [Google Scholar]
- Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards Understanding Convergence and Generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef] [PubMed]
- Oyedotun, O.K.; Papadopoulos, K.; Aouada, D. A new perspective for understanding generalization gap of deep neural networks trained with large batch sizes. Appl. Intell. 2023, 53, 15621–15637. [Google Scholar] [CrossRef]
- Zhuang, Z.; Liu, M.; Cutkosky, A.; Orabona, F. Understanding AdamW Through Proximal Methods and Scale-Freeness. arXiv 2022, arXiv:2202.00089. [Google Scholar]
- Ramirez, P.Z.; Salti, S.; Stefano, L.D. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv 2017, arXiv:1609.04836. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Hoffer, E.; Hubara, I.; Soudry, D. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 1–12. [Google Scholar]
- Hu, J.Y.C.; Su, M.; Kuo, E.J.; Song, Z.; Liu, H. Computational Limits of Low-Rank Adaptation (Lora) Fine-Tuning for Transformer Models. In Proceedings of the 13th International Conference on Learning Representations, ICLR 2025, Singapore, 24–28 April 2025; pp. 54606–54645. [Google Scholar]
- Zhang, L.; Wu, S.; Wang, Z. LoRA-INT8 Whisper: A Low-Cost Cantonese Speech Recognition Framework for Edge Devices. Sensors 2025, 25, 5404. [Google Scholar] [CrossRef] [PubMed]
Variant | Block | Feature Size | Heads Per Layer | Parameter | References |
---|---|---|---|---|---|
Tiny | 4 | 384 | 6 | 39 M | [14] |
Base | 6 | 512 | 8 | 74 M | [14] |
Small | 12 | 768 | 12 | 244 M | [14] |
Medium | 24 | 1024 | 16 | 769 M | [14] |
Large | 32 | 1280 | 20 | 1550 M | [14] |
Optimizer | Batch Size | Training Granularity | LoRA Configuration | WER (%) |
---|---|---|---|---|
AdamW | 8 | Batch-wise | R = 8, α = 16, dropout = 0.1 | 7.98 |
AdamW | 4 | Sample-by-sample | R = 8, α = 16, dropout = 0.1 | 8.02 |
Adam | 8 | Batch-wise | R = 8, α = 16, dropout = 0.1 | 8.45 |
Adam | 4 | Sample-by-sample | R = 8, α = 16, dropout = 0.1 | 6.08 |
Batch Size | Optimizer | Training Mode | LoRA Configuration | WER (%) |
---|---|---|---|---|
8 | AdamW | Batch-wise | R = 8, α = 16, dropout = 0.1 | 7.98 |
4 | AdamW | Sample-by-sample | R = 8, α = 16, dropout = 0.1 | 8.02 |
8 | Adam | Batch-wise | R = 8, α = 16, dropout = 0.1 | 8.45 |
4 | Adam | Sample-by-sample | R = 8, α = 16, dropout = 0.1 | 6.08 |
Proposed by | Model Used | Dataset Used | WER |
---|---|---|---|
[13] |
|
|
|
[14] |
|
|
|
[15] |
|
|
|
[16] |
|
|
|
[17] |
|
|
|
[18] |
|
|
|
[19] |
|
|
|
[20] |
|
|
|
[21] |
|
|
|
[22] |
|
|
|
|
|
| |
|
|
| |
[23] |
|
|
|
Proposed Model |
|
|
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Arshad, H.; Abdullah, T.; Rehman, M.; Hussain, A.; Kanwal, F.; Parveen, M. Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW. Information 2025, 16, 928. https://doi.org/10.3390/info16110928
Arshad H, Abdullah T, Rehman M, Hussain A, Kanwal F, Parveen M. Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW. Information. 2025; 16(11):928. https://doi.org/10.3390/info16110928
Chicago/Turabian StyleArshad, Hadia, Tahir Abdullah, Mariam Rehman, Afzaal Hussain, Faria Kanwal, and Mehwish Parveen. 2025. "Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW" Information 16, no. 11: 928. https://doi.org/10.3390/info16110928
APA StyleArshad, H., Abdullah, T., Rehman, M., Hussain, A., Kanwal, F., & Parveen, M. (2025). Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW. Information, 16(11), 928. https://doi.org/10.3390/info16110928