Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text †
Abstract
1. Introduction
- To measure and compare the energy consumption and carbon footprint of Whisper and Google Speech-to-Text when processing a standardized audio dataset;
2. Materials and Methodology
2.1. Dataset Selection
- Total files: 20,000 audio clips;
- Duration range: 1 to 10 s per clip;
- Mean duration: 4 s;
- Total audio duration: Approximately 22.2 h;
- Format: WAV files (16 kHz, 16-bit PCM);
- Language distribution: Urdu, Pakistan.
2.2. Overview of the ASR Systems
- Speech processing [6]: Enhances audio quality through noise reduction, filtering, and normalization, making speech clearer for subsequent processing.
- Acoustic Model (AM) [7]: A feed-forward deep neural network (DNN) that processes raw audio waveforms to predict phoneme probabilities. The AM is trained on hundreds of hours of transcribed audio recordings, making it computationally intensive and requiring GPUs for training.
- Language Model (LM) [7]: Generally based on n-gram statistics, the LM predicts word sequences based on linguistic structure. The LM is trained on millions of textual phrases and is computationally lightweight, typically running on CPUs.
- Architecture: A transformer-based end-to-end model trained on large-scale multilingual data. Uses an encoder–decoder architecture [9];
- Training: Trained on 680,000 h of diverse multilingual audio scraped from the internet;
- Version: Base model (74 M parameters);
- Deployment: Local execution on user hardware;
- Implementation: Official Python implementation with PyTorch backend;
- Configuration: Default settings, beam size 5.
- Architecture: Uses a traditional deep neural network (DNN)-based ASR pipeline. Likely follows a hybrid approach combining acoustic models, pronunciation models, and language models.
- Training: Trained on a large proprietary dataset, but exact details are undisclosed.
- Version: v2023.11.
- Deployment: Cloud-based API.
- Endpoint: Global (europe-west1).
2.3. The Measurement Tools
- CodeCarbon (v2.8.3) [12]:
- Purpose: Track CO2 emissions based on energy consumption;
- Implementation: Python package integrated with ASR processing code;
- Metrics: kWh consumption, CO2 emissions (kg).
- PyJoule (v0.15.0) [13]:
- Purpose: Fine-grained energy profiling of CPU, GPU, and RAM;
- Implementation: Part of PowerApi initiative Python decorators and context managers;
- Metrics: Energy consumption per component (Joules, converted to kWh).
- PowerAPI (v2.3.1) [14]:
- Purpose: Enables process-level power monitoring with high sampling rate;
- Implementation: Standalone monitoring process synchronized with ASR tasks;
- Metrics: Power consumption over time (Watts), cumulative energy (kWh).
- Carbon Algorithms [15]:
2.4. Experimental Setup
- CPU: Intel i7-12800H (16 cores, 32 threads);
- GPU: NVIDIA RTX A1000 (16 GB VRAM);
- RAM: 32 GB DDR4-3200.
- Python: 3.12.10;
- PyTorch: 2.0.1+cu118;
- CUDA: 11.8;
- Google Cloud SDK: 447.0.0;
- Whisper: OpenAI’s official last implementation;
- Google Speech-to-Text: Python client library v1.
- 1.
- Energy Consumption Estimation:
- 2.
- Carbon Footprint:
3. Results and Discussion
3.1. Results
3.1.1. Performance Metrics
3.1.2. Energy Consumption
- Whisper: 0.59 kwh.
- Google STT: 0.298 kwh.
3.1.3. Carbon Footprint
3.2. Discussion
4. Conclusions and Perspectives
Perspectives and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mehrish, A.; Majumder, N.; Bhardwaj, R.; Mihalcea, R.; Poria, S. A Review of Deep Learning Techniques for Speech Processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
- Vishwanath, A.; Jalali, F.; Hinton, K.; Alpcan, T.; Ayre, R.W.A.; Tucker, R.S. Energy Consumption Comparison of Interactive Cloud-Based and Local Applications. IEEE J. Sel. Areas Commun. 2015, 33, 616–626. [Google Scholar] [CrossRef]
- Henderson, P.; Hu, J.; Romoff, J.; Brunskill, E.; Jurafsky, D.; Pineau, J. Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. J. Mach. Learn. Res. 2020, 21, 1–43. [Google Scholar]
- Anthony, L.F.W.; Kanding, B.; Selvan, R. Tracking and Predicting the Carbon Footprint of Training Deep Learning Models. arXiv 2020, arXiv:2007.03051. [Google Scholar]
- Urdu Audio Dataset with Transcription-20000 File. Available online: https://www.kaggle.com/datasets/muhammadahmedansari/urdu-dataset-20000 (accessed on 10 January 2025).
- Available online: https://research.google/research-areas/speech-processing/ (accessed on 15 January 2025).
- Lei, Z.; Xu, M.; Han, S.; Liu, L.; Huang, Z.; Ng, T.; Zhang, Y.; Pusateri, E.; Hannemann, M.; Deng, Y. Acoustic Model Fusion for End-to-end Speech Recognition. arXiv 2023, arXiv:2310.07062. [Google Scholar]
- Whisper. Available online: https://openai.com/index/whisper/ (accessed on 15 January 2025).
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
- Speech-to-Text. Available online: https://cloud.google.com/speech-to-text?%20ML%20~%20Speech-to-Text-KWID_43700066957023758-userloc_ (accessed on 15 January 2025).
- Heguerte, L.B.; Bugeau, A.; Lannelongue, L. How to estimate carbon footprint when training deep learning models? A guide and review. arXiv 2023, arXiv:2306.08323v2. [Google Scholar]
- Available online: https://codecarbon.io (accessed on 15 January 2025).
- Available online: https://github.com/powerapi-ng/pyJoules (accessed on 15 January 2025).
- Available online: https://powerapi.org (accessed on 15 January 2025).
- Lannelongue, L.; Grealey, J.; Inouye, M. Green Algorithms: Quantifying the Carbon Footprint of Computation. Available online: https://advanced.onlinelibrary.wiley.com/doi/epdf/10.1002/advs.202100707 (accessed on 15 January 2025).
- Available online: https://mlco2.github.io/impact/ (accessed on 15 January 2025).
- Cloud Carbon Footprint. Cloud Carbon Footprint Free and Open Source: Cloud Carbon Emissions Measurement and Analysis Tool. Available online: https://www.cloudcarbonfootprint.org (accessed on 15 January 2025).
- Patterson, D.; Gonzalez, J.; Le, Q.; Liang, C.; Munguia, L.-M.; Rothchild, D.; So, D.; Texier, M.; Dean, J. Carbon Emissions and Large Neural Network Training. 21 April 2021. Available online: https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf (accessed on 15 January 2025).
- Available online: https://datacenters.google/efficiency/ (accessed on 15 January 2025).
- Statista. France: Power Sector Carbon Intensity 2023. July 2024. Available online: https://www.statista.com/statistics/1290216/carbon-intensity-power-sector-france/ (accessed on 15 January 2025).
- Carbon Free Energy for Google Cloud Regions. Available online: https://cloud.google.com/sustainability/region-carbon (accessed on 15 January 2025).
- Available online: https://cloud.google.com/carbon-footprint (accessed on 15 January 2025).
- Aksënova, A.; van Esch, D.; Flynn, J.; Golik, P. How Might We Create Better Benchmarks for Speech Recognition? In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, Bangkok, Thailand, 5–6 August 2021. [Google Scholar] [CrossRef]
Metrics | Google Speech-to-Text | Whisper |
---|---|---|
Word Error Rate (WER) | 14.9% | 17.2% |
Total Processing Time | 122 min | 238 min |
Processing Time per Audio Hour | 5.4 min/h | 10.7 min/h |
Power API | PyJoules | CodeCarbon | |
---|---|---|---|
Whisper | 0.43 kwh | 0.51 kwh | 0.53 kwh |
Google STT | 0.21 kWh | 0.19 kwh | 0.25 kwh |
System | Local Emissions (kg CO2) | Cloud Emissions (kg CO2) | Total Emissions (kg CO2) |
---|---|---|---|
Whisper | 0.38 | 0 | 0.38 |
Google Speech-to-Text | 0.002 | 0.02 | 0.022 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
El Bahri, J.; Kouissi, M.; Achkari Begdouri, M. Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text. Comput. Sci. Math. Forum 2025, 10, 6. https://doi.org/10.3390/cmsf2025010006
El Bahri J, Kouissi M, Achkari Begdouri M. Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text. Computer Sciences & Mathematics Forum. 2025; 10(1):6. https://doi.org/10.3390/cmsf2025010006
Chicago/Turabian StyleEl Bahri, Jalal, Mohamed Kouissi, and Mohammed Achkari Begdouri. 2025. "Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text" Computer Sciences & Mathematics Forum 10, no. 1: 6. https://doi.org/10.3390/cmsf2025010006
APA StyleEl Bahri, J., Kouissi, M., & Achkari Begdouri, M. (2025). Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text. Computer Sciences & Mathematics Forum, 10(1), 6. https://doi.org/10.3390/cmsf2025010006