Next Article in Journal
Enhanced Lung Disease Detection Using Double Denoising and 1D Convolutional Neural Networks on Respiratory Sound Analysis
Previous Article in Journal
Green DevOps: A Strategic Framework for Sustainable Software Development
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text †

by
Jalal El Bahri
1,
Mohamed Kouissi
2 and
Mohammed Achkari Begdouri
2
1
SIGL Research Laboratory, ENSATE of Tetouan, Abdelmalek Essaâdi University, Tetouan 93040, Morocco
2
DSAI2S Research Team, C35 Laboratory, FST of Tangier Abdelmalek Essaádi University, Tangier 900000, Morocco
Presented at the International Conference on Sustainable Computing and Green Technologies (SCGT’2025), Larache, Morocco, 14–15 May 2025.
Comput. Sci. Math. Forum 2025, 10(1), 6; https://doi.org/10.3390/cmsf2025010006
Published: 16 June 2025

Abstract

This study investigates the energy consumption and carbon footprint of two prominent automatic speech recognition (ASR) systems: OpenAI’s Whisper and Google’s Speech-to-Text API. We evaluate both local and cloud-based speech recognition approaches using a public Kaggle dataset of 20,000 short audio clips in Urdu, utilizing CodeCarbon, PyJoule, and PowerAPI for comprehensive energy profiling. As a result of our analysis, we expose some substantial differences between the two systems in terms of energy efficiency and carbon emissions, with the cloud-based solution showing substantially lower environmental impact despite comparable accuracy. We discuss the implications of these findings for sustainable AI deployment and minimizing the ecological footprint of speech recognition technologies.

1. Introduction

Automatic speech recognition (ASR) systems are integral to large-scale speech transcription and Interactive Voice Response (IVR) applications, such as virtual assistants, transcription services, and customer service automation, converting spoken language into text. While computationally expensive, particularly during the data-intensive training phase, the operational phase of ASR deployment, where the carbon footprint remains understudied, is increasingly widespread across devices and platforms [1].
This study focuses on comparing two fundamentally different ASR approaches:
Locally run open-source models represented by OpenAI’s Whisper, and cloud-based services represented by Google’s Speech-to-Text API. These systems represent two distinct architectural philosophies and deployment paradigms, leading to significant differences in environmental impacts [2].
We have two primary aims for this work:
  • To measure and compare the energy consumption and carbon footprint of Whisper and Google Speech-to-Text when processing a standardized audio dataset;
  • To analyze the relationship between performance metrics (accuracy, speed) and environmental impact, to provide recommendations for sustainable ASR applications deployment [3,4].

2. Materials and Methodology

2.1. Dataset Selection

For our experiments, we selected a public dataset from Kaggle titled “Urdu 20000 Audio Dataset with Transcription” [5].
This dataset was chosen for its diversity and representativeness of real-world ASR use cases. It provides a diverse range of speakers, accents, and audio qualities for a particular language, making it suitable for a robust comparison.
The dataset characteristics are as follows:
  • Total files: 20,000 audio clips;
  • Duration range: 1 to 10 s per clip;
  • Mean duration: 4 s;
  • Total audio duration: Approximately 22.2 h;
  • Format: WAV files (16 kHz, 16-bit PCM);
  • Language distribution: Urdu, Pakistan.
This dataset, with its short clips being particularly relevant for common voice command and dictation scenarios, provides a realistic test bed for ASR evaluation.

2.2. Overview of the ASR Systems

The ASR process often includes three steps:
  • Speech processing [6]: Enhances audio quality through noise reduction, filtering, and normalization, making speech clearer for subsequent processing.
  • Acoustic Model (AM) [7]: A feed-forward deep neural network (DNN) that processes raw audio waveforms to predict phoneme probabilities. The AM is trained on hundreds of hours of transcribed audio recordings, making it computationally intensive and requiring GPUs for training.
  • Language Model (LM) [7]: Generally based on n-gram statistics, the LM predicts word sequences based on linguistic structure. The LM is trained on millions of textual phrases and is computationally lightweight, typically running on CPUs.
During inference, the AM and LM work together to generate transcriptions from audio input.
We evaluated two prominent ASR systems representing different approaches.
Whisper (OpenAI) [8]:
  • Architecture: A transformer-based end-to-end model trained on large-scale multilingual data. Uses an encoder–decoder architecture [9];
  • Training: Trained on 680,000 h of diverse multilingual audio scraped from the internet;
  • Version: Base model (74 M parameters);
  • Deployment: Local execution on user hardware;
  • Implementation: Official Python implementation with PyTorch backend;
  • Configuration: Default settings, beam size 5.
Google Speech-to-Text [10]:
  • Architecture: Uses a traditional deep neural network (DNN)-based ASR pipeline. Likely follows a hybrid approach combining acoustic models, pronunciation models, and language models.
  • Training: Trained on a large proprietary dataset, but exact details are undisclosed.
  • Version: v2023.11.
  • Deployment: Cloud-based API.
  • Endpoint: Global (europe-west1).

2.3. The Measurement Tools

We employed multiple measurement tools to ensure comprehensive and reliable energy and carbon footprint assessments [11]:
  • CodeCarbon (v2.8.3) [12]:
    • Purpose: Track CO2 emissions based on energy consumption;
    • Implementation: Python package integrated with ASR processing code;
    • Metrics: kWh consumption, CO2 emissions (kg).
  • PyJoule (v0.15.0) [13]:
    • Purpose: Fine-grained energy profiling of CPU, GPU, and RAM;
    • Implementation: Part of PowerApi initiative Python decorators and context managers;
    • Metrics: Energy consumption per component (Joules, converted to kWh).
  • PowerAPI (v2.3.1) [14]:
    • Purpose: Enables process-level power monitoring with high sampling rate;
    • Implementation: Standalone monitoring process synchronized with ASR tasks;
    • Metrics: Power consumption over time (Watts), cumulative energy (kWh).
  • Carbon Algorithms [15]:
    • Purpose: Theoretical estimation of carbon emissions;
    • Implementation: Mathematical models based on published methodologies
    • Metrics: Estimated CO2 emissions (kg);
    • Other similar tools [16,17,18].

2.4. Experimental Setup

Below is the configuration of our local development environment.
Hardware Configuration:
  • CPU: Intel i7-12800H (16 cores, 32 threads);
  • GPU: NVIDIA RTX A1000 (16 GB VRAM);
  • RAM: 32 GB DDR4-3200.
Software Environment:
  • Python: 3.12.10;
  • PyTorch: 2.0.1+cu118;
  • CUDA: 11.8;
  • Google Cloud SDK: 447.0.0;
  • Whisper: OpenAI’s official last implementation;
  • Google Speech-to-Text: Python client library v1.
Measurement Procedures:
For consistency, we performed three runs on the same standard development machine, and the measurements corresponding to each tool have been averaged. Using Google Cloud STT from our local Python environment allowed for both local and remote (Google Cloud) measurements.
1.
Energy Consumption Estimation:
Local Processing: CPU, GPU, and RAM energy recorded separately via PyJoule, code Carbon, and PowerAPI for system-wide power;
Cloud Processing: Energy estimation based on published cloud workload models and the Green-Algorithms calculator, using Google’s published Power Usage Effectiveness (PUE) values [19] (PUE = 1.1, indicating high efficiency).
2.
Carbon Footprint:
Local emissions: Calculated using regional electricity carbon intensity (efficiency). Based on last statistics in France: 56 g CO2e/kWh [20].
Cloud Emissions: calculated using Google Cloud Carbon Footprint Tool [21].
Google’s reported carbon intensity: 122 g CO2e/kWh for Europe-West1 [22].

3. Results and Discussion

3.1. Results

3.1.1. Performance Metrics

The Word Error Rate (WER) [23] is the standard metric for evaluating automatic speech recognition (ASR) systems. It measures how different the transcribed text is from the reference (ground truth) transcription (provided in the dataset).
Both ASR systems were evaluated for accuracy and processing efficiency before analyzing their energy consumption (Table 1):
Google Speech-to-Text demonstrated superior performance across all metrics, with 13.4% lower WER and 51% faster processing time compared to Whisper.
The google cloud-based solution showed advantages for shorter audio clips, due to optimized batch processing capabilities.

3.1.2. Energy Consumption

Based on Google’s published efficiency metrics and computational models, we estimated the energy consumption in Google’s data centers (Table 2):
Total Cloud Energy: 0.35 Kwh (via cloud carbon footprint tool and based on the PUE).
Local Energy:
Theoretical energy estimation via Green Algorithms:
  • Whisper: 0.59 kwh.
  • Google STT: 0.298 kwh.

3.1.3. Carbon Footprint

The carbon footprint was calculated by applying appropriate carbon intensity factors to the measured energy consumption (Table 3).

3.2. Discussion

The theoretical estimations are higher than the actual measures; this can be easily explained. Indeed, our theoretical estimation assumes that the available hardware resources are assumed to be fully utilized, which is often not the case in practice.
The substantial difference in energy consumption and carbon footprint between Whisper and Google Speech-to-Text can be attributed to several key factors, including hardware optimization, datacenter efficiency, and especially renewable energy usage.
Findings are broadly applicable to other languages, as modern ASR models (e.g., Whisper, Google STT) are trained on multilingual data and designed for cross-lingual use. Therefore, results may vary with language complexity, accent diversity, and audio duration.

4. Conclusions and Perspectives

This study compared local and cloud-based speech recognition systems in terms of energy consumption and carbon footprint. When processing short audio clips, Google Speech-to-Text consumed approximately 51% less energy and produced 42% fewer emissions than locally run Whisper.
However, these environmental benefits of cloud-based ASR (especially Google implementation) must be weighed against other important considerations like privacy, cost, customizability, and internet dependency. The optimal choice will depend on the specific requirements and constraints of each real-life application.
Our methodological approach, employing multiple complementary measurement tools, provides a solid base for assessing the environmental impact of AI systems applicable to domains beyond speech recognition.

Perspectives and Future Work

For further work, we could investigate the adoption of hybrid solutions to optimize the use of local and cloud resources for ASR in order to maintain control and security over sensitive data held on-premises, while leveraging the scalability and processing power of cloud services for tasks such as large-scale audio transcription. Crucially, future research should also quantify the carbon footprint of these hybrid approaches, considering the energy consumption of both local and cloud infrastructure.

Author Contributions

Conceptualization, J.E.B.; methodology, J.E.B. and M.K.; software, J.E.B.; validation, M.K.; formal analysis, J.E.B.; investigation, J.E.B.; resources, J.E.B.; data curation, J.E.B.; writing—original draft preparation, J.E.B.; writing—review and editing, J.E.B.; visualization, J.E.B.; supervision, J.E.B.; project administration, J.E.B. and M.A.B.; funding acquisition, J.E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is public and opensource from Kaggle.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mehrish, A.; Majumder, N.; Bhardwaj, R.; Mihalcea, R.; Poria, S. A Review of Deep Learning Techniques for Speech Processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
  2. Vishwanath, A.; Jalali, F.; Hinton, K.; Alpcan, T.; Ayre, R.W.A.; Tucker, R.S. Energy Consumption Comparison of Interactive Cloud-Based and Local Applications. IEEE J. Sel. Areas Commun. 2015, 33, 616–626. [Google Scholar] [CrossRef]
  3. Henderson, P.; Hu, J.; Romoff, J.; Brunskill, E.; Jurafsky, D.; Pineau, J. Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. J. Mach. Learn. Res. 2020, 21, 1–43. [Google Scholar]
  4. Anthony, L.F.W.; Kanding, B.; Selvan, R. Tracking and Predicting the Carbon Footprint of Training Deep Learning Models. arXiv 2020, arXiv:2007.03051. [Google Scholar]
  5. Urdu Audio Dataset with Transcription-20000 File. Available online: https://www.kaggle.com/datasets/muhammadahmedansari/urdu-dataset-20000 (accessed on 10 January 2025).
  6. Available online: https://research.google/research-areas/speech-processing/ (accessed on 15 January 2025).
  7. Lei, Z.; Xu, M.; Han, S.; Liu, L.; Huang, Z.; Ng, T.; Zhang, Y.; Pusateri, E.; Hannemann, M.; Deng, Y. Acoustic Model Fusion for End-to-end Speech Recognition. arXiv 2023, arXiv:2310.07062. [Google Scholar]
  8. Whisper. Available online: https://openai.com/index/whisper/ (accessed on 15 January 2025).
  9. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
  10. Speech-to-Text. Available online: https://cloud.google.com/speech-to-text?%20ML%20~%20Speech-to-Text-KWID_43700066957023758-userloc_ (accessed on 15 January 2025).
  11. Heguerte, L.B.; Bugeau, A.; Lannelongue, L. How to estimate carbon footprint when training deep learning models? A guide and review. arXiv 2023, arXiv:2306.08323v2. [Google Scholar]
  12. Available online: https://codecarbon.io (accessed on 15 January 2025).
  13. Available online: https://github.com/powerapi-ng/pyJoules (accessed on 15 January 2025).
  14. Available online: https://powerapi.org (accessed on 15 January 2025).
  15. Lannelongue, L.; Grealey, J.; Inouye, M. Green Algorithms: Quantifying the Carbon Footprint of Computation. Available online: https://advanced.onlinelibrary.wiley.com/doi/epdf/10.1002/advs.202100707 (accessed on 15 January 2025).
  16. Available online: https://mlco2.github.io/impact/ (accessed on 15 January 2025).
  17. Cloud Carbon Footprint. Cloud Carbon Footprint Free and Open Source: Cloud Carbon Emissions Measurement and Analysis Tool. Available online: https://www.cloudcarbonfootprint.org (accessed on 15 January 2025).
  18. Patterson, D.; Gonzalez, J.; Le, Q.; Liang, C.; Munguia, L.-M.; Rothchild, D.; So, D.; Texier, M.; Dean, J. Carbon Emissions and Large Neural Network Training. 21 April 2021. Available online: https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf (accessed on 15 January 2025).
  19. Available online: https://datacenters.google/efficiency/ (accessed on 15 January 2025).
  20. Statista. France: Power Sector Carbon Intensity 2023. July 2024. Available online: https://www.statista.com/statistics/1290216/carbon-intensity-power-sector-france/ (accessed on 15 January 2025).
  21. Carbon Free Energy for Google Cloud Regions. Available online: https://cloud.google.com/sustainability/region-carbon (accessed on 15 January 2025).
  22. Available online: https://cloud.google.com/carbon-footprint (accessed on 15 January 2025).
  23. Aksënova, A.; van Esch, D.; Flynn, J.; Golik, P. How Might We Create Better Benchmarks for Speech Recognition? In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, Bangkok, Thailand, 5–6 August 2021. [Google Scholar] [CrossRef]
Table 1. The averaged performance metrics measured for Whisper and Google STT.
Table 1. The averaged performance metrics measured for Whisper and Google STT.
MetricsGoogle Speech-to-TextWhisper
Word Error Rate (WER)14.9%17.2%
Total Processing Time122 min238 min
Processing Time per Audio Hour5.4 min/h10.7 min/h
Table 2. The energy consumption kwh measured by each different tool.
Table 2. The energy consumption kwh measured by each different tool.
Power APIPyJoulesCodeCarbon
Whisper0.43 kwh0.51 kwh0.53 kwh
Google STT0.21 kWh0.19 kwh0.25 kwh
Table 3. Average of local vs. cloud carbon emission in Kg eq CO2 for both ASR.
Table 3. Average of local vs. cloud carbon emission in Kg eq CO2 for both ASR.
SystemLocal Emissions
(kg CO2)
Cloud Emissions (kg CO2)Total Emissions
(kg CO2)
Whisper0.3800.38
Google Speech-to-Text0.0020.020.022
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El Bahri, J.; Kouissi, M.; Achkari Begdouri, M. Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text. Comput. Sci. Math. Forum 2025, 10, 6. https://doi.org/10.3390/cmsf2025010006

AMA Style

El Bahri J, Kouissi M, Achkari Begdouri M. Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text. Computer Sciences & Mathematics Forum. 2025; 10(1):6. https://doi.org/10.3390/cmsf2025010006

Chicago/Turabian Style

El Bahri, Jalal, Mohamed Kouissi, and Mohammed Achkari Begdouri. 2025. "Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text" Computer Sciences & Mathematics Forum 10, no. 1: 6. https://doi.org/10.3390/cmsf2025010006

APA Style

El Bahri, J., Kouissi, M., & Achkari Begdouri, M. (2025). Comparative Analysis of Energy Consumption and Carbon Footprint in Automatic Speech Recognition Systems: A Case Study Comparing Whisper and Google Speech-to-Text. Computer Sciences & Mathematics Forum, 10(1), 6. https://doi.org/10.3390/cmsf2025010006

Article Metrics

Back to TopTop