Modern Speech Recognition for Romanian Language
Abstract
1. Introduction
- Public release of two Romanian resources: the 378 h Echo benchmark (https://huggingface.co/datasets/upb-nlp/echo; accessed on 30 December 2025 ) and the 9000 h CRoWL corpus (https://huggingface.co/datasets/upb-nlp/crowl-speech; Accessed on 30 December 2025).
- Competitive Romanian baselines with wav2vec 2.0 and Conformer, reaching down to 3.01% WER (https://huggingface.co/upb-nlp/ro-wav2vec2 and https://huggingface.co/upb-nlp/ro-fast-conformer; accessed on 30 December 2025).
- colorblueOpen-source plans for dataset creation and model fine-tuning (https://github.com/upb-nlp/speech-aio; accessed on 8 February 2026), including the CRoWL crawling pipeline (https://github.com/upb-nlp/crowl-speech; accessed on 8 February 2026).
2. Related Work
2.1. English ASR Models
2.2. Romanian Datasets and ASR Models
3. Method
3.1. Datasets
3.1.1. Echo Dataset
3.1.2. CRoWL: A Weakly Supervised Dataset
CRoWL Processing Pipeline
Data Processing and Normalization
- Audio standardization: All audio is converted to mono 16 kHz WAV.
- Text normalization:
- –
- Lowercasing;
- –
- Whitespace normalization and removal of non-linguistic symbols frequent in crawled data;
- –
- Romanian diacritics are preserved; legacy forms (ş, ţ) are mapped to Unicode-compliant (ș, ț);
- –
- Punctuation is removed for WER computation to avoid penalizing formatting;
- –
- Digits are expanded into their full word equivalents in Romanian (e.g., 10 → zece).
- Audio–text consistency filters:
- –
- Characters-per-second (CPS): We compute CPS as and retain utterances with CPS in [1.2, 35.5]. Values outside this range usually indicate misalignment, non-speech regions, or truncated transcripts;
- –
- Duration: We kept segments in the [1 s, 80 s] range;
- –
- Trimming: We removed leading/trailing non-speech using an energy/VAD trimming pass and discarded clips with excessive remaining silence.
3.1.3. Consolidated Echo + CRoWL Test Set
3.2. Training ASR Models for Romanian
3.2.1. Model Selection
3.2.2. Training and Fine-Tuning
- Transfer learning: The fundamental strategy is to leverage the rich representations learned by models pre-trained on vast datasets. XLS-R, for instance, has already been exposed to thousands of hours of Romanian speech during its pre-training phase, providing a strong starting point.
- Layer freezing: In some low-data regimes, it can be beneficial to freeze the initial layers of the pre-trained model (e.g., the feature encoder) and only fine-tune the upper layers (e.g., the Transformer blocks and the final classification head). This helps preserve the general acoustic representations learned during pre-training while adapting the task-specific layers. The wav2vec 2.0 paper notes that the feature encoder is not trained during fine-tuning.
- Data augmentation: Techniques like SpecAugment [24], which involve masking frequency bands and time steps in the spectrogram, are often applied during fine-tuning to improve model robustness and prevent overfitting, especially with limited data.
- Tokenizer: For character-based models like wav2vec 2.0-CTC (as used for Librispeech), the output vocabulary consists of characters. For Conformer models that might use sub-word units, a Romanian-specific tokenizer or adaptation of a multilingual tokenizer would be necessary.
3.2.3. Experimental Setup
- 1.
- Echo only: The models were fine-tuned solely on Echo, a fully supervised dataset.
- 2.
- Echo + CRoWL: The models were fine-tuned on Echo and CRoWL, incorporating both fully supervised and weakly supervised learning approaches.
wav2vec 2.0
Conformer
Training Configurations
3.3. Evaluation Metric
4. Results
5. Discussion
5.1. Analysis of wav2vec 2.0 Performance
5.2. Analysis of Conformer Performance
5.3. Limitations
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| ASR | Automatic Speech Recognition |
| CNN | Convolutional Neural Network |
| CTC | Connectionist Temporal Classification |
| DNN | Deep Neural Network |
| HMM | Hidden Markov Model |
| LM | Language Model |
| RNN | Recurrent Neural Network |
| WER | Word Error Rate |
References
- Eberhard, D.M.; Simons, G.F.; Fennig, C.D. Ethnologue: Languages of the World, 26th ed.; SIL International (SIL Global Publishing): Dallas, TX, USA, 2023. [Google Scholar]
- Posner, R. Romanian Language. Encyclopaedia Britannica. 2026. Available online: https://www.britannica.com/topic/Romanian-language (accessed on 7 February 2026).
- Ungureanu, D.; Toma, S.A.; Filip, I.D.; Mocanu, B.C.; Aciobăniței, I.; Marghescu, B.; Balan, T.; Dascalu, M.; Bica, I.; Pop, F. ODIN112–AI-Assisted Emergency Services in Romania. Appl. Sci. 2023, 13, 639. [Google Scholar] [CrossRef]
- Ungureanu, D.; Ruseti, S.; Toma, I.; Dascalu, M. pROnounce: Automatic Pronunciation Assessment for Romanian. In Conference on Smart Learning Ecosystems and Regional Development; Springer: Singapore, 2022; pp. 103–114. [Google Scholar]
- Ungureanu, D.; Dascalu, M. Echo: A Crowd-sourced Romanian Speech Dataset. Interact. Des. Archit. J.—IxD&A 2024, 62, 141–152. [Google Scholar] [CrossRef]
- Chitoran, I. The Phonology of Romanian: A Constraint-Based Approach; Walter de Gruyter: Berlin, Germany, 2013; Volume 56. [Google Scholar]
- Pană Dindelegan, G. (Ed.) The Grammar of Romanian; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
- Renwick, M.E. Vowels of Romanian: Historical, Phonological and Phonetic Studies. Ph.D. Thesis, Cornell University, Ithaca, NY, USA, 2012. [Google Scholar]
- Stan, C.; Moldoveanu Pologea, M. Inflectional and Derivational Morphophonological Alternations. In The Grammar of Romanian; Pană Dindelegan, G., Ed.; Oxford University Press: Oxford, UK, 2013; pp. 607–611. [Google Scholar]
- Şulea, O.M. Semi-supervised Approach to Romanian Noun Declension. Procedia Comput. Sci. 2016, 96, 664–671. [Google Scholar] [CrossRef]
- Tufis, D.; Ceausu, A. Diacritics Restoration in Romanian Texts. In Proceedings of the a Common Natural Language Processing Paradigm for Balkan Languages—RANLP 2007 Workshop Proceedings, Borovets, Bulgaria, 27–29 September 2007; Paskaleva, E., Slavcheva, M., Eds.; INCOMA Ltd.: Shoumen, Bulgaria, 2007; pp. 49–56. [Google Scholar]
- Roseano, P.; Turculeţ, A.; Bibiri, A.D.; Cerdà Massó, R.; Fernández Planas, A.M.; Elvira-García, W. A dialectometric approach to Romanian intonation. Onomázein 2022, 105–139. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
- Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; Von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
- Rekesh, D.; Koluguri, N.R.; Kriman, S.; Majumdar, S.; Noroozi, V.; Huang, H.; Hrinchuk, O.; Puvvada, K.; Kumar, A.; Balam, J.; et al. Fast conformer with linearly scalable attention for efficient speech recognition. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU); Taipei, Taiwan, 16–20 December 2023, IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning; New York, NY, USA, 19–24 June 2016, International Machine Learning Society (IMLS): San Diego, CA, USA, 2016; pp. 173–182. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Caranica, A.; Burileanu, C. An automatic speech recognition system with speaker-independent identification support. In Proceedings of the Advanced Topics in Optoelectronics, Microelectronics, and Nanotechnologies VII; Constanta, Romania, 21–24 August 2014, SPIE: Bellingham, WA USA, 2015; Volume 9258, pp. 769–775. [Google Scholar]
- Georgescu, A.L.; Cucu, H.; Buzo, A.; Burileanu, C. RSC: A Romanian read speech corpus for automatic speech recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 6606–6612. [Google Scholar]
- Stan, A.; Dinescu, F.; Ţiple, C.; Meza, Ş.; Orza, B.; Chirilă, M.; Giurgiu, M. The SWARA speech corpus: A large parallel Romanian read speech dataset. In Proceedings of the 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD); Bucharest, Romania, 6–9 July 2017, IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
- Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
- Georgescu, A.L.; Cucu, H.; Burileanu, C. Kaldi-based DNN architectures for speech recognition in Romanian. In Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD); Timisoara, Romania, 10–12 October 2019, IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019; ISCA: Grenoble, France, 2019. [Google Scholar] [CrossRef]

| Model Name and Variant | Parameters | Test-Clean | Test-Other |
|---|---|---|---|
| Conformer (small) [14] | 10 M | 2.1% | 5.0% |
| Conformer (medium) [14] | 30 M | 2.0% | 4.3% |
| Conformer (large) [14] | 118 M | 1.9% | 3.9% |
| FastConformer (large) [16] (available at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_transducer_large; accessed on 8 February 2026) | 114 M | 1.8% | 3.8% |
| FastConformer (xlarge) [16] (available at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_transducer_xlarge; accessed on 8 February 2026) | 600 M | 1.6% | 3.0% |
| FastConformer (xxlarge) [16] (available at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_transducer_xxlarge; accessed on 8 February 2026) | 1.1 B | 1.5% | 2.7% |
| wav2vec 2.0 (base) [13] | 95 M | 2.1% | 4.8% |
| wav2vec 2.0 (large) [13] | 317 M | 1.8% | 3.3% |
| Whisper (tiny) [18] | 39 M | 5.6% | 14.6% |
| Whisper (base) [18] | 74 M | 4.2% | 10.2% |
| Whisper (small) [18] | 244 M | 3.1% | 7.4% |
| Whisper (medium) [18] | 769 M | 3.1% | 6.3% |
| Whisper (large) [18] | 1.55 B | 2.7% | 5.6% |
| Whisper (large-v2) (available at https://huggingface.co/openai/whisper-large-v2; accessed on 8 February 2026) | 1.55 B | 2.7% | 5.2% |
| Whisper (large-v3) (available at https://huggingface.co/openai/whisper-large-v3; accessed on 8 February 2026) | 1.55 B | 2.3% | 4.6% |
| Humans [17] | - | 5.83% | 12.69% |
| Corpus | Hours | Speakers | Speech Type/Domain | Supervision |
|---|---|---|---|---|
| SWARA [21] | >21 | 17 | read (studio-quality) | manual transcripts |
| RSC [20] | 100 | 164 | read (varied microphones) | manual transcripts |
| Common Voice (RO) [22] | varies | varies | read (crowd-sourced) | manual transcripts |
| Echo [5] | 378 | 343 | multi-domain; read + spontaneous | manual transcripts |
| CRoWL (this work) | 9000 | N/A | parliamentary speech | weak labels |
| Model Name | Parameters | Echo-Test |
|---|---|---|
| Whisper (small) [18] | 244 M | 35.0% |
| Whisper (large-v3) [18] | 1.55 B | 7.6% |
| Whisper-RO (small) [5] (available at https://huggingface.co/readerbench/whisper-ro; accessed on 8 February 2026) | 244 M | 7.3% |
| Domain | Recordings | Duration (h) | Speakers | Vocabulary | Details |
|---|---|---|---|---|---|
| Literature | 34,896 | 69 | 207 | 10,661 | high variability from different subtypes |
| - Drama | 9077 | 13 | 198 | 2581 | |
| - Epic | 23,852 | 48 | 204 | 7643 | |
| - Poems | 1967 | 7 | 168 | 1182 | |
| News | 65,216 | 156 | 200 | 38,120 | clean, up-to-date language |
| Emergency | 8560 | 11 | 314 | 768 | read with accent; disfluencies in speech |
| Legal | 8832 | 28 | 194 | 2903 | longer sentences and formal register |
| Wikipedia | 45,193 | 111 | 329 | 7249 | mixed topics and mixed conditions |
| Total I | 162,697 | 378 | 343 | 49,664 |
| Step | Input | Output | Tooling/Notes |
|---|---|---|---|
| 1. Crawling | session pages | media URLs + metadata | Official Parliamentary archive; session-level IDs |
| 2. Audio extraction | video/stream | 16 kHz mono WAV | ffmpeg; loudness normalization |
| 3. Diarization + VAD | long-form audio | speech turns | PyAnnote-based diarization; remove non-speech |
| 4. Segmentation | speech turns | short utterances | Split on pauses; keep within duration bounds |
| 5. Weak transcription | segments | ASR pseudo-transcripts | Echo-trained ASR model; no manual correction |
| 6. Filtering | audio + text | cleaned pairs | CPS + duration + trimming filters |
| 7. Split + remove duplicates | cleaned pairs | train/val/test | transcript string matching; minimize speaker overlap |
| Source | Utterances | Duration (h) | Speakers | Vocabulary |
|---|---|---|---|---|
| Echo (test split) | 53,818 | 72.26 | 337 | 41,979 |
| CRoWL (test split) | 16,305 | 33.69 | N/A | 2990 |
| Total | 70,123 | 105.95 | N/A | 42,957 |
| Model Name | Architecture | Strengths | Weaknesses |
|---|---|---|---|
| DeepSpeech | RNN | Well-known, historically important | No longer SOTA |
| Kaldi | HMM-DNN Hybrid | Powerful toolkit, flexible Ideal for research | Complex to develop new models |
| Whisper | Transformer | Multilingual, multitask End-to-end with LM | Large models, slow fine-tuning/ inference Hallucinations |
| wav2vec 2.0 (XLS-R) | Transformer | SOTA self-supervised learning Excellent for low-resource Robust pre-training Multilingual | Requires fine-tuning CTC output may need LM |
| Conformer | Transformer + CNN | SOTA hybrid Efficient architecture | Needs substantial data for best accuracy |
| Item | Setting |
|---|---|
| Audio sampling rate | 16 kHz; mono |
| Optimizer | AdamW |
| Mixed precision | FP16 |
| Epochs | 20 |
| Decoding for WER | Greedy CTC decoding without an external language model; same text normalization for all evaluations |
| Seed | Fixed for data shuffling and initialization |
| Hardware | Single-node GPU (NVIDIA A100 with 80 GB) |
| Total training time | 69 h for wav2vec 2.0 when training on Echo + CRoWL and 42 h when training only on Echo |
| 24 h for Conformer when training on Echo + CRoWL and 11 h when training only on Echo |
| Model | Training Dataset | Evaluation Dataset | ||||
|---|---|---|---|---|---|---|
| Common Voice | Echo | SWARA | RSC | Echo + CRoWL | ||
| wav2vec 2.0 | Echo | 9.21 | 4.04 | 7.51 | 6.63 | 6.39 |
| Echo + CRoWL | 4.58 | 4.51 | 2.98 | 3.04 | 4.17 | |
| Conformer | Echo | 9.47 | 8.43 | 8.98 | 7.91 | 12.16 |
| Echo + CRoWL | 2.81 | 4.23 | 2.80 | 2.75 | 3.01 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ungureanu, R.-D.; Dascalu, M. Modern Speech Recognition for Romanian Language. Appl. Sci. 2026, 16, 1928. https://doi.org/10.3390/app16041928
Ungureanu R-D, Dascalu M. Modern Speech Recognition for Romanian Language. Applied Sciences. 2026; 16(4):1928. https://doi.org/10.3390/app16041928
Chicago/Turabian StyleUngureanu, Remus-Dan, and Mihai Dascalu. 2026. "Modern Speech Recognition for Romanian Language" Applied Sciences 16, no. 4: 1928. https://doi.org/10.3390/app16041928
APA StyleUngureanu, R.-D., & Dascalu, M. (2026). Modern Speech Recognition for Romanian Language. Applied Sciences, 16(4), 1928. https://doi.org/10.3390/app16041928

