This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá)
by
Joshua I. Ayoola
Joshua I. Ayoola *
and
Peter O. Olukanmi
Peter O. Olukanmi *
Department of Electrical Engineering Technology, University of Johannesburg, Johannesburg 2006, South Africa
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(12), 6195; https://doi.org/10.3390/app16126195 (registering DOI)
Submission received: 12 May 2026
/
Revised: 11 June 2026
/
Accepted: 15 June 2026
/
Published: 18 June 2026
Abstract
Diacritization is an essential part of the reading and writing of text in Yorùbá, a widely-spoken tonal language in West Africa and some parts of the American continent. Unfortunately, typical computer-typed texts are not diacritized. Thus, automatic diacritization is a critical issue in Yorùbá natural language processing (NLP), since missing tone marks and underdots affect text comprehension, translation and speech technology. This paper begins by reviewing the state of the art. While there is a paucity of Yorùbá diacritization models, four models found were studied to explore their performances using the standardised Yorùbá Automatic Diacritization Dataset: the 2018 Volta Baseline, the mT5_base_yoruba_adr, GPT-5.2 and Gemini 3.1 Pro. We measured the performance based on a set of metrics: Word Error Rate (WER), Character Error Rate (CER), Diacritization Error Rate (DER), Word Diacritization Error Rate (WDER), BLEU and ChrF, using the complete diacritic removal condition of the YAD test set. To ensure reproducibility, the LLM evaluations were conducted via the respective official APIs and AI Studio with pinned snapshots and deterministic settings, with each model evaluated across three independent full-dataset runs. The findings showed that the specialised mT5_base_yoruba_adr model slightly outperforms the LLMs, achieving the lowest error rates of 34.85% CER, 18.34% WER, 43.37% DER and 18.33% WDER, as well as a BLEU of 0.6872 and ChrF of 0.8436. Gemini 3.1 Pro ranked second across all error rate metrics with 35.68% CER, 18.96% WER, and 44.84% DER but outperformed mT5 by a small margin on ChrF (0.8469), followed by GPT-5.2 with 54.01% CER, 38.05% WER, and 62.64% DER. The Volta Baseline built on the early seq2seq showed the weakest performance with 92.37% CER and 94.42% DER. These results challenge the assumption that large parameter count and massive pre-training guarantee superior performance in low-resource language tasks and show that targeted fine-tuning on Yorùbá-specific data remains important. Our work serves as a reference for researchers seeking an overview of the state of the art, as well as a detailed and reproducible evaluation of existing models. The results highlight methodological progress and gaps in current systems. Addressing these gaps will require domain-adaptive fine-tuning, improved algorithms, and robust datasets to advance the state-of-the-art in African-language automatic diacritization research.
Share and Cite
MDPI and ACS Style
Ayoola, J.I.; Olukanmi, P.O.
Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá). Appl. Sci. 2026, 16, 6195.
https://doi.org/10.3390/app16126195
AMA Style
Ayoola JI, Olukanmi PO.
Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá). Applied Sciences. 2026; 16(12):6195.
https://doi.org/10.3390/app16126195
Chicago/Turabian Style
Ayoola, Joshua I., and Peter O. Olukanmi.
2026. "Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá)" Applied Sciences 16, no. 12: 6195.
https://doi.org/10.3390/app16126195
APA Style
Ayoola, J. I., & Olukanmi, P. O.
(2026). Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá). Applied Sciences, 16(12), 6195.
https://doi.org/10.3390/app16126195
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.