Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá)

Ayoola, Joshua I.; Olukanmi, Peter O.

doi:10.3390/app16126195

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá)

by

Joshua I. Ayoola

^*

and

Peter O. Olukanmi

^*

Department of Electrical Engineering Technology, University of Johannesburg, Johannesburg 2006, South Africa

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6195; https://doi.org/10.3390/app16126195 (registering DOI)

Submission received: 12 May 2026 / Revised: 11 June 2026 / Accepted: 15 June 2026 / Published: 18 June 2026

(This article belongs to the Special Issue Natural Language Processing (NLP): Technologies and Applications)

Download Versions Notes

Abstract

Diacritization is an essential part of the reading and writing of text in Yorùbá, a widely-spoken tonal language in West Africa and some parts of the American continent. Unfortunately, typical computer-typed texts are not diacritized. Thus, automatic diacritization is a critical issue in Yorùbá natural language processing (NLP), since missing tone marks and underdots affect text comprehension, translation and speech technology. This paper begins by reviewing the state of the art. While there is a paucity of Yorùbá diacritization models, four models found were studied to explore their performances using the standardised Yorùbá Automatic Diacritization Dataset: the 2018 Volta Baseline, the mT5_base_yoruba_adr, GPT-5.2 and Gemini 3.1 Pro. We measured the performance based on a set of metrics: Word Error Rate (WER), Character Error Rate (CER), Diacritization Error Rate (DER), Word Diacritization Error Rate (WDER), BLEU and ChrF, using the complete diacritic removal condition of the YAD test set. To ensure reproducibility, the LLM evaluations were conducted via the respective official APIs and AI Studio with pinned snapshots and deterministic settings, with each model evaluated across three independent full-dataset runs. The findings showed that the specialised mT5_base_yoruba_adr model slightly outperforms the LLMs, achieving the lowest error rates of 34.85% CER, 18.34% WER, 43.37% DER and 18.33% WDER, as well as a BLEU of 0.6872 and ChrF of 0.8436. Gemini 3.1 Pro ranked second across all error rate metrics with 35.68% CER, 18.96% WER, and 44.84% DER but outperformed mT5 by a small margin on ChrF (0.8469), followed by GPT-5.2 with 54.01% CER, 38.05% WER, and 62.64% DER. The Volta Baseline built on the early seq2seq showed the weakest performance with 92.37% CER and 94.42% DER. These results challenge the assumption that large parameter count and massive pre-training guarantee superior performance in low-resource language tasks and show that targeted fine-tuning on Yorùbá-specific data remains important. Our work serves as a reference for researchers seeking an overview of the state of the art, as well as a detailed and reproducible evaluation of existing models. The results highlight methodological progress and gaps in current systems. Addressing these gaps will require domain-adaptive fine-tuning, improved algorithms, and robust datasets to advance the state-of-the-art in African-language automatic diacritization research.

Keywords: Yorùbá; automatic diacritization; natural language processing (NLP); large language models; low-resource languages; benchmarking; tone restoration; domain-adaptive fine-tuning; mT5

Share and Cite

MDPI and ACS Style

Ayoola, J.I.; Olukanmi, P.O. Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá). Appl. Sci. 2026, 16, 6195. https://doi.org/10.3390/app16126195

AMA Style

Ayoola JI, Olukanmi PO. Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá). Applied Sciences. 2026; 16(12):6195. https://doi.org/10.3390/app16126195

Chicago/Turabian Style

Ayoola, Joshua I., and Peter O. Olukanmi. 2026. "Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá)" Applied Sciences 16, no. 12: 6195. https://doi.org/10.3390/app16126195

APA Style

Ayoola, J. I., & Olukanmi, P. O. (2026). Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá). Applied Sciences, 16(12), 6195. https://doi.org/10.3390/app16126195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá)

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI