Next Article in Journal
Transfer Learning for Named Entity Recognition in Financial and Biomedical Documents
Previous Article in Journal
Human Activity Recognition for Production and Logistics—A Systematic Literature Review
Previous Article in Special Issue
Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur
Open AccessArticle

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

1
School of Languages, Literatures and Linguistics, Bangor University, Bangor, Gwynedd LL57 2DG, UK
2
Language Technologies Unit, Bangor University, Bangor, Gwynedd LL57 2DG, UK
*
Author to whom correspondence should be addressed.
Information 2019, 10(8), 247; https://doi.org/10.3390/info10080247
Received: 30 June 2019 / Revised: 17 July 2019 / Accepted: 23 July 2019 / Published: 25 July 2019
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
  |  
PDF [1497 KB, uploaded 1 August 2019]
  |  

Abstract

Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed. View Full-Text
Keywords: low-resource languages; linguistic diversity; speech recognition; speech technology; corpus low-resource languages; linguistic diversity; speech recognition; speech technology; corpus
Figures

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Cooper, S.; Jones, D.B.; Prys, D. Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology. Information 2019, 10, 247.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Information EISSN 2078-2489 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top